LIA: Cost-efficient LLM Inference Acceleration with Intel Advanced Matrix Extensions and CXL

Published
View publication Download

Abstract

The limited memory capacity of single GPUs constrains large language model (LLM) inference, necessitating cost-prohibitive multiGPU deployments or frequent performance-limiting CPU-GPU transfers over slow PCIe. In this work, we first benchmark recent Intel CPUs with Advanced Matrix Extensions (AMX), including 4th generation (Sapphire Rapids) and 6th generation (Granite Rapids) Xeon Scalable Processors, demonstrating matrix multiplication throughput of 20 TFLOPS and 40 TFLOPS, respectively— comparable to some recent GPUs. These findings unlock more extensive computation offloading to CPUs, reducing CPU-GPU transfers and alleviating throughput bottlenecks compared to prior generation CPUs. Building on these insights, we design LIA, a single-GPU LLM inference acceleration framework leveraging cooperative AMX-enabled CPU-GPU computation and CXL offloading.LIA systematically offloads computation to CPUs, optimizing both latency and throughput. The framework also introduces a memory offloading policy that seamlessly integrates affordable CXL memory with DDR memory to enhance performance in throughput-driven tasks. On Saphhire Rapids (Granite Rapids) systems with a single H100 GPU, LIA achieves up to 5.1× (19×) lower latency and 3.7×(5.1×) higher throughput compared to the latest single-GPU offloading framework. Furthermore, LIA deploying CXL offloading yields an additional 1.5× throughput improvement over LIA using only DDR memory with a 1.8× increase in maximum batch size (900→1.6K).

Authors

Hyungyo Kim*, Nachuan Wang*, Qirong Xia*, Jinghan Huang*, Amir Yazdanbakhsh, Nam Sung Kim*

*
External author

Venue

ISCA 2025