EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models
EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models
Abstract:
The rapid advancements in artificial intelligence (AI), particularly the Large Language Models (LLMs), have profoundly affected our daily work and communication forms. However, it is still a challenge to deploy LLMs on resource-constrained edge devices (such as robots), due to the intensive computation requirements, heavy memory access, diverse operator types and difficulties in compilation. In this work, we proposed EdgeLLM to address the above issues. Firstly, focusing on the computation, we designed mix-precision processing element array together with group systolic architecture, that can efficiently support FP16+FP16 for the MHA block (Multi-Head Attention) and FP16+INT4 for the FFN layer (Feed-Forward Network). Meanwhile specific optimization on log-scare structured weight sparsity, has been used to further increase throughput. Secondly, to address the compilation and deployment issue, we analyzed the whole operators within LLM models and developed a unified compilation framework that can handle all input and output features maintain the same data format to process different operators without any data transformation. Then we proposed an end-to-end complete LLM tool on CPU-FPGA heterogeneous system (AMD Xilinx VCU128 FPGA). The accelerator achieves 19× higher throughput and 7.55× higher energy efficiency than commercial GPU (NVIDIA A100-SXM4-80G). When compared with state-of-the-art FPGA accelerator of HighLLM, it obtains 1.24× better performance in terms of HBM bandwidth utilization, energy efficiency and LLM throughput.
Index Terms —
Large language model, AI accelerator.
” Thanks for Visit this project Pages – Register This Project and Buy soon with Novelty “