A High-Throughput FPGA Accelerator for Lightweight CNNs With Balanced Dataflow
A High-Throughput FPGA Accelerator for Lightweight CNNs With Balanced Dataflow
Abstract:
FPGA accelerators for lightweight convolutional neural networks (LWCNNs) have recently attracted significant attention. Most existing LWCNN accelerators focus on single-Computing-Engine (CE) architecture with local optimizations. However, these designs typically suffer from high on-chip/off-chip memory overhead and low computational efficiency due to their layer-by-layer dataflow and inflexible resource mapping mechanisms. To tackle these issues, a novel multi-CE-based accelerator with balanced dataflow is proposed to efficiently accelerate LWCNN through memory-oriented and computing-oriented optimizations. Firstly, a streaming architecture with hybrid CEs is designed to minimize off-chip memory access while maintaining a low cost of on-chip buffer size. Secondly, a balanced dataflow strategy is introduced for streaming architectures to enhance computational efficiency by improving efficient resource mapping and mitigating data congestion. Furthermore, a resource-aware memory parallelism allocation methodology is proposed, based on a performance model, to achieve better performance and scalability. The proposed accelerator is evaluated on Xilinx ZC706 platform and demonstrates a significant reduction in memory access (up to 63.8% on average) and the proposed accelerator can save up to 6.3× off-chip memory access for ShuffleNetV2. The accelerator achieves 5.2× improvement over the reference design and exhibits an impressive performance of up to 2092.4 FPS and a state-of-the-art ACE score of up to 94.5%, with high on-chip utilization and DSP utilization of 95%, thus significantly outperforming current LWCNN accelerators.