A Low-Latency Feed-Forward Architecture for Image Filtering via Row-by-Row Processing
A Low-Latency Feed-Forward Architecture for Image Filtering via Row-by-Row Processing
Abstract:
This paper presents a novel accelerator architecture for real-time image filtering applications that require high throughput and low latency. The proposed architecture consists of two systolic arrays: a Row Convolution (RConv) Array and a Column Convolution (CConv) Array. The RConv Array processes one row of the input image and computes one row of an intermediate image per clock cycle which is then input to CConv Array. The CConv Array computes one row of the output image per clock cycle. A novel aspect of the architecture is that the line delays within the convolution architecture are used efficiently due to the parallelism. This leads to a significant reduction in the memory requirement and latency of the architecture. While the row convolution in the RConv Array is implemented using a direct-form FIR filter, the column convolution in the CConv Array is implemented using the transpose-form FIR filter. Thus, the processing elements in the CConv array are fully pipelined and do not require additional pipelining delays. The parallelized row-by-row processing significantly reduces the memory overhead per output pixel. The proposed architecture is then generalized where multiple rows of the image can be processed in a clock cycle, leading to further improvements in latency and memory requirements. The proposed architecture is fully feed-forward and can be pipelined further. Case studies show that using a Xilinx Virtex-7 200T FPGA device, and at 100 MHz, the architecture achieves 16.0×, 86.3×, and 943× improvements in the LUT-time², FF-time², and DSP-time² products over a baseline serial architecture for the non-separable case with Image size 128 × 128 and filter size 11 × 11. The proposed architecture can operate at a throughput of 1.28 Gpixels/s whereas the baseline can only operate at 100 Mpixels/s.