In general, all the FFT processors can be categorized into two main groups: pipelined processors or shared-memory processors. Examples of pipelined FFT processors. A pipelined architecture provides high throughputs, but it requires more hardware resources at the same time. One or multiple pipelines are often implemented, each consisting of butterfly units and control logic. In contrast, the shared-memory-based architecture requires the least amount of hardware resources at the expense of slower throughput. In the radix-2 shared-memory architecture, the FFT data are organized into two memory banks. At each clock cycle, two FFT data are provided by memory banks and one butterfly unit is used to process the data. At the next clock cycle, the calculation results are written back to the memory banks and replace the old data. The scope of this brief is limited to the shared-memory architecture.
In the shared-memory architecture, an efficient addressing scheme for FFT data as well as coefficients (called twiddle factors) is required. For split-radix FFT, it conventionally involves an L-shaped butterfly data path whose irregular shape has uneven latencies and makes scheduling difficult. In this brief, we show that the SRFFT can be computed by using a modified radix-2 butterfly structure. Our contribution consists of mapping the split-radix FFT algorithm to the shared-memory architecture, leveraging the lower multiplicative complexity of the algorithm to reduce the dynamic power and developing two novel twiddle factor addressing schemes for the split-radix FFT.
- More Delay
- Less Efficiency
- No Burst Mode Data Transfer
In this paper, we are proposed to Increases the size of radix-2 to radix-4, and increases the point 1024 to 2048, with complex valued transform, and show the performance of area, power and delay. With Radix-4 Burst I/O Solution, the FFT uses Radix-4 Butterfly processing engine. It loads and/or unloads data separately from calculating the transform. Data I/O and processing are not simultaneous. When the FFT is started, the data is loaded. After a full frame has been loaded, the core computes the transform. When the computation has finished, the data can be unloaded, but cannot be loaded or unloaded during the calculation process. The data loading and unloading processes can be overlapped if the data is unloaded in digit reversed order. This architecture has lower resource usage than the Pipelined, Streaming I/O architecture, but a longer transform time, and supports point sizes from 64 to 65536. Data and phase factors can be stored in block RAM or in distributed RAM (the latter for point sizes less than or equal to 1024).
- Less Delay
- More Efficiency
- Burst Mode Data Transfer