Proposed Title :
Much research has been done on the floating-point fused multiply add (FMA) unit. It has several advantages over discrete floating-point adders and multipliers in a floating point unit design. Not only can a fused multiply-add unit reduce the latency of an application that executes a multiplication followed by an addition, but the unit may entirely replace a processor’s floating-point adder and floating-point multiplier. Many DSP algorithms have been rewritten to take advantage of the presence of FMA units. For example, in a radix-16 FFT algorithm is presented that speeds up FFTs in systems with FMA units. High throughput and digital filter implementations are possible with the use of FMA units. FMA units are utilized in embedded signal processing and graphics applications, used to perform division, argument reduction, and this is why the FMA has become an integral unit of many commercial processors such as those of IBM, HP and Intel Similar to operations performed by a FMA, in many DSP algorithms and in other fields calculating the sum of the products of two sets of operands (dot-product) is a frequently used operation. For example, this is required in the computation of the FFT and DCT butterfly operations. In traditional floating-point hardware the dot product is performed with two multiplications and an addition. These operations may be performed in a serial fashion which limits the throughput. Alternatively, the multiplications may be performed in parallel with two independent floating-point multipliers followed by a floating-point adder which is expensive (in silicon area and in power consumption).
This paper investigates the implementation of a floating point fused dot-product unit shown in Fig. 1. It performs the following operation:
Y = A * B + C * D …………..(1)
The numerical operation performed by this unit can be used to improve many DSP algorithms. Specifically, multiplication of complex operands benefits greatly from the FDP. For example, implementations of the FFT butterfly operation, the DCT butterfly operation, vector multiplication and the wavelet transform could all benefit largely from the speed up offered by this unit. For example, consider the FFT radix-2 decimation in frequency butterfly shown in Fig. 2. In an implementation with discrete floating-point adders and multipliers ten operations are required (six additions and four multiplications). Alternatively two fused dot product operations and four additions can be used.
- Less Security
- More Area and More Power
There are two approaches that can be taken with conventional floating-point adders and multipliers to realize the dot-product. The parallel implementation shown on Fig. 2 uses two multipliers operating in parallel and an added. The parallel approach is appropriate for applications where maximizing the throughput is more important than minimizing the area or the power consumption. The serial implementation shown on Fig. 3 uses a single adder and a single multiplier with multiplexers and a register for intermediate results. It has lower throughput and less area and power consumption. Fig. 5 shows the architecture of a conventional single path floating-point multiplier. Much of the multiplier is devoted to operations that can be shared if it is extended to perform the dot-product.
In a parallel conventional implementation of the dot product (such as that shown in Fig. 2) two floating-point multipliers are used in addition to a floating-point adder, thus three rounding operations are performed in generating the result. In the FDP unit that is shown in Fig. 6, a multiplier tree, an aligner in addition to 4:2 reduction tree are added to a conventional FPM to perform the dot-product operation, the remaining components of the FPM are used as is which results in a significant area reduction compared to the conventional implementation.
The exponent compare circuit is shown in Fig. 7. Although it is not especially attractive, a system could use this unit to replace a floating-point adder and a floating-point multiplier. If operands B and D are set to one, then the unit will perform addition only, with simple data forwarding multiplexers for operands A and C to skip the multiplication trees, the speed of the addition will be one multiplexer delay more than a discrete floating-point adder.
- More Security algorithm
- Less Area and less Power