RECENTLY, the deep neural network (DNN) emerged as a powerful tool for various applications including image classification and speech recognition. Since an enormous amount of vector-matrix multiplication computations are required in a typical DNN application, a variety of dedicated hardware for machine learning have been proposed to accelerate the computations. In a machine learning accelerator, a large number of multiply–accumulate (MAC) units are included for parallel computations, and timing-critical paths of the system are often found in the unit.
A multiplier typically consists of several computational parts including a partial product generation, a column addition, and a final addition. An accumulator consists of the carry-propagation adder. Long critical paths through these stages lead to the performance degradation of the overall system. To minimize this problem, various methods have been studied. The Wallace and Dadda multipliers are well-known examples for the achievement of a fast column addition, and the carry-look ahead (CLA) adder is often used to reduce the critical path in the accumulator or the final addition stage of the multiplier.
Meanwhile, a MAC operation is performed in the machine learning algorithm to compute a partial sum that is the accumulation of the input multiplied by the weight. In a MAC unit, the multiply and accumulate operations are usually merged to reduce the number of carry-propagation steps from two to one. Such a structure, however, still comprises a long critical path delay that is approximately equal to the critical path delay of a multiplier.
It is well known that pipelining is one of the most popular approaches for increasing the operation clock frequency. Although pipelining is an efficient way to reduce the critical path delays, it results in an increase in the area and the power consumption due to the insertion of many flip-flops. In particular, the number of flip-flops tends to be large because the flip-flops must be inserted in the feed forward-cutset to ensure functional equality before and after the pipelining. The problem worsens as the number of pipeline stages is increased.
The main idea of this paper is the ability to relax the feed forward-cutset rule in the MAC design for machine learning applications, because only the final value is used out of the large number of multiply–accumulations. In other words, different from the usage of the conventional MAC unit, intermediate accumulation values are not used here, and hence, they do not need to be correct as long as the final value is correct. Under such a condition, the final value can become correct if each binary input of the adders inside the MAC participates in the calculation once and only once, irrespective of the cycle. Therefore, it is not necessary to set an accurate pipeline boundary.
Based on the previously explained idea, this paper proposes a feed forward-cutset-free (FCF) pipelined MAC architecture that is specialized for a high-performance machine learning accelerator. The proposed design method reduces the area and the power consumption by decreasing the number of inserted flip-flops for the pipelining.
- Number of inserted flip flops increases the pipeline stages.
- Consumes larger area and high critical path delay.
- Power consumption is high.
MAC(Multiply Accumulate Unit) computation plays a important role in (DSP) Digital Signal Processing. The MAC is common step that computes the product of two numbers and add that product to an accumulator. Generally, the Pipelined architecture is used to improve the performance by reducing the length of the critical path. But, more number of flip flops are used when using the pipeline architecture that reduces the efficiency of MAC and increases the power consumption. On the basis of machine learning algorithm, this paper proposes a feed forward-cutset-free (FCF) pipelined MAC architecture that is specialized for a high-performance machine learning accelerator. The proposed design method reduces the area and the power consumption by decreasing the number of inserted flip-flops for the pipelining when compared to the existing pipelined architecture for MAC computation. Finally, the proposed feed forward cutset free pipelined architecture for MAC is implemented in the VHDL and synthesized in XILINX and compared in terms of area, power and delay reports.
While the conventional pipelining method is advantageous because it effectively reduces the critical path delays, it leads mostly to an increase in the area and the power consumption due to the insertion of a large number of flip-flops. Moreover, the constraint that requires the insertion of the flip-flops according to the feed forward-cutset rule tends to significantly increase the overhead. This section proposes a method for the selective elimination of the flip-flops from the conventional pipeline boundary by exploiting the unique characteristics of the machine learning algorithm.
Proposed FCF Pipelining
Fig. 2 shows examples of the two-stage 32-bit pipelined accumulator (PA) that is based on the ripple carry adder (RCA). A[31 : 0] represents data that move from the outside to the input buffer register. A Reg[31 : 0] represents the data that are stored in the input buffer. S[31 : 0] represents the data that are stored in the output buffer register as a result of the accumulation. In the conventional PA structure [Fig. 2(a)], the flip-flops must be inserted along the feed forward-cutset to ensure functional equality. Since the accumulator in Fig. 2(a) comprises two pipeline stages, the number of additional flip-flops for the pipelining is 33 (gray-colored flip-flops). If the accumulator is pipelined to the n-stage, the number of inserted flip-flops becomes 33(n−1), which confirms that the number of flip-flops for the pipelining increases significantly as the number of pipeline stages is increased.
Fig. 2(b) shows the proposed FCF-PA. For the FCF-PA, only one flip-flop is inserted for the two-stage pipelining. Therefore, the number of additional flip-flops for the n-stage pipeline is n − 1 only.
In the conventional PA, the correct accumulation values of all the inputs up to the corresponding clock cycle are produced in each clock cycle as shown in the timing diagram of Fig. 2(a). A two-cycle difference exists between the input and the corresponding output due to the two-stage pipeline. On the other hand, in the proposed architecture, only the final accumulation result is valid as shown in the timing diagram of Fig. 2(b).
Fig. 3 shows examples of the ways that the conventional PA and the proposed method (FCF-PA) work. In the conventional two-stage PA, the accumulation output (S) is produced two clock-cycle after the corresponding input is stored in the input buffer. On the other hand, regarding the proposed structure, the output is generated one clock cycle after the input arrives. Moreover, for the proposed scheme, the generated carry from the lower half of the 32-bit adder is involved in the accumulation one clock cycle later than the case of the conventional pipelining. For example, in the conventional case, the generated carry from the lower half and the corresponding inputs are fed into the upper half adder in the same clock cycle as shown in the cycles 4 and 5 of Fig. 3 (left). On the other hand, in the proposed FCF-PA, the carry from the lower half is fed into the upper half one cycle later than the corresponding input for the upper half, as depicted in the clock cycles 3-5 of Fig. 3 (right). This characteristic makes the intermediate result that is stored in the output buffer of the proposed accumulator different from the result of the conventional pipelining case.
The proposed accumulator, however, shows the same final output (cycle 5) as that of the conventional one. In addition, regarding the two architectures, the number of cycles from the initial input to the final output is the same. The characteristic of the proposed FCF pipelining method can be summarized as follows: In the case where adders are used to process data in an accumulator, the final accumulation result is the same even if binary inputs are fed to the adders in an arbitrary clock cycle as far as they are fed once and only once.
In the machine learning algorithm, only the final result of the weighted sum of the multiplication between the input feature map and the filters is used for the subsequent operation, so the proposed accumulator would produce the same results as the conventional accumulator.
Meanwhile, the CLA adder has been mostly used to reduce the critical path delay of the accumulator. The carry prediction logic in the CLA, however, causes a significant increase in the area and the power consumption. For the same critical path delay, the FCF-PA can be implemented with less area and lower power consumption compared with the accumulator that is based on the CLA.
Modified FCF-PA for Further Power Reductions
Although the proposed FCF-PA can reduce the area and the power consumption by replacing the CLA, there are certain input conditions in which the undesired data transition in the output buffer occurs, thereby reducing the power efficiency when 2’s complement numbers are used. Fig. 4 shows an example of the undesired data transition. The inputs are 4-bit 2’s complement binary numbers. AReg[7 : 4] is the sign extension of AReg, which is the sign bit of AReg[3 : 0]. In the conventional pipelining [Fig. 4 (left)], the accumulation result (S) in cycle 3 and the data stored in the input buffer (AReg) in cycle 2 are added and stored in the output buffer (S) in cycle 4. In this case, the “1” in AReg in cycle 2 and the “1” in S in cycle 3 are added, thereby generating a carry. The carry is transmitted to the upper half of the S, and hence, S[7:4] remains as “0000” in cycle 4.
Pipelined MAC Unit
The column addition in the MAC operation is for the calculation of binary numbers in each addition stage using the half-adders and/or full adders and then for the passing of the results to the next addition stage. Since MAC computations are based on such additions, the proposed pipelining method can also be applied to the machine learning-specific MAC structure. In this section, the proposed pipelining method is applied to the MAC architecture by using the unique characteristic of Dadda multiplier. The Dadda multiplier performs the column addition in a similar fashion to the Wallace multiplier which is widely used, and it has less area and shorter critical path delay than the Wallace multiplier.
Fig. 7 shows the pipelined column addition structures in the Dadda multiplier. The Dadda multiplier performs the column addition to reduce the height of each stage. If a particular column already satisfies the target height for the next column addition stage, then no operation is performed during the stage . Using this property, the proposed pipelining method can be applied to the MAC structure as well. Fig. 7(a) is an example of pipelining where the conventional method is used. All of the edges in the feed forward-cutset are subject to pipelining. On the other hand, in the proposed FCF pipelining case [Fig. 7(b)], if a node in the column addition stage does not need to participate in the height reduction, it can be excluded from the pipelining [the group in the dotted box of Fig. 7(b)]. In other words, in the conventional pipelining method, all the edges in the feed forward-cutset must be pipelined to ensure functional equality regardless of a timing slack of each edge [Fig. 7(a)]. However, in the FCF pipelining method, some edges in the cutset do not necessarily have to be pipelined if the edges have enough timing slacks [Fig. 7(b)]. As a result, a smaller number of flip-flops are required compared with the conventional pipelining case. On the other hand, in the Wallace multiplier, as many partial products as possible are involved in the calculation for each column addition stage. Because the partial products do not have enough timing slack to be excluded from pipelining, the effectiveness of the proposed FCF pipelining method is smaller in the Wallace multiplier case than in the Dadda multiplier case.
Fig. 8 shows the block diagrams of pipelined MAC architectures. The proposed MAC architecture [Fig. 8(b)] combines the FCF-MAC (MAC with the proposed FCF pipelining) for the column addition and the MFCF-PA for the accumulation. Instead of pipelining all of the final nodes in the column addition stage as in Fig. 8(a), the proposed FCF-MAC architecture is used to selectively choose the nodes for the pipelining. For the proposed architecture, the merged multiply–accumulation style is adopted. The final adder is placed in the last stage of the MAC operation. In general, the final adder is designed using the CLA to achieve a short critical path delay. In contrast, the proposed design uses the MFCF-PA style in the accumulation stage in consideration of the greater power and the area efficiency of the MFCF-PA.
- Feed Forward Cutset Free technique decreases the Pipeline stages.
- Less area and shorter critical path delay when using the concept of DADDA multiplier.
- Power consumption is low.