MBS: A High-Precision Approximation Method for Softmax and Efficient Hardware Implementation
MBS: A High-Precision Approximation Method for Softmax and Efficient Hardware Implementation
Abstract:
The softmax function needs to be frequently used in the multi-head attention layer of Transformer networks. Compared to DNNs and other networks, Transformers have higher computational complexity, requiring higher accuracy and hardware performance for softmax function calculations. Therefore, we propose mixed-base softmax (MBS) for the first time for the approximation of the softmax function. This method combines exponential functions with bases of 2 and 4, which is advantageous for hardware implementation. MBS has a high similarity to the softmax function and demonstrates advanced performance during inference in Transformer networks. Through algorithm transformation and hardware optimization, we have designed a low-complexity and highly parallel hardware architecture, which only occupies few additional hardware resources compared to base-2 softmax but achieves higher accuracy. Experimental results show that, under TSMC 90nm CMOS technology at the frequency of 0.5 GHz, our design can enhance the efficiency of 236.18 G/s (mm²·mW) with the area of 324 μm². Furthermore, MBS exhibits higher computational accuracy and inference precision compared with base-2 softmax.