Approximate circuits provide high performance and require low power. Sum-of-products (SOP) units are key elements in many digital signal processing applications. In this brief, three approximate SOP (ASOP) models which are based on the distributed arithmetic are proposed. They are designed for different levels of accuracy. First model of ASOP achieves an improvement up to 64% on area and 70% on power, when compared with conventional unit. Other two models provide an improvement of 32% and 48% on area and 54% and 58% on power, respectively, with a reduced error rate compared with the first model. Third model achieves the mean relative error and normalized error distance as low as 0.05% and 0.009%, respectively. Performance of approximate units is evaluated with a noisy image smoothing application, where the proposed models are capable of achieving higher peak signal to-noise ratio than the existing state-of-the-art techniques. It is shown that the proposed approximate models achieve higher processing accuracy than existing works but with significant improvements in power and performance.
Approximate computing provides an efficient solution for the design of power efficient digital systems. For applications, such as multimedia and data processing, approximate circuits play an important role as a promising alternative for reducing area and power in digital systems that can tolerate some loss of precision. As one of the key components in arithmetic circuits, sum-of products (SOP) units have received less attention in terms of approximate implementation. Distributed arithmetic is a very efficient means for calculation of the inner products between vectors. It implements multiplication by doing a series of table-lookups and shift-and-accumulate operations. Due to the flexibility of the level of parallelism in the distributed arithmetic structure, the area-speed tradeoff can be adjusted. Distributed arithmetic is a bit-serial operation that computes the inner product of two vectors in parallel. It requires no multiplication and it has an efficient mechanism to perform the SOP operation. Bit-parallel versions of distributed arithmetic are proposed. In this brief, three models of SOP units based on parallel distributed arithmetic are proposed. Approximate SOP (ASOP) model based on truncation is discussed. Their scheme simply involves truncation in the number of lookup tables, by eliminating the least significant part of the distributed arithmetic operation.
Multipliers have been extensively studied for approximate implementation in proposes two models of approximate compressors with reduced erroneous outputs to accumulate partial products of the Dadda tree multiplier. The probability-based multiplier of is based on the altering the partial products and reducing the generated partial product tree based on their probability. In, partial product perforation (PPP) multiplier reduces k partial products starting from jth position, which in turn reduces the number of adders used in the accumulation of partial products.
In this brief, the novel ASOP designs are proposed using the efficient distributed arithmetic structure. Approximation involves changes with respect to word length, number of lookup tables, and number of elements in the final accumulator. Three models are proposed. First model provides significant power reduction with lower mean relative error (MRE) and normalized error distance (NED). Second and third models with increased area and power compared to first model provide better accuracy. In the proposed approximate structures, reductions in the number of lookup tables, length of adders, and accumulator size are employed for approximation. Compared to the exact SOP unit, the proposed models have reduced circuit complexity. NED is an effective metric to quantify the approximation irrespective of the size of the circuit. Also, traditional MRE error metric is used to evaluate the impact of approximation. Error distance is the difference between the exact value and the approximate value, whereas relative error is the value of error distance divided by the exact value. NED is calculated by normalizing the error distance by maximum possible exact output. MRE is calculated from the mean of relative errors for all possible values.
- More Mean Relative Error
- More Logic Size
- More Power and Delay
In a recent research of Approximate computing will provide a long lost bit error rate and also provide a power efficient with less area in digital signal processing applications. In a key component of arithmetic operations, a SOP (Sum of Product) method will have priority in approximate implementations such as calculation of inner products between vector based arithmetic operations. Here a arithmetic operation of approximate sum of product (ASOP) method will designed in three type of method in SOP Structure of K=3, N=16, these method will support without using multiplier design, such as 1) Generic method of ASOP1, 2) Priority encoder method of ASOP2, 3) Multi-Operation with truncated 18-m bits method of ASOP3. In this proposed work of this paper will implement this ASOP1, ASOP2, ASOP3 with using FIR Filter design, and shown the performance with this three method, and finally design this logic on VHDL and implemented with Xilinx FPGA-S6lx9, and shown the performance in terms of area, delay and power.
Proposed Approximate Sum-of-Products Model ASOP1:
In approximate model 1, K is 3 and N is reduced. m bits at the least significant part of ak and bk for k = 1, 2, and 3 are truncated. m = 8, 6, and 4 bits are implemented. For this implementation, three two-input 16 − m bit adders, one three-input 16 − m bit adder, 16 − m lookup tables with eight cases, and final accumulator with 16−m elements are required. This considerably reduces the hardware utilization at all the levels. The approximate model with reduced elements is shown in Fig. 2. In , by implementing with limits m to N −1, the number of lookup tables reduces to 16−m and 16−m elements are sent to the final accumulator (16 − m × 18). It should be noted that in ASOP1, the number of input bits to the adders is reduced, which further reduces the complexity of accumulator (16 − m × 18 − m), compared.
Proposed Approximate Sum-of-Products Model ASOP2:
ASOP2 is similar to ASOP1 with the addition of m-bit leading one predictor. This increases the accuracy, and more suitable for DSP application which will be discussed later in this section. In our method, leading one prediction of ak and bk for k = 1, 2, and 3 requires OR operation of most significant m bits of ak and bk for k = 1, 2, and 3 followed by the priority encoder. The function of OR gates can be given as amOR = a1m|a2m|a3m and bmOR = b1m|b2m|b3m where km represents first m bits of kth element, for m = 4, 6, or 8. After the leading one prediction, ASOP1 structure is used for the computation of elements starting from the leading one position.
For example, consider the input elements as a1 = “00110010 00101110,” a2 = “0001011000101011,” a3 = “0010011001 101000,” b1 = “0001001011101001,” b2 = “0001101000101110,” and b3 = “0000101011101011.” For m = 4, amOR = 0011, leading one predictor predicts zeros in first two bits of bit positions “15” and “14” of a1, a2, and a3, 12-bit (16 − m) information starting from bit position “13” to “2” of a1, a2, and a3 (“110010001011,” “010110001010,” and “100110011010”) are taken and fed to the inputs of the lookup tables.
For m = 4, bmOR = 0001, leading one predictor predicts zeros in first three bits of bit positions “15,” “14,” and “13” of b1, b2, and b3, 12-bit (16 − m) information starting from bit position “12” to “1” of b1, b2, and b3 (“100101110100,” “110100010111,” and “010101110101”) are taken and fed as control signals of lookup tables. The overall structure of ASOP2 is given in Fig. 2, where LZA refers to leading zeros in amOR and LZB refers to leading zeros in bmOR . ASOP2 reduces the negative effects of truncation, especially when there is information only in least significant parts of the inputs. In DSP applications, pixel values are highly correlated and the number of initial zeros of ak and bk for k = 1, 2, 3 have high chances of being the same. Using OR gate for combining the elements and using a leading one predictor afterward reduces the hardware resources to be used tables. The overall structure of ASOP2 is given in Fig. 3, where LZA refers to leading zeros in amOR and LZB refers to leading zeros in bmOR . ASOP2 reduces the negative effects of truncation, especially when there is information only in least significant parts of the inputs. In DSP applications, pixel values are highly correlated and the number of initial zeros of ak and bk for k = 1, 2, 3 have high chances of being the same. Using OR gate for combining the elements and using a leading one predictor afterward reduces the hardware resources to be used.
Proposed Approximate Sum-of-Products Model ASOP3:
In ASOP1, the least significant part m = 8, 6, and 4 bits are truncated. In ASOP1, m bits are truncated from the 18-bit outputs of the lookup table contents. And also, m control signals b1n, b2n, and b3n of the lookup table for n = 0, 1, …, m − 1 are truncated. In ASOP3, instead of truncation, approximation is employed. Lookup table output contents are divided into 18−m bits and m bits. The inputs b are divided to 16 − m group and m group. ASOP1 is used for the first 16 − m group. For the least m bits group of bk for k = 1, 2, 3, the control signals are grouped in pair. m lookup tables are reduced to m/2 tables. The additional hardware required for ASOP3 is given in Fig. 3.
For example, consider the input elements as a1 = “00110010 00101110,” a2 = “0001011000101011,” a3 = “00100110011 01000,” b1 = “0001001011101001,” b2 = “0001101000101110,” and b3 = “0000101011101011.” For m = 4, a23, a13, a12, and a123 are calculated, then except for least m bits, other bits are given to ASOP1 structure, and 12-bit (16 − m) information starting most significant bit of b1, b2, and b3 are taken and fed as control signals of lookup tables.
For the least significant bits calculation, least significant m bits of a23, a13, a12, and a123 are used as inputs to the lookup table. The number of lookup tables are reduced by half, by ORing each pair of control signals. In this scenario, for lookup table of n = 1 | 0, the control signals would be 111.
FIR Filter Design:
In FIR filter designed, will used design any multipliers, if last frequent years, the MCM technique will used, as a proposed of FIR filter design, but the drawback is MCM technique will not work both thing of signed and un-signed operation, so it will we need to design separate MCM for signed and unsigned multiplication. So here, we are proposed a MCM with Rounded based approximate multiplier that includes both signed and unsigned operation in single multiplier, this multiplier will implemented in FIR Filter, and shown the efficiency of area, power and delay.
Filter coefficients very often remain constant and known a priori in signal processing applications. This feature has been utilized to reduce the complexity of realization of multiplications. Several designs have been suggested by various researchers for efficient realization of FIR filters (having fixed coefficients) using distributed arithmetic (DA) and multiple constant multiplication (MCM) methods. DA-based designs use lookup tables (LUTs) to store pre computed results to reduce the computational complexity. The MCM method on the other hand reduces the number of additions required for the realization of multiplications by common sub expression sharing, when a given input is multiplied with a set of constants. The MCM scheme is more effective, when a common operand is multiplied with more number of constants. Therefore, the MCM scheme is suitable for the implementation of large order FIR filters with fixed coefficients. But, MCM blocks can be formed only in the transpose form configuration of FIR filters.
Block-processing method is popularly used to derive high-throughput hardware structures. It not only provides throughput-scalable design but also improves the area-delay efficiency. The derivation of block-based FIR structure is straightforward when direct-form configuration is used , whereas the transpose form configuration does not directly support block processing. But, to take the computational advantage of the MCM, FIR filter is required to be realized by transpose form configuration. Apart from that, transpose form structures are inherently pipelined and supposed to offer higher operating frequency to support higher sampling rate.
- Less Mean Relative Error
- Less Logic Size
- Less Power and Delay