## Description

**Existing System:**

There is a persistent demand for higher computational performance at low energy cost for emerging applications. It is unlikely that improvements from manufacturing processes alone, such as technology nodes or many-core system-onchip, will be able to cope with this challenge. Thus there is a genuine need to develop disruptive design approaches to achieve transformational energy reductions. Approximate computing systems design is a promising approach to this end [1]. The basic premise of approximate computing is to replace traditional complex and energy-wasteful data processing blocks by low-complexity ones with reduced logic counts. As a result, effective chip area and energy consumption are reduced at the cost of imprecision introduced to the processed data. Research has shown that the majority of modern applications such as digital signal processing, computer vision, robotics, multi-media and data analytics have some level of tolerance to such imprecision [2]. This can be leveraged as an opportunity for energy-efficient systems design for current and future generations of application-specific systems.

Multipliers are crucial arithmetic units in many of these applications, for two major reasons. Firstly, they are characterized by complex logic design, being one of the most energy demanding data processing units in modern microprocessors. Secondly, compute-intensive applications typically exercise a large number of multiplication operations to compute outcomes. These factors have prompted close attention in approximate multiplier design research, since improvements made in the power/speed of a multiplier are expected to substantially impact on overall system power/performance trade-offs.

These can be largely categorized as modifications of either timing or functional behaviors. Firstly, timing behavior can be modified using aggressive supply voltage scaling techniques. Operating below nominal voltage allows for reductions in energy consumption at the cost of time-induced errors. These errors cannot be rigorously bounded, and so extra error compensation circuits need to be incorporated. Secondly, functional modifications deal with logic reduction techniques and can be performed by relaxing the need for accurate Boolean equivalence in favor of energy and circuit area reductions. For example, truncating multiplier product terms allows for the elimination of some of the least significant partial product terms. As more columns are eliminated, further energy reduction is achieved; however, errors also increase. Modular re-design with low-complexity combinational logic is another effective technique. This allows for building larger energy-efficient multipliers using small approximate ones; however, the hierarchical organization of small approximate blocks will eventually propagate errors which increase with the multiplier size. A software-based perforation technique has been proposed by obtaining the optimized set of partial product terms based on power-area-accuracy trade-offs. Automated design approaches present design flows for generating approximate circuits using circuit activity profiles and quality bounds, and an evolutionary design process based on Cartesian Genetic Programming (CGP) has been utilized to implement approximate multipliers. A number of power and area-efficient multiplier redesign approaches have been proposed by changing the functional behavior. These changes extend from the architecture to transistor-level. The key principle of the above studies is to achieve reduced logic complexity, which is also the main aim of our work.

A typical (N ×N) accurate multiplier generates N2 product terms, which are then accumulated as a final product of size 2N. The accuracy of this product depends largely on the significance of bits; preserving higher-significance bits is likely to generate an outcome closes to the exact product than that of lower-significance bits. This can be exploited to progressively compress higher order combinatorial terms systematically and to achieve substantial energy savings at low loss of accuracy. In our work, we leverage this opportunity to make the following key contributions:

1) We propose a novel energy-efficient approximate multiplier design approach using bit significance-driven logic compression (SDLC).

2) At the core of our approach is a configurable logic clustering of product terms appropriately chosen for a given energy-accuracy trade-off, followed by remapping using their commutative properties to reduce the resulting number of product terms.

3) We demonstrate the comparative gains (with up to an order of magnitude energy reduction) through the design and synthesis of multipliers of different sizes (from 4 to 128 bits). Furthermore, we implement the multiplier in a real case-study image processing application to highlight its key advantages.

**Existing Approximate Multiplier Design **

Our proposed approach consists of two major steps. In the first, lossy compression is carried out through logic clustering. The resulting compressed terms are then remapped using their commutative properties. These steps together with the variable compression method, are described below. 1) Logic Compression: Parallel multiplication design is generally divided into three consecutive stages: partial product formation, accumulation, and carry propagation adder. In an (N × N) multiplier, N2 AND gates are utilized in parallel to generate the partial product bit-matrix. This matrix is then column-wise accumulated to generate the final product by using carry propagation adders. The proposed approach begins by generating all partial products using the same number of AND gates, similar to conventional multiplication. Before proceeding to the accumulation stage, the number of bits in the partial product matrix is reduced by performing lossy logic compression. The aim is to reduce the number of rows in the partial product matrix, thereby achieving low-complexity hardware before proceeding to accumulation. Figure 1 shows the difference between the design stages in accurate and the proposed multiplication. The shaded box highlights the contribution in this paper. To achieve lossy compression, we follow three key principles as follows.

Clustering a group of rows: The proposed multiplier organizes the partial product terms using different sizes of significant-driven logic clusters. Each logic cluster targets a group of columns containing two bits starting from the least significant bits in successive partial products. In general, each 2 × L logic cluster is responsible for two operations: i) generating 2L partial product bits within two contiguous rows, i.e., L pairs of vertically aligned bits, by utilizing 2L AND gates. Then, ii) minimizing these 2L bits by half using L OR gates. Figure 2 illustrates the utilization of four sizes of logic clusters in 8-bit parallel multiplier. The first 2×7 logic cluster forms 14 partial products by utilizing 14 AND logic gates and extracts 7-bit value by using an array of 7 OR logic gates. The second 2 × 6 logic cluster minimizes 12 partial products into 6 bits. In a similar way the third and fourth logic clusters use 2×5 and 2×4 to minimize 10 and 8 partial products into 5 and 4 bits respectively. By doing so, each logic cluster compresses a group of vertically aligned bits within two successive partial products based on their progressive bit significance. b. Generation of a reduced set of product terms: Using an array of OR gates in each logic cluster compresses the partial product terms by half. A reduced set of preprocessed partial product matrix is thus ready to be accumulated by applying any convenient scheme of multiplication, such as carry-save array, Wallace and Dadda tree. In theory, a two-input OR gate is sufficient to sum up two bits, i.e., ‘0’+‘1’=‘1’+‘0’=‘0’OR‘1’=‘1’OR‘0’=‘1’ and also ‘0’+‘0’=‘0’OR‘0’=‘0’. However, the OR gate fails to give an accurate sum if the two inputs are “ones”, i.e., ‘1’+‘1’=‘1’OR‘1’, the difference value is ‘1’ as the adder returns ‘10’ and OR outputs ‘1’.

- Significance-driven progressive cluster sizing: Since the main goal is to design a power-efficient multiplier with negligible loss of accuracy, the size of the logic clusters is decreased when going down in the partial product matrix. The more significant bits are treated with progressively higher precision, while bits with lower significance are compressed using the SDLC approach. This permits the most significant product terms to be accumulated on a carry-propagation basis as in the conventional multiplier. Thus, the accuracy of the significant bits of the final product is less affected.

Despite using the same number of AND gates as the accurate multiplier, this approach will deterministically reduce the hardware complexity of partial product accumulation, e.g., the count of the compressor cells needed in column compression multiplication for Wallace and Dadda cases, and also the number of half and full adders in the carry-save array will be decreased since the number of bits in the accumulation tree is minimized.

2) Commutative Remapping: The logic compression step (Section II-1) reduces the number of partial product terms. This reduction can be leveraged to reduce number of rows prior to the accumulation stage. This can be achieved by remapping the partial product terms based on the commutative property of the bits, i.e., bits with the same weight are gathered in the same column. Due to the reduced number of rows, the critical path delay is drastically reduced (see Section IV). Figure 3 demonstrates how the size of the partial product bit matrix in the case of an (8 × 8) multiplier is reduced using the SDLC approach. The lined boxes refer to a group of bits targeted by different sizes of logic clusters in which the height of the critical column is reduced by half.

**Disadvantages:**

- Consumes more power.
- Efficiency is low.

**Proposed System:**

Approximate computing device in a digital system reduces the design complexity and provide the high efficient solutions in order to increases its performance. The approximate computation devices reduces the design complexity, power consumption, delay and area in order to increases its performance when compared to exact computing device. The proposed technology of this paper describes the Bit Significance- Driven Logic Compression (SDLC) concept of high efficient approximate multipliers. The major concept of the proposed multiplier includes lossy compression through logic clustering and remapping the partial products. Thus, the algorithm of lossy compression in rows of the partial products based on their progressive bit significance and the lossy compression is carried out through logic clustering. Finally, the number of product rows are reduced by remapping the resultant rows of the partial product. Thus, the design complexity of logic cell count of the multiplier and length of the critical path is drastically reduced when compared to the normal exact multipliers. The proposed design of the multiplier are evaluvated in the VHDL and synthesized in the XILINIX 14.2 and compared with area, power and delay.

**Proposed Approximate Multiplier Design **

Our proposed approach consists of two major steps. In the first, lossy compression is carried out through logic clustering. The resulting compressed terms are then remapped using their Commutative properties. These steps together with the variable compression method, are described below. 1) Logic Compression: Parallel multiplication design is generally divided into three consecutive stages: partial product formation, accumulation, and carry propagation adder. In an (N × N) multiplier, N2 AND gates are utilized in parallel to generate the partial product bit-matrix. This matrix is then column-wise accumulated to generate the final product by using carry propagation adders. The proposed approach begins by generating all partial products using the same number of AND gates, similar to conventional multiplication. Before proceeding to the accumulation stage, the number of bits in the partial product matrix is reduced by performing lossy logic compression. The aim is to reduce the number of rows in the partial product matrix, thereby achieving low-complexity hardware before proceeding to accumulation. Figure 1 shows the difference between the design stages in accurate and the proposed multiplication. The shaded box highlights the contribution in this paper. To achieve lossy compression, we follow three key principles as follows. a. Clustering a group of rows: The proposed multiplier organizes the partial product terms using different sizes of significant-driven logic clusters. Each logic cluster targets a group of columns containing two bits starting from the least significant bits in successive partial products. In general, each 2 × L logic cluster is responsible for two operations: i) generating 2L partial product bits within two contiguous rows, i.e., L pairs of vertically aligned bits, by utilizing 2L AND gates. Then, ii) minimizing these 2L bits by half using L OR gates. Figure 2 illustrates the utilization of four sizes of logic clusters in 8-bit parallel multiplier. The first 2×7 logic cluster forms 14 partial products by utilizing 14 AND logic gates and extracts 7-bit value by using an array of 7 OR logic gates. The second 2 × 6 logic cluster minimizes 12 partial products into 6 bits. In a similar way the third and fourth logic clusters use 2×5 and 2×4 to minimize 10 and 8 partial products into 5 and 4 bits respectively. By doing so, each logic cluster compresses a group of vertically aligned bits within two successive partial products based on their progressive bit significance. b. Generation of a reduced set of product terms: Using an array of OR gates in each logic cluster compresses the partial product terms by half. A reduced set of preprocessed partial product matrix is thus ready to be accumulated by applying any convenient scheme of multiplication, such as carry-save array, Wallace and Dadda tree. In theory, a two-input OR gate is sufficient to sum up two bits, i.e., ‘0’+‘1’=‘1’+‘0’=‘0’OR‘1’=‘1’OR‘0’=‘1’ and also ‘0’+‘0’=‘0’OR‘0’=‘0’. However, the OR gate fails to give an accurate sum if the two inputs are “ones”, i.e., ‘1’+‘1’=‘1’OR‘1’, the difference value is ‘1’ as the adder returns ‘10’ and OR outputs ‘1’.

- Significance-driven progressive cluster sizing: Since the main goal is to design a power-efficient multiplier with negligible loss of accuracy, the size of the logic clusters is decreased when going down in the partial product matrix. The more significant bits are treated with progressively higher precision, while bits with lower significance are compressed using the SDLC approach. This permits the most significant product terms to be accumulated on a carry-propagation basis as in the conventional multiplier. Thus, the accuracy of the significant bits of the final product is less affected.

Despite using the same number of AND gates as the accurate multiplier, this approach will deterministically reduce the hardware complexity of partial product accumulation, e.g., the count of the compressor cells needed in column compression multiplication for Wallace and Dadda cases, and also the number of half and full adders in the carry-save array will be decreased since the number of bits in the accumulation tree is minimized.

2) Commutative Remapping: The logic compression step (Section II-1) reduces the number of partial product terms. This reduction can be leveraged to reduce number of rows prior to the accumulation stage. This can be achieved by remapping the partial product terms based on the commutative property of the bits, i.e., bits with the same weight are gathered in the same column. Due to the reduced number of rows, the critical path delay is drastically reduced (see Section IV). Figure 3 demonstrates how the size of the partial product bitmatrix in the case of an (8 × 8) multiplier is reduced using the SDLC approach. The lined boxes refer to a group of bits targeted by different sizes of logic clusters in which the height of the critical column is reduced by half.

The proposed approach is scalable for any (N × N) multiplier, as shown in Algorithm 1. This algorithm generates a reduced and ordered partial product bit-matrix, which can

Figure 5: Dot notation shows the major two steps in SDLC approach in the case of (8×8) multiplier: (a) clustering a group of rows in the partial product bit-matrix after bitwise multiplication; (b) generating a reduced set of product terms after applying logic compression; (c) ordered matrix after applying commutative remapping of the bit sequence resulting from the SDLC approach. The dotted rectangles indicate the height of the critical column which is reduced by half compared to the accurate accumulation tree.

Then be treated as an accumulation tree by any scheme of multiplication. Line (9) indicates how partial product bits are compressed using logic clusters. The main loop (lines 6 to 17) is responsible for remapping product terms in an ordered bit-matrix, as demonstrated in Figure 3.

3) Variable Logic Cluster Approach: The proposed approach is capable of achieving higher degrees of compression by increasing logic cluster depth. Figure 4 demonstrates the impact of increasing depth to 3 and 4 bits in the case of (8 × 8), showing the key steps in logic compression and commutative remapping. As can be seen, with increased depth we can achieve further reduction in the partial product terms, leading to fewer rows for final accumulation.

High performance adders are essential in fast computer arithmetic operations. This is specially important in massive data processing applications such as digital signal processing, image processing, graphics, and other on-line data crunching operations where the speed of sum operation is crucially important. On the other hand the speed of an adder depends, almost entirely, on the carry propagation delay. In carry look-ahead adders this propagation delay is substantially reduced; nevertheless, because of increase in both gate counts and overall modular delay, implementation of lookahead modules with more than four bits are not practical and are non-justified. For more than two decades, a dominant and more effective procedure to perform high speed addition has been through the use of carry generate, and carry propagate terms in the ripple-carry adders[l-71. In specific, for given operands A=(an., …a, a, a, ) and B=(b,., … b, b, b, ) we create two functions: generate, GI = al . bl , and propagate, PI = a,@ b, for each pair of bits a, and b, (see Fig. 1). Then the carry of the ith stag?, C1, .nay be expressed as Now, we have not only cut down the time for a single carry propagation, considerably, but we have also established a criteria for the distance that a carry might propagate. More clearly, if T, is the total time for adding two numbers A and B then T, = n .T,; where, np is the Maximum Propagation Length (MPE) of a carry, and T, is a single bit propagation delay. The effect of PI and GI in reducing MPL has been long investigated in the literature[l, 2, 41, and we are not going to discuss it here. Reducing T, is another means to speed up the carry propagation; and this is the subject of this correspondence. We start with Manchester carry adder[l], where the carry nodes are lined up in a chain of pass transistors, as shown in Fig. 2. During the clock period 0 nodes CO, C,, C,, and C, are rechargcd by the PMOS transistors, and during period &, depending on the logical value of Pi and G,, the node Ci is either pulled down to zero, or it stays charged, or it is charged to the level of the preceding node. Typically, we do not have any problem with the two former cases; it is, however, the later case which adds to the overall delay in the sum operation. Figure 6 shows the SPICE simulation results of a 4-bit Manchester carry chain when MPL is maximized; i.e., np = 4 (or Pi = 1, for every i), and CO = 0. Pomper et a1 [7] have proposed a solution for this worst-case propagation delay by providing a bypass chain from CO to C,. 11- Selective Pre-charge Technique Here in this article we are presenting a new precharging technique, called selective pre-charging (SPC) scheme. We show that by using SPC technique we substantially reduce the propagation delay associated with a single (bit) position, T,. In previous section we showed the worst-case delay in a 4bit Manchester carry chain; this happens when a zero carry is going to propagate through the nodes already pre-charged (set at Implementation of selective pre-charging The selective pre-charging technique and its effect on reducing the switching time in a Manchester carry chain was discussed in previous section In this section we shall investigate the implementation of the selective pre-charging technique for carry propagation in Manchester carry adders. Figure 3 shows a circuit for creating carry functions and sum term for a pair of bits, a, and b,. First, notice that instead of creating two carry functions; namely, propagate (PJ, and generate (GJ, we create three functions; propagate (PJ, 1 -generate (GJ, and 0-generate (Q,) in this confi uracion; where, G, = a,.b,, Q, = m,, and P, = d. The choice of three functions instead of two, in this technique, is quite essential. This is because, there is no absolute precharging assigned to every node within the carry chain. However, it must be noted, at this point, that we have created these three functions with the same hardware required for the original two; i.e., no extra silicon area is spent for creating the extra function Q,. Other important features of these caw functions are as follows: i) Q,” G, = 0, G,A PI =: 0, and PIA Q, = 0, and ii) Q, v G, V PI = U. In another words, for any given values of al and b, one and only one of the carry functions PI, G,, or Q, is true and the other two are false. Now that the carry functions are crated it remains to properly implement them in a carry chain circuit — similar to Manchester carry chain. Figure 4 shows a modified Manchester carry chain circuit developed for alternating pre-charging and pre-discharging nodes in the chain. In order to remove the effect of signal delays and the clock skew in this design we have adopted a two non-overlapping clock system Ol and @, as shown in Fig. 5. During the offperiod of 0, the function XI = P,.o, (see Fig. 3) is low; and this stops the transmission gates (within the chain) from conducting. As a result, the nodes CO, C,, C,, and C, are electrically disconnected from each other. On the other hand, during the on period of the clock (#& the nodes CO, C,, C,, and C, are selectively pre-charged/pre-discharged, as illustrated in Fig. 4. Notice that the carry cham circuit is so designed that it allows selective pre-charging take place regardless of the logical status of the carry functions Q, and GI. This is called the selective pre-charging (SPC) period. During the evaluation period (on period of Q,) one of the following three cases might happen, for each node (say CJ: 1- 0-generate function Q, is high (and apparently the two other functions, G, and PI, are low); therefore, c, = 0. 2- I-generate function G, is high; therefore, C, = 1. 3- propagate function P, is high, which leads to X, = P,.Q, being also high, and the propagation is from C,., to C1 resulting in C1 = Cl.l. Cases 1 and 2 represent a single category which is curly generate. There is no propagation involved in this category, and therefore, the result is almost immediate. It is, however, case 3 which creates delay through propagation. Again we consider the worst case which hap

**Error Analysis **

A number of simulations are carried out to examine the impact of error on the proposed approach for different sizes of multiplier. Several error metrics have been discussed [16] and [17] for evaluating the effectiveness and quantifying errors of approximate adders and multipliers. For any (N × N) approximate multiplier, the error distance (ED) is defined as the arithmetic difference between the accurate product (P) and erroneous product (P), i.e., ED = |P − P |. The relative error distance (RED) is the ratio of ED over the accurate output, i.e., RED = ED P = |P −P | P . The error rate (ER) is

Figure 10: Dot notation showing the impact of increasing the depth of the logic clusters in the case of (8 × 8) multiplier: (a) clustering a group of bits within three successive rows in the partial product bitmatrix after bitwise multiplication; (b) generating a reduced set of product terms after targeting the depth of 3-row logic compression; (c) ordered matrix after applying commutative remapping of the bit sequence resulting from the SDLC approach; (d), (e) and (f) the same process when applying 4-bit logic clusters. The dotted rectangles indicate the heights of the critical columns which are further reduced compared to the accurate accumulation tree.

defined as the ratio of incorrect outputs with respect to the total number of outputs. For any (N × N) approximate multiplier, the mean RED (MRED) is defined as [17]:

MRED = 22N -1i = 0 RED 22N …………..(1)

The Mean Error Distance (MED) is another useful error metric defined as the average of the ED values , i.e., MED = ED 22N . For comparing multipliers of different sizes, the normalized MED (NMED) is defined as [17]:

NMED = MED Pmax = 22N -1i = 0 ED 22N Pmax ……………(2)

where Pmax is the maximum product that can be obtained from an (N × N) accurate multiplier, i.e., Pmax = (2N −1)2.

Exhaustive simulations are performed in Matlab by implementing a functional model of the SDLC approach. The response of all approximate multipliers are evaluated for all possible combinations of operands. Table II shows four error metrics using varying sizes of the proposed multiplier. It can be seen that MRED and NMED fall drastically as the size of the multiplier is increased from 4 to 16-bit. The increasing trend in the error rate is expected due to the increased bitwidth of the multiplier. This is because the error occurrence increases as well due to the growing likelihood of finding a pair of vertically aligned “ones” through two successive rows. In such cases, the corresponding OR gate will return an error.

However, such error rates can be misleading, as the eventual impact of error is reflected in error distance metrics such as MRED and NMED [18]. Also, the readings of MAX(RED) would not denote severe degradation of the final output because the occurrence of these errors is regarded as very rare. This can be seen in Figure 5, which demonstrates the probability distribution for all relative errors resulting from three different sizes of multipliers using the SDLC approach. The probability distribution shows that the proposed approach tends to produce exact or close to exact results. This is seen in the sharp decline of the probability of errors with higher REDs, e.g., the MAX(RED) listed in Table II. Furthermore, as the bit-width of the multiplier is increased, the mass of the distribution is gradually concentrated at a lower error distance. This is because the proposed approach does not sacrifice the precision of the more significant bits when using significancedriven logic compression. Table III depicts the error trade-off with increased degree of compression achieved through higher depths of logic clusters in (8 × 8) multiplier. As expected, increased depth leads to higher error rates (up to 78%) when clustering with 4-row logic compression. However, results for the MRED metric are only marginally higher when compared with logic compression with 2- or 3-bit logic clusters. Similar observations can be made in the case of the NMED metric. The impact of increased degree of compression is further investigated in the application case-study in Section IV. IV. To demonstrate the proposed approach, we applied it on eight different sizes of widely known multipliers ranging from 4-bit to 128-bit. For the purpose of fair comparison, accurate ripple adders were used in both accurate and approximate multipliers to accumulate the partial product rows within the accumulation stage (see Figure 1). A generic SystemVerilog code was used to generate synthesizable modules for all accurate and approximate versions. These modules have been parametrized and configured differently during instantiation according to the bit-width of multiplier. The generated codes were implemented and synthesised using two different off-theshelf tools: Mentor Graphics Questa Sim was used to compile.

The SystemVerilog codes and run the associated test benches; and Synopsys Design Compiler was utilized for synthesising all sizes of accurate and proposed multipliers when mapping the circuits to the Faraday’s 90nm technology library and evaluating for power, delay and area. Figure 6 presents a comparison of dynamic/leakage power, area, delay and energy reductions for all eight sizes of proposed multipliers when compared with a conventional accurate multiplier (Figure 1(a)). As seen, there are significant improvements in all design trade-offs. This is basically because SDLC approach reduces the complexity of multiplier implementation by reducing the number of rows in the accumulation tree. Furthermore, this reduction in hardware complexity leads to low switching capacitance and leakage readings as well as shortened critical paths. The experiments show noteworthy reductions in terms of power consumption, runt-time and also silicon area used. For dynamic and leakage power, the reductions obtained from applying the SDLC approach range from 37.5%-67.4% and 34%-72.1% respectively when the bit-width ranges from a 4- bit to 128-bit multiplier. Furthermore, the range of savings in the operating delay for the same sizes of the proposed multiplier is from 38.5%-65.6%. The reduction in complexity also leads to silicon area to be reduced by 33.4%-62.9%, and energy consumed is substantially reduced by 65.5%-88.74%. The non-linear trend of the bars in some cases is attributed to the inconsistency of the ratio of the array of additions in the accumulation tree between the approximate and the accurate multiplier. Figure 7 illustrates the dynamic/leakage power, delay, area and energy savings with increased degree of logic compression. Higher depth of clustering achieves considerable savings in all design trade-offs since by increasing the depth of logic clusters, the hardware complexity associated with lower numbers of product rows is also decreased. We evaluate the efficiency of the proposed technique on a real life image-processing application. Such an application consists of additions and multiplications using key multipliers as building blocks. Our analysis considers the Gaussian blur filter [19] since it is widely used in graphics software, typically to reduce image noise and detail by acting as a low-pass filter. This filter involves the convolution of a ‘kernel’, described by a Gaussian function, with the pixels of the image.

The Figure 7: Dynamic power, leakage power, delay, area and energy savings for different degrees of logic compression of 8-bit multiplier. values of a given pixel in the output image are calculated by multiplying each kernel value by the corresponding input image pixel values; then all the obtained values are added and the result will be the value for the current pixel that overlaps with the centre of the kernel. To illustrate the effect of variable logic clusters in the proposed approach, different versions of an 8-bit approximate multiplier together with the Gaussian blur algorithm are implemented in Matlab covering 2-, 3- and 4-bit depth clustering. The Gaussian kernel is (3 × 3) with a 1.5 standard deviation value and it uses 8-bit fixed point arithmetic and is applied to 8-bit grayscale input image size (200×200) pixels. We approximate Gaussian blur by replacing the standard multiplication in the Gaussian filter with the aforementioned approximate (8×8) multipliers. The peak signal-to-noise ratio (PSNR) is a fidelity metric used to measure the quality of the output images. PSNR is expressed as: PSNR = 10 log10 2552 MSE

where MSE is the mean squared-error measured with respect to the reference pixel. Figure 8 demonstrates the impact of different bit-depth clustering on the image quality after applying the Gaussian blur filter. The standard (8 × 8) multiplier and three different levels of approximation for the proposed (8 × 8) multiplier are used. In fact, the utility of the proposed approach yields fruitful results. The PSNR for the case of 2-, 3- and 4-bit depth clustering are 50.2 dB, 39 dB, 30 dB respectively. The values of PSNR are computed compared to the image resulting after applying Gaussian blur filtering with the case of exact multiplication. Thus, the proposed approach can provide a significant dynamic energy saving up to 68.3% with acceptable quality of output image, especially when utilizing smaller bit depth cluster.

In this paper, a novel approximate multiplier design is proposed using significance-driven logic compression (SDLC). This design approach utilizes an algorithmic and configurable lossy compression based on bit significance to form a reduced set of partial product terms. This is then reorganized and accumulated using various schemes of parallel multiplication. On a statistical basis, the results of NMED and MRED metrics show how the impact of error is alleviated when the size of the multiplier is increased. Additionally, the error distributions show high right-skewness for error probabilities, indicating that the proposed multiplier gives close to exact products for most inputs. The results obtained after synthesis have shown a substantial decrease in run-time, power consumption and even in silicon area. We demonstrate energy-accuracy tradeoffs for different levels of approximations achieved through configurable logic clustering. To illustrate the effect of variable logic clusters, case study of an image-processing application shows that the proposed approach can provide significant energy and area savings with negligible loss in output quality, especially when utilizing smaller bit depth clusters. We believe that the proposed approach can be used with already existing low-power compute units to extract manifold benefits with a minimal loss in output quality.

** Advantages:**

- Excellent delay and power consumption
- High accuracy
- Both multiplication of luminance and Chrominance is achieved