VLSI IEEE TRANSACTION
Title: A 2.5-ps Bin Size and 6.7-ps Resolution FPGA Time-to-Digital Converter Based on Delay Wrapping and Averaging
Abstract: A high-resolution time-to-digital converter (TDC) implemented with field programmable gate array (FPGA) based on delay wrapping and averaging is presented. The fundamental idea is to pass a single clock through a series of delay elements to generate multiple reference clocks with different phases for input time quantization. Due to periodicity, those phases will be equivalently wrapped within one reference clock period to achieve the required fine resolution. In practice, a hybrid delay matrix is created to significantly reduce the required number of delay cells. Multiple TDC cores are constructed for parallel measurements and then exquisite routing control and averaging are applied to smooth out the large quantization errors caused by the in homogeneity of the TDC delay lines for both linearity and single-shot precision enhancement. To reduce the impact of temperature sensitivity, a cancellation circuit is created to substantially reduce the offset and confine the output difference within 2 LSB for the same input interval over the full operation temperature range of FPGA. With such a fine resolution of 2.5 ps, the integral nonlinearity is measured to be from merely -2.98 to 3.23 LSB and the corresponding rms resolution is 4.99–6.72 ps. The proposed TDC is tested to be fully functional over 0 °C–50 °C ambient temperature range with extremely low resolution variation. Its performance is even superior to many full-custom-designed TDCs.
Title: Adaptive Multi-bit Crosstalk-Aware Error Control Coding Scheme for On-Chip Communication
Abstract: The presence of different noise sources and continuous increase in crosstalk in the deep sub micrometer technology raised concerns for on-chip communication reliability, leading to the incorporation of crosstalk avoidance techniques in error control coding schemes. This brief proposes joint crosstalk avoidance with adaptive error control scheme to reduce the power consumption by providing appropriate communication resiliency based on runtime noise level. By switching between shielding and duplication as the crosstalk avoidance technique and between hybrid automatic repeat request and forward error correction as the error control policies, three modes of error resiliencies are provided. The results show that, in reduced mode, the scheme achieves up to 25.3% power savings at 3-mm wire length as compared to the original non-adaptive scheme at the cost of only 3.4% power overhead in high protection mode.
Title: Coordinate Rotation-Based Low Complexity K-Means Clustering Architecture
Abstract: In this brief, we propose a low-complexity architectural implementation of the K-means-based clustering algorithm used widely in mobile health monitoring applications for unsupervised and supervised learning. The iterative nature of the algorithm computing the distance of each data point from a respective centroid for a successful cluster formation until convergence presents a significant challenge to map it onto a low-power architecture. This has been addressed by the use of a 2-D Coordinate Rotation Digital Computer-based low-complexity engine for computing the n-dimensional Euclidean distance involved during clustering. The proposed clustering engine was synthesized using the TSMC 130-nm technology library, and a place and route was performed following which the core area and power were estimated as 0.36 mm2 and 9.21mW at 100 MHz, respectively, making the design applicable for low-power real-time operations within a sensor node.
Title: Low-Power Scan-Based Built-In Self-Test Based on Weighted Pseudorandom Test Pattern Generation and Reseeding
Abstract: A new low-power (LP) scan-based built-in self-test (BIST) technique is proposed based on weighted pseudorandom test pattern generation and reseeding. A new LP scan architecture is proposed, which supports both pseudorandom testing and deterministic BIST. During the pseudorandom testing phase, an LP weighted random test pattern generation scheme is proposed by disabling a part of scan chains. During the deterministic BIST phase, the design-for-testability architecture is modified slightly while the linear-feedback shift register is kept short. In both the cases, only a small number of scan chains are activated in a single cycle. Sufficient experimental results are presented to demonstrate the performance of the proposed LP BIST approach.
Title:A Way-Filtering-Based Dynamic Logical–Associative Cache Architecture for Low-Energy Consumption
Abstract : Last-level caches (LLCs) help improve performance but suffer from energy overhead because of their large sizes. An effective solution to this problem is to selectively power down several cache ways, which, however, reduces cache associativity and performance and thus limits its effectiveness in reducing energy consumption. To overcome this limitation, we propose a new cache architecture that can logically increase cache associativity of way-powered-down LLCs. Our proposed scheme is designed to be dynamic in activating an appropriate number of cache ways in order to eliminate the need for static profiling to determine an energy-optimized cache configuration. The experimental results show that our proposed dynamic scheme reduces the energy consumption of LLCs by 34% and 40% on single- and dual-core systems, respectively, compared with the best performing conventional static cache configuration. The overall system energy consumption including CPU, L2 cache, and DRAM is reduced by 9.2% on quad-core systems.
Title: Resource-Efficient SRAM-based Ternary Content Addressable Memory
Abstract:Static random access memory (SRAM)-based ternary content addressable memory (TCAM) offers TCAM functionality by emulating it with SRAM. However, this emulation suffers from reduced memory efficiency while mapping the TCAM table on SRAM units. This is due to the limited capacity of the physical addresses in the SRAM unit. This brief offers a novel memory architecture called a resource-efficient SRAM-based TCAM (REST), which emulates TCAM functionality using optimal resources. The SRAM unit is divided into multiple virtual blocks to store the address information presented in the TCAM table. This approach virtually increases the overall address space of the SRAM unit, mapping a greater portion of the TCAM table in SRAM and increasing the overall emulated TCAM bits/SRAM at the cost of reduced throughput. A 72×28-bit REST consumes only one 36-kbit SRAM and a few distributed RAMs via implementation on a Xilinx Kintex-7 field-programmable gate array. It uses only 3.5% of the memory resources compared with a conventional SRAM-based TCAM (hybrid-partitioned TCAM).
Title: Write-Amount-Aware Management Policies for STT-RAM Caches
Abstract:Spin-transfer torque random access memory (STT-RAM) technology has emerged as one of the most promising memory technologies owing to its non-volatility, high density, and low-leakage power characteristics. However, STT-RAM has certain drawbacks such as high write energy consumption and limits to the number of write cycles. To enable the adoption of STT-RAM in the implementation of cache memories, new cache hierarchy management policies are required to overcome such drawbacks. In this brief, we evaluated several cache hierarchy management policies in the context of static random access memory L1 caches and an STT-RAM L2 cache. We found that a nonexclusive policy is superior to non-inclusive and exclusive policies in terms of energy consumption and endurance. We also propose a subblock-based management policy because the write energy consumption and endurance are proportional and inversely proportional to the amount of written data, respectively. A combination of the proposed policy with a nonexclusive policy reduces the L2 cache energy consumption by 33.3% (31.5%) and improves the lifetime by 56.3% (56.8%) in a single-core (quad-core) system.
Title: Fault Diagnosis Schemes for Low-Energy Block Cipher Midori Benchmarked on FPGA
Abstract:Achieving secure high-performance implementations for constrained applications such as implantable and wearable medical devices are a priority in efficient block ciphers. However, security of these algorithms is not guaranteed in the presence of malicious and natural faults. Recently, a new lightweight block cipher, Midori, has been proposed that optimizes the energy consumption besides having low latency and hardware complexity. In this paper, fault diagnosis schemes for variants of Midori are proposed. To the best of the authors’ knowledge, there has been no fault diagnosis scheme presented in the literature for Midori to date. The fault diagnosis schemes are provided for the nonlinear S-box layer and for the round structures with both 64-bit and 128-bit Midori symmetric key ciphers. The proposed schemes are benchmarked on a field programmable gate array and their error coverage is assessed with fault-injection simulations. These proposed error detection architectures make the implementations of this new low-energy lightweight block cipher more reliable.
Title: High-Throughput and Energy-Efficient Belief Propagation Polar Code Decoder
Abstract:Owing to their capacity-achieving performance and low encoding and decoding complexity, polar codes have received significant attention recently. Successive cancellation decoding (SCD) and belief propagation decoding (BPD) are two popular approaches for decoding polar codes. SCD, despite having less computational complexity when compared with BPD, suffers from long latency due to the serial nature of the SC algorithm. BPD, on the other hand, is parallel in nature and is more attractive for low-latency applications. However, due to the iterative nature of BPD, the required latency and energy dissipation increase linearly with the number of iterations. In this paper, we propose a novel scheme based on sub-factor graph freezing to reduce the average number of computations as well as the average number of iterations required by BPD, which directly translates into lower latency and energy dissipation. Simulation results show that the proposed scheme has no performance degradation and achieves significant reduction in computation complexity over the existing methods. Moreover, the hardware architecture for the proposed scheme is developed and compared with the state-of-the-art BPD implementations for (1024, 512) polar codes. A decoding throughput of 13.9 Gb/s is achieved along with a 60%–73% improvement in energy reduction and two times increase in hardware efficiency when compared with the existing BPD implementations.
Title: High-Speed Parallel LFSR Architectures Based on Improved State-Space Transformations
Abstract:Linear feedback shift register (LFSR) has been widely applied in BCH and CRC encoding. In order to increase the system throughput, the parallelization of LFSR is usually needed. Previously, a technique named state-space transformation was presented to reduce the complexity of parallel LFSR architectures. Exhaustive searches are performed to find good transformation matrix candidates. This brief proposes a new technique for construction of the transformation matrix together with a more efficient searching algorithm. The realization results indicate that the proposed architecture outperforms the prior arts, improving the hardware efficiency by around 35% and the corresponding searching algorithm finds the desirable transformation matrix much faster.
Title: Scalable Approach for Power Droop Reduction During Scan-Based Logic BIST
Abstract:The generation of significant power droop (PD) during at-speed test performed by Logic Built-In Self Test (LBIST) is a serious concern for modern ICs. In fact, the PD originated during test may delay signal transitions of the circuit under test (CUT): an effect that may be erroneously recognized as delay faults, with consequent erroneous generation of test fails and increase in yield loss. In this paper, we propose a novel scalable approach to reduce the PD during at-speed test of sequential circuits with scan-based LBIST using the launch-on capture scheme. This is achieved by reducing the activity factor of the CUT, by proper modification of the test vectors generated by the LBIST of sequential ICs. Our scalable solution allows us to reduce PD to a value similar to that occurring during the CUT in field operation, without increasing the number of test vectors required to achieve target fault coverage (FC). We present a hardware implementation of our approach that requires limited area overhead. Finally, we show that, compared with recent alternative solutions providing a similar PD reduction, our approach enables a significant reduction of the number of test vectors (by more than 50%), thus the test time, to achieve a target FC.
Title: Stochastic Implementation and Analysis of Dynamical Systems Similar to the Logistic Map
Abstract:Stochastic computing (SC) is a digital computation approach that operates on random bit streams to perform complex tasks with much smaller hardware footprints compared with conventional binary radix approaches. SC works based on the assumption that input bit streams are independent random sequences of 1s and 0s. Previous SC efforts have avoided implementing functions that have feedback, because doing so has the potential for creating highly correlated inputs. We propose a number of solutions to overcome the challenges of implementing feedback in stochastic logic. We use a family of dynamical system functions that are similar to the well-known logistic map x?µx(1-x)as case studies. We show that complex behaviors, such as period doubling and chaos, do indeed occur in digital logic with only a few gates operating on a few 0s and 1s. Our energy consumption is between 21% and 31% of the conventional binary approach. In order to verify our design methodology, we have measured the mean switching rate between the basins of attraction of two coexisting fixed points and the peak width of the steady-state distribution of the output using a logistic-map-like function as an example. Theoretical results match well with our numerical experiments.
HIGH SPEED DATA TRANSMISSION
Title: Efficient Designs of Multi-ported Memory on FPGA
Abstract: The utilization of block RAMs (BRAMs) is a critical performance factor for multi-ported memory designs on field programmable gate arrays (FPGAs). Not only does the excessive demand on BRAMs block the usage of BRAMs from other parts of a design, but the complex routing between BRAMs and logic also limits the operating frequency. This paper first introduces a brand new perspective and a more efficient way of using a conventional two reads one write (2R1W) memory as a 2R1W/4R memory. By exploiting the 2R1W/4R as the building block, this paper introduces a hierarchical design of 4R1W memory that requires 25% fewer BRAMs than the previous approach of duplicating the 2R1W module. Memories with more read/write ports can be extended from the proposed 2R1W/4R memory and the hierarchical 4R1W memory. Compared with previous xor-based and live value table-based approaches, the proposed designs can, respectively, reduce up to 53% and 69% of BRAM usage for 4R2W memory designs with 8K-depth. For complex multi ported designs, the proposed BRAM-efficient approaches can achieve higher clock frequencies by alleviating the complex routing in an FPGA. For 4R3W memory with 8K-depth, the proposed design can save 53% of BRAMs and enhance the operating frequency by 20%.
Title: High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
Abstract: In this paper, a novel high-speed elliptic curve cryptography (ECC) processor implementation for point multiplication (PM) on field-programmable gate array (FPGA) is proposed. A new segmented pipelined full-precision multiplier is used to reduce the latency, and the Lopez-Dahab Montgomery PM algorithm is modified for careful scheduling to avoid data dependency resulting in a drastic reduction in the number of clock cycles (CCs) required. The proposed ECC architecture has been implemented on Xilinx FPGAs’ Virtex4, Virtex5, and Virtex7 families. To the best of our knowledge, our single- and three-multiplier-based designs show the fastest performance to date when compared with reported works individually. Our one-multiplier-based ECC processor also achieves the highest reported speed together with the best reported area-time performance on Virtex4 (5.32 µs at 210 MHz), on Virtex5 (4.91µs at 228 MHz), and on the more advanced Virtex7 (3.18 µsat 352 MHz). Finally, the proposed three-multiplier-based ECC implementation is the first work reporting the lowest number of CCs and the fastest ECC processor design on FPGA (450 CCs to get 2.83 µs on Virtex7).
Title:An On-Chip Monitoring Circuit for Signal-Integrity Analysis of 8-Gb/s Chip-to-Chip Interfaces With Source-Synchronous Clock
Abstract: This paper presents an on-chip monitoring circuit (OCMC) for analyzing the signal integrity of high speed signals for a chip-to-chip interface with a source synchronous clocking scheme. The proposed OCMC consists of a fractional-N phase-locked loop (PLL)-based frequency synthesizer, a high-bandwidth track-and-hold circuit, and a 10-bit analog-to-digital converter (ADC) to implement a subsampling scheme. The proposed fractional-N PLL-based frequency synthesizer improves the time jitter accumulated in a voltage controlled oscillator using a fractional frequency divider operated by an eight-phase clock. The bandwidth of the track-and hold circuit is designed to be 6 GHz, using inductive peaking realized through a source follower. The OCMC samples 49 points over two unit intervals of a high-speed input signal when the frequency multiplication of the frequency synthesizer is 6.125/6. The 10-bit ADC uses the architecture of a pipelined successive approximation register ADC to reduce the power consumption and chip area. The proposed OCMC is implemented with 65-nm CMOS technology and a 1.2 V supply. The 8-Gb/s chip-to-chip interface signal is reconstructed with time and voltage resolutions of 5.1 ps and 1.17 mV, respectively.
Title:A 2.4–3.6-GHz Wideband Sub-harmonically Injection-Locked PLL with Adaptive Injection Timing Alignment Technique
Abstract: This paper proposes a wideband sub harmonically injection-locked PLL (SILPLL) with adaptive injection timing alignment technique. The SILPLL includes three main circuit blocks: one-oscillator-period constant-delay (OOPCD) divider, timing-adjusted phase detector (TPD), and pulse generator (PG). The proposed injection timing alignment technique can align the injection timing adaptively in a wide range of the output clock frequency using the two blocks (OOPCD and TPD) and a falling edge locking scheme of pulses. It can avoid the risk that SILPLL may lock to the wrong frequency or even fail to lock. The PG block is used for half-integral injection to relax the tradeoff between the phase noise of SILPLL and the output frequency resolution. The OOPCD circuit occupies a negligible area. After the injection timing alignment is finished, the OOPCD is powered off so that no extra power is consumed. The SILPLL is implemented in the 65-nm 1P9M CMOS process. It consumes 8.6 mW at 1.2 V supply and occupies an active core area of 1×0.6mm2 . The measured output frequency range is 2.4~3.6 GHz with an output frequency resolution of 200 MHz and the phase noise is-127.6 dBc/Hz at an offset of 1 MHz from a carrier frequency of 3.4 GHz. The rms jitter integrated from 1 kHz to 30 MHz is less than 112 fs for all the covered frequency points. Under the supply voltage range from 1.1 to 1.3 V and the temperature range from -20 °C to 70 °C, the rms jitter variation of all the covered frequency points is less than 27 fs, which shows good robustness over environmental variation.
Title:Hardware-Efficient Built-In Redundancy Analysis for Memory With Various Spares
Abstract: Memory capacity continues to increase, and many semiconductor manufacturing companies are trying to stack memory dice for larger memory capacities. Therefore, built-in redundancy analysis (BIRA) is of utmost importance because the probability of fault occurrence increases with a larger memory capacity. A traditional spare structure that consists of simple rows and columns is somewhat inadequate for multiple memory blocks BIRA because the hardware overhead and spare allocation efficiency are degraded. The proposed BIRA uses various types of spares and can achieve a higher yield than a simple row and column spare structure. Herein, we propose a BIRA that can achieve an optimal repair rate using various spare types. The proposed analyzer can exhaustively search not only row and column spare types but also global and local spare types. In addition, this paper proposes a fault-storing content-addressable memory (CAM) structure. The proposed CAM is small and collects faults efficiently. The experimental results show a high repair rate with a small hardware overhead and a short analysis time.
Title: Fast Automatic Frequency Calibrator Using an Adaptive Frequency Search Algorithm
Abstract:A new adaptive frequency search algorithm (A-FSA) is presented for a fast automatic frequency calibrator in wideband phase-locked loops (PLLs). The proposed A-FSA optimizes the number of clock counts for each frequency comparison cycle, depending on the difference between the target frequency and the PLL output frequency, as opposed to a binary frequency search algorithm (B-FSA), where the frequency search time per cycle is fixed. This eliminates unnecessary clocking times during the frequency comparison process, and thus reduces the total PLL lock time. The additional circuitry needed for A-FSA is only a simple counter controller, thus minimizing hardware overhead. To verify the effectiveness of the proposed algorithm, two wideband PLLs are designed and simulated using a 65-nm CMOS technology: one with B-FSA, and the other with A-FSA. The latter achieves a lock time faster than the former by at least a factor of 2, even under worst case conditions.
Title: A High-Efficiency 6.78-MHz Full Active Rectifier with Adaptive Time Delay Control for Wireless Power Transmission
Abstract:This paper presents a full active rectifier consisting of GaN devices and a CMOS controller designed for wireless power transmission in high-power consumer devices. An adaptive time delay control circuit is developed to maximize the conduction interval of the GaN switch, which can significantly reduce the power loss caused by the forward voltage imposed by the diode. The proposed control algorithm also eliminates the reverse leakage current of the rectifier, and thus further improves its power transfer efficiency. The controller implemented based on a high voltage 0.18-µm CMOS process and the power stage consisting of four GaN transistors are assembled on the same printed circuit board (PCB) board. The proposed rectifier provides a maximum output current of 3 A at 5 V, with a 6.78-MHz ac input voltage.
Title: Scalable Device Array for Statistical Characterization of BTI-Related Parameters
Abstract:A device array circuit, scalable in terms of the number of transistors used, is proposed. The proposed array facilitates accurate and simultaneous bias voltage application to a large number of devices, making it suitable for the measurement based statistical characterization of device degradation, known as bias temperature instability. Using the proposed array, the degradation measurement of thousands of transistors is made possible in a practical amount of time. The experimental results show that the defect-centric model can approximate the statistical variation in magnitudes of threshold voltage shifts (delta- VTH) and that the variance of delta- VTH bears an inverse relationship to the channel areas of transistors. The degradation variability under ac stress conditions is also presented for the first time.
AREA EFFICIENT/ TIMING & DELAY REDUCTION
Title:VLSI Design of 64bit × 64bit High Performance Multiplier with Redundant Binary Encoding
Abstract: For multiplier dominated applications such as digital signal processing, wireless communications, and computer applications, high speed multiplier designs has always been a primary requisite. In this paper a high performance 16×16 bit redundant binary (RB) multiplier have been designed by using recently proposed redundant binary encoding approach to eliminate the error correcting word and a delay efficient parallel prefix Ling adder for final redundant binary to normal binary (RB-NB) conversion. Since redundant binary (RB) representation allows carry-free addition and adaptability, it has been used in 16×16 bit high-performance RB multiplier design for summation of partial product terms. The design of multiplier also reduces redundant partial product accumulation stage when eliminating the error correcting word which improves the complexity and the critical path delay. The performance of RB multiplier design compared with conventional RB modified booth encoding multiplier (CRBMBE). The comparison is based on synthesis result obtained by synthesizing both multiplier architectures targeting a Xilinx FPGA in terms of area and delay analysis.
Title:A Method to Design Single Error Correction Codes with Fast Decoding for a Subset of Critical Bits
Abstract: Single error correction (SEC) codes are widely used to protect data stored in memories and registers. In some applications, such as networking, a few control bits are added to the data to facilitate their processing. For example, flags to mark the start or the end of a packet are widely used. Therefore, it is important to have SEC codes that protect both the data and the associated control bits. It is attractive for these codes to provide fast decoding of the control bits, as these are used to determine the processing of the data and are commonly on the critical timing path. In this brief, a method to extend SEC codes to support a few additional control bits is presented. The derived codes support fast decoding of the additional control bits and are therefore suitable for networking applications.
Title:ENFIRE: A Spatio-Temporal Fine-Grained Reconfigurable Hardware
Abstract: Field programmable gate arrays (FPGAs) are well-established as fine-grained reconfigurable computing platforms. However, FPGAs demonstrate poor scalability in advanced technology nodes due to the large negative impact of the elaborate programmable interconnects (PIs). The need for such vast PIs arises from two key factors: 1) fine-grained bit-level data manipulation in the configurable logic blocks and 2) the purely spatial computing model followed in the FPGAs. In this paper, we propose ENFIRE, a novel memory-based spatio-temporal framework designed to provide the flexibility of reconfigurable bit-level information processing while improving scalability and energy efficiency. Dense 2-D memory arrays serve as the main computing elements storing not only the data to be processed but also the functional behavior of the application mapped into lookup tables. Computing elements are spatially distributed, communicating as needed over a hierarchical bus interconnect, while the functions are evaluated temporally inside each computing element. A custom software framework facilitates application mapping to the framework. By leveraging both spatial and temporal computing, ENFIRE significantly reduces the interconnect overhead when compared with FPGA. Simulation results show an improvement of 7.6×in energy, 1.6×in energy efficiency, 1.1×in leakage, and 5.3×in unified energy efficiency, a metric that considers energy and area together, compared with comparable FPGA implementations.
Title: Hybrid Hardware/Software Floating-Point Implementations for Optimized Area and Throughput Tradeoffs
AbstractHybrid floating-point (FP) implementations improve software FP performance without incurring the area overhead of full hardware FP units. The proposed implementations are synthesized in 65-nm CMOS and integrated into small fixed-point processors with a RISC-like architecture. Unsigned, shift carry, and leading zero detection (USL) support is added to a processor to augment an existing instruction set architecture and increase FP throughput with little area overhead. The hybrid implementations with USL support increase software FP throughput per core by 2.18×for addition/subtraction, 1.29×for multiplication, 3.07–4.05×for division, and 3.11–3.81×for square root, and use 90.7–94.6% less area than dedicated fused multiply– add (FMA) hardware. Hybrid implementations with custom FP-specific hardware increase throughput per core over a fixed point software kernel by 3.69–7.28×for addition/subtraction, 1.22–2.03×for multiplication, 14.4×for division, and 31.9× for square root, and use 77.3–97.0% less area than dedicated FMA hardware. The circuit area and throughput are found for 38 multiply–add, 8 addition/subtraction, 6 multiplication, 45 division, and 45 square root designs. Thirty-three multiply– add implementations are presented, which improve throughput per core versus a fixed-point software implementation by 1.11–15.9× and use 38.2 95.3% less area than dedicated FMA hardware.
Title: Efficient Soft Cancelation Decoder Architectures for Polar Codes
Abstract: The flooding belief propagation (FO-BP) and the soft-cancelation (SCAN) algorithms are the two most popular soft-output BP algorithms for the decoding of capacity-achieving polar codes. The FO-BP algorithm has high throughput at the cost of performance degradation in high signal-to-noise ratio (SNR) region or with large block length. The SCAN algorithm has much better decoding performance while suffering from long decoding latency and low throughput. In this paper, an improved BP algorithm, named reduced complexity soft cancelation (RCSC) algorithm, is proposed. Compared with the SCAN algorithm, the number of memory entries required by the RCSC algorithm is reduced by more than 50% in general, while achieving comparable or even better (e.g., when block size N=215) decoding performance. When block size is large(e.g., N =215), the proposed RCSC algorithm reduces the required memory entries by more than 23% compared with the state-of-the-art FO-BP algorithm. The numerical results show that the error performance improvement of the RCSC algorithm is more significant when the SNR increases. For a different tradeoff, a reduced latency soft-cancelation (RLSC) algorithm is proposed to reduce the decoding latency and increase the throughput of the RCSC algorithm while slightly sacrificing decoding performance. Finally, the optimized VLSI architectures are presented for the RCSC and RLSC algorithms, respectively. The synthesis results demonstrate the efficiency of the proposed algorithms and architectures.
Title: Low-Complexity Digit-Serial Multiplier Over GF(2m) Based on Efficient Toeplitz Block Toeplitz Matrix–Vector Product Decomposition
Abstract: In this paper, we have shown that a regular Toeplitz matrix-vector product (TMVP) can be transformed into a Toeplitz block TMVP (TBTMVP) using a suitable permutation matrix. Based on the TBTMVP representation, we have proposed a new (a,b)-way TBTMVP decomposition algorithm for implementing a digit-serial multiplication. Moreover, it is shown that, based on iterative block recombination, we can improve the space complexity of the proposed TBTMVP decomposition. From the synthesis results, we have shown that the proposed TBTMVP-based multiplier involves less area, less area–delay product, and higher throughput compared with the existing digit serial multipliers
Title: Hybrid LUT Multiplexer FPGA Logic Architectures
Abstract: Hybrid configurable logic block architectures for field-programmable gate arrays that contain a mixture of look up tables and hardened multiplexers are evaluated toward the goal of higher logic density and area reduction. Multiple hybrid configurable logic block architectures, both non fracturable and fracturable with varying MUX:LUT logic element ratios are evaluated across two benchmark suites using a custom tool flow consisting of Leg Up-HLS, Odin-II front-end synthesis, ABC logic synthesis and technology mapping, and VPR for packing, placement, routing, and architecture exploration. Technology mapping optimizations that target the proposed architectures are also implemented within ABC. Experimentally, we show that for non fracturable architectures, without any mapper optimizations, we naturally save up to~8%area post place and route; both accounting for complex logic block and routing area while maintaining mapping depth. With architecture-aware technology mapper optimizations in ABC, additional area is saved, post-place-and-route. For fracturable architectures, experiments show that only marginal gains are seen after place-and-route up to~2%.
Title: Sign-Magnitude Encoding for Efficient VLSI Realization of Decimal Multiplication
Abstract: Decimal X×Y multiplication is a complex operation, where intermediate partial products (IPPs) are commonly selected from a set of pre-computed radix-10Xmultiples. Some works require only[0,5]×X via recoding digits of Y to one-hot representation of signed digits in[-5,5]. This reduces the selection logic at the cost of one extra IPP. Two’s complement signed-digit (TCSD) encoding is often used to represent IPPs, where dynamic negation (via one xor per bit of X multiples) is required for the recoded digits of Y in [-5,-1].In this paper, despite generation of 17 IPPs, for 16-digit operands, we manage to start the partial product reduction (PPR) with 16 IPPs that enhance the VLSI regularity. Moreover, we save 75% of negating xor’s via representing precomputed multiples by sign-magnitude signed-digit (SMSD) encoding. For the first-level PPR, we devise an efficient adder, with two SMSD input numbers, whose sum is represented with TCSD encoding. Thereafter, multilevel TCSD 2:1 reduction leads to two TCSD accumulated partial products, which collectively undergo a special early initiated conversion scheme to get at the final binary-coded decimal product. As such, a VLSI implementation of 16×16-digit parallel decimal multiplier is synthesized, where evaluations show some performance improvement over previous relevant designs.
Title: FPGA Realization of Low Register Systolic All-One-Polynomial Multipliers over GF (2m) and Their Applications in Trinomial Multipliers
Abstract: Systolic all-one-polynomial (AOP) multipliers usually suffer from the problem of high register complexity, especially in field-programmable gate array (FPGA) platforms where the register resources are not that abundant. In this paper, we have shown that the AOP-based systolic multipliers can easily achieve low register-complexity implementations and the proposed architectures can be employed as computation cores to derive efficient implementations of systolic Montgomery multipliers based on trinomials. First, we propose a novel data broadcasting scheme in which the register complexity involved within existing AOP-based systolic multipliers is significantly reduced. We have found out that the modified AOP-based structure can be packed as a standard computation core. Next, we propose a novel Montgomery multiplication algorithm that can fully employ the proposed AOP-based computation core. The proposed Montgomery algorithm employs a novel precomputed modular operation, and the systolic structures based on this algorithm fully inherit the advantages brought from the AOP-based core (low register complexity, low critical-path delay, and low latency) except some marginal hardware overhead brought by a precomputation unit. The proposed architectures are then implemented by Xilinx ISE 14.1 and it is shown that compared with the existing designs, the proposed designs achieve at least 61.8% and 47.6% less area-delay product and power delay product than the best of competing designs, respectively.
Title:Low-Complexity Transformed Encoder Architectures for Quasi-Cyclic Non-binary LDPC Codes Over Subfields
Abstract: Quasi-cyclic low-density parity-check (QC-LDPC) codes are adopted in many digital communication and storage systems. The encoding of these codes is traditionally done by multiplying the message vector with a generator matrix consisting of dense circulant submatrices. To reduce the encoder complexity, this paper introduces two schemes making use of finite Fourier transform. We focus on QC-LDPC codes whose circulant submatrices are of dimension (2r-1) × (2r-1) and the entries are elements of GF(2p), where p divides r, and hence, GF(2p) is a subfield of GF(2r). These cover a broad range of codes, and binary LDPC codes are a special case. Making use of conjugacy constraints, low-complexity architectures are developed for finite Fourier and inverse transforms over subfields in this paper. In addition, composite field arithmetic is exploited to eliminate the computations associated with message mapping and reduce the complexity of Fourier transform. For a (2016, 1074) non binary QC-LDPC code whose generator matrix consists of circulants of dimension 63×63 with GF(22)entries, the proposed encoders achieve 22% area reduction compared with the conventional encoders without sacrificing the throughput.
Title:Antiwear Leveling Design for SSDs With Hybrid ECC Capability
Abstract: With the joint considerations of reliability and performance, hybrid error correction code (ECC) becomes an option in the designs of solid-state drives (SSDs). Unfortunately, wear leveling (WL) might result in the early performance degradation to SSDs, which is common with a limited number of P/E cycles, due to the efforts to delay the bit-error-rate growth. In this paper, an anti-WL design is proposed to avoid such a performance problem so that the performance of SSDs with hybrid ECC capability can be improved without sacrificing their reliability. The capability of the proposed design was evaluated by a series of experiments, for which it was shown that the proposed design could greatly improve the read and write performance of SSDs up to 50% without affecting the endurance of the investigated SSDs, compared with traditional approaches.
Title:Energy-Efficient VLSI Realization of Binary64 Division with Redundant Number Systems
Abstract: VLSI realizations of digit-recurrence binary division usually use redundant representation of partial remainders and quotient digits. The former allows for fast carry-free computation of the next partial remainder, and the latter leads to less number of the required divisor multiples. In studying the previous relevant works, we have noted that the binary carry save (CS) number system is prevalent in the representation of partial remainders, and redundant high radix representation of quotient digits is popular in order to reduce the cycle count. In this paper, we explore a design space containing four division architectures. These are based on binary CS or radix-16 signed digit (SD) representations of partial remainders. On the other hand, they use full or partial pre computation of divisor multiples. The latter uses smaller multiplexer at the cost two extra adders, where one of the operands is constant within all cycles. The quotient digits are represented by radix-16 [-9,9]SDs. Our synthesis-based evaluation of VLSI realizations of the best previous relevant work and the four proposed designs show reduced power and energy figures in the proposed designs at the cost of more silicon area and delay measures. However, our energy-delay product is 26%–35% less than that of the reference work.
AUDIO, IMAGE & VIDEO PROCESSING
Title:A Dual-Clock VLSI Design of H.265 Sample Adaptive Offset Estimation for 8k Ultra-HD TV Encoding
Abstract: Sample adaptive offset (SAO) is a newly introduced in-loop filtering component in H.265/High Efficiency Video Coding (HEVC). While SAO contributes to a notable coding efficiency improvement, the estimation of SAO parameters dominates the complexity of in-loop filtering in HEVC encoding. This paper presents an efficient VLSI design for SAO estimation. Our design features a dual-clock architecture that processes statistics collection (SC) and parameter decision (PD), the two main functional blocks of SAO estimation, at high- and low speed clocks, respectively. Such a strategy reduces the overall area by 56% by addressing the heterogeneous data flows of SC and PD. To further improve the area and power efficiency, algorithm-architecture co-optimizations are applied, including a coarse range selection (CRS) and an accumulator bit width reduction (ABR). CRS shrinks the range of fine processed bands for the band offset estimation. ABR further reduces the area by narrowing the accumulators of SC. They together achieve another 25% area reduction. The proposed VLSI design is capable of processing 8k at 120-frames/s encoding. It occupies 51k logic gates, only one-third of the circuit area of the state-of-the-art implementations.
Title:RoBA Multiplier: A Rounding-Based Approximate Multiplier for High-Speed yet Energy-Efficient Digital Signal Processing
Abstract: In this paper, we propose an approximate multiplier that is high speed yet energy efficient. The approach is to round the operands to the nearest exponent of two. This way the computational intensive part of the multiplication is omitted improving speed and energy consumption at the price of a small error. The proposed approach is applicable to both signed and unsigned multiplications. We propose three hardware implementations of the approximate multiplier that includes one for the unsigned and two for the signed operations. The efficiency of the proposed multiplier is evaluated by comparing its performance with those of some approximate and accurate multipliers using different design parameters. In addition, the efficacy of the proposed approximate multiplier is studied in two image processing applications, i.e., image sharpening and smoothing.
Title:Energy-Efficient Reduce-and-Rank Using Input-Adaptive Approximations
Abstract: Approximate computing is an emerging design paradigm that exploits the intrinsic ability of applications to produce acceptable outputs even when their computations are executed approximately. In this paper, we explore approximate computing for a key computation pattern, reduce-andrank (RnR), which is prevalent in a wide range of workloads, including video processing, recognition, search, and data mining. An RnR kernel performs a reduction operation (e.g., distance computation, dot product, and L1-norm) between an input vector and each of a set of reference vectors, and ranks the reduction outputs to select the top reference vectors for the current input. We propose three complementary approximation strategies for the RnR computation pattern. The first is interleaved reduction and-ranking, wherein the vector reductions are decomposed into multiple partial reductions and interleaved with the rank computation. Leveraging this transformation, we propose the use of intermediate reduction results and ranks to identify future computations that are likely to have a low impact on the output, and can, hence, be approximated. The second strategy, input similarity-based approximation, exploits the spatial or temporal correlation of inputs (e.g., pixels of an image or frames of a video) to identify computations that are amenable to approximation. The third strategy, reference vector reordering, rearranges the order in which the reference vectors are processed such that vectors that are relatively more critical in evaluating the correct output, are processed at the beginning of RnR operation. The number of these critical reference vectors is usually small, which renders a substantial portion of the total computation to be amenable to approximation. These strategies address a key challenge in approximate computing—identification of which computations to approximate—and may be used to drive any approximation mechanism, such as computation skipping or precision scaling to realize performance and energy improvements. A second key challenge in approximate computing is that the extent to which computations can be approximated varies significantly from application to application, and across inputs for even a single application. Hence, input-adaptive approximation, or the ability to automatically modulate the degree of approximation based on the nature of each individual input, is essential for obtaining optimal energy savings. In addition, to enable quality configurability in RnR kernels, we propose a kernel-level quality metric that correlates well to application-level quality, and identify key parameters that can be used to tune the proposed approximation strategies dynamically. We develop a runtime framework that modulates the identified parameters during the execution of RnR kernels to minimize their energy while meeting a given target quality. To evaluate the proposed concepts, we designed quality-configurable hardware implementations of six RnR-based applications from the recognition, mining, search, and video processing application domains in 45-nm technology.
Title:Dual-Quality 4:2 Compressors for Utilizing in Dynamic Accuracy Configurable Multipliers
Abstract: In this paper, we propose four 4:2 compressors, which have the flexibility of switching between the exact and approximate operating modes. In the approximate mode, these dual-quality compressors provide higher speeds and lower power consumptions at the cost of lower accuracy. Each of these compressors has its own level of accuracy in the approximate mode as well as different delays and power dissipations in the approximate and exact modes. Using these compressors in the structures of parallel multipliers provides configurable multipliers whose accuracies (as well as their powers and speeds) may change dynamically during the runtime. The efficiencies of these compressors in a 32-bit Dadda multiplier are evaluated in a 45-nm standard CMOS technology by comparing their parameters with those of the state-of-the-art approximate multipliers. The results of comparison indicate, on average, 46% and 68% lower delay and power consumption in the approximate mode. Also, the effectiveness of these compressors is assessed in some image processing applications.
Title:An FPGA-Based Hardware Accelerator for Traffic Sign Detection
Abstract: Traffic sign detection plays an important role in a number of practical applications, such as intelligent driver assistance and roadway inventory management. In order to process the large amount of data from either real-time videos or large off-line databases, a high-throughput traffic sign detection system is required. In this paper, we propose an FPGA-based hardware accelerator for traffic sign detection based on cascade classifiers. To maximize the throughput and power efficiency, we propose several novel ideas, including: 1) rearranged numerical operations; 2) shared image storage; 3) adaptive workload distribution; and 4) fast image block integration. The proposed design is evaluated on a Xilinx ZC706 board. When processing high-definition (1080p) video, it achieves the throughput of 126 frames/s and the energy efficiency of 0.041 J/frame.
Title: Soft Error Rate Reduction of Combinational Circuits Using Gate Sizing in the Presence of Process Variations
Abstract: Soft errors in combinational logic circuits are emerging as a significant reliability concern for nano scale VLSI designs. This paper presents a novel sensitivity-based gate sizing methodology to reduce the soft error rate (SER) of combinational circuits in the presence of process variations. The proposed method is based on modeling the statistics of SER of the circuit gates as a random variable to formulate a statistical optimization problem. A backward traversing algorithm with capability for incremental analysis is developed for computing the distribution of circuit gates of SER random variables. We present a gate resizing algorithm in which the gates with the most contribution to the circuit SER are selected in a candidate set using a statistical ordering approach. The proposed algorithm trades off SER reduction and area overheads. The experimental results show that using the proposed methodology, the circuit statistical SER can be reduced by up to 56.4% compared with the 14.8% SER reduction of a circuit obtained using the worst case methodology at the expense of 10% area overhead under 10% process variation ratio. The results also show that the proposed method achieves about 40% more SER reduction compared with that obtained using closed-form analysis for statistical soft error rate estimation (CASSER), the most recently published similar work, in the same experimental conditions. Comparing the runtime of the proposed optimization algorithm with the optimization based on CASSER, it is observed that the proposed method is two orders of magnitude faster than CASSER due to its incremental analysis property.
Title: Time-Encoded Values for Highly Efficient Stochastic Circuits
Abstract: Stochastic computing (SC) is a promising technique for applications that require low area overhead and fault tolerance, but can tolerate relatively high latency. In the SC paradigm, logical computation is performed on randomized bit streams. In prior work, streams were generated with linear feedback shift registers; these contributed heavily to the hardware cost and consumed a significant amount of power. This paper introduces a new approach for encoding signal values: computation is performed on analog periodic pulse signals. Exploiting pulse width modulation, time-encoded signals corresponding to specific values are generated by adjusting the frequency and duty cycles of pulse width modulated (PWM) signals. With this approach, the latency, area, and energy consumption are all greatly reduced. Experimental results on image processing applications show up to 99% performance speedup, 98% saving in energy dissipation, and 40% area reduction compared to prior stochastic approaches. Circuits synthesized with the proposed approach can work as fast and energy-efficiently as a conventional binary design while retaining the fault-tolerance and low cost advantages of conventional stochastic designs.
Title: Design of Power and Area Efficient Approximate Multipliers
Abstract: Approximate computing can decrease the design complexity with an increase in performance and power efficiency for error resilient applications. This brief deals with a new design approach for approximation of multipliers. The partial products of the multiplier are altered to introduce varying probability terms. Logic complexity of approximation is varied for the accumulation of altered partial products based on their probability. The proposed approximation is utilized in two variants of 16-bit multipliers. Synthesis results reveal that two proposed multipliers achieve power savings of 72% and 38%, respectively, compared to an exact multiplier. They have better precision when compared to existing approximate multipliers. Mean relative error figures are as low as 7.6% and 0.02% for the proposed approximate multipliers, which are better than the previous works. Performance of the proposed multipliers is evaluated with an image processing application, where one of the proposed models achieves the highest peak signal to noise ratio
Title:COMEDI: Combinatorial Election of Diagnostic Vectors From Detection Test Sets for Logic Circuits
Abstract: Although the modern automatic test pattern generation (ATPG) tools can efficiently produce near-optimal test sets with high fault-coverage for a circuit-under-test, a diagnostic test set (DTS), which is needed for fault localization, is much more challenging to construct. The DTS is used to analyze the responses of failing chips during manufacturing test for the purpose of identifying the root cause of observed errors. In this paper, a novel technique for selecting a powerful DTS for stuck-at faults from a pool of ATPG detection vectors is proposed. Unlike existing methods, this technique does not use any diagnostic test generation, circuit modification, or miter-based approach. It constructs a combinatorial cover of the pool to determine a test set with high diagnostic coverage (DC). Two variants of the covering algorithm are proposed based on this technique. The experimental results on several combinational and scan-based benchmark circuits demonstrate the effectiveness of our method in terms of the size of the DTS, DC, and CPU time.
Title:Reordering Tests for Efficient Fail Data Collection and Tester Time Reduction
Abstract: During fail data collection, a tester collects information that is useful for defect diagnosis. If fail data collection can be terminated early, the tester time as well as the volume of fail data will be reduced. Test reordering can enhance the ability to terminate the process early without affecting the quality of diagnosis. In this paper, test reordering targets logic defects based on information that is derived during defect diagnosis. The defect diagnosis procedure is enhanced to identify tests that are useful for defect diagnosis across a sample of faulty instances of a circuit. Tests that are determined to be useful for more faulty instances of a circuit are placed earlier in the test set based on the expectation that the same tests will be useful for other faulty instances of the circuit. The experimental results for logic defects in benchmark circuits support the effectiveness of this approach and indicate that test reordering helps to terminate fail data collection early without impacting the diagnosis quality.
NETWORKING ON CHIP (NOC)
Title:Multicast-Aware High-Performance Wireless Network-on-Chip Architectures
Abstract: Today’s multiprocessor platforms employ the network-on-chip (NoC) architecture as the preferable communication backbone. Conventional NoCs are designed predominantly for unicast data exchanges. In such NoCs, the multicast traffic is generally handled by converting each multicast message to multiple unicast transmissions. Hence, applications dominated by multicast traffic experience high queuing latencies and significant performance penalties when running on systems designed with unicast-based NoC architectures. Various multicast mechanisms such as XY-tree multicast and path multicast have already been proposed to enhance the performance of the traditional wireline mesh NoC incorporating multicast traffic. However, even with such added features, the multihop nature of the wireline mesh NoC leads to high network latencies and thus limits the achievable system performance. In this paper, to sustain the high-bandwidth and high-throughput requirements of emerging applications, we propose the design of a wireless NoC (WiNoC) architecture incorporating necessary multicast support. By integrating congestion-aware multicast routing with network coding, the WiNoC is able to efficiently handle heavy multicast injections.
TANNER & MICROWIND/DSCH3 & HSPICE
Title:Temporarily Fine-Grained Sleep Technique for Near- and Sub-threshold Parallel Architectures
Abstract: This paper presents a design approach for improving energy-efficiency and throughput of parallel architectures in near- and sub-threshold voltage circuits. The focus is to suppress leakage energy dissipation of the idle portions of circuits during active modes, which can allow us to wholly transform the throughput improvement from parallel architectures into energy savings via deep voltage scaling. We begin by investigating the efficacy of parallel and pipeline architectures in the near- and sub-threshold circuits. The investigation reveals that active energy dissipation largely undermines the ability of deep voltage scaling to transform excessive throughput into energy savings. Techniques, such as power-gating switches (PGSs), can mitigate active-leakage power dissipation; however, the over head for entering and exiting sleep modes can offset the energy savings provided by sleep mode, particularly if sleep time is fine grained for suppressing active leakage. Therefore, in this paper, we propose a PGS design technique, inspired by the so-called zigzag super cutoff CMOS, in order to optimize the overheads of mode transitions of PGS in near- and sub-threshold circuits. The proposed technique enables to have circuits in sleep mode for as short as a single clock cycle with a negligible amount of energy and delay overheads. We apply our proposed design to parallel multiplier-based test circuits operating at near- and sub-threshold voltages. Simulations show a significant improvement in energy efficiency over baselines at the same throughput.
Title: Low-Power Design for a Digit-Serial Polynomial Basis Finite Field Multiplier Using Factoring Technique
Abstract: In CMOS-based application-specific integrated circuit (ASIC) designs, total power consumption is dominated by dynamic power, where dynamic power consists of two major components, namely, switching power and internal power. In this paper, we present a low-power design for a digit-serial finite field multiplier in GF(2m ). In the proposed design, a factoring technique is used to minimize switching power. To the best of our knowledge, factoring method has not been reported in the literature being used in the design of a finite field multiplier at an architectural level. Logic gate substitution is also utilized to reduce internal power. Our proposed design along with several existing similar works have been realized for GF(2233)on ASIC platform, and a comparison is made between them. The synthesis results show that the proposed multiplier design consumes at least 27.8% lower total power than any previous work in comparison.
Title: Analysis and Design of a Low-Voltage Low-Power Double-Tail Comparator
Abstract: The need for ultralow-power, area efficient, and high speed analog-to-digital converters is pushing toward the use of dynamic regenerative comparators to maximize speed and power efficiency. In this paper, an analysis on the delay of the dynamic comparators will be presented and analytical expressions are derived. From the analytical expressions, designers can obtain an intuition about the main contributors to the comparator delay and fully explore the tradeoffs in dynamic comparator design. Based on the presented analysis, a new dynamic comparator is proposed, where the circuit of a conventional double-tail comparator is modified for low-power and fast operation even in small supply voltages. Without complicating the design and by adding few transistors, the positive feedback during the regeneration is strengthened, which results in remarkably reduced delay time. Post layout simulation results in a 0.18-µm CMOS technology confirm the analysis results. It is shown that in the proposed dynamic comparator both the power consumption and delay time are significantly reduced.
Title:10T SRAM Using Half-VDD Precharge and Row-Wise Dynamically Powered Read Port for Low Switching Power and Ultralow RBL Leakage
Abstract: We present, in this paper, a new 10T static random access memory cell having single ended decoupled read-bitline (RBL) with a 4T read port for low power operation and leakage reduction. The RBL is precharged at half the cell’s supply voltage, and is allowed to charge and discharge according to the stored data bit. An inverter, driven by the complementary data node (QB), connects the RBL to the virtual power rails through a transmission gate during the read operation. RBL increases toward the VDD level for a read-1, and discharges toward the ground level for a read-0. Virtual power rails have the same value of the RBL precharging level during the write and the hold mode, and are connected to true supply levels only during the read operation. Dynamic control of virtual rails substantially reduces the RBL leakage. The proposed 10T cell in a commercial 65 nm technology is 2.47×the size of 6T with ß=2, provides 2.3×read static noise margin, and reduces the read power dissipation by 50% than that of 6T. The value of RBL leakage is reduced by more than 3 orders of magnitude and (ION/IOFF) is greatly improved compared with the 6T BL leakage. The overall leakage characteristics of 6T and 10T are similar, and competitive performance is achieved.
Title:Delay Analysis for Current Mode Threshold Logic Gate Designs
Abstract: Current mode is a popular CMOS-based implementation of threshold logic functions, where the gate delay depends on the sensor size. This paper presents a new implementation of current mode threshold functions for improved gate delay and switching energy. An analytical method is also proposed in order to identify quickly the sensor size that minimizes the gate delay. Simulation results on different gates implemented using the optimum sensor size indicate that the proposed current mode implementation method outperforms consistently the existing implementations in delay as well as switching energy.
Title:Area and Energy-Efficient Complementary Dual-Modular Redundancy Dynamic Memory for Space Applications
Abstract: The limited size and power budgets of space-bound systems often contradict the requirements for reliable circuit operation within high-radiation environments. In this paper, we propose the smallest solution for soft-error tolerant embedded memory yet to be presented. The proposed complementary dual-modular redundancy (CDMR) memory is based on a four-transistor dynamic memory core that internally stores complementary data values to provide an inherent per-bit error detection capability. By adding simple, low-overhead parity, an error-correction capability is added to the memory architecture for robust soft-error protection. The proposed memory was implemented in a 65-nm CMOS technology, displaying as much as a 3.5×smaller silicon footprint than other radiation-hardened bit cells. In addition, the CDMR memory consumes between 48% and 87% less standby power than other considered solutions across the entire operating region.
Title:Probability-Driven Multi-bit Flip-Flop Integration With Clock Gating
Abstract: Data-driven clock gated (DDCG) and multi-bit flip-flops (MBFFs) are two low-power design techniques that are usually treated separately. Combining these techniques into a single grouping algorithm and design flow enables further power savings. We study MBFF multiplicity and its synergy with FF data-to-clock toggling probabilities. A probabilistic model is implemented to maximize the expected energy savings by grouping FFs in increasing order of their data-to-clock toggling probabilities. We present a front-end design flow, guided by physical layout considerations for a 65-nm 32-bit MIPS and a 28-nm industrial network processor. It is shown to achieve the power savings of 23% and 17%, respectively, compared with designs with ordinary FFs. About half of the savings was due to integrating the DDCG into the MBFFs.
Title:A High-Speed and Power-Efficient Voltage Level Shifter for Dual-Supply Applications
Abstract: This brief presents a fast and power-efficient voltage level shifting circuit capable of converting extremely low levels of input voltages into high output voltage levels. The efficiency of the proposed circuit is due to the fact that not only the strength of the pull-up device is significantly reduced when the pull-down device is pulling down the output node, but the strength of the pull-down device is also increased using a low-power auxiliary circuit. Post layout simulation results of the proposed circuit in a 0.18-µm technology demonstrate a total energy per transition of 157 fJ, a static power dissipation of 0.3 nW, and a propagation delay of 30 ns for input frequency of 1 MHz, low supply voltage level of VDDL=0.4V, and high supply voltage level of VDDH=1.8V.
Title:A 0.1–2-GHz Quadrature Correction Loop for Digital Multiphase Clock Generation Circuits in 130-nm CMOS
Abstract: A 100-MHz–2-GHz closed-loop analog in-phase/quadrature correction circuit for digital clocks is presented. The proposed circuit consists of a phase-locked loop- type architecture for quadrature error correction. The circuit corrects the phase errortowithina1.5°upto1GHzandtowithin3°at2GHz. It consumes 5.4 mA from a 1.2 V supply at 2 GHz. The circuit was designed in UMC 0.13-µm mixed-mode CMOS with an active area of 102µm×95µm. The impact of duty cycle distortion has been analyzed. High-frequency quadrature measurement related issues have been discussed. The proposed circuit was used in two different applications for which the functionality has been verified.
Title:Conditional-Boosting Flip-Flop for Near-Threshold Voltage Application
Abstract: A conditional-boosting flip-flop is proposed for ultra-low voltage application where the supply voltage is scaled down to the near-threshold region. The proposed flip-flop adopts voltage boosting to provide low latency with reduced performance variability in the near threshold voltage region. It also adopts conditional capture to minimize the switching power consumption by eliminating redundant boosting operations. Experimental results in a 65-nm CMOS process indicated that the proposed flip-flop provided up to 72% lower latency with 75% less performance variability due to process variation, and up to 67% improved energy-delay product at 25% switching activity compared with conventional pre-charged differential flip-flops.
Title:An All-MOSFET Sub-1-V Voltage Reference With a-51-dB PSR up to 60 MHz
Abstract: This paper presents a voltage reference (VR) with a power supply rejection (PSR) better than 50 dB for frequencies of up to 60 MHz, and uses MOSFETs in strong inversion. Another innovation is a compact MOSFET low-pass filter, which was developed along with a feedback technique for a wide-bandwidth PSR not achieved in previous works. The proposed all-MOSFET VR was fabricated using a standard 0.18µm CMOS process.
Title:A 65-nm CMOS Constant Current Source with Reduced PVT Variation
Abstract: This paper presents a new nanometer-based low-power constant current reference that attains a small value in the total process–voltage–temperature variation. The circuit architecture is based on the embodiment of a process-tolerant bias current circuit and a scaled process-tracking bias voltage source for the dedicated temperature-compensated voltage to-current conversion in a pre-regulator loop. Fabricated in a UMC 65-nm CMOS process, it consumes 7.18µWwitha1.4V supply. The measured results indicate that the current reference achieves an average temperature coefficient of 119ppm/°C over 12 samples in a temperature range from-30 °C to 90 °C without any calibration. Besides, a low line sensitivity of 180 ppm/V is obtained. This paper offers a better sensitivity figure of merit with respect to the reported representative counterparts.
Title:A Fault Tolerance Technique for Combinational Circuits Based on Selective-Transistor Redundancy
Abstract: With fabrication technology reaching nano-levels, systems are becoming more prone to manufacturing defects with higher susceptibility to soft errors. This paper is focused on designing combinational circuits for soft error tolerance with minimal area overhead. The idea is based on analyzing random pattern testability of faults in a circuit and protecting sensitive transistors, whose soft error detection probability is relatively high, until desired circuit reliability is achieved or a given area overhead constraint is met. Transistors are protected based on duplicating and sizing a subset of transistors necessary for providing the protection. In addition to that, a novel gate-level reliability evaluation technique is proposed that provides similar results to reliability evaluation at the transistor level (using SPICE) with the orders of magnitude reduction in CPU time. LGSynth’91 benchmark circuits are used to evaluate the proposed algorithm. Simulation results show that the proposed algorithm achieves better reliability than other transistor sizing-based techniques and the triple modular redundancy technique with significantly lower area overhead for 130-nm process technology at a ground level.
Title:Preweighted Linearized VCO Analog-to-Digital Converter
Abstract: A linearization technique of voltage-to-frequency characteristics of voltage-controlled oscillator (VCO) analog-to-digital converters (ADCs) is presented. In contrast to previous works, the proposed technique is an open-loop calibration-free configuration, so it can operate at higher frequencies. It is also independent of the delay element structure, so it can be applied to various VCO ADC topologies. The analog input signal is first mapped through a preweighted resistor network in which every delay element experiences a different version of the input and produces the corresponding delay. As a result, the proposed approach suppresses the impact of V/F nonlinearity on the ADC performance by expanding a linear region of the transfer curve over the full rail-to-rail input. This technique shows substantial improvement results by keeping nonlinearity within ±0.5% over the full input scale (dBFS) and achieves a peak signal-to-noise and distortion ratio (SNDR) of 75.7 and 60.4 dB for input of -8 and 0 dBFS, respectively.
Title:A 100-mA, 99.11% Current Efficiency, 2-mVppRipple Digitally Controlled LDO with Active Ripple Suppression
Abstract: Digital low-dropout (DLDO) regulators are gaining attention due to their design scalability for distributed multiple voltage domain applications required in state-of-the-art system on-chips. Due to the discrete nature of the output current and the discrete-time control loop, the steady-state response of the DLDO has inherent output voltage ripple. A hybrid DLDO (HD-LDO) with fast response and stable operation across a wide load range while reducing the output voltage ripple is proposed. In the HD-LDO, a DLDO and a low current analog ripple cancelation amplifier (RCA) work in parallel. The output dc of the RCA is sensed by a 2-bit analog-to-digital converter, and the digitized linear stage current is fed into the DLDO as an error signal. During load transients, a gear-shift controller enables fast transient response using dynamic load estimation. The DLDO suppresses the output dc of the RCA within its current resolution. With this arrangement, a majority of the dc load current is provided by the DLDO and the RCA supplies ripple cancelation current. The HD-LDO is designed and fabricated in a 180-nm CMOS technology, and occupies 0.697 mm2 of the die area. The HD-LDO operates with an input voltage range of 1.43–2.0 V and an output voltage range of 1.0–1.57 V. At 100-mA load current, the HD-LDO achieves a current peak efficiency of 99.11% and a settling time of 15 clock periods with a 0.5-MHz clock for a current switching between 10 and 90 mA. The RCA suppresses fundamental, second, and third harmonics of the switching frequency by 13.7, 13.3, and 14.1 dB, respectively.
Title:Sense Amplifier Half-Buffer (SAHB): A Low-Power High-Performance Asynchronous Logic QDI Cell Template
Abstract: We propose a novel asynchronous logic (async) quasi-delay-insensitive (QDI) sense-amplifier half-buffer (SAHB) cell design approach, with emphases on high operational robustness, high speed, and low power dissipation. There are five key features of our proposed SAHB. First, the SAHB cell embodies the async QDI 4-phase (4f) signaling protocol to accommodate process–voltage–temperature variations. Second, the sense amplifier (SA) block in SAHB cells embodies a cross-coupled latch with a positive feedback mechanism to speed up the output evaluation. Third, the evaluation block in the SAHB comprises both nMOS pull-up and pull-down networks with minimum transistor sizing to reduce the parasitic capacitance. Fourth, both the evaluation block and SA block are tightly coupled to reduce redundant internal switching nodes. Fifth, the SAHB cell is designed in CMOS static logic and hence appropriate for full range dynamic voltage scaling operation for VDD ranging from nominal voltage (1 V) to subthreshold voltage (~0.3 V). When six library cells embodying our proposed SAHB are compared with those embodying the conventional async QDI pre-charged half buffer (PCHB) approach, the proposed SAHB cells collectively feature simultaneous ~64% lower power, ~21% faster, and ~6% smaller IC area; the PCHB cell is inappropriate for subthreshold operation. A prototype 64-bit Kogge–Stone pipeline adder based on the SAHB approach (at 65 nm CMOS) is designed. For a 1-GHz throughput and at nominal VDD, the design based on the SAHB approach simultaneously features ~56% lower energy and~24% lower transistor count advantages than its PCHB counterpart. When benchmarked against the ubiquitous synchronous logic counterpart, our SAHB dissipates~39% lower energy at the 1-GHz throughput.
Title:On Micro-architectural Mechanisms for Cache Wear out Reduction
Abstract: Hot carrier injection (HCI) and bias temperature instability (BTI) are two of the main deleterious effects that increase a transistor’s threshold voltage over the lifetime of a microprocessor. This voltage degradation causes slower transistor switching and eventually can result in faulty operation. HCI manifests itself when transistors switch from logic “0” to “1” and vice versa, whereas BTI is the result of a transistor maintaining the same logic value for an extended period of time. These failure mechanisms are especially acute in those transistors used to implement the SRAM cells of first-level (L1) caches, which are frequently accessed, so they are critical to performance, and they are continuously aging. This paper focuses on micro architectural solutions to reduce transistor aging effects induced by both HCI and BTI in the data array of L1 data caches. First, we show that the majority of cell flips are concentrated in a small number of specific bits within each data word. In addition, we also build upon the previous studies, showing that logic “0” is the most frequently written value in a cache by identifying which cells hold a given logic value for a significant amount of time. Based on these observations, this paper introduces a number of architectural techniques that spread the number of flips evenly across memory cells and reduce the amount of time that logic “0” values are stored in the cells by switching OFF specific data bytes. Experimental results show that the threshold voltage degradation savings range from 21.8% to 44.3% depending on the application.
Title:Energy-Efficient TCAM Search Engine Design Using Priority-Decision in Memory Technology
Abstract: Ternary content-addressable memory (TCAM)-based search engines generally need a priority encoder (PE) to select the highest priority match entry for resolving the multiple match problem due to the don’t care (X) features of TCAM. In contemporary network security, TCAM-based search engines are widely used in regular expression matching across multiple packets to protect against attacks, such as by viruses and spam. However, the use of PE results in increased energy consumption for pattern updates and search operations. Instead of using PEs to determine the match, our solution is a three-phase search operation that utilizes the length information of the matched patterns to decide the longest pattern match data. This paper proposes a promising memory technology called priority-decision in memory (PDM), which eliminates the need for PEs and removes restrictions on ordering, implying that patterns can be stored in an arbitrary order without sorting their lengTHP. Moreover, we present a sequential input-state (SIS) scheme to disable the mass of redundant search operations in state segments on the basis of an analysis distribution of hex signatures in a virus database. Experimental results demonstrate that the PDM-based technology can improve update energy consumption of nonvolatile TCAM (nvTCAM) search engines by 36%–67%, because most of the energy in these search engines is used to reorder. By adopting the SIS-based method to avoid unnecessary search operations in a TCAM array, the search energy reductionis around 64% of nvTCAM search engines.
Title:A 92-dB DR, 24.3-mW, 1.25-MHz BW Sigma–Delta Modulator Using Dynamically Biased Op Amp Sharing
Abstract: A 2–2 cascaded switched-capacitor sigma-delta modulator is presented for design of low-voltage, low-power, broadband analog-to-digital conversion. To reduce power dissipation in both analog and digital circuits and ensure low-voltage operation, a half-sample delayed-input feed forward architecture is employed in combination with 4-bit quantization, which results in reduced integrator output swings and relaxed timing constraint in the feedback path. The integrator power is further reduced by sharing an op amp in the two integrators in each stage and periodically changing the op amp bias condition between a high-current and a low-current mode using a fast low-power high-precision charge pump circuit. Implemented in a 0.18-µm CMOS technology, the experimental prototype achieves a 92-dB dynamic range, a 91-dB peak signal-to-noise ratio, and an 84-dB peak signal-to-noise plus distortion ratio, respectively for a signal bandwidth of 1.25 MHz Operated at a 40-MHz sampling rate, the modulator dissipates 24.3mW from a 1 V supply.
Title:A 0.45 V 147–375 nW ECG Compression Processor With Wavelet Shrinkage and Adaptive Temporal Decimation Architectures
Abstract: This paper presents a real-time electrocardiogram (ECG) data compression processor with improved energy efficiency while maintaining high accuracy and real-time operation. Wavelet shrinkage is exploited to filter the noise and achieve sparse ECG signal representation. Adaptive temporal decimation is proposed to achieve configurable processing to adaptively reduce the data amount and computational activities for further power reduction. Modified Huffman and run-length wavelet source coding (MHRLC) is also designed to represent wavelet coefficients with optimized average code length and reduced memory requirement. Fabricated in 0.18-µmCMOS, the ECG processor is implemented with customized near-threshold digital logics for minimum energy operation. The prototype was fully validated with the MIT-BIH Arrhythmia database.