Base Paper Abstract:
The Linear Feedback Shift Register (LFSR) is a widely utilized circuit structure in electronic systems, often employed as a Pseudo Random Number Generator (PRNG) for generating pseudo random sequence. However, in light of the significant challenges associated with privacy protection and data encryption, traditional PRNGs have frequently failed to meet the increasing security demands of electronic systems. In contrast, True Random Number Generators (TRNGs), have emerged as essential security primitives within the realm of hardware security, garnering increasing attention. In response to these challenges, this paper proposes a novel lightweight TRNG architecture based on Galois LFSR. This innovation design incorporates inverters and two-to-one multiplexers to modify the feedback path. The proposed structure has been implemented on AMD Xilinx Artix-7 and Kintex-7 FPGA boards. Notably, it demonstrates a resource-efficient design, utilizing only 17 Look-Up Tables (LUTs) and 9 D Flip-Flops (DFFs), while achieving random number with throughput of 300Mbps. Furthermore, the structure successfully passes both randomness test and robustness test, indicating its promising application potential in secure electronic systems.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
Multiple precision modes are needed for a floating-point processing element (PE) because they provide flexibility in handling different types of numerical data with varying levels of precision and performance metrics. Performing high-precision floating-point operations has the benefits of producing highly precise and accurate results while allowing for a greater range of numerical representation. Conversely, low-precision operations offer faster computation speeds and lower power consumption. In this paper, we propose a configurable multi-precision processing element (PE) which supports Half Precision, Single Precision, Double Precision, BrainFloat-16 (BF-16) and TensorFloat-32 (TF-32). The design is realized using GPDK 45 nm technology and operated at 281.9 MHz clock frequency. The design was also implemented on Xilinx ZCU104 FPGA evaluation board. Compared with previous state-of-the-art (SOTA) multiprecision PEs, the proposed design supports two more floating point data formats namely BF-16 and TF-32. It achieves the best energy performance with 2368.91 GFLOPS/W and offers 63% improvement in operating
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
Deep Neural Network (DNN) hardware accelerators are essential in a spectrum of safety-critical edge-AI applications with stringent reliability, energy efficiency, and latency requirements. Multiplication is the most resource-hungry operation in the neural network’s processing elements. This paper proposes a scalable adaptive fault-tolerant approximate multiplier (AdAM) tailored for ASIC-based DNN accelerators at the algorithm and circuit levels. AdAM employs an adaptive adder that relies on an unconventional use of input Leading One Detector (LOD) values for fault detection by optimizing unutilized adder resources. A gate-level optimized LOD design and a hybrid adder design are also proposed as a part of the adaptive multiplier to improve the hardware performance. The proposed architecture uses a lightweight fault mitigation technique that sets the detected faulty bits to zero. The hardware resource utilization and the DNN accelerator’s reliability metrics are used to compare the proposed solution against the Triple Modular Redundancy (TMR) in multiplication, unprotected exact multiplication, and unprotected approximate multiplication. It is demonstrated that the proposed architecture enables a multiplication with a reliability level close to the multipliers protected by TMR while at the same time utilizing 2.74× less area and with 39.06% less power-delay product compared to the exact multiplier. Moreover, it has similar area, delay, and power consumption parameters compared to the state-of-the-art approximate multipliers with similar accuracy while providing fault detection and mitigation capability. Index Terms Deep neural networks, approximate computing, circuit design, reliability, DNN accelerator.
List of the following materials will be included with the Downloaded Backup:Proposed Abstract:
Sum of Absolute Differences (SAD) is mainly applied in block-matching tasks such as motion estimation for video compression, stereo matching for depth/disparity calculation, template matching in image/object detection, image registration (including medical imaging), and lightweight optical-flow/tracking systems, because it is simple, fast, and hardware-friendly. The Traditional accurate SAD hardware provides exact results but consumes high power and requires large area, while existing approximate designs reduce cost but often suffer from high errors and poor FPGA-specific optimization. To overcome these limitations, this work proposes an improved SAD hardware architecture that replaces the conventional full adder with a lightweight XOR–MUX structure. This change reduces delay, minimizes area, and increases speed by removing redundant logic and optimizing FPGA resource utilization. The novelty of the design lies in combining approximation with FPGA-aware optimization, achieving bounded error, reduced power consumption, and higher operating frequency. The proposed system is implemented in Verilog HDL and tested on a Xilinx FPGA, showing improvements in LUT usage, clock frequency, and power efficiency, making it suitable for real-time video and image processing applications.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
This brief presents an ultra-low leakage and fast conversion level shifter with wide-range voltage conversion and frequency. The proposed level shifter adopts the leakage shutoff transistors, which can completely cut off the static current when the circuits stand by. The pull-down network employs the low-threshold transistor for the fast fall transition. The proposed level shifter also solves the swing problem and achieves a fast conversion by using the voltage hysteresis transistor, strengthening the pull-up network to ensure the internal node is fast and fully charged. Measurement results based on the 55 nm process show that the average ultra-low leakage of the proposed level shifter is 34.8 pW when converting from 0.3 V input to 1.2 V output. Meanwhile, the average propagation delay and the average energy per transition of the proposed level shifter are 13.86 ns and 22.71 fJ for an input frequency of 1 MHz, respectively. The maximum conversion range is from 0.13 V to 1.2 V. Index Terms: Level shifter, ultra-low power, multi-supply voltage circuit, sub-threshold operation.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
This letter introduces an innovative approximate multiplier (AM) architecture that leverages stochastically generated bit streams through the Linear Feedback Shift Register (LFSR). The AM is applied to matrix-vector multiplication (MVM) in Neural Networks (NNs). The hardware implementations in 90 nm CMOS technology demonstrate superior power and area efficiency compared to state-of-the-art designs. Additionally, the study explores applying stochastic computing to LSTM NNs, showcasing improved energy efficiency and speed.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
Numerous obstacles in enhancing the performance of computing systems have spurred the emergence of approximate computing. Extensive studies have been reported on approximate computing to develop high-performance, energy-efficient hardware designs tailored to error-resilient applications. In this brief, we proposed 8-bit approximate multipliers with 15 levels of accuracy using three techniques: recursive, bit-wise, and hybrid approximation using partial bit OR (PBO). Compared to the existing multipliers, investigated designs have significantly improved the area, power, delay, Power Delay Product (PDP), and Power Area Delay Product (PADP) by 41.68%, 73.16%, 35.57%, 72.65%, and 75.42% respectively on average. On resemblance with the accurate multiplier, the area, power, delay, PDP, and PADP were enhanced by 54.41%, 57.57%, 25.73%, 60.14%, and 74.33% correspondingly on average. Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) values surpassing (30 dB, 94%), (31 dB, 96%), and (26 dB, 95%) by applying them to benchmarks in image smoothing, edge detection, and image sharpening successively. Moreover, upon scrutinizing the efficacy of multipliers in hardware implementations of deep neural networks attaining the performance exceeding 95%. The obtained results confirm that suggested multipliers are well-suited for their widespread applications.
List of the following materials will be included with the Downloaded Backup:Proposed Abstract:
Digital clocks and stopwatches are widely used in daily applications such as consumer electronics, embedded devices, portable medical instruments, and time monitoring systems, as they provide simple and accurate time tracking functions. These systems offer advantages like low cost, user-friendly operation, and high reliability; however, they often face disadvantages such as hardware redundancy, higher power consumption, and limited integration when clock and stopwatch functions are implemented separately. The main problem addressed in this work is the lack of a unified architecture that can perform both digital clock and stopwatch operations using shared resources, which leads to inefficient hardware utilization and increased complexity in existing designs. Conventional systems generally use independent controllers and dedicated display drivers, resulting in additional overhead. To overcome this limitation, we propose a finite state machine based architecture that integrates both digital clock and stopwatch modules into a single design with common display hardware. The system employs multiplexers and control signals to switch seamlessly between clock and stopwatch modes, while states such as idle, hour, minute, second, and pause are clearly managed through FSM logic. The novelty of this work lies in the resource-sharing approach where a common seven-segment display is driven by multiplexed outputs, thereby reducing area, power, and switching complexity without compromising accuracy. The proposed design is implemented and tested using hardware description language coding and simulated on FPGA-based platforms, ensuring precise timing, functional correctness, and display reliability. Performance evaluation confirms that the system achieves efficient utilization of logic resources, accurate real-time operation, and flexibility for future extension in low-power VLSI and IoT-based applications.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
This letter presents a novel hardware-efficient approximate 4-2 compressor design that significantly enhances accuracy through a systematic analysis of input patterns obtained from practical applications. We incorporate a majority operation and a compound gate in the compressor design to effectively boost hardware efficiency in multiplications. Our design approach results in substantial error reductions, with normalized mean error distance (NMED) and mean relative error distance (MRED) decreasing by up to 74.84% and 82.04%, respectively, compared to existing approximate multipliers discussed in this letter. When implemented in a 32-nm CMOS technology, the approximate multiplier adopting the proposed 4-2 compressor achieves excellent hardware efficiency, reducing area, power, and energy consumption by up to 8.95%, 13.02%, and 13.02%, respectively, compared to the other alternatives. Moreover, our design delivers enhanced performance in image processing tasks, achieving up to a 4.84× increase in peak signal-to-noise ratio (PSNR) compared to other designs, all while optimizing hardware efficiency. Index Terms—Approximate multiplier, majority operation, compound gate, image processing, approximate 4-2 compressor.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
In this article, a framework for the analog implementation of a deep convolutional neural network (CNN) is introduced and used to derive a new circuit architecture which is composed of an improved analog multiplier and circuit blocks implementing the ReLU activation function and the argmax operator. The operating principles of the individual blocks, as well as those of the complete architecture, are analysed and used to realize a low-power analog classifier, consuming less than 1.8 µW. The proper operation of the classifier is verified via a comparison with a software equivalent implementation and its performance is evaluated against existing circuit architectures. The proposed architecture is implemented in a TSMC 90-nm CMOS process and simulated using Cadence IC Suite for both schematic and layout design. Corner and Monte Carlo mismatch simulations of the schematic and the physical circuit (post layout) were conducted to evaluate the effect of transistor mismatches and process voltage temperature (PVT) variations and to showcase a proposed systematic method for offsetting their effect.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
The primary goal of approximate computing is enhancing system performance, such as energy efficiency, speed, and form factor. Despite the growing use of approximate multipliers, the design of efficient approximate compressors — a fundamental multiplier block — remains a significant challenge. In this brief, 8-transistor and 14-transistor 4:2 compressors are proposed. Both compressors exploit CMOS technology and a constant and conditional approximation of selected inputs, exhibiting fewer negative errors. As a result, a resource-expensive error recovery module is eliminated, yielding superior performance as compared with prior art. The 14-transistor architecture yields a lower error rate compared to the 8-transistor architecture, trading off lower area for higher accuracy. The compressor tailored circuit architecture is also proposed and evaluated using image multiplication. The proposed multiplier exhibits 50% area savings and 93% lower power-delay-product compared to the exact multiplier, as well as higher accuracy, and 38% PDP enhancement compared with the state-of-the-art.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
The integrated electronic nose (e-nose) design, which integrates sensor arrays and recognition algorithms, has been widely used in different fields. However, the current integrated e-nose system usually suffers from the problem of low accuracy with simple algorithm structure and slow speed with complex algorithm structure. In this article, we propose a method for implementing a deep neural network for odor identification in a small-scale Field-Programmable Gate Array (FPGA). First, a lightweight odor identification with depthwise separable convolutional neural network (OIDSCNN) is proposed to reduce parameters and accelerate hardware implementation performance. Next, the OI-DSCNN is implemented in a Zynq-7020 SoC chip based on the quantization method, namely, the saturation-flooring KL divergence scheme (SF-KL). The OI-DSCNN was conducted on the Chinese herbal medicine dataset, and simulation experiments and hardware implementation validate its effectiveness. These findings shed light on quick and accurate odor identification in the FPGA.
List of the following materials will be included with the Downloaded Backup:Proposed Abstract:
Intelligent elevator systems are used in many smart buildings, offices, hospitals, and tall apartments to move people quickly, reduce waiting time, and save energy. They have many advantages, like faster operation, better safety, and the ability to handle requests from many floors at the same time. But there are also some disadvantages, such as slow response when many people use them, fixed movement patterns that cannot adjust to real-time needs, weak security for restricted floors, and no use of advanced AI features for learning and prediction. Most existing elevator systems are built using microcontrollers with fixed scheduling methods, which cannot easily change their operation or add smart features. The problem in this work is to create an elevator system that works faster, is more secure, can adjust to different situations, and is ready for AI use, while also keeping passengers safe. In this project, we design an elevator controller on FPGA using a finite state machine. The system includes floor request handling, priority scheduling, emergency stop, overload detection, automatic door timing, floor number display, passcode access for special floors, and a fire alarm mode. The new idea in this work is to use the speed and flexibility of FPGA hardware along with an FSM design that can later connect to AI for learning passenger habits and predicting movement needs. This makes the system quick, safe, and adaptable. The design is written in Verilog HDL, tested in ModelSim, and implemented on a Xilinx FPGA board. We measure performance by checking response time, scheduling efficiency, and safety accuracy, and the results show it is suitable for future smart building use.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
This paper presents an efficient FPGA-based system for automatic brain tumor detection from MRI images using a 3x3 convolutional edge detection method with stride 1. The proposed architecture is developed as a soft IP core in Verilog HDL and synthesized on a Xilinx Zynq 7000 FPGA platform. The system applies a customized 3x3 convolution kernel over each MRI image with stride 1, ensuring that every pixel is processed and fine image details are preserved for accurate tumor detection. Edge detection results are used to segment and highlight abnormal regions, and a thresholding mechanism is employed to differentiate between normal and abnormal images. Hardware resource utilization—including look-up tables (LUTs), flip-flops (FFs), and power consumption—is analyzed after synthesis to verify system efficiency. Experimental results confirm that the proposed FPGA implementation provides real-time processing and reliable brain tumor detection with low power usage, making it suitable for portable and embedded medical devices. The stride 1 approach guarantees maximum detection accuracy and detailed edge representation in all test cases.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
Computing in memory (CIM), which alleviates the need to transfer a large amount of data between processor and memory, significantly reducing latency and energy consumption, is a promising new computing architecture for addressing the von Neumann bottleneck problem. This article proposes a CIM array structure composed of self-recycling 10T static random access memory (SRAM) cells, which can realize orthogonal data writing, and multiple Boolean logical operations for the entire array. The self-recycling and full-array activation characteristics are extremely suitable for accelerating diverse data processing algorithms such as the Advanced Encryption Standard (AES). A 4-kb SRAM is implemented in 55-nm CMOS technology to verify the effectiveness of the design. Compared with other state-of-threat architectures, the throughput and the operating frequency of the proposed CIM macro are increased to 843 GOPS/kb (2.64×) and 823.7 MHz (2.6×), respectively. The energy efficiency reaches 246.9 TOPS/W. When applied to the AES, the energy consumption is 35.77% less than the digital CIM architecture that is not self-recycling.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
In recent years, FPGA-based convolutional neural networks (CNNs) accelerator has received tremendous research interest, especially in fields such as autonomous driving and robotics. For the purpose of accelerating convolution computations, Winograd fast convolution algorithm is frequently employed. However, during implementation of the Winograd algorithm on FPGA, multiple rounding operations occur, and the accuracy of these operations substantially impacts the convolution results. The banker’s rounding algorithm, compared to other rounding algorithms, has advantages such as a more symmetric error distribution and smaller errors, making it suitable for Winograd convolution computation. However, the conventional banker’s rounding algorithm is proposed for floating-point calculations, yet FPGA implements fixed-point arithmetic. Moreover, it frequently rounds 0.5 to 0, leading to the issue of convolution weight invalidation and introducing significant errors. To overcome these challenges, an improved hardware circuit designed for implementing the fixed-point banker’s rounding algorithm is proposed. Experimental results show that compared with common rounding up and rounding down methods, the proposed algorithm exhibits smaller errors and effectively resolves the issue of weight invalidation in conventional banker’s rounding, leading to a significant 55.6% improvement in computational accuracy.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
This paper presents a lightweight, high-entropy true random number generator architecture featuring an innovative quad cross-coupled feedback mechanism to enhance randomness. The primary goal is to develop an efficient and secure true random number generator that addresses the growing demand for reliable random number generation in cryptographic and security-critical applications. The motivation stems from the need to improve entropy, reduce resource utilization, and ensure robustness across varying technologies. With the intention of achieving near-perfect randomness, the Quad-Input Oscillating Circuit module integrates self-coupled, jitter-inducing ring oscillators with cross-coupled feedback loops to induce metastability. Comprehensive evaluations confirm a Shannon entropy of 0.999818, a minimum entropy of 0.977257, and a collision entropy of 0.999636. The design was synthesized using Synopsys Design Compiler at 45 nm, 32 nm, and 14 nm, achieving a maximum frequency of 6.7 GHz, power consumption as low as 72 μW, and area utilization of 24 μm2 at 14 nm. Rigorous validation through multiple statistical test suites, including the AIS-31, Autocorrelation, Deviation, Diehard, the National Institute of Standards and Technologies SP800- 22 and SP800-90B, and TestU01, confirms its efficiency and reliability. Real random bits were implemented as oscilloscope viewable signals on the Cyclone V Field Programmable Gate Array developed by Altera, representing a significant advancement in secure random number generation technologies.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
In this paper we propose a novel approximate floating-point divider based on bi-dimensional linear approximation. In our approach, the mantissa quotient is seen as a function of the two input mantissas of the divider. The domain of this two-variable function is partitioned into nx × ny subregions, named tiles, where nx, ny are chosen as powers of two. In each tile the quotient is approximated with a linear combination of the input mantissas. To achieve fine accuracy, an optimization problem is formulated within each tile to determine the optimal coefficients for the linear combination, which minimize the Mean Relative Error Distance (MRED) of the divider. Furthermore, to make hardware implementation more effective, the minimization problem is appropriately modified to search for optimal quantized coefficients. The hardware structure of the divider only requires a small look-up table to store the linear approximation coefficients, and a carry save adder tree. The proposed architecture is highly tunable at design-time over a wide range of accuracy, depending on the number of tiles chosen for the approximation. The obtained results demonstrate error performance and hardware features superior to the state-of-the-art. The proposed dividers define the Pareto front, considering the trade-off between power-delay-product vs. MRED and area-delay-product vs. MRED, for MRED in the range of 4 × 10−3 − 2 × 10−2. Application results for JPEG compression and tone mapping further highlight the strength of our proposal, which exhibits Structural Similarity Index (SSIM) very close to 1 in all cases and Peak Signal-to-Noise Ratio (PSNR) up to 45 db. Index Terms: Floating-point divider, approximate computing, error correction, low-power.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
The dual edge-triggered flip-flop samples the data on both the positive and negative edges of the clock. Hence, it can lead to lower clock relative power consumption as compared to the single-edge triggered flip-flop while maintaining the same data throughput. In this paper, we present two low-power, low-energy dual-edge triggered TSPC flip-flops based on latch-mux type methodology. These two flip-flops, Low-Power at Low Data Activity (LPLD-DET), and Low-Power at High Data Activity (LPHD-DET) are suitable for low-power application. These flip-flops are fully static and contention-free. The post-layout simulation results in TSMC CMOS 65 nm technology suggest that the proposed LPLD-DET is the most power-efficient dual-edge triggered flip-flop for low data activities up to 30%, and LPHD-DET is the most power-efficient dual-edge triggered flip-flop for higher data activities from 45% compared to the other state of-the-art dual-edge triggered TSPC flip-flops.
List of the following materials will be included with the Downloaded Backup:Base Paper Abstract:
Wearable Artificial Intelligence-of-Things (AIoT) devices demand smart gadgets that are both resource and energy-efficient. In this paper, we explore efficient implementation of binary convolutional neural network employing function merging and block reuse techniques. The hardware implemented in field programmable gate array (FPGA) platform can classify ventricular beat in electrocardiogram achieving accuracy of 97.5%, sensitivity of 85.7%, specificity of 99.0%, precision of 92.3%, and F1-score of 88.9% while consuming only 10.5-µW of dynamic power dissipation.
List of the following materials will be included with the Downloaded Backup:Proposed Abstract:
Embedded memories are increasingly used in advanced System-on-Chip (SoC) designs for applications such as networking, automotive control, and medical imaging, where reliability and performance are critical. Ensuring fault-free operation of these memories is essential, yet memory testing remains a major challenge. Conventional MBIST architectures, while effective, often introduce significant silicon overhead, add design complexity, and lack flexibility for post-fabrication updates. In addition, existing memory test algorithms have their own drawbacks: March-C is widely applied and provides high fault coverage, but it requires long test times due to bit-oriented operations and large numbers of read–write cycles; MATS+ is simple and efficient but suffers from lower coverage, particularly for coupling and complex dynamic faults; and MATS++ improves on MATS+ with better detection capability, yet it still trades off hardware cost and scalability when applied to larger 32-bit word-oriented memories. Furthermore, most existing implementations are optimized for small SRAMs and are not easily scalable to clustered embedded memories in SoCs, nor do they fully exploit standard boundary-scan infrastructure for low-cost testing. To address these problems, this work proposes a scalable JTAG-based 32-bit memory test architecture that reuses IEEE 1149.1 boundary-scan resources to apply and compare March-C, MATS+, and MATS++ algorithms in both single-bit and multi-bit test modes. The proposed framework minimizes additional hardware cost by integrating BIST control into boundary-scan registers, while enabling algorithm programmability and flexibility for different memory clusters. The novelty lies in providing a detailed performance comparison of these algorithms under a unified boundary-scan-based architecture, focusing on trade-offs between fault coverage, test time, and silicon overhead. The design is implemented in Verilog HDL and synthesized on an FPGA using Xilinx Vivado, where parameters such as area, power, and latency are evaluated to validate efficiency and practical applicability for SoC-level memory testing.
List of the following materials will be included with the Downloaded Backup:We can provide Online Support Wordlwide, with proper execution, explanation and additionally provide explanation video file for execution and explanations.
NXFEE, will Provide on 24x7 Online Support, You can call or text at +91 9789443203, or email us nxfee.innovation@gmail.com
Customer are advice to watch the project video file output, and before the payment to test the requirement, correction will be applicable.
After payment, if any correction in the Project is accepted, but requirement changes is applicable with updated charges based upon the requirement.
After payment the student having doubts, correction, software error, hardware errors, coding doubts are accepted.
Online support will not be given more than 3 times.
On first time explanation we can provide completely with video file support, other 2 we can provide doubt clarifications only.
If any Issue on Software license / System Error we can support and rectify that within end of day.
Extra Charges For duplicate bill copy. Bill must be paid in full, No part payment will be accepted.
After payment, to must send the payment receipt to our email id.
Powered by NXFEE INNOVATION, Pondicherry.
Copyright © 2024 Nxfee Innovation.