A High-Throughput Hardware Accelerator for Lossless Compression of a DDR4 Command Trace
In a memory system, understanding how the host is stressing the memory is important to improve memory performance. Accordingly, the need for the analysis of memory command trace, which the memory controller sends to the dynamic random access memory, has increased. However, the size of this trace is very large; consequently, a high-throughput hardware (HW) accelerator that can efficiently compress these data in real time is required. This paper proposes a highthroughput HW accelerator for lossless compression of the command trace. The proposed HW is designed in a pipeline structure to process Huffman tree generation, encoding, and stream merge. To avoid the HW cost increase owing to highthroughput processing, a Huffman tree is efficiently implemented by utilizing static random access memory-based queues and bitmaps. In addition, variable length stream merge is performed at a very low cost by reducing the HW wire width using the mathematical properties of Huffman coding and processing the metadata and the Huffman codeword using FIFO separately. Furthermore, to improve the compression efficiency of the DDR4 memory command, the proposed design includes two preprocessing operations, the “don’t care bits override” and the “bits arrange,” which utilize the operating characteristics of DDR4 memory. The proposed compression architecture with such preprocessing operations achieves a high throughput of 8 GB/s with a compression ratio of 40.13% on average. Moreover, the total HW resource per throughput of the proposed architecture is superior to the previous implementations.
OWING to the increasing demand for big data and neural networks, the role of dynamic random access memory (DRAM) has become increasingly important and the memory bandwidth (BW) has rapidly increased. To understand how the host stresses the DRAM system and how the memory is vulnerable under certain situations, analyzing the memory access pattern of how the system controls the DRAM is necessary. Memory performance depends on the memory access patterns, and memory trace analysis for these access patterns has been actively conducted by memory manufacturers to improve the DRAM performance. However, as the memory command trace data have a throughput of 32 B/cycle based on the most commonly used DDR4 memory, the data size to be collected for analysis is very large. As the cost of storing and analyzing such large amounts of data is considerably high, and the memory BW to be transferred in real time to the storage space is also very large; various compression studies have been undertaken to reduce the trace data size.
For an efficient trace data compression, the following three conditions should be satisfied. First, lossless compression is essential as the loss of memory command trace data should not occur. Second, a hardware (HW) design with high speed and high throughput is required for compressing on-the-fly data with a high BW of 32 B/cycle in the high-speed system environment above 250 MHz. As data BW and operating frequency are gradually increasing in the memory system, high-throughput processing of 8 GB/s is generally required. This characteristic is even more important in recent years because memory structure such as high BW memory is spotlighted for supporting the rapid increase in the amount of data. Therefore, trace analysis system essentially requires an HW-based compression accelerator instead of a software (SW)-based compression method. It is noteworthy that the SW-based compression is much slower than the HW-based compression, which degrades the overall system performance. Third, HW cost and complexity should be low while maintaining high compression efficiency. In other words, the tradeoff between compression efficiency and HW resource must be carefully considered.
Lossless compression is classified into two categories: dictionary-based method and statistical method. The dictionary-based method, which is represented by the Lempel–Ziv series, compresses the data using the location information of the previous data. This method shows a relatively high compression ratio but is disadvantageous in that it requires large HW resources and complexity because the dictionary table must be created continuously for each input data by checking the dependence and location with previous data. On the other hand, the statistical method based on entropy, which is represented by the Huffman coding, constructs a tree with the frequency of symbol occurrence and compresses data by allocating the codeword according to the tree. This method can be implemented with relatively low complexity, but it shows a relatively low compression ratio. To take advantages of each method and to compensate for the disadvantages of each method, lossless compression schemes combining various compression algorithms have been widely studied, and the HW implementation of these lossless compression techniques has also been actively studied.
Nunez and Jones propose an HW architecture of the X-MatchPro algorithm that compresses data at a throughput of 1.6 Gbit/s with high HW scalability. Lin et al. propose an HW structure that combines parallel dictionary Lempel–Ziv–Welch (PDLZW) and dynamic Huffman coding. It offers the advantage of lower HW complexity and better compression efficiency than the dynamic Huffman. Fowers et al. propose the HW architecture of the DEFLATE algorithm that shows relatively high-throughput processing of 5.6 GB/s with high HW scalability. All these previous studies in use Huffman coding, but they are not an optimal Huffman coding. Nunez and Jones and Fowers et al. used a predetermined tree structure; they cannot compress data while changing the tree structure according to the data. In case of Lin et al. they cannot process compression and tree formation at the same time because they build an approximated tree structure on an offline process. Thus, the compression ratio is inevitably sacrificed. Furthermore, the operating frequency of these previous schemes is relatively low and except , they can only process the input data with a small BW, rendering it difficult to process high throughput data of 8 GB/s. This means that these previous schemes are incomplete in terms of compression efficiency and HW complexity/cost.
Herein, to compensate for this drawback, the tradeoff between compression efficiency and HW cost is carefully considered through the efficient pipeline processing of block based compression and HW-optimized design techniques. More specifically, this paper proposes an HW compression accelerator architecture capable of high-throughput processing based on the block Huffman coding . The block Huffman coding is a relatively simple statistical compression method that processes Huffman coding by dividing the input data into blocks. Although the compression efficiency of the block Huffman is slightly lower than that of the original Huffman coding, the two-pass operation in the original Huffman coding can be performed by one-pass operation through the block based pipeline structure. Moreover, because the frequency of the symbol occurrence is limited by the block size, the worst case of the Huffman codeword length can be significantly reduced. Therefore, it is possible to drastically reduce the HW resources and process the high-throughput on-the fly data compression. In the HW implementation of Huffman coding, the Huffman tree formation and merge processing of variable length coding are very difficult. To address these problems, the proposed HW architecture efficiently generates a new tree every time according to the input data using static random access memory (SRAM)-based queues and bitmaps and processes the variable length stream at a low cost using FIFO and HW wire width optimization. All of these processes are pipelined to achieve a high throughput of 8 GB/s. In addition, as the Huffman tree is generated for every block, it has a higher compression ratio than that of the predefined Huffman tree and approximated Huffman tree. Consequently, the proposed HW is excellent in terms of the tradeoff between HW cost and throughput. From these reasons, in this paper, block Huffman coding, which satisfies the second (i.e., high throughput) and third (i.e., low HW cost) conditions mentioned above, has been selected for memory trace compression.
- Encoding efficiency is low regarding delay and power consumption.
- Less Throughput.
- More number of Memories.
In a recent technology of digital network will transfer and receive the data with more complexity due to number of data bits and number of memory operations, thus it will take more data losses and low throughputs. Therefore this proposed work of this paper present lossless data compression with less memory architecture using Huffman compression technique. In this case here, Huffman machine will present lossless and less memory configuration and its support multi bit operations on data compression. Here, this technique will present input as 640 Data Bits and compressed output as 388 Data bits using variable length with block Huffman encoding method. The internal architecture of Huffman machine will present input as 160 Data Bits and compressed output as 45 Code words with 52 Metadata. Finally this work will present in Verilog HDL and synthesized in Vertex FPGA and get the results of area, delay and power.
Fig. 1 shows the overall structure of the proposed HW design. It consists of three parts: block Huffman, parallel stream merge, and preprocessing. The block Huffman module consists of a buffer, a Huffman front part, and a Huffman back part, as shown in Fig. 2. In the Huffman front part, the frequency count module calculates the symbol frequency of the input data and the sorting module arranges the symbols in the frequency order by merge sort. In the Huffman back part, a Huffman tree is formed using the symbols and frequencies sorted in the Huffman front part. In this process, the module uses SRAM-based queues and bitmaps appropriately to form the Huffman tree and the Huffman codeword effectively. A parallel stream merge module packs the output streams of four parallel compression machines into one stream. It consists of the metadata, intramerge, intermerge, and stream out module. In the metadata module, symbol and frequency information are packed into the single stream because they are needed to construct the tree at the decoding process. The intramerge module merges the eight variable length streams from the single compression machine into the single stream using shifters and OR operations. The intermerge module packs the metadata and intramerge streams into the single stream by shifters and OR operations. The stream out module slices the final compressed stream from the intermerge module to a certain size and sends it to the final output. This stream merge module is essential for variable length coding but requires tremendous HW cost for high-throughput data processing. To address this drawback, herein, the parallel stream merge module efficiently pipelines these merge processes and optimizes the HW wire width using the mathematical properties of Huffman coding, to enable high-throughput data processing with a very low HW cost. Consequently, the proposed HW accelerator performs lossless data compression with a high throughput of 8 GB/s and achieves excellent HW cost per throughput performance. Furthermore, it can preserve the compression ratio of optimal Huffman coding because it generates the new Huffman tree continuously according to the input data.
Even if the block Huffman coding module is implemented efficiently, compressing the DDR4 command trace data using Huffman coding results in very low compression efficiency because the DDR4 command trace data have a high entropy owing to their complex structure and Huffman coding generally has a relatively lower compression ratio than arithmetic coding or dictionary-based coding. It is noteworthy that the pattern distribution of the DDR4 command trace data is irregular because these data consist of a combination of addresses and control signals and the information contained in each signal differs depending on the command type. Herein, to achieve the high compression ratio for the DDR4 command trace data, two preprocessing schemes, the “don’t care bits override” and the “bits arrange,” which utilize the operating characteristics of the DDR4 memory, are proposed. They don’t care bits override module can lower the entropy of data by overriding the don’t care bits on the device deselect command to a value of 0 or 1. The bits arrange module can lower the entropy of data by dividing the memory command data composed of 32 bits into four groups of high correlation bits and obtain high compression ratio by compressing each partitioned data separately. It is noteworthy that the memory command data are divided into four groups because the block Huffman coding module compresses the memory command data in units of 8-bit symbols.
Block Huffman Coding: Fig. 2 shows the HW architecture of the Huffman machine in the block Huffman module. The preprocessed data are stored in the buffer for a certain block size, while it is sorted in the Huffman front-part module. If the input data are sorted by frequency, the time complexity of the tree composition  can be reduced from O(n log n) to O(n) using two queues. To achieve this, the finite-state machine (FSM) with six states, as shown in Fig. 3, is used. As sorting is complete in the Huffman front part, the FSM state moves from the IDLE to the INIT and the sorted symbols and frequencies are stored in the first queue. After storing, FSM goes to the EXTRACT state and dequeues two data from each queue. Subsequently, in the COMPARE state, two data with low frequency are determined from the four data extracted from the queue. In the SELECT state, the symbol with the lowest frequency is selected as the left child, and the remaining is selected as the right child. If the frequencies are the same, the symbol that appears later in the ascending sort becomes the left child. Finally, in the INSERT state, the parent data, which is the sum of the left child frequency and the right child frequency, is stored in the second queue. As the data stored in two queues becomes one, FSM returns to the IDLE state. Otherwise, the operation above is repeated. These processes are performed through the queue controller, queues, and bitmap controller, as shown in Fig. 2.
- Encoding efficiency is high regarding delay and power consumption.
- High Throughput.
- Less number of Memories.
” Thanks for Visit this project Pages – Buy It Soon “
A High-Throughput Hardware Accelerator for Lossless Compression of a DDR4 Command Trace
“Buy VLSI Projects On On-Line”
Terms & Conditions:
- Customer are advice to watch the project video file output, before the payment to test the requirement, correction will be applicable.
- After payment, if any correction in the Project is accepted, but requirement changes is applicable with updated charges based upon the requirement.
- After payment the student having doubts, correction, software error, hardware errors, coding doubts are accepted.
- Online support will not be given more than 5 times.
- Extra Charges For duplicate bill copy. Bill must be paid in full, No part payment will be accepted.
- After payment, to must send the payment receipt to our email id.
- Powered by NXFEE INNOVATION, Pondicherry.
- Pay Add to Cart Method on this Page
- Deposit Cash/Cheque on our a/c.
- Pay Google Pay/Phone Pay : +91 9789443203
- Send Cheque through courier
- Visit our office directly
HDFC BANK ACCOUNT:
- NXFEE INNOVATION,
HDFC BANK, MAIN BRANCH, PONDICHERRY-605004.
ACC NO. 50200013195971,
IFSC CODE: HDFC0000407.