The concept of memoization was first proposed and has been used mainly under the context of declarative languages, such as Prolog. Memoization techniques in processor architectures have been investigated. Citron et al. have proposed memoing multiplication and division units for multimedia applications running on processors in order to reduce the wall clock time; power saving, if any, is not reported. Khalvati and Aagaard have proposed a two-wide superscalar pipeline-based window memoization for image processing on FPGA, where two incoming windows are examined to see if a previous computation exists in on-chip memory.
The architecture discussed was manually generated. Memoization as a means to reduce the power consumption of processors has been studied. The work puts forward the concept but does not produce any results. The work employs instruction-level memoization for floating point operations in lossy multimedia applications, such as image compression and speech recognition. Consequently, the number of floating point computations decrease, resulting in power saving. The work does not propose any general methodology but focuses on memoization for a select set of target application programs which are extensively characterized together with predefined input data for a target embedded processor: Hitachi SH4. The work uses the concept of tolerant region reuse which involves memoing a region which is tolerant to error. However, the work also focuses on a specific data set for the three image compression algorithms considered and does not describe how it would handle data sets in general. These are simple data sets with one or two distinct images, which are characterized for three different embedded processors, whose instruction set architectures are then modified based on characterization to achieve energy efficiency by reusing tolerant regions. In this paper, we do not make such strict assumptions about the data set and the target architecture (of FPGA in this case). Instead, our focus is on developing a general methodology to analyze data sets and applications from different domains and automatic hardware architecture generation for mapping to the FPGA.
The work presented discusses a methodology for design space exploration of approximate programs through HLS and genetic algorithm-based search. However, the optimizations and knobs mentioned therein, such as constant propagation, bit-width sizing, and so on, to produce different approximate design points are standard procedures in typical HLS flows. Approximate computing on FPGA has been discussed, which focuses on overclocking of adders used in image-processing filters for performance acceleration.
- Power consumption is high
- less quality of results
We analyze memoization-based approximate computing with FPGA as the computing substrate from all of the above perspectives. We present and discuss the static memoization and dynamic memoization as well as the comparison measure settings for the applications to benefit from these specific memoization techniques. The nature and the availability of data sets are analyzed, which determines the proper memoization method for a better quality of results. An integrated high-level synthesis (HLS) flow is developed to automatically synthesize the memoization-based design. Based on our experiments using both the image-processing applications and common computation kernels with their data sets, we make recommendations regarding selection of proper memoization techniques and comparison measures. The experimental results show a significant power saving (around 20%) with small quality degradation.
Classification of Memoization:
Memoization can be classified into two types: static memoization and dynamic memoization. The static memoization refers to the case, where the value to be read out (and not computed) is predetermined. The exact condition for memoization to take place is predetermined as well. We use the example of edge detection in images to illustrate the difference between the static memoization and the dynamic memoization. The static memoization can be used when the input is predicted to vary slightly around a reference value.
Dynamic memoization refers to the case when the memoization condition is ill-defined, i.e., the exact value or range for which an input should be examined is unknown priori. Such value or range is determined at run time during application execution. This can be the case where an application input data set shows a relatively large variation over time. Because a high variance requires a large number of reference vectors to be stored (as a result of offline analysis) using the static method and a large number of comparisons may negate power saving due to memoization. For such cases, only the method to check input value is fixed beforehand. For instance, if dynamic memoization is applied to detect edges in images obtained from speed-limit cameras, then the output value to be read out will be determined at run time as well as the value against which the central pixel value needs to be checked. However, the fact that the central pixel will be examined is decided beforehand. Similarly, instead of examining the central pixel, a designer may decide to use a distance metric to check input pixel values. Whatever method is chosen to check the input values, it must be decided beforehand based on some basic understanding of the application. Thus, it can be observed that there are use-cases for both the static memoization and the dynamic memoization.
Proposed design flow for memoization-based automatic architecture generation:
We design our memoization-based architecture generation flow by adding a post processing step to existing HLS flows. Fig. 1 shows an overview of the memoization-based architecture generation flow. It shows snippets of code from a C-language source file, where a function named edge_detect has been identified for HLS. It also shows an example configuration file which shows the function that has been identified for memoization.
We assume an iterative design flow for memoization based approximate computing. The details of this iterative design flow are shown in Fig. 2. Here, P1 and P2 refer to the power values obtained without memoization and with memoization, respectively;R1 and R2 refer to the computed values obtained without memoization and with memoization, respectively; P and T are the power and result accuracy thresholds, respectively. Fig. 3 shows not only the memoization architecture generation flow but also the considerations related to power and accuracy of results that must be considered. The red block shows that an application or task described in C/C++language is synthesized using an HLS tool. The blue block shows memoization architecture generator, which generates the RTL wrapper module to wrap the HLS synthesized block with memoization related circuit blocks. As a result of this wrapping, the RTL design of memoized architecture is generated (purple block), i.e., the top-level module which contains the RTL wrapper and the HLS synthesized block. After placement and routing on target FPGA using a vendor specific placement and routing tool (such as Xilinx ISE), the simulation-based dynamic power analysis of both the HLS synthesized design and the memoized architecture is performed separately to evaluate the potential power saving (green block). The power analysis is performed using the data set corresponding to the application (strong or weak). The percentage difference between P1 and P2 should be greater than user-defined threshold P compared with the area overhead due to the wrapper.
Memoization Architecture Generator:
The memoization architecture generator contains a memoization wrapper generator that generates the architectural blocks needed for memoization. The wrapper generator architecture is shown in Fig. 3. It takes similarity measure, threshold (Th), user inputs, and user selection, which are present in the configuration file, as the input. The similarity measure is defined by the user to compare two temporally spaced inputs; threshold (Th) is the threshold value within which the result of similarity measurement must exist. Examples of similarity measure can be distance metrics, such as Euclidean distance and other measures, such as equality. The threshold (Th) value is an optional input, which is specified or not specified depending on the similarity measure chosen. For instance, no threshold is required if equality is chosen as the similarity measure.
Power and Computational Accuracy Analysis:
After the memoized architecture generated has been placed and routed on the FPGA device using a vendor specific placement and routing tool, its dynamic power analysis is performed using the FPGA vendor tools. Strong data set is used for statically memoized architecture, and weak data set is used for dynamically memoized architecture. The timing simulation of the design after placement and routing provides both the computation results as well as information on switching activity. The switching activity information is used as input to the power analysis tools for the dynamic power analysis.
- Power consumption is low
- Better quality of results
- Vivado HLS