Proposed Title :
Several new coding structures have been introduced inHEVC: coding unit (CU), prediction unit (PU) and transformunit (TU). CU is the basic unit of region splitting used for intra/inter coding It can be split from the largest coding unit(LCU, can be as large as 64×64 pixels) into the smallest coding unit (SCU, 8×8 pixels). Coupled with CU, PU carries theinformation related to the prediction processes. TU is used fortransform and quantization, and it depends on PU partitioningmodes.
Blocking effect is known as one of the most visual and objectionable artifacts of block-based compression methods.Artifacts that are commonly seen in prior video coding standardsat medium and low bitrates, such as blocking artifacts, ringingartifacts, color biases and blurring artifacts, may still existin HEVC. Also HEVC adopts in-loop filters in order to reducethese artifacts. The HEVC defines two in-loop filters, shown inFig. 1. In addition to the DF similar to the one in H.264/AVC,HEVC further introduces a completely new tool: sample adaptive offset (SAO). Meanwhile, the DF leads to 1.3-3.3% BD-ratereductionon average, and the SAO achieves 3.5% BD-rate reduction at the same quality. And the BD-rate calculation is introduced.
The coding efficiency of HEVC comes with a cost. Thecomputational complexity of HEVC is very high. From an encoder perspective, an encoder fully exploiting the capabilitiesof HEVC is expected to be several times more complex thanan H.264/AVC encoder. To understand the computational complexity of the HEVC, a study mapped the HEVC codec intoexisting systems. Authors mapped the HM (HEVC TestModel) encoder into a cluster containing Xeon-based servers(E5670 clocked at 2.93 GHz) and using gcc 4.4.5. Even forintra-only case, the encoding time at least exceeds 1000 timesreal-time (sequences’ resolution is 832×480 at 30 f/s).
Due to an increasing diversity of services, the growing popularity of HD video, and the emergence of beyond-HD formats(e.g., 4 K ×2Kor8K×4 K resolution) are creating evenstronger needs for high throughput in hardware implementation superior to H.264/AVC. Hence, hardware realization of theHEVC standard for real-time applications is an essential andchallenging task.There are some previous works on the topic of in-loop filters.a five-stage pipelined and hybrid edge filtering sequenceare applied; a five-stage pipelined and re-source-shareddual-edge filter to generate two filtering results every cycle isproposed; a parallelized scheme of processing the luminance and chrominance samples simultaneously is proposed.But all these works do not support HEVC in-loop filters. An HEVC in-loop filters architecture composed of fully utilizedDF and SAO is proposed. But it does not support HEVC encoder. A hybrid pipeline with two processing levels isproposed for HEVC DF, which uses one 1-D filter and singleport on-chip SRAM. Two parallel datapaths are used forthe design of HEVC DF in order to increase its performance.
- High power Consumption
Due to an increasing demand of high resolution applications,the high data access between on-chip memory and externalmemory becomes even more critical. Considering the tradeoffs between on-chip memory area and external memory traffic,we present an interlaced pipeline to combine the DF with SAOon a quarter-LCU basis; the quarter-LCU is defined as a 32×32 pixels’ block. In the process of DF, a novel filter is suggestedin order to keep the same result on a picture basis based on thequarter-LCU structure. Meanwhile, we also propose an interlacing memory scheme to arrange the data in on-chip memory,and access the data in the process of both vertical and horizontalfiltering efficiently in the DF phase.In the process of statistics collection in SAO, the overallnumber of comparators is reduced by 83% with our proposedconfigurable comparator array. We also present a fragmentationadder scheme to balance the computational burden betweenpipeline stages of SAO. Meanwhile, a simplified bitrate estimation method of rate-distortion cost calculation is adopted toreduce the computational complexity in the mode decision ofSAO.
Fig. 1 shows overall proposed hardware architecture ofcombined DF and SAO. We adopt interlaced pipeline architecture to speed up the combined DF and SAO. The wholeprocess is partitioned into three phases: DF, SAO statistics collection and SAO mode decision.
De-blocking filter architecture:
We use the image of size 64×64 pixels, so we split the image into 32×32,16×16,8×8 and 4×4 pixels then do the de-blocking filter.
The deblocking filter reduces the blocking artifacts (visible discontinuities in the vidEO) caused by block-based encoding with strong quantization. It is applied by modifying samples along horizontal and vertical boundaries of PUs and TUs of size not smaller than 8×8 samples. Filtering is applied separately in 4×4 blocks (so-called P and Q blocks), as shown in Fig. 1. Normal and strong filtering modes modify 2 and 3 luma samples along each boundary, respectively. The example in Fig. 2 shows a vertical boundary, as the horizontal boundary filtering is analogous.
Our design has two main units:
(1) Filtering decisions
(2) Filtering operations
This unit determines the need of filtering two given 4×4 blocks. For input samples convention, please refer to Fig. 2. By examining the equations in Fig. 5, it can be noted that conditions (1), (2), (3), (8) and (9) share similar equations and same input samples. Partial results from conditions (2) and (3) and used for conditions (1), (8) and (9). Hence, we employ hardware reuse to design a merged data path for those conditions, as depicted in Fig. 6. The condition equations require some multiplication by constants. We have replaced the multiplications by adders and shift operations to use less hardware resources. We have shorten conditions (1), (2), (3), (8) and (9) as c1, c2, c3, c8 and c9, respectively. With that, those five conditions are generated in only one clock cycle.
Data paths for the remaining conditions are depicted in Fig. 6. Data paths for conditions (4) and (5) are equal, only differing by the input samples (condition 4 for the first row and condition 5 for the last row of 4×4 blocks). We have included two instances of this data path to compute both conditions (c4 and c5) in the same clock cycle. The same was made for conditions (6) and (7). Condition (10) is an additional filtering decision applied to į0for the four rows of 4×4 blocks after normal filtering operation. Our architecture includes two instances of this data path in the design to compute c10 for the four rows in two clock cycles. Each instance computes two rows of samples. Additional data paths for β and tc multiplications, needed to compute all the conditions, are also shown in Fig. 6. β and tc values are generated by a lookup table with QP value as input index.
Boundary strength and Edge level adaptivity
- Boundary strength(BS) is calculated for boundaries that are either prediction unit boundaries or the transform unit boundaries.
- The boundary strength can take values as 0,1 or 2.
- For luma component, only block boundaries with BS values 1 or 2 are filtered. Therefore there is no filtering in static areas which avoids multiple subsequent filtering of same areas where pixels are copied from one picture to another with a residual equal to zero which can cause over smoothening.
- In case of chroma components, only boundaries with BS equal to 2 are filtered. Hence the block boundaries that are filtered have at least one of the two adjacent blocks intra predicted.
Sampling adaptive offset:
SAO may use different offsets sample by sample in a region depending on the sample classification, and SAO parameters are adapted from region to region. Two SAO types that can satisfy the requirements of low complexity are adopted in HEVC: edge offset (EO) and band offset (BO). For EO, the sample classification is based on comparison between current samples and neighboring samples. For BO, the sample classification is based on sample values. Please note that each color component may have its own SAO parameters. To achieve low encoding latency and to reduce the buffer requirement, the region size is fixed to one CTB. To reduce side information, multiple CTUs can be merged together to share SAO parameters.
In order to keep reasonable balance between complexity and coding efficiency, EO uses four 1-D directional patterns for sample classification: horizontal, vertical, 135° diagonal, and 45° diagonal, as shown in Fig. 9, where the label “c” represents a current sample and the labels “a” and “b” represent two neighboring samples. According to these patterns, four EO classes are specified, and each EO class corresponds to one pattern. On the encoder side, only one EO class can be selected for each CTB that enables EO. Based on rate-distortion optimization, the best EO class is sent in the bitstream as side information. Since the patterns are 1-D, the results of the classifier do not exactly correspond to extreme samples. For a given EO class, each sample inside the CTB is classified into one of five categories. The current sample value, labeled as “c,” is compared with its two neighbors along the selected 1-D pattern. The classification rules for each sample are summarized in Table I. Categories 1 and 4 are associated with a local valley and a local peak along the selected 1-D pattern, respectively. Categories 2 and 3 are associated with concave and convex corners along the selected 1-D pattern, respectively. If the current sample does not belong to EO categories 1–4, then it is category 0 and SAO is not applied. The meanings of edge offset signs are illustrated in Fig. 10 and explained as follows. Positive offsets for categories 1 and 2 result in smoothing since local valleys and concave corners become smoother, while negative offsets for these categories result in sharpening. On the other hand, the meanings are opposite for categories 3 and 4, where negative offsets result in smoothing and positive offsets result in sharpening. From the statistical analyses, the EO in HEVC disallows sharpening and sends absolute values of offsets, while signs of offsets are implicitly derived according to EO categories.
BO implies one offset is added to all samples of the same band. The sample value range is equally divided into 32 bands. For 8-bit samples ranging from 0 to 255, the width of a band is 8, and sample values from 8k to 8k+ 7 belong to band k, where k ranges from 0 to 31. The average difference between the original samples and reconstructed samples in a band (i.e., offset of a band) is signaled to the decoder. There is no constraint on offset signs.
Only offsets of four consecutive bands and the starting band position are signaled to the decoder. The number of signaled offsets in BO was decided to be reduced from 16 to 4 and is now equal to the number of signaled offsets in EO for reducing the line buffer requirement. Another reason of selecting only four bands is that the sample range in a region can be quite limited after the regions are reduced from picture quadtree partitions to CTBs.
For each CTU, there are three options: reusing SAO parameters of the left CTU by setting the syntax element SAO−merge−left−flag to true, reusing SAO parameters of the above CTU by setting the syntax element SAO−merge−up−flag to true, or transmitting new SAO parameters. Please note that the SAO merging information is shared by three color components. As shown in Fig. 11, a CTU includes the corresponding luma CTB, Cb CTB, and Cr CTB. When the current CTU selects SAO merge-left or SAO merge-up, all SAO parameters of the left or above CTU are copied for reuse, so no more information is sent. This CTU-based SAO information sharing can reduce side information effectively.
- Low power consumption