1. Introduction
Recently, video-based application service industries, such as over-the-top (OTT) media
services, video conferencing, and real-time streaming services, have been developed
rapidly. As a result, video data accounts for an exponential increase in Internet
traffic and consumer demand for more diverse and high-quality video content, such
as 360-degree videos for immersive media, screen sharing, and game application services,
including 4K ultra high definition (UHD) and 8K UHD video [1,2]. In order to efficiently serve these various types of large-capacity video data,
a new codec capable of providing higher compression efficiency than High Efficiency
Video Coding (H.265/HEVC) [3] was required. Accordingly, the international standardization organizations ITU Telecommunication
standardization sector (ITU-T) Video Coding Experts Group (VCEG) and ISO/IEC Moving
Picture Experts Group (MPEG) jointly formed Joint Video Experts Team (JVET) with the
goal of doubling the compression efficiency compared to H.265/HEVC. It officially
began the standardization process of H.266/VVC in July 2020 [4].
H.266/VVC was developed to facilitate efficient compression for a wide range of video
content and services, such as high-resolution (up to 8K or higher), high dynamic range/wide
color gamut (HDR/WCG) video and screen contents, and 360-degree videos. In addition,
H.266/VVC has the same block-based hybrid coding structure as H.264/AVC [5], H.265/HEVC [6], etc., which are conventional video coding standards, and additional techniques to
obtain higher coding performance than the existing video coding standards, showing
approximately 40% bitrate reduction compared to H.265/HEVC. In H.266/VVC, as in H.265/HEVC,
a frame is partitioned into non-overlapping coding tree units (CTUs), a basic coding
unit. Furthermore, the maximum CTU size is allowed at 128${\times}$128 for efficient
higher-resolution video processing. Each CTU was partitioned into only square coding
units (CUs) through a recursive quad-tree (QT) split in H.265/HEVC. On the other hand,
a multi-type tree (MTT) split consisting of a binary tree (BT) and ternary tree (TT)
has been adopted in H.266/VVC, which can be partitioned into rectangular CUs by applying
it to QT leaf nodes. H.266/VVC has been developed by improving the predecessor (H.265/HEVC)
coding techniques and adopting various new coding techniques to compress various video
content effectively. The intra prediction in H.266/VVC allows up to 67 angular prediction
modes instead of 35 modes of H.265/HEVC. In addition, pretrained matrix-based and
correlation-based prediction between luma and chroma have been newly adopted. For
an inter prediction, H.266/VVC supports both the whole block-based (same as H.265/HEVC)
and subblock-based prediction techniques. In addition, motion refinement tools have
been adopted to use more accurate motion information in a construction scheme for
an extended motion vector candidate list. In addition, H.266/VVC also supports performing
the transform of the residual signal by explicitly or implicitly selecting from among
various transform kernels, and a secondary transform has been adopted to further reduce
the redundancy of transform coefficients. Moreover, new in-loop filtering techniques
have been introduced into H.266/VVC to use signal ranges better to improve the coding
efficiency and reduce the coding artifacts introduced by the quantization and transform
process. In addition, various new techniques have been introduced in H.266/VVC, which
will be discussed in detail in later sections.
With many new techniques and coding structures, H.266/VVC can obtain comparable subjective
quality with H.265/HEVC at only approximately half the bitrate for test sequences.
On the other hand, there are some concerns regarding the complexity of the encoder
and decoder for practical commercial implementations. Many studies have been conducted
to solve this problem, but research on optimization, acceleration, and parallelization
is needed because of the difficulties in real-time encoding for the commercialization
of H.266/VVC.
Several overview papers for H.266/VVC have been reported [7,9]. Those papers commonly describe H.266/VVC tools compared to H.265/HEVC, albeit briefly.
In addition, each paper focuses on an overview of the first version of H.266/VVC including
a comparison against H.265/HEVC [7] and focuses on several interesting consumer electronic use cases and applications
[8] aims to explain how new features in H.266/VVC provide the versatility of applications
and functionalities [9], respectively. Furthermore, they also present experimental results for BD-rate comparison
of H.265/HEVC, and en/decoding complexity for each module in H.266/VVC. This paper
provides a detailed description of all the coding tools newly adopted in H.266/VVC.
It reports statistical analysis coding techniques for each module of H.266/VVC by
analyzing the bitstreams encoded using the VVC test model, VTM12.0 [10]. In particular, the coding efficiency is analyzed according to the CTU size, which
is the basic unit of the block-based coding process, and the complexity and impact
of BT and TT, which are newly introduced in the block partitioning structure of H.266/VVC,
are evaluated. H.266/VVC occupies many complexities in the encoding process. This
paper also reviews the performance of adjusting the number of available reference
pictures for encoding configuration and provides data for video encoder design and
future research by analyzing the ratio selected by the reference structure and video
resolution for each coding technology.
The remainder of this paper is organized as follows. Section 2 examines the history
of video coding standard technology, and Section 3 examines the newly added technologies
adopted in H.266/VVC. Section 4 performs a comparative analysis of the major technologies
and structures changed or expanded in H.266/VVC along with a statistical analysis
of the above-mentioned various compression techniques of H.266/VVC. This paper is
concluded with the outlook in Section 5.
3. Overview of Versatile Video Coding
Fig. 2 shows a block diagram of H.266/VVC encoder. H.266/VVC uses a block-based hybrid coding
structure like conventional video coding standards such as H.264/AVC, H.265/HEVC,
etc. The hybrid refers to combining predictive coding and transform coding with the
quantization of a residual signal to reduce the spatial and temporal redundancy in
the video signal. As shown in Fig. 2, there are tools in H.266/VVC that are the same as or modified from H.265/HEVC. In
addition, some techniques have been newly adopted to improve coding efficiency. This
section describes the main technique of H.266/VVC, which shows a compression efficiency
of approximately twice that of H.265/HEVC at the same visual quality.
H.266/VVC supports three types of hierarchical temporal prediction structures to ensure
efficient compression performance depending on the purpose of compression, same as
H.265/HEVC: all-intra (AI), low-delay (LD), and random access (RA) structures. In
addition, H.266/VVC specifies two types of intra random access point (IRAP) pictures:
instantaneous decoding refresh (IDR) picture, clean random access (CRA) picture, and
one type of gradual decoding refresh (GDR) picture [17] for random access. In IDR and CRA pictures having the same concept as H.265/HEVC,
the bitrate increases rapidly because the entire picture is encoded using intra prediction.
The GDR picture was introduced in H.266/VVC for low-latency applications to alleviate
this problem. The area within the GDR picture consists of three areas: intra-coded
area, clean (or refreshed) area, and dirty (or non-refreshed) area, and the clean
area and dirty area can never propagate errors [18]. Because the clean area of the current GDR picture is reconstructed by referencing
the clean area of previous pictures in the GDR period, it is possible to reconstruct
the entire area completely without errors occurring in the transmission process, even
if the entire area is not intra-predicted. Therefore, H.266/VVC can also be applied
to ultra-low latency applications because GDR can smooth out the bitrate of a bitstream,
reducing end-to-end latency significantly [19].
In H.266/VVC, each picture can be divided into multiple subpictures that can be en/decoded
and transmitted independently. Each picture or subpicture consists of one or more
slices with multiple CTUs. In addition, each CTU can be partitioned into CUs by applying
the QT+MTT split structure, then performing prediction, transform, and reconstruction.
Moreover, H.266/VVC has adopted a CTU dual tree structure where luma and chroma components
can have separate coding trees [20].
The prediction of H.266/VVC can be classified mainly into intra prediction and inter
prediction [21,23]. For intra prediction, several techniques have been newly adopted as follows. The
65 directional angular prediction modes, which is approximately doubled in 33 directional
angular prediction modes for H.265/HEVC, a wide angle intra prediction (WAIP) mode
for rectangular CUs, multiple reference line (MRL) that can perform a prediction with
non-adjacent reference lines, and a matrix-based intra prediction (MIP) mode have
been adopted. The MIP mode performs a prediction using adjacent reference samples
and a predefined matrix obtained from pre-training. Similar to residual quad-tree
(RQT) of H.265/HEVC, intra sub-partition (ISP) mode divides the current block into
subblocks and performs prediction and transform for each subblock. To improve the
coding efficiency of the chroma signal using the luma-chroma correlation, a cross-component
linear model (CCLM) mode performs prediction based on a linear model whose parameters
are derived from the collocated reconstructed luma samples and reconstructed adjacent
chroma samples. In addition, a position dependent prediction combination (PDPC) that
generates a final prediction signal by combining the initial intra-predicted samples
and adjacent reference samples has been adopted.
In a H.266/VVC inter prediction, motion compensation (MC) is performed by generating
a motion vector predictor (MVP) based on the merge or advanced motion vector prediction
(AMVP) modes, as in H.265/HEVC. In particular, a pairwise average MVP (PAMVP) has
been adopted in H.266/VVC that generates a new motion vector candidate from already
constructed motion vector candidates of the merge list to increase the coding efficiency
from various motion vector candidates. A history-based MVP (HMVP) is adopted used
to increase the coding efficiency, which uses the MV of previously coded blocks as
motion vector candidates for the current block. In addition, subblock-based MC methods
are newly introduced in H.266/VVC, affine MC (AMC), and subblock-based temporal MVP
(SbTMVP) mode. The AMC is a prediction method for performing subblock-based MC through
MVs, which are derived from the affine motion model that could represent translational
motion and rotation and zoom in/out. Subblock-based MC performs SbTMVP mode through
motion information of the corresponding block in the previously decoded picture. Various
techniques have been adopted to improve coding efficiency by refining the MVs of the
current block. In particular, the merge mode with MV difference (MMVD) mode derives
the MVD value with simplified signaling without explicitly transmitting MVD in merge
mode. Unlike the MMVD, decoder-side MV refinement (DMVR) mode has been adopted, which
applies bilateral matching to refine the accuracy of the initial MV, which be obtained
from the signaled merge index. Moreover, the motion or prediction signal of the current
block can be refined by performing bi-directional optical flow (BDOF), a technique
that refines motion in pixel-based optical flow, and a prediction refinement with
optical flow (PROF), which refines the pixel values in a prediction block generated
by the AMC. In addition, the techniques of generating a final prediction signal by
combining multiple prediction signals have been newly adopted as follows: the combined
inter-intra prediction (CIIP) mode, which generates the final prediction signal from
the weighted sum of the intra prediction signal and inter prediction signal; the geometric
partitioning mode (GPM) mode, which generates the final prediction signal from a weighted
sum of two inter prediction signals by applying a mask determined according to the
signaled mode; the bi-directional prediction with CU weights (BCW), which uses the
signaled weights to generate a final prediction signal from a weighted summation of
the two inter prediction signal.
Furthermore, techniques for reducing the bits of MVD to be transmitted have been also
adopted, the adaptive motion vector resolution (AMVR) and symmetric MVD (SMVD). The
AMVR mode can reduce the transmission bits of MVD by adaptively changing the resolution
of MVD. The SMVD mode signals MVD only for one reference list and derives MVD for
the other by assuming linear motion. Various techniques in the transform and quantization
process for coding efficiency have been adopted in H.266/VVC [24,25].
In H.265/HEVC, the transform using the DCT-II kernel was applied to each square transform
unit (TU), while the transform using the DST-VII kernel is only applied to the 4${\times}$4
intra prediction block. H.266/VVC supports transforms using vertical and horizontal
transform kernels of different types and lengths for rectangular transform blocks
(TBs) because of the different shapes of blocks. In particular, multiple transform
selection (MTS) has been adopted in H.266/VVC to transform using DCT-II, DST-VII,
and DCT-VIII kernels based on the prediction mode or explicit signaling. The low-frequency
non-separable transform (LFNST) has been also adopted, which can achieve an additional
coding efficiency by applying the secondary transform to the low-frequency coefficients
after the transform for the intra predicted block. In addition, subblock-based transform
(SBT) has been adopted, transforming only on subblocks for the inter predicted block
according to the signaling index. For the quantization process of H.266/VVC, a dependent
quantization (DQ) and joint coding of chroma residual (JCCR) have been adopted. The
DQ uses two quantizers and performs the transition between the two quantizers according
to the transform coefficient level. The JCCR is used when the quantized chroma residual
signals are similar and transmit only a single chroma residual signal. The entropy
coding process of H.266/VVC is performed through context-based adaptive binary arithmetical
coding (CABAC) [25]. The CABAC of H.266/VVC uses the same algorithm as H.265/HEVC but uses a multi-hypothesis
probability estimation model instead of a look-up table to increase the accuracy of
probability estimation.
The in-loop filtering process of H.266/VVC specifies the adaptive loop filter (ALF),
cross-component ALF (CC-ALF), and the luma mapping chroma sampling (LMCS) in addition
to deblocking filter (DF) and sample adaptive offset (SAO) used in H.265/HEVC [26]. The LMCS is a technique to reflect the characteristics of HDR videos. It aims at
using the signal range better for improved coding efficiency rather than specifically
addressing coding artifact reduction. The ALF and CC-ALF are the third filtering process
in the decoding process of H.266/VVC, which performs block-based linear filtering
and adaptive clipping with the filter coefficients determined to minimize the reconstruction
error.
As previously mentioned, H.266/VVC uses the hybrid coding structure, and various techniques
have been modified or newly adopted compared to H.265/HEVC. The subsequent subsection
describes the newly introduced or modified tools compared to the predecessor (H.265/HEVC)
within each module. Table 33 is provided with notations used in Section 3.
Fig. 2. Block diagram of H.266/VVC encoder.
3.1 Structure of Picture/Block Partitioning
The video sequence is composed of multiple pictures, and each picture could be divided
hierarchically and processed for coding efficiency and parallel processing. In H.264/AVC,
the macroblocks have been used as a basic unit of compression, which is fixed to 16${\times}$16
[5]. In H.265/HEVC, each picture could be divided into one or several tiles and slices,
which are a sequence of CTUs. Subsequently, a recursive partition is performed on
each CTU with a QT split structure for functional subdivision while having more sizes
than its predecessor (H.264/AVC), and the leaf node is called CU. Thereafter, the
CU is divided into a prediction unit (PU), which is a basic unit of prediction, and
a TU, which is a basic unit of transform, to perform compression efficiently [14].
As mentioned earlier, in H.266/VVC, a picture can be divided into one or several tiles
and slices, which have similar concepts to H.265/HEVC, and subpictures consisting
of one or more rectangular slices. In addition, H.266/VVC can use a CTU of up to 128${\times}$128,
which is four times larger than H.265/HEVC, as a basic unit of compression to process
video with a larger resolution than H.265/HEVC. In addition, each CTU can be first
divided through the QT split, and the QT leaf node can then be partitioned further
through the MTT split for more flexible partitioning. As shown in Fig. 3, the MTT split has four splitting types: vertical binary tree (VBT) split, horizontal
binary tree (HBT) split, vertical ternary tree (VTT) split, and horizontal ternary
tree (HTT) split. The QT+MTT leaf node is a CU, and compression can be performed using
a rectangular CU in H.266/VVC. Unlike H.265/HEVC, the block sizes of CU, PU, and TU
are the same except when CU is larger than the maximum TU size, when predicting in
ISP mode, and performing a transform through SBT mode. In addition, H.266/VVC has
adopted the GPM mode, which performs the prediction and reconstruction through a more
flexible non-rectangular partition rather than a vertical and horizontal rectangular
partition, showing improved coding efficiency. In addition, the CTU dual tree is adopted
in H.266/VVC, which allows the luma and chroma components to use separate partitioning
structures. This can only be applied to an intra-slice, and the compression efficiency
can be improved because luma and chroma components are divided to suit each characteristic.
In the case of inter-slice, luma and chroma components are divided into the same QT+MTT
split structure [20].
Table 1. Angular prediction modes are replaced by WAIP mode depending on different aspect ratios.
Aspect ratio (W:H)
|
Angular prediction mode (to be replaced)
|
Replaced WAIP mode
|
16:1
|
2 to 15
|
67 to 80
|
8:1
|
2 to 13
|
67 to 78
|
4:1
|
2 to 11
|
67 to 76
|
2:1
|
2 to 7
|
67 to 72
|
1:1
|
None
|
None
|
1:2
|
61 to 66
|
-6 to -1
|
1:4
|
57 to 66
|
-10 to -1
|
1:8
|
55 to 66
|
-12 to -1
|
1:16
|
53 to 66
|
-14 to -1
|
When designing and implementing a hardware video decoder, major coding techniques
are divided into functions and correlations, and the pipelines are configured to enable
parallel processing. The maximum block size of the pipeline is determined by the TB
size, and it is possible to design to operate by dividing the block to be processed
into arbitrary small blocks. For this reason, a virtual pipeline data unit (VPDU)
has been introduced in H.266/VVC, like H.265/HEVC. The size of the VPDU is 64${\times}$64
because the maximum TB size in H.266/VVC uses 64${\times}$64, or it is limited to
the size of the maximum CU if the size of the maximum CU is smaller than 64${\times}$64.
In addition, because different consecutive VPDUs in the hardware video decoder are
processed simultaneously in the pipeline stage, the constraint is defined not to allow
specific segmentation for CUs with a width or height of 128 for CUs larger than the
size of VPDUs, as shown in Fig. 4.
Fig. 3. An example of MTT split types.
Fig. 4. Disallowed BT and TT split for 64×64 VPDU constraints.
3.2 Intra Prediction
This section describes the main intra prediction techniques of H.266/VVC, as shown
in Fig. 5. In H.265/HEVC, intra prediction is performed using 33 angular prediction modes,
planar mode, and DC mode, while in H.266/VVC, various techniques have been adopted
to achieve improved coding efficiency, resulting in better prediction accuracy by
performing prediction using 65 angular prediction modes, wide-angle intra prediction
(WAIP) mode, multiple reference line (MRL), intra sub-partition (ISP) mode, matrix-based
intra prediction (MIP) mode, planar mode, DC mode, position dependent prediction combination
(PDPC) mode, and cross component linear model (CCLM) mode [21].
Angular prediction modes with 65 angles and wide angle intra prediction (WAIP): H.266/VVC
basically supports 65 angular prediction modes, which is increased by approximately
two times to perform a precise prediction for CUs with a larger size than H.265/HEVC.
In general, the same number of angular prediction modes are assigned to the upper
and left sides of the CU to perform prediction for a square CU. On the other hand,
H.266/VVC, compression is performed using a square and rectangular CU. WAIP has been
adopted, which allocates more angular prediction modes to the size with greater width
or height, to perform a sufficient prediction using angular prediction mode for rectangular
CU and ensure improved prediction accuracy. Because WAIP mode is used without additional
signaling, some angular prediction modes are replaced by 180-degree rotated modes
depending on the aspect ratio of the CU, as shown in Table 1.
Reference sample filtering: Similar to H.265/HEVC, two filtering techniques, reference
sample smoothing, and interpolation filtering, are applied to reference samples for
an intra prediction of H.266/VVC. The reference samples may have a discontinuity because
block-based prediction and reconstruction are performed. In addition, the reference
samples located at the fractional-sample position must be generated to perform a prediction
in fractional-slope angular modes. This can be addressed by performing the filtering
process on the reference samples as follows: for the reference sample smoothing, the
reference sample is filtered with the finite impulse response filter {1, 2, 1}/4.
Furthermore, reference sample smoothing is performed for integer-slope angular modes
if the number of samples in the current block is more than 32. Interpolation filtering
is applied to the reference samples around the fractional-sample position with a DCT-based
interpolation filter (DCTIF) for fractional-slope angular modes. The DCTIF is constructed
in the same way as the chroma DCTIF used for motion compensation in both H.265/HEVC
and H.266/VVC [27]. For the luma block, 4-tap interpolation filters are used for reference sample filtering,
and the linear 2-tap interpolation filter of H.265/HEVC is used in H.266/VVC for the
chroma components.
Table 2. Number of sub-partitions of the prediction and transform for ISP mode.
CU size
|
Hor. split
|
Ver. split
|
Num. of Pred. sub-part.
|
Num. of Trans. sub-part.
|
Num. of Pred. sub-part.
|
Num. of Trans. sub-part.
|
4×8
|
2
|
2
|
1
|
2
|
8×4
|
2
|
2
|
2
|
2
|
4×N (N>8)
|
4
|
4
|
1
|
4
|
8×N (N>4)
|
4
|
4
|
2
|
4
|
Others
|
4
|
4
|
4
|
4
|
Table 3. MIP cases depending on CU size.
MIP Cases
|
CU size
|
Case 1
|
4×4
|
Case 2
|
(W == 4 or H == 4) or 8×8
|
Case 3
|
Others
|
Most probable mode (MPM): In H.266/VVC, the MPM list is constructed based on left
and above neighbor CUs and used to reduce the amount of transmitted bits for signaling
prediction modes such as H.265/HEVC. On the other hand, unlike H.265/HEVC, an MPM
list is constructed with 6 MPMs, including planar mode, which is signaled with a separate
flag. The prediction mode is considered planar mode if the prediction mode of the
neighboring CU cannot be referenced or predicted with MIP mode. The MPM list is constructed
based on four cases according to the prediction modes of the neighboring blocks to
signal the prediction mode.
Position dependent prediction combination (PDPC): In H.266/VVC, because intra prediction
is performed using the reference samples above and left of the CU, the prediction
accuracy may be lower for prediction samples at a location far from the reference
samples, such as at the bottom right of the CU. H.266/VVC has adopted the PDPC to
address this problem. The PDPC generates final prediction samples by performing a
linear combination using adjacent reference samples that are filtered or not based
on the CU size and the intra-prediction mode, and the initial intra-predicted samples.
Multiple reference line (MRL): In H.264/AVC and H.265/HEVC, intra prediction is performed
using the above and left nearest reference line of the CU based on the correlation
of the adjacent pixels. On the other hand, because the reference samples in the reference
line are reconstructed samples, not original, compression error may damage the adjacent
reference samples, which may be unsuitable for prediction. To compensate for this
problem, H.266/VVC uses reference samples in two non-adjacent reference lines in addition
to the adjacent reference line. The index for the reference line is signaled, and
when a prediction is performed using non-adjacent reference lines, only the MPM-based
prediction, excluding planar mode, can be used, and the PDPC cannot be applied.
Intra sub-partition (ISP): In general, a prediction is performed in a video compression
process, and a transform process is performed on a residual signal that is a difference
value between the original signal and the prediction signal. In this case, the compression
error may spread to the entire block if the transform is performed at once if the
residual signal in the residual block is distributed biased to one side. In addition,
even if the directionality within that block is the same for a relatively large block,
the prediction sample with a distance from the reference sample may be inaccurate
if the prediction is performed at once. In this case, the accuracy of the prediction
may be improved if the prediction is performed by dividing into multiple small CUs,
but there is a problem that the bits for partitioning increase accordingly. ISP mode
has been adopted for H.266/VVC to solve this problem. ISP mode cannot be applied to
a 4${\times}$4 CU. Moreover, if the ISP mode is applied, prediction and transform
processes are performed by dividing the PU or TU into two or four vertically or horizontally
according to the aspect ratio of the CU, as shown in Table 2. At this time, sub-partitions with a width less than 4, such as 1${\times}$N and
2${\times}$N, could constitute an issue for hardware implementation. Therefore, a
prediction is performed at once using only the adjacent reference samples to 4${\times}$N
regions grouped into these sub-partitions. The corresponding four 1${\times}$N and
2${\times}$N TBs perform the transform in parallel [21,28]. In addition, the same prediction mode is applied for each PU, and the ISP mode is
applied only when the MRL index is set to 0, and all ISP sub-partitions could be conducted
PDPC in the same way, as in the non-ISP case.
Matrix-based intra prediction (MIP): MIP mode is a newly adopted intra prediction
mode for H.266/VVC. It was initially proposed as a neural network-based prediction
method. However, a set of pre-trained matrices for three cases according to the CU
size is defined as a table and simplified to a signaling index to improve computational
complexity, as shown in Table 3 [29]. The prediction process of the MIP mode is performed following three steps, as shown
in Fig. 6. First, averaging is performed to the above and left reconstructed reference samples,
refT and refL, to obtain the reduced smaller boundaries, refT’ and refL’. In the second
step, the predMIP signal is generated by matrix-vector multiplication through the
vector p, which is concatenated with refT’ and refL’ in the first step, and the matrix
$A_{k}$, which is determined by the signaled index. In the final step, the remaining
samples in PU are derived by linear interpolation, where horizontal interpolation
is first performed, and vertical interpolation is then performed.
Cross component linear model (CCLM): Generally, when compressing the YUV color format
video, the luma component is first compressed, followed by the chroma component is
compressed. Since video signal generally has local dependency between different color
components, H.266/VVC uses this property by adopting CCLM mode, a prediction technique
to improve the coding efficiency by minimizing the regression error between the reconstructed
luma component and the reconstructed chroma component. Since a lot of computational
complexity is consumed in the process of obtaining a linear model with all reference
samples, it has been simplified to obtain a linear model using reference samples at
pre-determined positions according to each CCLM mode [30].
Fig. 5. Block diagram of intra prediction in H.266/VVC decoder.
Fig. 6. Flowchart of MIP process[21].
3.3 Inter Prediction
As shown in Fig. 7, H.266/VVC has adopted various inter prediction techniques. The coding efficiency
heavily relies on the efficient representation of motion information. Efficient motion
data coding is realized by AMVP mode that predicts the MV values using a list of predictors
and merge/skip modes that derive the complete motion information based on the neighboring
motion data. Merge and AMVP modes in H.266/VVC are an extension of those in H.265/HEVC
to increase the prediction accuracy while minimizing the bitrate. In merge modes of
H.266/VVC, there are block-based merge modes consisting of general merge mode, merge
with motion vector difference (MMVD) mode, combined inter and intra prediction (CIIP)
mode, geometric partitioning mode (GPM), and subblock-based merge mode consisting
of subblock-based temporal motion vector prediction (SbTMVP) and affine motion model-based
motion compensation prediction. Inter-prediction modes in skip mode, where residuals
are not signaled and inferred to be zero, are the same as the merge modes, excluding
CIIP mode. AMVP modes in H.266/VVC consist of general inter mode (regular AMVP mode),
symmetric MVD (SMVD) mode, and affine inter mode. In addition, H.266/VVC has adopted
decoder-side motion vector refinement techniques, which are decoder-side MV refinement
(DMVR), bi-directional optical flow (BDOF), and a prediction refinement with optical
flow (PROF) to improve the prediction accuracy without signaling additional information.
Fig. 7. Block diagram of inter prediction in H.266/VVC decoder.
3.3.1 Merge Mode Inter Prediction
Regular merge mode: A merge candidate list consists of newly introduced history-based
motion vector prediction (HMVP) and pairwise average MVP (PAMVP) following spatial
merge candidates and temporal merge candidates similar to those in H.265/HEVC. HMVP
candidates are the MVs of previously coded CUs mostly from non-adjacent CUs, that
are stored using a five-entry table and updated using a first-in-first-out (FIFO)
rule. The PAMVP is generated by averaging the MVs of the first and second candidates
in the merge candidate list for each reference picture list, even if they have different
reference pictures, and assigning the reference picture index of the first candidate
to that of the PAMVP. If there is only one MV available, it is used directly. If no
MV is available for one of the reference picture lists, the MV for the list is considered
invalid.
Merge with motion vector difference (MMVD): In MMVD, one of the first two existing
candidates in the merge candidate list is selected as the base motion. The base motion
is refined with an MVD obtained from a predefined direction and a predefined distance.
A predefined direction indicating the direction of MVD can be either horizontal or
vertical direction, resulting in four directions. A direction index is signaled to
indicate the selected MVD direction, as shown in Table 4. A predefined distance indicates a distance to be refined from the base motion. There
are two predefined distance tables, as shown in Table 5. Based on one of the tables selected at the picture level and a signaled distance
index for MVD, the distance from the base motion can be derived for the refined motion.
MMVD improves the accuracy of MV from the general merge mode, even though it cannot
provide MVs as accurate as those in AMVP mode.
Combined inter and intra prediction (CIIP): Similar to the bi-prediction, i.e., a
linear superposition of two motion compensated predictions that can further reduce
the energy of the prediction error than uni-prediction, a uni-directional inter prediction
can be superposed with intra prediction. In CIIP mode, the predicted signal, $P_{CIIP}$,
is generated by a weighted sum of inter-predicted signals using merge mode, $P_{inter}$,
and intra-predicted signals using the planar mode, $P_{intra}$, as described in Eq.
(3.3.1). The weights, a weight for inter prediction, $W_{inter}$, and weight for intra prediction,
$W_{intra}$, are derived based on whether the above and left neighboring CUs are coded
using intra or inter prediction mode. When both the above and left neighboring CUs
are intra-coded or inter-coded ($W_{inter}$, $_{W_{intra}}$) is set to (1, 3) or (3,
1), respectively. Furthermore, it is set to (2, 2) if only one of both neighboring
CUs is intra-coded.
Geometric partitioning mode (GPM): H.266/VVC does not employ geometric block partitioning
that could increase the partitioning precision for moving objects boundaries because
of implementation complexity. Instead, GPM mode has been adopted to provide a similar
effect to geometric block partitioning. In GPM mode, a CU is partitioned into two
parts using a straight line. The straight line is parametrized by an angle and an
offset so that there are 64 partitions for a CU with size $w\times h=2^{m}\times 2^{n}$
with $m,n\in $\{3, {\ldots}, 6\} excluding 8${\times}$64 and 64${\times}$8. In each
partition, its own MV is derived from performing a block-based motion compensation.
The final prediction for the CU is generated by performing a blending process with
adaptive weights based on the position of each sample relative to the geometry partitioning
boundary, as shown in Fig. 8. A, MV for each partition is uni-directional, so the complexity of motion compensation
for the CU is the same as that of bi-directional prediction. The MV for each partition
is derived from the regular merge list, as shown in Fig. 9, with the signaled GPM merge index.
Subblock-based merge mode: In subblock-based merge mode, a CU with a height and width
larger than or equal to eight luma samples is divided into 8${\times}$8 subblocks,
and a different MV is derived for each subblock. A subblock-based merge mode consists
of SbTMVP similar to TMVP in H.265/HEVC and motion vector prediction based on an affine
motion model. SbTMVP is only applied to the subblock-based merge mode, and an affine
motion model-based prediction is applied to the merge and AMVP modes. Section 3.3.3
describes the affine motion model-based prediction. Fig. 10 presents the derivation process of SbTMVP consisting of two steps: 1) derivation
of a displacement vector (DV), and 2) derivation of motion information for each subblock
based on the motion information derived by the DV. If the MV of the left-bottom neighboring
block refers to the collocated picture, that MV is used as the DV. Otherwise, zero
MV is used as the DV. The DV is applied to the central position of the current CU
to locate the corresponding sample position in the collocated picture. the SbTMVP
is considered available if the block containing the corresponding sample position
in the collocated picture is inter-coded. If available, the motion information of
the corresponding subblocks in the collocated picture is found by applying the DV
to each subblock in the current CU. Finally, the motion vector of each subblock in
the current CU is derived from the motion information of the corresponding subblock,
similar to the TMVP process in H.265/HEVC, where temporal motion scaling is applied
to align the reference pictures of the temporal motion vectors to those of the current
CU [31]. The derived motion vectors for subblocks in the current CU become the SbTMVP.
Fig. 8. Derivation of predicted signal using GPM.
Fig. 9. GPM merge list construction.
Fig. 10. Derivation of SbTMVP[22].
3.3.2 AMVP Mode Inter Prediction
In the AMVP mode, the components of an MV are coded differentially using an MVP and
MV difference (MVD). In H.266/VVC, the AMVP mode is extended using improved predictors,
providing a more flexible MVD signaling to improve the tradeoff between the motion
accuracy and overhead motion bits. These enhancements on MV coding and motion compensation,
including the revised AMVP candidate list construction, AMVR, BCW, and SMVD, are described
in the following.
General AMVP mode: The MV prediction algorithm of H.266/VVC is based on the AMVP of
H.265/HEVC. The AMVP introduced in H.265/HEVC explicitly signals one of the two potential
MVP candidates derived from five spatially neighboring and two temporally co-located
MVs. In this way, the motion information, including reference picture indices, MVP
indices, and MVDs for each reference picture list 0 and list 1, are singled in the
AMVP mode. As a new feature, H.266/VVC adds an MV prediction of HMVP in the merge
mode and AMVP candidate list. The HMVP allows the reuse of MVs of previously coded
non-adjacent CUs. In addition, the AMVP candidate list construction process is revised
in terms of complexity.
Adaptive motion vector resolution (AMVR): H.266/VVC increases the MV precision to
1/16 luma sample, while the HEVC uses only a quarter-luma-sample precision. On top
of the higher precision MV representations, a CU-level AMVR method is applied to customize
the balance between quality and the MV bit cost overhead. For a CU with translational
motion in AMVP mode, MVDs can be coded in units of a quarter, half, integer, or four
luma samples. For the affine AMVP mode, MVDs can be switched among the quarter, integer,
or 1/16 luma samples. An alternative six-tap smoothing interpolation filter (IF) is
used instead of the eight-tap IF from HEVC when a half-luma-sample MV accuracy is
used in AMVP mode.
The MVP is rounded to the indicated precision before being added together with the
MVD to ensure the reconstructed MV uses the same precision as the MVD. The CU-level
MV resolution indication is conditionally signaled if the current CU has at least
one non-zero MVD component. Quarter-luma-sample MVD resolution is inferred if all
MVD components (i.e., horizontal and vertical MVDs for reference lists 0 and 1, respectively)
are zero.
Bi-prediction with CU-level weights (BCW): BCW provides the weighted averaging of
the two prediction signals for bi-prediction at the CU-level, in addition to the traditional
weighted prediction (WP) for which the weights are specified at the slice level for
each reference picture. In H.266/VVC, the legacy explicit-weighted prediction scheme
is kept and extended with CU-level syntax control for the weighted averaging. Five
weights are predefined, $w\in ${-2, 3, 4, 5, 10}, and an index (denoted as $wIdx$)
is signaled at the CU level to specify the selected weight $w$ of the prediction block
from list 1. All five weights are used when all the reference pictures are temporally
preceding the current picture in display order. Otherwise, only the weights $w\in
${3, 4, 5} are used.
Each luma/chroma prediction sample of BCW is calculated as follows:
where $P_{bi-pred}$ is the final prediction and $P_{L0}$ and $P_{L1}$ are prediction
samples pointed to by the MVs from reference picture lists 0 and 1, respectively.
BCW is only applied to CUs with a CU size larger than or equal to 256 luma samples.
To avoid interactions between WP and BCW, if CU uses WP, then the $wIdx$ is not signaled,
and $w$ is inferred to be 4 (i.e., equal weight is applied). For regular merge mode
or affine merge mode, the $wIdx$ is inferred from neighboring blocks based on the
merge candidate index. CIIP and BCW cannot be jointly applied for a CU. When a CU
is coded with CIIP mode, the $wIdx$ is set to 2, e.g., equal weight. The DMVR and
BDOF are both turned off when the weight is non-equal.
Symmetric motion vector difference (SMVD): When the motion of the current block is
on a constant trajectory over a past and future reference picture in display order,
the corresponding MVs and reference picture indices tend to be symmetrical. SMVD exploits
this assumption of linear motion to save bits for MVDs and reference picture index
signaling in the true bi-direction mode that uses past and future reference pictures,
as shown in Fig. 11. When SMVD is applied for a CU, only the MVP indices of lists 0 and 1 and the MVD
for list 0 are signaled. Other motion information is derived at the decoder side without
signaling. That is, first, the MVD for list 1 $\left(mvdx_{L1},mvdy_{L1}\right)$ is
set to the reverse of the list 0 MVD $(-mvdx_{L0},-mvdy_{L0})$, as shown below:
Second, the lists 0 and 1 reference picture indices are implicitly derived at the
slice level. That is, each reference picture is the nearest picture among all pictures
in its list, and they have opposite directions to each other.
Fig. 11. Illustration for SMVD mode.
3.3.3 Affine Motion Compensation
Because H.265/HEVC only considered the translational motion model for motion compensation
prediction, it is inefficient when there is motion, such as zoom-in/-out, rotation,
or perspective motions. H.266/VVC has adopted the 4-/6-parameter affine motion model
for motion compensation prediction to improve the coding efficiency, especially for
those cases.
Affine motion models based on control point MV (CPMV), 4-parameter affine motion model
using two CPMVs, and 6-parameter affine motion model using three CPMVs, are described
in Fig. 12. In the case of the 4-parameter affine motion model using 2 CPMVs which are $v_{0}$
and $v_{1}$, an MV ($mv_{x}$, $mv_{y}$) at sample location (x, y) in a CU whose size
of $W\times H$ is derived as follows:
In the case of a 6-parameter affine motion model using three CPMVs, which are $v_{0}$,
$v_{1}$, and $v_{0}$ and $v_{2}$, an MV ($mv_{x}$, $mv_{y}$) at sample location (x,
y) in a CU whose size of $W\times H$ is derived as follows:
As expressed in Eqs. (3.3.4) and (3.3.5), an MV at every sample location in a CU can be derived with an affine motion model
using CPMVs. On the other hand, to reduce complexity, an MV is derived for each 4${\times}$4
subblock in a CU, and motion compensation is performed with the derived 4${\times}$4
block level MV, which is an MV at the center sample position of the 4${\times}$4 block.
The affine merge mode and affine inter mode are based on the affine motion model in
H.266/VVC depending on the ways of obtaining CPMVs. In the affine merge mode, which
is a part of the subblock-based merge mode, the affine merge candidates, i.e., CPMV
candidates, are added to the subblock-based merge candidate list so that the motion
compensated prediction can be performed based on the affine motion model with the
derived CPMVs. There are two types of CPMV candidates in affine merge mode: inherited
affine merge candidates and constructed affine merge candidates. The inherited affine
merge candidates are the CPMVs of the current CU derived from the CPMVs of the above
or left neighboring CU, which is motion compensated based on the affine motion model.
The constructed affine merge candidates are constructed by combining the translational
motion information of neighboring CUs corresponding to the control points.
The affine inter mode, which is a type of AMVP mode, requires information such as
a flag to indicate the motion model type, i.e., the 4-parameter model or 6-parameter
model, and two or three motion vector differences (MVDs) between the CPMVs and the
predictors from an affine MVP list. For up to two affine AMVP candidates for an affine
MVP list, there are the inherited affine AMVP candidates and the constructed affine
AMVP candidates similar to the affine merge mode. The inherited affine AMVP candidates
are derived from the CPMVs of the above or left CU having the same reference picture
to the current CU, and the constructed affine AMVP candidates are derived by combining
the translation motion information of the neighboring CUs having the same reference
picture to the current CU. In addition, if there are fewer than two affine AMVP candidates,
the same translational MV of a neighboring CU is assigned to two or three CPMVs in
the affine AMVP list, depending on the motion model type.
Fig. 12. CPMV based affine motion model.
3.3.4 Decoder-side MV Refinement Tools
Refinements of motion and prediction at the decoder side are introduced to improve
the prediction quality without increasing the bit overhead of the signaling motion
parameters. DMVR is used to improve the accuracy of the MVs of the regular merge mode
with a low-complexity motion refinement. Unlike block-based motion compensation (MC),
optical flow is expected to achieve the effect of sample-wise inter prediction. It
is implemented in H.266/VVC as BDOF to improve the bi-prediction efficiency and as
PROF to refine the subblock prediction of the affine MC (AMC).
Decoder-side motion vector refinement (DMVR): DMVR refines the bi-prediction motion
of the regular merge mode using a bilateral search. To ensure the bilateral search
with equal distance, DMVR is allowed only if the merge MV pair point to two reference
pictures that have an equal and opposite temporal distance to the current picture.
As shown in Fig. 13, DMVR applies bilateral matching to refine the accuracy of the input MV pair {$MV_{0},MV_{1}$}.
That is, it searches the candidate MVs around the initial MVs in lists 0 and 1 with
a mirrored MV offset $MV_{diff}$. The refined pair obtained by Eq. (3.3.6), {$MV'_{0},MV'_{1}$}, are used for the motion-compensated prediction of both luma
and chroma CBs of a CU.
The searching process consists of an integer sample MV offset search and a fractional
sample MV offset refinement. The integer sample MV search calculates the sums of the
absolute differences (SADs) between each pair of candidate reference blocks in lists
0 and 1 within the search range of ${\pm}$2 integer luma samples from the initial
MVs. The fractional sample refinement is derived using a parametric error surface
approximation instead of additional SAD comparisons.
Bi-directional optical flow (BDOF): BDOF is another coding tool for improving the
bi-prediction signal using a motion refinement performed by the decoder. In particular,
BDOF aims at compensating the sample-wise fine motion that is limited in the block-based
MC based on the optical flow concept at the 4${\times}$4 subblock level. It is applied
to CUs coded either in merge mode or AMVP mode and assumes constant motion trajectory.
As the same constraint applied to DMVR, BDOF is applied only if the two different
reference pictures have an equal distance in picture order count (POC) to the current
picture.
For each 4${\times}$4 subblock, a motion difference relative to CU MVs is calculated
by solving an optical flow equation that minimizes the difference between the prediction
subblocks of lists 0 and 1. The derived motion differences and the prediction sample
gradients are then used to adjust the bi-predicted sample values.
Let $I\left(i,j,t\right)$ be the luminance value of a sample at time $t$ in position
$\left(i,j\right)$. Assuming the luminance of a sample does not change during the
object motion, the optical flow equation can be expressed as follows:
where the motion $(v_{x},v_{y})$ describing the remaining motion applied on top of
the original MV at each sample position.
As shown in Fig. 14, the motion $(v_{x},v_{y})$ from $I_{c}$ to $I_{0}~ $is symmetrical to its motion
from $I_{c}$ to $I_{1}$, where $I_{c}$, $I_{0}$ and $I_{1}$ are arrays of luminance
values in the current block and the two prediction blocks from the lists 0 and 1 reference
pictures, respectively. According to the aforementioned constraint that BDOF is applied
only to a true bi-directional prediction with the same prediction distance, the remaining
motions relative to both reference pictures are assumed to be in a mirroring relation.
In such a symmetric motion model illustrated in Fig. 14, each sample in $I_{c}~ $can be approximated from two directions, one from its correspondence
$A$ in $I_{0}~ $and the other from its correspondence $B$ in $I_{1}$ using Eq. (3.3.7). By minimizing the difference between two predictions with refined motion, the value
of $(v_{x},v_{y})$ is calculated as
The vector $(v_{x},v_{y})$ of each 4${\times}$4 subblock is calculated from the extended
6${\times}$6 window (denoted as $\Omega $) containing a subblock in the center, assuming
that it is constant in each subblock. This way, a more stable motion field is derived
with reduced computational complexity. The optimization problem in Eq. (3.3.8) can be solved using the auto- and cross-correlation of the horizontal and vertical
gradients for each prediction sample [22].
Based on the derived motion refinement $(v_{x},v_{y})$ and the prediction sample gradients,
the following adjustment is calculated for each sample in the subblock:
Finally, the bi-prediction signal of BDOF $I'_{c}\left(i,j\right)$ is calculated by
adjusting the bi-prediction samples as follows:
When DMVR and BDOF are applied to a CU, DMVR is performed first and followed by BDOF.
If BCW, WP, CIIP, or GPM, which include the blending process, is enabled for a CU,
then the BDOF is disabled. BDOF is also disabled when a CU is coded with SMVD mode.
Prediction refinement with optical flow (PROF): PROF is used to compensate for the
prediction error of a subblock-based AMC with the optical flow-based sample-wise refinement.
In this way, a finer granularity of AMC, which is conducted in a block-wise manner
for the trade-off between prediction accuracy and complexity, can be achieved.
After the subblock-based AMC is performed, each luma prediction sample is refined
by adding a difference derived based on the optical flow equation. PROF is not applied
to chroma samples.
The prediction at position $\left(i,j\right)$ in the current block $P\left(i,j\right)$
is predicted from the sample at position $\left(x,y\right)$ in the reference picture
$I\left(x,y\right)$ with the subblock MV. Let $\Delta v\left(i,j\right)$ be the difference
between the sample MV computed by an affine model and the MV of the subblock to which
the sample $\left(i,j\right)$ belongs, as shown in Fig. 15. The prediction with the sample MV $I'\left(x,y\right)$ would be:
where $g_{x}\left(i,j\right)$ and $g_{y}\left(i,j\right)$ are the horizontal and vertical
gradients of the subblock prediction, respectively, which are calculated at each sample
location similar to BDOF.
In Eq. (3.3.12), the prediction refinement $\Delta I\left(i,j\right)$ is derived using the spatial
gradients of each prediction sample and sample based MV offset relative to the centered
subblock MV $\Delta v\left(i,j\right)$ as follows:
The prediction refinement is added to the affine subblock prediction to form the final
affine prediction as
3.4 Transform Coding
As shown in Fig. 16, H.266/VVC has adopted various transform and quantization techniques. First, after
intra/inter prediction, a transform is applied to the residual signal, and the residual
signals in the spatial domain are converted to the frequency domain. An integer transform
based on DCT-II has been used widely in the previous video coding standards. In most
video coding standards, the two dimensional (2D) transform of the residual signal
is performed through two 1D transforms in the horizontal and vertical directions using
separability. 1D $N$-point transform and inverse transform are defined in Eqs. (3.4.1) and (3.4.2), respectively.
where $N$ is the transform size, $F\left(u\right)$ is the transformed coefficient,
$p\left(x\right)$ is original signal, and $v_{u,x}$ is the basis element of $N\times
1$ basis vector.
The H.266/VVC transform inherits the basic framework of H.265/HEVC, such as integer
transform, fixed point arithmetic operation, and intermediate data representation
[23]. H.266/VVC introduced extended transform techniques to achieve better energy compaction.
The new transform design of H.266/VVC is as follows.
Primary transform: The primary transform is a technique that has been used in the
existing video coding, and it was named to distinguish it from the secondary transform.
In H.265/HEVC, the separable transform was applied to square blocks of up to 32${\times}$32
[14]. In H.266/VVC, the maximum transform size is extended to 64${\times}$64, and non-square
blocks are also supported. Furthermore, zeroing out is introduced to reduce the decoder
complexity because of the increased transform size. For the 64-point transform, only
the first 32 low-frequency coefficients are maintained, and the high-frequency coefficients
are zeroed out. In addition to the conventional DCT-II, H.266/VVC specifies the alternative
transforms, such as DST-VII and DCT-VIII. The DCT-II, DST-VII, and DCT-VIII can be
applied to the luma blocks, and only the DCT-II is used for chroma blocks. The basic
function of the 1D $N$-point DCT-II, DST-VII, and DCT-VIII are formulated in Eqs.
(3.4.3)-(3.4.5), respectively.
The DCT-II is applied to transform block sizes from 4${\times}$4 to 64${\times}$64,
while the DST-VII and DCT-VIII are applied to transform block sizes from 4${\times}$4
to 32${\times}$32. Similar to the 64-point DCT-II, the coefficients outside the first
sixteen (16) low frequency ones are zeroed out for the 32-point DST-VII and DCT-VIII.
A method that applies the transform by selecting one of the three transforms is called
multiple transform selection (MTS), and there are two types of MTS, called explicit
MTS and implicit MTS. The explicit MTS applies to both intra and inter coded blocks,
and the selected transform combination is signaled through an index. On the other
hand, implicit MTS is only available in intra coded blocks, and the transform types
are derived from the coded information known to both the encoder and decoder with
no index signaled. The number of multiplications and additions of multiple transforms
in H.266/VVC can be found elsewhere [32].
Secondary transform: The secondary transform is a newly adopted transform tool in
H.266/VVC and means an additional transform process that follows the primary transform.
The low frequency non-separable transform (LFNST) is a non-separable transform that
applies to the top-left low-frequency region of intra-coded blocks that use the primary
transform using the DCT-II. There are two types of LFNST according to the size of
the transform block. In particular, a 16${\times}$48 (row${\times}$column) kernel
is applied to the top-left 8${\times}$8 region when the size of the transform block
is greater or equal to 8${\times}$8. This is referred to as LFNST8. A 16${\times}$16
kernel is applied to the top-left 4${\times}$4 regions when the width or height of
the transform block is 4, referred to as LFNST4. Four kernel sets are defined in LFNST
according to the intra prediction modes, and two kinds of kernels exist in each kernel
set. The information on whether to use LFNST and which kernel is selected is explicitly
signaled through an index per CU. Similar to the zeroing out used with the primary
transform, the coefficients outside of the LFNST output region are zeroed out. Fig. 17 shows an example of applying LFNST8 to a block size larger than 8${\times}$8 or equal
to 8${\times}$8.
Subblock transform (SBT): SBT is a transform method that splits the inter-coded residual
block and encodes only one of the two sub-partitions. There are eight modes according
to the size and position of the transform block. The residual block can be divided
horizontally or vertically by half or quarter size of the CU, and the coded sub-partition
can be located at the left, right, top, or bottom part. The transform kernel is chosen
adaptively based on the location of the transform block, as shown in Fig. 18. Only DCT-II can be used for horizontal and vertical transform when the width or
height of an SBT transform block exceeds 32. Fig. 18 shows eight SBT modes in H.266/VVC and selected transform combinations based on the
location of the SBT transform block.
Fig. 13. Bilateral matching based DVMR.
Fig. 14. BDOF using a symmetric motion model[22].
Fig. 15. Subblock AMC and sample-based AMC[22].
Fig. 16. Block diagram of transform and quantization in H.266/VVC decoder.
Fig. 17. Example of applying LFNST8.
Fig. 18. SBT modes and kernel combination.
3.5 Quantization
Quantization is an irreversible operation that maps a specific range of input values
to a single representative value for the input. The quantization is applied to the
transform coefficients. H.265/HEVC uses scalar quantization based on uniform reconstruction
quantizers (URQs). The URQs is a method that can vary the quantization rate with a
single parameter called quantization step size $\Delta $. The total number of quantization
step sizes is 52, which is a real number, not an integer. A quantization parameter
(QP) is an integer value from 0 to 51 and is used to prevent direct access or operation
on real numbers. When QP = 4, the quantization step size equals 1, and an increase
of 6 in QP means double the quantization step size. The relationship between $QP$
and quantization step size is expressed as follows:
In the quantization process, a quantized coefficient level ($level$) is obtained by
dividing a transform coefficient $C$ by $\Delta $:
In the de-quantization process, a reconstructed value $C` $ can be obtained by
multiplying $level$ by $\Delta $:
The quantization scale, a value pre-multiplied by $\Delta $ and scale factor, and
bitwise shift operation are used to avoid real number division.
The H.265/HEVC quantization technique is changed and added to H.266/VVC as follows
[25].
Extended quantization: H.266/VVC uses URQs design from H.265/HEVC. In H.266/VVC, the
maximum QP was extended from 51 to 63. Adaptive frequency weighting quantization and
sign data hiding (SDH) used in H.265/HEVC was also supported in H.266/VVC.
Dependent quantization (DQ): DQ defines two inverse scalar quantizers with different
levels, denoted by Q0 and Q1, and enables switching between two quantizers to decode
each transform coefficient. The admissible values of the current transform coefficients
depend on the value of the previous transform coefficients level $k$. The switching
between two quantizers is achieved through a state machine with four states, and the
selected quantizer is not explicitly signaled. Fig. 19 shows state transition and quantizer selection; Fig. 20 shows the scalar quantizers Q0 and Q1. The initial state is set equal to 0.
Joint Coding of Chroma Residuals (JCCR): Instead of transmitting two quantized chroma
residual blocks, the encoder uses JCCR mode to send one residual block using the correlation
between the quantized residual signals. The decoder uses the transmitted joint residual
block to generate two chroma residual blocks. The JCCR is applicable only if both
chroma coded block flags (cbfs) are not zero, and H.266/VVC supports six modes according
to a rotation angle. A TU-level flag indicates JCCR mode, and a selected mode is indicated
by chroma cbfs and a sign of the mode.
Fig. 19. State transition and quantizer selection.
Fig. 20. Scalar quantizer Q0 and Q1.
3.6 In-loop Filtering
The block-partitioning and the quantization steps in H.266/VVC may incur undesired
coding artifacts as in the previous video coding standards [3,5]. There are four H.266/VVC in-loop filters: deblocking filter (DF), sample adaptive
offset (SAO), adaptive loop filtering (ALF), and luma mapping with chroma scaling
(LMCS) [26], as shown in Fig. 21. The filters are applied during the picture reconstruction inside encoding and decoding
loops in the order of inverse LMCS, DF, SAO, and ALF. The output pictures are stored
in the decoded picture buffer (DPB). DF and SAO alleviate blocking artifacts and ringing
artifacts as in H.265/HEVC while being adapted to new coding blocks in H.266/VVC.
Compared to H.265/HEVC, ALF and LMCS are new in-loop filters adopted to H.266/VVC.
ALF attempts to reduce the mean square error (MSE) between the original and the reconstructed
samples by determining the filter coefficients based on a Wiener-Hopf equation [33]. The ALF provides significantly improved coding performance of H.266/VVC with approximately
4% BD-rate reductions [34], while the implementation is simplified to meet the low computational complexity.
LMCS adjusts the dynamic range of pixel values in a picture to improve the objective
quality of a reconstructed picture. The DF, SAO, and ALF are used in the original
sample domain after the inverse LMCS because in-loop filters are originally designed
to improve the subjective quality of a reconstructed picture.
H.266/VVC DF and SAO inherit the same design principles as H.265/HEVC with minor modifications
to adapt to the new block partitioning. The following subsections will describe the
details of ALF and LMCS.
Fig. 21. Block diagram of in-loop filtering in H.266/VVC decoder[26].
3.6.1 Adaptive Loop Filter
ALF is based upon adaptive linear filters to restore the reconducted picture to the
original picture by deriving the filter coefficients from a Wiener-Hopf equation.
The output samples of SAO are used as input samples, as shown in Fig. 21. When the filter coefficients are transmitted to a decoder, the derivation process
of filter coefficients can be conducted using RDO to improve the coding performance.
Despite the improved coding performance, the derivation process requires heavy computational
loads for a practical real-time decoder of a consumer device. The filtering mechanisms
are simplified during the H.266/VVC standardizations while maintaining the coding
performance.
The ALF uses two 2D finite impulse response (FIR) filters with a 7${\times}$7 diamond
shape and a 5${\times}$5 diamond shape applied to the luma and chroma samples, respectively
[35]. Fig. 22 exhibits the two filter shapes to correct the center samples. The filter shapes and
sizes are determined from extensive experiments with various resolutions of test videos
to consider the trade-off between the coding efficiency and computational complexity
[36,40]. The ALF uses symmetric FIR filters, in which the numbers of coefficients are 13
and 7 for the 7${\times}$7 and 5${\times}$5 filter shapes, respectively, which reduces
the computational complexity [38,39]. A line buffer reduction is also applied to reduce the storage requirements for ALF
[40].
The spatially neighboring samples to the center points are used for deriving the corresponding
filter coefficients $c_{i}$ in Fig. 22. A filtered sample $\overset{˜}{I}\left(x,y\right)$ in the current position is corrected
from the reconstructed sample $I\left(x,y\right)$ using the weighted linear combinations
of $c_{i}$ with a 7-bit fractional precision and the spatially neighboring samples.
Once $c_{i}$ is derived from the Wiener-Hopf equation, the filtering is defined as
where $N$ is 13 and 7 for the luma and chroma samples, respectively. $r_{i~ }$ refers
to a difference between the current pixel and a neighboring sample specified in Fig. 22. Specifically, $r_{i~ }$ is computed as follows:
where $b_{i}$ refers to a clipping parameter determined by a clipping index $d_{i}$
and a sample bit depth $BD$. The clipping parameter $b_{i}$ is set to $2^{BD}$ when
$d_{i}=0.$ Otherwise ($d_{i}$= 1, 2, and 3), it is $2^{BD-1-2{d_{i}}}$. The clipping
index is transmitted to a decoder.
The ALF maintains a set of up to 25 filter coefficients for luma samples, which can
be applied adaptively to each 4${\times}$4 subblock. A 4${\times}$4 subblock is categorized
into one of 25 classes. The classification is conducted to obtain directions and activity
using local gradients with Laplacian filters. Specifically, the classification index
is derived from a combination of five directional properties, including a texture,
strong and weak horizontal/vertical, strong and weak diagonal, and five activity properties
of a subblock. As a result, a different filter can be assigned to each class. Furthermore,
geometric transform, such as 90-degree rotation, diagonal, or vertical flip, can be
applied to the filter coefficients before the filtering. Various directionality can
be considered using the geometric transform, and the ALF handles more diverse block
characteristics with fewer filter coefficients using this method [41].
In addition to a subblock adaptation, for a coding tree block (CTB)-level adaptation,
a set of filter coefficients is obtained online using the current slice or the previous
slices or 16 offline trained filter sets. For chroma samples, H.266/VVC ALF uses only
the CTB-level filter adaptation using up to eight filters [42]. In H.266/VVC, the adaptation parameter set (APS) [19] is used to carry the ALF filter parameters, which include up to 25 and eight sets
of filter coefficients for luma and chroma components, respectively, and clipping
indices are signaled. When the same ALF coefficients are used for different slices,
only the ID of a reference APS can be signaled instead of a redundant transmission.
Furthermore, the ALF is controlled using on/off flags signaled in sequence, picture,
slice, and CTB levels. Chroma ALF is enabled only when the luma ALF is enabled at
the corresponding level.
H.266/VVC supports a wider variety of video applications with HDR and WCG, for which
the in-loop filters attempt to improve the visual quality of both luma and chroma
samples. A cross-component adaptive loop filtering (CC-ALF) [43] corrects the chroma samples in parallel with the ALF, using the correlation between
the current chroma samples and the luma samples in the corresponding positions. CC-ALF
applies a linear filtering operation to produce the correction of chroma samples from
the luma samples as input. The CC-ALF uses diamond-shaped FIR filters without symmetric
constraints.
Fig. 22. Filter shapes and sizes in VVC ALF: 7×7 and 5${\times}$5 symmetric diamond-shaped filters for luma and chroma samples, respectively.
3.6.2 Luma Mapping with Chroma Scaling
LMCS has been adopted to H.266/VVC to improve the coding efficiency by processing
the dynamic ranges of input samples rather than improving visual quality directly
[44]. The LMCS has been originally proposed to improve the coding efficiency of HDR and
WCG PQ video contents in which most input video samples tend to be distributed within
a relatively narrow range compared with the SDR video contents [45]. LMCS supports both HDR and SDR video content in H.266/VVC.
The LMCS consists of a luma mapping (LM) module and a chroma scaling (CS) module.
The LM module maps luma code values from the original sample domain to an LMCS domain
using a forward luma mapping ($FwdMap$), or vice versa using an inverse luma mapping
($InvMap$). In the CS module, a luma-dependent chroma residual scaling is applied
to balance the impact of luma remapping [46].
Fig. 23 depicts the LM and CS process in H.266/VVC decoder. The shadowed regions, including
inverse quantitation and transform, an intra prediction, and the luma reconstruction
of an intra prediction, represent decoding processes under an LMCS domain [45]. Accordingly, an inter prediction and intra prediction signals go through different
procedures with LMCS. For example, a residual signal and an inter prediction signal
are obtained in an LMCS domain and the original domain, respectively. $FwdMap$ is
conducted to convert a domain of a reference signal in a DPB for motion compensation,
and the reconstruction signals are stored in a DPB after $InvMap$.
$FwdMap$ is determined using an adaptive piecewise linear model derived from related
syntax elements of LMCS APS according to a dynamic range of input video samples. Specifically,
the original code values are sampled uniformly into 16 pieces to calculate $OrgCW$.
For each piece $i$, the number of mapped code values is defined as $binCW\left[i\right]$.
To define $FwdMap$, the slope $scaleY\left[i\right]$ is calculated as
and $invScaleY\left[i\right]$ as the slopes of $InvMap$ is calculated with the inverse
of $scaleY\left[i\right]$. In a decoder, $scaleY\left[i\right]$ and $invScaleY\left[i\right]$
can be derived because the difference between $OrgCW$ and $binCW\left[i\right]$ is
signaled in the LMCS adaptive parameter set (APS).
For CS, forward and inverse scaling are applied to the chroma residue with a factor
of $ScaleC$ and $invScaleC$, respectively. In H.266/VVC, $invScaleC$ is defined as
where $deltaCRS$ is a chroma scaling offset value. In this manner, as shown in Fig. 23, a chroma residue-scaled value is produced from a decoding process, and a chroma
residue value is calculated by multiplying $invScaleC$ [47].
Fig. 23. LMCS in H.266/VVC decoder[26].
3.6.3 Towards Deep Learning In-loop Filter
Current in-loop filters are designed to perform a pixel-level restoration of a reconstructed
image. Deep convolutional neural network (CNN) has attracted considerable attention
from video coding experts because many data-driven approaches for image denoising
have been actively studied in image processing and computer vision research areas
[48]. During H.266/VVC standardization, the benefit of deep learning techniques in video
compression has been discussed in AhG9 [48,49], and various CNN-based in-loop filters have been tested extensively in core experiments
[50] to investigate the coding efficiency and computational complexity.
Although the CNN-based in-loop filter has not been adopted to the H.266/VVC specification,
it envisioned a future research and development direction of in-loop filters in hybrid
video coding standards. Currently, video coding experts continue verifying the effectiveness
of neural network video coding (NNVC) [51].
3.7 Screen Contents Coding Tools
A screen content coding (SCC) tool can efficiently encode computer-generated video
that exhibits different signal characteristics from the usual video captured by a
camera. Screen content video mainly contains characters, lines, graphs, and patterns.
Thus, it is characterized by sharp edges, uniformly flat areas, repeating patterns,
and highly saturated colors, most of which are rarely found in camera-captured images/videos.
To address its rather unique characteristics, new coding tools have been added to
H.265/HEVC, i.e., HEVC range extension (RExt) and SCC extensions [52]. Through much improvement and refinement, the tools that have been adopted to H.266/VVC
are as follows: Transform skip residual coding (TSRC), block-based differential pulse-coded
modulation (BDPCM), intra block copy (IBC), adaptive color transform (ACT), and palette
mode coding [53]. TSRC is integrated with the transform skip mode (TSM), which has existed since H.265/HEVC.
IBC, ACT, and the palette mode are inherited from the H.265/HEVC SCC extensions. This
section describes the five main screen content coding tools in H.266/VVC.
3.7.1 Transform Skip Residual Coding
Transform skip residual coding (TSRC) is a CABAC entropy coding scheme designed especially
for transform skip residual blocks. In the H.265/HEVC RExt extension, by producing
a dedicated context model for flags representing absolute values greater than zero
and 180-degree rotating intra predicted transform skip residuals, statistical difference
between regular residual coding (RRC) and TSRC has already been considered, but only
partially. TSRC, which is newly introduced to H.266/VVC, directly addresses the difference
by employing the following three main features. First, instead of transmitting the
last significant scan position, it encodes the quantization indices of all scan positions
of a transform block. As shown in Fig. 24, the scanning direction is from top-left to bottom-right, which is a reversed version
of an RRC.
Second, even when the global distribution is almost uniform, the non-stationarity
of the symbol makes it possible to code the symbol indicators more efficiently using
the context model. Last, the binarization of absolute level values is changed by coding
more context-coded ``greater than $x$'' flags and modifying the Rice parameter derivation
for the Golomb-Rice code suffix, resulting in a higher cutoff for the unary binarization
prefix.
Fig. 24. Coefficient group scan for TSRC in H.266/VVC.
3.7.2 Block Differential Pulse Coded Modulation
Block differential pulse coded modulation (BDPCM), especially useful for screen content,
is one of the intra prediction modes in H.266/VVC. It aims for better decorrelation
of the intra-predicted residual of the screen content by replacing the usual transform
of DCT or DST. Whether to do BDPCM or not is signaled by a flag at the CU level. In
case BDPCM is used, an additional flag further signals the direction of block prediction
in BDPCM. The BDPCM predictor is generated through either horizontal or vertical prediction
using unfiltered reference samples. The residual values are quantized, and the difference
signal $\overset{˜}{r}_{i,j}$ is calculated using Eq. (3.7.1) for vertical BDPCM and Eq. (3.7.2) for horizontal BDPCM.
Here, $r_{i,j}$ denotes the intra-predicted residual signal of a block at a position
$\left(0\leq i\leq H-1,0\leq j\leq W-1\right)$ inside the block. The block has a size
$H\times W$, and $Q\left(\cdot \right)$ denotes the quantization operation. The difference
signal is transmitted to the decoder using TSRC. In the decoding process, the quantized
residual $Q\left(r_{i,j}\right)$ is reconstructed using Eq. (3.7.3) for vertical BDPCM and Eq. (3.7.4) for horizontal BDPCM.
The dequantized residual $Q^{-1}\left(Q\left(r_{i,j}\right)\right)$ is added to the
intra prediction to generate the reconstructed sample.
3.7.3 Intra Block Copy
Intra block copy (IBC) finds a reference block similar to a current block inside a
designated area of the current frame. The reference block works as an intra-predictor
of IBC, and its location is represented by a block vector (BV), which specifies its
displacement from the current block. The difference in BV from its block vector predictor
(BVP), noted as the block vector difference (BVD), is encoded using either the merge
mode or the AMVP mode. The merge mode is the most efficient coding tools for the BV
vector because it only sends the merge index with its BVD treated as zero (thus, not
encoded). The merge index indicates one vector for BVP in the list of six merge candidates.
The merge candidate list includes the following until it has six members in the order
of the BVs of a spatially adjacent block (bottom-left and top-right), history-based
BVs, and zero vectors. The AMVP mode transmits the AMVP flag and BVD, where the AMVP
flag indicates one selected BVP in the AMVP candidate list of size two.
Because the IBC is developed as an intra prediction coding tool specialized for screen
content, its encoder only searches the reference blocks at integer positions. Noting
that the screen content is generated pixel-by-pixel by a computer, no consideration
is given to representing the BV vector in the sub-pel resolution. The AMVR allows
the BVD to be encoded at a higher resolution, such as in a 1-pel or 4-pel resolution.
On the other hand, it is interesting to investigate the effectiveness of sub-pel BV
resolution by further estimating the BV into the half-pel and quarter-pel resolution.
In this context, the experiment results for screen content video [54] confirm that compared to camera-generated video, both half-pel and quarter-pel BVs
are chosen less often in actual encoding, even if they can be used, but it is not
absolute. Therefore, the screen content is rendered more by sophisticated rendering
techniques, which are often used in recent screen content, as shown in Fig. 25. That is, those screen contents with many characters, lines, and graphics like ‘SlideEditing’
(this type of screen content can be referred to as legacy screen content) have chosen
much fewer sub-pel BVs and permit additional BV resolutions in encoding, resulting
in coding loss [53]. On the other hand, more natural-looking computer-generated video by sophisticated
rendering processes is seen to choose sub-pel BVs, not trivially, as shown in Fig. 25 [54].
Fig. 25. Average ratio of BV resolution in camera-captured and screen video contents[54].
3.7.4 Adaptive Color Transform
Adaptive color transform (ACT) is a particularly effective technique for video sequences
expressed in the RGB color space because it can effectively reduce the correlation
among the three color components in 4:4:4 chroma format. The ACT technique in H.266/VVC
has existed since H.265/HEVC-SCC. It selectively transforms the residual signal in
the input color space (RGB) into the YCgCo-R luma-chroma color representation according
to a CU-level flag. The maximum ACT size cannot exceed 32${\times}$32 samples to ease
the cache requirement of temporarily storing all three transform blocks. The YCgCo-R
transform is fully reversible and can be applied to lossless coding.
3.7.5 Palette Mode
Palette mode, one of the intra prediction modes, supports all chroma formats. Considering
the relatively insignificant coding gain in small blocks and the complexity of the
palette mode, it is not applied unless a CU has more samples than 16. Each sample
in a CU coded in palette mode needs to signal its palette index with a representative
color set called a palette. Otherwise, an escape symbol is signaled. If a sample is
coded using escape symbols, its quantized component values are signaled. The palette
is defined separately for the luma (Y component) and chroma (Cb and Cr components)
for slices in a dual tree. In this case, the entries in the luma palette hold just
Y values and the chroma palette entries include Cb and Cr values. For the single tree
slices, a palette entry contains Y, Cb, and Cr values, and the palette is applied
to Y, Cb, and Cr components jointly. For the single tree, the maximum palette predictor
size is 63, and the maximum palette table size for a CU is 31. For the dual tree,
the maximum predictor size and table size are 31 and 15, respectively.
4. Performance Evaluation of H.266/VVC
The various coding techniques of the H.266/VVC previously described are more complex
than those included in the H.265/HEVC standard. Efficient encoding search algorithms
that can reduce the encoding time while maintaining compression efficiency are needed
to implement a real-time H.266/VVC encoder, particularly an early termination strategy
developed that considers the coding efficiency and tool selection ratio for each technique.
Therefore, this section evaluated the performed proportion and coding efficiency of
coding techniques for each module of H.266/VVC. First, an analysis of the selection
ratio of each coding technique is performed using the bitstreams generated by the
VTM12.0 encoder in JVET common test condition (CTC) [55]. The selection ratio for each tool is normalized to 4${\times}$4 for each CU or TU
to indicate the ratio of the area where each coding tool is selected in the entire
sequence. In addition, the experiments are conducted using the VTM12.0 encoder and
decoder in JVET CTC to evaluate the coding performance according to the block partitioning
structure, CTU and TU sizes, and a group of pictures (GOP) size. In addition, all
the experiments are conducted in a cluster environment consisting of Intel{\textregistered}
i7-10, 10700K 3.8Hz processors, and 64G RAM. The class F sequences, which use the
screen content tools, are excluded from the selection ratio analysis to analyze commonly
used techniques for the different content types.
4.1 Prediction Modes
This study first examines the overall selection ratio of intra and inter prediction
before analyzing each prediction tool according to encoding configurations. Table 6 shows the selection ratios of intra and inter prediction in all classes of JVET CTC.
In the LB and LP configurations, where all frames except the first frame consist of
B-frames and P-frames, the intra predicted region ratio is 2.77% and 3.07%, respectively.
On the RA configuration, where I-frames are inserted for each random access point
(RAP), the intra predicted region ratio is 13.35%. In particular, the proportion of
intra predicted regions in class A1 is about 27% on the RA configuration. This ratio
is higher than the other sequences because the class A1 sequences consists of many
complex textures. The other sequences show that the proportion of intra predicted
regions is less than 10%. Accordingly, when designing the H.266/VVC encoder, it is
expected that the encoding complexity can be reduced by adjusting the ratio of rate-distortion
optimization (RDO) to the early termination conditions of the intra and inter prediction
based on the target video for applications.
Table 4. Ratios of intra and inter prediction.
|
Intra prediction
|
Inter prediction
|
LB
|
2.77%
|
97.23%
|
LP
|
3.07%
|
96.93%
|
RA
|
13.35%
|
86.65%
|
4.2 Intra Prediction Tools
Tables 7 and 8 show the selection ratio of intra prediction tools for the luma and chroma
components, respectively. On the AI and RA configurations, the selected intra prediction
tool per region is similar for each class. In particular, the planar, angular prediction,
and MIP modes are selected as the highest in the luma component. In the case of angular
prediction mode, more than 30% is selected in the entire intra-predicted region. Specifically,
the performed proportion of horizontal mode among the angular predicted regions is
approximately 30% and 60% on AI and RA configurations, respectively. For MIP mode,
the matrices in each case, as defined in Table 3, are selected almost equally for each sequence, with an average of 6.25%, 12.5%,
and 16.67% on both configurations. In the MRL tool, the selection ratio of the farther
away non-adjacent reference lines is double that on all configurations. In addition,
the error correction through PDPC is more than 70% of the entire intra-predicted region,
and the signal of the prediction mode through the MPM list is shown to be 70% or more.
In the chroma component, the performed proportion of derived mode (DM) and CCLM mode
is the highest, and the selection ratio of non-angular (DC, planar) and angular prediction
mode is similar except for Class A1 in the case of DM mode. In addition, the selection
ratio of CCLM and PDPC is more than 40% and over 85%, respectively.
In addition, the MIP mode is turned off on the LB and LP configurations according
to JVET CTC. As a result, the selection ratios of planar and angular prediction modes
show larger than that of AI and RA configurations. The angular prediction modes are
selected for more than 40% of the entire intra-predicted region. Moreover, the error
correction through PDPC is more than 70% of the entire intra-predicted region, and
the case of signaling the prediction mode based on the MPM list is more than 68%.
In the chroma components, the DM and CCLM modes had the highest performed proportion,
and the ratio of both tools is more than 80%, indicating that they are selected as
the prediction mode for most regions. In addition, the PDPC is applied at higher rates
than 70% on the LB and LP configurations, respectively, as on the AI and RA configurations.
For all configurations in JVET CTC, setting early termination conditions considering
the ratio will be reduced the encoding complexity because the MPM list based prediction
for the luma component and CCLM modes for the chroma component, and the correction
through PDPC for all components apply to most intra-predicted regions.
Table 7. Ratios of intra prediction tools for the luma component.
|
Prediction mode
|
PDPC
|
MRL
|
MPM
|
DC
|
Planar
|
Ang.
|
MIP
|
ISP
|
Reg.
|
Wide
|
Case1
|
Case2
|
Case3
|
Hor. (Pred./Transf.)
|
Ver. (Pred./Transf.)
|
Line1
|
Line2
|
1/2
|
1/4
|
2/4
|
4/4
|
1/2
|
1/4
|
2/4
|
4/4
|
AI
|
5.02%
|
23.72%
|
29.18%
|
1.45%
|
0.10%
|
2.38%
|
23.13%
|
0.48%
|
7.02%
|
3.24%
|
1.33%
|
0.38%
|
0.66%
|
0.93%
|
0.97%
|
77.82%
|
1.95%
|
4.46%
|
73.72%
|
LB
|
7.03%
|
35.22%
|
44.28%
|
2.98%
|
0.00%
|
0.00%
|
0.00%
|
0.24%
|
0.69%
|
2.39%
|
1.44%
|
0.21%
|
0.98%
|
2.74%
|
1.80%
|
70.54%
|
1.42%
|
1.89%
|
68.13%
|
LP
|
7.13%
|
34.51%
|
44.88%
|
3.07%
|
0.00%
|
0.00%
|
0.00%
|
0.23%
|
0.68%
|
2.42%
|
1.40%
|
0.19%
|
0.99%
|
2.77%
|
1.73%
|
70.16%
|
1.48%
|
1.94%
|
67.53%
|
RA
|
2.94%
|
19.24%
|
36.87%
|
1.41%
|
0.03%
|
1.45%
|
17.64%
|
0.29%
|
10.70%
|
3.99%
|
3.07%
|
0.23%
|
0.54%
|
0.71%
|
0.89%
|
70.62%
|
6.25%
|
13.09%
|
79.32%
|
Table 8. Ratios of intra prediction tools for the chroma component.
|
Prediction mode
|
PDPC
|
DC
|
Planar
|
Hor.
|
Ver.
|
DM
|
CCLM
|
Non-Ang.
|
Ang.
|
T
|
L
|
LT
|
Reg.
|
Wide
|
AI
|
4.42%
|
3.26%
|
3.41%
|
3.34%
|
16.36%
|
23.65%
|
0.65%
|
9.94%
|
9.69%
|
25.28%
|
86.34%
|
LB
|
1.28%
|
1.74%
|
0.90%
|
1.48%
|
32.18%
|
33.22%
|
2.23%
|
5.65%
|
4.95%
|
16.38%
|
70.61%
|
LP
|
1.24%
|
1.68%
|
0.87%
|
1.43%
|
32.03%
|
33.91%
|
2.32%
|
5.59%
|
4.84%
|
16.09%
|
70.00%
|
RA
|
1.68%
|
1.68%
|
5.68%
|
1.14%
|
12.36%
|
39.14%
|
1.29%
|
8.05%
|
7.42%
|
21.57%
|
68.60%
|
Table 9. Ratios of the reference types and motion information coding modes.
|
Reference type
|
Uni-prediction
|
Bi-prediction
|
Uni
|
Bi
|
Merge
|
Merge/skip
|
AMVP
|
Merge
|
Merge/skip
|
AMVP
|
LB
|
25.22%
|
74.78%
|
6.67%
|
12.82%
|
5.73%
|
18.16%
|
48.42%
|
8.20%
|
LP
|
100%
|
0.00%
|
25.91%
|
64.01%
|
10.09%
|
0.00%
|
0.00%
|
0.00%
|
RA
|
19.47%
|
80.53%
|
4.66%
|
9.10%
|
5.71%
|
13.99%
|
56.98%
|
9.56%
|
Table 10. Ratios of the inter prediction tools in the uni-directional prediction.
Uni
|
Merge
|
Merge/skip
|
AMVP
|
Regular
|
MMVD
|
CIIP
|
SbTMVP
|
Aff. 4
|
Aff. 6
|
Regular
|
MMVD
|
SbTMVP
|
Aff. 4
|
Aff. 6
|
Regular
|
Aff. 4
|
Aff. 6
|
LB
|
10.44%
|
6.53%
|
2.85%
|
0.00%
|
1.53%
|
5.10%
|
30.07%
|
7.84%
|
0.00%
|
3.68%
|
9.25%
|
10.46%
|
5.58%
|
6.68%
|
LP
|
11.84%
|
4.41%
|
2.93%
|
2.58%
|
1.04%
|
3.11%
|
39.75%
|
5.61%
|
11.28%
|
2.35%
|
5.02%
|
4.44%
|
2.48%
|
3.17%
|
RA
|
11.15%
|
5.50%
|
2.54%
|
0.86%
|
1.06%
|
2.83%
|
28.22%
|
4.20%
|
8.78%
|
1.94%
|
3.59%
|
16.14%
|
4.55%
|
8.63%
|
Table 11. Ratios of the inter prediction tools in the bi-directional prediction.
Bi
|
Merge
|
Merge/skip
|
AMVP
|
Regular
|
MMVD
|
GPM
|
CIIP
|
SbTMVP
|
Aff. 4
|
Aff. 6
|
Regular
|
MMVD
|
GPM
|
SbTMVP
|
Aff. 4
|
Aff. 6
|
Regular
|
Aff. 4
|
Aff. 6
|
SMVD
|
LB
|
10.46%
|
3.73%
|
3.09%
|
1.63%
|
3.02%
|
0.77%
|
1.58%
|
38.63%
|
5.07%
|
3.33%
|
13.69%
|
1.62%
|
2.41%
|
6.07%
|
2.50%
|
2.40%
|
0.00%
|
RA
|
9.93%
|
3.50%
|
0.85%
|
0.62%
|
0.65%
|
0.54%
|
1.28%
|
56.32%
|
3.99%
|
1.03%
|
7.03%
|
0.75%
|
1.35%
|
3.45%
|
1.26%
|
2.10%
|
5.07%
|
4.3 Inter Prediction Tools
In order to analyze the selection ratio of each detailed inter prediction tool according
to the reference type and the motion coding methods (merge, merge/skip, and AMVP)
in the inter-predicted regions, the proportion of the uni-/bi-directional predicted
regions, and the selection ratio of the motion information coding methods are first
examined according to the reference type, as shown in Table 9. In the LB and RA configurations, where both uni-directional and bi-directional prediction
is available, 25.22% and 19.47% of the uni-directional prediction modes are applied,
respectively. In addition, the merge/skip mode is most selected for motion coding
in all conditions. Tables 10 and 11 present the proportion of each inter prediction tool according to motion coding
methods.
Table 12. Ratios of BCW weights.
|
BCW weight
|
-2
|
3
|
4
|
5
|
10
|
LB
|
Merge
|
0.05%
|
2.81%
|
85.96%
|
9.47%
|
1.71%
|
Merge/skip
|
0.04%
|
1.53%
|
89.17%
|
7.39%
|
1.88%
|
AMVP
|
0.29%
|
10.15%
|
57.43%
|
26.95%
|
5.18%
|
RA
|
Merge
|
0.02%
|
10.53%
|
77.91%
|
11.49%
|
0.05%
|
Merge/skip
|
0.00%
|
4.20%
|
91.31%
|
4.47%
|
0.02%
|
AMVP
|
0.07%
|
23.45%
|
51.31%
|
25.10%
|
0.08%
|
Table 13. Ratios of BDOF and DMVR.
|
BDOF
|
DMVR
|
Merge
|
36.15%
|
40.02%
|
Merge/skip
|
10.56%
|
72.11%
|
AMVP
|
68.69%
|
0.00%
|
Table 14. Ratios of PROF for the uni-directional AMC-predicted CU.
Uni
|
Merge
|
Merge/skip
|
AMVP
|
Affine 4-parameter
|
Affine 6-parameter
|
Affine 4-parameter
|
Affine 6-parameter
|
Affine 4-parameter
|
Affine 6-parameter
|
L0
|
L1
|
L0
|
L1
|
L0
|
L1
|
L0
|
L1
|
L0
|
L1
|
L0
|
L1
|
LB
|
94.42%
|
85.64%
|
98.21%
|
93.98%
|
96.02%
|
87.05%
|
98.23%
|
90.05%
|
97.72%
|
0.00%
|
99.97%
|
0.00%
|
LP
|
80.95%
|
0.00%
|
94.13%
|
0.00%
|
80.16%
|
0.00%
|
94.02%
|
0.00%
|
96.09%
|
0.00%
|
99.34%
|
0.00%
|
RA
|
41.21%
|
24.85%
|
67.64%
|
50.21%
|
23.04%
|
13.53%
|
41.54%
|
30.54%
|
28.23%
|
18.47%
|
53.79%
|
40.86%
|
Table 15. Ratios of PROF for the bi-directional AMC-predicted CU.
Bi
|
Merge
|
Merge/skip
|
AMVP
|
Affine 4-parameter
|
Affine 6-parameter
|
Affine 4-parameter
|
Affine 6-parameter
|
Affine 4-parameter
|
Affine 6-parameter
|
Only L0
|
Only L1
|
Both
|
Only L0
|
Only L1
|
Both
|
Only L0
|
Only L1
|
Both
|
Only L0
|
Only L1
|
Both
|
Only L0
|
Only L1
|
Both
|
Only L0
|
Only L1
|
Both
|
LB
|
2.75%
|
22.68%
|
56.64%
|
0.80%
|
8.49%
|
88.19%
|
2.90%
|
24.90%
|
51.51%
|
1.02%
|
11.93%
|
83.46%
|
3.33%
|
52.25%
|
40.07%
|
0.56%
|
41.56%
|
57.72%
|
RA
|
2.88%
|
3.12%
|
15.85%
|
4.84%
|
3.95%
|
44.80%
|
2.04%
|
1.65%
|
9.50%
|
4.01%
|
2.91%
|
34.12%
|
4.15%
|
3.77%
|
14.73%
|
5.65%
|
3.59%
|
46.52%
|
The most often selected inter prediction tool is the regular mode selected up to 56.32%,
which performed the whole block-based motion compensation through the coded motion
without the refinement tool. Among the inter prediction tools, AMC (e.g., Aff.4 and
Aff.6) and GPM modes showed the highest coding efficiency [34], but they have a lower selection ratio under all conditions. This suggests that a
higher coding efficiency compared to the selection proportion is obtained by performing
a more accurate prediction through the reference picture generated by applying AMC
and GPM modes.
Table 12 shows the signaled weight ($w$) ratio when the bi-directional prediction is performed
using Eq. (3.2.2) using BCW. Both $P_{L0}$ and $P_{L1}$ had the highest selection ratio of 91.31% when
weighted using the same weight, and the lowest selection ratio when the weighted prediction
using -2 and 10, where the weights of $P_{L0}$ and $P_{L1}$ differ the most. On the
other hand, the VTM12.0 encoder performs RDO in the ascending order of each index
value without considering these selection ratios. Therefore, if the performed proportion
is considered in future H.266/VVC encoder designs, the encoding complexity will be
improved while maintaining the coding performance.
Tables 13-15 show the selection ratio of the following: the BDOF, a tool for performing motion
refinement on the decoder side based on optical flow; the PROF, which performs sample-wise
refinement within an AMC block based on optical flow; the DMVR, which performs motion
refinement through bilateral matching. BDOF and DMVR can only be applied to the RA
configuration. The proportion of motion refinement performed through the BDOF tool
is 10.56% in merge/skip mode and 68.69% in AMVP, as shown in Table 13, showing a significant difference in the ratio depending on the motion coding methods.
The DMVR tool showed a high selection ratio of 72.11% in merge/skip mode, which skipped
the transform through the skip flag. This result is showed that even if DMVR is applied
by generating more accurate predictors through motion refinement, the CU can be reconstructed
to be close to the original signal without the signaled MVD. Tables 14 and 15 lists the proportions of PROF for AMC-predicted CUs according to reference
type. For uni-directional predictions in the LB and LP configurations, most AMC-predicted
CUs have performed PROF. In particular, up to 99.97% of the AMC-predicted CUs have
performed PROF in AMVP mode. Furthermore, PROF is performed mainly on AMC-prediction
CUs in the L0 and L1 directions rather than L0 or L1 for bi-directional predictions
in the LB and LP configurations.
Table 16. Ratios of the AMVR index for non AMC-predicted CU.
|
AMVR (for non-affine)
|
1/4-pel
|
1-pel
|
4-pel
|
Half-pel
|
LB
|
76.50%
|
13.25%
|
3.53%
|
6.72%
|
LP
|
69.09%
|
17.69%
|
3.03%
|
10.19%
|
RA
|
49.33%
|
31.31%
|
11.27%
|
8.08%
|
Table 17. Ratios of the AMVR index for AMC-predicted CU.
|
AMVR (for affine)
|
1/4-pel
|
1/16-pel
|
1-pel
|
RA
|
59.85%
|
8.14%
|
32.01%
|
Table 18. Ratios of the transform tools of the intra-predicted CUs for the luma component.
|
Transform skip
|
MTS
|
Transform kernel
|
LFNST
|
Implicit
|
Explicit
|
Horizontal
|
Vertical
|
DCT2
|
DCT8
|
DST7
|
DCT2
|
DCT8
|
DST7
|
AI
|
24.02%
|
33.83%
|
66.17%
|
16.18%
|
0.0012%
|
83.82%
|
21.28%
|
42.27%
|
36.45%
|
18.31%
|
LB
|
17.50%
|
14.23%
|
85.77%
|
2.85%
|
4.66%
|
92.50%
|
5.03%
|
29.77%
|
65.20%
|
0.00%
|
LP
|
17.99%
|
14.33%
|
85.67%
|
2.87%
|
4.69%
|
92.43%
|
5.00%
|
29.59%
|
65.40%
|
0.00%
|
RA
|
27.13%
|
44.08%
|
55.92%
|
31.06%
|
0.0011%
|
68.94%
|
31.89%
|
37.61%
|
30.50%
|
15.79%
|
Table 19. Ratios of the transform tools of the inter-predicted CUs for the luma component.
|
Transform skip
|
SBT
|
Transform kernel
|
SBT on
|
Half
|
Quad
|
Vertical
|
Horizontal
|
Vertical
|
Horizontal
|
Horizontal
|
Vertical
|
Left
|
Right
|
Top
|
Down
|
Left
|
Right
|
Top
|
Down
|
DCT2
|
DCT8
|
DST7
|
DCT2
|
DCT8
|
DST7
|
LB
|
70.96%
|
3.22%
|
29.12%
|
13.67%
|
25.45%
|
11.96%
|
7.18%
|
3.27%
|
6.55%
|
2.82%
|
94.04%
|
1.64%
|
4.32%
|
94.04%
|
1.44%
|
4.51%
|
LP
|
71.47%
|
2.99%
|
29.13%
|
13.58%
|
25.74%
|
11.91%
|
7.09%
|
3.24%
|
6.56%
|
2.76%
|
94.31%
|
1.56%
|
4.13%
|
94.31%
|
1.39%
|
4.30%
|
RA
|
80.64%
|
1.73%
|
28.12%
|
12.62%
|
26.97%
|
11.77%
|
7.13%
|
3.23%
|
7.12%
|
3.04%
|
95.41%
|
1.28%
|
3.31%
|
95.41%
|
1.16%
|
3.43%
|
Table 20. Ratios of transform tools of the intra-predicted CUs for the chroma components.
|
Transform skip
|
LFNST
|
AI
|
58.96%
|
20.55%
|
LB
|
87.46%
|
0.00%
|
LP
|
87.85%
|
0.00%
|
RA
|
67.28%
|
2.42%
|
Table 21. Ratios of transform tools of the inter-predicted CUs for the chroma components.
|
Transform skip
|
LB
|
96.91%
|
LP
|
96.96%
|
RA
|
91.07%
|
Tables 16 and 17 show the selection ratio of precision of MVDs signaled to the index through
AMVR in the AMVP mode. According to JVET CTC, sps_affine_amvr_ enabled_flag is set
to 0 on LB and LP configurations. Therefore, the precision of MVD is fixed to 1/4-pel
for the AMC-predicted CU. For the RA configuration, 1/4-pel is the precision most
frequently selected for AMC-predicted CU, with 59.85%, followed by 1-pel with 32.01%.
Moreover, for all configurations, the precisions are selected in the order of 1-pel,
half-pel, and 4-pel on the LP and LB configurations, and 1-pel, 4-pel, and half-pel
are selected in the RA configuration.
4.4 Transform Tools
Tables 18-21 show the selection ratio of the transform tools in luma and chroma components
for each region, respectively. Specifically,
Tables 18 and 19 show the performed proportion of the transform tools for the residual signal
of the intra-/inter-predicted CUs of the luma component, and
Tables 20 and 21 show the selection ratio of transform tools for the residual signal of intra-/inter-predicted
CUs of the chroma components, respectively.
Table 22. Ratios of SAO.
|
SAO (Luma)
|
SAO (Chroma)
|
New
|
Merge
|
Off
|
EO
(0°)
|
EO
(90°)
|
EO
(135°)
|
EO
(45°)
|
BO
|
New
|
Merge
|
Off
|
EO
(0°)
|
EO
(90°)
|
EO
(135°)
|
EO
(45°)
|
BO
|
AI
|
16.68%
|
80.08%
|
3.24%
|
66.63%
|
19.80%
|
3.48%
|
4.09%
|
2.77%
|
7.56%
|
80.08%
|
12.36%
|
64.08%
|
18.04%
|
1.45%
|
1.49%
|
2.58%
|
LB
|
6.57%
|
22.04%
|
71.39%
|
18.26%
|
7.13%
|
1.08%
|
1.06%
|
1.08%
|
0.96%
|
6.66%
|
92.39%
|
5.14%
|
1.69%
|
0.21%
|
0.23%
|
0.35%
|
LP
|
8.02%
|
26.04%
|
65.94%
|
21.68%
|
8.60%
|
1.40%
|
1.51%
|
0.86%
|
1.10%
|
7.34%
|
91.56%
|
5.61%
|
1.96%
|
0.25%
|
0.27%
|
0.36%
|
RA
|
3.78%
|
26.65%
|
69.57%
|
23.37%
|
4.81%
|
0.62%
|
0.70%
|
0.93%
|
3.34%
|
23.46%
|
73.20%
|
20.48%
|
4.29%
|
0.54%
|
0.62%
|
0.87%
|
Table 23. Ratios of ALF and CCALF for the chroma components.
|
ALF (Chroma)
|
CC-ALF
|
Off
|
Filter index
|
Off
|
Filter index
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
1
|
2
|
3
|
4
|
AI
|
Cb
|
59.30%
|
9.59%
|
5.98%
|
4.39%
|
3.65%
|
4.15%
|
5.02%
|
7.93%
|
38.26%
|
1.31%
|
0.98%
|
2.05%
|
57.41%
|
Cr
|
59.37%
|
11.87%
|
6.66%
|
4.75%
|
4.31%
|
3.86%
|
3.90%
|
5.28%
|
36.31%
|
0.00%
|
0.02%
|
1.09%
|
62.58%
|
LB
|
Cb
|
89.67%
|
4.84%
|
2.16%
|
1.20%
|
0.84%
|
0.56%
|
0.42%
|
0.30%
|
50.62%
|
1.51%
|
1.69%
|
3.49%
|
42.68%
|
Cr
|
90.50%
|
4.53%
|
2.01%
|
1.08%
|
0.74%
|
0.51%
|
0.37%
|
0.27%
|
61.99%
|
0.70%
|
1.13%
|
3.87%
|
32.30%
|
LP
|
Cb
|
89.19%
|
4.85%
|
2.31%
|
1.30%
|
0.84%
|
0.63%
|
0.46%
|
0.41%
|
50.32%
|
1.44%
|
1.24%
|
3.12%
|
43.88%
|
Cr
|
90.11%
|
4.62%
|
2.01%
|
1.18%
|
0.72%
|
0.56%
|
0.42%
|
0.38%
|
60.39%
|
1.13%
|
1.39%
|
3.88%
|
33.22%
|
RA
|
Cb
|
77.65%
|
5.59%
|
3.43%
|
2.60%
|
2.12%
|
2.20%
|
2.63%
|
3.79%
|
58.38%
|
2.65%
|
1.36%
|
2.96%
|
34.64%
|
Cr
|
78.78%
|
5.86%
|
3.12%
|
2.33%
|
2.02%
|
1.94%
|
2.39%
|
3.57%
|
54.18%
|
2.05%
|
1.75%
|
3.33%
|
38.70%
|
In the case of intra-predicted CU, both the luma and chroma components show similar
ratios on the AI and RA configurations and on the LB and LP configurations, respectively.
For the luma component, DST-VII is selected as the horizontal kernel in all configurations.
For the vertical kernels, DCT-VIII is the most selected on the AI and RA configurations,
and DST-VII is the most selected on LB and LP configurations, respectively. Moreover,
determining the transform kernel by explicit index signaling is less selective than
determining the transform kernel implicitly in all cases. In particular, DCT-VIII
for the horizontal kernel on the AI and RA configurations is rarely selected, so it
is expected that coding efficiency will not be affected, even if it is skipped from
the encoding process. In the case of LFNST, which performs the secondary transform
on the primary transform coefficients for the residual signal of intra-predicted CUs,
it is applied only to the AI and RA configurations by JVET CTC and is applied to 18.31%
and 15.79%, respectively.
In the chroma component, most regions are skipped on the LB and LP configurations,
unlike the luma component. Furthermore, LFNST is applied to 20.55% and 2.42% on the
AI and RA configurations, respectively. Considering the selection ratio of intra and
inter prediction in
Table 6, LFNST is analyzed as a technique that applies to approximately 2% of the luma component
in the RA environments but shows a coding efficiency of approximately 0.7%
[34].
In contrast to the intra-predicted CU, DST-VII and DCT-VIII kernels can be applied
only when SBT is used for the inter-predicted CU, and the vertical and horizontal
transform kernels of the non-SBT TU only apply DCT-II according to JVET CTC. As shown
in
Tables 20 and 21, the transform is skipped for most regions for both color components, and
more than 90% is skipped, especially for the chroma components. Therefore, the encoding
complexity will be improved if a strong early termination condition is added to the
chroma components for the H.266/VVC encoder. In addition, SBT is applied only in a
small region with 3.22%, 2.99%, and 1.73% on the LB, LP, and RA configurations, respectively.
More specifically, when only 1/2 or 1/4 region of the transform is performed through
the horizontal or vertical division of the TU, only the left subblock or the above
subblock is transformed was the most often selected.
4.5 In-loop Filtering Tools
Table 22 presents the selection ratio for SAO among in-loop filtering tools in the luma and
chroma components. SAO is applied to almost all blocks in the luma component on the
AI configuration but not for at least 65% of the regions in the luma component on
the LB, LP, and RA configurations. The chroma components also show a similar tendency,
and it appears that the proportion not applied is larger than that of the luma component.
In particular, the selection ratio for each category of SAO is applied the most to
the horizontal edge, especially within the intra-predicted region.
Tables 23 and 24 show the performed proportion in the chroma and luma components for ALF and
CC-ALF, which have achieved the highest coding efficiency among each newly adopted
tool of H.266/VVC
[34]. For the luma component, ALF is applied for almost all regions on the AI configuration,
and the most selected filter set is the 16$^{\mathrm{th}}$ index filter set among
pre-trained filter sets, with 91.96%. In addition, the selection ratio of ALF for
the chroma components is not large except for the AI configuration. On the other hand,
CC-ALF is applied to more than half of the entire region, and the 4$^{\mathrm{th}}$
index filter is the most selected. Therefore, the encoding complexity will be reduced
without coding performance loss if the search order is changed, considering the selection
ratio in the RDO process of the H.266/VVC encoder.
Table 24. Ratios of ALF for luma components.
|
ALF (Luma)
|
Off
|
Filter set index
|
Offline trained set (1~16)
|
Others (17~26)
|
AI
|
0.75%
|
99.25%
|
0.00%
|
LB
|
24.50%
|
38.75%
|
36.76%
|
LP
|
22.56%
|
39.29%
|
38.16%
|
RA
|
15.51%
|
53.10%
|
31.39%
|
Table 25. Experimental results of setting the CTU size to 64×64 (Anchor: 128${\times}$128).}
|
AI
|
RA
|
Y
|
Cb
|
Cr
|
EncT
|
DecT
|
Y
|
Cb
|
Cr
|
EncT
|
DecT
|
A1
|
0.40%
|
0.67%
|
1.39%
|
97%
|
102%
|
2.25%
|
2.29%
|
3.51%
|
84%
|
101%
|
A2
|
0.63%
|
0.94%
|
0.38%
|
96%
|
102%
|
3.50%
|
3.78%
|
4.78%
|
82%
|
100%
|
B
|
0.39%
|
0.14%
|
0.93%
|
97%
|
100%
|
1.78%
|
3.69%
|
4.61%
|
87%
|
102%
|
C
|
0.30%
|
0.34%
|
0.72%
|
97%
|
99%
|
0.32%
|
1.61%
|
2.18%
|
95%
|
101%
|
E
|
0.40%
|
-0.25%
|
2.30%
|
97%
|
100%
|
-
|
-
|
-
|
-
|
-
|
All
|
0.41%
|
0.34%
|
1.09%
|
97%
|
101%
|
1.83%
|
2.87%
|
3.77%
|
87%
|
101%
|
D
|
0.12%
|
-0.07%
|
0.46%
|
97%
|
100%
|
0.16%
|
0.61%
|
0.97%
|
99%
|
100%
|
F
|
-1.17%
|
-1.45%
|
-1.06%
|
85%
|
101%
|
0.46%
|
0.68%
|
1.48%
|
84%
|
105%
|
Table 26. Experimental results of setting the CTU size to 32×32 (Anchor: 128${\times}$128).}
|
AI
|
RA
|
Y
|
Cb
|
Cr
|
EncT
|
DecT
|
Y
|
Cb
|
Cr
|
EncT
|
DecT
|
A1
|
1.83%
|
6.39%
|
9.58%
|
86%
|
129%
|
14.00%
|
17.84%
|
23.05%
|
43%
|
129%
|
A2
|
2.43%
|
12.64%
|
5.94%
|
89%
|
126%
|
16.15%
|
21.90%
|
20.98%
|
44%
|
126%
|
B
|
1.59%
|
7.78%
|
11.71%
|
93%
|
121%
|
9.75%
|
18.58%
|
20.07%
|
48%
|
130%
|
C
|
0.85%
|
3.03%
|
3.95%
|
94%
|
111%
|
3.37%
|
6.87%
|
8.01%
|
59%
|
120%
|
E
|
2.33%
|
10.39%
|
14.32%
|
93%
|
125%
|
-
|
-
|
-
|
-
|
-
|
All
|
1.73%
|
7.73%
|
9.10%
|
91%
|
122%
|
10.18%
|
15.97%
|
17.63%
|
49%
|
126%
|
D
|
0.36%
|
1.16%
|
1.76%
|
93%
|
112%
|
2.02%
|
3.96%
|
4.99%
|
73%
|
116%
|
F
|
0.10%
|
2.34%
|
4.53%
|
79%
|
120%
|
6.56%
|
9.96%
|
10.97%
|
59%
|
146%
|
4.6 Block Structure and Partitioning Schemes
This subsection reports the coding efficiency according to the CTU size and the
results of the tool-off experiments on the BT, TT, and DT, which are newly adopted
partitioning structures compared to H.265/HEVC.
Tables 25 and 26 show the performance when set to 64${\times}$64 and 32${\times}$32 compared
to the maximum CTU size of H.266/VVC (128${\times}$128).
The coding loss is observed gradually as the CTU size is set smaller. This tendency
is shown mainly in classes A1 and A2 with a UHD 4K resolution. H.266/VVC uses a CTU
size four times larger than H.265/HEVC, which inevitably increases the hardware area
compared to H.265/HEVC when designing a hardware decoder. Therefore, if the CTU size
is limited, considering the resolution of the compressed video, it is expected to
reduce the hardware area without reducing the coding efficiency significantly. On
the other hand, limiting the CTU size to 32${\times}$32 will be difficult because
of significant coding performance degradation, as shown in
Table 26.
Table 27 shows the performance for the maxTB is set to 32 compared to 64 according to JVET
CTC. Because the intra prediction of H.266/VVC is performed per TU, setting maxTB
to 32 reduces the maximum block size of intra prediction. Therefore, more block partitionings
are performed in each intra-predicted CU. In addition, coding loss is observed due
to an increase in signaling information transmitted in each TU, such as cbfs. This
tendency is shown mainly in classes A1 and A2, similar to the result of adjusting
the CTU size, as shown in
Tables 25 and 26. These results show that H.266/VVC obtained better coding efficiency by increasing
the CTU size and maximum TU size in the block structure, in addition to newly adopting
the coding techniques compared to H.265/HEVC.
Table 27. Experimental results of the setting maxTU to 32 (Anchor: 64).
|
AI
|
RA
|
Y
|
Cb
|
Cr
|
EncT
|
DecT
|
Y
|
Cb
|
Cr
|
EncT
|
DecT
|
A1
|
0.56%
|
2.66%
|
2.60%
|
95%
|
107%
|
4.79%
|
7.54%
|
8.63%
|
93%
|
109%
|
A2
|
0.56%
|
2.71%
|
1.74%
|
96%
|
104%
|
1.22%
|
2.30%
|
1.71%
|
96%
|
103%
|
B
|
0.28%
|
2.96%
|
3.39%
|
97%
|
102%
|
0.92%
|
2.43%
|
3.05%
|
97%
|
104%
|
C
|
0.07%
|
0.46%
|
0.24%
|
98%
|
97%
|
0.27%
|
0.39%
|
0.43%
|
100%
|
101%
|
E
|
0.52%
|
3.42%
|
3.42%
|
97%
|
101%
|
-
|
-
|
-
|
-
|
-
|
All
|
0.37%
|
2.39%
|
2.29%
|
97%
|
102%
|
1.58%
|
2.88%
|
3.20%
|
97%
|
104%
|
D
|
0.05%
|
0.40%
|
0.19%
|
98%
|
97%
|
0.02%
|
0.52%
|
0.15%
|
102%
|
100%
|
F
|
0.16%
|
1.14%
|
1.14%
|
102%
|
101%
|
0.33%
|
0.86%
|
1.07%
|
100%
|
105%
|
Table 28. Experimental results of BT off (Anchor: VTM12.0).
|
AI
|
RA
|
Y
|
Cb
|
Cr
|
EncT
|
DecT
|
Y
|
Cb
|
Cr
|
EncT
|
DecT
|
A1
|
3.97%
|
4.55%
|
5.27%
|
31%
|
97%
|
5.71%
|
6.07%
|
7.47%
|
36%
|
100%
|
A2
|
2.82%
|
7.99%
|
6.82%
|
29%
|
98%
|
5.76%
|
8.65%
|
7.77%
|
43%
|
104%
|
B
|
3.51%
|
7.65%
|
8.65%
|
28%
|
97%
|
5.93%
|
9.69%
|
10.62%
|
41%
|
104%
|
C
|
4.98%
|
6.98%
|
7.68%
|
23%
|
94%
|
6.49%
|
8.98%
|
10.33%
|
38%
|
101%
|
E
|
5.04%
|
7.32%
|
7.59%
|
29%
|
97%
|
-
|
-
|
-
|
-
|
-
|
All
|
4.06%
|
6.99%
|
7.39%
|
28%
|
97%
|
6.00%
|
8.57%
|
9.34%
|
40%
|
102%
|
D
|
4.86%
|
7.48%
|
7.71%
|
23%
|
94%
|
6.94%
|
10.99%
|
11.55%
|
41%
|
102%
|
F
|
7.12%
|
8.99%
|
9.47%
|
31%
|
97%
|
8.24%
|
9.75%
|
10.27%
|
44%
|
104%
|
Table 29. Experimental results of TT off (Anchor: VTM12.0).
|
AI
|
RA
|
Y
|
Cb
|
Cr
|
EncT
|
DecT
|
Y
|
Cb
|
Cr
|
EncT
|
DecT
|
A1
|
0.95%
|
1.52%
|
1.60%
|
56%
|
97%
|
2.82%
|
3.68%
|
4.28%
|
48%
|
103%
|
A2
|
1.00%
|
1.66%
|
1.46%
|
51%
|
97%
|
2.80%
|
3.51%
|
3.25%
|
55%
|
101%
|
B
|
1.16%
|
1.59%
|
1.66%
|
51%
|
95%
|
2.70%
|
3.17%
|
3.52%
|
54%
|
100%
|
C
|
1.61%
|
2.28%
|
2.39%
|
47%
|
94%
|
3.37%
|
4.38%
|
4.48%
|
49%
|
99%
|
E
|
1.86%
|
1.37%
|
1.73%
|
52%
|
95%
|
-
|
-
|
-
|
-
|
-
|
All
|
1.32%
|
1.71%
|
1.79%
|
43%
|
96%
|
2.92%
|
3.66%
|
3.87%
|
51%
|
100%
|
D
|
1.49%
|
2.21%
|
2.65%
|
47%
|
95%
|
3.29%
|
4.82%
|
4.66%
|
54%
|
98%
|
F
|
2.38%
|
2.84%
|
2.92%
|
53%
|
96%
|
3.80%
|
4.26%
|
4.86%
|
57%
|
101%
|
Tables 28-31 show the results of the tool-off of BT, TT, and DT, respectively. Although BT
consumes more complexity in the encoding process than TT, the coding efficiency is
better. When BT and TT are turned off to use the same block partitioning structure
as H.265/HEVC, about 8% coding loss is observed in each configuration, but the encoding
complexity decreases by up to 10 times compared to the VTM12.0 encoder. These experimental
results are obtained because coding techniques are searched recursively for different
partitioning schemes in the video encoding process. Although the coding efficiency
of BT, TT, and DT is not known precisely, H.266/VVC obtains a significant compression
efficiency compared to H.265/HEVC because of the BT, TT, and DT, as shown in
Tables 28-31.
4.7 Group of Pictures Structure
In addition to the block structure and partitioning scheme analysis in the previous
subsection, the coding efficiency according to the number of reference pictures constituting
the GOP in the RA configuration is analyzed comparatively.
Table 30. Experimental results of both BT and TT off (Anchor: VTM12.0).
|
AI
|
RA
|
Y
|
Cb
|
Cr
|
EncT
|
DecT
|
Y
|
Cb
|
Cr
|
EncT
|
DecT
|
A1
|
9.14%
|
9.72%
|
11.43%
|
12%
|
90%
|
15.33%
|
16.54%
|
19.04%
|
12%
|
93%
|
A2
|
5.15%
|
14.35%
|
12.02%
|
10%
|
93%
|
13.04%
|
19.18%
|
17.13%
|
18%
|
101%
|
B
|
6.38%
|
13.96%
|
15.62%
|
10%
|
91%
|
11.41%
|
18.92%
|
19.80%
|
18%
|
101%
|
C
|
9.39%
|
12.79%
|
13.96%
|
8%
|
85%
|
13.26%
|
18.10%
|
19.59%
|
17%
|
97%
|
E
|
10.35%
|
13.42%
|
13.30%
|
12%
|
91%
|
-
|
-
|
-
|
-
|
-
|
All
|
7.97%
|
12.97%
|
13.57%
|
10%
|
90%
|
13.01%
|
18.28%
|
19.06%
|
16%
|
99%
|
D
|
8.80%
|
12.94%
|
13.13%
|
9%
|
86%
|
12.60%
|
19.56%
|
20.23%
|
23%
|
97%
|
F
|
16.68%
|
19.68%
|
21.13%
|
11%
|
89%
|
20.75%
|
23.49%
|
24.88%
|
25%
|
102%
|
Table 31. Experimental results of DT off (Anchor: VTM12.0).
|
AI
|
RA
|
Y
|
Cb
|
Cr
|
EncT
|
DecT
|
Y
|
Cb
|
Cr
|
EncT
|
DecT
|
A1
|
0.40%
|
0.67%
|
1.39%
|
97%
|
102%
|
0.02%
|
0.87%
|
0.16%
|
102%
|
102%
|
A2
|
0.63%
|
0.94%
|
0.38%
|
96%
|
102%
|
0.08%
|
3.40%
|
2.05%
|
101%
|
102%
|
B
|
0.39%
|
0.14%
|
0.93%
|
97%
|
100%
|
0.15%
|
5.04%
|
7.29%
|
102%
|
101%
|
C
|
0.30%
|
0.34%
|
0.72%
|
97%
|
99%
|
0.16%
|
2.20%
|
2.74%
|
101%
|
100%
|
E
|
0.40%
|
-0.25%
|
2.30%
|
97%
|
100%
|
-
|
-
|
-
|
-
|
-
|
All
|
0.41%
|
0.34%
|
1.09%
|
97%
|
101%
|
0.12%
|
3.12%
|
3.60%
|
102%
|
101%
|
D
|
0.12%
|
-0.07%
|
0.46%
|
97%
|
100%
|
0.16%
|
3.29%
|
3.85%
|
102%
|
100%
|
F
|
-1.11%
|
-1.37%
|
-0.98%
|
82%
|
101%
|
0.10%
|
2.20%
|
3.30%
|
105%
|
102%
|
Table 32. Experimental results of setting the GOP size to 16 on the RA configuration (Anchor: GOP 32).
|
RA
|
Y
|
Cb
|
Cr
|
EncT
|
DecT
|
A1
|
2.00%
|
3.43%
|
4.46%
|
101%
|
102%
|
A2
|
3.29%
|
6.52%
|
5.81%
|
101%
|
100%
|
B
|
4.36%
|
7.79%
|
8.55%
|
102%
|
102%
|
C
|
3.61%
|
6.41%
|
6.72%
|
101%
|
100%
|
All
|
3.47%
|
6.30%
|
6.70%
|
101%
|
101%
|
D
|
3.51%
|
6.99%
|
7.06%
|
100%
|
100%
|
F
|
6.97%
|
8.02%
|
7.87%
|
105%
|
100%
|
H.265/HEVC and H.266/VVC use a hierarchical GOP structure when performing compression
on the RA configuration according to the CTC, and the number of reference pictures
that constitute the GOP (GOP size) is set to 16 and 32, respectively. Therefore, the
coding efficiencies for the GOP size of 16 and 32 are compared, as shown in Table 32, to assess the coding efficiency of H.266/VVC and H.265/HEVC on the RA configuration
fairly. As shown in Table 32, when the GOP size is set to 16, a coding loss of approximately 3.5% occurred in
the luma component. This result is attributed to when the GOP is size set to 32, a
reference picture with a higher temporal ID is inserted between each picture in the
GOP of size 16, and the sliceQP of the inserted pictures is set to a high QP during
encoding. As a result, the coding efficiency of H.266/VVC compared to H.265/HEVC on
the RA configuration is obtained from both the newly adopted tools and in the GOP
size.
Table 33. Notations used in Section 3.