Multi-Type Self-Attention-Based Convolutional-Neural-Network Post-Filtering for AV1 Codec

Gwun, Woowoen; Choi, Kiho; Park, Gwang Hoon

doi:10.3390/math12182874

Open AccessArticle

Multi-Type Self-Attention-Based Convolutional-Neural-Network Post-Filtering for AV1 Codec

by

Woowoen Gwun

¹

,

Kiho Choi

^2,3,*

and

Gwang Hoon Park

^1,*

¹

Department of Computer Science and Engineering, College of Software, Kyung Hee University, Yongin 17104, Gyeonggi-do, Republic of Korea

²

Department of Electronics and Information Convergence Engineering, Kyung Hee University, Yongin 17104, Gyeonggi-do, Republic of Korea

³

Department of Electronic Engineering, Kyung Hee University, Yongin 17104, Gyeonggi-do, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Mathematics 2024, 12(18), 2874; https://doi.org/10.3390/math12182874

Submission received: 15 August 2024 / Revised: 12 September 2024 / Accepted: 13 September 2024 / Published: 15 September 2024

(This article belongs to the Special Issue New Advances and Applications in Image Processing and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

:

Over the past few years, there has been substantial interest and research activity surrounding the application of Convolutional Neural Networks (CNNs) for post-filtering in video coding. Most current research efforts have focused on using CNNs with various kernel sizes for post-filtering, primarily concentrating on High-Efficiency Video Coding/H.265 (HEVC) and Versatile Video Coding/H.266 (VVC). This narrow focus has limited the exploration and application of these techniques to other video coding standards such as AV1, developed by the Alliance for Open Media, which offers excellent compression efficiency, reducing bandwidth usage and improving video quality, making it highly attractive for modern streaming and media applications. This paper introduces a novel approach that extends beyond traditional CNN methods by integrating three different self-attention layers into the CNN framework. Applied to the AV1 codec, the proposed method significantly improves video quality by incorporating these distinct self-attention layers. This enhancement demonstrates the potential of self-attention mechanisms to revolutionize post-filtering techniques in video coding beyond the limitations of convolution-based methods. The experimental results show that the proposed network achieves an average BD-rate reduction of 10.40% for the Luma component and 19.22% and 16.52% for the Chroma components compared to the AV1 anchor. Visual quality assessments further validated the effectiveness of our approach, showcasing substantial artifact reduction and detail enhancement in videos.

Keywords:

video compression; AV1; self-attention; CNN

MSC:

94A08

1. Introduction

To meet the market’s demand for more efficient codecs and new technology trends such as 4 K, 8 K, and HDR, organizations such as the Joint Video Exploration Team (JVET) and the Alliance for Open Media (AOM) are dedicated to developing high-performance codecs. The JVET was established in 2015 as a collaboration between the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC JTC1 SC29 Moving Picture Experts Group (MPEG). In 2017, the JVET began developing Versatile Video Coding (VVC) [1], releasing the first test model in 2018 and completing the standardization in 2020. VVC achieved performance improvements of 50% and 75% over High-Efficiency Video Coding/H.265 (HEVC) and Advanced Video Coding/H.264 (AVC), respectively. Another latest video compression standard is AOMedia Video 1 (AV1). The AOM was formed in 2015 by major IT companies, including Google, Amazon, Netflix, Cisco, Microsoft, and Intel, with the goal of developing a next-generation video codec for streaming services. In March 2018, the AOM officially released the first version of the AV1 codec [2]. AV1 was developed based on technologies from Google’s VP9 codec, Xiph.org’s Daala project, and Cisco’s Thor project. AV1 boasts an over 30% better compression efficiency compared to VP9 and HEVC. By late 2018, major web browsers (e.g., Google Chrome, Firefox, etc.) started supporting AV1, and in 2019, major streaming services such as Netflix and YouTube began streaming with the AV1 codec. By early 2020, various hardware manufacturers had released chipsets and devices supporting AV1 decoding. In AV1, the traditional block-based hybrid video coding architecture is maintained while introducing numerous new coding tools that enhance its efficiency. The framework includes transformation, quantization, entropy coding, intra prediction, inter prediction, and in-loop filtering. The “transform” unit converts video frames from the time domain to the frequency domain, concentrating energy in low-frequency regions, using techniques like Discrete Cosine Transform (DCT) and Asymmetric Discrete Sine Transform (ADST). Quantization reduces the image’s dynamic range, which is the primary cause of coding distortion, and is managed through different parameters to balance compression and quality. Entropy coding, using methods such as context-adaptive binary arithmetic coding (CABAC), converts data into binary streams for efficient storage and transmission. Intra- and inter-prediction methods eliminate spatial and temporal redundancies, with AV1 supporting various intra prediction modes and advanced inter-prediction methods such as compound prediction and warped motion estimation. The loop filter enhances video quality and compression efficiency through several modules: the Deblocking Filter (DBF) reduces artifacts at block boundaries, the Constrained Directional Enhancement Filter (CDEF) smooths the areas insufficiently addressed by DBF, and the Loop Restoration applies filters (e.g., Wiener and Self-Guided Restoration) for further quality enhancement. These tools collectively enable AV1 to achieve high compression efficiency while maintaining video quality, making it highly effective for modern streaming and media applications [2,3].

Despite the release of high-performance codecs, there is a persistent demand for ultra-high-definition user experiences. Recent research has thus concentrated on enhancing compression efficiency through neural networks, utilizing either end-to-end (E2E) approaches [4] or modular approaches where neural network models are integrated into specific components of the traditional video coding pipeline. Among these, in-loop filtering [5,6,7,8,9,10] and post-filtering [11,12,13,14,15,16,17,18] are the most actively researched areas for improving codec performance with neural networks. The in-loop filter and post-filter each offer distinct advantages and disadvantages. The in-loop filter, an integral component of the video codec, as illustrated in Figure 1a, must always be used during both the encoding and decoding processes. It reduces artifacts and improves visual quality during processing. Because the in-loop filter operates with access to coding information, it can potentially be more efficient. Conversely, the post-filter functions at the end of the decoder stage, as shown in Figure 1b, and is applied only when sufficient computational resources are available on the decoding terminal. This selective application enhances the visual quality and coding efficiency without being a mandatory part of the codec’s standard operations. This flexibility significantly increases the terminal’s practical usability, particularly in resource-constrained environments. However, it also has limitations, as it can only access the resulting image after video coding.

Related Work

For in-loop filtering, Wang et al. [5] proposed a Residual Dense Convolutional Neural Network (DRN) for HEVC, utilizing residual learning and dense shortcuts to address gradient vanishing and feature reuse issues. In a follow-up paper, Chen et al. [6] applied the DRN to VVC. Zhao et al. [7] proposed a multi-scale Convolutional Neural Network for in-loop filtering that jointly processes luminance and chrominance components within the VVC framework. Kathariya et al. [8] introduced a novel framework called TSF-Net for video coding in VVC, integrating both pixel and frequency-decomposed information using a channel-wise transformer. This method significantly improved the in-loop filtering performance, achieving up to 10.26% bitrate saving and enhanced video quality compared to the existing VVC standard. For the AV1 standard, Ding et al. [9] presented a depth-variable Convolutional Neural Network designed to improve both intra and inter coding in the AV1 video codec by adapting to different levels of distortion. The experimental results demonstrated an average BD-rate reduction of 7.27% for intra coding and 5.57% for inter-coding compared to the AV1 anchor. Xia et al. [10] introduced an Asymmetric Convolutional Residual Network for AV1 in-loop filtering, which improves texture restoration and directional feature extraction, balancing performance and complexity, resulting in an average coding efficiency improvement of 8.48% compared to the AV1 anchor.

In the realm of post-filtering, several Convolutional Neural Network (CNN)-based methods have been proposed to enhance HEVC-compressed videos. Guan et al. [11] introduced MFQE 2.0, a multi-frame quality enhancement technique utilizing CNNs to improve the quality of HEVC-compressed videos. This method effectively harnesses temporal information across multiple frames to achieve superior enhancement results. Lin et al. [12] proposed partition-aware adaptive switching neural networks that selectively apply post-processing based on partition information, yielding significant improvements in video quality. This approach adapts to the varying characteristics of different video segments, enhancing the overall performance. With the advent of the VVC standard, research efforts have shifted towards optimizing this new compression format. Zhang et al. [13] presented a CNN-based post-processing method tailored for VVC, particularly in random access configurations. Their network, trained on an extensive video dataset, is deployed at the decoder to enhance the reconstruction quality. Lin et al. [14] developed MFRNet, which integrates multi-level feature review residual dense blocks for both post-processing and in-loop filtering, substantially improving video quality. This network leverages dense connectivity and residual learning to refine video frames effectively. Liu et al. [15] introduced DFNN, a fusion network combining CNN and transformer models with channel-wise attention mechanisms for enhanced VVC post-processing. This hybrid approach exploits the strengths of both CNNs and transformers to achieve superior performance. Santamaria et al. [16] introduced a content-adaptive CNN post-processing filter for video coding, which incorporates learnable multipliers into neural networks. These multipliers are overfitted to specific video content, effectively reducing compression artifacts and emphasizing the significance of content-adaptive mechanisms in video compression. Das et al. [17] present a deep CNN-based post-processing method that uses Quantization Parameter (QP) maps to enhance generalization and simplify the network architecture by minimizing skip connections and employing a straightforward design. Although CNN-based post-processing methods specifically for the AV1 standard are not yet prevalent, Zhang et al. [18] developed a CNN-based post-processing technique integrated with both VVC and AV1 standards. This method demonstrates bitrate savings of 4.0% for VVC and 5.8% for AV1, indicating its potential for enhancing AV1-compressed videos as well.

In-loop filtering and post-filtering are two key approaches in neural network-based video compression enhancement, each with distinct advantages and limitations. In-loop filtering is integrated directly within the video codec loop, providing real-time improvements in the coding efficiency and video quality during both encoding and decoding. This integration allows for more effective artifact reduction and tighter control over video quality. However, a significant limitation is that it requires modification to the codec itself, which can reduce flexibility and adaptability, particularly within established standards such as AV1. Post-filtering, by contrast, offers a flexible, codec-agnostic approach that is applied after video decoding. This method does not require changes to the existing encoding pipeline, making it highly adaptable to various codecs and implementations. Its flexibility is especially advantageous when enhancements are needed for finalized codecs, including AV1. However, post-filtering typically does not achieve the same level of artifact reduction as in-loop filtering, since it is applied after the video has been fully decoded. Considering these factors, post-filtering was selected for this research due to its adaptability and its potential to improve video quality within the established AV1 standard without necessitating modifications to the encoding process.

2. Proposed Method

In this paper, we propose an AV1 CNN with a Multi-type Self-Attention (MTSA) network designed to enhance the quality of AV1-decoded videos by effectively reducing artifacts. The MTSA network is composed of three main parts: (1) shallow feature extraction, (2) deep feature extraction, and (3) image reconstruction. Figure 2 illustrates the overall network architecture of the MTSA. The shallow feature extraction component, also referred to as the head part, consists of a convolutional layer followed by a PReLU activation function and two Residual Convolution Blocks (RCBs). This stage is crucial for capturing low-level features from the input video frames, preserving fine details and textures in the reconstructed images. Each convolutional layer in this part has a kernel size of 3 × 3. The deep feature extraction component, or the backbone part, is the core of the network and integrates the unique MTSA. It includes three distinct self-attention layers: Channel-wise Self-Attention (CWSA), Block-wise Spatial Self-Attention (BWSSA), and Patch-wise Self-Attention (PWSA). These layers work together to capture complex relationships and dependencies in the feature maps, allowing the network to understand both local and global contexts. The backbone part is further reinforced with RCBs between each self-attention layer, enhancing feature extraction and refinement. The image reconstruction component, structurally symmetric to the shallow feature extraction part, aims to generate the final high-quality output image from the deep features. This part also includes convolutional layers but without the PReLU activation function after the last layer, ensuring that the reconstruction process is efficient and accurate. By integrating these components, the MTSA network effectively reduces artifacts and improves the visual quality of AV1-decoded videos.

2.1. Shallow Feature Extraction

The shallow feature extraction part, as shown in Figure 2, is a crucial initial step in the CNN architecture for video post-filtering. This stage is designed to capture low-level features from the input video frames, which are essential for preserving fine details and textures in the reconstructed images. It is widely used in image regression problems such as image denoising or super-resolution, as early convolution layers excel in initial visual processing, resulting in more stable optimization and improved outcomes [19]. The proposed shallow feature extraction consists of three main components: a point-wise convolutional layer, a PReLU activation function, and two RCBs.

Residual Convolution Block

The RCB is a critical component in our architecture, building upon the principles introduced by ResNet [20]. ResNet demonstrated that adding residual connections between two to three convolution layers significantly improves the training efficiency and model performance. Following this approach, each RCB in our architecture consists of two sets of two 3 × 3 convolution layers, each followed by Batch Normalization and a PReLU activation function, coupled with a residual connection. This design helps to capture and refine local spatial patterns and details, which are essential for constructing detailed feature maps in video post-filtering tasks. The residual connections facilitate the flow of information and gradients through the network, leading to more efficient training and better performance. By iteratively refining the features extracted from the input frames, the RCBs contribute to the overall robustness and accuracy of the model. The implementation of RCBs ensures that the network can effectively capture fine details and textures, which are crucial for the high-quality reconstruction of YCbCr images, as illustrated in Figure 3a. Furthermore, these RCBs will be integrated with various tree-type self-attention layers later in the network, allowing for a powerful combination of local feature extraction and global context understanding. This synergy enhances the model’s capability to detect and correct artifacts in reconstructed images, leading to superior video post-filtering results.

2.2. Deep Feature Extraction

The goal of deep feature extraction is to recognize complex and high-level patterns in the image that are essential for making accurate predictions. In the proposed network, deep feature extraction consists of three different self-attention layers and two RCBs between those self-attention layers. The three different self-attention layers are used to extract various relationships between features by focusing on different parts of the image simultaneously, enhancing the network’s ability to capture global dependencies and contextual information. The RCBs, integrated with self-attention mechanisms, help in preserving spatial hierarchies and refining feature maps through residual learning. This combination of self-attention modules with a CNN allows the network to maintain a high level of accuracy by leveraging both local and global feature interactions [21], leading to more robust and detailed feature representations.

2.2.1. Channel-Wise Self-Attention

The first self-attention layer used in the proposed architecture is CWSA, which is shown in Figure 3b. This layer enhances the inter-channel dependencies after the initial local features are extracted [22]. Enhancing the relationships and dependencies between feature channels is crucial, especially since we are using three separate images as the original input (i.e., Y, Cb, and Cr). Capturing these inter-channel dependencies can further improve the network performance by allowing the network to better understand and integrate the distinct information provided by each color channel, leading to more accurate and robust feature representations.

The input tensor of CWSA is represented by:

Χ \in R^{B \times C \times H \times W},

(1)

where B represents the batch size, C is the number of channels, H is the height, and W is the width of the image. The query, key, and value matrices are projected by passing input tensor X through point-wise convolution and are represented, respectively, as Q_α, K_α, and V_α, as shown in Equation (2):

Q_{α}, K_{α}, V_{α} = {C o n v}_{1 \times 1} (Χ) \in R^{B \times C \times H \times W} .

(2)

After extracting Q_α, K_α, and V_α from input tensor X, they are reshaped so that the height H and width W are combined into a single dimension, as in Equation (3):

Q_{α}, K_{α}, V_{α} \to R^{B \times C \times (H \cdot W)} .

(3)

In self-attention mechanisms, the “Scaled Dot-Product Attention” technique involves scaling the dot products of queries and keys by

\sqrt[]{d_{k}}

, where

d_{k}

represents the dimension of the key, to prevent large values that could cause the Softmax function to operate in regions with very small gradients. This scaling stabilizes gradients and improves the performance, especially when dealing with higher-dimensional key/query vectors [23]. In the proposed network, the attention score A is computed by batch multiplication between Q_α and the Transpose of K_α (i.e.,

K_{α}^{t})

. The operation is normalized using

\sqrt[]{C}

, and then Softmax is applied to convert the attention score into attention weights:

A_{α} = S o f t m a x (\frac{Q_{α} \cdot K_{α}^{t}}{\sqrt{C}}) \in R^{B \times C \times C} .

(4)

Batch multiplication is performed on value V_α and the attention weight previously calculated to obtain the weighted sum of value O_α, which results in a tensor shape with

R^{B \times C \times (H \cdot W)}

O_{α} = A_{α} \cdot V_{α} \in R^{B \times C \times (H \cdot W)} .

(5)

After obtaining the weighted sum of value O_α, it is reshaped back to

R^{B \times C \times H \times W}

to match input tensor X.

O_{α} \to R^{B \times C \times H \times W} .

(6)

Finally, the weighted sum O_α is added to the input tensor X to include the residual connection, resulting in Y_α:

Y_{α} = O_{α} + X .

(7)

2.2.2. Block-Wise Spatial Self-Attention

BWSSA focuses on the relationships between different spatial locations (i.e., pixels) within each block of the feature maps. This approach allows each pixel in a block to attend to every other pixel within the same block, capturing long-range spatial dependencies that traditional convolutions might miss. This is crucial because current video codec architectures operate on a block-wise basis. By considering the entire block’s spatial context, BWSSA helps the network understand images in blocks, thereby improving its ability to capture global patterns and structures within each block, enhancing the overall performance of the network.

The input tensor of BWSSA is represented by X, the same as for the CWSA network shown in Equation (1). To perform spatial self-attention on a block basis, features are divided into smaller blocks. Figure 4a shows a simplified feature map with a channel size of 3, a height of 4, and a width of 4. Each pixel is represented by a top in the figure, with each channel denoted by subscript numbers 1 to 3. To perform self-attention in block basis, unfolding is performed to divide features into blocks, which is shown in Equation (8):

U = X.Unfold(H,b).Unfold(W,b).

(8)

Unfolding X by both height, H, and width, W, results in tensor U with the shape:

U \in R^{B \times C \times \frac{H}{b} \times \frac{W}{b} \times b \times b}

(9)

where U is the unfolded tensor, B is the batch size, C is the number of channels, H is the height, W is the width of the image, and b is the size of the block. Figure 4b shows the results of unfolding from Figure 4a. In the figure, features in the each channels are divided into 4 separate blocks. Since the block has a width and height size of 4, the number of blocks in the features is calculated as

\frac{h}{b}

and

\frac{w}{b}

for height and width, respectively.

After unfolding, permuting and reshaping are performed to prepare U to be used as query, key, and value. Permuting is performed so that channel and block are placed on last dimension for self-attention calculation. Reshaping is performed to combine the width and height dimension of the block into a single dimension and combine the number of blocks into a batch. The result is tensor S, shown in Figure 4c, and its shape is shown as Equation (10):

S \in R^{(B \cdot \frac{H}{b} \cdot \frac{W}{b}) \times C \times (b \cdot b)} .

(10)

Permuted and reshaped tensor S is then passed through point-wise convolution and represented, respectively, as Q_β, K_β, and V_β, as shown in Equation (11):

Q_{β}, K_{β}, V_{β} = {C o n v}_{1 \times 1} (S) \in R^{(B \cdot \frac{H}{b} \cdot \frac{W}{b}) \times C \times (b \cdot b)} .

(11)

The attention score A_β is computed using batch multiplication between the Transpose of Q_β (i.e.,

{Q_{β}}^{t})

and K_β. The operation is normalized using

\sqrt[]{b \cdot b}

, and then Softmax is applied to convert attention score into attention weights:

A_{β} = S o f t m a x (\frac{Q_{β}^{t} \cdot K_{β}}{\sqrt{b * b}}) \in R^{(B \cdot \frac{H}{b} \cdot \frac{W}{b}) \times (b \cdot b) \times (b \cdot b)} .

(12)

Batch multiplication is performed on value V_β and the attention weight previously calculated to obtain the weighted sum of value O_β, which results in a tensor shape with

R^{(B \cdot \frac{H}{b} \cdot \frac{H}{b}) \times C \times (b \cdot b)}

as follows:

O_{β} = A_{β} \cdot V_{β} \in R^{(B \cdot \frac{H}{b} \cdot \frac{W}{b}) \times C \times (b \cdot b)} .

(13)

After the weighted sum of value O_β is obtained, it is permuted and reshaped back to

R^{B \times C \times H \times W}

to match the input tensor X as follows:

O_{β} \to R^{B \times C \times H \times W} .

(14)

Finally, the weighted sum O_β is added to the input tensor X to include the residual connection, resulting in Y_β:

Y_{β} = O_{β} + X .

(15)

The process of BWSSA is shown in Figure 5a.

2.2.3. Patch-Wise Self-Attention

PWSA abstracts the image into smaller patches and models the relationships between these patches. This approach, inspired by the Vision Transformer (ViT) introduced in [24] by Dosovitskiy et al., allows the network to capture high-level global context and interactions between different regions of the image. By understanding the relationships between patches, the network can integrate information from various parts of the image, leading to a more comprehensive understanding of the overall scene. This is essential for high-level tasks and enhances the network’s capability to generalize. The PWSA process is illustrated in Figure 5b.

The input tensor of BWSSA is represented by X, the same as for the CWSA network shown in Equation (1). The query, key, and value matrices are projected by passing input tensor X through convolution with a kernel size and stride equal to patch size P, resulting in tensors of shape

R^{B \times C \times \frac{H}{P} \times \frac{H}{P}}

, which is shown in Equation (16):

Q_{γ}, K_{γ}, V_{γ} = {C o n v}_{P \times P} (Χ) \in R^{B \times C \times \frac{H}{P} \times \frac{W}{P}} .

(16)

After extracting Q_γ, K_γ, and V_γ from the input tensor X, they are reshaped so that the number of patches in the height,

\frac{H}{P}

and width

\frac{W}{P}

is combined into one dimension as follows:

Q_{γ}, K_{γ}, V_{γ} \to R^{B \times C \times (\frac{H}{P} \cdot \frac{W}{P})} .

(17)

The attention score A is computed using batch multiplication between the Transpose of Q_γ (i.e.,

Q_{γ}^{t})

and K_γ. The operation is normalized using

\sqrt[]{\frac{H}{P} \cdot \frac{W}{P}}

, and then Softmax is applied to convert the attention score into attention weights:

{A_{γ} = S o f t m a x (\frac{Q_{γ}^{t} \cdot K_{γ}}{\sqrt{\frac{H}{P} \cdot \frac{W}{P}}}) \in R}^{B \times (\frac{H}{P} \cdot \frac{W}{P}) \times (\frac{H}{P} \cdot \frac{W}{P})} .

(18)

Batch multiplication is performed on value V_γ and the attention weight previously calculated to obtain the weighted sum of value M, which results in a tensor shape with

R^{B \times C \times (\frac{H}{P} \cdot \frac{W}{P})}

as follows:

M = A_{γ} \cdot V_{γ} \in R^{B \times C \times (\frac{H}{P} \cdot \frac{W}{P})} .

(19)

After the weighted sum of value W is calculated, it is passed through transpose convolution, so the reduced spatial dimensions can be

O_{γ} = {T r a n C o n v}_{16 \times 16} (M) \in R^{B \times C \times \frac{H}{P} \times \frac{W}{P}},

(20)

and then reshaped back to

R^{B \times C \times H \times W}

to match input tensor X:

O_{γ} \to R^{B \times C \times H \times W} .

(21)

Finally, the weighted sum O_γ is added to the input tensor X to include the residual connection, resulting in Y_γ:

Y_{γ} = O_{γ} + X .

(22)

2.3. Image Reconstruction

The image reconstruction part of the network is the final stage in the proposed MTSA-based CNN architecture. This stage is responsible for generating the high-quality output image from the deep features extracted by the preceding layers. The image reconstruction process is crucial, as it translates the learned features back into the image space, ensuring that the output image is a refined and enhanced version of the input video frame.

Structure of the Image Reconstruction

The image reconstruction component of the proposed MTSA network is essential for generating the final high-quality output image from the deep features extracted in previous layers. Structurally similar to the shallow feature extraction part, the image reconstruction part includes RCBs, a point-wise convolution layer, and a Tanh activation function. Initially, the deep features are processed through two RCBs to refine and enhance the details. These refined features are then passed through a point-wise convolution layer, which uses a 1 × 1 kernel to combine the feature maps efficiently. The final step involves the Tanh activation function, which scales the output to a range between −1 and 1, ensuring balanced intensity levels and enhanced visual quality. This combination produces a smooth, natural-looking image with reduced artifacts and preserved fine details. By integrating these components, the image reconstruction part synthesizes a high-quality output image, retaining the natural colors, textures, and details of the original frame. This approach demonstrates significant improvements in video quality and compression efficiency across various test sequences.

2.4. Training and Testing Configuration

2.4.1. Training Dataset

For training our proposed model, we utilized the BVI-DVC dataset [25], a dataset widely adopted in deep learning-based video compression research. The BVI-DVC dataset includes 800 video sequences, each consisting of 64 frames in 10-bit YCbCr 4:2:0 format. These sequences span a broad range of resolutions, from 270 p to 2160 p, providing a comprehensive dataset for training neural networks in video compression tasks. Details of the BVI-DVC dataset are shown in Table 1. The diversity of this dataset, in terms of both content type and resolution, ensures that the network is exposed to a wide variety of scenarios, including diverse motion dynamics, textures, and lighting conditions, which are critical for training robust models. The BVI-DVC dataset is currently utilized by the JVET standardization organization for the development of neural network-based video coding technology. This further demonstrates the dataset’s suitability for training models in cutting-edge video compression research. It includes a mix of natural and synthetic video content, improving the generalizability of the models trained on it, ensuring that the network can effectively handle different real-world conditions. In this study, the BVI-DVC sequences were compressed using the SVT-AV1 encoder [26], following the parameters outlined in Table 2 to closely replicate the AVM Common Test Condition (CTC) for random access (RA) [27]. This approach ensured that the training data mirrored real-world conditions of AV1-compressed videos, optimizing the model for AV1 codec environments.

Since the BVI-DCV dataset is distributed in MP4 format, the videos were converted to YCbCr format with 4:2:0 Chroma sampling and a 10-bit depth. All 64 frames were extracted from each video, resulting in a training dataset of 51,200 frames. These frames were divided into patches with a height and width of 256 × 256, creating a total of 1,984,000 patches. Patches of the size 256 × 256 are used for training because AV1 uses a maximum super-block size of 128 × 128. By using a 256 × 256 block, the network can learn artifacts after applying an in-loop filter between super-blocks; hence, better results are expected. During training, the Chroma (Cb and Cr) channels were up-sampled by a factor of two to match the spatial resolution of the Luma (Y) channel, as the proposed network cannot process different input sizes.

Five distinct models were trained with the same network architecture using different QP values—20, 32, 43, 55, and 63—as shown in Table 3. These models were later used in the evaluation phase for different base QP values. Each model underwent training for 200 epochs using the Adam optimizer, with a learning rate of 10⁻⁴ and hyper-parameters of β1 = 0.9 and β2 = 0.999 to calculate gradient averages during the learning process.

2.4.2. Training Strategy

In video coding, the Y channel represents the brightness information of an image, while the Cb and Cr channels represent the color information. The human visual system is much more sensitive to changes in brightness than to changes in color, making the Y channel particularly important for maintaining the perceived quality of the video. Ensuring high fidelity in the Y channel results in a video that appears sharper and more detailed to viewers, even if the Cb and Cr channels are compressed more heavily [28].

{P S N R}_{Y U V} = \frac{12 \cdot {P S N R}_{Y} + {P S N R}_{C b} + {P S N R}_{C r}}{14}

(23)

To assess this quality, the Peak Signal-to-Noise Ratio (PSNR) is commonly used, as it quantifies the difference between the original and compressed images, taking into account both brightness and color information [29]. The PSNR ratio is set to 12:1:1, as shown in (23). The Y component is given 12 times more importance than the Cb or Cr components [28,30]. Such a ratio reflects this importance by assigning a much higher weight to the Y channel compared to the Cb and Cr channels. The L2 loss function, also known as the Mean Squared Error (MSE), is widely used in regression problems and neural networks. It measures the average squared difference between the predicted values and the actual values, as given in the following Equation (24).

L 2 l o s s o r M S E = \sum_{i = 1}^{n} {(y_{i} - f (x_{i}))}^{2}

(24)

where

y_{i}

and

f (x_{i})

represent the original and predicted values, respectively.

2.4.3. Experimental Setup

The experiment utilized PyTorch [31] as the deep learning framework on an Archlinux operating system. The hardware setup included two AMD EPYC 7513 32-Core CPUs, 384 GB of RAM, and an NVIDIA A6000 GPU. Additionally, training and testing were also performed on a system with less computational power, consisting of an AMD Ryzen Threadripper 1950X 16-Core CPU, 48 GB of RAM, and an NVIDIA GeForce RTX 2080 Ti, by reducing the batch size to 4.

3. Results

3.1. Evaluation Process

3.1.1. Testing Dataset

To evaluate the performance of the proposed network, we used sequences from Class A1 to A5 of the AVM-CTC [27]. Although the AVM-CTC is primarily used for testing the AV2 codec, it consists of high-quality sequences that are also highly suitable for evaluating AV1 performance. The 45 sequences in the AVM-CTC represent a variety of real-world video content, The 45 sequences in the AVM CTC represent a variety of real-world video content, each containing 130 frames, covering a wide range of resolutions from 480 p to 4 K, as well as various motion characteristics, textures, and scene complexities. This diversity ensures a comprehensive evaluation of the network’s ability to generalize across different video types. The sequences used for testing are listed in Table 4, and to ensure consistency between the training and testing datasets, all 8-bit sequences were converted to 10-bit, matching the format used for training with the BVI-DVC dataset. This conversion allows for uniform input processing across both training and testing, ensuring that the model is evaluated under consistent conditions. It is important to note that potential overlaps between training and testing datasets were identified. These overlaps could introduce biases, so we conducted a thorough investigation, which is discussed in detail in the Ablation Studies section, where the impact of these overlapping sequences is examined and addressed.

3.1.2. Testing Strategy

To evaluate the effectiveness of the proposed method, test sequences were compressed using SVT-AV1 parameters, as described in (23). The proposed post-filter network was then applied to assess performance improvement. The Bjontegaard Delta Bit Rate (BD-BR) metric [32] is used by JVET to evaluate bitrate reduction. This metric is valuable for comparing the coding efficiency of different video codecs or encoding settings because it considers both the bitrate and video quality. By measuring the difference in bitrate needed to achieve the same quality level between two codecs, the BD-BR metric allows for an objective assessment of the compression efficiency [32,33]. A lower BD-BR value indicates better coding efficiency. The results in the tables show that the proposed method consistently achieved significant coding gains for all test sequences. Table 5 summarizes the compression performance of the proposed architecture.

The proposed method achieves overall coding gains of 10.40% for the Y component and 19.22% and 16.52% for the Cb and Cr components, respectively, compared to the SVT-AV1-compressed content. The proposed method showed significant improvements in high-resolution sequences, with the 4 K content in Class A1 achieving BD-rate reductions of −14.84%, −24.32%, and −21.17% for the Y, Cb, and Cr components, respectively, and the FHD content in Class A2 achieving BD-rate reductions of −12.57%, −26.24%, and −21.78% for the Y, Cb, and Cr components, respectively.

3.1.3. Rate-Distortion Plot Analysis

The average rate-distortion curves for each video class are shown in Figure 6. Since there are numerous videos in the AVM-CTC, we average the bitrate and PSNR for the test sequences in each class and use these mean values to plot the RD curve for each class, displaying the RD curves of the proposed method across all sequences. As shown in Figure 6, the proposed network demonstrates superior performance across the five QP levels and various-resolution video sequences. Interestingly, unlike the CNN-based post-filtering methods [11,26], which perform better on low-resolution videos and show greater improvement at lower bitrates, our proposed method excels on high-resolution sequences and provides consistent improvement across all bitrates.

3.1.4. Visual Quality Evaluation

Figure 7, Figure 8 and Figure 9 show the visual quality comparison between the original image from the AVM-CTC sequence, the SVT-AV1-compressed image, and the image processed with the proposed post-filter network. By integrating self-attention mechanisms with CNN, the post-filter network specifically targets and reduces encoding artifacts like blocking and ringing. The self-attention mechanism excels at capturing long-term spatial dependencies across different regions of the image, enabling the network to better restore fine details and maintain spatial coherence. This is particularly effective in areas where severe artifacts disrupt the image’s structure. Meanwhile, the CNN focuses on refining local spatial patterns, contributing to the overall reduction of artifacts. The yellow boxes highlight a zoomed-in section of the image to show detailed changes. Figure 7 features an image from the Class A1 PierSeaSide sequence. In Figure 7b, blocking effects cause pixelation and distortion, which are particularly noticeable in the rigging wires, the edges of the “Harbor Office” sign, and the building structure. The blocking artifacts blur the wires, making them blend into the sky and surrounding structures, while the edges of the sign and windows appear jagged and poorly defined. These compression-induced artifacts result in a significant loss of clarity and fine details. After applying the proposed post-filter network, as seen in Figure 7c, many of these artifacts are successfully reduced, leading to a more natural and coherent appearance. The wires and the building structure regain some of their definition, and the edges of the “Harbor Office” sign and the building appear more refined. However, certain details, such as the thinner wires that were almost entirely lost in the compressed image, could not be fully recovered by the post-filter network. Despite this, the overall image demonstrates improved consistency and visual quality compared to the compressed version. Figure 8 displays an image from Class A1 Tango. In Figure 8b, the edge of the woman’s face, particularly around the nose, is distorted due to the blocking effect, while severe color bleeding is evident in the lips and chin due to the red clothing of the person in the background. After applying the proposed post-filter, as seen in Figure 8c, the boundaries of the nose and the face of the person in the background become more distinct, and the color bleeding around the lips and chin is significantly reduced. Finally, Figure 9 presents an image from Class A2 RushFieldCuts. In the encoded picture in Figure 9b, the ringing effect is visible around the bodies of the running people on the stadium field, which, combined with the blocking effect in that area, results in a visually cluttered image. Figure 9c shows the image after applying the proposed post-filter network. Although it does not achieve the same level of sharpness as the original image, the background texture is much more consistent, and the shapes of the people are clearer.

4. Discussion

4.1. Computational Complexity Increase by Using Patch-Wise Self-Attention

The majority of the computational complexity in the proposed network arises from the third self-attention mechanism, PWSA, as shown in Table 6. For CWSA, the computational load is not significant, since it calculates self-attention across channels. In the case of BWSSA, calculating spatial self-attention for an entire image patch would require a tremendous amount of computation. However, to address this issue, our study divides the image patch into smaller blocks and implements spatial self-attention within these blocks. PWSA, as proposed in the referenced paper [24], was originally used to solve classification problems, resulting in a lower computational load by dividing an image into smaller patches. However, in our study, it is applied to solve image regression problems, which requires an additional deconvolution process, thereby significantly increasing the computational load. In future research, we plan to propose methods to optimize the computational load of PWSA to better suit the current network’s objectives.

4.2. Ablation Studies

4.2.1. Impact of Overlapping Sequences between Training and Testing Datasets

During the evaluation phase, we identified five sequences that overlap between the BVI-DVC training dataset and the AVM CTC testing dataset, as shown in Table 7. The sequences in the BVI-DVC dataset consisted of 64 frames, while their counterparts in the AVM CTC dataset contained 130 frames and were used for testing. Within these five sequences, the overlapping frames constitute nearly half of each sequence, potentially leading the model to memorize specific features during training. This overlap could result in an artificial boost in performance for these sequences during testing, as the model may have learned specific characteristics from these frames during training. Although this affects a small subset of sequences, we recognize that such an overlap could influence the overall evaluation results, particularly for these sequences, due to the risk of biased testing. To address this potential concern, we removed these five overlapping sequences from the results and recalculated the BD-rate using only the remaining non-overlapping sequences. The recalculated results are presented in Table 8, ensuring that the model’s performance is assessed purely on unseen data. The overall BD-rate performances for all CTC sequences, including the overlapping ones, are Y = −10.40%, Cb = −19.22%, and Cr = −16.52%. After removing the overlapping sequences, the recalculated BD-rates are Y = −10.17%, Cb = −18.88%, and Cr = −16.34%. The comparison between these two sets of results shows only minor differences, reaffirming that the network consistently demonstrates strong performance across both cases. The removal of the overlapping sequences did not significantly affect the overall conclusions, confirming that the proposed method remains robust and effective even when tested solely on non-overlapping data. This further highlights the network’s ability to generalize to unseen video sequences, providing confidence in its applicability to a wide range of real-world content.

4.2.2. Processing of Boundary Pixels in Training Data

In this study, the performance of post-filtering was improved using multiple self-attention mechanisms. When generating the training data, the BVI-DCV dataset [25] images were divided into 256 × 256 patches for training. The sizes of the BVI-DCV dataset images vary from 480 × 272 to 3840 × 2176, as shown in Table 1. Therefore, when divided into 256 × 256 patches, the images at the edges are not perfectly divided. These edge patches can be handled in various ways; initially, we trained the network by simply filling the empty spaces with YCbCr values of 0, as shown in Figure 10a. While this method did not cause significant issues in a network using only a CNN [17], adding self-attention to the network resulted in performance degradation, making us skeptical about incorporating self-attention in this study.

Upon investigating the reasons for the performance drop, we found that the BVI-DCV dataset images were all filled with dark values at the bottom, and unlike networks composed solely of a CNN, self-attention captures long-range dependencies in image data. As a result, when a patch at the edge of an image was inputted during testing, the network incorrectly reconstructed the image, making it dark like the BVI-DVC images used in training, as shown in Figure 11a,b. To resolve this issue, instead of filling the empty spaces with 0, we filled them with boundary pixel values, as shown in Figure 10b, making the patch appear as if it was not located at the edge. This significantly improved the performance, as shown in Table 9, demonstrating the effectiveness of using self-attention.

Table 10 shows the results of applying this method to a network using only a CNN [17], which resulted in slight improvements but not as dramatic as those seen in the network using self-attention. Ultimately, it was determined that it is better not to use patches located at the edges for training, and thus, these patches were removed during the training process in this study.

5. Conclusions

In this study, we introduced a novel post-filtering approach for the AV1 codec that integrates multiple self-attention mechanisms into a CNN framework. The proposed method significantly enhances video quality by effectively reducing artifacts and improving the compression efficiency across various test sequences. The experimental results demonstrated that the proposed method achieves notable BD-rate reductions, with average gains of 10.40% for the Y component and 19.22% and 16.52% for the Cb and Cr components compared to SVT-AV1-compressed content. Specifically, for high-resolution sequences, our method showed substantial improvements, achieving BD-rate reductions of −14.84%, −24.32%, and −21.17% for the Y, Cb, and Cr components in 4 K content and −12.57%, −26.24%, and −21.78% for the Y, Cb, and Cr components in FHD content.

Visual quality evaluations further confirmed the effectiveness of our method, highlighting significant artifact reduction and detail restoration in high-resolution images. The proposed network demonstrated consistent performance across different bitrates and video resolutions, outperforming traditional CNN-based post-filtering methods, particularly in high-resolution scenarios. These findings underscore the potential of incorporating self-attention mechanisms into CNN frameworks to enhance post-filtering techniques in video coding, offering a robust solution for improving AV1 codec performance and paving the way for future advancements in video compression technologies.

However, it is important to acknowledge the limitations of the proposed method. The primary limitation is the increased computational complexity, particularly due to the PWSA mechanism. While PWSA effectively divides the image into smaller patches to manage the computational load, the additional deconvolution process required for solving image regression problems significantly increases the overall computational burden and contributes to higher memory usage due to the large number of parameters involved. Another significant limitation is the reliance on five distinct models trained using different QP values (e.g., 20, 32, 43, 55, and 63), as outlined in the paper. While this approach allows for specialized optimization across different QP ranges, it introduces complexity in model management and deployment, as each QP range requires a different network. This could hinder the scalability and generalization of the method in practical applications.

Future research will focus on applying the proposed method to other widely adopted video codecs, particularly VVC, as part of ongoing efforts to compare its performance with existing AI models in this domain. By testing the network on VVC, we aim to verify the generalizability of the self-attention-based approach and conduct a more thorough comparison with other post-filtering methods designed for VVC. This extension will provide deeper insights into the effectiveness of our method across different codecs and enable us to validate its performance beyond AV1.

Additionally, future research should focus on optimizing both the computational and memory efficiency of PWSA to better align with the network’s objectives and to make the approach more feasible for real-time applications. An important direction for future work is to explore the possibility of simplifying the model architecture by developing a single network capable of handling the full range of QP values. This would reduce the overall complexity of the system, making it more practical and easier to implement without compromising video quality. Further exploration into reducing the complexity and memory requirements of self-attention mechanisms, as well as consolidating the multiple models into a single, more generalized network, could lead to more efficient and scalable solutions, paving the way for future enhancements in video compression technologies.

Author Contributions

Conceptualization, W.G.; methodology, W.G. and K.C.; software, W.G.; validation, W.G. and K.C.; formal analysis, K.C. and G.H.P.; investigation, W.G. and G.H.P.; resources, W.G.; data curation, W.G. and K.C.; writing—original draft preparation, W.G.; writing—review and editing, K.C. and G.H.P.; visualization, W.G.; supervision, K.C. and G.H.P.; project administration, K.C. and G.H.P.; funding acquisition, K.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (RS-2023-00220204, Development of Standard Technologies on next-generation media compression for transmission and storage of Metaverse contents).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, S.; Bross, B.; Chen, J. Versatile Video Coding (Draft 6). In Proceedings of the Joint Video Experts Team (JVET) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11, 15th Meeting, Gothenburg, Sweden, 3–12 July 2019. JVET-O2001-vE. [Google Scholar]
Chen, Y.; Murherjee, D.; Han, J.; Grange, A.; Xu, Y.; Liu, Z.; Parker, S.; Chen, C.; Su, H.; Joshi, U.; et al. An overview of core coding tools in the AV1 video code. In Proceedings of the 2018 Picture Coding Symposium (PCS), San Francisco, CA, USA, 24–27 June 2018; pp. 41–45. [Google Scholar]
Han, J.; Li, B.; Mukherjee, D.; Chiang, C.; Grange, A.; Chen, C.; Su, H.; Parker, S.; Deng, S.; Joshi, U.; et al. A Technical Overview of AV1. arXiv 2020, arXiv:2008.06091. [Google Scholar] [CrossRef]
Zou, N.; Zhang, H.; Cricri, F.; Youvalari, R.G.; Tavakoli, H.R.; Lainema, J.; Aksu, E.; Hannuksela, M.; Rahtu, E. Adaptation and Attention for Neural Video Coding. arXiv 2021, arXiv:2112.08767. [Google Scholar]
Wang, Y.; Zhu, H.; Li, Y.; Chen, Z.; Liu, S. Dense Residual Convolutional Neural Network based In-Loop Filter for HEVC. In Proceedings of the 2018 IEEE Visual Communications and Image Processing (VCIP), Taichung, Taiwan, 9–12 December 2018; pp. 1–4. [Google Scholar]
Chen, S.; Chen, Z.; Wang, Y.; Liu, S. In-Loop Filter with Dense Residual Convolutional Neural Network for VVC. In Proceedings of the 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), Shenzhen, China, 6–8 August 2020; pp. 149–152. [Google Scholar]
Zhao, Y.; Lin, K.; Wang, S.; Ma, S. Joint Luma and Chroma Multi-Scale CNN In-loop Filter for Versatile Video Coding. In Proceedings of the 2022 IEEE International Symposium on Circuits and Systems (ISCAS), Austin, TX, USA, 27 May–1 June 2022; pp. 3205–3208. [Google Scholar]
Kathariya, B.; Li, Z.; Van der Auwera, G. Joint Pixel and Frequency Feature Learning and Fusion via Channel-Wise Transformer for High-Efficiency Learned In-Loop Filter in VVC. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 4070–4082. [Google Scholar] [CrossRef]
Ding, D.; Chen, G.; Mukherjee, D.; Joshi, U.; Chen, Y. A CNN-based In-loop Filtering Approach for AV1 Video Codec. In Proceedings of the 2019 Picture Coding Symposium (PCS), Ningbo, China, 12–15 November 2019; pp. 1–5. [Google Scholar]
Xia, J.; Wen, J. Asymmetric Convolutional Residual Network for AV1 Intra in-Loop Filtering. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 1291–1295. [Google Scholar]
Guan, Z.; Xing, Q.; Xu, M.; Yang, R.; Liu, T.; Wang, Z. MFQE 2.0: A New Approach for Multi-Frame Quality Enhancement on Compressed Video. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 949–962. [Google Scholar] [CrossRef] [PubMed]
Lin, W.; He, X.; Han, X.; Liu, D.; See, J.; Zou, J.; Xiong, H.; Wu, F. Partition-Aware Adaptive Switching Neural Networks for Post-Processing in HEVC. IEEE Trans. Multimedia 2020, 22, 2749–2763. [Google Scholar] [CrossRef]
Zhang, F.; Feng, C.; Bull, D.R. Enhancing VVC through CNN-Based Post-Processing. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar]
Lin, J.; Yang, Y. Multi-Frequency Residual Convolutional Neural Network for Steganalysis of Color Images. IEEE Access 2021, 9, 141938–141950. [Google Scholar] [CrossRef]
Liu, T.; Cui, W.; Hui, C.; Jiang, F.; Gao, Y.; Xie, S.; Wu, P. AHG11: Post-Process Filter Based on Fusion of CNN and Transformer. In Proceedings of the Joint Video Experts Team (JVET) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29, 26th Meeting, Teleconference, 20–29 April 2022. JVET-Z0101-v2. [Google Scholar]
Santamaria, M.; Yang, R.; Cricri, F.; Zhang, H.; Lainema, J.; Youvalari, R.G.; Tavakoli, H.R.; Hannuksela, M.M. Overfitting Multiplier Parameters for Content-Adaptive Post-Filtering in Video Coding. In Proceedings of the 10th European Workshop on Visual Information Processing (EUVIP), Lisbon, Portugal, 11–14 September 2022; pp. 1–6. [Google Scholar]
Das, T.; Choi, K.; Choi, J. High Quality Video Frames From VVC: A Deep Neural Network Approach. IEEE Access 2023, 11, 54254–54264. [Google Scholar] [CrossRef]
Zhang, F.; Ma, D.; Feng, C.; Bull, D.R. Video Compression With CNN-Based Postprocessing. IEEE MultiMedia 2021, 28, 74–83. [Google Scholar] [CrossRef]
Xiao, T.; Singh, M.; Mintun, E.; Darrell, T.; Dollár, P.; Girshick, R. Early Convolutions Help Transformers See Better. arXiv 2021, arXiv:2106.14881v3. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 20–30 June 2016; pp. 770–778. [Google Scholar]
Bello, I.; Zoph, B.; Le, Q.; Vaswani, A.; Shlens, J. Attention Augmented Convolutional Networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3285–3294. [Google Scholar]
Li, Y.; Rusanovskyy, D.; Karczewicz, M. EE1-1.5: Report on Implementation of HOP In-Loop Filter with Transformer Blocks. In Proceedings of the Joint Video Experts Team (JVET) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29, 33rd Meeting, Teleconference, 17–26 January 2024. Document JVET-AG0162_v1. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
Ma, D.; Zhang, F.; Bull, D.R. BVI-DVC: A Training Database for Deep Video Compression. IEEE Trans. Multimedia 2022, 24, 3847–3858. [Google Scholar] [CrossRef]
AOMediaCodec. SVT-AV1: Scalable Video Technology for AV1 Encoder. Available online: https://gitlab.com/AOMediaCodec/SVT-AV1 (accessed on 2 March 2024).
Zhao, X.; Lei, Z.; Norkin, A.; Daede, T.; Tourapis, A. AOM Common Test Conditions v2.0. Alliance for Open Media, Codec Working Group. 2021. Available online: https://aomedia.org/docs/CWG-B075o_AV2_CTC_v2.pdf (accessed on 2 March 2024).
Prangnell, L. Visible Light-Based Human Visual System Conceptual Model. arXiv 2016, arXiv:1609.04830. [Google Scholar]
Horé, A.; Ziou, D. Image Quality Metrics: PSNR vs. SSIM. In Proceedings of the 20th International Conference on Pattern Recognition (ICPR), Istanbul, Turkey, 23–26 August 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 2366–2369. [Google Scholar]
Alshina, E.; Galpin, F.; Rusanovskyy, D. AhG11/AhG14 teleconference. In Proceedings of the Joint Video Experts Team (JVET) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29, 33rd Meeting, Teleconference, 17–26 January 2024. JVET-AG0041-v1. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 8026–8037, Article 721. [Google Scholar]
Bjontegaard, G. Response to Call for Proposals for H.26L. ITU-T SG16 Doc. Q15-F-11. In Proceedings of the International Telecommunication Union, Sixth Meeting, Seoul, Republic of Korea, 3–6 November 1998. [Google Scholar]
Barman, N.; Martini, M.G.; Reznik, Y. Revisiting Bjontegaard Delta Bitrate (BD-BR) Computation for Codec Compression Efficiency Comparison. In Proceedings of the 1st Mile-High Video Conference, MHV ’22, New York, NY, USA, 1–3 March 2022; pp. 113–114. [Google Scholar]

Figure 1. (a) Illustration showing where in-loop filter is located in video codec pipeline; (b) illustration showing where post-filter is located in pipeline.

Figure 2. Proposed MTSA-based CNN.

Figure 3. (a) RCB; (b) CWSA.

Figure 4. (a) Simplified feature map with channel size of 3 and height and width sizes of 4; (b) feature map unfolded into smaller blocks; (c) feature map permuted and reshaped.

Figure 5. (a) BWSSA; (b) PWSA.

Figure 6. R-D curves by SVT-AV1 and MTSA. (a) class A1; (b) class A2; (c) class A3; (d) class A4; (e) class A5.

Figure 7. Example sequence of Class A1 PierSeaSide. (a) Original image from the AVM-CTC sequence; (b) detail inside the yellow box from (a) in the original image; (c) detail inside the yellow box from (a) in the compressed image using SVT-AV1 with QP55; (d) detail inside the yellow box from (a) after applying the post-filter using the proposed network.

Figure 8. Example sequence of Class A1 Tango. (a) Original image from the AVM-CTC sequence; (b) detail inside the yellow box from (a) in the original image; (c) detail inside the yellow box from (a) in the compressed image using SVT-AV1 with QP55; (d) detail inside the yellow box from (a) after applying the post-filter using the proposed network.

Figure 9. Example sequence of Class A2 RushFieldCuts. (a) Original image from the AVM-CTC sequence; (b) detail inside the yellow box from (a) in the original image; (c) detail inside the yellow box from (a) in the compressed image using SVT-AV1 with QP43; (d) detail inside the yellow box from (a) after applying the post-filter using the proposed network.

Figure 10. Methods to handle empty spaces for edge patches; (a) empty spaces filled with zero value; (b) empty spaces filled with edge pixel value extended.

Figure 11. Network wrongly turning edge pixel into darker value; (a) pixel value difference between the original video frame and the AV1-encoded frame; (b) pixel value difference between the original video frame and the AV1-encoded frame processed by the proposed network, with larger positive pixel differences in Y indicating that the processed frame is darker, at the bottom of the image.

Table 1. Details of the BVI-DVC dataset.

Class	Video Resolution	Number of Videos	Frames	Bit Depth	Chroma Sampling
A	3840 × 2176	200	64	10	4:2:0
B	1920 × 1088	200	64	10	4:2:0
C	960 × 544	200	64	10	4:2:0
D	480 × 272	200	64	10	4:2:0

Table 2. SVT-AV1 parameters used.

Configuration Parameter	Command Line	Range	Setting Used	Description
RateControlMode	--rc	[0–2]	0	Rate control mode [0: CRF or CQP (if --aq-mode is 0) [Default], 1: VBR, 2: CBR]
AdaptiveQuantization	--aq-mode	[0–2]	0	Set adaptive QP level [0: off, 1: variance base using AV1 segments, 2: deltaq pred efficiency]
QuantizationParameter	--qp	[1–63]	20, 32, 43, 55, 63	Initial QP level value
FrameRate	--fps	[1–240]	Sequence dependent	Input video frame rate, integer values only, inferred if y4 m
EncoderColorFormat	--color-format	[0–3]	1	Color format, only yuv420 is supported at this time [0: yuv400, 1: yuv420, 2: yuv422, 3: yuv444]
EncoderBitDepth	--input-depth	[8, 10]	10	Input video file and output bitstream bit-depth
PredStructure	--pred-struct	[1–2]	2	Set prediction structure [1: low delay, 2: random access]

Table 3. Network models for QP range.

Model	QP Base Range
Model QP20	QP_base < 26
Model QP32	26 ≤ QP_base < 37.5
Model QP43	37.5 ≤ QP_base < 49
Model QP55	49 ≤ QP_base < 59
Model QP63	59 ≤ QP_base < 63

Table 4. Class A sequences from AVM-CTC.

Class	Sequence	Resolution	Frame Rate	Bit-Depth
A1	BoxingPractice	3840 × 2160	59.94	10
	Crosswalk	3840 × 2160	59.94	10
	FoodMarket2	3840 × 2160	59.94	10
	Neon1224	3840 × 2160	29.97	10
	NocturneDance	3840 × 2160	60	10
	PierSeaSide	3840 × 2160	29.97	10
	Tango	3840 × 2160	59.94	10
	TimeLapse	3840 × 2160	59.94	10
A2	Aerial3200	1920 × 1080	59.94	10
	Boat	1920 × 1080	59.94	10
	CrowdRun	1920 × 1080	50	8 *
	DinnerSceneCropped	1920 × 1080	29.97	10
	FoodMarket	1920 × 1080	59.94	10
	GregoryScarf	1080 × 1920	30	10
	MeridianTalksdr	1920 × 1080	59.94	10
	Motorcycle	1920 × 1080	30	8 *
	OldTownCross	1920 × 1080	30	8 *
	PedestrianArea	1920 × 1080	25	8 *
	RitualDance	1920 × 1080	59.94	10
	Riverbed	1920 × 1080	25	8 *
	RushFieldCuts	1920 × 1080	29.97	8 *
	Skater227	1920 × 1080	30	10
	ToddlerFountainCropped	1080 × 1920	29.97	10
	TreesAndGrass	1920 × 1080	30	8 *
	TunnelFlag	1920 × 1080	59.94	10
	Verticalbees	1080 × 1920	29.97	8 *
	WorldCup	1920 × 1080	30	8 *
A3	ControlledBurn	1280 × 720	30	8 *
	DrivingPOV	1280 × 720	59.94	10
	Johnny	1280 × 720	60	8 *
	KristenAndSara	1280 × 720	60	8 *
	RollerCoaster	1280 × 720	59.94	10
	Vidyo3	1280 × 720	60	8 *
	Vidyo4	1280 × 720	60	8 *
	WestWindEasy	1280 × 720	30	8 *
A4	BlueSky	640 × 360	25	8 *
	RedKayak	640 × 360	29.97	8 *
	SnowMountain	640 × 360	29.97	8 *
	SpeedBag	640 × 360	29.97	8 *
	Stockholm	640 × 360	59.94	8 *
	TouchdownPass	640 × 360	29.97	8 *
A5	FourPeople	480 × 270	60	8 *
	ParkJoy	480 × 270	50	8 *
	SparksElevator	480 × 270	59.94	10
	VerticalBayshore	270 × 480	29.97	8 *

* 8-bit sequences were converted to 10-bit for testing.

Table 5. BD-BR for all AVM CTC sequences in random access configuration.

Class	Sequence	BD-BR (%)
Class	Sequence	Y	Cb	Cr
A1	BoxingPractice	−18.31%	−27.86%	−22.23%
	Crosswalk	−13.06%	−19.48%	−11.35%
	FoodMarket2	−12.61%	−19.29%	−21.84%
	Neon1224	−16.37%	−23.06%	−25.46%
	NocturneDance	−19.60%	−24.55%	−20.02%
	PierSeaSide	−15.52%	−26.94%	−25.54%
	Tango	−16.64%	−34.52%	−30.50%
	TimeLapse	−6.60%	−18.85%	−12.40%
	Average	−14.84%	−24.32%	−21.17%
A2	Aerial3200	−4.54%	−16.19%	−24.99%
	Boat	−10.03%	−42.92%	−26.80%
	CrowdRun	−13.21%	−32.24%	−27.83%
	DinnerSceneCropped	−14.09%	−35.18%	−18.85%
	FoodMarket	−11.67%	−28.86%	−25.61%
	GregoryScarf	−11.08%	−37.28%	−22.58%
	MeridianTalksdr	−12.51%	−33.22%	−21.34%
	Motorcycle	−10.98%	−21.85%	−23.55%
	OldTownCross	−13.94%	−45.59%	−21.14%
	PedestrianArea	−15.98%	−17.45%	−20.61%
	RitualDance	−16.57%	−26.43%	−34.26%
	Riverbed	−9.93%	−12.56%	−10.13%
	RushFieldCuts	−10.73%	−15.35%	−13.92%
	Skater227	−15.60%	−19.88%	−12.87%
	ToddlerFountainCropped	−12.13%	−21.78%	−21.56%
	TreesAndGrass	−4.39%	−17.25%	−6.69%
	TunnelFlag	−21.59%	−40.11%	−45.87%
	Verticalbees	−11.45%	−10.59%	−12.68%
	WorldCup	−18.46%	−23.86%	−22.63%
	Average	−12.57%	−26.24%	−21.78%
A3	ControlledBurn	−7.83%	−26.81%	−21.66%
	DrivingPOV	−14.20%	−24.55%	−22.48%
	Johnny	−12.36%	−13.77%	−14.01%
	KristenAndSara	−10.60%	−13.89%	−12.38%
	RollerCoaster	−17.18%	−18.66%	−27.76%
	Vidyo3	−11.03%	−6.71%	−7.97%
	Vidyo4	−9.93%	−11.60%	−15.35%
	WestWindEasy	−12.67%	−38.16%	−25.02%
	Average	−11.98%	−19.27%	−18.33%
A4	BlueSky	−10.21%	−14.54%	−29.31%
	RedKayak	−6.13%	−16.46%	14.80%
	SnowMountain	4.45%	1.49%	2.52%
	SpeedBag	−9.24%	−8.94%	−11.27%
	Stockholm	−10.53%	−10.64%	−19.96%
	TouchdownPass	−7.64%	−21.13%	−16.71%
	Average	−6.55%	−11.70%	−9.99%
A5	FourPeople	−7.66%	−14.09%	−13.12%
	ParkJoy	−3.92%	−16.67%	−8.84%
	SparksElevator	−4.11%	−10.28%	−10.62%
	VerticalBayshore	−8.61%	−17.32%	−12.70%
	Average	−6.07%	−14.59%	−11.32%
Average		−10.40%	−19.22%	−16.52%

Table 6. Number of parameters for each sub network in proposed network.

Network	Number of Parameters
RCB	295,682
CWSA	49,536
BWSSA	49,536
PWSA	1,258,3296

Table 7. Overlapping sequences in BVI-DVC and AVM-CTC datasets.

BVI-DVC	AVM-CTC
BoxingPracticeHarmonics	BoxingPractice
DCrosswalkHarmonics	Crosswalk
CrowdRunMCLV	CrowdRun
TunnelFlagS1Harmonics	TunnelFlag
DrivingPOVHarmonics	DrivingPOV

Table 8. BD-BR for AVM CTC with overlapping sequences from training removed in random access configuration.

Class	Sequence	BD-BR (%)
Class	Sequence	Y	Cb	Cr
A1	FoodMarket2	−12.61%	−19.29%	−21.84%
	Neon1224	−16.37%	−23.06%	−25.46%
	NocturneDance	−19.60%	−24.55%	−20.02%
	PierSeaSide	−15.52%	−26.94%	−25.54%
	Tango	−16.64%	−34.52%	−30.50%
	TimeLapse	−6.60%	−18.85%	−12.40%
	Average	−14.56%	−24.53%	−22.63%
A2	Aerial3200	−4.54%	−16.19%	−24.99%
	Boat	−10.03%	−42.92%	−26.80%
	DinnerSceneCropped	−14.09%	−35.18%	−18.85%
	FoodMarket	−11.67%	−28.86%	−25.61%
	GregoryScarf	−11.08%	−37.28%	−22.58%
	MeridianTalksdr	−12.51%	−33.22%	−21.34%
	Motorcycle	−10.98%	−21.85%	−23.55%
	OldTownCross	−13.94%	−45.59%	−21.14%
	PedestrianArea	−15.98%	−17.45%	−20.61%
	RitualDance	−16.57%	−26.43%	−34.26%
	Riverbed	−9.93%	−12.56%	−10.13%
	RushFieldCuts	−10.73%	−15.35%	−13.92%
	Skater227	−15.60%	−19.88%	−12.87%
	ToddlerFountainCropped	−12.13%	−21.78%	−21.56%
	TreesAndGrass	−4.39%	−17.25%	−6.69%
	Verticalbees	−11.45%	−10.59%	−12.68%
	WorldCup	−18.46%	−23.86%	−22.63%
	Average	−12.00%	−25.07%	−20.01%
A3	ControlledBurn	−7.83%	−26.81%	−21.66%
	Johnny	−12.36%	−13.77%	−14.01%
	KristenAndSara	−10.60%	−13.89%	−12.38%
	RollerCoaster	−17.18%	−18.66%	−27.76%
	Vidyo3	−11.03%	−6.71%	−7.97%
	Vidyo4	−9.93%	−11.60%	−15.35%
	WestWindEasy	−12.67%	−38.16%	−25.02%
	Average	−11.66%	−18.51%	−17.74%
A4	BlueSky	−10.21%	−14.54%	−29.31%
	RedKayak	−6.13%	−16.46%	14.80%
	SnowMountain	4.45%	1.49%	2.52%
	SpeedBag	−9.24%	−8.94%	−11.27%
	Stockholm	−10.53%	−10.64%	−19.96%
	TouchdownPass	−7.64%	−21.13%	−16.71%
	Average	−6.55%	−11.70%	−9.99%
A5	FourPeople	−7.66%	−14.09%	−13.12%
	ParkJoy	−3.92%	−16.67%	−8.84%
	SparksElevator	−4.11%	−10.28%	−10.62%
	VerticalBayshore	−8.61%	−17.32%	−12.70%
	Average	−6.07%	−14.59%	−11.32%
Average		−10.17%	−18.88%	−16.34%

Table 9. Performance differences based on empty space handling methods for edge patches in testing images in MTSA.

	Filled with Zero			Edge Value Extended
	Y	Cb	Cr	Y	Cb	Cr
A2	−7.42%	−19.78%	−21.06%	−11.72%	−18.82%	−21.27%
A3	−2.84%	−17.02%	−16.72%	−10.34%	−16.22%	−18.73%
A4	2.71%	−11.63%	−10.88%	−6.07%	−11.75%	−11.49%
A5	−2.16%	−12.71%	−12%	−5.01%	−15.55%	−13%
Average	−2.43%	−15.28%	−15.17%	−8.28%	−15.58%	−16.04%

Table 10. Performance differences based on empty space handling methods for edge patches in testing images in traditional CNN.

	Filled with Zero			Edge Value Extended
	Y	Cb	Cr	Y	Cb	Cr
A2	−5.17%	−15.38%	−14.66%	−5.36%	−15.44%	−14.74%
A3	−6.26%	−13.59%	−14.85%	−6.55%	−14.14%	−14.96%
A4	−3.92%	−8.91%	−9.83%	−4.14%	−9.23%	−9.83%
A5	−2.71%	−13.88%	−11.58%	−2.90%	−14.17%	−11.70%
Average	−4.51%	−12.94%	−12.73%	−4.74%	−13.24%	−12.81%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gwun, W.; Choi, K.; Park, G.H. Multi-Type Self-Attention-Based Convolutional-Neural-Network Post-Filtering for AV1 Codec. Mathematics 2024, 12, 2874. https://doi.org/10.3390/math12182874

AMA Style

Gwun W, Choi K, Park GH. Multi-Type Self-Attention-Based Convolutional-Neural-Network Post-Filtering for AV1 Codec. Mathematics. 2024; 12(18):2874. https://doi.org/10.3390/math12182874

Chicago/Turabian Style

Gwun, Woowoen, Kiho Choi, and Gwang Hoon Park. 2024. "Multi-Type Self-Attention-Based Convolutional-Neural-Network Post-Filtering for AV1 Codec" Mathematics 12, no. 18: 2874. https://doi.org/10.3390/math12182874

APA Style

Gwun, W., Choi, K., & Park, G. H. (2024). Multi-Type Self-Attention-Based Convolutional-Neural-Network Post-Filtering for AV1 Codec. Mathematics, 12(18), 2874. https://doi.org/10.3390/math12182874

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Type Self-Attention-Based Convolutional-Neural-Network Post-Filtering for AV1 Codec

Abstract

1. Introduction

Related Work

2. Proposed Method

2.1. Shallow Feature Extraction

Residual Convolution Block

2.2. Deep Feature Extraction

2.2.1. Channel-Wise Self-Attention

2.2.2. Block-Wise Spatial Self-Attention

2.2.3. Patch-Wise Self-Attention

2.3. Image Reconstruction

Structure of the Image Reconstruction

2.4. Training and Testing Configuration

2.4.1. Training Dataset

2.4.2. Training Strategy

2.4.3. Experimental Setup

3. Results

3.1. Evaluation Process

3.1.1. Testing Dataset

3.1.2. Testing Strategy

3.1.3. Rate-Distortion Plot Analysis

3.1.4. Visual Quality Evaluation

4. Discussion

4.1. Computational Complexity Increase by Using Patch-Wise Self-Attention

4.2. Ablation Studies

4.2.1. Impact of Overlapping Sequences between Training and Testing Datasets

4.2.2. Processing of Boundary Pixels in Training Data

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI