1. Introduction
With the emergence of 4K/8K, 3D, VR/AR/MR, High Frame Rate (HFR), High Dynamic Range (HDR), and other Ultra-High-Definition (UHD) videos, there are increasingly stringent requirements for the capability of video compression, and the need for more efficient video coding standards is becoming more and more urgent. To satisfy the requirement for ultra-high definition video compression, the Joint Video Experts Team (JVET) group, jointly established by Moving Picture Expert Group (MPEG) and Video Coding Experts Group (VCEG), designed the Versatile Video Coding (VVC), which was finalized in July 2020 as the next-generation international video coding standard [
1]. VVC inherits the basic framework of previous HEVC and introduces new techniques such as QTMTT [
2], Intra Sub-Partitions (ISP), Multiple reference line (MRL), Local Illumination Compensation (LIC), Bi-directional Optical flow (BIO), Affine motion compensated prediction (AMC) [
3], and Adaptive Loop Filtering (ALF), etc. These new techniques enable VVC to have a more powerful coding performance compared to the previous high-efficiency video coding (HEVC), but at the same time, they also bring about a huge overhead in coding time complexity. Among them, the introduction of QTMTT technology is the main part that leads to the increase in coding time complexity overhead [
4,
5]. In HEVC, only the quadtree (QT) partitioning structure is allowed to be used to partition each coding tree unit (CTU), while QTMTT supports the binary-tree (BT) and ternary-tree (TT) partitioning structures in addition to the QT partitioning. This flexibility allows for adaptation to different image features, making the partitioning structure more flexible and bringing more efficient coding efficiency. However, the introduction of BT and TT leads to a more complex partition search process, resulting in significant time complexity overhead. Previous studies have shown that the time complexity overhead of partition search in VVC is above 90% [
6]. Therefore, it is necessary to reduce the time complexity of VVC and decrease the time spent in the partition search process [
7].
In past research, the fast partitioning of CU partitions has been widely used to reduce time complexity, extending a large variety of fast partitioning methods. In HEVC, due to the existence of only a single QT partitioning structure, numerous studies have achieved superior performance by determining the need for partitioning through manually designed features. Yet, for VVC, due to the newly introduced BT and TT partition structures, which increase the number of partitioning modes to six, including quadtree, horizontal binary-tree (BTH), vertical binary-tree (BTV), horizontal ternary-tree (TTH), vertical ternary-tree (TTV), and non-split (NS), CU partitioning has become much more flexible, making it difficult to use previous algorithms for straightforwardly reducing complexity overheads. In recent studies, attempts have been made to address this challenge through statistical analysis or machine learning methods. The fundamental idea behind these approaches is to skip modes and terminate early.
The new QTMTT structure as shown in
Figure 1a, allows CUs to perform five partitions and greatly enhances the flexibility of CU blocks. Compared with HEVC, to improve the coding efficiency and adapt to higher resolution video content, VVC increases the default size of the CTU to 128 × 128, and the minimum CU size is specified to be 4 × 4. During the CU partitioning process of VVC, the CTU is always quadtree partitioned since the maximum CU is 64 × 64 by default. Furthermore, VVC specifies that only quadtree partition is allowed until the CU size is 32 × 32. Five partitions model are allowed for CUs of size 32 × 32 and its sub-CUs, which results in 15 different sizes of CU blocks as shown in
Figure 1b. This partitioning structure greatly increases the flexibility of CU blocks, and the different sizes of CU blocks can be better adapted to different texture features. The process of CU partitioning is very violent, in which VVC checks the RD cost of all the partitioning modes for each CU as well as its sub-CUs, and selects the partitioning combinations with the optimal RD cost. In HEVC, up to 21 combinations need to be checked for each CU of 32 × 32 size, but for VVC, 361 combinations need to be checked. Therefore, the QTMTT structure in VVC significantly increases the overall coding complexity.
Due to the complexity of the QTMTT structure, there is a wide variety of CU blocks of different sizes, which makes it difficult to perform unified inference with models. In this paper, we propose a new BSC structure that represents some of the blocks in a QTMTT as the same size and then uses different levels of CNN models to directly predict each block partitioning mode. In the CNN model, we introduced symmetric and asymmetric convolutional kernels for extracting texture features of different dimensions and added hand-crafted features in the fully connected layer to make the model more compatible with the BSC structure. After obtaining the output of the model, we use a multi-thresholding scheme to decide on the final segmentation mode, achieving tunability in terms of coding time complexity and performance.
The main contributions of this paper are as follows:
Block segmentation and block connection structure: we design a new representation structure for partial CUs based on texture features to represent partial CU blocks as the same size.
Different levels of CNN models: we designed different levels of CNN models to predict the partitioning mode of CU blocks, introduced asymmetric convolutional kernels for extracting different features in the model, and also introduced some external features, which proved the effectiveness.
Multi-thresholding design: we propose a multi-thresholding scheme that sets different thresholding schemes for different levels of characteristics, realizing a trade-off between coding time complexity and coding performance.
The rest of the paper is organized as follows.
Section 2 summarizes the background and related work.
Section 3 explores the details of the overall algorithm.
Section 4 presents the experimental results.
Section 5 describes the conclusions.
2. Related Works
In previous work, a large amount of research has been spent on fast CU partitioning with the expectation of reducing the time complexity of CU partitioning. We categorize the previous methods into two groups: statistical analysis-based methods and machine learning-based methods. In this section, we summarize and review the previous research results.
2.1. Methods Based on Statistical Analysis
Methods based on statistical analysis attempt to utilize the data or features generated during the coding process to determine the final classification result. In [
8], a bi-directional depth search method is proposed using previously encoded CUs and predicted mode costs. In [
9], an adaptive early termination algorithm based on coding unit depth history is proposed to track CU depth history based on the CTU temporal correlation to determine the depth range of the target CUs to terminate the segmentation early. In [
10], depth difference and RD loss ratio are utilized to model and perform split and early termination decisions for CUs. In [
11], the correlation with the distribution of the distortion cost of neighboring block rates under different quantization parameters is analyzed, and a threshold is set to terminate the partition early based on the correlation. In [
12], a methodology is introduced for Coding Unit (CU) segmentation based on the keypoint-based CU depth decision (KCD). Meanwhile, in [
13], a novel fast CU segmentation decision approach is proposed, leveraging SAO edge category information as spatiotemporal encoding parameters.
In [
14], a methodology is presented where partition modes are determined based on the gradient features and variance of the partitions. In [
15], a methodology is introduced that utilizes the sum of the mean absolute deviation (SMAD) to quantitatively measure the vertical and horizontal texture complexity. In [
16], the features of the current block and coding context are explored based on the selected intra-prediction mode. This exploration aims to skip unnecessary partitioning computations. In [
17], the encoding results of BT partitions are incorporated as features in the decision process for TT partitioning. This inclusion is aimed at reducing the time complexity of TT partitioning. In [
18], an efficient algorithm is proposed for the selection of the partitioning direction of the current CU. This algorithm utilizes entropy and texture contrast as effective features to discriminate the optimal partitioning direction for the current CU. In [
19], depth information from temporally and spatially adjacent blocks is extracted to predict the optimal depth of CUs. This enables the early termination of partitioning, reducing unnecessary time expenditures. In [
20], explicit VVC features (EVFs) and derived VVC features (DVFs) are manually designed based on the correlation with the QTMTT structure. These features are utilized to facilitate the early termination of the nested TT block structure after QT partitioning. In [
21], the distortion of CUs is obtained by calculating the difference between the original luminance pixels and the predicted luminance pixels. This feature is utilized to establish an early skip decision model for both BT and TT partitioning for each CU.
However, these algorithms heavily rely on manually designed features and heuristic rules to make decisions for video coding. Although these methods still perform well, manually designed features are usually difficult to capture complex nonlinear relationships and generalize to different situations. Moreover, summarizing features based on data analysis is not representative of all situations and only serves to show a high correlation, making it difficult for these algorithms to further improve their performance.
2.2. Methods Based on Machine Learning
In recent years, machine learning has been widely used in a large number of studies, showing excellent performance, and is rapidly being cited in the field of fast video coding. Machine learning models can automatically learn the features from large amounts of data and can represent complex nonlinear features well. In [
22,
23,
24], each Coding Tree Unit (CTU) is partitioned into blocks of size 32 × 32. A Convolutional Neural Network (CNN) is employed to predict the depth range of 32 × 32 Coding Unit (CU) blocks, facilitating the premature termination of unnecessary rate-distortion optimization (RDO) computations. However, this premature termination strategy may yield counterproductive results in cases of high texture complexity, as there is no need to skip RDO calculations in such scenarios. In [
25,
26], a partition map is employed to represent the block partitioning structure based on QTMTT. In these works, convolutional neural network (CNN) models are constructed to predict the optimal partition map based on the original pixel values. However, the prediction of partition maps relies on intricate sub-networks, leading to suboptimal computational efficiency and posing challenges for hardware implementation. Moreover, the complexity of the partition maps may contribute to inaccuracies in the prediction process. In [
27,
28], a methodology is implemented wherein each 64 × 64 block is subdivided into multiple 4 × 4-sized sub-blocks. CNN is leveraged to deduce, for each 4 × 4 sub-block, a probability vector denoting the likelihood of its borders serving as partition boundaries. This information is harnessed to formulate the overall partitioning structure for the 64 × 64 block.
In [
29], a methodology is introduced wherein 32 × 32 blocks are stratified based on their side lengths. The proposed Hierarchical Grid Fully Convolutional Network (HG-FCN) is employed to predict probability vectors denoting the likelihood of side lengths serving as boundaries within 32 × 32 blocks across different hierarchical levels. In [
30], a partition homogeneity map (PHM) is introduced, and a Fully Convolutional Network (FCN) is employed to infer the final results. However, fundamentally, this approach aims to predict the probability of 7 × 7 blocks serving as boundaries. The inherent bottom–up predictive structure may introduce redundant computations and compromise the accuracy of predictions.
In [
31], an approach is introduced, employing asymmetric convolutional kernels for the prediction of partition modes. Similarly, in [
32], CNN is utilized to directly predict the partition modes of CUs. In [
33,
34], an adaptive pooling-variable CNN is proposed to predict the partitioning of CUs of varying sizes. However, the pooling process inevitably introduces the loss of certain features.
In the realm of current video coding research, efficient CU partitioning strategies are crucial for enhancing the encoding efficiency and reducing computational complexity. Although various deep learning-based approaches have been proposed to predict the optimal partitioning modes of CUs, these methods typically confront the challenge of utilizing models to replace partition search. Addressing this issue, this paper introduces an innovative BSC structure capable of representing CUs of varying sizes as model inputs of the same dimension. With this structure, we can directly predict a wider range of CU partitioning modes using a CNN model. Moreover, our approach not only maintains high accuracy but also incorporates a threshold scheme to flexibly adjust the acceleration efficiency, thereby achieving a superior balance between encoding efficiency and computational complexity. Our “Fast” scheme reduces the average complexity by 57.14% and increases the BDBR by 1.86%, while the “Moderate” scheme reduces the average complexity by 50.14% and increases the BDBR by only 1.39%.
3. The Proposed Method
Our proposed method uses a CNN model to replace part of the time-consuming partition search process.
Figure 2 shows the flowchart of the overall algorithm. The BSC structure and CNN model are used to determine the partitioning modes for most of the CUs, and the encoder partitioning process can be guided by the model prediction results. The overall architecture of the method contains BSC mapping, segmentation mode prediction, post-processing, and CU encoding. We designed the 64 × 64 CNN Model, 32 × 32 CNN Model, and 16 × 16 CNN Model for the different sizes of CUs to predict the partitioning modes of different sizes of CUs. To enable more CUs to be predicted by the 16 × 16 CNN Model, we propose BSC structures to give more blocks of 16 × 16 mapping. Considering that the prediction against BSC structures may not be standardized, we design a post-processing algorithm to solve this problem. In addition, to increase the accuracy of the model prediction to reduce the error, we design a multi-thresholding scheme to further increase the stability of the model. After obtaining the prediction results, the encoder can directly skip this level of CU partition. Due to the multi-thresholding scheme and the multi-level network structure, the encoder can arbitrarily choose whether to use the network at this level, as well as arbitrarily adjust the threshold at a certain level, thus realizing flexible acceleration settings.
3.1. Block Segmentation and Block Connection Structure
VVC takes a lot of time for the partition search process. Therefore, streamlining this process is crucial for encoder efficiency. However, the size of CUs in the same layer may be different and the size of CUs in different layers may be the same, and this irregularity in size makes it difficult to make direct predictions. To make direct predictions of CUs, finding a single representation that unifies the size of CUs is the first problem that needs to be solved. As shown in
Figure 3, we have quantified the blocks in the dataset that correlate with the BSC structure. It is clear that the BSC structure enables the 16 × 16 CNN model to process an additional 40% of CU blocks, which is highly valuable in terms of reducing the encoding time.
We propose a new BSC structure based on the correlation of image texture features, expecting to represent the partial CUs of the same size. As shown in
Figure 4a, for CUs of 16 × 32 and 32 × 16 sizes, we segmented it into two 16 × 16 sized images from the middle to represent this CU. For the CUs of 8 × 32 and 32 × 8 sizes, we first split it into two 8 × 16 and 16 × 8 images from the center and then concatenated the left (bottom) half of the image with the right (top) half of the image to form a new 16 × 16 sized image. We try to use this BSC structure to unify four different sizes of CU blocks into the same 16 × 16 and design models to reason about them in a unified way.
Specifically, for the way it may be partitioned, we map the texture features of the original CU image as well as the partitioning mode to the newly composed image of 16 × 16 size. As shown in
Figure 4b, in the block segmentation structure (taking 16 × 32 as an example), this mode of NS indicates that the original CU image texture is not complex and does not need to be partitioned, and we consider that neither of the texture features of the segmentation two images is complex, corresponding to the NS partition modes. For the BTV and TTV partition modes, there is a difference between the texture features of the left and right parts of the original image, and we believe that the two images after segmentation still retain this difference, so they still correspond to the original partitioning case. However, for the TTH partition mode, this partition mode indicates that the original CU image exhibits three different levels of texture features in the vertical direction, and when the original CU is cut from the middle, the three original different levels of features will exhibit two different levels of features in each block after the cut, corresponding to the BTH partition model. It is worth saying that, in the BSC structure, there is no design for the BTH partition case, when cut from the middle, the two newly cut images will represent the information of next-level partitioning, which cannot be extracted from the current level, which is a situation that we will solve by the original encoder in post-processing.
As shown in
Figure 4c, in the block connection structure (taking 8 × 32 as an example), this model of NS indicates that the original CU image texture is not complex and does not need to be divided, and the spliced image corresponds to the NS partitioning model. In the model of BTV, it means that there are differences in texture information between the left and right sides of the original image, which will be represented as a three-level hierarchical architecture in the new image after the connect operation, corresponding to the TTV partitioning mode. BTH means that the upper and lower sides have different texture features, which will be transformed into different texture information between the left and right sides in the new image after the connect operation, corresponding to the BTV partitioning mode. In the TTH partitioning mode, CU exhibits three levels of texture features in the horizontal direction, and this characteristic is more consistent with the QT partitioning mode in the image after the connection. In general, BSC is a complete mapping of CU to a 16 × 16 size image; therefore, the partitioning decision of CU can be converted into a 16 × 16 size image prediction. The BSC structure is equivalent to preprocessing the original CU to facilitate the reasoning of CNN model, and will not be encoded as coding information. After CNN obtains the division mode of the CU, it will map the reasoning result to the division mode of the original CU and encode the original CU.
3.2. The Structure of CNN Models
In this paper, we use CNNs to determine the partitioning modes of CUs in the VVC internal prediction. For a block of CUs, the RD cost of all the partitioning modes for that CU and its sub-CUs is computed in VTM and the least costly partitioning case is used. We design different levels of CNNs for 64 × 64, 32 × 32, and 16 × 16 sized CUs including BSC structures to predict their partitioning modes thus replacing some of the tedious partitioning search process, which is more complex for these blocks as compared to other blocks. Our model structure is shown in
Figure 5. We designed different three-model architectures for different-sized blocks.
3.2.1. 64 × 64 CNN Model
The 64 × 64 CNN MODEL takes the luminance channel of the 64 × 64 CU block as input and outputs the probability of the two partition modes
. As shown in
Figure 5a, in the context of HD video processing, we designed an 7 × 7 convolution kernel in the first layer to extract features because pixels are highly redundant in HD videos, and employing a larger convolution kernel can increase the sensory field of the next layer of convolution. This approach ensures that the extracted features are highly representative and beneficial for optimizing. To effectively reduce the complexity, the running time of the model should be as little as possible, so we did not design a deep network and complex network architecture, and only used simple convolution, pooling, and batch normalization to extract features further. The CU block of HD video does not carry much texture information, and this simple structure can extract the features of the image well without the need for a more complex model representation. After that, we use Global AvgPool and Flatten to transform the features into one-dimensional vectors, and finally, output two predicted probabilities after a fully connected layer and Softmax activation function, which represent the probabilities of the two modes of the partitioning of the 64 × 64 CU block.
3.2.2. The 32 × 32 CNN Model
Unlike the 64 × 64 CU block, the 32 × 32 CU block allows for six partitioning modes. The 32 × 32 CNN MODEL takes the luminance channel of the 32 × 32 CU block as input, but the output is the probability vector of the six partitioning modes
. As shown in
Figure 5b, in the design of 32 × 32 CNN MODEL, instead of using the previous 7 × 7 convolution kernel as the first layer of the model, we use three different sizes of convolution kernels, 4 × 4, 5 × 3, and 3 × 5, as the three channels for extracting the features of different dimensions and connect the output with the features for input into the next layer. Due to the limitation of the pixel size of 32 × 32 CU itself, an 7 × 7 convolutional kernel is too large to easily cause the loss of some local features. We use a relatively small 4 × 4 convolutional kernel instead, and combine two asymmetric convolutional kernels, 3 × 5 and 5 × 3, to extract features in different directions for better prediction of BT, TT, and this type of segmentation. The subsequent design follows the backbone network in 64 × 64 CNN MODEL and uses Global AvgPool and Flatten to transform the features into one-dimensional vectors. Notably, in the fully connected layer of 32 × 32 CNN MODEL, we add the QPs of CUs as external features. For different QPs, the partitioning decision of the same CU may be different, which will make the CNN inference more correct. Finally, after Softmax activation function outputs six predicted probabilities, which represent the probabilities of the six partitioning modes of the 32 × 32 CU block.
3.2.3. 16 × 16 CNN Model
The 16 × 16 CNN MODEL uses the same structure as the 32 × 32 CNN MODEL. The difference is that the input of the 16 × 16 CNN MODEL does not only have the luminance channels of the original 16 × 16 CU blocks, but also contains the new 16 × 16 luminance maps formed by BSC of the 16 × 32, 32 × 16, 32 × 8, and 8 × 32 luminance channels. So, as shown in
Figure 5c, as in QP, we add both the width and height information of the CUs as external features to the fully connected layer, because they are important features of this block and will have an additional impact on the partitioning decision, which will make the CNN inference more correct.
Table 1 shows the number of channels in the first convolutional layer and the number of features in the flatten layer for models at different levels. In the 64 × 64 CNN model, the first convolutional layer is designed with 16 channels using a 7 × 7 convolution, and ultimately, 64 feature maps are extracted by the backbone. In the 32 × 32 CNN model and the 16 × 16 CNN model, the first convolutional layer is designed with 16 channels using a 4 × 4 convolution, and each has eight channels using 3 × 5 and 5 × 3 convolutions, respectively. Finally, 256 and 128 feature maps are extracted by the backbone. This is to consider the impact of the number of model parameters, so not too many channels are designed to avoid an explosion in parameter quantity. Since the 32 × 32 CNN model and the 16 × 16 CNN model classify more categories, more feature maps are ultimately extracted to increase their representational ability.
3.3. Dataset
The importance of datasets in training deep learning models cannot be overlooked, as they directly influence the performance and generalization capabilities of the models. We employ the Div2K [
35] public image dataset for model training, which is an open dataset for super-resolution tasks, encompasses a diverse collection of images covering a broad array of scenes and content. This diversity ensures that the model performs well under various conditions. We encode the dataset using a VTM encoder in AI configuration, setting the QP values to
, to extract complete CU partition information and RD costs. Based on the partition information, the partition mode and luminance channels of each CU are extracted to serve as labels and data for training. Notably, for CU blocks that conform to the BSC structure, labels and data are assigned according to the mapping depicted in
Figure 4. By utilizing the Div2K dataset, we acquired over 1.78 million instances of 64 × 64 block partitions, more than 5.91 million instances of 32 × 32 block partitions, and over 12.73 million instances of 16 × 16 block partitions.
Figure 6 shows the proportion of each label within the dataset, clearly indicating a significant imbalance in data distribution among different categories, regardless of the CU block size. The imbalance of data significantly affects the model’s fitting. For each block size, we denote
M as the number of labels with the lowest proportion. To mitigate this imbalance, we randomly select
from each category for the training set and
for the test set, thereby not only addressing the imbalance but also facilitating the identification of an optimal training set representation.
3.4. Loss
The approach in this paper treats CU mode prediction as a multistate polarization problem. In neural networks, the cross-entropy loss function is a natural choice when the task is to map the input to one of multiple categories. It is an effective measure of how the probability distribution of the model output differs from the true label. Moreover, the cross-entropy loss function is relatively simple for the computation of the gradient, which makes it easier to find the global optimal solution in the optimization process. So, we apply the basic cross entropy as the loss function with the following expression:
where
N is the size of the minibatch,
denotes the true split mode of ith CU,
denotes the predicted probability of ith CU.
In VVC, distinct partitioning modes result in significant variations in RD costs. Consequently, opting for different partitioning modes can entail considerably varying additional encoding expenses. This pronounced characteristic suggests that the conventional cross-entropy loss function may not perfectly capture the discrepancies between predicted and actual values. To account for this in model training, we have incorporated RD cost into the loss function to more accurately reflect the impact of different partitioning modes on encoding efficiency. The formula can be expressed as follows:
where
is the RD cost of the CU at split mode
m, and
is the minimum RD cost of this CU across all possible split modes. In the above equation,
can be interpreted as the normalized RD cost. The term
levies increased penalties on larger RD costs
, in line with the RD optimization objectives in VVC. Combining
and
, the overall loss function is:
Here, is a positive scalar set to 1, used to adjust the relative size of the RD cost term on the cross-entropy term to ensure that both terms can be optimized efficiently. Thus, three CNN models can be correctly trained by minimizing L.
3.5. Post-Processing Operations for BSCA Structure
Our BSCA structure gives a mapping between 16 × 32, 32 × 16, 32 × 8, and 8 × 32 CU blocks to 16 × 16, and after reasoning through the model, we need to post-process the results for two main purposes:
Error reduction: In the block segmentation structure, we split a 16 × 32 and 32 × 16 block into two blocks and reason about them separately. However, the two newly segmented blocks may exhibit different texture features. In a block with a BTV partition mode, one of the segmented two blocks may exhibit a more obvious BTV partition features while the other does not have any texture features, which are present. To avoid errors, when two blocks are predicted differently, we encode the partition modes mapped to both outcomes by adding them to the VTM. It is worth noting that we will always add the 32 × 16 BTH mode and the 16 × 32 BTV mode to the VTM, because the small blocks formed by block segmentation represent the features of the next level, and both cases cannot be mapped to the block segmentation mode.
Prescribed normalization: a trained network cannot predict every block with complete accuracy. Inevitably, there will be outliers in predicted values, which is not only a prediction error, but also violates the regularity of the partition. For example, QT partition is not allowed in the original blocks of the BSC structure. When the predicted value does not conform to the partition rules of the original block, we discard this prediction result and encode the block using VTM.
3.6. Multi-Threshold Settings
Exclusively using the prediction results of the model, deciding the partition mode of CUs can minimize the time complexity. However, the prediction results of the trained network are not completely accurate, and the presence of errors in the model can lead to the introduction of incorrect segmentation modes thus leading to a degradation of the RD performance. The error introduced by different block sizes varies, and relatively speaking, the prediction error of a block with a smaller depth brings more losses as it affects the deeper partition. Therefore, we propose a multi-thresholding scheme to realize the trade-off between coding complexity and RD performance.
3.6.1. Fixed Threshold Program
For a 64 × 64 model, errors in the prediction results may exponentially bring about a loss in RD performance compared to other, smaller CU blocks. We use a fixed threshold to determine the confidence level of the prediction, with ranging from . For prediction results, less than will be discarded, and the encoder performs an partition search to determine the segmentation mode, and the next layer of the prediction model is prohibited during the search process to ensure the correct decision.
3.6.2. Variable Threshold Program
For 32 × 32 and 16 × 16 models, there are multiple modes being predicted, and when using a simple fixed threshold to determine the confidence of the prediction results, setting the threshold too high many predictions will be discarded bringing limited time complexity reduction, and setting the threshold too low will check too many non-essential partition modes. So, we use a variable thresholding scheme, which is dynamically adjusted according to the probability vector of the model output. The threshold is defined as:
where
is a manually set fixed factor,
is the maximum value among the model outputs
, and
represents the number of different splitting modes for CUs. The threshold
is the final value. Since the model outputs vary with different CU blocks, the threshold
is variable. For all possible modes m of this CU, only the modes with a probability of
are checked during the encoder’s partition search, while the others are skipped.
For the most aggressive setting, , the encoder only examines modes where the model output , and the partitioning mode of the CUs is entirely determined by the model. In contrast, for , the encoder checks modes where the model output , and the partitioning mode of the CUs is entirely determined by the original encoder’s partition search. As a compromise, the parameter a is typically set between 0 and 1 in practice.