1. Introduction
Currently, video is the most dominant traffic type on the Internet. Particularly with the convergence of emerging technologies, such as 5G, artificial intelligence (AI) and the Internet of Things (IoT), an increasing number of videos are generated by edge devices and consumed by machines for various vision applications in different fields, including the autonomous vehicle, video surveillance and smart city fields. However, owing to the huge volume of the increasing video data, video compression is still a crucial challenge in machine vision applications. Traditional video coding standards aim to achieve the best quality for human consumption under a certain bitrate constraint by utilizing the characteristics of the human visual system (HVS). However, these standards may be inefficient for machine consumption of vision tasks, such as image classification, object detection and segmentation, owing to their different purposes and evaluation metrics. For example, in the vision task of object detection, important information lies in the position and shape of the objects rather than the background. Therefore, instead of compressing the entire image with high perceptual quality such as traditional standards, focusing on compressing only the crucial information for object detection tasks can lead to much more efficient compression than conventional methods. Additionally, compressing the entire image to a similar quality level to meet the given bit rate using conventional methods may result in the loss of important information for object detection, such as the shape of objects. Consequently, this has necessitated the development of a more efficient video coding standard for machine consumption. Accordingly, the Moving Picture Expert Group (MPEG) is developing a new standard called video coding for machines (VCM) [
1,
2,
3].
Currently, the MPEG VCM group is developing the new standard in two tracks: Track 2 and Track 1, which mainly deal with the compression of the input image/video and the compression of the feature extracted therefrom, respectively [
4].
Figure 1a shows a possible processing pipeline for the feature compression considered in the VCM Track 1 [
5], whereas
Figure 1b show a processing pipeline for the image/video compression of the VCM Track 2. For both pipelines, when versatile video coding (VVC), which is the most recent video coding standard [
6], is used as the codec, the machine vision task performances measured at the given bitrates are defined as a feature anchor and an image anchor of VCM, respectively, for the evaluation of the potential technologies [
4].
The VCM Track 1 explores the compression of multiscale features generated by the feature pyramid network (FPN), which is a backbone network of the machine vision networks selected for object detection tasks and segmentation tasks in the evaluation framework [
4]. Generally, the compression of a feature pyramid, which consists of multiscale features with different sizes corresponding to each layer, is inefficient owing to the significant increase in size from the corresponding input image/video.
To effectively reduce the size of multiscale features, first, a framework named multiscale feature compression (MSFC) was proposed [
7]. The MSFC framework integrates and compresses the multiscale features into a single feature map. As shown in
Figure 2, the MSFC framework consists of three modules: a multiscale feature fusion (MSFF) module, a single stream feature compression (SSFC) module and a multiscale feature reconstruction (MSFR) module. The MSFC framework was introduced to VCM and modified the existing bottom-up MSFR structure to a top-down MSFR structure, which reconstructs low-level features from high-level features [
8].
However, the performance of feature compression using the existing MSFC model was not superior to the performance of the image anchor defined by the VCM group. This implies that the conventional approach of compressing images using the HVS is still more efficient. Therefore, this paper proposes an MSFC-based feature compression method that outperforms the performance of image anchors in terms of compression performance. In this paper, we propose an enhanced MSFC (E-MSFC) [
9] based on the existing MSFC methods [
7,
8] to efficiently compress the multiscale features of FPN for object detection tasks. The proposed E-MSFC further reduces the feature channels to be packed into a single feature map and compresses the single feature map using VVC. For the further channel-wise reduction of the feature maps, the architectures of the MSFF and MSFR modules in the existing MSFC are extended in the E-MSFC and the SSFC module is replaced by VVC for the compression of the single feature map. The proposed E-MSFC method achieves much higher performance compared to the existing methods and significantly outperforms the VCM image anchor. It is expected that the proposed multiscale feature compression method using VVC can be a potential candidate solution for the feature compression of VCM.
The rest of this paper is organized as follows. In
Section 2, the existing MSFC framework is briefly described and
Section 3 presents the proposed E-MSFC by comparing it to the existing methods. In addition, an extension of the MSFF with a bottom-up structure in the E-MSFC is described in
Section 3. The experimental conditions and results are presented in
Section 4. Lastly, the conclusion is presented in
Section 5.
3. Proposed Enhanced MSFC
The SSFC module in the existing MSFC compresses the single feature map,
, which is the output of the MSFF module, using channel-wise reduction and uniform quantization. In contrast, in the exploration phase of VCM, a single feature map is compressed in various ways, such as using a neural network [
13,
14,
15] or a conventional video codec [
16,
17,
18,
19,
20]. In addition, compression using a conventional video codec or a neural network exhibits improved performance compared to simply quantizing the features [
7,
8]. Therefore, to enhance the machine vision task performance, this paper proposed an enhanced MSFC (E-MSFC) [
9] model that combines the existing MSFC model and the feature compression method explored in the VCM.
As mentioned above, a single feature map fusing multiscale features can be compressed by utilizing neural networks or VVC. In the case of neural-network-based compression, a codec that outperforms VVC can be devised by training it to specialize in feature compression. However, this may require the training of different compression networks for given different target bitrates. For example, according to VCM’s common test conditions (CTCs) [
21], the performance of the machine vision task is compared to the anchor at six predefined bitrate points; thus, the different feature compression networks should be trained for the six bitrate points. In addition, devising an enhanced MSFC framework is time-consuming because the feature compression network may require re-training with evolving MSFF and/or MSFR structures.
To compress using VVC, the feature map of each channel is packed in a single-frame feature map, as shown in
Figure 3. The packed feature map is then compressed using VVC. When VVC is used for feature compression, the task performances for the six bitrate points can be easily evaluated according to the CTCs of VCM. The performance of the feature compression using VVC may be lower than those of neural-network-based approaches [
13,
14,
15,
16,
17,
18,
19]. However, although VVC is optimized for the compression of content to be consumed by humans based on HVS, the feature map can be effectively compressed using VVC. For example,
Figure 4 indicates that the intraprediction of VVC works effectively in the compression of the feature map. In addition, it is relatively easy to develop an E-MSFC framework with MSFF and/or MSFR module modifications.
Therefore, as shown in
Figure 5, the proposed E-MSFC employs VVC as the core codec to compress the single feature map instead of the existing SSFC module. Thereafter, to develop an improved MSFC model in terms of the VCM performance, the structure of the MSFF module and MSFR modules of the existing MSFC were modified and extended in the proposed E-MSFC model. In the overall pipeline of feature compression, the proposed methods add an improved MSFC, which enhances performance with the slightly expanding structure of the existing MSFC, between the backbone and inference networks of the vision network. Therefore, considering the entire pipeline, excluding the feature map compression using the existing VVC, there is a minor increase in complexity with the proposed method.
3.1. Extension of MSFF and MSFR
Considering the trade-off between the bitrate and task performance, it is essential to appropriately determine the size of the single feature map to be compressed by VVC. The bitrate depends on the size of the single-frame feature map, which is determined by the number of feature channels constituting the single feature map. In this aspect, the E-MSFC reduces the number of feature channels constituting the single feature map, , from 256 channels by extending the existing MSFF and MSFR structures to enhance the bitrate-task performance.
As shown in
Figure 2, the number of the reweighted feature channels to be included in a single feature map in the existing MSFF is reduced from 1024 to 256 channels using a convolutional layer. In the extended MSFF, to reduce the size of the single feature map to be encoded using VVC, further channel-wise reduction is performed using a convolutional layer with smaller output channels than that of the existing MSFF’s convolutional layer. In addition, the architecture of the existing MSFR is modified to reconstruct a feature pyramid from a single feature map generated in the extended MSFF with the reduced number of feature channels.
The architectures of the extended MSFF and the extended MSFR for further channel-wise reduction and its reconstruction in the E-MSFC are shown in
Figure 5 [
9]. The number of feature channels to be included in the single feature map,
, was reduced from 256 to
{192, 144, 64}. In the extended MSFR module, the feature map,
, decoded by VVC is reconstructed to a feature pyramid, wherein each layer consists of 256 feature channels by the addition of a convolutional layer to each layer as follows.
is reconstructed to
using a convolutional layer that restores the reduced feature channels to 256 feature channels. In addition, for the reconstruction of
, a convolutional layer is added to restore the reduced channel of the upscaled feature,
, to 256 feature channels for each layer. Then, a transpose convolutional layer is used to upscale the feature map in the top-down architecture instead of using the nearest interpolation and convolutional layers as used in the existing MSFR. This ensures the reconstruction of the decoded single feature map,
, with the reduced feature channels to a feature pyramid with 256 feature channels. The process of aligning and concatenating proceeds in the same way as the existing MSFF, as described in
Section 2.1.
To compress the single feature map,
, using VVC, the E-MSFC generates a single-frame feature map using min–max normalization. That is, each channel feature map constituting a single feature map,
, is spatially packed into a single frame using a raster-scan format in the ascending order of the channel index, as shown in
Figure 3. Thereafter, each element of the packed frame is converted into a 10-bit depth format suitable for the encoding with VVC using the min–max normalization.
3.2. E-MSFC with a Bottom-Up MSFF
In addition, to improve the machine vision task performance, further extension of the MSFF module is employed in the E-MSFC. The feature pyramid of FPN consists of five multiscale feature maps: Higher-level feature maps exhibit a smaller size, as shown in
Table 1. In the inference of object detection, higher-level feature maps are mainly used to detect large objects and lower-level feature maps are used to detect small objects. In this network, the feature maps,
and
, can be regarded as higher-level feature maps and lower-level feature maps, respectively. According to the MSFR structures, lower-level feature maps are reconstructed from higher-level feature maps using a top-down structure. In such a top-down structure, information about lower-level features is likely to be lost. In addition, when the lower-level features, which have not been properly reconstructed, are used for the inference, small objects may not be correctly determined or even fail to be detected.
Therefore, to compensate for this shortcoming, the MSFF in the E-MSFC is further extended with a bottom-up structure to contain the information on lower-level features at higher-level features. This ensures an increase in the overall task performance via an improved detection of small objects. Therefore, we propose an MSFF extension with a bottom-up structure to contain the information on lower-level features at higher-level features [
22].
As shown in
Figure 6, the extended MSFF with a bottom-up structure includes additional preprocessing on the multiscale feature maps. To add the lower-level feature map information to the higher-level feature maps, the lower-level feature maps are downscaled and added to the higher-level feature maps. In detail, to add the information of the lowest-level feature map,
, to the upper-level feature map,
, the
is downscaled using a convolution layer and added to
. Thereafter, the
is generated through a convolutional layer that fine-tunes the summed feature. Using the same process,
and
are used to generate
. This ensures the generation of the feature maps,
and
, that contain lower-level feature map information. Lastly, the feature maps,
and
, are used for the fusion of the single feature map instead of
and
in the bottom-up MSFF. In contrast, as
contains the most important information for the inference, bottom-up processing is not applied to prevent possible information distortion. The fusion process of the single feature map mentioned above is summarized as follows:
4. Experimental Results
As mentioned previously, in the proposed E-MSFC method, a feature pyramid of multiscale features,
, extracted from FPN is compressed using VVC instead of the SSFC. To train the extended MSFF and MSFR modules of the E-MSFC, the E-MSFC was integrated into a Faster R-CNN X101-FPN of Detectron2 [
23]. In the training, only the extended MSFF and MSFR modules were trained while freezing the parameters of the Faster R-CNN. The modules were trained using the COCO train2017 dataset [
24] and an OpenImage V6 [
25] validation dataset of 5 K images was used for the evaluation according to the CTCs of VCM [
21]. The initial learning rate was set to 0.0005 and the training was iterated 300,000 times with a batch size of two. The extended MSFF and MSFR modules were separately trained according to the channel number,
, of the single feature map,
.
As mentioned previously, the single feature map to be compressed by VVC is a 10-bit depth format utilizing the min–max normalization. Therefore, the minimum and maximum values of
, each of which is a 32-bit floating-point value, should be transmitted to the decoder side to reconstruct the feature pyramid. Consequently, the data size of min/max values should be included in the bitrate calculation measured in bits per pixels (BPP).
Figure 7 shows examples of the single-frame feature maps packing the fused single feature channels with different channel-wise reductions. After the channel-wise reduction, the size of the single-frame feature maps to be compressed decreases as the feature channel number decreases.
According to the VCM CTCs, the VVC test model (VTM)-12.0 [
26] was used as a video codec for the single-frame feature map compression. The experimental results of the overall performance were measured as mean average precision (mAP), which is the accuracy of the object detection task, at the given bitrate in BPP, that is, the BPP-mAP performance [
21].
Table 2 shows the experimental results of the BPP-mAP performances for a set of quantization parameters (QPs) {22, 27, 32, 37, 42, 47} according to the channel numbers
{256, 192, 144, 64} reduced from 256 using the proposed channel-wise reduction. The BD-rate gains of the proposed E-MSFC over the image anchor of VCM [
4] are shown in the last row of
Table 2.
Figure 8 shows the BPP-mAP curves illustrating the performance results shown in
Table 2. Compared to the image anchor, the proposed method exhibited BD-rate gains of 43.17, 59.20, 65.40 and 84.98% when the number of the single feature channels is 256, 192, 144 and 64 channels, respectively. As shown in
Figure 8, the overall compression efficiency improves as the number of channels decreases. However, as the maximum mAP over the entire bit rate range varied slightly with a change in the feature channel number, there can still be a considerable amount of redundancy within a single feature map. Therefore, the number of channels that can be reduced can be estimated from the performance change through the decrease in the number of channels.
The performance of the E-MFSC was also compared to those of the existing MSFC methods [
7,
8], which compresses 256-channel feature maps into 64-channel single feature maps and quantizes the 32-bit floating-point feature values into
-bit values (
. The existing MSFC [
8] was trained on the COCO train2017 dataset and evaluated using the COCO validation2017 dataset [
24].
Table 3 shows the comparison results of the compression performances between the existing MSFC [
8] and the proposed E-MFSC. In the comparison, the E-MSFC with a 64-channel single feature map was evaluated on the same COCO validation 2017 dataset.
Figure 9 shows the BPP-mAP curves illustrating the performance results shown in
Table 3. The E-MSFC significantly outperformed the existing MSFC, indicating the high efficiency of VVC in the compression of feature maps. As shown in
Table 3, even without compression, the proposed method exhibited improved performance than the existing method. This implies that the proposed E-MSFC generates feature maps that more efficiently contain the information required for machine vision tasks.
The E-MSFF with the bottom-up MSFF structure was trained under the same conditions used to train the E-MSFC. The performance of the E-MSFC with the bottom-up MSFF was compared to the E-MSFC where a single feature map contains 192 or 64 feature channels. As shown in
Table 4, the E-MSFC with the bottom-up MSFF provides an additional BD-rate gain of 2.72% and 0.96% compared to the E-MFSC for the 192-channel and 64-channel feature maps, respectively.
Figure 10 shows the BPP-mAP curve for each case of the feature map with 192 or 64 channels.
Figure 11 shows the inference results of object detection when the E-MSFC and the E-MSFC with the bottom-up MSFF were applied to the original image. In the figure, the red circles indicate additionally detected objects compared to the E-MSFC. For instance, in
Figure 11a, the E-MSFC initially fails to detect small birds, but after applying the bottom-up MSFF module, detection is successful. Similarly, in
Figure 11b, the E-MSFC initially fails to detect persons who appear small due to distance, but detection is successful in the bottom-up MSFC. As a result, as shown in
Figure 11a,b, in the case of E-MSFC, 12 objects are detected, whereas 15 objects are detected when the bottom-up MSFF is applied. The E-MSFC with the bottom-up MSFF, which embeds lower-level feature information in higher-level features, enabled the additional detection of smaller objects compared to the E-MSFC. Therefore, the E-MSFC with the bottom-up MSFF exhibits a better inference performance than the E-MSFC at the same bitrate.
5. Conclusions
In this paper, we proposed an E-MSFC framework to efficiently compress multiscale features extracted from the feature pyramid network (FPN) and compressed using VVC. In the E-MSFC, multiscale features of a feature pyramid are fused and packed into a single feature map with further channel-wise reduction. The single feature map is compressed using VVC with min–max normalization rather than the existing SSFC. Thereafter, the compressed single feature map is transmitted, decoded and reconstructed to the feature pyramid with multiscale features to perform the object detection task. The structures of MSFR and MSFF modules in the existing MSFC were extended for channel-wise reduction and reconstruction. In addition, the MSFF module of the E-MSFC is further extended with a bottom-up structure to enhance the BPP-mAP performance by preserving information on lower-level features at higher-level features in the single feature map generation.
Experimental results revealed that the proposed E-MSFC significantly outperformed the VCM image anchor over a wider bitrate range with a BD-rate gain of up to 84.98%. In addition, the bottom-up MSFF further enhanced the performance of the E-MSFC with an additional BD-rate gain of up to 2.72%.
The proposed multiscale feature compression method using VVC was evaluated according to the experimental conditions defined by the VCM evaluation framework for the object detection vision task. However, multiscale features are common in various networks for different vision tasks, allowing the proposed method to be applied to a wide range of vision tasks and networks. Therefore, it is expected that the proposed method can be a potential candidate approach for the feature compression of VCM. The proposed method can be further enhanced by extracting more appropriate features to be compressed based on the recent works on compressed feature representation [
27,
28].