1. Introduction
Nowadays, the accuracy of machine vision tasks has seen considerable improvement owing to advances in deep-learning technology. This has facilitated the widespread integration of deep-learning-based models into IoT devices that we encounter in our day-to-day lives [
1,
2]. As a result, an exponential surge in machine-to-machine (M2M) communication, where devices autonomously communicate and process tasks, has been observed. Within M2M communication, the typical process involves compressing and transmitting input image/video data captured by one device before another analyzes these. Over the years, conventional video coding standard techniques such as high-efficiency video coding (H.265/HEVC) [
3] and versatile video coding (H.266/VVC) [
4] have been developed; these techniques prioritize human visual characteristics in terms of subjective and objective quality. However, their effectiveness in conducting efficient visual signal-level compression for image/video analysis, which is primarily performed by machines rather than humans, is debatable. Thus, there is a pressing need for a new compression scheme that aims to efficiently maintain data for machine vision while preserving human visual quality as well as ensuring accurate recognition of image/video information [
5,
6].
In response to this emerging need, the Moving Picture Experts Group (MPEG) initiated a new standardization activity in 2019 called video coding for machines (VCM). The MPEG-VCM group highlighted use cases such as smart cities, intelligent transportation, content management, and surveillance as prime applications for the VCM standard technology [
6]. These applications typically involve machine vision tasks such as object detection, instance segmentation, and object tracking. Accordingly, these were identified as main tasks, and three deep-learning networks were determined for VCM standardization: Faster R-CNN X101-FPN, Mask R-CNN X101-FPN, and JDE-1088 × 608. MPEG-VCM aims to establish an efficient bitstream from an encoded video, descriptors, or feature maps extracted from a video by considering both the bitrate and performance of the machine vision task post-decoding [
6]. Therefore, MPEG-VCM bifurcates into two tracks based on compression targets: feature map domain compression and video domain compression. Given that most intelligent edge devices are limited by storage capacity and computational power, cloud servers with deep-learning networks have been introduced to share these burdens while also supporting data analytics from transmitted devices [
7,
8]. A collaborative intelligence (CI) paradigm can be utilized to effectively distribute computational workload between edge devices and cloud servers by compressing and transmitting intermediate feature maps [
9,
10]. This is particularly significant for edge devices, such as mobile and embedded devices, as deep-learning network architectures continue to increase in complexity. Moreover, feature map compression techniques developed for MPEG-VCM can be effectively integrated within these CI paradigms. Consequently, this study concentrates on feature map compression for VCM.
Multi-level feature maps extracted from the commonly employed feature pyramid network (FPN) structure in the aforementioned deep-learning networks were established as compression targets for feature map domain compression for MPEG-VCM. Specifically, P-layer feature maps (
P2,
P3,
P4, and
P5) were targeted for object detection and instance segmentation, while L-layer feature maps (
L0,
L1, and
L2) were used for object tracking, as depicted in
Figure 1. However, the total data of multi-level FPN feature maps are significantly larger than the data size of an input image [
11]. Therefore, several methods have been proposed to eradicate redundancy within and between multi-level feature maps using a transform-based mechanism. Lee et al. [
11] proposed a receptive block-based principal component analysis (RPCA) for feature map compression. While superior to discrete cosine transform-based compression methods, this approach is limited when compressing low-level feature maps, which constitute more than half the data among multi-level feature maps. In addition, various PCA-based multi-level feature map compression methods have been proposed, such as by Park et al. [
12], Lee et al. [
13], and Rosewarne and Nguyen [
14]. These methods commonly employ a dimension-reduction mechanism to eliminate redundancy within each feature map and between low-level feature maps, leveraging the properties of the FPN structure. In addition, all these methods utilize PCA to extract basis kernels to efficiently reduce the dimensionality of the multi-level feature maps, eliminating redundancy between correlated data in each feature map. However, this requires executing an independent PCA for one or more multi-level feature maps for each image or video frame and compressing the basis matrices, coefficients, and mean vectors for feature map reconstruction. These components have distinct characteristics and may necessitate different codecs for sufficient compression efficiency. Depending on spatial redundancy, Park et al. [
12] and Lee et al. [
13] utilized VVC and DeepCABAC [
15], while Rosewarne and Nguyen [
14] allocated these three components into different areas of a picture, defined as a single sub-picture. The sub-picture was compressed by the VVC quantizer at a lower quantization parameter (QP) than the QP for other components. However, this approach is not ideal for applications that require low bitrates.
In an effort to circumvent these issues, a transform-based feature map compression method employing a predefined generalized basis matrix and mean vector is proposed. In this method, the PCA process is not incorporated in the proposed VCM encoder. Instead, the generalized PCA (GPCA) process [
16] is executed separately prior to encoding. This process generates the generalized basis matrix and mean vector using several non-duplicated images for the test datasets, eliminating the need to iterate through all feature maps of each input image data point. Generally, each channel within a feature map exhibits a strong correlation with others, irrespective of the input data provided to the deep-learning network [
17,
18]. Thus, the generalized basis matrix and mean vector (TGB and TGM) can be utilized to reduce the redundancy between the channels of each feature map, instead of using a basis matrix and mean vector that are dependent on all input data derived from performing PCA for all feature maps of each input. Moreover, by capitalizing on the spatial redundancy of the multi-level feature maps, a generalized common basis matrix and mean vector (TGCB and TGCM) for the low-level feature maps were obtained via the GPCA process. As TGB, TGM, TGCB, and TGCM are predefined on both the encoder and decoder sides, there is no need to send bitstreams representing them to the decoder side. Therefore, the proposed method only needs to compress and transmit the coefficients required for feature map reconstruction. This allows efficient compression while preserving the accuracy of machine vision tasks. This work extends our contribution to the 140th MPEG meeting [
19]. The remainder of this paper is organized as follows.
Section 2 elaborates on the proposed feature map compression method.
Section 3 presents and discusses the experimental results of the proposed method, validating its effectiveness. Finally,
Section 4 concludes the paper.
2. Proposed Transform-Based Feature Map Compression Method
This section provides an in-depth description of the proposed transform-based feature map compression method. Initially, a brief introduction to the proposed scheme’s outline is provided, followed by a more detailed explanation of each component in the remaining subsections.
Figure 2 showcases the overall pipeline of the proposed method. Similar to previous works [
11,
12,
13,
14], our method utilizes PCA to reduce the dimensions of the multi-level feature maps intended for compression. Previous methods pose certain drawbacks: they necessitate performing an independent PCA for one or more feature maps for each input datum during the encoding process, and they compress the basis matrices, coefficients, and mean vectors with distinct characteristics for feature map reconstruction.
To mitigate these limitations, our method employs the predefined TGB, TGM, TGCB, and TGCM obtained through the PCA process on multiple feature maps extracted from a dataset of several images not included in the test data for all machine vision tasks. This process occurs separately before encoding. Consequently, it is possible to extract the generalized basis matrix and mean vector that exhibit a strong correlation between channels for each feature map of each deep-learning network. Hence, the overall pipeline of the proposed method does not incorporate the PCA process at the encoder side.
The proposed method compresses multi-level feature maps, such as P- or L-layer feature maps, by the machine vision task.
Table 1 outlines the dimensions of each multi-level feature map. As indicated in the table, the lower-level feature maps, such as
P2,
P3, and
L0,
L1, account for over 85% of the total feature maps owing to their extensive spatial dimensions [
12].
The deep-learning networks used for each machine vision task commonly employ an FPN structure that generates multi-level feature maps based on a top-down architecture as the backbone network [
20]. There exists spatial redundancy between these feature maps. Therefore, to achieve effective compression, our method combines two low-level feature maps,
P2 and
P3, as well as
L0 and
L1, by leveraging the nearest-neighbor down-sampling used in the FPN structure and concatenating along the channel dimension. This process is represented in
Figure 3a. Consequently, the feature map count is reduced by one, and the feature maps with the largest dimensions are eliminated. The decoded combined feature map is then reconstructed into each low-level feature map by performing an inverse process, as shown in
Figure 3b. Consequently, one or two high-level feature maps, such as
P4,
P5, and
L2, as well as the combined feature maps, are compressed. Furthermore, we use an independent set of TGB, TGM, TGCB, and TGCM for each feature map to be compressed, tailored to the specific machine vision task.
By predefining and using a generalized basis matrix and mean vector for each feature map based on the task, both on the encoder and decoder sides, there is no need for compression, thereby preventing compression degradation. As illustrated in
Figure 2, the proposed encoder generates coefficients by centering using the TGM and TGCM and transforming using the TGB and TGCB for each feature map. To compress the coefficients with VVC, quantization and packing processes are performed. These are then encoded via VVC to generate a bitstream. On the receiving end, the proposed decoder performs unpacking and dequantization of the coefficients decoded with VVC. Following this, reconstruction of the multi-level feature maps is performed through mean compensation and inverse transform using the generalized basis matrix and mean vector. Finally, the reconstructed multi-level feature maps are fed into the machine vision inference network, as shown in
Figure 2, to produce inference results for each task.
2.1. Generalized Basis Matrix and Mean Vector
This subsection elaborates on the GPCA process used to derive each task’s generalized basis matrix and mean vector. Generally, the feature map of the intermediate layer of a deep-learning network is made up of channels that are extracted using convolution filters with fixed training parameters. Notably, channels in the feature map of the same layer are generated using the same convolution filter, regardless of the input data fed into the network. Consequently, the relationships between channels of the compression target feature map for any input data can be viewed as relatively similar [
17,
18]. Leveraging this characteristic, we obtain the generalized basis matrix and mean vector for each compression target feature map through the GPCA process before encoding. These are predefined on the proposed VCM encoder and decoder and are used to reduce inter-channel redundancy by transforming each feature map. The generalized basis matrix and mean vector for each high-level feature map and the combined feature map are derived based on the machine vision tasks by conducting the following process: considering each spatial point (1 × 1 × C) of a feature map (W × H × C) as a sample and performing the PCA process, a basis matrix (C × C) consisting of eigenvectors sorted according to their eigenvalue and a mean vector (C × 1) are obtained. These are then used to reduce the correlation between the feature map channels.
The aforementioned process highlights the strong correlation among channels within the feature map generated in the same layer, regardless of the input data. To leverage this correlation for dimensionality reduction and compression, each three-dimensional feature map extracted from
data points is reshaped into two dimensions, as shown in
Figure 4. This allows the concatenation of feature maps with different spatial resolutions. The GPCA is then performed on each sample set, consisting of one or more feature maps. These sets are obtained by concatenating the reshaped feature map or the reshaped combined feature map concatenated along the channel axis, as illustrated in
Figure 4. The result of this process yields the TGB and TGM for each high-level feature map (
P4,
P5, and
L2) and the TGCB and TGCM for the combined feature map, which is composed of two low-level feature maps (
P2 and
P3,
L0 and
L1) as shown in
Figure 5. These processes are performed separately for each task, obtaining two sets of TGB and TGM and one set of TGCB and TGCM for instance of segmentation and object detection. For object tracking, two sets of TGB, TGM, TGCB, and TGCM are derived. These sets are predefined for both the proposed VCM encoder and the decoder for coefficient compression for each task. The proposed method uses the same dataset across all tasks to derive the generalized basis matrix and mean vector. Furthermore, the number of coefficients to be compressed was determined experimentally, and this is discussed in detail in the next section.
2.2. Quantization and Packing
The coefficients are initially represented as 32-bit floating-point values. For compression with VVC, these values must be quantized to an integer representation. There are various techniques to perform quantization, with the goal of reducing the margin of error as much as possible. One such method is Lloyd–Max quantization [
21]. However, prior studies have indicated that the impact on performance from errors in the feature map domain due to quantization is not substantial [
11]. Therefore, the previous feature map compression methods and the MPEG-VCM [
22] feature anchor generation process adopted a uniform integer quantization method, relying only on the minimum and maximum values and bit depth [
11,
12,
13,
14].
Figure 6 shows the feature map and the coefficients obtained using TGB and TGM. As illustrated in
Figure 6a, the feature map illustrates object edges in the input data. As the coefficients are meant to represent the principal components of the feature map, they exhibit similar characteristics to the object edges in the input data, as demonstrated in
Figure 6b. Therefore, the proposed VCM codec uses the following uniform integer quantization process:
where
and
represent the original and quantized coefficients, respectively;
is the bit-depth for quantization;
and
denote the minimum and maximum values of all coefficients for the input image or video, respectively. The dequantization process performed on the decoder side is meant to reconstruct an approximation of the original coefficient values. It can be expressed as follows:
where
represents the decoded coefficients. The dequantized coefficients are derived by multiplying the quantization step value and adding the minimum value, resulting in a floating-point real number in the original range.
In the proposed VCM encoder, a packing process is carried out after the quantization process, as illustrated in
Figure 2. The dimensions of each coefficient
are the same as the spatial resolution of the feature map. The number of coefficients to be compressed is determined based on the ratio of TGB and TGCB. They can be rearranged using spatial or temporal packing for efficient compression of the quantized coefficients. However, the gain of spatiotemporal arrangement is not significant compared with the spatial arrangement, as reported in [
23]. Thus, the packing process is performed in a spatial tiling fashion in z-scan order, as demonstrated in
Figure 7. Finally, the packed and quantized coefficients for each feature map are encoded using VVC.
3. Experimental Results
Experimental results of the proposed feature map compression method for feature anchors [
24,
25] under the MPEG-VCM common test condition (CTC) [
22] are presented. The experimental setup is described in detail, and then the objective performance of the proposed method is evaluated for each machine vision task.
Table 2 outlines the experimental setup, detailing the deep-learning network, test dataset, and metrics for each machine vision task, as defined in the MPEG-VCM CTC. The experiment employed three pretrained deep-learning networks: Faster R-CNN X101-FPN and Mask R-CNN X101-FPN from Detectron2 [
26], and the DarkNet-53 ImageNet pretrained model [
27] for JDE-1088 × 608. Experiments were conducted using four test sets from three datasets: two subsets of the OpenImage V6 dataset [
28] were used for object detection and instance segmentation; one set of 5k images was used for object detection, and another set of 5k images was used for instance segmentation. The SFU-HW-Objects-v1 dataset [
29] was used for object detection, and three video sequences from the Tencent video dataset (TVD) [
30] were used for object tracking. The performance of the proposed scheme was evaluated using mean average precision (
mAP) for object detection and instance segmentation and multiple object tracking accuracy (
MOTA) for object tracking. Bitrate measurements included bits per pixel (
bpp) and
bitrate for image and video datasets. The Bjøntegaard-delta (BD)-rate [
31] metric was also used to evaluate average bitrate savings at an equivalent machine vision task performance.
For evaluating the proposed method, three sets of generalized basis matrix and mean vector are required for object detection and instance segmentation, and two sets for object tracking. These were obtained through the GPCA process, as mentioned in
Section 2.1. The dataset used in this strategy comprised 344 frames of TVD training videos selected by random sampling that for all tasks were not duplicates of the test data. Furthermore, the quantities of TGB and TGCB were determined at a saturation point for each machine vision task by comparing the amount of data to be compressed and the corresponding performance (
mAP and
MOTA) according to the TGB ratio. As illustrated in
Figure 8, the quantity of TGCB for
P2 and
P3 was set at 15%, and the quantity of TGB for
P4 and
P5 was set at 30%. At this point,
mAP reached saturation in object detection. The TGB ratio reached saturation in instance segmentation at the same point as in object detection. Additionally,
MOTA reached saturation in object tracking when the quantity of TGCB for
L0 and
L1 was set at 30%, and the quantity of TGB for
L2 was set at 20%.
We also employed the VVC test model (VTM12.0 [
32]) for compression of coefficients. For all tasks, coefficients were quantized by setting the quantization bit-depth
to 8. Furthermore, the quantized coefficients were encoded with six QPs per task, using an all-intra (AI) configuration for object detection and instance segmentation, and a random access (RA) configuration for object tracking as per the VVC common test conditions [
33].
Table 3,
Table 4 and
Table 5 outline the performance of the proposed method compared with the feature anchor according to each machine vision task’s test dataset. Experiments were carried out on six evaluation rate points (QP or rate point) for each dataset, and BD rate was subsequently calculated. The corresponding rate performance (RP) curves are shown in
Figure 9. The six evaluation rate points for each dataset, according to the proposed algorithm, were determined as follows. For experiments with the OpenImage V6 and TVD datasets, experiments were first run on all available QPs. Then, six QPs were selected as being most similar to the
mAP of each QP in the feature anchor results. For the SFU-HW-Objects-v1 dataset, the same six QPs for each sequence as in the feature anchor results were selected for the 14 sequences.
As indicated in
Table 3, the proposed method showed superior performance compared with the feature anchor across all evaluation points. Specifically, it achieved 85.35%, 70.33%, 85.15%, and 72.37% BD-rate reductions compared with the feature anchor for the OpenImage V6 dataset and each class of the SFU-HW-Objects-v1 dataset, respectively. Moreover, the proposed method also demonstrated improved performance in instance segmentation compared with the feature anchor, obtaining an 89.27% BD-rate reduction as illustrated in
Table 4. In terms of object tracking, the experimental results summarized in
Table 5 show that the proposed method achieved BD-rate reductions of 50.49%, 72.47%, and 79.09% for TVD-01, TVD-02, and TVD-03, respectively. It displayed a comparable
MOTA performance at a lower bitrate than the feature anchor at each rate point and achieved a 64.91% BD-rate gain.
Table 6 shows BD-rate reductions achieved by our method against the previous PCA-based feature map compression methods [
12,
13,
14] in object detection and instance segmentation for the OpenImage V6 dataset. In the object detection task, the proposed method achieved 85.35% BD-rate reduction against the feature anchor. It outperformed the previous methods [
12,
13,
14] with 69.20%, 65.29%, and 49.87% BD-rate reduction, respectively. In addition, the proposed algorithm achieved 89.27% BD-rate reduction against the feature anchor in the instance segmentation task while outperforming [
14] with 41.90%.
Table 5 reveals that the
MOTA performance of the proposed method for each QP is similar to that of the feature anchor for TVD-03. However, for TVD-01 and TVD-02, it can be observed that the
MOTA performance of the proposed method is lower than that of the feature anchor, particularly at high QPs. This discrepancy also appears in some sequences in the SFU-HW-Objects-v1 dataset. To analyze these results, we examined the ground truth within all frames on the TVD dataset, as depicted in
Figure 10. As shown in the figure, nearly all bounding boxes of each ground truth present in TVD-01 and TVD-02 are small compared to the video resolution. However, almost all bounding boxes of ground truth in TVD-03 are larger than those in TVD-01 and TVD-02. In addition, this phenomenon also appears in some sequences of the SFU-HW-Objects-v1 dataset that showed low performance compared with feature anchors at the high bitrate of the proposed algorithm. Nevertheless, the proposed feature map compression method outperformed in overall performance, including the accuracy of high bitrate for sequences mainly comprising medium or large-sized objects. Therefore, we believe that the results in some sequences that exhibited somewhat lower accuracy at these higher bitrates arise from the low accuracy of small-object detection. More specifically, the proposed method achieves higher compression efficiency by compressing the combined low-level feature maps using the characteristics of the FPN structure. However, this process may lead to information degradation of low-level feature maps for small-object detection. Despite this, the proposed algorithm generally outperforms all feature anchors of the MPEG-VCM, as shown in
Table 3,
Table 4 and
Table 5 and
Figure 9. Particularly, the proposed algorithm obtained superior accuracy on all machine vision tasks and datasets at low bitrates. It is important to obtain excellent accuracy at low bit rates to utilize MPEG-VCM technology in low-power devices. As a result, the method’s compression of multi-level feature maps can offer high-accuracy machine vision task performance with a reduced bit requirement. This indicates that the proposed scheme has significant machine vision task efficiency for medium or large-sized objects.
4. Conclusions
In this paper, we present a transform-based feature map compression approach for VCM. The compression target feature maps for each task of VCM exhibited spatial similarities, as they were extracted from the FPN structure using a top-down architecture. Additionally, each channel within the feature map generated by the same layer strongly correlated with the others, irrespective of the input data fed to the deep-learning network. Given these properties, the proposed method conducted the GPCA process separately from encoding to derive the TGB, TGCB, TGM, and TGCM for effective compression. These elements were then predefined in the proposed VCM encoder and decoder to minimize the cross-channel redundancy of the compression target feature map through transforms. Consequently, the PCA process was excluded from the VCM encoder, preventing compression loss from generalized data. The encoder compressed only the coefficients for each feature map, using VVC for reconstruction. Experimental evaluations conducted under the common MPEG-VCM test conditions across various machine vision tasks demonstrated the effectiveness of the proposed method. For object detection tasks using the OpenImage V6 and SFU-HW-Objects-v1 datasets, our method attained BD-rate reductions of 85.35%, 70.33%, 85.15%, and 72.37% compared with the feature anchor. In instance segmentation, the method yielded an 89.27% BD-rate reduction, and for object tracking tasks, it achieved a 64.91% BD-rate reduction relative to the feature anchor. Moreover, the proposed algorithm outperformed the previous PCA-based feature map compression method. We plan to enhance our proposed algorithm to ensure superior performance in machine vision tasks, even when dealing with small objects. This would enable our approach to be applicable across a broader range of use cases.