1. Introduction
With the recent popularity of smart devices, video-based application services have become popular, resulting in an increase in the demand for high-quality video. Therefore, the accurate perception of video quality is of essential importance for video sharing and streaming platforms. As the foundation of inpainting and enhancement of low-quality videos, video quality assessment (VQA) methods have become a promising research field in the past few decades. Quality evaluation methods contain two categories: subjective ones and objective ones. In general, subjective video quality evaluation by human experts is considered to be the most accurate method [
1]. The scores obtained from subjective methods are often taken as the ground truth of objective evaluation during the training process [
2]. However, this approach is quite time-consuming and expensive, which renders it impractical to test a large number of videos transmitted in real-time. By contrast, objective quality assessment by a computer is cheap and efficient. Objective VQA methods can mimic the human visual system (HVS), distinguish distortion in the video, and thus reasonably perceive the quality of the video. Depending on the availability of reference videos, VQA methods contain three categories: full reference (FR) [
3,
4,
5,
6], reduced-reference (RR) [
7,
8], and no-reference (NR) [
9,
10]. Since videos without distortions can be used as the evaluation criterion, in most cases the video scoring results of the FR/RR methods can achieve satisfactory similarity to the results of human perception. When evaluating the quality of videos captured in real-time, however, it is difficult to obtain distortion-free videos as references. Thus the importance of NR-VQA methods can never be overemphasized.
Among the existing studies, one category of NR-VQA methods perceives artificial distortions generated in the laboratory, while the other category perceives real distortions. Current NR-VQA methods for synthetically distorted videos have achieved phenomenal success. Nevertheless, a significant proportion of videos in real life are obtained by shooting in the wild. Hence, NR-VQA metrics for authentically distorted videos have become a matter of concern. Compared to videos with manual distortions, the frames of the in-the-wild videos contain various types of objects, some of which are beyond the cognitive scope of existing models. Meanwhile, due to camera movement, abnormal exposure, and being out of focus, distortions are randomly distributed in the video, causing difficulties during the feature extraction process. We believe that the NR-VQA method for authentic distortion currently has two challenges: firstly, the selection of features; and secondly, the formation of temporal memory.
In recent years, the application of deep learning in video processing tasks (e.g., action recognition) has gradually attracted attention. Unlike image processing tasks, video analysis tasks focus not only on the information contained in a single frame but also on the association among frames in a certain period, which are referred to as spatial and temporal information, respectively. Spatial features are usually extracted by using 2D convolutional networks such as AlexNet [
11], VGG [
12], and ResNet [
13], while optical flow networks [
14,
15,
16,
17], 3D convolution [
18,
19,
20,
21], recurrent neural networks [
22,
23], etc., are often used to obtain temporal features or associations. These experiences inspire objective VQA. Since research has proven that image quality assessment results are intrinsically associated with the awareness of content [
24], we can instinctively assume that there is a correlation between video quality evaluation and motion esthesia.
Based on this intuition, we propose a multi-dimensional hybrid feature network to process static spatial features and dynamic temporal features for NR-VQA. Specifically, spatial features are content-related features formed by 2D convolutional networks, and temporal features include motion-related features between adjacent frames (or a clip) formed by high-dimensional convolutional networks. The characteristics (spatial and temporal) contain a wealth of underlying information. For the purpose of extracting the information from a long-term sequence and modeling temporal memory, we employ a structure containing recurrent neural networks, which is called the temporal memory module (TMM) in this article. In order to better fit the subjective scoring data set, we simulate the time hysteresis effect in human visual habits by using a self-generated parametric network and controlling the impact of historical quality for the overall evaluation accordingly. Apart from these, we use multiple datasets for network training, thereby ensuring that the model has a better generalization performance.
The main contributions of this paper are as follows: (i) unique multi-dimensional spatiotemporal feature extraction and integration strategies for objective NR-VQA; (ii) the design of a time-memory module containing recurrent neural networks for long-term sequence-dependence modeling; and (iii) adaptive parameter networks to imitate the impact of historical quality on assessment score predictions.
The remainder of the paper is organized as follows. In
Section 2, previous works on NR-I/VQA are reviewed, while
Section 3 contains the introduction of the proposed approach in detail.
Section 4 exhibits the experimental validation of the method and related techniques on mainstream VQA databases with corresponding analysis. Finally, the paper is concluded in
Section 5 with possible future directions for research in this area.
2. Related Works
In this section, we provide a brief summary of the existing related methods for NR I/VQA tasks. Similar to the image quality evaluation metrics based on classic methods [
25,
26,
27,
28,
29] and deep learning [
30,
31,
32,
33,
34], most of the VQA methods can be divided into two steps: feature extraction and (feature-based) quality assessment. In addition, HVS characteristics, such as the influence of ambient illumination level [
35] and temporal hysteresis effect [
36], are equally important. The difference is that VQA tends to emphasize more on spatio-temporal motion information. For instance, discrete cosine transform (DCT) domain natural scene statistics (NSS), the motion coherency-related algorithm v-BLIINDS by Saad et al. [
37], and features in 3D discrete cosine transform (3D-DCT) domain based on spatiotemporal natural video statistics (NVS) were also proven to be effective [
38]. Wu et al. [
39] proposed an NR-VQA metric to estimate SSIM for single video frame. For video mean subtracted contrast normalized (MSCN) coefficients and spatiotemporal Gabor bandpass filtered outputs, [
40] established an asymmetric generalized Gaussian distribution (AGGD) model to perceive distortions. In the meantime, optical flow [
4,
41], ST-chip [
42], multi-scale trajectory [
43], and bitstream level features [
44,
45,
46,
47] were also used to quantify distortion in video data. Although many of these methods contribute greatly to the perception of specific distortions without reference, they are not satisfactory for evaluating the quality of in-the-wild videos with sophisticated distortion.
The success of CNN networks in object detection, instance segmentation, video understanding, and other fields has aroused attention in VQA researchers. Specifically, the presence of perceptual similarity [
24,
48] showed that quality analysis is intrinsically linked to object recognition, and thus it will be effective to use existing pre-trained CNN models for video quality analysis. The authors of [
49] provided an efficient deep-learning metric called DIQM to reduce the computational complexity in mimicking the HVS. For perceivable encoding artifacts (PEAs), [
50] proposed a CNN network for identifying different kinds of distortions. For convolutional neural networks and multi-regression-based evaluation (COME), [
51] proposed a multi-regression model to imitate human psychological perception. Concerning the limitation of HDR-VDP 2, [
52] developed NoR-VDPNet to predict global quality with substantially lower computational cost. Wei et al. utilized Semantic Information related two-level network to estimate the image quality [
53]. Entropic differences learned by the CNN network were used to capture distortions in [
54]. In order to enable the model to have the ability of time-series memory, recurrent neural networks are used in many metrics. For example, Li et al. [
55] trained a GRU with CNN features for NR-VQA in order to obtain a perception of video frame content and distortion. The combination of 3D-CNN and LSTM was used in [
56] for distortion perception. With the help of transfer learning and temporal pooling, [
57] developed a new NR-VQA architecture. In this paper, we construct a GRU-based structure with jump connections for temporal memory. On the one hand, this can solve long sequence dependence; on the other hand, this reduces information loss.
For NR-I/VQA, there are already some databases suitable for training deep learning networks, such as UPIQ [
58], CVD2014 [
59], KoNViD-1k [
60], LIVE-Qualcomm [
61], LIVE-VQC [
62], etc. Network models trained for a specific database often perform poorly in terms of prediction in other databases due to the differences between individual databases. In order to help the models in obtaining better generalization performance, researchers have recently proposed some methods for cross-dataset training. Zhang et al. [
63] built a training set with image pairs in order to avoid subjective quality evaluation for different datasets. Based on the study of different feature distributions for different datasets, UGC-VQA [
64] considered a selected fusion of BVQA models to reduce the inconsistency in subjective assessment among datasets. Li et al. [
65] divided the evaluation of network prediction scores into three steps: relative quality esthesia, perceptual quality awareness, and subjective quality generation, followed by a multi-parameter structure for the transitions between tiers to suit different datasets.
As described in this section, the fusion of multi-dimensional CNN features, which was proven to be quite effective in video understanding, has rarely been taken into account in the VQA task. Therefore, exploring appropriate fusion methods and designing reasonable processing frameworks for them in time series can be considered a promising area of research.
3. Proposed Method
In the proposed method, a fusion of multidimensional CNN features is considered. Such features are very common in video understanding tasks such as action recognition but are rarely used in VQA tasks. As shown in
Figure 1, our network structure can be divided into three parts: (i) a multidimensional feature fusion module, (ii) a temporal memory module, and (iii) an adaptive perception score generation module. In the multidimensional feature fusion module, different network structures and convolutional kernels are utilized to process the video sequences and, thus, generating rich spatio-temporal features. We then fuse these features and place them into the temporal memory module. In the temporal memory module, in order to generate features containing previous temporal memory information, we use a special recurrent network structure with shortcuts to form evaluation memories over the entire video sequences. In the score generation module, we use a self-generated parameter structure for coping with the impact of image frame quality variations on overall quality perception.
3.1. Multidimensional Features Fusion
The purpose of multi-dimensional fusion is to obtain rich features that are subsequently propagated to the network for characterizing spatio-temporal information by using different convolutional kernels/different network structures when processing original video clips. The feature maps generated by convolutional networks with different dimensions are shown in
Figure 2. We used ResNet networks [
13], R(2+1)D networks, and R3D networks [
21], which have similar structures. In addition to 2D-CNN, we select two different multi-dimensional features because, on the one hand, their temporal and spatial meanings are different; on the other hand, it is necessary to prevent spatiotemporal information imbalance.
In the image/video quality evaluation task, subjective evaluation results are shown to be correlated with content such as scenes and objects [
55]. Advances in deep 2D convolutional neural networks in fields such as object recognition suggests that 2D-CNN can competently mimic the human perception of static content in video sequences. Simultaneously, the deep features generated by such networks have been proved to be distortion-sensitive [
66]. Therefore, 2D-CNN backbones for image recognition, such as ResNet, are regularly used in image/video quality assessment. Typically, these networks are initially pre-trained on image classification databases such as ImageNet to generate feature maps related to static content/distortion. Evaluation scores are then obtained by using the deep feature maps after subsequent processing. In addition, it is fairly common to generate evaluations by using transfer learning. As for video tasks, in parallel to static scenes and objects, human perceptual content also consists of temporally manifested motion. When imitating the human visual system, in addition to focusing on 2D scene/object features, a 3D convolutional network can be used to pay attention to the effect of low-level motion content on the evaluation results.
Suppose the video
V contains
n image frames
, each adjacent
t frames form a clip (generally,
). Then, there are
m image clips
, where
. Then the features extracted by CNN models of different dimensions can be denoted as follows(
denotes 2D-CNN features,
denotes (2+1)D-CNN features and
denotes 3D-CNN features):
The 2D-CNN convolves individual image frames, while 3D-CNN convolves a clip of several image frames.
After the convolution operation is the Global Average Pooling (GAP) layer, which transforms the feature maps
,
, and
into feature vectors
,
, and
, thus, enabling the recurrent neural network to be used for memorization. In order to avoid excessive information loss from GAP operations on 2D convolutional features, we also use global standard deviation pooling (GSP) to obtain variation information. Finally, the outputs of two pooling layers are concatenated as follows.
The concatenation operation is denoted by ⊕.
For each frame, the network generates a 2D convolutional feature vector; however, for
t frames, there is only one 3D feature vector and one (2+1)D feature vector, which results in the feature-length difference in time sequence. As shown in
Figure 3, in order to align vectors for the concatenation operation, we consider two kinds of rescaling methods: shortening the long vector and amplifying the short vector. Shortening methods include long vector sub-sampling and sum-pooling, while amplification methods include nearest-neighbors upsampling and global upsampling.
The long vector sub-sampling means selecting only the vector generated in one of the
t image frames as the representative vector for the concatenation operation. In sumpooling, we sum the
t features in a clip to avoid information attenuation caused by subsampling. The nearest-neighbors upsampling alignment is implemented by replicating
t times for each clip-generated feature, and the global upsampling alignment copies the feature vector of all clips in a video
t times to make the long and short vectors the same length. In subsequent experiments, we find that long vector sub-sampling loses a large amount of two-dimensional perceptual information, resulting in poor model performance, while the global upsampling method performs relatively better. Let the vector length be
L; after aligning the three vectors, we perform a concatenation operation on them to obtain the features
f containing spatio-temporal information.
3.2. Temporal Memory Module
In the above subsection, we use 3D-CNN to model the connection between adjacent frames. The feature
f can be considered as an encoding of low-level motion characteristics. In order to further develop a long-time series modeling for high-level features, we use recurrent neural networks (RNN). GRU [
23] is one of the typical recurrent neural networks that use a gating mechanism to control input, memory, and other information to make predictions based on the current time step. It has the advantage of preserving information in long-term sequences and will not remove it even if it is not correlated with the prediction results. In the network, we implement a multi-level cascade of fully connected layers and recurrent networks. On top of GRU, we add a short path to enhance the learning ability of the network, the structure of which is depicted in
Figure 4. This temporal memory block (TMB) is composed of a GRU branch and a shortcut branch, representing historical quality memory and current quality perception, respectively, and thus avoiding an excessive loss of information. Specifically, we concatenate the GRU hidden state
at the current moment
s with the input information
x after dimension reduction and finally feed the result into the nonlinear activation layer.
In order to enhance the understanding of the time series information, we link several TMB blocks and then use a fully connected layer to generate a history-related quality score for each clip. The scores of all clips in a video form a video rating vector .
3.3. Adaptive Perception Score Generation
By processing all clips in the video, we generate an array of scores associated with the historical quality impact. In this subsection, we introduce a parameter-adaptive video score generation strategy. Research has shown that a decrease in video quality compared to an enhancement results in the scorer being more impressed. This phenomenon is referred to as the temporal hysteresis effect [
36]. It can be inferred that when there are poor quality clips in the video sequence, the rating perception will drop significantly, while it is not so sensitive to rising quality. Based on this, we try to generate video quality scores
Q using clip scores
(see
Figure 5). As mentioned in [
55], the final evaluation score consists of two components: memory of the historical worst perceptions
and the current rating status
.
is generated by a Min pooling block,
is generated by a Softmin-weighted average pooling,
w is a parameter generated by a differentiable Softmin function. Subjective frame quality scores can be approximated by linearly combining the
and
with parameter
, as follows.
In the human visual system, the memory of history is also affected by the current status. If the current clip has exceptional performance (relatively good or relatively poor), the test subjects will be impressed, while if the current frame performance is relatively mediocre, the test subject will recall the previous scenes more often. Thus, the proportion of these two components in the final evaluation system should be dynamic. Therefore, we design an adaptive weight
generation structure using an FC layer and a nonlinear activation layer.
During the training process, the network can learn the weight on its own with the help of the score vector .
3.4. Implementation Details
In this paper, we use ResNet50 pre-trained on the ImageNet [
67] dataset and later fine tuned the image quality evaluation task [
68] as a 2D-CNN feature extractor for a better perception of the distortions. We extract 2D features from the ‘res5c’ layer in ResNet50 and then set the feature size to 4096 after pooling. R(2+1)D-18 and R3D-18 [
21], which are pre-trained on the human action recognition dataset Kinetics [
69] from the ‘conv5_x’ layer, are chosen for multidimensional feature extraction, which provides the ability to perceive motion information. After the pooling operation, the feature sizes are both 512. The feature extraction module is separated from the model training process in order to avoid excessive computing time consumption. In order to form temporal memory, we consider third-order TMB blocks. The dimensions of each block are shown in
Table 1. The learning rate in our work is set to
and Adam is used as the optimizer to train our model for 40 epochs, with a batch size of 32. As in [
65], the model loss is defined as the softmax weighted average of the numerical summation of L1 loss, monotonicity-related loss, and accuracy-related loss in each dataset. We implement our model using PyTorch and conduct training as well as testing on a single NVIDIA 1080Ti GPU.
5. Conclusions and Future Work
In this paper, we have presented an empirical study of the temporal effects in objective NR-VQA. Many current approaches are based purely on the content-aware or motion-aware feature while ignoring other equally important features. In addition, finding a better temporal network for perceiving video quality is also worth investigating. With these motivations, we creatively fuse content-oriented 2D-CNNs with motion-oriented 3D-CNNs and complement them with (2+1)D-CNNs to form convolutional features containing both static spatial and dynamic temporal information. As the extracted dynamic features contain only low-level temporal information generated among image frames, we further used a modified recurrent network structure for high-level quality perception through a long time scale. In an attempt to simulate the temporal hysteresis effect of the human visual system, a weighted average evaluation model with adaptive weighting parameters was developed to generate the final scores. In order to verify the validity and generalization performance of the model, we conducted experimental validation on four public video quality datasets (CVD2014, KoNViD-1k, LIVE-Qualcomm, and LIVE-VQC) using SROCC and PLCC as metrics. The results reveal the superiority of the proposed method over the current state-of-the-art methods, which demonstrates that it is perfectly feasible to fuse multidimensional information followed by reasonable temporal sequencing in NR-VQA.
Current 3D convolution is computationally intensive and time-consuming, making it challenging to train evaluation networks end-to-end, and thus hindering further improvements in network performance. In the future, it will be important to find a more lightweight multidimensional feature extraction module for video quality awareness to enable end-to-end network training.