1. Introduction
Video is widely used in our daily lives. TV shows, computer games, online meetings, all rely on the quality of video. As mentioned in [
1], video sensor networks (VSNs) are communication infrastructures that involve video coding, transmission, and display/storage. Via VSNs, the dense visual information is captured and transmitted to applications on different devices for users to view. The size of video clips is many times larger than that of images and texts. To reduce bandwidth usage and save storage space, video coding is widely used. Recent technologies such as that in [
2] have attempted to transmit only the difference in data between the past and current images, while the difference data are calculated by a MPEG-4 Visual video encoder. Some technologies are integrated into the encoders for video compression. For example, a denoising algorithm is combined with a high-efficiency video coding (HEVC) encoder to improve the compression efficiency in [
3]. More importantly, if the quality of the network is poor, the perceptual quality continues to deteriorate as fewer packets are received [
4]. This is another problem we face during transmission. Before displaying to users, the decoder restores the data to a video. However, the quality of the video is often degraded after such a long process. The question is, “How can we measure quality?”.
In the real world, objects are three-dimensional (3D). To measure the perceptual quality, many aspects need to be considered. Resolutions of texture and mesh are combined to measure 3D perceptual quality in [
5]; optimized linear combinations of accurate geometry and texture quality measurements are used in [
6]; and a multi-attribute computational model optimized by machine learning is used in [
7]. For videos, all 3D objects are projected onto a 2D plane. Thus, estimating the quality for videos is simpler than quality estimation for 3D objects. Quality assessment can be performed considering two types of scores—subjective and objective. Users need to manually rate videos to allow the computation of precise subjective scores; however, this process can be very time-consuming. Thus, researchers have tried to develop objective scores that can estimate subjective scores automatically. Many aspects affect objective scores, such as contrast, frequency, pattern, and color perception. Thus, some metrics are developed to analyze specific features. Objective metrics can be further divided into three categories: full-reference (FR) methods; reduced-reference (RR) methods; and no-reference (NR) methods.
There are many image quality assessment (IQA) methods. For instance, SSIM [
8] and PSNR [
9] are FR IQA methods, [
10,
11] and NIMA [
12] are NR IQA methods. Since videos are composed of image frames, we consider extending IQA methods to assess video quality. Some VQA methods are extended from IQA methods, including PSNR [
9], the extension of PSNR based on HVS (PSNR-HVS) [
13], PSNR based on between-coefficient contrast masking (PSNR-HVS-M) [
14]), structural similarity image metrics (SSIM) [
8], and the extension of SSIM to video structural similarity (VSSIM) [
15]). However, existing full-reference VQA methods do not have high correlation with human perception while the video content varies considerably. In addition, some methods simply average the IQA scores for all the frames. By doing this, they only consider spatial features such as color and illumination, but neglect the temporal features. Our goal is to combine motion features and 2D spatial features together. In our work, we extend two IQA methods, namely SSIM [
8] and PSNR [
9], to obtain the RAVA scores. The reason we choose these IQA metrics are outlined below. SSIM and PSNR are the most widely used IQAs. We want to apply them to video quality evaluation and compare their performance with existing VQA methods, especially with the VQA methods extended by them. We divide each frame into foreground and background regions, calculate the IQA scores for those regions, and then assign them different weights based on the motion features. We also notice that if we linearly combine foreground and background scores as the VQA scores, the range and mean value differ considerably for videos with different content. To generalize the VQA scores for videos with various types of content, we introduced a self-supervised video content distinction network. Finally, the foreground, background, and content distinction features are passed to a support vector regression (SVR) [
16] to obtain the final VQA score.
For evaluation, we used the LIVE Mobile VQA database [
17], the MCL-V database [
18], and the Netflix Public Dataset [
19]. The video categories in these datasets are quite varied. These datasets are widely used to analyze video quality. The LIVE Mobile VQA database [
17] and the Netflix Public Dataset [
19] provide the subjective differential mean opinion score (DMOS), while the MCL-V database [
18] provides the mean opinion score (MOS). As mentioned in [
20], MOS is a typical subjective quality of experience (QoE) assessment score; for instance, users rate quality from 1 (bad quality) to 5 (excellent quality). It is considered the most accurate way to measure QoE since actual users were involved in developing the metric. DMOS is calculated as first getting the difference scores from the raw quality scores, and then converting the scores into Z-scores with outliers removed [
21]—here, lower is better. To evaluate the results, we analyzed Pearson’s (linear) correlation coefficient (PCC) [
22] and Spearman’s rank correlation coefficient (SCC) [
23] of the RAVA scores and the subjective scores.
The contributions of our paper are: (1) proposing a region-based VQA method that estimates video quality by extracting and processing the information of background regions and moving objects in the foreground regions; (2) integrating a self-supervised video content distinction network to generalize the VQA scores for videos with different content; (3) extending two full-reference IQA metrics to VQA metrics in the experiments, which shows the possibility of applying the RAVA technique to other FR IQA methods.
2. Related Work
Objective video quality assessment techniques can be categorized into three types: full reference (FR) methods, reduced reference (RR) methods and no reference (NR) methods. FR methods utilize the entire original video to determine the quality score. Nevertheless, their performance is relatively poor in terms of accuracy. Thus, perceptual factors in the human visual system (HVS) need to be incorporated to develop reliable video quality assessment techniques [
24]. Some FR VQA methods are as follows.
Netflix proposed video multi-method assessment fusion (VMAF) [
19] in 2016. VMAF calculates the visual information fidelity (VIF) [
25], detail loss metric (DLM) [
26], and a motion feature, which is defined as the average absolute pixel difference for the luminance component between adjacent frames. A support vector regressor (SVR) is subsequently used to fuse these elementary metrics together. In 2018, Netflix posted another blog saying that they added AVX optimization and frame-level multi-threading, which accelerates its execution three times and improves its prediction accuracy [
27]. Liu et al. [
28] proposed a new VQA metric using space–time slice mappings. They first use spatial temporal slices (STS) [
29] to obtain some STS maps. Then, on each of the reference-distorted STS map pairs, they calculated the IQA scores via a full-reference IQA algorithm. Finally, they apply feature pooling on the IQA scores on those maps to obtain the final score. Aabed et al. [
30] proposed power spectral density (PSD) [
30]. It is a perceptual video quality assessment (PVQA) metric that analyzes the power spectral density of a group of pictures. The authors built 2D time-aggregated PSD (or tempospatial PSD) planes for several sets of frames for both the original and distorted videos to capture spatio-temporal changes in the pixel domain. Following this, they built a local cross-correlation map. The perceptual quality score is the average of the values in the correlation map, with a higher value implying better quality.
RR methods extract some outstanding features from both the original and acquired videos, compare these features and obtain the objective score. For example, the Institute for Telecommunication Science (ITS) proposed the video quality metric (VQM) [
31]. It was adopted as the standard by the American National Standards Institute (ANSI) and the International Telecommunication Union (ITU) [
31]. VQM is defined in (
1):
Here,
h and
v represent the horizontal and vertical axes, respectively;
detects the loss of or decrease in spatial information;
detects edge sharpening or enhancement;
captures the shift of edges from vertical and horizontal orientations to a diagonal orientation;
finds the shift of edges from diagonal to horizontal;
finds changes in the spread of the distribution of 2D color samples;
measures serious localized color impairments; and
is the product of a contrast feature [
32].
NR methods access the quality of a new video without referring to the original video. Li et al. [
33] proposed VSFA (quality assessment of in-the-wild videos). It integrates two eminent effects of the human visual system: content-dependency and temporal-memory effects. Content-dependency effects are obtained by extracting features from a pre-trained image classification neural network on ImageNet; temporal-memory effects are integrated by adding a gated recurrent unit and a subjectively inspired temporal pooling layer to the neural network. The method does not refer to the original video when predicting the video quality. Zadtootaghaj et al. [
34] proposed an NR VQA method DEMI. DEMI first uses the scores predicted by the pre-trained VMAF model [
19] for training. Then, it is fine-tuned on a small image quality dataset. Finally, the authors apply random forest for feature pooling.
The problem for existing FR and RR methods is that they do not working well if the content of videos in a dataset varies a lot. The correlation values of many existing FR and RR VQA scores with human perception are low. For the NR methods, the predicted quality tends to be more affected by the content than the distortions. NR methods are often trained and tested with in-the-wild video datasets. The videos there are collected from real-world video sequences. The content does vary significantly, but we are not sure how much distortion is involved.
Since we want to extend some image quality assessment metrics to video, we will introduce the two full-reference IQA metrics we used below.
PSNR
Peak signal-to-noise ratio (PSNR) [
9] is the ratio between the maximum possible power of a signal and the power of the corrupting noise that affects the fidelity of its representation. A higher value of PSNR is better:
where
is the highest value in the two input variables (it is normally 255 for RGB images) and
is the mean squared error of the two inputs.
SSIM
SSIM [
8] is a perception-based model that considers image degradation as perceived change in structural information, while also incorporating important perceptual phenomena, including both luminance and contrast masking. Structural information is the idea of pixels having strong inter-dependencies, especially when they are spatially close. For SSIM, a higher value is better:
where
is the average of
x,
is the average of
y,
is the variance of
x,
is the variance of
y,
is the covariance of
x and
y,
and
are two variables to stabilize the division with weak denominator, with
,
.
L is the dynamic range of the pixel-values (typically
).
,
by default.
5. Conclusions
We introduced a new video quality evaluation approach that integrated various image quality assessment methods, namely region-based detection, temporal weights from optical flow, and content distinction features. Our RAVA technique was applied to extend two full-reference IQA metrics. We first separated foreground and background regions for all the video frames. Then, we integrated the motion features into the weights while designing the VQA metrics. The region weights were defined as the percentage of the average magnitudes of the optical flows for those regions out of all the regions. The foreground feature was the weighted average of the foreground IQA scores, and the background feature was the simple average of background IQA scores. Furthermore, a content distinction network was added to generalize the RAVA scores for videos with various types of content. All the features were passed to an SVR model to predict the final VQA score. We tested on two different datasets to validate the RAVA technique. The LIVE Mobile VQA database and the MCL-V database are widely used VQA datasets, so we used them to compare the performance of the RAVA methods with existing methods. By analyzing the correlation of the RAVA scores and the DMOS (or MOS) provided by the datasets, we noticed that RAVA and RAVA performed very well. Furthermore, the results produced by RAVA were better than those of the PSNR of existing video quality assessment methods. RAVA also performed better than SS-SSIM and MS-SSIM.
In summary, we believe that the RAVA approach has practical significance. It can extend IQA methods to VQA methods, and we expect it to be widely applicable for video quality assessment in the future.