1. Introduction
With the current state of development of multimedia technology, a large number of videos and images are being generated and processed every day, which are often subject to quality degradation. As a fundamental technology for various image applications, image quality assessment (IQA) has always been an important issue. The aim of IQA is to automatically estimate image quality to assist in the handling of low-quality images during the capture process. IQA methods can be divided into three major classes, namely, full-reference IQA (FR-IQA) [
1,
2], reduced-reference IQA (RR-IQA) [
3], and no-reference IQA (NR-IQA), based on whether reference images are available. In most cases, no reference version of a distorted image is available; consequently, it is both more realistic and increasingly important to develop an NR-IQA model that can be widely applied [
4]. NR-IQA models are also called blind IQA (BIQA) models. Notably, deep neural networks (DNNs) have performed well in many computer vision tasks [
5,
6,
7,
8], which encouraged researchers to use the formidable feature representation power of DNNs to perform end-to-end optimized BIQA, an approach called DNN-based BIQA. These methods use some prior knowledge from the IQA domain, such as the relationship among entropy, distortion and image quality, to attempt to solve IQA tasks using the powerful learning ability of neural networks. Accordingly, there is a strong need for DNN-based BIQA models in various cases where image quality is crucial.
However, attempts to use DNNs for the BIQA task were limited due to the conflicting characteristics of DNNs and IQA [
9]. DNNs require massive amounts of training data to comprehensively learn the relationships between image data and score labels; however, classical IQA databases are much smaller than the computer vision datasets available for deep learning. An IQA database is composed of a series of distorted images and corresponding subjective score labels. Because obtaining a large number of reliable human-subjective labels is a time-consuming process, the construction of IQA databases requires many volunteers and complex, long-term experiments. Therefore, expanding the available number of distorted image samples and labels that fully reflect human visual system (HVS)-aware quality factors for training is a key problem for DNN-based BIQA.
Based on the baseline datasets considered for expansion, the current DNN-based BIQA methods can be divided into two general approaches. The first approach is to use the images in an existing IQA dataset as the parent samples; we call this approach small-scale expansion. In this case, the goal of expansion is achieved by dividing the distorted images from the IQA dataset into small patches and assigning to each patch a separate quality label that conforms to human visual perception. The second strategy is to expand the number of distorted images by using another, non-IQA dataset as the parent dataset; we call this approach large-scale expansion. In this approach, nondistorted images from outside the IQA dataset are first selected; then, distortion is added to these images based on the types of distortion present in the IQA dataset to construct new distorted images on a large scale. Then, the newly generated distorted images are simply labeled with different values that reflect their ranking in terms of human visual perception quality to achieve the goal of expansion.
The small-scale expansion strategy relies on division. The initial algorithm [
10] assigns the score labels of the parent images to the corresponding small patches and then uses a shallow CNN to perform end-to-end optimization. The small patches and their labels are input directly to the network during training, and the predicted scores for all the small patches are averaged to obtain the overall image score during prediction. However, this type of expansion is not strictly consistent with the principles of the HVS. Previous studies have shown that saliency exerts a crucial influence on human-perceived quality; thus, saliency should be considered in IQA together with distortion and content [
11,
12,
13]. These studies have shown that the human eye tends to focus on certain regions when assessing an image’s visual quality and that different regions have different influences on the perceived quality of a distorted image. Therefore, it is not appropriate for all patches from a single image to be assigned identical quality labels because local perceptual quality is not always consistent with global perceptual quality [
14,
15]: the uneven spatial distortion distribution will result in varying local scores for different image patches. Thus, many works have attempted to consider this aspect of the problem. The saliency factor was first considered in DNN-based BIQA algorithms. The authors of [
16,
17] still assigned identical initial quality labels to the small patches, but the predicted scores for all small patches were eventually multiplied by different weights based on their saliency to obtain the overall image scores, thereby weakening the influence of patches with inaccurate labels in nonsalient regions on the overall image quality. In [
18,
19], strategies based on proxy quality scores [
18] and an objective error map [
19] were used to further improve the accuracy of the labels for different patches. All these strategies further increased the accuracy of this type of expansion and led to better predictions, confirming that the joint consideration of the influence of saliency and distortion on image quality more comprehensively reflects HVS-related perceptual factors. However, division strategies have obvious inherent drawbacks. First, because expansion is applied only to the existing distorted images in the IQA database (the expansion parent), the diversity of the training sample contents is not increased. The different levels of quality influenced by saliency and distortion must already be present in the training dataset, but it is difficult to claim that a typical small IQA database can comprehensively represent the influence of HVS factors on quality; hence, such methods are easily susceptible to overfitting. Second, there is a tradeoff between the extent of expansion achieved and the patch size. When the patch size is too small, each individual patch will no longer contain sufficient distorted semantic information for IQA, thus inevitably destroying the correlations between image patches. In contrast, a large patch size results in smaller-scale expansion, meaning that only a shallow network can be used for training. Moreover, the generated saliency-based patch weights will show large deviations from the real salient regions.
To avoid dividing the images in the IQA database while still not requiring human labeling, the large-scale expansion strategy instead involves creating new distorted images by adding distortion to a large number of high-definition images obtained from outside the IQA database. Separate values that reflect the overall quality level are assigned to each distorted image obtained from each original parent image. Because the labels of the newly generated images are not direct quality scores, the expanded database is used only to pretrain the DNN, which is then fine-tuned on the IQA database. This approach alleviates the training pressure placed on the small IQA dataset and successfully avoids the drawbacks of division encountered in the small-scale strategy because the number of labeled training images is expanded by a large amount, increasing the diversity of the training sample content. Such unrestricted, large-scale expansion also makes it possible to use deeper networks; in fact, a deep model pretrained on an image recognition task could also be used to further enhance the effect. This large-scale expansion approach was developed over the past two years, and it showed a much better effect than small-scale expansion algorithms. However, large-scale expansion also has some significant shortcomings. Although the newly added images with quality-level labels are consistent with human perception, they reflect only HVS-aware quality factors; distortion and the joint effects of saliency and distortion are not considered. Moreover, large-scale expanded datasets are typically prepared to assist in specific IQA tasks. The more similar the extended pretraining dataset is to the original IQA dataset for the target task, the more effectively it can support the IQA task. In this case, a “similar” dataset is an expanded dataset that fully reflects the influences of the HVS-related perceptual factors (saliency and distortion) as embodied in the IQA task of interest. The current algorithms [
15,
20] that use this approach mainly follow the lead of RankIQA [
20]: they generate a series of distorted image versions by adding different levels of distortion to each original parent image (with uniform distortion for each image region) and assign different numerical-valued labels to them to reflect the overall quality level. Consequently, the quality degradation of each distorted image depends only on the level of the distortion added to the whole image. As a result, HVS-aware quality factors are not well embedded into the expanded database. Using this type of extended dataset to pretrain the network will simply cause it to learn that a greater level of distortion leads to greater quality degradation; the network will be unable to discern that salient regions are more important than nonsalient regions and that different regions contribute differently to the overall image quality. Obviously, this type of expansion does not result in an ideal pretraining dataset for IQA.
In this paper, we introduce saliency into the large-scale expansion method, with the aim of constructing DNN-based BIQA models that will be effective in various cases where image quality is crucial. The objective is to be able to automatically estimate image quality to assist in handling low-quality images during the capture process. Moreover, by virtue of the introduction of saliency, our proposed model can achieve better prediction accuracy for large-aperture images (with clear foregrounds and blurred backgrounds), which are currently popular. We propose a new approach for incorporating saliency into BIQA that is perfectly compatible with the large-scale data expansion approach to ensure the full consideration of HVS-related factors in the mapping process. Specifically, we introduce saliency factors through regional distortion, thereby conveniently combining saliency and distortion factors during the expansion of each image to generate a series of distorted image versions. Then, we use the information entropy to rank these images based on their quality to complete the labeling process. By constructing a more efficient pretraining tool for DNN-based BIQA, we improve the prediction performance of the final model. We use our generated large-scale dataset to pretrain a DNN (VGG-16) and then use the original small IQA dataset to fine-tune the pretrained model. Extensive experimental results obtained by applying the final model to four IQA databases demonstrate that compared with existing BIQA models, our proposed BIQA method achieves state-of-the-art performance, and it is effective on both synthetic and authentic distorted images. Therefore, we conclude that a data expansion approach that fully reflects HVS-aware quality factors is beneficial for IQA. This study presents a novel method for incorporating saliency into IQA tasks, namely, representing it as regional distortion.
Our contributions can be summarized as follows: (1) We introduce saliency into the large-scale expansion method in a manner that fully reflects the influence of HVS-aware factors on image quality, representing a new means of considering saliency in IQA. With the incorporation of the saliency factor, the proposed data expansion method overcomes the main drawback of its predecessor algorithm, RankIQA [
20], which enables the learning of only the quality decline caused by the overall distortion level. Our approach enables the construction of an efficient pretraining dataset for DNN-based BIQA tasks and results in improved prediction accuracy compared to previous BIQA methods. (2) We propose a new data expansion method that fully reflects HVS-aware factors by generating distorted images based on both distortion and saliency and assigning labels based on entropy. This method successfully embeds the joint influence of saliency and distortion into a large-scale expanded distorted image dataset.
The remainder of this paper is organized as follows.
Section 2 describes the important factors that affect image quality and explores how those factors affect human judgments of image quality.
Section 3 introduces the proposed expansion method and describes its use in IQA in detail.
Section 4 reports the experimental results and presents corresponding discussions. Finally,
Section 5 offers conclusions.
3. Proposed Method
As mentioned above, our main goal is to construct a newly expanded dataset to support DNN-based BIQA tasks. We introduce saliency into the large-scale expansion strategy for the first time by creating distorted images based on the joint consideration of both saliency and distortion. Finally, we label the images based on the information entropy. The degradation of image quality in our new expanded dataset not only is related to the distortion level (as in RankIQA [
20]) but also fully reflects the joint influence of distortion and saliency on image quality. We use this large-scale expanded dataset to pretrain a DNN and then use the original small IQA dataset to fine-tune the pretrained DNN. After fine-tuning, we obtain the final BIQA model. The flow chart of our proposed method is shown in
Figure 4.
In this section, we present a detailed description of our method, which is divided into two main stages: dataset expansion and the use of the expanded dataset. First, we introduce our novel method of incorporating saliency into the large-scale dataset expansion process for IQA. Then, we describe the dataset generation process: image expansion based on saliency and distortion and image labeling guided by the information entropy. Finally, we describe how the expanded dataset is used in the IQA task, which involves a two-step training process to ensure that the DNN fully learns how HVS-aware factors influence image quality.
3.1. The Usage of Saliency in IQA
The incorporation of saliency into the expansion procedure is a key step because we want to consciously capture the influences of both saliency and distortion when generating distorted images. Previous algorithms [
16,
17,
18,
19] introduced saliency into the IQA task by assigning different weights to different regions of a distorted image when predicting the final score. Such saliency usage is suitable for small-scale expansion but cannot be applied in the case of large-scale expansion. Moreover, there is no opportunity to add saliency factors to the existing distorted image versions generated for RankIQA (large-scale expansion), for which several images with different distortion intensities were created and labeled by quality rank. Because each label is a simple number that represents the overall quality level, using regional saliency weights is insufficient. Moreover, the salient regions in any given image may shift under different distortion levels; examples of this attentional shift based on distortion are shown in
Figure 5. As the level of distortion increases, the salient areas also shift. Thus, we can see that differently distorted images with the same content should have different local saliency weight values. This saliency shift further increases the difficulty of adding saliency into the existing distorted images generated for RankIQA. Therefore, finding a new way to introduce saliency into the large-scale expansion process for IQA is crucial.
On the one hand, the characteristics of the large-scale expansion strategy are as follows: the time-consuming psychometric approach is not employed to obtain subjective score labels, and each distorted image derived from a given image by applying a given type of distortion has only a simple numerical label that represents its level of quality. On the other hand,
Section 2.1 shows that the influences of salient and nonsalient regions on quality are quite different. Based on the two considerations above, we are inspired to introduce saliency into an expanded dataset in the form of regional distortion. We can generate multiple distorted images by adding distortion to high-resolution reference images. Among these distorted images, some will be subjected to global distortion of the original images, some will be distorted only in the salient regions of the reference images, and others will be distorted only in the nonsalient regions. Because the locality of the distortion (both regional and global) in the extended set of distorted images will be different, these images will have different perceptual qualities. Next, instead of asking volunteers to provide subjective scores, we can sort the distorted images based on their information entropy and assign simple numerical labels that represent their quality ranking. In this way, the combined effects of both saliency and distortion on quality will be reflected in the expanded dataset.
To implement the approach proposed above, we performed two preparatory steps. First, we needed to choose a saliency model. From among the many possible saliency models, we selected [
24] because it emphasizes the identification of the entire salient area. Second, we needed to establish a measure of how the impact factor affects the quality (as discussed in
Section 2). In addition to the information entropy, this will be another important measure for guiding the image generation and labeling processes during our expansion procedure. Based on these two preparatory steps, we introduce the details of our expansion method below.
3.2. Generating Images for the Expansion Dataset
We selected the Waterloo database [
25], which includes a large number of high-resolution images (4744), as the parent database to be used in the expansion process. Using MATLAB, we added distortion to these images to construct a large-scale expanded dataset containing a total of 4744 × 4 × 9 distorted images. Here, the factor of 4 arises from the 4 types of distortion (JP2K, JPEG, WN, and GB) applied to each parent image; we adopted these four distortion types because they are found in most available IQA databases. The factor of 9 arises from the fact that for each distortion type, a total of nine distorted images of different qualities were generated, using a total of five distortion levels for each distortion type. We summarize this information in
Table 1. Please note that because we used MATLAB to simulate the types of distortion present in the LIVE dataset, the distortion functions and distortion factors used may be different from those used in LIVE; therefore, the parameters in
Table 1 are slightly different from those in
Figure 3. Next, with the help of
Figure 6, we will explain how we used the five distortion levels and different saliency models to generate nine distorted versions of each parent image.
As an example, we chose one original parent image (“shrimp” from the Waterloo database), and the image shown in panel (b) is its saliency map, generated as described in [
24]. Due to space constraints, only the nine distorted images generated using GB distortion are shown in
Figure 6. Please note that nine corresponding distorted image versions were also generated for each of the other three distortion types from each original parent image. As
Figure 6 shows, during the expansion procedure, we used the method introduced in [
24] to extract the saliency map of each original parent image. Then, according to the saliency map, we defined the region with pixel values greater than 30 as the salient region and defined the remaining area as the nonsalient region. Each image was thus divided into two parts, the salient region and the nonsalient region. Then, we independently added different levels of distortion to these two regions of the original image and spliced the results to obtain a distorted image. The distortion levels applied to the salient and nonsalient regions to generate the nine distorted images are shown in the GB distortion level column (e.g., “level 0 + level 1” for image (c) means that this image was generated by adding GB distortion of level 0 to the salient region and GB distortion of level 1 to the nonsalient region of image (a)). The definitions of distortion levels 1–5 for each distortion type can be found in
Table 1, and a level of 0 means no distortion.
Our expanded set of distorted images fully reflects the influence of HVS-aware quality factors. The nine distorted image versions generated from each parent image contain different levels of distortion across the entire image region, thus representing the influence of the overall distortion level on quality. In addition, some distorted images have different levels of distortion in the salient and nonsalient regions, thus representing the joint influence of saliency and distortion on quality. We ranked the nine distorted images of the same distortion type generated from each original image separately. The corresponding distorted image versions of decreasing quality can fully reflect the quality degradation caused by various HVS-aware factors.
3.3. Entropy-Based Image Quality Ranking of the Expanded Dataset
After generating the distorted images, we next assigned quality labels to them. We know that each image in an IQA database will have an assigned quality score generated through a time-consuming psychometric experiment, an option that is unavailable to us, and is, in fact, unnecessary. Labels that simply reflect the quality ranking are sufficient to create the needed effect (as discussed in detail in
Section 3.4). We refer to the nine distorted images of the same distortion type generated from the same parent image as a group; thus, there are a total of 4744 × 4 groups in our expanded dataset. We sorted the nine distorted images in each group separately by quality using the information entropy defined on the basis of Shannon’s theorem because the information entropy is a measure that reflects the richness of the information contained in an image. The larger the information entropy of an image is, the richer its information and the better its quality. Moreover, the information entropy value is sensitive to image distortion and quality. Distortion in the salient region will lead to a significant reduction in the entropy value. Therefore, the information entropy is a suitable basis for our labeling procedure. The formula is as follows:
where
H represents the information entropy of the image and
represents the proportion of pixels with a grayscale value of
i in the grayscale version of the image. The ordering of the information entropy values reflects the quality ranking of a group of images. We used this formula to calculate the information entropy of each of the nine distorted images in one group and ranked these nine images in order of their information entropy values. Accordingly, labels 1–9 were assigned to represent the image quality ranking. As mentioned above, there are a total of 4744 × 4 groups in our expanded dataset. We use letters
c to
k to denote the distorted image versions generated to compose each group (where
c represents the distorted image generated by adding no distortion to the salient region and level 1 distortion to the nonsalient region of the original image). For each of these nine distorted image versions, we calculated the average entropy for the corresponding 4744 × 4 images, as shown in
Table 2. The information entropy ranking results for most groups are consistent with the average order listed in
Table 2. For each group, the labels for distorted images
c to
k range from 1 to 9, representing their sequentially decreasing quality. For example, for the nine images in
Figure 6, their entropy sequentially decreases in the order in which they are displayed; the labels range from 1 to 9. Some groups also exist in which the information entropy order is different from the average order displayed in
Table 2; in most such cases, the entropy values of images
d (in which only the nonsalient region is distorted at level 2) and
e (in which only the salient region is distorted at level 1) are reversed. However, we still sort image
e below image
d in quality to emphasize the importance of the salient region.
These information entropy results are consistent with the previous conclusions regarding how HVS factors affect image quality. Images with only background distortion have higher quality indices than those with foreground distortion and whole-region distortion because distortion in only nonsignificant regions leads to only weak quality degradation due to the smaller entropy of nonsalient regions. Consequently, images with only foreground distortion and with overall distortion at the same level are of similar quality. In addition, as we discussed in
Section 2, the quality of the salient regions is highly consistent with that of the whole image. Please note that for a few landscape images in the Waterloo database, which have no obvious salient regions, we treated the entire image as the salient region to avoid negative effects. Although no convincing quality score labels could be extracted for these images, we were still able to use the expanded database for our BIQA task by adopting a Siamese network and a corresponding training method, as discussed in the next section.
3.4. Using the Expansion Dataset for the IQA Task
Now, we will introduce the use of our new expanded dataset. Our training process consists of two steps: pretraining on the expanded dataset and fine-tuning on the IQA database. We trained a model based on VGG-16 [
26], with the number of neurons in the output layer modified to 1. In our expanded database, for each original image, there are nine distorted images with corresponding labels from 1–9 that represent their quality ranking for each distortion type. We followed the training setup used by the authors of RankIQA [
20]. During pretraining, to train the network on the quality ranking task, we used a double-branch version of VGG-16 (called a Siamese network) with shared parameters and a hinge loss. We show a schematic diagram of the pretraining process in
Figure 7 and explain the training process in conjunction with the figure. Each input to the network consists of two images and two labels: a pair of images of different quality that are randomly selected from among the nine distorted images in one group. The image with the lower label (indicating higher quality) is always sent to the
branch, and the other image is sent to the
branch. When the outputs of the two branches are consistent with the order of the two labels, meaning that the network correctly ranks the two images by quality, the loss is 0. Otherwise, the loss is not 0, and the parameters will be adjusted (by decreasing the gradient of the higher branch and increasing the gradient of the lower branch) as follows:
where
represents the network parameters. Thus, the loss function is continuously optimized by comparing the outputs of the two branches, and eventually, the training of the quality ranking model is complete. Because any two of the nine distorted images in a group may be paired to form the input, the network is efficiently forced to learn the joint influence of saliency and distortion on image quality. After pretraining, either network branch can produce a value for an input image (because the two branches share parameters), and the quality ranking of different input images will be reflected by the order of their corresponding output values.
We have found that this pretrained model is nearly identical to the IQA model and can effectively judge the effects of saliency and distortion on quality. However, the output of this network is not a direct image quality score. Only when multiple different images are input to obtain different output values does the order of these values reflect the order of the images in terms of quality. Therefore, to facilitate the comparison of our model with other BIQA models and transform the network output into a direct quality score, our method includes an IQA-database-based fine-tuning step. From the pretrained model, we extract one branch to obtain a single VGG-16 network and perform training on the original IQA dataset to complete the fine-tuning process. In each round of training, the input to the network is one image, and the corresponding quality score is the label in the IQA database; thus, the network learns an accurate mapping from distorted images to scores. Again following the approach of RankIQA, we use the sum of the squared errors as the loss function during fine-tuning.