1. Introduction
Screen technologies have made incredible progress in recent years. They are able to display brighter and darker pixels with more vivid colors than ever and, thus, create more impressive and realistic images.
Indeed, the new generation of screens can display a luminance that can go below 0.01
and up to 10,000
, thus, allowing them to handle images and video with a High Dynamic Range (HDR) of luminance. For comparison, screens with a standard dynamic range (SDR) are traditionally able to display luminance between 1 and 100
. To handle HDR images, new transfer functions have to be used to transform the true linear light to perceptually linear light (Opto-Electronic Transfer Function (OETF)). The function used to transform back the perceptually linear light to true linear light is called the Electro-Optic Transfer Function (EOTF). The OETF and the EOTF functions are not exactly the inverse of each other. This non-linearity compensates the differences in tonal perception between the environment of the camera and that of the display. The SDR legacy transfer functions called gamma functions are normalized in BT.709 [
1] and BT.1886 [
2]. For HDR video compression, two transfer functions were standardized: PQ [
3] (Perceptual Quantizer) and HLG (Hybrid-Log-Gamma) [
4].
Screen enhancements do not only focus on increasing the luminance range but also on the size of the color space it can cover. Indeed, the color space that a screen can display is limited by the chromatic coordinates of its three primary colors (Red, Green and Blue) corresponding to the three kinds of display photo-transmitters. The gamut, i.e., a subset of visible colors that can be represented by a color space, used to encode SDR images (normalized in BT.709 [
1]) is not wide enough to cover the gamut that could be displayed by a Wide Color Gamut (WCG) screen. The BT.2020 [
5] recommendation define how to handle this wider gamut. For the moment, there is no screen that can cover this gamut in its totality, but some are really close. The standard BT.2100 [
6] sums up all the aforementioned HDR/WCG standards.
For these new images and videos, new quality assessment metrics are required. Indeed, quality metrics are key tools to assess performances of diverse image processing applications such as image and video compression. Unfortunately, SDR image quality metrics are not appropriate for HDR/WCG contents. To overcome this problem, we can follow two strategies. The first one is to adapt existing SDR metrics to a higher dynamic range. For instance, instead of using a classical gamma transfer function, Aydin et al. [
7] defined a transfer function, called the Perceptually Uniform (PU) function, which corresponds to the gamma non-linearity (defined in BT.1886 [
2]) for luminance value between 0.1 and 80
while retaining perceptual linearity above. This method can be used for any metrics using the luminance corrected with the gamma function, such as PSNR, SSIM [
8], VIF [
9], Multiscale-SSIM [
10] (MS-SSIM). In this paper, the metrics using the Perceptually Uniform (PU) function have the prefix PU- (PU-PSNR, PU-SSIM). The second strategy is to design dedicated metrics for HDR contents. We can mention HDR-VDP2 [
11,
12] for still images and HDR-VQM [
13] for videos.
Several studies have already benchmarked the performances of HDR metrics. In [
14], the authors assessed the performances of 35 quality metrics over 240 HDR images compressed with JPEG XT [
15]. They conclude that HDR-VDP2 (version 2.2.1 [
12]) and HDR-VQM were the best performing metrics, closely followed by PU-MS-SSIM. In [
16], the authors came to the conclusion that HDR-VDP2 (version 2.1.1) can be successfully used for predicting the quality of video pair comparison contrary to HDR-VQM. In [
17], authors showed that HDR-VDP2, HDR-VQM, PU-VIF and PU-SSIM provide similar performances. In [
18], results indicate that PU-VIF and HDR-VDP2 have similar performances, although PU-VIF has a slightly better reliability than HDR-VDP2 for lower quality scores. More recently, Zerman et al. [
19] demonstrated that HDR-VQM is the best full-reference HDR quality metric.
The above studies have two major limitations. First, as all of these metrics are color-blind, they only provide an answer to the increase of luminance range but they do not consider the WCG gamut. Second, the databases used to evaluate the different metrics were most of the time defined with an HDR display only capable of displaying a BT.709 [
1] gamut. The WCG gamut BT.2020 [
5] is currently addressed neither by current metrics nor by current databases.
To overcome these limitations, in this paper, we adapt existing SDR metrics to HDR/WCG images using uniform color spaces adapted to HDR. Indeed, most SDR metrics assume that the representation of images is perceptually linear. To be able to evaluate HDR metrics that include both luminance and chromatic information, we propose two new image databases, that include chrominance artifacts within the BT.2020 wide color gamut.
This paper is organized as follows. First, we describe the adaptation of SDR metrics to HDR/WCG images using perceptually uniform color spaces. Second, we present the methodology used to evaluate the performances of these metrics. In a third part, the performances of the considered metrics are presented. Results are discussed in a fourth part. A fifth section describes our recommendation to assess the quality of HDR/WCG images. The last section concludes this paper.
2. From State-of-the-Art SDR Quality Assessment Metrics to HDR/WCG Quality Assessment Metrics
In this section we first present the perceptually uniform color spaces able to encode the HDR/WCG content. Then in a second part, we elaborate on the color difference metrics associated with these color space. In a third part, we describe a selection of SDR quality metrics. Finally, we present how we tailor SDR quality metrics to HDR/WCG content.
2.1. Perceptually Uniform Color Spaces
For many image processing applications such as compression and quality assessment, pixels are encoded with a three-dimensional representation: one dimension corresponds to an achromatic component the luminance and the two others correspond to the chromatic information. An example of this kind of representations is the linear color space CIE- where Y represents the luminance and X and Z the chromatic information. This color space is often used as a reference from which many other color spaces are derived (including most of color spaces). However this space is not a uniform (or perceptually uniform) color space. A uniform color space is defined so that the difference between two values always corresponds to the same amount of visually perceived change.
Three uniform color spaces are considered in this article:
[
20], the HDR extension of the CIE 1976
[
21] and two other HDR/WCG color spaces designed to be perceptually linear, and simple to use:
[
6] and
[
22]. Unlike the
color space in which all components are always non-negative, these three uniform color spaces represent the chromatic information using a color-opponent model which is coherent with the Human Visual System (HVS) and the opponent color theory.
In this article, the luminance component of the uniform color spaces is called uniform luminance instead of, according to the case, lightness, brightness or luma to avoid unnecessary complexity. For example, the uniform luminance of should be called lightness while the uniform luminance of should be called brightness.
2.1.1. HDR-Lab
One of the most popular uniform color spaces is the CIE 1976
or CIELAB which is suited for SDR content. An extension of this color space for HDR images was proposed in [
20]. The proposition is to tailor CIELAB for HDR by changing the non-linear function applied to the pixel
values. This color space is calculated as follows:
where
,
and
are the
coordinates of the diffuse white. The non-linear function
f is used to output perceptually linear values.
f is defined for HDR as follows:
where
is the relative luminance of the surround and
is the absolute luminance of the diffuse white or reference white. The diffuse white corresponds to the chromatic coordinates in the
domain of a 100% reflectance white card without any specular highlight. In HDR imaging, the luminance
Y of the diffuse white is different from the luminance of the peak brightness. Light coming from specular reflections or emissive light sources can reach much higher luminance values. The luminance of the diffuse white is often chosen during the color grading of the images.
The use of color space is somewhat difficult since it requires to know the relative luminance of the surround, , as well as the diffuse white luminance, . Unfortunately these two parameters are most of the time unknown for HDR contents. To cope with this issue, we consider two different diffuse whites to compute the color space, i.e., 100 and 1000 . These two color spaces are named and , respectively.
In addition to the
color space, Fairchild et al. [
20] also proposed the
color space, which aims to extent the IPT color space [
23] to HDR content. This color space is not studied in this article due to its high similarity with
.
2.1.2. ICtCp
has a better chrominance and luminance decorrelation and has a better hue linearity than the classical
color space [
24]. This color space is calculated in three steps:
First, the linear
values (in the BT.2020 gamut) are converted into
values which correspond to the quantity of light absorbed by the cones:
Second, the inverse EOTF PQ [
6] is applied to each
component:
Finally, the luminance component
I and the chrominance components
and
are deduced as follows:
The
color space [
6] is particularly well adapted to video compression and more importantly to the PQ EOTF as defined in BT.2100 [
6].
2.1.3.
[
22] is a uniform color space allowing to increase the hue uniformity and to predict accurately small and large color differences, while keeping a low computational cost. It is computed from the
values (with a standard illuminant D65) in five steps:
First, the
values are adjusted to remove a deviation in the blue hue.
where
and
.
Second, the
values are converted to
values
Third, as for
, the inverse EOTF PQ is applied on each
component
Fourth, the luminance
and the chrominance
and
are calculated
Finally, to handle the highlight, the luminance is adjusted:
where
is the adjusted luminance,
and
is a small constant:
.
2.2. Color Difference Metrics
In this section, we present the color difference metrics associated to each HDR color space. Because the color spaces are uniform, it is possible to calculate the perceptual difference between two colors.
For
color space, the Euclidean distance
is used:
For the
, Safdar et al. [
22] proposed the following formula:
where
corresponds to the color saturation and
to the hue:
where
and
correspond to the saturation of the two compared colors.
For
, a weighted Euclidean distance formula was proposed in [
25]:
Then, to have a
color space truly perceptually linear, the coefficient
is applied to the
component before using any SDR metric.
These color difference metrics work well for measuring perceptual differences of uniform patches. Although that we do not perceive color differences in the same way in textured images or in uniform and large patches, they are often used to compare the distortion between two images. The mean of the difference between the distorted and the reference images can be used as an indicator of image quality:
where
I and
J correspond to the dimensions of the image and
corresponds to the spatial coordinates of the pixel.
2.3. SDR Quality Assessment Metrics
We have selected 12 SDR metrics commonly used in academic research, standardization or industry. There are six achromatic or color-blind metrics (PSNR, SSIM, MS-SSIM, FSIM, PSNR-HVS-M and PSNR-HMA) and six metrics including chrominance information (
,
, SSIMc, CSSIM, FSIMc, PSNR-HMAc).
Table 1 summarizes the principle of each metrics. More detailed information about these metrics can be found in a
Supplementary Materials.
2.4. Adapting SDR Metrics to HDR/WCG Images
For adapting SDR metrics to HDR/WCG images, the reference and distorted images are first converted in a perceptually linear color space. A remapping function is then applied. Finally the SDR metrics is used to determine the quality score.
Figure 1 presents the diagram of the proposed method.
2.4.1. Color Space Conversion
Most SDR metrics were designed with the assumption that the images are encoded in the legacy
[
1] color space; this color space is approximately perceptually uniform for SDR content.
To use SDR metrics with HDR images, we propose to leverage perceptually uniform color spaces adapted to HDR and WCG images (, , and ).
To illustrate the importance of using uniform color space, we also consider two non-uniform color spaces, namely
and
color spaces as defined in the BT.2020 recommendation [
5]. This last space can’t be considered as approximatly uniform for HDR content as it uses the classical gamma function. This function is applied to the
component of an image:
where
,
and
E is one of the
R,
G and
B channel normalized by the reference white level. In SDR, it is supposed to be equal to the peak brightness of the display, so we choose as being the maximum value taken by our own HDR images: 4250
.
From the non-linear
color space, the
color space can be easily deduced:
In addition to the previous color space, for the color-blind metrics, we use the PU-mapping function for the luminance [
7]. As mentioned earlier, this transfer function keeps the same behaviour than the
with a reference white of 80
(which is perceptually linear within a SDR range) and retains perceptual linearity above. Thus any color-blind metrics can be used thanks to this mapping.
2.4.2. Remapping Function
The six aforementioned color spaces, i.e.,
,
,
,
,
and
, have a different range of values. As most of SDR metrics have constant values defined for pixel values between 0 and 255, it is required to adapt the color spaces. We remap them in a way that their respective perceptually linear luminances fit a similar range as the luminances encoded with the PU transfer function between 0 and 100
. We choose 100
as a normalization point because it roughly corresponds to the peak brightness of an SDR screen. Moreover, the PU-encoding is used as a reference to remap the color spaces because it is already adapted to SDR metrics. The goal of this process is to obtain HDR images with the same luminance scale than SDR images in the range 0 to 100
while preserving the perceptual uniformity of the color spaces. The remapping of the perceptual color spaces is done as follows:
where
corresponds to the value in the
domain of the pixel with the spatial coordinates
i and
j.
corresponds to the same pixel value after the remapping.
is the luminance value in the PU space when linear luminance value is 100
.
is the same value but for the luminance component
of the
color space. A similar operation is applied to
and
,
and
color spaces. The resulting luminances for the aforementioned color-space as well as the PU-encoding luminance are plot on
Figure 2. For these figures, we chose a surround luminance of 20
for the two
color spaces.
Remark 1. Note that, to adapt the metrics, the blurring model used in this metrics is first applied to the color space of the images and then the different color difference metrics are calculated. In the case of the color space instead of the color difference metrics presented in Equation (15), we use a simpler Euclidean distance between the pixel values.
In the following sections, the naming convention used for all metrics is . For example, the PSNR metrics used with the color space is called .
4. Results
In this section, we present the performances of the different metrics presented in the previous sections. For the sake of completeness, we also study the performances of the following color-blind HDR metrics: PU-VIF [
9], HDR-VDP2 [
11] (version 2.2.1) [
12] and HDR-VQM [
13]. Note that HDR-VDP2 requires a number of parameters such as the angular resolution, the surround luminance and the spectral emission of the screen. For these parameters, we use the values corresponding to the different subjective tests. We measured the spectrum of the Sony BVM-X300 and the SIM2 HDR47ES4MB monitor using the “X-Rite Eye one Pro 2” probe (more details are given in [
45]. All these parameters are summarized in
Table 3.
4.1. 4Kdtb Database
With our proposed 4Kdtb (cf.
Figure 8), for each color-blind metrics, the best color spaces are always the
,
and the PU-encoding.
and
provide the lowest performances. The best performing color-blind metrics is FSIM used with the PU-encoding, closely followed by
and
. MS-SSIM used with the PU encoding,
and,
are almost on par with the second best performing metrics HDR-VDP2 (cf.
Appendix B). The only color space that provides good performances on all color metrics is the
color space.
4.2. Zerman et al. Database
With the Zerman et al. database, as previously, the color spaces,
,
and the PU-encoding provide the best performances for almost all color-blind metrics (cf.
Figure 9). However, there is one exception with FSIM. Used with the following color spaces,
,
and
, it provides slightly better performances than
and the PU-encoding. The best performing color-blind metrics are, with almost the same performances, HDR-VDP2, HDR-VQM and
.
4.3. HDdtb Database
With our proposed HDdtb (cf.
Figure 10), for color-blind metrics, the color space
provides slightly lower performances for all metrics, except with FSIM. For this metric, the performances with
are higher. The best performing color-blind metrics for this database are
,
and
. For the color metrics, the metrics based on color difference metrics (
,
and CSSIM) have very low performances. This is partially due to the presence of the gamut mismatch artifact. As noticeable on
Table 4, discarding this artifact increases the performances of these metrics. For the participants of our subjective test, the distortions on the images are clearly visible but are not directly associated with a loss in quality perception.
4.4. Korshunov et al. Database
The Korshunov et al. database is the less selective database (cf.
Figure 11). Most of the metrics have high correlation coefficients and the choice of color space has close to no impact on the performances especially on color-blind metrics. Even using non-perceptually linear color space like the
color space impacts only moderately the performances of MS-SSIM, FSIM, PSNR-HVS-M and PSNR-HMA. For this database, the best performing color-blind metrics are
,
and
.
4.5. Narwaria et al. Database
With the Narwaria et al. database (cf.
Figure 12),
is the best color space for SSIM and MS-SSIM while the PU-encoding and the
are the best color spaces for FSIM. The best metrics for this database are
, HDR-VDP2 and HDR-VQM. The good performances of HDR-VDP2 were expected for this database because it was part of the training set of this metric. For this database, the performances of the PSNR and the PSNR-HVS-M are relatively low compared to the other databases. The fact that PSNR-HMA with the adequate color space significantly increases the performances of PSNR-HVS-M suggests that the backward compatible compression used by Narwaria et al. (
Section 3.1.1) creates distortions that impact the mean luminance and the contrast of the images. Indeed PSNR-HMA is an improvement of PSNR-HVS-M that takes into account these two kinds of artifacts [
50].
4.6. Results Summary
For all studied databases, HDR-VDP2 has generally high performances although it is not always on the top three metrics (cf.
Appendix B). FSIM and MS-SSIM with appropriate perceptually uniform color space are often on par if not better than HDR-VDP2.
Among all metrics, FSIM is the less sensitive metrics to the choice of color space assuming that this color space is perceptually uniform.
The color extension of FSIM, namely FSIMc, does not improve the performances of FSIM even for our proposed database 4Kdtb which focuses on chromatic distortions. Worst, the metrics becomes much more sensitive to the color space choice. We observe the same behavior for the color extension of PSNR-HMA, PSNR-HMAc, which decreases the performances of the metrics for any color spaces.
When using the two non-uniform color space and , the performances of all metrics drop significantly compared to the other color spaces for all the databases and especially for our proposed database 4Kdtb, the Zerman et al. database and the Narwaria et al. database. It emphasizes the importance of perceptually uniform color space for predicting the quality of HDR images.
7. Conclusions
In this article, we reviewed the relevance of using SDR metrics with perceptually uniform color spaces to assess the quality of HDR/WCG contents. We studied twelve different metrics along with six different color spaces. To evaluate the performances of these metrics, we used three existing HDR image databases annotated with MOS and created two more databases specifically dedicated to WCG and chrominance artifacts. We showed that the use of perceptually uniform color spaces increases, in most cases, the performances of SDR metrics for HDR/WCG contents.
In this study, we also highlight two weaknesses of state-of-art metrics. First, The relationship between the diffuse white used for grading the image and the diffuse white used for the color space is not always easy to define. In a number of cases, we do not know the value of the diffuse white used for grading of the image. Choosing an arbitrary diffuse white for the color space may significantly alter the objective quality assessment. Further analysis of this relationship is required. A better understanding could help to evaluate compression of images using the HLG EOTF for which the diffuse white depends on the display. Second, to the best of our knowledge, the quality assessment of DR/WCG images with chrominance distortions is still an open-issue, because of the lack of relevant objective metrics.
In a broader perspective, the relevance of subjective tests can also be questioned. For example, on the proposed database HDdtb, viewers did not perceive the gamut mismatch artifact as a loss of quality. However, this kind of artifact changes completely the appearance of images. Some other artifact could also alter the image appearance like the tone mapping/tone expansion used during backward compatible compression. In some cases, asking the viewers to not assess only the quality of the images but also their fidelity to the image appearance can be valuable to fully evaluate image processing algorithms.