Perceptual Image Quality Prediction: Are Contrastive Language–Image Pretraining (CLIP) Visual Features Effective?
Abstract
:1. Introduction
2. Related Work
2.1. Perceptual Quality Assessment
2.2. Large Multimodal Models
3. Evaluation Methodology
3.1. Evaluation Architecture
3.2. LMM Features
3.2.1. CLIP (Contrastive Language–Image Pretraining)
3.2.2. ALTCLIP (Alter Ego CLIP)
3.2.3. HPS (Human Preference Score)
3.2.4. ALIGN (A Large-Scale ImaGe and Noisy-Text Embedding)
3.3. Distortion Features
3.3.1. Statistical Features
3.3.2. Lightweight CNN Features
3.3.3. Meta-Learning Features
3.4. Combined Features
3.5. Datasets and Implementation Details
4. Experiments
4.1. Experimental Settings
4.2. Evaluation Results
5. Conclusions
- The features of LMMs still cannot be as effective as those of state-of-the-art IQA models. This could be because the training data of these models do not specifically include distorted images with quality scores as labels. However, the performance of LMM features is much better than that of hand-crafted features or that of lightweight CNN.
- CLIP features have the best results on only one dataset. In fact, the performance of different LMMs varies widely, depending on the given dataset and the feature combination.
- The features of ALIGN, which is the main competitor of CLIP, usually are not as effective as the features of CLIP and its variants in image quality assessment. In addition, the cases involving features from HPS demonstrate exceptional performance, especially on large IQA datasets.
- An important finding is that distortion features can be combined with LMM features to significantly boost performance. Even the very basic distortion features of BRISQUE are useful in improving the performance of LMM features.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Nguyen, D.; Tran, H.; Thang, T.C. An ensemble learning-based no reference qoe model for user generated contents. In Proceedings of the 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shenzhen, China, 5–9 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
- Zhu, H.; Li, L.; Wu, J.; Dong, W.; Shi, G. MetaIQA: Deep meta-learning for no-reference image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14143–14152. [Google Scholar]
- Nguyen, H.N.; Vu, T.; Le, H.T.; Ngoc, N.P.; Thang, T.C. Smooth quality adaptation method for VBR video streaming over HTTP. In Proceedings of the 2015 International Conference on Communications, Management and Telecommunications (ComManTel), DaNang, Vietnam, 28–30 December 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 184–188. [Google Scholar]
- Tran, H.T.; Nguyen, D.; Thang, T.C. An open software for bitstream-based quality prediction in adaptive video streaming. In Proceedings of the 11th ACM Multimedia Systems Conference, Istanbul, Turkey, 8–11 June 2020; pp. 225–230. [Google Scholar]
- Tran, H.T.; Ngoc, N.P.; Hoßfeld, T.; Seufert, M.; Thang, T.C. Cumulative quality modeling for HTTP adaptive streaming. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2021, 17, 1–24. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Alpay, T.; Magg, S.; Broze, P.; Speck, D. Multimodal video retrieval with CLIP: A user study. Inf. Retr. J. 2023, 26, 6. [Google Scholar] [CrossRef]
- Wu, H.H.; Seetharaman, P.; Kumar, K.; Bello, J.P. Wav2clip: Learning robust audio representations from clip. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 4563–4567. [Google Scholar]
- Flaherty, J.; Onuoha, C.; Paik, I.; Thang, T.C. AI to Judge AI-Generated Images: Both Semantics and Perception Matter. In Proceedings of the 2023 IEEE 13th International Conference on Consumer Electronics-Berlin (ICCE-Berlin), Berlin, Germany, 3–5 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 51–52. [Google Scholar]
- Lan, Y.; Li, X.; Liu, X.; Li, Y.; Qin, W.; Qian, W. Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 4389–4400. [Google Scholar]
- Zhao, M.; Li, B.; Wang, J.; Li, W.; Zhou, W.; Zhang, L.; Xuyang, S.; Yu, Z.; Yu, X.; Li, G.; et al. Towards video text visual question answering: Benchmark and baseline. Adv. Neural Inf. Process. Syst. 2022, 35, 35549–35562. [Google Scholar]
- Barraco, M.; Cornia, M.; Cascianelli, S.; Baraldi, L.; Cucchiara, R. The unreasonable effectiveness of CLIP features for image captioning: An experimental analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4662–4670. [Google Scholar]
- Tang, M.; Wang, Z.; Liu, Z.; Rao, F.; Li, D.; Li, X. Clip4caption: Clip for video caption. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, China, 20–24 October 2021; pp. 4858–4862. [Google Scholar]
- Shen, S.; Li, L.H.; Tan, H.; Bansal, M.; Rohrbach, A.; Chang, K.W.; Yao, Z.; Keutzer, K. How much can clip benefit vision-and-language tasks? arXiv 2021, arXiv:2107.06383. [Google Scholar]
- Mokady, R.; Hertz, A.; Bermano, A.H. Clipcap: Clip prefix for image captioning. arXiv 2021, arXiv:2111.09734. [Google Scholar]
- He, Y.; Huang, Z.; Liu, Q.; Wang, Y. Incremental Object Detection with CLIP. arXiv 2023, arXiv:2310.08815. [Google Scholar]
- Wang, Z.; Lu, Y.; Li, Q.; Tao, X.; Guo, Y.; Gong, M.; Liu, T. Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11686–11695. [Google Scholar]
- Wang, J.; Wang, H.; Deng, J.; Wu, W.; Zhang, D. Efficientclip: Efficient cross-modal pre-training by ensemble confident learning and language modeling. arXiv 2023, arXiv:2109.04699. [Google Scholar]
- Huang, S.; Gong, B.; Pan, Y.; Jiang, J.; Lv, Y.; Li, Y.; Wang, D. VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6565–6574. [Google Scholar]
- Xia, X.; Dong, G.; Li, F.; Zhu, L.; Ying, X. When CLIP meets cross-modal hashing retrieval: A new strong baseline. Inf. Fusion 2023, 100, 101968. [Google Scholar] [CrossRef]
- Wu, X.; Sun, K.; Zhu, F.; Zhao, R.; Li, H. Better aligning text-to-image models with human preference. arXiv 2023, arXiv:2303.14420. [Google Scholar]
- Chen, Z.; Liu, G.; Zhang, B.W.; Ye, F.; Yang, Q.; Wu, L. Altclip: Altering the language encoder in clip for extended language capabilities. arXiv 2022, arXiv:2211.06679. [Google Scholar]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
- Bosse, S.; Maniry, D.; Müller, K.R.; Wiegand, T.; Samek, W. Deep neural networks for no-reference and full-reference image quality assessment. IEEE Trans. Image Process. 2017, 27, 206–219. [Google Scholar] [CrossRef]
- He, L.; Gao, F.; Hou, W.; Hao, L. Objective image quality assessment: A survey. Int. J. Comput. Math. 2014, 91, 2374–2388. [Google Scholar] [CrossRef]
- Akhtar, Z.; Falk, T.H. Audio-visual multimedia quality assessment: A comprehensive survey. IEEE Access 2017, 5, 21090–21117. [Google Scholar] [CrossRef]
- Gao, X.; Gao, F.; Tao, D.; Li, X. Universal blind image quality assessment metrics via natural scene statistics and multiple kernel learning. IEEE Trans. Neural Netw. Learn. Syst. 2013, 24, 2013–2026. [Google Scholar]
- Mittal, A.; Moorthy, A.K.; Bovik, A.C. No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 2012, 21, 4695–4708. [Google Scholar] [CrossRef]
- Moorthy, A.K.; Bovik, A.C. A two-step framework for constructing blind image quality indices. IEEE Signal Process. Lett. 2010, 17, 513–516. [Google Scholar] [CrossRef]
- Saad, M.A.; Bovik, A.C.; Charrier, C. Blind image quality assessment: A natural scene statistics approach in the DCT domain. IEEE Trans. Image Process. 2012, 21, 3339–3352. [Google Scholar] [CrossRef] [PubMed]
- Kang, L.; Ye, P.; Li, Y.; Doermann, D. Convolutional neural networks for no-reference image quality assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1733–1740. [Google Scholar]
- Zhang, W.; Ma, K.; Yan, J.; Deng, D.; Wang, Z. Blind image quality assessment using a deep bilinear convolutional neural network. IEEE Trans. Circuits Syst. Video Technol. 2018, 30, 36–47. [Google Scholar] [CrossRef]
- Bianco, S.; Celona, L.; Napoletano, P.; Schettini, R. On the use of deep learning for blind image quality assessment. Signal Image Video Process. 2018, 12, 355–362. [Google Scholar] [CrossRef]
- Zeng, H.; Zhang, L.; Bovik, A.C. A probabilistic quality representation approach to deep blind image quality prediction. arXiv 2017, arXiv:1708.08190. [Google Scholar]
- Talebi, H.; Milanfar, P. NIMA: Neural image assessment. IEEE Trans. Image Process. 2018, 27, 3998–4011. [Google Scholar] [CrossRef]
- Ma, K.; Liu, W.; Zhang, K.; Duanmu, Z.; Wang, Z.; Zuo, W. End-to-end blind image quality assessment using deep neural networks. IEEE Trans. Image Process. 2017, 27, 1202–1213. [Google Scholar] [CrossRef]
- Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A comprehensive survey on transfer learning. Proc. IEEE 2020, 109, 43–76. [Google Scholar] [CrossRef]
- Ribani, R.; Marengoni, M. A survey of transfer learning for convolutional neural networks. In Proceedings of the 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images Tutorials (SIBGRAPI-T), Rio de Janeiro, Brazil, 28–31 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 47–57. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Hentschel, S.; Kobs, K.; Hotho, A. CLIP knows image aesthetics. Front. Artif. Intell. 2022, 5, 976235. [Google Scholar] [CrossRef] [PubMed]
- Lin, Z.; Geng, S.; Zhang, R.; Gao, P.; de Melo, G.; Wang, X.; Dai, J.; Qiao, Y.; Li, H. Frozen clip models are efficient video learners. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 388–404. [Google Scholar]
- Khandelwal, A.; Weihs, L.; Mottaghi, R.; Kembhavi, A. Simple but effective: Clip embeddings for embodied ai. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14829–14838. [Google Scholar]
- RECOMMENDATION P.800.1; Mean Opinion Score (MOS) Terminology. ITU-T: Geneva, Switzerland, 2016.
- Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2556–2565. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Schuhmann, C.; Vencu, R.; Beaumont, R.; Kaczmarczyk, R.; Mullis, C.; Katta, A.; Coombes, T.; Jitsev, J.; Komatsuzaki, A. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv 2021, arXiv:2111.02114. [Google Scholar]
- Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.; et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Adv. Neural Inf. Process. Syst. 2022, 35, 25278–25294. [Google Scholar]
- Byeon, M.; Park, B.; Kim, H.; Lee, S.; Baek, W.; Kim, S. COYO-700M: Image-Text Pair Dataset. 2022. Available online: https://github.com/kakaobrain/coyo-dataset (accessed on 20 November 2023).
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. Available online: http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html (accessed on 20 November 2023).
- Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2014, 2, 67–78. [Google Scholar] [CrossRef]
- Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised cross-lingual representation learning at scale. arXiv 2019, arXiv:1911.02116. [Google Scholar]
- Xu, B. Nlp chinese corpus: Large scale chinese corpus for nlp. Zenodo 2019. [Google Scholar]
- Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, 2019, Long Beach, CA, USA, 9–15 June 2019; 2019; pp. 6105–6114. [Google Scholar]
- Yoon, B.; Lee, Y.; Baek, W. COYO-ALIGN. 2022. Available online: https://github.com/kakaobrain/coyo-align (accessed on 13 August 2023).
- Ponomarenko, N.; Ieremeiev, O.; Lukin, V.; Egiazarian, K.; Jin, L.; Astola, J.; Vozel, B.; Chehdi, K.; Carli, M.; Battisti, F.; et al. Color image database TID2013: Peculiarities and preliminary results. In Proceedings of the European Workshop on Visual Information Processing (EUVIP), Paris, France, 10–12 June 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 106–111. [Google Scholar]
- Lin, H.; Hosu, V.; Saupe, D. KADID-10k: A Large-scale Artificially Distorted IQA Database. In Proceedings of the 2019 Tenth International Conference on Quality of Multimedia Experience (QoMEX), Berlin, Germany, 5–7 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–3. [Google Scholar]
- Hosu, V.; Lin, H.; Sziranyi, T.; Saupe, D. KonIQ-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE Trans. Image Process. 2020, 29, 4041–4056. [Google Scholar] [CrossRef] [PubMed]
- Ghadiyaram, D.; Bovik, A.C. Massive online crowdsourced study of subjective and objective picture quality. IEEE Trans. Image Process. 2015, 25, 372–387. [Google Scholar] [CrossRef] [PubMed]
- Fang, Y.; Zhu, H.; Zeng, Y.; Ma, K.; Wang, Z. Perceptual quality assessment of smartphone photography. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3677–3686. [Google Scholar]
- Virtanen, T.; Nuutinen, M.; Vaahteranoksa, M.; Oittinen, P.; Häkkinen, J. CID2013: A database for evaluating no-reference image quality assessment algorithms. IEEE Trans. Image Process. 2014, 24, 390–402. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Dataset | Total Samples | Subjective Assessment | MOS Range | Distortion Details |
---|---|---|---|---|
LIVE | 1162 | Crowd-sourced | 0–100 | Blurry, grainy, noise over/under camera exposure, etc. |
KonIQ-10k | 10,073 | Crowd-sourced | 1–5 | Camera shakes, wrong focus, motion blur, compression, noise, etc. |
SPAQ | 11,125 | Laboratory | 0–100 | Lens flare, chromatic aberrations, camera movement, etc. |
CID2013 | 474 | Laboratory | 0–100 | Photon noise, pixel defects, pixel saturation, optical aberrations, etc. |
MODEL | LIVE | KONIQ-10k | SPAQ | CID2013 | ||||
---|---|---|---|---|---|---|---|---|
PLCC↑ (Std) | SRCC↑ (Std) | PLCC↑ (Std) | SRCC↑ (Std) | PLCC↑ (Std) | SRCC↑ (Std) | PLCC↑ (Std) | SRCC↑ (Std) | |
BRISQUE | 0.5071 (0.050) | 0.4846 (0.052) | 0.7197 (0.0109) | 0.7013 (0.013) | 0.8170 (0.006) | 0.8090 (0.007) | 0.4410 (0.039) | 0.4426 (0.037) |
CNNIQA | 0.4527 (0.051) | 0.4263 (0.064) | 0.7172 (0.012) | 0.6505 (0.014) | 0.8050 (0.008) | 0.7978 (0.009) | 0.5495 (0.052) | 0.3298 (0.063) |
MetaIQA | 0.8387 (0.023) | 0.8025 (0.028) | 0.8835 (0.005) | 0.8454 (0.005) | 0.9030 (0.004) | 0.9007 (0.004) | 0.7324 (0.029) | 0.6561 (0.048) |
CLIP | 0.8079 (0.015) | 0.7622 (0.013) | 0.8608 (0.004) | 0.8273 (0.008) | 0.8844 (0.003) | 0.8802 (0.003) | 0.7126 (0.022) | 0.6764 (0.029) |
ALIGN | 0.7243 (0.028) | 0.7099 (0.018) | 0.8620 (0.006) | 0.8385 (0.006) | 0.8743 (0.004) | 0.8725 (0.004) | 0.5953 (0.036) | 0.5369 (0.049) |
ALTCLIP | 0.7444 (0.034) | 0.7171 (0.025) | 0.8609 (0.005) | 0.8251 (0.005) | 0.8914 (0.004) | 0.8868 (0.004) | 0.5194 (0.036) | 0.4798 (0.045) |
HPS | 0.8290 (0.024) | 0.8028 (0.024) | 0.8600 (0.008) | 0.8284 (0.009) | 0.8893 (0.004) | 0.8860 (0.004) | 0.6534 (0.022) | 0.5972 (0.020) |
FUSION TYPE | LIVE | KONIQ-10k | SPAQ | CID2013 | ||||
---|---|---|---|---|---|---|---|---|
PLCC↑ (Std) | SRCC↑ (Std) | PLCC↑ (Std) | SRCC↑ (Std) | PLCC↑ (Std) | SRCC↑ (Std) | PLCC↑ (Std) | SRCC↑ (Std) | |
CLIP + BRISQUE | 0.8276 (0.013) | 0.7806 (0.015) | 0.8838 (0.003) | 0.8549 (0.006) | 0.8959 (0.003) | 0.8921 (0.003) | 0.7457 (0.021) | 0.7171 (0.026) |
ALIGN + BRISQUE | 0.7895 (0.028) | 0.7695 (0.023) | 0.8782 (0.004) | 0.8579 (0.005) | 0.8902 (0.005) | 0.8889 (0.004) | 0.6953 (0.037) | 0.6367 (0.055) |
ALT + BRISQUE | 0.8372 (0.018) | 0.8012 (0.014) | 0.8794 (0.006) | 0.8503 (0.005) | 0.9017 (0.003) | 0.8980 (0.003) | 0.6091 (0.032) | 0.5615 (0.041) |
HPS + BRISQUE | 0.8529 (0.019) | 0.8255 (0.016) | 0.8962 (0.005) | 0.8687 (0.006) | 0.9069 (0.003) | 0.9032 (0.003) | 0.7030 (0.020) | 0.6516 (0.023) |
FUSION TYPE | LIVE | KONIQ-10k | SPAQ | CID2013 | ||||
---|---|---|---|---|---|---|---|---|
PLCC↑ (Std) | SRCC↑ (Std) | PLCC↑ (Std) | SRCC↑ (Std) | PLCC↑ (Std) | SRCC↑ (Std) | PLCC↑ (Std) | SRCC↑ (Std) | |
CLIP + CNNIQA | 0.8359 (0.011) | 0.7925 (0.014) | 0.8918 (0.004) | 0.8607 (0.007) | 0.9044 (0.003) | 0.9005 (0.003) | 0.7658 (0.018) | 0.7316 (0.022) |
ALIGN + CNNIQA | 0.8002 (0.025) | 0.7791 (0.020) | 0.8870 (0.004) | 0.8644 (0.005) | 0.8963 (0.004) | 0.8950 (0.004) | 0.7741 (0.029) | 0.7152 (0.036) |
ALT + CNNIQA | 0.8469 (0.017) | 0.8144 (0.013) | 0.8866 (0.006) | 0.8564 (0.006) | 0.9056 (0.000) | 0.9023 (0.004) | 0.7123 (0.029) | 0.6598 (0.031) |
HPS + CNNIQA | 0.8602 (0.018) | 0.8364 (0.015) | 0.9028 (0.006) | 0.8738 (0.007) | 0.9117 (0.000) | 0.9084 (0.003) | 0.6864 (0.027) | 0.6283 (0.015) |
FUSION TYPE | LIVE | KONIQ-10k | SPAQ | CID2013 | ||||
---|---|---|---|---|---|---|---|---|
PLCC↑ (Std) | SRCC↑ (Std) | PLCC↑ (Std) | SRCC↑ (Std) | PLCC↑ (Std) | SRCC↑ (Std) | PLCC↑ (Std) | SRCC↑ (Std) | |
CLIP + MetaIQA | 0.8623 (0.018) | 0.8326 (0.023) | 0.9105 (0.004) | 0.8794 (0.006) | 0.9085 (0.004) | 0.9051 (0.003) | 0.7719 (0.031) | 0.7230 (0.042) |
ALIGN + MetaIQA | 0.8529 (0.024) | 0.8187 (0.028) | 0.9098 (0.004) | 0.8834 (0.005) | 0.9061 (0.004) | 0.9033 (0.003) | 0.7516 (0.026) | 0.6888 (0.043) |
ALT + MetaIQA | 0.8541 (0.021) | 0.8200 (0.026) | 0.9109 (0.004) | 0.8801 (0.005) | 0.9099 (0.004) | 0.9069 (0.003) | 0.7474 (0.034) | 0.6818 (0.050) |
HPS + MetaIQA | 0.8503 (0.023) | 0.8161 (0.027) | 0.9097 (0.005) | 0.8777 (0.006) | 0.9076 (0.004) | 0.9046 (0.004) | 0.7519 (0.033) | 0.6901 (0.049) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Onuoha, C.; Flaherty, J.; Cong Thang, T. Perceptual Image Quality Prediction: Are Contrastive Language–Image Pretraining (CLIP) Visual Features Effective? Electronics 2024, 13, 803. https://doi.org/10.3390/electronics13040803
Onuoha C, Flaherty J, Cong Thang T. Perceptual Image Quality Prediction: Are Contrastive Language–Image Pretraining (CLIP) Visual Features Effective? Electronics. 2024; 13(4):803. https://doi.org/10.3390/electronics13040803
Chicago/Turabian StyleOnuoha, Chibuike, Jean Flaherty, and Truong Cong Thang. 2024. "Perceptual Image Quality Prediction: Are Contrastive Language–Image Pretraining (CLIP) Visual Features Effective?" Electronics 13, no. 4: 803. https://doi.org/10.3390/electronics13040803
APA StyleOnuoha, C., Flaherty, J., & Cong Thang, T. (2024). Perceptual Image Quality Prediction: Are Contrastive Language–Image Pretraining (CLIP) Visual Features Effective? Electronics, 13(4), 803. https://doi.org/10.3390/electronics13040803