A Visual Enhancement Network with Feature Fusion for Image Aesthetic Assessment
Abstract
:1. Introduction
- (1)
- We propose an end-to-end training network, consisting of two sub-modules. The former module considers top-down neural attention and fovea visual characteristics. The latter module extracts and integrates the features learned by CNN at different stages.
- (2)
- An adaptive filter is designed to select the filters in the spatial domain. Specifically, each pixel in the images adjusts the parameters of filters according to the normalized interest matrix extracted by neural feedback.
- (3)
- We optimize a feature fusion unit to combine the low-level information and image semantics. The added pooling layers deal with the corresponding features, increasing the training speed and improving the precision of the predicted score prediction. Moreover, it fuses the features for contribution maximization.
2. Related Work
3. Proposed Method
3.1. Top-Down Neural Attention
3.2. Adaptive Filtering
3.3. Features at Different Stages
3.4. Feature Fusion Unit
3.5. EMD Loss
4. Experiments
4.1. Datasets
4.2. Details of the Experiment
4.3. Comparison on AVA Dataset
4.4. Comparison on Photo.net Dataset
4.5. Evaluation of Two Sub-Modules
4.6. Quality-Based Comparison
4.7. Different Shallow Features
4.8. Model Size Comparison
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Sun, S.; Yu, T.; Xu, J.; Zhou, W.; Chen, Z. GraphIQA: Learning Distortion Graph Representations for Blind Image Quality Assessment. IEEE Trans. Multimed. 2022. [Google Scholar] [CrossRef]
- Zhu, H.; Li, L.; Wu, J.; Dong, W.; Shi, G. MetaIQA: Deep Meta-Learning for No-Reference Image Quality Assessment. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 14131–14140. [Google Scholar] [CrossRef]
- Yan, M.; Xiong, R.; Shen, Y.; Jin, C. Intelligent generation of Peking opera facial masks with deep learning frameworks. Herit. Sci. 2023, 11, 20. [Google Scholar] [CrossRef]
- Deng, Y.; Loy, C.C.; Tang, X. Image aesthetic assessment: An experimental survey. IEEE Signal Process. Mag. 2017, 34, 80–106. [Google Scholar] [CrossRef] [Green Version]
- Golestaneh, S.A.; Dadsetan, S.; Kitani, K.M. No-reference image quality assessment via transformers, relative ranking, and self-consistency. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 3989–3999. [Google Scholar]
- Zhou, J.; Zhang, Q.; Fan, J.H.; Sun, W.; Zheng, W.S. Joint regression and learning from pairwise rankings for personalized image aesthetic assessment. Comput. Vis. Media 2021, 7, 241–252. [Google Scholar] [CrossRef]
- Yan, M.; Lou, X.; Chan, C.A.; Wang, Y.; Jiang, W. A semantic and emotion-based dual latent variable generation model for a dialogue system. CAAI Trans. Intell. Technol. 2023, 1–12. [Google Scholar] [CrossRef]
- Qian, Q.; Cheng, K.; Qian, W.; Deng, Q.; Wang, Y. Image Segmentation Using Active Contours with Hessian-Based Gradient Vector Flow External Force. Sensors 2022, 22, 4956. [Google Scholar] [CrossRef]
- Wang, D.; Zhang, H.; Shao, Y. A Robust Invariant Local Feature Matching Method for Changing Scenes. Wirel. Commun. Mob. Comput. 2021, 2021, 8927822. [Google Scholar] [CrossRef]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
- Talebi, H.; Milanfar, P. NIMA: Neural image assessment. IEEE Trans. Image Process. 2018, 27, 3998–4011. [Google Scholar] [CrossRef] [Green Version]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. Comput. Sci. 2014, 1–14. [Google Scholar] [CrossRef]
- Saraee, E.; Jalal, M.; Betke, M. Visual complexity analysis using deep intermediate-layer features. Comput. Vis. Image Underst. 2020, 195, 102949. [Google Scholar] [CrossRef]
- Wang, P.; Cottrell, G.W. Central and peripheral vision for scene recognition: A neurocomputational modeling exploration. J. Vis. 2017, 17, 5155. [Google Scholar] [CrossRef] [Green Version]
- Ma, S.; Liu, J.; Chen, C.W. A-lamp: Adaptive layout-aware multipatch deep convolutional neural network for photo aesthetic assessment. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 722–731. [Google Scholar]
- Zhang, X.; Gao, X.; Lu, W.; Yu, Y.; He, L. Fusion global and local deep representations with neural attention for aesthetic quality assessment. Signal Process. Image Commun. 2019, 78, 42–50. [Google Scholar] [CrossRef]
- Zhang, X.; Gao, X.; Lu, W.; He, L. A gated peripheral-foveal convolutional neural network for unified image aesthetic prediction. IEEE Trans. Multimed. 2019, 21, 2815–2826. [Google Scholar] [CrossRef]
- Datta, R.; Joshi, D.; Li, J.; Wang, J.Z. Studying aesthetics in photographic images using a computational approach. In Proceedings of the 9th European Conference on Computer Vision (ECCV2006), Graz, Austria, 8–11 May 2006; pp. 288–301. [Google Scholar]
- Sun, X.; Yao, H.; Ji, R.; Liu, S. Photo assessment based on computational visual attention model. In Proceedings of the 17th ACM international Conference on Multimedia, Beijing, China, 19–24 October 2009; pp. 541–544. [Google Scholar]
- Dhar, S.; Ordonez, V.; Berg, T.L. High level describable attributes for predicting aesthetics and interestingness. In Proceedings of the 2011 Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011; pp. 1657–1664. [Google Scholar]
- Bhattacharya, S.; Sukthankar, R.; Shah, M. A holistic approach to aesthetic enhancement of photographs. ACM Trans. Multimed. Comput. Commun. Appl. 2011, 7, 1–21. [Google Scholar] [CrossRef] [Green Version]
- Tang, X.; Luo, W.; Wang, X. Content-based photo quality assessment. IEEE Trans. Multimed. 2013, 15, 1930–1943. [Google Scholar] [CrossRef] [Green Version]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Lu, X.; Lin, Z.; Shen, X.; Mech, R.; Wang, J.Z. Deep multi-patch aggregation network for image style, aesthetics, and quality estimation. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 990–998. [Google Scholar]
- Jin, X.; Wu, L.; Li, X.; Zhang, X.; Chi, J.; Peng, S.; Ge, S.; Zhao, G.; Li, S. ILGNet: Inception modules with connected local and global features for efficient image aesthetic quality classification using domain adaptation. IET Comput. Vis. 2019, 13, 206–212. [Google Scholar] [CrossRef] [Green Version]
- Yan, G.; Bi, R.; Guo, Y.; Peng, W. Image aesthetic assessment based on latent semantic features. Information 2020, 11, 223. [Google Scholar] [CrossRef] [Green Version]
- Zhang, X.; Gao, X.; Lu, W.; He, L.; Li, J. Beyond vision: A multimodal recurrent attention convolutional neural network for unified image aesthetic prediction tasks. IEEE Trans. Multimed. 2020, 23, 611–623. [Google Scholar] [CrossRef]
- She, D.; Lai, Y.K.; Yi, G.; Xu, K. Hierarchical layout-aware graph convolutional network for unified aesthetics assessment. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8471–8480. [Google Scholar]
- Zhao, L.; Shang, M.; Gao, F.; Li, R.; Yu, J. Representation learning of image composition for aesthetic prediction. Comput. Vis. Image Underst. 2020, 199, 103024. [Google Scholar] [CrossRef]
- Zhang, J.; Bargal, S.A.; Zhe, L.; Brandt, J.; Shen, X.; Sclaroff, S. Top-down neural attention by excitation backprop. Int. J. Comput. Vis. 2018, 126, 1084–1102. [Google Scholar] [CrossRef] [Green Version]
- Kucer, M.; Loui, A.C.; Messinger, D.W. Leveraging expert feature knowledge for predicting image aesthetics. IEEE Trans. Image Process. 2018, 27, 5100–5112. [Google Scholar] [CrossRef]
- Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
- Jin, X.; Wu, L.; Li, X.; Chen, S.; Peng, S.; Chi, J.; Ge, S.; Song, C.; Zhao, G. Predicting aesthetic score distribution through cumulative jensen-shannon divergence. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA, USA, 2–7 February 2018; pp. 77–84. [Google Scholar]
- Murray, N.; Marchesotti, L.; Perronnin, F. AVA: A large-scale database for aesthetic visual analysis. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2408–2415. [Google Scholar]
- Joshi, D.; Datta, R.; Fedorovskaya, E.; Luong, Q.-T.; Wang, J.Z.; Li, J.; Luo, J. Aesthetics and Emotions in Images. IEEE Signal Process. Mag. 2011, 28, 94–115. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 7, 1904–1916. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zeng, H.; Zhang, L.; Bovik, A.C. A probabilistic quality representation approach to deep blind image quality prediction. CaRR 2017, 1–12. [Google Scholar] [CrossRef]
- Marchesotti, L.; Perronnin, F.; Larlus, D.; Csurka, G. Assessing the aesthetic quality of photographs using generic image descriptors. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1784–1791. [Google Scholar]
- Hao, Y.; He, R.; Huang, K. Deep aesthetic quality assessment with semantic information. IEEE Trans. Image Process. 2017, 26, 1482–1495. [Google Scholar] [CrossRef] [Green Version]
- Wang, W.; Shen, J.; Ling, H. A Deep Network Solution for Attention and Aesthetics Aware Photo Cropping. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 4, 1531–1544. [Google Scholar] [CrossRef] [PubMed]
- Wang, W.; Shen, J. Deep visual attention prediction. IEEE Trans. Image Process. 2018, 27, 2368–2378. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Filter | Size | k | b | σ |
---|---|---|---|---|
The high-boost filter (including Laplace filter) | 9 × 9 | - | ||
Gaussian filter | 9 × 9 | - | - |
Layer a | The Size of Input Data | The Number of the Layer |
---|---|---|
Conv3-64 | 224 × 224 × 3 | 2 |
Conv3-128 | 112 × 112 × 64 | 2 |
Conv3-256 * | 56 × 56 × 128 | 3 |
Conv3-512 | 28 × 28 × 256 | 3 |
Conv3-512 * | 14 × 14 × 512 | 3 |
Layer | The Size of Input Data | The Number of the Layer |
---|---|---|
PCFS | 28 × 28 × 256, 7 × 7 × 512 | 1 |
FC | 7 × 7 × 256 + 7 × 7 × 512 | 1 |
Network Architecture | Accuracy (%) | LCC | SRCC | MAE | RMSE | EMD |
---|---|---|---|---|---|---|
SPP-Net [36] | 74.41 | 0.5869 | 0.6007 | 0.4611 | 0.5878 | 0.0539 |
AA-Net [37] | 77.00 | - | - | - | - | - |
InceptionNet [9] | 79.43 | 0.6865 | 0.6756 | 0.4154 | 0.5359 | 0.0466 |
NIMA [10] | 81.51 | 0.636 | 0.612 | - | - | 0.050 |
GPF-CNN [17] | 81.81 | 0.7042 | 0.6900 | 0.4072 | 0.5246 | 0.045 |
ReLIC++ [29] | 82.35 | 0.760 | 0.748 | - | - | - |
FF-VEN | 83.64 | 0.773 | 0.755 | 0.4011 | 0.5109 | 0.044 |
Network Architecture | Accuracy (%) | LCC | SRCC | MAE | RMSE | EMD |
---|---|---|---|---|---|---|
GIST-SVM [38] | 59.9 | - | - | - | - | - |
FV-SIFT-SVM [38] | 60.8 | - | - | - | - | - |
MRTLCNN [39] | 65.2 | - | - | - | - | - |
GLFN [16] | 75.6 | 0.5464 | 0.5217 | 0.4242 | 0.5211 | 0.070 |
FF-VEN | 78.1 | 0.6381 | 0.6175 | 0.4278 | 0.5285 | 0.062 |
Network Architecture | Accuracy (%) | LCC | SRCC | MAE | RMSE | EMD |
---|---|---|---|---|---|---|
VGG16 [40] | 74.41 | 0.5869 | 0.6007 | 0.4611 | 0.5878 | 0.0539 |
Random-VGG16 [22] | 78.54 | 0.6382 | 0.6274 | 0.4410 | 0.5660 | 0.0510 |
Saliency-VGG16 [40] | 79.19 | 0.6711 | 0.6601 | 0.4228 | 0.5430 | 0.0475 |
GPF-VGG16 [17] | 80.70 | 0.6868 | 0.6762 | 0.4144 | 0.5347 | 0.0460 |
VE-CNN (VGG16) | 81.03 | 0.7395 | 0.7185 | 0.4073 | 0.5279 | 0.0441 |
SDFF (VGG16) | 81.47 | 0.7119 | 0.7021 | 0.4103 | 0.5317 | 0.0462 |
Mean | Network Architecture | Accuracy (%) | LCC | SRCC | MAE | RMSE | EMD |
---|---|---|---|---|---|---|---|
NIMA [10] | 78.46 | 0.6265 | 0.6043 | 0.5577 | 0.6897 | 0.067 | |
[0, 4) | ReLIC++ [29] | 80.02 | 0.6887 | 0.6765 | - | - | - |
FF-VEN | 80.59 | 0.7095 | 0.6971 | 0.5037 | 0.6139 | 0.059 | |
NIMA [10] | 80.43 | 0.7271 | 0.7028 | 0.4037 | 0.5256 | 0.048 | |
[4, 7) | ReLIC++ [29] | 81.15 | 0.8733 | 0.8547 | - | - | - |
FF-VEN | 81.33 | 0.8945 | 0.8831 | 0.3748 | 0.4851 | 0.039 | |
NIMA [10] | 94.93 | 0.5936 | 0.5645 | 0.5927 | 0.7314 | 0.073 | |
[7, 10] | ReLIC++ [29] | 96.64 | 0.6223 | 0.6084 | - | - | - |
FF-VEN | 98.71 | 0.6113 | 0.6492 | 0.5343 | 0.6457 | 0.061 |
Layer | Accuracy (%) | LCC | SRCC | MAE | RMSE | EMD |
---|---|---|---|---|---|---|
Conv3-64 | 80.21 | 0.692 | 0.682 | 0.4163 | 0.5376 | 0.046 |
Conv3-128 | 81.47 | 0.716 | 0.691 | 0.4025 | 0.5284 | 0.045 |
Conv3-256 | 83.64 | 0.773 | 0.755 | 0.4011 | 0.5109 | 0.044 |
Conv3-512 | 82.34 | 0.751 | 0.737 | 0.4047 | 0.5201 | 0.044 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, X.; Jiang, X.; Song, Q.; Zhang, P. A Visual Enhancement Network with Feature Fusion for Image Aesthetic Assessment. Electronics 2023, 12, 2526. https://doi.org/10.3390/electronics12112526
Zhang X, Jiang X, Song Q, Zhang P. A Visual Enhancement Network with Feature Fusion for Image Aesthetic Assessment. Electronics. 2023; 12(11):2526. https://doi.org/10.3390/electronics12112526
Chicago/Turabian StyleZhang, Xin, Xinyu Jiang, Qing Song, and Pengzhou Zhang. 2023. "A Visual Enhancement Network with Feature Fusion for Image Aesthetic Assessment" Electronics 12, no. 11: 2526. https://doi.org/10.3390/electronics12112526
APA StyleZhang, X., Jiang, X., Song, Q., & Zhang, P. (2023). A Visual Enhancement Network with Feature Fusion for Image Aesthetic Assessment. Electronics, 12(11), 2526. https://doi.org/10.3390/electronics12112526