Zero-Shot Day–Night Domain Adaptation for Face Detection Based on DAl-CLIP-Dino
Abstract
:1. Introduction
- ZSDA is engineered to effectively decode reflectance-based illumination-invariant data from both naturally well-lit and synthetically produced low-light images. The pre-trained RetinexNet [10] network is utilized to further enhance this module, with specific illumination invariance enhancement strategies to boost its performance.
- The exchange–decomposition coherence process [13] is proposed to improve the quality of image decomposition based on the Retinex theory. This process enhances reflectance consistency by introducing recomposition coherence loss during two decomposition stages, thereby improving the stability and accuracy of image reconstruction.
- ZSDA allows the model to be trained solely on well-lit source domain images and to perform precise evaluations in completely image-less low-light target domains, which enhances adaptability and generalization of model under extreme lighting variations. Additionally, we further enhance the model’s capabilities by merging the image encoder of CLIP [14] (Contrastive Language Image Pre-training) and the technology of Dino [15] (Self-Distillation with No Labels).
- The generalization ability of the CLIP (VIT-based image encoder part) and the self-supervised learning characteristics of Dino ae used to jointly improve the model’s capacity to capture details in low-light environments and understand complex scenes; this enables the model to more accurately recognize and process images in dim lighting conditions, improving the accuracy and reliability of object detection.
2. Related Work
2.1. Object Detection
2.2. Processing of Low-Light Images
2.3. Zero-Shot Domain Adaptation
- Global and local feature extraction imbalance: models often fail to balance global and local feature extraction, resulting in incomplete feature representations, particularly in low-light environments.
- Feature degradation in low-light conditions: severe lighting variations lead to degraded feature quality, while current methods lack robust illumination-invariant feature modeling.
- Weak domain-invariant feature learning: insufficient mechanisms for domain-invariant representation learning hinder performance in zero-shot domain adaptation tasks.
- Limited generalization: adapting to complex scenarios, such as day–night transitions, remains a challenge due to inadequate integration of global and local features.
3. Method
Algorithm 1 Algorithm for framework outline of the methodology. |
Input: Well-lit image , artificially generated low-light image Output: Refined reflectance predictions
|
3.1. Lighting Invariance Enhancement
3.2. Low-Light Image Reconstruction
3.3. CLIP (VIT-Based Image Encoder) Framework
3.4. DINO Framework
3.5. Network Training
4. Experiment
4.1. Datasets and Evaluation Indicators
4.1.1. Datasets
4.1.2. Evaluation Indicators
4.2. Ablation Experiment
4.3. Comparative Experiment
4.4. Visualization of Results
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Ovadia, Y.; Fertig, E.; Ren, J.; Nado, Z.; Sculley, D.; Nowozin, S.; Dillon, J.; Lakshminarayanan, B.; Snoek, J. Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. Adv. Neural Inf. Process. Syst. 2019, 32, 1254. [Google Scholar]
- Cai, Y.; Bian, H.; Lin, J.; Wang, H.; Timofte, R.; Zhang, Y. Retinexformer: One-stage retinex-based transformer for low-light image enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 12504–12513. [Google Scholar]
- Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1780–1789. [Google Scholar]
- Wang, W.; Xu, Z.; Huang, H.; Liu, J. Self-aligned concave curve: Illumination enhancement for unsupervised adaptation. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa Portugal, 10–14 October 2022; pp. 2617–2626. [Google Scholar]
- Wang, W.; Yang, W.; Liu, J. Hla-face: Joint high-low adaptation for low light face detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16195–16204. [Google Scholar]
- Wu, W.; Weng, J.; Zhang, P.; Wang, X.; Yang, W.; Jiang, J. Uretinex-net: Retinex-based deep unfolding network for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5901–5910. [Google Scholar]
- Cui, Z.; Qi, G.-J.; Gu, L.; You, S.; Zhang, Z.; Harada, T. Multitask AET with orthogonal tangent regularity for dark object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2553–2562. [Google Scholar]
- Loh, Y.P.; Chan, C.S. Getting to know low-light images with the exclusively dark dataset. Comput. Vis. Image Underst. 2019, 178, 30–42. [Google Scholar] [CrossRef]
- Yang, S.; Luo, P.; Loy, C.-C.; Tang, X. Wider face: A face detection benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5525–5533. [Google Scholar]
- Wei, C.; Wang, W.; Yang, W.; Liu, J. Deep retinex decomposition for low-light enhancement. arXiv 2018, arXiv:1808.04560. [Google Scholar]
- Guo, J.; Deng, J.; Lattas, A.; Zafeiriou, S. Sample and computation redistribution for efficient face detection. arXiv 2021, arXiv:2105.04714. [Google Scholar]
- Li, J.; Wang, Y.; Wang, C.; Tai, Y.; Qian, J.; Yang, J.; Wang, C.; Li, J.; Huang, F. DSFD: Dual shot face detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5060–5069. [Google Scholar]
- Du, Z.; Shi, M.; Deng, J. Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 12666–12676. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: Cambridge, MA, USA, 2021; pp. 8748–8763. [Google Scholar]
- Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9650–9660. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
- Redmon, J. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Tian, Z.; Chu, X.; Wang, X.; Wei, X.; Shen, C. Fully convolutional one-stage 3D object detection on lidar range images. Adv. Neural Inf. Process. Syst. 2022, 35, 34899–34911. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Dai, D.; Van Gool, L. Dark model adaptation: Semantic image segmentation from daytime to nighttime. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 3819–3824. [Google Scholar]
- Deng, J.; Guo, J.; Ververas, E.; Kotsia, I.; Zafeiriou, S. RetinaFace: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5203–5212. [Google Scholar]
- Hu, P.; Ramanan, D. Finding tiny faces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 951–959. [Google Scholar]
- Liu, Y.; Shi, M.; Zhao, Q.; Wang, X. Point in, box out: Beyond counting persons in crowds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6469–6478. [Google Scholar]
- Liu, Y.; Tang, X. BFBox: Searching face-appropriate backbone and feature pyramid network for face detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 13568–13577. [Google Scholar]
- Ming, X.; Wei, F.; Zhang, T.; Chen, D.; Wen, F. Group sampling for scale invariant face detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3446–3456. [Google Scholar]
- Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A.; et al. The Open Images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int. J. Comput. Vis. 2020, 128, 1956–1981. [Google Scholar] [CrossRef]
- Liu, R.; Ma, L.; Zhang, J.; Fan, X.; Luo, Z. Retinex-inspired unrolling with cooperative prior architecture search for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10561–10570. [Google Scholar]
- Sasagawa, Y.; Nagahara, H. YOLO in the dark-domain adaptation method for merging multiple models. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXI. Springer: Berlin/Heidelberg, Germany, 2020; pp. 345–359. [Google Scholar]
- Mo, Y.; Han, G.; Zhang, H.; Xu, X.; Qu, W. Highlight-assisted nighttime vehicle detection using a multi-level fusion network and label hierarchy. Neurocomputing 2019, 355, 13–23. [Google Scholar] [CrossRef]
- Vankadari, M.; Garg, S.; Majumder, A.; Kumar, S.; Behera, A. Unsupervised monocular depth estimation for night-time images using adversarial domain feature adaptation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXVIII. Springer: Berlin/Heidelberg, Germany, 2020; pp. 443–459. [Google Scholar]
- Liu, Y.; Wang, F.; Deng, J.; Zhou, Z.; Sun, B.; Li, H. MoGFace: Towards a deeper appreciation on face detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4093–4102. [Google Scholar]
- Hashmi, K.A.; Kallempudi, G.; Stricker, D.; Afzal, M.Z. Featenhancer: Enhancing hierarchical features for object detection and beyond under low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 6725–6735. [Google Scholar]
- Liu, M.-Y.; Tuzel, O. Coupled generative adversarial networks. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
- Wang, Q.; Breckon, T.P. Generalized zero-shot domain adaptation via coupled conditional variational autoencoders. Neural Netw. 2023, 163, 40–52. [Google Scholar] [CrossRef] [PubMed]
- Gao, H.; Guo, J.; Wang, G.; Zhang, Q. Cross-domain correlation distillation for unsupervised domain adaptation in nighttime semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9913–9923. [Google Scholar]
- Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
- Deng, X.; Wang, P.; Lian, X.; Newsam, S. NightLab: A dual-level architecture with hardness detection for segmentation at night. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16938–16948. [Google Scholar]
- Sakaridis, C.; Dai, D.; Gool, L.V. Guided curriculum model adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 7374–7383. [Google Scholar]
- Du, Z.; Deng, J.; Shi, M. Domain-general crowd counting in unseen scenarios. Proc. AAAI Conf. Artif. Intell. 2023, 37, 561–570. [Google Scholar] [CrossRef]
- Luo, R.; Wang, W.; Yang, W.; Liu, J. Similarity min-max: Zero-shot day-night domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 8104–8114. [Google Scholar]
- Land, E.H. The retinex theory of color vision. Sci. Am. 1977, 237, 108–129. [Google Scholar] [CrossRef] [PubMed]
- Lengyel, A.; Garg, S.; Milford, M.; van Gemert, J.C. Zero-shot day-night domain adaptation with a physics prior. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 4399–4409. [Google Scholar]
Method | mAP (%) |
---|---|
Dainet + dino | 25.37 |
Dainet + CLIP (VIT-based image encoder part) | 26.81 |
Dainet + CLIP (VIT-based image encoder part) + dino | 29.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sun, H.; Liu, Y.; Chen, Z.; Zhang, P. Zero-Shot Day–Night Domain Adaptation for Face Detection Based on DAl-CLIP-Dino. Electronics 2025, 14, 143. https://doi.org/10.3390/electronics14010143
Sun H, Liu Y, Chen Z, Zhang P. Zero-Shot Day–Night Domain Adaptation for Face Detection Based on DAl-CLIP-Dino. Electronics. 2025; 14(1):143. https://doi.org/10.3390/electronics14010143
Chicago/Turabian StyleSun, Huadong, Yinghui Liu, Ziyang Chen, and Pengyi Zhang. 2025. "Zero-Shot Day–Night Domain Adaptation for Face Detection Based on DAl-CLIP-Dino" Electronics 14, no. 1: 143. https://doi.org/10.3390/electronics14010143
APA StyleSun, H., Liu, Y., Chen, Z., & Zhang, P. (2025). Zero-Shot Day–Night Domain Adaptation for Face Detection Based on DAl-CLIP-Dino. Electronics, 14(1), 143. https://doi.org/10.3390/electronics14010143