ConceptVAE: Self-Supervised Fine-Grained Concept Disentanglement from 2D Echocardiographies
Abstract
:1. Introduction
- We introduce ConceptVAE, a novel SSL training framework that yields models capable to fine-grained disentangle concepts and styles from medical images. We evaluate the model using 2D cardiac echocardiographies, given the accessibility of datasets for pre-training and validation. Nevertheless, ConceptVAE is designed to be versatile and can potentially be applied to all 2D image modalities.
- We qualitatively validate ConceptVAE and demonstrate its ability to identify concepts specialised for anatomical structures, such as blood pools or septum walls.
- We quantitatively validate ConceptVAE and show consistent improvements over traditional SSL methods across various tasks, including instance retrieval, semantic segmentation, object detection, and OOD detection.
- We assess ConceptVAE’s ability to generate data conditioned on concept semantics and discuss its potential to enhance robustness in dense prediction tasks.
2. Related Work
3. ConceptVAE
3.1. Model Architecture
3.2. Training Objectives
3.3. Pre-Training Data and Hyper-Parameters
4. Latent Space and Qualitative Analysis
- The prior constraint, which requires regions outside the cone to be modeled solely by the first concept (i.e., the background concept at index 0) is generally respected. Exceptions occur at grid locations in the cone’s proximity, particularly at the boundaries between the cone and the background. As these are transition regions, they are not particularly concerning, since the model’s confidence is expected to be low for such regions.
- Certain concepts are specialized for specific anatomical structures. For example, concept models blood pools within the cone, concept represents the Left Ventricle (LV) free wall on the right hand size of the cone, concepts and correspond to septum walls, and concept covers the right-heart side of the cone, among others.
- Certain concepts, such as e.g., and appear more isolated and spanning a single grid location. By qualitatively assessing multiple input samples, we hypothesise these concepts encode information about the local anatomical shapes of nearby larger concept islands. It appears these concepts have larger confidence assigned to them than the average confidence inside larger concept islands. We term them modifier concepts.
5. Quantitative Model Analysis
5.1. Region-Based Instance Retrieval
- The concept probability grid indeed encodes semantic content and thus functions as a spatial arrangement of concepts, which for ConceptVAE are defined as composable higher-level discrete features.
- ConceptVAE shows promising results for zero-shot instance retrieval based on local-region queries, unlike more traditional approaches that operate at the image level and need additional fine-tuning.
Model | ||
---|---|---|
Landmark | ConceptVAE | Baseline |
left annulus | 0.418 | 0.148 |
mid-septum | 0.281 | 0.098 |
apex | 0.518 | 0.345 |
mid-free-wall | 0.263 | 0.094 |
right annulus | 0.371 | 0.128 |
average | 0.370 | 0.163 |
5.2. Semantic Segmentation
5.3. Near-OOD Detection
5.4. Aortic Valve Detection
5.5. Style-Based Synthetic Data Generation
- The model uses to decode semantic content, such as anatomical structures like chamber walls, blood pools, and valves, while is used to particularize local textures, shadows and speckles.
- With ConceptVAE, synthetic data can be generated by modifying only textures and speckles while retaining anatomical structures. This allows for the generation of novel samples that can serve as style augmentations without modifying the content, potentially enhancing the training performance of dense downstream models, such as those used for segmentation.
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Taleb, A.; Loetzsch, W.; Danz, N.; Severin, J.; Gaertner, T.; Bergner, B.; Lippert, C. 3d self-supervised methods for medical imaging. Adv. Neural Inf. Process. Syst. 2020, 33, 18158–18172. [Google Scholar]
- Azizi, S.; Mustafa, B.; Ryan, F.; Beaver, Z.; Freyberg, J.; Deaton, J.; Loh, A.; Karthikesalingam, A.; Kornblith, S.; Chen, T.; et al. Big self-supervised models advance medical image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3478–3488. [Google Scholar]
- Huang, Z.; Jiang, R.; Aeron, S.; Hughes, M.C. Systematic comparison of semi-supervised and self-supervised learning for medical image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 22282–22293. [Google Scholar]
- Balestriero, R.; Ibrahim, M.; Sobal, V.; Morcos, A.; Shekhar, S.; Goldstein, T.; Bordes, F.; Bardes, A.; Mialon, G.; Tian, Y.; et al. A Cookbook of Self-Supervised Learning. arXiv 2023, arXiv:2304.12210. Available online: https://arxiv.org/abs/2304.12210 (accessed on 27 January 2025).
- Cabannes, V.; Kiani, B.; Balestriero, R.; LeCun, Y.; Bietti, A. The ssl interplay: Augmentations, inductive bias, and generalization. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 3252–3298. [Google Scholar]
- Wu, L.; Zhuang, J.; Chen, H. Voco: A simple-yet-effective volume contrastive learning framework for 3d medical image analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 22873–22882. [Google Scholar]
- Baevski, A.; Hsu, W.N.; Xu, Q.; Babu, A.; Gu, J.; Auli, M. Data2vec: A general framework for self-supervised learning in speech, vision and language. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 25–27 July 2022; pp. 1298–1312. [Google Scholar]
- Wang, Y.; Li, Z.; Mei, J.; Wei, Z.; Liu, L.; Wang, C.; Sang, S.; Yuille, A.L.; Xie, C.; Zhou, Y. Swinmm: Masked multi-view with swin transformers for 3d medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BA, Canada, 8–12 October 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 486–496. [Google Scholar]
- Liu, P.; Zhang, J.; Wu, X.; Liu, S.; Wang, Y.; Feng, L.; Diao, Y.; Liu, Z.; Lyu, G.; Chen, Y. Benchmarking Supervised and Self-Supervised Learning Methods in A Large Ultrasound Multi-task Images Dataset. IEEE J. Biomed. Health Inform. 2024. Early Access. [Google Scholar]
- Holste, G.; Oikonomou, E.K.; Mortazavi, B.J.; Wang, Z.; Khera, R. Efficient deep learning-based automated diagnosis from echocardiography with contrastive self-supervised learning. Commun. Med. 2024, 4, 133. [Google Scholar] [CrossRef]
- Deng, Z.; Zhong, Y.; Guo, S.; Huang, W. InsCLR: Improving Instance Retrieval with Self-Supervision. Proc. AAAI Conf. Artif. Intell. 2022, 36, 516–524. [Google Scholar] [CrossRef]
- Chen, W.; Liu, Y.; Wang, W.; Bakker, E.M.; Georgiou, T.; Fieguth, P.; Liu, L.; Lew, M.S. Deep Learning for Instance Retrieval: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 7270–7292. [Google Scholar] [CrossRef]
- Bracci, S.; Op de Beeck, H. Understanding Human Object Vision: A Picture Is Worth a Thousand Representations. Annu. Rev. Psychol. 2023, 74, 113–135. [Google Scholar] [CrossRef]
- DiCarlo, J.J.; Zoccolan, D.; Rust, N.C. How does the brain solve visual object recognition? Neuron 2012, 73, 415–434. [Google Scholar] [CrossRef]
- Wardle, S.G.; Baker, C.I. Recent advances in understanding object recognition in the human brain: Deep neural networks, temporal dynamics, and context. F1000Research 2020, 9, 590. [Google Scholar] [CrossRef]
- Zhang, C.; Zheng, H.; Gu, Y. Dive into the details of self-supervised learning for medical image analysis. Med Image Anal. 2023, 89, 102879. [Google Scholar] [CrossRef]
- Eddahmani, I.; Pham, C.H.; Napoléon, T.; Badoc, I.; Fouefack, J.R.; El-Bouz, M. Unsupervised learning of disentangled representation via auto-encoding: A survey. Sensors 2023, 23, 2362. [Google Scholar] [CrossRef] [PubMed]
- Liu, X.; Sanchez, P.; Thermos, S.; O’Neil, A.Q.; Tsaftaris, S.A. Learning disentangled representations in the imaging domain. Med. Image Anal. 2022, 80, 102516. [Google Scholar] [CrossRef] [PubMed]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. Virtual, 13–18 July 2020. [Google Scholar]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9726–9735. [Google Scholar] [CrossRef]
- Bardes, A.; Ponce, J.; LeCun, Y. VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. In Proceedings of the 10th International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 15979–15988. [Google Scholar] [CrossRef]
- Baevski, A.; Babu, A.; Hsu, W.N.; Auli, M. Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
- Garrido, Q.; Chen, Y.; Bardes, A.; Najman, L.; LeCun, Y. On the duality between contrastive and non-contrastive self-supervised learning. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, Vienna, Austria, 5 May 2021. [Google Scholar]
- Caron, M.; Touvron, H.; Misra, I.; Jegou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 9630–9640. [Google Scholar] [CrossRef]
- Tian, Y.; Fan, L.; Chen, K.; Katabi, D.; Krishnan, D.; Isola, P. Learning vision from models rivals learning vision from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15887–15898. [Google Scholar]
- Li, K.; Wang, Z.; Cheng, Z.; Yu, R.; Zhao, Y.; Song, G.; Liu, C.; Yuan, L.; Chen, J. ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation. In Proceedings of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7162–7172. [Google Scholar] [CrossRef]
- Wang, X.; Chen, H.; Tang, S.; Wu, Z.; Zhu, W. Disentangled Representation Learning. arXiv 2023, arXiv:2211.11695. Available online: https://arxiv.org/abs/2211.11695 (accessed on 27 January 2025). [CrossRef] [PubMed]
- Locatello, F.; Bauer, S.; Lucic, M.; Raetsch, G.; Gelly, S.; Schölkopf, B.; Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 4114–4124. [Google Scholar]
- Mukherjee, S.; Asnani, H.; Lin, E.; Kannan, S. ClusterGAN: Latent Space Clustering in Generative Adversarial Networks. Proc. AAAI Conf. Artif. Intell. 2019, 33, 4610–4617. [Google Scholar] [CrossRef]
- Ngweta, L.; Maity, S.; Gittens, A.; Sun, Y.; Yurochkin, M. Simple disentanglement of style and content in visual representations. In Proceedings of the 40th International Conferenceon Machine Learning, ICML’23. Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
- Razavi, A.; van den Oord, A.; Vinyals, O. Generating diverse high-fidelity images with VQ-VAE-2. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019. [Google Scholar]
- Esser, P.; Rombach, R.; Ommer, B. Taming Transformers for High-Resolution Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12868–12878. [Google Scholar] [CrossRef]
- Chartsias, A.; Joyce, T.; Papanastasiou, G.; Semple, S.; Williams, M.; Newby, D.E.; Dharmakumar, R.; Tsaftaris, S.A. Disentangled representation learning in cardiac image analysis. Med. Image Anal. 2019, 58, 101535. [Google Scholar] [CrossRef]
- Wang, X.; Du, Y.; Yang, S.; Zhang, J.; Wang, M.; Zhang, J.; Yang, W.; Huang, J.; Han, X. RetCCL: Clustering-guided contrastive learning for whole-slide image retrieval. Med Image Anal. 2023, 83, 102645. [Google Scholar] [CrossRef]
- Fischer, M.; Hepp, T.; Gatidis, S.; Yang, B. Self-supervised contrastive learning with random walks for medical image segmentation with limited annotations. Comput. Med Imaging Graph. 2023, 104, 102174. [Google Scholar] [CrossRef]
- Jang, E.; Gu, S.; Poole, B. Categorical Reparameterization with Gumbel-Softmax. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Park, T.; Liu, M.; Wang, T.; Zhu, J. Semantic Image Synthesis with Spatially-Adaptive Normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2332–2341. [Google Scholar] [CrossRef]
- Haghighi, F.; Taher, M.R.H.; Gotway, M.B.; Liang, J. Self-supervised learning for medical image analysis: Discriminative, restorative, or adversarial? Med Image Anal. 2024, 94, 103086. [Google Scholar] [CrossRef]
- Nekrasov, V.; Shen, C.; Reid, I. Light-weight refinenet for real-time semantic segmentation. arXiv 2018, arXiv:1810.03272. Available online: https://arxiv.org/abs/1810.03272 (accessed on 27 January 2025).
- Kobayashi, K.; Gu, L.; Hataya, R.; Mizuno, T.; Miyake, M.; Watanabe, H.; Takahashi, M.; Takamizawa, Y.; Yoshida, Y.; Nakamura, S.; et al. Sketch-based semantic retrieval of medical images. Med Image Anal. 2024, 92, 103060. [Google Scholar] [CrossRef]
- Li, Z.; Zhang, X.; Müller, H.; Zhang, S. Large-scale retrieval for medical image analytics: A comprehensive review. Med. Image Anal. 2018, 43, 66–84. [Google Scholar] [CrossRef] [PubMed]
- Ren, J.; Fort, S.; Liu, J.; Roy, A.G.; Padhy, S.; Lakshminarayanan, B. A Simple Fix to Mahalanobis Distance for Improving Near-OOD Detection. arXiv 2021, arXiv:2106.09022. Available online: https://arxiv.org/abs/2106.09022 (accessed on 27 January 2025).
- Kuan, J.; Mueller, J. Back to the Basics: Revisiting Out-of-Distribution Detection Baselines. arXiv 2022, arXiv:2207.03061. Available online: https://arxiv.org/abs/2207.03061 (accessed on 27 January 2025).
- Kobyzev, I.; Prince, S.J.; Brubaker, M.A. Normalizing Flows: An Introduction and Review of Current Methods. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3964–3979. [Google Scholar] [CrossRef] [PubMed]
- van Amersfoort, J.; Smith, L.; Jesson, A.; Key, O.; Gal, Y. On Feature Collapse and Deep Kernel Learning for Single Forward Pass Uncertainty. arXiv 2022, arXiv:2102.11409. Available online: https://arxiv.org/abs/2102.11409 (accessed on 27 January 2025).
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Letnes, J.M.; Eriksen-Volnes, T.; Nes, B.; Wisløff, U.; Salvesen, O.; Dalen, H. Variability of echocardiographic measures of left ventricular diastolic function. The HUNT study. Echocardiography 2021, 38, 901–908. [Google Scholar] [CrossRef]
Kernel | Concept Only | Style Only | Concept & Style | |
---|---|---|---|---|
Concept VAE | 1 × 1 | 0.4853 | ||
3 × 3 | 0.1741 | |||
5 × 5 | 0.1087 | |||
7 × 7 | 0.0938 | |||
9 × 9 | 0.0900 | |||
Concept VAE Rand. init. | 1 × 1 | 0.6790 | ||
3 × 3 | 0.4655 | |||
5 × 5 | 0.2901 | |||
7 × 7 | 0.2016 | |||
9 × 9 | 0.1715 | |||
Vicreg | 1 × 1 | 0.187 |
Model | ||
---|---|---|
Metric | ConceptVAE | Baseline |
“open-AV” class AP | 0.337 | 0.297 |
“closed-AV” class AP | 0.386 | 0.459 |
mean AP | 0.362 | 0.378 |
objectness AP | 0.786 | 0.665 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ciușdel, C.F.; Serban, A.; Passerini, T. ConceptVAE: Self-Supervised Fine-Grained Concept Disentanglement from 2D Echocardiographies. Appl. Sci. 2025, 15, 1415. https://doi.org/10.3390/app15031415
Ciușdel CF, Serban A, Passerini T. ConceptVAE: Self-Supervised Fine-Grained Concept Disentanglement from 2D Echocardiographies. Applied Sciences. 2025; 15(3):1415. https://doi.org/10.3390/app15031415
Chicago/Turabian StyleCiușdel, Costin F., Alex Serban, and Tiziano Passerini. 2025. "ConceptVAE: Self-Supervised Fine-Grained Concept Disentanglement from 2D Echocardiographies" Applied Sciences 15, no. 3: 1415. https://doi.org/10.3390/app15031415
APA StyleCiușdel, C. F., Serban, A., & Passerini, T. (2025). ConceptVAE: Self-Supervised Fine-Grained Concept Disentanglement from 2D Echocardiographies. Applied Sciences, 15(3), 1415. https://doi.org/10.3390/app15031415