Hypergraph Neural Network for Multimodal Depression Recognition
Abstract
:1. Introduction
- This study is the first to apply hypergraph neural networks to multimodal depression recognition. By utilizing multimodal data, including speech, text, and facial video, hypergraphs are employed to model the complex, high-order relationships among depression patients. We propose the HYNMDR method, which is a multimodal recognition framework that integrates a temporal embedding module and a hypergraph classification module.
- The temporal embedding module introduces a Euclidean distance-based negative sampling loss function. It employs Temporal Convolution Networks to extract feature embeddings from unimodal and multimodal long-sequence data. Recognizing that depression may manifest as anomalies in specific elements of the feature embeddings, the hypergraph classification module employs a threshold-based hyperedge construction method to address this variability.
- Experimental results on the publicly available DAIC-WOZ and E-diac datasets demonstrate that HYNMDR significantly improves depression recognition accuracy. Additionally, ablation studies highlight the contribution of each module to the overall classification performance.
2. Related Work
2.1. Single-Modal Depression Recognition Methods
2.2. Multimodal Depression Recognition Methods
2.3. Graph Neural Network Model
3. Multimodal Depression Recognition Method Based on Hypergraph
3.1. Temporal Embedding Module
3.2. Hypergraph Classification Module
3.2.1. Definition of Hypergraph
3.2.2. Hyperedge Construction Method Based on Threshold Segmentation
3.2.3. Depression Classification Based on Hypergraph
4. Experimental Analysis
4.1. Data Collection
4.2. Experimental Setup and Evaluation Metrics
4.3. Threshold Parameter Selection
4.4. Experimental Results and Analysis
4.5. Ablation Study
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Marwaha, S.; Palmer, E.; Suppes, T.; Cons, E.; Young, A.H.; Upthegrove, R. Novel and emerging treatments for major depression. Lancet 2023, 401, 141–153. [Google Scholar] [CrossRef] [PubMed]
- Li, X.; Zhang, X.; Zhu, J.; Mao, W.; Sun, S.; Wang, Z.; Xia, C.; Hu, B. Depression recognition using machine learning methods with different feature generation strategies. Artif. Intell. Med. 2019, 99, 101696. [Google Scholar] [CrossRef] [PubMed]
- Yang, L.; Jiang, D.; Xia, X.; Pei, E.; Oveneke, M.C.; Sahli, H. Multimodal measurement of depression using deep learning models. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, Mountain View, CA, USA, 23 October 2017; pp. 53–59. [Google Scholar]
- AlHanai, T.; Ghassemi, M.M.; Glass, J.R. Detecting Depression with Audio/Text Sequence Modeling of Interviews. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 1716–1720. [Google Scholar]
- Haque, A.; Guo, M.; Miner, A.S.; Li, F.-F. Measuring depression symptom severity from spoken language and 3D facial expressions. arXiv 2018, arXiv:1811.08592. [Google Scholar]
- Lam, G.; Dongyan, H.; Lin, W. Context-aware deep learning for multi-modal depression detection. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 3946–3950. [Google Scholar]
- Chen, T.; Hong, R.; Guo, Y.; Hao, S.; Hu, B. S²-GNN: Exploring GNN-Based Multimodal Fusion Network for Depression Detection. IEEE Trans. Cybern. 2022, 53, 7749–7759. [Google Scholar] [CrossRef] [PubMed]
- Hu, B.; Wang, X.; Wang, X.; Song, M.; Chen, D. Survey on hypergraph learning: Algorithm classification and application analysis. J. Softw. 2022, 33, 498–523. [Google Scholar]
- Daros, A.R.; Ruocco, A.C.; Rule, N. Identifying mental disorder from the faces of women with borderline personality disorder. J. Nonverbal Behav. 2016, 40, 255–281. [Google Scholar] [CrossRef]
- Ansari, L.; Ji, S.; Chen, Q.; Cambria, E. Ensemble hybrid learning methods for automated depression detection. IEEE Trans. Comput. Soc. Syst. 2022, 10, 211–219. [Google Scholar] [CrossRef]
- de Melo, W.C.; Granger, E.; Lopez, M. MDN: A deep maximization-differentiation network for spatio-temporal depression detection. IEEE Trans. Affect. Comput. 2021, 14, 578–590. [Google Scholar] [CrossRef]
- Huang, Z.; Epps, J.; Joachim, D.; Sethu, V. Natural language processing methods for acoustic and landmark event-based features in speech-based depression detection. IEEE J. Sel. Top. Signal Process. 2019, 14, 435–448. [Google Scholar] [CrossRef]
- Shao, W.; You, Z.; Liang, L.; Hu, X.; Li, C.; Wang, W.; Hu, B. A multi-modal gait analysis-based detection system of the risk of depression. IEEE J. Biomed. Health Inform. 2021, 26, 4859–4868. [Google Scholar] [CrossRef] [PubMed]
- Yoon, J.; Kang, C.; Kim, S.; Han, J. D-vlog: Multimodal vlog dataset for depression detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 28 February–1 March 2022; pp. 12226–12234. [Google Scholar]
- Shen, T.; Jia, J.; Shen, G.; Feng, F.; He, X.; Luan, H.; Tang, J.; Tiropanis, T.; Chua, T.S.; Hall, W. Cross-domain depression detection via harvesting social media. In Proceedings of the International Joint Conferences on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 1611–1617. [Google Scholar]
- Yang, L.; Jiang, D.; Sahli, H. Integrating deep and shallow models for multi-modal depression analysis—Hybrid architectures. IEEE Trans. Affect. Comput. 2018, 12, 239–253. [Google Scholar] [CrossRef]
- Mao, K.; Zhang, W.; Wang, D.B.; Li, A.; Jiao, R.; Zhu, Y.; Wu, B.; Zheng, T.; Qian, L.; Lyu, W. Prediction of depression severity based on the prosodic and semantic features with bidirectional LSTM and time distributed CNN. IEEE Trans. Affect. Comput. 2022, 14, 2251–2265. [Google Scholar] [CrossRef]
- Zheng, W.; Yan, L.; Gou, C.; Wang, F.-Y. Graph attention model embedded with multi-modal knowledge for depression detection. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar]
- Zheng, W.; Yan, L.; Gou, C.; Zhang, Z.-C.; Zhang, J.J.; Hu, M.; Wang, F.-Y.J. Pay attention to doctor–patient dialogues: Multi-modal knowledge graph attention image-text embedding for COVID-19 diagnosis. Inf. Fusion 2021, 75, 168–185. [Google Scholar] [CrossRef] [PubMed]
- Niu, M.; Chen, K.; Chen, Q.; Yang, L. Hcag: A hierarchical context-aware graph attention model for depression detection. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 4235–4239. [Google Scholar]
- Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar]
- Wan, Z.; Yang, R.; Huang, M.; Zeng, N.; Liu, X. A review on transfer learning in EEG signal analysis. Neurocomputing 2021, 421, 1–14. [Google Scholar] [CrossRef]
- Erzhankyzy, B. Negative-sampling word-embedding method. Neurocomputing 2022, 10, 15–21. [Google Scholar]
- Gao, Y.; Zhang, Z.; Lin, H.; Zhao, X.; Du, S.; Zou, C. Intelligence, M. Hypergraph learning: Methods and practices. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2548–2566. [Google Scholar]
- DAIC-WOZ Database & Extended DAIC Database. Available online: https://dcapswoz.ict.usc.edu/ (accessed on 14 March 2023).
- Baltrušaitis, T.; Robinson, P.; Morency, L.-P. Openface: An open source facial behavior analysis toolkit. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), New York, NY, USA, 7–10 March 2016; pp. 1–10. [Google Scholar]
- Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1188–1196. [Google Scholar]
- Ringeval, F.; Schuller, B.; Valstar, M.; Cummins, N.; Cowie, R.; Tavabi, L.; Schmitt, M.; Alisamir, S.; Amiriparian, S.; Messner, E.-M. AVEC 2019 workshop and challenge: State-of-mind, detecting depression with AI, and cross-cultural affect recognition. In Proceedings of the 9th International on Audio/visual Emotion Challenge and Workshop, Nice, France, 21 October 2019; pp. 3–12. [Google Scholar]
- Williamson, J.R.; Godoy, E.; Cha, M.; Schwarzentruber, A.; Khorrami, P.; Gwon, Y.; Kung, H.-T.; Dagli, C.; Quatieri, T.F. Detecting depression using vocal, facial and semantic communication cues. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, Amsterdam, The Netherlands, 16 October 2016; pp. 11–18. [Google Scholar]
- Gong, Y.; Poellabauer, C. Topic modeling based multi-modal depression detection. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, Mountain View, CA, USA, 23–27 October 2017; pp. 69–76. [Google Scholar]
- Dinkel, H.; Wu, M.; Yu, K. Text-based depression detection on sparse data. arXiv. 2019, arXiv:1904.05154. [Google Scholar]
- Han, Z.; Shang, Y.; Shao, Z.; Liu, J.; Guo, G.; Liu, T.; Ding, H.; Hu, Q. Spatial-Temporal Feature Network for Speech-Based Depression Recognition. IEEE Trans. Cogn. Dev. Syst. 2023, 16, 308–318. [Google Scholar] [CrossRef]
- Kim, A.Y.; Jang, E.H.; Lee, S.-H.; Choi, K.-Y.; Park, J.G.; Shin, H.-C. Automatic depression detection using smartphone-based text-dependent speech signals: Deep convolutional neural network approach. J. Med. Internet Res. 2023, 25, e34474. [Google Scholar] [CrossRef] [PubMed]
- Xu, X.; Zhang, G.; Lu, Q.; Mao, X. Multimodal Depression Recognition that Integrates Audio and Text. In Proceedings of the 2023 4th International Symposium on Computer Engineering and Intelligent Communications (ISCEIC), New York, NY, USA, 18–20 August 2023; pp. 164–170. [Google Scholar]
Category | Method | Data Modality | Strengths | Weaknesses |
---|---|---|---|---|
Single-Modal Depression Recognition | Daros et al. [9] | Visual | Extracts facial features related to borderline personality disorder, inferring mental health status. | Limited to single data modality, may miss multimodal context. |
Ansari et al. [10] | Text | Uses sentiment lexicons and DNN for cross-domain adaptation from Twitter to Weibo. | Text-only approach may miss non-verbal cues. | |
Melo et al. [11] | Visual | Maximization and differential modules capture smooth and sudden facial changes. | Limited to visual features, does not integrate text or audio. | |
Huang et al. [12] | Audio | Utilizes features from acoustic landmark events for depression detection. | Limited to acoustic modality, does not integrate visual or textual features. | |
Multimodal Depression Recognition | Alhanai et al. [4] | Audio + Text | Sequential feature learning detects depression without manual question selection. | High-resource demands of LSTM models limit real-time use in constrained settings. |
Lam et al. [6] | Audio + Text | Topic modeling enhances detection accuracy by focusing on relevant data segments. | The model may lack generalizability in real-world settings due to reliance on structured clinical data. | |
Haque et al. [5] | Audio + Visual + Text | Causal CNNs outperform RNNs in processing long, unstructured interview sequences. | Limited applicability in real-world settings due to reliance on structured clinical data. | |
Shao et al. [13] | Visual + Text | Skeleton and silhouette data fusion boosts accuracy to 85.45% by combining complementary information. | Skeleton data outperform silhouette data, highlighting limitations in silhouette-only models. | |
Yoon et al. [14] | Audio + Visual | The approach demonstrates strong generalization across various datasets, including clinical ones. | Gender imbalance in the dataset may bias the model’s predictions. | |
Ansari et al. [10] | Visual + Text | The text classifier uses hybrid and ensemble approaches to enhance depression detection performance. | Computational complexity may increase with additional modalities. | |
Shen et al. [15] | Visual + Text | Enables effective knowledge transfer across platforms by addressing cultural differences between Twitter and Weibo. | Limited generalizability to other social media platforms or offline contexts. | |
Yang et al. [16] | Audio + Visual + Text | PV and HDR capture unique depression markers, improving feature representation. | Model accuracy depends on input quality, which can vary in real-world settings. | |
Mao et al. [17] | Audio + Text | Attention-based multimodal model for depression prediction using Bi-LSTM and CNN for audio and Bi-LSTM for text. | Unequal contributions from audio and text data may introduce prediction biases. | |
GNN-Based Medical Diagnostics | Zheng et al. [18,19] | Audio + Visual + Text | Multimodal self-attention network captures high-order knowledge–attention representations. | Limited to structured datasets with prior knowledge graphs. |
Niu et al. [20] | Text + Audio | Hierarchical context-aware GNN aggregates question–answer pairs for accurate classification. | Limited modality integration beyond text and speech. | |
Chen et al. [7] | Audio + EEG | Multimodal GNN extracts cross-modal embeddings and uses attention mechanism for representation. | Complexity may increase with additional data modalities. |
Dataset | Model | Modality | F1 | Pre | Rec |
---|---|---|---|---|---|
DAIC-WOZ | A | 0.462 | 0.316 | 0.857 | |
SVM | V | 0.500 | 0.600 | 0.428 | |
AV | 0.500 | 0.600 | 0.428 | ||
DepAudioNet | A | 0.520 | 0.350 | 1.000 | |
A | 0.570 | - | - | ||
BioFeGPM | V | 0.530 | - | - | |
T | 0.840 | - | - | ||
AVT | 0.810 | - | - | ||
A | 0.630 | 0.710 | 0.560 | ||
MulMol | T | 0.670 | 0.570 | 0.800 | |
AT | 0.770 | 0.710 | 0.830 | ||
ConaRe | AVT | 0.700 | - | - | |
BGRU | T | 0.870 | 0.850 | 0.930 | |
CNN + Transformer | AVT | 0.870 | 0.910 | 0.830 | |
STFN | AT | 0.769 | 0.650 | 0.920 | |
MS-GNN | AVT | 0.830 | 0.800 | 0.860 | |
HYNMDR | AVT | 0.924 | 0.873 | 0.983 | |
E-DAIC | DCNN | AT | 0.780 | 0.834 | 0.735 |
M-CBLALL | AT | 0.847 | 0.815 | 0.884 | |
HYNMDR | AVT | 0.914 | 0.889 | 0.936 |
Model | F1 | Pre | Rec |
---|---|---|---|
RNN | 0.640 | 0.628 | 0.652 |
LSTM | 0.747 | 0.710 | 0.791 |
GRU | 0.742 | 0.698 | 0.788 |
TEM | 0.783 | 0.734 | 0.833 |
Model | F1 | Pre | Rec |
---|---|---|---|
TEM-DNN | 0.780 | 0.834 | 0.735 |
GNN | 0.847 | 0.815 | 0.884 |
HGNN | 0.894 | 0.875 | 0.917 |
HCM | 0.911 | 0.889 | 0.936 |
Model | Modality | F1 | Pre | Rec |
---|---|---|---|---|
HYNMDR | AVT | 0.911 | 0.888 | 0.936 |
A | 0.883 | 0.873 | 0.895 | |
V | 0.877 | 0.881 | 0.874 | |
T | 0.872 | 0.863 | 0.882 | |
AV | 0.893 | 0.875 | 0.913 | |
VT | 0.892 | 0.861 | 0.927 | |
AT | 0.885 | 0.873 | 0.898 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, X.; Dong, Y.; Yi, Y.; Liang, Z.; Yan, S. Hypergraph Neural Network for Multimodal Depression Recognition. Electronics 2024, 13, 4544. https://doi.org/10.3390/electronics13224544
Li X, Dong Y, Yi Y, Liang Z, Yan S. Hypergraph Neural Network for Multimodal Depression Recognition. Electronics. 2024; 13(22):4544. https://doi.org/10.3390/electronics13224544
Chicago/Turabian StyleLi, Xiaolong, Yang Dong, Yunfei Yi, Zhixun Liang, and Shuqi Yan. 2024. "Hypergraph Neural Network for Multimodal Depression Recognition" Electronics 13, no. 22: 4544. https://doi.org/10.3390/electronics13224544
APA StyleLi, X., Dong, Y., Yi, Y., Liang, Z., & Yan, S. (2024). Hypergraph Neural Network for Multimodal Depression Recognition. Electronics, 13(22), 4544. https://doi.org/10.3390/electronics13224544