Hypergraph Neural Network for Multimodal Depression Recognition

Li, Xiaolong; Dong, Yang; Yi, Yunfei; Liang, Zhixun; Yan, Shuqi

doi:10.3390/electronics13224544

Open AccessArticle

Hypergraph Neural Network for Multimodal Depression Recognition

by

Xiaolong Li

^1,2,3

,

Yang Dong

¹

,

Yunfei Yi

^2,*

,

Zhixun Liang

²

and

Shuqi Yan

^1,*

¹

School of Computer Science, Hunan University of Technology and Business, Changsha 410205, China

²

School of Big Data and Computer Science, Hechi University, Yizhou 546300, China

³

Hunan Provincial General University Key Laboratory of IoT Intelligent Sensing and Distributed Collaborative Optimization, Changsha 410205, China

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(22), 4544; https://doi.org/10.3390/electronics13224544

Submission received: 17 October 2024 / Revised: 12 November 2024 / Accepted: 15 November 2024 / Published: 19 November 2024

(This article belongs to the Special Issue Digital Intelligence Technology and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Deep learning-based approaches for automatic depression recognition offer advantages of low cost and high efficiency. However, depression symptoms are challenging to detect and vary significantly between individuals. Traditional deep learning methods often struggle to capture and model these nuanced features effectively, leading to lower recognition accuracy. This paper introduces a novel multimodal depression recognition method, HYNMDR, which utilizes hypergraphs to represent the complex, high-order relationships among patients with depression. HYNMDR comprises two primary components: a temporal embedding module and a hypergraph classification module. The temporal embedding module employs a temporal convolutional network and a negative sampling loss function based on Euclidean distance to extract feature embeddings from unimodal and cross-modal long-time series data. To capture the unique ways in which depression may manifest in certain feature elements, the hypergraph classification module introduces a threshold segmentation-based hyperedge construction method. This method is the first attempt to apply hypergraph neural networks to multimodal depression recognition. Experimental evaluations on the DAIC-WOZ and E-DAIC datasets demonstrate that HYNMDR outperforms existing methods in automatic depression monitoring, achieving an F1 score of 91.1% and an accuracy of 94.0%.

Keywords:

depression recognition; hypergraph neural networks; negative sampling loss function; hyperedge construction

1. Introduction

Depression is one of the most common mental disorders, which is characterized by persistent low mood, sadness, and a lack of interest in activities. Early detection is crucial for identifying patients early, enabling timely intervention and treatment. It can alleviate symptoms, reduce the incidence and mortality of comorbid conditions, lower relapse rates, and prevent other adverse outcomes. However, clinical diagnostic standards for depression face challenges due to their subjective nature, relying heavily on physician assessments and patient self-reports [1]. Moreover, economic pressures and social stigma may lead some patients to conceal their mental health and physical symptoms, complicating timely identification [2]. Therefore, developing automated, cost-effective, and accurate depression diagnosis technologies is of significant practical importance.

With the rapid advancement of artificial intelligence, deep learning-based models have shown significant promise in the early detection of depression. Symptoms of depression are often reflected in patients’ language content, vocal tone, facial expressions, and other behavioral indicators. Deep learning models can extract latent features from these multimodal data, enabling the differentiation of depressive disorders from other mental health conditions and the general population. In recent years, researchers have increasingly applied deep learning techniques to depression recognition. Yang et al. [3] developed a depression recognition method based on Deep Convolutional Neural Networks (DCNNs). Alhanai et al. [4] employed text and audio data from the study population to assess depression tendencies with Long Short-Term Memory (LSTM) networks. Haque et al. [5] proposed a depression recognition framework using causal convolutional networks, which generate sentence-level embeddings from audio, text, and video data. Lam et al. [6] introduced a data augmentation approach based on topic modeling and employed Transformers to enhance model accuracy further. Unlike traditional neural networks, Graph Neural Networks (GNNs) can model complex relationships between entities. Researchers have begun exploring GNN to capture the associations among depression patients, aiding in the recognition of depression. Chen et al. [7] proposed a multimodal fusion framework based on GNN, which extracts cross-modal and modality-specific embedding vectors through modality-sharing and modality-specific networks, respectively. Additionally, they used a reconstruction network and attention mechanisms to achieve a multimodal representation of depression.

In recent years, deep learning-based methods for automatic depression recognition have advanced significantly, yet two significant challenges persist. First, depression symptoms are often subtle, and non-severe patients may only exhibit abnormalities over very brief periods, requiring a method capable of detecting abnormal symptom features in short time intervals. Second, depression is a highly diverse mental disorder with symptoms varying widely across individuals. To accurately classify samples, it is essential to capture the complex, high-order relationships between depression patients across multimodal data.

To address these challenges, this paper leverages the powerful expressive capabilities of hypergraphs to model the complex, high-order relationships among individuals with depression [8]. We propose a novel multimodal depression recognition method based on hypergraph neural networks (HGNNs) called HYNMDR (Hypergraph Neural Network for Multimodal Depression Recognition). The HYNMDR framework consists of two main components: the Temporal Embedding Module (TEM) and the Hypergraph Classification Module (HCM). The TEM module utilizes Temporal Convolutional Networks (TCNs) as its core framework. It applies one-dimensional convolutions and residual connections to extract features from long data sequences while embedding sequences of varying lengths into a shared low-dimensional space. Inspired by word embeddings and negative sampling techniques, we introduce a Euclidean distance-based negative sampling loss function to enhance the quality of the embeddings. This loss function minimizes the embedding distance between positive samples and maximizes the distance between negative samples, facilitating effective embedding learning for both unimodal and cross-modal data. To address potential anomalies in specific elements of depression-related embeddings and avoid low recognition accuracy from overall vector similarity comparisons, the HCM module introduces a threshold-based hyperedge construction method. Finally, classification is performed through the hypergraph neural network, utilizing the embedding vectors of the target subjects and the constructed hyperedges to achieve automatic depression recognition.

The main contributions of this paper are as follows:

This study is the first to apply hypergraph neural networks to multimodal depression recognition. By utilizing multimodal data, including speech, text, and facial video, hypergraphs are employed to model the complex, high-order relationships among depression patients. We propose the HYNMDR method, which is a multimodal recognition framework that integrates a temporal embedding module and a hypergraph classification module.
The temporal embedding module introduces a Euclidean distance-based negative sampling loss function. It employs Temporal Convolution Networks to extract feature embeddings from unimodal and multimodal long-sequence data. Recognizing that depression may manifest as anomalies in specific elements of the feature embeddings, the hypergraph classification module employs a threshold-based hyperedge construction method to address this variability.
Experimental results on the publicly available DAIC-WOZ and E-diac datasets demonstrate that HYNMDR significantly improves depression recognition accuracy. Additionally, ablation studies highlight the contribution of each module to the overall classification performance.

2. Related Work

To enable the early identification of depression patients, deep learning-based models have shown substantial potential for practical application. This section reviews relevant work from two perspectives based on the number of data modalities used: single-modal and multimodal depression recognition methods. In recent years, GNNs have gained significant attention in deep learning. These models effectively utilize graph structures to capture implicit relationships within medical data, making them highly suitable for diagnostic tasks. GNNs have been widely applied in disease recognition, and this section also introduces several medical diagnostic methods based on GNN. Table 1 shows the comparison of methods for depression detection.

2.1. Single-Modal Depression Recognition Methods

In recent years, depression recognition methods based on single-modal data have increasingly relied on deep learning techniques to extract high-level features associated with depressive disorders. These methods typically use data from diagnostic interviews, where individuals’ interactions, clinical interview questionnaires, or social media posts are observed. The data types involved include text, audio, facial expressions, video, electroencephalogram (EEG) signals, and gait patterns. For instance, Daros et al. [9] extracted facial features related to borderline personality disorder from static images to infer the mental health status of the subjects. Based on posts generated by individuals with depression or post-traumatic stress disorder on social media, Ansari et al. [10] proposed an ensemble hybrid method for depression recognition that utilizes various sentiment lexicons. This method employs a deep neural network with adaptive feature transformation and combination strategies, enabling domain adaptation from Twitter data to Weibo data. For video data, Melo et al. [11] developed a depression recognition method featuring a maximization module to capture smooth facial changes and a differential module for sudden ones. Huang et al. [12] developed a system framework using acoustic and landmark-based speech data, leveraging speech features to enhance depression recognition. Despite the significant progress demonstrated by the methods mentioned above in depression recognition, single-modal approaches often lack the nuanced understanding provided by multimodal data, which may reduce their accuracy in real-world depression detection applications. Our multimodal approach combines audio, text, and visual data to improve depression detection accuracy and generalizability. By integrating these features, our hypergraph-based model captures a richer representation of depression indicators, overcoming the limitations of single-modal methods and enhancing real-world performance.

2.2. Multimodal Depression Recognition Methods

Using data from multiple modalities enhances model accuracy and generalization. Multimodal deep learning models integrate information from various modalities to achieve superior classification performance. Several multimodal approaches for learning embeddings to detect depression have been proposed, demonstrating promising results. For example, Alhanai et al. [4] developed a multimodal LSTM model that leverages audio and text data for automatic depression recognition by extracting cross-modal semantic features. Lam et al. [6] employed Transformer and 1D CNN for acoustic feature modeling, proposing a multimodal depression recognition model based on data augmentation. Haque et al. [5] used causal convolutional networks to process multimodal interview data, including audio, video, and text, generating multimodal co-occurring semantic representations of depression. Shao et al. [13] introduced a multimodal fusion model that combines classifiers by extracting gait analysis features, such as skeleton, side, and front profiles, using LSTM and CNN networks. Yoon et al. [14] utilized a Transformer encoder to generate multimodal representations for depression recognition based on the DVlog dataset, comprising 961 YouTube video blogs. Ansari et al. [10] employed a Transformer with a cross-attention mechanism for multimodal depression recognition. Addressing data heterogeneity, Shen et al. [15] proposed a cross-domain neural network model that leverages Twitter as a source domain and Weibo as a target domain, applying data from the source to detect depression in the target domain. Yang et al. [16] presented a system framework incorporating deep and shallow audio, video, and text data models. Their audiovisual multimodal model estimates depression severity, while a segment vector–support vector machine model infers the individual’s physical and mental state. Based on the publicly available DAIC-WOZ depression dataset, Mao et al. [17] introduced an attention-based multimodal speech–text representation model for depression prediction. This model uses bidirectional LSTM networks and time-distributed CNN networks to learn audio features, and it uses Bi-LSTM networks to extract text features. Human physiological signals, often unordered, irregular, and non-Euclidean, are difficult to represent with matrices. Traditional LSTM or CNN network models, which are designed for regular 2D lattice data and 1D sequence data, struggle to extract high-level semantic features from such irregular, non-Euclidean graph-structured data as found in physiological signals. Unlike traditional models that rely on simple pairwise connections, our hypergraph neural network allows for connections between multiple nodes within hyperedges, effectively modeling intricate associations across audio, text, and visual modalities. This design improves the model’s capacity to recognize nuanced patterns within and across modalities, enhancing its robustness and accuracy in detecting depression.

2.3. Graph Neural Network Model

Graph Neural Networks (GNNs) are specialized neural networks designed to process graph-structured data consisting of vertices and edges. In recent years, GNNs have become a prominent research focus within deep learning, especially suited for handling the irregular and unordered nature of human physiological data, which makes them ideal for capturing implicit relationships in medical diagnostics. For example, in depression and COVID-19 diagnosis, Zheng et al. [18,19] introduced a multimodal knowledge graph that utilizes a multimodal self-attention network to generate knowledge representation vectors. These vectors are then combined with time-convolution self-attention networks to produce attention-based representation vectors, ultimately forming a joint knowledge–attention representation for medical diagnosis. Niu et al. [20] proposed a hierarchical context-aware graph attention model (HCAG) for depression recognition based on doctor–patient interview data, using question–answer pairs as node features. This graph attention network aggregates these features for depression classification. Chen et al. [7] developed a multimodal GNN fusion framework for depression recognition using audio and EEG data. This framework extracts cross-modal and modality-specific embedding vectors through shared and specific networks, respectively, and employs a reconstruction network to capture high-level semantic information. An attention mechanism further enriches the multimodal representation of depression. These studies demonstrate the potential of GNN to model complex relationships in human physiological data. Existing GNN methods have shown potential in depression and other medical diagnostic tasks, but they also have notable limitations. GNN models rely on binary relationships formed by vertices and edges, and this binary connection limits their ability to capture complex higher-order relationships. The hypergraph modeling approach proposed in this paper offers an enhanced method for capturing complex, higher-order relationships between nodes to address the limitations of binary relationships between graph nodes.

3. Multimodal Depression Recognition Method Based on Hypergraph

This paper introduces HYNMDR, a hypergraph-based method for multimodal depression recognition, leveraging the dataset X, which captures multimodal data, including text, audio, and facial video of diagnostic subjects. The overall framework of the network model is shown in Figure 1. The model comprises two main parts: a temporal embedding module and a hypergraph classification module. The temporal embedding module primarily operates on multimodal long-sequence data

X

= {x_{1}, x_{2}, \dots, x_{N}}

(

x_{i}

represents the i-th sample data,

1 \leq i \leq N

, where N is the number of samples) to extract embedded representations of the sample dataset

I = {I_{1}, I_{2}, \dots, I_{N}}

. The hypergraph classification module constructs hyperedges based on the embedded representations

I

and derives the classification results of depression recognition on this basis

Y = {y_{1}, y_{2}, \dots, y_{N}}

. The overall HYNMDR process can be summarized as follows:

\begin{matrix} I = TEM (X) \in R^{d \times N} \\ Y = HCM (I) \in R^{2 \times N} \end{matrix}

(1)

Here, d represents the dimension of

x_{i}

, and

Y \in R^{2 \times N}

represents the probability that the patient belongs to the “depressed” or “normal” category. Section 2.1 provides a detailed description of the structure of the TEM module and the Euclidean distance-based negative sampling loss function. Section 2.2 introduces the HCM module and the threshold-based hyperedge construction method.

3.1. Temporal Embedding Module

As illustrated in Figure 1, the temporal embedding module preprocesses audio, text, and facial video data for the three modalities. For audio data, audio feature extraction algorithms such as the COVAREP algorithm extract audio features, including prosody, frequency spectrum, and vocal tract control features. For text processing, we selected word embedding technique Doc2Vec over Word2Vec because Doc2Vec captures semantic information at the document level, which is essential for understanding longer passages relevant to depressed states in depression detection. Word2Vec, in contrast, focuses on word-level embeddings and may need more context for document-level sentiment analysis. For video processing, we selected facial behavior analysis technique OpenFace over CNN-based approaches because OpenFace provides a more efficient solution that is especially suited for handling continuous facial expression data. In contrast, while CNN-based approaches are powerful for image processing, they typically require extensive labeled datasets and can be computationally intensive. The input dataset is denoted as

X = {x_{1}, x_{2}, \dots, x_{n}}

, which is multimodal, with each data sample

x_{i} = (a_{i}, t_{i}, v_{i})

comprising audio, text, and video feature sets. Specifically,

a_{i} \in R^{d_{1} \times l_{1}}

represents the audio feature time sequence, where

d_{1}

is the feature dimension feature, and

l_{1}

is the sequence length. Similarly,

t_{i} \in R^{d_{2} \times l_{2}}

represents the text feature (sentence-level) time sequence, where

d_{2}

is the dimension of the text feature, and

l_{2}

is the length of the sequence.

v_{i} \in R^{d_{3} \times l_{3}}

represents the video feature (facial key points) time sequence, where

d_{3}

is the dimension of the video feature, and

l_{3}

is the sequence length. It is worth mentioning that in the temporal embedding module, the dimensions

d_{1}

,

d_{2}

, and

d_{3}

remain constant during the preprocessing step regardless of the specific data processing method. For instance, in the case of text features, the data are unified into 80-dimensional word vectors during embedding (

d_{2} = 80

), while the sequence lengths

l_{1}

,

l_{2}

, and

l_{3}

vary according to the recording length of each participant, meaning that the lengths of the multimodal feature sequences are not aligned.

Following data preprocessing, the three modalities

a_{i}

,

t_{i}

, and

v_{i}

, which have varying sequence lengths, are mapped into a common low-dimensional space using three temporal convolution networks. To support both unimodal and cross-modal depression embedding learning, we define the feature embeddings for audio, text, and video modalities as

I_{i}^{a}

,

I_{i}^{t}

, and

I_{i}^{v}

, respectively, and the cross-modal embeddings as

I_{i}^{c}

. The reconstruction loss functions for each modality are defined as

L_{a}

,

L_{t}

,

L_{v}

, and for cross-modal learning,

L_{c}

. Taking the audio modality data as an example, for the unequal length sequence

a_{i} = {a_{i}^{(1)}, a_{i}^{(2)}, a_{i}^{(3)}, \dots, a_{i}^{(l_{1})}}

, we use the TCN network denoted as the mapping function

f_{T C N}

to generate the feature embeddings for the audio modality.

I_{i}^{a} = f_{T C N} (a_{i}^{(1)}, a_{i}^{(2)}, a_{i}^{(3)}, \dots, a_{i}^{(l_{1})})

(2)

The TCN network employs causal convolution, dilation, and residual connections to capture sequential information effectively [21]. The entire historical sequence is compressed into the output by setting appropriate filter sizes and layer depths, and the final convolutional network layer generates a same-dimensional unimodal embedding

I_{i}^{a}

. To learn this high-quality embedding representation, we optimize from the following two aspects:

(1) Optimization based on embedding representations for classification. Considering the relatively low error in training deep neural networks for classification tasks and the fast training speed, it can effectively learn high-quality embeddings [22]. Here,

I_{i}^{a}

serves as the input for the depression diagnosis classification task, and

y_{i}

as the classification output, where

y_{i}

= 1 represents a depressed patient and

y_{i}

= 0 represents a non-depressed patient. For this binary classification task, we minimize the cross-entropy loss function

{L_{a}}^{1}

as the optimization objective to learn the embedding representation of

I_{i}^{a}

.

{L_{a}}^{1} = \sum_{i = 1}^{N} (- y_{i} log {\hat{y}}_{i}^{a} - (1 - y_{i}) log (1 - {\hat{y}}_{i}^{a}))

(3)

Here,

{\hat{y}}_{i p}^{a}

represents the probability that

I_{i}^{a}

is predicted as a depressed patient, while

1 - {\hat{y}}_{i}^{a}

indicates the probability that

I_{i}^{a}

is predicted not to be a depressed patient.

(2) Embedding Optimization Based on Graph Neural Network Nodes. Cross-entropy loss optimization primarily focuses on classification results, which may overlook the feature-space embeddings of depressed patients. This oversight could hinder the model’s ability to accurately reconstruct co-occurrence patterns among sample-based embeddings. To remedy this, it is necessary to design a loss function that targets the embedding representations of the graph neural network nodes. In word embedding representation learning, the co-occurrence between different words and the maximization of the conditional probability of each word and its context are used as the common objective function for embedding. Drawing inspiration from this approach for depression recognition [23], we design our embedding such that positive samples are positioned closer to each other in the embedding space, while negative samples are farther apart. During training, negative sampling randomly selects K samples from samples of different categories than

I_{i}^{a}

. We denote the positive sample and K negative samples for

I_{i}^{a}

as

{\bar{I}}^{a} {\tilde{I}}_{1}^{a}, {\tilde{I}}_{2}^{a}, \dots, {\tilde{I}}_{K}^{a}

. The objective function

L_{a}^{2}

is defined to optimize the embedding representation, aiming to minimize the Euclidean distance between the sample and the positive samples in the embedding space while maximizing the Euclidean distance between the sample and the K negative samples.

{L_{a}}^{2} = \sum_{i = 1}^{N} - log δ (\frac{∥I_{i}^{a} - {\bar{I}}^{a}∥}{d_{1}}) - \sum_{j = 1}^{K} log δ (- \frac{∥I_{i}^{a} - {\tilde{I}}_{j}^{a}∥}{d_{1}})

(4)

where

∥ \cdot ∥

denotes the Euclidean distance between two vectors, and

δ (\cdot)

represents the sigmoid function. By combining Equations (3) and (4), we obtain the objective for the feature embedding representation of the audio modality data:

L_{a} = L_{a}^{1} + λ_{a} L_{a}^{2}

(5)

where

λ_{a}

is a system parameter used to balance the embedding optimization of the classification task

{L_{a}}^{1}

and the embedding optimization of the graph neural network nodes

{L_{a}}^{2}

. Similar to the process of learning the feature embedding representation for audio modality data, we minimize

L_{t}

and

L_{v}

, as shown in Equations (6) and (7), to obtain the feature embedding representations for text and visual modality data, which are denoted as

I_{i}^{t}

and

I_{i}^{v}

, respectively.

\{\begin{matrix} L_{t} & = {L_{t}}^{1} + λ_{c} {L_{t}}^{2} \\ {L_{t}}^{1} & = \sum_{i = 1}^{N} \sum_{p = 1}^{P} y_{i p} ln {\hat{y}}_{i p}^{t} \\ {L_{t}}^{2} & = \sum_{i = 1}^{N} - ln δ (\frac{∥ I_{i}^{t} - {\bar{I}}^{t} ∥}{d_{2}}) - \sum_{j = 1}^{K} ln δ (- \frac{∥ I_{i}^{t} - {\tilde{I}}_{j}^{t} ∥}{d_{2}}) \end{matrix}

(6)

\{\begin{matrix} L_{v} & = {L_{v}}^{1} + λ_{c} {L_{v}}^{2} \\ {L_{v}}^{1} & = \sum_{i = 1}^{N} \sum_{p = 1}^{P} y_{i p} ln {\hat{y}}_{i p}^{v} \\ {L_{v}}^{2} & = \sum_{i = 1}^{N} - ln δ (\frac{∥ I_{i}^{v} - {\bar{I}}^{v} ∥}{d_{3}}) - \sum_{j = 1}^{K} ln δ (- \frac{∥ I_{i}^{v} - {\tilde{I}}_{j}^{v} ∥}{d_{3}}) \end{matrix}

(7)

To obtain the embedding representation of cross-modalities, we apply the concept of multimodal shared semantic space. The transformed feature vectors of the audio, text, and video modalities,

I_{i}^{a}

,

I_{i}^{t}

, and

I_{i}^{v}

, are combined to obtain the cross-modal embedding representation

I_{i}^{c}

, as shown in Equation (8):

I_{i}^{c} = {({(I_{i}^{a})}^{T} W^{a - c} + {(I_{i}^{t})}^{T} W^{t - c} + {(I_{i}^{v})}^{T} W^{v - c})}^{T}

(8)

where

W^{a - c} \in R^{d_{1} (d_{1} + d_{2} + d_{3})}

,

W^{t - c} \in R^{d_{2} (d_{1} + d_{2} + d_{3})}

, and

W^{v - c} \in R^{d_{3} (d_{1} + d_{2} + d_{3})}

represent the parameter matrices that map audio, text, and video modalities to the cross-modal embedding representation

I_{i}^{c}

. The objective of learning the cross-modal data feature embeddings is to minimize

L_{c}

.

\{\begin{matrix} L_{c} & = {L_{c}}^{1} + λ_{c} {L_{c}}^{2} \\ {L_{c}}^{1} & = \sum_{i = 1}^{N} \sum_{p = 1}^{P} y_{i p} ln {\hat{y}}_{i p}^{c} \\ {L_{c}}^{2} & = \sum_{i = 1}^{N} - ln δ (\frac{∥ I_{i}^{c} - {\bar{I}}^{c} ∥}{d_{1} + d_{2} + d_{3}}) - \sum_{j = 1}^{K} ln δ (- \frac{∥ I_{i}^{c} - {\tilde{I}}_{j}^{c} ∥}{d_{1} + d_{2} + d_{3}}) \end{matrix}

(9)

Combining

L_{a}

,

L_{t}

,

L_{v}

, and

L_{c}

, we obtain the overall objective function of the temporal embedding module as shown in Equation (10). The loss function relationship between the multimodal input data, embedding vectors, and output probabilities is illustrated in Figure 2.

{I_{i}^{a}, I_{i}^{t}, I_{i}^{v}, I_{i}^{c}} = arg max (L_{a} + L_{v} + L_{t} + L_{c})

(10)

The temporal embedding module utilizes Temporal Convolutional Networks to process multimodal data sequences, embedding sequences of varying lengths into a shared low-dimensional space. The complexity of TCN typically scales linearly with the input sequence length L, the number of filters F, and the number of layers D, resulting in an approximate complexity of

O (L \times F \times D)

.

3.2. Hypergraph Classification Module

The temporal embedding module produces embedded representation vectors for speech, text, and video modalities and cross-modal features for the training samples. In this section, we first introduce several foundational definitions. Using the obtained single-modal and multimodal embedding representation vectors, we model the complex high-order relationships among depressed patients. Building on this, we propose a hypergraph-based multimodal depression recognition method.

3.2.1. Definition of Hypergraph

Definition 1.

Hypergraph. A hypergraph is defined as a triplet

G = (V, E, W)

, where V represents the set of vertices, E represents the set of edges consisting of non-empty subsets, and the non-empty subsets within the set are called hyperedges. W is a set of weights for the hyperedges, representing their importance, strength, or other characteristics. The size of the vertex set N in a hypergraph is called the order of the hypergraph, denoted as

N = | V |

, and the number of hyperedges in the edge set M is called the size of the hypergraph, denoted as

M = | E |

.

Definition 2.

Adjacency Matrix. In a hypergraph

G = (V, E, W)

, the two-dimensional matrix that stores the relationships between vertices and hyperedges is defined as the adjacency matrix H of the graph G, where the i-th row and j-th column element

h_{i, j}

are defined as

h_{i, j} = \{\begin{matrix} 1, & v_{i} \in e_{j} \\ 0, & v_{i} \notin e_{j} \end{matrix}

(11)

where

v_{i}

represents the i-th vertex in the vertex set (

1 \leq i \leq N

), and

e_{j}

represents the j-th hyperedge in the edge set (

1 \leq j \leq M

).

w_{j} \in W

represents the weight of the hyperedge

e_{j}

. When

h_{i, j} = 1

and

v_{i} \in e_{j}

, it indicates that the hyperedge

e_{j}

contains the vertex

v_{i}

. When

h_{i, j} = 0

and

v_{i} \notin e_{j}

, it indicates that the hyperedge

e_{j}

does not contain the vertex

v_{i}

.

Definition 3.

Vertex Degree Matrix. For a vertex

v_{i} \in V

, the degree of

v_{i}

called

D (v_{i})

is defined as the sum of the weights of all hyperedges containing vertex

v_{i}

, i.e.,

D (v_{i}) = \sum_{j = 1}^{M} h_{i, j}

. The vertex degree matrix

D_{v}

is defined as the diagonal matrix of the degrees of the vertices.

Definition 4.

Hyperedge Degree Matrix. For a hyperedge

e_{j} \in E

, the degree of

e_{j}

called

D (e_{j})

is defined as the number of vertices contained in hyperedge

e_{j}

, which is represented as

D (e_{j}) = \sum_{i = 1}^{N} h_{i, j}

. The hyperedge degree matrix

D_{e}

is defined as the diagonal matrix of the degrees of the hyperedges.

3.2.2. Hyperedge Construction Method Based on Threshold Segmentation

Before detailing the hypergraph-based depression classification method, it is essential to outline the hypergraph construction process. Here, we represent the NN training samples as nodes within the hypergraph, using cross-modal embedding vectors derived from the temporal embedding module as their corresponding representations. For the construction of hyperedges, we utilize the embedding representation vectors of the samples in a single modality, obtained from the temporal embedding module, to capture the complex relationships between vertices in a single modality.

Currently, existing methods for generating hyperedges can be broadly categorized into four categories: attribute-based, network-based, distance-based, and representation-based methods [24]. Attribute-based approaches are often unsuitable for hyperedge construction in depression detection due to the high sensitivity of medical data and patient privacy requirements, which limit the availability of additional attribute information. Moreover, network-based methods are ineffective, as depression recognition data typically lack an inherent graph structure, existing independently at the sample level. Distance-based methods typically take the current vertex as the center and find the k nearest vertices to form a hyperedge. Each node in the hypergraph is successively treated as the current vertex, ultimately constructing N hyperedges, each with a degree of

k + 1

. Representation-based methods utilize sparse coding to model node relationships by optimizing the embedding of the central vertex alongside its k nearest neighbors, calculating edge weights and forming hyperedges with non-zero weights. However, these distance-based methods focus on overall similarity among vertex embeddings, overlooking the fact that depression feature embeddings may exhibit abnormalities in specific elements only, limiting their applicability to depression recognition tasks.

Inspired by threshold segmentation in the image domain, we propose a hyperedge construction method based on threshold segmentation. This approach avoids direct comparison of overall embedding vector similarity by focusing on individual feature values within the depression embedding vectors. Notably, each dimension of the depression embedding vector represents the likelihood that the sample contains a certain high-level feature with higher values indicating a greater possibility. Conversely, two samples with low feature values in that dimension are not considered similar in that feature. By applying the threshold segmentation method, we iteratively assign similar samples in each dimension of the embedding vector to a hyperedge. This process allows us to construct an adjacency matrix

H^{'}

for the single modality. We assume N training samples and corresponding feature embedding representations are

I_{1 \sim N} \in R^{d}

, where d is the dimension of the feature embedding. Based on the hyperedge construction algorithm using threshold segmentation, we can obtain the values of the matrix elements

h_{i, j}

in

H^{'}

.

h_{i, j} = \{\begin{matrix} 1, & I_{i, j} \geq θ \\ 0, & I_{i, j} < θ \end{matrix}

(12)

where

1 \leq i \leq N

,

1 \leq j \leq d

. Figure 3, as an example, describes the process of generating the adjacency matrix

H^{'}

. In the figure, we give seven sample embedding vectors with a dimension of 7. The matrix

H^{'}

contains seven hyperedges. Assuming a threshold

θ = 0.8

, since the embedding features of the 1st and 7th samples are greater than or equal to the threshold

θ

in the first dimension, the first hyperedge includes vertices 1 and 7. Following this approach, all hyperedge values are computed to form the adjacency matrix

H^{'}

. Using the same method, three separate adjacency matrices are constructed for audio, text, and facial video modalities:

H_{a} \in R^{N \times d_{1}}

,

H_{t} \in R^{N \times d_{2}}

, and

H_{v} \in R^{N \times d_{3}} .

These matrices are then concatenated to form the full hypergraph adjacency matrix.

H = [H_{a}, H_{t}, H_{v}] \in R^{N \times (d_{1} + d_{2} + d_{3})}

.

3.2.3. Depression Classification Based on Hypergraph

In the hyperedge weight matrix W, assuming that the weights of the hyperedges are identical, W is a unit diagonal matrix. Based on the hypergraph adjacency matrix

H

constructed in Section 3.2.2 and the definitions of the vertex degree matrix and hyperedge degree matrix, we generate the vertex degree matrix

D_{v}

and the hyperedge degree matrix

D_{e}

. For the vertex embedding representations of the hypergraph

G = (V, E, W)

, we apply the mapping function

f : V \in R^{N \times (d_{1} + d_{2} + d_{3})} \to R^{N \times P}

to obtain the classification results of the depressed recognition samples

F = (y_{i})

, where

y_{i}

represents the classification result of sample

v_{i}

. The model uses an activation function in the final layer to transform the output into a probability distribution. We transform the classification task for depression recognition into the following objective:

arg min_{f} {R_{e m p} (f) + λ Ω (f)}

(13)

Here,

R_{e m p} (f)

represents the loss function for the classification task, and

Ω (f)

serves as the hypergraph regularization term, assessing the smoothness of the hypergraph signal.

λ

is the system parameter for balancing the two objectives. For the loss function

R_{e m p} (f)

of the classification task, we employ the least squares error as one of the optimization objectives in the current task:

R_{e m p} (f) = {(y - f)}^{2}

(14)

Minimizing the smoothness of the hypergraph signal is another optimization objective of the current task.

Ω (f) = \frac{1}{2} \sum_{k = 1}^{d_{1} + d_{2} + d_{3}} \sum_{1 \leq i, j \leq N} \frac{w_{k}}{δ_{k}} {(\frac{f_{i}}{\sqrt{ρ_{i}}} - \frac{f_{j}}{\sqrt{ρ_{j}}})}^{2} = f^{T} (I - D_{v}^{- 1 / 2} {HWD}_{e}^{- 1} H^{T} D_{v}^{- 1 / 2}) f

(15)

Combining (14) and (15), we derive the overall optimization objective for the hypergraph classification module, as shown in Equation (13). Using layers, the training obtains the optimized f. Each layer of the hypergraph convolutional layer is defined as

U^{(l + 1)} = σ^{l} (D_{v}^{- 1 / 2} {HWD}_{e}^{- 1} H^{T} D_{v}^{- 1 / 2} I^{(l)} Θ^{(l)})

(16)

where

U^{(l)}

represents the feature of the hypergraph at the first layer, and

Θ^{(l)}

represents the parameters to be learned during training, which are used to extract the features for the

l + 1

-th layer.

σ^{l}

represents the non-linear activation function at the l-th layer.

U^{(1)} = [{I^{a}}_{i}, {I^{t}}_{i}, {I^{v}}_{i}, {I^{c}}_{i}] \in R^{N \times (d_{1} + d_{2} + d_{3})}, f = U^{(l + 1)} \in R^{N}

(17)

The HCM constructs hyperedges based on threshold segmentation and applies a hypergraph neural network to classify nodes, capturing complex higher-order relationships. The hyperedge construction process based on threshold segmentation involves comparing feature vectors with complexity dependent on the number of samples N and feature dimension d. If exhaustive pairwise comparisons are conducted, the complexity is approximately

O (N^{2} \times d)

. Furthermore, the complexity of the hypergraph neural network depends on the number of hyperedges M and the degree of each hyperedge with the information propagation complexity being approximately

O (N + M)

.

4. Experimental Analysis

In this section, we introduce two public datasets, DAIC-WOZS and E-DAIC, which are used for depression recognition research, and these are followed by a description of the experimental setup and model evaluation metrics. We then compare the proposed hypergraph-based HYNMDR method for multimodal depression recognition with several existing approaches and present a series of ablation experiments to validate HYNMDR’s effectiveness.

4.1. Data Collection

DAIC-WOZS dataset: This dataset consists of speech, text, and facial video data from 142 depressed and non-depressed patients. It comes from semi-structured clinical interviews, where clinicians remotely control a virtual agent to converse with the patient, asking a series of questions. Based on the patient’s answers, the clinician determines whether the patient is depressed. Detailed data can be found on the original dataset website [25]. The audio recordings were captured via headset microphones at a 16 kHz sampling rate, transcribed, and segmented into sentences and phrases with millisecond-level timestamps. The DAIC-WOZS dataset employs the COVAREP algorithm to extract prosodic and spectral features from speech data, utilizes the Doc2Vec word embedding method [26] to generate sentence-level vector representations for text data, and applies the OpenFace algorithm [27] to label facial key points, capturing continuous facial point clouds that reflect subtle expressions.

E-DAIC dataset [28]: This dataset is an extended version of the DAIC-WOZS dataset, containing 275 audiovisual and text data samples. For the speech modality, a bag-of-words model is applied, extracting features such as Mel-frequency cepstral coefficients and Gammatone filter cepstral coefficients. The text modality utilizes the Doc2Vec method to obtain 50-dimensional sentence vectors, while the video modality leverages the OpenFace algorithm to capture the subjects’ appearance, facial geometry, and action units.

4.2. Experimental Setup and Evaluation Metrics

In our experiments, to address class imbalance in the dataset, we implemented an oversampling strategy to enhance model accuracy and robustness. Oversampling increases the representation of minority classes by duplicating existing samples and generating synthetic ones, leading to a more balanced class distribution. In depression recognition, where subtle differences in depressed states can be crucial, the undersampling of minority classes could hinder accurate recognition. By increasing samples for minority classes, oversampling helps the model capture these distinctions, improving accuracy and generalization in real-world applications for the effective, consistent detection of emotional states.

To maximize the use of experimental data, we split the dataset into a training set and a test set at an 8:2 ratio, re-splitting them for each experiment. The TEM module employs a 10-layer temporal convolutional network for each modality with a dropout rate of 0.5. The hypergraph classification module consists of three layers of a hypergraph neural network, also using a dropout rate of 0.5. We set the number of training iterations to 600 and applied a weight decay coefficient of 0.0005. The model is trained on an Nvidia GeForce RTX 2080ti. The model performance is evaluated using precision (PRE), recall (REC), and F1 score, which are defined as follows:

PRE = ([TP] / [TP + FP]), REC = ([TP] / [TP + FN]), F 1 score = ([2 \times PRE \times REC] /

[PRE + REC] .

Here, TP, TN, FP, and FN denote True Positive, True Negative, False Positive, and False Negative, respectively.

4.3. Threshold Parameter Selection

In this model, the threshold

θ

is critical for hypergraph generation and recognition performance. This section discusses the optimal value of the parameter threshold

θ

in the threshold-based hyperedge construction method. Experiments were conducted on the E-DAIC dataset, setting

θ

between 0.1 and 0.95. Figure 4 shows the depressed recognition results of HYNMDR concerning the parameter

θ

. When

θ

is small, the hyperedge construction resembles a fully connected graph, resulting in relatively high recognition accuracy. However, as

θ

increases, the model becomes less effective at differentiating dissimilar samples, resulting in a gradual decline in the model’s accuracy. Once

θ

exceeds 0.5, recognition accuracy begins to improve as the model can better distinguish samples with similar individual features. When

θ

is between 0.8 and 0.9, the recognition accuracy reaches its highest point. Further increases in

θ

make it difficult to group samples similar in individual features into the same hyperedge, reducing connectivity and decreasing recognition accuracy. Therefore, we set the optimal value of

θ

at 0.85.

4.4. Experimental Results and Analysis

To verify the effectiveness of the HYNMDR method, we compared its performance on key depression recognition metrics using the DAIC-WOZ dataset against various established methods. The comparison methods include Support Vector Machine (SVM), CNN and LSTM-based DepAudioNet [7], the Biomarker Features-based Gaussian Process Model (BioFeGPM) [29], the Multimodal LSTM depressed recognition model (MulMol) [4], the Context-aware Regression model (ConaRe) [30], the multitask bidirectional gated recurrent unit (BGRU) [31], the CNN and Transformer-based multimodal model (CNN + Transformer) [6], the Vector Quantized Transformer Network-based Spatial-Temporal Feature Network (STFN) [32], and the Graph Convolutional Network model MS

^{2}

-GNN [7]. Since few depression recognition methods have been applied to the E-DAIC dataset, we benchmarked HYNMDR against the deep CNN model (DCNN) [33] and the hybrid CNN-BiLSTM-Attention + LSTM-LSTM model (M-CBLALL) [34]. The experimental results are presented in Table 1, where “A”, “V”, and “T” represent audio, visual, and textual modality data, respectively. Combinations such as “AV”, “AT,” and “AVT” indicate multimodality setups involving two or all three data types. Dashes indicate metrics not reported by the respective papers.

On the DAIC-WOZ dataset, the proposed HYNMDR method shows a substantial advantage in precision (Pre), recall (Rec), and F1 scores over single- and dual-modal depression recognition methods, such as SVM, DepAudioNet, BioFeGPM, MulMol, and STFN, as well as over tri-modal (“AVT”) methods like BioFeGPM, ConaRe, CNN+Transformer, and MS

^{2}

-GNN. Specifically, compared to other methods, CNN+Transformer achieved better results, with 0.91 in precision, 0.83 in recall, and 0.87 in F1 score. Although HYNMDR’s precision (0.873) was slightly lower than CNN+Transformer’s, it showed a notable advantage in the recall, which reached 0.98, raising the F1 score to 0.924. Compared to the Graph Neural Network-based depression recognition method MS

^{2}

-GNN, HYNMDR performed better across all three metrics (Pre: HYNMDR 0.873 vs. MS

^{2}

-GNN 0.8, Rec: HYNMDR 0.983 vs. MS

^{2}

-GNN 0.857, F1: HYNMDR 0.924 vs. MS

^{2}

-GNN 0.857). On the E-DAIC dataset, HYNMDR outperformed two recently introduced multimodal depression recognition methods, DCNN and M-CBLALL, regarding Pre, Rec, and F1 scores. The higher recall indicates HYNMDR’s effectiveness in capturing subtle depressive cues, and its superior F1 score highlights improved classification accuracy. This success stems from HYNMDR’s use of Temporal Convolutional Networks to extract nuanced depressive features from lengthy sequences. Furthermore, by leveraging hypergraphs, HYNMDR can effectively model the complex, hierarchical relationships within depressive states, which traditional graph structures may not capture adequately.

The results in Table 2 and Table 3 summarize model performance across different modalities, but further analysis of these trends could deepen our understanding of each modality’s contribution to depression detection. Among the three modalities—audio, text, and video—text-based features performed relatively worse on the DAIC-WOZ dataset. This may be due to the difficulty in capturing depressive symptoms through textual data alone, as effective analysis often requires nuanced contextual understanding that a single embedding might miss. Additionally, the text data in our datasets may lack detailed expressions of depressive cues compared to the richer emotional content conveyed by audio and video data. In contrast, audio data often capture subtle variations in tone, pitch, and rhythm that are strong indicators of emotional states, making it a particularly informative modality in depression detection. Video data, similarly, provide visual cues like facial expressions and microexpressions, which are critical for identifying depression. These distinct characteristics likely explain the stronger performance of audio and video data. To enhance interpretability, future work can focus on modality-specific analyses to identify which particular features within each modality (e.g., vocal pitch in audio or eye movement in video) most effectively predict depressive symptoms, aiding in the targeted enhancement of multimodal depression detection models.

To evaluate the data embedding quality during the learning process of the HYNMDR network model, we conducted a visual assessment of the sample embeddings generated at various stages. Specifically, using the DAIC-WOZS and E-DAIC datasets, we input the embeddings from the Temporal Embedding Module and the second and third layers of the Hypergraph Classification Module into the t-SNE tool for two-dimensional visualization. The visualization results are shown in Figure 5 and Figure 6. In Figure 5, the green triangles represent normal samples (multimodal data of non-depressed patients), while the purple dots represent abnormal samples (multimodal data of depressed patients). In Figure 6, the yellow dots represent normal samples, and the blue triangles represent abnormal samples. As observed, there is some clustering of normal and abnormal samples after the TEM stage, though a notable overlap remains between their regions. Following the second layer of the HCM, the samples show a more precise separation with abnormal samples clustering effectively. In the third HCM layer, the clustering of abnormal samples becomes even more distinct. Figure 6 shows that the HYNMDR model effectively distinguishes between normal and abnormal samples in the E-DAIC dataset with the TEM, and recognition accuracy further improves after passing through the HCM.

Through a comparative analysis of 10 experimental results, we compared the F1 scores of the baseline model and the HYNMDR model and conducted a paired t-test, yielding a p-value of 0.03. According to conventional statistical standards (p < 0.05), this difference can be considered statistically significant, indicating that the HYNMDR model has achieved a notable performance improvement over the baseline model. This statistical significance test further validates the stability and reliability of the HYNMDR model in multimodal depression recognition tasks, ruling out the possibility that the observed performance improvement is merely due to random variation.

4.5. Ablation Study

To verify the effectiveness of the TEM and the HCM, we conducted ablation experiments on the DAIC-WOZS and E-DAIC datasets. To effectively extract features from long-sequence data in the TEM, we tested commonly used models—Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU)—as replacements for the TEM module. From the original dataset, we randomly selected 30 normal samples and 30 depressive samples, using RNN, LSTM, GRU, and TEM to evaluate the quality of long-sequence feature embeddings. Table 3 presents the average recognition results of the four models across two datasets and three modalities. As shown in Table 3, the RNN model produced the lowest-quality embeddings, demonstrating limited suitability for long time-series data. LSTM and GRU showed similar performance, achieving F1 scores of around 0.74, with GRU performing closely to LSTM as a variant. Utilizing a temporal convolutional network, the TEM exhibited strong capabilities in modeling long-sequence data, yielding high-quality sample embeddings and demonstrating its superiority in this task.

To verify the effectiveness of the HCM, we conducted ablation experiments based on the TEM using different model architectures as follows: (1) TEM-DNN: a two-layer fully connected deep neural network connected to the TEM without utilizing a Graph Neural Network; (2) GNN: a graph neural network that constructs edges between vertices by selecting the k shortest edges based on Euclidean distance; and (3) HGNN, a hypergraph neural network that, unlike HCM, does not employ threshold-based hyperedge construction but instead forms hyperedges containing

k + 1

vertices using a Euclidean distance strategy. Table 4 presents the precision, recall, and F1 score results of TCN-DNN, GNN, HGNN, and HCM on the E-DAIC dataset across three modalities. The results indicate that using the hypergraph neural network with a threshold-based hyperedge construction strategy enhances the accuracy of depression recognition. Similar results were obtained on the DAIC-WOZS dataset, confirming the robustness of the HCM approach.

To evaluate the multimodal learning ability and generalization ability of HYNMDR, we conducted modality ablation experiments on the E-DAIC dataset, testing the model with one or two data modalities. As can be seen from the results of the modality ablation experiments in Table 5, HYNMDR achieved the highest scores in Pre, Rec, and F1 across all three modalities with a recognition accuracy of 0.911. These results indicate that HYNMDR is well adapted to multimodal scenarios.

5. Conclusions

To address the complex challenges of depression recognition, such as subtlety in symptoms and individual variations, this paper introduces HYNMDR, which is a hypergraph-based multimodal depression recognition method. HYNMDR utilizes a hypergraph to extract high-level semantic features and model complex, high-order relationships associated with depression. The method comprises two main components: a Temporal Embedding Module, which uses a Temporal Convolutional Network and a Euclidean distance-based loss function to learn embedded representations from both unimodal and multimodal data, and a hypergraph classification module, which employs a threshold-based hyperedge construction method. This module utilizes the embedded vectors of hypergraph nodes and constructed hyperedges to perform multimodal depression recognition through a hypergraph neural network. Experimental results on two public datasets, DAIC-WOZ and E-DAIC, demonstrate that HYNMDR outperforms existing multimodal depression recognition algorithms across multiple metrics, including precision and F1 score.

In addition to these promising results, exploring future applications and research directions could further enhance the practical utility of HYNMDR. One potential direction is optimizing HYNMDR for real-time applications, enabling instant emotional feedback in scenarios like clinical settings or wearable devices. Moreover, integrating additional data sources, such as physiological signals or contextual data, could further improve depression recognition accuracy. For example, capturing real-time physiological indicators like heart rate or skin conductance and using environmental data to supplement existing multimodal information may enhance depression detection capability across broader real-world scenarios and diverse emotional expressions.

Author Contributions

Conceptualization, Z.L.; Methodology, X.L., Y.Y. and Z.L.; Software, Y.D.; Validation, Z.L.; Formal analysis, Y.Y.; Investigation, Y.D.; Writing—original draft, X.L. and Y.D.; Writing—review & editing, X.L. and S.Y.; Visualization, Y.Y.; Project administration, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the IoT Intelligent Perception Technology Innovation Team of Hunan Province Ordinary Higher Education Institutions; the General Project of Xiangjiang Laboratory (23XJ03014); the Hunan Provincial Natural Science Foundation Project (2023JJ70005); Key Research and Development Program of Hunan Province under grant NO. 2024JK2007, and Hunan Provincial Natural Science Foundation of China under grant No. 2023JJ40237.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Marwaha, S.; Palmer, E.; Suppes, T.; Cons, E.; Young, A.H.; Upthegrove, R. Novel and emerging treatments for major depression. Lancet 2023, 401, 141–153. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Zhang, X.; Zhu, J.; Mao, W.; Sun, S.; Wang, Z.; Xia, C.; Hu, B. Depression recognition using machine learning methods with different feature generation strategies. Artif. Intell. Med. 2019, 99, 101696. [Google Scholar] [CrossRef] [PubMed]
Yang, L.; Jiang, D.; Xia, X.; Pei, E.; Oveneke, M.C.; Sahli, H. Multimodal measurement of depression using deep learning models. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, Mountain View, CA, USA, 23 October 2017; pp. 53–59. [Google Scholar]
AlHanai, T.; Ghassemi, M.M.; Glass, J.R. Detecting Depression with Audio/Text Sequence Modeling of Interviews. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 1716–1720. [Google Scholar]
Haque, A.; Guo, M.; Miner, A.S.; Li, F.-F. Measuring depression symptom severity from spoken language and 3D facial expressions. arXiv 2018, arXiv:1811.08592. [Google Scholar]
Lam, G.; Dongyan, H.; Lin, W. Context-aware deep learning for multi-modal depression detection. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 3946–3950. [Google Scholar]
Chen, T.; Hong, R.; Guo, Y.; Hao, S.; Hu, B. S²-GNN: Exploring GNN-Based Multimodal Fusion Network for Depression Detection. IEEE Trans. Cybern. 2022, 53, 7749–7759. [Google Scholar] [CrossRef] [PubMed]
Hu, B.; Wang, X.; Wang, X.; Song, M.; Chen, D. Survey on hypergraph learning: Algorithm classification and application analysis. J. Softw. 2022, 33, 498–523. [Google Scholar]
Daros, A.R.; Ruocco, A.C.; Rule, N. Identifying mental disorder from the faces of women with borderline personality disorder. J. Nonverbal Behav. 2016, 40, 255–281. [Google Scholar] [CrossRef]
Ansari, L.; Ji, S.; Chen, Q.; Cambria, E. Ensemble hybrid learning methods for automated depression detection. IEEE Trans. Comput. Soc. Syst. 2022, 10, 211–219. [Google Scholar] [CrossRef]
de Melo, W.C.; Granger, E.; Lopez, M. MDN: A deep maximization-differentiation network for spatio-temporal depression detection. IEEE Trans. Affect. Comput. 2021, 14, 578–590. [Google Scholar] [CrossRef]
Huang, Z.; Epps, J.; Joachim, D.; Sethu, V. Natural language processing methods for acoustic and landmark event-based features in speech-based depression detection. IEEE J. Sel. Top. Signal Process. 2019, 14, 435–448. [Google Scholar] [CrossRef]
Shao, W.; You, Z.; Liang, L.; Hu, X.; Li, C.; Wang, W.; Hu, B. A multi-modal gait analysis-based detection system of the risk of depression. IEEE J. Biomed. Health Inform. 2021, 26, 4859–4868. [Google Scholar] [CrossRef] [PubMed]
Yoon, J.; Kang, C.; Kim, S.; Han, J. D-vlog: Multimodal vlog dataset for depression detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 28 February–1 March 2022; pp. 12226–12234. [Google Scholar]
Shen, T.; Jia, J.; Shen, G.; Feng, F.; He, X.; Luan, H.; Tang, J.; Tiropanis, T.; Chua, T.S.; Hall, W. Cross-domain depression detection via harvesting social media. In Proceedings of the International Joint Conferences on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 1611–1617. [Google Scholar]
Yang, L.; Jiang, D.; Sahli, H. Integrating deep and shallow models for multi-modal depression analysis—Hybrid architectures. IEEE Trans. Affect. Comput. 2018, 12, 239–253. [Google Scholar] [CrossRef]
Mao, K.; Zhang, W.; Wang, D.B.; Li, A.; Jiao, R.; Zhu, Y.; Wu, B.; Zheng, T.; Qian, L.; Lyu, W. Prediction of depression severity based on the prosodic and semantic features with bidirectional LSTM and time distributed CNN. IEEE Trans. Affect. Comput. 2022, 14, 2251–2265. [Google Scholar] [CrossRef]
Zheng, W.; Yan, L.; Gou, C.; Wang, F.-Y. Graph attention model embedded with multi-modal knowledge for depression detection. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar]
Zheng, W.; Yan, L.; Gou, C.; Zhang, Z.-C.; Zhang, J.J.; Hu, M.; Wang, F.-Y.J. Pay attention to doctor–patient dialogues: Multi-modal knowledge graph attention image-text embedding for COVID-19 diagnosis. Inf. Fusion 2021, 75, 168–185. [Google Scholar] [CrossRef] [PubMed]
Niu, M.; Chen, K.; Chen, Q.; Yang, L. Hcag: A hierarchical context-aware graph attention model for depression detection. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 4235–4239. [Google Scholar]
Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar]
Wan, Z.; Yang, R.; Huang, M.; Zeng, N.; Liu, X. A review on transfer learning in EEG signal analysis. Neurocomputing 2021, 421, 1–14. [Google Scholar] [CrossRef]
Erzhankyzy, B. Negative-sampling word-embedding method. Neurocomputing 2022, 10, 15–21. [Google Scholar]
Gao, Y.; Zhang, Z.; Lin, H.; Zhao, X.; Du, S.; Zou, C. Intelligence, M. Hypergraph learning: Methods and practices. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2548–2566. [Google Scholar]
DAIC-WOZ Database & Extended DAIC Database. Available online: https://dcapswoz.ict.usc.edu/ (accessed on 14 March 2023).
Baltrušaitis, T.; Robinson, P.; Morency, L.-P. Openface: An open source facial behavior analysis toolkit. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), New York, NY, USA, 7–10 March 2016; pp. 1–10. [Google Scholar]
Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1188–1196. [Google Scholar]
Ringeval, F.; Schuller, B.; Valstar, M.; Cummins, N.; Cowie, R.; Tavabi, L.; Schmitt, M.; Alisamir, S.; Amiriparian, S.; Messner, E.-M. AVEC 2019 workshop and challenge: State-of-mind, detecting depression with AI, and cross-cultural affect recognition. In Proceedings of the 9th International on Audio/visual Emotion Challenge and Workshop, Nice, France, 21 October 2019; pp. 3–12. [Google Scholar]
Williamson, J.R.; Godoy, E.; Cha, M.; Schwarzentruber, A.; Khorrami, P.; Gwon, Y.; Kung, H.-T.; Dagli, C.; Quatieri, T.F. Detecting depression using vocal, facial and semantic communication cues. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, Amsterdam, The Netherlands, 16 October 2016; pp. 11–18. [Google Scholar]
Gong, Y.; Poellabauer, C. Topic modeling based multi-modal depression detection. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, Mountain View, CA, USA, 23–27 October 2017; pp. 69–76. [Google Scholar]
Dinkel, H.; Wu, M.; Yu, K. Text-based depression detection on sparse data. arXiv. 2019, arXiv:1904.05154. [Google Scholar]
Han, Z.; Shang, Y.; Shao, Z.; Liu, J.; Guo, G.; Liu, T.; Ding, H.; Hu, Q. Spatial-Temporal Feature Network for Speech-Based Depression Recognition. IEEE Trans. Cogn. Dev. Syst. 2023, 16, 308–318. [Google Scholar] [CrossRef]
Kim, A.Y.; Jang, E.H.; Lee, S.-H.; Choi, K.-Y.; Park, J.G.; Shin, H.-C. Automatic depression detection using smartphone-based text-dependent speech signals: Deep convolutional neural network approach. J. Med. Internet Res. 2023, 25, e34474. [Google Scholar] [CrossRef] [PubMed]
Xu, X.; Zhang, G.; Lu, Q.; Mao, X. Multimodal Depression Recognition that Integrates Audio and Text. In Proceedings of the 2023 4th International Symposium on Computer Engineering and Intelligent Communications (ISCEIC), New York, NY, USA, 18–20 August 2023; pp. 164–170. [Google Scholar]

Figure 1. The overall framework diagram of a multimodal depression recognition method based on hypergraphs.

Figure 2. The loss function relationship between data in the temporal embedding module.

Figure 3. An example of generating adjacency matrix

H^{'}

based on threshold segmentation.

Figure 3. An example of generating adjacency matrix

H^{'}

based on threshold segmentation.

Figure 4. Depression recognition results regarding the parameter threshold

θ

.

Figure 4. Depression recognition results regarding the parameter threshold

θ

.

Figure 5. T-SNE visualization of sample representation on the DAIC-WOZ dataset.

Figure 6. T-SNE visualization of sample representation on the E-DAIC dataset.

Table 1. Comparison of methods for depression detection.

Category	Method	Data Modality	Strengths	Weaknesses
Single-Modal Depression Recognition	Daros et al. [9]	Visual	Extracts facial features related to borderline personality disorder, inferring mental health status.	Limited to single data modality, may miss multimodal context.
	Ansari et al. [10]	Text	Uses sentiment lexicons and DNN for cross-domain adaptation from Twitter to Weibo.	Text-only approach may miss non-verbal cues.
	Melo et al. [11]	Visual	Maximization and differential modules capture smooth and sudden facial changes.	Limited to visual features, does not integrate text or audio.
	Huang et al. [12]	Audio	Utilizes features from acoustic landmark events for depression detection.	Limited to acoustic modality, does not integrate visual or textual features.
Multimodal Depression Recognition	Alhanai et al. [4]	Audio + Text	Sequential feature learning detects depression without manual question selection.	High-resource demands of LSTM models limit real-time use in constrained settings.
	Lam et al. [6]	Audio + Text	Topic modeling enhances detection accuracy by focusing on relevant data segments.	The model may lack generalizability in real-world settings due to reliance on structured clinical data.
	Haque et al. [5]	Audio + Visual + Text	Causal CNNs outperform RNNs in processing long, unstructured interview sequences.	Limited applicability in real-world settings due to reliance on structured clinical data.
	Shao et al. [13]	Visual + Text	Skeleton and silhouette data fusion boosts accuracy to 85.45% by combining complementary information.	Skeleton data outperform silhouette data, highlighting limitations in silhouette-only models.
	Yoon et al. [14]	Audio + Visual	The approach demonstrates strong generalization across various datasets, including clinical ones.	Gender imbalance in the dataset may bias the model’s predictions.
	Ansari et al. [10]	Visual + Text	The text classifier uses hybrid and ensemble approaches to enhance depression detection performance.	Computational complexity may increase with additional modalities.
	Shen et al. [15]	Visual + Text	Enables effective knowledge transfer across platforms by addressing cultural differences between Twitter and Weibo.	Limited generalizability to other social media platforms or offline contexts.
	Yang et al. [16]	Audio + Visual + Text	PV and HDR capture unique depression markers, improving feature representation.	Model accuracy depends on input quality, which can vary in real-world settings.
	Mao et al. [17]	Audio + Text	Attention-based multimodal model for depression prediction using Bi-LSTM and CNN for audio and Bi-LSTM for text.	Unequal contributions from audio and text data may introduce prediction biases.
GNN-Based Medical Diagnostics	Zheng et al. [18,19]	Audio + Visual + Text	Multimodal self-attention network captures high-order knowledge–attention representations.	Limited to structured datasets with prior knowledge graphs.
	Niu et al. [20]	Text + Audio	Hierarchical context-aware GNN aggregates question–answer pairs for accurate classification.	Limited modality integration beyond text and speech.
	Chen et al. [7]	Audio + EEG	Multimodal GNN extracts cross-modal embeddings and uses attention mechanism for representation.	Complexity may increase with additional data modalities.

Table 2. Depression recognition results.

Dataset	Model	Modality	F1	Pre	Rec
DAIC-WOZ		A	0.462	0.316	0.857
	SVM	V	0.500	0.600	0.428
		AV	0.500	0.600	0.428
	DepAudioNet	A	0.520	0.350	1.000
		A	0.570	-	-
	BioFeGPM	V	0.530	-	-
		T	0.840	-	-
		AVT	0.810	-	-
		A	0.630	0.710	0.560
	MulMol	T	0.670	0.570	0.800
		AT	0.770	0.710	0.830
	ConaRe	AVT	0.700	-	-
	BGRU	T	0.870	0.850	0.930
	CNN + Transformer	AVT	0.870	0.910	0.830
	STFN	AT	0.769	0.650	0.920
	MS $^{2}$ -GNN	AVT	0.830	0.800	0.860
	HYNMDR	AVT	0.924	0.873	0.983
E-DAIC	DCNN	AT	0.780	0.834	0.735
	M-CBLALL	AT	0.847	0.815	0.884
	HYNMDR	AVT	0.914	0.889	0.936

Table 3. Ablation experiment results for the Temporal Embedding Module.

Model	F1	Pre	Rec
RNN	0.640	0.628	0.652
LSTM	0.747	0.710	0.791
GRU	0.742	0.698	0.788
TEM	0.783	0.734	0.833

Table 4. Ablation experiment results of hypergraph classification module.

Model	F1	Pre	Rec
TEM-DNN	0.780	0.834	0.735
GNN	0.847	0.815	0.884
HGNN	0.894	0.875	0.917
HCM	0.911	0.889	0.936

Table 5. Modality ablation experiment results.

Model	Modality	F1	Pre	Rec
HYNMDR	AVT	0.911	0.888	0.936
	A	0.883	0.873	0.895
	V	0.877	0.881	0.874
	T	0.872	0.863	0.882
	AV	0.893	0.875	0.913
	VT	0.892	0.861	0.927
	AT	0.885	0.873	0.898

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Dong, Y.; Yi, Y.; Liang, Z.; Yan, S. Hypergraph Neural Network for Multimodal Depression Recognition. Electronics 2024, 13, 4544. https://doi.org/10.3390/electronics13224544

AMA Style

Li X, Dong Y, Yi Y, Liang Z, Yan S. Hypergraph Neural Network for Multimodal Depression Recognition. Electronics. 2024; 13(22):4544. https://doi.org/10.3390/electronics13224544

Chicago/Turabian Style

Li, Xiaolong, Yang Dong, Yunfei Yi, Zhixun Liang, and Shuqi Yan. 2024. "Hypergraph Neural Network for Multimodal Depression Recognition" Electronics 13, no. 22: 4544. https://doi.org/10.3390/electronics13224544

APA Style

Li, X., Dong, Y., Yi, Y., Liang, Z., & Yan, S. (2024). Hypergraph Neural Network for Multimodal Depression Recognition. Electronics, 13(22), 4544. https://doi.org/10.3390/electronics13224544

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hypergraph Neural Network for Multimodal Depression Recognition

Abstract

1. Introduction

2. Related Work

2.1. Single-Modal Depression Recognition Methods

2.2. Multimodal Depression Recognition Methods

2.3. Graph Neural Network Model

3. Multimodal Depression Recognition Method Based on Hypergraph

3.1. Temporal Embedding Module

3.2. Hypergraph Classification Module

3.2.1. Definition of Hypergraph

3.2.2. Hyperedge Construction Method Based on Threshold Segmentation

3.2.3. Depression Classification Based on Hypergraph

4. Experimental Analysis

4.1. Data Collection

4.2. Experimental Setup and Evaluation Metrics

4.3. Threshold Parameter Selection

4.4. Experimental Results and Analysis

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI