1. Introduction
Depression is a disorder involving a loss of pleasure or interest in activities for long periods and is associated with sustained mood deterioration [
1]. It can affect several aspects of life, including relationships and work. According to the World Health Organisation (WHO) 2023 estimates, 5% of adults (approximately 300 million) worldwide experience depression, with women 50% more likely to experience depression than men. It is a significant contributor to the 700,000 suicides every year around the world [
2]. Despite this, more than 75% of people in low- and middle-income countries receive no treatment due to a lack of investment in mental health, a lack of healthcare professionals and social stigma associated with mental health disorders [
2]. For those who receive treatment, antidepressant medications are often the first line of treatment. However, they have a low effectiveness as only one-third of all patients show symptom remission, as evidenced in large clinical trials [
3,
4].
Therefore, interest has grown towards approaches that supplement clinical interventions. Studies have shown that lifestyle interventions, such as better sleep hygiene [
5], practising mindfulness [
6], physical activity interventions [
7] and dietary interventions [
8,
9,
10], have promise in managing depression [
11]. Given the prevalence of devices with sensors that can be used to monitor lifestyle activities, such as smartphones and smartwatches, researchers are proposing using such devices to detect, monitor and manage depression [
12]. The use of wearable technology to supplement clinical approaches is particularly appealing as it is unobtrusive, real time, often passive (requiring little or no active input by a depressed individual/patient), of finer granularity (more data in the same time period) and allows assessments to occur in the person’s usual environment [
13].
As changes in mood and consistently low mood are often associated with depression, studies have tried to use mood as an indicator to monitor and predict the progression of depression. Previous studies have used data from various sensors on wearable devices to either detect or predict future changes in mood. They have used GPS location [
14,
15,
16], phone- and app-usage patterns [
17,
18,
19], voice and ambient noise [
20] and motion sensor information [
21]. An Ecological Momentary Assessment [
22] has also been used to predict mood [
23,
24] in depressed individuals. These studies have focused primarily on using Machine Learning (ML) and its subtype Deep Learning (DL) models to develop predictive models owing to their excellent ability to learn associations in complex data. Moreover, other studies categorise the sensor data into activity data, sleep data, heart data or phone-usage data and then build ML- and DL-based predictive models by using them [
21,
25,
26,
27,
28,
29,
30,
31].
Nevertheless, most previous studies using ML- and DL-based predictive models have focussed on cross-sectional research, despite the failure of cross-sectional studies to apply to larger, more representative samples [
32]. Moreover, cross-sectional works fail to account for the substantial interindividual variability in clinical response to the same treatment or behavioural recommendations for depression due to genetic, environmental, behavioural, lifestyle and interpersonal risk factors [
33,
34]. Personalised models built on longitudinal data are more suited to account for such variability. Therefore, recent works have begun focusing on personalised predictive models for depression [
16,
23,
25,
35].
Furthermore, predicting mood scores is often insufficient in a clinical setting. Most ML and DL approaches are black-box approaches, i.e., they do not show how they reached a prediction [
36]. Without explaining why a model predicts a mood score, healthcare professionals cannot determine what insights the prediction contains [
37]. These insights can then be used to check a model’s fidelity (whether the model predictions make sense) [
38] and suggest interventions that help manage the symptoms in a personalised fashion.
Recent advances in explainable Artificial Intelligence (XAI) offer solutions to the problem of trustworthiness in ML and DL models. Explainable models (we use the terms explainability and interpretability interchangeably in this work [
38]) such as Decision Trees [
36] can be easily processed/simplified to explain their outputs [
39]. However, their expressive power is limited by their size, and increasing their expressiveness decreases their interpretability. DL models can make more complex associations from multimodal data and yield better-performing models [
37,
40] but are not explainable [
36]. With the availability of post hoc explainable methods, such as Shapley Additive Explanations (SHAP) [
41] and Local Interpretable Model-agnostic Explanations (LIME) [
42], explaining performant black-box DL models has become easier [
36].
Studies such as [
43,
44,
45] use explainability techniques on ML models to obtain insights into the model outputs. Moreover, recent works have begun exploring explainability in mental health settings [
24,
46,
47,
48,
49]. However, the use of explainability has been limited to the extraction of the most influential model features/inputs using SHAP or LIME [
50]. Despite the high expressive power of DL models, the suitability of personalised models for depressive-mood prediction and the utility of explainable AI in establishing trustworthiness, the use of explainables in personalised DL mood-score prediction is currently lacking in academic literature.
Therefore, this work developed a novel DL-based post hoc explainable framework for personalised mood-score prediction. The models can be used to predict current mood scores from current biophysical signals and explain how patients’ activities affect their mood scores, suggesting possible indicators upon which to intervene for healthcare professionals and patients (for self-management). We illustrate our approach by using an existing multimodal dataset (from [
24]) containing longitudinal Ecological Momentary Assessments (EMAs) of depression, data from wearables and neurocognitive sampling synchronised with electroencephalography for 14 mild to moderately depressed participants over one month. The work in [
24] established the possibility of applying Machine Learning to a multimodal depression dataset with personalised prediction. We significantly extend that work by making three main contributions:
A parallelised DL modelling and optimisation framework is proposed that helps train and compare multiple Multilayer Perceptron (MLP) DL models to predict participants’ mood scores=—a discrete score used to assess the severity of patients’ depressive symptoms. The MLP framework exceeds the performance of 10 classical ML models.
Multiple post hoc explainable methods [
36] are combined to provide comprehensive insights into which biophysical indicators contribute most to a participant’s mood scores.
The generation and analysis of rule-based (IF–THEN) explanations for individual mood scores are presented.
2. Materials and Methods
The dataset used in this work was published previously [
24]. This dataset was gathered following a one-month study of 14 adult human subjects (with a mean age of 21.6 ± 2.8 years and ten females) before the onset of the COVID-19 pandemic.
2.1. Study Summary
Human participants were recruited to the study from the University of California San Diego College Mental Health Program [
51]. The study included participants experiencing moderate depression symptoms assessed by using the Patient Health Questionnaire (PHQ-9) scale [
52]. Participants with PHQ-9 scores greater than nine were included, with participant scores ranging between 10 and 17. While no structured interview was conducted for this study, suicidal behaviours were screened by using the Columbia Suicide Severity Rating Scale [
53]. Any participants on psychotropic medications maintained a stable dose throughout the one-month study, and no participants demonstrated suicidal behaviours during this study. The study protocol was approved by the University of California San Diego institutional review board, UCSD IRB# 180140.
The data were collected through two data-acquisition modes. First, lifestyle and physiological data were collected by using a Samsung Galaxy wristwatch (wearable) that all participants wore throughout the study, except while charging the watch for a few hours once every 2–3 days. Participants also used an application named BrainE on their iOS/Android smartphone [
54] to register their daily Ecological Momentary Assessments (EMAs) four times a day for 30 days. During each EMA, participants rated their depression and anxiety on a 7-point Likert scale (with severity increasing from 1 to 7), participated in a 30 s stress assessment and reported their diet (e.g., fatty and sugary food items consumed from a list provided and servings of coffee). Also, neurocognitive and EEG data were collected during assessments in a lab on days 1, 15 and 30 of the one-month study. Participants completed six cognitive assessment games to assess inhibitory control, interference processing, working memory, emotion bias, internal attention and reward processing. Finally, the gathered raw data, which had different sampling frequencies—seconds to minutes for the smartwatch data, hours for the EMA data and days for the neurocognitive data, were reconciled through aggregated or extrapolation to match the sampling frequency of the output variable, i.e., depressed mood scores.
2.2. Dataset
The raw dataset contained 48 features (or predictors) for each participant. We removed three speed-based features (such as the cumulative step speed) as they were computed from noisy distance features. Of the remaining 45 features, we chose 43 input features (i.e., inputs to a model), 1 output feature (i.e., the predicted feature) and 1 feature to preserve timing information. The input features included both the smartwatch and neurocognitive-assessment data. Sixteen input features were obtained from the Samsung wearable, and the remaining twenty seven were obtained from the neurocognitive assessments. The wearable and EMA features collected from the smartphone are presented in
Table 1. Supplementary Table S1 of [
24] describes the remaining features.
Moreover, the feature
depressed with a value between 1 and 7 was used as the output feature. The severity of the depressed mood increases from 1 to 7, with 1 indicating feeling not depressed and 7 indicating feeling severely depressed. The
datestamp feature was used to order the dataset chronologically before any data preprocessing was performed.
Table 2 contains sample information for each participant, and
Figure 1 shows the output-label distribution for each participant.
As seen from
Table 2, nine out of fourteen participants have features where some values are missing. This could be due to device error or participant behaviour (e.g., a participant may forget to wear the smartwatch for a few hours). However, there are no samples where all the feature values/data points are missing. Also, the total number of samples varies between the participants. Participants 14, 18, 21 and 29 have fewer samples, which could have a bearing on the performance of the models [
40].
Moreover, we can see from
Figure 1 that the label classes (depressed-state values) across participants are not balanced. This is expected as the participants are mild to moderately depressed, and the highest and lowest ends of the depressed mood scale (which correspond to no depression and severe depression, respectively) will be rarely represented. As this is an expected behaviour and we want the model to learn this behaviour, we do not use any methods to balance the dataset prior to training.
Furthermore, we noticed that a few participants (such as Participants 10, 15, 18, 21 and 23) had a few features with constant values, i.e., the same value repeated for each sample. This may make sense for neurocognitive-assessment features (where a participant may perform consistently on the tests) but not for features acquired through the wearable. For instance, a participant would be highly unlikely to have the same nonzero value for features like exercise calories or heart rate for 30 days. We deal with invalid and missing values in the following data-preprocessing section.
2.3. Data Preprocessing
As the dataset contained missing data points and invalid values, we preprocessed the data by using three data-preprocessing methods and built models for each to compare which method suited the dataset. We started with a simple data-preprocessing method and progressively increased the algorithm’s complexity.
For the first method, we used Deletion to ensure that each participant had all 43 features with no missing data points. We began by removing the participants with constant smartwatch feature values. This step eliminated Participants 10, 15, 18, 21, 23 and 24. Then, we removed the samples/rows with any missing data. This step reduced the number of samples for some participants. However, this method was the most straightforward data-preprocessing method we used and provided a good baseline against the more sophisticated data-preprocessing methods discussed next.
For the second method, we used Manual Imputation, which utilised information on the data type (discrete, continuous or neurocognitive) in a feature to impute/fill data. We removed the wearable features (data acquired from the smartwatch) where all values were constant and incorrect. Next, for features with discrete data, the missing values in a feature column were imputed with its most frequent value. In contrast, for features with continuous data, the missing values were imputed by using an iterative method that computes the missing values in each feature by considering it as a function of all other features in a round-robin manner (see Iterative Imputer in
Table 3) [
55]. Finally, we imputed the missing values in the neurocognitive features with zero, as a zero in an assessment typically implies an empty/void assessment.
For the third method, we employed Automatic Imputation, which automated the imputation stage. We removed the wearable features where all values are constant and incorrect. Next, we handled missing data by choosing a data-imputation method that preserved the original data distribution. Instead of manually choosing an appropriate method, we automated the process and seven different data-filling methods for each feature with missing values. The chosen methods are summarised in
Table 3. Finally, we compared the methods by using the distribution of the filled-in feature and the original feature vectors. For this, we used the two-sample Kolmogorov–Smirnov (KS) test, which compares two distributions by finding the maximum difference between the Cumulative Distribution Functions (CDFs) of the two distributions [
58]. As different methods were chosen for different features for every participant, we decided against reporting them here to maintain the succinctness of the paper. More information about the approaches discussed in this section can be found in
Appendix A.1.1.
2.4. Model Development
Since the depression scale is ordinal, i.e., there is an order in the value of the depression/mood score, and it increases from 1 to 7, we can consider the current mood-score-prediction problem as either a regression or classification problem [
59]. As a regression problem, the model will be concerned with developing a model that predicts values close to the actual mood scores. On the other hand, a classification model considers the mood scores as seven classes and tries to predict a class based on the input. We used MLP models to build the regression and classification models. Also, we used ten common regression and classification classical ML models to build baseline models against which to compare the performance of MLP models. Moreover, we built the models for the different types of data-imputation schemes (see
Section 2.3). The MLP model-development framework is shown in
Figure 2.
2.4.1. Base Model
We built a set of base models to act as a baseline for the predictive performance of MLP models on the dataset. We trained ten common classical ML algorithms (eight of which were used in [
24]) on the three preprocessed datasets for each participant: Adaboost Regressor, Adaboost Classifier, Elasticnet Regressor, Gradient Boosting Classifier, Gradient Boosting Regressor, Poisson Regressor, Random Forest Classifier, Random Forest Regressor, Support Vector Classifier and Support Vector Regressor. Also, we used a simple grid search (as used in [
24]) to tune the hyperparameters of the models. The grid search is a brute-force method that tries all possible combinations of hyperparameters and chooses the combination that provides the best prediction performance.
Furthermore, a Stratified 5-fold Cross-Validation (CV) scheme was used to validate the model performance during and after training. This scheme divides the normalised dataset into five parts, trains a model on the four parts (the training dataset) and tests on the remaining part (the testing dataset). It does so in a round-robin fashion. The division is stratified, meaning each fold contains the same proportion of the different output labels. So, for a 5-fold CV, we built five separate models (with the same architecture) on five training and test datasets. The overall performance was obtained by taking the mean of the training and test performance values over the five sets. Also, the test datasets do not overlap between the folds. This method ensures that the evaluation of the model is free of data-selection bias, which may arise when using a simple train–test split, as the performance depends on the particular split of the train and test set.
For each participant, the model (out of the ten) with the lowest Mean Absolute Error (MAE) after hyperparameter tuning, irrespective of classification or regression, was chosen as the base model. More details on the grid search and the hyperparameters used for each model are provided in
Appendix A.3. Note that the base models were only used for performance comparison with the MLP models and were not used for a model-explanation comparison as the explainability of such models in a mood-prediction setting has been explored in [
24].
2.4.2. Artificial Neural Networks
Artificial Neural Networks (ANNs) are networks of artificial neurons that attempt to model the behaviour of biological neurons by using mathematical functions composed of linear computations and nonlinear functions called activations, such as
sigmoid, hyperbolic tangent (
tanh) and others [
40]. Through training, ANNs determine nonlinear relationships between a provided set of inputs and their corresponding outputs. They are often designed as networks of several layers with an input layer, a few hidden layers and an output layer in succession [
40]. Many types of ANNs exist, including Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), with the most basic type being a Multilayer Perceptron (MLP) network. Once trained over the data, the networks make inferences when exposed to new but statistically similar input data [
40]. This ability allows them to perform tasks such as the classification or regression of input data and language translation. MLPs are particularly well suited for tabular data and are used in this work.
2.4.3. MLP Model Architecture
The model architecture differed between the regression and the classification models. As mentioned in the previous section, we used ANNs (MLP) to build the model. While both models had an input layer, a few hidden layers and an output layer, the number of neurons in the output layer differed between the regression and classification model. As a regression model predicts a single continuous output value for each input, all regression models used only one neuron in the output layer with no activation.
On the other hand, the classification models had seven neurons corresponding to the seven classes (mood scores). Outputs from the neurons were normalised (squashed) by using a
softmax activation. These squashed values (for each neuron) lie between 0 and 1 and represent the probability of an input belonging to that class. The class corresponding to the highest probability value was taken as the output. Model hyperparameters, such as the actual number of layers, the number of neurons in each layer and the activation for each layer, were determined by a hyperparameter-optimisation algorithm described in
Section 2.4.5.
2.4.4. MLP Model Training
All models were built and trained in Python by using a loss function and an optimiser. The loss function evaluates the model prediction against the actual output value and produces a numeric value based on how different the prediction and the actual values are. Moreover, the optimiser optimises/modifies the weights/parameters of the ANNs to minimise the loss.
For the classification models, we used a version of the cross-entropy loss (see Equation (
1)) called the Sparse Categorical Cross-Entropy. The regression models used either the Mean Squared Error (MSE) or the MAE between the predicted and actual values as the loss function. We used a version of stochastic gradient descent called the Adam [
60] optimiser to minimise the loss function
of all models.
where
C is the number of classes in the data,
y is the expected output and
is the predicted output.
The preprocessed dataset was time-sorted based on the timestamps and normalised before being fed into the training models. This normalisation ensures a smoother convergence of the loss function. We used the standard normalisation procedure. It centres the data around zero and gives the dataset a unit standard deviation. In this work, we standard-normalised the preprocessed data by subtracting the feature means (
) from each feature and dividing the result by the standard deviation (
) of the feature (see Equation (
2)).
We used a Stratified 5-fold Cross-Validation (CV) scheme to validate the model performance during and after training, similar to the base-model evaluation. The samples in the normalised folds were then randomised and fed into the MLP models for training, i.e., the MLP models took an input of shape , where N is the number of input samples and F is the number of features.
Moreover, we followed these steps for all MLP models built for regression and classification, irrespective of the data-imputation method. We trained each model by using batches of train data for 100 epochs, i.e., for 100 iterations of the entire training data (divided into batches). We only saved the best model across the epochs and used the early-stopping strategy, which stops the training before the epochs finish if the model’s performance does not improve for a certain number of epochs. Early stopping helps ensure that the models do not overfit the training data [
40].
Figure 2 shows the training framework.
2.4.5. MLP Model Optimisation
It is usually challenging to infer the architecture of an ANN that gives the best possible performance, as multiple model and training parameters often influence the performance of an ANN. Instead of manually choosing and tweaking a few parameters to obtain better performance, as we do with the base models, we used an automated method. We chose multiple Evolutionary Algorithm (EA)-based algorithms and stochastic algorithms to optimise the main model and training parameters (called hyperparameters in ML parlance) for a better prediction performance.
We used eight different EA and statistical methods to optimise the number of hidden layers in the model, the number of neurons in the input layer, the activation of the hidden layers and the training batch size. We optimised the number of neurons in the input layer but did not optimise the neurons in each layer as that would increase the number of optimisation variables. Increasing the number of optimisation variables increases the optimisation space, making the optimisation problem more difficult. Instead, we linearly interpolated the neurons in the hidden layers by using the number of neurons in the input layer and the number of neurons in the output layer (which depends on whether the model is classification or regression). The eight EA methods we used and the parameters we modified are mentioned in
Table 4. We use N.A wherever the default optimisation parameters were used.
Table 5 also contains upper and lower limits for each hyperparameter used during the optimisation.
Figure 3 shows the optimisation schematic. The hyperparameter optimisation existed as an outer loop to the inner loop of model training (which optimises the model weights). The optimisation repeated over 100 iterations, during which the model was trained by using the 5-fold CV procedure. We averaged the model performance over the five folds and used that as the performance metric to optimise the hyperparameters. We used the metrics F1-score and balanced-accuracy to optimise the classification models and used the MSE and MAE to optimise the regression models. These metrics served as indicators of the model performance and guided the optimisation process towards a set of hyperparameters that provided the best model performance.
At the end of the 100 iterations, we took the best models, i.e., models with the best mean 5-fold performance, from each method and found the best among the eight best (one for each optimisation method) models as well. We used the same metrics to optimise the hyperparameters and find the best models. The performance of these best models was then taken as the best for a particular combination of the optimisation metric, problem type and data-imputation method.
The optimisation procedure was entirely parallelised, and the number of parallel processes was determined by the number of cores in the system used for optimisation and training. We ran the optimisation (and training) in a Docker container containing all the required Python libraries, such as TensorFlow–Keras (for training the models) [
66] and Nevergrad (for the optimisation) [
65]. Parallelisation significantly reduced the optimisation time, making the procedure scalable to a high number of optimisation iterations.
2.5. MLP Model Evaluation and Explanation
We gathered one best model for each combination of problem type (classification or regression), data-preprocessing methodology (3 methods) and metric (2 metrics) used for hyperparameter optimisation (i.e., 12 combinations). As both regression and classification problems used different performance metrics except for the MAE and Mean Absolute Percentage Error (MAPE) (which can be used for both kinds of models), we used the MAE to find the best overall model as it corresponds to the absolute error and not the relative error (like the MAPE).
Hence, for each participant, we collected the best models from the 12 combinations, found the model with the lowest test MAE and used it as the overall best-optimised model. This overall best model was the final model for the participant, and we used this to extract indicators/features that were important as well as to explain how those features affect mood. To this end, we used three post hoc explainability methods from the explainable AI (XAI) literature [
36]. We used Shapley Additive Explanations (SHAP) [
41], Accumulated Local Effects (ALE) plots [
67] and Anchors [
68].
SHAP explains a prediction (a single prediction) of a data instance by computing the contribution of each feature to the prediction and is a linear approximation to the Shapley values. They are computed in relation to the average model prediction. Thus, a SHAP value of −0.2 for a feature in a sample, for instance, would mean that the model prediction decreases by 0.2 from the average for a change in that feature. Here, we used SHAP to find the top five important features of each participant. We obtained one SHAP value per data instance per feature, and to compute the global feature importance for a model and a dataset, we took the mean of the absolute SHAP values for all instances in the dataset to obtain the overall SHAP value of a feature. Also, when computing SHAP values, we focused only on features acquired by using wearables and EMAs, as our focus was on finding interventions that could be implemented in a depressed individual’s personal environment, such as at home. Also, as we used a 5-fold Cross-Validation approach, we found the SHAP values for each fold and averaged them.
Furthermore, ALE plots describe how certain features influence the model prediction, and their value can be interpreted as the main effect of the feature at a certain value compared to the average prediction of the data. ALE works well even when features are correlated and is well suited for our moderately correlated dataset (see plot
Figure A1). In this work, we used ALE plots to find how the top-five important features obtained through SHAP influence the model prediction. The plots show how the feature effects on the prediction vary with the value of the feature. This gives us an idea of whether a feature’s increase (or decrease) leads to a corresponding increase (or decrease) in the model prediction compared to the average prediction. As before, we used the test dataset to compute the ALE value for each fold and found the overall ALE value by taking the mean of the ALE values obtained for the five folds.
Finally, Anchors explain a prediction on a data instance of any black-box classification by finding an IF–THEN decision rule that anchors the prediction sufficiently. A rule is said to Anchor a prediction if changes in other features do not affect the prediction. Moreover, it includes the notion of coverage, stating which other, possibly unseen instances Anchors apply. We used Anchors to show how specific predictions for classification models could be explained in a rule-based manner. This made Anchors a good candidate to explain anomalous changes in mood. Furthermore, to produce comprehensive rules, we considered all features, including the neurocognitive-assessment features.
Additionally, all post hoc explainability methods take the un-normalised data as the input (which is internally normalised before being fed into the models). Using un-normalised data ensures that the explanations are produced in the actual data range, which makes it easier to interpret.
Appendix A.4 contains additional details about the explainability approaches used.
4. Discussion and Conclusions
Depression affects a large population worldwide and has a substantial global healthcare burden [
2,
69]. With the amount of technology around us, we are generating significant amounts of data. The recent literature has focussed on using data-driven methodologies to create predictive models for depression [
14,
15,
16,
17,
18,
19,
21,
25,
26,
27,
28,
29,
30,
31]. With the variability seen among depressed people [
33,
34], personalised predictive models have been suggested in recent years [
11].
Hence, we proposed a novel explainable framework to utilise multimodal data to build personalised and explainable Deep Learning (DL) models for people experiencing depression. To illustrate the framework, we used a dataset with 14 mild to moderately depressed participants from a previously published work [
24]. The dataset, collected over one month, contained activity data from a smartwatch, diet and mood-assessment reports from Ecological Momentary Assessments (EMAs) and neurocognitive data from in-person sessions. We preprocessed the raw data through multiple data-imputation schemes and trained both classification- and regression-based MLP (Multilayer Perceptron) models to produce predictions of mood scores—a discrete score based on the severity of their depressive symptoms.
The models are optimised through eight Evolutionary and Statistical optimisation algorithms to find the hyperparameters that offer the best model performance evaluated by using a five-fold Cross-Validation model training routine to obtain a robust estimate of the model performance. We compared this performance against ten classical ML-based baseline models and showed how the MLP models outperformed the baseline models. The best-performing MLP models were further analysed by using SHAP (Shapley Additive Explanations) [
41] and ALE (Accumulated Local Effects) [
67] plots to extract the top features/indicators that influenced the model and reveal the associations between the top feature indicators and depression. Moreover, we demonstrated how rule-based explanations predicated on features could be generated from the models by using Anchors [
68]. Such explanations can potentially guide clinical or self-management interventions for depression.
Our work differs from previous research on explainable depression modelling through mood-score prediction in many ways. Works like [
35,
70] perform an analysis on cross-sectional datasets, whereas we use a longitudinal dataset. Moreover, most studies, like [
21,
25], employ simple ML to build predictive models for depression, and studies that employ DL, like [
26,
28], do not employ a parallel, multiple Evolutionary Algorithm-based optimisation scheme to optimise the model hyperparameters. Furthermore, most studies like [
46,
47,
48,
49] use explainability to develop population-level explanations of various mental health disorders, while our work produces personalised insights.
The work by Shah et al. [
24] shares the most similarity with our work and develops personalised mood-prediction models on the same dataset and uses methods from the explainable AI literature [
36]. However, it uses it primarily to extract features (using SHAP) that have the most influence on the model’s prediction of the mood/depression score. We extend their work by using Accumulated Local Effects (ALE) plots to show how changes in the value of such features influence the model’s prediction of mood scores. We pipeline SHAP and ALE to show how the top wearable- and EMA-based features affect mood scores. We focus on these features as their trends allow one to suggest interventions based on lifestyle, such as diet and activity, as they can be monitored comfortably in real time in a person’s usual environment. We further generate rule-based (IF–THEN) explanations for instances showing sudden changes in the mood score (increase or decrease in depression) using Anchors. These rule-based explanations include bounds on features, which can be used to quantify interventions by using those features.
In general, we found that the MLP models were better able to learn the representations between the input features and the mood score. Although the model-hyperparameter search methodology differed between the MLP and the base models, the results indicate how powerful MLP models (a comparatively simple DL method) are at learning meaningful representations for mood scores from digital data. Moreover, our results on the top-five features for individuals slightly differed from that in [
24] due to the differences in model types, data preprocessing and feature design. Interestingly, the results for the population-level top-five features were similar to that in [
24], with diet- and anxiety-related features being the most frequent top-five features.
We also found that SHAP and ALE plots had the potential to help clinicians find the most influential features/indicators for intervention and how their values influence the mood score. Moreover, human-readable rules from Anchors could help clinicians obtain a quantitative estimate of feature limits (range) for individual predictions of mood scores in depressed individuals. By observing the feature ranges in the rules over time, a clinician could advise interventions focussed on certain activities and food items. The numerical bounds in the rules should help determine the limits for such interventions.
The overall framework presented in this work can be extended to other kinds of modelling approaches, data types and optimisation schemes; however, the results presented in this study are limited by the dataset (quality and quantity) used and some of the shortcomings of the explainability approaches used. The dataset for some participants has missing and invalid data. Even though the data-imputation schemes handled both issues, a complete dataset for those participants could have yielded more performant models and better explanations.
Moreover, one of the pitfalls when analysing models by using explainable methods is the need for clarity between causation and association. All explainability methods discussed here only provide information on the association, not causation. For instance, if an increase in the feature anxious is seen to increase the mood score in an ALE plot, we cannot say that being more anxious causes a participant to be more depressed. It could be the case that an increase in depression causes an increase in anxiety for that participant. Therefore, all we can say is that an increase in anxiety is associated with an increase in depression.
Furthermore, explainability methods are model-based, and the explanations produced are explanations for the model and not the underlying data distribution. This implies that if the model is poor, the explanations produced by using the model will not be reliable either. Thus, for the two participants (Participant 19 and Participant 21), where our models had a high MAPE value, the explanations (important features and feature trends) may be unreliable. Also, there may be instances where the explanations obtained from one of the three methods discussed in this work may seem counterintuitive. In such instances, we propose validating the results through the remaining two explainability methods.
Personalised models for depression by using wearable and other relevant data provide an opportunity for personalised treatment approaches as long as data of good quality and quantity are available and the pitfalls associated with using model-explainability methods are understood. Accurate, personalised models and the explanations generated from them can help build associations between individual activities and depression severity, assisting medical professionals and patients in managing depression through targeted interventions. This work presents a framework to achieve this. In the future, a combination of cross-sectional and longitudinal methodologies could solve the data quantity problem. Also, work on incorporating other modalities of data, such as speech and facial emotions, and different kinds of models, such as timed DL models, could improve the predictive models further.