1. Introduction
Since the end of the 20th century, mental health and well-being have become the new driving forces of psychology. Positive psychology refers to the treatment of mental illnesses the exploration and nurturing of the elements that contribute to human fulfillment [
1]. Indeed, research has shown that having a sense of well-being can lead to positive outcomes in life including improved health, flourishing relationships, and better academic performance [
2] but also in organizations to increase productivity, collaboration, customer satisfaction, and reduction in turnover [
3,
4]. Thus, understanding and promoting individual well-being is essential to the health of the workforce and the long-term success of an organization. However, despite these benefits, identifying individual well-being in the case of collaboration within a co-located team can prove challenging [
5]. In addition, most current tools for assessing subjective well-being rely on time-consuming surveys and questionnaires, which limit the possibility of providing real-time feedback necessary to raise awareness and change individual behavior [
6]. Since non-verbal communication, mostly visual cues [
7,
8], offers a precious and non-intrusive way to gather emotional and cognitive information on an individual’s state of mind [
9,
10,
11], the aim of this study is to understand the non-verbal communication process in teamwork, using video data to identify significant predictors of individual well-being in teamwork. We address the three following research questions:
RQ1: Which features of videos taken in a team setting will be predictive of individual and team well-being measured with PERMA (Positive Emotion, Engagement, Relationships, Meaning, and Accomplishments) surveys?
RQ2: How can the relevance of attributes for predicting individual well-being in a collaborative work context be measured?
RQ3: How can theories and hypotheses relevant to positive psychology be derived from AI-driven team video analysis?
Answering these questions will help experts from sociology and psychology to elaborate new theories and hypotheses based on large amounts of in-the-wild data representative of all the diversity of human behavior. Among other things, this information will be useful for organizing more effective and collaborative teamwork sessions. They could also help to promote policies that favor individual well-being, thereby increasing employee happiness and retention in companies.
The main contributions of this paper are the development of a framework for understanding the process of non-verbal communication in teamwork, using video data to identify significant predictors of individual well-being in teamwork. The framework relies on video acquisition technologies and state-of-the-art artificial intelligence tools to extract individual, relative, and environmental characteristics from panoramic video.
In the following, a brief overview of the non-verbal communication and well-being data analysis research will be carried out in
Section 2. The proposed framework to extract relevant features of non-verbal communication and well-being analysis will be presented in
Section 3. The experiment developed to test this framework as well as the results obtained will be presented in
Section 4. These results will be discussed in
Section 5. This will lead to the conclusions in
Section 6, about significant predictors of individual well-being in teamwork as well as to possible directions for future research.
2. Related Work
2.1. PERMA and the Notion of Well-Being
Striving for happiness has been a primary goal of humanity going back to antiquity. Measuring happiness and lifetime satisfaction has become an active area of research over the last century. The benefits of well-being as the overall state of an individual’s happiness, health, and comfort [
12] are widely recognized for individuals, organizations, and society as a whole [
2,
3,
4]. Positive psychology is the branch of psychology concerned with the notion of well-being, as it explores and nurtures the elements that contribute to human flourishing [
1]. Providing a holistic view of well-being, one of the leading figures of the positive psychology movement, Seligman [
13] proposed the PERMA model. Based on the well-being theory established by Forgeard et al. [
14], the PERMA model decomposes well-being into five pillars described as the level of pleasant emotions such as happiness, joy, etc. experienced (
Positive emotions) [
13], the level of absorption experienced during an activity (
Engagement), the degree of connection with other individuals (
Relationships) [
15], the degree to which the individual finds meaning in life (
Meaning), and finally, the level of realization of one’s full potential (
Accomplishment) [
16]. Based on the model of Seligman [
13], a number of PERMA measurement tools have been proposed for general assessments [
15] or more work-related environments [
16,
17]. The PERMA+4 framework proposed by Donaldson et al. [
17] represents a lean tool specifically tailored for the working environment allowing survey time to be reduced. The speed of data collection provided by this method is a considerable advantage over other methods since it simplifies the collection of a sufficient dataset to enable data-based analysis of individual well-being in a collaborative work.
Although surveys like the one listed above are subjective, until the availability of video analysis and AI it was not possible to collect personal happiness measurements in other ways. However, researchers have taken other individual parameters as proxies, which have been shown to be good predictors of happiness such as longevity, social network ties, and attitude [
18]. In his happiness research, Mihaly Csikszentmihalyi [
19] introduced experience-based sampling, using devices that at random points during the day asked respondents to enter their happiness into a survey. In recent research, the happimeter, a smartwatch-based solution that measures happiness based on body language was developed [
6]. While some participants in the research described in this paper were wearing the happimeter, this was not part of the research design for this framework.
2.2. Team Collaboration and Well-Being Data Analysis
Until recently, the only way to collect well-being and happiness ratings was using surveys. Ekman [
10] introduced a facial rating system to measure different emotions, which initially was used to manually label emotions of faces in videos. In the last few years, machine learning models to measure human emotions have greatly increased research in this area. The most accurate results have been achieved using multimodal emotion analysis combining video, audio, and text inputs. For instance, combined models have been built by [
20]. The drawback of the multimodal method is the amount of work necessary to collect the different input channels, which makes it rather cumbersome. This is why in this research we focus on video analysis.
Various approaches for data collection in teamwork environments are widely available in the literature. Online settings have been used to measure emotional conditions or engagement in e-sports teams [
21] and student groups [
22,
23], respectively. One of the advantages of the online setting is that it limits the need for data preparation since the records of each individual are already disentangled. Other studies focus on measuring team behavior in a co-located environment within surgical teams [
24,
25,
26] and laboratory teams [
27,
28,
29,
30], working in highly controlled environments. While [
27,
28,
29,
30] used multimodal frameworks as Guerlain et al. [
24] and Ivarsson and Åberg [
25] which used audiovisual data, Stefanini et al. [
26] used sociometric badges developed by Kim et al. [
31] to extract behavioral features such as mutual gaze, interpersonal distance, and movement patterns.
All the research mentioned above uses data from highly controlled environments compared to in-the-wild data collected in real-world conditions, outside of a controlled environment, with multiple teams working in parallel.
While the examples listed above use sensors to measure interpersonal interaction, most teamwork is studied through surveys, which makes analyzing well-being in collaborative work all the more complex as surveys are generally time-consuming and intrusive [
32].
3. Methods
To understand the non-verbal communication process in teams, we propose to use video data to identify significant predictors of individual well-being in teamwork. Towards this goal, a two-step facial-analysis-system (FAS), illustrated in
Figure 1 and detailed below, has been developed. It leverages state-of-the-art deep learning technologies to combine a
multi-face tracking approach and a
multi-task feature extraction.
We start by recording a video of the team members during the entire period they are interacting with each other with a 360-degree camera pointed at their faces. This video is then preprocessed and cleaned. In the next multi-face-tracking step, the faces are detected and tagged with anonymous identifiers, thus preserving individual anonymity. In the final multi-task-feature-extraction step, the 3D gaze pattern estimation computes if people are looking at each other; their facial emotions are also computed, as is the upper body posture (as people are sitting mostly around a table) and image brightness. This process is subsequently described in detail.
3.1. Data Presentation
To test the proposed FAS, video data was collected. To do so, an experiment was conducted over three days with 20 co-located on-site teams, each composed of 4 master’s students. During those teamwork sessions, participants were asked to work on a team project composed of different tasks such as project design and stakeholder analysis. The study only includes data from the 56 students who signed the informed consent form. Its purpose is to record non-verbal dynamics during collaborative teamwork in order to understand the non-verbal communication process, using video data to identify significant predictors of individual well-being in teamwork.
The experimental setup represented in
Figure 2 has been replicated on each of the 20 team’s tables.
As shown in
Figure 2, the four participants in each team are placed on opposite sides of the table, in pairs, facing each other. A wide-angle camera [
33] is placed in the exact center of the table (in both x and y directions) to record the 1.5 h of daily teamwork. The camera is stacked on top of the mini-PC. The camera was connected via USB to minimize the size and intrusiveness of the measurement setup. Finally, to reduce visual background noise, whiteboards topped with folding partitions were placed between adjacent tables.
The acquisition of full panoramic scenes allows the analysis of non-verbal cues such as 3D gaze pattern estimation. The structure selected for recording is a stack of two 180-degree images. Participants on either side of the table are systematically observed on the top or bottom image, respectively. This arrangement facilitates subsequent analysis of the video data by the FAS.
The final data collection and cleaning resulted in approximately 93 h of video data stored as MP4 files for all 20 teams analyzed on the three days of observation. This resulted, on average, in 4.5 h of video data per team and, thus, 1.5 h per team per day and was taken as the data source for the subsequent well-being analysis.
The video data collected had to be labeled with well-being attributes in order to be used to analyze participants’ well-being. For this reason, participants were asked to complete a PERMA+4 questionnaire at the end of each work session to assess their level of well-being according to the different pillars designated by the PERMA framework.
The PERMA data collected resulted in 104 data points from the 56 study participants over the three days. These data points are used as ground truth for training the machine learning model with the video data collected with the proposed FAS detailed below.
3.2. Multi-Face Tracking
Each video is analyzed to determine the respective trajectory of each face present in the recording, using a
multi-face tracking approach. All faces present in a single video frame are detected and embedded using the
RetinaFace model [
34] and the
ArcFace model [
35], respectively. The
RetinaFace model detects a set of faces
in a given frame. Each
is transformed to a lower dimension face embedding
using
ArcFace for greater computational efficiency. Finally, an ID database is generated by clustering a sample of frames from the video based on the number of individuals per team. It is then used to identify and track each individual in the video through face identification. The challenge of re-identification—the process of correctly identifying person identities across video frames—is tackled by calculating the cosine distances between preprocessed face templates
and the detected face embeddings
E. Then the Hungarian algorithm [
36] is used to solve the assignment problem. This approach allows an efficient tracking of multiple faces in a video stream. No tracking algorithm in the traditional sense is implemented, while the focus is on facial attributes.
3.3. Multi-Task Feature Extraction
After the face of each member is identified, the second step of the proposed FAS, the multi-task feature extraction, is employed on the detected faces F to extract features for the subsequent well-being analysis. Four direct features are extracted.
Face emotion recognition (FER) is used to identify and classify human emotions based on facial expressions using the
residual masking network [
37], which performs state-of-the-art analysis on the FER2013 dataset to estimate the six Ekman emotions [
10] plus an added “neutral” emotion for increased machine learning accuracy. Face alignment is not explicitly employed in this methodology to prevent potential information loss or artifacts.
The body landmarks are based on the face-center position while the Gaze estimation evaluates who is looking at whom in a panoramic scene. The approach is based on the 3D head pose and facial landmark estimations to identify where a person is looking. Specifically,
SynergyNet [
38] is used to estimate the full 3D facial geometry. The head poses, and facial landmarks are first spatially transformed to reconstruct the original 3D scene. Then, a visibility algorithm is employed to detect gaze exchanges among individuals. To do so, the human field of view (FOV) angle for 3D gaze pattern estimation has to be set to a specific angle. The number of gaze exchanges is captured in a gaze matrix populated over the duration of the video stream and illustrated in
Figure 3.
Finally, the brightness of the image is extracted directly from the video, reflecting an environmental characteristic. Each team member is assigned the perceived image brightness calculated across all images using the root mean square (RMS) described in Equation (
1). It weighs the contributions of the red (
R), green (
G), and blue (
B) channels to take into account the heterogeneity of human perception [
39].
While the face emotion recognition and body landmarks are specific to each individual, the gaze patterns are relative since they result from interactions between team members. Those direct features are used to extract derivative features valuable for the machine learning models and summarized in
Table 1.
The emotion recognition data include details about the emotional and affective states of every team member. The time series for each of Ekman’s six basic emotions plus “neutral”, alongside the distribution of each emotion (
Max Emotion) and the frequency of changes in emotion (
Freq Emotion changes), are extracted. The Body Landmarks data provide the position of the head centers of individuals using the standard deviation of the 2D kernel density data distributions in the X and Y directions. They express the spatial extent to which the individual moved during the analyzed video. From these data, the velocity of the head’s movement is extracted as a time series by calculating the difference in position between two consecutive frames. Additionally, the presence feature represents the percentage of frames an individual is identified in. The level of brightness is directly extracted from the video as a time series. Finally, the 3D gaze pattern estimation is used to generate interaction matrices and extract social network metrics. The gaze matrix, illustrated in
Figure 3, is computed by counting the number of times each individual looks at a team member. This asymmetrical matrix is combined into to symmetrical matrix, the gaze difference matrix, and the mutual gaze matrix. The first represents the difference between the total gazes emitted by person
i to person
j and the reciprocal, while the second only incorporates entries where two participants look at each other simultaneously. Features are extracted from those three matrices using 8 basic statistics: mean, standard deviation, median, max, min, slope, 75th percentile, and 25th percentile. Social network analysis of the gaze matrix allows us to extract in-degree and out-degree centrality for each individual.
Linear interpolation is used to fill in missing numerical data while a rolling average with a time-series-specific window is used to smooth noise.
The result of the proposed FAS is a dataset of 125 features generated using, once again, the 8 basic statistical features to describe each time series (mean, standard deviation, median, max, min, slope, 75th percentile, and 25th percentile).
4. Results
4.1. Data Collection
To test the proposed framework, the following experiment was conducted. The experiment was based on the exploitation of panoramic video files of work teams and PERMA survey forms completed by each individual at the end of filmed work sessions. Based on the work of [
40], audio and video data were collected simultaneously in distributed teams.
The results of each question of the PERMA+4 survey by Donaldson et al. [
17] were averaged by pillar in order to obtain a dataset of five target variables
representing the five pillars of the PERMA model for each individual in each video file.
Figure 4 resumes the experiment in which the proposed framework is implemented to better understand non-verbal communication processes in teamwork, using video data to identify significant predictors of individual well-being in teamwork. The panoramic video files are formatted and linked to the PERMA surveys in the Data preparation phase (green). The panoramic video data collected with the 360 degree camera are fed into a data preparation system, which identifies the faces, the exchange of gazes among people and the other features which will then be used for training the machine learning system with regression and classification models in the Data analysis phase (yellow) in order to obtain a prediction and classification of individual well-being. The PERMA survey will be used as the dependent variable for training the system. Two types of machine learning, regression and classification, will be tried to identify the best approach. Finally, SHAP values will be computed to identify the most relevant features.
The explainability of the prediction and classification by the identification of significant predictors is provided in the Feature importance phase (blue) by the computation of SHAP values.
Each of these phases will now be described in detail.
4.2. Data Preparation
The first phase is the Data preparation. The panoramic video files are preprocessed to extract pertinent information usable by the machine learning models.
First, the proposed FAS presented in
Section 3 is used to generate the dataset of features related to each individual in each video.
It extracts multiple initial features directly from the video stream in a time series structure as summarized in
Table 1.
The human field of view (FOV) angle for 3D gaze pattern estimation is set to 60°. A window size of 30 s is chosen for the rolling average on face emotion to reduce noise.
Then, each record is linked to the associated PERMA labels
. The PERMA data in the
dataset are preprocessed to handle missing values and outliers. Also, both the
and the
dataset are normalized to be used in the machine learning models. Thus, in the
Data preprocessing step, all records linked to a missing value or to a constant value throughout all the pillars of the PERMA survey are removed. PERMA variables are normalized using a min-max normalization while the dataset features are normalized using a standard or robust scaling depending on their distribution [
41].
The PERMA variables contained in the
Y dataset are continuous variables. Regression is therefore the most straightforward data analysis model. However, it may also be useful to classify each variable into binary categories (high- or low-level), as this aligns with the overall goal of the research. Classification metrics offer more intelligible performance scores than regression metrics [
42]. Thus, a new dataset called
is generated by discretizing the
Y dataset. The discretization is carried out by applying a median threshold to each dimension of
Y for binary classification. In order to reduce the complexity of the methodology and provide interpretable results, each targeted variable
present in
Y and
is analyzed independently in univariate problems.
To further limit the complexity of the models and comply with Occam’s razor principle, the features extracted in
are then selected in the
Feature selection step to generate the
X dataset. The attribute selection method is preferred to the dimensionality reduction method for reasons of interpretability of the results [
43]. To perform feature selection only within the training set to prevent data leakage, the
, the
, and the
datasets are divided into a training set (
,
, and
) and a test dataset (
,
, and
) representing 80% and 20% of the total dataset, respectively. Then, a voting strategy among filters presented in
Table 2 is defined for feature selection. Those filters are chosen since they are relatively computationally efficient and model-agnostic.
Sets of features are evaluated for each target variable
by the voting system using Equation (
2).
where
represents the set of features considered,
the filter ID,
the ensemble scores from the filter
for all features in
, and finally,
represents the weight given to the filter
based on the importance of the filter to the issue at hand [
43].
Since there is no contextual information that would allow one filter to be preferred to another in the proposed case study, the
values of the voting system described by Equation (
2) are set to 1.
The set of features with the highest score is chosen for the Data Analysis phase.
4.3. Data Analysis
The prediction of the PERMA scores is approached both as a regression and as a binary classification task (classification of the PERMA score level as high or low). Thus, different models are used and their respective hyperparameters that have to be tuned for proper performance of the models.
Table 3 provides a summary of the models used for the classification and the regression task, respectively. It also summarizes the various hyperparameters tuned using grid search and cross-validation on the training dataset.
For each target variable of the PERMA survey, the training set (, , and ) is split in k-folds in order to find the best combination of hyperparameters. The chosen model is the one that has the lowest validation error or the highest performance metric, such as balanced accuracy for classification or MAE for regression. Finally, the models are trained using the training sets.
A 5-fold cross-validation on the training set is used to tune the models under consideration. Each pillar of the PERMA model is analyzed independently.
Table 4 and
Table 5 depict the regression and classification models, respectively, as well as their hyperparameters offering the best performance on the validation set.
The predominance of the CatBoostClassifier model in the classification task is evident in
Table 5. This model is chosen for the classification of the level of four of PERMA’s five pillars. There is no such evidence in the regression task since, as described in
Table 4, each pillar is predicted by a different model, with the exception of pillars R and A, which are both predicted by the BayesianRidge model.
The best models and their associated hyperparameters are trained and tested using the training and the test set, respectively.
The performance on the test set of the regression models is calculated using the MAE metric to measure the mean absolute error between predicted and actual values [
46]. In
Figure 5, the performance of each model is compared to a baseline where the proposed precision is the average pillar value observed over the test set.
The performance on the test set of the classification models is calculated using the balanced accuracy metric to encourage the model to correctly predict examples from all classes, regardless of their size [
48]. This is conducted by averaging the percentage correctly predicted for each class individually. In the case of binary classification, the probability of predicting the right class when the data distribution is uniform is 50% [
49]. Thus, a naive classifier with 50% balanced accuracy is used as the baseline. The comparison between the baseline and the performance of the classification models for each PERMA pillar is shown in
Figure 6.
The results show that for most of the PERMA dimensions, with the exception of the Meaning dimension, the best performing regression and binary classification models outperform the baseline.
The regression and binary classification models outperform the baseline, on average, by 1.5% and 5.6% respectively. This may be an indication of significant relationships discovered by the models in the data.
4.4. Feature Importance
Regression models can be used to analyze the coefficients associated with each attribute to determine its importance. Tree-based models can also provide insight into the importance of attributes by analyzing the mean decrease in the impurity (MDI). However, they do not really give any indication of the impact of attributes on prediction or classification [
50]. For this purpose, the SHAP value can be used [
51].
SHAP values are computed by averaging the influence of one feature over all possible combinations of features in the model [
52]. In this way, the data from each of the models generated and trained during the
Data analysis phase (
Section 4.3) are analyzed in order to extract the influence of features across multiple models allowing the comparison of the effects of each features and the identification of the most influential features for the prediction and classification of each PERMA pillar
.
An analysis examining the Pearson correlation coefficient between each of the PERMA pillars and the individual features indicated at most weak correlations, with the highest being roughly 0.3.
To better understand the impact and dynamics of each feature on the final prediction and classification, a SHAP value analysis is undertaken. The SHAP analysis of the best binary classifier for the classification of each PERMA pillar is computed and the obtained results are proposed in
Table 6.
As presented in
Table 6, the attributes influencing classification vary greatly from pillar to pillar. The case study results indicate that the positive emotions (
P), accomplishment (
A) and meaning (
M) pillars are largely influenced by the attributes derived from emotions. Based on Ekman’s basic emotions, a high minimum level of surprise and a low maximum level of neutral emotion seem to positively influence pillar
P while a low level of sadness standard deviation and third quartile seems to positively influence pillar
A. This suggests that more stable emotional states are correlated with greater accomplishment. A high level of valence and dominance slope seems to be linked to the Meaning pillar (
M) of the PERMA model. The engagement pillar (
E) seems to be linked to head and gaze movements. A low level of minimum head velocity and a high average level of gaze exchange seem to have a positive impact on individuals’ engagement in collaborative work. Finally, the relations pillar (
R) seems to be linked to the environment in which the experiment takes place. Thus, attributes linked to luminosity have a strong impact on this pillar, with an advantage for low luminosity levels.
With the same objective of explicability, the SHAP analysis of the best regression model for the prediction of each PERMA pillar is computed and the obtained results are proposed in
Table 7.
As for the binary classifier and as presented in
Table 7, the attributes influencing prediction and classification vary greatly from pillar to pillar. Once again, the case study results indicate that the accomplishment (
A) and meaning (
M) pillars are largely influenced by the attributes derived from emotions. However, the attributes used vary. The valence level as well as the number of times sadness is experienced by the participants seems to have an impact on the accomplishment pillar (
A). For the meaning pillar (
M), the dominance (slope and first quartile value) is once again influential with a positive correlation between meaning value and dominance attribute levels. Contrary to the binary classification model, the key element for the positive emotion pillar (
P) in the regression task seems to be linked to the SNA metric of outdegree centrality. The more participants look at others, the more positive emotions they will experience. The commitment pillar (
E) also seems to be linked to the participant’s emotions, since the value of the third quartile of valence and the standard deviation observed for the neutral emotion are the most influential attributes for this pillar. Finally, it seems interesting that the relations pillar (
R) seems, once again, to be linked to the brightness of the environment in which the experiment takes place but also to head movement. Similarly to the binary classifiers, attributes linked to brightness have a strong impact on this pillar, with an advantage for low brightness levels but contrary to the binary classifier, the minimum head velocity seems to have a positive impact on individuals’ relationships.
5. Discussion
To recall, the aim of the proposed study was to understand the non-verbal communication process in teamwork using video data and identify significant predictors of individual well-being in teamwork. The experiment conducted and the results obtained serve as a basis for discussion of the proposed research questions.
RQ1: Which features of videos taken in a team setting will be predictive of individual and team well-being measured with PERMA surveys?
Through combining video analysis with measuring well-being with PERMA, we identified relevant features predicting individual well-being: If individuals exchange more gazes with others, experience more surprises, behave more dominant (i.e., speak more), and are more emotionally stable, they report higher satisfaction measured through PERMA. This means that simply experiencing constant happiness is not the best way to achieve a positive team experience; rather, individuals should engage in an active social exchange with others and pursue surprising avenues in their work.
A framework combining state-of-the-art tools has been proposed in
Section 3 extracting from panoramic video data non-verbal cues, such as facial emotions, gaze patterns, and head motions as input for individual well-being analysis. An experiment presented in
Section 4 applies the proposed framework and links the extracted attributes to the results of PERMA+4 surveys evaluating the various pillars of well-being defined in positive psychology. This way, a dataset of 125 features has been generated to predict the different pillars of the PERMA analysis. Machine learning models were then trained for the regression and binary classification tasks to predict individual well-being scores, as defined by the PERMA framework.
When applied to a case study of collaboration within 20 co-located work teams, regression models outperform the baselines in four of the five PERMA dimensions, with a notable 1.5% improvement in MAE. Bayesian ridge regression was identified as particularly effective. In comparison, binary classification emerged as a more reliable approach, with models yielding a balanced accuracy improvement of 5.1%, also outperforming the baseline in four out of five PERMA dimensions. Ensemble models, specifically CatBoost, showed superior performance in this setting. Notably, the Meaning dimension of PERMA proved challenging in both prediction and classification settings, indicating difficulty in discerning a participant’s sense of meaning purely from video cues.
RQ2: How can the relevance of attributes for predicting individual well-being in a collaborative work context be measured?
In this work, we have developed an approach measuring self-reported well-being from facial expressions and upper body posture. Using SHAP values to measure the size of the attributes for predicting PERMA we can identify the most predictive attributes. Applying this approach, we found for instance that active exchanges of mutual glances and engaging in an active dialog increase the well-being of team members.
SHAP values are used to interpret the impact of features on prediction and classification, independently of the machine learning model used. They also rank features according to their importance for the model under study [
51]. Derived from cooperative game theory, SHAP values identify the importance of features for each data point, since they are decomposed into the sum of feature contributions. This provides a more transparent description of the model’s behavior and therefore greater interpretability of the models [
51]. Furthermore, this approach facilitates the identification of the most appropriate features for PERMA prediction by allowing the comparison of the influence of features across multiple models [
51].
RQ3: How can theories and hypotheses relevant to positive psychology be derived from AI-driven team video analysis?
The general process to develop new theories consists of correlating personal features computed from video recordings of team interaction such as emotions, turn-taking, and looking at others with team outcome and personal well-being measured with PERMA. This way we are able to identify the most predictive behavioral patterns of an individual that correlate with high individual satisfaction and team success. These insights will allow individuals and teams to change their behavior accordingly.
From the feature analysis with SHAP values, various theories, and hypotheses potentially relevant to experts in the field of positive psychology could be derived, for instance from the distribution of data points in the SHAP analyses. Based on the results of the case study, preliminary insights for team work could be gained: Paying attention to (i.e., looking at) team members appears instrumental in fostering happiness (P), calmer head movements seem to enhance engagement (E) and interpersonal relationships (R), the brightness of the environment (more light) may have an important impact on relationships (R), the sense of meaning (M) seems to be strongly tied to an increasing feeling of control, and finally results suggest that steady emotional states provide a greater sense of achievements (A).
Limitations
The results presented here are valid only for the discussed case study. Thus, although the methodology employed is generalizable, more similar case studies in different contexts and with different participants should be conducted to further investigate these conclusions in the field of cognitive sciences. These results show links but do not allow causalities to be determined. This is one of the limitations of the proposed methodology, but other factors should also be acknowledged. In data preparation, the FAS did not utilize explicit face alignment and treated each video frame in isolation, possibly overlooking the importance of temporal dynamics. These two factors could have a negative impact on the performance of the proposed model as they could, respectively, complicate emotion recognition and neglect temporal entanglements. Moreover, inherent assumptions in the employed algorithms, like using the field of view (FOV) cone model for gaze pattern estimation, can also introduce errors to the proposed findings. That is also true for the data preprocessing techniques employed, such as smoothing or linear interpolation, coupled with the dependence on specific feature selection strategies, which may introduce potential biases and uncertainties. Another limitation of the proposed study is the small number of data points available, which restricts an accurate exploration of the feature space. The relative scarcity of data points limits our predictive model’s capacity to generalize beyond this study. While hyperparameter search space was leveraged by grid-search cross-validation, they might not capture the entirety of potential configurations. Also, the use of the SHAP-based feature analysis brings its own set of challenges. Finally, the modeling strategy relies on the fundamental assumption of relative independence among features, an ideal scenario that is challenging to achieve consistently. This assumption may mean that the model sometimes does not accurately capture interactions between features or possible non-linear effects.
6. Conclusions
Theories and hypotheses from sociology and psychology are necessary to better understand the behaviors and aspirations of the individuals and societies around us. However, developing these theories and hypotheses is often difficult, as manual data collection for qualitative analysis by domain experts is time-consuming, limited, and prone to bias. To help experts develop theories based on a wider range of objective data, we propose a methodology to understand the non-verbal communication process in teamwork using video data and identify significant predictors of individual well-being in teamwork.
Numerous studies analyze the well-being of individuals and teamwork, but these studies are positioned in virtual or highly controlled environments (see
Section 2). However, collaborative working generally takes place in uncontrolled, co-located environments.
To fill this gap, the proposed framework leverages video acquisition technologies and state-of-the-art artificial intelligence tools to extract from panoramic video individual, relative, and environmental features. Statistical analysis is applied to each time series, leading to the generation of a dataset of 125 features that are then linked to PERMA surveys.
A SHAP-based feature analysis unveils key indicators associated with the PERMA scores.
Applied to a case study, this method allows us to identify several hypotheses. For example, it seems that paying attention to team members is the key to happiness. It also appears that calm head movements promote individual commitment and interpersonal relations. Other hypotheses include the importance of the impact of the environment (brightness) on relationships, the close link between a sense of control and meaning, and the greater sense of achievement that stable emotional states bring.
However, these results are nuanced, since one case study is not enough to generalize these theories. The generalization of these results through the analysis of other case studies in various contexts is a promising line of research that will be interesting to study in the near future. In addition, practical improvements to the proposed FAS should be considered, such as explicit face alignment for better emotion recognition, taking into account the effects of temporal dynamics in image succession, or identifying and managing possible biases due to interpolation and line smoothing.
This study has identified some promising avenues of research. One lies in the fusion of different mediums for the analysis of individual well-being during teamwork. Indeed, the analysis of non-verbal communication could be combined with the analysis of verbal communication to have a holistic vision of communication patterns and develop an integrated framework for the analysis of communication factors impacting individual well-being.
Author Contributions
Methodology, M.M., T.Z. and P.A.G.; Software, M.M.; Formal analysis, M.M. and T.Z.; Investigation, P.A.G., M.M., T.Z., I.V. and J.H.; Data curation, M.M. and T.Z.; Writing—original draft, M.M. and A.D.; Writing—review & editing, P.A.G., J.H. and I.V. All authors have read and agreed to the published version of the manuscript.
Funding
Moritz Mueller’s stay at MIT was supported by the German Academic Exchange Service (DAAD).
Institutional Review Board Statement
This study was approved by MIT COUHES under IRB 1701817083 dated 19 January 2023.
Informed Consent Statement
Informed consent was obtained from all subjects involved in the study.
Data Availability Statement
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy reasons.
Acknowledgments
We thank Bryan Moser for his invaluable support in the integration of our experiment during the Independent Activity Period at MIT.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Seligman, M.; Csikszentmihalyi, M. Positive Psychology: An Introduction. Am. Psychol. 2000, 55, 5–14. [Google Scholar] [CrossRef] [PubMed]
- Maccagnan, A.; Wren-Lewis, S.; Brown, H.; Taylor, T. Wellbeing and Society: Towards Quantification of the Co-benefits of Wellbeing. Soc. Indic. Res. 2019, 141, 217–243. [Google Scholar] [CrossRef]
- Lyubomirsky, S.; King, L.; Diener, E. The Benefits of Frequent Positive Affect: Does Happiness Lead to Success? Psychol. Bull. 2005, 131, 803–855. [Google Scholar] [CrossRef]
- Kompaso, S.; Sridevi, M. Employee Engagement: The Key to Improving Performance. Int. J. Bus. Manag. 2010, 5, 89–96. [Google Scholar] [CrossRef]
- Wright, T.; Cropanzano, R. Psychological well-being and job satisfaction as predictors of job performance. J. Occup. Health Psychol. 2000, 5, 84–94. [Google Scholar] [CrossRef] [PubMed]
- Gloor, P. Happimetrics; Edward Elgar Publishing: Cheltenham, UK, 2022; pp. 103–120. [Google Scholar] [CrossRef]
- Mehrabian, A. Silent Messages; Wadsworth: Oxford, UK, 1971; p. 13. [Google Scholar]
- Birdwhistell, R.L. Kinesics and Context; University of Pennsylvania Press: Philadelphia, PA, USA, 1971. [Google Scholar] [CrossRef]
- Knapp, M.; Hall, J. Non-Verbal Communication in Human Interaction, 7th ed.; Wadsworth, Cengage Learning: Boston, MA, USA, 2010. [Google Scholar]
- Ekman, P.; Friesen, W.V. Constants across cultures in the face and emotion. J. Personal. Soc. Psychol. 1971, 17, 124–129. [Google Scholar] [CrossRef]
- Pantic, M.; Rothkrantz, L.J.M. Toward an Affect-Sensitive Multimodal Human–Computer Interaction. Proc. IEEE 2003, 91, 1370–1390. [Google Scholar] [CrossRef]
- Pinto, S.; Fumincelli, L.; Mazzo, A.; Caldeira, S.; Martins, J.C. Comfort, well-being and quality of life: Discussion of the differences and similarities among the concepts. Porto Biomed. J. 2017, 2, 6–12. [Google Scholar] [CrossRef]
- Seligman, M.E.P. Flourish: A Visionary New Understanding of Happiness and Well-Being; Free Press: New York, NY, USA, 2011; p. 349-xii. [Google Scholar]
- Forgeard, M.J.C.; Jayawickreme, E.; Kern, M.L.; Seligman, M.E.P. Doing the Right Thing: Measuring Well-Being for Public Policy. Int. J. Wellbeing 2011, 1, 76–106. [Google Scholar] [CrossRef]
- Butler, J.; Kern, M.L. The PERMA-Profiler: A brief multidimensional measure of flourishing. Int. J. Wellbeing 2016, 6, 1–48. [Google Scholar] [CrossRef]
- Kun, A.; Balogh, P.; Gerákné Krasz, K. Development of the Work-Related Well-Being Questionnaire Based on Seligman’s PERMA Model. Period. Polytech. Soc. Manag. Sci. 2017, 25, 56–63. [Google Scholar] [CrossRef]
- Donaldson, S.I.; van Zyl, L.E.; Donaldson, S.I. PERMA+4: A Framework for Work-Related Wellbeing, Performance and Positive Organizational Psychology 2.0. Front. Psychol. 2021, 12, 817244. [Google Scholar] [CrossRef]
- Zaraska, M. Growing Young: How Friendship, Optimism and Kindness Can Help You Live to 100; Appetite by Random House: Vancouver, BC, Canada, 2020. [Google Scholar]
- Csikszentmihalyi, M. Flow: The Psychology of Happiness; Ebury Publishing Random House: London, UK, 2013. [Google Scholar]
- Kruse, J.A. Comparing Unimodal and Multimodal Emotion Classification Systems on Cohesive Data. Master’s Thesis, Chair of Media Technology TUM School of Computation, Information and Technology Technical University of Munich, Munich, Germany, 2022. [Google Scholar]
- Abramov, S.; Korotin, A.; Somov, A.; Burnaev, E.; Stepanov, A.; Nikolaev, D.; Titova, M.A. Analysis of Video Game Players’ Emotions and Team Performance: An Esports Tournament Case Study. IEEE J. Biomed. Health Inform. 2022, 26, 3597–3606. [Google Scholar] [CrossRef]
- Nezami, O.M.; Dras, M.; Hamey, L.; Richards, D.; Wan, S.; Paris, C. Automatic Recognition of Student Engagement using Deep Learning and Facial Expression. arXiv 2018, arXiv:1808.02324. [Google Scholar]
- Savchenko, A.V.; Savchenko, L.V.; Makarov, I. Classifying emotions and engagement in online learning based on a single facial expression recognition neural network. IEEE Trans. Affect. Comput. 2022, 13, 2132–2143. [Google Scholar] [CrossRef]
- Guerlain, S.; Shin, T.; Guo, H.; Adams, R.; Calland James, M.D. A Team Performance Data Collection and Analysis System. Proc. Hum. Factors Ergon. Soc. Annu. Meet. 2002, 46, 1443–1447. [Google Scholar] [CrossRef]
- Ivarsson, J.; Åberg, M. Role of requests and communication breakdowns in the coordination of teamwork: A video-based observational study of hybrid operating rooms. BMJ Open 2020, 10, 35194. [Google Scholar] [CrossRef] [PubMed]
- Stefanini, A.; Aloini, D.; Gloor, P. Silence is golden: The role of team coordination in health operations. Int. J. Oper. Prod. Manag. 2020, 40, 1421–1447. [Google Scholar] [CrossRef]
- Salvador Vazquez Rodarte, I. An Experimental Multi-Modal Approach to Instrument the Sensemaking Process at the Team-Level. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2022. [Google Scholar]
- Koutsombogera, M.; Vogel, C. Modeling Collaborative Multimodal Behavior in Group Dialogues: The MULTISIMO Corpus; Technical Report; European Language Resources Association (ELRA): Miyazaki, Japan, 2018. [Google Scholar]
- Kontogiorgos, D.; Sibirtseva, E.; Pereira, A.; Skantze, G.; Gustafson, J. Multimodal reference resolution in collaborative assembly tasks. In Proceedings of the 4th Workshop on Multimodal Analyses Enabling Artificial Agents in Human–Machine Interaction, MA3HMI 2018—In Conjunction with ICMI 2018, Boulder, CO, USA, 16 October 2018; Association for Computing Machinery, Inc.: New York, NY, USA, 2018; Volume 10, pp. 38–42. [Google Scholar] [CrossRef]
- Sanchez-Cortes, D.; Aran, O.; Mast, M.S.; Gatica-Perez, D. A nonverbal behavior approach to identify emergent leaders in small groups. IEEE Trans. Multimed. 2012, 14, 816–832. [Google Scholar] [CrossRef]
- Kim, T.; McFee, E.; Olguin, D.O.; Waber, B.; Pentland, A.S. Sociometric badges: Using sensor technology to capture new forms of collaboration. J. Organ. Behav. 2012, 33, 412–427. [Google Scholar] [CrossRef]
- Kahneman, D.; Krueger, A.B.; Schkade, D.A.; Schwarz, N.; Stone, A.A. A survey method for characterizing daily life experience: The day reconstruction method. Science 2004, 306, 1776–1780. [Google Scholar] [CrossRef] [PubMed]
- Available online: https://en.j5create.com/products/jvcu360 (accessed on 30 January 2024).
- Deng, J.; Guo, J.; Ververas, E.; Kotsia, I.; Zafeiriou, S. RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
- Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
- Sharma, A. Master of Computer Applications; Technical Report; Universities Press (India) Private Limited: Telangana, India, 2002. [Google Scholar]
- Pham, L.; Huynh Vu, T.; Anh Tran, T.; Chi Minh City, H.; Trung Ward, L.; Duc District, T. Facial Expression Recognition Using Residual Masking Network; Facial Expression Recognition Using Residual Masking Network. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021. [Google Scholar] [CrossRef]
- Wu, C.Y.; Xu, Q.; Neumann, U. Synergy between 3DMM and 3D Landmarks for Accurate 3D Facial Geometry. In Proceedings of the 2021 International Conference on 3D Vision, (3DV), London, UK, 1–3 December 2021; pp. 453–463. [Google Scholar] [CrossRef]
- Smith, A.R. Color Gamut Transform Pairs. In ACM SIGGRAPH Computer Graphics; Technical Report; Association for Computing Machinery: New York, NY, USA, 1978; Volume 12, pp. 12–19. [Google Scholar] [CrossRef]
- Törlind, P. A framework for data collection of collaborative design research. In Proceedings of the ICED 2007 the 16th International Conference on Engineering Design, Paris, France, 28–31 July 2007. [Google Scholar]
- Raschka, S.; Mirjalili, V. Python Machine Learning: Machine Learning and Deep Learning with Python, Scikit-Learn, and Tensorflow 2; Packt Publishing: Birmingham, UK, 2017; p. 741. [Google Scholar]
- James, G.G.M.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning: With Applications in R; Springer: New York, NY, USA, 2013; p. 426. [Google Scholar] [CrossRef]
- Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature selection: A data perspective. ACM Comput. Surv. 2017, 50, 1–45. [Google Scholar] [CrossRef]
- Casella, G.; Berger, R.L.; Santana, D. Statistical inference-Solutions Manual. Stat. Inference 2002, 195. [Google Scholar]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) (Hardcover); Technical Report; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2012. [Google Scholar]
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
- Läuter, J. Multiple Testing Procedures with Applications to Genomics. S. Dudoit and M. J. van der Laan (2008). New York: Springer Science+Business Media, LLC. ISBN: 978-0-387-49316-9. Biom. J. 2010, 52, 699. [Google Scholar] [CrossRef]
- Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)?—Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
- Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer: New York, NY, USA, 2013. [Google Scholar]
- Molnar, C. Interpretable Machine Learning a Guide for Making Black Box Models Explainable; Technical Report; 2019. Available online: https://leanpub.com/interpretable-machine-learning (accessed on 30 January 2024).
- Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30, pp. 4768–4777. [Google Scholar] [CrossRef]
- Scapin, D.; Cisotto, G.; Gindullina, E.; Badia, L. Shapley Value as an Aid to Biomedical Machine Learning: A Heart Disease Dataset Analysis. In Proceedings of the 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), Taormina, Italy, 16–19 May 2022; pp. 933–939. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).