Human Operator Mental Fatigue Assessment Based on Video: ML-Driven Approach and Its Application to HFAVD Dataset

Othman, Walaa; Hamoud, Batol; Shilov, Nikolay; Kashevnik, Alexey

doi:10.3390/app142210510

Open AccessArticle

Human Operator Mental Fatigue Assessment Based on Video: ML-Driven Approach and Its Application to HFAVD Dataset

Saint Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 199178 St. Petersburg, Russia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(22), 10510; https://doi.org/10.3390/app142210510

Submission received: 27 August 2024 / Revised: 31 October 2024 / Accepted: 12 November 2024 / Published: 14 November 2024

(This article belongs to the Special Issue Advances in Human–Machine Systems, Human–Machine Interfaces and Human Wearable Device Performance)

Download

Browse Figures

Versions Notes

Abstract

:

The detection of the human mental fatigue state holds immense significance due to its direct impact on work efficiency, specifically in system operation control. Numerous approaches have been proposed to address the challenge of fatigue detection, aiming to identify signs of fatigue and alert the individual. This paper introduces an approach to human mental fatigue assessment based on the application of machine learning techniques to the video of a working operator. For validation purposes, the approach was applied to a dataset, “Human Fatigue Assessment Based on Video Data” (HFAVD) integrating video data with features computed by using our computer vision deep learning models. The incorporated features encompass head movements represented by Euler angles (roll, pitch, and yaw), vital signs (blood pressure, heart rate, oxygen saturation, and respiratory rate), and eye and mouth states (blinking and yawning). The integration of these features eliminates the need for the manual calculation or detection of these parameters, and it obviates the requirement for sensors and external devices, which are commonly employed in existing datasets. The main objective of our work is to advance research in fatigue detection, particularly in work and academic settings. For this reason, we conducted a series of experiments by utilizing machine learning techniques to analyze the dataset and assess the fatigue state based on the features predicted by our models. The results reveal that the random forest technique consistently achieved the highest accuracy and F1-score across all experiments, predominantly exceeding 90%. These findings suggest that random forest is a highly promising technique for this task and prove the strong connection and association among the predicted features used to annotate the videos and the state of fatigue.

Keywords:

operator fatigue detection; computer vision; physiological indicator; machine learning

1. Introduction

The issue of accurately detecting fatigue is of paramount importance, given the direct correlation between fatigue and endangering numerous lives, particularly those of complex system operators [1]. Operator fatigue is often the underlying cause of high-cost errors and mishaps when controlling complex systems, such as industrial robotic complexes, chemical reaction systems, or systems of multiple coordinated units [2].

Fatigue is typically induced when an individual does not receive an adequate amount of sleep or engages in extended periods of laborious activity. According to a comprehensive assessment of sleep patterns among the U.S. population, it was observed that 37% of the workforce received less than the recommended baseline of seven hours of sleep [3]. The principal challenge associated with fatigue is the difficulty in recognizing the onset of this state until it has escalated to a point where it significantly impairs work and system operation efficiency. Fatigued individuals often exhibit decreased decision-making speed and prolonged reaction times, which render them less effective in their professional roles and more susceptible to risks while working [4].

Numerous solutions have been proposed to tackle the issue of fatigue detection, with the primary aims of identifying signs of fatigue and subsequently alerting the concerned individual to this hazardous state. The majority of these methodologies employ machine learning classifiers or leverage deep learning models to construct a high-efficacy system for fatigue detection [5]. Most of the research on fatigue monitoring has concentrated on identifying fatigue during cognitive tasks, such as driving [6], and some researchers dedicated their endeavors to identifying signs of fatigue in natural-viewing situations, where individuals are not actively engaged in cognitive tasks [7]. Non-intrusive techniques, including those that utilize remote or webcam-based eye tracking, generally monitor variations in an individual’s pupil response, blink rate, and eye movement patterns to deduce signs of fatigue during such cognitive tasks [8].

Within the fields of machine learning and deep learning, the selection of the training dataset is a critical factor that influences several aspects of the model, including its accuracy and generalizability. Consequently, numerous datasets have been curated, developed, and utilized for the advancement of fatigue or drowsiness detection techniques [5]. Some of these datasets rely on the use of sensors and devices to collect pertinent features associated with an individual’s fatigue state and mental cognition. These include eye trackers [9], electroencephalogram (EEG) signals [10], and electrocardiography (ECG) signals [11], as well as other physiological indicators. However, the evolution of data collection techniques has led to the creation of more sophisticated datasets. These datasets comprise videos of drivers or operators engaged in a task or driving. The availability of such datasets opens up new avenues for experimentation, particularly with deep learning models. These models can process video data to recognize signs of fatigue with no need for any additional devices for the individual. Such non-intrusive methods enhance the applicability and usability of these approaches in safety and monitoring systems, thereby offering potential solutions for real-world fatigue detection challenges.

In this article, we present a selection of datasets that offer various advantages, particularly in terms of the wide range of the facial characteristics, behaviors, ethnic backgrounds, lighting conditions, recording settings, and facial poses they encompass. However, we also draw attention to a big limitation, that is, a noticeable trend of existing datasets of primarily focusing on simulating driving scenarios, which has created a lack of recordings that depict operator (computer user)-related tasks where fatigue exerts a substantial influence on both work performance and productivity. Therefore, our current research focuses on operator fatigue detection in work and academic settings. In this study, subjects experienced the causes of the two types of mental fatigue [12]. The first type is active fatigue, which arises from prolonged task-related perceptual–motor adjustments; for example, operators reading scientific articles develop cognitive strategies that enhance their ability to process complex information. This adaptation includes adjustments in reading speed and focused attention on key concepts. The second type is passive fatigue, which occurs due to extended periods of inactivity, such as sitting in front of a computer for a long time, leading to mental exhaustion. The originality of the proposal is in the usage of physiological indicators obtained by using computer vision for human fatigue state detection based on facial videos. The videos are taken from the public dataset OperatorEYEVP [13]. Therefore, we applied multiple advanced deep learning and computer vision models to analyze the dataset videos. In doing so, we successfully extracted physiological characteristics, which represents a novel and original contribution to the field. Unlike the usage of wearable or other sensors attached to the human body, computer vision techniques can be more useful in many situations, since they are non-intrusive and do not limit human mobility. Our contribution can be summarized as follows:

Innovative approach to non-intrusive fatigue detection: Diverging from conventional sensor-based research on mental fatigue, this study introduces a novel, non-intrusive method using deep learning, computer vision, and machine learning models. Unlike the traditional methods that leverage sensor-based data for mental fatigue detection, we extract features by using a simple camera and apply machine learning techniques for fatigue prediction. This significant shift offers practical benefits due to its accessibility and user friendliness, demonstrating the potential of our approach as a valuable tool for fatigue research.
Estimation of physiological indicators: We utilize the OperatorEYEVP dataset [13], which contains a diverse collection of videos capturing the daily tasks of employees and students. Through the application of deep learning and computer vision models developed in our previous work, we estimate a set of physiological indicators that provide an evaluation of the subject’s physiological condition in these videos.
Comprehensive dataset HFAVD: The resulting dataset includes physiological indicators besides other data (Landolt rings test results, VAS_F, and sensor data related to eye tracking and HRV). These indicators comprise head movement (as Euler angles: roll, pitch, and yaw), vital signs, and eye and mouth states (eye closure and yawning), which were automatically calculated and detected by using deep learning models, eliminating the need for manual intervention, which is often required by other existing datasets.
Applicability to work and academic settings: The resulting dataset holds particular significance for work and academic environments, where the impact of fatigue on work performance and efficiency is significant. Therefore, it has the potential to advance fatigue detection research in these contexts.

Our contribution provides a significant opportunity for researchers to advance their comprehension of the intricate relationship between mental fatigue and the diverse range of state changes displayed by subjects.

The paper is structured as follows: Section 2 describes several approaches and datasets for the task of fatigue and drowsiness detection. Section 3 begins by providing an overview of the original dataset we used for our research. Next, we explain how we made our own contribution to this dataset. Then, we describe the structure of models used with the details of the final tables that describe the dataset. Section 4 outlines the experiments we conducted to analyze the dataset with the results.

2. Related Work

In the context of monitoring drivers and operators, researchers often equate these terms due to the overlapping symptoms, notably including diminished concentration, delayed environmental processing, and reduced task performance, whether related to driving or computer-based work. This chapter provides an overview of current approaches and datasets utilized in the field of fatigue detection, encompassing both human drowsiness and fatigue resulting from work or study.

2.1. Existing Approaches to Human Fatigue Assessment

In this section, we detail the proposed approaches and methodologies employed for mental fatigue assessment for drivers and operators, since the biggest focus in the existing literature is on the mental fatigue caused by the driving situation.

The article [14] proposes a lightweight neural network model designed to address driver fatigue detection. The model comprises two primary components: object detection and fatigue assessment. Initially, a lightweight object detection network is developed to rapidly identify the opening and closing states of the driver’s eyes and mouth in the video. Following this, the EYE-MOUTH (EM) driver fatigue detection model encodes these states and computes two critical metrics: the Percentage of Eyelid Closure over the Pupil (PERCLOS) and the Frequency of Open Mouth (FOM). Finally, a multi-feature fusion judgment algorithm is employed to assess the driver’s fatigue state. The achieved accuracy was 98.30%.

Other researchers [15] introduced a system aimed at detecting drowsiness in drivers, featuring an architecture specifically designed to recognize signs of fatigue. The system incorporates four deep learning models, AlexNet, VGG-FaceNet, FlowImageNet, and ResNet, which analyze RGB video footage of drivers to facilitate drowsiness detection. The architecture focuses on four distinct feature categories: hand gestures, facial expressions, behavioral traits, and head movements. AlexNet is utilized to manage variations in environmental conditions, while VGG-FaceNet extracts facial features such as gender and ethnicity. FlowImageNet is responsible for assessing behavioral traits and head movements, and ResNet specializes in detecting hand gestures. The system classifies the detected features into four categories: alert, drowsy with eye blinking, yawning, and nodding. The outputs from these models are processed through an ensemble algorithm, culminating in a final classification via a SoftMax classifier. This system obtained an accuracy of 85%.

The authors of [16] utilized a pre-trained model employing Histogram of Oriented Gradients (HOG) [17] in conjunction with a linear Support Vector Machine (SVM) to extract the positions of the eyes, nose, and mouth, while also calculating the eye aspect ratio (EAR), mouth opening ratio (MOR), and nose length ratio (NLR). Once these three features are determined, drowsiness detection is performed on the extracted frames. Initially, adaptive thresholding is applied for the classification of blinking, yawning, and head tilting, followed by the implementation of various machine learning algorithms to further categorize the data into drowsy and non-drowsy. The SVM demonstrated the highest performance, achieving an accuracy of 96.4%.

Another proposed framework that facilitates drowsiness detection by examining the driver’s facial features captured through a camera was presented in [18]. The initial step involves face detection, where the driver’s face is identified within a video frame by utilizing Histogram of Oriented Gradients (HoG) features in conjunction with a Support Vector Machine (SVM). Subsequently, specific facial landmark positions are determined. These landmarks assist in estimating head orientation and identifying instances of blinking. By systematically monitoring blink duration, blink frequency, and PERCLOS (Percentage of Eyelid Closure over the Pupil over time), the level of driver drowsiness is assessed through a Fuzzy Inference System (FIS). The performance of the proposed system was tested against the subjective evaluation of the driver based on Karolinska Sleepiness Scale (KSS) [19], with an average normalized root mean square error (NRMSE) of 15.63%.

The study [20] introduces a real-time method for monitoring driver disturbances by utilizing Convolutional Neural Networks (CNNs). The detection process employs advanced CNN architectures, including InceptionV3, VGG16, and ResNet50. Consequently, the ResNet50 model outperforms the other deep learning architectures, attaining an accuracy of 93.69% with a loss of 0.6931.

The paper [21] introduces a cost-effective framework for predicting driver fatigue levels, which aims to detect fatigue in its initial stages. The framework assesses driver fatigue by analyzing behavioral cues related to the eyes, mouth, and head, utilizing an infrared (IR) camera. The features are detected by different architectures of CNNs, and the fatigue classification process is completed by using a logical model. To evaluate the effectiveness of the proposed fatigue prediction framework, experiments were conducted by using real datasets under both nighttime and daytime lighting conditions. The results indicate that the proposed method successfully predicts driver fatigue levels with an overall accuracy of 93.3%.

The authors of [22] introduced an algorithm for recognizing fatigue states. The methodology begins with the application of a Multitask Convolutional Neural Network (MTCNN) to detect human faces. Subsequently, DLIB (v.19.19.0), an open-source software library, is utilized to identify facial key points, enabling the extraction of fatigue feature vectors from each frame. These fatigue feature vectors from multiple frames are then concatenated into a temporal feature sequence, which is input into a Long Short-Term Memory (LSTM) network to derive a final fatigue feature value. The accuracy rates of fatigue detection of their method were 88% on the YawDD dataset [23] and 90% on their self-built dataset.

The article [24] employed an RGB-D camera to capture both RGB and infrared video of the face. The facial video is analyzed independently by using the Joint Approximate Diagonalization of Eigenmatrices (JADE) algorithm [25], Fourier transform, and Triangular Surface Patch (TSP) descriptor [26] to extract three key fatigue features: heart rate, eye opening level, and mouth opening level. Subsequently, the authors developed a novel Multimodal Fusion Recurrent Neural Network (MFRNN) that integrates these three features to enhance the accuracy of driver fatigue detection. Furthermore, a Recurrent Neural Network (RNN) layer is incorporated within the MFRNN to capture the temporal dynamics of the features. Fuzzy reasoning is combined with the RNN to extract the temporal information of the heart rate to address issues related to fuzziness and noise. The proposed method achieved fatigue detection accuracy rates of 86.75% and 91.67% across two distinct datasets.

The study [27] investigated the use of geometric measurements of facial features to assess mental fatigue in construction equipment operators. Seventeen male operators participated in excavation tasks, with data collected consistently over several days during the same morning hours under clear weather conditions. Mental fatigue was evaluated through both subjective NASA-TLX scores and objective electrodermal activity (EDA) recorded via the Empatica E4 sensor. The analysis focused on specific facial features, including the eyebrow, mouth contours, head movements, eye region, and overall facial area. The results revealed significant differences in these metrics between high and low fatigue states, particularly in the eye and facial area measurements, with mean differences of 45.88% and 26.9%, respectively. The findings indicate that geometric facial assessments can serve as an effective, non-invasive method for detecting mental fatigue in this occupational context.

The authors of the OperatorEYEVP dataset, which will be utilized in our experiments, developed a method for detecting mental fatigue based on human eye movements by using this dataset [28]. Their research was grounded in the analysis of eye-tracking data collected from human operators. Within the framework of the proposed methodology, they introduced a technique for identifying the most pertinent gaze characteristics associated with mental fatigue detection, incorporating various machine learning techniques. The experimental findings indicated that the most significant characteristics included the average velocity within the fixation area, the average curvature of the gaze trajectory, the minimum curvature of the gaze trajectory, the minimum saccade length, the percentage of fixations lasting less than 150 ms, and the proportion of time spent in fixations shorter than 150 ms. The processing of eye movement data using this method is conducted in real time, achieving a maximum accuracy of 85% and an F1-score of 80%, with the random forest algorithm demonstrating the best performance.

The research study reported in [29] identified mental fatigue in operators of a construction site by using flexible headband-based sensors for the collection of raw EEG data, along with deep learning networks. The study involved fifteen operators performing a one-hour excavation task on a construction site. The NASA-TLX score served as the ground truth for assessing mental fatigue, while brain activity patterns were captured by using a wearable EEG sensor. The collected raw EEG data were utilized to develop classification models based on deep learning techniques. The performance of these models, specifically Long Short-Term Memory (LSTM), Bidirectional LSTM (Bi-LSTM), and one-dimensional Convolutional Networks, was evaluated. The results indicated that the Bi-LSTM model significantly surpassed the other models, achieving an accuracy of 99.941%. These results underscore the effectiveness of the Bi-LSTM model for wearable sensor-based recognition and classification of mental fatigue, thereby enhancing health and safety measures on construction sites.

The article [30] presents an innovative methodology that employs three machine learning models in conjunction with multimodal data fusion to classify mental fatigue states. Data were collected during an excavation operation through the integration of electroencephalography (EEG), electrodermal activity (EDA), and video signals. Specifically, a headband was utilized to acquire EEG data from construction equipment operators, an E4 watch was attached to their wrists to collect EDA data, and a video camera was installed on the interior of the excavator’s front screen to capture facial feature information. Furthermore, a questionnaire was administered to gather subjective assessments of mental fatigue. The decision tree model, utilizing multimodal sensor data fusion, demonstrated superior performance compared with other models, achieving an accuracy of 96.2%. The findings suggest that multimodal sensor data fusion has the potential to facilitate the development of a real-time system for classifying mental fatigue.

The paper [31] presents the development of a non-invasive smart cushion designed to assess the mental fatigue of construction equipment operators through the real-time continuous monitoring of heart rate and respiration signals. A laboratory experiment was conducted with twelve participants who performed simulated excavator operations for data collection. Subsequently, relevant time-domain and frequency-domain features of the heart rate and respiration signals were extracted to create a random forest classification model aimed at exploring the relationship between vital signs and self-reported mental fatigue, as measured by the NASA-TLX scale. The results indicated that the random forest model achieved a classification accuracy of 92%. Furthermore, the combination of heart rate and respiration features proved to be more effective than using a single feature for identifying and classifying mental fatigue among construction equipment operators.

Table 1 summarizes the various approaches discussed previously, highlighting the features utilized and the techniques implemented for fatigue detection, along with the accuracy achieved in each case. While numerous studies report high performance and favorable outcomes, certain limitations are evident. Notably, many methods rely heavily on features derived solely from the eyes and mouth to ascertain fatigue levels. Although these features are known to correlate significantly with fatigue, there are circumstances—such as the presence of sunglasses, reflections from lighting on eyewear, or mouth coverings due to medical reasons—where accurate detection may be compromised. Furthermore, some studies have employed fuzzy logic systems or logical models for fatigue classification. These systems are often hindered by their lack of learning capabilities and their dependence on expert knowledge, as their effectiveness is contingent upon the quality of the rules established by specialists. Conversely, other research has focused on deep learning methodologies, including CNNs and RNNs, which, while powerful, are computationally intensive and necessitate substantial datasets for effective training. Moreover, the metrics used to define the fatigue states were mostly subjective, such as NASA-TLX or data collected from sensors (EDA). Nevertheless, numerous studies investigating operators’ mental fatigue have relied on sensors to gather the features utilized in their research. This reliance on sensors may be viewed as a limitation, as these devices can introduce discomfort for the participants involved in the studies. Additionally, a noticeable trend is the exclusive focus on detecting driver fatigue, which overlooks other contexts in which individuals, such as operators and students, may also experience fatigue. This narrow scope may limit the applicability of the findings across broader domains.

In an effort to address the aforementioned limitations, our study focuses on detecting mental fatigue associated with cognitively taxing tasks. This form of fatigue parallels the exhaustion experienced by students and workers after extended periods of studying or working. Importantly, our approach extends beyond the exclusive reliance on eye and mouth features, incorporating multiple features to enhance fatigue recognition. It should be emphasized that all features employed in this study are derived from pre-trained deep learning models, eliminating the need for additional sensor use or retraining, thus reducing costs. Furthermore, we utilize cost-effective machine learning techniques, such as random forest, for fatigue detection based on these estimated features. This strategy makes our approach cost-effective, straightforward, and user-friendly.

2.2. Existing Datasets

This section outlines several widely utilized datasets designed specifically for the detection of fatigue and drowsiness. Our research primarily concentrates on video datasets, which do not contain data derived from physical sensors, and the videos are captured by using ordinary cameras.

The UTA-RLDD dataset [32] comprises approximately 30 h of RGB videos featuring 60 healthy participants from diverse ethnic backgrounds, with varying ages ranging from 20 to 59 years old. Each participant contributed 1 video for each of three distinct states, i.e., alertness, low vigilance, and drowsiness, totaling 180 videos. Recordings were captured from diverse angles in real-life environments against different backgrounds. A notable aspect of this dataset is the authenticity of drowsiness samples, which contrast with staged portrayals found in other datasets. Moreover, data collection involved the use of different cameras, with each participant recording themselves in a real-life indoor setting of their choosing by using either a cell phone or a web camera.

The YawDD dataset [23] comprises two publicly available sub-datasets featuring RGB videos. Participants in these videos engage in simulated driving scenarios, while the vehicle remains stationary. In the first sub-dataset, a camera is positioned beneath the front mirror of the car, capturing three to four videos per participant. This sub-dataset offers a total of 322 videos depicting drivers (57 males and 50 females), with and without glasses/sunglasses, representing diverse ethnic backgrounds. In the second sub-dataset, which includes 29 videos, the camera is mounted on the car’s dashboard in front of the driver. Each video showcases various mouth conditions, including normal, talking/singing, and yawning.

However, some individuals in the YawDD dataset videos simulate yawning rather than naturally exhibiting the behavior. Hence, the authors of [22] decided to collect their own dataset to capture more authentic fatigue. Ten volunteers each participated in the creation of two videos: one depicting normal behavior and the other featuring signs of fatigue, including moments of eye closure, conversation, laughter, and yawning. The dataset aims to mitigate discrepancies in experimental outcomes arising from variations in skin color.

The NTHU-DDD dataset [33] consists of 360 (infrared (IR) and RGB) videos of 36 subjects showcasing a range of simulated driving situations, including regular driving, yawning, slow blinking, nodding off, and laughing. These recordings capture subjects under both daytime and nighttime lighting conditions. They were directed by an experimenter to perform a sequence of eight actions across five different scenarios: BareFace, Glasses, Sunglasses, Night-BareFace, and Night-Glasses.

The authors of [34] created the Licensed Crane Operators dataset, which features videos obtained through interviews conducted by the authors with five skilled crane operators. The videos encompasses three distinct behaviors: alertness, low vigilance, and fatigue. The recordings were captured from different camera angles and conducted across various scenarios, including working at computer stations, simulated or actual driving environments, and simulated crane operations. The participants exhibited diverse facial characteristics, behaviors, and ethnic backgrounds.

The authors of [35] introduced a multimodal database called DROZY. This database was constructed by using multimodal data obtained from 14 participants, who underwent three consecutive sessions of the authors’ version of psychomotor vigilance tests (PVTs) proposed by [36] under conditions of escalating sleep deprivation. This test produces an objective measure of vigilance and, indirectly, of drowsiness. For each participant and each PVT session, the database includes time-synchronized raw data, encompassing Karolinska Sleepiness Scale (KSS) measurements [19], polysomnography (PSG) signals, PVT data (including reaction times), and near-infrared (NIR)-intensity and -range images of the face.

The creators of the FatigueView dataset [37] introduced a comprehensive, large-scale multi-camera dataset tailored for investigating drowsiness detection in real-world driving scenarios. The dataset includes recordings from both RGB and infrared (IR) cameras positioned at five different angles, capturing genuine instances of drowsy driving along with various visual indicators of drowsiness, ranging from subtle to overt. It encompasses 17,403 distinct instances of yawning across more than 124 million frames sourced from 95 participants.

The SUST-DDD dataset [38] comprises 2074 videos captured by participants by using their mobile phone cameras while experiencing fatigue and normal states during authentic driving scenarios. Notably, participants were not directed to exhibit specific behaviors while driving, thereby ensuring the authenticity and safety of the driving experience. Nineteen participants, aged between 24 and 42, were tasked with recording videos by using their personal phones positioned in front of the driver’s seat when they felt drowsy or normal. The videos were recorded in periods of reduced daylight, such as driving to or from work.

In Table 2, we present the highlighted features of the above-mentioned datasets. It is evident that these datasets offer a considerable number of videos showcasing diverse facial characteristics, behaviors, ethnic backgrounds, lighting conditions, recording settings, and facial poses. Nonetheless, an observable trend unfolds, that is, the majority of existing datasets prioritize the simulation of driving scenarios. This emphasis consequently leads to an insufficiency of recordings that portray tasks associated with computer operation. This gap is particularly pertinent, as such tasks constitute the routine activities of numerous employees and students who may also experience fatigue and drowsiness. Moreover, upon analysis of datasets capturing tasks associated with office work or studying, it becomes evident that they lack a substantial number of informative features related to fatigue, such as vital signs, head movement, and the states of the eyes and mouth. Hence, the detection of these features is mandatory for the implementation of a fatigue recognition system using these dataset, which could be both computationally intensive and time-consuming. To address this limitation, we introduce our dataset, designed to rectify this deficiency by offering a more comprehensive collection of recordings relevant to the daily tasks of employees and students annotated with fatigue related features estimated by deep learning and computer vision methods. Such a dataset holds promise for facilitating fatigue detection research, given its direct relevance to work and academic settings, where fatigue significantly influences work performance and efficiency.

3. Dataset Description

This article introduces a dataset comprising video data integrated with numerical metrics, offering a comprehensive assessment of the subject’s physiological condition as depicted in the videos. The dataset provides parameters such as head movement in terms of Euler angles (roll, pitch, and yaw), vital signs, and eye and mouth states (eye closure and yawning), eliminating the necessity for manual calculation or detection of these parameters, which is commonly required by other existing datasets. In the following subsections, we describe the original dataset we worked with, explain our main contribution to the dataset, delineate the structure of the models used with a detailed explanation of the final tables.

3.1. OperatorEYEVP Dataset

For this purpose, we utilized the OperatorEYEVP dataset as introduced by [13]. This dataset features video recordings of ten unique individuals participating in diverse activities. These activities were documented at three distinct moments throughout the day (morning at 9:00 a.m., afternoon at 13:00 p.m., and evening at 18:00 p.m.), over a period spanning eight to ten days. The recordings occurred on weekdays from December 2022 to June 2023. The recorded data not only offer a frontal view of the participants’ facial expressions, but they also include a set of additional information. This encompasses data pertaining to pupil and head movement (including specific coordinates), scene visuals, pulse rate for each interval, and Choice Reaction Time measured twice. Furthermore, the dataset incorporates responses to various questionnaires and scales, notably the VAS-F scale. The VAS-F scale, an 18-question measure, addresses the subjective experiences of fatigue among participants. It is worth noting that these questions were administered prior to the beginning of the experimental session. The timeline of a session is shown in Figure 1.

The experimental procedure included a daily sleep quality survey conducted prior to the morning session. This was succeeded by the VAS-F questionnaire, a Choice Reaction Time (CRT) task, reading a text in a scientific style, performing the “Landolt rings” correction test, playing the “Tetris” game, and a second CRT. The latter was incorporated at the discretion of the authors, cognizant of the potential variability in operator fatigue levels between the inception and termination of the recording session. The aggregate duration of such sessions, on average, approximated one hour.

Throughout the Choice Reaction Time (CRT) assessment, a comprehensive group of parameters was recorded, such as the average reaction time, its standard deviation, and the number of errors made by participants during task performance. Participants engaged in the reading of scientific-style text, which can be a typical work-related activity. This task served as a static cognitive load task aimed at evaluating cognitive performance. The Landolt rings correction test is a well-established method for evaluating visual acuity. Based on [13], the following parameters were recorded during the Landolt rings test:

Time spent (t).
The total number of characters viewed up to the last selected character (N).
The total number of lines viewed (C).
The total number of characters to be crossed out (n).
The total number of characters crossed out (M).
Correctly selected letters (S).
Letter characters missed (P).
Wrongly selected characters (O).

These parameters facilitate the computation of various metrics, including attention productivity (K), mental performance (Au), and stability of attention concentration (Ku), using the following equations:

A u = \frac{N}{t} * \frac{M - O + P}{n}

(1)

K = \frac{(M - O) * 100}{n}

(2)

K u = C * \frac{C}{P + O + 1}

(3)

Finally, the Tetris game served as a dynamic active task to investigate hand–eye coordination. The recorded variables included the number of games played, the levels reached, the scores achieved, and the lines cleared.

This dataset’s demographic is primarily composed of individuals with a mean age of 27.3 years, extending from a minimum age of 21 to a maximum of 50 years. It is evident that the sample is skewed towards the younger age spectrum, with the majority below the age of 33. Furthermore, there were no reported instances of significant health issues that could impact vital signs, with the exception of the first participant, who exhibited symptoms of depression. However, it is important to note that the participants predominantly have lighter skin tones. This demographic characteristic may present limitations for future applications of this approach, particularly in assessing fatigue among older individuals, those with serious health conditions, and individuals with darker skin tones. This observation highlights the future need to expand the dataset to include a more diverse range of ages, health conditions, and ethnicities, thereby enhancing the generalizability of the approach.

3.2. Annotation of Dataset with Physiological Indicators

Our primary contribution to the dataset entails the annotation, performed at one-minute intervals, of the videos encompassing four main physiological indicators: blood pressure, heart rate, oxygen saturation, and respiratory rate. We selected the one-minute interval because, as we discuss in more detail in Section 3.3, the models used for parameter estimation make predictions at the level of seconds, frames, or minutes. Therefore, a one-minute interval was deemed the most suitable choice for synchronizing the estimations. Moreover, supplementary metrics were computed, including estimation of head pose via Euler angles (roll, pitch, and yaw), the proportion of frames exhibiting Euler angles surpassing 30 degrees relative to the total frames within a one-minute span, ratios of eye closure and mouth opening, instances of yawning, numbers of prolonged eye closures exceeding two seconds, and characteristics pertaining to breathing patterns, encompassing rhythmicity and stability.

The computation of the aforementioned parameters was carried out by using computer vision and deep learning approaches developed by members of our laboratory, which are discussed in the subsequent subsection. These approaches were applied to each video within the OperatorEYEVP dataset, and annotations were made for the 15 parameters at each minute of the recordings. For each participant, a comprehensive CSV file was created to document all the details of the session. This file encompassed information related to the CRT and Tetris tasks, as well as eye parameters and HRV parameters provided by the dataset in addition to the parameters extracted by our models with additional features we calculated from the coordinates of the eye-tracking data. A more in-depth exploration of the CSV files, including their structure and contents, is provided later in the discussion.

3.3. HFAVD Dataset Structure

In this subsection, we present an overview of the models utilized for extracting physiological features used in the annotation of the videos. These models leverage deep learning, machine learning, and computer vision techniques to estimate specific physiological indicators based on facial video recordings of the subjects, as shown in Figure 2. Through these models, we were able to estimate 15 features at each minute of every video, capturing a comprehensive range of physiological information. The physiological features were extracted by using deep learning models with a sampling frequency of one second, as will be detailed in subsequent sections. However, many of these features exhibit minimal variation within a one-second interval, which can result in numerous identical or highly similar samples. This redundancy may compromise the reliability of the experimental outcomes. Consequently, we opted to compute the average of these indicators for each minute and label them with the corresponding mental performance value. This approach was selected to retain as many data as possible, ensuring they remained adequate for the machine learning models. Therefore, increasing the sampling interval from 1 min to 5, for example, could lead to insufficient data, adversely affecting performance, while decreasing it would result in duplicated samples, thereby rendering the experiments invalid and unreliable.

The estimated physiological features were compiled and organized into comprehensive CSV files. These CSV files enhance the distinctiveness of the dataset compared with existing ones. They include almost every detectable or extractable feature from the subject while the latter performs various tasks, without the need for sensors or external devices apart from the widely available camera. This distinguishing characteristic contributes to the dataset’s versatility and accessibility for researchers. These models are explained as follows.

3.3.1. Respiratory Rate and Breathing Characteristics

To estimate the respiratory rate, the approach proposed by [39] was employed, encompassing the following steps: (a) Chest keypoint detection using OpenPose: The OpenPose framework was utilized to detect the keypoint corresponding to the chest region in the video frames. (b) Displacement detection using SelFlow: An Optical Flow-based neural network called SelFlow was employed to detect the displacement of the chest keypoint between consecutive frames. (c) Projection of displacement along the x- and y-axes: The displacement values obtained were projected onto the x- and y-axes, separating the movement into vertical (up/down) and horizontal (left/right) directions. (d) Signal processing techniques: Signal processing methods, including filtering and detrending, were applied to the displacement data to enhance the accuracy of respiratory rate estimation. (e) Calculation of true peak count and scaling: The true peak count was calculated based on the processed displacement data. This count was then scaled to estimate the number of breaths per minute. Furthermore, in addition to estimating the respiratory rate, various breathing characteristics, such as stability and rhythmicity, were assessed based on the amplitude and wavelength of the respiratory wave. The mean absolute error (MAE) for respiratory rate estimation was 1.5 breaths per minute (bpm). Regarding the limitations of this model, its effectiveness is restricted by its inability to handle body movement efficiently and its reliance on the subject maintaining a stationary position during recording. This requirement poses the risk of inaccurate estimations. Consequently, the model is unsuitable for use when the subject is walking, engaged in motion, or even driving on the road, due to the presence of vehicle vibrations.

3.3.2. Heart Rate

The heart rate estimation approach proposed by [40] was employed in this study. The methodology involved the following steps: (a) Extraction of the Region of Interest (Face): Landmarks obtained from the 3DDFA_V2 framework [41] were utilized to extract the Region of Interest, specifically the face, from the input data. (b) Processing through a Vision Transformer: The extracted face region was then processed through a Vision Transformer model, incorporating multi-skip connections. This process generated features from five different levels. (c) Block processing: The output from each level of the Vision Transformer was passed through a block structure. This block structure consisted of a Bidirectional Long Short-Term Memory (BiLSTM) layer, batch normalization, 1D convolution, and a fully connected layer. (d) Averaging of block outputs: The outputs from the five different blocks were averaged to obtain the minimum, maximum, and mean heart rates. (e) Weighted averaging: The minimum, maximum, and mean heart rates were then combined by using a weighted averaging scheme to estimate the final heart rate. This model demonstrated superior accuracy in heart rate estimation compared with previously published methods, as evidenced by its performance on the LGI-PPGI and V4V datasets, achieving mean absolute errors (MAEs) of 2.68 and 9.996 bpm, respectively. Nevertheless, it is important to note that the dataset utilized for training the model exhibited a lack of samples representing both very low and very high heart rates. Consequently, the model may produce inaccurate estimations when presented with subjects exhibiting specific medical conditions characterized by extreme heart rate values.

3.3.3. Blood Pressure

The estimation of blood pressure followed the approach proposed by [42]. The methodology consisted of the following steps: (a) Extraction of Regions of Interest: The left and right cheeks were identified as the Regions of Interest (ROIs) in each frame of every video. These sequential images were used as input for subsequent analysis. (b) Convolutional Neural Network (CNN) for spatial feature extraction: A CNN architecture was employed to capture spatial features from the ROIs. For systolic blood pressure estimation, EfficientNet_B3 was utilized for the left cheek, while EfficientNet_B5 was used for the right cheek. Conversely, for diastolic blood pressure estimation, an ensemble approach was adopted. This involved combining EfficientNet_B3 and ResNet50V2 for the left cheek and the right cheek, respectively. (c) Long Short-Term Memory (LSTM) for temporal feature extraction: The outputs obtained from the previous step were fed into an LSTM network. This enabled the extraction of temporal features within the sequence of images, considering the dynamic nature of blood pressure. (d) Fully connected layers for blood pressure estimation: Two fully connected layers were employed to derive the blood pressure values based on the extracted spatial and temporal features. Upon training and testing the proposed model, the mean absolute errors recorded were 11.8 and 10.7 mmHg for systolic and diastolic blood pressure, respectively. In parallel, the model’s mean accuracies were found to be 89.5% for systolic blood pressure and 86.2% for diastolic blood pressure. The main challenge faced in training this model was the lack of diversity in the dataset, particularly in terms of the subjects’ skin color. Most of the subjects had light-to-medium skin tones, which impacted the accuracy of the models in predicting blood pressure in individuals with dark skin tones. Another limitation was the dataset’s imbalance, with a scarcity of unusual values for systolic and diastolic blood pressure.

3.3.4. Oxygen Saturation

The estimation of oxygen saturation in this study was conducted based on the method proposed by [43]. The procedure involved the following steps: (a) Extraction of face region: The face region was initially extracted by using the 3DDFA_V2 framework. This facilitated the isolation of the relevant area for subsequent analysis. (b) Input to VGG19: The extracted face region was then used as input for the VGG19 model, which had been pre-trained on the ImageNet dataset. This allowed for the extraction of relevant features from the face region. (c) Utilization of XGBoost: The output obtained from VGG19 was subsequently fed into an XGBoost algorithm. XGBoost is a machine learning framework that combines multiple decision trees to make predictions. In this case, it was employed to obtain the estimation of the oxygen saturation value. Following the training phase, the model was subjected to testing by using two distinct datasets. The derived mean absolute error (MAE) for the first test set was found to be 1.17%, while for the second test set, it was marginally lower, at 0.84%. The primary challenge encountered in training this model pertained to the limited number of samples depicting SpO2 levels below 85. This scarcity poses a significant obstacle for the models, particularly when examining subjects with specific health conditions that are associated with low SpO2 levels.

3.3.5. Head Pose

The determination of the head pose, characterized by Euler angles (roll, pitch, and yaw), was carried out by using the methodology described in [44]. The approach involved the following steps: (a) Face detection using YOLO Tiny: Initially, the YOLO Tiny framework was utilized to detect the face in the input data. This allowed for the identification of the Region of Interest. (b) Three-dimensional face reconstruction for landmark alignment: Subsequently, a 3D face reconstruction technique was applied to align the facial landmarks. This process ensured accurate landmark detection even for those not directly visible to the camera. (c) Landmark analysis for Euler angle calculation: Once the facial landmarks were detected, the transitions and rotations between landmarks across successive frames were analyzed. This analysis enabled the calculation of the Euler angles, providing information about the head pose. The limitation of the article is that the model proposed can only estimate head angles up to a maximum value of 70 degrees. This limitation restricts its applicability in scenarios where larger head angles need to be accurately detected. Additionally, the model’s performance is slower compared with the Dlib detector, which itself loses accuracy when the head angle exceeds 35 degrees. This slower performance could hinder real-time applications that require fast and accurate head angle estimation.

3.3.6. Eye and Mouth State

The eye state was assessed through the utilization of a trained model. This model takes the detected face, identified via the FaceBoxes framework, as its input and generates an output that signifies whether the eyes are open or closed. Similarly, the yawning state is recognized by employing a modified version of MobileNet, as proposed by [45]. This model achieved an accuracy of 95.20%. However, a limitation of this model is that it relies on a private dataset for validation. This could potentially limit the generalizability of the findings, as the dataset is not representative of the broader population.

The videos in the OperatorEYEVP dataset exhibit a high degree of similarity to the conditions found in the datasets used for training the models. These similarities encompass factors such as lighting conditions, skin color, and the demographic characteristics of the subjects, who were healthy and young. Consequently, the likelihood of encountering extreme or abnormal values in vital signs, particularly oxygen saturation, is minimized. However, the insufficiency of sensor data in the OperatorEYEVP dataset prevents the calculation of estimation errors. Nevertheless, considering the aforementioned conditions, it is reasonable to assume that the models exhibit good accuracy. This assumption is further supported by the alignment of the obtained results with expectations and their consistency with the existing literature.

3.4. Data Example

As previously stated, inclusive CSV files were generated to meticulously record all the details of each video within the OperatorEYEVP dataset. In this subsection, we provide a thorough description of these files, elucidating the significance of each feature included therein, as well as their intended purpose. Additionally, we provide examples of how these files could potentially be utilized by other researchers.

For each participant, a CSV file encompassing 199 columns was generated to provide an extensive description of the session performed by the subject (see Table 3). The initial columns within the file include metadata associated with the session, including participant ID, session name, the type of activity undertaken, and the time of day during which the session occurred. Subsequently, information pertaining to the CRT sessions, specifically the first and second sessions, is listed. This information includes metrics such as the average reaction time, standard deviation of reaction time, and the number of errors made.

Following that, we included the pertinent information related to the Landolt rings test, encompassing a variety of indicators and metrics. This includes many indicators about attention, stability of attention and work accuracy, the quantification of mental productivity through a mental productivity index, and the evaluation of mental performance. Additionally, we incorporated a concentration indicator, the qualitative characteristic of concentration, the capacity of visual information processing, and the processing rate. Furthermore, we listed additional information related to the test results, such as the total number of characters crossed out, letters correctly selected, letter characters missed, wrongly selected characters, total number of lines viewed, and other relevant variables.

Subsequent to this, we added a collection of features related to eye movement strategy (pupil movement) observed during each session. These features, totaling 60 in number, encompass measurements related to fixation and saccadic time and velocity, alongside additional parameters associated with acceleration. Additionally, an extensive set of 79 parameters were included, encompassing various domains such as time, frequency, and nonlinearity, all of which are related to heart rate variability. Notable examples within this set include VLF, LF, and HF band frequencies, as well as relative and absolute power measurements for these frequency bands, among numerous other relevant metrics. In addition, we extracted the absolute heart rate for each minute from the PPI signal provided. We thought that it would be useful to include both real and predicted heart rate (by our model (Section 3.3.2)), so that many experiments and comparisons may be carried out by researchers to evaluate the possibility of integrating and relying on computer science for such tasks. All the aforementioned features were provided by [13].

The next stage involved extracting the relevant features from the x- and y-coordinates from the eye movement data, which were represented as two separate time series for each activity. To accomplish this, we employed basic Python (v. 3.6.9) code in conjunction with the NumPy library to compute seven key features from each series, namely, mean, standard deviation, minimum, maximum, 25th percentile, 50th percentile, and 75th percentile.

Finally, we added the 15 features extracted by our models after processing the videos from the OperatorEYEVP dataset. These features included respiratory rate, rhythmicity and stability of breathing, eye closure ratio, mouth opening ratio, head Euler angles (roll, pitch, and yaw), the ratio of frames where any Euler angle exceeds 30 degrees relative to the total frames in one minute (angles > 30), the count of eye closures lasting more than 2 s (count of eye closure > 2 s), the count of yawns, heart rate, systolic and diastolic blood pressure, and blood oxygen saturation. The final column in the dataset represents the subjective assessment of fatigue, specifically measured by using the Visual Analogue Scale for Fatigue (VAS-F). The participants themselves provided this assessment, indicating their perceived level of fatigue.

We believe that our contribution to the OperatorEYEVP dataset offers a valuable opportunity for researchers to enhance their understanding of the relationship between mental fatigue and various responses exhibited by subjects. These responses encompass both observable cues, such as head movement and eye and mouth state, as well as underlying indicators, such as vital signs. Researchers can gain insights beyond fatigue alone by exploring the correlations between these responses and multiple features, including attention, work accuracy, concentration stability, and mental performance. We consider that our previous approaches and research have been employed in an optimal manner. The advantage of utilizing these approaches lies in their ability to provide access to all the aforementioned features without the need for sensors or external devices. This not only saves considerable time and effort in conducting studies but also eliminates the need to develop costly approaches to detecting specific features. Consequently, researchers can promptly apply and test their hypotheses and theories. Furthermore, the availability of annotating videos capturing various activities, not solely limited to fatigue, can provide a deeper understanding of the human mechanism.

4. The Proposed Approach and Its Application to the HFAVD Dataset

In this section, we delve into the experiments used to analyze the dataset, along with presenting the resulting outcomes. The dataset provided was utilized to detect fatigue levels based on the psychological traits extracted by our developed models, as explained in Section 3.2. The overall framework is illustrated in Figure 3.

As illustrated in the diagram, our approach involves extracting key physiological indicators from the videos of operators using a set of specialized models developed for this purpose. These indicators include heart rate, respiratory rate, blood pressure, oxygen saturation, eye closure ratio, head pose, and yawning frequency. Each of these physiological signals is calculated on a per-minute basis, ensuring a continuous, time-resolved input into the fatigue detection process. Once these indicators are extracted, they are used as input features for our fatigue detection model. This model is designed to assess the operator’s fatigue state based on the mental performance (AU), which is derived from the results of the Landolt rings test—a well-known measure of visual acuity and attention. The Landolt test provides insights into the operator’s cognitive state, specifically focusing on attention and concentration.

To determine the optimal threshold for AU, which separates the fatigue state from the non-fatigue state, we conducted a detailed analysis of the data. This analysis involved evaluating the mental performance across different AU values to identify the point at which fatigue became evident. In addition to AU, we also analyzed two other important cognitive measures: the attention level (K) and the stability of concentration (Ku). These metrics helped us gain a more comprehensive understanding of how attention and focus degrade over time, particularly as fatigue sets in.

We employed the results of the Landolt test, particularly mental performance, to assess the fatigue level due to its capacity to effectively capture the cognitive and attentional aspects associated with fatigue. Unlike subjective measures, such as the Visual Analog Scale for Fatigue (VAS-F), which rely on self-reported perceptions, which can be swayed by individual biases and emotional states, AU provides a more objective and standardized evaluation of fatigue. Table 4 shows the mental performance (Au, attention (K), and the stability of concentration of attention (Ku) for each participant.

Given that attention decreases when a person is fatigued, we can draw meaningful insights from our data. For example, participant number 6 has the attention level above 60, which is considered good [46]. That indicates that the participant 6 is not fatigued. Therefore, we can conclude that the fatigue threshold for mental performance must be below 0.59, as individuals with mental performance scores above this threshold maintain high attention levels.

To predict the fatigue state, we based our study on the hypothesis that mental performance becomes lower with the increase in fatigue. Figure 4 illustrates the relationship between mental performance (as measured by the Landolt rings test) and the inverse of fatigue (1/Fatigue). We assume that there is a threshold value (a point or a small range of points) where fatigue level change causes a significant change in physiological indicators. We also assume the this change can be captured by a machine learning model.

We assessed each participant’s mental performance, classifying them as fatigued if their performance fell below a predetermined threshold. To identify the optimal threshold, we conducted multiple experiments by using the random forest algorithm, testing thresholds from 0.2 (chosen empirically from our dataset) to 0.59 (discussed above) in increments of 0.015. This increment value was chosen experimentally, as performance differences were negligible with smaller increment values. We selected the random forest algorithm for evaluation due to its superior results demonstrated in our previous research [2]. Figure 5 shows the F1-score for different threshold values obtained through experiments.

Moreover, we recognized that vital signs vary significantly between individuals. To address this, we meticulously normalized these metrics before incorporating them into our predictive models. This normalization allowed our models to accurately account for individual differences, thereby providing more reliable predictions of fatigue. We used the min–max normalization function shown in Equation (4).

X^{'} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}

(4)

where the following apply:

X is the original vital sign value.
$X^{'}$ is the normalized vital sign value.
$X_{\min}$ is the minimum value of the vital sign for the participant.
$X_{\max}$ is the maximum value of the vital sign for the participant.

As we can see from the figure, there is a threshold value, 0.38, which achieves the best F1-score.

To predict the fatigue state, several methods were evaluated, including Support Vector Classifier (SVC), logistic regression, Multi-Layer Perceptron (MLP), decision tree, XGBoost, and random forest. Each method has distinct characteristics, which are shown in Table 5.

The hyperparameters for each model were fine-tuned by using grid search to optimize performance. For SVC, we used a Radial Basis Function (RBF) kernel with a regularization parameter C = 1.0 and the kernel coefficient set to ‘scale’. In logistic regression, we applied an L2 (Ridge) penalty with a regularization strength C = 0.001 and used the ‘lbfgs’ solver for optimization. The MLP classifier was configured with two hidden layers, with each using a linear activation function. The ReLU function was applied as the activation function for the neurons, Adam was selected as the solver, and the learning rate was set to

1 \times 10^{- 3}

. For the decision tree, the Gini impurity was used as the splitting criterion, with a maximum tree depth of 20, a minimum of two samples required per leaf, and a minimum of two samples required to split a node. In the XGBoost model, the ‘gbtree’ booster was utilized. Key hyperparameters included a learning rate of 0.1, a maximum depth of 3, a minimum loss reduction (gamma) of 2, a feature subsample ratio (colsample_bytree) of 0.8, and 100 estimators (trees). The random forest classifier also employed 100 estimators with the Gini impurity criterion. It had a maximum tree depth of 20, used ‘log2’ as the maximum number of features to consider when splitting, and required a minimum of two samples to split a node. The environment used for these experiments was Python, with scikit-learn employed for SVC, logistic regression, MLP, decision tree, and random forest, while the XGBoost library was used for the XGBoost method. This setup ensures that the characteristics of each classification method are clearly presented and understood.

Table 5 shows the performance of the models using the threshold of 0.38 to detect the fatigue state based on the mental performance.

The random forest method addresses the issue of overfitting commonly associated with decision trees by combining the predictions of multiple trees to improve accuracy and robustness. While a single decision tree may capture noise in the data, random forests reduce this risk by averaging predictions across many trees, leading to better generalization. Random forest is an ensemble learning method that builds multiple decision trees by using bootstrap sampling (bagging) and random feature selection to reduce correlation between trees [47]. Each decision tree is constructed by using algorithms like ID3 (Iterative Dichotomiser 3), C4.5, or CART (Classification and Regression Trees). These algorithms split nodes based on criteria such as Gini impurity or entropy/information gain for classification problems, like detecting human fatigue. The final random forest prediction is made by aggregating results—majority voting for classification. It also uses out-of-bag (OOB) error for performance estimation and can calculate feature importance.

It is important to mention that the dataset, collected during working hours at various times of the day, reflects a greater number of non-fatigue state samples compared with fatigue state samples. For the testing dataset, this imbalance is evident, with the non-fatigue state (853 samples) samples being almost twice as numerous as the fatigue state samples (358 samples). This imbalance can make the accuracy metric less reliable, as it may not fully capture the performance of the models on the minority class. To address this issue, we also evaluated the F1-score, which is a more comprehensive measure by considering both precision and recall. As shown in Table 4, while all models exhibit high accuracy, there is significant variation in the F1-scores. Models such as SVC and logistic regression, despite their high accuracy, have relatively low F1-scores (0.4831 and 0.5215, respectively), indicating poorer performance in predicting the minority class. Conversely, models like MLP, decision tree, XGBoost, and random forest demonstrate both high accuracy and high F1-scores, with random forest achieving the highest F1-score, 0.9465. This indicates a better handling of the imbalanced data. The experiments revealed that the threshold of 0.38 yielded the best F1-score, with the random forest model achieving the F1-score of 0.9465 in fatigue prediction based on vital signs.

To analyze the effect of feature correlation on classification performance, it is essential to asses the relationship between the input variables. In this context, features exhibiting high correlation include num_eye_closure > 2 s and eye_closure_ratio, as well as rhythmicity_coeff and stability_coef, which were assessed by using Spearman rank correlation coefficient. However, other potential correlations, such as those between head Euler angles (roll, pitch, and yaw) with angles exceeding 30 degrees and the relationship between mouth_openness_ratio and num_yawns, were not captured by the Spearman method. This limitation may arise from the non-monotonic relationship between these variables. Notably, when training a random forest classifier without including the participant ID, the model achieved an F1-score of 0.7990 by using the original 15 features. Subsequent removing the redundant features, identified through correlation analysis, led to an improvement in the F1-score to 0.8384, highlighting the significance of feature selection in enhancing classification performance. This finding underscores the importance of addressing feature redundancy to optimize model performance in classification tasks.

Another crucial aspect is analyzing the most relevant characteristics for classification. Given the numerous variables used to characterize the problem, it is essential to determine their relevance and impact on classification performance. The relevance of each feature was analyzed, revealing that the most to least relevant features are BP Diastolic, participant ID, BP Systolic, eye_closure_ratio, heart_rate, oxygen_saturation, average pitch, average roll, mouth_openess_ratio, average yaw, average RR, rhythmicity_coeff, and stability_coef. The experiments shown in Table 6 revealed that using all 13 features resulted in an accuracy of 98.1% and an F1-score of 0.8658. Reducing the number of features to 11 increased the accuracy to 98.27% and the F1-score to 0.87481. A further reduction to nine features improved the accuracy to 98.43% and the F1-score to 0.8891. However, using only seven features slightly decreased the accuracy to 98.1% and dropped the F1-score to 0.8597. Leaving less than seven features produces significantly worse results. These results indicate that reducing the number of features can enhance classification performance up to a point, highlighting that not all features are equally relevant. Thus, focusing on the most significant features can enhance the model’s efficiency and accuracy. Another notable observation is that the high importance of the participant ID feature means that the fatigue assessment models tend to be person-dependent. Table 7 shows the drop in the methods’ performance after dropping the participant ID feature.

The participant ID’s role in classification processes is underscored by a set of research observing differential reactions to changes in mental fatigue states among subjects, as evidenced in alterations in vital signs and eye movement. In one study, ref. [48] discovered that exposure to a fatigue-related task led to an increase in blood pressure in some participants, while others registered a decrease. This observation extended to respiratory rates, reinforcing the individual variance in physiological responses to fatigue. Moreover, ref. [49] found a correlation between mental fatigue and eye movement data (including blink frequency) in three of their subjects, as captured by an eye detection system, while the rest of the subjects did not show any significant relationship. These findings underscore the crucial nature of participant ID as a parameter. It enables machine learning models to account for person-specific dependencies, thereby enhancing their predictive accuracy for fatigue levels. Thus, the participant ID emerges as a significant factor in optimizing classification processes by accommodating individual variability.

Table 8 presents a comparative analysis of existing methodologies for detecting fatigue through images or videos without the use of sensors or specialized equipment. While our approach demonstrates good accuracy (98.43%), it is important to note that the dataset employed is relatively small compared with those utilized in other studies, such as [22]. The mentioned accuracy rates in the table were calculated based on different data, where fatigue was identified based on different metrics (subjective, objective, or based on monitoring some features); hence, the approaches listed cannot be compared directly. Nevertheless, the accuracy of our models is high enough for the proposed models to be used for fatigue detection. Additionally, this limitation stems from the scarcity of public datasets addressing mental fatigue among operators in work and academic contexts, highlighting a significant gap that our study aims to fill, encouraging further research efforts to address this. Nevertheless, the only approach that allows for a valid comparison is [28], as it utilized the same dataset. However, our approach demonstrates superior performance by achieving higher accuracy through the use of physiological features rather than relying on eye-tracking data. Additionally, it is noteworthy that many existing approaches primarily rely on features related to the eyes and mouth. However, factors such as the use of sunglasses, reflections from lighting on eyewear, or mouth coverings for medical purposes can affect the detection accuracy. In contrast, our method incorporates a diverse range of parameters that account for both the internal and external states of the subject, thereby enhancing the robustness of our approach in managing these challenges. Furthermore, the majority of the existing datasets rely on subjective indicators to evaluate fatigue levels among participants. These subjective measures can be influenced by personal perceptions and biases, which may compromise the validity of the findings. On the other hand, our study adopts a more robust approach by employing an objective metric to assess fatigue. This choice not only minimizes potential biases but also enhances the overall reliability and validity of our results.

5. Discussion

The results of the article shed light on the proposed association among mental performance, mental fatigue, and attention indicators. This is particularly noteworthy considering the existing literature that suggests strong correlations between mental fatigue and a range of physiological features, including head movement [50], vital signs (blood pressure, heart rate, oxygen saturation, and respiratory rate) [48,51], blinks [52], and yawns [53]. These findings provide empirical support for the idea that these physiological measures can serve as reliable indicators of mental fatigue. Moreover, the study’s findings reveal that employing the identified mental performance threshold value of 0.38 leads to a high F1-score. These findings further indicate a robust relationship between mental performance and the aforementioned physiological features, which subsequently reflect the presence of mental fatigue.

In addition, the study findings not only reinforce the proposed connection among mental performance, mental fatigue, and attention indicators, but they also provide empirical evidence of the reliability and validity of utilizing specific physiological features and thresholds to assess mental fatigue. These insights contribute to a deeper understanding of the intricate relationship between mental performance and associated physiological manifestations, highlighting the potential for employing these measures in improving fatigue detection systems.

In our analysis, we believe that advanced deep learning techniques such as Transformers, CNNs, and RNNs are suited for more complex tasks, such as extracting deep layer features from video; hence, these type of models are used to extract vital signs from video data, a critical aspect of our research, after demonstrating their high capability due to their training on extensive datasets. Conversely, for the specific task at hand, traditional machine learning algorithms are more appropriate. This is primarily because they require significantly fewer data and are computationally efficient, allowing us to streamline our approach and avoid unnecessary complexity and time consumption.

Another reason that led us to exclude advanced deep learning techniques for the classification task is that the available data are insufficient for the effective implementation of deep learning algorithms. Additionally, it is noteworthy that most state-of-the-art (SOTA) fatigue assessment methodologies primarily address driving fatigue, often overlooking the fatigue associated with office work. Consequently, we consider our research to be novel and innovative, as it specifically targets this issue through objective measurement techniques, i.e., the mental performance assessment using the Landolt ring.

This study faced certain limitations due its relatively small size. However, this problem cannot be solved by merging our dataset with the other datasets mentioned in Section 2.2, because we collected different fatigue indicators in our dataset, such as VAS-F, Landolt rings test, and CRT. Based on our experiments described in [28], we found that Landolt rings test is the most reasonable in the considered case of fatigue detection. As a result, our models have been trained for the fatigue level estimated by the Landolt rings test, which is absent in the datasets mentioned in Section 2.2; hence, it is not possible to extend our dataset with the datasets introduced in Section 2.2. Additionally, another limitation was including group of participants with limited age diversity and ethnic background. As people age, heart rate and respiratory rate remain relatively stable, but maximal heart rate decreases, blood pressure—particularly systolic—tends to rise due to stiffening arteries, and oxygen saturation may decline, especially in those with chronic conditions. However, we have not studied how the data change across different age groups and ethnicities. These factors may hinder the generalizability of the proposed approach. One potential constraint that could impede the generalizability of the trained model is the fact that fatigue detection is human-dependent. This factor could pose challenges in applying the proposed model to novel subjects. Moreover, there remains ambiguity in determining the fatigue threshold, as the VAS-F result is subjective. To address this, we opted for the mental performance indicator as a replacement for fatigue. However, for greater reliability, it is recommended to establish an explicit value for the fatigue threshold.

Another challenge that opens up the discussion is the assessment of the validity of the data predicted by deep learning models. The majority of models employed for physiological indicator estimation have undergone testing on multiple datasets, distinct from the ones used for training. This is as detailed in the respective articles, which are cited in the reference section. The outcomes demonstrated high performance and low error margins, thereby validating their application to the current dataset. This dataset has a significant resemblance to the conditions present in the training datasets, including aspects such as lighting conditions and the skin tone and demographic characteristics of the subjects, who were predominantly healthy and young. As a result, the probability of encountering abnormal values in vital signs is considerably reduced. However, this limitation can be effectively addressed by expanding the training datasets to include a more diverse population. Specifically, incorporating individuals from various age ranges, ethnic backgrounds, and those who suffer from serious health conditions that may impact their vital signs will enhance the models’ robustness. Doing so would better equip the models to be generalized across different demographic groups and health conditions, ultimately improving their accuracy and reliability in estimating physiological indicators for a broader population. This approach would not only mitigate existing biases but also ensure that the models can more accurately reflect the complexities of real-world health scenarios.

6. Conclusions

This research article emphasizes the significance of fatigue detection in system operation contexts and its direct impact on efficiency. The objective of this study is to assess the mental performance of operators who work long hours in front of their computers. Mental performance is considered the opposite of mental fatigue. Accordingly, we concentrate on office routines that do not involve physical work. This study starts with exploring various approaches and datasets that utilize videos, sensors, and devices to capture features associated with fatigue. This paper introduces an ML-based approach that integrates video data with features computed by using computer vision deep learning models. These features encompass head movement (roll, pitch, and yaw), vital signs (blood pressure, heart rate, oxygen saturation, and respiratory rate), and eye and mouth states (blinking and yawning). By incorporating these features, the need for manual calculation or external sensors is eliminated, distinguishing it from existing datasets that rely on such methods. The primary objective of this research is to advance fatigue detection in work and academic settings.

The experiments were carried out by using the publicly available dataset OperatorEYEVP. The choice to apply the estimation process to this specific public dataset was motivated by the relevance of the videos provided, which depict the daily tasks of operators and students. As the main objective of this research is to investigate fatigue in this specific context, it was deemed appropriate to utilize this dataset. The specific task examined is reading scientific articles, as it necessitates a focused and alert mind to effectively analyze the information presented. Additionally, some individuals engage in gaming during their breaks, which is presented in the dataset used by the Tetris game, which also requires a significant level of concentration. Furthermore, we used an objective measurement of the mental performance by using the Landolt rings test to evaluate visual acuity. The physiologically indicative parameters, computed through computer vision techniques, were compiled into another publicly accessible dataset, HFAVD, presented in this paper. Future endeavors are directed towards applying this methodology to other video datasets and expanding the HFAVD dataset.

Through a series of experiments utilizing machine learning techniques, the dataset was analyzed to assess the fatigue state based on the predicted features by using deep learning models (Section 3.3). The mental performance indicator obtained from the Landolt rings test was utilized as a measure of mental fatigue. The primary objective was to investigate the relationship between this indicator and the attention metric in order to determine the optimal starting threshold for dividing our ground truth into fatigued and non-fatigued classes. Furthermore, a series of experiments were conducted to identify the most effective final value for identifying sessions characterized by mental fatigue in participants.

The research findings indicate that the threshold of 0.38 exhibits the highest potential for effectively discriminating between subjects experiencing mental fatigue and those who are not fatigued for achieving superior performance in the classification process. We employed a range of machine learning techniques, including Support Vector Classifier (SVC), logistic regression, Multi-Layer Perceptron (MLP), decision tree, XGBoost, and random forest. To optimize model performance, we utilized grid search to identify the best hyperparameters that yield the highest accuracy and F1-score. Our results indicate that most of these techniques achieved impressive accuracy levels, all exceeding 90%. However, we prioritized the F1-score as a critical performance metric, given that our test set was imbalanced. By attaining a high F1-score, we ensured that the models performed well across both classes, thereby mitigating potential bias issues. The outcomes consistently showed that random forest achieved the highest F1-score, 94%. These results highlight random forest as a highly promising technique for fatigue detection. XGBoost, decision tree, and MLP reached good F1-scores of 86%, 82%, and 87%, respectively.

It was also revealed that participant ID emerged as a significant factor in optimizing classification, showing that accounting for person-specific dependencies enhances the predictive accuracy and F1-score of machine learning models for the fatigue state. This observation was made by comparing the classification metrics when incorporating participant ID as a feature in the model versus when it was excluded. We found that the F1-score declined from 94% to 81%, highlighting the significance of individual differences in the context of mental fatigue classification.

Additionally, the concept of feature importance motivated us to investigate which features contribute most significantly to enhancing classification performance. Consequently, we conducted an experiment in which we sequentially analyzed the top n features based on their relevance. The results indicate that using 9 out of the 13 features achieved the highest accuracy, 98.43%. This finding suggests that we can streamline our approach by excluding certain less impactful features, such as those related to respiratory rate, thereby reducing the time required for implementation.

In summary, this study contributes to the topic of fatigue detection by presenting an ML-based approach and its application to a comprehensive dataset, demonstrating the effectiveness of machine learning in accurately assessing fatigue levels. The integration of video data and the use of deep learning techniques offer a more efficient and practical approach to identifying signs of fatigue, potentially enhancing safety in various contexts. Further research and development in this area can lead to the implementation of effective fatigue detection systems that mitigate risks and improve overall well-being.

Author Contributions

Conceptualization, N.S.; methodology, N.S. and W.O.; software, W.O. and B.H.; validation, N.S. and W.O.; formal analysis, N.S. and W.O.; investigation, N.S., W.O. and B.H.; resources, N.S.; data curation, A.K., W.O. and B.H.; writing—original draft preparation, A.K., W.O. and B.H.; writing—review and editing, N.S. and A.K.; visualization, W.O.; supervision, N.S.; project administration, N.S.; funding acquisition, N.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research study was funded by the Russian Science Foundation (project 24-21-00300).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is available at https://disk.yandex.ru/d/A49WtGSODl7eCw (accessed on 11 November 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

Bi-LSTM	Bidirectional Long Short-Term Memory
CNN	Convolutional Neural Networks
CRT	Choice Reaction Time
CSV	Comma-Separated Values
EAR	eye aspect ratio
ECG	electrocardiography signals
EEG	electroencephalogram signals
EDA	electrodermal activity
EM	EYE-MOUTH
FIS	Fuzzy Inference System
FOM	Frequency of Open Mouth
HF	high frequency
HFAVD	Human Fatigue Assessment Based on Video Data
HOG	Histogram of Oriented Gradients
HRV	heart rate variability
JADE	Joint Approximate Diagonalization of Eigenmatrices
KSS	Karolinska Sleepiness Scale
LF	low frequency
LSTM	Long Short-Term Memory
MAE	Mean Absolute Error
MFRNN	Multimodal Fusion Recurrent Neural Network
MLP	Multi-Layer Perceptron
MOR	mouth opening ratio
MTCNN	Multitask Convolutional Neural Network
NLR	nose length ratio
NRMSE	normalized root mean square error
PERCLOS	Percentage of Eyelid Closure over the Pupil
PSG	polysomnography
PVT	Psychomotor Vigilance Test
RBF	Radial Basis Function
RNN	Recurrent Neural Network
ROI	Region of Interest
TSP	Triangular Surface Patch
SVC	Support Vector Classifier
SVM	Support Vector Machine
VAS-F	Visual Analogue Scale to Evaluate Fatigue Severity
VLF	very low frequency

References

Zhu, T.; Zhang, C.; Wu, T.; Ouyang, Z.; Li, H.; Na, X.; Liang, J.; Li, W. Research on a Real-Time Driver Fatigue Detection Algorithm Based on Facial Video Sequences. Appl. Sci. 2022, 12, 2224. [Google Scholar] [CrossRef]
Shilov, N.; Othman, W.; Hamoud, B. Operator Fatigue Detection via Analysis of Physiological Indicators Estimated Using Computer Vision. In Proceedings of the 26th International Conference on Enterprise Information Systems—Volume 2: ICEIS, Angers, France, 28–30 April 2024; SciTePress: Setúbal, Portugal, 2024; pp. 422–432. [Google Scholar] [CrossRef]
Li, X.; Lian, X.; Liu, F. Rear-End Road Crash Characteristics Analysis Based on Chinese In-Depth Crash Study Data. In Proceedings of the 16th COTA International Conference of Transportation Professionals (CICTP 2016), Shanghai, China, 6–9 July 2016; pp. 1536–1545. [Google Scholar] [CrossRef]
Zhang, G.; Yau, K.K.; Zhang, X.; Li, Y. Traffic accidents involving fatigue driving and their extent of casualties. Accid. Anal. Prev. 2016, 87, 34–42. [Google Scholar] [CrossRef] [PubMed]
Alharbey, R.; Dessouky, M.M.; Sedik, A.; Siam, A.I.; Elaskily, M.A. Fatigue State Detection for Tired Persons in Presence of Driving Periods. IEEE Access 2022, 10, 79403–79418. [Google Scholar] [CrossRef]
Di Stasi, L.L.; Renner, R.; Catena, A.; Cañas, J.J.; Velichkovsky, B.M.; Pannasch, S. Towards a driver fatigue test based on the saccadic main sequence: A partial validation by subjective report data. Transp. Res. Part C Emerg. Technol. 2012, 21, 122–133. [Google Scholar] [CrossRef]
Yamada, Y.; Kobayashi, M. Fatigue Detection Model for Older Adults Using Eye-Tracking Data Gathered While Watching Video: Evaluation Against Diverse Fatiguing Tasks. In Proceedings of the 2017 IEEE International Conference on Healthcare Informatics (ICHI), Park City, UT, USA, 23–26 August 2017; pp. 275–284. [Google Scholar] [CrossRef]
Dawson, D.; Searle, A.K.; Paterson, J.L. Look before you (s)leep: Evaluating the use of fatigue detection technologies within a fatigue risk management system for the road transport industry. Sleep Med. Rev. 2014, 18, 141–152. [Google Scholar] [CrossRef]
Gomer, J.; Walker, A.; Gilles, F.; Duchowski, A. Eye-Tracking in a Dual-Task Design: Investigating Eye-Movements, Mental Workload, and Performance. Proc. Hum. Factors Ergon. Soc. Annu. Meet. 2008, 52, 1589–1593. [Google Scholar] [CrossRef]
Hu, X.; Lodewijks, G. Detecting fatigue in car drivers and aircraft pilots by using non-invasive measures: The value of differentiation of sleepiness and mental fatigue. J. Saf. Res. 2020, 72, 173–187. [Google Scholar] [CrossRef]
Matuz, A.; van der Linden, D.; Kisander, Z.; Hernádi, I.; Kázmér, K.; Csathó, Á. Enhanced cardiac vagal tone in mental fatigue: Analysis of heart rate variability in Time-on-Task, recovery, and reactivity. PLoS ONE 2021, 16, e0238670. [Google Scholar] [CrossRef]
Desmond, P.A.; Hancock, P.A. Active and passive fatigue states. In Stress, Workload, and Fatigue; CRC Press: Boca Raton, FL, USA, 2000; pp. 455–465. [Google Scholar]
Kovalenko, S.; Mamonov, A.; Kuznetsov, V.; Bulygin, A.; Shoshina, I.; Brak, I.; Kashevnik, A. OperatorEYEVP: Operator Dataset for Fatigue Detection Based on Eye Movements, Heart Rate Data, and Video Information. Sensors 2023, 23, 6197. [Google Scholar] [CrossRef]
Cui, Z.; Sun, H.M.; Yin, R.N.; Gao, L.; Sun, H.B.; Jia, R.S. Real-time detection method of driver fatigue state based on deep learning of face video. Multimed. Tools Appl. 2021, 80, 25495–25515. [Google Scholar] [CrossRef]
Dua, M.; Shakshi; Singla, R.; Raj, S.; Jangra, A. Deep CNN models-based ensemble approach to driver drowsiness detection. Neural Comput. Appl. 2021, 33, 3155–3168. [Google Scholar] [CrossRef]
Dey, S.; Chowdhury, S.A.; Sultana, S.; Hossain, M.A.; Dey, M.; Das, S.K. Real Time Driver Fatigue Detection Based on Facial Behaviour along with Machine Learning Approaches. In Proceedings of the 2019 IEEE International Conference on Signal Processing, Information, Communication & Systems (SPICSCON), Dhaka, Bangladesh, 28–30 November 2019; pp. 135–140. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar] [CrossRef]
Salzillo, G.; Natale, C.; Fioccola, G.B.; Landolfi, E. Evaluation of Driver Drowsiness based on Real-Time Face Analysis. In Proceedings of the 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Toronto, ON, Canada, 11–14 October 2020; pp. 328–335. [Google Scholar] [CrossRef]
Shahid, A.; Wilkinson, K.; Marcu, S.; Shapiro, C.M. Karolinska Sleepiness Scale (KSS). In STOP, THAT and One Hundred Other Sleep Scales; Shahid, A., Wilkinson, K., Marcu, S., Shapiro, C.M., Eds.; Springer: New York, NY, USA, 2012; pp. 209–210. [Google Scholar] [CrossRef]
Minhas, A.A.; Jabbar, S.; Farhan, M.; ul Islam, M.N. A smart analysis of driver fatigue and drowsiness detection using convolutional neural networks. Multimed. Tools Appl. 2022, 81, 26969–26986. [Google Scholar] [CrossRef]
Kassem, H.A.; Chowdhury, M.; Abawajy, J.H. Drivers Fatigue Level Prediction Using Facial, and Head Behavior Information. IEEE Access 2021, 9, 121686–121697. [Google Scholar] [CrossRef]
Chen, L.; Xin, G.; Liu, Y.; Huang, J. Driver Fatigue Detection Based on Facial Key Points and LSTM. Secur. Commun. Networks 2021, 2021, 5383573. [Google Scholar] [CrossRef]
Abtahi, S.; Omidyeganeh, M.; Shirmohammadi, S.; Hariri, B. YawDD: A yawning detection dataset. In Proceedings of the 5th ACM Multimedia Systems Conference, Singapore, 19 March 2014; pp. 24–28. [Google Scholar] [CrossRef]
Du, G.; Li, T.; Li, C.; Liu, P.X.; Li, D. Vision-Based Fatigue Driving Recognition Method Integrating Heart Rate and Facial Features. IEEE Trans. Intell. Transp. Syst. 2021, 22, 3089–3100. [Google Scholar] [CrossRef]
Cardoso, J.; Souloumiac, A. Blind beamforming for non-gaussian signals. IEE Proc. F (Radar Signal Process.) 1993, 140, 362–370. [Google Scholar] [CrossRef]
Papazov, C.; Marks, T.K.; Jones, M. Real-time 3D head pose and facial landmark estimation from depth images using triangular surface patch features. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4722–4730. [Google Scholar] [CrossRef]
Mehmood, I.; Li, H.; Umer, W.; Ma, J.; Saad Shakeel, M.; Anwer, S.; Fordjour Antwi-Afari, M.; Tariq, S.; Wu, H. Non-invasive detection of mental fatigue in construction equipment operators through geometric measurements of facial features. J. Saf. Res. 2024, 89, 234–250. [Google Scholar] [CrossRef]
Kashevnik, A.; Kovalenko, S.; Mamonov, A.; Hamoud, B.; Bulygin, A.; Kuznetsov, V.; Shoshina, I.; Brak, I.; Kiselev, G. Intelligent Human Operator Mental Fatigue Assessment Method Based on Gaze Movement Monitoring. Sensors 2024, 24, 6805. [Google Scholar] [CrossRef]
Mehmood, I.; Li, H.; Qarout, Y.; Umer, W.; Anwer, S.; Wu, H.; Hussain, M.; Fordjour Antwi-Afari, M. Deep learning-based construction equipment operators’ mental fatigue classification using wearable EEG sensor data. Adv. Eng. Inform. 2023, 56, 101978. [Google Scholar] [CrossRef]
Mehmood, I.; Li, H.; Umer, W.; Arsalan, A.; Anwer, S.; Mirza, M.A.; Ma, J.; Antwi-Afari, M.F. Multimodal integration for data-driven classification of mental fatigue during construction equipment operations: Incorporating electroencephalography, electrodermal activity, and video signals. Dev. Built Environ. 2023, 15, 100198. [Google Scholar] [CrossRef]
Wang, L.; Li, H.; Yao, Y.; Han, D.; Yu, C.; Lyu, W.; Wu, H. Smart cushion-based non-invasive mental fatigue assessment of construction equipment operators: A feasible study. Adv. Eng. Inform. 2023, 58, 102134. [Google Scholar] [CrossRef]
Ghoddoosian, R.; Galib, M.; Athitsos, V. A Realistic Dataset and Baseline Temporal Model for Early Drowsiness Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–20 June 2019; pp. 178–187. [Google Scholar]
Weng, C.H.; Lai, Y.H.; Lai, S.H. Driver Drowsiness Detection via a Hierarchical Temporal Deep Belief Network. In Computer Vision—ACCV 2016 Workshops; Chen, C.S., Lu, J., Ma, K.K., Eds.; Springer: Cham, Switzerland, 2017; pp. 117–133. [Google Scholar]
Liu, P.; Chi, H.L.; Li, X.; Guo, J. Effects of dataset characteristics on the performance of fatigue detection for crane operators using hybrid deep neural networks. Autom. Constr. 2021, 132, 103901. [Google Scholar] [CrossRef]
Massoz, Q.; Langohr, T.; François, C.; Verly, J.G. The ULg multimodality drowsiness database (called DROZY) and examples of use. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–7. [Google Scholar] [CrossRef]
Basner, M.; Dinges, D.F. Maximizing Sensitivity of the Psychomotor Vigilance Test (PVT) to Sleep Loss. Sleep 2011, 34, 581–591. [Google Scholar] [CrossRef] [PubMed]
Yang, C.; Yang, Z.; Li, W.; See, J. FatigueView: A Multi-Camera Video Dataset for Vision-Based Drowsiness Detection. IEEE Trans. Intell. Transp. Syst. 2023, 24, 233–246. [Google Scholar] [CrossRef]
Kavalci Yilmaz, E.; Akcayol, M.A. SUST-DDD: A Real-Drive Dataset for Driver Drowsiness Detection. In Proceedings of the 31st IEEE Conference of Open Innovations Association FRUCT, Helsinki, Finland, 27–29 April 2022. [Google Scholar] [CrossRef]
Othman, W.; Kashevnik, A.; Ryabchikov, I.; Shilov, N. Contactless Camera-Based Approach for Driver Respiratory Rate Estimation in Vehicle Cabin. Lect. Notes Netw. Syst. 2022, 5431, 429–442. [Google Scholar]
Othman, W.; Kashevnik, A.; Ali, A.; Shilov, N.; Ryumin, D. Remote Heart Rate Estimation Based on Transformer with Multi-Skip Connection Decoder: Method and Evaluation in the Wild. Sensors 2024, 24, 775. [Google Scholar] [CrossRef]
Guo, J.; Zhu, X.; Yang, Y.; Yang, F.; Lei, Z.; Li, S.Z. Towards Fast, Accurate and Stable 3D Dense Face Alignment. arXiv 2021, arXiv:2009.09960. [Google Scholar]
Hamoud, B.; Kashevnik, A.; Othman, W.; Shilov, N. Neural Network Model Combination for Video-Based Blood Pressure Estimation: New Approach and Evaluation. Sensors 2023, 23, 1753. [Google Scholar] [CrossRef]
Hamoud, B.; Othman, W.; Shilov, N.; Kashevnik, A. Contactless Oxygen Saturation Detection Based on Face Analysis: An Approach and Case Study. In Proceedings of the 2023 33rd Conference of Open Innovations Association (FRUCT), Zilina, Slovakia, 24–26 May 2023; pp. 54–62. [Google Scholar] [CrossRef]
Kashevnik, A.; Ali, A.; Lashkov, I.; Zubok, D. Human Head Angle Detection Based on Image Analysis. In Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1; Arai, K., Kapoor, S., Bhatia, R., Eds.; Springer: Cham, Switzerland, 2021; pp. 233–242. [Google Scholar]
Hasan, F.; Kashevnik, A. State-of-the-Art Analysis of Modern Drowsiness Detection Algorithms Based on Computer Vision. In Proceedings of the 2021 29th Conference of Open Innovations Association (FRUCT), Tampere, Finland, 12–14 May 2021; pp. 141–149. [Google Scholar] [CrossRef]
Bourdone Test—Landolt Rings Online. Available online: https://metodorf.com/tests/bourdon/bourdonlandolt.php (accessed on 11 November 2024).
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
Meng, J.; Zhao, B.; Ma, Y.; Yiyu, J.; Nie, B. Effects of fatigue on the physiological parameters of labor employees. Nat. Hazards 2014, 74, 1127–1140. [Google Scholar] [CrossRef]
Sampei, K.; Ogawa, M.; Torres, C.C.C.; Sato, M.; Miki, N. Mental Fatigue Monitoring Using a Wearable Transparent Eye Detection System. Micromachines 2016, 7, 20. [Google Scholar] [CrossRef]
Ye, M.; Zhang, W.; Cao, P.; Liu, K. Driver Fatigue Detection Based on Residual Channel Attention Network and Head Pose Estimation. Appl. Sci. 2021, 11, 9195. [Google Scholar] [CrossRef]
Luo, H.; Lee, P.A.; Clay, I.; Jaggi, M.; De Luca, V. Assessment of Fatigue Using Wearable Sensors: A Pilot Study. Digit. Biomarkers 2020, 4, 59–72. [Google Scholar] [CrossRef] [PubMed]
Zhao, Q.; Nie, B.; Bian, T.; Ma, X.; Sha, L.; Wang, K.; Meng, J. Experimental study on eye movement characteristics of fatigue of selected college students. Res. Sq. 2023. preprint. [Google Scholar] [CrossRef]
Gupta, N.K.; Bari, A.K.; Kumar, S.; Garg, D.; Gupta, K. Review Paper on Yawning Detection Prediction System for Driver Drowsiness. In Proceedings of the 2021 5th International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 3–5 June 2021; pp. 1–6. [Google Scholar]

Figure 1. Timeline of each session.

Figure 2. Models used to label the videos.

Figure 3. The overall scheme used for detecting the fatigue state.

Figure 4. The relationship between the mental performance and the inverse of fatigue (red dotted line denotes an example threshold value separating fatigued and not fatigued states).

Figure 5. The relationship between the threshold and the F1-score.

Table 1. Summary of approaches employed for fatigue detection using facial videos.

Article	Used Features	Technique Used for Fatigue Detection	Accuracy of Fatigue Detection Models
Cui et al. [14]	Eye and mouth features	A multi-feature fusion judgment algorithm	98.30%
Dua et al. [15]	Hand gestures, facial expressions, behavioral features, and head movements	SoftMax classifier	85%
Dey et al. [16]	Eye, mouth, and nose features	SVM	96.4%
Salzillo et al. [18]	Eye features	Fuzzy logic-based system	NRMSE of 15.63%
Minhas et al. [20]	Features detected by CNN layers	CNN (ResNet50)	93.69%
Kassem et al. [21]	Eye, mouth, and head features	Logical model	93.3%
Chen et al. [22]	Eyes and mouth features	LSTM	88% and 90%
Du et al. [24]	Heart rate, and eye and mouth features	MFRNN	86.75% and 91.67%
Mehmood et al. [27]	Eyebrows, mouth contours, head movements, eye region, and facial area	Static tests	Significant differences in these metrics between high and low fatigue states
Kashevnik et al. [28]	Eye-tracking data	Machine learning techniques	85%
Mehmood et al. [29]	EEG data collected from sensor	Deep learning techniques	99.941%
Mehmood et al. [30]	EEG and EDA data collected from sensors and facial features obtained from videos	Machine learning techniques	96.2%
Wang et al. [31]	Heart rate and respiration signals	Machine learning techniques	92%

Table 2. Summary of datasets.

Dataset	Number of Subjects	Age Range	Number of Videos	Illumination	Camera Type	Human State
						Alertness,
UTA-RLDD [32]	60	20–59	180	Daytime and nighttime	RGB	low vigilance, and
						drowsiness
						Talking,
YawDD [23]	107	-	322 and 29	Daytime	RGB	yawning, and
						singing
						Talking,
Chen et al. [22]	10	-	20	Daytime	RGB	yawning, and
						laughing
						Nodding,
						regular driving,
NTHU-DDD [33]	36	-	360	Daytime and nighttime	RGB and IR	yawning,
						slow blinking, and
						laughing
Licensed						Alertness,
Crane	5	30–50	-	Daytime	RGB	low vigilance, and
Operators [34]						fatigue
				1st PVT: 8:30 a.m.		Alertness,
DROZY [35]	14	22.7 ± 2.3	-	2nd PVT: 3:30 a.m.	IR	low vigilance, and
				3rd PVT: 12:00 p.m.		drowsiness
						Normal,
						alertness,
FatigueView [37]	95	Over 18	124 million frames	Daytime and nighttime	RGB and IR	signs of sleepness,
						sleepy, and
						sleepness
SUST-DDD [38]	19	24–42	2074	Reduced daylight	RGB	Normal and
						drowsiness

Table 3. The contents of the CSV files.

Type of Features	Features	Description
Info about the video	Four features included (participant ID, video name, time of day, and type of activity)	General info about the participant and the session
Questionnaire	One value presents VAS-F result	Subjective assessment of fatigue performed by the participants themselves
Task results	6 features describe CRT results	Metrics related to the performance during the CRT sessions
	19 features related to Landolt rings test results	Metrics related to the performance during the Landolt rings test session
Sensor features	60 eye movement strategy features	Eye characteristics regarding saccadic and fixation time and speed
	14 eye movement statistics features	Mean, standard deviation, minimum, maximum, 25th percentile, 50th percentile, and 75th percentile of each series on both coordinates
	79 HRV features	Features related to the HRV in multiple domains (time, frequency, and nonlinear)
	One value presents the actual heart rate	Real heart rate extracted from the PPI signals
Features extracted by our models	15 features, including head movement angles (roll, pitch, and yaw), vital signs (blood pressure, respiratory rate, heart rate, and oxygen saturation), and eye and mouth states (eye closure and yawning)	Detected features from each video at each minute using deep learning models proposed in our previous works [39,40,42,43,44,45]

Table 4. Mental performance (Au), attention (K), and the stability of concentration of attention (Ku) for each participant in the dataset.

Participant Id	Ku	K	Au
1	10–450	53.6–98.2	0.18–2.85
2	16–225	56.8–95.8	0.39–2.71
3	14–450	45.7–98.4	0.00–2.87
4	12–300	57.6–97.1	0.45–2.81
5	9–180	53.3–95.8	0.18–2.70
6	11–900	63.6–100.0	0.59–3.13
7	1–300	33.0–97.5	0.00–2.83
8	12–900	57.3–100.0	0.31–2.99
9	3–12	36.3–69.7	0.00–0.83
10	5–90	45.5–91.1	0.00–2.45

Table 5. The results of the models after normalizing the vital signs for each participant and considering fatigue states lower than 0.38.

Method	Distinct Characteristics	Accuracy	F1-Score
SVC	RBF kernel, C = 1.0, gamma = ‘scale’	93.48%	0.4831
Logistic regression	L2 penalty, C = 0.001, solver = ‘lbfgs’	96.27%	0.52147
MLP	Two hidden layers, ReLU activation, Adam solver, lr = $1 \times 10^{- 3}$	96.27%	0.8719
Decision tree	Gini criterion, max_depth: 20, min_samples_leaf: 2, min_samples_split: 2	97.11%	0.8256
XGboost	lr = 0.1, max depth = 3, 100 estimators, gamma: 2, colsample_bytree: 0.8	98.10%	0.8658
Random forest	100 estimators, max_depth: 20, max_features: log2, min_samples_split: 2	98.1%	0.9465

Table 6. Performance of XGBoost model with varying number of features.

Number of Features	Used Features	Accuracy	F1-Score
All 13 features	BP Diastolic, participant ID, BP Systolic, eye_closure_ratio, heart_rate, oxygen_saturation, average pitch, average roll, mouth_openess_ratio, average yaw, average RR, rhythmicity_coeff, and stability_coef	98.1%	0.8658
Top 11 features	BP Diastolic, participant ID, BP Systolic, eye_closure_ratio, heart_rate, oxygen_saturation, average pitch, average roll, mouth_openess_ratio, average yaw, and average RR	98.27%	0.8748
Top 9 features	BP Diastolic, participant ID, BP Systolic, eye_closure_ratio, heart_rate, oxygen_saturation, average pitch, average roll, and mouth_openess_ratio	98.43%	0.8891
Top 7 features	BP Diastolic, participant ID, BP Systolic, eye_closure_ratio, heart_rate, oxygen_saturation, and average pitch	98.1%	0.8597

Table 7. Comparison of F1-scores for different classification methods with and without the participant ID feature.

Method	F1-Score with Participant ID	F1-Score Without Participant ID
Random forest	0.9465	0.8190
Logistic regression	0.52147	0.4788
XGBoost	0.8658	0.8128
Decision tree	0.8256	0.7237

Table 8. Comparison with existing approaches for fatigue detection.

Approach	Type of Fatigue	Number of Subject	Fatigue Assessment	Best Result Achieved
Cui et al. [14]	Driving fatigue	2000 facial images with fatigue state	Threshold based on eye and mouth features	98.30%
Dua et al. [15]	Driving fatigue	36 subjects (360 videos)	Facial landmark points and head posture data	85%
Dey et al. [16]	Driving fatigue	36 subjects	PERCLOS (Percentage of Eyelid Closure over the Pupil) and voice	96.4%
Salzillo et al. [18]	Driving fatigue	10 subjects	KSS (Karolinska Sleepiness Scale)	NRMSE of 15.63%
Minhas et al. [20]	Driving fatigue	Thousands of images of sleepy and active faces	Expert assessment	93.69%
Kassem et al. [21]	Driving fatigue	36 subjects (360 videos) + 2423 images of eyes	Facial landmark points and head posture data	93.3%
Chen et al. [22]	Driving fatigue	110 subjects (120 videos)	Yawns	90%
Du et al. [24]	Driving fatigue	20 subjects	KSS (Karolinska Sleepiness Scale)	91.67%
Kashevnik et al. [28]	Operator fatigue	10 subjects	Landolt rings test (mental performance indicator)	85%
Mehmood et al. [29]	Operator fatigue	15 subjects	NASA-TLX	99.941%
Mehmood et al. [30]	Operator fatigue	16 subjects	NASA-TLX	96.2%
Wang et al. [31]	Operator fatigue	12 subjects	NASA-TLX	92%
Our approach	Operator fatigue	10 subjects (817 videos)	Landolt rings test (mental performance indicator)	98.43%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Othman, W.; Hamoud, B.; Shilov, N.; Kashevnik, A. Human Operator Mental Fatigue Assessment Based on Video: ML-Driven Approach and Its Application to HFAVD Dataset. Appl. Sci. 2024, 14, 10510. https://doi.org/10.3390/app142210510

AMA Style

Othman W, Hamoud B, Shilov N, Kashevnik A. Human Operator Mental Fatigue Assessment Based on Video: ML-Driven Approach and Its Application to HFAVD Dataset. Applied Sciences. 2024; 14(22):10510. https://doi.org/10.3390/app142210510

Chicago/Turabian Style

Othman, Walaa, Batol Hamoud, Nikolay Shilov, and Alexey Kashevnik. 2024. "Human Operator Mental Fatigue Assessment Based on Video: ML-Driven Approach and Its Application to HFAVD Dataset" Applied Sciences 14, no. 22: 10510. https://doi.org/10.3390/app142210510

APA Style

Othman, W., Hamoud, B., Shilov, N., & Kashevnik, A. (2024). Human Operator Mental Fatigue Assessment Based on Video: ML-Driven Approach and Its Application to HFAVD Dataset. Applied Sciences, 14(22), 10510. https://doi.org/10.3390/app142210510

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Human Operator Mental Fatigue Assessment Based on Video: ML-Driven Approach and Its Application to HFAVD Dataset

Abstract

1. Introduction

2. Related Work

2.1. Existing Approaches to Human Fatigue Assessment

2.2. Existing Datasets

3. Dataset Description

3.1. OperatorEYEVP Dataset

3.2. Annotation of Dataset with Physiological Indicators

3.3. HFAVD Dataset Structure

3.3.1. Respiratory Rate and Breathing Characteristics

3.3.2. Heart Rate

3.3.3. Blood Pressure

3.3.4. Oxygen Saturation

3.3.5. Head Pose

3.3.6. Eye and Mouth State

3.4. Data Example

4. The Proposed Approach and Its Application to the HFAVD Dataset

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI