1. Introduction
According to the most recent published World Health Organization (WHO) report, it was estimated that, in 2013, 1.25 million people were killed on the roads worldwide, making road traffic injuries a leading cause of death globally [
1]. Most of these deaths happened in low- and middle-income countries, where rapid economic growth has been accompanied by an increased motorization and therefore, road traffic injuries. In addition to deaths on the roads, up to 50 million people incur non-fatal injuries each year as a result of road traffic crashes, while there are additional indirect health consequences associated with this growing epidemic. Road traffic injuries are currently estimated to be the ninth leading cause of death across all age groups globally, and are predicted to become the seventh leading cause of death by 2030 [
1].
Distracted driving is a serious and growing threat to road safety [
1]. Collisions caused by distracted driving have captured the attention of the US Government and professional medical organizations during the last years [
2]. The prevalence and identification as a contributing factor in crashes is seen as an
epidemic of American roadways, in words of Ray LaHood, when he was US Secretary of Transportation in 2012 [
3]. There is not an exact figure regarding statistics about accidents caused by inattention (and its subtypes) since studies are made in different places, different time frames and therefore, different conditions. The studies referenced below show both the different statistics about inattention in general and those recorded when produced by distraction and fatigue in particular. These authors have estimated that distraction and inattention account for somewhere between 25% and 75% of all crashes and near crashes [
4,
5,
6,
7,
8].
The trend towards increasing the use of in-vehicle information systems (IVISs) is critical [
9] because they induce visual, manual and cognitive distraction [
10] and may affect driving performance in qualitatively different ways [
11]. Additionally, the advancement and prevalence of personal communication devices has exacerbated the problem during these last years [
12]. All these factors can lead to the increment of the number of tasks subordinate to driving activity. These tasks, namely secondary tasks, which may lead to distraction [
13], include eating, drinking, the act of taking something or tuning the radio or the use of cell phones and other technologies. The secondary tasks that take drivers’ eyes off the forward roadway [
14,
15] reduce visual scan [
16] and increase cognitive load may be particularly dangerous [
13]. For example, the use of cell phones while driving, according to naturalistic studies [
17], causes thousands of fatalities in the United States every year [
18,
19].
The purpose of this paper is the analysis of the state-of-the-art regarding the detection of drivers’ distraction. The scope of the paper can be seen in
Figure 1 and is commented as follows. The main methods for face detection, face tracking and detection of facial landmarks are summarized in
Section 2 because they are a key component in many of the video-based inattention monitoring systems. In
Section 3,
Section 4 and
Section 5, the main algorithms for biomechanical, visual and cognitive distraction detection are reviewed, respectively. Additionally, in
Section 6, there are some algorithms detecting mixed types of distraction and, hence, are also reviewed. The relationship between facial expressions and distraction is also explored in
Section 7. The key points for the development and implementation of sensors to carry out the detection of distraction will be considered in
Section 8. In
Section 9, the key ones to test and train driving monitoring systems are summarized. Privacy issues related to camera sensors are commented in
Section 10. Lastly, conclusions, future aspects and challenges ahead will be considered in
Section 11.
With the objective of introducing the scope and limitations of this review, some key aspects have been briefly introduced as follows. Driver distraction is just one form of inattention, which occurs when drivers divert their attention away from the driving task to focus on another activity. Therefore, a “complete” solution should consider all aspects of inattention. At least, the system should detect both distraction and drowsiness as the main contributing factors in crashes and near-crashes. As stated before, in this work, only distraction algorithms are summarized but one must not forget that other forms of inattention should be taken into account. Moreover, the use of on-board sensors already available in the vehicle to analyze driver behaviour is a low-cost and powerful alternative to the vision-based monitoring systems [
20,
21]. However, these systems should not be treated like different alternatives, because they can be used together (fusioned) in order to obtain indicators for monitoring [
22]. Hence, for the sake of completeness, in this paper review only “purely” vision-based monitoring systems have been reviewed.
One of the challenges in decreasing the prevalence of distracted drivers is that many of them report that they believe they can drive safely while distracted [
23]. However, for example, in connection with the use of mobile phones while driving, there is a great deal of evidence interacting with mobile devices, such as sending messages or engaging in conversations, which can impair driving performance because this interaction can create distraction. Moreover, a recent research showed that phone notifications alone significantly disrupted performance, even when drivers did not directly interact with a mobile device during the task [
24]. Another study suggests that people in general can reduce both inattention and hyperactivity symptoms simply by silencing the smartphones and avoiding notifications [
25]. Therefore, it is clear that drivers should not use and notice the presence of the smartphones inside the car while driving. It should be pointed out that distraction generation is a very complex process and is scarcely addressed here. We recommend some research papers that focused on driver distraction generation: Angell et al. [
26] focused on the process of cognitive load in naturalistic driving; Liang et al. [
27] addressed the adaptive behaviour of the driver under task engagement and their results on visual, cognitive and combined distraction; Caird analyzed the effects of texting on driving [
28]. In the context of intelligent vehicles, Ohn et al. [
29] highlights the role of humans by means of computer vision techniques.
1.1. Taxonomy
Both distraction and inattention have been inconsistently defined and the relationship between them remains unclear [
30]. The use of different, and sometimes inconsistent, definitions of driver distraction can create a number of problems for researchers and road safety professionals [
31]. Inconsistent definitions across studies can make the comparison of research findings difficult or impossible, can also lead to different interpretations of crash data and, therefore, to conclude different estimates of the role of distraction in crashes. This problem can be further seen in these recent works [
32,
33,
34,
35]. Many definitions have been proposed in order to define distraction [
5,
7,
8,
31]. Regan et al. [
35] proposed a taxonomy of both driver distraction and inattention in which distraction is conceptualized as just one of several factors that may give rise to inattention. They concluded that driver inattention means “
insufficient or no attention to activities critical for safe driving”. They defined driver distraction as “
the diversion of attention away from activities critical for safe driving toward a competing activity, which may result in insufficient or no attention to activities critical for safe driving”. The definition proposed here is almost identical to that coined for driver distraction by Lee et al. [
31].
It is acknowledged that the taxonomy proposed by Reagan et al. [
35] suffers from “hindsight bias”, that is, the forms of driver inattention proposed are derived from studies of crashes and critical incidents in which judgements have been made after the fact about whether or not a driver was attentive to an activity critical for safe driving [
35]. Driving consists of a variety of sub-tasks and it may not be possible to attend to all at the same time. Determining which sub-task is more important (and the driver, thus, should attend to) can often only be determined after the fact (i.e., after a crash or incident occurred) and, hence, this attribution of inattention is somewhat arbitrary [
36]. Additionally, the dynamics of distraction [
37], which identifies breakdowns on interruption as an important contributor to distraction should also be considered as part of this taxonomy, and hence, timing and context have implications on the algorithm design that should be taken into account.
1.2. Methodology
Papers addressed in this review are within the topic of distraction detection using vision-based systems. The search and review strategy is described below. A comprehensive review of the English language scientific literature was performed. It encompassed the period from 1 January 1980 to 31 August 2016. The following databases were used: EBSCO, ResearchGate, ScienceDirect, Scopus, Pubmed, Google Scholar and Web of Knowledge. Search terms related to driver distraction were employed combining all of them: driver, visual, cognitive, manual, biomechanicall, vision, vision-based, impairment, distraction, distractions, review, task, tasks, inattention, performance, phone, sms, vehicle, problem, looking, face, head, pose, glasses, illumination, self-driving, tracking, sensors, image, traffic, safety, facts, privacy, issues, porting, taxonomy. Many items were returned from the search criteria shown before. These were, then, reviewed using the following criteria. Exclusion criteria were obviously non-relevant papers or from medical, electronic, networking, marketing and patent topics. Only publications from peer-reviewed English language journals were considered for inclusion. Additionally, reviewed papers were ordered by the number of references in order to include all relevant papers. Finally, in order to get the latest published papers, search filters were applied for this purpose. Search filters were applied to get publications only from years 2015 and 2016. References and bibliographies from the selected papers identified were examined to determine potentially additional papers. A total of approximately 1500 publications were revised in the review process.
4. Visual Distraction
Visual distraction is often related to the on-board presence of electronic devices such as mobile phones, navigation or multimedia systems, requiring active control from the driver. It can also be related to the presence of salient visual information away from the road causing spontaneous off-road eye glances and momentary rotation of the head. A 2006 report on the results of a 100-car field experiment [
4] showed that almost 80% of all crashes and 65% of all near-crashes involved drivers looking away from the forward roadway just prior to the incident.
Engagements in visually distracting activities divert drivers’ attention from the road and cause occasional lapses, such as imprecise control of the vehicle [
108], missed events [
28], and increasing reaction times [
108]. Visual time sharing between the driving task and a secondary task reveals that the glance frequency to in-car devices is correlated to the task duration, but the average glance duration does not change with task time or glance frequency [
109]. Drivers do not usually increase the glance duration for more difficult or longer tasks but rather increase the accumulated visual time sharing duration by increasing the number of glances away from the road [
110]. As both single long glances and accumulated glance duration have been found to be detrimental for safety [
110,
111,
112], a driver distraction detection algorithm based on visual behaviour should take both glance duration and repeated glances into account [
113].
One one hand, high-resolution cameras placed throughout the cabin are needed to view the driver’s eyes from all head positions and at all times. Several economic and technical challenges of integrating and calibrating multiple cameras should be tackled to achieve this. Technically, eye orientation cannot always be measured in vehicular environments because eye region can be occluded by (1) sunlight reflections on eyeglasses; (2) the eye blink of the driver; (3) a large head rotation; (4) sunglasses; (5) wearing some kind of mascaras; (6) direct sunlight; (7) hats, caps, scarves; or (8) varying real world illumination conditions.
On the other hand, many security systems do not require such detailed gaze direction but they need coarse gaze direction to reduce false warnings [
114,
115]. For example, forward collision warning (FCW) systems need not only exterior observations but interior observations of the driver’s attention as well to reduce false warnings (distracting and bothering the driver), that is, coarse gaze direction can be used in order to control the timing of warning emission when the system detects that the driver is not facing forwards.
Taking into account that errors in facial feature detection greatly affect gaze estimation [
116], many researchers have measured coarse gaze direction by using only head orientation with the assumption that coarse gaze direction can be approximated by head orientation [
117]. Head pose is a strong indicator of a driver’s field of view and his/her focus of attention [
59]. It is intrinsically linked to visual gaze estimation, which is the ability to characterize the direction in which a person is looking [
118]. However, it also should be noted that drivers use a time-sharing strategy when engaged in a visual-manual task where the gaze is constantly shifted between the secondary task and the driving scene for short intervals of time [
119] and often position the head in between the two involved gaze targets and only uses the eyes to quickly move between the two targets. In this situation, a face tracking algorithm would recognize this as a distracted situation based on head position, but the driver is constantly looking the road ahead. Therefore, in an ideal situation, both driver gaze tracking and eyes-off-road should be detected together [
49].
In short, visual distraction can be categorized into two main approaches as it can be seen in
Figure 4. In the first approach, which can be called “coarse”, researchers measured the coarse gaze direction and the focus of attention by using only head orientation with the assumption that the coarse gaze direction can be approximated by the head orientation. In the second approach, which can be called “fine”, researchers considered both head and eye orientation in order to estimate detailed and local gaze direction.
Moreover, considering its operating principles, visual distraction systems can be grouped in two main categories: hardware- and software-based methods. Additionally, some systems can combine these two approaches and therefore, a third category can also be considered, as seen in
Figure 4.
4.1. Hardware-Based Methods to Extract Gaze Direction
Hardware-based approaches to head pose and gaze estimation rely on Near Infrared (NIR) illuminators to generate the bright pupil effect. These methods use two ring-type IR light-emitting diodes: one located near the camera’s optical axis and the other located far from it [
120,
121,
122,
123,
124,
125,
126]. The light source near the camera optical axis makes a bright pupil image caused by the red-eye effect, and the other light source makes a normal dark pupil image. The pupil was, then, easily localized by using the difference between bright and dark pupil images. Ji et al. used the size, shape, and intensity of pupils, as well as the distance between the left and right pupil, to estimate the head orientation. Specifically, the authors used the pupil-glint displacement to estimate nine discrete gaze zones [
121,
122], a geometric disposition of the IR LEDs similar to that of Morimoto et al. [
120] and two Charge Coupled Device (CCD) cameras embedded on the dashboard of the vehicle. In connection with the CCD cameras, the first one is a narrow angle camera, focusing on the driver’s eyes to monitor eyelid movement while the second one is a wide angle camera focusing on his/her head to track and monitor head movement. Based on this work, Gu et al. [
124] proposed a combination of the Kalman filtering with the head motion to predict the features localization and used Gabor wavelet in order to detect the eyes constrained to the vicinity of predicted location. Another existent approach proposed by Batista et al. used dual Purkinje images to estimate a driver’s discrete gaze direction [
125]. A rough estimation of the head-eye gaze was described based on the position of the pupils. The shape of the face is modeled with an ellipse and the 3D face pose is recovered from a single image assuming a ratio of the major and minor axes obtained through anthropometric face statistics. In this method, further research is necessary in order to improve the accuracy of the face orientation estimation, which is highly dependent on the image face ellipse detection.
The aforementioned NIR illumination systems work particularly well at night. The major advantage of these methods is the exact and rapid localization of the pupil. However, performance can drop dramatically due to the contamination introduced by external light sources [
126,
127]. In addition, during daytime, sunlight is usually far stronger than NIR light sources and hence, the red-eye effect may not occur. Moreover, these methods could not work with drivers wearing glasses because the lenses create large specular reflections and scatter NIR illumination [
127,
128,
129]. While the contamination due to artificial lights can easily be filtered with a narrow band pass filter, sunlight contamination will still exist [
126]. Furthermore, such systems are vulnerable to eye occlusion caused by head rotation and blinking [
114].
4.2. Software-Based Methods to Extract Gaze Direction
Combining facial feature locations with statistical elliptical face modelling, Batista et al. [
83] presented a framework to determine the gaze of a driver. To determine the gaze of the face, an elliptical face modelling was used taking the eye’s pupil locations to constraint the shape, size and location of the ellipse. The proposed solution can measure yaw head rotation over [−30°, +30°] interval and pitch head rotation over [−20°, +20°] interval.
Furthermore, despite the technical challenges of integrating multiple cameras, Bergasa et al. [
130] proposed a a subspace-based tracker based on head pose estimation using two cameras. More specifically, the initialization phase was performed using the Viola and Jones algorithm [
40] and a 3D model of the face was constructed and tracked. In this work, head pose algorithm, which was the base for visual distraction estimation, could track the face correctly up to [−40°, +40°].
A limitation of the software-based methods is the fact that they cannot often be applied at night [
126,
131]. This has motivated some researchers to use active illumination based on IR LEDs, exploiting the bright pupil effect, which constitutes the basis of these systems [
126,
131] (explained in previous section), or combine both methods, which can be seen in the next section.
4.3. Hardware- and Software-Based Methods to Extract Gaze Direction
Lee et al. [
114] proposed a system for both day and night conditions. A vision-based real-time gaze zone estimator based on a driver’s head orientation composed of yaw and pitch is proposed. The authors focused on estimating a driver’s gaze zone on the basis of his/her head orientation, which is essential in determining a driver’s inattention level. For night conditions, additional illumination to capture the driver’s facial image was provided. The face detection rate was higher than 99% for both daytime and nightime.
The use of face salient points to track the head was introduced by Jimenez et al. [
132], instead of attempting to directly find the eyes using object recognition methods or the analysis of image intensities around the eyes. The camera was modified to include an 850 nm band-pass filter lens covering both the image sensor and the IR LEDs in order: (a) to improve the rejection of external sources of IR radiation and reduce changes in illumination and (b) to facilitate the detection of the pupils, because the retina is highly reflective of the NIR illumination of the LEDs. An advantage of salient points tracking is that the approach is more robust to the eyes occlusion whenever they occur, due to the driver’s head or body motion.
Later on, the same authors extended their prior work in order to improve non-invasive systems for sensing a driver’s state of alert [
133]. They used a kinematic model of the driver’s motion and a grid of salient points tracked using the Lukas-Kanade optical flow method [
132]. The advantage of this approach is that it does not require one to directly detect the eyes, and therefore, if the eyes are occluded or not visible from the camera when the head turns, the system does not loose the tracking of the eyes or the face, because it relies on the grid of salient points and the knowledge of the driver’s motion model. Experiments involving fifteen people showed the effectiveness of the approach with a correct eyes detection rate of 99.41% on average. It should be noted that this work is focused on sensing the drivers’ state of alert, which is calculated measuring the percentage of eyelid closure over time (PERCLOS), and it is not focused on distraction detection.
Eyes Off the Road (EOR) detection system is proposed in [
49]. The system collects videos from a CCD camera installed on the steering wheel column and tracks facial features. Using a 3D head model, the system estimates the head pose and gaze direction. For night time operation, the system requires an IR illumination. The proposed system does not suffer from the common drawbacks of NIR based systems [
121,
122,
125], because it does not rely on the bright pupil effect. The system works reliably with drivers of different ethnicities wearing different types of glasses. However, if the driver is wearing sunglasses, it is not possible to robustly detect the pupil. Thus, to produce a reliable EOR estimation in this situation, only head pose angles are taken into account.
Cyganek et al. [
134] proposed a setup of two cameras operating in the visible and near infra-red spectra for monitoring inattention. In each case (visible and IR) two cascade of classifiers are used. The first one is used for the detection of the eye regions and the other for the verification stage.
Murphy-Chutorian et al. used Local Gradient Orientation (LGO) and Support Vector Regression (SVR) to estimate the driver’s continuous yaw and pitch [
135]. They used head pose information extracted from a LGO and SVR to recognize drivers’ awareness. The algorithm was further developed in [
59] by introducing a head tracking module built upon 3D motion estimation and a mesh model of the driver’s head. There is a general weakness here as the tracking module may easily diverge from face shapes that are highly different to the given mesh model.
4.4. Driver Distraction Algorithms Based on Gaze Direction
In these previous
Section 4.1,
Section 4.2 and
Section 4.3, gaze direction is extracted using different methods. The next step is to detect distraction using gaze direction regardless of the type of method used to extract this information, and hence, is commented as follows.
Many software-based methods have been proposed in order to detect visual distraction, many of which, rely on “course” information extracted from visual cues [
114,
136,
137,
138,
139]. Hattori et al. [
136] introduced a FCW system using drivers’ behavioural information. Their system determines distraction when it detects that the driver is not looking straight ahead. Following this approximation, an Android app [
137] has been developed to detect and alert drivers of dangerous driving conditions and behaviour. Images from the front camera of the mobile phone are scanned to find the relative position of the driver’s face. By means of a trained model [
38] four face related categories were detected: (1) no face is present; (2) facing forwards, towards the road; (3) facing to the left and (4) facing to the right. Another related system is proposed by Flores et al. [
138] where, in order to detect distraction, if the system detects that the face position is not frontal, an alarm cue is issued to alert the driver of a danger situation. Lee et al. [
114] proposed a vision-based real-time gaze zone estimator based on a driver’s head orientation composed of yaw and pitch. This algorithm is based on normalized histograms of horizontal and vertical edge projections combined with an ellipsoidal face model and a SVM classifier for gaze estimation. In the same research line but in a more elaborated fashion, Yuging et al. [
139] used machine vision techniques to monitor the driver’s state. The face detection algorithm is based on detection of facial parts. Afterwards, the facial rotation angle is calculated based on the analysis of the driver’s head rotation angles. When the angle of facial orientation is not in a reasonable range and lasts for a relatively long time, it can be thought that the driver is distracted and warning information will be provided.
Additionally, other software-based approaches rely on “fine” information considering both head and eye orientation in order to estimate distraction [
83,
130,
140,
141]. Pohl et al. [
140] focused on estimating the driver’s visual distraction level using head pose and eye gaze information with the assumption that the visual distraction level is non-linear: visual distraction increased with time (the driver looked away from the road scene) but nearly instantaneously decreased (the driver re-focused on the road scene). Based on the pose and eye signals, they established their algorithm for visual distraction detection. Firstly, they used a Distraction Calculation (DC) to compute the instantaneous distraction level. Secondly, a Distraction Decision-Maker (DDM) determined whether the current distraction level represented a potentially distracted driver. However, to increase the robustness of the method, also the robustness of the eye and head tracking device to adverse lighting conditions has to be improved.
Bergasa et al. [
126] presented a hardware- and software-based approach for monitoring driver vigilance. It is based on a hardware system, for real time acquisition of driver’s images using an active IR illuminator and a software implementation for real time pupil tracking, ocular measures and face pose estimation is proposed. Finally, driver’s vigilance level is determined from the fusion of the measured parameters into a fuzzy system. The authors yielded an accuracy percentage close to 100% both at night and for users not wearing glasses. However, the performance of the system decreases during daytime, especially in bright days, and at the moment, the system does not work with drivers wearing glasses [
126].
Recently, Lee et al. [
141] evaluated four different vision-based algorithms for distraction under different driving conditions. These algorithms were chosen for their ability to distinguish between distracted and non-distracted states using eye-tracking data [
141]. The resulting four algorithms, summarized in
Table 5, are commented next:
Eyes off forward roadway (EOFR) estimates distraction based on the cumulative glances away from the road within a 6-s window [
7].
Risky Visual Scanning Pattern (RVSP) estimates distraction by combining the current glance and the cumulative glance durations [
142].
“AttenD” estimates distraction associated with three categories of glances (glances to the forward roadway, glances necessary for safe driving (i.e., at the speedometer or mirrors), and glances not related to driving), and it uses a buffer to represent the amount of road information the driver possesses [
143,
144,
145].
Multi distraction detection (MDD) estimates both visual distraction using the percent of glances to the middle of the road and long glances away from the road, and cognitive distraction by means of the concentration of the gaze on the middle of the road. The implemented algorithm was modified from Victor et al. [
146] to include additional sensor inputs (head and seat sensors) and adjust the thresholds for the algorithm variables to improve robustness with potential loss of tracking.
Considering the results of the ROC curves, AUC values, accuracy and precision, it is apparent that a trade-off exists between ensuring distraction detection and avoiding false alarms, which complicates determining the most promising algorithm. More specifically, the MDD algorithm showed the best performance across all evaluation metrics (accuracy, precision, AUC). Although the EOFR algorithm had promising AUC values, the AttenD algorithm often yielded better accuracy and precision. Additionally, the RVSP algorithm consistently yielded the lowest values for both accuracy and precision, but yielded a slightly higher AUC value than AttenD. All of the algorithms succeeded in detecting distraction well above chance detection (AUC = 0.5). The performance of the algorithms varied by task, with little difference in performance for the looking and reaching task (bug) but more stark differences for the looking and touching (arrows). The AUC for each task for each algorithm is provided in
Table 5.
7. The Relationship between Facial Expressions and Distraction
Facial expressions can be described at different levels [
198]. A widely used description is Facial Action Coding System (FACS) [
199], which is a human-observer-based system developed to capture subtle changes in facial expressions. With FACS, these expressions are decomposed into one or more AUs [
200]. AU recognition and detection have attracted much attention recently [
201]. Meanwhile, psychophysical studies indicate that basic emotions have corresponding universal facial expressions across all cultures [
202]. This is reflected by most current facial expression recognition systems attempting to recognize a set of prototypic emotional expressions including disgust, fear, joy, surprise, sadness and anger [
201], which can be helpful in predicting driving behaviour [
203].
Therefore, in this work, main facial expression works in the driving environment are described in accordance with the two aforementioned levels (FACS and prototypic emotional expressions) and how they are related with distraction.
On one hand, in connection with FACS and distraction while driving, the reference work is the one proposed by Li et al. [
194]. The authors performed the analysis of driver’s facial features under cognitive and visual distractions. In addition to the obvious facial movement associated with secondary tasks such as talking, they hypothesized that facial expression can play an important role in cognitive distraction detection. They studied the top five features (from a total of 186 features) to predict both cognitive and visual distraction. For cognitive distraction, the most important features to consider are: (1) head yaw; (2) Lip Corner Depressor (AU15); (3) Lip Puckerer (AU18); (4) Lip Tightener (AU23) and (5) head roll. For visual distraction, the most important features to consider are: (1) Lip Tightener (AU23); (2) jaw drop (AU26); (3) head yaw; (4) Lip Suck (AU28) and (5) Blink (AU45). The results indicated that gaze and AU features are useful for detecting visual distractions, while AU features are particularly important for cognitive distractions. It should be pointed out that since the cognitive tasks considered in this study are closely related to talking activities, their future work will include the analysis of other cognitive tasks (e.g., thinking or solving math problems).
On the other hand, in connection with prototypic emotional expressions, there are some works trying to study how these emotions affect behaviour.
The relationship between emotion and cognition is complex, but it is widely accepted that human performance is altered when a person is in any emotional state. It is really important to fully understand the impact of emotion on driving performance because, for example, roadways are lined with billboard advertisements and messages containing a lot of different emotional information. Moreover, the distracting effects of emotions may come in other forms, such as cell phone, passenger conversations, radio information or texting information [
204]. For example, Chan et al. [
204] conducted a study to examine the potential for distraction from emotional information presented on roadside billboards. The findings in this study showed that emotional distraction: (a) can seriously modulate attention and decision-making abilities and have adverse impacts on driving behavior for several reasons and (b) can impact driving performance by reorienting attention away from the primary driving task towards the emotional content and negatively influence the decision-making process. In another study with a similar line of work, Chan et al. [
205] showed that emotion-related auditory distraction can modulate attention to differentially influence driving performance. Specifically, negative distractions reduced lateral control and slowed driving speeds compared to positive and neutral distractions.
Some studies have shown that drivers who are more likely to become angry (e.g., those with high trait anger rates) tend to engage in more aggressive behavior on the road, which can result in negative outcomes such as crashes [
206]. Moreover, anger negatively influences several driving performance and risky behaviors such as infractions, lane deviations, speed, and collisions [
207].
In conclusion, aggressiveness and anger are emotional states that extremely influence driving behaviour and increase the risk of accident. However, a too low level of activation (e.g., resulting from emotional states like sadness) also leads to reduced attention and distraction as well as prolonged reaction time and, therefore, lowers driving performance [
208]. On this basis, research and experience have demonstrated that being in a good mood is the best precondition for safe driving and that happy drivers produce fewer accidents [
209]. In other words,
happy drivers are better drivers [
208,
210]. Facial expression and emotion recognition can be used in advanced car safety systems, which, on one hand, can identify hazardous emotional drivers’ states that can lead to distraction and, on the other, can provide tailored (according to each state and associated hazards) suggestions and warnings to the driver [
211].
9. Simulated vs. Real Environment to Test and Train Driving Monitoring Systems
The development of the computer vision algorithm only represents one part of all the cycle of the product design. One of the hardest tasks is to validate the whole system with the wide variety of driving scenarios [
256]. In order to complete the whole “process development” of the vision-based ADAS, some key points are presented.
In order to monitor both the driver and his/her driving behaviour, several hardware and software algorithms are being developed, but they are tested mostly in simulated environments instead of in real driving ones. This is due to the danger of testing inattention in real driving environments [
21]. Experimental control, efficiency, safety, and ease of data collection are the main advantages of using simulators [
257,
258]. Some researches have validated that driving simulators can create driving environment relatively similar to road experiments [
259,
260,
261]. However, some considerations should be taken into account since simulators can produce inconsistent, contradictory and conflicting results. For example, low-fidelity simulators may evoke unrealistic driving behavior and, therefore, produce invalid research outcomes. One common issue is that real danger and the real consequences of actions do not occur in a driving simulator, giving rise to a false sense of safety, responsibility, or competence [
262]. Moreover, simulator sickness symptoms may undermine training effectiveness and negatively affect the usability of simulators [
262].
A study on distraction in both simulated and real environment was conducted in [
11] and it was found out that the driver’s physiological activity showed significant difference. Engstorm et al. [
11] stated that physiological workload and steering activity were both higher under real driving conditions compared to simulated environments. In [
257], the authors compared the impact of a narrower lane using both a simulator and real data, showing that the speed was higher in the simulated roads, consistent with other studies. In [
263], controlled driving yielded more frequent and longer eye glances than the simulated driving setting, while driving errors were more common in simulated driving. In [
167], the driver’s heart rate changed significantly while performing the visual task in real-word driving relative to a baseline condition, suggesting that visual task performance in real driving was more stressful.
After the system is properly validated in a driver simulator, it should be validated in real conditions as well, because various factors including light variations and noise can also affect the driver’s attention. The application on a real moving vehicle presents new challenges like changing backgrounds and sudden variations of lighting [
264]. Moreover, a useful system should guarantee real time performance and quick adaptability to a variable set of users and to natural movements performed during driving [
264]. Thus, it is necessary to make simulated environments appear more realistic [
203].
To conclude, in most previous studies, independent evaluations using different equipment and conditions (mainly simulated environments) resulted in time-consuming and redundant efforts. Moreover, inconsistency in the algorithm performance metrics makes it difficult to compare algorithms. Hence, the only way to compare most algorithms and systems is the metrics provided by each author when comparing their values, but with scarce information about the used images and conditions. Public data sets covering simulated and real driving environments should be released in the near future, as stated by some authors previously [
203].
10. Privacy Issues Related to Camera Sensors
Although there is a widespread agreement for intelligent vehicles to improve safety, the study of driver behaviour to design and evaluate intelligent vehicles requires large amounts of naturalistic driving data [
265]. However, in current literature, there is a lack of publicly available naturalistic driving data largely due to concerns over individual privacy. It also should be noted that a real-time visual-based distraction detection system does not have to save the video stream. Therefore, privacy issues are mostly relevant in research works were video feed is collected and stored to be studied at a later stage, for example in the large naturalistic studies conducted in the US.
Typical protection of the individuals’ privacy in a video sequence is commonly referred as “de-identification” [
266]. Although this fact will help protect the identities of individual drivers, it impedes the purpose of sensorizing vehicles to control both drivers and their behaviour. In an ideal situation, a de-identification algorithm would protect the identity of drivers while preserving sufficient details to infer their behaviour (e.g., eye gaze, head pose or hand activity) [
265].
Martin et al. [
265,
267] proposed the use of de-identification filters to protect the privacy of drivers while preserving sufficient details to infer their behaviour. Following this idea, a de-identification filter preserving only the mouth region can be used for monitoring yawning or talking and a de-identification filter preserving eye regions can be used for detecting fatigue or gaze direction, which is precisely proposed by Martin et al. [
265,
267]. More specifically, the authors implemented and compared de-identification filters made up of a combination of preserving eye regions for fine gaze estimation, superimposing head pose encoded face masks for providing spatial context and replacing background with black pixels for ensuring privacy protection. A two-part study revealed that human facial recognition experiment had a success rate well below the chance while gaze zone estimation accuracy disclosed 65%, 71% and 85% for One-Eye, Two-Eyes and Mask with Two-Eyes, respectively.
Fernando et al. [
268] proposed to use video de-identification in the automobile environment using personalized Facial Action Transfer (FAT), which has recently attracted a lot of attention in computer vision due to its diverse applications in the movie industry, computer games, and privacy protection. The goal of FAT is to “clone” the facial actions from the videos of a person (source) to another one (target) following a two-step approach. In the first step, their method transfers the shape of the source person to the target subject using the triangle-based deformation transfer method.In the second step, it generates the appearance of the target person using a personalized mapping from shape changes to appearance changes. In this approach video de-identification is used to pursue two objectives: (1) to remove person-specific facial features and (2) to preserve head pose, gaze and facial expression.
11. General Discussion and Challenges Ahead
The main visual-based approaches reviewed in this paper are summarized in
Table 8 according to some key factors.
A major finding emerging from two recent research works reveals that just-driving baselines may, in fact, not be “just driving” [
26,
269], containing a considerable amount of cognitive activity in the form of daydreaming and lost-in-thought activity. Moreover, eye-gaze patterns are somewhat idiosyncratic when visual scanning is disrupted by cognitive workload [
27]. Additionally, “look-but-failed-to-see” impairment under cognitive workload is an obvious detriment to traffic safety. For example, Strayer et al. [
270] found that recognition memory for objects in the driving environment was reduced by 50% when the driver was talking on a handsfree cell phone, inducing failures of visual attention during driving. Indeed, visual, manual and cognitive distraction often occur simultaneously while driving (e.g., texting while driving and other cell-phone reading and writing activities). Therefore, the estimates of crash risk based on comparisons of activities to just-driving baselines may need to be reconsidered in light of the possible finding that just-driving baselines may contain the aforementioned frequent cognitive activity. As a result, for example, secondary tasks effects while driving should be revised [
269]. Accordingly, as detecting driver distraction depends on how distraction changes his/her behavior compared to normal driving without distraction, periods with minimal or no cognitive activity should be identified in order to train the distraction detection algorithms.
Additionally, computer vision techniques can be used, not only for extracting information inside the car, but also for extracting information outside the car, such as traffic, road hazards, external conditions of the road ahead, intersections, or even position regarding other cars. The final step should be the correlation between the driver’s understanding and the traffic context. One of the first works trying to fuse “out” information (visual lane analysis) and “in” information (driver monitoring) is the one proposed by Apostoloff et al. [
271], pointing out the benefits of this approach. Indeed, visual lane analysis can be used for “higher-order tasks”, which are defined by interacting with other modules in a complex driver assistance system (e.g., understanding the driver’s attentiveness—distraction—to the lane-keeping task [
272]). Hirayama et al [
273] focused on temporal relationships between the driver’s eye gaze and the peripheral vehicles behaviour. In particular, they concluded that the timing when a driver gazes towards the overtaking event under cognitive distraction is later than that under the neutral state. Therefore, they showed that the temporal factor, that is, timing, of a reaction is important for understanding the state by focusing on cognitive distraction in a car-driving situation. Additionally, Rezaei et al. [
87] proposed a system correlating the driver’s head pose to road hazards (vehicle detection and distance estimation) by analyzing both simultaneously. Ohn et al. [
274] proposed a framework for early detection of driving maneuvers using cues from the driver, the environment and the vehicle. Tawari et al. [
275] provided early detection of driver distraction by continuously monitoring driver and surround traffic situation. Martin et al. [
276] focused on intersections and studied the interaction of head, eyes and hands as the driver approaches a stop-controlled intersection. In this line work of research, Jain et al. [
277] deal with the problem of anticipating driving maneuvers a few seconds before the driver performs them.
There are many factors that can modulate distraction. For example, as discussed in
Section 7, emotional information can modulate attention and decision-making abilities. Additionally, numerous studies link highly aroused stress states with impaired decision-making capabilities [
278], decreased situational awareness [
279], and degraded performance, which could impair driving ability [
280]. Another driver state, often responsible for traffic violations and even road accidents that can lead to distraction, is confusion or irritation, as it is related to loss of self-control and, therefore, loss of vehicle control, which can be provoked by non-intuitive user interfaces or defective navigation systems as well as by complex traffic conditions, mistakable signs and complicated routing. Moreover, the amount of information that needs to be processed simultaneously during driving is a source of confusion especially for older people [
281], who have slower perception and reaction times. Just like stress, confusion or irritation leads to impairment of driving capabilities including driver’s perception, attention, decision making, and strategic planning. Nervousness corresponds to a level of arousal above the “normal” one, which best suits to the driving task [
211]. It is an affective state with negative impact both on decision-making process and strategic planning. Nervousness can be induced by a variety of reasons either directly related to the driving task like novice drivers or by other factors like personal/physical conditions [
211].
The system should be validated, firstly, in a driver simulator and afterwards, in real conditions, where various factors including variations in lighting and noise can also affect both the driver’s attention and the performance of the developed algorithms. Therefore, public data sets covering simulated and real driving environments should be released. The driver’s physiological responses could be different in a driver simulator from those in real conditions [
11,
167,
257,
263]. Hence, while developing an inattention detection system, the simulated environment must be a perfect replica of the real environment. However, they are normally used in research and simulated scenarios, but not in real ones, due to the problems of vision systems working in outdoor environments (lighting changes, sudden movements, etc.). Moreover, they do not work properly with users wearing glasses and may need high computational requirements.
Data-driven applications will require large amount of labeled images for both training and testing the system. Both manual data reduction and labeling of data are time-consuming and they are also subject to interpretation of the reductionist. Therefore, to deal with this problem, two approaches are emerging from the literature: (1) unsupervised or semi-supervised learning and (2) automatic data reduction. For example, in connection with the first approach, Liu et al. [
189] commented the benefits of SSL methods. Specifically, the explained the benefits of using SSL increased with the size of unlabeled data set showing that by exploring the data structure without actually labeling them, extra information to improve models performance can be obtained. On the other hand, there has been a hype in data reduction using vehicle dynamics and looking outside on large scale naturalistic driving data [
282,
283,
284], and looking in at the driver [
285].
In many distraction detection systems, the use of commercial sensors is usually performed [
77,
239,
241,
245,
246,
247]. We understand that the reason from this is twofold: these systems are well-established solutions offering both head and gaze tracking in the car environment and the efforts of the investigation can be focused to detect and predict distraction from the outputs from these commercial sensors instead of developing a new sensor from the very beginning. These commercial sensors can operate using one camera [
239,
245,
246,
247], two cameras [
241] or even up to 8 cameras [
77] placed all over the vehicle cabin. What we find missing is some research works trying to compare these commercial sensors in order to highlight the pros and cons of each one. Also, missing from the literature is the comparison between a new sensor and a commercial one trying to offer a competitive solution from the sake of the research community.