1. Introduction
Driver distraction and inattention are the key factors that cause traffic accidents. Distracted driving increases the probability of crashes as the drivers shift their attention from driving. To recognize and prevent these types of potential dangers, driving behavior monitoring plays an increasingly significant role in Advanced Driver Assistance Systems (ADAS), and high level ADAS can provide higher forms of automation, in which drivers are even expected to glance away from the primary operational task and be guided to get through some critical situation.
Human-centric driving monitor technologies can be divided into two categories, intrusive-sensing technologies and remote-sensing technologies. While the intrusive-sensing technologies [
1] detect head motion from attached head orientation sensors, some biomedical sensing technologies [
2,
3] measure the signals from the driver immediately and intuitively, but disturb the driver in the process, leading to inconvenience complaints. Vision-based applications usually mount the remote cameras inside the vehicle, and are capable of monitoring the driver in a non-contact and non-invasive way. These applications benefit from the advance in information technologies, and can present computer vision algorithms based on low-cost sensors.
Figure 1 shows the typical RGB-D camera and the corresponding RGB-D data.
In the driver context, the dynamics of a driver’s head and eye are potential to present where or what he/she is looking at. The allocation of a driver’s gaze is linked to a driver’s current attention. Therefore, studying a driver’s gaze direction and fixation has been extensively applied for visual distraction detection and understanding driver activities, and in natural driving, many drivers move both their heads and eyes when looking at the target. Many gaze tracking systems have been proposed for monitoring driver’s attention state [
5]. Detailed surveys of gaze estimation and head pose estimation can be seen in Refs. [
6,
7].
Coarse gaze direction based on a driver’s head orientation is usually acceptable in vision-based driver behavior monitoring systems. The probability of driver gaze is often generated by a gaze zone estimator. The discrete gaze zones are defined as the in-vehicle components where drivers are looking at, such as windshields, rear-view mirrors, side mirrors, etc. Since head pose contributes to gaze direction, many gaze zone estimation methods consider head orientation as the indicator of the gaze zone in a convenient manner, and parts of many studies treat the gaze zone estimation as a combination of head pose estimation (head pose value) and gaze estimation (gaze angle value of eyeball) in three degree of freedom (Euler angle), yaw, pitch and roll. This is consistent with real driving behavior in natural driving, resulting in many drivers moving both their heads and eyes when they are looking at the target.
From the perspective of sensor information, driver gaze zone estimation systems fall into one of two categories: systems using RGB/Grayscale cameras, and RGB-D cameras.
RGB or Grayscale Cameras: Most systems that use RGB or Grayscale cameras are largely relying on precise localization of facial features. Constrained Local Model (CLM) is one of the Facial Landmark Detection (FLD) methods, and has been commonly employed to extract and analyze the head pose and its dynamic in single [
8] or multiple camera systems [
9,
10]. The driver’s face is detected in an unpredicted environment and further location of the frontal facial landmark points under model constraints (typical instances are various feature points annotated around face contour, eyes, eyebrows, nose and mouth). In order to provide robust representation against illumination and accelerate the detection speed, Vicente et al. [
11] expressed face shape by Supervised Descent Method (SDM) using an SIFT descriptor and analyzed the geometric configurations of facial landmark points to estimate the head pose. After FLD process, head pose vector or facial feature landmarks are extracted as training features for gaze zone estimation.
Besides, FLD uses eye alignment to locate the eye region for eye pose estimation. By assuming the human eyeball as a spherical 3D eye model with a constant radius, there are only several parameters needed. One of important parameter is pupil center. As the pupil is darker than other parts of the eye region, Fridman et al. [
12] used an adaptive threshold of the histogram of the eye image to segment the pupil blob, but it does not work well in the non-uniform varying lighting conditions. On the low resolution eye image, Trawari et al. [
10] detected the iris center (same as the eyeball center) using the HoG descriptor. This method trained the local patches of the eye image under different light, but to a great extent, needed some image processing steps to ensure its detection quality. Vicente et al. [
11] used SDM tracker to detect eye landmarks including six eye contour points and the pupil. The eye estimation followed a 3D eye model-based approach in their work.
However, it is still critical for the above systems to obtain depth information, especially when solving the 3D head pose from the 2D images depending on the detected landmarks and their relative 3D configurations with a weak perspective projection model. To address the varying changes of head position and head rotation, Ultrasonic sensors [
13,
14] or dual cameras [
15] are used as extra devices for generating more information to compensate for head movement.
RGB-D Cameras: Standard RGB cameras can take advantage of color information, but lack depth information due to inherent hardware restrictions. The great challenges for such works are the illumination vulnerability under poor environmental conditions where light and shade bring negative effects. To overcome some of these difficulties, RGB-D cameras are applied to obtain both RGB images and more information using point-cloud-based sensors. RGB-D cameras can synchronously capture RGB images and depth images. Different RGB-D cameras are implemented by stereo cameras, structured light, time-of-flight or laser scanners. The more expensive the sensors are, the more accurate point-cloud they achieve. RGB-D cameras benefit from the depth appearance or point-clouds that generated by themselves to build gaze zone estimation systems.
To handle the point-clouds, ICP (Iterative Closet Points) algorithm [
16,
17] that is used for iterative registration between the free-form three-dimensional rigid point-cloud surfaces, has been applied to calculate the rotation matrix and offset vector between source face template and target face templates. Peláez C. et al. [
18] presented a gaze zone estimation system to estimate head pose by analyzing the projection of three-dimensional point-cloud based on ICP. With continuous iterative correction, ICP can minimize the distance from the source point-cloud to the target point-cloud within a given three-dimensional space. However, when the point-cloud level grows larger, the time cost increases dramatically.
Therefore, Bär et al. [
19] used Newton method to optimize the ICP solution process. A Newton method is favorable for a faster convergence than a gradient descent method. Multi-templates were used in point-cloud alignment to compute the head pose, subsequently, driver’s gaze angle was analyzed on the eye gaze model. Experimental results show that their system can obtain robust estimation of head pose than single-template system, but the Newton method requires more strict initial value, thus, their system suffers from the problem of falling into a local solution. More studies show that in the process of ICP alignment, adding a filter (such as temporal filter [
20], Kalman Filter [
21], etc.) to track and learn its state at the next timeframe can solve ICP anisotropic conversion more effectively and stably. Particle Swarm Optimization (PSO) algorithm [
22] can solve this through the cooperative behavior of a generation of evolutionary particles. Although PSO has achieved better results, its response is too slow.
Based on depth image appearances, the training regression model for head pose estimation can be constructed by labelling a large number of training sample. Fanelli et al. [
23,
24] built a random forest regression model and tested depth image appearances with different scanning accuracy. Random forest regression was used to map the depth images to the continuous head pose space by probabilistic voting, in which random sampling samples were adopted to avoid over-fitting. Their results are sensitive to the depth image acquisition and preprocessing, and poor solutions will result in the case of online testing. Breitenstein et al. [
25] used the depth appearance of the nose region to predict the head pose, collected reference appearance during the offline stage, and then calculated the errors between the candidate depth and the current input. However, these methods have not been applied in driver gaze zone estimation systems. One of the most important reasons is that the depth appearances maybe incomplete in real driving environments, due to the illumination changes and occlusions.
For gaze estimation or head pose estimation using RGB-D cameras [
26,
27,
28], RGB and depth images can also be used in different processing events. Usually, a depth image is used for foreground segmentation, head localization and object tracking, while the RGB image is used for eye localization and feature extraction. For example, Cazzato et al. [
29] located the facial landmark points and position of pupil center in RGB images, and predicted head pose by ICP alignment. The human line of gaze was estimated by oriented feature points surrounding the eye. Mora et al. [
30,
31] also provided gaze system combining of head pose estimation and gaze estimation, but they used appearance-based gaze estimation methods instead of model-based methods. However, these methods only have better estimation accuracy in the case of a frontal face; the errors increase on low-resolution eye images under free movement.
This work focuses on the applicable gaze zone estimation system with RGB-D cameras performance in a real-world driving environment, and adapts for variants of ICP to align a driver’s face templates. The highlights of the paper are shown below:
An application-oriented ICP-based point-clouds alignment solution for continuous driver gaze zone estimation using RGB-D camera is proposed, applying multi-zone templates for target face templates revision, and particle filter tracking with auxiliary sampling for initializing and updating the best transformation of source face template; at the same time, the head state is tracked and learned to cope with high rotation velocities under natural head turns, providing reliable head pose value in both yaw, pitch and roll.
A novel appearance-based eye gaze estimation with two-stage neighbor selection is utilized, avoiding the inaccurate pupil center localization in a real remote driving environment and the vulnerable eye gaze model under very large head rotation. The proposed eye gaze estimation method treats gaze prediction as a combination of cascaded nearest neighbor query and local feature regression.
A summary of driver gaze zone estimation using an RGB-D camera is provided in
Table 1. Compared with the previous gaze zone detection systems using RGB-D cameras, the proposed system presents continuous resolution not only for the gaze zone estimation, but also for the head pose estimation and gaze estimation. Unlike the multi-template ICP in Ref. [
19], they ensured the transformation of the point-clouds by averaging the results of multiple templates. However, the target templates will be changed due to the varying illumination changes and large head rotations and presence of partial occlusion of eye glasses or light source. We revise multi-zone ICP for balancing the templates’ revision in the real driving scenario. Furthermore, particle filter tracking is used for initialize and update the best transformation of ICP. Unlike model-based gaze estimation methods, which have disadvantages due to the vulnerability under large head movement, the appearance-based gaze estimation method is a better alternative. Furthermore, we conduct the gaze estimation as a two-stage nearest neighbor selection from both head pose space and image feature space. This structure makes it more efficient. The proposed system outputs the final gaze zone index by classifying the gaze angle with head pose compensation.
The rest of this paper is organized as follows.
Section 2 introduces our driver gaze zone estimation system that combines the head pose tracking and gaze estimation. The details of implementing multi-zone ICP-based head pose estimation appear in
Section 2.1.
Section 2.2 presents head state tracking by auxiliary particle filter.
Section 2.3 shows the proposed appearance-based gaze estimation with neighbor selection. In
Section 3, the proposed system is evaluated and some practical issues regarding the implementation are considered. Finally,
Section 4 gives a brief conclusion.
2. Proposed System
This paper presents a combination of multi-zone ICP-based head pose tracking, and appearance-based gaze estimation to build a continuous driver gaze zone detection system (as shown in
Figure 2). These two parts have been handled in Depth image and RGB image, respectively.
On the depth image, the scene depth information can be easily obtained. Therefore, as shown in
Figure 3, the face region in the foreground is segmented from the driving environment with adaptive minimum distance restrictions. Simultaneously, face detection using Viola–Jones method [
32] is used to judge whether a driver’s face has been searched and further shrink the face region. At this point, the three-dimensional point-cloud data of face templates has been extracted more precisely and can basically meet the needs of subsequent operations. Some pre-processing is applied to remove outliers, reduce noises, and preserve the geometric characteristics of point-cloud at the same time. After smooth filtering on the depth image, its corresponding three-dimensional point-cloud is generated for rigid transformation. This point-cloud is called the source template.
To estimate head pose under large head rotation, a multi-zone ICP-based method is proposed. By taking advantage of the least squares techniques, source point-cloud and corresponding reference point-cloud templates are aligned under iterative operation, alignment, comparing, adjusting, re-alignment, re-comparing, and re-adjusting. Proper templates at different gaze zones can reduce the templates accumulative error under large head motion. In order to solve the problem that the iteration result does not converge, the head state is tracked and learned by auxiliary particle filtering. The ICP-based point-cloud alignment is then initialized by the prediction value of head state. Head pose in Euler angle will output by the recent head transformation. It should be noted that the reference templates for a multi-zone can be captured when a driver sits down and glances at the labeled center of the pre-defined self-centered gaze zone.
On the RGB image, an eye region is localized in the face region. Due to the scale of a driver’s face region not changing dramatically, the eye region is easier to be captured in the constraint of face detection. The normalized eye images have been mapped into the image feature space, while head pose that is generated in the head poses estimation have been mapped into the head pose space. Appearance-based gaze estimation using neighbor selection is utilized, in which both head pose and eye image features contribute to gaze prediction. By two-stage nearest neighbor searching in both head pose and image feature space, more relevant image features can be found for building the mapping relationship between image feature space and gaze angle space. Final gaze direction is obtained as the gaze angle with head pose compensation. Then, gaze zone estimation is a classification of final gaze direction by k-Nearest Neighbor.
Detailed information about head pose estimation, head state tracking and gaze estimation is described in the following chapters.
2.1. Multi-Zone ICP-Based Head Pose Estimation
The human face region is considered as rigid surface of three-dimensional model without deformation. Regardless of the perspective transformation and scale factor, only takes into account the linear transformation and translation transformation of the coordinate system, the rigid transformation between two human face point-cloud set data is defined as:
where,
is a
matrix,
is a
rotation matrix,
is a 3 × 1 translation vector. The rotation matrix of cloud point alignment is a continuous right multiplication process of three orthogonal rotation matrix with a determinant of 1.
To solve the transformation matrix
, ICP algorithm is applied for aligning different point-clouds [
33]. During data acquisition and rigid transformation, unavoidable data noise always exists, and causes the alignment of target point-cloud and source point-cloud not to achieve accurate results. Therefore, in order to improve the accuracy of calculation, it is necessary to find as many effective corresponding point pairs as possible, to constrain the transformation matrix.
The main steps of the basic ICP algorithm for point-cloud alignment are: (1) search the nearest neighbor point pairs between two point-cloud using the correspondence estimation; (2) calculate the transformation matrix by the least squares method in an iterative way with all the valid point pairs, until it meets the convergence conditions.
2.1.1. Nearest Neighbor Search
In a given point-cloud set
and
, a set of nearest neighbor point pairs
can be extracted, where
and
. Thus,
, at least one closest point
exists. In order to reduce the computational complexity of the rapid search, the corresponding point pairs are computed by the normal under the minimum distance constraints, and the obtained nearest neighbor at this time is an approximated nearest point, rather than the ground-truth nearest point.
Figure 4 shows a schematic diagram of nearest neighbor search process based on the Point-to-Plane method [
34]. Firstly, based on the normal of reference point
at the point-cloud
, the intersection
of the normal at the point-cloud
can be found. Then, make the tangent plane of
, and draw the vertical line between the point
and the tangent plane. Finally, compute the intersection point
at the point-cloud
. Thus far, a point pair
is extracted.
Through the Point-to-Plane nearest neighbor search, the found neighbor point pairs are not strictly constrained one by one correspondence. That means the different points on the source point-cloud maybe have built a pair relationship with the same point on the reference point could. Moreover, because of the influence of data noise, partial outliers are produced, and confuse the related point pairs. Furthermore, to eliminate the interference of outliers and build a stable point pairs relationship, the reciprocal correspondence point pairs are selected after the filtering method smooth the noise in the space. The reciprocal correspondence point pairs are intersection of two sets of nearest neighbor points pairs, exchange the reference point-cloud source and the reference point-cloud source.
In summary, reciprocal correspondence nearest neighbor point pairs search strategy is utilized in the proposed point-cloud alignment method, which accelerates the search speed and reduces index complexity, generating effective point-cloud pairs for further transformation computation of point-cloud alignment.
2.1.2. Iterative Computing of Transformation
The calculation process of transform matrix is as follows. Firstly, the space mapping reconstruction error function is defined by least square method using the generated nearest neighbor point pairs. Then, a coarse transformation matrix is optimized and solved by minimizing the error function. By projecting the source point-cloud to the coordinate system of the reference point-cloud, the new source point-cloud for next repeat is gotten. Each repeat process is a combination of the optimization of transformation matrix and nearest neighbor point pairs searches for the new source point-cloud. The fine transform matrix will be gotten until it satisfies the convergence condition.
When the final transformation matrix is solved, the rotation angle of head pose in Euclidean space can be calculated using the right-hand Cartesian coordinate system (as shown in
Figure 5).
where,
denotes the element of
at
i row
j column.
,
and
denotes the yaw, pitch and roll of driver’s head pose, respectively.
In general, there are large rotations of a driver’s head in the real driving condition, but the vast majority of the related head poses are concentrated on several gaze regions, such as the left mirror, right mirror, rear-view mirror, windshield, etc. All these areas are known as the gaze zone.
To reduce the cumulative error of ICP iteration, a multi-zone ICP-based head pose estimation method is proposed by applying templates of different gaze zones in continuous tracking. To accelerate the iterative process of ICP, particle filtering is used in tracking the head pose, initializing the coarse transformation matrix. Detailed descriptions of particle filtering are in
Section 2.2. All reference templates are collected with ground-truth head pose values and represent different gaze zones. The head pose estimation system first initializes the reference template with zero angle in head pose Euclidean space, then calculates the Euclidean distance of the estimated head pose and the corresponding head pose of the reference templates, and determines the current template index by choosing the 1-Nearest Neighbor. Typically, a driver’s head pose will vary depending on the driving behavior.
The steps of the proposed head pose estimation method are shown in the Algorithm 1.
Algorithm 1: Multi-zone ICP-based Driver’s Head Pose Estimation. |
- 1:
Initialize multiple cloud point templates for different driver gaze zone . - 2:
For each new cloud point , calculate the predicted head state by Particle Filter tracking, and get the initial value of : . - 3:
Update the coarse head pose value based on Equation ( 4) with . - 4:
Update the current gaze zone index m of templates using k-NN method. - 5:
Search the nearest point pairs between and using Nearest Neighbor Search algorithm:
where, is a Point-to-Plane Nearest Neighbor Search function with correspondence strategy. - 6:
Calculate the optimal transformation via minimize the reconstruction error between and by:
is computed in a iterative process, until the reconstruction error is below the given threshold . - 7:
According to the Right-hand Cartesian Coordinate system, update the fine head pose value based on Equation ( 4) with . - 8:
Tracking the head state by particle filter and goto Step 2
|
2.2. Head State Tracking by Particle Filter
Particle filter is a nonlinear filtering algorithm based on Bayesian estimation, and has unique advantages in dealing with parameter estimation and state tracking. In this chapter, it is assumed that the driver’s face is a rigid mesh, and we treat the alignment of the 3D point-cloud between source and templates as motion variant of head pose state. Therefore, the driver’s head state dynamic model is established based on particle filter, and the translation and rotation of a head in a given state space is tracked and learned by particle filters. In order to solve the particle impoverish and weight assignation problem of particle filter, an auxiliary sampling method is used in Sequential Importance Sampling (SIS).
Figure 6 shows the overall framework of head state tracking by particle filter.
2.2.1. State Space Model
In state space, an unobservable driver’s head state is part of time series dynamics, and defined as
. At the same time, some observations
are made at continuous time points; it is assumed that all the state sequence is a Markov chain. In this case, similar to [
35], the driver state space model can represent the process of the time series, the main composition of which is:
where,
is the driver’s head state.
is a two-dimensional vector consisting of line velocity and angular velocity. The driver’s head state
and data
are assumed to be generated by nonlinear functions
and
, respectively, of the state and noise disturbances
and
, and
is a six-dimensional vector, which is consisting of head displacements of the axis
,
,
, and head rotation
,
,
. Based on Equation (
4),
can be convert into ICP initial value
and
of the rigid transformation.
Generally, the driver’s typical head motions can be divided into two parts. One is static state that focuses on the straight ahead direction without offset. The other motion is the linear dynamics that moves from one position to another. These situations can be modeled as mixed driver’s head state [
36]:
, where
, and
is a binary sign of velocity, with a value of 0 or 1.
denotes the state with a speed of almost zero, while
denotes the state of constant velocity.
and
are random variables that account for changes of the head state from different i.i.d. stochastic sequences.
The driver’s head state observation model is defined as , where is conversion matrix between two space, and is the noise at time t. The distribution of is based on the rotational speed of the driver head. When , , where , are the threshold of rotation speed, it means that the movement of head exceeds the range. Otherwise, when , the head is stay still.
2.2.2. Particle Filter Tracking
On the basis of the above driver’s head state space model, auxiliary particle filter method is applied to improve the probability distribution of the driver’s head state at the new time point. Relying on the probability inference of posterior probability density, the joint probability density of driver’s head state and observed state is given as:
where,
is the initial probability density of
. According to the driver’s head state conversion model and observation model, the states and the data are from randomly sampling process. Their sample pathes
take the initial value
, and otherwise
. The corresponding
with a initial value
, and otherwise
.
Since it is not possible to accurately obtain the current driver’s head state distribution trend
, the standardized distribution of importance
is utilized as an alternative, and the weight of current state sample data is updated by the previous observed driver’s head state. For the
i-th sample weight
,
. By
, and set
, therefore the joint probability density can be computed by
Since it is impossible to sample according to the density function
, the
N samples
are selected based on the probability density
, and the sampling importance weights are computed by
All these weights are standardized and mapped to the interval [0, 1].
The weights will gradually probably fail after a long time of running, so the importance re-sampling is added after each weight calculation. In order to facilitate the survival of particles in the next moment, auxiliary sampling is used in the standard re-sampling process of the probability distribution of the driver’s head state. It is assumed that the joint posterior probability function at the time point can be well approximated using the Dirac measure of that time point.
A rough approximation function
is used in the re-sampling, then the joint probability density can be approximated by
At this point, the generalized importance ratio of particles is given as
Compared with the standard sequential importance sampling, the sampling in this chapter revises the important weights by , and the weight ratios by . In this way, during the re-sampling process before sampling, the particles predicted at the previous moment are extended to increase particle diversity at the current moment and to reduce the variance of the importance weights, producing a more accurate estimate.
At this point, the driver’s head state transition density can be estimated based on the observation density , where B is a standardized constant. Therefore, the current driver’s head state is computed by the weighted average of the samples .
2.3. Appearance-Based Gaze Estimation Using Neighbor Selection
The proposed appearance-based gaze estimation is modeled in a local neighbor-based regression way, which contains three steps: feature extraction, two-stage neighbor selection and PLSR for gaze regression. The facial landmark detection and eye region localization contribute in extracting the eye images and head pose for gaze prediction. Neighbor selection seeks the neighbor of test sample in a training dataset. The nearest neighbors have similar properties in head pose and image feature. Gaze regression based on PLSR (Partial Least Squares Regression) is then employed to model using these neighbor samples.
The driver’s face always appears fully in the field of view. After the face region, which takes the bounding box, is localized, it is easy to obtain the eye region according to the landmarks, and head angle values are computed through trigonometry operations using elements from a rotation matrix. The head vector is converted from the rotation matrix to its axis-magnitude representation by Rodrigues Transform, which can also be used to transform three basic vectors to a rotation matrix.
The success of neighbor selection is highly depent on the appropriate construction of neighbor feature space. However, finding the proper neighbors from large scale eye image dataset is still a challenging problem. Because eye appearance is sensitive to head movement, head pose feature is significant for appearance-based gaze estimation with free head movement. Similar gaze direction under the same head pose for the same subject has a closet pupil center.
Here, gaze directions are regressed under similar head pose and image feature using the local manifold.
As shown in
Figure 7, the proposed neighbor selection method consists of a double
k-NN query in different feature spaces. This work provides a simple version of our previous work [
37]. Here, Raw features are used as the appearance descriptor. A training dataset with query table has been built, in which each item of table contains index, eye image and its corresponding features (head pose and image feature). Image features with less nearest neighbors are found in the scope of the test data. The found image features are used as neighbor samples appearance for gaze local regression.
Previous local regression method based on
k-NN usually estimates gaze angle using the mean of selected neighbors, which ignores the correlation between samples and gaze angles. To handle this, PLSR is utilized to reduce the dimensionality and project the gaze angle data onto components of maximum covariance with the image feature data. It is a combination of two methods: partial least squares (PLS) analysis and multiple linear regression. Furthermore, the statistically inspired modification of the PLS method (SIMPLS) algorithm is used in the gaze local regression for its competitiveness on large scale dataset [
38].
Given eye appearances
and gaze directions
, then the gaze regression can be modeled as [
39] by
where,
and
are the scores of
and
, respectively.
and
are the loadings of
and
, respectively.
and
are the residual matrixs.
PLS matrices
and
contain latent variables which are calculated as the linear combination of
and
. Assume
and
. Thus, according to Ref. [
39], PLSR model is reformulated as follows:
where,
. The covariance between score vectors is maximized in each iteration of PLS, the
i-th components of
and
can be computed by
where,
is
i-th score vector of
.
and
are the refined value of
and
, that have subtracted the mean vector of themselves. When the regression coefficients
is obtained, the predicted gaze angle can be determined by
, where
is the image feature of test sample.