1. Introduction
Deceit, the distortion or omission of the (complete) truth, is a frequent and important aspect of human communication. However, most people, even with special training, fail to detect untruthful behavior; some studies show that their odds of succeeding are worse than chance [
1]. The problem is that there is no such thing as an infallible source of deceit detection.
The traditional tools for deceit detection (i.e., the polygraph tests) are based on several responses of the autonomic nervous system (ANS)— blood pressure, breathing pattern, skin resistance, etc.—correlated with the interrogation of the suspect. No standardized interrogation protocol has been proposed, so there are main ways of administering these questions [
2]: the Control Question Technique(CQT)—largely used in the United States, which aims at detecting psychological responses to the questions, and the Concealed Information Test [
3] (CIT)—used in Japan, which is designed to detect concealed crime-related knowledge.
Besides these classical methods, other cues of deceit detection have been considered [
4]: emblems, illustrators, micro-expressions and eye movements. The eyes are perhaps the most expressive and salient features of the human face, as they convey tremendous amounts of information: cognitive workload, (visual) attention, neurological processes, just to name a few.
Popular theories state that the position of the eyes can tell deceit; such a belief is used by one of the most popular so called personal development techniques: Neuro-Linguistic Programming (NLP) [
5]. One thesis from NLP states that the gaze direction can be used as an indicator to whether a person is telling the truth or not. More specifically, this theory suggests that humans tend to move their eyes to their left when visualizing past events, and tend to look right when constructing false events. A recent study [
6] tested this hypothesis and found no evidence to support the the idea that the eye movement patters can indicate lying. Despite several critics [
6,
7] and the little scientific evidence to support the NLP, this theory is still widely spread on the Internet and used my many NLP practitioners. However, this does not imply that other eye features cannot be used as cues to deceit detection.
In the same context, it is worth mentioning the Facial Action Coding System (FACS) [
8]—ananatomical methodology, developed in the late 1970s, to describe all observable human face movements. Basically, this taxonomy breaks down facial expressions into Action Units (AUs): a contraction or relaxation of a facial movement. Since its publication, it has undergone multiple revisions and updates, and now it is often used in facial expression analysis. AUs 61 to 69 describe the eye movements: eyes (turn) left, eyes (turn) right, eyes up, eyes down.
Blinking is defined as a rapid closure followed by are-opening of the eyelids; this process is essential in spreading the tears across the eyes’ surface, thus keeping them hydrated and clean. The average duration of a blink is considered to be around 100–150 ms [
9], although some studies suggest longer intervals (100–400 ms) [
10]. Eye closures that last longer than 1 s can be a sign of micro-sleeps: short time intervals in which the subject becomes unconscious and it is unable to respond to external stimuli. A detailed survey of the oculomotor measures which could be used to detect deceitful behavior can be found in [
11].
Blinks (and micro-sleeps) have been examined in a variety of multidisciplinary tasks: deceit detection [
12,
13], driver drowsiness detection [
14,
15], human–computer interaction [
16] and attention assessment [
17], just to name a few.
Automatic blink detection system can be roughly classified as appearance or temporal based [
18]. The first methods rely on the appearance of the eye (closed or opened) in each frame to determine the blink intervals. Temporal based methods detect blinks by analyzing the eyelid motion across videoframes.
In [
19], the authors propose a real-time liveness detection system based on blink analysis. The authors introduce an appearance based image feature—the
eye closity—determined using the AdaBoost algorithm. Based on this feature, eye blinks are detected by a simple inference process in a Conditional Random field framework.
The work [
15] proposes a novel driver monitoring system based on optical flow and driver’s kinematics. Among other metrics, the system computes the percentage of eyelid closure over time (PERCLOSE) in order to infer the driver’s state. The blinks are detected by analyzing the response of a horizontal Laplacian filter around the eyes. The authors assume that when the eyes are open, numerous vertical line segments (caused by the pupils and the eye corners) should be visible; on the other hand, when the eyes are closed only horizontal lines should be observed. The system decides on the eye state by applying a simple threshold on the value of the horizontal gradient. The value of this threshold value was established heuristically, based on trial and error experiments such that the number of false positives is minimized.
In [
20], the average height–width eye ratio is used to determine if the eyes are open or closed in a given frame. First, 98 facial landmarks are detected on the face using active shape models; based on the contour of the eyes, the height–width eye ratio is computed and a simple thresholding operation is used to detect eye blinks: if this measure changes from a value larger than 0.12 to a value smaller than 0.02, then a blink was detected. As a static threshold value is used, this method cannot be used for real world, “in the wild” video sequences.
In [
18], blinks are detected by analyzing the vertical motions which occur around the eye region. A flock of Kanade–Lucas–Tomasi trackers are initialized into a grid around the peri-ocular region and are used to compute the motion of the cells which compose this grid. State machines analyze the variance of the detected vertical motions. The method achieves a 99% accuracy rate.
A real-time blink detector designed for very low near-infrared images is presented in [
21]. Blinks are detected based on thresholded image differences inside two tracked region of interest corresponding to the right and left eye, respectively. Next, optical flow is computed in order to determine if the detected motion belongs to an eyelid closing or opening action.
Automatic detection of gaze direction implies the localization of the iris centers and determining their relative position to the eye corners or bounding rectangle.
Eye tracker have been extensively studied by the computer vision community over the last decades; in [
22], the authors present an extensive eye localization and tracking survey, which reviews the existing methods the future directions that should be addressed to achieve the performance required by real world eye tracking applications. Based on the methodology used to detect and/or track the eyes, the authors identified the following eye tracking system types: shape based, appearance based and hybrid systems.
Shape based methods localize and track the iris centers based on a geometrical definition of the shape of the eye and its surrounding texture; quite often, these methods exploit the circularity of the iris and the pupil [
23,
24,
25,
26]. In [
26], the iris centers are located as the locus where most of the circular image gradients intersect. An additional post-processing step is applied to ensure that the iris center falls within the black area of the pupil.
Appearance based methods are based on the response of various image filters applied on the peri-ocular region [
27]. Finally, hybrid methods [
28] combine shape based methods with appearance based methods to overcome their limitations and increase the system’s overall performance.
Nowadays, deep learning attained impressive results in image classification tasks, and, as expected, deceit detection was also addressed from this perspective. Several works used convolutional neural networks (CNNs) to spot and recognize micro-expressions—one of the most reliable sources of deceit detection. In [
29], the authors trained a CNN on the frames from the start of the video sequence and on the onset, apex and offset frames. The convolutional layers of the trained network are connected to a long-short-term-memory recurrent neural network, which is capable of spotting micro-expressions. In [
30], the authors detect micro-expressions using a CNN trained on images differences. In a post processing stage, the predictions of the CNN are analyzed in order to find the micro-expression intervals and to eliminate false positives.
The work [
31] presents an automatic deception detection system that uses multi-modal information (video, audio and text). The video analysis module fuses the score of classifiers trained trained on low level video features and high-level micro-expressions to spot micro-expressions.
This study aims at developing computer vision algorithms to automatically analyze eye movements and to compute several oculomotor metrics which showed great promise in detecting deceitful behavior. We propose a fast iris center tracking algorithm which combines geometrical and appearance based methods. We also correlate the position of the iris center with the eye corners and eyelid apexes in order to roughly estimate the gaze direction. Based on these features, we compute several ocular cues which have been proposed as cues to deceit detection in the scientific literature: the blink rate, the gaze direction and eye movement AUs’ recognition.
This work will highlight the following contributions:
The development of a blink detection system based on the combination of two eye state classifiers: the first classifier estimates the eye state (open or closed) based on the eye’s aspect ratio, while the second classifier is a convolutional neural network which learns the required filters needed to infer the eye state.
The estimation of the gaze based on the quantification of the angle between the iris and the eye center. The iris centers are detected using a shape based method which employs only image derivatives and facial proportions. The iris candidates are only selected within the internal eye contour detected by a publicly available library.
The definition of a novel metric, the normalized blink rate deviation which is able to capture the differences between the blink rate in cases of truthful and deceitful behavior. It computes the absolute normalized difference between a reference blink rate and the blink rate of the new session. Next, this difference is normalized with the reference blink rate in order to achieve inter-subject variability.
The remainder of this manuscript is structured as follows: in
Section 2, we present in detail the proposed solution and, in
Section 3, we report the experimental results we performed.
Section 4 presents a discussion of the proposed system and its possible applications and, finally,
Section 5 concludes this work.
2. Eye Movement Analysis System: Design and Implementation
The outline of the proposed solution is depicted in
Figure 1.
The proposed eye analysis method uses a publicly available face and face landmark detection library (dlib) [
32]; it detects the face area and 68 fiducial points on the face, including the eye corners and two points on each eyelid. Is the italic necessary?
These landmarks are used as the starting point of the eye movement detection and blink detection modules.
Based on the eye contour landmarks, the eye aspect ratio is computed and used as a cue for the the eye state. If the eye has a smaller aspect ratio, it is more probable that it is in the closed eye state. In addition, these landmarks are used to crop a square peri-ocular image centered in the eye center, which is fed to a convolutional neural network which detects the eye state. The response of these two classifiers is combined into a weighted average to get the final prediction on the eye state.
The eye AU type recognition module computes the iris centers and based on their relative positions to the eye corners decide on the AU.
2.1. Blink Detection
We propose a simple yet robust, appearance based algorithm for blink detection, which combines the response of two eye state classifiers: the first classifier uses the detected fiducial points in order to estimate the eye state, while the latter is a convolutional neural network (CNN) which operates on periocular image regions to detect blinks.
The first classifier analyses the eye aspect ratio (
); the width and height of the eyes are computed based on the landmarks provided by the
dlib face analysis toolkit. The width is determined as the Euclidian distance between the inner and outer eye corner, while the height is determined as the Euclidian distance between the upper and lower eyelid apexes (
Figure 2). As the face analysis framework does not compute the eyelid apexes, but two points on each eyelid, we approximate them through the interpolation between these two points. The aspect ratios extracted from each frame are stored into an array
.
In the frames where the eyes are closed, the aspect ratio of the eye is expected to decrease below its average value thought the video sequence. However, in our experiments, we observed that in cases of low resolution, degraded images or in the presence of occlusions (eyeglasses or hair), the fiducial points are not precisely detected, so this metric is not always reliable.
Therefore, to address this issue, we decided to combine the response of this simple classifier with the predictions of a convolutional neural network (CNN). CNNs achieved impressive results in several image recognition tasks; as opposed to classical machine learning algorithms, which require the definition and extraction of the training features, CNNs also learn optimal image filters required to solve the classification problem. To achieve time efficiency, we propose a light-weight CNN inspired from the Mobilenet architecture [
33]. The key feature of Mobilenet is the replacement of classical convolutional layers with depth-wise convolutional layers, which factorize the convolutions into a depth-wise convolution followed by a point-wise convolution (
convolution). These filters allow building simpler, lighter models which can be run efficiently on computational platforms with low resources, such as embedded systems and mobile devices.
The topology of the proposed network is reported in
Table 1.
The network has only four convolutional layers. The input layer consists of gray-scale images and it is followed by a classical convolutional layer. Next, three depth-wise convolutional layers follow. A dropout layer with a dropout keep probability of 0.99 is added before the last convolutional layer which is responsible with the eye state classification. Finally, the output layer is a softmax layer.
The training data for the network consists of images from Closed Eyes In The Wild (CEW) database [
34]. The dataset comprises
resolution images from 2423 subjects; 1192 subjects have the eyes closed, while the remaining 1231 have opened eyes. Some examples of images used to train the neural network are depicted in
Figure 3. To ensure that the training set is representative enough and to avoid over-fitting, the training images underwent several distortions: contrast and brightness enhancement, random rotation and random crops.
The network was trained using the softmax cross entropy loss function, RMSProp optimizer and asynchronous gradient descent, as described in [
33].
The problem now is to merge the response of the two classifiers to obtain the eye state: for each frame index, we combine the two predictions into a weighted average:
where
is the response of the combined classifiers at frame
t,
is the probability of the eye being closed at frame
t as predicted by the CNN,
is the eye aspect ratio at frame
t and
M is the maximum value from the array
.
The result of combining the estimation of these two classifiers is depicted in
Figure 4. In this figure, the aspect ratio is normalized as described in the above equation: it is divided by the maximum value of the
so that its maximum value becomes 1.0 and then inverted by subtracting its value from 1.0. In this way, blinks correspond to higher values in the feature vector.
The value of the weight was determined heuristically through trial and error experiments. All our experiments were performed using the same value for = 0.75, independently of the test database. We observed that the CNN gave predictions around 0.5–0.6 in case of false positives and very strong predictions (>0.97) for true positives. The aspect ratio based classifier failed to identify the closed eye state in degraded cases. We concluded that the CNN classifier should have a slightly higher weight, as in the majority of the cases it recognized with high probability the eye state. On the other hand, in cases of false positives, when combined with the other classifier, the overall classification was better as it compensated the “errors” made by the CNN.
The blinks (the intervals when the eyes were closed) should correspond to local maxima in the response vector R. Using a sliding window of size w, we iterate through these predictions to find the local maxima, as described in Algorithm 1. An item k from the prediction sequence is considered a local maximum if it is the largest element from the interval and its value is larger than the average of the elements from this interval by a threshold .
Algorithm 1: Blink analysis. |
|
Finally, we apply a post processing step in order to avoid false positives: for each local maximum, we also extract the time interval in which the eyes were closed and we check that this interval is larger or equal to the minimum duration of an eye blink.
The blink rate (BR) is computed as the number of detected blinks (i.e., local minima from the sequence) per minute. This metric has implications in various important applications: deceit detection, fatigue detection, understanding reading and learning patterns, just to name a few.
2.2. Gaze Direction
Another deception metric computed by the system is the gaze direction. The process of determining this value implies two main two steps: iris center localization and gaze direction estimation. The iris centers are detected using a shape-based eye detection method. The gaze direction is determined by analyzing the geometrical relationship between the iris center and the eye corners.
2.2.1. Iris Center Localization
Iris centers are detected using a method similar to [
23]: Fast Radial Symmetry transform (FRST) [
35] and anthropometric constraints are employed to localize them. FRST is a circular feature detector which uses image derivatives to determine the weight that each pixel has to the symmetry of the neighboring pixels, by accumulating the orientation and magnitude contributions in the direction of the gradient. For each image pixel
p, a positively affected pixel
and a negatively affected pixel
are computed (
Figure 5); the positively affected pixel is defined as the pixel the gradient is pointing to at a distance
r from
p, and the negatively affected pixel is the pixel the gradient is pointed away from at a distance
r from
p (
Figure 5).
The transform can be adapted to search only for dark or bright regions of symmetry: dark regions can be found by considering only the negatively affected pixels, while bright regions are found by considering only positively affected pixels.
One of the main issues of the method [
23] is that in cases of degraded images or light eye colors, the interior parts of the eyebrows give stronger symmetry responses than the actual iris centers, and thus lead to an inaccurate localization. In order to address this problem, we modified the method such that only the iris candidates located within the interior contour of the eye detected by the
dlib library are selected.
First, the FRST transform of the input image is computed; the search radii are estimated based on facial proportions: the eye width is equal to approximately one fifth of the human face, while the ratio between the iris radius and the eye width is 0.42 [
23].
To determine the iris candidates, the area of the FRST image within the internal contour of the eye is analyzed and the first three local minima are retained. In order to ensure that the detected minima don’t correspond to the same circular object, after a minimum is detected, a circular area around it is masked so that it will be ignored when searching for the next minimum.
All the possible iris pairs are generated, and the pair with the best score is selected as the problem’s solution. The score of a pair is computed as the weighted average of the pixels values from the symmetry transform image
S located at the coordinates of the left
and right iris candidates
, respectively:
After this coarse approximation of the iris center, to ensure that the estimation is located within the black pupil area, a small neighborhood equal to half the iris radius is traversed and the center of set as the darkest pixel within that area.
2.2.2. Gaze Direction Estimation
NLP practitioners consider that the direction of the eyes can indicate whether a person is constructing or remembering information. In addition, in the FACS methodology, the movements of the eyes are encoded into the following AUs: AU61 and AU62—eyes positioned to the left and right respectively, and AU63 and AU64, eyes up or eyes down.
The proposed system recognizes the four AUs which describe the eye movements: for each frame of the video sequence, we compute the angle between the center of the eyes (computed as the centroid of the inner eye contour detected by
dlib) and the center of the iris:
where
and
are the coordinates of the eye and iris center respectively, and
is the angle between these two points.
The next step is the consists in the quantization of these angles to determine the eye AU, as illustrated in
Figure 6.
We defined some simple rules based on which we recognize the eye movement action units. For AU61, eye movement to the left, the distance between the iris center and the eye center must be larger than and the angle (in degrees) between these two points must be between []. For A62, eye movement to the right, the angle between the iris and the eye center must lie in one of the intervals: [] or [). In addition, the distance between these to points must be larger than . The value of was determined heuristically and expressed in terms of facial proportions: , where is the inter-pupillary distance.
4. Discussion
In this section, we provide a short discussion on how the proposed method could be used in deceit detection applications.
The blink rate has applications in multiple domains: driver monitoring systems, attention assessment.
The Silesian face database was developed in order to help researchers investigate how eye movements, blinks and other facial expressions could be used as cues to deceit detection. It includes annotations about the blink intervals, small eye movements and micro-tensions. A description of how the Silesian database was captured is provided in
Section 3.
As some studies suggest that there is a correlation between the blink rate and the difficulty of the task performed by the subject, we searched to see if there are any differences between the blink rates based on the number of questions answered wrong by the participants from the Silesian face database. We assumed that the subjects that performed multiple mistakes found the experiment more complex.
On average, the participants answered incorrectly (made mistakes on) 0.762 from the 10 questions. The histogram of the mistakes made by the participants depicted in
Figure 9 and
Figure 10 shows the average blink rate based on the number of mistakes made by the participants.
The subjects that correctly answered all the questions and those who made 1, 2 and 3 mistakes have approximately the same blink rates. The subject who got 4 out of 10 questions wrong had a significantly higher blink rate. Strangely, the subject who made the most mistakes (7 of the10 questions were not answered correctly) had a lower blink rate. Of course this isn’t necessarily statistically relevant because of the low number of subjects (one person got four questions wrong and another one got seven questions wrong). We also analyzed the correlation between the blink rate and deceitful behavior. In [
36], the authors report the average blink rate per question, without taking into consideration the differences that might occur due to inter-subject differences. Based on this metric, there wasn’t any clear correlation between the blink rate and the questions in which the subject lie about the shape displayed to them on the computer screen.
We propose another metric that takes into consideration the inter-subject variability: the normalized blink rate deviation
NBRD. As there are less questions in which the subjects tell the truth (three truthful answers vs. seven deceptive answers), we compute for each subject a reference blink rate
as the average of the blink rates in the first four deceitful answers. For the remaining six questions, we compute the
NBRD metric as follows:
where
is the blink rate for the current question and
is the reference blink rate computed for the subject as the average blink rate on the four deceitful answers. This methodology is somewhat similar to the Control Question Technique [
2], as for each new question we analyze the absolute blink rate difference to a set of “control” questions. Are the italics necessary?
The next step is to apply a simple classifier (a Decision Stump in our case) and to see if the deceitful answers could be differentiated from the truthful ones. We randomly selected subjects to train the decision stump and we kept the other to test the accuracy of the decision stump. For each subject, we have three truthful questions, three deceitful questions (the other four deceitful questions are kept as a normalization reference factor), so the test data is balanced.
The performance of the simple classifier is reported in
Table 6 and the corresponding confusion matrix is reported in
Table 7.
Therefore, from
Table 6, we can conclude that a simple classifier was able to differentiate between the truthful and the deceitful questions based on the proposed
NBRD metric.
Regarding the gaze direction in the deceit detection, to our knowledge, there isn’t any database annotated with the gaze direction in the context of a deceit detection experiment. The Silesian face database provides annotation with the subtle movements of the gaze (saccades); the following dictionary of eye movement is defined: EyeLeft, EyeRight, EyeUp, EyeDown, Neutral, EyeLeftUp, EyeLeftDown, EyeRightUp, EyeRightDown. However, this movements are short, low-amplitude, ballistic eye movements, and differ a lot from “macro” eye movements.
In the experiment, the subjects are asked to respond to the questions of a person that they believe is a telepath, according to some instructions displayed on the screen. Therefore, the experimental setup is more controlled and the subjects don’t need to access their memory (they need to provide a predefined answer), so it is normal that “macro” movements do not occur. The participants were instructed to tell the truth for questions 1, 2 and 9 and to lie for all the other questions.
However, we analysed the provided saccadic data in order to determine if there is any connection between direction of the saccades and deceitful behavior. However, as opposed to [
36], we don’t analyze the number of gaze aversions, but we split the gaze aversion into four classes corresponding to the eye movements defined in the FACS methodology: eyes up, eyes down, eyes left and eyes right. NLP theories claim that these movements could indicate deceit.
Figure 11 illustrates the average lateral eye movements (left—annotations EyeLeft, EyeLeftUp, EyeLeftDown—vs. right—annotations EyeRight, EyeRightUp, EyeRightDown), while
Figure 12 illustrates the vertical eye movements (up and down, annotations EyeUp vs. EyeDown, respectively).
There doesn’t seem to be any distinguishable pattern in saccadic eye movements which could indicate deceit. We also applied the proposed deceit detection metric NBRD and a simple decision stump to try to detect the deceitful detection based on eye movements. The results were not so satisfying: we obtained classification rates worse than average on both left eye movements and right eye movements. However, it is worth mentioning that the saccades from this database are not necessarily non-visual saccades (the type of saccades that has been correlated to deceit), but visual saccades (the student needs to read a predefined answer from a computer screen).
5. Conclusions
In this manuscript, we presented an automatic facial analysis system that is able to extract various features about the eyes: the iris centers, the approximative gaze direction, the blink intervals and the blink rate.
The iris centers are extracted using a shape based method which exploits the circular (darker) symmetry of the iris area and facial proportions to extract to locate the irises. Next, the relative orientation angle between the eye center and the detected iris center is analyzed in order to determine the gaze direction. This metric can be used as a cue to deception: the interlocutor not making eye contact and often shifting his gaze point could indicate that he feels uncomfortable or has something to hide. In addition, some theories in the field of NLP suggests that the gaze direction indicates if a person remembers or imagines facts.
The proposed system also includes a blink detection algorithm that combines the response of two classifiers to detect blink intervals. The first classifier relies on the eye’s aspect ratio (eye height divided by eye width), while the latter is a light-weight CNN that detects the eye state from peri-ocular images. The blink rate is extracted as the number of detected blinks per minute.
The proposed solution was evaluated on multiple publicly available datasets. On the iris center detection task, the proposed method surpasses other state-of-the-art works. Although this method uses the same image feature as [
23] to find the circular iris area, its performance was boosted with more than 6% by selecting the input search space based on the position of the facial landmarks extracted with the
dlib library.
We also proposed a new deceit detection metric—NBRD—the normalized blink rate deviation, which defined the absolute difference between the blink rate from new situations normalized with the subject’s average blink rate. Based on this metric, a simple decision stump classifier was able to differentiate between the truthful and the deceitful questions with an accuracy of 96%.
As a future work, we plan to capture a database intended for deceit detection; our main goal is to let the subjects interact and talk freely in an interrogation-like scenario. For example, the subjects will be asked to randomly select a note which contains a question and an indication to whether they should answer that question truthfully or not. Next, they will read the question out loud and discuss it with an interviewer; the interviewer—who is not aware if the subject is lying or not—will engage in an active conversation with the participant. In addition, we intend to develop robust motion descriptors based on optical flow that could capture and detect the saccadic eye movements. We plan to detect the saccadic eye movements using a high speed camera by analyzing the velocity of the detected eye movements.