1. Introduction
Face recognition is a popular biometric authentication technique that is extensively used in many security and online systems. It has the advantages of easy deployment and being non-intrusive when compared to other biometric authentication schemes. Current approaches to face recognition have the key disadvantage of being easily spoofed where an impostor can present a photograph or recorded video of another person to the camera. Hence, face liveness detection is a crucial preprocessing step before performing the face authentication via face recognition. Various approaches have been proposed for face liveness detection on static images, such as analysis of texture differences between live and fake faces as in Reference [
1], motion analysis, and deep Convolutional Neural Network (CNN) architectures, etc. Most of the recent research resulting in high accuracy face liveness detection and has focused on a two-step process of performing a speed diffusion, which is followed either by a Support Vector Machine (SVM) as a classifier [
2], or a deep CNN architecture [
4]. Existing approaches that have been proposed to address dynamic face spoofing attacks on recorded videos are based on methods such as texture analysis, motion analysis, image quality, and 3D structure information. Recent research has focused on using deep CNN architectures and recurrent neural networks (RNN) for face liveness detection on video frames.
We present real-time solutions for face liveness detection on static images by integrating anisotropic diffusion for converting the captured image to the diffused form, and the deep CNN into a single framework. The anisotropic diffusion allows the illumination energies to diffuse slowly on a uniform 2D surface, while moving faster on a 3D live face because of its non-uniformity [
2]. The diffused image is then fed to the deep CNN architectures, i.e., Specialized Convolutional Neural Network (SCNN) and Inception v4. In the real-time solution using the SCNN, we compute the smoothness of diffusion parameter alpha, whereas, for the Inception v4, we fix a value for the alpha. We evaluate the performance of our proposed methods on the Replay-Attack dataset and the Replay-Mobile dataset, respectively. For the Inception v4, we experiment with various values of the parameter alpha that defines the smoothness of diffusion. On average, an alpha value of up to 75 gave better classification results in the integrated environment, since higher values blur out important information from the image. We further perform a comparison of our proposed approaches with previous state-of-the-art approaches as well as other deep architectures for face liveness detection on static images, and determine that our end-to-end approaches produce competitive accuracy with the advantage of being “real-time” on a standard medium power computer in the detection of liveness of images to counteract face spoofing.
For face liveness detection on video sequences, we develop a CNN-LSTM architecture where nonlinear diffusion is first applied to the individual frames in the sequence, and then the deep CNN and LSTM capture the deep spatial and temporal features in the sequence. Even though use of CNN followed by LSTM has been reported in the literature, our contribution is to further add anisotropic diffusion in the beginning. We evaluate the performance of the proposed video frame based face liveness framework on the Replay-Attack dataset and the Replay-Mobile dataset. We also perform a comparison of our proposed approach with previous state-of-the-art approaches for face liveness detection on video frames, and demonstrate that our architecture is very competitive in liveness detection of video sequences to counteract face spoofing. The performance of the proposed framework yields competitive results. We obtained better results compared to the reported results in the literature with the Replay-Mobile dataset, and second best results with the Replay-Attack dataset.
The rest of the paper is organized as follows.
Section 2 discusses previous work related to face liveness detection on static images and video sequences. Our proposed real-time methods of integrating anisotropic diffusion and CNN architectures for liveness detection on static images, and the method of applying diffusion followed by a CNN-LSTM for liveness detection on video sequences, are discussed in
Section 3.
Section 4 presents a performance evaluation of our architectures on the Replay-Attack dataset and Replay-Mobile dataset, respectively, for face anti-spoofing. The concluding remarks are mentioned in
Section 5.
2. Related Work
Many methods have been proposed by researchers for determining the liveness of a captured static image by extracting the features from a 2D image, and then feeding these to a classifier. Some of these include extraction of variations of Local Binary Patterns, and, using a Support Vector Machine (SVM) classifier to identify whether the face is real or fake. Parveen et al. [
1] introduced a texture descriptor known as Dynamic Local Ternary Pattern (DLTP), where the textural properties of facial skin were explored using dynamic threshold setting, and the SVM with linear kernel was used for classification. The method proposed by Kim et al. [
2] is based on the idea of differencing in surface properties between live and fake faces by using diffusion speed. They computed the diffusion speed by utilizing the total variation flow and extracted anti-spoofing features based on the local patterns of diffusion speeds, which were then fed to a linear SVM classifier for determining the liveness of the facial image. Gragnaniello et al. [
5] proposed a domain-aware CNN architecture by adding appropriate regularization terms to the loss function. In the work proposed by Das et al. [
6], hand-crafted features and deep features were extracted by using a combination of a Local Binary Pattern (LBP) and a pre-trained CNN model based on the VGG-16 network architecture for the liveness detection.
Some of the recent work in face liveness detection has focused on the use of deep CNN architectures [
8], since these provide better liveness detection accuracy than the above-mentioned approaches. Rehman et al. [
7] employed data randomization on small mini batches for the training of deep CNNs for liveness detection. In the proposed work by Alotaibi et al. [
3], a combination of diffusion of the captured image followed by only a simple three-layer CNN architecture was utilized. The research proposed by Koshy et al. [
4] used a combination of nonlinear diffusion and explored three architectures such as CNN-5, ResNet50, and Inception v4, and the best architecture was determined to be the Inception v4. The main drawback of the approaches in References [
4] has been a requirement of a preprocessing step to obtain the diffused image before feeding it to a deep CNN for classification, making it unsuitable for real-time deployment. A part of our work in this paper enhances the ideas of References [
4] in a better integrated approach.
Various approaches have been proposed to address dynamic face spoofing attacks to determine the liveness of a video sequence. Wang et al. [
9] proposed a detection approach where the sparse structure information in 3D space was analyzed. Facial landmarks were detected from the given face video, and key frames were selected from which the sparse 3D facial structure was recovered. The structures were then aligned and the structure features were extracted for classification using an SVM classifier. Another technique reported by Anjos et al. [
10] is based on foreground and background motion correlation using optical flow, where they detected motion correlations between the head of the user and the background, and then fed extracted features to a binary classifier to classify the sequence as real or fake. Wen et al. [
11] proposed a detection algorithm based on Image Distortion Analysis (IDA). They extracted four different features, namely specular reflection, chromatic moment, blurriness, and color diversity, to form the feature vector, which was then fed to an SVM classifier.
A method based on the Local Binary Patterns from Three Orthogonal Planes (LBP-TOP) operator combining both space and time information into a single multi-resolution texture descriptor was presented by de Freitas Pereira et al. [
12]. The histograms were computed from the local binary patterns and concatenated for classification using Linear Discriminant Analysis (LDA) and SVM. Bharadwaj et al. [
13] used motion magnification followed by two approaches, where one involves texture analysis using LBP and an SVM classifier, and the other involves a motion estimation approach using a Histogram of Oriented Optical Flow (HOOF) descriptor with LDA for classification. Tang et al. [
14] proposed a challenge-response liveness detection protocol called face flashing that flashes randomly generated colors and verifies the reflected light. This was repeated many times so that enough responses could be collected to ensure security. Yeh at al. [
15] proposed an approach against face spoofing attacks based on perpetual image quality assessment with multi-scale analysis. They used a combination of an image quality evaluator and a quality assessment model for selecting effective pixels to create the image quality features for liveness detection. Pan et al. [
16] presented a time-based presentation attack detection algorithm for capturing the texture changes in a frame sequence. They used a Motion History Image (MHI) descriptor to get the primary features, and used LBP and a pre-trained CNN to get the secondary feature vectors, which were then fed to a classifier network. In the work proposed by Asim et al. [
17], LBP-TOP is cascaded with a CNN to extract spatio-temporal features from video sequences, which is followed by SVM with Radial Basis Function (RBF) kernel for classification.
Xu et al. [
18] proved that a deep architecture combining LSTM with CNN can be used for face anti-spoofing in videos. Local and dense features were extracted by the CNN, while the LSTM captured the temporal relationships in the input sequences. Tu et al. [
19] also proposed a joint CNN-LSTM network for face anti-spoofing in video sequences by focusing on the motion cues across video frames. They used the Eulerian motion magnification as preprocessing to enhance the facial expressions of individuals. Then the CNN was used for extracting the highly discriminative features of video frames, and the LSTM was used to capture the temporal dynamics in the videos. Costa-Pazo et al. [
20] introduced the Replay-Mobile database, and they describe two face presentation attack detection methods that were applied to the database. In one method, Image Quality Measures (IQM) were used as features, and, in the other, a texture-based approach using Gabor-jets were used. Classification was done using the support vector machine with a radial basis function kernel.
Nikisins et al. [
21] proposed an anomaly detection, or a one-class classifier-based face presentation attack detection system having better generalization properties against unseen types of attacks. They used an aggregated database consisting of three publicly available datasets, which includes the Replay-Attack dataset and Replay-Mobile dataset, for their experiments. Their system consisted of a preprocessor, feature extractor, and one-class classifier. They also evaluated and reported the results of some of the successful, previously published face liveness detection systems on the Replay-Attack and Replay-Mobile datasets, such as the LBP-based system in Reference [
22]. The IQM-based system where feature vector of a frame is a concatenation of quality measures introduced in References [
23], and the motion-based approach in Reference [
24]. Evaluation using the LBP-based system and IQM-based system were done on the frame-level, whereas the motion-based approach was done on video sequences. Fatemifar et al. [
25] adopted the anomaly detection approach where the detector is trained on genuine accesses only, using one-class classifiers built using representations obtained from deep pre-trained CNN models. Each frame in a video clip was photometrically normalized based on the retina method for reducing the impact of various lighting conditions before they were fed to pre-trained networks. They used different CNN architectures and anomaly detectors of which the class-specific Mahalanobis distance with GoogleNet features achieved better performance on the Replay-Mobile dataset. Arashloo [
26] presented a one-class novelty detection approach based on kernel regression using the Replay-Mobile dataset in which only bona fide samples were used in the training process, as in Reference [
25]. A projection function defined in terms of kernel regression maps bona fide samples onto a compact cluster of target samples, and provides the best separability of normal samples from outliers with classification based on the Fisher criterion. Other mechanisms, which include a multiple kernel fusion approach, sparse regularization, and client-specific and probabilistic modeling were also incorporated to improve performance.
The methods proposed by de Freitas Pereira et al. [
12] and Bharadwaj et al. [
13] made use of hand-crafted feature extraction, while we are using a CNN that does the feature extraction by itself, which eliminates hand-engineered feature extraction. Though the method proposed by Tu et al. [
19] gave state-of-the-art performance, compared to the 13 convolution layers of the VGG-16 they used, we are using only a 4-layer CNN as the front-end of our architecture with which we get very competitive results. The work presented by Fatemifar et al. [
25] and Arashloo [
26] used pre-trained models such as GoogleNet, ResNet50, etc., while we designed a CNN-LSTM architecture with which we achieved better results on the Replay-Mobile dataset. The enhancement in our work is to apply nonlinear diffusion to the frames in the sequence to obtain the sharp edges and preserve the boundary locations, and then feed the diffused frames to the CNN-LSTM.
3. Proposed Method
For the end-to-end real-time solution for liveness detection on static images, we propose a solution similar to what was done in References [
4]. However, instead of using a preprocessing step (via Matlab code) for diffusing the images and feeding them to the deep CNN network, we provide an end-to-end solution with diffusion as well as more advanced deep CNNs. In our framework, we use a combined architecture where the diffusion process and deep CNN are implemented in a single step. We use two different methods for the end-to-end solution. In the first method, we use an alpha trainable network that computes the smoothness of the diffusion parameter (alpha), and the diffused image is then created using this computed alpha. This diffused image is then fed to a pre-trained three-layer CNN model (CNN layers with Batch Normalization) that gave 97.50% accuracy on the Replay-Attack dataset, and 99.62% accuracy on the Replay-Mobile dataset, respectively. In the second method, we fix a value for the smoothness of the diffusion parameter (alpha) using what we create for the diffused image, and then feed the diffused image to an Inception v4 network. In either case, we feed the original captured images to the framework, where the first layer computes the nonlinear diffusion based on an Additive Operator Splitting (AOS) scheme and an efficient block-solver called a Tri-Diagonal Matrix Algorithm (TDMA). This enhances the edges and preserves the boundary locations of the real image (similar to the work proposed in References [
4]). This diffused input image is then fed to the deep CNN architecture or the Inception v4 network to extract the complex and deep features, and to classify the image as real or fake. Our integrated implementation results in real-time detection of liveness. We also do a comparison of our proposed integrated method with the previous approaches that have been proposed for liveness detection to determine how well it performs with regard to current state-of-the-art methods.
For the liveness detection on video frames, we propose a solution where we also first apply the nonlinear diffusion based on the AOS scheme and TDMA to the individual frames of the video sequence. This enhances the edges and preserves the boundary locations of the real image (similar to the static method proposed in References [
4]). These diffused input images are then fed one-by-one to the CNN, which acts as the front-end of our architecture, and extracts the complex and deep features. The output of the CNN is then fed to the LSTM, which detects the temporal information in the sequence, and, finally, the output dense neural network layer classifies the sequence as real or fake. For the sake of completeness, we provide a brief summary of the key concepts in nonlinear diffusion.
3.1. Nonlinear Diffusion
Linear diffusion smoothens the input image at a constant rate in all directions to remove noise. Therefore, the smoothing process does not consider information regarding important image features such as edges [
27]. The solution of the linear diffusion equation is given by:
I is the image,
d is the scalar diffusivity, and
div is the divergence operator. This is somewhat equivalent to convolving the image with a Gaussian kernel, and, hence, linear diffusion can be regarded as a low pass filtering process.
The edge-preserving capability of nonlinear diffusion makes it a powerful denoising technique, as the information contained in high spatial frequency components is preserved [
28]. Anisotropic diffusion, which is nonlinear diffusion based on a partial differential equation, prevents the blurring and localization issues associated with linear diffusion, and focuses on reducing the image noise without reducing significant parts of the image content such as edges. It improves the scale-space technique, enhances the boundaries, and preserves the edges [
29]. The diffusion coefficient is locally adapted, and is chosen as a function of the image gradient, that varies with both the edge location and its orientation in order to preserve the edges. The nonlinear diffusion process is defined by the equation.
is the gradient, and the diffusivity g is a function of the gradient
The Additive Operator Splitting (AOS) scheme addresses the problem of regularization associated with anisotropic diffusion [
30]. This semi-implicit scheme is stable for all time-steps, and ensures that all co-ordinate axes are treated equally, as defined by Equation (3) [
31]. AOS enables fast diffusion, resulting in smoothing of the edges in fake images while the edges in real images will be preserved. The iterative solution in AOS is given in Equation (3).
Ik is the diffused image,
m is the number of dimensions,
k represents the channel,
I is the identity matrix,
. is the diffusion, and
τ is the time steps (referred to as param. alpha in our implementation). In the two-dimensional case,
m = 2, and the equation then becomes:
denote the diffusion in the horizontal and vertical directions. The equation is split into two parts in the operator splitting scheme. The solution to each is computed separately and results are then combined.
The block-solver Tri-diagonal matrix algorithm (TDMA) is a simplified form of Gaussian elimination, useful in solving tri-diagonal systems of equations. The AOS scheme, together with TDMA can, therefore, be used to efficiently solve the nonlinear, scalar-valued diffusion equation [
31]. We implement the AOS scheme in the first layer of our implementations.
3.2. End-to-End Diffusion-CNN Networks
3.2.1. Specialized Convolutional Neural Network (SCNN)
We implemented a specialized end-to-end diffusion-CNN network (with Batch Normalization) where the smoothness of the diffusion parameter (alpha) is learned by the network. Convolutional Neural Networks (CNNs) work by combining the architectural concepts of local receptive fields, shared weights, and spatial or temporal subsampling in order to ensure some degree of shift, scale, and distortion invariance. To achieve higher accuracy, we use transfer learning by first training the CNN network on diffused images created with a fixed smoothness of diffusion value of 15. We then initialize the pre-trained convolutional neural network with batch normalization in the integrated diffusion architecture and retrain it again to obtain higher accuracy.
In our architecture, the original image is first fed to an alpha network to compute the value of the smoothness of diffusion parameter (alpha). This is a neural network comprising of a hidden layer of 15 neurons followed by a dense layer of one neuron, which outputs the alpha. The Rectified Linear Unit (ReLU) activation function is applied to the neurons in these layers. The SCNN model consists of three convolutional layers C1, C2, and C3 with 16, 32, and 64 feature maps, respectively, where kernel sizes of 15 × 15, 7 × 7, and 5 × 5 are used in the convolutions. Each convolution is followed by batch normalization, and max pooling is applied to the C1 and C2 layers after batch normalization for reducing the resolution. The higher filter size in the C1 layer is important to extract the diffusion enhanced features for liveness detection. The C3 layer batch normalization is followed by a dense layer of 64 neurons, and a dense output layer of one neuron. The ReLU activation function is applied to the convolution layers, the hidden layer, and the sigmoid activation function is applied to the output layer. The SCNN is trained using the binary-cross-entropy loss function, and the Adam optimizer with an initial learning rate set to 0.001.
Figure 1 below shows the proposed architecture. The nonlinear diffusion code implemented in TensorFlow is used to convert the original image to diffused form with the parameter Alpha determined from the alpha network (bottom left in
Figure 1). During backpropagation, the weights in the alpha network will be updated, and the updated weights are used in the computation of the output of the single neuron in the dense layer (parameter alpha) on the next forward pass.
3.2.2. Inception v4
To determine if a more advanced CNN network will result in better accuracy, we replace the CNN part in
Figure 1 with the Inception v4 network. However, instead of using an alpha network that computes the smoothness of diffusion (alpha), we fix the value of alpha in order to improve the real-time performance. The first stage is the nonlinear diffusion stage, whose input is the original 64 × 64 input image captured through the webcam, which is followed by the Inception network v4 architecture. The inception network is a CNN architecture designed as a deeper and wider network. It consists of inception modules stacked upon each other with intermittent subsampling layers for reducing the resolution and, thereby, reducing the shift and distortion in the image.
The Inception network v4 architecture consists of an inception stem, three different inception blocks, namely inception-A, inception-B, and inception-C, which are used repeatedly 4, 7, and 3 times, respectively, and two reduction blocks for changing the resolution of the grid [
32]. Convolutions in the inception modules within a block are performed by applying filters of multiple sizes to the same layer, making the network wider and enhancing the recognition of features at different scales. The resulting feature maps are then aggregated and forwarded to the next layer [
33]. The complete architecture for this approach is illustrated in
Figure 2. The diffused image obtained from the nonlinear diffusion stage using the fixed value of alpha is fed to the next stage, which is the inception stem of the Inception v4 network in which the output is then fed to the inception-A blocks, reduction-A block, Inception-B blocks, reduction-B block, and inception-C blocks, which is followed by an average pooling layer, dropout, and a dense output layer of two neurons with SoftMax activation. The network was trained by using the Adam optimization algorithm. Since our targets are in a categorical format with two classes of fake and real, we used the categorical cross-entropy as a loss function. The diffusion block was implemented via direct implementation of the diffusion equations, as described in
Section 3.1.
For liveness detection on video frames, we need to keep track of the information in a sequence of frames. Thus, our architecture for this case consists of diffusion, which is followed by CNN feeding to an LSTM layer, as shown in
Figure 3. Multiplicative units called gates (input, output, forget) in each LSTM cell provide continuous analogues of write, read, and reset operations for the cells [
34]. These units learn to open and close access to the constant error flow through internal states of the cells [
35]. The input gate indicates how much of the new information must be stored in the cell state, the forget gate indicates how much of the internal state can be removed, and the output gate indicates how much of the cell state can be sent as output to the next time-step. LSTMs in combination with CNNs have been used successfully in person identification from lip texture analysis [
36], 3D gait recognition [
37], image-to-video person re-identification [
38], and a deep bi-directional LSTM was used with CNN for action recognition in video sequences in Reference [
39]. Our enhancement is the addition of the diffusion preprocessing to further enhance liveness detection.
The CNN part in our CNN-LSTM network has convolutional layers C1 and C2, one subsampling layer, and a fully connected layer of 50 neurons (adapted from the CNN-5 in Reference [
4]). This is followed by the LSTM layer, which consists of 60 cells, and a feedforward output layer of two neurons. The sigmoid activation function is applied to the output layer, giving an output in the range 0 to 1. Nonlinear diffusion is first applied to the frames in each input sequence, and the diffused frames are fed to the CNN. The CNN captures the spatial information in the sequence by extracting the complex and deep discriminative features, and the hidden layer of CNN produces an output of 50 features per frame. The input to the LSTM layer is three-dimensional, where the three dimensions are samples, time-steps, and features (i.e., batch size, 20, 50), where 20 is the number of frames (time-steps) per sequence. The LSTM layer captures the long-term temporal dependencies across frames in the sequence, and the 60 features obtained from the LSTM layer are fed to the output layer, which then classifies the 20-frame sequence as real or fake. The network is trained by backpropagation through time using the Adam optimization algorithm with mean-squared-error as the loss function, and batch size set to 32. Implementation of the diffusion block was done via direct implementation of the diffusion equations.