1. Introduction
Facial expressions convey a wealth of communicative information, often surpassing that conveyed through language and bodily cues. Numerous studies have contributed to facial expression recognition; however, many of these primarily focus on extracting coarse-grained macroscopic features of facial expressions, potentially overlooking subtle, brief facial information. In this paper, we take micro-expressions as an example to delve into the exploration of visual recognition methods based on fine-grained features.
When human beings hide real emotions, subtle changes will occur in facial muscles. These changes result in transient and involuntary facial expressions. A micro-expression has the characteristics of real emotional reflection and is difficult to forge. Therefore, it plays an important role in the fields of medical diagnosis [
1], lie detection [
2], business negotiation [
3] and so on. Due to micro-expression’s small range and short duration, it is difficult to be captured and recognized by the naked eye. Even after professional training, the accuracy of recognition by human eyes is only 47% [
4]. Therefore, in the era of artificial intelligence, it is of great significance to study how to capture and recognize micro-expressions with the help of fast visual capture equipment and artificial intelligence technology. At present, the study of micro-expression is focused on video stream data. The existing methods mainly include: LBP (Local Binary Pattern) [
5], optical flow and methods based on Deep Learning [
6].
LBP can describe the local texture features of the image and extract them, which has the advantages of gray invariance and rotation invariance. LBP usually extracts the appearance and motion features from the symbol-based difference between two pixels, so it will lose some information about features. In order to extract more useful information, the improved methods based on LBP consider the information of amplitude and direction, but this also leads to higher feature dimension after extraction. Most of these improved methods will reduce the processing speed and find it difficult to achieve real-time detection.
Optical flow methods use the relevance between adjacent frames in an image sequence to calculate the motion information of objects. These methods can detect small motion changes, and are suitable for recognizing micro-expressions. Optical flow methods extract geometric features from images, so they have a relatively low dimension of features and a relatively fast processing speed. However, the effect of optical flow feature extraction will be greatly reduced in the scenes with large changes in brightness.
Methods based on LBP or optical flow usually extract the shallow features of micro-expression images, so they cannot fully describe the structural information of samples and can hardly distinguish the relevance between high-dimensional features. Therefore, it is still a challenge to extract useful information and obtain high-quality description from micro-expressions.
For micro-expression recognition, feature extraction is an important critical issue [
7], especially for optimal flow-based micro-expression recognition. The methods based on deep learning can extract the deep features from micro-expression images and recognize them [
8]. Due to the change in a micro-expression being relatively small, it is hard to distinguish the effectiveness of features by directly transforming its RGB image into matrix and inputting it into our network for convolution. Therefore, LBP or optical flow is usually used to extract the image’s low-dimensional features before inputting them into the network to extract the deep-dimensional features.
Due to the weak intensity of micro-expressions, directly inputting their RGB images into the network for convolution makes it difficult to distinguish effective features. Therefore, we use micro-expression videos as the model input. We employ a deep learning-based video motion magnification network to amplify micro-expression features and use a convolutional neural network (CNN) to estimate optical flow, extracting the optical flow information between adjacent frames in the image sequence.
Existing networks may overlook the correlation of image information in channel and spatial dimensions. To address this issue, we combine ResNet and ConvLSTM to extract spatiotemporal information, adding attention mechanisms between them. Our model effectively focuses on the micro-expression information across multiple dimensions, achieving superior recognition performance.
Micro-expression datasets are typically small and imbalanced, which can hinder the training of deep networks and lead to overfitting. To address these challenges, we employ data augmentation techniques, such as random rotation, to expand the effective dataset. We exclude classes with very few samples to improve training efficiency. The ResNet parameters are initialized with ImageNet pre-trained weights, and the model is further trained on the macro-expression dataset CK+. Additionally, we incorporate dropout layers into our network to selectively ignore some neurons and reduce their interactions in each training batch, helping to prevent overfitting.
Figure 1 shows our research framework. The input to the model consists of video sequences containing micro-expressions. First, facial detection and alignment are performed on each frame to ensure consistent facial feature points across all frames. Next, an action magnification network is used to amplify micro-expression features, making them easier to capture. Subsequently, a convolutional neural network (CNN) processes the optical flow information between adjacent frames to capture subtle facial muscle movements. Specifically, the optical flow describes the motion patterns between two consecutive frames. Following this, ResNet is utilized to extract spatial features, and ConvLSTM is employed to capture temporal variations. By combining these two methods, richer feature representations are obtained. Additionally, both channel attention mechanisms and spatial attention mechanisms are adopted to achieve multidimensional focus on micro-expressions, enhancing the model’s ability to perceive important features. Finally, a fully connected layer maps the extracted features to the classification space, and a classifier is used for categorization.
Overall, our contributions can be mainly summarized as follows:
Instead of using traditional RGB images, we leverage video motion to recognize micro-expressions, which allows for better capture of expression dynamics.
We combine spatiotemporal features with static features and introduce two types of attention mechanisms—channel attention and spatial attention—to model micro-expressions in a fine-grained manner from multiple dimensions.
To address the challenge of limited micro-expression datasets, we employ effective data augmentation techniques such as rotation and flipping, improving the training process.
We conduct extensive experimental validation, demonstrating that ASTNet significantly outperforms nine state-of-the-art open set recognition methods.
The structure of other parts of this paper is as follows: The second chapter briefly describes the existing feature extraction methods based on traditional or deep learning. The third chapter describes our processing method and proposes our network. The fourth chapter describes the experimental process, use parameters, results of experiments and the analysis of module effectiveness in detail. The last chapter summarizes our main work and contributions and looks forward to the possible direction of follow-up work.
3. Our Method
Firstly, we enlarge the features of the micro-expression and extract the optical flow features to improve the inter-class discrimination and enhance the micro-expression features. Then, we design a network structure, ASTNet, to further extract spatiotemporal features and identify emotions. In ASTNet, ResNet and ConvLSTM process the spatial and temporal features of the optical flow image, respectively. Finally, we add channel and spatial attention (CBAM) between them to generate the weight map of the feature map from the two dimensions of space and channel for feature adaptive learning.
Figure 2 shows the structure of our network.
3.1. Processing
- (1)
Micro-expression motion amplification
Micro-expressions change subtly, which leads to low characteristic intensity and low category discrimination. Therefore, we use the motion amplification technology to properly amplify these subtle dynamic changes and improve the discrimination of emotions. Some methods of video action amplification can be used to complete this part of the work. Their working principles are as follows: First they use Euler Video Amplification [
22] to decompose the input continuous segments into different spatial frequency bands. Then, they use the same time filtering for processing and give an amplification factor to amplify the signal of interest. Finally, they superimpose it on the original signal as output; the mathematical description is as follows: Let
be the image intensity of the image at time t and position x, and use the displacement function to express the observed intensity
, that is,
and
. After amplifying the image signal
times, it is superimposed back to the original image to obtain the synthesized image signal
, as shown in Equation (
1):
This paper focuses on the amplification of facial micro-expression movements, setting the amplification factor
z to 2, 3 and 4 for comparison.
Figure 3 shows the amplification effects of three types of emotions in the CASME II dataset, with each set of experiments using the same frame comparison.
In the emotion category labeled as disgust, as the amplification factor increases, the muscle movements in the eyebrow area of the subjects become more noticeable, and the activation level of the frowning action increases. In the happiness category, the muscle movements in the mouth area of the subjects become more noticeable, and the activation level of the mouth corner lifting action increases. In the surprise category, the muscle movements in the eyebrow and mouth areas are generally amplified to varying degrees compared to other muscle regions.Additionally, it can be observed that in the happiness category, some subjects exhibit eye closure, while in the surprise category, the subjects’ eyeballs are amplified. This is because the network enhances eye movements such as pupil changes and blinking through temporal information from consecutive frames. However, these eye movements do not conform to the definition of micro-expressions and do not contribute to their recognition.From the amplification results of the three emotions, although image noise increases when the amplification factor is set to 3, the amplification effect on motion is generally more noticeable than when the factor is set to 2. When the amplification factor is set to 4, facial muscles start to distort excessively, and image noise increases significantly. Considering these factors, this paper adopts an amplification factor of 3 for the network to amplify the actions of micro-expression samples.
We magnify images of micro-expression in SMIC and CASME II by 3 times through the Motion Amplification Network [
23].
Figure 4 shows the comparison between the images before and after processing.
- (2)
Optical flow feature extraction
In micro-expression recognition, feature extraction is a critical issue. Micro-expressions are brief, involuntary facial expressions, and directly using RGB images of micro-expressions as network input to extract deep features cannot effectively recognize emotions [
24]. Therefore, extracting effective features is essential [
25].
In this paper, we employ an optical flow-based feature extraction method due to the superior performance of optical flow in handling dynamic information and subtle changes, which is crucial for micro-expression recognition. Micro-expressions are often difficult to detect, and the optical flow method can provide detailed motion vectors, aiding in the more accurate recognition of micro-expressions.
Therefore, we use features in the form of optical flow as the input to the network and model them using the Motion Constraint Equation [
22] of optical flow.
Specifically, for the RGB keyframe sequence,
represents the light intensity of the pixel at position
and time
t. The pixel moves a distance of
to the next frame over a time interval of
. Since it is the same pixel, we assume that the light intensity of the pixel remains unchanged before and after the movement. It is shown in Equation (
2):
The flow field is continuously differentiable in both the space and time domain. According to Taylor series expansion, Equation (
2) can be expanded, as shown in Equation (
3):
is the second-order or above estimator of time
. When
tends to infinity, the optical flow constraint equation can be obtained by combining Equations (2) and (3), as shown in Equation (
4):
Then, the corresponding optical flow vector calculation equation is shown in Equation (
5):
Figure 5 shows the use of LiteFlowNet [
26] to extract the optical flow features between 11 frames of ‘happiness’ and visualize them. In the above optical flow images, different colors represent different directions of motion, while different depths of color represent different intensities of motion.
3.2. Network
ASTNet consists of three parts: ResNet for extracting spatial features of the micro-expression image sequence, an attention mechanism module for extracting attention maps and ConvLSTM for processing temporal features and recognizing emotions. ResNet can train very deep neural networks and extract deeper, fine-grained spatial features of micro-expression image sequences. The attention mechanism calculates weighted features through attention scores, allowing for a better focus on key features during the micro-expression change process. Since we use micro-expression video sequences that vary over time, temporal modeling is necessary. ConvLSTM excels in capturing spatiotemporal relationships and can simultaneously model spatial relationships to recognize facial emotions. In the following sections, we will detail each component of the model.
- (1)
ResNet
The first part of our network uses the transfer of learning; we initialize the parameters of ResNet by using ImageNet, and further train on the macro-expression dataset: CK+. Finally, we extract the spatial features of the optical flow image by using the processed ResNet.
ResNet was proposed by He Kaiming, Ren Shaoqing et al. [
27]. They added a residual structure in their network to solve the degradation problem of the poor stacking effect when they deepened the depth of the network to a certain extent. ResNet also solved the problem of gradient disappearance or explosion of the deep network through the identity mapping method.
Figure 6 shows the residual block.
We remove the full connection layer of ResNet and directly output the convoluted and pooled graph of features as the input of the next part of our network.
- (2)
Attention Mechanism
Each region or visual feature on micro-expression images has different importance for the recognition of the network. Therefore, in order to establish the correlation of the pixels of a micro-expression image in the spatial and channel dimensions, we use CBAM [
28] to construct the attention mechanism. CBAM is a feedforward convolutional neural network attention module that can be integrated into CNN. CBAM has two sequential submodules: channel and space. The feature map of the network convolution block can be adaptively refined through CBAM.
Figure 7 shows CBAM’s structure.
Given a feature graph as input, CBAM is successively transformed into a one-dimensional channel attention map and a two-dimensional spatial attention map .
The channel attention module uses the channel relationship between features to generate the channel attention map. Firstly, the channel attention module uses the average pooling and maximum pooling to aggregate the spatial information of the feature graph, and generate average pooling
and maximum pooling
. Then, it transmits
and
to a shared network layer composed of multi-layer perceptrons to generate the channel attention graph
. Finally, the feature vectors are merged. The channel attention model structure is shown in
Figure 8.
The spatial attention module uses the spatial relationship between features to generate a spatial attention map. Firstly, the average pooling and maximum pooling operations are used to aggregate the channel information of the characteristic graph along the channel direction to generate two characteristic graphs:
and
. Then, a standard convolution layer is used to connect and convolute them to generate a 2D spatial attention map. The structure of spatial attention model is shown in
Figure 9.
The channel and spatial attention mechanisms are arranged in order. According to the experimental results, we prioritize the channel attention module, and then arrange the spatial attention module.
- (3)
ConvLSTM
The optical flow features of the micro-expression image sequence have strong temporal correlation. Thus, we use ConLSTM [
29] to explore the temporal information from adjacent frames and retain the spatial information.
LSTM [
30] has a good recognition effect for a video stream, but when processing spatiotemporal data, LSTM uses a complete connection to model the sequence information in the transformation of input–state and state–state, and then flattens the input image into a one-dimensional vector. Thus, LSTM does not encode the spatial information, resulting in the loss of spatial information. ConvLSTM uses convolution to replace matrix multiplication to act on the conversion between input and state and state and state. Therefore, ConvLSTM can retain the spatial information of the sequence. The structure of ConvLSTM is shown in
Figure 10.
ConvLSTM combines the advantages of CNNs and LSTMs, enabling the capture of spatial features while modeling the dynamic variations in time-series data. ConvLSTM controls the flow of information through three gate mechanisms (input gate, forget gate and output gate). The computations of these gates involve convolution operations instead of the fully connected operations in traditional LSTMs: Input Gate: Controls the impact of input data on the cell state. Forget Gate: Determines how much of the previous time step’s cell state is forgotten. Output Gate: Regulates the influence of the current cell state on the hidden state. The calculation Equations for each gate are detailed in Equations (6)–(8):
where
denotes the sigmoid function, ∗ represents the convolution operation,
is the input at the current time step,
is the hidden state from the previous time step, and
W and
b are the weights and biases, respectively. The process for updating the cell state is shown in Equations (9) and (10):
where ∘ denotes element-wise multiplication,
represents the hyperbolic tangent function,
is the new candidate cell state and
is the cell state from the previous time step. The update for the hidden state is calculated as Equation (
11):
Through the output gate control, the current cell state is activated by the Tanh function to obtain the current hidden state . ConvLSTM outputs the current cell state and hidden state at each time step. serves as the processing result for the current time step, while serves as the input for the next time step.
Therefore, combining the above three modules, our network model is shown in
Table 1.
4. Experiment and Analysis
4.1. Datasets and Preprocessing
Considering the frame rate, emotion type and use frequency of each dataset, we use SMIC [
31] and CASME II [
32] for the experiments. SMIC contains 164 micro-expression samples of 16 subjects, which are divided into positive, negative and surprised. Among them, the positive emotion is happiness, and negative emotions include sadness, fear and disgust; CASME II contains 255 video samples from 26 participants, which are divided into seven emotions: happiness, surprise, repression, disgust, sadness, fear and others.
Table 2 summarizes the characteristics of the two datasets.
In CASME II, the number of samples of sadness and fear is less than 10. In order to avoid the imbalance of data distribution affecting the recognition effect, we ignore these two emotions in the experiment. Through image operations including mirror, random rotation and clipping, we expand the datasets 20 times. The distribution of original and expanded samples is shown in
Table 3.
The number of sample frames in the micro-expression datasets is not uniform. Therefore, we extract the continuous frames from the apex to the peak and fix the length as 11 frames. We preprocess each sample sequence by using face detection and alignment, facial motion feature amplification and optical flow feature extraction, as follows:
- (1)
Face detection and alignment
Firstly, we select the first frame image in each sample, and use ASM [
33] to extract 68 facial key points; then, we use the Local Weighted Average [
34] to transform these facial key points into matrices, and normalize and align all frames of the sample with these matrices. Finally, we locate the left eye coordinates, determine the distance between the two eyes and cut all frames of the samples.
- (2)
Facial motion feature amplification
We use the video action amplification network based on deep learning to amplify the features of micro-expression. According to our experiment, when the magnification is too small, the micro-expression features are not obvious enough; when the magnification is too large, it will cause too much noise interference and distort the micro-expression image. After many experiments, we determine that the amplification factor is 3.
- (3)
Optical flow extraction
We use LiteFlowNet to extract optical flow features from two adjacent micro-expression images in each sample. After a pair of images are processed through this network, an optical flow file (.flo) is output and visualized. Therefore, one sequence of 11 frames can be turned into an optical flow sequence of 10 frames.
4.2. Experimental Environment and Parameter Setting
The operating system of our experiments is ubuntu-16.04 and the development tool we used is Pycham. Our experiments are accelerated by a Tesla p100-pcie GPU and we selected the network parameters with the best effect after many experimental comparisons, as shown in
Table 4.
4.3. Detail of Experiment and Analysis
Our experiments use LOSO (Leave-One-Subject-Out) as the evaluation standard and use the accuracy and F1-score as the performance indexes of micro-expression recognition. By reserving a specific participant sample as the test sample and using other participant samples as the training sample, we divide the micro-expression recognition accuracy of all participants equally. The final accuracy is , and the F1-score is .
The calculation equation of accuracy is shown in Equation (
12):
The F1-score is defined as Equation (
13):
We compared four traditional methods and three methods based on deep learning with excellent performance in the past few years on the SMIC and CASME II datasets. Our method achieved the best results in accuracy and F1-score among these methods, which shows that it has significant advantages in micro-expression recognition. The specific experimental results are shown in
Table 5.
The proposed method outperforms other methods in recognition performance on both the CASME II and SMIC datasets. This is because our proposed method does not use traditional RGB images but instead utilizes video motion to recognize micro-expressions, thereby better capturing the dynamics of expressions. Additionally, we combine spatiotemporal characteristics with static features, modeling micro-expressions from multiple perspectives in a fine-grained manner.
DSSN captures micro-expressions from different angles but only focuses on micro-expression changes at the image level without considering optical flow features and dynamic changes. MDMO divides the face into 36 non-overlapping regions and extracts optical flow features frame by frame. However, it only applies optical flow features, neglecting the continuous spatiotemporal changes in micro-expressions, resulting in less effective performance compared to our model.
4.4. Hyperparameter Optimization
First, we compare and experiment with different learning rate settings. Due to the weak intensity of micro-expression features, the initial learning rate of the network should be relatively low. In this experiment, we set the learning rates to , and and tested them on the CASME II dataset. When the learning rate was , the network training experienced oscillations, and the accuracy could not improve steadily. This was because the excessively high learning rate prevented the network from reaching local minima. When the learning rate was , the network’s fitting speed was too slow, resulting in a waste of time. Therefore, considering these factors, the learning rate was set to for the experiment.
When training the network, the initial weights need to be initialized according to a certain distribution to ensure faster convergence of the loss function during training, thereby achieving better optimization results. However, when initializing network weights randomly according to a certain distribution, inappropriate initial weights might cause the loss function to get stuck in local minima during training, preventing it from reaching a global optimum. Momentum can partially solve this problem. The larger the momentum, the more energy is converted into potential energy, making it more likely to escape the confines of local concave regions and enter global concave regions. The most common setting for momentum is 0.9.
Other parameter settings are as follows: the optimizer used is Stochastic Gradient Descent (SGD), and lr_scheduler is used to automatically update the learning rate. We compared batch sizes of 16, 32, 64 and 128, and dropout rates of 0.3, 0.4, 0.5, 0.6 and 0.7.
Figure 11 and
Figure 12 show the impact of Batch Size and Dropout on network recognition, respectively. When the batch is selected as 32 and the dropout parameter is set to 0.5, the recognition of our network is the best. Dropout prevents overfitting by discarding some features and improves the generalization ability of the network. However, when the parameter is set too large, too much effective information is lost, which affects the learning ability of our network; while when its parameter is set too small, there are too many redundant neurons to extract enough effective features.
The confusion matrices on SMIC and CASME II are shown in
Figure 13. It can be seen from
Figure 13 that among the three emotions identified in the SMIC, ‘negative’ has the highest accuracy, followed by ‘surprise’, and ‘positive’ has the lowest accuracy. The possible reason is that the number of ‘negative’ samples is the largest and it has high discrimination from other emotions. Among the five emotions identified in the CASME II, the accuracy of ‘others’ is the highest, while the accuracy of ‘repression’ is the lowest. The reason is that the number of ‘others’ is the largest, ‘repression’ has a small number of samples and the class differentiation between other emotions is small. While the number of samples is also small, the accuracies of ‘disgust’ and ‘surprise’ are better than that of ‘repression’, which is because the former two are highly distinguished from other emotions. It can be seen that the number of samples and the intensity of features can have a great impact on the recognition.
4.5. Ablation and Analysis
To validate the effectiveness of each module, we conducted ablation experiments, testing the effectiveness of the action amplification module, the spatial attention mechanism and the channel attention mechanism. We generated four variants: the ResNet-ConvLSTM network without the action amplification module and both attention mechanisms. ResNet-ConvLSTM with the action amplification module but without both attention mechanisms. The ResNet-SpatialAttention-ConvLSTM network without the channel attention module. The ResNet-ChannelAttention-ConvLSTM network without the spatial attention module. We compared the results of these variants with our ASTNet.
Table 6 illustrates the effectiveness of micro-expression amplification and compares the improvement of different attention modules on the network. It can be seen from
Table 6 that although action amplification increases the noise of the micro-expression image and causes a certain degree of image distortion, it has a significant amplification effect on micro-expression features. On the whole, its recognition effect on the micro-expression has been improved. ResNet+ConvLSTM has more degradation without the attention mechanism, which indicates that the importance of pixels in different regions and different channels of the micro-expression image is different and verifies the effectiveness of our attention mechanism. Compared with spatial attention and channel attention, CBAM improves the network recognition effect more significantly. At the same time, it can be seen that the spatial attention mechanism improves the network recognition effect less than the channel attention mechanism. This is because the convolution of ConvLSTM is used for conversion between input and state and state and state, which can better capture the spatial information of images than LSTM. Therefore, it can be considered that there is an overlap between the spatial attention mechanism and the convolution of ConvLSTM.
We also try to add CBAM to different locations within ResNet to explore the impact of location and structure on the CBAM module computing attention map and improve the effect of the network model. As shown in
Table 7, in our proposed network model, ResNet is composed of five middle layers. We integrate CBAM modules between each two middle layers to form four different networks. In
Table 8, we show the effect of these networks on recognizing micro-expressions and analyze the possible reasons. The results show that they improve the accuracy of recognition less than adding CBAM between ResNet and ConvLSTM. The results show that although integrating the attention mechanism into different locations within the ResNet will improve the recognition rate of our network, their improvement is less than that of integrating CBAM between ResNet and ConvLSTM. It is difficult to find the law of the effect of the attention mechanism integrated into the network with the change in location, but they will affect the effect of transfer learning to a certain extent, so the overall improvement effect is not as good as the effect of integrating attention outside the network.
5. Application
The ASTNet model has been integrated into a multimodal teaching behavior intelligent analysis system for smart classrooms (This system is designed based on the open source smart classroom accessed on 1 October 2023 (
https://github.com/hongyao-hongyao/smart_classroom).), aiming to enhance teaching behavior through advanced technology. This innovative system utilizes video data to perform detailed analysis of students’ micro-expressions in the classroom and accurately perceive collective emotions. This emotional insight assists teachers, enabling them to dynamically adjust teaching methods to improve student engagement and learning efficiency. The system performs real-time detection and analysis of classroom emotions through three aspects: attention, emotional state and facial expressions.
The specific application scenario is a classroom with a frame width of 640 pixels, height of 480 pixels and frame rate of 30 frames per second, ensuring high-quality data acquisition and analysis.
Figure 14 shows a screenshot of the prototype system, which we now introduce: Firstly, the system uses three specific metrics to evaluate micro-expressions: focused emotion, group emotional state and expressed emotion. The three graphs on the right represent the score curves of these metrics, with the vertical axis showing the metric scores. Higher scores indicate higher levels of focused emotion, group emotional state and expressed emotion. The horizontal axis represents time, showing the emotional values at each time point.
The classroom group emotion perception curve below represents the overall emotional value of the classroom. The radar chart in the lower left corner shows the ratio of occurrences of each behavior to the total number, used as weights. By weighting and summing each behavior, the classroom group emotion value is obtained.
The system interface also includes real-time video data captured during the class, with the video capture time displayed in the upper left corner. The bounding boxes in the video indicate identified students, and the number above each box represents the individual student’s emotion value. These emotion values are crucial for teachers, as higher emotion values indicate better student engagement, allowing teachers to adjust their teaching methods accordingly.
The contribution of this application is significant. By providing real-time, detailed insights into students’ emotions, the system enables teachers to create more responsive and effective learning environments. This proactive educational approach not only improves individual student performance but also enhances the overall classroom atmosphere, fostering a more supportive and engaging learning experience.