In this section we carefully analyze the recent relevant literature related to the SOTA approaches for VER as well as approaches for lip-reading.
2.1. State-of-the-Art Approaches for Visual Emotion Recognition
The SOTA emotion recognition approaches have been improved tremendously over the last five years due to the rapid development of modern Machine Learning (ML) techniques and algorithms, as well as the release of large-scale corpora. Emotion recognition has drawn increasingly intense attention from researchers in a variety of scientific fields. Human emotions can be recognized and analyzed from visual data (facial expressions [
21,
22,
23], behavior [
24], gesture [
25,
26,
27], pose [
28]), acoustic data (speech) [
29], physiological signals [
30,
31,
32,
33], or even text [
34]. In this section, we will focus on SOTA approaches for VER based on facial expression analysis.
The visual information of facial expressions is a crucial indicator of a speaker’s emotion. According to the SOTA pipeline for VER, the first step is to localize the Region-of-Interest (ROI). Several open-source Computer Vision (CV) libraries allow the detection of facial variations and facial morphology (e.g., the Dlib (
http://dlib.net/, accessed on 24 October 2023) or MediaPipe (
https://google.github.io/mediapipe/, accessed on 24 October 2023) libraries). This step evaluates the position of the landmarks on a face, which contain meaningful information about a person’s facial expression that helps to perform automatic emotion recognition [
35].
For any utterance, the underlying emotions are classified using SER. Classification of SER can be carried out in two ways:
- (a)
Traditional classifiers;
- (b)
DL classifiers.
The latter are an order of magnitude superior to traditional ones in terms of Acc of emotion recognition. Therefore, only approaches based on DL will be considered below.
DNN is a multilayered NN used for complex data processing. Many researchers have investigated different deep models, including Deep Belief Networks (DBNs) [
36], Convolutional Neural Networks (CNNs) [
37], and Long-Short Term Memory (LSTM) networks [
38] to improve the performance of SER systems. However, nowadays the most promising approaches to emotion recognition usually rely on CNNs and Recurrent Neural Networks (RNNs) at their core, with attention mechanisms and some additional modifications [
18].
A CNN is a NN that consists of various layers sequentially. Usually, the model contains several convolution layers, pooling layers, and fully connected layers, followed by a SoftMax unit. This sequential network forms a feature-extraction pipeline modeling the input in the form of an abstract. The basis of CNNs is convolutional layers, which constitute filters. These layers perform a convolutional operation and pass the output to the pooling layer. The pooling layer’s main aim is to reduce the resolution of the output of the convolutional layers, thereby reducing the computational load. The resulting outcome is fed to a fully connected layer, where the data are flattened and are finally classified by the SoftMax unit.
RNNs and their variations are usually designed for capturing information from sequence/time series data. RNNs work on the recursive equation:
where
is the input at time
t;
(new state) and
(previous state) are the state at time
t and
, respectively; and
is the recursive function. The recursive function is a
function. The equation is simplified as:
where
and
are the weights of the previous state and input, respectively, and
is the output. An efficient approach based on RNNs for emotion recognition was presented in Ref. [
39]. In Ref. [
40], it was shown that RNNs can learn temporal aggregation and frame-level characterization over a long period.
With recent developments in NN models, and, more specifically, with the introduction of DNN architectures such as Visual Geometry Group (VGG) [
41] and ResNet-like [
42] that are able to consume raw data without a feature-extraction phase, modern emotion SR approaches started to shine. For the last five years, numerous research studies have been published, e.g., [
43]. In existing DL emotion recognition models for recognizing spatial-temporal input, there are three common topologies: CNN-RNN [
44], 3-Dimensional (3D) CNN [
45], and Two-Stream Network (TSN) [
46].
This enables them to capture both macro- and micro-motions. However, it cannot incorporate transferred knowledge as conveniently as CNNs–RNNs. TSNs contain two parallel CNNs: a network that processes images and a temporal network that processes motions [
47].
There are also numerous studies dedicated to multimodal Audio-Visual (AV) emotion recognition [
48,
49,
50,
51,
52,
53,
54,
55]. Moreover, visual information such NNs consider different information channels. Information fusion is commonly performed using DBNs [
56] or attention mechanisms [
57].
A notable set of approaches transform each audio sample into a 2-Dimensional (2D) visual representation in order to be able to employ SOTA CNNs based on 2D/3D CNN [
58,
59]. To this end, the researchers usually compute the discrete Short Time Fourier Transform (STFT), as follows:
where
is the discrete input signal,
is a window function,
is the STFT length, and
R is the hop (step) size. Following computation of the spectrograms as the squared magnitude of the STFT, the resulting values are then converted to a logarithmic scale (decibels) and normalized to the interval
, generating single-channel images suitable for further image processing.
Furthermore, the authors of Ref. [
60] proposed an audio spectrogram Transformer method based on the well-known Transformer [
61], the first convolution-free, purely attention-based model for emotion classification.
Since the main goal is to evaluate emotion in videos rather than in images (frames), the researchers in Ref. [
62] focused on video analysis and evaluated two strategies to provide a final prediction for the video. The first collapses the sequence of action units generated in each timestamp into a vector that is the average of all the temporal steps. These features fed three static models: a Support Vector Machine (SVM), a k-NN, and a Multilayer Perceptron (MLP). The second strategy employed a sequential Bidirectional (Bi)LSTM with an attention mechanism to extract the video prediction from the sequence of features retrieved from each frame of the video.
As an alternative to static models, the authors adopted a sequential model with an assumption that there is relevant information not only in frames themselves but also in the order of frames. The authors used a BiLSTM network as a sequential model.
The BiLSTM layer operates in a Bi manner, which allows them to collect sequential information in two directions of the hidden states
of the LSTMs. To obtain the emotion prediction from the BiLSTM layers, the embeddings of the outputs of each specific direction are concatenated, according to the equation:
where
denotes the concatenation operator and
L the size of each LSTM.
The general goal of the following attention mechanism is to distinguish the most relevant active unit associated with a specific frame while detecting the emotion in the whole video. The researchers used the actual contribution of each embedding evaluated through MLP with a non-linear activation function, similar to [
63].
The attention function is a probability distribution applied to hidden states that allows us to obtain attention weights that each frame of the video receives. The models estimate the linear combination of the LSTM outputs with the weights . Finally, the resulting feature vector is fed to a final task-specific layer for emotion recognition. The authors used a Fully Connected (FC) layer of eight neurons, followed by a SoftMax activation that returns the probability distribution over the emotional class.
A notable set of approaches in VER rely on the implementation of self-attention [
61]. Self-attention is formulated via estimating the dot-product similarity in the latent space, where queries
q, keys
k, and values
v are learned from the input feature representation via the trainable projection matrices
, and a resulting representation is calculated based on them:
where
d is the dimensionality of a latent space.
This idea has been widely employed for solving a variety of tasks in emotion/affect recognition [
57,
64,
65]. In Ref. [
64], the authors presented a Transformer-based methodology for unaligned sequences and performed fusion of three modalities of audio, video, and text. The data from each modality are first projected via a 1-Dimensional (1D) convolutional layer to the desired dimensionality, followed by a set of Transformer blocks. Specifically, two Transformer modules were added to each modality. These representations are subsequently concatenated and another Transformer module is applied in each modality branch on the joint representations. The resulting representations from each of the three branches are subsequently concatenated for final emotion classification.
The authors of Ref. [
66] recently proposed the use of Emoformer blocks for VER tasks. Emoformer has an Encoder structure similar to Transformer, excluding the decoder part. The previously described self-attention layer is used to extract features that contain emotional information effectively, and the residual structure ensures the integrity of the original information. Finally, the mapping network decouples the features and reduces the feature dimensions.
For a given input feature
U, three matrices of query
, key
, and value
are calculated from
U:
where
represent the sequence length of the
Q,
K,
V, respectively;
represent the dimensions of the
Q,
K,
V, respectively; and
,
,
.
Multiple self-attention layers are concatenated to obtain a multi-head attention layer:
where
are the outputs of the self-attention layers,
h is the number of layers, and
W is the weight parameter.
A residual connection with the normalization layer is used to normalize the output of the multi-head attention layer, and a feed-forward layer is employed to obtain the output of the self-attention parts:
where
,
are the weight parameters and
,
are the bias parameters.
Finally, the original features
U and the output of the self-attention parts
G are connected through the residual structure, and a mapping network is employed to obtain the final output
E:
where the
represents the mapping network, which consists of five fully connected layers. Combining the above equations, we can obtain different modality emotion vectors from different input channels with
:
where
,
,
represent the original input of audio, visuals and textual features, respectively, and
,
,
represent the emotion vectors of the respective modalities.
The authors of Ref. [
67] proposed a method to tackle emotion recognition in dialogue videos. The researchers proposed the Recurrence based Uni-Modality Encoder (RUME), inspired by the famous Transformer architecture [
61]. The authors added fully connected networks and residual operations to improve the expressiveness and stability of the emotion recognition system. The proposed structure can be formalized as:
where
X denotes the feature matrix of the utterance;
,
, and
denote the RNN, normalization, and feed-forward network layers, respectively. In this study, the
and
default to Bi Gated Recurrent Unit (GRU) and layer normalization, while the feed-forward layer consists of two fully connected networks, which can be formulated as:
where
and
denote the fully connected network and dropout operation, respectively, and
denotes the activation function.
In this subsection, we have analyzed the SOTA SER systems based on visual data processing. DL classifiers are usually used to recognize emotions. A framework of DNN, CNN, and RNN and modifications along with an attention mechanism is usually used to achieve SOTA performance.
2.2. State-of-the-Art Approaches for Lip-Reading
In this subsection, we evaluate recent progress in VSR methodology. More detailed research on these issues can be found in related studies [
68,
69].
Traditionally, VSR systems consist of two processing stages: feature extraction from video data followed by lip-reading. For traditional methods, features are usually extracted around the mouth ROI. However, in recent years, with the development of DL technology, the feature-extraction step has been replaced with deep bottleneck architectures.
The first CNN image classifier to discriminate visemes was trained by the researchers in Ref. [
70]. In Ref. [
71], the deep bottleneck features were used for word recognition in order to take full advantage of deep convolutional layers and explore highly abstract features. Similarly, it was applied to every frame of the video in Ref. [
72].
The researchers in Ref. [
73] proposed using 3D convolutional filters to process spatio-temporal information of the lips. A basic 2D convolution layer from
C channels to
channels (without a bias and with unit stride) computes [
74]:
for input
x and weights
, where we define
for
i,
j out of bounds. Spatio-Temporal Convolutional Neural Network (STCNN)s can process video data by applying convolution across both time and spatial dimensions [
75], thus:
Then, the researchers in [
76] applied an attention mechanism to the mouth ROI. The authors proposed the
,
,
, and
networks, which learn to predict characters in sentences being spoken from a video of a talking face without audio. The authors model each character
in the output character sequence
as a conditional distribution of the previous characters in the input image sequence
for lip-reading. Hence, we model the output probability distribution as:
The model consists of three key components: the image encoder
, the audio encoder
, and the character decoder
. Each encoder transforms the respective input sequence into a fixed dimensional state vector
s and sequences of encoder outputs
. The decoder ingests the state and the attention vectors from both encoders and produces a probability distribution over the output character sequence.
The three modules in the model are trained jointly.
Several training strategies and DNNs have been recently proposed for lip-reading [
77]. It has received a lot of attention due to the availability of large publicly available corpora, e.g., the Lip Reading in the Wild dataset (LRW) [
78] and the Lip-Reading Sentences in-the-Wild dataset (LRS) [
76].
The authors of Ref. [
76] proposed an image encoder that consists of a convolutional module that generates image features
f for every input time-step
x and a recurrent module that produces the fixed dimensional state vector
s and a set of output vectors
o:
The majority of SOTA approaches follow a similar lip-reading strategy that consists of a visual encoder, followed by a temporal model and a classification layer. The authors of Ref. [
11] proposed a modification of self-distillation. It is based on the idea of training a series of models with the same architecture using distillation and has been recently applied to lip-reading. The teacher network provides additional supervisory signals, including inter-class similarity information. The overall loss
to be optimized is the weighted combination of the cross-entropy loss
for hard targets and the Kullback-Leibler (KL) divergence loss
for soft targets:
where
and
represent the embedded representations from student and teacher networks, respectively;
and
denote the learnable parameters of the student and teacher models, respectively;
y is the target label;
stands for the SoftMax function; and
is the balancing weight between the two terms.
A visual encoder was initially proposed in Ref. [
79], and since then has been widely used and improved in later studies [
80]. At the same time, the most recent advances include the temporal model and the training strategy. Bi GRUs and LSTMs, as well as Multi-Scale Temporal Convolution Network (MSTCN)s, have been the most popular temporal models [
14].
Recently, as an alternative to RNN for classification, Transformer models using attention mechanisms and temporal convolutional networks have begun to be used [
81,
82]. The Vision Transformer (ViT) [
83] first converts an image into a sequence of patch tokens by dividing it with a certain patch size and then linearly projecting each patch into tokens.
A Transformer encoder is composed of a sequence of blocks, where each block contains Multi-head Self-Attention (MSA) with a feed-forward network. It contains a two-layer multilayer perceptron with an expanding ratio
r at the hidden layer, and one Gaussian Error Linear Unit (GELU) non-linearity is applied after the first linear layer. Layer Normalization (LN) is applied before every block, and residual shortcuts after every block. The input of ViT,
, and the processing of the
k-th block can be expressed as:
where
and
are the
and patch tokens, respectively, and
is the position embedding.
N and
C are the number of patch tokens and the dimension of the embedding, respectively.
However, training such models requires large computing power, as well as a significant amount of training data, so one of the popular approaches in such cases is transfer learning [
84]—a method of using a pre-trained model to improve predictions within a different but similar task, allowing a reduction in training time and data requirements, as well as improvement of the performance of the NN.
In order to develop real-world lip-reading systems, high-quality training and testing corpora are essential. A modern trend that has appeared recently is web-based corpora: corpora collected from open sources such as YouTube or TV shows [
85]. The most well-known of them are the LRW [
78], LRS2-BBC, and LRS3-TED corpora [
86], among others. The combination of SOTA DL methods and large-scale visual corpora has been highly successful, achieving significant recognition Acc results and even surpassing human performance.