1. Introduction
In today’s society, video surveillance has become an important means of public security, and thousands of surveillance cameras have been installed in public places. It is a very difficult task to search for a person manually in large-scale video data, as it consumes too much manpower and time and reduces the efficiency of police handling cases. In recent years, the automatic search of interested persons in large-scale video or image database has attracted the increasing attention of researchers. There are two main ways to conduct a person search: Image/video-based and text-based.
Image/video-based person search (IBPS) refers to the use of computer vision (CV) techniques to determine whether the specified person in the camera appears in other cameras, also known as person re-identification (reID) [
1,
2]. In recent years, person reID has attracted extensive interest and become a hot spot of CV. After more than 10 years of development, person reID technology has achieved high accuracy on the public person reID datasets. However, person reID has great limitations, because it requires that at least one image of the target person can be obtained, but in some actual cases, it may not be able to obtain the image of the target person. In this case, the target person can only be searched in the surveillance video/image database based on the text language description of the target person’s appearance provided by the witness, which is called text-based person search (TBPS).
TBPS can be classified into two classes: Attribute and natural language. Attribute-based person search involves searching for a person with certain attributes in the database [
3,
4,
5], such as wearing a red t-shirt, etc. This method needs a set of predefined semantic attributes to describe the appearance of person, and the ability of attributes to describe the appearance of person is limited. It is also very expensive to mark attributes for large-scale person image datasets. Because of the low-dimensional feature of person attributes, the results of attribute-based person search are often not satisfactory. A person search based on natural language refers to searching the closest person image in large-scale video surveillance dataset through a language description given by the witness [
6,
7,
8], as shown in
Figure 1. The description of witness is often a natural language description rather than discrete attributes. Natural language is able to represent the appearance details of person more accurately than the attributes, and can provide more semantic information for person search. Therefore, this paper studies person search tasks based on natural language query description.
A person search based on natural language involves using natural language description to retrieve person images in large-scale datasets, which is actually a fine-grained application field of cross-modal retrieval. It needs to deal with both CV and natural language processing (NLP) at the same time, which has great technical difficulties and challenges. On the one hand, because of the modal heterogeneity, it is difficult to directly measure the cross-modal similarity and correlation between the language description features and the image appearance features. On the other hand, because all images are part of the same big class (that is, person class), the discriminations among different persons are not obvious, which is a typical fine-grained recognition problem. Also, different witnesses may describe the same person differently, which further increases the difficulty of the task. In short, when dealing with the task, there are two main difficulties: Fine-grained and cross-modality. This paper focuses on the solution of these two challenging difficulties.
For the challenge of fine-grained, one key method is to enhance the ability of discriminative feature extraction and to extract the discriminative feature representation associated with language description and visual appearance. We discuss feature extraction for image and text, respectively.
To enhance the ability of discriminative feature extraction for image, this paper mainly focuses on two aspects: Introducing an attention mechanism and leveraging midlevel features. As we know, attention mechanism plays an important role in the human visual perception system [
9]. Human visual perception does not process the whole image at one time but instead continuously glimpses and selectively focuses on the critical parts to obtain key visual information [
10]. This feature is conducive to the learning of discriminant features of the image. For a person search, local features play an important role in distinguishing different persons. Local features are learned from different locations of the person image, which corresponds to spatial attention (SA). Spatial attention mainly focuses on “where” the critical parts of person images are. In addition, in the convolution neutral network (CNN), different features are dispersed in different channels of the feature map. The network usually uses the full connection layer and softmax loss at the highest layer for classification tasks. Every channel of the feature map at the highest layer of the network can be regarded as a response of each class, and different semantic class responses are interrelated [
11]. In the person search task, different classes correspond to different person IDs. Therefore, by exploring the relationship among different channels in the feature map, person features can be enhanced to have specific body semantic information, which corresponds to channel attention (CA). CA mainly focuses on “what” the critical parts of person images are. Both SA and CA are used is this paper and combined in an improved way to produce the novel cubic attention mechanism that we proposed.
The features extracted from different layers of CNN have different characteristics and values. Features of the lower-level network have more details and more spatial information, which can provide more valuable information for distinguishing different persons. It is very beneficial to the fine-grained task of person search. Most of the existing methods only extract high-level semantic features [
6,
7]. The high-level network has rich semantic features, but the extracted information is too abstract, leading to the problem of insufficient detail information, which affects the person search performance. Different from most other methods just generating both spatial attention and channel attention based on the highest convolution layer or same layer of the network [
12], we proposed a cross-layer cubic attention mechanism which generates spatial attention based on the midlevel network and generates channel attention based on the high-level network, to fully leverage the spatial information and rich details of the midlevel network and rich semantics of the high-level network, so as to get better performance of this fine-grained task. The reason we did not generate spatial attention based on the low-level network is because of the insufficient semantic information and high computational cost of the low-level network.
To enhance the ability of discriminative feature extraction for language description, we proposed text attention network based on BiLSTM [
13,
14] and the self-attention [
15,
16] mechanism. The language descriptions of persons are context-dependent, and the contribution of each word to semantics is different. Through the forward and backward propagation of BiLSTM, it can better capture the hidden dependency of context and obtain the features with richer semantics. In most of the existing methods [
7], the output vectors of BiLSTM at every time step are summed and averaged directly, enabling each word to have the same contribution to semantics, which easily leads to unsatisfactory performance. In this work, self-attention was used to fully consider the contribution of different words to semantics, to pay more attention to the key words in language description, to increase the weight of key words, and to reduce the weight of redundant or unimportant words, so as to obtain more discriminative feature representation of language description and improve the performance of the task.
To deal with the challenge of cross-modality, we proposed a cross-modal attention mechanism and joint loss function. Most of the existing works have independently extracted the features of each modality, and then measured the cross-modal correlation and similarity of the features [
7,
17]. However, the features of different modalities are quite different and noisy, and the correlation between features is not strong because of cross-modal heterogeneity. This paper proposes a cross-modal attention mechanism (CMAM) to solve the problem. The CMAM is able to fully acquire the correlation between text and image feature and focuses on the correlated parts between features. Thereby, it improves the final performance of this cross-modal task.
A joint loss function for cross-modal learning is designed in this paper. First, ranking loss was used because it has a good effect in cross-modal retrieval. However, the commonly used ranking loss only requires the network to distinguish the person image based on whether the image matches the language description or not (positive image-text sample pair or negative image-text sample pair). This constraint is coarse-grained. The ranking loss is improved for the task, and a new ranking loss embedded with intra-modal similarity is proposed. The sample distance in the same modality was calculated as the coefficient of the negative sample pair distance in the new similarity ranking loss, which can not only distinguish the positive pair and negative pair in coarse-grained, but also distinguish the similarity between the negative pairs in fine-grained way, so as to achieve a more accurate language-based person search ranking. Second, the similarity ranking loss mainly concerns cross-modal correlation and cannot well explore the intra-modal distribution of samples. For this reason, classification loss was also used, which uses the identity label information to mine intra-modal semantic information. Classification loss was also leveraged to constrain similarity ranking loss based on the identity label, which enabled the samples of the same identity to be aggregated as much as possible, so as to obtain more distinctive features to further increase the accuracy of the person search. Finally, similarity ranking loss and classification loss were joined to form a joint loss function for cross-modal learning and jointly train the network.
The major contributions of our work are listed below:
A cross-layer cubic attention mechanism was proposed to enhance the ability of discriminative feature extraction for image. It can not only focus on critical features, but also fully leverages both the detailed features of midlevel network and the semantic features of high-level network, so that the performance of this cross modal fine-grained person search task can be improved.
A text attention network, including both BiLSTM and self-attention, was put forward to enhance the ability of discriminative feature extraction for language description by better capturing the hidden dependency of context and increasing the weight of key words.
A cross-modal attention mechanism was proposed, which can pay more attention to the correlated important parts between text and image features and better solve the problem of cross-modal heterogeneity.
A joint loss function for cross-modal learning was proposed, including improved cross-modal ranking loss embedded with intra-modal similarity and intra-modal classification loss. The loss function leverages similarity ranking and discriminant feature mining to further improve the performance of natural language description based person search.
3. Proposed Method
Given a natural language description of the person, the goal of the task is to search for the closest person images from the large-scale image database. This task needs to deal with both CV and NLP, which is a fine-grained cross-modal retrieval issue. The architecture of the hybrid attention network we proposed is illustrated as
Figure 2. The network is composed of three parts: (1) The image subnetwork, including a ResNet50 [
36]-based CNN extended with a cubic attention mechanism, to extract the feature maps of person images. (2) The text subnetwork, including Word2Vec [
37,
38], the bidirectional LSTM (BiLSTM) network [
13,
14], and self-attention [
15,
16], to encode the natural language description and extract the semantic features. (3) The cross-modal joint learning module, including the cross-modal attention mechanism and full-connection layers (FC2 and FC3), with joint loss function (similarity ranking loss and classification loss), guiding the learning process to fully exploit the cross-modal and intra-modal information.
3.1. Image Feature Map Extraction
Because CNN achieved outstanding performance in many fields, this paper also uses CNN to extract person image visual features. In this paper, ResNet50 [
36] with network parameters pre-trained on ImageNet [
39] is adopted as base network to extract the features of person images. This paper designs a new cross-layer cubic attention mechanism. The mechanism enables the model to focus more on the key features of person images and ignore the redundant features, and leverages midlevel networks so that the model can focus more on the details of the image to improve the performance of this fine-grained task. The cubic attention mechanism is composed of two modules: One is the spatial attention module based on object region, and the other is the channel attention module based on object semantics. The details of each module are described below.
3.1.1. Spatial Attention Based on Object Region
Different appearance attributes of the person described by natural language are distributed in different locations of the person images. In order to enable the model to focus more on the corresponding spatial location of person attribute objects, we improved the commonly used spatial attention model. This paper designs a spatial attention mechanism (SAM) based on the object region. The model can automatically acquire the locations and the detail information of the target objects in the image and weigh the spatial position so that the model is able to focus more on the important regions and restrain the impact of unimportant regions and noises.
In CNN, the lower-level features usually have higher resolution, can get more details, and are sensitive to small targets. Meanwhile, the higher-level features usually have lower resolution, contain more semantic information, and have larger receptive field, but may lack enough details and can easily ignore small objects. In the language description person search task, the appearance descriptions of the person (such as shoes, hand carry items, etc.) usually correspond to the small areas of person image, so detailed information is very important to achieve good performance of this fine-grained task. In order to get more detailed information, different from the other SAM-creating spatial attention map based on the last convolution layer (the conv5 layer of ResNet50), our SAM obtains spatial attention map based on the feature map of the midlevel of the network (conv4 layer of ResNet50). With the concern of the insufficient semantic information and high computational cost of low-level network, we did not create a spatial attention map based on the feature maps of lower layers, even though lower layers have higher resolution and more details.
The structure of our SAM is illustrated as
Figure 3. The channel dimension of the input feature map (extracted from conv4 layer of ResNet50) was handled with max pooling and average pooling, respectively, and two feature maps with channel dimension of 1 were obtained. Then, the two feature maps were concatenated to one feature map, and the feature map was reduced to one channel through the 3 × 3 convolution operation. Then, the attention map based on spatial location was generated using the sigmiod function. Then, the attention map was subsampled, and its height and width were half of the previous one. The final spatial attention feature map can be acquired through the multiplication of the spatial attention map and the output of channel attention feature map based on the conv5 layer of ResNet50. The calculation of spatial attention map is shown in Formula (1):
where
F is the input feature map,
is the spatial attention map, and
is the 3 × 3 convolution operation.
3.1.2. Channel Attention Based on Object Semantics
Only spatial attention is not enough to achieve satisfactory performance, and it is necessary to obtain the features of person image from multiple dimensions. In the CNN model, the feature of each channel actually represents the weight of the image on different convolution kernels, and different features are dispersed in different channels. In order to get the semantic features of each channel obtain additional improvement on the performance of the task, we put forward a channel attention mechanism (CAM) based on object semantics.
The structure of CAM is illustrated as
Figure 4. The input feature map (extracted from the conv5 layer of ResNet50) was handled with max pooling and average pooling, respectively, to obtain two feature maps with height and width are 1. Then, the two feature maps were concatenated and connected with two layers of fully connected network, and then the attention map based on channel dimension was generated using the sigmiod function. The final channel attention feature map was generated through the multiplication of the channel attention map and the input feature map. The calculation of channel attention map is shown in Formula (2):
where
F is the input feature map,
is the channel attention map, and
and
are the weight matrix of the fully connected layer FC1 and FC2.
3.1.3. Cross-Layer Cubic Attention
For CNN, lower layers features have higher resolution and more detail information than higher-layer features, and higher-layer features have more semantic information than lower layers. Spatial attention focuses on the details of objects, the features of the conv5 layer of ResNet50 were too abstract to provide sufficient detail and spatial information for an effective spatial attention to get achieve performance for such fine-grained tasks of the person search. So, different from other methods generating both SA and CA based on the conv5 layer of ResNet50, we proposed a cross-layer cubic attention mechanism which generates spatial attention based on the conv4 layer of ResNet50 and generates channel attention based on the conv5 layer of ResNet50. It can leverage both midlevel details and high-level semantics and achieve better performance of the fine-grained task.
The structure of cross-layer cubic attention mechanism is shown in
Figure 5. The cubic attention obtains important regions and key high-level semantic features of person image through SA and CA. SA and CA are combined to obtain the cubic attention weights of image features. With the purpose of preventing the disappearance of the gradient caused by the loss of feature information because of the multiplication of attention and features, we added a certain proportion of original features to the attention mechanism. The calculation of cubic attention feature map is shown in Formula (3):
where
is the output feature map of conv5 layer of ResNet50,
is spatial attention,
is channel attention,
b is a hyper parameter, which is the weight for the proportion of original features of the conv5 layer of ResNet50 to prevent attention mechanism from introducing noise.
3.2. Natural Language Feature Extraction
For the natural language description input of the person, we first used Word2Vec [
37,
38] to convert the words into the vectors containing semantic information. Then, we used BiLSTM [
13,
14] to acquire the context information and semantic features of the description. Finally, we used self-attention to set proper weights to different features in the description, and pay attention to discriminative features representing person appearance to effectively improve the person search efficiency and reduce the adverse effects of noises in the description. The text attention network structure of natural language extraction is illustrated as
Figure 6.
3.2.1. Word Embedding Using Word2Vec
Different witnesses have different language description methods for persons. Because there is no structured or standardized grammar and mode, the description is highly unstructured. Therefore, we cannot directly use the existing mathematical model or statistical model to process the description. We need to convert the words in the description into vectors for processing. Word2Vec [
37,
38], an open-source word embedding tool, uses the continuous bag of words (CBOW) or skip-gram model to transform words into high-dimensional real number vectors with certain semantic information. CBOW infers the central word through the context words. Skip-gram infers the context words through the central word. Comparing the two models, CBOW has shorter training time than skip-gram, but for some low-frequency words, the prediction effect of CBOW is poor, and its generalization capability is weak. Considering skip-gram has stronger generalization capability, this we selected Word2Vec’s skip-gram model for word embedding.
3.2.2. Context Information Extraction Using BiLSTM
The descriptions of persons from different witnesses are in the form of natural language. Although the form is free, there is still a context dependent relationship in the description. The semantics can be understood more accurately according to the context information of the description. Recurrent neural network (RNN) can mine temporal and contextual semantic information of text, but with the increase of input, RNN’s perception of long-term information declines, resulting in the issues of long-term dependence and gradient disappearance [
40]. The LSTM network [
13] is an improvement for RNN and can solve the two issues mentioned above.
Although LSTM can solve the two main issues of RNN, LSTM can only learn the preceding information of the text and cannot use the following information of the text. However, in the language description of persons, the semantics of a word is closely relevant to both the preceding information and the following information. Therefore, using bidirectional LSTM (BiLSTM) [
13,
14] replaces LSTM and introduces the following information in the description, and can better capture bidirectional semantic dependency and improve the accuracy of prediction. BiLSTM is composed of two stacked LSTM networks, forward and backward, which capture the effective information of the preceding and the following, respectively, and then concatenate the two hidden states to form the final output. The BiLSTM network structure is shown in
Figure 6.
For an input language description of person, such as a sentence {
w0,
w1,...,
wt,
wt+1,…}, the word vectors after word embedding are {
v0,
v1,…,
vt,
vt+1,…}, and BiLSTM can calculate the output
of any time step, where
t[1, 2,
t,…,
T].
and
represent the outputs of forward LSTM and backward LSTM at time step
t, respectively, as shown in Formula (4) and Formula (5), and the output of BiLSTM is determined by the state of LSTM networks in both directions, as shown in Formula (6).
3.2.3. Key Semantic Feature Extraction Using Self-Attention
In the description sentences of person, different words have different contribution to sentence semantics. In our work, we utilize self-attention in determining the contribution of every word and assigning different weights to it, so that the model can focus more on the critical features for this person search task, enhance the expression of key features, and weaken the influence of redundant features, which further improves the performance of person search based on natural language description.
The output of BiLSTM stands for the hidden vector for the t-th word, which contains the representation of context information. Different weights are assigned by self-attention to the output of BiLSTM at different time step; different weights represent different degrees of focus. The specific construction of attention is as follows.
First, the BiLSTM hidden layer
is transformed into the hidden vector
of the attention layer by the nonlinear tanh function, as shown in Formula (7):
where
and
are the coefficient matrix and bias vector of attention mechanism, which are updated automatically with model training.
Then, the attention weights
are acquired through normalization with the softmax function.
represents the amount of semantic information of each hidden layer vector, and satisfies
, as shown in formula (8):
Finally, weighted semantic vector is obtained by the weighted sum of hidden layer vector:
3.3. Cross-Modal Attention
Person search based on natural language description is a cross-modal retrieval task. The features of image and text are quite different. The task faces the challenge of cross-modal feature heterogeneity, and is easy to be disturbed by noises, with weak robustness. For the purpose of generating better cross-modal features and improving the performance, we put forward a cross-modal attention mechanism (CMAM), which can capture the correlation between text and image features, pay attention to the correlated parts between features, and improve the performance of this cross-modal task.
As illustrated in
Figure 2, after extracting the semantic feature of text and image, respectively, the two extracted features were inputted into the CMAM. The structure of CMAM is shown in
Figure 7.
The CMAM utilizes a full connection layer followed with a sigmoid function to learn the weights of the input features. The parameters of CMAM are shared for image subnetwork and text subnetwork during feature weight learning, as it can cut down the number of parameters and create common feature expression space. The cross-modal attention map can be determined as follows:
where
and
are the input features of CMAM for image and text,
and
are the cross-modal attention maps, and
W is the weight matrix of the fully connected layer.
A fully connected layer and a nonlinear activation function can extract the whole semantic information of the feature map, capture the correlated parts of the cross-modal features, and ignore the irrelevant parts. The sigmoid function can map the weights of the learned features to [0, 1], and the attention mechanism can make the network to focus and extract relevant and important features. In addition, the mechanism of weight sharing can map the features of the image and text into a common feature expression space, which also shows that the features of one modality can be associated with those of another modality. The residual mapping adds the features in the original modal space, so that the feature weights learned are related to both the original features and the common features. The common feature space can better solve the problem of cross-modal heterogeneity.
After obtaining the feature weights, this paper combines the feature weights with the image and text features to obtain the final cross-modal feature representation. If just simply dot product features weight with features, the feature value may be too small. In addition, the ReLU [
41] function in the network may aggravate the problem of feature sparsity, which may cause overfitting and weaken the robustness of the network. For the purpose of alleviating above issues, we added identity mapping in the mechanism. The final feature generation formulas of image and text are as follows:
where
and
are the output feature maps of image and text. They are intermediate features, which will be input to the fully connected layer to obtain the final features.
The output of CMAM is connected to FC2 and FC3 for cross-modal joint learning as illustrated in
Figure 2. Both the input and output dimensions of FC2 are 2048. The input dimension of FC3 is 2048, and the output dimension equals to number of person class which is the quantity of person IDs in person image dataset. Because cross-modal data have the characteristics that low-level features are heterogeneous but high-level semantic features are correlated, for this work, FC1 in the image subnetwork and FC1 in the text subnetwork adopt different parameter settings, while FC2 and FC3 in the image subnetwork and text subnetwork share the same parameters. This parameter setting strategy is not only beneficial to extract the effective modal specific feature representation of different modalities in the low-level layers, but also beneficial to establish the cross-modal semantic correlation in the high-level layers.
3.4. Joint Loss Function
As for loss function design, this paper proposes a new ranking loss embedded with intra-modal similarity, which can not only distinguish the positive and negative sample pairs coarsely, but also distinguish the similarity between the negative sample pairs in a fine-grained way, so as to obtain a more accurate ranking of the person based on the description. Additionally, we introduced the classification loss and use the identity label information to mine the intra-modal semantic information, so that the samples in the same modality belonging to the same person can be aggregated as much as possible to acquire more distinctive features. The classification loss was also utilized to restrain the similarity ranking loss during model training. Finally, this paper uses similarity ranking loss and classification loss to form joint loss function and jointly train the model to further improve the performance of natural language description based person search.
3.4.1. Similarity Ranking Loss
In cross-modal tasks, ranking loss is widely used. The purpose of ranking loss is to make the distances among positive pairs shorter than that among negative pairs with a margin. It only focuses on cross-modal distance. This leads to two potential disadvantages: One is ignoring the intra-modal feature distribution, while the other is ignoring the feature similarity among samples in the same modality. A new ranking loss embedded with intra-modal similarity (similarity ranking loss) was proposed for this work. The similarity ranking loss not only considers the intermodal distance, but also considers the similarity among the different samples within the same modality and its influence on the final ranking results.
Ranking loss is to rank all person image samples in the dataset according to the distance of similarity with the language description of the person to be searched. For this image-text matching work, we used the Euclidean distance
as the measure of similarity between two samples. Here,
and
represent the sematic feature output by FC2 layer for the image sample
i and text sample
t respectively, and
represents
norm. The ranking loss formula [
42,
43] is as follows:
Here, represents the semantic features of image and text, represents the margin parameter. I stands for image inputs, and T stands for text inputs. The first part of loss function is to sum up all mismatched text data given an image query , and the second part is to sum up all mismatched image data given a text query . The purpose of ranking loss is to make the distances among matched pairs of image-text samples (positive pair) shorter than that among any pair of mismatched image-text samples (negative pair). Since language-based person search uses a text query to retrieve images, only the second part of the ranking loss formula was calculated. During the actual training procedure, for the purpose of improving the calculation efficiency, ranking loss did not calculate and sum all the negative samples within the training datasets according to the above formula. We chose mini-batch to determine the sum of negative samples to consider both the efficiency of the calculation and the accuracy of the person search.
There is a disadvantage of the above ranking loss. For the language person search, its goal is not only to accurately match the samples to be searched, but also to meet the ranking rules, ranking according to the similarity relationship between all samples in the sample gallery and the samples to be searched. The above formula only maximizes the distances among positive and negative samples, which is coarse-grained. A novel similarity ranking loss is presented here to match the image and text features for such fine-grained cross-modal retrieval task of natural language description based person search, as shown in Formula (15):
We defined the intra-modal similarity to measure the similarity relationship of different samples in the same modality. Cosine distance was utilized here for intra-modal similarity evaluation. In the similarity ranking loss, we constrained the distance difference between the pairs of positive and negative samples. Assume , and the value range of intra-modal similarity distance measured by the Cosine distance is [−1,1]. When equals 1, which implies the positive and negative samples are identical, the loss function becomes , which only involves the distance of the pair of positive samples. In other words, during network training, if the images selected as negative samples are very similar to those of positive samples, it is not necessary to make the distance between positive and negative sample pair large enough. When equals −1, which means that the negative sample is completely different from the positive sample, and the loss function becomes , so the loss function gives a greater weight to the distance between negative sample pairs, that is, the model is required to distinguish the distances among negative sample pairs and positive sample pairs as much as possible.
3.4.2. Classification Loss
Similarity ranking loss mainly focuses on cross-modal correlation and similarity distance and cannot well explore the intra-modal distribution of features. In this paper, classification loss is introduced, and the intra-modal semantic information was extracted using the identity label information of the sample. Classification loss can constrain the similarity ranking loss based on identity labels, so that samples belonging to the same person identity within the same modality can be gathered together as much as possible, so as to obtain more distinctive features to further increase the accuracy of this task.
The identities (IDs) of different persons in the dataset were used as classes for this work. The IDs were used to classify images and language descriptions separately, and samples with the same ID in the same modality were classified into the same class. A full connection layer FC3 was added after the semantic features of image and text subnetworks. The output dimension of FC3 was the same as the quantity of person IDs (i.e., the quantity of classes) in the datasets. Then, softmax was used to calculate the probability of the class that each image and natural language description belongs to. Since person search can be treated as a multiclass classification task, cross-entropy loss was utilized to predict the identity. The classification losses of the person image and language description are expressed as following:
where
and
represent the sematic feature output by FC3 layer for the image sample
i and text sample
t, respectively,
are the predicted probabilities,
is the ground truth probability,
and
are the classification loss of the two modalities respectively, and
is the final classification loss joining both modalities.
The final loss function of joint cross-modal learning is as follows:
where
and
represent the weights of the losses.