Image–Text Matching Model Based on CLIP Bimodal Encoding

Zhu, Yihuan; Xu, Honghua; Du, Ailin; Wang, Bin

doi:10.3390/app142210384

Open AccessArticle

Image–Text Matching Model Based on CLIP Bimodal Encoding

School of Computer Science and Technology, Changchun University of Science and Technology, Changchun 130000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(22), 10384; https://doi.org/10.3390/app142210384

Submission received: 27 September 2024 / Revised: 3 November 2024 / Accepted: 4 November 2024 / Published: 12 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

Image–text matching is a fundamental task in the multimodal research field, connecting computer vision and natural language processing by aligning visual content with corresponding textual descriptions. Accurate matching is critical for applications such as image captioning and text-based image retrieval yet remains challenging due to the differences in data modalities. This paper addresses these challenges by proposing a robust image–text matching model inspired by Contrastive Language–Image Pre-training (CLIP). Our approach employs the Vision Transformer (ViT) model as the image encoder and Bidirectional Encoder Representations from Transformers (Bert) as the text encoder, integrating these into a shared vector space to measure semantic similarity. We enhance the model’s training efficiency using the LiT-tuning paradigm to optimize learning through a cosine decay strategy for dynamic adjustment of the learning rate. We validate our method on two benchmark datasets, WuKong and Flickr30k, demonstrating that our model achieves superior performance and significantly improves key evaluation metrics. The results underscore the model’s effectiveness in achieving accurate and robust image–text alignment.

Keywords:

image and text matching; contrastive learning; cross-modal retrieval

1. Introduction

In the multimodal field, image–text matching is an important task, with significant implications for cross-modal tasks such as image captioning [1] and machine translation [2]. In real life, images and text are the primary means through which humans acquire information. The task of image–text matching focuses on measuring the semantic similarity between images and text. The current mainstream approach involves mapping the features of images and text into a feature space, utilizing deep neural networks to extract image and text features, and then calculating their similarity or merging the two vectors before inputting them into a classification network for categorization. When using a sentence to describe an image, the sentence can be considered a weak annotation for the image. At a macro level, an image and a sentence correspond one-to-one; however, from a more detailed perspective, certain keywords in the text correspond to specific areas in the image, such as nouns or verbs corresponding to key pixels in the image. Therefore, image–text matching methods based on overall features also overlook, to some extent, the fine-grained correspondence between images and text [3,4].

Image–text matching is a crucial task in the multimodal field, serving as a fundamental component for various cross-modal applications, such as image captioning and visual question answering. With the rapid development of artificial intelligence, accurately understanding and aligning visual and textual data has become increasingly significant. In real-world scenarios, images and text are the primary ways humans perceive and communicate information, making the ability to semantically match these modalities essential for building systems that can effectively interpret and utilize both types of content.

The objective of image–text matching is to measure the semantic similarity between images and their corresponding textual descriptions. The mainstream approach involves mapping image and text features into a shared representation space using deep neural networks. Typically, this involves extracting image features through vision models like the Vision Transformer (ViT), and text features through language models such as Bert. These features are then aligned through similarity computation or feature fusion, enabling tasks such as cross-modal retrieval and categorization.

Despite significant advancements, current methods primarily focus on global semantic alignment, often overlooking the fine-grained correspondences between specific keywords in the text and visual regions in the image. For example, nouns or verbs in a sentence may correspond to particular objects or actions depicted in an image. Addressing this limitation requires models that not only align overall semantic features but capture intricate associations between textual and visual elements.

In this work, we propose an enhanced image–text matching model inspired by Contrastive Language–Image Pretraining (CLIP). Our approach introduces a joint training framework utilizing the ViT and Bert encoders, optimized through the LiT-tuning paradigm. Furthermore, we adopt a cosine decay strategy to dynamically adjust the learning rate, enhancing model convergence and performance. We validate our model on benchmark datasets, including Flickr30k and WuKong, demonstrating significant improvements in key performance metrics and illustrating the effectiveness of our method.

Our main contributions are summarized as follows:

We present a novel image–text matching framework based on CLIP, integrating the ViT for image encoding and Bert for text encoding, and leveraging the LiT-tuning paradigm for enhanced training efficiency.
We implement a cosine decay strategy for adaptive learning rate adjustments, which improves convergence and model stability during training.
We conduct extensive experiments on the Flickr30k and WuKong datasets, achieving superior results compared to existing baselines and demonstrating the robustness of our model in multimodal alignment.

This paper is organized as follows: Section 2 reviews related work on image–text matching and multimodal learning. Section 3 details our proposed model architecture and training strategies. Section 4 presents experimental results and analysis, while Section 5 provides an overall summary of the paper. The list of abbreviations and their definitions can be found in Table 1 below.

2. Related Work

Faghri et al. [5] mapped entire images and complete sentences into a shared vector space to compute their cosine similarity. In the context of local interactions between image regions and individual words, Lee et al. [6] proposed the SCAN unimodal model, which explored the fine-grained correspondence between image regions and text words to calculate similarity. This model’s introduction inspired the work of Zhang et al. [7] and Hu et al. [8] in exploring fine-grained matching between image and text modalities. While the aforementioned methods generally performed well, they overlooked the relationship between the two modalities of images and text.

Radford et al. [9] first introduced the concept of Contrastive Language–Image Pretraining (CLIP), where the core idea was to map images and text into a vector space and then compute similarity based on this vector space. For modality features, Liu et al. [10] utilized a recurrent residual neural network to extract visual features, enhancing global information at the visual level, while Kiros et al. [11] used a gated recurrent neural network as the text encoder to encode words in the text and obtain textual information. In recent years, attention mechanisms have been widely applied. Wei et al. [12] employed multi-head cross-modal attention to integrate inter-modal and intra-modal connections to enhance the matching between images and text, while Nam et al. [13] used a dual-attention network to jointly utilize visual and textual attention mechanisms to capture fine-grained semantics between visual and textual elements.

Messina et al. [14] proposed an Alignment Learning and Dynamic Inference Network (ALADIN), which aligned fine-grained images and text, and then learned a shared embedding space from fine-grained alignments, where efficient KNN search could be performed, enhancing the performance of relevant cross-modal retrieval tasks. Fu et al. [15] introduced a relationship modeling framework HREM, which explicitly captured fragment-level and instance-level relationships to learn discriminative and robust cross-modal embeddings. Pan et al. [16] conducted fine-grained image–text matching from the perspective of information encoding, and designed an alignment network in the proposed framework, which connected associated region-word pairs and eliminated other irrelevant word-domain pairs. The comparison of classic image-text matching methods was shown in Table 2.

3. Joint Encoding Model for Image–Text Matching

This paper proposes an image–text matching model that aggregates two encoders. Specifically, the model aims to match images with descriptive text or retrieve relevant images based on text queries, addressing key challenges in cross-modal retrieval tasks. The proposed approach involves utilizing the Vision Transformer (ViT) model as the image encoder to capture spatial features effectively and derive a robust image representation vector. For text encoding, we employ Bidirectional Encoder Representations from Transformers (Bert), which leverages bidirectional context to generate a comprehensive text representation vector. The similarity between the two modalities is then measured using cosine similarity, enabling efficient and accurate image–text alignment. This integration of the ViT and Bert, based on the principles of Contrastive Language–Image Pretraining (CLIP), enhances the model’s ability to align and retrieve cross-modal information effectively. The overall framework is shown in Figure 1.

This paper is based on the principles of CLIP, where individual discrimination proxy tasks are employed in the model as the model’s supervisory signal. Individual discrimination involves treating the image and text of the current sample in the batch as a positive sample pair, and the other images and texts in the current batch as negative sample pairs for this positive sample pair. During training, the distance between positive sample pairs is minimized, while the distance between negative sample pairs is maximized. Then, the model uses a contrastive learning loss function to calculate its loss value, which guides the training direction of the model.

3.1. Image Encoder

In recent years, due to the rapid development of transformers in the field of natural language processing [17], many scholars and experts have been applying this model to the computer vision domain. For image data, this paper utilizes the Vision Transformer for encoding, which consists of three components: Patch Embedding, Transformer Encoder, and a multi-layer perceptron classifier.

In a standard transformer, the required input is a token vector, i.e., a two-dimensional matrix. However, the standard image input data format is [H, W, C], which evidently does not comply with the input requirements of the transformer. Therefore, the first step of the model in this paper employs two operations, Patch and Embedding, to transform the image data into a two-dimensional matrix. The Patch operation is implemented using convolutional operations. For instance, for an image with dimensions [224, 224, 3], representing a 3-channel 224 × 224 image, 768 16 × 16 convolutional kernels are used with a stride of 16 and zero padding. According to the convolutional size change formula:

N = \frac{W - F + 2 P}{S} + 1

(1)

where W is image size, F is convolution kernel size, P is edge filling size, S is step size. By calculation, a feature map of [14, 14, 768] can be obtained. After the Flatten operation, a two-dimensional matrix of [196, 768] can be obtained. In order for the model to acquire the category information of the current image, a class token is added at the beginning of the input sequence in Patch Embedding to represent the category information of the current image. Therefore, the overall dimension becomes [197, 768]. As the transformer is not as adept at handling sequential data as RNN, each component in the token needs to be supplemented with positional encoding information. The approach is to directly stack the positional encoding vector onto the tokens. Thus, after processing by the Embedding layer, the final input dimension of the image data is [197, 768]. This format meets the input requirements of the Transformer Encoder.

The structure of the Transformer Encoder consists of stacking multiple layers of Encoder Blocks. Each Encoder Block includes Layer Norm, a multi-head attention mechanism, Dropout, and an MLP Block. The structure of the Encoder Block is shown in Figure 2.

In contrast to the word vector approach in natural language processing, the Encoder Block first conducts layer normalization on the incoming tokens vector. Layer normalization is a common data preprocessing transformation method, which ensures that all data conforms to a standard normal distribution with a mean of 0 and a variance of 1. This can accelerate the model training speed, enhance model robustness, and contribute to achieving better model performance. The multi-head attention module aims to enable the model to comprehend data from multiple dimensions, and the computation in multi-head attention is parallel, aligning with modern hardware architecture to expedite computational efficiency. The self-attention mechanism used in multi-head attention allows the output vector to consider all preceding input vectors. After the multi-head self-attention layer and layer normalization, an output vector will be obtained:

O = [o_{1}, o_{2}, \dots, o_{m}] \in R^{m \times d}

, The calculation formula is as follows:

O = L a y e r N o r m (Z + M u l t i H e a d (Z))

(2)

Then use a feedforward neural network to perform operations on each component in O, as follows:

y_{i} = L a y e r N o r m (o_{i} + F F N (o_{i}))

(3)

By performing the above operation on all m regions, we obtain

Y = {y_{1}, y_{2}, \dots, y_{m}}

, we then use the average pooling operation to obtain the final embedding vector:

q = \frac{1}{m} \sum_{i = 1}^{m} y_{i}

(4)

In order to prevent overfitting, this paper uses Dropout operation, whose dimensions are invariant after processing. The final module in the ViT is a multi-layer perceptron classifier, consisting of a Layer Norm layer and a linear layer. This module is primarily used for classification. It takes the class token from the tokens and then feeds it into the fully connected layer. After applying a non-linear activation function, it undergoes a softmax normalization operation to obtain the final classification result. This paper transfers the sequence processing concept from the natural language processing domain to the computer vision domain. The approach of segmentation allows for a more detailed extraction of local features from the image, which are then input into the multi-head attention module in the Encoder for computation. While RNN can only process one token at each time step, the multi-head attention module can process multiple tokens in parallel, taking into account the correlations across multiple dimensions. Its sequence processing is more efficient than RNN.

3.2. Text Encoder

The sequential nature is an inherent characteristic of textual data, and the RNN model is specifically designed to handle sequential data. The RNN model possesses a “memory” function, enabling it to record information from previous time steps and pass it on for current time step processing. However, the drawback of the RNN model is that when the sequence is too long, the information from earlier distant time steps has minimal impact on the current time step, leading to the phenomenon of vanishing gradients. This is not very favorable for long sequential text. Subsequently, the Transformer model gained widespread usage. In 2018, Devlin et al. [18] proposed the Bert model, whose underlying architecture is also based on the Transformer model. In simple terms, Bert is formed by stacking multiple transformer encoders. The structure of Bert model is shown in Figure 3.

In text generation tasks, the traditional approach primarily revolves around the Seq2Seq architecture. However, traditional Seq2Seq has two significant drawbacks. Firstly, at the encoder end, all information is compressed into a single vector, which includes all the information from the encoder end, leading to information loss. Additionally, the decoder end cannot use attention to focus on important information. Secondly, it cannot perform parallel computation, fundamentally similar to the reason why recurrent neural networks cannot be parallelized. Before formal encoding, the first step is tokenization. In the Bert model, tokenization consists of initial tokenization and subword tokenization. Initial tokenization converts the input text data into Unicode strings, while removing any illegal characters and extra spaces. It then distinguishes between Chinese and English. For Chinese text, each Chinese character is separated by a space, while English remains unchanged. Subsequently, unnecessary characters and spaces are removed, resulting in tokenization where each element is either a Chinese word, a regular English word, or a phrase. Subword tokenization uses a greedy longest-match-first algorithm on each basic token unit obtained from the initial tokenization. It further divides an English word or Chinese word into smaller units based on whether it is in the vocabulary. The token sequence is then marked with the CLS and SEP tokens to form the final text tokens. This paper adopts the approach of Devlin et al. [18], using the WordPiece tokenizer for tokenization to obtain the corresponding tokens. Finally, after undergoing three embedding layers for tokenization, category, and positional aspects, the current text’s word vectors are obtained. These are then fed into Bert’s Transformer Encoder for encoding to obtain the text’s sentence vector representation.

Next, the Bert model is used to extract relevant sentence embeddings from low-level word tokens. The encoder in the Bert model maps the

{x_{1}^{t}, x_{2}^{t} \dots x_{n}^{t}}

of the sentence to a continuous representation sequence

{z_{1}^{t}, z_{2}^{t}, \dots, z_{n}^{t}}

, and then calculates the vector inner product at each word embedding position using convolution kernels with three window sizes. The convolution output of the k-th word under the m-th window is:

p_{m, k} = R e l u (w_{m} z_{k : k + m - 1} + b_{m}), m = {1, 2, 3}

(5)

where

w_{m}

represents the convolution kernel matrix and

b_{k}

is the bias in the linear map. Due to the dimension difference of the convolution kernel, the word representation should be zero-padded to ensure that the dimension is not lost after convolution calculation. Max pooling is applied to all the words in the sentence:

q = m a x (p_{m, 1}, \dots, p_{m, n})

(6)

q

represents the features obtained from a word sequence under the action of a convolutional kernel. This paper employs multiple convolutional kernels to consistently extract diverse features from the word sequence, which are subsequently passed to a fully connected layer. Finally, after normalization, the overall embedding of the current text sentence is obtained

e^{t} \in R^{1 \times d}

.

3.3. Similarity Calculation

In image–text matching, there are various methods for calculating similarity. Word Mover’s Distance (WMD) measures the semantic similarity between two texts, considering the semantic relationships between words. It represents each text as the weighted sum of word embeddings and then calculates the distance between them. Hamming distance represents the text as binary vectors, and then calculates the number of differing bits at corresponding positions in the binary vectors. When calculating the Hamming distance, the length of the strings must be equal, otherwise, it would be meaningless. The calculation formula is as follows:

d_{H} (s_{1}, s_{2}) = \sum_{i = 1}^{n} δ (s_{1_{i}}, s_{2_{i}})

(7)

where

δ (s_{1_{i}}, s_{2_{i}})

is an indicator function that takes the value 1 if the strings are not equal and 0 if they are equal.

The Euclidean distance calculates the straight-line distance between two points, which measures the difference in size between vectors. The calculation formula is as follows:

d i s t a n c e (p, q) = \sqrt{{(x_{1} - x_{2})}^{2} + {(y_{1} - y_{2})}^{2}}

(8)

In this model, cosine similarity is used to measure the similarity between vectors. For vector A and vector B, the cosine similarity value is calculated by the following formula:

c o s i n_s i m (A, B) = \frac{A \times B}{‖A‖ ‖B‖}

(9)

The magnitude of the cosine similarity is related to the angle between the two vectors, with a smaller angle indicating a higher similarity between the vectors. In the image–text matching model, we learn a vector space where we map the vector representations of two modal data. Then, we calculate the similarity value between the vectors.

To enhance the performance of our model in learning these vector representations, we employ the LiT-tuning paradigm (Language–Image Pre-training Tuning). This paradigm fine-tunes the pre-trained Vision Transformer (ViT) and Bert encoders, which were initially trained on large-scale image and text datasets. LiT-tuning allows the model to transfer learned knowledge effectively, optimizing the encoders for the image–text matching task. By leveraging LiT-tuning, the model can achieve better alignment between visual and textual features, leading to more accurate similarity measurements.

4. Experimental Analysis

4.1. Experimental Environment

Model training was conducted on a cloud server, with the CPU model being Xeon(R) Platinum 8255C [email protected] GHz, the GPU model being Nvidia RTX2080ti, the graphics memory capacity being 11 G, the operating memory being 40 G, and the CUDA version being 10.2. The operating system corresponding to the experiments in this chapter was Ubuntu 18.04. The relevant third-party libraries are shown in Table 3.

4.2. Datasets

This paper utilized the WuKong dataset [19], a large-scale image–text pair dataset created by Huawei Noah’s Ark, which contains 166 million image–text pairs. This dataset is designed to handle diverse and complex visual and textual content, making it suitable for training models that need to understand a wide variety of real-world scenes and corresponding descriptions. After downloading and processing the data, we selected 1.4 million image–text pairs as the training set and 26,000 pairs from the WuKong test set as the test set. Each image in WuKong is annotated with a comprehensive, sentence-level textual description, which aids in developing robust image–text alignment capabilities.

In addition to WuKong, we employed the domain-specific Flickr30k dataset [20], which is widely used in image–text matching research. Flickr30k comprises 31,000 images, each annotated with five diverse and descriptive sentences, totaling 155,000 sentences. This dataset emphasizes scenes commonly found in social and cultural contexts, providing valuable training data for models focusing on more structured and descriptive image–text relationships. We used 29,000 images as the training set, 1000 images as the validation set, and 1000 images as the test set.

4.3. Learning Rate Scheduling

While the Adam optimizer has been widely recognized for its adaptive learning rate and has gained popularity in multimodal learning, our preliminary experiments indicated that stochastic gradient descent (SGD) was more suitable for our application. Specifically, we observed that SGD achieved better generalization performance across both the Flickr30k and WuKong datasets, while minimizing the risk of overfitting. Moreover, training stability was enhanced with SGD, resulting in smoother convergence for our dual-encoder architecture. Based on these results, we chose to utilize SGD to ensure consistent and reliable performance throughout our experiments.

In the model training process, this paper introduced a learning rate scheduling strategy to find the minimum value more efficiently and facilitate faster convergence. As the model sought to extract deeper information during training, it typically consisted of a larger number of layers. This complexity could lead to challenges, such as gradient disappearance or explosion, complicating the convergence process. Gradient descent operated through iterative updates in the feature space, where features of varying scales may have contributed to gradient instability. To address the issue of inconsistent scales, normalization techniques were employed to ensure uniformity across feature data. This effectively alleviated problems related to gradient explosion and disappearance. Additionally, learning rate scheduling served as another effective solution by modifying the learning rate to adjust the magnitude of parameter updates.

Among the various popular scheduling schemes, the learning rate decay strategy was particularly prevalent. This strategy began by setting a relatively large learning rate as a target value, then employed a warm-up phase to gradually increase the learning rate to this target. Subsequently, the learning rate was reduced during training. In this paper, we utilized the cosine decay strategy, which decreased the learning rate according to a cosine function. The rationale for selecting cosine decay lay in its smooth reduction of the learning rate, which enhanced convergence stability and helped prevent abrupt changes that could destabilize training.

Exponential decay represented the numerical value of the learning rate change according to an exponential function, with the formula:

α_{t} = α_{0} \cdot β^{n}

(10)

where

α_{0}

represented the initial learning rate,

n

represented the current training round, and

β

represented the learning rate decay coefficient. Exponential decay meant that the learning rate would decay once every training round. Step decay meant that the learning rate would decay by a fixed multiple after a certain number of rounds. The formula was:

c u r r e n t_l r_r a t e = l r_r a t e \cdot d e c a y_r a t e^{(g l o b a l_e p o c h / d e c a y_e p o c h)}

(11)

where

l r_r a t e

was the initial learning rate,

d e c a y_r a t e

was the learning rate decay coefficient,

g l o b a l_e p o c h

was the number of rounds of current training, and

d e c a y_e p o c h

was the learning rate decay cycle, which determined how many rounds the learning rate decayed at once. As it was a step decay, its function image was a discrete image. When the remainder of the training round and the learning rate decay cycle was 0, the above formula was calculated to obtain the new learning rate value.

This paper used the cosine decay strategy, indicating that the change trend of the learning rate during the model training process followed the cosine function law. The cosine decay formula was:

\begin{array}{l} g l o b a l_s t e p = m i n (g l o b a l_s t e p, d e c a y_s t e p s) \\ c o s i n e_d e c a y = 0.5 \cdot (1 + c o s (π \cdot \frac{g l o b a l_s t e p}{d e c a y_s t e p s})) \\ d e c a y e d = (1 - α) \cdot c o s i n e_d e c a y + α \\ d e c a y e d_l e a r n i n g_r a t e = l e a r n i n g_r a t e \cdot d e c a y e d \end{array}

(12)

where

l e a r n i n g_r a t e

represented the initial learning rate,

d e c a y e d

represented the decay rate,

g l o b a l_s t e p

represented the current iteration round, and

d e c a y_s t e p s

represented the iteration speed, indicating how often to modify the learning rate. The learning rate was adjusted according to the cosine decay strategy. The learning rate trend of the cosine decay strategy resembled the graph of the cosine function, as shown in Figure 4.

4.4. Loss Function

Because this paper was based on the contrastive learning strategy of CLIP, it adopted the contrastive learning loss (infoNCE loss) [21] to calculate the training loss for each batch. The formula for calculating this loss function was as follows:

L_{i} = - l o g \frac{e^{(q \cdot k^{+}) / τ}}{\sum_{i = 0}^{k} e^{(q \cdot k_{i}) / τ}}

(13)

The variables

(q \cdot k^{+})

and

(q \cdot k_{i})

represented a measure of similarity, which could be cosine similarity or other similarity calculation methods. The variable

τ

was a temperature hyperparameter, which controlled the model’s discrimination against negative samples. A larger value of this hyperparameter resulted in output probabilities being close to a uniform distribution, indicating that the probabilities for all categories were similar. Conversely, a smaller value made the output probabilities sharper, leading to a high probability for one category and low probabilities for others. The numerator calculated the similarity of each sample to the positive sample, while the denominator calculated the sum of similarities of each sample to the positive and negative samples. This loss function aimed to maximize the similarity between positive sample pairs and minimize the similarity between negative sample pairs. By minimizing the infoNCE loss function, the model could learn to map similar samples to nearby embedding spaces and dissimilar samples to distant embedding spaces, thereby improving the model’s embedding representations.

4.5. Experimental Results and Analysis

This paper utilized two different image encoders and text encoders, resulting in a total of four combined models. The image encoders were ViT32 and ViT16, while the text encoders were Bert-base-chinese and Bert-base-fin. The text encoder employed pre-trained Bert weights on Chinese corpora, which made it more suitable for the WuKong dataset used in this study. To evaluate the model’s performance on image–text matching tasks, we used R@n (Recall at n) as a primary metric. R@n measured the proportion of correct matches within the top-n retrieved results. Specifically, R@1, R@5, and R@10 represented the recall rates when the correct match appeared within the top 1, 5, and 10 retrieved results, respectively. This metric provided a clear indication of the model’s ability to rank relevant items near the top of the retrieval list.

To validate the effectiveness and feasibility of our model, we conducted comparative experiments on the Flickr30k and WuKong datasets, calculating the corresponding metrics. In these experiments, we selected SGM [22], NAAF [23], and CVSE [24] as baseline methods for comparison. These methods were representative models in the field of image–text matching: SGM (Similarity Graph Modeling) achieved efficient image–text matching by constructing a similarity graph between images and text; NAAF (Non-local Attention and Adaptive Fusion) introduced a non-local attention mechanism and adaptive fusion strategy to enhance cross-modal retrieval performance; and CVSE (Conditional Variational Semantic Embedding) utilized a conditional variational auto-encoder for semantic embedding of images and text. We chose these baseline methods due to their outstanding performance on public datasets and their diverse model architecture and feature extraction approaches, which allowed for a comprehensive evaluation of our proposed method’s performance.

Comparative experiments were conducted on the two datasets, and the corresponding metrics were calculated to validate the model’s effectiveness and feasibility. The experiments encompassed both image retrieval and text retrieval in multimodal matching, with results on the Flickr30k dataset shown in Table 4.

In the Image-to-Text task, the multimodal matching model proposed in this paper achieved an R@1 value 0.2% higher than NAAF and an R@10 value 0.7% higher than the comparative models NAAF and CVSE. In the Text-to-Image task, the R@1 value was 1.2% higher than the NAAF in the comparative model, and the R@10 value was 1.1% higher than SGM. According to the results in the table above, it was evident that in the Image-to-Text task, when k = 10, the coverage of the retrieval model generating relevant text information was significant. Similarly, in the Text-to-Image task, when k = 10, the coverage was also substantial. This was because in the R@k recall rate, k represented the top k returned results, and the meaning of R@k was the proportion of correct results in the top k results. As k increased, more results were included, and the proportion of correct results also increased.

In addition to the Flickr30k dataset, this paper also used a WuKong dataset, which was formatted as a CSV file. The URL field in the dataset represented the web address of the image, while the Text field contained textual descriptions corresponding to the images. Therefore, it was necessary to download the images to the local environment using the URLs. This dataset consisted of one image paired with one textual description. Comparative experiments were also carried out with relevant models on this data set. The results of comparative experiments of relevant indicators are shown in Table 5.

In the text retrieval direction, our model achieved a 2.0% higher R@1 value, and a 2.4% higher R@10 value compared to the maximum values from other models. In the image retrieval direction, our model achieved a 2.2% higher R@1 value, and a 2.1% higher R@10 value compared to the maximum values from other models. As shown in the above data table, there was a significant improvement in the recall rate when covering the top 10 retrieval results. Additionally, compared to the Flickr30k dataset, the improvement on the WuKong dataset was more significant, possibly due to the models being pre-trained on Chinese corpora.

During the model training process, the changes in the model’s loss were recorded. The corresponding trends are shown in Figure 5.

From the above figures, it can be observed that within the initial few steps, the loss value rapidly increases. This is because the model is in the early stages of training and has low affinity for the relatively new dataset, resulting in significant changes in the loss value. However, as the training steps increase, the overall training loss gradually decreases and eventually approaches 0, indicating that the model’s parameters have gradually adapted to the entire dataset.

During the initial training process, the objective function of the model fluctuated around a certain value without reaching convergence. This was due to the use of a large learning rate, which resulted in large parameter updates, preventing the model from approaching the optimal solution. Therefore, this paper employed a cosine decay strategy to dynamically adjust the learning rate, allowing the model to smoothly adjust the learning rate during training according to the cosine function. As the learning rate decreased, the magnitude of the model parameter updates also decreased, gradually improving the overall accuracy of the model.

In this paper, several images and their corresponding text were randomly extracted from the test set. Different models were then used to calculate the corresponding similarity values. The model’s performance was evaluated using similarity scores between images and text descriptions, which indicated how well the image matched each description. Scores closer to 1 suggested a higher degree of match, meaning the image was strongly related to the description, while scores closer to 0 indicated a weaker match and lower relevance. The results are shown in Table 6.

5. Conclusions

This paper is based on the idea of CLIP and utilizes two transformer-based architectures, the Vision Transformer for image encoding and Bert for text encoding, to implement a multimodal matching model. During the model training process, the LiT-tuning training paradigm is employed to jointly train both models, enabling the text vectors to quickly converge towards the image vectors. Additionally, dynamic adjustments to the learning rate during training address the issue of large parameter updates hindering convergence. The entire model is trained based on the idea of contrastive learning, using individual discriminative proxy tasks to provide supervisory signals for model training. The model in this paper achieves further improvement in the task of multimodal matching, promoting further interaction and fusion of information between different modalities.

Author Contributions

Conceptualization, Y.Z. and H.X.; methodology, Y.Z.; software, Y.Z.; validation, Y.Z., H.X., and A.D.; formal analysis, Y.Z.; investigation, Y.Z.; resources, Y.Z.; data curation, Y.Z.; writing—original draft preparation, Y.Z. and A.D.; writing—review and editing, A.D. and B.W.; visualization, A.D. and B.W.; supervision, H.X.; project administration, Y.Z., A.D., and B.W.; funding acquisition, H.X. All authors have read and agreed to the published version of the manuscript.

Funding

Research on image visual perception description generation technology integrating attention mechanism, The Scientific Research Project of Jilin Provincial Department of Education, No. JJKH20230841KJ.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
Toyama, J.; Misono, M.; Suzuki, M.; Nakayama, K.; Matsuo, Y. Neural machine translation with latent semantic of image and text. arXiv 2016, arXiv:1611.08459. [Google Scholar]
Zhou, T.; Cai, Z.; Liu, F.; Su, J. In pursuit of beauty: Aesthetic-aware and context-adaptive photo selection in crowdsensing. IEEE Trans. Knowl. Data Eng. 2023, 35, 9364–9377. [Google Scholar] [CrossRef]
Cheng, D.; Chen, L.; Lv, C.; Guo, L.; Kou, Q. Light-guided and cross-fusion U-Net for anti-illumination image super-resolution. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 8436–8449. [Google Scholar] [CrossRef]
Faghri, F.; Fleet, D.; Kiros, J.; Fidler, S.V. Improving visual-semantic embeddings with hard negatives. arXiv 2017, arXiv:1707.05612. [Google Scholar]
Lee, K.-H.; Chen, X.; Hua, G.; Hu, H.; He, X. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), 15th European Conference, Munich, Germany, 8–14 September 2018; pp. 201–216. [Google Scholar]
Zhang, Q.; Lei, Z.; Zhang, Z.; Li, S.Z. Context-aware attention network for image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3536–3545. [Google Scholar]
Hu, Z.; Luo, Y.; Lin, J.; Yan, Y.; Chen, J. Multi-level visual-semantic alignments with relation-wise dual attention network for image and text matching. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macao, China, 10–16 August 2019; pp. 789–795. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Teney, D.; Liu, L.; van Den Hengel, A. Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1–9. [Google Scholar]
Kiros, R.; Salakhutdinov, R.; Zemel, R.S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv 2014, arXiv:1411.2539. [Google Scholar]
Wei, X.; Zhang, T.; Li, Y.; Zhang, Y.; Wu, F. Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10941–10950. [Google Scholar]
Nam, H.; Ha, J.-W.; Kim, J. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 299–307. [Google Scholar]
Messina, N.; Stefanini, M.; Cornia, M.; Baraldi, L.; Falchi, F.; Amato, G.; Cucchiara, R. Aladin: Distilling fine-grained alignment scores for efficient image-text matching and retrieval. In Proceedings of the 19th International Conference on Content-Based Multimedia Indexing, Graz, Austria, 14–16 September 2022; pp. 64–70. [Google Scholar]
Fu, Z.; Mao, Z.; Song, Y.; Zhang, Y. Learning semantic relationship among instances for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15159–15168. [Google Scholar]
Pan, Z.; Wu, F.; Zhang, B. Fine-grained image-text matching by cross-modal hard aligning network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19275–19284. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Gu, J.; Meng, X.; Lu, G.; Hou, L.; Minzhe, N.; Liang, X.; Yao, L.; Huang, R.; Zhang, W.; Jiang, X. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Adv. Neural Inf. Process. Syst. 2022, 35, 26418–26431. [Google Scholar]
Plummer, B.A.; Wang, L.; Cervantes, C.M.; Caicedo, J.C.; Hockenmaier, J.; Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2641–2649. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Wang, S.; Wang, R.; Yao, Z.; Shan, S.; Chen, X. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020; pp. 1508–1517. [Google Scholar]
Zhang, K.; Mao, Z.; Wang, Q.; Zhang, Y. Negative-aware attention framework for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 15661–15670. [Google Scholar]
Wang, H.; Zhang, Y.; Ji, Z.; Pang, Y.; Ma, L. Consensus-aware visual-semantic embedding for image-text matching. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXIV 16, 2020. pp. 18–34. [Google Scholar]

Figure 1. Image–Text Matching Architecture.

Figure 2. Encoder Block Structure Diagram.

Figure 3. Bert Model Structure Diagram.

Figure 4. Learning Rate Decay.

Figure 5. Model Loss. (a) represents the loss trend of ViT16 under Bert-base; (b) represents the loss trend of ViT16 under Bert-base-fin; (c) represents the loss trend of ViT32 under Bert-base; (d) represents the loss trend of ViT32 under Bert-base-fin.

Table 1. List of Abbreviations and Their Definitions.

Full Form	Abbreviation
Contrastive Language–Image Pre-training	CLIP
Vision Transformer	ViT
Bidirectional Encoder Representations from Transformers	Bert
Recurrent Neural Network	RNN
Sequence-to-Sequence	Seq2Seq
Multi-Layer Perceptron	MLP
Feed-Forward Neural Network	FFN
K-Nearest Neighbors	KNN
Information Noise Contrastive Estimation Loss	infoNCE loss
Recall at n	R@n
Stochastic Gradient Descent	SGD
Classification Token	CLS
Separator Token	SEP
Comma-Separated Values	CSV

Table 2. Comparison of Classic Image–Text Matching Methods.

Reference	Model	Key Techniques	Advantages	Limitations
Faghri et al. [5]	Shared Vector Space Model	Cosine similarity, global feature mapping	Simple, effective for overall similarity	Misses fine-grained image–text details
Lee et al. [6]	SCAN Model	Attention-based fine-grained matching	Captures detailed region–word alignment	Ignores inter-modal relationships
Radford et al. [9]	CLIP	Contrastive learning, global + fine-grained mapping	Strong generalization and efficiency	Limited handling of subtle, nuanced details
Wei et al. [12]	Cross-Modal Attention	Multi-head inter- and intra-modal attention	Enhances semantic alignment and matching	High computational cost
Messina et al. [14]	ALADIN	Fine-grained alignment, KNN search	Fast, efficient cross-modal retrieval	Requires accurate feature alignment
Our Method	CLIP-Based Bimodal Encoding	ViT + Bert encoders, LiT-tuning, cosine decay	Superior convergence, robust performance, captures both global and fine-grained semantics	Potential computational complexity increase

Table 3. Third Party Library.

Library Name	Version
Python	3.8
Pytorch	1.12.0
Pandas	1.4.2
Numpy	1.22.4
Transformers	4.21.0
Pillow	9.3.0

Table 4. Results of ViT–Bert on Flickr30k.

Model	Image to Text			Text to Image			rsum
Model	R@1	R@5	R@10	R@1	R@5	R@10	rsum
SGM [22]	73.2	92.8	96.6	57.2	87.2	94.1	501.1
NAAF [23]	75.0	94.2	97.4	58.1	83.6	89.7	498.0
CVSE [24]	69.1	93.3	97.4	55.5	86.9	93.8	496.0
ours	75.2	94.1	98.1	59.3	84.2	95.2	506.1

Table 5. Comparison results of ViT–Bert on WuKong dataset.

Model	Image to Text			Text to Image			rsum
Model	R@1	R@5	R@10	R@1	R@5	R@10	rsum
SGM [22]	71.2	83.1	90.3	46.5	63.4	72.6	427.1
NAAF [23]	67.0	84.5	91.4	58.3	79.6	88.4	469.2
CVSE [24]	70.2	79.4	88.2	66.7	71.9	80.2	456.6
ours	73.4	81.3	93.5	68.7	61.8	90.8	469.5

Table 6. Test Results.

ViT32 + Bert_base	ViT32 + Bert_base_fin	ViT16 + Bert_base	ViT16 + Bert_base_fin
(‘Autumn sports car aesthetic pictures desktop wallpaper’, 1.0), (‘A bus is parked on the roadside’, 7.17 × 10⁻¹⁴), (‘A puppy sitting on the ground’, 1.82 × 10⁻¹⁷)	(‘Autumn sports car aesthetic pictures desktop wallpaper’, 1.0), (‘A bus is parked on the roadside’, 3.84 × 10⁻¹²), (‘A puppy sitting on the ground’, 1.51 × 10⁻¹⁴)	(‘Autumn sports car aesthetic pictures desktop wallpaper’, 1.0), (‘A bus is parked on the roadside’, 4.42 × 10⁻¹¹), (‘A puppy sitting on the ground’, 1.87 × 10⁻¹⁵)	(‘Autumn sports car aesthetic pictures desktop wallpaper’, 1.0), (‘A bus is parked on the roadside’, 3.35 × 10⁻¹²), (‘A puppy sitting on the ground’, 3.36 × 10⁻¹⁵)
(‘A puppy sitting on the ground’, 0.998), (‘Autumn sports car aesthetic pictures desktop wallpaper’, 4.42 × 10⁻⁸), (‘A bus is parked on the roadside’, 1.52 × 10⁻¹⁰)	(‘A puppy sitting on the ground’, 0.996), (‘Autumn sports car aesthetic pictures desktop wallpaper’, 1.33 × 10⁻⁹), (‘A bus is parked on the roadside’, 1.15 × 10⁻⁹)	(‘A puppy sitting on the ground’, 0.999), (‘A bus is parked on the roadside’, 1.23 × 10⁻¹¹), (‘Autumn sports car aesthetic pictures desktop wallpaper’, 1.80 51 × 10⁻¹⁴)	(‘A puppy sitting on the ground’, 0.977), (‘A bus is parked on the roadside’, 2.10 × 10⁻¹²), (‘Autumn sports car aesthetic pictures desktop wallpaper’, 5.82 × 10⁻¹³)
(‘A bus is parked on the roadside’, 1.0), (‘Autumn sports car aesthetic pictures desktop wallpaper’, 8.04 × 10⁻⁷), (‘A puppy sitting on the ground’, 8.00 × 10⁻¹¹)	(‘A bus is parked on the roadside’, 0.998), (‘Autumn sports car aesthetic pictures desktop wallpaper’, 1.59 × 10⁻³), (‘A puppy sitting on the ground’, 3.45 × 10⁻⁷)	(‘A bus is parked on the roadside’, 1.0), (‘Autumn sports car aesthetic pictures desktop wallpaper’, 3.19 51 × 10⁻⁴), (‘A puppy sitting on the ground’, 1.23 × 10⁻¹⁰)	(‘A bus is parked on the roadside’, 0.999), (‘Autumn sports car aesthetic pictures desktop wallpaper’, 5.68 × 10⁻⁴), (‘A puppy sitting on the ground’, 1.76 × 10⁻¹²)

Note: Similarity scores indicated how closely each image matched the corresponding text descriptions. Scores closer to 1 represented a stronger match, while scores closer to 0 indicated a weaker match.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, Y.; Xu, H.; Du, A.; Wang, B. Image–Text Matching Model Based on CLIP Bimodal Encoding. Appl. Sci. 2024, 14, 10384. https://doi.org/10.3390/app142210384

AMA Style

Zhu Y, Xu H, Du A, Wang B. Image–Text Matching Model Based on CLIP Bimodal Encoding. Applied Sciences. 2024; 14(22):10384. https://doi.org/10.3390/app142210384

Chicago/Turabian Style

Zhu, Yihuan, Honghua Xu, Ailin Du, and Bin Wang. 2024. "Image–Text Matching Model Based on CLIP Bimodal Encoding" Applied Sciences 14, no. 22: 10384. https://doi.org/10.3390/app142210384

APA Style

Zhu, Y., Xu, H., Du, A., & Wang, B. (2024). Image–Text Matching Model Based on CLIP Bimodal Encoding. Applied Sciences, 14(22), 10384. https://doi.org/10.3390/app142210384

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Image–Text Matching Model Based on CLIP Bimodal Encoding

Abstract

1. Introduction

2. Related Work

3. Joint Encoding Model for Image–Text Matching

3.1. Image Encoder

3.2. Text Encoder

3.3. Similarity Calculation

4. Experimental Analysis

4.1. Experimental Environment

4.2. Datasets

4.3. Learning Rate Scheduling

4.4. Loss Function

4.5. Experimental Results and Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI