4.1. Dataset and Preprocess
For objective performance evaluation of the proposed method, we used the Microsoft Research Video Description (MSVD) dataset, which has been widely used in the video captioning problem. The MSVD dataset consists of 1970 YouTube clips and human-annotated descriptions in open domains such as cooking and movies. Each video has approximately 41 descriptions on average.
We tokenized the descriptions using the wordpunkt tokenizer from the Natural Language Toolkit and the descriptions were lowercased. Except for this space tokenization, no natural language processing technique was applied. This dataset has approximately 5.7 million tokens, and each description consists of approximately seven tokens. As in all other previous studies, we used 1200 videos for training, 100 videos for validation, and 670 videos for evaluation. We sampled a total of 20 frames twice per second from each video. We cropped each frame into a square shape and resized it to 224
224 pixels. To normalize the RGB values, we divided these values by 127.5 and subtracted them from 1. Sample training data from the MSVD dataset are shown in
Table 2 with only five descriptions.
From the descriptions of the first video in
Table 2, we can observe that the same entity, a squirrel, was referred to by different words, such as “small,” “animal,” “chipmunk,” or “hamster.” Additionally, we can observe that the nut eaten by the squirrel was expressed by “peanut” or “food.” Similar patterns can be observed in the descriptions of the second video. In addition, the word “ingrediants” is misspelled. The descriptions of the third video indicate an onion, but it can be confused with an orange if we do not pay attention to the video. Most of the descriptions were mainly composed in a progressive form.
For transfer learning, we used image datasets of the Microsoft Common Objects in Context dataset published in 2014 and the Flickr30k dataset. The images and descriptions were preprocessed in the same manner as the MSVD dataset.
4.2. Experimental Environment
We used the pretrained version of the ResNet50V2 network [
27] with the ImageNet dataset. The dimensionalities for frame and word embeddings,
,
; the RNNs,
,
; the internal projection of attention mechanisms,
; and word logits,
, were set to 512. The maximum input,
, and output,
, lengths were set to 20 and 10, respectively. We used a leaky rectified linear unit [
44] for the activation function. We used the Adam optimizer [
45] for training with a learning rate of 5 × 10
−5,
of 0.9,
of 0.999, and
of 1 × 10
−7. We only used words that appeared more than once in the training set of the MSVD dataset and the image datasets. In this setting, the vocabulary consisted of 21,992 tokens. The hyperparameters for the proposed method are listed in
Table 3.
A beam search algorithm is a greedy tree search algorithm, and it limits the number of candidate groups for finding the optimal node to a beam size. To generate word sequences based on the trained method, many researchers have used the beam search algorithm. In this case, the goal of the beam search algorithm was to find the combination of words with the highest likelihood in a greedy manner. We stored paths, whose termination token was generated separately, as candidates. We selected the word sequence with the highest average likelihood among the candidate group searched up to the maximum length and the previously stored candidate group. Because the length of the given description was short, no additional length penalty was applied. We used the beam search algorithm with a beam size of 5 to generate the descriptions.
Similar to other studies, we used the bilingual evaluation understudy (BLEU) [
46], the metric for evaluation of translation with explicit ordering (METEOR) [
47], and the consensus-based image description evaluation (CIDEr) [
48] as performance measures. BLEU is a measure widely used for evaluating machine translation methods. It measures the n-gram precision based on automatically generated descriptions and the ground truth. We used BLEU-4 for evaluation. METEOR, likewise, is a measure widely used for evaluating machine translation methods, and it measures the n-gram precision in a semantic manner such as the form of stemming and synonym matching. CIDEr is a measure used to evaluate the image captioning method. It measures the n-gram F-score weighted by the term frequency–inverse term frequency. Higher scores on all these measures imply better performance.
To train the proposed method, we used a workstation equipped with an NVIDIA GeForce RTX 3090 graphics card to accelerate the training time. The proposed method was implemented using TensorFlow 2.3.
4.3. Single Model and Ensemble Model
We pretrained the proposed method for up to 10 epochs using the image datasets with a batch size of 32, and we trained the method for an additional 30,000 steps using the MSVD dataset with a batch size of 8. When training with the MSVD dataset, we stored the parameters at every 2000 steps. We denote this stored parameter as the single parameter . In this work, 15 single parameters were stored, denoted by .
To further enhance the performance of the proposed method, we ensembled the single parameters following the previous research [
49]. The ensemble method we used is an arithmetic average of the values of single parameters,
, and a weighted average according to the METEOR scores of single parameters,
. We chose the best five single parameters for the ensemble, and we also considered the combinations of the best two, the best three, and so on. These parameters are denoted by
and
from the best two to the best five, respectively. Among these 23 selected parameters, the parameter with the best validation performance was finally finetuned using validation data.
4.4. Quantitative Results
First, we validated the single parameters using the beam search algorithms and the performance measures, which were same as the test. The BLEU-4 and METEOR scores of the single parameters are shown in
Figure 4.
In
Figure 4, the highest point is marked by a blank circle, and its value is represented on the side of the circle. An index of optimal parameters is represented by a vertical red dashed line. As shown in
Figure 4, with the highest BLEU-4 and METEOR scores, we chose the parameters,
, of 6000 (3 × 2000) steps as the optimal single parameters,
, based on the validation set of the MSVD dataset. We chose the best five single parameters for the ensemble, and the parameters were
,
,
,
, and
in the order of the METEOR score. The BLEU-4 and METEOR scores of the ensemble parameter are shown in
Figure 5.
Through this validation, we finally chose the optimal parameters , , and with METEOR scores of 32.9, 33.51, and 33.54, respectively, for testing. The parameter with the highest METEOR score was . We finetuned the parameter using the validation data of the MSVD dataset with a learning rate of 5e-6 for only one epoch, and the finetuned parameter is denoted by . The proposed method with the optimal single parameters, , is expressed as “Multi-Representation Switching with single parameter (MRS-s)” and the methods with the optimal ensemble parameters with arithmetic average, , and with weighted average, , are expressed as “MRS-ea” and “MRS-ew,” respectively. Finally, the proposed method with the refined parameters, , is expressed as “MRS-ew+.”
The experimental results are shown in
Table 4. The comparison methods are sorted in ascending order based on the METEOR score.
As shown in
Table 4, the proposed methods consistently showed good performance for the three measures. In the ensemble model, MRS-ea and MRS-ew, the BLEU-4 and METEOR scores increased slightly, but the CIDEr score increased significantly compared with the single model, MRS-s. In particular, the finetuned model, MRS-ew+, recorded scores that exceeded the performance of most existing video captioning methods. These scores were achieved in only 6000 training steps for MRS-s and 30,000 training steps for various MRS-e. Because the batch size was 8, one epoch consisted of approximately 6000 training steps, and it can be seen that these models were trained up to one epoch and five epochs, respectively.
We compared the proposed method with PickNet [
8], STAT_V [
6], TSA-ED [
9], and RecNet [
10], which are similar in performance and training method. PickNet was trained in three phases for up to 100 epochs in each phase. The first phase trains the encoder–decoder networks with a negative log likelihood loss for target words. The second phase trains the picking network based on reinforcement learning. The last phase jointly trains both networks. TSA-ED was trained for up to 30 epochs, or the training was stopped when the validation performance did not improve for 20 epochs. RecNet, similar to TSA-ED, was trained until the validation performance did not improve for 20 epochs. RecNet additionally used reconstruction loss for training. By contrast, the proposed method was only trained for up to five epochs.
The proposed method achieved BLEU-4, METEOR, and CIDEr scores that were 8.2 points, 0.9 points, and 4.3 points higher than PickNet, without reinforcement learning. Compared with TSA-ED, the proposed method recorded BLEU-4 and CIDEr scores that were 2.3 points and 5.4 points higher, and the METEOR score was tied. Compared with RecNet, the proposed method achieved BLEU and METEOR scores that were 2 points higher and 0.1 points lower, respectively, without additional loss, and the CIDEr score was tied.
STAT_V consists of a pretrained 2D CNN, C3D [
40], and Faster R-CNN [
41]. Owing to the characteristics of the MSVD dataset without additional coordinate information for videos, the R-CNN used in STAT_V cannot be finetuned in an end-to-end manner. This structure has a limitation in that the feature extraction phase and the description generation phase should be separated during training or testing. By contrast, the proposed method has the advantage that feature extraction and description generation phases are seamlessly connected so that each network can be smoothly finetuned. The proposed method achieved BLEU-4, METEOR, and CIDEr scores that were 2.3 points, 0.7 points, and 6.5 points higher than STAT_V.
From this result, it can be confirmed that the proposed structure is effective for extracting information of a given video and description pair. The proposed method can be trained in a fully end-to-end manner, and we did not apply any preprocessing based on computer vision and natural language processing or any additional loss function.