Incorporating Concreteness in Multi-Modal Language Models with Curriculum Learning
Round 1
Reviewer 1 Report
The contribution of this work is two-fold: First, a new dataset, created from Wikimedia Commons, is introduced, which has approximately 3.2M images, with 630k captions, 1.96M descriptions, and concreteness labels. Second, a new training scheme for multi modal pre-training is introduced.
This paper is well written. There are many tables with the results of the models. The paper should be accepted with minor corrections.
- Can you compare Wikimedia common dataset with other datasets?
- Can you compare the performance of other models on Wikimedia commons dataset?
Author Response
Reviewers' comments:
The contribution of this work is two-fold: First, a new dataset, created from Wikimedia Commons, is introduced, which has approximately 3.2M images, with 630k captions, 1.96M descriptions, and concreteness labels. Second, a new training scheme for multi modal pre-training is introduced.
This paper is well written. There are many tables with the results of the models. The paper should be accepted with minor corrections.
Can you compare Wikimedia common dataset with other datasets?
Response 1: We want to thank the reviewer for raising this issue. We added a new table (p. 8) to the article showing various other multi-modal datasets compared to our proposed dataset. Also, a discussion paragraph (p.8) is added to highlight the differences between the datasets mentioned above.
Can you compare the performance of other models on Wikimedia commons dataset?
Response 2: Running the other multi-modal models on Wikimedia commons would take much computer time due to dataset size and models’ sizes (the number of learnable parameters). Because we had a few days to complete this revision, this is not ready. However, to respond to the concern about the informativeness of the proposed dataset in the context of multi-modal datasets, Wikimedia texts are known to be rich. It contains captions and descriptions. In multi-modal training, captions are preferable as they are more topical and relevant than other texts to align with images.
Author Response File: Author Response.pdf
Reviewer 2 Report
The focus of this work is a language model that takes into account some basic aspects of human language acquisition, namely the role of concrete experience: it is widely accepted that children acquire firstly words related to concrete objects (see the Tomasello "pointing phase").
Thus, the positive merit of this work is that the authors present a new dataset of 3.2 M of images with related captions, descriptions as well as concreteness labels.
However, at the conclusion, it is not clear the explanation for the reduced accuracy of their model than, e.g., VilBERT (54.13 vs. 70.92).
In addition, the initial part of the manuscript is too generic: compressing 35 years of neural network history in a few lines is never a good idea. And, in my opinion, it is unnecessary for your goal. For example, in the ABSTRACT I suggest you to cancel lines 1: 6, as well as in the INTRODUCTION 23:53.
As a mere clarification, in the Abstract (line 3) "characters" as an example of semantic unit is not correct. Line 26 "and so on" is not correct since the only additional meaningful unit could be "chunks".
To summarise my comments, I suggest you to re-focus the manuscript on the state-of-the-art models (BERT and BERT-derived) by highlighting similarities and specificities of your approach (concrete/abstract classifier).
Author Response
The focus of this work is a language model that takes into account some basic aspects of human language acquisition, namely the role of concrete experience: it is widely accepted that children acquire firstly words related to concrete objects (see the Tomasello "pointing phase").
Thus, the positive merit of this work is that the authors present a new dataset of 3.2 M of images with related captions, descriptions as well as concreteness labels.
However, at the conclusion, it is not clear the explanation for the reduced accuracy of their model than, e.g., VilBERT (54.13 vs. 70.92).
Response 1: We agree with the reviewer for this comment since any work must discuss the possible shortcomings of the model/work. It was missing in our paper. Therefore, we added a new paragraph to the Experimental Results section (between lines 561-579) for discussing the possible reasons why the model underperforms the state-of-the-art models. You can also find the added part below:
"It should be noted that there are subtle but vital differences between our model and the VilBERT model. The main focus of VilBERT is to process text and image streams in parallel under the transformer architecture to encode their relationship in a pre-trained model to have optimized performance in downstream tasks. On the other hand, the main focus of this work is to optimize the model for the fusion of modalities and curriculum learning. Although our work is much similar to earlier multi-modal works in this regard, our model is a language pre-training model, not a task-specific architecture. The main difference in our work is to add curriculum learning methodology on top of the pre-trained models.
Other than the main focus described above, several reasons might lead to the performance discrepancy between the proposed model and the state-of-the-art models, such as VilBERT. First, the number of learnable parameters in VilBERT is much greater than the proposed model (600M vs. 170M). Second, VilBERT uses the Faster-RCNN model to match each word in the text with the corresponding image patch, while our model uses the Resnet-152 model on the entire image. One could argue that the better alignment provided by the faster-RCNN method might lead to better learning since the model also learns which part in the image a particular word corresponds to. Providing such an alignment could also benefit the proposed model for catching up with the performance of the state-of-the-art models."
In addition, the initial part of the manuscript is too generic: compressing 35 years of neural network history in a few lines is never a good idea. And, in my opinion, it is unnecessary for your goal. For example, in the ABSTRACT I suggest you to cancel lines 1: 6, as well as in the INTRODUCTION 23:53.
As a mere clarification, in the Abstract (line 3) "characters" as an example of semantic unit is not correct. Line 26 "and so on" is not correct since the only additional meaningful unit could be "chunks".
To summarise my comments, I suggest you to re-focus the manuscript on the state-of-the-art models (BERT and BERT-derived) by highlighting similarities and specificities of your approach (concrete/abstract classifier).
Response 2: We agree that the Introduction section needs to be rewritten with a focus on multi-modal language pre-training. We removed the lines 23:53 from the previous manuscript version as suggested and rewrote that part accordingly. To connect this part to the successive paragraphs, we also deleted the first clause of the sentence at line 53. You can find these new paragraphs below:
"After the success of contextual representations, language model pre-training and fine-tuning the model for downstream tasks have been common practices in NLP. The wide-spread adoption of BERT [1] led to several pre-trained language models that are described as BERT variants [2–5]. Putting BERT at the core, these models provide extensions with different viewpoints, cross-lingual, multi-task, multi-modal, world knowledge, to name a few. Among these models, Albert [3] targets efficiency by using weight sharing and decreasing memory consumption, RoBERTa [2] increases the amount of training data and times and removes the next sentence prediction objective, XLNet [4] uses permutation instead of masking to capture the bidirectional context and combines BERT with autoregressive language modeling, ERNIE [5] aims to exploit world knowledge by masking named entities and phrases rather than random words, and in its updated version [6], the pre-training task is organized as a multi-task objective to capture different relations such as lexical, syntactic, and semantic.
The earlier approaches to bridge vision and language relied on architectures with a visual feature extractor, a text encoder, a multi-modal fusion component, and a classification layer to perform the given multi-modal task, e.g., visual question answering. The robust pre-trained language models have caused a shift from a task-specific perspective to a task-agnostic one, multi-modal language model pre-training.
Multi-modality, especially with vision and language, has been implemented in some BERT variants [7–9] as well. VisualBERT [7] and VideoBERT [8] use similar transformer-42based architectures. The former processes image captions together with image regions to discover implicit alignments between language and vision. On the other hand, the latter works with spoken words paired with a series of images to learn a similar alignment. Distinctively, ViLBERT [9] has a two-stream transformer model, which processes vision and language separately but learns their relationships through co-attentions between them. The primary motivation for combining vision and language in these models has been visual grounding to learn visual features under the guidance of textual descriptions. Apart from it, we can leverage visual and language features to mimic human language acquisition."
Response 2 (continued): We also canceled the lines 1:6 from Abstract as suggested.