1. Introduction
The category of concreteness/abstractness (C/A) has been the focus of cognitive research for decades. The problem of representing concrete and abstract objects in the human brain poses a serious challenge to all cognitive science [
1]. Concreteness/abstractness is one of the main organizational axes of the mental vocabulary [
2].
The main approach to defining these concepts is as follows [
3]. Concrete concepts are those that are perceived by the senses. Examples of specific words are cupcake or computer. Abstract concepts are not perceived by the senses—for example, ‘soul’ or ‘trust’. Similar interpretations are found in many works. Thus, in the work [
4], the following definition is given: ’abstract nouns are those nouns whose denotata are not part of the concrete physical world and cannot be seen or touched’. A similar definition is offered in [
5].
To support all such studies, dictionaries with indices characterizing the degree of concreteness/abstractness of words are required. The dictionary is created by polling native speakers, who are asked to rate the concreteness/abstractness of the given words. For the English language, the first major dictionary of this kind was created in 1981 [
6]. It contains nearly 4000 words and is freely available from the psycholinguistic MRC database (
https://websites.psychology.uwa.edu.au/school/MRCDatabase/uwa_mrc.htm (accessed on 4 May 2022)). Later, a dictionary of almost 40 thousand words was created [
5]. Each word receives at least 25 ratings from respondents on a 5-point scale, which are averaged. In addition to English, a comparable dictionary (30,000 words) was created only for Dutch [
7]. An apparent problem is the extraordinary complexity of creating such dictionaries. For the German language, the dictionary [
8] contains only 4000 words. A database with 6000 words of concreteness/abstractness ratings for the Croatian language has recently been published [
9]. The dictionary for the Russian language [
10] currently contains 1000 words and is available at
https://kpfu.ru/tehnologiya-sozdaniya-semanticheskih-elektronnyh.html (accessed on 4 May 2022). Similar dictionaries have been created for Italian, Chinese, and some other languages.
Two obvious tasks arise: (1) increasing the size of the dictionaries and (2) creating dictionaries for other languages, which one can try to solve using modern computer technology, extrapolating from existing human estimates. The main idea in extrapolating estimates to previously unappreciated words is to use vector semantics of words built on a large corpus of texts and obtain new estimates based on the semantic proximity of words in the constructed semantic space. This idea has been implemented in several works. To transfer ratings from one language to another, creating a single multilingual semantic space is appropriate. This approach, to the best of our knowledge, has been implemented in two works [
11,
12].
In this article, we will apply technologies based on neural networks with deep learning to both of the above problems for the first time. The article is devoted to the following research tasks:
- (1)
Improve the quality of extrapolation of human ratings for concreteness/abstractness by applying deep learning models.
- (2)
Apply deep learning models to transfer ratings from one language to another and evaluate the quality of ratings obtained in this way.
- (3)
Evaluate the impact of the size of dictionaries when used as a training set and during cross-language transfer.
- (4)
Evaluate the possibility of using the data of several languages simultaneously during a cross-lingual transfer of concreteness/abstractness ratings.
The article is organized traditionally.
Section 2 provides a literature review.
Section 3 describes the data—the vocabularies used and the methods used—and additional training of several types of neural networks based on a transformer architecture [
13].
Section 4 describes the results obtained: the extrapolation of ratings with quality assessment. Finally, in
Section 5, the conclusions summarize the research results and we discuss plans for further work.
2. Related Works
Research on concreteness/abstractness is conducted broadly from psychology and psycholinguistics to neurophysiology and medicine. A fresh overview can be found in [
14]. In neurophysiology, the localization of the concepts of concreteness/abstractness was studied extensively. In many experiments using neuroimaging techniques, it has been shown that concrete and abstract words are represented in different neuroanatomical structures of the brain.
In psychological research, the so-called ’effect of concreteness’ has been established, demonstrating the greater ease of processing specific words in the human mind. Concrete words are better remembered [
15], better recognized [
16], read faster [
17], and learned faster [
18]. Dictionary definitions are easier to write and more detailed [
19]. Concrete words are easier to associate [
20]. The concepts of concreteness and abstractness have been subjects of study in linguistics for a long time. However, recently, with the emergence of large corpora of texts and large lexical ontologies, fundamentally new research ideas and results have appeared. The most interesting is the following. In [
21], it is shown that, over time, the degree of specificity of words increases. In [
22], it is shown that the density of the set of semantically close words is higher for concrete words compared to abstract ones. In [
23], it is noticed that in text corpora, abstract words are more often found together with abstract words and concrete ones with concrete ones. The work [
24] compares the categories of concreteness and specificity.
To conduct psychological and neurophysiological experiments, lists of words with estimates of the degree of their concreteness/abstractness are needed. Such lists are created by interviewing native speakers and extrapolating human estimates by machine methods. A significant number of works have been devoted to the development of machine extrapolation methods: [
2,
25,
26,
27,
28,
29,
30,
31,
32,
33].
Early work was based on thesauri such as WordNet, with ratings shifted to synonyms. The next step was using vector semantics applicable to large corpora of texts. Rankings for new words were based on the semantic proximity of words in the constructed semantic space [
25,
28,
30,
31]. Thus, to create a computer dictionary in any language, a necessary condition is the existence of a large body of texts, based on which vector semantics can be built. Vector semantics was constructed using various methods: Latent Semantic Analysis, the High Dimensional Explorer model [
34], the Skipgram vector-embedding model [
35], etc.
Evaluation of the quality of machine dictionaries is also of fundamental importance. They are assessed by comparison with human ones with the calculation of the correlation coefficient of the two dictionaries, most often according to Spearman. By far, the best result achieved is the machine dictionary for English of work [
36], which has a human correlation coefficient of 0.9. The dictionary is built using fastText technology to build a semantic space and SVM as a classifier. Extrapolation of human estimates is done by cross-validation against a 40,000-word dictionary [
5]. In [
5], two human dictionaries were compared, and it was found that the correlation coefficient between them was 0.919. It is natural to interpret this value as the maximum possible for machine extrapolation.
In most papers, only one fixed-size dictionary is considered. In [
12], three English dictionaries of different sizes are taken, which makes it possible to reveal the dependence of the quality of extrapolation on the size of the dictionary. For dictionaries of size 22,797, 4061, and 3000 words, the correlation scores were 0.887, 0.872, and 0.848, respectively, which clearly indicates a dependence on the size of the dictionary. The larger the dictionary, the more accurate the extrapolation. At the same time, it can be noted that the quality of extrapolation increases significantly with an increase in the size of the dictionary from 3 to 4 thousand; however, a much larger increase of the size to 23 thousand words affects the result to a lesser extent.
All of the above work was carried out for one language—in most cases, it is English. However, recently, [
37] proposed a way to combine several languages in a single semantic space based on English. This is accomplished by using Google-Translate-obtained translations into English of the 10,000 most frequent lexical items in a given language.
The article [
12] investigates the cross-lingual transfer of ratings for the Croatian–English language pair; data are presented in the ‘Concreteness and imageability lexicon MEGA.HR-Crossling’ available at
http://hdl.handle.net/11356/1187 (accessed on 4 May 2022) [
38]. An SVM regression model and deep feedforward neural network were applied, which led to almost the same results. The human dictionaries from [
5,
9] were used for English and Croatian, respectively. As a result, for the transfer of specificity ratings from English to Croatian, the Spearman correlation coefficient between the transferred ratings and the human ones was 0.724; for the reverse order for transfer, the correlation was 0.791. For other properties of words commonly considered in psychological research, correlation scores are lower. For example, for the imageability property, when transferring from Croatian to English, the Spearman coefficient was only 0.694 [
12].
In [
11], this approach is applied to the problem of transferring concreteness ratings from English to other languages, as a result of which concreteness/abstractness dictionaries were obtained for 77 languages. For one language, Dutch, an assessment of the quality of the dictionary created in this way was obtained—the Spearman correlation coefficient = 0.76, which is interpreted as a very high result. Correlation coefficients for other languages were not released. A similar approach with automatic translation of words and transfer of ratings was applied to other properties of words—valence, arousal, and dominance—reflecting the emotional coloring of words. Valence, arousal, and dominance word ratings are very important in the task of sentiment analysis of texts, particularly on social networks [
39].
Thus, it remains unclear how well transferring estimates from one language to another should work to produce a good dictionary. However, this is important for low-resource languages. High-quality cross-language transfer would make it possible to quickly obtain dictionaries with concrete/abstract ratings without conducting time-consuming and expensive surveys of respondents.
4. Results
4.1. Results of Single-Language BERT
The results for English (
Table 1) allow us to compare them with the results of other studies. On the test set, Spearman’s correlation coefficient
equals 0.910, with 0.920 (Pearson’s correlation,
). Thus, a result was obtained that surpasses the best previous result achieved in the work [
36]. Considering the value of the correlation coefficient of 0.919 (according to Spearman’s
) to be the maximum possible, the result of this work is 97.9% of it, and our result is 99% of it. In [
36], on the same vocabulary [
5], the SVM classifier was used in combination with fastText. Our result confirms the advantages of BERT over previous architectures. Note that without fine-tuning, the results are very low.
We studied the dependence between the quality of extrapolation and the size of the training set.
Table 1 presents the correlation coefficients for the same test set of 4000 words and for different sizes of the training set.
Thus, here, we obtained the expected result: the quality of extrapolation correlates with the size of the training set. This is in good agreement with the results of [
12]. However, it should be noted that applying BERT on a training set of 1000 words exceeded most of the previously published estimates with significantly larger training set sizes. The result is only 5% worse than the result obtained on 35 thousand words (train set). For low-resource languages, where no such large dictionaries with human ratings are available, the result with a 1000-word dictionary should be assessed as quite acceptable.
In several works, including [
11], it has been observed that machine ratings are most different from human ratings for the words that have extreme ratings from respondents.
Table 2 shows the words from the predicted ratings with the largest discrepancies between machine and human C/A estimates.
4.2. Cross-Lingual Transfer of Concreteness Estimates
The presence of good large dictionaries for some languages raises the question of transferring ratings from one language to another. This is possible thanks to the existence of multilingual neural networks pre-trained in several languages simultaneously.
In the first series of calculations, the dictionaries were used entirely. In all models, the hyperparameters bs = 64, 3 epochs were used in all models. The best result was shown by the MS-MiniLM model.
Table 3 shows Spearman’s correlation coefficients between ratings transferred from the source language with their values in the target language for three models. Pearson’s coefficients are, on average, 0.01 less. The best-achieved result (for Russian) exceeds the result of work [
11].
For the multilingual BERT model, the results are, on average, almost the same, only slightly lower, but MS-MiniLM provides the absolute best coefficients among all transfers. These are the coefficients for transferring from English and from German to Russian, 0.8 and 0.79, respectively. The results shown by the distilled multilingual BERT model are noticeably worse.
When considering
Table 3, it is noteworthy that for different languages, the results differ quite significantly. The hardest-to-obtain rankings are for Dutch and Croatian. In
Figure 2 is a histogram of word scores for Dutch.
The histogram shape is noticeably different from similar histograms for English and Russian languages. Noteworthy is the non-smooth character of the histogram, with statistically unexplained peaks, which may be a consequence of incorrect post-processing of empirical data or may be due to numerical binning issues or anchoring to discrete levels.
In
Table 3, one can see a significant difference between the coefficients when transferred into Russian and from Russian (the average difference is 0.145). One of the possible explanations is that the Russian dictionary contains only 1000 of the most frequent words (nouns) of the language, for which there were a lot of data in the corpora on which the neural networks were pre-trained, and they are better represented in models; for them, it is easier to predict the extrapolated ratings. For other languages, especially English, the dictionaries are much larger and contain more rare words, the ratings for which are more difficult to predict. We then conducted a study of the dependence of the results on the volume of dictionaries.
4.3. Dependence of the Transfer Quality on the Size of Dictionaries
We are interested in the question of how the size of the source dictionary and the target dictionary affects the quality of the transfer. We allocate 1000 most frequent words for each language in dictionaries to do this. Since the dictionary for the Russian language contains only nouns, nouns were chosen for other languages as well. In addition, only in two languages, English and Dutch, dictionaries contain a sufficiently large number (over 1000) of adjectives or verbs.
Thus, to make all languages equal in terms of the size of the word sets used for training and testing, we report results only for nouns. Moreover, our preliminary experiments with adjectives show lower results (correlation is less than 0.6).
Table 4 presents correlation coefficients in 1000-word dictionary to 1000-word transfer. Comparing
Table 3 and
Table 4 shows that all coefficients except one become higher when a complete dictionary is replaced with a 1000-word dictionary. At first glance, this result seems unnatural, but in the work [
43], a similar result was obtained using the material of emotive vocabulary. On a 1000-word dictionary, machine extrapolation of the data gives a better correlation coefficient than on a larger one (a dictionary with 14 thousand words); this is the case for both English and Spanish. A similar result was obtained in [
44], also for an emotive vocabulary. However, a direct comparison with the results of these two articles is impossible since both the size of the training set and the test set differ proportionally. In this paper, the difference is not proportional. Therefore, it makes sense to take a closer look at the influence of the size of the source and target dictionaries in our task; we separate these two factors.
We compare four transfer options: from a dictionary of 1000 words to a dictionary of 1000 words (1000 -> 1000), from a complete dictionary to a dictionary of 1000 words (all -> 1000), from a dictionary of 1000 words to a complete dictionary (1000 -> all ), and from the full dictionary to the full one (all -> all). As mentioned above, the best results are achieved by MS-MiniLM. The results are presented in
Figure 3 and
Figure 4. The
y-axis shows the Spearman correlation coefficient. Numerical data are given in the
Appendix A (
Table A1,
Table A2,
Table A3,
Table A4,
Table A5).
We consider two separate cases: (A) the size of the source vocabulary is fixed while the target vocabulary changes, and (B) the size of the target vocabulary is fixed while the source is changed.
- A.
The size of the source vocabulary is fixed while the target vocabulary changes. This is shown in
Figure 3 and
Figure 4 by moving from the first block of columns to the third and from the second to the fourth. The values of the correlation coefficients in the third and fourth blocks are noticeably smaller than in the first and second, respectively. It is a natural result that an increase in the number of words to which rankings need to be extrapolated leads to a deterioration in quality.
- B.
The size of the target vocabulary is fixed while the source is changed. This is shown in
Figure 3 and
Figure 4 by moving from the first block of columns to the second and from the third to the fourth. Increasing the size of the training set of 1000 words to a complete dictionary does not lead to a noticeable improvement in the results. The quality of the ratings obtained due to the transfer can be assessed as sufficiently high. Thus, the quality of the cross-lingual transfer remains practically at the same level. This result is somewhat unexpected. A smaller amount of training data, it would seem, should lead to a deterioration in the results, which we have in the case of extrapolation within one language (
Table 1). However, with a cross-lingual transfer, the situation changes, and an increase in the amount of training data does not improve the result. All the information sufficient for the qualitative transfer of ratings from one language to another is already contained in the 1000 most frequent words, and rarer words do not add useful information. Perhaps this is due to the quality of the multilingual semantic space, in which the correspondence between languages is well established for more frequent words and worse for rare ones. However, this issue requires a separate additional study.
The data for the distilled multilingual BERT and multilingual BERT models on 1000-word dictionaries are presented in
Appendix A (
Table A6 and
Table A7). The results of these models, as well as in the previous experiment, turned out to be somewhat worse. For example, the MS-MiniLM model has five coefficients greater than or equal to 0.8, while the distilled multilingual BERT has none at all, and the multilingual BERT has only one. The MS-MiniLM model has no coefficient less than 0.6, while multilingual BERT has three, and distilled multilingual BERT has six such cases.
4.4. Multilingual Transfer
Our next goal is to test the effect of mixing multiple languages in the training set. To compare the results with the previous ones, we form a 1000-word sample, taking 250 words from four languages randomly from 1000 most frequent ones. Next, we extrapolate them to a 1000-word list (the same as in the previous experiment) and calculate the correlation coefficient. The results are shown in
Table 5.
The results of using mixed data of four languages in three cases were higher than the arithmetic average of four transfers from each language separately, including for one language—Dutch—noticeably higher. For one language, Croatian, the result remained practically unchanged, and for Russian, it even somewhat worsened.
The results presented in
Table 5 do not allow us to come to a conclusion about the advantage of using mixed data from different languages. The improvement in the results in the first three rows of
Table 5 takes place in cases where the source data are taken from languages of different structures (different branches of the Indo-European family of languages). It is possible that this factor is essential, but limited data do not yet enable us to draw this conclusion. An improvement is not observed (last two rows of the
Table 5) in cases where most of the source data relate to languages of a different branch than the target language.
Other neural network models produce, as described above, the worst result, except for the multilingual BERT model, which, when applied to English, provided the highest correlation coefficient of 0.843.
5. Discussion and Conclusions
Various models and methods have been used to extrapolate human concreteness/ abstractness ratings to obtain large vocabularies: LSA, GloVe, fastText, etc. In this work, the most modern models were used for the first time: BERT, MS-MiniLM. With extrapolation within one language, it was possible to improve all previous assessments of the quality of the dictionary, to bring them to a value of 0.910 according to Spearman (for the English language), with an expected marginal estimate of 0.919. The BERT model can be recommended for the extrapolation of other ratings (valence, etc.) and in other languages. A prerequisite is the pre-training of BERT in a given language or the presence of a large body of texts required for training the model. A human-rated dictionary containing around 1000 words is also in demand.
In the absence of a vocabulary with human ratings for a language, an approach related to transferring ratings from one language, where they already exist, to another can be applied. Previously, this approach was used in combination with the automatic translation of words, thus creating a single semantic space for several languages. The best previously achieved result on this path is 0.76 [
11]. The best score that we obtained is 0.843, which is close to the scores in the single-lingual case. Comparison of the results of neural networks of different architectures showed a noticeable advantage of MS-MiniLM.
In addition to the architecture of neural networks, the result is influenced by many other factors analyzed in this article. First of all, these are the sizes of the dictionaries of the source and target languages. In this paper, it is shown that an increase in the training set is important for the one-language case but not essential for the cross-language transfer. A 1000-word vocabulary is sufficient for fine-tuning the models and obtaining a high-quality vocabulary in the target language.
The article is the first to consider the option of transferring from mixed data of several languages to the target language. It turned out that this led to some improvement in the results. It is hypothesized that linguistic diversity in source data may be a favorable factor.
When considering different languages, the hypothesis about the importance of the typological proximity of languages is suggestive. However, it is not confirmed by our data. Russian and Croatian are Slavic languages; the other three are Germanic.
The linguistic affinity of the languages did not provide the best result when transferring from Russian to Croatian and vice versa compared to the Germanic languages. Thus, the main results are as follows:
- (1)
We presented a deep learning model and applied it in the task of extrapolating human concreteness/abstractness ratings to achieve a result that exceeds those previously published.
- (2)
The methodology for transferring ratings from one language to another is described; estimates of the quality of the ratings are obtained.
- (3)
It has been established that dictionaries of 1000 words in size provide a sufficiently high quality of extrapolation when used as a training set; in a cross-language transfer, they give almost the same quality as large dictionaries.
- (4)
The simultaneous use of data from several languages does not provide a significant improvement in the results of cross-lingual transfer.
- (5)
The listed results are obtained by considering the data of all languages for which there are sufficiently large dictionaries with human concreteness/abstractness ratings in the public domain.
- (6)
Summing up, we note that the latest generation neural networks allow us to obtain rating dictionaries of very high quality, providing a radical reduction in labor costs.
Using the described techniques for extrapolating ratings and their cross-lingual transfer can be of practical importance for the rapid construction of dictionaries in low-resource languages. Thus, the significance and usability of the results obtained in the article is that they define a method for the fast and inexpensive construction of a dictionary with concreteness/abstractness ratings of nouns. Our results allow us to claim that:
to build such a dictionary in any language, it is necessary to apply cross-lingual transfer learning from an already existing dictionary in one of the languages (for example, in English);
for transferring, it is advisable to use Microsoft’s Multilingual-MiniLM-L12-H384 model, which showed better results compared to BERT and other models;
in the source language, it is advisable to use a dictionary of 1000 most frequent words.