Semantic similarity is used to identify concepts that share a common ’feature’ [
23]. Semantic-similarity matching is essentially a measure of similarity between text data. The purpose is to capture the strength of semantic interactions between semantic elements (e.g., words, concepts) according to their meaning. It has many application scenarios, such as QA, automated customer service, search engines, semantic understanding, and automated marking, etc. To address this problem, various semantic-similarity methods have been proposed over the years. From traditional NLP techniques, such as string-based approaches, to the latest research work in the field of artificial intelligence, they are categorised into string-based methods [
24], knowledge-based methods and corpus-based methods according to their underlying principles [
25], which are summarised in
Table 1.
2.2.1. String-Based-Method
In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric which measures the distance between two text strings for approximate string matching or comparison and in fuzzy string searching. A requirement for a string metric (e.g., in contrast to string matching) is the fulfillment of the triangle inequality. For example, the strings “Sam” and “Samuel” can be considered to be close [
26]. A string metric provides a number indicating an algorithm-specific indication of distance.
Table 1.
Semantic-similarity methods.
Table 1.
Semantic-similarity methods.
Types | Methods | Reference | Features |
---|
String-based | Edit distance | [27] | Based on features of glyphs, without semantics. |
Jaccard | [28] | Similarities and differences between finite sample sets, without semantics. |
Knowledge-based | Wordnet | [29] | Structured words ontology, but too few words. |
Wikipedia | [30] | A rich and updated corpus, but require networking and time-consuming. |
Corpus-based | Word2vec | [31] | Fast and generalized considering the context, but cannot solve polysemy. |
Glove | [32] | Fast and generalized considering the context and global corpus, but cannot solve polysemy. |
Fasttext | [33] | Fast and represents rare words and out-of-lexicon words by n-gramm. |
BERT | [15] | Effectively extracts contextual information, but unsuitable for semantic-similarity search. |
Sentence-bert | [16] | Fast and over 100 languages. |
Edit distance: This method [
27] is a quantitative measure of the degree of difference between two strings. Suppose a string a and a string b. The measure is to calculate how many operations are needed to change string a into string b. The operations include addition, deletion and replacement. Edit distance has applications in natural language processing, where automatic spelling correction can determine a candidate correction for a misspelled word by selecting words from the dictionary which have a low distance from the word.
Jaccard [
28] index is a classical measure of set similarity with many practical applications in information retrieval, data mining, and machine learning, etc. [
34]. The Jaccard distance, which measures the dissimilarity between sample sets, complements the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1 or, alternatively, by dividing the size of the union by the difference between the sizes of the union and the intersection of the two sets [
35].
2.2.2. Knowledge-Based Method
In many applications dealing with textual data, such as natural language processing, knowledge acquisition and information retrieval, the estimation of semantic similarity between words is of great importance. Semantic-similarity measures make use of knowledge sources as a basis for estimation. The knowledge-based semantic-similarity approach calculates the semantic similarity between two terms based on information obtained from one or more underlying knowledge sources (e.g., ontology/lexical databases, thesauri, and dictionaries, etc.). The underlying knowledge base provides a structured representation of terms or concepts connected by semantic relations for these methods, further providing a semantic measure that is free of ambiguity, as the actual meaning of the terms is taken into account.
Wordnet [
29] is characterized by a wide coverage of the English lexical–semantic network by organizing lexical information according to word meaning rather than word form. Nouns, verbs, adjectives and adverbs are each organized into a network of synonyms, each synonym set represents a basic semantic concept and these sets are also connected by various relations; a polysemantic word will appear in each of its meaning synonym sets. A synonymy set can be seen as a semantic relationship between word forms with a central role. Given a synonymy set, the Wordnet network can be traversed to find synonymy sets of related meanings. Each synonym set has one or more superordinate word paths connected to a root superordinate word. Two synonym sets connected to the same root may have some superordinates in common. If two synonym sets share a particular superlative, i.e., are at the lower level of the superlative hierarchy, they must be closely related. Therefore, we can perform the semantic-similarity calculation between words based on Wordnet, an ontology library.
In general, WordNet is a well-structured knowledge base, which includes not only general dictionary functions but, additionally, word-classification information. Therefore, based on Wordnet, similarity calculation methods are provided [
36], such as the calculation of the shortest path between two words. The shortest distance between two words is obtained by calculating the relative position of each word in Wordnet and its closest common ancestor, and this distance is used to calculate the magnitude of similarity between them. However, its disadvantages cannot be ignored. It has only a limited number of words, so domain-specific words cannot be recognized, and, secondly, it cannot reflect the meaning of words in context.
Wikipedia is a large-scale knowledge resource built by Internet users who contribute freely and collaborate together in a way that creates a very practical ontology repository. In addition, it is completely open; the basic unit of information in Wikipedia is an article. Each article describes a single concept, and each concept has a separate article. Since each article focuses on a single issue and discusses that issue in detail, each Wikipedia article describes a complete entity.
Since articles contain, for example, titles, tables of contents, categories, text summaries, sections, citations, and hyperlinks, these can be considered as features of the concept. Therefore, it is natural to think of using Wikipedia’s conceptual features to measure the similarity between words. Since the titles of Wikipedia articles are concise phrases, similar to the terms in traditional thesauri, we can also think of the title as the name of the concept. To calculate the similarity value between two concepts, we can select some features to represent the concept. For example, the four parts of the Wikipedia concept—synonyms, glosses, anchors, and categories—can be considered as features representing the Wikipedia concept.
Jiang et al. [
30] propose a feature-based approach which relies entirely on Wikipedia, which provides a very large domain-independent encyclopedic repository and semantic network for computing the semantic similarity of concepts with broader coverage than the usual ontology. To implement feature-based similarity assessment using Wikipedia, first, they present a formal representation of Wikipedia concepts. Then, a framework for feature similarity based on the formal representation of Wikipedia concepts is given. Finally, they investigate several feature-based semantic-similarity measures that emerge from instances of this framework and evaluate them. In general, several of their proposed methods have good human relevance and constitute some effective methods for determining similarities between Wikipedia concepts.
Knowledge-based systems are highly dependent on the underlying resources, resulting in the need for frequent updates, which require time and significant computational resources. While powerful ontologies such as WordNet exist in English, similar resources are not available in other languages, which necessitates the creation of robust, structured knowledge bases to enable knowledge-based approaches in different languages as well as in different domains.
2.2.3. Corpus-Based Method
Natural language is a complex system for expressing the thoughts of the human brain. In this system, words are the basic units of meaning. The technique of mapping words to real vectors is known as word embedding. Word embeddings use the distributional hypothesis to construct vectors and rely on information retrieved from large corpora; thus, word embeddings are part of a corpus-based approach to semantic similarity. The distribution hypothesis presents a view in which words with the same meaning are grouped together in a text. This view examines the meanings of words and their distribution throughout the text and then compares them with the distribution of words with similar or related meanings, the basic principle of which is simply summarised as ’similar words occur together frequently’. In recent years, word embeddings have gradually become essential knowledge for natural language processing. Word embeddings take a vector representation of words and provide a vector of meaning for them, preserving the underlying linguistic relationships between words. Methods for word embedding include artificial neural nets [
31], dimensionality reduction in word co-occurrence matrices, and explicit representation of the context in which words occur.
Word2vec takes a text corpus as input, first constructs a vocabulary from the training text data, and then learns a vector representation of the words. It maps each word to a vector of fixed length, which better expresses the similarity and contrast between words. Tomas Mikolov et al. [
31] propose two new model structures, the CBOW and Skip-gram models, for computing continuous vector representations of words from very large datasets. The quality of these representations was measured in word and syntactic-similarity tasks, and the results were compared with previous best-presentation techniques based on different types of neural networks which can be trained to produce high-quality word vectors using very simple model architectures. They show that it is possible to compute very accurate high-dimensional word vectors from much larger datasets at much lower computational costs.
In summary, Word2vec can be trained to obtain its weight matrix through two different structures, and this weight matrix is the word-vector dictionary we want to obtain in the end. In this, each word has its corresponding word vector to represent. Therefore, when calculating the similarity between different words, the similarity can be calculated using their word vectors. In the Google Word2vec model used in this matcher, the word-vector table is mainly composed of some phrases and words, in which the words are basically lowercase words and the upper case words are not recognized; therefore, the words to be calculated need to be preprocessed when using this model. Moreover, this method can be regarded as a word-vector dictionary based on corpus training; thus, for some special abbreviations or words that are not in the word-vector dictionary, the word vector it represents cannot be queried, and it cannot be used for the calculation of lexical similarity. In addition, since the word-vector relationship is one-to-one, the problem cannot be solved for words with multiple meanings, such as “bank”.
Glove: Jeffrey Pennington et al. [
32] constructed a new global log-linear regression model, which they call GloVe, for unsupervised lexical-representation learning which outperforms other models on lexical-analogy, lexical-similarity and named-entity-recognition tasks because the statistics of the global corpus are captured directly by the model. The model combines the strengths of two major families of models: the global matrix- decomposition method and the local context-window method. The model makes effective use of statistical information by training only the non-zero elements of the word–word co-occurrence matrix, rather than the entire sparse matrix or a single context window in a large corpus. The model produces a vector space with meaningful substructures. In addition, it outperforms the correlation model on similarity tasks and named-entity recognition.
Fasttext: In linguistics, morphology studies word formation and lexical relationships. However, Word2vec and GloVe do not explore the internal structure of words. The fastText [
33] model proposes a subword insertion method, which assumes that a word consists of n characters, which is an n-gram. There are some advantages of using n-grams, which can generate better word vectors for rare words. For character-level n-grams, that is, the word appears very few times, but the characters that make up the word share parts with other words, so this can optimize the generated word vectors, and, in the case of lexical words, it is still possible to construct word vectors for words from character-level n-grams even if the words do not appear in the training corpus. In addition to this, the n-gram allows the model to learn partial information about the local word order.
Neither word2vec nor GloVe can provide word vectors for words that do not exist in the dictionary. Compared to them, fastText has the following advantages: first of all, it works better for word vectors generated from low-frequency words. This is because their n-grams can be shared with other words. Secondly, for words outside the training lexicon, their word vectors can still be constructed. We can superimpose their character-level n-gram vectors. Thus, when using it for lexical similarity computation, an important feature of fastText is its ability to generate word vectors for any word, even for words that do not occur, assembled words and some specialized domain abbreviations. This is mainly because fastText builds word vectors from substrings of characters contained in words, so this way of training the model allows fastText to generate word vectors for misspelled or concatenated words. In addition, fasttext is also faster than other methods, which makes it more suitable for computing on small data sets.
BERT [
15] is the Bidirectional Encoder Representation from Transformers pre-trained model. The BERT model has achieved excellent results in various NLP tests; the network architecture of BERT uses the encoder-side structure of the multilayer transformer proposed in this work. Attention is all you need, and the overall framework of BERT consists of two phases: pre-train and fine tune. In contrast to Word2vec or GloVe, for example, “bank”, the same word has different meanings in different contexts. However, embedding such as Word2Vec will only provide the same word embedding for “bank” in these different contexts. Compared with Word2vec, it can also obtain word meanings according to the sentence context, thus avoiding ambiguity.
Sentence-BERT: Due to the excellent performance of the BERT model, many scholars later conducted much research based on BERT. Sentence-BERT is also based on the BERT model by extending its application.
Nils Reimers and Iryna Gurevych [
16] proposed Sentence-BERT (SBERT), a modification of the BERT network using siamese and triplet networks, which is able to derive semantic sentence embeddings. This allows BERT to be used for certain new tasks. This framework can be used to compute sentence/text embeddings in over 100 languages. These embeddings can then be compared, for example, using cosine similarity to find sentences with similar meanings. This is useful for semantic text similarity, semantic searching or paraphrase mining. This model is trained on all available training data (over 1 billion training pairs) and is designed to be a general-purpose model. It is not only fast but maintains high quality.