1. Introduction
Similarity is the primary element that shows the relation between two objects, numerical or categorical, and gives an interpreted result for human perception and machine processing. The concept of similarity measure is widely used in many scientific and interdisciplinary fields such as decision making, cognitive science, natural language processing (NLP), recommender systems, and many other areas [
1,
2,
3]. In NLP, the similarity metric is used to capture the semantic relation of words in a given context that shows to what extent words express the same meaning.
Semantic similarity measures the strength of interactions between words and captures any direct or indirect semantic relations, Harispe et al. [
4]. Textual semantic similarity is utilized to identify common characteristics between cases, such as word–word, word–document, document–document, or query–document. Semantic similarity is essentially helpful for topic modeling analysis and semantic information retrieval systems where it identifies the most optimal match of two objects. Generally speaking, different approaches to semantic measurements should be used, depending on the type of data, task formulation, and algorithm structure in which it is applied [
5,
6]. Given the words that can be represented in vector forms, one can use a distance measure to decide if their numerical representations are similar based on the decision rule: “the words are semantically similar if the distance between them is less than some threshold".
The success of any similarity measure metrics relies on the vector representation of words. Zhang et al. [
7] show that bag of words (BOW) counts the frequency of a word appearing in a sentence or document without considering the order, syntactic or semantic of words, where the dimension of a vector space is the number of all unique words in the corpus. Term frequency-inverse document frequency (TF-IDF) works in the same manner, but instead of giving more importance to common words, it reduces its weight as the word appears more often in other documents, as stated in Ramos et al. [
8]. Two different word2vec methods proposed in Mikolov et al. [
9], continuous bag-of-words and skip-gram model, obtain weights of shallow neural network based on proximity probability in its surrounding context to use it as vectors for each word with the dimension of a predefined number of neurons. To bypass the issue of uninterpreted word vectors, Pennington et al. [
10] introduced global vectors for word representation (GloVe) which is based on statistical calculations of corpus’ word occurrences and generates numerical representation of words, regardless of their location in the sentence or possibility of homonym pair. In 2019, Devlin et al. [
11] introduced a bidirectional encoder representations from transformer (BERT) which is a context-dependent embedding where the same word
apple will have different vectors in
juicy apple and
apple juice. These vectors are coordinates of words in vector space, and by calculating the distance between them, the similarity of words can be found.
Kusner et al. [
12] proposed a new metric, denoted as word mover distance (WMD), which measures the distance between texts in a semantic way based on the word2vec model. The intuition behind the WMD method is that it minimizes the cumulative traveling distance between sets of words in two documents,
D and
T.
However, WMD relies on the minimization of the total traveling distance between all document’s words, including low relevancy ones. It implies limitations of WMD to certain embedding types and has low relevancy for many NLP tasks. In [
13], the author stresses the need to incorporate word importance regarding their syntactic level of connectivities into WMD.
In our work, we develop such a text similarity metric, denoted as greedy texts similarity mapping (GTSM), that combines the semantics of word embeddings with word weights and assembles by meta-heuristics into a reasonably practical estimator. Hence, the estimator can be used with preferable vectorization models, word weights, and distance metrics. We construct a heuristic that is simple enough, has a small number of parameters, and can still handle the semantic features of words.
The benchmarked algorithm which will be used to compare the performance of GTSM is WMD, sentence mover similarity (SMS), which is identical to WMD with averaged sentence embeddings introduced by Clark et al. [
14] and the cosine similarity metric. The choice of WMD method is justified by its low retrieval error rate as stated by Kusner et al. [
12] and the error rate estimation by
-NN document classification task that outperformed other similarity measurements, such as BOW, TF-IDF, latent semantic indexing, latent Dirichlet allocation, and others. The full list and their descriptions can be found in Kusner et al. [
12].
2. Related Work
In 2015, Kusner et al. [
12] published a paper on WMD that inflamed the NLP society with its impressive results on similarity metrics. The main engine on which WMD is constructed is earth mover distance (EMD), also known as Wasserstein distance, that minimizes the total work needed to change one probability distribution into the form of another. EMD was mainly applied in cases such as content-based image retrieval, Rubner et al. [
15], histogram comparisons, Ling and Okada [
16], phishing web pages, Fu et al. [
17], and recently in vector-based text similarity, Kusner et al. [
12].
Kusner et al. [
12] assign each word
i in document
D the weight
, where
is the word count of document
D.
The semantic similarity for the WMD metric between a pair of words can be captured through Euclidean distance denoted as a “travel cost” between word
i and word
j:
where
is
norm in Euclidean space.
The transportation problem below states that from each word in document
D it finds the closest word in document
T and aggregates to a total distance that
D needs to travel to exactly match
T.
where
denotes what ’amount’ of word
i in the document
D transfers to word
j in the document
T, and
n is the size of vocabulary. Equations (
3) and (
4) ensure that the amounts transformed from word
i to words in
T is
and that the total amount of words
i in
D transformed into
j is
.
Although the results of WMD are very promising, this method is computationally inefficient with average complexity time
, where
p is the number of unique words in documents. As Kusner et al. [
12] states, it is “prohibitive” to use this method for high-dimensional datasets. As an alternative, to speed up the process, they also proposed the word centroid method or relaxing boundaries WMD with the tradeoff in the error rate reduction.
The sentence mover’s similarity (SMS) introduced by Clark et al. [
14] circumvented complexity issues by averaging word embeddings so that the similarity evaluation performs only on sentence-levels which significantly saves computational time. That is,
sentence to documents is equivalent to
word to sentences. The weight of each sentence embedding is estimated by the number of words in the document
, where in this case
i is the sentence in document
D, and
is the word count in the corresponding sentence. As was suggested in Kilickaya et al. [
18], we transform WMD to word mover similarity (WMS) as
.
5. Conclusions
We propose a text similarity measure, GTSM that can be applied in various tasks of different corpus sizes using a preferred vectorization method. The proposed algorithm intends to capture the essential word relations between two texts and evaluate their similarity according to the weightings of words. The attractiveness of this algorithm is its simplicity of implementation, the flexibility of tuning for specific data, relatively cheap computations (average complexity time is regarding the number of unique words in documents), the ability of easy parallelization, and its acceptable F1 score. It is also self-sufficient and does not need additional resources or computationally expensive steps (such as in the preliminary calculation of word2vec or BERT), so it is not bonded to languages with pre-trained vector models and can handle low-resource languages. We showed that a co-occurrence matrix could be a good alternative for other vectorization models such as BERT. As an advantage of the presented algorithm, it is possible to state that it relies on vectorized word representations and, assuming such exist, no further concerns regarding linguistic properties of the analyzed dataset.
The algorithm provides a unified way of using word embeddings and word weighting to find a relation between two bags of words. It is worth noting that this method is flexible enough to be used as a feature generating mechanism, e.g., in graph analysis. Nevertheless, the feature representation and the number of classes strongly impact the classification quality, as it is seen on the RECIPE dataset.
The GTSM algorithm has plenty of room for further expansions: e.g., regarding the selection of advanced heuristic models for computing “ideal” relations or applying alternative vectorizations and weightings (as discussed in Ibrahim and Landa-Silva [
31] for TF-IDF) in order to obtain more efficient training and lower error rate results.