3.1. Text Similarity Calculation Based on Different Text Vector Representations
Among the various methods for calculating text similarities, the most commonly used is the vector space model (VSM) based on the term frequency-inverse document frequency (TF-IDF) to represent texts and then calculate the cosine distance between two text vectors. However, this bag-of-words-based text vector representation only considers the co-occurrence of words between texts and does not consider the semantics of the words. This shortcoming is particularly significant in the case of implicit citation detection. The reason for this is that, when citing texts, the authors often refer to the content of the original text of the cited reference by generalization or paraphrasing instead of directly repeating the phrases of the original text, which leads to a significant reduction in the use of the same words, even though the cited content and the original content are still semantically similar.
With the advancement of deep learning techniques, Al-Saqqa and Awajan [
24] proposed the CBOW and skip-gram models for word embedding (Word2Vec), which can be used to train word vectors that express the semantics of each word from a large unlabeled corpus. The semantic word vector has made a major breakthrough in computing semantic similarities between words. Based on word vectors, Dinmont et al. [
25] proposed two document embedding (Doc2Vec) models, PV-DM and PV-DBOW. In these two models, a document (or a sentence or a paragraph) is added to all local contexts of a document as a special vocabulary, and then the vector representation of the document is derived using the word embedding model. However, the disadvantage of the Doc2Vec model is that it does not accurately reflect the weights of each vocabulary in the document.
Overall, the vector space model and the deep neural network-based document vector representation model have their own advantages and disadvantages. The vector space model can accurately calculate the weights of words in a document but does not consider the semantic relationships between words. By contrast, the deep neural network model captures the semantics of words but does not consider the weights of words in a document. To accurately compute the semantic similarity between the citation candidates and documents, this paper explores the document vector representation method, which combines the vector space model and the deep neural network model, and proposes two combinations. One is to embed TF-IDF weights into the deep neural network-based document vector representation model. The other is to use the deep neural network-based word vectors in the TF-IDF-based vector space model instead of using the representation of individual words in the original model.
- (1)
Document vector representation model based on TF-IDF weights and word vectors.
In the traditional vector space model, a document is considered a bag of words consisting of a set of words. For each word in the bag of words, a word vector representation can be trained from a large unlabeled corpus using a deep neural network-based word embedding method. In this paper, we considered a linear weighted combination of word vectors of words to predict the vectors of the bag of words (i.e., documents). However, which combination of weights can more accurately reflect the actual vectors of the bag of words needs further verification. In view of the computationally intensive nature of this validation experiment in a large text corpus and the difficulty of obtaining the actual vector representation of the documents, a multinomial corpus is used to investigate the relationship between different linear combinations of word vectors and bag-of-word vectors.
Following the approach of Nahar et al. [
26], this paper also considers a document as a special vocabulary. Since a document can express multiple semantics, this special vocabulary is also a polysemous word. Each semantic meaning of a polysemous word can be considered a special semantic vocabulary, and a polysemous word is a bag of words consisting of all the semantic vocabularies contained within it. To study word vector representation of polysemous words, the SENSEVAL corpus [
5] is used in this paper to conduct experiments on the word vector representation of polysemous words. SENSEVAL is a semantic disambiguation corpus of polysemous words created by the Association of Computational Linguistics (ACL) [
27]. In this corpus, an example sentence is given for each polysemous word to illustrate the use of the semantics. For example, the polysemous word line has six semantics: cord (rope), division (separation), formation (team), telephone (phone), product (product), and text (text). Consider each of these semantics as a special semantic vocabulary called line_rope, line_division, line_formation, line_phone, line_product, and line_text. Then, the individual words in the example sentence are replaced with the corresponding semantic vocabulary. The training corpus containing the original example sentences and the replaced example sentences is formed, as shown in
Table 1. Based on this corpus, Word2Vec [
6], a word vector training algorithm developed by Google (Mountain View, CA, USA), is used to train the word vector representation of each polysemous word and semantic vocabulary in the corpus.
Since a polysemous word can be considered as a bag of words of all the “semantic words” it contains, a linear combination of these “semantic words” and word vectors can be used to compute (predict) the word vectors of polysemous words. In this paper, two linear combination models are defined: the average model (
for short), which is the average of the semantic vocabulary word vectors, and the weighted average model (
for short), which is the weighted average of the semantic vocabulary word vectors. The mathematical representation of these two models is shown below. Considering the bag of words
is denoted as
= {
,
, …,
, …,
}, where
is the
th word in the bag of words. The word vector representation of the bag of words
, based on the average model, is shown in Equation (1), and the word vector representation of the bag of words
, based on the weighted average model, is shown in Equation (2).
where
is the word vector of the word
;
is the term frequency (
) weight of the
th word
in the bag of words
.
The predicted word vectors of the polysemantic words are calculated based on the word vectors of the semantic vocabulary using the above two linear combination models. By comparing with the real word vectors of the polysemantic words trained and based on the SENSEVAL corpus, we can determine which linear model can better represent the word vectors of the polysemantic words.
Table 2 shows the cosine similarity between the real word vectors of the four polysemantic words and the word vectors of their semantic vocabulary, as well as the predicted word vectors based on the two linear combination models. From
Table 2, it can be seen that for the above four polysemous words (i.e., the bag of words), the word vectors calculated based on the weighted average model are closest to their real word vectors, with a cosine similarity above 0.9 for all of them. This indicates that the frequency-weighted average model can be used to represent the word vectors in the bag of words.
Although documents are also considered a bag of words, the bag of words in a document differs from the bag of polysemous words mentioned above in one important respect. The words in a document, in addition to the term frequency (
) weight, have a more important weight, the
weight, which can more accurately reflect the meaning of the words in the document. Therefore, based on the above term frequency weighted average model, the
weights of each of these constituent words are replaced with
weights to obtain the
weighted average model
for the document vector prediction, as shown in Equation (3).
where
D denotes a document,
denotes the i-th word in document
,
denotes the
weight of the word
in the document
, and
is the word vector of word
.
- (2)
Vector space model based on weights and word vectors
When calculating the similarity between the texts based on the traditional spatial vector model, the model considers only the co-occurrence of words in the text. If a word does not occur, its weight is
. The possible existence of semantically identical or similar alternative words in the text is completely ignored. To address this limitation of the traditional vector space model, this paper uses semantically similar words for mutual substitution and proposes the vector space model
based on
weights and word vectors. Given that
is a vector representation of the vector space-based model of document
, where
vi denotes the weight of the ith word in the document. If the term occurs in the document, it is assigned its
weight; if it does not occur, the
weight of the semantically most similar term (often a synonym or near-synonym) that occurs in the document is used instead, but the value is corrected for the semantic similarity between them. This specific calculation is shown in Equations (4) and (5).
where
is the
word in the document
, and
is the semantic similarity between the word most similar to
in the document
and
, which can be calculated using cosine similarity based on the word vector trained by Word2Vec.
In summary, two methods for document vector representation are investigated in this paper, namely the document vector representation model based on weights and word vectors () and the vector space model based on weights and word vectors (). In the following, these two models are used for the vector representation of citation candidates and documents for text similarity calculation.
3.2. Automatic Identification Method for Implicit Citation Sentences Measurement
According to the document vector representation model proposed in
Section 3.1, the cited sentences, cited literature and cited references can be represented as document vectors based on the trained word vectors. The cosine similarity is then used to compare the semantic similarity between the citation sentences and the cited literature and cited references.
- (1)
Data preparation
To train the word vectors, more than 23,500 articles were first collected randomly from the ACL Anthology Web Corpus (ACL) [
28] (
https://aclanthology.org/) (accessed on 1 October 2022). The full articles were downloaded and converted to computer-processable text in PDF format using the Apache PDFBox tool developed by Google (Mountain View, CA, USA). Then, the word vectors of all the words were trained using the Word2Vec tool developed by Google (Mountain View, CA, USA).
To compare different document vector models for their implicit citation sentence identification, a sample of papers is collected and manually annotated with all the citation texts contained in them (mainly implicit citation sentences) to build the experimental corpus. While explicit citations are easy to identify based on citation labels, implicit citations are not easy to identify and require reading and understanding the cited references and the citation context of the cited documents. Therefore, identifying all the implicit citation sentences of the cited references (if present) from a paper is a very time-consuming task. In view of this, in this paper, only seven papers published from 2014 to 2017 were randomly selected from three journals [
29] in the field of computing, and each citation text in each paper was manually identified to generate a small corpus. The seven papers contained a total of 207 citation texts, among which 139 citation texts (67.1%) contained only explicit citation sentences, while the other 68 citation texts (32.9%) were composed of both explicit and implicit citation sentences, involving a total of 98 implicit citation sentences.
Table 3 lists the seven papers selected for building the experimental corpus.
Given the small size of the experimental corpus, which is insufficient to evaluate the final detection effectiveness of the implicit citation sentences, this paper selects two highly cited papers by Jacobs and Hoste [
37] and Färber and Jatowt [
38], which randomly crawl about 200 citations each, and manually annotates the citation text (including explicit and implicit citation sentences) of each cited paper to build the final evaluation corpus. Since citation styles may differ in different literature areas, the cited papers were crawled from different sources, namely the three databases Scopus, ProQuest, and EBSCOhost.
- (2)
Examination of research hypotheses
Compared to the cited literature, the implicitly cited sentences are more similar in meaning to the cited references. Since it is often difficult to obtain the full text of documents in practice, both abstracts and full texts were used to represent the documents in the experiment. Different document vector representation models are used to represent the implicit citation sentence, the cited document (full text or abstract), and the cited reference (full text or abstract), and then the cosine similarity between each implicit citation sentence and the cited document and the mentioned cited reference is compared. The results of the comparison with different document vector representation models are shown in
Table 4. It can be seen that more than half of the implicitly cited sentences (at least 57.11%) are more similar to their cited references, regardless of which document vector representation model is used and whether the documents are presented as abstract or full texts.
Using the document vector representation model based on TF-IDF weights and word vectors, and literature abstracts, the effect is more significant, with more than 80.33% of the implicitly cited sentences appearing more similar to the cited references. Moreover, the degree of similarity between the implicitly cited sentences and the cited references is more pronounced when the document abstracts (both cited and cited references) are used instead of the full text. This is because the cited sentences tend to summarize the content of the cited references, and the summary is also general, so the semantic similarity between the cited sentences and the summary is higher than that of the full text. The experimental results show that the implicit method proposed in this paper for identifying citation sentences based on text similarity is extremely useful and feasible.
- (3)
Range determination of candidate citations
The main problems with the text similarity-based method for identifying implied citation sentences proposed here are twofold. The first is determining the range of citation sentences to consider. If the range chosen is too large, it will result in high noise and, thus, low accuracy. If the range chosen is too small, the truly implicitly cited sentences will be overlooked, resulting in low accuracy. Second, the vector representation of the documents directly affects the accuracy of the text similarity computation. In this paper, two models for the document vector representation are proposed, and it needs to be determined which model has a better detection performance.
In this paper, the range of citation-eligible sentences is determined by the experimental corpus. The range of citation-eligible sentences has two windows on the left and right sides, where the left window refers to the sentences before the explicitly cited sentences, and the right window refers to the sentences after the explicitly cited sentences. The lengths of the two windows are set relatively independently. The effect of changing the length of the left and right windows on the detection result is examined separately by using the F1 value as the evaluation index. First, the length of the right window is set to 10 (i.e., 10 sentences after the explicit citation), and the length of the left window is set from 1 to 9 (i.e., 1–9 sentences before the explicit citation), and the sentences in the left and right windows are used as citation candidates. The detection result for the different lengths of the left window is shown in
Figure 2. The different curves show the results of using different models of document vector representation. It can be seen that for all the document vector representation models, the F1 value for detecting the sentences with implicit citations is highest when the length of the left window is 2. Therefore, the length of the left window is set to 2 for the range of candidate citations. Next, the length of the left window is set to 2, and the length of the right window is adjusted from 1 to 14.
The detection results with different lengths of the right window are shown in
Figure 3. It can be seen that the F1 value gradually increases as the length of the right window increases for all the document vector representation models. When the length of the right window is 10, the increase in the F1 value flattens out and no longer changes significantly. Therefore, the length of the right window for the range of candidate sentences is set to 10. The range of candidate citations was set to the two sentences before the explicit citation sentence and the 10 sentences after.
- (4)
Evaluation and Analysis of Implicit Citation Sentence Detection Based on Different Document Vector Representation Models
After determining the range of candidate citation sentences, the experimental corpus is used to evaluate the performance of the two document vector representation models proposed in this paper for implicit citation sentence detection. As a comparison, the traditional vector space model and the Doc2Vec document vector representation model [
39] are used as benchmark models.
Table 5 shows the performance of implicit citation sentence detection based on various document vector representation models, evaluated by precision (P), recall (R), and F1 values. It can be seen that the model with the best detection performance is the document vector model based on TF-IDF weights and word vectors, with F1 values of 88.87% and 82.43% for the abstract and full text of the documents, respectively. When the abstract is used to represent the literature, the identification is better than when the full text is used. In reality, the abstracts of documents are far easier to obtain than full texts, so this finding is very significant for practical applications. For all the document vector representation models, the detection accuracy was very high; all of them reached more than 80%, and some of them even reached more than 99%, but the detection recall rate is not satisfactory, with the highest reaching only 80%. This indicates that some implicit references are still missed; therefore, the key to improving the overall performance of detection is to increase the recall rate.
In order to improve the recall rate, detection methods based on different document vector representation models are combined. First, the first document vector representation model is used to identify the citation and non-citation sentences from candidate citation sentences; next, the non-citation sentences filtered out in the first step are identified by the second document vector representation model, from which the missing implicit citation sentences are identified.
Table 6 shows the performance of different combination modes for identifying implicitly cited sentences. It can be seen that the implicitly cited sentence detection based on the combined model can further improve the recall rate and greatly improve the overall performance of detection. The best combination model is the combination of the document vector model (TFIDF-AWV) based on TF-IDF weights and the word vectors and the vector space model (PTFIDF-VSM) based on TF-IDF weights and word vectors. When the abstract representation of documents is used, the F1 value reaches over 94%. The order of the combination of these models has a slight effect on the detection performance, but the difference between them is not significant and negligible.
- (5)
Final evaluation of implicit citation sentence detection based on the best combination model
In order to evaluate the final detection effect of implicitly quoted sentences, the best detection model (i.e., the combined model TF-IDF-AWV + PTFIDF-VSM) is used in this paper to automatically recognize the implicitly quoted sentences of two highly cited papers on the evaluation corpus. Both the compared cited and cited literature were used for the abstracts, and the results are shown in
Table 7 and
Table 8, respectively. The experimental results show that the overall effect of the implicit citation sentence detection of both highly cited papers is satisfactory, with F1 values as high as 92.2%, indicating that the implicit citation sentence detection method proposed in this paper is very effective. By comparison, the precision rate of implicit citation sentences in the highly cited papers of the deep neural network (89.2%) is lower than that of the highly cited papers of the LDA topic model (96.8%), but the recall rate (96.6%) is higher than that of the latter (88.5%). There is no significant difference in the effectiveness of identifying implicitly cited sentences in the cited papers of different fields [
40].