3.2. WWE Model
We propose a WWE model based on
TF-IWF and CBOW. The model of WWE is shown in
Figure 2.
Although many scholars use TF-IDF to weight the words in the vector, which greatly improves the document representation based on a static word vector, IDF reflects the importance of words and the distribution of feature words based on the quotient of the total number of documents divided by the number of documents with specific words. When a word appears in more than one document, the smaller the quotient is, the less important the word is, and the more inaccurately it is reflected in the specific corpus environment. Because, in a domain, a word appears in different documents many times, this just reflects the fact that the word is more important. Different from IDF, IWF reduces the influence of similar texts on word weight in the corpus, and more accurately expresses the importance of words in the documents to be checked. For example, when a word appears in multiple documents, but the total word frequency of the word is relatively small, the IWF calculation result will be larger, indicating that the word is more important, which is close to the fact. We use TF-IWF to weight the words in the vector for short text.
TF-IDF is a commonly used feature weighting technology in information retrieval and text mining, and is often used in text topic extraction and word segmentation weighting.
TF-IDF is a completely statistical method. Its core idea is to assume that the importance of a word is proportional to the proportion of its appearance in a certain document, and inversely proportional to its proportion of appearances in other documents. It is defined as Equation (1):
TF means word frequency, that is, the number of times a word appears in the document. This may be positively related to the length of the document. Therefore, the word frequency needs to be normalized, usually by dividing the number of times it appears by the total number of words in the document:
In Formula (2), the numerator denotes the frequency of the word in document j, and the denominator denotes the sum of all vocabularies in the document j.
IDF represents the inverse document frequency, defined as the total number of documents divided by the number of documents containing a given word. In Equation (3), |
C| denotes the total number of documents in corpus
C, and the denominator
represents the number of documents in corpus
C that contain the word
in the document
j. In applications, in order to avoid the denominator being 0, the denominator is generally given as
.
In essence, IDF is a weighting method that tries to suppress noise. It assumes that less frequent words are more important and more frequent words are less important. This is not entirely correct for most text information. The keywords extracted by IDF cannot effectively reflect the importance of words and the distribution of characteristic words if the model is then unable to complete the function of weight adjustment well. Especially in similar corpora, this method has flaws, and often some keywords of the same text are covered.
To solve the shortcomings of
IDF with short text and similar corpora, this paper adopts
TF-IWF to weight the word embedding in each text. TF is consistent with the definition in
TF-IDF, and IWF is defined as Equation (4):
In Equation (4), the numerator
represents the sum of the frequencies of all words in the corpus, and the denominator
represents the total frequency of the word
in the corpus. Therefore,
TF-IWF is defined by Equation (5):
According to Formula (5), calculate the based on TF-IWF for each text in the corpus, denoted by , where i represents the text number, and j represents the jth word in the text.
Suppose that corpus
C forms a vocabulary
V, and each word in the vocabulary
V is encoded by the one-hot method. According to the idea of the CBOW model, each word (one-hot encoding) in vocabulary
V can be mapped into a low-latitude dense word vector through training, which is denoted as matrix
. Therefore, we can obtain the word vector
representation of short text by querying
, and finally generate WWE
for each short text representation through the multiplication cross of vectors
and
.
3.3. Extended Topic Information
The sparsity of content in short text brings new challenges to topic modeling [
23]. Conventional topic models assume that a document is a mixture of topics, where a topic is seen as conveying a certain semantic meaning through a set of correlated words. Then, they utilize statistical techniques to learn the topic component and mixing coefficient of each document. In essence, the conventional topic model reveals the topics in the corpus by implicitly capturing document-level word co-occurrence patterns. Therefore, applying these models directly to short text will suffer from severe data sparsity problems. On the one hand, the frequency of occurrence of words in a single short text is less than that of a long text, so it is difficult to infer which words in each document are more relevant. On the other hand, because the short text contains sparse words, it cannot express rich topic information by means of global co-occurrence.
The purpose of this paper is to enrich the semantic information of short text through topic information, while the LDA model exploits word co-occurrence patterns to reveal the latent semantic structure of a corpus in an implicit way by modeling the generation of words in each document. In order to solve the problem of sparse and weakened short text topic representation, we learn from the idea of BTM [
23] for extending the topic information without changing the original document’s words. With this idea, we use a novel method to extend the latent topic components in short text by directly extending word sequences. The detailed process of generating topic information with the LDA model (
https://doi.org/10.1007/s44196-021-00055-4, accessed on 26 January 2022) can be found in reference [
24].
Before we detail the method of word expansion, we first introduce the “biterm” in the BTM model. Biterm denotes an unordered word pair co-occurring in a short context. In such a case, any two distinct words in a document construct a biterm. For example, a document with three distinct words will generate three biterms:
In
, the biterm is unordered, and the whole corpus turns into a biterm set. Although this method greatly enriches the topic information, this topic information is not suitable for representing the semantic information of short text. This method does not consider the word order of the source document (which may cause confusion with the semantic information of the short text) and does not retain the original words (losing part of the original subject information). Therefore, we adopt the following method for adaptive transformation, and other parts of the ETI are consistent with traditional LDA.
After the extension, we can use the LDA model-generated topic representation. The theory above can greatly improve the weakness of the LDA model with short text. Therefore, we can obtain topic information for each text
.
3.4. Fusion Method
Through the above steps, we can obtain a distributed representation of each short text and . Although they represent semantic information and topic information to a certain extent, their proportions should be different for each short text. Therefore, we propose two fusion strategies:
Static linear fusion: This is achieved by assigning a static parameter to
and
to adjust semantic information and topic information, as shown in Equation (8). The
stands for the weight, and the empirical value range is 0 to 1.
Dynamic fusion: The WWE gives different weights to each word vector and obtains the distributed representation of short text by accumulating the weighted word vectors. Due to the different lengths of the texts, this accumulation method will lead to differences between semantic representation and topic representation. The more words there are in the text, the greater the noise, and the more easily the semantic information on
is obscured. On the contrary,
has more topic information. Therefore, we propose a dynamic fusion model, as shown in Equation (9):
The research object of this paper is short text with less than 512 words, such as paper abstracts, instant messages, and website reviews. For short texts that are more than 512 words in length, we truncate them and keep only the first 512 words. In Equation (9), Len denotes the number of effective words contained in short text, and so its maximum value is 512 in this paper; δ is a hyperparameter used to adjust the performance of the formula, and it can be any value greater than max (Len), and max (Len) = 512 in this paper. In addition, by observing Equation (9), we can find that δ can be used to adjust the margin of and ; that is, the smaller the δ is, the larger the adjustment is. Otherwise, the smaller the adjustment is.