1. Introduction
Natural language is the mechanism that a human being uses to communicate and transmit an idea, opinion, or feeling [
1]. Understanding natural language is a complex task and requires time because millions of connections between neurons are necessary to learn it. However, a computer needs structure and logic to understand a programming or natural language. Therefore, a mathematical formula or a predefined pattern will be necessary for the computer to learn the required knowledge [
1]. For a computer to recognize the data it receives, it must generate an adequate numerical representation.
Therefore,
natural language processing (
NLP) is an area that is in constant development; it seeks to generate efficient algorithms for a computer to understand the spontaneous language of a human being. Some of the characteristics of natural language involve strict rules, which facilitate their computerized analysis [
1]. Hence,
NLP encompasses techniques and tools for developing systems that can interpret and utilize
natural language to perform desired tasks, such as news classification or spam identification [
1]. Mechanisms, such as the
extraction of semantic relationships, named entity recognition, topic discovery, and
word embedding, are essential to provide a computer with the necessary knowledge to process information and display inevitable results. Therefore,
topic discovery is defined as the task of finding specific topics present in a set of documents as input. This task can be applied to any text [
2]. The purpose is to identify, without needing a dictionary, the main themes that implicitly exist within a collection of texts [
2]. For a computer to understand
natural language, it is necessary to create vectors of numbers. The embedding vectors or patterns may be subject to operations, such as addition, subtraction, and distance measurements. The literature shows that some word-embedding models are based on neural networks or context matrices [
3]. The advancement of technology has made it possible to streamline processes, such as:
Searching for the subject of a document.
Searching for a specific document.
Generating a summary or extracting key phrases from a text.
In the literature, some word embedding models are word2vec, glove, fastText, and BERT. The embedding or word embedding model came to fruition in 2013 when Tomas Mikolov and his team at Google developed the first embedding model, named word2vec. The model has the following sub-models:
Continuous bag-of-words (
CBOW [
4]): receives a context and predicts a target word [
4].
Skip-gram [
5]: each word is represented as a bag of
n-grams of [
6] characters.
The
GloVe embedding model was developed in 2014 by Jeffrey Pennington [
7]. This model combines the advantages of the two main family models in the literature:
global matrix factorization and
local context window.
GloVe works with non-zero elements in a word–word co-occurrence matrix rather than the entire sparse matrix or separate context windows in a large corpus [
7].
On the other hand, in 2015,
Facebook researchers created the embedding model called
fastText. The
fastText model has pre-trained models for 294 languages. The authors relied on the
skip-gram [
8] model. In 2018,
BERT (bidirectional encoder representations from transformers) was designed to pre-train deep bidirectional representations from the unlabeled text by jointly conditioning left and right contexts in all layers [
9]. In [
3], the authors applied GloVe and
fastText for the text classification of two text corpora, and compared the results that were obtained with a semantics embedding model.
Hence,
text classification involves processing data without a person’s intervention. The computer must have access to the knowledge necessary to carry out tasks such as medical diagnoses, analysis of social networks, or the search for fake news. Therefore, a system that works with many documents requires algorithms or methods to provide the computer with the necessary knowledge to generate the results expected by a user [
10].
In addition,
deep learning is a process carried out with a neural network, for example, a convolutional neural network. It has been adopted for text classification tasks, generating better results than a traditional classification task [
11].
A
CNN is a multi-layer network or hierarchical network and is a high-level feature-based method.
CNN builds by stacking multiple layers of features. One feature of a
CNN is the presence of a subsampling or pooling layer [
12]. It allows for optimizing the calculation processes to reduce the data size in learning new data, allowing the recognition of different features [
3].
Currently, computational approaches need to model knowledge to generate accurate results without the intervention of a person. Text classification involves ordering large amounts of documents in short periods. On the other hand, topic discovery involves finding the main ideas from large amounts of textual data; it is presented as a recurring topic. The objective of topic discovery in text documents is to extract the central idea by imitating human capacity, without human intervention automatically extracting knowledge from the text. It indicates the recurring topics in the documents, allowing for an overview of the text. A topic discovery model that receives text with or without processing generates general topics since the input data include many unclassified texts.
However, discovering topics from previously classified text allows us to learn specific topics. Hence, the top words that characterize the topics are linked to each other since they come from classified text [
13].
This paper presents a text classification process integrated into identified topics with semantic embedding models. This incorporation provides specific topics with specific significance instead of general topics from the complete set of unclassified text since the topics are extracted from each identified class. The input texts are composed of two news domain corpora, previously classified with a convolutional neural network, using three semantic embedding models [
3] as semantic features. The proposed topic discovery process aims to obtain specific topics for each class with semantic relationships between their top words. A quality assessment of the identified topics was performed with the normalized topic coherence metric. Therefore, the identified topics in each class provide latent and specific topics depicted by top words with high coherence from each obtained class. Based on the results obtained by integrating text classification with topic discovery, it was concluded that discovering topics in previously classified text generates specific topics from each class.
The rest of the paper is organized as follows.
Section 2 explores works related to this research.
Section 3 shows the proposed approach to incorporate
text classification in the
topic discovery process. The experimental results are presented in
Section 4. The conclusions and future work are presented in
Section 5.
2. Related Works
This section presents works related to the same field. Some works incorporated additional algorithms into their approaches to discover topics such as text classification through deep learning models. They also applied sentiment analysis and clustering algorithms for the same purpose.
In the literature, some authors created different models for discovering topics. The authors in [
14,
15,
16,
17,
18,
19] used the normalized topic coherence metric to evaluate the results obtained. In [
14], a topic model based on min-hashing was proposed to find sets of matching words, which were subsequently grouped to produce the existing topics in the analyzed text.
On the other hand, in [
17,
20,
21,
22,
23], the authors applied different models, like encoder, LSTM, and matrix factorization in their research. In [
17], the authors combined contextualized representations with topic models via neural networks. The combination presents an extension of the neural-named ProdLDA model. On the other hand, ref. [
18] proposed a variational automatic encoder (VAE) NTM model. The model reconstructs the sentence and word count of the document by using combinations of bag-of-words and word embedding.
Furthermore, in [
24], the authors showed the pseudo-document-based topic model (PTM), which introduces the concept of a pseudo-document to add short text against the scarcity of data implicitly. They also proposed a word embedding PTM (WE-PTM). On the other hand, in [
21], a hierarchical latent tree analysis was proposed for hierarchical topic modeling, with the extracting and selecting collocations as a preprocessing step. The model was named HLTA. Selected collocations were replaced with unique tokens in the bag-of-words model before running HLTA. In [
22], the authors presented the automated extraction of discussions related to COVID-19 and applied an LSTM recurrent neural network for sentiment classification. In [
23], they showed two mixed counting neural models, called the negative binomial-neural topic model (NB-NTM) and the gamma negative binomial-neural topic model (GNB NTM). However, in [
23], the authors showed two models for discovering scattered topics. The first model involved the negative binomial-neural topic model (NB-NTM) subjects, and the second involved gamma negative binomial-neural topic model (GNB-NTM) subjects. In [
20], the authors presented two approaches, NTM-R and NTM-F, which were based on regularization and factorization constraints. The objective was to incorporate knowledge about topic coherence in formulating topic models.
The authors of [
19,
25,
26,
27,
28,
29,
30] applied word embedding to discover topics in long and short texts, or they applied clustering algorithms. Hence, in [
25,
26], the authors showed models based on word embeddings. In [
25], the authors presented a model named Word2Vec2Graph, based on the Word2Vec model. The authors applied the model to analyze long documents, obtain unexpected word associations, and discover topics in the papers. On the other hand, in [
26], the authors presented an approach to topic discovery and extracted text representations of tweets using a word embedding model. They then grouped them into semantically similar groups using the HDBSCAN algorithm, with each representing a topic.
In addition, in [
19], the authors presented a hierarchical topic modeling algorithm. The algorithm is based on community mining of word co-occurrence networks, taking advantage of the natural network structure. However, a Bayesian generative model was shown in [
27]. The model describes thematic hierarchies organized into taxonomies. The experiments show that the proposed model integrates prior knowledge and improves both the hierarchical discovery of topics and the representation of documents. In [
28], the authors presented the use of the kernel principal component analysis (KernelPCA) and
K-means clustering in BERTopic architecture. In [
29], the authors presented a new method, combining a pre-trained BERT model and a
K-clustering algorithm, applying similarity between documents and topics. Furthermore, ref. [
30] proposed a polymerization topic sentiment model (PTSM) to conduct textual analysis for online reviews.
The authors of [
31] incorporated topic discovery with a long-term memory model (LSTM) to extract patterns in the analyzed comments in crowdfunding campaigns. The proposed model trains with latent Dirichlet allocation (LDA) with word embedding. In [
32], the authors presented an analysis of comments about COVID-19 to detect feelings related to the disease. They used the VADER lexicon, which associates a sentiment rating to each word, followed by TextBlob. The discovery of the topics was carried out using the LDA algorithm.
A dependency SCOR-topic sentiment (DSTS) model was offered in [
33]. The authors used online tea sales data as empirical evidence to test the proposed model. The results show that the DSTS model is generally superior to the LDA and PLSA models. In addition, in [
15], each document was interpreted as word embeddings and a two-way model for discovering multi-level topic structures. In each layer, it learns a set of topical embeddings. On the other hand, in [
16], the authors presented an approach that introduced hyperbolic embeddings to represent words and topics. In 2022, the authors of [
34,
35,
36] presented their contributions in this field with text classification tasks and topic discovery.
TextNetTopics [
34] is an approach that applies feature selection by considering the bag-of-topics (BOT) approach rather than the traditional bag-of-words (BOW) approach. This paper suggested scoring topics to select the top topics for training the classifier, hence reducing dimensionality and preserving the semantic descriptions of documents. On the other hand, in [
35], the authors proposed considerations for selecting a suitable topic model based on the predictive performance and interpretability measures for text classification. Using clinical notes, they compared 17 different topic models regarding interpretability and predictive performance in an inpatient violence prediction task. Finally, ref. [
36] presented a comprehensive survey of algorithms for short text topic discovery, and performance was evaluated using text classification.
On the other hand, in [
37,
38], the authors presented their contributions in the same field. In [
37], the authors presented a model combining the advantages of unsupervised topic modeling with supervised string kernels for text classification tasks. The top words in the identified topics reduced the document corpus to a topic–word sequence. This reduction was used for text classification with string kernels, significantly improving accuracy and reducing training time. For [
38], an approach to discovering topics through cosine similarity brought great results. First, the authors extracted synonyms from a semantic network, and in this way, relevant topics from datasets such as Yahoo and BBC News were identified. They used text classification models, such as support vector machines, decision trees, and random forests to carry out the text classification task.
The authors of [
39] considered open information extraction techniques to integrate into text classification tasks, considering semantic aspects in different languages. Hence, the authors presented an approach to enrich the open information extraction paradigm by exploiting syntactic and semantic analysis and semantic relations from an ontology. The authors used the English Wikipedia as a dataset. On the other hand, in [
40], the authors presented an approach based on patterns and ontologies for information extraction and integration with other tasks; hence, in [
40], the authors experimented with building an open information extraction system (OIE) for text in the Italian language. The authors proposed an approach that relied on linguistic structures and a set of verbal patterns, combining theoretical linguistic knowledge and corpus-based statistical information. Also, ref. [
41] presented an approach to perform open information extraction (OIE) for the Italian language; it was based on linguistic structures to analyze sentences and a set of verbal behavior patterns to extract information from them. The patterns combined a linguistic theoretical framework (such as lexicon-grammar (LG)) and distributional profiles extracted from a contemporary. In addition, the authors of [
42] presented a multi-
OIE, which performed open information extraction (OIE) by using multilingual BERT. The model is a sequence-labeling system with an extraction method. On the other hand, in [
43], the authors presented an overview of the current situation of neural information extraction models, focusing on the advantages, disadvantages, and future of work in the field. In [
44,
45], the authors applied machine learning and sentiment analysis techniques to the COVID-19 domain. The authors of [
44] proposed a methodology for sentiment analysis based on natural language processing (NLP) and sentiment analysis to obtain insight into opinions on COVID-19 vaccination in Italy. Ref. [
45] presented an analysis of the scoping review of AI in COVID-19 research. However, in [
46], the authors presented DEFIE, an approach to information extraction (IE) based on the syntactic-semantic analysis of textual definitions and techniques involving semantic aspects; they harvested instances of semantic relations from a corpus of textual descriptions. The aim was to extract as much information as possible by unifying syntactic analysis with state-of-the-art disambiguation and entity linking. An extensive knowledge base was produced against state-of-the-art OIE systems based on much larger corpora. In this analysis, only in reference [
19] was text classification and topic discovery used to obtain the sentiment associated with the text. The rest of the works used topic discovery with other techniques or algorithms (but not text classification). Therefore, this work seeks to provide a methodology that classifies text by analyzing the words that make up a document, to decide which class it identifies. This results in discovering specific topics for each class with a higher semantic relationship between their primary words.
This paper proposes integrating a text classification process with semantic embedding models to discover specific topics. Specific topics are identified in each class identified in each corpus. The 20-Newsgroups corpus has twenty classes, and the Reuters Corpus has ninety classes. The results are evaluated with the normalized topic coherence metric to assess the performance of the proposed model.
3. Proposed Approach
This section presents the proposed approach for text classification and integrates it into topic discovery with semantic embedding models and three algorithms.
The proposed approach for integrating text classification in the topic discovery process incorporates the following process:
Pre-processing: The text sets are pre-processed; this consists of removing punctuation marks, converting to lowercase, and removing URL marks.
The semantic relationship extraction: Extracting semantic relationships from the English Wikipedia corpus is vital for constructing the proposed embedding models. It is necessary to extract the relations of synonymy, hyponymy, and hyperonymy using lexical–syntactic patterns extracted from the literature for these semantic relationships [
3].
Development of embedding models: Each word pair of identified relationships is assigned a unique identifier for constructing semantic relationship embeddings [
3].
Text classification from
20-Newsgroups and
Reuters Corpus with a convolutional neural network (CNN) [
3].
Topic discovery for each class with the latent Dirichlet analysis (LDA), probabilistic latent semantic analysis (PLSA), and latent semantic analysis (LSA) algorithms. An evaluation process based on normalized topic coherence using the top words is performed.
The LDA, PLSA, and LSA algorithms are chosen because they are the most used algorithms in the literature. It is possible to contrast with works that only use the input data with traditional preprocessing and discover topics against those works where they perform additional processing to the traditional one, such as text classification and community detection algorithms.
Figure 1 shows the proposed approach. A total of 2 previously preprocessed corpora were classified, from which, 20 classes were obtained for the 20-Newsgroups corpus and 90 for the Reuters Corpus. The classes are the input data set to the LDA, PLSA, and LSA algorithms for topic discovery. The 20-Newsgroups corpus comprises 20,000 documents, organized into 20 classes, from which, 20, 50, and 100 topics are identified. On the other hand, the Reuters Corpus has fewer documents; 90 classes must be obtained, so it is impossible to extract 100 topics due to the corpus size.
3.1. Text Classification
The text was preprocessed via text cleaning, removing stop words, and converting to lowercase. The text classification process was carried out as proposed in [
3]. The method, which includes CNN, was used to assess the performance of three embedding models of semantic relations and to produce a set of classes for each corpus used. The classification task generated the corresponding classes in each corpus used. For the
20-Newsgroups corpus, 20 classes were obtained, and for the
Reuters Corpus, 90 classes were obtained. The classes are the basis for topic discovery since they are the input data set to each topic algorithm. In addition, classification was carried out to have an ordered corpus with semantic relationships between the texts. The hypothesis focuses on how finding topics in a classified corpus will improve the coherence of the retrieved topics. Therefore, integrating the classification of two corpora with semantic embedding models for topic discovery is the main contribution of this paper.
Semantic Embedding Model
In this paper, the semantic embedding model used in text classification was developed in [
3], which involves semantic relationships of synonymy, hyponymy, and hyperonymy.
In [
3], the authors presented a novel approach based on relationships extracted from Wikipedia to create embedding models. The creation of embedding models is conditional on the available semantic relations in the text. The process focuses on extracting semantic relationships from an English corpus from Wikipedia. Synonymy, hyponymy, and hyperonymy relationships are extracted with a set of lexical–syntactic patterns from the literature. The relationships are embedded using the procedure proposed by [
11] based on matrix factorization. A text classification using CNN was carried out to compare the performance of the relationship-based embeddings and the word-based models, such as
fastText, GloVe, and the WordNet-based model presented in [
11]. Therefore, the performance of the semantic embedding model based on three semantic relationships was the best result obtained by [
3]. For that reason, in this paper, the semantic embedding model incorporated in text classification for topic discovery is shaped by these three semantic relations.
3.2. Topic Discovery
Topic discovery breaks down a large corpus of text into a small set of interpretable topics, allowing a domain expert to explore and analyze a corpus efficiently [
47]. In the literature, algorithms for topic discovery are latent Dirichlet analysis (LDA), latent semantic analysis (LSA), and probabilistic latent semantic analysis (PLSA). However, some authors have incorporated additional procedures into their approaches, such as classifying text before topic discovery. On the other hand, topic discovery has been an essential part of different tasks of the NLP, for example, sentiment analysis and decision-making.
Latent Dirichlet analysis is an algorithm used for topic discovery. The model is based on the hypothesis that each text contains words or terms from different topics. This model needs to know a priori the text and the number of topics to be found in the text. This model is maintained under the premise that the topics and text are treated through Dirichlet distributions [
48].
Figure 2 presents a graphical representation of the LDA model.
Where:
M denotes the number of documents.
N denotes the number of words in a given document (document i has words).
denotes the parameter of the Dirichlet prior to the per-document topic distributions.
denotes the parameter of the Dirichlet prior to the per-topic word distribution.
denotes the topic distribution for document i.
denotes the word distribution for topic k.
denotes the j- word in document i.
denotes the specific word.
On the other hand, latent semantic analysis (LSA) is another algorithm used for topic discovery. LSA is a mathematical dimensionality reduction procedure denoted as the singular value decomposition (SVD). LSA (or LSI) is an automatic index analysis that projects terms and text in a space of reduced dimensions. The reduction of attributes or dimensions of the text leads to the recovery of the semantics of the original text. The LSA dimensionality reduction process captures important terms or topics [
50].
Figure 3 shows the matrix generated by the LSA model.
A singular value decomposition (SVD) applies to the resulting matrix through a series of elementary linear operations, such as adding and multiplying rows and columns. Therefore, the matrices resulting from applying SVD are as follows:
Orthogonal matrix (U): obtained by linearly processing the original matrix’s number of columns (orthogonal).
Transposed matrix (V): obtained by swapping the rows with the columns, providing an orthogonal arrangement of the elements of the row.
Diagonal matrix (E): obtained by linearly processing the original matrix’s number of rows, columns, and dimensions (A); the diagonal matrix represents the singular value of (A), and in this, all the elements that do not belong to the diagonal are null or equal to zero.
Finally, probabilistic latent semantic analysis (PLSA) continues the LSA. In PLSA, words are attributed to latent topics or concepts based on the weighted frequency of terms. It interprets frequencies in terms of probability. PLSA is a descriptive statistical technique. The probability that a term forms part of the set of terms belonging to a topic or concept depends on the different parameters obtained. These parameters are obtained by counting the given frequencies in a matrix based on a multinomial probability calculation. The objective of PLSA is to estimate the multinomial probability distribution of some words in a topic.
Figure 4 shows the PLSA model graphically. In
Figure 4,
models the joint probability of seeing a word
w and a document (text)
d as a mixture of conditionally independent multinomial distributions.
Where:
M denotes the number of texts.
N denotes the number of words in a given text.
d indicates a text.
z denotes the latent or hidden variable (topic).
w denotes a specific word.
The evaluation of topic discovery is performed with the normalized topic coherence metric
(NPMI) described in Equation (
1).
The normalized topic coherence consists of obtaining the
normalized coherence of each topic
. It measures the semantic relevance of the most important words of a topic, which is computed by the normalized pointwise mutual information (
NPMI) over the selected words of each topic; this is described below:
NPMI scores were then computed from the top-k words for each topic, and lexical probabilities , , and were calculated by sampling word counts within a sliding context window over an external corpus, in this case, the English Wikipedia.
Normalized coherence is based on obtaining the normalized mutual point information (NPMI) of each pair of words belonging to the
k top words representing each topic. The metric is based on calculating the probabilities that
k top words co-occur within the same paragraph of the set of external text, in this case, the English Wikipedia [
53].
4. Results and Discussion
This section presents the results of integrating text classification for topic discovery with semantic embedding models. In addition, the results obtained are compared with those found in the literature and between the datasets used in this work.
4.1. Datasets
A corpus in English from Wikipedia was used as a reference corpus to evaluate the topics.
Table 1 shows the text and token numbers for each dataset, i.e., Wikipedia for the evaluation of topics, and
Reuters (
https://trec.nist.gov/data/reuters/reuters.html, accessed on 1 May 2020) and
20-Newsgroups (
http://qwone.com/~jason/20-Newsgroups/, accessed on 1 May 2020) for topic discovery. The Wikipedia corpus was chosen due to its diverse range of topics, leading to relationships between some words. The
Reuters and
20-Newsgroups corpora were chosen because most authors in the literature use these datasets. In addition, the 20-Newsgroups and Reuters corpora are for general purposes.
4.2. Experimental Results
The proposed approach was evaluated with the normalized topic coherence metric described in the Equation (
1). The corpora used were the classes of
Reuters and
20-Newsgroups corpora.
Table 2 shows some classes recovered in each corpus.
For the topic discovery process, different configurations of parameters and sizes were used. Twenty, fifty, and one hundred topics with the ten top words were identified for the 20-Newsgroups corpus. For the Reuters Corpus, only twenty and fifty topics with the ten top words were identified.
The mean and standard deviations were extracted from the results obtained. The objective was to identify trends in the 20-Newsgroups and Reuters corpora. In this way, it was possible to analyze the results of each corpus.
The number of topics (
n), mean (
), and standard deviation (
) of the normalized coherence of the topics extracted from each identified class from the
20-Newsgroups and
Reuters corpora are presented in
Table 3,
Table 4 and
Table 5.
Table 3 presents the results of the topics extracted from each class from the
20-Newsgroups and
Reuters corpora with the LDA algorithm. The average and standard deviations of the normalized coherence of each identified topic in each class were obtained by extracting 20, 50, and 100 topics from the
20-Newsgroups corpus. The LDA algorithm achieved a highly normalized coherence by extracting 20 topics with 10 top words from the corpus classes. On the other hand, for the
Reuters Corpus, only 20 and 50 topics with the ten top words were identified. Also, the LDA algorithm obtained a highly normalized coherence by extracting 20 topics with 10 top words from the corpus classes. However, for each class belonging to the
20-Newsgroups corpus, when 50 and 100 topics were identified with the LDA algorithm, the results did not exceed those obtained when discovering 20 topics with the same algorithm. The same situation occurs with the
Reuters corpus in each class when 50 topics were identified.
Table 4 presents the results obtained when discovering twenty, fifty, and one hundred topics for each class in the
20-Newsgroups corpus. On the other hand,
Table 5 shows the results obtained from the twenty and fifty topics identified for each class in the
Reuters corpus. The results in both cases were obtained by applying the LSA and PLSA algorithms, respectively. However, they are minor to the results obtained with the LDA algorithm.
In total, 10,200 topics were obtained with the LDA, LSA, and PLSA algorithms for each class in the 20-Newsgroups corpus. For the Reuters corpus, 12,600 topics were obtained from the ninety classes with the LDA, LSA, and PLSA algorithms.
Table 6 presents classes of the
Reuters Corpus, and
Table 7 shows classes of the
20-Newsgroups corpus. They only present the three top words obtained with the
LDA, LSA, and
PLSA algorithms, respectively.
The results obtained with our proposed approach offer insight into integrating classes as input data for each topic discovery algorithm. Although the results are somewhat low, we hypothesize that the results are relevant by applying additional parameters like the number of epochs in text classification, the number of topics, and the embedding model previously applied. The approach provides consistent topics relevant to the language and domain used. The approach is applied to the LDA, LSA, and PLSA techniques, combining the number of parameters (twenty, fifty, and one hundred) to gauge, respectively, the approach’s behavior relative to the number of topics identified.
The results obtained were compared with those existing in the literature; the conclusion is that, in this paper, better results were obtained when considering twenty identified topics with the ten top words.
Table 8 presents the results of different authors in the literature. This work identified topics from each class from which they were extracted. However, even when the authors listed in
Table 8 did not discover topics by class, the results obtained by applying the topic coherence metric are higher than those disclosed in this work.
The authors used the normalized topic coherence evaluation metric and the
20-Newsgroups and/or
Reuters corpora. However, not all authors applied document classification or clustering algorithms before performing topic discovery. The
Based in column shows the methods or algorithms applied by the authors in their papers. The results obtained in this paper are higher than those obtained by the authors of [
14,
15,
16,
17,
18,
20]. On the other hand, some authors, such as [
18], obtained higher coherence values for the
20-Newsgroups corpus. The
Reuters corpus [
19,
27] obtained higher coherence values than the results obtained in this work. In [
18,
27], the results are significant to those obtained in this paper because they applied algorithms and methods, such as variational autoencoder community detection and community mining. We deem the methods mentioned previously as beneficial to the authors’ results. In this paper, the objective was to integrate text classification into topic discovery. Hence, no additional method was contemplated.
The proposed approach obtained a coherence of 0.1723 for the 20-Newsgroups corpus using the LDA algorithm with 20 topics, and 0.1441 for the Reuters Corpus with the LDA algorithm with 20 topics.
It is evident that with 20 extracted topics, optimal topic coherent results were obtained. The 20-Newsgroups corpus has 20,000 documents, and it was necessary to obtain 20 classes, which allowed experiments to extract 20, 50, and 100 topics, resulting in specific topics and coherent results. On the other hand, the Reuters Corpus has a smaller number of documents, and 90 classes must be extracted. The dispersion will affect the coherent results by adding repeated words to each topic. On the other hand, the LDA algorithm weighs the results better by assigning a higher value to a word; therefore, this algorithm has better results.