1. Introduction
Since social media-based data or online media data is composed of natural language, it has a much larger and more complex structure than existing transaction data [
1,
2]. Recently, the media distributes news articles online in order to quickly deliver news to consumers, online news articles can identify current social trends and behavioral patterns of members of society [
3]. The social trend analysis technology for content published in online media has the advantage of being less expensive and faster than the analysis by existing expert groups. Therefore, research to detect and monitor current major issues by analyzing unstructured text information from social media or online news posts and extracting useful knowledge is being actively conducted.
For social trend analysis, it is important to identify event sentences from text documents such as social media or online news articles [
4]. The event sentence refers to a sentence in which specific content about a specific topic, i.e., who, where, when, what, what, etc. is expressed. The temporal and spatial information included in news articles is used to detect the early onset of disease and to determine the time and location of disease outbreaks [
5]. The temporal and spatial information presented in online news articles plays a decisively important role in understanding social trends.
Existing research to detect spatial and temporal information from text focuses on how accurately all temporal and spatial information contained within a document is extracted [
6,
7,
8]. A document can contain many pieces of information about time and space. In this study, among various spatial and temporal information included in a document, temporal and spatial information describing the core topic of the document is defined as ‘
representative spatio-temporal information’. The document including representative spatio-temporal information is defined as a ‘
representative spatiotemporal document’. If not only representative spatio-temporal information but also a large number of general spatio-temporal information are extracted from one document, the accuracy of core event analysis based on spatio-temporal information can be lowered. In order to increase the accuracy of core event analysis through artificial intelligence, it is necessary to remove unnecessary spatio-temporal information from one document and extract only the representative spatio-temporal information that accurately describes the core event in the document. Since extracting representative spatio-temporal information from a single document is a high-cost task, it is difficult to treat all documents from big data such as social media-based data or online news articles as analysis targets. Therefore, in order to efficiently analyze core events through representative spatio-temporal information, it is important to select documents from which representative spatio-temporal information is extracted.
Research using machine learning (Naïve Bayes [
9,
10], SVM [
11,
12] and Random Forest [
13,
14], etc.) in automatic document classification problems have been conducted so far. Recently, as deep learning-based Convolution Neuron Network (CNN) has been used for document classification, the performance of automatic document classification has been greatly improved [
15]. CNN started to attract attention in the field of artificial intelligence as it showed excellent performance in image classification or object detection in the early days [
16,
17,
18]. Classification technology using CNN has expanded its field of application from images to texts [
19]. Recently, document classification using CNN is characterized as an area that classifies documents (patent documents [
20], contracts [
21], infectious disease documents [
22], etc.) of a specific domain.
In this paper, we propose a character-level CNN-based representative spatio-temporal document classification model. First, we built 7400 learning data from online news articles provided by the National Institute of the Korean Language [
23]. We developed a character-level CNN-based document classifier (a.k.a. RepSTDoc_ConvNet) that can classify representative spatio-temporal documents. RepSTDoc_ConvNet has a deeper CNN layer and a fully-connected layer than the existing CNN-based document classification model. In order to prove the performance of the proposed CNN model, we compared RepSTDoc_ConvNet with three baseline machine learning classifiers (Gaussian Naïve Bayes, linear SVM, and random forest) and three deep learning-based models (ConvNet, DocClass_ConvNet [
22] and DocClass_ConvNet_Mod).
The final goal of our study is to extract representative spatio-temporal information from a large amount of documents. In order to extract representative spatio-temporal information, it is first necessary to classify representative spatio-temporal documents having representative spatio-temporal information in a large number of documents. This paper corresponds to the stage of classification of representative spatio-temporal documents. Through the representative spatio-temporal information, it can be used for natural disaster detection and analysis of factors (events such as urban planning, building construction, traffic control, and store opening) influencing business district analysis.
Our main contributions are summarized as follows.
We defined a novel problem of classifying representative spatio-temporal documents containing spatio-temporal information describing the core topic of a document.
We developed 7400 learning data for representative spatio-temporal documents.
We proposed a character-level CNN-based document classifier to classify representative spatio-temporal documents.
The proposed RepSTDoc_ConvNet outperforms traditional machine learning classifiers, achieving the F1 score of 61.2%.
The rest of the paper is organized as follows.
Section 2 presents the literature review. In
Section 3, we define the research problem.
Section 4 is the proposed CNN-based document classifier model. In
Section 5, we provide the experimental results and discuss the detailed implications along with their results.
Section 6 presents the conclusion.
3. Problem
In this section, we first define several concepts as well as the problem of representative spatio-temporal documents.
Subject of the document. Let D = {d1, …, dn} be a set of documents. Each document has a core subject, which is the message the author wants to convey to the reader. For example, consider a news article reporting the damage of a typhoon that occurred on Jeju Island, South Korea on September 7. di.subject = {‘typhoon damage’} denotes the subject of di is about the damage caused by the typhoon that occurred on Jeju Island on September 7.
Spatio-temporal word. di = {s1, …, sm} is a sequence of sentences and si = {w1, …, wl} is a sequence of words. Among the words contained in a document, there are words for a specific time and place where an event occurred. wi.time = {‘September 7’} denotes that an event occurred on September 7. wj.place = {‘Jeju Island’} denotes that the place where an event occurred is Jeju Island.
Representativeness of spatio-temporal word. Several spatio-temporal words can exist in one document. Some of the spatio-temporal words are related to the subject of the document, and some are not. Among spatio-temporal words, we consider the words most relevant to the subject of a document as ‘representative spatio-temporal words’. We denote a representative spatio-temporal word, wi.presentativeness = true.
Representative spatio-temporal document. We define a document containing both a representative spatial word and a representative temporal word among words included in one document as a representative spatio-temporal document.
5. Result and Discussion
In this section, we present comprehensive experimental results of the deep learning model. The purpose of this paper is to develop a classifier for representative spatio-temporal documents based on deep learning. To evaluate the performance of a proposed deep learning-based classifier, we first evaluated the performance of three traditional machine learning algorithms: Gaussian Naïve Bayes, Linear SVM, and Random Forest. For performance comparison with our CNN model (RepSTDoc_ConvNet), we also evaluated the performance of DocClass_ConvNet, an existing CNN-based document binary classifier, and DocClass_ConvNet_Mod, which adjusted hyper-parameters in the DocClass_ConvNet model to fit our dataset.
To confirm that our CNN model works properly, we pre-tested the performance of binary classification using the benchmark spam dataset from the UCI Repository [
31]. The spam dataset contained 5572 messages in English. This spam dataset was fed to our proposed CNN model and the experimental results were as follows: accuracy (0.982), precision (0.962), recall (0.916), and F1-score (0.938). This result is not significantly different from that of the recently published CNN model [
32].
All experiments were carried out on conducted on a GeForce RTX 2080 Ti 11GB GPU and an Intel(R) Xeon CPU with 64 GB memory.
5.1. Performance Evaluation
For the experiment, we divided the collected data into training (60%), validation (20%), and test data (20%) as shown in
Table 2. Target data were distributed to each data about 25.23%. The training data was used to train the model, the validation data was used to select the best performing model in the training process, and the test set was used to evaluate the performance of the finally selected model.
5.2. Hyper-Parameter Tuning
CNN consists of several hyper-parameters such as kernel size, batch size, dropout rate, learning rate, pooling window size, pooling type, activation function, number of neurons in a density layer, and optimization function, etc. We found the most suitable parameter values for the proposed model by manually adjusting the values of each parameter. We found the optimal parameter values by using the learning curves for accuracy and loss of training data and validation data for every experiment. We set up the experimental environment with various parameters, the parameters used in the experiment are summarized in
Table 3, and the parameter values with the highest performance are shown in bold. During the training process of the CNN model, we trained our CNN model with up to 1000 epochs and early stopping patience = 220.
Overfitting deep learning models makes it difficult to trust their predictive performance on new data. Therefore, training should be stopped when the loss in the validation data is no longer reduced during the training phase. Early stopping is one of the regularization techniques that makes neural networks avoid overfitting [
33]. We can use the EarlyStopping callback to terminate the model early when the performance index of the model does not improve during the set epoch. Through a combination of EarlyStopping and ModelCheckpoint callbacks, it is possible to trigger an early shutdown for non-improving training and resume training by reloading the best model from ModelCheckpoint. Both training loss and validation loss decrease until overfitting occur, but when overfitting occurs, training loss decreases while validation loss increases. Thus, we set the monitor option of EarlyStopping callback to stop training when the validation loss increases.
5.3. Experimental Results
We compared the RepSTDoc_ConvNet with three baseline machine learning classifiers (Gaussian naïve Bayes, linear SVM, and random forest) and three deep learning models (ConvNet, DocClass_ConvNet, and DocClass_ConvNet_Mod). DocClass_ConvNet is a model in which the CNN layer and hyper-parameters presented in the study are identical. DocClass_ConvNet_Mod is a model that optimizes the hyper-parameter values according to the experimental data while maintaining the same CNN layer of DocClass_ConvNet. Deep learning includes the process of randomly setting weight values during model training. Therefore, to compensate for such randomness, the average performance was measured after performing each experiment 10 times. The experimental results are presented in
Table 4.
The accuracy of machine learning algorithms to classify representative spatio-temporal documents was derived from a minimum of 0.74 to a maximum of 0.79. This accuracy is far below the performance of machine learning that deals with general document classification problems. The CNN layer used in this paper derives relatively high performance in the spam classification problem. From these results, it can be seen that classifying representative spatio-temporal documents is a difficult problem.
Random Forest showed the highest precision with 0.729 and DocClass_ConvNet_Mod showed the highest accuracy with 0.794. RepSTDoc_ConvNet showed the highest recall and F1-score with 0.673 and 0.612, respectively. In terms of accuracy, DocClass_ConvNet_Mod seems to have the highest performance with 0.794. However, considering the confusion matrix, it does not seem appropriate to evaluate the performance of machine learning only with accuracy in the problem of classifying representative spatio-temporal documents.
Figure 3 shows three confusion matrixes of Linear SVM, Random Forest, and RepSTDoc_ConvNet.
In the validation data used to evaluate the proposed CNN model, the proportion of representative spatio-temporal documents (RepSTDoc) is only 25.20%. Therefore, even when the model is not trained at all, the accuracy is 74.80%. In this case, high accuracy is maintained even if the number of documents predicted by the model with RepSTDoc is small. In
Figure 3a, Linear SVM classified 123 documents (46 false positives, 77 true positive) as RepSTDoc. Even if the model training is not done properly, the high true negative value (471) results in high accuracy. A random forest with the second-highest accuracy is also similar to Linear SVM. In the random forest, the accuracy is 0.770 even though there are few documents classified by RepSTDoc (48) because the model is hardly trained. The fact that the number of documents predicted as RepSTDoc is small because the model is not trained can be confirmed by the small recall value (0.191). In
Figure 3c, RepSTDoc_ConvNet classified 257 documents (123 false positives, 134 true positive) as RepSTDoc. In RepSTDoc_ConvNet, as the value of true positive increased, the value of false-positive also increased. The fact that the model classified many documents as RepSTDoc can be seen from the high value of recall (0.609). This phenomenon occurs because the number of positive and false documents in the data is imbalanced. Therefore, in order to accurately evaluate the performance of the model, the F1-score, which considers both precision and recall, should be used as a measure. In terms of the F1-score, RepSTDoc_ConvNet yields the highest performance with 0.609.
We measured the classification accuracy of human workers on 1400 learning data to verify the challenge of the representative spatio-temporal document classification problem. The 1400 learning data consists of 359 representative spatio-temporal documents and 1041 non-representative spatio-temporal documents. Four workers who participated in building learning data classified representative spatio-temporal documents for 1400 learning data. For each learning data, the number of workers who judged actual representative spatio-temporal documents as representative spatio-temporal documents (True Positive: TP) and the number of workers who judged non-representative spatio-temporal documents (False Negative: FN) were calculated.
For one actual representative spatio-temporal document, the ratio was calculated by dividing the number of all four people judged as TP, the number of three or more judged as TP, the number of two or more judged as TP, and the number of one or more judged as TP in
Table 5. For each of the 359 representative spatiotemporal documents, the number of documents judged as TP by all 4 people was 189 (52.64%), the number of documents judged as TP by 3 or more people 251 (69.92%), and the number of documents judged as TP by 2 or more people was 310 (89.35%), the number of documents judged as TP by 1 or more people was 332 (92.48%).
For one actual nonrepresentative spatio-temporal document, the ratio was also calculated by dividing the number of all 4 people judged as FN, the number of 3 or more people judged as FN, the number of 2 or more people judged as FN, and the number of 1 or more people judged as FN in
Table 6. For each of the 1041 nonrepresentative spatio-temporal documents, the number of documents judged as FN by all 4 people was 5 (0.48%), the number of documents judged as FN by 3 or more people was 24 (2.31%), and the number of documents judge as FN by 2 or more people (6.34%), and the number of documents judged as FN by more than one person was 135 (12.97%).
First of all, we describe the challenge of the representative spatio-temporal document classification problem through the ratio of documents in which at least three people, more than half of the judges, judged the actual representative spatio-temporal document as the representative spatio-temporal document. About 70% of the three or more people judged the actual representative spatio-temporal document as TP, and the ratio of all four people who judged it as TP was only about 53%, confirming that it is difficult for humans to classify representative spatio-temporal documents from large documents.
5.4. Effect of Learning Rate
The learning rate refers to the amount by which the weights are updated during model training and determines how quickly the model adapts to the problem. Larger learning rates converge more quickly to suboptimal solutions, while lower learning rates can result in early intervening learning. One of the important hyper-parameters that must be appropriately selected in deep learning neural network model training is the learning rate. We experimented with the effect of learning rate [0.1, 0.01, 0.001, 0.0001, 0.00001. 0.000001] on performance.
Figure 4 shows the effect of the learning rate for ConvNet, DocClass_ConvNet_Mod, and RepSTDoc_ConvNet. The learning rate at which no training was performed in each model was not shown on the graph (learning rate: 0.1, 0.01, and 0.000001). In the section where the model is trained, the F1-score tends to increase as the learning rate decreases. There is a large difference in performance according to the learning rate in each model. In the representative spatio-temporal learning data used in this study, the learning rate shows the highest performance at 0.00001.
5.5. Effect of Batch Size
Most of the training of deep learning models is based on mini-batch stochastic gradient descent (SGD). At this time, the batch size is one of the important hyper-parameters when training the actual model. Various studies are being conducted regarding the effect of the batch size on model training. Although it has not been clearly identified yet, it is experimentally observed in several studies that the use of a small batch size has a positive effect on generalization performance. We experimented with the effect of learning rate [16, 32, 64, 128, and 256] on performance.
Figure 5 shows the effect of batch size for ConvNet, DocClass_ConvNet_Mod, and RepSTDoc_ConvNet. In the representative spatio-temporal learning data used in this study, there was no consistent performance variability across models. RepSTDoc_ConvNet shows a tendency to improve performance as the batch size increases in the model training section [32, 64, 128, and 256]. However, in DocClass_ConvNet_Mod, the variation of performance according to the batch size was not consistent. Although this result cannot be generalized, the batch size may not affect the performance of the model depending on the complexity of the CNN layer and the characteristics of the data.
5.6. Time Efficiency
The numbers of weights are 1,410,609, 1,446,261, and 5,083,129 in DocClass_ConvNet_Mod, ConvNet and RepSTDoc_ConvNet respectively. The overall algorithm time is affected by the complexity of the neural network. This is because the amount of computation increases as the number of weights in the network increases.
Table 7 shows the time efficiencies for the three algorithms.
5.7. Data Distribution Rate
We also investigated the performance difference according to the change in the distribution ratio of training, validation, and test data. The ratio of training data was set while keeping the ratio of validation data and test data the same. The distribution ratio used in the experiment is as follows: training, validation, and test data are 4:3:3, 6:2:2, and 8:1:1 respectively.
Figure 6 shows the highest performance with a 6:2:2 distribution ratio. There is not much difference in the performance of each model according to the distribution ratio.
5.8. Receiver Operating Characteristic
The Receiver Operating Characteristic (ROC) curve shows the performance of the binary classifier for various thresholds.
Figure 7 shows the corresponding ROC curves when using ConvNet, DocClass_ConvNet_Mod, and RepSTDoc_ConvNet. ConvNet outperformed the other models in the lower-left corner. However, in the section where the false positive rate is greater than 0.2, RepSTDoc_ConvNet was superior to other models. RepSTDoc_ConvNet was found to have the best performance for classifying representative spatiotemporal documents.
6. Conclusions
The purpose of this paper is to develop a CNN-based representative spatio-temporal document classification model. Because the representative spatio-temporal document is a novel concept, we defined a representative spatio-temporal document as documents containing spatio-temporal information describing the core topic of a document. We built 7400 learning data to train a CNN-based representative spatio-temporal document classifier and developed a character-level CNN-based document classifier to classify representative spatio-temporal documents. To evaluate the performance of RepSTDoc_ConvNet, we evaluated the performance of three traditional machine learning algorithms: Gaussian Naïve Bayes, Linear SVM, and Random Forest. For performance comparison with our RepSTDoc_ConvNet, we also evaluated the performance of ConvNet, DocClass_ConvNet, and DocClass_ConvNet_Mod. The experimental results show that RepSTDoc_ConvNet outperforms traditional machine learning classifiers and existing CNN-based classifiers.
A limitation of the work is that RepSTDoc_ConvNet still has lower performance compared to general document classifiers. It is necessary to diversify the features of the input data as it shows that classifying representative spatio-temporal documents is a difficult problem. In order to further improve the performance of the representative spatio-temporal document classifier, it is necessary to find a way to lower the false positive value by finding the characteristic that distinguishes the general spatio-temporal document from the representative spatio-temporal document.