1. Introduction
Topic Detection and Tracking (TDT) is an information processing technique for the information flow on news media [
1], which can detect the appearance of new topics and track their reappearance and evolution [
2], whilst helping people deal with the problem of the internet information explosion [
3]. Topic detection is a sub-task of TDT, which can help decision makers find meaningful topics or events in a timely manner [
4] and has attracted a great deal of attention in many application areas, such as public opinion monitoring, emergency management, decision-making support systems, and online reputation monitoring [
5,
6,
7,
8]. In the context of news, topic detection and event detection can be viewed as the same concept [
9]. Food safety event detection is very important for governments and for society. In recent years, food safety events have occurred frequently, making rapid food safety event detection an urgent problem to be solved. Food safety events include food poisoning, food-borne diseases, food contamination, etc. Examples include the horsemeat scandal that occurred in Europe [
10], rat meat that was found in famous snacks in Korea [
11], and the melamine, Sudan red egg, the gutter oil scandals that occurred in China [
12,
13,
14]. These events not only caused huge economic losses and brought anxiety to the public, but also seriously undermined the reputation of the relevant governments.
Several approaches have been proposed for events detection, including: (1) Document clustering based on news feature extraction and representation [
15,
16,
17], wherein most researchers use Term Frequency-Inverse Document Frequency (TF-IDF) [
18] to extract keywords and Vector Space Model (VSM) [
19] to represent news, then the clustering algorithms such as single-pass [
20] or k-means [
21] are used to cluster news (news describing the same event are clustered to generate events); (2) the method based on a topic model [
22], where Latent Dirichlet Allocation (LDA) [
23], Probabilistic Latent Semantic Analysis (PLSA) [
24], and various extension versions are used to explore the latent semantic knowledge of documents, i.e., treating each document as a probability distribution over topics, then representing news based on this distribution and clustering the news accordingly; (3) The method based on neural networks [
25], which uses deep neural network models such as Doc2vec [
26] and Sentece2vec [
27] to obtain document vectors, and then clustering document vectors to generate events; (4) The community partitioning method based on a complex network [
28], which takes co-occurrence words as nodes in the network to establish a topic graph and detect topics by using community partitioning. The community partitioning methods include the Kernighan–Lin algorithm [
29], Blondel algorithm [
30], etc. There are a few studies currently available about food safety event detection [
31,
32]. These studies are based on LDA and have lower data dimensions, achieving better results than methods based on TF-IDF [
31,
32].
Nevertheless, the document representation ability of the above research is still limited by the low semantic information, and the inference algorithm used in the model can be too complex [
4]. In addition, such methods need manual labeling of events and setting a predefined number of clusters, and have difficulty in detecting new events [
33], which is not conducive to large-scale data modeling and affects the precision of event detection.
In this paper, TF-IDF is used to calculate the weights of all words in the news, and a fixed number of words are selected as the feature of news. McMinn et al. [
34] proposed a real-time entity-based event detection method for Twitter, which proved that named entities play a crucial role in describing an event. Their entity-based event detection method is able to detect more events than previous approaches whilst also providing improved precision and retaining low computational complexity. Therefore, we use the named entity as a part of the features to represent news in this paper, and combine it with the feature words obtained by TF-IDF to form the joint feature of documents. This entity can significantly reduce the data dimension and computation overhead. In addition, it can retain the important news information effectively. Furthermore, the news headlines of food safety event can effectively summarize the news content, therefore, this paper uses the semantics of headlines to update the weight of the joint features, so that words with high similarity to the headlines have greater weights. In this paper, we proposed the concept of news “fusion feature”, which fuses multiple features together, including the TF-IDF features, the named entity features, and the headline features. In this way, key information can be more prominent and document semantics can be highly covered, meaning more accurate representation of the news can be obtained to improve the event detection results.
The main contributions of this paper lie in: (1) the combination of TF-IDF features and named entity features used to form the joint feature of news; (2) a method for updating the weight of feature words based on semantics of headlines, which highlights the key information and allows the fusion feature of news to be obtained. The multi-feature fusion method proposed in this paper is used to document a representation of food safety news, which has enhanced the detection results of food safety events, and can help regulatory authorities to more accurately detect food safety events. In order to verify the effectiveness of the method proposed in this paper, experiments were carried out on real food news data, and the experimental results of TF-IDF, LDA, and multi-feature fusion are compared.
2. Methods
This paper proposes a food safety event detection method based on multi-feature fusion, and the process was as follows: (1) preprocessing the news data; (2) TF-IDF is used to calculate the weight of each word in the news document, then the first M words with the largest weight of each news document are selected to form a feature words set
; (3) the named entities in the news document are recognized by using the Bi-LSTM-CNN-CRF framework [
35] to form the set
, then the joint feature set
is obtained by combining
and feature set
; (4) the word2vec is used to obtain the vector of all words in the news dataset, then establish the dictionary D and corresponding word vector set
; (5) establish the headline vector
of the headlines in the dataset which has been preprocessed by (1), calculate the similarity between each word in the feature set
and headline vector, and update the weight of the feature words according to the similarity value, then the VSM is used to represent the news document; and (6) the single-pass algorithm is used to cluster the news documents and generate events. The process is outlined in
Figure 1 and described in detail in the next subsections.
2.1. Preprocessing
The data preprocessing includes filtering noise, removing meaningless symbols such as space and links, word segmentation, and stop words. The news dataset
S contains plenty of news documents, each news document is represented as a word bag and recorded by a set
after preprocessing, and as the input of the subsequent components, as shown as Formula (1)
where
m is the number of news documents in the news dataset
S, and
n is the number of words in each news document.
2.2. TF-IDF Feature Extraction
TF-IDF is a feature extraction algorithm, where TF denotes word frequency, that is, the frequency of a word appearing in the document, and IDF denotes the inverse document frequency. The main idea is that if a word or phrase appears more frequently in one document and less frequently in other documents, it is considered to have good representation ability for the document. Generally, the words or phrases with higher TF-IDF values are more important in the documents. The
tf of the word
appearing in document
is calculated by Formula (2):
where
is the number of occurrences of word
in document
,
is the sum of the TF of all the words in document
, which is the normalization process.
idf is the reverse document frequency and is calculated by Formula (3):
where
is the total number of documents in the dataset
S,
denotes the number of documents containing the word
. In general,
is used as denominator to avoid it being zero. TF-IDF is calculated by Formula (4).
Then the weights of all words are calculated by TF-IDF, for each news document, the first M words with the largest weights are selected to form the feature set , the corresponding weights set is .
2.3. Named Entity Feature Extraction
Named entities include person names, place names, organization names, and proper nouns. In this paper, named entities are regarded as one of the features of food safety news. The Bi-LSTM-CNN-CRF framework is used to recognize named entities in a food safety news dataset, the framework is based on Bi-directional Long Short-Term Memory (Bi-LSTM) [
36], Convolutional Neural Networks (CNN) [
37], and Conditional Random Field (CRF) [
38]. The steps are as follows: firstly, word embedding is used to obtain the vectors of words; then CNN is used to encode character-level information of a word into its character-level representation, then the character and word-level representations are fed into Bi-LSTM to the model context information of each word. Finally, a sequential CRF is used to jointly decode labels for the whole sentence. For each news, the extracted named entities set can be expressed as
.
Combine the TF-IDF feature set
with the named entity feature set
to obtain the joint feature set of the news, shown as Formula (5).
where
, the weight set of the joint feature set
is
.
2.4. Feature Fusion Based on the Semantic of Headline
In general, the headline for food safety news is a summarization of the news content, as it contains the keywords of a certain food safety event.
Figure 2 shows a news document about a food safety event. In this Figure, (a) shows the original news in Chinese and (b) shows the English translation of the news in (a). Through the keywords “苏州”(Suzhou), “喜茶”(Heytea), “苍蝇”(flies) in the headline, we can understand what happened in this food safety event.
In this paper, a dictionary D of food news dataset is constructed, the vectors of all words in D are obtained by word2vec [
39] and form a set
, where
z is the size of D and each vector contains 256 dimensions. The preprocessing of the headlines involves removing punctuation marks, spaces, Chinese word segmentation, and stop words. Then Doc2vec was used to map headlines into vectors with fixed dimensions, thus the headline vector
of each news
d was obtained and its dimension is also 256. The vectors of words and sentences contain its semantic meaning, while the relationships between words and sentences can be calculated by the vectors.
Each news document is represented by joint feature set K, words with high similarity to headlines can better represent the key information of a news and should be given greater weight. For each word
in K, the distance between
and the headline vector
is calculated by the cosine similarity
, and it is calculated by Formula (6):
Thus, the similarity of every word in the joint feature set K and headline is obtained and the similarity set is expressed as
. The updated weights
can be obtained by Formula (7).
where
is the original weight,
is the similarity of the word
i with the headline,
is the updated weight of word
i. Please note the coefficient of the similarity value
is determined by our preliminary experiment.
2.5. News Representation Based on Multi-Feature Fusion Using the Vector Space Model
In this paper, the news representation based on multi-feature fusion is modeled by VSM (vector space model). VSM is one of the most popular methods for text modeling, as it regards the news document as a set of unordered words. The joint feature of VSM is shown as set
in
Section 2.3, while the calculation process of the words’ weights is shown in
Section 2.1,
Section 2.2,
Section 2.3 and
Section 2.4 Thus, the vector space model of a news can be expressed by the weights of the unordered words, as shown in Formula (8):
where
is the weight of the
i-th word, and D is the number of words in the dataset after preprocessing, TF-IDF feature extraction, and named entity recognition. Then the representation of news in the dataset can be obtained as
, in which
N is the number of news in the dataset. The news representation combined the TF-IDF features, named the entity features, and fused the headline information, thereby developing the news representation model based on multi-feature fusion. In this way, the similarity between different types of news can be calculated and cluster analysis can be performed.
2.6. Experiment
2.6.1. Data Preparation
The food news data in our experiment were gathered from several popular news websites such as Headlines Today (
https://www.toutiao.com/), Sina News (
https://news.sina.com.cn/), and Sohu News (
http://news.sohu.com/) in China, these websites provides valuable and real-time information for people. The news data were used to evaluate the performance and robustness of our approach. The dataset was named “Food Safety News” and contains the human-annotated facts. It contains 1255 Chinese news documents and corresponding headlines from 10 events, where each event contains a variable number of news items ranging from 68 to 180. The vocabulary contains 84,198 unique terms after preprocessing. The total time span for the 10 events is from 1 January 2017 to 30 June 2019 (
Table 1).
Therein the People’s Daily [
40] annotated corpus, which contains a large number of annotated person names, place names, organizations, and other proper nouns, is combined with the “Food Safety News” dataset to train the named entity recognition model.
2.6.2. Evaluation Metrics
TDT [
41] proposed several evaluation metrics for topic detection, including Precision P, Recall R and F1 score. In addition, the Miss Rate (
) and False Alarm Rate (
) are also important evaluation metrics of system performance. The evaluation status of event detection is shown in
Table 2.
Where
a is the number of detected news stories related to an event,
b is the number of detected news stories irrelevant to the event,
c is the number of undetected news stories related to the event, and
d is the number of undetected news stories irrelevant to the event. Following the notation in
Table 2, the evaluation metrics of TDT are shown in Formula (9):
In this paper, the detection cost function
is used to evaluate the system performance [
41], which is a metric that combines the miss rate
and the false alarm rate
proposed in TDT2004 [
42] and is calculated by Formula (10):
where
,
,
and
are predefined parameters, (TDT2004 set these parameters as 1.0, 0.02, 0.1, and 0.98, respectively).
is usually normalized by Formula (11):
The lower the
value, the better the system performed [
42]. In the experiment, the evaluation metrics of each event were firstly calculated, and then the average value is calculated to determine the system performance.
3. Results
In this paper, experiments were designed to compare different methods and verify the advantage of the proposed method. The experiments are consisted of two parts: (1) explore the influence of the TF-IDF feature number M and cluster thresholds T on system performance, then determine the optimal value of M and T; (2) compare the P, R, and F1 score of the proposed method and other methods under the same feature number M and threshold T.
3.1. Parameter Selection
For the first experiment, we set M = 5, 10, 15, 20, …, 50, and cluster threshold T = 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, and 0.4 to test the system performance. The
values of the system under different M and T are shown in
Figure 3.
From
Figure 3, we can see that the
value is different under different M and T. When M < 30, the
value decreases gradually with the increase of M; when M = 30, the
value can reach the minimum value under a different threshold; when M > 30, the
value increases with the increase of M. When the number of features M = 30, the clustering result and the performance of the system achieved the best results, i.e., a lower
.
Under different T in
Figure 3, we found that the dotted line (threshold T = 0.25) was lower no matter what the M is. So in the following experiments, we used M = 30 as the number of news features and T = 0.25 as the clustering threshold.
3.2. Food Safety Event Detection Results
The food safety event detection method based on multi-feature fusion combines the TF-IDF features with the named entity features and forms the joint features, then fuses the headline features to make the news representation more accurate. In this paper, a single-pass clustering algorithm was used to cluster news and generate food safety events. The experiment compared the results of event detection under the following news representations methods: (1) TF-IDF, (2) LDA, (3) TF-IDF and named entity (TI-NE), and (4) multi-feature fusion.
The Precision P, Recall R, and F1 score of food event detection when the number of TF-IDF features M = 30 and clustering threshold T = 0.25, under different news representation methods are shown in
Table 3.
The experimental results show that the Precision P, Recall R, and F1 score of event detection based on LDA are 0.79, 0.81, and 0.79, which are 3%, 6%, and 4% higher than the values of the method based on TF-IDF, which means that the method based on LDA is better than the method based on TF-IDF. After being combined with named entity features (TI-NE), the Precision P, Recall R, and F1 score are 0.86, 0.84, and 0.84, which are higher than the method based on LDA by 7%, 3%, and 5%, which means that the named entity features are important in representing the news documents. Compared with the method based on TI-NE, our method based on multi-feature fusion is 8%, 9%, and 9% higher on the three metrics. Compared with the method based LDA, our method is 15%, 12%, and 14% higher than TI-NE, which only fused the named entities features with TF-IDF. Compared with TF-IDF, our method is 18%, 18%, and 18% higher on the Precision P, Recall R, and F1 score, which proves that our proposed multi-feature fusion method is effective and better than the baseline method based on TF-IDF and LDA.
When using the method based on TF-IDF, the data dimension is equal to the size of the dictionary that constituted by all words in the news dataset, while the dictionary size is 84,198 and provides a high dimensional and sparse matrix, thus leading to low computational efficiency. In the multi-feature fusion method proposed in this paper, the dictionary size is reduced to 4562 after preprocessing, TF-IDF feature extraction, and named entity recognition, which means that the dimension of the news representation based on the multi-feature fusion method is only 4562. Compared to the traditional TF-IDF method, the dimension of news representation is greatly reduced, so the computational efficiency is greatly improved.
4. Discussion
A food safety event detection method based on multi-feature fusion is proposed in this paper. The method combines TF-IDF features and named entity features of food news, then the headline features are fused and more accurate news representation is obtained. Finally, the news is clustered based on the representation and events are obtained. The method proposed in this paper solves the problems of a too large data dimension and low computational efficiency of traditional TF-IDF [
43], as well as the problems of manual data annotation, which are an inability to identify new events that occurs when using the LDA method [
33].
In this paper we designed experiments on the real food safety news dataset. The experimental results show that the value of normalization of detection cost function () varies with the number of TF-IDF features M. When M = 30, the can reach the smallest value, this is because when the number of TF-IDF features is less than 30, the smaller the number of features available, the smaller the amount of semantic information contained, thus the content of a news report cannot be represented well; when the number of features is too large, some unimportant information is introduced, which makes the clustering results worse. Therefore, the number of features affects the performance of event detection, meaning an appropriate number of features is important for the system performance. Experimental results show that the system performance at its best when M = 30. The clustering results show that the threshold also affects the clustering results; when the threshold T=0.25, the value is the smallest possible value, which indicates that the system performance is the best when the threshold T = 0.25.
Compared the results of event detection based on different news representation methods, when the TF-IDF features are combined with the named entity features, the Precision P, Recall R, and F1 score are better than LDA, which is because the named entities is a part of the important information of the news report, while the joint feature has richer information and the news representation is better. When fusing the headline semantic information, the results are higher than those obtained from other methods, which is because headline is a summarization of the news content. By calculating the similarity between feature words and the headline vector, the weights of the feature words are updated and the key information of the food safety news representation is more prominent, which improves the results of the event detection compared with the method based on TF-IDF and LDA.
The method proposed in this paper also has some limitations, since the data used in this paper is derived from the events that has occurred, so it cannot guarantee the real-time performance of events detection. Nevertheless, the method proposed in this paper still reduced the data dimension, enhanced the results, and more effectively solved the problem of food safety event detection.
In the future, we will focus on combining a variety of data sources and constructing a versatile event detection method, solving the computation overhead, and addressing problems dealing with real-time news feed.