1. Introduction
Topic modelling as unsupervised learning has become a prevalent text mining tool for discovery of hidden semantic structures in a text body. Given a collection of documents, most of the existing topic models perform a full analysis to discover
all topics occurring in the corpus. However, it was recently noticed [
1] that in many situations users are interested in focused topics related to some specific aspects only. For example, given a set of Amazon product reviews, a user might be interested only in bedding products. A conventional topic model performing full analysis will identify all topics from the entire corpus such as “furniture”, “food” and “clothing”. Although the topic of “furniture” is related to the user interested aspect of “bedding products”, it is too coarse as the user might be more interested in fine-grained topics like “bed frames” and “mattress”. As a result, targeted (or focused) analysis is proposed by Wang et al. [
1] to discover topics relevant to targeted aspects only. Particularly, given a corpus of documents from a broad domain and a set of user-provided keywords representing user-interested aspects, targeted analysis aims to discover topics related with the queried aspects only.
Methods for targeted analysis can be generally categorised into two groups: (1) conventional topic models incorporating filtering strategies and (2) specialised topic models. However, methods of both categories suffer from problems such as topic loss and topic suppression, because of the limitations of their respective assumptions and strategies.
For algorithms in the first group, both pre-filtering and post-filtering strategies can be adopted to empower full-analysis topic models to find topics related to queried aspects. Basically, the pre-filtering strategy retains only documents containing the query keywords and extracts topics from the retained “partial data”. The quality of the discovered topics thus heavily depends on user-supplied query keywords. If the keywords are not appropriate or comprehensive enough, many relevant documents will be filtered, which incurs a significant
topic loss. For example, if a user provides “bath” as a query keyword, documents without the keyword but containing the synonyms like “shower” and similar words like “bathtub” will be filtered although such documents are actually relevant. Consequently, there is a great possibility to lose topics if modelling from the retained partial data. A post-filtering strategy applies conventional topic models to identify first all topics in the corpus and then filter the topics that do not contain the query keywords in the results. However, as analysed in [
1], such a strategy may result in
topic suppression when the query keywords are infrequent in the database.
Topic suppression means that topics related to the user interested aspect are suppressed by general topics.
For algorithms in the second group, TTM [
1] is the first and the state-of-the-art. TTM is a sparse topic model designed to directly mine focused topics based on user-provided query keywords. TTM simulates two topic-word distributions:
for relevant topics and
for irrelevant topics. It considers documents at the sentence level and introduces a variable
r to indicate the status of a sentence (e.g., relevant or irrelevant). Words are then sampled from
or
according to the sentence status. Although TTM can accomplish the targeted analysis to a certain extent, the effectiveness of TTM is handicapped by its scheme of processing at the sentence level and its assumption that each sentence focuses on only one topic. By considering sentences individually and separately, topic information between consecutive sentences may be lost, which results in inferior topic qualities and possible topic loss. By assuming that each sentence is related with only one aspect, it is very likely for TTM to mistakenly assign relevance status for sentences related with multiple topics, which is often the case for long sentences. The wrong assignment of sentence statuses will in turn lead to possible missing of meaningful topics.
A common challenge faced by algorithms of both categories is the computation efficiency, while full analysis of topics is largely performed offline, targeted analysis is more likely an online module that is supposed to respond to user queries as soon as possible. However, existing algorithms for targeted analysis, especially the post-filtering strategy and the specialised topic models, are not devised to address this issue. The pre-filtering strategy may gain efficiency by modelling topics from a reduced set of “partial data”, but it achieves this at the cost of losing important topics.
To address the aforementioned issues, we propose a novel
Core BiTerm-based Topic Model (BiTTM) for targeted analysis, which directly models fine-grained topics related to the queried aspect from a set of
core biterms. Biterm, proposed in BTM [
2], is a word-pair consisting of two different words that appear together in a fixed-size window and represent co-occurrence information. Improving biterms, we introduce
core biterms as a set of selected biterms that have strong connections with query keywords. By modelling topics from the set of core biterms, BiTTM is expected to achieve better performance than existing specialised topic models in terms of the following aspects:
1. The existing specialised topic models for targeted analysis (i.e., TTM and APSUM [
3]) process at either the sentence level or the word level so that the semantic information between consecutive sentences will be lost. In contrast, since a biterm may consist of two words coming from two successive sentences, information across the whole document can be captured by BiTTM to alleviate the issue of losing topics.
2. The TTM model samples relevance status at the sentence level which may be too coarse. When a sentence is related to multiple topics, it would be difficult to infer the relevance status of the sentence as a binary value. In contrast, the APSUM model [
3] samples relevance status for individual words which may be too specific, because it cannot handle phrases that make sense when multiple words are considered together. Biterms, as a scheme in-between sentences and words, are expected to achieve more accurate inference of relevance status.
3. Existing specialised topic models do not have any finesse to accelerate the calculation without significant semantic information losing. Instead, BiTTM introduces a heuristic preprocessing based on core biterms for speeding topic modelling while alleviating information loss, which makes it a more pragmatic solution for targeted analysis according to user queries.
To comprehensively evaluate the performance of BiTTM, extensive experiments have been conducted on real-world datasets including short texts, medium texts and long texts. Moreover, we select a large number of targets with different word and document frequencies to explore the adaptability of BiTTM to various types of queries. The experimental results show that (1) BiTTM improves the quality of topics, alleviates topic losing, and outperforms baselines especially for query keywords of low frequencies; (2) the time cost of BiTTM is most outstanding and stable compared to those of the baselines, which demonstrates the high applicability of BiTTM on datasets with different characteristics.
The remainder of this paper is organised as follows. Prior research and related works are reviewed in
Section 2. We provide technical details of BiTTM in
Section 3, and discuss the experimental results in
Section 4. Finally,
Section 5 closes this paper with some conclusive remarks.
3. BiTTM
In this section, we describe BiTTM for efficient topic analysis of targeted aspects. In
Section 3.1, we introduce the concept of core biterms and the process to generate core biterms.
Section 3.2 and
Section 3.3 discuss the generative process and inference of BiTTM, respectively.
3.1. Core Biterms
Considering the user-specified aspect usually involves only part of the data, we believe data preprocessing is an indispensable step for efficient targeted analysis. However, existing specialised topic models perform directly on the entire dataset ignoring the efficiency issue. Existing methods incorporating pre-filtering strategies, as discussed before, achieve certain efficiency by modelling topics from a reduced data set; nevertheless, the reduced data set may lose relevant documents if the targets are not expressed appropriately or comprehensively. For example,
Table 1 enumerates three situations where query keywords may easily be incomplete, resulting in possible loss of relevant documents and topics.
Synonyms. For example, if the supplied query keyword is “bath”, relevant documents containing words representing similar semantics, such as “shower”, may be missed.
Words referring to the same targeted aspect in a particular domain. For example, when the domain is confined to Amazon reviews of baby products, the keywords “crib” and “bed” represent the same aspect, although they are not exactly synonyms.
Words describing the same event. Users often use diverse words to refer to the same event, especially in social networks. For example, considering the Twitter dataset of Oscars, both “mistake” and “oscarsfail” are used to describe the event of a wrong envelope for the Best Picture Award.
To address the aforementioned issues, we propose an efficient data preprocessing method based on core biterms.
As introduced in BTM [
2], a biterm consists of any two distinct words in a fixed-length window so that it captures the co-occurrence information in the document. As the window may span two or more sentences, the semantic information between consecutive sentences can be captured. Compared with TTM and APSUM, processing at the level of biterms addresses potential loss of information between successive sentences. Therefore, we consider biterms as the base unit of our preprocessing.
To handle the situations exemplified in
Table 1, we consider to use “core words” to complement query keywords so that relevant documents that do not explicitly contain query keywords can be considered. Intuitively, if core words represent the same aspect indicated by query keywords, they should appear together with query keywords very often. Hence, we first extract “core words” that frequently co-occur with query keywords from biterms, and then extract frequent biterms containing core words as “core biterms”. The algorithm is illustrated in Algorithm 1, which can be summarised in three steps as follows:
Step 1: Calculate the desired size of the set of core words, , and rank all biterms in descending order according of frequency (Lines 1–2).
Step 2: Acquire core words from top frequent biterms containing target, and then calculate the average frequency of biterms containing core words as (Lines 3–15).
Step 3: Select core biterms according to two conditions. Firstly, the biterm has at least one core word. Secondly, the frequency of the biterm has to be greater than (Lines 16–20).
We will then model targeted topics from the generated core biterms, which yields a threefold benefit as follows: (1) the context information between neighbouring sentences is preserved; (2) sampling relevance status based on biterms is more accurate; and (3) modelling topics from core biterms is more efficient.
Algorithm 1: Preprocessing based on biterms |
|
3.2. Model Description & Generative Process
In this subsection, we describe the model and the generative process of BiTTM.
Table 2 lists the notations used in this paper.
The generative process is as follows:
Draw .
Draw .
For each target-relevant topic
- (a)
Draw .
- (b)
For each word
- (c)
Draw .
For each biterm
- (a)
Draw .
- (b)
Compute r based on x and .
- (c)
If is relevant to the target
Draw .
Draw , and
Draw .
- (d)
If is irrelevant
Graphical representation of BiTTM is shown in
Figure 3. Following the above procedure, the generative process can be summarised into three parts. Firstly, we draw two global parameters
and
. The former is a topic distribution which models on the entire corpus instead of one document, and the latter is a topic-word distribution of irrelevant topic. In other words, two words in an irrelevant biterms are drawn from only one irrelevant topic. Secondly,
is drawn for each target-relevant topic
. Please note that two smoothing parameters, smoothing prior
and weak smoothing prior
, are used for dual-sparsity [
25]. Thirdly, status
r of
is determined by both target indicator
x and
. According to two different types of status, relevant or irrelevant, we draw a word from
or
.
Different from the generative process of BTM, BiTTM draws two topics for a relevant biterm and each word in the biterm may be assigned a different topic. The reason why we choose this strategy is that it is inappropriate to assume that the two words in a biterm share the same topic for targeted analysis, while for BTM, it is probably sufficient to draw one topic for a biterm as it is a full-analysis model which aims to mine coarse-grained topics.
Here is an example to elaborate the difference between full-analysis and targeted analysis. When dealing with the biterms and , BTM is prone to assign the same topic to the two biterms because of the shared word “larger”. This allocation might be fine for full analysis since it does not pursue fine-grained topics so that it is not necessary to distinguish between “battery” and “lens”. However, for targeted analysis, “battery” and “lens” represent two different aspects and should be recognised distinctively. Note that, although the sampling process of the two words in a biterm is independent to each other, their combined effects determine the status (i.e., relevant or irrelevant) of the biterm.
3.3. Inference
Following BTM [
2] and TTM [
1], we choose Gibbs Sampling [
34] to infer the model parameters. All notations used in this section are shown in
Table 2.
We first sample the status of every biterm. Intuitively, if a biterm contains a query keyword, then it is relevant with the target aspect. Let
d be a binary variable and
indicates a biterm contains the keyword provided by users. Then, we define the probability that a biterm is relevant as
if
. Otherwise, we define the probability as shown below:
Next, we sample word selector
for all words
. Applying Gibbs Sampling similar to TTM [
1], we can obtain the equation
. Then,
For a biterm
, the probability of sampling
k as the topic for
can be computed as Equation (
3).
As mentioned before, two words in a biterm may be assigned different topics if the biterm is relevant. Therefore, we can have
If a biterm
is irrelevant (i.e.,
), then we directly sample a topic from
for two words
. Therefore, we can obtain the conditional probability:
At last, we sample a word selector for a topic k, where and .
4. Experimental Results
4.1. Baselines and Metrics
Baselines. Three methods are chosen to be compared with BiTTM, including Targeted Topic Model (TTM), Biterm Topic Model-Partial Data (BTM-PD), and Biterm Topic Model with a post-filtering strategy (BTM).
TTM. Targeted Topic Model is the first method for focused analysis that extracts related topics according to a target keyword provided by users. We select TTM rather than APSUM as the baseline of specialised topic models for targeted analysis because TTM outperforms APSUM in terms of
topic coherence when the number of topics is less than 50 [
3]. For targeted analysis of fine-grained topics, we believe the number of topics in a given corpus is usually less than 50. Moreover, TTM serves as the most valuable comparison because APSUM is not exactly designed for targeted analysis.
BTM. As our model is developed based on biterms, we also compare with two variations of BTM that are adapted for targeted analysis. BTM is a state-of-the-art topic model for short texts, which also applies to long texts [
2]. As a typical full-analysis model, BTM aims to find all topics (or all aspects) from the entire corpus. We then use a filtering strategy to eliminate topics that do not contain the target keywords. This approach is named as BTM
for simplicity.
BTM-PD. This is another variation of BTM which applies the pre-filtering strategy to perform focused analysis. We use only the subset of documents containing the target keywords to model topics. As discussed before, the pre-filtering strategy is handicapped by the variability of target keyword—relevant documents may be filtered so that topics may be missed out.
Metrics. We adopt two techniques to evaluate the quality of topics:
topic coherence [
35] and
precision@n [
1] (
for short). The former is a popular evaluation method to evaluate the quality of discovered topics [
36,
37,
38,
39,
40]. As an automated evaluation metric, topic coherence mainly measures the interpretability of topics instead of target-relevance. More specifically, topic coherence measures document-level mutual information of keywords in topics, however, it does not reflect the relationship between topics and targets. In order to evaluate whether topics are target-relevant, we employ the metric
, also used by TTM [
1], which is an evaluation based on human judgment to assess the relevance between the target and topics.
Considering the
M most probable words in topic
k, the topic coherence of
k is defined as Equation (
7).
where
is the number of documents containing both
and
;
is the number of documents containing
, and
is the
lth most probable word in topic
k. Basically, for the
mth probable word, the measure considers its co-occurrence with the
more probable words. A smoothing count of 1 is added to avoid leading the logarithm to zero. Basically, the more the measure approximates to zero, the more coherent the discovered topics are.
Given the set of topics discovered by all models, suppose there are
topics that have been verified by users to be related with the target aspect. Moreover, from all topics discovered by a particular model
m, suppose there are
topics related with the target. Then, the precision of model
m at rank position
n is defined as follows:
where
is the number of words, among the top
n words of topic
z, which are relevant to the target (Note that, if a discovered topic is potentially related with multiple semantic topics, the best semantic topic based on the top 20 words will be adopted).
Therefore, the two evaluation methods have different merits and objectives. For example, topic coherence is an automated evaluation metric reflecting the interpretability of topics. demands human judgement and assesses the relevance between the discovered topics and the queried target. For the sake of fairness, we use to evaluate all comparing models (i.e., BiTTM, TTM, BTM-PD and BTM) to find out the effectiveness of the models in performing focused analysis. However, we only compare BiTTM and TTM in terms of topic coherence to evaluate the topic quality since the other two models are variations of BTM, which is essentially designed for full-analysis of topics.
4.2. Data sets & Experimental Settings
Data sets. In order to comprehensively evaluate the performance of our proposed model, we conduct experiments on different types of text. In particular, three types of documents are considered, including short, medium and long texts. For each type of documents, we select three data sets. The description of the nine datasets used in our experiments is provided in
Table 3. The datasets are all publicly available at the URLs listed in the bottom of
Table 3.
Experimental Settings. In our experiments, we use various words as target queries to analyse the influence exerted by diverse targets on performance. For parameter settings, we follow the hyper-parameter setting in TTM: , and the two smoothing priors are set as . Other baselines follow the parameter settings in their respective papers.
4.3. Quantitative Evaluation
In this subsection, we analyse the quality of discovered topics from two aspects: topic coherence (representing topic interpretability or semantic coherence) and (indicating topic relevance).
Analysing the results of topic coherence: The average topic coherence achieved by BiTTM and TTM is shown in
Table 4, the more the score approximates to zero, the more coherent the discovered topics are. As we can see from the table, BiTTM is not comparable to TTM for analysing short texts in terms of topic coherence. However, with the increase of document length, BiTTM starts to outperform TTM. The reason why BiTTM generally works better than TTM on medium and long texts is because TTM is a sentence-based model for which the information between consecutive sentences will be lost. In contrast, by considering core biterms that may come from neighbouring sentences, our BiTTM model captures the semantics crossing sentences so that more interpretable topics can be generated. However, since it is quite often for a short text document to contain only one sentence, the limitation of sentence-based TTM cannot be reflected. Generally, by beating TTM on non-short text documents, BiTTM has a broader applications in text data analysis.
To evaluate the model performance with respect to different queries, we randomly sample query keywords from the documents according to word frequency distributions. We plot the comparative results of BiTTM and TTM in
Figure 4, where the horizontal axis represents the word frequency of the target keyword, and the vertical axis indicates the percentage of documents containing the target. There are three types of symbol in the figure: red dots, green squares and blue triangles. Each symbol corresponds to a comparison between the topics discovered by BiTTM and TTM with respect to a query. In particular, a green square means BiTTM obtains a better topic coherence than TTM for this query, while a blue triangle implies the opposite. For a red dot, it indicates that TTM fails to discover the specified number of topics or words under some topics for this particular query. For example, we set the number of topics to 5 for the experiments in
Figure 4 and consider the top 10 words for each topic. However, TTM discovers less than 5 topics or less than 10 words for a topic when handling queries corresponding to red dots. Note that, this situation does not happen for BiTTM.
The most obvious trend that can be observed from
Figure 4 is that the red dots usually appear in the lower left corner, the blue triangles gather in the upper right corner, and the green squares fall in between. The red dots in the lower left corner imply that TTM is prone to miss out topics when dealing with infrequent targets. The blue triangles in the right corner suggest that TTM performs better when the targets appear very frequently in many documents. However, the number of such target keywords may be limited. On the contrary, BiTTM achieves satisfactory performance for a diverse range of targets even if they are infrequent in the corpus. This also verifies the effectiveness of using core words to enrich the semantic information in the context of the target keyword (i.e., BiTTM strategy) than taking words in same sentences as bridges to connect potentially target-related words (i.e., TTM strategy).
Analysing the results of : To calculate the measure of
, similar to TTM, three human labelers familiar with the data sets are engaged to label the results. The
values at the rank positions of 5, 10 and 20 are reported in
Table 5, from which several interesting outcomes can be observed. Firstly, the performance of the two variations of BTM (i.e., BTM-PD and BTM
) is generally worse than that of the two specialised topic models (i.e., BiTTM and TTM), which demonstrates that full-analysis topic models with filtering strategies are not suitable for targeted analysis because they are prone to detect general topics instead of fine-grained target-related topics. In addition, comparing the two BTM variations, BTM-PD is better than BTM
in most cases, which proves that the pre-filtering strategy is more effective in removing irrelevant words than the post-filtering strategy. Secondly, the average
of BiTTM achieves a gain of more than 10% compared with TTM, and more than 26% compared with BTM-PD, over all queries in the table and the settings of
n. Moreover, the performance difference among the three types of document is not significant, whereas the different target queries have influence on the
results, which will be explained later using concrete examples. Thirdly, TTM is the second best model for
. However, for
, TTM achieves the best performance than all other models for some queries. It suggests the tendency of TTM to put target-related words in lower-ranked positions.
To explore the influence of different queries, let us take a closer look at two specific targets: “ashtray” (in the short-text data set “cigar”) and “rinses” (in the medium-text data set “baby”). As shown in
Figure 4, both targets are infrequent words (appearing in the lower left corner) in respective datasets. However, the
scores of BiTTM and TTM for the two queries, as shown in
Table 5, are remarkably different. Basically, both models perform well with respect to “ashtray” but not with respect to “rinses”, especially for TTM. The
score of TTM for handling “rinse” is unsatisfactory and several inexplicable words, such as “attention” and “entertain”, appear in the discovered topics, which makes it hard to interpret the topics. By examining the datasets, we find that documents containing “ashtray” consistently describe the appearance of ashtrays such as colours and materials. That is, the documents are pretty clean and relevant, which explains why both BiTTM and TTM process the query well. Nevertheless, the documents containing “rinses” are mostly composed of short sentences, such as “
It rinses out well and dries quickly.” and “
Rinses/Washes easy.”, where the meaningful descriptions are hidden in the context of sentences containing “rinses”. TTM cannot handle this situation since it is a sentence-based model. The two examples explain why the performance varies with respect to query keywords.
Comparing the performance in terms of topic coherence and , we notice that BiTTM is more capable to acquire topics related to the target (i.e., high scores) than to generate semantically coherent topics (i.e., better topic coherence values), especially for short text documents. This is because words related to the target do not necessarily have high co-occurrence, which is used to calculate topic coherence. For instance, “Oktoberfest” is an appropriate word related to the target “place” in the dataset cigar because a type of cigar named Quesada Oktoberfest is released in October for celebrating the famous Germany beer festival. However, “Oktoberfest” as a low-frequency word can not provide enough mutual information, which directly causes the poor performance in topic coherence. Conversely, a high-frequency word “rolled” contributes to high topic coherence score but it is not selected by BiTTM since it is too general to describe the target “place”.
4.4. Time Efficiency Analysis
As mentioned before, it is ideal for targeted analysis to provide responses to user queries as soon as possible. Therefore, in this experiment, we analyse the time efficiency of the comparative models.
The average time cost of the four methods on each dataset over 40 random queries is shown in
Table 6. It can be observed that, generally, BiTTM has the best time efficiency, followed by BTM-PD. TTM is significantly slower without any preprocessing strategy, and BTM
is the most inefficient model since BTM performs full analysis on the complete dataset.
To clearly demonstrate the impact of data size on the time efficiency, we plot the results in
Figure 5 where the grey bars denote the size of datasets and the polylines in different colours indicate the time cost of different methods. Note that, since the time consumption of BTM
is not comparable to the others, only three models (i.e., BiTTM, BTM-PD and TTM) are displayed in the figure. It can be observed that, generally, the time cost of all methods increases with respect to the increment of data size. However, the size of dataset has a greater impact on TTM than the other two methods, which shows that TTM is not suitable for processing large data sets. In contrast, BiTTM and BTM-PD have a better capability to adapt to large data sets. For these two methods, the difference of data size does not make dramatic changes to time consumption since they both have preprocessing strategies to focus on only the portion of data related to query targets. The difference between BiTTM and BTM-PD is that BiTTM is faster than BTM-PD especially when the length of documents increases. The reason is that the pre-filtering strategy adopted by BTM-PD is a simple and rough processing. It selects documents as long as they contain the query keywords. Consequently, irrelevant information contained by such documents will be included and processed as well, which negatively contributes to the time efficiency of BTM-PD.
To illustrate the impact of document length on the time efficiency, the percentage histogram of time cost of BiTTM, TTM and BTM-PD is plotted in
Figure 6, where the average document length increases from left to right. It can be observed that the time efficiency of TTM is worst on short texts. Recall that the topic coherence of TTM on short texts is better than BiTTM. This experiment shows that TTM achieves this by significantly sacrificing time efficiency, while the topic quality in terms
of TTM on short texts is also worse than that of BiTTM. Moreover, we can see that efficiency performance of BTM-PD is worse on long texts, compared to its performance on short texts. This is because BTM-PD is a biterm-based topic model and long texts generally have more biterms than short texts. Although BiTTM is also a biterm-based model, the strategy of selecting “core biterms” removes a lot of irrelevant biterms so that the performance of BiTTM on long texts is also promising.
Therefore,
Figure 5 and
Figure 6 demonstrate that BiTTM can be widely applied to various types of text data, because both data size and document length have no great impact on its time efficiency, thanks to the core biterm-based preprocessing strategy.
4.5. Qualitative Evaluation
We present qualitative analysis of the result topics generated by comparative models in this subsection. We focus on evaluating from two aspects: performance of discovering as many fine-grained relevant topics as possible and performance of dealing with semantically approximate targets. For exemplified queries discussed in the following, we have shown their word frequency and document frequency in
Figure 4.
4.5.1. Discovering Relevant Topics
We take the query “disease” in the dataset
food as an example.
Table 7 shows the topics discovered by the four comparative models, together with the top 10 words of each topic. The third row of
Table 7 are the labels we assign manually to summarise the semantics of each topic, where
SFA is the abbreviation for
Saturated Fatty Acid. Words that do not semantically align with the topics are displayed in red.
Compared with the topics discovered by BiTTM, all the other three methods fail to identify the topic prevention, which is clearly a relevant topic of “disease”. Moreover, the two BTM variation models (i.e., BTM-PD and BTM) miss out the topic risk. By taking a closer look, we find that this is because the two BTM models cannot distinguish between the two topics risk and research that are different delicately. In other words, the two BTM models discover a topic combining research and risk. This is understandable because BTM as a full-analysis topic model discovers general topics. TTM succeeds in discovering both research and risk, but the topic quality is poorer than that of BiTTM (e.g., there are more bold words in the two topics discovered by TTM, which means more inconsistent words in results of TTM). Therefore, BiTTM discovers more relevant and fine-grained topics than other models for this example query.
Consider the topic SFA that is discovered by all of the four models. Results of BiTTM clearly indicate that saturated fatty acids affect blood sugar and carcinogenesis, but the results of other methods are not satisfactory. For example, TTM tends to find out which foods (e.g., tart, chip and sweetener) have unsaturated fatty acids. BTM-PD and BTM focus on food ingredients (e.g., palm oil and protein). These results are not related with the target “disease” queried by users. Hence, the topic quality of BiTTM is better than that of other models as well in this example.
4.5.2. Handling Semantically Approximate Targets
When the targets supplied by users are semantically approximate, a set of similar relevant topics are supposed to be discovered. We further examine the performance of the comparative models in handling semantically approximate targets. In particular, we analyse two types of semantically approximate queries mentioned in
Section 3.1: synonyms and diverse descriptions of the same event.
An example of the first type is shown in
Table 8. We query the dataset “baby” with two targets, “bath” and “shower”, which share similar semantics in the data set of Amazon reviews of baby products. A successful model should return similar topics. As shown in the table, BiTTM is the only model that can obtain the set of four meaningful topics for both queries, while other methods either miss topics or generate vague content for topics.
For instance, except BiTTM, the other three methods fail to identify the topic blanket with respect to the query “bath”, while TTM and BTM can retrieve the topic with respect to the target “shower”. According to the results of BiTTM, we find that blanket is an important aspect of bath/shower, because most people will cover their babies with a blanket after a bath/shower. Hence, the topic blanket is an aspect in which users are interested. Ignoring an important topic hinders downstream analysis and applications, such as high-quality personalised services and commodity recommended systems. As another example, there are two topics discovered by BiTTM only: sentiment and protection. Checking the content of topic sentiment, we learn that users tend to associate emotional expressions (e.g., “have a nice time with daughter/son”) when commenting on shower/bath products. This topic thus implies users’ emotional polarity of products, which is important for applications such as user profiling, recommendation and public opinion monitoring. The topic protection describes safety products that can be installed in tubs or on faucets. The safety issue of bath is an important concern especially for baby products, and it is non-ideal for the other three methods to ignore this topic.
Moreover, we find that BTM-PD extracts only two topics for both queries and the content of the topics are too vague to understand (e.g., we are not able to assign semantic labels to the topics). There are six identical words between the two sets of top 10 words, which makes it very hard to distinguish between the semantics of the topics. The same situation occurs to BTM—there are two similar topics about “spout”. For example, given the query “bath”, the two topics have eight identical words in the top 10 words. The content of these two topics may be correct, but the information expressed is redundant. It is not useful to generate identical topics but increasing the difficulty of further analysis.
Table 9 shows an example of the second type. Given the dataset
Oscars, both “mistake” and “oscarsfail” refer to the same event that the Best Picture Award, which should belong to
Moonlight, was wrongly presented to
La La Land because of a wrong envelope. As we can see from the table, BiTTM can acquire three fine-grained relevant topics, which describes the process of the event development: At the beginning of the event, two guests present the Best Picture to
La La Land, and no one was aware of the mistake. Many tweets emerge to talk about
La La Land and express congratulations to the actors and the producer, which can be seen from the content of the topic
beginning. Next, the error is corrected and the real winner is another movie
Moonlight. Topic
correction is a perfect interpretation of this stage. Note that, the top 10 words of this topic with respect to the query “mistake” contains the word “oscarsfail”, which demonstrates the usefulness of the core biterms strategy used by BiTTM. The third topic
discussion covers the discussion of the actors’ reaction after this mistake has happened.
In contrast, TTM only retrieves the topic discussion and the quality is not satisfactory. Some irrelevant words like Moana, another movie, appear in the topic. BTM-PD and BTM also discover only the topic of discussion with respect to the target of “oscarsfail”, and the quality is low. For example, the word “documentary” which is not related with the two movies appears in results. Although the quality has improved with respect to the target “mistake”, the two topics discovered BTM are too similar with 6 identical words in top 10.