It is a very intuitive idea that topics within a text corpus show up as patterns in the overall word usage in the corpus documents. Graphs as the mathematical structure for representing networks of entities (so called nodes or vertices) which are linked (through so called edges) are an obvious choice for formalizing this idea. What is less obvious is which of the many possible ways of transforming the corpus into a graph is the most effective one for the present purpose and how exactly topic-forming patterns can be identified.
In this section, we describe our particular choice which we found to be successful in the analyses of many text corpora.
We will give reasons for some of our choices which we found to be successful in the analyses of many text corpora. However, for several parameters we heuristically tried only a few values and chose the one which worked best on some manually assessed samples; we do not claim that we have tested all or even a big fraction of the conceivable alternatives. In principle, the parameters might be used in the future to further tune the results for better performance.
3.1. Setting up the Corpus Network
Based on a corpus
of
N text documents
, we define the corpus graph as a weighted undirected graph
, consisting of a vertex set, an edge set, and an edge weight function. In this subsection we describe the various steps of the graph construction which are schematically depicted in Procedure 1 and Procedure 1
a.
Procedure 1 SetUpCorpusGraph |
|
Procedure 1a ExtractDocumentTerms |
Input: Text document d as part of a corpus of documents |
Output: Sequence of document terms with rank values, ,
sorted by decreasing rank value: if |
1 Recognize named entities in d. |
2 Tokenize d while treating compound named entities as single tokens. |
3 Lemmatize tokens |
4 Remove stop words. |
5 Remove 1- and 2-character tokens, digits, exotic characters. |
6 Keep only nouns, proper nouns, adjectives. |
7 number of unique tokens left |
8 set of unique tokens left |
9 Derive Markov chain on with transition probabilities between terms as described in Section 3.1 based on term positions and term neighborhood within d as well as term statistics in . |
10 Compute the Markov chain’s stationary distribution for all . |
The vertex set is a certain subset of unique and normalized words or word combinations appearing in the corpus (which hereafter we will subsume under the name terms), where p with is a parameter that controls the number of vertices. It is constructed by looping over all documents of the corpus (lines 4 to 6 in Procedure 1).
More specifically, for the terms that form the vertex set result from a fairly standard text preparation pipeline (lines 2 to 8 in Procedure 1a), consisting of tokenization, lemmatization (in order to consolidate several inflected forms of the same word), and the removal of stop words (frequent words with little meaning), short tokens (less than three characters), exotic characters, and tokens consisting mainly of digits. We also retain only nouns, adjectives, and proper nouns as usual in NLP tasks that focus on factual content.
Applying the procedures listed in the previous paragraph to the original documents leads to single-word term vertices only. Yet, retaining compound terms that consist of several words as units in the corpus graph is a desirable enhancement of the method because it prevents the loss of meaning that is a consequence of splitting up a compound term into its components. Technically, we can include compound terms without changing the pipeline described above using a formal trick: before we put the documents into that pipeline, we connect the individual words of compound terms by underscores (e.g., “department of homeland security” becomes “department_of_homeland_security”). This renders the compound term a single token which survives the pipeline as a unit. However, identifying compound terms in documents is not an easy task. We experimented with various statistical and linguistic approaches. While it is possible to identify many meaningful combinations in that way, one also produces several nonsensical combinations which can create serious confusion in the results. Thus, we decided not to search for general compound terms but only for those which show up in named entity recognition (line 1 in Procedure 1a). Concretely, we incorporate entities of the types events, facilities, geographical and political entities, languages, laws, locations, nationalities or religious or political groups, organizations, persons, products, and works of art which consist of 2, 3, or 4 words.
For all these document preparation steps, we use the Python library spaCy [
62]. Working with the small language model en_core_web_sm turned out to be sufficient for our purposes; using larger language models did not lead to significant changes in the results.
After these preparations every document
has been stripped down to a collection of terms which still carry the main subject content of the document.
is the set of all unique terms remaining in the corpus. However, usually quite a few of these terms are of general nature and not important for the main message of the document. Having those terms in the corpus graph blurs its ability to represent thematic links. Working with a smaller subset of
, which we denote as
where
is the percentage of terms retained, can prevent that effect as we will discuss in more detail in
Section 4.
In order to judge about which terms to drop from the corpus graphs, a function for document term ranking is needed which produces a rank order of terms depending on their significance for the message of the document (lines 9 and 10 in Procedure 1
a). This is a well-known task in the context of key term extraction. The naïve solution would be to rank a term
t in document
d by its frequency
in the document but it is well-known that this unjustly favors terms that tend to occur frequently, independent of the specific content of the document. The long-standing solution to this problem is to counterbalance the effect of overall frequent words by the inverse document frequency,
where
is the number of corpus documents that contain the term
t, and to use
for term ranking in the document
d.
However, this typical bag-of-word approach—where all words of a document are treated equally independent of their position—neglects the observation that important words of a document are usually not distributed evenly over the whole document. Rather, they tend to appear in groups, and for many document types it is also common that authors place especially many words that characterize the document content at the top of the document, notably in title, subtitle, and abstract. A term ranking method that takes these two observations into consideration is PositionRank [
63], which is a modification of the TextRank method introduced in [
64] in analogy to the PageRank [
65] method for ranking within a network of linked web pages.
In order to combine the advantages of the frequency arguments and the positional arguments for term ranking, we devised our own ranking method, posIdfRank [
66], which works as follows: for a corpus document
d, define a graph
(not to be confused with the corpus-wide graph
) which has the set of unique terms of
d as vertex set
, and an edge
between two vertices
s and
t iff the terms
s and
t co-occur in a common window of size
w (which is an adjustable parameter of the method), i.e., if there are at most
words between
s and
t.
We now consider a random walk on
in which the transition probability between two terms
s and
t is given by
where
counts how often in
d the terms
s and
t co-occur in a window of size
w, and
counts on which position within the document the term
t appears first;
(
) and
(
) are two more parameters of the method. This process mimics a reader randomly scanning the document for important terms: from term
s he moves with probability
to another term
t in the vicinity (neighborhood of size
w)—more likely reaching at terms which commonly stand close to
s and which do not appear in many documents of the corpus. However, with probability (1-
) he jumps to some word which can be far away from
s—then more likely to terms at the beginning of the document (where the preference of terms at the beginning is stronger the smaller one chooses
) and again to terms which do not appear in many documents of the corpus.
The long-term behavior of this random walk is characterized by its stationary distribution, a probability distribution on . We regard this function as a useful document term ranking function: it has high values at terms that are frequently visited during the random walk.
For the calculations in this paper, we fix the three parameters of the method at
,
and
. These values were derived from a best fit with the manually assigned key words of the dataset of journal abstracts from [
67]. This also showed that the ranking is only weakly sensitive to moderate changes of the parameter values so that it is not critical that other training datasets result in slightly different optimal parameter values.
We illustrate the considerations on document term ranking with the following example document from the BBC corpus:
India widens access to telecoms |
India has raised the limit for foreign direct investment in telecoms companies from 49% to 74%. |
Communications Minister Dayanidhi Maran said that there is a need to fund the fast-growing mobile market. The government hopes to increase the number of mobile users from 95 million to between 200 and 250 million by 2007. “We need at least $20bn (£10.6bn) in investment and part of this has to come as foreign direct investment,” said Mr Maran. The decision to raise the limit for foreign investors faced considerable opposition from the communist parties, which give crucial support to the coalition headed by Prime Minister Manmohan Singh. Potential foreign investors will however need government approval before they increase their stake beyond 49%, Mr Maran said. Key positions, such as those of chief executive, chief technology officer and chief financial officer are to be held by Indians, he added. |
Analysts and investors have welcomed the government decision. “It is a positive development for carriers and the investment community, looking to take a longer-term view of the huge growth in the Indian telecoms market,” said Gartner’s principal analyst Kobita Desai. “The FDI relaxation coupled with rapid local market growth could really ignite interest in the Indian telecommunication industry,” added Ernst and Young’s Sanjay Mehta. Investment bank Morgan Stanley has forecast that India’s mobile market is likely to grow by about 40% a year until 2007. The Indian mobile market is currently dominated by four companies, Bharti Televentures which has allied itself with Singapore Telecom, Essar which is linked with Hong Kong-based Hutchison Whampoa, the Sterling group and the Tata group. |
If one would simply go by the frequency of terms in the document, the 10 highest ranked terms would be:
investment, market, foreign, mobile, India, telecom, government, investor, chief, indian |
This list obviously does contain useful key terms of the document, but also very unspecific terms like “market” and “chief”. In contrast, the top ranking according to would be:
Maran, investment, telecom, mobile, indian, foreign, India, investor, market, direct |
Here, the unspecific terms disappear or get shifted to lower positions. The very specific person name Maran, in contrast, appears at the top of the list.
Finally, the ranking according to posIdfRank results in the following top terms:
India, telecom, investment, foreign, limit, Maran, direct, investor, mobile, Sanjay_Mehta |
This list now favors important terms of the title and subtitle, and also brings up person names which stand close to other key terms in the document. Among the three ordered term lists, this is the one which condenses the content of the document best.
Now that we have a document term ranking function we can define the corpus vertex set : it is the result of keeping only the top p percent (in the sense of the document term ranking function posIdfRank) of unique terms of each document (lines 5 and 6 of Procedure 1).
In order to demonstrate the effect of this document reduction, we show what remains from the above example document if we keep only terms from :
India access telecom India limit foreign direct investment telecom Communications Minister mobile market government mobile investment foreign direct investment Maran decision limit foreign investor considerable opposition communist crucial coalition Minister Manmohan_Singh foreign investor government Maran investor government decision investment indian telecom market Gartner principal Kobita_Desai relaxation market indian telecommunication Ernst Young Sanjay_Mehta investment bank Morgan_Stanley India mobile market indian mobile market bharti_televenture Essar hutchison_whampoa Sterling |
Having established the vertex set, we now have to specify the edge set
and the edge weight function
(lines 7 to 14 in Procedure 1). The edges are supposed to connect terms that are likely to help in identifying thematic relationships, and the edge weight function should measure how significant this connection is. Among all conceivable options, the simplest turns out to be very effective already: two terms are connected if they appear together in at least one document and the weight of that connection is the number of documents in which both terms co-occur:
Several other authors who use network-based topic detection work with more restrictive and complicated edge definitions and weights, involving thresholds and Idf-values in order to avoid noise produced by accidental or meaningless co-occurrences. We did not find this beneficial in our case as we avoided such type of noise already by reducing the vertex set with percentages p well below 100.
We will go into more detail concerning the influence of the value of
p in
Section 4. In the present section we work with
.
Following Procedures 1 and 1a, the BBC corpus of 2225 documents leads to a graph that has 23859 vertices and 1896807 edges. The edge weights vary between 1 and 86 (with highest weight on the link between the terms election and Labour, meaning that 86 documents of the corpus contained both these terms).
3.2. Detecting Topics as Communities
In this subsection we describe how we detect topics in the term co-occurrence graph. The scheme for Procedure 2 outlines this stage of the workflow.
We first look into the details of its fundamental building block: community detection (line 2 of Procedure 2). The aim of setting up the graph
was that in it topics of
would show up as communities of terms, where we refer to the network-theoretical concept of communities, which are—loosely speaking—groups of nodes that are densely connected within each group and sparsely connected with other groups. While this vague idea is quite intuitive, there are various non-equivalent options how to turn it into a quantifiable criterion for identifying communities, some of which we have mentioned in
Section 2. It is a priori not clear which criterion fits best for the language related purpose of establishing dense thematic connections. Comparative work in [
4] indicates that modularity maximization [
33] is very well suited for achieving the goal of topic detection.
Procedure 2 DetectTopics |
|
Given an undirected weighted graph
, modularity is a function which maps a partition
of
V into vertex groups
(with
) to a real number which measures how well the groups
can be considered to be communities in the sense of high intra-group connectedness. It can be thought of as consisting of two parts. Denoting for a vertex
the group of
to which
s belongs by
and defining the total edge weight
the first part,
represents the fraction of all intra-group edge weights compared to the total edge weight of
G, where we have used Kronecker’s
notation (
for
and
for
) in order to express that
s and
t should be in the same group. The second part is the same fraction, but not for the graph at hand,
G, but what one would expect for a random graph that has the same degree distribution (i.e., edge weight sum connected to each vertex) as
G:
with
, the edge weight sum (degree) at vertex
s.
Modularity now is the difference between these two terms:
Maximizing the modularity therefore means finding a partition of the vertex set such that within the groups of this partition the fraction of intra-group edges is as big as possible when compared to the fraction of intra-group edges which one would expect in a similar random graph. At least intuitively, this translates well into what we are looking for: communities of terms that appear more often together in common documents than one would expect if the terms were randomly distributed.
The above definition of modularity can be generalized to include a parameter
like follows [
49]:
Maximizing leads to coarser or finer communities depending on the value of : in the extreme situation that , the objective function is just , and its maximum is obviously reached at the trivial solution , i.e., when the whole graph is considered to be one big community. In the other extreme, , the objective is to minimize , and this obviously happens when , i.e., when each single vertex forms its own community such that the number of communities is equal to the number of vertices.
Varying in the generalized modularity makes it possible to find communities of variable granularity, and therefore is called the resolution parameter.
Computationally, maximizing the (generalized) modularity is known to be a non-deterministic polynomial-time hard problem, which means that there are no efficient algorithms that guarantee an optimal solution. However, several efficient heuristic algorithms are known for producing potentially suboptimal but useful solutions.
Here, we use the recently published Leiden algorithm [
68], which is an improvement of the very popular Louvain algorithm [
34] for maximizing modularity. Starting from the extreme partition
where each vertex forms its own community, the Leiden algorithm first visits the nodes in a random order and tries greedily to shift nodes to other communities in a way that offers the biggest modularity increases. In a second step (which is the main improvement compared to the Louvain algorithm), the community partition is refined in a way that produces well-connected communities. After that, an aggregated graph is formed which contains the refined communities of the original graph as vertices. The whole procedure is repeated for the aggregated graph, and this is iterated until no further modularity increase can be achieved.
For our calculations we use the implementation of the Leiden algorithm in the Python version of the library igraph [
69].
Applying the Leiden algorithm with the standard resolution parameter to the example graph results in 8 to 14 term communities, depending on the run. Each run may produce a different number of communities because the algorithm follows a non-deterministic heuristic. Therefore, not only the number but also the terms of the communities, which are only approximations to the optimal community partitioning, may change in each run. However, closer inspection shows that the differences between the runs are not large. In particular, if a run produces more than 8 communities, the additional communities are strikingly smaller than the biggest 8 communities. This is an indication that in some runs the greedy algorithm fails to assign some terms to any of the dominant communities and leaves them in some small residual communities.
While these odd cases are reasonably easy to detect, it is nevertheless better to remove those undecisive terms completely from the picture. Therefore, for the actual topic detection we repeat the community detection step times (lines 1 to 3 in Procedure 2) and consider only those sets of terms which end up together in the same Leiden community in at least of the runs (lines 10 to 15 in Procedure 2). Finally, of those sets we retain only the ones that contain a minimum number of terms (lines 16 to 19 in Procedure 2). We found that the remaining sets of terms form stable topics if we set , and to 10% of the terms that one would expect in a topic if all topics were of equal size.
Following Procedure 2, 22426 terms among the 23859 vertices of (or 94%) can be assigned to one of 8 stable topics. The biggest of these topics contains 4657 terms, the smallest 214 terms.
Based on the assignment of terms to topics, we can also determine which document of the corpus is concerned with which topics. The connection is made by counting the relative number of topic terms within a document: if
is the number of terms in document
d that belong to the topic cluster
, then
is a good indicator of the importance of topic
i for that document. While one can argue that here a probabilistic topic model would offer a sounder method for calculating the topic share per document, we found that a simple count of topic terms works very well as we will show when we use the term community topics for document classification in
Section 4.
Like with other topic detection approaches, the method results in a list of terms that characterize a topic, but the actual interpretation of what that topic is about, is left to human evaluation. It is certainly a difficult task to look at hundreds or thousands of terms, having to make sense of what topic might be encoded in them. Probabilistic topic models that produce topics as probability distributions on terms have an advantage here at first glance: the topic terms can be sorted by their probability, and one usually looks only at the 10 to 30 most probable terms.
In the next subsection we will explain how we suggest to look at big sets of topic terms in a way that facilitates interpretation in the absence of a probability distribution on the terms.
3.3. Topic Presentation
In this subsection we explain how we present the topic terms in an informative way that eases topic interpretation. The relevant steps are schematically outlined in Procedure 3 and Procedure 3
a.
Procedure 3 PresentTopic |
|
Procedure 3a RankCorpusTerms |
|
Looking in a random order at the hundreds or thousand terms which constitute a topic is certainly not helpful for grasping its meaning. It would be best to look at the most characteristic terms first; what we need is a term ranking. In the context of network preparation we have already established a document term ranking
. However, what is required now is a corpus term ranking. The former one can only decide which terms are the most characteristic ones for a certain document
d, but now we need a more global ranking function
, independent of
d. For a term
t, the document term ranking function
is only defined for those documents
d which contain
t. Since the document frequency
varies a lot with
t it would not be fair to simply take the arithmetic average of the existing values of
. This resembles star ratings in recommender systems where typically some items have been rated many times and some items have only one rating. There one uses a Bayesian average for ranking [
71].
In order to transfer that solution to the problem of corpus term ranking, we first introduce a discretized version
of the document term ranking
(lines 1 to 5 in Procedure 3
a). This is because we do not claim that the document term ranking is sufficiently exact to measure the importance of terms continuously but rather that it is a good base for grouping terms into sets of more or less important terms. Let
be an ordered sequence of the terms in document
D,
for
and
if
. Then we divide
into
A parts
,
, of equal lengths
. We also introduce a cut-off value
;
A and
K are parameters which can be adjusted so as to result in a ranking that works well.
and
otherwise. After some experimentation we fixed
and
; this means that the top 5% of terms in a document get the discretized rating 3, the next 5% the rating 2, and the third 5% the rating 1.
Based on
, we calculate the corpus term ranking function
as the following Bayesian average (lines 6 to 8 in Procedure 3
a):
with
, which is the document frequency averaged over all terms, and
, which is the mean
ranking of all terms.
With the function we can sort all terms of the corpus.
The ten terms which ranked highest as the most specific terms for the corpus are:
Yukos, Holmes, UKIP, Fiat, Blunkett, howard, Yugansk, Kenteris, Parmalat, Wenger |
The ten terms ranked lowest—the least specific ones—are:
year, place, time, month, spokesman, recent, week, Tuesday, long, second |
Now we can present the terms that form a topic in the order of decreasing values of
. The examples in
Table 1 show how advantageous this is. Both columns show terms from the same topic found by community detection in
with resolution parameter
. The topic comprises a total of 4080 terms.
The left column of the table shows just 18 randomly picked topic terms. Guessing the topic’s theme from these mostly general terms, like “contribution” or “flexible”, and less familiar person names or book titles is difficult. More generally, the chance that random samples include the most characteristic terms of the topic is low. In contrast, the right column shows the 18 topic terms with highest values of . Most of these terms are prominent politicians (e.g., “Blair”, “Blunkett”), political parties (e.g., “UKIP”, “lib_dem”) and other well-known entities (e.g., “Speaker”, “migrant”) from—not only, but mainly, British—politics. This makes the topic’s main theme immediately evident to domain experts.
However, looking at only one or two dozen of 4080 identified topic terms wastes a lot of information, and we suggest to look at many more terms when interpreting topics in order to achieve a proper assessment. In order to facilitate an overview of many topic terms, we add, next to the specificity ranking
, another structuring criterion to the set of topic terms by grouping them into clusters of semantically related terms. Here, we make use of pretrained fastText embeddings [
56]. The fastText approach belongs to the semantic work embedding methods through which words can be mapped to a moderately low-dimensional Euclidean vector space in a way that semantic closeness translates into small metric distances. The pre-trained fastText models are shallow neural nets with output vectors of dimension 300 that were trained on huge text collections from Wikipedia and Common Crawl, using the so called CBOW task of predicting a word by its surrounding words. In distinction to its predecessor Word2Vec, fastText internally does not work on the level of words but on the level of its constituting character n-grams. In the present context this offers two advantages: first, this mapping works even on words which do not appear in the Wikipedia and Common Crawl collections; second, word variations due to spelling mistakes or imperfect lemmatization usually end up close to each other in the vector space representation.
If we now take the vector representations of the topic terms, we can use any metric-based clustering method for finding groups of semantically related words, or, more precisely, of words whose components have been seen frequently together in huge text collections. After some experimentation, we decided to use hierarchical clustering in its scikit-learn [
70] implementation AgglomerativeClustering with distance threshold 1 (lines 1 and 2 in Procedure 3).
We show these semantic groups of terms, which we call semantic strata, rather than single terms, when we present the topics. We order the terms within each stratum by their value (lines 3 to 5 in Procedure 3) and the strata by the value of for the terms t per stratum (lines 6 to 12 in Procedure 3). As a result, we have a two-dimensional order in the topic terms: one dimension ranking the specificity and one dimension depicting semantic relations.
For topic evaluation we produce large sheets with the topic terms structured in the stratified way described above. In
Table 2, we show only the beginning of such a sheet for better comprehension. The rows depict the strata in which the
r-sorted top terms of the topic get accompanied by semantically related terms, or, in the case of person names, by persons who usually appear in a common context.