1. Introduction
In order to develop natural language processing systems, it is necessary to construct language resources that capture as much as possible the linguistic diversity present in a natural language. These natural language resources, called
corpus, are usually provided with meta-data containing information about the tokens and the documents the forms up the
corpus [
1]. The addition of meta-data to a
corpus is called annotation or labelling. It is the process of adding information to plain textual data. Annotations that aggregate information to a
corpus can be applied to a document as a whole, to its sentences, its terms and words and can be performed manually or automatically [
2]. Annotations can facilitate the development of various types of applications related to the understanding of natural language, ranging from information extractors to automatic language translators. The reason for this is that annotations (syntactic and/or semantic) provide additional information that help establish the context of the statement where the lexical item is inserted and helps eliminate ambiguities.
Corpus annotation may be applied to the various levels of the linguistic structures. So annotations can express the grammar class of the annotated elements (Part-of-Speech), their morphology, correlation phenomena, the aspects of phonetics and so on [
2]. The annotation may also cover other aspects related to the structure of the annotated text and its content as a whole. There are several aspects of a lexical item that can be annotated, but we can basically divide it into syntactic and semantic aspects. Syntactic annotation aims to add information related to the form of the lexical item, such as its part-of-speech tagging or its dictionary form (lemma form). On the other hand, semantic annotation is the process in which are added to the terms significant references to express their meaning. The annotation task attempts to capture the essence of the meaning of the tagged object [
3]. In general, the main goal of semantic annotation is to make texts capable of being understood by machines. In the case of semantic annotation, a set of labels whose meaning has already been formally defined or is well-known is selected, and such labels are assigned to the terms annotated by their meaning. This annotation category can instantiate words, sentences, paragraphs, or full text and can incorporate one or more domains [
4]. There are many advantages of semantic annotation, such as allowing context comprehension, assisting search tools, establishing correlations, and the main one which is to offer meaning to a set of words [
5]. The semantic annotation establishes a network of concepts, allowing to infer the context referring to the annotated context. Since the semantic annotation is performed on a certain document, it can be easily interpreted by machines allowing a multitude of applications. The main contribution of semantic annotation is to eliminate ambiguities regarding the meaning of words by computer devices. Establishing the meaning of a lexical item is still a challenging task due to its polysemic character [
6]. For example, the word “ bank ”, according to Wordnet [
7], has ten meanings as a noun and eight meanings as a verb. Some meanings are listed below:
Noun
S: (n) bank (sloping land (especially the slope beside a body of water)) “they pulled the canoe up on the bank”;
S: (n) depository financial institution, bank, banking concern, banking company (a financial institution that accepts deposits and channels the money into lending activities) “he cashed a check at the bank”;
S: (n) bank (a supply or stock held in reserve for future use (especially in emergencies)).
Verb
S: (v) bank (do business with a bank or keep an account at a bank) “Where do you bank in this town?”
S: (v) bank (cover with ashes so to control the rate of burning) “bank a fire”
S: (v) count, bet, depend, swear, rely, bank, look, calculate, reckon (have faith or confidence in) “you can count on me to help you any time”.
Polysemy is a complex area of study within linguistics and has divergent theories formulated by linguistics classical and cognitive linguistics [
8,
9]. Regardless of the underlying theory, in order to establish the correct meaning of a lexical item it is necessary to know the context where it occurs, that is, the words that occur in its neighborhood. According to [
10]
apud [
11], “ You shall know a word by the company it keeps. ” In the sentence “he cashed a check at the bank” it is possible to infer that the word “bank” refers to a banking institution because of the co-occurrence of the words “sit” and “check”.
The semantic annotation, which is the focus of this article, can be further divided. Pustejovsky and Stubbs [
12] divide the semantic annotation into annotation of semantic roles and annotation of semantic types. They say “we can distinguish two kinds of annotation for semantic content within a sentence: what something is, and what role something plays.” In Semantic Typing annotation a language structure is labeled with a type identifier, from a reserved vocabulary or ontology, indicating what it denotes, whereas in Semantic Role Labeling a language structure is identified as playing a specific semantic role relative to a role assigner, such as a verb [
12].
As one of the branches of philosophy, the term ontology is related to the study of many things that exist in the world, and its main function is the organization of what exists in a set of categories. In the words of [
13], an ontology, from a Computer Science perspective, is a specification of a conceptualization. In this sense, the ontology tries to describe the concepts existing in a domain and to relate them according to their characteristics [
14]. As a Computer Science object, the categories, concept and definitions of an ontology need to be constructed under a formal specification, representing an abstract real-world model capable of being machine readable. Since the ontologies are formed by concepts, properties and relations of the domain to which they propose to specify, they may adequately be used to define part of the meaning of a lexical item.
When using ontologies in the annotation process, the ontological classes become the labels and the contents specify which objects should be annotated [
15]. The task of annotation is closely related to the ontology domain. There are several types of ontology, which are classified according to their function and their scope. Generally, in the semantic annotation task, when aimed at annotating texts in a specific domain, one should use classes from an ontology corresponding to that domain. On the other hand, if the annotation comprehends broader concepts, a more generic ontology that can incorporate a vast number of domains should be used. General domain ontologies, also called top-level ontologies, are extremely extensive and can define concepts applicable to any domain. Top level ontologies describe broader and more abstract concepts regardless of a particular problem or particular domain [
16]. Due to its expressiveness and to the large number of classes that compose such ontologies, commonly only fragments are selected from them, usually the top-level, to form the label set and then to perform the annotation task. Annotating text with the concepts of a top-level ontology can be the starting point for deepening semantic annotation at more specific levels of a general ontology or a domain ontology.
In addition to producing semantically annotated
corpus, semantic annotation based on top-level ontologies can also be useful for the enrichment of Web content. One of the main applications of semantic annotation on the Internet sphere is the contribution that it can offer to the Semantic Web [
5]. Semantic Web follows the principle that all information made available on the Internet should be labeled in such a way that computers are able to understand the content [
17]. The ultimate goal of the Semantic Web is to enable machines to perform more useful tasks by developing a network of connected data through standard vocabularies and definitions that have semantic meaning with them [
18]. Attaining this goal would facilitate, for instance, search engines in providing a response to users of what is most relevant to their needs. Ontology vocabulary in an important element and valuable tool for organizing the data of a domain and enriching it by adding meaning. In this sense, semantic annotation based on ontologies plays a fundamental role in the process of semantic enrichment of web content to support the Semantic Web [
19].
Moreover, semantic annotation, particularly annotation based on ontologies, can help improve results of other applications that currently rely more on syntactic information and word relationships. This is the case of the applications addressed by [
20,
21]. Both papers deal with the classification of documents based on information extracted from the style of writing. In the first work the author tries to find out whether a scientific article was written by an automatic text generator or not, while the second work seeks to identify the authorship of a text. Both researches produced expressive results in their tasks, with accuracy greater than 88%, but it would be interesting to see if semantically annotated texts can improve these numbers. Looking beyond the analysis of feelings, Preoţiuc-Pietro et al. [
22] presented the results of their research that aimed to predict the political ideology of tweeters from the analysis of their posts. To carry out the predictions they used as language features, Unigrams, Linguistic Inquiry and Word Count (LIWC), Word2Vec clustering, Political Terms, and words associated with six emotions. The results showed that Word2Vec clusters obtain the highest predictive accuracy for political leaning, and for political engagement, political Terms and Word2Vec clusters obtain similar predictive accuracy. In works such as this, the joining of words with their ontological types before the construction of the model has the potential to generate results with greater accuracy. Liu et al. [
23] investigated the feasibility of career path prediction from social network data. The approach they proposed was a multi-source learning framework with a fused lasso penalty (MSLFL), thus the predicted results from individual sources should be the same or similar, otherwise a penalty should take place. As the model fuses information distributed over multiple social networks to characterize users from multiple views, it could benefit from semantically annotated information to make a more appropriate merging. Estival et al. [
24] developed a project in which ontologies are part of the reasoning process used for information management and for the presentation of information. According to the authors, “
users access to information and the presentation of information to users are both mediated via natural language, and the ontologies used in the reasoning component are coupled with the lexicon used in the natural language component”. We believe that the system can be even more efficient if the information base had previously been annotated with ontological tags.
Semantic annotations are valuable and help many types of NLP applications, however, according to [
25], semantic annotation is an extremely time and resource consuming task. In the process of annotation performed through human work, factors related to the time, cost or heterogeneity of linguistics itself, still prevent the task from being performed optimally. Automation of annotation routine using computational tools could provide a solution [
12]. So, in order to optimize the task and decrease such complexity, researchers from the NLP area uses methods that learn from previous annotated
corpora using machine learning techniques. The learning algorithms have the ability to, after training under the use of previously annotated
corpus, perform the annotation of new text documents. Nonetheless, to make use of automatic annotation techniques via learning algorithms training material is needed and there is a shortage of annotated
corpora for this task [
15]. Another difficulty is to find top-level ontologies capable of specifying appropriate domains to guide the semantic annotation process.
These factors that hinder the development of the semantic annotation serve as motivation for doing research in the area. The main objective of this work is to make use of a machine learning method to perform the semantic annotation based on top-level ontologies of an American English
corpus. Specifically, we have constructed a model capable of classifying the selected top-level ontology types with a satisfactory prediction rate to apply it in the semantic annotation task. This paper is organized as follows. The next section gives an overview of the works that have a relation with this research, presenting the advances and highlighting the points that can be improved.
Section 3 are divided into three subsections. Schema.org ontology subsection presents the process of ontology selection as well as its characterization and definition as the top-level ontology responsible for generating the classification labels. The subsection
Corpus, introduces more details of the
corpus adopted. The CRF approach subsection describes the chosen classification model, and the preprocessing stages, and the classification process. The results achieved in the classification stage, and a discussion about them are presented on the
Section 4, and finally the conclusions are presented at
Section 5.
2. Related Works
Automatic semantic annotation based on top-level ontologies is a recent research area, becoming feasible by novel advances in hardware architectures and, therefore, there are few papers available in the literature for comparison. In this section, we present an overview of related work concerning the semantic annotation of texts, even though some do not specifically address ontological knowledge. The following papers can be divided according to the type of annotation in three different groups: semantic role annotation, named entity recognition and ontology-based annotation. Although the first two groups have different definition and applicability, they were considered because they use similar techniques and share similar challenges. The third group has the same goals of the work presented in this paper.
Semantic role labeling (SRL) is the task of identifying the semantic arguments of a predicate and labeling them with their semantic roles [
26]. You can distinguish annotation based on macro-roles such as
agent and
patient or micro-roles can be adopted such as those defined by the frame semantic theory [
27]. FitzGerald et al. [
26] proposed a method for semantic role annotation in which arguments and semantic roles were jointly embedded in a shared vector space for a given predicate. The embedding was produced by a neural network model. Training their model jointly on both FrameNet and PropBank data, they achieved the best result to date on the FrameNet test set.
Named Entity Recognition (NER) holds a certain relation to ontological annotation since the tags used in NER such as PEOPLE, ORGANIZATION and REGION are a subset of the ontological categories of some top-level ontologies. Hence, NER can be considered a particular case of ontological annotation. The work presented by [
28] describes the use of the Conditional Random Field algorithm for named entity recognition. Their work comes close to ours since, in addition to recognize entities, they add semantic features before performing the annotation. This approach differs from the usual one to NER and brings better results, augmenting the semantic information supplied to the model. The recognition method was performed combining standard training features and semantic information gathered from the Cogito linguistic analysis engine (
http://www.expertsystem.com/). Cogito semantic analysis creates a network associating words which are related to each other via semantic links. The experiments were applied to the CoNLL 2003 NER
corpus (
https://www.clips.uantwerpen.be/conll2003/ner/) that was manually annotated using five categories, PER, LOC, ORG, MISC and O. Throughout the experiment the authors compared results obtained with and without the use of the semantics. They also employed
corpus of different sizes to analyze the performance. According to the authors, the results were considerably better when compared to the usual approach. Without semantics, they obtained an average of 0.8507 for precision, 0.8188 for recall and 0.8336 for F1-measure. Adding the semantic information they obtained an average of 0.8629 for precision, 0.8392 for recall and 0.8505 for F1-measure. Hence, the research showed that by combining semantic information with training features resulted in a positive effect on the outcome of the NER task.
Skeppstedt et al. [
29] proposed the use of machine learning techniques to recognize and annotate disorders, findings, pharmaceuticals and body structures from clinical text. Although the research has a different aspect because it has no ontology background, the authors performed quite a similar work considering the annotation process. The procedure aimed to recognize clinical entities from medical texts written in Swedish. It is common to annotate Health records with those classes in order to assist the patient’s analysis and medical hypothesis construction. The contribution of the proposal is to aid in medical knowledge extraction in a language distinct from English. Because of that, it was done a comparative study to figure out how well clinical entities previously annotated in English are recognized in the Swedish clinical
corpus. The main reason why this research was selected as a related work is because of its automatic annotation approach performed by the same machine learning algorithm. After the
corpus selection, training, and test set distribution, the CRF algorithm was applied to annotate using the four selected categories. The results produced by the algorithm using the best features, settings and its ability to generalize to held-out data was an F-score of 0.81 for Disorder, 0.69 for Finding, 0.88 for Pharmaceutical Drug, 0.85 for Body Structure and 0.78 for the combined category Disorder + Finding.
The work proposed by [
30] focused on ontological annotation. The study came up with a self-adaptive system for automatic ontology-based annotation of unstructured documents in the context of digital libraries. Different from our proposal, this work has an approach of annotating an entire document over an ontological perspective and not the terms, being the use of ontologies what correlate both works. The authors aim to create a system capable to automate the ontological-based annotation process of texts from digital libraries. The work is based on the STOLE [
31], an ontology-based digital library created from documents about the history of public administration in Italy in the 19th and 20th centuries. For annotation purposes, they considered classes from STOLE to perform the experiment, Article, Event, Institution, Legal System, and Person. In order to execute the task, 20 documents manually annotated were selected. A preprocessing phase was applied to the
corpus providing necessary information to build the features, such as sentence boundary, part-of-speech, named entity recognition. The system used its own algorithm capable of annotating automatically from features extracted from the document. The algorithm also has a self-adaptive approach. After all tests, it was noticed that the application is sensitive to the entry order of the documents, producing different results for each entry. The best results achieved by the tests had precision of 0.80, recall of 0.53, F-measure of 0.63. Although the results are considerably low and the system does not use any machine learning approach, the study is important to introduce the ontological-based annotation as a new field of study for both specific domain and general domain.
Another work that also used an ontology to annotate terms related to a domain is described in the article of [
32]. In their work, the authors used a semi-supervised conditional random fields based on ontology for automated information extraction from bridge inspection reports. The research had as its focus the extraction of information about bridges maintenance and deficiencies, naming entities related to the theme. As an object of analysis, eleven bridge inspection reports were used which had sufficient amount of content, considered by the authors, to carry out the research. These reports generated a total of 1866 sentences on different aspects of bridge maintenance, complexity, conditions and age, rendering a
corpus appropriate for the proposed technique. The authors carried out a preprocessing phase, making the documents readable by the application. Subsequently, the feature extraction phase of was accomplished, taking into account aspects related to the part of speech, stem, and semantic characteristics. Finally, came the last step corresponding to the extraction of information through the semi-supervised conditional random fields. In the evaluation phase, they took into account the classification of eleven classes in the set of tests that reached an average precision, recall and, F-1 measure of 94.1%, 87.7%, and 90.7%, respectively. The ontological aspect of the research is related to the definition of the tags in the extraction process, so the ontology assisted in analyzing the context based on a specific domain. Again, the research differs from our work because it is a specific domain of annotation, but it is similar in being based on the use of ontologies to annotate text and for using CRF as a learning model.
4. Results and Discussion
This section presents the results obtained in the annotation phase using the CRF classifier. As mentioned earlier, the focus of the annotation phase was to tag lexemes using features over the nine tag classes. However, the results described here take into account only the eight classes that really assign semantic value to the annotated words. The class O, standing for lexemes classified as OTHER, does not add semantic value to the words, so it is not relevant to the purpose of this research. Another important point to consider is that the corpus used as a training and test set does not have a balanced distribution between classes, which can lead to a biased weight distribution. Finally, we carried out two types of test. The first used a simple CRF configuration and the second used hyper-parameter optimization and cross-validation, both approaches applied to all datasets. Recalling that the corpus was divided into subsets that gradually increased in size to analyze the behavior of the model during the experiments.
The
Table 1 presents the precision, recall, and F1-measure values for the eight annotated classes. The results relate to the execution performed with 100% of the
corpus using the standard CRF model. The training phase comprised the total of 6997 documents, encompassing the various literary models stored in the
corpus. After the training stage, the test set was submitted for evaluation, producing the described results. All classes showed satisfactory results in the annotation process obtaining a score higher than 85% in the F1-measure. The class that presented the best results was ACTION, probably due to the fact that this class comprises nouns that have some distinguished features, such as having the suffix “ing” or being preceded by the verb “to be.” Although the EVENT class presented the lowest number of words for evaluation, it obtained results similar to those of the other classes. The class that presented the lowest scores was ORGANIZATION, with an F1-measure of 0.875, even though it has a relatively high number of occurrences in the
corpus. At the end of the tests, the model reached a general average of 0.940 for precision and 0.929 for recall. These values yielded to an F1-measure of 0.935, an impressive result for the task of automatic semantic annotation.
After the execution, the tool used to run the CRF Model outputs a list containing the most prominent features detected by the learning process. The features were ranked according to their weight relative to the classification of each class. The
Table 2 presents the features that are most relevant to the prediction for each of the top-level classes of Schema.org. The features listed are related to the processing of the whole dataset. The elements showed in the
Table 2 can be interpreted as follows: the “affix” refers to a token suffix or prefix; “postag” refers to the part-of-speech of the token, and the value received matches the nomenclature used in the Penn Treebank Project (
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html); the “−” and “+” signs denote the position related to the current token, with “−” indicating a previous position in the sequence, and the “+” indicating a posterior position in the sequence; the number refers to the number of positions from the current token. There are many other features, but these were selected to depict the behavior of the model during the prediction phase.
Thus, in the sentence “top things to do in Kansas City”, the word “Kansas” is preceded by the preposition “in”, which matches the “−1:word.lower():in” feature, as is followed by the noun “city” with the “+1:word.lower():city” feature. Those two features reinforce the likelihood of the word “Kansas” falling into the category of PLACE.
Table 3 summarizes the hyper-parameters used in the training, with and without hyper parameter optimizer.
The
Figure 2 shows the F1-measure scores for each class and for each dataset size. From the
Figure 2 it is possible to note that the score increases as the size of the dataset gets bigger. All classes have a relatively low F1-measure for datasets with sizes of 1% and 10% of the whole
corpus. However, as the size of the dataset increases, the value of F1-measure increases significantly up to 50% of the whole
corpus. From 50% the F1-measure continues to grow but more slowly. This demonstrates that although the bigger the
corpus the better the results, a data set with 50% of the total
corpus size is enough to evaluate the model.
Due to the restrictions of computational resources to execute the second approach of the CRF model on the whole
corpus, the results presented in
Table 4 are partial. The tests performed in the second approach comprised 50% of the
corpus. As mentioned earlier, the second phase of the model corresponds to the use of hyper-parameter optimization and cross-validation in order to improve the performance of the annotation process. These techniques require additional computational power to perform the training of the model. Therefore, since the
corpus used is large, the available resources were only able to handle half of the set. Regarding the results obtained, it is important to note that the performance of the model using this approach is similar to the results obtained using the whole
corpus, although slightly better. The tests obtained a value of 0.927 for precision, 0.918 for recall and 0.923 for F1-measure using 116,530 tokens tagged over the eight classes. On the average, the results achieved using hyper-parameter optimization and cross-validation were higher, suggesting that it is beneficial the use optimization module of the tool.
5. Conclusions
The meaning of words has many facets and levels of details. One of these facets is the information captured by ontologies which is capable of providing some insight about the nature of the concept denoted by the lexeme. Nonetheless, there are several levels of details, and ontological aspects that can be explored. The top-level ontology categories capture general concepts that attempt to present what exists in the world. In this sense, the use of these categories for annotating texts helps to categorize lexemes broadly. The semantic annotation can be performed either manually or automatically, but doing it manually is expensive as it demands skilled labor and time to perform the annotation. Aiming to mitigate this problem, this research proposed an automatic approach of semantic annotation based on top-level through the use of a supervised machine learning model.
Semantic annotation based on top-level ontologies using the supervised machine learning approach may contribute considerably to the successful execution of the task. For the research execution, it was necessary to define a corpus, being that in this case was selected the OANC corpus. Also, it was necessary to clear some errors in the annotated words, and to format the corpus according to an appropriate pattern. To supply the categories for the annotation process it was selected Schema.org, an ontology directed to organize the most common types of the Web. From Schema.org was chosen its top-level categories, composed of eight types that have become the classification tags used by the machine learning model. This work focused on the use of the CRF model, due to its ability to relate features that occur far-off in a sequence, which is the case in natural language statements. It is a system that performs well, provided proper feature engineering is carried out. After having been trained the system can perform text annotation in real-time. Finally, the classification was performed, and the results for the different versions of the training and test sets were analyzed.
The results obtained were encouraging, although they are difficult to compare with other studies, since there is a lack of related works in the area of semantic annotation based on ontology. In general, the CRF model presented excellent results when annotating the corpus using the eight selected classes, achieving an F1-measure above 85% in each class and an average of 93.5% considering all classes. Comparing these results with the state of the art, which is reported in the section of related works, we can see that the proposal produces equivalent or superior results. That said, the use of CRF has some advantages over other techniques that are currently suggested. Regarding the maximum-entropy Markov model (MEMM), CRF does not suffer from the “label bias problem”, where states with low-entropy transition distributions ignore their observations. However, CRF takes considerably longer to train. Regarding the deep learning technique, CRF has the advantage of not behaving like a black box and still presenting competitive results. Nonetheless, in fact, it presents the drawback of needing a feature engineering phase. Another significant outcome of the research are the results obtained for the classes, PERSON, PLACE and ORGANIZATION, commonly used in classifications of named entities. In these classes we achieved results comparable to the state of the art. Although computational resources prevented the use of the entire database using hyper-parameter optimization and cross-validation, the results were positive enough to justify the approach. To conclude, after analyzing all the results obtained it is possible to conclude that, although it is still a new approach for automatic text annotation text based on top-level ontologies, the results of this research were quite promising suggesting the continuity of the research in this direction.
For future work, we plan to use more powerful computer resources that are able to deal with the whole corpus. Also, we plan to try hyper-parameter optimization techniques and cross-validation in order to improve the results and the processing power of the model. Another approach that can be explored is the use of other machine learning techniques to compare the results obtained. One of these techniques which has become popular is the deep learning model approach. Intending to deepen the process of ontology-based semantic annotation, one proposal to be analyzed is to use lower level categories of the Schema.org in order to assign meaning to the lexemes in a greater level of detail.