1. Introduction
An important part of humanity’s cultural heritage resides in its literature [
1], a rich body of interconnected works revealing the history and workings of human civilization across the eras. Major novelists have produced their works by engaging with the spirit of their time [
2] and capturing the essence of society, human thought and accomplishment.
Cultural Heritage (CH) in its entirety constitutes a “cultural capital” for contemporary societies because it contributes to the constant valorization of cultures and identities. Moreover, it is also an important tool for the transmission of expertise, skills and knowledge across generations and is closely related to the promotion of cultural diversity, creativity and innovation [
3]. For this reason, proper management of the development potential of CH requires a sustainability-oriented approach, i.e., one that ensures both the preservation of the heritage from loss and its connection to the present and the future. Proper management of literary cultural heritage, therefore, requires extensive digitization of collections and procedures that allow for the automatic extraction of semantic information and metadata to ensure the organization of past collections and their linkage with present and future documents.
Until recently, engaging with a large body of literature and discovering insights and links between storytellers and cultures was a painstaking process which relied mainly on close reading [
4]. Nowadays, however, the large-scale digitization of texts as well as developments in Artificial Intelligence (AI) and Natural Language Processing (NLP) are making it possible to explore the richness of our written heritage with methods that were not possible before at an unprecedented scale, while facilitating the management and preservation of texts [
5].
One of the opportunities afforded by digitization is relation extraction (RE): the automatic discovery of relations between entities in a document. This task plays a central role in NLP because relations can be used to populate knowledge bases (KB), to index corpora in search engines, to answer questions related to the text, to assist in the comparative analysis of texts and to understand/analyze the narration of a story. In this paper, we present a novel deep-learning model for RE that enables applications in all of the above domains by automatically identifying relations between entities in 19th century Greek literary texts. Although there are several RE approaches in the literature, the particular texts we are interested in (fiction), the language and the specific period all present significant challenges. We will have more to say about these shortly.
Most RE methods follow a supervised approach; thus, the required large amount of labeled training data constitutes perhaps the greatest barrier for real-world applications. In order to overcome this challenge, RE research has adopted distantly supervised approaches that are based upon automatically constructed datasets. Towards that end, Reference [
6] proposed to use distant supervision (DS) from a KB, assuming that if two entities in a KB exhibit a relation, then all sentences mentioning those entities express that relation. This assumption inevitably results in false positives and to remotely generated records containing incorrect labels. In order to mitigate the problem of
wrong labeling, Reference [
7] relaxed the assumption so that it does not apply to all instances and, together with [
8,
9], proposed multi-instance learning. In that setting, classification shifts from instance-level to bag-level, with current state-of-the-art RE methods focusing on reducing the effect of noisy instances.
At the same time, extracting relations from literary texts has been undertaken only in the broader context of people in dialogue [
10,
11,
12,
13], people in the same place [
14] and event extraction [
14,
15] and not, thus far, in the context of predefined relations among named entities other than person and place. We also emphasize the fact that state-of-the-art RE approaches are evaluated mostly on news corpora. The reason is that literary texts put emphasis on the narrative craft and exhibit characteristics that go beyond journalistic, academic, technical or more structured forms of literature. Moreover, literary texts are characterized by creative writing peculiarities that can vary significantly from author to author and time to time. Moreover, as most works of literature have been digitized through OCR systems, the digitized versions can also suffer from character or word misspellings. All these make it extremely challenging to discover entity relations in literary texts.
In order to address these challenges, we propose REDSandT_Lit (Relation Extraction with Distant Supervision and Transformers for Literature), a novel distantly supervised transformer-based RE model that can efficiently identify six distinct relationships from Greek literary texts of the 19th century, the period that “contains” the largest part of digitized Modern Greek literature. Since no related dataset exists, we undertook the construction of a new dataset including 3649 samples annotated through distant supervision with seven semantic relationships, including ’NoRel’ for instances with non-labelled relation. Our dataset is in the
Katharevousa variant of Greek, an older, more formal and more complex form of the Modern Greek language in which a great part of Modern Greek literature is written in. In order to capture the semantic and syntactic characteristics of the language, we exploited the state-of-the-art transformer-based Language Model (LM) for Modern Greek (GREEK-BERT [
16]), which we fine-tuned on our specific task and language. In order to handle the problem of noisy instances as well as the long sentences which are typical in literary writing, we guided REDSandT_Lit to focus solely on a compressed form of the sentence that includes only the surrounding text of the entity pair together with their entity types. Finally, our model encodes sentences by concatenating the entity-pair type embeddings, with relation extraction to occur at bag-level as a weighted sum over the bag’s sentences predictions. Regarding the selected transformer-based model, the reasons for choosing BERT [
17] are twofold: (i) BERT is the only transformer-based model pre-trained in Modern Greek corpora [
16], and (ii) BERT considers bidirectionality while training with [
18], showing BERT to capture a wider set of relations compared to GPT [
19] under a DS setting. Extensive experimentation and comparison of our model to several existing models for RE reveals REDSandT_Lit’s superiority. Our model captures with great precision (75–100% P) all relations, including the infrequent ones that other models failed to capture. Moreover, we will observe that fine-tuning a transformer-based model under a DS setting and incorporating entity-type side information highly boosts RE performance, especially for the relations in the long-tail of the distribution. Finally, REDSandT_Lit manages to find additional relations that were missed during annotation.
Our proposed model is the first to extract semantic relationships from 19th century Greek literary texts, and the first, to our knowledge, to extract relationships between entities other than person and place; thus, we provide a broader and more diverse set of semantic information on literary texts. More precisely, we expand the boundaries of current research from narration understanding to extended metadata extraction. Even though online repositories provide several metadata that accompany digitized books to facilitate search and indexing, digitized literary texts contain rich semantic and cultural information that often goes unused. The six relationships identified by our model can further boost the books’ metadata, preserve more information and facilitate search and comparisons. Moreover, having access to a broader set of relations can boost downstream tasks, such as recommending similar books based on hidden relations. Finally, distant reading [
4] goes one step further with readers and storytellers in terms of understanding the story set more quickly and easily.
The remainder of this paper is organized as follows:
Section 2 contains a brief literature review,
Section 3 discusses our dataset and proposed methodology.
Section 4 and
Section 5 contain our results and discussion, respectively.
3. Materials and Methods
As discussed in the Introduction, extracting cultural information from literary texts demands either a plethora of annotations or robust augmentation techniques that can capture a representative sample of annotations and boost machine learning techniques. Meanwhile, automatically augmented datasets are always accompanied by noise, while creative writing’s characteristics set an extra challenge.
In this section, we present a new dataset for Greek literary fiction from the 19th century. The dataset was created by aligning entity pair-relation triplets to a representative sample of Greek 19th century books. Even though we efficiently manage to augment the training samples, these inevitably suffer from noise and include imbalanced labels. Moreover, the special nature of the 19th century Greek language sets an extra challenge.
We present our model as follows: a distantly supervised transformer-based RE method based on [
18] that has proven to efficiently suppress noise from DS using multi-instance learning and exploiting a pre-trained transformer-based LM. Our model proposes a simpler configuration for representing the embedding of the final sentence, which manages to capture a larger number of relations by using information about the entity types and the Greek BERT’s [
16] pre-trained model.
3.1. Benchmark Dataset
Preserving semantic information from cultural artifacts requires either extensive annotation that is rarely available or automatically augmented datasets to sufficiently capture context. In the case of literary texts, no dataset exists to train our models. Taking into account that the greatest part of digitized Modern Greek literature refers to the 19th century, we construct our dataset by aligning relation-triples from [
41] to twenty-six (26) literary Greek books of the 19th century (see
Table A1). Namely, we use the provided relation triplets (i.e., head-tail-relationship triplets) as an external knowledge base (KB) to automatically extract sentences that include the entity pairs, assuming that these sentences also express the same relationship (distant supervision).
The dataset’s six specific relations and their statistics can be found in
Table 1. Train, validation and test datasets follow a 80%-10%-10% split. We assume that a relationship can occur within a period of three consequent sentences and only between two named entities. Sentences that include at least two named entities of different types but do not constitute a valid entity pair are annotated with a “NoRel” relation. These can either reflect sentences with no actual underlying relation or sentences for which the annotation is missed. The dataset also includes the named entity types of the sentence’s entity pair. The following five entity types are utilized: person (PER), place (GPE), organization (ORG), date (DATE) and book title (TITLE). We made this dataset publicly available (Data available at:
https://github.com/intelligence-csd-auth-gr/extracting-semantic-relationships-from-greek-literary-texts (accessed on 3 August 2021)) to encourage further research on 19th century Greek literary fiction.
The challenges of this dataset are threefold. At first, similar to all datasets created via distant supervision, ours also suffers from noisy labels (false positives) and is imbalanced, including relations with a varying number of samples. Secondly, the dataset includes misspellings stemming from the books’ digitization through OCR systems. Lastly, the documents use a conservative form of the modern Greek language, katharevousa, which was used between the late 18th century and 1976. Katharevousa, which covers a significant part of modern Greek literature, is more complex than modern Greek, including additional cases, compound words and other grammatical features that set an extra challenge for the algorithm.
3.2. The Proposed Model Architecture
In this section, we present our approach towards extracting semantic relationships from literary texts. We highlight that the specific challenges that we have to address are as follows: DS noise, imbalanced relations, character misspellings due to OCR, Katharevousa form of Greek language and creative writing peculiarities. Inspired by [
18,
19] who showed that DS and pre-trained models can suppress noise and capture a wider set of relations, we propose an approach that efficiently handles the aforementioned challenges by using multi-instance learning, exploiting a pre-trained transformer-based language model and incorporating entity type side-information.
In particular, given a bag of sentences
that concern a specific entity pair, our model generates a probability distribution on the set of possible relations. The model utilizes the GREEK-BERT pre-trained LM [
16] to capture the semantic and syntactic features of sentences by transferring pre-trained common-sense knowledge. In order to capture the specific patterns of our corpus, we fine-tuned the model using multi-instance learning; namely, we trained our model to extract the entity pairs’ underlying relation given their associated sentences.
During fine-tuning, we employ a structured, RE-specific input representation to minimize architectural changes to the model [
42]. Each sentence is transformed to a structured format, including a compressed form of the sentence along with the entity pair and their entity types. We transform the input into a sub-word level distributed representation using byte-pair encoding (BPE) and positional embeddings from GREEK-BERT fine-tuned on our corpus. Lastly, we concatenate the head and tail entities’ types embeddings, as shaped from BERT’s last layer, to form the final sentence representation that we used to classify the bag’s relation.
The proposed model can be summarized in three components: the sentence encoder, the bag encoder and model training. Components are described in the following sections with the overall architecture shown in
Figure 1 and
Figure 2.
3.2.1. Sentence Encoder
Our model encodes sentences into a distributed representation by concatenating the head (
h) and tail (
t) entity type embeddings. The overall sentence encoding is depicted in
Figure 1, while the following sections examine in brief the parts of the sentence encoder in a bottom-up manner.
In order to capture the relation hidden between an entity pair and its surrounding context, RE requires structured input. To this end, we encoded sentences as a sequence of tokens. At the very bottom of
Figure 1 is this representation, which starts with the head entity type and token(s) followed by the delimiter (H- SEP), continues with the tail entity type and token(s) followed by the delimiter [T- SEP] and ends with the token sequence of a compressed form of the sentence. The whole input starts and ends with the special delimiters [CLS] and [SEP], respectively, which are typically used in transformer models. In BERT, for example, [CLS] acts as a pooling token representing the whole sequence for downstream tasks, such as RE. We do not follow that convention. Furthermore, tokens refer to the sub-word tokens of each word, where each word is also lower-cased and normalized in terms of accents and other diacritics; for example the word “Aρσάκειο” (Arsakeio) is split into the “αρ” (“ar”), “##σα” (“##sa”) and “##κειο” (“##keio”) sub-word tokens.
Input Representation
As discussed in
Section 3.1, samples including a relation can include up to three sentences; thus, samples generally referenced as sentences within the document can entail information which is not directly related to the underlying relation. Moreover, creative writing’s focus on narration results in long secondary sentences that further disrupt the content linking the two entities. In order to focus on the important to the relation tokens, we adopt two distinct compression techniques, namely the following:
trim_text_1: Given a sentence, it preserves the text starting from the three preceding words of the head entity to the three following words of the tail entity;
trim_text_2: Given a sentence, it preserves only the surrounding text of the head and tail entities, with surrounding text referring to the three preceding and following words of each entity.
Our selection is based on the fact that context closer to the entities holds the most important relational information. We experimented with two compressed versions of the text, one that keeps all text between the two entities (
trim_text_1) and one that keeps only the very close context (
trim_text_2) assuming that the in-between text, if long enough, typically constitutes a secondary sentence, irrelevant to the underlying relation. Our assumption is reassured in our experiments (see
Section 4 and
Section 5).
After suppressing the sentences to a more compact form, we also incorporate the head and tail entities text and types in the beginning of the structured input to bias LM focusing on the important for the entity pair features. Extensive experimentation reveals that the extracted entity type embeddings hold the most significance information for extracting the underlying relation within two entities. Entity types are considered known and are also provided in the dataset.
Input Embeddings
Input embeddings to GREEK-BERT are presented as
in
Figure 1. Each token’s embedding results from summing the positional and byte pair embeddings for each token in the structured input.
Position embedding is an essential part of BERT’s attention mechanism, while byte-pair embedding is an efficient method for encoding sub-words to account for vocabulary variability and possible new words in inference.
To make use of sub-word information, the input is tokenized using byte-pair encoding (BPE). We use the tokenizer of the pre-trained model (35,000 BPEs) to which we added seven task-specific tokens (e.g., [H-SEP], [T-SEP] and five entity type tokens). We forced the model not to decompose the added tokens into sub-words because of their special meaning in the input representation.
Sentence Representation
Input sequence is transformed into feature vectors () using GREEK-BERT’s pre-trained language model fine-tuned in our task. Each sub-word token feature vector () is the result of BERT’s attention mechanism over all tokens. Intuitively, we do understand that feature vectors of specific tokens are more informative and contribute more in identifying the underlying relationship.
To the extent that each relation constrains the type of the entities involved and vice versa [
30,
43], we represent each sentence by concatenating the head and tail entities’ type embeddings:
where
.
While it is typical to encode sentences using the vector of the [CLS] token in
[
11], our experiments show that representing a sentence as a function of the examining entity pair types reduces noise, improves precision and helps in capturing the infrequent relations.
Several other representation techniques were tested; i.e., we tested the method of also concatenating the [CLS] vector to embed the overall sentence’s information and also using the sentence representation from [
18], including relation embeddings and further attention mechanisms, with the presented method to outperform. Our intuition is that the LM was not able to efficiently capture patterns in Katharevousa since manual observation revealed most words to have split in many sub-words. This occurs because Katharevousa differs to Modern Greek, while some words/characters were also misspelled in the OCR process.
3.3. Bag Encoder
Bag encoding, i.e., aggregation of sentence representations in a bag, comes to reduce noise generated by the erroneously annotated relations accompanying DS.
Assuming that not all sentences equally contribute to bag’s representation, we use selective attention [
24] to highlight the sentences that better express the underlying relation.
As observed in the above equation, selective attention represents each bag as a weighted sum over its individual sentences. Attention
is calculated by comparing each sentence representation against a learned representation r:
At last, the bag representation
B is fed to a softmax classifier in order to obtain the probability distribution over the relations:
where
is the relation weight matrix, and
is the bias vector.
3.4. Training
Our model utilizes a transformer model, precisely GREEK-BERT, which fine-tunes on our specific setup to capture the semantic features of relational sentences. Below, we briefly present the overall process.
Pre-training
For our experiments, we use the pre-trained
bert-base-greek-uncased-v1 language model [
16], which consists of 12 layers, 12 attention heads and 110M parameters where each layer is a bidirectional Transformer encoder [
31]. The model is trained on uncased Modern Greek texts of Wikipedia, European Parliament Proceedings Parallel Corpus (Europarl) and OSCAR (clean part of Common Crawl) with a total of 3.04B tokens. GREEK-BERT is pre-trained using two unsupervised tasks, masked LM and next sentence prediction, with masked LM being its core novelty as it allows the previously impossible bidirectional training.
Fine-tuning
We initialize our model’ s weights with the pre-trained GREEK-BERT model and fine-tune only the last four layers under the multi-instance learning setting presented in
Figure 2, using the specific input shown in
Figure 1. After experimentation, only the last four layers are fine-tuned.
During fine-tuning, we optimize the following objective:
where for all entity pair bags
in the dataset, we want to maximize the probability of correctly predicting the bag’s relation (
) given its sentences’ representation and parameters (
θ).
3.5. Experimental Setup
3.5.1. Hyper-Parameter Settings
In our experiments we utilize
bert-base-greek-uncased-v1 model with hidden layer dimension
, while we fine-tune the model with
max_seq_length . We use the Adam optimization scheme [
44] with
and a cosine learning rate decay schedule with warm-up over 0.1% of training updates. We also minimize loss using the cross entropy criterion.
Regarding dataset-specific REDSandT_Lit model’s hyper-parameters, we automatically tune them on the validation set based on F1- score.
Table 2 shows the applied search space and selected values for the dataset-specific hyper-parameters.
Experiments are conducted in Python 3.6, on a PC with 32.00 GB RAM, Intel i7-7800X CPU@ 3.5 GHz and NVIDIA’s GeForce GTX 1080 with 8 GB. Fine-tuning takes about 5 min for the three epochs. The implementation of our method is based on the following code:
https://github.com/DespinaChristou/REDSandT (accessed on 18 May 2021).
3.5.2. Baseline Models
In order to show the proposed method’s effectiveness, we compare against three strong baselines in our dataset. More precisely, we compare REDSandT_Lit to the standard feature-based [
45] and NN-based [
46] approaches used in the literature while also comparing to the Greek version of BERT [
16]. All models were tested on both sentence compression formats presented in
Section 3.2.1 and are indicated with respective (1, 2) superscripts. For the Bi-LSTM approach we also experimented with both full-word and BPE tokenization indicated with (★) and (
) superscripts, respectively.
Feature-based Methods
: A Support Vector Machine classifier. Sentences are encoded using the first-presented compression format.
: A Support Vector Machine classifier. Sentences are encoded using the second-presented compression format.
NN-based Methods
: A Bidirectional Recurrent Neural Network (RNN) classifier. Sentences are encoded using the first-presented compression format, while full-word tokenization is used.
: A Bidirectional RNN classifier. Sentences are encoded using the first-presented compression format, while BPE tokenization is used.
: A Bidirectional RNN classifier. Sentences are encoded using the second-presented compression format, while full-word tokenization is used.
: A Bidirectional RNN classifier. Sentences are encoded using the second-presented compression format, while BPE tokenization is used.
Transformer-based Methods
GREEK-BERT: BERT (bert-base-uncased) fine-tuned on modern Greek corpora. We fine-tune this to our specific dataset and task.
: The default REDSandT approach for distantly supervised RE. We use GREEK-BERT as base, and we fine-tune the model on our corpus and specific task. Sentences are encoded using the second-presented compression format.
: The proposed variant of REDSandT fine-tuned on our corpora and specific task. Sentences are encoded using the first-presented compression format.
: The proposed variant of REDSandT fine-tuned on our corpora and specific task. Sentences are encoded using the second-presented compression format.
3.5.3. Evaluation Criteria
In order to evaluate our model against baselines, we report accuracy macro-P, R, F and weighted-P, R, F for all models. For a more in-depth analysis of models’ performance in each relation, we report Precision, Recall and F1-score metrics for all models and relations. Moreover, we conduct Friedman’s statistical significance test to compare all presented models on our dataset, following [
47,
48].
6. Conclusions and Future Work
We proposed a novel distantly supervised transformer-based relation extraction model, REDSandT_Lit, that can automate metadata extraction from literary texts, thus helping sustaining important cultural insights that otherwise could be lost in unindexed raw texts. Precisely, our model efficiently captures semantic relationships from Greek literary texts of the 19th century. We constructed the first dataset for this language and period, including 3649 samples annotated through distant supervision with six semantic relationships. The dataset is in the Katharevousa variant of Greek, in which a great part of Modern Greek literature is written. In order to capture the semantic and syntactic characteristics of the language, we exploited GREEK-BERT, a pre-trained language model on modern Greek, which we fine-tuned on our specific task and language. To handle the problem of noisy instances, as well as the long sentences that are typical in literary writing, we guided REDSandT_Lit to focus solely on a compressed form of the sentence and the entity types of the entity pair. Extensive experiments and comparisons with existing models on our dataset revealed that REDSandT_Lit has superior performance, manages to capture infrequent relations and can correct mislabelled sentences.
Extensions of this work could focus on augmenting our dataset to facilitate direct BERT pre-training on the Katharevousa form of the Greek language. Even though we achieve high accuracy with pre-trained models in Modern Greek and finetuned on the Katharevousa variant, this inconsistency suggests that augmenting the studied data and providing a model specific to these data can further improve results. Moreover, we would like to further investigate the effect of additional side-information such as POS info and entities description, while also an end-to-end model that is not based on pre-recognized entities and extracts both entities and relations in one pass. At last, although there is extensive research on ancient Greek philosophy, literature and culture, as well as research in modern Greek Natural Language Processing (NLP) tools, the very important (from a cultural, literary and linguistic point of view) Katharevousa form of the Greek language has not been studied in terms of automatic NLP tools. Thus, creating automated tools specific to this form is a step towards revealing important cultural insights for the early years of the modern Greek state.