Next Article in Journal
Harnessing the Distributed Computing Paradigm for Laser-Induced Breakdown Spectroscopy
Next Article in Special Issue
Explainable Pre-Trained Language Models for Sentiment Analysis in Low-Resourced Languages
Previous Article in Journal
Artificial Intelligence Techniques for Sustainable Reconfigurable Manufacturing Systems: An AI-Powered Decision-Making Application Using Large Language Models
Previous Article in Special Issue
International Classification of Diseases Prediction from MIMIIC-III Clinical Text Using Pre-Trained ClinicalBERT and NLP Deep Learning Models Achieving State of the Art
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

IndoGovBERT: A Domain-Specific Language Model for Processing Indonesian Government SDG Documents

1
Graduate School of Information Science and Engineering, Ritsumeikan University, Ibaraki 5678570, Osaka, Japan
2
Ministry of National Development Planning/BAPPENAS, Jakarta 10310, Indonesia
3
College of Information Science and Engineering, Ritsumeikan University, Ibaraki 5678570, Osaka, Japan
4
Center for Democracy Studies Aarau (ZDA), University of Zurich, 8006 Zurich, Switzerland
*
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2024, 8(11), 153; https://doi.org/10.3390/bdcc8110153
Submission received: 3 October 2024 / Revised: 29 October 2024 / Accepted: 7 November 2024 / Published: 9 November 2024
(This article belongs to the Special Issue Artificial Intelligence and Natural Language Processing)

Abstract

:
Achieving the Sustainable Development Goals (SDGs) requires collaboration among various stakeholders, particularly governments and non-state actors (NSAs). This collaboration results in but is also based on a continually growing volume of documents that needs to be analyzed and processed in a systematic way by government officials. Artificial Intelligence and Natural Language Processing (NLP) could, thus, offer valuable support for progressing towards SDG targets, including automating the government budget tagging and classifying NSA requests and initiatives, as well as helping uncover the possibilities for matching these two categories of activities. Many non-English speaking countries, including Indonesia, however, face limited NLP resources, such as, for instance, domain-specific pre-trained language models (PTLMs). This circumstance makes it difficult to automate document processing and improve the efficacy of SDG-related government efforts. The presented study introduces IndoGovBERT, a Bidirectional Encoder Representations from Transformers (BERT)-based PTLM built with domain-specific corpora, leveraging the Indonesian government’s public and internal documents. The model is intended to automate various laborious tasks of SDG document processing by the Indonesian government. Different approaches to PTLM development known from the literature are examined in the context of typical government settings. The most effective, in terms of the resultant model performance, but also most efficient, in terms of the computational resources required, methodology is determined and deployed for the development of the IndoGovBERT model. The developed model is then scrutinized in several text classification and similarity assessment experiments, where it is compared with four Indonesian general-purpose language models, a non-transformer approach of the Multilabel Topic Model (MLTM), as well as with a Multilingual BERT model. Results obtained in all experiments highlight the superior capability of the IndoGovBERT model for Indonesian government SDG document processing. The latter suggests that the proposed PTLM development methodology could be adopted to build high-performance specialized PTLMs for governments around the globe which face SDG document processing and other NLP challenges similar to the ones dealt with in the presented study.

1. Introduction

Many countries still face challenges in achieving the Sustainable Development Goals (SDGs), a set of global objectives to be met by 2030 which were formulated by the United Nations in 2015 with the overall goal of promoting global sustainability [1]. These challenges include the impact of climate change [2] and the effects of unforeseen and unprecedented events, such as the COVID-19 pandemic [3]. To achieve all 17 SDGs and their 169 targets, governments and various non-state actors (NSAs), such as civil society organizations, religious groups, private companies, etc., must coordinate their efforts and promptly react to social, economic, and environmental changes. Governments’ spending can be adjusted to finance vital areas for impactful SDG results [4], while NSA SDG initiatives, such as, for instance, Corporate Social Responsibility [5], can foster positive change and promote community engagement [6]. Collaboration between all parties by aligning their activities can, furthermore, help solve complex problems more effectively than through any individual effort [7]. Despite arrangements made by many governments to define each goal within its budget and create opportunities for NSAs to voluntarily list their SDG-related activities in the SDG action plans, effective coordination of the government spending is often hampered by the large amount of manual document processing. A known approach to overcoming this obstruction is the broad introduction of digital governance [8], which leverages technologies such as Natural Language Processing (NLP). The latter technologies largely rely on artificial intelligence with its vast array of machine learning methods and models [9].
The presented research thus aims to develop a machine learning approach to facilitate the identification of sustainability goals in government and NSA documents in Indonesia and to help discover possible means to achieve these goals. The study introduces a set of corpora, a language model development methodology, as well as software tools and methods that would not only allow the Indonesian government and NSAs to expedite document processing and enhance their decision-making processes but also to promote collaboration among all actors involved in the pursuit of the SDG agenda. While this study specifically focuses on Indonesian government activities, its results could be adopted and adapted to help other governments in SDG document processing as well.
The Indonesian language, “Bahasa Indonesia”, is the country’s only official language that symbolizes the national unity and has over 270 million native speakers [10]. At the same time, it is recognized as one of the so-called “low-resource languages” in the NLP landscape [11,12,13]. A few attempts have been made to address this issue, including the development of pre-trained language models (e.g., IndoBERT [11,12]), the introduction of annotated datasets (e.g., NusaCrowd [14]), and the creation of benchmark tasks (e.g., IndoLEM [11], IndoNLU [12], and IndoNLG [15]). Most of the work on pre-trained models was, however, directed to the development of general-purpose, generic, or “universal” models with little attention given to domain-specific language models. To the authors’ best knowledge, there has been built only one Indonesian domain-specific language model, called IndoBERTweet [13].
Pre-trained language models (PTLMs) specialized for a particular domain are known to outperform general-purpose language models when dealing with domain-specific downstream tasks [16,17]. Such models are developed through either further pre-training of generic models or training language models from scratch, using a large amount of domain-specific data. Among the available domain-specific models that have been developed for the English language, there are notable examples used in the medical domain (BioBERT [17], BlueBERT [18], PathologyBERT [19]), biodiversity domain (BiodivBERT [20]), science domain (SciBERT [16]), legal domain (Legal-BERT [21]), financial domain (FinBERT [22]), agricultural domain (AgriBERT [23]), conflict domain (ConfliBERT [24]), and architecture, engineering, and construction domain (ARCBERT [25]). For the governmental domain, there exist few non-English PTLMs, including ones for Chinese (GovAlbert [26]) and Swedish (KB-BERT [27]). There have been no domain-specific models proposed for solving Indonesian government tasks, however.
In this paper, the authors seek to bridge the research gap on the absence of PTLMs built for solving Indonesian government domain-specific tasks. Since Bidirectional Encoder Representations from Transformers (BERT) models have proven to be highly effective in addressing various NLP challenges [28] and are known to be superior to other machine learning approaches (e.g., see [29,30,31,32]), a BERT-based architecture was chosen to develop the PTLMs in the presented research. This research, therefore, puts forward a BERT-based domain-specific language model pre-trained with governmental texts to effectively process SDG-related documents in the course of Indonesian government decision-making. An effective methodology for the development of high-performance PTLMs for solving downstream NLP tasks is also proposed. All resources necessary to implement the methodology (source code, text corpora, etc.) have been made publicly available at https://github.com/Just108/IndoGovBERT (accessed on 30 September 2024).
The original contributions of the presented study are, therefore, as follows:
  • A methodology for the development of high-performance PTLMs for solving document-processing downstream tasks in the governmental context.
  • An open-data corpus of Indonesian government documents for NLP research.
  • A domain-specific pre-trained Indonesian language model, IndoGovBERT, for SDG document processing by the Indonesian government.
The developed IndoGovBERT model was thoroughly examined through experiments and found to be efficient and effective in real-world scenarios of multi-label classification of government budget items and NSA activities. The model was also deployed to facilitate the matchmaking of SDG-related activities by the government and NSAs to uncover possible collaboration opportunities for the parties. In all experiments conducted, the proposed model outperformed other models which could be used in the government settings and which, for that reason, were tested in this study.
The rest of the paper is organized as follows. Section 2 surveys the related literature. Section 3 provides an overview of the approach and resources used. Section 4 then details the model development methodology and introduces the IndoGovBERT model. Section 5 presents application examples of the developed model in SDG document processing contexts. Finally, Section 6 provides an overall discussion and concludes the study.

2. Related Work

This section provides an overview of recent studies at the intersection of SDG problem-solving and NLP, domain-specific PTLMs, and corpora used to develop PTLMs.

2.1. NLP and SDGs

Classification can be seen as one of the fundamental, if not the most important, tasks of NLP [33]. It typically involves the categorization of textual data by predefined labels, either single or multiple. Numerous machine learning efforts have been directed to classifying textual documents by sustainability goals. These efforts, however, varied considerably in the classification approaches chosen.
Nugroho et al. [34] used a Naïve Bayes model as a binary classifier to categorize Indonesian online newspaper articles, depending on whether they are related to SDGs or not. Angin et al. [35] proposed an approach with multiple binary classifiers, each focusing on a specific SDG. The authors deployed the RoBERTa model (also, see [36]) to distribute company sustainability reports into the SDG categories. Guariso et al. [37] discussed SDG classification as a multi-class single-label problem. Their research focused on assigning SDG labels to government budget programs in three countries—Mexico, Colombia, and Uruguay. The authors utilized several classifiers, including Naïve Bayes, Support Vector Machine, and Random Forest, and compared their performances for solving the task.
When dealing with SDG activity descriptions, one typically has to perform multi-label multi-class document classification as each document would present more than one activity aligning with SDGs, while the same activity would pursue more than one goal. Few attempts have been made, however, to address SDG classification as a multi-class multi-label problem. Morales-Hernandez et al. [38] categorized peer-reviewed journals by SDG goals, taking the latter perspective. The authors experimented with several classifiers and concluded that the Support Vector Machine is the best performer. Matsui et al. [39] employed a Japanese BERT model for classifying Japanese official documents into SDG categories. Using the model, the authors also computed cosine similarity to facilitate the matchmaking process between documents describing SDG challenges faced by Japanese municipalities and those with potential solutions to the challenges from the private sector.
It is understood that, while the previous research has mainly focused on the downstream tasks of classification and text similarity assessment of SDG documents, the upstream task of developing a domain-specific pre-trained language model has received little attention. The application of such a domain-specific model could help improve the performance and context awareness of various NLP tools for government applications [26,27].

2.2. Domain-Specific PTLMs

PTLMs, such as the BERT [28] and RoBERTa [36] models, underwent pre-training with extensive collections of unlabeled textual data. Afterwards, they could be fine-tuned or adapted using labeled data specific to the target assignment, enabling the models to perform effectively on various downstream NLP tasks. Pre-training a language model is generally a computationally costly process and, therefore, it is usually undertaken by organizations with abundant computational resources. In contrast, fine-tuning to downstream tasks is computationally far less intensive and has demonstrated its efficacy by delivering state-of-the-art results across a wide range of NLP applications [40].
Achieving effective fine-tuning for solving downstream tasks relies on the availability of both adequate training data and relevant pre-trained models. As many contemporary language models have been trained with generic document corpora, they often happen to be far from acceptable when used in specialized domains, owing to differences in language usage and vocabulary [41]. The development of domain-specific PTLMs is, therefore, an important research problem.
In a real-world application scenario, once-developed PTLMs would, from time to time, have to deal with the emergence of new data very different from what the models were originally trained on. An outdated or out-of-domain PTLM could have a detrimental effect on the problem-solving performance [42]. This circumstance highlights the importance of continually updating PTLMs to adapt to the changing nature of data in order to maintain their problem-solving capabilities.
When developing or updating a domain-specific PTLM, there are two major approaches (see Figure 1): (i) training a new language model from scratch with in-domain data or (ii) further pre-training an existing (e.g., general-purpose) language model using in-domain data (domain-adaptive pre-training, DAPT). Gururangan et al. [41] proposed an additional option, where PTLMs are further pre-trained with unlabeled data specific to the target task (task-adaptive pre-training, TAPT) rather than with far more general domain-wide corpora. The authors, however, noted that TAPT can have negative effects when the model is intended for several different downstream tasks. On the other hand, Abnar et al. [43] argued that solely focusing on the performance of one task should be avoided, as the resultant model would then be too specialized. The authors advocated design choices that would enhance model capabilities across a diverse range of downstream tasks.
In terms of model performance, domain-specific PTLMs have been shown to be superior when compared to generic models (e.g., see [16]), but this was not always the case. For instance, Zhu et al. [44] found that further pre-training on task-oriented dialogues does not necessarily lead to improved performance for the corresponding downstream tasks. The authors observed that although further pre-training can be advantageous in low-resource settings, its benefits diminish as the training data size grows. Arslan et al. [45] examined various PTLMs for multi-class classification in the financial domain. The authors discovered that although the FinBERT model [22] was pre-trained with financial documents, it could only reach the same level of performance as the generic models tested, even when the vocabulary was adjusted. One possible explanation here would be that there can still be significant differences between documents used for pre-training and those relevant to the task at hand. The authors also suggested that merely adjusting the vocabulary may not be sufficient, and training completely from scratch would be a better option. On the other hand, while training a language model from scratch with domain-specific corpora could naturally yield better-tailored word embeddings and improve the model performance, Boukkouri et al. [46] argued that further pre-training generic BERT models with domain-specific corpora would result in similar performance. Considering the resource costs associated with each method, the authors concluded that retraining generic models with a specialized corpus or domain adaptation is a preferable approach.
Acknowledging the different findings related to the PTLM development, all approaches discussed above will be re-examined in this study with a focus on the Indonesian government domain. As emphasized by Naveed et al. [47], the quality of data used for pre-training plays a critical role in the model development process. In the next subsection, various corpora utilized by domain-specific PTLMs will, therefore, be surveyed.

2.3. Corpora Used in Domain-Specific PTLMs

In addition to the computational resource challenges experienced by PTLM developers [36], the incorporation of domain-specific vocabulary presents another significant problem [48]. PTLMs for a specialized context are typically built with the available domain corpora (see Table 1). For instance, domain-specific models, such as AgriBERT, BioBERT, BioDivBERT, and SciBERT, were all trained on academic publications in the corresponding branches of science. On the other hand, models such as ConfliBERT, FinBERT, LegalBERT, PathologyBERT, and ARCBERT were developed, using more general documents, such as business reports, news articles, regulatory documents, encyclopedia texts, and Wikipedia pages.
The Indonesian government, just as many other modern-era civic institutions, continually generates a large number of documents. Some of these documents are released to the public through the official channels, including government websites. Other documents are given restricted access and are only meant for entitled individuals or groups. Public documents usually comprise regulatory texts and civic action programs and reports, whereas restricted documents record government internal interactions and confidential communications with stakeholders. For the development of a domain-specific PTLM intended to facilitate solving the government’s NLP tasks, both types of documents appear important.

3. Methodology and Resources

3.1. Approach Overview

To fulfill the study objectives formulated in Section 1, the following procedure is undertaken (also, see Figure 2).

3.1.1. Domain Corpora Development

In this initial phase, domain-specific corpora tailored to the Indonesian government context are constructed to serve as the foundation for pre-training language models. The development of the domain-specific corpora follows a two-step approach: document collection and preprocessing. Section 4.1 details the corpora development process.

3.1.2. PTLM Development

The development of an Indonesian Government BERT-based PTLM (referred to in this paper as IndoGovBERT) involves leveraging the government’s domain corpora to capture specific linguistic patterns and nuances relevant to the domain. Several versions of the IndoGovBERT model are developed, fine-tuned, and compared. This is followed by the selection of the best-performing model for further experiments. Section 4.2 explicates the PTLM development process.

3.1.3. PTLM Application on Downstream Tasks

The selected best-performing IndoGovBERT model is used to solve various SDG-related downstream NLP tasks. The model deployment aims at contributing to the decision-making processes within governmental contexts, such as multi-class multi-label document classification and document matchmaking. Section 5 presents results obtained in these experiments.

3.2. Baseline Models Used

In the past few years, several Indonesian monolingual BERT-based PTLMs have been developed (see Table 2). Koto et al. introduced the IndoBERT model trained on a dataset sourced from Wikipedia, news articles, and the Indonesian Web corpus [11]. Wilie et al. released a pre-trained model with the same name, IndoBERT, which used unlabeled text data from the Indo4B project [12]. There are also two publicly available pre-trained language models stored on Hugging Face (http://huggingface.co/models accessed on 30 September 2024): the IndoBERT model by Lintang that was trained on the OSCAR corpus [49] and the Indonesian BERT base model by Wirawan that was trained on Wikipedia data [50]. All these pre-trained models have previously been utilized for solving various NLP tasks by the Indonesian research community (e.g., see [51]).
As indicated in Table 2, the four general-purpose language models all have approximately the same vocabulary size, are uncased, and were trained with the Masked Language Model (MLM) objective (for training Wilie et al.’s model, Next Sentence Prediction, or NSP objective was also used). Considering the size of the training data, Wilie et al.’s model tops the list with 4 billion words, followed by Lintang’s, Wirawan’s, and Koto et al.’s models, in that order. To avoid confusion with model names, the PTLMs of Table 2 will be referred to in this paper as Koto’s, Wilie’s, Lintang’s, and Wirawan’s models.
Contrasting the situation with the general-purpose models, the availability of domain-specific PTLMs for the Indonesian language is quite limited, however. Currently, the IndoBERTweet model [13] is, in fact, the only publicly available instance. The IndoBERTweet PTLM was developed through continual domain-adaptive pre-training of Koto’s model, using a dataset of 409 million word tokens extracted from Indonesian tweets. The latter dataset is twice as large as the corpora used to pre-train the general-purpose IndoBERT model. Since the IndoBERTweet PTLM was developed for a domain unrelated to the government and SDGs, only the four general-purpose models of Table 2 will be used as baselines for benchmarking purposes in this study.

3.3. SDG Data

To examine the performance of PTLMs, real-world data were obtained from the Indonesian government with a particular focus on classification tasks concerning SDGs. The SDGs encompass 17 global goals that, in government settings, typically necessitate topic-based document classification by 17 labels defined as follows: “No Poverty” (Goal 1), “Zero Hunger” (Goal 2), “Good Health and Well-being” (Goal 3), “Quality Education” (Goal 4), “Gender Equality” (Goal 5), “Clean Water and Sanitation” (Goal 6), “Affordable and Clean Energy” (Goal 7), “Decent Work and Economic Growth” (Goal 8), “Industry, Innovation, and Infrastructure” (Goal 9), “Reduced Inequality” (Goal 10), “Sustainable Cities and Communities” (Goal 11), “Responsible Consumption and Production” (Goal 12), “Climate Action” (Goal 13), “Life Below Water” (Goal 14), “Life on Land” (Goal 15), “Peace, Justice, and Strong Institutions” (Goal 16), and “Partnerships for the Goals” (Goal 17). Data for this study were primarily sourced from the Indonesian national SDG action plan (Rencana Aksi Nasional, the RAN document collection), covering the period from 2021 to 2024. The plan was developed and is monitored by the Indonesian SDG secretariat within the Ministry of National Development Planning (BAPPENAS). RAN acknowledges the vital role of stakeholders in advancing SDGs through an attachment list that details SDG-related activities in three text collections. Specifically, the collections include descriptions of (i) budget programs and activities initiated by the central government, (ii) non-government programs and activities involving civil society, philanthropic, and academic organizations, and (iii) non-government programs and activities of various business actors. In the presented study, SDG documents from only (i) and (ii) are used, while collection (iii) is excluded, owing to the structural similarity between documents (ii) and (iii).
In the RAN action plan, stakeholders voluntarily documented their proposals and manually annotated them with SDG labels to indicate projected contributions to achieving the corresponding goals. One of the objectives of the presented study is, therefore, to facilitate classification of SDG documents created by the stakeholders. Models, methods, and tools developed in this study would be deployed to establish a matching process between NSA activities and government spending, as described in the documents.

3.3.1. SDG Budget Tagging Data

Apart from being listed in the RAN document, SDG-related programs and initiatives by the central government are also filed in the “Kolaborasi Perencanaan dan Informasi Kinerja Anggaran” (KRISNA) system, which has been used by the government for planning and budgeting purposes since 2017. In KRISNA, expenditures are manually tagged with various budget labels, including the SDG labels. As SDG budget tagging started in 2021, KRISNA data utilized in this study cover the period of 2021 through 2023.
The budget data collected were subjected to preprocessing, as suggested by Riyadi et al. [52]. To reduce irrelevant and noisy information, the data were also manually cleaned by BAPPENAS specialists. Missing values and formatting inconsistencies were handled to standardize the dataset. Feature selection was conducted to retain significant attributes by assessing variable permutation importance of a Random Forest classifier  [53]. For the dataset, features with an importance score exceeding 0.05 were considered important, reducing the number of input features from 12 to 6 for further analysis (see Figure 3). The importance score threshold was set based on human judgment. Table 3 exemplifies the six remaining features. Since all these features assume a textual format, the corresponding texts were concatenated to form a single string input for the classifiers.
The preprocessing steps yielded 4875 unique data entries with a label distribution, depicted in Figure 4 (the total sum over all goals is different due to the presence of multi-label assignments). The average document length is 34.5 words, and the maximum length is 64 words.

3.3.2. SDG NSA Activity Data

The NSA data comprise documents describing non-government programs and activities that involve civil society and philanthropic organizations, academic institutions, and other entities of the RAN collection. These data were subjected to preprocessing in the same way as the budget data. Results of the feature selection, based on permutation importance, are shown in Figure 5. The selection threshold was set at 0.16, thus reducing the number of features from seven to four. The remaining features are listed in Table 4, together with specific examples. The same text-concatenation approach as in the case of budget data was applied to form the classifier input.
The preprocessing returned 2445 entries. The average document length is 26.1 words, and the maximum length is 109 words. Figure 6 depicts the SDG labeling of the data (the total sum over the goals is different due to the presence of multilabel assignments). As one could see from the figure, the class distribution is unbalanced, with one class of “Goal 7” (“Affordable and Clean Energy”) having only 12 documents. The same pattern of having the lowest data count (150 documents) for “Goal 7” has occurred in the budget data (see Figure 4). There are several other classes in the NSA data which received fewer than 100 data entries, including ”Goal 9”, ”Goal 10”, ”Goal 2”, “Goal 17”, and “Goal 6” with a total of 35, 66, 75, 84, and 85 documents, respectively, sorted in ascending order based on the number of entries.
Different from the case of budget data, which is monolingual in principle, the NSA data contain both Indonesian and English texts. Through an analysis of the NSA documents with the langdetect tool (https://pypi.org/project/langdetect/ accessed on 30 September 2024) (the language probability parameter was set to 70%), it was found that 155 NSA entries constitute English documents, which account for approximately 6% of the whole data. An example of this language mixing is given in Table 4, where the features are formulated in English and Indonesian. Thus, the NSA dataset is both unbalanced and multilingual (more specifically, duolingual), which poses an additional challenge for the development of NLP solutions in the Indonesian government domain.

4. IndoGovBERT Model Development

This section describes the key phases of the IndoGovBERT development. As illustrated in Figure 7, the process begins with the corpora construction. Two types of candidate models are then built: models trained from scratch with domain-specific data (SC PTLMs) and models created through further pre-training of the relevant generic models with domain-specific data (FT PTLMs). All models are evaluated after fine-tuning them on downstream task-specific data, and the best-performing model is selected for solving government SDG-related NLP tasks. The process is detailed in the following subsections.

4.1. Domain Corpora

4.1.1. Government Document Collection

Static documents from the government domain, both publicly available and circulated exclusively within the government, have been compiled. Web scraping was also used to collect data from the government’s public online sources. Through these activities, two document collections have been created, C1 and C2, as specified in Table 5.
The internal documents C1 have been obtained from the mailing system used at the deputy ministerial level within BAPPENAS. The system has been in operation since 2009, and the collected documents cover the period from 2009 to 2022. C1 is primarily comprised of PDF files with texts related to the internal affairs of specific government units that may include sensitive information with restricted access.
The public documents C2 have been compiled by scraping static content from several government websites (see Table 6). These documents are made publicly available and present official information published by the government. The collection includes laws and regulations but also government reports and presidential and ministerial speeches for the period from 1945 to 2022. The web scraping was implemented using BeautifulSoup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/ accessed on 30 September 2024), a Python library for extracting data from HTML and XML files. Ethical considerations were upheld, ensuring compliance with the government websites’ terms of service.
As indicated in Table 5, a majority of the collected documents are in PDF, a popular format for document storage in various organizations worldwide, including the Indonesian Government. While PDF provides for a consistent visual appearance across different applications and platforms both on screen and in print, extracting text from the files can be challenging, especially when dealing with historical documents  [54]. Details of the preprocessing procedures of text extraction, filtration, and deduplication of the PDF files are explained in the next subsection.

4.1.2. Preprocessing

All collected documents were subjected to preprocessing and cleaning. The first step of preprocessing is to extract text from the PDF files (see Figure 8). In this study, text extraction was performed using the Tesseract engine (https://github.com/tesseract-ocr/tesseract accessed on 30 September 2024). The content extraction process was agnostic to the layout, i.e., all content elements had been extracted. This simplification allows for accelerating the whole procedure but has limitations in terms of accuracy and completeness of the extracted text. To improve the accuracy, spellchecking was performed using the autocorrect tool (https://github.com/filyp/autocorrect accessed on 30 September 2024). The tool originally had no support for the Indonesian language. Nonetheless, it has functionality for adding new languages. An Indonesian vocabulary was created with a custom list of Indonesian words from Wikipedia (https://dumps.wikimedia.org/idwiki/latest/ accessed on 30 September 2024), setting a prerequisite that a word to be selected must appear at least 30 times in the Wikipedia dumps. Spelling errors detected have, thus, been corrected with the tool. All the texts obtained also underwent various cleaning procedures to streamline the data, including hyperlink, email address, and special character removal. Camel-case words were split, all letters were converted to lowercase, digit-only words were removed, and words with fewer than three characters were filtered out. Extra spaces were also eliminated to standardize the text structure.
Subsequently, a filtering process at the document level was conducted to improve the quality of the training data. Documents smaller than 500 bytes in size and those with an average word length of less than four characters (that usually indicates texts shattered due to conversion errors) were excluded from the collection at this stage.
Next, deduplication was performed to minimize redundant information. As demonstrated by Lee et al. [55], existing language model datasets often contain numerous near-duplicate data and long repetitive substrings. Duplicated data occurring at different levels, i.e., in sentences, documents, and corpora, can hinder model performance and result in “data memorization” [47]. Kandpal et al. [56] argued that data deduplication is also an effective technique for mitigating privacy risks associated with language models. In this study, document and substring deduplication was executed.
First, the MinHash (https://ekzhu.com/datasketch/minhash.html accessed on 30 September 2024) tool with the Jaccard similarity threshold set to 0.70 for 5-grams was deployed to identify and eliminate near-duplicate documents. Output files from this step were combined into a single file. Suffix array deduplication was then implemented with a minimum matching substring length of k = 50 tokens, following the approach proposed in [55]. The repetitive substring deduplication was run until no more duplicates were found.
As indicated in Table 5, the preprocessing steps resulted in the reduction of the C1 corpus size from 1.9 GB to 413.7 MB, and of the C2 corpus—from 4.3 GB to 1.5 GB. Combining all corpora resulted in a dataset of 1.9 GB (C3 = C1 + C2) that comprises approximately 255 million words.

4.2. PTLM Development

4.2.1. Architecture and Tokenizer

The standard BERT model architecture was used. As introduced by Devlin et al. [28], it assumes setting both MLM and NSP as training objectives. In a recent study, however, Liu et al. [36] demonstrated that the application of MLM alone is sufficient. To develop the IndoGovBERT model, the MLM-only approach was, therefore, exercised.
Generally, the PTLM development process begins with either the creation of a tokenizer, based on the collected data (as in the pre-training from scratch approach) or the deployment of the tokenizer from the base general-purpose language model (as in the further pre-training approach). The IndoGovBERT tokenizer was constructed following the methodology proposed by the authors of NorBERT [57], a Norwegian language BERT model. A sub-word tokenizer was built by applying the WordPiece algorithm [58] and using the Hugging Face’s tokenization library (https://github.com/huggingface/tokenizers accessed on 30 September 2024). The vocabulary size was set to 50,000 for each of the three corpora, C1, C2, and C3. To fine-tune models already trained, the original tokenizers of the models were utilized.

4.2.2. Pre-Training and Fine-Tuning

Pre-training and fine-tuning of the IndoGovBERT model were performed on a single NVIDIA GeForce RTX 4090 GPU with 24 GB of memory. The fine-tuning process was conducted after having the model pre-training finished. Multi-label SDG classification of budget tagging data was used as the fine-tuning task.
Two different pre-training approaches were exercised (see Figure 7): pre-training from scratch (resulted in SC PTLMs), and further pre-training of an existing PTLM (resulted in FT PTLMs; also see the “Domain adaptive” track of Figure 1). All three prepared corpora were used as inputs with an 80–20 proportion of training and testing data. The key hyperparameter settings were as follows: 25 training epochs, a warm-up ratio of 0.1, a weight decay of 0.1, a learning rate of 2 × 10−5, and an MLM probability of 0.15. Setting the warm-up ratio to 0.1 allowed for stabilizing the training process by gradually increasing the learning rate from the start. Setting the weight decay to 0.1 helped prevent overfitting and improved model generalization to unseen data. The learning rate of 2 × 10−5 and the MLM probability of 0.15 are the typical choices when pre-training BERT models (e.g., see [12,19]). All BERT models pre-trained in this study have 12 layers, a hidden size of 768, and 12 attention heads, as recommended in [28].
The decision to use the relatively small number of epochs (25 vs. 40 epochs in the original BERT model [28]) as the termination criterion was due to organizational constraints on computational resources and time. These constraints mirror practical considerations faced by the government units. On the other hand, Komatsuzaki [59] argued that a single epoch is sufficient for pre-training with large volumes of unlabeled data, thus reducing the cost of training. The latter aligns with the findings of Zhao et al. [60], who concluded that increasing the number of epochs during pre-training does not translate into performance improvement.
To assess the performance of different versions of the IndoGovBERT model during pre-training, metrics such as computation time, train loss, and word-level perplexity (on the test set) were used. The perplexity metric reflects the model’s ability to predict next-word sequence [61], and a lower perplexity usually signals better model performance [62].
Pre-training from scratch produced three different IndoGovBERT PTLM versions, one for each of the three corpora used. Performance of all obtained models was compared against the four baseline monolingual Indonesian BERT models after the fine-tuning process with labeled data of the selected downstream task. The seven models tested were, hence, fine-tuned with the budget data for two epochs, as recommended by Devlin et al. [28]. The learning rate was set at 5 × 10−5 and the weight decay—at 0.01. Setting the parameters to these values allowed for preventing overfitting while fine-tuning the models for SDG multi-label multi-class classification of the budget data. The resultant model performance was evaluated, using the F1 macro and micro, as well as accuracy scores averaged through a five-fold cross-validation. The dataset proportions used in the experiments are as follows: 80% (3900 data entries) for training and 20% (975 data entries) for testing in each fold.
Of the seven models examined, the three best-performing models were selected for further pre-training experiments. The same hyperparameter settings were used as in the pre-training from scratch approach. The three governmental corpora, C1, C2, and C3, were utilized. In the case of the IndoGovBERT models, the training data were adjusted to only include documents which were not used in the initial pre-training.

4.2.3. SC PTLM Performance

Results of the pre-training from scratch are presented in Table 7. The SC-C1 version has the worst perplexity and training loss but required a much shorter training time. With the corpus size increase, the training time also increased, but both perplexity and training loss improved, as observed for the SC-C2 and SC-C3 models.
The SC PTLMs and the baseline models were then tested on the SDG budget tagging downstream task. Figure 9 portrays the performance of the models assessed in terms of the F1 macro score averaged through five-fold cross-validation. Table 8 gives a summary of the metrics monitored in the experiment. As one can see from the illustrations, Wirawan’s model outperformed the other baselines despite having a smaller training corpus size than in the cases of Wilie’s and Lintang’s models. The PTLMs trained from scratch on the governmental corpora C1 and C2 excelled, and the SC-C1 demonstrated the best performance overall despite its higher perplexity and smaller training data size. Merging the C1 and C2 corpora for the training purposes has not, however, led to performance improvements with SC-C3.

4.2.4. FT PTLM Performance

The top-three performing PTLMs—SC-C1, SC-C2, and Wirawan’s—were selected for further pre-training, as outlined in Section 4.2.2. Experiments were conducted to investigate how SC-C1 would perform when further pre-trained with the C2 corpus, how SC-C2 would perform when further pre-trained with C1, and what would be Wirawan’s model performance when it is further pre-trained with the C1, C2, and C3 corpora. Table 9 presents results of these experiments.
As can be seen from Table 9, the best results in terms of perplexity and training loss were obtained when Wirawan’s model was further pre-trained with the C2 corpus. In contrast, the worst perplexity was observed when SC-C2 was further pre-trained with the C1 corpus. In all cases, further pre-training with C1 resulted in higher perplexity values when compared to further pre-training with C2. On the other hand, the computational cost (assessed in terms of training time) was expectedly lower for the C1 corpus, owing to its smaller size.
Results of testing the further pre-trained models on the budget tagging task are given in Table 10. It is evident from the table that the further pre-trained models could not, generally, surpass the performances of their original base models (for comparison, see Table 8). Notably, however, SC-C2-FT-C1 and SC-C1-FT-C2 showed better results than the SC-C3 model trained from scratch with the merged corpora. This suggests that continual model pre-training could be more advantageous than the direct corpora-merging method. In the case of Wirawan’s model, an interesting observation is that the general-purpose PTLM performance gradually improved as the training data size grew.

5. IndoGovBERT Model Application Examples

This section presents results of testing the best-performing version of the IndoGovBERT model (SC-C1) in real-world settings of the Indonesian government. Two types of NLP downstream tasks are considered:
  • Multi-label multi-class document classification. To discover possible connections between Indonesian public spending items or NSA activities and SDGs, a multi-label SDG classification task needs to be solved. For this purpose, two classifiers are created, using the IndoGovBERT model: one for the government data and another for the NSA activity documents. Both classifiers are compared with a Multi-Label Topic Model (MLTM)-powered classifier [63], a non-transformer-based generative model frequently used as a baseline in studies dealing with multi-label classification (e.g., see [64,65]). As the NSA dataset differs from the budget spending data of Section 4, the IndoGovBERT model is also compared with the best baseline model of Table 8, i.e., with Wirawan’s model. In addition, since the NSA dataset contains a sizeable portion of English texts (e.g., SDG documents of World Bank, Asian Development Bank, etc.), the IndoGovBERT model is compared with a multilingual BERT model [28] as well. Classifier performance is assessed, using three commonly recognized metrics for multi-label prediction: Hamming loss (H-loss), Accuracy (A), and F1-score (F1) [66]. The classifiers are also evaluated in terms of Hamming score (H-score) and the Average Area Under the Precision–Recall Curve (often abbreviated as simply AUC). These are all metrics typically used in multi-label classification studies [67,68,69].
  • Document matchmaking. The IndoGovBERT model is employed to discover similarities between the government’s SDG budget items and NSA initiatives. This is to facilitate document screening performed by the government and aimed at identifying NSA’s activities that would align with the resources available. The developed model is used to generate BERT word embeddings for each pair of NSA documents and government SDG program descriptions. This is followed by cosine similarity calculation for the embedding vectors obtained. Two baseline models are deployed in this experiment, and all results obtained are compared and discussed in a practical context of document preliminary or initial screening by the government.
It should be noted that these two tasks selected for model evaluation are quite typical for government document processing. While the specific downstream tasks considered in this paper are managed by BAPPENAS, all other units of the Indonesian government have to deal with laborious document classification and screening on a daily basis as well.

5.1. SDG Multi-Label Classification of the Government Budget Data

The developed SC-C1 model was fine-tuned with the annotated budget data, using the same parameters as in the fine-tuning procedure described in Section 4.2.2. The number of epochs was increased to 30 to improve the classifier performance. The dataset proportions for model training, validation, and testing followed the experimental setup of MLTM [63] with 65% (3168 documents), 10% (170 documents), and 25% (1537 documents) of the whole data, respectively.

5.1.1. Results

Table 11 provides a performance comparison of MLTM and the proposed model on the government budget data. A probability threshold of 0.5 was used to decide whether a document should be categorized as in-class or not. Figure 10 depicts the average (over the classes) micro precision–recall curves for the IndoGovBERT-based and the MLTM classifiers. Figure 11 then details the class distribution of F1 macro scores obtained in the experiment.

5.1.2. Discussion

Results of Table 11 indicate that the proposed model outperformed MLTM in terms of all metrics used but micro precision. The SC-C1 model achieved a lower Hamming loss and higher values of Hamming score, accuracy, and micro and macro F1-scores. On the other hand, the application of MLTM resulted in relatively high scores in both micro and macro precision but the low recall dramatically affected the F1 metrics observed. The better performance of the proposed model over MLTM is also illustrated by the precision–recall curves of Figure 10.
As demonstrated in Figure 11, the SC-C1 model consistently outperformed MLTM in terms of F1 macro score. The proposed model achieved higher scores for all classes with only 6 of the total 17 being below 0.9 (but all above 0.7). In contrast, MLTM scored below 0.5 for more than half of the classes. The overall performance tendencies observed for both models are somewhat similar, with the exception for the Goal 13, 15, 16, and 17 classes, where the corresponding curve patterns diverge.

5.2. SDG Multi-Label Classification of the NSA Data

The NSA activity dataset was used both to fine-tune the SC-C1 model and to train MLTM. The data proportions were set as follows: 65% (1587 documents) for training, 10% (85 documents) for validation, and 25% (771 documents) for testing. The experimental design and model training parameters were identical to those of Section 5.1.

5.2.1. Results

Table 12 compares the performance of MLTM and the developed model on the NSA activity data. The same in-class probability threshold of 0.5 was applied as in the case of budget data. Average (over the classes) micro precision–recall curves of the two models are displayed in Figure 12, while Figure 13 details the models’ per-class performance in terms of F1 macro score.
The SC-C1 model was then compared with Wirawan’s model and the multilingual BERT model by Devlin et al. [28] (https://huggingface.co/bert-base-multilingual-uncased, accessed on 13 October 2023). The latter model was trained on Wikipedia dumps in 102 languages, including the Indonesian language. All tested models were fine-tuned using the same settings as described in Section 4.2.2, except for the number of epochs, which was set to 10. The reduction in epochs was implemented to facilitate a more practice-oriented assessment of the training capabilities of the three BERT-based models. During training, the F1 micro and macro metrics were monitored across the epochs, as depicted in Figure 14.

5.2.2. Discussion

It is evident from Table 12 and Figure 12 that MLTM underperformed compared to the IndoGovBERT model on the NSA data in terms of all metrics but micro precision. The proposed model consistently demonstrated much better classification results for all labels. It achieved high F1 macro scores surpassing 0.9 for labels “Goal 2”, “Goal 3”, “Goal 5”, “Goal 14”, and “Goal 15”. The overall performance tendencies observed in the curve patterns of both models are similar with remarkably good results obtained for the “Goal 5”, “Goal 14”, and “Goal 15” classes.
In the language-mixture experiment involving the three BERT-based models, the IndoGovBERT model outperformed Wirawan’s and the multilingual BERT models in terms of both F1 micro and macro scores (Figure 14). Wirawan’s and the multilingual models notably struggled to achieve significant improvements in the initial epochs. While training the multilingual model, the F1 micro and macro scores started to rise from zero only in the fourth epoch compared to the proposed model that produced scores well above zero already in the second epoch. Although Wirawan’s model also began to improve in the second epoch, its scores initially rose just marginally. The IndoGovBERT model’s performance jumped considerably beginning from the third epoch.
The observed curve tendencies suggest that the tested models’ performance might still improve with the increase in the number of epochs. Nevertheless, considering the longer time and computational resources that would be required to implement additional model training, the IndoGovBERT model comes up as an obviously preferable choice for this classification task in government settings.
An interesting observation is that the performance of the multilingual, Wirawan’s, and IndoGovBERT models assessed during the model testing phase was consistent with that demonstrated by the models after fine-tuning. In testing, the multilingual BERT yielded F1 micro and macro scores of 0.69 and 0.41, respectively. Wirawan’s model performed better, producing F1 micro and macro scores of 0.84 and 0.74, while the model developed in this study achieved even better results of 0.84 and 0.76, respectively. This fact highlights the limitations of the general-purpose multilingual BERT in handling Indonesian NSA documents and underscores the advantages of using models tailored specifically for the Indonesian language.

5.3. SDG Matchmaking

To emulate preliminary document screening by the government, a dataset was prepared with 100 documents selected randomly from the NSA document pool. For the screening targets, five Indonesian SDG-related program descriptions encompassing the diverse domains of road construction, waste management, vocational training, fish hatcheries, and water cleaning were chosen. The NSA and government documents thus formed 500 pairs which were manually assigned “Relevant” or “Irrelevant” labels by one BAPPENAS specialist based on subjective similarity assessment of the NSA activities and government programs as they are presented in the corresponding texts. The annotation process resulted in a heavily imbalanced dataset with only approximately 9% of the document pairs (44 of the total 500) labeled as “Relevant”. The latter closely replicates real-world situations when governments, through preliminary screening, select relatively few documents to be considered for funding programs and initiatives from a much larger pool of applications, reports, proposals, etc. It should be noted that in practice, labeling a document “Relevant” (i.e., selecting a document pair) at this stage would not automatically imply that the described activities would be considered for the corresponding funding program or initiative. Rather, it would signify that there is a meaningful similarity between descriptions in both documents, and that the selected NSA document needs further (manual) examination and vetting by the authorities in charge of the corresponding affairs.
To assess text similarity, three models were deployed: IndoGovBERT, Wirawan’s, and Multilingual BERT PTLMs. All documents forming the 500 pairs were tokenized using each model’s tokenizer, with padding (maximum length of 109 tokens) and truncation if the input exceeds the maximum length. The last layer of each model was leveraged to produce informative embeddings from the tokenized inputs, which were averaged to obtain single vector representations. Cosine similarity for the government and NSA document vectors was calculated, resulting in three sets (one set per model) of 500 values each. Beeswarm plots [70] in Figure 15 provide a summary of the results obtained. In the figure, document pairs labeled as “Relevant” are presented with red dots.

5.3.1. Results

The mean values and variances of the “Relevant” vs. “Irrelevant” subsets of the cosine similarity data are as follows: 0.43 and 0.0066 vs. 0.35 and 0.0038 for the IndoGovBERT, 0.72 and 0.0040 vs. 0.67 and 0.0046 for Wirawan’s, and 0.83 and 0.0023 vs. 0.81 and 0.0035 for the Multilingual BERT models, respectively. A non-parametric Anderson–Darling k-sample test [71] suggested that all three models provide for a statistically significant separation in terms of average cosine similarity in the “Relevant” vs. “Irrelevant” data comparison (p-values of 0.001, 0.001, and 0.003 for the IndoGovBERT, Wirawan’s, and the Multilingual BERT models, respectively). The size of the minimum set containing all “Relevant” pairs computed by lowering the selection threshold of cosine similarity until all “Relevant” document pairs are selected is 213 (a 57% reduction of the original set) for the proposed model, 340 (32% reduction) for Wirawan’s model, and 388 (22% reduction) for the Multilingual BERT model. A Brown–Forsythe test [72] indicated that, in terms of variances, there is a significant difference (p  < 0.003 ) between the “Relevant” and “Irrelevant” sets of the IndoGovBERT model. Contrasting this, the variability of the “Relevant” cosine similarity data is on a par with the “Irrelevant” data in the cases of Wirawan’s model (p  < 0.42 ) and the Multilingual BERT model (p  < 0.12 ).

5.3.2. Discussion

Given the shapes of the “swarms” in Figure 15, one could infer that the IndoGovBERT model is a better performer than the two baseline models for filtering out “Irrelevant” documents. The “root-down tree” shape of the “swarm” suggests that the corresponding model produced fewer document pairs with a relatively high cosine similarity. On the contrary, the “root-up tree” shapes obtained with the baseline models reveal that there are fewer document pairs with a relatively low similarity in both cases. The developed model would, accordingly, allow for setting a “cut-off” threshold with reference to cosine similarity, thereby reducing the number of documents that would require manual processing. The IndoGovBERT model also provided for a more compact grouping of the “Irrelevant” data than the “Relevant” data, as confirmed by the Brown–Forsythe test results. On the other hand, all three models allowed for statistically separating the “Relevant” and “Irrelevant” subsets. The latter would, however, have little practical importance, as even a small number of false negatives (i.e., relevant documents left unselected) would undermine the credibility of the screening procedure.

6. Overall Discussion, Conclusions, Limitations, and Future Work

6.1. Domain-Specific PTLM Development

Several observations can be made analyzing experimental results obtained with the pre-trained models in this study. First, it became evident that relying solely on the perplexity metric while monitoring the PTLM development can be insufficient, if not misleading. The developed SC-C1 PTLM, by far the superior model for solving the Indonesian government SDG tasks, actually has the worst perplexity among all models tested. This revelation echoes the concerns of Tay et al. [73], who argued that relying on upstream perplexity as an evaluation metric can be deceptive when assessing PTLM downstream task performance.
Another important observation is that the PTLMs developed from scratch outperformed both the fine-tuned general-purpose language models and the further pre-trained domain-specific models in solving the tested downstream tasks. The from-scratch models demonstrated better performance even when they were trained on comparatively small datasets. This advantage, however, comes at the expense of increased computational costs compared with the downstream fine-tuning approach. Therefore, balancing performance and environmental impacts should be considered for establishing sustainable practices for domain-specific PTLM development.
Regarding the various corpora utilized by the PTLMs scrutinized in this study, one should not overlook the crucial role of the relevant document selection. It was argued previously that the incorporation of historical data into training corpora would often have a positive effect on the trained model performance (e.g., see [74,75]). On the other hand, in this study, the model trained on the C2 corpus of governmental documents covering the whole history of the Indonesian state since its establishment in 1945 could not reach the same level of performance as the more specific IndoGovBERT SC-C1 trained on documents created in 2019 through 2022. This would be attributed, at least partly, to the spelling rule changes implemented in Indonesia in 1972 [10]. Quite naturally, the writing style of older documents became obsolete, which might make the C2 data excessively noisy in regard to processing contemporary documents. Therefore, when building a domain-specific PTLM, one needs to consider and experiment with the “expiration date” of the training data.

6.2. Automatic Multi-Label Classification

The reliability analysis of the IndoGovBERT SC-C1 model presented in Section 5.1 and Section 5.2 confirmed that the model developed in this study consistently outperformed the MLTM classification approach across various evaluation metrics for both the government budget tagging and NSA activity tasks. The proposed model demonstrated better performance compared with the multilingual BERT model specifically for the NSA mixed-language document classification task, for which it also outperformed Wirawan’s model, even though not as much as in the case of the multilingual model. These findings highlight the capability of the IndoGovBERT model to address the NLP challenges of SDG multi-label classification in the Indonesian government domain.
The relatively low performance scores observed in the experiments with MLTM would be attributed to several factors, including class (i.e., document per topic) imbalances within the datasets. On the other hand, the considerably better performance of the proposed model would be due to domain semantics incorporated into the model through pre-training on the highly relevant yet compact and focused domain corpus, the effectiveness of the fine-tuning process, and the high relevance of the document features selected for solving the specific task.

6.3. Matchmaking

Cosine similarity threshold-based matchmaking has often been discussed in various contexts of NLP, including document screening (e.g., see [76,77]). Practical considerations dictate, however, that fully automatic document screening (e.g., based on machine learning) may hardly be achievable in actual government settings [37]. Rather, there is a need for AI-powered decision support tools that would alleviate the burden of manual processing of large volumes of textual data. Addressing this need, Matsui et al. [39] demonstrated how cosine similarity would be used to assist SDG document selection. The efficiency of the proposed match-making procedure, as well as the performance of the PTLM deployed, received little attention from the authors, however. The latter two issues are still generally poorly understood, as they have seldom been discussed for tasks other than document classification.
The experiments of Section 5.3 have not only confirmed the potential of the IndoGovBERT model for discovering closely related government and NSA SDG documents but also put forward a possible evaluation and application framework for PTLMs to support the matchmaking process. The beeswarm plot offers a simple yet powerful tool for qualitative assessment of a model used to compute document similarity on a benchmark set. A root-down tree shape of the “similarity swarm” with positively labeled documents concentrated in its upper part would be signaling that the tested model is suitable for supporting the document screening task. The size of the minimum set containing all “Relevant” data and within-class variances would then offer quantitative insights about the model performance.

6.4. Study Conclusions, Limitations, and Future Work

In the presented work, a specialized language model IndoGovBERT intended to process Indonesian Government SDG documents has been developed. The upstream approach undertaken by the authors is to complement the existing downstream efforts by the Indonesian NLP community. This work is also to lay a foundation for the development of more accurate and impactful NLP applications for various government tasks, especially in the context of SDGs.
Different versions of the IndoGovBERT model were built and examined through experiments using two different pre-training methodologies with internal and public governmental domain corpora, which were also compiled by the authors. The subsequent evaluation of the best-performing version of the IndoGovBERT model in comparison with the carefully selected baseline models on two real-world document multi-label classification tasks confirmed its unrivaled capabilities. The IndoGovBERT model has also proven to be a better choice in the practical document screening scenario tested for matching SDG-related activities described in the government budget and NSA documents compared to the baseline general-purpose and multilingual models.
The PTLM development and application framework presented in this study appears quite universal, especially in the context of government SDG and other similar activities. It could be used to quickly build specialized language models for other organizations which face document processing challenges similar to the ones addressed in this study. The latter, however, is subject to the availability of adequate training corpora.
It is understood that the presented study has limitations. The first comes with the text extraction method used in data preprocessing. The deployed method is agnostic to document layout, which caused many symbol recognition errors in the generated texts. While most misspelled (i.e., misrecognized) words have been corrected in the following preprocessing steps, other errors and inaccuracies were left unchecked and “noised” the training data semantics. The latter would negatively affect the pre-trained model’s performance. Future studies should, therefore, focus on refining the text extraction process to make it more layout-aware and, ultimately, more accurate.
It is also important to note that the proposed IndoGovBERT model was pre-trained only with internal documents collected at BAPPENAS, a unit of the Indonesian Government that is involved in development planning. The corresponding corpus is, therefore, much closer semantically to the domain of government budgeting (and to the downstream tasks considered) than in the case of other models tested in this study. Further experiments should be conducted to establish whether the IndoGovBERT SC-C2 version would overcome (or, at least, catch up with) the SC-C1 model in governmental subdomains other than budgeting. This is, however, naturally beyond the scope of this particular study.
The simplistic approach to enlarging the training data by concatenating domain corpora has proven to be a failure. This may be due to possible redundancy, obsolescence, writing style and domain mismatches, and other “noise” in the data in regard to the downstream tasks at hand. As merging the corpora in the experiments resulted in higher training costs but no gains in the performance, continuous pre-training [42] with incremental domain (e.g., [78]) or chronological approaches (e.g., [79]) might then be a better alternative. This will be examined in future studies.
The results of the document matchmaking experiments have a preliminary character. To discuss prospects of unsupervised or even semi-supervised document screening based on PTLMs, additional experiments are required with much larger and diverse benchmark document sets. The latter is outside the scope of this paper and is also left for future work.
Future studies are, therefore, open to improve the IndoGovBERT model through experiments with other governmental data. Future work should also shed light on what model architecture would be the best choice in the governmental settings, as this study exclusively focused on BERT-based models. All these efforts would be directed towards ensuring the continual evolution and adaptability of the model for diverse applications within the government domain.

Author Contributions

Conceptualization, methodology, software, validation, formal analysis, investigation, writing—original draft preparation, visualization, A.R.; writing—review and editing, M.K., U.S. and V.K.; supervision, V.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported, in part, by the Japan International Cooperation Agency (JICA), the SDG Global Leadership Program.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

In line with the principles of open science, and to ensure the replicability of the results reported in this paper, all data (with the exception for non-public government documents), models, and code resulting from the study are made available online under an open-source license at https://github.com/Just108/IndoGovBERT accessed on 30 September 2024.

Acknowledgments

Access to documents which are not publicly available was facilitated by the first author’s government employment at BAPPENAS. Sincere thanks are extended to BAPPENAS for all the support received.

Conflicts of Interest

The first author has a concurrent position at the Indonesian Ministry of National Development Planning (BAPPENAS), while the others do not.

References

  1. UN. Transforming Our World: The 2030 Agenda for Sustainable Development; UN: New York, NY, USA, 2015. [Google Scholar]
  2. Sheriffdeen, M.; Nurrochmat, D.R.; Perdinan; Aliyu Abubakar, H.K. Effectiveness of emerging mechanisms for financing national climate actions; example of the Indonesia Climate Change Trust Fund. Clim. Dev. 2023, 15, 81–92. [Google Scholar] [CrossRef]
  3. Safitri, Y.; Ningsih, R.D.; Agustianingsih, D.P.; Sukhwani, V. COVID-19 Impact on SDGs and the Fiscal Measures: Case of Indonesia. Int. J. Environ. Res. Public Health 2021, 18, 2911. [Google Scholar] [CrossRef] [PubMed]
  4. Mutiarani, N.D.; Siswantoro, D. The impact of local government characteristics on the accomplishment of Sustainable Development Goals (SDGs). Cogent Bus. Manag. 2020, 7, 1847751. [Google Scholar] [CrossRef]
  5. Banerjee, A.; Murphy, E.; Walsh, P.P. Perceptions of Multistakeholder Partnerships for the Sustainable Development Goals: A Case Study of Irish Non-State Actors. Sustainability 2020, 12, 8872. [Google Scholar] [CrossRef]
  6. Setiawan, I.K.A.; Larasati, P.A.; Sugiarto, I. CSR Contextualization for Achieving the SDGs in Indonesia. J. Judic. Rev. 2021, 23, 183. [Google Scholar] [CrossRef]
  7. Choi, G.; Jin, T.; Jeong, Y.; Lee, S.K. Evolution of Partnerships for Sustainable Development: The Case of P4G. Sustainability 2020, 12, 6485. [Google Scholar] [CrossRef]
  8. Janowski, T. Implementing Sustainable Development Goals with Digital Government—Aspiration-capacity gap. Gov. Inf. Q. 2016, 33, 603–613. [Google Scholar] [CrossRef]
  9. Vinuesa, R.; Azizpour, H.; Leite, I.; Balaam, M.; Dignum, V.; Domisch, S.; Felländer, A.; Langhans, S.D.; Tegmark, M.; Fuso Nerini, F. The role of artificial intelligence in achieving the Sustainable Development Goals. Nat. Commun. 2020, 11, 1–10. [Google Scholar] [CrossRef]
  10. Sneddon, J. The Indonesian Language; University of New South Wales Press Ltd.: Randwick, Australia, 2003. [Google Scholar]
  11. Koto, F.; Rahimi, A.; Lau, J.H.; Baldwin, T. IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 757–770. [Google Scholar] [CrossRef]
  12. Wilie, B.; Vincentio, K.; Winata, G.I.; Cahyawijaya, S.; Li, X.; Lim, Z.Y.; Soleman, S.; Mahendra, R.; Fung, P.; Bahar, S.; et al. IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Suzhou, China, 4–7 December 2020; pp. 843–857. [Google Scholar]
  13. Koto, F.; Lau, J.H.; Baldwin, T. INDOBERTWEET: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization. In Proceedings of the EMNLP 2021—2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 10660–10668. [Google Scholar] [CrossRef]
  14. Cahyawijaya, S.; Aji, A.F.; Lovenia, H.; Winata, G.I.; Wilie, B.; Mahendra, R.; Koto, F.; Moeljadi, D.; Vincentio, K.; Romadhony, A.; et al. NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian Languages. arXiv 2022, arXiv:2207.1052. [Google Scholar]
  15. Cahyawijaya, S.; Winata, G.I.; Wilie, B.; Vincentio, K.; Li, X.; Kuncoro, A.; Ruder, S.; Lim, Z.Y.; Bahar, S.; Khodra, M.; et al. IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 8875–8898. [Google Scholar] [CrossRef]
  16. Beltagy, I.; Cohan, A.; Lo, K. SciBERT: Pretrained Contextualized Embeddings for Scientific Text. arXiv 2019, arXiv:1903.10676. [Google Scholar]
  17. Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
  18. Peng, Y.; Yan, S.; Lu, Z. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. In Proceedings of the BioNLP 2019—SIGBioMed Workshop on Biomedical Natural Language Processing, 18th BioNLP Workshop and Shared Task, Florence, Italy, 1 August 2019; Number iv. pp. 58–65. [Google Scholar] [CrossRef]
  19. Santos, T.; Tariq, A.; Das, S.; Vayalpati, K.; Smith, G.H.; Trivedi, H.; Banerjee, I. PathologyBERT—Pre-trained vs. a New Transformer Language Model for Pathology Domain. arXiv 2022, arXiv:2205.06885. [Google Scholar]
  20. Abdelmageed, N.; Löffler, F.; König-Ries, B. BiodivBERT: A Pre-Trained Language Model for the Biodiversity Domain. In Proceedings of the 14th International Conference on Semantic Web Applications and Tools for Health Care and Life Sciences (SWAT4HCLS 2023), Basel, Switzerland, 13–16 February 2023; Yamaguchi, A., Splendiani, A., Marshall, M.S., Baker, C., Bolleman, J.T., Burger, A., Castro, L.J., Eigenbrod, O., Österle, S., Romacker, M., et al., Eds.; CEUR Workshop Proceedings. CEUR-WS.org: Basel, Switzerland, 2023; Volume 3415, pp. 62–71. [Google Scholar]
  21. Chalkidis, I.; Fergadiotis, M.; Malakasiotis, P.; Aletras, N.; Androutsopoulos, I. LEGAL-BERT: The Muppets straight out of Law School. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Stroudsburg, PA, USA, 16–20 November 2020; Number i. pp. 2898–2904. [Google Scholar] [CrossRef]
  22. Araci, D. FinBERT: Financial Sentiment Analysis with Pre-trained Language Models. arXiv 2019, arXiv:1908.10063. [Google Scholar]
  23. Rezayi, S.; Liu, Z.; Wu, Z.; Dhakal, C.; Ge, B.; Zhen, C.; Liu, T.; Li, S. AgriBERT: Knowledge-Infused Agricultural Language Models for Matching Food and Nutrition. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Vienna, Austria, 23–29 July 2022; Raedt, L.D., Ed.; AI for Good. International Joint Conferences on Artificial Intelligence Organization: Vienna, Austria, 2022; pp. 5150–5156. [Google Scholar] [CrossRef]
  24. Hu, Y.; Hosseini, M.S.; Parolin, E.S.; Osorio, J.; Khan, L.; Brandt, P.T.; D’Orazio, V.J. ConfliBERT: A Pre-trained Language Model for Political Conflict and Violence. In Proceedings of the NAACL 2022—2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 10–15 July 2022; pp. 5469–5482. [Google Scholar] [CrossRef]
  25. Zheng, Z.; Lu, X.Z.; Chen, K.Y.; Zhou, Y.C.; Lin, J.R. Pretrained domain-specific language model for natural language processing tasks in the AEC domain. Comput. Ind. 2022, 142, 103733. [Google Scholar] [CrossRef]
  26. Xiong, Z.; Kong, D.; Xia, Z.; Xue, Y.; Song, Z.; Wang, P. Chinese government official document named entity recognition based on Albert. In Proceedings of the 2021 IEEE 6th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA), Chengdu, China, 24–26 April 2021; pp. 350–354. [Google Scholar] [CrossRef]
  27. Wallerö, E. Automatic Classification of Conditions for Grants in Appropriation Directions of Government Agencies. Master’s Thesis, Uppsala University, Uppsala, Sweden, 2022. Available online: https://www.diva-portal.org/smash/get/diva2:1679811/FULLTEXT01.pdf (accessed on 10 February 2024).
  28. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
  29. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; Liu, Q., Schlangen, D., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 38–45. Available online: https://2020.emnlp.org/ (accessed on 10 February 2024). [CrossRef]
  30. Garrido-Merchan, E.C.; Gozalo-Brizuela, R.; Gonzalez-Carvajal, S. Comparing BERT against traditional machine learning models in text classification. J. Comput. Cogn. Eng. 2023, 2, 352–356. [Google Scholar] [CrossRef]
  31. García-Barragán, Á.; González Calatayud, A.; Solarte-Pabón, O.; Provencio, M.; Menasalvas, E.; Robles, V. GPT for medical entity recognition in Spanish. Multimed. Tools Appl. 2024, 1–20. [Google Scholar] [CrossRef]
  32. Yang, J.; Liu, C.; Deng, W.; Wu, D.; Weng, C.; Zhou, Y.; Wang, K. Enhancing phenotype recognition in clinical notes using large language models: PhenoBCBERT and PhenoGPT. Patterns 2024, 5, 100887. [Google Scholar] [CrossRef]
  33. Li, Q.; Peng, H.; Li, J.; Xia, C.; Yang, R.; Sun, L.; Yu, P.S.; He, L. A Survey on Text Classification: From Traditional to Deep Learning. ACM Trans. Intell. Syst. Technol. 2022, 13, 1–41. [Google Scholar] [CrossRef]
  34. Nugroho, A.; Widyawan; Kusumawardani, S.S. Distributed Classifier for SDGs Topics in Online News using RabbitMQ Message Broker. J. Phys. Conf. Ser. 2020, 1577, 012026. [Google Scholar] [CrossRef]
  35. Angin, M.; Taşdemir, B.; Yılmaz, C.A.; Demiralp, G.; Atay, M.; Angin, P.; Dikmener, G. A RoBERTa Approach for Automated Processing of Sustainability Reports. Sustainability 2022, 14, 16139. [Google Scholar] [CrossRef]
  36. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
  37. Guariso, D.; Guerrero, O.A.; Castaneda, G. Automatic SDG Budget Tagging: Building Public Financial Management Capacity through Natural Language Processing. Data Policy 2023, 5, e31. [Google Scholar] [CrossRef]
  38. Morales-Hernández, R.C.; Juagüey, J.G.; Becerra-Alonso, D. A Comparison of Multi-Label Text Classification Models in Research Articles Labeled with Sustainable Development Goals. IEEE Access 2022, 10, 123534–123548. [Google Scholar] [CrossRef]
  39. Matsui, T.; Suzuki, K.; Ando, K.; Kitai, Y.; Haga, C.; Masuhara, N.; Kawakubo, S. A natural language processing model for supporting sustainable development goals: Translating semantics, visualizing nexus, and connecting stakeholders. Sustain. Sci. 2022, 17, 969–985. [Google Scholar] [CrossRef]
  40. Radiya-Dixit, E. How fine can fine-tuning be? Learning efficient language models. Proc. Mach. Learn. Res. 2020, 108, 2435–2443. [Google Scholar]
  41. Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; Smith, N.A. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8342–8360. [Google Scholar] [CrossRef]
  42. Jin, X.; Zhang, D.; Zhu, H.; Xiao, W.; Li, S.W.; Wei, X.; Arnold, A.; Ren, X. Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora. In Proceedings of the NAACL 2022—2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 10–15 July 2022; pp. 4764–4780. [Google Scholar] [CrossRef]
  43. Abnar, S.; Dehghani, M.; Neyshabur, B.; Sedghi, H. Exploring the Limits of Large Scale Pre-training. arXiv 2021, arXiv:2110.02095. [Google Scholar]
  44. Zhu, Q.; Gu, Y.; Luo, L.; Li, B.; Li, C.; Peng, W.; Huang, M.; Zhu, X. When does Further Pre-training MLM Help? An Empirical Study on Task-Oriented Dialog Pre-training. In Proceedings of the Second Workshop on Insights from Negative Results in NLP, Punta Cana, Dominican Republic, 10 November 2021; pp. 54–61. [Google Scholar] [CrossRef]
  45. Arslan, Y.; Allix, K.; Veiber, L.; Lothritz, C.; Bissyandé, T.F.; Klein, J.; Goujon, A. A Comparison of Pre-Trained Language Models for Multi-Class Text Classification in the Financial Domain. In Proceedings of the WWW ’21: Companion Proceedings of the Web Conference 2021, New York, NY, USA, 19–23 April 2021; pp. 260–268. [Google Scholar] [CrossRef]
  46. El Boukkouri, H.; Ferret, O.; Lavergne, T.; Zweigenbaum, P. Re-train or Train from Scratch? Comparing Pre-training Strategies of BERT in the Medical Domain. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 2626–2633. [Google Scholar]
  47. Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. arXiv 2023, arXiv:2307.06435. [Google Scholar]
  48. Tai, W.; Kung, H.T.; Dong, X.; Comiter, M.; Kuo, C.F. exBERT: Extending pre-trained models with domain-specific vocabulary under constrained training resources. In Proceedings of the Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020, Online, 16–20 November 2020; pp. 1433–1439. [Google Scholar] [CrossRef]
  49. Lintang, S. IndoBERT (Indonesian BERT Model). 2020. Available online: https://huggingface.co/sarahlintang/IndoBERT (accessed on 18 July 2023).
  50. Wirawan, C. Indonesian BERT Base Model (Uncased). 2020. Available online: https://huggingface.co/cahya/bert-base-indonesian-522M (accessed on 18 July 2023).
  51. Rahmawati, A.; Alamsyah, A.; Romadhony, A. Hoax News Detection Analysis using IndoBERT Deep Learning Methodology. In Proceedings of the 2022 10th International Conference on Information and Communication Technology (ICoICT), Online, 2–3 August 2022; pp. 368–373. [Google Scholar] [CrossRef]
  52. Riyadi, A.; Kovacs, M.; Serdult, U.; Kryssanov, V. A Machine Learning Approach to Government Business Process Re-engineering. In Proceedings of the 2023 IEEE International Conference on Big Data and Smart Computing, BigComp 2023, Jeju, Republic of Korea, 13–16 February 2023; pp. 26–33. [Google Scholar] [CrossRef]
  53. Gregorutti, B.; Michel, B.; Saint-Pierre, P. Correlation and variable importance in random forests. Stat. Comput. 2017, 27, 659–678. [Google Scholar] [CrossRef]
  54. Nguyen, T.T.H.; Jatowt, A.; Coustaty, M.; Doucet, A. Survey of Post-OCR Processing Approaches. ACM Comput. Surv. 2021, 54, 1–37. [Google Scholar] [CrossRef]
  55. Lee, K.; Ippolito, D.; Nystrom, A.; Zhang, C.; Eck, D.; Callison-Burch, C.; Carlini, N. Deduplicating Training Data Makes Language Models Better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Dublin, Ireland, 2022; Volume 1: Long Papers, pp. 8424–8445. [Google Scholar] [CrossRef]
  56. Kandpal, N.; Wallace, E.; Raffel, C. Deduplicating Training Data Mitigates Privacy Risks in Language Models. Proc. Mach. Learn. Res. 2022, 162, 10697–10707. [Google Scholar]
  57. Samuel, D.; Kutuzov, A.; Touileb, S.; Velldal, E.; Øvrelid, L.; Rønningstad, E.; Sigdel, E.; Palatkina, A. NorBench—A Benchmark for Norwegian Language Models. arXiv 2023, arXiv:2305.03880. [Google Scholar]
  58. Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
  59. Komatsuzaki, A. One epoch is all you need. arXiv 2019, arXiv:1906.06669. [Google Scholar]
  60. Zhao, Z.; Zhang, Z.; Hopfgartner, F. A Comparative Study of Using Pre-Trained Language Models for Toxic Comment Classification. In Proceedings of the Companion Proceedings of the Web Conference 2021, New York, NY, USA, 19–23 April 2021; pp. 500–507. [Google Scholar] [CrossRef]
  61. Wang, C.; Li, M.; Smola, A.J. Language Models with Transformers. arXiv 2019, arXiv:1904.09408. [Google Scholar]
  62. Jozefowicz, R.; Vinyals, O.; Schuster, M.; Shazeer, N.; Wu, Y. Exploring the limits of language modeling. arXiv 2016, arXiv:1602.02410. [Google Scholar]
  63. Soleimani, H.; Miller, D.J. Semisupervised, Multilabel, Multi-Instance Learning for Structured Data. Neural Comput. 2017, 29, 1053–1102. [Google Scholar] [CrossRef]
  64. Hananto, V.R.; Serdült, U.; Kryssanov, V. A Text Segmentation Approach for Automated Annotation of Online Customer Reviews, Based on Topic Modeling. Appl. Sci. 2022, 12, 3412. [Google Scholar] [CrossRef]
  65. Zha, D.; Li, C. Multi-label dataless text classification with topic modeling. Knowl. Inf. Syst. 2019, 61, 137–160. [Google Scholar] [CrossRef]
  66. Spolaôr, N.; Cherman, E.A.; Metz, J.; Monard, M.C. A Systematic Review on Experimental Multi-Label Learning. ICMC Technical Report. 2013. Available online: https://repositorio.usp.br/directbitstream/d6b6a713-8e86-419c-8ee8-5b77a0ebf613/Relat%C3%B3rios+T%C3%A9cnicos_392_2013.pdf (accessed on 10 February 2023).
  67. Read, J.; Pfahringer, B.; Holmes, G.; Frank, E. Classifier chains for multi-label classification. Mach. Learn. 2011, 85, 333–359. [Google Scholar] [CrossRef]
  68. Zhang, W.; Liu, F.; Luo, L.; Zhang, J. Predicting drug side effects by multi-label learning and ensemble learning. BMC Bioinform. 2015, 16, 1–11. [Google Scholar] [CrossRef] [PubMed]
  69. Schindler, D.; Spors, S.; Demiray, B.; Krüger, F. Automatic Behavior Assessment from Uncontrolled Everyday Audio Recordings by Deep Learning. Sensors 2022, 22, 8617. [Google Scholar] [CrossRef] [PubMed]
  70. Wilkinson, L. Dot plots. Am. Stat. 1999, 53, 276–281. [Google Scholar] [CrossRef]
  71. Scholz, F.W.; Stephens, M.A. K-sample Anderson–Darling tests. J. Am. Stat. Assoc. 1987, 82, 918–924. [Google Scholar] [CrossRef]
  72. Brown, M.B.; Forsythe, A.B. Robust Tests for the Equality of Variances. J. Am. Stat. Assoc. 1974, 69, 364–367. [Google Scholar] [CrossRef]
  73. Tay, Y.; Dehghani, M.; Rao, J.; Fedus, W.; Abnar, S.; Chung, H.W.; Narang, S.; Yogatama, D.; Vaswani, A.; Metzler, D. Scale Efficiently: Insights from Pretraining and Finetuning Transformers. arXiv 2022, arXiv:2109.10686. [Google Scholar]
  74. Yang, Z.; Yan, S.; Lad, A.; Liu, X.; Guo, W. Cascaded Deep Neural Ranking Models in LinkedIn People Search. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual, 1–5 November 2021; pp. 4312–4320. [Google Scholar] [CrossRef]
  75. Rastas, I.; Ciarán Ryan, Y.; Tiihonen, I.; Qaraei, M.; Repo, L.; Babbar, R.; Mäkelä, E.; Tolonen, M.; Ginter, F. Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model. In Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change, Dublin, Ireland, 26–27 May 2022; Tahmasebi, N., Montariol, S., Kutuzov, A., Hengchen, S., Dubossarsky, H., Borin, L., Eds.; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2022; pp. 68–77. [Google Scholar] [CrossRef]
  76. Davis, C.; Aid, G. Machine learning-assisted industrial symbiosis: Testing the ability of word vectors to estimate similarity for material substitutions. J. Ind. Ecol. 2022, 26, 27–43. [Google Scholar] [CrossRef]
  77. Karabulut, E.; Sofia, R.C. An Analysis of Machine Learning-Based Semantic Matchmaking. IEEE Access 2023, 11, 27829–27842. [Google Scholar] [CrossRef]
  78. Qin, Y.; Zhang, J.; Lin, Y.; Liu, Z.; Li, P.; Sun, M.; Zhou, J. ELLE: Efficient lifelong pre-training for emerging data. arXiv 2022, arXiv:2203.06311. [Google Scholar]
  79. Loureiro, D.; Barbieri, F.; Neves, L.; Anke, L.E.; Camacho-Collados, J. Timelms: Diachronic language models from twitter. arXiv 2022, arXiv:2202.03829. [Google Scholar]
Figure 1. PTLM development scenarios for solving domain-specific NLP tasks.
Figure 1. PTLM development scenarios for solving domain-specific NLP tasks.
Bdcc 08 00153 g001
Figure 2. An overview of the presented study.
Figure 2. An overview of the presented study.
Bdcc 08 00153 g002
Figure 3. Permutation importance of the budget data features.
Figure 3. Permutation importance of the budget data features.
Bdcc 08 00153 g003
Figure 4. Label distribution of the budget data.
Figure 4. Label distribution of the budget data.
Bdcc 08 00153 g004
Figure 5. Permutation importance of the NSA data features.
Figure 5. Permutation importance of the NSA data features.
Bdcc 08 00153 g005
Figure 6. Label distribution of the NSA data.
Figure 6. Label distribution of the NSA data.
Bdcc 08 00153 g006
Figure 7. An overview of the specialized PTLM development methodology. Model development phases of the best-performing version of IndoGovBERT PTLM are shown with the thick blue arrows.
Figure 7. An overview of the specialized PTLM development methodology. Model development phases of the best-performing version of IndoGovBERT PTLM are shown with the thick blue arrows.
Bdcc 08 00153 g007
Figure 8. Document preprocessing.
Figure 8. Document preprocessing.
Bdcc 08 00153 g008
Figure 9. F1 macro scores for the budget tagging task by the different models.
Figure 9. F1 macro scores for the budget tagging task by the different models.
Bdcc 08 00153 g009
Figure 10. Micro precision–recall averaged over classes and the corresponding AUC values for the budget data classification task.
Figure 10. Micro precision–recall averaged over classes and the corresponding AUC values for the budget data classification task.
Bdcc 08 00153 g010
Figure 11. F1 macro score by class labels for the budget data classification task.
Figure 11. F1 macro score by class labels for the budget data classification task.
Bdcc 08 00153 g011
Figure 12. Micro precision–recall averaged over classes and the corresponding AUC values for the NSA data classification task.
Figure 12. Micro precision–recall averaged over classes and the corresponding AUC values for the NSA data classification task.
Bdcc 08 00153 g012
Figure 13. F1 macro score by class labels for the NSA activity classification task.
Figure 13. F1 macro score by class labels for the NSA activity classification task.
Bdcc 08 00153 g013
Figure 14. Performance monitoring of the IndoGovBERT SC-C1, Wirawan’s, and multilingual BERT models while fine-tuning on the NSA data.
Figure 14. Performance monitoring of the IndoGovBERT SC-C1, Wirawan’s, and multilingual BERT models while fine-tuning on the NSA data.
Bdcc 08 00153 g014
Figure 15. Cosine similarity score distribution. In the figure, m stands for the mean value and v denotes variance.
Figure 15. Cosine similarity score distribution. In the figure, m stands for the mean value and v denotes variance.
Bdcc 08 00153 g015
Table 1. Examples of domain-specific corpora utilized to build PTLMs.
Table 1. Examples of domain-specific corpora utilized to build PTLMs.
PTLMCorpora
AgriBERT [23]Agricultural literature
ARCBERT [25]AEC (Architecture, Engineering, Construction) regulatory texts, encyclopedia AEC-related texts, Wikipedia AEC-related pages
BioBERT [17]Biomedical academic papers
BioDivBERT [20]Biodiversity academic papers
ConfliBERT [24]Conflict-related reports and news
FinBERT [22]Financial texts (news, reports, announcements)
LegalBERT [21]Legal texts (laws, court pleadings, contracts)
PathologyBERT [19]Pathology reports
SciBERT [16]Scientific academic papers
Table 2. Monolingual Indonesian BERT-based PTLMs.
Table 2. Monolingual Indonesian BERT-based PTLMs.
Monolingual BERTKoto et al. [11]Wilie et al. [12]Lintang [49]Wirawan [50]
Model nameIndoBERTIndoBERTIndoBERTIndonesian BERT base
Data source (Indonesian)Wikipedia, News, Web CorpusIndo4BOSCAR corpusWikipedia
Data size(220 M words) 32 GB(4 B words) 16 GB(2 B words)(522 M words)
Vocabulary size32,00030,52232,00032,000
Case typeuncaseduncaseduncaseduncased
Training objectiveMLMMLM, NSPMLMMLM
Table 3. Selected features of the budget data with examples *.
Table 3. Selected features of the budget data with examples *.
FeatureDescriptionExample
national_priorityNational priority name, regulated by BAPPENASMeningkatkan sumber daya manusia berkualitas dan berdaya saing (improving the quality and competitiveness of human resources)
programProgram name by ministriesProgram perlindungan sosial (social protection program)
ministryMinistry nameKementerian Sosial (Ministry of Social Affairs)
priority_programPriority program name, regulated by BAPPENASPengendalian penduduk dan penguatan tata kelola kependudukan (population control and strengthening the management of population)
activityExpendituresPengelolaan data terpadu kesejahteraan sosial (integration of the social welfare data management)
detail_outputDetailsData terpadu kesejahteraan sosial (unified social welfare data)
* English translations are given in the brackets.
Table 4. Selected features of the NSA data with examples *.
Table 4. Selected features of the NSA data with examples *.
FeatureDescriptionExample
outputOutput of the activityAnalisis finansial dan model bisnis implementasi bus listrik di Medan dan Bandung (financial analysis and a business model for implementing electric buses in Medan and Bandung)
activityExamplesAsistensi teknis perencanaan implementasi bus listrik di Medan dan Bandung (technical assistance for planning the deployment of electric buses in Medan and Bandung)
fundingFunding sourceWorld Bank
programProgram namee-mobility adoption roadmap for the Indonesian mass transit program
* English translations are given in the brackets.
Table 5. Constructed corpora.
Table 5. Constructed corpora.
ItemInternal Documents (C1)Public Documents (C2)Combined (C3 = C1 + C2)
ContentLettersLaw and regulatory texts, government reports, presidential/ministerial speechesLetters, law and regulatory texts, government reports, presidential/ministerial speeches
Period2009–20221945–20221945–2022
Count (original), documents97,783 (pdf) + 7 (MS Word)138,089 (pdf) + 922 (MS Word)235,872 (pdf) + 929 (MS Word)
Count (retained), documents97,385133,251230,636
Size (retained), GB1.94.36.2
Size (deduplicated), GB0.4 (59 million words)1.5 (196 million words)1.9 (255 million words)
Table 6. Websites scrapped.
Table 6. Websites scrapped.
ContentURL *
Law and regulatory textshttps://peraturan.bpk.go.id/, https://peraturan.go.id/, https://jdihn.go.id/
Government reports, presidential and ministerial speecheshttps://perpustakaan.bappenas.go.id/e-library/data/dokumen-bappenas
* accessed on 10 November 2022.
Table 7. SC PTLM pre-training results.
Table 7. SC PTLM pre-training results.
CorpusTime (Hours)Training LossPerplexity
C110:086.85798.18
C232:315.55154.57
C342:193.3615.91
Table 8. Model performance in the SDG budget tagging experiments *.
Table 8. Model performance in the SDG budget tagging experiments *.
ModelAccuracy **F1 Macro **F1 Micro **
Koto’s0.3680.3520.556
Lintang’s0.5000.6460.698
Wirawan’s0.5430.6460.729
Wilie’s0.5230.6270.715
SC-C10.5750.6830.744
SC-C20.5320.6500.717
SC-C30.5190.6270.705
* best-performing model results are shown in bold; ** averaged through a 5-fold cross-validation.
Table 9. PTLM further pre-training results.
Table 9. PTLM further pre-training results.
Base ModelCorpusTime (Hours)Training LossPerplexity
Wirawan’sC112:123.3520.25
Wirawan’sC234:442.196.49
Wirawan’sC346:132.448.35
SC-C2C110:454.7962.34
SC-C1C234:154.9924.74
Table 10. FT PTLM performance in the SDG budget tagging experiments *.
Table 10. FT PTLM performance in the SDG budget tagging experiments *.
ModelAccuracy **F1 Macro **F1 Micro **
Wirawan’s-FT-C10.5070.6090.697
Wirawan’s-FT-C20.5380.6370.719
Wirawan’s-FT-C30.5430.6460.727
SC-C2-FT-C10.5320.6410.716
SC-C1-FT-C20.5240.6380.709
* best-performing model results are shown in bold; ** averaged through a 5-fold cross-validation.
Table 11. MLTM vs. SC-C1 performance comparison on the SDG multi-label classification task using the budget data *.
Table 11. MLTM vs. SC-C1 performance comparison on the SDG multi-label classification task using the budget data *.
MethodH-LossH-ScoreAccuracyMicroMacro
Precision Recall F1 Score Precision Recall F1 Score
MLTM0.040.490.430.900.460.610.840.470.57
SC-C10.020.850.780.880.850.860.890.850.87
* best results are shown in bold.
Table 12. MLTM vs. SC-C1 performance comparison for the SDG multi-label classification task using the NSA data *.
Table 12. MLTM vs. SC-C1 performance comparison for the SDG multi-label classification task using the NSA data *.
MethodH-LossH-ScoreAccuracyMicroMacro
Precision Recall F1 Score Precision Recall F1 Score
MLTM0.040.350.350.900.340.490.640.270.36
SC-C10.020.820.820.890.820.850.880.750.80
* best results are shown in bold.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Riyadi, A.; Kovacs, M.; Serdült, U.; Kryssanov, V. IndoGovBERT: A Domain-Specific Language Model for Processing Indonesian Government SDG Documents. Big Data Cogn. Comput. 2024, 8, 153. https://doi.org/10.3390/bdcc8110153

AMA Style

Riyadi A, Kovacs M, Serdült U, Kryssanov V. IndoGovBERT: A Domain-Specific Language Model for Processing Indonesian Government SDG Documents. Big Data and Cognitive Computing. 2024; 8(11):153. https://doi.org/10.3390/bdcc8110153

Chicago/Turabian Style

Riyadi, Agus, Mate Kovacs, Uwe Serdült, and Victor Kryssanov. 2024. "IndoGovBERT: A Domain-Specific Language Model for Processing Indonesian Government SDG Documents" Big Data and Cognitive Computing 8, no. 11: 153. https://doi.org/10.3390/bdcc8110153

APA Style

Riyadi, A., Kovacs, M., Serdült, U., & Kryssanov, V. (2024). IndoGovBERT: A Domain-Specific Language Model for Processing Indonesian Government SDG Documents. Big Data and Cognitive Computing, 8(11), 153. https://doi.org/10.3390/bdcc8110153

Article Metrics

Back to TopTop