1. Introduction
Many countries still face challenges in achieving the Sustainable Development Goals (SDGs), a set of global objectives to be met by 2030 which were formulated by the United Nations in 2015 with the overall goal of promoting global sustainability [
1]. These challenges include the impact of climate change [
2] and the effects of unforeseen and unprecedented events, such as the COVID-19 pandemic [
3]. To achieve all 17 SDGs and their 169 targets, governments and various non-state actors (NSAs), such as civil society organizations, religious groups, private companies, etc., must coordinate their efforts and promptly react to social, economic, and environmental changes. Governments’ spending can be adjusted to finance vital areas for impactful SDG results [
4], while NSA SDG initiatives, such as, for instance, Corporate Social Responsibility [
5], can foster positive change and promote community engagement [
6]. Collaboration between all parties by aligning their activities can, furthermore, help solve complex problems more effectively than through any individual effort [
7]. Despite arrangements made by many governments to define each goal within its budget and create opportunities for NSAs to voluntarily list their SDG-related activities in the SDG action plans, effective coordination of the government spending is often hampered by the large amount of manual document processing. A known approach to overcoming this obstruction is the broad introduction of digital governance [
8], which leverages technologies such as Natural Language Processing (NLP). The latter technologies largely rely on artificial intelligence with its vast array of machine learning methods and models [
9].
The presented research thus aims to develop a machine learning approach to facilitate the identification of sustainability goals in government and NSA documents in Indonesia and to help discover possible means to achieve these goals. The study introduces a set of corpora, a language model development methodology, as well as software tools and methods that would not only allow the Indonesian government and NSAs to expedite document processing and enhance their decision-making processes but also to promote collaboration among all actors involved in the pursuit of the SDG agenda. While this study specifically focuses on Indonesian government activities, its results could be adopted and adapted to help other governments in SDG document processing as well.
The Indonesian language, “Bahasa Indonesia”, is the country’s only official language that symbolizes the national unity and has over 270 million native speakers [
10]. At the same time, it is recognized as one of the so-called “low-resource languages” in the NLP landscape [
11,
12,
13]. A few attempts have been made to address this issue, including the development of pre-trained language models (e.g., IndoBERT [
11,
12]), the introduction of annotated datasets (e.g., NusaCrowd [
14]), and the creation of benchmark tasks (e.g., IndoLEM [
11], IndoNLU [
12], and IndoNLG [
15]). Most of the work on pre-trained models was, however, directed to the development of general-purpose, generic, or “universal” models with little attention given to domain-specific language models. To the authors’ best knowledge, there has been built only one Indonesian domain-specific language model, called IndoBERTweet [
13].
Pre-trained language models (PTLMs) specialized for a particular domain are known to outperform general-purpose language models when dealing with domain-specific downstream tasks [
16,
17]. Such models are developed through either further pre-training of generic models or training language models from scratch, using a large amount of domain-specific data. Among the available domain-specific models that have been developed for the English language, there are notable examples used in the medical domain (BioBERT [
17], BlueBERT [
18], PathologyBERT [
19]), biodiversity domain (BiodivBERT [
20]), science domain (SciBERT [
16]), legal domain (Legal-BERT [
21]), financial domain (FinBERT [
22]), agricultural domain (AgriBERT [
23]), conflict domain (ConfliBERT [
24]), and architecture, engineering, and construction domain (ARCBERT [
25]). For the governmental domain, there exist few non-English PTLMs, including ones for Chinese (GovAlbert [
26]) and Swedish (KB-BERT [
27]). There have been no domain-specific models proposed for solving Indonesian government tasks, however.
In this paper, the authors seek to bridge the research gap on the absence of PTLMs built for solving Indonesian government domain-specific tasks. Since Bidirectional Encoder Representations from Transformers (BERT) models have proven to be highly effective in addressing various NLP challenges [
28] and are known to be superior to other machine learning approaches (e.g., see [
29,
30,
31,
32]), a BERT-based architecture was chosen to develop the PTLMs in the presented research. This research, therefore, puts forward a BERT-based domain-specific language model pre-trained with governmental texts to effectively process SDG-related documents in the course of Indonesian government decision-making. An effective methodology for the development of high-performance PTLMs for solving downstream NLP tasks is also proposed. All resources necessary to implement the methodology (source code, text corpora, etc.) have been made publicly available at
https://github.com/Just108/IndoGovBERT (accessed on 30 September 2024).
The original contributions of the presented study are, therefore, as follows:
A methodology for the development of high-performance PTLMs for solving document-processing downstream tasks in the governmental context.
An open-data corpus of Indonesian government documents for NLP research.
A domain-specific pre-trained Indonesian language model, IndoGovBERT, for SDG document processing by the Indonesian government.
The developed IndoGovBERT model was thoroughly examined through experiments and found to be efficient and effective in real-world scenarios of multi-label classification of government budget items and NSA activities. The model was also deployed to facilitate the matchmaking of SDG-related activities by the government and NSAs to uncover possible collaboration opportunities for the parties. In all experiments conducted, the proposed model outperformed other models which could be used in the government settings and which, for that reason, were tested in this study.
The rest of the paper is organized as follows.
Section 2 surveys the related literature.
Section 3 provides an overview of the approach and resources used.
Section 4 then details the model development methodology and introduces the IndoGovBERT model.
Section 5 presents application examples of the developed model in SDG document processing contexts. Finally,
Section 6 provides an overall discussion and concludes the study.
3. Methodology and Resources
3.1. Approach Overview
To fulfill the study objectives formulated in
Section 1, the following procedure is undertaken (also, see
Figure 2).
3.1.1. Domain Corpora Development
In this initial phase, domain-specific corpora tailored to the Indonesian government context are constructed to serve as the foundation for pre-training language models. The development of the domain-specific corpora follows a two-step approach: document collection and preprocessing.
Section 4.1 details the corpora development process.
3.1.2. PTLM Development
The development of an Indonesian Government BERT-based PTLM (referred to in this paper as IndoGovBERT) involves leveraging the government’s domain corpora to capture specific linguistic patterns and nuances relevant to the domain. Several versions of the IndoGovBERT model are developed, fine-tuned, and compared. This is followed by the selection of the best-performing model for further experiments.
Section 4.2 explicates the PTLM development process.
3.1.3. PTLM Application on Downstream Tasks
The selected best-performing IndoGovBERT model is used to solve various SDG-related downstream NLP tasks. The model deployment aims at contributing to the decision-making processes within governmental contexts, such as multi-class multi-label document classification and document matchmaking.
Section 5 presents results obtained in these experiments.
3.2. Baseline Models Used
In the past few years, several Indonesian monolingual BERT-based PTLMs have been developed (see
Table 2). Koto et al. introduced the IndoBERT model trained on a dataset sourced from Wikipedia, news articles, and the Indonesian Web corpus [
11]. Wilie et al. released a pre-trained model with the same name, IndoBERT, which used unlabeled text data from the Indo4B project [
12]. There are also two publicly available pre-trained language models stored on
Hugging Face (
http://huggingface.co/models accessed on 30 September 2024): the IndoBERT model by Lintang that was trained on the OSCAR corpus [
49] and the Indonesian BERT base model by Wirawan that was trained on Wikipedia data [
50]. All these pre-trained models have previously been utilized for solving various NLP tasks by the Indonesian research community (e.g., see [
51]).
As indicated in
Table 2, the four general-purpose language models all have approximately the same vocabulary size, are uncased, and were trained with the Masked Language Model (MLM) objective (for training Wilie et al.’s model, Next Sentence Prediction, or NSP objective was also used). Considering the size of the training data, Wilie et al.’s model tops the list with 4 billion words, followed by Lintang’s, Wirawan’s, and Koto et al.’s models, in that order. To avoid confusion with model names, the PTLMs of
Table 2 will be referred to in this paper as Koto’s, Wilie’s, Lintang’s, and Wirawan’s models.
Contrasting the situation with the general-purpose models, the availability of domain-specific PTLMs for the Indonesian language is quite limited, however. Currently, the IndoBERTweet model [
13] is, in fact, the only publicly available instance. The IndoBERTweet PTLM was developed through continual domain-adaptive pre-training of Koto’s model, using a dataset of 409 million word tokens extracted from Indonesian tweets. The latter dataset is twice as large as the corpora used to pre-train the general-purpose IndoBERT model. Since the IndoBERTweet PTLM was developed for a domain unrelated to the government and SDGs, only the four general-purpose models of
Table 2 will be used as baselines for benchmarking purposes in this study.
3.3. SDG Data
To examine the performance of PTLMs, real-world data were obtained from the Indonesian government with a particular focus on classification tasks concerning SDGs. The SDGs encompass 17 global goals that, in government settings, typically necessitate topic-based document classification by 17 labels defined as follows: “No Poverty” (Goal 1), “Zero Hunger” (Goal 2), “Good Health and Well-being” (Goal 3), “Quality Education” (Goal 4), “Gender Equality” (Goal 5), “Clean Water and Sanitation” (Goal 6), “Affordable and Clean Energy” (Goal 7), “Decent Work and Economic Growth” (Goal 8), “Industry, Innovation, and Infrastructure” (Goal 9), “Reduced Inequality” (Goal 10), “Sustainable Cities and Communities” (Goal 11), “Responsible Consumption and Production” (Goal 12), “Climate Action” (Goal 13), “Life Below Water” (Goal 14), “Life on Land” (Goal 15), “Peace, Justice, and Strong Institutions” (Goal 16), and “Partnerships for the Goals” (Goal 17). Data for this study were primarily sourced from the Indonesian national SDG action plan (Rencana Aksi Nasional, the RAN document collection), covering the period from 2021 to 2024. The plan was developed and is monitored by the Indonesian SDG secretariat within the Ministry of National Development Planning (BAPPENAS). RAN acknowledges the vital role of stakeholders in advancing SDGs through an attachment list that details SDG-related activities in three text collections. Specifically, the collections include descriptions of (i) budget programs and activities initiated by the central government, (ii) non-government programs and activities involving civil society, philanthropic, and academic organizations, and (iii) non-government programs and activities of various business actors. In the presented study, SDG documents from only (i) and (ii) are used, while collection (iii) is excluded, owing to the structural similarity between documents (ii) and (iii).
In the RAN action plan, stakeholders voluntarily documented their proposals and manually annotated them with SDG labels to indicate projected contributions to achieving the corresponding goals. One of the objectives of the presented study is, therefore, to facilitate classification of SDG documents created by the stakeholders. Models, methods, and tools developed in this study would be deployed to establish a matching process between NSA activities and government spending, as described in the documents.
3.3.1. SDG Budget Tagging Data
Apart from being listed in the RAN document, SDG-related programs and initiatives by the central government are also filed in the “Kolaborasi Perencanaan dan Informasi Kinerja Anggaran” (KRISNA) system, which has been used by the government for planning and budgeting purposes since 2017. In KRISNA, expenditures are manually tagged with various budget labels, including the SDG labels. As SDG budget tagging started in 2021, KRISNA data utilized in this study cover the period of 2021 through 2023.
The budget data collected were subjected to preprocessing, as suggested by Riyadi et al. [
52]. To reduce irrelevant and noisy information, the data were also manually cleaned by BAPPENAS specialists. Missing values and formatting inconsistencies were handled to standardize the dataset. Feature selection was conducted to retain significant attributes by assessing variable permutation importance of a Random Forest classifier [
53]. For the dataset, features with an importance score exceeding 0.05 were considered important, reducing the number of input features from 12 to 6 for further analysis (see
Figure 3). The importance score threshold was set based on human judgment.
Table 3 exemplifies the six remaining features. Since all these features assume a textual format, the corresponding texts were concatenated to form a single string input for the classifiers.
The preprocessing steps yielded 4875 unique data entries with a label distribution, depicted in
Figure 4 (the total sum over all goals is different due to the presence of multi-label assignments). The average document length is 34.5 words, and the maximum length is 64 words.
3.3.2. SDG NSA Activity Data
The NSA data comprise documents describing non-government programs and activities that involve civil society and philanthropic organizations, academic institutions, and other entities of the RAN collection. These data were subjected to preprocessing in the same way as the budget data. Results of the feature selection, based on permutation importance, are shown in
Figure 5. The selection threshold was set at 0.16, thus reducing the number of features from seven to four. The remaining features are listed in
Table 4, together with specific examples. The same text-concatenation approach as in the case of budget data was applied to form the classifier input.
The preprocessing returned 2445 entries. The average document length is 26.1 words, and the maximum length is 109 words.
Figure 6 depicts the SDG labeling of the data (the total sum over the goals is different due to the presence of multilabel assignments). As one could see from the figure, the class distribution is unbalanced, with one class of “Goal 7” (“Affordable and Clean Energy”) having only 12 documents. The same pattern of having the lowest data count (150 documents) for “Goal 7” has occurred in the budget data (see
Figure 4). There are several other classes in the NSA data which received fewer than 100 data entries, including ”Goal 9”, ”Goal 10”, ”Goal 2”, “Goal 17”, and “Goal 6” with a total of 35, 66, 75, 84, and 85 documents, respectively, sorted in ascending order based on the number of entries.
Different from the case of budget data, which is monolingual in principle, the NSA data contain both Indonesian and English texts. Through an analysis of the NSA documents with the
langdetect tool (
https://pypi.org/project/langdetect/ accessed on 30 September 2024) (the language probability parameter was set to 70%), it was found that 155 NSA entries constitute English documents, which account for approximately 6% of the whole data. An example of this language mixing is given in
Table 4, where the features are formulated in English and Indonesian. Thus, the NSA dataset is both unbalanced and multilingual (more specifically, duolingual), which poses an additional challenge for the development of NLP solutions in the Indonesian government domain.
6. Overall Discussion, Conclusions, Limitations, and Future Work
6.1. Domain-Specific PTLM Development
Several observations can be made analyzing experimental results obtained with the pre-trained models in this study. First, it became evident that relying solely on the perplexity metric while monitoring the PTLM development can be insufficient, if not misleading. The developed SC-C1 PTLM, by far the superior model for solving the Indonesian government SDG tasks, actually has the worst perplexity among all models tested. This revelation echoes the concerns of Tay et al. [
73], who argued that relying on upstream perplexity as an evaluation metric can be deceptive when assessing PTLM downstream task performance.
Another important observation is that the PTLMs developed from scratch outperformed both the fine-tuned general-purpose language models and the further pre-trained domain-specific models in solving the tested downstream tasks. The from-scratch models demonstrated better performance even when they were trained on comparatively small datasets. This advantage, however, comes at the expense of increased computational costs compared with the downstream fine-tuning approach. Therefore, balancing performance and environmental impacts should be considered for establishing sustainable practices for domain-specific PTLM development.
Regarding the various corpora utilized by the PTLMs scrutinized in this study, one should not overlook the crucial role of the relevant document selection. It was argued previously that the incorporation of historical data into training corpora would often have a positive effect on the trained model performance (e.g., see [
74,
75]). On the other hand, in this study, the model trained on the C2 corpus of governmental documents covering the whole history of the Indonesian state since its establishment in 1945 could not reach the same level of performance as the more specific IndoGovBERT SC-C1 trained on documents created in 2019 through 2022. This would be attributed, at least partly, to the spelling rule changes implemented in Indonesia in 1972 [
10]. Quite naturally, the writing style of older documents became obsolete, which might make the C2 data excessively noisy in regard to processing contemporary documents. Therefore, when building a domain-specific PTLM, one needs to consider and experiment with the “expiration date” of the training data.
6.2. Automatic Multi-Label Classification
The reliability analysis of the IndoGovBERT SC-C1 model presented in
Section 5.1 and
Section 5.2 confirmed that the model developed in this study consistently outperformed the MLTM classification approach across various evaluation metrics for both the government budget tagging and NSA activity tasks. The proposed model demonstrated better performance compared with the multilingual BERT model specifically for the NSA mixed-language document classification task, for which it also outperformed Wirawan’s model, even though not as much as in the case of the multilingual model. These findings highlight the capability of the IndoGovBERT model to address the NLP challenges of SDG multi-label classification in the Indonesian government domain.
The relatively low performance scores observed in the experiments with MLTM would be attributed to several factors, including class (i.e., document per topic) imbalances within the datasets. On the other hand, the considerably better performance of the proposed model would be due to domain semantics incorporated into the model through pre-training on the highly relevant yet compact and focused domain corpus, the effectiveness of the fine-tuning process, and the high relevance of the document features selected for solving the specific task.
6.3. Matchmaking
Cosine similarity threshold-based matchmaking has often been discussed in various contexts of NLP, including document screening (e.g., see [
76,
77]). Practical considerations dictate, however, that fully automatic document screening (e.g., based on machine learning) may hardly be achievable in actual government settings [
37]. Rather, there is a need for AI-powered decision support tools that would alleviate the burden of manual processing of large volumes of textual data. Addressing this need, Matsui et al. [
39] demonstrated how cosine similarity would be used to assist SDG document selection. The efficiency of the proposed match-making procedure, as well as the performance of the PTLM deployed, received little attention from the authors, however. The latter two issues are still generally poorly understood, as they have seldom been discussed for tasks other than document classification.
The experiments of
Section 5.3 have not only confirmed the potential of the IndoGovBERT model for discovering closely related government and NSA SDG documents but also put forward a possible evaluation and application framework for PTLMs to support the matchmaking process. The beeswarm plot offers a simple yet powerful tool for qualitative assessment of a model used to compute document similarity on a benchmark set. A root-down tree shape of the “similarity swarm” with positively labeled documents concentrated in its upper part would be signaling that the tested model is suitable for supporting the document screening task. The size of the minimum set containing all “Relevant” data and within-class variances would then offer quantitative insights about the model performance.
6.4. Study Conclusions, Limitations, and Future Work
In the presented work, a specialized language model IndoGovBERT intended to process Indonesian Government SDG documents has been developed. The upstream approach undertaken by the authors is to complement the existing downstream efforts by the Indonesian NLP community. This work is also to lay a foundation for the development of more accurate and impactful NLP applications for various government tasks, especially in the context of SDGs.
Different versions of the IndoGovBERT model were built and examined through experiments using two different pre-training methodologies with internal and public governmental domain corpora, which were also compiled by the authors. The subsequent evaluation of the best-performing version of the IndoGovBERT model in comparison with the carefully selected baseline models on two real-world document multi-label classification tasks confirmed its unrivaled capabilities. The IndoGovBERT model has also proven to be a better choice in the practical document screening scenario tested for matching SDG-related activities described in the government budget and NSA documents compared to the baseline general-purpose and multilingual models.
The PTLM development and application framework presented in this study appears quite universal, especially in the context of government SDG and other similar activities. It could be used to quickly build specialized language models for other organizations which face document processing challenges similar to the ones addressed in this study. The latter, however, is subject to the availability of adequate training corpora.
It is understood that the presented study has limitations. The first comes with the text extraction method used in data preprocessing. The deployed method is agnostic to document layout, which caused many symbol recognition errors in the generated texts. While most misspelled (i.e., misrecognized) words have been corrected in the following preprocessing steps, other errors and inaccuracies were left unchecked and “noised” the training data semantics. The latter would negatively affect the pre-trained model’s performance. Future studies should, therefore, focus on refining the text extraction process to make it more layout-aware and, ultimately, more accurate.
It is also important to note that the proposed IndoGovBERT model was pre-trained only with internal documents collected at BAPPENAS, a unit of the Indonesian Government that is involved in development planning. The corresponding corpus is, therefore, much closer semantically to the domain of government budgeting (and to the downstream tasks considered) than in the case of other models tested in this study. Further experiments should be conducted to establish whether the IndoGovBERT SC-C2 version would overcome (or, at least, catch up with) the SC-C1 model in governmental subdomains other than budgeting. This is, however, naturally beyond the scope of this particular study.
The simplistic approach to enlarging the training data by concatenating domain corpora has proven to be a failure. This may be due to possible redundancy, obsolescence, writing style and domain mismatches, and other “noise” in the data in regard to the downstream tasks at hand. As merging the corpora in the experiments resulted in higher training costs but no gains in the performance, continuous pre-training [
42] with incremental domain (e.g., [
78]) or chronological approaches (e.g., [
79]) might then be a better alternative. This will be examined in future studies.
The results of the document matchmaking experiments have a preliminary character. To discuss prospects of unsupervised or even semi-supervised document screening based on PTLMs, additional experiments are required with much larger and diverse benchmark document sets. The latter is outside the scope of this paper and is also left for future work.
Future studies are, therefore, open to improve the IndoGovBERT model through experiments with other governmental data. Future work should also shed light on what model architecture would be the best choice in the governmental settings, as this study exclusively focused on BERT-based models. All these efforts would be directed towards ensuring the continual evolution and adaptability of the model for diverse applications within the government domain.