1. Introduction
Named entity recognition (NER) is a crucial task in the field of natural language processing (NLP). It involves identifying and categorizing named entities in text, such as people, organizations, locations, dates, and other specific terms [
1] NER is highly valuable in various applications, such as information retrieval, data analysis, and decision support systems. For example, in healthcare, NER can be used to extract relevant medical terms, diseases, and drug names from clinical texts, facilitating clinical text analysis, electronic health record management, and medical research [
2]. NER is also used in other domains like information retrieval, where it helps improve search results by accurately identifying and categorizing entities mentioned in documents [
3].
The availability of a standard Malay language corpus and machine learning algorithms can catalyze a new wave of Malay NLP research, particularly in ongoing research on NER, semantic analysis, information retrieval, sentiment analysis, and translations. These resources would enable researchers to develop more accurate and effective NER models specific to the Malay language, improving the overall quality of Malay NLP applications.
Currently, domain-specific applications primarily focus on the specific context and often do not extend to other languages with diverse morphological and syntactic structures [
4]. Therefore, the development of a standard Malay language corpus and machine learning algorithms tailored to the language is essential. This would enable the expansion of NLP applications to encompass a wider range of domains and promote cross-linguistic research and development. By investing in creating a comprehensive annotated corpus and advancing machine learning algorithms for Malay NLP, researchers can unlock the full potential of NER and other NLP tasks in the Malay language [
5]. This will not only contribute to the growth of the field but also facilitate the development of innovative applications that cater to the unique linguistic characteristics of Malay.
The growth of health-related information in the Malay language necessitates the development of NLP tools and resources tailored to the Malay-speaking community. However, the existing NER tools primarily focus on basic entity types, such as person, organization, and location, and often do not support the Malay language. Moreover, it should be noted that the field of identifying syntax and semantics in the Malay language lacks the abundance of tools and resources that are readily available in English [
6]. This scarcity poses a significant challenge in accurately performing named entity recognition (NER) tasks in Malay health documents. These challenges highlight the need for specialized NER models and resources specifically tailored to the Malay language and the health domain.
In addressing this challenge, leveraging parallel corpora, which consist of aligned texts in English and Malay, emerges as the most suitable solution. By utilizing parallel corpora, we can leverage the existing tools and resources for English NER and adapt them effectively to the Malay language, facilitating the identification of named entities in Malay health documents. This approach maximizes the available resources and enables the development of robust NER models specifically tailored to the Malay language.
The primary objective of this research is to address the challenges of healthcare NER in the Malay language by constructing an annotated corpus and facilitating the development of tailored NER algorithms for the healthcare domain. By creating a comprehensive annotated corpus and advancing machine learning algorithms for Malay NLP, this research aims to unlock the full potential of NER and other NLP tasks in the Malay language, ultimately improving information extraction, analysis, and understanding in the healthcare sector.
In this paper, the research on building an annotated corpus for Malay health documents is presented, focusing on named entity recognition (NER). The paper is divided into four sections.
Section 2 covers the background and related work, providing an overview of the current state of NER research in the Malay language and discussing the limitations of existing resources and approaches.
Section 3 describes the methodology employed to create the Malay Health Document Annotated Corpus. This includes data collection, preprocessing, annotation of both English and Malay text, and the process of combining annotated documents to create the corpus. In
Section 4, the primary results of the study are presented, which are the creation of the Malay Health Document Annotated Corpus.
Section 5 discusses the challenges in building the Malay Health Document Annotated Corpus. The importance of this corpus as a useful tool for training and testing NER models in the Malay language is elucidated, along with the wide range of biomedical concepts that have been correctly identified and labeled within the corpus. In the final section, the main findings of the research are summarized, and potential future directions for this work are discussed.
2. Background and Related Work
The development of named entity recognition (NER) systems for the Malay language has been the focus of numerous researchers. In this section, we will explore the current state of Malay NER research, including the domains that have been predominantly studied and the methodologies and results of various studies. Additionally, we will discuss the existing resources available for the Malay language and highlight the need for further research and resource development in diverse domains beyond crime and news.
Existing datasets for the health domain are readily available in several languages, such as English [
7,
8,
9,
10], Chinese [
11,
12], and Indonesian [
13,
14]. Herwando et al. [
13] discussed the identification of medical entities using the conditional random field (CRF) approach, which aims to extract health information, such as on diseases, symptoms, treatments, and medicines, from online health forum discussions in Indonesian. However, although Indonesian and Malaysian have much in common, they are two different languages. When it comes to Malay named entity recognition (NER) in the health domain, the available resources are primarily focused on domains like crime and news. It is important to note that named entity recognition (NER) systems designed for the public domain may not yield optimal performance when applied to health documents due to the domain-specific terminology and language variations. The specialized vocabulary and linguistic nuances present in the health domain require tailored NER models and resources to accurately extract and classify named entities in Malay health documents.
Malay, being an agglutinative language, poses challenges for named entity recognition due to its unique morphological structures [
15]. For example, the formation of words via affixation and compounding can result in variations in word forms and make it difficult to accurately identify named entities. Additionally, the presence of linguistic nuances, such as honorifics and honorific markers, further adds to the complexity of recognizing named entities in Malay. These challenges highlight the need for specialized NER models and resources that can effectively handle the intricacies of the Malay language in the health domain.
Efforts should be made to curate and annotate large-scale health corpora in Malay, encompassing a wide range of health-related topics and sources. This would enable the development and evaluation of specialized NER models tailored to the unique characteristics of the Malay language and the specific domain of health. By expanding the available resources for Malay NER in the health domain and promoting research on health literacy, we can enhance the accuracy and applicability of NER systems, ultimately improving information extraction, analysis, and understanding in the healthcare sector. This, in turn, can contribute to advancements in healthcare services, research, and the overall well-being of Malay speakers.
Numerous researchers have dedicated their efforts to developing named entity recognition (NER) systems specifically for the Malay language, with a particular focus on Malay NER. However, the current landscape of Malay NER research is predominantly limited to the domains of crime and news resources. In the section discussing various studies on Malay NER, we provide an overview of several studies conducted on Malay NER, examining their methodologies, outcomes, and the existing state of resources available for the Malay language. By exploring these studies, we aim to gain insights into the progress, challenges, and opportunities in the field of Malay NER and highlight the need for further research and resource development in diverse domains beyond crime and news.
Several studies have been conducted on Malay NER, each employing different methodologies and achieving varying levels of success. For example, Saad et al. [
16] created a crime news corpus and manually annotated entities, achieving a recall value of 78.67%, a precision of 71.11%, and an F-measure of 74.7%. Nadia et al. [
17] proposed a rules-based Malay NER system that achieved a recall value of 92.13%, a precision value of 90.23%, and an F-score of 91.05%. Salleh et al. [
18] combined the fuzzy c-means and K-nearest neighbors algorithm methods, resulting in an overall success rate of 95.24% for entity recognition in Malay crime data. Ulanganathan et al. [
19] developed the Mi-NER system using a probabilistic approach with a linear-chain CRF machine learning technique. Finally, Sazali et al. [
20] extracted nouns from Malay classical documents with a 77.61% chance of identifying a noun, while Alfred et al. [
21] employed a rule-based approach with manually created dictionaries and achieved a recall of 94.44%, precision of 85%, and an F-score of 89.47%.
While several studies have been conducted on Malay NER, it is important to note that the current state of resources for the Malay language is still limited. For example, the lack of an annotated corpus remains a challenge for developing a reliable Malay NER system. Additionally, the availability of completed Malay noun lists or dictionaries is limited, requiring manual review by language experts. These limitations highlight the need for further research and resource development in the field of Malay NER.
While the current landscape of Malay NER research has primarily focused on crime and news resources, there is a need for research in diverse domains. By expanding the scope of research, we can develop more comprehensive Malay NER systems that can be applied to a wide range of applications and industries. This will require the development of resources specific to these domains and the exploration of new methodologies and techniques.
Developing Malay NER systems specifically for the health domain is crucial for improving information extraction, analysis, and understanding in the healthcare sector. Accurately extracting and classifying named entities in Malay health documents can enhance the accuracy of NER systems and ultimately contribute to advancements in healthcare services and research. By developing comprehensive and domain-specific resources for Malay NER in the health domain, we can ensure that NER systems are tailored to the unique characteristics of the Malay language and the specific terminology and nuances present in the health domain.
The lack of annotated text corpora in Malay named entity recognition (NER) is a significant challenge in developing supervised learning algorithms. This lack of resources highlights the broader issue of limited Malay natural language processing (NLP) resources, including the absence of credible Malay NER systems [
22]. A comprehensive NER corpus is crucial for training and evaluating NER models, enabling researchers and practitioners to develop more effective algorithms and gain a deeper understanding of named entities in the Malay language [
23].
The scarcity of annotated text in Malay NER hinders progress in developing accurate and robust NER systems for the language. The lack of credible Malay NER systems further emphasizes the need for comprehensive NLP resources tailored to the Malay language. A high-quality NER corpus would facilitate the development of more effective algorithms and enable researchers and practitioners to gain deeper insights into the characteristics and patterns of named entities in Malay [
24].
Future research should prioritize expanding the range of domains covered by Malay NER systems and developing more comprehensive and established resources for the Malay language. Addressing these challenges is crucial to making NLP technology accessible and beneficial to the Malay-speaking community. Advancements in Malay NER can significantly improve the language processing capabilities in Malay and empower researchers, practitioners, and users to leverage the potential of NLP technology.
Furthermore, the development of domain-specific Malay NER systems, such as in finance, law, or education, would broaden the applicability and impact of NLP in the Malay language. These domain-specific systems can cater to the specific needs and requirements of different sectors, enabling more targeted and accurate information extraction and analysis.
In addition to expanding the domains covered, future research should also prioritize the development and establishment of robust and comprehensive resources for Malay NER. This includes the creation of high-quality annotated corpora, lexicons, and rule-based systems that capture the unique linguistic characteristics and entities in the Malay language. By addressing these challenges and investing in the development of Malay NER systems and resources, we can unlock the full potential of NLP technology for the Malay-speaking community. This will not only improve language processing capabilities but also open doors for various applications, such as information retrieval, text summarization, and knowledge extraction, ultimately benefiting both researchers and users in the advancement and utilization of NLP in Malay.
3. Methodology
The main functions of research methodology are to ensure that the research is conducted systematically, consistently, and objectively. The creation of the Annotated Malay Health Document Corpus consists of several stages, as illustrated in
Figure 1. These stages encompass data collection via dataset scrapping, annotating text for English, followed by annotating text for Malay, and finally the creation of the corpus. Each of these stages contributes to the overall process of developing a valuable resource for Malay health document analysis and research.
3.1. Data Collection
Health information is widely available via various sources, including online articles and social media. Each of these sources has different writing styles, and their information bears varying levels of availability and reliability. The unstructured text, which will be used as material and a data source, comes from web pages on health-themed websites. We employed the technique of web scraping to extract data from websites with a health-related focus. The Malaysian Ministry of Health is responsible for maintaining the MyHealth portal, which was our main focus. In our study, our health text data were mainly sourced from the MyHealth portal, an online platform active in 2022.
This methodology allowed for the collection of substantial data from unorganized textual sources, thereby facilitating subsequent examination and annotation. The MyHealth portal plays a pivotal role in the healthcare system of Malaysia, with the objective of facilitating its transition toward a more comprehensive, interconnected, and digitally enabled service. It aims to offer healthcare information that is comprehensive, easily understandable, and of superior quality. By using data from the MyHealth portal, our study benefits from the wealth of health-related information available on this platform.
For this study, we selected about 100 articles and documents from the MyHealth portal as shown in
Table 1. These were analyzed and annotated to create a substantial corpus. Once the data are collected, the next step is to prepare the collected data for further analysis and annotation. Irrelevant information such as advertisements, unrelated images, author biographies, reference or support group information, and final reviews is carefully removed. This is carried out to ensure that only relevant content, i.e., content directly related to health topics, is retained. Additionally, any formatting inconsistencies that existed in the original documentation, such as variations in font size, style, and line spacing, have also been addressed. This is enacted to ensure uniformity across documents, so that data are easier to analyze and process.
In this research project, we gathered a robust corpus consisting of approximately 3952 health-related sentences in the Malay language and roughly 3728 corresponding sentences in English. The large corpus size is essential for conducting thorough analysis and annotation, as it provides a diverse range of data for examination. With a substantial corpus, we can draw more reliable conclusions and insights from our study. Examples of the sentences in both Malay and English can be viewed in
Table 2. This table is illustrative of the variety and complexity of sentences that were included in our data collection effort.
The selection of the English language as the reference point for our dataset was based on its extensive utilization in health-related studies on natural language processing, which has resulted in a robust framework. The utilization of English as a standard allows for the maintenance of consistency and precision in our process of comparing and analyzing. This enables us to utilize pre-existing research and methodologies established in the field of English language studies, and subsequently employ them in our cross-linguistic investigation. The utilization of this methodology was implemented in order to guarantee coherence and precision in our examination and evaluation, given that the English language possesses a firmly established structure within health-related studies pertaining to natural language processing.
The Malay and English collections exhibit an equivalent quantity of documents, although a discernible discrepancy is observed in the number of sentences. The Malay language corpus exhibits a greater quantity of sentences in comparison to the English corpus. The main reason behind this disparity lies in the structural and linguistic differences between the two languages. Often, a single English sentence can expand into multiple sentences in Malay to convey the same meaning. This is due to the nuanced complexities inherent in the translation process between English and Malay. As Malay has unique syntactic and semantic properties, it often requires more sentences to capture the same information contained in a single English sentence. This linguistic phenomenon is illustrated in the first two rows of
Table 2. This crucial observation underscores the challenges and intricacies involved in cross-lingual studies, particularly when developing natural language processing algorithms that accurately capture the subtleties of different languages. It also highlights the importance of developing tailored methodologies that take into account the specific linguistic features and structures of the target language.
3.2. Annotation of English Text
The existing entity recognition algorithms, such as the Stanford CoreNLP tools [
25], predominantly classify basic entity types like person, organization, and location. These established tools, while effective in their own right, lack comprehensive support for the Malay language. This poses a significant challenge for our project since the primary objective is to develop a customized named entity recognition (NER) and relation extraction system tailored to Malay.
Considering this, we resolved to create a tailored annotation schema that would effectively cater to the unique needs of the Malay language. This approach would ensure that our annotated text corpus was well equipped to serve as a potent training and evaluation resource for custom NER and relational extraction algorithms.
For the English texts within our corpus, we employed biomedical NER tools such as BioYODIE NER. This powerful tool enabled us to efficiently identify named entities such as disease, symptoms, care, and others [
25]. This identification process is critical as it facilitates the comprehensive mapping of each text’s entity landscape, providing valuable data for subsequent processing and analysis (
Table 3).
To enhance the breadth of our entity identification, we additionally employed the Stanza i2b2 and NCBI-Disease tools [
26]. These resources were instrumental in identifying other biomedical entities, including categories like problem, treatment, test, and disease. The inclusion of these tools in our entity recognition process ensures broader coverage, enabling us to capture a more diverse set of entities within the corpus (
Table 4 and
Table 5).
Via the judicious use of these tools, we were able to create a comprehensive annotated corpus that encompasses a wide range of entity categories. This enriched corpus serves as a valuable resource for training and evaluating our custom NER and relation extraction algorithms, bringing us one step closer to achieving our project objectives. By tailoring our approach to suit the unique linguistic context of Malay, we aim to drive significant advancements in the field of Malay language processing.
3.3. Annotation of Malay Text
In the process of creating annotated Malay texts and documents, we leverage reference annotations derived from English texts. More specifically, these are annotated English texts that have been processed using the BioYODIE tools [
26], which are designed to provide entity annotations for diseases, symptoms, and care. In addition to this, we also draw upon references from annotated English texts that have been processed using the Stanza and NCBI tools [
26], which specialize in providing entity annotations for diseases.
In order to ensure the accurate identification and labeling of biomedical-named entities within our corpus, we consult additional resources such as the Malay Wikipedia [
27] and the dictionary from Dewan Bahasa [
28]. These additional sources provide valuable insights into the specific linguistic and terminological nuances of the Malay language.
The primary aim of annotating Malay texts and documents is to identify named entities such as penyakit (diseases), simptom (symptoms), and rawatan (treatments). These annotations serve as an invaluable asset in the process of training and evaluating natural language processing (NLP) models tailored to the Malay language. Upon the completion of the annotation process, we are left with a comprehensive corpus of annotated Malay texts. Representative examples of these annotated texts can be found in
Table 6. This rigorous process of annotation serves to guarantee the accurate identification and classification of biomedical-named entities within the Malay language, thus paving the way for the development of highly effective NLP models designed specifically for the Malay language.
4. Corpus Malay Health Document
The Malay Health Document Annotated Corpus, a detailed collection of annotated health documents, is a crucial asset for researchers and practitioners focusing on the Malay language. It facilitates the training and evaluation of named entity recognition (NER) models specifically crafted for Malay. These models excel in accurately extracting pertinent information from Malay health documents, benefiting medical research, clinical text analysis, and electronic health record management.
Moreover, this corpus plays a pivotal role in advancing various natural language processing (NLP) technologies in healthcare, such as natural language understanding, sentiment analysis, and text classification. It covers a wide array of health-related entities, including diseases, symptoms, and treatments, thus thoroughly representing the healthcare sector, encompassing medical, pharmaceutical, and clinical research areas. The utilization of this corpus not only enhances the effectiveness of NER models in discerning and retrieving valuable data from Malay-language health documents but also aids in expanding the scope and efficiency of NLP technologies within the healthcare field. This amplifies their applicability and utility in diverse scenarios like medical research and clinical text analysis (see
Figure 2.)
The primary result of this research is the creation of the Malay Health Document Annotated Corpus, which is derived from both English and Malay health documents. The corpus contains a diverse set of accurately labeled health-named entities, such as penyakit (diseases), simptom (symptoms), and rawatan (treatments). These entities can be seen in
Table 7, which provides descriptions and examples for each entity type.
The development of the Malay Health Document Annotated Corpus significantly contributes to the growing body of NLP resources for the Malay language. By providing a comprehensive annotated corpus, researchers are enabled to develop and evaluate NER models that can accurately analyze Malay health documents. This ultimately leads to better health outcomes for Malay speakers. Furthermore, the annotated corpus serves as a starting point for future research in Malay NLP, particularly in the health domain, opening up opportunities for advancements in this field.
By enabling the development of more accurate NER models for Malay health documents, the Malay Health Document Annotated Corpus can contribute to the creation of innovative healthcare technologies. These technologies can automate the analysis and interpretation of health information, leading to faster diagnosis, more personalized treatment plans, and improved patient outcomes.
Unlike existing NLP resources for the Malay language that focus on general text or news articles, the Malay Health Document Annotated Corpus specifically targets the healthcare domain. This makes it a specialized resource that captures the unique vocabulary, terminology, and entities found in health documents. By focusing on this specific domain, the corpus provides researchers and practitioners with a more accurate and tailored resource for developing healthcare-related NLP technologies.
5. Discussion (Challenges)
From this research, several things emerged as challenges in making the Malay Health Document Annotated Corpus: synonyms in Malay annotations, ambiguous entity categorization, co-reference in translation, and polysemous terms.
5.1. Synonyms in Malay Annotations
The section mentions the presence of synonyms in the Malay language, but it would be helpful to provide specific examples to illustrate this challenge. Including examples of synonyms and their different lexical realizations would make this argument more concrete and easier to understand. For example, the synonyms “barah” and “kanser” both refer to the concept of “cancer” in Malay. These terms represent the same concept but have different lexical realizations. Capturing these synonyms in the annotated corpus requires careful attention to ensure that their identical meaning is retained.
This task is not trivial, as it directly influences the efficacy of the subsequent training and evaluation of NLP models. Machine learning models rely on a clear, consistent representation of the data to learn effectively. If a model perceives “barah” and “kanser” as distinct entities, it may fail to generalize appropriately, leading to potential misclassifications in unseen data or new contexts.
This can have significant consequences in NLP tasks such as sentiment analysis, text classification, or information retrieval, where accurate representation and understanding of the data are crucial for reliable results. Misclassifications can lead to incorrect interpretations, biased predictions, or inaccurate information retrieval, undermining the effectiveness and trustworthiness of NLP models.
Additionally, the intricacy of handling synonyms extends beyond mere identification. The model must also consider the context and co-occurrence of these terms within the textual data. It is important to note that even though synonyms refer to the same concept, their usage might differ based on the context. For example, one term may be more prevalent in formal writing, while the other is commonly used in daily conversations or specific regions.
Moreover, it is also essential to acknowledge the cultural and linguistic nuances associated with these synonyms. Some terms might carry different connotations or emotional valences despite referring to the same concept, which further emphasizes the need for nuanced understanding and handling of these terms during the annotation and model training process [
29].
To address these challenges, advanced NLP techniques, such as word embeddings or contextual models, might be deployed. These techniques can capture the semantic similarity between different words and help the model understand that “barah” and “kanser” refer to the same concept. Furthermore, domain expertise and a careful annotation process play a crucial role in ensuring the consistency and accuracy of the data representation.
Overall, having synonyms in the data makes the process of annotating it and training models more difficult. However, these problems can be solved by being careful and using advanced NLP techniques. This helps make NLP models that are strong and aware of their surroundings.
5.2. Ambiguous Entity Categorization
There are certain words or phrases that can serve as entities for multiple category types, presenting a complex issue in named entity recognition (NER). For instance, consider the phrase “sakit dada” (chest pain), which could be perceived as an entity within either the disease or symptom categories. This duality generates a demand for context-specific interpretation by the NER system. If “sakit dada” appears within a disease diagnostic context, the NER should classify it within the disease entity category. Alternatively, if the phrase is cited in the description of symptoms, the NER should allocate it to the symptom entity category.
In many scenarios, the NER system needs to analyze the broader context, taking into account related words in the sentence or document, to determine the most appropriate entity categorization. This is essentially utilizing the principles of co-reference resolution and word sense disambiguation to clarify semantic relationships and meanings.
This context-sensitive entity categorization presents significant challenges in developing an accurate and reliable NER system. The complexity is amplified when dealing with the medical domain, given the vast range of terminologies and their potential overlap between categories. Furthermore, the NER system must also factor in the linguistic and cultural nuances that can influence the interpretation of certain words or phrases. Consequently, handling such ambiguities requires sophisticated models with robust context-understanding capabilities, well-crafted feature sets, and effective training methods. These requirements underscore the need for high-quality, annotated training data like the Malay Health Document Annotated Corpus.
However, even with these resources, achieving a high level of accuracy in ambiguous entity categorization remains a demanding task. This is a significant area of research focus, with potential solutions exploring advanced techniques like deep learning and complex NLP models, as well as inter-disciplinary approaches integrating linguistics, medical knowledge, and computational methodologies.
5.3. Co-Reference in Translation
The next challenge lies in the use of co-reference during the translation process from English to Malay. Co-reference refers to the use of words or phrases that point to the same concept or entity within a sentence or text. For example, in the sentence that can be seen in
Table 8, the pronoun “it” could be used later in the text to refer back to
Table 8. The use of co-reference is crucial in the translation process as it aids in maintaining consistency and clarity.
Co-references can become significantly complex, especially within lengthy and nuanced texts. For instance, a document might initially mention “sakit perut” and subsequently use pronouns like “ia” in other parts of the text to refer to “sakit perut”. In such scenarios, the NER system must be adept enough to recognize that “ia” is indeed referring to the initial mention of “sakit perut”.
Effectively leveraging co-reference in the translation process necessitates a deep understanding of the structure and semantics of both languages. The system must recognize and maintain co-reference throughout the translation process while ensuring that the final translation remains accurate and comprehensible [
30]. This requires advanced techniques in natural language processing and machine learning, as well as a good understanding of both languages’ cultural and social contexts.
Furthermore, in many instances, the source and target languages might have different co-reference rules and conventions. For example, Malay might have different ways of referring to entities or concepts compared to English. Thus, the system must be capable of adapting the co-reference from the source language to the target language in a natural and accurate manner. This is often challenging and necessitates ongoing research and development.
Addressing these challenges requires careful consideration and the development of methodologies that account for synonyms, resolve entity categorization ambiguities, and accurately handle co-reference during translation. By addressing these challenges, the Malay Health Document Annotated Corpus can be further refined and serve as a valuable resource for training and evaluating NLP models in the Malay language.
5.4. Polysemous Terms
In some cases, the challenge lies in what are known as “multiple translations” or “polysemous terms”. This refers to situations where a single word or phrase can hold multiple meanings or translations in another language, particularly within specialized contexts like medicine or technology. For instance, “shortness of breath” and “breathlessness” are two medical terms that signify “difficulty breathing” or “shortness of breath” in English. Both terms share the same translation in Malay, which is “sesak nafas”.
This can complicate the selection of the appropriate translation, especially when context sensitivity is a requirement. Context plays a vital role in determining the best translation for such polysemous terms, and this challenge increases when the context is intricate or subject to individual interpretation. This is a common issue in machine translation, and solutions usually employ deep learning models that can consider broader contextual information to better understand and determine an accurate translation.
Additionally, such polysemous terms also pose a significant challenge to the named entity recognition (NER) systems since the same word or phrase might be classified under different categories based on its different meanings. This introduces the necessity for advanced models that can effectively discern the semantic boundaries of such terms within the given context.
Moreover, this issue further accentuates the importance of domain-specific knowledge. In the example of “shortness of breath” and “breathlessness”, having knowledge about medical terminologies can guide the translation process more accurately. It highlights the requirement for a multidisciplinary approach, incorporating subject matter expertise in conjunction with computational methodologies, to effectively handle multiple translations and polysemous terms [
31].
Lastly, this challenge also calls attention to the value of extensive, high-quality, and well-annotated corpora, like the Malay Health Document Annotated Corpus. They serve as critical resources for training machine translation and NER systems, enabling them to better understand and handle the complexities of multiple translations and polysemous terms.
6. Conclusions and Future Work
This research has successfully spearheaded the development of the Malay Health Document Annotated Corpus, which is a crucial resource for training and evaluating named entity recognition (NER) models for the Malay language. By meticulously identifying and labeling biomedical-named entities, this corpus significantly enhances the suite of NLP resources available for Malay. It has the potential to improve health outcomes for Malay speakers by enabling the development and evaluation of NER models that can efficiently analyze Malay health documents.
The research conducted using the Malay Health Document Annotated Corpus has shown promising results in the development of an NER model for the Malay language. The model, trained using supervised machine learning techniques like the conditional random field algorithm, has demonstrated the ability to accurately identify and extract biomedical entities from Malay health documents. The evaluation of the model using standard measures such as precision, recall, and the F1-score has provided insights into its effectiveness. These findings highlight the potential of NLP technologies in the health sector for the Malay language.
Several challenges such as synonyms in Malay annotations, ambiguous entity categorization, co-reference in translation, and the handling of polysemous terms have been identified as key areas for future research. Addressing these issues will not only enhance the quality of the corpus but also significantly contribute to the advancement of natural language processing technologies. Focusing on these areas promises to improve the accuracy and utility of NLP models, particularly in the context of the Malay language, thereby elevating the overall effectiveness of language processing applications.
Furthermore, future research could delve into the development of domain-specific NER models customized for other sectors such as finance, law, or education. This would substantially broaden the spectrum of NLP resources available for the Malay language. Researchers could also investigate the use of different machine learning algorithms, advanced deep learning techniques that can learn from large amounts of data, or methods that leverage knowledge from related tasks to enhance the performance of NER models. These avenues have the potential to augment the performance of NER models tailored to the Malay language, thereby expanding the reach and potential of NLP within the Malay-speaking world.