Next Article in Journal
Biological Properties in Relation to the Health-Promoting Effects of Independent and Combined Garcinia mangostana Pericarp and Curcuma in Lean Wistar Albino Rats
Previous Article in Journal
Inverse Q-Filtering as a Tool for Seismic Resolution Enhancement: A Case Study from the Carpathian Foredeep
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Understanding the Research Challenges in Low-Resource Language and Linking Bilingual News Articles in Multilingual News Archive

1
Department of Computer and Software Technology, University of Swat, Mingora 19130, Pakistan
2
College of Computer Science & Engineering, University of Ha’il, Ha’il 81451, Saudi Arabia
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(15), 8566; https://doi.org/10.3390/app13158566
Submission received: 9 June 2023 / Revised: 9 July 2023 / Accepted: 20 July 2023 / Published: 25 July 2023

Abstract

:
The developed world has focused on Web preservation compared to the developing world, especially news preservation for future generations. However, the news published online is volatile because of constant changes in the technologies used to disseminate information and the formats used for publication. News preservation became more complicated and challenging when the archive began to contain articles from low-resourced and morphologically complex languages like Urdu and Arabic, along with English news articles. The digital news story preservation framework is enriched with eighteen sources for Urdu, Arabic, and English news sources. This study presents challenges in low-resource languages (LRLs), research challenges, and details of how the framework is enhanced. In this paper, we introduce a multilingual news archive and discuss the digital news story extractor, which addresses major issues in implementing low-resource languages and facilitates normalized format migration. The extraction results are presented in detail for high-resource languages, i.e., English, and low-resource languages, i.e., Urdu and Arabic. LRLs encountered a high error rate during preservation compared to high-resource languages (HRLs), corresponding to 10% and 03%, respectively. The extraction results show that few news sources are not regularly updated and release few new news stories online. LRLs require more detailed study for accurate news content extraction and archiving for future access. LRLs and HRLs enrich the digital news story preservation (DNSP) framework. The Digital News Stories Archive (DNSA) preserves a huge number of news articles from multiple news sources in LRLs and HRLs. This paper presents research challenges encountered during the preservation of Urdu and Arabic-language news articles to create a multilingual news archive. The second part of the paper compares two bilingual linking mechanisms for Urdu-to-English-language news articles in the DNSA: the common ratio measure for dual language (CRMDL) and the similarity measure based on transliteration words (SMTW) with the cosine similarity measure (CSM) baseline technique. The experimental results show that the SMTW is more effective than the CRMDL and CSM for linking Urdu-to-English news articles. The precision improved from 46% and 50% to 60%, and the recall improved from 64% and 67% to 82% for CSM, CRMDL, and SMTW, respectively, with improved impact of common terms as well.

1. Introduction

The Internet serves as the primary and extensive hub of information, encompassing diverse sources that cover every aspect of human existence. It offers a wide range of data, including weather forecasts, travel deals, local and global events, and much more. This vast pool of information is accessible through the World Wide Web and various Web services [1]. In fact, the amount of information generated on the Web is growing exponentially and is predicted to surpass the collective cognitive capacity of humanity by 2025. To put it into perspective, the current measurement of Web information stands at 1018 exabytes and 1021 zettabytes [2,3].
Despite its rapid growth, the World Wide Web (WWW) is inherently fragile, which poses a significant challenge. The fragility of information on the Web leads to the unfortunate disappearance and inaccessibility of valuable scholarly, cultural, and scientific resources for future generations. Consequently, it becomes imperative to prioritize the preservation of this diverse and valuable information, which exists in various forms.
For hundreds of years, newspapers have served as the primary source of information, covering a wide range of topics that encompass various aspects of human life. They provide valuable insights into local and global events, including parliamentary activities, politically significant occurrences, court proceedings, births, deaths, marriages, sports, science, technology, and more. Newspapers reflect societal life, capturing social dynamics, behaviors, and cultural values, thus serving as essential scholarly information for individuals and communities. Given the significance of preserving such information for future generations, efforts have been made to ensure its availability. For example, historical manuscripts now hold immense value, just as addresses made by prime ministers after election victories or announcements related to imminent foreign invasions. The UNESCO Declaration on Archives emphasizes the crucial role played by archives in societal development by safeguarding the contributions of individuals and communities [4]. The preservation and accessibility of published information are essential for safeguarding valuable resources. Various initiatives have been implemented to achieve this goal, leading to the creation of numerous newspaper archives. Curators and their organizations play a significant role in preserving newspapers and maintaining the digital collections of these publications. Typically, newspapers are digitized either internally, i.e., in-house, or with the assistance of external vendors. Additionally, some newspapers are preserved as born-digital content obtained directly from publishers or by harvesting from the web [5].
A literature review of newspaper archives reveals that diverse approaches are employed for the preservation of newspapers, with the majority being digitized as a single digital record. Typically, curated digitized records are created by scanning microfilm, which is a compact photograph that can be stored and enlarged for reading, then saved in formats such as pdf, gif, jpg, or other graphical formats. Newspaper archives can be categorized into two main types: old and newer archives. Old newspaper archives pose challenges for optical character recognition (OCR) technology in indexing them into a full-text corpus. As a result, these archives are primarily available in graphical format, requiring visual inspection to access the content. Conversely, the newer newspaper archives have been extensively indexed, allowing for efficient full-text searching mechanisms to retrieve specific information.
The digital news story preservation (DNSP) framework was introduced to establish a digital archive of news articles interconnected based on specific criteria with the purpose of future utilization [6]. Recently, this framework was enhanced to include a multilingual and multisource digital news stories archive aimed at preserving digital news articles for the long term and for the benefit of future generations. The framework now incorporates two low-resource languages, namely Urdu and Arabic. However, there are several challenges associated with including these languages in the Digital News Stories Archive (DNSA), mainly due to their low-resource nature. This study identifies various challenges related to different aspects of low-resource languages, making it difficult to incorporate them into the digital news stories archive. These challenges encompass issues related to volume, variety, and velocity during archival information packaging; technical difficulties encountered during the creation of the archive; and challenges with the dissemination of archived content. One of the major obstacles is the scarcity of resources available for low-resource languages. For example, the tokenization of Urdu scripts is less efficient and inconsistent in terms of space than that of high-resource languages, and comprehensive dictionaries, which are fundamental linguistic resources, are not available for either Arabic or Urdu. As a result, extensive preprocessing becomes necessary during the preservation process to compensate for the lack of these resources.
Establishing linking mechanisms and metadata is crucial during the preservation process to ensure the efficient dissemination of archived news articles sourced from multiple languages and diverse sources [7]. The second part of this paper introduces bilingual news linking mechanisms, specifically the “Common Ratio Measure for Dual Language (CRMDL)”, which is based on the ratio of common terms, and the “Similarity Measure based on Transliteration Words (SMTW)”, which relies on words translated from the English language in Urdu news articles. These mechanisms are implemented to facilitate the accessibility of news articles that were extracted and archived from various sources during the preservation phase. This paper compares these linking algorithms and discusses the effectiveness of the results. By incorporating different linking mechanisms, the Digital news stories preservation (DNSP) framework is enriched and enhanced to ensure future accessibility of the archived content.
Section 2 and subsections differentiate low-resource languages from high-resource languages and introduce challenges in LRLs, with a brief overview of the Urdu and Arabic languages. Section 3 presents details about the digital news story preservation framework initiative, the importance of preservation, research challenges, DNSP framework enhancement, the multilingual archive and its structure, and major issues encountered in enhancing the extraction tool. In Section 4, extraction quantification is comprehensively discussed. Section 6.3.2 compares the results of bilingual linking mechanisms, and in Section 7, the findings are summarized.

2. Low-Resource Languages

Natural languages are classified into two broad categories, i.e., low-resource Languages (LRLs) and high-resource languages (HRLs). For high-resource languages, many data resources exist that help to enable machines to learn and understand natural languages, e.g., English. English is a well-resourced language as compared to other spoken languages. Many western European languages are well-resource-covered languages. Chinese, Japanese, and Russian are also high-resource languages. In contrast, low-resource languages have very few or no resources available. Low-resource languages can be defined as less studied, resource-scarce, less computerized, less privileged, less commonly taught, or low-density languages [8,9]. Many languages are difficult to preserve because they are mostly oral, and very few written resources exist in physical form, with none available in electronic format. There are different types of resources for natural language processing and the development of language-based systems:
  • Collection of text in various forms, such as research papers, books, email collections, social media content collections, etc.;
  • Lexical, syntactic, and semantic resources, such as a bag of words, dictionaries, semantic databases (e.g., WordNet), organized dependency tree corpora, etc.;
  • Task-specific resources, such as part-of-speech tags, corpora for machine translation, annotated text, named entity recognition resources, etc.
Many language resources are costly to produce, which is why the economic inequalities between countries/languages are reflected in the amount (or absence) of language resources.

2.1. Challenges in Low-Resource Language Processing

The natural language processing (NLP) tools experienced a drastic change in the 1990s, shifting from rule-based techniques to statistically based approaches, and a new era of artificial intelligence started. Since then, the focus has majorly been on English as an international language, and about 20 languages out of 7000 languages of the world have been considered [10].
Languages that need a lot of research are often referred to as low-resource languages and face many challenges, as briefly discussed below:
  • Alignment or the projection technique (three levels of alignment, i.e., word, sentence, and document) is a common technique for annotation. It is difficult to adopt the projection technique from HRLs to LRLs because of a lack of resources and different structures of target and source languages [8];
  • Creating a bag of words, dataset, and raw text collection for LRLs is difficult but necessary for any natural language processing (NLP) task or mapping technique [8];
  • The most important resource for any language is the lexicon of that language; many NLP tasks heavily depend on the textual material available, which is lacking in LRLs, making it a challenging task to produce an efficient lexicon;
  • The morphology of LRLs is constantly evolving, with vocabulary easily extended. Developing a comprehensive framework for morphological pattern recognition is difficult because of multiple roots [11];
  • The major applications of NLP, such as question–answer systems, sentiment analysis, image-to-text mapping, machine translation, and named entity recognition-based systems, are very difficult to implement in low-resource languages;
  • Basic NLP tasks such as stop-word identification and removal, tokenization, part-of-speech tagging, sentence parsing, lemmatization, stemming, etc., are also difficult in low-resource languages;
  • The NLP systems of LRLs are time-consuming and comparatively less efficient as a result of a lack of resources, increasing the difficulty of developing a machine learning system [10];
  • Many languages are mostly oral, for which very few written resources exist (physical and digital formats). For some, there are written documents but not even a basic resource like a dictionary;
  • Integrated and customized systems are always a huge challenge for multilingual systems.
Dealing with all the challenges faced by low-resource languages requires extensive research in different dimensions. Urdu and Arabic are two huge languages that need a lot of focus in research.

2.2. Urdu Language

Urdu, a prominent South Asian language, boasts approximately 70 million native speakers and over 164 million speakers worldwide [12]. It serves as the official literary language of Pakistan and is also spoken and understood in other countries such as India and Bangladesh. Urdu shares close linguistic ties with Hindi. The preservation of Urdu periodicals holds immense value for researchers and future generations, as they encompass a wide array of significant topics concerning South Asia throughout the nineteenth and twentieth centuries.

2.3. Arabic Language

Arabic is the third most widely spoken language globally, trailing behind English and French. Approximately 292 million individuals use Arabic as their primary and official language across twenty-seven countries, with a significant number of people also capable of understanding it as a second language. Alongside English, French, Spanish, Russian, and Chinese, Arabic holds the distinction of being one of the official languages of the United Nations. Notably, Arabic is gaining popularity as a language to learn in the Western world, and numerous other languages have borrowed words from Arabic due to their historical significance. The intricacies of Arabic grammar can pose a challenge, both for native speakers of Indo-European languages and for machines attempting to accurately interpret and comprehend the Arabic language [13,14].

3. Digital News Story Preservation Initiative

The “Digital News Story Preservation (DNSP) Framework” was initiated in 2015 [6]. The term “digital preservation” is broadly comprehended as the “arrangement of supervised exercises important to guarantee proceeded with access to advanced materials for whatever length of time required”. An initiative was undertaken to establish the “Digital News Stories Archive (DNSA),” which comprises interconnected news articles in both English and Urdu languages. The archive incorporates two types of linkages: English-to-English linkage and Urdu-to-English news article linkage. These linkages are established through preprocessing and by applying content-based techniques.

3.1. Importance of Digital News Story Preservation

Preserving access to historical records is crucial for writers conducting research; however, these preserved documents may be at risk of disappearing. While digital news content is widely available today, it is more vulnerable than print and can be scattered across different media and storage systems [15].
The importance of digital news story preservation is summarized as follows:
  • Preservation and data backup are distinct concepts and should not be conflated. Digital media are susceptible to various failures, including file corruption, viruses, malware, damage, overwritten backups, server issues, and even natural disasters like earthquakes and tornadoes. These risks highlight the need for proper digital preservation, which encompasses a set of processes and activities specifically designed to ensure the long-term, sustained storage, access to, and interpretation of digital information. Preservation goes beyond mere data backup and focuses on maintaining the integrity and accessibility of digital content over time;
  • According to a survey by Educopia conducted in 2012, out of 60 newspaper companies, less than half keep their data and content for more than five years [16]. Their news contents are also dispersed and distributed over multiple servers;
  • The preservation of news is vital for the advancement of society, as it enables citizens to stay well-informed and up-to-date on a wide range of events and news through journalism. By preserving news content, individuals are empowered with knowledge, ensuring they are equipped to make informed decisions and actively participate in their communities;
  • Preserving news is beneficial for writers and researchers, as it empowers them to craft relevant and contextualized stories. News preservation holds immense value as a historical record for society on a large scale, offering significant benefits;
  • Researchers requires information such as birth, death, marital status, business data, announcements, legal documents, property transaction papers, etc., with respect to the genealogical status of a person or even a community as a whole, which can be obtained from news or published document archives;
  • Developed nations are actively engaged in archiving significant documents and newspapers and preserving news content to ensure future accessibility. When considering our social heritage, digital preservation becomes more important, especially during times when society relies on journalists to thoroughly investigate stories and produce impactful news;
  • Urdu news preservation: To uphold the rich heritage of the Urdu language, it is crucial for Urdu speakers to have a sincere commitment to preserving its essence. Urdu periodicals contain a diverse array of literary works encompassing significant topics from South Asia during the nineteenth and twentieth centuries, making their preservation highly valuable for researchers interested in the language. The preservation of Urdu news stories should be particularly significant for the people of Pakistan, as it is a country where Urdu is widely spoken and holds a unique position in the world [17];
  • Arabic news preservation: Arabic is gaining popularity as a language of interest in the Western world, attracting an increasing number of learners. Throughout history, other languages have borrowed words from Arabic due to their significant contributions. However, the grammar of Arabic poses a challenge for native speakers of Indo-European languages and even for machines attempting to accurately interpret and comprehend the language [4]. Arabic encompasses multiple versions of its script, including standard Arabic, classical Arabic, literary Arabic, and modernized Arabic. During the early Middle Ages, Arabic played a central role as a primary source for science, mathematics, culture, and philosophy. Preserving Arabic scripts is crucial, as they captures various grammatical changes that have occurred, reflecting the nuances found in colloquial variants.

3.2. Research Challenges

The news encompasses a wide range of events that are directly or indirectly connected to our social lives. These events include parliamentary actions, significant political occurrences for countries, court proceedings, government announcements, deaths, births, marriages, sports, etc. In the coming years, the responsibility of preserving these comprehensive journalistic records primarily lies with news outlets and newspaper organizations, ensuring their availability for future generations. Online news publications are generated and updated instantly, following a non-linear format, which means that they can disappear and become inaccessible. Based on existing data, it has been observed that approximately 80% of web pages become inaccessible within a year, and around 13% of links, particularly web references in scholarly articles, cease to function after approximately 27 months [18,19]. Consequently, the need to preserve online digital news for an extended duration has become imperative to ensure its safeguarding for future generations.
Even if a newspaper is backed up or archived by national archives and libraries, accessing specific information from multiple sources about a particular event may be challenging in the future. This challenge becomes even more complex when attempting to follow a story through an archive that comprises a vast collection generated from numerous news sources, each requiring different technologies to access the archived contents.
News archives are of two types, i.e., graphical formats and partially indexed archives, which makes it difficult to access particular news about an event because, many challenges encountered, such as:
  • Vast archive collections: an archive created from many sources;
  • Various sources on different platforms;
  • Multilingual archive: an archive created from multiple languages, i.e., Arabic, Urdu, and English;
  • Low-resource language: Access becomes more complicated when searching news article in low-resource languages, such as Urdu.
There are many difficulties in digital news preservation, such as;
  • Extraction of news from diverse sources and different technological platforms;
  • Extraction of explicit and implicit metadata;
  • Computing similarity values between news articles;
  • Conversion of news articles to a specific standard format for future integration and access, etc.
There are many challenges in accessing preserved digital news stories in archives, such as;
  • Locating and discovering a digital resource among a huge collection, such as a catalog or archive [20];
  • The effectiveness of search mechanisms depends directly on how these objects are organized. Digital library management helps by providing support for identifying, describing, and locating resources;
  • Interoperability is the ability of different systems to exchange and use information together without losing content and functionality, representing a huge challenge in archive management [21];
  • Providing mechanisms for digital objects to hold the data that prove their reliability, integrity, authenticity, and provenance [22];
  • Storing information about the physical characteristics and documenting behavior so that it can be emulated in future technologies [21]. For example, “the original XML instance of imported data is maintained to preserve all mappings and to be able to roundtrip the original” [23];
  • During the object development phase, multiple versions of the same object may be created for preservation and dissemination. Thus, the same object may be present in multiple versions; metadata tracks all the information regarding different versions and changes in the object over time;
  • When seeking to utilize data collected for a different project in their own work, individuals aim to locate and utilize data while placing a greater emphasis on trust and comprehension. Reusing data typically necessitates meticulous preservation and documentation of both the data content and the accompanying metadata.

3.3. DNSP Framework Enhancement

The primary purpose of the DNSP framework is to create a multilingual, multisource digital news stories archive to preserve digital news articles for the long term and future generations. The framework is enriched with two low-resource languages, i.e., Urdu and Arabic. The challenges presented in previous sections regarding low-resource languages make it hard to include these sources simply. The absence of efficient tokenizers, dictionaries, and other basic resources prompts heavy prepossessing during preservation in the framework. The workflow and main components are presented in the enhanced version of the DNSP framework in Figure 1.

3.4. Multilingual News Archive

The following section provides a brief introduction to the Digital News Stories Archive (DNSA). The primary concept behind the digital news story preservation (DNSP) framework was introduced at the International Conference on Asian Digital Libraries 2015 (ICADL-2015) [6]. The following contributions were made to the framework:
  • After analyzing 120 news archives worldwide, a comprehensive and generic systematic approach was proposed as a model for Web preservation. This approach entails a step-by-step procedure to be followed in web preservation projects [24,25];
  • A multisource web archive known as the Digital News Stories Archives (DNSA) was designed and developed to preserve online news articles originating from multiple sources [1];
  • A digital news story extractor (DNSE) tool was specifically designed to extract news articles from diverse sources and compile them to form the DNSA. Its primary function is to collect and gather news articles from various online sources, ensuring their preservation within the DNSA [26];
  • In the DNSA, we use content-based methods to link news articles during the preservation process. These methods rely on text features, such as the ratio of common terms based on their frequency [27], named entities [28], the position of terms, terms in the headline, the credibility of information, the distance between similar terms, etc., [1];
  • Similarity measures were studied in the most relevant field of news recommender systems. A comprehensive study was performed on recommendation systems that can enhance the DNSP framework in different dimensions and improve its utility (a few of them are discussed in future work) [29];
  • The CRMS technique was modified for news headings to reduce extra computation for the terms appearing in the news body for linking of English news articles during preservation [27];
  • The CRMS technique was updated for linking of Urdu-language news articles with English-language news articles in the DNSA [30];
  • A heading-based technique was introduced for linking of news articles in the DNSA during the preservation process in the DNSP framework [31].
The Digital News Stories Archive (DNSA) is a news archive created locally from multiple online sources that provide news in three languages, i.e., English, Urdu, and Arabic. Currently, the DNSA is archiving news articles from seven online newspapers and three local news television networks [26] in English, five Urdu news sources, and four Arabic online news sources. The archive (created locally) preserves more than one thousand news stories after removing duplicate URLs and news in each extraction.
The high-level system architecture is illustrated in Figure 2. The figure depicts the process flow, starting from the ingest phase, where two mediators are employed to extract and incorporate metadata into the news story archive. Once the metadata are added, the news stories are archived and safeguarded for future use by generating an archival information package (AIP), as depicted in Figure 3. Subsequently, the preserved contents can be accessed using the information dissemination package.
To help readers in accessing relevant news articles and enhance their understanding of various topics, the DNSA (Digital News Story Archive) requires an effective mechanism for linking digital stories and recommending them to readers. This mechanism, as discussed in previous studies [32,33], aims to provide readers with a broader perspective and diverse viewpoints by comparing similar news articles from multiple sources. By establishing linkages between English-to-English news articles and English-to-Urdu news articles, the DNSA enables readers to browse through its extensive collection effortlessly. This linkage mechanism not only assists in authenticating information but also helps readers explore a wide range of news articles related to a particular topic.
Without an efficient search functionality, a news archive would essentially amount to a mere collection of news articles, lacking the ability to serve as a truly valuable information repository. To transform it into an effective repository, it is essential to implement a robust search functionality, which necessitates the use of indexing approaches and the establishment of a clearly defined set of metadata elements.

3.5. Enhancing “Digital News Story Extractor (DNSE)”

The digital news story extractor (DNSE) is a Java-based tool for extracting digital news stories from online news websites using JSOUP and POI libraries. Initially, the DNSE was developed for English news sources [26], then enhanced for Urdu news articles, and since further enhanced by including Arabic news sources and some features for quantification. The DNSE extracts news stories from online sources, extracts meta information, i.e., metadata, and normalizes news content and related metadata into XML format for preservation in the DNSA. However, these enhancements have encountered the following problems:
  • Non-uniform Web structure: There are many platforms and technologies for developing Web-based applications. Front-end technologies include HTML, CSS, JAVA, and JAVASCRIPT and its frameworks, and back-end logic creation technologies include PHP, ASP.net, and XML, among many others. Due to the use of different technologies, the Web structure varies; hence, extracting the desired information is challenging.
  • Recency or maintenance of fresh content: The Web contents of dynamic Web applications, such as blogs and news websites, update instantly and frequently. The recency of news content is very important to maintain efficiently, considering access frequency and network traffic issues.
  • Rise of anti-scraping tools: The biggest challenge in extracting news content is the rise of anti-scraping tools, e.g., Captcha, which differentiates between bots and humans. The extractor got stuck when anti-scraping tools were implemented.
  • Unknown host issue: An unreliable Internet connection leads to an unknown host issue; the extraction of news can be restarted, but the interruption is time-consuming.
  • Socket timeout: Most websites temporarily block or suspend their services when frequently accessing the contents for a specific period during preservation. The websites consider a bot unnecessary to send requests, overload the server, and start blocking access.
  • Garbage collection: The inconsistency in development approaches leads to erroneous extraction by collecting unwanted data, such as in-text links, tags, or other code, during news extraction.
  • Identifying and preprocessing of low-resource languages: The DNSE tool deploys different libraries for the identification and preprocessing of low-resource languages, and the preprocessing is computationally expensive.
  • Firewall blocking: Few online news sources are protected from extraction using a firewall.
The developed extractor somewhat manages problems such as firewall blocking, the rise of anti-scraping applications, preprocessing, garbage collection, and dealing with non-uniform Web structure. The extractor handled the maintenance of fresh news articles, unknown host issues, and socket timeout issues using different techniques and Web APIs. Extraction is important for any digital archive and challenging when preserving low-resource contents. The enhanced DNSE is enabled to deal with the above challenges efficiently.

4. News Extraction Results

The “DNSA” is enriched with five sources that provide Urdu news articles and three online sources that publish news in the Arabic language. The details of the included news articles from all three languages are summarized in Table 1.
The progress of research is hindered by the slow development of the DNSP framework, along with limited resources and insufficient financial support. Initially, three local English newspapers, namely Dawn News, The Tribune, and The News were chosen as the test subjects for the DNSE tool. A total of 86,545 URLs (with an average of 2791 URLs) were extracted, including duplicate URLs for news stories. Among these extracted URLs, there were 23,843 unique URLs representing individual news stories (with an average of 769 unique URLs). The extraction results from the newspaper websites of Dawn News, The Tribune, and The News accounted for 6457 news stories (with an average of 208), 4914 news stories (with an average of 158), and 5713 news stories (with an average of 184), respectively [26].
The new extraction/crawling results after the DNSE was enriched with two low-resource languages, i.e., Urdu and Arabic, are keenly analyzed for shortcomings of the DNSP framework and DNSE tool. Table 2 presents the detailed extraction of the DNSE for both the HRL and LRLs.
The extraction results are visualized in Figure 4 for all ten sources of the high-resource language, i.e., English. The results show that few news sources do not frequently update the news online and can be replaced by other sources for efficient utilization of the DNSP framework.
Evaluating the frequency of new story extraction is crucial due to the continuous and non-periodic nature of the news stream, unlike printed media.
The extraction process was carried out daily or after waiting for a few days. Figure 5 presents the average count of extracted URLs and unique URLs. The figure illustrates that the number of newly extracted news URLs closely aligns with the count of new news stories obtained from online newspapers and various news channels.
The processing of low-resource languages is expensive in terms of time, complexity, and accuracy. The main problems with implementing the DNSE, including LRLs, are non-uniform web structure, unknown host issues, and garbage collection. Figure 6 and Figure 7 present the average extraction of new news articles and unique URLs, respectively.
Table 3 presents the error rate of URLs and story extraction during preservation for both high- and low-resource languages. The LRLs are associated with a large error rate because of non-uniform web structure, unknown host issues, maintenance of fresh content, anti-scraping tools, and garbage collection. It is observed that low-resource-language news sources are not very well maintained like high-resource-language news sources.

5. Content-Based Bilingual Linking Techniques

The number of online news articles necessitates the use of recommender system techniques to establish linkages between digital news stories during preservation. These techniques can be broadly categorized into collaborative filtering techniques and content-based techniques. Collaborative filtering techniques encounter several challenges, as they heavily rely on user opinions, demographics, and feedback to establish similarity [1,34]. The dynamic nature of both users and news articles complicates the accurate modeling of user preferences based on their past readings [35,36,37]. User preferences and interests can change over time due to current events and the popularity of news articles [38]. Furthermore, users tend to be hesitant in actively clicking or recommending news articles while browsing specific topics [39]. On the other hand, content-based approaches recommend new items to users by calculating the similarity between the features of previously selected items. However, content-based approaches encounter difficulties in determining the similarity between news articles that cover diverse topics and in accounting for potentially hidden factors that influence user choices [40]. While most studies have focused on runtime similarity between recent articles, the approach adopted by the DNSA involves linking stories during preservation to ensure long-term accessibility and to benefit future generations.
The common ratio measure for stories (CRMS) is a content-based similarity measure that has proven to be effective in linking digital news articles during preservation. The Digital News Story Archive (DNSA), ensures the future accessibility of related news articles by preserving and formatting linked news articles sourced from a vast corpus of news articles extracted from multiple sources. To determine the similarity among related news articles in the archive, a threshold value for CT/TT (a measure defined in the study) was established, representing the best indicator of similarity [27]. The CRMS is evaluated extensively in comparison to the baseline similarity measure, namely the cosine similarity measure (CSM). Additionally, the CRMS is subjected to analysis based on human judgment through user-based evaluation, providing empirical insights. The CRMS emerges as the most suitable measure for linking news articles, particularly facilitating the linkage of dual-language news articles gathered from diverse sources during preservation and the creation of archives. The results of this evaluation and analysis are derived from a dataset of 5.3K news articles extracted from ten different news sources over ten days.
Similarly, another content-based similarity measure is introduced based on the use of transliteration words known as the similarity measure based on transliteration words (SMTW). It is observed that the use of English transliteration words is common in Urdu manuscripts. This practice is anticipated to play a vital role in linking Urdu news articles with English news articles. These linking mechanisms link formatted news articles to ensure the future accessibility of related news articles from an enormous corpus of news articles extracted from multiple sources in the DNSA [41]. A comprehensive and empirical evaluation was conducted to assess the effectiveness of the SMTW. The results demonstrate that the SMTW is the most viable metric for linking news articles. It strongly promotes the linking of dual-language news articles obtained from various sources during the process of preservation and archive creation.
“Transliteration is a process of using the text of one script in another script or the process of converting text from one language to another”. In the field of linguistics, the act of incorporating a word or a group of words from one language into another language’s writing scripts is known as borrowing. These borrowed words are commonly referred to as loan words [42]. When it comes to transliteration, the process of representing a phrase or word from one language to another that uses a different writing system, challenges may arise, especially if the languages in question have distinct sounds and writing scripts [43,44].

6. Complexity Analysis

Complexity analysis of an algorithm is a technique used to analyze or predict resources that are required to solve a problem of a given size [45]. Analyzing algorithms include:
  • Time complexity;
  • Space complexity;
  • Accuracy.

6.1. Time Complexity

Time complexity (or time efficiency) is the measure of the amount of time taken by an algorithm to execute or run as a function of the input size. It is a way to describe how efficient an algorithm is in terms of the amount of time it takes to solve a problem. There are two known approaches:
  • The empirical approach is used to measure time complexity experimentally, which has several limitations; for example, it depends on hardware resources, the software environment, and the implementation design;
  • The analytical approach, which encounters the limitations of the empirical approach and is independent of the computing hardware, programming languages, and complex detail of the algorithm. The execution time is estimated by counting the primitive operations of the statement for input values.
In this study, we analyze both linking algorithms, i.e., CRMDL (Algorithm 1) and SMTW (Algorithm 2), using an analytical approach. Table 4 presents the time complexity of CRMDL, and subsequently present the time complexity of SMTW.
Algorithm 1: CRMDL Algorithm Pseudo-Code
  Input: Urdu News Article (UNA) and archived English News Article (ENA) ∈ DNSA
  Output: Similarity Score of UNA against ENA
Applsci 13 08566 i009
Algorithm 2: SMTW Algorithm Pseudo-Code
  Input: New News Article (NNA) and Archived News Articles (ANA) ∈ DNSA
  Output: Similarity Score of NNA with ANA
Applsci 13 08566 i010
The total time complexity of input size n for CRMDL is:
T ( n ) = n + n + k + ( n + 1 ) + n + 1 + n + 1 + k + ( n + 1 ) + n + k + n i = 1 n ( m i ) + n + 1 + n + n + n + i = 1 c w ( t f 1 + t f 2 ) W i + j = 1 u w ( t f 1 t f 2 ) W j + 1 + 1
where k lies between 0 and n, uw = n. For average-case complexity, the value of k = n/2. Therefore,
T ( n ) = n + n + n / 2 + ( n + 1 ) + n + 1 + n + 1 + n / 2 + ( n + 1 ) + n + n / 2 + n i = 1 n ( m i ) + n + 1 + n + n + n + i = 1 n ( t f 1 + t f 2 ) W i + j = 1 n ( t f 1 t f 2 ) W j + 1 + 1
The average time complexity is simplified to the order-of-growth function that interests the researchers by ignoring the low-order terms and multiplicative constants.
T(n) = θ ( n 2 ) is a polynomial-class algorithm that exhibits good behavior in available high computing devices. Similarly, SMTW is also a polynomial-class algorithm with an average time complexity of T(n) = θ ( n 2 ) .

6.2. Space Complexity

Space complexity refers to the amount of memory space required by an algorithm or program to solve a specific problem. It is a measure of how much memory is needed to execute an algorithm or program, and it is typically expressed in terms of the number of bytes or bits of memory required [45]. The memory required for CRMDL or SMTW is simple and straightforward and can be divided into two categories:
  • Determining the space as a function of array size, which depends on the nature of data and is normally represented by n;
  • Space required by the instruction or statements of the algorithm, which is constant and represented by 1.
The total space complexity of input size n for CRMDL or SMTW is:
S(n) = θ ( n ) by ignoring the low-order terms and multiplicative constants.

6.3. Accuracy

The accuracy of an algorithm refers to how well it solves a particular problem or produces the correct output for a given input. In other words, it measures how closely the algorithm’s output matches the desired or correct output. The accuracy of an algorithm can be measured quantitatively, such as through the use of performance metrics like precision, recall, and F1 score. The proposed content-based algorithms, CRMDL and SMTW, are extensively analyzed in [30,41].

6.3.1. Datasets

The DNSA encounters rapid growth in both high- and low-resource languages due to its continuous extraction of news articles from multiple sources. As an example, the DNSE collects approximately 400 Urdu news articles from five different sources, 180 Arabic news articles from three sources, and 700 English news articles from ten online sources on a daily basis.
To evaluate the proposed similarity measures, the DNSA selects datasets based on the heading or title of news articles, focusing on currently hot topics from a general pool. Table 5 provides a summarized overview of the datasets used in the evaluation process and Table 6 is used to analyse the similarity based on human-based observation.
To evaluate the comprehensive overall impact of the proposed similarity measures, a dataset consisting of 282 news articles is utilized. These news articles are sourced from two online television broadcasters, namely Geo and Samaa News, and are available in both English and Urdu languages. The dataset includes 152 Urdu news articles and 130 English news articles that were selected from a general pool. For further details on the news articles used in the empirical evaluation, please refer to Table 7.

6.3.2. Comparison of Content-Based Measures

This section focuses on how SMTW “outperformed CRMDL” in linking Urdu, a low-resource language, with English, a high-resource language, using content-based techniques. The improvement achieved by SMTW is discussed in detail, and the comparison is based on three evaluation parameters, which are:
  • Result improvement
    The outcomes of both CRMDL and SMTW techniques are compared, emphasizing the improved results achieved by SMTW and ranking them accordingly. The term “improvement” indicates that the result now includes all the relevant news articles within the top-five ranking or that the ranking of the relevant articles has been enhanced, bringing the most relevant news to the top. On the other hand, if a similar news article that was previously within the top five has been displaced, it is denoted as “dropped”. The term “none” is used when both techniques yield the same results or when the new technique has no noticeable effect.
  • Transliteration word impact
    Given that English transliterated words are commonly used in Urdu scripts, it is expected that they would influence the frequency of shared terms. Therefore, an analysis is conducted to examine the impact of transliteration words on the results. This analysis specifically focuses on showcasing the effects of linking Urdu and English news articles and how the presence of transliterated words affects the outcomes.
  • Result accuracy (precision and recall)
    To evaluate the effectiveness of the proposed similarity measure, the accuracy of the results is measured by precision and recall and compared with CSM for dual-lingual news articles to assess the overall feasibility.
Table 8 highlights the superiority and improved performance of SMTW over CRMDL and CSM in linking Urdu news articles with relevant English news articles during the presentation and development of the DNSA. Transliterated words play a crucial role in calculating similarity values among relevant news in multilingual archived news articles. The SMTW demonstrates a significant improvement in similarity, with a 22% increase (5 out of 23) compared to CRMDL. Within this improvement, the ranking of relevant news articles improves by 13%, and the overall results show a 9% enhancement. In 74% of the cases, the results remain unchanged, indicating consistency between the techniques. However, for the Urdu news article “ur6”, there is a drop of 4% in the computed similarity value.
Similarly, the impact of transliteration words on the count of common terms and subsequent computation is substantial. The number of common terms is directly influenced by the length of Urdu news articles, and it is observed that there are five (05) transliterated words present in these articles. As a result, the inclusion of these transliterated words leads to a significant improvement of 22% in the results. This improvement is attributed to a 75% increase in the count of common terms, as shown.
SMTW is more effective than CRMDL in linking news articles in two languages in the DNSA. Table 9 shows that SMTW works well on large datasets. The study also finds that sports news contains more English words in Urdu and provides better results, whereas Urdu news is not considerably influenced by English words. The results improved by 20% (6 out of 30), worsened by 04%, and stayed the same for 76% of stories. The percentage of English words in Urdu news articles varies from 20–30% depending on the type and length of the news articles.
Figure 8 and Figure 9 depict the precision and recall results for all datasets of news articles. These figures demonstrate that the proposed similarity measure, SMTW, achieves higher accuracy and comprehensiveness compared to CRMDL and CSM in linking of dual-language news articles within the DNSA. These results further emphasize the superiority of SMTW in effectively linking and aligning news articles across multiple languages.
In the preservation process, employing a “similarity measure based on transliteration words (SMTW)” appears to be a viable approach for calculating content-based similarity and linking Urdu-to-English news articles. The SMTW measure demonstrates effectiveness with lengthy news articles compared to shorter ones, and it is proven to be particularly suitable for sports news. By utilizing the SMTW measure, the Digital News Stories Archive ensures the preservation of linked and properly formatted news articles. This approach guarantees the future accessibility of related news articles from a vast corpus of news articles extracted from multiple sources. If transliteration words appear frequently in a given type of manuscript or in a language script, the SMTW perform betters.

7. Conclusions and Future Work

The preservation of news stories is of great significance for multiple reasons. These stories offer in-depth information about events that encompass our culture and heritage, making them invaluable resources, and preserving news stories ensures their availability for long-term research purposes. However, the news stories published online are in danger of being lost because of constant changes in the technologies used by online publishing sources and the formats used by platforms. The preservation of news and the creation of news archives is challenging. It becomes even further complicated when an archive contains articles from a low-resourced and morphologically complex language like Urdu or Arabic. This study introduces a multilingual news archive for Urdu, Arabic, and English news article sources published online on eighteen news publishing platforms. The digital news stories extractor is enhanced to address major issues in implementing low-resource languages and facilitates normalized format migration. The extraction results are presented in detail for a high-resource language, i.e., English, and low-resource languages, i.e., Urdu and Arabic. The LRLs encountered a high error rate during preservation compared to the HRL: 10% and 03%, respectively. The extraction results show that two of the news sources are not regularly updated and release very few new news stories online. The Digital News Stories Archive framework successfully preserved an average of 879 news articles from ten high-resource-language (HRL) sources and 553 news articles from eight low-resource-language (LRL) news sources. In the context of the DNSA, we compared two bilingual linking mechanisms, namely the common ratio measure for dual language (CRMDL) and the similarity measure based on transliteration words (SMTW) for linking of Urdu-to-English language news articles. The SMTW demonstrated superior results compared to the CRMDL technique and CSM. It was observed that approximately 78% of Urdu news articles contained transliterated words. The precision improved from 46% and 55% to 60%, while the recall improved from 64% and 67% to 82%. The impact of common terms also exhibited improvement. Notably, the SMTW was proven effective and feasible for sports news.
This study highlights the challenges faced in dealing with low-resource languages (LRLs) and outlines research challenges. It also provides insights into how the framework can be enhanced and emphasizes the need for a more detailed investigation to ensure accurate extraction and archiving of news content for future access. The framework holds potential for further expansion and exploration in various dimensions.
This research highlights challenges encountered in low-resource languages (LRLs) and explores the associated research challenges. It also presents the improvements made to the framework and emphasizes the necessity of a comprehensive investigation to ensure precise extraction and archiving of news content for future retrieval. Furthermore, the framework holds potential for extension across various aspects, such as:
  • Thorough analysis of Arabic script, which is necessary to facilitate multilingual linking;
  • To provide access to the archived contents of the DNSA, the implementation of a standardized user interface is essential;
  • The DNSE tool should be developed to meet professional standards;
  • Meta attributes can be expanded to accommodate multilingual archives and include languages like Urdu, Arabic, Pashto, and other languages;
  • Implicit meta elements can be added to the proposed set after comprehensively reviewing individual sources;
  • We are working on enhancing the structure of the Urdu-to-English lexicon and optimizing the bag of Urdu words to improve processing efficiency;
  • Advanced content-based similarity measures should be developed, utilizing various features, such as weighted terms, named entities, term position, and contextual information from news articles;
  • The DNSA needs crosslingual techniques for linking of multilingual archived news;
  • Metadata elements need to be proposed for the digital news story preservation framework for efficient archive management and information dissemination;
  • A more comprehensive set of generic elements for well-structured and well-populated online sources is required.
This study presents details of the framework’s enhancements and emphasizes the need for a more comprehensive investigation into the accurate extraction and archiving of news content for future retrieval. The framework holds potential for future expansion in various aspects, such as:
  • Thorough analysis of the Arabic script to facilitate multilingual linking;
  • Development of a standardized user interface to facilitate access to archived content in the DNSA;
  • Professional-level development of the DNSE tool;
  • Creation of meta attributes for multilingual archives, encompassing languages such as Urdu, Arabic, Pashto, etc.
  • Addition of implicit meta elements to the proposed set after comprehensive evaluation of individual sources;
  • Ongoing efforts to enhance the structure of the Urdu-to-English lexicon and the bag of Urdu words to improve processing efficiency;
  • The design of more advanced content-based similarity measures incorporating diverse features such as weighted terms, named entities, term position, and contextual information within news articles.
  • Incorporation of crosslingual techniques in the DNSA for linking of multilingual archived news.

Author Contributions

M.K.: conceptualization, methodology, experimentation, development, data collection, and manuscript writing; K.U.: conceptualization, methodology, experimentation, manuscript writing, and proofreading; Y.A.: conceptualization, methodology, proofreading, and supervision; A.A. (Ali Alferaidi): conceptualization, proofreading, and supervision; T.S.A.: conceptualization, methodology, proofreading, and supervision; K.Y.: conceptualization, methodology, and proofreading; N.A.: conceptualization, methodology, and proofreading; A.A. (Akash Ahmad): conceptualization, methodology, and proofreading. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Scientific Research Deanship at the University of Ha’il—Saudi Arabia through project number RG-21 090.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

This article does not involve humans or animals.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
SMTWSimilarity measure based on transliteration words
CRMDLcommon ratio measure for dual languages
CSMCosine similarity measure
WWWWorld Wide Web
DNSADigital News Stories Archive
DNSPdigital news story preservation
DNSEDigital news story extractor
CTCommon terms
TTTotal terms
UTUncommon terms
UrNewsUrdu news
EngNEnglish news
UrUrdu
EngEnglish
ICADLInternational Conference on Asian Digital Libraries
AIArtificial intelligence
APIApplication programming interface

References

  1. Khan, M. Using Text Processing Techniques for Linking News Stories for Digital Preservation. Ph.D. Thesis, Faculty of Computer Science, Islamabad Campus, Preston University Kohat, Kohat, Pakistan, 2018. [Google Scholar]
  2. WWW Size The Size of the World Wide Web (The Internet). 2021. Available online: https://www.worldwidewebsize.com/ (accessed on 4 August 2021).
  3. Emani, C.K.; Cullot, N.; Nicolle, C. Understandable big data: A survey. Comput. Sci. Rev. 2015, 17, 70–81. [Google Scholar] [CrossRef]
  4. UNESCO. UNESCO Universal Declaration on Archives. 2010. Available online: https://www.ica.org/en/universal-declaration-archives (accessed on 19 July 2023).
  5. Skinner, K.; Schultz, M. Guidelines for Digital Newspaper Preservation Readiness; Educopia Institute: Atlanta, GA, USA, 2014. [Google Scholar]
  6. Khan, M.; Rahman, A.U. Digital News Story Preservation Framework. In Digital Libraries: Providing Quality Information, Proceedings of the 17th International Conference on Asia-Pacific Digital Libraries, ICADL 2015, Seoul, Republic of Korea, 9–12 December 2015; Springer: Cham, Switzerland, 2015; Volume 9469, p. 350. [Google Scholar]
  7. Khan, M.; Alharbi, Y.; Alferaidi, A.; Alharbi, T.S.; Yadav, K. Metadata for Efficient Management of Digital News Articles in Multilingual News Archives. SAGE Open 2023, 13, 1–17. [Google Scholar]
  8. Magueresse, A.; Carles, V.; Heetderks, E. Low-resource languages: A review of past work and future challenges. arXiv 2020, arXiv:2006.07264. [Google Scholar]
  9. Cieri, C.; Maxwell, M.; Strassel, S.; Tracey, J. Selection criteria for low resource language programs. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portoroz, Slovenia, 23–28 May 2016; pp. 4543–4549. [Google Scholar]
  10. Guellil, I.; Saâdane, H.; Azouaou, F.; Gueni, B.; Nouvel, D. Arabic natural language processing: An overview. J. King Saud Univ. Comput. Inf. Sci. 2021, 33, 497–507. [Google Scholar] [CrossRef]
  11. Elkateb, S.; Black, W.J.; Vossen, P.; Farwell, D.; Rodríguez, H.; Pease, A.; Alkhalifa, M.; Fellbaum, C. Arabic WordNet and the challenges of Arabic. In Proceedings of the International Conference on the Challenge of Arabic for NLP/MT, London, UK, 23 October 2006. [Google Scholar]
  12. Rehman, Z.; Anwar, W.; Bajwa, U.I. Challenges in Urdu text tokenization and sentence boundary disambiguation. In Proceedings of the 2nd Workshop on South Southeast Asian Natural Language Processing (WSSANLP), Chiang Mai, Thailand, 8 November 2011; pp. 40–45. [Google Scholar]
  13. Kamusella, T. The Arabic language: A Latin of modernity? J. Natl. Mem. Lang. Politics 2017, 11, 117–145. [Google Scholar] [CrossRef] [Green Version]
  14. Unesco Official. UNESCO World Arabic Language Day. 2016. Available online: https://en.unesco.org/node/267866 (accessed on 25 January 2022).
  15. Semple, N. The digital preservation coalition. Alexandria 2007, 19, 47–55. [Google Scholar] [CrossRef]
  16. DeRidder, J.L. Digital Preservation: Why Is This Important to Me? 2016. Available online: https://www.rjionline.org/stories/digital-preservation-why-is-this-importantto-me (accessed on 21 December 2021).
  17. Naeem, W. Preserving Languages: Urdu’s Importance Discussed at Dictionary Launch. 2013. Available online: http://tribune.com.pk/story/622605/preserving-languages-urdus-importance-discussed-at-dictionary-launch/ (accessed on 2 April 2022).
  18. Lavoie, B.F. The open archival information system reference model: Introductory guide. Microform Digit. Rev. 2004, 33, 68–81. [Google Scholar] [CrossRef]
  19. Ntoulas, A.; Cho, J.; Olston, C. What’s new on the Web? The evolution of the Web from a search engine perspective. In Proceedings of the 13th International Conference on World Wide Web, New York, NY, USA, 17–20 May 2004; pp. 1–12. [Google Scholar]
  20. Greenberg, J. Dublin Core: History, Key Concepts, and Evolving Context (Part One). In Proceedings of the 2010 International Conference on Dublin Core and Metadata Applications, Pittsburgh, PA, USA, 20–22 October 2010. [Google Scholar]
  21. Riley, J. Understanding Metadata. Official of NISO, 2017. Available online: https://www.niso.org/publications/understanding-metadata (accessed on 11 May 2021).
  22. Harran, M.; Farrelly, W.; Curran, K. A method for verifying integrity & authenticating digital media. Appl. Comput. Inform. 2018, 14, 145–158. [Google Scholar]
  23. McClelland, M.; McArthur, D.; Giersch, S.; Geisler, G. Challenges for service providers when importing metadata in digital libraries. D-Lib Mag. 2002, 8, 1082–9873. [Google Scholar] [CrossRef]
  24. Khan, M.; Rahman, A.U. A Systematic Approach Towards Web Preservation. Inf. Technol. Libr. 2019, 38, 71–90. [Google Scholar] [CrossRef]
  25. Khan, M.; Rahman, A.U.; Awan, M.D. Exploring the Digital World of Newspaper Archives. Sci. Technol. J. 2017, 32, 140–164. [Google Scholar]
  26. Khan, M.; Rahman, A.U.; Awan, M.D.; Alam, S.M. Normalizing digital news-stories for preservation. In Proceedings of the Eleventh International Conference on Digital Information Management (ICDIM), Porto, Portugal, 19–21 September 2016; IEEE: New York, NY, USA, 2016; pp. 85–90. [Google Scholar]
  27. Khan, M.; Rahman, A.U.; Awan, M.D. Term-Based Approach for Linking Digital News Stories. In Digital Libraries and Multimedia Archives, Proceedings of the 14th Italian Research Conference on Digital Libraries, IRCDL 2018, Udine, Italy, 25–26 January 2018; Springer: Cham, Switzerland, 2018; pp. 127–138. [Google Scholar]
  28. Khan, M.; Rahman, A.U.; Ullah, M.; Naseem, R. The Role of Named Entities in Linking News Articles During Preservation. In Proceedings of the International Conference on the Sciences of Electronics, Technologies of Information and Telecommunications, Maghreb, Tunisia, 18–20 December 2018; Springer: Cham, Switzerland, 2018; pp. 50–58. [Google Scholar]
  29. Feng, C.; Khan, M.; Rahman, A.U.; Ahmad, A. News Recommendation Systems-Accomplishments, Challenges & Future Directions. IEEE Access 2020, 8, 16702–16725. [Google Scholar]
  30. Khan, M.; Rahman, A.U.; Ahmad, A.; Khan, S.S. A content-based technique for linking dual language news articles in an archive. J. Inf. Sci. 2022, 48, 57–70. [Google Scholar] [CrossRef]
  31. Khan, M.; Khan, S.S.; Ahmad, A.; Rahman, A.U. The role of news title for linking during preservation process in digital archives. Libr. Hi Tech 2022, 40, 1359–1383. [Google Scholar] [CrossRef]
  32. Haq, I.U.; Khan, Z.Y.; Ahmad, A.; Hayat, B.; Khan, A.; Lee, Y.E.; Kim, K.I. Evaluating and Enhancing the Robustness of Sustainable Neural Relationship Classifiers Using Query-Efficient Black-Box Adversarial Attacks. Sustainability 2021, 13, 5892. [Google Scholar] [CrossRef]
  33. Khan, Z.Y.; Niu, Z.; Nyamawe, A.S.; ul Haq, I. A Deep Hybrid Model for Recommendation by jointly leveraging ratings, reviews and metadata information. Eng. Appl. Artif. Intell. 2021, 97, 104066. [Google Scholar] [CrossRef]
  34. Doychev, D.; Lawlor, A.; Rafter, R.; Smyth, B. An Analysis of Recommender Algorithms for Online News. In Proceedings of the 5th International Conference of the CLEF Initiative, Sheffield, UK, 15–18 September 2014; pp. 825–836. [Google Scholar]
  35. Agarwal, D.; Chen, B.C.; Elango, P.; Wang, X. Personalized click shaping through lagrangian duality for online recommendation. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, Portland, OR, USA, 12–16 August 2012; ACM: New York, NY, USA, 2012; pp. 485–494. [Google Scholar]
  36. Fortuna, B.; Fortuna, C.; Mladenić, D. Real-time news recommender system. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Barcelona, Spain, 19–23 September 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 583–586. [Google Scholar]
  37. Li, L.; Wang, D.D.; Zhu, S.Z.; Li, T. Personalized news recommendation: A review and an experimental investigation. J. Comput. Sci. Technol. 2011, 26, 754–766. [Google Scholar] [CrossRef] [Green Version]
  38. Li, L.; Zheng, L.; Yang, F.; Li, T. Modeling and broadening temporal user interest in personalized news recommendation. Expert Syst. Appl. 2014, 41, 3168–3177. [Google Scholar] [CrossRef]
  39. Said, A.; Bellogín, A.; Lin, J.; de Vries, A. Do recommendations matter? News recommendation in real life. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing, Baltimore, MD, USA, 15–19 February 2014; ACM: New York, NY, USA, 2014; pp. 237–240. [Google Scholar]
  40. Li, L.; Li, T. News recommendation via hypergraph learning: Encapsulation of user behavior and news content. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, Rome, Italy, 4–8 February 2013; ACM: New York, NY, USA, 2013; pp. 305–314. [Google Scholar]
  41. Khan, M.; Khan, S.S.; Alharbi, Y.; Alferaidi, A.; Alharbi, T.S.; Yadav, K. The Role of Transliterated Words in Linking Bilingual News Articles in an Archive. Appl. Sci. 2023, 13, 4435. [Google Scholar] [CrossRef]
  42. Borrow Language Definition. Available online: https://www.thoughtco.com/what-is-borrowing-language-1689176 (accessed on 5 July 2017).
  43. Accredited Language Services. Available online: https://www.accreditedlanguage.com/2016/09/09/what-is-transliteration/ (accessed on 5 July 2017).
  44. Al-Onaizan, Y.; Knight, K. Machine transliteration of names in Arabic text. In Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages, Philadelphia, PA, USA, 11 July 2002; Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; pp. 1–13. [Google Scholar]
  45. Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Introduction to Algorithms; MIT Press: Cambridge, MA, USA, 2022. [Google Scholar]
Figure 1. Enhanced digital news story preservation framework.
Figure 1. Enhanced digital news story preservation framework.
Applsci 13 08566 g001
Figure 2. High-Level System Architecture.
Figure 2. High-Level System Architecture.
Applsci 13 08566 g002
Figure 3. Archival information package (AIP) of DNSA [1].
Figure 3. Archival information package (AIP) of DNSA [1].
Applsci 13 08566 g003
Figure 4. Average new news story extraction for high-resource language “English” from different sources.
Figure 4. Average new news story extraction for high-resource language “English” from different sources.
Applsci 13 08566 g004
Figure 5. Average URL extraction and unique URL extraction for HRLs.
Figure 5. Average URL extraction and unique URL extraction for HRLs.
Applsci 13 08566 g005
Figure 6. Unique URL extraction and new URL extraction comparison.
Figure 6. Unique URL extraction and new URL extraction comparison.
Applsci 13 08566 g006
Figure 7. Comparison of news URL extraction from online newspapers.
Figure 7. Comparison of news URL extraction from online newspapers.
Applsci 13 08566 g007
Figure 8. Comparison of the precision of SMTW, CRDML, and CSM.
Figure 8. Comparison of the precision of SMTW, CRDML, and CSM.
Applsci 13 08566 g008
Figure 9. Comparison of the recall of SMTW, CRDML, and CSM.
Figure 9. Comparison of the recall of SMTW, CRDML, and CSM.
Applsci 13 08566 g009
Table 1. News sources in the DNSA.
Table 1. News sources in the DNSA.
No.News SourceAbbreviationLanguage
01DAWN NewsDNEnglish
02The TribuneTTEnglish
03The NewsTNEnglish
04Geo NewsGNEnglish
05Pakistan ObserverPOEnglish
06Pakistan TodayPTEnglish
07ARY NewsANEnglish
08Samaa NewsSNEnglish
09Voice of JournalistVJEnglish
10Time of PakistanTPEnglish
11ExpressExUrdu
12Daily PakistanDPUrdu
13Samaa UrduSUUrdu
14Geo UrduGUUrdu
15Dawn NewsDUUrdu
16Al-Jazirah OnlineAOArabic
17Al-RiazARArabic
18OkazOKArabic
Table 2. Average of Six Days Extraction Results of DNSE for both the HRL and LRLs for All Sources.
Table 2. Average of Six Days Extraction Results of DNSE for both the HRL and LRLs for All Sources.
No.News SourceExtracted URlsUnique URLsNew URLs
English News Sources
01DN1277304166
02TT711243111
03TN816230131
04GN3069566
05PO405136108
06PT490170151
07AN2238662
08SN1786547
09VJ1593919
10TP1274918
Urdu News Sources
11Ex29517399
12DP20212375
13SU27014491
14GU1979955
15DU1759345
Arabic News Sources
16AO21110163
17AR1548350
18OK19211075
Table 3. Error rate in both HRL (English) and LRLs (Urdu and Arabic) during extraction.
Table 3. Error rate in both HRL (English) and LRLs (Urdu and Arabic) during extraction.
DayHRL SourcesError RatePercentageLRLs SourcesError RatePercentage
011572815%94812213%
02712264%4695512%
03781314%4724910%
04746253%4806413%
05716193%4574610%
06745213%493429%
Table 4. Time complexity of CRMDL.
Table 4. Time complexity of CRMDL.
StatementUnit CostTotal Cost
UNA Pre-processingnn
T u = {t 1 , t 2 , t 3 , …, t n }nn
Remove stopwords (if any)kk
for t i T u upto n don + 1n + 1
Find tf for each term t from Tnn
Update Map(UNA)11
Map(UNA) = {(tf 1 , w 1 ), (tf 2 , w 2 ), (tf 3 , w 3 ), …, (tf i , w i )}nn
Identify the English meaning of each Urdu term in the dictionary11
Identify multiple meanings for each Urdu word (if any)kk
for ENADNSA don + 1n + 1
T e = {t 1 , t 2 , t 3 , …, t m }nn
Remove stopwords (if any)kk
for t i T e upto m do i = 1 n (m i )n ∗ i = 1 n (m i )
Find tf for each term t from Tnn
Update Map(ENA)11
Map(ENA) = {(tf 1 , w 1 ), (tf 2 , w 2 ), (tf 3 , w 3 ), …, (tf i , w i )}nn
Map(ENA) = {(tf 1 , w 1 ), (tf 2 , w 2 ), (tf 3 , w 3 ), …, (tf j , w j )}nn
Map(UNA) = {(tf 1 , w 1 ), (tf 2 , w 2 ), (tf 3 , w 3 ), …, (tf i , w i )}nn
CT = (tf 1 + tf 2 )W 1 + (tf 1 + tf 2 )W 2 + … + (tf 1 + tf 2 )W c w i = 1 c w (tf 1 + tf 2 )W i i = 1 c w (tf 1 + tf 2 )W i
UT = (tf 1 ∨ tf 2 )W 1 + (tf 1 ∨ tf 2 )W 2 + … + (tf 1 ∨ tf 2 )W u w j = 1 u w (tf 1 ∨ tf 2 )W j j = 1 u w (tf 1 ∨ tf 2 )W j
TT = UT + CT11
CRMDL = CT/TT11
Table 5. Dataset overview for linking of bilingual news articles.
Table 5. Dataset overview for linking of bilingual news articles.
News ArticlesObserved Similarity
No.News Articles/SetsSetsUrdu ArticlesEnglish ArticlesSourcesDuring SelectionProposed MeasuresObserved Results
143133YesYesYes
2102555YesYesYes
320110105YesYesYes
4282
(One-Day)
21521304NoYesYes
Table 6. Overview: dataset of 20 news articles.
Table 6. Overview: dataset of 20 news articles.
Type of NewsNews ArticlesNews ArticlesAbout
31, 6, 10PSL, Cricket
Sports News27, 9WI tour, teams announcement
15ICC president resigns
32, 6, 8COAS, army
General News13Trump travel ban
14MQM leader
Table 7. News articles analyzed for similarity.
Table 7. News articles analyzed for similarity.
Urdu ArticleApplsci 13 08566 i001
English TranslationBudget 2017–2018: Government employees were made happy
DescriptionHaving no exact match, much similar news, general news, and of average length
StatsSix relevant news of 55 and No exact match
Urdu ArticleApplsci 13 08566 i002
English TranslationThe Ramadan moon sighted, the first fast will be tomorrow
DescriptionHaving no exact match, much similar news, general news, and of short length
StatsNine relevant news of 55 and No exact match
Urdu ArticleApplsci 13 08566 i003
English TranslationBudget, 10% raise in salaries and pension
DescriptionHaving one exact match, much similar news, general news, and of average length
StatsEight relevant news of 74 and One exact match
Urdu ArticleApplsci 13 08566 i004
English TranslationYonus Khan’s all-time test captain is Imran Khan
DescriptionHaving one exact match, much similar news, sports news, and of average length
StatsSeven relevant news of 74 and One exact match
Table 8. Improved results of SMTW approach vs. CRMDL approach for 20 news article sets.
Table 8. Improved results of SMTW approach vs. CRMDL approach for 20 news article sets.
Urdu NewsCRMDL RankSMTW RankSMTW CT Impact
eng1eng1
eng10eng6-
urNews1eng6eng10
eng7eng7
eng4eng4
eng2eng2
eng6eng6-
urNews2eng8eng8
eng5eng5-
eng3eng3-
urNews3eng3eng3-
eng4eng4-
urNews4eng4eng4
eng7eng7-
urNews5eng5eng5
eng1eng1
urNews6eng6eng6
eng2eng2
eng1eng8
eng7eng1
eng10eng7
eng8eng10
urNews7eng7eng7
eng1eng10
eng10eng6
eng3eng1
eng6eng9
urNews8eng8eng8
eng3eng6
eng4eng2
eng1eng4
eng2eng3
urNews9eng7eng9
eng9eng7
eng5eng6
urNews10eng1eng10
eng10eng6
eng6eng1
eng7eng7
Table 9. Improved results of SMTW Approach vs. CRMDL Approach for one-day news article sets. ⯅ indicates improved results, and “-” represent “no Change or no impact”.
Table 9. Improved results of SMTW Approach vs. CRMDL Approach for one-day news article sets. ⯅ indicates improved results, and “-” represent “no Change or no impact”.
Applsci 13 08566 i005
Eng NewsCRMDL CTSMTW CTCT Impact
Eng14975
Eng22231
Eng32231
Eng42536
Eng52634
Eng61220
Applsci 13 08566 i006
Eng NewsCRMDL CTSMTW CTCT Impact
Eng11418
Eng21214
Eng30606-
Eng40404-
Eng50707-
Eng60606-
Eng70606-
Eng81111-
Eng90202-
Applsci 13 08566 i007
Eng NewsCRMDL CTSMTW CTCT Impact
Eng182121
Eng283115
Eng3162219
Eng487106
Eng56697
Eng65583
Eng75686
Eng84271
Applsci 13 08566 i008
Eng NewsCRMDL CTSMTW CTCT Impact
Eng153122
Eng265176
Eng33781
Eng42769
Eng524103
Eng61950
Eng71371
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Khan, M.; Ullah, K.; Alharbi, Y.; Alferaidi, A.; Alharbi, T.S.; Yadav, K.; Alsharabi, N.; Ahmad, A. Understanding the Research Challenges in Low-Resource Language and Linking Bilingual News Articles in Multilingual News Archive. Appl. Sci. 2023, 13, 8566. https://doi.org/10.3390/app13158566

AMA Style

Khan M, Ullah K, Alharbi Y, Alferaidi A, Alharbi TS, Yadav K, Alsharabi N, Ahmad A. Understanding the Research Challenges in Low-Resource Language and Linking Bilingual News Articles in Multilingual News Archive. Applied Sciences. 2023; 13(15):8566. https://doi.org/10.3390/app13158566

Chicago/Turabian Style

Khan, Muzammil, Kifayat Ullah, Yasser Alharbi, Ali Alferaidi, Talal Saad Alharbi, Kusum Yadav, Naif Alsharabi, and Aakash Ahmad. 2023. "Understanding the Research Challenges in Low-Resource Language and Linking Bilingual News Articles in Multilingual News Archive" Applied Sciences 13, no. 15: 8566. https://doi.org/10.3390/app13158566

APA Style

Khan, M., Ullah, K., Alharbi, Y., Alferaidi, A., Alharbi, T. S., Yadav, K., Alsharabi, N., & Ahmad, A. (2023). Understanding the Research Challenges in Low-Resource Language and Linking Bilingual News Articles in Multilingual News Archive. Applied Sciences, 13(15), 8566. https://doi.org/10.3390/app13158566

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop