1. Introduction
The Internet serves as the primary and extensive hub of information, encompassing diverse sources that cover every aspect of human existence. It offers a wide range of data, including weather forecasts, travel deals, local and global events, and much more. This vast pool of information is accessible through the World Wide Web and various Web services [
1]. In fact, the amount of information generated on the Web is growing exponentially and is predicted to surpass the collective cognitive capacity of humanity by 2025. To put it into perspective, the current measurement of Web information stands at 1018 exabytes and 1021 zettabytes [
2,
3].
Despite its rapid growth, the World Wide Web (WWW) is inherently fragile, which poses a significant challenge. The fragility of information on the Web leads to the unfortunate disappearance and inaccessibility of valuable scholarly, cultural, and scientific resources for future generations. Consequently, it becomes imperative to prioritize the preservation of this diverse and valuable information, which exists in various forms.
For hundreds of years, newspapers have served as the primary source of information, covering a wide range of topics that encompass various aspects of human life. They provide valuable insights into local and global events, including parliamentary activities, politically significant occurrences, court proceedings, births, deaths, marriages, sports, science, technology, and more. Newspapers reflect societal life, capturing social dynamics, behaviors, and cultural values, thus serving as essential scholarly information for individuals and communities. Given the significance of preserving such information for future generations, efforts have been made to ensure its availability. For example, historical manuscripts now hold immense value, just as addresses made by prime ministers after election victories or announcements related to imminent foreign invasions. The UNESCO Declaration on Archives emphasizes the crucial role played by archives in societal development by safeguarding the contributions of individuals and communities [
4]. The preservation and accessibility of published information are essential for safeguarding valuable resources. Various initiatives have been implemented to achieve this goal, leading to the creation of numerous newspaper archives. Curators and their organizations play a significant role in preserving newspapers and maintaining the digital collections of these publications. Typically, newspapers are digitized either internally, i.e., in-house, or with the assistance of external vendors. Additionally, some newspapers are preserved as born-digital content obtained directly from publishers or by harvesting from the web [
5].
A literature review of newspaper archives reveals that diverse approaches are employed for the preservation of newspapers, with the majority being digitized as a single digital record. Typically, curated digitized records are created by scanning microfilm, which is a compact photograph that can be stored and enlarged for reading, then saved in formats such as pdf, gif, jpg, or other graphical formats. Newspaper archives can be categorized into two main types: old and newer archives. Old newspaper archives pose challenges for optical character recognition (OCR) technology in indexing them into a full-text corpus. As a result, these archives are primarily available in graphical format, requiring visual inspection to access the content. Conversely, the newer newspaper archives have been extensively indexed, allowing for efficient full-text searching mechanisms to retrieve specific information.
The digital news story preservation (DNSP) framework was introduced to establish a digital archive of news articles interconnected based on specific criteria with the purpose of future utilization [
6]. Recently, this framework was enhanced to include a multilingual and multisource digital news stories archive aimed at preserving digital news articles for the long term and for the benefit of future generations. The framework now incorporates two low-resource languages, namely Urdu and Arabic. However, there are several challenges associated with including these languages in the Digital News Stories Archive (DNSA), mainly due to their low-resource nature. This study identifies various challenges related to different aspects of low-resource languages, making it difficult to incorporate them into the digital news stories archive. These challenges encompass issues related to volume, variety, and velocity during archival information packaging; technical difficulties encountered during the creation of the archive; and challenges with the dissemination of archived content. One of the major obstacles is the scarcity of resources available for low-resource languages. For example, the tokenization of Urdu scripts is less efficient and inconsistent in terms of space than that of high-resource languages, and comprehensive dictionaries, which are fundamental linguistic resources, are not available for either Arabic or Urdu. As a result, extensive preprocessing becomes necessary during the preservation process to compensate for the lack of these resources.
Establishing linking mechanisms and metadata is crucial during the preservation process to ensure the efficient dissemination of archived news articles sourced from multiple languages and diverse sources [
7]. The second part of this paper introduces bilingual news linking mechanisms, specifically the “Common Ratio Measure for Dual Language (CRMDL)”, which is based on the ratio of common terms, and the “Similarity Measure based on Transliteration Words (SMTW)”, which relies on words translated from the English language in Urdu news articles. These mechanisms are implemented to facilitate the accessibility of news articles that were extracted and archived from various sources during the preservation phase. This paper compares these linking algorithms and discusses the effectiveness of the results. By incorporating different linking mechanisms, the Digital news stories preservation (DNSP) framework is enriched and enhanced to ensure future accessibility of the archived content.
Section 2 and subsections differentiate low-resource languages from high-resource languages and introduce challenges in LRLs, with a brief overview of the Urdu and Arabic languages.
Section 3 presents details about the digital news story preservation framework initiative, the importance of preservation, research challenges, DNSP framework enhancement, the multilingual archive and its structure, and major issues encountered in enhancing the extraction tool. In
Section 4, extraction quantification is comprehensively discussed.
Section 6.3.2 compares the results of bilingual linking mechanisms, and in
Section 7, the findings are summarized.
2. Low-Resource Languages
Natural languages are classified into two broad categories, i.e., low-resource Languages (LRLs) and high-resource languages (HRLs). For high-resource languages, many data resources exist that help to enable machines to learn and understand natural languages, e.g., English. English is a well-resourced language as compared to other spoken languages. Many western European languages are well-resource-covered languages. Chinese, Japanese, and Russian are also high-resource languages. In contrast, low-resource languages have very few or no resources available. Low-resource languages can be defined as less studied, resource-scarce, less computerized, less privileged, less commonly taught, or low-density languages [
8,
9]. Many languages are difficult to preserve because they are mostly oral, and very few written resources exist in physical form, with none available in electronic format. There are different types of resources for natural language processing and the development of language-based systems:
Collection of text in various forms, such as research papers, books, email collections, social media content collections, etc.;
Lexical, syntactic, and semantic resources, such as a bag of words, dictionaries, semantic databases (e.g., WordNet), organized dependency tree corpora, etc.;
Task-specific resources, such as part-of-speech tags, corpora for machine translation, annotated text, named entity recognition resources, etc.
Many language resources are costly to produce, which is why the economic inequalities between countries/languages are reflected in the amount (or absence) of language resources.
2.1. Challenges in Low-Resource Language Processing
The natural language processing (NLP) tools experienced a drastic change in the 1990s, shifting from rule-based techniques to statistically based approaches, and a new era of artificial intelligence started. Since then, the focus has majorly been on English as an international language, and about 20 languages out of 7000 languages of the world have been considered [
10].
Languages that need a lot of research are often referred to as low-resource languages and face many challenges, as briefly discussed below:
Alignment or the projection technique (three levels of alignment, i.e., word, sentence, and document) is a common technique for annotation. It is difficult to adopt the projection technique from HRLs to LRLs because of a lack of resources and different structures of target and source languages [
8];
Creating a bag of words, dataset, and raw text collection for LRLs is difficult but necessary for any natural language processing (NLP) task or mapping technique [
8];
The most important resource for any language is the lexicon of that language; many NLP tasks heavily depend on the textual material available, which is lacking in LRLs, making it a challenging task to produce an efficient lexicon;
The morphology of LRLs is constantly evolving, with vocabulary easily extended. Developing a comprehensive framework for morphological pattern recognition is difficult because of multiple roots [
11];
The major applications of NLP, such as question–answer systems, sentiment analysis, image-to-text mapping, machine translation, and named entity recognition-based systems, are very difficult to implement in low-resource languages;
Basic NLP tasks such as stop-word identification and removal, tokenization, part-of-speech tagging, sentence parsing, lemmatization, stemming, etc., are also difficult in low-resource languages;
The NLP systems of LRLs are time-consuming and comparatively less efficient as a result of a lack of resources, increasing the difficulty of developing a machine learning system [
10];
Many languages are mostly oral, for which very few written resources exist (physical and digital formats). For some, there are written documents but not even a basic resource like a dictionary;
Integrated and customized systems are always a huge challenge for multilingual systems.
Dealing with all the challenges faced by low-resource languages requires extensive research in different dimensions. Urdu and Arabic are two huge languages that need a lot of focus in research.
2.2. Urdu Language
Urdu, a prominent South Asian language, boasts approximately 70 million native speakers and over 164 million speakers worldwide [
12]. It serves as the official literary language of Pakistan and is also spoken and understood in other countries such as India and Bangladesh. Urdu shares close linguistic ties with Hindi. The preservation of Urdu periodicals holds immense value for researchers and future generations, as they encompass a wide array of significant topics concerning South Asia throughout the nineteenth and twentieth centuries.
2.3. Arabic Language
Arabic is the third most widely spoken language globally, trailing behind English and French. Approximately 292 million individuals use Arabic as their primary and official language across twenty-seven countries, with a significant number of people also capable of understanding it as a second language. Alongside English, French, Spanish, Russian, and Chinese, Arabic holds the distinction of being one of the official languages of the United Nations. Notably, Arabic is gaining popularity as a language to learn in the Western world, and numerous other languages have borrowed words from Arabic due to their historical significance. The intricacies of Arabic grammar can pose a challenge, both for native speakers of Indo-European languages and for machines attempting to accurately interpret and comprehend the Arabic language [
13,
14].
3. Digital News Story Preservation Initiative
The “Digital News Story Preservation (DNSP) Framework” was initiated in 2015 [
6]. The term “digital preservation” is broadly comprehended as the “arrangement of supervised exercises important to guarantee proceeded with access to advanced materials for whatever length of time required”. An initiative was undertaken to establish the “Digital News Stories Archive (DNSA),” which comprises interconnected news articles in both English and Urdu languages. The archive incorporates two types of linkages: English-to-English linkage and Urdu-to-English news article linkage. These linkages are established through preprocessing and by applying content-based techniques.
3.1. Importance of Digital News Story Preservation
Preserving access to historical records is crucial for writers conducting research; however, these preserved documents may be at risk of disappearing. While digital news content is widely available today, it is more vulnerable than print and can be scattered across different media and storage systems [
15].
The importance of digital news story preservation is summarized as follows:
Preservation and data backup are distinct concepts and should not be conflated. Digital media are susceptible to various failures, including file corruption, viruses, malware, damage, overwritten backups, server issues, and even natural disasters like earthquakes and tornadoes. These risks highlight the need for proper digital preservation, which encompasses a set of processes and activities specifically designed to ensure the long-term, sustained storage, access to, and interpretation of digital information. Preservation goes beyond mere data backup and focuses on maintaining the integrity and accessibility of digital content over time;
According to a survey by Educopia conducted in 2012, out of 60 newspaper companies, less than half keep their data and content for more than five years [
16]. Their news contents are also dispersed and distributed over multiple servers;
The preservation of news is vital for the advancement of society, as it enables citizens to stay well-informed and up-to-date on a wide range of events and news through journalism. By preserving news content, individuals are empowered with knowledge, ensuring they are equipped to make informed decisions and actively participate in their communities;
Preserving news is beneficial for writers and researchers, as it empowers them to craft relevant and contextualized stories. News preservation holds immense value as a historical record for society on a large scale, offering significant benefits;
Researchers requires information such as birth, death, marital status, business data, announcements, legal documents, property transaction papers, etc., with respect to the genealogical status of a person or even a community as a whole, which can be obtained from news or published document archives;
Developed nations are actively engaged in archiving significant documents and newspapers and preserving news content to ensure future accessibility. When considering our social heritage, digital preservation becomes more important, especially during times when society relies on journalists to thoroughly investigate stories and produce impactful news;
Urdu news preservation: To uphold the rich heritage of the Urdu language, it is crucial for Urdu speakers to have a sincere commitment to preserving its essence. Urdu periodicals contain a diverse array of literary works encompassing significant topics from South Asia during the nineteenth and twentieth centuries, making their preservation highly valuable for researchers interested in the language. The preservation of Urdu news stories should be particularly significant for the people of Pakistan, as it is a country where Urdu is widely spoken and holds a unique position in the world [
17];
Arabic news preservation: Arabic is gaining popularity as a language of interest in the Western world, attracting an increasing number of learners. Throughout history, other languages have borrowed words from Arabic due to their significant contributions. However, the grammar of Arabic poses a challenge for native speakers of Indo-European languages and even for machines attempting to accurately interpret and comprehend the language [
4]. Arabic encompasses multiple versions of its script, including standard Arabic, classical Arabic, literary Arabic, and modernized Arabic. During the early Middle Ages, Arabic played a central role as a primary source for science, mathematics, culture, and philosophy. Preserving Arabic scripts is crucial, as they captures various grammatical changes that have occurred, reflecting the nuances found in colloquial variants.
3.2. Research Challenges
The news encompasses a wide range of events that are directly or indirectly connected to our social lives. These events include parliamentary actions, significant political occurrences for countries, court proceedings, government announcements, deaths, births, marriages, sports, etc. In the coming years, the responsibility of preserving these comprehensive journalistic records primarily lies with news outlets and newspaper organizations, ensuring their availability for future generations. Online news publications are generated and updated instantly, following a non-linear format, which means that they can disappear and become inaccessible. Based on existing data, it has been observed that approximately 80% of web pages become inaccessible within a year, and around 13% of links, particularly web references in scholarly articles, cease to function after approximately 27 months [
18,
19]. Consequently, the need to preserve online digital news for an extended duration has become imperative to ensure its safeguarding for future generations.
Even if a newspaper is backed up or archived by national archives and libraries, accessing specific information from multiple sources about a particular event may be challenging in the future. This challenge becomes even more complex when attempting to follow a story through an archive that comprises a vast collection generated from numerous news sources, each requiring different technologies to access the archived contents.
News archives are of two types, i.e., graphical formats and partially indexed archives, which makes it difficult to access particular news about an event because, many challenges encountered, such as:
Vast archive collections: an archive created from many sources;
Various sources on different platforms;
Multilingual archive: an archive created from multiple languages, i.e., Arabic, Urdu, and English;
Low-resource language: Access becomes more complicated when searching news article in low-resource languages, such as Urdu.
There are many difficulties in digital news preservation, such as;
Extraction of news from diverse sources and different technological platforms;
Extraction of explicit and implicit metadata;
Computing similarity values between news articles;
Conversion of news articles to a specific standard format for future integration and access, etc.
There are many challenges in accessing preserved digital news stories in archives, such as;
Locating and discovering a digital resource among a huge collection, such as a catalog or archive [
20];
The effectiveness of search mechanisms depends directly on how these objects are organized. Digital library management helps by providing support for identifying, describing, and locating resources;
Interoperability is the ability of different systems to exchange and use information together without losing content and functionality, representing a huge challenge in archive management [
21];
Providing mechanisms for digital objects to hold the data that prove their reliability, integrity, authenticity, and provenance [
22];
Storing information about the physical characteristics and documenting behavior so that it can be emulated in future technologies [
21]. For example, “the original XML instance of imported data is maintained to preserve all mappings and to be able to roundtrip the original” [
23];
During the object development phase, multiple versions of the same object may be created for preservation and dissemination. Thus, the same object may be present in multiple versions; metadata tracks all the information regarding different versions and changes in the object over time;
When seeking to utilize data collected for a different project in their own work, individuals aim to locate and utilize data while placing a greater emphasis on trust and comprehension. Reusing data typically necessitates meticulous preservation and documentation of both the data content and the accompanying metadata.
3.3. DNSP Framework Enhancement
The primary purpose of the DNSP framework is to create a multilingual, multisource digital news stories archive to preserve digital news articles for the long term and future generations. The framework is enriched with two low-resource languages, i.e., Urdu and Arabic. The challenges presented in previous sections regarding low-resource languages make it hard to include these sources simply. The absence of efficient tokenizers, dictionaries, and other basic resources prompts heavy prepossessing during preservation in the framework. The workflow and main components are presented in the enhanced version of the DNSP framework in
Figure 1.
3.4. Multilingual News Archive
The following section provides a brief introduction to the Digital News Stories Archive (DNSA). The primary concept behind the digital news story preservation (DNSP) framework was introduced at the International Conference on Asian Digital Libraries 2015 (ICADL-2015) [
6]. The following contributions were made to the framework:
After analyzing 120 news archives worldwide, a comprehensive and generic systematic approach was proposed as a model for Web preservation. This approach entails a step-by-step procedure to be followed in web preservation projects [
24,
25];
A multisource web archive known as the Digital News Stories Archives (DNSA) was designed and developed to preserve online news articles originating from multiple sources [
1];
A digital news story extractor (DNSE) tool was specifically designed to extract news articles from diverse sources and compile them to form the DNSA. Its primary function is to collect and gather news articles from various online sources, ensuring their preservation within the DNSA [
26];
In the DNSA, we use content-based methods to link news articles during the preservation process. These methods rely on text features, such as the ratio of common terms based on their frequency [
27], named entities [
28], the position of terms, terms in the headline, the credibility of information, the distance between similar terms, etc., [
1];
Similarity measures were studied in the most relevant field of news recommender systems. A comprehensive study was performed on recommendation systems that can enhance the DNSP framework in different dimensions and improve its utility (a few of them are discussed in future work) [
29];
The CRMS technique was modified for news headings to reduce extra computation for the terms appearing in the news body for linking of English news articles during preservation [
27];
The CRMS technique was updated for linking of Urdu-language news articles with English-language news articles in the DNSA [
30];
A heading-based technique was introduced for linking of news articles in the DNSA during the preservation process in the DNSP framework [
31].
The Digital News Stories Archive (DNSA) is a news archive created locally from multiple online sources that provide news in three languages, i.e., English, Urdu, and Arabic. Currently, the DNSA is archiving news articles from seven online newspapers and three local news television networks [
26] in English, five Urdu news sources, and four Arabic online news sources. The archive (created locally) preserves more than one thousand news stories after removing duplicate URLs and news in each extraction.
The high-level system architecture is illustrated in
Figure 2. The figure depicts the process flow, starting from the ingest phase, where two mediators are employed to extract and incorporate metadata into the news story archive. Once the metadata are added, the news stories are archived and safeguarded for future use by generating an archival information package (AIP), as depicted in
Figure 3. Subsequently, the preserved contents can be accessed using the information dissemination package.
To help readers in accessing relevant news articles and enhance their understanding of various topics, the DNSA (Digital News Story Archive) requires an effective mechanism for linking digital stories and recommending them to readers. This mechanism, as discussed in previous studies [
32,
33], aims to provide readers with a broader perspective and diverse viewpoints by comparing similar news articles from multiple sources. By establishing linkages between English-to-English news articles and English-to-Urdu news articles, the DNSA enables readers to browse through its extensive collection effortlessly. This linkage mechanism not only assists in authenticating information but also helps readers explore a wide range of news articles related to a particular topic.
Without an efficient search functionality, a news archive would essentially amount to a mere collection of news articles, lacking the ability to serve as a truly valuable information repository. To transform it into an effective repository, it is essential to implement a robust search functionality, which necessitates the use of indexing approaches and the establishment of a clearly defined set of metadata elements.
3.5. Enhancing “Digital News Story Extractor (DNSE)”
The digital news story extractor (DNSE) is a Java-based tool for extracting digital news stories from online news websites using JSOUP and POI libraries. Initially, the DNSE was developed for English news sources [
26], then enhanced for Urdu news articles, and since further enhanced by including Arabic news sources and some features for quantification. The DNSE extracts news stories from online sources, extracts meta information, i.e., metadata, and normalizes news content and related metadata into XML format for preservation in the DNSA. However, these enhancements have encountered the following problems:
Non-uniform Web structure: There are many platforms and technologies for developing Web-based applications. Front-end technologies include HTML, CSS, JAVA, and JAVASCRIPT and its frameworks, and back-end logic creation technologies include PHP, ASP.net, and XML, among many others. Due to the use of different technologies, the Web structure varies; hence, extracting the desired information is challenging.
Recency or maintenance of fresh content: The Web contents of dynamic Web applications, such as blogs and news websites, update instantly and frequently. The recency of news content is very important to maintain efficiently, considering access frequency and network traffic issues.
Rise of anti-scraping tools: The biggest challenge in extracting news content is the rise of anti-scraping tools, e.g., Captcha, which differentiates between bots and humans. The extractor got stuck when anti-scraping tools were implemented.
Unknown host issue: An unreliable Internet connection leads to an unknown host issue; the extraction of news can be restarted, but the interruption is time-consuming.
Socket timeout: Most websites temporarily block or suspend their services when frequently accessing the contents for a specific period during preservation. The websites consider a bot unnecessary to send requests, overload the server, and start blocking access.
Garbage collection: The inconsistency in development approaches leads to erroneous extraction by collecting unwanted data, such as in-text links, tags, or other code, during news extraction.
Identifying and preprocessing of low-resource languages: The DNSE tool deploys different libraries for the identification and preprocessing of low-resource languages, and the preprocessing is computationally expensive.
Firewall blocking: Few online news sources are protected from extraction using a firewall.
The developed extractor somewhat manages problems such as firewall blocking, the rise of anti-scraping applications, preprocessing, garbage collection, and dealing with non-uniform Web structure. The extractor handled the maintenance of fresh news articles, unknown host issues, and socket timeout issues using different techniques and Web APIs. Extraction is important for any digital archive and challenging when preserving low-resource contents. The enhanced DNSE is enabled to deal with the above challenges efficiently.
4. News Extraction Results
The “DNSA” is enriched with five sources that provide Urdu news articles and three online sources that publish news in the Arabic language. The details of the included news articles from all three languages are summarized in
Table 1.
The progress of research is hindered by the slow development of the DNSP framework, along with limited resources and insufficient financial support. Initially, three local English newspapers, namely Dawn News, The Tribune, and The News were chosen as the test subjects for the DNSE tool. A total of 86,545 URLs (with an average of 2791 URLs) were extracted, including duplicate URLs for news stories. Among these extracted URLs, there were 23,843 unique URLs representing individual news stories (with an average of 769 unique URLs). The extraction results from the newspaper websites of Dawn News, The Tribune, and The News accounted for 6457 news stories (with an average of 208), 4914 news stories (with an average of 158), and 5713 news stories (with an average of 184), respectively [
26].
The new extraction/crawling results after the DNSE was enriched with two low-resource languages, i.e., Urdu and Arabic, are keenly analyzed for shortcomings of the DNSP framework and DNSE tool.
Table 2 presents the detailed extraction of the DNSE for both the HRL and LRLs.
The extraction results are visualized in
Figure 4 for all ten sources of the high-resource language, i.e., English. The results show that few news sources do not frequently update the news online and can be replaced by other sources for efficient utilization of the DNSP framework.
Evaluating the frequency of new story extraction is crucial due to the continuous and non-periodic nature of the news stream, unlike printed media.
The extraction process was carried out daily or after waiting for a few days.
Figure 5 presents the average count of extracted URLs and unique URLs. The figure illustrates that the number of newly extracted news URLs closely aligns with the count of new news stories obtained from online newspapers and various news channels.
The processing of low-resource languages is expensive in terms of time, complexity, and accuracy. The main problems with implementing the DNSE, including LRLs, are non-uniform web structure, unknown host issues, and garbage collection.
Figure 6 and
Figure 7 present the average extraction of new news articles and unique URLs, respectively.
Table 3 presents the error rate of URLs and story extraction during preservation for both high- and low-resource languages. The LRLs are associated with a large error rate because of non-uniform web structure, unknown host issues, maintenance of fresh content, anti-scraping tools, and garbage collection. It is observed that low-resource-language news sources are not very well maintained like high-resource-language news sources.
5. Content-Based Bilingual Linking Techniques
The number of online news articles necessitates the use of recommender system techniques to establish linkages between digital news stories during preservation. These techniques can be broadly categorized into collaborative filtering techniques and content-based techniques. Collaborative filtering techniques encounter several challenges, as they heavily rely on user opinions, demographics, and feedback to establish similarity [
1,
34]. The dynamic nature of both users and news articles complicates the accurate modeling of user preferences based on their past readings [
35,
36,
37]. User preferences and interests can change over time due to current events and the popularity of news articles [
38]. Furthermore, users tend to be hesitant in actively clicking or recommending news articles while browsing specific topics [
39]. On the other hand, content-based approaches recommend new items to users by calculating the similarity between the features of previously selected items. However, content-based approaches encounter difficulties in determining the similarity between news articles that cover diverse topics and in accounting for potentially hidden factors that influence user choices [
40]. While most studies have focused on runtime similarity between recent articles, the approach adopted by the DNSA involves linking stories during preservation to ensure long-term accessibility and to benefit future generations.
The common ratio measure for stories (CRMS) is a content-based similarity measure that has proven to be effective in linking digital news articles during preservation. The Digital News Story Archive (DNSA), ensures the future accessibility of related news articles by preserving and formatting linked news articles sourced from a vast corpus of news articles extracted from multiple sources. To determine the similarity among related news articles in the archive, a threshold value for CT/TT (a measure defined in the study) was established, representing the best indicator of similarity [
27]. The CRMS is evaluated extensively in comparison to the baseline similarity measure, namely the cosine similarity measure (CSM). Additionally, the CRMS is subjected to analysis based on human judgment through user-based evaluation, providing empirical insights. The CRMS emerges as the most suitable measure for linking news articles, particularly facilitating the linkage of dual-language news articles gathered from diverse sources during preservation and the creation of archives. The results of this evaluation and analysis are derived from a dataset of 5.3K news articles extracted from ten different news sources over ten days.
Similarly, another content-based similarity measure is introduced based on the use of transliteration words known as the similarity measure based on transliteration words (SMTW). It is observed that the use of English transliteration words is common in Urdu manuscripts. This practice is anticipated to play a vital role in linking Urdu news articles with English news articles. These linking mechanisms link formatted news articles to ensure the future accessibility of related news articles from an enormous corpus of news articles extracted from multiple sources in the DNSA [
41]. A comprehensive and empirical evaluation was conducted to assess the effectiveness of the SMTW. The results demonstrate that the SMTW is the most viable metric for linking news articles. It strongly promotes the linking of dual-language news articles obtained from various sources during the process of preservation and archive creation.
“Transliteration is a process of using the text of one script in another script or the process of converting text from one language to another”. In the field of linguistics, the act of incorporating a word or a group of words from one language into another language’s writing scripts is known as borrowing. These borrowed words are commonly referred to as loan words [
42]. When it comes to transliteration, the process of representing a phrase or word from one language to another that uses a different writing system, challenges may arise, especially if the languages in question have distinct sounds and writing scripts [
43,
44].
7. Conclusions and Future Work
The preservation of news stories is of great significance for multiple reasons. These stories offer in-depth information about events that encompass our culture and heritage, making them invaluable resources, and preserving news stories ensures their availability for long-term research purposes. However, the news stories published online are in danger of being lost because of constant changes in the technologies used by online publishing sources and the formats used by platforms. The preservation of news and the creation of news archives is challenging. It becomes even further complicated when an archive contains articles from a low-resourced and morphologically complex language like Urdu or Arabic. This study introduces a multilingual news archive for Urdu, Arabic, and English news article sources published online on eighteen news publishing platforms. The digital news stories extractor is enhanced to address major issues in implementing low-resource languages and facilitates normalized format migration. The extraction results are presented in detail for a high-resource language, i.e., English, and low-resource languages, i.e., Urdu and Arabic. The LRLs encountered a high error rate during preservation compared to the HRL: 10% and 03%, respectively. The extraction results show that two of the news sources are not regularly updated and release very few new news stories online. The Digital News Stories Archive framework successfully preserved an average of 879 news articles from ten high-resource-language (HRL) sources and 553 news articles from eight low-resource-language (LRL) news sources. In the context of the DNSA, we compared two bilingual linking mechanisms, namely the common ratio measure for dual language (CRMDL) and the similarity measure based on transliteration words (SMTW) for linking of Urdu-to-English language news articles. The SMTW demonstrated superior results compared to the CRMDL technique and CSM. It was observed that approximately 78% of Urdu news articles contained transliterated words. The precision improved from 46% and 55% to 60%, while the recall improved from 64% and 67% to 82%. The impact of common terms also exhibited improvement. Notably, the SMTW was proven effective and feasible for sports news.
This study highlights the challenges faced in dealing with low-resource languages (LRLs) and outlines research challenges. It also provides insights into how the framework can be enhanced and emphasizes the need for a more detailed investigation to ensure accurate extraction and archiving of news content for future access. The framework holds potential for further expansion and exploration in various dimensions.
This research highlights challenges encountered in low-resource languages (LRLs) and explores the associated research challenges. It also presents the improvements made to the framework and emphasizes the necessity of a comprehensive investigation to ensure precise extraction and archiving of news content for future retrieval. Furthermore, the framework holds potential for extension across various aspects, such as:
Thorough analysis of Arabic script, which is necessary to facilitate multilingual linking;
To provide access to the archived contents of the DNSA, the implementation of a standardized user interface is essential;
The DNSE tool should be developed to meet professional standards;
Meta attributes can be expanded to accommodate multilingual archives and include languages like Urdu, Arabic, Pashto, and other languages;
Implicit meta elements can be added to the proposed set after comprehensively reviewing individual sources;
We are working on enhancing the structure of the Urdu-to-English lexicon and optimizing the bag of Urdu words to improve processing efficiency;
Advanced content-based similarity measures should be developed, utilizing various features, such as weighted terms, named entities, term position, and contextual information from news articles;
The DNSA needs crosslingual techniques for linking of multilingual archived news;
Metadata elements need to be proposed for the digital news story preservation framework for efficient archive management and information dissemination;
A more comprehensive set of generic elements for well-structured and well-populated online sources is required.
This study presents details of the framework’s enhancements and emphasizes the need for a more comprehensive investigation into the accurate extraction and archiving of news content for future retrieval. The framework holds potential for future expansion in various aspects, such as:
Thorough analysis of the Arabic script to facilitate multilingual linking;
Development of a standardized user interface to facilitate access to archived content in the DNSA;
Professional-level development of the DNSE tool;
Creation of meta attributes for multilingual archives, encompassing languages such as Urdu, Arabic, Pashto, etc.
Addition of implicit meta elements to the proposed set after comprehensive evaluation of individual sources;
Ongoing efforts to enhance the structure of the Urdu-to-English lexicon and the bag of Urdu words to improve processing efficiency;
The design of more advanced content-based similarity measures incorporating diverse features such as weighted terms, named entities, term position, and contextual information within news articles.
Incorporation of crosslingual techniques in the DNSA for linking of multilingual archived news.