1. Introduction
The Semantic Web is a visionary extension of the traditional World Wide Web, aimed at making data more accessible and interpretable by machines. As a transformative approach, the Semantic Web embeds structured, machine-readable information into web content, enabling both artificial and human agents to understand and navigate content effectively [
1,
2]. This capability is particularly impactful in fields like e-learning, as well as news web portals, where structured data improve article indexing and presentation, facilitating platforms like Google News in enhancing content discoverability and relevance through structured data integration [
3,
4].
One key advancement in this area is rich results, which enable search engines to present enhanced, visually informative search results that may include images, star ratings, and other contextual information. These rich results rely on structured data annotations embedded directly within the HTML of web pages, typically using formats like JavaScript Object Notation for Linked Data (JSON-LD). JSON-LD is a lightweight linked data format that allows website developers to add structured metadata to web content in a way that is both machine-readable and straightforward to implement. By embedding JSON-LD annotations, web developers can provide essential details to search engines without altering visible content, ensuring better indexing and enhancing search visibility [
5,
6].
Such structured data significantly improve a website’s search engine optimization (SEO), a collection of techniques aimed at enhancing a website’s visibility and ranking on search engine results pages (SERPs). SERPs are the pages displayed by search engines in response to a user’s query, and they are instrumental in attracting online traffic by presenting a ranked list of relevant links, summaries, and often enhanced features like images or videos. By optimizing elements like keywords, content, and links, SEO strategies help elevate a website’s SERP position, which is critical for online visibility and user engagement [
7]. JSON-LD’s compatibility with SEO efforts allows it to boost site visibility by making important data accessible for indexing [
8]. However, many legacy websites lack these structured data annotations, making it challenging for search engines to accurately index their content, thereby negatively impacting SEO performance, visibility, and user engagement [
9,
10,
11].
Legacy news websites often lack the tools and resources to implement structured data, resulting in poor visibility and limited engagement on search platforms. This creates a significant barrier to information accessibility and discoverability, as valuable content becomes harder to access for users relying on modern search engines. By addressing this gap, our research seeks to transform how legacy news content is indexed and presented, making essential information more accessible to the public.
Motivated by this challenge, our research seeks to address the gap in structured data for legacy news websites, which are often excluded from modern CMS tools that facilitate structured data generation. This paper introduces a comprehensive solution to automatically generate JSON-LD scripts tailored for news websites, bridging the gap between unstructured and structured web data. Our approach is driven by two primary innovations. The first innovation is our use of advanced pattern mining techniques coupled with cutting-edge NLP models, specifically OpenAI’s GPT-3. GPT-3 not only helps in accurately extracting relevant article features such as titles, dates, and content structure, but also enables content refinement by removing irrelevant text and advertisements. This ensures the annotations generated are clean, focused, and high-quality, making the structured data both accurate and SEO-optimized.
Second, we introduce a robust validation framework to ensure that the generated structured data align with search engine standards. Unlike previous approaches, which often lack comprehensive validation steps, our framework systematically removes pre-existing JSON-LD scripts, regenerates them using our algorithm, and rigorously verifies their quality and compliance through both syntactic and semantic checks. Specifically, we represent the article data as a matrix , where n is the number of articles and m represents extracted features (e.g., headline, author, date, and content). Our goal is to transform this feature matrix into a structured data matrix , where k denotes the required JSON-LD elements for structured data compliance. This transformation is optimized to minimize discrepancies between the generated JSON-LD and the original content data, ensuring that the annotations are both accurate and compliant with search engine requirements. Additionally, we define a matching function to quantitatively assess alignment between the original article features and the generated JSON-LD annotations, providing a comprehensive metric to evaluate the quality and precision of the structured data.
To assess the effectiveness of our approach, we define a matching function , which quantitatively measures the alignment between the original article features and the generated JSON-LD annotations. This metric provides a comprehensive assessment of the automated annotation process and supports our claim that the generated data align closely with the original content.
Our work contributes to the Semantic Web ecosystem by enhancing the accessibility, discoverability, and indexing of news articles on the web. Unlike previous solutions, which often focus on structured data for modern websites, our approach specifically targets legacy news websites, making it possible to bring these sites up to current standards. By enabling the automatic generation and validation of JSON-LD annotations, our framework has the potential to significantly increase the visibility and accessibility of news content on search engines. Moreover, this innovation has broader implications for the web as a whole, as it can be adapted for various types of unstructured pages, paving the way for a more structured and accessible Internet.
The broader implications of our solution extend beyond news articles. By creating an adaptable and generalizable framework, this work sets a foundation for structured data generation across various unstructured content types, including blogs, forums, and other legacy content platforms. This could lead to a more interconnected and easily searchable web ecosystem, ultimately facilitating a more inclusive and information-rich web environment.
Structure of the Paper
This paper is organized into several sections that collectively present our approach to enhancing news web pages for integration into the Semantic Web.
Section 2 is titled Literature Review. This review highlights both historical developments and contemporary approaches, situating our contribution within the broader context of Semantic Web research.
Section 3, Methodology, outlines our novel approach, illustrated by a flowchart. Here, we detail how our methodology infers linked data from web page content without prior knowledge of the HTML structure, utilizing articles from Google News as the primary data source.
Section 4 presents the implementation, where we elaborate on the practical application of our methodology. In
Section 5, titled Comments and Results, we discuss the outcomes of our approach and provide critical insights based on our findings. Following this,
Section 6 is dedicated to Impact Analysis, assessing the technological, social, and business implications of our solution. Finally,
Section 7 concludes this paper by summarizing our findings and emphasizing the significance of our contributions to the field of Semantic Web research.
2. Literature Review
This section provides a comprehensive review of the existing literature, organized into two main threads: (i) the generation of metadata and linked data from raw text and (ii) methods for the automated extraction and organization of data to be used in metadata generation. The review underscores both historical developments and current approaches, contextualizing our contribution within this body of knowledge.
2.1. Metadata Generation and Linked Data Representation
The vision of the Semantic Web, first articulated over two decades ago by Tim Berners-Lee and his team, has been fundamental in reshaping how web content is perceived by both humans and machines. Berners-Lee’s goal was to create a web where information is not only accessible by humans but also structured in a way that machines can understand and process autonomously. This vision laid the groundwork for the development of standards and technologies aimed at transforming the web into a more intelligent system.
To this end, the World Wide Web Consortium (W3C) has defined key standards for representing structured data on the web. Two of the most significant standards include the
Resource Description Framework (
RDF) and the
Web Ontology Language (
OWL) [
12]. The RDF provides a syntax for encoding relationships between resources, while OWL allows for the creation of complex ontologies—formal vocabularies that define concepts within specific domains. These frameworks have seen increasing adoption across academia and industry [
13]. However, for these technologies to reach their full potential, it is essential to develop domain-specific vocabularies (ontologies). Many of these vocabularies have been crafted manually for fields such as healthcare, law, and geography, but advancements in machine learning are enabling the automatic or semi-automatic generation of these ontologies from large text corpora [
14]. This progress reduces the time and expertise required to build and maintain these knowledge representations, making the Semantic Web more accessible and scalable.
Although the foundations of the Semantic Web are well established, recent studies highlight further advancements that improve the practical scalability of metadata generation and linked data integration. For instance, modern language models, such as GPT-based models, are being applied to support automated metadata generation, event analysis, and sentiment detection, which enhances structured information retrieval and data interconnectivity in real-time applications. Despite these advances, the need remains for streamlined methodologies to support automatic metadata and structured data generation across diverse domains.
Table 1 provides a comparative summary of key related works, focusing on modern techniques and highlighting the limitations in existing methodologies, such as limited generalizability or high resource demands for model training and deployment.
By focusing on the capabilities and limitations of these methodologies, this review underscores the research gap addressed by our approach: a scalable, efficient, and semi-automated solution that integrates both structured metadata generation and linked data representation, enabling adaptive, domain-agnostic content extraction.
2.2. Automated Data Acquisition and Content Extraction
Extracting the main content from unstructured web pages has been a long-standing challenge in web data mining. Early approaches focused on parsing the
Document Object Model (
DOM) structure of web pages. For instance, researchers at Columbia University proposed a method to detect the largest text body within a web page, classifying it by counting the words in each HTML element [
20]. This approach, which is the basis of our method for identifying the body of an article, proves effective for detecting main content sections but struggles with title identification due to the variability in web page layouts.
More sophisticated content extraction methods have since been proposed. For example, ref. [
21] introduced an approach combining structural and contextual analysis to improve the accuracy of content extraction. While this method has shown promise, it has yet to be fully implemented, suggesting that further refinement and experimentation are needed.
The literature reveals two dominant approaches in the field of web data extraction.
2.2.1. DOM-Based Approaches
DOM-based approaches leverage the structural properties of HTML documents to identify and extract relevant content. These methods rely on traversing the DOM tree to locate elements such as text bodies and headings based on properties such as size, position, and relative relationships between elements [
22,
23,
24]. While effective in well-structured environments, the major limitation of DOM-based methods is their sensitivity to changes in web page layouts. As web pages evolve, their structural composition can change, rendering static DOM-based techniques less effective over time.
2.2.2. AI-Driven Approaches
Artificial intelligence (AI)-based methods represent a more dynamic approach to content extraction. These techniques typically involve training machine learning models on large, labeled datasets of web pages. The models learn to recognize patterns and features within the page that correspond to different types of content, such as article titles, bodies, and advertisements [
25,
26,
27]. AI-based methods have demonstrated strong performance, particularly in handling diverse web layouts. However, they require significant computational resources and large amounts of labeled data for training. Moreover, the generalization of these models to new, unseen web layouts can be challenging, especially when the underlying page structure differs significantly from the training data.
2.3. Challenges and Opportunities
The comparison of DOM-based and AI-driven approaches underscores distinct trade-offs. AI-based methods, although flexible and adaptive, necessitate significant infrastructure and are data-intensive, with performance reliant on labeled training data. Conversely, DOM-based methods, though lightweight and efficient for structured content, are sensitive to web page architecture changes, necessitating frequent updates.
Our research adopts an enhanced DOM-based approach for article data extraction from web pages, combining DOM structure traversal with pattern mining techniques for greater accuracy and adaptability. This method allows for structured data extraction (e.g., titles and article bodies) without requiring intensive computational resources, aligning well with the regularity found in news articles. By reintegrating extracted data back into web pages as linked data, our approach not only supports metadata generation but also aligns with linked data principles, fostering more intelligent and connected web resources [
28,
29,
30,
31,
32,
33].
This refined DOM-based methodology offers a scalable alternative to AI-driven solutions, addressing the challenges of both traditional DOM-based and AI-based methods. By synthesizing structured pattern recognition with linked data re-injection, our approach advances the efficiency and resilience of content extraction in web data mining.
3. Methodology
Our methodology, depicted in the flowchart in
Figure 1, enables the inference of linked data from the entire web page content without requiring prior knowledge of the structure of the source HTML code. The results produced are comparable to those generated by web page authors.
In
Figure 1, the flowchart illustrates the workflow used to process news articles for integration into the Semantic Web. The process begins with exporting top news articles from Google News, followed by a check to verify if each article includes a JSON-LD (linked data JSON) object. Articles lacking JSON-LD are marked as “rejected”, while those with JSON-LD continue to the next step as “Approved Articles”. Subsequently, these approved articles are evaluated for their “Rich Text” quality, meaning they have sufficient structure for further processing. Articles failing this check are rejected, while those passing it are finalized as “100% Rich Text articles”. Additional steps are applied to add or remove JSON-LD where necessary, ensuring the structured readability and integration of news content.
To perform the analysis, we utilized Google News as our source of articles, which already possesses linked data JSON objects (JSON-LD) attached to each page, hereafter referred to as the original JSON-LD. Our approach employs a general data extraction method that applies pattern mining to news sites, as published in [
34]. The pattern miner scrapes the title and body of the news article to extract the plain text content. The text with the largest font size on the page is identified as the title, while the largest div element (by character length) is designated as the body, as illustrated in
Figure 2.
Furthermore, we extended our method from [
34] to clean the title and article body after scraping, removing unrelated words, links, images, and advertisements using OpenAI’s GPT-3 API [
35]. Specifically, GPT-3 was employed to identify and filter out irrelevant elements. This includes extraneous words or phrases unrelated to the core news content, promotional links and advertisements, references to images, and other non-essential elements. This process ensures that the structured data focus exclusively on the relevant article information, facilitating seamless integration into the Semantic Web.
Utilizing this methodology, we extract the most important properties of the JSON-LD object, such as the title, article body, images, and URLs, thus creating a new JSON object—the generated JSON-LD. To verify our results, we compare the generated JSON-LD with the original using two distinct methods:
While the Google Rich Results Checker serves as the primary validation tool, we also employ a word-by-word verification method to ensure content accuracy, as the rich results validator only checks the structural integrity of the object.
To quantify the content similarity, we utilize
Jensen–Shannon Divergence [
37,
38,
39], which measures the similarity between two probability distributions
P and
Q. This divergence is defined as follows:
where
and
is the Kullback–Leibler divergence, defined as
This allows us to quantify the differences in content distribution between the original JSON-LD and the generated JSON-LD.
Additionally, we can measure the entropy
H of the generated content using the following equation:
where
represents the probability of the occurrence of each unique word
in the content.
The methodology can be succinctly represented in five steps:
Extraction of News Articles: gather news web pages containing the original JSON-LD object.
Analyze/Remove JSON-LD: remove the original JSON-LD object and save it externally for comparison.
Generate and Inject JSON-LD: use pattern mining to generate a new JSON-LD object and inject it into the original page to replace the original.
Check JSON-LD: validate the new page with the injected JSON-LD using the Rich Results Checker.
Check the JSON-LD Content: compare the content word by word to compute a similarity score between properties.
The following pseudocode outlines the algorithms utilized in our methodology for extracting and generating JSON-LD annotations.
3.1. Extraction of News Articles
Algorithm 1 describes the process of extracting news articles from web pages.
Algorithm 1 ExtractNewsArticles |
- 1:
procedureExtractNewsArticles(URL) - 2:
Fetch the web page content from the URL - 3:
Parse the HTML content - 4:
Identify the title: - 5:
title ← FindElementByLargestFontSize() - 6:
Identify the body: - 7:
body ← FindElementByLargestDiv() - 8:
Identify additional properties: - 9:
img ← FindImageElements() - 10:
url ← FindPageURL() - 11:
Return (title, body, img, url) - 12:
end procedure
|
3.2. Cleaning the Article Body
Algorithm 2 cleans the extracted article body by removing unrelated content, such as stop words and images.
Algorithm 2 CleanArticleBody |
- 1:
procedureCLEANARTICLEBODY(RawArticleBody) - 2:
Initialize cleanedBody as an empty string - 3:
for each word in RawArticleBody do - 4:
if word is not in the stop words list then - 5:
cleanedBody ← cleanedBody + word - 6:
else if word is a link or image then - 7:
continue - 8:
end if - 9:
end for - 10:
Return cleanedBody - 11:
end procedure
|
3.3. Generation of JSON-LD
Algorithm 3 generates the new JSON-LD object based on the cleaned title and body, including other relevant properties.
Algorithm 3 GenerateJSONLD |
- 1:
Input: Title, Cleaned Body, Image, URL - 2:
Output: Generated JSON-LD object - 3:
Initialize an empty JSON object jsonLD as {} - 4:
Set jsonLD[“title”] ← Title - 5:
Set jsonLD[“body”] ← Cleaned Body - 6:
Set jsonLD[“image”] ← Image - 7:
Set jsonLD[“url”] ← URL - 8:
Add any additional metadata to jsonLD - 9:
returnjsonLD
|
3.4. Validation of JSON-LD
Algorithm 4 validates the generated JSON-LD using the Google Rich Results Checker and performs a content comparison.
Algorithm 4 ValidateJSONLD |
- 1:
Input: Original JSON-LD, Generated JSON-LD - 2:
Output: Validation result - 3:
Validate JSON structure using Google Rich Results Checker - 4:
validationResult← CheckRichResults(Generated JSON-LD) - 5:
Perform word-by-word comparison between Original and Generated JSON-LD - 6:
similarityScore← CompareWordByWord(Original JSON-LD, Generated JSON-LD) - 7:
return (validationResult, similarityScore)
|
4. Implementation
This section presents a detailed review of the implemented solution, fulfilling all the requirements outlined earlier. We begin with an overview of the framework architecture, followed by an in-depth description of the individual components, as depicted in
Figure 3.
Figure 4 shows the Google Rich Results Test website, which is used to validate the structured data of a URL. After submitting a URL, the tool analyzes the page’s structured data and provides a report on the number of valid items detected, the types of rich results found, and any issues or errors in the data implementation. This step is essential for ensuring that the structured data on the news articles are properly formatted and eligible for enhanced presentation in Google search results.
Figure 3 illustrates the architecture overview of our system, which is organized into three main layers: Extraction, Modification, and Verification. The Extraction Layer is responsible for scraping news content and extracting JSON-LD metadata properties from the target web pages. This is followed by the Modification Layer, where the system can either remove existing JSON-LD elements or inject new JSON-LD code. Finally, the Verification Layer ensures the accuracy and completeness of the extracted metadata using two verification methods: a Rich Text Verifier and a Word-by-Word Verifier. These layers work together to streamline and validate the extraction and modification processes, forming the core structure of our methodology.
4.1. Data Extraction Layer
The Data Extraction Layer is responsible for collecting raw data from news sources.
Web Scraping Module: This module employs the Beautiful Soup library (Python) for HTML parsing, extracting the main content sections of Google News articles. The ‘requests’ library (Python) is used to make HTTP requests and retrieve HTML content. These libraries were chosen for their efficiency in handling and parsing HTML structures commonly found in news websites.
Data Storage: The extracted data are stored in a NoSQL database, specifically MongoDB, chosen for its flexibility in handling semi-structured data and compatibility with JSON format. MongoDB also enables efficient querying and manipulation of data, which is essential during the analysis and transformation stages.
4.2. Transformation Layer
The Transformation Layer is tasked with preparing the extracted data for further processing.
Pattern Mining Algorithm: The pattern mining algorithm is implemented using Python’s native capabilities for text processing and rule-based pattern matching. By utilizing Python’s regular expressions and HTML parsing techniques, this algorithm identifies the title and body of news articles based on attributes like font size and HTML element type.
Data Cleaning Module: To remove irrelevant content such as unrelated words, links, images, and advertisements, we employ the OpenAI API (specifically GPT-3), which uses advanced natural language processing (NLP) to distinguish relevant content from noise. This API was chosen due to its high accuracy in understanding and processing textual data, ensuring only relevant information remains in the title and body sections of the articles.
4.3. Loading Layer
The Loading Layer prepares the cleaned data for linked data integration and storage.
JSON-LD Generation: We use a custom Python module to convert the cleaned and structured data into JSON-LD format, adhering to schema.org standards for linked data representation. JSON-LD was selected as it is widely supported by search engines, improving the interoperability and discoverability of content.
Database Integration: The generated JSON-LD data are stored back in MongoDB, allowing for seamless retrieval and future validation of the data. MongoDB’s ability to store JSON-like documents allows us to store linked data efficiently alongside metadata.
4.4. Validation Layer
The Validation Layer verifies the quality and accuracy of the generated JSON-LD.
The implementation described above provides a comprehensive framework using open-source libraries, a powerful NLP API, and widely supported data storage solutions to achieve the efficient extraction, transformation, generation, and validation of linked data from news articles. This framework facilitates enhanced data accessibility and interoperability on the web.
5. Experiment and Results
The dataset used for the experiment consists of 1100 web pages from 18 reputable news sources, written in both English and Arabic. The original and generated JSON-LD objects for each article are compared using word-by-word matching and validated using Google Rich Results.
Figure 5 and
Figure 6 illustrate examples of the original and generated JSON-LD objects, respectively. A quantitative comparison of the performance across various websites is presented in
Table 2, including the rich results eligibility status (Pass/Fail) and the percentage of matches between the generated JSON-LD files and the ground truth data with respect to title and article body fields.
The dataset evaluation is summarized in
Table 2, indicating the pass rates and similarity scores for the title and article body, broken down by website.
To quantify the similarities between the original and generated JSON-LD objects, we used a word-by-word comparison tool.
Figure 7 and
Figure 8 display the similarity percentages for the article body and title, respectively, as histograms overlaid with density plots.
In
Figure 7, the histogram shows the distribution of similarity scores between the original and generated JSON-LD objects for the article body across multiple articles from
independent.co.uk, accessed on 10 October 2024. The
x-axis represents the percentage similarity, while the
y-axis shows the frequency of articles falling within each similarity range. The density plot superimposed over the histogram provides a smooth curve, highlighting the general distribution pattern and concentration of scores, with most values clustering above 90%.
Similarly,
Figure 8 presents the similarity percentages for the article title, specifically from articles on Skynewsarabia.com. The histogram reflects the frequency of articles with specific similarity scores, while the density plot provides a smoothed view of the data distribution. Like the article body, the title similarity scores are also predominantly above 90%, indicating that the generated JSON-LD objects closely match the original titles in most cases.
These visualizations support our quantitative analysis, demonstrating that the generated JSON-LD objects maintain high similarity to the original data, validating the effectiveness of our framework in accurately replicating article information.
The implemented framework successfully scraped and processed news articles from diverse sources, generating accurate JSON-LD objects with high similarity to the original data. Validation through Google Rich Results indicated high pass rates, and content similarity consistently exceeded 90%. Future work could involve enhancing the data cleaning process and expanding the dataset to include more websites in multiple languages.
6. Impact Analysis
The implications of our proposed solution for automatically generating JSON-LD annotations extend across several dimensions, including technological advancements, business benefits, and social impacts. This multifaceted impact underscores our work’s significance in the Semantic Web context and broader data accessibility initiatives.
6.1. Technological Impact
From a technological standpoint, our work significantly contributes to the evolution of the Semantic Web by enhancing the machine-readability of web content. The automated generation of structured data through advanced algorithms and language models, such as GPT-3, facilitates the integration of artificial intelligence into web development processes. One of the key benefits of this approach is its scalability; it allows for the rapid scaling of structured data across millions of web pages, accommodating the growing demand for data-driven insights and services. Furthermore, by adhering to JSON-LD standards, the generated structured data enhance interoperability between different systems and platforms, enabling better data exchange and collaboration among various stakeholders in the digital ecosystem. This also leads to enhanced SEO, as the implementation of structured data improves visibility in search engine results, providing a competitive advantage to websites that adopt our solution [
40].
6.2. Business Impact
The business ramifications of our solution are equally substantial. Automating structured data generation, such as JSON-LD, not only reduces the manual effort needed for web maintenance and SEO, but also translates into significant cost savings for organizations. This efficiency allows resources to be reallocated to more strategic areas, enhancing overall operational productivity. Moreover, the implementation of structured data has been proven to positively influence engagement metrics, as shown in a recent study on SEO’s effect on the business performance of a private university in Sarajevo. This case study demonstrated that increasing search engine visibility through structured data led to a notable rise in site visits, user engagement, and ultimately boosted revenue through increased student enrollments [
41].
Businesses that adopt structured data generation solutions like ours gain a competitive edge by offering enhanced user experiences and better-targeted content. With improved access to structured data, organizations can leverage advanced analytics for data-driven decision-making, optimizing content strategies to better engage their target audiences. The demonstrated effectiveness of SEO in this case, along with the wider adoption of structured data for improved business performance, highlights the transformative potential of our solution for diverse organizations.
6.3. Social Impact
On a societal level, our work addresses critical issues related to information accessibility and digital equity. By improving the indexing of news content, our solution ensures that users have better access to timely and relevant information, fostering a more informed society. Moreover, smaller news organizations that may lack the resources for sophisticated SEO strategies can leverage our automated solution to enhance their online presence, thus leveling the playing field in the digital landscape. This increased visibility not only empowers these organizations but also promotes a more diverse media landscape by ensuring that a wider range of perspectives and narratives are accessible to the public. Consequently, our work contributes to a more democratic society by facilitating the dissemination of varied viewpoints. Research has shown that businesses using digital marketing strategies like SEO and social media see substantial improvements in visibility and engagement, which similarly apply to organizations seeking a greater online presence through optimized content and multi-channel approaches [
42].
In conclusion, the technological, business, and social impacts of our work are profound and far-reaching. By enabling automated structured data generation, we advance the capabilities of web technologies while empowering organizations and individuals to harness the full potential of information in the digital age. The comprehensive nature of these impacts highlights the relevance and necessity of our research in fostering an inclusive and innovative information ecosystem.
7. Conclusions
In this research, we have proposed a comprehensive solution for embedding linked data objects into news articles sourced from Google News. Our methodology employs pattern mining techniques to extract key content from news articles, including titles and bodies, which are then processed to create structured JSON-LD annotations. To enhance the accuracy and relevance of the extracted content, we used OpenAI’s GPT-3 for cleaning, ensuring the removal of irrelevant elements such as advertisements, links, and extraneous words. This approach allows for the generation of high-quality linked data, which facilitates the seamless integration of news content into the Semantic Web.
The dataset used in this study comprised 1100 web pages from 18 distinct news websites, written in both English and Arabic. The results demonstrated a substantial similarity rate exceeding 93% for news titles and over 90% for the article body within the generated linked data JSON objects. We validated the accuracy of our generated JSON-LD by comparing it to the original using both the Google Rich Results Checker and a word-by-word comparison, achieving strong content fidelity.
Our approach has the potential to be integrated as a plugin within various content management systems (CMSs), automating the injection of structured data annotations and enhancing the discoverability and interoperability of news articles on the web. Additionally, the versatility of the technique suggests broader applicability for embedding linked data in a variety of web page types, benefiting both developers and end users by improving web content functionality and searchability.
However, there are limitations in our current approach, including the reliance on predefined rules for pattern mining and text extraction based on HTML attributes. These rules may not perform optimally on websites with complex or dynamic layouts. While GPT-3 has proven effective in cleaning content, there are occasional instances of relevant information being erroneously filtered out. These challenges indicate the need for more adaptive algorithms capable of handling diverse and evolving web page structures.
In future work, we aim to address these limitations by exploring machine learning-based adaptive extraction techniques, enhancing the model’s ability to generalize across different web structures. Expanding our method to support multilingual content, multimedia (including images and videos), and fact-checking mechanisms will further improve the reliability and versatility of the solution. Additionally, we plan to incorporate real-time user feedback to refine the content cleaning process and ensure compliance with emerging standards for linked data on the web.
Looking forward, we will continue to expand our dataset and refine the accuracy of our algorithm. We believe our research has significant potential to advance the adoption of linked data practices across the web, fostering improved information accessibility, integration, and semantic understanding.