Next Article in Journal
A Novel Data Obfuscation Framework Integrating Probability Density and Information Entropy for Privacy Preservation
Previous Article in Journal
Synchronous Remote Calibration for Electricity Meters: Application and Optimization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing News Articles: Automatic SEO Linked Data Injection for Semantic Web Integration

1
Department of Computer Science and Engineering, Innopolis University, Innopolis 420500, Russia
2
Q Deep, Innopolis 420500, Russia
3
Scientific Center for Information Technologies and Artificial Intelligence, Sirius University of Science and Technology, Sirius Federal Territory, Scochi 354340, Russia
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(3), 1262; https://doi.org/10.3390/app15031262
Submission received: 8 October 2024 / Revised: 6 January 2025 / Accepted: 10 January 2025 / Published: 26 January 2025

Abstract

:
This paper presents a novel solution aimed at enhancing news web pages for seamless integration into the Semantic Web. By utilizing advanced pattern mining techniques alongside OpenAI’s GPT-3, we rewrite news articles to improve their readability and accessibility for Google News aggregators. Our approach is characterized by its methodological rigour and is evaluated through quantitative metrics, validated using Google’s Rich Results Test API to confirm adherence to Google’s structured data guidelines. In this process, a “Pass” in the Rich Results Test is taken as an indication of eligibility for rich results, demonstrating the effectiveness of our generated structured data. The impact of our work is threefold: it advances the technological integration of a substantial segment of the web into the Semantic Web, promotes the adoption of Semantic Web technologies within the news sector, and significantly enhances the discoverability of news articles in aggregator platforms. Furthermore, our solution facilitates the broader dissemination of news content to diverse audiences. This submission introduces an innovative solution substantiated by empirical evidence of its impact and methodological soundness, thereby making a significant contribution to the field of Semantic Web research, particularly in the context of news and media articles.

1. Introduction

The Semantic Web is a visionary extension of the traditional World Wide Web, aimed at making data more accessible and interpretable by machines. As a transformative approach, the Semantic Web embeds structured, machine-readable information into web content, enabling both artificial and human agents to understand and navigate content effectively [1,2]. This capability is particularly impactful in fields like e-learning, as well as news web portals, where structured data improve article indexing and presentation, facilitating platforms like Google News in enhancing content discoverability and relevance through structured data integration [3,4].
One key advancement in this area is rich results, which enable search engines to present enhanced, visually informative search results that may include images, star ratings, and other contextual information. These rich results rely on structured data annotations embedded directly within the HTML of web pages, typically using formats like JavaScript Object Notation for Linked Data (JSON-LD). JSON-LD is a lightweight linked data format that allows website developers to add structured metadata to web content in a way that is both machine-readable and straightforward to implement. By embedding JSON-LD annotations, web developers can provide essential details to search engines without altering visible content, ensuring better indexing and enhancing search visibility [5,6].
Such structured data significantly improve a website’s search engine optimization (SEO), a collection of techniques aimed at enhancing a website’s visibility and ranking on search engine results pages (SERPs). SERPs are the pages displayed by search engines in response to a user’s query, and they are instrumental in attracting online traffic by presenting a ranked list of relevant links, summaries, and often enhanced features like images or videos. By optimizing elements like keywords, content, and links, SEO strategies help elevate a website’s SERP position, which is critical for online visibility and user engagement [7]. JSON-LD’s compatibility with SEO efforts allows it to boost site visibility by making important data accessible for indexing [8]. However, many legacy websites lack these structured data annotations, making it challenging for search engines to accurately index their content, thereby negatively impacting SEO performance, visibility, and user engagement [9,10,11].
Legacy news websites often lack the tools and resources to implement structured data, resulting in poor visibility and limited engagement on search platforms. This creates a significant barrier to information accessibility and discoverability, as valuable content becomes harder to access for users relying on modern search engines. By addressing this gap, our research seeks to transform how legacy news content is indexed and presented, making essential information more accessible to the public.
Motivated by this challenge, our research seeks to address the gap in structured data for legacy news websites, which are often excluded from modern CMS tools that facilitate structured data generation. This paper introduces a comprehensive solution to automatically generate JSON-LD scripts tailored for news websites, bridging the gap between unstructured and structured web data. Our approach is driven by two primary innovations. The first innovation is our use of advanced pattern mining techniques coupled with cutting-edge NLP models, specifically OpenAI’s GPT-3. GPT-3 not only helps in accurately extracting relevant article features such as titles, dates, and content structure, but also enables content refinement by removing irrelevant text and advertisements. This ensures the annotations generated are clean, focused, and high-quality, making the structured data both accurate and SEO-optimized.
Second, we introduce a robust validation framework to ensure that the generated structured data align with search engine standards. Unlike previous approaches, which often lack comprehensive validation steps, our framework systematically removes pre-existing JSON-LD scripts, regenerates them using our algorithm, and rigorously verifies their quality and compliance through both syntactic and semantic checks. Specifically, we represent the article data as a matrix A R n × m , where n is the number of articles and m represents extracted features (e.g., headline, author, date, and content). Our goal is to transform this feature matrix into a structured data matrix S R n × k , where k denotes the required JSON-LD elements for structured data compliance. This transformation is optimized to minimize discrepancies between the generated JSON-LD and the original content data, ensuring that the annotations are both accurate and compliant with search engine requirements. Additionally, we define a matching function M ( A , S ) to quantitatively assess alignment between the original article features and the generated JSON-LD annotations, providing a comprehensive metric to evaluate the quality and precision of the structured data.
To assess the effectiveness of our approach, we define a matching function M ( A , S ) , which quantitatively measures the alignment between the original article features and the generated JSON-LD annotations. This metric provides a comprehensive assessment of the automated annotation process and supports our claim that the generated data align closely with the original content.
Our work contributes to the Semantic Web ecosystem by enhancing the accessibility, discoverability, and indexing of news articles on the web. Unlike previous solutions, which often focus on structured data for modern websites, our approach specifically targets legacy news websites, making it possible to bring these sites up to current standards. By enabling the automatic generation and validation of JSON-LD annotations, our framework has the potential to significantly increase the visibility and accessibility of news content on search engines. Moreover, this innovation has broader implications for the web as a whole, as it can be adapted for various types of unstructured pages, paving the way for a more structured and accessible Internet.
The broader implications of our solution extend beyond news articles. By creating an adaptable and generalizable framework, this work sets a foundation for structured data generation across various unstructured content types, including blogs, forums, and other legacy content platforms. This could lead to a more interconnected and easily searchable web ecosystem, ultimately facilitating a more inclusive and information-rich web environment.

Structure of the Paper

This paper is organized into several sections that collectively present our approach to enhancing news web pages for integration into the Semantic Web. Section 2 is titled Literature Review. This review highlights both historical developments and contemporary approaches, situating our contribution within the broader context of Semantic Web research. Section 3, Methodology, outlines our novel approach, illustrated by a flowchart. Here, we detail how our methodology infers linked data from web page content without prior knowledge of the HTML structure, utilizing articles from Google News as the primary data source. Section 4 presents the implementation, where we elaborate on the practical application of our methodology. In Section 5, titled Comments and Results, we discuss the outcomes of our approach and provide critical insights based on our findings. Following this, Section 6 is dedicated to Impact Analysis, assessing the technological, social, and business implications of our solution. Finally, Section 7 concludes this paper by summarizing our findings and emphasizing the significance of our contributions to the field of Semantic Web research.

2. Literature Review

This section provides a comprehensive review of the existing literature, organized into two main threads: (i) the generation of metadata and linked data from raw text and (ii) methods for the automated extraction and organization of data to be used in metadata generation. The review underscores both historical developments and current approaches, contextualizing our contribution within this body of knowledge.

2.1. Metadata Generation and Linked Data Representation

The vision of the Semantic Web, first articulated over two decades ago by Tim Berners-Lee and his team, has been fundamental in reshaping how web content is perceived by both humans and machines. Berners-Lee’s goal was to create a web where information is not only accessible by humans but also structured in a way that machines can understand and process autonomously. This vision laid the groundwork for the development of standards and technologies aimed at transforming the web into a more intelligent system.
To this end, the World Wide Web Consortium (W3C) has defined key standards for representing structured data on the web. Two of the most significant standards include the Resource Description Framework (RDF) and the Web Ontology Language (OWL) [12]. The RDF provides a syntax for encoding relationships between resources, while OWL allows for the creation of complex ontologies—formal vocabularies that define concepts within specific domains. These frameworks have seen increasing adoption across academia and industry [13]. However, for these technologies to reach their full potential, it is essential to develop domain-specific vocabularies (ontologies). Many of these vocabularies have been crafted manually for fields such as healthcare, law, and geography, but advancements in machine learning are enabling the automatic or semi-automatic generation of these ontologies from large text corpora [14]. This progress reduces the time and expertise required to build and maintain these knowledge representations, making the Semantic Web more accessible and scalable.
Although the foundations of the Semantic Web are well established, recent studies highlight further advancements that improve the practical scalability of metadata generation and linked data integration. For instance, modern language models, such as GPT-based models, are being applied to support automated metadata generation, event analysis, and sentiment detection, which enhances structured information retrieval and data interconnectivity in real-time applications. Despite these advances, the need remains for streamlined methodologies to support automatic metadata and structured data generation across diverse domains. Table 1 provides a comparative summary of key related works, focusing on modern techniques and highlighting the limitations in existing methodologies, such as limited generalizability or high resource demands for model training and deployment.
By focusing on the capabilities and limitations of these methodologies, this review underscores the research gap addressed by our approach: a scalable, efficient, and semi-automated solution that integrates both structured metadata generation and linked data representation, enabling adaptive, domain-agnostic content extraction.

2.2. Automated Data Acquisition and Content Extraction

Extracting the main content from unstructured web pages has been a long-standing challenge in web data mining. Early approaches focused on parsing the Document Object Model (DOM) structure of web pages. For instance, researchers at Columbia University proposed a method to detect the largest text body within a web page, classifying it by counting the words in each HTML element [20]. This approach, which is the basis of our method for identifying the body of an article, proves effective for detecting main content sections but struggles with title identification due to the variability in web page layouts.
More sophisticated content extraction methods have since been proposed. For example, ref. [21] introduced an approach combining structural and contextual analysis to improve the accuracy of content extraction. While this method has shown promise, it has yet to be fully implemented, suggesting that further refinement and experimentation are needed.
The literature reveals two dominant approaches in the field of web data extraction.

2.2.1. DOM-Based Approaches

DOM-based approaches leverage the structural properties of HTML documents to identify and extract relevant content. These methods rely on traversing the DOM tree to locate elements such as text bodies and headings based on properties such as size, position, and relative relationships between elements [22,23,24]. While effective in well-structured environments, the major limitation of DOM-based methods is their sensitivity to changes in web page layouts. As web pages evolve, their structural composition can change, rendering static DOM-based techniques less effective over time.

2.2.2. AI-Driven Approaches

Artificial intelligence (AI)-based methods represent a more dynamic approach to content extraction. These techniques typically involve training machine learning models on large, labeled datasets of web pages. The models learn to recognize patterns and features within the page that correspond to different types of content, such as article titles, bodies, and advertisements [25,26,27]. AI-based methods have demonstrated strong performance, particularly in handling diverse web layouts. However, they require significant computational resources and large amounts of labeled data for training. Moreover, the generalization of these models to new, unseen web layouts can be challenging, especially when the underlying page structure differs significantly from the training data.

2.3. Challenges and Opportunities

The comparison of DOM-based and AI-driven approaches underscores distinct trade-offs. AI-based methods, although flexible and adaptive, necessitate significant infrastructure and are data-intensive, with performance reliant on labeled training data. Conversely, DOM-based methods, though lightweight and efficient for structured content, are sensitive to web page architecture changes, necessitating frequent updates.
Our research adopts an enhanced DOM-based approach for article data extraction from web pages, combining DOM structure traversal with pattern mining techniques for greater accuracy and adaptability. This method allows for structured data extraction (e.g., titles and article bodies) without requiring intensive computational resources, aligning well with the regularity found in news articles. By reintegrating extracted data back into web pages as linked data, our approach not only supports metadata generation but also aligns with linked data principles, fostering more intelligent and connected web resources [28,29,30,31,32,33].
This refined DOM-based methodology offers a scalable alternative to AI-driven solutions, addressing the challenges of both traditional DOM-based and AI-based methods. By synthesizing structured pattern recognition with linked data re-injection, our approach advances the efficiency and resilience of content extraction in web data mining.

3. Methodology

Our methodology, depicted in the flowchart in Figure 1, enables the inference of linked data from the entire web page content without requiring prior knowledge of the structure of the source HTML code. The results produced are comparable to those generated by web page authors.
In Figure 1, the flowchart illustrates the workflow used to process news articles for integration into the Semantic Web. The process begins with exporting top news articles from Google News, followed by a check to verify if each article includes a JSON-LD (linked data JSON) object. Articles lacking JSON-LD are marked as “rejected”, while those with JSON-LD continue to the next step as “Approved Articles”. Subsequently, these approved articles are evaluated for their “Rich Text” quality, meaning they have sufficient structure for further processing. Articles failing this check are rejected, while those passing it are finalized as “100% Rich Text articles”. Additional steps are applied to add or remove JSON-LD where necessary, ensuring the structured readability and integration of news content.
To perform the analysis, we utilized Google News as our source of articles, which already possesses linked data JSON objects (JSON-LD) attached to each page, hereafter referred to as the original JSON-LD. Our approach employs a general data extraction method that applies pattern mining to news sites, as published in [34]. The pattern miner scrapes the title and body of the news article to extract the plain text content. The text with the largest font size on the page is identified as the title, while the largest div element (by character length) is designated as the body, as illustrated in Figure 2.
Furthermore, we extended our method from [34] to clean the title and article body after scraping, removing unrelated words, links, images, and advertisements using OpenAI’s GPT-3 API [35]. Specifically, GPT-3 was employed to identify and filter out irrelevant elements. This includes extraneous words or phrases unrelated to the core news content, promotional links and advertisements, references to images, and other non-essential elements. This process ensures that the structured data focus exclusively on the relevant article information, facilitating seamless integration into the Semantic Web.
Utilizing this methodology, we extract the most important properties of the JSON-LD object, such as the title, article body, images, and URLs, thus creating a new JSON object—the generated JSON-LD. To verify our results, we compare the generated JSON-LD with the original using two distinct methods:
  • Validate the results on the Google Rich Results Checker [36].
  • Perform a word-by-word comparison of the original and generated texts.
While the Google Rich Results Checker serves as the primary validation tool, we also employ a word-by-word verification method to ensure content accuracy, as the rich results validator only checks the structural integrity of the object.
To quantify the content similarity, we utilize Jensen–Shannon Divergence [37,38,39], which measures the similarity between two probability distributions P and Q. This divergence is defined as follows:
D J S ( P | | Q ) = 1 2 D K L ( P | | M ) + 1 2 D K L ( Q | | M )
where M = 1 2 ( P + Q ) and D K L is the Kullback–Leibler divergence, defined as
D K L ( P | | Q ) = x P ( x ) log P ( x ) Q ( x )
This allows us to quantify the differences in content distribution between the original JSON-LD and the generated JSON-LD.
Additionally, we can measure the entropy H of the generated content using the following equation:
H ( X ) = i p ( x i ) log ( p ( x i ) )
where p ( x i ) represents the probability of the occurrence of each unique word x i in the content.
The methodology can be succinctly represented in five steps:
  • Extraction of News Articles: gather news web pages containing the original JSON-LD object.
  • Analyze/Remove JSON-LD: remove the original JSON-LD object and save it externally for comparison.
  • Generate and Inject JSON-LD: use pattern mining to generate a new JSON-LD object and inject it into the original page to replace the original.
  • Check JSON-LD: validate the new page with the injected JSON-LD using the Rich Results Checker.
  • Check the JSON-LD Content: compare the content word by word to compute a similarity score between properties.
The following pseudocode outlines the algorithms utilized in our methodology for extracting and generating JSON-LD annotations.

3.1. Extraction of News Articles

Algorithm 1 describes the process of extracting news articles from web pages.
Algorithm 1 ExtractNewsArticles
1:
procedureExtractNewsArticles(URL)
2:
    Fetch the web page content from the URL
3:
    Parse the HTML content
4:
    Identify the title:
5:
       title ← FindElementByLargestFontSize()
6:
    Identify the body:
7:
       body ← FindElementByLargestDiv()
8:
    Identify additional properties:
9:
       img ← FindImageElements()
10:
     url ← FindPageURL()
11:
  Return (title, body, img, url)
12:
end procedure

3.2. Cleaning the Article Body

Algorithm 2 cleans the extracted article body by removing unrelated content, such as stop words and images.
Algorithm 2 CleanArticleBody
1:
procedureCLEANARTICLEBODY(RawArticleBody)
2:
    Initialize cleanedBody as an empty string
3:
    for each word in RawArticleBody do
4:
        if word is not in the stop words list then
5:
           cleanedBody ← cleanedBody + word
6:
        else if word is a link or image then
7:
           continue
8:
        end if
9:
    end for
10:
  Return cleanedBody
11:
end procedure

3.3. Generation of JSON-LD

Algorithm 3 generates the new JSON-LD object based on the cleaned title and body, including other relevant properties.
Algorithm 3 GenerateJSONLD
1:
Input: Title, Cleaned Body, Image, URL
2:
Output: Generated JSON-LD object
3:
Initialize an empty JSON object jsonLD as {}
4:
Set jsonLD[“title”] ← Title
5:
Set jsonLD[“body”] ← Cleaned Body
6:
Set jsonLD[“image”] ← Image
7:
Set jsonLD[“url”] ← URL
8:
Add any additional metadata to jsonLD
9:
returnjsonLD

3.4. Validation of JSON-LD

Algorithm 4 validates the generated JSON-LD using the Google Rich Results Checker and performs a content comparison.
Algorithm 4 ValidateJSONLD
1:
Input: Original JSON-LD, Generated JSON-LD
2:
Output: Validation result
3:
Validate JSON structure using Google Rich Results Checker
4:
validationResult← CheckRichResults(Generated JSON-LD)
5:
Perform word-by-word comparison between Original and Generated JSON-LD
6:
similarityScore← CompareWordByWord(Original JSON-LD, Generated JSON-LD)
7:
return (validationResult, similarityScore)

4. Implementation

This section presents a detailed review of the implemented solution, fulfilling all the requirements outlined earlier. We begin with an overview of the framework architecture, followed by an in-depth description of the individual components, as depicted in Figure 3. Figure 4 shows the Google Rich Results Test website, which is used to validate the structured data of a URL. After submitting a URL, the tool analyzes the page’s structured data and provides a report on the number of valid items detected, the types of rich results found, and any issues or errors in the data implementation. This step is essential for ensuring that the structured data on the news articles are properly formatted and eligible for enhanced presentation in Google search results.
Figure 3 illustrates the architecture overview of our system, which is organized into three main layers: Extraction, Modification, and Verification. The Extraction Layer is responsible for scraping news content and extracting JSON-LD metadata properties from the target web pages. This is followed by the Modification Layer, where the system can either remove existing JSON-LD elements or inject new JSON-LD code. Finally, the Verification Layer ensures the accuracy and completeness of the extracted metadata using two verification methods: a Rich Text Verifier and a Word-by-Word Verifier. These layers work together to streamline and validate the extraction and modification processes, forming the core structure of our methodology.

4.1. Data Extraction Layer

The Data Extraction Layer is responsible for collecting raw data from news sources.
  • Web Scraping Module: This module employs the Beautiful Soup library (Python) for HTML parsing, extracting the main content sections of Google News articles. The ‘requests’ library (Python) is used to make HTTP requests and retrieve HTML content. These libraries were chosen for their efficiency in handling and parsing HTML structures commonly found in news websites.
  • Data Storage: The extracted data are stored in a NoSQL database, specifically MongoDB, chosen for its flexibility in handling semi-structured data and compatibility with JSON format. MongoDB also enables efficient querying and manipulation of data, which is essential during the analysis and transformation stages.

4.2. Transformation Layer

The Transformation Layer is tasked with preparing the extracted data for further processing.
  • Pattern Mining Algorithm: The pattern mining algorithm is implemented using Python’s native capabilities for text processing and rule-based pattern matching. By utilizing Python’s regular expressions and HTML parsing techniques, this algorithm identifies the title and body of news articles based on attributes like font size and HTML element type.
  • Data Cleaning Module: To remove irrelevant content such as unrelated words, links, images, and advertisements, we employ the OpenAI API (specifically GPT-3), which uses advanced natural language processing (NLP) to distinguish relevant content from noise. This API was chosen due to its high accuracy in understanding and processing textual data, ensuring only relevant information remains in the title and body sections of the articles.

4.3. Loading Layer

The Loading Layer prepares the cleaned data for linked data integration and storage.
  • JSON-LD Generation: We use a custom Python module to convert the cleaned and structured data into JSON-LD format, adhering to schema.org standards for linked data representation. JSON-LD was selected as it is widely supported by search engines, improving the interoperability and discoverability of content.
  • Database Integration: The generated JSON-LD data are stored back in MongoDB, allowing for seamless retrieval and future validation of the data. MongoDB’s ability to store JSON-like documents allows us to store linked data efficiently alongside metadata.

4.4. Validation Layer

The Validation Layer verifies the quality and accuracy of the generated JSON-LD.
  • Google Rich Results Checker Integration: We use the Google Rich Results Test API to validate the generated JSON-LD against Google’s structural guidelines. This integration ensures that the structured data align with Google’s requirements, enhancing visibility on Google platforms.
  • Similarity Comparison Module: Implemented with the SciPy library (Python), this module uses the Jensen–Shannon Divergence metric to quantitatively compare the original and generated JSON-LD content [37,38], assessing how closely the generated linked data represent the original article. SciPy was chosen for its robust mathematical and statistical functions, making it ideal for similarity measurement tasks.
The implementation described above provides a comprehensive framework using open-source libraries, a powerful NLP API, and widely supported data storage solutions to achieve the efficient extraction, transformation, generation, and validation of linked data from news articles. This framework facilitates enhanced data accessibility and interoperability on the web.

5. Experiment and Results

The dataset used for the experiment consists of 1100 web pages from 18 reputable news sources, written in both English and Arabic. The original and generated JSON-LD objects for each article are compared using word-by-word matching and validated using Google Rich Results.
Figure 5 and Figure 6 illustrate examples of the original and generated JSON-LD objects, respectively. A quantitative comparison of the performance across various websites is presented in Table 2, including the rich results eligibility status (Pass/Fail) and the percentage of matches between the generated JSON-LD files and the ground truth data with respect to title and article body fields.
The dataset evaluation is summarized in Table 2, indicating the pass rates and similarity scores for the title and article body, broken down by website.
To quantify the similarities between the original and generated JSON-LD objects, we used a word-by-word comparison tool. Figure 7 and Figure 8 display the similarity percentages for the article body and title, respectively, as histograms overlaid with density plots.
In Figure 7, the histogram shows the distribution of similarity scores between the original and generated JSON-LD objects for the article body across multiple articles from independent.co.uk, accessed on 10 October 2024. The x-axis represents the percentage similarity, while the y-axis shows the frequency of articles falling within each similarity range. The density plot superimposed over the histogram provides a smooth curve, highlighting the general distribution pattern and concentration of scores, with most values clustering above 90%.
Similarly, Figure 8 presents the similarity percentages for the article title, specifically from articles on Skynewsarabia.com. The histogram reflects the frequency of articles with specific similarity scores, while the density plot provides a smoothed view of the data distribution. Like the article body, the title similarity scores are also predominantly above 90%, indicating that the generated JSON-LD objects closely match the original titles in most cases.
These visualizations support our quantitative analysis, demonstrating that the generated JSON-LD objects maintain high similarity to the original data, validating the effectiveness of our framework in accurately replicating article information.
The implemented framework successfully scraped and processed news articles from diverse sources, generating accurate JSON-LD objects with high similarity to the original data. Validation through Google Rich Results indicated high pass rates, and content similarity consistently exceeded 90%. Future work could involve enhancing the data cleaning process and expanding the dataset to include more websites in multiple languages.

6. Impact Analysis

The implications of our proposed solution for automatically generating JSON-LD annotations extend across several dimensions, including technological advancements, business benefits, and social impacts. This multifaceted impact underscores our work’s significance in the Semantic Web context and broader data accessibility initiatives.

6.1. Technological Impact

From a technological standpoint, our work significantly contributes to the evolution of the Semantic Web by enhancing the machine-readability of web content. The automated generation of structured data through advanced algorithms and language models, such as GPT-3, facilitates the integration of artificial intelligence into web development processes. One of the key benefits of this approach is its scalability; it allows for the rapid scaling of structured data across millions of web pages, accommodating the growing demand for data-driven insights and services. Furthermore, by adhering to JSON-LD standards, the generated structured data enhance interoperability between different systems and platforms, enabling better data exchange and collaboration among various stakeholders in the digital ecosystem. This also leads to enhanced SEO, as the implementation of structured data improves visibility in search engine results, providing a competitive advantage to websites that adopt our solution [40].

6.2. Business Impact

The business ramifications of our solution are equally substantial. Automating structured data generation, such as JSON-LD, not only reduces the manual effort needed for web maintenance and SEO, but also translates into significant cost savings for organizations. This efficiency allows resources to be reallocated to more strategic areas, enhancing overall operational productivity. Moreover, the implementation of structured data has been proven to positively influence engagement metrics, as shown in a recent study on SEO’s effect on the business performance of a private university in Sarajevo. This case study demonstrated that increasing search engine visibility through structured data led to a notable rise in site visits, user engagement, and ultimately boosted revenue through increased student enrollments [41].
Businesses that adopt structured data generation solutions like ours gain a competitive edge by offering enhanced user experiences and better-targeted content. With improved access to structured data, organizations can leverage advanced analytics for data-driven decision-making, optimizing content strategies to better engage their target audiences. The demonstrated effectiveness of SEO in this case, along with the wider adoption of structured data for improved business performance, highlights the transformative potential of our solution for diverse organizations.

6.3. Social Impact

On a societal level, our work addresses critical issues related to information accessibility and digital equity. By improving the indexing of news content, our solution ensures that users have better access to timely and relevant information, fostering a more informed society. Moreover, smaller news organizations that may lack the resources for sophisticated SEO strategies can leverage our automated solution to enhance their online presence, thus leveling the playing field in the digital landscape. This increased visibility not only empowers these organizations but also promotes a more diverse media landscape by ensuring that a wider range of perspectives and narratives are accessible to the public. Consequently, our work contributes to a more democratic society by facilitating the dissemination of varied viewpoints. Research has shown that businesses using digital marketing strategies like SEO and social media see substantial improvements in visibility and engagement, which similarly apply to organizations seeking a greater online presence through optimized content and multi-channel approaches [42].
In conclusion, the technological, business, and social impacts of our work are profound and far-reaching. By enabling automated structured data generation, we advance the capabilities of web technologies while empowering organizations and individuals to harness the full potential of information in the digital age. The comprehensive nature of these impacts highlights the relevance and necessity of our research in fostering an inclusive and innovative information ecosystem.

7. Conclusions

In this research, we have proposed a comprehensive solution for embedding linked data objects into news articles sourced from Google News. Our methodology employs pattern mining techniques to extract key content from news articles, including titles and bodies, which are then processed to create structured JSON-LD annotations. To enhance the accuracy and relevance of the extracted content, we used OpenAI’s GPT-3 for cleaning, ensuring the removal of irrelevant elements such as advertisements, links, and extraneous words. This approach allows for the generation of high-quality linked data, which facilitates the seamless integration of news content into the Semantic Web.
The dataset used in this study comprised 1100 web pages from 18 distinct news websites, written in both English and Arabic. The results demonstrated a substantial similarity rate exceeding 93% for news titles and over 90% for the article body within the generated linked data JSON objects. We validated the accuracy of our generated JSON-LD by comparing it to the original using both the Google Rich Results Checker and a word-by-word comparison, achieving strong content fidelity.
Our approach has the potential to be integrated as a plugin within various content management systems (CMSs), automating the injection of structured data annotations and enhancing the discoverability and interoperability of news articles on the web. Additionally, the versatility of the technique suggests broader applicability for embedding linked data in a variety of web page types, benefiting both developers and end users by improving web content functionality and searchability.
However, there are limitations in our current approach, including the reliance on predefined rules for pattern mining and text extraction based on HTML attributes. These rules may not perform optimally on websites with complex or dynamic layouts. While GPT-3 has proven effective in cleaning content, there are occasional instances of relevant information being erroneously filtered out. These challenges indicate the need for more adaptive algorithms capable of handling diverse and evolving web page structures.
In future work, we aim to address these limitations by exploring machine learning-based adaptive extraction techniques, enhancing the model’s ability to generalize across different web structures. Expanding our method to support multilingual content, multimedia (including images and videos), and fact-checking mechanisms will further improve the reliability and versatility of the solution. Additionally, we plan to incorporate real-time user feedback to refine the content cleaning process and ensure compliance with emerging standards for linked data on the web.
Looking forward, we will continue to expand our dataset and refine the accuracy of our algorithm. We believe our research has significant potential to advance the adoption of linked data practices across the web, fostering improved information accessibility, integration, and semantic understanding.

Author Contributions

Conceptualization, H.S. (Hamza Salem); Methodology, H.S. (Hamza Salem), H.S. (Hadi Salloum) and Kamil Sabbagh; Software, H.S. (Hamza Salem); Validation, H.S. (Hamza Salem); Data curation, O.O.; Writing—original draft, H.S. (Hamza Salem), H.S. (Hadi Salloum) and M.M.; Writing—review & editing, H.S. (Hadi Salloum), K.S. and M.M.; Supervision, H.S. (Hadi Salloum) and M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Bashir, F.; Warraich, N.F. Systematic literature review of Semantic Web for distance learning. Interact. Learn. Environ. 2020, 31, 527–543. [Google Scholar] [CrossRef]
  2. Breit, A.; Waltersdorfer, L.; Ekaputra, F.J.; Sabou, M.; Ekelhart, A.; Iana, A.; Paulheim, H.; Portisch, J.; Revenko, A.; Teije, A.T.; et al. Combining Machine Learning and Semantic Web: A Systematic Mapping Study. ACM Comput. Surv. 2023, 55, 313. [Google Scholar] [CrossRef]
  3. Yu, L. Introduction to the Semantic Web and Semantic Web Services; Chapman and Hall/CRC: Boca Raton, FL, USA, 2007. [Google Scholar]
  4. Wang, Q. Normalization and Differentiation in Google News: A Multi-Method Analysis of the World’s Largest News. Aggregator. Thesis, School of Graduate Studies, Rutgers The State University of New Jersey, New Brunswick, NJ, USA, 2020. [Google Scholar]
  5. Sporny, M.; Longley, D.; Kellogg, G.; Lanthaler, M.; Lindström, N. JSON-LD 1.1. W3c Recomm. 2020. Available online: https://hal.science/hal-02141614v1/file/mozilla.pdf (accessed on 8 October 2024).
  6. Sporny, M.; Longley, D.; Kellogg, G.; Lanthaler, M.; Lindström, N. JSON-LD 1.0. W3c Recomm. 2014, 16, 41. [Google Scholar]
  7. Lew, O.D.; Kammerer, Y. Factors influencing viewing behaviour on search engine results pages: A review of eye-tracking research. Behav. Inf. Technol. 2020, 40, 1485–1515. [Google Scholar] [CrossRef]
  8. Web Payments Working Group. JSON for Linking Data. 2022. Available online: https://json-ld.org/ (accessed on 8 October 2024).
  9. Bizer, C.; Meusel, R.; Primpeli, A.; Brinkmann, A. Web Data Commons—RDFa, Microdata Microformat Data Sets—Section 3.2 Extraction Results from the October 2022 Common Crawl Corpus. 2023. Available online: http://webdatacommons.org/structureddata/index.html#toc4 (accessed on 29 January 2023).
  10. Iqbal, M.; Khalid, M.N.; Manzoor, A.A.; Malik, M.; Shaikh, N.A. Search Engine Optimization (SEO): A Study of important key factors in achieving a better Search Engine Result Page (SERP) Position. Sukkur Iba J. Comput. Math. Sci. 2022, 6, 1–15. Available online: https://api.semanticscholar.org/CorpusID:250972011 (accessed on 8 October 2024). [CrossRef]
  11. Alfiana, F.; Khofifah, N.; Ramadhan, T.; Septiani, N.; Wahyuningsih, W.; Azizah, N.N.; Ramadhona, N. Apply the Search Engine Optimization (SEO) Method to determine Website Ranking on Search Engines. Int. J. Cyber Serv. Manag. 2023, 3, 65–73. [Google Scholar] [CrossRef]
  12. Berners-Lee, T.; Hendler, J.; Lassila, O. The Semantic Web: A New Form of Web Content that is Meaningful to Computers will Unleash a Revolution of New Possibilities. In Linking the World’s Information: Essays on Tim Berners-Lee’s Invention of the World Wide Web, 1st ed.; Association for Computing Machinery: New York, NY, USA, 2023; pp. 91–103. [Google Scholar]
  13. Adida, B.; Birbeck, M.; McCarron, S.; Pemberton, S. RDFa in XHTML: Syntax and processing. Recomm. W3C 2008, 7, 14. [Google Scholar]
  14. Chandrasekaran, B.; Josephson, J.R.; Benjamins, V.R. What are ontologies, and why do we need them? IEEE Intell. Syst. Their Appl. 1999, 14, 20–26. [Google Scholar] [CrossRef]
  15. Sufu, F. Advanced Computational Methods for News Classification: A Study in Neural Networks and CNN integrated with GPT. J. Econ. Technol. 2024, in press. [CrossRef]
  16. Shushkevich, E.; Alexandrov, M.; Cardiff, J. Improving multiclass classification of fake news using Bert-based models and CHATGPT-augmented data. Inventions 2023, 8, 112. [Google Scholar]
  17. Sufi, F.K. AI-GlobalEvents: A Software for analyzing, identifying and explaining global events with Artificial Intelligence. Softw. Impact. 2022, 11, 100218. [Google Scholar] [CrossRef]
  18. Sufi, F.K.; Alsulami, M. Automated multidimensional analysis of global events with entity detection, sentiment analysis and anomaly detection. IEEE Access 2021, 9, 152449–152460. [Google Scholar] [CrossRef]
  19. Medhat, W.; Hassan, A.; Korashy, H. Sentiment analysis algorithms and applications: A survey. Ain Shams Eng. J. 2014, 5, 1093–1113. [Google Scholar] [CrossRef]
  20. McKeown, K.; Hatzivassiloglou, V.; Barzilay, R.; Schiffman, B.; Evans, D.; Teufel, S. Columbia Multi-Document Summarization: Approach and Evaluation. Doc. Underst. Conf. 2001, 1, 1–21. [Google Scholar]
  21. Rahman, A.F.R.; Alam, H.; Hartono, R. Content Extraction from HTML Documents. In Proceedings of the 1st International Workshop on Web Document Analysis (WDA2001), Seattle, WA, USA, 8 September 2001. [Google Scholar]
  22. Fumarola, F.; Weninger, T.; Barber, R.; Malerba, D.; Han, J. Extracting General Lists from Web Documents: A Hybrid Approach. In Modern Approaches in Applied Intelligence, Proceedings of the 24th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems Conference on Modern Approaches in Applied Intelligence—Volume Part I. IEA/AIE’11, Syracuse, NY, USA, 28 June–1 July 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 285–294. ISBN 978-3-642-21821-7. [Google Scholar]
  23. Hong, J.L.; Siew, E.-G.; Egerton, S. Information Extraction for Search Engines Using Fast Heuristic Techniques. Data Knowl. Eng. 2010, 69, 169–196. [Google Scholar] [CrossRef]
  24. Safi, W.; Maurel, F.; Routoure, J.M.; Beust, P.; Dias, G. A Hybrid Segmentation of Web Pages for Vibro-Tactile Access on Touch-Screen Devices. In Proceedings of the 3rd Workshop on Vision and Language (VL 2014) Associated to 25th International Conference on Computational Linguistics (COLING 2014), Dublin, Ireland, 23–29 August 2014; pp. 95–102. [Google Scholar]
  25. Lima, R.; Espinasse, B.; Oliveira, H.; Pentagrossa, L.; Freitas, F. Information Extraction from the Web: An Ontology-Based Method Using Inductive Logic Programming. In Proceedings of the 2013 IEEE 25th International Conference on Tools with Artificial Intelligence, Herndon, VA, USA, 4–6 November 2013; pp. 951–958. [Google Scholar] [CrossRef]
  26. Zheng, S.; Song, R.; Wen, J.-R. Template-Independent News Extraction Based on Visual Consistency. In Proceedings of the 22nd National Conference on Artificial Intelligence—Volume 2. AAAI’07; Vancouver, BC, Canada, 22–26 July 2007; AAAI Press: Washington, DC, USA, 2007; pp. 1507–1512, ISBN 9781579953232. [Google Scholar]
  27. Zhu, W.; Dai, S.; Song, Y.; Lu, Z. Extracting news content with visual unit of web pages. In Proceedings of the 2015 IEEE/ACIS 16th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), Takamatsu, Japan, 1–3 June 2015; pp. 1–5. [Google Scholar]
  28. Gupta, S.; Kaiser, G.; Neistadt, D.; Grimm, P. DOM-based content extraction of HTML documents. In Proceedings of the 12th International Conference on World Wide Web, Budapest, Hungary, 20–24 May 2003; pp. 207–214. [Google Scholar]
  29. Mirzaaghaei, M.; Mesbah, A. DOM-based test adequacy criteria for web applications. In Proceedings of the 2014 International Symposium on Software Testing and Analysis, San Jose, CA, USA, 21–26 July 2014; pp. 71–81. [Google Scholar]
  30. Behr, J.; Eschler, P.; Jung, Y.; Zöllner, M. X3DOM: A DOM-based HTML5/X3D integration model. In Proceedings of the 14th International Conference on 3D Web Technology, Darmstadt, Germany, 16–17 June 2009; pp. 127–135. [Google Scholar]
  31. Fard, A.M.; Mirzaaghaei, M.; Mesbah, A. Leveraging existing tests in automated test generation for web applications. In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, Essen, Germany, 3–7 September 2014; pp. 67–78. [Google Scholar]
  32. Xia, J.; Xie, F.; Zhang, Y.; Caulfield, C. Artificial intelligence and data mining: Algorithms and applications. Abstr. Appl. Anal. 2013, 2013, 524720. [Google Scholar] [CrossRef]
  33. Menaga, D.; Saravanan, S. Application of artificial intelligence in the perspective of data mining. In Artificial Intelligence in Data Mining; Elsevier: Amsterdam, The Netherlands, 2021; pp. 133–154. [Google Scholar]
  34. Salem, H.; Mazzara, M. Pattern Matching-based scraping of news websites. J. Phys. Conf. Ser. 2020, 1694, 012011. [Google Scholar] [CrossRef]
  35. OpenAI. Chatgpt: Optimizing Language Models for Dialogue; OpenAI: San Francisco, CA, USA, 2022. [Google Scholar]
  36. Rich Results Test. Available online: https://search.Google.com/test/rich-results (accessed on 8 October 2024).
  37. Corander, J.; Remes, U.; Koski, T. On the Jensen-Shannon divergence and the variation distance for categorical probability distributions. Kybernetika 2021, 57, 879–907. [Google Scholar] [CrossRef]
  38. Nielsen, F. Jensen-Shannon divergence and diversity index: Origins and some extensions. Preprint 2021. [Google Scholar]
  39. Menéndez, M.L.; Pardo, J.A.; Pardo, L.; Pardo, M.C. The Jensen-Shannon divergence. J. Frankl. Inst. 1997, 334, 307–318. [Google Scholar] [CrossRef]
  40. Shadbolt, N.; Berners-Lee, T.; Hall, W. The Semantic Web Revisited. IEEE Intell. Syst. 2006, 21, 96–101. [Google Scholar] [CrossRef]
  41. Poturak, M.; Keco, D.; Tutnic, E. Influence of search engine optimization (SEO) on business performance: Case study of private university in Sarajevo. Int. J. Res. Bus. Soc. Sci. 2022, 11, 59–68. [Google Scholar] [CrossRef]
  42. Mbonigaba, C.; Sujatha, S.; Kumar, A.D.; Vasuki, M. Leveraging Digital Channels for Customer Engagement and Sales: Evaluating SEO, Content Marketing, and Social Media for Brand Growth. Int. J. Eng. Res. Mod. Educ. 2024, 9, 32–40. [Google Scholar]
Figure 1. Article processing workflow.
Figure 1. Article processing workflow.
Applsci 15 01262 g001
Figure 2. News title and body detection.
Figure 2. News title and body detection.
Applsci 15 01262 g002
Figure 3. Architecture overview.
Figure 3. Architecture overview.
Applsci 15 01262 g003
Figure 4. Google Rich Results Checker.
Figure 4. Google Rich Results Checker.
Applsci 15 01262 g004
Figure 5. Original JSON-LD data.
Figure 5. Original JSON-LD data.
Applsci 15 01262 g005
Figure 6. Generated JSON-LD data.
Figure 6. Generated JSON-LD data.
Applsci 15 01262 g006
Figure 7. Histogram of similarity scores for article body (independent.co.uk).
Figure 7. Histogram of similarity scores for article body (independent.co.uk).
Applsci 15 01262 g007
Figure 8. Histogram of similarity scores for article title (Skynewsarabia.com).
Figure 8. Histogram of similarity scores for article title (Skynewsarabia.com).
Applsci 15 01262 g008
Table 1. Comparative analysis of related works on metadata and content extraction.
Table 1. Comparative analysis of related works on metadata and content extraction.
MethodologyKey FeaturesLimitations
GPT-based News Classification [15,16]High accuracy in classifying news topicsHigh computational cost
Global Event Analysis
with AI [17]
Real-time tracking with entity detectionLimited to news domains
Multi-dimensional
Event Analysis [18]
Structured metadata
for events
Requires domain-
specific data
Sentiment Analysis with Feature Engineering [19]Adds relevance via structured sentiment dataResource-intensive; domain-specific
Table 2. Dataset evaluation and results.
Table 2. Dataset evaluation and results.
WebsiteRich ResultsTitle SimilarityArticle Body SimilarityLang.
skynewsarabia.comPass>93%>90%AR
arabic.cnn.comPass>93%>90%AR
youm7.comPass>93%>90%AR
bbc.comPass>93%>90%AR
cnn.comPass>95%>90%EN
reuters.comPass>95%>90%EN
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Salem, H.; Salloum, H.; Orabi, O.; Sabbagh, K.; Mazzara, M. Enhancing News Articles: Automatic SEO Linked Data Injection for Semantic Web Integration. Appl. Sci. 2025, 15, 1262. https://doi.org/10.3390/app15031262

AMA Style

Salem H, Salloum H, Orabi O, Sabbagh K, Mazzara M. Enhancing News Articles: Automatic SEO Linked Data Injection for Semantic Web Integration. Applied Sciences. 2025; 15(3):1262. https://doi.org/10.3390/app15031262

Chicago/Turabian Style

Salem, Hamza, Hadi Salloum, Osama Orabi, Kamil Sabbagh, and Manuel Mazzara. 2025. "Enhancing News Articles: Automatic SEO Linked Data Injection for Semantic Web Integration" Applied Sciences 15, no. 3: 1262. https://doi.org/10.3390/app15031262

APA Style

Salem, H., Salloum, H., Orabi, O., Sabbagh, K., & Mazzara, M. (2025). Enhancing News Articles: Automatic SEO Linked Data Injection for Semantic Web Integration. Applied Sciences, 15(3), 1262. https://doi.org/10.3390/app15031262

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop