1. Introduction
Journalism was one of the first fields to make the transition from the physical realm to the online digital space, starting with the appearance of the Wall Street Journal in Bulletin Board Systems of the 1980s [
1]. As soon as the World Wide Web started becoming popular, newspapers also started being published online, with the Palo Alto Weekly being available on the Web as early as January 1994 [
1]. In the beginning, the printed content was being identically reproduced on the Web, but after a short period, some publications started being produced specifically for the Web, thus dramatically changing the way media outlets produced and disseminated their content according to Karlsson and Holt [
2]. By the year 1999, more than 20% of American online newspaper content were Web originals, as claimed by Deuze’s research the same year [
3]. Ever since then, online media outlets have been capitalizing on the Web’s power to provide journalistic content with traits that only it can offer, namely interactivity, immediacy, hypertextuality, and multimodality [
2].
At the turn of the millennium, Tim Berners-Lee proposed the Semantic Web, an expansion of the World Wide Web that included content that can be retrieved and comprehended by machines, introducing the idea of a machine-readable Web [
4]. In the field of Journalism, as Fernandez et al. point out, in order to cover the customers’ needs for information freshness and relevance, the use of metadata became prevalent [
5]. Moreover, the use of additional Semantic Web technologies as proposed by Fernandez et al. was set to increase both productivity and media outlet revenues [
5]. Heravi and McGinnis proposed the use of Semantic Web technologies, in tandem with Social Media technologies, to produce a new Social Semantic Journalism framework that combined technologies that could collaborate with each other in order to identify newsworthy user-generated journalistic content [
6].
However, the evolution of the Web is not limited to content diffusion and machine-readability fields but also applies to the realms of aesthetics and usability. As Wu and Han point out, both aesthetics and usability display a strong relationship with the satisfaction of potential users [
7]. King et al. [
8] make the claim that a significant relationship exists between the visual complexity of a website and its influence on user first impressions. This is especially important with regard to media outlets since King’s research specifically links increased visual complexity with the user’s perception of informativeness and engagement cues [
8]. This perceived informativeness is an important quality when associated with a news website. Besides complexity, usability and compatibility with multiple devices have also evolved through the progression of website layout techniques over the course of time as studied by Stoeva [
9]. The way information is presented on a Web page is under constant change.
In addition to complexity and layout, color also plays an important role in influencing user impressions. On many occasions, researchers have established that the colors used on a website can elicit emotional reactions and feelings that can lead to outcomes concerning a website’s perceived trustworthiness and appeal or even a visitor’s overall satisfaction [
7,
10,
11,
12]. Talei proposes that these emotional responses are a result of human natural reactions to colors as encountered in natural life [
12]. In addition to colors as individual factors eliciting an emotional reaction from users, White proposes that color schemes can also have a similar effect and proceeds to study the case of schemes using complementary colors [
13] leading to conclusions about how specific complementary colors lead to increase in user pleasure.
In order to monitor how websites of media outlets evolve alongside the evolution of Web technologies and aesthetics, taking a look at contemporary websites only is not enough. Instead, what is needed is a comprehensive overview of each website’s journey throughout the past decades. Brügger coined the term “website history” as a combination between media history and Internet history, where the individual website is considered the object of historical analysis instead of the medium [
14]. The website then, playing the part of a historical document, is to be archived and preserved, and subsequently delivered as historical material [
14]. This type of historical material is the means through which the aesthetic trends and Semantic Web adaptation of media outlets may be identified through means of archival data extraction.
The study presented in this article attempts to answer the following research questions:
RQ1. How has the integration of Semantic Web technologies (SWT) progressed in the last decades? When and to what extent were various technologies implemented?
RQ2. What are the trends in website aesthetics that can be identified concerning the complexity of Web pages, the usage of graphics, and the usage of fluid or responsive designs?
RQ3. What basic colors and coloring schemes are prevalent in website homepages? Did they change over the years and are there consistent trends that can be inferred by such changes?
In order to investigate these questions, large amounts of quantitative data were collected from actual public media outlets on the World Wide Web, based on their popularity in Greece. The past versions of these websites were retrieved through the use of a Web service offering archival information on websites. With that data in hand, a comprehensive understanding of the landscape of SWT adoption and general aesthetic trends can be attained. The method of collecting and analyzing that information will be presented in the following section.
2. Methodology
The research presented was conducted in four stages:
Stage 1: Media outlet websites were identified and selected based on their popularity in Greece.
Stage 2: Current and archival information from these websites was collected through the use of a website archive service. This information included the HyperText Markup Language (HTML) code of a website’s homepage as well as a screenshot of that homepage.
Stage 3: Using a Web data extraction algorithm, information regarding the usage of SWTs, website complexity, graphic usage, and website repressiveness or fluidity was recorded.
Stage 4: Using an image analysis algorithm, information regarding the colors used was extracted from the websites’ screenshots.
The methods and decision process behind each stage will be further detailed in this section. The quantitative data collected will be further presented in the results section.
2.1. Identifying Websites for Information Extraction
In order to reach safe conclusions regarding the evolution of media outlet websites through time, a large number of websites must be used, as well as multiple instances of each such website over the course of time. A large data set can lead to reliable results and create an impression that accurately represents reality. For that purpose, the archival Web service that was selected as the main provider of data concerning these websites was the Wayback Machine Internet Archive. As seen in the work of Gomes et al. [
15], most Web archiving initiatives are national or regional. Out of the few international ones, the Internet Archive is both the largest and the oldest, dating back to 1996. It boasts over 625 million Web pages [
16] which it provides to interested parties through its Wayback Machine. Using the Wayback Machine was considered the best way to collect a variety of instances for each studied website, which spanned over a representative period of time.
Another consideration, besides the number of instances, was which specific websites were to be targeted. A reliable metric of a media outlet’s impact and visibility is its popularity based on digital traffic. Additionally, this popularity can ensure the existence of multiple instances of archived website data in Web archives. Based on that, a sample of the 1000 most popular websites was obtained from the SimilarWeb digital intelligence provider in the category of “News & Media Publishers” in Greece. SimilarWeb is a private company aiming to provide a comprehensive and detailed view of the digital world [
17]. Information about a website’s online market share, its global rank, and more were collected manually in the form of text files and using an algorithm scripted with PHP, this information was parsed and imported into a relational database powered by the MariaDB database management engine. This process is visually presented in
Figure 1.
Both international websites with a popular presence in Greece and popular Greek media outlets were included in the final list of websites to be investigated. Overall, the websites presented a varied mix including popular international online media outlets (e.g., Yahoo, MSN, BBC, the NYTimes, etc.), popular Greek online media outlets such as (e.g., protothema.gr, iefimerida.gr, newsbomb.gr, etc.), a series of local news outlets with a popular online presence (e.g., typosthes.gr, thebest.gr, larissanet.gr, etc.), and more.
2.2. Collecting HTML Data and Screenshots of Each Relevant Website
Having established a good dataset of relevant websites, the next stage of this research was to collect HTML data and screenshots for each website for various different instances over the past few decades. An algorithm was developed in the PHP scripting language that inquired the Internet Archive’s Wayback Machine for each of the websites collected in the previous stage, in order to obtain available instances for that specific website.
These inquiries were performed using the Wayback CDX (ChemDraw Exchange format) server Application Programming Interface (API). The CDX API is a tool that allows advanced queries that can be used to filter entries with high density instancing, in order to obtain instances for specific intervals. By using the API’s ability to skip results in which a specific field is repeated, instance recovery was accomplished faster and more efficiently. For each instance of a website that is discovered in the Internet Archive’s database, the API provides information on the domain name, the exact timestamp of the snapshot, the snapshot’s year and month, the original Uniform Resource Locator (URL), the mime type of the data provided by the service and the current URL of the archived website on the Wayback Machine. This process of collecting instances is visually presented in
Figure 2.
Koehler, in their research, discovered that the half-life of a page is approximately 2 years [
18]. Especially when it comes to structural or large-scale changes such as the ones that are being investigated in this research, it makes sense that they do not happen too often. With that in mind, for the purposes of this study, it was decided that one website instance per year was more than enough to record any significant changes. In order to accomplish this sampling, the timestamp field that was returned by the API was utilized. This field has 14 digits corresponding to the year, month, day, hour, minutes, and seconds that the instance was created. By instructing the API to exclude results that had the same first four digits in this field, the system returns exactly one snapshot per year as intended (if available). Out of a total of 1000 websites identified in stage one, 905 were discovered in the Internet Archive’s databases and a grand total of 10,084 instances were discovered.
In order to acquire the HTML source code for each instance, an algorithm was developed in the PHP scripting language. This algorithm made use of the Wayback URL field that was collected during the instance information-gathering process to access the archival version of the website on the Wayback Machine. After accessing the instance, the algorithm proceeded to extract the source code and store it in an HTML file. The files were stored in a separate folder for each domain and their filenames represented the year and month of the instance. Before storing the source code into the HTML file, the application used string manipulation PHP functions to remove any part that belonged to the Wayback Machine’s Web interface, in order to ensure that the end result was exclusively the original website’s source code. This process is visually presented in
Figure 3.
The second important piece of information collected in this stage of our research besides the HTML source code is a screenshot of each website instance’s homepage. The collected screenshots will be used to infer the color pallets of each instance and derive information from there. The plug-in UI.Vision RPA for the Chrome browser was used to acquire these screenshots. This plugin is a tool that allows the automation of various browser operations. The instructions for the automated process are provided to the plugin using JSON syntax. This enabled us to generate a vast series of instructions using an algorithm in the PHP scripting language. These instructions guide the plugin to open a website, pause for the time required for the website to load, capture a screenshot of the website, and then proceed to store the captured screenshot in a PNG image file with a filename indicating the year, month, and domain of the instance. This process is visually presented in
Figure 4.
This detailed overall process of gathering archival data related to website aesthetics can be extended for use in other fields and with different objectives that can be accomplished through knowledge of the HTML source code and a screenshot of a website instance and was presented in greater detail by Lamprogeorgos et al. in 2022 [
19]. The complete process is visually presented in
Figure 5. It should be noted that the process of collecting screenshots is much more resource and time intensive than the process of collecting HTML documents and for this reason, the analysis of screenshots was based on a random sample of 5402 website instance screenshots out of the 10084 total website instances. The screenshot sample was considered still large enough to lead to safe conclusions.
2.3. Collecting SWT and Aesthetics Data from the HTML Source Code
With the HTML files containing the source code of each website instance collected, the next step included the extraction of data from these files. The process of collecting this information was accomplished with the use of an algorithm developed in the PHP scripting language. This algorithm converted each HTML file into an entire HTML document through the use of PHP’s DOMDocument Class. It then proceeded to collect information based on the various HTML elements and their attributes. This information was recorded into variables that can be divided into three categories: variables concerning Semantic Web technology adoption, variables concerning the homepage’s complexity, and variables concerning the user interface’s layout.
2.3.1. Semantic Web Technologies Adoption Variables
With the coming of HTML5 in 2008, a series of new structural elements were added [
20] with the intention of not only providing structural insight, as normal HTML elements do but also contextual insight on what the content inside these elements represents. Fulanovic et al. indicate that usage of these elements is mainly intended for browsers and accessibility devices, and that it is up to the content creators to select the proper element to convey the contents of each part of their website [
21]. These the elements are
<article>, <aside>, <details>, <figcaption>, <figure>, <footer>, <header>, <main>, <mark>, <nav>, <section>, <summary>, and
<time>. The data extraction algorithm traverses the Document Object Model (DOM) of each website and identifies the use of any of these elements and records it into the variable
html_var.
The second variable concerning SWT adoption was
og and it recorded whether a website made use of the Open Graph protocol to present itself in the form of a rich object. The protocol’s intention is to make it possible for websites to be presented in a social graph and this is accomplished through a method compatible with W3Cs Resource Description Framework in attributes (RDFa) recommendation [
22].
Another RDFa compatible system specifically designed for Twitter is called “Twitter Cards” [
23] and whether it existed in a website instance was recorded in the
twitter variable. Both the Open Graph and the Twitter Cards graphs create meta tag attributes that include information containing the Web page including a title, a short description, and a related image. Essentially, both Open Graph and Twitter Cards comprise Semantic Web applications that stem from the realm of Social Media as Infante-Moro et al. explain [
24] and this connection they have with Social Media has influenced their popularity and their importance to websites’ Semantic Web integration.
Although technically on the fence between Web 2.0 and Web 3.0, RDF Site Summary (RSS) feeds present one of the earliest attempts at Web syndication [
25] and have hence been a long-time component of presenting Web pages and their content in a machine-readable manner. The variable
rss records the existence of such feeds in a website instance.
Finally, the last SWT-related variable is
sch, which records the existence of schema.org data structures in the website instance. The data structure schemas of the schema.org community, which is supported by various big names in Web technologies such as Microsoft, Google, Yahoo, and Yandex, aim to make it easier for developers to integrate sections of machine-readable information in their creations [
26]. Their usage provides the flexibility of choosing between three formats: RDFa like Open Graph and Twitter Cards, Microdata, and JSON-LD.
Table 1 presents all SWT-related variables with a short description.
2.3.2. Aesthetics and Interface Variables
Visual complexity is a factor that plays an important role in the aesthetics of a website as discussed by Harper et al. [
27], King et al. [
8], and Chassy et al. [
28]. Harper et al., in their work, supported that complexity as perceived by the users is influenced by structural complexity and presented a paradigm that related the complexity of an HTML document’s DOM with how users subjectively judged complexity [
27]. In a similar manner, the present study collected information regarding specific DOM elements, including both structural elements and graphical elements, in order to draw conclusions regarding the aesthetics of a website instance and how they evolved through time with regard to visual and structural complexity. In
Figure 6, a screenshot of the homepage of popular European media outlet
euronews.com (accessed on 1 January 2018), which displays a high amount of visual and structural complexity, is presented as an example.
In the div_tags variable the number of <div> elements was recorded while all hyperlinks were identified through the use of anchor elements <a> and recorded in the a_tags variable. Similarly, the various graphical components were measured using the img_tags variable to collect <img> elements, the svg_tags variable to collect scalable vector graphics elements (<svg>), the map_tags variable to collect image map elements (<map>), the figure_tags variable to collect figure semantic element (<figure>), the picture_tags variable to collect the art and responsive design oriented picture element (<picture>), and finally the video_tags variable to collect <video> elements.
The <img> tag is used to embed an image file in an HTML page. The image file can be of any Web-supported filetype such as compressed JPG files, animated GIF files, transparent PNG files, and even SVG files. An SVG element (<svg>) is a graphic saved in a two-dimensional vector graphic format that stores information that describes an image in text format based on XML. An image map consists of an image with clickable areas, where the user can click on the image and open the provided destination. The <map> tag can consist of more than one <area> element, which defines the coordinates and type of the area and any part of the image can be linked to other documents, without dividing the image. The <figure> tag is used to mark up a photo in the document on a Web page. Although the <img> tag is already available in HTML to display the pictures on Web pages, the <figure> tag is used to handle the group of diagrams, photos, code listing, etc. with some embedded content. The most common use of the <picture> element will be in responsive designs where instead of having one image that is scaled up or down based on the viewport width, multiple images can be designed to more nicely fill the browser viewport.
Besides visual complexity, modern website aesthetics and their interfaces are heavily influenced by the need to be presentable and easily usable on many different devices, operating at various different screen resolutions and aspect ratios. This has been achieved through the fluidity offered by using table elements to contain a website’s structure and through the use of responsive design practices and frameworks. In order to study the trends in this area over time for each website instance, the number of table elements (
<table>) was recorded in the
table_tags variable. Additionally, the viewport meta element was investigated for each website instance as an indicator that the website is undertaking an effort towards supporting multiple screen resolutions and the results were recorded in the
mobile_scale variable. Finally, two very popular responsive design frameworks were investigated. These were Bootstrap, an open source CSS framework developed by the Boostrap Team and operating under the Massachusetts Institute of Technology (MIT) license [
29], and Foundation, a similar CSS framework also operating under the MIT license developed by ZURB [
30]. In order to identify the frameworks, the algorithm tried to detect div elements with the grid “
row” class and then proceeded to investigate for grid column elements through the various “
col-” classes for Bootstrap and the “
columns” and “
large-” or “
small-” classes for Foundation. Whenever the use of these frameworks was discovered, it was recorded in the
bootstrap and
foundation variables respectively.
Table 2 presents all visual complexity and layout structure-related variables with a short description.
2.4. Collecting Color Data from the Homepage Screenshot
Having amassed a large amount of website instance screenshots, we proceeded to use them in order to gain a better understanding of how news websites evolved through the last decades, in terms of empty space use and colors. Empty space (or white space, or negative space) is the unused space around the content and elements on a website, which designers used to balance the design of the website, organize the content and elements, and improve the visual experience for the user.
Figure 7 presents an example of empty space from a homepage screenshot from the popular American media outlet
nytimes.com (accessed 1 January 2014), where all the empty space has been marked with the use of the color orange.
Figure 8 displays an example of the evolution of the homepage of the international media outlet
hellomagazine.com throughout the last two decades. This collection of homepage screenshots exemplifies the visible evolution of structural and graphical complexity, as well as color and empty space usage, which comprise the metrics collected by our algorithms from each website instance, as detailed in
Section 2.3.2 and in the current section.
An algorithm was created that used the PHP scripting language and its native image handling capabilities to discover information regarding the use of color as presented by the screenshots. At first, the algorithm used image scaling and the
imagecolorat function to identify and extract colors from a screenshot into the hexadecimal color code used by HTML5 and CSS3. Our work was based on the ImageSampler class developed by the Art of the Web [
31]. All colors that took less than 3% of space on the screenshot are excluded from further analysis. In order to better study the remaining extracted colors, they were grouped based on their proximity to a primary, secondary, or tertiary color of the red yellow blue (RYB) color model.
As established by Gage in his work in the 1990s [
32] the RYB color model incorporates subtractive color mixing and is one of the most popular color models, especially in design. By extension, it has become very useful in digital art and, of course, Web design since it can be used to identify colors that go well together. A major reason it was decided to convert the red green blue (RGB) based HTML hexadecimal colors to the RYB ones was to better study design schemes based on color relationships, as will be detailed below. The three primary colors of the RYB color wheel are red, yellow, and blue. Each combination of the three creates secondary colors which are orange, green, and purple. The tertiary colors are established through the combination of primary and secondary colors and they are red-orange, yellow-orange, yellow-green, blue-green, blue-purple, and red-purple. Additionally, black is achieved by combining all three primary colors and white through the lack of them.
The algorithm in this research used saturation to determine if a color is white: any color with less than 16% saturation was considered white. In a similar manner, brightness was used to identify black: any color with less than 16% brightness was considered black. Considering websites as a medium are presented across many different types of screens of various technologies, colors that are this close to black or white will most definitely be perceived as such by the average user. Additionally, the most used color on each website instance was considered to be the empty space color, meaning the color upon which all visual elements of the page appear.
In order to identify whether a color scheme (or color combination) is used in each website instance that uses colors besides black and white, an additional algorithm was developed in the PHP scripting language. This algorithm was designed to identify five major methods of color combination based on the RYB color wheel as presented in
Figure 9:
Monochromatic shades, tones, and tints of one base color
Complementary colors that are on opposite sides of the color wheel
Analogous colors that are side by side on the color wheel
Triadic, three colors that are evenly spaced on the color wheel
Tetradic, four colors that are evenly spaced on the color wheel
The algorithm measured the minimum and maximum distance between the colors on the color wheel. Based on the number of colors and these two distances, conclusions can be drawn regarding the use of a harmonic color combination as presented in
Figure 10.
If the number of colors used is one, then the color scheme used is monochromatic. If the number of colors used is two, and if the maximum distance is lower than two, the analogous scheme is used, but if the maximum distance is greater than five, the complementary scheme is used. Similar conclusions can be drawn from the usage of three or four colors. If three colors are used and the minimum distance is greater than three, then the triadic color scheme is used. Similarly, if four colors are used and the minimum distance is greater than two, then the tetradic color scheme is implemented. The algorithm rejects any other situation and classifies it as a non-harmonic color combination.
Having obtained all relevant information through the steps described above, we proceeded to study the following:
How many colors appear in the website instances on average by year besides black and white?
How much each of the basic 14 colors of the RYB model is used in the website instances on average by year?
How popular was the use of white, black, or colored empty space through the years?
How popular were the different types of harmonic color combination schemes through the years?
The answers to these questions, alongside all other information collected throughout the stages of this research as presented in this section, are available in the results section below.
5. Conclusions
In this research, an innovative method was used to collect information from the HTML source code and homepage screenshots of a large number of websites, over a period of two decades, using data extraction techniques on archival data. The websites investigated were the top 1000 online media outlets based on Web traffic in Greece and included websites of both international media outlets and Greek national and local media outlets. The main goal of the study was to observe the course of these websites throughout the past decades, in regards to the adaptation of popular Semantic Web technologies and the aesthetic evolution of their interfaces, which included aspects concerning DOM structure and visual complexity, fluid, and responsive layout design techniques, and color usage and schemes.
The introduction of SWT in the websites was fast and extensive, with the main motivation behind it being the greater diffusion of media content. Structural and visual complexity displayed a steady but significant positive trend, aiming to achieve better first impressions while still maintaining performance across a plethora of devices. The rise of the mobile Internet guided the investigated websites to the adoption of responsive web design principles. An increase of visual complexity was also noted in the usage of colors, accompanied not only by an effort to better abide by the principles of accessibility, as established by the use of black as an empty space color but also by an effort to more closely adhere to color harmony through the use of color combinations.
The study’s sample is large but does present limitations, in the sense that the criteria for selection were popularity on the Greek Web. Focusing on websites popular in a different country might have presented different results due to cultural or other factors. That being said, many of the studied websites were international media outlets, which would be popular in most of the world. An additional limitation of the research can be found in its focus on websites with high traffic, which might be inclined to adopt current technologies and trends more rapidly. Finding a more varied sample of media outlets that would include low traffic or niche outlets could provide an interesting contrast. In the future, this research can be expanded to different fields of online activity, beyond news and media, and attempt to find comparable results. Additionally, focusing on regions with a large cultural distance to Greece could lead to conclusions regarding the connection between cultural identity and aesthetic trends. Moving forward, we will focus our future work on collecting information regarding a vast array of websites from different fields, beyond news outlets, while simultaneously adapting our metrics to better identify regional aesthetic trends, in order to contrast their development to global trends.
The World Wide Web is a constantly evolving entity that is influenced both by the rise and fall of technologies and by the continuous evolution of human nature through cultural trends, global events, and globalization in general. Studies of the Web’s past and its course through time can provide valuable knowledge, pertaining not only to the present but hopefully preparing us for the future. The advancements of the Semantic Web and the aesthetic evolution of user interfaces can be useful tools at the disposal of every online media outlet, both established and new, and can lead to the overall betterment of the undeniable services they provide.