4.1. Statistics and Analysis
Table 4 depicts a sample of the first ten entries of our data formation. The first column shows the websites, the second column presents the ranking of a website based on its popularity according to Alexa Internet (_alexa_rank). Columns 3 to 7 contain scores for each of the four major Semantic Web technologies derived by dividing the pages that used each technology as shown in
Section 3.2 (variables _rss_feeds, _html, _og, _twitter, _schema_org) by the total number of pages crawled (variable _pages_crawled) thus creating the new variables (_rss_score, _html_score, _og_score, _twitter_score, _schema_score). Columns from 8 to 10 contain the variables _html_variety, _og_variety and _twitter_variety. The last column contains the rating for each site based on the SWTI rating system detailed in
Section 3.3 (variable _swti). The variables _microformats, _microformats_variety and _other, which identified usage of microformats or other json data in each web page were omitted from further statistical analysis because the percentage of websites with findings in these metrics was below 1%.
Table 5 depicts the descriptive statistics for each variable and
Table 6 depicts the frequency related statistics. In
Figure 4 the histogram and boxplot of the _alexa_rank variable are presented, depicting the distribution and dispersion of the variable. The histogram and boxplot, the distribution and dispersion of the other sample values are plotted for each of the variables and presented in
Appendix A.
Table 7 depicts the descriptive statistics of the _swti variable for websites that were evaluated as Yellow or Green during the expert screening process described in
Section 3.1.2 (variables _swti_yellow, swti_green). Frequency-related statistics for these variables are presented in
Table 8 and their histograms and boxplots in
Figure 5 and
Figure 6.
In order to analyze the interrelation between the independent variables _html_score, og_score, _twitter_score and _schema_score, a Pearson’s r criterion was applied [
35]. The results are shown in
Table 9 where the correlations between the variables are depicted. All the correlations have significance level at 0.000 confirming the statistically significant correlation.
The Pearson’s r coefficients range from 0.316 (the weaker positive correlation between html_score and twitter score) to 0.482 (the stronger positive correlation between og_score and twitter score).
In an effort to examine the interrelationship between the collected Semantic Web metrics and a websites popularity the various websites were ranked according to their measured SWTI rating (variable _swti_rank) and the Spearman’s rank correlation coefficient was calculated. The results are presented in
Table 10.
Results of the Spearman correlation indicated that there is a significant very small positive relationship between _swti_rank and _alexa_rank and Y, (r(3630) = 0.0683, p < 0.001). This correlation despite being very small prompted the researchers to further investigate the interrelationship between SW integration and popularity using a gradient boosting analysis which included every metric collected by the crawling algorithm.
4.2. Gradient Boosting Analysis Using XGBoost
After samples have been collected, the XGBoost models are built using a grid search among the parameter space. XGBoost (eXtreme Gradient Boosting) is a fast implementation of gradient boosting [
36]. It is a scalable end-to-end tree boosting system that has been widely used and achieves state-of-the-art classification and regression performance [
37]. It can improve in the reduction of overfitting, the parallelization of tree construction, and the acceleration of execution. It is an ensemble of regression trees known as CART [
38]. The prediction score is calculated by adding all of the trees together, as indicated in the following equation,
where
M is the number of trees and
is the mindependent CART tree. In contrast to Friedman’s [
39] original gradient boosting architecture, XGBoost adds a regularized objective to the loss function. The regularized objective for the
mth iteration optimization is provided by
where
n denotes the number of samples,
l denotes the differentiable loss function that quantifies the difference between the predicted
and the target
and
denotes the regularization term
where
is the number of nodes and
denotes each node’s weight. The regularization degree is controlled by two constants,
and
. Furthermore, taking into account that for the
mth iteration the following relation holds,
we can recast Equation (2) as,
where we introduced the operators,
and
, which are the loss function’s first and second-order derivatives, respectively.
XGBoost makes the gradient converge quicker and more accurately than existing gradient boosting frameworks by using the second-order Taylor expansion for the loss function [
36]. It also unifies the generation of the loss function’s derivative. Furthermore, adding the regularization term XGBoost to the target function balances the target function’s decrease, reduces the model’s complexity, and successfully resolves overfitting [
36].
Furthermore, XGBoost can use the weight to determine the importance of a feature. The number of times a feature is utilized to partition the data across all trees is the weight in XGBoost [
36], and is given by the equation
with the boundary conditions,
.
is the number of trees or iterations,
denotes the number of nodes in the
mth tree,
denotes the tree’s non-leaf nodes,
stands for the corresponding feature to node
, and
denotes the indicator function.
The Alexa ranking for the websites under investigation is used as the outcome of the fitted model. The features collected by the crawling mechanism are used as the predictor variables. Since the main point of the analysis is to identify the most important features related to semantic web technologies with respect to the ranking of a website, we perform a grid search for the parameter space of XGBoost. The Alexa ranking is used to extract four classes using the quartiles with respect to the ranking. This transforms the regression analysis to a multiclass classification problem with four classes available. The first class is for the top 25% of the websites in ranking, and the other three classes are for the intervals [0%, 25%), [25%, 50%) and [50%, 75%] of the remaining websites.
The measure logLoss, or logarithmic loss, penalizes a model’s inaccurate classifications. This is particularly useful for multiclass classification, in which the approach assigns a probability to each of the classes for all observations (see, e.g., [
40]). As we are not expecting a binary response, the logLoss function was chosen over traditional accuracy measurements. The logLoss function is given by
where,
M is the number of classes,
N the number of observations,
indicates if observation
belongs to class
, an
the respective probability.
The number of pages crawled is used to scale the related features extracted. These are all page count features for a respective semantic web technology, and the feature extracted for the rss feeds. In addition, this transformation “scales-out” the number of pages crawled to isolate the effect, and the importance of the semantic web features measured to ranking. In particular, the following variables are transformed by dividing with the number of pages crawled (“_pages_crawled”), “_html”, “_og”, “_twitter”, “_rss_feeds”, “_schema_org”, “_other”, “_microformats”.
The parameters of machine learning models have a significant impact on model performance. As a result, in order to create an appropriate XGBoost model, the XGBoost parameters must be tuned. XGBoost has seven key parameters: boosting number (or eta), max depth, min child weight, sub sample, colsample bytree, gamma, and lambda. The number of boosting or iterations is referred to as the boosting number. The greatest depth to which a tree can grow is represented by max depth. A larger max depth indicates a higher degree of fitting, but it also indicates a higher risk of overfitting. The minimum sum of instance weight required in a child is called min child weight. The algorithm will be more conservative if min child weight is set to a large value. The subsample ratio of the training instances is referred to as subsample. Overfitting can be avoided if this option is set correctly. When constructing each tree, colsample bytree refers to the subsample ratio of features. The minimum loss reduction necessary to make a further partition on a tree leaf node is referred to as gamma. The higher the gamma, the more conservative the algorithm is. Lambda represents the L2 regularization term on weights. Additionally, increasing this value causes the model to become more conservative. We perform a grid search using the facilities of the Caret R-package [
41]. We search the parameter space with the “grid” method, using 10-fold cross validation for a tuneLength of 30 that specifies the total number of unique combinations using the trainControl and train functions of the Caret package (Kuhn 2008). The optimal values identified are
{eta = 0.3, gamma = 0, min child weight = 5, max depth = 6, subsample = 0.5, colsample_bytree = 0.5, lambda = 0.5}.
The overall statistics are presented in
Table 11 and the statistics by class in
Table 12.
Figure 7 presents the sorted accuracy for each model fit and
Figure 8 displays the various variables and their importance.