4.2. Model Assessment
The PLS-SEM path analysis algorithm calculates standardized partial regression coefficients within the structural model after approximating the parameters of the measurement model [
73]. Consequently, a two-stage assessment of the psychometric properties of the proposed conceptual model was conducted. The quality of the measurement model was assessed by evaluating several factors: the reliability of indicators, internal consistency, convergent validity, and discriminant validity.
To assess indicator reliability, the standardized loadings of items with their corresponding constructs were explored. Hulland’s purification guidelines [
74] suggest retaining items in the measurement model only if their standardized loadings are equal to or exceed 0.708. Items GMH4, LRN3, LRN5, UIS2, UIS5, ENG3, and BEH5 were eliminated from the measurement model and subsequent analysis since their loadings were below the recommended threshold. The confirmatory factor analysis (CFA) results, as shown in
Table 1, indicate that the standardized loadings for all the remaining items in the measurement model are above the acceptable cut-off value. The standardized loadings of items comprising the measurement model ranged from 0.714 to 0.883, signifying that constructs explained between 50.98% and 77.97% of their items’ variance.
The internal consistency of constructs was assessed using three indices: Cronbach’s alpha, composite reliability (rho_C), and the consistent reliability coefficient (rho_A). Cronbach’s alpha [
75] serves as a lower bound estimate for construct reliability, assuming an equal weighting of items. Composite reliability [
76], which considers actual item loadings, provides a more accurate internal consistency estimate compared with Cronbach’s alpha. Dijkstra and Henseler’s consistent reliability coefficient [
77] is an approximate exact measure of construct reliability, acting as a middle ground between Cronbach’s alpha and composite reliability [
78]. For these indices, values ranging from 0.60 to 0.70 are deemed satisfactory in exploratory research, while values between 0.70 and 0.95 indicate good internal consistency. However, values exceeding 0.95 suggest item redundancy that can negatively impact content validity [
79]. Due to the inadequate phrasing of item ENG3, it was excluded from the measurement model, leading to acceptable values for all three internal consistency indices related to the player engagement construct. The same was done for the BEH5 item and the corresponding construct. As shown in
Table 2, the calculated values for all three indices ranged between 0.672 and 0.916, suggesting good internal consistency for all eight constructs in the research framework.
Convergent validity was examined with the Average Variance Extracted (AVE) criterion, as suggested by Hair et al. [
78]. An AVE value of 0.50 or higher is deemed satisfactory because it indicates that the shared variance between a construct and its items surpasses the variance due to measurement error. As presented in the last column of
Table 2, all constructs which constitute the research framework have met the requirement of this criterion thus signifying the robust convergent validity of the measurement model.
Discriminant validity, which represents the degree to which a specific construct differs from the others in the model, was scrutinized using the Heterotrait–Monotrait (HTMT) ratio of correlations introduced by Henseler et al. [
80]. This ratio is computed by dividing the average value of all the correlations of indicators measuring different constructs by the average value of the correlations of indicators measuring the same construct. For related constructs, discriminant validity is deemed absent if the HTMT value oversteps the 0.90 threshold. On the other hand, for conceptually distinct constructs, the cut-off value is reduced to 0.85 [
79]. The study findings reported in
Table 3 demonstrate that the HTMT values of all the constructs in the research framework are below the aforementioned respective thresholds, thereby meeting the requirement of the discriminant validity criterion.
After confirming that the quality of the measurement model was satisfactory, the appropriateness of the structural model was evaluated. This assessment involved analyzing collinearity, the significance of paths, the explanatory power of the research model, the effect size of exogenous constructs, the predictive power of the research model, and the predictive relevance of exogenous constructs.
Evaluating the structural model requires estimating numerous regression equations that depict the relationships between constructs. If two or more constructs within the structural model represent similar concepts, there is a risk of excessive collinearity, which could potentially result in skewed estimates of partial regression coefficients. The Variance Inflation Factor (VIF) is a widely used metric for detecting the presence of collinearity among predictor constructs in the structural model. While VIF values of 5 or higher indicate collinearity problems among exogenous constructs, issues may arise even with VIF values of 3 [
78]. As a result, VIF values should ideally be close to or below 3.
Table 4 shows that the VIF values for the predictor constructs range from 1.000 to 2.033, confirming the absence of collinearity in the structural model.
The explanatory power of the model is assessed using the coefficient of determination (
), which illustrates the proportion of variance in endogenous constructs explained by their predictors. The acceptable
values depend on the specific research discipline and study in question [
81]. Orehovački [
82] proposes that, in empirical research focused on software quality evaluation,
values of 0.15, 0.34, and 0.46 signify weak, moderate, and substantial explanatory capacities of exogenous constructs within the research model, respectively. Adjusted
is commonly interpreted instead of
because it considers the size of the model [
79]. The study results shown in
Table 5 reveal that 67.7% of the variance in behavioral intention was accounted for by player enjoyment, player engagement, and gameplay mechanics; player enjoyment and gameplay mechanics explained 49.6% of the variance in player engagement; 30.4% of the variance in player enjoyment was accounted for by gameplay mechanics and visual elements; 27.8% of the variance in gameplay mechanics was explained by audio elements and visual elements; the user interface sensibility and gameplay mechanics accounted for 29% of the variance in learnability; 29.4% of the variance in user interface sensibility was explained by the visual elements; while 28.3% of the variance in the visual elements was accounted for by the audio elements.
The reported findings indicate that the determinants of behavioral intention and player engagement have a substantial explanatory power while the predictors of player enjoyment, gameplay mechanics, learnability, user interface sensibility, and visual elements demonstrate a weak explanatory power.
The hypothesized interplay among constructs in the research framework was examined by evaluating the goodness of the path coefficients. A bootstrapping resampling procedure was employed, utilizing asymptotic two-tailed t-statistics to evaluate the significance of the path coefficients. The number of cases equaled the sample size, while the number of bootstrap samples amounted to 5000.
Table 6 presents the outcomes of hypothesis testing. The findings revealed that gameplay mechanics (β = 0.140,
p < 0.05), player enjoyment (β = 0.556,
p < 0.001), and player engagement (β = 0.242,
p < 0.005) significantly influenced behavioral intention, thus corroborating hypotheses H10, H11, and H13, respectively. The data analysis also determined that audio elements (β = 0.256,
p < 0.01) and the user interface sensibility (β = 0.403,
p < 0.001) substantially impacted gameplay mechanics, hence supporting hypotheses H2 and H5. Furthermore, gameplay mechanics (β = 0.215,
p < 0.05) and player enjoyment (β = 0.576,
p < 0.001) contributed significantly to player engagement, thus confirming hypotheses H7 and H12. Additionally, the user interface sensibility (β = 0.233,
p < 0.05) and gameplay mechanics (β = 0.399,
p < 0.001) exhibited a notable effect on learnability, hence substantiating hypotheses H6 and H9, respectively. The visual elements of the video game (β = 0.274,
p < 0.01) and its gameplay mechanics (β = 0.367,
p < 0.001) considerably influenced player enjoyment, thus lending support to hypotheses H4 and H8. Finally, the study findings revealed that audio elements (β = 0.539,
p < 0.001) are significant determinants of visual elements, which in turn (β = 0.549,
p < 0.001) serve as a significant antecedent for the user interface sensibility, thereby providing support for hypotheses H1 and H3, respectively.
The effect size (
) represents the magnitude of the influence of an exogenous construct on an endogenous construct. An
value of 0.02, 0.15, or 0.35 signifies a small, medium, or large effect, respectively [
83]. Based on the
values presented in
Table 7, we can interpret the strength of relationships between the constructs for the given hypotheses. Audio elements considerably impact visual elements (
= 0.409), while only having a minimal influence on gameplay mechanics (
= 0.083). The visual aspects greatly contribute to the user interface sensibility (
= 0.431) and marginally affect player enjoyment (
= 0.078). The user interface sensibility moderately influences gameplay mechanics (
= 0.207) and has a minor impact on learnability (
= 0.060). Gameplay mechanics exert a weak effect on player engagement (
= 0.069), a moderate influence on learnability (
= 0.176), a negligible impact on behavioral intention (
= 0.043), and a mild contribution to player enjoyment (
= 0.140). Player enjoyment plays a crucial role in shaping both behavioral intention (
= 0.486) and player engagement (
= 0.495). Finally, player engagement has a minor, yet noticeable, effect on behavioral intention (
= 0.093).
The nonparametric cross-validated redundancy measure
by Stone [
84] and Geisser [
85], which utilizes the blindfolding reuse technique to predict endogenous construct items, was frequently used in the literature to evaluate the predictive validity of exogenous constructs. However, since
integrates aspects of both out-of-sample forecasting and in-sample explanatory strength [
86], it does not serve as a true measure of out-of-sample prediction [
78]. To address this issue, Shmueli et al. [
86,
87] developed the PLSpredict algorithm as an alternative method for evaluating a model’s predictive relevance.
PLSpredict uses k-fold cross-validation, in which a fold is a subset of the total sample, and k represents the number of subsets. This method determines whether the model performs better than the most basic linear regression benchmark (called
and defined as the indicator means from the analysis sample) [
78,
79,
86]. PLS path models with
values above 0 exhibit lower prediction errors compared with the simplest benchmark. Since
can be understood in a similar manner as
, values surpassing 0, 0.25, and 0.5 indicate a small, medium, and large predictive relevance of the PLS path model, respectively [
78].
The predictive strength of a model is usually evaluated using the root mean squared error (RMSE). However, when the distribution of prediction errors is notably non-symmetric, the mean absolute error (MAE) represents a suitable alternative [
87]. This evaluation procedure involves comparing the RMSE (or MAE) values to a simple linear regression model (LM) benchmark. The outcomes from this comparison [
87] could be as follows: (a) if the RMSE (or MAE) values surpass those of the simple LM benchmark across all items, this indicates the model lacks predictive strength; (b) if the majority of items in the endogenous construct exhibit larger prediction errors than the LM benchmark, it suggests the model has low predictive strength; (c) when a minority or equal number of construct items show higher prediction errors compared with the LM benchmark, it indicates the model has medium predictive strength; and (d) if none of the items demonstrate higher RMSE (or MAE) values than the LM benchmark, it infers the model has high predictive strength.
Upon visually inspecting the error histograms, it was revealed that the distribution of prediction errors is highly non-symmetric. Consequently, the predictive power evaluation was based on MAE. As displayed in
Table 8’s fourth column, a minority of endogenous construct items exhibit higher PLS-SEM_MAE values when compared with the naïve LM_MAE benchmark. This indicates that the proposed model has medium predictive power.
Changes in
represent the relative influence (
) of exogenous constructs on predicting the observed measures of endogenous constructs within the structural model. According to Henseler et al. [
67],
values of 0.02, 0.15, or 0.35 indicate that a specific exogenous construct has weak, moderate, or substantial relevance in predicting an endogenous construct, respectively. The calculation of
values is performed as follows [
83]:
represents the value of
for an endogenous construct when the related exogenous construct is factored into the model calculation. On the other hand,
signifies the value of
for an endogenous construct when the associated exogenous construct is not considered in the model estimation. The findings presented in
Table 9 indicate that player enjoyment (
) emerges as a robust predictor and player engagement (
) serves as moderate antecedent, while gameplay mechanics (
) appear to be a weak determinant of a player’s intention to continue engaging with the game.
While player enjoyment () exhibits a moderate level of importance in predicting player engagement, gameplay mechanics () yield very poor predictive relevance for the same construct. Nevertheless, gameplay mechanics () display a moderate significance in forecasting player enjoyment, akin to visual elements (). Audio elements () represent a weak predictor of gameplay mechanics, while the user interface sensibility () emerges as a moderate predictor in the same respect. Gameplay mechanics () exhibit a moderate degree of relevance concerning learnability whereas the user interface sensibility () is identified as a very weak predictor of the same construct. Lastly, visual elements demonstrate a moderate level of importance () in predicting the user interface sensibility, while audio elements () prove to have a moderate significance in forecasting the quality of visual elements in the realm of platform video games.