1. Introduction
Digital competence (DC), commonly referred to as the ability to use digital technologies effectively, safely and responsibly, is a key factor used to avoid social and labor exclusion, as it enables individuals to effectively navigate and utilize technology in various aspects of life [
1]. Over the last few years, interest in the assessment of DC has grown significantly, and has been used to accredit the level of DC of individuals. Although several reviews have summarized the progress and shortcomings in this area, some issues are yet to be explored, with a specific focus on instrument validity and reliability, and the identifying of important gaps of sources of evidence to ensure the quality of available tools [
2,
3,
4]. These limitations directly impact stakeholders in the following ways: individuals may receive inaccurate assessments of their DC, hindering their employability and social integration; educators struggle to design curricula that accurately reflect students’ competencies; and policymakers face challenges in developing informed strategies to enhance digital literacy across populations. Consequently, the inadequacies of current DC assessment tools not only undermine the effectiveness of digital literacy initiatives, but also perpetuate the risks of social exclusion and technological disparity. By addressing these gaps, we can ensure more reliable and valid assessments, thereby supporting individuals in achieving DC and fostering a more inclusive digital society.
A key challenge in assessing DC lies in the varied and context-dependent definitions of the construct. While experts have sought to identify essential DCs for all citizens [
5], differing definitions [
4,
6] complicate the development of consistent assessment methods. The absence of a universally accepted definition of DC results in tools that may assess certain skills but overlook others [
7], limiting comparability across studies and settings. Additionally, the context-specific nature of current assessment tools hinders their general applicability, as tools validated in one setting may not be reliable in another [
6]. To address this, the European Commission introduced DigComp (Digital Competence Framework for Citizens) [
7], which defines DC in terms of the knowledge, skills, and attitudes required to be digitally competent. While DigComp represents progress in standardizing DC, further research is needed to validate its use across diverse contexts and populations, and to address its limitations in capturing the evolving nature of DC. Continued efforts to refine and expand the framework will enable the development of more accurate, adaptable, and comprehensive assessment tools for DC.
Another significant issue is that, despite the availability of numerous DC assessment systems using various approaches, most rely on self-assessment, which inadequately measures practical skills and higher-level cognitive abilities [
2,
8]. Moreover, as noted by Law et al., many existing DC frameworks are developed by commercial enterprises [
3]. This reliance on proprietary software, such as Microsoft Office or Windows, can shape curricular DCs around specific tools rather than broader competencies. While the introduction of DigComp has enabled more tailored implementations [
2,
9], current systems remain largely based on self-reporting, often using multiple-choice questions and Likert scales. This approach minimally assesses practical skills, and individuals often overestimate their competence [
10], rendering such assessments unreliable and invalid for accurate competence accreditation.
Against this confusing and restricted background, we have identified several lines of action that could be envisaged to address the shortcomings identified.
The use of Technology-Enhanced Assessment (TEA) presents both opportunities and challenges in the evaluation of DC. TEA offers interactive and adaptive testing environments that better simulate real-world digital tasks, providing a more accurate measure of complex problem-solving and higher-order thinking skills [
3,
4,
11]. It also enables the assessment of the skills dimension of DC, including high-level cognitive skills, through innovative item formats [
12].
However, technical limitations, such as the digital divide and varying familiarity with platforms, may introduce biases that affect the validity of results [
13]. Additionally, the design of assessment items is crucial, as performance outcomes are strongly influenced by item format [
14,
15]. Poorly designed items can misrepresent examinees’ DC levels, especially when assessing complex cognitive skills. While innovative approaches like performance-based tasks (i.e., assessments where individuals demonstrate their skills or knowledge by completing a specific, real-world task rather than answering questions or selecting answers) offer a more authentic measure of DC [
16,
17], traditional self-report tools often lead to overestimations due to social desirability bias and the Dunning–Kruger effect [
4].
Despite its potential, the high development costs and technical complexity of TEA limit its broader adoption and scalability [
18]. Moreover, it is challenging to infer the response processes (RPs) examinees use to complete tasks, making it difficult to confirm if they approached problems as intended or employed alternative strategies. Thus, while TEA represents progress, its effectiveness depends on addressing these limitations and ensuring fair access and user competence.
Another key approach involves utilizing TEA to collect detailed data on examinee behavior and performance, such as log data, response time (RT), and click streams [
19,
20]. Although some researchers have used these data to enhance score interpretation and validate assessments [
21], such studies are rare, particularly in the assessment of DC. Analyzing RP data from assessments can help validate item design by showing that tasks elicit the intended knowledge and skills. However, evaluating whether examinees engage with relevant RPs is often overlooked when considering test validity, and evidence of this is lacking in most DC evaluation tools [
4,
22].
The integration of ET data offers new opportunities to analyze cognitive processes during assessments [
23]. ET data provide fine-grained insights into examinee behavior at the item level, revealing attention patterns, decision-making processes, and problem-solving strategies [
21]. For example, in image-based questions, gaze data can identify the areas participants focus on most, offering a unique view into their problem-solving strategies and pinpointing differences in performance. Such data can help validate DC assessments by providing direct evidence of cognitive engagement and resource allocation [
24].
However, the implementation of ET data presents challenges, such as the need for sophisticated technology, complex data interpretation, and potential intrusiveness, which could affect performance [
25]. Additionally, the large-scale use of ET remains costly and logistically difficult, making it hard to scale [
26]. While innovative item designs and ET data have the potential to improve the accuracy of DC assessments, practical and financial barriers must be overcome to reach widespread application.
In short, it is essential to design items that prompt more complex RPs from test-takers, while ensuring that these items trigger the intended knowledge and skills. Relying solely on RT data and item scores is insufficient. This study aims to use gaze data as additional evidence to support RP analysis and enhance the validity of the assessments. This paper builds on our previous research by incorporating gaze data into DC assessment to validate item design and differentiate participants based on their DC levels [
27]. We focused on analyzing variations among participants with different DC levels using fixation-based metrics supported by visualization techniques.
The purpose of this exploratory study was to design items suitable for the assessment of DC focused on higher-order thinking skills and gather validation evidence in a custom TEA implementation based on DigComp. In alignment with the United Nations Educational, Scientific and Cultural Organization (UNESCO) [
3] and the World Bank [
28], we adopted the DigComp framework as our reference model, owing to its significant advantages, as follows: (1) it was developed following an in-depth examination of existing DC frameworks, (2) it underwent a rigorous process of consultation and refinement by specialists in the field of DC, and (3) consequently, it offers a holistic perspective grounded in DC and its related areas. We utilized data from RPS, including ET data, to aid in validating these items and consequently the test itself. We seek to contribute to further the understanding of participants’ task-solving behaviors by examining the scan-paths of the participants, which may reveal important information about differences that may not be captured by the final responses, and to explore whether evidence from this analysis can contribute to the validity argument for inferences based on scores. Scan-path and fixation-based measures are well-suited for this study because they provide detailed insights into how participants interact with assessment tasks, offering a nuanced understanding of their cognitive processes and problem-solving strategies. Empirical research has demonstrated that fixation metrics, such as duration and sequence, correlate with information processing and cognitive load, making them valuable for assessing DC [
24,
29]. To date, we are not aware of a similar study having been carried out on the design of DC assessments.
Cronbach asserts that validation aims to support an interpretation by rigorously testing for potential errors [
30]. A claim is only trustworthy if it has withstood serious attempts to disprove it. This perspective on validity is relevant to studies like this one, which do not offer direct evidence that the cognitive processes involved in test item responses mirror those used in real-world tasks. However, they can help demonstrate their differences [
31]. For example, if ET data show that test-takers completed simulation tasks without thoroughly reviewing instructions, it would challenge whether these tasks accurately reflect real-world cognitive processes. While ET data do not directly confirm cognitive processes, they provide valuable inferences [
32]. So, we used ET data to validate the inferences made between claimed and observed behavior. We sought to respond to the following research questions:
- (1)
Is it possible to identify an alternative interpretation of how participants processed the item in terms of areas of interest (AOIs) examined (AOIs are usually defined by researchers to study eye movements within specific limited areas)?
- (2)
Is it possible to identify an alternative interpretation of how participants processed the item in terms of AOIs examined and the order followed for the different AOIs?
This paper begins by discussing the background of RP validation. Despite its importance in test validation, RP evidence has been largely underutilized. Given the benefits of RP evidence and the capabilities of TEA, we chose to explore RP data for DC assessment validation, specifically using ET data.
Section 3 introduces ET technology and its ability to provide valuable insights into test-takers’ performance, emphasizing its potential for assessment validation, particularly in DC evaluation.
Section 5 outlines the experimental methodology, focusing on validating alternative interpretations of how participants processed selected items based on AOIs and the order in which they were examined. Metrics and visualization methods from the literature were used to analyze information acquisition and cognitive processing.
Section 6 presents the results, followed by the conclusion and a discussion of future directions.
5. Materials and Methods
5.1. Methodology
We carried out an exploratory study that used participants’ eye-movements to provide insights into their performance in a custom TEA implementation based on DigComp. The details of the assessment tool used, such as the number and type of items administered and the level of validity and reliability, can be found in the previous work by the authors Bartolomé and Garaizar, and this information is available on
www.evaluatucompetenciadigital.com (accessed on 23 January 2025) [
93]. This tool provided 4 different tests targeting the following DCs according to DigComp version 2.1 [
91]: browsing, searching, and filtering data, information, and digital content; evaluating data, information, and digital content; managing data, information, and digital content; and netiquette. Each test assesses one DC independently, without considering the possible overlaps between the different DCs as stated in DigComp. This tool is also part of BAIT, the Basque Government’s evaluation system, accessible at
http://www.bait.eus (accessed on 23 January 2025).
In our study, several optimization measures were implemented to enhance the accuracy and reliability of ET data. First, an advanced equipment calibration protocol was followed to ensure the precise alignment of the ET system with the participant’s gaze. This involved multi-point calibration, repeated at regular intervals throughout the data collection process to account for potential drift in tracking accuracy. Additionally, to mitigate interference factors, the experimental environment was carefully controlled, e.g., ambient lighting was standardized, reflective surfaces were minimized, and participants were instructed to maintain a consistent head position. These measures collectively aimed to minimize extraneous variability, thereby enhancing the validity of the ET data.
The interactions of participants with the tests were gathered to assess if the designed evaluation tasks elicited the targeted knowledge and skills. Additionally, these interactions were used to classify individuals into different expertise levels based on the assessment criteria established for each task. In doing so, we are evaluating the credibility of an alternative interpretation of the test results. That is, the test assesses the skills of participants using a strategy that may be only imprecisely related to the proficiency of the test of items. We examined the scan-paths of the participants’ RPs during the resolution of the image/simulation-based items to evaluate an alternative interpretation of the test scores. To do so, we created the scan-paths in terms of the visual elements included.
Image/Simulation-based questions are well-suited for assessing intermediate and advanced levels of DC. In these tasks, participants evaluate scenarios presented in visual formats, selecting the correct option, similar to multiple-choice questions. This approach assesses various DCs, such as information literacy, practical abilities, and critical thinking. By interpreting visual data, such as infographics or multimedia, participants apply their knowledge to analyze and make informed decisions, reflecting real-world digital interactions.
Incorporating these questions into assessments offers a strong measure of one’s ability to navigate digital content, aligning with the DigComp framework’s focus on practical application and critical engagement. Considering that in the interactive simulations we can identify the scan-path followed by the participants thanks to the click log, we focused on this question format, since the gaze data are the unique source available for describing the strategy of the participants. This method is also cost-effective and applicable across various DCs.
In order to answer the first research question, we carry out a scan-path analysis evaluating alternative interpretations based on the AOIs examined, taking the following steps:
We examined whether there were any differences between the scores and the type of question (if required a systematic approach to solve the question or not, i.e., 1 or 0). With this aim, we calculated the Pearson coefficient to analyze the relationship between the scores, the type of question, the time to solve the question, the scan-path length, and AOIs to be checked from the total number of AOIs defined. Pearson’s correlation coefficient is a widely used measure of the strength and direction of the linear relationship between two variables. One important assumption for its proper application is that the data should ideally come from a bivariate normal distribution. However, Pearson’s correlation is robust to deviations from normality, especially when sample sizes are sufficiently large. This robustness makes Pearson’s correlation applicable and reliable in many practical situations where the normality assumption may not be perfectly met. Therefore, we opted to use Pearson’s correlation to examine the relationships between variables in this study, as it provides a straightforward and interpretable measure of linear association that is resilient to moderate departures from normality;
Additionally, we analyzed the influence of each AOI with a double purpose. First, we wanted to detect the most relevant areas to assess each question correctly. To do so, we performed a feature correlation analysis based on the mutual information between each of the visited AOIs and the result of the interaction. First, we carried out a minimal study of the data variance to discard invariant data that would become impractical for the analytical procedure. Then, we performed a feature correlation analysis between each of the remaining AOIs and the task result. Considering the categorical nature of both features and target, we used the estimate mutual information for discrete variables as the correlation metric;
Finally, besides the pure analytical analysis of the results, we decided to perform a categorical classification of the participants’ behavior with a double purpose. We wanted to examine whether specific patters of AOIs visited to solve the task supposed higher success rates in an inconsistent way with expectations for each item. First, this kind of analysis allows us to identify the flexibility of a task, providing a direct insight into whether the task is somehow guided, with most of the participants following a single route of resolution, or whether it is a highly free task, with not obvious pattern to resolve it. Secondly, it is possible to link those possible patterns with the obtained results to verify if any of them is more effective than others in solving the problem. So, we used a clustering classification to identify the different behavioral patterns during the experiment, using the visits of each participant to each AOI as classification features. We employed the OPTICS algorithm [
94] to determine appropriate clustering regarding the participants’ behavior. This algorithm offers two significant advantages over other traditional algorithms such as k means. First, it is compatible with Boolean friendly distances, such as Hamming distance. Considering that all the features used during classification are of a Boolean nature, this is a key factor to select this approach. Second, unlike k means, this algorithm detects the optimum cluster amount based on the data distribution itself. Using this approach allows us to avoid predefining the number of clusters to discover during the process. In order to check if any of the identified pattern behaviors is more efficient in solving the task, we assigned each of the participants to the corresponding cluster, and then we performed a Kruskal–Wallis test to check for significant differences between the mean values of both the result (the correct answer for the specific task) and the overall performance (the global score considering a task related to the same competence of task in question). Additionally, we investigated the sensitivity of each item to the visitation of associated AOIs, aiming to determine if some items were more predictable than others based solely on AOI visits.
To answer the second research question, we evaluated alternative interpretations based on the AOIs examined and the processing order of the AOIs, by taking the following steps:
To conduct our analysis, we firstly created the scan-paths in terms of the visual elements of images included. For instance, if a participant fixated on the elements A, B and C, respectively, their scan-path was generated as ABC by keeping the fixation durations. We also decided to remove the data from participant 28 in Item24, and participants 5 and 31 in Item5, since we had problems recording their interactions, and their information was inconsistent.
5.2. Ethics Statement
This study was non-interventional observational research considered as having a minimal risk for the participants. This is why we sought the approval from the Data Protection Officer of Tecnalia (
[email protected]), who ensured our study protocol was compliant with the GDPR, the H2020 ethics standards and the principles stated in the Declaration of Helsinki. The participants’ right to refuse or withdraw from participating was fully maintained and the information provided by each participant was kept strictly confidential. To start participating in the study, participants had to read and sign the informed consent form, which explained the study’s objective and conditions. After we determined the willingness and gained written consent from the participants, the required data were collected during the study.
5.3. Participants
Our lab-based study involved 30 participants (15 male, 15 female) from Tecnalia Research & Innovation, aged between 30 and 50, with varying levels of DC. All were native Spanish speakers, and none encountered comprehension issues during the task. The majority were university graduates, with many holding master’s degrees and four participants having doctoral qualifications.
While research in the literature has not extensively explored the optimal sample size for eye-movement studies [
78], similar research has demonstrated that comparable sample sizes are adequate for identifying significant gaze patterns [
77,
97,
98,
99]. The sample size was determined based on methodological and practical considerations to ensure the identification of meaningful visual behavior patterns while maintaining resource efficiency. Our focus was to identify dominant patterns among high-performing participants, recognizing that a larger sample would be necessary to explore multiple patterns in depth.
In future research, we would recommend increasing the sample size following a stratified sampling, selecting a minimum sample for each of the 3 possible levels of DC according to DigComp (basic, medium, and advanced), i.e., at least 30 × 3 = 90 participants. In this way, we could explore multiple patterns in depth, not only of high-performing participants, but also of participants with a basic and medium level of DC.
Gender was not a selection criterion, and the age range (25 to 54) was chosen to reflect over 90% of BAIT service users. Participants self-reported their DC levels, with most rating themselves as “advanced” or “intermediate”. However, in the netiquette domain, fewer participants identified as “advanced”, and more placed themselves in the “basic” category, likely due to unfamiliarity with the term. Lastly, vision differences were not considered a significant source of variability in the study.
5.4. Materials
All participants individually completed the tests available on the online assessment tool. The 4 DCs selected were: (1) netiquette; (2) browsing, searching, and filtering data, information, and digital content; (3) evaluating data, information, and digital content; and (4) managing data, information, and digital content.
The details about the assessment tool, the dimensions selected for each DC and the type of items included in each test can be found in the previous article [
27]. In order to measure not only low-order cognitive skills, the TEA provided different item formats to assess higher-order skills according to the medium and advance levels defined in DigComp—multiple-choice questions, interactive simulations, image/simulation-based questions, and open tasks. Furthermore, they were presented on one screen, necessitating a single-step response without the need for scrolling. The graphics consistently occupied the right half of the screen. Instructions and question statements were positioned in the upper right, with answer choices below. The “respond” button, situated in the lower-left corner of each item, facilitated saving results and progressing to the next question. Our aim was to minimize extraneous eye movements through a consistent layout of the screen. All items were displayed solely in Spanish and had to be responded to within the test application, in a controlled setting, without exiting the main interface. The participants received the items in a fixed sequence, and they were not permitted to delay addressing them or alter the order. This methodology was chosen due to the potential interrelation between certain questions. All interactions and attempts were monitored through the platform, with results automatically computed. Additionally, data on time spent per question and total test duration were recorded.
After gathering the data from the 30 participants, we opted to conduct a brief follow-up session, focusing on a selection of the most representative items to analyze the ET metrics. Among the various formats utilized in the tests, we chose the image/simulation-based tasks, wherein participants were asked to review and assess an illustration or simulation and then select the appropriate response. This format appears to be well-suited for assessing higher-order cognitive skills at the intermediate and advanced levels, as it challenges participants to critically evaluate the presented scenarios. In addition, using an eye-tracker could aid in discerning whether users, in their responses, accurately assessed the predetermined areas outlined in the assessment criteria for each item. The items selected for this study and the assessment criteria defined for each item are shown in
Table 1. Item4 was slightly different, as we asked users to click on the correct area after examining the image, rather than selecting the correct option from the list of possible choices.
Figure 1,
Figure 2 and
Figure 3 show the design of the 3 items selected (Item24, Item32 and Item4) and the AOIs defined.
5.5. Experimental Equipment
The data were collected on a laptop in a Tecnalia meeting room throughout January 2020. The laptop was a MSI GS75 stealth i7 laptop and was located far enough away to avoid distractions caused by actions taken by the researcher. For the study, we employed Tobii Pro Lab (TPL) software version 1.130.24185, utilizing the integrated browser within the software to present the tests on a DELL e2310 23-inch monitor connected to a laptop. Eye movements of the participants were recorded with the Tobii X2-30 Eye Tracker, a discreet and standalone device designed for in-depth research on natural behavior. The eye-tracker, operating at a frequency of 30 Hz, was positioned at the base of a separate 17-inch monitor, which was used to monitor the participants’ eye and head positions.
During tracking, the eye-tracker uses infrared illuminators to generate reflection patterns on the corneas of the participant’s eyes. These reflection patterns, together with other visual data about the participant, are collected by image sensors. Sophisticated image processing algorithms identify relevant features, including the eyes and the corneal reflection patterns. Complex mathematics is used to calculate the 3D position of each eyeball and, finally, the gaze point (in other words, where the participant is looking). The timing and placement of mouse clicks were documented as part of the data collection.
Six distinct projects were created within the TPL framework for the purposes of this research. Each project commenced with a self-evaluation, during which participants indicated their perceived level of competence. Each stimulus was linked to an item previously selected in the tests. All tests and items were displayed in the same order. We selected two types of stimuli according to the item formats included, as follows: (1) Web stimulus, to display webpages to participants during a recording. TPL opens the website’s URL in the built-in Lab browser, and automatically registers all mouse clicks, keystrokes, and webpages accessed. (2) Screen recording stimulus, to register all mouse clicks, keystrokes, programs, and webpages accessed, i.e., all activity displayed on the screen from the beginning to the end of the stimulus. In particular, the 3 items selected for this study were recorded using a Web stimulus.
For each stimulus, distinct AOIs and TOIs were established. The AOI is a concept used in TPL that allows the researcher to calculate quantitative eye movement measures. So, we drew a boundary encompassing the elements of the ET stimulus relevant to our study. We created AOIs in all the areas of the questions that we thought might attract their attention. TPL then calculates the metrics within the boundary over the time interval of interest. TOI is another TPL concept that provides a degree of analytical flexibility, allowing researchers to organize the recording data according to intervals of time during which meaningful behaviors and events take place. In simulations/image-based queries, we selected the intervals at which users viewed the image.
In summary, the main goal of the scan-path analysis was to understand how participants interacted with different stimuli and tasks, and how their attention was distributed across different areas and times. This was achieved through the use of ET and the TPL analysis tool.
5.6. Procedure
The participants were seated in front of the monitor for data collection, following which the ET system was individually calibrated, a process that took approximately three minutes. Subsequently, the items were presented in a consistent sequence for all participants. After completing each task, the researcher activated the next stimulus, advancing to the following item and ensuring that participants could not return to previous items. The eye-tracker recorded gaze data and click streams, which were supplemented by the information captured by the TEA. The I-VT (Fixation) filter was selected for exporting the data from the TPL with the velocity threshold by default, i.e., 30 degrees/s. The I-VT (Fixation) filter is based on the identification of fixations using the Velocity Threshold (VT) criterion. Fixations are defined as periods during which gaze velocity falls below a specified threshold, indicating stable fixation on a particular location. The I-VT filter provides a robust method for identifying fixations in ET data. By incorporating both velocity and spatial criteria, this method provides a robust and flexible approach to detecting fixations, enabling researchers to gain deeper insights into visual behavior.
7. Conclusions
Recent studies have sparked a lively debate regarding the roles various types of process data play in evaluating the performance and validity of test examinees [
34,
35]. Previous studies, such as those by D’Mello et al. and Azevedo [
38,
100], have emphasized the need to bring together different sources, such as log files, ET, emotion recognition, and think-aloud protocols, to better understand RPs. This multi-source approach is crucial for enhancing the robustness of validity arguments. For example, Ercikan and Pellegrino highlighted the importance of integrating process data to validate complex assessments [
34], while Azevedo demonstrated how ET and log data can provide complementary insights into cognitive and emotional states during test-taking [
38]. Despite its relevance, studies that have incorporated the analysis of RPs remain scarce in assessment domains focused on validating the design of the information structure, rather than the content being assessed, and are practically non-existent in DC assessment. By integrating findings from these studies, our work extends the understanding of how process data like log data, ET data, scores, and item response times can contribute to validating inferences made between claims and observed behaviors. The primary objective of our exploratory study was to illustrate specific applications of ET data in enhancing the validity of inferences drawn from scores. Our focus on evaluating participant interactions with assessed content aimed to determine if observed behaviors aligned with predefined assessment criteria for each item. This approach builds on the works of Greiff et al., who explored how response time analysis can reveal engagement and problem-solving strategies [
47], and of Kroehne and Goldhammer, who demonstrated the benefits of combining response time with other process data to understand test-taker behavior [
48]. Moreover, our findings extend the insights of Van Gog and Scheiter, who discussed the potential of ET to uncover cognitive processes in problem-solving tasks by applying similar methodologies to the domain of DC assessment [
101]. This comparative analysis underscores the value of ET data in providing a nuanced understanding of test-taker behaviors, complementing traditional measures of assessment.
We explored specific patterns of participants’ eye movement to make detailed observations of item RPs in an evaluation of DC. Specifically, we used ET observations to fill the ‘explanatory gap’ by providing data on the variation in item RPs that are not captured in traditional TEAs. We focused on generating and testing inferences about the RPs performed by the participants, evaluating an alternative interpretation of the test scores in terms of AOIs examined and the order followed in examining the different AOIs.
In accordance with observations by other researchers, like Cronbach, stating that “A proposition deserves some degree of trust only when it has survived serious attempts to falsify it”, we focused on evaluating an alternative interpretation of the test scores [
30]. We tried to falsify the proposition that the items triggered the knowledge and skills required for solving the image/simulation-based items by testing an alternative interpretation: that test-takers do not pay attention to the key areas, or that they followed a meaningless order of fixations to answer the items correctly; that is, solving the tasks in a way that is inconsistent with expectations for each item according to the assessment criteria previously defined.
To evaluate this alternative interpretation of the test scores, we used ET technology to examine the scan-paths. We investigated how participants processed the different AOIs defined in order to evaluate the situation and choose the correct answer. We tried to answer two questions, as follows:
First, which AOIs within the items were examined, and whether specific patterns of AOIs visited might undermine the claim that the items required the same cognitive processes that are required in real-world tasks, according to the assessment criteria previously defined. We could not identify alternative patterns in terms of visit rates of AOIs that might undermine the assessment criteria defined. All the AOIs were examined with different visit rates, but we could not find any response pattern with an unexpected visit rate that was difficult to explain, which would have raised doubts as to whether the question was generating the expected RPs. Additionally, we carried out a clustering of the responses in terms of AOIs visited, applying an unsupervised OPTICS classification algorithm to examine whether specific patterns of AOIs visited to solve the task predicted higher success rates in the overall performance (the global score considering tasks related to the same DC of task in question), and in a way that is not inconsistent with expectations for each item. We could identify clusters with higher success rates, but nevertheless, the results were far from significant, and thus, it was not possible to conclude that any of the behavioral patterns were more efficient than the rest. It would be very interesting to carry out this analysis again, with a higher number of participants.
Secondly, we examined the scan-paths of the items in more depth, and considered the order of processing the different AOIs when solving them. We employed the Levenshtein method, facilitated by the ScanGraph tool, to ascertain that the unsuccessful group exhibits greater variance than the successful group. The significant variability and lack of clear patterns in participant behavior, as suggested by the clustering analysis and the variability within groups in the Levenshtein method analysis, might reflect the complex nature of DC assessment. In order to explain this variability, it would be interesting to incorporate additional variables not accounted for in this analysis (e.g., individual differences in cognitive strategies). Then, we went deeper into cognitive processes and problem-solving strategies, calculating the common scan-paths of only the successful participants whose variance was smaller. Despite the limitations of the position-based weighted model applied, we obtained interesting insights into the performances of the successful participants, which could not have been obtained in traditional TEAs. As a result, we acquired a visual representation enabling us to easily verify that participants who answered the question correctly did not exhibit any unexpected behavior that would invalidate the item. We consider this information helpful to examinees who incorrectly responded. If they wanted to review the question for the correct answer, we could show them the common scan-path as an extra value. To our knowledge, current DC assessments do not offer this functionality for such items during review. However, in the meta-analysis performed by Xie et al., it was pointed out that the use of eye movement modeling examples was beneficial to learners’ performance when non-procedural tasks were used instead of procedural tasks [
102]. So, it should be further investigated to what extend the use of modeling examples in the review process of an assessment of DC might be appropriate, depending on the type of question. Finally, we also examined the fixation durations and the places where longer fixations occurred. We could confirm that successful participants undertook the longest fixations on the expected areas, and we could not identify an alternative interpretation that would undermine the assessment criteria defined.
Additionally, the fixation durations’ distribution validated the item design, revealing two distinct approaches—systematic and non-systematic. While the systematic approach focuses on specific areas, the non-systematic approach uniformly examines all areas, as indicated by the homogeneous fixation durations distribution.
Previous studies used ET data in fields related to the DCs selected. For example, fixation patterns are relevant in the rapid analysis and recognition of information [
64,
67,
103], and are helpful for understanding user behavior in search engines [
68], or to show the differences in the strategies that participants used to answer the multiple-choice questions in the code comprehension and recall tasks [
104]. These studies have yielded valuable insights into the cognitive mechanisms underlying rapid information analysis and recognition, highlighting the importance of context, task demands, and individual differences in shaping fixation behavior. Individual differences in eye movement behavior, such as age, expertise, and cognitive abilities, can influence fixation patterns, complicating the generalization of findings across diverse populations. Furthermore, some of these studies also mentioned some methodological limitations, such as the spatial and temporal resolution of ET devices, which can constrain the accuracy and precision of fixation data, potentially leading to misinterpretation or oversimplification of cognitive processes.
As far as we know, only Yaneva et al. used ET data for validation purposes to examine how the distributions of the options in multiple-choice questions influenced the way examinees responded to the questions [
39]. Nevertheless, to our knowledge, prior research has not addressed the evaluation of DCs for validating RPs. We showed different ways of using data collected on fixation areas and density, which could be used to support validation practice and test development. Additionally, we have also shown that metrics from ET data were of great value for test designers in terms of understanding examinees’ interactions with the tests and helping them to adjust the level of difficulty.
When interpreting the findings of this study, it is important to acknowledge certain limitations, including the relatively small participant sample and the limited number of items used. Future research would benefit from examining whether these outcomes are consistent with a larger sample that also includes individuals with a fundamental level of DC, or by employing additional items with a similar structure. Furthermore, subsequent studies should explore items with varied response formats, which might incorporate more intricate RPs. Zumbo et al. recently highlighted the need to be more rigorous when validating the way that RPs and their data are processed as measurement opportunities [
105]. ET studies can assist the process of constructing a validity argument. However, RP data can more clearly challenge an interpretation than directly support it [
31]. In addition, further research would be recommended. For example, with a larger participant sample, we may discover that a single common scan-path is inadequate, and multiple common scan-paths may prove valid. This could indicate different participant cohorts, such as those with advanced levels of DC.
We are also aware that while scan-paths and fixation durations offer valuable insights into cognitive processing, they may not capture the full complexity of examinees’ cognitive processes. Several factors can influence these metrics and affect their interpretation, such as task complexity, examinee characteristics, environmental distractions, or test anxiety. Although during the study we tried to minimize these factors, such as reducing distractions, we should be aware of their existence and influence.
Furthermore, additional research is recommended to better understand the efficacy of displaying the common scan-path of successful participants during the review. Implementing this suggestion poses challenges, such as determining which common scan-path to display if multiple valid paths are identified for an item.
While our study is exploratory in nature and primarily focuses on utilizing ET data to enhance the validity of inferences drawn from test scores, it is essential to situate our findings within the broader context of existing research on DC assessment and the use of process data. Previous studies, such as those by Law et al. and Siddiq et al. [
3,
4], have highlighted significant gaps in the validity and reliability of DC assessment tools, particularly those relying heavily on self-reported measures. Our findings extend this body of work by demonstrating that ET data can provide detailed insights into the RPs of examinees, thereby addressing some of these validity concerns. Furthermore, our results align with the conclusions of Papamitsiou and Economides, who emphasized the need for more granular analyses of assessment and learning design strategies [
106]. By integrating ET data, we contribute to a more nuanced understanding of how examinees interact with assessment items, corroborating the findings of Scherer et al. on the benefits of TEA environments [
12]. Additionally, our study contrasts with the work of Rienties and Toetenel, who primarily focused on log data and response times [
15]; we show that ET data can uncover variations in response strategies that are not captured by these traditional metrics. Overall, our research underscores the potential use of ET as a complementary tool in DC assessment, providing richer and more actionable data compared to conventional methods, and paving the way for future studies to further explore this promising avenue. To effectively integrate and analyze ET data with other process data such as log data and RTs, it would be advisable to use a multi-modal data fusion framework that leverages machine learning algorithms and statistical models. The integration process would begin with data synchronization, where timestamps from different data sources should be aligned to ensure temporal consistency. ET data would be processed alongside log data capturing user interactions and RTs that indicate cognitive load and decision-making speed. For the analysis, certain approaches could be employed, e.g., a combination of Dynamic Time Warping (DTW) for aligning temporal sequences and Principal Component Analysis (PCA) to reduce dimensionality, while preserving key patterns in the data. To explore relationships and predictive capabilities, machine learning models could be utilized, such as Random Forests and Support Vector Machines (SVM) for classification and regression tasks, respectively. Additionally, Hidden Markov Models (HMM) could be used to identify latent states in user behavior. The integration of these data streams could help to provide a comprehensive view of user behavior, improving the understanding of cognitive processes and enhancing predictive analytics in human–computer interaction studies.
Our study’s findings can be further situated within established theoretical frameworks of cognitive processing and test validity. According to theories of cognitive processing, such as those posited by Ercikan and Pellegrino and Kane and Mislevy [
31,
34], ET data provide a robust method to infer cognitive strategies and processes involved in test-taking. These theories suggest that RPs are crucial for validating the cognitive constructs that assessments aim to measure. The alignment of fixation patterns and scan paths with the hypothesized cognitive processes supports the validity of the test items and the inferences drawn from test scores. Additionally, the use of ET in our study aligns with Cronbach’s validation theory [
30], which emphasizes the necessity of rigorous validation methods to substantiate the intended interpretations of test scores. By examining the detailed eye movement data, we have demonstrated that participants engage with test items in ways that reveal underlying cognitive strategies, thus reinforcing the construct validity of our DC assessment. This integration of ET data into the validation framework not only provides empirical support for the intended constructs, but also extends the applicability of cognitive processing theories in the context of digital assessments. Although we think that the results presented are useful and can make an important contribution to the process of constructing a validity argument in a DC assessment, they are limited because ET studies are time-consuming, and for this reason, the participant samples tend to be limited. Even more so, there are more metrics available that should be investigated and which might contribute to furthering the understanding of participants’ task-solving behaviors. The results of this research could be considered when thinking about how to use ET data to validate the design of items when measuring complex cognitive constructs such as DC. Studies like this can be part of a complete validity argument.