In the development research methodology, evaluation is the aggregating stage in the process and should be performed several times during the empirical testing stage or whenever a sub-product is “concluded”.
After the first tests with users similar to end users and refinement of the products, we were able to conduct a larger study with end users in a real classroom environment.
The final evaluation study was conducted during two different school years (2022–2023 and 2023–2024) using three different study groups from a Portuguese school in real ICT or robotics classes:
This study was conducted using the ICT and robotics classes for 50 minutes per week from the end of January until the end of May in both school years. Initially, the basic concepts and notations of the manual and robot were explained to all of the groups. At the beginning of each section of the manual, a theoretical explanation was given, and the expected results were detailed to make sure every student understood what they were asked to do. Group 1 was already familiar with the mBlock environment, and thus only the explanation of the newly developed framework was needed. Groups 2 and 3 felt the need to have extra introductory classes on how to work with mBlock and the newly developed framework for
Stemie, as they had never programmed robots before. However, as they were already familiar with Scratch, these classes were quite simple for them because of mBlock’s similarities to it. It is important to note that every student in this study had their own robot to take care of and take home after every class. After this, computational thinking was developed through hands-on problem-solving exercises [
20] involving sequences, loops, parallelism, events, conditionals, operators, and data. The evaluations sent by the classes’ teachers to each one of the students involved in the study were gathered. These data were sorted and classified by computational concepts and practices, and quantitative data analysis was performed to obtain comparable results. We also took into consideration the notes the teacher took about the students and their working methods during this period of time. Replicating what was performed in evaluation stage 1, to evaluate the students’ problem-solving skills, the responsible teacher used rubrics for “problem identification”, “planning”, and “execution”. The specific objectives for each of the activities were also used, and all of the teachers’ records were documented using a three-point Likert scale: “not yet”, “more or less”, and “yes, completely”. Video was the privileged way of keeping track of the activities. The classification of optimal and non-optimal solutions was performed using the same set of rules as in evaluation stage 1.
7.1. Results
The results of 20 different problems were evaluated for each of the 329 students involved in the study. In this phase of the study, we did not have data for one of the problems (Problem 19), and thus it was taken out of the analysis. As previously mentioned, not all of the 329 students tried to solve all 20 problems. Even with this fact in mind, its important to remember that for this analysis, a total of 3990 problems were taken in consideration. The overall results are summarized in
Table 4.
In this global analysis, we obtained correctly solved problems at a rate of 91.13%, and 56.99% of those were solved using an optimal solution for that specific problem, which is a positive result. However, we also found that some of the problems (8.87%) were not solved by the students. According to the teachers’ notes, this was mainly due to students not having the work material in all of the classes or, in some particular cases, due to hardware malfunctions.
Similar to what we did in the test with users similar to end users, it was important to determine if there were statistically significant differences in the scores across the three study groups. As the ANOVA assumed that the data were balanced, the missing values which we had due to not all groups having completed the same number of exercises would create an unbalanced design and impact the validity of the results. To overcome this problem, before performing the analysis, we decided to use multiple imputation, one of the most reliable techniques for handling missing data due to partial or incomplete responses from a portion of the sample [
36]. Using the R programming language [
37] and RGui editor, we applied multiple imputation with predictive mean matching (pmm) to our data. This procedure, executed with the
mice package, allowed us to generate five different versions of plausible values for the missing entries. By integrating the imputed data into the original dataset, we obtained a new dataset but with most of the missing values replaced by statistically appropriate estimates. After the multiple imputation process, we were left with 17 exercises with complete data for the three different groups, which allowed us to perform further analysis.
Using Jamovi, we executed a one-way ANOVA with Fisher’s method to determine if there were significant differences in performance between the different groups of students, categorized by different backgrounds in robotics, for each of the exercises. The one-way ANOVA results (see
Table 5) showed that in most of the exercises,
p values > 0.05 were registered, meaning that there were no significant differences between the groups. On the other hand, for exercises 8, 10, 14, and 17, the results table revealed
p values < 0.05, which suggest significant group differences. The most significant differences occurred in exercises 8 (F = 25.34,
p < 0.001), exercise 14 (F = 13.11,
p < 0.001), and exercise 17 (F = 26.02,
p < 0.001).
For the exercises with the most significant differences, we conducted a follow-up Tukey’s post hoc test to identify which groups differred from one another. For exercise 8, the test showed significant differences between each pair of groups, with p ≤ 0.005. For exercise 10, significant differences (p = 0.017) were only found between Group 1 and Group 2. Exercise 14 revealed significant differences between Group 1 and Group 2 as well as Group 2 and Group 3. However, it did not reveal a significant difference between Group 1 and Group 3. For exercise 17, a p < 0.001 in all group comparisons, showing that all groups differed significantly from each other. Generally speaking, Group 1 performed best in the most significant exercises (8, 14, and 17), and Group 2 was the worst in terms of the results for exercises 14 and 17.
The results we obtained were somehow expected. Group 1 was the one with older students and with previous knowledge on both robotics and the mBlock programming environment. It was not a surprise that they generally outperformed the other two groups. On the other hand, the worse results for Group 2 in exercises 14 and 17 were somehow surprising. Although both Group 2 and Group 3 started the study with no previous robotics experience, when those exercises were performed, Group 2’s students were 1 year older and had one more school year of experience in robotics compared with Group 3. We did not find any specific reason for those results. It is possible that a slight lack of motivation may have affected the final results. Group 3 performed quite well, especially considering that the students were younger than the ones from the other two groups when they entered the study.
When we analyzed the results according to the three dimensions of Brennan and Resnick’s framework [
3] with computational concepts, practices, and perspectives, we also obtained quite interesting results.
7.1.1. Computational Concepts
Although all computational concepts were explored in this study, due to the different starting and ending points for each group, not every group was able to experience and test every one of them.
Table 6 indicates the number of tasks proposed to students in which a specific concept was approached. Furthermore, the same problem may have addressed more than one notion. There was also a disparity between the quantity of questions and concepts because complex topics such as events, parallelism, and data were only covered in a few of the book’s final tasks. The concept of sequences was present in all problems and was evaluated with a completion rate of 91.13%.
Group 1, which had previous experience in robotics, was the one which tried to solve the larger number of problems involving events, conditionals, and operators. This may explain why, globally speaking, these were the skills students had less difficulty acquiring, with a completion rate of 96.55% and, among those, 62.05% finding the optimal solution for the problems, while only 3.45% of the problems were not solved.
As previously mentioned in the description of this study, each of the groups, due to the school year and previous experience, solved a different set of exercises. However, most of the concepts were included, making it of some importance to compare the results between groups.
It is possible to observe from
Figure 10 that Group 1 included all of the concepts, although they only solved some of the more complex problems. Group 2 and Group 3 only worked with some of the concepts but, on the other hand, they solved more exercises. In every common concept, Group 1 had more expressive results. Despite this difference, its possible to observe that every concept which each group worked on was successfully developed, with the results being between 80% and 100%.
7.1.2. Computational Practices
Despite the different numbers and types of problems each group solved, every group was able to experiment with all computational concepts explored in this study, as can be seen in the group comparison chart in
Figure 11. Similarly, Group 1 was the one which obtained the better results in app computational practices. Their previous experience and age may have been the differentiation factors. However, when comparing Groups 2 and 3, we can observe that Group 3 had better performance than Group 2, although they were younger and had less experience. Through the analysis of this chart, we can perceive that all of the groups successfully developed every computational practice.
As with computational concepts, it is important to observe that the number of problems indicated in
Table 7 refers to the number of problems proposed to students in which a specific practice was approached, and the same problem often addressed more than one practice.
Figure 12 gives a better understanding regarding the completion rate with optimal and non-optimal solutions to the problems, grouped by computational practices.
Through the analysis of the results grouped by computational practices, it is possible to find that being incremental and iterative was the most addressed practice throughout the proposed problems. It was also in this practice that the students showed less difficulty, with a completion rate of 91.13%. Reusing and remixing was also a practice which students were comfortable with, successfully solving 89.77% of the problems which involved this practice. Although achieving extremely positive results, the problems which involved abstracting and modularizing were those which more students were unable to solve, with 12.07% of them not finishing in time. From the teachers’ notes, it was not possible to perceive if this was due to abstraction difficulties or if, given the slow pace of some students, there was simply no time to solve them.
7.1.3. Computational Perspectives
The three computational perspectives—express, collaborate, and question—were cross-sectional in all of the developed exercises, although they were not objectively measured. One’s own expression was implicit, as the students solved the problem-solving tasks while following the guidelines but with the freedom to create something new and personalize the already existent elements through the inclusion of personal elements and preferences in the task scenarios. Collaboration was also a constant. Although the tasks were performed mostly individually as every student had their own robot, as soon as one ended, they asked to help the most delayed colleagues by performing peer work. Also, the curiosity about the processes, the similarities with some real-life situations, and the different problem-solving methods led them to question the technology. Some students even suggested new developments in the existing challenges.