EFAR-MMLA: An Evaluation Framework to Assess and Report Generalizability of Machine Learning Models in MMLA
Abstract
:1. Introduction
- The development step (also known as model training) fits the ML models to the available data and optionally finds hyper-parameters. Hyper-parameters are the tuning parameters of the ML models [10] (e.g., the number of hidden layers in a neural network) that need to be configured before model training [11].
- After the model training, the evaluation step assesses the models’ performance. The model evaluation serves different purposes depending on the goal of the ML model in MMLA. For instance, we use model evaluation to find the best-fitted model to the data when we develop the model to identify learning indicators (e.g., data features with high predictive power). Meanwhile, in the case of building a predictive ML model, the evaluation step assesses the performance on unseen data [9]. The model’s ability to perform on unseen data is also referred to as generalizability [10,12].
2. Model Evaluation in Machine Learning and Multimodal Learning Analytics
2.1. Model Evaluation in ML
2.2. Model Evaluation in MMLA
- The widely used cross-validation methods are limited in terms of providing a generalizability assessment [23]. The use of these methods is neither recommended for model comparison nor selection purposes [30,31,32]. The performance measure obtained from a cross-validation step, which is also used for hyper-parameter tuning, is found to be significantly biased [30].
- The model evaluation methods from ML do not assess generalizability at the levels relevant to the MMLA field. For example, the cross-validation method assesses the model’s generalizability across folds containing random data from the dataset. This kind of evaluation cannot offer information on how the model will perform on data from different students or classrooms.
- The heterogeneous approach to performance reporting hinders the community from accumulating knowledge regarding the maturity of ML in the field. MMLA researchers are employing different baselines and often reporting the model’s average performance without giving an uncertainty measure [13,28,29]. The use of baseline performance only offers a lower bound of performance, which is not sufficient to understand the practical value of the model.
3. EFAR-MMLA
3.1. Model Evaluation at Different Generalizability Levels
3.1.1. Instance Generalizability
3.1.2. Group Generalizability
3.1.3. Context Generalizability
3.2. Performance Reporting
3.2.1. Performance Variation Measure
3.2.2. Frames of Reference
3.3. Current State of MMLA Research from EFAR-MMLA Point of View
4. Illustrative Case Study
4.1. Motivation and Context of the MMLA Project
4.2. Research Problem
4.3. Methods
4.3.1. Data Gathering
4.3.2. Data Processing
- Simple FeaturesWeinberger and Fisher [54] highlighted the amount of participation as one of the key quantitative measures in collaborative learning, and this is considered a useful indicator for collaborative behavior [24]. In our case study, we computed the amount of participation in physical and digital spaces in terms of speaking time, turn-taking and writing activity in Etherpad (please refer to Table 6). To extract speaking time and turn-taking, the direction of audio captured with the microphone array was mapped to each student according to their sitting position around the prototype. This mapping provided us with the sequence of speaking turns taken by students. We counted the total number of turns taken by each student in a group for each 30 s window, as well as their total speaking time (in 200 ms increments, which was the granularity of the audio direction detection algorithm). From the Etherpad logs, we obtained the number of characters added or deleted by each student. These features were first collected at the individual level. We then used PCA (Principal Component Analysis)-based fusion to obtain group-level features from individual features [55]. PCA is a dimensionality reduction technique that reduces the number of attributes in a dataset while preserving most of the variance in the data. Our preliminary analyses showed PCA to be a better-performing fusion method than average and entropy-based methods for individual student data fusion (see [56]).
- Acoustic FeaturesWe also extracted acoustic features (e.g., pitch, fundamental frequency, energy) from the group audio data of all the collaborating students. This decision was based on previous collaboration modeling research [13,27,57], which achieved higher classification accuracy compared to other types of features in laboratory settings [13]. We used the OpenSmile toolkit (https://www.audeering.com/opensmile/, accessed on 12 August 2020) and extracted 1584 different acoustic features (please refer to Appendix A Table A1 for a full list). Given the high dimensionality of these audio features (e.g., more features than the total number of data points), we used several dimensionality reduction strategies on this feature set. First, removing highly-correlated features (with a correlation > 0.90) left us with 803 features; we then applied PCA dimensionality reduction for further reduction, resulting in 156 features explaining 90% of the variance in the data.
- Linguistic FeaturesWe used a speech-to-text service (Otter.ai: https://otter.ai/, accessed on 16 May 2020) to obtain transcripts of the recorded audio automatically. We decided to use this approach instead of manually transcribing audio because of its easier integration in the automation of the application to estimate collaboration quality. We extracted linguistic features (e.g., frequency of “we”, “you”, “our” ) from the transcript for each group. The extracted features were based on previous research that has found differences between collaborative and non-collaborative behaviors in a group’s usage of first/second person singular pronouns (I, you), first-person plural pronouns (we, us) [58] and the numbers of words and sentences [25]. We extracted the number of times these words were used for every 30 s window in addition to the total number of words and the number of “wh” words (e.g., what, why, where).
4.3.3. Data Annotation
4.4. Analysis
4.5. Results
5. Discussion
- Assessing models at different generalizability levelsEvaluating ML models at different generalizability levels helps researchers to see how their model’s performance varies when moving towards stricter generalizability levels. In our case study, we show how the ML regression models performed better than the lower (no-information) frame of reference in estimating overall collaboration quality and some of its sub-dimensions (e.g., ITO, ARG), when evaluated using train and test evaluation or stratified k-fold corresponding to the instance generalizability. However, the performance degraded substantially when models were evaluated using more stringent assessments in the EFAR-MMLA. The most likely reasons for this performance degradation are the small dataset size and the small number of contexts/groups from which data were gathered. The assessment at different generalizability levels (group and context) helped us to clearly see that the model is not generalizable enough to make it useful in practice.
- Understanding the rationale for performance variationThe EFAR-MMLA can help us to understand the reason for performance variations by looking systematically and evaluating ML models at different generalizability levels. In our case study, we found significant variations in the performance of our ML models at the instance and group generalizability levels. These variations led us to further explore the underlying reason by looking into the various cross-validation units (e.g., student groups). In our case, we identified a group that was actively participating (in terms of speaking time and characters added or deleted) but mostly with off-topic discussion. This led to human raters scoring this group lower in collaboration quality, but this was undetected by our models (given the types of features that we included in the modeling), thus leading to poor model performance. This illustrates that our models were not able to generalize to that group’s behavior. It also made us consider the inclusion of additional features (e.g., content-based) in a future version of our ML models that could help in mitigating the identified issue.
- Offering better comprehensibility regarding the model’s performanceThe other benefit of the proposed framework is the increased understanding of the performance reporting, both for the research team doing the reporting and for readers of that report. Although the mean performance of our model in terms of instance generalizability (k-fold evaluation) was better than the no-information lower bound, its high variation suggested that the performance was not stable (i.e., it was likely to fail on future data). The inclusion of a performance upper bound allowed us to see the extent to which the model deviated from its expected performance.
- Bringing another perspective of bias identificationGeneralizability is a highly sought-after characteristic by researchers across domains (e.g., clinical research, computer vision). However, this emphasis is not necessarily needed in every scenario, and it has also been criticized in other fields (e.g., clinical research) [62]. Considering an example similar to [62], but in an educational context, to understand this further, a researcher developed an ML model to estimate engagement in the classroom of primary students. The researcher validated models’ performance in other classrooms with primary students and found it to achieve moderately stable performance. Now, the researcher aims to further improve the model to work with other students (e.g., secondary, higher secondary); the existing model may suffer significantly reduced performance for even primary students to make it generalizable to other students. If the model generalizes well to other students, that is certainly positive; however, if it does not generalize to students other than primary but performs fairly for primary students, it is still useful, and the lack of generalizability to other students does not necessarily undermine its value. In this direction, besides supporting researchers to evaluate their models for generalizability at the group and context level, EFAR-MMLA brings another perspective regarding the identification of the biases in the models when performed with different groups or contexts. Such biases can help researchers and the community to identify the scenarios in which a model is useful and where and how much its performance degrades when changing group and contextual characteristics. This perspective can be useful for the MMLA community to take a further step towards developing practically relevant models for real-world educational settings.
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
ML | Machine learning |
MMLA | Multimodal learning analytics |
EFAR-MMLA | Evaluation framework to assess and report generalizability of ML models in MMLA |
CL | Collaborative learning |
CV | Cross-validation |
CQ | Collaboration quality |
ARG | Argumentation |
KE | Knowledge exchange |
CO | Cooperative orientation |
SPST | Structuring problem-solving process and time management |
SMU | Sustaining mutual understanding |
ITO | Individual task orientation |
CF | Collaboration flow |
GP | Gaussian process |
SVM | Support vector machine |
RF | Random forest |
NN | Neural network |
RNN | Recurrent neural network |
LSTM | Long short-term Memory |
Ada | Ada boost |
LR | Logistic Regression |
DT | Decision tree |
STD | Standard deviation |
HP | Hyper-parameter |
IQR | Inter-quartile range |
CI | Confidence interval |
NB | Naive Bayes |
Appendix A
Feature Name | Number of Features |
---|---|
MFCC | 630 (15 Fe, 15 D, 21 F) |
Mel frequency | 336 (8 Fe, 8 D, 21 F) |
Linear spectral coefficient | 336 (8 Fe, 8 D, 21 F) |
Loudness | 42 (1 Fe, 1 D, 21 F) |
Voicing | 42 (1 Fe, 1 D, 21 F) |
Fundamental frequency envelope | 42 (1 Fe, 1 D, 21 F) |
Jitter | 38 (1 Fe, 1 D, 19 F) |
Jitter(DP) | 38 (1 Fe, 1 D, 19 F) |
Shimmer | 38 (1 Fe, 1 D, 19 F) |
Pitch onsets | 38 (1 Fe, 1 D, 19 F) |
Duration | 38 (1 Fe, 1 D, 19 F) |
Appendix B
Strategy | Features | CQ | ARG | CF | CO | ITO | KE | SMU | SPST |
---|---|---|---|---|---|---|---|---|---|
RHO | Basic | 4.88 (0.10) | 0.83 (0.00) | 1.23 (0.01) | 0.98 (0.00) | 0.87 (0.01) | 1.06 (0.01) | 1.19 (0.01) | 1.51 (0.01) |
Acoustic | 5.81 (0.21) | 0.90 (0.00) | 1.42 (0.01) | 0.99 (0.00) | 0.96 (0.01) | 1.23 (0.01) | 1.34 (0.01) | 1.60 (0.01) | |
Linguistic | 5.29 (0.09) | 0.82 (0.00) | 1.34 (0.01) | 0.98 (0.00) | 0.98 (0.00) | 1.11 (0.01) | 1.32 (0.01) | 1.60 (0.01) | |
All | 5.65 (0.25) | 0.90 (0.01) | 1.42 (0.01) | 0.97 (0.00) | 0.96 (0.01) | 1.17 (0.01) | 1.32 (0.01) | 1.59 (0.01) | |
SKF | Basic | 4.94 (0.09) | 0.80 (0.00) | 1.20 (0.00) | 0.99 (0.00) | 0.86 (0.01) | 1.03 (0.01) | 1.22 (0.01) | 1.49 (0.00) |
Acoustic | 5.82 (0.07) | 0.90 (0.00) | 1.44 (0.00) | 0.99 (0.00) | 0.99 (0.00) | 1.21 (0.01) | 1.33 (0.01) | 1.61 (0.01) | |
Linguistic | 5.23 (0.19) | 0.79 (0.00) | 1.37 (0.01) | 0.99 (0.00) | 0.95 (0.00) | 1.09 (0.01) | 1.29 (0.01) | 1.62 (0.00) | |
All | 5.83 (0.07) | 0.91 (0.00) | 1.41 (0.01) | 0.96 (0.00) | 0.95 (0.00) | 1.17 (0.00) | 1.34 (0.00) | 1.58 (0.00) | |
KF | Basic | 5.42 (3.51) | 0.85 (0.06) | 1.36 (0.12) | 0.99 (0.07) | 0.96 (0.17) | 1.13 (0.15) | 1.31 (0.17) | 1.53 (0.16) |
Acoustic | 5.34 (6.02) | 0.90 (0.04) | 1.34 (0.25) | 0.95 (0.07) | 0.91 (0.19) | 1.15 (0.26) | 1.27 (0.22) | 1.68 (0.08) | |
Linguistic | 5.42 (4.00) | 0.85 (0.03) | 1.40 (0.09) | 0.98 (0.07) | 0.99 (0.12) | 1.09 (0.16) | 1.34 (0.12) | 1.63 (0.07) | |
All | 5.25 (6.53) | 0.88 (0.06) | 1.35 (0.27) | 0.93 (0.08) | 0.90 (0.20) | 1.09 (0.25) | 1.28 (0.23) | 1.60 (0.10) | |
LOGO | Basic | 6.38 (6.88) | 0.84 (0.03) | 1.53 (0.23) | 1.05 (0.10) | 1.15 (0.27) | 1.37 (0.30) | 1.55 (0.20) | 1.58 (0.08) |
Acoustic | 5.60 (6.99) | 0.87 (0.02) | 1.38 (0.33) | 0.97 (0.09) | 0.99 (0.21) | 1.21 (0.27) | 1.33 (0.25) | 1.69 (0.04) | |
Linguistic | 5.70 (4.13) | 0.84 (0.02) | 1.42 (0.17) | 0.99 (0.07) | 1.07 (0.09) | 1.17 (0.14) | 1.42 (0.18) | 1.67 (0.10) | |
All | 5.68 (7.85) | 0.84 (0.03) | 1.40 (0.39) | 0.97 (0.11) | 0.98 (0.22) | 1.19 (0.27) | 1.36 (0.26) | 1.63 (0.05) | |
LOCO | Basic | 6.70 (0.36) | 0.88 (0.00) | 1.75 (0.00) | 1.23 (0.01) | 1.19 (0.03) | 1.53 (0.00) | 1.59 (0.01) | 1.84 (0.05) |
Acoustic | 6.08 (3.24) | 0.88 (0.01) | 1.49 (0.16) | 1.05 (0.07) | 1.06 (0.05) | 1.31 (0.16) | 1.38 (0.12) | 1.94 (0.00) | |
Linguistic | 6.35 (0.52) | 0.89 (0.00) | 1.55 (0.02) | 1.06 (0.02) | 1.12 (0.01) | 1.29 (0.01) | 1.48 (0.01) | 1.86 (0.03) | |
All | 6.35 (2.85) | 0.88 (0.00) | 1.54 (0.22) | 1.05 (0.08) | 1.10 (0.05) | 1.34 (0.14) | 1.45 (0.11) | 1.94 (0.00) |
Strategy | Features | CQ | ARG | CF | CO | ITO | KE | SMU | SPST |
---|---|---|---|---|---|---|---|---|---|
RHO | Basic | 4.62 (0.10) | 0.79 (0.00) | 1.14 (0.00) | 0.91 (0.00) | 0.84 (0.01) | 1.02 (0.01) | 1.09 (0.01) | 1.42 (0.01) |
Acoustic | 4.61 (0.10) | 0.77 (0.00) | 1.15 (0.01) | 0.93 (0.00) | 0.81 (0.01) | 1.00 (0.00) | 1.14 (0.01) | 1.46 (0.01) | |
Speech | 4.96 (0.10) | 0.75 (0.00) | 1.26 (0.01) | 0.89 (0.00) | 0.94 (0.00) | 1.03 (0.00) | 1.23 (0.01) | 1.49 (0.01) | |
All | 4.50 (0.11) | 0.74 (0.00) | 1.12 (0.01) | 0.88 (0.00) | 0.81 (0.00) | 0.96 (0.00) | 1.10 (0.01) | 1.45 (0.01) | |
SKF | Basic | 4.68 (0.02) | 0.76 (0.00) | 1.13 (0.00) | 0.90 (0.00) | 0.84 (0.00) | 1.01 (0.00) | 1.11 (0.00) | 1.41 (0.01) |
Acoustic | 4.60 (0.22) | 0.78 (0.00) | 1.14 (0.00) | 0.93 (0.00) | 0.80 (0.00) | 0.99 (0.01) | 1.10 (0.00) | 1.44 (0.00) | |
Speech | 4.82 (0.08) | 0.74 (0.00) | 1.26 (0.00) | 0.90 (0.00) | 0.92 (0.00) | 1.01 (0.00) | 1.20 (0.00) | 1.47 (0.01) | |
All | 4.48 (0.04) | 0.75 (0.00) | 1.12 (0.01) | 0.89 (0.00) | 0.80 (0.00) | 0.94 (0.00) | 1.08 (0.00) | 1.44 (0.00) | |
KF | Basic | 5.11 (3.67) | 0.80 (0.05) | 1.24 (0.13) | 0.95 (0.06) | 0.91 (0.18) | 1.09 (0.16) | 1.19 (0.10) | 1.43 (0.11) |
Acoustic | 4.63 (3.16) | 0.78 (0.05) | 1.19 (0.10) | 0.93 (0.07) | 0.84 (0.08) | 0.98 (0.23) | 1.16 (0.16) | 1.48 (0.11) | |
Speech | 4.87 (3.36) | 0.75 (0.04) | 1.24 (0.11) | 0.90 (0.05) | 0.90 (0.15) | 1.01 (0.16) | 1.23 (0.12) | 1.47 (0.11) | |
All | 4.65 (3.31) | 0.73 (0.04) | 1.19 (0.10) | 0.86 (0.05) | 0.84 (0.09) | 0.97 (0.16) | 1.17 (0.17) | 1.45 (0.13) | |
LOGO | Basic | 6.21 (8.16) | 0.77 (0.03) | 1.42 (0.23) | 1.02 (0.07) | 1.11 (0.26) | 1.38 (0.25) | 1.41 (0.19) | 1.55 (0.10) |
Acoustic | 5.85 (8.06) | 0.79 (0.03) | 1.36 (0.29) | 1.00 (0.09) | 1.10 (0.20) | 1.20 (0.39) | 1.35 (0.31) | 1.50 (0.12) | |
Speech | 5.28 (3.60) | 0.74 (0.03) | 1.33 (0.16) | 0.92 (0.06) | 1.01 (0.12) | 1.09 (0.13) | 1.31 (0.11) | 1.57 (0.12) | |
All | 5.79 (6.88) | 0.74 (0.04) | 1.39 (0.26) | 0.91 (0.08) | 1.18 (0.22) | 1.20 (0.32) | 1.37 (0.29) | 1.51 (0.11) | |
LOCO | Basic | 6.01 (2.98) | 0.83 (0.00) | 1.57 (0.00) | 1.07 (0.04) | 1.07 (0.08) | 1.40 (0.03) | 1.51 (0.01) | 1.78 (0.03) |
Acoustic | 6.63 (1.39) | 0.84 (0.00) | 1.54 (0.06) | 1.07 (0.04) | 1.07 (0.02) | 1.41 (0.07) | 1.48 (0.05) | 1.72 (0.07) | |
Speech | 5.78 (1.85) | 0.78 (0.00) | 1.48 (0.06) | 0.99 (0.03) | 1.09 (0.04) | 1.24 (0.01) | 1.39 (0.03) | 1.83 (0.05) | |
All | 6.50 (0.81) | 0.80 (0.00) | 1.52 (0.04) | 1.01 (0.04) | 1.11 (0.04) | 1.37 (0.04) | 1.47 (0.05) | 1.72 (0.06) |
Strategy | Features | CQ | ARG | CF | CO | ITO | KE | SMU | SPST |
---|---|---|---|---|---|---|---|---|---|
RHO | Basic | 4.68 (0.12) | 0.77 (0.00) | 1.14 (0.01) | 0.89 (0.00) | 0.85 (0.01) | 1.03 (0.01) | 1.09 (0.01) | 1.43 (0.01) |
Acoustic | 4.70 (0.13) | 0.77 (0.00) | 1.19 (0.01) | 0.94 (0.01) | 0.84 (0.01) | 1.03 (0.01) | 1.17 (0.01) | 1.50 (0.01) | |
Speech | 4.79 (0.10) | 0.74 (0.00) | 1.22 (0.01) | 0.87 (0.00) | 0.91 (0.00) | 1.00 (0.00) | 1.19 (0.01) | 1.47 (0.01) | |
All | 4.52 (0.14) | 0.74 (0.00) | 1.16 (0.01) | 0.89 (0.00) | 0.85 (0.01) | 0.99 (0.01) | 1.14 (0.01) | 1.53 (0.01) | |
SKF | Basic | 4.73 (0.01) | 0.75 (0.00) | 1.12 (0.00) | 0.88 (0.00) | 0.86 (0.00) | 1.02 (0.00) | 1.11 (0.00) | 1.44 (0.01) |
Acoustic | 4.70 (0.32) | 0.76 (0.00) | 1.15 (0.01) | 0.95 (0.00) | 0.83 (0.00) | 1.02 (0.01) | 1.13 (0.00) | 1.49 (0.00) | |
Speech | 4.79 (0.09) | 0.74 (0.00) | 1.22 (0.00) | 0.88 (0.00) | 0.89 (0.00) | 0.99 (0.00) | 1.17 (0.00) | 1.46 (0.01) | |
All | 4.50 (0.07) | 0.73 (0.00) | 1.13 (0.01) | 0.89 (0.00) | 0.83 (0.00) | 0.95 (0.00) | 1.09 (0.01) | 1.48 (0.00) | |
KF | Basic | 5.00 (4.39) | 0.77 (0.05) | 1.24 (0.19) | 0.90 (0.06) | 0.88 (0.21) | 1.09 (0.21) | 1.20 (0.15) | 1.42 (0.18) |
Acoustic | 4.77 (3.71) | 0.77 (0.04) | 1.20 (0.14) | 0.92 (0.10) | 0.84 (0.13) | 0.99 (0.24) | 1.20 (0.22) | 1.43 (0.25) | |
Speech | 4.67 (3.79) | 0.73 (0.04) | 1.21 (0.13) | 0.87 (0.05) | 0.86 (0.17) | 0.98 (0.16) | 1.17 (0.14) | 1.45 (0.12) | |
All | 4.66 (3.69) | 0.73 (0.04) | 1.21 (0.18) | 0.86 (0.07) | 0.85 (0.14) | 0.96 (0.18) | 1.20 (0.24) | 1.43 (0.26) | |
LOGO | Basic | 6.02 (8.40) | 0.76 (0.03) | 1.42 (0.30) | 0.98 (0.09) | 1.05 (0.32) | 1.35 (0.29) | 1.41 (0.26) | 1.48 (0.20) |
Acoustic | 5.89 (8.64) | 0.79 (0.03) | 1.45 (0.40) | 0.97 (0.13) | 1.07 (0.28) | 1.17 (0.44) | 1.40 (0.40) | 1.47 (0.21) | |
Speech | 4.98 (3.89) | 0.72 (0.03) | 1.29 (0.18) | 0.90 (0.06) | 0.95 (0.14) | 1.05 (0.14) | 1.24 (0.14) | 1.53 (0.18) | |
All | 5.70 (8.37) | 0.73 (0.03) | 1.45 (0.37) | 0.91 (0.13) | 1.13 (0.26) | 1.19 (0.42) | 1.40 (0.38) | 1.48 (0.24) | |
LOCO | Basic | 6.11 (2.83) | 0.82 (0.00) | 1.69 (0.00) | 1.05 (0.05) | 1.09 (0.05) | 1.46 (0.03) | 1.55 (0.02) | 1.86 (0.09) |
Acoustic | 6.88 (1.55) | 0.85 (0.00) | 1.68 (0.02) | 1.07 (0.06) | 1.14 (0.01) | 1.45 (0.12) | 1.61 (0.04) | 1.79 (0.12) | |
Speech | 5.47 (1.69) | 0.75 (0.00) | 1.47 (0.05) | 0.96 (0.04) | 1.05 (0.03) | 1.23 (0.00) | 1.33 (0.03) | 1.81 (0.08) | |
All | 6.85 (1.03) | 0.80 (0.00) | 1.71 (0.00) | 1.04 (0.07) | 1.15 (0.02) | 1.42 (0.13) | 1.61 (0.03) | 1.87 (0.07) |
Strategy | Features | CQ | ARG | CF | CO | ITO | KE | SMU | SPST |
---|---|---|---|---|---|---|---|---|---|
RHO | Basic | 4.87 (0.14) | 0.84 (0.01) | 1.36 (0.02) | 1.04 (0.01) | 0.88 (0.01) | 1.09 (0.01) | 1.25 (0.01) | 1.63 (0.02) |
Acoustic | 4.84 (0.13) | 0.84 (0.00) | 1.35 (0.01) | 1.04 (0.01) | 0.85 (0.01) | 1.06 (0.01) | 1.22 (0.02) | 1.66 (0.02) | |
Speech | 4.92 (0.13) | 0.78 (0.00) | 1.29 (0.01) | 0.92 (0.01) | 0.94 (0.01) | 1.09 (0.01) | 1.26 (0.01) | 1.69 (0.02) | |
All | 4.69 (0.14) | 0.82 (0.01) | 1.25 (0.03) | 1.05 (0.01) | 0.86 (0.01) | 1.01 (0.01) | 1.16 (0.01) | 1.66 (0.02) | |
SKF | Basic | 4.96 (0.10) | 0.83 (0.01) | 1.39 (0.00) | 1.05 (0.00) | 0.88 (0.00) | 1.12 (0.00) | 1.21 (0.00) | 1.67 (0.00) |
Acoustic | 4.75 (0.10) | 0.85 (0.00) | 1.35 (0.00) | 1.05 (0.00) | 0.85 (0.00) | 1.09 (0.00) | 1.21 (0.01) | 1.68 (0.00) | |
Speech | 4.94 (0.02) | 0.82 (0.00) | 1.38 (0.00) | 0.94 (0.00) | 0.93 (0.00) | 1.11 (0.00) | 1.26 (0.00) | 1.69 (0.00) | |
All | 4.66 (0.07) | 0.83 (0.00) | 1.33 (0.00) | 1.05 (0.00) | 0.85 (0.00) | 1.01 (0.00) | 1.18 (0.01) | 1.67 (0.00) | |
KF | Basic | 5.22 (5.86) | 0.82 (0.07) | 1.31 (0.32) | 0.98 (0.14) | 0.89 (0.22) | 1.06 (0.20) | 1.19 (0.22) | 1.55 (0.39) |
Acoustic | 4.77 (4.81) | 0.86 (0.06) | 1.23 (0.36) | 0.98 (0.14) | 0.83 (0.17) | 1.05 (0.23) | 1.22 (0.21) | 1.55 (0.45) | |
Speech | 4.68 (4.36) | 0.80 (0.04) | 1.25 (0.27) | 0.91 (0.09) | 0.88 (0.21) | 1.06 (0.22) | 1.18 (0.23) | 1.58 (0.39) | |
All | 4.60 (5.02) | 0.79 (0.04) | 1.29 (0.33) | 0.98 (0.14) | 0.82 (0.18) | 1.01 (0.21) | 1.21 (0.21) | 1.55 (0.43) | |
LOGO | Basic | 5.67 (7.47) | 0.89 (0.04) | 1.31 (0.38) | 1.00 (0.16) | 0.99 (0.27) | 1.27 (0.60) | 1.24 (0.24) | 1.57 (0.34) |
Acoustic | 5.73 (8.20) | 0.89 (0.03) | 1.30 (0.37) | 1.00 (0.16) | 1.02 (0.21) | 1.27 (0.63) | 1.32 (0.28) | 1.60 (0.35) | |
Speech | 5.15 (5.05) | 0.87 (0.05) | 1.28 (0.33) | 0.90 (0.11) | 0.93 (0.23) | 1.24 (0.63) | 1.23 (0.23) | 1.59 (0.37) | |
All | 5.45 (6.90) | 0.81 (0.04) | 1.48 (0.34) | 1.00 (0.16) | 1.07 (0.22) | 1.24 (0.64) | 1.27 (0.26) | 1.58 (0.34) | |
LOCO | Basic | 5.94 (4.00) | 0.88 (0.00) | 1.75 (0.02) | 1.11 (0.09) | 1.01 (0.11) | 1.32 (0.21) | 1.39 (0.09) | 1.88 (0.21) |
Acoustic | 6.15 (3.61) | 0.85 (0.00) | 1.66 (0.06) | 1.11 (0.09) | 1.07 (0.05) | 1.36 (0.15) | 1.41 (0.09) | 1.83 (0.26) | |
Speech | 5.51 (3.07) | 0.80 (0.00) | 1.50 (0.13) | 0.97 (0.07) | 1.03 (0.06) | 1.16 (0.12) | 1.32 (0.12) | 1.87 (0.20) | |
All | 5.83 (2.72) | 0.84 (0.00) | 1.67 (0.06) | 1.11 (0.09) | 1.06 (0.06) | 1.26 (0.13) | 1.37 (0.10) | 1.82 (0.26) |
Strategy | Features | CQ | ARG | CF | CO | ITO | KE | SMU | SPST |
---|---|---|---|---|---|---|---|---|---|
RHO | Basic | 4.84 (0.11) | 0.81 (0.01) | 1.21 (0.01) | 0.94 (0.01) | 0.91 (0.01) | 1.06 (0.01) | 1.16 (0.01) | 1.43 (0.01) |
Acoustic | 5.72 (0.15) | 0.92 (0.01) | 1.45 (0.01) | 1.17 (0.01) | 1.16 (0.01) | 1.32 (0.02) | 1.48 (0.01) | 1.66 (0.02) | |
Speech | 4.86 (0.09) | 0.76 (0.00) | 1.22 (0.00) | 0.90 (0.01) | 0.91 (0.00) | 1.04 (0.00) | 1.20 (0.01) | 1.47 (0.01) | |
All | 5.34 (0.13) | 0.92 (0.01) | 1.46 (0.01) | 1.08 (0.01) | 1.11 (0.01) | 1.33 (0.01) | 1.37 (0.01) | 1.69 (0.03) | |
SKF | Basic | 5.16 (0.07) | 0.78 (0.00) | 1.20 (0.01) | 0.92 (0.00) | 0.91 (0.00) | 1.07 (0.01) | 1.17 (0.00) | 1.41 (0.00) |
Acoustic | 6.19 (0.17) | 0.90 (0.01) | 1.52 (0.04) | 1.15 (0.00) | 1.17 (0.01) | 1.37 (0.02) | 1.40 (0.04) | 1.77 (0.01) | |
Speech | 4.89 (0.04) | 0.74 (0.00) | 1.20 (0.00) | 0.88 (0.00) | 0.91 (0.00) | 1.02 (0.00) | 1.19 (0.00) | 1.50 (0.00) | |
All | 5.31 (0.03) | 1.00 (0.00) | 1.46 (0.01) | 1.12 (0.00) | 1.08 (0.00) | 1.37 (0.00) | 1.45 (0.04) | 1.89 (0.04) | |
KF | Basic | 5.27 (3.84) | 0.80 (0.03) | 1.35 (0.21) | 0.98 (0.06) | 0.91 (0.15) | 1.11 (0.17) | 1.30 (0.17) | 1.49 (0.10) |
Acoustic | 6.21 (4.76) | 0.97 (0.04) | 1.49 (0.13) | 1.19 (0.14) | 1.10 (0.12) | 1.51 (0.12) | 1.50 (0.21) | 1.83 (0.07) | |
Speech | 4.71 (3.69) | 0.74 (0.04) | 1.21 (0.12) | 0.89 (0.06) | 0.88 (0.15) | 1.02 (0.16) | 1.18 (0.14) | 1.56 (0.13) | |
All | 5.82 (3.08) | 0.97 (0.02) | 1.53 (0.12) | 1.15 (0.05) | 1.10 (0.10) | 1.24 (0.07) | 1.43 (0.10) | 1.73 (0.08) | |
LOGO | Basic | 6.29 (9.21) | 0.82 (0.03) | 1.39 (0.19) | 1.05 (0.09) | 1.15 (0.22) | 1.29 (0.17) | 1.44 (0.16) | 1.56 (0.07) |
Acoustic | 6.61 (3.99) | 1.02 (0.00) | 1.49 (0.23) | 1.16 (0.04) | 1.18 (0.10) | 1.42 (0.12) | 1.60 (0.13) | 1.83 (0.03) | |
Speech | 4.95 (3.51) | 0.76 (0.02) | 1.32 (0.15) | 0.91 (0.07) | 0.96 (0.13) | 1.10 (0.15) | 1.31 (0.16) | 1.49 (0.13) | |
All | 6.47 (2.41) | 0.94 (0.04) | 1.62 (0.19) | 1.13 (0.13) | 1.17 (0.16) | 1.46 (0.14) | 1.64 (0.14) | 1.84 (0.05) | |
LOCO | Basic | 5.91 (2.11) | 1.02 (0.00) | 1.98 (0.27) | 1.06 (0.05) | 1.26 (0.03) | 1.37 (0.04) | 1.57 (0.00) | 1.64 (0.04) |
Acoustic | 6.42 (2.77) | 0.97 (0.00) | 1.74 (0.00) | 1.40 (0.01) | 1.25 (0.27) | 1.51 (0.02) | 1.67 (0.01) | 1.95 (0.06) | |
Speech | 5.50 (2.29) | 0.78 (0.00) | 1.46 (0.04) | 1.05 (0.02) | 1.07 (0.05) | 1.19 (0.02) | 1.37 (0.05) | 1.72 (0.06) | |
All | 6.39 (0.65) | 0.97 (0.02) | 1.53 (0.00) | 1.37 (0.01) | 1.25 (0.14) | 1.69 (0.15) | 1.57 (0.04) | 1.88 (0.00) |
References
- Blikstein, P.; Worsley, M. Multimodal Learning Analytics and Education Data Mining: Using computational technologies to measure complex learning tasks. J. Learn. Anal. 2016, 3, 220–238. [Google Scholar] [CrossRef] [Green Version]
- Ochoa, X.; Worsley, M. Augmenting Learning Analytics with Multimodal Sensory Data. J. Learn. Anal. 2016, 3, 213–219. [Google Scholar] [CrossRef]
- Worsley, M.; Abrahamson, D.; Blikstein, P.; Grover, S.; Schneider, B.; Tissenbaum, M. Situating multimodal learning analytics. In 12th International Conference of the Learning Sciences (ICLS 2016); Looi, C.K., Polman, J., Reimann, P., Cress, U., Eds.; International Society of the Learning Sciences (ISLS): Singapore, 2016; Volume 2, pp. 1346–1349. [Google Scholar]
- Di Mitri, D.; Schneider, J.; Specht, M.; Drachsler, H. From signals to knowledge: A conceptual model for multimodal learning analytics. J. Comput. Assist. Learn. 2018, 34, 338–349. [Google Scholar] [CrossRef] [Green Version]
- Sharma, K.; Niforatos, E.; Giannakos, M.; Kostakos, V. Assessing Cognitive Performance Using Physiological and Facial Features: Generalizing across Contexts. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2020, 4. [Google Scholar] [CrossRef]
- Schneider, J.; Börner, D.; Van Rosmalen, P.; Specht, M. Augmenting the Senses: A Review on Sensor-Based Learning Support. Sensors 2015, 15, 4097–4133. [Google Scholar] [CrossRef] [Green Version]
- Mu, S.; Cui, M.; Huang, X. Multimodal Data Fusion in Learning Analytics: A Systematic Review. Sensors 2020, 20, 6856. [Google Scholar] [CrossRef] [PubMed]
- Spikol, D.; Ruffaldi, E.; Landolfi, L.; Cukurova, M. Estimation of Success in Collaborative Learning Based on Multimodal Learning Analytics Features. In Proceedings of the 17th IEEE International Conference on Advanced Learning Technologies (ICALT 2017), Timisoara, Romania, 3–7 July 2017; Chang, M., Chen, N., Huang, R., Kinshuk, Sampson, D.G., Vasiu, R., Eds.; IEEE Computer Society: Washington, DC, USA, 2017; pp. 269–273. [Google Scholar]
- Mohri, M.; Rostamizadeh, A.; Talwalkar, A. Foundations of Machine Learning; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Raschka, S. Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning. arXiv 2018, arXiv:1811.12808. [Google Scholar]
- Yu, T.; Zhu, H. Hyper-Parameter Optimization: A Review of Algorithms and Applications. arXiv 2020, arXiv:2003.05689. [Google Scholar]
- Roelofs, R. Measuring Generalization and Overfitting in Machine Learning. Ph.D. Thesis, UC Berkeley, Berkeley, CA, USA, 2019. [Google Scholar]
- Viswanathan, S.A.; VanLehn, K. Using the Tablet Gestures and Speech of Pairs of Students to Classify Their Collaboration. IEEE Trans. Learn. Technol. 2018, 11, 230–242. [Google Scholar] [CrossRef] [Green Version]
- Martinez, R.; Kay, J.; Wallace, J.R.; Yacef, K. Modelling Symmetry of Activity as an Indicator of Collocated Group Collaboration. In User Modeling, Adaption and Personalization; Konstan, J.A., Conejo, R., Marzo, J.L., Oliver, N., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 207–218. [Google Scholar]
- Geisser, S. The predictive sample reuse method with applications. J. Am. Stat. Assoc. 1975, 70, 320–328. [Google Scholar] [CrossRef]
- Efron, B.; Tibshirani, R.J. Introduction Bootstrap; CRC Press: Boca Raton, FL, USA, 1994. [Google Scholar]
- Prieto, L.; Sharma, K.; Kidzinski, L.; Rodríguez-Triana, M.; Dillenbourg, P. Multimodal teaching analytics: Automated extraction of orchestration graphs from wearable sensor data. J. Comput. Assist. Learn. 2018, 34, 193–203. [Google Scholar] [CrossRef] [PubMed]
- Giannakos, M.N.; Sharma, K.; Pappas, I.O.; Kostakos, V.; Velloso, E. Multimodal data as a means to understand the learning experience. Int. J. Inf. Manag. 2019, 48, 108–119. [Google Scholar] [CrossRef]
- Martinez-Maldonado, R.; Dimitriadis, Y.; Martinez-Monés, A.; Kay, J.; Yacef, K. Capturing and analyzing verbal and physical collaborative learning interactions at an enriched interactive tabletop. Int. J. Comput.-Support. Collab. 2013, 8, 455–485. [Google Scholar] [CrossRef]
- Spikol, D.; Ruffaldi, E.; Dabisias, G.; Cukurova, M. Supervised machine learning in multimodal learning analytics for estimating success in project-based learning. J. Comput. Assist. Learn. 2018, 34, 366–377. [Google Scholar] [CrossRef]
- Ezen-Can, A.; Grafsgaard, J.F.; Lester, J.C.; Boyer, K.E. Classifying Student Dialogue Acts with Multimodal Learning Analytics. In Proceedings of the Fifth International Conference on Learning Analytics And Knowledge (LAK ’15), Poughkeepsie, NY, USA, March 2015; Association for Computing Machinery: New York, NY, USA, 2015; pp. 280–289. [Google Scholar] [CrossRef]
- Grover, S.; Bienkowski, M.; Tamrakar, A.; Siddiquie, B.; Salter, D.; Divakaran, A. Multimodal Analytics to Study Collaborative Problem Solving in Pair Programming. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge (LAK ’16), Edinburgh, UK, 25–29 April 2016; ACM: New York, NY, USA, 2016; pp. 516–517. [Google Scholar]
- Mosier, C.I. The need and means of cross validation. I. Problems and designs of cross-validation. Educ. Psychol. Meas. 1951, 11, 5–11. [Google Scholar] [CrossRef]
- Martinez, R.; Wallace, J.R.; Kay, J.; Yacef, K. Modelling and Identifying Collaborative Situations in a Collocated Multi-display Groupware Setting. In Artificial Intelligence in Education; Biswas, G., Bull, S., Kay, J., Mitrovic, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 196–204. [Google Scholar]
- Reilly, J.M.; Schneider, B. Predicting the Quality of Collaborative Problem Solving Through Linguistic Analysis of Discourse. In Proceedings of the 12th International Conference on Educational Data Mining (EDM 2019), Montréal, QC, Canada, 2–5 July 2019; Desmarais, M.C., Lynch, C.F., Merceron, A., Nkambou, R., Eds.; International Educational Data Mining Society (IEDMS): Worcester, MA, USA, 2019. [Google Scholar]
- Smith, J.; Bratt, H.; Richey, C.; Bassiou, N.; Shriberg, E.; Tsiartas, A.; D’Angelo, C.; Alozie, N. Spoken interaction modeling for automatic assessment of collaborative learning. In Proceedings of the International Conference on Speech Prosody, Boston, MA, USA, 31 May–4 June 2016; pp. 277–281. [Google Scholar] [CrossRef] [Green Version]
- Bassiou, N.; Tsiartas, A.; Smith, J.; Bratt, H.; Richey, C.; Shriberg, E.; D’Angelo, C.; Alozie, N. Privacy-preserving speech analytics for automatic assessment of student collaboration. In Proceedings of the Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, 8–12 September 2016; pp. 888–892. [Google Scholar]
- Echeverría, V.; Avendaño, A.; Chiluiza, K.; Vásquez, A.; Ochoa, X. Presentation Skills Estimation Based on Video and Kinect Data Analysis. In Proceedings of the 2014 ACM Workshop on Multimodal Learning Analytics Workshop and Grand Challenge (MLA ’14), Istanbul, Turkey, 12 November 2014; Association for Computing Machinery: New York, NY, USA, 2014; pp. 53–60. [Google Scholar] [CrossRef]
- Ponce-López, V.; Escalera, S.; Baró, X. Multi-Modal Social Signal Analysis for Predicting Agreement in Conversation Settings. In Proceedings of the 15th ACM on International Conference on Multimodal Interaction (ICMI ’13), Sydney, Australia, 9–13 December 2013; Association for Computing Machinery: New York, NY, USA, 2013; pp. 495–502. [Google Scholar] [CrossRef]
- Varma, S.; Simon, R. Bias in error estimation when using cross-validation for model selection. BMC Bioinform. 2006, 7, 1–8. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Busemeyer, J.R.; Wang, Y.M. Model Comparisons and Model Selections Based on Generalization Criterion Methodology. J. Math. Psychol. 2000, 44, 171–189. [Google Scholar] [CrossRef] [Green Version]
- Forster, M.R. Key Concepts in Model Selection: Performance and Generalizability. J. Math. Psychol. 2000, 44, 205–231. [Google Scholar] [CrossRef] [Green Version]
- Justice, A.C.; Covinsky, K.E.; Berlin, J.A. Assessing the generalizability of prognostic information. Ann. Intern. Med. 1999, 130, 515–524. [Google Scholar] [CrossRef]
- Cronbach, L.J.; Linn, R.L.; Brennan, R.L.; Haertel, E.H. Generalizability analysis for performance assessments of student achievement or school effectiveness. Educ. Psychol. Meas. 1997, 57, 373–399. [Google Scholar] [CrossRef]
- Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI’95), Montreal, QC, Canada, 20–25 August 1995; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1995; Volume 2, pp. 1137–1143. [Google Scholar]
- Buolamwini, J.; Gebru, T. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, New York, NY, USA, 23–24 February 2018; Friedler, S.A., Wilson, C., Eds.; Proceedings of Machine Learning Research. PMLR: New York, NY, USA, 2018; Volume 81, pp. 77–91. [Google Scholar]
- Gardner, J.; Brooks, C.; Baker, R. Evaluating the Fairness of Predictive Student Models Through Slicing Analysis. In Proceedings of the 9th International Conference on Learning Analytics & Knowledge (LAK19), Tempe, AZ, USA, 4–8 March 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 225–234. [Google Scholar] [CrossRef]
- Kaur, H.; Pannu, H.S.; Malhi, A.K. A Systematic Review on Imbalanced Data Challenges in Machine Learning: Applications and Solutions. ACM Comput. Surv. 2019, 52. [Google Scholar] [CrossRef] [Green Version]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Fitzpatrick, T.B. The validity and practicality of sun-reactive skin types I through VI. Arch. Dermatol. 1988, 124, 869–871. [Google Scholar] [CrossRef]
- Bauer, G.R.; Lizotte, D.J. Artificial Intelligence, Intersectionality, and the Future of Public Health. Am. J. Public Health 2021, 111, 98–100. [Google Scholar] [CrossRef]
- West, M.; Kraut, R.; Chew, H.E. I’d Blush if I Could: Closing Gender Divides in Digital Skills through Education. UNESCO; EQUALS. Retrieved from UNESCO Website. 2019. Available online: https://unesdoc.unesco.org/ark:/48223/pf0000367416.page=1 (accessed on 17 April 2021).
- UNESCO. Artificial Intelligence and Gender Equality: Key Findings of UNESCO’s Global Dialogue. UNESCO. Retrieved from UNESCO Website. 2020. Available online: https://unesdoc.unesco.org/ark:/48223/pf0000374174 (accessed on 17 April 2021).
- Ciston, S. Intersectional AI is essential: Polyvocal, multimodal, experimental methods to save artificial intelligence. J. Sci. Technol. Arts 2019, 11, 3–8. [Google Scholar] [CrossRef] [Green Version]
- Browne, M.W. Cross-Validation Methods. J. Math. Psychol. 2000, 44, 10–132. [Google Scholar] [CrossRef] [Green Version]
- Mitchell, M.; Wu, S.; Zaldivar, A.; Barnes, P.; Vasserman, L.; Hutchinson, B.; Spitzer, E.; Raji, I.D.; Gebru, T. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* ’19), Atlanta, GA, USA, 29–31 January 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 220–229. [Google Scholar] [CrossRef] [Green Version]
- Dodge, J.; Gururangan, S.; Card, D.; Schwartz, R.; Smith, N.A. Show Your Work: Improved Reporting of Experimental Results. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Hong Kong, China, 2019; pp. 2185–2194. [Google Scholar] [CrossRef]
- Laal, M.; Ghodsi, S.M. Benefits of collaborative learning. Procedia Soc. Behav. Sci. 2012, 31, 486–490. [Google Scholar] [CrossRef] [Green Version]
- Martínez Maldonado, R.; Kay, J.; Shum, S.B.; Yacef, K. Collocated Collaboration Analytics: Principles and Dilemmas for Mining Multimodal Interaction Data. Hum. Comput. Interact. 2019, 34, 1–50. [Google Scholar] [CrossRef]
- Martinez-Maldonado, R. A handheld classroom dashboard: Teachers’ perspectives on the use of real-time collaborative learning analytics. Int. J. Comput.-Support. Collab. 2019, 14, 383–411. [Google Scholar] [CrossRef]
- Rummel, N.; Deiglmayr, A.; Spada, H.; Kahrimanis, G.; Avouris, N. Analyzing Collaborative Interactions Across Domains and Settings: An Adaptable Rating Scheme. In Analyzing Interactions in CSCL: Methods, Approaches and Issues; Puntambekar, S., Erkens, G., Hmelo-Silver, C., Eds.; Springer: Boston, MA, USA, 2011; pp. 367–390. [Google Scholar] [CrossRef]
- Chejara, P.; Kasepalu, R.; Shankar, S.K.; Prieto, L.P.; Rodríguez-Triana, M.J.; Ruiz-Calleja, A. MMLA Approach to Track Participation Behavior in Collaboration in Collocated Blended Settings. In Proceedings of CrossMMLA in Practice: Collecting, Annotating and Analyzing Multimodal Data Across Spaces Co-Located with 10th International Learning and Analytics Conference (LAK 2020), 24 March 2020; Giannakos, M.N., Spikol, D., Molenaar, I., Mitri, D.D., Sharma, K., Ochoa, X., Hammad, R., Eds.; Volume 2610, pp. 11–16. Available online: http://ceur-ws.org/Vol-2610/ (accessed on 17 April 2021).
- Standard, O. MQTT Version 3.1.1. Available online: http://docs.oasis-open.org/mqtt/mqtt/v3 (accessed on 17 April 2021).
- Weinberger, A.; Fischer, F. A framework to analyze argumentative knowledge construction in computer-supported collaborative learning. Comput. Educ. 2006, 46, 71–95. [Google Scholar] [CrossRef] [Green Version]
- Sharma, K.; Papamitsiou, Z.; Giannakos, M. Building pipelines for educational data using AI and multimodal analytics: A “grey-box” approach. Br. J. Educ. Technol. 2019, 50, 3004–3031. [Google Scholar] [CrossRef] [Green Version]
- Chejara, P.; Prieto, L.P.; Ruiz-Calleja, A.; Rodríguez-Triana, M.J.; Shankar, S.K.; Kasepalu, R. Quantifying Collaboration Quality in Face-to-Face Classroom Settings Using MMLA. In Collaboration Technologies and Social Computing; Nolte, A., Alvarez, C., Hishiyama, R., Chounta, I.A., Rodríguez-Triana, M.J., Inoue, T., Eds.; Springer International Publishing: Cham, Switherland, 2020; pp. 159–166. [Google Scholar]
- Lubold, N.; Pon-Barry, H. Acoustic-Prosodic Entrainment and Rapport in Collaborative Learning Dialogues. In Proceedings of the 2014 ACM Workshop on Multimodal Learning Analytics Workshop and Grand Challenge (MLA ’14), Istanbul, Turkey, 12 November 2014; ACM: New York, NY, USA, 2014; pp. 5–12. [Google Scholar]
- Storch, N. How collaborative is pair work? ESL tertiary students composing in pairs. Lang. Teach. Res. 2001, 5, 29–53. [Google Scholar] [CrossRef]
- Meier, A.; Spada, H.; Rummel, N. A rating scheme for assessing the quality of computer-supported collaboration processes. Int. J. Comput.-Support. Collab. 2007, 2, 63–86. [Google Scholar] [CrossRef] [Green Version]
- Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)?—Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef] [Green Version]
- Futoma, J.; Simons, M.; Panch, T.; Doshi-Velez, F.; Celi, L.A. The myth of generalisability in clinical research and machine learning in health care. Lancet Digit. Health 2020, 2, e489–e492. [Google Scholar] [CrossRef]
- Shankar, S.K.; Rodríguez-Triana, M.J.; Ruiz-Calleja, A.; Prieto, L.P.; Chejara, P.; Martínez-Monés, A. Multimodal Data Value Chain (M-DVC): A Conceptual Tool to Support the Development of Multimodal Learning Analytics Solutions. IEEE Rev. Iberoam. Tecnol. Aprendiz. 2020, 15, 113–122. [Google Scholar] [CrossRef]
- Shankar, S.K.; Calleja, A.R.; Iglesias, S.S.; Arranz, A.O.; Topali, P.; Monés, A.M. A data value chain to model the processing of multimodal evidence in authentic learning scenarios. In Proceedings of the Learning Analytics Summer Institute, Vigo, Spain, 27–28 June 2019; pp. 71–83. Available online: http://ceur-ws.org/Vol-2415/ (accessed on 17 April 2021).
Article | Learning Construct | Problem Type | Model | Dataset Size | Evaluation Method |
---|---|---|---|---|---|
[5] | Cognitive performance | Regression | SVM, GP | 1724 | Leave-one study out |
[25] | Collaboration quality in CL | Classification | RF, SVM, NB | 40 | 5-fold cv |
[18] | Skill acquisition | Regression | RF | 252 | Hold-out |
[13] | Collaboration in CL | Classification | RF | 325 | 10-fold cv |
[20] | Artefact quality | Regression | NN | 18 | 4-fold cv |
[17] | Teaching activity, | Classification | SVM, LSTM | 5561 | Leave-one session out |
Social level | RNN | ||||
[22] | Collaboration level | Classification | SVM | 117 | 10-fold cv |
[26] | Collaboration quality | Classification | Ada | 1623 | 5-fold cv |
[27] | Collaboration quality | Classification | SVM, RF | 2942 | Hold-out |
[21] | Type of dialogue in group | Classification | K-means | 1443 | Leave-one student out |
[28] | Presentation skill | Classification | LR | 448 | 10-fold cv |
[29] | Agreement | Classification | Ada, SVM, NN | 28 | Leave-one out |
[24] | Collaboration levels in CL | Classification | NB, DT | 700 | Leave-one group out |
10-fold cv |
Frame No. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
A | 1 | 2 | −2 | 2 | 1 | 1 | 1 | 2 | 2 | −2 |
B | 1 | 1 | −2 | 0 | −2 | 0 | 1 | 2 | 1 | −2 |
Frame No. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
Final | 1 | 2 | −2 | 1 | 1 | 1 | 1 | 2 | 2 | −2 |
Theoretical | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Article | Generalizability Levels | Performance Reporting | ||
---|---|---|---|---|
Performance Variation | Frame of Reference | HP Tuning | ||
[5] | Context | Std | Random | Not reported |
[25] | Instance | Not reported | Random | Grid |
[18] | Instance | 95% CI | None | Not reported |
[13] | Instance | Not reported | None | Not reported |
[20] | Instance | Variance | None | Not reported |
[17] | Context | IQR | None | Manual and grid |
[22] | Instance | Not reported | Random | Not reported |
[26] | Instance | Not reported | Majority | Not reported |
[27] | Instance | Not reported | Majority | Manual |
[21] | Instance | Not reported | Majority | Manual |
[28] | Instance | Not reported | None | Not reported |
[29] | Instance | Not reported | Random | Not reported |
[24] | Group, context | Std | Proportion | Not reported |
Dataset | Group Number | Group Size | Data Sources | Problem Topic |
---|---|---|---|---|
Problem_A | 2 | 4 | Audio and log | Cell respiration |
Problem_B | 3 | 3–4 | Audio and log | Ethical codes on growing GMO |
Feature | Description |
---|---|
Speaking time | Speaking time in seconds for each student |
Turn-taking | Number of speaking turns taken by each student |
Char-add | Number of characters added in Etherpad by each student |
Char-del | Number of character deleted in Etherpad by each student |
SMU | CF | KE | ARG | SPST | CO | ITO-1 | ITO-2 | ITO-3 | ITO-4 |
---|---|---|---|---|---|---|---|---|---|
0.71 | 0.91 | 0.74 | 0.80 | 0.65 | 0.68 | 0.72 | 0.76 | 0.75 | 0.78 |
Frame of Reference | CQ | SMU | CF | KE | ARG | SPST | CO | ITO |
---|---|---|---|---|---|---|---|---|
Predictor using theoretical average (lower bound) | 7.00 | 1.53 | 1.26 | 1.61 | 1.15 | 1.79 | 1.15 | 1.74 |
Human performance (upper bound) | 2.06 | 0.59 | 0.33 | 0.43 | 0.41 | 1.20 | 0.58 | 0.32 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chejara, P.; Prieto, L.P.; Ruiz-Calleja, A.; Rodríguez-Triana, M.J.; Shankar, S.K.; Kasepalu, R. EFAR-MMLA: An Evaluation Framework to Assess and Report Generalizability of Machine Learning Models in MMLA. Sensors 2021, 21, 2863. https://doi.org/10.3390/s21082863
Chejara P, Prieto LP, Ruiz-Calleja A, Rodríguez-Triana MJ, Shankar SK, Kasepalu R. EFAR-MMLA: An Evaluation Framework to Assess and Report Generalizability of Machine Learning Models in MMLA. Sensors. 2021; 21(8):2863. https://doi.org/10.3390/s21082863
Chicago/Turabian StyleChejara, Pankaj, Luis P. Prieto, Adolfo Ruiz-Calleja, María Jesús Rodríguez-Triana, Shashi Kant Shankar, and Reet Kasepalu. 2021. "EFAR-MMLA: An Evaluation Framework to Assess and Report Generalizability of Machine Learning Models in MMLA" Sensors 21, no. 8: 2863. https://doi.org/10.3390/s21082863
APA StyleChejara, P., Prieto, L. P., Ruiz-Calleja, A., Rodríguez-Triana, M. J., Shankar, S. K., & Kasepalu, R. (2021). EFAR-MMLA: An Evaluation Framework to Assess and Report Generalizability of Machine Learning Models in MMLA. Sensors, 21(8), 2863. https://doi.org/10.3390/s21082863