1. Introduction
Games, as the most common entertainment form, have a large audience. Difficulty is a fundamental concept in games [
1,
2]. Excessive difficulty may cause users anxiety, while insufficient difficulty may lead to boredom [
3]. To attract players of different levels, game developers usually provide various difficulty settings for players to choose from or design algorithms to automatically adjust the game difficulty to match the player’s level. All of these require assessing the user’s perceived game difficulty in advance [
4]. Furthermore, game difficulty could impact player motivation [
5,
6], self-confidence [
7], performance [
8], and engagement [
9]. Therefore, it is necessary and meaningful to predict game difficulty accurately.
Most current studies utilize success rates, life points, and completion times to assess game difficulty [
10,
11,
12]. Although they can predict game difficulty to some extent, they hardly take user data into account. Some methods collect physiological signals when users play the game and employ machine learning methods to classify the difficulty [
13,
14]. Although physiological signals are more objective, their acquisition process often requires some sensor devices connected to users, which may be invasive and affect the game experience. Additionally, it is laborious to process physiological data. These can lead to a lower accuracy of prediction.
This paper presents a method for predicting game difficulty based on multiple facial cues and game performance. Unlike those cumbersome physiological signal-based methods, our method is non-invasive and requires only one camera to record user data. Specifically, we use various computer vision methods to obtain multimodal facial features during gameplay, including facial expressions, gaze directions, and head poses. Multimodal features are collected because they can compensate for the shortcomings of a single modality. Then, we build a game difficulty dataset by combining these facial cues with the user’s game performance as input and labeling it with the user’s subjective difficulty scores. Subsequently, we train several machine learning classifiers and compare them on two tasks: two-class classification and three-class classification. The experimental results show that MLP, a forward-structured neural network that contains several fully connected layers, outperforms other machine learning classifiers on these two tasks. In addition, we compare the effects of different inputs on prediction performance, and the results further demonstrate the effectiveness of combining multimodal features.
The main objective of this paper is to provide a method for automatically recognizing user-perceived game difficulty, which can help designers modify games based on the difference between predicted difficulty and designed difficulty. In addition, the game level can be dynamically adjusted according to the estimated difficulty. For example, when users feel that the game is difficult or are challenged when playing, the system could reduce the game difficulty accordingly. Additionally, our method can be introduced not only to game difficulty prediction but also to any research related to user perception.
In summary, the main contributions of this paper are as follows:
We created a game difficulty dataset that uses facial expressions, gaze directions, head poses, the number of failures, and game scores as inputs, which subjective game difficulty ratings as labels.
We compared the performance of different machine learning algorithms on the collected dataset and performed ablation experiments to compare the effects of various inputs on prediction accuracy. The results showed that MLP achieved the best results on all evaluation metrics, and combining multiple features yielded the highest performance.
The remainder of the paper is organized as follows:
Section 2 reviews the studies concerning difficulty prediction and multimodal fusion.
Section 3 introduces the procedures of the experiment and dataset collection. Next,
Section 4 compares the performance of different machine learning methods on two classification tasks, and
Section 5 discusses the experimental results and points out the limitations. Finally,
Section 6 concludes the study.
2. Related Work
2.1. Difficulty Prediction
Some studies utilized heuristic functions to measure the difficulty level faced by the player. These methods typically use success rates, life points, and completion times to calculate game scores and assess difficulty [
15]. For example, VanKreveld et al. [
10] presented a method for automatically evaluating the difficulty of puzzle games that combines and weights features from several aspects, including level size, number of balls, and minimum number of moves, to generate the difficulty function. The weights of each feature are determined by optimizing the function on a set of known difficulties. Ashlock et al. [
16] used the average solution time of an evolutionary algorithm and the number of failures to grade the Sokoban boards’ difficulty. Kristensen et al. [
12] introduced factorization machines (Fms) to predict game difficulty based on the observed attempt times from early levels and levels others played. Fms have been widely used in recommendation systems [
17,
18]. It is a factorization model that can be used as a predictor for tasks such as classification and regression. The experimental results in the study showed the superiority of Fms over random forest (RF). Although, to some extent, the mentioned methods could predict the game difficulty, none of them take the objective user data generated during interactions into account, which limits the accuracy of prediction.
Several studies have begun to utilize user data to measure game difficulty. Blom et al. [
19] used the RF classifier to categorize identified user facial expressions to predict the challenge difficulty of each chunk and adjusted the difficulty of each chunk accordingly. Compared to heuristic methods, this approach is smoother and more efficient (requiring fewer iterations). However, using a single facial expression may not accurately reflect the user’s genuine sentiments about difficulty, resulting in limited enhancement of the game experience. In addition to facial expressions, physiological signals are alternative methods chosen by many researchers to evaluate game difficulty since they can objectively respond to the changes in users’ physiological conditions. Naumann et al. [
13] measured game difficulty using electroencephalogram (EEG) signals. After collecting and processing EEG signals, the authors trained a linear regression model to predict the game difficulty. Although the correlations of Pearson and Spearman between the predicted and actual difficulties were high, RMSE was relatively low. Girouard et al. [
14] collected functional near-infrared spectroscopy (fNIRS) data from users and used sequence classification methods [
20] to predict game difficulty. Unlike EEG signals, fNIRS is non-invasive, portable, and unaffected by user movements. However, it did not perform well (accuracy: 61.1%) in classifying game difficulty. Furthermore, Darzi et al. [
21] achieved higher accuracy in game difficulty prediction by combining physiological signals, game performance, and personal characteristics.
Some studies predict the difficulty of other tasks. For example, Sakamoto et al. [
22] used eye-tracking features to estimate users’ perceived difficulty when reading educational comics. Specifically, the authors first used a head-mounted device to collect multiple eye-movement information, including fixations, blinks, heat maps, etc. Then, a support vector machine
(SVM) classification model was trained with these features and achieved F1 scores of 0.721 and 0.742 in user-dependent and user-independent models, respectively. The effectiveness of eye-tracking features has also been validated in text difficulty prediction [
23,
24].
2.2. Multimodal Fusion
Using single-modal data may be inadequate to obtain promising prediction performance, and some studies suggest combining multimodal data. For example, Kawamura et al. [
25] used facial information, head movement, and upper limb pressure to predict learners’ wakefulness levels during a video course. The authors classified learner states into three categories: Awake, Drowsy, and Asleep, and achieved high classification performance with the CatBoost classifier [
26]. However, the study used only one classification method and did not compare with other methods. Peng et al. [
27] combined three modalities, including facial, heart rate, and acoustic features, and trained multiple machine learning classifiers (SVM, RF, and MLP) to predict students’ mental state. The results showed that RF obtained better performance than other methods. Besides, using acoustics and heart rate can more accurately predict concentration and confusion, while combining the three modalities could better classify frustration and boredom. Zhou et al. [
28] measured the cognitive load of surgeons during surgery by recording multiple physiological signals, including heart rate variability, galvanic skin, and electroencephalography. They compared different classifiers, modalities, and fusion schemes, and the results also showed that the combination of multiple modalities could outperform a single modality. Compared to other methods, SVM was able to achieve better performance. Some studies also merge various modalities to achieve high accuracy in emotion recognition [
29,
30,
31].
From the above studies, it is clear that the combination of multimodal data can compensate for the shortcomings of single data. Although they are not directly related to game difficulty prediction, we can draw on the multimodal idea introduced in these studies to our research. Inspired by these studies, our study also uses multimodal data to predict user-perceived game difficulty. Unlike them, we use multiple facial cues and game performance instead of physiological signals or eye-tracking data. The reasons are as follows: (1) it is more convenient to use a camera to collect facial data without disturbing the user. (2) the economic cost is relatively low since a camera is the only device we need. (3) we can obtain multiple salient features (i.e., facial expressions, gaze directions, and head poses) from a single facial image. Among them, facial expressions can react to one’s inner emotion, while the gaze directions and head poses can reflect the attention. Since game performance is an essential indicator of the player’s game state, we integrate it into the features.
In addition, the above studies have also shown the effectiveness of machine learning algorithms in a wide variety of classification tasks. Therefore, this study also draws support from different machine learning classification models to predict game difficulty. The study sought to answer the following research questions: (1) Which machine learning classifier performs better at game difficulty prediction? (2) Can multimodal facial features and game performance be more effective than single-modal features?
3. Methods
3.1. Participants
Twenty-nine participants (13 males, 16 females) with aged 22 to 35 years (M = 25.9, SD = 3.3) were recruited online for this research. All of them were right-handed and had normal or corrected vision. They were graduate students at Hunan University. At the beginning of the experiment, all subjects were informed of the experimental procedure and asked to fill out a basic information questionnaire. The main elements included gender, age, education, average daily game hours, and whether they had played Dino Run or a similar parkour game. The average time they spent playing was 1.38 h (SD = 0.97) per day. They all had played Dino Run or similar games. However, they do not play the game used in this study, as it was a redesigned and developed version of the original game. Before the experiment, they signed the informed consent voluntarily. Upon finishing the experiment, they would receive a gift to appreciate their participation.
3.2. Experimental Procedure
Based on previous research [
13,
21], predicting game difficulty with machine learning models generally requires constructing a dataset with objective features (i.e., physiological signals, facial expressions) as input and subject ratings as labels. Then, after training the classification model on the collected dataset, it can be applied to recognize game difficulty. Therefore, in this study, we first need to create a dataset.
When selecting the game for data collection, the following aspects were mainly involved: first, the operation should not be complex to prevent a high learning burden. Previous studies have used games, including Tetris [
13], Pacman [
14,
19], and Pong [
21], which all meet this requirement. Second, the game should contain various modes to stimulate unequal physiological states and facial cues. Third, it can simultaneously record and process the facial data when the participant plays the game. With this, the workload during data processing can be significantly reduced.
Based on the above three aspects, we ultimately chose the popular game Dinosaur Run and modified it with multiple modes. We referred to the research [
21] when determining the game mode settings. They used a Pong game to study difficulty prediction and generated nine game modes based on the ball speed and paddle size. Therefore, this study also established the game mode based on two aspects: dinosaur speed and cactus size. To create multiple difficulties in our game, we developed nine modes (I–IX) by adjusting dinosaur speed (slow, medium, and fast) and cactus size (small, medium, and large), as shown in
Table 1. From the table, it can be observed that the difficulty in each row or column is progressively increasing. Therefore, different game modes were expected to elicit various levels of facial cues. Subjects were required to play in each mode, and their facial cues, game performance, and game difficulty ratings for each game mode were automatically collected to create the dataset. Further, we would train several machine learning classifiers on this dataset and compare their performances to obtain the best model.
The left picture in
Figure 1 shows the interface at the start of the game. Users can press the space key or click the start button to enter the game. The right picture in
Figure 1 shows a screenshot during gameplay.
After the participants arrived at the lab at the appointed time, we introduced them to the game’s contents, rules, and duration. Then, they filled out the consent voluntarily. Next, participants were requested to play one or two random modes to familiarize themselves with the game. All users indicated that they were aware of the game operations and experimental requirements after the pretest. Subsequently, the formal experiment started.
Figure 2 shows the experimental procedure.
To mitigate the order effect, we randomly disrupted the order of the games before the formal experiment started. The experimental procedure is detailed as follows: Participants enter the game interface after clicking the start button or pressing the space key. During gameplay, they tap the space key to jump over the cactus. If the user does not avoid a cactus, the number of failures increases by one, and the game continues. The duration of each game is 30 s. After completing one mode, the game returns to the initial interface. Users need to rate the difficulty on a scale from 1 to 9, where 1 means extremely easy, 5 represents normal (neither easy nor hard), and 9 indicates very hard. Then, they take a break for about 1 to 2 min to prevent fatigue. Afterward, users enter the game in a different mode by pressing the start button again. The entire experiment took about 15 min, and we only recorded users’ facial data while they were playing. We used the number of failures (touching the cactus) and the score as indicators of the game performance. The lower the number of failures, the smoother the user’s gameplay was in that mode.
3.3. Dataset Collection
We use facial expressions, gaze directions, head poses, and game performance to predict game difficulty. Among them, facial expressions can mirror the user’s inner emotions to a certain extent, gaze directions and head poses are related to the user’s attention, and game performance reflects proficiency. Therefore, by combining these data from various modalities, we expect to predict the perceived game difficulty experienced by users during interaction.
We use OpenFace [
32], a prominent facial behavior analysis toolkit that has been employed in many studies [
11,
33,
34], to obtain gaze directions and head poses. The gaze directions include pitch and yaw, and the head poses include pitch, yaw, and roll. Then, ResMaskingNET [
35], which has achieved promising performance on several publicly available facial expression datasets, is utilized to detect facial expressions. Facial expressions include happy, sad, angry, surprise, fear, disgust, and neutral.
Figure 3 shows the facial cues obtained via OpenFace 2.0 (available at
https://github.com/TadasBaltrusaitis/OpenFace Last accessed: 10 September 2024) and ResMaskingNET (available at
https://github.com/phamquiluan/ResidualMaskingNetwork Last accessed: 10 September 2024). Participants’ facial expressions, gaze directions, and head poses in each frame during gaming were saved to a local Excel file for subsequent processing.
After completing each game mode, we recorded the player’s game performance, including the score and the number of failures. Then, we combined the averages of their facial cues and game performance as the final input features. Thus, we obtained 14-dimensional features (facial expression: 7, gaze direction: 2, head pose: 3, game performance: 2). Finally, we built a game difficulty prediction dataset with these features as inputs and the users’ subjective difficulty rating scores as labels.
Figure 4 shows an example of generating a dataset sample from a test video containing n-frame images. The first column shows the facial cues computed in each frame, the second column displays the average results of each facial feature and the game performance, and the last column shows the subjective difficulty rating and the corresponding difficulty category. The mapping rules between the subjective difficulty rating and the difficulty category will be described in detail in
Section 3.5.
Since we integrated the facial data processing module into the game, there is no desynchronization problem between the acquired facial cues, as in physiological signal-based methods.
3.4. Subjective Difficulty Ratings
Figure 5 shows the subjective difficulty rating scores for each game mode. When the dinosaur speed is slow, the larger the cactus size, the more difficult the user feels. Then, when the dinosaur speed is medium, there is no significant difference in the subjective scores between medium and large cactus sizes. In addition, we can also find that when the dinosaurs were fast, all the subjective difficulty ratings in these three modes exceeded six and were almost identical. The above results indicate that when the dinosaur speed is low, the size of the cactus has a significant effect on game difficulty. However, as the dinosaur speed increases, the effect of cactus size decreases.
Subsequently, we use SPSS statistics 24.0 (IBM Corp, Armonk, NY, USA) to analyze the inter-rater reliability of the subjective difficulty ratings. Kendall’s W-test results showed statistical significance (, p < 0.001), indicating that the subjective ratings were consistent. Meanwhile, Kendall’s coordination coefficient W reached a value of 0.628. It further revealed that the data were reliable and had a relatively strong consistency level.
3.5. Classification Tasks
Based on previous research [
13,
14,
21,
22], game difficulty is generally divided into two and three categories. Thus, we conduct two tasks to compare the performance of various machine learning methods.
Table 2 and
Table 3 define the classes for the two-class and three-class classification tasks. The definitions of the tasks and the ranges in each class are described as follows.
Two-class classification: The input data were classified as “simple” or “hard” for perceived game difficulty. The ranges for each class in
Table 2 were defined manually based on the distributions of all subjects’ difficulty ratings and ensured that the sample size in each class was approximately identical.
Table 2 presents the number of samples in each class. We observe that the number of samples in these two classes is almost equal.
Three-class classification: The input data were classified as “simple”, “normal”, or “hard” for perceived game difficulty. In this task, we just divided the rating scores into three categories with equal intervals. Additionally,
Table 3 presents the number of samples in each class. We observe that the second class encompasses the highest number of samples.
3.6. Evaluation Metrics
We measure the classification performance of these machine learning methods using widely used metrics: F1-score, accuracy, precision, and recall. These metrics range from 0 to 1, with larger values (close to 1) indicating better performance. In addition, the confusion matrix is used to visualize the classification results.
4. Experimental Results
We divided the collected dataset into two parts: the training dataset and the validation dataset. The ratio between them is 4:1. Then, we trained several machine learning classifiers on the training dataset, including support vector machine (SVM), k-nearest neighbors (KNN), Naïve Bayes, decision trees (DT), random forest (RF), gradient boosting (GB), Adaboost (AB), and multilayer perceptron (MLP). These methods were implemented using Sklearn [
36] and trained with default training parameters. The following sections present the comparison results of these methods on the validation dataset.
4.1. Comparison of Performance on the Two-Class Classification Task
Table 4 reports the performance of these methods on the two-class classification task. Since the discrepancies among the four evaluation indicators in each classifier are relatively small, we do not present the results in graphical form. From the table, we observe that MLP achieved the highest results on all evaluation metrics, and the values of these metrics were over 0.85. The performances of KNN and GB were slightly lower than that of MLP and separately ranked second and third. Additionally, AB achieved the poorest performance, with all the metrics below 0.7. Further, we use the confusion matrix to analyze the performance of these methods in each category.
Figure 6 shows the confusion matrices for the top three methods, including MLP, KNN, and GB. We observe that MLP performs well in predicting the “simple” class but is mediocre for the “hard” class. In contrast, KNN and GB achieved higher accuracy in predicting the “hard” class. The reasons why MLP performs poorly in the “hard” class may be due to the following reasons: 1. The model structure might be too complicated for the two-class classification task, leading to overfitting when training on the dataset since we used the default parameters for the MLP classifier. 2. MLP is sensitive to the choice of initial training weights and bias, which may cause the model to become stuck in local optima, leading to a divergence in the training process [
37,
38]. In future studies, we will further explore the effects of different training parameters, the number of neurons, and the number of hidden layers on the classification performance.
4.2. Comparison of Performance on the Three-Class Classification Task
Figure 7 reports the performance of these methods on the three-class classification task. We observe that MLP achieves the highest results on all four evaluation metrics, and the values of these metrics were close to 0.8, while those of other methods were below 0.7. The results indicate that MLP is more suitable for predicting game difficulty than the other methods in the three-class task. In addition, surprisingly, KNN, which achieved satisfactory performance on the two-class classification task, does not perform well on the three-class task. Additionally, AB, which achieved the worst performance on the two-class classification task, is ranked second on the three-class task. These results imply that the same methodology could perform differently on different tasks.
Figure 8 presents the confusion matrices for the top three methods: MLP, AB, and GB. We observe that the accuracy of all three methods in predicting the “simple” class is inferior to that of the remaining two classes. This disparity could be due to the small sample size of that class in the dataset. An imbalanced dataset significantly restricts the estimation performance of the model [
39]. In addition, MLP achieves high prediction accuracies (exceeding 80%) on both the normal and hard categories. Although its accuracy in the “simple” class is only 0.714, it also outperforms AB and GB by a large margin. The results reveal the advantage of MLP in handling complex tasks.
4.3. Comparison with Different Data Modalities on the Three-Class Task
From the above results on the two-class and three-class tasks, we observe that the latter task is more difficult. Therefore, we used the three-class task to conduct an extra experiment to verify the differences in classification performance from different input data modalities. Since MLP demonstrated better performance on these tasks, we used it as the classifier.
First, we compare four types of single input data: FE, GD, HP, and GP, where FE indicates the intensity of facial expressions, GD denotes gaze directions, HP is stands for head poses, and GP refers to game performance. We only adjusted the number of neurons in the input layer and kept other parameters unchanged. Rows 1–4 in
Figure 9 show the results of MLP with different single inputs. We find that GP is more accurate than the other data modalities. Therefore, in the next step, we use GP as the baseline feature and integrate it with other modalities. Then, three multiple fusion datasets were compared: GP + FE, GP + FP + GD, and GP + FE + GD + HP. Rows 5–7 in
Figure 9 report the results of MLP with different multiple inputs. We observe that the prediction accuracy increases as the feature dimension rises. Fusing these four modalities achieves the best results across all four metrics. The experimental results demonstrate the effectiveness of combining multiple data modalities to predict game difficulties.
5. Discussion
Machine learning methods such as MLP, SVM, and DT have been applied in many fields, such as usage intentions for website prediction [
40], stress recognition [
41], user satisfaction prediction [
42], etc. This paper uses machine learning classifiers to predict user-perceived game difficulty by learning representations from different user facial cues and game performance. The experimental results show that MLP acquired better performance than other classifiers on both two-class and three-class classification tasks. Unlike other traditional classifiers, MLP contains multiple hidden layers and nonlinear activation functions, which enables it to simulate complicated nonlinear interactions between inputs and outputs and handle more complex tasks. Therefore, it could achieve superior performance in our study, especially on the 3-class classification task with an imbalanced dataset. Additionally, classifiers such as SVM and DT, although they have also achieved promising results in previous studies, did not perform well on the three-class classification task. These results indicate that various methods are suitable for different classification tasks.
Blom et al. [
19] also used a camera-based approach to capture user facial expressions and predict the game difficulty perceived by the user. However, using only facial expressions may not be sufficient, as we found that some participants’ facial expression changes were not visible during the experiment, which may be because the game is relatively simple and does not evoke sufficient emotions, or they may deliberately control their expressions. Multimodal fusion can gather information from multiple dimensions and thus complement unimodal data. The classification performance of MLP improved by combining multimodal data, which is consistent with studies [
21,
25,
27,
28]. Unlike these studies, which required the use of different hardware devices to collect user data (e.g., physiological signals, eye movements, etc.) in a restricted laboratory setting, our method could acquire multiple modalities of data from facial images recorded by a single camera, allowing us to collect data in any place without affecting the user’s gaming experience. Furthermore, although the physiological signal method is also objective, the data acquisition and synchronization processes are more complicated than our method.
Our study is based on a customized single-player parkour game. It is worth pointing out that the proposed framework could also apply to other game types, such as puzzle or action. It can be easy to capture facial cues when users play games, since most users’ computers are equipped with front-facing cameras. However, when predicting the perceived difficulty, it would be inappropriate and inaccurate to directly employ the model trained in this study for another game unless the two games are similar in operation, mechanics, game settings, etc. User experience in different games should be different, and the user could be easily influenced by various aspects, including game content, graphics, other players, etc. Therefore, their facial cues and subjective feelings will be discrepant with our study. We should recollect the dataset and train various machine learning models for other games as we did in this study. In the future, we will consider using multiple games to collect and create a broad dataset to improve the generalization ability of the prediction model. Additionally, fusing the strengths of various models to develop new machine-learning models is also one of the directions that can be focused on in the future.
Several limitations of this paper should be acknowledged. First, the participants in this study were graduate students and game enthusiasts from our university, which may introduce potential bias and limit the validity of the results. In the future, we will recruit participants with diverse gaming backgrounds, ages, and cultural experiences to improve the representation and robustness of our method. Second, the game we used is relatively simple to operate, which only requires users to press the space bar when playing. Some users suggest that adding game components, such as gold coins or skills, might improve the game’s interestingness. Third, we set and adjusted the game difficulty in only two dimensions: dinosaur speeds and cactus sizes. The realistic difficulty settings in computer games should contain more dimensions, such as new obstacles, power-ups, and skill-based mechanics. We will consider adding these elements to the game to construct a more realistic representation of difficulty level in a following study and validate the generalization ability of the machine learning classifiers in more games. Lastly, using a camera to collect facial data may limit the motion range of the user’s head. Typically, the camera is fixed to the computer. Since the camera has a limited field of view, users should keep their heads within the range to prevent the loss of facial data. However, during the experiment, we found that some users moved out of range unconsciously. In future experiments, we could add an additional camera to capture the user’s face from multiple directions.
6. Conclusions
In this paper, we propose a non-intrusive, low-cost, and efficient method for predicting game difficulty. The method requires only one camera to capture the user’s facial data during gameplay and utilizes different computer vision methods to simultaneously detect facial expressions, gaze directions, and head poses. We built a dataset by combining these three modalities with game performance as input features and the user’s subjective difficulty ratings as labels. Several machine learning models were trained and compared using this dataset. The experimental results show that MLP performs better than other methods in predicting game difficulty. In addition, we explored the effect of different inputs on classification performance, and the results indicate that the combination of multimodal features can effectively improve accuracy.
Author Contributions
Conceptualization, L.Y. and H.Z.; methodology L.Y., H.Z. and R.H.; software, L.Y. and H.Z.; validation, L.Y. and H.Z.; formal analysis, L.Y. and H.Z.; investigation, L.Y. and H.Z.; resources, L.Y., H.Z. and R.H.; data curation, L.Y. and H.Z.; writing—original draft preparation, L.Y.; writing—review and editing, L.Y. and H.Z.; supervision, R.H.; funding acquisition, L.Y., H.Z. and R.H. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the China Scholarship Council (CSC, No. 202306130013 and 202306130012).
Institutional Review Board Statement
The study was conducted by the Declaration of Helsinki and approved by The Institutional Review Board of the School of Design, Hunan University (No. 2003006, 1 September 2023).
Informed Consent Statement
Informed consent was obtained from all subjects involved in the study.
Data Availability Statement
The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to restrictions, e.g., privacy or ethical.
Conflicts of Interest
The authors declare no conflicts of interest. The funders had no role in the study’s design; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
References
- Tekinbas, K.S.; Zimmerman, E. Rules of Play: Game Design Fundamentals; MIT Press: Cambridge, MA, USA, 2003. [Google Scholar]
- Li, J.; Lu, H.; Wang, C.; Ma, W.; Zhang, M.; Zhao, X.; Qi, W.; Liu, Y.; Ma, S. A Difficulty-Aware Framework for Churn Prediction and Intervention in Games. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual, 14–18 August 2021; pp. 943–952. [Google Scholar] [CrossRef]
- Gallego-Durán, F.J.; Molina-Carmona, R.; Llorens-Largo, F. Measuring the difficulty of activities for adaptive learning. Univers. Access Inf. Soc. 2018, 17, 335–348. [Google Scholar] [CrossRef]
- Constant, T.; Levieux, G.; Buendia, A.; Natkin, S. From objective to subjective difficulty evaluation in video games. In Proceedings of the Human-Computer Interaction—INTERACT 2017: 16th IFIP TC 13 International Conference, Mumbai, India, 25–29 September 2017; Proceedings, Part II 16. pp. 107–127. [Google Scholar] [CrossRef]
- Rao Fernandes, W.; Levieux, G. Difficulty Pacing Impact on Player Motivation. In Proceedings of the International Conference on Entertainment Computing, Bremen, Germany, 1–3 November 2022; pp. 140–153. [Google Scholar] [CrossRef]
- Allart, T.; Levieux, G.; Pierfitte, M.; Guilloux, A.; Natkin, S. Difficulty influence on motivation over time in video games using survival analysis. In Proceedings of the 12th International Conference on the Foundations of Digital Games, New York, NY, USA, 14–17 August 2017; pp. 1–6. [Google Scholar] [CrossRef]
- Constant, T.; Levieux, G. Dynamic difficulty adjustment impact on players’ confidence. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, New York, NY, USA, 4–9 May 2019; pp. 1–12. [Google Scholar] [CrossRef]
- Caroux, L.; Mouginé, A. Influence of visual background complexity and task difficulty on action video game players’ performance. Entertain. Comput. 2022, 41, 100471. [Google Scholar] [CrossRef]
- Ermi, L.; Mäyrä, F. Fundamental components of the gameplay experience: Analysing immersion. In Proceedings of the DiGRA Conference, Vancouver, BC, Canada, 16–20 June 2005. [Google Scholar]
- Van Kreveld, M.; Löffler, M.; Mutser, P. Automated puzzle difficulty estimation. In Proceedings of the 2015 IEEE Conference on Computational Intelligence and Games (CIG), Tainan, Taiwan, 31 August–2 September 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 415–422. [Google Scholar] [CrossRef]
- Santoso, K.; Kusuma, G.P. Face recognition using modified OpenFace. Procedia Comput. Sci. 2018, 135, 510–517. [Google Scholar] [CrossRef]
- Kristensen, J.T.; Guckelsberger, C.; Burelli, P.; Hämäläinen, P. Personalized Game Difficulty Prediction Using Factorization Machines. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, Bend, OR, USA, 29 October–2 November 2022; pp. 1–13. [Google Scholar] [CrossRef]
- Naumann, L.; Schultze-Kraft, M.; Dähne, S.; Blankertz, B. Prediction of difficulty levels in video games from ongoing EEG. In Proceedings of the Symbiotic Interaction: 5th International Workshop, Padua, Italy, 29–30 September 2016; pp. 125–136. [Google Scholar]
- Girouard, A.; Solovey, E.T.; Hirshfield, L.M.; Chauncey, K.; Sassaroli, A.; Fantini, S.; Jacob, R.J. Distinguishing difficulty levels with non-invasive brain activity measurements. In Proceedings of the Human-Computer Interaction—INTERACT 2009: 12th IFIP TC 13 International Conference, Uppsala, Sweden, 24–28 August 2009; Proceedings, Part I 12. pp. 440–452. [Google Scholar] [CrossRef]
- Zohaib, M. Dynamic difficulty adjustment (DDA) in computer games: A review. Adv. Hum. Comput. Interact. 2018, 2018, 1–12. [Google Scholar] [CrossRef]
- Ashlock, D.; Schonfeld, J. Evolution for automatic assessment of the difficulty of sokoban boards. In Proceedings of the IEEE Congress on Evolutionary Computation, Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar] [CrossRef]
- Hong, F.; Huang, D.; Chen, G. Interaction-aware factorization machines for recommender systems. Proc. AAAI Conf. Artif. Intell. 2019, 33, 3804–3811. [Google Scholar] [CrossRef]
- Rendle, S. Factorization machines with libfm. ACM Trans. Intell. Syst. Technol. 2012, 3, 1–22. [Google Scholar] [CrossRef]
- Blom, P.M.; Bakkes, S.; Spronck, P. Modeling and adjusting in-game difficulty based on facial expression analysis. Entertain. Comput. 2019, 31, 100307. [Google Scholar] [CrossRef]
- Dietterich, T.G. Machine learning for sequential data: A review. In Proceedings of the Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshops SSPR 2002 and SPR 2002, Windsor, ON, Canada, 6–9 August 2002; pp. 15–30. [Google Scholar]
- Darzi, A.; Wondra, T.; McCrea, S.; Novak, D. Classification of multiple psychological dimensions in computer game players using physiology, performance, and personality characteristics. Front. Neurosci. 2019, 13, 1278. [Google Scholar] [CrossRef]
- Sakamoto, K.; Shirai, S.; Takemura, N.; Orlosky, J.; Nagataki, H.; Ueda, M.; Uranishi, Y.; Takemura, H. Subjective Difficulty Estimation of Educational Comics Using Gaze Features. IEICE Trans. Inf. Syst. 2023, 106, 1038–1048. [Google Scholar] [CrossRef]
- Lima Sanches, C.; Augereau, O.; Kise, K. Estimation of reading subjective understanding based on eye gaze analysis. PLoS ONE 2018, 13, e0206213. [Google Scholar] [CrossRef]
- Parikh, S.S.; Kalva, H. Feature weighted linguistics classifier for predicting learning difficulty using eye tracking. ACM Trans. Appl. Percept. 2020, 17, 1–25. [Google Scholar] [CrossRef]
- Kawamura, R.; Shirai, S.; Aizadeh, M.; Takemura, N.; Nagahara, H. Estimation of wakefulness in video-based lectures based on multimodal data fusion. In Proceedings of the Adjunct Proceedings of the 2020 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2020 ACM International Symposium on Wearable Computers, Virtual, 12–17 September 2020; pp. 50–53. [Google Scholar] [CrossRef]
- Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. arXiv 2017, arXiv:1706.09516. [Google Scholar] [CrossRef]
- Peng, S.; Nagao, K. Recognition of students’ mental states in discussion based on multimodal data and its application to educational support. IEEE Access 2021, 9, 18235–18250. [Google Scholar] [CrossRef]
- Zhou, T.; Cha, J.S.; Gonzalez, G.; Wachs, J.P.; Sundaram, C.P.; Yu, D. Multimodal physiological signals for workload prediction in robot-assisted surgery. ACM Trans. Hum.-Robot. Interact. 2020, 9, 1–26. [Google Scholar] [CrossRef]
- Jia, N.; Zheng, C.; Sun, W. A multimodal emotion recognition model integrating speech, video and MoCAP. Multimed. Tools Appl. 2022, 81, 32265–32286. [Google Scholar] [CrossRef]
- Zheng, C.; Wang, C.; Jia, N. Emotion recognition model based on multimodal decision fusion. J. Phys. Conf. Ser. 2021, 1873, 012092. [Google Scholar] [CrossRef]
- Zhao, Z.; Wang, Y.; Xu, Y.; Zhang, J. TDFNet: Transformer-based Deep-scale Fusion Network for Multimodal Emotion Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 3771–3782. [Google Scholar] [CrossRef]
- Baltrusaitis, T.; Zadeh, A.; Lim, Y.C.; Morency, L.P. Openface 2.0: Facial behavior analysis toolkit. In Proceedings of the 2018 13th IEEE International Conference on Automatic face & Gesture Recognition, Xi’an, China, 15–19 May 2018; pp. 59–66.
- Araluce, J.; Bergasa, L.M.; Ocaña, M.; López-Guillén, E.; Revenga, P.A.; Arango, J.F.; Pérez, O. Gaze focalization system for driving applications using openface 2.0 toolkit with NARMAX algorithm in accidental scenarios. Sensors 2021, 21, 6262. [Google Scholar] [CrossRef]
- Akram, A.; Khan, N. SARGAN: Spatial Attention-based Residuals for Facial Expression Manipulation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 5433–5443. [Google Scholar] [CrossRef]
- Pham, L.; Vu, T.H.; Tran, T.A. Facial expression recognition using residual masking network. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 4513–4519. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Duchesnay, É. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Mirjalili, S.; Mirjalili, S.M.; Lewis, A. Let a biogeography-based optimizer train your multi-layer perceptron. Inf. Sci. 2014, 269, 188–209. [Google Scholar] [CrossRef]
- Ojha, V.K.; Abraham, A.; Snášel, V. Metaheuristic design of feedforward neural networks: A review of two decades of research. Eng. Appl. Artif. Intell. 2017, 60, 97–116. [Google Scholar] [CrossRef]
- Li, Q.; Zhao, C.; He, X.; Chen, K.; Wang, R. The impact of partial balance of imbalanced dataset on classification performance. Electronics 2022, 11, 1322. [Google Scholar] [CrossRef]
- Cao, Y.; Ding, Y.; Proctor, R.W.; Duffy, V.G.; Liu, Y.; Zhang, X. Detecting users’ usage intentions for websites employing deep learning on eye-tracking data. Inf. Technol. Manag. 2021, 22, 281–292. [Google Scholar] [CrossRef]
- Liapis, A.; Katsanos, C.; Karousos, N.; Xenos, M.; Orphanoudakis, T. User experience evaluation: A validation study of a tool-based approach for automatic stress detection using physiological signals. Int. J. Hum.-Comput. Interact. 2021, 37, 470–483. [Google Scholar] [CrossRef]
- Koonsanit, K.; Hiruma, D.; Yem, V.; Nishiuchi, N. Using Random Ordering in User Experience Testing to Predict Final User Satisfaction. Informatics 2022, 9, 85. [Google Scholar] [CrossRef]
Figure 1.
Game interface.
Figure 1.
Game interface.
Figure 2.
Experimental procedure.
Figure 2.
Experimental procedure.
Figure 3.
Facial cues obtained via OpenFace and ResMaskingNET. Here, happiness indicates the estimated expression category, the blue cube is the head pose direction, the red dots are facial landmarks, and the green lines denote the gaze directions.
Figure 3.
Facial cues obtained via OpenFace and ResMaskingNET. Here, happiness indicates the estimated expression category, the blue cube is the head pose direction, the red dots are facial landmarks, and the green lines denote the gaze directions.
Figure 4.
Facial cue processing and dataset construction. The values in the figure are only used to explain the process and assist understanding. The check mark indicates the corresponding difficulty category of the subjective rating in Two-class or Three-class classification tasks.
Figure 4.
Facial cue processing and dataset construction. The values in the figure are only used to explain the process and assist understanding. The check mark indicates the corresponding difficulty category of the subjective rating in Two-class or Three-class classification tasks.
Figure 5.
Users’ subjective difficulty ratings for each game mode.
Figure 5.
Users’ subjective difficulty ratings for each game mode.
Figure 6.
Confusion matrices of MLP, KNN, and GB on the two-class classification task.
Figure 6.
Confusion matrices of MLP, KNN, and GB on the two-class classification task.
Figure 7.
Performance comparison on the 3-class classification task.
Figure 7.
Performance comparison on the 3-class classification task.
Figure 8.
Confusion matrices of MLP, AB, and GB on the three-class classification task.
Figure 8.
Confusion matrices of MLP, AB, and GB on the three-class classification task.
Figure 9.
Comparison of MLP with different input data modalities. Here, FE represents facial expression, GD is gaze direction, HP indicates head pose, and GP is game performance. Bold represents the best.
Figure 9.
Comparison of MLP with different input data modalities. Here, FE represents facial expression, GD is gaze direction, HP indicates head pose, and GP is game performance. Bold represents the best.
Table 1.
Game mode settings.
Table 1.
Game mode settings.
Dinosaur Speed\Cactus Size | Small | Medium | Large |
---|
Slow | I | II | III |
Medium | IV | V | VI |
Fast | VII | VIII | IX |
Table 2.
The definition of game difficulty classes for the two-class classification task.
Table 2.
The definition of game difficulty classes for the two-class classification task.
Dimension | | Difficulty |
---|
Class simple | Range | 1–5 |
| Samples | 123 |
Class hard | Range | 6–9 |
| Samples | 137 |
Table 3.
The definition of game difficulty classes for the three-class classification task.
Table 3.
The definition of game difficulty classes for the three-class classification task.
Dimension | | Difficulty |
---|
Class simple | Range | 1–3 |
| Samples | 57 |
Class normal | Range | 4–6 |
| Samples | 120 |
Class hard | Range | 7–9 |
| Samples | 83 |
Table 4.
Performance comparison on the two-class classification task. Bold represents the best.
Table 4.
Performance comparison on the two-class classification task. Bold represents the best.
Classifier | Precision | Recall | F1-Score | Accuracy |
---|
SVM | 0.833 | 0.830 | 0.831 | 0.833 |
KNN | 0.861 | 0.851 | 0.854 | 0.857 |
Bayes | 0.714 | 0.716 | 0.714 | 0.714 |
DT | 0.693 | 0.695 | 0.690 | 0.690 |
RF | 0.811 | 0.803 | 0.806 | 0.810 |
GB | 0.833 | 0.830 | 0.831 | 0.833 |
AB | 0.688 | 0.685 | 0.686 | 0.690 |
MLP | 0.865 | 0.865 | 0.857 | 0.857 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).