1. Introduction
Chinese characters are the carriers of Chinese history and culture, and have a rich traditional cultural heritage. The standardization of stroke order is the most basic part of the writing process, but it is also the part that people most easily overlook. Mastering the standard stroke order not only helps to correct the wrong writing habits, but also adapts to the physiological function of the wrist, making the written characters more proportional, balanced and beautiful, and improving the speed of writing to a certain extent [
1]. According to statistics, among the more than 260,000 contestants in the competition
Brush and Ink in China sponsored by the Ministry of Education of China in 2020 and 2021, 79.24% of them had problems with stroke order. However, as typical morpheme characters [
2], the total number of Chinese characters is very large, and it is very difficult for learners to learn and correct the wrong stroke order character by character. Therefore, how to effectively learn the standard stroke and correct the wrong stroke is an important research topic of sustainable education.
Research has shown that computer technology used in the teaching process can provide students with new learning experiences and more effective learning, which can lead to sustainable education. For example, Dillenbourg P et al. [
3] investigated how MOOC study groups watch videos together under different configurations. The results show that watching MOOCs in groups provides highly satisfying learning experience as learners feel connected and interactions among them are enabled, which reveals that collaborative learning with the help of computer technology can increase students’ sense of participation and improve learning efficiency. Dillenbourg P et al. [
4] also captured students’ behavioral patterns through analysis of sequential interaction logs, which enabled more effective and personalized support during the learning processes. Troussas C et al. [
5] presented a fully operating and evaluated adaptive and intelligent e-learning system for second language acquisition, which provided each student with a unique educational experience. Furthermore, the inference system utilized the knowledge inference relationship between the learning objects and created a personalized learning environment for each student.
It can be inferred from these studies that the successful experiences of similar learners can be shared to shorten the learning path of new learners. In addition, by mining the correlation between knowledge points, the discrete knowledge points can be integrated into integrated knowledge points, so as to improve the knowledge density and reduce the learning task. These two points will help to establish an effective and personalized learning system. According to this idea, we take the learning of Chinese character stroke order knowledge as a case study of e-learning personalization system and try to use artificial intelligence technology to improve the efficiency of learning standard stroke order and correcting wrong stroke order.
Chinese character forms consist of limited basic strokes and components, so the writing of different characters is necessarily related. Such relationships can be obtained through the mining association rule. The experience of Chinese character writers can be shared and learned from, and the group writing experience can be promoted through collaborative filtering technology. Association rule mining is a typical data mining technique that has been widely used in the field of computer-assisted education. Ding Jihong et al. [
6] achieved accurate recommendation of learning resources based on association rules in a big data environment, enhancing the experience of online learning. Zhang L et al. [
7] applied association rule mining techniques to teaching information management in universities, which shows that data mining is helpful for teaching management. However, there is a paucity of research on the incorporation of association rules into Chinese character writing teaching techniques. Collaborative filtering is one of the core techniques used in recommendation systems, mainly by calculating preference information between similar users and then predicting what other users might be interested in [
8].
As a typical study case, this paper addressed the learning need for efficient correction of Chinese writing stroke order, forming an error-prone character set through mining the correlation between error-prone Chinese characters and recommending exercises through an optimized collaborative filtering algorithm. A systematic stroke order correction method and system were achieved. To verify the effectiveness of the personalized stroke correction algorithm, we conducted special tests and effect evaluations on two different groups of learners. The experiments show that our method can effectively achieve personalized correction of Chinese characters’ stroke order and has effectiveness in teaching stroke order standardization to different groups, which can advance sustainable education.
2. Personalized Chinese Character Stroke Order Correction Algorithm
Based on the massive writing records of learners, combined with data mining technology, this paper constructs a personalized Chinese character stroke order correction algorithm. As shown in
Figure 1, the algorithm is divided into two stages. The first stage constructed a unique error-prone Chinese character set library based on the Apriori algorithm that introduced lift, which provided important support for realizing the second stage of error-prone Chinese character recommendation. The second stage introduced learner-based collaborative filtering-inverse item frequence based on the error-prone Chinese character set library, which recommended effective experiences for learners by calculating their similarity. In general, association rules were used to explore the potential relationships between Chinese characters. On this basis, the error-prone Chinese character set library is summarized. Then, based on the improved collaborative filtering algorithm, a personalized error-prone Chinese character recommendation model based on user was proposed, which took learners and error-prone Chinese character set as the core, and a complete personalized Chinese character stroke correction algorithm was constructed.
2.1. Apriori Algorithm with Lift Measure
As a data mining algorithm based on association rules [
9], the Apriori algorithm can analyze valuable information from the writing records of different learners, and reflect the writing situation of most learners. It is an important means to summarize the types of error-prone Chinese characters and then generate the error-prone Chinese character set table.
Assume that the error-prone Chinese character data set D contains all the incorrect characters in the database. The non-empty item set Q represents a learner written record, an item set composed of several Chinese characters [
10]. Let X and Y be two error-prone Chinese character sets in learner written records Q,
and
. If there is
,
, and
, then
constitutes an error-prone Chinese character association rule in learner written record D. The effectiveness of the association rules of the Apriori algorithm is usually measured by the support and confidence [
11]. Support refers to the percentage of the number of characters X and Y appearing simultaneously in the total characters in pre-processed writing record dataset C, and is denoted as support (
), as shown in Equation (1). Confidence is the percentage of the number of characters X and Y to the number of characters X in pre-processed dataset C, denoted as confidence(
), as shown in Equation (2). Where
is the number of characters X and Y that can occur simultaneously, and
is the percentage of the number of characters X in dataset C.
The traditional Apriori algorithm uses two evaluation indexes: support and confidence, for rule filtering, and many of the association rules mined are invalid. To address the shortcomings of the support–confidence framework, we introduce lift [
12,
13] to further filter the mined association rules. The lift refers to the ratio of the probability of the occurrence of character Y in the condition of the existence of character X to the probability of the occurrence of character Y without the existence of character X, reflecting the correlation between X and Y, as shown in Equation (3).
is the percentage of the number of characters Y in the index data to the whole data set D. The value range of lift is [0, +∞]. When the lift is greater than 1, it indicates that the appearance of character X promotes the appearance of character Y, which is called the positive correlation rule. When the lift is equal to 1, it indicates that the simultaneous occurrence of characters X and Y is an independent random event, and this rule is called irrelevant rule. When the lift is less than 1, it indicates that the occurrence of character X reduces the probability of occurrence of character Y, which is called negative correlation rule.
Therefore, the Apriori algorithm which introduces list measure can extract meaningful association rules from the massive Chinese writing records and summarize them into the error-prone Chinese character set table, providing support for the subsequent effective recommendation of error-prone Chinese characters.
2.2. Learner-Based Collaborative Filtering-Inverse Item Frequence
By calculating the similarity of learners, learners can be recommended incorrectly written Chinese characters of similar learners; thus, the experience of other learners can be effectively utilized to enhance learning efficiency.
Firstly, it can make statistics according to the number of errors of different Chinese characters made by learners. In general, the more frequently learners make mistakes in a Chinese character, the more likely the Chinese character is to be written wrong. Thus, the scoring matrix of learners for Chinese characters is established. By calculating the similarity of different learners, the nearest neighbor set of the current learner is established, and the error-prone degree of different Chinese characters is ranked according to the learners in the nearest neighbor set, to obtain the recommendation of the current learners’ error-prone characters. Jaccard similarity [
14] and cosine similarity [
15] can be used to calculate the similarity between different learners. The calculation formulas are shown in Equations (4) and (5):
refers to the collection of wrong Chinese characters written by learner with number u. refers to the collection of wrong Chinese characters written by learner with number v.
However, neither of the two similarity calculation methods mentioned above can avoid the influence of high-frequency Chinese characters’ handwriting errors. It means that many error-prone Chinese characters have similar problems with a common error-prone Chinese character. Therefore, it is necessary to improve the similarity degree. When two learners have the same writing errors for certain low-frequency characters, it is more indicative of similarity between the two learners. Therefore, we introduce user-based Collaborative filtering-inverse Item Frequence (UserCF-IIF) into the cosine similarity calculation formula [
14] and penalize the effect of common error-prone characters in the learner and error-prone character sets on similarity. In this case, the improved Jaccard similarity and cosine similarity formulas are shown in Equations (6) and (7):
The note indicates the number of error-prone Chinese characters. Finally, the recommendation analysis of relevant error-prone Chinese characters is realized by sorting the similarity degree of learners.
3. Experimental Results and Analysis of Association Rules and Collaborative Filtering
This section took the writing records of the competition
Brush and Ink in China as the data source. For example, a participant’s error writing record is “连, 迈, 莲, 房, and 剪”. In order to extract more effective data, data source and pretreatment methods are introduced in detail in
Section 3.1. In
Section 3.2, the improved association rule algorithm was used to mine and analyze typical Chinese characters, and the correlation strength between different Chinese characters and error types was calculated, and the error-prone Chinese characters set library was summarized. In
Section 3.3, by comparing various traditional collaborative filtering algorithms, it proved the effectiveness of the improved collaborative filtering algorithm in recommending Chinese characters that learners were interested in.
3.1. Data Pre-Processing
The data was based on the standardized writing questions from the 2020 and 2021 Competition—Brush and Ink in China, which was targeted at teachers, students in schools and colleges, and members of the community across the country. One hundred and thirty-seven error-prone Chinese characters were selected as the question bank of standard writing. We randomly sampled 20,000 participants from 2020 and 2021 to write record data as the research object. The specific data pre-processing operation steps are shown below.
Data Cleaning: Delete missing data to complete data cleaning. Since some participants left several questions empty without answering them directly, resulting in vacant answer data and wrong judging data, these records need to be deleted. In addition, the purpose of the research is to mine information about Chinese characters, so redundant data such as participants’ cell phone numbers and titles were deleted.
Data Integration: Multiple sub-data were integrated into one data file and duplicate records were removed to resolve data redundancy. Since there may be multiple submissions by a participant resulting in the data records being saved multiple times, these duplicate data need to be removed to avoid data redundancy.
Data Conversion: The form of data is subject to the requirements of the algorithm, and the data used for mining needs to be processed by data conversion. Since character data is generally not directly used as input to the algorithm, it is necessary to encode the character data into digital data to make it meet the requirements of the algorithm.
3.2. The Error-Prone Character Set Table of Stroke Order Based on Association Rules
By using the Apriori algorithm that introduced lift to mine the pre-processed contest data, the relationship between error-prone Chinese characters (incorrect stroke sequence/incorrect number of strokes) and error-prone Chinese characters was mined. Some of the mined results are shown in
Table 1, respectively.
It can be seen from
Table 1 that error-prone Chinese characters are often significantly associated with specific error types. Taking “之” as an example, the confidence of “之” and the error type of “Wrong number of strokes” is 0.99. This indicates that the reliability of this rule is very high, and learners are most likely to have this error type when practicing this character; it is also in line with the fact that it is easy to write the two strokes of “horizontal-break” and “right-falling” as one folding stroke. Therefore, in the process of writing correction, attention should be paid to the correction of the character strokes. Error-prone Chinese characters like this can be grouped into the set of characters with incorrect hyphenated strokes. In addition, the character “怀” corresponding to no. 6 in
Table 1 has a high correlation with the character “情”. When learners make writing errors on the character “怀”, they may also make errors on the character “情”. It is easy to observe that both characters have the “忄” side, which stroke order is easy to write incorrectly, indicating that there is a certain correlation between Chinese characters with the same components, which is relatively intuitive. These kind of error-prone Chinese characters can be grouped into the set of characters with the same error-prone components. However, there are also some Chinese characters with no intuitive correlation. Through the Apriori algorithm, we found that characters “龙” and “为” have a certain correlation. From the similarity of structure, it can be explained that their commonality is independent dot strokes, suggesting that we can sum up the stroke order rules of Chinese characters with independent dot strokes, grouping them into sets of characters with the same error-prone features. Other characters do not have the above features but are also easy to write incorrectly due to their complex structure, which can be grouped into the set of characters with complex structures that are not easy to write correctly.
By mining the correlation between error-prone Chinese characters, some error-prone Chinese character categories were obtained. Additionally, we constructed the basic error-prone Chinese character set table by expanding the set of characters within different categories (
Table 2). Each category contains dozens of Chinese characters with common error-prone feature. By correcting this error, it can be extended to every Chinese character of this category, and the learning efficiency of Chinese character strokes can be improved tens of times.
Thus far, we described a generation method of error-prone Chinese character set library based on the improved Aprori algorithm. By calling the Chinese characters in the library summarized above, we can make personalized recommendations according to the user information and character library. To verify the effectiveness of the error-prone Chinese character set library, we imported it into an applet developed by ourselves for internet users to practice. The writing records of each user were extracted and further analyzed for the types of errors in the strokes and stroke order of Chinese characters in the writing records, and some of the exercise data are shown in
Table 3.
We can see that there is a mutual relationship between the wrong characters written by learners, and there is an explicit same-part correlation. For example, characters “龙” and “拢” written by learner no. 1 have the same component “龙”. Additionally, there is an implicit same-part correlation, such characters “丑” and “再” written by learner no. 5, which both have a “土” structure. We can conclude that the “土” is integrated into the character, and the stroke order rule is “vertical first and then two horizontal” [
16]. Inspired by this idea, 38 different error types and their character sets were summarized through data mining and analysis to form the error-prone Chinese character set table, in which correlations between characters were confirmed in the learners’ writing records.
3.3. Recommendation of Error-Prone Chinese Characters based on Collaborative Filtering
The learned-based collaborative filtering algorithm for the error-prone Chinese character recommendation is mainly aimed at learners and the Chinese character writing records of the test system based on the error-prone Chinese character set table as experimental samples. The experimental analysis is carried out through the intelligent recommendation of error-prone Chinese characters based on UserCF-IIF. The experiment focuses on analyzing the quality of recommendation for selecting error-prone Chinese characters based on improved cosine similarity.
The purpose of the algorithm applied in the recommendation is to recommend the most error-prone Chinese characters to learners, so the top-N recommendation strategy was used [
17]. To evaluate the recommendation results objectively, we adopted commonly used evaluation indexes in the recommendation system, namely precision, recall and coverage. Among them, accuracy rate refers to the ratio of error-prone Chinese characters recommended to learners to the true error Chinese characters. Recall rate represents the ratio of learners’ true error Chinese characters appearing in the most likely error-prone Chinese characters set recommended in the test set. Coverage rate represents the ratio of all the recommended error-prone Chinese characters to the whole error-prone Chinese character set table. The formulas are shown in Equations (8)–(10):
represents the error-prone Chinese character set recommended for learner , represents the true error Chinese characters of learner in the test set, represents the sum of the whole error-prone Chinese character set table, represents the prevalence of Chinese character , and represents the list length of the recommended error-prone Chinese character .
The comparison algorithms adopted in this paper were three collaborative filtering algorithms: UserCF (Learner-based collaborative filtering algorithm), MostPopularCF (Heat-based collaborative filtering algorithm) and RandomCF (random filtering algorithm) [
18]. All algorithms were tested separately using Jaccard similarity and cosine similarity for comparison. The length of the recommendation list selected in the experiment was 10, and the length of similar learners was 5–30.
As can be seen from
Table 4, the collaborative filtering algorithm based on improved similarity can improve the accuracy rate of the recommendation of error-prone Chinese characters. Comparing several models, Jaccard-UserCF-IIF has the best accuracy rate; RandomCF is the best in coverage, while Jaccard-UsercF-iIF is the second best. This is because RandomCF is randomly recommended, but its performance in terms of accuracy and recall is flawed. Overall, Jaccard-UserCF-IIF shows the best comprehensive performance among all models.
We further analyze the model with the best comprehensive performance, namely Jaccard-UserCF-IIF.
Figure 2 reports the influence of different neighbor numbers on the recommendation performance of Jaccard-UserCF-IIF. With the increase in neighbor numbers, the recommendation performance increases gradually. When the number of neighbors is 25, the accuracy and recall rate reach the maximum value. The coverage rate decreases with the increase of neighbor number, while the prevalence rate increases steadily.
5. Conclusions
As an educational sustainability case study, a personalized Chinese stroke order correction algorithm was successfully developed to correct irregular writing habits. In this algorithm, the Apriori algorithm improved by lift measure was first used to construct the error-prone Chinese character set table, and the improved collaborative filtering algorithm was then used to develop a learner-based personalized error-prone Chinese character recommendation model. The empirical testing of the personalized stroke correction algorithm of two experiments showed that the experimental testers’ performance was significantly improved after training. The overall results illustrated the effectiveness of the proposed algorithms. However, the strength of the association rules was not sufficient due to massive competition data with sparsity, which deserves further in-depth investigation. In future studies, we will further optimize the error-prone Chinese character set table and introduce more perspectives of learner information to improve the performances of the proposed algorithms. The methods in this study can also be extended to relevance mining of other subjects and the design of teaching strategies, due to the fact that knowledge is relevant and learners have similar groups in each domain.