Combining Multi-Scale Fusion and Attentional Mechanisms for Assessing Writing Accuracy

Liu, Renyuan; Shi, Yunyu; Tang, Xian; Liu, Xiang

doi:10.3390/app15031204

Open AccessArticle

Combining Multi-Scale Fusion and Attentional Mechanisms for Assessing Writing Accuracy

College of Electrical and Electronic Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1204; https://doi.org/10.3390/app15031204

Submission received: 10 December 2024 / Revised: 18 January 2025 / Accepted: 21 January 2025 / Published: 24 January 2025

(This article belongs to the Special Issue Intelligent Systems and Tools for Education)

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

This paper focuses on the intelligent application of Shanghai’s writing grade examination, aiming to achieve efficient recognition of handwritten Chinese character images in the promotion examination for primary and secondary school students.

Abstract

Traditional methods of assessing handwritten characters are often too subjective, inefficient, and lagging in feedback, which makes it difficult for educators to achieve fully objective writing assessments and for writers to receive timely suggestions for improvement. In this paper, we propose a convolutional neural network (CNN) architecture that combines the attention mechanism with multi-scale feature fusion; specifically, the features are weighted by designing a bottleneck layer that combines the Squeeze-and-Excitation (SE) attention mechanism to highlight the important information and by applying a multi-scale feature fusion method to enable the network to capture both the global structure and the local details of Chinese characters. Finally, a high-quality dataset containing 26,800 images of handwritten Chinese characters is constructed based on the application scenario of the writing grade test, covering the common Chinese characters in the writing grade exam; The experimental results show that the proposed method achieves 98.6% accuracy on the writing grade exam dataset and 97.05% on the ICDAR-2013 public dataset, significantly improving recognition accuracy. The constructed dataset and improved model are suitable for application scenarios such as writing grade exams, which helps to improve marking efficiency and accuracy.

Keywords:

writing grade exam; offline handwritten Chinese character recognition; attention mechanism; multi-scale feature fusion

1. Introduction

In recent years, the State has attached great importance to the standardized use of Chinese characters and to calligraphy education and has particularly strengthened calligraphy education in schools by issuing the Guidelines for Calligraphy Education in Primary and Secondary Schools, which incorporates calligraphy education into primary and secondary school curricula. In Shanghai, for example, primary and secondary school students are required to take a writing test, the results of which are counted in the Shanghai Student Growth Record Book, which is particularly important for students who will soon be promoted to junior high school. The current writing grade examination needs a lot of manual participation, has low efficiency, and is susceptible to subjective factors. Handwriting recognition technology based on artificial intelligence and deep learning can automatically process writing samples, which not only improves grading efficiency but also reduces subjective bias.

The application of Handwritten Chinese Character Recognition (HCCR) in writing grade examinations has an important research background and practical significance. With the development of information technology and the gradual electronification of paper writing, handwritten Chinese character recognition, as a key technology, has become a research hotspot in both academia and industry. Handwritten Chinese character recognition faces multiple challenges, especially the difficulty in dealing with Chinese characters, a complex character system. The complexity of Chinese characters is mainly manifested in their large number, complex character structures, and diverse writing styles. There are thousands of commonly used characters and tens of thousands of less commonly used characters. At the same time, there are many similar characters with small differences, which further increases the difficulty of recognition [1]. Therefore, improving the accuracy of handwritten Chinese character recognition has become a research priority. Particularly, with the remarkable advancements of convolutional neural networks (CNNs) in the field of image recognition, such as capturing multi-scale features through multi-scale convolutional kernels (e.g., Inception modules [2]) or feature pyramid networks (FPN [3]). Additionally, attention mechanisms like Squeeze-and-Excitation (SE) [4] and Convolutional Block Attention Module (CBAM) [5] have significantly enhanced the model’s ability to focus on key features by adaptively adjusting feature weights.

The rapid development of Convolutional Neural Networks (CNNs) has significantly advanced the field of Handwritten Chinese Character Recognition (HCCR). Compared to traditional methods, CNNs eliminate the reliance on subjective feature comparisons (e.g., stroke and radical similarity [6]), greatly improving recognition efficiency. In the context of handwriting grading examinations, CNNs can effectively reduce misjudgments caused by the subjective biases of human evaluators. Moreover, CNNs exhibit greater adaptability to diverse writing styles and are more capable of capturing subtle features, thereby achieving higher recognition accuracy. In recent years, significant advancements have been made in improving the accuracy of Chinese character recognition models. The ResNet–Centerloss [7] model achieved an accuracy of 97.03%, setting a new benchmark; the improved SqueezeNet [8], in 2021, achieved 96.32% accuracy while reducing the model size; and the CUE model combining the head self-information [9], in 2023, also achieved 96.96%. In 2024, the improved VGG-net, combined with a self-constructed dataset, achieved 88.3% accuracy in style recognition and 95.6% in character recognition [10]. In the same year, this study proposed a CNN-based Chinese recognition method combining HOG feature extraction, Euclidean distance analysis, and the Google LeNet Inception-v3 model, achieving a 93.12% recognition rate [11]. Although deep learning-based HCCR techniques have made significant progress in accuracy, they still face several challenges. Firstly, Chinese characters have complex glyph structures and many similar characters, and the differences between these similar characters are usually very small [1]. Secondly, in writing grade tests, simple characters are more predominant, and the feature scales of these simple characters differ significantly from those of complex characters. Therefore, the size of the convolution kernel needs to be optimized in the network design to accommodate the feature extraction needs of different types of Chinese characters. A larger convolutional kernel can increase the sensing field and better acquire the global information of complex shape approximation characters [12], while a smaller convolutional kernel is suitable for extracting the detailed features of simple characters, thus reducing the amount of unnecessary computation and avoiding the problem of over-computation of simple characters or under-extraction of features of complex characters. In addition, in deep convolutional networks, the back layer is mainly responsible for extracting deep abstract features, but to some extent, it is insufficiently combined with the low-level features of the front layer, which leads to insufficient ability to capture certain detailed features [13]. For this reason, the introduction of an attention mechanism [14] is expected to enhance the network’s ability to jointly capture global and local information, thus better supporting complex recognition tasks [15]. Furthermore, significant advancements have been made in large-scale character recognition in other countries. For instance, Szymkowski proposed a solution for offline recognition of Japanese characters using minutiae and other features extracted from handwriting images [16]. In Korea, to address the structural complexity of Hangul, Snowberger conducted a comprehensive study using three neural network architectures—Multilayer Perceptron (MLP), Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN)—on a self-constructed Handwritten Hangul dataset [17]. In the digitization of ancient Egyptian Hieroglyphs, researchers created a hand-drawn hieroglyph dataset and validated its efficiency for recognition using CNN models like ResNet and VGG [18]. To address the challenges in Urdu handwritten character recognition, researchers developed a Nasta’liq-style dataset and introduced an optimized CNN architecture. The proposed model demonstrated superior performance, achieving higher accuracy compared to existing state-of-the-art systems [19].

In this context, this paper proposes an improved network structure based on CNN, aiming to better handle the multi-scale feature extraction problem in Chinese character recognition. By introducing a multi-scale convolutional kernel and an attention mechanism, this study enables the network to adapt to the small differences in the shape of close characters of Chinese characters, while ensuring efficient recognition of simple characters. Meanwhile, we invited 240 students to write Chinese characters to construct a high-quality dataset containing 26,800 images of handwritten Chinese characters for the needs of Chinese character grade examination and proposed a CNN architecture combining attention mechanism and multi-scale feature fusion to enhance the extraction of Chinese character structure and overall features, so as to improve the recognition performance, especially on the writing level examination dataset, which achieved an accuracy rate of 98.6%. Meanwhile, the test results on the public dataset ICDAR-2013 [20] show that the accuracy of this paper’s method reaches 97.05%, which significantly improves the recognition accuracy. The constructed dataset and the improved model are suitable for scenarios such as Chinese character level exams, which helps to improve the efficiency and accuracy of marking. Future work will continue to optimize the model to better suit the actual teaching and examination needs. Future work will continue to optimize the model to better adapt to the actual teaching and examination needs.

2. Multi-Scale-SE Recognition Methods

2.1. Multi-Scale-SE Recognition Models

To solve the problem of losing details of Chinese characters due to deep convolution and pooling operations, this paper designs an attention-based bottleneck layer structure and incorporates multi-scale fusion to better capture the overall features of Chinese characters. A handwritten Chinese character recognition model combining the Squeeze-and-Excitation (SE) attention mechanism and multi-scale fusion is finally constructed (Multi-scale SE, MSSE), and its structure is shown in Figure 1.

The MSSE model consists of two convolutional layers, four bottleneck layers, four pooling layers, and one regression layer. Firstly, the Chinese character images within the dataset are preprocessed, and the input grey scale image size is adjusted to 96 × 96. The convolution kernels of both convolutional layers in the network are set to 3 × 3 because 3 × 3 convolution kernels not only capture local spatial information but also have a lower computational cost. Compared to 5 × 5 convolutional kernels, a similar sense field can be achieved using a combination of two layers of 3 × 3 convolutional kernels, progressively capturing a wider range of spatial information while reducing the number of parameters.

Batch normalization [21] has been widely used in the field of HCCR, which can significantly improve the training efficiency and stability of the model. Therefore, a batch normalization layer is added after each convolutional layer of this model, and a ReLU activation function [22] is added to increase the non-linearity. In order to further improve the feature extraction capability of the model, MSSE designs a bottleneck layer that combines the attention mechanism and multi-scale feature fusion to replace the traditional convolutional operation to obtain the detailed information and global features of Chinese characters more comprehensively. The specific structure of the network is shown in Table 1.

2.2. The Bottleneck Layer Structure of the Fusion Attention Mechanism

In the design of convolutional neural networks (CNNs), the bottleneck layer structured as ‘1 × 1–3 × 3–1 × 1’ with three convolutional kernels is a classic design to optimize the network efficiency and feature extraction capability, especially widely used in deep networks such as ResNet [23]. This bottleneck layer structure significantly improves the performance of the model in computer vision tasks by reducing the computational effort while maintaining the integrity of the features through the ‘compression–transformation–recovery’ strategy [24]. Specifically, the bottleneck layer structure consists of three convolutional kernels: first, the first 1 × 1 convolutional kernel is used to downscale the feature map, reducing the number of channels to significantly reduce the amount of computation and parameters, to achieve the effect of feature compression while retaining the key information. Next, the middle 3 × 3 convolutional kernel extracts local features in lower dimensions, which is less computationally expensive but can effectively capture spatial information. Finally, the second 1 × 1 convolutional kernel is used to recover the number of channels of the feature map to ensure that the feature dimensions output from the bottleneck layer match the input from the subsequent network layer.

In the handwritten Chinese character recognition task, the complex structure and subtle differences in Chinese characters require the model to have accurate multi-level feature extraction capability. The design of the bottleneck layer helps enhance the feature expression ability of the model, where the 3 × 3 convolution is good at capturing details and local features, while the 1 × 1 convolution kernel effectively reduces the computational burden of the network through the process of dimensionality reduction and dimensionality reduction, which improves the training efficiency and inference speed. In addition, the bottleneck layer reduces feature redundancy, which not only enables the model to still achieve performance close to that of deep networks in relatively shallow networks and improves the model’s ability to capture subtle features such as complex strokes and structures, but also enables deep neural networks to achieve a good balance between accuracy and computational efficiency by reducing computational cost.

The Squeeze-and-Excitation [4] (SE) module is a structure that significantly enhances the feature representation of a neural network, aiming to improve the model’s ability to capture key features and thus improve recognition accuracy. The SE module enables the network to focus on important information while suppressing irrelevant or noisy information by dynamically adjusting the weights of channel features. This is particularly helpful for handwritten Chinese character recognition, as the complex strokes and structures of Chinese characters make recognition challenging. With the introduction of the SE module, the model is able to focus on discriminative features more effectively, which significantly improves the robustness. The structure of the SE module is shown in Figure 2.

The SE module consists of two main phases: Squeeze and Excitation. In the Squeeze phase, the SE module aggregates the global spatial information of each channel of the input feature map through global average pooling. Let the input feature map be

U

, where H, W, and C denote the height, width, and number of channels of the feature map, respectively. The global average pooling aggregates the spatial information of each channel into a scalar, generating a C-dimensional channel description vector, computed as:

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} U_{i, j, c} for c = 1, 2, \dots, C

(1)

where

z_{c}

denotes the global average pooling result for the c channel.

In the excitation phase, the SE module selectively enhances or suppresses channels by generating a weight value for each channel through an adaptive weight generation mechanism. Specifically, two Fully Connected (FC) layers and a nonlinear activation function (usually ReLU and Sigmoid) are used to construct the dependencies between channels. First, the channel description vector z is mapped to a dimensionally reduced intermediate representation space through the first Fully Connected layer, where r is a scaling factor to reduce the number of parameters, computed as follows:

u = σ (W_{2} δ (W_{1} z))

(2)

where

δ

denotes the ReLU activation function,

σ

denotes the Sigmoid activation function, and

W_{1}

and

W_{2}

are the weighting matrices of the two fully connected layers.

In the final feature recalibration stage, the SE module multiplies each channel in the original feature map with the corresponding weight to achieve adaptive weighting on the channels. The output feature map is computed as follows:

X_{i, j, c} = s_{c} \cdot U_{i, j, c}, for c = 1, 2, \dots, C

(3)

The SE bottleneck layer module shows remarkable effectiveness in the handwritten Chinese character recognition task, thanks to its excellent ability to cope with the complexity of Chinese character strokes and structures. The challenge of handwritten Chinese character recognition lies in the need to accurately capture highly discriminative features, and the SE bottleneck layer significantly improves the robustness and accuracy of the model by enabling the network to focus more on key feature regions through adaptive weight adjustment.

In the specific implementation, in terms of specific implementation, the adaptive weight generation mechanism of the SE bottleneck layer includes the following steps:

Global Information Compression: First, global average pooling (GAP) is applied to compress each channel’s feature map, aggregating spatial dimensions (H × W) into a channel descriptor. This step extracts global contextual information while significantly reducing the number of parameters.

Channel Weight Learning: Next, two fully connected (FC) layers are used to learn inter-channel dependencies. The first FC layer reduces the number of channels by a factor of 1/r (where r is the compression ratio, typically set to 16) to lower computational complexity, while the second FC layer restores the original number of channels. A ReLU activation function is applied between the two FC layers to introduce non-linearity.

Weight Adjustment and Feature Recalibration: Finally, the learned channel weights are normalized to the range [0, 1] using a Sigmoid function, generating adaptive weights. These weights are then multiplied channel-wise with the original feature maps, enhancing important features and suppressing less significant ones.

This mechanism not only effectively reduces the number of parameters and computational overhead of the model, but also significantly enhances the sensitivity and selectivity of the network to important features. Through global information aggregation and dynamic weight adjustment, the SE bottleneck layer enables the network to focus on high-contribution features, further improving the quality of feature extraction. Its design takes full consideration of the balance between lightweight and performance optimization, and it demonstrates a strong capability in handwritten Chinese character recognition, capturing the fine-grained features of the image, thus better adapting to the complex Chinese character recognition task.

2.3. Multi-Scale Fusion Feature Extraction

In the layer of convolutional neural network, take 2D image data as an example, suppose the input feature map has height H, width W, and the number of channels Cin. After the convolution operation, the generated feature map will have a new height H‘, width W’, and a new number of channels Cout [25]. Since a single convolutional kernel can only capture the features of a specific pattern, in order to enhance the expressive power of the network and to extract more diversified features, it is common for multiple convolutional kernels to be introduced. Multi-scale information fusion has long been a focal point in the field of image processing and computer vision [26].

For the input feature map X, different sizes of convolution kernels (e.g., 3 × 3 and 5 × 5) can be used to extract multi-scale information. These convolutional kernels are able to capture different patterns of the input features and combine them into richer representations, helping to improve the network’s adaptability to complex tasks. In this way, multi-scale features are fused into the feature representation of the network, thus enhancing the overall performance.

The process is as follows: a convolution operation is performed on the input feature map X using a convolution kernel of size

K_{1}

to obtain the feature map:

Y_{1} = C o n v K_{1} (X) + b_{1}

(4)

A convolution operation is performed on the input feature map X using a convolution kernel of size

K_{2}

to obtain the feature map:

Y_{2} = C o n v K_{2} (X) + b_{2}

(5)

Next, the feature maps are dimensionally aligned, usually through an upsampling or downsampling operation, to bring the output feature maps to a uniform size:

\begin{array}{l} Y_{1}^{aligned} = U p s a m p l e (Y_{1}) \\ Y_{2}^{aligned} = U p s a m p l e (Y_{2}) \end{array}

(6)

After completing the dimensional alignment, the two feature maps are spliced in the channel dimension to obtain the fused feature map Y:

Y = Concat (Y_{1}^{aligned}, Y_{2}^{aligned})

(7)

Through the above multi-scale convolution and feature fusion, the network can more fully acquire information at different scales and improve the representation of features. In the handwritten Chinese character recognition task, the traditional 3 × 3 convolutional kernel design has been proven to have efficient feature extraction capability but still has some limitations when facing the complex, multi-scale visual information extraction of handwritten Chinese character images. In particular, simple characters occupy a large proportion of the writing grade test, but the feature scales of these simple characters differ significantly from those of complex-shaped characters. This difference in feature scales poses a special challenge to the design of the convolution kernel of the model: on the one hand, a larger convolution kernel can effectively increase the receptive field to more adequately capture the structural features and shape differences in complex characters; on the other hand, a smaller convolution kernel reduces the amount of computation in extracting the detailed features of the simple characters, which helps to avoid the problems of over-computation of the simple characters as well as the under-exploitation of the features of the complex characters.

To address this problem, this paper introduces a combination of parallel 3 × 3 and 5 × 5 convolutional kernels in the bottleneck layer to enhance the feature extraction capability. The structure enables the model to process multi-scale features in parallel at the same level, capturing local details through 3 × 3 convolutional kernels while using 5 × 5 convolutional kernels to encompass a larger sensory field, thus acquiring rich contextual information. Figure 3 shows a diagram of multi-scale feature extraction.

3. Results

3.1. Dataset

There is no publicly available dataset of handwritten Chinese character recognition that meets the standard of the writing grade exam. Therefore, in this paper, based on the needs of the Chinese character level exam, 240 students who will face the Chinese character level exam are invited to write and collect handwritten Chinese character images, and a high-quality dataset containing 26,800 Chinese character images is constructed. We use professional annotation tools to finely annotate the Chinese character samples, generate XML files containing character encoding and other information, divide and process the dataset through Python 3.6 code, add more Chinese character data through data enhancement such as rotating and panning, and preprocess all the Chinese character images: using non-local mean denoising for preprocessing, to retain as much as possible the detailed information of the original images while effectively suppressing noise; for the handwritten Chinese character images, we use non-local mean denoising for preprocessing. Detail information of the original image: for the problem of foreground/background greyness in handwritten Chinese character images, the influence of background greyness value changes on feature extraction can be reduced by using the Otsu method binarization. Figure 4 shows an example diagram of a Chinese character before and after preprocessing in a self-built dataset.

To verify the effectiveness of the model in this paper on offline handwritten Chinese character recognition as well as its generalization, the experimental dataset of this paper is simultaneously selected from the publicly available handwritten Chinese character dataset (HWDB1.0-1.1) of the Institute of Automation, Chinese Academy of Sciences, as shown in Table 2. This dataset includes 3755 GB2312-80 [27] level 1 commonly used Chinese characters, which is a large-scale pattern recognition sample set. In this paper, all the images in the dataset are uniformly scaled to 96 × 96 for recognition experiments, The ratio of training set to test set is 4:1.

3.2. Experimental Parameters

In this paper, the network model is implemented through PyTorch 1.6.0, the code is written and run on a Linux CentOS 7.7 system, and the experimental hardware environment is NVIDIA A100 Tensor Core GPU 80 GB. Adam (Adaptive Moment Estimation) optimizer is used for training, the learning rate is set to 0.001, the input is set to 96 × 96, and the batch size is set to 128.

3.3. Experimental Results and Ablation Experiments

The experimental results show the model training loss curve as shown in Figure 5. From the curve trend, it can be observed that the loss value decreases rapidly at the beginning of training, indicating that the model is optimized rapidly in the initial stage. After about five rounds of iterations, the rate of loss decrease gradually slows down and tends to be stable, showing that the model is gradually approaching convergence. Finally, after 139 epochs of training, the model performance is stable and converges to 98.6%.

Additionally, to further validate the reliability of the results, we incorporated the F1 score, Recall, and the Kappa coefficient for classification performance evaluation. The F1 score is the harmonic mean of precision and recall, calculated as:

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(8)

Here, precision represents the proportion of true positives among the samples predicted as positive, while recall measures the proportion of true positives correctly identified by the model, defined as:

Recall = \frac{True Positives}{True Positives + False Negatives}

(9)

Given the large number of categories and imbalanced sample distribution in our dataset, we introduced the Kappa coefficient to assess classifier agreement. The Kappa coefficient compares observed agreement with random agreement, eliminating the influence of chance, and is calculated as:

κ = \frac{P_{o} - P_{e}}{1 - P_{e}}

(10)

From Table 3, the MSSE model attains 98.6% accuracy and 94.3% Kappa, reflecting strong consistency. Its 96.9% recall and 95.7% F1 score confirm robust minority class detection and precision–recall balance, proving its efficacy in assessing writing accuracy.

To enhance the reliability of our experimental results, we conducted a set of Monte Carlo experiments on our self-constructed dataset, focusing on the performance of the Assessing Writing Accuracy task. These experiments involved randomly splitting the dataset into training and testing sets multiple times, calculating the mean and standard deviation across iterations to evaluate the model’s stability and generalization capability in the writing accuracy task.

Furthermore, we performed statistical analysis on the Monte Carlo results using 95% confidence intervals and p-values as evaluation metrics. Confidence intervals estimate the plausible range of parameters, reflecting the reproducibility and stability of the experimental results. The 95% confidence level was chosen because it is widely accepted in statistics, striking a balance between confidence level and interval width, ensuring high reliability without overly broad intervals that could obscure meaningful information. p-values test the statistical significance of the results, determining whether observed differences are due to random chance. Typically, a p-value less than 0.05 indicates a significant finding.

After conducting ten Monte Carlo experiments, the calculated statistical data, including the mean ± standard deviation (Mean ± SD), 95% confidence interval (95% CI), and p-value, are presented in Table 4.

In the task of assessing writing accuracy, the proposed MSSE model demonstrated exceptional performance in Monte Carlo experiments, achieving an average accuracy of 98.19% (±0.316). The 95% confidence interval of [97.964, 98.416] indicates that the model’s accuracy is highly reliable, with minimal variability across experiments. Additionally, the statistically significant p-value of 0.002 confirms that the high accuracy is not due to chance but reflects the model’s robust performance. These results highlight the consistency and reliability of the MSSE model in handling the complexity and diversity of handwritten Chinese characters within the context of handwriting proficiency examinations.

To verify the improvement of the model improvement part of this paper on the overall model, the following ablation experiments are conducted on the self-built dataset. As shown in the table below, where method ① is a 13-layer basic network (Basic), method ② uses a parallel convolution kernel as a multi-scale information fusion (Muti-Basic), method ③ introduces an SE module on top of the addition of a bottleneck pooling layer (SE-Basic), and method ④ introduces an SE module on top of the bottleneck pooling layer and uses a parallel convolution kernel as an enhancement to the feature extraction feature fusion part of the network (MSSE). The comparison of the methods is shown in Figure 6, and the accuracy of this model has achieved a large improvement after combining the attention module and the multi-scale fusion method. The results of the ablation experiments are shown in Table 5.

3.4. Generalization Experiments and Comparisons

To further illustrate the superiority of our proposed network, we repeated the model on the CASIA-HWDB1.0-1.1 [28] dataset and compared the performance with other mainstream Chinese character recognition models on the same ICDAR-2013 test set, as shown in Table 6. DFE and DLQDF [29] are used for the recognition of Chinese characters by Discriminative Feature Extraction (DFE) and Discriminative Learning Quadratic Discriminative Function (DLQDF) to complete the recognition. Fujitsu-CNN is the champion model for ICDAR-2013 Chinese character recognition, MCDNN [30] model incorporates multi-scale fusion, and ATR-CNN [31] uses a relaxed convolutional neural network to recognize handwritten Chinese characters, i.e., by altering the strategy of sharing convolution kernels within one feature map in traditional convolution. HCCR-Gabor-GoogLeNet [2] is based on GoogLeNet and combines Gabor features to produce the final result, DirectMap-ConvNet-Adaptation [32] is a model matching method by mapping handwritten Chinese character features into a convolutional neural network to accommodate different input feature distributions, HCCR-CNN12layer-GSLRE 4X [33] is a model that combines global and local feature extraction (GSLRE) to improve handwritten Chinese character recognition, Ensemble-CNN-voting [34] is a model approach that improves handwritten Chinese character recognition accuracy by integrating multiple Convolutional Neural Networks (CNNs) and using a voting mechanism, Improved SqueezeNet [8] is a method based on SqueezeNet combined with dynamic network surgery algorithm, and CUE [9] combines head self-information.

In the Shanghai Writing Grade Examination, when there are more than two misspelled characters in a paper, it means that the paper will be directly rated as failed. In this paper, four expert teachers were invited to make a judgment on the interpretation of this kind of failed paper, and we chose to analyze the four Chinese characters that were judged to be misspelled at the same time by the four expert teachers, but were successfully recognized by the present model (Figure 7), and compared the output feature heatmaps and prediction results of these four Chinese characters in the MSSE model. The heatmap of the features and the prediction results of these four characters in the MSSE model are compared.

In Figure 7a because of the lack of strokes in the lower-left corner of the structure, it is difficult for the human eye to distinguish it based on the strokes in other parts of the structure, and all four experts could not recognize this character, and unanimously agreed that it was wrongly written, but based on the heatmap, the model incorporates the strokes in the lower-left corner into the important features and successfully identifies it, which shows the ability of this model to grasp the information at the fine granularity level. Figure 7b Because of the scribbled handwriting and the fact that all the strokes are concentrated in the middle part of the character, which makes it difficult to distinguish the overlapping strokes, the four experts have mistaken it for another close character ‘放’, but according to the heatmap, the model focuses on the peripheral strokes and the structure of the character, and successfully identifies it, which shows the ability of this model to grasp fine-grained information after the introduction of the SE attention module, which demonstrates the ability of this model to differentiate the weights of features after introducing the SE attention module. The four experts in Figure 7c fail to recognize the character and all of them identify it as a misspelled character, while the model accurately grasps the keystrokes in the right half of the Chinese character and recognizes it successfully. Similarly, Figure 7d gives different results from the experts because the model focuses on the subtle stroke features in the upper-left corner. The results of the expert approval (Subjective) and the identification of this model (objective) are shown in Table 7.

The experimental results show that compared with the traditional recognition method DFE+DLQDF, the accuracy is significantly improved, and compared with human eye recognition, not only can the artificial subjective factors be eliminated but also the subtle features that are easily ignored by the human eye. Compared to MCDNN, HCCR-Gabor-GoogLeNet, which focuses on multi-scale feature extraction, and DirectMap-ConvNet-Adaptation, HCCR-CNN12layer-GSLRE 4X, and Skew Correction Based on ResNet, which focuses on attention mechanism, the models in this paper have achieved good improvement. The comparison results are shown in the table, the methods proposed in this paper combining the attention mechanism and multi-scale feature fusion are all higher than the mainstream models in recognition accuracy.

4. Conclusions and Further Research

The MSSE model proposed in this study integrates an attention bottleneck layer with a multi-scale fusion module, enabling robust feature extraction and adaptive feature weighting for assessing writing accuracy. Experimental results demonstrate that the model achieves outstanding performance on a self-constructed dataset comprising 26,800 samples from 240 writers, with an accuracy of 98.6%, an F1 score of 95.7%, a recall of 96.9%, and a Kappa coefficient of 94.3%. These metrics highlight the model’s strong learning capability and generalization potential. Furthermore, Monte Carlo experiments reveal a mean accuracy of 98.19% (±0.316%) and a 95% confidence interval of [97.964%, 98.416%], with a statistically significant p-value of 0.002, confirming the model’s stability and reliability.

The attention bottleneck layer enhances network depth and feature discriminability, ensuring high accuracy even in complex scenarios. Meanwhile, the multi-scale fusion module adapts to varying feature scales, making it particularly effective for both simple and complex characters in writing grade examinations. Ablation studies further validate the contributions of each component, with the MSSE model achieving a 2.06% improvement in accuracy over the basic version.

In addition to its performance on the self-constructed dataset, the MSSE model demonstrates excellent generalization on the CASIA-HWDB dataset, achieving an accuracy of 97.05%, which surpasses several state-of-the-art methods. This indicates its potential for practical applications in real-world scenarios. Moreover, the model’s architecture allows for further compression and optimization, making it suitable for deployment in resource-constrained environments.

In summary, the MSSE model not only achieves high recognition accuracy but also exhibits strong robustness and generalization capabilities, making it a promising solution for HCCR tasks in both academic and practical settings.

While the MSSE model demonstrates strong performance on the Shanghai Writing Grade Examination dataset, its limitations must be acknowledged. The dataset, comprising 26,800 samples from 240 fifth-grade students, is sufficient for initial validation but lacks diversity in terms of age groups, educational backgrounds, and regional variations. This restricts the model’s generalizability to broader assessing writing accuracy applications, such as adult handwriting calligraphy recognition or historical document analysis. Future work should focus on expanding the dataset to include more diverse populations and handwriting scenarios. Additionally, the model’s computational requirements, driven by its multi-scale fusion module and attention bottleneck layer, may hinder deployment in resource-constrained environments. Techniques such as quantization, pruning, or knowledge distillation should be explored to optimize the model’s size and inference speed.

Author Contributions

Conceptualization, Y.S.; methodology, R.L.; software, R.L.; validation, R.L.; formal analysis, R.L.; investigation, R.L. and Y.S.; resources, Y.S.; data curation, R.L.; writing—original draft preparation, R.L.; writing—review and editing, R.L. and Y.S.; visualization, R.L.; supervision, Y.S., X.L. and X.T.; project administration, Y.S., X.L. and X.T.; funding acquisition, Y.S., X.L. and X.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by University–Industry Co-operation Project, Intelligent Sensing and Real-time Monitoring System for Urban Spatial Light Environment, Shanghai, China, grant number (24) DZ-007.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to the use of external datasets as the input data for creating an extended dataset.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jin, L.-W.; Zhong, Z.-Y.; Yang, Z.; Yang, W.-X.; Xie, Z.-C.; Sun, J. Applications of deep learning for handwritten Chinese character recognition: A review. Acta Autom. Sin. 2016, 42, 1125–1141. [Google Scholar]
Zhong, Z.; Jin, L.; Xie, Z. High performance offline handwritten chinese character recognition using googlenet and directional feature maps. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 846–850. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wu, Y.T.; Fujiwara, E.; Suzuki, C.K. Image-Based Radical Identification in Chinese Characters. Appl. Sci. 2023, 13, 2163. [Google Scholar] [CrossRef]
Huang, Z.; Zhang, Q. Skew correction of handwritten Chinese character based on ResNet. In Proceedings of the 2019 International Conference on High Performance Big Data and Intelligent Systems (HPBD&IS), Shenzhen, China, 9–11 May 2019; pp. 223–227. [Google Scholar]
Zhou, Y.C.; Tan, Q.H.; Xi, C.L. Offline Handwritten Chinese Character Recognition of SqueezeNet and Dynamic Network Surgery. J. Chin. Comput. Syst. 2021, 42, 556–560. [Google Scholar]
Luo, G.-F.; Wang, D.-H.; Du, X.; Yin, H.-Y.; Zhang, X.-Y.; Zhu, S. Self-information of radicals: A new clue for zero-shot Chinese character recognition. Pattern Recognit. 2023, 140, 109598. [Google Scholar] [CrossRef]
Wong, A.; So, J.; Ng, Z.T.B. Developing a web application for Chinese calligraphy learners using convolutional neural network and scale invariant feature transform. Comput. Educ. Artif. Intell. 2024, 6, 100200. [Google Scholar] [CrossRef]
Si, H. Analysis of calligraphy Chinese character recognition technology based on deep learning and computer-aided technology. Soft Comput. 2024, 28, 721–736. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Chen, Z.; Wang, C.; Wu, G. Pupil Refinement Recognition Method Based on Deep Residual Network and Attention Mechanism. Appl. Sci. 2024, 14, 10971. [Google Scholar] [CrossRef]
Vaswani, A. Attention is all you need. In Advances in Neural Information Processing Systems, Proceedings of the Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; NeurIPS: San Diego, CA, USA, 2017. [Google Scholar]
Kim, H.-J.; Eesaar, H.; Chong, K.T. Transformer-Enhanced Retinal Vessel Segmentation for Diabetic Retinopathy Detection Using Attention Mechanisms and Multi-Scale Fusion. Appl. Sci. 2024, 14, 10658. [Google Scholar] [CrossRef]
Szymkowski, P.; Saeed, K.; Szymkowski, Ł.; Nishiuchi, N. Classification of Japanese Handwritten Characters Using Biometrics Approach. Appl. Sci. 2024, 14, 225. [Google Scholar] [CrossRef]
Snowberger, A.D.; Lee, C.H. Handwritten Hangul Graphemes Classification Using Three Artificial Neural Networks. J. Inf. Commun. Converg. Eng. 2023, 21, 167–173. [Google Scholar] [CrossRef]
Aneesh, N.; Somasundaram, A.; Ameen, A.; Garimella, G.S.; Jayashree, R. Exploring Hieroglyph Recognition: A Deep Learning Approach. In Proceedings of the 2024 2nd International Conference on Computer, Communication and Control (IC4), Indore, India, 8–10 February 2024; pp. 1–5. [Google Scholar]
Mushtaq, F.; Misgar, M.M.; Kumar, M.; Khurana, S.S. UrduDeepNet: Offline handwritten Urdu character recognition using deep neural network. Neural Comput. Appl. 2021, 33, 15229–15252. [Google Scholar] [CrossRef]
Yin, F.; Wang, Q.-F.; Zhang, X.-Y.; Liu, C.-L. ICDAR 2013 Chinese handwriting recognition competition. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; pp. 1464–1470. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Li, Z.-H.; Cui, J.-X.; Lu, H.-P.; Zhou, F.; Diao, Y.-l.; Li, Z.-X. Prediction model of measurement errors in current transformers based on deep learning. Rev. Sci. Instrum. 2024, 95, 044704. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Cui, J.; Chen, H.; Lu, H.; Zhou, F.; Rocha, P.R.; Yang, C. Research Progress of All-Fiber Optic Current Transformers in Novel Power Systems: A Review. Microw. Opt. Technol. Lett. 2025, 67, e70061. [Google Scholar] [CrossRef]
Pereira, L.M.; Salazar, A.; Vergara, L. A comparative analysis of early and late fusion for the multimodal two-class problem. IEEE Access 2023, 11, 84283–84300. [Google Scholar] [CrossRef]
GB2312-80; National Standard of the People’s Republic of China: Chinese Coded Character Set for Information Interchange—Primary Set. China State Bureau of Technical Supervision: Beijing, China, 1981.
Liu, C.-L.; Yin, F.; Wang, D.-H.; Wang, Q.-F. CASIA online and offline Chinese handwriting databases. In Proceedings of the 2011 International Conference on Document Analysis and Recognition, Beijing, China, 18–21 September 2011; pp. 37–41. [Google Scholar]
Long, T.; Jin, L. Building compact MQDF classifier for large character set recognition by subspace distribution sharing. Pattern Recognit. 2008, 41, 2916–2925. [Google Scholar] [CrossRef]
Cireşan, D.; Meier, U. Multi-column deep neural networks for offline handwritten Chinese character classification. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 12–17 July 2015; pp. 1–6. [Google Scholar]
Zhang, X.-Y.; Yin, F.; Zhang, Y.-M.; Liu, C.-L.; Bengio, Y. Drawing and recognizing chinese characters with recurrent neural network. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 849–862. [Google Scholar] [CrossRef] [PubMed]
Orlando, J.I.; Prokofyeva, E.; Blaschko, M.B. A discriminatively trained fully connected conditional random field model for blood vessel segmentation in fundus images. IEEE Trans. Biomed. Eng. 2016, 64, 16–27. [Google Scholar] [CrossRef] [PubMed]
Zhou, L.; Yu, Q.; Xu, X.; Gu, Y.; Yang, J. Improving dense conditional random field for retinal vessel segmentation by discriminative feature learning and thin-vessel enhancement. Comput. Methods Programs Biomed. 2017, 148, 13–25. [Google Scholar] [CrossRef]
Chen, L.; Wang, S.; Fan, W.; Sun, J.; Naoi, S. Beyond human recognition: A CNN-based framework for handwritten character recognition. In Proceedings of the 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia, 3–6 November 2015; pp. 695–699. [Google Scholar]

Figure 1. MSSE model.

Figure 2. SE bottleneck layer structure.

Figure 3. Multi-Scale Feature Extraction.

Figure 4. Self-built dataset image preprocessing.

Figure 5. Loss curve.

Figure 6. Comparison of ablation experiments.

Figure 7. Heatmap of some typical images: an example of successful recognition with missing strokes (a), Scribbled but successfully recognized (b), More hyphenated but successfully recognized (c), Only one stroke in the upper part but successfully recognized (d), and their respective label and recognition results.

Table 1. MSSE network-specific architecture design.

Layer Name	MSSE Model	Output Shape
Input	96 × 96 image	96 × 96 × 1
Conv1	3 × 3 conv	96 × 96 × 64
Conv2	3 × 3 conv	96 × 96 × 64
AvgPool	3 × 3 avgpool	48 × 48 × 64
MSSEBottleneck1	1 × 1 conv 3 × 3 conv/5 × 5 conv 1 × 1 conv	48 × 48 × 96
AvgPool	3 × 3 avgpool	24 × 24 × 96
MSSEBottleneck2	1 × 1 conv 3 × 3 conv/5 × 5 conv 1 × 1 conv	24 × 24 × 128
AvgPool	3 × 3 avgpool	12 × 12 × 128
MSSEBottleneck3	1 × 1 conv 3 × 3 conv/5 × 5 conv 1 × 1 conv	12 × 12 × 256
AvgPool	3 × 3 avgpool	6 × 6 × 256
MSSEBottleneck4	1 × 1 conv 3 × 3 conv/5 × 5 conv 1 × 1 conv	6 × 6 × 448
GAP	gap, dropout	1 × 1 × 448
Output	Softmax	3755

Table 2. Dataset.

Dataset	Writer	Sample
Writing Grade Examination	240	268,000
CASIA-HWDB1.0	420	1,680,258
CASIA-HWDB1.1	300	1,172,907
ICDAR-2013	60	224,419

Table 3. Classification indicators.

Model	Accuracy	F1	Recall	Kappa
MSSE	98.6%	96.9%	95.7%	94.3%

Table 4. Statistical significance analysis.

Experiment	Mean ± SD	95% CI	p-Value
Monte Carlo	98.19 ± 0.316	[97.964, 98.416]	0.002

Table 5. Ablation experiments.

Method	Accuracy
Basic	96.54%
Muti-Basic	97.38%
SE-Basic	97.65%
MSSE	98.60%

Table 6. Result comparisons.

Method	Sample
Human Recognition	96.13%
DFE+DLQDF	92.72%
MCDNN	95.79%
ATR-CNN	95.04%
HCCR-Gabor-GoogLeNet	96.35%
DirectMap-ConvNet-Adaptation	96.55%
HCCR-CNN12layer-GSLRE 4X	96.73%
Ensemble-CNN-voting	96.79%
ResNet–Centerloss	97.03%
Skew Correction Based on ResNet	95.50%
SqueezeNet	96.32%
CUE	96.96%
Ours (MSSE)	97.05%

Table 7. Subjective and objective comparisons.

Sample	Ours	Expert A	Expert B	Expert C	Expert D
Figure 7a	‘鲜’	‘蓟’	null	null	null
Figure 7b	‘饺’	‘放’	‘放’	‘放’	‘放’
Figure 7c	‘隙’	null	‘陈’	null	‘琼’
Figure 7d	‘首’	‘月’	‘月’	‘月’	‘月’

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, R.; Shi, Y.; Tang, X.; Liu, X. Combining Multi-Scale Fusion and Attentional Mechanisms for Assessing Writing Accuracy. Appl. Sci. 2025, 15, 1204. https://doi.org/10.3390/app15031204

AMA Style

Liu R, Shi Y, Tang X, Liu X. Combining Multi-Scale Fusion and Attentional Mechanisms for Assessing Writing Accuracy. Applied Sciences. 2025; 15(3):1204. https://doi.org/10.3390/app15031204

Chicago/Turabian Style

Liu, Renyuan, Yunyu Shi, Xian Tang, and Xiang Liu. 2025. "Combining Multi-Scale Fusion and Attentional Mechanisms for Assessing Writing Accuracy" Applied Sciences 15, no. 3: 1204. https://doi.org/10.3390/app15031204

APA Style

Liu, R., Shi, Y., Tang, X., & Liu, X. (2025). Combining Multi-Scale Fusion and Attentional Mechanisms for Assessing Writing Accuracy. Applied Sciences, 15(3), 1204. https://doi.org/10.3390/app15031204

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Combining Multi-Scale Fusion and Attentional Mechanisms for Assessing Writing Accuracy

Abstract

Featured Application

Abstract

1. Introduction

2. Multi-Scale-SE Recognition Methods

2.1. Multi-Scale-SE Recognition Models

2.2. The Bottleneck Layer Structure of the Fusion Attention Mechanism

2.3. Multi-Scale Fusion Feature Extraction

3. Results

3.1. Dataset

3.2. Experimental Parameters

3.3. Experimental Results and Ablation Experiments

3.4. Generalization Experiments and Comparisons

4. Conclusions and Further Research

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI