1. Introduction
Optical character recognition (OCR) is a field of research in pattern recognition (PR). The goal of an OCR system is to automatically read text from a scanned paper and convert it to a digital format that can be readable and editable using electronic applications [
1].
Currently, the technologies are spread worldwide, and almost all the essential processes are being performed electronically. Also, the Arabic language is the official language of 23 countries and is spoken by more than 400 million people worldwide [
2]. This raises the need for an efficient Arabic text recognizer, which can be helpful in many institutions, such as educational, governmental, and economic organizations. For example, it is essential to convert old and/or new documents with handwritten scripts into digital Arabic text in some institutions. The OCR system helps professionally complete office tasks while saving time and effort. Moreover, recognizing Arabic handwritten text is helpful in the procedure of the automatic reading of bank checks [
3].
It is challenging to recognize the human writing of Arabic text because of the cursive nature of Arabic handwriting scripts. The different shapes of Arabic letters depend on their location in the word and the special marks used in some Arabic letters, such as ‘Hamza’ and ‘Maada’. Also, a lot of Arabic letters have the same shape but are differentiated by dots, which can be one, two, or three dots placed either above or below the character [
4].
The Arabic script is written from right to left, so it is essential to recognize the words in the same direction. Due to the challenges of recognizing Arabic text and the various characteristics of Arabic writing from other writings, it is difficult to apply the techniques developed for identifying other languages in Arabic scripts. Therefore, we implemented a new model that will recognize offline Arabic handwritten text.
Arabic text recognition systems can be either based on segmenting the word (analytical approach) or without segmentation. Most of the current systems are segmentation-based systems, which require segmenting the word into different characters [
5]. Then, after the segmentation process, each character is recognized. However, due to the cursive nature of the Arabic handwritten text, it is challenging to segment words into characters [
6]. On the other hand, segmentation-free models (holistic approach) recognize words as a whole-word images without any segmentation processes. The holistic approach is preferred for data with small vocabulary sizes, such as the recognition of bank checks, while the analytical method is ideal for recognition systems that consist of large vocabularies [
7,
8].
Traditional approaches to developing Arabic handwritten recognition systems are based on shallow learning techniques. Using these techniques makes it difficult to deal with the challenges of recognizing Arabic handwritten words. This is because they only extract the sample features of the word image. However, deep machine learning approaches have better performance in many systems since they can extract more complex features from the word image [
9,
10]. Thus, using machine learning approaches is helpful to handle the challenges of recognizing Arabic handwritten words [
11,
12].
The identification of Arabic characters poses persistent challenges due to several factors, and continuous research is being conducted to enhance the performance of current systems. Several methodologies are constrained to proprietary datasets or the identification of individual words or paragraphs, which complicates the evaluation of their effectiveness on authentic Arabic literature [
13].
Within the field of deep learning, multiple architectures have become quite effective instruments for different purposes, including image and text recognition. ResNet is a novel convolutional neural network (CNN) architecture that solved the vanishing gradient issue, therefore enabling the training of very deep networks and hence the idea of residual learning [
14]. The BiLSTM networks acquire contextual information and long-term associations from sequences by processing data both forward and backwards. These networks are a type of recurrent neural network. Text recognition and natural language processing are two examples of tasks that benefit greatly from this feature of BiLSTM [
15,
16].
When dealing with sequence-to-sequence problems, if the alignment of the input and output sequences is unknown, the CTC technique is a useful complement to these structures [
17]. The ability to predict sequences of varying lengths without utilizing pre-segmented input is a common use case for CTC, which finds widespread use in voice and handwriting recognition. The combination of CNN’s feature extraction capabilities, RNNs’ sequence modeling capabilities, and CTC’s adaptive sequence alignment capabilities provides a solid basis for tackling difficult recognition tasks [
18].
This work implements a segmentation-based model using deep machine learning techniques to have a high accuracy rate. The system is evaluated using King Fahd University of Petroleum and Minerals (KFUPM) Handwritten Arabic Text database (KHATT) [
19] and the Arabic Handwritten Text Images Database written by Multiple Writers (AHTID/MW) datasets [
20]. These are challenging datasets that contain text line images. These datasets cover different writing styles of other writers; the system will recognize Arabic handwritten text and words from a text line image.
The contributions of the study are as follows:
We implemented the ResNet model to extract the features of every individual character in the textual image. The BiLSTM with CTC model was employed for the purpose of sequence modeling. Ultimately, a language model (LM) was employed during the post-processing phase to improve the forecasted outcome derived from the classification phase.
We performed testing of the model on two distinct datasets, the KHATT and AHTID/MW datasets. The utilization of several datasets underscores the model’s capacity to extrapolate across diverse manifestations of Arabic handwriting.
The subsequent sections of the article are classified as follows:
Section 2 presents the literature review of OCR systems, while
Section 3 provides an outline of the study approach. The findings are articulated and examined in
Section 4.
Section 5 concludes the study, highlighting specific limitations and prospective areas for further research.
2. Literature Review
Many optical character recognition systems have been designed to recognize Arabic handwritten characters and words. CNNs have been widely employed in handwritten recognition due to their capacity to autonomously acquire hierarchical features from unprocessed pixel input [
21,
22]. A combination of two classifiers, which are CNN and Support Vector Machine (SVM) with a dropout regularization technique, is used to recognize offline Arabic handwriting characters [
23]. The authors use SVM to adjust the trainable classifier of CNN. The dropout is applied before supplying the output into an SVM classifier. The authors tested their design on a character dataset, a handwritten Arabic characters database (HACDB), and a word dataset that is IFN/ENIT. The experimental results showed that the HACDB dataset had a 5.83% error classification rate. The IFN/ENIT dataset had a 7.05% error classification rate.
An Arabic handwriting recognition system was proposed based on multiple BiLSTM-CTC combinations [
24]. In this study, two different extraction techniques were used. The first method is segment-based features. The second is Distribution-Concavity (DC)-based features trained on different levels of the BiLSTM-CTC combination. The combination levels were low-level fusion, mid-level combination methods, and high-level fusion. The experiments were performed on the KHATT dataset. The results showed that the high-level fusion had a better recognition rate than the other combination levels, with a 29.13% word error rate (WER) and a 16.27% character error rate (CER).
BenZeghiba, Louradour, and Kermorvant used a hybrid Hidden Markov Model (HMM) and Artificial Neural Network (ANN) framework to recognize Arabic handwritten text [
25]. The type of ANN used in their system is Multi-Dimensional Long Short-Term Memory Networks (MDLSTMs). The hybrid model extracts the pixel values of text line images by scanning the text line images in four directions. A CTC is used during the training process. The Viterbi algorithm [
26], a decoding strategy, is used to generate the best hypothesis of a character sequence. They added a hybrid language model that consists of words and Part-of-Arabic Words (PAWs). The KHATT dataset was used to evaluate the system, and the result was a 33% WER.
A recognition system for Arabic handwritten text was proposed by Stahlberg and Vogel [
27]. A sliding window is used to extract features from text-line images. The window’s width is 3 pixels with an overlap of 2 pixels. The parts are extracted using two different strategies. The first strategy is pixel-based features extracted from raw grayscale pixel values. The second strategy is segment-based features, consisting of centroid and height features. Kaldi toolkit, which is used in speech recognition systems and is based on deep neural networks, is used for classification [
28]. The best word error rate was obtained from pixel-based features with a 30.5% WER on the KHATT corpus.
Wigington et al. introduced two data augmentation and normalization methods: a novel profile normalization strategy for both word and line images and an augmentation of existing text images using random perturbations on a regular grid [
29]. These techniques were used with a CNN-LSTM architecture to enhance handwriting text recognition. Contemporary youngsters frequently utilize technology, and their distinctive characteristics in handwriting differ from those of adults. Therefore, the study by Altwaijry et al. [
30] has been trained on children’s handwriting.
The work by Khayati et al. [
31] emphasizes the development and efficiency of several CNN architectures in tackling the distinct difficulties presented by Arabic script, such as writing with cursive and the existence of diacritical markings. Lamia et al. [
11] developed a CNN-graph theory method for Arabic handwritten character segmentation. They address the difficulty of segmenting linked and overlapping cursive Arabic letters, a major obstacle to effective character identification. In a study by AlShehri [
32], a deep neural network for Arabic handwritten recognition (DeepAHR) improves feature extraction and recognition with a complex neural network design. The model excels at Arabic script character segmentation and contextual changes. DeepAHR outperformed prior models in accuracy and processing speed.
In another study by Alghyaline [
33], the Arabic handwritten recognition was implemented using different CNN pretrained models such as Visual Geometry Group (VGG), ResNet, and Inception on three different datasets: Hijja, the Arabic Handwritten Character Dataset (AHCD), and the AlexU Isolated Alphabet (AIA9K). Each dataset achieved accuracies of 93.05%, 98.30%, and 96.88% on the VGG model. The transformer transducer and the typical transformer design that makes use of cross-attention were the two end-to-end architectures that were explored in the work by Momeni and BabaAli [
34]. They employed the KHATT dataset and obtained a CER of 18.45%.
Table 1 describes the recent works on OCR.
3. Materials and Methods
The Arabic handwritten text recognition system should have multiple stages to convert a handwritten text image into a digital format. This system consists of four consecutive processing stages: preprocessing, feature extraction, classification, and post-processing, as shown in
Figure 1. The output of each process is used as an input to the process that follows.
First, preprocessing techniques are applied to the scanned image to improve the readability of the text. After that, the ResNet model is used to extract the features of each character in the text image. These features are the input for the classification stage. The BiLSTM-CTC network converts the visual features into contextual features. It predicts the sequence of characters with the help of the predefined classes in the database. Finally, an LM is used in the post-processing stage to enhance the predicted result from the classification stage.
Figure 2 shows the workflow of our model’s architecture. Each stage will be discussed in detail in the following subsections.
3.1. Datasets Description
The number of available Arabic handwritten text databases is limited. In our model, we used two different datasets, the KHATT and AHTID/MW datasets, to train and test our model. These datasets contain all the Arabic characters written in different writing styles by different writers.
3.1.1. KHATT
The KHATT is one of the challenging Arabic handwritten text databases published by KFUPM [
19]. The KHATT is an offline Arabic handwritten text database consisting of text line images extracted from handwritten forms filled out by 1000 different writers. The writers are from different regions, gender, age, left/right-handedness, and educational background. The database consists of 300 Dots Per Inch (DPI) grayscale images of 2000 unique text paragraphs (randomly selected) and 2000 fixed text paragraphs (similar text). The database also contains 300 DPI binary text line images extracted from the paragraphs [
35].
Figure 3 shows some samples from the KHATT dataset.
3.1.2. AHTID/MW Database
The AHTID/MW dataset developed by [
20] includes 3710 handwritten Arabic text lines and 22,896 words written by 53 native Arabic writers of different ages and educational levels. The dataset consists of grayscale text line images with 300 DPI resolution. The dataset contains a variety of Arabic handwriting writing styles, as shown in
Figure 4.
With varied samples in terms of writing styles and significant data for training and assessment, both the KHATT and AHTID/MW datasets provide priceless resources for Arabic OCR research. Handling the cursive character of Arabic script, the existence of diacritics, and the variation in individual writing styles are the main difficulties related to these datasets. Confronting these problems is essential for creating efficient OCR systems that achieve accurate results across various handwriting samples.
Table 2 shows the statistics of the dataset used for the study.
3.2. Preprocessing
The preprocessing step for scanned text images is very critical. It improves accuracy in handwritten text recognition systems. First, image binarization is applied to convert grayscale images. Each image uses values ranging from 0 to 256 for each pixel, which are converted to represent a black and white image represented by 1 and 0, respectively.
Arabic text line image datasets have high skew and extra white spaces. We removed the extra white regions by scanning the image from top to bottom to locate the first black pixel at the top and the position of the lowest black pixel at the bottom. After detecting the highest and lowest points of the black pixels, the text line images are cropped [
36]. The exact process is repeated from the left to the right side.
Figure 5 shows a sample image from the KHATT dataset after removing the white spaces.
Handwritten text is freestyle handwriting; therefore, noises or unwanted text such as random lines and dots may exist in the text image. Some images in the dataset contain a straight horizontal line on top of the text. In this work, we removed the horizontal line by computing an image difference in the horizontal direction, and from the image difference, we can find the horizontal line in the image by searching the continuous difference value. Doing so, we can filter and obtain the horizontal line by setting a threshold value to determine whether a horizontal area can be considered a line. Suppose the length of the horizontal line is greater than the image width size divided by 10. In that case, the filtered horizontal line is removed from the image.
Figure 6 shows a sample image from the KHATT dataset where the upper horizontal line is removed from the text line image.
Noise filtering techniques are applied to remove noises from the text line images. Max and Min filters, also known as erosion (minimum) and dilation (maximum) filters, are used in the preprocessing stage to remove noises from the text images efficiently [
37]. These filters are morphological transformation filters that define the neighborhood around each pixel. Erosion and dilation are two basic morphological operators [
38].
First, erosion is applied to the text image to erode the foreground object. It makes it smaller, that is, it removes small pixels (noises) near the boundaries of the foreground object (characters). Then, the dilation process is used to increase the size of foreground pixels (characters). We used erosion followed by dilation because erosion removes the noises in the text image, but it also shrinks the characters. Therefore, after the noises are removed from the text image, we dilate it. The dilation process enhances the distinctness of the characters and helps in joining broken parts of the text image together. An example of the Max and Min filter applied to a sample image from the KHATT dataset is shown in
Figure 7.
Image normalization is performed, which helps reduce the text image’s skewness and facilitates characters’ visual learning features. Arabic text contains two baselines: the upper and lower baselines, as shown in
Figure 8. The two baselines identify the core zone, the upper region with the ascenders, and the lower region includes the descenders. The core zone typically has a significant fraction of foreground pixels.
We used a method proposed by Stahlberg and Vogel [
39] for baseline estimation by finding stripes in an image with a dense foreground. First, we detect the baseline for the whole image and rotate the image so that the detected baseline is horizontal. After that, we split the image vertically into smaller slices. Then, we detect the baseline for each slice of the image separately and rotate the sliced images such that the baseline becomes horizontal. Finally, we concatenate all the sliced images with a straight horizontal baseline to a single image.
3.3. Feature Extraction
Feature extraction is the second phase after the data have been preprocessed. Features are the main point around which the whole system is built. They are the target for all the previous stages and the input to the classification phase. Different feature extraction methods are applied in Arabic text recognition systems. Some approaches used handcrafted feature extraction techniques using statistical [
40] or structural [
41] features. Other approaches computed both statistical and structural features [
42,
43,
44].
Recently, a new trend has shifted from handcrafted feature extraction methods towards machine learning techniques for feature extraction and text recognition systems. Deep networks, one of the most advanced machine learning techniques, simulate human brain activity and automatically extract features from text images. However, deep models for Arabic handwritten text recognition systems are rare compared to other languages due to their complexity and cursive writing style [
45]. A convolutional neural network (CNN) is an artificial neural network used in the pattern recognition field for image processing and recognition. CNN’s have proven their effectiveness in understanding image content, providing state-of-the-art image recognition and detection [
46]. Therefore, since CNN’s have shown their ability to learn interpretable and powerful features from an image [
47], we adopted a CNN architecture in our model to extract features from the text image.
In our system, we used a ResNet-based model [
48], which is a robust CNN architecture, to extract features from the text images. The CNN state-of-the-art architecture goes deeper each year since Krizhevsky et al. [
49] presented AlexNet 2012. While AlexNet consisted of only five convolutional layers, the VGG (Visual Geometry Group) network had 16–19 convolutional layers [
50]. Googlenet consisted of 22 deep convolutional layers [
51].
However, enabling the model to learn better and more features by increasing network depth is not as simple as stacking more layers together. Deep networks are hard to train because of the vanishing/exploding gradient problem, where, as the gradient is backpropagated to earlier layers in the network, frequent multiplication might make the gradient infinitely small (i.e., vanish or explode) [
52,
53].
The vanishing/exploding gradient problem makes it hard to learn and tune the parameters of the earlier layers in the network, which impedes convergence from the beginning. This results in the inability of models with deep layers to learn on a given dataset. The network performance with deep layers becomes saturated or even starts degrading rapidly.
The ResNet model was introduced to overcome these issues. The main idea behind ResNet models is to use residual blocks to improve the accuracy of the models. Residual blocks are based on the concept of “identity shortcut connection” that skips/bypasses one or more layers. The input of the residual block is denoted by
x, and the output is
H(
x), which is the desired underlying mapping. The difference or residual between them is shown in Equation (1):
where
F(
x) is the mapping of the stacked nonlinear layers. The original mapping is rearranged into
H(
x) =
F(
x) +
x. The additional x operates as a residual, thus “residual block”. Therefore, ResNet solves degradation by adding the input of a layer to its output. As a result, ResNet improves the efficiency of deep neural networks with more layers and avoids poor accuracy as the model becomes deeper.
In our model, we build a 32-layer ResNet-based model to extract character features [
54]. The details of the network are illustrated in
Table 3. The convolutional layers in the table are shown in the following format: (kernel size, stride (width) × stride (height), pad (width) × pad (height), channels). The max-pooling layers are shown in the following format: (kernel size, stride (width) × stride (height), pad (width) × pad (height)). The residual blocks in the ResNet model are shown in
Table 3 with a gray background having the following format: [kernel size, channels]. Each convolution layer in the residual blocks has stride 0 and zero padding.
The ResNet model is trained from scratch. The input of this stage is the normalized image. The output is a visual feature map containing each character’s characteristic features in the image, as shown in
Figure 9 and
Figure 10.
3.4. Classification
In this stage, after the features are extracted and the feature map, which contains the qualities and characteristics of a sequence of characters, is produced, a classifier is used to generate characters with the help of predefined classes. Since we are dealing with long sentences that contain a sequence of characters and words, we used Bi-directional Long-Short-Term Memory (BiLSTM) to capture the context information of the sentence.
The BiLSTM model was proposed by Graves and Schmidhuber [
55] and is robust as a classifier and in sequence recognition models in different natural-language processing (NLP) tasks such as speech recognition [
56], natural language understanding [
57], machine translation [
58], and sentiment analysis [
59].
The BiLSTM model consists of two LSTMs to process sequence information in two directions. The first is taking the sequence of inputs in the forward direction (past to future). The other is taking the sequence of inputs in the backward direction (future to past). Therefore, BiLSTM will efficiently extract the full-text context information since it has access to the previous and following context. Thus, we have used BiLSTM in our Arabic handwritten text recognition system.
Moreover, multiple BiLSTMs can be stacked together to have a deep BiLSTM model. The deep BiLSTM model allows a higher level of abstraction in data than a shallow model. Having a deep BiLSTM model improved the performance of the speech recognition system [
60].
After the last BiLSTM layer, each column of contextual features is mapped to an output label. The Connectionist Temporal Classification (CTC) output layer, which was proposed by Graves et al. [
61], is adopted to predict the probability of an output label sequence. The CTC layer has several character outputs and one additional output known as ‘blank’. The additional ‘blank’ output is helpful to avoid making decisions in uncertain zones, that is, in a low context area, instead of being trained to constantly predict a character.
The Arabic language has 28 letters, and each letter has one to four forms. We added the 28 Arabic letters with their different forms for each letter, number, and punctuation mark and ended up with a class size of 135.
The probability given by CTC is defined as for input sequence
Y, where
, and
T is the length of the sequence, the output is the probability of
π, which is defined in Equation (2) as:
where
is the probability of having a character
at time step t [
62]. A sequence-to-sequence mapping function
M is defined on the sequence
π. The mapping function
M maps π onto l, where l is the final prediction output, by removing the repeated characters first, and then removing the blanks. For example,
M maps “--مم-ث--اا--للل--” onto “مثال”, where “–” represents blank. The conditional probability is defined as the total sum of probabilities of all
π that are mapped by
M onto
Y, as shown in Equation (3).
3.5. Language Model
To improve recognition accuracy, LMs are used in many NLP models, such as handwritten text recognition and speech recognition. In our system, we used an n-gram language model, which is a statistical language modeling technique. The n-gram language model is a probabilistic model that predicts the probability of a sequence of words in a text.
The n-gram language models are simple in their structure, easy to calculate the word occurrence probability, and work best with high performance when trained on large amounts of data. In this work, a 3-gram language model was trained on the training corpus of the KHATT and AHTID/MW datasets. KenLM language model toolkit [
63] was used to build the 3-gram language model. The KenLM toolkit is faster and uses less memory than other existing toolkits such as SRI Language Modeling [
64] and IRST Language Modeling [
65], improving system runtime performance.
4. Results and Discussions
4.1. System Settings and Parameters
We used Python 3 and PyTorch tools and libraries to implement our model. The code was implemented using Amazon Web Services. We used Amazon Elastic Compute Cloud with a 16GB NVIDIA V100 GPU.
The network configuration of our model is shown in
Table 3. We used the architecture of the ResNet model to construct 32 trainable layers, which are a combination of convolutional layers with a ReLU (Rectified Linear Unit) activation function, max pooling layers with 2 × 2 filters, and batch normalization layers. It is beneficial to add the batch normalization technique for training our intense neural network. The batch normalization layers have the effect of stabilizing the learning process and accelerating the training process of the neural network. Our system applied a dropout layer after the ResNet model with a dropout ratio of 0.2. The second dropout layer is applied after the BiLSTM layers with a dropout ratio of 0.2.
Dropout, a stochastic regularization technique, is applied in our neural network. The dropout technique helps prevent overfitting and reduce interdependent learning amongst the neurons in neural networks by dropping out units (i.e., neurons) from the neural network during the training process.
The output of the ResNet model, which contains the extracted sequence of visual features from normalized text line images, is fed into the BiLSTM model with 512 hidden units to generate the contextual sequence. Different depths of BiLSTM layers are used to compare the performance of our model when adding more bidirectional LSTM layers. The first experiment was performed using 2 layers of BiLSTM, and the second experiment was performed on 3 layers of BiLSTM. The BiLSTM network is followed by the CTC decoder to translate the contextual feature sequence to the character sequence. The CTC decoder has 135 output units to generate characters and predict words.
Finally, we added a 3-gram language model to our system to improve the recognition accuracy. The KenLM toolkit, a fast and memory-efficient language model, is used to build a 3-gram language model. The language model compares the weights assigned by CTC and LM. The predicted word with the highest weight will be replaced.
For optimization, we adopted the Adadelta optimizer, which is a robust learning rate method that does not require the manual setting of a learning rate. We set the training batch size to 24, and all images were scaled to 1048x64 in both training and testing. The data were split to 80% for training, 10% for validation, and the remaining 10% for testing. The description of parameters used for training is mentioned in
Table 4. The overfitting is evaluated using the EarlyStopping method based on validation loss value.
4.2. Performance Evaluation
The performance of handwriting recognition systems was evaluated in terms of WER and CER. We used these two metrics to assess the performance of our system. The WER and CER are based on the concept of Levenshtein edit distance, which is the minimum number of edit operations required to transform the output text into the ground truth text. The editing operations are substitutions, insertions, and deletions necessary to convert the source string into the reference string. The
WER is calculated as follows:
where
S is the total number of substituted words,
I is the total number of inserted words,
D is the total number of deleted words, and
N is the total number of words in the evaluation set.
The
CER is calculated as follows:
where
S is the total number of substituted characters,
I is the total number of inserted characters,
D is the total number of deleted characters, and
N is the total number of characters in the evaluation set.
4.3. Experimental Results
The last stage of developing a handwriting recognition system is testing the system. This process used scaled images as input to the Arabic handwritten text recognition system. Two different datasets of Arabic handwritten text, that is, the KHATT and AHTID/MW, were used to cover all forms of Arabic text. Therefore, characters and words with different forms and widths were used in our experiments.
The scaled images were passed through the ResNet model, followed by the BiLSTM-CTC layers and the language model post-processing stage.
Table 5 shows the results of our system using the KHATT and AHTID/MW datasets with different BiLSTM layers.
The recognition rates are improved in both datasets when using three BiLSTM layers. The WER is reduced by 4.29% for the KHATT dataset and by 5.37% for the AHTID/MW dataset. Therefore, the proposed model performs better, and the results are improved when using 3 layers of the BiLSTM network.
Additional experiments were performed on the AHTID/MW dataset to test our proposed system performance. As seen below, the best performance is obtained by using 3-BiLSTM layers, which resultis in a 17.42% WER and 6.6% CER.
Figure 11 and
Figure 12 show the relation between the CER and WER with the epoch number, respectively. As shown in
Figure 11 and
Figure 12, the CER and WER decrease as the epoch number increases during the training process until it reaches epoch number 300. The results of our proposed system on the KHATT and AHTID/MW datasets confirm the robustness of our system.
To validate our system’s performance, we compared our results with the most recent works on Arabic handwriting recognition systems.
Table 6 shows the results of recent works obtained from the test set of the KHATT and AHTID/MW datasets. The experimental results showed that our system had an impressive recognition accuracy with a WER of 27.31% and a CER of 13.2% on the test set of the KHATT corpus. ResNet is a resilient CNN architecture designed specifically for extracting information from textual images. The fundamental concept underlying ResNet models is employing residual blocks to enhance the precision of the models. Residual blocks rely on the idea of an “identity shortcut connection” that allows for the skipping or bypassing of one or more levels.
5. Conclusions
We proposed a model for recognizing Arabic handwritten text. The system aims to identify Arabic handwritten text accurately by imitating the human brain to recognize text using machine learning approaches. The ResNet model was used for feature extraction, and BiLSTM-CTC sequence modeling was used for classification. Machine learning techniques are used to overcome traditional methods based on shallow learning and hand-engineered features. Moreover, machine learning approaches help overcome the challenges of recognizing Arabic handwritten text. A 3-gram language model was used in our system using the KenLM toolkit to improve the recognition accuracy of handwritten text.
Our proposed model was evaluated on the KHATT and AHTID/MW datasets. The experimental results showed that our system had an impressive recognition accuracy, with a 27.31% WER and a 13.2% CER for the KHATT dataset and a 17.42% WER and a 6.6% CER for the AHTID/MW dataset.
Although our proposed methodology for Arabic OCR has been successful, there are still certain limitations. The proposed study only employs a pretrained CNN model. Evaluating other transfer learning or transformer models is a future enhancement. Also, in future work, different datasets can be combined to reduce the generalizability problem.