1. Introduction
Error detection may help both natives and non-natives in producing correctly written text. Moreover, error detection has a wide range of applications, such as automatic essay scoring [
1], style and grammar suggestion [
2,
3], and language learning systems [
4]. Additionally, specific types of errors in a text may reflect the writer’s age, proficiency, dialect, and native language. Therefore, the output of error detection models could be used for creating a profile of a certain writer [
5]. Such profiles can be used for author identification, native language identification, or even the level of education.
Using neural network models has shown significant results for different Natural Language Processing (NLP) tasks [
6]. There exist a variety of neural network architectures, ranging from Recurrent Neural Networks (RNNs) [
7] to more advanced Long–Short-Term Memory Networks (LSTMs) [
8] and other varieties, such as bidirectional LSTMs (BiLSTMs) [
9]. RNN and its variations are very appealing for NLP as they are specialized in learning sequential data including text. Hence, they are commonly used for the sequence classification task and have also recently found success at automating grammatical error detection [
1] and correction [
10] through language modeling [
11] and machine translation [
12].
Although such approaches have been used to detect errors in English text, error detection based on neural networks has not yet been explored for Arabic. Essentially, earlier state-of-the-art systems for Arabic depend on employing combinations of rule and statistical-based approaches [
3]. However, taking into account the number and complexity of Arabic grammar rules, there is a need for more advanced approaches that can cope with these challenges.
In this paper, we treat error detection as an independent task and present the first experiments using neural network models for the task of error detection in Arabic text. We mainly investigate and report the results of comparing three neural networks architectures: SimpleRNN, LSTM, and BiLSTM. We also explore the effects of training datasets with different sizes on the overall performance by providing additional training data to each one of the models. The details will be presented for the dataset, which we created and later augmented especially for the task of error detection for Arabic. The types of errors included in the dataset are syntax errors, morphological errors, semantic errors, linguistic errors, stylistic errors, spelling errors, and the use of informal as well as borrowed words. Erroneous words that appeared within some sentences in the corpus are presented in
Table 1, showing a sample for each type of error and the reason it is considered an error.
The rest of the paper is organized as follows: in
Section 2, we give a background and overview of the related work; the details of our corpus are presented in
Section 3; implementation and experiments are described in
Section 4; results from our experimentation are illustrated in
Section 5;
Section 6 provides discussion and analysis of the results is presented;
Section 7 concludes the paper along with limitations and future work.
4. Implementation and Experiments
In this section, we present the details of our implementation and experiments. First, we present the toolkit we used as well as the corpus preparation process. Then, we describe the neural network architectures, which we used for the experiments. Next, we describe which hyperparameters we tuned, and then the measure we used for evaluation—lastly, we present our baseline.
4.1. Toolkit and Corpus Preparation
Keras with a TensorFlow backend is used for all the experiments. Particularly, Keras version 2.2.4 and Tensorflow version 1.12.0 are used. For the use of neural networks in Keras, input sequences need to be equally long for modeling. Therefore, we pad our sentences to a uniform maximum length, which is set to 17, since this is the length of the longest sentence in the corpus. Additionally, label sequences were padded with zeros to the maximum length. The model learns that padding values carry no information.
4.2. Neural Network Architectures
Neural networks have been used for various sequence-labeling NLP tasks, such as Part of Speech (POS) tagging and Named-Entity Recognition (NER). Similarly, we investigated treating error detection for Arabic text as a sequence-labeling task by utilizing token-level labeled Arabic sentences. For the sequence-labeling model, we built a neural network, which takes a sequence of tokens (words) as input, and outputs the probability of each word in the sentence. We explored a few types of neural networks, including a SimpleRNN, LSTM, and BiLSTMs.
We used a similar sequence of layers for every experiment while varying the hyperparameters and RNN types (i.e., SimpleRNN, LSTM, or BiLSTM) for comparison. A SimpleRNN layer is just plain vanilla RNN in Keras where the output is to be fed back to input.
Models are basically composed of an input layer, a single embedding layer, a dropout layer, a SimpleRNN layer (or 1–2 LSTM layers or 1–2 BiLSTM layers) with recurrent unit dropout, and TimeDistributed wrapped Dense output layer with sigmoid activation. The sigmoid function outputs a probability for each class as a value between 0 and 1.
For our task, we trained word embeddings using the Embedding layer in Keras [
34]. The Embedding layer receives the input sequence and encodes it into a sequence of vectors, thereby learning an embedding for all the words in the training dataset. The advantage of this approach in comparison to using pre-trained embeddings is that learned vectors are targeted to the specific corpus as well as the error detection task.
4.3. Hyperparameters
Empirical parameters are used as candidate values for the model. A manual random search method is used to find the best hyperparameter combinations, taking into consideration the small corpus size and potential overfitting. We observed the effects of altering these values over the performance in order to develop the best possible model given the available corpus size. For example, parameters such as embedding size, batch size, and number of units are varied. Moreover, we explored multi-layer variants of each of the neural network models investigated.
Table 4 provides a summary of the used hyperparameters and associating values.
The models are optimized using Nadam, which is an Adam optimizer [
35] variant with Nesterov momentum [
36]. We left the parameters of this optimizer at their default values as recommended by Keras documentation [
37]. Training is stopped when the validation loss stops decreasing over two epochs.
Moreover, we optimized the model by minimizing cross-entropy between the predicted label distributions and the annotated labels. However, our dataset is imbalanced because there are far more correct words than incorrect words. Therefore, we defined and used a weighted binary cross entropy function [
38]. This function allowed trading off between recall and precision by adjusting the weight of positive errors in comparison to negative errors.
4.4. Evaluation
To train a model, data could be split into training and testing sets. However, doing so on small datasets would cause testing scores to vary depending on which data parts were used for training and which were used for testing. Therefore, using this approach on small data in particular would not provide reliable results [
39].
So, to make better use of our corpus, we used 10-fold cross validation and the averages of precision, and the recall for all iterations of a single model were calculated and collected for analysis. Various research papers apply k-fold cross validation when dealing with smaller datasets [
40,
41,
42,
43,
44,
45,
46].
Additionally, we calculated and observed F0.5 because, for the task of error detection, feedback accuracy (i.e., precision) has a higher priority than coverage (i.e., recall). Other evaluation measures used in the literature, such as the M2-scorer, were not directly valid for error detection because they require that a system suggests a correction.
Additional code was written to be able to evaluate precision, recall, and F0.5 by comparing true labels to predicted labels and computing true negatives, true positives, false negatives, and false positives.
4.5. Baseline
For our baseline, we tested the corpus sentences using a commercially available rule-based Arabic grammar checker (Microsoft Word 2007). We compared the errors detected by the grammar checker with the golden labels of the corpus to obtain the results. Consequently, precision for the baseline was 75.6%, while recall was 91.96% and F0.5 was equal to 78.39%.
5. Results
In this section, we present our experimental results using the corpora we created. The first part describes the results received from training and testing the models using 494 sentences. The second part shows the results when using the best performing model configurations to train and test the augmented corpus of 620 sentences. Lastly, the third part reports model performances when tested on 50 sentences from QALB.
5.1. Results of the Models
Table 5 presents the best results from each type of neural network we investigated using our un-augmented corpus and compares them with other works. The BiLSTM model performed the best of all systems that we developed in terms of precision and F
0.5. However, SimpleRNN performed better than LSTM in terms of precision and F
0.5. LSTM outperformed the other models in terms of recall. Unlike LSTM, SimpleRNN is not suitable for processing long sequences, such as text [
39]. Therefore, we expected that LSTM would perform better in our experiments. However, one reason that SimpleRNN actually performed better than LSTM might be because most of the sentences in the training set are short, where the longest was only 17 words long. Additionally, because the task was considered simple (i.e., sequence labeling) and the training set was small, a simpler network was more suitable. LSTM could most probably perform better on more complex tasks and datasets [
39].
Additionally, the fact that LSTM outperformed the others in recall is because LSTM was better than other models with regard to not considering correct words as mistakes. So, it was better at recognizing that correct words were in fact correct.
LSTM, BiLSTM, and SimpleRNN all outperformed the baseline in precision and F0.5. However, the baseline had higher recall than SimpleRNN and BiLSTM.
Table 6 shows the hyperparameter settings for each of the models in
Table 5. We noticed that BiLSTM only needed one layer of size 25, the smallest sized model of the three, to perform the best. Implying that BiLSTM was more powerful in being able to select the most important structure in the input data to model while being small in size.
5.2. Increasing Corpus Size
Since we were dealing with quite a small training corpus, it seemed that the size of the corpus would be a limiting factor. Therefore, we decided to manually augment our corpus (explained in
Section 3.2) and use that to see the effects of increasing data. For training, we used the best configuration of each model (as in
Table 6) in order to see the impact on performance when more data are used.
Table 7 presents the results of this experiment. In comparison to
Table 5, all three models actually performed significantly better when trained on more data, even though the augmented corpus was only 25.5% larger than the original. Additionally, with more data, LSTM outperformed SimpleRNN over all measures. This indicates that LSTM would handle larger data better. While BiLSTM achieved better recall when trained on the augmented corpus, it had the least recall in comparison to the other two, meaning it had given more false negatives. Moreover, the three neural networks outperformed the baseline in recall and F
0.5.
5.3. Testing on QALB
In order to further validate our model, we used a portion of the Qatar Arabic Language Bank (QALB) corpus [
51] for testing. We did not initially use it to train our models mainly because the type of annotation in QALB was targeted at error correction, which was not suitable for our task (error detection). Additionally, we wanted to support efforts produced by language experts, add to the existing number of Arabic corpora, and investigate sequence labeling for detecting errors in Arabic on a smaller scale and simpler model.
QALB is, to the best of our knowledge, the only corpus for Arabic grammar checking. We took 50 MSA sentences from testing data of QALB 2014, and only used the first 10–17 words from each sentence to mimic the lengths of sentences in the training corpus and to speed up the annotation process. This resulted in having 516 words and 781 tokens for the QALB test corpus. Punctuation marks were eliminated, and punctuation edits were not taken into consideration. Furthermore, labeling was manually done, similar to our training corpus. In order to label this corpus, we depended on the original annotation in QALB to lead the labeling process. For example, an edited word in the QALB sentence was labeled as “i” in the token-level labeled test corpus.
We ran QALB using the best configuration of each model (as in
Table 6) in order to predict the labels of the words in our QALB test set.
Table 8 shows the results of this experiment. While LSTM had the lowest performance in terms of precision and F
0.5 on our training set, it performed the best when tested on QALB. On the contrary, BiLSTM had the lowest performance on QALB over all measures. This might be a sign that BiLSTM was lacking regularization or further tuning.
6. Discussion and Analysis
For error detection in Arabic, Tomeh et al. [
26] approached the task as a sequence labeling problem by training an SVM classifier using morphological and lexical features, such as lemmas and POS tags. The model labels the incorrect words in the text with a tag before passing them on for correction. Recall of the error detection model was 80%, whereas precision was 34.5%. Both scores are considered low in comparison to the results of the neural network models in
Table 5.
On the other hand, the first notable application of neural sequence labeling to the task of error detection was done by Rei and Yannakoudakis [
1]. They performed neural sequence labeling on English corpora and reported the results using precision, recall, and F
0.5. They achieved the best result on BiLSTM with F
0.5 = 41.1%, which was improved to F
0.5 = 41.88% when integrating the character-level component in [
24]. Similarly, Yannakoudakis et al. [
47] treated error finding as a neural sequence labeling task with the character-level component, while using a small set of features that were easily computed and needed no linguistic input. Their detection model achieved an F
0.5 score of 51.14%. On the other hand, Liu et al. [
48] utilized Conditional Random Field for sequence labeling to automate Chinese grammar error detection without the use of any external features. Yuan et al. [
49], however, did incorporate some external features to their sequence labeling model for English language error detection and achieved 82.52% in F
0.5. Additionally, Bell et al. [
50] used BiLSTM sequence labeling for detecting errors but the model used a forward Language Model to predict the following token and a backward Language Model to predict the preceding token achieving F
0.5 = 63.57%.
All of these works utilized multiple large corpora and more complex models for training and testing. However, we believe that this is the first application of neural sequence labeling on Arabic text. Considering the small size of training corpus, we had expected that precision would be easier to achieve than recall. On the contrary, the results revealed a noticeable trend throughout all experiments: recall was mostly easier to achieve than precision. This is most probably due to the shortness of the sentences in the training set.
In addition, we found that smaller batches performed better than larger batches on our small dataset, which agrees with the work in [
52]. Additionally, configurations using smaller batches were able to generalize better on new data than larger batches. This was apparent in our results for SimpleRNN and LSTM.
Table 9 presents four examples and their labels in five cases: gold standard and the baseline commercial Arabic grammar checker, as well as the testing results of SimpleRNN, LSTM, and BiLSTM models. Additionally, the specific type of error for each example is shown, as was mentioned in [
32].
Overall, when observing the predictions on test sets, BiLSTM had the best performance; it was better at learning sentence structure in general since LSTMs are capable of handling long-term dependencies by applying a linear update to the internal cell representation, unlike SimpleRNN. More importantly, it is because BiLSTMs are able to incorporate context on both sides of every token.
While SimpleRNN and LSTM also performed well, they were able to correctly learn and identify that correct words were correct. This is due to the fact that there were far more correct words in the training corpus than incorrect words. This might be an indication that a more balanced corpus would have enabled models to learn both correct and incorrect words equally. For the same reason, BiLSTM predicted most words as correct as can be seen in the first two examples in
Table 9. However, the difference here is that BiLSTMs learn the sentences forward as well as backwards, which enabled the model to learn more correct word usage. At the same time, it was more accurate than SimpleRNN and LSTM when predicting that incorrect words were incorrect, as can be seen in the third example in
Table 9. Although the examples in the table show similar results in the baseline and BiLSTM, the commercial checker had actually missed many erroneous words considering them as correct as shown in the fourth example. This indicates that a rule-based checker may not be sufficient enough in detecting particular mistakes. This is because it is difficult to conduct fully exhaustive rules, especially when writing mistakes in a language are not constant.
7. Conclusions, Limitations and Future Work
In this paper we focused on the detection of errors in MSA text using neural sequence labeling. We presented a background and related work on Arabic language and approaches to Automated Grammatical Error Detection and Correction for Arabic and other languages. We also described the corpora we created for this task and presented our experimental setting. Finally, we showed our results and discussed them.
One of the main limitations of this project was the lack of data, which in turn called for using cross-validation. Although we were satisfied with the results, models were simplified in order to be able to accommodate the small dataset. However, we believe that larger data and larger networks would be able to generalize more when presented with new data. Additionally, our corpora sentences were short (maximum 17 words); however, we believe that the model would have benefited from longer sentences in order to learn from longer sequences of words.
Furthermore, the fact that neural networks are stochastic required the presentation of fixed seeds into random number generators. It has been shown that results may change when different seed numbers are used. Therefore, every set of hyperparameter configurations needs to be evaluated a number of times for every seed number, such that results are averaged and compared. A number of 5 up to 30 seeds could be used randomly. This process would enable drawing conclusions and making fair comparisons between different models [
52]. In our case, we fixed a single seed value for all experiments (to 42), because, for the lack of resources, we could not run every configuration multiple times for different seeds. Nevertheless, this does not hinder the fact that our experiments are reproducible.
The results shown in this paper can be extended and improved in a number of ways. We suggest using a larger training set with longer sentences, conducting more experimentation with more hyperparameters, using deeper networks with larger embedding size, and running configurations multiple times using a variety of seed values and reporting the averaged results. Additionally, evaluating the performance of the different proposed neural network architectures per error type could be useful and may facilitate a vehicle for comparing systems. Moreover, incorporating the POS tags in the neural network could further improve the error detection.
As far as we know, we are presenting the first results of applying neural sequence labeling on Arabic text. Additionally, these are the first results of using the A7′ta corpus [
33] in an experiment. We believe that the configurations we reported would show good performance on any other comparable Arabic corpus. The code for the experiments is available at
https://gist.github.com/iwan-rg/4e7f522a53e664607c2a3e664f4c076a.