Revealing the Next Word and Character in Arabic: An Effective Blend of Long Short-Term Memory Networks and ARABERT

Al-Anzi, Fawaz S.; Shalini, S. T. Bibin

doi:10.3390/app142210498

Open AccessArticle

Revealing the Next Word and Character in Arabic: An Effective Blend of Long Short-Term Memory Networks and ARABERT

by

Fawaz S. Al-Anzi

^*

and

S. T. Bibin Shalini

Department of Computer Engineering, University Kuwait, Kuwait City P.O. Box 5969, Kuwait

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(22), 10498; https://doi.org/10.3390/app142210498

Submission received: 7 October 2024 / Revised: 5 November 2024 / Accepted: 6 November 2024 / Published: 14 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

Arabic raw audio datasets were initially gathered to produce a corresponding signal spectrum, which was further used to extract the Mel-Frequency Cepstral Coefficients (MFCCs). The pronunciation dictionary, language model, and acoustic model were further derived from the MFCCs’ features. These output data were processed into Baidu’s Deep Speech model (ASR system) to attain the text corpus. Baidu’s Deep Speech model was implemented to precisely identify the global optimal value rapidly while preserving a low word and character discrepancy rate by attaining an excellent performance in isolated and end-to-end speech recognition. The desired outcome in this work is to forecast the next word and character in a sequential and systematic order that applies under natural language processing (NLP). This work combines the trained Arabic language model ARABERT with the potential of Long Short-Term Memory (LSTM) networks to predict the next word and character in an Arabic text. We used the pre-trained ARABERT embedding to improve the model’s capacity and, to capture semantic relationships within the language, we educated LSTM + CNN and Markov models on Arabic text data to assess the efficacy of this model. Python libraries such as TensorFlow, Pickle, Keras, and NumPy were used to effectively design our development model. We extensively assessed the model’s performance using new Arabic text, focusing on evaluation metrics like accuracy, word error rate, character error rate, BLEU score, and perplexity. The results show how well the combined LSTM + ARABERT and Markov models have outperformed the baseline models in envisaging the next word or character in the Arabic text. The accuracy rates of 64.9% for LSTM, 74.6% for ARABERT + LSTM, and 78% for Markov chain models were achieved in predicting the next word, and the accuracy rates of 72% for LSTM, 72.22% for LSTM + CNN, and 73% for ARABERET + LSTM models were achieved for the next-character prediction. This work unveils a novelty in Arabic natural language processing tasks, estimating a potential future expansion in deriving a precise next-word and next-character forecasting, which can be an efficient utility for text generation and machine translation applications.

Keywords:

Baidu’s deep speech; LSTM; CNN; ARABERT; Markov chain; sequence generation; vocab size

1. Introduction

Speech is a powerful tool that shapes our world and serves as a conduit for conveying our ideas and thoughts through vocal sounds and word formation and expression through various rhythmic flows. Speech processing is the study of speech signals and methods used to convert, manipulate, recognize, and process the signals into understandable formats such as speech recognition, translation, emotion identification, etc. An automatic speech recognition (ASR) model recognizes or transcribes the word segments into a sequence of voice or acoustic data [1]. Despite rapid technological advancements, voice recognition remains a challenging process with numerous criteria, making the ideal speech-to-text conversion a distant goal [2]. In many commercial applications, the traditional human–machine interface has been replaced by modern ASR systems. ASR systems can translate spoken words into text by interpreting them using various technological advancements and linguistics. Arabic is pronounced differently depending on the phonemes and diacritical marks across the globe. Speaking patterns can also be significantly impacted by dialectal differences. With over 380 million native speakers, Arabic is one of the most widely spoken languages worldwide. There are three primary varieties of Arabic, making it a very diverse language. The written and formally spoken languages are based on classical Arabic, the language of the Quran. Derived from classical Arabic, Modern Standard Arabic (MSA) finds use in government, education, and the media. Conversely, the numerous dialects of Arabic spoken in various Arabic-speaking countries are referred to as colloquial Arabic. These dialects can differ greatly, which frequently makes it difficult for speakers of various dialects to communicate with one another. Colloquial Arabic is the language of daily living and social contact, even if MSA is the official language in many Arab nations.

The foundation of comprehending and producing natural language is the capacity to anticipate the word or character that will appear in a series. It is a basic task in natural language processing (NLP), with applications that power virtual assistants and chatbots to transform text generation and machine translation. Accurately predicting the next word or character is essential for many applications. LSTM networks are well known for their capacity to represent long-term dependencies in sequences; they are perfect for simulating language’s sequential structure. However, adding domain-specific knowledge can improve their performance further, and this is where ARABERT becomes useful. With extensive pre-training on an extensive Arabic text corpus, ARABERT embodies a deep semantic comprehension of the language. LSTM networks with ARABERT embeddings can achieve higher prediction accuracy by combining them. Using deep learning techniques such as RNNs, transformer networks, and language models, the next-word prediction helps forecast the most likely word in a sentence. As natural language processing (NLP) and deep learning evolve, the progression toward next-word prediction plays a critical role in creating intelligent systems that can comprehend and produce natural language. The rich morphology, intricate script, and variety of dialects of Arabic are the main causes of the gap in the state of the art of next-word and -character prediction for Arabic text. A pre-trained BERT model for the Arabic language aims to achieve state-of-the-art performance on most Arabic natural language processing (NLP) [3]. Using Natural Language Generation (NLG) and Natural Language Understanding (NLU) architectures, paper [4] suggests an extractive Arabic text summarizer that preserves the elements of Arabic manuscripts by analyzing and extracting key phrases. The Rouge metric and human assessment are used to compare the effectiveness of the suggested solution to provide the most effective method for summarizing Arabic texts. Moreover, in robotics, speech recognition can be rapid and intuitive while processing the human voice, and it can have the ability to respond dynamically to modulated interactions. Generally, unraveling the complexities of words occurring in various contexts and integrating contextual data in the Arabic language has been a challenge where more sophisticated models exist in enhancing the user experience in personalized recommendation systems, question-answering, content creation, and NLP applications.

2. Literature Review

In document [5], the objective of automatic speech recognition is to give computers the best possible ability to recognize and understand human speech. Speech recognition models can be implemented using a variety of methods. Neural networks with deep learning are used to support one of the newest methods for speech recognition. Arabic receives the least attention in terms of speech recognition technology, despite being the language that is spoken the most. Paper [6] provides an overview of the research on Arabic voice recognition. It also provides some insight into the resources and toolkits available for the development of Arabic voice recognition systems. Many products that effectively use automatic speech recognition to facilitate human–machine communication have been developed. Applications for speech recognition perform worse when there is minimal background noise or reverberation [7]. Both text and audio transcriptions are used for the entire process of training automatic speech recognition (ASR) neural network systems. This work assesses the representation quality in a variety of classification tasks by comparing phonemes and graphemes in addition to various articulatory properties. This article shows how consistently different features are represented across deep neural network systems by analyzing three datasets in two languages: Arabic and English [8].

Researchers sought to develop highly effective recognizers for two drastically dissimilar languages, like English and Mandarin. In this paper, researchers investigated several network topologies and discovered some useful methods, like batch normalization and sort grad to improve numerical optimization and look ahead convolution for unidirectional models. This research was made possible by a highly effective training method [9]. The ability of the Deep Speech network to identify distinct Bengali speech samples is examined in their literature. This network models internal phoneme representation using recurrent LSTM layers as its foundation. Convolutional layers are added at the bottom, eliminating the need to make any assumptions regarding internal phoneme alignment. A connectionist temporal classification (CTC) damage task was used to train the model, and a beam search decoder was used to generate the transcript. The developed method yielded a lower word error rate and a lower character error rate on the Bengali real number speech dataset [10]. An innovative end-to-end speech acknowledgment technique is provided by reference [11], which makes use of a hybrid CTC–attention archetype inside the multitask learning context to increase convergence and resilience and lessen alignment problems. Its superiority over the CTC and attention-based encoder–decoder baselines is demonstrated in a trial using the WSJ and CHiME-4 tasks, yielding 5.4–14.6% qualified enhancements in character error rate (CER). For speech-to-text, work [12] uses a shared task on SwissText/KONVENS. End-to-end training of a neural network is based on Mozilla Deep Speech. Postprocessing, data augmentation, and transfer learning from standard German and English were applied. The system produces a WER of 58.9%.

One of natural language processing’s (NLP) most crucial tasks is next-word prediction, which is also called language modeling. A recurrent neural network (RNN) model is being developed with TensorFlow to predict the top 10 words from a 40-letter text provided by a client. The objective is to accurately and quickly predict ten or more words [13]. Study [14] offers a novel approach that can be used for a variety of NLG tasks to predict the next word in a Hindi sentence using Long Short-Term Memory (LSTM) and Bi-LSTM deep learning techniques. The approach achieves 59.46% and 81.07% accuracy, respectively. Study [15] explores the use of sub-word units (character and factored morphological decomposition) in neural language modeling. Using Czech as a case study, it was found that character-based embedding significantly improved model performance, even with unstable training. Reducing the output look-up table size also improved the model’s performance. In research study [16] for smartphone users, a hybrid language model called RNN-LM is suggested to enhance Japanese text prediction. To predict subsequent words, this model combines an n-gram model with a recurrent neural network (RNN). Long Short-Term Memory networks (LSTMs) are used to connect the input, output, and hidden layers of the model, which works best in a client–server architecture. The model is 10% less confusing than traditional models and is compact. It is now part of IME, Flick, which, in a Japanese text input experiment, beats Mozc by 16% in time and 34% in keystrokes.

In article [17], many language tools are available in Ukrainian, and next-word prediction is essential for users. Unlike T9, LSTM and Markov chains are selected for this task because of their sequential nature and capacity to produce multiple words in conjunction with their hybrid model. The goal of work [18], on the Long Short-Term Memory (LSTM) network model for instantaneous messaging, is to predict the word or words that will come after a collection of modern terms. Research study [19] uses a transcript of the Assamese language based on the International Phonetic Association (IPA) chart; the model achieves an accuracy of 72.10% for a phonetic transcript of the Assamese language and 88.20% for Assamese text, especially during phonetic typing. Next-word prediction is crucial for typing subsystems and digital document writing. Researchers are exploring ways to improve this capability using natural language processing (NLP) practices like word embedding and successive contextual modeling. This paper compares embedding methods for Bengali next-word prediction, comparing word2vec skip-gram, word2vec CBOW, fastText skip-gram, and fastText CBOW. The results provide insights into contextual and sequential information gathering, enabling the implementation of a context-based Bengali next-word prediction system. Next-word prediction, also known as language modeling, is a machine learning application that uses Long Short-Term Memory (LSTM) and other models to predict the next word. LSTMs have a built-in mechanism to prevent overfitting and use activation functions like ReLU and softmax to introduce non-linearity. Combining techniques like pre-training, advanced architectures, and large datasets can improve performance. The paper discusses different approaches to achieve the best results.

In reference [20], a large-scale language model (LM) study on 50 different languages is presented, with an emphasis on sub-word-level data. In order to incorporate sub-word-level information into semantic word vectors and train neural language models, the researchers offer a novel approach. The study offers new benchmarks for LM and demonstrates significant gains in perplexity across all 50 languages, particularly morphologically rich ones. The datasets and code are openly accessible.

This paper discusses recurrent neural networks, introducing LSTM as an effective model and recommending a language model based on LSTM for character prediction. Comparing LSTM with standard RNN models reveals significant potential in character prediction. Paper [21] discusses the development of recurrent neural networks (RNNs) and introduces LSTM, a more effective model. It recommends a language model based on RNNs for character prediction. Paper [22] compares LSTM and standard RNN models, highlighting their potential in character prediction. The program runs on TensorFlow, demonstrating RNNs’ sequential processing ability. The goal of the project is to use the n-gram language model, a statistical machine learning technique, to create a Ge’ez word prediction model. The model predicts likely stem/root words and morphological features using affixes and morphological datasets. The keystroke savings are used as a metric to assess the model. Three experiments were carried out as part of the study, using smaller and larger datasets in experiments 1, 2, and 3. Experiment 2, which demonstrated 35.7% keystroke reserves for a crossbreed of n-gram categorizations with a back-off flattening model, produced the best results. Future research and other natural language processing (NLP) applications, like texting on a mobile device and helping individuals with disabilities, can benefit from this technology.

Work [23] suggests that natural language processing (NLP) applications, like helping individuals with disabilities and texting on a mobile device, can be supported by this technology and used for future research. In the field of language modeling, this study investigates the use of deep learning methods, particularly temporal convolutional networks and recurrent neural networks, to predict the next word. The researchers obtained an accuracy of 71.51% for the RNN model and 65.20% for the TCN model using three databases: Coursera Swiftkey, the book Writings of Friedrich Nietzsche: Volume 1 by Friedrich Nietzsche, and the News category from the Brown corpus in the NLTK library. The findings imply that in language modeling, TCN architecture and RNN architecture compete. They advise looking into more recent deep learning mobile platforms. The study does not examine NWP for Arabic datasets due to resource constraints. The study evaluates current autoregressive language models (LMs) for their accuracy in human incremental processing and reading time prediction. They propose Cloze Distillation, a method for distilling linguistic information into pre-trained LMs, which improves analysis period forecasting and generalizes to finite cloze data [24].

The purpose of paper [25] is to examine the utility of deep learning models for Urdu’s next-word calculation and proposition model. It suggests two word prediction models: BERT, which is specially made for natural language modeling, and LSTM, which is used for neural language modeling. The study trained these prototypes on a corpus of 1.1 million sentences written in Urdu. The BERT model performed better than the other two pre-trained models, whereas the LSTM model had an accuracy of 52.4%. Additionally, the study demonstrates how changing the input context’s window size can impact performance. With the usage of next-word forecast technology, typing can be made easier by providing the next word. However, the absence of n-grams and a Kurdish text corpus presents difficulties for the language. Paper [26] presents a novel investigative work on next-word guessing for Kurdish Sorani and Kurmanji, with an accuracy rate of 96.3% and a reduction in typing time using an n-gram model. The aim of research study [27] is to create a new word predictor for Tigrigna text entry that will cut down on keystrokes by 55.06%. A probabilistic language model that combines unigram, bigram, trigram, and quadgram predictors is used by the predictor model. This helps people with limited vocabulary avoid spelling errors and saves over 50% of the time spent writing a document in Tigrigna, according to the study. Still, there are difficulties in Tigrigna word prediction, like figuring out affixes that might be either words or letters. The study intends to support applications such as texting on a mobile phone or PDA, handwriting recognition, and helping individuals with disabilities.

Deep learning methods are presented in work [28]. Long Short-Term Memory and Bidirectional Long Short-Term Memory are the core neural network architectures used in this work. The model developed in this work had the highest accuracy among neural network-based methods and outperformed previous approaches using the IITB English–Hindi parallel corpus. Next-word prediction, compared with other algorithms, is mentioned in [29]. The paper aims to achieve state-of-the-art performance on most Arabic natural language processing (NLP) tasks by presenting a pre-trained BERT model for the Arabic language. The model achieved state-of-the-art results when compared to other approaches and Google’s multilingual BERT. To promote Arabic natural language processing research and applications, the pre-trained ARABERT models are made publicly available [3]. As discussed in article [30], language modeling, another name for next-word prediction, is a branch of machine learning-based natural language processing. The researchers utilized Google Colab to code, a dataset of 180 Indonesian destinations, tools such as TensorFlow, Keras, NumPy, and Matplotlib, and a Long Short-Term Memory (LSTM) model with 200 epochs for training. With a 75% accuracy rate, the model produced 8 ms/step. A Bidirectional Long Short-Term–Gated Recurrent Unit (BLST-GRU) network model for Amharic word prediction is presented in paper [31]. Comparing the model against state-of-the-art models like LSTM, GRU, and BLSTM, it demonstrated promising results and obtained 78.6% accuracy on 63,300 Amharic phrases. Arabic voice recognition has been shown to be successful with Mozilla’s Deep Voice framework, which is based on Baidu’s Deep Speech. The system converts spoken language into written text using recurrent neural networks (RNNs). A sizable dataset of speech and alphabet text is used to train the RNNs. For feature extraction and preprocessing, the system makes use of spectrograms, PyCharm, and a virtual environment. High-quality Arabic voice recognition with lower loss, word error, and character error rates is possible with the deep voice technique. Future advancements in Arabic voice recognition may result from this study [32].

Research Gap

Based on our survey, we have observed limited research on next-word recommendations in Arabic linguistics. However, publications in other languages, such as Urdu, English, Ukrainian, Assamese, Hindi, Indonesian, Amharic, etc., exist in the repository. As we found that the Arabic language was not explored extensively, and natural language processing for Arabic is required for modernization, we have decided to proceed with the detailed study of the prediction of the next word and characters in Arabic. Arabic next-word prediction still faces challenges because of the limited availability of large-scale annotated corpora, handling dialectical variations, tackling domain-specific challenges, and developing task-specific works. Arabic next-word prediction is a useful tool to assist language learning and natural processing tasks. BERT and ARABERT are the pre-trained models used extensively for English and Arabic corpora, respectively.

3. Deep Learning Models and Methods

A raw Arabic dataset is available to extract the MEL frequency coefficients. The audio is represented on a hearing perception scale like that of humans by the computed MEL spectrogram. Deep Speech’s neural network architecture, which comprises several layers of recurrent neural networks (RNNs), is its fundamental element, trained from a sizable text and voice dataset. It aims to understand the complex relationships and patterns that exist between spoken sounds and textual representations. Arabic textual transcripts and Arabic audio datasets were collected from 10 men and women for testing and training our model. Deep Speech offers superior Arabic speech recognition quality with less loss, word error rate, and character error rate, and, hence, Baidu’s Deep Speech framework is used to convert the audio data into text [32]. Additionally, we have designed deep learning models such as LSTM, CNNs, Markov models, and ARABERT for the next-word and -character prediction from the Arabic text corpus obtained from the audio dataset. The Arabic Bidirectional Encoder Representations from Transformers (ARABERT) preprocessing method is preferred for comparison over the Natural Language Toolkit (NLTK) method. The capability of a language model to envisage the subsequent word in a sequence can be gauged using perplexity, and its perplexity value is scaled. Perplexity flips the average of the negative log probabilities for each word in a sentence. A low perplexity score indicates the prediction is accurate. This is a useful metric for comparing various language models and assessing how well they predict text. It is particularly helpful for tasks where precisely predicting the next word is essential, such as text generation and machine translation.

3.1. Dataset Overview

This paper’s statistics castoff entails the text data extracted from 4071 Arabic audio clips. The audio data collected are related to sports, technology, education, health, economy, security, and justice. Each category of collected data consists of different levels of data. Some categories contain only two levels of data; others contain three levels of data. In the sports category, the dataset consists of sports in general and data related to boxing, a specific sport. In the technology category, the data are subdivided into mobile phone technology and the data from technology based on electronic devices. The education category includes data about things such as education in general, literacy levels, public education, higher education, compulsory education, kindergarten, university and colleges, intermediate education, primary education, and research. The health category entails data from health insurance, disease, and nutritional disease. The economy category contains data about economic growth, industrialized countries, Islamic economics, and the economy of the Gulf Cooperation Council. The security and justice groups contain data regarding drugs, prisons, execution, embezzlement, law firms, Homeland Security courts, and crimes. In the continuous speech corpus, we have 2.63 h of audio data (10 male and 10 female speakers [2]). In our work, we dealt with the following types of dialects and vernaculars in the Arabic language:

-: A (Arabic—Arab nations’ official language)
-: EGY (Egyptian Arabic)
-: GLF (Gulf Arabic or Khaleeji—spoken in the Eastern Arabian region)
-: LAV and LF (Levantine Arabic—spoken in Syria, Lebanon, Jordan, etc.)
-: MSA (Modern Standard Arabic)
-: NOR (North African Arabic)
-: SA (Saudi Arabic)

The average sampling rate of the audio is 16,000 Hertz, and the average encoding of the WAV files is 2 bytes per sample. We have considered 1321 training audio files with an average text length of 93.0; the number of spontaneous speech files is 733, and the number of read speech files is 588.

3.2. Model Architecture of Next-Word Prediction

3.2.1. Methodology

The first process involves gathering raw Arabic audio data from several Middle Eastern nations with a range of Arabic accents. MEL frequency coefficients were extracted from the audio dataset which are further broken down into pronunciation dictionary, linguistic, and acoustic data models. This classified information is fed into Baidu’s Deep Speech framework to convert it into text after an n-gram is created for language models. Baidu’s deep voice text corpus is used and provided as input to the next word and next-character prediction framework’s architecture. To anticipate the next word and character from the Arabic corpus, models like LSTM, CNN, and Markov chain models are developed, and ARABERT preprocessing methods are used to train the Arabic text corpus.

The raw Arabic corpus obtained from Baidu’s Deep Speech framework consists of sentences with commas, full stops, quotation marks, URL, HTML, and other punctuation. The extraneous symbols, punctuation, and spaces were removed during the preprocessing stage. The filtered dataset was cleansed and tokenized to produce a collection of list files containing the number of characters and vocabulary size. These tokenized list files were grouped into a sequence of data files, and the input files were regrouped with a sequence length of 100 characters in sequential order. The sequential output data files were collaborated with the Arabic alphabet and fed into deep learning models.

3.2.2. Architecture

The Block diagram shown in Figure 1 consists of Baidu’s Deep Speech architecture [32], which converts raw Arabic audio signals into text data.

The text output is applied in the text prediction structure shown in Figure 2. The model architecture is the design flow of next-word prediction.

The prediction model consists of three different layers, such as the embedding layer, the recurrent layer, and the dense layer. The key element of the next-word prediction model is an embedding layer that turns words into a numerical representation that includes syntactic and semantic information. The recurrent layer processes the data from the embedding layer and converts it into sequential data. The output data from the recurrent layer is processed into a dense layer to find the probability of each word in the vocabulary. The foundation of language understanding in next-word prediction is formed by these layers. The relationship between the words is maintained, which makes it possible for the archetype to envisage the most probable next word. To handle the vanishing gradient problem, capture word dependencies, and process sequential data, we use Long Short-Term Memory (LSTM).

3.3. Long Short-Term Memory

The memory of the recurrent neural network was extended and formed LSTM. LSTM consists of two different states, such as the cell state and the hidden state, as shown in Figure 3. The length of both states is given as n. The cell state is the same as that of the computer memory. Long-term memory is saved in the cell state. Reading, writing, and erasing information is very much possible in this state. Three main gates were used for reading, writing, and erasing information, and the equations for the state are shown. For reading, the information input gate is suitable. For writing, the information output gate is normally used, and for deleting, the information forgot gate will be used. Three gates are used with a sigmoid function.

G_{t} = s i g m o i d (W_{f} [H_{t - 1}, X_{t}] + b_{f})

(1)

P_{t} = s i g m o i d (W_{i} [H_{t - 1}, X_{t}] + b_{i})

(2)

S_{t}^{'} = t a n h (W_{c} [H_{t - 1}, X_{t}] + b_{c})

(3)

S_{t} = F_{t} * s_{t} - 1 + S^{'} * I_{t}

(4)

O_{t} = s i g m o i d (W_{0} [H_{t - 1}, X_{t}] + b_{0})

(5)

H_{t} = O_{t} * t a n h (S_{t})

(6)

G_{t}

is the forgot gate, which selects the facts that need to be deleted from the cell.

P_{t}

, as input, determines the new data to be stored in the cell state.

O_{t}

is the output gate that chooses the data from the cell state from the output as a hidden state.

H_{t}

is the hidden state. Equation (1) is employed to remove extraneous data from the cell state.

3.4. Markov Model

Markov models are statistical techniques that forecast the position of an element in the sequence by analyzing the past based on early occurrences. Markov models predict the next word by assuming that the likelihood of the word relies only on the preceding few words and not on the complete historical text. The next state is governed by only the current state. This can be modeled using Markov chains, and it is an ideal method for predicting a sequence of events. A Markov chain is a mathematical model of a series of trials in which the possibility of an individual event depends on the state obtained by the prior event. This model consists of different states, transitions, and transition matrix. For word prediction in a Markov chain, every word in the passage is a state.

Transition probabilities show the chance of changing from one word to another. The frequency with which one word follows the other in the provided text is based on calculating the probabilities. The number of times each word comes after another is counted, and the total number of occurrences of the word before it is used to normalize the counts to determine the transition probabilities. The Markov chain chooses the word with the highest transition probability from the current state to predict the word that will come after a given word. The following steps are used in finding the next word using a Markov model. Preprocessing the data includes removing the punctuation, adding to the dictionary, and listing with probability. The next step is training the model, normalizing the distributions, getting the predicted word based on the model, and calculating the testing and training accuracy.

P (X_{n} = i_{n}| X_{n - 1} = i_{n - 1}) = P (X_{n} = i_{n} | X_{0} = i_{0}, X_{1} = i_{1}, . ., X_{n - 1} = i_{n - 1})

(7)

P(…) shows the likelihood.

The random variable at time step n is denoted by X_n.

i_n depicts the random variable’s precise value or state at time step n.

| denotes probability with conditions.

X(n − 1) = i_(n − 1): since this is the conditioning event, the probability is computed concerning the prior state, which was i_(n − 1).

The Markov property is expressed by this equation, which says that the probability of the current state (X_(n − 1), and not of the previous states, determines the probability of the next state (X_n).

Features of the Markov Model

Arabic text is generated statistically using Markov chain models, and, to determine the likelihood of word sequences, they examine a sizable text collection. Markov technology generates logical sentences by predicting the most likely next word based on the preceding words. The accuracy and fluency of the model can be increased by using sophisticated methods like smoothing and higher-order n-grams. Long-range dependencies are difficult for Markov chain models to capture.

3.5. Data Preprocessing

ARABERT is a transformer-based pre-trained Arabic language model that can be used in different NLP models. The ARABERT preprocessor is a stacked bidirectional transformer encoder. ARABERT first segments the words into stems, prefixes, and suffixes, and it also performs tokenization, where the text is broken into words, and normalization, where it normalizes the diacritics, and it converts the text into a consistent format, removing punctuation and stop words and performing text cleaning, padding, and truncation.

3.5.1. ARABERT-BERT-Base-arabertv02

A pre-trained language model created especially for Arabic is called BERT-base-arabertv02. It was founded based on robust BERT architecture, which makes use of the transformer-based embedding method to comprehend the meaning of words in a phrase. Self-attention, positional encoding, and a multi-layer encoder are important components of the embedding method. The transformer encoder processes input sequences using a feed-forward neural network, positional encoding, multi-head attention, and a self-attention mechanism. The comprehension of sentence-level links is enriched using non-linear transformations and pre-training exercises like Next Sentence Prediction (NSP) and Masked Language Modeling (MLM). This method makes it possible to capture intricate syntactic and semantic links between words in a phrase, producing representations that are more accurate and pertinent to the context.

There are other methods that can also be used, for example, TF-IDF. Although mostly a text-based method, it may be modified for Arabic by employing suitable tokenization and preprocessing strategies to manage morphological variances and diacritical marks. To capture the syntactic and semantic links between Arabic words, word2vec and GloVe are word embedding approaches that can be taught on sizable Arabic text corpora. FastText is another method; because it can handle sub-word information, which is essential for languages with complex morphology, this method works especially well for Arabic [33]. The sample encoder is input after ARABERT preprocessing. The sample ARABERT preprocessing output is shown below.

3.5.2. NLTK

The text analysis methods include sentiment analysis, text classification, and topic modeling, which offers text processing capabilities like tokenization, stemming, lemmatization, and part-of-speech tagging. Furthermore, the NLTK provides a set of corpora and lexical resources for a variety of NLP applications, and language modeling capabilities, such as n-gram models and language modeling.

4. Training and Optimization

The Arabic text corpus is used to train the model, and each input sequence is matched with the target word and the model parameters, such as loss and accuracy, are optimized. Categorical cross-entropy is used to minimize the appropriate loss function. An Adam optimizer is used to carry out the optimization process and compared against other optimizers, such as stochastic gradient descent and NADAM. The trained model is cast-off to envisage the following word to produce the probability distribution over different vocabularies. The word with the highest likelihood is selected for the word prediction.

4.1. Model Architecture of Next-Character Prediction

Next-character prediction is a task of natural language processing. Based on the previous character, the next characters should be predicted as given in Figure 4. The fundamental building block of next-character prediction consists of the generation of text with a vocabulary size and the calculation of the total characters present in the raw text. Language model generation includes pattern generation followed by model generation. The input parameters for the model generation contain the embedding dimension, number of vocabulary words, and input length. The model processes the input sequence using techniques such as LSTM and CNN + LSTM networks, and the ARABERT transformer model is used as a preprocessing technique along with the LSTM model. The next character predicted by the model is based on the learned patterns in the input sequences.

We have implemented the LSTM model for both next-word and -character prediction. We have imported the necessary libraries from TensorFlow and have utilized the functionalities for deep learning. NumPy, re, string, and NLTK were also imported for generating arrays and in data processing pattern recognition.

4.2. Tokenizer

Tokenization is the first step in converting the unprocessed data into the format of machine learning for next-word and -character prediction in Arabic corpus. Tokenization works by dividing the text into smaller units called tokens, which can be words, characters, or sub-words. We then coded to get the numerical representation of the text, which allowed the model to comprehend it and produce unique tokens. The process of finding a unique token is very important as it is useful for the development of vocabulary, which forms the basis for the embedding layer. The unique characters were then mapped into integers. The words that are not in the vocabulary can also be identified by employing sub-word tokenization. Words are the discrete meanings found in Arabic text. Sub-words are the tiny word units or word morphemes. The Arabic word, character and subwords representation is mentioned.

4.3. Vocabulary Size

The total number of unique characters that appear in the training dataset is referred to as vocabulary size. The size of the embedding layer is decided by the vocabulary size. The size of the embedding layer plays a crucial role in the next-character predictions. After tokenization, the generated unique characters are stored in the list and the number of elements in the vocabulary is counted to obtain the vocab size.

5. Model Portrayal

5.1. Sequence Generation

The generated text is divided into sequences of 100 characters each; the last character is given as target, and the rest of the characters are given as input. The overall result obtained after preprocessing for character based prediction is shown in Table 1.

Case 1: Character-based Preprocessed data

Table 1. Character-based preprocessed data.

Preprocessing	NLTK	ARABERT
Total number of sentences	1	1
Number of tokens	207,258	376,600
Average number of tokens per sentence	207,258	376,600
Number of unique tokens	47,820	10,375
Number of total characters	1,925,926	1,415,081
Character vocab size	47	37
Total patterns	1,925,826	1,414,981

Case 2: Word-based Preprocessed data

The overall result obtained after preprocessing for word based prediction is shown in Table 2.

The dataset is prepared with the input as the sequence of characters and the target as the next character. We have given the sequence length as 100. After reshaping the sample’s time steps and features, the total number of patterns obtained is 1,925,826. In model preparation, the primary coating is the embedding layer, where the input sequence is mapped to a dense vector of fixed size. Three parameters, such as unique words in the vocabulary, embedding space dimensionality, and input length, are essential in building the model for predicting the next word. An LSTM with 256 units makes the second layer, followed by the dropout and dense layer. A softmax activation layer with categorical cross-entropy loss with an Adam optimizer is cast in the model.

5.2. Model Training and Testing

For training the Arabic dataset, the input data X and the target Y are mentioned separately. The epochs parameter, set as 30, indicates the total number of iterations that the entire dataset will experience during training. During training, the model gains the capacity to forecast the next word in the sequence based on the input data. During training, the weight of the model’s layer is iteratively adjusted to decrease the specified loss function and to improve the ability to envisage the next word in the input sequence. The train test split applied in our work is in the ratio of 80:20.The hyperparameters used for turning the model are portrayed in Table 3.

5.2.1. Model Performance—Word-Based Prediction

The training accuracy, validation accuracy, testing loss, and validation loss of the LSTM model are 65%, 62%, 1.20, and 1.30, respectively, as shown in Table 4. After preprocessing with the help of transformer-based ARABERT, we have obtained accuracy, validation accuracy, testing loss, and validation loss as 74%, 87%, 0.67, and 1.33, respectively. We have concluded that the Markov model has displayed the best results by obtaining 78% training accuracy, 76% validation accuracy, 0.49 testing loss, and 0.48 validation loss.

5.2.2. Model Performance—Character-Based Prediction

The training accuracy, validation accuracy, testing loss, and validation loss of the LSTM model are 72%, 71%, 0.92, and 0.95, respectively, as given in Table 5. We have obtained accuracy, validation accuracy, testing loss, and validation loss as 72%, 71%, 0.91, and 0.95, respectively, when working with the LSTM + CNN combined model. We have concluded that preprocessing with the help of transformer-based ARABERT has displayed the best results, with 73% training accuracy, 72% validation accuracy, 0.94 testing loss, and 0.99 validation loss. The test accuracy for the LSTM + ARABERT model is 0.716249.

6. Result and Discussions

Model performances of different models of word prediction are shown in Figure 5.

Model performances of different models of character prediction are shown in Figure 6.

6.1. Evaluation Score—Next-Word Prediction

The initial input text, from which the next word is to be predicted, is stored in the variable seed text. The number of words to be predicted is indicated. The expected input length of the model sequence is matched by padding the token list. To find the predicted probabilities of each word in the vocabulary, the model prediction method is applied to the padded token list. The index of the word with the highest probability is found using the arg max function. Using the tokenizer index, the index is changed to the corresponding word to yield the predicted word. and the predicted character. Until the required number of next-word predictions are made, the process will be repeated. Later, the word sequence is predicted and displayed. Word-predicted data and evaluation score are shown in Table 6.

6.2. Evaluation Score—Next-Character Prediction

The predicted character and the BLEU score are shown in Table 7.

Classification Report of Each Character

Character-wise classification report is shown in Table 8.

6.3. Evaluation Metrics

The accuracy, loss, perplexity score, BLEU (Bilingual Evaluation Understudy) score, word error rate, and character error rate are used as the measure for validation.

6.3.1. Perplexity Score

The perplexity values of the Markov model for next-word prediction that have been provided indicate that the language model has performed satisfactorily on both the training and testing sets. A slightly lower testing perplexity of 1.6189 indicates good generalization to unseen data, while a training perplexity of 1.6443 indicates effective learning of the language patterns. According to these findings, the model may be able to predict words in a sequence with reasonable accuracy, which makes it a useful tool for a variety of natural language processing applications. For character-based LSTM, the perplexity is 3.5598.

6.3.2. BLEU Score

The machine translation performance is indicated by the BLEU (Bilingual Evaluation Understudy) score. The high BLEU score indicates a better match between predicted and reference sequences. For this calculation, we break the reference and predicted sequences into individual words, extract the n-grams, calculate the overlap between n-grams, calculate the precision of n-grams, and combine the precision of different n-grams using weighted averages. Values of 0.7 and 0.8 are shown in Table 6.The BLEU scores obtained indicate “Good” and “Very Good” levels of similarity.

6.3.3. Word Error Rate

The WER is the metric used to assess the language model’s performance. A lower WER indicates better performance, which means fewer words are predicted wrong. We have obtained the WER of 0.16 given in Table 6, which is relatively low and indicates the model is performing well in predicting the next word.

6.3.4. Character Error Rate

A metric, character error rate (CER), is used to assess how well the language model performs at the character level. A lower CER indicates we have fewer incorrectly predicted characters. We have obtained a CER of 0.14, shown in Table 6, which indicates the model performs well with predicted characters.

6.4. Discussions

When comparing with other models, such as LSTM and the ARABERT preprocessing with LSTM model, Markov models showed better results on our Arabic corpus. Markov chain models give better results when working on sequential data. Markov models are prominent in anticipating the word based on the previous context because of their capacity to recognize local dependencies and patterns within the sequences. Markov chains are an effective way to capture short-term dependencies where the word that follows is heavily dependent on the word that comes before and best suited for our Arabic data in the next-word prediction. Refs. [3,21] show the next-word prediction of the Urdu language with 52.4% accuracy for LSTM and 73.7% accuracy with BERT. Ref. [25] presented the next-word prediction in English, and Ref. [13] provided the next-word prediction in Ukrainian. The detailed version is shown in the comparison Table 9 given below.

In the next-character prediction, ARABERT + LSTM outperforms other models, as it combines the advantages of the long-term dependency capabilities of LSTM with the pre-trained Arabic model ARABERT. LSTM helps us capture contextual information and character relationships. ARABERT offers a solid foundation for comprehending the subtleties of the Arabic language. The combination makes it possible for the model to efficiently identify the pattern in the text, which enhances the prediction and boosts the overall performance in character prediction.

Implication

One important area of natural language processing (NLP) with several advantages is Arabic next-word prediction. It improves machine translation, speech recognition, and text input techniques by correctly anticipating the following word. Additionally, this technology helps us better comprehend the morphological and dialectal variances of the Arabic language. It is a useful tool for both people and corporations, with practical uses ranging from information retrieval to sentiment analysis.

7. Conclusions

Baidu’s Deep Speech is an automatic speech recognition (ASR) framework that converts Arabic audio data into an Arabic text corpus, thus predicting the next word and character. These predictions are based on deep learning models such as LSTM, ARABERT + LSTM, a Markov model, and LSTM + CNN. We have facilitated these models to forecast the next word and character with the text data extracted from 4071 Arabic audio clips.

The Markov model produced a highly anticipated result for next-word prediction, and LSTM + CNN produced a better output on next-character prediction. Moreover, a detailed performance analysis was conducted, and these parameters, such as word error rate, character error rate, BLEU score, and perplexity score, are documented in the manuscript.

8. Future Work and Limitations

We have completed the next-word and -character prediction from the Arabic dataset. However, certain limitations can be further improved. For example, the training of both predictions with the available data extracted from the Arabic audio has been performed over a limited time. The availability of larger datasets that can be trained with higher computational resources with more epochs and with increased batch size can surely increase the performance of the prediction. The number of classes is equal to the vocabulary size; this accuracy is still promising.

Author Contributions

Conceptualization, F.S.A.-A. and S.T.B.S.; Methodology, F.S.A.-A. and S.T.B.S.; Software, S.T.B.S.; Validation, S.T.B.S.; Formal analysis, S.T.B.S.; Investigation, F.S.A.-A. and S.T.B.S.; Data curation, F.S.A.-A.; Writing—original draft, S.T.B.S.; Writing—review & editing, S.T.B.S.; Visualization, S.T.B.S.; Supervision, F.S.A.-A.; Funding acquisition, F.S.A.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This work is funded by Kuwait University Grant EO 02/22.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the datasets.

Conflicts of Interest

The authors declare that they have no conflicts of interest that appeared to influence the work reported in this paper.

References

Al-Anzi, F.S. Improved noise-resilient isolated words speech recognition using piecewise differentiation. Fractals 2022, 30, 2240227. [Google Scholar] [CrossRef]
Al-Anzi, F.S.; AbuZeina, D. Synopsis on Arabic speech recognition. Ain Shams Eng. J. 2022, 13, 101534. [Google Scholar] [CrossRef]
Antoun, W.; Baly, F.; Hajj, H. Arabert: Transformer-based model for arabic language understanding. arXiv 2020, arXiv:2003.00104. [Google Scholar]
Abu Nada, A.M.; Alajrami, E.; Al-Saqqa, A.A.; Abu-Naser, S.S. Arabic text summarization using arabert model using extractive text summarization approach. Int. J. Acad. Inf. Syst. Res. 2020, 4, 6–9. [Google Scholar]
Keshishian, M.; Norman-Haignere, S.V.; Mesgarani, N. Understanding adaptive, multiscale temporal integration in deep speech recognition systems. Adv. Neural Inf. Process. Syst. 2021, 34, 24455–24467. [Google Scholar] [PubMed]
Algihab, W.; Alawwad, N.; Aldawish, A.; AlHumoud, S. Arabic speech recognition with deep learning: A review, in Social Computing and Social Media. Design. In Human Behavior and Analytics: 11th International Conference, SCSM 2019, Held as Part of the 21st HCI International Conference, Orlando, FL, USA, 26–31 July 2019; Springer: Berlin/Heidelberg, Germany, 2009; pp. 15–31. [Google Scholar]
Karpagavalli, S.; Chandra, E. A review on automatic speech recognition architecture and approaches. Int. J. Signal Process. Image Process. Pattern Recognit. 2016, 9, 393–404. [Google Scholar]
Belinkov, Y.; Ali, A.; Glass, J. Analyzing phonetic and graphemic representations in end-to-end automatic speech recognition. arXiv 2019, arXiv:1907.04224. [Google Scholar]
Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G.; et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International Conference on Machine Learning; PMLR: London, UK, 2016; pp. 173–182. [Google Scholar]
Nahid, M.M.; Purkaystha, B.; Islam, M.S. End-to-end bengali speech recognition using deepspeech. J. Eng. Res. Innov. Educ. 2019, 1, 40–49. [Google Scholar]
Agarwal, A.; Zesch, T. LTL-UDE at Low-Resource Speech-to-Text Shared Task: Investigating Mozilla DeepSpeech in a low-resource setting. SwissText/KONVENS 2020, 31, 40–47. [Google Scholar]
Khatatneh, K. A novel Arabic Speech Recognition method using neural networks and Gaussian Filtering. Int. J. Electr. Electron. Comput. Syst. 2014, 19. [Google Scholar]
Ambulgekar, S.; Malewadikar, S.; Garande, R.; Joshi, B. Next Words Prediction Using Recurrent NeuralNetworks. In ITM Web of Conferences; EDP Sciences: Les Ulis, France, 2021; Volume 40, p. 03034. [Google Scholar]
Sharma, R.; Goel, N.; Aggarwal, N.; Kaur, P.; Prakash, C. Next word prediction in hindi using deep learning techniques. In Proceedings of the 2019 International Conference on data science and Engineering (ICDSE), Patna, India, 26–28 September 2019; pp. 55–60. [Google Scholar]
Labeau, M.; Allauzen, A. Character and subword-based word representation for neural language modeling prediction. In Proceedings of the First Workshop on Subword and Character Level Models in NLP, Copenhagen, Denmark, 7 September 2017; pp. 1–13. [Google Scholar]
Ikegami, Y.; Tsuruta, S.; Kutics, A.; Damiani, E.; Knauf, R. Fast ML-based next-word prediction for hybrid languages. Internet Things 2024, 25, 101064. [Google Scholar] [CrossRef]
Shakhovska, K.; Dumyn, I.; Kryvinska, N.; Kagita, M.K. An Approach for a Next-Word Prediction for Ukrainian Language. Wirel. Commun. Mob. Comput. 2021, 2021, 5886119. [Google Scholar] [CrossRef]
Barman, P.P.; Boruah, A. A RNN based Approach for next word prediction in Assamese Phonetic Transcription. Procedia Comput. Sci. 2018, 143, 117–123. [Google Scholar] [CrossRef]
Mahbub, M.; Akhter, S.; Kabir, A.; Begum, Z. Context-based Bengali Next Word Prediction: A Comparative Study of Different Embedding Methods. Dhaka Univ. J. Appl. Sci. Eng. 2022, 7, 8–15. [Google Scholar] [CrossRef]
Gerz, D.; Vulić, I.; Ponti, E.; Naradowsky, J.; Reichart, R.; Korhonen, A. Language Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction. Trans. Assoc. Comput. Linguist. 2018, 6, 451–465. [Google Scholar] [CrossRef]
Shi, Z.; Shi, M.; Li, C. The prediction of character based on recurrent neural network language model. In Proceedings of the 2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS), Wuhan, China, 24–26 May 2017; pp. 613–616. [Google Scholar]
Manaye, W. Designing Geez Next Word Prediction Model Using Statistical Approach. Ph.D. Thesis, Bahir Dar University, Bahir Dar, Ethiopia, 2020. [Google Scholar]
Lahrache, F.; Djebrit, S. Next Word Prediction Based on De Ep Learning. Ph.D. Thesis, University of Ghardaia, Ghardaia, Algeria, 2020. [Google Scholar]
Eisape, T.; Zaslavsky, N.; Levy, R. Cloze Distillation: Improving Neural Language Models with Human Next-Word Prediction; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2020. [Google Scholar]
Shahid, R.; Wali, A.; Bashir, M. Next word prediction for Urdu language using deep learning models. Comput. Speech Lang. 2024, 87, 101635. [Google Scholar] [CrossRef]
Hamarashid, H.K.; Saeed, S.A.; Rashid, T.A. Next word prediction based on the N-gram model for Kurdish Sorani and Kurmanji. Neural Comput. Appl. 2021, 33, 4547–4566. [Google Scholar] [CrossRef]
Berhe, H. Next Word Prediction for Word Autocomplete in Tigrigna Text Entry. Ph.D. Thesis, Bahir Dar University, Bahir Dar, Ethiopia, 2020. [Google Scholar]
Tiwari, A.; Sengar, N.; Yadav, V. Next word prediction using deep learning. In Proceedings of the 2022 IEEE Global Conference on Computing, Power and Communication Technologies (GlobConPT), New Delhi, India, 23–25 September 2022; pp. 1–6. [Google Scholar]
Soam, M.; Thakur, S. Next word prediction using deep learning: A comparative study. In Proceedings of the 2022 12th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 27–28 January 2022; pp. 653–658. [Google Scholar]
Rianti, A.; Widodo, S.; Ayuningtyas, A.D.; Hermawan, F.B. Next word prediction using lstm. J. Inf. Technol. Its Util. 2022, 5, 432033. [Google Scholar] [CrossRef]
Endalie, D.; Haile, G.; Taye, W. Bi-directional long short term memory-gated recurrent unit model for Amharic next word prediction. PLoS ONE 2022, 17, e0273156. [Google Scholar] [CrossRef] [PubMed]
Al-Anzi, F.S.; Shalini, S.B. Continuous Arabic Speech Recognition Model with N-gram Generation Using Deep Speech. In Proceedings of the 2024 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), Istanbul, Turkiye, 23–25 May 2024; pp. 1–9. [Google Scholar]
Waheeb, S.A.; Khan, N.A.; Shang, X. An efficient sentiment analysis based deep learning classification model to evaluate treatment quality. Malays. J. Comput. Sci. 2022, 35, 1–20. [Google Scholar] [CrossRef]

Figure 1. Baidu’s Deep Speech Arabic representation.

Figure 2. Block diagram representation.

Figure 3. LSTM architecture.

Figure 4. Block diagram representation—next-character prediction.

Figure 5. Case 1: Word-based prediction.

Figure 6. Case 2: character-based prediction.

Table 2. Word-based preprocessed data.

Preprocessing	NLTK	ARABERT
Total number of sentences	1	1
Number of tokens	207,258	376,600
Average number of tokens per sentence	207,258	376,600
Number of unique tokens	47,820	10,375
Total sequences	1,925,926	1,415,081
Vocabulary size	45	38

Table 3. Hyperparameters

Parameters	LSTM/Combined LSTM and CNN
Maximum length	100
Epochs	30
Batch size	128
Learning rate	0.000001
Patience	3
Test size	0.2
Embedding dimensions	100
Trainable parameters (LSTM)	Trainable params: 2,125,868 (8.11 MB)

Table 4. Prediction output, word-based.

Parameters	LSTM	ARABERT + LSTM	Markov Model
Training Accuracy	0.6496	0.746	0.7800
Validation Accuracy	0.6284	0.8791	0.7676
Testing Loss	1.2029	0.6748	0.4973
Validation Loss	1.3071	1.3307	0.4818

Table 5. Prediction output—character-based.

Parameters	LSTM	LSTM/CNN Combined Model	ARABERT + LSTM
Training Accuracy	0.72	0.7227	0.7305
Validation Accuracy	0.71	0.7175	0.7216
Testing Loss	0.9243	0.9187	0.9424
Validation Loss	0.9559	0.9545	0.9988

Table 6. Word-predicted data and evaluation score.

Reference Text	Predicted Text	BLEU Score	WER	CER
الحُكُومَةلِتَرجَمَةِتِلكَالرُّؤيَاإِلَىواقِعٍمَلمُوس’ (The government to translate that vision into a tangible reality)	الحُكُومَةلِتَرجَمَةِتِلكَالرُّؤيَاإِلَىواقِعٍمَلمُوسوالتجوالالمُ (The government to translate this vision into a tangible reality and to move forward)	0.588	0.57	0.43
جاثيةفِيالبِلَاد’ (Athiya in the country)	اثِيَةِفِيالبِلَادالمُتَوَقَّعَةوَال (Atheism in the expected country and)	0.29	1.33	1.21
‘دَائِمَةلِتَوفِيرِكَافَّةِالإِمكَانَاتلِوِزَارَةِالشَّبَابِوَالرِّياضَة’ (Permanent to provide all possibilities to the Ministry of Youth and Sports)	دَائِمَةلِتَوفِيرِكَافَّةِالإِمكَانَاتلِوِزَارَةِالشَّبَابِوَالرِّياضَةوَالمُتَظَاهِرِينَ (Permanent to provide all possibilities to the Ministry of Youth and Sports and the demonstrators)	0.724	0.16	0.27
والتَّجيِيشمُشِيرَةًإِلَىتَبَنِّيبَعضالنُّوَّابفِكرَةَأَخَذِالشَّارِعِإِلَىأَبعَدَمَدَى’ (And the mobilization, indicating that some representatives have adopted the idea of taking the street to the furthest extent.)	والتَّجيِيشمُشِيرَةًإِلَىتَبَنِّيبَعضالنُّوَّابفِكرَةَأَخَذِالشَّارِعِإِلَىأَبعَدَمَدَىأَنَّالمَجلِسالأَ (And the mobilization, indicating that some representatives have adopted the idea of taking the street to the furthest extent that the Council)	0.8709	0.16	0.14

Table 7. Character-predicted data and evaluation score.

Reference Text	Predicted Text	BLEU Score
Both parties were able to move forward to end the Palestinian division, but the agreement faced لطَّرَفَينالمُضِيَّقُدُمًالِإنَّهَاءِالانِقِسامِالفِلَسطِينِيّإِلَّاأَنَّالاِتِّفَاقَ	ؤٌؤؤؤؤٌؤؤؤؤؤٌؤؤؤؤٌؤٌؤٌؤؤؤٌؤؤٌؤٌؤؤؤؤؤٌؤؤ ءؤؤؤؤؤؤؤٌؤؤؤٌؤٌؤؤؤؤٌؤؤٌؤؤؤؤؤؤؤٌؤًؤ نؤؤؤٌؤًؤؤؤؤؤؤؤًؤٌؤؤؤؤؤًؤؤؤؤ (A)	0.00023

Table 8. Classification report for Arabic alphabet.

Characters	Precision	Recall	F1-Score	Support
0	0.87	0.86	0.86	42,382
1	0.82	0.87	0.85	872
2	0.57	0.13	0.21	189
3	0.36	0.21	0.27	4310
4	0.31	0.34	0.32	300
5	0.63	0.11	0.19	1712
6	0.66	0.74	0.70	1072
7	0.57	0.85	0.68	32,169
8	0.68	0.44	0.53	6893
9	0.88	0.95	0.91	7137
10	0.64	0.52	0.57	9996
11	0.69	0.40	0.51	1017
12	0.49	0.34	0.40	3200
13	0.50	0.32	0.39	3916
14	0.66	0.29	0.40	1809
15	0.68	0.57	0.62	6649
16	0.79	0.61	0.69	1065
17	0.73	0.58	0.65	9232
18	0.70	0.51	0.59	1402
19	0.56	0.42	0.48	5272
20	0.54	0.19	0.28	2217
21	0.58	0.32	0.41	1932
22	0.63	0.43	0.51	1436
23	0.68	0.41	0.51	1950
24	0.44	0.39	0.41	470
25	0.46	0.41	0.51	1950
26	0.63	0.18	0.27	541
27	0.55	0.29	0.38	4686
28	0.59	0.35	0.44	4374
29	0.62	0.42	0.50	3453
30	0.81	0.83	0.82	25,162
31	0.38	0.54	0.45	13,089
32	0.70	0.67	0.68	9918
33	0.68	0.35	0.46	4328
34	0.44	0.62	0.52	10,801
35	0.89	0.95	0.92	1885
36	0.84	0.84	0.84	15,174
37	0.64	0.59	0.61	1137
38	0.47	0.31	0.37	388
39	0.57	0.36	0.44	1372
40	0.82	0.87	0.84	66,990
41	0.70	0.57	0.63	13,271
42	0.79	0.80	0.80	38,670
43	0.86	0.91	0.88	14,211
46	0.00	0.00	0.00	10

Table 9. Comparison table with other literature.

Literature	Year	Language	Accuracy
Next-word prediction of Urdu language [23]	2024	Urdu	52.4% (LSTM) 73.7% (BERT)
Next-word prediction using deep learning [27]	2022	English	58.27% (LSTM) 66.1% (Bi-LSTM)
An approach for a next-word prediction for the Ukrainian language [15]	2018	Ukrainian	74% (LSTM)
An RNN-based approach for next-word prediction in Assamese phonetic transcription [16]	2018	Assamese	72.10% (RNN)
Next-word prediction in Hindi using deep learning techniques [12]	2019	Hindi	70.89% (LSTM)
Next-word prediction using LSTM [29]	2022	Indonesian	75% (LSTM)
Bidirectional Long Short-Term Memory gated recurrent model for Amharic next-word prediction [30]	2022	Amharic	75.6% (Bi-LSTM)
Proposed Model		Arabic	0.6496% (LSTM) 0.746% (ARABERT + LSTM) 0.78% (Markov)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Al-Anzi, F.S.; Shalini, S.T.B. Revealing the Next Word and Character in Arabic: An Effective Blend of Long Short-Term Memory Networks and ARABERT. Appl. Sci. 2024, 14, 10498. https://doi.org/10.3390/app142210498

AMA Style

Al-Anzi FS, Shalini STB. Revealing the Next Word and Character in Arabic: An Effective Blend of Long Short-Term Memory Networks and ARABERT. Applied Sciences. 2024; 14(22):10498. https://doi.org/10.3390/app142210498

Chicago/Turabian Style

Al-Anzi, Fawaz S., and S. T. Bibin Shalini. 2024. "Revealing the Next Word and Character in Arabic: An Effective Blend of Long Short-Term Memory Networks and ARABERT" Applied Sciences 14, no. 22: 10498. https://doi.org/10.3390/app142210498

APA Style

Al-Anzi, F. S., & Shalini, S. T. B. (2024). Revealing the Next Word and Character in Arabic: An Effective Blend of Long Short-Term Memory Networks and ARABERT. Applied Sciences, 14(22), 10498. https://doi.org/10.3390/app142210498

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Revealing the Next Word and Character in Arabic: An Effective Blend of Long Short-Term Memory Networks and ARABERT

Abstract

1. Introduction

2. Literature Review

Research Gap

3. Deep Learning Models and Methods

3.1. Dataset Overview

3.2. Model Architecture of Next-Word Prediction

3.2.1. Methodology

3.2.2. Architecture

3.3. Long Short-Term Memory

3.4. Markov Model

Features of the Markov Model

3.5. Data Preprocessing

3.5.1. ARABERT-BERT-Base-arabertv02

3.5.2. NLTK

4. Training and Optimization

4.1. Model Architecture of Next-Character Prediction

4.2. Tokenizer

4.3. Vocabulary Size

5. Model Portrayal

5.1. Sequence Generation

5.2. Model Training and Testing

5.2.1. Model Performance—Word-Based Prediction

5.2.2. Model Performance—Character-Based Prediction

6. Result and Discussions

6.1. Evaluation Score—Next-Word Prediction

6.2. Evaluation Score—Next-Character Prediction

Classification Report of Each Character

6.3. Evaluation Metrics

6.3.1. Perplexity Score

6.3.2. BLEU Score

6.3.3. Word Error Rate

6.3.4. Character Error Rate

6.4. Discussions

Implication

7. Conclusions

8. Future Work and Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI