1. Introduction
Speech is a powerful tool that shapes our world and serves as a conduit for conveying our ideas and thoughts through vocal sounds and word formation and expression through various rhythmic flows. Speech processing is the study of speech signals and methods used to convert, manipulate, recognize, and process the signals into understandable formats such as speech recognition, translation, emotion identification, etc. An automatic speech recognition (ASR) model recognizes or transcribes the word segments into a sequence of voice or acoustic data [
1]. Despite rapid technological advancements, voice recognition remains a challenging process with numerous criteria, making the ideal speech-to-text conversion a distant goal [
2]. In many commercial applications, the traditional human–machine interface has been replaced by modern ASR systems. ASR systems can translate spoken words into text by interpreting them using various technological advancements and linguistics. Arabic is pronounced differently depending on the phonemes and diacritical marks across the globe. Speaking patterns can also be significantly impacted by dialectal differences. With over 380 million native speakers, Arabic is one of the most widely spoken languages worldwide. There are three primary varieties of Arabic, making it a very diverse language. The written and formally spoken languages are based on classical Arabic, the language of the Quran. Derived from classical Arabic, Modern Standard Arabic (MSA) finds use in government, education, and the media. Conversely, the numerous dialects of Arabic spoken in various Arabic-speaking countries are referred to as colloquial Arabic. These dialects can differ greatly, which frequently makes it difficult for speakers of various dialects to communicate with one another. Colloquial Arabic is the language of daily living and social contact, even if MSA is the official language in many Arab nations.
The foundation of comprehending and producing natural language is the capacity to anticipate the word or character that will appear in a series. It is a basic task in natural language processing (NLP), with applications that power virtual assistants and chatbots to transform text generation and machine translation. Accurately predicting the next word or character is essential for many applications. LSTM networks are well known for their capacity to represent long-term dependencies in sequences; they are perfect for simulating language’s sequential structure. However, adding domain-specific knowledge can improve their performance further, and this is where ARABERT becomes useful. With extensive pre-training on an extensive Arabic text corpus, ARABERT embodies a deep semantic comprehension of the language. LSTM networks with ARABERT embeddings can achieve higher prediction accuracy by combining them. Using deep learning techniques such as RNNs, transformer networks, and language models, the next-word prediction helps forecast the most likely word in a sentence. As natural language processing (NLP) and deep learning evolve, the progression toward next-word prediction plays a critical role in creating intelligent systems that can comprehend and produce natural language. The rich morphology, intricate script, and variety of dialects of Arabic are the main causes of the gap in the state of the art of next-word and -character prediction for Arabic text. A pre-trained BERT model for the Arabic language aims to achieve state-of-the-art performance on most Arabic natural language processing (NLP) [
3]. Using Natural Language Generation (NLG) and Natural Language Understanding (NLU) architectures, paper [
4] suggests an extractive Arabic text summarizer that preserves the elements of Arabic manuscripts by analyzing and extracting key phrases. The Rouge metric and human assessment are used to compare the effectiveness of the suggested solution to provide the most effective method for summarizing Arabic texts. Moreover, in robotics, speech recognition can be rapid and intuitive while processing the human voice, and it can have the ability to respond dynamically to modulated interactions. Generally, unraveling the complexities of words occurring in various contexts and integrating contextual data in the Arabic language has been a challenge where more sophisticated models exist in enhancing the user experience in personalized recommendation systems, question-answering, content creation, and NLP applications.
2. Literature Review
In document [
5], the objective of automatic speech recognition is to give computers the best possible ability to recognize and understand human speech. Speech recognition models can be implemented using a variety of methods. Neural networks with deep learning are used to support one of the newest methods for speech recognition. Arabic receives the least attention in terms of speech recognition technology, despite being the language that is spoken the most. Paper [
6] provides an overview of the research on Arabic voice recognition. It also provides some insight into the resources and toolkits available for the development of Arabic voice recognition systems. Many products that effectively use automatic speech recognition to facilitate human–machine communication have been developed. Applications for speech recognition perform worse when there is minimal background noise or reverberation [
7]. Both text and audio transcriptions are used for the entire process of training automatic speech recognition (ASR) neural network systems. This work assesses the representation quality in a variety of classification tasks by comparing phonemes and graphemes in addition to various articulatory properties. This article shows how consistently different features are represented across deep neural network systems by analyzing three datasets in two languages: Arabic and English [
8].
Researchers sought to develop highly effective recognizers for two drastically dissimilar languages, like English and Mandarin. In this paper, researchers investigated several network topologies and discovered some useful methods, like batch normalization and sort grad to improve numerical optimization and look ahead convolution for unidirectional models. This research was made possible by a highly effective training method [
9]. The ability of the Deep Speech network to identify distinct Bengali speech samples is examined in their literature. This network models internal phoneme representation using recurrent LSTM layers as its foundation. Convolutional layers are added at the bottom, eliminating the need to make any assumptions regarding internal phoneme alignment. A connectionist temporal classification (CTC) damage task was used to train the model, and a beam search decoder was used to generate the transcript. The developed method yielded a lower word error rate and a lower character error rate on the Bengali real number speech dataset [
10]. An innovative end-to-end speech acknowledgment technique is provided by reference [
11], which makes use of a hybrid CTC–attention archetype inside the multitask learning context to increase convergence and resilience and lessen alignment problems. Its superiority over the CTC and attention-based encoder–decoder baselines is demonstrated in a trial using the WSJ and CHiME-4 tasks, yielding 5.4–14.6% qualified enhancements in character error rate (CER). For speech-to-text, work [
12] uses a shared task on SwissText/KONVENS. End-to-end training of a neural network is based on Mozilla Deep Speech. Postprocessing, data augmentation, and transfer learning from standard German and English were applied. The system produces a WER of 58.9%.
One of natural language processing’s (NLP) most crucial tasks is next-word prediction, which is also called language modeling. A recurrent neural network (RNN) model is being developed with TensorFlow to predict the top 10 words from a 40-letter text provided by a client. The objective is to accurately and quickly predict ten or more words [
13]. Study [
14] offers a novel approach that can be used for a variety of NLG tasks to predict the next word in a Hindi sentence using Long Short-Term Memory (LSTM) and Bi-LSTM deep learning techniques. The approach achieves 59.46% and 81.07% accuracy, respectively. Study [
15] explores the use of sub-word units (character and factored morphological decomposition) in neural language modeling. Using Czech as a case study, it was found that character-based embedding significantly improved model performance, even with unstable training. Reducing the output look-up table size also improved the model’s performance. In research study [
16] for smartphone users, a hybrid language model called RNN-LM is suggested to enhance Japanese text prediction. To predict subsequent words, this model combines an n-gram model with a recurrent neural network (RNN). Long Short-Term Memory networks (LSTMs) are used to connect the input, output, and hidden layers of the model, which works best in a client–server architecture. The model is 10% less confusing than traditional models and is compact. It is now part of IME, Flick, which, in a Japanese text input experiment, beats Mozc by 16% in time and 34% in keystrokes.
In article [
17], many language tools are available in Ukrainian, and next-word prediction is essential for users. Unlike T9, LSTM and Markov chains are selected for this task because of their sequential nature and capacity to produce multiple words in conjunction with their hybrid model. The goal of work [
18], on the Long Short-Term Memory (LSTM) network model for instantaneous messaging, is to predict the word or words that will come after a collection of modern terms. Research study [
19] uses a transcript of the Assamese language based on the International Phonetic Association (IPA) chart; the model achieves an accuracy of 72.10% for a phonetic transcript of the Assamese language and 88.20% for Assamese text, especially during phonetic typing. Next-word prediction is crucial for typing subsystems and digital document writing. Researchers are exploring ways to improve this capability using natural language processing (NLP) practices like word embedding and successive contextual modeling. This paper compares embedding methods for Bengali next-word prediction, comparing word2vec skip-gram, word2vec CBOW, fastText skip-gram, and fastText CBOW. The results provide insights into contextual and sequential information gathering, enabling the implementation of a context-based Bengali next-word prediction system. Next-word prediction, also known as language modeling, is a machine learning application that uses Long Short-Term Memory (LSTM) and other models to predict the next word. LSTMs have a built-in mechanism to prevent overfitting and use activation functions like ReLU and softmax to introduce non-linearity. Combining techniques like pre-training, advanced architectures, and large datasets can improve performance. The paper discusses different approaches to achieve the best results.
In reference [
20], a large-scale language model (LM) study on 50 different languages is presented, with an emphasis on sub-word-level data. In order to incorporate sub-word-level information into semantic word vectors and train neural language models, the researchers offer a novel approach. The study offers new benchmarks for LM and demonstrates significant gains in perplexity across all 50 languages, particularly morphologically rich ones. The datasets and code are openly accessible.
This paper discusses recurrent neural networks, introducing LSTM as an effective model and recommending a language model based on LSTM for character prediction. Comparing LSTM with standard RNN models reveals significant potential in character prediction. Paper [
21] discusses the development of recurrent neural networks (RNNs) and introduces LSTM, a more effective model. It recommends a language model based on RNNs for character prediction. Paper [
22] compares LSTM and standard RNN models, highlighting their potential in character prediction. The program runs on TensorFlow, demonstrating RNNs’ sequential processing ability. The goal of the project is to use the n-gram language model, a statistical machine learning technique, to create a Ge’ez word prediction model. The model predicts likely stem/root words and morphological features using affixes and morphological datasets. The keystroke savings are used as a metric to assess the model. Three experiments were carried out as part of the study, using smaller and larger datasets in experiments 1, 2, and 3. Experiment 2, which demonstrated 35.7% keystroke reserves for a crossbreed of n-gram categorizations with a back-off flattening model, produced the best results. Future research and other natural language processing (NLP) applications, like texting on a mobile device and helping individuals with disabilities, can benefit from this technology.
Work [
23] suggests that natural language processing (NLP) applications, like helping individuals with disabilities and texting on a mobile device, can be supported by this technology and used for future research. In the field of language modeling, this study investigates the use of deep learning methods, particularly temporal convolutional networks and recurrent neural networks, to predict the next word. The researchers obtained an accuracy of 71.51% for the RNN model and 65.20% for the TCN model using three databases: Coursera Swiftkey, the book
Writings of Friedrich Nietzsche: Volume 1 by Friedrich Nietzsche, and the News category from the Brown corpus in the NLTK library. The findings imply that in language modeling, TCN architecture and RNN architecture compete. They advise looking into more recent deep learning mobile platforms. The study does not examine NWP for Arabic datasets due to resource constraints. The study evaluates current autoregressive language models (LMs) for their accuracy in human incremental processing and reading time prediction. They propose Cloze Distillation, a method for distilling linguistic information into pre-trained LMs, which improves analysis period forecasting and generalizes to finite cloze data [
24].
The purpose of paper [
25] is to examine the utility of deep learning models for Urdu’s next-word calculation and proposition model. It suggests two word prediction models: BERT, which is specially made for natural language modeling, and LSTM, which is used for neural language modeling. The study trained these prototypes on a corpus of 1.1 million sentences written in Urdu. The BERT model performed better than the other two pre-trained models, whereas the LSTM model had an accuracy of 52.4%. Additionally, the study demonstrates how changing the input context’s window size can impact performance. With the usage of next-word forecast technology, typing can be made easier by providing the next word. However, the absence of n-grams and a Kurdish text corpus presents difficulties for the language. Paper [
26] presents a novel investigative work on next-word guessing for Kurdish Sorani and Kurmanji, with an accuracy rate of 96.3% and a reduction in typing time using an n-gram model. The aim of research study [
27] is to create a new word predictor for Tigrigna text entry that will cut down on keystrokes by 55.06%. A probabilistic language model that combines unigram, bigram, trigram, and quadgram predictors is used by the predictor model. This helps people with limited vocabulary avoid spelling errors and saves over 50% of the time spent writing a document in Tigrigna, according to the study. Still, there are difficulties in Tigrigna word prediction, like figuring out affixes that might be either words or letters. The study intends to support applications such as texting on a mobile phone or PDA, handwriting recognition, and helping individuals with disabilities.
Deep learning methods are presented in work [
28]. Long Short-Term Memory and Bidirectional Long Short-Term Memory are the core neural network architectures used in this work. The model developed in this work had the highest accuracy among neural network-based methods and outperformed previous approaches using the IITB English–Hindi parallel corpus. Next-word prediction, compared with other algorithms, is mentioned in [
29]. The paper aims to achieve state-of-the-art performance on most Arabic natural language processing (NLP) tasks by presenting a pre-trained BERT model for the Arabic language. The model achieved state-of-the-art results when compared to other approaches and Google’s multilingual BERT. To promote Arabic natural language processing research and applications, the pre-trained ARABERT models are made publicly available [
3]. As discussed in article [
30], language modeling, another name for next-word prediction, is a branch of machine learning-based natural language processing. The researchers utilized Google Colab to code, a dataset of 180 Indonesian destinations, tools such as TensorFlow, Keras, NumPy, and Matplotlib, and a Long Short-Term Memory (LSTM) model with 200 epochs for training. With a 75% accuracy rate, the model produced 8 ms/step. A Bidirectional Long Short-Term–Gated Recurrent Unit (BLST-GRU) network model for Amharic word prediction is presented in paper [
31]. Comparing the model against state-of-the-art models like LSTM, GRU, and BLSTM, it demonstrated promising results and obtained 78.6% accuracy on 63,300 Amharic phrases. Arabic voice recognition has been shown to be successful with Mozilla’s Deep Voice framework, which is based on Baidu’s Deep Speech. The system converts spoken language into written text using recurrent neural networks (RNNs). A sizable dataset of speech and alphabet text is used to train the RNNs. For feature extraction and preprocessing, the system makes use of spectrograms, PyCharm, and a virtual environment. High-quality Arabic voice recognition with lower loss, word error, and character error rates is possible with the deep voice technique. Future advancements in Arabic voice recognition may result from this study [
32].
Research Gap
Based on our survey, we have observed limited research on next-word recommendations in Arabic linguistics. However, publications in other languages, such as Urdu, English, Ukrainian, Assamese, Hindi, Indonesian, Amharic, etc., exist in the repository. As we found that the Arabic language was not explored extensively, and natural language processing for Arabic is required for modernization, we have decided to proceed with the detailed study of the prediction of the next word and characters in Arabic. Arabic next-word prediction still faces challenges because of the limited availability of large-scale annotated corpora, handling dialectical variations, tackling domain-specific challenges, and developing task-specific works. Arabic next-word prediction is a useful tool to assist language learning and natural processing tasks. BERT and ARABERT are the pre-trained models used extensively for English and Arabic corpora, respectively.
3. Deep Learning Models and Methods
A raw Arabic dataset is available to extract the MEL frequency coefficients. The audio is represented on a hearing perception scale like that of humans by the computed MEL spectrogram. Deep Speech’s neural network architecture, which comprises several layers of recurrent neural networks (RNNs), is its fundamental element, trained from a sizable text and voice dataset. It aims to understand the complex relationships and patterns that exist between spoken sounds and textual representations. Arabic textual transcripts and Arabic audio datasets were collected from 10 men and women for testing and training our model. Deep Speech offers superior Arabic speech recognition quality with less loss, word error rate, and character error rate, and, hence, Baidu’s Deep Speech framework is used to convert the audio data into text [
32]. Additionally, we have designed deep learning models such as LSTM, CNNs, Markov models, and ARABERT for the next-word and -character prediction from the Arabic text corpus obtained from the audio dataset. The Arabic Bidirectional Encoder Representations from Transformers (ARABERT) preprocessing method is preferred for comparison over the Natural Language Toolkit (NLTK) method. The capability of a language model to envisage the subsequent word in a sequence can be gauged using perplexity, and its perplexity value is scaled. Perplexity flips the average of the negative log probabilities for each word in a sentence. A low perplexity score indicates the prediction is accurate. This is a useful metric for comparing various language models and assessing how well they predict text. It is particularly helpful for tasks where precisely predicting the next word is essential, such as text generation and machine translation.
3.1. Dataset Overview
This paper’s statistics castoff entails the text data extracted from 4071 Arabic audio clips. The audio data collected are related to sports, technology, education, health, economy, security, and justice. Each category of collected data consists of different levels of data. Some categories contain only two levels of data; others contain three levels of data. In the sports category, the dataset consists of sports in general and data related to boxing, a specific sport. In the technology category, the data are subdivided into mobile phone technology and the data from technology based on electronic devices. The education category includes data about things such as education in general, literacy levels, public education, higher education, compulsory education, kindergarten, university and colleges, intermediate education, primary education, and research. The health category entails data from health insurance, disease, and nutritional disease. The economy category contains data about economic growth, industrialized countries, Islamic economics, and the economy of the Gulf Cooperation Council. The security and justice groups contain data regarding drugs, prisons, execution, embezzlement, law firms, Homeland Security courts, and crimes. In the continuous speech corpus, we have 2.63 h of audio data (10 male and 10 female speakers [
2]). In our work, we dealt with the following types of dialects and vernaculars in the Arabic language:
- -
A (Arabic—Arab nations’ official language)
- -
EGY (Egyptian Arabic)
- -
GLF (Gulf Arabic or Khaleeji—spoken in the Eastern Arabian region)
- -
LAV and LF (Levantine Arabic—spoken in Syria, Lebanon, Jordan, etc.)
- -
MSA (Modern Standard Arabic)
- -
NOR (North African Arabic)
- -
SA (Saudi Arabic)
The average sampling rate of the audio is 16,000 Hertz, and the average encoding of the WAV files is 2 bytes per sample. We have considered 1321 training audio files with an average text length of 93.0; the number of spontaneous speech files is 733, and the number of read speech files is 588.
3.2. Model Architecture of Next-Word Prediction
3.2.1. Methodology
The first process involves gathering raw Arabic audio data from several Middle Eastern nations with a range of Arabic accents. MEL frequency coefficients were extracted from the audio dataset which are further broken down into pronunciation dictionary, linguistic, and acoustic data models. This classified information is fed into Baidu’s Deep Speech framework to convert it into text after an n-gram is created for language models. Baidu’s deep voice text corpus is used and provided as input to the next word and next-character prediction framework’s architecture. To anticipate the next word and character from the Arabic corpus, models like LSTM, CNN, and Markov chain models are developed, and ARABERT preprocessing methods are used to train the Arabic text corpus.
The raw Arabic corpus obtained from Baidu’s Deep Speech framework consists of sentences with commas, full stops, quotation marks, URL, HTML, and other punctuation. The extraneous symbols, punctuation, and spaces were removed during the preprocessing stage. The filtered dataset was cleansed and tokenized to produce a collection of list files containing the number of characters and vocabulary size. These tokenized list files were grouped into a sequence of data files, and the input files were regrouped with a sequence length of 100 characters in sequential order. The sequential output data files were collaborated with the Arabic alphabet and fed into deep learning models.
3.2.2. Architecture
The Block diagram shown in
Figure 1 consists of Baidu’s Deep Speech architecture [
32], which converts raw Arabic audio signals into text data.
The text output is applied in the text prediction structure shown in
Figure 2. The model architecture is the design flow of next-word prediction.
The prediction model consists of three different layers, such as the embedding layer, the recurrent layer, and the dense layer. The key element of the next-word prediction model is an embedding layer that turns words into a numerical representation that includes syntactic and semantic information. The recurrent layer processes the data from the embedding layer and converts it into sequential data. The output data from the recurrent layer is processed into a dense layer to find the probability of each word in the vocabulary. The foundation of language understanding in next-word prediction is formed by these layers. The relationship between the words is maintained, which makes it possible for the archetype to envisage the most probable next word. To handle the vanishing gradient problem, capture word dependencies, and process sequential data, we use Long Short-Term Memory (LSTM).
3.3. Long Short-Term Memory
The memory of the recurrent neural network was extended and formed LSTM. LSTM consists of two different states, such as the cell state and the hidden state, as shown in
Figure 3. The length of both states is given as n. The cell state is the same as that of the computer memory. Long-term memory is saved in the cell state. Reading, writing, and erasing information is very much possible in this state. Three main gates were used for reading, writing, and erasing information, and the equations for the state are shown. For reading, the information input gate is suitable. For writing, the information output gate is normally used, and for deleting, the information forgot gate will be used. Three gates are used with a sigmoid function.
is the forgot gate, which selects the facts that need to be deleted from the cell. , as input, determines the new data to be stored in the cell state. is the output gate that chooses the data from the cell state from the output as a hidden state. is the hidden state. Equation (1) is employed to remove extraneous data from the cell state.
3.4. Markov Model
Markov models are statistical techniques that forecast the position of an element in the sequence by analyzing the past based on early occurrences. Markov models predict the next word by assuming that the likelihood of the word relies only on the preceding few words and not on the complete historical text. The next state is governed by only the current state. This can be modeled using Markov chains, and it is an ideal method for predicting a sequence of events. A Markov chain is a mathematical model of a series of trials in which the possibility of an individual event depends on the state obtained by the prior event. This model consists of different states, transitions, and transition matrix. For word prediction in a Markov chain, every word in the passage is a state.
Transition probabilities show the chance of changing from one word to another. The frequency with which one word follows the other in the provided text is based on calculating the probabilities. The number of times each word comes after another is counted, and the total number of occurrences of the word before it is used to normalize the counts to determine the transition probabilities. The Markov chain chooses the word with the highest transition probability from the current state to predict the word that will come after a given word. The following steps are used in finding the next word using a Markov model. Preprocessing the data includes removing the punctuation, adding to the dictionary, and listing with probability. The next step is training the model, normalizing the distributions, getting the predicted word based on the model, and calculating the testing and training accuracy.
P(…) shows the likelihood.
The random variable at time step n is denoted by X_n.
i_n depicts the random variable’s precise value or state at time step n.
| denotes probability with conditions.
X(n − 1) = i_(n − 1): since this is the conditioning event, the probability is computed concerning the prior state, which was i_(n − 1).
The Markov property is expressed by this equation, which says that the probability of the current state (X_(n − 1), and not of the previous states, determines the probability of the next state (X_n).
Features of the Markov Model
Arabic text is generated statistically using Markov chain models, and, to determine the likelihood of word sequences, they examine a sizable text collection. Markov technology generates logical sentences by predicting the most likely next word based on the preceding words. The accuracy and fluency of the model can be increased by using sophisticated methods like smoothing and higher-order n-grams. Long-range dependencies are difficult for Markov chain models to capture.
3.5. Data Preprocessing
ARABERT is a transformer-based pre-trained Arabic language model that can be used in different NLP models. The ARABERT preprocessor is a stacked bidirectional transformer encoder. ARABERT first segments the words into stems, prefixes, and suffixes, and it also performs tokenization, where the text is broken into words, and normalization, where it normalizes the diacritics, and it converts the text into a consistent format, removing punctuation and stop words and performing text cleaning, padding, and truncation.
3.5.1. ARABERT-BERT-Base-arabertv02
A pre-trained language model created especially for Arabic is called BERT-base-arabertv02. It was founded based on robust BERT architecture, which makes use of the transformer-based embedding method to comprehend the meaning of words in a phrase. Self-attention, positional encoding, and a multi-layer encoder are important components of the embedding method. The transformer encoder processes input sequences using a feed-forward neural network, positional encoding, multi-head attention, and a self-attention mechanism. The comprehension of sentence-level links is enriched using non-linear transformations and pre-training exercises like Next Sentence Prediction (NSP) and Masked Language Modeling (MLM). This method makes it possible to capture intricate syntactic and semantic links between words in a phrase, producing representations that are more accurate and pertinent to the context.
There are other methods that can also be used, for example, TF-IDF. Although mostly a text-based method, it may be modified for Arabic by employing suitable tokenization and preprocessing strategies to manage morphological variances and diacritical marks. To capture the syntactic and semantic links between Arabic words, word2vec and GloVe are word embedding approaches that can be taught on sizable Arabic text corpora. FastText is another method; because it can handle sub-word information, which is essential for languages with complex morphology, this method works especially well for Arabic [
33]. The sample encoder is input after ARABERT preprocessing. The sample ARABERT preprocessing output is shown below.
3.5.2. NLTK
The text analysis methods include sentiment analysis, text classification, and topic modeling, which offers text processing capabilities like tokenization, stemming, lemmatization, and part-of-speech tagging. Furthermore, the NLTK provides a set of corpora and lexical resources for a variety of NLP applications, and language modeling capabilities, such as n-gram models and language modeling.