1. Introduction
Arabic is not only a major world language, spoken natively by approximately three hundred million people primarily in the Middle East and North Africa (MENA), but also the liturgical language of two billion Muslims globally. It features one of the most widely used writing systems in the world. This script transcends its native speakers, extending throughout the Islamic world, as it is employed to write the Qur’an, the holy book of Muslims [
1].
In the contemporary landscape of the Arab region, there has been a marked escalation in the deployment of Arabic dialects for informal written communication and interactions on social media platforms [
2,
3,
4,
5,
6]. This trend has led to an exponential growth in the quantity of Dialectal Arabic content across these digital venues, particularly on social media, as evidenced by recent studies [
7,
8,
9,
10]. Consequently, this surge has catalyzed considerable scholarly interest among researchers in Arabic Natural Language Processing (NLP). There is a robust initiative underway to cultivate annotated linguistic resources specifically designed for these dialects. The primary objectives of this initiative are to enrich our understanding of the linguistic intricacies of Dialectal Arabic (DA) [
11] and to accelerate the advancement of specialized tools and applications for its processing [
12,
13].
Despite the broad spectrum of tools and resources dedicated to Arabic NLP, the primary emphasis remains on Modern Standard Arabic (MSA), the lingua franca language universally utilized throughout the Arab world, which is deeply rooted in Classical Arabic [
14,
15,
16,
17,
18,
19,
20,
21,
22,
23]. However, the adaptation of these NLP resources to DA introduces significant challenges. These challenges stem from the substantial linguistic divergences between the various dialects and MSA [
24,
25]. This situation highlights a critical gap in the applicability of existing tools to the linguistic realities of the Arab region, underlining the need for targeted research and development efforts in the field of NLP to bridge these disparities.
Arabic dialects demonstrate significant divergences from MSA in several linguistic domains, including morphology, phonology, and syntax [
4,
26,
27]. Furthermore, the lexicon across various Arabic dialects varies considerably, and most dialects do not employ standardized orthographies [
6,
28]. While the development of resources for NLP tasks in DA remains nascent in comparison to MSA [
29,
30], there are ongoing initiatives to construct dialect-specific tools and resources. These efforts include the creation of annotated corpora and morphological analyzers tailored to particular dialects [
31]. Notably, the development of NLP tools for the Egyptian and Levantine dialects has progressed more substantially [
32,
33,
34]. Despite the extensive presence of dialectal content online, Gulf Arabic still experiences a significant deficiency in NLP tools and resources, highlighting a critical area for further linguistic research [
4,
6,
35,
36].
Arabic dialects are predominantly classified into several principal groups based on geographical regions, including Gulf, Egyptian, Levantine, North African (Maghrebi), Iraqi, Yemeni, and Sudanese dialects [
37,
38,
39,
40]. The Gulf dialect comprises the linguistic variants spoken in countries adjacent to the Arabian Gulf, such as Saudi Arabia, Kuwait, Qatar, United Arab Emirates (UAE), Bahrain, and Oman, see
Figure 1a. The Egyptian dialect encompasses the linguistic varieties found primarily in Egypt and select areas of Sudan. The Levantine dialect is primarily spoken in the Levant region, including Palestine, Syria, Lebanon, and Jordan. The Maghrebi dialect includes a range of dialects spoken across North Africa, excluding Egypt, and covers countries such as Morocco, Algeria, Tunisia, and Libya. The Iraqi dialect refers to the linguistic variety prevalent in Iraq [
41], and the Yemeni dialect is used in Yemen. In Saudi Arabia, a country characterized by its substantial geographical diversity, there is a significant variation in linguistic dialects across different regions; see
Figure 1b [
4,
42]. Historically, there has been a tendency to not recognize the Saudi dialect as a distinct linguistic entity, typically categorizing it within the broader Gulf dialects [
43].
The profound social and political transformations occurring in the Gulf region, particularly in Saudi Arabia, underscore the need for a dedicated Saudi Dialect Corpus. Saudi Arabia, characterized by its conservative and autocratic nature, is home to a predominantly young population, with about 50% under the age of 25. The transformative government-sponsored reforms have sparked significant debates within the society, prominently featured on social media platforms such as Twitter due to restrictions on public debate and protests [
10].
The need for a dedicated corpus is exemplified by the intense discussions on Twitter that followed the Saudi government’s decision to lift the driving ban for women in September 2017. This landmark policy shift ignited widespread debates and garnered support, as documented in [
44]. Researchers gathered and analyzed tweets and hashtags in the local Saudi dialect from the initial days after the ban was lifted, providing valuable insights into public sentiment on this crucial issue. This work underscores the necessity of a specialized corpus that accurately captures the unique linguistic nuances of the Saudi dialect, which is vital for precise sentiment analysis and in-depth cultural research.
A proper Saudi Dialect Corpus would thus not only facilitate a deeper understanding of the public discourse and sentiment in the region but also enhance the development of tailored NLP tools. These tools are essential for accurately interpreting and responding to the nuanced language used in social media, which is often imbued with cultural and regional specificities not covered by standard Arabic corpora. This makes the development and availability of a dedicated Saudi Dialect Corpus crucial for researchers working on Arabic NLP, especially in applications involving sentiment analysis, social monitoring, and cultural studies.
In Saudi Arabia, the dialectical landscape is distinguished by the prevalence of specific regional dialects: Hijazi in the western region, Najdi in the central region, Southern dialect in the southern territories, Northern dialect in the northern areas, and Eastern dialect in the eastern part of the Arabian peninsula. This research paper outlines our efforts in constructing an extensively annotated corpus of Gulf dialectical Arabic. We utilized data sourced from X, formerly known as Twitter (throughout this paper, we will continue to use the terms "Twitter" and "tweets", as there are no widely accepted substitutes for the latter). The data extraction process was improved by incorporating advanced techniques for textual analysis, notably feature extraction. This approach leveraged geographical metadata embedded within user profiles to ensure the accurate representation of diverse dialectical variations across the specified regions. This methodology enhances the precision and robustness of our data processing and annotation efforts, contributing significantly to the depth and quality of linguistic analysis. These techniques are crucial for advancing AI-driven frameworks, optimizing data utilization, and enhancing the reliability of AI applications across various sectors.
This study introduces two pivotal resources: the Gulf Arabic Corpus (GAC-6) and the Saudi Dialect Corpus (SDC-5). The GAC-6 is an expansive corpus comprising approximately 1.7 million Arabic tweets that reflect the dialectical nuances of the Gulf region, encompassing Saudi Arabia, Bahrain, Kuwait, Oman, Qatar, and the Emirates. Concurrently, the SDC-5 emerges as a comprehensive annotated corpus dedicated to Saudi dialects, encapsulating linguistic variations across five key Saudi regions: Hijazi, Najdi, Southern, Northern, and Eastern. Our methodology includes the meticulous manual annotation of a data subset with specific dialect labels, establishing a gold standard for dialect identification. The resources elucidated in this paper hold significant potential not only for advancing NLP applications such as machine translation but also for facilitating in-depth linguistic analysis of Gulf and Saudi dialects. This contributes to a broader understanding of regional linguistic diversity, aiding both academic research and practical applications in computational linguistics.
The objective of this research is to augment the domain of language resources by developing comprehensive linguistic datasets for both Gulf and Saudi dialects. This initiative leverages the substantial repository of Arabic textual content available on social media platforms, with a specific focus on Twitter. The contributions of this paper are manifold and include the following:
The development of an extensive corpora of Gulf and Saudi dialects, which are sourced from Twitter and automatically tagged with dialect labels. These corpora are designed to significantly support and enable a wide range of Arabic NLP research endeavors in the future.
The employment of native speakers for the manual annotation of the segments of the datasets, thereby validating the accuracy of the automatically assigned dialect labels. This step ensures that the linguistic characteristics of each dialect are accurately captured.
The evaluation of the corpora’s quality through the measurement of inter-annotator agreement, quantified using the Kappa statistic. This assessment ensures the reliability and consistency of the annotation process, confirming that the data are robust for academic and practical applications in computational linguistics.
The structure of this paper is organized as follows:
Section 2 reviews the relevant literature and related studies.
Section 3 details the methodology used in compiling, preprocessing, and annotating the corpora.
Section 4 provides an in-depth discussion of the compiled corpora, including general statistics, and assesses the quality of the annotations. Finally,
Section 5 offers concluding remarks and outlines potential directions for future research.
2. Related Work
This section provides an overview of the literature pertaining to Arabic dialectal corpora, emphasizing the significant contributions and key findings within this research domain.
The exploration of Arabic dialects has garnered considerable interest in recent years. A myriad of Arabic dialectal corpora has been developed, serving as invaluable resources for the study and linguistic analysis of these dialects. In recent times, there has been an augmented effort toward the aggregation of datasets and the formulation of comprehensive corpora from a variety of sources, thereby facilitating in-depth investigations into Arabic dialectology.
A seminal piece of research in the realm of corpus development was conducted by [
45], who introduced a pioneering dataset dedicated to Dialectal Arabic, known as the Arabic Online Commentary (AOC) Dataset. The AOC Dataset represents the inaugural dialectal corpus made available to the academic community, comprising approximately 52 million words derived from the comment sections of online Arabic news platforms.
Several studies, such as those documented by [
46], have utilized datasets that were manually annotated through crowdsourcing efforts. In particular, Alsarsour et al. [
46] introduced the Dialectal Arabic Tweets (DART) dataset, a collection comprising approximately 25,000 Arabic tweets that were manually annotated via crowdsourcing. This dataset is characterized by its balanced representation across five major Arabic dialect groups: Egyptian, Maghrebi, Levantine, Gulf, and Iraqi. Furthermore, Zaghouani and Charfi [
47] developed a multidialectal corpus, also annotated manually through crowdsourcing, which includes around 2.4 million tweets originating from 11 Arab regions, including North Levant, South Levant, Egypt, Gulf, Morocco, Tunisia, Algeria, Yemen, Iraq, Libya, and Sudan. Sadat et al. [
48] contributed a sentence-level manually annotated dataset containing about 62,000 sentences sourced from online blogs across various Arab nations.
Khalifa et al. [
33] compiled a substantial Gulf Arabic Corpus, consisting of 1200 forum novels, with annotations at the document level derived from the novels’ titles and authors’ names. Alshutayri and Atwell [
49] outlined the methodology for constructing an Arabic dialect text corpus from social media platforms like Twitter and Facebook, as well as comments from newspapers, with each entry receiving a dialect tag through crowdsourcing. Some scholars have also embarked on creating parallel corpora, which consist of sentences translated into Arabic dialects from other datasets. An exemplar of such a corpus is presented in [
50], where the authors unveiled the Multidialectal Parallel Corpus of Arabic (MPCA), a compilation of approximately 2000 sentences that were manually translated into various dialects from Egyptian Arabic.
Moreover, Bouamor et al. [
2] introduced the Multi Arabic Dialect Applications and Resources (MADAR) corpus, a parallel corpus encompassing 25 dialects from Arabic-speaking cities across 15 Arab countries. This corpus was developed through the manual translation of selected sentences from the Basic Travel Expression Corpus (BTEC) [
51]. Notably, the corpus lacks representation of certain Gulf Arabic dialects, specifically those from Bahrain, Kuwait, and the United Arab Emirates.
Other studies, such as the next one, have employed a semi-automatic approach for corpus annotation. Mubarak and Darwish [
52] developed a multidialectal Twitter corpus of Arabic, consisting of 6.5 million tweets. This corpus is annotated based on distinct dialects and the geographical locations of users. Similarly, Abu Kwaik et al. [
24] introduced the Shami Dialects Corpus (SDC), which encompasses four dialects (Palestinian, Jordanian, Lebanese, and Syrian) and contains approximately 118,000 sentences. This corpus is compiled from Arabic tweets, with automatic annotations derived from the Twitter API’s geographical location feature, supplemented by manual annotations for web content.
Furthermore, there are datasets that have been collected and labeled entirely through automated processes. Abdul-Mageed et al. [
3] unveiled a vast corpus of tweets representing city-level dialects from 29 Arab cities across 10 Arab countries, with diverse dialectical features. The annotation for this corpus was conducted automatically using a Python geocoding library. Additionally, Abdelali et al. [
25] created a balanced, non-genre-specific, country-level Arabic dialectal tweet corpus. This corpus was generated using a series of filters; user accounts were selected based on country-specific keywords, and tweets were further filtered to exclude users predominantly using MSA. The final corpus comprises 540,000 tweets from 2525 Twitter users, along with a test set consisting of 182 tweets per country, which were manually classified by native Arabic speakers.
Conversely, certain dialectal corpora have concentrated on specific dialects, with a focus on morphological annotation. An exemplar is the work by [
53], who introduced the Saudi corpus for NLP Applications and Resources (SUAR), targeting the Saudi dialect. This corpus encompasses 104,000 words sourced from various online social media platforms, and it underwent morphological annotation via the MADAMIRA tool [
54]. This initial automated annotation was subsequently subjected to manual review to ensure accuracy and validate the analysis.
Alowisheq et al. [
55] unveiled the Multi-domain Arabic Resources for Sentiment Analysis (MARSA), a sentiment-annotated corpus specific to the Gulf dialect. This corpus comprises 61,000 tweets, each manually annotated with sentiment labels by two independent annotators, ensuring the reliability of sentiment assessment.
Further, Elgibreen et al. [
56] introduced the King Saud University Saudi Corpus (KSUSC), a comprehensive new corpus containing over 161 million sentences harvested from a variety of sources. While this corpus is extensive and spans multiple domains, it lacks annotations, presenting a vast but unstructured resource for linguistic analysis. This study also conducted a review of existing Arabic corpora, highlighting the need for more in-depth research into corpora representing the Saudi dialect.
Moreover, Alruily [
57] developed a dialectal Saudi Twitter corpus and provided an analysis of its linguistic peculiarities, such as compounding, abbreviation, spelling discrepancies, and the emergence of neologisms, shedding light on the unique challenges associated with processing this dialect. Lastly, Al-Ghadir and Azmi [
58] capitalized on the vibrant social media environment of Saudi Arabia to investigate the posting patterns of local users, delineating these behaviors by gender and educational attainment. Concentrating on author profiling in this milieu, this study provides an understanding of demographic trends, thereby enhancing the comprehension of dialectical subtleties as manifested on social media platforms.
3. Our Methodology
Social media platforms and microblogging sites, notably Twitter, have emerged as significant repositories of natural language textual data, offering a rich vein of content for research purposes [
59,
60]. Twitter, in particular, is integrated into the daily routines of millions, positioning it as one of the foremost social media networks currently in operation [
61]. The platform facilitates the aggregation of a substantial volume of text, contributed by a diverse array of Arab speakers who often express their thoughts and opinions in their respective Arabic dialects [
62].
Tweets are typically short and informal, frequently composed in the users’ own dialects, and reveal a plethora of spoken language features. This makes them an invaluable resource for the study of Arabic dialects, offering insights into the linguistic nuances and vernacular expressions prevalent within the Arab-speaking community [
63,
64].
Our goal is to develop two distinct corpora: the first, GAC-6, targets the dialects commonly spoken in the Arabian Gulf, encompassing Saudi Arabia, Bahrain, Kuwait, Oman, Qatar, and the UAE. The second, SDC-5, focuses on the five principal dialects within Saudi Arabia—Hijazi, Najdi, Southern, Northern, and Eastern. Both corpora are compiled from data collected from Twitter.
In the development of the GAC-6 and the SDC-5, we maintained stringent ethical standards and emphasized privacy protection throughout the data collection phase. All data used in this study were derived from publicly available Twitter posts, adhering to Twitter’s data usage policies. We collected no private information, such as direct messages or personal identifiers, beyond what is publicly visible on user profiles. To enhance privacy protection, all usernames and profile details were anonymized and omitted from the dataset. Additionally, the data were aggregated to eliminate the possibility of tracing back to individual users. We also took precautions to ensure that no sensitive data, which could potentially disclose the identities of individuals or communities, were gathered.
Next, we will explore the details of the corpora development process. This includes discussions on compilation and preprocessing techniques, annotation methodologies, and the protocols used for corpus validation, ensuring a comprehensive understanding of the corpora’s foundational integrity.
3.1. Compilation and Preprocessing of the Corpora
Twitter provides an excellent Application Programming Interface (API) and development platform, featuring a comprehensive suite of tools and meticulously crafted documentation. This framework enables researchers to tap into the vast reservoir of social media content and extract additional metadata, such as geographical information [
65,
66]. Consequently, Twitter was chosen as the primary data source for the construction of the envisaged corpus. The collection of data was facilitated by the Twitter REST APIs, which are accessible through Twitter user credentials via Open Authentication (OAuth) [
67]. Leveraging the Twitter API streaming library "Tweepy", a Python library dedicated to tweet retrieval, we amassed thousands of tweets characterized by the use of dialectal expressions typical of Gulf Arabic and Saudi Dialect speakers [
68,
69]. Our methodology in this study was predicated on the extraction of dialect-specific tweets through the application of filters based on seed words pertinent to each dialect. This approach circumvents the necessity of sifting through an extensive volume of Arabic tweets and subsequently undertaking a labor-intensive annotation process. Algorithm 1 outlines our approach for the automated construction of an annotated dialectal corpus.
The initial step in our corpus construction involved the identification of seed words for each dialect. Seed words are defined as terms that are frequently and predominantly used within a particular dialect and not found in others [
37,
46]. We focused on unique expressions characteristic of each dialect, used exclusively by its native speakers, to compile the corpus. These dialect-specific terms were employed as query parameters, coupled with the "lang = ar" filter to confine the search to Arabic tweets, facilitating the collection of a real-time tweet stream from speakers of the targeted dialects [
49]. Each tweet was then annotated with the user’s name and location.
To enhance the reliability of the dialect attribution, we incorporated geographical information from the Twitter profiles associated with each tweet, ensuring that the tweets originated from within the designated dialectal regions. The corpus was further expanded by aggregating additional tweets from each user’s profile. We compiled a list of user profiles pertinent to each of the six Gulf dialect regions and the five Saudi dialects by identifying tweets containing specific seed words and dialectal expressions exclusive to a single dialect. Only users with a minimum of 1000 tweets were considered, from which we downloaded up to 500 tweets per account. During the collection process, duplicate tweets and retweets were excluded to maintain the corpus’s uniqueness.
Algorithm 1: A high-level algorithm for the automatic construction of the proposed corpus. During construction, we rely on four components of a tweet: the text, author, timestamp, and location. |
|
The dialect classification methodology was twofold: initially, it relied on the presence of dialect-specific terms within the tweets, and subsequently, it required a match between the user’s profile location and the designated dialect region. The general architecture for our corpus construction is presented in
Figure 2.
Our methodology was fundamentally anchored in the utilization of dialect-specific terminology. For instance, the term “watermelon” is articulated as
جح in the Saudi dialect and as
رقي in the Kuwaiti dialect. Employing this strategy enabled us to refine our tweet selection, ensuring a focus on the specific dialects of the Gulf region and Saudi Arabia. The compilation of seed words and dialectal expressions for each Saudi dialect was compiled from
https://lahajat.blogspot.com/p/blog-page_7.html (accessed on 9 June 2023). Conversely, the compilation of seed words for the Gulf dialects, encompassing Kuwait, Bahrain, Qatar, the United Arab Emirates, Saudi Arabia, and Oman, was derived from Mo3jam (
https://en.mo3jam.com/, a dictionary of colloquial Arabic/Arabic slang, accessed on 24 October 2023). A selection of these seed words from each Gulf dialect, accompanied by representative tweet examples, is listed in
Table 1. Furthermore,
Table 2 presents illustrative tweets in the Saudi dialect, demonstrating the application of dialect-specific terms in authentic social media discourse.
Given the noise and extraneous information typical of data sourced from social media, the processes of data cleaning and preprocessing are crucial [
53,
70]. The primary objective of preprocessing the extracted data was to cleanse it of noise and irrelevant content, thereby enhancing the dataset’s quality for more precise dialect identification tasks [
46].
The initial phase of preprocessing involved manually discarding tweets we identified as advertisements. Subsequent to this manual filtering, the tweets were subjected to a systematic preprocessing regimen, as shown in Algorithm 2 [
71]. This algorithmic approach to preprocessing facilitated the refinement of the dataset, culminating in a polished corpus. Consequently, the Gulf Arabic Corpus was reduced to an approximate total of 1.7 million tweets, while the Saudi Dialect Corpus was cut to about 790,000 tweets, thereby ensuring a cleaner, more focused dataset for subsequent analysis.
Algorithm 2: Preprocessing the tweets |
|
3.2. Annotating the Corpus
In this section, we describe the methodology adopted for the annotation of the corpus, detailing the procedural steps and tools employed for annotating each tweet within the corpus, followed by the strategies implemented to assess the quality of the annotation task.
The genesis of a corpus extends beyond mere data accumulation; it encompasses a rigorous process of data verification and validation to ensure the corpus’s reliability and applicability [
72,
73,
74]. The primary goal of this endeavor is to forge a dialectal corpus of tweets, distinguished by high-quality annotations, to serve as a resource for scholars engaged in the study of Arabic dialects within the domain of Arabic NLP. This includes, but is not limited to, investigations pertaining to Arabic dialect identification.
The success of dialect identification is intrinsically linked to the precision of annotation outcomes [
75]. In order to evaluate the quality of our corpus, constructed and annotated via the proposed algorithm, we conducted a manual annotation exercise. From each region represented in the Arabian Gulf corpus, we randomly selected 2000 tweets, resulting in a total of 12,000 tweets. Similarly, for the Saudi Corpus, a total of 10,000 tweets were chosen for manual annotation. For this task, we utilized Label Studio, (
https://labelstud.io/, accessed on 28 September 2023), an open-source data labeling platform renowned for its versatility in annotating, labeling, and preparing diverse datasets.
Within the Label Studio environment, we established a project dedicated to the annotation of tweets. Annotators were presented with tweets and instructed to assign a dialect label to each one. For the GAC-6, the tweets were categorized into one of six dialect labels corresponding to the Gulf regions: Saudi Arabia, Bahrain, Kuwait, Oman, Qatar, and UAE. In the case of the SDC-5, annotators classified each tweet under one of the five designated Saudi labels: Hejazi, Najdi, Southern, Northern, and Eastern. This meticulous process of manual annotation serves as a cornerstone for ensuring the integrity and utility of the corpus for research in Arabic dialect identification and other NLP applications.
Almuzaini and Azmi [
76] employed crowdsourcing to annotate a large volume of data in MSA. However, they encountered quality issues and lapses due to incompetent or dishonest annotators, particularly when the tasks were poorly defined or required specialized knowledge. Considering the anonymity of crowdsourcers, we chose to interact directly with the annotators to minimize these risks.
For the annotation process, we employed two primary annotators, supported by a third annotator responsible for resolving any discrepancies or ambiguities that may arise during the initial annotation phase. To ensure high-quality and consistent output, all annotators underwent comprehensive training. This training encompassed detailed annotation guidelines, which included definitions, examples, and specific linguistic features characteristic of each dialect. Additionally, the annotators participated in sessions to familiarize themselves with the Label Studio annotation platform, and they practiced annotating sample data, receiving feedback to fine-tune their understanding and application of the guidelines.
To guarantee the accuracy of the annotations, the third annotator reviewed instances where the initial annotations differed. The criteria for annotation focused on identifying distinct lexical items, phonological variations, and syntactic constructions unique to each dialect. Annotators were trained to recognize explicit markers—specific words or phrases characteristic of a dialect—as well as contextual clues like cultural references or idiomatic expressions, ensuring a thorough and nuanced analysis.
Figure 3 illustrates sample instances of the annotation task.
Throughout this process, annotators were instructed to identify the dialect of the text presented to them and mark their selection using the provided checkboxes corresponding to each dialect. It was imperative that the annotators possess native-level proficiency in Arabic to ensure the accuracy of the dialect labeling.
All annotators involved were native Arabic speakers, each with a collegiate level of education. Specifically, for the annotation of the Saudi Dialect Corpus (SDC-5), annotators were recruited from within Saudi Arabia to ensure an intrinsic understanding of the regional dialects. Similarly, the annotation of the Gulf Arabic Corpus (GAC-6) was entrusted to native speakers from the Gulf countries, ensuring an authentic representation of the dialectal nuances.
4. Results and Discussion
4.1. Overview of Our Compiled Corpora
In this study, we have constructed a large-scale corpus consisting of Gulf dialect Arabic text, derived from data sourced from Twitter. For the purposes of this research, the following two corpora have been developed:
The Gulf Arabic Corpus (GAC-6), which encompasses the dialects prevalent in the Gulf region, specifically Saudi Arabia, Bahrain, Kuwait, Oman, Qatar, and the Emirates (UAE). This corpus contains approximately 1.7 million Arabic tweets, offering a broad representation of the linguistic diversity within the Gulf countries.
The Saudi Dialect Corpus (SDC-5), comprising around 790,000 tweets in Saudi Arabic, representing the five main dialects found within Saudi Arabia: Hijazi, Najdi, Southern, Northern, and Eastern. This corpus provides a focused insight into the linguistic variations across different regions of Saudi Arabia.
In the construction of the GAC-6, our initial collection comprised 2.6 million tweets. Following a meticulous cleaning process, which included the removal of redundant tweets, the corpus was reduced to 1.7 million tweets. The final composition of the corpus featured a diverse distribution of tweets across various Gulf dialects: 330,408 tweets were categorized under the Saudi Arabian dialect, 169,977 tweets under the Bahraini dialect, 426,771 tweets under the Kuwaiti dialect, 273,920 tweets under the Omani dialect, 256,377 tweets under the Qatari dialect, and 242,590 tweets under the Emirati dialect. The distribution of tweets by dialect within the Gulf Arabic Corpus is shown in
Figure 4a. Notably, the Saudi and Kuwaiti dialects are represented by a larger volume of tweets compared to the Omani, Qatari, and Bahraini dialects. This discrepancy in tweet volumes may be attributed to the higher popularity of Twitter in Saudi Arabia and Kuwait.
For the development of the SDC-5, our objective was to encompass all five dialects prevalent within Saudi Arabia. Prior corpora focusing on Saudi dialects [
42] have often omitted the Southern and Northern dialects, attributing this exclusion to the relative scarcity and limited usage of these dialects in comparison to the Najdi, Hijazi, and Eastern dialects. Contrary to these precedents, our research endeavored to construct a corpus that inclusively represents the five Saudi dialects: Hijazi, Najdi, Southern, Northern, and Eastern. As shown in
Figure 4b, the Saudi Dialect Corpus comprises a total of 790,000 tweets, distributed as follows: 116,117 tweets in the Hijazi dialect, 393,342 tweets in the Najdi dialect, 183,487 tweets in the Southern dialect, 53,883 tweets in the Northern dialect, and 44,029 tweets in the Eastern dialect. This distribution underscores our commitment to providing a comprehensive representation of the linguistic diversity within Saudi Arabia.
The comprehensive statistical overview of both the GAC-6 and the SDC-5 is shown in
Table 3. This summary delineates the number of tweets and the extracted word count for each dialect within both corpora. Additionally, it presents an analysis of the tweet lengths across the dialects, specifying the minimum, maximum, and average number of words per tweet.
Within the GAC-6, the average tweet length stands at 11.96 words. The corpus features a tweet in the Kuwaiti dialect as the longest, containing 48 words, while the shortest tweets, found within both the Kuwaiti and Saudi dialects, comprise merely three words. Conversely, in the SDC-5, the Hejazi dialect boasts the longest tweet, encompassing 47 words. The shortest tweets, each consisting of just three words, are observed in both the Najdi and Hejazi dialects, highlighting the variance in expression and conciseness across the different dialects represented in the corpora.
While the development of GAC-6 and SDC-5 contributes valuable resources to Arabic NLP, it is essential to acknowledge several limitations of this study. Data collection was limited to publicly available Twitter posts, which might not capture the complete spectrum of dialectal variations due to user demographics and the informal nature of Twitter content, such as the brevity of tweets. Additionally, the representation of specific dialects, particularly the Southern and Northern Saudi dialects, is somewhat underrepresented in the corpora, potentially limiting the generalizability of our findings to all dialects. Lastly, despite the use of both manual and automated annotation techniques aimed at ensuring high-quality data, variability in inter-annotator agreement for some dialects persists, suggesting a need for further refinement of the annotation guidelines and processes.
4.2. Evaluating the Quality of Corpus Annotations
Ensuring the quality of annotations is crucial for the reliability of the corpus and its subsequent utility in developing precise dialect identification models. To assess the annotation quality within our study, we employed inter-annotator agreement (IAA) measures on the dialect annotations of the tweets. IAA measures provide insights into the consistency of annotator choices regarding dialect annotations, underpinning the validity of the annotated data.
The premise is that if annotators exhibit discrepancies in their annotations, it could be indicative of potential challenges for a dialect identification model to accurately classify those instances [
77]. In our research, we quantified inter-annotator agreement using Cohen’s Kappa coefficient (
), a widely recognized statistical measure designed to evaluate the level of agreement between two annotators beyond chance in classification tasks. Cohen’s Kappa is calculated using the following equation, which accounts for the observed agreement and the expected agreement by chance, thereby offering a normalized measure of annotator concordance:
where
is the observed agreement among annotators, and
can be defined as the expected agreement obtained by the random assignment of labels by annotators during the annotation process. The
is given by
where
and
is the number of tokens labeled with tag
T by annotator 1 and annotator 2, respectively, and
N is the total number of annotated tokens.
The analysis of annotation quality through Cohen’s Kappa revealed a commendable degree of agreement among the annotators for both corpora under study. For the GAC-6, the average Cohen’s Kappa was approximately 78%, indicating a robust level of concurrence. The SDC-5 exhibited an even higher level of annotator consensus, with an average Kappa value of 90%.
The detailed breakdown of inter-annotator agreement for each region within both corpora, as presented in
Table 4, showcases varying degrees of agreement across the dialects. For the GAC-6, the Bahraini, Omani, and Qatari dialect annotations fell within the "satisfactory" range, with Kappa values between 0.6 and 0.8. In contrast, the Saudi, Kuwaiti, and Emirati dialects demonstrated "really good" agreement levels, with Kappa values ranging from 0.8 to 1, underscoring a high reliability in these annotations.
In the context of the SDC-5, the Hijazi dialect annotations achieved a notably high Kappa value of 92%, reflecting near-perfect annotator alignment. The Najdi and Southern dialects also showcased excellent agreement, with Kappa values around 91%, while the Northern and Eastern dialects exhibited very good agreement, with Kappa values at 87% and 89%, respectively. These results underscore the high quality of the annotation process, bolstering the reliability of the corpora for dialect identification research and applications.
5. Conclusions
In this study, we have outlined the methodologies utilized in compiling and developing two substantial linguistic resources for the Arabic language: the Gulf Arabic Corpus (GAC-6) and the Saudi Dialect Corpus (SDC-5). These datasets, richly annotated and multi-dialectal, encapsulate a broad spectrum of linguistic nuances from the Gulf and Saudi regions. GAC-6 comprises approximately 1.7 million labeled tweets from six Gulf dialects, while SDC-5 features around 790,000 tweets that reflect the predominant dialects within Saudi Arabia: Hijazi, Najdi, Southern, Northern, and Eastern.
The annotation strategy combined manual and automated methods, leveraging user location data and dialect-specific seed words to achieve precise dialect identification. A portion of the corpus was subjected to thorough manual annotation, with inter-annotator agreement metrics confirming the quality of the annotations and the overall reliability of the datasets. These resources are set to greatly advance Arabic Natural Language Processing, supporting sophisticated inquiries into dialect identification, sentiment analysis, author profiling, machine translation, and morphological analysis.
Looking ahead, we plan to expand these corpora by incorporating diverse textual data from additional online platforms, thus broadening the scope of dialectal representation. We aim to further refine our annotation methods and enhance the robustness of our techniques. Moreover, by making these comprehensive resources accessible to the global research community, we contribute to the fields of Arabic and computational linguistics. This initiative aligns with the movement toward employing sophisticated data-driven AI technologies to dissect and understand the intricate linguistic framework of the Arabic language, thereby facilitating the development of next-generation AI applications.