1. Introduction
Uyghur and Kazakh are a kind of morphologically rich agglutinative languages, in which words are formed by a root (stem) followed by suffixes, therefore, the vocabulary size of these languages is huge. There are around 10 million Uyghur people and 1.6 million Kazakh people living in northwestern China. Officially, Arabic scripts are used for both languages, while Latin scripts are also widely used on the Internet, especially on mobile communications and social networks. The grammar and lexical structure of Uyghur and Kazakh are basically the same. Words in these languages are naturally separated in sentences, and are relatively long as the powerful suffixes can extend the stems (roots) semantically and syntactically, as shown in the example below:
Uyghur Arabic script form.
Uyghur Latin script form.
musabiqidA musabiqiniN vaHirqi musabiqA numurini velip, tallanma musabiqidin GAlbilik vOtti.
Kazakh Arabic script form.
Kazakh Latin script form.
jaresta jaresneN soNGe jares nomeren alep, taNdaw jarestan jENespEn votte.
As the stems are the notionally independent word particles with a practical meaning, and affixes provide grammatical functions in Uyghur and Kazakh languages, morpheme segmentation can enable us to separate stems and remove syntactic suffixes as stop words, and reduce noise and feature dimension of Uyghur and Kazakh texts in classification task. The form of the sentences in the example given above are becoming the form shown below after morpheme segmentation:
Uyghur morpheme segmentation. musabiqA + dA musabiqA + niN vaHir + qi musabiqA numur + i + ni val + ip, talla + an + ma musabiqA + din GAlbA + lik vOt + ty.
Kazakh morpheme segmentation. jares + ta jares + neN soNGe jares nomer + en al + ep, taNdaw jares + tan jENespEn vot + te.
There are 10 words in the above Uyghur and Kazakh sentences, and the stems (bold parts) of four words are /musabiqA/ (match) and /jares/ (match), respectively. After morpheme segmentation and stem extraction on the above sentences, only one stem was extracted from four words, so the number of features will be greatly reduced, as shown in
Table 1.
The Uyghur and Kazakh Arabic letters corresponding to the Latin letters are shown in
Table 2. Here, La, Uy, and Ka in table headings stand for Latin, Uyghur, and Kazakh, respectively.
Both Uyghur and Kazakh languages transcribe speech as they pronounce which cause the personalized spelling of words especially less frequent words and terms. The main problems in natural language processing (NLP) tasks for Uyghur and Kazakh languages are the scarceness in resources and derivative morphology in language structure. Data collected from the Internet are noisy and uncertain in terms of coding and spelling [
1]. Generally, Internet data for these low resource languages suffered from high uncertainty of writing forms on these languages, due to the deep influence of the major languages, Chinese and English [
2]. This influence is greatly aggravated by the rapid development of information technology, which triggers a broad spectrum of cross-lingual and cross-cultural interaction, leading to unceasing coining of new words and new concepts. Most of these new items are borrowed from Chinese and English, and the integration is in forms that are full of noise caused by the different spelling habits [
3]. Dialects and uncertainty in terms of spelling and coding pose a big challenge for reliability of extracting and classifying short and noisy text data.
Previous works [
4] on stem extraction for Uyghur and Kazakh texts are mostly based on simple suffix-based stemming methods and some simple hand-crafted rules, which suffer from ambiguity, particularly on the short texts. Sentence or longer context-based reliable stem extraction methods can extract stems and terms accurately in noisy texts for Uyghur and Kazakh texts on a sentence level, and lead to the ambiguity reduction in a noisy text environment.
Text classification approaches based on convolutional neural networks (CNN) are extensively studied on major languages such as English and Chinese. P. Wang et al. [
5] proposed semantic clustering and a CNN-based short text classification method. Reference [
5] used a fast clustering approach to find semantic cliques by clustering word embeddings, and used the semantic units which meet the preset threshold to build semantic matrices, then put them into CNN. In this experiment, more than 97% of classification accuracy was obtained on TREC (Text REtrieval Conference) questions data set when using the GloVe (Global Vectors for Word Representation) model to pretrain word embeddings. R. Johnson and T. Zhang [
6] proposed a text classification method based on CNN with a multiconvolution layer. CNN was used directly to the text data with high dimension and learned the local features of small text areas for classification of texts, and the bag of word (BoW) conversion operation was made in convolution layer in [
6], and an excellent result with 9.33% error rate was obtained in topic classification experiment on RCV1 (Reuters Corpus Volume 1) data set with 103 topics. M. Zhu and X.D. Yang [
7] proposed a Chinese text classification method based on CNN. They added attention mechanism after pooling layer to make improvements on the CNN model proposed by Kim [
8], and achieved 95.15% of classification accuracy on Sohu news data with nine classes.
Some works on Uyghur and Kazakh text classification have been reported in [
4,
9,
10]. Tuerxun et al. [
4] used KNN (K-Nearest Neighbor) as a classifier on Uyghur text to conduct text classification, and used the TFIDF (Term Frequency-Inverse Document Frequency) algorithm to calculate the feature weight in this paper. Imam et al. [
9] used the TextRank algorithm to select the features to make a sentiment classification experiment on Uyghur text, and SVM (Support Vector Machine) was used as a classifier in this experiment. Yergesh et al. [
10] made a sentiment classification experiment on Kazakh text based on linguistic rules of Kazakh. The text classification methods used by the researchers mentioned above are the traditional classification framework in which the machine learning process is shallow and do not consider the context relationship between the words in the text or just based on the simple rules, so they are problematic for noisy text.
Automatic text classification (ATC) is a guided learning process, which classifies a large number of unstructured text data into specific categories according to the given classification system and the content of text information [
11,
12,
13], which is widely used in sentiment classification [
14], spam filtering [
15], and Web searching [
16]. CNN uses relatively little preprocessing compared to traditional classification models. This is because the network can learn the filters that were hand-crafted in a traditional classification framework. CNN can be used to learn features as well as to classify data.
Text feature representation methods used frequently are BoW [
17], TFIDF [
18], and LDA (Latent Dirichlet Allocation) [
19]. In this paper, we propose a subword and stem vector-based Uyghur and Kazakh short text classification method. We used a word (stem) embedding method for extraction of text features, and used a TFIDF algorithm to weight the feature vectors to better represent the Uyghur and Kazakh texts, and then used CNN as a feature selection and text classification algorithm to obtain a Uyghur and Kazakh text classification model, and conducted a classification experiment on a corpora collected from the Internet.