1. Introduction
Morphology is the meaningful independent unit in language, and the process of segmenting words into morphemes is called morphological segmentation [
1]. Due to the derivational characteristics of agglutinative languages, morphological segmentation has become a fundamental task in agglutinative language processing. Agglutinative languages are a type of language in which words are typically formed by stringing together a sequence of morphemes; based on the different meanings expressed by morphemes, agglutinative words can be divided into stem and affixes [
2]. The stem expresses the meaning of the word, whereas the affix expresses grammatical information. Uyghur and Kazakh are typical agglutinative languages with extremely rich morphology [
3] and a rich set of affixes expressing derivation or inflection. The lexical and grammatical structures of the two languages are very similar [
4], and their word formation rules can be represented as [
5]
. Stemming, as an extension of morphological segmentation, aims to extract the morphemes that represent the meaning of a word and remove the morphemes that express grammatical meanings [
5].
Figure 1 shows examples for Uyghur and Kazakh of morphological segmentation and stemming.
In
Figure 1, the first line is words, and the second to fifth lines present the analysis results using the Leipzig Glossing Rules. The fifth line specifically marks the word’s stem and the original form of morphemes before the phonological changes. From
Figure 1, it can be observed that there are relations between morphological segmentation and stemming. In agglutinative languages, the first morpheme is the stem without prefixes. Furthermore, when the morphology of Uyghur and Kazakh is concatenated, phonological harmony may cause characters at the junction to occur: phenomena such as deletion, addition, and weakening [
6].
Figure 1 shows three different phonological changes. The differing spelling of morphemes after harmony could increase out-of-vocabulary words, impacting the model’s generalization ability. This characteristic distinguishes the morphological segmentation task in agglutinative languages from the word segmentation or morphological segmentation tasks in other languages, such as Chinese or English. Therefore, morphological segmentation models designed for other languages may not perform well when applied to morphologically rich languages like Uyghur or Kazakh.
Due to the agglutinative nature of Uyghur and Kazakh, theoretically, an infinite vocabulary can be generated [
7]. As a result, data sparsity in agglutinative languages poses a challenge for downstream NLP tasks, as even small datasets lead to a large vocabulary [
5]. However, morphological segmentation, which divides words into their smallest semantic units while maintaining semantic information, effectively alleviates the data sparsity issue caused by rich morphology. Therefore, morphological segmentation and stemming are widely used in various downstream natural language processing tasks such as named entity recognition [
8], keyword extraction [
4], question answering [
9], speech recognition [
10], machine translation [
11,
12], and language modeling [
3].
The sequence labeling task is a fundamental problem in NLP, which involves assigning a label to each element in an input sequence. In Uyghur and Kazakh, morphological segmentation or stemming are often considered character-level sequence labeling problems, and models predict at the character-level labels. Therefore, character-level evaluation methods are commonly used. However, character-level evaluation cannot reflect the overall performance of models in agglutinative language morphology segmentation, exhibiting some shortcomings. As shown in
Table 1, the correct segmentation of the word “سانائەتنىڭ ” is “سانائەت نىڭ ”, true labels are “BMMMMEBME”, consisting of two morphemes. In labels, “B” represents the starting character of a morpheme, “M” represents the middle characters of a morpheme, “E” represents the ending character of a morpheme. However, the model incorrectly predicted the label of the eighth character as E, where the prediction was wrong by only one label while the rest were correct. Using a character-based method, an incorrect label will not affect the morpheme to which the character belongs. However, when characters are merged into morphemes, it can be seen that this error label affects the morphemes “نى” and “ڭ”. Although current state-of-the-art (SOTA) models have achieved around 97% accuracy in character-level evaluation, the evaluation methods used only focus on individual characters (a single point) and do not take into account the morphological context of the characters (the horizontal relationships between points). The effect of model prediction errors on cutoff results.
To enhance the performance of stemming and morphological segmentation models for Uyghur and Kazakh, this paper redefines morpheme-based evaluation metrics (F1-score and accuracy) in morphological segmentation and stemming. In addition, we propose two benchmark models based on different training methods: (1) a supervised model—Feature-Enhanced Morphological Segmentation Model (FEMSeg) for morphological segmentation and stemming; (2) an unsupervised morphological segmentation model—Masked Morphological segmentation (MMSeg). The character-level and contextual features are learned in the supervised model through CNN and BiLSTM networks, representing the input embedding. This embedding is then fed into the encoder of the Transformer model, where linear transformations and multi-head attention mechanisms are used to learn the relationships between character features, contextual features, and morphological boundaries in different spaces. In the unsupervised model, character embedding, pre-trained using the word2vec model, was fed into an encoder–decoder structure composed of an LSTM network with n-gram correlations and masked multi-head attention. This reduces the interference of characters outside the n-gram in determining morphological boundaries. Finally, the paper further comparatively analyzes the performance of recent stemming and morphological segmentation models on languages Uyghur and Kazakh from several different perspectives. The contributions of this paper can be summarized as follows:
This paper redefines evaluation metrics in morphological segmentation and stemming tasks from a morphological perspective. Then, a comparison is made between recently proposed stemming and morphological segmentation models across various criteria, providing a comprehensive performance analysis.
For the second issue, this paper proposes two models employing different training approaches: supervised and unsupervised. Both models utilize character features, contextual features, and correlations between them to improve the model’s generalization ability in complex scenarios (such as phonological harmony).
The two models proposed in this paper achieve SOTA results in morphological segmentation and stemming for Uyghur and Kazakh, updating the benchmark models and evaluation metrics.
2. Related Work
As fundamental tasks in NLP, stemming and morphological segmentation have produced numerous representative research achievements in high-resource and low-resource languages. When applied to downstream tasks, these achievements can effectively mitigate the problem of data sparsity. Like other NLP tasks, the research methods for stemming and morphological segmentation have evolved from dictionary-based or finite state automaton methods [
13,
14] to statistical learning methods based on manual feature extraction [
15,
16], and to methods based on deep learning [
17,
18]. Rule-based methods rely on grammatical rules and require linguistic experts to construct a rule base. The segmentation results are not ideal when rule conflicts or ambiguities occur. Supervised statistical machine learning models perform better than rule-based methods, as they do not require the construction of large dictionaries and complex grammatical rules. However, to enhance model performance, they rely on manual feature engineering. Commonly used statistical machine learning algorithms include Conditional Random Fields [
19], Perceptrons [
20], Graph-based models [
21], and Morfessor [
22].
In supervised learning, stemming and morphological segmentation are usually regarded as sequence labeling tasks. By combining different labeling schemes, characters in morphemes are labeled according to their positions [
23]. Commonly used labeling schemes include BIO, BMES, BIOES, etc. The letter B represents the starting character of a morpheme, M and I represent the middle characters of a morpheme, E represents the ending character of a morpheme, S represents a single morpheme, and O represents a character that does not belong to the morpheme. Qiu et al. [
24] proposed a multi-task learning model for Chinese word segmentation on multi-criteria. The model consists of a Transformer encoder and a CRF layer, with a set of embeddings (including criterion, bigram, and position embeddings) added to the input, prompting the output criteria and exhibiting excellent transfer capabilities. Huang et al. [
25] addressed issues such as the explosive growth of model parameters on multi-criteria and trained a joint multi-criteria Chinese word segmentation model with shared parameters on multiple benchmark datasets using pre-trained language models. Pre-trained models have shown strong competitiveness in word segmentation tasks, but these models tend to learn word segmentation knowledge from in-vocabulary words rather than from context. Therefore, Lin et al. [
17] proposed a context-aware Chinese word segmentation model. This method introduces an unsupervised sentence representation learning auxiliary task into the multi-criteria training framework, enabling the model to understand the entire context better. Specifically, the multi-criteria training framework incorporates unsupervised sentence representation learning with different dropout masks. Through contrastive learning, the differences between sentence representations of the same sentence under different masks are minimized. Currently, Chinese word segmentation methods based on pre-trained models have reached state-of-the-art levels, but they pose certain challenges for deployment. To improve model efficiency and generality, Li et al. [
18] proposed a method for enhancing pre-trained models for Chinese word segmentation through cohort training and versatile decoding strategies. Numerous word segmentation models have been proposed for resource-rich languages like Chinese, achieving near-human annotation levels.
Unlike supervised models, unsupervised models may not provide high-quality segmentation results, but they have certain advantages in open domains or specific applications. Downey et al. [
26] proposed a new Masked Segmental Language Model (MSLM) that generates unsupervised subword segmentation by training a masked neural language model. MSLM is based on a bidirectional Transformer architecture with span masking, utilizing context and attention to increase the model’s scalability. To improve the model’s word segmentation performance in open domains, Pan et al. [
27] proposed a new model called TopWORDS-Seg based on Bayesian inference. This model combines the TopWORDS and PKUSEG tools, enabling word segmentation and the discovery of new words. A series of experimental studies have demonstrated the robustness and interpretability of this model in open-domain Chinese word segmentation tasks. To address the lack of labeled data resources, Yan et al. [
28] proposed a concept of word influence. They argued that the influence between words can be divided into strong and weak influences, assuming they follow a Gaussian distribution. By calculating the mutual influence between words using a pre-trained language model, they proposed a new loss function that separates the distributions of strong and weak influences as much as possible. Morfessor [
29] is a classic unsupervised morphological segmentation tool. Rouhe et al. [
30] investigated the possibility of using the unsupervised morphological segmentation model Morfessor on supervised conditions for segmentation. Specifically, they used Morfessor to segment words and enrich the model’s input features, then used a seq2seq model to determine whether Morfessor’s segmentation results were correct. Song et al. [
11] proposed a self-supervised subword segmentation model that optimizes the word generation probability of partially masked character sequences and uses dynamic programming to generate the segmentation with the maximum posterior probability.
In statistical-based stemming or morphological segmentation tasks for Uyghur and Kazakh languages, features such as syllables [
31], part-of-speech, context [
19,
32,
33], phonetic classes, the presence of sound change phenomena, and phonetic features [
34] are often selected and added to the model to improve its performance. In deep learning-based models, (Bi)RNN [
35], BiLSTM-CRF [
36], CNN-BiLSTM-CRF [
7], pointer networks [
37], and attention mechanism [
7,
37,
38] have been used to learn the labels of the input sequence and distinguish morpheme boundaries. The literature mentioned above have introduced labeling schemes, but these labels are not independent, which can easily lead to model overfitting. Therefore, Yang et al. [
37] only used segmentation points for modeling. They proposed a morphological segmentation model based on a pointer network with a fused attention mechanism, and its segmentation effect is superior to the BiGRU model [
35]. Abudukelimu et al. [
7] also applied the CNN-BiLSTM-CRF model to the morphological segmentation task and compared it with the pointer network [
37]; the F1-score improved by 0.33%, comprehensively analyzing typical error types. The model improved the ability to recognize out-of-vocabulary words and low-frequency morphemes. Gvzelnur et al. [
38] and Imin et al. [
36] introduced an attention mechanism based on BiLSTM-CRF, considering contextual sentence information and capturing the boundaries of stems and affixes through global features. On word-level datasets, it under performs compared to BiGRU [
35]. However, on sentence-level datasets, the F1-score reached 96.07%. Zhang et al. [
39] proposed an unsupervised morphological segmentation model for Uyghur based on meta-learning, realizing morphological segmentation in a few-shot learning environment and alleviating overfitting.
In summary, although supervised models for Uyghur and Kazakh lexical analysis have achieved certain research progress [
6,
40,
41,
42], these models’ evaluation metrics are based on character-level metrics. The performance and differences of these models have not been explored using morpheme-level evaluation metrics. In addition, the correlation between different features has not yet been investigated. Based on the above analysis, this paper redefines the evaluation metrics and proposes two feature-enhanced models for Uyghur and Kazakh morphological segmentation and stemming. Then, it comprehensively analyzes the experimental results from different dimensions.
3. Method
3.1. Task Definition
This paper utilizes unsupervised and supervised models to learn Uyghur and Kazakh stemming and morphological segmentation tasks. Specifically, the unsupervised model is used to learn morphological segmentation, and the supervised model is used to learn both stemming and morphological segmentation. Assume a word of arbitrary length W, consisting of x characters or y morphemes, i.e., or , where . In supervised morphological segmentation and stemming, each character is labeled and learns the corresponding tags. In unsupervised morphological segmentation, the boundaries of morphemes are determined through the correlations between characters.
3.2. Feature-Enhanced Morphological Segmentation Model
This paper proposes two
Feature
Enhanced
Morphological
Segmentation models—FEMSeg and FEMSeg-CRF—based on the advantages of the sequence labeling model. The model integrates a CNN character-level representation layer, a BiLSTM context-level representation layer, a Transformer encoder layer, a linear layer, and a softmax layer or CRF layer. The model structure is shown in
Figure 2. Specifically, the model learns character-level embeddings from the input sequence. Then, it captures character-level features of word and contextual features between characters (referred to as character-level context in this paper) through a CNN convolutional layer and a BiLSTM network layer, respectively. To learn the correlations between character features and contextual features and further determine morphological boundaries, character representation and contextual representation are concatenated and fed into a Transformer encoder layer for relevant feature learning. Finally, the CRF layer or softmax layer is used to predict the labels of the input sequence.
After the input characters are embedded, they are fed into the CNN and BiLSTM layers to learn character-level and context-level representations. In stemming and morphological segmentation, character-level feature extraction is particularly important. Feature engineering primarily relies on manual feature extraction in the literature on stemming and morpheme segmentation in low-resource languages. Therefore, to reduce manual feature extraction, this paper uses a character-level CNN network to learn character features. The input character is a d-dimensional embedding, i.e.,
. Assuming there is a convolution kernel
, the convolution operation on the input yields the result O, as shown in Equation (
1). To obtain more features,
convolution kernels are set up, and their outputs are concatenated together to represent the output of the convolutional layer, as shown in Equation (
2).
For sequence labeling tasks, RNN models can effectively learn temporal information. Bidirectional RNNs can learn not only past information but also future information. However, for sequences with long-span dependencies, there exist problems of gradient vanishing or explosion. To solve the long-distance dependency issue, Hochreiter and Schmidhuber [
43] proposed a variant of RNN called LSTM, which controls the information flow and forgetting through a gate mechanism. The BiLSTM model overcomes the shortcomings of the LSTM model, which only records the previous context information without considering the future context information. The BiLSTM layer obtains two hidden layer outputs and concatenates them as the final output. Specifically, the hidden layer representation at time step t is
, where
represents the forward hidden layer at time step
t, and
represents the backward hidden layer at time step
t.
After obtaining the character-level and context-level representations, they are concatenated and result in the final feature representation, i.e.,
, where
. The process of multi-feature extraction is summarized in Algorithm 1.
Algorithm 1 Multi-feature extraction. |
| Input: Word W, Kernel size K, Parameter P |
| Output: Feature F |
1: | |
2: | |
3: | function CNN_Feature(, K, P) |
4: | for i in K do |
5: | |
6: | end for |
7: | | ▹ concat to |
8: | |
9: | |
10: | return |
11: | end function |
12: | |
13: | function BiLSTM_Feature(, P) |
14: | |
15: | return |
16: | end function |
17: | |
18: | |
19: | |
20: | | ▹ concat and |
The multi-head attention mechanism can learn correlations between elements in the input sequence from different dimensions. Typically, it processes inputs of the same type. For instance, if the input sequence is a sentence, it learns the relationships between words; if it is a word, it learns the relationships between characters. To learn characters, context, and the correlation between them, we have concatenated the correlation representation and input it into the encoder of the Transformer mode. It utilizes a multi-head attention mechanism, residual normalization layer, and feed-forward network layer to capture the dependencies and semantic information between the character-level input sequences. This enhances the model’s awareness of morphological boundaries and semantic dependencies between characters and their context. The calculation steps are shown in the following equations:
where
n is the number of heads. The concatenation of
n attention mechanisms followed by a linear transformation represents the final multi-head attention mechanism.
represents the
ith attention. After obtaining the multi-dimensional span dependencies through the multi-head attention mechanism, the result is fed into the residual normalization layer to prevent model degradation. There are two residual normalization layers in the Transformer encoder. The output from the first residual normalization layer is then fed into the feed-forward layer. The feed-forward layer consists of two fully connected layers, with a ReLU activation function in the first layer and no activation function in the second layer. After the output from the Transformer encoder is fed into a fully connected network layer, the purpose is to unify the dimensions of the vectors input to the CRF and softmax layers. The calculation formula is shown in Equation (
9):
The last layer of a sequence labeling model is generally set as a softmax classification layer or a CRF layer, and this paper also follows this structure. When the last layer is chosen as a softmax function, it transforms the feature vector into a probability distribution in the range [0–1], predicting the probability that the feature embedding belongs to a specific label. When the last layer is chosen to be a CRF model, given a sequence
, the label sequence predicted by CRF is
, and the score of the sequence is defined as in Equation (
10):
In this paper, P is the output result of , where represents the score of the th label of the ith character in the sequence; A is the matrix of transition scores, which is position-independent, and represents the score of transition from label i to label j.
3.3. Masked Morphological Segmentation Model
In unsupervised morphological or word segmentation models, it is common to learn the correlations between characters at the sentence level, such as BPE, Morfessor, WordPiece, BBPE, etc. These methods have shown strong performance in downstream tasks, but their performance in morphological segmentation and word segmentation evaluation is not very good. Therefore, this paper takes words as units and learns the n-gram correlations within words. It uses a masked self-attention mechanism to avoid the influence of characters outside the n-gram on determining morpheme boundaries. The model structure is shown in
Figure 3.
Masked Morphological Segmentation (MMSeg) model consists of an input layer, an encoding layer, a decoding layer, and an output layer. We will introduce the model separately below:
Input Layer: The input sequence is represented as a fixed-size vector. Given an input sequence , it is represented as . Here, , and d is the embedding dimension. To incorporate prior knowledge into the model, this paper initializes the embedding representation of the input sequence using pre-trained character embedding. These embedding representations are obtained through training a word2vec model. During unsupervised learning, the embeddings are continuously updated.
Encoding Layer and Masked Attention Mechanism: After the input sequence is vectorized, it is fed into the encoding layer. The encoder consists of a single-layer unidirectional LSTM network. The output at the
ith time step is represented as shown in Equation (
11):
where
represents the hidden layer state at time step
, and
is the input at time step
i.
After the input sequence passes through the encoding layer, the hidden layer is truncated based on a predefined maximum character length and fed into the masked attention mechanism. The masked attention mechanism will use a multi-head self-attention mechanism with four heads. The calculation formulas were introduced in
Section 3.2. Unlike other attention mechanisms, a upper triangular matrix is used as the masking matrix, ensuring that the attention calculation emphasizes features close in time to the
tth time step. The masking values for nearby positions are set to 1, whereas the masking values for other positions are set to 0. Algorithm 2 summarizes the masked attention mechanism.
Decoding Layer and Objective Function: The decoding layer of the model consists of a single-layer unidirectional LSTM. The output
A from the masked attention mechanism is fed into the LSTM layer for decoding, as shown in Equation (
12):
where
represents the
jth character in the
ith segment, and
represents the segmentation from the first time step to the
th time step. Concatenating all the segmentation results
can represent the entire word
. The expression
represents the length of the
ith segment
, and
T is the length of the sequence
Y. The model inputs the embedding from the attention layer into a single-layer unidirectional LSTM network. The
ith segmentation result will depend on the previous
segmentation results. The initial
of the
ith segment is initialized using the results of the previous segments
. The model achieves the purpose of unsupervised word segmentation by learning the joint probability of the segmented character sequences, as shown in Equation (
13):
where the embedding of
is represented as
. We train this model by maximizing the log-likelihood value, as shown in Equation (
14):
Algorithm 2 Masked attention mechanism. |
| Input: sequence_length , the output of LSTM H, max_length , Parameter P, Head_num, , |
| Output: Attention Feature A |
1: | Initialize |
2: | |
3: | functionGet_Attention_Mask() |
4: | |
5: | return |
6: | end function |
7: | |
8: | |
9: | function Multi_Head_Attention() |
10: | for i in do |
11: | |
12: | |
13: | end for |
14: | | ▹ concat to |
15: | |
16: | |
17: | |
18: | return A |
19: | end function |
When the initial conditions satisfy
, the loss function can be calculated using dynamic programming, as shown in Equation (
15):
where
is the joint probability of all segmentation results,
is the probability of segment, and
K is the maximum length of a segment.