3.1. Phase One: GujiBERT-GCN-LSTM Model for Classical Chinese Text Classification
Ancient texts are significantly different from modern texts due to the linguistic features, structures, and contexts of ancient texts, including the change in meaning of the words used, the differences in syntactic structures, and the uniqueness of rhetorical techniques, which leads to the poor performance of traditional text classification models in classifying ancient texts. To address the above problems, a GujiBERT-GCN-LSTM-based text classification model for ancient texts is proposed, which efficiently processes the complex structure and semantics of ancient texts by closely integrating a series of modules. The model structure of the GujiBERT-GCN-LSTM model is visualized in
Figure 1.
Firstly, the model takes the ancient text as input and afterword segmentation and encoding, and it is fed into the GujiBERT model. GujiBERT is based on the Transformer architecture, which has powerful bidirectional encoding capability and captures the semantic information of each word in context through the self-attention mechanism. It first transforms the text into word embeddings and positional embeddings. Then, it analyses the contextual relationships between words through a multi-layer Transformer encoder, thus outputting the contextual embedding vectors for each word. These embedding vectors contain the semantic understanding of each word in the text in its context and are the basis for subsequent processing.
The GCN module is used to capture nonlinear word relationships that are unique to ancient texts. Although GujiBERT can capture contextual dependencies, word relationships in ancient texts often go beyond linear order, such as inversions or metaphors. GCN constructs a graph structure by treating each word as a node, and the edges represent the dependencies between words. The representation of a node is updated by aggregating the features of its neighboring nodes, and this process is unfolded by a multilayer convolution operation so that each node not only contains its information but also integrates the features of its neighboring nodes. GCN performs a feature transformation with the following formula:
In Equation (1), is the layer node characteristic matrix, is the learnable weight matrix of the layer , and is the activation function ReLU.
After GujiBERT’s contextual understanding, GCN’s complex relationship capture, and LSTM’s long-distance dependency processing, the model generates a final representation vector for each word. To classify the text, the model generates the overall text representation by average pooling and then classifies the text by fully connected layers and the Softmax function. Ultimately, the error between the predictions and the true labels is calculated using the cross-entropy loss function, and the model is optimized using backpropagation, constantly updating the parameters in GujiBERT, GCN, and LSTM. In this way, the model is able to comprehensively understand ancient texts from multiple levels of semantics, structure, and dependencies, thus achieving more accurate text classification.
3.2. Phase Two: Improved Entropy-SkipBERT Model for Classical Chinese
A dependency analysis refers to obtaining the dependency relationship between words in a sentence, which is a directed unequal relationship between a central word and its related words, in which the core word dominates its related words, and the related words depend on the core word. The open-source Natural Language Processing (NLP) library SpaCy (
https://spacy.io (accessed on 7 November 2024)) provides dependency-parsing capabilities, enabling a syntactic analysis of sentences.
The dependency syntactic analysis of ancient texts using SpaCy to generate dependency syntactic trees is complicated because there are 22 common types of relations, and more types of relations may lead to the problem of model overfitting. Therefore, the dependency relations were filtered after the annotation was completed, and only eight types of dependency relations were retained: the subject–predicate, verb–object, definite–medium, parallel, punctuation, modifier, dependency marker, and compound structure.
SkipGram is a Word2Vec model for generating word embedding architecture from text. Compared with CBOW [
15], another architecture in Word2Vec, SkipGram handles rare words better. As opposed to CBOW, which predicts the central word given the context, SkipGram is a selected central node that predicts the surrounding context nodes, and through the context node’s Conditional Probability Learning word vectors, calculating the conditional probability of the target word given the context word-specific formula is as shown in Formula (2), where the target word is
, given the context word
;
is the vector representation of the input word;
is the vector representation of the output word; and
is the total number of words in the vocabulary.
The SkipGram model trains word vectors by maximizing conditional probability, typically producing more accurate and enriched word embeddings during training. The SkipGram model processes each target context separately, enabling it to capture complex lexical relationships better. The structure of the SkipGram is shown in
Figure 2. In
Figure 2, taking a sentence of eight words as an example, each word is denoted as
. Selecting
as the center word, it serves as the input for the SkipGram model. This center word is mapped to a word vector through the projection layer, embedding the semantic information of the word. Using the word vector of the center word, the model predicts its context words, specifically
,
,
, and
. The SkipGram model optimizes the word vectors by maximizing the conditional probability of the context words given the center word, enabling the vectors to represent the relationships between words better.
When training a neural network, the weights are adjusted with training. Therefore, the computational size of the weight matrix during the training of the SkipGram model will be significant, consuming a large amount of computational resources and slowing down the training speed. To address this problem, the negative sampling technique [
16] is used to optimize the training process. Negative sampling enhances training efficiency and reduces computational complexity by updating only a subset of weights. This is carried out by incorporating a few negative samples along with the positive ones and updating only the selected samples rather than the entire word list.
In this paper, we enhance the SkipGram model by incorporating dependency weights and distance factors to refine the probability distribution of context words. The dependency relationships are processed using SpaCy, and we implement an entropy weighting method to assign these weights based on the identified dependencies. This entropy weighting approach allocates weights objectively by analyzing the information associated with each indicator. Specifically, a low information entropy for an indicator suggests more significant variability across different contexts, warranting a relatively high weight. Conversely, a high information entropy indicates less variability, leading to a lower weight assignment.
Initially, we conduct a dependency analysis of the ancient texts, calculating the frequency of each dependency type. These frequencies are normalized such that their total sums are one. The formula for calculating the frequency of each dependency is as follows:
where
is the number of occurrences of dependency
in the dataset and
is the total number of all dependencies in the dataset.
The uncertainty of the dependencies is measured by calculating the entropy value, which is calculated as follows:
where
is the frequency of the
ith dependency.
After obtaining the entropy value, the weight of each dependency is calculated. The higher the weight, the greater the importance of the dependency in the whole sentence structure, and its weight is calculated by the formula
The weights obtained are normalized to achieve the weights, and the formula is specified as follows:
Dependency weights were calculated for historical texts, as shown in
Table 1, and for non-historical texts, as shown in
Table 2.
Dependency weights are introduced into the SkipGram model so that the stronger the dependency with the central word in the sentence, the higher the probability of being generated. The Entropy-SkipBERT conditional probability formula is as follows: where
is the central word,
is the context word;
and
are the word vectors of the central word and the context word, respectively;
is the vocabulary list; and
is the weights of the dependency.
In this paper, the SkipGram model that introduces the entropy weighting method is referred to as Entropy-SkipGram. Entropy-SkipGram and GujiBERT are two different word vector training models. GujiBERT is a language model based on the Transformer architecture, which performs large-scale, unsupervised training through the mechanism of self-attention and can capture more complex semantic and contextual dependencies. The word vectors generated by GujiBERT usually have higher dimensionality, which can cover deeper semantic information. After Entropy-SkipGram and GujiBERT have been trained separately, they can be directly spliced and fused into a new word vector. Entropy-SkipGram word vectors and GujiBERT word vectors of the same word are spliced in vector dimensions to form a higher dimensional word vector. We refer to the overall module as Entropy-SkipBERT. Through this direct splicing operation, it is possible to retain the features and semantic information of the two models simultaneously, forming a richer representation of the word vector. This newly generated word vector is used as the input embedding layer. When receiving this fused word vector, the machine translation model can take advantage of the local contextual information captured by Entropy-SkipGram and the global semantic representation generated by GujiBERT, thus enhancing the translation effect. This combination of multi-source information makes the model more comprehensive and precise when dealing with word meanings, context dependencies, and sentence structure.