1. Introduction
Natural language processing (NLP) is a critical research area in computer science and Artificial Intelligence (AI). The language model is a fundamental component of NLP and plays a crucial role in predicting words or characters within a text sequence. Information theory has profoundly influenced this field, providing a mathematical framework dedicated to studying the quantification, transmission, encoding, and processing of information. This theory underpins the construction of language models and supplies essential concepts and methodologies for comprehending and manipulating textual data and encoding processes, among other linguistic operations.
Pronouns can cause ambiguity in semantic understanding. Evidence suggests that language ambiguity is one of the most critical factors in how natural human language is represented by deep learning models [
1]. Coreference resolution (coref) is the task of identifying mentions in a text that refer to the same entity or concept [
2]. An example of a coref task is shown in
Figure 1.
Figure 1a illustrates the input for a coref task. When multiple entities are present in a sentence, the specific direction of the pronoun affects the machine’s understanding of the sentence.
Figure 1b displays the output of the coref task, which assigns the pronouns in the sentence to the corresponding referential clusters and establishes associations with the correct entities. These associations can be utilized in downstream tasks to accurately understand the meanings of the corresponding pronouns.
This fundamental NLP task can benefit various applications, such as Information Extraction [
3,
4], Question Answering [
5,
6], Machine Translation [
7,
8], and Summarization [
9,
10], which are of great research value.
Coref requires document-level encoding. Evaluating the similarity of long-span texts presents a significant challenge in coref [
11]. Neural network models are widely used in the field of computing. These models utilize word embeddings to capture word similarity, thereby effectively improving the accuracy of coref models. As a subtask of NLP, the coref algorithm based on deep learning faces the following challenges:
(1) Long-Span Problem: The issue at hand involves the span distance between pronouns and the entities to which they refer. In the traditional modeling process of deep learning language models, text lines of varying spans are often used as inputs. However, traditional sequence models, such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory Networks (LSTM), often perform poorly and need improvement in capturing long-range dependencies. This deficiency occurs because their recursive calculation of long-distance information leads to information attenuation or loss;
(2) Ambiguous Reference: Ambiguous reference is a phenomenon in natural language processing where a pronoun or indicator may correspond to multiple potential entities within a single input. When multiple entities precede contemporary words, referential errors may occur. Resolving ambiguous references typically requires a detailed understanding of semantics and context, utilizing richer contextual, semantic and global information for recognition.
In recent years, various language models based on Transformers [
12], such as BERT and GPT, have become the mainstream research focus to address the aforementioned issues. BERT employs a self-attention mechanism that effectively captures dependencies between distant words in a sentence. The multi-head attention mechanisms of the Transformer model, combined with BERT’s bidirectional encoding and context-sensitive embeddings, empower the model to capture dependency relationships and contextual information between words on a global scale. These methodologies significantly enhance the model’s capacity to understand contextual dependencies, thereby improving overall performance.
However, although BERT can capture rich contextual information within sentences, it still requires a more comprehensive global understanding for coref tasks. Therefore, challenges persist in using the BERT model to address issues related to long-span referencing.
The reasons for the above problems include the nature of the original feature maps in the BERT model, which exhibit low-dimensional features and excel at preserving many fine-grained details. However, a common challenge in coreference resolution tasks relates to long-span issues, where pronouns and their referents are far apart. It has been demonstrated that the limited range of low-dimensional feature maps in BERT does not provide adequate global information and spatial features, thereby hindering the model’s effectiveness in coref tasks.
Given the characteristics of the BERT model, several issues remain in improving and fine-tuning the model:
(1) BERT requires attention mechanism operations for each position within the original feature maps, which impedes the model’s ability to capture global information effectively under the constraints of large-scale feature maps;
(2) The original feature map is information-dense, and the size of the feature maps is large, which can obscure feature weights during model fine-tuning. This significantly affects performance and complicates the optimization process.
This phenomenon has led to traditional methods used in deep learning to acquire global information, such as sub-sampling and global average pooling, to potentially cause feature confusion. Consequently, this contributes to a decline in accuracy and presents challenges in further improving BERT’s capability to acquire global information.
Convolutional kernels of varying scales can capture different receptive fields, enabling the model to acquire more contextual and global information across multiple scales. This approach effectively addresses issues related to varying text spans and enriches the network’s contextual information [
13]. However, Convolutional Neural Networks (CNNs) perform downsampling during the feature extraction stage to obtain more global features. This downsampling can obscure the features in BERT’s feature maps and negatively affect prediction results [
14]. To leverage CNNs while overcoming the limitations of integrating them with BERT for coreference resolution tasks, this paper enhances the BERT model as follows:
(1) A new multi-scale convolution module is designed to process the feature maps and map the features to higher-dimensional spaces using convolution operations with different-scale convolution kernels. This module is then added to the BERT base model to obtain contextual information at different scales and to improve the sparsity of the features, thereby adapting to the problem of referring to different text spans;
(2) Given the substantial number of parameters in BERT, and because the convolution operation with a large-scale convolution kernel would significantly increase the number of parameters, this paper employs depth-separable convolution instead of regular convolution operations to reduce the computational load of the new module;
(3) The improved BERT model replaces the original BERT model among span-BERT and c2f-BERT configurations, and its performance is validated on the Ontonotes dataset [
15], achieving better results than the original model.
3. Model
The core module of the BERT model is the Transformer [
31]. In 2017, Google introduced the Transformer model in the paper ‘Attention is All You Need’ [
32]. The Transformer encoding module comprises a feedforward layer (FFNN) and multi-head attention. The multiple attention mechanisms consist of various groups of self-attention, with each self-attention mechanism responsible for establishing a separate feature matrix. The feedforward layer primarily integrates the feature matrices obtained from multi-head attention.
The self-attention mechanism is less dependent on external information and excels at capturing the internal relevance of data or features by computing the relationships between words, thereby overcoming long-range dependency issues. The relationships among the input tensors are extracted to obtain three tensors:
Q (query),
K (key), and
V (value). Finally, the results are combined, and the self-attention is computed as shown in Formula (7) below:
where
Q,
K, and
V denote the weight matrices of the query, key, and value, respectively.
The value matrix V acts as an information carrier within the self-attention mechanism, encapsulating the actual content of the input sequence. During the attention calculation, the similarity between the query Q and the key K determines the weights applied to the value matrix V. This process identifies which parts of the information require more focus and extraction. The attention mechanism generates an aggregated output that represents the most pertinent information from the input sequence by performing a weighted summation of the value matrix.
Each self-attention computation produces the output of a distinct attention head. The outputs of all attention heads are concatenated and then subjected to a linear transformation to produce the final multi-head attention output, as illustrated in Formula (8) below:
where
represent multiple self-attention output results,
represents the linear-layer weight matrix, and Concat represents the concatenation of results by dimension.
The self-attention mechanism reduces the model’s dependency on sequence length. The use of BERT provides an excellent solution to the problem of non-uniform natural language spans, and the span between the pronoun position and the subject to which it points cannot be precisely determined. Since BERT lacks a downsampling module, the original feature map is large-scale and dense, making global information less accessible. Conventional fine-tuning methods are also prone to feature weight confusion, leading to degradation in network performance.
CNNs are notably proficient in extracting local information but face challenges in acquiring global information. In the initial stages of research, the conventional method for acquiring global information involved a combination of subsampling and global average pooling. This method entails downsizing feature maps to capture global information. However, this approach is not suitable for enhancing the Transformer module, as the feature maps produced by the Transformer module are typically of a relatively larger scale. Downsizing these feature maps may lead to information loss and confusion.
In the efforts of other researchers, the pursuit of global information while preserving the integrity of local details is often addressed through the use of methods such as multi-scale convolutional sampling or spatial pyramid pooling. Multi-scale convolution leverages receptive fields of varying scales to effectively capture features across different spatial dimensions in the input data. This approach aids in the recognition and comprehension of both local and global information, thus enhancing the network’s robustness and generalization capabilities. It also helps mitigate overfitting issues in the network, improving its generalization capacity. Although this technique is less common in NLP, it is widely used in fields such as image segmentation.
To enhance the performance of BERT, this paper proposes a novel multi-scale convolutional model based on the downsampling module in Feature Pyramid Attention (FPA) [
33,
34]. This model convolves the input vector with kernels of various scales to capture richer contextual information. Given the downsampling operations in CNNs, which can obscure the feature weights of the extracted feature maps, this paper adopts a parallel operation approach from Inception [
35,
36] to simultaneously extract multi-scale feature information, thereby counteracting the adverse effects of downsampling. Additionally, considering the numerous hidden layers in BERT, parallel operations help prevent the gradient vanishing problem associated with excessive network depth. The architecture of the multi-scale feature extraction module is depicted in
Figure 3.
Firstly, the feature dimension is increased through a convolution operation on the input features. After processing with convolution kernels of different scales, the dimension is reduced back to that of the input feature through dimension-independent splicing. This is followed by adding the input feature to the output. The convolution operation with kernels of various scales allows the capture of different scales of receptive fields, which provide diverse scales of feature information for the model and enhance the capture of global contextual associations. The dimension-raising operation, conducted during convolution, maps the feature map to a high-dimensional space, improving the sparsity and, to some extent, alleviating the issue of overly dense feature maps extracted by the BERT model. This is followed by a dimensional splicing operation, then a dimensional reduction to restore the original dimensions and a channel shuffle to select the appropriate scale features while reducing redundancy. Finally, the obtained multi-scale features are combined with the original features for output.
Other commonly used design schemes to enhance the network’s capability to obtain global information are illustrated in
Figure 4 and
Figure 5.
Figure 4 depicts a downsampling operation scheme, and
Figure 5 illustrates a scheme incorporating global-average pooling information. These two design schemes are used for comparative experiments to assess the impact of improving global-average pooling on Transformer performance.
Given BERT’s already large number of parameters, and considering that convolution operations with large convolution kernels can significantly increase the model parameters, this paper employs depth-separable convolution and light-weighting techniques to mitigate the increase in parameter count.
The Conv Stage flow is shown in
Figure 6 below.
This paper applies the aforementioned multi-scale feature extraction module to the multi-head attention process. Given the distinct contents processed in the Q, K, and V feature vectors within the self-attention mechanism, only V is processed in this manner, as detailed in Formula (9) below:
where
is the convolution operation,
is batch normalization, and
is the activation function used in BERT. Since
V is in the form of a four-dimensional tensor, the convolution operation of
K is used here for only one dimension.
h represents the activation layer and normalization layer that have been passed through once.
The specific usage is shown in
Figure 7 below.
Basic BERT employs an improved multi-head attention mechanism. Cross-entropy loss was used as the primary loss function in constructing the final loss function. The multi-scale feature extraction module was validated on the Ontonotes dataset and achieved notable enhancement effects.
4. Experiment
4.1. Datasets and Parameters
This paper utilizes the document-level dataset Ontonotes (English) from the CoNLL-2012 [
37] shared task of coreference resolution. The dataset comprises 2802 documents, including 343 training documents and 348 test documents, and contains approximately 1 million words spanning newswire, broadcast news, broadcast conversations, and web data. The primary evaluation employed the official CoNLL-2012 evaluation script to test three metrics from the test set: average F1-MUC, B3, and CEAF
4.
Figure 8 below presents the original annotation document for coref tasks in Ontonotes 5.0.
Figure 8a displays an example of the input text, and
Figure 8b shows an annotated document. During the annotation process, it is essential to identify the beginning and end positions of entities or pronouns and annotate their refers cluster IDs.
Before performing the coref task on the original Ontonotes 5.0 data, it is necessary to preprocess the data using the relevant code provided by Conll-2012. The preprocessing results are shown in
Figure 9, where
Figure 9 demonstrates the conversion of the Ontonotes 5.0 original annotation to the Conll-2012 standard annotation format. Each column in the annotation format represents the file name, sentence number, word index, part of speech, sentence structure information, speaker, referential cluster, and other sequential information.
When using the above annotation file as input for the coref model, it is also necessary to convert it to the JSONLINE format shown in
Figure 10. Each word is treated as a token, and the annotation file is reorganized into the annotation file based on the set maximum span (ensuring the integrity of the sentence). Each span is then used as input.
The input document is encoded and trained by the model to obtain the model. In the test set, based on the trained model and annotated reference cluster labels, it determines whether each pronoun is assigned to the correct reference cluster and finally outputs the total evaluation parameter results for all test files.
The server has two 12th Gen Intel(R) Core (TM) i9-12900K processors, each with a clock frequency of 3.20 GHz, was manufactured by Intel Corporation in Oregon, USA, and a total memory capacity of 64 GB RAM. Additionally, the server is configured with two NVIDIA GeForce RTX 3090 GPUs, each with 24 GB of RAM. The NVIDIA GeForce RTX 3090 GPUs are manufactured by NVIDIA Corporation. The company’s headquarters are located in Santa Clara, CA, USA.
The experiment was configured with a dropout rate of 0.3, learning rates of and , and a linear decay rate of 0.1. A maximum of the first 50 antecedents were selected for analysis (max_top_antecedents = 50). The maximum number of sentences used during each training session was 5, with the first 40% of the span selected for analysis (top_span_ratio = 0.4), and a maximum of 20 different speakers were considered (max_num_speakers). The hidden layer size of the feedforrward neural network was set to 1000, with two hidden layers (ffnn_size = 1000, ffnn_depth = 2). The training was conducted over 30 epochs (num_epochs = 30), with a feature dimension of 20 (feature_size = 20) and a maximum reference span of 30 (max_span_width = 30). The Adam optimizer was used, with a decay rate of adam_eps at . Additionally, testing experiments on the BERT base and span model were conducted under the maximum segmentation lengths of 128 and 256 segments, respectively.
4.2. Results and Analysis
This paper initially evaluates the impact of global-average pooling, downsampling, and multi-scale convolution on the coref task based on the BERT model. These techniques are commonly employed to enhance semantic information. However, as indicated by the results presented in
Table 1, applying global-average pooling and downsampling to BERT models for coref tasks can introduce perturbations in feature vectors, resulting in a degradation of network performance. Therefore, these techniques are deemed unsuitable for referent resolution tasks.
Specifically, the F1 score of the BERT-base model is 73.9%, while the models utilizing global-average pooling and downsampling have F1 scores of 55.8% and 64.6%, respectively, which are significantly lower than the base model. This indicates that the global-average pooling and downsampling methods may underperform in this task, possibly due to the loss of critical semantic and contextual information during feature extraction.
Conversely, employing multi-scale convolution to amalgamate local information from different receptive fields enables the network to acquire additional contextual information, thereby facilitating a more comprehensive understanding of the input data. The model with multi-scale convolution achieves an F1 score of 74.4%, slightly higher than the base BERT model. This suggests that, by integrating information from multiple scales, the model can capture semantic and structural features at various levels, thus enhancing its ability to understand the input data.
In further ablation experiments, this paper investigated whether dimension enhancement affects network performance, specifically the impact of changes in the dimension of intermediate feature maps on model performance during convolution operations. As depicted in
Table 2, the results indicate that, without dimensionality augmentation, the simple convolutional operations on the dense feature maps of the BERT model necessitate weighted computation of adjacent feature maps. This process results in a decline in the linear separability of the model, leading to a reduction in the F1 score to 73.9%.
On the other hand, dimensionality expansion enhances the linear separability of feature maps while preserving more original feature information, resulting in an improved F1 score of 74.4%. This improvement can be attributed to dimensionality expansion, which allows the model to handle the feature maps more effectively by capturing and preserving essential features.
These findings suggest that dimensionality augmentation enhances the performance of BERT models. By improving the linear separability of the feature maps and preserving more original feature information, dimensionality expansion allows CNNs within the BERT model framework to process the dense feature maps more efficiently, thereby alleviating the challenges associated with straightforward convolutional operations on unaltered dimensionality.
The results in
Table 3 indicate that the module developed in this paper increases the average F1 score by approximately 0.5% for the original BERT-based models and by 0.2% for the span-BERT models. This improvement signifies a positive impact of the module on BERT models, particularly in enhancing precision.
The multi-scale convolutional kernel’s operation enables the model to access contextual information at various scales, which is beneficial for feature screening. This enhanced feature screening improves the model’s precision, albeit with a slight decrease in recall, which accounts for the modest overall improvement in F1 score.
For example, the BERT-based + ours (IND) model slightly raises the average F1 score from 73.9% to 74.2%, while the BERT-based + ours (Ovlp) model shows an increase from 73.9% to 74.4%. Similarly, the span-BERT + ours model records an improvement from 79.6% to 79.8%. These results confirm that the designed module effectively manages the feature maps and captures essential information, thus improving model performance, especially in terms of precision.
This paper also conducted experiments using BERT (independent, IND), BERT-based (overlapping, OVLP), and span-BERT as baselines, testing the efficacy of the improved model under different input-text segmentation lengths. Two commonly used segmentation lengths, 128 and 256, were selected for testing. As detailed in
Table 4, the proposed model demonstrates an improvement in F1 score by 0.2% at a segmentation length of 128 and by 0.5% at a segmentation length of 256. This suggests that the model performs better with longer segmentation lengths, providing a broader context for feature extraction, which leads to improved overall model performance.
The observed improvement may be attributed to the fact that, with longer segmentation lengths, the span of the subject to which the word points becomes more uncertain. In such cases, the multi-scale receptive field can more effectively capture the necessary features to make accurate judgments. The ability of the multi-scale convolutional kernel to acquire contextual information at various scales proves beneficial in these scenarios, as it enhances the model’s capability to handle longer and more complex sequences.
For example, at a segmentation length of 256, the proposed model’s F1 score increases from 73.9% to 74.4%, indicating a significant performance enhancement. This suggests that the multi-scale approach is particularly effective in scenarios where the segmentation length is substantial, likely because it provides a broader context for feature extraction, leading to better overall model performance.
As can be seen from
Table 5, the module proposed in this paper can effectively improve the efficiency of the BERT-based model and has achieved better results in the related BERT-based model improvement results.
5. Conclusions
Leveraging its robust encoding capabilities, BERT has achieved groundbreaking successes across various domains of NLP. Models such as BERT-coref have effectively applied BERT in semantic disambiguation, demonstrating notable efficacy. However, challenges persist in BERT’s application to coreference resolution, particularly due to discrepancies in pronoun lengths and subject spans within discourse. Enhancing the contextual and global information acquisition capabilities of encoding modules holds promise for addressing these issues, suggesting a direction for future enhancements to further improve BERT’s performance in complex NLP tasks.
Based on the BERT model, this paper evaluates the impact of downsampling, global-average pooling, and multi-scale convolution on the coreference resolution task. Multi-scale convolution, in particular, proves especially advantageous in enhancing network performance. Building upon this observation, we have devised a convolutional operation module equipped with multi-scale receptive fields. This module enables the model to extract contextual information independently of text length, facilitating refined text-span selection. Additionally, projecting features into higher-dimensional spaces during processing aids in acquiring sparser feature representations, which are beneficial for the model’s performance. We also employ cross-entropy loss as the training objective. Experimental findings demonstrate the augmented model’s superiority over its precursor, although improvements are less conspicuous when text truncation lengths are excessively short.