Efficient Headline Generation with Hybrid Attention for Long Texts

Wan, Wenjin; Zhang, Cong; Huang, Lan

doi:10.3390/electronics13173558

Open AccessArticle

Efficient Headline Generation with Hybrid Attention for Long Texts

by

Wenjin Wan

,

Cong Zhang

and

Lan Huang

^*

School of Computer Science, Yangtze University, Jingzhou 434000, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(17), 3558; https://doi.org/10.3390/electronics13173558

Submission received: 2 August 2024 / Revised: 1 September 2024 / Accepted: 5 September 2024 / Published: 7 September 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Headline generation aims to condense key information from an article or a document into a concise one-sentence summary. The Transformer structure is in general effective for such tasks, yet it suffers from a dramatic increase in training time and GPU consumption as the input text length grows. To address this problem, a hybrid attention mechanism is proposed. Both local and global semantic information among words are modeled in a way that significantly improves training efficiency, especially for long text. Effectiveness is not sacrificed; in fact, fluency and semantic coherence of the generated headlines are enhanced. Experimental results on an open benchmark dataset show that, compared to the baseline model’s best performance, the proposed model obtains a 14.7%, 16.7%, 14.4% and 9.1% increase in the F1 values of the ROUGE-1, the ROUGE-2, the ROUGE-L and the ROUGE-WE metrics, respectively. The semantic coherence of the generated text is also improved, as shown by a 2.8% improvement in the BERTScore’s F1 value. These results show that the effectiveness of the proposed headline generation model with the hybrid attention mechanism is also improved. The hybrid attention mechanism could provide references for relevant text generation tasks.

Keywords:

headline generation; long text; Transformer; attention mechanism; deep learning

1. Introduction

With the avalanche of information pouring out of social networks, Automatic Text Summarization (ATS) can condense and summarize vast amounts of information, significantly enhancing user efficiency in understanding the content. As a key technology in natural language processing, ATS is widely used in information extraction, information retrieval, automatic question answering and other fields [1,2,3,4,5,6]. Headline generation (HG) is an important application scenario of ATS. It aims to automatically generate one single sentence that can accurately summarize the main content of the text and uses this sentence as the article’s title. HG often requires generating rather than extracting the headline; therefore, it belongs to the category of generative ATS. Compared with conventional ATS tasks, headline generation requires the generated text to have a higher compression ratio and stronger hierarchical summarization capabilities. The generated titles must be both informative and highly readable to ensure that they can subtly and accurately convey the key information in the input text. Therefore, high-quality HG model research is challenging and has widespread practical needs [7,8,9].

Most of the current HG models with good performance are based on deep neural network structures such as RNN and LSTM [10,11]. However, when the text is long, it is easy to cause problems like gradient explosion, causing the model failing to converge and complete training. The traditional Transformer model uses the global attention mechanism, which needs to consider all positions in the input sequence, resulting in high complexity of attention calculations and large memory requirements. Therefore, it is very limited or even unfeasible when processing medium and long text [12,13]. The more efficient Transformer structures like the Sparse Transformer and the Linear Transformer also face similar problems when processing long sequences in complex tasks [14]. The sliding window attention mechanism was proposed to replace the global attention mechanism. By only capturing information in local areas of the target sequence, global calculations are avoided, thereby improving the training efficiency of the model. However, the sliding window attention alone will lose the long-distance dependencies between texts. Therefore, this paper proposes a hybrid attention mechanism for the HG task. The hybrid attention mechanism integrates both sliding window attention and global attention to achieve simultaneous modeling of local and global semantic information, which not only reduces the memory and time overhead required for model training, but also improves the quality of the generated text. The main contributions of this article are as follows:

(1): We propose a hybrid attention mechanism that combines sliding window and global attention to capture both local and global semantic dependencies, and design two ways to implement the hybrid attention.
(2): In local semantic modeling, various sliding window sizes are compared for the effectiveness of the HG task, and the optimal window size for local semantic representation is determined for the HG task.
(3): We conduct comparative experiments and in-depth analyses to verify the effectiveness of the proposed HG model with the hybrid attention mechanism, in terms of training time, memory overhead and the accuracy and readability of the final generated headlines.

2. Related Work

The development of HG research can be divided into three key stages from the perspective of methodological paradigms. In the first stage, traditional techniques based on rules and statistical methods were mainly used. The rise of deep learning led to the second phase, a period which saw the rapid development of neural network techniques as the dominant method for HG tasks. In the last few years, headline generation research has entered the third phase, with a research and application focus on the Transformer model.

2.1. Rule-Based and Statistical Methods

Early approaches to HG mainly used traditional NLP techniques, such as rule-based and statistical approaches. These methods typically select sentences or phrases from the input text that are highly informative or contain key information as headings. In Dorr et al. [15], the authors propose an HG system, Hedge Trimmer, which prunes the parse tree of the first sentence of the input text. Words with specific grammatical components are deleted until the length of the pruned sentence meets the threshold and is used as the title of the newspaper story. In Banko et al. [16], the authors combine statistical models of term selection and term ranking so that the generated titles conform to the term collocation style learned by the model from the training corpus. Methods based on rules and statistical models generally lack flexibility and adaptability to different contexts and fields and cannot handle semantically complex and long texts.

2.2. Deep Neural Network Methods

With the rise of deep learning, research on HG using deep neural networks has made significant progress. HG is formulated as a sequence-to-sequence learning problem. Therefore, Recurrent Neural Network (RNN) [17] and Long Short-Term Memory (LSTM) [18] are widely used to generate text summaries and titles. In Bowman et al. [19], the authors propose a generative model based on RNNs and variational autoencoders that achieves smooth generation from continuous semantic spaces to natural language sentences. In Lopyrev et al. [20], the authors propose an LSTM network with an encoder–decoder architecture to achieve automatic generation of news article titles. Although these models have achieved performance improvement over earlier HG models, the nested structural characteristics of deep networks bring some common problems, such as lack of long-distance dependencies, vanishing gradients, limited memory capacity and difficulty with parallelization, which limit their ability to process long text. In Bengio et al. [21], the authors point out that due to the gradient training method used by RNN, the model has the problem of gradient disappearance or explosion. In Vaswani et al. [22], the authors mention that this inherently sequential nature of RNNs hinders parallelization in training examples, but this parallelism becomes critical at longer sequence lengths.

2.3. Transformer-Based Models

The Transformer model is a deep neural network model based on the self-attention mechanism proposed by Google [22]. By introducing the self-attention mechanism, the model can pay attention to all positions in the input sequence and solve long-distance dependencies in text. It has been widely applied in many areas. In Mohamed et al. [23], the authors combine the Transformer model with a convolutional neural network (CNN) to improve the performance of automatic speech recognition (ASR) by introducing convolutional context. In Zhang et al. [24], the authors propose to combine Transformers with Graph Masked Autoencoders, to improve the modeling ability and reconstruction accuracy of complex graph data.

In the context of the headline generation task, Transformer’s ability to capture global dependencies is particularly beneficial, because generating a coherent and concise title often requires synthesizing information from different parts of the text, which Transformer’s self-attention mechanism can achieve effectively. In Zhang et al. [25], the authors propose a large-scale pre-trained generative ATS model, PEGASUS, based on Transformer. By removing some key sentences in the input text, the model learns to fill in these missing sentences in the pre-training stage based on massive data, a variation of summarization. In Li et al. [26], the authors propose to introduce a multi-head self-attention mechanism into news HG based on the Transformer decoder, and designed a decoding selection strategy integrating top-k, top-p and penalty mechanisms to select important semantic information and then generate news headlines. In Yamada et al. [27], the authors propose a Transformer-based Seq2BF model that alternates forward and backward decoding to generate headlines with a given phrase. In Bukhtiyarov et al. [28], the authors fine-tuned two Transformer-based pre-trained models, mBART and BertSumAbs, and achieved good results. In Sberdevices et al. [29], the authors propose a generative model that combines the Transformer-3 and RuGPT-3 models, which uses zero-shot and minimum optimization methods in news clustering and news HG tasks.

Although, in theory, the Transformer model is not limited by the length of the text, in practice, the running time and computing resource overhead increase dramatically as the length of the input text grows, because the global attention mechanism used needs to consider the entire input text when generating each word. Therefore, the method of limiting the total number of input characters is usually used to improve the training efficiency of the model. For example, in Wang et al. [30], the authors use the BERT-UniLM model to generate titles from scientific paper abstracts. Most of their original abstract lengths are between 100 and 600 characters, and the model input limits the length of the original text to less than 450 words. In Zhang et al. [31], the authors propose the DSGPT model. The model performs HG for Jingdong product names with an input text length of around 40 characters, allowing users to have a better experience on narrow mobile phone screens. Therefore, when the article is long, the generation model based on the Transformer structure requires a large amount of computing resources, which affects the scope of application of the model.

To address this problem, this paper proposes to improve the attention mechanism that accounts for the largest computational cost, by introducing a sliding window attention mechanism to the global attention mechanism of the Transformer model, so as to reduce the computing and memory costs by limiting the scope of attention calculations. Meanwhile, the global attention mechanism is retained in certain positions, so that the model can still effectively capture and represent long-distance dependencies in text.

2.4. Attention Mechanisms

Attention computation is a core component in modern deep neural network structures. Different attention mechanisms captures different similarities or dependencies hidden in the input object. Their effectiveness have been researched in different scenarios. For example, in Meng et al. [32], the authors propose a multiview learning-based clickbait detection (MCBD) model to combine useful signals from both the title and the text. The title was encoded with a multi-layer gated convolutional network, and the text was encoded with an attention-fused deep correlation matching network (ARMN). With different attention mechanisms, the similarity between title and text was better captured. In Cui et al. [33], the authors propose the Factor Mixed Hawkes Process (FMHP) model to incorporate multiple types of temporal attention mechanisms to capture the dependencies between multiple temporal dimensions in time series data. These attention mechanisms use either global-level dependencies alone [34,35] or local-level dependencies along with sliding window mechanisms [36,37]. In contrast, we propose to combine the global and the local attention mechanisms to better capture the dependencies within texts, so as to improve the quality of the generated headlines.

3. Headline Generation Model with Hybrid Attention

3.1. Motivation for the Hybrid Attention Mechanism

The memory consumption required by the global attention mechanism is proportional to the square of the number of input text characters n; that is,

T = O (n^{2})

. Therefore, the length of the input text is very limited. Using sliding window attention alone can effectively improve model training efficiency, but it can easily reduce the quality of generated text, as it only focuses on local semantic details and cannot represent long-distance semantic dependencies. Therefore, this paper proposes to mix the global and sliding window attention mechanisms for HG for long input text.

3.2. Transformer-Based Headline Generation Model

The Transformer model contains four parts: input, encoder, decoder and probability distribution calculation [22]. The encoder consists of a multi-head self-attention mechanism, two layers of Add and Norm, and a forward propagation layer. The decoder consists of a masked multi-head self-attention, Add and Norm, multi-head cross-attention, two layers of Add and Norm, and a forward propagation layer. Finally, through the linear layer and the Softmax layer, the probability of a target word is predicted. The structure of the Transformer model is shown in Figure 1.

In the Transformer structure, the multi-head self-attention layer is a key component of the architecture. Its core idea is to perform multiple linear projections on the input and perform independent self-attention operations in each subspace, which helps the model focus on different positions in parallel. Figure 2 shows its structure.

The specific calculation mechanism for each attention head is scaled dot product attention. By introducing a scaling factor, the numerical stability, gradient fluidity and computational efficiency are improved, which helps the model consider different position information in a balanced manner. Figure 3 shows the structure.

3.3. Hybrid Attention

The hybrid attention mechanism proposed in this article combines the global attention mechanism and the local attention mechanism, thereby reducing the computational complexity required when processing long texts. Meanwhile, the model also enhances its capability in capturing both overarching themes and nuanced details within the text with the hybrid attention mechanism. Different from previous research on attention mechanisms, this hybrid attention is more suitable for the task of title generation, as titles need to succinctly summarize the main content of a text while also reflecting specific details. By effectively balancing global context and local relevance, the hybrid attention mechanism ensures that the generated titles are both accurate and informative. There are two major components of our approach. Firstly, we need to decide the position to focus the attention computation. Secondly, we need to decide the window size of the local attention computation.

In order to guide the attention mechanism to focus on a specific position in the text during calculation, the label “[GAM]” is inserted at the position that needs attention in the input sequence. The label “[GAM]” is a special tag used to guide the attention mechanism. By using the label “[GAM]”, the model can learn different attention weights for different parts of the input text, so as to capture important information within specific texts. There are two common insertion methods. The first is token global attention (FTGA), which inserts a “[GAM]” label at the beginning of the article. This method enables the model to capture the overall semantics of the article by allowing the model to allocate global attention starting from the first position of the text. The second is sentence-level global attention (SGA), which inserts a “[GAM]” label at the beginning of each sentence. This method ensures that when processing each sentence, the model first focuses on the beginning of the sentence, thereby enhancing the understanding of sentence-level semantics.

Figure 4 compares the dependency categories calculated by the global, the local and the two hybrid attention mechanisms. The black squares represent the target character, gray squares represent characters that are involved in the attention computation, and white squares represent characters that are not involved.

Figure 4c illustrates the computation of the first token global attention plus local sliding window attention mechanism. When the “[GAM]” tag is at the beginning of the entire article, the information of the first character at this position is propagated to the entire input sequence. Subsequent characters cannot access contextual information, because they cannot access the “[GAM]” tag. At the same time, the sliding window attention mechanism allows the model to selectively focus on local information around the target.

Figure 4d illustrates the computation of the sentence-level global attention plus local attention mechanism. When the “[GAM]” tag is at the beginning of each sentence, each character has access to contextual information when calculating attention. Compared to FTGA, SGA has more sentence-level semantics, by trading off some computational costs.

The calculation formula of the hybrid attention mechanism is as follows. The global attention is calculated following [22]. Given

X_{f}

as the input of the FTGA mechanism and

X_{s}

as the input of the SGA mechanism,

X_{f}, X_{s} \in R^{L * \frac{H}{h}}

, where

L, H

respectively represent the subsequence length and word-embedding dimension. Three 1 × 1 convolutions are used to project the input

X_{f}

and

X_{s}

into the query vectors, key vectors and value vectors, respectively. In this process, each attention head will create different

Q_{f}, K_{f}, V_{f}

and

Q_{s}, K_{s}, V_{s}

weight matrices. Given that

W_{i}^{q}, W_{i}^{k}, W_{i}^{v}

are the three trainable parameter matrices, both weight matrices can be expressed as:

Q_{i} = X_{i} W_{i}^{q}

(1)

K_{i} = X_{i} W_{i}^{k}

(2)

V_{i} = X_{i} W_{i}^{v}

(3)

Attention weights are then calculated for each head separately to quantify the importance of each element in the input sequence. The attention weight requires calculating the similarity between the query matrix

Q_{i}

and the key value matrix

K_{i}

. In order to ensure numerical stability during model training, the product is normalized. The output of the self-attention mechanism is expressed as:

e_{i} = softmax (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}})

(4)

where

d_{k}

is the dimensionality of the word embeddings, which is used to prevent gradient disappearance during training,

e_{i} \in R^{L * L}

is the normalized weight of the first character and the sentence-level attention mechanism,

K_{i}^{T}

is the transpose of

K_{i}

.

After obtaining the attention weight, a weighted summation of the value matrix

V_{i}

and the attention weight

e_{i}

is performed to obtain the output of each self-attention and splice each output result to obtain the final result of the multi-head attention model:

Z = e_{1} V_{1} \oplus e_{2} V_{2} \oplus \dots \oplus e_{h} V_{h}

(5)

where

Z

is the final weighted result of the first character and sentence-level global attention mechanisms,

h

is the number of self-attention mechanisms and ⨁ represents the concatenation of each attention head.

Although the sliding window attention mechanism uses the same mathematical equation as the global attention mechanism, it limits the scope of the dot product calculation between

Q

and

K

; that is, each query vector

Q_{i}

is only calculated with the key vector

K_{j}

in a local window near it. Specifically, first set a fixed-size window

w

, and define the window range as

[i - w, i + w]

, where

i

is the current time step. For the input sequence

X_{l}

, the query matrix

Q_{l}

, the key matrix

K_{l}

and the value matrix

V_{l}

are obtained through the same linear transformation as the global attention mechanism. For each element

X_{i}

in the sequence, the attention weight is calculated only within its local window

[i - w, i + w]

. The steps are as follows:

First, extract a local window of the current element

X_{i}

from the key matrix

K

and the value matrix

V

of the entire sequence. Assume that the shapes of

K

and

V

are

(n, d_{k})

and

(n, d_{v})

. Extract the key and value within the local window:

K_{l o c a l} = K [i - w : i + w + 1]

(6)

V_{l o c a l} = V [i - w : i + w + 1]

(7)

where

n

is the sequence length,

d_{k}

and

d_{v}

are the dimensions of the key and the value matrices. The shapes of

K_{l o c a l}

and

V_{l o c a l}

are

(2 w + 1, d_{k})

and

(2 w + 1, d_{v})

.

Then, calculate the dot product of the query vector q of the current element

X_{i}

and all key vectors in the local window to obtain the unnormalized attention weight. Assuming the shape of

Q_{i}

is

d_{k}

, the similarity matrix is computed as:

S_{i} = Q_{i} K_{l o c a l}^{T}

(8)

where

K_{l o c a l}^{T}

is the transpose of

K_{l o c a l}

and has a shape of

(d_{k}, 2 w + 1)

. The dot product operation produces a vector

S_{i}

of shape

(2 w + 1)

, where each element represents the similarity between the query vector

Q_{i}

and the corresponding key vector in the local window.

The scaled attention weights are normalized using the softmax function so that their sum is 1. The normalized attention weight

α_{i}

is calculated as follows:

s_{i} = softmax (S_{i})

(9)

4. Experiments, Results and Discussion

In order to verify the effectiveness of the proposed model on the HG task, three representative benchmark models were selected for comparison. Experiments were conducted on the WeChat data set, and experimental results were compared and analyzed.

4.1. Dataset

The WeChat dataset (weixin_public_corpus, hereinafter referred to as WXPC) was chosen as the experimental benchmark dataset. It contains 712,826 articles with their corresponding titles published on WeChat public accounts and covers a wide variety of subject areas such as science and technology, health and entertainment. It also contains a wide range of genres such as news, advertisements, essays and notices.

There is some noise data in the original dataset. In data cleaning, samples with title lengths less than 5 or greater than 30 were first filtered out, then invalid characters and identical sentences in the body were removed and samples with word counts less than 300 after sentence de-weighting were deleted. The total number of samples after cleaning was 281,179. In order to ensure a consistent number of samples between the validation set and the test set, all samples were randomly divided into 244,557 articles in the training set, 18,331 articles in the validation set and 18,291 articles in the test set. Table 1 shows the various characteristics of each subset after cleaning.

Among them, the compression ratio refers to the ratio between the number of text characters and the number of title characters, which reflects the degree to which the title summarizes the text content. Generally, the higher the compression ratio, the greater the difficulty of generating a good heading. The average compression ratio refers to the average compression ratio of all samples in the data set. This ratio on the WXPC dataset is 60.7, indicating that headline generation on WXPC is relatively difficult.

Figure 5 illustrates the distribution of text length and heading length in the dataset. Most texts have about 500 characters, and most headings have about 20 characters. The distribution of headline lengths generally conforms to a normal distribution, but there are considerably more articles with a heading length of 12, 14 and 19 characters.

4.2. Performance Indicators

4.2.1. Syntactic Similarity Indicators

ROUGE [39] (Recall-Oriented Understudy for Gisting Evaluation) is a commonly used performance indicator for text generation tasks. It focuses on measuring the lexical similarity between generated and reference texts. Depending on the length of the evaluation unit, there are multiple variants of the ROUGE index. This article selected two variants: ROUGE-N and ROUGE-L.

ROUGE-N computes similarity based on N-grams. Common values of N include 1, 2 and 3. ROUGE-N measures the overlapping N-grams between the generated and the reference texts. ROUGE-N’s formula is as follows:

R O U G E - N = \frac{\sum_{S \in {R e f e r e n c e S u m m a r i e s}} \sum_{g r a m \in S} C o u n t_{m a t c h} (g r a m_{N})}{\sum_{S \in {R e f e r e n c e S u m m a r i e s}} \sum_{g r a m \in S} C o u n t (g r a m_{N})}

(10)

where

{C o u n t}_{m a t c h} ({g r a m}_{N})

is the number of matching N-grams, and

C o u n t ({g r a m}_{N})

is the number of N-grams in the reference text.

ROUGE-L: Evaluate the degree of overlap between the generated text and the reference text based on the length of the Longest Common Subsequence (LCS). First, the generated text and reference text are converted into sequences of words or subwords respectively. Then, calculate the length of the longest common subsequence between the generated text and the reference text. Finally, the length of the longest common subsequence divided by the total number of tokens of the reference text is calculated; that is, the ROUGE-L score. ROUGE-L takes into account the order information of the text, so it is very useful for tasks such as automatic summarization and machine translation that have certain requirements for the readability of the generated text. ROUGE-L’s formula is as follows.

F_{L C S} = \frac{(1 + β^{2}) R_{L C S} P_{L C S}}{R_{L C S} + β^{2} P_{L C S}}

(11)

where

R_{L C S}

is the recall rate,

P_{L C S}

is the precision rate,

β

is the hyperparameter that adjusts the weight of the recall rate and precision rate.

4.2.2. Semantic Similarity Indicators

ROUGE-WE [40] (word embedding) extends ROUGE-1 by using word vectors and their cosine similarities to replace lexical similarities. Both texts are first segmented into words. Each word is represented by a pretrained word vector. Each generated word is paired with the word in the reference text that yields the greatest cosine similarity. A matching pair’s cosine similarity has to exceed a pre-specified threshold (e.g., 0.8 in our case). ROUGE-WE is the ratio of matching words to the total number of generated words in the summary (or reference summary). For Chinese text, the Jieba package is used to segment text into words. The Chinese and the English word vectors released by Tencent AI Lab are used as the pretrained word vectors, with a dimensionality of 2,000,000.

BERTScore [41] (BERT-based Sentence Score) is a context-embedded evaluation metric based on the pre-trained BERT model. It firstly encodes the input text sequence through the BERT model to obtain a context-aware character embedding representation. For each word in the generated text, the word with the highest cosine similarity to its embedding representation in the reference text is selected as the score, and finally the average of all words is taken. This method can more accurately capture the semantic similarities between words in different contexts.

4.2.3. Human Evaluation Indicators

Apart from the automatic indicators introduced above, subjective assessment by human experts remains a crucial and widely-used approach for evaluating the quality of the generated text. Therefore, the generated headlines were also subjected to a manual evaluation, as an essential complement to the automatically evaluated indicators.

For manual evaluation, 50 article–headline pairs were randomly selected from the four experiments of the WXPC dataset, resulting in four evaluation sets with a total of 200 headlines. These evaluation sets included the article’s content, the reference headlines, and the generated headlines. Three evaluators were invited to assess the quality of these headlines. All of them were university graduate students with the computer science major and were familiar with the scoring standards and requirements, ensuring the professionalism and accuracy of the evaluation process. All evaluators were of ages between 22–23 years, with two females and one male. To facilitate the evaluation process, the evaluators were asked to rate the headlines on the following four specific aspects:

(1): Readability. Is the headline easy for readers to understand and read?
(2): Informativeness. Does the headline contain rich and useful information?
(3): Coherence. Are the content and logic of this headline coherent?
(4): Conciseness. Is this headline concise in its content while conveying an effective message?

Each aspect is rated separately. To ensure a fair comparison, a 0–2 scoring system was employed for the above four evaluation metrics, in which 0 meant the generated headline is very bad and 2 meant very good in the particular aspect. The final ratings were averaged over the scores assigned by the three evaluators.

4.3. Comparison of Local Attention Mechanisms

Compared with the attention mechanism in the traditional Transformer model, this paper introduced a hybrid attention mechanism model that combined the window attention mechanism and the global attention mechanism. The most important parameter in the window attention mechanism is the window size. A larger window can encode richer contextual information at the expense of increased hardware resources such as video memory and computing power. To explore the most suitable window size, this paper selected different window size values of 8, 16, 32, 64, 128 and 256 for ablation experiments. The reasons for selecting these window sizes are as follows: Firstly, the values of these different windows are all powers of 2, which can ensure uniform distribution at different magnitudes, making it easier to observe the impact of window size on experimental results. At the same time, the exponential growth setting can help quickly determine the nonlinear impact of window size on performance. Secondly, by going from a smaller window size (such as 8) to a larger window size (such as 256), various situations from fine-grained to coarse-grained can be covered. Finally, they are the most commonly chosen values for window size and are thus representative. By selecting these representative window sizes, unnecessary calculations and experimental amounts can be reduced while ensuring the experimental effect.

We selected the most appropriate window size based on the performance of different values of window size between 8 and 256 on the WXPC dataset. No global attention mechanism was added at this stage. Table 2 shows the experimental results, where L8 represents a model with a window size of 8 characters. The third column, Score, provides the sum of the three values ROUGE-1, ROUGE-2 and ROUGE-L, as an overview of the three metrics. The last column gives the relative improvement over the L8 baseline model.

Results in Table 2 show that as the window size expands from 8 to 64 characters, the quality of the generated headlines gradually increases. However, when the window size reaches 64 characters, further expansion of the local attention window does not result in better headlines. In particular, when the window size increases to 128 characters, the quality of the generated headlines decreases. Based on these results, subsequent experiments used the window size of 64 characters for the local attention mechanism.

4.4. Comparison of Hybrid Attention Mechanisms

There are two strategies for implementing the global attention mechanism: the first token and the sentence global attention, noted as FTGAM and SGAM, respectively. Combined with the local attention mechanism with a window size of 64 characters, the hybrid attention mechanisms are noted as FTGAM+L64 and SGAM+L64, respectively. Table 3 compares the quality of the headlines generated by three models: two models with different global attention strategies and the standard Transformer model with the standard global attention mechanism.

Table 3 shows the comparative performance. The last column (percentage%) in the table indicates the relative increase or decrease of the values in the Score column, as compared to those of the L64 model in Table 2. Choosing the first token for global attention calculation, i.e., the FTGAM+L64 model, generated better headlines than computing sentence-level attention as the global attention mechanism. This suggests that a wider range of correlation is probably more important than sentence-level correlation for the headline generation task. After all, a headline needs to present the most important piece of information of an article. It is worth noting that the overall score of the FTGAM+L64 model was 1.0479, which is greater than that of the L256 model in Table 2, which is 1.0414 and also the best score achieved by all models with local attention only. In other words, performance was improved with less attention calculation. Interestingly, the standard Transformer model with the global attention calculation scored much lower than the FTGAM+L64 model, resulting in a 13.8% decline in the overall quality. The generated headlines shared less common words with the groundtruth.

4.5. Comparison of the Training and Prediction Costs

Reducing the overhead of the global attention calculation in the Transformer model is crucial for model efficiency, especially for training with long texts. Figure 6 and Figure 7 compare the running time and the GPU memory consumption with various input text lengths. Transformer-based models with different local attention window sizes are compared with the standard Transformer model that only uses the global attention mechanism and with the proposed hybrid attention mechanism.

In training time costs, curves for all the models with local attention mechanisms and the hybrid attention mechanism are relatively flat: training time does not increase significantly as the input sequence length grows. This suggests models with local attention and the hybrid attention maintain a more consistent training time across varying input lengths. Similarly, the GPU memory consumption curves for the local-attention-based models are also flat, indicating efficient memory utilization regardless of the input text length. There is almost no difference between our proposed FTGA+L64 model and the WS_64 model.

In contrast, the training time cost curves for the standard Transformer model shows a nearly linear increase as the input text length grows. This indicates that its training time scales almost linearly with the input length, requiring progressively more time for longer text. The GPU memory consumption follows a similar pattern to the training time, with a linear increase as the input text length grows. Notably, there is a dramatic spike in GPU memory consumption when the input text length exceeds 1650 characters. This suggests that the standard Transformer model becomes significantly less efficient beyond this threshold.

Overall, the local attention mechanisms and the hybrid attention mechanism exhibited superior scalability and efficiency in both training time and GPU memory consumption as the input text length increased. Their relatively flat curves indicate that they can handle longer text with minimal impact on performance. The global attention mechanism, while initially manageable, demonstrated a less efficient scaling pattern with a nearly linear increase in resource demands as the input text length grew. The sharp rise in GPU consumption beyond 1650 characters indicates a critical point where the model’s efficiency degraded substantially, which significantly limited the model’s processing of medium and long texts.

To complement the comparison, time costs of the different models during the prediction step were also compared. For the testing set, which contains 18,291 articles, the standard Transformer model with the global attention mechanism took about 28 min to process and generate a headline for every testing article, i.e., costing about 0.1 s for each single article. In contrast, the total testing time of the hybrid attention mechanism model was about 20 min, i.e., about 0.06 s for each single article. The memory costs of the global and the local attention mechanisms were similar, though.

These results show that the hybrid attention mechanism can significantly reduce the costs in both the training and the testing stages, in both the time needed for training the model and generating the headline for a given input text, and in the memory needed for training the HG model.

4.6. Comparison of Syntactic Similarity between the Generated and the Reference Headlines

For comparison, three representative models for the HG task were chosen as the benchmark: the copy-network-based HG model [42], the pointer-network-based HG model [43] and the NeuSum model [44]. Table 4 compares the proposed hybrid-attention-based headline generation model with the baseline models in terms of the syntactic similarity between the generated headlines and the manually written reference headlines that come with the data set. Performance of the standard Transformer model with the global attention calculation [22] is also listed for convenience.

As results in Table 4 show, our proposed model consistently achieved the highest scores across all three ROUGE metrics, demonstrating superior performance in the HG task. Its higher scores indicate better recall of unigrams, bigrams, and longer sequences, making it the most effective model among the five evaluated. Compared with the standard Transformer model with the global attention mechanism, which yielded the best performance among the benchmark models, our proposed model obtained 14.7%, 16.7% and 14.4% improvement on the ROUGE-1, ROUGE-2 and ROUGE-L scores, respectively, showing the effectiveness of the combination of the local and the global attention mechanism. It suggests that when processing medium and long texts, the sliding-window-based local attention mechanism plus the fist-token-based global attention mechanism can better capture the context information in the input text. The generated headlines shared more common ngrams with the reference headlines.

The ROUGE scores of the CopyNetwork and the PointerNetwork models were not much different. This is probably because the copy mechanism and the pointer mechanism were both introduced for handling unseen words in the generated text. However, their ROUGE scores were not satisfactory in the HG task. On average, texts in the WXPC dataset used in this article had 1232 characters. A longer input sequence may lead to information loss, and both models were found difficult to effectively capture key information in long dependency ranges of long texts.

Among the three benchmark models, the NeuSum model had the highest ROUGE scores. This is probably because the NeuSum model had integrated a sentence selection strategy into the sentence scoring model, helping the model to first extract key sentences that are important for revealing the topic of the input text and remove other unimportant sentences from consideration.

However, the ROUGE score of the NeuSum model was about 10% lower on average, compared with the proposed hybrid-attention-based model. This suggests that the first token global attention can effectively capture long-range semantic dependencies, probably more effective than introducing the hard constraints on selecting, scoring and determining the key sentences.

In order to verify the impact of data partitioning on the model effect, we also re-partitioned the dataset using the classic 8:1:1 ratio, with 224,942, 28,118 and 28,119 articles in the training, the validation and the test sets, respectively. Experiments were conducted with this hybrid attention mechanism in the same fashion. Experimental results on this dataset partition showed that the FTGA+L64 model achieved ROUGE-1, ROUGE-2 and ROUGE-L scores of 0.37, 0.27 and 0.35, respectively. These scores were slightly lower than those achieved on the original split. This is probably because there were less training samples. Yet, as compared to the baseline models, the FTGA+L64 model was still superior. These results combined with the results in Table 4 show that the hybrid attention mechanism is effective across different dataset partitions.

4.7. Comparison of Semantic Similarity between the Generated and the Reference Headlines

The above ROUGE metrics mainly focus on the surface overlapping between texts, while semantic similarities can be achieved with different words and expressions. Therefore, we used the ROUGE-WE and the BERTScore metrics to quantify the semantic similarity between the generated and the reference headlines. Table 5 compares the Transformer model with the hybrid attention mechanism proposed in this paper and the four baseline models.

Compared to the four benchmark models, our proposed model consistently achieved the best results in both metrics again. Specifically, when compared with the standard Transformer model with the standard global attention, which yielded the best headlines among the four benchmark models, our proposed model obtained a 9.1% and 2.8% improvement in the ROUGE-WE and the BERTScore metrics, respectively, showing that not only do the generated headlines have more literal overlaps with the reference headlines, they are also semantically closer to the reference headlines. This strongly advocates for the proposed design of hybrid attention mechanisms to enhance the Transformer-based HG model.

In the baseline models, again, the CopyNetwork model and the PointerNetwork model had the lowest scores on these two metrics. CopyNetwork was significantly less effective than PointerNetwork. Combined with the results in Table 4, it is suggested that headlines generated by CopyNetwork and PointerNetwork shared similar degrees of lexical overlapping with the reference headlines, as their combined ROUGE scores were 0.60 and 0.63, respectively, which was not much different. When the two models did generate different headlines, the headlines generated by PointerNetwork were substantially semantically better, as their combined score in Table 5 is 0.78 and 0.96, respectively. An obvious performance gap exists, evidently.

On the other dataset with 8:1:1 division, our proposed model achieved a ROUGE-WE score of 0.35 and a BERTScore of 0.74. Again, these results are comparable to the original division and show a consistent performance of the proposed model across different datasets.

In summary, regardless of whether the evaluation is based on lexical similarity or semantic similarity, our proposed model can generate better headlines than the baseline models for the headline generation task.

4.8. Comparison of Human Evaluation Results between the Generated and the Reference Headlines

Table 6 compares the manual evaluation scores of the readability, informativeness, coherence and conciseness of the headlines generated by the four baseline models with those generated by the Transformer model with the hybrid attention mechanism proposed in this paper. The numbers in the table represent the average scores provided by the three human evaluators.

In general, compared to the four baseline models, our model achieved much higher scores in every aspect. As compared to the best baseline, the Transformer model with the global attention mechanism, the headlines generated by our proposed model achieved 6.9%, 3.0%, 8.0% and 3.4% improvement in readability, informativeness, coherence and conciseness, respectively. These experimental results, combined with those in Figure 6 and Figure 7 and Table 4 and Table 5, show that our hybrid-attention-based headline generation model can not only reduce training time and memory consumption, but also remarkably improve the overall quality of the generated headlines.

Improvements in readability and coherence show that the headlines are easy to read and understand, and have better logical coherence.

Improvements in informativeness show that the proposed hybrid attention mechanism can capture both local and global semantic dependencies, so that key information is retained in the headlines.

The greatest improvement was observed in the conciseness of the generated headlines. This is especially significant because the input articles were longer than normal, which presented a greater challenge for generating a concise one-sentence summary of the entire input article.

In contrast, of the three baseline models, CopyNetwork was less effective than the other two baseline models. Although CopyNetwork and PointerNetwork yielded similar results in previous evaluation with automatically computed indices, they yielded wildly different ratings from human evaluators. As results in Table 5 and Table 7 from the next section show, although the generated headlines might have borne certain lexical similarity to the reference headlines (see results in Table 4), they failed to achieve sufficient semantic-level resemblance (see results in Table 5), and this led to the significantly lower ratings from human evaluators (see results in Table 6), because humans are more sensitive to the nuance differences in meanings.

The PointerNetwork and the NeuSum models are extractive models. By extracting sentences from the original text to be used as headlines, the generated headlines are in general better in terms of readability, informativeness and coherence. Interestingly, because the extracted sentences tend to be long, these models often result in less concise headlines when dealing with long texts. Therefore differences between the three baseline models on the Conciseness index were very small. Actually, the PointerNetwork even scored lower than the CopyNetwork in terms of generating concise headlines.

The above results and analysis demonstrate a comprehensive improvement in the headlines generated using the proposed hybrid attention mechanism. Despite the input articles being long, the headlines generated were readable, informative, coherent and concise, allowing for potential value to improve readers’ efficiency.

4.9. Comparison of the Generated Headline Instances

To intuitively demonstrate the similarities and differences of the headlines generated by different models, Table 7 presents a true instance from the WXPC dataset, the manually written referencing headline and the headlines generated by various models. The original text is in Chinese, so both the original Chinese text and its English translation are provided.

The examples in Table 7 are representative of the headlines generated from this dataset. In general, headlines generated by the CopyNetwork model tend to have syntax or grammar problems, which could affect the readability of the generated headlines. Furthermore, because PointerNetwork and NeuSum both belong to the category of extractive ATS models, they tend to focus on the same sentences and generate very similar or even identical headlines. Although the headlines generated by these models are relatively complete in content and have roughly the same meaning as the reference headlines, they tend to be much more verbose and less concise, which also affects the readability and the attractiveness of the generated text as the headlines of the entire text. The headline generated by the standard Transformer model with the global attention mechanism is similar to the reference title, yet it lacks key descriptive information such as “the most beautiful”. This may be because, when processing long texts, the global attention mechanism may fail to capture locally important information. In comparison, the headline generated by our proposed model conveys the core information of the input text in a clear, concise and friendly fashion, providing strong readability.

5. Conclusions

Targeting a reduction of the dramatic increase in training time and GPU memory consumption when generating headlines for long text without sacrificing quality of the generated headlines, we propose combing the sliding-window-based local attention mechanism and the first-token-based global attention mechanism to replace the global attention mechanism in the standard Transformer-based text generation model. The local attention mechanism was designed to capture local semantic context, while the global attention mechanism was adapted to capture global dependencies in a more efficient manner.

Extensive experimental results show that the proposed hybrid-attention-based model can efficiently generate readable, concise and more-attractive headlines for long texts with medium to long lengths. Training time and GPU memory consumption are significantly reduced, while headlines generated by our model are more similar, both syntactically and semantically, to the manually composed reference headlines. These results suggest that, for applications requiring high-quality headline generation, our model is the clear choice, due to its superior performance across all evaluated metrics. For tasks involving variable input text lengths, especially longer texts, our model is preferable due to its consistent performance and efficient resource utilization.

In the future, we will keep track of relevant and suitable experimental datasets to verify the generalization ability of our method for different domains and document types. We will also explore fusing text with other data modalities, such as images and videos, to generate headlines with richer meanings and diverse formats.

Author Contributions

Methodology, C.Z., W.W. and L.H.; Implementation, experiments and data analysis, C.Z. and W.W.; Funding acquisition and supervision, L.H.; Writing—original draft, C.Z. and W.W.; Writing—review & editing, W.W. and L.H. All authors have read and agreed to the published version of the manuscript.

Funding

2021268: 2021 Higher Education Research Program of the Educational Commission of Hubei Province of P. R. China.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lee, S.-H.; Choi, S.-W.; Lee, E.-B. A Question-Answering Model Based on Knowledge Graphs for the General Provisions of Equipment Purchase Orders for Steel Plants Maintenance. Electronics 2023, 12, 2504. [Google Scholar] [CrossRef]
Ahmad, P.N.; Liu, Y.; Khan, K.; Jiang, T.; Burhan, U. BIR: Biomedical Information Retrieval System for Cancer Treatment in Electronic Health Record Using Transformers. Sensors 2023, 23, 9355. [Google Scholar] [CrossRef] [PubMed]
Lu, Y.; Liu, Q.; Dai, D.; Xiao, X.; Lin, H.; Han, X.; Sun, L.; Wu, H. Unified Structure Generation for Universal Information Extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 5755–5772. [Google Scholar]
Peng, M.; Gao, B.; Zhu, J.; Huang, J.; Yuan, M.; Li, F. High Quality Information Extraction and Query-Oriented Summarization for Automatic Query-Reply in Social Network. Expert Syst. Appl. 2016, 44, 92. [Google Scholar] [CrossRef]
Sakurai, T.; Utsumi, A. Query-Based Multidocument Summarization for Information Retrieval. In Proceedings of the NTCIR-4; National Institute of Informatics: Winter Garden, FL, USA, 2004. [Google Scholar]
Deutsch, D.; Roth, D. Incorporating Question Answering-Based Signals into Abstractive Summarization via Salient Span Selection. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, 2–6 May 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 575–588. [Google Scholar]
Panthaplackel, S.; Benton, A.; Dredze, M. Updated Headline Generation: Creating Updated Summaries for Evolving News Stories. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 6438–6461. [Google Scholar]
Akash, A.U.; Nayeem, M.T.; Shohan, F.T.; Islam, T. Shironaam: Bengali News Headline Generation Using Auxiliary Information. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, 2–6 May 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 52–67. [Google Scholar]
Liu, H.; Guo, W.; Chen, Y.; Li, X. Contrastive Learning Enhanced Author-Style Headline Generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Industry Track, Abu Dhabi, Arab, 7–11 December 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 5063–5072. [Google Scholar]
Hayashi, Y.; Yanagimoto, H. Headline Generation with Recurrent Neural Network; Matsuo, T., Mine, T., Hirokawa, S., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 81–96. [Google Scholar]
Thu, Y.; Pa, W.P. Myanmar News Headline Generation with Sequence-to-Sequence Model. In Proceedings of the 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), Yangon, Myanmar, 5–7 November 2020; pp. 117–122. [Google Scholar]
Zhuoran, S.; Mingyuan, Z.; Haiyu, Z.; Shuai, Y.; Hongsheng, L. Efficient Attention: Attention with Linear Complexities. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 5–9 January 2021; IEEE: New York, NY, USA, 2021; pp. 3530–3538. [Google Scholar]
Fan, A.; Grave, E.; Joulin, A. Reducing Transformer Depth on Demand with Structured Dropout. arXiv 2019, arXiv:1909.11556. [Google Scholar]
Yang, K.; Ackermann, J.; He, Z.; Feng, G.; Zhang, B.; Feng, Y.; Ye, Q.; He, D.; Wang, L. Do Efficient Transformers Really Save Computation? arXiv 2024, arXiv:2402.13934. [Google Scholar]
Dorr, B.; Zajic, D.; Schwartz, R. Hedge Trimmer: A Parse-and-Trim Approach to Headline Generation. In Proceedings of the HLT-NAACL 03 on Text Summarization Workshop; Association for Computational Linguistics: Stroudsburg, PA, USA, 2003; pp. 1–8. [Google Scholar]
Banko, M.; Mittal, V.O.; Witbrock, M.J. Headline Generation Based on Statistical Translation. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Hong Kong, China, 3–6 October 2000; Association for Computational Linguistics: Hong Kong, China, 2000; pp. 318–325. [Google Scholar]
Elman, J.L. Finding Structure in Time. Cogn. Sci. 1990, 14, 179. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735. [Google Scholar] [CrossRef] [PubMed]
Bowman, S.R.; Vilnis, L.; Vinyals, O.; Dai, A.; Jozefowicz, R.; Bengio, S. Generating Sentences from a Continuous Space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, 11–12 August 2016; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 10–21. [Google Scholar]
Lopyrev, K. Generating News Headlines with Recurrent Neural Networks. arXiv 2015, arXiv:1512.01712,. [Google Scholar]
Bengio, Y.; Simard, P.; Frasconi, P. Learning Long-Term Dependencies with Gradient Descent Is Difficult. IEEE Trans. Neural Netw. 1994, 5, 157. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Mohamed, A.; Okhonko, D.; Zettlemoyer, L. Transformers with Convolutional Context for ASR. arXiv 2020, arXiv:1904.11660. [Google Scholar]
Zhang, S.; Chen, H.; Yang, H.; Sun, X.; Yu, P.S.; Xu, G. Graph Masked Autoencoders with Transformers. arXiv 2022, arXiv:2202.08391. [Google Scholar]
Zhang, J.; Zhao, Y.; Saleh, M.; Liu, P.J. PEGASUS: Pre-Training with Extracted Gap-Sentences for Abstractive Summarization. In Proceedings of the Thirty-seventh International Conference on Machine Learning, Online, 13–18 July 2020; pp. 11328–11339. [Google Scholar]
Li, Z.; Wu, J.; Miao, J.; Yu, X. News Headline Generation Based on Improved Decoder from Transformer. Sci. Rep. 2022, 12, 11648. [Google Scholar] [CrossRef] [PubMed]
Yamada, K.; Hitomi, Y.; Tamori, H.; Sasano, R.; Okazaki, N.; Inui, K.; Takeda, K. Transformer-Based Lexically Constrained Headline Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 4085–4090. [Google Scholar]
Bukhtiyarov, A.; Gusev, I. Advances of Transformer-Based Models for News Headline Generation. In Proceedings of the Ninth Conference on Artificial Intelligence and Natural Language, Helsinki, Finland, 7–9 October 2020; pp. 54–61. [Google Scholar]
Tikhonova, M.; Shavrina, T.; Pisarevskaya, D.; Shliazhko, O. Using Generative Pretrained Transformer-3 Models for Russian News Clustering and Title Generation Tasks. In Proceedings of the Conference on Computational Linguistics and Intellectual Technologies, Lviv, Ukraine, 22–23 April 2021; pp. 1214–1223. [Google Scholar]
Wang, Y.; Zhang, Z.; Zhao, Y.; Zhang, M.; Li, X. Design and Implementation of Automatic Generation System for Chinese Scientific and Technical Paper Titles. Data Anal. Knowl. Discov. 2023, 5, 61–71. [Google Scholar]
Zhang, X.; Jiang, Y.; Shang, Y.; Cheng, Z.; Zhang, C.; Fan, X.; Xiao, Y.; Long, B. DSGPT: Domain-Specific Generative Pre-Training of Transformers for Text Generation in E-Commerce Title and Review Summarization. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Online, 11–15 July 2021; pp. 2146–2150. [Google Scholar]
Meng, Q.; Liu, B.; Sun, X.; Yan, H.; Liang, C.; Cao, J.; Lee, R.K.-W.; Bao, X. Attention-Fused Deep Relevancy Matching Network for Clickbait Detection. IEEE Trans. Comput. Soc. Syst. 2023, 10, 3120. [Google Scholar] [CrossRef]
Cui, Z.; Sun, X.; Pan, L.; Liu, S.; Xu, G. Event-Based Incremental Recommendation via Factors Mixed Hawkes Process. Inf. Sci. 2023, 639, 119007. [Google Scholar] [CrossRef]
Ma, T.; Pan, Q.; Rong, H.; Qian, Y.; Tian, Y.; Al-Nabhan, N. T-BERTSum: Topic-Aware Text Summarization Based on BERT. IEEE Trans. Comput. Soc. Syst. 2022, 9, 879. [Google Scholar] [CrossRef]
Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; Hon, H. Unified Language Model Pre-Training for Natural Language Understanding and Generation. In Proceedings of the 33rd Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 13042–13054. [Google Scholar]
Hutchins, D.; Schlag, I.; Wu, Y.; Dyer, E.; Neyshabur, B. Block-Recurrent Transformers. Adv. Neural Inf. Process. Syst. 2022, 35, 33248. [Google Scholar]
Liang, X.; Tang, Z.; Li, J.; Zhang, M. Open-Ended Long Text Generation via Masked Language Modeling. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 223–241. [Google Scholar]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer 2020. arXiv 2020, arXiv:2004.05150. [Google Scholar]
Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, 21–26 July 2004; Association for Computational Linguistics: Stroudsburg, PA, USA, 2004; pp. 74–81. [Google Scholar]
Ng, J.-P.; Abrecht, V. Better Summarization Evaluation with Word Embeddings for ROUGE. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1925–1930. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the 2020 International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
Gu, J.; Lu, Z.; Li, H.; Li, V.O.K. Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 1631–1640. [Google Scholar]
Vinyals, O.; Fortunato, M.; Jaitly, N. Pointer Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2692–2700. [Google Scholar]
Zhou, Q.; Yang, N.; Wei, F.; Huang, S.; Zhou, M.; Zhao, T. Neural Document Summarization by Jointly Learning to Score and Select Sentences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 654–663. [Google Scholar]

Figure 1. Transformer model structure [22].

Figure 2. Multi-head self-attention layer structure [22].

Figure 3. Scaled dot product attention structure [22].

Figure 4. Four attention mechanisms. (a) Global attention [38]. (b) Local attention [38]. (c) Hybrid 1: FTGA + local attention. (d) Hybrid 2: SGA + local attention.

Figure 5. Length distribution of the WXPC dataset.

Figure 6. Comparison of running time.

Figure 7. Comparison of GPU memory consumption.

Table 1. WXPC WeChat public dataset.

Indicators	Train	Val	Test
Average number of characters in text	1232.8	1236.2	1226.4
Average number of sentences in text	28.2	28.1	28.1
Average number of characters in sentences	24.5	24.6	24.5
Average number of characters in headline	20.3	20.4	20.3
Average compression ratio	60.6	60.8	60.6

Table 2. Comparison of local semantic modeling title generation effects.

Model	ROUGE-1	ROUGE-2	ROUGE-L	Score	Percentage%
L8	0.3705	0.2684	0.3455	0.9845	0
L16	0.3762	0.2732	0.3507	1.0002	1.5
L32	0.3856	0.2826	0.3598	1.0281	4.4
L64	0.3904	0.2862	0.3644	1.0412	5.7
L128	0.3897	0.2847	0.3626	1.0372	5.3
L256	0.3903	0.2864	0.3647	1.0414	5.7

Table 3. Comparison of models with different hybrid attention mechanisms.

Model	ROUGE-1	ROUGE-2	ROUGE-L	Score	Percentage%
FTGA+L64	0.3930	0.2881	0.3668	1.0479	0.6
SGA+L64	0.3874	0.2830	0.3614	1.0320	−0.8
Transformer	0.3398	0.2409	0.3172	0.8979	−13.8%

Table 4. Comparison in terms of the syntactic similarity of the generated headlines.

Model	ROUGE-1	ROUGE-2	ROUGE-L	Score
CopyNetwork [42]	0.24	0.13	0.23	0.60
PointerNetwork [43]	0.25	0.16	0.22	0.63
NeuSum [44]	0.28	0.19	0.24	0.71
Transformer [22]	0.34	0.24	0.32	0.90
FTGA+L64 (ours)	0.39	0.28	0.36	1.03

Table 5. Comparison in terms of the semantic similarity of the generated headlines.

Model	ROUGE-WE	BERTScore	Score *
CopyNetwork	0.14	0.64	0.78
PointerNetwork	0.27	0.69	0.96
NeuSum	0.29	0.70	0.99
Transformer	0.33	0.72	1.05
FTGA+L64 (ours)	0.36	0.74	1.10

* Score is the sum of the scores of ROUGE-WE and BERTScore.

Table 6. Comparison in terms of the human evaluation results.

Model	Readability	Informativeness	Coherence	Conciseness
CopyNetwork	0.37	0.33	0.31	0.59
PointerNetwork	1.49	1.31	1.42	0.56
NeuSum	1.53	1.32	1.47	0.64
Transformer	1.73	1.65	1.74	1.78
FTGA+L64 (ours)	1.85	1.70	1.88	1.84

Table 7. Comparison of the generated headlines.

Original text	It’s the fourth anniversary of the Redline Team. In the past four years, we have held hands with more than 400 couples, and more than 100 couples have entered the marriage hall. Here we have witnessed the most beautiful love and the purest trust. For four years, from the initial attempt to the present, it has always been you who have persevered. Some people say that the best love is companionship. Here at our Redline platform, we have also walked all the way with everyone’s company. Thank you for having you along the way!...... (628 words)
Referencing headline	Finding the best hardcore fans of the Redline platform
Model	Generated Title
CopyNetwork	It’s the fourth anniversary of the Redline team, and four years of holding hands in the halls of love here with us, whether you’re a newcomer or not!
PointerNetwork	From now on the most beautiful Redline hardcore fan selection activity is about to kick off, no matter you are a newcomer or an old fan, no matter you are a pro-military girl, or a soldier brother, as long as you are enthusiastic enough, sincere enough can become our most beautiful hardcore fans!
NeuSum	From now on the most beautiful Redline hardcore fan selection activity is about to kick off, no matter you are a newcomer or an old fan, no matter you are a pro-military girl, or a soldier brother, as long as you are enthusiastic enough, sincere enough can become our most beautiful hardcore fans!
Transformer	The contest for hardcore Redline fan is now open!
FTGA+L64 (ours)	The contest for the most beautiful hardcore Redline fan is now open!

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wan, W.; Zhang, C.; Huang, L. Efficient Headline Generation with Hybrid Attention for Long Texts. Electronics 2024, 13, 3558. https://doi.org/10.3390/electronics13173558

AMA Style

Wan W, Zhang C, Huang L. Efficient Headline Generation with Hybrid Attention for Long Texts. Electronics. 2024; 13(17):3558. https://doi.org/10.3390/electronics13173558

Chicago/Turabian Style

Wan, Wenjin, Cong Zhang, and Lan Huang. 2024. "Efficient Headline Generation with Hybrid Attention for Long Texts" Electronics 13, no. 17: 3558. https://doi.org/10.3390/electronics13173558

APA Style

Wan, W., Zhang, C., & Huang, L. (2024). Efficient Headline Generation with Hybrid Attention for Long Texts. Electronics, 13(17), 3558. https://doi.org/10.3390/electronics13173558

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Headline Generation with Hybrid Attention for Long Texts

Abstract

1. Introduction

2. Related Work

2.1. Rule-Based and Statistical Methods

2.2. Deep Neural Network Methods

2.3. Transformer-Based Models

2.4. Attention Mechanisms

3. Headline Generation Model with Hybrid Attention

3.1. Motivation for the Hybrid Attention Mechanism

3.2. Transformer-Based Headline Generation Model

3.3. Hybrid Attention

4. Experiments, Results and Discussion

4.1. Dataset

4.2. Performance Indicators

4.2.1. Syntactic Similarity Indicators

4.2.2. Semantic Similarity Indicators

4.2.3. Human Evaluation Indicators

4.3. Comparison of Local Attention Mechanisms

4.4. Comparison of Hybrid Attention Mechanisms

4.5. Comparison of the Training and Prediction Costs

4.6. Comparison of Syntactic Similarity between the Generated and the Reference Headlines

4.7. Comparison of Semantic Similarity between the Generated and the Reference Headlines

4.8. Comparison of Human Evaluation Results between the Generated and the Reference Headlines

4.9. Comparison of the Generated Headline Instances

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI