Diversifying Multi-Head Attention in the Transformer Model

Ampazis, Nicholas; Sakketou, Flora

doi:10.3390/make6040126

Open AccessArticle

Diversifying Multi-Head Attention in the Transformer Model

by

Nicholas Ampazis

^*,†

and

Flora Sakketou

^†,‡

Department of Financial and Management Engineering, University of the Aegean, 82100 Chios, Greece

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

^‡

Current address: Lufthansa Group Digital Hangar, De-Saint-Exupéry Straße 8, 60549 Frankfurt am Main, Germany.

Mach. Learn. Knowl. Extr. 2024, 6(4), 2618-2638; https://doi.org/10.3390/make6040126

Submission received: 20 August 2024 / Revised: 3 November 2024 / Accepted: 8 November 2024 / Published: 12 November 2024

(This article belongs to the Section Data)

Download

Browse Figures

Versions Notes

Abstract

:

Recent studies have shown that, due to redundancy, some heads of the Transformer model can be pruned without diminishing the efficiency of the model. In this paper, we propose a constrained optimization algorithm based on Hebbian learning, which trains specific layers in the Transformer architecture in order to enforce diversification between the different heads in the multi-head attention module. The diversification of the heads is achieved through a single-layer feed-forward neural network that is added to the Transformer architecture and is trained with the proposed algorithm. We utilize the algorithm in three different architectural variations of the baseline Transformer model. In addition to the diversification of the heads, the proposed methodology can be used to prune the heads that capture redundant information. Experiments on diverse NLP tasks, including machine translation, text summarization, question answering and large language modeling, show that our proposed approach consistently improves the performance of baseline Transformer models.

Keywords:

deep learning; transformer; multi-head attention

1. Introduction

The Transformer architecture utilizes attention mechanisms in order to capture the global relations between inputs and outputs [1]. The key component of the Transformer is the multi-head attention layer, where the attention is applied in parallel and each attention “head” is potentially focused on different parts of the input sequence. Recent studies have focused on the information that the attention heads actually capture, where it has been observed that different multi-head attention layers mark specific dependency relations. More specifically, lower layers tend to learn more about syntax, while higher layers tend to encode more semantics [2]. Other studies have focused on making the Transformer lighter. Pruning usually improves the processing speed and memory efficiency but at the cost of decreased accuracy. However, some studies have proven that a large percentage of the attention heads can be removed during inference without significantly impacting the performance [3]. In [4], it was shown that the most important and confident heads play consistent and often linguistically interpretable roles; therefore, these specialized heads are the last to be pruned. In [5], it was even proposed to simplify the encoder self-attention of the Transformer by replacing all but one attention head in each encoder layer with simple fixed, non-trainable attentive positional patterns. In [6], through extensive experiments, it was validated that the attention weights that are learned from token–token (query–key) interactions are not significantly important because random alignment matrices can perform quite competitively.

Instead of pruning the attention heads, the model proposed by [7] learns to activate different heads on different inputs by “reallocating” them. They show that multi-head attention can be seen as a uniform, input-agnostic mixture of experts and propose the Mixture of Attentive Experts (MAE) model, which, instead of uniformly weighting the experts, learns to weight the experts depending on the input [8]. This model allows the heads that are more important to the given input to contribute more by assigning greater “responsibility” to them.

In this paper, we are driven by the assumption that, during training, the information captured by the different heads might be quite similar. This is due to standard weight initialization schemes and/or the weight update rule of the optimizer [9]. In order to eliminate heads that capture redundant information, we propose the integration of principal components analysis (PCA) via the Generalized Hebbian Algorithm (GHA) as external knowledge into a constrained optimization algorithm called DEACON (DivErsifying Attention with CONstraints) in order to enforce diversification between different heads. The algorithm enforces the heads to capture distinct information by projecting them into a space of maximum variance using PCA. By forcing the heads to learn different aspects of the input data, DEACON aims to reduce redundancy and improve the model’s ability to capture a wider range of linguistic patterns and relationships. We introduce three different architectures that are incorporated into the Transformer model and utilize the proposed training algorithm. We also present the results of experiments on machine translation, text summarization, question answering and large language modeling, which validate the efficiency of the proposed methodology.

2. Motivation and Proposed Methodology

In the original Transformer architecture, a sequence of n d-dimensional embeddings

X \in R^{n \times d}

is given as input to each encoder block, and each attention head

Z_{i} \in R^{n \times d_{k}} \forall i = 1, \dots, h

is computed as follows:

Z_{i} = Attention (Q_{i}, K_{i}, V_{i})

(1)

where

\begin{matrix} Q_{i} & = X W_{i}^{Q} \end{matrix}

(2)

\begin{matrix} K_{i} & = X W_{i}^{K} \end{matrix}

(3)

\begin{matrix} V_{i} & = X W_{i}^{V} \end{matrix}

(4)

For each attention head

Z_{i}

, the query matrix

Q_{i} \in R^{n \times d_{k}}

represents the queries that the model is trying to attend to, the key matrix

K_{i} \in R^{n \times d_{k}}

represents the keys that the model is using to attend to the queries, and the value matrix

V_{i} \in R^{n \times d_{k}}

represents the values that the model is attending to. The right side of Equation (1) is the scaled dot-product attention given by

Attention (Q_{i}, K_{i}, V_{i}) = s o f t m a x (\frac{Q_{i} K_{i}^{⊤}}{\sqrt{d_{k}}}) V_{i}

(5)

where

K_{i}^{⊤}

denotes the transpose of matrix

K_{i}

.

Next, the heads are concatenated to form a matrix

M \in R^{n \times h \cdot d_{k}}

as follows:

M = Concat (Z_{1}, Z_{2}, \dots, Z_{h})

(6)

Then, the heads are projected back to the input’s dimensions by multiplying M with

W^{O} \in R^{h \cdot d_{k} \times d}

:

Z_{o u t} = M W^{O}

(7)

A different perspective is provided in [7] by expressing Equation (7) as a sum of block matrices, as shown below:

Z_{o u t} = M W^{O} = [Z_{1}; Z_{2}; \dots; Z_{h}] W^{O} = \sum_{i = 1}^{h} Z_{i} W_{i}^{O}

(8)

where

W_{i}^{O} \in R^{d_{k} \times d}

is a block submatrix of

W^{O} = [\begin{matrix} W_{1}^{O} \\ W_{2}^{O} \\ ⋮ \\ W_{h}^{O} \end{matrix}]

(9)

This means that each head

Z_{i}

is projected by its own matrix

W_{i}^{O}

to a different vector space and then the projected heads are summed to form the output. Within this conceptualization, the output is a linear combination of the projected heads; thus, in essence, the Transformer performs local linear regression in order to learn the weights

W_{i}^{O}

. In this case, the heads

Z_{i}

can be viewed as features. As these features may be correlated with each other, we need to perform feature selection or extraction in order to utilize uncorrelated features relevant to the problem. However, this is not an autonomous logistic regression problem, since Equation (8) is deeply embedded into the Transformer architecture, and thus we cannot perform, for example, ridge or lasso regression. Alternatively, we could utilize principal components analysis (PCA) for feature selection (and dimensionality reduction) in order to project the heads into a space that maximizes their variance. This means that the newly projected heads will be uncorrelated and thus able to capture diverse information. By forcing the attention heads to learn diverse information, they “attend” to different parts of the input, which could potentially improve the performance of the baseline Transformer. After performing PCA, we can even select only the projected heads that capture the most variance in the data and discard the rest, which will reduce the model’s parameters without hindering its performance. This operation has an effect that is equivalent to pruning, but, instead of learning the heads beforehand, we can avoid this step by selecting those heads that correspond to the maximum variance of the data.

2.1. Generalized Hebbian Algorithm

PCA can be implemented as a feed-forward neural network by utilizing the Generalized Hebbian Algorithm (GHA), also known in the literature as Sanger’s rule [10]. It is an algorithm for the training of single-layer feed-forward neural networks in order to find the eigenvectors of the covariance matrix of the input distribution. The GHA is a combination of Oja’s learning rule [11] for a simplified model of a neuron and the Gram–Schmidt orthogonalization process. We should note that the GHA requires the inputs to be centered at zero.

If

x^{(t)} \in R^{n}

is an n-dimensional input to a single-output one-layer feed-forward network at iteration t, the output of the network is given by

y^{(t)} = w^{⊤ (t)} x^{(t)}

, where

w \in R^{n}

is the weight vector of the network. The Oja learning rule for the weight vector is a modification of Hebb’s rule, which states, in conceptual terms, that neurons that fire together wire together. Hebb’s rule is written as

d w = η y^{(t)} x^{(t)}

(10)

where

η

is the learning rate. However, Hebb’s rule allows the weights to approach infinity with a positive learning rate. Oja’s modification addresses this stability problem by normalizing the weights to have a unit length and is given by

d w = w^{(t + 1)} - w^{(t)} = η y^{(t)} (x^{(t)} - w^{(t)} y^{(t)})

(11)

Oja showed that this equation forces the weight vector w to converge to the normalized principal eigenvector of the covariance matrix

C = x x^{⊤}

.

Sanger extended Oja’s rule for an m-dimensional output and showed that, in order to extract a particular m number of principal components from the covariance matrix, the weights need to be updated as follows:

d W = η^{(t)} (y^{(t)} x^{(t) ⊤} - L T [y^{(t)} y^{(t) ⊤}] W^{⊤ (t)})

(12)

where

W \in R^{n \times m}

is the weight matrix of the single-layer network, t is the iteration number,

L T

is a function that sets all matrix elements above the diagonal equal to 0 and

y \in R^{m}

is the output. Note that each output of the trained network represents the response to one eigenvector, and the outputs are ordered in terms of decreasing eigenvalues.

A network trained by the GHA allows the linear reconstruction of the original input with a minimal mean-squared error [10]. Since the weights of the GHA form principal components, there are two main attributes that can be exploited: (1) the principal components are orthogonal to each other, and (2) the principal components that correspond to greater eigenvalues capture the greatest variance in the data.

2.2. The DEACON Algorithm

The GHA can be integrated into the multi-head attention module of the Transformer as an additional single-layer feed-forward network (which we refer to as the PCA layer), in order to extract the principal components of the original heads. The projected uncorrelated heads will therefore be the new features of the local logistic regression of Equation (8). Unfortunately, the GHA is also an unsupervised learning algorithm; thus, integrating such a layer into the multi-head attention layer could potentially have catastrophic effects on the supervised training nature of the Transformer. Therefore, at the same time, we need to ensure that the model loss is reduced at every iteration.

One way to tackle this problem is to incorporate the GHA as external knowledge into a constrained optimization algorithm that enforces both loss minimization and diversification between different heads. Thus, the objective is to reach the minimum error cost function with respect to the weights and to simultaneously maximize a quantity

Φ^{(t)}

, the derivative of which expresses the alignment of the PCA layer weight updates at each iteration with the GHA weight update rule, without compromising the need for a decrease in the cost function. The strategy that we adopt for the solution of this constrained optimization problem closely follows the methodology for the incorporation of additional knowledge in the form of constraints in neural network training, as originally proposed in [12].

This strategy yields the following weight update rule for the PCA layer, which we refer to as DEACON (DivErsifying Attention with CONstraints):

\begin{matrix} d W^{(t)} & = - \frac{λ_{1}}{2 λ_{2}} G^{(t)} + \frac{1}{2 λ_{2}} F^{(t)} \end{matrix}

(13)

where

G^{(t)}

is the gradient of the loss of the Transformer model with respect to the weights W of the PCA layer, and

F^{(t)}

is Sanger’s weight update rule given by

F^{(t)} = (y^{(t)} x^{(t) ⊤} - L T [y^{(t)} y^{(t) ⊤}] W^{⊤ (t)})

(14)

Within this formalism,

λ_{1}

and

λ_{2}

can be evaluated in terms of known quantities as follows:

λ_{1} = \frac{I_{G F} - 2 λ_{2} δ Q^{(t)}}{I_{G G}}

(15)

with

I_{G G}

and

I_{G F}

given by

I_{G G} = | | G^{(t)} {| |}^{2}, I_{G F} = G^{⊤ (t)} F^{(t)}

(16)

λ_{2} = \frac{1}{2} {[\frac{I_{G G} {δ P}^{2} - {(δ Q^{(t)})}^{2}}{I_{F F} I_{G G} - I_{G F}^{2}}]}^{- 1 / 2}

(17)

where

I_{F F}

is given by

I_{F F} = F^{⊤ (t)} F^{(t)}

(18)

In Equation (17),

δ P

prevents the uncontrollable growth of the weights W of the PCA layer by enforcing the weight updates to have constant moduli equal to

δ P

, while maximizing the alignment of its updates

d W^{(t)}

with the direction of the GHA’s updates. At the same time, the minimization of the loss function at each iteration is ensured by demanding that the loss is decremented by a quantity

δ Q^{(t)}

so that, at the end of learning, it is rendered as small as possible.

The choice of

δ Q^{(t)}

is dictated by the demand that the quantity under the square root in Equation (17) should be positive. It is easy to show that the term

I_{F F} I_{G G} - I_{G F}^{2}

is always positive by the Cauchy–Schwarz inequality. Now, since

I_{G G} = | | G^{(t)} {| |}^{2} \geq 0

, it follows that care must be taken to ensure that

I_{G G} {(δ P)}^{2} > {(δ Q^{(t)})}^{2}

. The simplest way to achieve this is to select

δ Q^{(t)}

adaptively by setting

δ Q^{(t)} = - ξ δ P \sqrt{I_{G G}}

(19)

with

0 < ξ < 1

. Consequently, the weight update rule for the PCA layer has two free parameters, namely

δ P

and

ξ

.

Note that each output of a trained network represents the response to one eigenvector, and the outputs are ordered in terms of decreasing eigenvalues. Since the weights of the GHA form principal components, there are two main attributes that can be exploited: (1) the principal components are orthogonal to each other, and (2) the principal components that correspond to greater eigenvalues capture the greatest variance in the data. In the next section, we introduce three different architectures that could utilize the PCA layer, trained with the proposed algorithm. Note that only the weights of the PCA layers are updated with the above learning rule, while the rest of the network can be trained with any optimizer.

3. The Proposed Architectures

In order to incorporate the PCA layer into the Transformer architecture, we first need to reshape the concatenated multi-head attention matrix M into

M_{r} \in R^{n \cdot d_{k} \times h}

. This operation is illustrated in Figure 1. From Equation (6), we find that each row of the matrix M is defined as follows:

\begin{matrix} M_{[i, :]} = Concat ([{Z_{1}}^{[i, 1]}, \dots, {Z_{1}}^{[i, d_{k}]}]; [{Z_{2}}^{[i, 1]}, \dots, {Z_{2}}^{[i, d_{k}]}]; \\ \dots; [{Z_{h}}^{[i, 1]}, \dots, {Z_{h}}^{[i, d_{k}]}]) \end{matrix}

(20)

where

i = 1, \dots, n

corresponds to the word’s position in the sequence. Each row of the reshaped matrix

M_{r} = Reshape (M)

is defined as

{M_{r}}_{[i \cdot j, :]} = [{Z_{1}}^{[i, j]}, {Z_{2}}^{[i, j]}, \dots, {Z_{h}}^{[i, j]}]

(21)

where

j = 1, \dots, d_{k}

corresponds to the

j^{t h}

dimension of each head’s vector.

As we can see, after the reshaping operation is performed, each row

{M_{r}}_{[i \cdot j, :]}

is an h-dimensional vector that contains the

j^{t h}

elements of every head that correspond to the

i^{t h}

word in the sequence. Each row of matrix

M_{r}

will be given as input to the PCA layer, and the h elements in

{M_{r}}_{[i \cdot j, :]}

can each be viewed as an individual data point, so that each pattern will be represented as a coordinate in h-dimensional space. In the next section, we describe in detail three different possible architectures for the incorporation of the PCA layer that utilizes matrix

M_{r}

into the Transformer module.

3.1. Direct Architecture

The direct architecture directly applies PCA to the reshaped matrix

M_{r}

from Equation (21). However, since the GHA requires the input vector of the PCA layer to be centered at zero, the matrix is first normalized. The normalized data are fed into the PCA layer, which projects the heads into a new space where their variance is maximized. This projection results in diversified attention heads that are less correlated with each other. The outputs of the PCA layer are reshaped back to the original format and then projected to the input dimension using a linear transformation, similar to the baseline Transformer.

More specifically, the matrix

M_{r}

is passed through a batch normalization (BN) layer [13], which normalizes the layer inputs in order to follow the standard normal distribution with zero mean and unit variance. The output of this BN layer is matrix

{\tilde{M}}_{r}

.

Following this, the rows of

{\tilde{M}}_{r}

are given as input to the PCA layer; therefore,

z_{l} = {\tilde{M}}_{r [l, :]}

(22)

where

z_{l} \in R^{h}

. The outputs of the PCA layer are given by

y = M_{r}^{'} = {\tilde{M}}_{r} \cdot w

(23)

where

w \in R^{h \times m}

is the weight matrix of the PCA layer,

M_{r}^{'} \in R^{n \cdot d_{k} \times m}

and m can be chosen to be any number between 1 and h. By selecting

m = h

, the algorithm simply diversifies the input; however, since the outputs are ordered in terms of decreasing eigenvalues, when

m < h

, the heads that correspond to smaller eigenvalues are discarded. Note that this inherent feature of the algorithm is particularly appealing because we avoid training the whole model and can then prune the heads that contain redundant information. The retained m vectors are those that correspond to the first m principal directions in the input space that account for as much of the data’s variance as possible. Subsequently, the heads are transformed into an m-dimensional space without having lost essential intrinsic information. At the same time, the proposed algorithm ensures that this projection will not harm the training process, since the loss is constrained to be decremented in each epoch. Thus,

y_{l} = {M_{r}^{'}}_{[l, :]}

(24)

is the diversified version of

{\tilde{M}}_{r [l, :]}

. Figure 2i shows the PCA layer for

m = h

.

Next, the output matrix

M_{r}^{'}

is reshaped back into

M^{'} \in R^{n \times m \cdot d_{k}}

in order to be projected into the dimension space of the encoder layer’s input, as with the original Transformer model. This is achieved by performing the following multiplication:

Z_{o u t} = M^{'} W^{O}

(25)

where

W^{O} \in R^{m \cdot d_{k} \times d}

and

Z_{o u t} \in R^{n \times d}

, as depicted in Figure 2ii. Note that, if

m = h

, then

W^{O}

will have the same dimensionality as in the baseline architecture; however, in the case of

m < h

, the proposed model will have fewer parameters and therefore it will be lighter than the baseline model.

3.2. Average Architecture

The average architecture aims to reduce the dimensionality of the data before applying PCA by averaging the values within each attention head for each word in the sequence. For each word, the values across the dimensions of each attention head are averaged, resulting in a single scalar value representing a given head for a given word. The averaging operation and the potential for head pruning can result in a model with fewer parameters compared to the baseline Transformer. The averaged values are fed into the PCA layer, which diversifies the attention heads, and finally the outputs of the PCA layer are projected to the input dimension.

While it is seemingly paradoxical that the average architecture, despite adding an additional PCA layer, can result in a model with fewer parameters than the baseline Transformer, the parameter reduction is achieved through two key mechanisms.

Dimensionality Reduction through Averaging: Before the PCA layer, the average architecture performs an averaging operation across the dimensions $(d_{k})$ of each attention head for each word in the sequence. This step condenses the information represented by each head into a single scalar value for each word. This significantly reduces the dimensionality of the data that are fed into the PCA layer. For example, for 8 attention heads, each with dimensionality of 32 $(d_{k} = 32)$ , the baseline Transformer would have $8 * 32 = 256$ values representing the attention heads for each word. The average architecture, after averaging, reduces this to only 8 values per word. This dimensionality reduction has a direct impact on the number of parameters in subsequent layers.
Head Pruning through PCA: The PCA layer itself contributes to parameter reduction through its inherent ability to identify and retain only the most informative features, effectively pruning less important ones. The PCA layer in the average architecture projects the averaged attention head values onto a lower-dimensional space defined by the principal components. By selecting only the outputs corresponding to the largest principal components (m outputs), the architecture effectively discards the less important heads that likely capture redundant information. For instance, with 8 attention heads initially but choosing to retain only the top 3 principal components $(m = 3)$ , we can effectively prune 5 heads. This reduction in the number of heads translates into fewer parameters in the final linear transformation that projects the outputs back to the input dimension.

In the direct architecture, each column

{M_{r}}_{[:, k]}

is the

k^{t h}

attention head of the entire sequence. There are two issues with this approach: (i) the PCA layer takes as input the rows of

M_{r}

one by one and therefore there is no distinction between rows that belong to different words in the sequence, and (ii) each of the

d_{k}

dimensions of the attention heads within the same word is treated as a new input, whereas, in reality, these dimensions correspond to the same word.

In the average architecture, we decrease the dimensionality of each word’s attention head

[Z_{k}^{[i, 1]}, \dots, Z_{k}^{[i, d_{k}]}]

by calculating its normalized average. Thus, each word’s normalized head

[{\tilde{Z}}_{k}^{[i, 1]}, \dots, {\tilde{Z}}_{k}^{[i, d_{k}]}]

is represented by a single element

S_{k}^{[i]}

as follows:

S_{k}^{[i]} = \frac{1}{d_{k}} \sum_{j = 1}^{d_{k}} {\tilde{Z}}_{k}^{[i, j]}

(26)

where

i = 1, \dots, n j = 1, \dots, d_{k} k = 1, \dots, h

{\tilde{Z}}_{k}^{[i, j]}

is the output of the BN layer and

S_{k}^{[i]}

is the

[i, k]

element of matrix

S \in R^{n \times h}

. Therefore,

S_{k}

is a new n-dimensional representation of the

k^{t h}

attention head, and each row

S^{[i]}

corresponds to a different word in the sentence. This operation is also illustrated in Figure 3i.

Following this, the rows of S are given as input to the PCA layer; therefore,

z_{l} = S_{[l, :]}

(27)

where

z_{l} \in R^{h}

, and the outputs of the PCA layer are given by

y = S^{'} = S \cdot w

(28)

where

w \in R^{h \times m}

is the weight matrix of the PCA layer,

S^{'} \in R^{n \times m}

and m can be chosen to be any number between 1 and h. As was the case with the direct architecture, by selecting

m = h

, the algorithm simply diversifies the input; however, when selecting

m < h

, the heads that contain unnecessary information are discarded. The new heads are as uncorrelated as possible, constrained by the fact that the diversification does not diminish the model’s performance. Thus,

y_{l} = S_{[l, :]}^{'}

(29)

is the diversified version of

S_{[l, :]}

. Figure 3ii shows the PCA layer for

m = h

. Next, the output matrix

S^{'}

is projected back into the dimension space of the encoder layer’s input by performing the following operation:

Z_{o u t} = S^{'} {\tilde{W}}^{O}

(30)

where

{\tilde{W}}^{O} \in R^{m \times d}

and

Z_{o u t} \in R^{n \times d}

, as depicted in Figure 3iii. As mentioned above, a benefit of this architecture is that

{\tilde{W}}^{O}

has fewer rows, which results in a lighter model with fewer parameters. Moreover, if

m < h

, the model can be downsized even further.

3.3. Non-Linear Architecture

The non-linear architecture aims to capture potential non-linear relationships between the attention heads by introducing polynomial features before applying PCA. The concatenated attention head outputs are augmented with polynomial features, including their squares as well as the product combinations of the attention heads, before applying the PCA layer. This expands the feature space and introduces non-linear relationships between the heads, potentially allowing the model to capture more complex relationships between the attention heads. The augmented data are normalized using batch normalization, and the normalized data are fed into the PCA layer for diversification. The outputs of the PCA layer are then reshaped and projected to the input dimension. Note that while polynomial expansion increases the initial feature space, the PCA layer, combined with head pruning, can ultimately reduce the model’s parameter count.

As mentioned in Section 2, we can express the rescaling operation as a linear combination of the heads (Equation (8)). In this context, it is equivalent to say that the Transformer performs local linear regression in order to learn the weights

W_{i}^{O}

, where the heads

Z_{i}

are viewed as features. Therefore, we make the assumption that there is a linear relationship between the heads

Z_{i}

and the output of the rescaling operation

Z_{o u t}

. Furthermore, since most of the operations involved in the Transformer are linear, we have not explored probable non-linear relationships within the model. To this end, we introduce polynomial features so as to exploit non-linear relationships that could potentially exist between the heads and the output of the rescaling operation. By doing this, we can identify more complex interactions between the input and the output, such as curves, without actually having to deal with complicated non-linear models, since the output of the rescaling operation will still be a linear combination of the new features. In consideration of this, we create polynomial features from the existing heads by raising them to an exponent. In addition, we add new features that represent the multiplication between two different heads. After this step, we project these new features into a space that maximizes their variance so that they can capture diverse information, by giving them as input to the PCA layer. This also allows us to identify and retain only the projections corresponding to the m largest eigenvalues and then automatically discard the features that contain redundant information and thus have an insignificant contribution to the model.

To this end, the reshaped matrix

M_{r}

from Equation (21) is augmented to form a new matrix that includes the combinations with the repetition of h heads taken r at a time; it therefore has

(\binom{r + h - 1}{r}) = \frac{(r + h - 1)!}{r! (h - 1)!}

extra columns. We set

r = 2

because polynomials of a higher order often become overly flexible and can take on unusual shapes. In addition, the number of new features explodes as the degree of the polynomial increases [14]. This results in

h (h + 1) / 2

additional features.

Suppose that we have

h = 4

heads,

Z_{0}, Z_{1}, Z_{2}, Z_{3}

. The new polynomial features are composed of their squares

Z_{0}^{2}, Z_{1}^{2}, Z_{2}^{2}, Z_{3}^{2}

and their combinations

Z_{0} Z_{1}, Z_{0} Z_{2}, Z_{0} Z_{3}, Z_{1} Z_{2}, Z_{1} Z_{3}, Z_{2} Z_{3}

. This results in a total of

h (h + 1) / 2 + h = 14

features, as illustrated in Figure 4i.

Therefore, the general form of the new augmented matrix

C \in R^{n \cdot d_{k} \times (h (h + 1) / 2 + h)}

is as follows:

\begin{matrix} C_{[i \cdot j, :]} = [{Z_{1}}^{[i, j]}, {Z_{2}}^{[i, j]}, \\ \dots, {Z_{h}}^{[i, j]}, {Z_{1}}^{[i, j]}^{2}, {Z_{2}}^{[i, j]}^{2}, \dots, {Z_{h}}^{[i, j]}^{2}, \\ Z_{0}^{[i, j]} Z_{1}^{[i, j]}, Z_{0}^{[i, j]} Z_{2}^{[i, j]}, \dots, Z_{h - 1}^{[i, j]} Z_{h}^{[i, j]}] \end{matrix}

(31)

After this step, C is passed through a BN layer (as with the rest of the architectures), which outputs

\tilde{C} \in R^{n \cdot d_{k} \times (h (h + 1) / 2 + h)}

.

Following this, the rows of

\tilde{C}

are given as input to the PCA layer; therefore,

z_{l} = {\tilde{C}}_{[l, :]}

(32)

where

z_{l} \in R^{h (h + 1) / 2 + h}

. The outputs of the PCA layer are given by

y = C^{'} = \tilde{C} \cdot w

(33)

where

w \in R^{(h (h + 1) / 2 + h) \times m}

is the weight matrix of the PCA layer,

C^{'} \in R^{n \cdot d_{k} \times m}

and m can be chosen to be any number between 1 and

h (h + 1) / 2 + h

.

Since the augmented input

\tilde{C}

of the PCA layer contains more information, the PCA layer will have a wider variety of elements to “choose” from. However, not all of the columns contain useful information; therefore, setting

m ≪ h (h + 1) / 2 + h

is a more sensible choice because the vectors that correspond to the smaller eigenvalues and contain excess information will be discarded. Note that the PCA layer will not necessarily converge to the exact principal components because its weight updates are constrained by the fact that the cost must also be decremented in every epoch. Therefore, the layer will converge to a solution that best satisfies both diversification and cost minimization. Thus,

y_{l} = C_{[l, :]}^{'}

(34)

are the projected heads that contain the m elements that capture the greatest possible variance in the data. Figure 4ii shows the PCA layer for

m = h (h + 1) / 2 + h

.

Following this, matrix

C^{'}

is reshaped into

C_{r} \in R^{n \times m \cdot d_{k}}

, as shown in Figure 5i, in order to be projected into the dimension space of the encoder layer’s input. This is achieved by multiplying

C_{r}

with

{\tilde{W}}^{O} \in R^{m \cdot d_{k} \times d}

as follows:

Z_{o u t} = C_{r} {\tilde{W}}^{O}

(35)

where

Z_{o u t} \in R^{n \times d}

. The dimensionality of

{\tilde{W}}^{O}

depends directly on the selection of m; therefore, the size of the model may be either larger (if

m > h

) or smaller (if

m < h

) than the baseline Transformer architecture. The rescaling operation is illustrated in Figure 5ii.

The three proposed architectures are summarized in Table 1, highlighting their key characteristics, advantages and potential trade-offs.

4. Datasets

The performance of our proposed methods was first evaluated on the WMT-16 dataset [15], which contains English descriptions of images that are translated into German. The dataset was split into 29,000 training, 1014 validation and 1000 test sample sentence pairs. After preprocessing, the vocabulary consisted of all of the training source words and all of the training target words that appeared more than five times in the corpus. By choosing the minimum frequency to be more than five, we avoided an excessive vocabulary size and, at the same time, we minimized the number of unknown tokens. The performance of the methods was evaluated regarding the accuracy of the translation.

In order to demonstrate the versatility of our approach, we further tested our methods on two related but different NLP tasks, namely text summarization and question answering. To this end, we utilized the XSum (Extreme Summarization) dataset [16] and the Stanford Question Answering Dataset (SQuAD) v1.1 [17]. XSum contains BBC news articles paired with their single-sentence summaries. The dataset consisted of 226,711 article–summary pairs, split into 204,045 training, 11,332 validation and 11,334 test samples. SQuAD v1.1 contains question–answer pairs based on Wikipedia articles. The dataset consisted of 107,785 question–answer pairs spread over 536 articles, split into 87,599 training samples, 10,570 validation samples and a hidden test set. For both datasets, we followed the same preprocessing steps as in the WMT-16 dataset, where we retained all words that appeared more than five times in the corpus. For XSum, the performance of the methods was evaluated in terms of the ROUGE-1, ROUGE-2 and ROUGE-L scores of the generated summaries, whereas, for SQuAD v1.1, the performance was evaluated regarding the Exact Match (EM), which measures the percentage of predictions that exactly match any of the ground truth answers, and the F1 score, which measures the average overlap between the prediction and ground truth answer.

Finally, in order to demonstrate the scalability of our proposed methods, we utilized the SlimPajama 627B Dataset [18], the largest extensively deduplicated, multi-corpora, open dataset for large language model (LLM) training. SlimPajama provides high-quality data through the curation of sources, consisting of text from Commoncrawl (52.20%), C4 (26.70%), GitHub (5.20%), Books (4.20%), ArXiv (4.60%), Wikipedia (3.80%) and StackExchange (3.30%). For our experiments, we shuffled and randomly sampled the SlimPajama dataset to form a ∼300B token training dataset and ∼320M validation token dataset, as shown in Table 2. We did not use replay [19] and utilized the Byte-Pair Encoding (BPE)-based tokenizer described in [20].

5. Experimental Results

In this section, we present the experimental results on the datasets and tasks mentioned above, namely on the WMT-16 dataset for machine translation, the XSum dataset for text summarization, the SQuAD v1.1 dataset for question answering and the large-scale SlimPajama dataset for language modeling.

5.1. WMT-16 Dataset (Machine Translation)

We compare the performance of the DEACON algorithm, utilized within the architectures described in the previous sections, to the performance of the baseline Transformer. Within each of the proposed architectures, for each batch during training, we perform an inner loop that iteratively updates the quantity

F^{(t)}

alone from Equation (14). This ensures that, after a predetermined number of iterations i,

F^{(i)}

will converge to the principal components of each batch and eventually to the principal components of the heads corresponding to the whole dataset. In our experiments, we set

i = 500

. Regarding DEACON’s parameters, in all experiments, we set

δ P = 0.2

and

ξ = 0.8

, but we should note that similar performance was recorded with

0.05 < δ P < 0.5

and

0.8 < ξ < 0.99

, indicating that the results were not notably sensitive to the exact values of these parameters. The code for implementation has been made available in a public GitHub repository: https://github.com/flo3003/DEACON (accessed on 1 April 2024).

In all our experiments, (except for that on the SlimPajama dataset; see below), for the baseline Transformer, we set the number of heads

h = 8

, and, for the proposed architectures, we set the output of the PCA layer to be either

m = h

or

m = k

corresponding to the k largest eigenvalues. This had an effect equivalent to pruning, but we avoided learning the heads beforehand. In our experiments, we set

k = 3

. In every architecture, each of the encoder and decoder blocks comprise two layers. All sub-layers in the model, as well as the embedding layer, produce outputs of dimension

d = 256

. A common practice is to set

d_{k} = d / h

; therefore, the dimensionality of the queries, keys and values is

d_{k} = 32

. For fairness in the comparison, all models were trained for the same number of epochs, equal to 30, and each run was initialized from a common random seed. For the rest of the discussion, the proposed methods will be addressed as “Name of architecture–Number of retained heads”, e.g., the direct architecture with

m = 3

will be referred to as “Direct-3”.

For the WMT-16 dataset, Table 3 shows, for each model, the number of PCA-projected heads, the number of trainable parameters and their performance in terms of the training loss, validation accuracy and BLEU score. We can see that the direct architecture with eight PCA heads (Direct-8) performs the best across all metrics as it achieves the highest BLEU score (28.9) and the lowest training loss, which is 2.19% lower than the training loss achieved by the baseline Transformer. This is also reflected in the validation accuracy results, where it achieves the highest accuracy (69.75%). This result is 1.12% higher than the validation accuracy achieved by the baseline Transformer, which shows that the diversification of the heads is beneficial to the model. The Direct-3 model achieves 16.95% higher training loss than the baseline Transformer, which can be explained by the fact that it has around 4% fewer parameters (see Section 3.1, Figure 2). Interestingly, however, the model exhibits strong generalization properties since it achieves essentially the same validation accuracy and BLEU score as the baseline Transformer, despite having significantly higher training loss. This shows that selecting only around 37% of the heads (three out of eight), which has the same effect as pruning five out of eight heads, does not significantly diminish the performance of the model, since the heads that are retained capture the majority of the data’s variance. Thus, our approach discards redundant heads by design, without actually having to prune them after training, which is a significant benefit of the proposed methodology.

Average-8 achieves 9.07% higher training loss than the baseline Transformer, but only 0.85% lower validation accuracy and ∼99% of the baseline’s BLEU score. Regarding this result, one should also consider that the Average-8 model has roughly 7% fewer parameters than the baseline model, since, for each word, the information that is encapsulated in each head is compressed into a scalar. This also makes the input to the PCA layer considerably correlated, which means that it would be more difficult for the algorithm to converge to the principal components. Due to the reduced number of parameters, Average-3 achieves the highest training loss, which is roughly 37% higher than the score achieved by Average-8. This is also reflected in the validation accuracy, where it achieves the lowest result, which is 4.08% lower than the validation accuracy achieved by Average-8. Similarly, its BLEU score is 4.62% worse than that of Average-8. This means that decreasing the dimensionality of the already compressed heads results in a significant decline in the performance, which is not unexpected. For the Non-Linear-8 architecture, we should first mention that, with

h = 8

heads, the combination of the non-linear features as described in Section 3.3 results in 44 features as inputs to the PCA layer. Thus, by selecting

m = 8

, meaning that we retain only eight projected heads (corresponding to the first eight principal components out of 44), we also simultaneously perform dimensionality reduction. Non-Linear-8 performs 17.19% worse than the baseline Transformer in terms of the training loss, but only 0.61% worse in terms of the validation accuracy and 0.7% in terms of the BLEU score.

Thus, despite the great difference in the training loss metric, the results for the other two metrics reveal that in the training loss metric, the results for the other two metrics reveal that Non-Linear-8 generalizes quite well, achieving comparable performance to the baseline Transformer despite having discarded more than 82% of its heads (36 out of 44). This also indicates that DEACON in the Non-Linear-8 setting possibly did not reach its optimal performance at just 30 epochs; however, for fairness in the comparison, we report the results for this number. Non-Linear-3 discards even more heads by retaining only 7% of the projected features, and thus it achieves 12.3% lower training loss compared to Non-Linear-8 due to its reduced number of parameters. Interestingly, however, it achieves almost the same validation accuracy and BLEU score compared to Non-Linear-8. This means that the majority of the data variance is contained in the first three to eight heads and that the rest of the heads contain redundant information.

Another way to validate the diversification of the heads is to compute the correlation between the weights of the PCA layer and to show that the correlation matrix of the PCA layer’s weights resembles the identity matrix. As we can see from Figure 6i, the off-diagonal elements of the correlation matrix in the Direct-8 architecture are close to zero and the norm of the weights of the PCA layer is equal to one. Note that the off-diagonal elements are not exactly equal to zero because the weight updates are constrained by the fact that the cost must also be decremented in every epoch. Therefore, the new weights of the PCA layer are not the exact principal components, but they are aligned to them as much as possible.

Figure 6ii shows the correlation matrix of the PCA layer’s weights for the Average-8 architecture. Recall that, for each word, the inputs of the PCA layer in this architecture are averaged over the

d_{k}

dimensions of the attention heads, which means that they could become significantly correlated with each other. Therefore, finding the principal components is even more challenging. At the same time, the algorithm is constrained to simultaneously decrease the loss in every epoch, which makes it even more difficult to converge to the principal components. This is clearly reflected in the figure, where some of the off-diagonal elements are small but not close to zero.

Finally, for this dataset, Figure 6iii shows the correlation matrix of the PCA layer’s weights for the Non-Linear-8 architecture. From this figure, we can see that almost all of the off-diagonal elements are very close to zero; therefore, the weights of the PCA layer have been almost perfectly aligned with the principal components. The Non-Linear-8 architecture had initially increased feature dimensionality compared to the other architectures. However, by returning only 8 out of 44 heads, we performed simultaneous dimensionality reduction. By doing this, we automatically discarded the heads that were correlated with each other, which contained redundant information. As a result, the projected heads were completely uncorrelated. However, achieving this maximum alignment between the weights and the principal components hindered the minimization of the cost function, but it did not diminish the validation accuracy or the BLEU score, as shown in Table 3.

5.2. XSum Dataset (Text Summarization)

Table 4 shows how the proposed methodology performs in the text summarization task. The performance is evaluated regarding the training loss and three ROUGE metrics: ROUGE-1, ROUGE-2 and ROUGE-L. From the table, we can see that the Direct-8 architecture exhibits the best performance across all metrics. It achieves the lowest training loss of 3.187 (1.79% lower than the baseline Transformer) and the highest ROUGE scores: 33.12% for ROUGE-1 (2.06% improvement), 12.05% for ROUGE-2 (2.29% improvement) and 26.21% for ROUGE-L (2.10% improvement). These consistent improvements across all metrics demonstrate the ability of DEACON to capture a wider range of linguistic patterns and relationships for text summarization. When we reduce the direct architecture to three heads (Direct-3), decreasing the parameter count, the performance remains close to the baseline. It shows a slight decrease in the ROUGE scores (32.38%, 11.72%, 25.59% for ROUGE-1, ROUGE-2 and ROUGE-L, respectively) but demonstrates that the DEACON algorithm can effectively prune the attention heads while retaining most of the performance.

Average-8, with ∼7% fewer parameters, performs slightly below the baseline, with ROUGE-1, ROUGE-2 and ROUGE-L scores equal to 32.01%, 11.56% and 25.33%, respectively. The three-head version, further reduced to 4,984,738 parameters, sees a more significant drop in the corresponding ROUGE scores at 30.87%, 11.02% and 24.45%, respectively, with higher training loss (3.512). This suggests that the averaging approach in DEACON may lead to the loss of some nuanced information for the summarization task or important information when the pruning is too aggressive.

Non-Linear-8 achieves ROUGE-1, ROUGE-2 and ROUGE-L scores equal to 32.23%, 11.69% and 25.48%, respectively, performing slightly worse than the baseline Transformer and better than the average Transformer but not quite reaching the level attained by the direct architecture. This is expected since, similarly to the WMT-16 case, it discards more than 82% of its heads (36 out of 44). Notably, Non-Linear-3, which retains just 7% of the heads, achieved similar performance to Non-Linear-8, with ROUGE scores of 32.09%, 11.63% and 25.37%, respectively. This suggests that the non-linear approach in DEACON allows for effective pruning in the summarization task and is effective in preserving important information, even with very few heads.

5.3. SQuAD v1.1 Dataset (Question Answering)

Table 5 shows DEACON’s performance on the question answering task with the SQuAD v1.1 dataset. For each variant, we report the training loss, Exact Match (EM) and F1 score metrics. As shown in the table, Direct-8 achieves the best performance across all metrics. It reduces the training loss to 2.812 and improves the EM to 68.73% and the F1 score to 78.21%. These improvements, while not dramatic compared to the baseline Transformer, still demonstrate that the diversified attention heads can collectively represent a broader spectrum of linguistic information for question answering. When pruned to three heads, the Direct-3 Transformer maintains performance close to the baseline, with ∼4.5% fewer parameters. It achieves an EM of 67.65% and an F1 score of 77.18%, showing only a slight decrease despite the sizable reduction in the number of heads.

Average-8, using 4,992,688 parameters, performs slightly below the baseline, with an EM of 67.42% and an F1 score of 76.95%. However, Average-3, which is further reduced to 4,984,738 parameters, shows a more significant drop to 65.78% for the EM and 75.32% for the F1 score. This suggests that averaging potentially discards important information when pruned too aggressively for question answering tasks.

Non-Linear-8 performs slightly better than the baseline, achieving an EM of 67.98% and an F1 score of 77.56%, even though it maintains ∼18% of its heads (8 out of 44). Non-Linear-3 achieves similar performance, with an EM of 67.54% and an F1 score of 77.09%, despite having discarded 93% of its heads. This indicates that the data variance is contained in the first few heads and that the non-linear approach in DEACON effectively preserves the question answering capabilities.

5.4. SlimPajama Dataset (Language Modeling)

Due to the size of the SlimPajama dataset, we expanded all of the model architectures and configurations. For the baseline Transformer, we set the number of heads

h = 16

, while the DEACON variants were configured with either

m = 16

or

m = 8

outputs from the PCA layer, with the latter corresponding to the eight principal eigenvalues. This approach effectively pruned the model without pre-training redundant heads. Each encoder and decoder section in these architectures was set to consist of 24 distinct layers. The embedding dimension of the models was set to

d = 2048

. Adhering to the standard practice, we established the number of queries, keys and values as

d_{k} = d / h

, resulting in

d_{k} = 128

. The intermediate layer in the position-wise feed-forward networks was expanded to 8192 units. To ensure a balanced assessment, all models underwent training for an identical number of epochs, with each instance initialized using a shared random seed. For compatibility with the previous experiments, throughout our analysis, we again denote our proposed methods as ‘name of architecture–number of retained heads’. For example, the direct architecture preserving 16 heads is labeled as “Direct-16”. These enhanced models, each comprising about 354 million parameters, enabled us to evaluate DEACON’s performance on a more extensive language modeling challenge. For the SlimPajama dataset experiments, we utilized a GPU server equipped with eight H100 GPUs in order to handle the increased computational demands.

Table 6 shows the experimental results on the subsampled SlimPajama dataset described in Table 2. The models are evaluated across multiple dimensions: perplexity, training time, inference speed and convergence rate. The direct architecture with 16 heads (Direct-16) demonstrates superior performance across all key metrics. Despite a minimal 0.009% increase in parameters (354,850,816) compared to the baseline Transformer, it achieves validation perplexity of 9.41, marking a significant 4.6% improvement over the baseline. This substantial perplexity reduction suggests that DEACON’s head diversification strategy effectively captures more nuanced language patterns. Moreover, it converges 14.8% faster, at 2.3 epochs, indicating more efficient learning. However, this comes at the cost of a 3.5% increase in training time (8.8 days) and 2.4% slower inference (42 ms), highlighting a trade-off between the performance and computational demands.

Decreasing the direct architecture to eight heads (Direct-8) demonstrates a balance between efficiency and performance. With a 0.22% reduction in parameters (354,023,424), it achieves perplexity of 9.57, only 1.7% higher than its 16-head counterpart and still 2.9% better than the baseline. Notably, it reduces the training time by 5.9% (8.0 days) and the inference time by 9.8% (37 ms) compared to the baseline, demonstrating DEACON’s capability to create more streamlined models with minimal performance degradation.

The average architecture variants present a perspective on the efficiency-performance spectrum. Average-16, with a 0.06% parameter reduction (354,588,672), performs comparably to the baseline in most metrics. However, the eight-head version, despite a 0.30% parameter reduction (353,761,280), showcases the fastest training and inference times of 7.8 days (8.2% faster than baseline) and 36 ms (12.2% faster), respectively. This comes at the cost of 1.2% higher perplexity (9.98) compared to the baseline, illustrating a case where the computational efficiency is prioritized over the raw performance.

The non-linear architecture also presents some interesting results. Non-Linear-16, with a slight 0.023% increase in parameters (354,899,968), improves the perplexity by 3.4% (9.52) over the baseline. However, this improvement comes with a 5.9% increase in the training time (9.0 days) and a 7.3% slower inference time (44 ms), underscoring the computational cost of its more complex architecture. Non-Linear-8 seems to offer a middle ground, with a 0.21% parameter reduction (354,072,576) and perplexity of 9.61, which is still 2.5% better than the baseline. It reduces the training and inference times compared to its 16-head counterpart, but remains slightly slower than the baseline.

6. Conclusions

In this paper, we introduce DEACON, a novel algorithm that integrates the diversification of the Transformer heads directly into the training process. Unlike previous work that analyzed or modified the attention heads after training was complete, our method actively shapes multi-head attention diversification while the model learns its primary task. DEACON achieves this by projecting the heads into a space that maximizes their variance, utilizing a novel constrained optimization approach that incorporates the Generalized Hebbian Algorithm (GHA) in its learning rule. This approach trains a specific principal components analysis (PCA) layer integrated into the Transformer architecture, providing new insights into attention mechanisms and their optimization.

We propose three variations of this PCA layer—direct, average and non-linear—each offering different trade-offs between performance and efficiency. These variations can be seamlessly integrated into the baseline Transformer and trained using DEACON. The framework’s mathematical foundations enable a systematic approach to understanding and optimizing attention head interactions, while providing flexibility in architectural choices.

Our experiments on machine translation using the WMT-16 dataset demonstrated DEACON’s effectiveness, with the direct architecture showing consistent improvements over the baseline Transformer. We also evaluated DEACON across a diverse range of NLP tasks. In text summarization using the XSum dataset, we observed improvements across all ROUGE metrics with the direct Transformer. For question answering on SQuAD, DEACON enhanced both the Exact Match and F1 scores, demonstrating its capability in tasks requiring precise information extraction. Our large-scale language modeling experiments on the SlimPajama dataset revealed interesting patterns about the architectures’ effectiveness across different scales. Using eight H100 GPUs to handle the increased computational demands, we tested DEACON on models with hundreds of millions of parameters. The direct architecture achieved improved perplexity and enhanced convergence speeds compared to the baseline. In addition, the average architecture variants demonstrated particular strengths in computational efficiency at larger scales, showcasing the flexibility of our approach in adapting to different computational environments.

These comprehensive results, spanning from smaller models trained on single GPUs to large-scale experiments requiring multiple GPUs, demonstrate DEACON’s contribution to both the theoretical understanding and practical improvement of Transformer models. By providing a mathematical framework for attention head diversity, DEACON enables the models to capture a wider range of linguistic patterns and relationships. In future work, we plan to further evaluate DEACON’s performance on an even wider range of tasks and larger LLMs. Additionally, investigating the interpretability of the diversified attention heads could provide valuable insights into the specific linguistic phenomena captured by each head. Thus, in future work, we intend to explore the following:

Theoretical foundations—developing a formal framework to characterize the relationship between the attention head diversity and model performance across different NLP tasks;
Efficiency optimization—investigating techniques to reduce the computational overhead of the PCA layer while maintaining its diversification benefits;
Attention interpretability—analyzing how diversified attention heads capture different linguistic phenomena to better understand their complementary roles.

In conclusion, DEACON represents a step towards advancements in optimizing Transformer architectures. Its performance improvements across diverse tasks, coupled with the potential for model efficiency through head pruning, establish a foundation for future research in attention mechanism design and optimization in Transformer-based models.

Author Contributions

Conceptualization, N.A. and F.S.; Methodology, N.A. and F.S.; Software, N.A. and F.S.; Validation, F.S.; Visualization, F.S.; Supervision, N.A.; Writing—original draft, N.A. and F.S.; Writing—review & editing, N.A. All authors have read and agreed to the published version of the manuscript.

Funding

F.S. was supported by a PhD scholarship from the State Scholarships Foundation (IKY), Greece.

Data Availability Statement

The data presented in this study are openly available in https://github.com/flo3003/DEACON, accessed on 20 August 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Raganato, A.; Tiedemann, J. An Analysis of Encoder Representations in Transformer-Based Machine Translation. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; pp. 287–297. [Google Scholar] [CrossRef]
Michel, P.; Levy, O.; Neubig, G. Are Sixteen Heads Really Better than One? arXiv 2019, arXiv:1905.10650. [Google Scholar]
Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, I. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. arXiv 2019, arXiv:1905.09418. [Google Scholar]
Raganato, A.; Scherrer, Y.; Tiedemann, J. Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation. arXiv 2020, arXiv:2002.10260. [Google Scholar]
Tay, Y.; Bahri, D.; Metzler, D.; Juan, D.C.; Zhao, Z.; Zheng, C. Synthesizer: Rethinking Self-Attention in Transformer Models. arXiv 2020, arXiv:2005.00743. [Google Scholar]
Peng, H.; Schwartz, R.; Li, D.; Smith, N.A. A Mixture of h - 1 Heads is Better than h Heads. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 6566–6577. [Google Scholar] [CrossRef]
Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive Mixtures of Local Experts. Neural Comput. 1991, 3, 79–87. [Google Scholar] [CrossRef] [PubMed]
Ampazis, N.; Perantonis, S.; Taylor, J. Dynamics of Multilayer Networks in the Vicinity of Temporary Minima. Neural Netw. 1999, 12, 43–58. [Google Scholar] [CrossRef] [PubMed]
Sanger, T.D. Optimal Unsupervised Learning in a Single-Layer Linear Feedforward Neural Network. Neural Netw. 1989, 2, 459–473. [Google Scholar] [CrossRef]
Oja, E. Principal Components, Minor Components, and Linear Neural Networks. Neural Netw. 1992, 5, 927–935. [Google Scholar] [CrossRef]
Karras, D.; Perantonis, S. An Efficient Constrained Training Algorithm for Feedforward Networks. IEEE Trans. Neural Netw. 1995, 6, 1420–1434. [Google Scholar] [CrossRef] [PubMed]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning: With Applications in R; Springer: New York, NY, USA, 2014. [Google Scholar]
Elliott, D.; Frank, S.; Sima’an, K.; Specia, L. Multi30K: Multilingual English-German Image Descriptions. arXiv 2016, arXiv:1605.00459. [Google Scholar]
Narayan, S.; Cohen, S.B.; Lapata, M. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018. [Google Scholar]
Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; Su, J., Duh, K., Carreras, X., Eds.; 2016; pp. 2383–2392. [Google Scholar] [CrossRef]
Soboleva, D.; Al-Khateeb, F.; Myers, R.; Steeves, J.R.; Hestness, J.; Dey, N. SlimPajama: A 627B Token Cleaned and Deduplicated Version of RedPajama. 2023. Available online: https://cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama (accessed on 9 June 2023).
Ostapenko, O.; Lesort, T.; Rodriguez, P.; Arefin, M.R.; Douillard, A.; Rish, I.; Charlin, L. Continual learning with foundation models: An empirical study of latent replay. In Proceedings of the Conference on Lifelong Learning Agents, PMLR, Montréal, QC, Canada, 22–24 August 2022; pp. 60–91. [Google Scholar]
Black, S.; Biderman, S.; Hallahan, E.; Anthony, Q.; Gao, L.; Golding, L.; He, H.; Leahy, C.; McDonell, K.; Phang, J.; et al. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. arXiv 2022, arXiv:2204.06745. [Google Scholar]

Figure 1. The reshaping operation. This figure illustrates the reshaping operation of the concatenated multi-head attention output

M \in R^{n \times h \cdot d_{k}}

(on the left). Each head

Z_{i} \in R^{n \times d_{k}} \forall i = 1, \dots, h

is represented by a

n \times d_{k}

matrix, where

n = 2

,

d_{k} = 3

and

h = 4

for illustrative purposes. The output of the reshaping operation is the matrix

M_{r} \in R^{n \cdot d_{k} \times h}

(on the right).

Figure 1. The reshaping operation. This figure illustrates the reshaping operation of the concatenated multi-head attention output

M \in R^{n \times h \cdot d_{k}}

(on the left). Each head

Z_{i} \in R^{n \times d_{k}} \forall i = 1, \dots, h

is represented by a

n \times d_{k}

matrix, where

n = 2

,

d_{k} = 3

and

h = 4

for illustrative purposes. The output of the reshaping operation is the matrix

M_{r} \in R^{n \cdot d_{k} \times h}

(on the right).

Figure 2. The direct architecture. These figures illustrate the direct architecture. (i) shows the operations involved in the PCA layer (Equation (23)), where the normalized matrix

{\tilde{M}}_{r} \in R^{n \cdot d_{k} \times h}

is multiplied by

P \in R^{h \times h}

to obtain

M_{r}^{'} \in R^{n \cdot d_{k} \times h}

, with

n = 2

,

d_{k} = 3

and

h = 4

for illustrative purposes.

M_{r}^{'}

is then reshaped into

M^{'} \in R^{n \times h \cdot d_{k}}

, which is multiplied by

W^{O} \in R^{h \cdot d_{k} \times d}

in order to be rescaled to

Z_{o u t} \in R^{n \times d}

with

d = 5

. (ii) shows the rescaling operation (Equation (25)).

Figure 2. The direct architecture. These figures illustrate the direct architecture. (i) shows the operations involved in the PCA layer (Equation (23)), where the normalized matrix

{\tilde{M}}_{r} \in R^{n \cdot d_{k} \times h}

is multiplied by

P \in R^{h \times h}

to obtain

M_{r}^{'} \in R^{n \cdot d_{k} \times h}

, with

n = 2

,

d_{k} = 3

and

h = 4

for illustrative purposes.

M_{r}^{'}

is then reshaped into

M^{'} \in R^{n \times h \cdot d_{k}}

, which is multiplied by

W^{O} \in R^{h \cdot d_{k} \times d}

in order to be rescaled to

Z_{o u t} \in R^{n \times d}

with

d = 5

. (ii) shows the rescaling operation (Equation (25)).

Figure 3. The average architecture. These figures show the average architecture. (i) shows the operation that calculates the average of each head across the

d_{k}

dimension as defined in Equation (26) in order to obtain matrix

S \in R^{n \times h}

, where

n = 2

,

d_{k} = 3

and

h = 4

for illustrative purposes. (ii) shows the operations involved in the PCA layer, where S is multiplied by

P \in R^{h \times h}

to obtain

S^{'} \in R^{n \times h}

(Equation (28)). (iii) shows the rescaling operation (Equation (30)) where

S^{'}

is multiplied by

{W^{O}}^{'} \in R^{h \times d}

in order to be rescaled to

Z_{o u t} \in R^{n \times d}

with

d = 5

.

Figure 3. The average architecture. These figures show the average architecture. (i) shows the operation that calculates the average of each head across the

d_{k}

dimension as defined in Equation (26) in order to obtain matrix

S \in R^{n \times h}

, where

n = 2

,

d_{k} = 3

and

h = 4

for illustrative purposes. (ii) shows the operations involved in the PCA layer, where S is multiplied by

P \in R^{h \times h}

to obtain

S^{'} \in R^{n \times h}

(Equation (28)). (iii) shows the rescaling operation (Equation (30)) where

S^{'}

is multiplied by

{W^{O}}^{'} \in R^{h \times d}

in order to be rescaled to

Z_{o u t} \in R^{n \times d}

with

d = 5

.

Figure 4. The non-linear architecture. These figures illustrate the non-linear architecture. (i) shows how the augmented matrix

C \in R^{n \cdot d_{k} \times (h (h + 1) / 2 + h)}

is obtained from matrix

M_{r} \in R^{n \cdot d_{k} \times h}

, where

n = 2

,

d_{k} = 3

and

h = 4

for illustrative purposes. (ii) shows the operations involved in the PCA layer (Equation (33)), where the normalized augmented matrix

\tilde{C} \in R^{n \cdot d_{k} \times (h (h + 1) / 2 + h)}

is multiplied by

P \in R^{(h (h + 1) / 2 + h) \times (h (h + 1) / 2 + h)}

to obtain

C^{'} \in R^{n \cdot d_{k} \times (h (h + 1) / 2 + h)}

(Equation (33)). Following this, Figure 5i shows the reshaping operation that takes as input matrix

C^{'} \in R^{n \cdot d_{k} \times (h (h + 1) / 2 + h)}

and outputs

C_{r} \in R^{n \times (h (h + 1) / 2 + h) \cdot d_{k}}

. Finally, Figure 5ii shows the rescaling operation (Equation (35)), where

C_{r}

is multiplied by

{W^{O}}^{'} \in R^{h \cdot d_{k} \times d}

in order to be rescaled to

Z_{o u t} \in R^{n \times d}

with

d = 5

.

Figure 4. The non-linear architecture. These figures illustrate the non-linear architecture. (i) shows how the augmented matrix

C \in R^{n \cdot d_{k} \times (h (h + 1) / 2 + h)}

is obtained from matrix

M_{r} \in R^{n \cdot d_{k} \times h}

, where

n = 2

,

d_{k} = 3

and

h = 4

for illustrative purposes. (ii) shows the operations involved in the PCA layer (Equation (33)), where the normalized augmented matrix

\tilde{C} \in R^{n \cdot d_{k} \times (h (h + 1) / 2 + h)}

is multiplied by

P \in R^{(h (h + 1) / 2 + h) \times (h (h + 1) / 2 + h)}

to obtain

C^{'} \in R^{n \cdot d_{k} \times (h (h + 1) / 2 + h)}

(Equation (33)). Following this, Figure 5i shows the reshaping operation that takes as input matrix

C^{'} \in R^{n \cdot d_{k} \times (h (h + 1) / 2 + h)}

and outputs

C_{r} \in R^{n \times (h (h + 1) / 2 + h) \cdot d_{k}}

. Finally, Figure 5ii shows the rescaling operation (Equation (35)), where

C_{r}

is multiplied by

{W^{O}}^{'} \in R^{h \cdot d_{k} \times d}

in order to be rescaled to

Z_{o u t} \in R^{n \times d}

with

d = 5

.

Figure 5. The non-linear architecture (cont’d). These figures illustrate the non-linear architecture. Figure 4i (see the previous page) shows how the augmented matrix

C \in R^{n \cdot d_{k} \times (h (h + 1) / 2 + h)}

is obtained from matrix

M_{r} \in R^{n \cdot d_{k} \times h}

, where

n = 2

,

d_{k} = 3

and

h = 4

for illustrative purposes. Figure 4ii shows the operations involved in the PCA layer (Equation (33)), where the normalized augmented matrix

\tilde{C} \in R^{n \cdot d_{k} \times (h (h + 1) / 2 + h)}

is multiplied by

P \in R^{(h (h + 1) / 2 + h) \times (h (h + 1) / 2 + h)}

to obtain

C^{'} \in R^{n \cdot d_{k} \times (h (h + 1) / 2 + h)}

(Equation (33)). Following this, (i) shows the reshaping operation that takes as input the matrix

C^{'} \in R^{n \cdot d_{k} \times (h (h + 1) / 2 + h)}

and outputs

C_{r} \in R^{n \times (h (h + 1) / 2 + h) \cdot d_{k}}

. Finally, (ii) shows the rescaling operation (Equation (35)), where

C_{r}

is multiplied by

{W^{O}}^{'} \in R^{h \cdot d_{k} \times d}

in order to be rescaled to

Z_{o u t} \in R^{n \times d}

with

d = 5

.

Figure 5. The non-linear architecture (cont’d). These figures illustrate the non-linear architecture. Figure 4i (see the previous page) shows how the augmented matrix

C \in R^{n \cdot d_{k} \times (h (h + 1) / 2 + h)}

is obtained from matrix

M_{r} \in R^{n \cdot d_{k} \times h}

, where

n = 2

,

d_{k} = 3

and

h = 4

for illustrative purposes. Figure 4ii shows the operations involved in the PCA layer (Equation (33)), where the normalized augmented matrix

\tilde{C} \in R^{n \cdot d_{k} \times (h (h + 1) / 2 + h)}

is multiplied by

P \in R^{(h (h + 1) / 2 + h) \times (h (h + 1) / 2 + h)}

to obtain

C^{'} \in R^{n \cdot d_{k} \times (h (h + 1) / 2 + h)}

(Equation (33)). Following this, (i) shows the reshaping operation that takes as input the matrix

C^{'} \in R^{n \cdot d_{k} \times (h (h + 1) / 2 + h)}

and outputs

C_{r} \in R^{n \times (h (h + 1) / 2 + h) \cdot d_{k}}

. Finally, (ii) shows the rescaling operation (Equation (35)), where

C_{r}

is multiplied by

{W^{O}}^{'} \in R^{h \cdot d_{k} \times d}

in order to be rescaled to

Z_{o u t} \in R^{n \times d}

with

d = 5

.

Figure 6. Correlation matrix heat map. This figure shows the heat map of the correlation matrix of the PCA layer’s weights. Each plot corresponds to a different architecture.

Table 1. Summary of the proposed DEACON architectures.

Architecture	Description and Key Features
Direct
	Directly applies PCA to reshaped attention head matrix Maintains original dimensionality of heads before diversification Allows for head pruning by selecting $m < h$ principal components Preserves full information content of attention heads Suitable for cases where maintaining detailed head information is crucial
Average
	Reduces dimensionality through averaging before PCA Compresses each head into a single scalar per word Significantly reduces parameter count Faster processing due to reduced dimensions Best suited for applications where computational efficiency is priority May lose some fine-grained information due to averaging
Non-linear
	Introduces polynomial features before PCA Captures potential non-linear relationships between heads Expands feature space through squared terms and head combinations Automatically prunes redundant polynomial features Effective in preserving complex patterns with fewer heads Slightly higher computational overhead during feature generation

Note: For all architectures, h represents the original number of attention heads and m represents the number of retained principal components. The choice of m allows for flexible trade-offs between model complexity and performance.

Table 2. Composition of the subsampled SlimPajama dataset.

Data Source	Sampling	Train	Validation	Train
	(%)	Tokens	Tokens	(% of Total)
CommonCrawl	67.0	200.82B	214.72M	66.99
C4	15.0	44.98B	48.06M	15.00
GitHub	4.5	13.51B	14.42M	4.51
Books	4.5	13.49B	14.39M	4.50
Wikipedia	4.5	13.50B	14.41M	4.50
ArXiv	2.5	7.49B	8.01M	2.50
StackExchange	2.0	5.99B	6.41M	2.00
Total	100.0	299.78B	320.42M	100.00

Table 3. Experimental results on the WMT-16 dataset (machine translation).

Architecture	PCA	Number of	Training	Validation	BLEU	Params
	Heads	Parameters	Loss	Accuracy (%)	Score	(% of Base)
Baseline Transformer	8	5,370,112	0.8211	68.63	28.4	100.00
Direct Architecture	8	5,373,616	0.8031	69.75	28.9	100.07
Direct Architecture	3	5,127,586	0.9887	68.30	28.3	95.48
Average Architecture	8	4,992,688	0.8255	67.78	28.1	92.97
Average Architecture	3	4,984,738	1.1382	63.70	26.8	92.82
Non-Linear Architecture	8	5,372,800	0.9916	68.02	28.2	100.05
Non-Linear Architecture	3	5,125,690	0.8689	67.68	28.1	95.45

For each model, we report the number of heads utilized in the final rescaling operation, the number of trainable parameters and their performance in terms of the training loss, validation accuracy and BLEU score. The ‘Params’ column shows the percentage of parameters relative to the baseline Transformer model.

Table 4. Experimental results on the XSum dataset (text summarization).

Architecture	PCA	Number of	Training	ROUGE-1	ROUGE-2	ROUGE-L
	Heads	Parameters	Loss	(%)	(%)	(%)
Baseline Transformer	8	5,370,112	3.245	32.45	11.78	25.67
Direct Architecture	8	5,373,616	3.187	33.12	12.05	26.21
Direct Architecture	3	5,127,586	3.356	32.38	11.72	25.59
Average Architecture	8	4,992,688	3.278	32.01	11.56	25.33
Average Architecture	3	4,984,738	3.512	30.87	11.02	24.45
Non-Linear Architecture	8	5,372,800	3.298	32.23	11.69	25.48
Non-Linear Architecture	3	5,125,690	3.321	32.09	11.63	25.37

Table 5. Experimental results on the SQuAD v1.1 dataset (question answering).

Architecture	PCA	Number of	Training	Exact	F1	Params
	Heads	Parameters	Loss	Match (%)	Score (%)	(% of Base)
Baseline Transformer	8	5,370,112	2.876	67.89	77.45	100.00
Direct Architecture	8	5,373,616	2.812	68.73	78.21	100.07
Direct Architecture	3	5,127,586	2.934	67.65	77.18	95.48
Average Architecture	8	4,992,688	2.901	67.42	76.95	92.97
Average Architecture	3	4,984,738	3.087	65.78	75.32	92.82
Non-Linear Architecture	8	5,372,800	2.889	67.98	77.56	100.05
Non-Linear Architecture	3	5,125,690	2.923	67.54	77.09	95.45

Table 6. Experimental results on the subsampled SlimPajama dataset (language modeling) using 8 H100 GPUs.

Architecture	PCA	Number of	Perplexity	Training	Inference	Convergence
	Heads	Parameters	(Valid)	Time (days)	Time (ms)	Epoch
Baseline Transformer	16	354,818,048	9.86	8.5	41	2.7
Direct Architecture	16	354,850,816	9.41	8.8	42	2.3
Direct Architecture	8	354,023,424	9.57	8.0	37	2.5
Average Architecture	16	354,588,672	9.63	8.6	40	2.6
Average Architecture	8	353,761,280	9.98	7.8	36	2.9
Non-Linear Architecture	16	354,899,968	9.52	9.0	44	2.4
Non-Linear Architecture	8	354,072,576	9.61	8.2	39	2.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ampazis, N.; Sakketou, F. Diversifying Multi-Head Attention in the Transformer Model. Mach. Learn. Knowl. Extr. 2024, 6, 2618-2638. https://doi.org/10.3390/make6040126

AMA Style

Ampazis N, Sakketou F. Diversifying Multi-Head Attention in the Transformer Model. Machine Learning and Knowledge Extraction. 2024; 6(4):2618-2638. https://doi.org/10.3390/make6040126

Chicago/Turabian Style

Ampazis, Nicholas, and Flora Sakketou. 2024. "Diversifying Multi-Head Attention in the Transformer Model" Machine Learning and Knowledge Extraction 6, no. 4: 2618-2638. https://doi.org/10.3390/make6040126

APA Style

Ampazis, N., & Sakketou, F. (2024). Diversifying Multi-Head Attention in the Transformer Model. Machine Learning and Knowledge Extraction, 6(4), 2618-2638. https://doi.org/10.3390/make6040126

Article Menu

Diversifying Multi-Head Attention in the Transformer Model

Abstract

1. Introduction

2. Motivation and Proposed Methodology

2.1. Generalized Hebbian Algorithm

2.2. The DEACON Algorithm

3. The Proposed Architectures

3.1. Direct Architecture

3.2. Average Architecture

3.3. Non-Linear Architecture

4. Datasets

5. Experimental Results

5.1. WMT-16 Dataset (Machine Translation)

5.2. XSum Dataset (Text Summarization)

5.3. SQuAD v1.1 Dataset (Question Answering)

5.4. SlimPajama Dataset (Language Modeling)

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI