1. Introduction
Transformer-based language models have revolutionized natural language processing, demonstrating unprecedented performance across various tasks [
1,
2]. These models, characterized by their multi-layered architecture and self-attention mechanisms, have become the cornerstone of modern NLP applications. However, despite their widespread adoption and empirical success, the internal mechanisms that drive their performance still need to be more adequately understood [
3,
4]. This gap in our understanding presents a significant obstacle to further advancements in the field and developing more efficient architectures.
Previous research has made significant strides in analyzing various aspects of transformer models [
5]. Voita et al. [
6] examined the evolution of representations across transformer layers, revealing that different attention heads specialize in distinct linguistic tasks. Tenney et al. [
7] employed probing tasks to analyze BERT’s layer-wise linguistic capabilities, proposing that BERT’s layers mirror the traditional NLP pipeline. However, these studies, while invaluable, have primarily focused on specific components or linguistic features, lacking a comprehensive mathematical framework to describe the overall semantic convergence process across layers [
8,
9].
The primary objective of this paper is to establish a mathematical framework for analyzing the semantic convergence of token embeddings across transformer layers. Let M be a transformer model with L layers, and let
denote the embedding of token t at layer i, where d is the embedding dimension. We define the semantic alignment
of layer i with respect to the final layer L as:
, where V is the vocabulary and
denotes the cosine similarity. We aim to elucidate the fundamental processes underlying these models’ ability to capture and manipulate semantic information by quantifying how these representations evolve through the network. This understanding is theoretically significant and has profound implications for model design and optimization.
The principal challenge in this endeavor lies in formulating a generalizable theory that holds across diverse model architectures and input contexts [
10]. The inherent complexity and high dimensionality of transformer models render it difficult to derive universal principles that accurately describe their behavior. Let
denote the transformation applied by the
i-th layer of a transformer model, where
n is the sequence length. The challenge is to characterize the properties of the composition
in terms of semantic convergence.
To address these challenges, we introduce and prove a novel theorem characterizing the gradient of embedding similarity across layers in transformer models. The proposed theorem provides a quantitative measure of how token representations converge towards their final form as they progress through the network, offering a lower bound on the rate of semantic alignment increase between consecutive layers. The implications of this work extend beyond theoretical interest. By providing a mathematical framework for analyzing the layer-wise behavior of transformer models, we offer a new tool for comparing and optimizing model architectures.
In the following sections, we present our theoretical framework, prove our main theorem, and provide empirical validation through comprehensive experiments on BERT and DistilBERT models. Our analysis covers various aspects of semantic convergence, including the impact of sentence complexity and word frequency, offering insights into how different linguistic features are processed through the layers of transformer models. We examine the properties of the function
and its relationship to model architecture, laying the groundwork for a deeper understanding of the mathematical principles underlying transformer-based language models.
Beyond theoretical interest, our findings have significant practical implications for NLP practitioners. By establishing a mathematical framework that describes how semantic information is progressively refined across transformer layers, we provide valuable insights that can inform the design and optimization of transformer-based architectures. Understanding the semantic convergence process can guide practitioners in tasks such as model pruning, layer selection for fine-tuning, and developing more efficient architectures that retain performance while reducing computational costs. This framework also offers a diagnostic tool for analyzing model behavior and identifying layers that contribute most to semantic understanding, thereby aiding in model interpretability and transparency.
2. Background
2.1. Transformer Architecture
Transformer models have significantly advanced the field of natural language processing by efficiently handling sequential data. Introduced by Vaswani et al. [
11], transformers utilize a mechanism called self-attention, which allows the model to focus on different parts of the input sequence when generating a representation for each word. This means that the model can capture relationships between words regardless of their position in the sequence, enabling it to understand long-range dependencies in language.
The self-attention mechanism works by assigning weights to each word in the input sequence, indicating its importance relative to other words. This approach has been extensively studied, and has proven effective in capturing the overall meaning of sentences [
12,
13]. Researchers have explored various aspects of self-attention, such as how it can learn to focus on important features in the data [
13], and how to simplify it for improved computational efficiency [
14].
Transformers are composed of multiple layers, and each layer includes two main components: the self-attention mechanism and a feed-forward neural network. The self-attention component allows the model to weigh the significance of different words when processing the input, while the feed-forward network further transforms these representations to capture higher-level features. By stacking these layers, transformers can build increasingly complex representations of the input data.
Some studies have compared transformers to recurrent neural networks (RNNs), noting that transformers can be viewed as a type of RNN that processes all positions in the sequence simultaneously, rather than sequentially [
15]. This parallel processing capability makes transformers more efficient and better suited for handling long sequences than traditional RNNs.
Definition 1 (Transformer layer). A transformer layer
consists of two sublayers applied sequentially to the input sequence
, where n is the sequence length and d is the embedding dimension:
Self-Attention Sublayer with Residual Connection: Feed-Forward Sublayer with Residual Connection: In this formulation,
is the output sequence of the transformer layer,
denotes the multi-head self-attention mechanism,
is layer normalization, and
is a position-wise feed-forward network. The residual connections are added after each sublayer’s output, and layer normalization is applied before each sublayer’s computation. This structure aligns with the Pre-LayerNorm transformer architecture commonly used in practice.
2.2. Token Embeddings and Semantic Representation
Token embeddings are dense vector representations of discrete tokens (e.g., words or subwords) in a continuous vector space. In transformer models, these embeddings evolve through the layers, gradually capturing more complex and abstract semantic information.
Theorem 1 (Embedding transformation).
Let
be the embedding of token t at layer i. The transformation of this embedding through a transformer layer can be expressed as where
are attention weights and the sum represents the contextual information from other tokens. 2.3. Cosine Similarity as a Measure of Semantic Alignment
Cosine similarity serves as a fundamental metric for quantifying semantic alignment between vector representations in natural language processing. For two vectors
, the cosine similarity is defined as follows.
Definition 2 (Cosine similarity).
The cosine similarity between two non-zero vectors
is defined as This measure captures the angular similarity between vectors, providing a normalized metric invariant to the magnitude of the vectors. The cosine similarity has several important properties that make it particularly suitable for analyzing semantic alignment in transformer models.
In our analysis, we use the cosine similarity as defined in Equation (
3) without taking the absolute value. This decision is based on the significance of vector directionality in semantic embedding spaces. A cosine similarity of 1 indicates that the vectors are perfectly aligned and convey similar semantic content, while a cosine similarity of
implies that the vectors are diametrically opposed, representing contrasting or unrelated meanings. By retaining the sign of the cosine similarity, we preserve critical information about the relationship between token embeddings. Taking the absolute value would obscure this distinction, treating opposing meanings as similar, which is undesirable in the context of semantic analysis. Therefore, using the cosine similarity without the absolute value allows us to accurately capture the nuances of semantic alignment and dissimilarity between token embeddings.
Theorem 2 (Properties of cosine similarity). For any two non-zero vectors
, the cosine similarity satisfies the following properties:
Boundedness:
.
Symmetry:
.
Invariance to scalar multiplication:
for any non-zero scalar c.
Cosine distance:
is a proper distance metric on the unit sphere.
Proof. The proof of these properties follows from the definition of cosine similarity and basic vector algebra:
Boundedness: This follows from the Cauchy–Schwarz inequality:
Dividing both sides by
yields the result.
Symmetry: this is evident from the commutativity of the dot product.
Invariance to scalar multiplication: This follows from the properties of dot product and vector norms:
Cosine distance: the proof that
is a proper distance metric involves verifying the four metric axioms (non-negativity, identity of indiscernibles, symmetry, and triangle inequality) for vectors on the unit sphere.
□
2.4. Fractal Mathematics and Self-Similarity
Fractal mathematics deals with structures that exhibit self-similarity across different scales, meaning that the structure appears similar regardless of the level of magnification [
16]. A classic example is the Mandelbrot set, where zooming into the boundary reveals infinitely many repetitions of the overall shape. In mathematical terms, a fractal is a set that displays self-similarity under some scaling transformation.
In the context of transformer models, the repeated application of similar layer transformations to token embeddings can be seen as analogous to the iterative processes in fractal generation. Each layer refines the representations in a manner that preserves certain structural properties while introducing new details, much like how fractals evolve under iteration. This self-similarity suggests that concepts from fractal mathematics may provide a suitable framework for analyzing the convergence properties of these models.
By viewing the semantic convergence through the lens of fractal self-similarity, we can employ mathematical tools from fractal analysis to quantify the rate and nature of convergence. For instance, the concept of fractal dimension could inspire metrics for measuring the complexity of embedding spaces across layers. This perspective enriches our theoretical framework and aligns with the observed patterns of progressive refinement in transformer models.
The semantic convergence process in transformer models exhibits intriguing parallels with concepts from fractal mathematics [
17]. As we further explore the layers of these models, we observe a self-similar pattern in the way token representations refine and converge, akin to the recursive and iterative processes found in fractal structures. In fractal geometry, complex patterns are generated by repeating a simple process at different scales, resulting in structures that are self-similar across scales [
16]. Similarly, transformer models apply the same layer transformations repeatedly, leading to representations that exhibit patterns of refinement that are consistent across layers. This fractal-inspired perspective is appropriate for our analysis because it captures the essence of hierarchical processing and recursive refinement inherent in transformer architectures [
18]. By leveraging mathematical concepts from fractal geometry, we can quantify the progressive convergence of token embeddings and provide a framework that mirrors the self-similar nature of the models’ internal dynamics. This approach not only enhances our theoretical understanding, but also offers practical tools for analyzing and optimizing transformer models.
3. Related Work
3.1. Analysis of Transformer Layer Dynamics
The study of transformer layer dynamics has been a focal point in understanding the internal mechanisms of these models. Voita et al. [
6] pioneered this field by examining the evolution of representations across transformer layers. Their work revealed that different attention heads specialize in distinct linguistic tasks, a finding that contrasts with our approach of analyzing the overall semantic convergence. While Voita et al. focused on individual attention heads, our method provides a holistic view of semantic evolution across layers.
Tenney et al. [
7] extended this line of inquiry by employing probing tasks to analyze BERT’s layer-wise linguistic capabilities. They proposed that BERT’s layers mirror the traditional NLP pipeline, a hypothesis that aligns with our notion of gradual semantic refinement. However, our work differs in its mathematical formalization of this process, offering a quantitative measure of semantic convergence through the gradient of embedding similarity. The findings of Tenney et al. have been later challenged by Niu et al. [
19], who argued for a more nuanced interpretation of BERT’s layer-wise behavior. This debate underscores the complexity of transformer dynamics, and motivates our approach of seeking a universal principle of semantic convergence that holds across different architectures.
Recent studies have further diversified the analysis of transformer layers. Kakouros et al. [
20] investigated BERT’s encoding of prosodic information, finding a concentration in middle layers. Similarly, Zheng et al. [
21] examined syntactic knowledge in Chinese BERT. These works highlight the multifaceted nature of information processing in transformers, which our gradient analysis aims to capture in a unified framework.
3.2. Semantic Representation in Language Models
The analysis of semantic representations in language models has been a topic of significant interest in recent years. Peters et al. [
22] introduced ELMo, demonstrating the power of deep contextualized word representations. Their work laid the foundation for understanding how neural language models capture semantic information across different layers. The ELMo model can be formalized as
where
is the contextualized representation of the
k-th token at the
j-th layer,
are softmax-normalized weights, and
is a scalar parameter.
The ELMo was extended this line of inquiry by investigating how contextualized representations encode sentence structure across various syntactic and semantic phenomena [
23]. The notion of scalar mixing weights to combine information from different layers is introduced:
This approach allows for a fine-grained analysis of how different linguistic features are captured across layers, providing insights into the hierarchical nature of semantic representations in transformer models.
Ethayarajh [
24] comprehensively compared the geometry of contextualized word representations in models, such as BERT, ELMo, and GPT-2. Their work introduced the concept of self-similarity to quantify how much word representations change across layers.
where
is the embedding of word
w in context
c at layer
i, and
is the set of contexts in which
w appears. This measure provides insights into the contextual nature of word representations and how they evolve through the network.
Our work builds upon these foundations by introducing a novel measure of semantic convergence that captures the global behavior of token representations across all layers simultaneously. We extend the notion of self-similarity to consider the alignment of representations with the final layer, providing a more comprehensive view of semantic evolution in transformer models.
3.3. Gradient-Based Analysis of Neural Networks
Gradient-based analysis of neural networks has been a fruitful area of research for understanding the internal dynamics of deep learning models. Raghu et al. [
25] introduced Singular Vector Canonical Correlation Analysis (SVCCA) as a powerful tool for analyzing the representations learned by neural networks across layers. SVCCA computes the similarity between two sets of neurons.
While SVCCA provides valuable insights into the similarity of representations across layers, our approach differs by focusing specifically on the semantic alignment of token representations with respect to the final layer. This allows us to capture the progressive refinement of semantic information through the network, which is particularly relevant for understanding transformer-based language models.
4. Method
4.1. Formalization of Semantic Convergence
In this section, we formalize the concept of semantic convergence in transformer-based language models. Our approach extends the notion of token embeddings introduced in the Background section to capture the evolution of semantic representations across layers.
We begin by introducing a measure of semantic alignment between token embeddings at different model layers. This measure serves as the foundation for our analysis of semantic convergence.
Definition 3 (Layer-wise semantic alignment).
Let M be a transformer-based language model with L layers. For any input sentence
, let
denote the embedding of the token at position t in sentence s at layer i, where t ranges over the positions in s. The embeddings
are context-dependent due to the self-attention mechanism in the transformer architecture. The semantic alignment
of layer i with respect to the final layer L is defined aswhere
denotes the cosine similarity between two vectors. The semantic alignment measure
quantifies the average similarity between token representations at layer i and their final representations at the output layer L. This measure provides a principled way to assess the progression of semantic information through the network.
Assumption 1 (Semantic refinement in transformer layers).
For any input sentence
, any token position t in s, and layers
in a well-trained transformer model, the transformation applied by layers between i and j brings
closer to
in terms of cosine similarity, i.e., Assumption 1 posits that, in a well-trained transformer model, the cosine similarity between a token’s embedding at layer i and its final layer embedding
increases monotonically with the layer index i. This assumption is generally justified due to the nature of transformer architectures and their training objectives.
In transformer models, each layer applies a series of transformations designed to refine token embeddings by integrating contextual information from other tokens in the sequence. Specifically, the self-attention mechanism and feed-forward networks in each transformer block are trained to progressively capture increasingly abstract and high-level representations of the input data. Moreover, during training, the model’s parameters are optimized to minimize a loss function that depends on the outputs at the final layer L. As a result, the transformations applied in intermediate layers are indirectly guided to produce embeddings that facilitate accurate predictions at layer L. This optimization process encourages embeddings at each layer to become progressively more aligned with the final representations. Additionally, the presence of residual connections in transformer architectures allows later layers to easily preserve or refine relevant information from earlier layers, facilitating this progressive refinement. This architectural design further supports the gradual improvement of embeddings in terms of their alignment with the final layer representations.
Empirical evidence from previous studies [
3,
7] has demonstrated that representations in deeper layers of transformer models capture more semantic and syntactic information than those in earlier layers. Our experimental results (
Section 6) also corroborate this behavior, showing a consistent increase in cosine similarity between embeddings at layer
i and the final layer as
i increases.
Therefore, while a formal mathematical proof of Assumption 1 for all possible models and inputs may be intractable due to the complexity of neural networks, both theoretical considerations and empirical observations support the general validity of this assumption in well-trained transformer models. The self-attention mechanism and feed-forward layers in each transformer block are designed to progressively refine token representations progressively, capturing increasingly complex and task-relevant features as information flows through the network. During training, the model learns to optimize these transformations to minimize the overall loss function, which typically involves predicting correct outputs at the final layer. This optimization process naturally encourages each layer to produce representations that are more aligned with the final layer’s requirements, leading to the observed pattern of increasing semantic similarity.
Moreover, the residual connections present in transformer architectures allow later layers to easily preserve or refine relevant information from earlier layers, facilitating this progressive refinement. However, it is important to note that while this assumption generally holds, there might be specific cases or poorly trained models in which it only applies occasionally.
We now establish several key properties of the semantic alignment measure.
Lemma 1 (Properties of semantic alignment). For a transformer model M with L layers:
.
for all
.
Under Assumption 1,
for
.
The proof of this lemma is provided in
Appendix A.
4.2. Gradient of Embedding Similarity
To characterize the rate at which semantic alignment increases across layers, we introduce the notion of the gradient of embedding similarity.
Definition 4 (Embedding similarity gradient).
The embedding similarity gradient
for layer i is defined as the difference in semantic alignment between consecutive layers,for
. The embedding similarity gradient provides a measure of how much the semantic representation of tokens changes between adjacent layers. A positive gradient indicates an increase in alignment with the final layer representation, while a negative gradient would suggest a divergence.
Lemma 2 (Positivity of similarity gradient).
For a well-trained transformer model M satisfying Assumption 1
, the embedding similarity gradient
is non-negative for all layers: The proof of this lemma is provided in
Appendix B.
4.3. Proof of the Main Theorem
We now present and prove our main theorem, which characterizes the gradient of embedding similarity across transformer layers.
Theorem 3 (Gradient of embedding similarity).
For a given language model M with L layers, there exists a monotonically increasing function
such thatfor all
, where
is the semantic alignment of layer i, as defined in Equation (10). The proof of this theorem is provided in
Appendix C.
This theorem establishes a fundamental property of transformer-based language models, which is the semantic alignment of token representations consistently increases as we move through the layers of the network. The monotonically increasing function
provides a lower bound on this rate of increase, capturing the minimum improvement in semantic alignment that we can expect at each layer.
5. Experimental Setup
5.1. Dataset Generation and Preprocessing
To evaluate the gradient of embedding similarity across transformer layers, we constructed a large and diverse corpus of input sentences designed to probe various aspects of the models’ behavior. We generated
input sentences to ensure sufficient statistical power when analyzing models with billions of parameters like BERT. We stratified our input set into two categories: simple sentences
and complex sentences
, such that
and
.
The simple sentences were generated by sampling random sequences of words from the vocabulary with lengths varying up to the maximum token length of the models. Specifically, for each
, we have
where
k is uniformly sampled from the integer range
, allowing for sentences of varying lengths up to the models’ maximum token capacity.
The complex sentences were generated using a context-free grammar
, designed to produce sentences with intricate syntactic structures and lengths up to 512 tokens. This approach ensures that our dataset includes sentences of sufficient length to fully engage the models’ capacities.
To quantify the complexity and length distribution of the sentences, we adjusted the parameters of the generation processes to produce a balanced mix of sentence lengths, including a significant proportion of long sentences (e.g., over 256 tokens). This allows us to analyze the models’ behavior across a wide range of input lengths, ensuring that our findings are robust and generalizable.
To preprocess the generated sentences, we applied a series of transformations
, where each
represents a specific preprocessing step (e.g., tokenization, lowercasing, special character removal). The composition of these transformations yields our final preprocessed dataset:
This approach to dataset generation and preprocessing ensures that our experimental setup is well-suited for analyzing the gradient of embedding similarity across a diverse range of linguistic contexts.
5.2. Model Architecture and Hyperparameters
Our experiments utilized pre-trained BERT [
11] and DistilBERT [
6] models, chosen for their widespread use and to compare the behavior of a full-scale transformer with its distilled counterpart. Let
and
denote these models, respectively. Both models operate on input sequences of maximum length
tokens, with an embedding dimension
. The key architectural difference lies in the number of layers:
and
.
For each model
and each input sentence
, we extracted contextualized token embeddings from every layer of the model. Let
denote the embedding of the token at position t in sentence s at layer i. It is important to note that
inherently depends on the entire input sentence s due to the self-attention mechanism in the transformer architecture, which allows each token to attend to all other tokens in the sequence.
The extraction process can be formalized as
where
represents the function that maps an input sentence
s to its sequence of contextualized token embeddings at layer
i, and
L is the total number of layers in the model. The
t-th row of
corresponds to the embedding
of the token at position
t in the context of sentence
s at layer
i. Since these embeddings are context-dependent, they capture both the lexical information of the tokens and the syntactic and semantic relationships within the sentence.
We emphasize that while we refer to
as the embedding of a token, it is more accurately the embedding of a token in context, reflecting the influence of the entire input sentence on the token’s representation at layer i. This context-aware representation is crucial for our analysis of semantic convergence across layers.
5.3. Embedding Similarity Analysis
To quantify the semantic convergence across layers, we computed the cosine similarity between each layer’s embeddings and the final layer’s embeddings for every token
The layer-wise average similarity
was then calculated as
Then, to estimate the gradient of embedding similarity, we computed the differences between consecutive layers,
.
5.4. Statistical Analysis and Validation
To analyze the impact of word frequency on the similarity gradient, we categorized tokens into frequent and rare words based on their occurrence in the input sentences. Let
denote the frequency of token
t in
. We defined the set of frequent words
, and computed separate similarity gradients for frequent and rare words:
To assess the statistical significance of our results, we employed a bootstrap resampling technique. Let
be the number of bootstrap samples. For each sample
, we randomly selected N sentences with replacement from
and recomputed
,
, and
. All experiments were implemented using PyTorch and the Hugging Face Transformers library. Computations were performed on a single NVIDIA RTX A6000 GPU.
6. Results
6.1. Gradient of Embedding Similarity
Our experimental results provide strong support for Theorem 3, demonstrating a consistent pattern of semantic convergence across both BERT and DistilBERT models.
Figure 1a illustrates the average similarity gradient for both models, clearly showing the monotonic increase in semantic alignment across layers.
For BERT, we observed a gradual increase in similarity from 0.0886 at the first layer to 1.0 at the final layer. DistilBERT, despite having fewer layers, achieved a similar level of final layer similarity, starting from 0.0717 and reaching 1.0. This finding suggests that DistilBERT’s compression technique effectively preserves the semantic convergence property of the full BERT model. The fitted monotonic functions
for both models provide empirical evidence for the lower bound on the rate of semantic convergence predicted by our theorem. BERT’s fitted function shows a constant value of 0.0738 for most layers, with a slight increase in the final layers. DistilBERT’s function maintains a constant value of 0.1547 throughout all layers.
6.2. Impact of Sentence Complexity
To assess the robustness of our findings across different input types, we analyzed the semantic convergence patterns for simple and complex sentences separately.
Figure 1b presents the results of this analysis.
For BERT, we observed that complex sentences initially showed slightly higher similarities (0.0886 vs. 0.0885 at layer 0) compared to simple sentences. This difference remained minimal throughout the layers, with both types converging to similar values (0.9001 vs. 0.9003 at layer 11). DistilBERT exhibited a similar pattern, with complex sentences starting at 0.0720 (vs. 0.0714 for simple sentences) and converging to 0.8843 (vs. 0.8844) at the penultimate layer. These findings suggest that both BERT and DistilBERT effectively capture the semantic content of simple and complex sentences similarly across all layers.
6.3. Word Frequency Analysis
To investigate the impact of word frequency on semantic convergence, we conducted a detailed analysis of similarity gradients for frequent and rare words. Let V be the vocabulary and
be the frequency function that maps each word to its occurrence count in the corpus. We define the set of frequent words
and rare words
.
For each layer
i and word category
, we compute the average similarity
as
where
is the embedding of word
w at layer
i, and
L is the final layer.
Table 1 presents the average similarities for frequent and rare words at key layers of both BERT and DistilBERT models.
Our analysis reveals that both frequent and rare words follow similar convergence patterns, with rare words consistently showing slightly higher similarities across all layers for both models. For BERT, we observe an average similarity of 0.6205 for frequent words and 0.6372 for rare words. For DistilBERT, the average similarities are 0.7483 for frequent words and 0.7603 for rare words. To quantify this difference, we define the similarity gap
at layer
i as
We observe that
for all layers i, with an average gap of
for BERT and
for DistilBERT across all layers. To test the statistical significance of this difference, we performed a paired t-test for each layer, comparing the similarities of frequent and rare words. The null hypothesis
was tested against the alternative hypothesis
for each layer i. The null hypothesis was rejected at a significance level of
for all layers in both models, with p-values < 0.001, confirming that the observed difference is statistically significant. The degrees of freedom for each test were
, where n is the number of word pairs compared in each layer.
These findings have important implications for understanding how transformer models process words of varying frequencies and could inform strategies for improving model performance on tasks involving rare words or domain-specific vocabulary.
6.4. Layer-Wise Similarity Differences
To provide a more granular view of the semantic convergence process, we analyzed the layer-wise differences in similarity.
Figure 2 illustrates these differences for both BERT and DistilBERT models.
For BERT, we observed that the largest increase in similarity occurs between the first and second layers (0.2576), followed by a gradual decrease in the magnitude of change. The final layer transition shows a moderate increase (0.0998), possibly due to the model’s final refinement of semantic representations. DistilBERT exhibits a similar pattern, with the largest increase occurring between the first and second layers (0.4143), but with larger magnitudes of change due to its compressed architecture. These findings provide strong empirical support for the lower bound
proposed in Theorem 3.
To quantify the rate of semantic convergence, we computed the average rate of change in similarity across layers:
For BERT, we found
, while for DistilBERT,
. These values indicate that DistilBERT achieves a faster rate of semantic convergence, likely due to its compressed architecture.
In conclusion, our experimental results provide strong empirical support for Theorem 3, demonstrating the existence of a monotonic function describing the increase in semantic alignment across transformer layers. This property holds consistently across different model architectures, sentence complexities, and word frequencies, highlighting its fundamental nature in transformer-based language models. The layer-wise analysis further reinforces this finding, providing a detailed view of the semantic convergence process and its variations across different model architectures.
7. Discussion
7.1. Implications to Theory and Practice
The findings of this study have significant implications for both theoretical understanding and practical applications of transformer-based language models. Theoretically, our work provides a mathematical framework that formalizes the semantic convergence process in transformers. By establishing the existence of a monotonically increasing function
that bounds the semantic alignment across layers, we contribute to the fundamental knowledge of how deep neural networks process and refine semantic information. This framework bridges the gap between empirical observations and theoretical analysis, offering a solid foundation for future research on model interpretability and the internal dynamics of transformer architectures.
Practically, the insights gained from our analysis can inform the design and optimization of transformer models in several ways. Understanding the rate of semantic convergence across layers allows practitioners to make informed decisions about model depth and layer configurations. For instance, models can be designed with the optimal number of layers necessary to achieve desired levels of semantic alignment, potentially reducing computational costs without compromising performance. Additionally, the identification of layers that contribute most significantly to semantic refinement can guide pruning strategies and the development of more efficient architectures tailored to specific tasks. This is particularly relevant in resource-constrained environments where model efficiency is critical.
Furthermore, our findings on the differential treatment of frequent and rare words suggest that transformer models inherently capture nuances related to word frequency. This has practical implications for tasks involving domain-specific vocabulary or low-resource languages, where rare words are more prevalent. By leveraging this understanding, models can be fine-tuned or augmented to improve performance on tasks that require sensitivity to rare or specialized terms.
An example of practical application of our theoretical framework is in the field of graph representation learning. Recent work by Zhang et al. [
26] introduced Graph Masked Autoencoders with Transformers, leveraging transformer architectures to learn representations of graph structures. Our findings on semantic convergence could inform the design of such models by providing insights into how transformer layers refine embeddings in graph contexts. Understanding the layer-wise semantic alignment can help optimize the depth and configuration of graph transformers, potentially improving their ability to capture complex structural information in graphs. This demonstrates the broader applicability of our theoretical contributions beyond natural language processing to other domains where transformer models are employed.
7.2. Limitations
While our study provides valuable insights into the semantic convergence of transformer models, it is important to acknowledge its limitations. Firstly, our theoretical framework relies on Assumption 1, which posits that embeddings become progressively more aligned with the final layer representations. While this assumption is supported by empirical evidence and the design of transformer architectures, it may not hold universally for all models or under all training conditions. Models trained on different objectives or with alternative architectures may exhibit different convergence behaviors.
Additionally, our experimental validation is limited to pre-trained BERT and DistilBERT models. Although these models are widely used and representative of transformer architectures, the generalizability of our findings to other models, such as GPT series or newer architectures like Transformer-XL, remains to be tested. Future work should explore a broader range of models to confirm the universality of the observed semantic convergence patterns. Also, the analysis focuses on average behaviors across tokens and layers, which may obscure important nuances at the individual token level or in specific linguistic contexts. Semantic convergence may vary for different types of linguistic phenomena, such as idiomatic expressions, named entities, or syntactic ambiguities. A more fine-grained analysis could reveal additional complexities in how transformers process and represent language.
Lastly, our study does not consider the impact of training data characteristics, such as domain diversity or language complexity, on the semantic convergence process. Different training datasets may influence how models learn and refine semantic representations, which could affect the applicability of our theoretical framework in various contexts.
8. Conclusions and Future Work
This paper presents a mathematical analysis of the semantic convergence of token embeddings across layers in transformer-based language models, inspired by the concept of fractal self-similarity. We introduced and proved a novel theorem characterizing the gradient of embedding similarity, providing a quantitative measure of how token representations evolve through the network. Our main result, Theorem 3, establishes the existence of a monotonically increasing function
that describes the consistent increase in semantic alignment across layers, exhibiting a pattern analogous to fractal self-similarity.
Our experimental results, conducted on pre-trained BERT and DistilBERT models, strongly support the theoretical predictions. We observed a consistent pattern of semantic convergence across both model architectures, with variations based on token frequency and model depth. Quantitatively, we found that for BERT, the average rate of semantic convergence
, while for DistilBERT,
, indicating a faster convergence in the compressed model.
The analysis of simple versus complex sentences revealed that complex sentences show slightly higher similarities in early layers, and both BERT and DistilBERT effectively capture their semantic content by the final layers. We observed that the initial similarity difference between complex and simple sentences ( for BERT and
for DistilBERT) remains minimal throughout the layers, with both types converging to similar values by the final layers ( for BERT and
for DistilBERT). The word frequency analysis showed that both frequent and rare words follow similar convergence patterns, with rare words showing slightly higher similarities across all layers for both models. We quantified this difference and found that rare words consistently exhibited higher similarities, with an average difference of
for BERT and
for DistilBERT across all layers.
Our layer-wise analysis of similarity differences provided strong empirical support for the lower bound proposed in Theorem 3. We found that the most significant increase in similarity occurs between the first and second layers for both BERT () and DistilBERT (), with DistilBERT exhibiting larger magnitudes of change due to its compressed architecture. This detailed view of the semantic convergence process offers insights into the differential treatment of various linguistic features by the models.
The implications of this work extend beyond theoretical interest. By providing a mathematical framework for analyzing the layer-wise behavior of transformer models, we offer a new tool for comparing and optimizing model architectures. The gradient of embedding similarity introduced here can serve as a metric for assessing the efficiency of different layer configurations in capturing semantic information, potentially guiding the development of more effective and computationally efficient models.
Future work could explore the application of this framework to other transformer architectures to test the generalizability of our findings. A natural extension would be to investigate whether the function
in Theorem 3 can be characterized more precisely for different model architectures. Additionally, investigating the impact of fine-tuning on the similarity gradients could provide insights into how task-specific training affects the semantic alignment across layers. This could be formalized as studying the change in
before and after fine-tuning:
Another promising direction is to analyze the relationship between the similarity gradient and model performance on specific NLP tasks. This could be approached by defining a performance metric
for a model
M on a given task and studying the correlation between
and the characteristics of
for that model,
where
could include properties such as the area under the curve of
or its rate of decay.
Our findings have practical implications for NLP practitioners. By providing a mathematical framework for analyzing the layer-wise behavior of transformer models, we offer a new tool for comparing and optimizing model architectures. The gradient of embedding similarity introduced here can serve as a metric for assessing the efficiency of different layer configurations in capturing semantic information. This can inform decisions in model design, such as determining the optimal number of layers or identifying layers that can be pruned without significant loss of semantic alignment. Additionally, understanding the semantic convergence process can aid in developing more efficient training strategies and improving model interpretability, which is crucial for deploying NLP models in real-world applications where transparency and efficiency are important.
While our experiments focused on pre-trained BERT and DistilBERT models, future work could explore the application of this framework to other transformer architectures to test the generalizability of our findings. Investigating models with different configurations, such as varying numbers of layers, attention heads, and embedding dimensions, would provide deeper insights into the impact of architectural choices on semantic convergence. Additionally, expanding the range of input contexts to include more diverse and multilingual datasets could further validate the universality of the semantic convergence property across languages and domains.
In conclusion, this work advances our understanding of the internal mechanisms of transformer models and provides a mathematical framework for comparing and optimizing model architectures. The gradient of embedding similarity offers a new perspective on the layer-wise behavior of these models, opening up avenues for future research in model interpretation and design. The theoretical foundation laid by Theorem 3 and the empirical validation across different model architectures and input types provide a solid basis for further exploration of the semantic convergence properties in transformer-based language models.