1. Introduction
A variational autoencoder (VAE) [
1] has been applied in numerous NLP tasks, including language modeling [
2] and semi-supervised text classification [
3]. The most prominent component of a VAE in language modeling is the statistical use of latent representation, which aims to contain holistic and informative features in texts such as styles, topics, and semantic features. With this latent representation, samples from its prior distribution can generate diverse and exquisite sentences [
2].
Due to the inherent auto-regressive nature of texts, an auto-regressive decoder such as a recurrent neural network (RNN) and a long short-term memory (LSTM) [
4] are also widely used, and a few integrated models with RNN and VAE have also been proposed [
2]. Previous VAE-driven language modeling approaches use encoders and decoders in the form of RNN [
2] as well as the fusion of RNN and Transformer [
5,
6].
However, the mere use of such auto-regressive decoders, besides failing to include informative latent encodings, induces a so-called posterior collapse that makes latent representations useless [
2,
7]. To mitigate this problem, several techniques are proposed in the literature, such as updating inference networks and generative networks in imbalanced ways [
7], refactoring loss functions [
8], changing loss functions [
9], and adopting Kullback–Leibler (KL) annealing [
2,
10].
Transformer has produced state-of-the-art outcomes, becoming a default choice in various natural language tasks, including generative language modeling [
11] and discriminative language understanding [
12]. Its excellent performance is due to a self-attention mechanism that captures contextual information from entire sequences. Thus, our view is that the integration of Transformer with a variational autoencoder, which captures the entire sequence and statistical inference, will be beneficial for various natural language tasks.
To the best of our knowledge, no such models that make use of a Transformer architecture internally coupled with a VAE to build a new language modeling approach have been proposed and tested. In addition, from a computational viewpoint, Transformer has several benefits in comparison with RNNs, such as mitigating vanishing-gradient issues and the ability to handle sequential operations and parallelization [
13]. Thus, building a language representation model in VAE solely with Transformer may capture the exquisite holistic features of sentences with parallel computing. Furthermore, a careful adoption of Transformer in our model may bring forth sufficient performance in various NLP tasks without a severe adaptation of the model.
To sum up, this paper makes the following contributions: (1) We provide a novel Transformer model inherently coupled with a variational autoencoder, which we call a variational autoencoder Transformer (VAE-Transformer), for language modeling; (2) We implement the VAE-Transformer model with KL annealing techniques and perform experiments involving real-life datasets with different sentence lengths. With the results, we verify that the model, which produces more informative embedding representations, is better than previous RNN-based models in reconstruction and representation learning.
2. Preliminaries
In this section, we briefly explain the models of variational autoencoders and Transformer to propose a new integrated model that combines the two.
Variational Autoencoder. A VAE model is a deep generative model using latent variables, as shown in
Figure 1 [
1]. VAE models assume that observable data derive from latent (hidden) variables that follow a simple distribution, usually Gaussian.
Suppose that an observed data point
is a high-dimensional random vector, and that latent variable
is in a relatively small dimensional space. The log-likelihood of data
can be expressed as follows:
where
and
are the probability density functions of
and
, respectively, and the Kullback–Leibler divergence is
. The right-hand side of the equation above is the lower bound of the log-likelihood of data
, which is called an evidence lower bound (ELBO). Noticeably, by including inference model
and generative model
, ELBO can be expressed as the sum of two terms by encoder
and decoder
, as shown in Equation (2) and
Figure 1. Typically, we assume that
is Gaussian,
, with dimension
k, and that
is also Gaussian,
, with dimension
k. Under this assumption, we can rewrite ELBO as follows:
Using a neural network model of parameters
and
, we estimate the parameters by maximizing ELBO, which is equivalent to maximizing the log-likelihood. The VA model can be applied to language modeling for text generation and text embeddings. To generate a sentence,
, of length
T, a language model generates each token,
, conditioned on the previous tokens:
where
represents the word (or unit) token at position
t, and
represents word tokens before position
t. In other words,
means
for
and a given fixed token, <BOS>, for
, with
. Using the auto-regressive characteristics of Equation (3), we express ELBO as follows:
In a variational autoencoder for language modeling, both an encoder and a decoder use auto-regressive models such as LSTM [
4] and GRU [
14]. As depicted in
Figure 2, decoders are conditioned on latent variables designed to capture global features such as style, topic, and high-level syntactic features [
2].
However, a variational autoencoder model using an auto-regressive decoder easily falls into local optima, which is called a posterior collapse or KL vanishing [
2,
7]. This phenomenon occurs because the model training implemented to maximize ELBO often finishes at an early step, with latent variables
failing to contain meaningful information about input sequences and the KL divergence term falling to zero. When reconstructing
with the latent variable
from such a training process, the model just relies on
, as depicted in
Figure 3 by the red line [
7].
To alleviate this phenomenon, numerous techniques such as updating inference networks and generative networks in imbalanced ways [
7], refactoring loss functions [
8], changing loss functions [
9], and KL annealing [
2,
10] have been proposed. Because of its fast convergence coupled with low hardware capacity, we adhere to KL annealing techniques for our experiment in this research work [
2,
10].
Transformer. Transformer is a sequence-to-sequence model that removes recurrence and adopts self-attention [
13]. Transformer architectures have become the state of the art in various NLP tasks such as translation, summarization, and classification. Moreover, the architecture is used in various pre-trained language models such as BERT [
12] and GPT [
11], which have accomplished top performance in numerous NLP tasks.
Self-attention and its multi-head version are crucial in Transformer. When a sentence is injected into the Transformer architecture and then subsequently passed to the embedding layer, it is summed with positional encoding, which gives position information to the output of the embedding layer. Next, the output is projected to multiple sets of query, key, and value.
Attention includes a mapping from a query (Q), along with a key (K) and a value (V), to an output as a weighted sum of values in which the weight, also called the attention score, is the similarity between query and key. The similarity is a scaled dot product between query and key, normalized by the dimension of key,
, as in Equation (5), where
,
and
are the lengths of the sequence in query and key (or value), respectively, and
,
, and
are the dimensions of query, key, and value, respectively:
As an extension of attention, self-attention is an attention mechanism in which the query, key, and value are from same source.
In addition, the Transformer mechanism adopts a multi-head attention, a mechanism which projects query, key, and value into multiple and different spaces and applies attention functions, as shown in
Figure 4. Equation (6) shows its process as a concatenation of multiple attention outputs, in which the linear projections are
, and
, where
is the dimension of the model and
h is the number of heads. Multi-head attention brings forth various views of query, key, and value. The total number of heads, which is a hyperparameter, is often set in practice by an associated performance measure.
The architecture includes an encoder composed of two sub-layers, a multi-head self-attention layer, and a feed-forward neural network layer. It also has a decoder composed of three sub-layers, a masked multi-head self-attention layer, an encoder-decoder attention layer, and a feed-forward neural network layer. The masked multi-head self-attention layer contains masks used in preventing the current token from attending the next tokens. When predicting the current token , the model can see the current token and previous tokens .
Transformer is equipped with an
encoder-decoder structure in the decoder part, as shown in
Figure 5. The encoder-decoder attention layer calculates attention scores with the encoder output and the decoder input, where query is the output of the masked multi-head self-attention layer, and both key and value are the output of the encoder. We mention that the encoder-decoder structure is essentially the connection that links the encoder part to the decoder part in Transformer, which we will relate to the use of a VAE.
The Transformer model is superior to existing RNN-based models in two aspects: preventing a posterior collapse and enabling parallel computing. The issue of long-term dependence occurs when recurrent models have back-propagated gradients gradually decreasing to near-zero, causing latent variable
to rarely be updated in training, which will result in a posterior collapse with useless latent variable
[
7]. On the contrary, by computing all input elements equally, the Transformer model is able to capture long-term dependence and prevent posterior collapses. Preventing posterior collapses results in useful latent variables generating sentences. In addition, the Transformer model is equipped with parallel computing. Unlike Transformer, RNN-based models predict tokens in a recurrent manner so that parallel computing is impossible. Transformer, by simultaneously treating all input elements as a matrix form, enables parallel processing.
To the best of our knowledge, previous models of variational autoencoders with Transformer in language modeling use a mixed version of Transformer and RNN [
5,
6], in which Transformer serves as the encoder and RNNs as the decoder. With this structure, because of the auto-regressive property of RNNs, the models hardly take full advantage of Transformer. Thus, we will tightly couple VAE and Transformer, which can prevent posterior collapses and enhance parallel computing. With these ideas, we propose a VAE plugged into Transformer, which we call a VAE-Transformer model, so that the VAE can connect the output of Transformer’s encoder with the input of Transformer’s decoder while utilizing the ability of statistical inference via VAE in Transformer. Indeed, some Transformer-based VAE models exist in the literature. Conditional generation models with VAE and Transformer [
15,
16] simply focus on predicting the next tokens of the decoder such as in a machine translation problem, where the inputs of the encoder are different from those of the decoder. The primary use of the models excludes sentence representation, which our model focuses on. Moreover, the pre-trained models coupled with VAE by [
5,
17] are different from our model in that they adopt pre-trained models inside. Furthermore, they fail to compare theirs with RNN-based models and only perform comparisons with other pre-trained models. Arroyo et al. [
18] proposed a VAE-based Transformer model for layout generation. The previous models just used the variational autoencoder in a structural term, which did not alleviate posterior collapse. Our model’s novelty is its use of VAE with Transformer in language modeling, which is applicable to conditional generation and pre-trained models. In addition, our model alleviates posterior collapse by creating a strong connection with the decoder and the encoder. We will explain our model in the following section.
3. The Proposed Model, VAE-Transformer
The proposed VAE-Transformer model combines Transformer with a variational autoencoder for both the encoder and the decoder, as shown in
Figure 6. This model remains mostly the same as the original Transformer. However, the proposed model differs from the original model in that a variational autoencoder is plugged in to connect its encoder output with its encoder-decoder structure so as to apply and adjust variational effects in the decoding process. It is worth comparing the VAE-based Transformer with the original Transformer and VAE-equipped RNNs. In addition, we believe that the embedding by the VAE-based Transformer will be diverse while maintaining an effective representation of input texts, such that the proposed model may be potentially useful in further NLP tasks such as summarization, machine translation, among others.
The encoder remains the same as in Transformer, except for the connection of the encoder output to the decoder by VAE. hlIn order to create a latent variable
, we attach an additional neural network after the last layer of the encoder. This involves
and
, which are the distribution parameters of the latent variable
, which is the same as in [
1] and shown as a blue box in
Figure 6. In addition, we revise the encoder-decoder attention in the Transformer decoder so that the latent variable
, instead of the output of the encoder, feeds the attention module. The rationale of this structure is that it is quite close to the encoder-decoder attention in Transformer, therefore allowing us to take advantage of the original Transformer decoder structure. Thus, by removing the previous encoder-decoder attention, we provide more attention to the latent variable and the output of the decoder. We denote this structure as a latent-decoder attention after the feed-forward network, displayed as the red box in
Figure 6.
We define the abbreviations for the tokens in use: token
means the beginning of a sentence, and token
means the end of a sentence. Using the encoder, we generate the output of the input sequence
X =
, in which the output of
represents the summary of
X, which is similar to BERT [
12]. Using a neural network, as the blue box shows in
Figure 6, we convert the output of
in the encoder into its stochastically sampled representation,
, by a Gaussian distribution, with the mean
and variance
of the latent variable
. Because of the indifferentiable sampling process, we use a reparametrization trick that outsources the sampling process to the additional variable, whose distribution is
:
where ⊙ is the element-wise product [
1]. Using the decoder, we generate the embeddings,
, of the input sequence,
Y = [
], and process them in the masked self-attention layer. We process the samples of latent variable
and the output of the masked self-attention layer in the latent-decoder attention layer, similar to the process performed in the encoder-decoder attention layer of the original Transformer decoder. Lastly, we predict the token
based on
and
. We obtain the loss of the entire output sequence by fitting it onto [
].
4. Experiments
First, we compare the proposed method with a few existing methods in language modeling. In this work, we consider three baseline models: (1) LSTM with a variational autoencoder [
2], denoted by
VAE-LSTM, which uses a variational autoencoder where both the encoder and decoder are LSTM; (2) an LSTM language model, denoted by
LSTM, which uses an LSTM for the prediction of next tokens by previous tokens; (3) Transformer language modeling, denoted by
Transformer-Decoder, which uses only the Transformer decoder without an encoder-decoder attention layer in the same way as [
11]. We denote the proposed method as
VAE-Transformer. We also utilize the variational autoencoder with the pre-trained model GPT2 [
16] to see how a pre-trained model works in a variational autoencoder in terms of language modeling. The encoder and the decoder of this model are the Transformer encoder and pre-trained model GPT2, respectively. This model is used for conditional story generation in language modeling, and we give the same inputs to the encoder and the decoder as those used in our model. We denote this model as
VAE-GPT. We show the parameter settings for the tested models in
Table 1.
For the comparison, we used four datasets, which are publicly available and widely adopted as benchmark datasets in language modeling:
Pen Tree Bank (PTB) [
19],
Stanford Sentiment Treebank (SST2) [
20],
Twitter US Airline Sentiment (AIR), and
WikiText2. In short, PTB is a dataset for part-of-speech tagging in NLP, selected from Wall Street Journal. SST2 is a collection of movie reviews with two labels (positive, negative). AIR consists of tweets for US airlines with three labels (positive, neutral, negative). We, however, just use two labels (positive, negative) in the following experiments. Lastly, WikiText2 is a subset of wikipedia data, commonly used for language modeling.
To evaluate the performance of language modeling, we adopt two evaluation approaches: intrinsic and extrinsic. The intrinsic evaluation considers two aspects of the generated sentences, which are the output of the decoder. The first one is reconstruction, which refers to how well the generated sentences are reconstructed compared with the input sentences. We represent the ability of reconstruction as a perplexity measure [
2]. The definition of perplexity (PPL), shown in Equation (7), is related to a measure of how well the probability distribution of a language model predicts a text sequence as reciprocal to the averaged log probability of words. Thus, the smaller the perplexity measure for a language model, the better the model:
The second aspect considered by the intrinsic evaluation is how well representation learning is accomplished on the basis of a KL divergence measure since the performance of representation learning [
2]. Kullback-Leibler divergence (KL) in Equation (2), being a statistical distance, is a measure of how one probability distribution is different from another reference probability distribution. In particular, if a language model produces a low reconstruction error and a high KL divergence, it means that the latent variable,
, represents the sentence data quite well while possibly well reflecting informative features such as topics or styles from the input sentences. Furthermore, to specify how the latent variables of a language model represent data, we also include mutual information (MI) [
21]. Mutual information is a measure of how much two random variables are related. We calculate the mutual information with latent variable
and input variable
.
When KL divergence and mutual information are close to zero, it means that the latent variable
z, unable to contain valuable information from input sentences, causes the model to degenerate as conventional language modeling does. Encoders that bring forth latent variable
z with useful information from input sentences will have non-zero KL divergence and mutual information. [
2] In addition, we include the extrinsic evaluation as an extension of intrinsic evaluation, especially for representation learning. If representation learning works well, the latent variable
z contains an ample amount of useful information for the input sentences. To implement the usefulness of
z in representation learning, we use a sentiment classification task with
z for the SST2 and AIR datasets. As the classification in previous language models [
11,
12] proceeds with pre-training and fine-tuning after training a model for language modeling with input data, we similarly fine-tune the model for classification with the same input data, following the idea of transfer learning. For a fair comparison, we update linear classifier module parameters in the classification model, excluding VAE encoder parameters. We use latent variable
for classification, following the idea of transfer learning.
Figure 7 illustrates the adopted procedure of extrinsic evaluation in classification.
For pre-processing input texts in the experiments, we use a sentence-piece tokenizer with a vocabulary size of 30,522, similar to the procedure used for BERT [
22]. For
VAE-GPT, however, we use a byte pair encoding tokenizer that is used in GPT2. We use a fixed sequence length (including
and
tokens). When an input sentence is larger than the fixed length, it is truncated. When it is less than the length, it is zero-padded. We use a fixed sequence length of 64 for the PTB, SST2, and AIR datasets and 256 for the Wikitext2 dataset.
The parameters of the tested models are shown in
Table 1. For fair testing, we make the total number of parameters for the VAE-Transformer and VAE-LSTM models as close as possible. We also make those for the Transformer-Decoder and LSTM as close as possible. We notice that the parameters of the Transformer-Decoder and LSTM without VAEs are smaller than those of the VAE-Transformer and VAE-LSTM with VAEs. As it is a pre-trained model, VAE-GPT is the largest model of all, and a direct model comparison is fair. However, we can check the effect of a pre-trained model in the variational autoencoder.
The model dimension of the encoder and decoder of the Transformer model is 128, and its hidden dimension is 512. Furthermore, we use two multi-head attention mechanisms, the GeLU activation function from Ref. [
23] in each layer, and the dimension of 64 for the latent variables. We set the encoder of the VAE-GPT model to be the same as the VAE-Transformer encoder and the decoder of the VAE-GPT model the same as in [
16].
We set the encoder of the LSTM model to be bidirectional and the decoder to be unidirectional. We set the hidden dimension as 128, and 2 hidden layers for both the encoder and the decoder. The hidden dimension of the LSTM language modeling is set at 128, and the two hidden layers are set to be the same as the VAE-LSTM. Moreover, we use the same settings as that used with the VAE-Transformer for the Transformer-Decoder.
We optimize the models by the AdamW optimizer [
24] with cosine annealing [
25] for a varying learning rate. We use an initial learning rate of 1 × 10
−3 for the PTB and Wikitext2 datasets and 1 × 10
−4 for the SST2 and AIR datasets. The final learning rate of 1 × 10
−7 is used, which restarts every 10 epochs for all datasets. We set the momentum parameters at 0.9 and 0.999, and the weight decay as 0.1. The batch size is 32, and our model is trained in 50 epochs.
To prevent KL vanishing, we use linear annealing [
2] and cyclical annealing [
10]. With the annealing techniques, the objectives of the model (ELBO) are rewritten as in Equation (8). Notice that
represents a learning rate that varies with iteration
t.
For linear annealing,
starts at 0 and reaches 1 at a certain step, after which we set
at 1. For cyclic annealing,
is the same as in linear annealing, except that after reaching 1, it resets to 0 and repeats this several times, as shown in Equation (9).
Figure 8 shows the comparison between linear annealing and cyclic annealing scheduling according to iteration
t.
where
T is the total number of training steps,
f is a monotonically increasing function,
M is the number of cycles, and
R is the proportion used to increase
within a cycle. We set
R at
and
M at 5 for 50 training epochs, in the same way as [
10].
4.1. Results
4.1.1. Intrinsic Evaluation
As described in the previous section, we compared our model, the VAE-Transformer, with VAE-LSTM, LSTM LM, Transformer Decoder, and VAE-GPT using intrinsic evaluation.
Table 2 shows that the
VAE-Transformer outperforms the LSTM-based variational autoencoder, LSTM language modeling, and Transformer Decoder in terms of reconstruction and representation learning. It can be observed that the proposed model, the
VAE-Transformer, is better than
VAE-LSTM regardless of the KL annealing strategy.
Table 2 also shows that using a pre-trained model with VAE improves the performance of a language model. This is because of the knowledge transfer and the size of the pre-trained model. To illustrate the training performance, we show the learning curves of cross entropy and KL divergence according to epochs for
VAE-Transformer and
VAE-LSTM in the training session of PTB, as shown in
Figure 9. In terms of cross entropy, the
VAE-Transformer is much lower than the
LSTM. Moreover, it converges faster than
LSTM. In each of the KL annealing strategies, the KL divergence of the
VAE-Transformer was larger than that of VAE-LSTM.
The experimental results indicate that the latent variables of our model contain more useful information from input sentences than the VAE-LSTM. Furthermore, in comparison with LSTM-LM and Transformer Decoder in reconstruction measurement (PPL), the proposed model predicts the next word better than the compared models. This indicates the informative representation z of the proposed model. In PTB and Wikitext2, VAE-GPT has the best results for reconstruction measurement (PPL), while its KL divergence and MI are low. This implies that posterior collapse happens naturally in those cases because of the complexity and power of the used pre-trained model. Taking advantage of pre-trained knowledge, the pre-trained model is able to reproduce outputs without relying on latent variables. On the contrary, the results of SST2 and AIR, which consist of a small amount of data, are quite different, and the decoder of the variational autoencoder has to rely on latent variables in reproducing the sentences.
4.1.2. Extrinsic Evaluation
In the SST2 dataset, the
VAE-Transformer using linear annealing yielded the best result, but in order to achieve consistency with the results of [
2,
10], we used cyclic annealing models for extrinsic evaluation. Similarly, in pre-trained models, we fine-tuned just the encoder part of the variational autoencoders in the pre-trained language model and chose the best model using the validation data. The model was optimized by the AdamW optimizer [
24] with cosine annealing [
25]. We used an initial learning rate of 1 × 10
with a final learning rate of 1 × 10
that restarted every 10 epochs for all datasets. We set the momentum parameters at 0.9 and 0.999 and the weight decay at 0.1. The batch size was 32, and we trained the classification model for extrinsic evaluation in 20 epochs. The result is shown in
Table 3. For all the datasets, it can be seen that the proposed model shows better performance than
VAE-LSTM in cross entropy. This result demonstrates that the latent variable of the proposed model is more informative and fit for classification tasks than that of
VAE-LSTM. In addition, we can observe that
VAE-GPT has the largest cross entropy, which implies that with the powerful and complex decoder, the latent variables have low sentence representation capability because of posterior collapse.
With intrinsic evaluation and extrinsic evaluation, our model surpassed the other tested models. Overall, the experimental results show that the Transformer-based models are better than RNN-based model in reconstruction and representation learning, and better at avoiding posterior collapse than the RNN-based models.
4.1.3. Ablation Study
We also checked the consistency of the proposed model by performing experiments with different latent dimensions in intrinsic evaluation. We show the results in
Table 4 for latent dimension 32. The results for latent dimension 32 are similar with the results for latent dimension 64 in language modeling and representation learning. Because of the size of the model, practical scores are different. However, both results show that the variational autoencoder for Transformer reached the best results in every dataset. These results show that our proposed model is superior to the other models in language modeling and representation learning.
Moreover, we can see the effect of LSTM in the model by replacing it with GRU. We show the results in
Table 5 and
Table 6, with the proposed model evaluated by
VAE-GRU.
The results show that VAE without annealing produces a score of almost zero in KL divergence and Mutual information. This is because of posterior collapse, after which the latent variable becomes a useless representation. It can be observed that the GRU-based model achieved the best performance in the PTB and Wikitext2 datasets, while the Transformer-based model was the best in the SST2 and AIR datasets in the language modeling task.
As for the model’s computation time, we measured the time required to calculate forward passes in training the data, which is shown in
Table 7. We conducted the experiments on both PTB and WikiText2, whose volumes were substantially large. To check the parallelization of the model, we sized up the model in terms of the parameters. As shown in
Table 7, when the model’s dimension is 128, the differences between
VAE-Transformer,
VAE-LSTM, and
VAE-GRU are indiscernible even though
VAE-GRU is the fastest. However, when the dimension is 1024, the differences between
VAE-Transformer,
VAE-LSTM, and
VAE-GRU become discernible. Even though
VAE-Transformer has the largest number of parameters, as shown in
Table 8,
VAE-Transformer is the fastest. It shows that Transformer-based variational autoencoders possess great parallelization as compared to RNN-based variational autoencoders.
In the classification task, the GRU-based model achieved the best results in terms of computational time. One needs to be wary of concluding that the proposed model is inferior to the GRU-based models. Naturally, the size of a GRU-based model is smaller than Transformer- and LSTM-based models. For example, the GRU-based model, which converges faster than the other models, has 12,526,010 parameters, while the Transformer-based model and the LSTM-based model have 12,763,706 and 12,756,922, respectively. We recommend that future research should study the effect of LSTM variants with a similar size and compare them to adequately fit datasets.