1. Introduction
Feature engineering is the process of transforming raw data into a suitable representation or feature vector that can be used to train a machine learning model for a prediction problem. For decades, the traditional approach to feature engineering was to manually build a feature extractor that required careful engineering and considerable domain expertise. Manual feature engineering required writing code that was problem-specific and that had to be adjusted for each new dataset. Deep learning allows computational models that are composed of multiple processing layers to learn representation of data with multiple levels of abstraction [
1] (p. 436). Deep learning improves the process of feature engineering by automatically extracting useful and interpretable features. It also eliminates the need for domain expertise and hard core feature extraction by learning high-level features from the data in a hierarchical manner. The main building blocks in the deep learning literature are restricted Boltzmann machines [
2,
3], autoencoders [
4,
5], convolutional neural networks [
6,
7], and recurrent neural networks [
8].
Most of these architectures are used to learn a relationship between a single input source and the corresponding output. However, there are many domains where the representation to be learned is the correspondence between more than one source and one output [
9]. For instance, many tasks in vision carry the relevant information in the encoding of the relationship between observations, not the content of a single observation.
The above deep learning building blocks can be extended to contain gated connections and allow them to learn relationships between at least two sources of input and at least one output. A defining feature of the gated networks is that they contain gating connections. Unlike other networks, whose layer-to-layer connections are linear, gated networks introduce higher order interactions. The connection between two neurons
x and
y is in fact modulated by the activity of a third neuron
h.
Figure 1 illustrates two different approaches for the connection relationship between three neurons: to control the flow of information in the network or to model multiplicative interactions between several inputs. In the first type of connection, the neuron
h is used as a switch or a gate that stops or does not stop the flow of information between
x and
y. In the second type of connection, the connection implements a multiplicative relationship between
x and
h, whose values are multiplied before being projected to the output
y by the synaptic connection. In the multiplicative interaction, we can say that the neuron
h modulates the signal between
x and
y.
Despite the growing interest, the literature about gated networks is still sparse [
9] (p. 2). The focus of this paper is the specific family of neural networks implementing a multiplicative gating relationship that are built on an RBM architecture. This concept of a gated restricted Boltzmann machine was first introduced in [
10]. The basic idea of the gated model is to use the binary hidden units to learn the conditional distribution of one image (the input) given another image (the output). In [
11], the authors revisit the problem and present a factorization alternative to the gated RBM. A gated RBM can be also considered a higher-order Boltzmann machine. As cited in [
10] (p. 1474), Boltzmann machines that contain multiplicative interactions between more than two units are known in general as higher-order Boltzmann machines [
12].
The remainder of this paper is organized as follows: in the following section, the RBM model and its popular training method Contrastive Divergence (CD) are presented. In
Section 3, the standard gated RBM model is presented and a mechanism to reduce the number of its weights by projecting onto factor layers is discussed. Next, we present in
Section 4 a small overview on the subject of tensors and the factorization known as Tucker decomposition. In
Section 5, we introduce a multimodal tensor-based Tucker decomposition for the three-way parameter tensor in the gated RBM. In this section, we also show that by using Tucker Decomposition, we can use less than the cubically many parameters implied by the three-way weight tensor and introduce a Contrastive Divergence-based training procedure for the gated RBM, which reduces the number of model parameters and efficiently parameterizes its bilinear interactions. Experimental results and their corresponding discussion are reviewed in
Section 6. Finally, conclusions are presented in
Section 7.
2. Restricted Boltzmann Machines
A restricted Boltzmann machine is a type of graphical model in which the nodes
form a symmetrical bipartite graph with binary observed variables
(visible nodes) and binary latent variables
(hidden nodes). Each visible unit (node) is connected to each hidden unit, but there are no visible-to-visible or hidden-to-hidden connections. Importantly, RBMs are able to model the probabilistic density of the joint distribution of visible and hidden units, enabling them to generate samples similar to those of the training data onto the visible layer. This type of model is called
generative models. A classic RBM model is illustrated in
Figure 2.
An RBM is governed by an energy function, the energy of a joint configuration
between the visible layer, and the hidden layer is given by
where
are the parameters of the model.
is the connection weight matrix between the visible layer and the hidden layer, and
and
are biases of the visible layer and the hidden layer, respectively.
is called the energy of the state
. The joint probability distribution under the model is given by the Boltzmann distribution:
where
is a normalizing constant and
is the thermodynamic temperature (often considered as 1).
is called a
partition function and is defined by summing over all possible visible and hidden configurations. Therefore, it is extremely hard to compute when the number of units is large. The partition function is represented as follows:
Since there are no connections between two variables of the same layer, it is possible to derive an expression for
; this is the probability of a particular visible unit being
on given a hidden configuration:
This is also true for a particular hidden unit given a visible configuration:
where
. This leads to a block Gibbs sampling dynamics, used universally for sampling from RBMs.
RBM Training
Carreira-Perpinan and Hinton [
13] showed that the derivative of the log-likelihood of the data under the RBM with respect to its parameters is:
where
denotes the expectation over the data, or in other words the distribution
. In the same way,
denotes the expectation over the model distribution
. However, directly calculating the sums that run over all values of
v and
h in the second term in (
4) leads to a computational complexity, which is in general exponential in the number of variables.
The
expectation can be approximated with samples from the model distribution. These samples can be obtained via Gibbs sampling, iteratively sampling all units in one layer at once given the other layer using (
2) and (
3) alternately. However, this requires running the Markov chain for an infinite time to ensure convergence to a stationary state, making it an unfeasible solution.
Obtaining an unbiased sample of
is extremely difficult. Hinton [
3] approximates the second term in the true gradient in (
4) by using an approximation of the derivative called Contrastive Divergence: the goal of CD is to replace the average
with samples
obtained after running
k steps of Gibbs sampling starting from each data sample. This is illustrated in
Figure 3. A typical value used in the literature is
. Moreover, this way of updating the parameters has become a standard way of training RBMs. Although it has proven to work well in practice, CD does not yield the best approximation of the log-likelihood gradient [
13,
14]. There has been much research dedicated to better understanding this approach and the reasoning behind its success [
13,
14,
15], leading to many variations being proposed from the perspective of improving the Markov chain Monte Carlo approximation to the gradient, namely, Persistent CD [
16], Fast Persistent CD [
17], and Parallel Tempering [
18].
3. Gated RBM
Gated RBMs are a natural extension of RBMs, in which the gating idea is applied via the units of a hidden layer connecting/gating the neurons of two other layers (input and output). As in the RBM model, the gated RBM is governed by an energy function. Memisevic and Hinton [
10] propose using the following three-way energy function that captures all possible correlations among the components of the
(input),
(output), and
(hidden) layers:
where
i,
j and
k index the units in the input, output and hidden layers, respectively;
is the binary state of input pixel
i;
is the binary state of output pixel
j; and
is the binary state of hidden unit
k.
Figure 4 shows a fully connected gated RBM. The components
of its three-way interaction tensor connect units
,
and
and learn to weight the importance of the possible correlations given some training data. This type of multiplicative interaction among the input, output, and hidden units leads to a type of higher-order Boltzmann machine that retains the computational benefits of RBMs, such as being amenable to contrastive divergence training and allowing for efficient inference schemes that use alternating Gibbs sampling [
11] (p. 1474).
In practice, to be able to model affine and not just linear dependencies, it is useful to add biases to the output and hidden units, which makes (
5):
where the terms
and
are bias terms used to model the base rates of activity of the hidden and output units, respectively. In general, a higher-order Boltzmann machine can also contain bias terms for the input units, but following [
11], we do not use these.
The negative energy
captures the compatibility between the input, output and hidden units. As in the RBM model, we can use this energy function to define the joint distribution
over output and hidden variables by exponentiating and normalizing:
where
is a normalizing constant, which depends on the input image
. To obtain the distribution over output images, given the input, we marginalize and get:
This marginalization over the hidden units is known in the literature as free energy. Note that or cannot be computed exactly, since both contain sums over the exponentially large number of all possible instances of the hidden units and output units for . However, we do not actually need to compute any of these quantities to perform either inference or learning, as we shall see in the next section.
It is important to note that the normalization step in (
8) is performed over
and
; thus, it defines the conditional distribution
rather than the joint
. This is done deliberately to free the model from many of the independence assumptions that a fully generative model would need to make, hence simplifying inference and learning [
10].
Inference then consists of guessing the transformation, or equivalently its encoding
, from a given pair of observed images
and
. Since the energy function does not contain interactions between any pairs of output units or pairs of hidden unis, it is possible to derive a closed-form expression for
; this is the probability of a particular hidden unit when a input–output image pair is given:
This is also true for the output units when input and hidden units are given:
where
.
Note that in practice, function can be changed for some other non-linear activation function. Regardless of the activation function, these models are called bilinear because, if one input is held fixed, the output is linear in the other input.
Consider now the task of predicting the hidden layer
given the input
and output
units, in such multiplicative network, this consists in computing all the values
of
using (
10):
Alternatively, one may compute
given the input
and hidden
units using (
11):
Memisevic and Hinton [
10] point out that this type of three-way model can be interpreted as a mixture of experts. Note from (
5) that in the way the energy is defined, the importance that each hidden unit
attributes to the correlatedness (or anti-correlatedness) of a particular pair
,
is determined by
.
To train the probabilistic mode,l we can use the same principle from Contrastive Divergence in the RBM model: we maximize the average conditional log-likelihood
for a set of training pairs
. The derivative of the (negative) log probability with respect to the weight parameter
is given by the difference of two expectations:
where
denotes the expectation with regard to variable
z. Note that the expectation in the first term in (
14) is over the posterior distribution over hidden units, and it can be computed efficiently using (
12). The expectation in the second term in (
14) is over all possible output/hidden instantiations and is intractable. However, because of the conditional independences of
given
and
and
given
and
, we can easily sample from the conditional distributions
and
. Using Gibbs sampling with Equations (
12) and (
13), respectively, for the hidden and output layer, we can approximate the intractable term.
Factorized Gated RBM
Memisevic and Hinton [
11] propose a way of reducing the number of weights that consists of projecting the
x,
y, and
h layers onto smaller layers, noted, respectively, as
,
, and
before performing the product between these smaller layers. Given their multiplicative role, these layers are called
factor layers.
The three-way tensor
is constrained to use these projections; three factor layers
,
, and
of the same size
as is illustrated in
Figure 5. Moreover, the weights
are restricted to follow a specific form:
With this constraint, the matrices , and are of respective size , and ; thus the total number of weights is just , which is quadratic instead of cubic in the size of input or factors.
4. Tucker Decomposition
Since tensors, specifically a three-way tensor, and its corresponding Tucker decomposition are at the core of this study, a brief overview of the subject of tensors is presented. First introduced by Tucker [
19] and refined in subsequent articles by Levin [
20] and Tucker et al. [
21], the Tucker decomposition is a form of higher-order Principal Component Analysis. Tucker decomposition factorizes a tensor into a (usually smaller) core tensor and a set of factor matrices. One factor matrix along each mode. Then, in the three-way case where
, we have
where the operator
denotes the mode-
n multiplication of a tensor by a matrix in mode
n.
,
, and
are known as the factor matrices and can be thought of as the principal components in each mode. The tensor
is called the core tensor, and its entries show the level of interaction between the different components. The Tucker decomposition of
is usually summarized as:
A comprehensive discussion on Tucker decomposition and tensor analysis is available in Kolda [
22]. If
is the same size as
, the Tucker decomposition is simply a change of basis. More often, we are interested in using a change of basis to compress
. If
P,
Q,
R are smaller than
I,
J,
K, the core tensor
can be thought of as compressed version of
.
For some computations presented in this document, it is important to be able to transform the indices of a tensor so that it can be represented as a matrix and vice versa.
Matricization, also known as
unfolding or
flattening, is the process of reordering the elements of a tensor (
N-way array) into a matrix [
23]. For instance, a
tensor can be rearranged as a
matrix or a
matrix, and so on.
The matricized forms (one per mode) of (
16) are:
5. Materials and Methods
In this section, we propose a strategy for reducing the number of parameters in a gated RBM. First, we refactor the gated RBM model by applying a multimodal tensor-based Tucker decomposition to its three-way weight tensor. Then, we show that by using Tucker Decomposition, we can use fewer than the cubically many parameters implied in the model. Finally, we introduce a Contrastive Divergence-based training procedure for the tucker decomposed gated RBM, which efficiently parameterizes its bilinear interactions.
5.1. Decomposing the Three-Way Tensor in a Gated RBM
The central idea of this research is to represent the required three-way interaction tensor in the gated RBM model using far fewer parameters through its Tucker Decomposition. The energy function in a gated RBM (Equation (
5)) captures all possible correlations among the components of the
(input),
(output), and
(hidden) layers. In this function, parameter
defines a three-way interaction tensor that learns the importance of correlations between layers
and
. However, despite its appealing modeling power, a fully parametrized gated RBM suffers from an explosion in the number of parameters, quickly becoming intractable because the size of the full tensor
is prohibitive using common dimensions for textual, visual, or output spaces.
As we will see, it is possible to use much fewer parameters by factorizing the multi-way interaction tensor via Tucker decomposition. We can plug the Tucker decomposition Equation (
16) into the energy function of the gated RBM Equation (
5). Then, the energy of a joint configuration of the visible (input/output) and hidden units is defined as:
Using the distributive law, this can be rewritten as:
We can drop subindices for clarity and get:
It is possible to simplify the notation in (
21) if we define:
Then the energy function in Equation (
21) is given by:
5.2. Interpretation of the Refactored Model: Dimensionality Reduction
Let us consider the three-way tensor with shape
and its corresponding Tucker decomposition presented in
Figure 6. As we parametrize the weights of the three-way tensor
with its Tucker decomposition, we are now able to separate
into four components, each having a specific role in the gated RBM model. Matrices
and
project the input (
x) and output (
y) images into spaces of respective dimension
and
. The core tensor
, whose shape is
, is used to model the interactions between the input and output image projections. Finally, the matrix
projects the scores of the pair embedding
into a space of dimension
.
Moreover, if has the same shape as , the Tucker decomposition is simply a change of basis. However, in our case, we are interested in using a change of basis to compress . If , , are smaller than , , , the core tensor can be thought of as compressed version of . Note that the dimensions for the factor matrices , , and are a result of the n-mode product between the original tensor and the core tensor . The factor matrix has dimensions as a result of the i-mode product between and . Respectively, factor matrix has dimensions (from j-mode product) and factor matrix has dimensions (from k-mode product). By constraining , , to be smaller than , , , we use a lower number of components for each of the three modes while at the same time linking these components to each other by means of the three-way core tensor.
Again, consider the three-way tensor in
Figure 6, whose respective cardinality of each layer is given by
,
and
. If we consider
and
, then the number of free parameters in the tensor
is ∼8.39
. It is easy to see that having such a number of free parameters is a problem both for memory and computing costs. In contrast, if we apply Tucker decomposition to this three-way tensor using a core tensor with shape
, then the number of free parameters would be ∼1.05
, which is given by the sum of parameters from the core tensor and the three factor matrices. By applying Tucker decomposition we reduce the dimensionality of the model. Note that the compression in the data is determined by the ranks of the core tensor.
Dimensionality reduction has long been an important technique for data representation. It reduces the space complexity of the underlying model so that it has higher stability when fitting, require fewer parameters and consequently becomes easier to interpret.
5.3. Training the Refactored Gated RBM
To train the refactored gated RBM, we can maximize its average conditional log-likelihood
for a set of training pairs
. By substituting Equations (
7) and (
9), the derivative of the negative log probability with regard to any element
of the parameter tensor is given by
where
denotes the average with regard to variable
. By substituting the reparametrized energy function from (
21) into (
23), we get
Equation (
24) calculates the derivative of the (negative) log probability with respect to any parameter in the refactored weights tensor: core tensor
and the three factor matrices
,
, and
.
Similar to the unfactored gated RBM model presented in
Section 3, note that the derivative of the (negative) log probability with respect to any parameter of the Tucker refactored gated RBM is given by the difference of two expectations. The first expectation in (
24) is over the posterior distribution over the hidden units. On the other hand, the second expectation in (
24) is over all possible output/hidden instantiations and is intractable.
Note that the first term in (
24) amounts to inferring the transformation (encoding)
from a given pair of observed inputs
and
as considered in Equation (
12). It is possible to plug the Tucker refactored energy function from (
22) into (
12) (bias term dropped for clarity), which becomes:
In an analogous way, we may consider the task to compute
from an input image
and a given fixed transformation
considered in Equation (
13). By plugging the Tucker decomposed energy function from (
22) into (
13) (bias term dropped for clarity), when input and hidden units are given, we get
Let us now focus again on Equation (
24). Note that the second term, also known as the
model expectation, is an expectation over all possible instances of the output/mapping units and is intractable. However, similar to the bipartite structure in an RBM, the tripartite structure of a gated RBM facilitates Gibbs-sampling. With this in mind, we also consider the task to compute
.
Then, Gibbs sampling suggests itself as a way to approximate the intractable term in (
24). Because of the conditional independences of
given
and
,
given
and
, and
given
and
, we can easily sample from the conditional distributions
,
, and
using (
25)–(
27) respectively.
Given the tripartite structure of the gated RBM, it is possible to perform three-way alternating Gibbs sampling. This scheme of optimizing an undirected graphical model is known as Contrastive Divergence. In this research, we perform a single Gibbs iteration when approximating the negative phase.
Using this Contrastive Divergence approach with Equations (
25)–(
27), we can make use of a machine learning library that supports reverse-mode automatic differentiation such as PyTorch.
PyTorch provides two high-level features: tensor computing with strong acceleration via GPU and automatic differentiation for all operations on tensors. Conceptually, this method for automatic differentiation records a graph recording all of the operations that created the data as the operations are executed. The leaves in the graph are the input tensors, and the roots are the output tensors. PyTorch traces this graph from roots to leaves, automatically computing the gradients using the chain rule. From a computational point of view, training a model consists of two phases: a forward pass to compute the value of the loss function and a backward pass to compute the gradients of the learnable parameters. With this in mind, we use the Contrastive Divergence approach for building the forward pass and generating the graph. This process is summarized in Algorithm 1.
Algorithm 1: Forward pass |
Input: : Training pair; k number of steps for CD learning |
Output: : input vector once CD-k is applied; : output vector once CD-k is applied |
1. Calculate positive phase performing three-way Gibbs sampling |
|
|
Calculate using Equation (25). |
|
|
2. Calculate negative phase |
For each step in k: |
(a) Calculate using Equation (26). Sample the states |
(b) Calculate using Equation (27). Sample the states |
(c) Calculate using Equation (25). Sample the states |
When computing the forward pass, PyTorch simultaneously performs the requested computations and builds up a graph representing the function that computes the gradient. Once the forward pass is completed, this graph is evaluated in the backward pass to compute the gradients. To build the backward pass, we used the concept of
free energy as presented in (
9). Under the gated RBM, the probability of observing a configuration of output units
given the input units
can be obtained by marginalizing out the hidden units. This computation is called free energy and is given by:
Generally speaking, RBMs and gated RBMs are a class of models that belongs to the more general class of energy-based models (EBMs) [
24]. EBMs have probabilistic equations of the following form:
where
is the observed variables,
Z is the normalization term, and
G is the energy function. Since (
28) yields the same form for the equation describing an EBM (
29), we can then minimize the free energy function for maximum (log-)likelihood.
Unfortunately,
is intractable to compute since
involves an integration/sum over all possible settings of the input and hidden units:
However,
can be computed up to a constant, which is useful for scoring observation under a fixed model. First, we notice that Equation (
28) involves a sum over all possible configurations of the hidden units
, but we observe that the hidden units are binary. This means that we only need to consider two possible states for each unit. This observation leads to a non intractable form of the energy function:
Then the
is:
For scoring observations under a model, we can ignore the partition function
Z. Finally, the backward pass can be computed using the free energy function in its scoring version (no partition function
Z) as it is shown in Algorithm 2.
Algorithm 2: Backward pass |
Input: : input vector at step 0 : output vector at step 0 : input vector once CD-k is applied : output vector once CD-k is applied.
1. Calculate the free energy using Equation ( 30) 2. Calculate the free energy with using Equation ( 30) 3. Calculate the difference between the free energy and |
6. Results and Discussion
To illustrate the performance and viability of the model, we conducted experiments on pairs of shifted random binary images using the dataset provided on the accompanying website of Memisevic and Hinton [
11].
We trained the model on pairs of transformed image patches, forcing the hidden variables to encode the transformations. The goal of this experiment is to investigate what forms the model weights take on when trained on affine transformations. The dataset consists of 10,000 binary image patch pairs of size
pixels each, where the output image in each pair is a transformed version of the input image. The input images are shifted by one pixel in a random direction in each sample. As stated in Memisevic and Hinton [
11] (p. 1482), there is no structure at all in the images themselves, which are composed of random pixels. The only source of structure in the data comes from the way the images transform.
Figure 7 shows three different samples of the binary images in the upper row. The lower row shows the shifted binary samples. This dataset was generated by a set of initial images where each pixel in the image is turned on randomly with probability 0.1. These initial images are used as input for the input layer
in the gated RBM model. Then, a random direction is chosen from the set {up, down, left, right, up-left, up-right, down-left, down-right, no shift} and each initial image is shifted by one pixel to create the output images. The newly appearing edges are filled randomly and independently as before with probability 0.1. The shifted images are used as input for the output layer
in the gated RBM model.
For this task, we trained a gated RBM with the proposed training algorithm from
Section 5.3. We performed parameter exploration via grid search considering the following parameters: core tensor ranks, number of units in the hidden layer, and learning rate. We ranged the value of hidden units from 64 to 144 units. We did not identify significant changes in the filters learned given the number of units in the hidden layer or the core tensor ranks. The reason is that much of the interactions between the input and output layers are captured in the core tensor. The learning rate ranged from
to
, and the core tensor ranks evaluated were
,
and
.
In
Figure 8, we display the filters learned at different stages by the input, output, and hidden layers as a qualitative assessment of the trained gated RBM model. The filters displayed in
Figure 8 correspond to a model with a core tensor with ranks
, 144 units in the hidden layer, and a learning rate of
. In the figure, each column in the factor matrices is rearranged to be displayed as a square. For example, if factor matrix A is of shape
, then 120 squares of shape
are displayed. From iteration 0 in
Figure 8, we observe that the weights for each layer are randomly initialized. On each iteration, the weights become more structured with no supervision in the model. Note that the filters presented in iteration 211 in
Figure 8 do not totally resemble the filters found in [
11] although the same dataset of random transformations is used. The reason is that the factorization of the three-way tensor proposed in this research is different to the one presented in [
11]. While Memisevic and Hinton [
11] project each mode of the three-way tensor in the gated RBM onto smaller layers, the current factorization involves a core tensor that models the interactions between the input and output image projections modulated via the gated connections in the hidden layer.
In fact, the core tensor also learns filters, as presented in
Figure 9. Moreover, the filters learned by each layer of the model show a correspondence with the filters in the three-way core tensor. In other words, each unit in the model is connected to the core tensor with variable strengths, and this allows them to detect specific changes in frequency, orientation, and phase shift in the data. In
Figure 9, we show the filters learned by a model with a core tensor of shape
. In the image, we display each frontal slice of the core tensor. For simplicity, we present the frontal slices for this model with a smaller tensor, but the same behavior was observed regardless of the tensor shape.
We also confirmed that the learning rate would affect convergence. The learning rate controls how much the parameters in the model are adjusted with respect to the energy loss gradient.
Figure 10 shows energy loss during training for different learning rate configurations:
,
, and
. We limit the number of epochs to 10. The number of units in the hidden layer is 144, and the ranks selected for the core tensor is
. As can be observed in the figure, the largest learning rate provides the fastest convergence in less than one epoch. On the other hand, the smallest learning rate decreases steadily on each epoch but does not arrive to a point of convergence. In each case, we observed that the corresponding model learns filters at different stages. In general, a smaller learning rate requires more training epochs given the smaller changes made to the model parameters in each update. However if the learning rate is too small, it can cause the learning process to get stuck and not converge. On the other hand, a larger learning rate results in rapid changes and requires fewer training epochs. However, it is possible that too large a learning rate causes the model to converge to a sub-optimal solution. The rate at which the learning takes place is an important hyper-parameter and will vary depending on the application of the model.
To gain insight into the way the core tensor affects the learning speed, we also trained the gated RBM model on the random-shifts dataset under various configurations of the core tensor while holding the learning rate and the number of hidden units constant. The gated RBM evaluated has 169 units both in its input and output layer and 169 units in its hidden layer. With this configuration, the original weight tensor has a shape of . On the other hand, the core tensor was tested with ranks , , and . Note that in the last configuration, the core tensor provides no compression since it has the same shape as the original tensor. For simplicity, we only considered cubical tensors. The models were trained using the same learning rate of for 10 epochs and were not stopped via early stopping. The idea is to isolate the effect of the core tensor configurations.
Figure 11 shows the energy loss when training under the three core tensor configurations. Note that in each case, the energy loss decreases steadily although with different gradients. We should remember that in a gated RBM as well as in an RBM, the concept of the energy function is analogous to the cost function. In fact, the configuration with the smallest core tensor has the highest energy decrease, while the core tensor configuration that is the exact shape of the original tensor has the lowest energy decrease. The reason is because the core tensor does not provide any compression but only functions as a change of basis. In addition, note that different configurations of the core tensor explore different configuration of the energy loss, which is explained by the fact that the core tensor modulates the level of interaction between the different components of each layer, resulting in different configurations of energy.
In
Figure 12, we show the training time (in seconds) for the same core tensor configurations. As was expected, this figure confirms that the training speed is a function of the core tensor dimensionality. Although it could be inferred from
Figure 11 and
Figure 12 that the best model configuration is the one with the smallest core tensor, it should be noted that the core tensor ranks should be calibrated according to the specifics of the problem to solve and evaluated according to the desired performance of the model.
As a last evaluation, we compared the Tucker refactored gated RBM proposed in this research against the unfactored model presented in [
10], which uses a full three-way tensor. The filters learned by the input, output, and hidden layers in combination with the filters learned by the core tensor are highly structural and in fact are very similar to the ones that are applied to the output images in [
10], as shown in
Figure 13b. The latter produces a canvas-like effect on the image. In both cases, the filters are learned in an unsupervised fashion. We should underline that there is no available code accompanying the research in [
10] and that we implemented their unfactored model using the update rules provided in the paper using PyTorch.
Although the filters produced both for the Tucker refactored and unfactored models might be similar, the training for both approaches has different performance metrics. To better understand this, we trained both models on the same random shifts dataset while holding all the variables constant. The input and output images have a size of
pixels each, which determines the number of neurons in the input and output layers to be 169 in both models. The number of hidden units and learning rate are kept the same in both implementations and have values of 121 and
, respectively.
Figure 14 shows the energy loss for the Tucker refactored, while
Figure 15 shows the energy loss for the unfactored model.
The only difference between the two models is that the unfactored gated RBM uses a full tensor
, and the Tucker refactored model factorizes the full tensor into a core tensor
, input and output factor matrices
and a hidden factor matrix
. Note that the unfactored gated RBM starts with a much higher energy configuration. On the other hand, the Tucker refactored model starts at a much smaller energy configuration as a result of the core tensor modulating the interactions of the three layers. The relationship between the core tensor configuration and the resulting energy configuration explored is also observable in
Figure 14. From
Figure 14 and
Figure 15, we see that the Tucker refactored gated RBM reaches convergence in fewer iterations than the unfactored gated RBM while having the same learning rate and hidden units.
Finally, in
Figure 16, we show a comparison of the training time (in seconds) for both the Tucker refactored and unfactored gated RBMs. It is easy to see that the unfactored model using a full tensor takes much longer than the Tucker decomposed factored model proposed in this research. This is because the core tensor reduces the number of free parameters in the model while maintaining its learning capacity.
7. Conclusions
The multimodal tensor-based Tucker decomposition presented in this research has the useful property that it keeps the independent structure in the gated RBM model intact. We take advantage of this independent structure by developing a contrastive-divergence-based training procedure used for inference and learning.
In this paper, we combine Tucker Decomposition for a gated RBM and showed how the model allows us to obtain image filter pairs that are highly structural when trained on transformations of images. There is some resemblance between our approach and the bilinear model for Visual Question Answering (VQA) task proposed in [
25]; however, the problems solved and the learning methods presented are quite different. Despite its appealing modeling power, the literature on gated RBMs is still scarce. The literature review on gated RBMs revealed that almost all the publications on the subject present only different applications for the model, namely texture modeling [
26], classification [
27], and rotation representations [
28]. In that sense, this research contributes an alternative method for training the gated RBM.
As we have seen, Tucker decomposition is a great tool for multidimensional data dimensionality reduction. Implementing it in the gated RBM means adding an additional hyperparameter to the model: the multimodal shape of the core tensor. In fact, the resulting model allows us to explicitly control the model complexity and to choose an accurate and interpretable factorization of the learnable parameters. One important property of tensor decomposition is that the number of parameters from the core tensor and factor matrices is usually much smaller than the number of parameters in the original tensor. Using the compression ratio (number of elements before and after tensor decomposition) and the approximation error caused by tensor decomposition, we can evaluate the proper decomposition to be performed.
Time complexity for inference or one-step inference in the Tucker decomposed gated RBM is , where , , and correspond to the dimensionality of each mode in the core tensor and directly impact the modeling complexity that will be allowed for each modality. Note that the core tensor can be fixed to a small rank regardless of the dimensionality of each layer in the weight tensor. This is in contrast to the time complexity for inference in a fully parameterized gated RBM, which is . The latter is in terms of the dimensionality of the input, output, and hidden layers in the fully parametrized gated RBM and is not efficient in terms of memory with large inputs.
Strictly speaking, we could also select the dimension of each modality , , to be equal to or greater than , , . This would lead to a change of basis or an explosion in the number of free parameters. There are several interesting directions for future work. In this research, we used a fixed shape in the core tensor; however, fine tuning this new hyperparameter needs to be addressed in future research. A possible idea is to measure the approximating error and select the smallest multimode dimensionality that meets a selected threshold. Another idea is to use the tripartite structure of the gated RBM to model discriminative tasks in which an image on layer x and its corresponding target in layer y are provided.