1. Introduction
In numerous practical situations, we encounter probability distributions that are challenging to calculate. This occurs especially when the distribution includes hidden variables. Therefore, it becomes necessary to employ approaches that can estimate or approximate such distributions. Variational inference (VI) is a technique used to accomplish this task. VI is a compelling approach for approximating posterior distributions in latent variable models [
1]. It can handle intractable and possibly high-dimensional posteriors, and it makes Bayesian inference computationally efficient and scalable to large datasets. To this end, VI defines a simple distribution family, called the variational family, and then finds the optimal member of the variational family that is closest to the true posterior distribution. This transforms the posterior inference into an optimization problem concerning the variational distribution.
One of the most successful applications of VI in the deep neural network realm is the Variational Autoencoder (VAE) [
2], which is a deep generative model that implements a probabilistic model and variational Bayesian inference. Many techniques have been suggested to improve the accuracy and efficiency of variational methods (cf. [
3,
4,
5,
6,
7]). Recent trends in variational inference have focused on the following aspects:
Scalability: includes stochastic approximations.
Generalization: extends the applicability of VI to a large class of otherwise intractable models, such as non-conjugate models.
Accuracy: includes variational models beyond the mean field approximation.
Amortization: implements the inference over local latent variables with inference networks.
Robustness: generating a reliable representation of particular data types in the encoded space when using corrupted training data and detecting anomalies.
There are other methods for improving approximation such as Monte Carlo methods for VI and black-box methods [
8].
In this work, we focus on the accuracy of the VAE models. An essential aspect of the VI methodology revolves around selecting an appropriate divergence method. This divergence measure allows us to approximate the true posterior distribution with a simpler variational distribution. Consequently, the selection of the divergence measure can have a notable impact on the accuracy of the approximation. Furthermore, using the selected divergence measure, one can devise lower and upper bounds, and estimate the true posterior.
Accordingly, we propose a new upper bound for the evidence, termed the Variational Rényi Log Upper bound (VRLU), based on the Variational Rényi (VR) bound suggested by Li and Turner [
3]. Further, we devise a (sandwiched) upper–lower bound variational inference method, termed VRS, to jointly optimize the Rényi upper and lower bounds. The VRS loss function combines the VR lower bound and our new upper bound, thus providing a tighter estimate for the log evidence.
Next, we will demonstrate the practical effectiveness of VRS by applying it to the domain adaptation problem. Through this application, we aim to showcase the tangible benefits and practical relevance of our approach.
Domain adaptation is a scenario that arises when we aim to learn from a source data distribution; a well-performing model on a different (but related) target data distribution. A real-world example of domain adaptation is the common spam filtering problem. This problem consists of adapting a model from one user (the source distribution) to a new user who receives significantly different emails (the target distribution).
In the context of domain adaptation, the terms “source” and “target” domains are used to refer to the training and test sets, respectively. These sets can have distinct feature spaces, which can occur when the statistical properties of a domain change over time or when new samples are collected from various sources, resulting in domain shifts. Multiple-Source Adaptation (MSA) addresses scenarios where there are multiple source domains and one target domain. The central question is whether the learner can effectively combine relatively accurate predictors from each source domain to create an accurate predictor for any new target domain that may consist of a mixture of these sources.
In contrast to the majority of machine learning research, where models are trained and tested on data drawn from the same distribution, domain adaptation involves using data from different distributions for training and testing. When the train and test sets share the same distribution, the uniform convergence theory ensures that a model’s empirical training error closely approximates its true error. This assumption is not guaranteed in the MSA problem.
In this work, we have focused on two main ideas:
The rest of the paper is organized as follows:
Section 2 provides a review of variational inference for probabilistic modeling, and discusses different divergence methods such as KL Divergence, Rényi Divergence, and
Divergence, for bounding the log evidence. In
Section 3, we present our novel approach, called Variational Rényi Log Upper bound (VRLU), which offers an improved bound for the log evidence. Additionally, we introduce an optimized technique, referred to as the Variational Rényi Sandwich (VRS), that leverages both upper and lower bounds.
Section 4 offers a comprehensive overview of the domain adaptation problem and illustrates the application of the approximated distributions in calculating its loss function. Finally, in
Section 5, we present a series of experiments conducted to evaluate the effectiveness of our proposed methods, VRLU and VRS, in the context of both log evidence estimation and domain adaptation.
2. Divergence Methods in Variational Inference for Probabilistic Modeling
In probabilistic modeling, we aim to devise a probabilistic model,
, that best explains the data. This is commonly done by maximizing the log-likelihood of the data (also known as
log evidence), with respect to the model’s parameter
, i.e., Maximum Likelihood Estimation (MLE). For a latent model, where we assume that the observed data,
x, depend on a latent variable
z, the MLE takes the following form:
For many latent models, the log evidence integral is unavailable in closed form or it is too complex to compute. A leading approach to handle such intractable cases is variational inference (VI). One of the most successful applications of VI in the deep neural network realm is the Variational Autoencoder (VAE).
2.1. Variational Autoencoder and the Kulback–Leibler Divergence
A Variational Autoencoder is a deep generative model that implements a probabilistic model and variational Bayesian inference. Introduced by Kingma and Welling [
2], a VAE model is an autoencoder, designed to stochastically encode the input data into a constrained multivariate latent space (encoding), and then to reconstruct it as accurately as possible (decoding). To turn the intractable posterior inference into a solvable problem, we use a parametric inference model
which is also called an encoder. We optimize the variational parameters
such that
. The VAE loss function is composed of a “reconstruction term” (to ensure the decoded data are close to the original data) and a “regularisation term”. The goal of the regularisation term is to ensure that the distributions returned by the encoder are close to a standard normal distribution. That is expressed as the Kulback–Leibler divergence between the returned distribution and a standard Gaussian.
Definition 1. Kulback–Leibler (KL) divergence [10,11]. For discrete probability distributions p and q, defined on the same probability space, the KL divergence from q to p is defined to be: Since the true posterior
is intractable, we aim to approximate it with a Gaussian distribution
, in the KL divergence sense. It follows that:
Definition 2. Evidence Lower Bound (ELBO): We note that the KL divergence is non-negative, thus maximizing the ELBO results with the minimization of the KL divergence between and the true posterior .
ELBO optimization is a well-known method that has been studied in depth, and is applicable in many models, especially in VAE [
12]. Nevertheless, using the ELBO can give rise to some drawbacks. First, the ELBO is not always very tight, and maximizing the bound instead of the actual likelihood can lead to bias. Typically this leads to a simpler model
, which approximates the real posterior. Second, the
does not always lead to the best results—it tends to favor approximate distributions
that underestimate the entropy of the true posterior (“zero-forcing”). Namely,
is infinite when
and
. Therefore, the optimal variational distribution
q will be 0 when
. This “zero-forcing” behavior leads to degenerate solutions during optimization.
2.2. Rényi Divergence
One of the core parts of probabilistic models is the selection of the method for estimating the approximation of the distribution. In the previous section, we introduced Kulback–Leibler (KL) divergence. In this section, we will present the Rényi divergence (also known as
divergence), which measures the difference between two distributions
p and
q, and is defined by:
Rényi divergence was initially defined for
,
. The definition was extended to
by continuity. There are certain
values for which Rényi divergence has a wider application than the others. Of particular interest are the values 0,
, 1, 2, and
∞, presented in
Table 1. We note that for
:
, the KL divergence is recovered.
2.2.1. Selected Properties of Rényi Divergence
Theorem 1. (Positivity): For any order : , and
Theorem 2. (Convexity): For any order Rényi divergence is jointly convex in its arguments. That is, for any two pairs of probability distributions ) and , and any :For any order Rényi divergence is convex in its second argument. That is, for any probability distributions p, and : Theorem 3. (Continuity in the Order): The Rényi divergence is continuous in α on .
The definition of Rényi divergence was extended to
as well. However, not all properties are preserved, and some are inverted. For example, Rényi divergence for negative orders is
non-positive and
concave in its first argument (cf.
Figure 1). The extended definition of Rényi divergence to all
has some interesting properties:
Theorem 4. (Monotonicity) [3]: Rényi divergence, extended to negative α, is continuous and non-decreasing on . Lemma 1. The Skew Symmetry property:
For any For any
Definition 3. We will denote by the exponential: Figure 1 illustrates
and
. One can see that
achieves high values very quickly.
and
are non-decreasing as functions of
, and:
Many other properties described in [
3,
13].
2.2.2. Rényi Divergence Variational Inference
To estimate the evidence
, we employ a minimization approach using Rényi divergence between the variational distribution
and the true posterior distribution
, where
is a selected positive value. Extending the posterior
and using Bayes’ theorem, we obtain:
It follows that:
Definition 4. Variational Rényi (VR) bound [3]: The variational Rényi (VR) bound can be extended for
as well. Since
for
and
for
(see
Figure 1), then, for
,
is a lower bound for
, and for
,
is an upper bound for
.
2.3. Divergence
Similarly to the KL divergence and the Rényi divergence, one can use the
-divergence (or in general the
-divergence) and develop a bound of the log evidence [
14].
Now, our objective is to approximate the evidence
by using
-divergence between the true posterior
and
.
After rearranging the equation we will obtain:
Taking logarithms on both sides:
By monotonicity of log and non-negativity of the
-divergence, this quantity is an upper bound of the log evidence:
Using -divergence for general n, provides a family of bounds. We note the strong connection between the and the Rényi bound : when , the VR bound is revealed.
Theorem 5. (Sandwich Theorem [14]) For the following holds: - 1.
: .
- 2.
: is a non-decreasing function of the n order χ-divergence.
- 3.
.
Using Theorem 5, one can estimate with both upper and lower bounds, which may provide a better approximation for the log evidence.
The
upper bound has many advantages: It is a black-box inference algorithm in that it does not need model-specific derivations and it is easy to apply to a wide class of models. In addition, it is useful when the KL divergence is not a good objective, and it is guaranteed to converge [
14].
2.4. Monte Carlo Approximation
So far, we have discussed KL divergence, Rényi divergence, and
divergence, and have demonstrated how each of these measurements can be used to construct a bound for the log evidence. However, calculating these bounds is computationally intractable, due to the stochastic nature of the latent space and the exponential number of random variables. In real-world situations, where datasets are typically limited and contain a finite number of data points, empirical estimations become necessary. A popular method for estimating these bounds is the Monte Carlo (MC) approximation [
15,
16]. Typically, the MC method involves random sampling from certain probability distributions.
The Monte Carlo (MC) approximation of the Kullback–Leibler (KL) divergence is unbiased, guaranteeing the convergence of the optimization process for the Evidence Lower Bound (ELBO). However, the MC approximation for the Rényi bound introduces bias, leading to an underestimation of the true expectation. In the case of positive values of
, this implies a relatively looser bound, but it should still be effective. On the other hand, for negative values of
, this becomes a significant issue as it underestimates an upper bound. More precisely, the MC approximation for the Rényi bound is:
For this to be unbiased, the expectation should be equal to the true value,
By Jensen’s inequality:
Thus, the approximation is actually an underestimate of the true bound. This characteristic was also discussed in [
3], where the authors suggested improving the approximation quality by using more samples and using negative
values to improve the accuracy, at the cost of losing the upper-bound guarantee.
Other papers have suggested different approaches to keep the upper bounding property intact [
8,
14,
17]. Of particular interest is the generic
upper bound,
, which also suffers from the same problem of biased estimation using MC approximation. In [
14], the authors suggested an approach to avoid the biased approximation, by exponentiation:
Applying MC approximation to
provides an unbiased upper bound. However, this change affects the variance of the gradients, which may damage the quality of the approximation result. It may result in high variance estimates and requires a large number of samples in order to serve as a reliable upper bound [
18].
4. Multiple-Source Adaptation (MSA)
In statistical learning, there are numerous settings that require an accurate estimation of the data distribution to find effective solutions. One such task is known as domain adaptation. In the preceding section, we introduced VRS as an enhanced method to obtain accurate approximations of the data distribution. In this section, we will apply these estimated distributions to the domain adaptation objective, thus demonstrating the effectiveness and practicality of the VRS method to yield accurate solutions.
Domain adaptation is a scenario where we aim to train a classifier on one dataset (referred to as the source domain) for which labels or annotations are available and achieve good performance on another dataset (referred to as the target domain) for which labels or annotations are not available. A common example of a domain adaptation application is spam filtering, where a model trained on one user’s emails (the source domain) is adapted and used to filter spam for a different user who receives distinct emails (the target domain).
In this work, our focus is on the Multi-Source Domain Adaptation (MSA) problem, where there are multiple source domains available in addition to only one target domain. The target domain can be considered as either an exact mixture of the source domains, or it might be well approximated by such a mixture. The goal is to leverage the information provided by the source domains to improve the performance on the target domain, where annotations or labels are not available.
In many real-world scenarios, the learner may not have access to all of the source data at once, due to privacy or storage constraints. Therefore, the learner cannot simply combine all of the source data together to train a predictor. A possible solution to this problem is the Mixture of Experts (MOE) approach. MOE is an ensemble learning technique that involves training multiple experts on different sub-tasks of a predictive modeling problem. Each expert concentrates on a specific part of the modeling problem space. A gating network then combines the outputs of the various experts. In the domain adaptation problem, this concept can be applied by modeling the domain relationship with an MOE approach.
The MSA problem was theoretically analyzed by Mansour, Mohri, and Rostamizadeh in [
19]. In their paper, the authors presented the domain adaptation problem setup and proved that for any target domain, there exists a hypothesis, referred to as the distribution weighted combining rule, which is capable of achieving a low error rate with respect to the target domain. However, it should be noted that the authors did not provide a method for determining or learning the aforementioned hypothesis.
In the paper by Hoffman, Mohri and Zhang [
9], the authors extended the definition of the weighted combination rule to solve probabilistic models as well, using cross-entropy loss. Additionally, the authors introduced an iterative algorithm based on Difference of Convex (DC) programming, that constructs the weighted combination rule. Nonetheless, the algorithm proposed in the paper assumes either prior knowledge of the probabilities associated with the data samples or relies on accurate estimates of these probabilities. The authors evaluated the performance of their model by employing the Rényi divergence, which quantifies the discrepancy between the true distribution and the approximated distribution. As a result, the effectiveness of their model is contingent upon the accuracy of the probability approximations as well.
In order to circumvent the need for good estimates of the data distribution, Cortes et al. [
20] proposed a discriminative technique using an estimate of the conditional probabilities
for each source domain
(that is, the probability that an instance
x belongs to source
i). To this end, they had to modify the DC algorithm proposed in [
9], in order to adapt to their new distribution calculation.
In this study, we will build upon the algorithm introduced by Hoffman, Mohri, and Zhang [
9], and enhance it with a refined approximation of the source distribution via variational inference.
4.1. MSA Problem Setup
We refer to a probability model where there is a distribution over the input space X. Each data point has a corresponding label , where Y denotes the space of labels. Our objective function describes the correspondence between the data point and its label . We will focus on the adaptation problem with k source domains and a single target domain. For each domain , we have a source distribution and corresponding hypotheses . More precisely, returns the probability that .
Definition 8. Let be a loss function penalizing errors with respect to f. The loss of hypothesis h with respect to the objective function f and a distribution p is denoted by and defined as: For simplicity, we will denote as throughout this paper. We will assume that the following properties hold for the loss function L:
L is non-negative:
L is convex.
L is bounded: .
L is continuous in both arguments.
L is symmetric.
Proposition 1. For each domain i, the hypothesis is a relatively accurate predictor for domain i with the distribution ; i.e., there exists such that: Proposition 2. We will denote the simplex: . The distribution of the target domain is assumed to be a mixture of the k source distributions , that is: 4.2. Existence of a Good Hypothesis
The goal of solving the MSA problem is to establish a good predictor (a good predictor: a predictor that provides a small error with respect to the target domain) for the target domain, given the source domain’s predictors. A common assumption is that there exists some relationship between the target domain and the distributions of the source domains (See Proposition 2). It can be demonstrated that conventional convex combinations of source predictors may yield suboptimal results in certain scenarios. In particular, studies have indicated that even if the source predictors possess zero loss, no convex combination can attain a loss lower than a specific constant for a uniform mixture of the source distributions.
Alternately, Mansour, Mohri, and Rostamizadeh [
19] proposed a distribution-weighted solution and defined the distribution-weighted combination hypothesis for a regression model. Hoffman and Mohri [
9] extended the distribution-weighted combination hypothesis to a probabilistic model, as follows:
Definition 9. Distribution-weighted combination hypothesis.
For any and :where is the uniform distribution over X. In the probabilistic model case, we will use
L as the binary cross entropy loss:
which maintain all of the required properties stated in
Section 4.1.
Theorem 6. For any target function and for any , there exist and such that for any mixture parameter λ.
The proof of Theorem 6 is detailed in [
19]. From this Theorem, it can be inferred that for any fixed target function
f, the distribution-weighted combination hypothesis is a good hypothesis for the target domain.
4.3. A Good Hypothesis with Estimated Probabilities
On closer inspection of Definition 9, it is evident that constructing requires access to the distributions of all domains, represented by . Yet, in practical settings, the true distributions may not be directly available to the learner. Instead, the learner relies on estimates derived from the available data. Thus, addressing the application of domain adaptation becomes essential for real-world scenarios where the true distributions remain unknown.
Our objective is to minimize the value of
. To accomplish this, we will develop an upper bound for this loss function (similar to previous research [
9,
21]). By doing so, we can examine the impact of utilizing estimated distributions
on the efficacy of our model and gain insights into the application of domain adaptation in real-world scenarios. First, let us recall Holder’s inequality:
Theorem 7. Holder’s inequality: For any s and t in the open interval with , and for and be certain sets of real numbers, we have: Corollary 1. Let be an estimation of the original domain distribution . The following inequality holds for any : Proof of Corollary 1. For any hypothesis
h and any distributions
p,
q, and for any
, the following holds (the proof is based on a similar corollary proven in [
9]):
For each
, by setting
and
, we will find that:
□
Corollary 1 provides us an upper bound of the loss using the estimated distributions . When , and we will remain with . We will set , since we use the loss function as the cross-entropy loss (log-loss). Thus, when , .
By performing the aforementioned calculation with , it is possible to derive a lower bound for . This lower bound serves as a confirmation that the utilization of approximated probabilities does not lead to significant errors. For instance, if the lower bound exhibits a considerably large value, it indicates that our approximation is inadequate. Conversely, if the lower bound demonstrates a small value, it signifies the effectiveness of our approximation. Moreover, by employing both upper and lower bounds, we can obtain a more precise estimation of the loss.
Theorem 8. Generalization of Holder’s inequality [22]: Let and with , and for and be certain sets of real numbers, we have: Corollary 2. Let be an estimation of the original domain distribution . The following inequality holds for any :where Proof of Corollary 2. First, we will prove for
, and then for
. Let us set
,
and
. For any hypothesis
h and any distributions
p,
q, the following holds:
Next, let us set , and (notice that ).
For any hypothesis
h and any distributions
p,
q, the following holds:
For each
, by setting
and
, we will find that:
□
We contend that the value of can be disregarded when examining the loss bound. As previously mentioned, we assume that , where we have set . Consequently, we are left with . Since is a distribution, the sum equals 1.
Let us set
. We would like to present an example of different
values calculated with a constant distribution
, and a distribution
, where
. When
,
. The results are shown in
Figure 5.
As we can observe, as the estimated distribution approaches the true distribution p (i.e., as approaches 3), the bounds on the loss function become increasingly similar. We can also see that the value of the lower bounds is not significantly large, which means that we can consider using the probability approximation to solve the MSA problem. It is also worth noting that when deviates significantly from 1, the bounds move away from the actual value.
Theorem 9. Let be an arbitrary target distribution. For any , there exists and , such that the following inequality holds for any and any mixture parameter λ: Proof of Theorem 9. Let
. In the proof for Corollary 1, we showed that for any hypothesis
h and any distributions
p,
q, and for any
, the following holds:
Hence, for
,
and
we will find that:
By Theorem 6, given
, there exist
and
such that
for any mixture parameter
. Therefore:
□
Corollary 3. Let be an arbitrary target distribution. For any , there exists and , such that the following inequality holds for any and any mixture parameter :where and is our good hypothesis from Definition 9 but calculated with the estimated probabilities . Proof of Corollary 3. By Corollary 1, and for any : . Let us set such that: . Overall, we obtained the following:
For every : .
.
We can repeat the proof of Theorem 9 with instead of , instead of and instead of . □
In summary, we demonstrated that it is possible to use approximate distributions to calculate a good distribution-weighted combining rule. We have established that the error introduced by using estimated distributions is bounded. Thus, we can address the Multi-Source Adaptation (MSA) problem in real-world applications.
4.4. MSA Algorithm
Alongside the unknown probabilities, another crucial aspect is determining an appropriate vector of weights, denoted as
w, to fully establish the distribution-weighted combining rule. The paper by Hoffman, Mohri, and Zhang [
9] presents a new algorithm for determining the distribution-weighted combination solution for cross-entropy loss and other losses, based on Difference of Convex (DC) programming.
Lemma 2. For any target function and any , there exists with for all , such that the following holds:where: The proof of Lemma 2 is detailed in [
19].
Corollary 4. For any target function and any , there exists with for all , such that the following holds: Corollary 4 provides a single upper bound for the loss with respect to every
. Thus, our problem consists of finding a parameter
w verifying this property. This, in turn, can be formulated as the following optimization problem:
Definition 10. DC Function [23]: Let C be a convex subset of . A real-valued function is called DC on C, if there exist two convex functions such that f can be expressed in the form: DC programming problems are programming problems dealing with DC functions. An important class of DC problems is the following:
where
g and
h are two convex functions in
, and
X is a closed convex subset of
.
Proposition 3. Assume that the problem is solvable. Then, a point is an optimal solution to if and only if there is , such that: Horst and Thoai [
23] developed an algorithm for solving DC programming problems such as
based on the above optimality condition. The assumptions in Proposition 3 apply to the MSA problem, since we know there is an optimal solution. The key lies in identifying two convex functions whose difference coincides with the solution of the MSA problem. Let us define the following functions:
Note that:
.
Let us define the following convex functions:
is convex since
is convex as a composition of the convex function
with an affine function
. Similarly,
is convex, which shows that the second term in the expression of
is a convex function. The first term can be written in terms of the unnormalized relative entropy (the unnormalized relative entropy of P and Q is defined by:
). It can be shown that the relative entropy is jointly convex using the so-called log-sum inequality (based on the explanation in [
9]).
Let us be reminded of our regression loss function:
Proposition 4. Let L be the cross-entropy loss. Then, for Using the proof above, our optimization problem
is a DC programming problem, since it is the difference between two convex functions. In light of all of the above, our optimization problem can be cast as the following variational form of a DC-programming problem: let us set
to be the sequence defined by repeatedly solving the following convex optimization problem:
where
is an arbitrary starting value. Then,
is guaranteed to converge to a
local minimum of the optimization problem [
9].
Given the fact that an optimal hypothesis exists, we converted the MSA problem into an optimization problem and cast it to a DC programming form in order to find a local optimum. This way, we are able to find the parameter w which is used in the distribution-weighted combination rule.
6. Summary
In this study, we reviewed and analyzed the methods to estimate data probabilities where traditional computation methods have failed. Specifically, we examined variational inference (VI) models, such as Variational Autoencoder (VAE) [
27], which we aimed to improve using different divergence methods. We examined the properties of the Kullback–Leibler divergence, the Rényi divergence (which is essentially a family of divergences parameterized by
), and the
divergence. We derived the ELBO, the VR, and the CUBO bounds for the log evidence, and presented a new upper bound, termed VRLU, for which its MC approximation remains an upper. We used VRLU to devise a new (sandwiched) upper–lower bound variational inference method (VRS). The VRS loss function combines the VR lower bound (with positive
) and the new VRLU upper bound (with negative
), thus providing a tighter estimate for the log evidence.
We performed several experiments designed to test the performance of the new VRS model. We compared VAE, VR, VRLU, and VRS models over the digits datasets and PIE datasets, using different values of positive and negative
. In all cases, the VRS algorithm presented good results, many of which are the best performances compared to the other methods. We note, in passing, that the selection of the
value may depend on the data, an observation that was indicated in previous studies, as well [
3,
14].
In addition, we demonstrated the usage of VRS in MSA applications. We combined the DC-programming algorithm (suggested in [
9]) with our VRS model, to obtain more accurate density estimates and improve the accuracy of the hypothesis for the target domain. We performed experiments to compare the accuracy of the resulting hypothesis in two MSA datasets: the digits and Office31 datasets. We compared our new model using VAE, VR, and VRS to the previous models, GMSA and DMSA, presented in [
20].
Our empirical evaluation revealed that the proposed VRS-MSA model demonstrated competitive performance, and in certain instances even surpassed the performance of models reported in previous studies. Additionally, among the VI models tested, the VRS model achieved the highest overall score, which supports the conclusion that accurate probability estimates are necessary for the success of the weighted combination hypothesis .
Nonetheless, it is important to note that the VRS-MSA model achieved lower scores in certain individual test sets, where the weight parameter w was assigned a low value for that particular domain. When the weight parameter is low, it is important to take into account both the probability and the domain-specific hypothesis . For example, if the image x is from the SVHN domain, the probabilities and should be relatively low in comparison to , such that the value of is the most prominent in the weighted combination hypothesis. Our VRS-MSA model operates by training a VRS model for each domain, which learns its latent space vectors based on a Gaussian distribution, and outputs the probability in relation to these latent vectors . Consequently, for each domain, the Gaussian distribution may have slight variations in variance, which can influence the log evidence value output from the VRS model. Therefore, the DC programming model, which takes into account the probabilities from all domains simultaneously, may be affected by the different scales of the probability measurements across the domains.
Looking forward, further work is required to disentangle the complexities of the aforementioned VRS-MSA. Specifically, in this work, we have not formed a connection between the latent variables of each VRS model of the different domains. It will be interesting to see how such a connection (of normalization, scaling of the probability measurements, or latent space alignment) will affect the compatibility of the probabilities. In addition, some researchers suggest even using a common latent feature space in the autoencoder models [
28]. Building such a network using our VRS loss might improve the results of the VRS-MSA model. However, it is worth noting that such a common model would lack the separation and privacy of domains that we have achieved using distinct VRS models.
We would also like to extend our experiments on the VRS model: First, it will be interesting to examine the different values of negative and positive values and search for the best combination of and . Second, since may be data-dependent, it will be interesting to explore the possibility to make a trainable parameter. It can also be used to adjust the degree of relative risk aversion. These directions are left for future research efforts.