First we introduce the proposed general framework for disparity fusion and then the new loss functions in the supervised and semi-supervised methods. These functions will make adversarial training simple and the refined disparity more accurate and robust. Finally, the end-to-end refiner (Figure 2) and discriminator (Figure 3) network structure are presented.
3.2. Objective Function
To let the refiner produce a more accurate refined disparity map, the objective function is designed as follows:
(1) To encourage the disparity value of each pixel to approximate the ground truth and to avoid blur at scene edges (such as occurs with the Monodepth method [
7]), a gradient-based
distance training loss is used, which applies a larger weight to the disparity values at the scene edges:
where
R represents the refiner network.
x is the ground truth and
is the refined disparity map from the refiner.
and
represents the real disparity distribution from the ground truth and fake disparity distribution produced by the refiner.
is the gradient of the left intensity image in the scene because all the inputs and refined disparity map are from the left view.
weights the gradient.
is the
distance. The goal is to encourage disparity estimates near image edges (larger gradients) to get closer to the ground truth.
(2) A gradient-based smoothness term is added to propagate more reliable disparity values from image edges to the other areas in the image under the assumption that the disparity of neighboring pixels should be similar if their intensities are similar:
where
is the disparity value of a pixel
u in the refined disparity map
from the refiner.
is the disparity value of a pixel
v in the neighborhood
of pixel
u.
is the gradient from pixel
u to
v in the left intensity image (the refined disparity map is produced on the left view).
is responsible for how close the disparities are if the intensities in the neighbourhood are similar.
(3) The underlying assumption in
is that the disparity relationship among pixels is independent. The disparity relationship in
is too simple to describe the real disparity relationship among neighbours in the real situation. To help the refiner produce a disparity map whose disparity Markov Random Field is closer to the real distribution, the proposed method inputs disparity maps from the refiner and the ground truth into the discriminator, which outputs the probability of the input samples being from the same distribution as the ground truth. This probability is then used to update the refiner through its loss function. Instead of defining a global discriminator to classify the whole disparity map, we define it to classify all local disparity patches separately because any local disparity patch sampled from the refined disparity map should have similar statistics to the real disparity patch. Thus, by making the discriminator output the probabilities in different receptive fields or scales (In Figure 3, please refer to
,
, …,
), the refiner will be forced to make the disparity distribution in the refined disparity map close to the real. In Equations (
3) and (
4) below,
is the probability at the
ith scale that the input patch to the discriminator is from the real distribution at the
ith scale:
To avoid
mode collapse during training and alleviate other training difficulties, we have also investigated replacing
with the improved WGAN loss function [
18].
is the penalty coefficient (We set it 0.0001 for all the experiments in this paper) and
are the random samples (For more details, please read [
18]):
The experiments explored the difference in performance of these two GAN loss functions. We let be either or in the following context. The difference of performance with both the single scale and multiple scales will also be explored.
(4) By inputting only the refined disparity map and its corresponding ground truth into the discriminator simultaneously in each step during training, the discriminator is trained in a fully supervised manner considering whether the input disparity maps are the same. In semi-supervised mode, we still feed the refined disparity map and its corresponding ground truth into the discriminator for the labeled data. But for the unlabeled data, we feed the refined disparity map of the unlabeled data and random samples from a small ground truth dataset simultaneously. By doing this, the discriminator will be taught to classify the input samples based on the disparity Markov Random Field. Then, in turn, the refiner will be trained to produce a disparity Markov Random Field in the refined disparity map that is closer to the real case.
(5) The combined loss function in the fully supervised learning approach is:
where
M is the number of the scales.
,
,
are the weights for the different loss terms. In the fully supervised learning approach (See Equation (
5)), we only feed the labeled data (denoted by
). In the semi-supervised learning (See Equation (
6)), in each iteration, we feed one batch of labeled data (denoted by
) and one batch of unlabeled data (denoted by
) simultaneously. As for the labeled data
, we calculate its L1 loss (denoted by
), smoothness loss (denoted by
), and GAN loss (denoted by
). The input to the discriminator is the refined disparity map (denoted by
) and corresponding ground truth (denoted by
). Thus, the GAN loss for the labeled data
is calculated using
and
. As for the unlabeled data
, we only calculate its GAN loss (
) and neglect the other loss terms. The unlabeled data gets its refined disparity map (denoted by
) from the refiner. Then feed
and
into the discriminator to get the GAN loss for the unlabeled data. As our experiment results show, this approach allows the use of much less labeled data (expensive) in a semi-supervised method (Equation (
6)) to achieve similar performance to the fully supervised method (Equation (
5)) or better performance when using the same amount of labeled data with additional unlabeled data compared with the supervised method. The combined loss function in the semi-supervised method is:
3.3. Network Architectures
We adopt a fully convolutional neural network [
23] and also the partial architectures from [
16,
22,
24] are adapted here for the refiner and discriminator. The refiner and discriminator use dense blocks to increase local non-linearity. Transition layers change the size of the feature maps to reduce the time and space complexity [
16]. In each dense block and transition layer, modules of the form ReLu-BatchNorm-convolution are used. We use two modules in the refiner and four modules in the discriminator in each dense block, where the filter size is 3 × 3 and stride is 1. The growth rate
k for each dense block is dynamic (unlike [
16]). In each transition layer, we only use one module, where the filter size is 4 × 4 and the stride is 2 (except that in Tran.3 of the discriminator the stride is 1).
Figure 2 shows the main architecture of the refiner, where
initial disparity inputs (the experiments below use
for 2 disparity maps) and
pieces of information (the experiments below use
for the left intensity image and a gradient of intensity image) are concatenated as input into the generator. The batch size is
b and input image resolution is
(
m,
n are integers).
is the number of the feature map channels after the first convolution. To reduce the computational complexity and increase the extraction ability of local details, each dense block contains only 2 internal layers (or modules above). Additionally, the skip connections [
15] from the previous layers to the latter layers preserve the local details in order not to lose information after the network bottleneck. During training, a dropout strategy has been added into the layers in the refiner after the bottleneck to avoid overfitting and we cancel the dropout part during test to produce a determined result.
Figure 3 is for the discriminator. The discriminator will only be used during training and abandoned during testing. Thus, the architecture of the discriminator will only influence the computational costs during training. The initial raw disparity maps, information and real or refined disparity maps are concatenated and fed into the discriminator. Each dense block contains 4 internal layers (or modules above). The sigmoid function outputs the probability map (
) that the local disparity patch is real or fake at different scales to force the Markov Random Field of the refined disparity map to get closer to the real distribution at different receptive field sizes.