Multimodal Tucker Decomposition for Gated RBM Inference
Round 1
Reviewer 1 Report
In general, with some issues commented below, the theoretical part of the paper seems technically sound. However, the experimental part of the paper has important problems on the design of the experiments, clarity of their presentation and the conclusions supported by them. Because of that, the recommendation should be to reconsider it after major revision.
These are the main issues about the theoretical part:
- Sections 2, 3 and 4 have eight pages in total, which is the third part of the paper, but they are just state-of-the art, which can be found in books. They should be shortened, particularly sections 2 and 3.
- In subsection 5.3, all the dissertation which includes expression (29)-(40), around two pages finish with the conclusion that “with these results we cannot form update rules for the model’s parameters”. It is not justified the inclusion of all of this mathematical apparatus. The next part of 5.3 and the next subsection 5.4 do not seem to have any relationship
However, the main problems of the proposed manuscript arise from the experimental part:
- The experiments include in section 6 are designed only to show the convergence of the parameters of the network when it has to learn some artificial random transformation applied to random images. Convergence is an issue, but it is necessary to show if the final parameters correspond with the actual transformations, if the final values are useful for this task.
- Anyway, the results are not appropriately explained. It even seems there are inaccuracies. The sentence “For example, if factor matrix A is of shape 120 x 69, then 120 squares of shape 13 x 13 are displayed” is a clear example.
- There is no comparison with the results of the state-of-the-art methods, which is essential. Only there is figure (Fig. 8) comparing on the filter matrixes of the proposed algorithm with a filter matrix from [10], but with a vague qualitative discussion and without any quantitative comparison is needed.
- 12 is not clear. It shows the dependence of the loss energy in terms of the batch number, both for training in test. What is the meaning of batch number for test? Even the acronym GBM is not defined.
- 13 is not clear either. It is too shortly described and discussed, just six lines. Why curve corresponding to smaller models is higher and shorter?
Author Response
In general, with some issues commented below, the theoretical part of the paper seems technically sound. However, the experimental part of the paper has important problems on the design of the experiments, clarity of their presentation and the conclusions supported by them. Because of that, the recommendation should be to reconsider it after major revision.
These are the main issues about the theoretical part:
Sections 2, 3 and 4 have eight pages in total, which is the third part of the paper, but they are just state-of-the art, which can be found in books. They should be shortened, particularly sections 2 and 3.
Answer: Sections 2 and 3 have been shortened.
In subsection 5.3, all the dissertation which includes expression (29)-(40), around two pages finish with the conclusion that “with these results we cannot form update rules for the model’s parameters”. It is not justified the inclusion of all of this mathematical apparatus.
Answer: Subsection removed from the paper.
The next part of 5.3 and the next subsection 5.4 do not seem to have any relationship
Answer: Section 5 has been updated and reordered.
However, the main problems of the proposed manuscript arise from the experimental part:
The experiments include in section 6 are designed only to show the convergence of the parameters of the network when it has to learn some artificial random transformation applied to random images. Convergence is an issue, but it is necessary to show if the final parameters correspond with the actual transformations, if the final values are useful for this task.
Answer: Section 6 has been changed and updated.
Anyway, the results are not appropriately explained. It even seems there are inaccuracies. The sentence “For example, if factor matrix A is of shape 120 x 69, then 120 squares of shape 13 x 13 are displayed” is a clear example.
Answer: Text changed from “120 x 69” to “120 x 169”
There is no comparison with the results of the state-of-the-art methods, which is essential. Only there is figure (Fig. 8) comparing on the filter matrixes of the proposed algorithm with a filter matrix from [10], but with a vague qualitative discussion and without any quantitative comparison is needed.
Answer: Section 6 has been changed. Starting in line 348 (page 17) we provide a qualitative and quantitative comparison of the factored vs the unfactored model.
12 is not clear. It shows the dependence of the loss energy in terms of the batch number, both for training in test. What is the meaning of batch number for test?
Answer: Image and its explanation has been updated.
Even the acronym GBM is not defined.
Answer: GBM was a proposed acronym for gated RBM, but it was decided to use gated RBM instead. Text was changed
13 is not clear either. It is too shortly described and discussed, just six lines. Why curve corresponding to smaller models is higher and shorter?
Answer: Explanation of the experiment and the image was added.
Author Response File: Author Response.pdf
Reviewer 2 Report
This paper describes a method to reduce the computational complexity of a variant of restricted Boltzmann machine, based on Tucker decomposition.
The training algorithm is described as well.
The relation of Sections 3.2-3.3 with the Tucker decomp. is not clear to me. Is it another way to reduce complexity? Is it a known approach or a newly developed one? The first phrase of Section 3.2 (...we follow the same principle..) makes me thinking that it is a new technique developed by the authors. Or not? please explain.
Perhaps there is too much preliminary material before the proposed strategy is described starting from Section 5 !
It is hard to me to see how much the complexity is reduced. In other terms, what the typical values of n_i n_j n_k? In Section 5.1 the authors consider values around 2000 while in Section 5.2, row 236, they consider values around 256. In the last case the reduction of free parameters decrease from 10^6 to 10^5, but what is the reduction in the first case?
Experimental results, fig.7-11, are not compared with other approaches, not even with Memisevic and Hinton who furnish the data.
Figure 13 which is very important is not adeguately discussed.
Minor comment
row 61: typo of Boltzmann
Section 5.3: in (29) and (30) the derivatives are indicated with \partial L/ \theta perhaps you should write \partial \theta?
Section 5.4: '....belongs to the more general class of energy-based model EBM...' --> model should be models. Moreover, after EBM it is better to put a reference
row 366: what is GBM?
Figure 12 and 13 should be redrawn with a thicker line. Moreover the legends should be clearer
In all the paper: the authors refer to the equations with a number, for example in row 101: 'Using 1 and 4'. Personally I find this quite disturbing. Why you don't write (1) and (4)?
Author Response
This paper describes a method to reduce the computational complexity of a variant of restricted Boltzmann machine, based on Tucker decomposition.
The training algorithm is described as well.
The relation of Sections 3.2-3.3 with the Tucker decomp. is not clear to me. Is it another way to reduce complexity? Is it a known approach or a newly developed one? The first phrase of Section 3.2 (...we follow the same principle..) makes me thinking that it is a new technique developed by the authors. Or not? please explain.
Answer: This is not a new technique, but a new model using Tucker decomposition. We are using a Contrastive Divergence technique for a refactored model. Section 5 has been updated and refactored to make the explanation of the technique developed easier.
Perhaps there is too much preliminary material before the proposed strategy is described starting from Section 5 !
Answer: Sections 2 and 3 have been shortened.
It is hard to me to see how much the complexity is reduced. In other terms, what the typical values of n_i n_j n_k? In Section 5.1 the authors consider values around 2000 while in Section 5.2, row 236, they consider values around 256. In the last case the reduction of free parameters decrease from 10^6 to 10^5, but what is the reduction in the first case?
Answer: There are not typical values for n_i n_j n_k? That would depend in the application.
Experimental results, fig.7-11, are not compared with other approaches, not even with Memisevic and Hinton who furnish the data.
Answer: Section 6 has been changed. Starting in line 348 (page 17) we provide a qualitative and quantitative comparison of the factored vs the unfactored model.
Figure 13 which is very important is not adeguately discussed.
Answer: Text was changed to address this.
Minor comment
row 61: typo of Boltzmann
Answer: Text changed, thankyou
Section 5.3: in (29) and (30) the derivatives are indicated with \partial L/ \theta perhaps you should write \partial \theta?
Answer: Formula was changed, thank you
Section 5.4: '....belongs to the more general class of energy-based model EBM...' --> model should be models. Moreover, after EBM it is better to put a reference
Answer: Text was changed, thank you.
row 366: what is GBM?
Answer: GBM was a proposed acronym for gated RBM, but it was decided to use gated RBM instead. Text was changed
Figure 12 and 13 should be redrawn with a thicker line. Moreover the legends should be clearer
Answer: Images were changed.
In all the paper: the authors refer to the equations with a number, for example in row 101: 'Using 1 and 4'. Personally I find this quite disturbing. Why you don't write (1) and (4)?
Answer:
As stated in https://www.mdpi.com/journal/applsci/instructions
-
Supplementary Materials: Describe any supplementary material published online alongside the manuscript (figure, tables, video, spreadsheets, etc.). Please indicate the name and title of each element as follows Figure S1: title, Table S1: title, etc.
To be consistent with Figures and Tables, the Equations are not written inside parenthesis.
Author Response File: Author Response.pdf
Reviewer 3 Report
This paper looks at Restricted Boltzmann Machines (RBM) proposing gated connections, and looking at situations where tensors decompose into products of smaller tensors, hence reducing the number of parameters (ie in the layer interconnect).
First of all, the topic is very timely.
The paper is quite long, and while it covers the introduction to RBMs, it does so pretty well. There is, however, a minus missing in the equation in line 89 (it is Z(\theta)=\sum\exp(-E(v,h)) (for \tau=1).
Note that while I have read pretty much everything, I have not checked the maths in great depth. To do so is impossible in the short week we're given to review the papers.
The paper relies a lot on a couple of main references, like Hinton's for RBMs and Kolda's for the Tucker decomposition. However, as the former is quite widely available even in textbooks, and the latter is also available, there is no problem with this.
There are some minor English issues here and there, but nothing that hinders understanding in any way.
The line numbers don't seem to apply consistently; for example, half of page 14 does not have line numbers.
Some general comments:
In Fig 12., the ordinate ("y axis") seems a very narrow range, from 0.00285 to 0.00245 is a decrease of only 14% (4 out of 28). Fig 13 also seems to ask more questions than it answers. The converged energy of the smaller tensor (green line) converges to a higher energy value than the others. In contrast, the largest one is said to take longer to train, but it is not at all obvious since it starts pretty much at its minimum, you could have stopped it sooner than the small one.
The example do not seem terribly interesting: shifted random images (6.1) may be good in theory but the paper proposes two things: gated connections in RBMs, and (mainly) the Tucker decomposition, so this is what we expect demonstrated - perhaps compared to non-gated/full-tensor implementations.
A more serious problem, though, is that there is no reference to the code. The authors mention using PyTorch but if their code is not published, how is the reader supposed to replicate the work? We are not in the 20th century any more; code is part of publications in computational sciences. If the code were made available, it would be easier for the reader to explore the work and evaluate the results. While I would like to see section 6 strengthened, it would be more important to publish the code.
Author Response
This paper looks at Restricted Boltzmann Machines (RBM) proposing gated connections, and looking at situations where tensors decompose into products of smaller tensors, hence reducing the number of parameters (ie in the layer interconnect).
First of all, the topic is very timely.
The paper is quite long, and while it covers the introduction to RBMs, it does so pretty well. There is, however, a minus missing in the equation in line 89 (it is Z(\theta)=\sum\exp(-E(v,h)) (for \tau=1).
Answer: Fixed. Thank you!
Note that while I have read pretty much everything, I have not checked the maths in great depth. To do so is impossible in the short week we're given to review the papers.
The paper relies a lot on a couple of main references, like Hinton's for RBMs and Kolda's for the Tucker decomposition. However, as the former is quite widely available even in textbooks, and the latter is also available, there is no problem with this.
There are some minor English issues here and there, but nothing that hinders understanding in any way.
The line numbers don't seem to apply consistently; for example, half of page 14 does not have line numbers.
Answer: The MDPI template was used (https://www.mdpi.com/authors/latex) , but apparently some lines numbers are not generated when using formulas. Need to check the final version.
Some general comments:
In Fig 12., the ordinate ("y axis") seems a very narrow range, from 0.00285 to 0.00245 is a decrease of only 14% (4 out of 28).
Answer: Figure was changed.
Fig 13 also seems to ask more questions than it answers. The converged energy of the smaller tensor (green line) converges to a higher energy value than the others. In contrast, the largest one is said to take longer to train, but it is not at all obvious since it starts pretty much at its minimum, you could have stopped it sooner than the small one.
Answer: The experiment and the image were updated with their explanation.
The example do not seem terribly interesting: shifted random images (6.1) may be good in theory but the paper proposes two things: gated connections in RBMs, and (mainly) the Tucker decomposition, so this is what we expect demonstrated - perhaps compared to non-gated/full-tensor implementations.
Answer: Section 6 has been changed. Starting in line 348 (page 17) we provide a qualitative and quantitative comparison of the factored vs the unfactored model.
A more serious problem, though, is that there is no reference to the code. The authors mention using PyTorch but if their code is not published, how is the reader supposed to replicate the work? We are not in the 20th century any more; code is part of publications in computational sciences. If the code were made available, it would be easier for the reader to explore the work and evaluate the results. While I would like to see section 6 strengthened, it would be more important to publish the code.
Answer: Code available at https://github.com/macma80/tucker_gatedrbm
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Although the theoretical part of the paper has been clearly improves, there are still some important problems in the experimental part:
- There are no comparisons with other state-of-the-art methods.
- Figure 10 shows that loss energy does not decrease with batch number in testing. Of course, after training energy loss always remain stable. Batch number has not any meaning in training.
- Figure 14 is very confusing. The authors say that 'at first sight the enegy loss of the factored model looks like no learning is taking place", but it is not clearly seen in the figure. On the other hand, why the energy loss is constant for the unfactored model? Why the number of batches is much samller than for the unfactored model?
Author Response
Although the theoretical part of the paper has been clearly improves, there are still some important problems in the experimental part:
-
There are no comparisons with other state-of-the-art methods.
Answer: From lines 360 to 397, we present a qualitative and quantitative comparison with the work presented in [1] which is indeed the state-of-the-art method. The literature on gated RBM’s is not very extensive: The idea was first introduce in [1] in 2007. The same authors, present a different factorization for the gated RBM in [2]
[1] Memisevic, R., & Hinton, G. (2007, June). Unsupervised learning of image transformations. In 2007 IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-8). IEEE.
[2] Memisevic, R., & Hinton, G. E. (2010). Learning to represent spatial transformations with factored higher-order Boltzmann machines. Neural computation, 22(6), 1473-1492.
Other publications found are only presenting different applications of the model for texture modeling [3], classification [4], and rotation representations [5]:
[3] Hao, T., Raiko, T., Ilin, A., & Karhunen, J. (2012, September). Gated Boltzmann machine in texture modeling. In International Conference on Artificial Neural Networks (pp. 124-131). Springer, Berlin, Heidelberg.
[4] Sorokin, I. (2015). Classification factored gated restricted Boltzmann machine. Proceedings of AALTD 2015, 131.
[5] Giuffrida, M. V., & Tsaftaris, S. A. (2016). Theta-rbm: Unfactored gated restricted boltzmann machine for rotation-invariant representations. arXiv preprint arXiv:1606.08805.
-
Figure 10 shows that loss energy does not decrease with batch number in testing. Of course, after training energy loss always remain stable. Batch number has not any meaning in training.
Answer: The figure shows the energy loss for training decreasing in each iteration of 32 samples (batch). The training process is stopped when the energy loss stops decreasing and does not change significantly. We set this threshold value at 0.00001 meaning that there is no extra value in more training. The fact that the energy loss remains stable when using the testing datasets confirms that the training was trained and stopped correctly. Text analizing Figure 10 was changed and made more explicit.
-
Figure 14 is very confusing. The authors say that 'at first sight the enegy loss of the factored model looks like no learning is taking place", but it is not clearly seen in the figure. On the other hand, why the energy loss is constant for the unfactored model? Why the number of batches is much samller than for the unfactored model?
Answer: Figure 14 and its accompanying text has been changed. The energy loss is not constant in the refactored model, we split the figure to show both cases side by side. In both cases we used early stopping when training. The Tucker refactored model convergences faster (in less iterations) than the full factored model.
Author Response File: Author Response.pdf
Round 3
Reviewer 1 Report
The new version does not addrees previous concerns:
- Comparison with state-of-the-art methods is not convincing. It is not clear if the method of Memisevic et al corresponds to the unfactored model. The caputre of the figure, which illustrates this model, considers (a) and (b) parts, but they are no seen in the figure. Anyway, the results shown in the figure are not convincing. They are not discussed. Finally, the figure which compares the energy loss of the refactores and unfactores methods is not either convincing. It is not clear that refactores model converges faster.
- Batch number has not any meaning in training. The study of the convergence in testing makes no sense.
- Figure 14 remains confusing. It is not clear that the convergence of the refactored model.
Author Response
-
Comparison with state-of-the-art methods is not convincing. It is not clear if the method of Memisevic et al corresponds to the unfactored model. The caputre of the figure, which illustrates this model, considers (a) and (b) parts, but they are no seen in the figure. Anyway, the results shown in the figure are not convincing. They are not discussed.
Answer: Figure 13 was changed and it is discussed in lines 374 to 383.
-
Finally, the figure which compares the energy loss of the refactores and unfactores methods is not either convincing. It is not clear that refactores model converges faster.
Answer: Figure was divided into Figures 14 and Figures 15. They are discussed in lines 384 to 392 and in lines 393 to 402.
-
Batch number has not any meaning in training. The study of the convergence in testing makes no sense.
Answer: Figures were changed to show loss vs epochs, their corresponding discussions were changed.
-
Figure 14 remains confusing. It is not clear that the convergence of the refactored model.
Answer: Figure was divided into Figures 14 and Figures 15. They are discussed in lines 384 to 392 and in lines 393 to 402.
Author Response File: Author Response.pdf