Knowledge Distillation: A Method for Making Neural Machine Translation More Efficient
Round 1
Reviewer 1 Report
The paper looks much better. The comments are taken into account.
Author Response
We have made some improvements
Reviewer 2 Report
This paper presents an analysis of the cost of building a translation model in terms of its carbon output. The paper makes a clear point and has some strength as a position paper. However, the experiments presented in the article are not very complete and do not present much novelty. One major issue is that the teacher model is not significantly better than the baselines, and as such there is no real analysis of the trade-off between CO2 cost and model accuracy. As such, the results in this paper tell a story about performing hyperparameter tuning to find good settings to avoid inefficient models, rather than the benefits of using smaller models. Given the significant effort and cost that seems to have been expended on this paper, it would probably be best for this paper (and the planet) if the authors did small-scale experiments to better validate the effectiveness of reducing model sizes in MT. Still, the paper raises an important point and the results are mostly sound so I could see value in publishing the manuscript as it is.
In Table 8, the authors observe that using 2 GPUs is actually faster than 4 GPUs. I wonder if a better explanation can be given for this result, it seems very anomalous. I would also question if this gives an accurate estimation of CO2 for this case, as it seems likely that a concurrency issue would have led to one or more GPU being idle for a significant part of the "wall-time" computation.
I don't think "Eq (2)" is the correct style. Lose the parentheses.
l150 "IN addition"
l273, "In this work,"
Author Response
Please see the attached document
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
The paper has been improved by the addition of more experiments. Overall I think this is improved but I have a few minor comments. Firstly, the way the Tables 6 & 7 have been included means that you have to jump between these tables and Table 5 to understand the results, so the presentation could be clearer. While it is nice to see the results of multiple runs with the same hyperparameters, I wonder if it would not make more sense to present only average results with error bars?
The authors explanation for the difference in performance of the 2 GPU and 4 GPU in the response letter is very interesting, but it does not seem to have made it to the paper! I wonder if saving the parameters is such a cost, could models be more efficient (and thus output less CO2) if they avoided this step? I would like the authors to comment on this in more detail in the manuscript.
Author Response
Please see document attached
Author Response File: Author Response.pdf
This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.
Round 1
Reviewer 1 Report
This paper describes a comparison between the original transformer and a smaller distilled model. The results show that the smaller model could be faster and more energy efficient. However, I found the paper less interesting because most of its content could be easily derived from previous research. I think the paper will be be improved if it contains a more thorough comparison between different KD approaches and models. More interesting and freshing ideas are expected as well.
It is always strange that the model with 2 GPUs translate faster than the model with 4 GPUs. The results should be investigated and verified.
The paper is easier to understand, but its writing could be improved. For example:
I suggest to clearly state the difference between each models and their parameters, instead of list the script itself.
It is not very clear how the baseline models differ from the teacher models. Around line 116, it is mentioned that the training of baseline is limited to 20 epochs. But I am not sure whether this is the only difference. (Judging from the description of student models, it seems that the baseline models are smaller than the teacher model. )
around line 94, it should be stated clearly how the models will be evaluated, in term of the training time, number of GPUs, number of parameters, cost, CO2 emissions, etc.
Minors:
line 77: inorder -> in order
line 94: of all of all -> of all
line 95: namel y -> namely
Author Response
Please see the attachment
Author Response File: Author Response.pdf
Reviewer 2 Report
The authors analyze the sequence-level knowledge distillation for model size reduction. Different baseline models, teacher models, and student models are evaluated. The Impact section is presented very well.
I recommend taking into account the following suggestions:
Fig.2 should be transformed into Table
This is not understandable - what does mean baseline, teacher model, student model? What are the parameters and hyperparameters of these models? Can you present models in more detail?
It would be great to add the limitation of the current works
Please add the authors' contribution to the Introduction section
It would be great to add a comparison of the model proposed in Currey et al. and the authors' approach. Is it difference between these approaches?
Author Response
Please see the attachment
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
The authors did make some improvement according to the previous reviews. However, there is no clear technical contributions or interesting new ideas presented. Thus, I am still lean to reject this paper.
Reviewer 2 Report
The authors took into account comments. The paper is presented in much more better form