MemBox: Shared Memory Device for Memory-Centric Computing Applicable to Deep Learning Problems
Round 1
Reviewer 1 Report
In this manuscript, the authors proposed a shared memory device called MemBox and evaluated its FPGA-based prototype. The background section was comprehensively described and the device was well implemented and explained.
The largest weakness of the manuscript is that the difference from the Parameter Server mechanism of Tensorflow was not properly described. The authors argued that "(snip) parameter server is centralized in the distributed system. But, as node number is increased, bottleneck can occur in the parameter server" (in Lines 47-49). However, The reviewer cannot find an evidence that this is not applicable to the MemBox device. So, the result of the experiment shown in Section 6.2 looks that the MemBox device is simply "a faster Parameter Server," rather than scalable in principle. (The reviewer thinks the novelty of the manuscript will not be lost by viewing the proposed system as a more efficient platform for the Parameter Server mechanism, though.)
The reviewer's suggestion is EITHER of the following.
- If the proposed system is scalable in principle, in other words, it has a mechanism in order to avoid centralization, it should be properly explained.
- Otherwise, as the word "scalability" in Section 6.2 might lead readers' misunderstanding, it should be replaced with a more appropriate word (performance, training time, or something).
In addition, the caption of Table 1 should be appropriately described. It seemed that the caption was not replaced from one in the template.
Author Response
The largest weakness of the manuscript is that the difference from the Parameter Server mechanism of Tensorflow was not properly described. The authors argued that "(snip) parameter server is centralized in the distributed system. But, as node number is increased, bottleneck can occur in the parameter server" (in Lines 47-49). However, The reviewer cannot find an evidence that this is not applicable to the MemBox device. So, the result of the experiment shown in Section 6.2 looks that the MemBox device is simply "a faster Parameter Server," rather than scalable in principle. (The reviewer thinks the novelty of the manuscript will not be lost by viewing the proposed system as a more efficient platform for the Parameter Server mechanism, though.)
We accepted your review comment. Actually, MemBox system logically acts as parameter server and we designed software-based parameter server into hardware-based shared memory system. we reflect your comment in lines 36-60.
The reviewer's suggestion is EITHER of the following.
- If the proposed system is scalable in principle, in other words, it has a mechanism in order to avoid centralization, it should be properly explained.
- Otherwise, as the word "scalability" in Section 6.2 might lead readers' misunderstanding, it should be replaced with a more appropriate word (performance, training time, or something).
We choose option 2 thereby remove the word "scalability" in Section 6.2
In addition, the caption of Table 1 should be appropriately described. It seemed that the caption was not replaced from one in the template.
Caption fixed
Author Response File: Author Response.pdf
Reviewer 2 Report
MemBox: Shared Memory Device for Memory Centric Computing applicable to Deep Learning Problem
Reading the article, it is not very clear if the article intends to introduce MemBox for the first time, or it is already designed and this article only reviews it and then runs an experiment. The sections of the manuscript describing the hardware and architecture, in particular Section 5, are written similar to datasheet reports, not a research paper. There is not enough discussion and analysis indicating why these selections and designs are done in this system design.
Section 6 assesses the performance of MemBox by comparing the time it takes to run a training application in different memory access scenarios. Then a comparison with Parameter Server is given. This one case study is not enough for assessment. The authors need to extend the case studies to test various possible scenarios and provide deep analysis, comparison and justification on why this system is preferred over the others.
Author Response
Reading the article, it is not very clear if the article intends to introduce MemBox for the first time, or it is already designed and this article only reviews it and then runs an experiment.
We introduce MemBox for the first time and reflect it in 51-53. Logically, MemBox is hardware-based implementation from Google's Parameter server concept, but it has advantages over software-based Parameter server(no software stack like server daemon needed and can be accessed by simple memory access or DMA)
The sections of the manuscript describing the hardware and architecture, in particular Section 5, are written similar to datasheet reports, not a research paper. There is not enough discussion and analysis indicating why these selections and designs are done in this system design.
We reorganized Section 4 and Section 5. Section 4 focuses architecture and concepts and Section 5 focuses detailed implementation considerations in design constraints(FPGA or development board limits). Discussion is included in Section 4.
Section 6 assesses the performance of MemBox by comparing the time it takes to run a training application in different memory access scenarios. Then a comparison with Parameter Server is given. This one case study is not enough for assessment. The authors need to extend the case studies to test various possible scenarios and provide deep analysis, comparison and justification on why this system is preferred over the others.
Now we cannot do test any more because we cannot use servers (Project funding ends and project team(authors) is disbanded). we can only test 2 node tests and it can be only latency test.
But, after submitting our paper, we did another test for expanding MemBox connection up to 8 nodes. we designed and implemented repeater , tested 8 node with it and get the measurement result. we tested multi-node MNIST and extensively test it 100 epoch at each node and get a pretty good result. We include the result with repeater in Section 6.3 and propose design considerations with repeater.
Reviewer 3 Report
This paper describes the development of MemBox shared memory targeting distributed systems.
The paper is interesting. However, the writing style should be improved. As an example, in the first two lines of the Abstract, the word “problems” is repeated three times. Moreover, there are verbs in the short form and sentences that should be rewritten (41-43, 45-48, 204-206, 219-220, …). Authors refer the their work as a “thesis” on line 66.
Concerning the paper organization, the description reported in Section 4 seems unlinked with the implementation of Section 5. Authors should explain how the concepts described in Section 4 are considered by the implementation of Section 5.
The experimental results should include more tests! Authors describe only two tests. The first one is not a relevant test since it includes only two nodes. The second test is more interesting since considers four nodes. However, only a single deep learning architecture is tested! How does the MemBox system perform when considering other (and bigger) networks? Authors are encouraged to test their system with other deep learning methods.
Author Response
The paper is interesting. However, the writing style should be improved. As an example, in the first two lines of the Abstract, the word “problems” is repeated three times. Moreover, there are verbs in the short form and sentences that should be rewritten (41-43, 45-48, 204-206, 219-220, …). Authors refer the their work as a “thesis” on line 66.
writing is generally revised
Concerning the paper organization, the description reported in Section 4 seems unlinked with the implementation of Section 5. Authors should explain how the concepts described in Section 4 are considered by the implementation of Section 5.
We reorganized Section 4 and Section 5. Section 4 focuses architecture and concepts and Section 5 focuses detailed implementation considerations in design constraints(FPGA or development board limits). Concept consideration is included in Section 4.
The experimental results should include more tests! Authors describe only two tests. The first one is not a relevant test since it includes only two nodes. The second test is more interesting since considers four nodes. However, only a single deep learning architecture is tested! How does the MemBox system perform when considering other (and bigger) networks? Authors are encouraged to test their system with other deep learning methods.
Now we cannot do test any more DNN networks because we cannot use servers (Project funding ends and project team(authors) is disbanded). we can only test 2 node tests and it can be only latency test.
But, after submitting our paper, we did another test for expanding MemBox connection up to 8 nodes. we designed and implemented repeater , tested 8 node with it and get the measurement result. we tested multi-node MNIST and extensively test it 100 epoch at each node and get a pretty good result. We include the result with repeater in Section 6.3 and propose design considerations with repeater.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
The changes made are satisfactory. I recommend that the authors thoroughly edit their manuscript and improve it.
Reviewer 3 Report
Authors addressed all my comments.