An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks
Round 1
Reviewer 1 Report
This is a very interesting paper on the design of specialty hardware to improve the power efficiency of convolutional neural networks. I have a broad background in CNNs, but not in hardware. Obviously a great deal of work went into this research. Overall, I think the paper could be improved by reducing the scope of the paper, and focusing on a couple of your key findings, and describing these finds in clear reproducable detail.
In the background I think you need to talk about GPUs and Google TPU. These are the standard CNN processors, and they have worked over the years to reduce power.
In the three bullet section of your contribution, I found your description unclear. I don't know what you mean by "coarse grain task partitioning". Is this for a chip design?
Section 2.1 needs more detail. You need more than a paragraph to describe a CNN. The designs of CNN is highly dependent on the data you are using. It seems that you are oversimplifying this in your assumptions.
The accuracy of a low bit-width CNNs is highly dependent on the data you are using. If you have simple data, this can work very well, but if your data is complex, your model will probably not converge with this type of approach.
You make a huge jump in section 3. Your 4 points look like they should be in the conclusion, not the description of the architecture. The description of the architecture is very hard to follow. I think you should focus on just the novelty of your approach, not a point by point description of the architecture.
3.2.1 needs much more description, I don't understand what you are trying to show in this section.
I am not sure what Table 2 means, maybe a figure would be better.
Section 4.1.1 and 4.1.2 is very confusing. These look like general formulas, not computational complexity (big O notation), or performance measure.
4.2 You need to describe how you derived your model, it should be reproducable.
4.2.2 The typical measure of a CNN is accuracy on a given dataset. The accuracy needs to be measured as well as the power. Low power and poor accuracy does not help the problem.
Author Response
Thanks for your comments and suggestions sincerely. I have provided a point-by-point response and uploaded as a PDF file.
Author Response File: Author Response.pdf
Reviewer 2 Report
The paper presented an architecture for low power, high-throughput CNNs. While the results are impressive in terms of efficiency, there are a few concerns with the results.
-Is the design fabricated or only simulation? If fabricated, the authors should include a micrograph of the die.
If it is simulated, they should include a synthesized layout detailing the area and components, showing how they align with Figure 3. Furthermore, if the results are based on simulations, it is unclear what is included in the die that were part of the simulations. Is the DRAM a part of their chip, or is it external? Is the interface/controller to the DRAM part of the design as well? Are the results presented assuming the DRAM operates at full capacity? Finally, if simulated, the authors should clearly state how the simulations were performed (ie. on full layout with all parasitics, on sub-blocks of the layout, etc.).
-The key implementation of the paper is that the architecture is designed around low bit widths. In their discussion, they should mention the tradeoff between low bit-widths and accuracy, especially when comparing to the YodaNN, BCNN, and QUEST devices. Does their design achieve higher throughput through reduced accuracy, is there a programmable tradeoff with their work, is the accuracy the same but throughput higher due to the design? They only mention briefly on pg 17 a TOP1 accuracy of 0.498 and FPS of 162, however they then mention FPS is not a good metric.
-The English needs work in a number of places (lines 61, 126-128, 177, 194, 239-240, 268)
Author Response
Thanks for your comments and suggestions sincerely. I have provided a point-by-point response and uploaded as a PDF file.
Author Response File: Author Response.pdf
Reviewer 3 Report
Well written paper, with good experimental results, especially in terms of energy efficiency.
Some comments and questions:
when mentioning OIDF and DIOF for the first time in page 3, please spell them out.
AOF unit -> AQF unit
what is the "internal covariate shift" issue, and how it can be alleviated? Please give some intuition.
Please plot the three activation functions, and briefly explain why three and why those three were chosen.
The top of page 6 is very messy...
How did you define the relative number of LPEs for the CONV and for the LFC units? Is the ratio between the computational loads of the two pretty much the same among all the CNNs you considered? If not, how did you resolve the trade-off?
is it true that you consider a single-port SRAM? It seems from Fig. 3 that you actually have two separate memory groups, one for weights and one for input features, and each of them has several ports. Please clarify (and explain how many ports you have for each group).
At the end of page 7, can you please expand what is the difference between your dataflow mechanisms and other schemes (e.g. input-stationary or weight-stationary).
At the top of page 9, please explain better how the architectural terms used here (groups, CUs, ...) map to Fig 3. I assume that the "low parallelism" of the FCN layer is actually "low computational intensity, with respect to memory bandwidth", right? Again, how did you choose the "magic" number 2 for CONV and 1 for LFC, and their respective amount of LPEs?
The explanation of the algorithm is not very clear. What is the goal of your assignment? To keep the two loads as balanced as possible? If so, just say so...
Ping-Pang->Ping-Pong
Stragedy -> Strategy
From table 6, it frankly looks like the different partitioning approaches do not make much of a difference here :-) The partitions are the same, except for S-net. Do you know why, by the way?
Is the utilization in eqn (20), and in Fig 11, defined only for the CONV layers, or also including LCN? Can you please report both separately? It could help answer 5.a and 7.
Figure 11 (a) and (b) seem to carry exactly the same information... Throughput is directly proportional to utilization, right?
Author Response
Thanks for your comments and suggestions sincerely. I have provided a point-by-point response and uploaded as a PDF file.
Author Response File: Author Response.pdf
Reviewer 4 Report
Line spacings on the two "Algorithms" on P9 & 13 appear crowded and might be improved.
Overall an excellent paper with very good support detail.
I guess the only thing I felt was missing were some experiments to show if the results from implementing a network on this hardware give the same output results as implementing the full network on a GPU? For example, do you obtain the same object detection window with your hardware across a sample set of images? (Typically there will be variations, but these should be small).
In a sense, this is probably a separate work so it is a minor point but if you did do some tests it would be nice to see some image examples showing a comparison of output results (and it proves your hardware works for practical application!).
Author Response
Thanks for your comments and suggestions sincerely. I have provided a point-by-point response and uploaded as a PDF file.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Thank you for your thorough revision of the paper. I think it is ready to be published.