ACIMS: Analog CIM Simulator for DNN Resilience
Round 1
Reviewer 1 Report
This paper describes a method for retraining a simple MLP neural network to account for the error introduced by computing using an ACIM device. This is achieved by implementing a model of the fault-profile of the ACIM device, and training with this process in the loop. The presented results seem to indicate that this produces an improvement in accuracy, at least in the case where the ACIM fault profile matches the model closely.
The ideas in the paper are interesting, but the presentation needs quite a lot of work still. English is rough in many places, too many issues to sensibly list here. Recommend professional proof-reading. The language issues prevented me from being fully clear about some of the ideas presented, and therefore its hard to fully judge the soundness of the approach.
It's mentioned that the ACIM fault injection can be static, deterministic or probabalistic in some way, but this topic needs to be expanded further. What's the exact nature of the fault injection used in the examples?
Is real ACIM hardware used to test the models after the fault-aware-training? This wasn't totally clear to me from the text.
Does the nature of the ACIM faults vary widely between individual hardware devices? Or is it mainly dependent on the design? This seems crucial to determine whether this approach is practically applicable.
Some types of neural network are much more sensitive to fault injection than others. Could the authors envision their method extending to e.g. RNNs?
Author Response
Dear reviewer,
Thank you very much for such useful commendation for improvement.
Thank you for your affirmation of this manuscript‘s idea.
According to your advice, we have made some improvements.
Point 1:It's mentioned that the ACIM fault injection can be static, deterministic or probabalistic in some way, but this topic needs to be expanded further. What's the exact nature of the fault injection used in the examples?
Response 1: We add the subsection 3.4 to describe how we generate an ACIM fault in simulator. And we also give a verification for it.
Point 2: Is real ACIM hardware used to test the models after the fault-aware-training? This wasn't totally clear to me from the text.
Response 2: We are keeping continuous communication with our cooperator who designs the hardware of Sandwich-Ram. The real hardware testing still needs more debugging and testing job between algorithm and hardware. Our work, ACIMS framework is a kind of DNN training procedure. The procedure won't be changed when it is transformed to a hardware environment. So, it is also effectively on hardware theoretically. We add more detailed elaboration in our conclude section.
Point 3: Is real ACIM hardware used to test the models after the fault-aware-training? This wasn't totally clear to me from the text.
Response 3: Yes, the nature of ACIM faults vary widely between individual hardware devices. We redescribed this feature in subsection 3.4. We are not going to retrain an individual chip then looking forward all chip's error can be solved. In fact, we solve each chip's error based on its own individual fault. After targeted retraining, one common DNN model's parameters will be adjusted differently based on different hardware devices.
Point 4: Some types of neural network are much more sensitive to fault injection than others. Could the authors envision their method extending to e.g. RNNs?
Response 4: As to RNN structure, we inquired our cooperator. They design the hardware for DNN at present. But this is an interesting research direction. And exploring on a software platform can be easier. We will do these research in the future.
Reviewer 2 Report
This paper presents Analog Computing In Memory (ACIM) simulator which is a hot topic these days. I think any contribution in CIM is beneficial in society. The major contribution of this paper is that it combines analog simulator and CIM simulator in one platform. Also, users can inject faults and check its impact. Therefore, users can estimate DNN accuracy where the architecture will be used for specific dataset such as MNIST in advance.
Overall, the topic of this paper is interesting and the contribution of the proposed simulator is clear. I would like to see the full paper in this journal because the society is eager to see more CIM papers in those days.
Author Response
Dear reviewer,
Thank you very much for your appreciation of our manuscript.
We modify some of our introduction and conclution to make the logic more fluid.
We add the subsection 3.4 to describe how we generate an ACIM fault in simulator. And we also give a verification for it.
Thanks again for your review.
Reviewer 3 Report
The paper is devoted to actively developing area of the-state-of-art computing like analog and CIM. Complex DNN applications require a high-efficient hardware, which can be realized in classical CPU or GPU architectures, or being implemented as accelerators with specific architectures. The DNN’s accelerators design and reliability aware are actual area in multidisciplinary interaction of computer science, applied mathematics, computer aided design and microelectronics design automation fields. Authors have proposed a fault injection and fault-aware training framework and corresponding simulator ACIMS. The novelty of paper is dealt with an attempt to automate design and training of DNN as Analog Computing In Memory Architecture (ACIM), taking into account the influences of tolerances and process deviations in used microelectronic components.
The theoretical and experimental sections are presented in the paper. Meanwhile, there are the following comments and questions:
- According to the first contribution “We are the first to establish a mathematical model of an ACIM architecture”. There are no an ACIM model description in the text. The DNN fitting function is described in subsection 2.1. It’s important to pay attention to the upper indexes at line 2 of Eq. 1. Next, the statements of different errors for the Sandwich-Ram have been represented in subsections 3.1-3.3. So, the specified on page 2 (line 50) contribution has not reflected in the paper.
2. The models of different faults for Sandwich-Ram are just described in subsections 3.1-3.3 without any proof of the adequacy and accuracy. The coefficients Pre and N(mu, sigma^2) were mentioned, but were not specified later at the experiments. - The Fault Injection Method (subsection 4.2) defines the general aspects of three considered kinds of faults injection without description how the real deviations of the circuit, which have casual behavior, are taken into account. The adequacy of the one-time fault injection for next retraining has not considered also. Is only one fault injection enough for an exhaustive correction?
- There are no mentions about real circuits’ deviations and environmental conditions, which were used in the experiments. In this case we cannot see the effect of proposed solution on the improvement of training and ability the DNN as the ACIM.
- The Conclusions and Abstract have some non-correspondences from the aims and scopes point of view.
- Text requires essential improvements in English. There are a lot of misprints and errors in the paper.
The paper should be revised from the methodological point of view, with better presentation and proof mathematical statements, description of the method details, organization and description of realistic experiments, as well as English improvement.
Author Response
Dear reviewer,
Thank you very much for such useful commendation for improvement.
And thank you for your recognition of our work.
Point 1: According to the first contribution “We are the first to establish a mathematical model of an ACIM architecture”. There are no an ACIM model description in the text.
Response 1: We modify some of our introduction and give an description about ACIM architecture in a more fluid logic. ACIM is a kind of architecture combines computing in memory and analog computing. It has double advantages and double disadvantages. It is a frontier technology, and we are one of the earlier exploers in this domain. We changed the first palce statement.
Point 2: The DNN fitting function is described in subsection 2.1. It’s important to pay attention to the upper indexes at line 2 of Eq. 1.
Response 2: Thank you for pointing out this error and we have made a correction.
Point 3: Next, the statements of different errors for the Sandwich-Ram have been represented in subsections 3.1-3.3. So, the specified on page 2 (line 50) contribution has not reflected in the paper.
Response 3: We think Sandwich-Ram error is a combined error. So we analyzed it from three angels. Of course, the error in 3.1 can be the feature of Sandwich-Ram. Hardware process deviation is very important in analog circuits. While Sandwich-Ram uses a number of refactored analog computing elements. This means that the unique bias matirx can be the identity of an ACIM chip.
Point 4: The models of different faults for Sandwich-Ram are just described in subsections 3.1-3.3 without any proof of the adequacy and accuracy. The coefficients Pre and N(mu, sigma^2) were mentioned, but were not specified later at the experiments.
Response 4: We add the subsection 3.4 to describe how we generate an ACIM fault in simulator. And we also give a verification for it on RTL level. The relevant parameters of normal distribution are also given.
Point 5: The Fault Injection Method (subsection 4.2) defines the general aspects of three considered kinds of faults injection without description how the real deviations of the circuit, which have casual behavior, are taken into account. The adequacy of the one-time fault injection for next retraining has not considered also. Is only one fault injection enough for an exhaustive correction?
Response 5: We redescribed this feature in subsection 3.4 when describe the faults generating. We are not going to retrain an individual chip then looking forward all chip's error can be solved. In fact, we solve each chip's error based on its own individual fault. After targeted retraining, one common DNN model's parameters will be adjusted differently based on different hardware devices.
And we generate fault in two phase, the hardware deviation matrix only generated once since it is relatively stable. The environmental efault is generated everytime at layer operatio. The experiment shows, this approach can solve the multiple levels of fault composition. ACIMS can targeted solve hardware deviation and tolerate the secondary error environment fault.
Point 6: There are no mentions about real circuits’ deviations and environmental conditions, which were used in the experiments. In this case we cannot see the effect of proposed solution on the improvement of training and ability the DNN as the ACIM.
Response 6: We are keeping continuous communication with our cooperator who designs the hardware of Sandwich-Ram. The real hardware testing still needs more debugging and testing job between algorithm and hardware. Our work, ACIMS framework is a kind of DNN training procedure. The procedure won't be changed when it is transformed to a hardware environment. So, it is also effectively on hardware theoretically. We add more detailed elaboration in our conclude section.
Point 7: The Conclusions and Abstract have some non-correspondences from the aims and scopes point of view.
Response 7: We have modified part of our introduction and conclution to make the logic more fluid.
Point 8: Text requires essential improvements in English. There are a lot of misprints and errors in the paper.
Response 8: Thank you for your advice on our submition. I will improve my english expressopm ability continuously. It's a little hard for a non-native speaker to write such long an ariticle for the first time.
Round 2
Reviewer 1 Report
Thank you to the authors for the addition of the extra section describing the error injection process in greater detail. This is a valuable addition, and I think improves the paper.
The scientific content is now ready for publication, but the quality of the English is still not really good enough. I would suggest getting a professional proof-reader to take a look at the text.
Author Response
Dear reviewer,
Thank you for your appreciation.
We went through a job of proofeading.
Your valuable advice dose make our article better.
Thank you again.
Reviewer 3 Report
The authors did a great job and in the present version took into account many of the comments and suggestions made. However, a number of positions is remained open:
Point 1 (was): According to the first contribution “We are the first to establish a mathematical model of an ACIM architecture”. There are no an ACIM model description in the text.
Response 1: We modify some of our introduction and give an description about ACIM architecture in a more fluid logic. ACIM is a kind of architecture combines computing in memory and analog computing. It has double advantages and double disadvantages. It is a frontier technology, and we are one of the earlier exploers in this domain. We changed the first palce statement.
Point 1 (continue): Authors have changed the first contribution on the following “We take Sandwich-Ram as an example to study the fault-pattern of ACIM architecture. We are the earlier to establish a mathematical model of an ACIM architecture. Through verification, this model can fit Sandwich-Ram’s fault-pattern well.” Unfortunately the current version of the paper does not contain the mentioned mathematical model. Therefore it cannot be considered as the contribution. According to the title of paper the main proposed object is an Analog CIM Simulator. It will be normal to make main focus in the paper on this object and define contribution 1 correctly.
Point 2 (new): The abbreviation BWN is used on Page 2 (line 41) without a transcript.
Point 3 (new): The fault model represented in equation 5 does not describe again the function N and coefficients mu and sigma^2.
Point 4 (new): Authors have included the new subsection 3.4 describing how they generate an ACIM fault in simulator. This subsection improves the perception of the proposed fault simulation method. Meanwhile, the following designations are used here without definitions: ff70, ss0, tt25. The coefficient Pre is assigned into HHHH. What does it mean and how can be used in the models in equations (4), (5) and (6).
Point 8 (was): Text requires essential improvements in English. There are a lot of misprints and errors in the paper.
Response 8: Thank you for your advice on our submition. I will improve my english expressopm ability continuously. It's a little hard for a non-native speaker to write such long an ariticle for the first time.
Point 8 (continue): I can understand the problem to use a non-native language for writing a long text, but the paper should be correctly presented for understanding by the possible readers. Especially, when authors try to submit the paper to the high-ranking journal as Electronics. Unfortunately, the current version contains many misprints and errors. Authors need to provide a deep spelling analysis of the text at least.
Indeed, the paper is became better structured. After final improvement it will be able to accept for publication.
Author Response
Dear reviewer,
Thank you very much for such useful commendation for improvement.
According to your advice, we have made some improvements.
Point1(continue): Authors have changed the first contribution on the following “We take Sandwich-Ram as an example to study the fault-pattern of ACIM architecture. We are the earlier to establish a mathematical model of an ACIM architecture. Through verification, this model can fit Sandwich-Ram’s fault-pattern well.” Unfortunately the current version of the paper does not contain the mentioned mathematical model. Therefore it cannot be considered as the contribution. According to the title of paper the main proposed object is an Analog CIM Simulator. It will be normal to make main focus in the paper on this object and define contribution 1 correctly.
Response 1 (continue): We have modified the expression here. We de-emphasized mathematical model and focus on the Analog CIM Simulator.
Point 2 (new): The abbreviation BWN is used on Page 2 (line 41) without a transcript.
Response 2 (new): We take place BWN with 'binary weight network'.
Point 3 (new): The fault model represented in equation 5 does not describe again the function N and coefficients mu and sigma^2.
Response 3 (new): We add an introduction for Gaussian distribution N(mu,sigama^2).
Point 4 (new): Authors have included the new subsection 3.4 describing how they generate an ACIM fault in simulator. This subsection improves the perception of the proposed fault simulation method. Meanwhile, the following designations are used here without definitions: ff70, ss0, tt25. The coefficient Pre is assigned into HHHH. What does it mean and how can be used in the models in equations (4), (5) and (6).
Response 4 (new): ff70, ss0, tt25 is common process conner in chip manufacturing. We add decription about them. And Pre = 7140 = 255*28, this value is to prevent negative overflow when calculating. 'HHHH' is a label which I forgot to replace it.
Point 8 (continue): I can understand the problem to use a non-native language for writing a long text, but the paper should be correctly presented for understanding by the possible readers. Especially, when authors try to submit the paper to the high-ranking journal as Electronics. Unfortunately, the current version contains many misprints and errors. Authors need to provide a deep spelling analysis of the text at least.
Response 8 (continue) : We have got a proof-reading job. I expect the language to be more habitual this time.
I sincerely appreciate your sharp advice. They push me to think about my article and methodology twice and make my paper better.