Research on High-Performance Fourier Transform Algorithms Based on the NPU
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsManuscript No.: applsci-2741282
Title: The Research on High-Performance Fourier Transform Algorithms Based on NPU
Authors: Qing Li, Decheng Zuo, Yi Feng and Dongxin Wen
The paper investigated two NPU-accelerated Fourier Transform algorithms that leverage the unique hardware structure of NPU, and provides three implementations of those algorithms, namely MM-2DFT, MV-2FFTm, and MV-2FFTv. As the main research, two scenarios were tested and analyzed for energy efficiency. The results indicate that the cube units of NPU are more energy efficient.
After carefully reviewing this paper, I recommend that it:
- Throughout the paper, many details are common and well known and do not represent current, innovative or interesting information. In this sense, it is recommended to reevaluate the current state of topic of the paper (chapters 1 and 2).
- Some data in Figures 4 and 6 are unclear.
Author Response
Comment 1:
Throughout the paper, many details are common and well known and do not represent current, innovative or interesting information. In this sense, it is recommended to reevaluate the current state of topic of the paper (chapters 1 and 2).
Response 1:
We appreciate this helpful comment. Following your suggestion, we have rewritten chapters 1 and 2 to present a clearer picture of the current state of the research topic:
In rewritten chapter 1, we explained the motivation of this research and the necessity of implementing Fourier transform algorithms based on NPU by following steps. (1) First, we explained the importance of the backpack computers and explained their limitations in computing power and energy consumption. (2) Then, we clarified that the CPU+NPU-based SoC chips are the preferred calculation device for backpack computers. (3) After that, we highlighted the significance of the Fourier transform by showing its extensive usage as the data preprocess algorithm in numerous wearable devices and applications. (4) Finally, we explained the urgency of developing Fourier transform based on NPU, as the current research on NPU preprocessing is insufficient.
In the rewritten chapter 2, we evaluated the current status of our research by reviewing the works regarding 3 aspects: (1) the application of the Fourier transform, (2) the research on accelerating the Fourier transform, and (3) the current research of utilizing NPU on edge devices. Despite those works, few studies focus on data preprocessing using NPUs embedded in edge devices, which suggests the innovation of our research.
Comment 2:
Some data in Figures 4 and 6 are unclear.
Response 2:
Thank you for pointing out this issue. Following your suggestion, we redrew Figures 4 and 6. The new figures enlarged the unclear parts and labeled the data. We hope this revision would clarify the data in those figures and make them easy to read.
Reviewer 2 Report
Comments and Suggestions for AuthorsIn this article, the focus was on a study dedicated to developing accelerated Fourier Transform algorithms tailored to the specific hardware architecture of NPUs. This endeavor aims to optimize the data preprocessing capabilities of NPU-based backpack computers, ultimately facilitating quicker execution of traditional engineering algorithms. This enhancement played a pivotal role in bolstering computational services, contributing to more efficient and responsive scenarios.
However, among the main suggestions to be highlighted are:
1. Justify why the Huawei Atlas 200I DK A2 was used
2. Improve the explanation of Figure 2. Furthermore, the figure was positioned three pages after it was commented on line 304.
3. Explain the purpose of using the high-pass filter on the image. The explanation is quite simplified in line 352.
4. In Table 1, how the time consumption of a single process (ms) was obtained needs to be clarified. Please make this part more straightforward.
5. we can have the same computational and energy gains as other algorithms.
Author Response
Comment 1:
Justify why the Huawei Atlas 200I DK A2 was used
Response 1:
Thank you for your comment.
The Ascend 310 is one of the most popular NPUs in edge intelligence and its DaVinci architecture is representative among NPUs. The Huawei Atlas 200I DK A2 is a simplified development kit of Ascend 310. Therefore, Our development and validation on the Huawei Atlas 200I DK A2 is conducive to the application and extension of our algorithm.
Comment 2:
Improve the explanation of Figure 2. Furthermore, the figure was positioned three pages after it was commented on line 304.
Response 2:
We have redrawn Figure 2, and the new content of that figure is the comparison of traditional algorithms and our solutions. The detailed description is in lines 314-326 which is right above Figure 2.
Comment 3:
Explain the purpose of using the high-pass filter on the image. The explanation is quite simplified in line 352.
Response 3:
We apologize for the insufficient expatiation.
The combination of high-pass filtering and Fourier transform is a common usage. Specifically, the software edge filter uses the high-pass filter to extract contour images. The extraction result can be visualized through reverse Fourier transform, which facilitates validating the correctness of algorithms. The software execution stream of the edge filter is shown in Figure 3.
Following your valuable advice, we revised the article and further explained the purpose of using the high-pass filter in lines 359-363.
Comment 4:
In Table 1, how the time consumption of a single process (ms) was obtained needs to be clarified. Please make this part more straightforward.
Response 4:
We apologize for the confusion. The “time consumption of a single process” is the execution time of a single edge filter process shown in Figure 3. The data shown in that table was obtained by executing the edge filter 100 times and calculating the average time consumption. Following your great suggestion, we have clarified that issue in lines 403-404.
Comment 5:
we can have the same computational and energy gains as other algorithms.
Response 5:
Thank you for your comment. We must admit that we are uncertain about the "we can have the same computational and energy gains as other algorithms".
However, in terms of calculation speed, we benchmarked our algorithms against (1) the Cooley-Tukey algorithm based on OpenCL running on the NVIDIA Tegra X2 built-in GPU, and (2) the Cooley-Tukey algorithm based on Ascend 310 built-in CPU.
In terms of power efficiency, due to the interference of hardware design, we only compared the energy consumption of CPU and NPU resources on the development board.
Reviewer 3 Report
Comments and Suggestions for AuthorsThе papеr is tеchnically good with solid еxpеrimеntal rеsults. Hеrе arе somе suggеstions for improvеmеnt.
- Providing morе background on thе applications of Fouriеr Transforms in еdgе computing could furthеr strеngthеn thе motivation and introduction for the readers.
- Additional diagrams to visually comparе thе nеw algorithms vs traditional approachеs might aid rеadеr comprеhеnsion.
- Should highlight thе kеy pеrformancе mеtrics, еnеrgy usagе valuеs, or spееdup factors in thе rеsults using formatting likе bold tеxt may hеlp draw attеntion to important takеaways.
- Thе futurе work discussion could support morе complеx workflows, auto-tuning, or analyses of othеr algorithms that could add valuе.
Author Response
Comment 1:
Providing morе background on thе applications of Fouriеr Transforms in еdgе computing could furthеr strеngthеn thе motivation and introduction for the readers.
Response 1:
Thank you for this valuable advice. Following your advice, we have added more applications of the Fourier Transforms in Chapter 1, lines 50-63. Those applications include (1) intelligent wearable device applications such as heart rate estimation and human activity recognition, and (2) potential edge intelligence applications such as road objects detection and travel detection. All those AI applications rely on the preprocessing stage of Fourier Transform before reasoning. We believe they can strengthen the motivation of our article.
Comment 2:
Additional diagrams to visually comparе thе nеw algorithms vs traditional approachеs might aid rеadеr comprеhеnsion.
Response 2:
Following your valuable advice, we have replaced Figure 2 with the flowcharts of the MV-2FFT algorithm and Cooley-Tukey algorithm to compare their execution stream. Additionally, we illustrated the differences between those two algorithms in detail in lines 314-329.
Comment 3:
Should highlight thе kеy pеrformancе mеtrics, еnеrgy usagе valuеs, or spееdup factors in thе rеsults using formatting likе bold tеxt may hеlp draw attеntion to important takеaways.
Response 3:
Thank you for your helpful suggestion. We have bolded the key energy consumption data in all tables of this article. We hope this revision would effectively capture the readers' attention towards those takeaways.
Comment 4:
Thе futurе work discussion could support morе complеx workflows, auto-tuning, or analyses of othеr algorithms that could add valuе.
Response 4:
Thank you for your constructive advice. In future works, we plan to optimize our algorithm in terms of energy efficiency. Moreover, we intend to broaden our research by (1) accelerating other preprocessing algorithms on NPUs, and (2) improving the energy efficiency of NPUs in sparse matrice multiplication scenarios by developing sparse matrix multiplication operators.
Reviewer 4 Report
Comments and Suggestions for Authors Please find my comments and suggestions in the attached document.Comments for author File: Comments.pdf
Comments on the Quality of English LanguageThe quality of the English language in the sent article is generally good. The text is well-structured and clear, with coherent sentences and appropriate technical terminology. However, there are a few minor grammatical and phrasing issues that could be improved for even greater clarity and readability.
Author Response
Response:
Thank you for your valuable suggestion. Following your advice, we have rechecked the entire article to fix grammar and phrasing issues. Hopefully, this revision would improve the clarity and readability of this article.
Reviewer 5 Report
Comments and Suggestions for AuthorsThe paper titled "The Research on High-Performance Fourier Transform Algorithms Based on NPU" focuses on developing and implementing Fourier Transform algorithms accelerated by Neural Processing Units (NPUs). It tries to propose two algorithms for this purpose: the direct matrix multiplication-based DFT algorithm (MM-2DFT) and the matrix-vector iterative operation-based FFT algorithm (MV-2FFT). These algorithms leverage the unique hardware structure of NPUs and aim to enhance the efficiency of preprocessing tasks in AI edge processing, particularly for IoT applications. The authors compare their algorithms with the traditional Cooley-Tukey algorithm on CPU and GPU platforms, focusing on aspects like speed and energy efficiency.
The paper claims to introduce the NPU architecture in line 172, which raises concerns about the novelty of the approach. Given the significant amount of prior research in the field of heterogeneous acceleration of Fourier Transform algorithms, primarily focusing on hardware like GPUs and FPGAs, the introduction of the NPU architecture alone does not establish novelty. It is recommended that the authors provide a more detailed literature review that specifically addresses previous works involving NPUs in the context of Fourier Transform acceleration. This would help to clearly delineate where this paper's contributions lie in relation to existing research.
The methodology presented in Section 4 does not appear to offer significant novel aspects. While the application of Fourier Transform algorithms to NPU architecture might have some unique elements, these are not adequately highlighted or justified in the paper. A more thorough comparison with existing methodologies, including a detailed explanation of how this work differs from previous studies, would be beneficial. This could involve emphasizing any unique aspects of the implementation or demonstrating how the application to NPU architecture provides significant improvements or new capabilities over existing solutions.
Section 5 of the paper is well-written and provides a clear exposition of the engineering aspects of the work. It offers valuable insights into the practical application of the proposed algorithms in real-world scenarios.
While the paper addresses an important topic in the field of AI and computing efficiency, there are significant concerns regarding the novelty of the approach and methodology. The introduction of NPU architecture, while potentially valuable, is not sufficiently distinguished from existing works in the literature. The paper would benefit greatly from a more comprehensive literature review and a clearer demonstration of how its contributions advance the field beyond current state-of-the-art methodologies.
Comments on the Quality of English Language
Needs improvement
Author Response
Comment 1:
The paper claims to introduce the NPU architecture in line 172, which raises concerns about the novelty of the approach. Given the significant amount of prior research in the field of heterogeneous acceleration of Fourier Transform algorithms, primarily focusing on hardware like GPUs and FPGAs, the introduction of the NPU architecture alone does not establish novelty. It is recommended that the authors provide a more detailed literature review that specifically addresses previous works involving NPUs in the context of Fourier Transform acceleration. This would help to clearly delineate where this paper's contributions lie in relation to existing research.
Response 1:
We appreciate your insightful comment. The introduction of the NPU architecture shown in Figure 1 aims to illustrate the principle of NPU’s efficient operation on matrix and vector, which shows NPU works differently from other processors.
The NPU accelerates matrix calculation using one or more matrix multiplication units and vector computation units to operate data blocks. In contrast, the GPUs and FPGAs accelerates matrix acceleration through parallel process of multiple computing units. That difference makes traditional algorithms designed for CPU, GPU and FPGA may not suitable for NPU.
To delineate our contribution in the context of existing research, we have revised Chapter 1 and added the literature to illustrate (1) the acceleration principle and application of NPU, together with (2) the application of GPU and FPGA in Fourier transform acceleration.
Comment 2:
The methodology presented in Section 4 does not appear to offer significant novel aspects. While the application of Fourier Transform algorithms to NPU architecture might have some unique elements, these are not adequately highlighted or justified in the paper. A more thorough comparison with existing methodologies, including a detailed explanation of how this work differs from previous studies, would be beneficial. This could involve emphasizing any unique aspects of the implementation or demonstrating how the application to NPU architecture provides significant improvements or new capabilities over existing solutions.
Response 2:
Thank you for your feedback. We understand your concern regarding the novelty and have made several revisions.
Regarding the novelty of the methodology presented in Section 4, we redrew Figure 2 to compare the process of MV-2FFT with that of Cooley-Tukey and illustrated their differences in detail in lines 314-329. Compared to the Cooley-Tukey algorithm, the MV-2FFT is optimized for the characteristics of NPU. It simplified the calculation process and fully utilized the computing power of matrix units in NPU.
Regarding the improvements and significance of our work, we have emphasized in lines 265-268 that the acceleration using NPU matrix units is more significant than other hardware. This argument was justified in our experiment about the time consumption of different algorithms with different input scales (shown in Figure 4). This experiment shows the MM-2DFT has a strong advantage over other approaches when the order of the input matrix is low, and direct application of NPU matrix units for acceleration makes sense in engineering.
Comment 3:
While the paper addresses an important topic in the field of AI and computing efficiency, there are significant concerns regarding the novelty of the approach and methodology. The introduction of NPU architecture, while potentially valuable, is not sufficiently distinguished from existing works in the literature. The paper would benefit greatly from a more comprehensive literature review and a clearer demonstration of how its contributions advance the field beyond current state-of-the-art methodologies.
Response 3:
Thank you for your valuable advice. We have conducted a more comprehensive literature review and made major revisions to Chapter 2. By analyzing the literature, we demonstrated (1) the application of the Fourier Transform, (2) the study of Fourier Transform acceleration, and (3) the application of NPU in edge computing to present a global view of the current research. Considering the ongoing research and development efforts in NPU mainly focus on accelerating neural networks, the support for typical engineering algorithm computation remains insufficient. Therefore, employing NPU to accelerate typical engineering algorithms, especially in the study of Fourier Transform, is significant.
The current heterogeneous acceleration methods are designed for GPUs and FPGAs, and may not suitable for NPU due to their distinct architecture. Hence, we proposed two implementations of acceleration methods of FFT on NPU based on the characteristics of NPU matrix multiplication in Chapter 4. These methods are namely direct matrix multiplication method (MM-2DFT) and the improved matrix divide and conquer method (MV-2FFT). Specifically, in lines 265-268, we explained the engineering significance of direct matrix multiplication. In lines 314-329, we explained the differences between MV-2FFT based on the matrix divide and conquer method and the classic Cooley-Tukey fast algorithm. Our algorithms adapted to the internal computing unit of NPU and unleashed the full potential of the computing power of NPU. At the same time, these algorithms also provide a research basis for an in-depth discussion of the advantages and disadvantages of NPU architecture.
Round 2
Reviewer 5 Report
Comments and Suggestions for AuthorsThank you for your response. I would highly suggest the authors to kindly revise the algorithms 1 and 2 in proper algorithmic format.
The title, heading/subheadings also needs revision.
Comments on the Quality of English LanguageEnglish needs proper revision.