Hardware Implementation Study of Particle Tracking Algorithm on FPGAs
Round 1
Reviewer 1 Report
2021 0908 MDPI electronics-1380638
Title : Hardware implementation study of particle tracking algorithm on FPGAs
Contribution: reconfigurable hardware implementation of the Hough transform algorithm, a mathematical spatial transformation to facilitate the recognition of trajectories of ionizing particles passing through the tracker apparatus within high-energy physics detectors.
Somohow clear and readable. Terminology not so accurate (using terms such as “is evident”). No state of the art. Only simulation? The implementation is not tested even with an input signal generator? No implementation results (measured size, area, etc). the implementation description is not clear (verbose sometimes). A pseudo algorithm could help to understand the operations performed. The blocks diagram is not clear at all. The sliding window is not clearly described (and more, this technique is not needed, explain why! Why just 8 input channels? How the values on the channels are organized overtime? How the values are computed?
Remarks:
the measured parameters and units must be indicated at the begining
what is a bin?
Not clear: “these numbers report the minimum 129 number of different input channels that have generated straight lines in the Accumulator, 130 crossing each other up to a given number of times (8 times at most for the 8 inputs).”
Not clear: “by incrementing all the internal elements by "1" if that element was not 167 crossed by any line generated by that input channel”
Tower Finder : a pseudo algorithm could be better
“Sliding Windows : As the Accumulator becomes easily too big to be processed within one clock period” really? By using a lockup table per line, at any line write operation you can update the position of the max value (Road).
“the task is developed in a pipeline, so to process a portion 208 of operations at a time” not clear. Where?
“it is evident how the use of SLRs affects 281 the system performance” the term evident is not appropriate, it must be proved/demonstrated instead.
“Failed Nets” : explain
“by varying the Accumulator binning in terms of ( Aq 282 Pt , f0).” Not clear
Author Response
Response to Reviewer 1 Comments
I thank referee for the comments that I address in detail below.
General Comments
Somohow clear and readable. Terminology not so accurate (using terms such as “is evident”). No state of the art. Only simulation? The implementation is not tested even with an input signal generator? No implementation results (measured size, area, etc). the implementation description is not clear (verbose sometimes). A pseudo algorithm could help to understand the operations performed. The blocks diagram is not clear at all. The sliding window is not clearly described (and more, this technique is not needed, explain why! Why just 8 input channels? How the values on the channels are organized overtime? How the values are computed?
Response to General Comments: Now “is evident” has been rephrased mainly with “is shown”. Regarding the physical tests with a generator this is a matter of wider lab tests within collaborations at CERN. On the other hand, we run post-layout simulations, not only behavioral, and now this is mentioned in the paper. The sliding window technique is the way we found to avoid a too wide resource consumption due to the parallel execution of the algorithm. Using windows instead, the Accumulator can be swept in successive steps and we can find the best compromise between the number of windows and the increase of latency. Now this is better described. 8 channels is the maximum number of inputs that can be processed at a time, even using the largest FPGA available. Also this concept is described.
Point 1: the measured parameters and units must be indicated at the begining. what is a bin?
Response 1: any histogram representation uses bins on the horizontal axis, this is the basic representation of statistical distribution. Units are described as angles for the horizontal axis, divided in 1200 bins, and momenta, which use common physical units. To better describe the point these units have been reported horizontally as radians, momenta vertically, measured in Kg*m/s and charge in Coulomb. Then the two axes have been divided in bins to plot a bidimensional histogram.
Point 2: Not clear: “these numbers report the minimum number of different input channels that have generated straight lines in the Accumulator, crossing each other up to a given number of times (8 times at most for the 8 inputs).”Not clear: “by incrementing all the internal elements by "1" if that element was not crossed by any line generated by that input channel”
Response 2: The phrase has been changed to “In more detail, these 6, 7 or 8 numbers indicate the minimum number of different input channels associated to the straight lines in the Accumulator. These lines cross several times (8 times at most for the 8 inputs)
Point 3: Tower Finder : a pseudo algorithm could be better
Response 3: I am not sure to understand the hint. The Tower Finder basically searches for the points of the Accumulator for the bins 8 high.
Point 4: “Failed Nets” : explain
Response 4: now a sentence has been added: Failed nets in the Table 1 indicate the nets that, after the place and route task, do not match the Setup Time. By violating the Setup Time the circuit can again work, as long as the clock speed is reduced, or as long as these nets are routed again.
Point 5: by varying the Accumulator binning in terms of …..” Not clear
Response 5: now the sentence has been rephrased as: “by reducing the Accumulator binning in terms of Aq/Pt, phi0 the overall synthesis and Place and Route task can better adapt to match the constraints.
Author Response File: Author Response.pdf
Reviewer 2 Report
The authors implement a particle tracking algorithm on FPGA.
The abstract should include the main outcomes of the project, including quantitative features.
Equations should have numbers. There are several previous implementations of the Hough transform algorithm, which were not mentioned in a previous work section.
The algorithm should be described concisely in some kind of pseudocore. Then, the hardware design should be explained following the algorithm. Each hardware block should be clearly explained. For example, the event memory bank depends on the number of input channels. But how? The subsections 3.x should explain the hardware but they are still explaining the algorithm. A diagram of each block should also be given.
Section 4.1 explains a strategy to reduce the hardware size. However, the hardware design was already proposed previously. The sections should be reordered.
Instead of a performance design goal based on frequency, you should consider the required throughput, latency, delay, etc. Does the system fulfils the required.
The total FPGA resources must be described (LUTs, DSP, BRAM, etc.). This allows me to determine which FPGA I need to run the design.
How does the FPGA receive and send data? What kind of interfaces are you using? What is the required data rate? The equations of ITT and OTT on page 10 should consider the true frequency.
Please discuss the scalability of the system. Can the system be scaled to achieve a higher throughput?
In general, the paper is just the hardware implementation of a tracking algorithm. The algorithm is known and the implementation has no new concepts. Therefore, I would recommend to submit the paper to a conference.
Author Response
Response to Reviewer 2 Comments
I thank referee for the comments that I address in detail below. The paper has been fully revised according also to the other reviewers.
Point 1: The abstract should include the main outcomes of the project, including quantitative features.
Response 1: now this sentence has been added at the end of the Abstract: “In particular the Xilinx Alveo U250 device has been investigated. A target frequency of 250 MHz and a total latency of 30 clock periods has been achieved using only the 17-53 % of LUTs, the 8-12 % of DSPs, the 1-3% of Block Rams and a Flip Flop occupancy range of 9-28 %.”
Point 2: Equations should have numbers. There are several previous implementations of the Hough transform algorithm, which were not mentioned in a previous work section.
Response 2: now we have also numbered formulas 6, 7, 8 for the Maximum Frequency and throughputs. Surely there are other implementations, now references 8 and 9 have been added even though the described performance is not comparable at all wrt the presented work. A sentence has been added at the beginning of the section Introduction and at the beginning of section 2.
Point 3: The algorithm should be described concisely in some kind of pseudocore. Then, the hardware design should be explained following the algorithm. Each hardware block should be clearly explained. For example, the event memory bank depends on the number of input channels. But how? The subsections 3.x should explain the hardware but they are still explaining the algorithm. A diagram of each block should also be given.
Response 3: I understand the comment but diagrams of inner blocks are just RTL VHDL descriptions, and I imagine that it would be of any benefit give the explanation of the code. However, we have rephrased the entire Section 3 to give more precise details about the implementation.
Point 4: Section 4.1 explains a strategy to reduce the hardware size. However, the hardware design was already proposed previously. The sections should be reordered.
Response 4: Right, now the subsection has been reordered and the “Accumulator Sectors” has been moved up as a subsection of the Accumulator.
Point 5: Instead of a performance design goal based on frequency, you should consider the required throughput, latency, delay, etc. Does the system fulfils the required.
Response 5: The paper describes the specifications not only in terms of clock frequency but also with the estimation of the overall latency, the input and output maximum throughput. Now these estimations are given via the formulas 6, 7, and 8. Yes, then the system fulfills the specifications for the HEP experiments.
Point 6: The total FPGA resources must be described (LUTs, DSP, BRAM, etc.). This allows me to determine which FPGA I need to run the design.
Response 6: Yes, right, now Table 2 has been added to summarize all the utilization of resources and a brief summary has been added in the Abstrac.
Point 7: How does the FPGA receive and send data? What kind of interfaces are you using? What is the required data rate? The equations of ITT and OTT on page 10 should consider the true frequency.
Response 7: The FPGA use internal transceivers for input and output ports. As these transceivers are programmable for different rates, we have used the intermediate rate of 16 Gbps for the estimation. The final rate that will be used will depend to the external interfaces that, so far, have not been fully specified by the experiments. The formulas 7 and 8 show the input and output throughput using respectively 8 and 16 channels/transceivers.
Point 8: Please discuss the scalability of the system. Can the system be scaled to achieve a higher throughput?
Response 8: The system at the moment can only be scaled down as we have targeted the largest family of FPGAs available on the marked. Of course, in the next years other components, even larger, will be available but by considering the effective costs of the individual FPGAs, a real physical implementation on prototype boards is not viable. Only studies on demo boards, with limited features, can be used today as proofs of concepts and this is what we have done.
Point 9: In general, the paper is just the hardware implementation of a tracking algorithm. The algorithm is known and the implementation has no new concepts. Therefore, I would recommend to submit the paper to a conference.
Response 9: The new concepts are:
- The overall fixed latency can be reached only using programmable device through FW. SW would not guarantee a fixed latency that is here one of the required specifications,
- The low level of abstraction has led us to implement a huge system that carries out thousands of operations in parallel without exceeding the FPGA resources. Using high level synthesis or SW based architectures the same design would have expanded beyond the available resources. Many other studies have not succeeded for this reason, and this is why we have investigated this use case.
- The way the HT is implemented, for what we have called backward computation, is different wrt to those targeted to smaller FPGAs implementing smaller systems.
These were the reasons why at CERN all the experiments have never implemented the HT in this way on a hardware system. And this is mentioned in the paper.
Reviewer 3 Report
The paper presents an interesting new application for FPGAs.
However, the presentation lacks a few improvements that make the paper more readable by the electronics community.
- The FPGA configuration bitstream cannot be called Firmware (FW). Firmware is reserved for the resident software in an embedded processor. While reading the paper, I spent a few moments looking for the embedded processor until I realized that FW actually means the hardware design and not the software design. I recommend you simply use the word Design instead of FW.
- Most systems these days do not use only hardware blocks. Part of the algorithm runs on an embedded processor that exists in the FPGA fabric or is external. This speeds up development as SW is easier to implement than HW. The processor controls a few IP core blocks (HW) that run the compute-intensive parts of the algorithm. So from the paper text, we can't know exactly the time that each of the blocks is spending, and if they are worthy of the silicon area they occupy. I understand this may be the first version but normally the detailed *time profiling* of the application is presented first before deciding it can all be implemented in firmware (the real firmware), hardware or a combination of the two. Clearly, an FW-only solution is not viable. However, the reader can't know if a mixed FW/HW solution would be best.
- A table showing the resources consumed by the design is missing and the paper can't be accepted without it. Namely, we need to know: #LUTs, #FFs (LUT connected flip-flops) #DSPs, #BRAMs, #PINs (external IO pins). This table is very important so that the implementation can be compared to future implementations.
- The paper should make clear how the FPGA is used for. Looks like the FPGA is used on a desktop setup to post-process data acquired somehow. In fact, discrete FPGA chips can be used to build special boards that can process the data closer to its source avoiding transmission and other complex operations with large volumes of data.
Author Response
Response to Reviewer 3 Comments
I thank referee for the comments that I address in detail below. The paper has been fully revised according also to the other reviewers.
Point 1: The FPGA configuration bitstream cannot be called Firmware (FW). Firmware is reserved for the resident software in an embedded processor. While reading the paper, I spent a few moments looking for the embedded processor until I realized that FW actually means the hardware design and not the software design. I recommend you simply use the word Design instead of FW.
Response 1: Actually, the FW is called FW because it is a completely different thing w.r.t. SW. SW for embedded processors is resident in a memory and it is executed as a program, The bitstream of a FW is also saved into a FPGA memory but it is only used to open and close hard connections among internal physical nodes. Once loaded, the FW turns the FPGA working as an ASIC, so it is fully hardware with hardwired connections. Nothing to do with SW. I use the word “FW” because what we do cannot be implemented using SW as it would not fit in the FPGA or other programmable devices, with the same performance.
To address this point I have added this sentence at the beginning of section 4:
“In this project we have used FPGAs and we have designed a firmware (FW) starting from a low abstraction level of description, without using high-level synthesizers or other compilers targeted to other types of programmable devices because such a huge system would not fit, converging with the same performance.”
Point 2: Most systems these days do not use only hardware blocks. Part of the algorithm runs on an embedded processor that exists in the FPGA fabric or is external. This speeds up development as SW is easier to implement than HW. The processor controls a few IP core blocks (HW) that run the compute-intensive parts of the algorithm. So from the paper text, we can't know exactly the time that each of the blocks is spending, and if they are worthy of the silicon area they occupy. I understand this may be the first version but normally the detailed *time profiling* of the application is presented first before deciding it can all be implemented in firmware (the real firmware), hardware or a combination of the two. Clearly, an FW-only solution is not viable. However, the reader can't know if a mixed FW/HW solution would be best.
Response 2: This point refers that what already addressed in point 1. When you say….” This speeds up development as SW is easier to implement than HW” … I agree if you succeed. But what do you do if the SW runs tens of times slower that what you need? And if you need a fixed latency, how can you do it using SW? This is why we need to implement FW in a HW device. To better address this point I have added the following sentence again at the beginning of section 4:
“In addition, it should be noted that using other programmable devices as GPUs or CPUs instead of FPGAs we would not be able to guarantee a fixed latency, as estimated below, and this is a need for HEP experiments.”
Point 3: A table showing the resources consumed by the design is missing and the paper can't be accepted without it. Namely, we need to know: #LUTs, #FFs (LUT connected flip-flops) #DSPs, #BRAMs, #PINs (external IO pins). This table is very important so that the implementation can be compared to future implementations.
Response 3: Yes, this is true. So, we have added the Table 2.
Point 4: The paper should make clear how the FPGA is used for. Looks like the FPGA is used on a desktop setup to post-process data acquired somehow. In fact, discrete FPGA chips can be used to build special boards that can process the data closer to its source avoiding transmission and other complex operations with large volumes of data.
Response 4: Ok right, I have added a sentence the following sentence in the conclusion: “The use of FPGAs in this context is driven by the necessity to reduce data on-the fly during a data acquisition process. For this, FPGA components, if compared to other programmable devices, can provide at the same time a low and fixed latency, low power consumption and high throughput and data rate. In this way HEP experiments can reduce the acquired data flow from one to two orders of magnitude.”
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
The Hough transform is used in several other contexts. There are several implementations of this transform in FPGA, from low-cost implementations to parallel implementations. Some of them are scalable and therefore can be scaled to larger FPGAs. So, any of these solutions could be adopted for this particular application.
You mention a clock frequency of 250 MHz, but according to table 1, the maximum achievable frequency was from 214 to 235. Why do you still consider the 250 MHz in the results section? Does this frequency include the whole system, including interfaces or it is only the hardware accelerator?
The implementation requires a more detailed description. Your VHDL designs are not direct descriptions of the blocks illustrated in figure 5. For example, the HT computation runs 9600 computations. How do you parallelize the operations? Are working with floating-point or fixed-point? How do you implement the sliding windows block? There is a block that you mention 1024 adders and comparators, but the expression has multiplications and divisions. These are just a few examples of why a detailed design of each block is important.
Considering the scalability, independently of the FPGA size, it is important to discuss the scalability, that is, how does the performance change with the available resources.
Why does using HLS invalidate the success of the design? Using VHDL could potentially lead to better results, but HLS directives let you control the possible explosion of resources.
Have you considered comparing the FPGA to a GPU solution? Should we adopt a GPU instead of an FPGA in this case? For this reason, the data format representation is important, since FPGAs are more efficient on fixed-point than on floating-point, compared to GPU.
I still stress that the paper is just the hardware implementation of a tracking algorithm. The algorithm is known, and the implementation has no new concepts. As far as I understand, the system was just simulated, and the results are estimates that do not take into account the real external access to data.
Author Response
Response to Reviewer 2 – second round Comments
I thank referee for the comments that I address in detail below.
Point 1: The Hough transform is used in several other contexts. There are several implementations of this transform in FPGA, from low-cost implementations to parallel implementations. Some of them are scalable and therefore can be scaled to larger FPGAs. So, any of these solutions could be adopted for this particular application.
Response 1: Correct, there are many implementations of the Hough Transform on FPGAs however, as far as I have found out, or system is not comparable in terms of fixed latency that we need for the triggers in high-energy physics applications. Only this specification cuts out all the implementations of the Hough Transform via GPUs or other software-based programmable devices. We need to know in advance, how many clock periods the system based on the Hough Transform will take before sending the first data valid to the output. And our tests show a total fixed latency – which in any case can be slightly adjusted working on the internal pipelines – of the order of 30 clock periods. Using a 250 MHz frequency this leads to a total latency around 120 ns, which matches our specifications.
Regarding the scalability of the system, I have seen implementations and examples that can be surely scaled to larger FPGAs but, for all the works that I found, these are very far from having 8 inputs synchronized at 250 MHz and processing a data set composed of 8 x 500 vectors, 28-bits each (12+16), 112kbit, within a 120 ns latency.
Point 2: You mention a clock frequency of 250 MHz, but according to table 1, the maximum achievable frequency was from 214 to 235. Why do you still consider the 250 MHz in the results section? Does this frequency include the whole system, including interfaces or it is only the hardware accelerator?
Response 2: ok. For the physical demonstrator, a clock frequency running at 235 MHz is more than a credible starting point for a future implementation. The complete set of specification is not closed yet, but we had to proof the validity of the system. This is why we did not stress the synthesis process to reach that 250 MHz target. Just to give you some numbers: we run Synthesis and Place and Route process on a 12-core Intel High-End CPUs at 4 GHz and 128 GB RAMs, and it took some like 10-hour to process the VHDL code.
We included the transceivers in the system, yes, the entire set of 24 available transceivers on the ALVEO U250. We use them for a 8-input 16-output implementation.
Point 3: The implementation requires a more detailed description. Your VHDL designs are not direct descriptions of the blocks illustrated in figure 5. For example, the HT computation runs 9600 computations. How do you parallelize the operations? Are working with floating-point or fixed-point? How do you implement the sliding windows block? There is a block that you mention 1024 adders and comparators, but the expression has multiplications and divisions. These are just a few examples of why a detailed design of each block is important.
Response 3: ok. For the floating point/integers question, we are in any case using integers (std_log_vectors), on the other hand no floating-point descriptions area available at low abstraction description level that we use. And the use of floating-point data will not help because somehow the synthesizer will translate them in bits, sooner or later. Parallel implementation of internal circuits is done by synthesize one “VHDL Process” alone and then put it into a loop with fix number of iterations. If this is compatible with the FPGA resources, it can provide good results. The expressions with multiplication and division are partially simplified using precalculated fixed steps described with formula 4. In this way the internal DSPs are used to address repetitive similar calculations, and we have balanced the FPGA resources (DSPs) with the number of iterations and with the windows in which the process is split. This is very system dependent, but we have eventually used a subset of the available resources in terms of LUTs, DSP, and RAMs and a huge amount of routing capability. This is shown with the layout figure 8. In addition, the large FPGAs use super-logic regions, which are physically separated as silicon dies, and we also split the clock trees in different branches so that eventually these fitted the different regions accordingly. That has been a huge work we did after realizing that high level synthesis tools did not converge with the same specifications.
Point 4: Considering the scalability, independently of the FPGA size, it is important to discuss the scalability, that is, how does the performance change with the available resources.
Response 4: Scalability here has only been available downward, by reducing the requests in terms of input channels, event size, and relaxing the fixed latency. Eventually we have fitted the largest number we could, and we figured out a system compatible with the initial specifications. The only way to scale up the system is to split it on parallel FPGAs, each one working on a subset of the accumulator, for example. This is how many systems have been implemented in the electronics for the experiments at CERN.
Point 5: Why does using HLS invalidate the success of the design? Using VHDL could potentially lead to better results, but HLS directives let you control the possible explosion of resources.
Response 5: No, HLS would not help, we have tried it at the beginning and also other groups have investigated the solution but reaching in any case smaller implementations.
Point 6: Have you considered comparing the FPGA to a GPU solution? Should we adopt a GPU instead of an FPGA in this case? For this reason, the data format representation is important, since FPGAs are more efficient on fixed-point than on floating-point, compared to GPU.
Response 6: Not yet because of the required fixed latency that a software-based system cannot guarantee. The option might become available if the latency specification will be removed. This is also possible, but it has not been the case, so far.
Point 7: I still stress that the paper is just the hardware implementation of a tracking algorithm. The algorithm is known, and the implementation has no new concepts. As far as I understand, the system was just simulated, and the results are estimates that do not take into account the real external access to data.
Response 7: The system has not been tested in a real environment because this is not a viable option for the CERN experiment apparatuses, which are unique systems. The only thing we can do in advance is building these demonstrators as proof of concepts. The novelty is that so far, it has never been implements, not even proved to work, in any of the CERN experiment even though many other have used software-based tools.
Round 3
Reviewer 2 Report
No comments.