1. Introduction
High-performance computing (HPC) systems are essential in order to run a variety of large-scale applications in multiple domains such as bio-informatics, astronomical analysis, nuclear simulations, financial services, etc., as well as large machine learning models applied to numerous use cases. As the backbone of HPC systems, interconnection networks are responsible for connecting up to hundreds or thousands of compute nodes (a compute node may, in turn, consist of hundreds to thousands of processing cores) [
1] by providing fast and low-cost communication in the order of microseconds [
2]. A primary factor in dictating the performance of interconnection networks is topology, which specifies the structure that is used to connect compute nodes. To achieve high interconnected performance, it is critical to design network topologies that have small diameters and low hop counts.
Prior works have proposed a number of regular and irregular topologies including rings, meshes, tori, hypercubes, fat trees, Clos, Butterfly, and Dragonfly. Many of them have been deployed in practical HPC systems. Interestingly, despite the seemingly mature development of topologies, recent work has demonstrated the large potential of a very different class of topologies, which we refer to as
shortcut-augmented topologies. Such a topology starts with a base topology (e.g., a ring) and adds a series of shortcuts on top of that. Surprisingly, with careful selection, shortcut-augmented topologies are able to reduce both network diameter and average shortest hop count by multiple folds, compared with existing widely used regular and irregular topologies [
3].
While this is promising, a major roadblock of further improving this class of topologies is the enormous design space that is formed by the combinations of possible shortcuts. As an example, for a 64-node network, there are around possible edges. If 128 edges are selected, there are over combinations! To make things worse, this design space grows super-exponentially as the network size increases, thus greatly exceeding the capabilities of exhaustive methods. Conventional methods of design space searching, such as simulated annealing (SA), would be extremely slow. This calls for novel, efficient heuristic approaches that exploit unique characteristics of the shortcut selection problem.
A straightforward heuristic is to add shortcuts randomly (subject to available free ports in the switches). With a large number of repetitions, a good shortcut-augmented topology may be found for a small network size. However, given the rapid increase in design space, this method quickly becomes insufficient for larger networks. Several papers have pointed out that adding random shortcuts can enhance the performance of their proposed network topologies [
4,
5,
6,
7], but these works do not directly propose approaches that can generate shortcuts more effectively. The state-of-the-art heuristic [
3] is to consider the usefulness of shortcuts when adding shortcuts for a given vertex. Specifically, for each vertex
v, the method examines a set of random nodes that are connected to
v, and selects the top
y (say 3) nodes that have the longest shortest paths from
v. It then adds
y shortcuts from
v—one to each of the
y nodes. We refer to this method as
vertex-based random shortcut (VRS) approach. While this improves the quality of the generated topologies, our analysis reveals that VRS often adds shortcuts that are locally useful at a vertex but are not much use globally. This is because the
y longest shortest paths at a vertex may not necessarily be considered as long paths in the entire network. Consequently, VRS requires more shortcuts to be added, which leads to additional costs of links and higher-degree switches.
To address this issue, this paper proposes EdgeCut, an effective heuristic approach for generating high-quality shortcut-augmented topologies. The main novelty is to identify more globally useful shortcuts by thinking from the perspective of edges, rather than from the perspective of vertices in prior work. We propose two variations of EdgeCut. EdgeCut-Full considers the performance impact on all node pairs after a shortcut is added. It produces the best topologies but needs to update all pairs’ shortest paths after each addition, thus resulting in longer search time. EdgeCut-Lite approximates this function, leading to reduction over EdgeCut-Full in search time while still generating satisfying topologies. Evaluation results show that, compared with simulated annealing, the proposed EdgeCut reduces the search time by and generating better or equivalent diameter in 94.9% of the test network topologies and sizes. Compared with VRS, EdgeCut reduces the network diameter by 55.1% while being slightly faster. These results highlight the effectiveness of the proposed approach.
The rest of the paper is organized as follows.
Section 2 provides more background on HPC interconnection networks and shortcut-augmented topologies.
Section 3 describes details of the proposed EdgeCut approach for identifying high-quality shortcuts.
Section 4 presents evaluation methodology and results, and
Section 5 includes further discussions. Finally,
Section 6 concludes the paper.
3. Proposed Approach
In this section, we describe the details of the proposed EdgeCut approach (implementation and instructions on running the proposed approach are available at
https://github.com/OSU-STARLAB/EdgeCut (accessed on 1 September 2022)). We start by introducing some notations and definitions, and then present two versions of EdgeCut, namely EdgeCut-Full and EdgeCut-Lite, that offer different trade-offs between efficiency and effectiveness.
3.1. Definitions, Notations, and Assumptions
An interconnection network topology for
N compute nodes can be abstracted as a graph with
N vertices. The hop count between two nodes is the length of the path between the corresponding two vertices in the graph. Without additional information, a typical way to estimate the hop count is to use the shortest path length (SPL) between two vertices. From this, we can define an
hop count matrix, where each entry
corresponds to the SPL of two nodes
i and
j. As illustrated in the example in
Figure 1, the shortest path between node 0 and node 1 is
one hop, and the shortest path between node 0 and node 2 is
two hops. As a result, entry (0,1) in the hop count matrix has a value of one and entry (0,2) has a value of two. Entries in the hop-count matrix are initialized to
. Every switch has a degree cap of
d. We assume that
d is large enough to enable at least one valid topology that fully connects the
N nodes.
3.2. EdgeCut-Full (ECF)
The problem with vertex-based random shortcut (VRS) is that it strictly requires every vertex to add a fixed number of shortcuts. Consequently, even if the shortcuts are optimal locally at a vertex, they may not be optimal globally. This also indicates that VRS uses more shortcuts to achieve a better topology than is necessary. To address this issue fundamentally, the proposed EdgeCut considers the pool of all possible shortcut candidates directly, and allows flexibility at a specific node as long as the degree cap is not violated.
The first instantiation, EdgeCut-Full (ECF), evaluates the impact of adding a potential shortcut that is more comprehensive by calculating the SPL between all node pairs if that shortcut is added. The rationale behind all-pair shortest path calculation is that adding a single shortcut may potentially affect all nodes. Algorithm 1 shows the pseudo code of ECF. Similar to other shortcut-augmented topologies, ECF starts with a base topology (such as a ring, mesh, torus, or any existing topology) and gradually adds random shortcuts based on their usefulness on reducing the network for low latency. From the base topology node connections, the hop count matrix and degree of each nodes are initialized, and the current all-pair average shortest path length (ASPL) is calculated. Then, the algorithm generates
t random edges from the pool of shortcut candidates. For each edge, it re-calculates ASPL assuming that the edge is added, and updates the current best edge if ASPL is reduced.
Algorithm 1 EdgeCut-Full (ECF) for adding shortcuts |
- 1:
Initialize , , from base network topology - 2:
while ( < ) do - 3:
Randomly generate t edges (under degree constraint) - 4:
= - 5:
for each edge in t edges do - 6:
Calculate - 7:
if ( < ) then - 8:
= - 9:
= - 10:
end if - 11:
end for - 12:
Use to update , , and - 13:
+= 1 - 14:
end while
|
Once all the t edges are considered, the best edge with the lowest ASPL is added to the network. This shortcut addition is continued until the cap of allowed shortcuts is reached.
In terms of computation complexity, the while loop runs for c iterations assuming that up to c shortcuts can be added. To add an edge, the algorithm searches for t random edges, and each requires the recalculation of ASPL (Line 6). Since only one edge is added, the update can be done by enumerating any two points u and v with the update , which takes . Similarly, updating on Line 12 also takes . Putting things together, the time complexity of ECF is which is . Note that the hop count matrix of the base topology is known beforehand, so Line 1 includes only the time to copy the hop count matrix locally, not recalculating all-pair shortest paths for the base topology.
3.3. EdgeCut-Lite (ECL)
EdgeCut-Full estimates the impact of adding a candidate shortcut accurately but is also computationally expensive. The dominating component comes from updating ASPL in the inner loop, which takes
each time. To alleviate the complexity issue of ECF, we propose EdgeCut-Lite (ECL) that simplifies this estimation. To select a shortcut to add, we still generate and evaluate among
t random edges. For each random edge
, we focus on how much hop count can be reduced between node
x and node
y if the edge is added. Since the edge directly connects
x and
y to have a hop count of one, the reduced hop count is equivalent to the previous hop count between
x and
y (minus one). This hop count value can be retrieved from the hop count matrix with just
time. The random edge with the greatest hop count reduction is selected. The pseudo code for ECL is given in Algorithm 2. Lines 4–10 correspond to the above process. As an example, consider three candidate random edges for node pairs
,
, and
where the corresponding nodes have an SPL of 4, 8, and 6 respectively. In this case, ECL chooses the edge for
as it has the largest SPL; an additional edge between node 3 and 8 would reduce their shortest distance to 1, resulting in a saving of 7 hops. At the end of the
loop, the hop count matrix is updated with the selected edge.
Algorithm 2 EdgeCut-Light (ECL) for adding shortcuts |
- 1:
Initialize , , from base network topology - 2:
while ( < ) do - 3:
Randomly generate t edges (under degree constraint) - 4:
= 0 - 5:
for each edge in t edges do - 6:
if ( > ) then - 7:
= - 8:
= - 9:
end if - 10:
end for - 11:
Use to update , , and - 12:
+= 1 - 13:
end while
|
Following a similar complexity analysis to ECF, the complexity of each loop in ECL consists of the loop which is now simply and the hop count update which is . Overall, the time complexity of ECL is .
4. Result and Analysis
4.1. Methodology
The proposed EdgeCut-Full (ECF) and EdgeCut-Lite (ECL) are evaluated quantitatively against simulated annealing (SA) and virtex-based random shortcut (VRS) algorithms. Multiple topologies including ring, mesh, and torus are used as the base topology to assess the efficacy of the algorithms in adding shortcuts. All the algorithms are implemented in Python 3 and Numpy and executed on the same CPU to ensure fair comparison. In our experiments, in addition to N that denotes the number of nodes in the network, we sometimes also use to better reveal scalability trends. Different network sizes are evaluated including N = 16, 25, 36, ..., 256 nodes, which corresponds to n of 4, 5, 6, ..., 16. We use a switch degree cap of 10 and the limit of allowable shortcuts for the main evaluation in this section. More sensitivity studies on different degree cap values and shortcut limits are presented in the Discussion section.
4.2. Computation Time
Computation time (i.e., search time) required for SA, VRS, ECF, and ECL to reach a stable solution is measured by the running time on an Intel(R) Core(TM) i9 9900 CPU @3.1 GHz.
Table 1 compares the search time for different network sizes under the four schemes. For clarity, only the results for using torus as the base topology are shown, as the ring/mesh have very similar numbers. As can be seen, SA takes the longest time for each size, and the search time increases rapidly as the network size increases. This is not scalable and can be costly to repeat for design space exploration if any of the network parameters change. VRS is much faster due to its simplicity in heuristic but has poor performance, as shown shortly. In comparison, the proposed ECF and ECL are
and
faster than SA, respectively (and with better generated topologies as shown later). ECL is
faster than ECF and is also slighter faster VRS on average.
4.3. Reduction in Diameter
Diameter measures the longest shortest path in a network and is an important metric that relates to a number of considerations such as latency, cost, reliability [
23].
Table 2 compares the impact on reducing diameter when shortcuts are added for the four schemes.
As can be seen from the table, the proposed ECL, while being the fastest among the four, also achieves very low diameter. Specifically, ECL produces better or equal diameter in 37 out of 39 test sizes on three base topologies (94.9%) compared with SA. This is significant given that SA is designed to achieve near-optimal diameter at the cost of long search time. The proposed ECF optimizes the network diameter further than SA (and with much faster search than SA). For example, with a ring base topology of nodes, the diameter of 19 in ECF is better than the 22 in ECL, and is also lower than the 24 in SA. In contrast, the existing vertex-based heuristic VRS has the largest diameter, e.g., 49 in the above example, reducing only about 61% from the base diameter, whereas ECF reduces 85% from the base. Comparing between VRS and ECL, ECL reduces the diameter by 55.1% relatively. Across the three base topologies, improvement of different schemes depends on the ratio of the number of added shortcuts to the number of base edges. For the given shortcuts, ring with base edges benefits more than mesh with , and torus with base edges benefits the least.
4.4. Average Shortest Path Length
While diameter captures boundary cases, the average shortest path length (ASPL) provides insights on the average cases for network latency.
Figure 2 plots the ASPL for different sizes of ring, mesh, and torus networks when shortcuts are added. The proposed EdgeCut-Full has the lowest ASPL in all the evaluated sizes and topologies. The proposed EdgeCut-Lite has slightly higher ASPL than EdgeCut-Full, but is still very close to SA. Meanwhile, VRS performs significantly worse, with a wider gap for larger network sizes, indicating that the heuristic in VRS does not explore the full potential of shortcuts.
4.5. Randomness
All the compared algorithms exhibit randomness, as part of the algorithms involves randomly generated edges. To examine this aspect in more detail, we plot the ASPL over 100 independent runs of the algorithms for adding shortcuts to a 64-node ring network.
Figure 3 demonstrates that SA, ECF, and ECL not only generate lower ASPL than VRS but also produce stable results constantly. In comparison, VRS generates good results only 25% of the time.
5. Discussion
5.1. Effect on Degree Limit
A per-switch degree cap of 10 is used in the above experiments. To gain more insight on how the four schemes may perform under different degree limits, we have evaluated a 64-node mesh network when the degree limit is varied. The results are plotted in
Figure 4. Although the specific improvement varies, it is evident that the proposed ECF and ECL have lower ASPL than SA and VRS for all degree limits. ECF is slightly better than ECL, as expected, due to the more comprehensive consideration of the impact of adding shortcuts.
5.2. Effect of Shortcuts Limit
Adding shortcuts reduces latency but increases the cost of cables. Moreover, switches have finite ports to connect to compute nodes and other switches, which also limits the number of shortcuts that can be added. The evaluation in
Section 4 assumes a shortcut limit of
where
. Here, we conduct additional experiment that varies the shortcut limit for a 64-node network with ring as the base topology. The results are presented in
Figure 5. Note that the previous
corresponds to 16 here, and the number of added shortcuts ranges from 2 to 64. As expected, the ASPL reduces in all four schemes as more shortcuts are added. The proposed ECF and ECL consistently perform better than the other two schemes. Although the figure shows a trend of diminishing return with small differences when a large number of shortcuts are added, this is prohibitively expensive. Therefore, for practical systems, the proposed EdgeCut has substantial advantage.
5.3. Routing and Multi-Level Networks
Irregular networks including shortcut-augmented topologies usually requires topology-agnostic adaptive routing [
3]. Research on this aspect is well established, and routing scheme much as [
24] with up*/down* routing as the escape path is able to remove routing from the performance bottleneck, thus reflecting the benefits of the underlying topology.
Shortcut-augmented topologies, such as the ones that are generated from our proposed EdgeCut, can be employed in various networks. Large networks are often constructed in multiple levels, e.g., an intra-group network that connects compute nodes within a rack, and an inter-cabinet network that connects racks. EdgeCut can be used to generate topologies at each of those levels.
5.4. EdgeCut over Deterministic Shortcuts
In addition to comparing with VRS and SA that generates shortcuts randomly, we have also compared the proposed EdgeCut with two topology designs that add shortcuts deterministically.
The first topology is hierarchical rings that connect a set of rings hierarchically to achieve better scalability than regular ring networks [
25]. Our evaluation shows that, for a 64-node network, hierarchical ring topology has an ASPL of 5.84 and a diameter of 12. In comparison, the proposed ECL with
shortcuts has an ASPL of 4.88 and a diameter of 10, and ECF has an ASPL of 4.67 and a diameter of 9. Therefore, EdgeCut is considerably better than hierarchical rings in both metrics.
The second topology is flattened butterfly (FB) that adds shortcuts between nodes on the same row/column in a mesh network [
26]. For an
network, we follow the original paper to construct a flattened butter from four
sub-networks to keep the bisectional bandwidth reasonable. Evaluation shows that FB has an ASPL of 3.39 and a diameter of 7; whereas ECL and ECF have ASPL of 3.39 and 3.3, respectively, and diameter of 7 and 6, respectively, so again better than FB in both metrics. Furthermore, ECL and ECF use only 128 total edges in contrast to 204 edges required in FB, thus making EdgeCut a more cost-effective approach.
5.5. Additional Considerations
The evaluation in this paper focuses on average hop count, diameter, and number of edges, all of which are well-established metrics to assess topologies. For a given system, however, the choice of topologies also depends on several other considerations, such as traffic patterns, queuing effects, cost of links (e.g., electrical vs. optical), cost of switches, etc. Additional evaluation is needed to take these factors into account, such as simulating on a cycle-accurate interconnection network simulator.