We use linear programming instances from NETLIB [
26]. There are a total of 138 linear programming problems with varying sizes. The smallest one has 22 non-zeros and the largest one has 1,408,073 non-zeros in the coefficient matrix. There are 30 instances that are infeasible. Thirty nine of the instances are not full rank. Further, four test problems have
. If
, the initialization scheme [
27] not produce an initial feasible solution. We do not use these 73 instances in our experimentation. We define the absolute and relative gap as follows:
For all experiments, we use a relative-gap of at most
, at most 700 iterations as the stopping criteria. Although the absolute gap is generally more stringent than the relative-gap, achieving the absolute gap for problems with large objective function values may require many iterations. In IPMs, significant reductions occur during the initial iterations, but only small reductions happen in the final iterations. Therefore, in this article, we use the relative-gap in our experiments.
3.1. Experiments on CPU
We demonstrate the effectiveness of the new approach by executing Algorithm 1 and comparing the results to the classical interior-point algorithm (the primal-dual logarithmic barrier method Chapter 7) [
8]. Note that instead of calculating the search direction using Newton’s method mentioned in the classical algorithm [
8], Algorithm 1 finds the search direction using the modified Newton approach. This is the primary difference between Algorithm 1 and the classical algorithm. Specifically, in each iteration, Algorithm 1 first updates the auxiliary point. It then uses the auxiliary point, along with the primary principle, to find the search direction. In contrast, the classical algorithm relies solely on the information from the principal point to determine the search direction. In this regard, we coded the three algorithms (Algorithm 1, Algorithm 2 and Classical approach) in Python 3.10 using BLAS routines for the equation-solving step in all the algorithms. Of the remaining 65 instances the three algorithms solved 50 instances within a total time limit of 24 h on alliance canada cluster CEDAR, 2 × Intel E5-2683 v4 Broadwell CPU with 125 G RAM 2.1 GHz (
https://alliancecan.ca/en, accessed on 27 August 2023) [
28]. The remaining fifteen instances large-sized instances were not completed due to the time limit for at least one of the algorithms tested.
To demonstrate the efficacy of the new method, we applied Algorithms 1 and 2 to a test problem. For each step of the two-step method, we computed the objective function.
Figure 1 illustrates the difference in the objective function between the two points,
(auxiliary point) and
(principal point), during each iteration. The plot shows that the objective function value in the second stage is consistently lower than in the first stage, indicated by positive values of
. This validates the proposed method, demonstrating its potential to reduce the number of iterations required. Additionally, it indicates that the auxiliary point effectively guides the algorithm in correcting its search direction, leading to more efficient convergence.
Next, we want to see for the same instance
lp_scsd8, (i) the rate at which the absolute gap decreases, (ii) the decrease in the relative gap between primal and dual solution values as a function of the iterations.
Figure 2 shows the two plots. Algorithms 1 and 2 achieve a faster reduction in the relative gap and the absolute gap compared to the classical approach, i.e., a lesser number of iterations are needed overall. Also note a sharp reduction in the latter set of iterations for Algorithms 1 and 2 (this phenomenon is not universal though, sometimes there is a sharp reduction in the initial iteration for some instances).
To demonstrate the effectiveness of the two-step method we observe two quantities, (i) the speedup and (ii) the relative reduction in the number of iterations achieved by Algorithms 1 and 2 compared to the classical algorithm. If Algorithm i takes time on instance j, classical algorithm C takes time on the same instance j then the speedup of Algorithm i is defined as . Similarly, we define the relative reduction in the number of iterations. If Algorithm i takes iterations on instance j, classical algorithm C takes iterations on the same instance j then the relative reduction of Algorithm i is defined as .
Figure 3 and
Figure 4 show the relative reduction and the speedup in the number of iterations as a function of the number of instances. The instances are ordered according to the size in array. The average value of speedup, relative reduction if computed for each suffix of the array. If the
-coordinate is
n then the y-coordinate is the average speedup, relative reduction for
largest instances. The average speedup over all 50 instances is between 1.3 and 1.4 for both Algorithms 1 and 2. The average relative reduction is between 1.20 and 1.25 for both Algorithms 1 and 2. If we focus only on 5 of the largest instances then the speedup and relative reduction are comparably higher. This indicates that the method scales well. Algorithm 1 has a greater relative reduction but it does not translate into a greater speedup. Algorithm 2 has a better speedup. The instances indexed 40–45 in the plots show anomalous behaviour, it appears that these are specific types of instances (staircase instances) and the performance of Algorithms 1 and 2 seems to be different (compared to the average) on these instances. One of the prominent features of the new proposed algorithm is its ability to find an appropriate search direction compared to traditional interior-point algorithms. As demonstrated in [
20], the new modified Newton method exhibits a faster convergence rate, approximately
. Therefore, the new algorithm leverages this property, allowing it to reach the optimal solution in fewer iterations—a particularly valuable advantage when dealing with high-dimensional test problems. In scenarios where matrix multiplication and inversion can be computationally intensive, the reduced number of iterations provided by the new algorithm significantly decreases the execution time and enhances overall performance.
In
Table 1, the average number of iterations and CPU time (measured in seconds) for three different methods—Classical Algorithm, Algorithms 1 and 2—are presented. The Classical Algorithm required an average of 97.82 iterations and 166.71 s of CPU time. In contrast, Algorithm 1 demonstrated improved efficiency with an average of 71.34 iterations and 136.29 s of CPU time. Similarly, Algorithm 2 also exhibited promising results with 72.52 iterations and 135.46 s of CPU time. These findings indicate that both Algorithms 1 and 2 are more effective than the Classical Algorithm in reducing the number of iterations and CPU time. Notably, Algorithm 1 achieved the lowest number of iterations, while Algorithm 2 achieved the lowest CPU time. Moreover, the fourth column presents the standard deviation for the number of iterations for the three algorithms, indicating that Algorithm 1 has the lowest variability, followed by Algorithm 2, with the Classical Algorithm having the highest variability.
The details of the problem instances can be found in [
29]. The raw data used to build these figures is in the
Appendix A and is self-explanatory. The data in
Appendix B is divided into blocks of three rows each. The first row in each block is data for the classical algorithm and the other two rows are the data for Algorithms 1 and 2.
Table A2 lists the relative gap and the norms and shows the convergence of the three algorithms to an optimal solution on several instances from NETLIB.
3.2. Experiments on GPU
A study by Świrydowicz et al. [
30] investigated the use of GPU-accelerated linear solvers for optimizing large-scale power grid problems. Motivated by their study, we evaluate the performance of the algorithms on massively parallel hardware (GPU with tensor cores) using library routines to perform factoring.
We utilized the Pytorch library to develop a GPU-enabled version of our program. The library assumes that all tensors are located on the same device. During our experiments, we utilized the NVIDIA V100 Volta (on the GRAHAM Compute Canada cluster), which has a maximum size limitation due to its 16GB HBM2 memory. This card features tensor cores that perform certain tensor operations very efficiently in hardware. GPU arithmetic is typically conducted in low precision, ranging from 8 to 32 bits which account for the speed. Unfortunately, the algorithms failed to converge to an optimal solution when we employed low-precision arithmetic on V100. As a result, we opted to use double precision at 64 bits, even though this substantially increased the run time.
To utilize GPU acceleration, we convert all vectors and matrices into tensors, leveraging the PyTorch library instead of Numpy for computations. In contrast, when running on a CPU, we represent these structures as regular vectors and matrices. By performing essential operations like matrix multiplication, inverse calculation and step size computation in tensor mode. This approach enhances the efficiency of calculations and enables faster reaching the optimal point.
The computationally expensive steps in the algorithm proposed here (as in any IPM) are (i) the determination of
and (ii) the computation of the inverse of the Jacobian to determine the step direction. The first expensive step has quadratic complexity (
when implemented naturally). The second step is computationally more costly
. In the approach here, gains are realized asymptotically in the second step. Therefore, we focus on the time the GPU takes to perform matrix inversions. The raw data is shown in
Appendix C. The second and third columns are the iterations needed by the classical and the algorithm proposed here. The last two columns in the table are the times (in seconds) needed by the GPU to perform the matrix inversions.
Results for GPU
Table A3 presents the results for Classical Algorithm and Algorithm 1 on GPU. We report the number of iterations and GPU times for 41 test problems. Note that, Classical algorithm could not reach the optimal solution for 5 test problems and Algorithm 1 could not find the optimal solution for 1 test problem based on the stopping conditions. We remove the results for these 6 test problems and calculate the relative reduction and speed-up.
Figure 5 presents the relative reduction and speed-up for 35 test problems. Moreover, we denote the average of the number of iterations and GPU time for these 35 test problems in
Table 2. Based on this Table, we can conclude that the new proposed method reduced the number of iterations and GPU by
and
respectively. Moreover, the standard deviation for the number of iterations indicates that Algorithm 1 has the lowest variability at 60.19.