1. Introduction
The advent of modern hadron circular accelerators that utilise superconducting magnets has significantly enhanced performance, especially regarding energy reach. However, a major drawback is that the inevitable addition of field errors inherent to superconducting magnets, which generate the field by means of current distributions (the magnetic field produced by normal-conducting magnets is controlled through the accurate shaping of their poles, which allows for finer control compared to the current distribution in superconducting magnets), renders the beam dynamics strongly nonlinear. This nonlinearity can potentially excite resonances, leading to beam losses and an increase in transverse emittances. These negative phenomena adversely affect the performance of accelerators, necessitating the implementation of mitigation strategies during both the design phase and later during the operation of the machine. In this context, it is vital to identify an indicator to measure the impact of magnetic errors on beam dynamics. Parenthetically, we should note that nonlinear effects also play a role in the dynamics of circular lepton accelerators. The interested readers may refer to the selected references specified [
1,
2,
3,
4,
5,
6,
7] including references therein. Among several possible options, the most effective is the dynamic aperture (
), which is defined as the extent of the connected phase-space region within which single-particle dynamics remain bounded (see, e.g., [
8] and references therein). A key aspect of this definition is specifying the time interval over which the dynamics should stay bounded, which is a parameter determined by physical considerations, particularly the length of the accelerator’s operational cycle. The
offers essential insights into the nonlinear beam dynamics of non-interacting particles as well as the resonance mechanisms that provoke beam losses and reduce beam lifetime. Despite its somewhat abstract nature, a direct link can be made to the beam losses resulting from nonlinear beam dynamics [
9], which is a relationship that underpins the recent method for measuring
in circular rings [
10].
Mastering
is crucial for optimising the functionality of current circular particle accelerators like the CERN Large Hadron Collider (LHC) [
11] and its luminosity upgrade (HL-LHC) [
12], and it represents a fundamental figure of merit for the conception of future accelerators, such as the CERN Future Circular Collider (FCC), in its hadron–hadron version (FCC-hh) [
13] (refer to Ref. [
14] for an overview of the most recent baseline layout).
The numerical determination of the
involves monitoring multiple initial conditions in the phase space over numerous revolutions, the precise value of which depends on the application or specific physical processes involved. For example, in lepton rings, energy damping introduces a natural time scale that is relatively short, i.e., 1 × 10
2 turns to 1 × 10
4 turns, for
calculations. In contrast, in hadron rings, where energy damping is minimal or absent, the time scale is defined by the application, and for the LHC or HL-LHC, it is usual to consider 1 × 10
5 (or 1 × 10
6) turns. Therefore, the methods presented in this paper primarily focus on the DA concerning hadron rings, as the absence of significant damping effects renders the long-term dynamics especially crucial, making DA calculation a highly CPU-demanding activity. It should be noted that the standard number of turns used in LHC
calculations is equal to only about 9 s (or 90 s) of storage time, compared to 10 to 15 h of a typical fill duration: This suggests that there is a significant discrepancy between what can be computed and what is needed to model the actual LHC performance and that its treatment is crucial to the advancement of the field. The computation of
is very demanding, particularly for large hadron accelerator rings like the LHC. The computational challenge is two-fold. Firstly, it involves the large number of initial conditions needed to accurately explore the phase space; secondly, it requires a substantial number of turns to assess the stability of the dynamics. The first challenge can be mitigated, as the initial conditions are non-interacting, by parallelising the
computation algorithms [
15] or exploiting the performance advantages provided by GPUs. The latter challenge lacks a straightforward solution; the most effective method has been the derivation of scaling laws for the
versus the turn number. This method has seen success through the application of advanced theorems from dynamical systems theory, which yields the long-sought scaling laws [
16,
17].
Conducting precise numerical calculations of the
requires significant CPU resources. Nevertheless, a single
value is rarely practically valuable in the design and analysis of circular accelerators. Meaningful outcomes require examining several machine configurations. There are at least two reasons for this. First, the accelerator model may be known with limited precision. In such cases, a Monte Carlo method is typically employed, in which certain parameters describing the accelerator model are varied across a high-dimensional space. For the LHC, this technique applies to the errors that affect the superconducting magnets. Although an extensive measurement campaign was conducted to qualify all produced magnets, only a small subset was tested under cold conditions, which are the conditions of normal operation. Magnetic errors measured under these conditions can be directly used in
simulations. For most magnets, errors are known only under warm conditions, and a transformation is required to deduce errors under cold conditions [
18,
19,
20,
21]. It is customary to neglect the impact on the field quality experienced by the charged beams of the misalignment errors of the various magnets. This methodology introduces some degree of uncertainty regarding the actual errors in these magnets. As a result, it is typical to produce realisations of the magnetic errors measured under warm conditions based on the uncertainty in the measurements. For the LHC, it is standard to consider sixty realisations (also known as seeds) of the ring lattice, each incorporating different sets of magnetic errors: overall, the true
value should fall within the range of
values for the collection of realisations. The second reason for assessing multiple
values is that optimising the ring configuration is a crucial aspect of the circular accelerator design and operation. This is usually achieved by evaluating numerous machine configurations and examining the hyper-surface that represents the
as a function of the machine parameters, identifying regions that maximise the
and produce minimal
variation close to the peak (see, e.g., Refs. [
22,
23]).
Observing the current trend in accelerator physics, it is clear that the pursuit of constructing surrogate
models has grown into a highly appealing research area. Clearly, in this field, modern computational technologies, such as machine learning (ML), offer an attractive solution to this demand. Recently, ML methodologies have been investigated to create efficient surrogate models for
(see, e.g., [
24]) or to better reconstruct its time dependence [
25]. Furthermore, deep learning (DL) techniques have been applied to quickly and precisely forecast
for unfamiliar machine configurations [
26]. By training a multilayer perceptron (MLP) on a comprehensive data set of simulated initial conditions, this method effectively captures the intricate relationships between accelerator parameters and the corresponding
values. Furthermore, this model has shown the ability to accurately represent even new optical configurations of the accelerator with minimal new data due to its ability to extract universal physical features from common accelerator variables that influence beam dynamics. This feature holds significant promise for the design and optimisation of future accelerators, underscoring the crucial role of surrogate models.
Expanding on our initial MLP, we conducted additional research, evaluations, and comparisons involving various cutting-edge DL architectures. Each architecture has been tested on the same data set to assess its potential in improving the accuracy of forecasts. It is important to note that while numerical accuracy is crucial, it should be weighed alongside the inference time, which needs to be sufficiently brief to make these advanced architectures a viable substitute for traditional numerical computations. These points are thoroughly examined and discussed throughout this paper.
The layout of this paper is as follows:
Section 2 describes the computation of the
through direct numerical simulations and addresses the data preparation to implement the DL models.
Section 3 examines the proposed
regressor, providing a comprehensive review of the various architectures developed and evaluated in this research. The analysis of the developed DL architectures is divided into three sections:
Section 4 focuses on the training and accuracy of the different architectures;
Section 5 discusses the inference performance; and
Section 6 looks at the surrogate model’s performance when
is considered a function of the number of turns. Lastly, conclusions are presented in
Section 7, and the appendices provide details on some aspects of this study.
2. Simulated Samples for Deep Learning Models of the DA
2.1. Simulating the DA
The definition of DA requires some care, and the detail is given in
Appendix A. The stability of the orbits should be checked by performing an appropriate sampling of the phase space, which is a very time-consuming task. For this reason, some simplifications can be applied so that only a subspace of the entire phase space is carried out. It is customary to consider a scan over the initial conditions of the form
, which reduces the computation of the DA to a 2D scan. Furthermore, in the rest of the paper, the concept of angular
is used: It is simply defined as
using the same notation as in Equation (
A14).
An essential point to cover is the criterion for defining bounded orbits, which necessitates setting a threshold amplitude. Orbits with amplitudes below this threshold throughout the tracking simulation are classified as bounded (or stable). If an orbit exceeds this threshold, it is deemed unbounded (or unstable), and the number of turns reached is recorded to define the orbit’s stability time. This assessment is performed turn by turn with a default threshold value of 1 m. In particular, in the LHC, a collimation system [
11] is used to safeguard the cold aperture by absorbing particles at higher amplitudes. The jaws of this system serve as absorbers and set a physical limit on the amplitudes of the orbit. Therefore, in our
simulations, we considered defining the threshold amplitude based on the aperture of the collimator jaws.
From a computational perspective, we evaluated
by tracking initial conditions in phase space using XSuite [
27,
28]. These initial conditions are set in polar coordinates, allowing us to establish the stability limit for each angle, which is known as the angular
. To speed up the tracking process, an initial coarse scan was performed using evenly spaced initial conditions over eight polar angles within the range of
and 33 radial amplitude values spanning
. (It is customary to express amplitudes in terms of the transverse rms beam size, which is computed starting from the values of the normalised emittance
. In the case of the LHC [
11], the nominal value of
equals 3.5 μm for both
x and
y planes.) This approach identifies the angular
value for each angle and pinpoints the smallest amplitude where the initial conditions do not remain bounded beyond 1 × 10
3 turns, defining the boundary of the fast-loss region. Subsequently, a finer scan is performed within the amplitude range from
inside the stable region to
outside the fast-loss boundary and tracking for 1 × 10
5 turns. This finer scan employs radial increments of 0.06
and examines 44 polar angles within
. Overall, this method provides a detailed phase space scan, focusing on areas near the stability boundary. This technique not only improves the accuracy of the angular
computation but also triples the tracking efficiency by limiting the scanned space, as empirically shown [
29].
An example of the results of these calculations, performed in the
space, is shown in
Figure 1 for one of the LHC configurations that are part of the data set used to construct the DL surrogate models.
We note that the horizontal and vertical axes represent in reality the value of the horizontal and vertical invariants. Therefore, for the case of the hadron accelerators, for which the dynamics is symplectic, only positive values of the initial coordinates are considered.
2.2. Generating Accelerator Configurations and DA Samples
The data set used to build the DL models is made up of accelerator configurations that are generated using MAD-X [
30,
31,
32,
33]. The accelerator considered is the LHC, which is in its configuration used during the 2024 physics run at injection energy (450 GeV). This base configuration has been used to generate many variants labelled by the values of some key accelerator parameters, namely the betatron tunes
, the linear chromaticities
, the strength of the Landau octupole magnets (these magnets are used in operations to passively mitigate collective instabilities. The strength of the magnets is the key parameter and, in our studies, it has been labelled by the current
that flows in the octupoles), and the realisations of the magnetic field errors (also called seeds) that have been assigned to the various magnet families. Furthermore, in these studies, the ring configurations for Beam 1 (clockwise beam) and Beam 2 (counter-clockwise beam) have been independently considered. Note that the two magnetic channels of Beam 1 and Beam 2 only have a relatively small number of single-bore magnets in common, namely 36 superconducting magnets and 30 normal-conducting magnets, while the two-bore magnets are several thousands. For the two-bore magnets, the magnetic field errors are different for the two apertures, thus justifying the approach of considering both beams as independent configurations.
An initial data set comprising 5 × 103 LHC configurations was generated by performing a random uniform grid search of the following parameters: and , with steps of size 5 × 10−3; (high chromaticity values were incorporated into the scan since they may actually be employed to mitigate collective instabilities, and it is important to integrate this aspect into our DA model; also, note that the same value is used for and ), with steps of size 2; with steps of size . For each of the 1000 configurations, five different realisations of magnetic errors (seeds), randomly selected among the 60 available, have been considered for both beams, resulting in a total of 5 × 103 configurations.
The control of the ring chromaticity is performed by means of two families of sextupole magnets that are installed in regular arc cells, close to the focusing and defocusing quadrupoles, as is customary for a FODO lattice. The Landau octupoles are also located close to focusing and defocusing cell quadrupoles, but only in a subset of the regular cells in the ring arcs.
Moreover, 5655 additional accelerator configurations (which gives a data set with a total of 10.655 × 10
3 configurations) were generated using the active learning (AL) framework developed by our team [
29]. This framework allows the smart sampling of accelerator configurations through a clever selection of configurations with higher error estimates. As demonstrated in our previous studies, by prioritising configurations with larger magnitude error, the AL framework enables a more efficient exploration of machine configurations where the surrogate model has not yet fully captured the underlying physics features.
We also considered six values of the threshold amplitude to identify bounded orbits, namely four cases using different collimator apertures () and one using the default aperture of 1 m.
The relevance of the evolution over time of has already been mentioned above. Therefore, the possibility of reconstructing a model for this dependence has been explored by monitoring the stability time of the initial conditions. For this reason, the stability times have been binned using 19 distinct intervals whose boundaries are given by thousand turns.
Concerning the number of turns performed in the tracking simulations, 3805 LHC configurations were tracked for 5 × 10
5 turns, while in the remaining 6850 configurations, tracking was limited to 1 × 10
5 turns. This approach was used to assess the capacity of the model to predict the angular
over a larger number of turns even with a smaller sample size. Further discussion of this methodology is provided in
Section 6.
The quantity of samples employed in our model is determined by the product of three factors: the quantity of accelerator configurations in the data set, the 44 angles used to examine the phase space, and the 19 bins used to observe the evolution of the stability time. This calculation results in a considerable 836-fold increase in the size of the data set of the accelerator configurations, substantially improving the chances of model training convergence while allowing us to track the contours of the stability limits and their development.
As beam–beam effects are not included, the value of the beam emittance represents a simple scale factor. Therefore, it is possible to augment the angular
data set, computed for the nominal value, by using different values of the beam normalised emittances
and
. This strategy does not require any additional CPU-intensive angular
computation but rather only a simple rescaling of existing results. In fact, the value of the angular
using
and
is given by
where
is the nominal normalised emittance value used to compute
. In our studies, we considered the following emittance values:
where
is a Gaussian-distributed random variable with zero mean and sigma equal to
, which corresponds to having always almost equal horizontal and vertical emittances. This approach is designed to achieve two objectives: it enables the surrogate model to grasp an initial understanding of how
is influenced by the beam emittance value and, more significantly, to introduce a method to even out the distribution of angular
. This is essential to provide an unbiased training set for the
regressor, which is achieved by randomly sampling the inverse distribution of the angular
values after augmentation. The distribution of angular
is illustrated in
Figure 2 both before (red) and after (blue) the augmentation and unbiasing steps. Initially, the higher DA values were not sufficiently represented; however, following these adjustments, the distribution becomes more uniform, resulting in a better-balanced representation of DA values.
The total number of samples after the unbiasing step is about 50 million. From this sample size, 10% of the samples were used for validation and 10% were used to test the performance of the model, while the rest was used for training.
2.3. Pre-Processing of DA Data Set
Labels used to denote machine configurations might be inadequate for classifying the beam dynamics. Consequently, to provide a more comprehensive description of the phenomena that affect
, we incorporated several variables calculated using MAD-X and PTC [
34,
35,
36]. For a better characterisation of the LHC ring’s optical configuration, we included the maximum values of the Twiss parameters
and
, along with the phase advance
, between the high-luminosity insertions of the ATLAS and CMS experiments, which are located at IP1 and IP5, respectively. This approach aims to better capture the interrelationship between the optical parameters and the angular
. Additionally, to encapsulate observables related to nonlinear beam dynamics, we accounted for seven anharmonicity coefficients, specifically the amplitude detuning terms up to the second order [
37].
A crucial step is the standardisation of input variables that take real values. However, note that the beam and seed variables take discrete values and are excluded from this step. This process entails normalising the distributions by using their mean and standard deviation. By ensuring a uniform scaling of all input features, this step helps to achieve quicker model convergence and improves model stability [
38]. Consequently, this improves the performance and interpretability of the model.
In contrast to [
26], we did not cap the DA values, as our augmentation and unbiasing steps result in a balanced distribution of DA values, allowing the model to learn from the full range of DA values without introducing bias.
3. DA Regressor
The
regressor is a neural net designed to assess accelerator parameters, the polar angle, and the number of tracked turns to predict the angular
value. We explored several network architectures using the TensorFlow library [
39] and used the rectified linear unit (ReLU) activation function [
40] to enhance nonlinear learning capabilities. Furthermore, the hyperparameters were fine-tuned using random search with the Keras Tuner framework [
41] to improve model performance. The subsequent subsections offer a comprehensive description of each tested network.
The network architectures discussed in this document showcase the top performing setups discovered during our investigation, encompassing various layer configurations and types. Our experiments revealed that attention mechanisms and residual connections were especially effective in representing the numerical variables in this data set. Although these configurations delivered optimal results, it is possible that they are tailored to this specific data set, suggesting that exploring a wider range of hyperparameters might provide additional insight. Nonetheless, we consider the effectiveness of attention and residual mechanisms in this scenario to be noteworthy, potentially pointing to their broader applicability in similar data sets.
3.1. Multilayer Perceptron
The MLP consists of a fully connected neural network, and its advantage over more complex architectures lies in its simplicity, requiring fewer computational resources and fewer data for training. The implemented network structure closely follows that implemented in our previous study [
29]. This includes four hidden layers with 2048, 1024, 512, and 128 nodes, respectively. Batch normalisation is applied to each hidden layer to stabilise and speed up training by normalising layer outputs. To avoid overfitting, three dropout layers between the hidden dense layers with a rate of 1% were applied in [
29]. However, after performing hyperparameter tuning for this new data set, the optimal dropout rate was found to be 50%. The structure of the MLP network used is shown in
Figure 3.
3.2. Bidirectional Encoder Representations from Transformers
The Bidirectional Encoder Representations from Transformers (BERT) neural network [
42] was initially designed to improve natural language understanding tasks due to its unique architecture. BERT is characterised by its use of self-attention mechanisms in multiple Transformer blocks [
43], which allows it to capture complex, bidirectional dependencies within input data. The embedding in the network converts the input features into dense vectors, which effectively represent the patterns in the data. These vectors are then processed through self-attention layers, where each token in the input sequence is able to consider and weigh its relationship to all the other tokens in the sequence. This dynamic attention mechanism adjusts focus based on the relevance of tokens, allowing the model to capture subtle contextual relationships across the entire sequence, enhancing the model’s ability to learn intricate relationships and improve predictive accuracy.
Our BERT-based network is designed for numerical data processing and consists of 12 Transformer encoders. Each Transformer encoder comprises a multi-head self-attention layer with eight attention heads, allowing the model to learn from different representation subspaces simultaneously. Following this, a single feed-forward neural layer (FFN) is employed with a hidden layer size of size 512. Layer normalisation and dropout layers with a rate of 0.5 are applied before and after each FFN to stabilise the training and prevent overfitting. A global average pooling layer is included after the Transformer blocks to reduce the sequence dimension and aggregate the information into a fixed-size vector, ensuring efficient downstream processing. The structure of the BERT network used is shown in
Figure 4.
3.3. Densely Connected Convolutional Networks
The Densely Connected Convolutional Networks (DenseNet) network introduces a novel connectivity pattern in which each layer is directly connected to every other layer in a feed-forward fashion [
44]. This dense connectivity results in a significant reduction in the number of parameters compared to other deep network configurations of similar depth, leading to more compact models with improved computational efficiency. The structure of the DenseNet network used is shown in
Figure 5.
Our implementation of DenseNet consists of 121 layers, including dense blocks, transition layers, and fully connected layers. The network begins with an initial convolutional layer, which is followed by maximum pooling to reduce spatial dimensions. Then, a series of four dense blocks, each separated by a transition layer, are applied. Within each dense block, multiple layers are connected directly to every other layer, promoting gradient flow and alleviating the vanishing gradient problem. A global average pooling layer is used before the fully connected layers to aggregate features, and dropout with a rate of 0.5 is used to prevent overfitting, enhancing the generalisability of the model.
3.4. Residual Networks
Residual Network (ResNet) is a deep convolutional neural network (CNN) known to address the challenges of training very deep networks through the introduction of residual learning [
45]. It leverages residual connections, or
skip connections, that allow the network to bypass one or more layers. Instead of learning the full output transformation at each layer, the network learns the residual (or difference) between the input and output of the layer. This mechanism helps mitigate issues such as the vanishing gradient problem and enables the construction of much deeper networks without performance degradation. The structure of the ResNet network used is shown in
Figure 6.
We used a relatively shallow variant of this topology, based on ResNet-18, making it computationally efficient and suitable for tasks where faster training and inference are needed without sacrificing performance. It starts with an initial 1D convolutional layer followed by eight residual blocks, each containing two convolutional layers with Batch Normalisation. The network ends with a global average pooling layer to aggregate features, a dense layer with 1024 units, and a dropout layer with a rate of 0.5 to mitigate overfitting.
3.5. Visual Geometry Group
The Visual Geometry Group (VGG)-16 model developed for our application features a deep architecture with a series of convolutional blocks designed for effective feature extraction [
46]. Each VGG block consists of multiple 1D convolutional layers followed by max-pooling, which reduces the spatial dimensions of the feature maps by selecting the maximum value in each region of the feature map, allowing the network to focus on the most important information while downsampling. The structure of the VGG-16 network used is shown in
Figure 7.
After the convolutional layers, the features are flattened and passed through two fully connected dense layers with 4096 units each, including a dropout layer with a rate of 0.5 for regularisation.
3.6. Hybrid
Finally, we developed a hybrid network that integrates three key components: a Transformer encoder, a residual block, and a dense network. The input features are converted into dense vectors through an embedding layer. Then, the Transformer encoder, featuring a multi-head attention layer with a feed-forward layer, captures complex dependencies and contextual information. This is followed by a residual block that enhances feature learning through skip connections. A dense block builds upon the residual connections, incorporating dropout and concatenation to improve robustness and performance. The dense network processes the extracted features with dense layers and an additional dropout layer (with a 0.5 rate) for regularisation. Combining these elements ensures that the network maintains high performance while being computationally efficient compared to traditional DenseNet and BERT models. The structure of the Hybrid network used is shown in
Figure 8.
4. Training and Precision of DA Models
For training of all the networks tested, we used about 1.1 × 10
4 accelerator configurations of the LHC 2024 optics case at injection energy, comprising data tracked for 1 × 10
5 turns and 5 × 10
5 turns, totalling approximately 9.1 million samples. The batch size used for training is 1024 samples. The loss function used for the regressor is the mean absolute error (MAE) function that is trained with the NADAM optimiser [
47]. The initial learning rate is 5 × 10
−5 and is halved every five sequential epochs if the validation loss is not improved to enhance the model accuracy. Early stopping was used with a patience of 20 epochs to prevent overfitting.
Figure 9 illustrates the performance improvements throughout the training process and the adjustments to the learning rate.
The MAE, the mean absolute percentage error (MAPE), and the root mean squared error (RMSE) for the test (train) data set are presented in
Table 1, while
Table 2 shows the training time spent. The BERT model achieves the lowest MAE in the test (
). However, it required 37 h for training, reflecting a high computational cost. DenseNet-121 exhibits a reasonable MAE in the test (
) but required the longest training time (138 h), indicating a significant computational cost to achieve its performance. ResNet-18 shows a balance between prediction accuracy (
) and training time, taking only 39 min per epoch. In contrast, the MLP (baseline), with moderate error metrics (
), completed the training only in 3 h. Moreover, there are no signs of overfitting, as evidenced by the similar error metrics between the training and test sets in
Table 1. This is further supported by the training progress graph, such as the one shown in
Figure 9, which demonstrates consistent performance across both data sets.
Figure 10 illustrates the 2D histogram of the predicted versus computed angular
for all evaluated networks. This visualisation shows that most models perform well for the majority of data points, with a tight cluster along the diagonal line, indicative of accurate predictions. Additionally, the Pearson correlation coefficient is presented, demonstrating values approaching unity across all instances albeit marginally lower for the Hybrid and VGG-16 architectures.
Figure 11 presents the comparison of the distribution between the computed and predicted angular
, demonstrating their compatibility and further validating the predictive accuracy of the models. In addition to the significant 16% enhancement in MAE achieved by the BERT architecture compared to our former MLP baseline model, BERT also excels in angular
coverage. As illustrated in
Figure 11, all models typically underestimate the final angular
bin. This is a familiar problem in regression assignments, as regressors generally exhibit increased errors near boundary values. However, it should be noted that the BERT model delivers markedly more precise predictions, particularly at higher angular
values where the MLP falters. This is crucial, since predicting values close to the extremities of a distribution can be challenging. The attention mechanism of BERT enables it to focus on the most pertinent input features, thus improving its accuracy at the boundaries, whereas the uniform weighting of MLP often decreases performance in these areas. This underscores the superior predictive power and robustness of BERT. In closing, it is noteworthy that the rise in DA prediction associated with the
bin remains unexplained, as it is absent in the numerical data.
In addition, we provided an assessment of the models using the Kolmogorov–Smirnov (KS) test, which measures the maximum distance between the empirical distribution functions of the predicted and computed angular . The KS test values further highlight the effectiveness of BERT, which achieves the best KS test value. It is followed by DenseNet, ResNet-18, and the MLP baseline. Models such as Hybrid and VGG-16 exhibit poorer performance with KS values. In this regard, we note that the KS values might be high because these models tend to prioritise the reduction of errors in the central parts of the distribution, where most of the data are concentrated. This occurs at the cost of precision in the tails. The KS test is sensitive to differences at the extremes of the distribution, which can result in a less precise performance of the model in these areas. These results support the superior distributional alignment achieved by BERT, underlining its robustness at varying values of angular .
Although the Hybrid network developed for the regression of the angular exhibited faster inference times than DenseNet, BERT, and ResNet, its prediction accuracy lagged behind. This could be attributed to the network’s insufficient depth, preventing it from fully utilising the benefits of residual connections, attention mechanisms, and dense blocks.
5. Inference Performance of Models
The inference time of DL networks is influenced by several aspects, including the number of samples and the parameters of the model. These elements can lead to a significant increase in computational costs for training the network, affecting both the duration of training and the use of memory. To address this issue, we assessed the performance of two hardware architectures designed for accelerated operation with our
regressor: an Apple M3 MAX [
48] and a NVIDIA V100 [
49]. A brief overview of the hardware setup is provided in
Appendix B.
Table 3 displays the timing performance of various network architectures when predicting a single angular
and a complete accelerator configuration in two hardware platforms. Examples of complete machine configurations are illustrated in
Figure 12. The findings indicate that the NVIDIA V100 generally surpasses the M3 MAX in inference speed for most architectures even though the latter features quicker I/O speeds. In addition, architectures with deeper layers and more intricate operations, such as BERT, exhibit longer inference durations due to higher computational demands. In contrast, simpler models or those with fewer parameters, such as ResNet-18 and MLP (baseline), generally exhibit faster inference times.
While the conventional numerical tracking approach using X-Suite and the HT-Condor batch system [
50] can manage around 1 × 10
3 jobs simultaneously and takes approximately 30 h to track particles across 1 × 10
3 different configurations, equating to about 107 s per accelerator configuration, BERT, even as the most intricate network, delivers a significant performance boost. On the M3 MAX, BERT achieves a speed-up factor of roughly 675, and on the NVIDIA V100, it delivers an even higher speed-up factor of 2 × 10
3. Naturally, it is important to emphasise that achieving such a significant speed-up is only feasible after constructing an adequately large data set, which allows for the computation of DL surrogate models. This data set is created through conventional numerical tracking simulations, which are computationally very expensive. Consequently, the benefit of the surrogate model lies in the context of exploring a wide array of ring configurations, such as for design studies or optimisation purposes.
6. Prediction of DA for Higher Number of Turns
In previous studies [
26,
29], we limited the tracking of particles to 1 × 10
5 turns. Certainly, it is valuable to predict
for a larger number of turns; but expanding the tracking is computationally demanding. For this reason, as mentioned in
Section 2, only one third of the ring configurations in our data set were tracked up to 5 × 10
5. This introduces the possibility of testing the ability to capture the evolution of the
when tracking particles for an even longer duration even if only a fraction of the samples contain extended tracking information.
Figure 13 shows the absolute error distribution for each of the bins used to distribute the stability time when the BERT network is used. Performance in the test data set demonstrates that the model accurately predicts the angular
, even with limited information on extended turn counts. This indicates that the model capability captures the evolution of
as a function of turn, even if the training of the surrogate model has been performed including only a reduced sample size that contains simulation results up to 5 × 10
5 turns. It should be noted that there were slight discrepancies in the MAE values between machine configurations tracked up to 1 × 10
5 turns and those tracked up to 5 × 10
5 turns, as indicated in
Figure 13, which may be due to differences in the statistical characteristics of the training data. In particular, the input variables may exhibit varying distributions between the samples tracked up to 1 × 10
5 turns and those tracked up to 5 × 10
5 turns, potentially impacting the model’s predictive accuracy across these subsets. These distribution shifts could lead to the observed performance discrepancies.
7. Conclusions
This research investigates several advanced deep learning architectures to improve prediction in circular particle accelerators, using 2024 LHC optics at injection as a benchmark. We experimented with models including BERT, DenseNet, and ResNet, finding substantial enhancements in prediction accuracy compared to our previous MLP model. Specifically, the BERT model achieved the highest precision, although it required higher computational resources. However, ResNet-18 provided a balanced trade-off between performance and computational efficiency with respect to the inference time; the training time is only 30% grater than BERT. All evaluated networks showed remarkable speed increases in inference times over traditional tracking methods, achieving predictions that were at least 675 times faster on advanced hardware. Furthermore, these models effectively predicted over a greater number of turns, demonstrating their robustness even with limited training data on long-duration tracking. This progress presents new opportunities for optimising and designing future accelerators by providing fast and accurate predictions, reducing the dependency on time-consuming simulations. Implementing these deep learning models in beam dynamics research represents a substantial advancement in the development of efficient surrogate models.
Future studies could extend this work by comparing the performance of the models with measured beam data from the LHC to further validate their effectiveness. Furthermore, incorporating variables that describe the collision process, such as beam separation and crossing angle, could extend the ability of the models to predict angular under more complex operational conditions. An alternative research direction involves crafting physics-informed models that incorporate the DA scaling law as a function of the turn number. This approach could significantly improve our ability to predict accelerator performance over time durations that are relevant for standard operations.