1. Introduction
Gaussian processes (GPs) have gained significant popularity in the fields of machine learning and statistics, owing to their remarkable adaptability and capacity to capture intricate, non-linear associations. These non-parametric Bayesian models empower the inference of complex, non-linear functions from data while furnishing valuable uncertainty assessments. Essentially, GPs embody a stochastic process that posits a probability distribution over functions, ensuring that any finite collection of function values conforms to a joint Gaussian distribution.
In essence, a GP lays down a prior distribution over functions, subsequently refined through Bayesian inference as more data become available. Notably, GPs exhibit the unique capability of accommodating irregularly spaced and noisy data, as well as capturing inherent correlation structures within the data [
1,
2,
3].
The utility of GP spans an array of tasks, encompassing regression [
4,
5,
6], classification [
7,
8], and optimization [
9,
10]. These versatile models have found successful applications in diverse domains, including support for bike-sharing systems [
11], computer vision [
12], and even diesel engine optimization [
13]. A prominent feature of GP is its ability to yield probabilistic predictions and effectively account for data noise, as previously noted.
However, the predominant drawback of employing GPs lies in their computational demands [
3]. Considerable efforts have been dedicated to enhancing the computational efficiency of GPs, particularly when dealing with large datasets [
14,
15]. Prominent techniques in this regard include sparse approximation methods and scalable algorithms, which aim to make GPs more accessible and computationally tractable [
16,
17].
In recent years, the integrated nested Laplace approximation (INLA) framework has emerged as a powerful and efficient approach for modeling GPs in various statistical and machine learning applications. INLA combines elements of Bayesian modeling with numerical approximations, offering a computationally tractable alternative to traditional Markov chain Monte Carlo (MCMC) methods for estimating GP parameters and making predictions [
18]. INLA excels in providing accurate and rapid inference for GPs, particularly when dealing with large datasets or complex models [
19]. It leverages a combination of deterministic numerical integration techniques and Laplace approximations to approximate the posterior distribution of GP parameters, such as the length scale and amplitude hyperparameters. This approach significantly reduces the computational burden associated with GP modeling, making it feasible to work with extensive datasets and intricate models [
20].
Less attention is given to improving the efficiency of using GPs as generative models. In general, GPs are utilized as generative models within the larger algorithm framework. For instance, the authors of [
21] demonstrate the technique for embedding unlabeled categorical data into a continuous space using GP. As a probabilistic generative model, it is capable of estimating densities. They present a generative classification model that operates as a supervised approach for labeled categorical data. Moreover, the authors of [
22] developed a Gaussian process density sampler, an exchangeable model for nonparametric Bayesian density estimation. This model generates independent samples from a transformed function, derived from a GP prior. Their approach facilitates the inference of an unknown density from data through MCMC methods, which gives samples from both the posterior distribution over density functions and the predictive distribution in the data space. A good representation of the GP generative model, which was optimized through approximation, is presented in [
23], where the authors address the problem of Koopman mode decomposition. Their generative GP model allows for the concurrent estimation of specific quantities and latent variables controlled by an unidentified dynamical system. Moreover, they propose an effective method for parameter estimation within the model, which utilizes low-rank approximations of covariance matrices. The optimization of the generative model in the mentioned case primarily relies on reducing the dimension of the covariance matrix, a commonly employed method that facilitates computation time reduction. An underexplored aspect of such optimizations involves decreasing the dimensions of the mentioned matrix in a manner that additionally provides further information for the generative model. Our proposed approach for selecting computation points for GP is based on the premise of not only reducing computation time but also enhancing the generative model’s performance through the application of output in the form of functional series.
In this paper, we aim to explore the feasibility and efficiency of leveraging the Chebyshev property for efficient GP calculations. Our main contributions to this paper are as follows:
Selecting computational points for GP by using the properties of Chebyshev nodes. The proposed approach seeks to mitigate the computational challenges associated with GPs by reducing the number of computation points and adopting a specific method to gather information for generating functions with series representation
An algorithm output in the form of a functional series, achieving low computational complexity with minimal loss of accuracy. Normally, it is difficult to ensure the series representation, but our approach allows us to address this challenge with the use of fast Fourier transform (FFT) formulas to convert function values at Chebyshev points into coefficients of the Chebyshev series. This can be done with minimal loss of accuracy (1 bit) while facilitating effective interpolation of the series.
An algorithm that is especially susceptible to heavy computational methods. Alleviating computational burden by reducing the covariance matrix within Chebyshev node calculations can enable the use of high-cost computational methods like MCMC sampling on a wider scale.
Improving the usability of GP generative models. Ensuring the function series output with FFT conversion, with low cost and accuracy loss can lead to overall improved generative model capabilities, for example, in terms of classification.
The rest of the paper is organized as follows. We begin by presenting the fundamental theory behind GPs, generative models, and Chebyshev interpolation that we use in our solution as well as a description of computational methods used in the experiment section (MCMC and INLA framework). Subsequently, we detail the experiments conducted and the application of the proposed methodology to GP computations. Finally, we conclude with a discussion and summary of our findings.
2. Materials and Methods
This section of the article is dedicated to explaining the fundamental theory behind Gaussian processes (GPs), including their covariance functions and practical applications. Furthermore, we describe the theory and use of generative models in the context of GPs. We also describe generative models to highlight the application context of the method. Moreover, we focus on the components of Chebyshev interpolation, which we employ in our proposed process. This starts with a description of Chebyshev nodes and extends to the convergence of functional series, detailing the specifics of the algorithm for converting values at Chebyshev nodes into Chebyshev series coefficients using FFT. It also describes the computational methods used in the experimental section, such as Markov chain Monte Carlo (MCMC) methods and the integrated nested Laplace approximation (INLA) framework, which are employed for efficient Bayesian inference. Additionally, this section provides an overview of the algorithm based on the aforementioned theory, explaining how these elements are bound to address the problem.
2.1. Gaussian Process
We need to provide a concise definition of GPs and explore their fundamental properties. A widely accepted and succinct definition of a GP comes from [
2] and can be summarized as follows:
A Gaussian process is a collection of random variables, any finite number of which jointly follow a Gaussian distribution.
To define a GP mathematically, let us introduce the mean function
and the covariance function
for a real process
as follows:
With these definitions, we can formally express a GP as follows:
The mean and covariance functions provide a complete characterization of the GP prior, describing the expected behavior of the function, f.
To define GP in another way, we can consider the following set of observations:
; then the prior distribution of
follows a multivariate Gaussian distribution:
, where
, and
is the covariance matrix with elements
. The joint distribution of
f and a new variable
, assuming
, is expressed as follows:
where
represents the covariance between
and
(creating a covariance matrix), and
is the prior variance of
[
2].
In a broader context, GPs are stochastic processes commonly used for modeling data observed over time, space, or both [
1]. They can be viewed as generalizations of normal probability distributions, where each random variable (scalar or vector in the case of multivariate distributions) is described by a GP. Formally, a GP on a set
is a collection of random variables indexed by a continuous variable, denoted as
, with the property that any finite subset of these random variables follows a multivariate Gaussian distribution [
24,
25].
To fully specify a GP, one needs to define the mean (often set to “0” for simplicity) and covariance functions. However, variations in the mean function can be considered for interpretability or prior specification purposes [
1,
2,
26]. The covariance function, also known as the kernel function, quantifies the similarity between data points and is typically chosen from a set of predefined functions [
27]. While choosing the appropriate covariance function is crucial, any function generating a positive definite covariance matrix can be used [
28]. Nevertheless, creating new covariance functions that are both mathematically and practically useful can be challenging.
In this research, our attention is primarily directed toward exploring the intricacies and applications of two distinct kernels: the radial basis function (RBF) and the Matérn kernel. These kernels were chosen for their unique properties and relevance in the field of study, offering us the opportunity to delve deeper into their theoretical foundations and practical implementations. The RBF kernel is defined as follows:
where
l represents the characteristic length scale and
is the Euclidean distance. The RBF kernel’s key property is its dependence, primarily on the distance from the specified point. In the case of a twice differentiable function, we use a kernel from the Matérn family. The Matérn kernel is a kind of generalization of the RBF function. The general formula for Matérn covariance functions is given by (
6):
where
is a modified Bessel function,
l is a length-scale parameter and
v is a parameter that allows changing smoothness of the function, which gives an ability to flexibly control the function in relation to the one we want to model. Of particular note are the frequently used parameter,
v, with values of 3/2 used for learning functions that are at least once differentiable, and 5/2 for functions that are at least twice differentiable [
2].
The kernel choice plays a pivotal role in determining the regularity of the function from the GP. Specifically, the Matérn kernel exhibits a notable characteristic where it is
k times differentiable. This differentiability property is integral in defining the smoothness of the function modeled by the Gaussian process. A higher degree of differentiability implies a smoother function, which can be crucial in various applications where the smoothness of the function has significant implications on the accuracy and reliability of the model’s predictions or interpretations [
2].
GPs play a significant role in the advancement of modern technologies and sciences, where they are used for prediction, optimization, or classification [
3]. In practice, they are useful, for instance, in industrial anomaly detection [
29], and crucial for ensuring production quality and continuity. GPs predict battery state parameters [
30], which are relevant in electric vehicles and energy management [
4]. In bike-sharing systems, they optimize bike allocation, enhancing efficiency [
31]. Furthermore, GPs are applied in tracking and location, especially where GPSs fail [
32]. In optimization areas, they aid in resource management, like in data center designs [
9]. These processes are key in data analysis and statistical modeling, used across various fields, from image processing [
12] to biomedical data analysis [
33], demonstrating their versatility and importance in contemporary applications.
2.2. Generative Models
A generative model is a type of statistical algorithm designed to generate new data similar to the data it was trained on, but not identical. These models learn the probability distribution of the training data, allowing them to create new instances that retain the statistical characteristics of the training set. Generative models are widely used in various fields such as natural language processing, image generation, speech synthesis, and others, where they can create realistic text, images, sounds, or other types of data [
34].
In the context of GPs, we utilize function sampling from posterior distributions, enabling statistical inference based on available data. This method relies on data influencing the distribution of hyperparameters, allowing the generation of new data samples and creating a generative model [
2]. An example of a generative model can be found in [
29], where the described instance of diagnosing malfunctions in industrial processes. In this case, the GP model is trained with measurement data from both normal and healthy operation of equipment as well as from moments when faults occur. The model then becomes capable of generating more time series data, which consequently leads to the ability to classify new situations based on data depth. This approach exemplifies the application of GPs in predictive maintenance, where they assist in interpreting complex data patterns to identify potential issues before they become critical failures, enhancing both the efficiency and safety of industrial operations. Another example can be found in [
35]. In this problem, we use generative models within a Gaussian Mixture to classify the faulty or healthy states of induction motor startups.
GPs provide a deeper understanding of data relationships, which is particularly useful in estimating hyperparameters and calculating posterior distributions [
2]. In function modeling, it is crucial to focus not only on values at measurement points but also on the overall function trajectory. An important aspect of GPs is that we sample a set of points, not the entire function. The problems we are dealing with involve how to choose specific points at which to calculate the GP and how to generate functions from the model. A simple rule of thumb states that the more data points we have information on, the easier and better the interpolation will be, regardless of the method chosen. However, in the case of GP, significantly increasing the number of computational points (random variables) can lead to computational burdens [
3]. Therefore, we must strive for a robust method of interpolating functions while maintaining as few computational points as possible for GP.
2.3. Chebyshev Interpolation
Our proposed solution is to generate functions through a GP model in the form of a functional series, an example of which is the Chebyshev series. GP generates a point from a function. We want to transform those points into a Chebyshev series. Fundamentally, obtaining its coefficients is not a simple task, but each polynomial can be uniquely expressed as a finite Chebyshev series. When the polynomial is defined by its values at Chebyshev points, there is a unique linear relationship between these values and the coefficients of the Chebyshev expansion. This relationship allows for an efficient mapping, executable in
operations using the FFT. This approach offers a computationally effective way of handling polynomial expressions and transformations [
36].
In the context of generative modeling, where new functions are generated, the application of the Chebyshev series becomes particularly relevant. In such a case, it is very useful that the Chebyshev series are known for their convergence properties, particularly toward differentiable functions that are generated by the model. Moreover, the degree of approximation plays a crucial role in determining the nature and rate of this convergence [
36]. The convergence of the Chebyshev series to the generated functions with the GP model can be mathematically represented by the below formula, where for integer
1, let
f and all its differentiable functions up to
and the
th differentiable
be of bounded variation,
V, then for any
, we have the following:
Unfortunately, computing the values of the Chebyshev series can be an exceptionally demanding task. In such instances, we might resort to utilizing the Chebyshev interpolant as an alternative to the series itself [
36]. The Chebyshev series and interpolants are quite similar in nature, and the convergence of the latter can be articulated through the following expression:
As we can observe in the convergence formulas, the Chebyshev interpolation differs little in terms of accuracy, amounting to only twice the results, which means that in computer calculations, we only lose one bit of accuracy. This minimal loss of precision is compensated by efficiency and numerical stability, which are crucial in many engineering and scientific applications. Thus, we can speak of a significant consistency in the representation of the series and its interpolation using Chebyshev polynomials in terms of our problem. The optimal interpolation nodes are known as Fekete points, which for scalar interpolation are the Legendre nodes. They require solving a large-dimensional eigenvalue problem. Chebyshev nodes have low complexity but are approximately two times worse than Legendre nodes, equating to 1 bit of accuracy [
36].
The next problem we must confront is how to obtain the Chebyshev coefficients. This aspect is crucial because the correct determination of these coefficients is fundamental to the effectiveness of the entire method. Thanks to the work of Ahmed and Fisher in 1970 [
37], we understand that using interpolation formulas with the FFT on Chebyshev polynomial values can lead to the accurate determination of Chebyshev series coefficients. This process results in minimal loss of accuracy, around the order of one bit [
36], due to Chebyshev interpolation. This implies that while efficiently computing the coefficients for a Chebyshev series through FFT, we maintain a high degree of precision, making this method highly valuable for accurate interpolations in various computational applications. Thus, our solution to the problem of function generativity is the use of the Chebyshev series form, where we can quickly calculate the series coefficients by utilizing FFT on Chebyshev node values. An important aspect is selecting the appropriate number of Chebyshev points, ensuring minimal information loss about the original function while avoiding overburdening the GP model with heavy calculations. We can apply a solution similar to that promoted in the modern approach to interpolation [
36], where the accuracy of the interpolating polynomial stops at achieving machine precision. For differentiable functions, such as those from the Matérn family, Chebyshev coefficients gradually decrease to a certain level. This allows them to be used to determine the required precision. An example of such behavior can be seen in
Figure 1, where we observe that the Chebyshev coefficients decrease as the degree of the Chebyshev series increases until it reaches machine precision. In our case, since we may not achieve machine precision accuracy, where subsequent values cease to be significant, we can use a threshold related to the noise variance, which we can observe as oscillations around a single value of the series coefficient, after a significant drop at the beginning.
The assumptions of our approach involve utilizing knowledge about where to apply specific points for constructing Chebyshev polynomials for the purpose of interpolating the Chebyshev series. This entails strategically selecting points that optimize the polynomial’s ability to model the underlying function. Chebyshev polynomials are a sequence of polynomials that satisfy a specific orthogonality condition with respect to a weight function. The most commonly used weight function for Chebyshev polynomials is based on the inverse square root of the difference between the upper and lower bounds of the interval over which the polynomials are defined. The
kth Chebyshev polynomial can be expressed as the real part of the function,
, on the unit circle [
36,
37]:
One notable property of Chebyshev polynomials is the presence of equally spaced roots along the defined interval. These roots are known as Chebyshev nodes or Chebyshev points and can be expressed as the real parts of these points [
36]:
An alternative way to define Chebyshev’s points is in terms of the original angles [
36]:
Chebyshev polynomials are valuable for interpolation, particularly due to their effectiveness compared to polynomial interpolants created through equally spaced points. The clustering of points near the ends of the interval plays a significant role. Chebyshev points offer the advantage of having each point at a similar average distance from the others, as measured by the geometric mean. The resulting polynomial interpolant using Chebyshev points is referred to as the Chebyshev interpolant [
36].
Furthermore, Chebyshev polynomials can be exclusively represented in a finite Chebyshev series format. This means that the functions
form the basis for
(the Chebyshev series). The relationship between values at Chebyshev points and Chebyshev expansion coefficients is straightforward and can be achieved using the FFT in
operations. The Chebyshev series can be defined as follows:
where
represents the
kth Chebyshev polynomial, and
is the associated coefficient. The Chebyshev series exhibits several desirable properties, such as uniform convergence over the interval
for any continuous function and rapid convergence for functions that are analytic or possess a smooth derivative [
36,
37].
2.4. Overview of Proposed Algorithm
The structure of our proposed algorithm, based on the gathered information, comprises several key components:
Data feeding and fitting the model: Involves using relevant data to train and adjust the Gaussian process (GP) model, ensuring an accurate representation of the system or phenomenon.
Generative Gaussian process model: The core of the algorithm, utilizing statistical methods for generating predictions or simulations from input data.
Generating values at Chebyshev points: Calculating values at crucial Chebyshev points for the subsequent transformation process.
FFT transformation to Chebyshev series: Applying fast Fourier transform to convert values from Chebyshev points into a Chebyshev series for efficient computational processing.
Series degree optimization with noise variance condition: Optimizing the Chebyshev series degree while considering noise variance (oscillations), balancing accuracy and computational load.
To help understand the algorithm’s workflow and the interconnections between different components,
Figure 2 presents a flow chart illustrating the general assumptions of the algorithm. Each step in the flow chart represents a specific part of the algorithm.
2.5. Computational Methods
The current state of GP modeling reflects both its robust potential in statistical learning and the computational challenges inherent in its implementation. GPs are flexible non-parametric models that offer a principled way of encoding prior beliefs about functions, incorporating uncertainty, and handling various types of data. However, the computational demands of GPs, particularly with large datasets, pose significant challenges.
One of the primary computational challenges in GP modeling arises from the need to compute the inverse and determinant of the covariance matrix, a process that is typically cubic in the number of data points
[
2]. This complexity limits the scalability of standard GP methods to large datasets. Moreover, the problem of direct and analytical calculation of the hierarchical model of Gaussian processes has led to the exploration of alternative solution methods [
3]. To address these challenges, several numerical methods and approximations have been developed.
One such method is the Markov chain Monte Carlo (MCMC) sampling method. MCMC is a powerful and widely used computational technique in statistics, machine learning, and various scientific fields. It is particularly valuable when dealing with complex probability distributions and high-dimensional spaces where traditional numerical methods become impractical or inefficient. MCMC provides a way to generate a sequence of random samples from a target probability distribution, allowing us to approximate various characteristics of that distribution [
38].
In our research, we are using a Stan model as a probabilistic programming framework that leverages MCMC methods for Bayesian inference. Stan allows us to specify statistical models using a flexible modeling language and then employs MCMC techniques of Hamiltonian Monte Carlo (HMC) to sample from the posterior distribution of model parameters. This combination of Stan’s modeling language and MCMC algorithms enables sophisticated and robust Bayesian inference, making it a powerful tool for a wide range of applications. For instance, in our case, it is particularly effective for the hierarchical modeling of GP [
39].
The implementation of the GP generative model in Stan involves dealing with multivariate Gaussian processes and generative algorithms. This includes specifying a series of inputs and outputs and defining the probability of outputs conditioned on their inputs. Stan’s approach to GPs is characterized by its focus on statistical efficiency and the handling of model conditioning and curvature. This is particularly relevant in Bayesian modeling contexts where the inference about the data-generating process is made using posterior distributions derived from a priori assumptions and observed data. Stan allows direct use of MCMC within its framework and generates data [
39].
Building a Gaussian process (GP) model in Stan involves several key steps, from specifying the model to executing the simulation. Here is a general overview of the process we used:
Initially, we define both the input and output data for the model. Given that the model is generative in nature, we declare how many samples are required. In selecting priors, a crucial aspect is the choice of the covariance function, which is employed in constructing the covariance matrix. A common practice, also adopted in our approach, is the use of Cholesky decomposition. The modeling of our likelihood is based on declaring a normal distribution, complete with appropriate variance and potential noise. The hyperparameters of our covariance function are treated as priors over hyperparameters and are inferred from the data. The estimation of these parameters is carried out through MCMC sampling. Once the entire model is defined, posterior inference from the model becomes possible. This allows us to obtain the required parameters along with their uncertainties, enabling the use of the model to generate new responses and their associated uncertainties.
MCMC methods within the Stan framework are known for their high precision and computational complexity. These methods, while detailed and accurate, demand considerable computational resources. In contrast, there are alternative approaches that offer faster computational performance with a trade-off in the amount of information extracted from the model. One prominent alternative is the INLA method. The tool employed in our second modeling framework is the R-INLA package, which provides comprehensive implementations for solving problems using sparse matrices, as described in the INLA methodology. INLA, or the integrated nested Laplace approximation, was introduced by Rue, Martino, and Chopin in 2009 as an alternative to traditional MCMC methods for approximate Bayesian inference [
40].
INLA is specifically tailored to models that can be expressed as latent Gaussian Markov random fields (GMRFs) due to their advantageous computational properties. Without going into detail, GPs (especially Matérn ones) are solutions to certain stochastic partial differential equations. Those solutions obtained with finite element methods can be considered GMRFs [
41]. In the INLA framework, the observed variables
are modeled using an exponential family distribution. The mean
(for observation
) is linked to the linear predictor
through an appropriate link function. The linear predictor encompasses terms related to covariates and various types of random effects, while the distribution of the observed variables
y depends on certain hyperparameters
.
The distribution of the latent effects
x is assumed to follow a GMRF with a zero mean and a precision matrix
, which is dependent on the hyperparameter vector
[
42]. The likelihood of the observations, given the vector of latent effects and the hyperparameters, can be expressed as a product of likelihoods for individual observations:
In this context, the latent linear predictor is denoted as , which is one of the components of the vector x containing all latent effects. Set includes the indices of all the observed values of y, although some of these values may not have been observed.
The primary goal of the INLA approach is to estimate the posterior marginal distributions of the model’s effects and hyperparameters. Model effects are the components of the model that capture the underlying structure and relationships within the data. This includes fixed effects, which are the deterministic part of the model related to the covariates, and random effects, which account for spatial or temporal correlations, incorporating variation produced by variables in the model that are not the primary focus of the study. Hyperparameters are the parameters of the kernel used in the model. This goal is achieved by leveraging the computational advantages of GMRF and utilizing the Laplace approximation for multidimensional integration. The joint posterior distribution of the effects and hyperparameters can be expressed as [
42]:
Here, the precision matrix of the latent effects is denoted as
, and its determinant is represented by
. Additionally,
for values of
i that are in the set
[
42].
The computation of the marginal distributions for the latent effects and hyperparameters involves integrating over the hyperparameter space, requiring a reliable approximation of the joint posterior distribution of the hyperparameters [
40]. This is accomplished by approximating
as
and using it to approximate the posterior marginal distribution of the latent parameter
:
The weights
correspond to a vector of hyperparameter values
in a grid, and the specific method for calculating the approximation
can vary depending on the chosen approach [
42].
The primary distinction for Gaussian process (GP) generative models lies in the nature of the output from the modeling approach. In the case of an MCMC-based sampling model in Stan, complete information is obtained in the form of samples at specific points, which are then used to determine the underlying distribution. This approach in Stan allows for a direct and comprehensive understanding of the data distribution, as it provides specific instances (samples) of the model’s output [
39].
On the other hand, INLA provides information primarily about the distribution of the random variables. Instead of generating samples directly from the distribution values, INLA approximates the posterior distribution of both the latent field and hyperparameters using Laplace approximation methods. This approximation leads to a more efficient computational process but offers a less direct insight into the specific instances of the model’s output [
41].
3. Results
Numerical experiments were conducted using the efficient GP computation algorithm we developed. The process for each scenario consists of the following steps:
Selecting and defining the tested function.
Generating a small subset of samples with specified noise from the function as data input.
Creating a Gaussian process model.
Fitting of the Gaussian process model.
Generating samples (distributions) at Chebyshev points from the GP model.
Converting samples (distributions) into Chebyshev series coefficients.
Optimizing the degree of the Chebyshev series.
Approaches vary between the MCMC (HMC) sampling method and the INLA framework. With MCMC, we obtain 4000 samples at each random variable, from which we gather information about the distribution at each point, while with INLA, we only obtain information about the distribution at certain points (for this reason, INLA can run more efficiently). A detailed chart for the algorithm used in the numerical experiments can be seen in
Figure 3.
In the initial stage, two different functions are defined for testing purposes. The first function is a simple—at least twice differentiable, i.e., used for Matérn kernels—function and is defined as follows:
The second one is an infinitely differentiable function (used for RBF kernels) and defined as follows:
For these defined functions, a small number of points (20 in our experiments) was generated and then joined with random values from a Gaussian distribution (mean
; standard deviation
) to introduce noise.
Figure 4 and
Figure 5 illustrate the defined functions and the noisy samples generated for them, used in the presented experiment. To simplify subsequent operations with Chebyshev nodes, the domain was set to a range of
.
The next stage involves defining and creating a Gaussian process model. At first, we use the GP model created in the Stan, which allows us to create an accurate model based on defined priors and generate samples using MCMC tailored to the specifics of the problem. In our case, the GP model is characterized by a mean of 0. We also use two different covariance functions, one for each case of the function. For the case of an infinitely differentiable function, we use the radial basis function (RBF), also known as the squared exponential (SE) kernel. Also, as we deal with the twice differentiable function, we set the Matérn value to 5/2. Both of these kernels are the most common in terms of GP usage. Our model in Stan was set up based on general guidelines available in [
39]. Our priors were set as normal distributions, we used Cholesky decomposition for our covariance matrix, and generated results for random variables (computation points) in the sample generating block.
In the case of the INLA framework, since we use the stochastic partial differential equation (SPDE) approach, we can only set covariance parameters to use the Matérn family function. But as Matérn is a variation of the RBF function, we can set the parameter
v to high values to approximate RBF function behavior. In the implementation of R-INLA, we first generate a one-dimensional mesh based on which we generate matrix A (the operation matrix on the mesh). As a prior for the similarity matrix, we choose the identity matrix (for random effects). Priors for hyperparameters, as in the case of Stan, are also normal distributions. All created objects are enclosed in what is called a stack, and then a model is created from which we infer the distributions. More on this can be found in [
41].
The models were fitted using a small subset of generated noisy data from the original functions. The standard deviation of added noise was . It is important to note that the number of Chebyshev nodes used for sample generation directly impacts the degree of the Chebyshev series obtained later, which is closely related to the optimization of this degree, as discussed below. To obtain appropriate preliminary conclusions, the number of Chebyshev nodes we used varied from 10 to 200. Such a discrepancy in values allowed us to examine not only the optimized degree of the Chebyshev series but also, for example, the computational efficiency.
Once the function values were calculated at the Chebyshev nodes, we utilized the FFT to convert them into Chebyshev series by calculating their coefficients, and a script was developed for this purpose. The kth Chebyshev polynomial on the unit interval can be interpreted as the real part of the monomial on the unit disc in the complex plane, given by . Thus, when adding n+1 Chebyshev polynomials, a truncated Laurent series in the variable z was obtained, with the same coefficients for the and terms. Similarly, the Chebyshev points on the interval can be seen as the real parts of equispaced nodes on the unit circle.
This connection between the unit interval and the unit circle allowed us to utilize Fourier analysis [
37]. A truncated Laurent series with equal coefficients for the
and
terms is equivalent to a Fourier series in the variable
s, where
. Therefore, Fourier and Laurent coefficients are identical, and the coefficient vector is symmetric since the same factor multiplies the
and
terms. In this context, the Chebyshev coefficients correspond to the first n + 1 terms of this vector, with the first and last coefficients divided by 2.
The script based on information introduced in [
36,
43] was implemented and performed via the following steps:
We gathered function values from the model results into vectors.
We extended the vectors of function values to represent equispaced values on the unit circle, progressing anticlockwise, starting from an x value of 1.
We performed FFT to obtain Fourier/Laurent coefficients. Since the Chebyshev coefficients are real-valued, the real part of this vector was taken to eliminate any spurious imaginary components due to rounding errors.
We extracted the first values of the vector, adjusting by the degree required to obtain the Fourier coefficients from the FFT routine.
We divided the first and last entries by 2.
The script can also be seen in
Figure 3 as a part of the whole algorithm.
With the calculated Chebyshev series coefficients for the generated values (4000 for each node in the case of MCMC) from the model, we obtained the marginal distribution of the coefficients. In the case of the INLA framework, our conversion script was based on the mean values of the distribution in each node, resulting in single coefficient values from the FFT, in contrast to the MCMC distribution. The previously mentioned challenge was selecting the optimal degree of the Chebyshev series, which subsequently determined the calculation of the GP in the appropriate number of Chebyshev nodes. The method for estimating the optimal degree involved evaluating the behavior of coefficient values based on the degree. Consequently, we assessed the significance of the coefficients to determine a cutoff threshold beyond which the remaining coefficients could be considered negligible. This threshold could be reached when values approached machine precision or oscillated around a value. In our case, we decided to use the value based on our noise variance and set the threshold to . This step reduces the Chebyshev series degree. Since we were dealing with distributions, we used the expected values of the coefficients to optimize the series order needed to accurately represent the original function.
The results of conducted tests for different orders, ranging from 10 to 200, can be seen below. We gathered the plots of results from models (mean values with confidence interval) and coefficient values together. In
Figure 6 and
Figure 7, we can see results for the MCMC sampling approach with the use of the Matérn kernel with at least twice differentiable functions.
Figure 8 and
Figure 9 show the results of the same model but for RBF kernels and infinitely differentiable functions. The results from the INLA framework can be seen in
Figure 10 and
Figure 11 for at least twice differentiable functions and in
Figure 12 and
Figure 13 for infinitely differentiable functions. We also gathered information about the average computing time for each experiment in different combinations of kernels, frameworks, and Chebyshev node counts; they can be seen in
Table 1.
The overall conclusion regarding MCMC approaches is that there is a clear barrier to the optimal degree level of the Chebyshev series. A clear cut-off line of the threshold of useful information can be used but can also be achieved by catching the oscillations around the same value of the series coefficient. For the case of the Matérn kernel, the value of oscillations is slightly below the threshold, and for the RBF kernel, it is a bit above the threshold value. Both approaches provide us with the value of nodes we should use; both were around 10–30 orders. It is important to check the optimization of the order because any decrease in these values directly impacts the size of the covariance matrix and, thus, the computation time, losing very little information.
In the case of the R-INLA framework approach, we can conclude that despite achieving high Chebyshev degree values for the coefficients and observing a decreasing tendency, we do not reach machine precision, and the values do not oscillate around any value. In such a situation, we can use the selected threshold, which was useful in the previous experiment with MCMC, where the optimal cut-off values of the necessary coefficients were identified. Moreover, it is important to understand that the INLA framework is optimized to calculate Gaussian random fields (multivariate GPs), so decreasing the number of nodes does not significantly impact execution time.
This difference in results comes from a different calculation method. MCMC uses sampling at each point, so we receive more information, such as the values of each generated sample and chain, from which probability distributions are later calculated. Due to this possibility, MCMC takes into account entire value vectors, not just distribution information, for calculations related to the Chebyshev series. This allows us to optimize the degree of the series with FFT conversion. In the case of R-INLA, the SPDE approach only provides information about the distribution at given points. Therefore, we can only use data on the expected value and standard deviation when calculating values for the Chebyshev series. It is also important to consider the execution times for model fitting and data generation. MCMCs are known for their long and complex calculations but for cases involving 10 to 50 random variables, the execution time did not exceed several dozen seconds. However, for 200 variables, it took almost 15 min on a moderate computer. We did not encounter a similar problem with R-INLA, as it is optimized for calculations on sparse matrices; the execution times for all attempts did not exceed several seconds.
By employing the selected Chebyshev nodes, we obtained the expected results of the GP model in the form of Chebyshev series coefficients, rather than computing the GP at every point. It is also crucial to emphasize that, regardless of the chosen framework or the degree of the results, all are presented in the form of coefficients. This form allows us to encode information in the form of the Chebyshev series itself, which serves as an approximation of the original function and can help in generating more data for generative models. Thus, it carries additional information about the original function, unlike other frequently used methods. It is also important to note that, unlike MCMC, R-INLA is already an approximation method, so the use of subsequent optimizations for simple one-dimensional cases did not bring much improvement. However, in the case of the MCMC approach, our method can facilitate its use in scenarios where it was previously infeasible, especially where generative models are needed.
4. Discussion
In the field of computational process optimization for GPs, one of the more significant and recognizable solutions is sparse GP. This approach involves strategically selecting a subset of informative points, known as inducing points, to represent the underlying data. The main advantage lies in addressing the computational challenges associated with traditional GPs, particularly when dealing with large datasets. By employing sparse GPs, the model’s scalability is greatly improved, allowing for efficient processing of extensive datasets while maintaining the flexibility and power of GPs in capturing complex relationships [
16,
17].
One way to enhance the performance in representing underlying functions through sparse models of GPs is through rectangularization. In situations where data are sparse, avoiding overfitting and optimizing Gaussian process regression (GPR) hyperparameters becomes challenging, especially in high-dimensional spaces where data sparsity cannot be practically resolved by adding more data. The success or failure of GPR applications hinges on the optimal selection of hyperparameters. The authors propose a method that facilitates parameter optimization through the rectangularization of the GPR-defining equation. They use a 15-dimensional molecular potential energy surface as an example; they illustrate the effectiveness of this approach in achieving hyperparameter tuning even with very sparse data [
44].
Other works focus on optimizing the performance of online updating for sparse GPs. Research presented in [
45] delves into the relationship between a class of sparse-inducing point GP regression methods and Bayesian recursive estimation, enabling Kalman filter-like updates for online learning. Unlike prior focus on the batch setting, this study concentrates on mini-batch training. Leveraging the Kalman filter formulation, the proposed approach recursively propagates analytical gradients of the posterior over mini-batches, resulting in faster convergence and superior performance. The approach presented in [
46] was developed for both the fully independent training conditional (FITC) and the partially independent training conditional (PITC) approximations, enabling the inclusion of a new measurement point
in
time, where m represents the size of the set of inducing inputs. The online nature of the algorithms enables the forgetting of earlier measurement data, resulting in a memory space requirement of
for both FITC and PITC.
Works that do not rely on sparse GPs can also be found. In [
12], the authors propose a way to optimize GPs for time series gap-filling in satellite image processing by replacing the per-pixel optimization step with cropland-based pre-calculations for GPR hyperparameters
.
In [
47], the authors present a novel algorithm in terms of GP optimization: BKB (budgeted kernelized bandit). It is designed for optimization under bandit feedback. BKB achieves near-optimal regret and convergence rates with nearly constant per-iteration complexity. Notably, it makes no assumptions about the input space or GP covariance. The algorithm combines a kernelized linear bandit approach (GP-UCB) with randomized matrix sketching using leverage score sampling. The authors demonstrate that randomly sampling inducing points based on posterior variance accurately provides a low-rank approximation of the GP, maintaining reliable variance estimates and confidence intervals.
In the last example, the authors explore an alternative global optimization framework called optimistic optimization, which is computationally more efficient. This approach leverages prior knowledge of the search space’s geometry through a dissimilarity function. The study aims to integrate the conceptual advantages of Bayesian Optimization with the computational efficiency of optimistic optimization. This is achieved by mapping the kernel to a dissimilarity, resulting in the development of an optimistic optimization algorithm [
48].
The practical application of the proposed solution can be easily illustrated through existing examples, such as in the previously mentioned problem of anomaly detection in industrial processes [
29]. This problem involves several steps: data acquisition, generating new samples using g = Gaussian processes (GPs), and classification through data depth. While other aspects of the issue would remain unchanged, for the generative model, our presented solution could be applied. In the original problem, the authors used a GP implementation based on the ML-II solution. By applying our algorithm, MCMC sampling could be used in such a problem, potentially providing more information about the generation of functions. Another example is the previously discussed issue of using a Gaussian mixture in diagnosing the state of an engine based on its startup characteristics [
35]. Here, the authors employ several models aimed at generating new samples for classification. As in the previous case, it would be valuable to explore the application of our proposal, which relies on optimally selecting the degree of the Chebyshev series and Chebyshev points as random variables, thereby enabling the use of MCMC sampling.
As presented by employing various methods, including sparse GP variations, we can address the issue of time-consuming computations in GP modeling. In contrast to the standard sparse approach, our proposal relies on an innovative method characterized by two key features: a reduction in computation time and the ability to reconstruct the original function. The first feature is ensured through the imposition of domain constraints in computations, where we model the GP using a limited set of random variables at selected Chebyshev nodes. The second feature arises from the form of the Chebyshev series, achieved by transforming function values at Chebyshev nodes using FFT, allowing us to obtain coefficients of the Chebyshev series. This representation provides information about the original function. Additionally, by analyzing coefficients of different degrees, we can guarantee a minimal, meaningful number of Chebyshev nodes. Our experiments confirm that our algorithm significantly reduces computation time, especially in the context of sampling-based methods, while maintaining unique properties not encountered in standard sparse approaches.
5. Conclusions
The incorporation of the Chebyshev property to enhance the computation of the GP is a topic frequently overlooked. Over the years, one of the primary challenges associated with GPs has been their computational complexity, particularly when dealing with a large number of points. This complexity is exacerbated when employing the MCMC method for sample generation, often resulting in impractical computation times.
In response to this issue, our proposed approach aims to mitigate the computational burden by reducing the calculations of random variables to the appropriate number of Chebyshev nodes. Moreover, our study focuses on leveraging the values at Chebyshev nodes (and their marginal distribution) gathered from the generative model to construct the Chebyshev series. Chebyshev series are recognized for their strong convergence characteristics, especially in relation to differentiable functions produced by the model. Computing the values of the Chebyshev series can be a highly challenging task so we are using the Chebyshev interpolant as a substitute for the series itself with just one bit loss of accuracy. With conversion based on FFT, we can manage to transform distributions (values) of random variables in Chebyshev nodes to Chebyshev series coefficients. The critical challenge addressed was the judicious selection of the optimal degree for the Chebyshev series, a parameter that influences the computation of the GP within an appropriate number of Chebyshev nodes. To determine the optimal degree, we employed a method that involved assessing the behavior of coefficient values based on the degree, subsequently establishing a cutoff threshold based on the significance of coefficients. This threshold was identified when values approached machine precision or exhibited oscillations around a similar value, leading to a reduction in the Chebyshev series degree.
Our proposed solution is also characterized by the feature of receiving more information about the original function. By transforming the function values at Chebyshev nodes using FFT, we can reduce the amount of Chebyshev points and obtain the Chebyshev interpolation of the generated function. This form of Chebyshev representation can help us generate more variations from the model, with respected uncertainty from the GP model. This approach works very well for generative models. These models are designed to map new data samples based on collected information from the original function. Using MCMC sampling in such cases is useful and our approach allows its application in a reasonable time. Moreover, thanks to the use of the after-mentioned Chebyshev interpolation, we have the possibility of transferring additional information about the function through appropriate coefficients.
From numerical experiments, we can conclude significant differences between the results between MCMC and R-INLA frameworks. These differences arise from their distinct calculation methods. MCMC, using sampling at each point, considers entire value vectors, providing more information for Chebyshev series calculations. This enables the optimization of the series degree. On the other hand, R-INLA, utilizing an SPDE approach, offers information solely about the distribution at given points, limiting the utilization of data to expected values and standard deviations in the context of calculating values for the Chebyshev series.
Furthermore, considering the execution times of the algorithm is crucial. MCMC, known for lengthy and intricate calculations, exhibited significant variations. For 10 random variables, the execution time remained within several seconds, whereas for 200, it extended to about 11 min. In contrast, R-INLA, designed for optimized and approximated calculations on sparse matrices, consistently demonstrated shorter execution times, not exceeding one minute in all attempts. The utilization of the selected Chebyshev nodes in our script facilitated the acquisition of expected results for the GP generative model in the form of Chebyshev series coefficients, avoiding the need for computing the GP at every individual point. Preliminary results indicate a substantial reduction in calculation time in the case of MCMC-based calculations. In both cases, it is possible to utilize a reduced number of nodes for computations without significant loss in the structures of the reconstructed functions from the models. This promising outcome underscores the potential efficiency gains achievable through our proposed method, especially in terms of an MCMC-based approach. Our solution allows us to use the MCMC sampling in cases where it was not possible before due to the problem of computation time. For the R-INLA case, the reduction in computational time is not as spectacular as in the MCMC case, but that outcome should be expected from the already approximate calculations; therefore, the use of the method may seem unnecessary.
It is worth noting that our suggested approach shows promise but necessitates further testing and validation on real case scenarios. Future considerations should delve into exploring alternative, less frequently used covariance functions for GPs, adjusting priors and model settings to better align with specific problem nuances, and implementing scalability measures to ensure the versatility and applicability of the proposed solution across diverse domains. As our research progresses, it is imperative to validate and refine the proposed approach to contribute meaningfully to the optimization of GP computations.