Interpretable Mixture of Experts for Decomposition Network on Server Performance Metrics Forecasting

Peng, Fang; Ji, Xin; Zhang, Le; Wang, Junle; Zhang, Kui; Wu, Wenjun

doi:10.3390/electronics13204116

Open AccessArticle

Interpretable Mixture of Experts for Decomposition Network on Server Performance Metrics Forecasting

by

Fang Peng

¹,

Xin Ji

¹,

Le Zhang

¹,

Junle Wang

²,

Kui Zhang

²

and

Wenjun Wu

^2,*

¹

State Grid Corporation of China Big Data Center, Beijing 100033, China

²

State Key Laboratory of Software Development Enviroment, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(20), 4116; https://doi.org/10.3390/electronics13204116

Submission received: 11 September 2024 / Revised: 8 October 2024 / Accepted: 17 October 2024 / Published: 18 October 2024

(This article belongs to the Special Issue Trustworthy Deep Learning in Practice)

Download

Browse Figures

Versions Notes

Abstract

:

The accurate forecasting of server performance metrics, such as CPU utilization, memory usage, and network bandwidth, is critical for optimizing resource allocation and ensuring system reliability in large-scale computing environments. In this paper, we introduce the Mixture of Experts for Decomposition Kolmogorov–Arnold Network (MOE-KAN), a novel approach designed to improve both the accuracy and interpretability of server performance prediction. The MOE-KAN framework employs a decomposition strategy that breaks down complex, nonlinear server performance patterns into simpler, more interpretable components, facilitating a clearer understanding of how predictions are made. By leveraging a Mixture of Experts (MOE) model, trend and residual components are learned by specialized experts, whose outputs are transparently combined to form the final prediction. The Kolmogorov–Arnold Network further enhances the model’s ability to capture intricate input–output relationships while maintaining transparency in its decision-making process. Experimental results on real-world server performance datasets demonstrate that MOE-KAN not only outperforms traditional models in terms of accuracy but also provides a more trustworthy and interpretable forecasting framework. This makes it particularly suitable for real-time server management and capacity planning, offering both reliability and interpretability in predictive models.

Keywords:

service performance metrics; decomposition linear network; Kolmogorov–Arnold theorem

1. Introduction

A server is a computer or system that provides resources, data, services, or programs to other computers, known as clients, over a network. In recent years, the global server industry has experienced significant growth to meet the increasing demand for computing and storage. The number and scale of data centers have been expanding, driving rapid development in the server market. By 2024, it is estimated that approximately 80 million servers will be in operation worldwide, mainly driven by digital transformation and cloud computing services. Server shipments in 2024 are expected to range between 14 million and 20.7 million units, with the global server market revenue projected to be between EUR 80 billion and 140 billion [1].

As the number of servers and clusters increases, server outages are also rising steadily [2]. For example, In early 2024, within the cloud computing sector, outages among cloud service providers (CSPs) increased significantly compared to those among internet service providers (ISPs). The proportion of CSP to ISP outages rose from 17% in 2023 to 27% in 2024, highlighting increased vulnerability in large-scale server clusters. This trend emphasizes the need for enhanced visibility and monitoring in server work environments to manage risks and reduce outages effectively [3].

Server performance metrics, such as CPU utilization, memory consumption, disk I/O, and network bandwidth, play a pivotal role in identifying system bottlenecks, optimizing resource allocation and enhancing overall operational efficiency [4,5,6,7,8,9,10,11]. The importance of server performance metrics forecasting lies in its ability to predict and mitigate potential issues before they occur. By analyzing historical performance data, forecasting tools can predict trends in resource utilization, workload fluctuations, and potential hardware failures. This proactive approach enables cloud service providers to dynamically allocate resources to avoid both over-provisioning (where resources are underutilized) and under-provisioning (where demand exceeds capacity). Efficient forecasting not only improves system reliability but also reduces operational costs by optimizing resource utilization.

One notable example of the impact of forecasting is the potential prevention of the 2008 AWS S3 outage [12], which lasted eight hours and resulted in widespread business disruptions. Predictive monitoring could have helped AWS identify the root cause before it escalated, allowing for preemptive mitigation measures to avoid downtime. The accurate prediction of these metrics not only aids in capacity planning but also improves service level agreements (SLAs) [13,14,15,16]. Traditional reactive approaches to server management are insufficient in modern computing environments. Instead, proactive methods, driven by predictive analytics and machine learning models, allow system administrators to forecast performance trends and maintain optimal server health.

In this paper, we present a new approach that integrates the Mixture of Experts (MOE) framework with the Kolmogorov–Arnold Network (KAN) [17] to tackle the complexities of server performance forecasting. This model leverages KAN’s strong nonlinear fitting capabilities while maintaining the interpretability of the decomposition process. By applying the MOE framework, the model provides interpretable predictions by breaking down the contributions of various experts. This allows users to understand the model’s decision-making process, leading to more trustworthy forecasting.

In summary, the key contributions of our work are as follows:

Mixture of Experts Decomposition Framework: We propose a Mixture of Experts (MOE) framework that decomposes server performance metrics into trend and residual components using multi-scale moving average module. This mechanism allows for a more effective decomposition of time series data, retaining both long-term trends and high-frequency details.

Kolmogorov–Arnold Network Integration: We extend the traditional linear model by incorporating the Kolmogorov–Arnold Network (KAN) into the temporal layer. KAN’s nonlinear fitting capabilities enable the model to effectively capture complex temporal dependencies in server performance metrics, improving forecasting accuracy over traditional linear methods.

Comprehensive Performance Evaluation: We conduct extensive experiments using the Server Machine Dataset (SMD) and demonstrate that our MOE-DKAN model consistently outperforms traditional baseline models, such as Linear, DLinear, and FEDformer, across multiple forecast horizons for both univariate and multivariate server performance metrics. The model achieves significant improvements in terms of Mean-Squared Error (MSE) and Mean Absolute Error (MAE).

2. Related Works

2.1. Long-Term Time Series Forecasting

Long-term time series forecasting (LTSF) has been a widely researched area due to its crucial role in fields such as energy management, finance, and healthcare [18,19,20,21]. In recent years, multiple methods, ranging from traditional statistical models to deep learning approaches, have been proposed to tackle the challenges inherent in predicting long-term patterns.

The advent of deep learning significantly changed the landscape of time series forecasting. Recurrent Neural Networks (RNNs) and their variants, such as Long Short-Term Memory (LSTM) [22] and Gated Recurrent Units (GRUs) [23], were designed to model sequential data, making them highly effective for time series forecasting. LSTM and GRU models have demonstrated success in capturing long-term dependencies by using memory cells to retain information over extended periods. Several hybrid approaches combining RNNs with other techniques have emerged to overcome limitations in capturing complex temporal relationships [24]. Attention mechanisms were integrated into RNNs to focus on the most relevant parts of the input sequence, allowing for better handling of long-range dependencies. For instance, DA-RNN [25] leveraged a dual-stage attention mechanism to enhance the feature extraction process, proving effective in long-term forecasting by emphasizing critical temporal and spatial relationships. Another approach, ESLSTM [26], combined exponential smoothing with LSTMs to decompose and capture both linear and nonlinear trends, providing a robust method for multivariate time series prediction.

Additionally, the introduction of encoder–decoder architectures enabled RNNs to excel in multi-step forecasting tasks by creating a more flexible framework for handling sequential data of varying lengths. Models like MQRNN [27] applied multi-horizon prediction techniques, demonstrating the ability to simultaneously forecast multiple future timesteps, which was particularly useful in applications requiring long-term strategic planning. Despite their effectiveness, these models encounter difficulties when the prediction horizon extends significantly, often suffering from vanishing gradient problems and computational inefficiencies [28,29]. Furthermore, their recursive nature results in longer training and inference times, limiting their practical use in real-time LTSF tasks.

Transformer models, initially introduced for natural language processing, have recently been adapted for time series forecasting. These models have gained significant attention in time series forecasting due to their powerful capabilities in capturing long-term dependencies and handling large-scale data sequences. Initially developed for natural language processing tasks, Transformers leverage self-attention mechanisms to model relationships across entire sequences, which is particularly beneficial for time series data where capturing both short-term and long-term patterns is essential. Unlike RNN-based models, Transformers are not limited by the issues of gradient vanishing or explosion, making them more effective for handling longer input sequences.

The vanilla Transformer architecture was adapted for time series forecasting by researchers aiming to exploit its parallelism and attention mechanism [30]. However, due to the quadratic complexity of self-attention, which makes Transformers computationally expensive for long sequences, several modifications have been introduced to improve their efficiency. For example, Informer [31] uses a ProbSparse self-attention mechanism to reduce the computational burden by selecting only the most informative queries, making it suitable for long-sequence forecasting tasks. Similarly, LogTrans and Reformer employ sparse attention techniques to decrease the number of operations required, thus making the Transformer architecture more feasible for long-term time series predictions [32].

Another notable advancement is the Autoformer [33], which introduces a decomposition block to capture both seasonal and trend components of time series data. This approach helps to mitigate the challenges posed by non-stationarity in time series, allowing for a more robust forecasting performance. FEDformer [34] extends this concept by incorporating frequency domain representations, such as Fourier and wavelet transforms, to enhance feature extraction and further reduce time complexity, providing effective solutions for forecasting tasks in various domains, including finance and meteorology.

Hybrid models combining Transformer components with other neural network architectures have also emerged [35]. For example, Pyraformer employs a pyramidal attention mechanism to capture multi-resolution temporal features, allowing the model to focus on different time scales simultaneously. Additionally, the TCCT model [36] integrates CNN layers with a Transformer to extract spatial features before applying self-attention, thus leveraging the strengths of both convolutional and attention mechanisms. These models have outperformed traditional RNN-based models in both short- and long-term forecasting tasks, proving the potential of attention mechanisms in LTSF.

Another promising avenue in LTSF is the application of decomposition techniques to separate time series data into trend, seasonal, and residual components. Decomposition-based models, such as the Neural Basis Expansion Analysis (N-BEATS) [37], have shown remarkable performance by treating forecasting as a signal decomposition problem. The separation of trend and seasonal patterns allows the model to learn simpler representations, which are easier to predict over long horizons. Furthermore, combining decomposition with machine learning methods, such as Mixture of Experts (MOE) models, has improved the model’s ability to generalize across domains.

One of the major challenges in LTSF is the accumulation of prediction errors over longer forecasting horizons. Recursive multi-step forecasting, where the model’s predictions are fed back as inputs for subsequent steps, often leads to the propagation of errors, reducing accuracy. This has led to the development of direct forecasting strategies, where models predict all future steps simultaneously. Additionally, scalability remains a concern, particularly when dealing with large datasets and extended horizons. There is also a growing interest in incorporating external data sources (e.g., weather, socio-economic factors) into LTSF models to improve robustness and accuracy [38].

2.2. Linear Model for Time Series Forecasting

The long-term time series forecasting linear (LTSF-Linear) [39] model is a recent approach introduced to tackle long-term time series forecasting (LTSF) by using simple linear regression models, challenging the dominance of Transformer-based models in this domain. Despite the popularity of Transformers due to their success in various fields like natural language processing (NLP) and computer vision (CV) [40], their effectiveness in time series forecasting has been questioned. The self-attention mechanism in Transformers, which excels in capturing semantic correlations, struggles to model temporal relations because it is inherently permutation-invariant, leading to temporal information loss. This is a critical shortcoming in time series data, where the order of data points is crucial for understanding trends and patterns.

LTSF-Linear Structure and Principles

LTSF-Linear employs an embarrassingly simple structure: a one-layer linear model that regresses historical time series data to predict future values. The model’s core idea revolves around directly modeling temporal dynamics through weighted summation along the time axis. LTSF-Linear offers two key variants:

(1) DLinear: Incorporates a decomposition scheme to split the input into trend and seasonal components [41], as shown in Figure 1. Separate linear layers are applied to each component, and the results are summed to produce the final forecast.

(2) NLinear: Adjusts for potential distribution shifts by subtracting the last value of the input sequence before passing it through a linear layer, then adding it back after prediction. This normalization step mitigates bias caused by distribution changes between training and testing data.

Transformer-based models, such as Informer, Autoformer, and FEDformer, use the multi-head self-attention mechanism to capture long-range dependencies in time series data. While these models have introduced innovations like sparse attention, decomposition techniques, and multi-step forecasting strategies to enhance their performance, they suffer from significant limitations, which are elucidated below.

The self-attention mechanism in Transformer models does not inherently preserve the sequence order, crucial for modeling time series data, leading to challenges with computational efficiency and noise overfitting as sequence lengths increase. Conversely, LTSF-Linear models outperform Transformer-based models across various real-world datasets such as traffic, energy, and weather forecasting. These linear models offer better accuracy with significantly lower computational demands.

LTSF-Linear requires fewer parameters and less memory compared to Transformer models, making it more suitable for large-scale time series forecasting tasks. Interpretability: The linear nature of the model allows for easy interpretation of the learned weights, providing insights into the temporal dynamics of the data. In contrast, despite their theoretical ability to handle long sequences, Transformer models often fail to leverage larger input windows effectively, leading to performance degradation or stagnation. This is particularly evident when dealing with datasets that exhibit strong temporal patterns, such as traffic and electricity data, where LTSF-Linear can capture daily and weekly periodicities more naturally than Transformer models.

In summary, LTSF-Linear offers a compelling alternative to Transformer-based LTSF solutions, providing both superior accuracy and efficiency, especially in scenarios where clear trends and periodicities exist. Its simplicity and interpretability make it a strong baseline for future research in time series forecasting (LTSF-Linear).

2.3. Kolmogorov–Arnold Network

The Kolmogorov–Arnold Network (KAN) is an innovative neural network architecture that differs from the traditional multilayer perceptron (MLP) models. It achieves outstanding performance in scientific fields with fewer parameters and enhanced interpretability. KAN is poised to be a key direction in the future development of deep learning models.

KAN’s design is based on the Kolmogorov–Arnold representation theorem, which posits that multivariate continuous functions can be broken down into a finite combination of univariate continuous functions and binary addition operations. Unlike conventional neural networks that use fixed activation functions at each neuron, KAN incorporates learnable activation functions applied to the weights. This allows the network to more effectively approximate complex relationships in input data, in line with the Kolmogorov–Arnold representation theorem [42].

As per this theorem, any multivariable continuous function f can be expressed as a composition of a finite set of univariable continuous functions.

f (x) = f (x_{1}, \dots, x_{n}) = \sum_{q = 0}^{2 n} Φ_{q} (\sum_{p = 1}^{n} ϕ_{q, p} (x_{p}))

(1)

where

ϕ_{q, p} : [0, 1] \to R, Φ_{q} : R \to R

.

Unlike MLPs, each connection in a KAN layer is defined by a single 1D function, which maps inputs directly to outputs (where l denotes the l-th layer). This architecture eliminates the need for matrix multiplication and instead employs a set of function mappings, where each function is responsible for converting one component of the input to one component of the output.

Recent research into Kolmogorov–Arnold Networks (KANs) has highlighted their potential as an innovative model for time series forecasting. This approach is particularly advantageous for time series forecasting, as it enables KAN to dynamically learn activation patterns, making it more efficient than conventional multilayer perceptrons (MLPs) in capturing complex temporal dependencies.

KAN’s application to time series forecasting has led to various adaptations, such as Temporal Kolmogorov–Arnold Networks (T-KANs) and Multivariate Temporal Kolmogorov–Arnold Networks (MT-KANs) [43]. A T-KAN detects concept drift and interprets nonlinear relationships in univariate time series, while an MT-KAN captures complex relationships in multivariate series [44]. Another adaptation, Reversible Mixture of KAN Experts (RMoK), uses a mixture-of-experts structure to assign different KAN variants to specific data segments, enhancing predictive accuracy [45].

Despite their promising advancements, KAN and its variants face several limitations in time series forecasting [46]. One key drawback is that they have not been effectively integrated with sequence decomposition methods, which makes it challenging to capture complex global temporal information. Additionally, the high computational cost associated with spline-based activation functions can be resource-intensive compared to more lightweight linear models.

3. Methods

The methodology shown in Figure 2 is divided into three parts: data preprocessing, a mixture of experts for performance metrics decomposition, and a KAN-based temporal layer.

3.1. Mixture of Experts for Performance Metrics Decomposition

By breaking down a time series into its constituent components, such as trend, seasonality, and residuals, the model can more effectively capture and understand the underlying patterns in the data. By leveraging trend decomposition, forecasting models can reduce the complexity of the original time series and make more informed predictions, ultimately improving their ability to generalize across different data scenarios. Additionally, this method provides greater interpretability, allowing researchers to analyze the individual effects of trends, seasonalities, and residuals on future outcomes.

To mitigate challenges associated with fixed-window average pooling when handling complex periodic patterns and trends in real-world data, we introduce a novel approach. We developed a Mixture Of Experts Decomposition, which is shown in Figure 3, to tackle this issue. The aim of this module is to decompose time series data into trend and residual components by leveraging multiple moving averages with different kernel sizes, and then combining them using learned weights to produce the final decomposition results.

The network structure consists of the following key components:

Multi-scale Moving Average Module: This module applies multiple moving average operations, each with a different kernel size, to the input time series. For each moving average operation, a 1D average pooling layer is used, where the kernel size controls the smoothing window, and the stride is fixed at 1. This multi-scale approach enables the network to capture trends over varying time windows. To ensure that the input length remains unchanged, padding is applied to both ends of the sequence before the moving average operation.

The mathematical formulation for this process is as follows:

X_{t r e n d_i n i t} = F_{m o v i n g_a v e} (X)

(2)

where

X \in R^{B \times L}

, B is batch size, and L is the length of performance metric sequence.

F_{m o v i n g_a v e}

denotes n-sliding average functions with different window sizes, and

X_{t r e n d_i n i t}

is the initial sequence decomposition results.

Weight Learning Module: After obtaining trend components from different kernel sizes, the network uses a linear layer to map the input series to weights. Specifically, the linear layer outputs weights corresponding to each of the trend components. These weights are normalized using a Softmax function to form a probability distribution, ensuring that each trend component is appropriately weighted based on its importance.

The mathematical formulation for this process is as follows:

w_{l e a r n a b l e} = S o f t m a x (L (X))

(3)

where L is a linear layer that transforms the input data into dimensions

B \times n

, and

w_{l e a r n a b l e}

is the resulting learnable weights.

Weighted Fusion and Residual Calculation: Once the weights are learned, the trend components from different scales are combined via a weighted sum to produce the final trend. The residual is then calculated as the difference between the input time series and the final trend component. The residual reflects the fast-varying part of the series, while the trend component captures the long-term behavior of the time series.

The mathematical formulation for this process is as follows:

X_{t r e n d} = w_{l e a r n a b l e} * X_{t r e n d_i n i t}

(4)

In this way, we can decompose the performance metric series into two parts: a trend component

X_{t r e n d}

and a residual component

X_{r e m a i n d e r}

, where the residual component

X_{r e m a i n d e r}

is

X_{r e m a i n d e r} = X - X_{t r e n d}

(5)

The overall process of MOE is illustrated in the Figure 3.

3.2. KAN-Based Temporal Layer

In the original DLinear model, the decomposed components of the time series are processed through a linear layer, where historical data are used to perform regression, and a weighted sum is calculated to produce the final forecasted values. While this approach works well for modeling linear relationships, it may fall short when dealing with complex, nonlinear patterns in the data.

To address this limitation, we propose an extension to the DLinear model by replacing the linear layer with a Kolmogorov–Arnold Network (KAN). KAN is known for its exceptional nonlinear fitting capabilities, which make it particularly suitable for capturing more intricate relationships in time series data. By incorporating KAN, we aim to enhance the model’s ability to capture the underlying temporal dynamics, thus improving the accuracy of the predictions for the decomposed components (Figure 4).

This modification allows the model to effectively handle both linear and nonlinear trends, resulting in more precise forecasting, especially in scenarios where the time series exhibits complex patterns that cannot be easily captured by linear models. The enhanced ability of KAN to fit nonlinearity makes it a powerful tool in refining the prediction process for time series decomposition, leading to better overall performance. The mathematical expression is

y_{t r e n d} = f_{k_{1}} \circ f_{k_{2}} (X_{t r e n d})

(6)

y_{r e m a i n d e r} = f_{k_{1}} \circ f_{k_{2}} (X_{r e m a i n d e r})

(7)

where

f_{k_{1}}

is the first layer in KAN, which is a combination of a finite number of univariable continuous functions, and

f_{k_{2}}

is the second layer. This output,

y_{t r e n d}

and

y_{r e m a i n d e r}

, represents the trend component and the residual component for the next N timesteps predicted by the MOE-DKAN model. The final output y is as follows:

y = y_{t r e n d} + y_{r e m a i n d e r}

(8)

4. Experiment

4.1. Experiment Setup

4.1.1. Dataset Description

We conducted extensive experiments using the Server Machine Dataset (SMD), a dataset compiled over 5 weeks from a major internet company as outlined in Table 1. The SMD was organized into three groups of entities, each identified by labels such as machine-<group_index>-<index>, encompassing data from twenty-eight distinct machines. For our experiments, each subset of data was processed independently and split into two equal halves; the first half was used for training and the second for testing. Although the dataset included labels for anomalies, our focus on server performance metric forecasting—a self-supervised learning task—meant that we would utilize only the time series data for both training and testing.

In preparing the multivariate time series data from the SMD for our analysis, the initial critical step was standardization, which addresses the varying scales across different features. Such multivariate time series often involve multiple variables, each scaled differently, leading to potential imbalances during model training if not normalized. We applied the min–max normalization to each server performance metric series within the dataset to ensure uniformity. The normalization is defined by the following equation:

X = \frac{x - x_{m i n}}{x_{m a x} - x_{m i n}}

(9)

Here, x represents the original value within the server performance metric series, while

x_{m i n}

and

x_{m a x}

are the minimum and maximum values of the metric across the dataset, respectively. X denotes the normalized value.

By normalizing the server performance metrics in this way, we prevented any single feature with a larger numerical range from overpowering others during the training of the model. This normalization not only facilitated an equal contribution of all input features to the learning process but also enhanced the efficiency of the data processing by subsequent model components, such as the Transformer encoder and KAN network. This approach ensures that the models can train more effectively on the standardized data, leading to more accurate forecasts of server performance metrics.

4.1.2. Experimental Environment

The experimental hardware environment is shown in the Table 2. All of the anomaly detection model was implemented in PyTorch 1.9.

4.2. Training Details

In the training process of the proposed network, we employed a supervised learning approach to optimize the model parameters. The training data consisted of time series samples, each of which was decomposed into a trend component and a residual component. The loss function was designed to minimize the error between the predicted trend and residual components and their corresponding ground truth values.

Specifically, the following key details were applied during training:

Loss Function: The loss function used in this model was calculated as the Mean-Squared Error (MSE) between the predicted output values and the true values of the time series.

Optimization: The model was trained using the Adam optimizer with an initial learning rate of

η

=

10^{- 4}

.

Batch Size and Epochs: The model was trained with a batch size of 64, and the total number of training epochs was set to 100. Early stopping was implemented based on the validation loss, with a patience of 10 epochs to avoid overfitting and unnecessary computation.

The training loss function of the model on some subsets of the dataset is shown in the Figure 5.

4.3. Evaluation Metrics

MSE and MAE are commonly used in time series forecasting tasks because they effectively measure prediction accuracy, with MSE emphasizing larger errors and MAE providing a straightforward average of absolute errors.

The Mean Absolute Error (MAE) is a commonly used metric in regression analysis to measure the average magnitude of errors between predicted and actual values. It is the average of the absolute differences between the predicted values

{\hat{y}}_{i}

and the actual values

y_{i}

. The formula for MAE is

MAE = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - {\hat{y}}_{i}|

(10)

where n is the total number of data points

The Mean-Squared Error (MSE) is another popular regression metric that calculates the average of the squared differences between predicted and actual values. MSE gives higher weight to larger errors due to the squaring of the residuals, making it more sensitive to outliers. The formula for MSE is

MSE = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

(11)

Both MAE and MSE provide insight into the accuracy of a model’s forecasting, with MSE penalizing larger errors more heavily than MAE.

4.4. Experimental Result

In this experiment on server performance metric forecasting, we proposed the MOE-DKAN model and compared its performance against baseline models, including Linear, DLinear, and FEDformer. FEDformer was selected as the representative Transformer model due to its ability to efficiently capture long-range dependencies and its state-of-the-art performance in long-term time series forecasting. The experiments were conducted across four forecast horizons: 96, 192, 336, and 720 points. The error estimation metrics used were the Mean-Squared Error (MSE) and Mean Absolute Error (MAE). The experiment was divided into two categories: univariate server performance metrics forecasting and multivariate server performance metrics forecasting. The results of the univariate time series forecasting experiments are presented in Table 3, showcasing the performance of different models on single-variable data. In contrast, the multivariate time series forecasting results, which involve multiple variables interacting over time, are shown in Table 4. These two experimental setups allow us to evaluate the effectiveness of the proposed MOE-DKAN model in handling both simple and more complex time series structures, providing a comprehensive assessment of its performance across different data scenarios.

5. Discussion

Combining the loss function curve in Figure 5, it can be concluded that, compared to Dlinear, MOE-KAN has a slower loss convergence rate during the early training epochs. However, after epoch 20, MOE-KAN achieves convergence with greater robustness than Dlinear, demonstrating its superior capability in capturing complex temporal features.

Based on the experimental results presented in Table 3 and Table 4, it is evident that the MOE-KAN model consistently outperforms the baseline models (Linear, DLinear, and FEDformer) in both univariate and multivariate server performance forecasting tasks. This demonstrates the superior capability of MOE-KAN in capturing both linear and nonlinear patterns in server performance data.

The MOE-KAN model’s superior performance in both univariate and multivariate server performance forecasting tasks can be attributed to its ability to decompose server performance data into meaningful components and apply specialized experts for each component. This approach allows the model to better capture the intricate relationships within server performance metrics, especially in scenarios where complex patterns emerge over time. Additionally, MOE-KAN’s nonlinear fitting capabilities enable it to adapt to the varying nature of server performance data, where the temporal dynamics can fluctuate between linear and nonlinear behaviors.

Furthermore, the ability of MOE-KAN to consistently outperform other models across different horizons highlights its robustness in handling both short-term and long-term forecasting tasks. While models like Linear and DLinear perform well in shorter forecast horizons, their performance degrades as the forecast horizon lengthens, particularly in multivariate scenarios. In contrast, MOE-KAN maintains stable performance, showing that it generalizes well even in more complex multivariate tasks.

6. Conclusions

In conclusion, we present a novel approach, the Mixture of Experts for Decomposition Kolmogorov–Arnold Network (MOE-DKAN), aimed at improving server performance metrics forecasting. By leveraging the Kolmogorov–Arnold Network’s (KAN) powerful nonlinear fitting abilities and the Mixture of Experts (MOE) framework, the MOE-DKAN model effectively decomposes server performance data into trend and residual components. Through the use of multiple average filters and data-dependent weighting, the model enhances prediction accuracy across various forecast horizons. Our extensive experiments on both univariate and multivariate server performance metrics demonstrate that MOE-DKAN consistently outperforms traditional baseline models, such as Linear, DLinear, and FEDformer, achieving significant improvements in Mean-Squared Error (MSE) and Mean Absolute Error (MAE).

To further advance the capabilities of the MOE-KAN model, future research should explore applying it to other domains, such as financial modeling, healthcare diagnostics, and environmental monitoring. These fields often involve complex, high-dimensional time series data, which could benefit from the interpretability and adaptive forecasting capabilities of MOE-KAN. Additionally, integrating advanced sequence decomposition techniques could help capture intricate temporal patterns more effectively.

Another key area for future research is improving the MOE-KAN model’s performance by enhancing MOE mechanism to more accurately assign data segments to the most suitable experts. Optimizing the training process to reduce computational overhead while maintaining predictive accuracy is also crucial. Incorporating transfer learning techniques could enable the MOE-KAN model to adapt quickly to new datasets with limited labeled data, broadening its practical applicability. Lastly, expanding the interpretability of MOE-KAN through symbolic regression and visualization techniques would provide deeper insights into decision-making processes, making the model more transparent for critical applications.

Author Contributions

Conceptualization, F.P.; methodology, X.J.; data curation, L.Z.; writing—original draft preparation, K.Z.; writing—review and editing, J.W. and W.W.; visualization, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the State Grid Corporation of China Big Data Center, grant number SGSJ0000YWJS2400019.

Data Availability Statement

Access to the experimental data presented in this article can be obtained by contacting the corresponding author.

Acknowledgments

We are grateful for the valuable resources offered by the Server Machine Dataset.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LTSF	Long-term Time Series Forecasting
MOE	Mixture of Experts
KAN	Kolmogorov–Arnold Network
MAE	Mean Absolute Error
MSE	Mean-Squared Error

References

Hintemann, R.; Hinterholzer, S.; Konrat, F. Server Stock Data—A Basis for Determining the Energy and Resource Requirements of Data Centres. In Proceedings of the 2024 Electronics Goes Green 2024+(EGG), Berlin, Germany, 18–20 June 2024; pp. 1–5. [Google Scholar]
Li, Z.; Liang, M.; O’brien, L.; Zhang, H. The cloud’s cloudy moment: A systematic survey of public cloud service outage. arXiv 2013, arXiv:1312.6485. [Google Scholar] [CrossRef]
Fraunhofer, I.Z.M.; European Commission; Deloitte; Directorate-General for Internal Market, Industry, Entrepreneurship and SMEs. Ecodesign Preparatory Study on Enterprise Servers and Data Equipment; European Union: Brussels, Belgium, 2016. [Google Scholar]
Ismail, L.; Materwala, H. Computing server power modeling in a data center: Survey, taxonomy, and performance evaluation. ACM Comput. Surv. (CSUR) 2020, 53, 1–34. [Google Scholar] [CrossRef]
Xu, F.; Liu, F.; Jin, H.; Vasilakos, A.V. Managing performance overhead of virtual machines in cloud computing: A survey, state of the art, and future directions. Proc. IEEE 2013, 102, 11–31. [Google Scholar] [CrossRef]
Shuja, J.; Bilal, K.; Madani, S.A.; Othman, M.; Ranjan, R.; Balaji, P.; Khan, S.U. Survey of techniques and architectures for designing energy-efficient data centers. IEEE Syst. J. 2014, 10, 507–519. [Google Scholar] [CrossRef]
Binkert, N.L.; Hsu, L.R.; Saidi, A.G.; Dreslinski, R.G.; Schultz, A.L.; Reinhardt, S.K. Performance analysis of system overheads in TCP/IP workloads. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05), St. Louis, MO, USA, 17–21 September 2005; pp. 218–228. [Google Scholar]
Balen, J.; Vajak, D.; Salah, K. Comparative performance evaluation of popular virtual private servers. J. Internet Technol. 2020, 21, 343–356. [Google Scholar]
Zia, A.; Khan, M.N.A. Identifying key challenges in performance issues in cloud computing. Int. J. Mod. Educ. Comput. Sci. 2012, 4, 59. [Google Scholar] [CrossRef]
Rao, V.V.; Rao, M.V. A survey on performance metrics in server virtualization with cloud environment. J. Cloud Comput. 2015, 2015, 291109. [Google Scholar]
Katal, A.; Dahiya, S.; Choudhury, T. Energy efficiency in cloud computing data centers: A survey on software technologies. Clust. Comput. 2023, 26, 1845–1875. [Google Scholar] [CrossRef]
Kalbarczyk, Z.T.; Nakka, N.M. Classical Dependability Techniques. In Dependable Computing: Design and Assessment; Wiley: Hoboken, NJ, USA, 2024. [Google Scholar]
Tuli, S.; Ilager, S.; Ramamohanarao, K.; Buyya, R. Dynamic scheduling for stochastic edge-cloud computing environments using a3c learning and residual recurrent neural networks. IEEE Trans. Mob. Comput. 2020, 21, 940–954. [Google Scholar] [CrossRef]
Abouelyazid, M. Machine Learning Algorithms for Dynamic Resource Allocation in Cloud Computing: Optimization Techniques and Real-World Applications. J. AI-Assist. Sci. Discov. 2021, 1, 1–58. [Google Scholar]
Deng, S.; Xiang, Z.; Zhao, P.; Taheri, J.; Gao, H.; Yin, J.; Zomaya, A.Y. Dynamical resource allocation in edge for trustable internet-of-things systems: A reinforcement learning method. IEEE Trans. Ind. Inform. 2020, 16, 6103–6113. [Google Scholar] [CrossRef]
Sabireen, H.; Neelanarayanan, V. A review on fog computing: Architecture, fog with IoT, algorithms and research challenges. ICT Express 2021, 7, 162–176. [Google Scholar]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. Kan: Kolmogorov-arnold networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
Liu, Y.; Gong, C.; Yang, L.; Chen, Y. DSTP-RNN: A dual-stage two-phase attention-based recurrent neural network for long-term and multivariate time series prediction. Expert Syst. Appl. 2020, 143, 113082. [Google Scholar] [CrossRef]
Lindemann, B.; Müller, T.; Vietz, H.; Jazdi, N.; Weyrich, M. A survey on long short-term memory networks for time series prediction. Procedia Cirp 2021, 99, 650–655. [Google Scholar] [CrossRef]
Torres, J.F.; Hadjout, D.; Sebaa, A.; Martínez-Álvarez, F.; Troncoso, A. Deep learning for time series forecasting: A survey. Big Data 2021, 9, 3–21. [Google Scholar] [CrossRef]
Lim, B.; Zohren, S. Time-series forecasting with deep learning: A survey. Philos. Trans. R. Soc. A 2021, 379, 20200209. [Google Scholar] [CrossRef]
Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys. D Nonlinear Phenom. 2020, 404, 132306. [Google Scholar] [CrossRef]
Dey, R.; Salem, F.M. Gate-variants of gated recurrent unit (GRU) neural networks. In Proceedings of the 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS), Boston, MA, USA, 6–9 August 2017; pp. 1597–1600. [Google Scholar]
Liu, X.; Lin, Z. Impact of Covid-19 pandemic on electricity demand in the UK based on multivariate time series forecasting with Bidirectional Long Short Term Memory. Energy 2021, 227, 120455. [Google Scholar] [CrossRef]
Tayal, A.R.; Tayal, M.A. DARNN: Discourse Analysis for Natural languages using RNN and LSTM. Int. J. Next-Gener. Comput. 2021, 12, 762. [Google Scholar] [CrossRef]
Smyl, S. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. Int. J. Forecast. 2020, 36, 75–85. [Google Scholar] [CrossRef]
Hamfelt, T. Forecasting the Regulating Price in the Finnish Energy Market Using the Multi-Horizon Quantile Recurrent Neural Network. Master’s Thesis, Lund University, Lund, Sweden, 2020. [Google Scholar]
Jing, X.; Luo, J.; Zuo, G.; Yang, X. Interpreting runoff forecasting of long short-term memory network: An investigation using the integrated gradient method on runoff data from the Han River basin. J. Hydrol. Reg. Stud. 2023, 50, 101549. [Google Scholar] [CrossRef]
Kag, A.; Saligrama, V. Time adaptive recurrent neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15149–15158. [Google Scholar]
Lim, B.; Arık, S.Ö.; Loeff, N.; Pfister, T. Temporal fusion transformers for interpretable multi-horizon time series forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtul, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. itransformer: Inverted transformers are effective for time series forecasting. arXiv 2023, arXiv:2310.06625. [Google Scholar]
Chen, M.; Peng, H.; Fu, J.; Ling, H. Autoformer: Searching transformers for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 12270–12280. [Google Scholar]
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 27268–27286. [Google Scholar]
Liang, X.; Yang, E.; Deng, C.; Yang, Y. CrossFormer: Cross-modal Representation Learning via Heterogeneous Graph Transformer. ACM Trans. Multimed. Comput. Commun. Appl. 2024. [Google Scholar] [CrossRef]
Shen, L.; Wang, Y. TCCT: Tightly-coupled convolutional transformer on time series forecasting. Neurocomputing 2022, 480, 131–145. [Google Scholar] [CrossRef]
Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv 2019, arXiv:1905.10437. [Google Scholar]
Wen, Q.; Sun, L.; Yang, F.; Song, X.; Gao, J.; Wang, X.; Xu, H. Time series data augmentation for deep learning: A survey. arXiv 2020, arXiv:2002.12478. [Google Scholar]
Kim, G.; Yoo, H.; Kim, C.; Kim, R.; Kim, S. LTScoder: Long-term time series forecasting based on a linear autoencoder architecture. IEEE Access 2024, 12, 98623–98633. [Google Scholar] [CrossRef]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. (CSUR) 2022, 54, 1–41. [Google Scholar] [CrossRef]
Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are transformers effective for time series forecasting? In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 11121–11128. [Google Scholar] [CrossRef]
Schmidt-Hieber, J. The Kolmogorov–Arnold representation theorem revisited. Neural Netw. 2021, 137, 119–126. [Google Scholar] [CrossRef] [PubMed]
Dong, C.; Zheng, L.; Chen, W. Kolmogorov-Arnold Networks (KAN) for Time Series Classification and Robust Analysis. arXiv 2024, arXiv:2408.07314. [Google Scholar]
Xu, K.; Chen, L.; Wang, S. Kolmogorov-Arnold Networks for Time Series: Bridging Predictive Power and Interpretability. arXiv 2024, arXiv:2406.02496. [Google Scholar]
Han, X.; Zhang, X.; Wu, Y.; Zhang, Z.; Wu, Z. KAN4TSF: Are KAN and KAN-based models Effective for Time Series Forecasting? arXiv 2024, arXiv:2408.11306. [Google Scholar]
Vaca-Rubio, C.J.; Blanco, L.; Pereira, R.; Caus, M. Kolmogorov-arnold networks (kans) for time series analysis. arXiv 2024, arXiv:2405.08790. [Google Scholar]

Figure 1. Illustration of the decomposition linear model. The left part of the figure shows the overall structure of DLinear. The raw historical data (the red curve) are decomposed into trend (the orange curve) and residual components (the blue curve) using moving averages. These two parts are then transformed by two linear layers of the same size, and the results are summed to obtain the final forecast output (the green curve). The right part of the figure illustrates the basic structure of Linear, which maps historical input values to future forecast values.

Figure 2. The MOE-KAN’s overall structure diagram. The server performance metrics (the light blue curves) are decomposed into trend and residual components (the dark blue curve in the middle) using average filters of varying sizes. Expert outputs, the final trend of raw data (the orange curve) are combined through data-dependent weighting, with trends predicted by aggregating both components (the green curves). The Kolmogorov–Arnold Network (KAN) models nonlinear relationships, and the final output (the dark blue curve on the right) is obtained by adding the KAN-predicted trend to the residual.

Figure 3. MOE flowchart. We propose a Mixture of Experts Decomposition, utilizing multiple average filters of varying sizes to capture trend components, which are then combined into a final trend through data-dependent learnable weights.

Figure 4. A diagram of the KAN-based temporal linear model illustrates its powerful nonlinear fitting capability, which effectively overcomes the limitation of traditional linear layers in capturing nonlinear components.

Figure 5. The training loss comparison between the MOE-KAN and DLinear models on certain subsets.

Table 1. The statistics of the Server Machine Dataset.

Groups	Machines	Variates	Timesteps	Granularity
3	29	38	50,400	1 min

Table 2. Experimental environment.

Resource	Specification
CPU	AMD Epyc 9654 Processor 96 Core 2.4 GHz
RAM	128 GB
GPU	NVIDIA RTX A6000
OS	Ubuntu 22.04.3

Table 3. Univariate server performance metrics forecasting MSE and MAE values for various models and forecast horizons. The best performance metrics are highlighted in bold to indicate superior results.

Model	MOE-DKAN		Linear		DLinear		FEDformer
Metric	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE
96	0.053	0.177	0.178	0.359	0.062	0.195	0.084	0.251
192	0.0785	0.199	0.097	0.245	0.093	0.244	0.119	0.245
336	0.093	0.235	0.104	0.268	0.119	0.290	0.126	0.292
720	0.108	0.241	0.176	0.350	0.193	0.401	0.146	0.310

Table 4. The forecasting results of multivariate server performance metrics. The experimental parameter settings and evaluation metrics are the same as those in Table 3.

Methods	MOE-DKAN		Linear		DLinear		FEDformer
Metric	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE
96	0.152	0.278	0.181	0.293	0.153	0.281	0.194	0.312
192	0.169	0.283	0.172	0.280	0.169	0.277	0.219	0.352
336	0.173	0.288	0.186	0.285	0.173	0.285	0.215	0.403
720	0.212	0.325	0.223	0.318	0.212	0.325	0.273	0.386

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Peng, F.; Ji, X.; Zhang, L.; Wang, J.; Zhang, K.; Wu, W. Interpretable Mixture of Experts for Decomposition Network on Server Performance Metrics Forecasting. Electronics 2024, 13, 4116. https://doi.org/10.3390/electronics13204116

AMA Style

Peng F, Ji X, Zhang L, Wang J, Zhang K, Wu W. Interpretable Mixture of Experts for Decomposition Network on Server Performance Metrics Forecasting. Electronics. 2024; 13(20):4116. https://doi.org/10.3390/electronics13204116

Chicago/Turabian Style

Peng, Fang, Xin Ji, Le Zhang, Junle Wang, Kui Zhang, and Wenjun Wu. 2024. "Interpretable Mixture of Experts for Decomposition Network on Server Performance Metrics Forecasting" Electronics 13, no. 20: 4116. https://doi.org/10.3390/electronics13204116

APA Style

Peng, F., Ji, X., Zhang, L., Wang, J., Zhang, K., & Wu, W. (2024). Interpretable Mixture of Experts for Decomposition Network on Server Performance Metrics Forecasting. Electronics, 13(20), 4116. https://doi.org/10.3390/electronics13204116

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Interpretable Mixture of Experts for Decomposition Network on Server Performance Metrics Forecasting

Abstract

1. Introduction

2. Related Works

2.1. Long-Term Time Series Forecasting

2.2. Linear Model for Time Series Forecasting

LTSF-Linear Structure and Principles

2.3. Kolmogorov–Arnold Network

3. Methods

3.1. Mixture of Experts for Performance Metrics Decomposition

3.2. KAN-Based Temporal Layer

4. Experiment

4.1. Experiment Setup

4.1.1. Dataset Description

4.1.2. Experimental Environment

4.2. Training Details

4.3. Evaluation Metrics

4.4. Experimental Result

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI