1. Introduction
Energy transition is a critical trend in the development of the energy industry, aiming primarily to enhance energy utilization efficiency [
1]. Traditional energy systems, such as electricity, natural gas, and heating systems, tend to operate independently and are managed by different energy companies, which results in insufficiently tight coupling between energy systems [
2]. As an important aspect of integrated energy system (IES) demand-side energy forecasting, IES load forecasting has become a primary prerequisite for the collaborative IES operation, dispatch, and planning [
3]. Accurate load forecasting can better control the balance between production and demand and increase energy use efficiency.
Extensive studies have been carried out in the field, developing various statistical methods [
4]. Before the advent of machine learning, linear regression techniques were the dominant approach in forecasting [
5]. The autoregressive integrated moving average (ARIMA) method is particularly notable as an advanced autoregressive technique [
6]. However, a drawback of linear regression approaches is that they tend to ignore extrinsic elements, like temperature and humidity, in favor of concentrating exclusively on linear connections within load data. A nonlinear function-to-function model proposed in the literature [
7], which captures complex time series patterns through a multilayer neural network structure, shows higher prediction accuracy in practical applications. Power demand forecasting now uses machine learning approaches thanks to the development of artificial intelligence. Techniques such as random forest (RF) [
8], artificial neural networks (ANNs) [
9,
10,
11,
12], and support vector machines (SVMs) [
13,
14,
15,
16] have become increasingly popular. In SVM applications, selecting parameters related to the hyperplane and kernel function is crucial. It can be difficult to reduce the problem to a quadratic programming (QP) problem since so many variables are involved, particularly when working with large amounts of data. The advent of hybrid forecasting models, which provide more optimal answers, results from deep learning’s growth [
17]. Recurrent neural networks (RNNs), specifically, the long short-term memory (LSTM) variant, show significant potential in power load forecasting. LSTM, which inherits RNNs’ capability for long-term memory, introduces a “forget gate” to address issues related to long-term dependencies. Currently, numerous studies are focused on enhancing LSTM networks. For instance, a combined multivariate linear regression and LSTM approach has been suggested [
18], showing promise on huge datasets spanning western China, Uzbekistan, and portions of the United States. It was confirmed that LSTM could reliably extract load characteristics more accurately than support vector regression (SVR) using the power load of Estonia as a case study [
19].
With the rapid progress of big data and artificial intelligence technologies, the input features of forecasting models have become increasingly diversified, multidimensional, and complex. In practical applications, multivariate time series data often contain a large number of redundant and irrelevant feature variables, which may mask key feature information and negatively affect the accuracy of time series forecasting models. Therefore, there is an urgent need for a good feature extraction method to filter the input features. Ref. [
20] integrated the dynamic time warping (DTW) algorithm with the LSTM model to address data sparsity issues in short-term load forecasting, demonstrating DTW’s effectiveness in compensating for missing future data. The significance of the model’s capacity for generalization and training speed is rapidly becoming more apparent as a result of the ongoing advancements in deep learning and machine learning technologies, and the convergence process of the model can be effectively accelerated by introducing suitable optimization algorithms, avoiding falling into local optimal solutions, and enhancing the model’s generalization ability. The prediction accuracy significantly improved in Ref. [
21]’s study based on Bayesian optimization algorithms to improve the traditional LSTM, convolutional neural networks (CNNs), least squares support vector machine, and additional models. At the same time, compared with other models, the LSTM and Bayesian optimization algorithms have better fitness. Ref. [
22] proposed a two-stage adaptive superposition prediction of multivariate loads, significantly improving the model’s generative memorization capability and stability through the parallel approach of initial prediction and prediction correction.
Accompanied by the accelerated evolution of integrated energy systems, single-load forecasting has increasingly given way to multivariate load forecasting in the study of load forecasting. At present, multivariate load forecasting mainly focuses on deterministic forecasting, which can be categorized into two types according to the number of model research objects. The first category is the separate prediction of multivariate loads, which uses single-task-learning (STL) models to establish independent prediction models for multiple loads, like electricity, heating, and cooling, respectively, and the models are unrelated. Ref. [
23] decomposed the data by aggregated modal decomposition and then used a temporal convolutional network (TCN) for prediction. Ref. [
24] incorporated a dual attention mechanism into the Seq2Seq model, improving the algorithm’s ability to learn temporal and input features. The methods above improve the prediction accuracy of STL models by introducing the attention mechanism or modal decomposition. However, since the STL models are independent, they cannot fully consider the coupling characteristics between multivariate loads. This independence restricts the model from capturing and synthesizing the interactions between loads, which may affect the overall forecasting effect.
The second method is called joint prediction of multiple loads, which may simultaneously produce prediction results for various loads, like heating, cooling, and electricity. It achieves this by predicting multiple loads as a whole using a multi-task learning model. Ref. [
25] proposed an explanatory load forecasting framework that combines multi-task learning (MTL) and LSTM modeling and constructs the input variables through a coupled feature extraction strategy, which makes it possible for the model to effectively capture the strong coupling interactions between the demands of heating, cooling, and electricity. A complex neural network (ComNN) was proposed in the literature of Ref. [
26] for multi-task learning. It extracts the salient features of the shared layer using an attention mechanism that distinguishes between subtasks, and it uses the MTL-ComNN network structure for prediction. The methods above predict multivariate loads through MTL, which can fully consider the common data features among loads and exploit the coupling relationship among them, and their prediction accuracy is usually higher than that of STL models. However, there are differences in the sensitivity of meteorological factors to different loads, i.e., the same meteorological factor affects different loads to different degrees. When forecasting with the MTL model, using only the meteorological features strongly correlated with all the loads as inputs may lead to omitting critical meteorological information. In contrast, using meteorological features strongly correlated with any load as inputs may introduce too much noise at the input level. In addition, there are differences in the coupling characteristics among different loads [
27], and it is difficult for the MTL model to adequately account for these differences when the coupling strength is weak. These factors limit the prediction accuracy of the MTL model [
28]. A single MTL model cannot use future timing information in prediction to improve prediction accuracy. Future time-series information refers to multivariate loads other than the target load at the forecasting moment. For example, due to the dynamic coupling characteristics of multiple loads, the electric, cooling, and heating loads are coupled and related to each other at time
. Therefore, the cooling load and heat load at time
are used in the MTL model. Thus, using the cooling and heating loads at time
as input features for training, the model can be trained to predict the electrical loads at time
based on the values of these loads, thus improving the prediction accuracy. However, for a single MTL model, the multivariate load values at the time, which are the target of prediction, are unknown. Thus, it is impossible to utilize this time-series information to improve the prediction accuracy further.
A multivariate load short-term forecasting model that utilizes multi-task learning and the DTW algorithm is proposed to address the aforementioned issues. The extraction of multivariate load coupling information was considered, concentrating on the problem of limited prediction accuracy due to the inability to fully use the load information at future moments. To effectively utilize the future time-series information, the multi-task learning model based on bidirectional long short-term memory (BiLSTM) first extracts the coupling information between loads and performs a preliminary prediction; second, the DTW algorithm is used to identify the sequences that are most similar to the prediction target, and these similar sequences are added to the training set; third, BiLSTM is applied to establish the multi-task learning shared layer, fully exploiting the characteristics of coupling between loads of heat, electricity, and cold, and the attention mechanism is used to achieve the differential extraction of important features in the shared layer by sub-tasks to enhance the impact of key information; and finally, the established DTW-BiLSTM-MTL model is used to make predictions. Practical examples are employed to evaluate the effectiveness of the presented model.
2. Methodology
2.1. Introduction to the DTW Algorithm Principle
The DTW algorithm is designed to measure similarity between time series by non-linearly aligning sequences, allowing for a more accurate reflection of the intrinsic nature of the data, especially in the presence of temporal delays. Due to the cyclical characteristics expressed in time delays caused by cold, heat, and power load, such a matching method can mitigate the adverse impact of short-term dissimilarities between sequences, thereby capturing meaningful correlative information.
The following are the algorithm’s precise steps to evaluate two time series of length, and . Firstly, an integer sequence path needs to be defined, where , and it must satisfy the following conditions:
- (1)
Monotonicity: Each point on the path must vary monotonically over time, hence and must satisfy , .
- (2)
Continuity: For the path, any point and the next point must satisfy and .
- (3)
Boundary Conditions: The boundary conditions start with and end with .
Under these conditions, the optimization goal of the DTW algorithm is defined as follows:
where
is the path of the match. The selection criteria for the DTW algorithm are given by the following:
is the value taken from the accumulated distance matrix at point
.
represents the index position of sequence
;
represents the index position of sequence
.
According to similarity, the relationship between source features and target features is established, forming feature pairs.
Figure 1 is a schematic of DTW.
2.2. Introduction to the BiLSTM Algorithmic Mechanism
LSTM is an improvement of recurrent neural networks (RNNs), aiming to solve the problem of gradient explosion or gradient disappearance of RNNs due to the large span and the influence of the activation function when dealing with long time series data.
Figure 2a displays the architecture of the LSTM model. LSTM improves long-term dependent data handling by introducing three gating units—the input gate, forget gate, and output gate—based on RNNs and using sigmoid or tanh activation functions to control the information flow. The input gate is responsible for selectively recording new input information into the memory cell of the previous step; the forgetting gate is responsible for only forgetting and retaining the information in the memory cell state passed during the last step; and the output gate passes the updated cell state to the next time step, as elaborated in Equation (3).
Conversely, BiLSTM expands upon the traditional LSTM by utilizing two distinct LSTM networks at every time step: a forward-moving network and a backward-moving network [
29]. This approach enables the model to incorporate both preceding and succeeding contexts within the sequence. Although each part functions like a typical LSTM, the input sequence is processed in reverse order by the backward LSTM, and the BiLSTM output is ascertained using the method outlined in Equation (4).
where
,
, and
represent the features of the forgetting, output, and input gates at time
, respectively, and the weights that need to be updated and learned are represented by
,
,
,
,
, and
. The output layer’s activation function is indicated by
,
represents the input at moment
, and
denotes the output from the prior time step. The weight matrices required to generate the outputs are
and
. The output bias matrix is denoted by
. The forward and backward LSTM outputs at time step
are represented by
and
, respectively.
2.3. Improved Bayesian Optimization Algorithm
The Bayesian optimization algorithm works by minimizing or maximizing a black-box objective function. It is based on Bayes’ theorem and probabilistic models, such as the Gaussian process (GP). It gradually converges to the global optimal solution by selecting sampling points sequentially in the search space and estimating the optimal value of the function based on the existing sampling data. The expression is shown in Equation (5).
In this context,
represents the prior distribution model,
represents the optimal parameter value under the constraint of
, and
represents the candidate set. Compared to grid search and random search, the Bayesian optimization algorithm can achieve satisfactory optimization results with fewer iterations [
30].
The Gaussian process is a probabilistic surrogate model that works well when optimizing continuous hyperparameters, like the learning rate, dropout rate, etc. However, it struggles with discrete hyperparameters, like batch size, epoch count, BiLSTM hidden units, and other discrete hyperparameters, like layer count, pooling layer count, and BiLSTM network layer count. It is statistically difficult to perform well in complicated network models if the activation function type, convolution kernel size, or pooling window size cannot be adjusted via the Bayesian optimization procedure.
The topic of this research is the optimization of discrete and continuous hyperparameters in neural network architectures. It makes use of the tree-structured Parzen estimator (TPE) model, which addresses the drawbacks of the conventional GP-based Bayesian optimization models and enables more effective optimization of the neural network architecture and hyperparameters.
The TPE algorithm uses a Gaussian mixture model (GMM) to model the search space. First, according to the Bayesian framework, the conditional probability distribution is decomposed into the product of the likelihood and the prior probability , where is updated based on new observations. Therefore, the key lies in obtaining .
The TPE algorithm employs different strategies based on the search space, replacing the continuous uniform distribution with a truncated Gaussian mixture and the stepwise uniform distribution (discrete) with a truncated categorical mixture, where the weights are adjusted for each category. For different levels of observation
, the algorithm uses different replacements and generates different densities in the configuration space. These two densities are then used to redefine the probability density function
, obtained by the following Equation (6).
In the equation, represents the density function formed by the observations when the evaluation value is less than the threshold , while represents the density function formed by the observations when the evaluation value is greater than the threshold .
Therefore, the TPE algorithm constructs two different distributions from the observations, which can be regarded as a distribution of better hyperparameter values and a distribution of worse hyperparameter values.
The choice of the threshold value is made by setting a parameter that satisfies the condition . This parameter corresponds to a quantile of . Based on general experience, is usually set to 0.25.
Meanwhile, the EI acquisition function with better acquisition performance is used to guide the next set of iterative evaluation points. Therefore, the acquisition function expression is obtained, as shown in Equation (7).
From the equation above, it follows that , i.e., the value of EI is positively related to the derivative of the denominator of the expression of the acquisition function. When determined, the size of the denominator depends only on the ratio of the two probabilities of . The physical significance of this ratio lies in the fact that this assessment point is the ratio of the probability of the more effective hyperparameter to the probability of the less effective hyperparameter in the current observed assessment. Therefore, the problem of finding the hyperparameters that maximize the acquisition function translates into finding the hyperparameters that maximize this ratio, and the TPE algorithm is based on this principle to find the next set of hyperparameters.
In summary, the TPE algorithm optimizes the acquisition function by analyzing the ratio of the probability distribution of the hyperparameters, thus effectively determining the optimal combination of hyperparameters.
The improved Bayesian optimization algorithm logically determines which types of hyperparameters different types of hyperparameters belong to and then implements the establishment of probabilistic agent models and corresponding acquisition functions to comprehensively update and fit the hyperparameter combinations with the best diagnostic performance of the model.
2.4. Self-Attention Mechanism
By mimicking how attention resources are distributed in the human brain, the attention mechanism (AM) increases the impact of important information on the model output outcomes [
31]. The input information is very complex because load forecasting needs to extract the time series information and characteristic information of uncertain quantities, such as cooling, heating, and electric loads. The self-attention mechanism can adaptably concentrate on various sections of the incoming data and assign greater weights to the important information. Therefore, this paper adopts a neural network centered on the self-attention mechanism to extract the characteristic information of each uncertain quantity and its related variables so as to ensure that the model can focus on the most important characteristic details of each uncertain quantity. The self-attention mechanism is responsible for calculating the attention score, which is used to determine the relevance of the data in question in relation to other data points in the sequence. The configuration of the AM is depicted in
Figure 2. In this paper, the dot product is used to calculate the attention score. The specific steps are as follows:
First, the input data
is multiplied with the matrices
,
, and
to obtain the sequence of query vectors
, the sequence of key vectors
, and the sequence of value vectors
, respectively:
The value weight coefficients corresponding to each key are then obtained by scaling the dot product, which yields the correlation between the query vector sequence
and all the keys in the key vector sequence
. Then, applying the
function to normalize the weight coefficients, the value vector sequence
is weighted and summed to produce the attention score sequence
:
where
is the computational mechanism of the attention score;
is the vector sequence’s dimension.
The attention mechanism offers two clear benefits: firstly, it allows the model to learn the correlation between distant inputs by lowering the maximum distance between any two points in the input sequence to 1; secondly, it allows the input data to be processed in parallel, increasing computational efficiency. These two characteristics can be relied upon to ensure the efficacy and rapidity of the prediction model in feature extraction.
2.5. Multi-Task Learning Theory
The multi-task learning approach as a solution to the problem of inadequate training for multidimensional and multivariate data [
32]. Let
be the number of target tasks, and let
be the distribution space from which all the data in each task originate. As training samples, each task has
data points, i.e.,
, sampled from a distribution
in the space
, where each task has a different
, but they are interrelated. Therefore, the learning of
functions
is the aim of multi-task learning, such that
The knowledge contained in various tasks can be thought of as distinct tasks within multi-task learning during the process. It is important to think about learning many functions simultaneously to balance the limitations and correlations across activities. IES multivariable load forecasting is not simply the sum of predictions for each load category; independent modeling and separate forecasting methods are not applicable. More attention should be paid to the coupling mechanism between multiple loads, exploring more implicit information within, which gives multivariable load forecasting practical significance. Multi-task learning is a method that enhances model representation and generalization capabilities by simultaneously learning multiple tasks, and the regularization of parameters induced by optimizing the learning across tasks demonstrates better generalization performance compared to ordinary training models.
2.6. Shapley Additive Interpretation of Load Forecasting
Like numerous other neural network models, the load forecasting model is opaque. It is necessary to assess the model to ascertain its efficacy and explore the mechanisms by which the characteristics function. Shapley additive interpretation (SHAP) is employed to visually represent the genuine influence of different metrics linked to load on the DTW-BiLSTM-MTL model’s output in this work [
33]. Effective predictors are connected to the model interpretation by applying Shapley values in the SHAP approach. The Shapley value is the primary parameter of the SHAP method. Since many factors affect power load, the Shapley value can be employed to quantify the contribution of each team member to the overall benefit, whether it rises or reduces. This is feasible if every feature connected to power is thought of as a team member and the overall benefit is taken into account when estimating load.
For a sample
from past information
with
samples, the Shapley value
of the
feature
can be defined as follows:
In the formula,
represents the subset of the values of the j-th feature
after being selected;
represents the amount of features in subset
; and
is the feature function, indicating the degree to which the features in
influence the result of the negative load prediction through the ‘cooperative’ process. Formula (14) illustrates how to compute its value using the output value of the negative prediction model that makes use of the features that are not part of subset S.
Equation (14) requires a sum of integrated operation for each load-dependent feature not included in the sample
, so the SHAP equation can be readily derived from Equation (15).
where
represents the mean amount of the anticipated negative load. The Shapley value of the initial feature within the sample related to the negative load that predicts the negative load outcome
is indicated by
. If
, it indicates that the initial feature positively influences the predicted negative load result
, thereby increasing the relative value of
compared to
; conversely, it decreases it.
SHAP will be employed more effectively to comprehend the direction and significance of each load-related indicator after training the DTW-BiLSTM-MTL model.
3. Framework of Multi-Task Learning-Based Joint Prediction Model for Multiple Loads
The sharing mechanism of MTL can account for the coupling relationship between multivariate loads by sharing the learning layer so tasks can interact. In cases with a high correlation between prediction targets, the shared learning layer does not cause much loss and enhances parameter sharing between models. Models with multiple targets can be trained jointly, reducing the parameter size of the model and preventing model overfitting. For IES, the complex coupling relationship between multiple loads with many uncertainties, the large number of parameters, and the close connection between the sub-tasks of cooling, heating, and electricity make it more suitable to use MTL to construct the model, which has a better fitting effect and better generalization ability. Therefore, the IES multivariate load forecasting model is constructed using MTL in this paper.
In addition, the process of selecting features is a pivotal aspect of model construction, and good or undesirable features play a decisive role in the prediction effect. Traditional methods usually rely on manual experience in feature engineering construction. A combination of manual experience and correlation analysis is used for feature selection in this study. Specifically, the features are scored using Spearman correlation analysis and Pearson correlation analysis. Pearson correlation analysis is mainly used to measure the linear correlation between two variables, while Spearman correlation analysis measures the ranked connection between two variables. By combining these two methods, the importance of each feature can be effectively assessed to help validate and assist decision-making, allowing the modeling process to retain important features and remove redundant features, thus improving the predictive capacity of the model.
The training process for the multivariate load forecasting method based on multi-task learning is as follows:
- (1)
Data preprocessing. Abnormal data are eliminated, and missing data for cold, heat, and electricity loads and related influencing factors are filled in.
- (2)
Correlation analysis. Comprehensive analysis of the relevant influencing factors of multivariate loads is conducted by selecting Pearson and Spearman correlation coefficients, deleting the weakly correlated features therein, reducing the impact of the weakly correlated features on the model’s prediction accuracy and training speed, and assisting in feature selection.
- (3)
Construction of an input layer. Preliminary prediction of target data sequences is carried out by a BiLSTM-based MTL model, then historical sequences similar to the target sequences are selected through DTW, and similar data are spliced with the source data to optimize the model’s input attributes to enhance the model’s capacity for generalization and training.
- (4)
Shared layer construction. The MTL shared layer is constructed using BiLSTM to extract the coupling information between the loads of electricity, heat, and cold, and the best choice of BiLSTM hyperparameters is achieved through the application of the Bayesian optimization technique.
- (5)
Output layer construction. To realize the differential selection of significant features in the shared layer by various subtasks, the attention layer is added before the output layer of each task, which will help the model make better use of the data. The attention mechanism prioritizes the critical characteristics pertaining to heat, cold, and electricity loads while assigning varying probability weights to the BiLSTM’s hidden states.
Figure 3 depicts the DTW-BiLSTM-MTL model that is suggested in this paper.
4. Performance Evaluation
This case utilizes multivariate load data from the University of Arizona, Tempe Campus Energy Information System. The dataset contains cooling, heating, and electric load data between January 2023 and December 2023, and the sampling rate of the data is 1 h, i.e., it contains 24 sets of data per day. The meteorological data were obtained from the National Oceanic and Atmospheric Administration (NOAA), and the meteorological stations corresponded to the geographic location of the energy station. At the same 1 h time interval, the meteorological data contained six influencing factors: mean dew point, mean temperature, mean barometric pressure, mean wind speed, maximal wind speed, and precipitation.
4.1. Data Preprocessing
The distance
from the data points to the overall mean was first calculated using the Z-Score formula, as follows:
where
is the value of the data point,
is the mean of the overall data, and
is the standard deviation of the overall data. The points where
is too large are excluded, and the missing data in the dataset are filled in using linear interpolation. Meanwhile, to help the model capture the periodicity of load changes, the original date and time information is converted into a feature vector
, which represents whether the day is a working day (yes is 1, no is 0) and whether the current time point is the
hour of the day.
Considering the influence of the magnitude problem on other input features, loads of cooling, electricity, and heat, in addition to all the climate data, are normalized to facilitate the input. The calculation formula is as follows:
where
is the data before normalization of each input eigenvalue;
and
are the minimum and maximum values corresponding to
, respectively.
In order to guarantee that the predicted values accurately reflect the actual physical significance of each type of load in the IES, the predicted values are ultimately inverted and normalized in order to restore their magnitudes. The formula for the back-normalization is as follows:
4.2. Analysis of the Input Feature Contribution
Considering that multivariate load forecasting involves a complex relationship between several variables, including the supply of different energy sources, weather factors, and so on, taking all the factors as feature inputs will increase noise interference, making the model complex and the training process too long, and overfitting may occur. Through correlation analysis, the degree of interaction between these variables can be determined, helping to identify important influencing factors, thus optimizing the prediction model and improving prediction accuracy.
Since the factors that influence the integrated energy system are complex, there are complicated distribution relationships between different features. The two sets of variables have a linear connection, which is necessary for the Pearson correlation coefficient to be calculated. In contrast, although it can deal with the nonlinear relationship between the variables, the Spearman correlation coefficient is not sensitive to outliers, so the Pearson and Spearman correlation coefficients were chosen simultaneously for the correlation analysis.
The formula for the Pearson correlation coefficient is as follows:
where
is the Pearson correlation coefficient between variables
and
, where the observed values in the sample data are denoted by
and
. The sample size is denoted by
, and the sample means of
and
are, respectively,
and
.
The calculation formula for the Spearman correlation coefficient is as follows:
where
is the Spearman rank correlation coefficient,
is the difference between the two variables in the ranking, and
represents the sample size.
The correlation analyses of Pearson and Spearman are utilized to obtain the contribution of cold, heat, and electricity loads and the six meteorological conditions to each other.
- (1)
Observing the values of each correlation coefficient in
Figure 4, it can be observed that the data values of cold, heat, and electricity loads show a significant correlation. The absolute value of Pearson’s coefficient is above 0.4. The absolute value of Spearman’s coefficient is above 0.7, confirming a close coupling relationship among cold, heat, and electricity loads in this system.
- (2)
The Pearson’s coefficient value of the average temperature is above 0.7, the Spearman’s coefficient value is above 0.4, the absolute value of the average air pressure is above 0.6, the absolute value of the Spearman’s coefficient value fluctuates above and below 0.5, and the correlation with the load data value is strong. The Pearson’s coefficient value between the average dew point and the cold and hot loads is above 0.5, the Spearman’s coefficient value is above 0.65, the Pearson’s correlation coefficient value with the electric loads is below 0.1, and the Spearman’s correlation coefficient value is 0.6, which indicates that the average dew point is strongly correlated with the cold and hot loads and weakly correlated with the electric loads.
- (3)
The correlation between the average wind speed and maximum wind speed and the load data values is small. The absolute value of Pearson’s correlation coefficient is below 0.4; the Spearman’s correlation coefficient’s absolute value varies from 0.24 to 0.41, which is weakly correlated with the load data values; and the correlation coefficient between the precipitation and the load data values has an absolute value below 0.2, which is very weakly correlated with the load data values.
The correlation is strong since the correlation coefficient values of mean temperature, mean air pressure and mean dew point with load data values are large. Therefore, these three factors are considered the main factors influencing the load data values to be input into the model in this paper.
4.3. Evaluation Criteria
A few commonly used evaluation criteria were applied to the IES joint prediction model. Three indexes are used to evaluate computational statistics: weighted mean accuracy (WMA), root mean square error (RMSE), and mean absolute percentage error (MAPE), which are calculated by the following formulas:
- (1)
MAPE is the mean square error between the actual output value and the predicted output value when the integrated energy system makes a joint prediction of each type of energy, and is calculated by the following formula:
where
is the number of samples,
is the actual output value at time
, and
is the forecast output value at time
.
is the average absolute error of the load forecast value in the training set in any 24 h day.
- (2)
When predicting different types of energy simultaneously, the RMSE is the arithmetic square root of the mean square error between the actual and expected output integrative energy system values, calculated by the following formula:
In the formula, the interpretation of the remaining variables is the same for
;
is the standard deviation of the expected output value at time
.
- (3)
The integrated energy system’s WMA reflects the distribution of errors of different energy subsystems.
The calculation formula is as follows:
where
and
represent the actual and predicted values of the
sample.
represents the weight of this sample, which is based on its influence or priority corresponding to the subsystem.
4.4. Comparative Analysis of Multivariate Load Forecasting Results
All the experiments involved in this study are based on building multivariate load prediction models under a Python 3.7 compilation environment with TensorFlow and Keras frameworks. The model hyperparameters involved in the experiments are shown in
Table 1. The data were partitioned for the deep modeling experiments, with 63% of the data designated as the training set, 30% designated as the test set, and the remaining 7% as the validation set. The time step was set to 24 h. To validate the superiority of the DTW-BiLSTM-MTL model multivariate load prediction proposed in this paper, the prediction effects of the model’s extreme gradient boosting and multi-task learning (XGBOOST-MTL), CNN-BiLSTM-MTL, and the model proposed in this study were compared and analyzed, respectively. Meanwhile, to test the enhancement effect of DTW and MTL on BiLSTM prediction models, the DTW-BiLSTM-STL and BiLSTM-MTL models were set up for comparison as ablation experiments.
In order to guarantee the impartiality of the experiment, each model employed a Bayesian optimization algorithm to identify the optimal hyperparameters. The prediction results of each model for a particular week are shown in
Figure 5, and the evaluation metrics on the test set are displayed in
Table 2.
Figure 5 shows that the prediction error on weekdays is smaller than that on holidays because the energy demand on holidays has greater uncertainty and a lack of regularity data. Regarding prediction accuracy, it can be seen from
Table 2 and
Figure 6 that when different models are used to predict the loads for heating, cooling, and electricity, the DTW-BiLSTM-STL model has the highest RMSE, MAPE, and WMA, the worst prediction effect, and the longest training time. As an example, when using the DTW-BiLSTM-MTL model for electrical load prediction, the RMSE and MAPE were reduced by 75.75%, 57.25%, and 62.08% compared to the BiLSTM-MTL model, and the RMSE, MAPE, and WMA were decreased by 82.65%, 56.84%, and 91.25% compared to the DTW-BiLSTM-STL model, verifying the improvement in prediction accuracy of the BiLSTM model through multi-task learning and the dynamic time warping algorithm.
Compared with the other two comparison models, the DTW-BiLSTM-MTL model has the lowest RMSE, MAPE, and WMA, which indicates that its overall prediction effect is the most accurate. The DTW-BiLSTM-MTL model horizontally realizes the information sharing among different types of loads by establishing the MTL sharing layer based on BiLSTM, learning the characteristics of the coupling between the heating, cooling, and electric load data, and horizontally realizing the information sharing. It can also effectively utilize the auxiliary coupling characteristics to reduce the prediction error when a single load fluctuates greatly. DTW clustering, on the other hand, splices the input data by measuring the similarity of time series and vertically realizes the extraction of future time series information, which leads to a greater enhancement of the prediction effect.
4.5. Model Interpretation
The plot of SHAP of the DTW-BiLSTM-MTL model is shown in
Figure 7.
Figure 7 illustrates the top 20 metrics with the largest impacts on multivariate loads between 1 January 2023 and 31 December 2023. The heat load metrics have the greatest impact, with most metrics impacting both directions. There is a significant impact between different load indicators, which verifies the impact of the tight coupling between different load indicators on the forecast accuracy. Meanwhile, comparing the rankings in
Figure 7 to the relevant rankings in
Figure 4 can prove the indicator’s validity. In addition, when analyzing the metrics’ effects at various moments in time, the visualization of multiple time points during the forecasting period is presented in
Figure 7, both in the morning and evening. The load significantly affects the forecast accuracy, and the users’ electricity consumption patterns during the morning and evening peaks and the lunch breaks tend to increase the load. In contrast, the users’ related behaviors during working hours tend to decrease the load.
6. Conclusions
The method proposed in this paper fully considers the connection of coupling between multivariate loads and the effective use of future time-series information, uses the DTW algorithm as an optimization method for the input features of prediction models, calculates the distance matrix between different load data sequences, and applies hierarchical clustering methods according to the distance matrix to cluster and splicing, which effectively solves the problem of the traditional prediction model, which relies on historical data and cannot fully utilize the future time-series information. The MTL shared layer is simultaneously established using a BiLSTM neural network, fully exploiting the properties of coupling between electrical, heat, and cold loads and applying the Bayesian optimization approach to determine the prediction model’s hyperparameters’ global optimal solution.
After finding the optimal parameters, the method has better adaptability and prediction accuracy in load forecasting applications compared to traditional single-task learning and machine learning models. Taking the actual load data of Arizona State University, Tempe Campus as an example and comparing the prediction results of various algorithms under three types of loads, namely, cold, heat, and electricity, it is proven that the DTW-BiLSTM-MTL load forecasting method proposed in this paper can achieve very high prediction accuracy and can adapt to various types of load profiles. The model’s training time is short, which allows it to meet the needs of the actual operation of the power system.