1. Introduction
The importance of electricity load forecasting is increasingly recognized in modern power systems, especially in smart grid and household energy management systems [
1]. On one hand, with the proliferation of smart meters and the advancement of Internet of Things (IoTs) [
2], more households have introduced smart household devices, which enable the real-time monitoring and transmission of electricity consumption data and make the collection and analysis of load data more accessible and diversified. On the other hand, the global shift towards the energy transition and low-carbon economy has prompted sustainable development. Further, the growing electrification of households, coupled with the widespread use of electrical and renewable energy devices, has made the household energy demand more complex and volatile.
As a result, electricity load forecasting has become imperative for household energy usage optimization, cost reduction and energy efficiency improvement. Accurate forecasting not only ensures the stable operation of power systems, but also enhances energy utilization and reduces consumption as well as carbon emissions. Ultimately, it can promote sustainable development at the societal level [
3].
In general, load forecasting could be classified into four types, in line with the prediction horizon, as follows:
Very-short-term load forecasting (VSTLF): The prediction ranges from a few minutes to one hour, and is principally used for real-time dispatch and rapid response [
4,
5,
6,
7,
8,
9].
Short-term load forecasting (STLF): The prediction spans from one hour to one week, serving the daily operation and short-term planning of power systems [
10,
11,
12,
13,
14,
15,
16].
Medium-term load forecasting (MTLF): It ranges from one week to one year, supporting seasonal load adjustments and medium-term planning [
17,
18].
Long-term load forecasting (LTLF): It extends beyond one year, primarily guiding in the strategic planning and long-term decision-making [
19].
Among these types, VSTLF plays a key role in instantaneous dispatches and in responding to bursts. It predicts the power demand in the next few minutes to several hours, assisting in the timely adjustments by electric power companies in the dynamic environment, and maintaining grid stability by preventing the power imbalance caused by load fluctuations. With the advancement of smart meters and IoT techniques, the acquisition of real-time household electricity data has become practical, thereby providing abundant data for VSTLF. Furthermore, in household scenarios, VSTLF facilitates the scheduling of resident electricity usage to avert peak periods, thereby reducing electricity costs. Therefore, it is essential for alleviating power pressure, optimizing the resource allocation, and promoting the energy conservation and emission saving.
Existing methods for VSTLF can be divided into traditional statistical and machine learning-based ones. Traditional methods, such as the auto-regressive integrated moving average (ARIMA) model [
7,
8] and exponential smoothing, rely on the historical time series data. Furthermore, with the development of machine learning, more advanced algorithms, e.g., support vector regression (SVR) [
9], convolutional neural networks (CNNs) [
17], graph neural networks (GNNs) [
11] and long short-term memory (LSTM) networks [
12,
13,
14,
15,
16,
17,
18] have emerged. Compared to statistical methods, learning algorithms can better capture nonlinear features and intricate temporal dependencies, thereby improving forecasting accuracy.
Despite recent advents, VSTLF in the household scenario still faces many challenges. For instance, some characteristics (e.g., high frequency, noisy and non-stationary) exacerbate the data processing and model training procedures, and the heterogeneity in household consumption patterns poses challenges for building models with the generalization capability. Further, the real-time data processing requirement calls for both the high forecasting accuracy and improved computational efficiency.
In response to the above challenges, we propose herein a diffusion-and-attention-enhanced temporal model (DATE-TM) with multi-feature fusion, tailored for VSTLF in the household. Several of this work’s contributions can be summarized as follows:
We propose an attention-enhanced decoder module capable of dynamically selecting key encoder outputs during the decoding and thus improving the model’s attention on significant features. It alleviates the limitation of traditional gated recurrent unit (GRU) models in capturing the long-term dependency, thereby enhancing the forecasting accuracy.
We incorporate the diffusion model for the uncertainty modeling into the proposed DATE-TM. By introducing noises to the encoder’s final states and performing reverse sampling, the model’s ability to handle the randomness and uncertainty in the data is improved.
The proposed model could well support various features such as time, weather, temperature, humidity, individual device loads and user behavior, and adapt to the variable input length. It allows for the generalization across diverse household scenarios by dynamically adjusting the input length and feature weights.
The rest of this paper is organized as follows.
Section 2 summarizes present load forecasting approaches, concentrating on machine learning- and deep learning-based ones.
Section 3 depicts the proposed forecasting model, including the functionalities of GRUs, attention mechanisms and diffusion models.
Section 4 presents the experiment setup and used datasets, followed by the performance evaluation and analysis.
Section 5 concludes the work and suggests possible future orientations.
2. Related Work
By accurately predicting a household’s electricity consumption, the energy efficiency could be improved, carbon emissions can be reduced, and the grid stability is also ensured, thus preventing potential power failures. Furthermore, load forecasting supports the smooth operation of smart household systems, offers personalized energy management solutions, and saves electricity bills for residents. It also helps to balance the intermittency of renewable energy, optimize the use of energy storage systems, and increase the utilization rate of renewables.
Thus, Hsiao et al., in [
4], proposed a household load modeling method based on the context information and daily schedule analysis. By analyzing the daily electricity consumption time series, behavioral patterns are identified and a rule for the prediction is built. However, the reliance on existing behavior patterns makes it inadequate for new patterns, possibly incurring forecasting errors once the behavior changes. He et al., in [
10], addressed the above issue by proposing a model-agnostic meta-learning method for the household STLF, which allows multiple households to cooperatively train a generalized network model. Although neural networks are effective in handling the heterogeneous data, challenges still remain when there are significant differences in the electricity usage behavior across households. Wu et al. in [
11] further improved the model’s generalization by developing a transfer learning framework that dynamically assigns weights to different GNN models and then integrates the source with target domain data; however, the framework still encounters the alignment issue between source and target domain data.Mansoor et al. in [
12] introduced the past vector similarity (PVS) technique to forecast the next hour at the individual household level, simplifying the input by only using the load information and not relying on additional attributes like the calendar or weather data. Nevertheless, this type of simplification might reduce the robustness in handling sudden behavioral changes or anomalies.
To improve forecasting robustness and accuracy, Masood et al. in [
13] proposed a quantile LSTM model with clustering-based probabilistic forecasting, which reduces the data heterogeneity through clustering and enhancing the ability to handle outliers through a probabilistic approach. Yet, despite its effectiveness for large-scale forecasting, the method could not well process the small-sample household data. Tiwari et al. in [
14] introduced a dual-stage attention-based LSTM (DSA-LSTM) model to enhance accuracy by capturing key temporal features; building upon their study, Li et al. in [
15] designed an interpretable memristive LSTM (IM-LSTM) network that embeds the hybrid attention mechanism into LSTM units to represent the importance of both variables and time steps. Although DSA-LSTM and IM-LSTM could improve the interpretability and feature-capturing ability, they require insensitive computational resources, thus limiting their practical implementation.
All of the aforementioned works demonstrated their advancements in feature extraction, model generalization and heterogeneous data handling through techniques such as the similarity analysis, contextual information, neural networks, transfer learning, clustering and attention mechanisms. Some challenges and issues, nevertheless, still remain. For instance, the behavior pattern-dependent models are less adept to sudden behavioral changes; their generalization is confined across households with different consumption patterns; uncertainty and randomness in the load data remain insufficiently addressed (even using the attention mechanism); and fixed-length inputs would limit the generalizability of the models across different scenarios.
3. Methods
3.1. Problem Definition
VSTLF could be formulated into a multivariate temporal forecasting issue. As shown in
Figure 1, the goal is to forecast the total household electricity consumption over a future period based on the historical data, including the household data, the meteorological data (e.g., wind speed, temperature and humidity) for the residence-located region and additional feature data collected from the household context. In particular, the problem is formulated as
where
is the forecasting window length;
is the look-back window length;
is the predicted load; and
,
and
denote the historical load, meteorological information and additional features, respectively. The function
represents the nonlinear mapping learned through the neural network to forecast future loads based on past observations.
3.2. GRU Model
As illustrated in
Figure 2, the GRU is taken to deal with the time series information [
20], which overcomes the vanishing gradient issue of RNNs by retaining useful information from the past while discarding the irrelevant data. In particular, two key gates are involved: the reset gate, which decides how much past information should be bonded with the new input, and the update gate, which regulates the amount of past information to be reserved at the current time step. In particular, the updating of the GRU is as follows:
and
where
and
are separately the outputs of the reset gate and update gate,
denotes the weight matrices,
denotes the Sigmoid activation, and ⊗ expresses the element-wise multiplication, respectively.
In time series forecasting and sequence modeling, GRU, LSTM and BiLSTM are common RNN variants, each of which alleviates the gradient vanishing and explosion issue in different ways. The GRU offers a key advantage over LSTM and BiLSTM due to its simpler structure, high computational efficiency and suitability for specific tasks [
21,
22]. In particular, the GRU uses only two gates—the reset and update gates—compared to LSTM’s three gates and cell state, leading to fewer parameters, reduced computational cost and faster training. It makes the GRU especially advantageous for smaller datasets or tasks with limited resources, thereby achieving similar performance to LSTM with less training time. Furthermore, the GRU is more stable during the training and excels at tasks with short-term dependency due to its flexible update gate. While LSTM is better for long-term dependency, the GRU typically outperforms it in tasks dominated by short-term dependency, and is thus better suited to resource-constrained or time-sensitive applications.
3.3. Attention-Enhanced GRU
To augment the network’s capability in seizing the global dependency within the input sequence, we merged the scaled dot-product attention [
23] into the GRU architecture. As shown in
Figure 3, the attention mechanism calculates a context vector for each time step, which is then fed into the GRU. Given the output sequence
from the GRU encoder, of which each
represents the hidden state at time
t, the attention mechanism is defined as
where
,
and
denote the query, key and value matrices, respectively, and
is the dimension of the key vector. The attention weights were computed by scaling the dot-product between queries and keys, followed by a Softmax operation to obtain normalized weights. These weights were then used to aggregate the values, yielding the context vector. At the current time step, let
be the input label,
be the predicted value and
be the previous hidden state, respectively. The GRU update with the attention can be expressed as
where
is the context vector from the attention mechanism. Finally, the hidden state
is passed through a feedforward network to generate the final prediction, as
which is repeated iteratively to generate the entire forecast sequence. The fully connected feedforward neural network (FFN) follows the GRU and plays a crucial role in generating the final predictions. The GRU first captures the temporal features from the input sequence, while the context vector obtained through the attention mechanism refines the GRU’s hidden state at each time step. This type of refined hidden state is then processed by the FFN, which progressively reduces the data dimension through multiple layers, ultimately outputting the final prediction for each time step. The FFN is essential for mapping the complex temporal information encoded by the GRU to the target prediction values, allowing the model to capture intricate nonlinear relationships that the GRU alone would struggle to model.
3.4. Diffusion Model
Diffusion models are widely used in generative tasks, especially in scenarios that require modeling complex data distributions through iteratively sampling prior knowledge or assumptions. Note that directly modeling the target distribution can pose challenges, especially when capturing the diversity and complexity of the data [
24,
25]. Therefore, a method is preferred, which transforms the target distribution into a standard Gaussian distribution, and then reverses the process to retrieve the target distribution from the Gaussian distribution. Similar to other generative models like the variational autoencoders (VAEs) and normalizing flows, the principle of diffusion models is to start with a known simple distribution (e.g., a Gaussian distribution) and gradually transform it into the target distribution. Yet, unlike these methods, the transformation process in diffusion models is implemented in multiple steps using a Markov process. The noise is gradually added to the samples per step, allowing the final generated samples to better capture the characteristics of the target distribution. As such, the complex distribution characteristics is learned, thereby improving the quality and diversity of the generated samples. In particular, the diffusion model consists of two procedures, as follows:
The forward process, which involves gradually adding noises to the information and making it the pure noise.
The reverse process, which involves gradually eliminating the added noise to recover the original data.
In this work, the diffusion model is applied to hidden states of the encoder instead of the raw input data. It helps to focus more on the uncertainty and randomness of hidden states, which is critical for modeling the high-frequency and non-stationary household load data.
3.4.1. Forward Process
Let
be the hidden state output from the encoder at the initial step. To gradually transform
into white noise, the forward process introduces the Gaussian noise
at each step
n as
where
controls the weight between
and the added noise. As the step progresses, the hidden state becomes increasingly dominated by noise, and the stability and additivity properties of Gaussian distributions allow us to express the noisy hidden state at step
n as
In particular, the forward process ensures that after sufficient steps, the hidden state well approximates a standard normal distribution as
where
T is the total step number. The transformation could provide a simple and tractable prior distribution for the reverse process to begin reconstruction.
3.4.2. Reverse Process
Next, the reverse process is used to recover the original hidden state
by gradually removing the noise, modeling the reverse transition as
where
and
are the predicted mean and variance, respectively, and can be approximated by the neural network. Since the true posterior distribution is unknown, the model then minimizes the mean squared error (MSE) loss as follows:
where
is the predicted noise at step
n. By minimizing (
13), the model ensures that the predicted noise aligns with the true noise, facilitating the accurate reconstruction of original hidden states.
3.4.3. Reparameterization and Optimization
To facilitate the model training, a reparameterization trick is used, and the noisy state at step
n can be rewritten as
which separates the deterministic part of the hidden state from the stochastic noise, allowing for more stable training. The loss function then becomes
where
is the predicted noise term. Through reparameterization, the gradient is obtained as
which ensures that the model can more effectively utilize the information in samples during each update, reducing the uncertainty introduced by noise and thereby improving the stability.
In particular, the reparameterization trick offers several benefits during the practical training process. First, it effectively transforms the randomness in the training process into a controllable procedure, allowing the model to sample effectively in higher dimensions during the learning. Second, the reparameterization enables the model to generate samples flexibly, enhancing its robustness when facing high-dimensional data. Additionally, using the reparameterization trick, the model’s inference process becomes more efficient. In particular, the sample generation no longer relies on complex latent variable structures but is achieved through simple deterministic functions and standard normal noise. The simplification not only accelerates the computation but also improves the model’s adaptability in the real-time prediction process. Finally, the reparameterization trick ensures that the error between predicted noise and true noise is minimized. This type of optimization goal could improve the training, ensuring a robust predictive performance in high-noise environments, and in turn proving the diffusion models’ significant advantages in applications such as household load forecasting.
In the proposed framework, the diffusion model tries to model the distribution of the encoder’s hidden states to better manage the high noise and non-stationary characteristics of household electricity data. Rather than directly model the raw load data, the diffusion model exhibits significant advantages by focusing on the hidden states. First, the diffusion model effectively smooths high-frequency noise, thereby enhancing the model robustness and alleviating the impact of random fluctuations in the data. Furthermore, through a multi-step sampling mechanism, the model enhances its adaptability in various household scenarios, improving its generalization capability. Further, the diffusion model could support the probabilistic modeling of prediction results, allowing for more accurate quantification and reflection of uncertainty. All of the above advantages enable the diffusion model to perform well in processing the complex load data, improving both the prediction accuracy and generalization capability.
3.4.4. Implementation in DATE-TM
In the proposed DATE-TM model, the diffusion model mainly operates on the encoder’s , specifically addressing the volatility and randomness in the real-world data, and striving to maintain the predictive accuracy when facing the complex and dynamically changing load paradigm. By leveraging the reverse process, the model effectively manages load variations existing in high-noise environments.
Note that, the proposed DATE-TM model not only enhances the model’s generalization capability across households with varying consumption behaviors, but also improves its stability and adaptability. Particularly, when dealing with households exhibiting different consumption patterns, the DATE-TM model can flexibly adjust its parameters to adapt to various data characteristics. Thus, by incorporating the diffusion-enhanced hidden state modeling, the DATE-TM model could achieve high forecasting accuracy while ensuring computational efficiency. Furthermore, it also allows the model to run quickly on larger datasets without sacrificing the accuracy, thereby adapting even better to practical applications.
3.5. DATE-TM Architecture
As depicted in
Figure 4, the DATE-TM model integrates GRUs, attention mechanisms and the diffusion model. First, the encoder uses multi-layer GRUs to extract temporal dependencies, addressing the vanishing gradient problem. Then, the decoder combines the GRUs with attention to dynamically select relevant encoder outputs, thereby capturing long-term dependencies. Finally, the diffusion model serves as a bridge connecting the decoder and encoder to enhance the network’s robustness by modeling uncertainty and randomness in the data. Furthermore, DATE-TM well supports the multi-feature input (including time, weather, temperature, humidity, individual device loads and user behavior), and adapts to the variable input length. The flexibility enables the model to dynamically adjust input sequences and feature weights in different household scenarios. In particular, the
teacher forcing technique is used during training to accelerate the convergence by taking true target outputs as inputs for each step in the decoder. The specific hyperparameter settings for the model are summarized in
Table 1.
Compared to traditional models such as BiLSTM and DeepAR, DATE-TM specializes in handling data uncertainty and long-term dependencies while offering high prediction accuracy. The adaptability allows it to process diverse input features and be fit for various household settings, ensuring robust and reliable load forecasting.
3.6. Data Preprocessing
The original dataset may contain missing values that need to be filled in to ensure proper model training and forecasting. The existence of missing values not only affects the learning process, but might also incur inaccurate prediction results. In this part, the
K-nearest neighbor (KNN) algorithm (using the similarity between samples to infer the missing values) is taken for imputation. For instance, given two samples
Y and
X, we obtain the Euclidean distance, i.e.,
where
and
are the values of the
i-th feature, and
n expresses the overall feature number, respectively. The weights for each neighbor are determined based on the inverse of their distance, i.e., closer neighbors would receive heavier weights as follows:
Finally, the imputed value for the missing feature is the weighted average of neighbors, i.e.,
When dealing with multi-dimensional data, significant differences may exist in the magnitude of features. For instance, the temperature is measured in degrees Celsius while the electricity consumption is measured in kilowatt-hours. If these features are not normalized, then larger-magnitude feature values would dominate, resulting in the underutilization of all features. Furthermore, the inconsistent magnitudes of input features can incur the numerical instability, which in turn affects the convergence speed. To prevent the numerical instability and ensure balanced training, the
min–max normalization is applied as
in which
and
denote the maximal and minimal numbers for each feature vector, respectively. The normalization ensures that all features are restricted within the
range, thus facilitating the comparison across different features.
We next use the sliding window technique to extract input sequences from normalized data. In particular, the sliding window allows us to generate multiple input sequences from successive data based on a fixed time step. These input sequences not only retain the temporal relevance of time series data but also capture potential trends and seasonal variations, providing rich information for the subsequent model training. Furthermore, the data are split into training, validation and test sets to assess the model’s generalization ability and mitigate overfitting. In particular, 80% of the dataset is used for training, while the remaining 20% serves as the test set, thereby providing an independent evaluation of the model’s performance on unseen data. More specifically, the training set is further divided, with 80% for training and 20% for validation. The validation set enables the real-time monitoring of the model’s performance during the training, facilitating hyperparameter adjustment to prevent the overfitting.
Note that the chronological division of time series ensures the consistency between training and testing environments, making the model adapt to real-world scenarios. Furthermore, it also averts possible biases arising from random partitioning, thereby assuring reliability. Thus, the combination of both sliding window and chronological splitting can fully leverage the data’s temporal characteristics.
3.7. Assessment Metrics
The indicators below were taken to comprehensively assess the model’s capability:
Symmetric mean absolute percentage error (SMAPE): .
Mean absolute percentage error (MAPE): .
Mean absolute error (MAE): .
Root mean squared error (RMSE): .
Relative root squared error: .
Coefficient of determination: .
Note that SMAPE and MAPE are widely used for measuring prediction errors, yet SMAPE offers a more stable evaluation by addressing the unbounded error issue existing in MAPE when true values approach zero. RMSE provides an overall measure of prediction accuracy, as it is especially sensitive to large errors, while RRSE normalizes RMSE by comparing the model’s performance with a baseline model. Furthermore, quantifies the proportion of variance explained by the model, offering insights into its goodness of fit.
By incorporating these metrics, we aim to provide a comprehensive and balanced evaluation of the model’s accuracy, robustness and generalization capability, providing a more nuanced understanding of its performance across different data scenarios.
4. Experiments and Results
4.1. Dataset Description
In this part, we present how two real-world household load datasets were used to execute the 24-minute very-short-term load forecasting, which included the following data:
The two household datasets not only contain the energy consumption data, but also the records of the weather data for their respective areas. The detailed variables (factors) are listed in
Table 2 and
Table 3. Additionally, more detailed statistics, including the count, minimum, maximum, mean and standard deviation, are provided for a comprehensive understanding of the datasets, as shown in
Table 4 and
Table 5, offering a clearer characterization of the datasets.
The residence power usage is characterized by its irregularity compared to the commercial usage. To illustrate the volatility and uncertainty of household load, three days’ data are sampled from two households for comparison, as shown in
Figure 5. The comparison highlights the variability between households due to external factors (e.g., the economic environment and electrical infrastructure), as well as internal factors (e.g., members’ consumption habits). Furthermore, intra-household consumption patterns differ significantly across days, making the load forecasting (especially the very-short-term forecasting) extremely challenging. Thus, capturing various impacting factors is essential for developing a model adapting to diverse household scenarios.
Figure 6 and
Figure 7 present the heatmaps for two different households, highlighting notable differences in their correlation patterns. The differences can be attributed to various factors, including the appliance configuration, the specific living environment, and the behavior of household members. For instance, one household may exhibit a strong correlation between high load and specific appliance usage, indicating that certain devices are heavily relied upon during particular periods of the day. In contrast, the other household may display more varied appliance usage, bringing about a more balanced load distribution throughout the day.
4.2. Performance Comparison
The proposed model is compared with other benchmarks, including DATE-TM, DATE-TM without (w/o) the diffusion module, DeepAR and BiLSTM, respectively. For each dataset, two sets of experiments were performed. In the first set, we used 1 min data intervals, with a historical window of 240 time steps (240 min) and a prediction window of 24 time steps (24 min). In the second set, we aggregated the data into 15 min intervals, where the historical window covers 240 steps (60 h) and the prediction window covers 24 steps (6 h). The intermediate values of 240 consecutive prediction sequences were formed into complete sequences for the prediction comparison.
Figure 8 and
Figure 9 illustrate the prediction results across the two datasets.
As shown in
Figure 8 and
Figure 9, the DATE-TM model achieved superior accuracy since it can capture more load fluctuations than other models. Furthermore, the removal of the diffusion module lowered the performance, especially when facing abnormal fluctuations. While DeepAR could provide stable results, it performed worse than DATE-TM. Further, BiLSTM struggled to handle the long-term sequence, incurring higher prediction errors.
It is inferred that the prediction accuracy at the 1 min interval level exceeded that at the 1 min interval level. The finer granularity of the 1 min interval data preserves more detailed temporal features, allowing the model to capture the short-term dependency and dynamic patterns within the shorter prediction horizon. In contrast, the 15 min interval data, with larger intervals, might smooth fluctuations and reduce feature granularity, instead of increasing the prediction difficulty. Furthermore, the aggregation of the 15 min interval data from the 1 min interval data would incur the loss of critical weather-related and load-specific features (which are essential for accurate forecasting). As such, the aggregated data often fail to capture transient load fluctuations and rapid weather changes and to identify the fine-grained temporal dynamics.
Next, a quantitative comparison is shown in
Table 6 and
Table 7, where the smallest (best) values are stressed in bold. “M” and “S” refer to the datasets from Mexico and the smart household, respectively.
As mentioned above, by comparing different models on the household load data, it demonstrates that the DATE-TM model holds an evident advantage in handling the household data. This is because the proposed architecture (with multi-feature fusion, diffusion module for uncertainty modeling and attention mechanism) can make the model capture abnormal and irregular load variations more accurately, achieving superior performance across various metrics. Although the DATE-TM model without the diffusion module showed a slight degradation in prediction accuracy, it still outperformed both DeepAR and BiLSTM models.
5. Conclusions and Future Work
We have proposed a novel multi-feature temporal prediction architecture, DATE-TM, for the VSTLF in households. The model integrates a GRU, attention mechanism and diffusion model to address and alleviate the unique challenges faced by household energy usage forecasting. By incorporating versatile features, including historical load data, meteorological variables and user-specific contextual factors, DATE-TM exhibits superior performance in handling data uncertainty and capturing long-term dependencies.
In particular, DATE-TM involves the attention-enhanced decoder module to dynamically select relevant encoder outputs during the decoding process, not only boosting the model’s ability to focus more on significant features but also rendering traditional GRU-capturing long-term dependencies, thus improving the forecasting accuracy. More specifically, the integration of the diffusion model enhances the model’s robustness by introducing noises to the encoder’s final states and performing reverse sampling, allowing DATE-TM to effectively manage randomness and fluctuations inherent in the household load data, and significantly improving its reliability in predicting future load profiles. Furthermore, the model is also highly adaptable to versatile features such as time, weather, temperature, humidity, individual device loads and user behavior, with variable input lengths, thereby ensuring the generalization across diverse household scenarios. The adaptability renders DATE-TM well suited for diverse forecasting tasks, while maintaining high performance across various input configurations. Evaluation results using multiple metrics—including MAPE, SMAPE, MAE, RMSE and —demonstrate that DATE-TM consistently outperforms traditional models, such as BiLSTM and DeepAR. Notably, the inclusion of the diffusion module significantly strengthens the model’s capacity to handle uncertainties and random fluctuations in the load data. Even in the absence of the diffusion module, DATE-TM still outperforms the benchmark models, highlighting the effectiveness of its multi-feature fusion and enhanced attention mechanism.
Despite promising results, there are still spaces for further improvement. Future work will focus on optimizing feature selection to enhance model performance, simplifying the architecture to improve computational efficiency, acquiring real-time data to enable continuous learning, and extending the model’s applicability to other domains, such as industrial load forecasting and urban energy management. Such directions will further enhance the versatility and capability of DATE-TM, providing robust support for smart grid systems and intelligent energy management.
Author Contributions
Conceptualization, Y.Z.; methodology, Y.Z. and J.L.; software, J.L.; validation, Y.Z. and J.L.; formal analysis, Y.Z.; investigation, J.L. and C.C.; resources, Y.Z. and C.C.; data curation, J.L.; writing—original draft preparation, Y.Z.; writing—review and editing, J.L., C.C. and Q.G.; visualization, Y.Z.; supervision, C.C. and Q.G.; project administration, C.C.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
Conflicts of Interest
Authors Yitao Zhao and Jiahao Li were employed by the Yunnan Power Grid Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
References
- Zhang, X.; Han, J. Automatic machine learning participation in power load forecasting under the background of big data. In Proceedings of the International Conference on Energy, Power and Electrical Technology (ICEPET), Chengdu, China, 17–19 May 2024; pp. 1311–1315. [Google Scholar]
- Zhang, J.; Lu, W.; Xing, C.; Zhao, N.; Al-Dhahir, N.; Karagiannidis, G.K.; Yang, X. Intelligent integrated sensing and communication: A survey. Sci. China Inf. Sci. 2025, 68, 1–42. [Google Scholar] [CrossRef]
- Yu, B.; Qin, J.; Deng, Z.; Guo, X.; Si, J. Analysis of power load characteristics and research of load forecasting model in a certain area under the background of new power reform. In Proceedings of the International Conference on Smart Power Internet Energy Systems (SPIES), Shenyang, China, 16–18 June 2023; pp. 100–104. [Google Scholar]
- Hsiao, Y.-H. Household electricity demand forecast based on context information and user daily schedule analysis from meter data. IEEE Trans. Ind. Inform. 2015, 11, 33–43. [Google Scholar] [CrossRef]
- Gonzalez, R.; Ahmed, S.; Alamaniotis, M. Deep neural network based methodology for very-short-term residential load forecasting. In Proceedings of the International Conference on Information, Intelligence, Systems Applications (IISA), Corfu, Greece, 18–20 July 2022; pp. 1–6. [Google Scholar]
- Alamaniotis, M.; Ikonomopoulos, A.; Tsoukalas, L.H. Evolutionary multiobjective optimization of kernel-based very-short-term load forecasting. IEEE Trans. Power Syst. 2012, 27, 1477–1484. [Google Scholar] [CrossRef]
- Lu, J.-C.; Zhang, X.; Sun, W. A real-time adaptive forecasting algorithm for electric power load. In Proceedings of the IEEE/PES Transmission Distribution Conference Exposition: Asia and Pacific, Dalian, China, 15–18 August 2005; pp. 1–5. [Google Scholar]
- de Andrade, L.C.M.; da Silva, I.N. Using intelligent system approach for very short-term load forecasting purposes. In Proceedings of the IEEE International Energy Conference, Manama, Bahrain, 18–22 December 2010; pp. 694–699. [Google Scholar]
- Setiawan, A.; Koprinska, I.; Agelidis, V.G. Very short-term electricity load demand forecasting using support vector regression. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN), Atlanta, GA, USA, 14–19 June 2009; pp. 2888–2894. [Google Scholar]
- He, Y.; Luo, F.; Ranzi, G. Transferrable model-agnostic meta-learning for short-term household load forecasting with limited training data. IEEE Trans. Power Syst. 2022, 37, 3177–3180. [Google Scholar] [CrossRef]
- Wu, D.; Lin, W. Efficient residential electric load forecasting via transfer learning and graph neural networks. IEEE Trans. Smart Grid 2023, 14, 2423–2431. [Google Scholar] [CrossRef]
- Mansoor, H.; Rauf, H.; Mubashar, M.; Khalid, M.; Arshad, N. Past vector similarity for short term electrical load forecasting at the individual household level. IEEE Access 2021, 9, 42771–42785. [Google Scholar] [CrossRef]
- Masood, Z.; Gantassi, R.; Choi, Y. Enhancing short-term electric load forecasting for households using quantile LSTM and clustering-based probabilistic approach. IEEE Access 2024, 12, 77257–77268. [Google Scholar] [CrossRef]
- Tiwari, P.; Mahanta, P.; Trivedi, G. A dual-stage attention based RNN model for short term load forecasting of individual household. In Proceedings of the International Conference on Electrical, Computer and Energy Technologies (ICECET), Cape Town, South Africa, 9–10 December 2021; pp. 1–6. [Google Scholar]
- Li, C.; Dong, Z.; Ding, L.; Petersen, H.; Qiu, Z.; Chen, G. Interpretable memristive LSTM network design for probabilistic residential load forecasting. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 69, 2297–2310. [Google Scholar] [CrossRef]
- Kong, W.; Dong, Z.Y.; Jia, Y.; Hill, D.J.; Xu, Y.; Zhang, Y. Short-Term residential load forecasting based on LSTM recurrent neural network. IEEE Trans. Smart Grid 2019, 10, 841–851. [Google Scholar] [CrossRef]
- Han, L.; Peng, Y.; Li, Y.; Yong, B.; Zhou, Q.; Shu, L. Enhanced deep networks for short-term and medium-term load forecasting. IEEE Access 2019, 7, 4045–4055. [Google Scholar] [CrossRef]
- Li, J.; Wei, S.; Dai, W. Combination of manifold learning and deep learning algorithms for mid-term electrical load forecasting. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 2584–2593. [Google Scholar] [CrossRef] [PubMed]
- Xie, J.; Hong, T.; Stroud, J. Long-term retail energy forecasting with consideration of residential customer attrition. IEEE Trans. Smart Grid 2015, 6, 2245–2252. [Google Scholar] [CrossRef]
- Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar]
- Pirani, M.; Thakkar, P.; Jivrani, P.; Bohara, M.H.; Garg, D. A comparative analysis of ARIMA, GRU, LSTM and BiLSTM on financial time series forecasting. In Proceedings of the 2022 IEEE International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE), Ballari, India, 23–24 April 2022; pp. 1–6. [Google Scholar]
- Mateus, B.C.; Mendes, M.; Farinha, J.T.; Assis, R.; Cardoso, A.M. Comparing LSTM and GRU models to predict the condition of a pulp paper press. Energies 2021, 14, 6958. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
- Wang, Z.; Wen, Q.; Zhang, C.; Sun, L.; Wang, Y. DiffLoad: Uncertainty quantification in electrical load forecasting with the diffusion model. IEEE Trans. Power Syst. 2024; early access. [Google Scholar] [CrossRef]
- Aguirre-Fraire, B.; Beltrán, J.; Soto-Mendoza, V. A comprehensive dataset integrating household energy consumption and weather conditions in a north-eastern Mexican urban city. Data Brief 2024, 54, 110452. [Google Scholar] [CrossRef] [PubMed]
Figure 1.
Forecasting model for household load prediction. The green and blue lines represent the estimated and true data, separately. The forecasting model uses historical data to bring forth predictions for future time steps.
Figure 1.
Forecasting model for household load prediction. The green and blue lines represent the estimated and true data, separately. The forecasting model uses historical data to bring forth predictions for future time steps.
Figure 2.
Structure of the GRU, which uses an update gate and a reset gate to manage the influence of input and previous state to generate the new state .
Figure 2.
Structure of the GRU, which uses an update gate and a reset gate to manage the influence of input and previous state to generate the new state .
Figure 3.
Architecture integrating the GRU with dot-product attention. The attention mechanism processes the input queries , keys and values , followed by the concatenation with previous output , and the result is fed into the GRU to generate the next prediction . The structure captures both temporal dependencies and contextual relevance in the input sequence.
Figure 3.
Architecture integrating the GRU with dot-product attention. The attention mechanism processes the input queries , keys and values , followed by the concatenation with previous output , and the result is fed into the GRU to generate the next prediction . The structure captures both temporal dependencies and contextual relevance in the input sequence.
Figure 4.
Architecture of the DATE-TM model. It integrates GRUs, attention mechanisms and a diffusion model to capture both temporal dependencies and model uncertainty.
Figure 4.
Architecture of the DATE-TM model. It integrates GRUs, attention mechanisms and a diffusion model to capture both temporal dependencies and model uncertainty.
Figure 5.
Random three-day load profiles for two households illustrating the variability in energy consumption patterns. (a) Load profile of a Mexican household. (b) Load profile of a smart household.
Figure 5.
Random three-day load profiles for two households illustrating the variability in energy consumption patterns. (a) Load profile of a Mexican household. (b) Load profile of a smart household.
Figure 6.
Weather heatmaps for the locations of two households, reflecting the differences that may arise from geographical and environmental factors. (a) Weather heatmaps of the Mexican household. (b) Weather heatmaps of the smart household.
Figure 6.
Weather heatmaps for the locations of two households, reflecting the differences that may arise from geographical and environmental factors. (a) Weather heatmaps of the Mexican household. (b) Weather heatmaps of the smart household.
Figure 7.
Correlation heatmaps for the load data from two households. The color indicates the strength of relationships, with brighter ones representing the stronger correlation, highlighting the difference arising from the appliance configuration and usage pattern. (a) Load correlation heatmap of load data for the Mexican household. (b) Load correlation heatmap of load data for the smart household.
Figure 7.
Correlation heatmaps for the load data from two households. The color indicates the strength of relationships, with brighter ones representing the stronger correlation, highlighting the difference arising from the appliance configuration and usage pattern. (a) Load correlation heatmap of load data for the Mexican household. (b) Load correlation heatmap of load data for the smart household.
Figure 8.
Prediction results for Mexican household. The considered models include DATE-TM, DATE-TM without the diffusion module, DeepAR and BiLSTM. The historical window spans 240 time steps, and the prediction window consists of 24 time steps. (a) Prediction results of the Mexican household data with 1 min interval. (b) Prediction results of the Mexican household data with 15 min intervals.
Figure 8.
Prediction results for Mexican household. The considered models include DATE-TM, DATE-TM without the diffusion module, DeepAR and BiLSTM. The historical window spans 240 time steps, and the prediction window consists of 24 time steps. (a) Prediction results of the Mexican household data with 1 min interval. (b) Prediction results of the Mexican household data with 15 min intervals.
Figure 9.
Prediction results for smart household. The considered models compared include DATE-TM, DATE-TM without the diffusion module, DeepAR and BiLSTM. The historical window spans 240 time steps, and the prediction window consists of 24 time steps. (a) Prediction results of the smart household data with 1 min interval. (b) Prediction results of the smart household data with 15 min intervals.
Figure 9.
Prediction results for smart household. The considered models compared include DATE-TM, DATE-TM without the diffusion module, DeepAR and BiLSTM. The historical window spans 240 time steps, and the prediction window consists of 24 time steps. (a) Prediction results of the smart household data with 1 min interval. (b) Prediction results of the smart household data with 15 min intervals.
Table 1.
Hyperparameter configuration for the model.
Table 1.
Hyperparameter configuration for the model.
Hyperparameter | Value |
---|
Historical window length | 240 |
Forecasting window length | 24 |
Batch size | 256 |
Learning Rate | 0.0001 |
Epoch | 40 |
Hidden dimension | 64 |
Diffusion steps | 5 |
GRU Layer | 2 |
Denoising Layer | 5 |
Table 2.
Description of variables (factors) in the Mexican household dataset.
Table 2.
Description of variables (factors) in the Mexican household dataset.
Variable | Description |
---|
current | Current (A), represents the strength of the household current |
voltage | Voltage (V), records the voltage level in the household power system |
reactive_power | Reactive power (W), describes the energy generated by inductive loads in the household power system |
apparent_power | Apparent power (VA), a combination of active and reactive power |
power_factor | Power factor, indicates the efficiency of energy utilization; a value closer to 1 indicates
better utilization |
temp | Temperature (°C), real-time temperature at the household location |
feels_like | Feels-like temperature (°C), the perceived temperature combining temperature and humidity |
temp_min | Minimum temperature of the day (°C) |
temp_max | Maximum temperature of the day (°C) |
pressure | Atmospheric pressure (hPa), describes the air pressure changes in the environment |
humidity | Humidity (%), represents the moisture content in the air |
speed | Wind speed (m/s), records the speed of the wind |
deg | Wind direction (°), indicates the direction of the wind |
temp_t+1 | Forecasted temperature for the next day (°C) |
feels_like_t+1 | Forecasted feels-like temperature for the next day (°C) |
load | Total power consumed by household appliances (W) |
Table 3.
Description of variables (factors) in the smart household dataset.
Table 3.
Description of variables (factors) in the smart household dataset.
Variable | Description |
---|
Dishwasher | Power consumption of the dishwasher (kW) |
Home_office | Power consumption of the home office (kW) |
Fridge | Power consumption of the fridge (kW) |
Wine_cellar | Power consumption of the wine cellar (kW) |
Garage_door | Power consumption of the garage door (kW) |
Barn | Power consumption of the barn (kW) |
Well | Power consumption of the well (kW) |
Microwave | Power consumption of the microwave (kW) |
Living_room | Power consumption in the living room (kW) |
Furnace | Power consumption of the furnace (kW) |
Kitchen | Power consumption in the kitchen (kW) |
Solar | Power generated by the solar system (kW) |
temperature | Temperature (°C) at the household location |
humidity | Humidity (%), represents the moisture content in the air |
visibility | Visibility (km), describes how far one can see under current weather conditions |
apparentTemperature | Apparent temperature (°C), the perceived temperature, considering humidity and wind |
pressure | Atmospheric pressure (hPa), describing air pressure in the environment |
windSpeed | Wind speed (m/s), records the speed of the wind |
cloudCover | Cloud cover (%), describes the fraction of the sky covered by clouds |
windBearing | Wind direction (°), indicates the direction from which the wind is coming |
precipIntensity | Precipitation intensity (in/h), indicates the rate of precipitation |
dewPoint | Dew point temperature (°C), the temperature at which air becomes saturated
with moisture |
precipProbability | Probability of precipitation (%), indicates the likelihood of precipitation occurring |
load | Total power consumption (kW) from all appliances combined |
Table 4.
Summary statistics of variables in the Mexican household dataset.
Table 4.
Summary statistics of variables in the Mexican household dataset.
Variable | Min | Max |
---|
current | 0.30 | 24.41 |
voltage | 107.60 | 135.50 |
reactive_power | 4.73 | 1293.58 |
apparent_power | 37.14 | 2931.64 |
power_factor | 0.20 | 1.00 |
temp | −5.56 | 39.37 |
feels_like | −6.13 | 36.70 |
temp_min | −5.56 | 37.59 |
temp_max | −5.56 | 39.44 |
pressure | 996.00 | 1035.00 |
humidity | 1.00 | 100.00 |
speed | 0.00 | 10.29 |
deg | 0.00 | 360.00 |
temp_t+1 | −5.56 | 39.37 |
feels_like_t+1 | −6.13 | 36.70 |
load | 24.40 | 2900.00 |
Table 5.
Summary statistics of variables in the smart household dataset.
Table 5.
Summary statistics of variables in the smart household dataset.
Variable | Min | Max |
---|
Dishwasher | 0.00 | 1.40 |
Home_office | 0.00 | 0.97 |
Fridge | 0.00 | 0.85 |
Wine_cellar | 0.00 | 1.27 |
Garage_door | 0.00 | 1.09 |
Barn | 0.00 | 7.03 |
Well | 0.00 | 1.63 |
Microwave | 0.00 | 1.93 |
Living_room | 0.00 | 0.47 |
Furnace | 0.00 | 2.47 |
Kitchen | 0.00 | 2.27 |
Solar | 0.00 | 0.61 |
temperature | −24.80 | 34.30 |
humidity | 13.00 | 98.00 |
visibility | 0.27 | 10.00 |
apparentTemperature | −35.60 | 38.40 |
pressure | 986.40 | 1042.46 |
windSpeed | 0.00 | 22.91 |
cloudCover | 0.00 | 100.00 |
windBearing | 0.00 | 359.00 |
precipIntensity | 0.00 | 0.19 |
dewPoint | −32.90 | 23.00 |
precipProbability | 0.00 | 84.00 |
load | 0.00 | 14.71 |
Table 6.
Quantitative comparison of load prediction models at the 1 min granularity.
Table 6.
Quantitative comparison of load prediction models at the 1 min granularity.
Model | MAPE (M) | MAPE (S) | SMAPE (M) | SMAPE (S) | MAE (M) | MAE (S) | RMSE (M) | RMSE (S) | RRSE (M) | RRSE (S) | R2 (M) | R2 (S) |
---|
Ours | 0.121 | 0.357 | 0.107 | 0.227 | 28.754 | 0.201 | 82.488 | 0.400 | 0.601 | 0.566 | 0.878 | 0.863 |
Ours w/o Diff | 0.148 | 0.381 | 0.130 | 0.243 | 28.031 | 0.229 | 82.972 | 0.478 | 0.606 | 0.676 | 0.851 | 0.839 |
DeepAR | 0.160 | 0.483 | 0.142 | 0.278 | 37.341 | 0.258 | 83.953 | 0.462 | 0.828 | 0.654 | 0.756 | 0.737 |
BiLSTM | 0.340 | 1.056 | 0.270 | 0.465 | 66.207 | 0.432 | 113.230 | 0.632 | 0.613 | 0.896 | 0.704 | 0.683 |
Table 7.
Quantitative comparison of load prediction models at the 15 min granularity.
Table 7.
Quantitative comparison of load prediction models at the 15 min granularity.
Model | MAPE (M) | MAPE (S) | SMAPE (M) | SMAPE (S) | MAE (M) | MAE (S) | RMSE (M) | RMSE (S) | RRSE (M) | RRSE (S) | R2 (M) | R2 (S) |
---|
Ours | 0.411 | 0.407 | 0.337 | 0.288 | 1031.504 | 3.324 | 1409.729 | 5.475 | 0.746 | 0.678 | 0.771 | 0.758 |
Ours w/o Diff | 0.434 | 0.432 | 0.354 | 0.304 | 1051.771 | 3.546 | 1443.636 | 5.918 | 0.787 | 0.721 | 0.756 | 0.742 |
DeepAR | 0.454 | 0.575 | 0.366 | 0.351 | 1080.459 | 3.951 | 1459.828 | 5.928 | 0.796 | 0.722 | 0.710 | 0.698 |
BiLSTM | 0.750 | 0.575 | 0.458 | 0.452 | 1396.170 | 5.320 | 1749.662 | 7.525 | 0.613 | 0.919 | 0.683 | 0.671 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).