A Diffusion–Attention-Enhanced Temporal (DATE-TM) Model: A Multi-Feature-Driven Model for Very-Short-Term Household Load Forecasting

Zhao, Yitao; Li, Jiahao; Chen, Chuanxu; Guan, Quansheng

doi:10.3390/en18030486

Open AccessArticle

A Diffusion–Attention-Enhanced Temporal (DATE-TM) Model: A Multi-Feature-Driven Model for Very-Short-Term Household Load Forecasting

¹

Yunnan Power Grid Co., Ltd., Kunming 650217, China

²

School of Electronics and Information, South China University of Technology, Guangzhou 510641, China

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(3), 486; https://doi.org/10.3390/en18030486

Submission received: 12 December 2024 / Revised: 12 January 2025 / Accepted: 15 January 2025 / Published: 22 January 2025

(This article belongs to the Special Issue Advances in Machine Learning Applications in Modern Energy Systems 2024)

Download

Browse Figures

Versions Notes

Abstract

:

With the proliferation of smart home devices and the ever-increasing demand for household energy management, very-short-term load forecasting (VSTLF) has become imperative for energy usage optimization, cost saving and for sustaining grid stability. Despite recent advancements, VSTLF in the household scenario still poses challenges. For instance, some characteristics (e.g., high-frequency, noisy and non-stationary) exacerbate the data processing and model training procedures, and the heterogeneity in household consumption patterns causes difficulties for models with the generalization capability. Further, the real-time data processing requirement calls for both the high forecasting accuracy and improved computational efficiency. Thus, we propose a diffusion–attention-enhanced temporal (DATE-TM) model with multi-feature fusion to address the above issues. First, the DATE-TM model could integrate residents’ electricity consumption patterns with climatic factors. Then, it extracts the temporal feature using an encoder and meanwhile models the data uncertainty through a diffusion model. Finally, the decoder, enhanced with the attention mechanism, creates the precise prediction for the household load forecasting. Experimental results reveal that DATE-TM significantly surpasses classical neural networks such as BiLSTM and DeepAR, especially in handling the data uncertainty and long-term dependency.

Keywords:

very-short-term load forecasting; multi-feature fusion; diffusion model; attention mechanism; energy management

1. Introduction

The importance of electricity load forecasting is increasingly recognized in modern power systems, especially in smart grid and household energy management systems [1]. On one hand, with the proliferation of smart meters and the advancement of Internet of Things (IoTs) [2], more households have introduced smart household devices, which enable the real-time monitoring and transmission of electricity consumption data and make the collection and analysis of load data more accessible and diversified. On the other hand, the global shift towards the energy transition and low-carbon economy has prompted sustainable development. Further, the growing electrification of households, coupled with the widespread use of electrical and renewable energy devices, has made the household energy demand more complex and volatile.

As a result, electricity load forecasting has become imperative for household energy usage optimization, cost reduction and energy efficiency improvement. Accurate forecasting not only ensures the stable operation of power systems, but also enhances energy utilization and reduces consumption as well as carbon emissions. Ultimately, it can promote sustainable development at the societal level [3].

In general, load forecasting could be classified into four types, in line with the prediction horizon, as follows:

Very-short-term load forecasting (VSTLF): The prediction ranges from a few minutes to one hour, and is principally used for real-time dispatch and rapid response [4,5,6,7,8,9].
Short-term load forecasting (STLF): The prediction spans from one hour to one week, serving the daily operation and short-term planning of power systems [10,11,12,13,14,15,16].
Medium-term load forecasting (MTLF): It ranges from one week to one year, supporting seasonal load adjustments and medium-term planning [17,18].
Long-term load forecasting (LTLF): It extends beyond one year, primarily guiding in the strategic planning and long-term decision-making [19].

Among these types, VSTLF plays a key role in instantaneous dispatches and in responding to bursts. It predicts the power demand in the next few minutes to several hours, assisting in the timely adjustments by electric power companies in the dynamic environment, and maintaining grid stability by preventing the power imbalance caused by load fluctuations. With the advancement of smart meters and IoT techniques, the acquisition of real-time household electricity data has become practical, thereby providing abundant data for VSTLF. Furthermore, in household scenarios, VSTLF facilitates the scheduling of resident electricity usage to avert peak periods, thereby reducing electricity costs. Therefore, it is essential for alleviating power pressure, optimizing the resource allocation, and promoting the energy conservation and emission saving.

Existing methods for VSTLF can be divided into traditional statistical and machine learning-based ones. Traditional methods, such as the auto-regressive integrated moving average (ARIMA) model [7,8] and exponential smoothing, rely on the historical time series data. Furthermore, with the development of machine learning, more advanced algorithms, e.g., support vector regression (SVR) [9], convolutional neural networks (CNNs) [17], graph neural networks (GNNs) [11] and long short-term memory (LSTM) networks [12,13,14,15,16,17,18] have emerged. Compared to statistical methods, learning algorithms can better capture nonlinear features and intricate temporal dependencies, thereby improving forecasting accuracy.

Despite recent advents, VSTLF in the household scenario still faces many challenges. For instance, some characteristics (e.g., high frequency, noisy and non-stationary) exacerbate the data processing and model training procedures, and the heterogeneity in household consumption patterns poses challenges for building models with the generalization capability. Further, the real-time data processing requirement calls for both the high forecasting accuracy and improved computational efficiency.

In response to the above challenges, we propose herein a diffusion-and-attention-enhanced temporal model (DATE-TM) with multi-feature fusion, tailored for VSTLF in the household. Several of this work’s contributions can be summarized as follows:

We propose an attention-enhanced decoder module capable of dynamically selecting key encoder outputs during the decoding and thus improving the model’s attention on significant features. It alleviates the limitation of traditional gated recurrent unit (GRU) models in capturing the long-term dependency, thereby enhancing the forecasting accuracy.
We incorporate the diffusion model for the uncertainty modeling into the proposed DATE-TM. By introducing noises to the encoder’s final states and performing reverse sampling, the model’s ability to handle the randomness and uncertainty in the data is improved.
The proposed model could well support various features such as time, weather, temperature, humidity, individual device loads and user behavior, and adapt to the variable input length. It allows for the generalization across diverse household scenarios by dynamically adjusting the input length and feature weights.

The rest of this paper is organized as follows. Section 2 summarizes present load forecasting approaches, concentrating on machine learning- and deep learning-based ones. Section 3 depicts the proposed forecasting model, including the functionalities of GRUs, attention mechanisms and diffusion models. Section 4 presents the experiment setup and used datasets, followed by the performance evaluation and analysis. Section 5 concludes the work and suggests possible future orientations.

2. Related Work

By accurately predicting a household’s electricity consumption, the energy efficiency could be improved, carbon emissions can be reduced, and the grid stability is also ensured, thus preventing potential power failures. Furthermore, load forecasting supports the smooth operation of smart household systems, offers personalized energy management solutions, and saves electricity bills for residents. It also helps to balance the intermittency of renewable energy, optimize the use of energy storage systems, and increase the utilization rate of renewables.

Thus, Hsiao et al., in [4], proposed a household load modeling method based on the context information and daily schedule analysis. By analyzing the daily electricity consumption time series, behavioral patterns are identified and a rule for the prediction is built. However, the reliance on existing behavior patterns makes it inadequate for new patterns, possibly incurring forecasting errors once the behavior changes. He et al., in [10], addressed the above issue by proposing a model-agnostic meta-learning method for the household STLF, which allows multiple households to cooperatively train a generalized network model. Although neural networks are effective in handling the heterogeneous data, challenges still remain when there are significant differences in the electricity usage behavior across households. Wu et al. in [11] further improved the model’s generalization by developing a transfer learning framework that dynamically assigns weights to different GNN models and then integrates the source with target domain data; however, the framework still encounters the alignment issue between source and target domain data.Mansoor et al. in [12] introduced the past vector similarity (PVS) technique to forecast the next hour at the individual household level, simplifying the input by only using the load information and not relying on additional attributes like the calendar or weather data. Nevertheless, this type of simplification might reduce the robustness in handling sudden behavioral changes or anomalies.

To improve forecasting robustness and accuracy, Masood et al. in [13] proposed a quantile LSTM model with clustering-based probabilistic forecasting, which reduces the data heterogeneity through clustering and enhancing the ability to handle outliers through a probabilistic approach. Yet, despite its effectiveness for large-scale forecasting, the method could not well process the small-sample household data. Tiwari et al. in [14] introduced a dual-stage attention-based LSTM (DSA-LSTM) model to enhance accuracy by capturing key temporal features; building upon their study, Li et al. in [15] designed an interpretable memristive LSTM (IM-LSTM) network that embeds the hybrid attention mechanism into LSTM units to represent the importance of both variables and time steps. Although DSA-LSTM and IM-LSTM could improve the interpretability and feature-capturing ability, they require insensitive computational resources, thus limiting their practical implementation.

All of the aforementioned works demonstrated their advancements in feature extraction, model generalization and heterogeneous data handling through techniques such as the similarity analysis, contextual information, neural networks, transfer learning, clustering and attention mechanisms. Some challenges and issues, nevertheless, still remain. For instance, the behavior pattern-dependent models are less adept to sudden behavioral changes; their generalization is confined across households with different consumption patterns; uncertainty and randomness in the load data remain insufficiently addressed (even using the attention mechanism); and fixed-length inputs would limit the generalizability of the models across different scenarios.

3. Methods

3.1. Problem Definition

VSTLF could be formulated into a multivariate temporal forecasting issue. As shown in Figure 1, the goal is to forecast the total household electricity consumption over a future period based on the historical data, including the household data, the meteorological data (e.g., wind speed, temperature and humidity) for the residence-located region and additional feature data collected from the household context. In particular, the problem is formulated as

{\hat{y}}_{t + 1 : t + τ} = f (X_{t - ω + 1 : t}, Z_{t - ω + 1 : t}, C_{t - ω + 1 : t}),

(1)

where

τ

is the forecasting window length;

ω

is the look-back window length;

\hat{y}

is the predicted load; and

X

,

Z

and

C

denote the historical load, meteorological information and additional features, respectively. The function

f (\cdot)

represents the nonlinear mapping learned through the neural network to forecast future loads based on past observations.

3.2. GRU Model

As illustrated in Figure 2, the GRU is taken to deal with the time series information [20], which overcomes the vanishing gradient issue of RNNs by retaining useful information from the past while discarding the irrelevant data. In particular, two key gates are involved: the reset gate, which decides how much past information should be bonded with the new input, and the update gate, which regulates the amount of past information to be reserved at the current time step. In particular, the updating of the GRU is as follows:

r_{t} = σ (W_{r} \cdot [h_{t - 1}; x_{t}] + b_{r}),

(2)

z_{t} = σ (W_{z} \cdot [h_{t - 1}; x_{t}] + b_{z}),

(3)

{\tilde{h}}_{t} = tanh (W \cdot [r_{t} \otimes h_{t - 1}; x_{t}] + b),

(4)

and

h_{t} = (1 - z_{t}) \otimes {\tilde{h}}_{t} + z_{t} \otimes h_{t - 1},

(5)

where

r_{t}

and

z_{t}

are separately the outputs of the reset gate and update gate,

W

denotes the weight matrices,

σ

denotes the Sigmoid activation, and ⊗ expresses the element-wise multiplication, respectively.

In time series forecasting and sequence modeling, GRU, LSTM and BiLSTM are common RNN variants, each of which alleviates the gradient vanishing and explosion issue in different ways. The GRU offers a key advantage over LSTM and BiLSTM due to its simpler structure, high computational efficiency and suitability for specific tasks [21,22]. In particular, the GRU uses only two gates—the reset and update gates—compared to LSTM’s three gates and cell state, leading to fewer parameters, reduced computational cost and faster training. It makes the GRU especially advantageous for smaller datasets or tasks with limited resources, thereby achieving similar performance to LSTM with less training time. Furthermore, the GRU is more stable during the training and excels at tasks with short-term dependency due to its flexible update gate. While LSTM is better for long-term dependency, the GRU typically outperforms it in tasks dominated by short-term dependency, and is thus better suited to resource-constrained or time-sensitive applications.

3.3. Attention-Enhanced GRU

To augment the network’s capability in seizing the global dependency within the input sequence, we merged the scaled dot-product attention [23] into the GRU architecture. As shown in Figure 3, the attention mechanism calculates a context vector for each time step, which is then fed into the GRU. Given the output sequence

\{h_{1}, h_{2}, \dots, h_{T}\}

from the GRU encoder, of which each

h_{t}

represents the hidden state at time t, the attention mechanism is defined as

Attention (Q, K, V) = Softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V,

(6)

where

Q

,

K

and

V

denote the query, key and value matrices, respectively, and

d_{k}

is the dimension of the key vector. The attention weights were computed by scaling the dot-product between queries and keys, followed by a Softmax operation to obtain normalized weights. These weights were then used to aggregate the values, yielding the context vector. At the current time step, let

y_{s}

be the input label,

{\hat{y}}_{s}

be the predicted value and

h_{s - 1}

be the previous hidden state, respectively. The GRU update with the attention can be expressed as

h_{s} = GRU ([C, y_{s}], h_{s - 1}),

(7)

where

C

is the context vector from the attention mechanism. Finally, the hidden state

h_{s}

is passed through a feedforward network to generate the final prediction, as

{\hat{y}}_{s} = FFN (h_{s}),

(8)

which is repeated iteratively to generate the entire forecast sequence. The fully connected feedforward neural network (FFN) follows the GRU and plays a crucial role in generating the final predictions. The GRU first captures the temporal features from the input sequence, while the context vector obtained through the attention mechanism refines the GRU’s hidden state at each time step. This type of refined hidden state is then processed by the FFN, which progressively reduces the data dimension through multiple layers, ultimately outputting the final prediction for each time step. The FFN is essential for mapping the complex temporal information encoded by the GRU to the target prediction values, allowing the model to capture intricate nonlinear relationships that the GRU alone would struggle to model.

3.4. Diffusion Model

Diffusion models are widely used in generative tasks, especially in scenarios that require modeling complex data distributions through iteratively sampling prior knowledge or assumptions. Note that directly modeling the target distribution can pose challenges, especially when capturing the diversity and complexity of the data [24,25]. Therefore, a method is preferred, which transforms the target distribution into a standard Gaussian distribution, and then reverses the process to retrieve the target distribution from the Gaussian distribution. Similar to other generative models like the variational autoencoders (VAEs) and normalizing flows, the principle of diffusion models is to start with a known simple distribution (e.g., a Gaussian distribution) and gradually transform it into the target distribution. Yet, unlike these methods, the transformation process in diffusion models is implemented in multiple steps using a Markov process. The noise is gradually added to the samples per step, allowing the final generated samples to better capture the characteristics of the target distribution. As such, the complex distribution characteristics is learned, thereby improving the quality and diversity of the generated samples. In particular, the diffusion model consists of two procedures, as follows:

The forward process, which involves gradually adding noises to the information and making it the pure noise.
The reverse process, which involves gradually eliminating the added noise to recover the original data.

In this work, the diffusion model is applied to hidden states of the encoder instead of the raw input data. It helps to focus more on the uncertainty and randomness of hidden states, which is critical for modeling the high-frequency and non-stationary household load data.

3.4.1. Forward Process

Let

h_{0}

be the hidden state output from the encoder at the initial step. To gradually transform

h_{0}

into white noise, the forward process introduces the Gaussian noise

ϵ

at each step n as

h_{n} = \sqrt{α_{n}} h_{0} + \sqrt{1 - α_{n}} ϵ, ϵ \sim N (0, I),

(9)

where

α_{n}

controls the weight between

h_{0}

and the added noise. As the step progresses, the hidden state becomes increasingly dominated by noise, and the stability and additivity properties of Gaussian distributions allow us to express the noisy hidden state at step n as

\begin{matrix} h^{n} & = \sqrt{α_{n}} h^{n - 1} + \sqrt{1 - α_{n}} ϵ_{n - 1} \\ = \sqrt{α_{n}} (\sqrt{α_{n - 1}} h^{n - 2} + \sqrt{1 - α_{n - 1}} ϵ_{n - 2}) + \sqrt{1 - α_{n}} ϵ_{n - 1} \\ = \sqrt{α_{n} α_{n - 1}} h^{n - 2} + (\sqrt{α_{n} - α_{n} α_{n - 1}} ϵ_{n - 2} + \sqrt{1 - α_{n}} ϵ_{n - 1}) \\ = \dots = \sqrt{α_{n} α_{n - 1} \dots α_{1}} h^{0} + \sqrt{1 - α_{n} α_{n - 1} \dots α_{1}} ϵ \\ = \sqrt{{\bar{α}}_{n}} h^{0} + \sqrt{1 - {\bar{α}}_{n}} ϵ . \end{matrix}

(10)

In particular, the forward process ensures that after sufficient steps, the hidden state well approximates a standard normal distribution as

q (h_{T} ∣ h_{0}) = N (0, I),

(11)

where T is the total step number. The transformation could provide a simple and tractable prior distribution for the reverse process to begin reconstruction.

3.4.2. Reverse Process

Next, the reverse process is used to recover the original hidden state

h_{0}

by gradually removing the noise, modeling the reverse transition as

p_{θ} (h_{n - 1} ∣ h_{n}) = N (h_{n - 1}; μ_{θ} (h_{n}, n), σ_{n}^{2} I),

(12)

where

μ_{θ} (h_{n}, n)

and

σ_{n}^{2}

are the predicted mean and variance, respectively, and can be approximated by the neural network. Since the true posterior distribution is unknown, the model then minimizes the mean squared error (MSE) loss as follows:

L (θ) = E_{n, h_{0}, ϵ} [{∥ ϵ - {\hat{ϵ}}_{θ} (h_{n}, n) ∥}^{2}],

(13)

where

{\hat{ϵ}}_{θ} (h_{n}, n)

is the predicted noise at step n. By minimizing (13), the model ensures that the predicted noise aligns with the true noise, facilitating the accurate reconstruction of original hidden states.

3.4.3. Reparameterization and Optimization

To facilitate the model training, a reparameterization trick is used, and the noisy state at step n can be rewritten as

h_{n} = \sqrt{{\bar{α}}_{n}} h_{0} + \sqrt{1 - {\bar{α}}_{n}} ϵ, ϵ \sim N (0, I),

(14)

which separates the deterministic part of the hidden state from the stochastic noise, allowing for more stable training. The loss function then becomes

L (θ) = E_{n, ϵ} [{∥ ϵ - ϵ_{θ} (h_{n}, n) ∥}^{2}],

(15)

where

ϵ_{θ} (h_{n}, n)

is the predicted noise term. Through reparameterization, the gradient is obtained as

\begin{matrix} \nabla_{θ} L (θ) = & \nabla_{θ} E_{n, ϵ} [{∥ ϵ - {\hat{ϵ}}_{θ} (h_{n}, n) ∥}^{2}] \\ = & E_{n} [- 2 (ϵ - {\hat{ϵ}}_{θ} (h_{n}, n)) \cdot \nabla {\hat{ϵ}}_{θ} (h_{n}, n)], \end{matrix}

(16)

which ensures that the model can more effectively utilize the information in samples during each update, reducing the uncertainty introduced by noise and thereby improving the stability.

In particular, the reparameterization trick offers several benefits during the practical training process. First, it effectively transforms the randomness in the training process into a controllable procedure, allowing the model to sample effectively in higher dimensions during the learning. Second, the reparameterization enables the model to generate samples flexibly, enhancing its robustness when facing high-dimensional data. Additionally, using the reparameterization trick, the model’s inference process becomes more efficient. In particular, the sample generation no longer relies on complex latent variable structures but is achieved through simple deterministic functions and standard normal noise. The simplification not only accelerates the computation but also improves the model’s adaptability in the real-time prediction process. Finally, the reparameterization trick ensures that the error between predicted noise and true noise is minimized. This type of optimization goal could improve the training, ensuring a robust predictive performance in high-noise environments, and in turn proving the diffusion models’ significant advantages in applications such as household load forecasting.

In the proposed framework, the diffusion model tries to model the distribution of the encoder’s hidden states to better manage the high noise and non-stationary characteristics of household electricity data. Rather than directly model the raw load data, the diffusion model exhibits significant advantages by focusing on the hidden states. First, the diffusion model effectively smooths high-frequency noise, thereby enhancing the model robustness and alleviating the impact of random fluctuations in the data. Furthermore, through a multi-step sampling mechanism, the model enhances its adaptability in various household scenarios, improving its generalization capability. Further, the diffusion model could support the probabilistic modeling of prediction results, allowing for more accurate quantification and reflection of uncertainty. All of the above advantages enable the diffusion model to perform well in processing the complex load data, improving both the prediction accuracy and generalization capability.

3.4.4. Implementation in DATE-TM

In the proposed DATE-TM model, the diffusion model mainly operates on the encoder’s

h_{t}

, specifically addressing the volatility and randomness in the real-world data, and striving to maintain the predictive accuracy when facing the complex and dynamically changing load paradigm. By leveraging the reverse process, the model effectively manages load variations existing in high-noise environments.

Note that, the proposed DATE-TM model not only enhances the model’s generalization capability across households with varying consumption behaviors, but also improves its stability and adaptability. Particularly, when dealing with households exhibiting different consumption patterns, the DATE-TM model can flexibly adjust its parameters to adapt to various data characteristics. Thus, by incorporating the diffusion-enhanced hidden state modeling, the DATE-TM model could achieve high forecasting accuracy while ensuring computational efficiency. Furthermore, it also allows the model to run quickly on larger datasets without sacrificing the accuracy, thereby adapting even better to practical applications.

3.5. DATE-TM Architecture

As depicted in Figure 4, the DATE-TM model integrates GRUs, attention mechanisms and the diffusion model. First, the encoder uses multi-layer GRUs to extract temporal dependencies, addressing the vanishing gradient problem. Then, the decoder combines the GRUs with attention to dynamically select relevant encoder outputs, thereby capturing long-term dependencies. Finally, the diffusion model serves as a bridge connecting the decoder and encoder to enhance the network’s robustness by modeling uncertainty and randomness in the data. Furthermore, DATE-TM well supports the multi-feature input (including time, weather, temperature, humidity, individual device loads and user behavior), and adapts to the variable input length. The flexibility enables the model to dynamically adjust input sequences and feature weights in different household scenarios. In particular, the teacher forcing technique is used during training to accelerate the convergence by taking true target outputs as inputs for each step in the decoder. The specific hyperparameter settings for the model are summarized in Table 1.

Compared to traditional models such as BiLSTM and DeepAR, DATE-TM specializes in handling data uncertainty and long-term dependencies while offering high prediction accuracy. The adaptability allows it to process diverse input features and be fit for various household settings, ensuring robust and reliable load forecasting.

3.6. Data Preprocessing

The original dataset may contain missing values that need to be filled in to ensure proper model training and forecasting. The existence of missing values not only affects the learning process, but might also incur inaccurate prediction results. In this part, the K-nearest neighbor (KNN) algorithm (using the similarity between samples to infer the missing values) is taken for imputation. For instance, given two samples Y and X, we obtain the Euclidean distance, i.e.,

d (X, Y) = \sqrt{\sum_{i = 1}^{n} {(X_{i} - Y_{i})}^{2}},

(17)

where

X_{i}

and

Y_{i}

are the values of the i-th feature, and n expresses the overall feature number, respectively. The weights for each neighbor are determined based on the inverse of their distance, i.e., closer neighbors would receive heavier weights as follows:

w_{j} = \frac{1}{d (X, Y_{j})} .

(18)

Finally, the imputed value for the missing feature is the weighted average of neighbors, i.e.,

{\hat{X}}_{i} = \frac{\sum_{j = 1}^{k} w_{j} Y_{j, i}}{\sum_{j = 1}^{k} w_{j}} .

(19)

When dealing with multi-dimensional data, significant differences may exist in the magnitude of features. For instance, the temperature is measured in degrees Celsius while the electricity consumption is measured in kilowatt-hours. If these features are not normalized, then larger-magnitude feature values would dominate, resulting in the underutilization of all features. Furthermore, the inconsistent magnitudes of input features can incur the numerical instability, which in turn affects the convergence speed. To prevent the numerical instability and ensure balanced training, the min–max normalization is applied as

X^{'} = \frac{X - X_{min}}{X_{max} - X_{min}},

(20)

in which

X_{max}

and

X_{min}

denote the maximal and minimal numbers for each feature vector, respectively. The normalization ensures that all features are restricted within the

[0, 1]

range, thus facilitating the comparison across different features.

We next use the sliding window technique to extract input sequences from normalized data. In particular, the sliding window allows us to generate multiple input sequences from successive data based on a fixed time step. These input sequences not only retain the temporal relevance of time series data but also capture potential trends and seasonal variations, providing rich information for the subsequent model training. Furthermore, the data are split into training, validation and test sets to assess the model’s generalization ability and mitigate overfitting. In particular, 80% of the dataset is used for training, while the remaining 20% serves as the test set, thereby providing an independent evaluation of the model’s performance on unseen data. More specifically, the training set is further divided, with 80% for training and 20% for validation. The validation set enables the real-time monitoring of the model’s performance during the training, facilitating hyperparameter adjustment to prevent the overfitting.

Note that the chronological division of time series ensures the consistency between training and testing environments, making the model adapt to real-world scenarios. Furthermore, it also averts possible biases arising from random partitioning, thereby assuring reliability. Thus, the combination of both sliding window and chronological splitting can fully leverage the data’s temporal characteristics.

3.7. Assessment Metrics

The indicators below were taken to comprehensively assess the model’s capability:

Symmetric mean absolute percentage error (SMAPE): $\frac{1}{n} \sum_{i = 1}^{n} \frac{| y_{i} - {\hat{y}}_{i} |}{(y_{i} + {\hat{y}}_{i}) / 2}$ .
Mean absolute percentage error (MAPE): $\frac{1}{n} \sum_{i = 1}^{n} |{(y_{i} - \hat{y})}_{i} / y_{i}|$ .
Mean absolute error (MAE): $\frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |$ .
Root mean squared error (RMSE): $\sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}$ .
Relative root squared error: $\sqrt{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2} / \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}$ .
Coefficient of determination: $R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}$ .

Note that SMAPE and MAPE are widely used for measuring prediction errors, yet SMAPE offers a more stable evaluation by addressing the unbounded error issue existing in MAPE when true values approach zero. RMSE provides an overall measure of prediction accuracy, as it is especially sensitive to large errors, while RRSE normalizes RMSE by comparing the model’s performance with a baseline model. Furthermore,

R^{2}

quantifies the proportion of variance explained by the model, offering insights into its goodness of fit.

By incorporating these metrics, we aim to provide a comprehensive and balanced evaluation of the model’s accuracy, robustness and generalization capability, providing a more nuanced understanding of its performance across different data scenarios.

4. Experiments and Results

4.1. Dataset Description

In this part, we present how two real-world household load datasets were used to execute the 24-minute very-short-term load forecasting, which included the following data:

Energy consumption data from a household in northeastern Mexico, which contain the electricity meter readings recorded from 5 November 2022 to 5 January 2024, at a time granularity of 1 min [26].
A smart household dataset from Kaggle, which includes the meter data from 1 January 2016 to 16 December 2016, at a time granularity of 1 min.

The two household datasets not only contain the energy consumption data, but also the records of the weather data for their respective areas. The detailed variables (factors) are listed in Table 2 and Table 3. Additionally, more detailed statistics, including the count, minimum, maximum, mean and standard deviation, are provided for a comprehensive understanding of the datasets, as shown in Table 4 and Table 5, offering a clearer characterization of the datasets.

The residence power usage is characterized by its irregularity compared to the commercial usage. To illustrate the volatility and uncertainty of household load, three days’ data are sampled from two households for comparison, as shown in Figure 5. The comparison highlights the variability between households due to external factors (e.g., the economic environment and electrical infrastructure), as well as internal factors (e.g., members’ consumption habits). Furthermore, intra-household consumption patterns differ significantly across days, making the load forecasting (especially the very-short-term forecasting) extremely challenging. Thus, capturing various impacting factors is essential for developing a model adapting to diverse household scenarios.

Figure 6 and Figure 7 present the heatmaps for two different households, highlighting notable differences in their correlation patterns. The differences can be attributed to various factors, including the appliance configuration, the specific living environment, and the behavior of household members. For instance, one household may exhibit a strong correlation between high load and specific appliance usage, indicating that certain devices are heavily relied upon during particular periods of the day. In contrast, the other household may display more varied appliance usage, bringing about a more balanced load distribution throughout the day.

4.2. Performance Comparison

The proposed model is compared with other benchmarks, including DATE-TM, DATE-TM without (w/o) the diffusion module, DeepAR and BiLSTM, respectively. For each dataset, two sets of experiments were performed. In the first set, we used 1 min data intervals, with a historical window of 240 time steps (240 min) and a prediction window of 24 time steps (24 min). In the second set, we aggregated the data into 15 min intervals, where the historical window covers 240 steps (60 h) and the prediction window covers 24 steps (6 h). The intermediate values of 240 consecutive prediction sequences were formed into complete sequences for the prediction comparison. Figure 8 and Figure 9 illustrate the prediction results across the two datasets.

As shown in Figure 8 and Figure 9, the DATE-TM model achieved superior accuracy since it can capture more load fluctuations than other models. Furthermore, the removal of the diffusion module lowered the performance, especially when facing abnormal fluctuations. While DeepAR could provide stable results, it performed worse than DATE-TM. Further, BiLSTM struggled to handle the long-term sequence, incurring higher prediction errors.

It is inferred that the prediction accuracy at the 1 min interval level exceeded that at the 1 min interval level. The finer granularity of the 1 min interval data preserves more detailed temporal features, allowing the model to capture the short-term dependency and dynamic patterns within the shorter prediction horizon. In contrast, the 15 min interval data, with larger intervals, might smooth fluctuations and reduce feature granularity, instead of increasing the prediction difficulty. Furthermore, the aggregation of the 15 min interval data from the 1 min interval data would incur the loss of critical weather-related and load-specific features (which are essential for accurate forecasting). As such, the aggregated data often fail to capture transient load fluctuations and rapid weather changes and to identify the fine-grained temporal dynamics.

Next, a quantitative comparison is shown in Table 6 and Table 7, where the smallest (best) values are stressed in bold. “M” and “S” refer to the datasets from Mexico and the smart household, respectively.

As mentioned above, by comparing different models on the household load data, it demonstrates that the DATE-TM model holds an evident advantage in handling the household data. This is because the proposed architecture (with multi-feature fusion, diffusion module for uncertainty modeling and attention mechanism) can make the model capture abnormal and irregular load variations more accurately, achieving superior performance across various metrics. Although the DATE-TM model without the diffusion module showed a slight degradation in prediction accuracy, it still outperformed both DeepAR and BiLSTM models.

5. Conclusions and Future Work

We have proposed a novel multi-feature temporal prediction architecture, DATE-TM, for the VSTLF in households. The model integrates a GRU, attention mechanism and diffusion model to address and alleviate the unique challenges faced by household energy usage forecasting. By incorporating versatile features, including historical load data, meteorological variables and user-specific contextual factors, DATE-TM exhibits superior performance in handling data uncertainty and capturing long-term dependencies.

In particular, DATE-TM involves the attention-enhanced decoder module to dynamically select relevant encoder outputs during the decoding process, not only boosting the model’s ability to focus more on significant features but also rendering traditional GRU-capturing long-term dependencies, thus improving the forecasting accuracy. More specifically, the integration of the diffusion model enhances the model’s robustness by introducing noises to the encoder’s final states and performing reverse sampling, allowing DATE-TM to effectively manage randomness and fluctuations inherent in the household load data, and significantly improving its reliability in predicting future load profiles. Furthermore, the model is also highly adaptable to versatile features such as time, weather, temperature, humidity, individual device loads and user behavior, with variable input lengths, thereby ensuring the generalization across diverse household scenarios. The adaptability renders DATE-TM well suited for diverse forecasting tasks, while maintaining high performance across various input configurations. Evaluation results using multiple metrics—including MAPE, SMAPE, MAE, RMSE and

R^{2}

—demonstrate that DATE-TM consistently outperforms traditional models, such as BiLSTM and DeepAR. Notably, the inclusion of the diffusion module significantly strengthens the model’s capacity to handle uncertainties and random fluctuations in the load data. Even in the absence of the diffusion module, DATE-TM still outperforms the benchmark models, highlighting the effectiveness of its multi-feature fusion and enhanced attention mechanism.

Despite promising results, there are still spaces for further improvement. Future work will focus on optimizing feature selection to enhance model performance, simplifying the architecture to improve computational efficiency, acquiring real-time data to enable continuous learning, and extending the model’s applicability to other domains, such as industrial load forecasting and urban energy management. Such directions will further enhance the versatility and capability of DATE-TM, providing robust support for smart grid systems and intelligent energy management.

Author Contributions

Conceptualization, Y.Z.; methodology, Y.Z. and J.L.; software, J.L.; validation, Y.Z. and J.L.; formal analysis, Y.Z.; investigation, J.L. and C.C.; resources, Y.Z. and C.C.; data curation, J.L.; writing—original draft preparation, Y.Z.; writing—review and editing, J.L., C.C. and Q.G.; visualization, Y.Z.; supervision, C.C. and Q.G.; project administration, C.C.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in [Energy consumption data from a household in northeastern Mexico] at [https://doi.org/10.1016/j.dib.2024.110452], reference number [23].

Conflicts of Interest

Authors Yitao Zhao and Jiahao Li were employed by the Yunnan Power Grid Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zhang, X.; Han, J. Automatic machine learning participation in power load forecasting under the background of big data. In Proceedings of the International Conference on Energy, Power and Electrical Technology (ICEPET), Chengdu, China, 17–19 May 2024; pp. 1311–1315. [Google Scholar]
Zhang, J.; Lu, W.; Xing, C.; Zhao, N.; Al-Dhahir, N.; Karagiannidis, G.K.; Yang, X. Intelligent integrated sensing and communication: A survey. Sci. China Inf. Sci. 2025, 68, 1–42. [Google Scholar] [CrossRef]
Yu, B.; Qin, J.; Deng, Z.; Guo, X.; Si, J. Analysis of power load characteristics and research of load forecasting model in a certain area under the background of new power reform. In Proceedings of the International Conference on Smart Power Internet Energy Systems (SPIES), Shenyang, China, 16–18 June 2023; pp. 100–104. [Google Scholar]
Hsiao, Y.-H. Household electricity demand forecast based on context information and user daily schedule analysis from meter data. IEEE Trans. Ind. Inform. 2015, 11, 33–43. [Google Scholar] [CrossRef]
Gonzalez, R.; Ahmed, S.; Alamaniotis, M. Deep neural network based methodology for very-short-term residential load forecasting. In Proceedings of the International Conference on Information, Intelligence, Systems Applications (IISA), Corfu, Greece, 18–20 July 2022; pp. 1–6. [Google Scholar]
Alamaniotis, M.; Ikonomopoulos, A.; Tsoukalas, L.H. Evolutionary multiobjective optimization of kernel-based very-short-term load forecasting. IEEE Trans. Power Syst. 2012, 27, 1477–1484. [Google Scholar] [CrossRef]
Lu, J.-C.; Zhang, X.; Sun, W. A real-time adaptive forecasting algorithm for electric power load. In Proceedings of the IEEE/PES Transmission Distribution Conference Exposition: Asia and Pacific, Dalian, China, 15–18 August 2005; pp. 1–5. [Google Scholar]
de Andrade, L.C.M.; da Silva, I.N. Using intelligent system approach for very short-term load forecasting purposes. In Proceedings of the IEEE International Energy Conference, Manama, Bahrain, 18–22 December 2010; pp. 694–699. [Google Scholar]
Setiawan, A.; Koprinska, I.; Agelidis, V.G. Very short-term electricity load demand forecasting using support vector regression. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN), Atlanta, GA, USA, 14–19 June 2009; pp. 2888–2894. [Google Scholar]
He, Y.; Luo, F.; Ranzi, G. Transferrable model-agnostic meta-learning for short-term household load forecasting with limited training data. IEEE Trans. Power Syst. 2022, 37, 3177–3180. [Google Scholar] [CrossRef]
Wu, D.; Lin, W. Efficient residential electric load forecasting via transfer learning and graph neural networks. IEEE Trans. Smart Grid 2023, 14, 2423–2431. [Google Scholar] [CrossRef]
Mansoor, H.; Rauf, H.; Mubashar, M.; Khalid, M.; Arshad, N. Past vector similarity for short term electrical load forecasting at the individual household level. IEEE Access 2021, 9, 42771–42785. [Google Scholar] [CrossRef]
Masood, Z.; Gantassi, R.; Choi, Y. Enhancing short-term electric load forecasting for households using quantile LSTM and clustering-based probabilistic approach. IEEE Access 2024, 12, 77257–77268. [Google Scholar] [CrossRef]
Tiwari, P.; Mahanta, P.; Trivedi, G. A dual-stage attention based RNN model for short term load forecasting of individual household. In Proceedings of the International Conference on Electrical, Computer and Energy Technologies (ICECET), Cape Town, South Africa, 9–10 December 2021; pp. 1–6. [Google Scholar]
Li, C.; Dong, Z.; Ding, L.; Petersen, H.; Qiu, Z.; Chen, G. Interpretable memristive LSTM network design for probabilistic residential load forecasting. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 69, 2297–2310. [Google Scholar] [CrossRef]
Kong, W.; Dong, Z.Y.; Jia, Y.; Hill, D.J.; Xu, Y.; Zhang, Y. Short-Term residential load forecasting based on LSTM recurrent neural network. IEEE Trans. Smart Grid 2019, 10, 841–851. [Google Scholar] [CrossRef]
Han, L.; Peng, Y.; Li, Y.; Yong, B.; Zhou, Q.; Shu, L. Enhanced deep networks for short-term and medium-term load forecasting. IEEE Access 2019, 7, 4045–4055. [Google Scholar] [CrossRef]
Li, J.; Wei, S.; Dai, W. Combination of manifold learning and deep learning algorithms for mid-term electrical load forecasting. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 2584–2593. [Google Scholar] [CrossRef] [PubMed]
Xie, J.; Hong, T.; Stroud, J. Long-term retail energy forecasting with consideration of residential customer attrition. IEEE Trans. Smart Grid 2015, 6, 2245–2252. [Google Scholar] [CrossRef]
Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar]
Pirani, M.; Thakkar, P.; Jivrani, P.; Bohara, M.H.; Garg, D. A comparative analysis of ARIMA, GRU, LSTM and BiLSTM on financial time series forecasting. In Proceedings of the 2022 IEEE International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE), Ballari, India, 23–24 April 2022; pp. 1–6. [Google Scholar]
Mateus, B.C.; Mendes, M.; Farinha, J.T.; Assis, R.; Cardoso, A.M. Comparing LSTM and GRU models to predict the condition of a pulp paper press. Energies 2021, 14, 6958. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Wang, Z.; Wen, Q.; Zhang, C.; Sun, L.; Wang, Y. DiffLoad: Uncertainty quantification in electrical load forecasting with the diffusion model. IEEE Trans. Power Syst. 2024; early access. [Google Scholar] [CrossRef]
Aguirre-Fraire, B.; Beltrán, J.; Soto-Mendoza, V. A comprehensive dataset integrating household energy consumption and weather conditions in a north-eastern Mexican urban city. Data Brief 2024, 54, 110452. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Forecasting model for household load prediction. The green and blue lines represent the estimated and true data, separately. The forecasting model uses historical data to bring forth predictions for future time steps.

Figure 2. Structure of the GRU, which uses an update gate and a reset gate to manage the influence of input

x_{t}

and previous state

h_{t - 1}

to generate the new state

h_{t}

.

Figure 2. Structure of the GRU, which uses an update gate and a reset gate to manage the influence of input

x_{t}

and previous state

h_{t - 1}

to generate the new state

h_{t}

.

Figure 3. Architecture integrating the GRU with dot-product attention. The attention mechanism processes the input queries

Q

, keys

K

and values

V

, followed by the concatenation with previous output

y_{s}

, and the result is fed into the GRU to generate the next prediction

{\hat{y}}_{s}

. The structure captures both temporal dependencies and contextual relevance in the input sequence.

Figure 3. Architecture integrating the GRU with dot-product attention. The attention mechanism processes the input queries

Q

, keys

K

and values

V

, followed by the concatenation with previous output

y_{s}

, and the result is fed into the GRU to generate the next prediction

{\hat{y}}_{s}

. The structure captures both temporal dependencies and contextual relevance in the input sequence.

Figure 4. Architecture of the DATE-TM model. It integrates GRUs, attention mechanisms and a diffusion model to capture both temporal dependencies and model uncertainty.

Figure 5. Random three-day load profiles for two households illustrating the variability in energy consumption patterns. (a) Load profile of a Mexican household. (b) Load profile of a smart household.

Figure 6. Weather heatmaps for the locations of two households, reflecting the differences that may arise from geographical and environmental factors. (a) Weather heatmaps of the Mexican household. (b) Weather heatmaps of the smart household.

Figure 7. Correlation heatmaps for the load data from two households. The color indicates the strength of relationships, with brighter ones representing the stronger correlation, highlighting the difference arising from the appliance configuration and usage pattern. (a) Load correlation heatmap of load data for the Mexican household. (b) Load correlation heatmap of load data for the smart household.

Figure 8. Prediction results for Mexican household. The considered models include DATE-TM, DATE-TM without the diffusion module, DeepAR and BiLSTM. The historical window spans 240 time steps, and the prediction window consists of 24 time steps. (a) Prediction results of the Mexican household data with 1 min interval. (b) Prediction results of the Mexican household data with 15 min intervals.

Figure 9. Prediction results for smart household. The considered models compared include DATE-TM, DATE-TM without the diffusion module, DeepAR and BiLSTM. The historical window spans 240 time steps, and the prediction window consists of 24 time steps. (a) Prediction results of the smart household data with 1 min interval. (b) Prediction results of the smart household data with 15 min intervals.

Table 1. Hyperparameter configuration for the model.

Hyperparameter	Value
Historical window length	240
Forecasting window length	24
Batch size	256
Learning Rate	0.0001
Epoch	40
Hidden dimension	64
Diffusion steps	5
GRU Layer	2
Denoising Layer	5

Table 2. Description of variables (factors) in the Mexican household dataset.

Variable	Description
`current`	Current (A), represents the strength of the household current
`voltage`	Voltage (V), records the voltage level in the household power system
`reactive_power`	Reactive power (W), describes the energy generated by inductive loads in the household power system
`apparent_power`	Apparent power (VA), a combination of active and reactive power
`power_factor`	Power factor, indicates the efficiency of energy utilization; a value closer to 1 indicates better utilization
`temp`	Temperature (°C), real-time temperature at the household location
`feels_like`	Feels-like temperature (°C), the perceived temperature combining temperature and humidity
`temp_min`	Minimum temperature of the day (°C)
`temp_max`	Maximum temperature of the day (°C)
`pressure`	Atmospheric pressure (hPa), describes the air pressure changes in the environment
`humidity`	Humidity (%), represents the moisture content in the air
`speed`	Wind speed (m/s), records the speed of the wind
`deg`	Wind direction (°), indicates the direction of the wind
`temp_t+1`	Forecasted temperature for the next day (°C)
`feels_like_t+1`	Forecasted feels-like temperature for the next day (°C)
`load`	Total power consumed by household appliances (W)

Table 3. Description of variables (factors) in the smart household dataset.

Variable	Description
`Dishwasher`	Power consumption of the dishwasher (kW)
`Home_office`	Power consumption of the home office (kW)
`Fridge`	Power consumption of the fridge (kW)
`Wine_cellar`	Power consumption of the wine cellar (kW)
`Garage_door`	Power consumption of the garage door (kW)
`Barn`	Power consumption of the barn (kW)
`Well`	Power consumption of the well (kW)
`Microwave`	Power consumption of the microwave (kW)
`Living_room`	Power consumption in the living room (kW)
`Furnace`	Power consumption of the furnace (kW)
`Kitchen`	Power consumption in the kitchen (kW)
`Solar`	Power generated by the solar system (kW)
`temperature`	Temperature (°C) at the household location
`humidity`	Humidity (%), represents the moisture content in the air
`visibility`	Visibility (km), describes how far one can see under current weather conditions
`apparentTemperature`	Apparent temperature (°C), the perceived temperature, considering humidity and wind
`pressure`	Atmospheric pressure (hPa), describing air pressure in the environment
`windSpeed`	Wind speed (m/s), records the speed of the wind
`cloudCover`	Cloud cover (%), describes the fraction of the sky covered by clouds
`windBearing`	Wind direction (°), indicates the direction from which the wind is coming
`precipIntensity`	Precipitation intensity (in/h), indicates the rate of precipitation
`dewPoint`	Dew point temperature (°C), the temperature at which air becomes saturated with moisture
`precipProbability`	Probability of precipitation (%), indicates the likelihood of precipitation occurring
`load`	Total power consumption (kW) from all appliances combined

Table 4. Summary statistics of variables in the Mexican household dataset.

Variable	Min	Max
`current`	0.30	24.41
`voltage`	107.60	135.50
`reactive_power`	4.73	1293.58
`apparent_power`	37.14	2931.64
`power_factor`	0.20	1.00
`temp`	−5.56	39.37
`feels_like`	−6.13	36.70
`temp_min`	−5.56	37.59
`temp_max`	−5.56	39.44
`pressure`	996.00	1035.00
`humidity`	1.00	100.00
`speed`	0.00	10.29
`deg`	0.00	360.00
`temp_t+1`	−5.56	39.37
`feels_like_t+1`	−6.13	36.70
`load`	24.40	2900.00

Table 5. Summary statistics of variables in the smart household dataset.

Variable	Min	Max
`Dishwasher`	0.00	1.40
`Home_office`	0.00	0.97
`Fridge`	0.00	0.85
`Wine_cellar`	0.00	1.27
`Garage_door`	0.00	1.09
`Barn`	0.00	7.03
`Well`	0.00	1.63
`Microwave`	0.00	1.93
`Living_room`	0.00	0.47
`Furnace`	0.00	2.47
`Kitchen`	0.00	2.27
`Solar`	0.00	0.61
`temperature`	−24.80	34.30
`humidity`	13.00	98.00
`visibility`	0.27	10.00
`apparentTemperature`	−35.60	38.40
`pressure`	986.40	1042.46
`windSpeed`	0.00	22.91
`cloudCover`	0.00	100.00
`windBearing`	0.00	359.00
`precipIntensity`	0.00	0.19
`dewPoint`	−32.90	23.00
`precipProbability`	0.00	84.00
`load`	0.00	14.71

Table 6. Quantitative comparison of load prediction models at the 1 min granularity.

Model	MAPE (M)	MAPE (S)	SMAPE (M)	SMAPE (S)	MAE (M)	MAE (S)	RMSE (M)	RMSE (S)	RRSE (M)	RRSE (S)	R² (M)	R² (S)
Ours	0.121	0.357	0.107	0.227	28.754	0.201	82.488	0.400	0.601	0.566	0.878	0.863
Ours w/o Diff	0.148	0.381	0.130	0.243	28.031	0.229	82.972	0.478	0.606	0.676	0.851	0.839
DeepAR	0.160	0.483	0.142	0.278	37.341	0.258	83.953	0.462	0.828	0.654	0.756	0.737
BiLSTM	0.340	1.056	0.270	0.465	66.207	0.432	113.230	0.632	0.613	0.896	0.704	0.683

Table 7. Quantitative comparison of load prediction models at the 15 min granularity.

Model	MAPE (M)	MAPE (S)	SMAPE (M)	SMAPE (S)	MAE (M)	MAE (S)	RMSE (M)	RMSE (S)	RRSE (M)	RRSE (S)	R² (M)	R² (S)
Ours	0.411	0.407	0.337	0.288	1031.504	3.324	1409.729	5.475	0.746	0.678	0.771	0.758
Ours w/o Diff	0.434	0.432	0.354	0.304	1051.771	3.546	1443.636	5.918	0.787	0.721	0.756	0.742
DeepAR	0.454	0.575	0.366	0.351	1080.459	3.951	1459.828	5.928	0.796	0.722	0.710	0.698
BiLSTM	0.750	0.575	0.458	0.452	1396.170	5.320	1749.662	7.525	0.613	0.919	0.683	0.671

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Y.; Li, J.; Chen, C.; Guan, Q. A Diffusion–Attention-Enhanced Temporal (DATE-TM) Model: A Multi-Feature-Driven Model for Very-Short-Term Household Load Forecasting. Energies 2025, 18, 486. https://doi.org/10.3390/en18030486

AMA Style

Zhao Y, Li J, Chen C, Guan Q. A Diffusion–Attention-Enhanced Temporal (DATE-TM) Model: A Multi-Feature-Driven Model for Very-Short-Term Household Load Forecasting. Energies. 2025; 18(3):486. https://doi.org/10.3390/en18030486

Chicago/Turabian Style

Zhao, Yitao, Jiahao Li, Chuanxu Chen, and Quansheng Guan. 2025. "A Diffusion–Attention-Enhanced Temporal (DATE-TM) Model: A Multi-Feature-Driven Model for Very-Short-Term Household Load Forecasting" Energies 18, no. 3: 486. https://doi.org/10.3390/en18030486

APA Style

Zhao, Y., Li, J., Chen, C., & Guan, Q. (2025). A Diffusion–Attention-Enhanced Temporal (DATE-TM) Model: A Multi-Feature-Driven Model for Very-Short-Term Household Load Forecasting. Energies, 18(3), 486. https://doi.org/10.3390/en18030486

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Diffusion–Attention-Enhanced Temporal (DATE-TM) Model: A Multi-Feature-Driven Model for Very-Short-Term Household Load Forecasting

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Problem Definition

3.2. GRU Model

3.3. Attention-Enhanced GRU

3.4. Diffusion Model

3.4.1. Forward Process

3.4.2. Reverse Process

3.4.3. Reparameterization and Optimization

3.4.4. Implementation in DATE-TM

3.5. DATE-TM Architecture

3.6. Data Preprocessing

3.7. Assessment Metrics

4. Experiments and Results

4.1. Dataset Description

4.2. Performance Comparison

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI