Next Article in Journal
Effects of Palm Kernel Shells (PKS) on Mechanical and Physical Properties of Fine Lateritic Soils Developed on Basalt in Bangangté (West Cameroon): Significance for Pavement Application
Previous Article in Journal
Chloride Resistance of Assembled Bridge Piers Reinforced with Epoxy-Coated Steel Bars
Previous Article in Special Issue
Integrating Interpolation and Extrapolation: A Hybrid Predictive Framework for Supervised Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

TransTLA: A Transfer Learning Approach with TCN-LSTM-Attention for Household Appliance Sales Forecasting in Small Towns

Department of Information and Communication Engineering, Tongji University, Shanghai 201804, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(15), 6611; https://doi.org/10.3390/app14156611
Submission received: 26 June 2024 / Revised: 19 July 2024 / Accepted: 27 July 2024 / Published: 28 July 2024
(This article belongs to the Special Issue Big Data: Analysis, Mining and Applications)

Abstract

:
Deep learning (DL) has been widely applied to forecast the sales volume of household appliances with high accuracy. Unfortunately, in small towns, due to the limited amount of historical sales data, it is difficult to forecast household appliance sales accurately. To overcome the above-mentioned challenge, we propose a novel household appliance sales forecasting algorithm based on transfer learning, temporal convolutional network (TCN), long short-term memory (LSTM), and attention mechanism (called “TransTLA”). Firstly, we combine TCN and LSTM to exploit the spatiotemporal correlation of sales data. Secondly, we utilize the attention mechanism to make full use of the features of sales data. Finally, in order to mitigate the impact of data scarcity and regional differences, a transfer learning technique is used to improve the predictive performance in small towns, with the help of the learning experience from the megacity. The experimental outcomes reveal that the proposed TransTLA model significantly outperforms traditional forecasting methods in predicting small town household appliance sales volumes. Specifically, TransTLA achieves an average mean absolute error (MAE) improvement of 27.60% over LSTM, 9.23% over convolutional neural networks (CNN), and 11.00% over the CNN-LSTM-Attention model across one to four step-ahead predictions. This study addresses the data scarcity problem in small town sales forecasting, helping businesses improve inventory management, enhance customer satisfaction, and contribute to a more efficient supply chain, benefiting the overall economy.

1. Introduction

With the development of the global economy and the advancement of technology, the quality of people’s living standards has continued to grow, and the demand for household appliances among consumers has been increasing. However, the rapid pace of appliance innovation and replacement has led to a swift reduction in the lifespan of these appliances, resulting in a continuous rise in the number of discarded appliances. At the same time, the substantial amount of electronic waste generated has brought about a series of resource and environmental issues [1]. Therefore, it is necessary to promote green recycling, reduce resource waste, and achieve the recycling of products.
One of the promising techniques to reduce the pollution caused by waste household appliances is the dual network integration technique. Dual network integration refers to the combination of the product sales networks and the recycling and reuse networks. Such integration not only improves the efficiency of resource utilization but also promotes the development of a green economy and thus brings sustainable growth opportunities for the household appliance industry [2]. During such a dual network, the accurate prediction of household appliance sales is crucial for optimizing inventory management and improving the overall supply chain efficiency.
In recent years, house appliances sales forecasting has been widely studied in [3,4,5]. It is noted that the widely used forecasting method can be classified into three types: (1) statistical probability models [6], (2) machine learning (ML) models [7,8], and (3) deep learning (DL) models [9,10]. All these three categories can achieve excellent house appliances sales forecasting with the help of sufficient sales data, and these data can be easily obtained in large cities, especially in megacities. However, in small towns, the sales and recycling network may be not established, and there is not sufficient historical house appliances sales data to train the complex predictive models. In addition, since the market demand in small towns may differ from that of large cities significantly, the complexity of predictions could be further improved. Therefore, it is necessary to propose a novel house appliances sales forecasting algorithm via small sample mining for small towns.
Transfer learning (TL) is a powerful strategy in ML that is extensively employed for small sample learning. It leverages knowledge learned from one task to enhance the performance of another, related task [11]. This technique is especially effective when the source and target tasks share some correlations, as it can speed up the learning process and reduce the need for extensive data and computational resources.
In order to address the challenge of limited data in small town home appliances sales, we propose a novel TL approach that combines the temporal convolutional network (TCN), long short-term memory (LSTM), and attention mechanism (called “TransTLA”). Firstly, we integrate the TCN and LSTM networks to capture both spatial and temporal correlations within sales data and extract meaningful features. Then, we employ attention mechanisms to effectively utilize the extracted features, enhancing the model’s understanding of the sales data. Finally, we use TL techniques to address data scarcity and regional disparities. After one round of training, the model incorporates the feature discrepancy between the source domain and target domain into a composite loss function. This step highlights the model’s strength in domain adaptation. Our TransTLA model can adapt to data-limited scenarios and facilitate the transfer of knowledge from the source domain’s sales network to the target domain.
The primary contributions of this paper can be summarized as follows:
(1)
A novel forecasting model integrating TL, TCN, LSTM, and attention mechanisms is proposed to forecast the sales volume of household appliances in small towns with limited data availability, significantly enhancing the predictive performance, as confirmed by ablation and comparative experiments.
(2)
A composite loss function is designed to facilitate effective knowledge transfer from the source domain to the target domain, utilizing a weighted superposition of regression losses from both domains and the domain adaptation loss.
The rest of this paper is organized as follows: In Section 2, we briefly review related works in household appliance sales forecasting. In Section 3, we provide the formal definition and system model for the household appliance forecasting problem. In Section 4, we describe the proposed TransTLA model in detail. In Section 5, the simulation results of household appliance sales forecasting are shown. Conclusions are drawn in Section 6.

2. Related Works

In this section, we provide a concise overview of the existing literature on forecasting sales of household appliances. ML and DL technologies have proven to be effective in addressing related forecasting challenges.
As for ML models, Schmidt et al. [12] implemented three different ML models to achieve the precise household appliance sales forecasting. In [13], the authors proposed a novel clustering-based forecasting model that combines a clustering approach and machine learning method for computing the retailing sales forecasting. In this model, K-means, self-organizing map, and growing hierarchical self-organizing map were used as clustering techniques, while extreme learning machine (ELM) and support vector regression (SVR) were used as ML techniques. In [14], the authors used efficient ML techniques for household appliance sales prediction. All the above studies can be accelerated by utilizing big data as a tool for predictive analysis in household appliance sales forecasting.
In recent years, some works have employed DL models for sales forecasting. In [10], the authors used convolutional neural networks (CNNs) to learn effective features from the historical sales data automatically and predict household appliance sales. In order to exploit the temporal correlation among the historical sales data, the authors in [15] utilized LSTM for sales prediction in the E-commerce realm. This algorithm focuses on the household products with fluctuating demands over short periods. To further improve the predictive accuracy of sales forecasting, He et al. [16] designed a novel algorithm that combines LSTM with particle swarm optimization (PSO). In [17], the authors proposed a novel hybrid prediction model based on the autoregressive integrated moving average model (ARIMA) and recurrent neural network (RNN) model. Kaunchi et al. [18] cascaded CNN and LSTM networks to predict future market sales. In [19], a CNN-LSTM-Attention model was introduced to address the limitations of traditional LSTM architectures. Table 1 summarizes some studies related to the topic of sales forecasting in this paper. For time series forecasting, Heureux et al. [20] considered a Transformer architecture and multiple features in the input. Staffini at al. [21] proposed a CNN model combined with a bidirectional LSTM. Rathore et al. [22] proposed a graph convolutional neural network based on WaveNet to perform wind speed forecasting.
In summary, most of the previous works [10,12,13,14,15,16,17,18,19] assumed that there are sufficient household appliance sales data for predicting model training. Unlike these existing works, we consider the scenario of forecasting household appliance sales in small towns, which lack sufficient sales data. We will employ TL technology to transfer the sales knowledge learned from big cities to small towns and make a precise prediction for household appliance sales.

3. Problem Formulation and System Model

In this paper, the home appliance sales forecasting problem is formalized as an end-to-end time series prediction task, with the objective of predicting future sales within a specific time frame based on the sequence of historical sales volume data. The core goal is to identify and learn patterns within the historical sales data and apply these patterns to forecast future sales trends. Therefore, we establish a functional model that is capable of learning and representing the patterns in sales data.
Mathematically, this process can be defined as a mapping function f , which takes the historical sales data   X = x 1 ,   x 2 ,   ,   x t 1 , x t as the input and outputs the predicted sales volume Y = y 1 , y 2 , , y k = x t + 1 ,   x t + 2 ,   ,   x t + k for a given time length k . This process is shown as follows:
Y = f X ; θ ,
where θ represents the parameters of the model, which are learned through the training process.
We introduce the TransTLA model, a novel forecasting approach for home appliance sales volumes, leveraging DL and TL technologies. The model integrates TCN, LSTM, and attention mechanisms, along with a custom loss function that incorporates TL mechanisms. The TCN-LSTM structure extracts the features from both source and target domains, while the attention mechanism refines these features to improve the predictive performance. After a round of training, the TL mechanism incorporates the distance between the features of both domains into a composite loss function. Once the model is trained, it can be used for forecasting the sales volume of home appliance products.
The TransTLA model is shown in Figure 1, which contains the input layer, TCN layer, LSTM layer, attention layer, and output layer. Firstly, the TCN layer extracts local features by using two stacked residual blocks. Secondly, the LSTM layer captures the temporal features of sequence data with two stacked LSTM units. Then, the attention mechanism layer refines these features by assigning weights based on relevance, and thus, the model’s predictive focus can be improved. Finally, two cascaded fully connected layers are used for the dimensionality reduction of high-dimensional features, and the output layer provides the prediction results. The first fully connected layer serves to halve the size of the feature vector from the hidden layer, and the second fully connected layer further maps high-level features to the final one-dimensional prediction result. This architecture not only mitigates overfitting but also enhances the model’s generalization capabilities.

4. The Proposed TransTLA Method

The TransTLA model consists of four components: The first part is a TCN, which uses convolutional methods to capture the spatial local attributes from the original dataset. The second part is a LSTM-driven temporal feature extraction network, which processes the temporal dynamics inherent within the data. The third part is the feature optimization network, composed of the attention mechanism. This network allocates attention weights to the extracted features, thereby optimizing the final predictive outcome. The last part is a TL mechanism implemented by a specially designed loss function. This function combines weighted losses from both the source and target domains, along with a domain adaptation loss, to enhance the model’s predictive capability for the target domain. In this section, we will introduce these four parts respectively.

4.1. Temporal Convolutional Network Layer

To extract the spatial local feature from source and target domain datasets, TCN is employed to utilize the information in the time domain to identify and extract local patterns through convolution operations. TCN provides a powerful feature representation for time series forecasting tasks. The key features of TCN include dilated causal convolution and residual connections [23].

4.1.1. Dilated Causal Convolution

Causal convolutions in TCN maintain temporal causality by using only past and present data, preventing information leakage and enhancing feature extraction capabilities effectively. Traditional CNNs face challenges with increased parameters and overfitting risks as the network depth grows. TCN can overcome the issues by introducing dilated convolutions, which incorporate “dilation” into the convolutional kernel by placing gaps between its elements. Therefore, dilated convolutions can expand the receptive field without increasing the parameters. By combining causal convolution with dilated convolution, TCN constructs dilated causal convolution, significantly enhancing the network’s ability to understand and model long-term dependencies in time series data. The network structure is shown in Figure 2.
Mathematically, for a one-dimensional time series input X = x 1 , x 2 , , x n n , the dilated causal convolution operation F at the element s in the sequence with a convolutional filter f : 0 , 1 , , k 1 can be represented as
F ( s ) = ( X d f ) ( s ) = k = 0 K 1 f ( k ) x s d k ,
where d denotes the dilated convolution, d refers to the dilation factor, K is the kernel size, and s d k represents the use of past data. Notably, when d is set to 1, the dilated convolution degenerates into a regular convolution. The dilation factor controls kernel sparsity, allowing the network to integrate information over a wider range. In practice, the dilation factor typically grows exponentially by a factor of 2 with each additional layer in the network.

4.1.2. Residual Connections

Simply increasing the dilation rate without adding network layers may not maximize the network’s learning capabilities. Adding depth can expand the receptive field but may also cause vanishing or exploding gradients. To address these challenges, TCN utilizes residual connections, which integrate skip connections to facilitate a gradient flow directly from earlier layers, mitigating gradient problems and maintaining training stability even with the increased network depth. The network structure of the residual connections is illustrated in Figure 3. The mathematical representation of the residual connections is given as follows:
y t = F ( x t , θ t ) + x t ,
where y t is the output of the residual connection, x t is the original signal input to the residual block, and F ( x t , θ t ) is the transformation within the residual block with parameters θ t .
By integrating residual connections, TCN enhances the training efficiency and stability while expanding the receptive field, which is essential for capturing long-term dependencies in time series data. In practice, TCNs are often constructed by stacking multiple residual blocks, each incorporating dilated causal convolution layers to broaden the receptive field and preserve the time series causality.
TCN’s combination of dilated causal convolution and residual connections makes it a robust framework for time series analysis. This makes TCN well suited for forecasting tasks, providing precise predictions and strong generalization in home appliance sales forecasting. With an optimized architecture and training approach, TCN effectively extracts valuable features from time series data, serving as a powerful tool for subsequent feature extraction and TL. Despite these advantages, TCNs can be computationally expensive and memory-intensive, especially when dealing with very deep networks or long input sequences.

4.2. Long Short-Term Memory Network Layer

To extract temporal features from both the source and target domain datasets, we employ LSTM networks. The core innovation of LSTM lies in its internal gating units: the input gate, the forget gate, and the output gate. These gates work in concert to dynamically regulate the flow of information, allowing the network to learn both long-term and short-term dependencies. The input gate determines how much new input information to incorporate into the neuron’s state at each time step, the forget gate decides which outdated information to discard from the state, and finally, the output gate determines how much of the current neuron’s state to include in the next time step’s output. This design makes LSTM particularly effective in sequence prediction tasks, especially in scenarios that require remembering long-term information. The structure of the LSTM network unit is shown in Figure 4, where C t 1 is the state of the neuron from the previous moment, h t 1 is the output of the neuron from the previous moment, and x t is the current moment’s time series data input. The gating mechanism provides the network with powerful memory and forgetting capabilities, which is crucial for the temporal feature extraction of the source and target domain datasets. However, LSTM can be more resource-intensive and slower to train, particularly when dealing with large batches or extended sequences.
The LSTM network, fed by the TCN’s feature outputs, operates with a set of parameters that govern its state transitions. Let X L S T M = x 1 , x 2 , , x n denote the input features, h t the hidden state, C t the cell state, and Y L S T M = y 1 , y 2 , y n the output sequence. The state updates of the LSTM network parameters can be represented as follows:
f t = σ W f h t 1 , x t + b f
i t = σ W i h t 1 , x t + b i
C ˜ t = tanh W C h t 1 , x t + b C
o t = σ W o h t 1 , x t + b o
C t = f t C t 1 + i t C ˜ t
h t = o t tanh ( C t )
y t = σ V h t + b y
where σ is the sigmoid function; tanh is the hyperbolic tangent function; W f , W i , W c , W o , and V are the weight matrices for the forget gate, input gate, candidate memory cell, output gate, and sequence prediction output, respectively; and b f , b i , b C , b o , and b y are the bias terms for the forget gate, input gate, candidate memory cell, output gate, and sequence prediction output, respectively. ⊙ denotes the Hadamard product. By stacking two layers of LSTM networks, we extract deeper temporal features within the dataset.

4.3. Attention Mechanism Layer

In the task of household appliance sales forecasting, it is crucial to recognize that not all extracted features possess equal importance, and existing features may not be sufficient to capture critical and subtle information. Thus, optimizing feature representation is essential. To address this, we incorporate the attention mechanism to enhance the model’s capacity to capture pivotal information within sales data, thereby improving the predictive performance. The attention mechanism is inspired by the human visual attention system, which automatically focuses on the most critical parts when processing a large amount of information [24].
The core of the attention mechanism lies in computing a weighted sum based on the relevance of each element in the input sequence to a central query. Define an attention pooling function f , as shown in Equation (11). It accepts a query vector q q and a set of m “key–value” pairs, where each key is k i k and each value is v i v .
f ( q , ( k 1 , v 1 ) , , ( k m , v m ) ) = i = 1 m α ( q , k i ) v i v
The attention weight α ( q , k i ) is a scalar derived from the attention scoring function a and normalized through the softmax function to ensure the sum of all weights equals 1, as shown in Equation (12). The function a maps the query vector q and key vector k i to a scalar.
α ( q , k i ) = exp ( a ( q , k i ) ) j = 1 m exp ( a ( q , k j ) )
In the proposed TransTLA, we utilize the scaled dot-product attention mechanism to compute attention weights, enabling the model to more precisely predict home appliance sales while maintaining computational efficiency. This mechanism calculates the square root of the dimension of the key vectors, which is
a ( q , k i ) = q k i d k .
Compared to the traditional dot-product attention, this scaling operation stabilizes the gradients and mitigates numerical stability issues associated with high-dimensional key vectors, thereby enhancing the robustness of the attention mechanism and its efficacy in processing inputs of diverse dimensions. However, attention mechanisms can be computationally expensive, particularly with very long sequences.
The specific implementation of the attention mechanism is illustrated in Figure 5. Specifically, the hidden state of the final time step from the LSTM network serves as the query, while the entire sequence of hidden states output by the LSTM network is fed into the attention mechanism layer as the key and value. This design enables the model to assess the significance of each time step in the sequence while considering long-term dependencies. By focusing on the critical components of the LSTM output, the model can more accurately capture both long-term patterns and short-term dynamics within the sales sequence data, which is essential for precise sales forecasting.
The scaled dot-product attention mechanism involves a key tensor K derived from a nonlinear transformation using the tanh activation function. This key tensor interacts with the query tensor Q to compute attention scores that reflect the similarity between each time step and the last hidden state. The softmax function is then applied to these scores to yield the attention weights α . These weights are used to perform a weighted sum with the value tensor V , producing a context vector that aggregates the features of each time step according to their importance, as depicted in Equation (14).
c = i = 1 t α i V i
Instead of directly utilizing the weighted context vector, we concatenate it with the last hidden state. This concatenated result is then passed through a fully connected layer and processed with the tanh activation function to produce the final attention vector, as shown in Equation (15). This strategy not only helps to prevent the loss of long-term dependency information but also strengthens the model’s capability to capture features from recent time steps.
o = tanh ( W o [ c ; h t ] + b o )
where [ ; ] represents the concatenation operation, W o is a learnable weight matrix, and b o is the bias term.

4.4. Transfer Learning Implementation

In the field of TL, Domain-Adversarial Training (DAT) and Model-Agnostic Meta-Learning (MAML) are two representative methods. DAT introduces domain discriminators and adversarial training strategies to learn domain invariant feature representations by minimizing the distribution difference between the source and target domains [25]. On the other hand, MAML leverages the concept of meta-learning to train models on a multitude of tasks, enabling them to quickly adapt to new target tasks [26]. TL boosts the efficiency and performance of a target task by transferring knowledge from a source domain. In our study, TL is employed in home appliance sales forecasting, with the aim of leveraging experience from data-rich areas (such as large cities) to enhance the predictive capabilities of the model in data-scarce areas (such as small towns).
Instead of the traditional two-step process of training on the source domain followed by fine-tuning on the target domain, a composite loss function is utilized during model training. This function comprises the source domain regression loss L source , target domain regression loss L target , and domain adaptation loss L adapt . By leveraging data from both domains, our method progressively adjusts the model for better adaptation to the target domain’s predictive tasks. This integrated approach aims to enhance the model’s ability to generalize across domains without the need for separate adversarial training or extensive meta-learning phases.
The calculation of L source and L target focuses on optimizing the prediction performance through mean squared error (MSE) between the model’s outputs and actual data in both domains, i.e.,
L source = MSE y source , y ^ source = 1 N i = 1 N y i source i y ^ i source 2
L target = MSE y target , y ^ target = 1 N i = 1 N y i target y ^ i target 2
where y source and y ^ source are the actual data and predicted results from the source domain training set, respectively; and y target and y ^ target are the actual data and predicted results from the target domain training set, respectively.
The domain adaptation loss L adapt in home appliance sales forecasting aims to bridge the knowledge gap between source and target domains, facilitating the effective transfer of the model from the source to the target domain by quantifying the similarity of their feature distributions. Maximum Mean Discrepancy (MMD), a widely used measure of distribution similarity based on kernel methods, enables the comparison of distribution differences in high-dimensional spaces [11]. Specifically, MMD is defined as the difference between the means of two probability distributions, P and Q , given by Equation (18), with a value close to zero indicating alignment and a higher value signifying divergence. Minimizing MMD facilitates effective knowledge transfer, thereby enhancing the model’s performance in the target domain.
MMD 2 ( P , Q ) = ϕ ( x ) d P ( x ) ϕ ( x ) d Q ( x ) 2
In Equation (18), ϕ ( ) is a mapping function from the input space to some high-dimensional feature space, and denotes the norm in that feature space.
In our study, a linear variant of MMD is used as the domain adaptation loss. This specialized MMD variant leverages a linear mapping function, simplifying the computation by utilizing the dot product in the original feature space as the kernel. The domain adaptation loss can be formulated as Equation (19), which accepts feature tensors from the source and target domains, as derived from the attention mechanism layer, as input sample sets. Calculate the Euclidean distance of the difference between the mean vectors of these two sample sets and then compute the square of this difference, guiding the model’s domain adaptation.
L adapt = E ( Y source ) E ( Y target ) 2 = 1 N source i = 1 N source Y i source 1 N target j = 1 N target Y j target 2
Combining Equations (16), (17) and (19), the overall loss function can be expressed as
L = α L source + 1 α L target + β L adapt ,
where α and β are the balancing weights.
By integrating the custom loss function, the model enhances the predictive performance across both source and target domains while facilitating effective knowledge transfer via domain adaptation loss, refining feature representations with each training epoch. Over successive training rounds, the model learns to minimize discrepancies between source and target domain feature tensors, mitigating predictive errors from data distribution inconsistencies.
Compared to traditional TL methods, including DAT and MAML, our proposed approach offers several key advantages: (1) Unlike DAT’s adversarial training and MAML’s meta-learning, our method eliminates complex model transfer and fine-tuning, simplifying the training process. (2) By simultaneously optimizing the regression losses and domain adaptation loss for both domains, it simplifies the training process and enables faster learning of common features. (3) This approach mitigates overfitting, especially in data-scarce target domains, enhancing generalization by learning common features across domains. (4) Furthermore, the model progressively adapts to target domain predictive tasks during training, improving the predictive accuracy. It is a continuous adaptation strategy distinct from DAT’s adversarial steps or MAML’s episodic learning.
The training process framework of the home appliance sales forecasting model based on TransTLA is depicted in Figure 6. Initially, source and target domain data are collected and preprocessed. The training data from both domains are then fed into the model network for training. During each training epoch, the model leverages data from both domains to calculate the source domain regression loss, target domain regression loss, and domain adaptation loss. These are weighted and summed to form the overall composite loss function, which is then used for backpropagation to update the model parameters for the next round of training. It is important to note that, despite facing two distinct data domains, we train the same model. The model adjusts its parameters according to the guidance of the composite loss function in each round to achieve TL, thereby improving the performance of home appliance sales forecasting in the target domain.
After each epoch, the model’s performance is evaluated using the target domain’s validation set. If the current model outperforms the previous best model on the validation set, it is updated and saved as the new best model. This process is repeated until a preset number of training epochs is reached. Once training is over, the saved best model is loaded to output and evaluate predictions on the target domain’s test set.
During the training process, the Adam optimizer is utilized for efficient model parameter tuning, while a dropout strategy is implemented to mitigate overfitting and enhance the model’s generalization capability. Furthermore, we employ a dynamic learning rate adjustment method. This approach, as described by Equation (21), gradually decreases the learning rate as the training advances, enabling rapid convergence in the early stages and facilitating precise parameter tuning in the later stages.
L R i = L R 0 1 + γ i δ
where L R i represents the learning rate at the i-th iteration, L R 0 is the starting learning rate, γ is the coefficient for learning rate growth, and δ is the coefficient for learning rate decay.
The pseudocode implementation of the entire training and evaluation process of the TransTLA model algorithm is shown in Algorithm 1.
Algorithm 1. Process of the TransTLA Model Algorithm
Input: Time series data of home appliance sales from both source and target domains
Output: Predicted sales volume result in the target domain under a given forecast time step
1: Data Preprocessing: Clean, normalize, split the datasets, and process with sliding windows for both source and target domain data.
2: Train the Model with the Training Set:
3: for epoch in range(epochs) do     // for each iteration
4:    Extract spatial local features through the TCN layer.
5:    Extract temporal features through the LSTM layer.
6:    Q = Hidden state output from the last time step by LSTM.
7:    K = Sequence of hidden states output by LSTM.
8:    V = Sequence of hidden states output by LSTM.
9:    Optimize features through the attention mechanism layer: Attention(Q, K, V).
10:    Calculate domain adaptation loss L_adapt.
11:    Get prediction output through fully connected layers.
12:    Calculate the regression loss of source domain L_source and target domain L_target.
13:    Calculate composite loss L = α × L_source + (1 − α) × L_target + β × L_adapt
14:    Backpropagate and optimize model parameters with the Adam optimizer based on the loss function.
15:    Perform dynamic adjustment of the learning rate.
16:    Evaluate performance with the target domain validation set, update and save the best model.
17: end for
18: Load the best model.
19: Model Evaluation: Test on the target domain test set and calculate evaluation metrics.
20: Return: The predicted sales results in the target domain.

5. Experiments

In this section, we will evaluate the performance of the TransTLA model in home appliance sales forecasting from various perspectives. First, we perform parameter selection experiments to identify the optimal values for key model parameters. Then, we implement ablation experiments to analyze the contribution of each model component to the overall performance. Finally, we compare the model with other forecasting methods.

5.1. Data Description and Preprocessing

Our study aims to refine the precision of home appliance sales forecasting via TL. In the experiment, we employ real sales data of home appliance products from the Chinese market. Rich historical sales data from outlets in large cities are selected as the source domain, while small towns or newly established outlets with lower market maturity and scarce data are chosen as the target domain. Specifically, we select two geographically close cities, City A and Town B in the Yangtze River Delta region, as the subjects of our study to ensure a certain degree of similarity in features between the source and target domains, such as geographical location, climate, and weather conditions. This similarity helps the model to better adapt to the characteristics of the target domain.
The source domain dataset covers 1260 days of air conditioner sales data from City A from 2020 to 2023, while the target domain dataset includes 405 days of air conditioner sales data from Town B from 2022 to 2023. These data provide a solid foundation for subsequent experiments, enabling us to conduct in-depth analysis and forecasting of home appliance sales patterns.
During the data preprocessing phase, the collected source and target domain datasets are first cleaned and filtered to remove outliers and noise. Missing values in the datasets are smoothed through mean imputation, which helps maintain data integrity and reduces the potential impact of missing data on model training.
We apply min–max normalization to standardize the data across variables, as shown below:
x j = x j x min x max x min ,
where x j and x j represent the original data and normalized data, respectively, with x min and x max being the minimum and maximum values in the original sequence.
To visually present the experimental datasets and analyze the sales changes over time more accurately and meticulously for the two domains, we normalize and align the sales data of the source and target domains on the time axis, as shown in Figure 7. It can be seen that the sales trends from both domains exhibit a degree of similarity, suggesting that, despite variations in data volume and market maturity, there exists some shared patterns in sales fluctuations. This commonality offers robust empirical evidence that is instrumental for the domain adaptation training of the TransTLA model.
The dataset is partitioned into three distinct subsets: training (70%), validation (10%), and testing (20%). This division is strategically designed to address the limited availability of the target domain samples by assigning a larger training dataset, ensuring appropriate data support for the model’s training, optimization, and assessment. To model the sequences, a sliding time window technique is employed, with a fixed window size, such as 7 days, to capture consecutive time intervals. This approach allows for overlapping windows, enabling the model to effectively handle the sequential and evolving nature of sales data.

5.2. Evaluation Metrics

To assess the accuracy of the algorithm’s predictive results, we employ a combination of evaluation metrics, including the coefficient of determination (R-Square, R2), mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean square error (RMSE). These metrics enable a quantitative assessment of the model’s predictive performance and error levels, providing an objective judgment of the quality of the forecast results. The calculation methods for these metrics are shown as follows:
R 2 = 1 i = 1 N ( y i y ^ i ) 2 i = 1 N ( y i y ¯ ) 2
MAE = 1 N i = 1 N y i y ^ i
MAPE = 1 N i = 1 N y i y ^ i y i
RMSE = 1 N i = 1 N y i y ^ i 2
where y i is the true value of the i-th sample, y ^ i is the predicted value of the i-th sample, and y ¯ is the average of the true values of the samples.
R2 measures how well the model fits the data; the closer the value is to 1, the stronger the model’s explanatory power. However, it can sometimes be less than 0, indicating that the model’s predictive performance is not as good as simply using the mean. MAE calculates the average of the absolute differences between predicted and actual values; the smaller the MAE, the smaller the model’s average predictive error. MAPE assesses the proportion of the error relative to the actual value; the lower the MAPE, the smaller the relative predictive error. RMSE computes the square root of the mean of the squared differences between the predicted results and the actual data; the smaller the RMSE, the better the predictive capability.

5.3. Experimental Setup

The open-source DL framework PyTorch [27] is utilized to construct the DL network model. All experiments are conducted on a server with an AMD EPYC 7763 2.45GHz CPU, 512GB of memory, a NVIDIA A800 80GB GPU, and the CentOS 8 operating system. The software stack includes CUDA Toolkit version 12.2, Python 3.11, and PyTorch 2.2. The parameter settings used in the model training and testing are shown in Table 2.

5.4. Parameter Selection Experiments and Results

In the parameter selection experiment, we aim to determine the optimal parameter settings for the TransTLA model to optimize performance on home appliance sales forecasting. We focus on the impact of weight parameters in the composite loss function and the effect of the varying historical window sizes.
The composite loss function (see Equation (20)) includes source domain loss, target domain loss, and domain adaptation loss, with hyperparameters α and β balancing their contributions.
α is a critical hyperparameter that balances the weights between the source domain regression loss and the target domain regression loss. We explore α values from 0 to 1 with steps of 0.1 to determine its effect. Specifically, when α equals 0, the influence of the source domain regression loss is not considered; only the target domain regression loss and domain adaptation loss are used in the backpropagation. When α is set to 1, it indicates that the total loss consists entirely of the source domain regression loss and domain adaptation loss. For each value of α, the model is trained with a consistent structure and parameters as described in Section 5.3. The historical backtracking window length is set to 7 days, and β is set to 0.1. We use a sliding window to predict the sales volume for the next day. The evaluation metrics for sales volume prediction under different proportions of source domain regression loss are presented in Table 3, with comparative curves for the MAE and RMSE indicators illustrated in Figure 8.
Both Table 3 and Figure 8 indicate that the model performs well when the value of α is in the middle range, specifically at 0.5 or 0.6, with a small difference between them. As the value of α moves away from the middle, the predictive performance declines, suggesting that equally considering the regression losses of both the source and target domains leads to a more balanced effect, which helps the model maintain a better performance when transferred to the target domain. Therefore, 0.5 is chosen as the optimal parameter setting for α.
It can be observed that, when α = 0 or α = 1, the model’s performance is not ideal, indicating that relying solely on target domain data or source domain data is not the best strategy. A moderate amount of source domain information helps improve the predictive performance in the target domain task, but overreliance on source domain information may introduce noise or inadaptability, thereby reducing the model’s performance.
Next, we explore the impact of the hyperparameter β on the performance of the TransTLA model. β regulates the impact of the domain adaptation loss, which is crucial for minimizing the differences in feature distributions across source and target domains, thereby enhancing the model’s performance on the target domain. Experiments vary β from 0 to 1 in 0.1 increments, with other parameters held constant. At β = 0, the model disregards domain adaptation loss, focusing solely on regression losses. With α set at 0.5 and a 7-day backtracking window, the model forecasts the next day’s sales. The results under different domain adaptation loss ratios are detailed in Table 4, with the MAE and RMSE curves compared in Figure 9.
The analysis of the data and comparison curves from Table 4 and Figure 9 reveals that an initial increase in β from 0 to 0.1 enhances the model performance. However, beyond this point, further increments in β result in performance fluctuations rather than a steady improvement. A β that is too high could cause the model to prioritize reducing domain discrepancies over optimizing the regression task, potentially degrading the overall performance. This suggests that there is a sweet spot for balancing regression loss and domain adaptation loss, which is essential for model tuning. Notably, when β is set to 0.1, the model achieves the best results in forecasting home appliance sales. Consequently, this paper adopts β = 0.1 for all subsequent experiments.
The size of the historical backtracking window determines the length of historical information considered by the model when making predictions. Choosing the appropriate window size is crucial for the model to capture long-term dependencies in time series and to improve the predictive performance. To quantitatively analyze the impact of different backtracking window sizes, the size is varied from 4 to 12 days, generating corresponding datasets for training and evaluation. With α = 0.5 and β = 0.1, the model predicts the sales volume for the next day. The predictive efficacy is evaluated, with the results and comparison curves presented in Table 5 and Figure 10, respectively.
The experimental results from Table 5 and Figure 10 indicate that the model’s predictive performance follows an increasing and then decreasing trend with the expansion of the historical backtrack window size. Specifically, when the window size is less than 7 days, the model’s predictive accuracy improves with the enlargement of the window, likely due to the small window’s inability to capture long-term dependencies in the time series, thus limiting its predictive power. However, when the window size exceeds 7 days, the performance begins to decline, possibly because excessive historical information introduces noise and irrelevant data, which reduces its relevance to current sales trends and interferes with the model’s prediction of current sales. Additionally, considering the limited number of samples in the target domain of this paper, a larger historical backtrack window could theoretically provide more data but actually lead to a further reduction in available training samples. This not only decreases the model’s training efficiency but also, due to data scarcity, may prevent the model from being adequately trained, ultimately affecting its performance and generalization capability. When the backtrack window size is set to 7 days, the model achieves the best predictive performance, outperforming other window sizes across all evaluation metrics. In summary, a 7-day historical backtrack period is used for subsequent experiments.
In the real-world home appliance market, accurate sales forecasting is crucial for strategic planning, inventory optimization, and flexible sales strategies. Consequently, it is necessary to further evaluate the performance of the proposed TransTLA model in multi-step forecasting. We employ a 7-day historical sliding window to generate predictions for the next 1 to 4 days, which are then compared to actual sales data. For instance, to predict the next 3 days, the model uses real data from the previous 7 days to forecast the sales for the next 3 days at once. Subsequently, the sliding window moves forward by 3 days, and the process is repeated.
Figure 11 presents the home appliance sales forecast results of the TransTLA model on the test set under different prediction steps (1 to 4 days ahead). The blue lines represent the actual sales volumes, while the orange lines indicate the predicted values. The model demonstrates high accuracy and sensitivity for short-term forecasts of 1 and 2 days, closely tracking actual sales. As the prediction horizon extends, slight discrepancies emerge between the forecasted and actual sales volumes. This is expected due to the increased uncertainty associated with longer-term predictions. However, the model still captures the overall long-term sales trends effectively, especially during significant upward or downward trends, showing strong trend detection and robustness.

5.5. Ablation Experiments and Results

To comprehensively evaluate the contribution of each component in the TransTLA model to the overall performance of home appliance sales forecasting, this paper conducts a series of ablation experiments. The essence of ablation experiments is to remove or modify key components of the model gradually, compare the performance of the complete model with the ablated models in forecasting tasks, and thus deeply understand and quantify the specific impact of these components on the overall model performance.
The experiments focus on four key parts, i.e., TCN layers, LSTM layers, attention mechanism layers, and TL mechanisms, and various combinations of these parts, respectively. Notably, when TL is not used, the model is trained and predicted using only the target domain dataset. Models with the prefix “Trans” indicate the use of TL strategies, which means training with both source and target domain data and evaluating the testing set of the target domain. The ablation experiments in this study cover a time span of 1 to 4 days in the future to fully analyze the impact of each component on the model’s forecasting ability. Through these ablation experiments, we can not only identify the model components that are crucial for the forecasting performance but also understand whether the role of these components differs significantly under different prediction time spans. To address the inherent stochasticity in DL models due to random weight initialization and optimization processes, we conducted 10 separate evaluations for each model. Table 6 lists the average evaluation metrics of the models under different ablation configurations, with the standard deviation indicated in brackets. Figure 12 visually displays how the MAE metric of different models changes with the time span. Additionally, to statistically confirm the significant differences in the performance of the models, we performed independent t-tests, considering a significance level of p < 0.05.
From Figure 12, one can see that the predictive accuracy of all models declines as the forecast horizon increases, highlighting the complexity of long-term forecasting. This decline is due to escalating uncertainty and the diminishing relevance of historical data to future sales over time.
In single model structures, TCN and LSTM perform relatively well for a forecast step of one but deteriorate rapidly as the steps increase. Particularly, LSTM’s R2 becomes negative at step 4, indicating its predictions are less effective than a simple mean forecast and have very low relevance to the actual sales data. Thus, LSTM cannot effectively capture future trends. In contrast, TCN shows some predictive power at step 4.
Furthermore, the TCN-LSTM model shows some advantages at steps 1 and 2 but has poor performance at steps 3 and 4, indicating limitations in handling long-term forecasting tasks. However, the introduction of the attention mechanism improves the performance of the TCN-Attention, LSTM-Attention, and TCN-LSTM-Attention model, which achieves better results across single and multiple forecast steps, demonstrating that the attention mechanism helps the model focus on key information in the time series and confirms the advantage of hybrid model structures in capturing complex time series features.
The introduction of TL mechanisms enhances the model’s predictive performance. Such improvement can be attributed that leveraging the source domain data can effectively enhance the model’s understanding of the target domain. Comparisons clearly show that the proposed TransTLA model outperforms other ablation models at most forecast steps, with the percentage MAE improvement displayed in Table 7. At a one-day forecast step, the TransTLA model reduces the MAE by 6.66% compared to the TCN model and by 8.55% compared to the LSTM model, with 5.25% improvement over the non-TL TCN-LSTM-Attention model. When predicting four days, the TransTLA model further reduces these metrics by 6.13% compared to the TCN model, by 36.31% compared to the LSTM model, and by 10.64% compared to the TCN-LSTM-Attention model. This indicates that combining attention and TL mechanisms with TCN and LSTM can better leverage their respective advantages for more accurate sales forecasting. The TransTLA model effectively captures both short-term and long-term dependencies in time series data, focuses on key time steps through the attention mechanism, and learns useful knowledge from the source domain through TL, thereby improving the forecast accuracy.
In summary, the ablation study reveals the key role of the TCN layers, LSTM layers, the attention mechanism, and TL in improving the accuracy of home appliance sales forecasting. The TransTLA model integrates these components to achieve optimal performance across different forecast time spans.

5.6. Comparative Experiments and Results

To further validate the effectiveness and superiority of the TransTLA model for home appliance sales forecasting, comparative experiments are conducted with several time series forecasting models to comprehensively assess the performance of the TransTLA model. In addition to the proposed TransTLA model, the comparison includes the following models:
ARIMA: Autoregressive Integrated Moving Average, a classic and commonly used statistical method that predicts by capturing linear dependencies in time series data.
XGBoost: Extreme Gradient Boosting, an advanced and scalable ML method that enhances model performance by combining the predictions of multiple weak learners and using gradient descent to minimize errors.
SVR: Support Vector Regression, a ML method that extends Support Vector Machines (SVMs) to regression tasks, aiming to find a function within a margin of error while handling nonlinear relationships using kernel functions.
RNN: A type of neural network designed for sequence data, characterized by maintaining states in the hidden layers to influence subsequent outputs with previous information. In the experiment, a two-layer RNN stack processes the input, followed by two fully connected layers to produce the forecast.
CNN: A DL model initially used in image recognition but also effective in extracting local features of time series through one-dimensional convolution. The CNN structure employs one-dimensional convolution, average pooling, and ReLU activation function, stacked twice, followed by two fully connected layers for the output.
CNN-LSTM [18]: A combined model that first extracts local features with CNN layers and then captures temporal dependencies with LSTM layers to enhance the forecasting performance.
CNN-LSTM-Attention: A model that further combines the advantages of CNN, LSTM, and attention mechanisms to capture complex patterns and dependencies in time series data through layered abstraction and precise information filtering.
TransCNN-LSTM-Attention: A model that incorporates TL strategies into CNN-LSTM-Attention to explore the impact of TL on sales forecasting performance.
Experiments to predict sales volume for 1, 2, 3, and 4 days in advance are designed for each model. The results of various evaluation metrics in the comparative experiment are shown in Table 8, and Figure 13 displays the MAE comparison of different models across various forecast horizons.
The experimental results from Table 8 and Figure 12 demonstrate that, as the forecast horizon lengthens, the predictive performance of all models declines due to the increasing uncertainty associated with making longer-term predictions. Overall, ML- and DL-based models outperform the traditional statistical ARIMA model. Across all evaluation metrics and forecast steps, RNN, CNN, and their variants consistently perform significantly better than ARIMA. Particularly at the forecast steps of 3 and 4 days, for ARIMA, the values of R2 drop to −0.292 and −0.723, respectively, indicating that its predictions are less effective than a simple mean forecast and exhibit a substantial negative correlation. This suggests that traditional statistical models struggle to effectively capture the nonlinear trends and complex patterns in the time series data of home appliance sales.
For ML-based methods, XGBoost and SVR show nuanced differences in their predictive performances over 1–4 day forecast horizons. XGBoost starts strong with an R2 of 0.652 on 1 day but sees a significant drop by 4 days (R2 of 0.105). Conversely, SVR maintains a more consistent performance, demonstrating its stronger capacity to handle extended forecasts. While both models’ accuracy declines over longer horizons, SVR consistently outperforms XGBoost across most metrics and steps, indicating better accuracy and robustness in handling nonlinear time series data complexities. Nevertheless, these ML models still underperform compared to DL models, particularly for longer forecasts.
Further analysis of DL models reveals that combined models perform better than individual RNN and CNN base models. RNN shows better performance in short-term forecasts (e.g., 1-day step) but a significant drop in performance for long-term forecasts (e.g., 3 and 4 days). This may be due to challenges in capturing long-term dependencies in time series data and susceptibility to the vanishing gradient problem. CNN exhibits a relatively stable performance across various forecast steps but falls short of the combined models in overall predictive accuracy, indicating that convolutional operations alone may not be sufficient to fully explore the temporal characteristics of time series data.
Combined models like CNN-LSTM and CNN-LSTM-Attention achieve superior predictive performance, with CNN-LSTM-Attention showing a high R2 value (0.797) at a 1-day step and maintaining a relatively stable performance even at a 4-day step. This indicates that integrating convolutional operations with LSTM or attention mechanisms can better model time series data, capturing both local and global features, thereby enhancing the predictive accuracy.
The TransCNN-LSTM-Attention model, which incorporates TL, demonstrates exceptional performance in both short-term and long-term forecasting tasks, outperforming CNN-LSTM-Attention in most evaluation metrics and forecast horizons and ranking just below the TransTLA model overall. This shows that TL can further enhance sales forecasting performance by leveraging knowledge from related domains.
Notably, the TransTLA model proposed in this paper exhibits the best overall predictive performance across all forecast horizons. Table 9 presents the percentage MAE improvement of the TransTLA model over some comparative models. When predicting 1-day sales, its MAE is reduced by 9.83% compared to the CNN-LSTM-Attention model and by 9.29% compared to the second-best TransCNN-LSTM-Attention model. Especially in multi-step forecasting, its evaluation metrics are better than other comparative models. For instance, at a 3-day forecast step, the MAE is reduced by 12.12% compared to the CNN-LSTM-Attention model and by 10.53% compared to the TransCNN-LSTM-Attention model. This indicates that the components of the TransTLA model, such as TCN and TL, can more effectively mine intrinsic patterns and trends from time series data, further validating its high efficiency and robustness in multi-step forecasting tasks.
In addition to the superior predictive performance, it is also crucial to consider the computational complexity of these models. Table 10 presents the number of floating-point operations (FLOPs), multiply–accumulate operations (MACs), and parameters (Params) for different models used in the experiments, based on the settings described in Section 5.3.
From Table 10, we can observe that the TransTLA model has higher FLOPs and MACs compared to most other models, indicating greater computational complexity. However, this complexity is justified by its significantly improved predictive performance. When comparing with simpler models like ARIMA, XGBoost, and SVR, the complexity of DL models including TransTLA is considerably higher. Traditional models such as ARIMA and SVR generally have lower computational complexity due to their simpler structure and fewer parameters. XGBoost, while more complex than ARIMA and SVR, still does not reach the level of complexity seen in DL models. However, these traditional models often fall short in capturing the intricate patterns and dependencies in time series data, especially for multi-step forecasting tasks, where the TransTLA model excels.
It is worth noting that the addition of TL in models leads to the number of parameters remaining unchanged, but the computational cost actually doubles. This is because data from both the source and target domains will pass through the model.
In summary, the comparative experimental results clearly show the performance differences of various model structures in the task of home appliance sales forecasting. The TransTLA model’s overall performance is significantly better than other classic and combined models, demonstrating excellent time series forecasting capabilities. This not only validates the effectiveness of its design but also provides a powerful tool for sales forecasting in the home appliance industry.

6. Conclusions

In this paper, we have proposed a novel TL approach with TCN-LSTM-Attention for household appliance sales forecasting in small towns, addressing the challenge of limited data availability. First, we cascaded the TCN and LSTM networks to capture the spatiotemporal relationships within historical sales data. Subsequently, based on these extracted features from the sales data, we utilized the attention mechanism to assign different weights by calculating the relevance between query, key, and value, dynamically focusing on different points in the sequence and further enhancing the model’s focus on key information. Finally, by introducing a redefined composite loss function, we performed the TL technique between the source domain and target domain to address the challenges posed by limited data availability and regional variations. By transferring the knowledge gained from modeling sales data in large cities, we bolstered the predictive ability of small towns. The experimental results show that the proposed TransTLA approach outperforms traditional or combined methods such as RNN, LSTM, and CNN-LSTM-Attention in the context of small town household appliance sales forecasting.
The proposed methodology has potential applications beyond household appliances, extending to other product categories such as electronics, automotive, and furniture. Accurate sales forecasting can help businesses optimize inventory management, improve supply chain efficiency, and enhance customer satisfaction. For governments, such forecasts can assist in economic planning, policy making, and resource allocation, thereby contributing to economic stability and growth.
However, the current study has some limitations. The model’s reliance on historical data presupposes the continuation of past patterns, which may not always be the case. Future research will investigate the adaptability of this model to diverse product categories and various market conditions, thereby broadening its applicability. Additionally, we intend to examine the sensitivity of the TL component in TransTLA to domain variations and explore strategies to mitigate the potential presence of negative transfer, particularly in scenarios where the source and target domains differ significantly, further improving the robustness and effectiveness of our approach.

Author Contributions

Conceptualization, Z.H.; methodology, Z.H.; software, Z.H.; validation, Z.H.; formal analysis, Z.H.; investigation, Z.H. and J.L.; writing—original draft preparation, Z.H.; writing—review and editing, Z.H. and J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Huang, H.; Li, B. Optimization of WEEE Recycling Network for E-Wastes Based on Discrete Event Simulation. Procedia CIRP 2020, 90, 705–711. [Google Scholar] [CrossRef]
  2. Han, H.; Fan, X.; Zhang, Q.; Du, Y. Situation and Development Trend for Recycling of Electrical and Electronic Equipment. Comput. Integr. Manuf. Syst. 2022, 28, 2005. [Google Scholar] [CrossRef]
  3. Culcay, L.; Bustillos, F.; Vallejo-Huanga, D. Home Appliance Demand Forecasting: A Comparative Approach Using Traditional and Machine Learning Algorithms. Intell. Syst. Appl. 2024, 824, 457–473. [Google Scholar]
  4. Akdağ, T. Application of Sales Forecasting Techniques for A Company in The Home Appliances. Master’s Thesis, Marmara University, Istanbul, Türkiye, 2019. [Google Scholar]
  5. Jiang, H.; Ruan, J.; Sun, J. Application of Machine Learning Model and Hybrid Model in Retail Sales Forecast. In Proceedings of the 2021 IEEE 6th International Conference on Big Data Analytics (ICBDA), Xiamen, China, 5–8 March 2021; pp. 69–75. [Google Scholar]
  6. Hyndman, R.J.; Khandakar, Y. Automatic Time Series Forecasting: The Forecast Package for R. J. Stat. Softw. 2008, 27, 1–22. [Google Scholar] [CrossRef]
  7. Wang, J. A Hybrid Machine Learning Model for Sales Prediction. In Proceedings of the 2020 International Conference on Intelligent Computing and Human-Computer Interaction (ICHCI), Sanya, China, 4–6 December 2020; pp. 363–366. [Google Scholar]
  8. Saravana Kumar, N.M.; Hariprasath, K.; Kaviyavarshini, N.; Kavinya, A. A Study on Forecasting Bigmart Sales Using Optimized Machine Learning Techniques. SITech 2020, 1, 52–59. [Google Scholar] [CrossRef]
  9. Loureiro, A.L.D.; Miguéis, V.L.; da Silva, L.F.M. Exploring the Use of Deep Neural Networks for Sales Forecasting in Fashion Retail. Decis. Support. Syst. 2018, 114, 81–93. [Google Scholar] [CrossRef]
  10. Li, X.; Du, J.; Wang, Y.; Cao, Y. Automatic Sales Forecasting System Based On LSTM Network. In Proceedings of the 2020 International Conference on Computer Science and Management Technology (ICCSMT), Shanghai, China, 20–22 November 2020; pp. 393–396. [Google Scholar] [CrossRef]
  11. Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A Comprehensive Survey on Transfer Learning. Proc. IEEE 2021, 109, 43–76. [Google Scholar] [CrossRef]
  12. Schmidt, A.; Kabir, M.W.U.; Hoque, M.T. Machine Learning Based Restaurant Sales Forecasting. Mach. Learn. Knowl. Extr. 2022, 4, 105–130. [Google Scholar] [CrossRef]
  13. Chen, I.-F.; Lu, C.-J. Sales Forecasting by Combining Clustering and Machine-Learning Techniques for Computer Retailing. Neural Comput. Appl. 2017, 28, 2633–2647. [Google Scholar] [CrossRef]
  14. Cheriyan, S.; Ibrahim, S.; Mohanan, S.; Treesa, S. Intelligent Sales Prediction Using Machine Learning Techniques. In Proceedings of the 2018 International Conference on Computing, Electronics & Communications Engineering (iCCECE), Southend, UK, 16–17 August 2018; pp. 53–58. [Google Scholar]
  15. Shih, Y.-S.; Lin, M.-H. A LSTM Approach for Sales Forecasting of Goods with Short-Term Demands in E-Commerce. In Intelligent Information and Database Systems, Proceedings of the 11th Asian Conference, ACIIDS 2019, Yogyakarta, Indonesia, 8–11 April 2019; Nguyen, N.T., Gaol, F.L., Hong, T.-P., Trawiński, B., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 244–256. [Google Scholar]
  16. He, Q.-Q.; Wu, C.; Si, Y.-W. LSTM with Particle Swam Optimization for Sales forecasting. Electron. Commer. Res. Appl. 2022, 51, 101118. [Google Scholar] [CrossRef]
  17. Hiranya Pemathilake, R.G.; Karunathilake, S.P.; Achira Jeewaka Shamal, J.L.; Ganegoda, G.U. Sales Forecasting Based on AutoRegressive Integrated Moving Average and Recurrent Neural Network Hybrid Model. In Proceedings of the 2018 14th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), Huangshan, China, 28–30 July 2018; pp. 27–33. [Google Scholar]
  18. Kaunchi, P.; Jadhav, T.; Dandawate, Y.; Marathe, P. Future Sales Prediction For Indian Products Using Convolutional Neural Network-Long Short Term Memory. In Proceedings of the 2021 2nd Global Conference for Advancement in Technology (GCAT), Bangalore, India, 1–3 October 2021; pp. 1–5. [Google Scholar]
  19. Liu, Y. Deep Learning for Enterprise Sales Prediction: Haressing CNN-LSTM-Attention Model. In Proceedings of the 2024 IEEE 3rd International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA), Changchun, China, 27–29 February 2024; pp. 746–750. [Google Scholar]
  20. L’Heureux, A.; Grolinger, K.; Capretz, M.A.M. Transformer-Based Model for Electrical Load Forecasting. Energies 2022, 15, 4993. [Google Scholar] [CrossRef]
  21. Staffini, A. A CNN–BiLSTM Architecture for Macroeconomic Time Series Forecasting. Eng. Proc. 2023, 39, 33. [Google Scholar] [CrossRef]
  22. Rathore, N.; Rathore, P.; Basak, A.; Nistala, S.H.; Runkana, V. Multi Scale Graph Wavenet for Wind Speed Forecasting. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; pp. 4047–4053. [Google Scholar]
  23. Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar]
  24. Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2016, arXiv:1409.0473. [Google Scholar]
  25. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V. Domain-Adversarial Training of Neural Networks. In Domain Adaptation in Computer Vision Applications; Advances in Computer Vision and Pattern Recognition; Csurka, G., Ed.; Springer International Publishing: Cham, Switzerland, 2017; pp. 189–209. ISBN 978-3-319-58346-4. [Google Scholar]
  26. Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 1126–1135. [Google Scholar]
  27. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Figure 1. The structure of the TransTLA model.
Figure 1. The structure of the TransTLA model.
Applsci 14 06611 g001
Figure 2. The structure of the dilated causal convolutional network.
Figure 2. The structure of the dilated causal convolutional network.
Applsci 14 06611 g002
Figure 3. The structure of the residual module.
Figure 3. The structure of the residual module.
Applsci 14 06611 g003
Figure 4. The structure of the LSTM network.
Figure 4. The structure of the LSTM network.
Applsci 14 06611 g004
Figure 5. The structure of the attention mechanism.
Figure 5. The structure of the attention mechanism.
Applsci 14 06611 g005
Figure 6. The training process for the TransTLA model.
Figure 6. The training process for the TransTLA model.
Applsci 14 06611 g006
Figure 7. Visualization of the temporally aligned sales dataset.
Figure 7. Visualization of the temporally aligned sales dataset.
Applsci 14 06611 g007
Figure 8. Comparison curves of the evaluation metrics under different ratios of source domain regression loss.
Figure 8. Comparison curves of the evaluation metrics under different ratios of source domain regression loss.
Applsci 14 06611 g008
Figure 9. Comparison curves of evaluation metrics under different domain adaptation loss ratios.
Figure 9. Comparison curves of evaluation metrics under different domain adaptation loss ratios.
Applsci 14 06611 g009
Figure 10. Comparison curves of the evaluation metrics under different backtrack window sizes.
Figure 10. Comparison curves of the evaluation metrics under different backtrack window sizes.
Applsci 14 06611 g010
Figure 11. Home appliance sales forecast results under different prediction steps.
Figure 11. Home appliance sales forecast results under different prediction steps.
Applsci 14 06611 g011
Figure 12. MAE comparison of the ablation experiments.
Figure 12. MAE comparison of the ablation experiments.
Applsci 14 06611 g012
Figure 13. MAE comparison of comparative experiments.
Figure 13. MAE comparison of comparative experiments.
Applsci 14 06611 g013
Table 1. Related studies on sales forecasting.
Table 1. Related studies on sales forecasting.
ReferenceMethodsContributionsDrawbacks
[13]Clustering + MLImproving the accuracy of household appliance sales forecastingIncreased complexity, computationally intensive
[10]CNNLearning effective features from the historical sales data automatically and predicting household appliance salesLow prediction accuracy, high data requirements
[15]LSTMExploiting the temporal correlation among the historical sales dataLow prediction accuracy, longer training time
[16]LSTM + PSOEnhanced performance of sales forecasting with hyperparameter optimizationRisk of overfitting, potential local optima
[17]ARIMA + RNNCapturing both linear and nonlinear relationships of time series dataComplex implementation
[18]CNN + LSTMUsing spatiotemporal information to improve prediction accuracyHigh computational cost
[19]CNN + LSTM + AttentionAddressing the limitations of traditional LSTM, particularly in scenarios influenced by small weightsHigher complexity and computationally expensive
Table 2. Model parameter settings in the experiment.
Table 2. Model parameter settings in the experiment.
ParameterSetting Value
Iterations150
Batch Size32
TCN Channels64
Kernel Size3
Kernel Stride1
Kernel Padding1
Residual Blocks2
Dilation Factor1, 2
LSTM Layers2
Hidden Layer Size64
Initial Learning Rate0.001
Learning Rate Growth0.0003
Learning Rate Decay0.75
Dropout Rate0.2
Table 3. Experimental results under different ratios of source domain regression loss.
Table 3. Experimental results under different ratios of source domain regression loss.
αR2MAERMSEMAPE
00.48317.67727.2290.438
0.10.81610.80416.3610.229
0.20.81510.41516.3890.210
0.30.81210.32716.5110.212
0.40.8269.72115.9230.180
0.50.8449.29515.0480.171
0.60.8459.78615.0240.179
0.70.81610.65716.3490.229
0.80.80010.73617.0180.225
0.90.80210.74216.9200.233
10.76612.12818.1230.279
Table note: The optimal results are highlighted in bold, and the suboptimal results are underlined. The same applies to subsequent tables and will not be repeated.
Table 4. Experimental results under different ratios of domain adaptation loss.
Table 4. Experimental results under different ratios of domain adaptation loss.
βR2MAERMSEMAPE
00.8259.93315.9640.213
0.10.8449.29515.0480.171
0.20.82710.12415.8660.200
0.30.81510.20616.3800.211
0.40.80610.53416.7500.220
0.50.81410.32916.4410.212
0.60.79810.80817.1210.228
0.70.81510.43116.3750.227
0.80.80710.67916.7370.227
0.90.80210.44516.9670.220
10.80710.49716.7630.223
Table 5. Experimental results for different backtrack window sizes.
Table 5. Experimental results for different backtrack window sizes.
Backtrack Window SizeR2MAERMSEMAPE
40.81310.86616.5510.246
50.8229.93616.1580.214
60.81910.08016.2610.205
70.8449.29515.0480.171
80.82110.56016.0780.220
90.79310.83617.2300.213
100.80510.60816.6550.204
110.78810.89217.3440.210
120.79011.00517.2120.218
Table 6. Evaluation results of the ablation experiments.
Table 6. Evaluation results of the ablation experiments.
ModelMetricStep Size 1Step Size 2Step Size 3Step Size 4
TCNR20.814 (0.011)0.660 (0.034)0.394 (0.048)0.419 (0.046)
MAE10.066 (0.251)15.017 (0.857)17.780 (0.839)20.827 (0.901)
RMSE16.453 (0.496)22.198 (1.103)29.604 (1.172)29.006 (1.133)
MAPE0.218 (0.009)0.435 (0.047)0.498 (0.044)0.620 (0.043)
LSTMR20.802 (0.009)0.342 (0.024)0.220 (0.080)−0.010 (0.021)
MAE10.275 (0.240)20.534 (0.821)24.498 (2.131)30.697 (0.219)
RMSE16.954 (0.380)30.931 (0.571)33.579 (1.631)38.256 (0.399)
MAPE0.216 (0.015)0.594 (0.064)0.724 (0.049)0.857 (0.024)
TCN-LSTMR20.820 (0.005)0.388 (0.072)0.185 (0.070)−0.002 (0.075)
MAE9.753 (0.287)19.215 (0.953)23.445 (0.915)30.325 (1.269)
RMSE16.188 (0.212)29.787 (1.723)34.338 (1.460)38.082 (1.463)
MAPE0.189 (0.014)0.514 (0.037)0.692 (0.038)0.839 (0.028)
TCN-AttentionR20.818 (0.009)0.654 (0.052)0.430 (0.031)0.404 (0.055)
MAE9.631 (0.433)14.234 (0.994)17.524 (1.694)19.745 (0.699)
RMSE16.256 (0.407)22.370 (1.653)28.737 (0.789)29.352 (1.333)
MAPE0.189 (0.014)0.354 (0.029)0.479 (0.071)0.607 (0.059)
LSTM-AttentionR20.803 (0.011)0.660 (0.024)0.412 (0.036)0.418 (0.023)
MAE10.376 (0.427)14.228 (1.154)18.434 (0.649)20.662 (0.524)
RMSE16.919 (0.452)22.214 (0.764)29.184 (0.888)29.034 (0.570)
MAPE0.222 (0.021)0.333 (0.024)0.480 (0.038)0.579 (0.044)
TCN-LSTM-AttentionR20.813 (0.013)0.677 (0.027)0.424 (0.020)0.363 (0.061)
MAE9.917 (0.494)13.629 (0.566)17.675 (0.768)21.880 (1.397)
RMSE16.469 (0.579)21.660 (0.909)28.893 (0.503)30.356 (1.432)
MAPE0.191 (0.017)0.348 (0.019)0.429 (0.036)0.582 (0.033)
TransTCNR20.810 (0.013)0.706 (0.020)0.402 (0.029)0.415 (0.024)
MAE10.206 (0.609)13.679 (0.542)18.108 (0.695)21.568 (0.775)
RMSE16.605 (0.574)20.664 (0.674)29.430 (0.728)29.118 (0.588)
MAPE0.209 (0.022)0.351 (0.022)0.455 (0.024)0.654 (0.040)
TransLSTMR20.806 (0.007)0.431 (0.012)0.201 (0.037)−0.066 (0.028)
MAE10.741 (0.282)19.337 (0.407)24.855 (0.807)31.135 (0.245)
RMSE16.788 (0.300)28.762 (0.309)34.013 (0.792)39.296 (0.522)
MAPE0.241 (0.015)0.506 (0.018)0.664 (0.026)0.814 (0.031)
TransTCN-LSTMR20.820 (0.006)0.382 (0.071)0.207 (0.068)−0.074 (0.045)
MAE9.684 (0.152)18.820 (0.835)23.606 (1.367)31.024 (0.282)
RMSE16.176 (0.266)29.937 (1.704)33.873 (1.460)39.441 (0.819)
MAPE0.189 (0.012)0.479 (0.026)0.580 (0.016)0.789 (0.043)
TransTCN-AttentionR20.804 (0.009)0.682 (0.032)0.368 (0.052)0.424 (0.012)
MAE9.737 (0.441)13.383 (1.465)17.726 (0.762)19.868 (0.349)
RMSE16.865 (0.371)21.465 (1.055)30.241 (1.236)28.899 (0.305)
MAPE0.181 (0.021)0.297 (0.043)0.429 (0.034)0.602 (0.027)
TransLSTM-AttentionR20.808 (0.012)0.671 (0.024)0.475 (0.025)0.402 (0.023)
MAE10.664 (0.422)13.712 (1.075)18.087 (0.693)20.745 (0.737)
RMSE16.685 (0.503)21.864 (0.784)27.564 (0.644)29.431 (0.552)
MAPE0.232 (0.018)0.336 (0.019)0.458 (0.037)0.583 (0.028)
TransTLAR20.826 (0.009)0.689 (0.025)0.433 (0.029)0.423 (0.017)
MAE9.396 (0.197)13.339 (0.605)17.025 (0.528)19.551 (0.347)
RMSE15.890 (0.402)21.231 (0.860)28.649 (0.734)28.924 (0.416)
MAPE0.172 (0.004)0.293 (0.020)0.396 (0.025)0.587 (0.024)
Table note: Values are presented as “mean (standard deviation)” based on 10 independent runs.
Table 7. The MAE performance improvement of the TransTLA model over the ablation models (percentage).
Table 7. The MAE performance improvement of the TransTLA model over the ablation models (percentage).
Compared ModelStep Size 1Step Size 2Step Size 3Step Size 4Average
TCN6.6611.174.256.137.05
LSTM8.5535.0430.5036.3127.60
TCN-LSTM-Attention5.252.133.6810.645.43
Table 8. Evaluation results of comparative experiments.
Table 8. Evaluation results of comparative experiments.
ModelMetricStep Size 1Step Size 2Step Size 3Step Size 4
ARIMAR20.6210.128−0.292−0.723
MAE15.31223.68526.78535.012
RMSE23.40335.50343.20649.908
MAPE0.3780.6450.6700.842
XGBoostR20.6520.4230.5980.105
MAE12.94418.49918.33623.398
RMSE22.58829.06024.31636.279
MAPE0.2580.4190.4460.561
SVRR20.6600.5630.5500.278
MAE12.04314.21617.27920.656
RMSE22.31225.28725.71332.587
MAPE0.2290.2980.3980.445
RNNR20.812 (0.008)0.425 (0.053)0.281 (0.042)−0.038 (0.044)
MAE11.022 (0.246)19.158 (0.894)22.812 (0.761)30.894 (0.416)
RMSE16.545 (0.369)28.870 (1.342)32.258 (0.960)38.783 (0.816)
MAPE0.260 (0.019)0.538 (0.050)0.690 (0.044)0.840 (0.044)
CNNR20.801 (0.013)0.644 (0.027)0.304 (0.071)0.372 (0.091)
MAE10.565 (0.399)14.552 (0.698)19.414 (1.013)20.621 (0.986)
RMSE17.006 (0.570)22.726 (0.884)31.723 (1.643)30.093 (2.131)
MAPE0.221 (0.026)0.366 (0.033)0.522 (0.064)0.632 (0.056)
CNN-LSTMR20.816 (0.032)0.644 (0.026)0.436 (0.044)0.392 (0.044)
MAE10.670 (1.130)14.688 (0.952)19.153 (0.733)20.293 (0.752)
RMSE16.284 (1.338)22.694 (0.842)28.557 (1.093)29.662 (1.059)
MAPE0.212 (0.043)0.336 (0.032)0.520 (0.053)0.633 (0.023)
CNN-LSTM-AttentionR20.797 (0.024)0.588 (0.058)0.365 (0.047)0.348 (0.033)
MAE10.420 (0.490)15.392 (0.640)19.373 (0.698)21.419 (0.899)
RMSE17.170 (1.027)24.436 (1.671)30.323 (1.147)30.738 (0.775)
MAPE0.201 (0.015)0.353 (0.018)0.473 (0.038)0.622 (0.020)
TransCNN-LSTM-AttentionR20.813 (0.024)0.645 (0.033)0.374 (0.055)0.360 (0.037)
MAE10.358 (0.567)14.498 (0.801)19.029 (1.022)20.829 (0.491)
RMSE16.457 (1.040)22.688 (1.079)30.320 (1.409)30.434 (0.878)
MAPE0.198 (0.009)0.301 (0.030)0.446 (0.030)0.583 (0.031)
TransTLAR20.826 (0.009)0.689 (0.025)0.433 (0.029)0.423 (0.017)
MAE9.396 (0.197)13.339 (0.605)17.025 (0.528)19.551 (0.347)
RMSE15.890 (0.402)21.231 (0.860)28.649 (0.734)28.924 (0.416)
MAPE0.172 (0.004)0.293 (0.020)0.396 (0.025)0.587 (0.024)
Table note: Values are presented as “mean (standard deviation)” based on 10 independent runs.
Table 9. The MAE performance improvement of the TransTLA model over comparative models (percentage).
Table 9. The MAE performance improvement of the TransTLA model over comparative models (percentage).
Compared ModelStep Size 1Step Size 2Step Size 3Step Size 4Average
CNN11.068.3412.315.199.23
CNN-LSTM11.949.1811.113.668.97
CNN-LSTM-Attention9.8313.3412.128.7211.00
TransCNN-LSTM-Attention9.297.9910.536.148.49
Table 10. Computational complexity of various models.
Table 10. Computational complexity of various models.
ModelFLOPs (M)MACsParams (k)
TCN25.453612.7191 M51.906
LSTM22.555666.56 k52.61
TCN-LSTM54.31412.3259 M106.178
TCN-Attention27.026713.5055 M51.969
LSTM-Attention24.91521.2462 M65.026
TCN-LSTM-Attention56.673513.5055 M118.594
TransTCN50.907125.4382 M51.906
TransLSTM45.1113133.12 k52.61
TransTCN-LSTM108.62824.6518 M106.178
TransTCN-Attention54.053327.0111 M51.969
TransLSTM-Attention49.83032.4924 M65.026
RNN5.696566.56 k14.786
CNN21.326810.5083 M399.681
CNN-LSTM284.1266.3795 M75.074
CNN-LSTM-Attention301.4315.0303 M87.49
TransCNN-LSTM-Attention602.8630.0605 M87.49
TransTLA113.34727.0111 M118.594
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, Z.; Liu, J. TransTLA: A Transfer Learning Approach with TCN-LSTM-Attention for Household Appliance Sales Forecasting in Small Towns. Appl. Sci. 2024, 14, 6611. https://doi.org/10.3390/app14156611

AMA Style

Huang Z, Liu J. TransTLA: A Transfer Learning Approach with TCN-LSTM-Attention for Household Appliance Sales Forecasting in Small Towns. Applied Sciences. 2024; 14(15):6611. https://doi.org/10.3390/app14156611

Chicago/Turabian Style

Huang, Zhijie, and Jianfeng Liu. 2024. "TransTLA: A Transfer Learning Approach with TCN-LSTM-Attention for Household Appliance Sales Forecasting in Small Towns" Applied Sciences 14, no. 15: 6611. https://doi.org/10.3390/app14156611

APA Style

Huang, Z., & Liu, J. (2024). TransTLA: A Transfer Learning Approach with TCN-LSTM-Attention for Household Appliance Sales Forecasting in Small Towns. Applied Sciences, 14(15), 6611. https://doi.org/10.3390/app14156611

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop