1. Introduction
With the development and progress of society, people’s demands for a comfortable living, work, and living environment are becoming increasingly high, leading to a gradual increase in the proportion of building energy consumption in global energy. To lower the environmental and economic burden caused by the increasing building energy demand, improving the energy efficiency of buildings would be an effective solution [
1]. Building energy prediction, also known as building energy estimating or forecasting, is critical for efficient energy use in buildings because it can help develop an energy-efficient building design, automate the functions of the building and its energy systems, and plan its energy distributions [
2]. With the progress and development of science and technology, researchers have made significant progress toward improving building energy prediction. Various methods, including physical models and data-driven models, have been proposed and verified. Among these forecasting methods, data-driven models have recently received great attention due to their convenient modeling methods and accurate forecasting capabilities [
3,
4,
5].
As the most populated country in the world, China’s building energy consumption accounts for a large global share. In 2021, there were around 4.82 million teachers and college students in China, accounting for 3.41% of the Chinese population [
6]. However, most of the university buildings were built in the last century, some of which are historical buildings symbolizing the cultural characteristics of the campus. This has an important negative impact on the effective collection of building-related information. Based on limited historical datasets of poor quality, it is a significant challenge to obtain reasonable and reliable campus building energy consumption prediction results.
The occurrence of problems in data quality usually determines the accuracy of a prediction task. To solve problems in data quality, data preprocessing can be carried out to filter useful data and remove outliers. The clustering algorithm, a data preprocessing method, is used to classify the data in the preprocessing stage to improve the accuracy of the prediction model, and it has been proven to be an effective method [
7]. Juan Sala used the clustering method to classify the historically similar days of household electricity consumption and used logical regression and a random forest (RF) algorithm to predict electricity consumption [
8]. Yang used the k-shape algorithm to classify buildings according to their energy consumed per hour and per week and used the classification results for Support Vector Regression (SVR) model prediction. The experiment showed that this preprocessing significantly improves the accuracy of the prediction model [
9].
General data-driven models can be divided into two categories: regression models and machining learning algorithms. The typical regression models include Linear Regression (LR), Auto-Regressive Moving Average (ARMA), and Auto-Regressive Integrated Moving Average (ARIMA). Widely used machine learning models include Long Short-term Memory (LSTM), Regression Tree (RT), and SVR [
10,
11,
12]. Most of the existing research has improved the accuracy of building energy consumption prediction models by studying the optimization and update of traditional algorithms. For example, Karijadi used complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) to transform data into several components and then used RF and LSTM to predict a building’s energy [
13]. Wang Ran improved a new integration model (stacking model) to solve complex multifactor engineering tasks [
14]. Luo proposed an adaptive LSTM neural network that is better than the existing feedforward neural network and LSTM-based prediction models in accuracy and robustness [
15]. Jin proposed a novel hybrid AI-empowered forecasting model that combines singular spectrum analysis (SSA) and parallel long short-term memory (PLSTM) neural networks [
16]. Ding divided campus electricity consumption into two categories—“basic” and “variable”—and established a two-part building electricity forecasting model based on human behavior [
11].
In terms of the timescale of the forecast, electricity consumption forecasting can be divided into four categories: long-term (a year or more) prediction, medium-term (between a week and a year) prediction, short-term (hours to a week) prediction, and very short-term (from minutes to an hour) prediction [
17,
18,
19]. Long-term and mid-term forecasting, which need extensive historical data, can provide useful reference values for strategic planning. Different from long-term prediction, short-term and very short-term forecasting have implications for energy management that only needs energy consumption data from within a few days [
20]. The focus of this study is to use small amounts of historical data to predict the electricity consumption of campus buildings on the following day. It has important reference value for campus building electricity management personnel.
For the short-term prediction of campus buildings, Luo found that the genetic algorithm–deep neural network (GA-DFNN) model can obtain the predicted electricity consumption in an hour or a week by inputting weather conditions, historical data, and time indicators. Compared with the DFNN model, the GA-DFNN model was proved to have better prediction performance due to the optimization ability of GA [
21]. Reddy, A. performed short-term electricity consumption prediction using ensemble learning with a consideration of historical energy consumption values to forecast the energy consumption in the next 4 h [
22]. However, most colleges and universities do not have long-term effective historical electricity consumption data, and comprehensive climate data can only be obtained from local climate stations.
Based on the above research, people have performed a variety of research on building energy prediction and have also accomplished great achievements in the variation of the prediction model. However, the energy management system of many Chinese universities is relatively backward, and it is difficult to collect relevant data such as building parameters, indoor human behavior, and weather. The clustering method can be used to improve prediction accuracy by clustering the existing historical data and then optimizing the prediction model according to the clustering results. Therefore, in this study, the K-means algorithm is used to cluster the daily electricity consumption of buildings, and a new combination forecasting model for short-term forecasting is proposed based on its clustering results, which is practically operable for logistics groups and provides useful guidance for energy system managers in advance.
2. Methodology
2.1. Overall Flowchart of the Research
To analyze the power energy usage law of different campus buildings, the studied buildings are first classified into different categories; then, the targeted energy management policy can be made. Moreover, the accurate prediction of power energy consumption is the foundation of energy management. In searching for the optimal prediction model, it is found that the ARMA and LR models have obvious opposite effects on the fluctuation of time series. By combining the two models, the prediction error can be reduced despite time series fluctuations. Therefore, a combined forecasting model based on the clustering results of the daily electricity consumption of campus buildings is proposed.
The overall framework of this study is presented in
Figure 1. It mainly consists of three steps, i.e., data processing, time series prediction, and model evaluation:
Step 1: Filtering out abnormal data from the raw data and dividing 29 buildings into three categories using the K-means clustering method.
Step 2: Comparing the prediction results of ARMA and LR to propose an integrated model (ARMA-LR).
Step 3: Comparing the ARMA-LR models with the mainstream prediction algorithm and evaluating the predicted accuracy of these models based on metrics of MAE, RMSE, and MAPE.
This study collected the daily power consumption data of 29 buildings in a university in Wuhan from 1 January 2020 to 31 December 2021, and a total of 21,141 daily power consumption data was obtained. Due to the influence of COVID-19, students were studying at home from 1 January 2020 to 31 August 2020; therefore, this part of the data was excluded, and the outliers and abnormal data were ignored.
Through the above process of eliminating data, the first 100 days were used to divide the buildings into three categories by using the clustering method. The initial 15 days of training data was the beginning of the prediction and the last 150 days of validation data was the prediction reliability of the test model.
2.2. K-Means Clustering
Since the changing pattern of daily power energy consumption varies across building types, the relationship between power energy usage and building types needs to be clarified. Thus, appropriate energy management strategies can be formulated in terms of building categories. This paper adopts the K-means clustering method to classify 29 campus buildings into 3 categories according to daily electricity consumption, i.e., high-energy consumption buildings, medium-energy consumption buildings, and low-energy consumption buildings.
The principle of the K-means algorithm is to calculate the Euclidean distance between each point and the centroid based on the planning of the initial centroid in advance. After classifying according to the Euclidean distance, the centroid is calculated again, and the classification process is repeated until the classification result does not change [
23,
24,
25,
26].
As shown in
Table 1, 29 buildings were clustered into three categories: high-energy consumption buildings, medium-energy consumption buildings, and low-energy consumption buildings. The three initial centroids are 1,000,300 and 500.
After iterative calculation, the three types of results of the three semester periods are different, as shown in
Table 1. The primary median electricity consumption of the three types of buildings is 100, 500, 1400. The number of buildings of each type is the average, which is 10, 10, and 9, respectively. Meanwhile, high-energy-consuming buildings include 5 high-rise offices, 3 high-rise laboratories, and 2 high-rise teaching buildings, with a large number of administrative personnel, students, and teachers. Medium energy-consuming buildings mainly consist of 7 mid-level office buildings and 3 mid-level teaching buildings. Low-energy-consuming buildings consist of 9 small laboratories.
2.3. Fluctuation
After eliminating the outliers in the energy consumption time series of buildings, their time series still fluctuates frequently. Generally, the coefficient of variation (Cv) can be used to evaluate the fluctuation level of a time series. We propose a new indicator, average percentage fluctuation (APF), which is a dimensionless indicator similar to the Cv, and both indicators can represent the degree of the volatility of a sequence. Compared to Cv, APF is more relevant to time series because the calculation of Cv does not consider the sequence order, which is included in APF. Therefore, APF can better describe the volatility of time series compared to Cv.
where
is the daily electricity consumption of one day,
is the daily electricity consumption of the next day.
is the average value of the daily electricity consumption of the whole time series.
As shown in
Figure 2, the APF of high-energy consumption buildings, medium-energy consumption buildings, and low-energy consumption buildings is 7.95%, 11.7%, and 12.74%, respectively. Meanwhile, the Cv of the three building categories is 12.80%, 14.06%, and 14.12%, respectively. The description trend of time series fluctuations for the three buildings, when using APF and Cv, is consistent: both decrease with an increase in building energy consumption. This also indicates that the higher the energy consumption of a building, the smaller its fluctuation.
2.4. Combined Forecasting Model
Although the ARMA model and LR model are both linear models, these two time series models are feasible in terms of data patterns. It was found in the experiment that the ARMA model is more sensitive to the fluctuation of the original time series, subsequently enlarging the effect of data fluctuation on the prediction results, which results in predictions that are too large and too small. Meanwhile, the regression model is less sensitive to the fluctuation of the original data and will respond slowly to the change and weaken the impact of fluctuation on the prediction results, which implies slightly smaller or larger prediction results. Therefore, these two models can complement each other in time series prediction. In this study, the two models are combined, i.e., the ARMA-LR model, as shown in Equation (3), where
is the weight of the ARMA model,
is the prediction result of the ARMA model,
is the weight of the LR model, and
is the prediction result of the LR model. Because of the energy consumption law of different building types, both
and
change with the building type. In the ARMA-LR model, the weights of the two models are trained based on the training effectiveness of the training set, using MAE as the evaluation indicator. In this study, the first 100 days of training for three types of buildings were used for training, and the weights of the model (
and
) were determined through training. The values of
and
are presented in
Table 2.
3. Results and Evaluation
3.1. Prediction Accuracy Evaluation Index
Mean average error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE) are the most commonly used indicators to evaluate the accuracy of prediction models. MAE and RMSE are usually used to compare the effectiveness of prediction models. MAPE is used to evaluate the accuracy of the prediction model. The smaller the MAPE, the higher the prediction accuracy of the evaluated model. The mathematical expressions of MAE, RMSE, and MAPE are expressed as shown in Equations (4)–(6), respectively.
where
t represents the time,
represents the forecast made for period
t, and
represents the actual observation at time
t.
3.2. Comparison of Prediction Results among Three Models
The prediction results of two representative buildings in each building cluster are selected. The black line, red line, green line, and blue line represent the real daily electricity data and prediction results of the ARMA model, LR model, and ARMA-LR model, respectively.
As presented in
Figure 3, the prediction results of the three prediction models of the first typical building within the high-energy consumption buildings category are presented. The average daily energy consumption of the building is 2482.19 kWh. Compared with real data, the prediction accuracy of the ARMA model is the highest among the three models during the first 50 days, with a long peak-to-trough span. The prediction accuracy of the LR model is the highest from 100 days to 150 days, with a short peak-to-trough span. Overall, the ARMA-LR model’s prediction result is not the best for each day, but its accuracy is the best over 150 days, and it reduces the impact of excessive errors in some of the forecast results of the ARMA and LR models.
As shown in
Figure 4, the prediction results of the second typical building of the high-energy consumption buildings are presented. The average daily energy consumption of the building is 1782.49 kWh. Compared with the first typical building, the real data of this building do not have large seasonal trends such as long-lasting trends like those in the initial 50 days from the first typical building test. Compared with the first 50 days and the period from 100 days to 150 days, the peak-to-trough difference of 50 days to 100 days is large, and the LR model performs better than the ARMA model. The peak-to-trough span is stable over 150 days.
As illustrated in
Figure 5, the prediction results of the first typical building of the medium-energy consumption buildings are presented. The average daily energy consumption of the building is 558.41 kWh. Compared with the other buildings, the prediction results of the three models are similar. Although the fluctuation amplitude is smaller than in other buildings, the fluctuation frequency is higher.
As shown in
Figure 6, the prediction results of the second typical building of the medium-energy consumption buildings are presented. The average daily energy consumption of the building is 727.24 kWh/d. Compared with the data from 50 days to 150 days, the real data of the first 50 days is stable, and the peak-to-trough span is smaller. Thus, the LR model performs better than the ARMA model in the initial days. It was found that, from 50 days to 150 days, the ARMA model performs better than the LR model in detecting seasonal trends. The ARMA-LR model can neutralize the advantages and disadvantages of the two models and obtain a prediction result with a relatively stable error.
As indicated in
Figure 7, the prediction results of the first typical building of the low-energy consumption buildings are presented. The average daily energy consumption of the building is 120.192 kWh/d. Compared with the high-energy consumption buildings and medium-energy consumption buildings, the peak-to-trough span of this building is consistently small. When the peak-to-trough difference is small, it was found that the ARMA model performs better than the LR model. But the LR performs better during the other days. Therefore, the ARMA-LR model performs better than the two models.
As shown in
Figure 8, the prediction results of the second typical building of the low-energy consumption buildings are presented. The average daily energy consumption of the building is 108.59 kWh/d. Compared with the high-energy consumption buildings and medium-energy consumption buildings, both the two typical buildings of the low-energy consumption buildings have an obvious seasonal trend. Moreover, while some of the peak-to-trough spans are large, others are small. The reason for this is that the factors influencing building energy consumption have the largest influence on low-energy consumption buildings compared to others. Therefore, the ARMA-LR can solve the limitations of the ARMA model and the LR model.
3.3. Evaluation of the Combined Forecasting Model
The indicators of the above three models are shown in
Table 3,
Table 4 and
Table 5. It can be found that the overall MAPE of the high-energy-consuming buildings is the lowest, but the overall MAE and RMSE are the highest. low-energy consumption buildings have the highest overall MAPE, but MAE and RMSE are the lowest. It can be seen from the above results that, when taking volatility as the evaluation standard, the prediction result of high-energy-consuming buildings is the best, and the prediction result of the corresponding low-energy-consuming buildings is the worst. If the specified value of the offset is used as the evaluation standard, the prediction result of high-energy-consuming buildings is the worst.
Comparing the overall indicators of the three models indicates that the ARMA-LR model obtains the best prediction result and outperforms the ARMA model and the LR model. In terms of high-energy consumption buildings, the ARMA model is better than the LR model, and, in terms of medium-energy consumption buildings and low-energy consumption buildings, the LR model is better than the ARMA model. This conclusion is also consistent with the previous conclusion from the prediction results and further confirms that the ARMA-LR model integrates the advantages of the two models. In order to verify the superiority of the ARMA-LR model, this paper also selects two classical prediction algorithms—SVR and LSTM—for comparison. SVR has good performance, but LSTM has poor performance. This also proves that the deep learning algorithm performs poorly in the case of fewer data. However, the ARMA-LR model proposed in this paper is superior to the other four algorithms in predicting the three different building types.
3.4. The Relationship between APF and MAPE
Figure 9,
Figure 10 and
Figure 11 present the comparison results of the APF and MAPE of the ARMA-LR prediction model for the three building clusters. Among them, the horizontal axis of the graph represents the statistical sequence of each building within 29 buildings. It is interesting to find that the APF result is very close to that of the MAPE. Within medium-energy consumption buildings and low-energy consumption buildings, the APF is a little larger than MAPE. For high-energy consumption buildings, the APF is generally smaller than MAPE.
The relationship between MAPE and APF for different types of buildings is obtained by using a univariate LR. Within three building clusters, the relationship between APF and MAPE is expressed in Equations (7)–(9), respectively.
represents the fitting degree of the LR equation (Equation (10)), the
of the high-energy building is 0.8671, the
of the medium-energy building is 0.9318, and the
of the low-energy building 0.7527.
The LR results are shown in
Figure 12,
Figure 13 and
Figure 14, and it can be seen in the results that the best result of the LR model in three categories is in the medium-energy consumption buildings. This is because the APF of medium-energy consumption buildings is high. The reason why the results of the LR model are not the best for low-energy consumption buildings is that the APF is more than 10%, and these buildings affect the result of LR.
Based on the relationship between MAPE and APF, it can be found that, as building energy consumption increases, the proportion of APF increases. It can be seen in
Section 2.3 that the higher the building energy consumption, the smaller the APF. Therefore, the smaller the fluctuation of the time series of the building itself, the higher the prediction accuracy.
Based on the above results, we can conclude that the relationship between APF and MAPE is linear. The ARMA model and the regression model—being, in essence, linear time series prediction methods—can represent the trend of the original time series. APF represents the fluctuation change of the original time series, that is, the deviation of its time series trend. MAPE represents the difference between the predicted time series trend and the true value. In principle, these two indices are equal. But due to the noise of the original data, there are some discrepancies between the two values. Therefore, APF can be used as an indicator to evaluate the short-term linear prediction potential of building energy consumption, which can be used during the training phase to evaluate the prediction potential of each building and reduce the workload of building energy management personnel.
4. Discussion
Some universities in China lack efficient building energy management systems, resulting in limited building energy consumption information. Most of them only collect energy consumption through electricity meters and have not established a comprehensive building energy management system. At present, research on energy consumption in university buildings is mainly based on more comprehensive datasets with sufficient building information [
11,
13,
14,
15,
16]. Therefore, based on the current situation of inadequate building energy management systems in Chinese universities, this study used the daily electricity consumption of campus buildings to achieve the short-term forecasting of building energy consumption. Twenty-nine university buildings measured using electricity meters in a certain university were selected as the sample, covering large, medium, and small office buildings, teaching buildings, experimental buildings, and other small, specialized buildings, which can represent all types of buildings in a university. Therefore, this study can help building managers in Chinese universities establish short-term prediction models for university buildings, use prediction data to conduct building energy consumption warnings, and formulate building energy consumption management policies.
When predicting building energy consumption, data preprocessing is essential to attaining final prediction accuracy. In data preprocessing, it is necessary to analyze the features of raw data, which can determine the necessity of subsequent predictions. The fluctuation of the time series of building energy consumption can reflect the potential of building energy consumption prediction. In general, it is often difficult to obtain accurate predictions for buildings with large fluctuations in time series. However, the traditional indices of standard deviation (SD) and Cv for describing sequence fluctuations cannot incorporate the temporal nature of the time series in the calculation process. To compensate for the limitation of SD and Cv, this study proposes a new indicator, APF, that includes the temporal sequence of building energy consumption. Compared with the traditional indicator Cv, APF includes a time series of building energy consumption, which can more significantly demonstrate the volatility of daily energy consumption changes in buildings. At the same time, when faced with the linear prediction of the time series of energy consumption in university buildings, this indicator has a direct linear relationship with the MAPE. Therefore, it can be used to evaluate the accuracy of short-term linear prediction models for the daily electricity consumption of buildings, which greatly increases the reliability of a building energy consumption prediction potential assessment.