1. Introduction
In recent years, global resources have become scarce and the environment is deteriorating. To improve the global environment, many countries have committed to energy conservation and an emissions reduction. The United States, Canada, the European Union, Japan, and other countries have promised to achieve carbon neutrality by 2050, and China has proposed achieving carbon neutrality by 2060. To achieve carbon neutrality, energy conservation in building operations is imperative. According to statistics from the International Energy Agency, the building and building construction sectors account for more than one-third of the total global final energy consumption, and the total amount of direct and indirect carbon emissions from electricity and commercial heat used in buildings has risen to 10 Gt, the highest level ever recorded. The building energy consumption increased from 118 EJ in 2010 to 128 EJ in 2019 [
1]. In 2019, carbon emissions from the operation of buildings in the European Union reached 980 million tons [
2], and carbon emissions from residential and commercial sources in the United States were 1.856 billion tCO
2 [
3], accounting for 36% of the total carbon emissions of the United States. International Energy Agency statistics indicate that residential energy consumption in various countries is at a high level. In 2019, the residential energy consumption was approximately 1.5 EJ in Canada, 1.8 EJ in Japan, and 1.6 EJ in the UK. In the same year, the total commodity energy consumption of China’s building operation was 1.02 billion tCO
2, accounting for 21% of China’s total energy consumption in that year [
4]. Accurate prediction of a building’s cooling and heating load, energy consumption, and solar energy availability for the next few days is an important way to achieve building energy savings. As the main impact factor of building load, meteorological conditions are an important factor in the load prediction model. Air temperature, relative humidity, wind speed, wind direction, cloud cover, and weather types for the next few days can be obtained through meteorological forecasts. However, the hourly global solar radiation cannot be predicted using weather forecasts. Hourly global solar radiation is the main factor affecting building load and is an important input for building load prediction. The accurate prediction of hourly global solar radiation can also improve solar energy utilization. Therefore, accurately and simply predicting hourly global solar radiation is a problem worth discussing.
At present, some scholars have conducted relevant research on the prediction of hourly global solar radiation, and the data processing methods, model infrastructure, input parameters, and original data scale vary. With the rapid development of machine learning, its application in the field of hourly global solar radiation prediction has gradually increased. Machine learning models can automatically learn the relationship between the input and target parameters, and many studies have used these models to establish hourly global solar radiation prediction models.
Wang et al. [
5] decomposed the daily average solar radiation intensity into intrinsic mode functions as inputs to the daily solar radiation models. The hybrid empirical mode decomposition (EEMD) and regression model (RE) model had the best performance, with the root mean square error (RMSE) of 1.135 when the daily solar radiation was predicted. Bou-Rabee M A et al. [
6] developed bidirectional long short-term memory (BiLSTM) to predict the daily solar radiation. This model took the historical time-series data as the input variable, and the RMSE during sunny and cloudy conditions were 4.24 W/m
2 and 20.95 W/m
2, respectively. These studies confirmed the feasibility of machine learning for solar radiation prediction. However, the above methods are suitable for the prediction of daily global radiation.
Jiménez-Pérez et al. [
7] used a clustering algorithm to divide the types of days into four categories based on the solar global horizontal irradiance received in a period. According to the different types of days, support vector machines (SVM) and artificial neural networks (ANNs) were applied to establish hourly global solar radiation prediction models. The model parameters included air temperature, relative humidity, and atmospheric pressure. The results showed that the SVM model exhibited the best performance. When the input variables were the values of the meteorological parameters for the previous day, the RMSE was 147 W/m
2. When the input variables were the forecasts of the meteorological parameters for the same day, the RMSE was 119 W/m
2. Lan et al. [
8] used discrete Fourier transform to extract the frequency features of historical solar radiation data from five locations: Dalian, Weihai, Qingdao, Dafeng, and Shanghai. Principal component analysis was applied to identify the crucial frequency features, which were input into an Elman-based neural network to predict Qingdao’s solar radiation in the subsequent 24 h. The minimum RMSE value of the model appeared in autumn at 72.95 W/m
2, while the maximum appeared in spring at 191.33 W/m
2. Yong Zhou et al. [
9] proposed an attention-based transformer model to predict the future 10 h global solar radiation, where the input parameters were the historical 70 h global solar radiation. The RMSE varied from 63.54 to 81.28 W/m
2. The above studies used historical solar radiation as the input parameter of the model, and the flexibility of these methods was limited by the historical solar radiation acquisition.
Kuk et al. [
10] used the K-means clustering algorithm to collect meteorological data and divided the weather data into three classifications: sunny days, partially cloudy days, and cloudy or rainy days. Then, they established a prediction scheme for the hourly prediction of solar irradiance based on weather classification and the SVM model. The input parameters used in the model were sunshine duration, cloud cover, cloud type, sunshine, relative humidity, precipitation, air temperature, and wind speed. The RMSE of the model under the three weather types was 49.26, 62.57, and 57.87 W/m
2, respectively. Li et al. [
11] conducted a sensitivity analysis to evaluate the contribution of each input parameter and the most significant five climatic variables were selected as inputs of various multivariate adaptive regression spline (MARS) models. Hourly global solar radiation prediction models were established based on horizontal extraterrestrial solar radiation, sunshine duration, visibility, amount of cloud cover, and wind speed. The lowest RMSE of the models was 76.1 W/m
2. Wang et al. [
12] used correlation analysis to screen six parameters that were significantly related to the actual radiation level: atmospheric pressure, air temperature, relative humidity, precipitation, actual sunshine duration, and solar altitude angle. Using data from 2009 to 2019 from Haikou, the Elman neural network model was trained and established to predict the hourly solar radiation. The lowest RMSE of the model was 44.44 W/m
2. All of these studies used the actual sunshine durations for which predictions were unavailable. The actual sunshine duration can only be measured; therefore, the predicted value cannot be obtained. Therefore, the method proposed in this study has some limitations in practical applications for predicting the hourly global solar radiation.
Meenu et al. [
13] established a convolutional long short-term memory fusion network (CNN-LSTM) to predict solar radiation for 15–150 s in advance. LSTM was applied to extract the time-series features of 10 past time steps of the solar radiation values, and the CNN was used to extract features from the cloud cover satellite images. The highest accuracy rate of the model was 99.23%. Francisco et al. [
14] used satellite images, cloud data, direct solar radiation, and diffuse solar radiation data to predict the solar radiation level under different cloud types 90 min in advance. The model based on satellite data had the highest accuracy in predicting the radiation value under cumulus clouds, with an RMSE of approximately 100 W/m
2. These methods require the collection and analysis of satellite images, and the processing and analysis of satellite images is relatively complicated. Therefore, it is difficult to meet the requirements of general engineering practice and this type of model is unsuitable for practical engineering applications.
In summary, the limitations of most existing machine learning models for hourly solar radiation prediction are as follows. (1) Some methods that use historical solar radiation as the input parameter cannot eliminate the dependence on historical data. It is necessary to obtain the historical solar radiation to ensure the operation of the model. (2) Some hourly radiation prediction models require the actual sunshine duration as an input parameter, which refers to the time in a day when the sun is directly on the ground and can only be recorded by meteorological equipment. Because the parameter cannot be predicted, it affects the prediction function of solar radiation prediction models. (3) Some prediction models must acquire and analyze satellite images, which are difficult to obtain and analyze. Complex image processing limits the efficiency of radiation-prediction models.
To solve these problems, this study proposed a simplified method to predict the hourly global solar radiation, which has the input parameters of extraterrestrial solar radiation, weather types, cloud cover, air temperature, relative humidity, and time. The remainder of this paper is organized as follows.
Section 2 introduces the method of the prediction model including the input parameters of the model, the algorithm of the model, the pre-processing of the original data, and the evaluation indicators of the method.
Section 3 presents a case study and the performance of each model.
Section 4 analyzes the importance of the model input parameters for the prediction results. Finally,
Section 5 summarizes the study.
2. Methods
Extraterrestrial solar radiation varies with geographic location and the level of extraterrestrial radiation at the same location does not change significantly every year [
15]. Extraterrestrial solar radiation reaches the surface of the Earth after being attenuated by the atmosphere; therefore, clouds in the atmosphere will cause the attenuation of extraterrestrial solar radiation through direct reflection or shortwave radiation [
16]. Some studies have indicated that the presence of clouds and water vapor affects the level of global solar radiation [
17,
18,
19]. The weather type and relative humidity can characterize the cloud amount and water vapor condition, respectively, to a certain extent. Weather type and relative humidity are the parameters for which the forecast values are easily obtained. The air temperature changes after the ground receives solar radiation. Relevant studies have also shown a relationship between air temperature and solar radiation [
18,
20]. According to the law of relative motion between the Sun and the Earth, there are inter-day and inter-annual variations in global solar radiation, so time is also one of the factors affecting global solar radiation. In summary, combined with the research summarized in
Section 2.2 and the input parameters of the solar radiation prediction model summarized in the related literature review [
21], it can be preliminarily determined that the input parameters selected in this study are extra-terrestrial solar radiation, weather types, cloud cover, air temperature, relative humidity, date, and hours.
Three typical algorithms were selected to establish hourly global solar radiation prediction models to demonstrate the feasibility of the simplified method. MAE and RMSE were applied as model evaluation indicators, and the algorithm structure with the highest prediction accuracy was selected. Then, the Shapley additive explanations (SHAP) model was used to analyze the influence of input parameters on the output value to try to further simplify the model.
Figure 1 shows the flowchart for constructing the hourly global solar radiation prediction model.
2.1. Calculation Method of Extraterrestrial Solar Radiation
Extraterrestrial solar radiation is only affected by the relative position between the Sun and Earth. Therefore, the extra-terrestrial solar radiation value of a certain place in one year can be calculated using known parameters such as longitude, latitude, date, and hour. The calculation formula is as follows [
22]:
where
is the solar constant, taken as 1367 W/m
2;
φ is the geographical latitude of the site;
is the eccentricity correction factor of Earth’s orbit [
23];
δ is the solar declination [
24];
ω is the hour angle [
25].
2.2. Machine Learning Algorithms
A review by Cyril et al. [
26] showed that artificial neural network models such as ANN have been the most popular algorithms used in solar radiation prediction in recent years. SVM and K-means algorithms have gradually been applied in this field, whereas methods such as boosting and regression trees are rarely used in the solar radiation field. Therefore, back propagation (BP) network, SVM, and light gradient boosting machine (LightGBM) were used to establish models for estimating the hourly global solar radiation to test whether the simplified prediction method proposed in this study is feasible. These three models are briefly described below.
2.2.1. BP Network
The main structure of the BP network includes an input layer, a hidden layer, and an output layer. The output of the neurons was determined by the input value, action function, and threshold [
27]. The learning process of the BP network consists of the forward propagation of the signal and backward propagation of the error [
28]. The BP network has strong self-adaptation and good fault tolerance. However, its training speed is slow and easily falls into the local minimum.
Theoretically, the three-layer BP network (with one hidden layer) shown in
Figure 2 can realize any nonlinear mapping with a sufficient number of neurons [
29]. The number of neurons in the hidden layer (
m) can be determined according to Formula (2) [
30,
31]. The parameters of the BP network in this study are shown in
Table 1.
where
n is the number of neurons in the input layer;
l is the number of neurons in the output layer;
k is a constant between 1 and 10.
2.2.2. SVM
SVM is a prediction algorithm based on statistical principles. The basic principles of the SVM algorithm are structural risk minimization and the Vapnik–Chervonenkis dimension, which can process linear and nonlinear data simultaneously. The mechanism of SVM is to find an optimal classification hyperplane (
) that can not only guarantee classification accuracy, but also minimize the distance between the closest vector and the hyperplane [
32]. In
Figure 3, the sample points above parallel lines L1 and L2 are the support vectors. The SVM model can avoid overlearning and has a strong generalization ability and fast classification speed. It has significant advantages in solving nonlinear and high-dimensional pattern recognition problems; however, it is more sensitive to missing values.
2.2.3. LightGBM
LightGBM is based on the structure of the decision tree and boosting algorithm. A decision tree is a single classification algorithm, whose principle is to approximate discrete function values. The boosting algorithm is a commonly used ensemble learning algorithm that converts weak classifiers into ones through iterations [
33]. LightGBM supports efficient parallel training, mainly through histogram optimization and the depth-limited leaf-wise tree growth strategy, to improve the calculation speed. As shown in
Figure 4, the histogram algorithm divides the floating-point value into K ranges and constructs a histogram with width K. When traversing data, only discrete data are indexed. When searching for the optimal split point, the number of calculations can be reduced, and the calculation speed can be improved. As shown in
Figure 5, a depth-limited leaf-wise tree growth strategy can find a leaf that has the greatest splitting gain and then split. LightGBM supports parallel training with high accuracy and can handle large amounts of data. Nevertheless, it is sensitive to noise and does not consider all characteristics of the data based on the optimal segmentation variable when searching for the optimal solution.
2.3. Model Evaluation Indicators
The RMSE and relative error (RE) were used to evaluate the performance of the models. RMSE is the square root of the ratio of the square of the deviation between the predicted and actual values to the total amount of data, which can measure the error between the real and predicted values. The RE is the ratio of the absolute error to real values, which reflects the reliability of the prediction. In this study, the RMSE were applied to evaluate the overall prediction results of the model, and the RE was used to evaluate the coincidence between the predicted and actual values. The smaller the RMSE and RE values, the higher the model accuracy. The calculation formulae are as follows:
where
is the output value of the prediction model, W/m
2;
is the actual solar radiation value, W/m
2;
n is the total value.
4. Analysis of Input Variables
Through the case study, it can be observed that the LightGBM model had the best prediction performance among the three models. Therefore, based on this algorithm, the feature importance of each input parameter in the prediction model was analyzed. Since the LightGBM model is a black-box model, it is impossible to directly know the internal calculation process of the model and to intuitively display the influence of each input parameter on the model operation and prediction results. Therefore, a suitable method is required to explain it, and the SHAP model proposed by Lundberg and Lee can meet the needs of explaining the black-box model. The SHAP model is an additive explanatory model based on cooperative game theory. Its core is to calculate the SHAP values of each feature of the model and summarize the contribution of the feature to the predictive ability of the model. Traditional feature importance ranking cannot judge the relationship between the feature and the output result, whereas the SHAP model can intuitively reflect the impact of each feature on the predicted value and the positive or negative impact. Therefore, the SHAP model was applied in this study to explain the hourly global solar radiation prediction model based on the LightGBM algorithm.
4.1. Significance Analysis
Figure 9 summarizes the feature importance with a density scatter plot. The ordinate in the figure is the name of each input parameter, arranged in descending order of the SHAP average absolute value. In the figure, the weather types are encoded as w_i (i is 0~5), which are shown in
Table 2. The abscissa is the SHAP value, and each point in the figure represents a sample data. The redder the color, the larger the value and the bluer the color, the smaller the value.
Figure 10 shows the results of ranking the importance and the average absolute value of each feature. The ranking of the feature importance in this figure was the same as that in
Figure 9.
Figure 9 and
Figure 10 show that the importance of the input parameters in the LightGBM model from high to low are as follows: extraterrestrial solar radiation, hour, relative humidity, cloud cover, date, air temperature, and weather types.
The average SHAP value of extra-terrestrial solar radiation was the largest, indicating that this parameter had the greatest impact on the output value of the model. The red sample points were distributed on the side with positive SHAP values, suggesting that this parameter positively affects the output value of the mode. Hour is the second most significant factor affecting the predicted value of the solar radiation. When the data point was red, most of the SHAP values were less than 0, and when the data point was blue, the SHAP value was greater than 0. This indicates that the solar radiation reached its maximum after noon and then gradually weakened. This phenomenon shows that the solar radiation first increased and then decreased with time.
Relative humidity and cloud cover were negatively correlated with the SHAP value, so the decrease in relative humidity and cloud cover indicates that the hourly global solar radiation is increasing. The red and blue sample points were uniformly distributed and were concentrated in the area where the SHAP value was zero, showing no obvious trend. The reason for this phenomenon may be that the actual solar radiation is highest in the middle of the year and decreases in the early and late parts of the year. The air temperature positively affected the predicted value of solar radiation, and the solar radiation increased with an increase in temperature.
Weather type had no remarkable effect on the model. Sunny and light snow were relatively significant weather types that affected the level of global solar radiation. When the weather was sunny, the SHAP value was positive, indicating that sunny weather had a positive effect on global solar radiation. When the weather was light snow, the SHAP value was negative. This indicates that the global solar radiation level decreased when snow was present. However, the SHAP values for overcast, cloudy, light rain, and moderate rain were mainly concentrated at 0, which did not significantly affect the prediction results of the LightGBM model.
4.2. Simplification of the Model
To simplify the model, we attempted to eliminate the characteristic parameters of weather types, establish and test the LightGBM model again, and compare the test results of the model before and after the simplification.
Table 6 shows the running results of the LightGBM model after excluding the weather types.
From the above analysis, among the input parameters, weather types had the least importance to the model. Therefore, the LightGBM model was established and tested again after eliminating the weather types.
Table 4 shows the running results of the LightGBM model after excluding the weather types. The RMSE of the LightGBM model was 135.2 W/m
2 after eliminating the weather types, which was 9.1 W/m
2 higher than those with weather types included. The RMSE of the LightGBM model remained unchanged after excluding weather types.
Figure 11 shows the cumulative probability distribution curves of the RE before and after excluding the weather types. It can be observed that the RE distributions of the two models were not significantly different.
The above results confirm the analysis results of the SHAP model; that is, the weather types were the least significant to the LightGBM model. There was no significant change in the performance of the LightGBM with and without weather types. Therefore, when using the LightGBM model to predict the hourly global solar radiation, the input parameters can be simplified into six groups: extra-terrestrial solar radiation, cloud cover, air temperature, relative humidity date, and hour.
5. Conclusions
In summary, the methods for predicting hourly global solar radiation proposed in this study can achieve aa satisfactory performance. The RMSE of the BP network, SVR, and LightGBM were 138.7 W/m2, 135.5 W/m2, and 126.1 W/m2, respectively, where it can be seen that the LightGBM model exhibited the best performance.
Based on the SHAP analysis results of the LightGBM model, weather types were not the main factors that affected the prediction result of the model. The accuracy of the model did not change significantly after excluding the weather types. Therefore, the input parameters of the LightGBM model were simplified to extraterrestrial solar radiation, cloud cover, air temperature, relative humidity date, and hour.
In conclusion, this method is applicable to engineering applications that need to predict the hourly ground solar radiation and provides a convenient and effective method for research and the engineering of building load calculation, energy consumption prediction, solar energy utilization, etc. Unfortunately, due to the limitation of data sources, the model was only validated in Lanzhou and still needs to be popularized and verified in other regions.