1. Introduction
Solar thermal or electrical systems require high quality solar radiation measurement instruments in order to accurately measure solar energy received on the plant. Poor quality data or too short data series can generate errors in plant design, performance, and production forecasting, negatively impacting return on investment.
Unfortunately, measures of solar radiation are sparse and inaccurate over the world; there are still large areas without any solar radiation observations [
1]. Investments and maintenance costs for each measurement site are not negligible and even in industrialized countries, the national network often consists in a relatively small number of solar radiation stations [
2]; and the measurement quality varies from a network to another, often by lack of maintenance and calibration. The measuring devices’ price is an important part of the process cost of collecting solar data, especially for non-profit institutions, such as schools or universities [
2].
The amount of meteorological stations measuring solar irradiance through the world is difficult to count because various sources give different information [
3]. Even so, only 1000 continental stations around the world are measuring solar radiation [
4].
In these conditions, it is interesting to look for some relations between sparse solar radiation data and more measured meteorological data as temperature, humidity, and wind speed. Satellite observations are used for determining the solar irradiation on the ground, with a good accuracy, but the time step of the estimated data is relatively large (minimum hourly).
A bibliographical study showed [
5] that artificial neural networks (ANN) were developed between meteorological parameters and solar irradiations, but generally only for averaged values of solar irradiations (on a monthly or annual basis); today, data with a short time granularity (minute, 5-min) must be known for interesting applications in solar energy. In the first part of this paper, 10 meteorological inputs (measured or calculated) are available and then 2
10 − 1 = 1023 combinations of input data are possible, from 10 ANN with only one input to one ANN with 10 inputs. All these combinations are tested and the best configurations are discussed.
The solar panels are rarely fixed horizontally and the global tilted irradiation (GTI) is rarely measured and must be estimated from global horizontal irradiation (GHI), more often available; using pyranometers with various inclinations is costly and their maintenance is constraining. Thus, it is useful to develop accurate methods for determining GTI from only GHI. This objective is difficult to reach with conventional physical relations because the sky anisotropy makes the modeling of the sky diffuse radiation difficult [
6,
7,
8]. ANN methods generally realize the same conversion with an improved adequacy (generally, at an hourly scale), and they outperform [
9,
10,
11,
12,
13] the traditional methods due to the inherent non-linearity in solar radiation data. In this work, an ANN model is developed and optimized for “tilting” 5-min time step solar data.
The stochastic and intermittent behavior of the solar resources poses numerous problems for the electricity grid operator [
5] and limits the future development of the phovoltaic (PV) and concentrated solar power (CSP) plants. To improve the integration of such systems, the solution consists in introducing energy storages and developing smart grids as well as implementing production and consumption forecasting.
The forecasting of the output power of solar energy systems is required for good operation of the power grid and for optimal management of electrical flows [
3]. It is essential to estimate the energy reserves, to schedule the power systems, to optimally manage the storage, and to trade in the electricity market [
14,
15,
16]. Thus, predicted and anticipated events are easier to manage. Electricity must be produced by CSP and/or PV plants; the first ones convert the direct normal irradiation (DNI) into heat through focusing receivers and PV ones enable direct conversion of GHI into electricity through semiconductor devices [
3,
17]. The literature shows that the most efficient methods for a forecast at a short time horizon from one hour (h+1) to six hours (h+6) are time series analysis and artificial intelligence methods [
18]. If a large literature exists about the GHI forecasting [
3,
15,
17,
18,
19], this literature is poorer concerning DNI being more difficult to predict because its variations are deeper and more frequent [
20,
21]. ANN predictive models are implemented to forecast hourly GHI and DNI from h+1 to h+6.
Section 2 presents the data used in this paper (Bouzareah, Algeria for estimation purposes and Odeillo, France for forecasting purposes), the preprocessing used on this data, and the calculated error metrics to estimate the accuracy of ANN models.
Section 3 gives some information on the ANN implementation for estimation and forecasting.
In
Section 4 and
Section 5, the main results on the estimation of 5 min-GHI from other meteorological parameters and of the 5 min-GTI from GHI measurements are presented.
Section 6 shows the results of the ANN forecasting of hourly GHI and DNI for a time horizon from h+1 to h+6.
2. Meteorological Stations and Data
2.1. The Meteorological Stations
The two first studies were realized using meteorological data measured in the meteorological station belonging to the Renewable Energies Development Centre (CDER) located in Bouzareah near Algiers (latitude: 36.8° N; longitude: 3.17° E) at an altitude of 347 m. The site is characterized by a Mediterranean climate with dry and hot summers and damp and cool winters. The data were measured each second and stored each 5 min from April 2011 to April 2013 (24 months of 5-min data). The measured data and the calculated astronomical data (horizontal extraterrestrial irradiation (EHI), solar declination, δ, and zenith angle,
[
22]), are presented in
Table 1.
The tilted solar data were measured for a 36.8° tilt angle equal to the latitude of Bouzareah (optimal angle for a maximum annual irradiation). The data basis contains 12 5-min parameters, 9 measured and 3 calculated. Each data was previously verified in order to extract outliers.
The forecasting work was realized from GHI and DNI data provided by the PROMES laboratory (CNRS UPR 8521) located in the south of France in Odeillo (Pyrénées Orientales, France, 42°29 N, 2°01 E, 1550 m asl), the station is located in the mountains, at about 100 km from the Mediterranean sea and presents an often high nebulosity. The solar data are measured and stored with a 1 min time granularity. This meteorological station is in altitude, the climate is very perturbed, the rainfall continues to be present during the driest months, the variability of solar radiation is high, and thus its forecasting is more difficult to realize. Two years of hourly data were available i.e., 17 520 data, for both GHI and DNI.
2.2. Cleaning and Preprocessing
For Bouzareah, each 5-min data were first verified to extract outliers or missing data. Then, the data during which the sun rises or sets were deleted because the mask effect of the environment and the no-reliable response of pyranometers at a high zenith angle (cosine effect) introduced some errors. Thus, over the 2 years, 75674 validated 5-min data were available for each parameter.
For the Odeillo’s data, an automatic quality control used in the frame of the GEOSS project (Group on Earth Observation System of System) [
23] was applied. Before introducing the solar data into the machine learning process, the data were cleaned and filtered.
For forecasting purposes, it is common to filter out the data to remove night hours and to conserve them only between sunrise and sunset. As for Bouzareah, the data near sunset and sunrise are sources of errors and a pre-processing operation was applied based on the solar elevation: Solar radiation data for which the solar elevation is lower than 10° were removed [
15,
24]. Two years of hourly data were used in this study. After cleaning and filtering, the total number of hourly data for each solar component (GHI and DNI) was 10559 (about 60% of the data were not used (2% for outliers’ data and 58% for sun elevation less than 10°)). These solar data were then transformed into stationary data by a method described in
Section 6.
2.3. Statistical Index for Accuracy Evaluation
There are no well-defined error metrics standards, which makes the forecasting and estimation methods difficult to compare [
25]. A benchmarking exercise was realized within the framework of the European Actions Weather Intelligence for Renewable Energies (WIRE) [
26], with the objective to evaluate the state of the art concerning models’ performances for short term renewable energy forecasting. They concluded that: “More work using more test cases, data and models needs to be performed in order to achieve a global overview of all possible situations. Test cases located all over Europe, the US and other relevant countries should be considered, trying to represent most of the possible meteorological conditions”.
In this paper, these five error metrics were used:
- -
The mean absolute error (MAE) defined by:
being the forecasted outputs (or predicted values),
the observed data, and N the number of observations.
- -
The root mean square error (RMSE), more sensitive to important forecast errors, and hence suitable for applications where small errors are more tolerable than larger ones, as in utility applications. It is probably the reliability factor that is the most widely used:
- -
The mean bias error (MBE), mainly used to estimate the bias of the model:
These errors were then normalized, and the mean value of irradiation is generally used as the reference:
with
the average value of X calculated on the N data.
4. Estimation of GHI from Other Meteorological Data
4.1. Method
In this study, the data used were measured in Bouzareah (see
Section 2.1).
Figure 2 shows the ANN structure with all the available inputs.
The sunshine duration is the time expressed in minutes during which the solar irradiance exceeds 120 W/m2. It is strongly linked with the solar radiation (Angstrom relation for example).
The number of inputs, 10 (7 measured, 3 calculated), makes the optimization of the MLP structure long and arduous: With 10 inputs, 210 − 1 = 1023 combinations of input data are possible.
The choice of the best inputs combination is a prerequisite stage because the parsimony is a basic principle in ANN elaboration, essential for its generalization. Some of the variables bring little information, sometimes no information at all, some of them are redundant, even worse they reduce the model performance. Moreover, an increase of the input number is accompanied by an increase of the hidden neurons and of the calculation time.
The Pearson’s correlation coefficient between each input parameter and the output is determined before using an exhaustive selection (testing 1023 architectures).
4.2. Relationship between Input and Output Data
The Pearson’s correlation coefficient (R) between input variables and GHI and between inputs variables themselves were calculated. When the absolute value of R is near 1, there is a high degree of linear correlation between the two variables; if R = 0, there is no linear correlation, but other relation types can exist.
Computing R between input variables allows an estimation of whether the inputs are redundant and interdependent. The first objective is to rank the statistical linear dependences between the inputs and output; for a large sample of data, the R threshold from which there is a significant link between parameters is very low.
Table 2 shows the values of the Pearson correlation coefficient between meteorological variables.
We mainly see:
The only high correlation is between GHI and S (69%), the other are just weak correlations;
The ranking of inputs from the R point of view (excepted S) is H (18%), EHI (13%), (13%), and p (10%); and
A high value of R between inputs data, , δ, and EHI (between 34% and 95%); it was obvious because and δ are used in the calculation of EHI.
This preliminary study allows to have an idea about the link between variables, but only linear ones. The results were not significant enough to avoid an exhaustive study for all combinations; the Pearson coefficient allows estimation of only the linear dependency between the data while the MLP is a non-linear model (sigmoid activation function), thus it is not surprising that the analysis of the Pearson coefficient is not sufficient to customize an MLP.
4.3. Results
Figure 3 shows the average, minimum, and maximum values of nRMSE and the standard deviation versus the number of inputs.
The minimum nRMSE was obtained for the 10 inputs model with a value of 18.65% compared to 73.91% for the worst configuration (2 inputs: WD and WS).
Table 3 presents the two best configurations for the same number of inputs; as an example, for 9 inputs, 10 combinations were possible, and only the two best ones (from an nRMSE point of view) are reported in
Table 3. The models were classified in descending order of performance (ranking).
Some models with a lower number of inputs can be better than models with a higher number of inputs. The declination, δ, appeared very rarely, WD and Pr never appeared, S was always present, and T, H, EHI, and p were often present; these 5 inputs had a relatively good R with GHI (but only 6% for temperature).
Without S (strongly linked with GHI), the best nRMSE dropped to 32.07% compared with 18.65% for the best configuration with 10 inputs.
Table 4 shows the results for the best configurations without S input.
Figure 4 shows the average, minimum, and maximum nRMSE values and the standard deviation versus the number of input data without S input.
The ANN reliability can be considered as correct, particularly when S was an input. Pr and WD had a low influence on GHI estimation (low correlation with GHI). The nRMSE of the 6-inputs model (T, H, p, WS, S, EHI) was 19.35% compared with the nRMSE of 10-inputs model with 18.65%; this combination had a good performance with a minimum of inputs.
Estimated 5-min GHI is plotted versus the measured 5-min GHI for four architectures in
Figure 5.
- -
ANN structure with 10 inputs;
- -
ANN structure with six inputs;
- -
ANN structure with nine inputs without sunshine duration; and
- -
best ANN structure with five inputs (without sunshine duration).
Few differences appeared between the ANN structure with 10 and six inputs in term of reliability and data dispersion. Without S, it appears clearly a more important spread of data compared with the results obtained with ANN structures with S in the inputs set. The performances of the best ANN structures without S for 5-min data were correct with an nMAE between 28.5% and 31% and an nRMSE between 32% and 35%.
The presence of some meteorological inputs in the “best” configurations seems sometimes surprising as WD and WS; it is difficult to understand the physical relations between GHI and other meteorological parameters. One of the major criticisms that could be levelled at the ANN model is that it is a black box model, allowing it to find some relations between data as often difficult to interpret, and ANN is a data driven method.
Even if some estimated data are far away from the real values, we can consider that the performance of this model is satisfying because determining GHI with a time granularity of 5-min from other meteorological data is a very complex task (high variability phenomenon and anisotropic aspect); keeping in mind that such a method is generally applied only for daily average values [
5].
A bibliographical study [
5] was realized on ANN methods used for such an estimation of GHI from exogenous meteorological data and this study showed that:
- -
For the estimation of monthly mean values of daily GHI, the nRMSE was between 4.07% and 9.4%, but the process, on average, monthly, allows smoothing of the anisotropic effects and sometimes linear relationships are sufficient to link GHI with other parameters.
- -
For the estimation of the daily GHI, nRMSE around 6% and nMAE around 5% were found. The time granularity was much higher than in our work.
Note that the determination coefficient (R
2) between measured and estimated data was between 0.86 and 0.95 for the four graphs in
Figure 5.
The mean bias error (MBE) was also computed for
Figure 5 and was equal to −0.08 Wh/m
2, for 10 and six inputs, −1.14 Wh/m
2 for nine inputs without S, and −1.87 Wh/m
2 for five inputs; thus, all the ANN models slightly underestimated GHI.
5. Estimation of Tilted Global Irradiation (GTI) from Horizontal Global Irradiation (GHI)
In sizing or simulation software for solar systems, the solar collector inclination is introduced as an input and the horizontal solar data (generally hourly) collected from several meteorological stations are “tilted”. The accuracy and quality of GTI used as an input in these software have an impact on the reliability of the results.
It is difficult to develop a simple model for converting GHI into GTI [
6] because the radiation received by a tilted plane includes the radiation reflected by the ground and scattered by the sky; this last component is difficult to estimate; when the collector is inclined, it sees only a part of the sky; moreover, the sky diffuse radiation depends on the inclination or orientation of the collector, on the elevation and azimuth of the sun, but also on the sky state with complex anisotropic effects [
7,
8]. The larger the time-step is, the more this anisotropy decreases (time-averaging and compensating effect) and tends towards an isotropic distribution; the shorter the time-step is, the more it is difficult to realize this conversion with good accuracy. The conversion of GHI to GTI is a complex issue often dealt with in the scientific literature [
7,
42,
43,
44,
45,
46].
5.1. Method
As in
Section 4, the data used here were measured in Bouzareah.
Four data, among them GHI, were used as input:
- -
The declination representing the position of the Earth from the Sun depending on the day number;
- -
the zenith angle characterizing the sun position, which influences the quantity and the quality of the sun radiation; when the sun is high in the sky (low zenith angle), the solar radiation is maximal (in clear skies). Moreover, as the optical path is minimal, the incident radiation is less absorbed;
- -
the extraterrestrial irradiation, EHI, used as a reference; depending on sky conditions, several values of GTI correspond to the same GHI. In diffuse radiation models, the clearness or diffuse index are often used to characterize the sky. When the clearness index is high, then the sky is clear and GHI is mainly composed of BNI.
According to the rules described in
Section 3.2, the number of hidden neurons in only one hidden layer will vary from one to eight. The ANN structure is shown in
Figure 6.
5.2. Results
The five error metrics are presented in
Table 5 (calculated on the basis of eight runs (each run corresponds to a different random weight initializing). The first column contains the number of neurons in the hidden layer.
The nRMSE mean values and its corresponding standard deviation are presented in
Figure 7 as an error-bar graph. An improvement appears until reaching four hidden neurons, then the nRMSE becomes almost constant and no improvement was observed. The dash points define the 95% confidence interval of the prediction errors (calculated based on eight runs per architecture), the triangles and squares are the minimum and maximum observed errors, respectively. We observed the same trend for the variation of the nMAE. The best configuration is encircled in red.
We conclude that an ANN with one hidden layer of four neurons is the best model. Moreover, it appears that the use of the azimuth does not provide any improvement. Thus, we will retain an ANN with four inputs and one hidden layer of four neurons, which have an average nRMSE of 8.81%, the best simulation with this ANN structure conduced to an nRMSE of 8.27%.
To illustrate the good reliability of this optimized MLP, a period of seven days unknown to the network was plotted with measured and calculated data in
Figure 8.
A good relationship is observed between the modelled and measured data whatever the state of the sky is (clear, partially cloudy, and cloudy) because the nRMSE was under 10%, which is a good value for an nRMSE for such a short time step (5-min).
6. Forecasting of Hourly Direct Normal (DNI) and Horizontal Global (GHI) Irradiation for a Time Horizon from 1 h to 6 h
This forecasting work was realized from global horizontal (GHI) and normal beam (BNI) data measured at Odeillo, Pyrénées Orientales, located in the south of France (42°29 N, 2°01 E, 1550 m asl).
6.1. Stationnarization of Solar Data
Machine learning methods are efficient tools for forecasting time series with a stationary behavior. An MLP is a stationary model, which must use stationary data as input. To make solar irradiation time series stationary and to separate the climatic effects and the seasonal effects, the solar data are generally transformed in unitless variables called “clearness index”, and denoted kt; kt is the ratio of the solar radiation on the earth, GHI, to that outside the atmosphere, EHI, and defined by Equation (6) [
35]:
It is the clearness index series, kt, that induces randomness, caused by the diversity of atmospheric components (dusts, aerosols, clouds motion, and humidity) on the solar irradiation measured at the Earth‘s surface.
Numerous studies have showed that EHI can be efficiently replaced by the clear sky solar irradiation [
22] taking into account the climatic conditions of the meteorological site; thus the clearness index is replaced by the clear sky index, k
g,cs, defined by:
with
the global horizontal solar irradiation in clear sky conditions.
For DNI, a similar index, k
d,CS [
47,
48], is defined by:
Various models of clear sky solar irradiations are available in the literature, which differ from each other mainly in the inputs needed by each model [
49]. Solar irradiance models by clear sky denoted in the following clear sky models used meteorological variables (as ozone layer thickness, precipitable water, optical aerosol depth, etc.) and used solar geometry (solar elevation and air mass), using radiative transfer models to consider the absorption and diffusion effects of solar radiation into the atmosphere [
50,
51]. The most widely used clear sky model is the Solis model developed by Mueller et al. [
52] and simplified by Ineichen [
53], the European Solar Radiation Atlas (ESRA) model [
54], and the Reference Evaluation on Solar Transmittance 2 (REST2) model [
55].
Thus, the simplified Solis clear sky model [
53] was used here. It allowed calculations of the GHI
CS, and DNI
CS. This clear sky model was validated for each month by comparison with experimental solar radiation data measured in clear sky conditions. For illustration purposes, experimental and modelled solar irradiances by clear sky are plotted in
Figure 9 for one day in April and in September.
6.2. Choice of the Number of Input Data
The purpose of
Section 6 is to predict the future hourly solar irradiation (at different time horizons) based on the past observed data, i.e., mathematically:
A variable, X, with the symbol,
, represents a forecasted data; without this symbol, X is a measured data. The solar data at future time step (t+h),
is forecasted from the observed data X measured at the times (
t,
t − 1…,
t −
n); thus, the first objective consists of determining the value of
n, i.e., the dimension of the input matrix; to do it, an auto mutual information method [
56,
57,
58] was used. The auto mutual information is a property of the time series, it depends on each dataset and is characteristic of the degree of statistical dependence between
and
with 0 ≤
i ≤
n. It is a dimensionless quantity with units of bits. High mutual information indicates a large reduction in uncertainty about one random variable,
, given knowledge of another
The auto-mutual information method showed that the number of inputs (value of n in Equation (9)) for predicting GHI is six and for DNI, it is seven.
6.3. Methods
A first forecasting method, a naïve model, easy to implement and requiring no training step, i.e., no historical data set, was used as a reference model to compare it with more sophisticated models in terms of accuracy. It allowed us to see the improvement due to the use of the ANN forecaster.
The persistence model, the simplest forecasting model, assumes that the future value is identical to the previous one (Equation (10)). The persistence forecast accuracy decreases significantly with the forecasting horizon [
59].
The smart persistence (SP) is an improved version of the persistence one taking into account the diurnal solar cycle: The clear sky solar radiation profile over the day was used [
41]:
is GHI or DNI for a clear sky condition calculated at time t. This smart persistence model was applied in this paper and used mainly as a reference model.
For the ANN model, as explained in
Section 6.1, the clear sky index will be forecasted because it is a stationary series. Once the clear sky index is forecasted, the value of the forecasted solar irradiation, (
is obtained by multiplying it by the calculated clear sky irradiation (
6.4. Results for GHI
Table 6 gives the values of the error metrics calculated on the test data set (RMSE, MAE, and MBE are given in Wh.m
−2).
As the ranking of the model is always identical from a RMSE point of view or a MAE point of view, we only present in
Figure 10 the results in terms of RMSE and nRMSE expressed in percentage.
The smart persistence, a naive model, was used as a reference. This model has a good RMSE and MAE for a time horizon, h+1, but its performances decrease rapidly with the time horizon. The gap in term of performances between ANN and SP increases with the time horizon.
6.5. Results for DNI
Table 7 gives the values of the performance metrics computed on the test data set (RMSE, MAE, and MBE are given in Wh.m
−2) for DNI.
The results in terms of RMSE and nRMSE are presented in
Figure 11 for BNI.
The DNI forecasting is more difficult and the models’ performances were less satisfying than with GHI particularly because DNI was more sensitive to meteorological conditions and because its variation was more rapid and of a greater magnitude than for GHI.
Some differences are noted in terms of ranking according to the metric used (nRMSE or nMAE), the nRMSE gives more importance to large gaps between predicted and measured data and generally the forecasting models were better compared in term of nRMSE than nMAE.
6.6. Comparison between GHI and DNI Forecasts
It is impossible to compare the performances of the models according to the solar component in terms of the absolute value of RMSE (or MAE) because the daily curve of GHI and DNI are very different in term of amplitude and form. Thus, we plotted in
Figure 12 a comparison of the performances in term of nRMSE because it is the most common error metric used in the solar radiation prediction; in the case of
Figure 11, as we compared two different solar radiations (GHI and DNI) with a different scale, the normalized value of RMSE seems to be the most adapted metric.
As previously underlined, GHI was forecasted with a better accuracy compared with DNI. It is probably due to the fact that in GHI, the two components, diffuse one and beam one, have compensating effects (when diffuse increases, beam decreases) and the variation rate of GHI is less rapid than for DNI. With SP and ANN methods, DNI is predicted with an nRMSE nearly twice as high than for GHI, but this difference was reduced when the forecast horizon increased and for (h+6), the accuracy for DNI prediction was the same than for GHI prediction.
Antonanzas et al. [
50] reviewed the intra-day ahead forecast performances for PV production (using GHI as renewable resource) using different numerical prediction models. Various error metrics were used and calculated according to different definitions, moreover, the forecasting methods were applied in different meteorological stations; thus, it is very difficult to make a comparison of our results with the literature in these conditions.
A short bibliographical study [
60] on DNI forecasting concludes that the DNI forecasting is obtained with a lower accuracy than for GHI forecasting and that only a small number of articles are written on the DNI forecasting at short time horizons as confirmed by Law et al. [
20].