1. Introduction
Linear regression (
) is an approach to using the least squares method to model the relationship between a scalar dependent variable and one or more explanatory variables. It also refers to the plane points that are fitted with a straight line or the points in a high dimension space that are fitted with a hyperplane. This method is very sensitive to predictors being in a configuration of near-collinearity. Ridge regression (
) is a variant of linear regression whose goal is to circumvent the problem of predictors collinearity. The ridge regression model is a powerful technique of machine learning which was introduced by Hoerl [
1] and Hastie et al. [
2], and it is a method from classical statistics that implements a regularized form of least squares regression [
3]. Ridge regression is an alternative method for learning function based on a regularized extension of least squares techniques [
4].
Given the data-set
where
,
,
is the data-set. A multiple
is
.
R represents real number set,
is
n dimensional Euclidean space,
N is the number of sample points, superscript
T denotes the matrix transpose.
and
determine the parameter vector
by minimizing the objective functions, respectively:
The objective function used in ridge regression implements a form of Tikhonov [
5] regularization of a sum-of-squares error metric, which is a regularization parameter controlling the bias-variance trade-off [
6]. This corresponds to penalized maximum likelihood estimation of
, assuming the targets have been corrupted by independent identical probability distribution (i.i.d.) samples from a Gaussian noise process with zeros mean and variance
, i.e.,
.
The
model based on Gaussian-noise characteristic is derived by Saunders et al [
7].
[
1,
3,
5] aims to find the hidden nonlinear structure in the raw data, while nonlinear mapping is approximated by means of
based on kerneltechniques [
7,
8,
9,
10,
11]. Therefore a linear
model is constructed in a feature-space
H (
), induced by a nonlinear kernel function defining the inner product
(
). The kernel function
may be any positive definite Mercer kernel. Therefore, the objective function of
based on Gaussian-noise (
-
) minimization can be written as
Suppose the noise is Gaussian, the
model may meet the requirements. However, the noise in wind speed and wind power forecast does not obey the Gaussian distribution, but the Beta distribution. The classic regression techniques are not applicable to above case. The uncertainty of wind power predictions was investigated in [
12]. The statistics of the wind power forecasting error were not Gaussian. The work in [
13] also found that the output of wind turbine systems is limited between zero and the maximum power and the error statistics do not follow a normal distribution. It also proved that using the Beta-function is justifiable for wind power prediction about chi-squared tests. In [
14], the standard deviation of the data set was a function of the normalized predicted power
, where
is the predicted power and
is the wind power installed capacity. Fabbri [
14] pointed out standardized production power
p be within the interval
and Beta-function are more suitable than standard normal distribution. Literature [
15] exhibited the advantages of using Beta-probability distribution function (pdf) instead of Gaussian pdf for approximating the forecasting error. Based on the above literature [
12,
13,
14,
15,
16], this work plans to study the error of Beta-distribution between the predicted values
and the measured values
in the wind speed forecasting, and pdf of
is
, plotted below in
Figure 1. Where
are parameters,
h is normalization-factor, and the parameters
may be determined by the given values of mean and standard deviation [
17].
It is not suitable to apply the techniques based on Gaussian-noise model (-) to fit functions from data-set with Beta-noise. In order to solve the above problem, this work focuses on the utilization of optimization theory and Beta-noise loss function and derives a method of based on Beta-noise characteristic (-). It also introduces a forecasting technique that can deal with high-dimensionality and nonlinearity simultaneously.
This paper is organized as follows. In
Section 2, we will derive the Beta-noise empirical risk loss by the Bayesian principle.
Section 3 describes the proposed
model based on the Beta-noise.
Section 4 gives the solution and algorithm design of
of the Beta-noise characteristic based on Genetic Algorithm. The numerical experiments are carried out on
-
to short-term wind speed and wind power prediction in
Section 5, respectively. Finally, the conclusions and future work are given in
Section 6.
2. Bayesian Principle to Beta-Noise Empirical Risk Loss
Learning to fit data with noise is an important problem in many real-world data mining applications. Given a training set
of (1) with noise is additive
where
is random i.i.d.
with standard deviation
and mean
.
The objective is to find regressor
f minimizing the expected risk [
18,
19]
based on the empirical data
, where
is a empirical risk loss (determining how we will penalize estimation errors). Since we do not know the distribution
, it can only use data-set
to estimate a regressor
f and minimize
. A possible approximation consists of replacing the integration by the empirical estimate to get the empirical risk
. In general, we should add a capacity control term in
and
, which leads to the regularized risk functional [
18,
20]
where
is a regularization constant,
is the empirical risk. It is well known that
is the empirical risk loss of Gaussian-noise characteristics for
(2),
(3), and
(4). However, what is the empirical risk loss about Beta-noise of
model? The Beta-noise empirical risk loss by the use of Bayesian principle is given as follows.
The regressor
is unknown, the objection is to estimate the regressor
from
. According to the literature [
20,
21,
22], the optimal empirical risk loss from maximum likelihood be
The maximum likelihood estimation be
Maximizing
is equivalent to minimizing
. Using Equation (
7), we have
Suppose noise in Equation (
5) adheres to Beta distribution with mean
and variance
, thus we can get
,
[
13,
14], where
is the normalization-factor. By Equation (
10), the Beta-noise empirical risk loss is
Empirical risk loss of Gauss-noise and Beta-noise with different parameters is shown in
Figure 2.
3. Model Based on Beta-Noise
It is not appropriate to apply the model based on Gaussian-noise characteristic (-) to deal with tasks with the Beta-noise distribution. Consequently, we use Beta-noise loss function and maximum likelihood method to estimate the optimal loss function. Now, we derive the optimal empirical risk loss about Beta-noise distribution, and propose a new technique of the model based on Beta-noise characteristic (-).
First, considering constructing regressor , where (), . We use kernel techniques and construct the kernel function where , , H is Hilbert space, and () is inner product of H. Then we extend kernel techniques to the ridge regression model based on the Beta-noise characteristic.
Let the set of inputs be
, where
i represents the indicator for the
i-th sample in
. For the general Beta-noise characteristic, it is Formula (11) that the Beta-noise loss function
in the sample point
of
. Owing to the fact that ridge regression and
techniques with Gaussian-noise characteristic (
) are not suitable to Beta-noise distribution in time series problems, the Formula (11) is selected as Beta empirical risk loss to overcome the shortage of
-
. The primal problem of
model with the Beta-noise (Denoted by
-
) can be described as follows (
)
where
,
.
Theorem 1. Model -’s Solution to original Problem (12) about ϖ exists and is unique.
Proof. The existence of solutions is trivial. The uniqueness of solutions is shown below. If we have solutions
,
, then Problem (12) exist solutions
and
. Define
as follows
We have
where
,
is a feasible solution of Problem (12). Further,
By Inequalities (14) and (15), we get
.
is substituted into the above inequality
as
, then
In addition,
, by (17), get
For , thus or . Since , then . By , get . Namely, . For , thus .
In conclusion, Solution to Problem (12) exists and is unique. □
Theorem 2. Model -’s dual Problem of primal Problem (12) iswhere , () and is constant. Proof. The introduction of Lagrange functional
is
For the sake of the minimum
, seek partial derivative to
respectively. From Karush-Kuhn-Tucker (KKT) conditions, obtain
So
Substituting the extreme conditions into and finding the Maximum of α, thus derive the dual Problem (18) of Problem (12). □
On account of , by , we have
Now get
where
. Because of
, we reject
and let
.
We have
, gain the decision-making function of
-
is
Note: The
of the Gaussian-noise characteristic (
) was discussed in [
9,
10,
11]. The Gaussian empirical risk loss in the sample point
is
, thus the dual Problem of model
based on Gaussian-noise characteristic (
) is
The dual Problem of model
based on the Gaussian-noise characteristic (
-
) is
4. Solution Based on Genetic Algorithm
We get the Solution and algorithm design of model based on Beta-noise characteristic (-) as follows.
- (1)
Let training samples , where , ().
- (2)
Select the appropriate positive and the suitable kernel .
- (3)
Solve optimization Problem (18), gain optimal Solution .
- (4)
Construct the decision-making function
and
.
The confirmation of unknown parameters of model
is a complicated process and the appropriate parameter combination of the models can enhance the regression accuracy of the kernel ridge regression based on Beta-noise. Genetic Algorithm (GA) [
23,
24,
25] is a search heuristic that mimics the process of natural evolution, this heuristic is routinely used to generate useful solutions to optimize and search problems. In GA, the evolution usually starts from a population of randomly generated individuals and happens in generations. In each generation, the fitness of every individual in the population is evaluated, multiple individuals are stochastically selected from the current population and modified to form a new population. The new population is then used in the next iteration of the algorithm. Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population. If the algorithm has terminated due to a maximum number of generations, a satisfactory solution may or may not have been reached.
GA is considered as one of the modern optimization algorithms to solve the combinatorial optimization problem and is used to determine the parameters of model
-
. Based on the survival and reproduction of the fitness, GA is continually applied to get new and better solutions without any pre-assumptions, such as continuity and unimodality [
26,
27,
28]. The proposed model
-
has been implemented in Matlab 7.8 programming language. The experiments are made on the 8.0 GHz Core (TM) i7-4790 CPU personal computer with 3.60 GB memory under Microsoft Windows XP Professional. The initial parameters of GA are
,
,
. Many practical applications display that polynomial and Gaussian kernels perform well under general smooth assumptions [
29]. This work, polynomial, and Gaussian kernels can be used as the kernel for models
-
,
-
, and
-
:
where
d is positive integer, and let
, or 3.
is positive, and take
.
As we all know, no prediction model forecasts perfectly. There are also certain criteria, such as mean absolute error (MAE), the root mean square error (RMSE), mean absolute percentage error (MAPE), and standard error of prediction (SEP) are used to evaluate the predictive performance of models
-
,
-
, and
-
. The four criteria are defined as follows:
where
l is the size of the selected samples,
is the measured result of data-point
, and
is the predictive result of data-point
(
) [
14,
15,
16].
5. Short-Term Wind Speed and Wind Power Forecasting with Real Data-Set
The model - is applied to the multi-factors actual data-set for wind speed sequence prediction from Jilin Province. The wind speed data contain more than a year of samples which are collected in intervals of ten minutes, and the number of wind speed data is 62,466. Each column attribute is mean, variance, minimum, and maximum, respectively. The short-term wind speed forecast is studied as follows.
Suppose the training sample number is 2160 (from 1 to 2160 for 15 days), and the number of test samples is 720 (from 2161 to 2880 for 5 days). The input vector is
, the output value is
, and
. Namely, the pattern above is used to forecast the wind speed each interval of 10 and 30 min at each Point
, respectively [
30,
31].
1. Forecast wind speed at point each interval of 10 min
The short-term wind speed sequence forecast results at point
each interval of 10 min given by
-
[
7,
8,
32],
-
[
33,
34], and
-
are illuminated with
Figure 3. In
-
, parameter
. In
-
, parameter
. In
-
, parameters
.
MAE, MAPE, RMSE, and SEP indicators are used to evaluate the prediction results of the three models at point
each interval of 10 min shown in
Table 1.
2. Forecast wind speed at point each interval of 30 min
The short-term wind power sequence forecast results at point
each interval of 30 min given by
-
,
-
, and
-
are illuminated with
Figure 4. In
, parameter
. In
-
, parameter
. In
-
, parameters
.
MAE, MAPE, RMSE, and SEP indicators are used to evaluate the prediction results of the three models at point
each interval of 30 min shown in
Table 2.
The results of wind speed forecasting experiments indicate that - has better performance than - and - in 10-min and 30-min short-term wind speed forecasting.
We have predicted the short-term wind speed from the Jilin Province wind farm, so we can calculate the wind power according to the Formula (25):
where
and
represent cut-in wind speed and cut-out wind speed of wind turbine, respectively.
and
represent rated wind speed and rated power of wind turbine, respectively. The predictive wind speed is substituted into the Formula (25), we can obtain the predicted wind power.
6. Conclusions and Future Work
In this work, we propose a new version of kernel ridge regression model based on the Beta-noise (-) to predict the uncertainty system of Beta-noise. Novel results have been obtained by the use of the model -, which takes the Bayesian principle to Beta-noise empirical risk loss and improves the prediction accuracy. The numerical experiments are carried out on real-world data (the short-term wind speed). Comparing the model - and models - and - by criteria , , , and verifies the validity and feasibility of our proposed model . Further, the forecasting results indicate that the proposed technique can obtain good performance on short-term wind speed forecasting.
In practical regression problems, data uncertainty is inevitable. The observed data are usually described in linguistic levels or ambiguous metrics, like the weather forecast, the forecast results of dry and wet, or sunny and cloudy, and so on. We should consider developing fuzzy kernel ridge regression algorithms with different noise models.
We verify the validity and feasibility of the model.