2.2. Machine Learning Methods
In this section, the techniques and algorithms covered within the paper are introduced and discussed. The statistical models covered are KNN, linear regression, LSGDR, elastic net, PLS, ridge, kernel ridge, SVR, NuSVR, decision tree, random forest, ensemble bagging, MLP, LSTM, and GRU. Since the dataset used in this research is time sequencing, it dictates that the regression variations of the models should be used instead of classification.
K-Nearest Neighbors. The KNN algorithm utilizes a type of instance-based learning based on the differences between features. The algorithm uses the distance function to determine a set of samples, whose length is dictated by the value of
k, that are closest to the target variable [
32]. The algorithm stores the entire training dataset during the training phase. The algorithm then creates a set of instances of length
k that most closely maps to the target. The prediction of the model is created based on the similarity that new observations have with the aforementioned set formed during training. These new instances are compared with each instance within the training set; the prediction is derived from the average of the response variable. In regression-based KNN, the response variable is the mean of the output variable [
32]. In greater detail, the KNN algorithm computes the prediction
Y for each instance of
x by averaging the targets from the nearest
k instances from the set, as described in Equation (
1):
where, in this simplified example,
represents the training examples, and
is the set of nearest points [
32]. It can be difficult to determine the optimal value of
k as there is an inverse relationship between
k and the error on the training set but a direct relationship with the error on the test set. The distance function, used to calculate the Euclidean distance
d between the variables
x and
y, is used in the KNN algorithm as described in Equation (
2).
Linear Regression. This algorithm is one of the simplest models that could be tried when conducting regression on a dataset. As shown in Equation (
3), the correlation between the independent variable
X and the dependent variable
Y is bridged with a coefficient for each dependent variable and an intercept [
33]. However, as the complexity of a dataset increases, the likelihood of linear regression producing accurate projections likely decreases. However, it is still beneficial to include the model to serve as a baseline:
where
y is the output or dependent variable,
are the independent features,
are the coefficients of the linear model, and
is the intercept term.
Linear Stochastic Gradient Descent Regressor. A linear regression model that uses stochastic gradient descent as an optimizer. This model iteratively updates the model weights using a small, randomized subset of the training data instead of the entire dataset, therefore making it computationally efficient for larger datasets [
34]. The linear function that is used to predict the target variable is described in Equation (
3). The objective of this regressor is to determine the values of
w and
such that the loss function is minimized. Additionally, the loss function must be defined within the predicted and actual values of the target variable [
34]. In this case, the loss function is the squared error, and the penalty function is the elastic net.
Partial Least Squares. PLS is an efficient regression model, based on covariance, that is often used in circumstances where there are many independent variables. In particular, PLS is often used when the independent variables are correlated. PLS reduces the number of variables to predict a smaller set of predictors [
35]. This smaller set is then used to perform the regression analysis. Although there are two types of PLS for regression, PLS 1 and PLS 2, the difference between them is if there are one or multiple dependent variables, respectively. Given that there was only one dependent variable in this research, PLS 1 was used [
35]. The formula for PLS regression is described in Equation (
4):
where
Y is the matrix of dependent variables,
X the matrix of independent variables, and
B the matrix of regression coefficients generated by PLS of
Y on
X with
h number of components. Meanwhile,
,
,
,
, and
are matrices generated by the algorithm, and
is the residual from the algorithm.
Ridge Regression. Ridge regression is an algorithm that estimates the coefficients of multiple regression models where the independent variables are highly correlated. The ability of ridge regression to handle this multicollinearity separates it from partial least squares regression. As a result, the ridge is most used in applications where there are many independent variables. Typically, when using ridge regression, it can be assumed that the independent and the dependent variables have been centered [
36]. In ridge regression,
regularization is used, such that the penalized sum of squares is minimized to yield the ridge coefficient.
Kernel Ridge. Kernel ridge combines the linear least squares and
-norm of ridge regression with the kernel trick. Therefore, a linear function is learned in the space induced by the respective kernel and data [
37]. However, in this study, this model uses a polynomial-based kernel function; therefore, a nonlinear function is mapped to the original space. Within this research, the polynomial kernel function has a degree of 10. The resulting kernel ridge regression model differs from SVR from the loss function that is used [
37]. For kernel ridge, squared error loss is used, while epsilon-insensitive loss is typically used by SVR.
Elastic Net. Elastic net is a sparse learning regressor that solves the limitations of lasso and ridge regression, yet also maintains both as special cases. It uses a weighted combination of the
- and
-norm, where these regularization methods are used by lasso and ridge, respectively [
38]. In lasso regression, the independent variables are shrunk to a central value. Elastic net is able to generate reduced models by creating zero-valued coefficients [
39]. The algorithm is often preferred, as it is able to apply the optimal regularization technique based on the nature of the data. As a result, the elastic net is considered to be a parent model to lasso and ridge regression [
38]. Elastic net is described further in Equation (
5):
where
N is the number of observations,
is the response at observation
i,
is the data as a vector of
p values at observation
i,
is a positive regularization parameter,
is the penalty term,
is a scalar that ranges between zero and one, and
and
are scalars [
39]. When
equals one, then the elastic net applies the
-norm and functions like lasso regression; alternatively, as
approaches zero, the elastic net approaches the
-norm, therefore functioning comparable to ridge regression. If the elastic net is operating similarly to ridge regression, then the algorithm would use gradient descent to generate the projections. If the elastic net is either completely or partially configured to operate as lasso regression, then subgradient descent or coordinate descent would be used. In the case where
is between zero and one, then both the
- and
-norm would be used by the algorithm.
Decision Tree. Tree-based regression models benefit from a simpler structure and efficiency, in regard to the large domains of datasets. This is a result of the fast divide-and-conquer behavior of the model, based on the greedy algorithm wherein the larger dataset is split recursively into smaller partitions [
40]. These tree-based algorithms are effective for large datasets yet prove to have shortcomings, such as instability on smaller datasets. This instability could arise from a small change during the training phase, leading to different nodes being created, causing the said instability and inconsistent results. A decision tree is composed of the potential decisions and corresponding repercussions, constructed in a flowchart-like tree structure [
32]. The outcome of a node is represented by the branches or edges. Each node has either a decision node, chance node, or end node. A boolean argument is representative of the branches or edges, and the decision tree weighs the three aforementioned conditions.
Random Forest. Random forest is a type of supervised learning algorithm that effectively uses ensemble bagging to tackle regression- or classification-based problems. During the training phase, the algorithm creates multiple decision trees and then outputs the mean prediction of the trees [
41]. The benefit of having multiple trees, instead of just one, is that the collection of trees protects against the errors of the individual counterparts. The random forest model acts as an aggregator to the mean projections of the total decision trees constructed. In this study, both the random forest and the decision tree algorithms use squared error as the loss function.
Ensemble Bagging. The basic principle behind ensemble methods is to create an integrated group of baseline models, typically considered weak learners, into a more robust model [
14]. The more robust a model is, the more capable it is to adapt to changes in the dataset, thereby providing more accurate and reliable performances regarding the projections. There are three types of ensemble methods that are typically used: bagging, boosting, and stacking. In this study, a version of ensemble bagging that is composed of random forest models is utilized. Bagging, whose name was derived from bootstrap aggregation, is where multiple baseline models are trained in parallel on portioned subsets of the training data. During the training phase, bootstrapping occurs, where the original dataset is randomly sampled with replacement. Sampling with replacement means that every time a sample is collected by a model, it is then replaced [
42]. This ensures that each round of sampling is independent and does not interfere with the next round. Then, the final prediction of the algorithm is obtained from a voting aggregation of the final predictions of the baseline models [
14]. Given that random sampling with replacement is used within ensemble methods, instead of altering the biases of the models, the variance of the projections is reduced.
Support Vector Regression. Support vector regression is an abstracted version of support vector machines. SVRs are better suited for times-series predictions, which are the condition that governs forecasts for PV power generation when using irradiance. In a more general sense, the SVR is derived from a function that maps the input patterns to those of the output. This is done based on a given set of training data that aim to minimize error by individualizing the hyperparameters. The input features are mapped using a nonlinear mapping process to a high-dimensional space [
15]. The nature of the SVR is described in Equation (
6):
where
is the training set and many of the
’s are equal to zero. However, there are some limitations to the SVR algorithm: it lacks probabilistic interpretation, there is difficulty in selecting the optimal regularization parameter
C, and the algorithm is restricted to using positive semidefinite kernels [
43].
The projections of NuSVR are also compared in this study.
Nu is a parameter used to control the number of support vectors and replaces the parameter epsilon in epsilon-SVR [
44]. In this case, an
nu value of 0.35 was used. For both SVR and NuSVR, the radial basis function kernel was used as the activation function.
Multilayer Perceptron. MLP is a type of feed-forward supervised learning algorithm. It is composed of three layers, an input layer, a hidden layer, and an output layer. Each layer is multidimensional and can handle nonlinear calculations. However, in the case of this study, the output just has one dimension. Each neuron in the hidden layer transforms the previous dimensions of the input layer using a weighted linear summation, then utilizes a nonlinear activation function [
45]. Additionally, backpropagation is used without the need for an activation function in the output layer, effectively using the identity function as an activation function. In this study, for forward propagation, the rectified linear unit activation function is used. Additionally, the Adam optimizer, an extended version of stochastic gradient descent, and the square error loss function are implemented into the MLP. The MLP algorithm is beneficial as it has the capability to learn nonlinear models and learning models in real time. However, the hidden layers have a nonconvex loss function, which leads to the potential for multiple minima to exist [
45]. Therefore, any differences in the random weighting of the initialization can cause differences in the accuracy of the validation. Additionally, the MLP is subject to sensitivity when feature scaling.
Long Short-Term Memory. LSTM networks are composed of a few types of gates that contain information about the previous state. The information of the LSTM is either written, stored, read, or eliminated in the cells that serve as a memory stage for the model [
21]. The four potential processes are accomplished through the opening or closing of the gates. The cells act on signals they receive, and based on the strength of the signal, they will either transmit or block information. The LSTM model is composed of three different states, the input, hidden, and output state. Within each unit of the LSTM, there exist a cell state,
; an input gate,
; an output gate,
; and a forget gate,
, displayed in
Figure 2. The forget gate is tasked with determining which information is kept or eliminated from the cell state [
21]. This decision is determined by the logistic function
, as described in Equation (
7). This function will either output a value of zero, to keep the information, or a value of one, to forget it:
where, in Equations (
7) and (
8),
is the activation function,
is the weight of the forget gate,
is the bias of the forget gate,
is the input at time
t,
is the hidden layer at time
,
is the weight of the cell, and
is the bias of the cell. The input gate, forget gate, cell state, and output gate are shown in
Figure 2, the LSTM cell. The input gate,
, and the cell state,
, are described in Equation (
8). The input gate determines which input values are updated by the blocks of the LSTM.
The output state determines which segment of the cell state is permitted to output. The formula for the output state, as described in Equation (
9), includes a tanh and is multiplied by another logistic function whose output is scaled similarly to the forget state:
where
is the activation function,
is the weight of the output gate, and
is the bias of the output gate. The input data to the LSTM are composed of a three-dimensional array. The first dimension is represented by the number of samples in the network, the second dimension is the time steps, and the third is the number of features in one input sequence [
21]. In order for the LSTM to properly handle the dataset, a sliding window was created such that data could be inputted into the algorithm. The sliding window is discussed in greater detail in the preprocessing section. That said, the resulting size of the three-dimensional array inputted into the LSTM was 11 by 3 by 20. The version of the LSTM in this study contains a batch size of 64, a hidden size of 64, three dropout layers, and an MSE loss function used given the regression nature of the dataset. Additionally, the Adadelta optimizer was determined to yield the best performance by trial and error. The Adadelta optimizer is a more robust version of the Adagrad optimizer. Adadelta adapts the learning rates based on a moving window of gradient updates; therefore, it is not necessary to set an initial learning rate [
46].
Gated Recurrent Unit. The GRU is a type of RNN and was introduced in 2014. It was implemented to solve the issue of the vanishing gradient that is a problem within standard RNNs [
47]. Similar to the LSTM, the GRU is able to handle sequential data, such as time series, speech, and text. Similar to the functionality of the LSTM, the GRU uses gating mechanisms to selectively update the hidden state, subsequently updating the output layer. In particular, the GRU has an update gate and a reset gate that compose the gating mechanisms. However, unlike the LSTM, the GRU does not contain an internal cell state. In the GRU model, the reset gate determines how much of the previous information of the hidden state should be forgotten. The reset gate of the GRU is analogous to the input and forget gate of the LSTM [
48]. Meanwhile, the update gate determines how much of the previous information should update the hidden state, and subsequently be passed into future units of the algorithm. The update gate is comparable to the output gate within the LSTM. The current memory gate is a subset of the reset gate. This gate introduces nonlinearity to the input data. Another benefit of the current memory gate being a subset of the reset gate is that it is able to reduce the impact that the previous information has on the current information that will be transmitted to any future units [
48]. The final output of the GRU model is calculated based on the hidden state and is described in Equation (
10):
where
is the reset gate,
is the update gate,
is the candidate hidden state,
is the hidden state,
is the prior hidden state,
and
are the learnable weight matrices, and
is the input at time step
t. The sigmoid function is applied to scale the result between zero and one. The GRU model is able to solve the vanishing gradient by storing the relevant information from one time step to the next of the network [
47]. The GRU used in this research shared the dimensionality of data inputted as the LSTM because the same sliding window was employed. Additionally, the model utilized an averaged stochastic gradient descent optimizer with an MSE loss function.