1. Introduction
In theoretical research, it is usually assumed that the dataset is balanced. In contrast, in real life, imbalanced datasets are common, which brings challenges to data analysis in almost all research fields [
1]. Ignoring the problem of data imbalance, it is unreasonable to predict directly using the original dataset. A characteristic of imbalanced data is that the sample size of a certain category is much smaller than that of other categories. Most of the samples are in the normal range, and the minority samples are in the abnormal range. This results in the prediction model being more inclined to predict the minority samples at majority sample intervals, thus reducing the prediction accuracy of the minority samples [
2]. However, the information extracted from the minority samples is usually more valuable than the information extracted from the majority. Therefore, it is vital to correctly deal with the problem of data imbalance to improve the effect of the prediction model, which has become an important topic in current research. At present, many scholars have proposed solutions to the problem of data imbalance. These methods are mainly divided into two categories: solutions for classification problems and solutions for regression problems [
3].
The imbalanced classification problem refers to the imbalance of the number of samples in different categories in the classification problem [
4]. At present, many studies have been carried out to solve the problem of imbalanced classification data [
5]. The existing solutions include resampling [
6,
7,
8,
9,
10], ensemble learning [
11,
12,
13,
14,
15], sample generation [
16,
17,
18,
19,
20,
21,
22], and so on. Among them, resampling and ensemble learning both start from the local neighborhood of sample points, without considering the overall distribution of the original dataset. Sampling from the distribution of data to increase the number of minority samples is an ideal imbalanced data processing method, in which the idea of using a generator to generate minority samples is applied. In terms of generating samples, the most typical sample generation algorithms are variational auto-encoder (VAE) and generative adversarial network (GAN). In order to learn the distribution information of the original sample, Diederik P et al. [
16] proposed the VAE algorithm. The algorithm includes an encoder and a decoder. The encoder is used to learn the distribution of the original samples, and the decoder is used to generate samples that conform to the distribution of the original samples. However, the VAE model has the disadvantage of poor generalization ability. Goodfellow et al. [
17] proposed the GAN algorithm. The GAN consists of two networks: a generator and a discriminator. The goal of the generator is to generate samples similar to real samples as much as possible to deceive the discriminator. The purpose of the discriminator is to distinguish the samples generated by the generator from the real samples. When the Nash equilibrium state is reached, the performance of the generator and the discriminator is optimal, and the generator can generate high-quality samples. However, the GAN has the disadvantage of training instability due to gradient disappearance and mode collapse [
18]. In order to improve stability, a large number of variants of GANs have been developed, such as those that change the model structure, including internal and external structures, add other input conditions, change the loss function, etc. For internal structural changes, the deep convolutional generative adversarial network (DCGAN) [
19] uses convolutional neural networks and deconvolutional neural networks to construct a generator and a discriminator, respectively, and provide experimental guidance on how to establish a stable GAN network. For conditional settings, the conditional generative adversarial network (CGAN) [
20] adds conditional variables to the generator and discriminator at the same time so that the generation of sample data is based on conditional variables. For the change of loss function, the Wasserstein generative adversarial network (WGAN) [
21] replaces the JS divergence with the Wasserstein distance to estimate the distance between the real samples’ distribution and the generated samples’ distribution, making the adversarial learning of the model more stable. Bao et al. [
22] proposed the CVAE-GAN model, which is based on the CVAE model and uses CGANs to optimize the generator so that the generated samples are both real and diverse. Although a large number of studies have been carried out, these studies mainly focus on categorical data with discrete target variables.
The imbalanced regression problem occurs when the frequency of some target values in the regression dataset is extremely low and it is easy for the model to ignore these very few target values, resulting in the poor prediction performance of the model on these samples [
23]. For the problem of regression data imbalance, the method of solving the problem of classification data imbalance is mainly applied directly to the imbalanced regression task [
24]. Torgo et al. [
25] applied the SMOTE algorithm that generates classified samples for the regression problem, but this method is the same as the SMOTE algorithm that deals with the imbalance of classified data. Because the SMOTE algorithm cannot learn the data distribution of imbalanced datasets, it is easy to produce the problem of distribution marginalization. Branco et al. [
26] proposed the REBAGG algorithm, which is an ensemble method based on bagging and combines data preprocessing methods to solve the problem of imbalanced data in regression tasks. Despite their effectiveness, these methods do not take into account the regression characteristics of the sample. In the regression problem, the model predicts the value rather than the category. The imbalanced regression problem needs to study the distribution of the data more carefully in order to predict the value more accurately. The imbalanced regression problem is more complex than the imbalanced classification problem. By analyzing the relationship between the distribution of the target value of the regression sample and the test error of the prediction model, Yang et al. [
27] proposed the concept of deep imbalanced regression by analyzing the relationship between the target value distribution of regression samples and the test error of the prediction model. According to the characteristics of imbalanced regression data, they proposed label distribution smoothing (LDS) and feature distribution smoothing (FDS). However, this method still uses the method of data interpolation when generating minority regression samples, which is prone to overfitting. Ren et al. [
28] proposed a new loss function called balanced mean squared error (BMSE) to solve the problem of imbalanced regression. Specifically, the BMSE loss function uses a weighting method to assign different degrees of importance to different samples. Rahul et al. [
29] proposed the spatial-SMOTE algorithm, which handles the oversampling of rare events by preserving the importance of the spatial distribution of data. Nevertheless, the above two algorithms are only applicable to specific datasets. By analyzing the existing research, it is found that the current research on imbalanced regression mainly focuses on model integration and data interpolation. When using a model ensemble to solve the problem of imbalanced regression data, the method of solving the problem of imbalanced classification is directly applied to the problem of imbalanced regression, without considering the continuity of the sample target value. When using the data interpolation method to generate minority regression samples, the regression characteristics of the original samples are not considered. Additionally, the existing methods have problems such as slow convergence speed and a lack of wide applicability when processing data.
Considering the defects of existing methods for imbalanced regression problems, this paper proposes the IRGAN algorithm. This algorithm primarily addresses two tasks: (1) focusing on the problem of the number of samples in the original regression samples varying greatly in different target value intervals, a combination of optimization and confrontation ideas is used to generate regression samples; (2) combining the generated samples and the original samples as new samples for regression prediction. The algorithm includes four parts: generation, correction, discriminant, and regression modules. Due to the imbalance of data distribution, firstly, considering the contextual information of data, the generation module is designed. Combined with the characteristics of the regression problem, a composite loss function is designed to guide the generation of regression samples to further ensure the quality of the generated samples. In the early stages of training, the gap between the generated samples and the real samples is large, which makes it easy for the training process to not converge. By introducing the correction module, using the optimization idea to implement the decision and combining it with the deep neural network, the internal relationship between the state and action and the subsequent state and reward of the real samples is trained to guide the generation module and improve the quality and convergence speed of the generated samples. Then, based on the confrontation idea, the generation module and the discriminant module are continuously optimized to make the generation module and the discriminant module reach the Nash equilibrium. At this time, the generation module can generate high-quality samples to balance the original samples. Finally, the regression prediction is realized by using the regression module in the algorithm. In short, in order to solve the problem of imbalanced regression of data, we propose the IRGAN algorithm. The effectiveness and feasibility of the algorithm are demonstrated by experimental verification in the fields of aerospace, biology, physics, and chemistry. The main contributions of this paper are as follows:
(1) For the problem of imbalanced data regression, this study takes into account the continuity of the target variable and the correlation between the data and uses a method that combines optimization and the confrontation idea to generate regression samples.
(2) According to the continuous and imbalanced characteristics of the target variables of the original regression data, the generation module is designed to generate the samples closer to the original samples.
(3) Focusing on the problems of the gap between the generated samples and the real samples being large and the training process not converging at the initial stages of training, a correction module is designed to guide the generation module.
2. Algorithm Design
In traditional regression problems, the predicted numerical variables usually have a balanced distribution. That is, the frequency of the occurrence of each value in the dataset is roughly equal. Therefore, in the prediction process, the algorithm can optimize the model by minimizing criteria such as average error or mean square error. However, in imbalanced regression problems, the distribution of the predicted numerical variables is imbalanced. This means that the frequency of some values in the dataset may be much higher than other values. Due to the imbalance of training data, the traditional regression prediction model may tend to predict the category with a large number of samples, which leads to a poor prediction effect of the model regarding minority samples [
30]. In order to improve the regression prediction accuracy of balanced data, we designed the IRGAN algorithm. This algorithm is designed to tackle two tasks. first, for the problem of the number of samples in the original regression samples varying greatly in different target value intervals, high-quality minority samples are generated to balance the original samples. Then, the generated samples and the original samples are combined as new samples for regression prediction. It contains four kinds of samples: (1) original data pool D, storing all the original samples; (2) real samples pool D1, which is used to collect real samples obtained during the interaction between the agent and the environment; (3) fake samples pool D2, which is used to collect samples generated by the generation module; the real samples in pool D1 and fake samples in pool D2 together provide training samples for the agent; (4) balanced data pool D′, which stores the balanced samples. The balanced samples include the original samples and the samples generated by the generated module after training. The data in the sample pool is finally used for regression prediction. The algorithm consists of four modules. (1) The generation module is used to generate minority samples. (2) The correction module includes an agent and a correction network. Similar to the human brain, the agent can perceive environmental information and make optimal decisions. The correction network is used to train the relationship between the state action pair
and the subsequent state reward pair
, and guide the generation module to improve performance and accelerate the convergence speed of model training. (3) The discriminant module determines whether the input samples are real or fake, feeds back the discriminant information to the generation module, and optimizes the generation module in order to improve the quality of the generated samples. (4) The regression module performs regression prediction for the balanced samples.
The overall research framework diagram is shown in
Figure 1a, and the model structure of the IRGAN algorithm is shown in
Figure 1b. The following sections will introduce the four parts of the generation module: the discriminant module, the correction module, and the regression module.
2.1. Generation Module
The function of the generation module is to generate regression data. Firstly, the random Gaussian noise, , is introduced into the generation module as an input signal. The purpose of inputting the random Gaussian noise is to increase the diversity of the generation module and improve the exploration ability, so as to better simulate the distribution of data. The Gaussian noise is output to generate sample after processing by the neural network. The loss function, , of the generation module is to minimize the JS divergence between the generated distribution and the real distribution. However, when there is no overlap between the two distributions or when the overlap is extremely small, the JS divergence is constant, and the generation module is be updated. The loss function was at a constant value, resulting in the disappearance of the gradient. Focusing on the problem of gradient disappearance in the training process, this paper designs the loss function of the generation module and uses the Mahalanobis distance to calculate the distance between the generated data and the real data, which is more suitable for generating regression data, and effectively alleviates the problem of gradient disappearance in the training process.
The loss function formula of the generation module is divided into two parts. They are written for the generated data and the distance between the generated data and the real data. The overall loss function is shown in (1).
(1) The first part of the generation module loss function is for the generated data, as shown in (2).
where
is the generated fake sample, which begins with random noise and then is generated by the generation module. For the generation module, its goal is to generate a generated sample closer to the real sample so that the discriminant module considers it to be a real sample—that is, if
is closer to 1, then
is close to 0, which can minimize the loss function of the generation module. The smaller the loss of the generating module is, the more realistic the generated samples are, and the discriminant module cannot be identified.
(2) Considering the correlation between the characteristics of regression data, in the second part of the generation module loss function, the Mahalanobis distance is added to measure the distance between the real samples and the generated samples. The Mahalanobis distance is a commonly used distance index in metric learning. It is used as a similarity index between data like Euclidean distance, Manhattan distance, and Hamming distance, but it can deal with the problem of non-independent and identical distributions between dimensions in high-dimensional data. Upon adding this item to the loss function of the generation module, the formula is as follows:
where
is the covariance matrix, and
is the inverse matrix of the covariance matrix. The Mahalanobis distance adds the consideration of the correlation between features, and it is invariant to all nonsingular linear transformations. This shows that it is not affected by the selection of feature dimension and considers the influence of dimension on the sample distance, which can measure the distance between the real samples and the generated samples more scientifically. The loss function value of the generated module is calculated, and then the distance between the real samples and the generated samples is continuously reduced by back propagation so that the generated sample is closer to the real sample, thereby improving the quality of the generated sample. Increasing this item is conducive to improving the convergence speed of the model and is also more conducive to regression prediction.
is a hyperparameter that can adjust the weight of the Mahalanobis distance in the loss function of the generation module. Its influence on the model will be discussed in subsequent experiments. In addition, the Mahalanobis distance has superior smoothing characteristics relative to the JS divergence. Even if the overlap of the two distributions is extremely small, the Mahalanobis distance can also reflect the distance between them. Therefore, after adding this item, the loss function of the generation module will not become constant with training and the gradient can be continuously updated, which effectively alleviates the problem of gradient disappearance during training.
2.2. Correction Module
At the beginning of training, the samples generated by the generation module are only theoretically feasible data samples and the gap with the real samples is large, which easily causes instability in the training process, so it needs to be corrected by the correction module. The function of the correction module is to guide the generation module. The correction module is composed of an agent and a correction network. Specifically, the samples are selected by combining real samples and virtual samples, and the data is provided to the agent for training the action value function network, so as to find the optimal strategy. The correction network analyzes the relationship among the state, action, and reward.
The real samples in the learning process are defined as a pair of states and actions and subsequent states and rewards. The state at the previous moment corresponds to the corresponding action, which is called the state action pair
. The state at the next moment and the reward are called the subsequent state reward pair
. Therefore, the real samples,
, can be divided into two parts:
where
denotes the state action pair, and
denotes the subsequent state reward pair. The input is
, and the output is
. The correction network is used to train the internal relationship between
and
. Consistent with the real samples, the samples
generated by the generation module can also be divided into two parts:
where
represents the generated state action pair, and
represents the generated subsequent state reward pair. In order to improve the quality of the generated samples, based on the generated
, the relationship between the generated
and
should be consistent with the relationship in the real samples
. Therefore, combined with the generated sample
,
is input into the correction module, and the output is used as the constructed subsequent state reward pair
. The goal is to make the generated subsequent state reward pair,
, and the constructed subsequent state reward pair,
, have a high similarity. Further, the generation module is promoted to generate more high-quality samples and accelerate the convergence speed of model training. The loss function formula of the correction module is as follows:
where
denotes the generated subsequent state reward pair,
, and
denotes the constructed subsequent state reward pair,
.
2.3. Discriminant Module
The goal of the discriminant module is to extract the features of the input data and distinguish the authenticity of the sample as much as possible. The real and fake samples are put into the discriminant module, and then the loss value of the discriminant module is calculated. The loss includes the real loss and the fake loss. The loss function of the discriminant module is shown in (7).
(1) The loss function of the first part of the discriminant module refers to the real samples as the input of the discriminant module. The loss function after input is shown in (8).
where
is the real sample. For the discriminant module, its goal is to determine whether the real sample is real. The closer the discriminant value,
, of the real sample is to 1, the closer the
is to 0, which can minimize the loss of the discriminant module.
(2) The loss function of the second part of the discriminant module refers to the generated sample as the input of the discriminant module. The loss function after input is shown in (9).
where
is the generated sample. For the discriminant module, its goal is to determine the generated sample as fake. The closer the discriminant value,
, of the generated sample is to 0, the closer the
is to 0, so as to minimize the loss that the discriminant module discriminates as fake.
2.4. Regression Module
The regression module is the last module of the algorithm, which is mainly responsible for the regression prediction of the balanced samples. When using neural networks for regression, it is necessary to establish a mapping from input features to continuous output values. The formula is as follows:
. where
is the weight matrix,
is the bias, and
usually represents the activation function. However, its application scope is far more than that.
can represent a variety of types of functions, such as the regularization function, normalized function, and so on. The regression process usually involves multi-layer nonlinear mapping. It is assumed that the size of the processed matrix is
, where
is the number of samples, and
is the number of features. The input
and output matrix
of regression prediction in neural network are defined as follows:
Before constructing the model, we first need to divide the input dataset into a training set and a test set, and then divide the data from each interval according to the ratio of 7:3. A total of 70% of the data is used to train the model to grasp the potential rules of the data, and the remaining 30% is used as a test set to evaluate the performance of the model in new situations. This segmentation strategy helps the model make accurate predictions on unknown data and enhances its generalization ability. In the model training phase, the loss function is used to measure the size of the prediction error, which is a key indicator to evaluate the performance of the model. The loss function is shown in (12).
where
is the real value,
is the predicted value, and
is the number of samples. Compared with other loss functions, this loss function is more robust to outliers and is very useful in feature selection and model interpretation. It can help identify the most important features, simplify the model, and improve the generalization ability. The loss function calculates the average value of the prediction error of all samples, which reflects the accuracy of the model prediction results. By optimizing this loss function, the model parameters can be adjusted to make the predicted value as close as possible to the real value, thereby improving the prediction ability of the model.