1. Introduction
New, pollution-free renewable energy, represented by wind and solar energy, has difficulty generating electricity continuously and steadily. Energy storage is a crucial factor for renewable energy to become a fully reliable primary energy source [
1]. With the characteristics of high energy density and high power density, lithium-ion batteries are widely used in energy storage systems. The battery state of charge is an important parameter to measure the performance of Li-ion batteries, while the SOH is a measure of a battery’s lifetime [
2]. The development of online estimation studies of the SOC and SOH of lithium-ion batteries is essential to extend the cycle life of the batteries to reduce the potential for accidents.
Currently, there are three main methods for SOC estimation: the time integration method, the open-circuit voltage method, and the data-driven method. Among them, the time integration method discretely sums the current flowing through the battery and obtains the SOC value by simple division. The time integration method estimates the SOC by measuring current and time. Its advantage is simplicity and directness, without the need for additional sensors. However, due to measurement errors and integration drift, time integration methods may lead to cumulative errors in SOC estimation. The open-circuit voltage method measures the open-circuit voltage of the battery and obtains the charging state according to the corresponding relationship between the open-circuit voltage and the charging state. Its advantage is that it is non-invasive and does not require additional measuring equipment. Reference [
3] proposed a fast and accurate method to measure the OCV after comparing three conversion methods of differential equations, thus improving the accuracy of SOC prediction. Reference [
4] proposed a novel constant-current/constant-voltage charging control strategy for batteries by adjusting the battery charging current based on the estimation of the open-circuit voltage parameter. Reference [
5] used the open-circuit voltage method to obtain an SOC estimate based on the average value calculated from the random forest output OCV-SOC curve to reduce the hysteresis effect. Wang et al. [
6] proposed a new method for calculating model parameters and estimating the state of charge of lithium-ion batteries based on the parameter-estimated open-circuit voltage (OCV) under multi-temperature conditions. Although the accuracy is relatively high, the OCV method requires a long resting time to reach the equilibrium state in practical tests, and the resting time is affected by the environmental conditions and monitoring equipment, so it is usually used in laboratories or calibration-assisted techniques.
The data-driven method only needs to extract features using physical quantities measured during battery charging and discharging, and then uses these features to train a model to establish a mapping model between battery data features and the SOC. Reference [
7] proposed a low-dimensional classification model based on machine learning and an equivalent circuit model, which can estimate the SOC with an accuracy of more than 93%. Reference [
8] used 18 machine learning algorithms to predict the SOC and applied different filters to improve the estimator. The Bagging and ExtraTree algorithms were found to significantly outperform other ML methods for SOC estimation, and the Rloess filter was found to perform well. Reference [
9] processed historical capacity data using a generalized learning system (BLS) and generated feature nodes as input layers in a neural network. The method does not need an in-depth study of the battery aging mechanism, but requires at least 25% of the historical capacity data. Reference [
10] constructed a random forest regression model for SOC estimation, which effectively avoids the overfitting problem and improves the estimation accuracy and provides a reference for future research on estimation models. Data-driven methods can more accurately capture the nonlinear characteristics of battery behavior, but require a large amount of training data and computational resources.
SOH estimation methods can be categorized into two main groups: model-based methods and data-driven methods. The commonly used models generally contain two kinds: electrochemical models and equivalent circuit models. In electrochemical model-based methods, firstly, the first-principle equations are established based on the internal electrochemical processes of the battery, and then the exact state is calculated. Togasaki et al. [
11] proposed electrochemical impedance spectroscopy (EIS) to predict severe capacity degradation of lithium-ion batteries due to overcharging. Zhang et al. [
12] used the phase resistance between the solid electrolyte and the thickness of the deposited layer as a proxy for aging and developed a battery aging model using the transfer function versus input current. Hou et al. [
13] combined Maxwell–Cattaneo–Vernotte theory with Marcus–Hush–Chidsey kinetics to establish an electrochemical–thermal model for fast and accurate diagnoses of lithium-ion batteries. Gao Yizhao et al. proposed an SOH estimation method for lithium batteries based on an enhanced degradation electrochemical model and a dual nonlinear filter [
14].
The following studies the use of equivalent circuit model methods: Amirs et al. from the University of Management Sciences, Lahore, Pakistan, proposed a method for estimating battery SOH based on a dynamic equivalent circuit model. The proposed 2-RC model has reduced computational complexity compared to the 1-RC model and outperforms the N-RC model [
15]. Based on the simplified second-order RL network, ECM, Yang Jufeng et al. proposed an SOH estimation method based on the decoupled dynamic characteristics of constant-current charging currents. Compared with the traditional nonlinear least squares method, the dynamic decoupling method proposed in this paper has lower computational effort and higher parameter identification accuracy [
16]. Chen Mang et al. proposed a comprehensive SOH estimation method based on multi-factor ECM, which has an estimation error of about 1% for the same model of battery [
17]. Zhang et al. [
18] analyzed the impedance characteristics by means of a pseudo two-dimensional (P2D) model based on the variation of battery impedance characteristics. Based on this, the original model was corrected and compared with the EIS model, which reduced the prediction error by half compared with the original model. Improved reliability is more suitable for SOH estimation under real operating conditions. The model-based approach uses physical models to describe the decay process of batteries, such as capacity decay, internal resistance increase, etc. The accuracy of these methods is influenced by the accuracy of model parameters and the limitations of model assumptions.
There are differences in effectiveness between model-based and data-driven methods. Model-based methods can provide better interpretability and interpretability, but for complex battery systems, more prior knowledge and parameter adjustments may be required. Data-driven methods can better adapt to uncertainty and nonlinear features, but may lack interpretability and generalization ability. The data-driven approach estimates the SOH by analyzing battery operating data. Key factors include data quality, feature extraction, and algorithm selection. High-quality data can provide more accurate estimation results, while effective feature extraction can capture key features of battery health status. Reference [
19] is based on incremental capacity (IC) analysis and battery operating characteristics combined with a regression model to correct for the bias caused by individual batteries. The method was validated on laboratory and EV datasets, with average absolute percentage errors of 0.29% and 3.20%, respectively. Reference [
20] proposed an aging feature extraction method based on an electrochemical model (EM) to explain the degradation mechanism of batteries. A data-driven SOH estimation model based on health characteristics was constructed by a machine learning algorithm. Experimental data show that the proposed method can effectively improve the accuracy of SOH estimation in different application scenarios and battery charging and discharging modes. The SOH estimation based on GMO-BRNN proposed in reference [
20] achieves an estimation evaluation index of less than 1%, which is conducive to the development of EV battery prediction and health management systems.
The above related studies were carried out based on single-parameter estimation. However, there is a certain coupling link between the SOC and the SOH. For example, when estimating the SOC, the variation in the maximum capacity of the battery needs to be taken into account, and at the same time, inaccurate SOC estimation will also affect SOH correction. It follows that there will also be some overlap in the estimation steps for these two parameters. SOH estimation using charge state data can not achieve online estimation. Therefore, conducting a study on the joint estimation of SOC and SOH can save certain computational steps and has high practical significance. Both for SOC estimation and SOH estimation, the data-driven method relies heavily on the choice of algorithm. However, a single data-driven algorithm is susceptible to the influence of the dataset itself, which leads to a reduced generalization ability. The integrated learning approach is particularly suitable for large datasets and nonlinear data, and is applicable to the study of the health state of lithium-ion batteries. Compared with a single model, the stacking algorithm can improve the prediction performance by integrating the advantages of multiple models. It can reduce the bias and variance of individual models and provide more stable and reliable estimation results. In addition, stacking algorithms can also improve robustness to uncertainty and noise through the diversity of models. However, the integrated learning approach tends to consume a lot of computational resources and time to build high-precision models, and the combination of Bayesian algorithms and integrated learning for training can greatly reduce the training time.
In summary, in order to predict SOC and SOH better and more accurately, and to reduce the loss of accuracy of the model, after analyzing the discharge data of NASA’s batteries, temperature and end voltage were selected as training features in this study. After that, training and testing on the dataset using LR, ENR, DTR, ETR, GBR, SVR, KNNR, DTR, and XGBoost algorithms were carried out to compare the prediction result errors of different algorithms. With LR as the meta-learner, DTR/ENR/ETR/KNNR are selected as the base learners to build the stacking integrated learning model, which is trained using the B0005 battery data and optimized using a Bayesian algorithm to optimize the parameters to predict the B0007 and B0018 batteries. Simulation analysis shows that stacking exhibits better estimation stability and accuracy than a single model. Second, this study examines the running time of the algorithm. The simulation analysis shows that the stacking algorithm does not consume too much time for the trained ground-built model, although it is an integrated learning approach. The trained model still has excellent computational speed in predicting SOH. Finally, a comparison with the estimation error results of other papers proves the effectiveness of the stacking algorithm model.
4. Results and Discussion
All training in this study was carried out on the same device, and the CPU of the device used was an Intel(R) Core (TM) i7-6700HQ CPU @ 2.60 GHz. In order to verify the above selection of individual learners and meta-learners, this study first uses the B0005 battery dataset to train and predict different machine learning algorithms.
The main common algorithms in dealing with regression problems are KNeighbors Regressor, Decision Tree Regressor, Elastic Net, GradientBoostingRegressor, XGB Regressor, Lasso, Extra Tree Regressor, SVR, and Linear Regression.
Table 1 lists the advantages and disadvantages of these mainstream algorithms as well as their scope of application. A box plot is used to reflect the center position and scattering range of the continuous-type data distribution. The results of the overall health status estimation of B0005 are represented by a box plot as shown in
Figure 7 below, with the median represented by a short red line, the two horizontal lines above and below the box representing the upper and lower boundaries of the data (the upper edge value is not necessarily the maximum value in the data, and the smallest lower edge value is not necessarily the minimum value), and the red dots representing the outliers that are beyond the upper and lower boundaries. On the far right is the raw SOH data for cell B0005. As shown in the figure, its data spread is basically uniform, and the red line is closer to the lower quartile, indicating that the original data are in a slightly left-biased state.
By observing the distribution of the other algorithm boxplots, it can be seen that the data predicted by each algorithm are in different degrees of bias. Among them, the predicted data of DTR\ETR\GBR\KNNR\SVR\XGBR have a similar distribution to the original data, DTR\ENR\ETR\LRXGBR\SVR basically shows a left-skewed state, and KNNR\LASSO shows a right skewed state. The DTR\LASSO\LR\SVR algorithm shows outlier points, and all of them are below the lower boundary, indicating that the errors of the algorithms are mostly the predicted values being smaller than the actual values. Regarding the prediction results made by the LR and LASSO algorithms, the median red line is shifted too much and the box is too narrow, showing that the predicted values of these two algorithms are concentrated in a certain interval and do not have the ability to predict directly.
Figure 8 shows the R
2 score of the algorithms as well as the RMSE values, and it can be seen that the R
2 score of the best-performing model also stays below 0.8, indicating that the predictive ability of a single model needs to be further improved.
Considering the characteristics of stacking algorithms, appropriate individual learners and meta-learners will be selected by experiments. Nine machine learning algorithms are first used as base learners/meta-learners, the stacking algorithm is trained and predicted on the B0005 dataset, and the training set is divided into 9:1 with the test set. The results are shown in
Figure 9 below.
Figure 9a shows the training time of different algorithms as meta-learner models. It can be seen that the overall model training is time-consuming when XGBoost is used as the meta-learner; other algorithms, except GBR, have similar model training times; and when LR is used as the meta-learner, the shortest model training runtime is 1795.4763 s.
Figure 9b shows the R
2 scores of the models of different algorithms as meta-learners.
Figure 9c shows the training errors of the models of different algorithms as meta-learners. Combining
Figure 9b,c, it is found that LR as a meta-learner has the highest R
2 score of 0.972 and the lowest RMSE of 0.05272498 compared to the other algorithms. So, it was finally decided to use LR as a meta-learner for modeling.
After determining the meta-learner, the number of base learners and the algorithm chosen still need to be further determined. A method using the addition of different base learners one by one was carried out next and used to determine the final base learner. A total of nine different machine learning algorithms were selected for this study. Firstly, only DTR was used as the base learner to train and test on the dataset. Secondly, the ENR algorithm was added to build the model with two base learners for training, and in this way, these nine algorithms were added as base learners to build the model in turn. The results of the experiment are shown in
Figure 10 below.
Figure 10a represents the training time of the model with different numbers of base learners, and it can be observed that the rate of increase of the training time of the whole model gradually becomes larger after using the SVR algorithm to constitute seven base learner algorithms.
Figure 10b shows the R
2 scores of the models with different numbers of base learners, and it can be observed that before using four base learners, the R
2 score gradually increases with the growth of the learners and then basically remains stable, except for the decrease after the addition of the two algorithms of LR/SVR.
Figure 10c represents the training errors of models with different numbers of base learners. As can be seen from the figure, the error decreases with the increase in the number of base learners, which is most obvious before the number of base learners is increased to four. To summarize, the construction of the stacking algorithm should be considered from the three aspects of reducing the training time, improving the accuracy, and decreasing the error. After balancing, the final choice is to construct the stacking model with four base learners with the DTR/ENR/ETR/KNNR algorithm.
After constructing the model, the B0005 battery is used as the training data to predict the SOC. The battery state of charge is evaluated sequentially and substituted into the established joint SOC–SOH estimation model. Nine SOC interval segments are selected to reflect the estimation effect of the model. The SOH estimation effect is shown in
Figure 11. From the figure, it can be seen that the constructed stacking model has a stable overall estimation effect. The performance of the model was evaluated using the root mean square error (RMSE) and the results are shown in
Table 2. The R
2 score is maintained around 0.997 and the RMSE results are basically unchanged, indicating that the model performs stably and accurately in predicting the SOH.
To verify the generalization ability of the model, the discharge current and terminal voltage are still used as features to make predictions about the health status of lithium-ion batteries of models B0007 and B0018. The results are shown in
Figure 11a,b, where the raw data are in blue. The prediction of B0007 takes 0.48 ms, and the prediction of B0018 takes 0.32 ms. As can be seen from
Table 2,
Table 3 and
Table 4 combined with
Figure 11a,b, the stacking algorithm in this study not only does not have the phenomenon of overfitting, but also shows a strong generalization ability.
5. Conclusions
Most data-driven methods can provide an accurate estimate of the health status of lithium batteries, effectively reducing the risks and losses caused by failures during use. However, a single data-driven algorithm is susceptible to the influence of the dataset itself, resulting in lower accuracy. In addition, since the relationships between variables in the lithium-ion battery dataset are mostly nonlinear, it is very difficult to establish an accurate SOH fitting relationship on the discharge dataset using a model. Meanwhile, most of the related studies on battery health estimation are offline estimation, and the inability to estimate online is also a problem to be solved. In view of such problems, this study proposes a joint machine learning SOC–SOH estimation method based on a stacking algorithm, which realizes online detection and the estimation of battery management systems.
Firstly, this study utilized the publicly available data of batteries provided by NASA as the simulation experimental data, and explored the SOH changes of different characteristic responses by plotting the end-voltage curve, the discharge current curve, the SOC-time curve, etc., and finally chose the end-voltage and the temperature as the input characteristics.
Secondly, starting from the basic algorithm of a single model, this study analyzed the prediction ability of each of the different tree modeling algorithms of Decision Tree, GBR, SVR, KNeighbors Regressor, Extra Tree Regressor, and XGBoost, and chose the stacking integrated learning method, with LR as the meta-learner and the other four algorithms as the sub-learners.
Finally, this study used the B0005 battery as the training set, used the Bayesian algorithm for parameter optimization, and used the trained model for the SOH prediction of the B0007 and B0018 batteries. After a comparative analysis, the developed models were found to have a strong generalization ability, and the running time for the prediction of the full dataset was less than 0.2 ms, which indicates the great potential of actual linear estimation.