1. Introduction
Because of technical or economic constraints, online hardware sensors are still often insufficient for monitoring complex bioprocesses with regard to their decisive biological key parameters. Soft(ware) sensors can be used to close this gap. To predict the target variables, a combination of mathematical models and existing hardware sensors is applied [
1].
The partial least squares regression (PLSR) method is a popular way to build a data-driven soft sensor model. Using this method, a linear model is calibrated with an additional dimensionality reduction based on the relationships between hardware sensor readings and one or more target variables. This technique has been used to successfully develop data-driven soft sensor models in bioprocesses [
2]. When combined with process knowledge, such as a carbon mass balance as input, these models performed particularly well as hybrid soft sensor models [
3]. Nonetheless, a soft sensor is typically created manually for each bioprocess and is, thus, time-consuming. Typically, the automated application of soft sensor concepts across bioprocesses does not occur. Furthermore, the prediction performance of soft sensors often degrades significantly due to changing raw materials, modified process strategies, and biological variability. These issues are a significant barrier to their use in industry [
4]. In particular, manual recalibration of soft sensors is often not executed due to a lack of qualified personnel. An automatic generalist recalibration approach can provide a solution.
The regular recalibration of the soft sensor model is a common method for adjusting the soft sensor. Previously, this was mostly performed manually from time to time, but in more recent studies, it is now partially automated and also known as just-in-time learning [
5,
6,
7,
8,
9]. Differences in automatic recalibration are primarily due to the type of historical data selection. On the one hand, continuous recalibration of temporally matching sections from chronologically preceding data is possible [
6]. The recalibration can, therefore, be performed within the current process in a moving time window based on chronologically corresponding data sets, and previous recalibrations can be gradually removed from the prediction model by forgetting factors [
8]. On the other hand, a selection of historical data based on similarity criteria can be performed. This approach is not only suitable for slow and constant changes, but also for sudden changes, such as new raw materials. Thereby, historical data are selected for recalibration, in which these changes or similar changes occurred earlier [
10]. Consequently, the selection is based on the similarity of the online process variables between the historical data [
6,
9] and the current process. As long as the correlations between the variables are constant, selecting historical data at the level of entire process data sets leads to better prediction performance than selecting individual reference points [
7]. Multiway principal component analysis (MPCA), a similarity criterion, and clustering can be used to implement the selection. The MPCA technique folds the data pool to form a two-dimensional matrix, succeeded by a principal component analysis to concentrate the data’s information into higher-level variables [
11]. According to the MPCA, data sets that match the current process can be selected from the data pool using a similarity criterion. One method for achieving this selection is to compute the Euclidean distances between historical data sets and the current process, followed by identifying nearest neighbors [
6]. Other methods, such as the Mahalanobis distance, which additionally considers covariances of the process variables, can be used to determine similarity. Saptoro [
10] gave a good overview of these methods.
Bioprocesses are usually multiphase processes, which means that correlations between variables typically change phase by phase throughout the process [
3]. For the selection of historical data, a generalist soft sensor concept must consider these phase changes. This necessitates the inclusion of phase detection in the generalist soft sensor concept. Yao and Gao [
12] gathered various phase detection methods and classified them into two groups. The first is based on expert knowledge, and the second is based solely on data-driven approaches. They depict multivariate rules [
13] and the definition of landmarks in indicator variables [
14,
15] as examples of knowledge-driven methods. Data-driven methods are described, such as the analysis of local correlations [
16] or approaches based on the explained variance of principal components for phase detection [
17]. The use of data-driven phase detection methods in particular promises good transferability to multiple bioprocesses without the need for process knowledge. However, temporally faulty phase detection, also known as burrs, can occur, especially in data-driven approaches. Wang et al. [
18] built a phase detection algorithm on such a data-driven concept and enhanced it with burr compensation.
Soft sensor concepts with automatic recalibration have traditionally been used primarily in the chemical and petroleum processing industries. Their application in biotechnological and pharmaceutical processes is currently limited [
10] due to the more challenging processes involved. This technology, particularly when combined with phase detection algorithms, has only been described in a few publications, for example, to determine the penicillin concentration in a simulated bioprocess [
19]. The broader application of different bioprocesses in a generalist concept in the biotechnology industry, as well as the implementation of robust data-driven phase detection methods that exclude burrs, is still an open issue.
A generalist soft sensor concept for multiphase bioprocesses is presented in this study. This novel concept provides soft sensors that automatically predict assigned target variables in various bioprocesses. MPCA, Euclidean distance, and a k-nearest neighbor algorithm are used to select historical data for automatic soft sensor recalibration. Furthermore, the selected historical data sets are divided using a phase detection algorithm with burrs compensation inspired by Wang et al. [
18]. For the current process phase, a soft sensor model was then calibrated. Therefore, additional input variables were calculated from hardware sensor readings, including the carbon dioxide evolution rate (CER), the oxygen uptake rate (OUR), and the cumulative CER and OUR. Finally, the automatic recalibration of the generalist soft sensor was evaluated using two different bioprocesses: first, the biomass prediction of
Pichia pastoris bioprocesses, and second, the biomass and protein prediction of
Bacillus subtilis processes with variable process characteristics, such as cultivation media.
3. Results and Discussion
The following is the structure of the evaluation of the generalist soft sensor concept. First, the algorithm’s function was validated on the
P. pastoris process. The temporal change in the Euclidean distances of the historical data pool to the current process is shown and discussed, as is the course of the predictions with detected phases of the example process. Following that, the validation of the
B. subtilis process is demonstrated, particularly the differentiation of the different media in the automated selection of data sets. On an example data set, the profile of the predictions and phase detection is also shown. Finally, the relative errors of the various applications of the generalist soft sensor concept are compared.
Figure 3 depicts an overview of the various process characteristics of the bioprocesses.
3.1. Evaluation of the Generalist Soft Sensor on the P. pastoris Process
Figure 4 depicts the selection of historical data sets from the
P. pastoris data pool (19 data sets) for the
P. pastoris example process. Two distinct time points were chosen: 22 h for the start of the growth batch phase and 48 h for the start of the fed-batch phase. The data pool’s spatial distribution shifts over time. Thus, the generalist soft sensor concept selected data sets 4, 6, 7, and 12 at 22 h and data sets 1, 6, 9, and 14 at 48 h, similar to the validation data set. This demonstrates that even within a process, the most similar data sets can change because as the process progresses, more and more information on the current process is available, allowing a more appropriate selection of data sets. First, the current process is compared with historical data sets in a growing time window in the generalist soft sensor concept described in this study (
start process to
current process time). Following that, phases are determined in this growing time window, and a flexible phase-dependent recalibration time window is calculated, in which the data points of the selected historical data sets are used to calibrate the currently valid soft sensor model.
Figure 5 depicts the temporal evolution of biomass concentration with the prediction from the generalist soft sensor concept, as well as reference values. A high prediction performance can be seen. Only in the last process phase does the prediction performance deteriorate. One reason could be that several of the selected calibration data sets terminated early. As a result, fewer calibration data points were available for the soft sensor model in this phase, which could lead to a decrease in the prediction performance. Quality parameters such as RMSEP (2.6 g L
−1) and relative error (4.1%) were within acceptable limits (relative error < 10%).
Six distinct phases were detected through automatic phase detection. The phases can be classified as follows: Phase 1 is the lag phase; Phase 2 is the start of the growth batch phase (pO2 = 100%); Phase 3 is the beginning of the stronger growth batch phase (significant decrease in pO2); Phase 4 is the main batch phase and transition phase (pO2 controlled at 40% until substrate reaches 0 g L−1); Phase 5 is the adaptation to a new substrate and the start of the fed-batch phase; and Phase 6 is the completion of adaptation to the new medium and the second part of the fed-batch phase. The detected phases can, in theory, be justified both technically and biologically. The division of the batch phase into three phases can be attributed primarily to differences in oxygen saturation in the medium, as well as the start of control thereof. However, because the historical data sets had a high sample frequency (2 h), these shorter phases did not pose challenges. If the historical data sets had a lower sampling frequency, the number of data sets to be selected would have to be increased to have enough reference points available for short phases.
3.2. Evaluation of the Generalist Soft Sensor Concept on a B. subtilis Process with Changing Process Characteristics
The generalist soft sensor concept was then put to the test with an example data set from the
B. subtilis process.
Figure 6 depicts the outcomes of the selection of similar historical data sets from the
B. subtilis data pool (
. Because the sampling frequency for
B. subtilis was lower and the data pool was larger than that for
P. pastoris, five similar data sets were always selected instead of four. Even during short phases, there should be enough reference points for calibration. The data pool included both data sets with CLA medium and FB medium. Visually, the separability of the various process characteristics can be confirmed. This implies that the existing online hardware sensors used as input into the generalist soft sensor concept provided enough information about the process to reflect differences in media compositions and their impact on process progress.
The predictions of the
B. subtilis example data set were achieved with relative errors of 20.4% (biomass prediction) and 7.2% (protein prediction) using the generalist soft sensor concept. At the last reference point of the biomass concentration, an untypically high CFU was measured, which indicates an outlier. The relative error of the biomass prediction without this outlier is 13.2%.
Figure 7 shows a visual confirmation of the high prediction performance. The algorithm identified three distinct phases, which are as follows: Phase 1: Batch phase (oxygen saturation drops to 0%); Phase 2: Start of fed-batch phase (oxygen saturation rises again due to substrate limitation); Phase 3: Second part of fed-batch phase (oxygen saturation returns to a stable, high level). Thus, the detected phases can be technically and biologically assigned and are, therefore, valid. The validation example here was only for predictions in CLA medium, but the applications of the generalist soft sensor concept for the
B. subtilis process in FB medium are discussed in the following section. In general, the
B. subtilis process has already confirmed the successful use of the generalist soft sensor concept for the more industrially relevant target variable product concentration.
3.3. Overall Evaluation of the Generalist Soft Sensor Concept
Finally, the generalist soft sensor concept’s prediction performance for all use cases was validated. Five random validation data sets were chosen for each use case, and the average relative error with standard deviation was calculated and summarized in
Table 1. For comparison, a concept with fixed time windows was used as reference. Therefore, the algorithm of the generalist concept was modified with a fixed window size of 20 h (ensuring a sufficient amount of calibration points for all use cases) instead of dynamic phase-dependent windows.
Comparing the predictions of the generalist concept with the reference concept, a significantly better prediction performance can be observed for all use cases. This demonstrates that, especially for multiphase bioprocesses, an automated phase detection and a subsequent dynamic adaptation of the windows to the phases are essential for a suitable prediction performance. Particularly when there are fewer reference points for the target variable in the calibration data, phase-dependent allocation is important, as can be seen when comparing the prediction performance of the P. pastoris and B. subtilis process models with the fixed window concept.
The generalist concept has a similar prediction performance for biomass prediction in the
P. pastoris process and product prediction in the
B. subtilis process. Comparing the biomass prediction results for the
P. pastoris process of the generalist soft sensor with the hybrid soft sensor model of Brunner et al. [
3], a model without similarity analysis and selection of similar data sets, but with knowledge-based phase detection and process knowledge in terms of a carbon balance, an approximately comparable prediction performance could be achieved (
). The prediction performance for biomass concentration of the
B. subtilis process is lower than the predictions of the other target variables. However, the primary reason for the lower prediction performance is not the generalist soft sensor concept itself, but the higher measurement error of the biomass reference measurement during the
B. subtilis process
than the protein reference measurement
and the biomass reference measurement during the
P. pastoris process
. Particularly in the CFU measurement, the necessary high dilutions of the samples led to an absolute error that increases with the level of dilution. Comparing the prediction performance for the same target variable in different media in the
B. subtilis process, similar high relative errors could be observed. Thus, it can be confirmed that the differences in the prediction performance between the different targets can be predominantly attributed to the different measurement errors of the reference measurements.
Consequently, it was possible to demonstrate that the generalist soft sensor concept is suitable for predicting different scenarios, even when process characteristics such as media, strains, and target variables are varied.
4. Conclusions
This study revealed that a generalist soft sensor concept could reliably predict target variables in bioprocesses with varying process characteristics. The biomass prediction for the P. pastoris process and the biomass and product prediction for the B. subtilis process were utilized to evaluate this concept.
Since the generalist concept is real-time capable, it can be used for process monitoring as well as for process control. For process monitoring, the predicted variables are used and expected process corridors can additionally be created for them to be able to directly assess the quality of the process. For process control, the predictions can be implemented directly in a control concept. However, it is recommended to add a smoother phase transition in the generalist concept for this application. As well as biomass and product concentration, additional target variables such as substrate concentration could be predicted with the generalist concept, enabling further control strategies. The major challenge in applying the generalist soft sensor concept is gathering enough process information to digitally map the process, for example using hardware sensors. The concept is designed in such a way that hardware sensors, actuators, and additionally calculated variables other than those used in this study can also be used as the input variables. Additionally, if non-information-bearing input variables are present, the generalist concept will automatically give them very little or no influence on the prediction model. However, if the existing online input variables generally do not contain enough information about the process, reliable predictions cannot be made, even with the generalist concept. Furthermore, as large a data pool as possible should be provided because relevant prediction models can only be trained if current process variations have already been recorded in similar historical data sets. The following topics can be considered for future application and further development of the proposed generalist soft sensor concept. One optimization possibility is automatic data pool maintenance. For example, previous data sets based on online process variables may be similar to the current process but have faulty reference values. This can occur due to incorrect sampling or measurement issues with the samples. To overcome this, an automated concept that removes outliers during data pool preprocessing can be implemented. One implementation approach is to group similar data sets based on their online variables, as presented in this study, but then, the correlations between the reference and online variables of the historical data sets are compared. Individual data sets with significantly differing correlations can be removed. Another optimization possibility is the addition of a synchronization method [
23] to prepare data sets with varying lengths for MSPC-based selection because previous data sets that indicate similar temporal profiles of the online variables to the current process are chosen for automatic recalibration. However, this neglects the fact that data sets may be adequate for recalibration despite their temporal variances.
This concept can be tested on other multiphase bioprocesses in the future to overcome isolated solutions in soft sensor applications and proceed toward soft sensor concepts comprising various bioprocesses.