4.2. Feature Engineering
Feature engineering requires domain knowledge and comprises feature construction and feature selection. In feature construction, new variables are built using existing ones for a more accurate physical description of the problem. Specifically, aft and fore draft measurements are combined to calculate the mean draft and trim of the vessel, according to Equations (2) and (3). In addition, the sea current speed is calculated as the difference between speed over ground (SOG) and speed through water (STW), as stated in Equation (4). The relative wind speed and direction measurements are combined to determine the longitudinal and transverse components of the relative wind speed, according to Equations (5) and (6). Similarly, in Equation (7) the relative wave angle is calculated to determine the longitudinal and transverse components of the significant wave height, according to Equations (8) and (9). To account for the strong time dependence of fouling growth, the Days feature was added, which is equal to the number of days passed since the last hull and propeller maintenance event. For each data point recorded on a given date, the Days feature was calculated according to Equation (10). Thus, for each dataset mentioned in
Table 3, the Days feature was created separately, based on the respective starting date.
In feature selection, the most important input variables are determined with the help of appropriate statistical tools, whereas irrelevant or redundant input variables are eliminated to simplify the model and prevent overfitting. To determine the necessary features, two criteria are used: comprehension of the physical problem and correlation analysis. Based on domain knowledge, among all available variables, a few were shortlisted as feature candidates for SHP prediction. Then, correlation analysis was performed on those candidates using the distance correlation method, with the results presented in
Table 4. Distance correlation was presented by [
26] and constitutes a measure of statistical dependence between two random variables that captures both linear and nonlinear relationships.
While traditional correlation measures such as Pearson’s correlation coefficient are effective for capturing linear relationships, they often fail to detect complex dependencies that may exist in the data. Distance correlation overcomes this limitation by considering the distances between observations rather than assuming a specific functional form. It provides a robust measure of dependence that can reveal intricate relationships that may be missed by traditional methods. The concept of distance correlation is based on the notions of distance covariance and distance variance. The distance covariance quantifies the similarity of paired observations in terms of their distances to other observations, while the distance variance measures the dispersion of the distance between paired observations. By normalizing the distance covariance with the square root of the product of the distance variances, the distance correlation is obtained. The resulting measure ranges between 0 and 1, with 0 indicating independence and 1 indicating perfect dependence. This makes the distance correlation a valuable tool for feature selection, as it captures a wide range of relationships and provides a comprehensive understanding of the data. Specifically, the distance covariance between two random variables
X and
Y is defined as the square root of their distance variance, as follows:
The distance variance between
X and
Y measures the dispersion of the distances between paired observations. It is calculated as follows:
where
and
represent the Euclidean distances between observations
and
, and between
and
, respectively, and n represents the sample size. The distance correlation between
X and
Y is obtained by normalizing the distance covariance
with the square root of the product of the distance variances
and
of
and
, respectively, as follows:
The correlation analysis confirmed what was expected: RPM and STW were the candidate features with the highest correlation values with the SHP variable, which highlighted their importance as predictors. However, the RPM variable was not included as a feature because it would limit the applicability of the model as an emulator, even though it would increase its predictive accuracy. It is worth mentioning that the correlation coefficient values of the longitudinal components of the relative wind speed and the significant wave height were higher than the respective values of both the transverse components and the original variables. In addition, it is worth noting that although the correlation coefficient of the rate of turn (RoT) variable had a small value, it was useful to include a variable related to maneuvering. On the other hand, current speed, resulting in the difference between SOG and STW, did not directly impact the hydrodynamics and ship speed-power performance. However, according to [
27], the wave–current interaction can affect the wave height, as the wave field significantly changes as it interacts with following and opposing currents. Therefore, it might be assumed that the current speed can indirectly affect the ship’s resistance and propulsion power requirements. On the other hand, the correlation analysis revealed a small but non-negligible relationship between current speed and SHP, with a correlation coefficient similar in magnitude to other environmental factors like wave height and wind speed. This suggests that current, as a proxy for wave-current interaction, may provide useful information to the model in predicting the ship’s performance.
Therefore, feature selection was based on a combination of domain knowledge and correlation analysis using the distance correlation method. Eventually, two sets of features were selected, one including the Days feature and the other not including it, as listed in
Table 5. The models derived by each feature set will be used in different ways, as will be described next, to monitor the hull and propeller condition.
4.3. Data Preparation
Before feeding the data to an ML algorithm, suitable preparation is required to ensure their quality. In the present study, data preparation was implemented in three steps:
These processes are described hereafter.
Data smoothing aims to remove signal noise while preserving the main trend. To do this, first, a simple moving average (SMA) filter is applied to each signal (features and output variables). The signal is denoted as X(i), where i represents the data sample at a certain time. The equation for calculating the SMA at time i, denoted as SMA(i), with a window size of k can be expressed as follows:
Then, each SMA-smoothed signal is decimated by a factor of k, i.e., only one sample out of k samples is retained, where k is equal to the window size used for the moving average. In
Figure 7, window sizes of 5, 10, and 15 for two variables are compared to show the effect of k. The selected sizes are commonly used in the literature; for example, [
11,
12,
17,
24] chose a 5 min window, while [
28,
29] chose 10 and 15 min windows, respectively. Clearly, the wider the time window, the more intense the effect of the smoothing. In the current study, we visually compare the effect of these window sizes by inspecting the time series of the original signals and the smoothed signals. Based on our observations, a window size of 5 min effectively reduces the sharpest spikes while maintaining sufficient resolution for accurate model training. Since it is the smallest window of the three, it preserves higher data resolution, providing more data points for the machine learning models.
Voyage extraction: An assessment of the vessel’s hydrodynamic performance makes sense only in sailing conditions. Thus, it is important to identify in the data records the time intervals during which the ship was traveling. With voyage extraction, the identification of time intervals in the dataset corresponding to actual ship voyages, therefore excluding port calls, is achieved. Thus, a list of the start and end date/time of voyages during the reporting period is derived from noon reports. Subsequently, a dataset containing only voyages is extracted.
Threshold value filtering involves the rejection of data points based on threshold values. In such a way, only the values inside the operational range of choice are kept, ensuring that the data analysis is focused on specific conditions related to open-sea operation. Therefore, certain criteria are subsequently applied to the dataset, as presented by the lower and upper limits of
Table 6. For the definition of these values, the histogram of each variable is examined to identify the tails of the distributions, as these represent unrepresentative conditions occurring over short periods ([
30]). Therefore, the chosen thresholds ensure that only data points within the operational range of interest are retained.
These data preparation steps caused a progressive reduction of the number of data points with respect to the raw data, as shown in
Figure 8, where the number of points after each step is indicated, showing that the greatest reduction occurred during the smoothing process. This reduction was a side effect deemed necessary to eliminate spikes and reduce the noise present in the raw data.
4.4. ML Model Calculations
The main idea for the development of Model-1 was to employ it as a reference performance baseline, representing the vessel in clean condition, to evaluate conditions with fouling, as presented by Datasets B and D2. Model-1, based on the feature set without the Days feature, was trained on data corresponding to a few months after hull cleaning and propeller polishing (Dataset D1) so that it emulated the behavior of the vessel when it is clean. A basic assumption in this approach was the unchanged fouling status of the vessel during this period. Another issue to consider was the relatively short duration of Dataset D1 used for model training. This limited training period may not have captured the operational conditions encountered by the vessel during Datasets B and D2.
Figure 9 and
Figure 10 present the relative frequency distributions of STW and mean draft, two key operation parameters for the problem studied, while
Table 7 shows the basic statistics for STW. The STW values did not differ significantly among the examined datasets, even though Dataset B presented higher STW values. However, the under-representation of Dataset D1 in mean drafts in the range of 8 m and 11 m may have led to possible inaccurate extrapolated predictions for these drafts. Therefore, a second approach was also examined, as discussed next.
It was also noticed that Dataset C was another candidate dataset for reference performance. However, as shown in
Table 3, idle time, referred as time in port, comprised 42.8% of Dataset C, in contrast to 25.6% in the case of Dataset D1. Due to the much greater percentage of idle time in Dataset C, more significant fouling accumulation was expected during Dataset C compared to Dataset D1, as idle time facilitates biofouling growth (see for example [
9]). Furthermore, Dataset C contained one less ballast loading case compared to Dataset D1, as shown in
Figure 5. Hence, Dataset C was not considered as a reference performance baseline for the vessel in clean condition.
The second approach used Model-2, which was employed to monitor the advance of biofouling in the period corresponding to Dataset D spanning over a year after maintenance. The distinctive feature of Model-2 was the Days feature, which enabled the emulation of scenarios referring to different points in time. To achieve this, synthetic datasets were created by assigning specific values to the model features. Details of these models are given in
Table 8, while
Figure 11 illustrates the two approaches for the assessment of hull and propeller fouling using Model-1 and Model-2, respectively.
The procedure for the development of both models is as follows. First, the dataset is randomly shuffled and split into a training set and a test set using a ratio of 80% to 20%, respectively. Then, various ML algorithms, suitable for regression tasks, are evaluated on the training set. The ML algorithms considered are decision trees ([
31]) and ensemble methods based on decision trees, i.e., random forest ([
32]), extremely randomized trees (extra trees; [
33]), and gradient tree boosting ([
34]). This decision was motivated by the superior performance exhibited by tree-based algorithms compared to other algorithms, such as artificial neural networks (ANN), as demonstrated in relevant studies ([
22,
35]).
In summary, decision trees can capture complex interactions and patterns, thus they can handle a wide range of regression problems, including those with nonlinear or non-monotonic relationships. Ensemble methods combine multiple decision trees, each one trained on different subsets of the data, providing multiple predictions. The combination of the information coming from these multiple predictions improves the overall accuracy and allows a generalization of the model. Decision trees are prone to overfitting, which occurs when the model becomes too specific to the training data and performs poorly on unseen data, i.e., instances that were not part of the training dataset. Ensemble methods alleviate this issue by reducing overfitting through techniques like random subspace sampling (random forests) or gradient-based optimization (gradient boosting). These methods introduce randomness and regularization, leading to better generalization and improved performance on unseen data.
Moreover, the averaging employed in ensemble methods helps to mitigate the influence of individual noisy predictions and leads to more robust overall predictions. In ensemble methods, multiple models are combined, and their predictions are averaged to produce a final prediction. By doing so, the impact of individual predictions that may contain noise or errors is reduced, resulting in more reliable and stable predictions across the ensemble. More details are presented in
Appendix A.
Given that the most suitable algorithms for any given problem cannot be known a priori, a comparative analysis is required. In addition, the performance of each algorithm heavily depends on the configuration of the hyperparameters. To determine a set of well-performing hyperparameters, a search across the hyperparameter space is conducted. An exhaustive exploration of the search space is impossible, so only a few selected combinations are tested, either with grid search or a randomized search. Grid searches evaluate all of the possible combinations of the specified hyperparameter values, whereas randomized searches select a random value for each hyperparameter in each iteration, so that a set number of random combinations are evaluated. For a more extensive exploration of the search space, the randomized search is performed over a wide range of hyperparameter values, coupled with 5-fold cross-validation, to account for the intrinsic randomness in training. Randomized searches allow for broader exploration of the hyperparameter space while maintaining computational efficiency.
Additionally, randomized searches are coupled with 5-fold cross-validation to account for the intrinsic randomness in training and to help mitigate the risk of overfitting. The use of 5-fold cross-validation ensures that the model’s performance is evaluated on multiple subsets of the data, providing a more reliable estimate of its generalization capacity.
The hyperparameter optimization was conducted separately for the two models. The best configuration for Model-2 is presented in
Table 9, as it plays a more critical role in this study.
After hyperparameter tuning, a comparative evaluation of the four ML algorithms was performed using 10-fold cross-validation on the test set, which enabled a robust assessment of model performance. The goal was to compare the RMSE values across the 10 folds for each algorithm. The respective mean cross-validated RMSE values are reported in
Table 10. To determine whether the observed differences in performance were statistically significant, we applied the Friedman test ([
36]), followed by the post-hoc pairwise Wilcoxon signed-rank test ([
37]).
The Friedman test indicated a statistically significant difference between the RMSE values of the four algorithms for both models:
Model 1:
Model 2:
Since the Friedman test showed significant differences, the post-hoc pairwise Wilcoxon signed-rank test was conducted to assess the differences between individual algorithms. In Model-1, pairwise comparisons showed that decision tree was significantly different from random forest, extra trees, and gradient boosting (p = 0.002 for all comparisons). Extra trees also significantly outperformed random forest (p = 0.004) and gradient tree boosting (p = 0.002), while random forest and gradient tree boosting were marginally different (p = 0.049).
The results suggested that, in both models, the extra trees algorithm significantly outperformed the other methods, while decision tree consistently exhibited worse performance compared to all ensemble methods. The pairwise comparisons also revealed that random forest and gradient boosting performed similarly, with only marginal differences between them in certain cases. The statistical analysis reinforced the conclusion that extra trees was the most suitable algorithm for this regression task, as it consistently showed the lowest RMSE and significantly better performance across the folds compared to the other methods. Hence, the extra trees algorithm was selected.
The model development process was completed by evaluating the selected, top-performing, ML algorithm (extra trees ensemble method) on the test set, with the objective of assessing the model’s generalization capacity. The test set provided a reliable final measure of how well the model performed when applied to unseen data, because it had not been involved in either training or validation. Thus, error metrics suitable for regression tasks were used, and the results are reported in
Table 11. The model’s performance was evaluated using error metrics on the test set, which consisted of data absent from the training and validation process. Let
be the value predicted by the model for the i-th sample,
be the respective actual value, and
be the corresponding error. Thus, the following error metrics are defined, which are suitable for regression analysis ([
17]):
where
and
where ε is an arbitrarily small positive number to avoid zeroing the denominator.
Moreover, in
Figure 12, the SHP values predicted by the model in the test set are plotted against the actual SHP values, and the points are clustered around the 45-degree line. Similarly, in
Figure 13, the residual SHP values (actual minus predicted) in the test set are plotted against the predicted SHP values, and the points are clustered around the horizontal line through zero. A comparison of the results of Model-1 and Model-2 in both the test set and cross-validation revealed slightly better performance of the model with the Days feature, which was due to the additional information provided by the temporal feature.