The final two stages of the CRISP-DM methodology are presented in this section. First, the models built in the previous phase are assessed. Then, a set of recommendations for preventative strategies based on predictive modelling results are discussed.
4.1. Evaluation Results
Within the CRISP-DM methodology, the evaluation phase holds significant importance as it allows for a comprehensive assessment of the quality and effectiveness of the developed machine learning models. As already mentioned, to address the challenge of predicting pallet collapse, this study employed Decision Tree and Random Forest due to their proven effectiveness in handling classification tasks, particularly in scenarios with class imbalances. Decision Tree was chosen for its simplicity and interpretability, making it easier to understand the decision-making process. Random Forest, an ensemble method, was selected for its ability to reduce overfitting and improve predictive accuracy by averaging the results of multiple trees.
During this phase, the primary objective was to assess the models’ generalization capabilities, specifically their accuracy in predicting unseen data. To achieve this, 20% of the dataset was reserved during the initial split as a dedicated test set. This reserved set was used to evaluate the models’ performance on data that was not seen during training. By employing this separate test dataset, the evaluation phase ensured an unbiased assessment of each model’s predictive performance and its ability to generalize beyond the training data.
The evaluation of the developed models, which addressed a binary classification problem, involved constructing a confusion matrix for each model. The confusion matrix provides a detailed view of a model’s accuracy, allowing for the calculation of various performance metrics. Given the objective of identifying pallets at risk of collapsing during transportation, minimizing false negatives is critical. At the same time, it is important to avoid excessive false positives to prevent unnecessary alerts. The
-
was chosen as the primary metric for optimization during the hyperparameter tuning process. The
-
metric, which represents the harmonic mean of
and
, offers a balanced evaluation of the model’s performance:
where
and
are calculated as follows:
and
stands for true positives,
for true negatives,
for false positives, and
for false negatives.
To facilitate a comparison among the developed models,
Table 11 was built, presenting a comprehensive overview of each model’s performance. This table allows for an analysis of each model’s strengths and weaknesses based on the
,
, and
-
values.
To aid in interpreting
Table 11, it is important to note the naming convention used: for example, “DecisionTrees7525” refers to the Decision Tree model with a class imbalance, where 75% indicates the prevalence of the majority class (“No”) and 25% represents the minority class (“Yes”). This convention helps identify each model’s specific configuration and clarifies the class distribution used during training.
The confusion matrix for the model with the best performance, RandomForests6040, based on the
-
results, is presented in
Table 12. This matrix shows the results for the 1206 examples in the test set, facilitating the calculation of the metrics previously discussed.
Table 13 and
Table 14 present the optimal values of the hyperparameters that resulted in the maximum
-
values during the tuning process for the Decision Tree and Random Forest models, respectively. Hyperparameter tuning involved an exhaustive grid search, where key parameters such as
,
,
,
, and
were adjusted to optimize the
-
. This tuning process ensured that the models were fine-tuned to strike the right balance between sensitivity (
) and precision (
).
The analysis of the results clearly shows that the Random Forest model outperformed the Decision Tree model based on the
-
metric, which was the primary metric of interest. The Random Forest model proved to be more effective at capturing the complexities of the classification problem, resulting in more accurate predictions than the Decision Tree model. This is in line with previous studies, such as Panchapakesan et al. [
17], that demonstrate the superiority of Random Forest over other models such as the Decision Tree and Neural Network algorithms.
Several steps were taken to refine the dataset, including normalization of continuous variables, encoding of categorical variables, and handling missing data. These preprocessing steps significantly improved the models’ performance, particularly the Random Forest model, which is sensitive to feature scaling and encoding. Also, the superior performance of the Random Forest model can be attributed to its ensemble nature, which aggregates the predictions of multiple decision trees to reduce variance and improve generalization. Unlike the single Decision Tree, the Random Forest model was able to capture complex interactions between features, leading to higher accuracy and a better balance between and .
Identifying the most informative features, which are critical for solving a classification problem, is essential in addressing the underlying challenge. In this context, the concept of Impurity Decrease was used to evaluate each feature’s contribution to reducing data impurity during the decision-making splits in the Decision Tree model. This approach was applied within the RandomForests6040 model, which demonstrated superior performance.
The relative importance of features was assessed by calculating the average Impurity Decrease across all 103 trees in the model. The analysis identified “Destination_Country” as the most influential variable in the prediction task, followed closely by the “Container_Height”, “Container_Diameter”, and “Temperature_Delivery_Date” features. These findings are visually represented in
Figure 5. The high importance of features like “Destination_Country” and “Container_Height” can be linked to the inherent logistics and handling challenges in different regions and the physical dimensions of the cargo, respectively. This insight is critical for logistics companies to understand where to focus their efforts in securing cargo during transit.
Comparing the results of a study with the ones of the existing literature is crucial in research, as it helps validate and contextualize the findings within a broader knowledge framework. Evaluating the alignment or divergence of a new study’s results with the ones of prior studies allows for an assessment of their consistency, reliability, and generalizability.
The findings of this study align with those of Panchapakesan et al. [
17], as both identify the Random Forest model as the most effective approach for addressing the problem under study. However, a discrepancy arises in identifying the most influential feature for predicting complaints about shipping containers. While Panchapakesan et al. [
17] identified the duration of container storage as the most informative variable, the current study found that the corresponding variable, “Warehousing_Time”, did not show a statistically significant correlation with the target variable, “Complaint (Y/N)”. This divergence in findings may be attributed to differences in the contexts of the two studies and the varying influence of business type, which could lead to different outcomes.
The research by Wu et al. [
15] offered valuable insights into predicting cargo loss severity during transportation. Their study highlighted Transit Types, Product Categories, and Shipping Destinations as key features. While the dataset used in the current study lacked sufficient information to directly verify the findings related to Transit Types and Product Categories, the identification of Shipping Destinations as a significant feature aligns with their results. Consistent with Wu et al. [
15], the current study found “Destination_Country” to be the most critical factor in predicting pallet collapses during transport. This agreement on the importance of Shipping Destinations strengthens the validity of the current study’s conclusions.
This study makes a novel contribution to the existing literature by identifying new variables related to the geometry of the shipped product, specifically “Container_Height” and “Container_Diameter”, as well as the average temperature recorded at the delivery location on the delivery date, denoted as “Temperature_Delivery_Date”. These variables are found to be highly informative for predicting pallet collapse during transportation. This is the first study to recognize the significance of these variables in addressing the issue. Moreover, this research breaks new ground as the first to specifically focus on predicting pallet collapse during transportation. It also addresses the unique product-related challenges encountered in the glass industry, adding to its innovative approach.
4.2. Preventive and Mitigation Strategies
The predictive analysis conducted in this study reveals key insights into the factors contributing to pallet collapses during transportation. Therefore, based on the identified influential features, several preventative strategies can be implemented to reduce cargo loss.
Firstly, optimizing packaging specifications is crucial. The study identified both “Container_Height” and “Container_Diameter” as significant predictors of pallet collapse. To address this, companies should standardize packaging dimensions to ensure stability. By adhering to optimal container sizes and avoiding excessive height or diameter, the likelihood of imbalances that could lead to collapses is reduced. Additionally, reinforcing packaging through the use of double-walled or specially designed containers for high-risk shipments can further enhance stability. Testing packaging designs for their ability to withstand various transport conditions is also recommended.
Temperature control is another critical factor. The “Temperature_Delivery_Date” feature was found to impact the risk of pallet collapse significantly. To mitigate this risk, it is essential to implement measures for maintaining optimal temperature conditions during transport. This could involve utilizing temperature-controlled trucks or containers and ensuring regular monitoring to adhere to specified temperature ranges. Integrating temperature logging systems into the shipping process can provide real-time data, allowing for prompt corrective actions if temperature deviations occur.
Reviewing and adjusting shipping routes based on the analysis of the “Destination_Country” feature can also help reduce risks. It is important to analyse shipping routes and destination-specific factors that may contribute to increased risk, such as exposure to extreme weather conditions or rough terrains. Developing a risk assessment framework based on historical data for different destination countries can guide adjustments in shipping strategies and packaging solutions according to the risk profile of each destination.
Improving handling procedures is another vital strategy. Enhancing training programs for handling and loading procedures will ensure that personnel are well informed about best practices for securing and managing loads. This can significantly reduce the chances of pallet collapses. Additionally, investing in high-quality handling equipment and performing regular maintenance will prevent equipment failures during loading and unloading.
Leveraging predictive analytics for proactive measures is also recommended. Developing early warning systems that utilize predictive models to flag high-risk shipments based on identified features can trigger preventive actions, such as additional packaging or routing adjustments, before shipment dispatch. Continuous monitoring of predictive models as new data become available allows for real-time adjustments to strategies, improving overall risk management.
Finally, collaborating with industry partners can enhance these efforts. Sharing best practices and learning from others’ experiences can lead to improved standards and innovative solutions. Engaging with stakeholders and partners can foster collaborative efforts to address common challenges in transportation and logistics.
By implementing the aforementioned strategies, businesses can proactively address the factors contributing to pallet collapses and effectively reduce cargo loss during transportation. The insights derived from predictive modelling will support informed decision making and enhance the efficiency and safety of the supply chain.