2. Materials and Methods
Maintaining automotive engines effectively ensures vehicle reliability, safety, and durability. Recent advancements in technology, particularly in Artificial Intelligence sensors and data analytics, have significantly enhanced the methods and systems used for automotive maintenance. This section provides an overview of various aspects of automotive engine maintenance systems, illustrating the progress and challenges in this domain.
AI technologies are increasingly being used to predict engine failures before they occur. Extensive research was conducted in this field. Paper [
2] describes how machine learning algorithms can analyse sensor data to predict engine component failures. Machine learning (ML) can analyse large volumes of data from vehicle sensors. These models learn from historical data to predict future events, such as potential failures. Paper [
3] provides an overview of the application of neural networks to predict the lifespan of engine components based on operational data and past maintenance records. Deep learning, a subset of ML, is also helpful in identifying anomalies in sensor data that may indicate impending failures. In paper [
4], Convolutional Neural Networks (CNNs) were used to detect unusual patterns in engine vibration data, a common precursor to mechanical issues. Reinforcement Learning can be applied to optimise maintenance schedules based on the performance feedback of automotive engines [
5]. Natural language processing (NLP) techniques can analyse maintenance records and service logs [
6].
Integrating advanced sensors into automotive engines provides real-time data crucial for monitoring engine health. The Internet of Things (IoT) refers to a network of interconnected devices that can communicate with each other and exchange data over the Internet. Internet of Things (IoT)-based onboard sensor usage is presented in [
7] for nitrogen oxide emissions estimation. A review is given in paper [
8] to discuss how IoT devices can send real-time diagnostics and performance data to cloud-based systems for analysis, facilitating proactive maintenance strategies.
Predictive maintenance is a significant strategy in automotive applications. The IoT-based maintenance [
9] and lifecycle management of engine components can significantly extend the operational life of automotive engines. Paper [
10] examines the methodologies used to assess the lifecycle of various engine components based on usage patterns and environmental factors. The economic impact of engine maintenance is also essential in the literature, papers [
11,
12] analyse the cost-effectiveness of the automotive industry’s different maintenance strategies, including reactive, preventive, and predictive maintenance.
Emerging technologies such as AI, machine learning, and IoT will redefine the scope and effectiveness of maintenance systems. A future-oriented study [
13] presents the next generation of maintenance systems powered by intelligent algorithms capable of self-diagnosis and automated repair functions. The concept of connected cars is reviewed in paper [
14], presenting the possibilities and capabilities of hardware and software.
2.1. Research Goal
This paper will analyse an Engine Failures Dataset [
15]. A structured approach extracts meaningful relationships and develops an AI-based predictive model.
Figure 1 describes the necessary steps:
Data collection and semantical analysis involve obtaining and exploring the dataset, reviewing documentation, and performing initial data exploration to identify data types, missing values, and basic statistics. Data cleaning and preprocessing include handling missing values through imputation, dealing with outliers by trimming, normalising, or standardising data, if required, and conducting feature engineering to create new variables. Exploratory data analysis uncovers patterns, trends, and relationships within the data using visualisations like histograms, box plots, and scatter plots. Model development involves creating predictive models using suitable machine learning algorithms, selecting promising models based on the data and prediction tasks, and using cross-validation to tune the models and avoid overfitting. The model evaluation assesses the models’ performance using the accuracy, precision, recall, and F1 score for classification problems. Presentation compiles the findings into a comprehensive report or presentation, outlining the methodology, findings, and recommendations.
2.2. Description of the Dataset
This paper investigates the engine time to failure dataset. The type of the engine is not defined. Units of the data columns are not indicated. It describes a study in which 100 different engines are continuously monitored from the start of use until they fail. Each engine is run under certain conditions—possibly varying by factors such as operational intensity, environmental conditions, or maintenance schedules—until it experiences a breakdown. The dataset consists of the following columns:
ID: This column represents the unique identifier for each monitored engine or unit.
TTF: This stands for “time to failure”, which is the primary variable of interest, indicating the remaining operational time before an engine failure occurs.
s12, s14, s17: These columns represent various signals or sensor readings related to engine health. Each signal provides specific data points collected during engine operation.
Table 1 provides a brief overview of the first few rows of the dataset:
TTF shows a strong positive correlation with s12 (0.67) and a robust negative correlation with s17 (–0.61). The third signal, s14, has a moderate negative correlation with TTF (–0.31) and a moderate positive correlation with s17 (0.25). A strong negative correlation (–0.70) exists between s12 and s17.
Table 2 shows the signals’ statistics.
Figure 2 shows the correlation heatmap. The descriptive statistics for each column in the dataset are shown in
Table 2:
2.3. Key Observations on the Data
The following observations can be taken: The dataset contains 20,631 fields for all columns, indicating a complete dataset with no missing values. The mean and median values for signals s12, s14, and s17 are close, which is a symmetric distribution around the centre for these measurements. The standard deviation for each sensor signal (s12, s14, s17) is relatively small, indicating that most values are close to the mean, suggesting consistent sensor readings. The relatively narrow interquartile range for the sensors indicates that most data points lie close to the median, which is a reasonable basis for predictive modelling and reliability analysis.
2.4. Histograms
The histogram of s12 (
Figure 3, Blue) illustrates the distribution of sensor s12 readings, which appear to be normally distributed, with a slight asymmetry. The histogram of s14 (
Figure 3, Red) also demonstrates a normal distribution but a narrower range than s12, indicating less variability in sensor s14 readings. The histogram of s17 (
Figure 3, Green) shows a somewhat normal distribution with a slight left skew, suggesting a concentration of values at the higher end of the range. Finally, the histogram of Time to Failure (
Figure 3, Purple) depicts the distribution of time to failure across the dataset, displaying a right skew, which indicates that many engines have a longer lifespan before failing, with fewer failing much earlier.
2.5. Adaboost Classification Model
In this paper, an adaptive boosting (AdaBoost) algorithm was used. AdaBoost is a learning algorithm designed to improve the performance of weak classifiers by combining them into a robust classifier. It starts by assigning equal weights to all training samples. The algorithm iteratively trains weak learners and adjusts the weights of the training samples based on their classification results. Misclassified samples have their weights increased, while correctly classified samples have decreased. This process ensures that subsequent learners focus more on complex cases. Each weak learner’s contribution to the final model is determined by its error rate, with a lower error rate resulting in a higher weight (alpha). The final model is a weighted sum of the predictions from all weak learners. In the preprocessing step, the dataset was loaded, and a threshold for classification was set using the median of the TTF values. Binary labels were created: samples with TTF values less than the median were labelled as 1 (high risk), and those with values greater than or equal to the median were labelled as 0 (low risk). The features used for classification were all the three available sensor readings (s12, s14, s17). The AdaBoost model was implemented in python using the AdaBoostClassifier class from the scikit-learn library. The AdaBoost model was configured with a DecisionTreeClassifier as the base estimator (max_depth = 3), using 100 estimators (n_estimators = 100), a learning rate of 1.0 (learning_rate = 1.0), and a random state of 42 (random_state = 42) to ensure reproducibility. The experiment was conducted on a personal computer equipped with an Intel i7 processor, with 16 GB of RAM. The data was split into training and testing sets, as 70% was used for training and 30% was used for testing. After training, the model was evaluated on the test set. The predicted classification was carried out, and various performance metrics were calculated, including the confusion matrix, accuracy, precision, recall, and F1 score.
4. Conclusions
One of the main advantages of data-driven approaches, such as the one employed in this study, is that they do not require extensive prior knowledge of automotive engine mechanics. This allows the analysis to be conducted purely based on the data, irrespective of the specific details or understanding of what each data point represents. However, the success of these approaches relies on the reliability and accuracy of the sensor data used for model training and testing.
This paper has presented a theoretical approach. An AdaBoost classification model was applied to predict the time to failure for automotive engines based on sensor readings. The model achieved an accuracy of approximately 80.08%. This is a reasonably high level of performance for the AdaBoost. With a precision of 83.02%, the model is quite reliable. The recall rate of 74.15% indicates that the model can identify the majority of “Low risk” (class 1) instances. While this is a strong performance, it also shows room for improvement in catching more true positives. The F1 score, which balances precision and recall, is 78.33%. This suggests a good overall performance, balancing the identification of true positives and the minimisation of false positives.
It is important to note that this study is hypothetical. To validate the approach as an engineering solution, reliable data—including precise sensor accuracy, appropriate sampling rates, and detailed information on the interaction between the measured dimensions and the engine’s operational conditions—must be used. Without validation against real-world operational conditions and sensor data accuracy, the results remain a theoretical exercise.
The model’s predictions, if validated with reliable data, have the potential to enhance preventive maintenance schedules, reduce unexpected engine failures, and improve operational efficiency. By accurately predicting engine failures, resources can be better allocated to engines at a higher risk of failure, optimising maintenance efforts and costs.