1. Introduction
Modern vehicles include electronic modules to control the vehicle’s subsystems. These modules are called Electronics Control Units (ECU). The number of ECUs in some vehicles can reach up to 70 [
1]. Many vehicle networks were developed to allow vehicle ECUs to communicate with each others. Controller Area Network (CAN) was introduced as an automotive communication network protocol by Robert Bosch LLC in 1994 [
2]. Flexray is another vehicle network protocol that provides more bandwidth than CAN [
3]. Ethernet is also used as a network protocol in the automotive industry [
4]. The transferred data within the vehicle network are called in-vehicle data.
In the recent years, new concepts of vehicle communications were introduced, such as Vehicle to Vehicle (V2V) and Vehicle to Infrastructure (V2I). In these approaches, a vehicle can transfer data with other road elements, such as other vehicles, pedestrians and infrastructure cloud systems. The transferred data between the vehicle and other road elements is called connected vehicles data. More information about V2V and V2I communication can be found in [
5,
6].
Many new features and applications were introduced to utilize vehicles data (Both in-vehicle and connected vehicles data) to improve vehicle and road safety. Ziebinski et al. [
7] provided a review for the latest Advanced Driver Assistance System (ADAS) that uses in-vehicle data to introduce safety features such as lane detection, road object detection and traffic sign recognition. These systems require dedicated sensors such as cameras, radars and ultrasonic sensors to collect road information. Park et al. [
8] proposed forward collision warning system using mono camera. A frontal object detection system based on sensor fusion of radar and mono vision camera was proposed by Hsu et al. [
9]. A literature review for connected vehicles data and Internet of Things (IOT) to implement the smart cities approach can be found in [
10].
Connected vehicles data systems require wireless devices to transfer the data. Moreover, the size of the transferred data is large and the data require advanced data storage and data processing systems. The ML approaches required to deal with the connected vehicles data are more complex than in-vehicle data. Therefore, in-vehicle data systems are usually less expensive and more readily available than connected vehicles data-based systems.
The main goal of this research is to enhance vehicle and road safety using a low-cost ML system that uses readily available in-vehicle data. Two design considerations were taken into account to reduce the ML system cost. The first consideration is that the ML system requires only basic in-vehicle CAN data. No special sensors, such as cameras and radars, are required by the ML system. Engine rpm, engine coolant temperature, manifold pressure, vehicle acceleration and fuel consumption are examples of the used data by the ML system. These data are available in the CAN for the main vehicle functionalities, and the proposed ML system uses this existing for predicting road conditions. This will significantly lower the cost of the data required by the ML system.
The second consideration is to use traditional ML algorithms, such as decision trees, random forests and SVM. These algorithms can achieve acceptable accuracy scores and allow real-time implementation with low cost. Deep learning algorithms may provide more accurate predictions than the traditional ML algorithms, but also require very expensive systems for real time implementation.
The proposed ML system handles three categorization problems; road surface conditions, road traffic conditions and the driving style. Road surface is characterized by three classes; full of holes, smooth or even. Road traffic is characterized as high, normal or low and the driving style is characterized as aggressive or normal.
In this paper,
Section 2 explores some related work to our research.
Section 3 provides an overview of the proposed system architecture.
Section 4 explains the dataset we used for algorithms training and testing.
Section 5 briefly explains the ML algorithms implementation.
Section 6 defines the evaluation metrics.
Section 7 presents the detection results.
Section 8 provides a discussion about the system results, system limitation and future enhancement. Finally, conclusions are provided in
Section 9.
2. Related Work
Since our proposed system uses in-vehicle data, this section explores more related work about in-vehicle data and ML applications. Lattanzi et al. [
11] used two ML approaches and in-vehicle sensor data to identify unsafe driving behavior by the driver. They used SVM and neural network algorithms for classification. The input features to the ML system were the vehicle speed, engine speed, engine load, throttle position, steering wheel angle and Brake pedal pressure. Classification results of this study showed an average accuracy above 90% for both classifiers.
Alvarez-Coello et al. [
12]. proposed a model for dangerous driving events using in-vehicle data. Random forests and Recurrent Neural Network were used to classify the data. The authors used features such as acceleration, brake pedal position, acceleration pedal position, engine RPM and torque. The danger level classified as normal, moderate and aggressive. Wang et al. [
13] proposed k-means clustering-based support vector machine (kMC-SVM) method to classify drivers into two types: aggressive and moderate. Vehicle speed and throttle opening were treated as the feature parameters to reflect the driving styles.
Osman et al. [
14] introduced a machine learning model for near-crash prediction from observed vehicle kinematics data. Vehicle kinematics data, such as speed, longitudinal acceleration, lateral acceleration, yaw rate and pedal position, were used as input features for multiple ML systems. The authors utilized several machine learning algorithms, such as K nearest neighbor (KNN), random forests, support vector machine (SVM) and adaptive boost (AdaBoost), to predict near-crash situations. The AdaBoost algorithm showed a better recall and F-score than other algorithms. A system which can identify the driver trip using historical trip-based data collected from in-vehicle data was proposed by Moreira-Matias et al. [
15]. Decision trees obtained an accuracy between 75% and 100%.
Ghadge et al. [
16] proposed a model to detect road potholes using vehicle accelerator information and GPS data. The k-means clustering algorithm was applied on the training data to build the model. Random forests classifier was used to evaluate this model on the test data for better prediction. Dhiman et al. [
17] proposed a computer vision approach to detect potholes using stereo vision camera and deep learning algorithm. Kim et al. [
18] provided a review for potholes techniques using machine learning. The paper summarised the different approaches for potholes detection using vibration sensors, accelerometer, 3D construction and 2D images.
Bernas et al. [
19] provided a survey for low-cost techniques to detect road traffic using in-vehicle sensors. The techniques include applications of infrared and visible light sensors, wireless transmission, accelerometers, magnetometers, ultrasonic and microwave radars as well as acoustic sensing.
There are many other applications that uses in-vehicle data with ML. For example, a vehicle theft prevention and driver identification system was proposed by Martinelli et al. [
20,
21]. A system to predict the driver’s drowsiness based on the air quality presented in the cabin car was proposed by Goh et al. [
22]. Bai et al. [
23] proposed a system to address the problem of detecting traffic signals from a set of vehicle speed profiles.
The significance of our research is in proposing a low-cost prediction system for road conditions and driving style. In order to reduce the system cost, general CAN were was used as input features to the ML system. No additional cost for special sensors are required by the system. Furthermore, the chosen ML algorithms are inexpensive to implement and they do not require a complex computing system.
3. System Overview
This section describes how ML and vehicle network data can be used together to implement a full prediction system. It also explains how the predictions can be used in safety applications. The safety application can be implemented in the vehicle and in the infrastructure system by transferring the prediction results to the infrastructure. The proposed system block diagram is summarized in
Figure 1. The proposed system includes the following components:
Vehicle network: The in-vehicle data to be fed to the ML system are collected through a vehicle network.
Data logging system: The data logging system collects data from the vehicle network.
Machine learning system: The machine learning system receives the data from the logging system and then classifies them. A training dataset is required to train the ML algorithms. The training dataset should be labeled correctly to the required classes.
Vehicle to Infrastructure communication (V2I): This network is used to transfer the result of the ML predictions to the infrastructure system.
Vehicle application system: This system uses the ML results to provide in-vehicle safety functions for the driver. For example, if the road traffic is classified as high, then a warning is issued to drive carefully.
Infrastructure application system: This system uses the ML classification result to provide functions in the infrastructure level. for example, a road maintenance request is issued if the road surface is detected as being full of holes.
Three algorithms were implemented for in-vehicle data classifications; decision trees, random forest and Support Vector Machine (SVM). A labeled dataset collected from the CAN network was used to train and test these algorithms. Results of the classification were analyzed with respect to algorithm accuracy, precision, recall and F-score.
4. The Dataset
The dataset used in this work was obtained from the Kaggle website under the title of Traffic, Driving Style and Road Surface Condition [
24]. Two cars were used to collect the dataset, a Peugeot 207 1.4 HDi and an Opel Corsa 1.3 HDi. The dataset was collected from the vehicles On Board Diagnostics port (OBD) by using an OBD device that can be paired with a smartphone. Ruta et al. [
25] used this dataset to propose machine learning models in Internet of Things (IOT).
The dataset includes 14 input features. They are summarized as follows:
Altitude change, calculated over 10 s.
Current speed value, which is the average speed in the last 60 s.
Speed variance in the last 60 s.
Speed variation for every second of detection.
Longitudinal acceleration.
Engine load, expressed as a percentage.
Engine coolant temperatures in degree celsius.
Manifold Air Pressure (MAP), a parameter used by the internal combustion engine used to compute the optimal air/fuel ratio.
Revolutions Per Minute (RPM) of the engine.
Mass Air Flow (MAF) Rate measured in g/s. This reading is used by the engine to set fuel delivery and spark timing.
Intake Air Temperature (IAT) at the engine entrance.
Vertical acceleration.
Average fuel consumption, calculated as liters per 100 km.
The dataset was labeled to three sub-problem categories, i.e., road surface conditions, road traffic conditions and driving style. The road surface condition was labeled as smooth, even or full of holes. The road traffic condition was labeled as low, normal or high, and the driving style was labeled as normal or aggressive style. The dataset includes 24,957 data points.
Table 1 summarizes the input features, the categories and the labels for each category of the dataset.
The number of labels for each category are different. Smooth roads, normal traffic conditions and normal driving style represent the majority of the labels of each category. This is due to the nature of the roads used for data collection.
Figure 2 shows the distribution of the labels for the classification categories we considered in this study.
5. ML Algorithms Implementation
The used ML algorithms in this work are common and widely used in classification problems. Decision trees is a flowchart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test and each leaf node (terminal node) holds a class label. A decision tree typically starts with a single node, which branches into possible outcomes. Each of those outcomes leads to additional nodes, which branch off into other possibilities. This gives it a treelike shape. More information about decision trees can be found in [
26].
Random forests is an ML algorithm constructed from many decision trees. The random forests algorithm establishes the outcome based on the predictions of the decision trees. It predicts by taking the average or mean of the output from various trees. More information about random forests can be found in [
27].
The support vector machine is an algorithm that tries to find a hyperplane to separate the data based on the classes. SVM finds boundaries that maximize the distance between the support vector data of each class. More information about SVM can be found in [
28].
As mentioned in the previous sections, three ML algorithms were implemented to classify the CAN data. The implementation was done in Python using Sklearn, Pandas, Scipy, Numpy and Matplot packages [
29,
30,
31,
32]. The packages were used for ML implementation, results analysis and visualization. The dataset was divided to 80% for training and 20% for testing. This yields 19,965 samples for training and 4992 for testing.
In the decision trees implementation, the minimum number of samples to split to an internal node is set to 2, while the minimum number of samples per leaf is set to 1. In total, 200 trees were used in the random forests implementation.
SVM was implemented with radial basis function (RBF) kernel as the dataset is highly non linear, and the Kernel is needed to create accurate boundary conditions. The scaling parameter () and the cost parameter were adjusted to achieve the best classification accuracy.
6. Evaluation Metrics
The results of the detection are analyzed by showing the classification confusion matrix for each algorithm. The confusion matrix shows the true positives, false positives, true negatives and false negatives for each of the classification problems in this study. From the confusion matrix, accuracy, recall, precision and F-score are calculated.
Accuracy represents the number of the correct prediction as a ratio to the number of total predictions. Precision shows how many are predicted correctly from all the classes predicted as positive. Recall shows how many are predicted correctly from all the positive classes. F-score is the harmonic mean of precision and recall.
The above measures are given as follows:
The permutation feature importance approach is implemented to show the importance of the used features to the accuracy of detection. This algorithm works by shuffling the data of a single feature at time to destroy its quality while maintaining the rest of the features. If the quality of prediction is highly impacted, it means the feature is very important for the predictor. Feature ranking helps in understanding how the ML algorithms work and what data are more important to them. André et al. [
33] showed more information about the permutation importance and implementation.
7. Results
7.1. Road Surface Conditions Classification Results
Table 2 shows the accuracy, the precision, the recall and the F-score of the road surface conditions classification and
Table 3 shows the confusion matrix of the predictions.
Table 4 shows the top seven important features for road surface detection for the three algorithms. As shown in the results, engine coolant temperature is the most important feature for the three algorithms (decision trees, random forests and SVM). SVM was the only approach to have the longitudinal acceleration as one of the top seven important features for classification.
7.2. Road Traffic Conditions Classification Results
Table 5 shows the accuracy, the precision, the recall and the F-score of road traffic conditions classification, while
Table 6 shows the confusion matrix of the predictions.
Table 7 shows the feature importance for road traffic classification using the permutation feature importance. SVM relied on vehicle instant speed for classification, while decision trees and random forests relied more on the average speed. Fuel consumption ranked as the third important feature for decision trees and random forests, while it was not in the top seven important features for SVM. Manifold absolute pressure was more important to SVM than the other two algorithms.
7.3. Driving Style Classification Results
Table 8 shows the accuracy, the precision, the recall and the F-score of driving style classification and
Table 9 shows the confusion matrix of the predictions.
Table 10 shows the feature importance for driving style classification. Fuel consumption was more important for decision trees and random forests, while manifold air pressure was more important for SVM.
8. Discussion
In this work, in-vehicle data were used to make predictions for road conditions and driving style using supervised machine learning algorithms. The detest was collected from vehicle CAN network. It includes 14 features, such as vehicle speed, longitudinal acceleration, fuel consumption and engine rpm, as shown in
Table 1. The data were labeled to three categories. The first one is road surface conditions, which classify the road as full of holes, smooth or even. The second category is road traffic conditions, which classifies the traffic as low, normal or high. Finally, The driving style which classifies the driving style into aggressive or normal.
A detailed overview for the system architecture is shown in
Figure 1. The model includes a data logging system for in-vehicle data. A machine learning algorithm system was used for classification and prediction.
Three ML algorithms were implemented, i.e., decision trees, random forests and SVM. The detection results showed that random forests provided the best performance among the three algorithms. Decision trees came second, while the lowest performance algorithm was SVM.
Figure 3,
Figure 4 and
Figure 5 show the detection results represented in charts for the three classifications topics we covered in this work.
Due to the nature of road conditions where the dataset was gathered, it was noticed that 61% of road surface is smooth and only 13% is full of holes. Moreover, 75% of the traffic is normal and only 12% is high traffic. Normal driving style is 89% of the data and the rest are aggressive. This imbalanced data distribution can impact the ML models and make them biased toward one class more than others. Although the recall results in this study were good, which means algorithms detection was accurate for the positive classes (low-sample data), it is always better to have balanced data. Future work may focus on solving this issue by increasing the amount of the training data to have more balanced data. Another solution is to use oversampling techniques to increase the positive classes’ samples.
The permutation feature importance technique was used to rank the input features based on its impact on the detection results. It was noticed that decision trees and random forests have almost the same rank for the features, while SVM showed different ranking. If we have to develop a voting system to choose between many ML detection, it is important to choose ML algorithms that build different classification models and think differently. Feature ranking showed that some features did not have high impact to the detection results. For example, engine load and manifold absolute pressure data have a very low impact on the driving style detection. Altitude variation and vehicle speed variation have a very low impact on driving style detection in SVM. Therefore, eliminating low-ranked features can improve the ML system performance and helps avoiding model over fitting; it also helps in the practical implementation of the system.
More work can be added in the future to this research. Collecting more data from other vehicle systems, such as suspension, brake and gear, can help improve the results. Extracting some statistics from the data, such as mean, standard deviation and median, can add more value to the input features. Ranking the features and eliminating the low impact features is also a good practice to reduce system complexity.
Fusing data from many resources provides a better understanding about the vehicle surrounding area and then yield to a better prediction system. Therefore, fusing in-vehicle data with connected vehicle data should boost the performance of the ML system. Adding data from sensors such as camera, radar and Lidar will improve the detection results.
A deep learning algorithm, such as neural network, can be suggested as a future work. Neural networks may have a better performance than the conventional ML algorithms, such as random forests and SVM. However, deep learning techniques require more computation and then a more expensive system. Therefore, choosing between deep learning and the traditional ML algorithms is a trade off between system accuracy and system cost.
9. Conclusions
In this study, an ML system is proposed to solve three categorization problems; road surface conditions, road traffic conditions and driving style. Decision trees, random forests and SVM were implemented in Python. In-vehicle CAN data were used to train and test the algorithms.
Random forests showed the best accuracy, precision, recall and F-score for all the classifications. The nature of the features and the amount of the training dataset is what gives an algorithm the advantage over another. From the results, we can conclude that random forests is the best algorithm to predict road surface conditions, road traffic conditions and driving style.
Feature importance of the algorithms was analyzed using permutation feature importance algorithm. It was noticed that decision trees and random forests have almost the same feature importance rank. SVM showed different feature importance rank. For example, SVM showed a high rank for longitudinal acceleration for road surface detection, while decision trees and random forests showed a low rank for this feature. Features ranking can help eliminate the low-ranked features to reduce system complexity, while maintaining the ML system performance.
Finally, this work shows that vehicle network carries rich information that can be analyzed and classified using ML to provide useful applications. In-vehicle data with traditional ML algorithms can provide a system with high accuracy and inexpensive implementation compared to more complex ML systems.