4.1. Driving Simulation Environment
We developed a PC-based driving simulator to acquire driving data. We installed the Logitech G920 steering wheel to drive the vehicle in the simulation. The Euro Truck Simulator 2 (ETS2) that was used in the simulator includes a wide variety of roads in Europe, which were closely simulated.
We used a vehicle developed using an extension interface called modification (MOD) for data collection. We referred to the specifications of a Hyundai Elantra model: size, acceleration, maximum speed, and rotation radius.
Figure 7 shows a participant collecting data in the developed simulator.
Additionally, we designed the route using the roads included in the software. The driving route for the data acquisition was set up as shown in
Figure 8a. It included urban roads and highways, as shown in
Figure 8b,c. In addition, it comprised linear driving and curved driving, as shown in
Figure 8d. Road structures such as toll gates, traffic lights, and interchanges, which are shown in
Figure 8e, were included. Sensor data such as the SWA, accelerator pedal pressure, and brake pedal pressure during driving were acquired at a sampling rate of 30 Hz. The driver’s face was simultaneously recorded for the ground truth.
To induce drowsy driving when collecting data, we performed simulations in a dark environment where light was blocked between 2 a.m. and 4 a.m. In addition, we advised data collection participants to limit sleep-disrupting behaviors such as naps and caffeine intake before data collection. Preliminary data collection was conducted for 54 participants (21 men and 33 women), who were in their 20 s and 30 s, with more than 1 year of driving experience. Participants who collected inappropriate data due to events such as deviation, speeding, and forward collision; those who were not sleepy at all; and those with mild motion sickness were excluded. As a result, 17 participants (11 males and 6 females) were selected and their data were collected. The driving route was repeatedly driven 20 times, and a total of 198.3 h of simulation data were collected.
Three experts labeled the drowsy state (in seconds) based on the facial movements and expressions of the driver [
38,
39] in the recorded video. The detailed criteria that were used for labeling are shown in
Table 3.
4.2. Experimental Environment
We performed the performance evaluation on a PC with an Intel i7-7700 CPU and two Geforce GTX 2080 Ti installed. Additionally, we used the Tensorflow r1.15 and Keras packages in Ubuntu 16.04 environment to implement the proposed ensemble CNN. The application programming interface (API) of the Keras package was utilized for training, validating, and testing the models used in the experiments.
To evaluate the performance of the proposed method in detecting drowsy driving, we ran a hold-out cross validation study. The collected dataset were randomly separated into training, validation, and testing datasets with a ratio of 6:2:2. In other words, we separated the collected dataset across all participants into three subsets.
The performance of the proposed network was evaluated in terms of accuracy, precision, recall, and F1-score. The accuracy was defined as , and the precision was defined as . The recall was , and the F1-score was calculated as the harmonic average of the accuracy and recall.
The definitions of TP, TN, FP, and TN are as follows:
True Positive (TP): Actual state is drowsy driving and inferred as drowsy driving.
True Negative (TN): Actual state is normal driving and inferred as normal driving.
False Positive (FP): Actual state is normal driving but inferred as drowsy driving.
False Negative (FN): Actual state is drowsy driving but inferred as normal driving.
Accuracy refers to the percentage of correct detection for the drowsy driving case and normal driving case and implies the accuracy of the network production. Precision represents the ratio of drowsy driving for the cases that were detected as drowsy driving. That is, if the precision is high, the reliability of the detection result is high. Recall refers to the rate that is detected by drowsy driving among the drowsy driving cases. If the recall is high, the network can detect drowsy driving successfully. In general, recall and precision have an inverse relationship, and precision and recall can be simultaneously considered through the F1-score. The parameters used in the experiment are shown in
Table 4.
4.6. Experiment Result for the Proposed Ensemble CNN
The performance was compared according to the number of subnetworks in the proposed network.
Table 7 and
Table 8 show the normalized confusion matrix according to the proposed network architecture. The network with
and
showed the worst performance, with an 86.90% accuracy and an 87.07% F1-score. In the case of
and
, the accuracy was 90.81% and the F1-score was 90.67%. When
and
, the accuracy was 93.38% and the F1-score was 93.34%. As the number of subnetworks that were included in the network increased, the accuracy and F1-score tended to increase.
This trend was maintained even as the number of subnetworks further increased. In the case of and , the accuracy was 93.97%, and the F1-score was 93.95%. In the case of and , the accuracy was 94.07%, and the F1-score was 94.05%. When and , the best performance was achieved with an accuracy of 94.20% and a F1-score of 94.18%. However, we confirmed that, after and , the rate of increase in the performance decreased, and it did not increase significantly even when it became larger than and .
As the number of subnetworks increases, we observed that both training time and inference time were increased. When and , it took about 18.6 h to train the model, and when and , it took about 37.8 h. Similarly, when and , the inference time excluding feature calculation is 228 ms, and when and , it took about 459 ms.
Table 8 is a normalized confusion matrix showing the performance when the model structure parameter is asymmetric. Previously, it was demonstrated that performance increases as the number of subnetworks increases even when the number of subnetworks is asymmetric, similar to when the number of subnetworks specialized for LD detection and subnetworks specialized for SD detection increase to the same number. When
and
,
, and
, accuracies of 89.59%, 91.04%, and 91.76% were recorded, respectively. Conversely, when
was fixed and
,
, and
, 88.28%, 91.81%, and 92.15% were recorded, respectively.
We confirmed that, when L = 2 and S = 6, the accuracy was 92.36% and the F1-score was 92.35%. In the case of L = 3 and S = 5, the accuracy was 92.92% and the F1-score was 92.83%. If the value of the parameter is reversed, the performance changes slightly. When L = 6 and S = 2, the accuracy was 92.69% and the F1-score was 92.71%. When L = 5 and S = 3, the accuracy was 92.71% and the F1-score was 92.95%.
Figure 10 is a graph that summarizes the changes in accuracy and F1-score according to the changes in the model structure parameters
L and
S mentioned above. In each graph, the size of the dot represents the size of the model and the label represents the parameter value. A square dot means a case where
, and a triangle point means a case where
. Finally, the circular dot represents the case where
.
When the number of subnetworks in the model structure parameter is less than 8, the performance increases rapidly as the number of subnetworks increases. In the case of , it was confirmed that the range of performance increase was greater than in the case of asymmetrical number of subnetworks.
On the contrary, when the number of subnetworks is 8 or more, the performance change is not large as the number of subnetworks increases, but it can be assumed that the variance in the network is sufficiently reduced through the ensemble technique.
Even when the number of subnetworks is asymmetric, it has been shown that the performance increases as the number of subnetworks increases. It was confirmed that the case of
has a better performance than the opposite case. According to the analysis results in
Section 2, the collected dataset has more LD data than SD data, and we segmented the same window size (
w). Therefore, it is estimated that it is more advantageous to detect LD better than SD.
Finally, the performances of the traditional machine learning techniques, other deep learning networks, and the previous studies were compared.
Table 9 shows the results of the comparison between the machine learning techniques and the other networks. We used
as the architecture parameters of the proposed network. The deep neural network (DNN) for the performance comparison used a network in which the fully connected (FC) layer was overlapped five times. The output sizes of each layer were set as 128, 256, 256, 64, and 2. We set the activation function of the last layer to a softmax function for the detection result.
One-dimensional CNN consists of two 1D convolution layers and the two FC layers. The parameters of the convolution layers used in 1D-CNN are as follows: the number of filters is 64 and 128, the kernel size is 2, and the stride is 1. The output sizes of two FC layers are 64 and 2, and the activation function of the last layer used the softmax function. We designed the long short-term memory (LSTM) network to have a structure of 3 consecutive LSTM layers and 2 FC layers. Each LSTM layer had 300 units, and the output sizes of the FC layer were set to 64 and 2. In the case of the random forest model, which is a machine learning technique, 100 estimators were used.
As a result of the performance comparison, it was shown that the proposed network outperformed the other networks on all the performance measures. In particular, the LSTM network has an advantageous architecture for analyzing the time-series data such as the sensor data that is used in this study. It shows a higher performance than the DNN, CNN, and random forest models; however, it has a lower performance than the proposed network.
We also performed comparisons with previous studies. Previous studies [
27,
29,
31,
40] used traditional machine learning-based methods, whereas Arefnezhad et al. [
33] proposed a CNN–LSTM network to detect drowsy driving.
As shown in
Table 10, SWA data are used in many studies and pedal pressure data are also used to detect drowsy driving [
29,
31,
33,
40]. Although not used in this paper, in-vehicle sensor data such as lateral acceleration data [
33] and vehicle speed [
27] are also used. In addition, deviation of the vehicle position in a lane [
27,
33,
40] or PERCLOS [
27] are used to improve the detection performance.
McDonald et al. [
31] collected approximately 108 h of data from 72 participants. The data from the collected dataset in McDonald et al. [
31] were segmented to a 60 s window and sampled at a reduced frequency of 1 Hz. Zhang et al. [
27] collected about 27 h of data from 27 participants on a circular highway track. The collected data in Zhang et al. [
27] were segmented into 600 s non-overlapping windows. Arefnezhad et al. [
33] collected data through a driving simulation from 13 participants on a monotonous highway at a sampling rate of 100 Hz that was used as input to the network without any preprocessing. Krajewski et al. [
40] collected data from 12 participants through simulations for about 11 h and segmented the collected data. Each segment had a length of 240 s. Unlike other studies, Li et al. [
29] collected data from real vehicles for about 14 h from 6 participants. The collected data in Krajewski et al. [
40] were sampled to 60 s interval.
In Li et al. [
29], an approximate entropy feature was created from SWA data, and in Zhang et al. [
27], average and standard deviation were calculated from vehicle speed and lateral position. Similarly, we extracted the approximate entropy feature from brake pedal pressure for SD detection.
We also compared the accuracy performance with the reported results of previous studies [
27,
29,
31,
40]. Additionally, the network that was proposed in Arefnezhad et al. [
33] was implemented, and the performance was evaluated through the collected dataset. According to
Table 10, the proposed method showed superior performance compared to other previous studies. It can be observed that the characteristics for the duration of drowsy driving that were used in this study are very effective. For example, in [
40], a variety of features were extracted from several domains and used; however, the performance was relatively low, with an accuracy of 86.10%.
In Arefnezhad et al. [
33], the achieved accuracy was 95.05%, but when applied to our collected dataset, it was 84.95%. We speculate that these results are due to the various road conditions and implementation constraints included the simulation. As mentioned above, the dataset was collected on a monotonous highway at a sampling rate of 100 Hz. The dataset in our study was collected on more complex road environments at sampling rate of 30 Hz. In the dataset that was collected, the lateral deviation data that was used in Arefnezhad et al. [
33] did not exist; thus, only the SWA, SWV, yawn rate, and lateral acceleration data were used.