1. Introduction
The occurrence of wildfires often results in significant fatalities. In California, wildfires killed more than 30 people, destroyed more than 8500 structures, and torched 4 million acres of land in the period between January 2020 and mid-October 2020 [
1]. The ability to identify wildfires at their early stage is important as wildfires are notorious for their high speed of spread [
2]. However, the task of early detection of wildfires remains a challenging research question [
3].
Satellites and both occupied and unoccupied aerial vehicles equipped with infrared sensors or high-resolution cameras have been used to identify wildfires. Previously, Lin et al. [
4] proposed a Kalman filter based model running on a system of UAVs to monitor the heat generated by wildfire. Even though not explicitly stated, such a method would require a temperature sensor to measure the heat generated by the wildfire. However, infrared sensors and optical cameras are both susceptible to physical obstacles in the direct line of sight. Wildfires can be classified into three types according to the combustion materials: underground fire, surface fire, or crown fire [
5]. Underground fire may not be detectable by cameras at its early stage, whereas the initial surface fire might be obstructed by the leaves and branches of trees. Attempts have been made to identify wildfire according to its acoustic characteristics. An in-depth acoustic spectral analysis on different types of fire has been conducted Zhang et al. [
5].
Previously, Internet of Things (IoT) devices that transmit collected acoustic signals with wireless sensor networks to a centralized data center for wildfire detection have been proposed [
5,
6]. Specifically, Zhang et al. [
5] proposed to use LoRa to transmit collected acoustic signals. Such approaches rely on transmitting original acoustic signals efficiently and correctly. However, LoRa’s transmission rate is bounded between 0.3 and 27 kbps depending on the environment and the transmission protocol [
7]. Such limitations could be a significant drawback when transmitting acoustic signals from the forest to a remote computing center in a timely manner. Furthermore, continuously streaming data from the forest to a remote server would result in continuous high power consumption, which is undesirable in remote locations where access to electricity is limited.
In this paper, a machine learning data pipeline derived from the framework proposed by Salamon and Bello [
8] is implemented to distinguish wildfire sound from generic forest sounds. In addition, an edge computing device (Raspberry Pi 4) is utilized to run the data pipeline to demonstrate that the proposed data pipeline can be deployed on embedded systems in remote locations. The main idea of the original framework is to perform feature engineering with spherical k-means [
9], followed by classifying feature vectors with a classification algorithm. Different from the original framework, an additional PCA decorrelation is added to prevent the downstream classifier from overfitting. In this work, the support vector machine [
10] is the chosen classification algorithm. Even though deep neural networks are considered as general function approximators and have been demonstrated to perform well on time-series classification tasks, their expensive computational and memory requirements during inference make it difficult to fit on an embedded system [
11,
12].
The main contributions of this work are twofold. First, this is the first edge-computing implementation of an audio-based wildfire detection system to the knowledge of the authors. Second, the added PCA decorrelation step is demonstrated to enhance the generalization capability of the trained data pipeline. Moreover, the proposed method could be generalized to recognize underwater species, classify bird species, or detect unusual activities in cities [
8,
13,
14,
15].
4. Experimental Setup
Sound waves are transformed into log-scaled Mel spectrograms with 40 components covering the audible frequency range (0–22,050 Hz) with window size and hop size of 5.8 ms (256 samples at 44.1 kHz). The shingle size is set to 16 frames. The explained variance associated with spherical k-means is set to 85%. The optimal number of clusters
is first roughly estimated with knee point algorithm [
18] with respect to distortion score with
’s in the range of
and step size of 5. (Intuitively, the distortion score is the mean sum of squared distances between each shingled vector and its assigned cluster centroid. However, the PCA instance in spherical k-means transforms the shingled vectors into a latent space in which are found the cluster centroids. Therefore, the distortion score is technically the mean sum of squared distance between each “latent” shingled vector and its assigned cluster centroid.) The actual
’s are then selected manually around the estimated
’s based on the trained pipeline’s performance. For simplicity, the selected
’s are subject to the constraint of
. The explained variance of the PCA between pooled vectors and classification is set to 99%. In this work, the support vector machine implemented in Scikit-learn [
19] is the chosen classification algorithm. For RBF kernel, the parameter
gamma is set to
"scale", which is the default of Scikit-learn at the time of the work. (From Scikit-learn documentation, the value of
gamma is calculated from the formula
1/(n_features*X.var()).) In this case,
n_features is the dimensionality of the decorrelated vectors and
X is a matrix of all the decorrelated vectors in the training dataset [
19]. The trained models are serialized into the Open Neural Network Exchange (ONNX) format [
20] and are deployed on Raspberry Pi 4 Model B 2GB with Ubuntu Mate 20.04.
As suggested by Zhu et al. [
21], a smaller latent mixture model trained with a “clean” training dataset could outperform a bigger model trained with a noisy training dataset. In this work, the spherical k-means instance in the data pipeline serves as the latent mixture model and is trained on sound waves in the dataset without data augmentation. However, it is realized that sound waves streamed from the forest might contain different levels of noise. However, small datasets, such as the one used in this work, are notorious for their poor generalization capability. Hence, for each sound wave in the training dataset, the probability of the sound wave being augmented with white noise is set to
. If the audio is selected for augmentation, the amount of noise to be injected is measured by the signal-to-noise ratio (SNR
) and is determined by a sample from a uniform distribution within a prespecified range
. The SNR
formulation adopted in this work is:
When being deployed in real time at remote locations, it is realized that the collected sound wave would be the combination of both wildfire and forest sounds. Therefore, the testing wildfire sound is played back with different levels of forest sound in the background to test the pipeline’s ability to identify wildfire in the stated scenario. The loudness of the background forest fire is determined by the mixing factor
. Since the testing wildfire sound is longer than the testing forest sound, the testing wildfire sound is truncated to have the same duration. Let
denote the amplitude of wildfire at time step
t and
denote the amplitude of forest at time step
t, the amplitude of mixed sound wave at time step
t is denoted as
and is calculated as follows:
5. Result
The average estimated optimal
k identified by knee point algorithm across all 10 validation folds is
with sample standard deviation of 19.178 and
with sample standard deviation of 29.957. The elbow plots for cross-validation set 1 is presented in
Figure 2. Selected spectrogram features learned by spherical k-means are presented in
Figure 3. Note that the estimated optimal
k for cross-validation set 1 is outside one sample standard deviation.
Based on the average estimated optimal
k, the proposed data pipeline is trained with 4 different fixed
ks in the range of
with a step size of 50. The large range is selected due to the high standard deviation observed above. All the experimented classifiers reached 99% training and validation accuracy. However, the best-performing data pipeline can only reach 90% accuracy on the test dataset. The complete classification performance on the test dataset is presented in
Table 1. It is observed that a data pipeline with a spherical k-means of
, SVM with
, and RBF kernel has the best performance among the 16 experiments executed. The confusion matrix of the best-performing data pipeline classifying the test dataset is presented in
Figure 4. The proportion of time the pipeline identifies wildfire occurs vs. the mixing factor
is presented in
Figure 5.
The trained data pipeline is deployed to a Raspberry Pi 4 with 2 GB of RAM running Ubuntu Mate 20.04 with graphical user interface (GUI). A monitor and a keyboard are connected. Both Wi-Fi and Bluetooth are disabled. The memory profiler reported that the proposed data pipeline steadily consumes 275 MB when being used to stream and classify an input sound wave. Out of 326 experiments, processing one-second-long input audio takes 66 milliseconds on average, with sample standard deviation of 24 milliseconds and a maximum of 169 milliseconds. When the program is not being executed, the Raspberry Pi consumes 2.107 Watts on average. While the program is running, the power consumption of the Raspberry Pi is bounded between 2.91 and 3.59 Watts.
6. Discussion
From the elbow plot shown in
Figure 2, it can be observed that the distortion score decays exponentially with respect to the number of clusters
k when
k is less than the estimated optimal
k, and linear decay is observed when
k is larger than the estimated optimal
k. Based on the observation, it is believed that the range of
k values used to run the knee point algorithm is sufficient. If the range of
k values is too small, the experiment may return an elbow plot that does not have such a distinctive pattern or even fail to converge. There is no practical benefits except increased experiment time if a larger range of
k values is used.
The learned spectrogram features shown in
Figure 3 could correspond to a human’s experience. In
Figure 3a, the vertical stripes might correspond to the sound of branches snapping while being torched by wildfire, whereas the bright horizontal bright peak at low frequency in
Figure 3b might correspond to birds singing in the forest. It is worth noting that patterns such as vertical stripes become less obvious as the explained variance associated with spherical k-means decreases down to 80%.
Comparing test performance shown in
Table 1 and
Table 2, it can be observed that data augmentation makes a significant contribution to the generalization capability of the trained data pipeline. It is observed that a SVM with a RBF kernel outperforms a SVM with a linear kernel. In particular, the best-performing data pipeline (
, SVM with
, and RBF kernel) demonstrated 99% precision and 85% recall. These metrics have two meanings. First, whenever the data pipeline reports wildfire occurs, the fire agency could believe there really is a fire with 99% certainty. In other words, the best performing data pipeline is unlikely to trigger a false alarm. Second, when we know there is a wildfire happening and the data pipeline is deployed, it is expected that the data pipeline would recognize there is a wildfire 85% of the time, which means the best-performing data pipeline is very effective at detecting wildfire.
From the result shown in
Figure 5, it is observed that the data pipeline’s ability to identify wildfire reduces exponentially as louder background forest sound is injected. The observation is as expected, since the sound of wildfire with forest sound playing in the background is not included in the training dataset. Future research is needed to address the issue of low-volume wildfire sounds.
The 10% difference between train and validation accuracy and test accuracy suggests that the dataset currently used might not be a comprehensive representation of real-world wildfire sounds. In addition, the assumption for the current dataset is that the trained data pipeline has to be deployed in a remote area where the input is only limited to either wildfire or natural forest sound. The behavior of the proposed data pipeline taking rock music as input is undetermined, not investigated, and beyond the scope of this investigation.
Raspberry Pi 4 was selected since it was convenient for our preliminary experiment. The theoretical lifespan of the system running on a three-cell, 11.1 Volts, 5000 mAh Li-Po battery can be calculated from the measured power consumption. It was assumed that the microphone consumes no energy and power losses to heat due to voltage conversion are negligible. It is estimated that the lifespan of the system is between 15 h 28 min and 19 h 5 min. To enable prolonged deployment, a solar-based battery-recharging system would need to be developed. Assuming 12 h of sun light and no more than 50% radiation loss due to foliage or atmospheric ash, a 10 Watt solar panel would be needed.
Even under optimistic assumptions, it is admitted that the energy requirements for the system might be too strenuous to be deployed in forests with dense foliage or continuous ash cover. However, based on the profiled memory consumption, the authors believe the short battery lifespan can be improved by deploying the pipeline on a more compact edge computing device such as the Raspberry Pi Zero 2 W with consumes less power. In this way, the size of the solar panel required for sustained operations could be reduced.