1. Introduction
Environmental sound recognition is a widely used technique when identifying various sound events for surveillance or monitoring systems based on the acoustic environment. Several investigations have been carried out with different techniques in the context of a forest monitoring system to protect forest reserves. For example, prior studies have experimented with different sound classification approaches for the recognition of various species and possible forest threats such as illegal logging, poaching, and wildfire [
1,
2,
3,
4,
5]. In such systems, environmental sounds are captured, processed using a modelling algorithm, and classified into different sound classes.
With technical advancements, sound classification approaches evolved from machine learning (ML) models such as K-nearest neighbours (KNN) [
3,
6,
7], XGBoost [
8,
9], Gaussian mixture modelling (GMM) [
5,
10], and support vector machine (SVM) [
6,
11,
12] to deep learning. Deep neural networks (DNNs) such as convolutional neural networks (CNN) and recurrent neural networks (RNN) require a large quantity of labelled data compared to ML for a promising result. Hence, when using DL-based approaches, a well-biased and rich dataset with a relatively large data size is essential as the performance keeps increasing with a quality dataset.
Although several studies have been carried out in the forest acoustic monitoring context, still, a standard benchmark dataset specific to forest sounds is unavailable. Therefore, most of the existing studies have utilized publicly available environmental sound datasets such as ESC-50 [
4,
13,
14,
15,
16,
17], UrbanSound8K (U8k) [
14,
18,
19,
20,
21], FSD50K [
22,
23], and SONYC-UST [
24,
25]. These datasets contain a large quantity of audio data categorized into several groups covering a broad area of sound events. However, a limited number of classes can be used for forest environment sound classification, and most data are irrelevant for such a domain. Since a significant number of resources need to be utilized to extract data from datasets and to annotate the data points according to a suitable taxonomy, the direct use of public datasets for the classification model is ineffective.
Additionally, some studies have utilized datasets such as BIRDZ [
26,
27] and xeno-canto Archive [
28,
29,
30], which contain only bird sounds. The xeno-canto archive is an open audio collection dedicated to sharing bird sounds, and BIRDZ is a control audio dataset originating from the xeno-canto archive, which contains a subset of 11 bird species. As it contains audio data specific to one class, several such datasets need to be used in the forest sound classification system. Moreover, several researchers have experimented with private datasets due to the unavailability of forest-specific sound datasets. For instance, in such studies, they have deployed sound sensors in a forest environment and recorded the sound events to create a dataset according to their requirements [
6,
31,
32]. In contrast, some studies have created datasets using audio clips collected from online sound data repositories such as Freesound [
3,
5,
11]. With a closer look at the literature, it can be identified that the forest acoustic monitoring domain suffers from certain shortcomings including a lack of a standard taxonomy and the unavailability of a public benchmark dataset. These limitations motivated us to introduce a new dataset for the domain. Accordingly, the novelty of this paper is to present a standard dataset for forest sound classification and to provide a comprehensive overview of the procedure for creating and validating the dataset. Addressing the current research gaps, we introduce FSC22 [
33], a novel benchmark dataset for the acoustic-based forest monitoring domain. It contains 2025 5 s long audio clips originating from an online audio database Freesound. All sound events are categorized into 6 major classes, which are further divided into 34 subclasses. For the initial phase of dataset composition, 27 subclasses were picked, and 75 audio samples were collected per class. Each audio clip was manually annotated and verified to ensure the quality of the dataset. The key contributions of this paper can be summarized as follows.
Introduces a novel public benchmark dataset consisting of forest environmental sounds, which can be utilized for acoustic-based forest monitoring.
Presents a comprehensive description of the methodology used for dataset creation, including data acquisition from Freesound, filtering, validation, and normalization.
Explains the baseline models used for the sound classification and the selection criteria for those models.
Provides a detailed evaluation of the dataset using human classification, ML-based, and DL-based classification.
Presents a comprehensive discussion of the results obtained with the proposed FSC22 dataset and compares them with the publicly available datasets.
We have created the FSC22 dataset and made it freely available to support and motivate future researchers in this domain [
33]. We expect that this dataset will help research communities to better understand forest acoustic surveillance and experiment with the domain. The rest of the paper is structured as follows.
Section 2 explores the related datasets used in previous research.
Section 3 provides an overview of the taxonomy of the proposed dataset.
Section 4 introduces the FSC22 dataset, including the data collection methodology and its importance to the acoustic domain.
Section 5 provides a comprehensive description of the baseline-model-based dataset evaluation approach.
Section 6 describes the experiments conducted on the dataset namely human classification and baseline-model-based classification, with the results and observations. Finally,
Section 7 concludes the paper.
2. Related Work
Seminal contributions have been made to the ESC context in recent years. Among those, several instances of research carried out for forest acoustic monitoring can be identified. Forest acoustic monitoring is crucial as it provides a firm basis of evidence to arrive at conclusions to conserve forest coverage and species. However, due to the unavailability of a comprehensive forest-specific sound dataset, most of the previous research on forest monitoring was done using a common environmental sound dataset or a private dataset according to the requirement. This section provides an overview of the publicly available environmentally sound datasets and other datasets utilized by previous researchers in this domain. However, to the best of our knowledge, there is no forest-specific sound dataset in the literature.
Among the available datasets, ESC-50 [
34] is a frequently used environmental sound dataset for forest acoustic monitoring. For instance, Andreadis et al. [
4] utilized ESC-50 to detect illegal tree-cutting and identify animal species. ESC-50 is a dataset consisting of 2000 environmental audio clips under 50 classes of common sound events. It contains 40 5 s long recording samples per class, extracted from Freesound.
Figure 1 shows a section of the ESC-50 dataset taxonomy emphasizing forest-specific sounds. Moreover, U8K [
35] is another popular dataset used in many types of research on audio-based monitoring systems [
18,
36]. U8K is a subset of the main Urban Sound dataset, which contains 8732 labelled sound clips of urban sounds from 10 classes. The classes of this dataset were drawn from the urban sound taxonomy [
37], and all the recordings were extracted from Freesound.
Figure 2 includes a part of the U8K dataset taxonomy mostly relevant to the forest environment sound domain. FSD50K [
38] is an open dataset of human-labelled sound events. It consists of over 51K audio clips totalling over 100 h of audio manually labelled using 200 classes. The classes of this dataset are drawn from the AudioSet ontology [
39]. The three above-mentioned datasets were created using the audio extracted from the Freesound project. It is an audio-based public dataset that contains more than 500,000 audio clips.
Moreover, SONYC-UST [
40] is another quality dataset, where data are grouped into 8 main classes and further divided into 23 fine-grained classes. This can be considered a more realistic dataset as it was created using the audio data acquired using acoustic sensors deployed in New York City.
Figure 3 shows a part of the SONYC-UST dataset taxonomy highlighting the audio classes specific to forest monitoring and surveillance. AudioSet [
41] is another audio event dataset, including over 2M tracks from Youtube videos. Every 10 s video is annotated using over 500 sound classes derived from the AudioSet ontology [
39]. The main concern with AudioSet is it cannot be considered an open dataset due to the copyright issues and terms of services constraint from Youtube. In addition, as the clips were collected from Youtube, they may consist of clips with poor quality and can disappear after a certain time due to privacy issues or copyright claims.
Table 1 presents a summary of the existing environmental sound datasets.
Additionally, several other domain-specific dataset usages were reported in prior studies on environment sound observatory systems. For bird sound identification studies, the xeno-canto archive [
43], which is a bird-sound-sharing portal, was used to acquire the audio data essential for the experiment [
28,
30,
44]. BIRDZ dataset, which is a real-world audio dataset made using the xeno-canto archive was also used in the related literature [
45,
46]. Similarly, the usage of the BirdCLEF dataset was identified in prior studies, which consists of 62,902 audio files and is publicly available on Kaggle [
47]. As all these datasets are specific to a certain sound class, a combination of several such datasets is required when developing a complete forest monitoring system.
Many researchers have experimented with a private dataset created according to their requirements, due to the scarcity of forest-specific sound datasets. Such datasets are generated using the audio data acquired from online sound repositories or audio recorded by acoustic sensors or as a combination of both. Mporas et al. [
3] created a chainsaw sound dataset, including the background noises such as rain and wind, using the sounds acquired from freely available sound repositories. Ying et al. [
11] experimented with an animal sound recognition system, and the required animal sounds were acquired from Freesound. In contrast, Assoukpou et al. [
6] combined the chainsaw sounds recorded from acoustic sensors deployed in three different forest areas and other sounds acquired from online websites to create a dataset to identify chainsaw sounds.
Accordingly, many environmental-sound classification studies have utilized the datasets mentioned above with different sound classification approaches. In most of the studies, CNN models were widely adopted as a firm basis for prominent audio classification models [
20,
36,
48]. Moreover, there are instances where ML algorithms were utilized for audio classification [
49]. One of the key distinctions when choosing between DL and ML was the availability of well-labelled and high volumes of data. DL algorithms scale with the data while increasing the performance, whereas ML plateaus at a certain level of performance when adding more data.
Table 2 shows an overview of DL and ML approaches deployed for sound classification using the ESC-50 and U8K datasets.
3. FSC22 Taxonomy
Prominent research efforts carried out in the forest acoustic classification domain have been based on a subset of an already established public dataset such as ESC50, U8K, or on small self-made datasets. Thus, the requirement for a well-defined dataset dedicated to forest acoustics can be identified. As the first step of creating a benchmark dataset, a standard taxonomy that can showcase and capture all the different acoustic scenarios present in forest ecosystems needs to be established.
In the parent level of the proposed taxonomy, all the acoustic scenarios were classified into six classes: mechanical sounds, animal sounds, environmental sounds, vehicle sounds, forest threat sounds, and human sounds. Further, each class was divided into subclasses that captured specific sounds which fell under the main category. For example, under the main class, mechanical sounds, four subclasses were identified, namely axe, chainsaw, handsaw, and generator. This subdivision aimed to introduce specific class labels to prevent the usage of generalized labels such as tree cutting, animal roar, etc.
Figure 4 presents the complete forest sound taxonomy developed as a base for the creation of the FSC22 dataset. Further, it showcases the complete subdivision of the main 6 classes into 34 subclasses. We selected only 27 subclasses for the FSC22 dataset ignoring 7 subclasses shown in blue colour, due to the unavailability of a sufficient number of sound clips in Freesound. Though all the left-out classes had more than 200 search results in the Freesound platform, most of the audio clips were artificially generated or included unnecessary noise making them unsuitable to be included in the FSC22 dataset.
The proposed taxonomy aimed at covering two main objectives. The first objective was to completely cover fundamental acoustic scenarios such as chainsaw sounds, tree felling, and wildfire, which are extensively used for research works. The second objective was to provide high-quality, normalized audio under unambiguous class labels. We extensively analysed the related literature that utilized forest acoustics and identified the most essential and frequent types of acoustic phenomenon that should be available in a benchmark dataset to fulfil the first objective, as explained in
Section 4. It should be noted that the proposed taxonomy is not fixed and with time, more related acoustic classes under forest acoustics need to be added while refining the taxonomy to achieve saturation.
5. Methods and Technical Implementation
For ESC, both ML and DL have been extensively used in the related literature. Therefore, we provide classification experiments covering both architectures. An extreme gradient boosting (XGBoost)-based experiment is provided for the ML approach, while a CNN-based experiment is provided for the DL approach. These models were used as the baseline models.
5.1. Feature Engineering
Feature engineering is a principal requirement for a successful ML pipeline. Studies focusing on the audio classification domain properly emphasize the requirement of advanced feature-engineering techniques such as the usage of spectrograms to represent audio samples in the time and frequency domains [
4,
6,
10,
17,
52] and audio augmentation techniques to prevent overfitting of the prediction algorithm [
13,
14,
46,
53,
54] to obtain state-of-the-art classification performances. This section provides an overview of the feature-engineering techniques followed in the proposed experiments as shown in
Figure 8.
5.1.1. Considered Datasets
As described in
Section 4.3, quality audio data are scarce in the forest acoustics domain; thus, a benchmark dataset that could be used to compare the quality of the proposed FSC22 dataset could not be identified in the related literature. The ESC50 dataset, which is a benchmark dataset used in the ESC domain, was therefore used to compare the performance of the FSC22 dataset. For the study, 2000 audio recordings, each of 5 s duration, distributed into 50 unique classes from the ESC50 dataset, and 2025 audio recordings each of 5 s duration distributed into 27 unique classes from the FSC22 dataset were selected.
5.1.2. Data Augmentation Technique
Data augmentation is an important step in the feature engineering phase to artificially expand the available data samples for training and testing ML and DL algorithms. Especially when it comes to DL approaches, models suffer from overfitting when the quantity of training data available is considerably less [
55]. For the proposed experiments, positive pitch shifting and negative pitch shifting, where the pitch of audio recordings is increased and decreased by two steps, respectively, were utilised [
56]. The pitch shifting was implemented with the pitch_shift function provided by the Librosa.effects library for python.
As a result of a single audio sample, two new augmented audio samples were created increasing the quantity of data available. In summary, due to the augmentation with pitch shift, the number of audio samples from ESC50 was increased to 6000, while the FSC22 dataset increased to 6025 audio samples. For both datasets, 80% of the audio samples were used for training the model, while 20% were used for validating the performance of the trained model, by following the Pareto principle as in most of the general cases, 80% of the effects come from 20% of the causes.
5.1.3. Feature Extraction
In the audio classification domain, the general practice is to us spectrograms, representing an audio signal in both time and frequency domains, as the feature extraction mechanism. The mel spectrogram (MEL) [
20,
57] and the mel frequency cepstral coefficients (MFCC) [
3,
10], which can be identified as the two most utilized spectrograms, were used to extract the features for this study. In order to extract the spectrograms from the raw audio data, the mel spectrogram and MFCC were provided by the librosa.feature library. Using both functions, each audio file was sampled into overlapping frames, and for each frame model coefficient or mel frequency, cepstral coefficients were calculated. Thus, the calculated coefficients were returned as a two-dimensional array of shapes (number of coefficients × number of samples). As a further improvement, for the mel spectrograms obtained, all the coefficients were converted to the decibel scale from the power scale.
As shown in
Figure 8, ML-based classification techniques generally utilize one-dimensional features. Therefore, it was required to reduce the dimensionality of the created spectrograms, before they were used with the XGBoost model. This was achieved by aggregating the one-dimensional feature vectors extracted for each overlapping frame into a single vector by taking their mean value. For the DL-based classification, an image-like representation of the features according to the RGB mode was required. Hence, for each audio sample, three spectrograms were created by changing the length of the window used for framing. The created spectrograms had a windowing length of 93 ms, 46 ms, and 23 ms, and this was achieved by keeping the sample rate parameter at 22,050 Hz and the n_fft parameter at 2048, 1024, and 512, respectively.
5.2. Machine-Learning-Based Classification
The related literature exploring the automated classification of acoustic phenomena that are abundant in forest ecosystems has utilized different ML algorithms to carry out the classification task. Among such efforts, ML algorithms such as KNN, SVM, and random forests can commonly be identified. Due to the superiority of the extreme gradient boosting (XGBoost) algorithm against such traditional ML algorithms, this study explored the usability of XGBoost to properly classify forest acoustics.
XGBoost is capable of handling nonlinear relationships in the features. Handling nonlinear relationships are important in sound classification as there are many nonlinear relationships between the sound features and the class labels. Moreover, XGBoost has the ability to learn from the errors made by previous trees. Additionally, XGBoost use L1 and L2 regularizations, which is important to reduce overfittings.
The XGBoost library available for python was used to conduct the tests and the model parameters were used to fine-tune the performance of the implemented model. As the final set of parameters, num_class was set to 27, the multiclass classification error rate was used as the eval_metric, subsample, colsample_bytree and min_child_weight were set to 1, max_depth was 6, learning_rate was 0.3, and 100 n_estimators were used. Further, to improve the memory efficiency and the training speed of the XGBoost model, both the training and validation datasets were converted to the internal data structure (DMatrix) used by the model, which was optimized for both memory efficiency and training speed. Then, the configured model was trained with 80% of the considered dataset, and the evaluation was completed with the remaining 20% of the data using the trained XGBoost model.
5.3. CNN-Based Classification
Although it can be identified that a substantial number of studies have used ML-based algorithms to classify unstructured data such as audio and images, DL-based models can outperform the traditional ML models with considerable margins, due to their ability to extract features from raw data [
58]. For the study, a convolutional neural network-based [
14,
59] model consisting of 9 layers was utilized, based on the work of the authors of [
36].
Similarly, as in the ML-based approach, 80% of the data were used to train and fine-tune the CNN model, while the remaining 20% was used for the validation procedure. The model was configured to run for 50 epochs; however, an early stopping callback function was used to stop the model from overfitting to the training data. The implementation of the model was completed using the Keras library provided by TensorFlow [
60].
Figure 9 presents the architecture of the model accompanied by the parameters used to implement the model using the Keras library.
8. Conclusions
Environment sound classification (ESC) using artificial intelligence is a prominent research area in audio recognition. Under ESC, forest sound classification (FSC), which focuses on identifying artificial and natural phenomena observable in forest ecosystems, has received a high amount of research interest. The recognition of forest sounds generates highly valuable use cases in scenarios such as illegal logging, poaching, and wildfires. FSC suffers from the unavailability of a standard sound taxonomy and the unavailability of a sufficiently large public benchmark dataset. With the intention of resolving both issues, this study presented the FSC22 taxonomy and the first version of the FSC22 dataset. The first version of the FSC22 dataset consisted of 2025 human-annotated, 5 s long audio recordings equally distributed into 27 unique classes. The authors intend to expand the first version of the FSC22 dataset in the future, capturing more acoustic classes according to the FSC22 taxonomy. Further, the study presented CNN-based and XGBoost-based classification experiments using the FSC22 dataset. The CNN-based approach achieved a maximum classification accuracy of 92.59%, while the XGBoost model achieved a maximum accuracy of 62.71%. A survey conducted with 25 human candidates to identify different sounds from the classes listed in the FSC22 dataset was also conducted to establish a baseline accuracy score. Finally, the authors believe that the proposed FSC22 taxonomy, the created FSC22 V1.0 dataset, the experiments conducted, and the discussions provided through this study will support future research work governing the FSC domain.