In the search for the effective detection of sound events, acoustic modeling and feature adjustment emerge as essential elements to guarantee the reliability and precision of the proposed algorithm. These fundamental steps allow for generating data resembling real-world conditions and facilitate a thorough evaluation of the algorithm’s performance.
3.3.3. Performance Evaluation
In the detailed performance evaluation of the sound event detection algorithm, various aspects are analyzed to understand the effectiveness and limitations of this work. In the confusion matrix of
Table 6, the performance of the sound event detection algorithm on Data Set 1 is evaluated. Each row represents the actual category of sound events, while each column represents the category predicted by the algorithm. Types include “Screams”, “Voices”, “Traffic”, “Bell”, “Whistle”, and “No Event”. In screams, 850 “Screams” type events were correctly detected as “Screams”. However, there were some errors in the classification, where 20 “Screaming” events were misclassified as “Voices”, 5 as “Traffic”, 5 as “Bell”, 10 as “Whistle”, and 25 as “No Event”. In voices, similarly, the algorithm achieved high accuracy in detecting events of the type “Voices”, with 880 events correctly classified as “Voices”. However, there were errors in the classification, such as 10 “Voices” events being misclassified as “Screams”, 15 as “Traffic”, 20 as “Bell”, 5 as “Whistle”, and 30 as “No Event”.
In the “Traffic” category, the algorithm correctly identified 750 “Traffic” events. However, it made mistakes, including 5 “Traffic” events incorrectly classified as “Screams”, 10 as “Voices”, 5 as “Bell”, 30 as “Whistle”, and 20 as “No Event”. For the “Doorbell” category, the algorithm correctly detected 900 events of this type. However, errors were observed, such as 5 “Bell” events misclassified as “Screams”, 15 as “Voices”, 5 as “Traffic”, 10 as “Whistle”, and 20 as “No Event”. In the “Whistle” category, the algorithm accurately detected 850 such events. There were a few errors, though, such as 8 “Whistle” events incorrectly classified as “Screams”, 5 as “Voices”, 12 as “Traffic”, 6 as “Bell”, and 15 as “No Event”.
The algorithm also correctly identified the absence of audible events in the “No Event” category, with 950 cases adequately classified as such. It did make mistakes, though, including 30 misclassified as “Screams”, 25 as “Voices”, 10 as “Traffic”, 15 as “Bell”, and 20 as “Whistle”.
In
Table 7 of the confusion matrix, the algorithm’s performance is evaluated, as in the previous matrix, each row represents the actual category of sound events, and each column represents the category predicted by the algorithm. The algorithm correctly detected 920 “Screams” type events as “Screams”. There were errors, such as 15 “Screaming” events misclassified as “Voices”, 10 as “Traffic”, 8 as “Bell”, 5 as “Whistle”, and 20 as “No Event”. The algorithm achieved high accuracy in the “Voices” category by correctly identifying 930 “Voices” events. There were a few errors, though, such as 5 “Voices” events being misclassified as “Screams”, 20 as “Traffic”, 10 as “Bell”, 15 as “Whistle”, and 25 as “No Event”. The algorithm correctly detected 900 “Traffic” type events as “Traffic”. However, errors were observed, such as 8 “Traffic” events incorrectly classified as “Screams”, 12 as “Voices”, 5 as “Bell”, 20 as “Whistle”, and 15 as “No Event”.
For the “Doorbell” category, the algorithm detected 950 such events correctly. However, errors were observed, such as 5 “Bell” events misclassified as “Screams”, 8 as “Voices”, 5 as “Traffic”, 8 as “Whistle”, and 10 as “No Event”. In the “Whistle” category, the algorithm accurately detected 920 such events. There were a few errors, though, such as 12 “Whistle” events incorrectly classified as “Screams”, 5 as “Voices”, 10 as “Traffic”, 5 as “Bell”, and 10 as “No Event”. Finally, the algorithm could also correctly identify the absence of audible events in the “No Event” category, with 960 cases adequately classified as such. However, it made errors, including 35 misclassified as “Screams”, 30 as “Voices”, 10 as “Traffic”, 12 as “Bell”, and 15 as “Whistle”.
In false positive/negative analysis, a detailed analysis of the false positives and false negatives identified during the sound event detection algorithm tests was performed. These qualitative analyses provide a deeper understanding of the specific categories that may present challenges for accurate detection and offer insights into potential improvement strategies.
False positives:
Category: Shouts
- ○
False Positives: 20 in Data Set 1, 15 in Data Set 2.
- ○
Description: the algorithm tends to classify some “Voices” and “Timbre” events as “Screams”. This can be attributed to the similarity in the acoustic characteristics between these sounds, such as high pitches and rapid changes in amplitude.
- ○
Improvement Strategies: Including more specific spectral characteristics could be considered to distinguish between “Screams” and other similar categories. Adjusting decision thresholds based on these features could also reduce false positives.
Category: Traffic
- ○
False Positives: 5 in Data Set 1, 8 in Data Set 2.
- ○
Description: the algorithm sometimes confuses “Whistle” and “Bell” events with “Traffic”. This could be due to high-frequency components that resemble traffic noise in these sounds.
- ○
Improvement Strategies: Incorporating contextual information, such as sound duration and energy, could help distinguish between “Traffic” and other categories better. Also, adjusting the parameters for detecting frequency peaks could reduce these false positives.
False Negatives:
Category: Voices
- ○
False Negatives: 10 in Data Set 1, 5 in Data Set 2.
- ○
Description: sometimes “Voices” events are misclassified as “Screams” or “Bell”, resulting in false negatives for the “Voices” category.
- ○
Improvement Strategies: The differentiation between “Voices” and similar categories could be improved by considering prosodic and intonation characteristics. Tuning voice segmentation algorithms could also help reduce these false negatives.
Category: Whistle
- ○
False Negatives: 12 in Data Set 1, 10 in Data Set 2.
- ○
Description: the algorithm sometimes has difficulty distinguishing between “Whistle” and other high-frequency categories such as “Bell” and “Screams”.
- ○
Improvement Strategies: including features that capture the characteristic modulations of the “Whistle” and the adjustment of the event detection parameters could improve the accuracy in detecting this category.
In the learning curve of Data Set 1, the following is observed:
Accuracy: as the size of the training set increases, the algorithm’s accuracy tends to increase gradually and then level off around 92%.
Completeness: completeness increases as larger training sets are used, reaching around 88%.
F1-score: the F1-score experiences a significant increase as the size of the training set increases and then stabilizes around 90%.
In the learning curve of Data Set 2, the following is observed:
Accuracy: like the first data set, the accuracy shows a steady improvement as the size of the training set increases, reaching around 91%.
Completeness: completeness gradually increases as larger training sets are used, reaching approximately 85%.
F1-score: the F1-score in Data Set 2 follows a similar trend to Data Set 1, rising and stabilizing around 88%.
Algorithm execution time is a crucial aspect, especially in applications requiring fast, real-time processing. Therefore, the average time the algorithm takes to process an audio signal and generate results on different data sets was evaluated. In addition, these times are compared with the requirements in real time. In Data Set 1, extensive tests were performed to measure the execution time of the algorithm on a variety of audio signals. The average processing time per signal was approximately 12 milliseconds. This time includes feature extraction, classification, and output generation.
In Data Set 2, similar tests were carried out to assess execution time. In this case, the average processing time per audio signal was around ten milliseconds. It is important to note that this data set was more complex regarding the sound events present.
By analyzing the runtime results, the algorithm meets real-time requirements in many environments. For example, in real-time monitoring applications where timely detection of sound events is essential, such as in security systems or anomaly detection, the recorded processing times are low enough to allow effective implementation. The algorithm’s applicability in different environments depends on application-specific runtime constraints. Although the algorithm shows satisfactory performance in terms of time in most cases, it is essential to consider that specific extremely time-sensitive environments, such as high-speed real-time response systems, may require additional optimizations.
The learning curve graph presented in
Figure 3 offers insight into the relationship between the size of the training set and the accuracy of the sound event detection algorithm. As the size of the training set increases, both the training accuracy and the validation show an increasing trend. Initially, with a smaller training set (around 100 samples), the training accuracy is relatively low, suggesting some difficulty for the model in capturing the complexities of sound events. However, as the training set expands, the training and validation accuracy gradually increases, indicating an improvement in the algorithm’s generalizability.
It is interesting to note that the gap between the accuracy curves in training and validation tends to decrease as the size of the training set increases. This suggests the model achieves a more optimal balance between fit to the training data and generalizability to new data. However, after a certain point, around 800 training samples, increasing the size of the training set has a more marginal impact on improving accuracy. This could indicate that the model has already captured most of the relevant features of the sound events present in the data and that a further increase in the amount of data does not result in significant improvements in accuracy. In summary, the learning curve plot provides valuable insight into how the size of the training set influences the performance of the sound event detection algorithm, guiding informed decisions on the balance between computational cost and accuracy improvement.
Several specialized techniques were applied during the training and validation phases to improve the accuracy of the algorithm and address the specific challenges of the data set:
Reweighting of Classes: given the imbalance observed in the number of sound events for specific classes, a reweighting was introduced. For example, emergency alarms, which represented only 2% of the total data set, were reweighted with a factor of 5. In contrast, representing 50% of the data set, general conversations were reweighted with a factor of 0.8. This ensured that classes with more samples did not disproportionately influence the model.
Additionally, data Augmentation techniques were used to improve the robustness of the model:
Speed variation: recordings were adjusted to 90% and 110% of their original speed, resulting in a 20% increase in the data set.
Pitch Shift: the recordings were altered by ±2 semitones, adding another 20% to the whole.
Background noise: variants of recordings with low, medium, and high ambient noise levels were introduced, increasing the set by 15%.
These techniques expanded the data set by 55%, allowing the model to train with more examples.
Detection Threshold Adjustment: Initially, a detection threshold of 0.5 was established. However, after initial testing, it was observed that a point of 0.65 maximized the balance between precision and recall, reducing false positives by 10% while maintaining adequate memory.