4.2. Proposed CNN Pipeline Results
We augmented the FSC22 dataset in two combinations of data augmentation techniques as described in
Section 3.3. The first combination involved time stretch and pitch shift only. The second combination involved GWN addition with pitch shift and time stretch. Data augmentation was followed by feature extraction, where we employed three feature extraction techniques as described in
Section 3.3.
Table 3 shows the performance of each CNN with each combination of preprocessing techniques. The best-performing models for all the CNNs were obtained from time stretch and pitch shift data augmentation and mixed spectrograms approach except for MobileNet-v3-small. The best performances were 97.47%, 99.22%, 98.88%, 96.33% and 98.65% for AlexNet, DenseNet-121, Inception-v3, EfficientNet-v2-B0 and ResNet-50 using mixed spectrograms, respectively. Additionally, MobileNet-v3-small with Mel-spectrogram feature extraction resulted in the best accuracy of 98.27%.
Figure 6 shows the accuracies of the different CNNs for the two augmentation methods concerning the feature extraction techniques that achieved maximum accuracy for that augmentation. Augmentation with only pitch shift and time stretch outperformed augmentation with pitch shift, time stretch, and Gaussian noise addition in all models. This is due to excessive data augmentation amplifying the peculiarities of the dataset and the high distortion of the audio due to the addition of noise.
Table 4 presents the attributes of these selected models. The inference time was lowest in AlexNet with 2.668 ms, while the highest inference time was obtained for the model with the highest accuracy out of the selected best-performing models, which was DenseNet-121. Noticeably, the model size of MobileNet-v3-small was relatively small compared to other models.
Although MobileNet-v3-small performed differently than the rest of the models in terms of feature extraction, the accuracy difference between the best-performing model and the model trained with mixed spectrograms was relatively small. To have an equal set of training conditions to compare compression approaches without a bias, considering the majority, we selected the models trained with mixed spectrograms to be used in the downstream tasks in the experiment.
Figure 7 graphically represents this process of selecting the best augmentation and feature extraction techniques and proceeding to the subsequent steps of the workflow.
Subsequently, these models obtained from mixed spectrograms are subjected to compression following the compression pipeline as described in
Section 3.3. We performed three compression techniques for the best-performing models, namely 8-bit quantization, weight pruning, and filter pruning, as shown in
Figure 7.
Table 5 shows the model details when the models are compressed with 8-bit quantization. The performance of MobileNet-v3-small has reached an optimal trade-off between a model size of 1.2 MB and an accuracy of 95.28%, although Inception-v3 obtained the highest accuracy of 96.41% accounting for the vast difference in parameter count of the two models.
The number of parameters does not change when a model is compressed using 8-bit quantization, as it only rescales the model weights and biases. Accuracies of AlexNet, ResNet-50, DenseNet-121, and EfficientNet-v2-B0 are significantly reduced due to the high connectivity between the layers in the model architectures and the large dense layers present in AlexNet. These filters and layers are highly affected because they tend to be noisy without any weight pruning or filter pruning. The model sizes are approximately 4 times smaller than the base models, which was as expected since 8-bit integers require 4 times less space to store than 32-bit floats.
Simultaneously, weight pruning was applied to the selected models, and
Table 6 presents the performance of the selected models on weight pruning with different pruning ratios. Weight pruning using
normalization removed most of the insignificant learned parameters, sparsifying the weight matrices without significantly affecting the classification accuracy. However, there is a drastic decline in model performance if the pruning ratio is too high without much benefit from other factors, such as a reduction in inference time as explained by [
49]. Weight pruning did not reduce the number of parameters of the model or the Floating-Point Operations (FLOPs). Since weight pruning introduced sparse matrices in the model, it is not suitable to be deployed in most edge devices. Here, the weight prune ratio, also known as the pruning sparsity, indicates the percentage of weights that should be 0 at the end of the weight-pruning process. We have evaluated weight-pruning ratios starting at 80% and increasing until a significant drop-off of the model accuracy is observed.
When the best-performing models were subjected to filter pruning based on magnitude-based
norm filter impotence criteria, it was observed that different models behave differently, as shown in
Table 7. This is largely due to the architectural differences, filter connectivity, and branching of channels in the models. Since AlexNet has two very large dense layers with 1024 channels each, these layers are heavily pruned, resulting in a massive accuracy reduction. Most of the parameters of MobileNet-v3-small are confined to several large Conv2D layers. Filter pruning these layers resulted in a significant accuracy drop. The parameter count and the number of FLOPs have been reduced as expected by imputing the unimportant filters. Henceforth, the model sizes have been reduced too. However, due to the extremely sparse filter connectivity of the Inception-v3 architecture, very few layers have been completely pruned, resulting in a minimal size reduction of the filter-pruned model compared to the base model. The inference times of the models that exhibit dense connectivity among the layers, such as AlexNet, DenseNet-121, and MobileNet-v3-small, have all reduced or remained relatively stable. The inference times of the models with sparse and residual connectivity, such as Inception-v3 and ResNet-50, have increased. This is due to the removal of residual and branching connections between layers, making the resulting model essentially a densely connected network, thus heavily compromising and disrupting the intended model architecture and its inference capabilities. Furthermore, due to the fused mobile inverted bottleneck (Fused-MBConv) being the main building block of EfficientNet-v2-B0, the inference times have been affected adversely, reflecting the heavy impact on filter pruning on the residuals of the bottleneck layer of Fused-MBConv and the squeeze and excite (SE) optimization [
38].
Figure 8 shows the effect of the pruning level on the accuracy and the parameter count. It was evident that models that have a smaller number of parameters are adversely affected by channel pruning with minimal reduction of parameters and model size.
After utilizing each of the compression approaches individually, we selected the best-performing model with weight pruning from each CNN architecture and applied filter pruning followed by quantization. DenseNet-121, Inception-v3, and MobileNet-v3-small were obtained with a pruning ratio of 0.8, AlexNet and ResNet-50 were obtained with a pruning ratio of 0.9, and EfficientNet-v2-B0 from a pruning ratio of 0.95 are selected as the best-performing weight-pruned models. The hybrid pruning approach has resulted in much-improved results compared to solely resting on weight pruning or filter pruning, as shown in
Table 8. The prior application of weight pruning has acted as a regularization technique, removing the unimportant weights, resulting in a more generalized model that has aided the filter pruning algorithm to execute efficiently without overfitting and remove the filters with the least effect on the model while maintaining higher accuracies. The parameter count, FLOPs, and model size have been reduced compared to the base model and the previously pruned models. However, the inference times of the hybrid pruned models have remained relatively equal to the inference times displayed with filter pruning as a compression technique. These observations can be explained by the aforementioned reasons expressed previously in filter pruning the base model. A filter pruning level of 0.7 resulted in the best model in every CNN architecture, considering the trade-off between accuracy and the other factors.
The best-performing models out of the weight and filter-pruned are quantized with 8-bit quantization. The best performances of all models are achieved with the filter pruning ratio of 0.7.
Table 9 shows the performance after quantizing these best models obtained after weight and filter pruning.
The decline of the accuracy after 8-bit quantization of the pruned models was negligible compared to the post-quantization accuracies of the base models. This is because filter pruning removes noisy filters from the model, consequently narrowing down value ranges of weights and activations and culminating in the reduction of the total quantization error. These are the best models to be deployed on edge devices for forest sound classification, considering their performance and resource requirements. MobileNet-v3-Small is suitable for forest monitoring applications where real-time event detection is paramount and edge device flash memory capabilities are limited. However, ResNet-50 achieves excellent performance while maintaining the model size and the inference time relatively small compared to the other models.
Figure 9 displays the evolution of the accuracy, FLOPs, and sizes of the models through different stages of the compression pipeline.
When the performances of these CNNs, which are initially designed for images, are compared with ACDNet, designed for compression and edge deployment, it is evident that the efficacy of image-based CNNs relies heavily on their architecture. As depicted in
Figure 10, CNN architectures with high compressibility achieve a smaller model size while maintaining commendable accuracy. Among the selected CNNs, ResNet-50 attains the highest accuracy at 97.28%, but its model size of 4.1MB raises doubt regarding deployment on extremely resource-constrained edge devices. Conversely, MobileNet-v3-small emerges as the optimal choice, with an accuracy of 87.95% and a compact model size of 0.24 MB.
Correspondingly, the compressed ACDNet achieves an accuracy of 85.64% with a model size of 0.484 MB. Although MobileNet-v3-small outperforms compressed ACDNet based on these metrics, it is crucial to note that ACDNet can process raw audio input and perform feature extraction with convolutional layers using fewer parameters and the additional cost of feature extraction using mixed spectrograms for MobileNet-v3-small is not accounted for in this comparison. Henceforth, it follows that both of these models exhibit suitability for edge deployment, each with its minor trade-offs.