Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion
Abstract
:1. Introduction
2. Related Works
3. Two-Stream CNN with Decision-Level Fusion
3.1. Feature Extraction and Combination
3.2. Structure of the MCNet and LMCNet
- (1)
- The first layer uses 32 kernels with receptive field and the stride step is set to and batch-normalization is performed. The Rectified Linear Unit (ReLU) is exploited as the activation function.
- (2)
- The second layer uses the same settings as the first layer, where 32 convolution kernels with receptive filed of and stride step of . The batch-normalization is performed and activation function is ReLU as well. The difference is that the second layer applies max-pooling for dimensionality reduction of feature maps.
- (3)
- The third layer uses 64 convolution kernels with a receptive field of and the stride step is also , where batch-normalization is used. Followed by the activation function, ReLU.
- (4)
- The fourth layer 64 convolution kernels with receptive filed of and stride step of . The batch-normalization is performed and activation function is ReLU.
- (5)
- The fifth layer is the fully connected layer with 1024 hidden units and the activation function is Sigmoid.
- (6)
- The output is ten units according to the datasets, followed by the softmax activation function.
3.3. Dempster—Shafer Evidence Theory-Based Information Fusion
- (1)
- , which means the sum of each probability in subset is 1.
- (2)
- , this indicate that the mass function cannot allocate any value to an empty set. Meanwhile, a mass function with this characteristic is called normalized mass function.
4. Experiment and Analysis
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Baum, E.; Harper, M.; Alicea, R.; Ordonez, C. Sound Identification for Fire-Fighting Mobile Robots. In Proceedings of the 2018 Second IEEE International Conference on Robotic Computing (IRC), Laguna Hills, CA, USA, 31 January–2 February 2018; pp. 79–86. [Google Scholar] [CrossRef]
- Liu, J.-M.; You, M.; Li, G.-Z.; Wang, Z.; Xu, X.; Qiu, Z.; Xie, W.; An, C.; Chen, S. Cough signal recognition with Gammatone Cepstral Coefficients. In Proceedings of the 2013 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), Beijing, China, 6–10 July 2013; pp. 160–164. [Google Scholar]
- Ali, H.; Tran, S.N.; Benetos, E.; d’Avila Garcez, A.S. Speaker recognition with hybrid features from a deep belief network. Neural Comput. Appl. 2018, 29, 13–19. [Google Scholar] [CrossRef]
- Ghosal, D.; Kolekar, M.H. Music Genre Recognition Using Deep Neural Networks and Transfer Learning. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 2087–2091. [Google Scholar] [CrossRef]
- Chachada, S.; Kuo, C.-C.J. Environmental sound recognition: A survey. APSIPA Trans. Signal Inf. Process. 2014, 3. [Google Scholar] [CrossRef]
- Zhang, Z.; Xu, S.; Cao, S.; Zhang, S. Deep Convolutional Neural Network with Mixup for Environmental Sound Classification. arXiv, 2018; arXiv:1808.08405. [Google Scholar]
- Li, J.; Dai, W.; Metze, F.; Qu, S.; Das, S. A comparison of deep learning methods for environmental sound detection. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 126–130. [Google Scholar]
- Jin, Z.; Zhou, G.; Gao, D.; Zhang, Y. EEG classification using sparse Bayesian extreme learning machine for brain–Computer interface. Neural Comput. Appl. 2018, 1–9. [Google Scholar] [CrossRef]
- Shao, Y.; Wang, D. Robust speaker identification using auditory features and computational auditory scene analysis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, NV, USA, 31 March–4 April 2008; pp. 1589–1592. [Google Scholar]
- Wang, J.-C.; Wang, J.-F.; He, K.W.; Hsu, C.-S. Environmental Sound Classification using Hybrid SVM/KNN Classifier and MPEG-7 Audio Low-Level Descriptor. In Proceedings of the 2006 IEEE International Joint Conference on Neural Network Proceedings, Vancouver, BC, Canada, 16–21 July 2006; pp. 1731–1735. [Google Scholar] [CrossRef]
- Zhang, Y.; Wang, Y.; Zhou, G.; Jin, J.; Wang, B.; Wang, X.; Cichocki, A. Multi-kernel extreme learning machine for EEG classification in brain-computer interfaces. Expert Syst. Appl. 2018, 96, 302–310. [Google Scholar] [CrossRef]
- Zhang, H.; McLoughlin, I.; Song, Y. Robust sound event recognition using convolutional neural networks Brisbane. Available online: https://core.ac.uk/download/pdf/42411594.pdf (accessed on 10 April 2019).
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
- Palaz, D. Analysis of CNN-based Speech Recognition System using Raw Speech as Input. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015; p. 5. [Google Scholar]
- Mesaros, A.; Heittola, T.; Virtanen, T. Acoustic scene classification: An overview of dcase 2017 challenge entries. In Proceedings of the 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), Tokyo, Japan, 17–20 September 2018; pp. 411–415. [Google Scholar]
- Parascandolo, G.; Heittola, T.; Huttunen, H.; Virtanen, T. Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1291–1303. [Google Scholar]
- Adavanne, S.; Virtanen, T. Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network. arXiv, 2017; arXiv:1710.02998. [Google Scholar]
- Adavanne, S.; Pertilä, P.; Virtanen, T. Sound event detection using spatial features and convolutional recurrent neural network. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 771–775. [Google Scholar]
- Li, S.; Yao, Y.; Hu, J.; Liu, G.; Yao, X.; Hu, J. An Ensemble Stacked Convolutional Neural Network Model for Environmental Event Sound Recognition. Appl. Sci. 2018, 8, 1152. [Google Scholar] [CrossRef]
- Ye, H.; Wu, Z.; Zhao, R.-W.; Wang, X.; Jiang, Y.-G.; Xue, X. Evaluating Two-Stream CNN for Video Classification. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval-ICMR ’15, Shanghai, China, 23–26 June 2015; pp. 435–442. [Google Scholar] [CrossRef]
- Li, Y.; Chen, J.; Ye, F.; Liu, D. The Improvement of DS Evidence Theory and Its Application in IR/MMW Target Recognition. J. Sens. 2016, 2016, 1–15. [Google Scholar] [CrossRef]
- Li, J.; Qiu, T.; Wen, C.; Xie, K.; Wen, F.-Q. Robust Face Recognition Using the Deep C2D-CNN Model Based on Decision-Level Fusion. Sensors 2018, 18, 2080. [Google Scholar] [CrossRef] [PubMed]
- Takahashi, N.; Gygli, M.; Pfister, B.; Van Gool, L. Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection. arXiv, 2016; arXiv:1604.07160. [Google Scholar]
- Boddapati, V.; Petef, A.; Rasmusson, J.; Lundberg, L. Classifying environmental sounds using image recognition networks. Procedia Comput. Sci. 2017, 112, 2048–2056. [Google Scholar] [CrossRef]
- Dai, W.; Dai, C.; Qu, S.; Li, J.; Das, S. Very deep convolutional neural networks for raw waveforms. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 421–425. [Google Scholar]
- Salamon, J.; Jacoby, C.; Bello, J.P. A Dataset and Taxonomy for Urban Sound Research. In Proceedings of the ACM International Conference on Multimedia-MM ’14, Orlando, Fl, USA, 3–7 November 2014. [Google Scholar] [CrossRef]
- Piczak, K.J. Environmental sound classification with convolutional neural networks. In Proceedings of the 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), Boston, MA, USA, 17–20 September 2015; pp. 1–6. [Google Scholar]
- Meyer, M.; Cavigelli, L.; Thiele, L. Efficient Convolutional Neural Network for Audio Event Detection. arXiv, 2017; arXiv:1709.09888. [Google Scholar]
- Pons, J.; Serra, X. Randomly weighted CNNs for (music) audio classification. arXiv, 2018; arXiv:1805.00237. [Google Scholar]
- Chen, Y.; Guo, Q.; Liang, X.; Wang, J.; Qian, Y. Environmental sound classification with dilated convolutions. Appl. Acoust. 2019, 148, 123–132. [Google Scholar] [CrossRef]
- Zhang, X.; Zou, Y.; Shi, W. Dilated convolution neural network with LeakyReLU for environmental sound classification. In Proceedings of the 2017 22nd International Conference on Digital Signal Processing (DSP), London, UK, 23–25 August 2017; pp. 1–5. [Google Scholar] [CrossRef]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
- Piczak, K.J. ESC: Dataset for Environmental Sound Classification; ACM Press: Boston, MA, USA, 2015; pp. 1015–1018. [Google Scholar]
- Tokozume, Y.; Harada, T. Learning environmental sounds with end-to-end convolutional neural network. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2721–2725. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv, 2014; arXiv:1409.1556. [Google Scholar]
- Zhu, B.; Wang, C.; Liu, F.; Lei, J.; Lu, Z.; Peng, Y. Learning Environmental Sounds with Multi-scale Convolutional Neural Network. arXiv, 2018; arXiv:1803.10219. [Google Scholar]
- Xing, Z.; Baik, E.; Jiao, Y.; Kulkarni, N.; Li, C.; Muralidhar, G.; Parandehgheibi, M.; Reed, E.; Singhal, A.; Xiao, F.; et al. Modeling of the Latent Embedding of Music using Deep Neural Network. arXiv, 2017; arXiv:1705.05229. [Google Scholar]
- Burgos, W. Gammatone and MFCC Features in Speaker Recognition. Ph.D. Thesis, Florida Institute of Technology, Melbourne, FL, USA, 2014. [Google Scholar]
- Müller, M.; Ewert, S. Chroma Toolbox: MATLAB implementations for extracting variants of chroma-based audio features. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR), Miami, FL, USA, 24–28 October 2011. [Google Scholar]
- Jiang, D.-N.; Lu, L.; Zhang, H.-J.; Tao, J.-H.; Cai, L.-H. Music type classification by spectral contrast feature. In Proceedings of the IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, 26–29 August 2002; pp. 113–116. [Google Scholar] [CrossRef]
- Harte, C.; Sandler, M.; Gasser, M. Detecting harmonic change in musical audio. In Proceedings of the 1st ACM Workshop on Audio and Music Computing Multimedia-AMCMM ’06, Santa Barbara, CA, USA, 27 October 2006; p. 21. [Google Scholar] [CrossRef]
- McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.W.; McVicar, M.; Battenberg, E.; Nieto, O. librosa: Audio and Music Signal Analysis in Python. In Proceedings of the 14th Python in Science Conference, Austin, TX, USA, 6–12 July 2015. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv, 1412; arXiv:1412.6980. [Google Scholar]
- Shafer, G. A Mathematical Theory of Evidence; Princeton University Press: Princeton, NJ, USA, 1976; Volume 42. [Google Scholar]
- Reineking, T. Belief Functions: Theory and Algorithms. Ph.D. Thesis, Staats-und Universitätsbibliothek Bremen, Bremen, Germany, 2014. [Google Scholar]
four-layer | 6-Layer | 8-Layer | ||||
---|---|---|---|---|---|---|
Layer | param | memory | param | memory | param | memory |
input | 0 | 3.5 K | 0 | 3.5 K | 0 | 3.5 K |
288 | 111.5 K | 288 | 111.5 K | 288 | 111.5 K | |
9.2 K | 111.5 K | 9.2 K | 111.5 K | 9.2 K | 111.5 K | |
18.4 K | 57.8 K | 18.4 K | 57.8 K | 18.4 K | 57.8 K | |
36.8 K | 57.8 K | 36.8 K | 57.8 K | 36.8 K | 57.8 K | |
0 | 0 | 73.7 K | 31 K | 73.7 K | 31 K | |
0 | 0 | 147.5 K | 31 K | 147.5 K | 31 K | |
0 | 0 | 0 | 0 | 294.9 K | 4.6 K | |
0 | 0 | 0 | 0 | 589.8 K | 4.6 K | |
15.9 M | 1024 | 8.7 M | 1024 | 4.7 M | 1024 | |
10.2 K | 10 | 10.2 K | 10 | 10.2 K | 10 | |
Total | 15.9 M | 339.6 K | 8.9 M | 401.6 K | 5.9 M | 413.4 K |
Class | LMC (LMCNet) | MC (MCNet) | MLMC | TSCNN-DS |
---|---|---|---|---|
ac | 98.6% | 99.9% | 99.2% | 99.9% |
ch | 93.9% | 91.4% | 93.2% | 94.2% |
cp | 97.3% | 93.9% | 96.1% | 97.5% |
db | 92.6% | 90.4% | 94.2% | 95.3% |
dr | 94.8% | 95.0% | 95.7% | 97.2% |
ei | 98.9% | 99.6% | 98.5% | 99.6% |
gs | 88.6% | 91.1% | 85.9% | 95.4% |
jh | 93.2% | 95.9% | 91.1% | 97.1% |
si | 98.6% | 98.3% | 98.5% | 98.9% |
sm | 95.0% | 97.4% | 94.1% | 96.9% |
Avg. | 95.2% | 95.3% | 94.6% | 97.2% |
Mean | N | Std Deviation | Time Cost | |
---|---|---|---|---|
LMCNet | 0.9515 | 10 | 0.03121 | 0.023 |
MCNet | 0.9529 | 10 | 0.03352 | 0.024 |
MLMC | 0.9465 | 10 | 0.03812 | 0.028 |
TSCNN-DS | 0.9720 | 10 | 0.01788 | 0.077 |
Class | LMC (LMCNet) | MC (MCNet) | MLMC | TSCNN-DS |
---|---|---|---|---|
ac | 98.9% | 98.9% | 97.5% | 99.9% |
ch | 90.2% | 69.4% | 87.9% | 89.2% |
cp | 94.8% | 91.1% | 93.6% | 96.4% |
db | 91.3% | 88.0% | 91.6% | 93.1% |
dr | 93.8% | 90.9% | 91.5% | 95.5% |
ei | 98.2% | 97.7% | 98.1% | 99.1% |
gs | 77.2% | 77.2% | 81.7% | 85.1% |
jh | 92.6% | 91.6% | 93.4% | 97.1% |
si | 99.0% | 96.1% | 99.0% | 98.9% |
sm | 94.3% | 92.1% | 92.9% | 94.7% |
Avg. | 93.0% | 89.3% | 92.7% | 94.9% |
Class | LMC (LMCNet) | MC (MCNet) | MLMC | TSCNN-DS |
---|---|---|---|---|
ac | 94.8% | 91.5% | 93.2% | 98.2% |
ch | 76.1% | 47.3% | 88.1% | 69.9% |
cp | 84.0% | 80.9% | 87.9% | 88.0% |
db | 79.9% | 73.3% | 86.8% | 80.8% |
dr | 87.8% | 87.4% | 87.0% | 91.6% |
ei | 96.8% | 94.8% | 95.3% | 97.4% |
gs | 57.2% | 63.4% | 45.4% | 67.8% |
jh | 89.8% | 74.7% | 85.9% | 87.6% |
si | 97.8% | 88.3% | 96.5% | 96.3% |
sm | 85.3% | 71.8% | 90.3% | 80.3% |
Avg. | 84.9% | 77.3% | 85.7 % | 85.8% |
Model | Accuracy |
---|---|
Stacked four-layer CNN | 86.4% |
Stacked 6-layer CNN | 79.8% |
Stacked 8-layer CNN | 80.1% |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Su, Y.; Zhang, K.; Wang, J.; Madani, K. Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion. Sensors 2019, 19, 1733. https://doi.org/10.3390/s19071733
Su Y, Zhang K, Wang J, Madani K. Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion. Sensors. 2019; 19(7):1733. https://doi.org/10.3390/s19071733
Chicago/Turabian StyleSu, Yu, Ke Zhang, Jingyu Wang, and Kurosh Madani. 2019. "Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion" Sensors 19, no. 7: 1733. https://doi.org/10.3390/s19071733
APA StyleSu, Y., Zhang, K., Wang, J., & Madani, K. (2019). Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion. Sensors, 19(7), 1733. https://doi.org/10.3390/s19071733