Exploring Channel Properties to Improve Singing Voice Detection with Convolutional Neural Networks
Abstract
:1. Introduction
2. Methods
2.1. Channel Attention
2.1.1. Motivation
2.1.2. Related Work
- 1.
- The origin of attention mechanism
- 2.
- The attention mechanism in the RNNs
- 3.
- The attention mechanism in the CNNs
2.1.3. Channel Attention Based on a Scaled Dot-Product Module
2.1.4. Channel Attention Based on a Squeeze-and-Excitation Module
2.2. Multi-Scale Input Channels
2.2.1. Motivation
2.2.2. Related Work
2.2.3. Method
3. Proposed CNN Architecture
3.1. Baseline Architecture
3.2. The Improvements
3.3. Proposed CNN Architecture
4. Experiments and Results
4.1. Experimental Setup
4.2. Datasets
4.3. Channel Attention
4.3.1. The Attention Distribution on the Feature Maps
4.3.2. The Experimental Results
4.4. Multi-Scale Channels
4.5. Computation Cost
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Chan, T.-S.; Yeh, T.-C.; Fan, Z.-C.; Chen, H.-W.; Su, L.; Yang, Y.-H.; Jang, R. Vocal activity informed singing voice separation with the ikala dataset. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 718–722. [Google Scholar]
- Maddage, N.C.; Xu, C.; Wang, Y. Singer identification based on vocal and instrumental models. In Proceedings of the 17th International Conference on Pattern Recognition, Cambridge, UK, 23–26 August 2004; pp. 375–378. [Google Scholar]
- Wang, Y.; Kan, M.-Y.; Nwe, T.L.; Shenoy, A.; Yin, J. Lyrically: Automatic synchronization of acoustic musical signals and textual lyrics. In Proceedings of the 12th Annual ACM International Conference on Multimedia, New York, NY, USA, 10–16 October 2004; pp. 212–219. [Google Scholar]
- Berenzweig, A.L.; Ellis, D.P. Locating singing voice segments within music signals. In Proceedings of the IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics, New Platz, NY, USA, 24 October 2001; pp. 119–122. [Google Scholar]
- Tsai, W.-H.; Wang, H.-M. Automatic singer recognition of popular music recordings via estimation and modeling of solo vocal signals. IEEE Trans. Audio Speech Lang. Process. 2005, 14, 330–341. [Google Scholar] [CrossRef]
- Nwe, T.L.; Shenoy, A.; Wang, Y. Singing voice detection in popular music. In Proceedings of the 12th Annual ACM International Conference on Multimedia, New York, NY, USA, 10–16 October 2004; pp. 324–327. [Google Scholar]
- Maddage, N.C.; Xu, C.; Wang, Y. A svm-based classification approach to musical audio. In Proceedings of the International Society for Music Information Retrieval (ISMIR), Baltimore, MD, USA, 27–30 October 2003. [Google Scholar]
- Lehner, B.; Widmer, G.; Sonnleitner, R. On the reduction of false positives in singing voice detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 7480–7484. [Google Scholar]
- Schlüter, J. Learning to pinpoint singing voice from weakly labeled examples. In Proceedings of the International Society for Music Information Retrieval (ISMIR), New York, NY, USA, 7–11 August 2016; pp. 44–50. [Google Scholar]
- Schlüter, J.; Grill, T. Exploring data augmentation for improved singing voice detection with neural networks. In Proceedings of the International Society for Music Information Retrieval (ISMIR), Malaga, Spain, 26–30 October 2015; pp. 121–126. [Google Scholar]
- Lehner, B.; Schlüter, J.; Widmer, G. Online, loudness-invariant vocal detection in mixed music signals. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1369–1380. [Google Scholar] [CrossRef]
- Huang, H.-M.; Chen, W.-K.; Liu, C.-H.; You, S.D. Singing voice detection based on convolutional neural networks. In Proceedings of the 7th International Symposium on Next Generation Electronics (ISNE), Taipei, Taiwan, 7–9 May 2018; pp. 1–4. [Google Scholar]
- Schlüter, J.; Lehner, B. Zero-mean convolutions for level-invariant singing voice detection. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR 2018), Paris, France, 23–27 September 2018; pp. 321–326. [Google Scholar]
- Lehner, B.; Widmer, G.; Böck, S. A low-latency, real-time-capable singing voice detection method with lstm recurrent neural networks. In Proceedings of the 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 31 August–4 September 2015; pp. 21–25. [Google Scholar]
- Leglaive, S.; Hennequin, R.; Badeau, R. Singing voice detection with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 121–125. [Google Scholar]
- Zhang, X.; Yu, Y.; Gao, Y.; Chen, X.; Li, W. Research on singing voice detection based on a long-term recurrent convolutional network with vocal separation and temporal smoothing. Electronics 2020, 9, 1458–1470. [Google Scholar] [CrossRef]
- Berenzweig, A.; Ellis, D.P.; Lawrence, S. Using voice segments to improve artist classification of music. In Proceedings of the 22nd Audio Engineering Society International Conference, Espoo, Finland, 15–17 June 2002. [Google Scholar]
- Li, Y.; Wang, D. Separation of singing voice from music accompaniment for monaural recordings. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 1475–1487. [Google Scholar] [CrossRef] [Green Version]
- Ramona, M.; Richard, G.; David, B. Vocal detection in music with support vector machines. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, NV, USA, 31 March–4 April 2008; pp. 1885–1888. [Google Scholar]
- Regnier, L.; Peeters, G. Singing voice detection in music tracks using direct voice vibrato detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, 19–24 April 2009; pp. 1685–1688. [Google Scholar]
- Mauch, M.; Fujihara, H.; Yoshii, K.; Goto, M. Timbre and melody features for the recognition of vocal activity and instrumental solos in polyphonic music. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR), Miami, FL, USA, 24–28 October 2011; pp. 233–238. [Google Scholar]
- Lee, K.; Choi, K.; Nam, J. Revisiting singing voice detection: A quantitative review and the future outlook, International Society for Music Information Retrieval Conference. In Proceedings of the International Society for Music Information Retrieval, Paris, France, 23–27 September 2018. [Google Scholar]
- Graves, A.; Wayne, G.; Danihelka, I. Neural turing machines. arXiv 2014, arXiv:1410.5401. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Luong, M.-T.; Pham, H.; Manning, C.D. Effective approaches to attention-based neural machine translation. arXiv 2015, arXiv:1508.04025. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Hu, Y.; Li, J.; Huang, Y.; Gao, X. Channel-wise and spatial feature modulation network for single image super-resolution. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 3911–3927. [Google Scholar] [CrossRef] [Green Version]
- Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning, computer vision and pattern recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6298–6306. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the International Conference on Machine Learning (ICML) Workshop on Deep Learning for Audio, Speech, and Language Processing, Atlanta, GA, USA, 16 June 2013; pp. 3–11. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 448–456. [Google Scholar]
- Homura. Homura Package. Available online: https://github.com/moskomule/homura (accessed on 10 October 2019).
- Grard, P.; Kratz, L.; Zimmer, S. Jamendo, Open Your Ears. Available online: http://www.jamendo.com (accessed on 1 May 2019).
- Goto, M.; Hashiguchi, H.; Nishimura, T.; Oka, R. Rwc music database: Popular, classical and jazz music databases. In Proceedings of the 3rd International Conference on Music Information Retrieval (ISMIR), Paris, France, 13–17 October 2002; pp. 287–288. [Google Scholar]
- Hsu, C.-L.; Jang, J.-S.R. Mir-1k Dataset. Available online: https://diana.shuu.cf/extdomains/sites.google.com/site/unvoicedsoundseparation/mir-1k (accessed on 1 March 2019).
- Stoller, D.; Ewert, S.; Dixon, S. Wave-u-net: A multi-scale neural network for end-to-end audio source separation. In Proceedings of the 19th International Society for Music Information Retrieval Conference, Paris, France, 23–27 September 2018; pp. 334–340. [Google Scholar]
Scale | 512 | 1024 | 1536 | 2048 | 2560 | 3072 | 3584 | 4096 |
Δt (ms) | 23.2 | 46.4 | 69.7 | 92.9 | 116.1 | 139.3 | 162.5 | 185.8 |
Δf (Hz) | 43.1 | 21.5 | 14.4 | 10.8 | 8.6 | 7.2 | 6.2 | 5.4 |
Dataset | Number of Songs | Total Length (min) | Voice to Non-Voice Ratio |
---|---|---|---|
JMD | 93 | 371 | 1.12 |
RWC | 100 | 407 | 1.55 |
Mir1k | 1000 | 133 | 4.37 |
Dataset | Method | Accuracy | F-Measure | Precision | Recall | FPR | FNR |
---|---|---|---|---|---|---|---|
JMD | CNN (µ ± σ, %) | 86.61 ± 1.01 | 86.35 ± 0.93 | 82.35 ± 2.37 | 90.9 ± 2.62 | 17.14 ± 3.1 | 9.1 ± 2.62 |
SDP-CNN (µ ± σ, %) | 88.78 ± 0.75 | 88.38 ± 0.68 | 85.55 ± 2.07 | 91.49 ± 1.99 | 13.58 ± 2.5 | 8.51 ± 1.99 | |
SE-CNN (µ ± σ, %) | 88.82 ± 0.81 | 88.48 ± 0.72 | 85.21 ± 1.9 | 92.07 ± 1.56 | 14.02 ± 2.26 | 7.93 ± 1.56 | |
RWC | CNN (µ ± σ, %) | 88.85 ± 0.68 | 90.62 ± 0.63 | 91.43 ± 1.35 | 89.87 ± 1.84 | 12.67 ± 2.39 | 10.13 ± 1.84 |
SDP-CNN (µ ± σ, %) | 90.26 ± 0.6 | 91.75 ± 0.54 | 93.17 ± 1.09 | 90.54 ± 1.41 | 9.96 ± 1.82 | 9.59 ± 1.42 | |
SE-CNN (µ ± σ, %) | 91.13 ± 0.52 | 92.49 ± 0.5 | 93.88 ± 0.92 | 91.17 ± 1.45 | 8.92 ± 1.54 | 8.83 ± 1.45 | |
Mir1k | CNN (µ ± σ, %) | 88.91 ± 0.23 | 93.73 ± 0.13 | 90.67 ± 0.43 | 97.01 ± 0.54 | 58.67 ± 3.31 | 2.99 ± 0.54 |
SDP-CNN (µ ± σ, %) | 89.42 ± 0.27 | 94 ± 0.16 | 91.14 ± 0.52 | 97.06 ± 0.64 | 55.43 ± 3.91 | 2.94 ± 0.64 | |
SE-CNN (µ ± σ, %) | 89.67 ± 0.24 | 94.13 ± 0.14 | 91.47 ± 0.61 | 96.97 ± 0.72 | 53.17 ± 4.55 | 3.03 ± 0.72 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gui, W.; Li, Y.; Zang, X.; Zhang, J. Exploring Channel Properties to Improve Singing Voice Detection with Convolutional Neural Networks. Appl. Sci. 2021, 11, 11838. https://doi.org/10.3390/app112411838
Gui W, Li Y, Zang X, Zhang J. Exploring Channel Properties to Improve Singing Voice Detection with Convolutional Neural Networks. Applied Sciences. 2021; 11(24):11838. https://doi.org/10.3390/app112411838
Chicago/Turabian StyleGui, Wenming, Yukun Li, Xian Zang, and Jinglan Zhang. 2021. "Exploring Channel Properties to Improve Singing Voice Detection with Convolutional Neural Networks" Applied Sciences 11, no. 24: 11838. https://doi.org/10.3390/app112411838
APA StyleGui, W., Li, Y., Zang, X., & Zhang, J. (2021). Exploring Channel Properties to Improve Singing Voice Detection with Convolutional Neural Networks. Applied Sciences, 11(24), 11838. https://doi.org/10.3390/app112411838