applsci-logo

Journal Browser

Journal Browser

Applications of Machine Learning in Audio Classification and Acoustic Scene Characterization

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Acoustics and Vibrations".

Deadline for manuscript submissions: closed (30 April 2022) | Viewed by 36548

Special Issue Editor


E-Mail Website
Guest Editor
Faculty of Computer Science Bialystok University of Technology Wiejska 45A, 15-351 Bialystok Poland
Interests: machine learning; audio engineering; psychoacoustics; spatial audio

Special Issue Information

Dear Colleagues,

The goal of “audio classification” (AC) is to automatically identify the origin and attributes of individual sounds, while the aim of “acoustic scene characterization” (ASC) is to computationally describe more complex acoustic scenarios, consisting of many simultaneously sound-emitting sources. The additional difference between the tasks of AC and ASC is that the latter one also encompasses the identification and description of acoustical environments where the audio recordings took place. Hence, ASC portrays sonic events at a higher and more generic level compared to AC. Nevertheless, due to a considerable application and methodological overlap between ASC and AC, we decided to cover both areas of research within the scope of this Special Issue.

An important but still under-researched aspect of the ASC is the “spatial” characterization of sound scenes. Most of the AC and ASC systems developed so far are limited to the identification of monaurally recorded audio sources or events, overlooking the importance of their spatial characteristics. Therefore, we are interested in research papers that include but are not limited to the following topics:

  • Spatial audio scene characterization;
  • Localization of sound sources within complex audio scenes;
  • Automatic indexing, search or retrieval of spatial audio recordings;
  • Acoustic scene characterization in music information retrieval;
  • Data-efficient augmentation for deep learning-based audio classification algorithms;
  • Intelligent audio surveillance systems;
  • Detection of anomalous or emergency-related sounds;
  • Acoustically-based systems for early detection and fault prevention in industrial settings.

Dr. Sławomir K. Zieliński
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • audio classification
  • acoustic scene characterization
  • spatial audio
  • machine learning
  • deep learning

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (8 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

23 pages, 3313 KiB  
Article
IoT System for Detecting the Condition of Rotating Machines Based on Acoustic Signals
by Milutin Radonjić, Sanja Vujnović, Aleksandra Krstić and Žarko Zečević
Appl. Sci. 2022, 12(9), 4385; https://doi.org/10.3390/app12094385 - 26 Apr 2022
Cited by 6 | Viewed by 3124
Abstract
Modern predictive maintenance techniques have been significantly improved with the development of Industrial Internet of Things solutions which have enabled easier collection and analysis of various data. Artificial intelligence-based algorithms in combination with modular interconnected architecture of sensors, devices and servers, have resulted [...] Read more.
Modern predictive maintenance techniques have been significantly improved with the development of Industrial Internet of Things solutions which have enabled easier collection and analysis of various data. Artificial intelligence-based algorithms in combination with modular interconnected architecture of sensors, devices and servers, have resulted in the development of intelligent maintenance systems which outperform most traditional machine maintenance approaches. In this paper, a novel acoustic-based IoT system for condition detection of rotating machines is proposed. The IoT device designed for this purpose is mobile and inexpensive and the algorithm developed for condition detection consists of a combination of discrete wavelet transform and neural networks, while a genetic algorithm is used to tune the necessary hyperparameters. The performance of this system has been tested in a real industrial setting, on different rotating machines, in an environment with strong acoustic pollution. The results show high accuracy of the algorithm, with an average F1 score of around 0.99 with tuned hyperparameters. Full article
Show Figures

Figure 1

16 pages, 1023 KiB  
Article
You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection
by Satvik Venkatesh, David Moffat and Eduardo Reck Miranda
Appl. Sci. 2022, 12(7), 3293; https://doi.org/10.3390/app12073293 - 24 Mar 2022
Cited by 17 | Viewed by 9224
Abstract
Audio segmentation and sound event detection are crucial topics in machine listening that aim to detect acoustic classes and their respective boundaries. It is useful for audio-content analysis, speech recognition, audio-indexing, and music information retrieval. In recent years, most research articles adopt segmentation-by-classification. [...] Read more.
Audio segmentation and sound event detection are crucial topics in machine listening that aim to detect acoustic classes and their respective boundaries. It is useful for audio-content analysis, speech recognition, audio-indexing, and music information retrieval. In recent years, most research articles adopt segmentation-by-classification. This technique divides audio into small frames and individually performs classification on these frames. In this paper, we present a novel approach called You Only Hear Once (YOHO), which is inspired by the YOLO algorithm popularly adopted in Computer Vision. We convert the detection of acoustic boundaries into a regression problem instead of frame-based classification. This is done by having separate output neurons to detect the presence of an audio class and predict its start and end points. The relative improvement for F-measure of YOHO, compared to the state-of-the-art Convolutional Recurrent Neural Network, ranged from 1% to 6% across multiple datasets for audio segmentation and sound event detection. As the output of YOHO is more end-to-end and has fewer neurons to predict, the speed of inference is at least 6 times faster than segmentation-by-classification. In addition, as this approach predicts acoustic boundaries directly, the post-processing and smoothing is about 7 times faster. Full article
Show Figures

Figure 1

23 pages, 20456 KiB  
Article
Spatial Audio Scene Characterization (SASC): Automatic Localization of Front-, Back-, Up-, and Down-Positioned Music Ensembles in Binaural Recordings
by Sławomir K. Zieliński, Paweł Antoniuk and Hyunkook Lee
Appl. Sci. 2022, 12(3), 1569; https://doi.org/10.3390/app12031569 - 1 Feb 2022
Cited by 2 | Viewed by 2004
Abstract
The automatic localization of audio sources distributed symmetrically with respect to coronal or transverse planes using binaural signals still poses a challenging task, due to the front–back and up–down confusion effects. This paper demonstrates that the convolutional neural network (CNN) can be used [...] Read more.
The automatic localization of audio sources distributed symmetrically with respect to coronal or transverse planes using binaural signals still poses a challenging task, due to the front–back and up–down confusion effects. This paper demonstrates that the convolutional neural network (CNN) can be used to automatically localize music ensembles panned to the front, back, up, or down positions. The network was developed using the repository of the binaural excerpts obtained by the convolution of multi-track music recordings with the selected sets of head-related transfer functions (HRTFs). They were generated in such a way that a music ensemble (of circular shape in terms of its boundaries) was positioned in one of the following four locations with respect to the listener: front, back, up, and down. According to the obtained results, CNN identified the location of the ensembles with the average accuracy levels of 90.7% and 71.4% when tested under the HRTF-dependent and HRTF-independent conditions, respectively. For HRTF-dependent tests, the accuracy decreased monotonically with the increase in the ensemble size. A modified image occlusion sensitivity technique revealed selected frequency bands as being particularly important in terms of the localization process. These frequency bands are largely in accordance with the psychoacoustical literature. Full article
Show Figures

Figure 1

14 pages, 2411 KiB  
Article
Sound Source Separation Mechanisms of Different Deep Networks Explained from the Perspective of Auditory Perception
by Han Li, Kean Chen, Lei Wang, Jianben Liu, Baoquan Wan and Bing Zhou
Appl. Sci. 2022, 12(2), 832; https://doi.org/10.3390/app12020832 - 14 Jan 2022
Cited by 8 | Viewed by 3201
Abstract
Thanks to the development of deep learning, various sound source separation networks have been proposed and made significant progress. However, the study on the underlying separation mechanisms is still in its infancy. In this study, deep networks are explained from the perspective of [...] Read more.
Thanks to the development of deep learning, various sound source separation networks have been proposed and made significant progress. However, the study on the underlying separation mechanisms is still in its infancy. In this study, deep networks are explained from the perspective of auditory perception mechanisms. For separating two arbitrary sound sources from monaural recordings, three different networks with different parameters are trained and achieve excellent performances. The networks’ output can obtain an average scale-invariant signal-to-distortion ratio improvement (SI-SDRi) higher than 10 dB, comparable with the human performance to separate natural sources. More importantly, the most intuitive principle—proximity—is explored through simultaneous and sequential organization experiments. Results show that regardless of network structures and parameters, the proximity principle is learned spontaneously by all networks. If components are proximate in frequency or time, they are not easily separated by networks. Moreover, the frequency resolution at low frequencies is better than at high frequencies. These behavior characteristics of all three networks are highly consistent with those of the human auditory system, which implies that the learned proximity principle is not accidental, but the optimal strategy selected by networks and humans when facing the same task. The emergence of the auditory-like separation mechanisms provides the possibility to develop a universal system that can be adapted to all sources and scenes. Full article
Show Figures

Figure 1

20 pages, 7836 KiB  
Article
Characterization of Sonic Events Present in Natural-Urban Hybrid Habitats Using UMAP and SEDnet: The Case of the Urban Wetlands
by Víctor Poblete, Diego Espejo, Víctor Vargas, Felipe Otondo and Pablo Huijse
Appl. Sci. 2021, 11(17), 8175; https://doi.org/10.3390/app11178175 - 3 Sep 2021
Cited by 5 | Viewed by 3090
Abstract
We investigated whether the use of technological tools can effectively help in manipulating the increasing volume of audio data available through the use of long field recordings. We also explored whether we can address, by using these recordings and tools, audio data analysis, [...] Read more.
We investigated whether the use of technological tools can effectively help in manipulating the increasing volume of audio data available through the use of long field recordings. We also explored whether we can address, by using these recordings and tools, audio data analysis, feature extraction and determine predominant patterns in the data. Similarly, we explored whether we can visualize feature clusters in the data and automatically detect sonic events. Our focus was primarily on enhancing the importance of natural-urban hybrid habitats within cities, which benefit communities in various ways, specifically through the natural soundscapes of these habitats that evoke memories and reinforce a sense of belonging for inhabitants. The loss of sonic heritage can be a precursor to the extinction of biodiversity within these habitats. By quantifying changes in the soundscape of these habitats over long periods of time, we can collect relevant information linked to this eventual loss. In this respect, we developed two approaches. The first was the comparison among habitats that progressively changed from natural to urban. The second was the optimization of the field recordings’ labeling process. This was performed with labels corresponding to the annotations of classes of sonic events and their respective start and end times, including events temporarily superimposed on one another. We compared three habitats over time by using their sonic characteristics collected in field conditions. Comparisons of sonic similarity or dissimilarity among patches were made based on the Jaccard coefficient and uniform manifold approximation and projection (UMAP). Our SEDnet model achieves a F1-score of 0.79 with error rate 0.377 and with the area under PSD-ROC curve of 71.0. In terms of computational efficiency, the model is able to detect sound events from an audio file in a time of 14.49 s. With these results, we confirm the usefulness of the methods used in this work for the process of labeling field recordings. Full article
Show Figures

Figure 1

14 pages, 1938 KiB  
Article
Modelling the Microphone-Related Timbral Brightness of Recorded Signals
by Andy Pearce, Tim Brookes and Russell Mason
Appl. Sci. 2021, 11(14), 6461; https://doi.org/10.3390/app11146461 - 13 Jul 2021
Cited by 1 | Viewed by 1883
Abstract
Brightness is one of the most common timbral descriptors used for searching audio databases, and is also the timbral attribute of recorded sound that is most affected by microphone choice, making a brightness prediction model desirable for automatic metadata generation. A model, sensitive [...] Read more.
Brightness is one of the most common timbral descriptors used for searching audio databases, and is also the timbral attribute of recorded sound that is most affected by microphone choice, making a brightness prediction model desirable for automatic metadata generation. A model, sensitive to microphone-related as well as source-related brightness, was developed based on a novel combination of the spectral centroid and the ratio of the total magnitude of the signal above 500 Hz to that of the full signal. This model performed well on training data (r = 0.922). Validating it on new data showed a slight gradient error but good linear correlation across source types and overall (r = 0.955). On both training and validation data, the new model out-performed metrics previously used for brightness prediction. Full article
Show Figures

Figure 1

18 pages, 1634 KiB  
Article
An Ensemble of Convolutional Neural Networks for Audio Classification
by Loris Nanni, Gianluca Maguolo, Sheryl Brahnam and Michelangelo Paci
Appl. Sci. 2021, 11(13), 5796; https://doi.org/10.3390/app11135796 - 22 Jun 2021
Cited by 64 | Viewed by 7580
Abstract
Research in sound classification and recognition is rapidly advancing in the field of pattern recognition. One important area in this field is environmental sound recognition, whether it concerns the identification of endangered species in different habitats or the type of interfering noise in [...] Read more.
Research in sound classification and recognition is rapidly advancing in the field of pattern recognition. One important area in this field is environmental sound recognition, whether it concerns the identification of endangered species in different habitats or the type of interfering noise in urban environments. Since environmental audio datasets are often limited in size, a robust model able to perform well across different datasets is of strong research interest. In this paper, ensembles of classifiers are combined that exploit six data augmentation techniques and four signal representations for retraining five pre-trained convolutional neural networks (CNNs); these ensembles are tested on three freely available environmental audio benchmark datasets: (i) bird calls, (ii) cat sounds, and (iii) the Environmental Sound Classification (ESC-50) database for identifying sources of noise in environments. To the best of our knowledge, this is the most extensive study investigating ensembles of CNNs for audio classification. The best-performing ensembles are compared and shown to either outperform or perform comparatively to the best methods reported in the literature on these datasets, including on the challenging ESC-50 dataset. We obtained a 97% accuracy on the bird dataset, 90.51% on the cat dataset, and 88.65% on ESC-50 using different approaches. In addition, the same ensemble model trained on the three datasets managed to reach the same results on the bird and cat datasets while losing only 0.1% on ESC-50. Thus, we have managed to create an off-the-shelf ensemble that can be trained on different datasets and reach performances competitive with the state of the art. Full article
Show Figures

Figure 1

20 pages, 7807 KiB  
Article
A Biologically Inspired Sound Localisation System Using a Silicon Cochlea Pair
by Ying Xu, Saeed Afshar, Runchun Wang, Gregory Cohen, Chetan Singh Thakur, Tara Julia Hamilton and André van Schaik
Appl. Sci. 2021, 11(4), 1519; https://doi.org/10.3390/app11041519 - 8 Feb 2021
Cited by 6 | Viewed by 2662
Abstract
We present a biologically inspired sound localisation system for reverberant environments using the Cascade of Asymmetric Resonators with Fast-Acting Compression (CAR-FAC) cochlear model. The system exploits a CAR-FAC pair to pre-process binaural signals that travel through the inherent delay line of the cascade [...] Read more.
We present a biologically inspired sound localisation system for reverberant environments using the Cascade of Asymmetric Resonators with Fast-Acting Compression (CAR-FAC) cochlear model. The system exploits a CAR-FAC pair to pre-process binaural signals that travel through the inherent delay line of the cascade structures, as each filter acts as a delay unit. Following the filtering, each cochlear channel is cross-correlated with all the channels of the other cochlea using a quantised instantaneous correlation function to form a 2-D instantaneous correlation matrix (correlogram). The correlogram contains both interaural time difference and spectral information. The generated correlograms are analysed using a regression neural network for localisation. We investigate the effect of the CAR-FAC nonlinearity on the system performance by comparing it with a CAR only version. To verify that the CAR/CAR-FAC and the quantised instantaneous correlation provide a suitable basis with which to perform sound localisation tasks, a linear regression, an extreme learning machine, and a convolutional neural network are trained to learn the azimuthal angle of the sound source from the correlogram. The system is evaluated using speech data recorded in a reverberant environment. We compare the performance of the linear CAR and nonlinear CAR-FAC models with current sound localisation systems as well as with human performance. Full article
Show Figures

Figure 1

Back to TopTop