Machine Learning in Music/Audio Signal Processing

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Circuit and Signal Processing".

Deadline for manuscript submissions: closed (15 December 2023) | Viewed by 13748

Special Issue Editors

School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China
Interests: speech/audio signal processing; multimedia communications; virtual reality

E-Mail Website
Guest Editor
School of Information Science and Technology; Beijing University of Technology; Beijing 100124, China
Interests: speech and audio coding; multichannel audio signal processing; and array signal processing
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

With the development of machine learning technology, especially deep learning and neural networks, the performance of music and audio signal processing has been improved largely in the fields of audio information understanding, extracting, generation and recovery. More efficient audio signal processing and analysis techniques based on data-driven approaches are needed in order to make it possible for humans to experience better speech- and music-related products. Machine learning methods cover the research of supervised, unsupervised, semi-supervised and reinforcement learning, which can be applied to solve prediction and classification problems in music/audio signal processing. Compared with machine learning in speech signal processing, it is more difficult to analyze and process audio, especially music, as it is more complex in signal characteristics and it is not easy to construct a large number of datasets. This kind of problem brings new challenges to the research of machine learning techniques in music/audio signal processing. There is still the potential to combine machine learning with traditional signal processing methods when facing complex music and audio. In the future, not only will machine learning make music and audio more convenient for the user experience, but high-level artificial intelligent methods will also lead to more applications in intelligent hardware, smart education, internet music, entertainment, AI composition and even broader metaverse audio scenarios.

This Special Issue mainly aims to show better solutions regarding machine learning techniques in music and audio signal processing, such as music information retrieval, audio classification, speech/music enhancement and music/sound synthesis, as well as the general audio computational auditory scene analysis. Topics of interest include, but are not limited to, the following:

  • General data-driven methods in music and audio signal analysis and processing;
  • Machine learning methods for music/audio information retrieval such as music instrument classification, mood classification, music melody extraction, etc.;
  • Approaches for audio scene analysis including audio tagging, audio classification, sound event detection and the related signal processing in the generalized acoustic scene;
  • Deep learning methods for speech and audio processing such as speech/music enhancement, speech/audio bandwidth extension, speech/music separation and music/sound synthesis;
  • Necessary techniques for machine learning-based music/audio signal processing such as database collection, emotion analysis, sound visualization, music evaluation, etc.

Dr. Jing Wang
Prof. Dr. Maoshen Jia
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • machine learning
  • deep learning
  • neural networks
  • audio signal processing
  • music information retrieval
  • audio scene analysis

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (6 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

14 pages, 1030 KiB  
Article
Sound Event Detection with Perturbed Residual Recurrent Neural Network
by Shuang Yuan, Lidong Yang and Yong Guo
Electronics 2023, 12(18), 3836; https://doi.org/10.3390/electronics12183836 - 11 Sep 2023
Cited by 1 | Viewed by 1112
Abstract
Sound event detection (SED) is of great practical and research significance owing to its wide range of applications. However, due to the heavy reliance on dataset size for task performance, there is often a severe lack of data in real-world scenarios. In this [...] Read more.
Sound event detection (SED) is of great practical and research significance owing to its wide range of applications. However, due to the heavy reliance on dataset size for task performance, there is often a severe lack of data in real-world scenarios. In this study, an improved mean teacher model is utilized to carry out semi-supervised SED, and a perturbed residual recurrent neural network (P-RRNN) is proposed as the SED network. The residual structure is employed to alleviate the problem of network degradation, and pre-training the improved model on the ImageNet dataset enables it to learn information that is beneficial for event detection, thus improving the performance of SED. In the post-processing stage, a customized median filter group with a specific window length is designed to effectively smooth each type of event and minimize the impact of background noise on detection accuracy. Experimental results conducted on the publicly available Detection and Classification of Acoustic Scenes and Events 2019 Task 4 dataset demonstrate that the P-RRNN used for SED in this study can effectively enhance the detection capability of the model. The detection system achieves a Macro Event-based F1 score of 38.8% on the validation set and 40.5% on the evaluation set, indicating that the proposed method can adapt to complex and dynamic SED scenarios. Full article
(This article belongs to the Special Issue Machine Learning in Music/Audio Signal Processing)
Show Figures

Figure 1

13 pages, 366 KiB  
Article
Speaker Recognition Based on the Joint Loss Function
by Tengteng Feng, Houbin Fan, Fengpei Ge, Shuxin Cao and Chunyan Liang
Electronics 2023, 12(16), 3447; https://doi.org/10.3390/electronics12163447 - 15 Aug 2023
Cited by 1 | Viewed by 1203
Abstract
The statistical pyramid dense time-delay neural network (SPD-TDNN) model makes it difficult to deal with the imbalance of training data, poses a high risk of overfitting, and has weak generalization ability. To solve these problems, we propose a method based on the joint [...] Read more.
The statistical pyramid dense time-delay neural network (SPD-TDNN) model makes it difficult to deal with the imbalance of training data, poses a high risk of overfitting, and has weak generalization ability. To solve these problems, we propose a method based on the joint loss function and improved statistical pyramid dense time-delay neural network (JLF-ISPD-TDNN), which improves on the SPD-TDNN model and uses the joint loss function method to combine the advantages of the cross-entropy loss function and the comparative learning of the loss function. By minimizing the distance between speech embeddings from the same speaker and maximizing the distance between speech embeddings from different speakers, the model could achieve enhanced generalization performance and more robust speaker feature representation. We evaluated the proposed method’s performance using the evaluation indexes of the equal error rate (EER) and minimum cost function (minDCF). The experimental results show that the EEE and minDCF on the Aishell-1 dataset reached 1.02% and 0.1221%, respectively. Therefore, using the joint loss function in the improved SPD-TDNN model can significantly enhance the model’s speaker recognition performance. Full article
(This article belongs to the Special Issue Machine Learning in Music/Audio Signal Processing)
Show Figures

Figure 1

14 pages, 6266 KiB  
Article
Supervised Single Channel Speech Enhancement Method Using UNET
by Md. Nahid Hossain, Samiul Basir, Md. Shakhawat Hosen, A.O.M. Asaduzzaman, Md. Mojahidul Islam, Mohammad Alamgir Hossain and Md Shohidul Islam
Electronics 2023, 12(14), 3052; https://doi.org/10.3390/electronics12143052 - 12 Jul 2023
Cited by 3 | Viewed by 2805
Abstract
This paper proposes an innovative single-channel supervised speech enhancement (SE) method based on UNET, a convolutional neural network (CNN) architecture that expands on a few changes in the basic CNN architecture. In the training phase, short-time Fourier transform (STFT) is exploited on the [...] Read more.
This paper proposes an innovative single-channel supervised speech enhancement (SE) method based on UNET, a convolutional neural network (CNN) architecture that expands on a few changes in the basic CNN architecture. In the training phase, short-time Fourier transform (STFT) is exploited on the noisy time domain signal to build a noisy time-frequency domain signal which is called a complex noisy matrix. We take the real and imaginary parts of the complex noisy matrix and concatenate both of them to form the noisy concatenated matrix. We apply UNET to the noisy concatenated matrix for extracting speech components and train the CNN model. In the testing phase, the same procedure is applied to the noisy time-domain signal as in the training phase in order to construct another noisy concatenated matrix that can be tested using a pre-trained or saved model in order to construct an enhanced concatenated matrix. Finally, from the enhanced concatenated matrix, we separate both the imaginary and real parts to form an enhanced complex matrix. Magnitude and phase are then extracted from the newly created enhanced complex matrix. By using that magnitude and phase, the inverse STFT (ISTFT) can generate the enhanced speech signal. Utilizing the IEEE databases and various types of noise, including stationary and non-stationary noise, the proposed method is evaluated. Comparing the exploratory results of the proposed algorithm to the other five methods of STFT, sparse non-negative matrix factorization (SNMF), dual-tree complex wavelet transform (DTCWT)-SNMF, DTCWT-STFT-SNMF, STFT-convolutional denoising auto encoder (CDAE) and casual multi-head attention mechanism (CMAM) for speech enhancement, we determine that the proposed algorithm generally improves speech quality and intelligibility at all considered signal-to-noise ratios (SNRs). The suggested approach performs better than the other five competing algorithms in every evaluation metric. Full article
(This article belongs to the Special Issue Machine Learning in Music/Audio Signal Processing)
Show Figures

Figure 1

22 pages, 10745 KiB  
Article
MKGCN: Multi-Modal Knowledge Graph Convolutional Network for Music Recommender Systems
by Xiaohui Cui, Xiaolong Qu, Dongmei Li, Yu Yang, Yuxun Li and Xiaoping Zhang
Electronics 2023, 12(12), 2688; https://doi.org/10.3390/electronics12122688 - 15 Jun 2023
Cited by 6 | Viewed by 3196
Abstract
With the emergence of online music platforms, music recommender systems are becoming increasingly crucial in music information retrieval. Knowledge graphs (KGs) are a rich source of semantic information for entities and relations, allowing for improved modeling and analysis of entity relations to enhance [...] Read more.
With the emergence of online music platforms, music recommender systems are becoming increasingly crucial in music information retrieval. Knowledge graphs (KGs) are a rich source of semantic information for entities and relations, allowing for improved modeling and analysis of entity relations to enhance recommendations. Existing research has primarily focused on the modeling and analysis of structural triples, while largely ignoring the representation and information processing capabilities of multi-modal data such as music videos and lyrics, which has hindered the improvement and user experience of music recommender systems. To address these issues, we propose a Multi-modal Knowledge Graph Convolutional Network (MKGCN) to enhance music recommendation by leveraging the multi-modal knowledge of music items and their high-order structural and semantic information. Specifically, there are three aggregators in MKGCN: the multi-modal aggregator aggregates the text, image, audio, and sentiment features of each music item in a multi-modal knowledge graph (MMKG); the user aggregator and item aggregator use graph convolutional networks to aggregate multi-hop neighboring nodes on MMKGs to model high-order representations of user preferences and music items, respectively. Finally, we utilize the aggregated embedding representations for recommendation. In training MKGCN, we adopt the ratio negative sampling strategy to generate high-quality negative samples. We construct four different-sized music MMKGs using the public dataset Last-FM and conduct extensive experiments on them. The experimental results demonstrate that MKGCN achieves significant improvements and outperforms several state-of-the-art baselines. Full article
(This article belongs to the Special Issue Machine Learning in Music/Audio Signal Processing)
Show Figures

Figure 1

25 pages, 7503 KiB  
Article
Automatic Assessment of Piano Performances Using Timbre and Pitch Features
by Varinya Phanichraksaphong and Wei-Ho Tsai
Electronics 2023, 12(8), 1791; https://doi.org/10.3390/electronics12081791 - 10 Apr 2023
Cited by 4 | Viewed by 1741
Abstract
To assist piano learners with the improvement of their skills, this study investigates techniques for automatically assessing piano performances based on timbre and pitch features. The assessment is formulated as a classification problem that classifies piano performances as “Good”, “Fair”, or “Poor”. For [...] Read more.
To assist piano learners with the improvement of their skills, this study investigates techniques for automatically assessing piano performances based on timbre and pitch features. The assessment is formulated as a classification problem that classifies piano performances as “Good”, “Fair”, or “Poor”. For timbre-based approaches, we propose timbre-based WaveNet, timbre-based MLNet, Timbre-based CNN, and Timbre-based CNN Transformers. For pitch-based approaches, we propose Pitch-based CNN and Pitch-based CNN Transformers. Our experiments indicate that both Pitch-based CNN and Pitch-based CNN Transformers are superior to the timbre-based approaches, which attained classification accuracies of 96.87% and 97.5%, respectively. Full article
(This article belongs to the Special Issue Machine Learning in Music/Audio Signal Processing)
Show Figures

Figure 1

17 pages, 2731 KiB  
Article
VAT-SNet: A Convolutional Music-Separation Network Based on Vocal and Accompaniment Time-Domain Features
by Xiaoman Qiao, Min Luo, Fengjing Shao, Yi Sui, Xiaowei Yin and Rencheng Sun
Electronics 2022, 11(24), 4078; https://doi.org/10.3390/electronics11244078 - 8 Dec 2022
Cited by 3 | Viewed by 1885
Abstract
The study of separating the vocal from the accompaniment in single-channel music is foundational and critical in the field of music information retrieval (MIR). Mainstream music-separation methods are usually based on the frequency-domain characteristics of music signals, and the phase information of the [...] Read more.
The study of separating the vocal from the accompaniment in single-channel music is foundational and critical in the field of music information retrieval (MIR). Mainstream music-separation methods are usually based on the frequency-domain characteristics of music signals, and the phase information of the music is lost during time–frequency decomposition. In recent years, deep learning models based on speech time-domain signals, such as Conv-TasNet, have shown great potential. However, for the vocal and accompaniment separation problem, there is no suitable time-domain music-separation model. Since the vocal and the accompaniment in music have a higher synergy and similarity than the voices of two speakers in speech, separating the vocal and accompaniment using a speech-separation model is not ideal. Based on this, we propose VAT-SNet; this optimizes the network structure of Conv-TasNet, which sets sample-level convolution in the encoder and decoder to preserve deep acoustic features, and takes vocal embedding and accompaniment embedding generated by the auxiliary network as references to improve the purity of the separation of the vocal and accompaniment. The results from public music datasets show that the quality of the vocal and accompaniment separated by VAT-SNet is improved in GSNR, GSIR, and GSAR compared with Conv-TasNet and mainstream separation methods, such as U-Net, SH-4stack, etc. Full article
(This article belongs to the Special Issue Machine Learning in Music/Audio Signal Processing)
Show Figures

Figure 1

Back to TopTop