entropy-logo

Journal Browser

Journal Browser

Information-Theoretic Approaches in Speech Processing and Recognition

A special issue of Entropy (ISSN 1099-4300). This special issue belongs to the section "Information Theory, Probability and Statistics".

Deadline for manuscript submissions: closed (15 January 2024) | Viewed by 19674

Special Issue Editor


E-Mail Website
Guest Editor
Department of Electrical and Electronics Engineering, Antalya Bilim University, Antalya 07190, Turkey
Interests: Bayesian data analysis; statistical signal processing; machine learning (for big data); information theory; source separation; computational mathematics and statistics; autonomous and intelligent systems; data mining and knowledge discovery; remote sensing; climatology; astronomy; systems biology; smart grid

Special Issue Information

Dear Colleagues,

Information theoretic quantities, such as entropy, mutual information and transfer entropy, have been successfully utilized in many areas of science and engineering. Mutual Information has been frequently preferred as a key quantity to reveal statistical dependencies between random variables, especially in cases where widely utilized linear correlation analyses become insufficient. As an asymmetric quantity, Transfer Entropy has been applied to detect directional information flows, helping to better understand the cause and effect relationships between different variables. Despite many applications in signal processing and machine learning, information theoretic quantities have been seldomly used in the speech processing literature. Most applications involve different but more informative feature selection of speech signals by using Mutual Information and its variants. Similar quantities are also utilized to provide improvements in the speech recognition quality. Another research area includes multimodal applications, such as the analysis of coupled effects of visual lip movements and speech recognition. In another recent study, speech information is decomposed into four components, which are language content, timbre, pitch and rhythm, via a triple information bottleneck. Deep learning applications are also common where convolutional bottleneck features are analyzed for speech recognition from an information-theoretic point of view. In addition to these, Bayesian inference and prediction of speech signals is another featured topic of interest within the scope of this special issue.

In this Special Issue, we would like to collect papers focusing on the theory and applications of information-theoretic approaches in speech processing and recognition. Some application areas can be classified as hands-free computing, automatic emotion recognition, automatic translation, home automation, telematics, and robotics, but a broader list of topics in information theory and Bayesian statistics are encouraged. Of special interest are theoretical papers elucidating the state-of-the-art of multimodal signal processing approaches.

Dr. Deniz Gençağa
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Entropy is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • speech processing
  • speech recognition
  • information theory
  • information-theoretic quantities
  • feature selection for speech
  • deep learning
  • information bottleneck
  • Bayesian learning
  • multimodal signal processing
  • machine learning

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (9 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

17 pages, 1545 KiB  
Article
Confounding Factor Analysis for Vocal Fold Oscillations
by Deniz Gençağa
Entropy 2023, 25(12), 1577; https://doi.org/10.3390/e25121577 - 23 Nov 2023
Viewed by 904
Abstract
This paper provides a methodology to better understand the relationships between different aspects of vocal fold motion, which are used as features in machine learning-based approaches for detecting respiratory infections from voice recordings. The relationships are derived through a joint multivariate analysis of [...] Read more.
This paper provides a methodology to better understand the relationships between different aspects of vocal fold motion, which are used as features in machine learning-based approaches for detecting respiratory infections from voice recordings. The relationships are derived through a joint multivariate analysis of the vocal fold oscillations of speakers. Specifically, the multivariate setting explores the displacements and velocities of the left and right vocal folds derived from recordings of five extended vowel sounds for each speaker (/aa/, /iy/, /ey/, /uw/, and /ow/). In this multivariate setting, the differences between the bivariate and conditional interactions are analyzed by information-theoretic quantities based on transfer entropy. Incorporation of the conditional quantities reveals information regarding the confounding factors that can influence the statistical interactions among other pairs of variables. This is demonstrated on a vector autoregressive process where the analytical derivations can be carried out. As a proof of concept, the methodology is applied on a clinically curated dataset of COVID-19. The findings suggest that the interaction between the vocal fold oscillations can change according to individuals and presence of any respiratory infection, such as COVID-19. The results are important in the sense that the proposed approach can be utilized to determine the selection of appropriate features as a supplementary or early detection tool in voice-based diagnostics in future studies. Full article
(This article belongs to the Special Issue Information-Theoretic Approaches in Speech Processing and Recognition)
Show Figures

Figure 1

39 pages, 2509 KiB  
Article
Deriving Vocal Fold Oscillation Information from Recorded Voice Signals Using Models of Phonation
by Wayne Zhao and Rita Singh
Entropy 2023, 25(7), 1039; https://doi.org/10.3390/e25071039 - 10 Jul 2023
Cited by 2 | Viewed by 1841
Abstract
During phonation, the vocal folds exhibit a self-sustained oscillatory motion, which is influenced by the physical properties of the speaker’s vocal folds and driven by the balance of bio-mechanical and aerodynamic forces across the glottis. Subtle changes in the speaker’s physical state can [...] Read more.
During phonation, the vocal folds exhibit a self-sustained oscillatory motion, which is influenced by the physical properties of the speaker’s vocal folds and driven by the balance of bio-mechanical and aerodynamic forces across the glottis. Subtle changes in the speaker’s physical state can affect voice production and alter these oscillatory patterns. Measuring these can be valuable in developing computational tools that analyze voice to infer the speaker’s state. Traditionally, vocal fold oscillations (VFOs) are measured directly using physical devices in clinical settings. In this paper, we propose a novel analysis-by-synthesis approach that allows us to infer the VFOs directly from recorded speech signals on an individualized, speaker-by-speaker basis. The approach, called the ADLES-VFT algorithm, is proposed in the context of a joint model that combines a phonation model (with a glottal flow waveform as the output) and a vocal tract acoustic wave propagation model such that the output of the joint model is an estimated waveform. The ADLES-VFT algorithm is a forward-backward algorithm which minimizes the error between the recorded waveform and the output of this joint model to estimate its parameters. Once estimated, these parameter values are used in conjunction with a phonation model to obtain its solutions. Since the parameters correlate with the physical properties of the vocal folds of the speaker, model solutions obtained using them represent the individualized VFOs for each speaker. The approach is flexible and can be applied to various phonation models. In addition to presenting the methodology, we show how the VFOs can be quantified from a dynamical systems perspective for classification purposes. Mathematical derivations are provided in an appendix for better readability. Full article
(This article belongs to the Special Issue Information-Theoretic Approaches in Speech Processing and Recognition)
Show Figures

Figure 1

33 pages, 1486 KiB  
Article
A Gene-Based Algorithm for Identifying Factors That May Affect a Speaker’s Voice
by Rita Singh
Entropy 2023, 25(6), 897; https://doi.org/10.3390/e25060897 - 2 Jun 2023
Cited by 1 | Viewed by 1825
Abstract
Over the past decades, many machine-learning- and artificial-intelligence-based technologies have been created to deduce biometric or bio-relevant parameters of speakers from their voice. These voice profiling technologies have targeted a wide range of parameters, from diseases to environmental factors, based largely on the [...] Read more.
Over the past decades, many machine-learning- and artificial-intelligence-based technologies have been created to deduce biometric or bio-relevant parameters of speakers from their voice. These voice profiling technologies have targeted a wide range of parameters, from diseases to environmental factors, based largely on the fact that they are known to influence voice. Recently, some have also explored the prediction of parameters whose influence on voice is not easily observable through data-opportunistic biomarker discovery techniques. However, given the enormous range of factors that can possibly influence voice, more informed methods for selecting those that may be potentially deducible from voice are needed. To this end, this paper proposes a simple path-finding algorithm that attempts to find links between vocal characteristics and perturbing factors using cytogenetic and genomic data. The links represent reasonable selection criteria for use by computational by profiling technologies only, and are not intended to establish any unknown biological facts. The proposed algorithm is validated using a simple example from medical literature—that of the clinically observed effects of specific chromosomal microdeletion syndromes on the vocal characteristics of affected people. In this example, the algorithm attempts to link the genes involved in these syndromes to a single example gene (FOXP2) that is known to play a broad role in voice production. We show that in cases where strong links are exposed, vocal characteristics of the patients are indeed reported to be correspondingly affected. Validation experiments and subsequent analyses confirm that the methodology could be potentially useful in predicting the existence of vocal signatures in naïve cases where their existence has not been otherwise observed. Full article
(This article belongs to the Special Issue Information-Theoretic Approaches in Speech Processing and Recognition)
Show Figures

Figure 1

15 pages, 4114 KiB  
Article
LFDNN: A Novel Hybrid Recommendation Model Based on DeepFM and LightGBM
by Houchou Han, Yanchun Liang, Gábor Bella, Fausto Giunchiglia and Dalin Li
Entropy 2023, 25(4), 638; https://doi.org/10.3390/e25040638 - 10 Apr 2023
Cited by 4 | Viewed by 2178
Abstract
Hybrid recommendation algorithms perform well in improving the accuracy of recommendation systems. However, in specific applications, they still cannot reach the requirements of the recommendation target due to the gap between the design of the algorithms and data characteristics. In this paper, in [...] Read more.
Hybrid recommendation algorithms perform well in improving the accuracy of recommendation systems. However, in specific applications, they still cannot reach the requirements of the recommendation target due to the gap between the design of the algorithms and data characteristics. In this paper, in order to learn higher-order feature interactions more efficiently and to distinguish the importance of different feature interactions better on the prediction results of recommendation algorithms, we propose a light and FM deep neural network (LFDNN), a hybrid recommendation model including four modules. The LightGBM module applies gradient boosting decision trees for feature processing, which improves LFDNN’s ability to handle dense numerical features; the shallow model introduces the FM model for explicitly modeling the finite-order feature crosses, which strengthens the expressive ability of the model; the deep neural network module uses a fully connected feedforward neural network to allow the model to obtain more high-order feature crosses information and mine more data patterns in the features; finally, the Fusion module allows the shallow model and the deep model to obtain a better fusion effect. The results of comparison, parameter influence and ablation experiments on two real advertisement datasets shows that the LFDNN reaches better performance than the representative recommendation models. Full article
(This article belongs to the Special Issue Information-Theoretic Approaches in Speech Processing and Recognition)
Show Figures

Figure 1

24 pages, 821 KiB  
Article
Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms
by Chunyan Zeng, Shixiong Feng, Dongliang Zhu and Zhifeng Wang
Entropy 2023, 25(4), 626; https://doi.org/10.3390/e25040626 - 6 Apr 2023
Cited by 8 | Viewed by 1827
Abstract
Source acquisition device identification from recorded audio aims to identify the source recording device by analyzing the intrinsic characteristics of audio, which is a challenging problem in audio forensics. In this paper, we propose a spatiotemporal representation learning framework with multi-attention mechanisms to [...] Read more.
Source acquisition device identification from recorded audio aims to identify the source recording device by analyzing the intrinsic characteristics of audio, which is a challenging problem in audio forensics. In this paper, we propose a spatiotemporal representation learning framework with multi-attention mechanisms to tackle this problem. In the deep feature extraction stage of recording devices, a two-branch network based on residual dense temporal convolution networks (RD-TCNs) and convolutional neural networks (CNNs) is constructed. The spatial probability distribution features of audio signals are employed as inputs to the branch of the CNN for spatial representation learning, and the temporal spectral features of audio signals are fed into the branch of the RD-TCN network for temporal representation learning. This achieves simultaneous learning of long-term and short-term features to obtain an accurate representation of device-related information. In the spatiotemporal feature fusion stage, three attention mechanisms—temporal, spatial, and branch attention mechanisms—are designed to capture spatiotemporal weights and achieve effective deep feature fusion. The proposed framework achieves state-of-the-art performance on the benchmark CCNU_Mobile dataset, reaching an accuracy of 97.6% for the identification of 45 recording devices, with a significant reduction in training time compared to other models. Full article
(This article belongs to the Special Issue Information-Theoretic Approaches in Speech Processing and Recognition)
Show Figures

Figure 1

16 pages, 715 KiB  
Article
Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations
by Laurent Benaroya, Nicolas Obin and Axel Roebel
Entropy 2023, 25(2), 375; https://doi.org/10.3390/e25020375 - 18 Feb 2023
Cited by 2 | Viewed by 1904
Abstract
Voice conversion (VC) consists of digitally altering the voice of an individual to manipulate part of its content, primarily its identity, while maintaining the rest unchanged. Research in neural VC has accomplished considerable breakthroughs with the capacity to falsify a voice identity using [...] Read more.
Voice conversion (VC) consists of digitally altering the voice of an individual to manipulate part of its content, primarily its identity, while maintaining the rest unchanged. Research in neural VC has accomplished considerable breakthroughs with the capacity to falsify a voice identity using a small amount of data with a highly realistic rendering. This paper goes beyond voice identity manipulation and presents an original neural architecture that allows the manipulation of voice attributes (e.g., gender and age). The proposed architecture is inspired by the fader network, transferring the same ideas to voice manipulation. The information conveyed by the speech signal is disentangled into interpretative voice attributes by means of minimizing adversarial loss to make the encoded information mutually independent while preserving the capacity to generate a speech signal from the disentangled codes. During inference for voice conversion, the disentangled voice attributes can be manipulated and the speech signal can be generated accordingly. For experimental evaluation, the proposed method is applied to the task of voice gender conversion using the freely available VCTK dataset. Quantitative measurements of mutual information between the variables of speaker identity and speaker gender show that the proposed architecture can learn gender-independent representation of speakers. Additional measurements of speaker recognition indicate that speaker identity can be recognized accurately from the gender-independent representation. Finally, a subjective experiment conducted on the task of voice gender manipulation shows that the proposed architecture can convert voice gender with very high efficiency and good naturalness. Full article
(This article belongs to the Special Issue Information-Theoretic Approaches in Speech Processing and Recognition)
Show Figures

Figure 1

17 pages, 3892 KiB  
Article
Improved Transformer-Based Dual-Path Network with Amplitude and Complex Domain Feature Fusion for Speech Enhancement
by Moujia Ye and Hongjie Wan
Entropy 2023, 25(2), 228; https://doi.org/10.3390/e25020228 - 26 Jan 2023
Cited by 3 | Viewed by 2403
Abstract
Most previous speech enhancement methods only predict amplitude features, but more and more studies have proved that phase information is crucial for speech quality. Recently, there have also been some methods to choose complex features, but complex masks are difficult to estimate. Removing [...] Read more.
Most previous speech enhancement methods only predict amplitude features, but more and more studies have proved that phase information is crucial for speech quality. Recently, there have also been some methods to choose complex features, but complex masks are difficult to estimate. Removing noise while maintaining good speech quality at low signal-to-noise ratios is still a problem. This study proposes a dual-path network structure for speech enhancement that can model complex spectra and amplitudes simultaneously, and introduces an attention-aware feature fusion module to fuse the two features to facilitate overall spectrum recovery. In addition, we improve a transformer-based feature extraction module that can efficiently extract local and global features. The proposed network achieves better performance than the baseline models in experiments on the Voice Bank + DEMAND dataset. We also conducted ablation experiments to verify the effectiveness of the dual-path structure, the improved transformer, and the fusion module, and investigated the effect of the input-mask multiplication strategy on the results. Full article
(This article belongs to the Special Issue Information-Theoretic Approaches in Speech Processing and Recognition)
Show Figures

Figure 1

15 pages, 529 KiB  
Article
Audio Augmentation for Non-Native Children’s Speech Recognition through Discriminative Learning
by Kodali Radha and Mohan Bansal
Entropy 2022, 24(10), 1490; https://doi.org/10.3390/e24101490 - 19 Oct 2022
Cited by 15 | Viewed by 2474
Abstract
Automatic speech recognition (ASR) in children is a rapidly evolving field, as children become more accustomed to interacting with virtual assistants, such as Amazon Echo, Cortana, and other smart speakers, and it has advanced the human–computer interaction in recent generations. Furthermore, non-native children [...] Read more.
Automatic speech recognition (ASR) in children is a rapidly evolving field, as children become more accustomed to interacting with virtual assistants, such as Amazon Echo, Cortana, and other smart speakers, and it has advanced the human–computer interaction in recent generations. Furthermore, non-native children are observed to exhibit a diverse range of reading errors during second language (L2) acquisition, such as lexical disfluency, hesitations, intra-word switching, and word repetitions, which are not yet addressed, resulting in ASR’s struggle to recognize non-native children’s speech. The main objective of this study is to develop a non-native children’s speech recognition system on top of feature-space discriminative models, such as feature-space maximum mutual information (fMMI) and boosted feature-space maximum mutual information (fbMMI). Harnessing the collaborative power of speed perturbation-based data augmentation on the original children’s speech corpora yields an effective performance. The corpus focuses on different speaking styles of children, together with read speech and spontaneous speech, in order to investigate the impact of non-native children’s L2 speaking proficiency on speech recognition systems. The experiments revealed that feature-space MMI models with steadily increasing speed perturbation factors outperform traditional ASR baseline models. Full article
(This article belongs to the Special Issue Information-Theoretic Approaches in Speech Processing and Recognition)
Show Figures

Figure 1

12 pages, 1159 KiB  
Article
Multi-Task Transformer with Adaptive Cross-Entropy Loss for Multi-Dialect Speech Recognition
by Zhengjia Dan, Yue Zhao, Xiaojun Bi, Licheng Wu and Qiang Ji
Entropy 2022, 24(10), 1429; https://doi.org/10.3390/e24101429 - 8 Oct 2022
Cited by 11 | Viewed by 2751
Abstract
At present, most multi-dialect speech recognition models are based on a hard-parameter-sharing multi-task structure, which makes it difficult to reveal how one task contributes to others. In addition, in order to balance multi-task learning, the weights of the multi-task objective function need to [...] Read more.
At present, most multi-dialect speech recognition models are based on a hard-parameter-sharing multi-task structure, which makes it difficult to reveal how one task contributes to others. In addition, in order to balance multi-task learning, the weights of the multi-task objective function need to be manually adjusted. This makes multi-task learning very difficult and costly because it requires constantly trying various combinations of weights to determine the optimal task weights. In this paper, we propose a multi-dialect acoustic model that combines soft-parameter-sharing multi-task learning with Transformer, and introduce several auxiliary cross-attentions to enable the auxiliary task (dialect ID recognition) to provide dialect information for the multi-dialect speech recognition task. Furthermore, we use the adaptive cross-entropy loss function as the multi-task objective function, which automatically balances the learning of the multi-task model according to the loss proportion of each task during the training process. Therefore, the optimal weight combination can be found without any manual intervention. Finally, for the two tasks of multi-dialect (including low-resource dialect) speech recognition and dialect ID recognition, the experimental results show that, compared with single-dialect Transformer, single-task multi-dialect Transformer, and multi-task Transformer with hard parameter sharing, our method significantly reduces the average syllable error rate of Tibetan multi-dialect speech recognition and the character error rate of Chinese multi-dialect speech recognition. Full article
(This article belongs to the Special Issue Information-Theoretic Approaches in Speech Processing and Recognition)
Show Figures

Figure 1

Back to TopTop