Applied Sciences

Editorial

Jump to: Research

8 pages, 205 KiB

Open AccessEditorial

Editorial for Special Issue “IberSPEECH2018: Speech and Language Technologies for Iberian Languages”

by Francesc Alías, Antonio Bonafonte and António Teixeira

Appl. Sci. 2020, 10(1), 384; https://doi.org/10.3390/app10010384 - 4 Jan 2020

Viewed by 2039

Abstract

The main goal of this Special Issue is to present the latest advances in research and novel applications of speech and language technologies based on the works presented at the IberSPEECH edition held in Barcelona in 2018, paying special attention to those focused [...] Read more.

The main goal of this Special Issue is to present the latest advances in research and novel applications of speech and language technologies based on the works presented at the IberSPEECH edition held in Barcelona in 2018, paying special attention to those focused on Iberian languages. IberSPEECH is the international conference of the Special Interest Group on Iberian Languages (SIG-IL) of the International Speech Communication Association (ISCA) and of the Spanish Thematic Network on Speech Technologies (Red Temática en Tecnologías del Habla, or RTTH for short). Several researchers were invited to extend their contributions presented at IberSPEECH2018 due to their interest and quality. As a result, this Special Issue is composed of 13 papers that cover different topics of investigation related to perception, speech analysis and enhancement, speaker verification and identification, speech production and synthesis, natural language processing, together with several applications and evaluation challenges. Full article

(This article belongs to the Special Issue IberSPEECH 2018: Speech and Language Technologies for Iberian Languages)

Research

Jump to: Editorial

22 pages, 353 KiB

Open AccessArticle

Albayzin 2018 Evaluation: The IberSpeech-RTVE Challenge on Speech Technologies for Spanish Broadcast Media

by Eduardo Lleida, Alfonso Ortega, Antonio Miguel, Virginia Bazán-Gil, Carmen Pérez, Manuel Gómez and Alberto de Prada

Appl. Sci. 2019, 9(24), 5412; https://doi.org/10.3390/app9245412 - 11 Dec 2019

Cited by 29 | Viewed by 3718

Abstract

The IberSpeech-RTVE Challenge presented at IberSpeech 2018 is a new Albayzin evaluation series supported by the Spanish Thematic Network on Speech Technologies (Red Temática en Tecnologías del Habla (RTTH)). That series was focused on speech-to-text transcription, speaker diarization, and multimodal diarization of television [...] Read more.

The IberSpeech-RTVE Challenge presented at IberSpeech 2018 is a new Albayzin evaluation series supported by the Spanish Thematic Network on Speech Technologies (Red Temática en Tecnologías del Habla (RTTH)). That series was focused on speech-to-text transcription, speaker diarization, and multimodal diarization of television programs. For this purpose, the Corporacion Radio Television Española (RTVE), the main public service broadcaster in Spain, and the RTVE Chair at the University of Zaragoza made more than 500 h of broadcast content and subtitles available for scientists. The dataset included about 20 programs of different kinds and topics produced and broadcast by RTVE between 2015 and 2018. The programs presented different challenges from the point of view of speech technologies such as: the diversity of Spanish accents, overlapping speech, spontaneous speech, acoustic variability, background noise, or specific vocabulary. This paper describes the database and the evaluation process and summarizes the results obtained. Full article

(This article belongs to the Special Issue IberSPEECH 2018: Speech and Language Technologies for Iberian Languages)

► Show Figures

Figure 1

12 pages, 2905 KiB

Open AccessArticle

Glottal Source Contribution to Higher Order Modes in the Finite Element Synthesis of Vowels

by Marc Freixes, Marc Arnela, Joan Claudi Socoró, Francesc Alías and Oriol Guasch

Appl. Sci. 2019, 9(21), 4535; https://doi.org/10.3390/app9214535 - 25 Oct 2019

Cited by 11 | Viewed by 2471

Abstract

Articulatory speech synthesis has long been based on one-dimensional (1D) approaches. They assume plane wave propagation within the vocal tract and disregard higher order modes that typically appear above 5 kHz. However, such modes may be relevant in obtaining a more natural voice, [...] Read more.

Articulatory speech synthesis has long been based on one-dimensional (1D) approaches. They assume plane wave propagation within the vocal tract and disregard higher order modes that typically appear above 5 kHz. However, such modes may be relevant in obtaining a more natural voice, especially for phonation types with significant high frequency energy (HFE) content. This work studies the contribution of the glottal source at high frequencies in the 3D numerical synthesis of vowels. The spoken vocal range is explored using an LF (Liljencrants–Fant) model enhanced with aspiration noise and controlled by the

R_{d}

glottal shape parameter. The vowels [ɑ], [i], and [u] are generated with a finite element method (FEM) using realistic 3D vocal tract geometries obtained from magnetic resonance imaging (MRI), as well as simplified straight vocal tracts of a circular cross-sectional area. The symmetry of the latter prevents the onset of higher order modes. Thus, the comparison between realistic and simplified geometries enables us to analyse the influence of such modes. The simulations indicate that higher order modes may be perceptually relevant, particularly for tense phonations (lower

R_{d}

values) and/or high fundamental frequency values,

F 0

s. Conversely, vowels with a lax phonation and/or low F0s may result in inaudible HFE levels, especially if aspiration noise is not considered in the glottal source model. Full article

(This article belongs to the Special Issue IberSPEECH 2018: Speech and Language Technologies for Iberian Languages)

► Show Figures

Figure 1

14 pages, 358 KiB

Open AccessArticle

Towards a Universal Semantic Dictionary

by Maria Jose Castro-Bleda, Eszter Iklódi, Gábor Recski and Gábor Borbély

Appl. Sci. 2019, 9(19), 4060; https://doi.org/10.3390/app9194060 - 28 Sep 2019

Cited by 3 | Viewed by 2334

Abstract

A novel method for finding linear mappings among word embeddings for several languages, taking as pivot a shared, multilingual embedding space, is proposed in this paper. Previous approaches learned translation matrices between two specific languages, while this method learns translation matrices between a [...] Read more.

A novel method for finding linear mappings among word embeddings for several languages, taking as pivot a shared, multilingual embedding space, is proposed in this paper. Previous approaches learned translation matrices between two specific languages, while this method learns translation matrices between a given language and a shared, multilingual space. The system was first trained on bilingual, and later on multilingual corpora as well. In the first case, two different training data were applied: Dinu’s English–Italian benchmark data, and English–Italian translation pairs extracted from the PanLex database. In the second case, only the PanLex database was used. The system performs on English–Italian languages with the best setting significantly better than the baseline system given by Mikolov, and it provides a comparable performance with more sophisticated systems. Exploiting the richness of the PanLex database, the proposed method makes it possible to learn linear mappings among an arbitrary number of languages. Full article

(This article belongs to the Special Issue IberSPEECH 2018: Speech and Language Technologies for Iberian Languages)

► Show Figures

Figure 1

13 pages, 360 KiB

Open AccessArticle

Summarization of Spanish Talk Shows with Siamese Hierarchical Attention Networks

by J.-A. González, L.-F. Hurtado, E. Segarra, F. García-Granada and E. Sanchis

Appl. Sci. 2019, 9(18), 3836; https://doi.org/10.3390/app9183836 - 12 Sep 2019

Cited by 5 | Viewed by 2836

Abstract

In this paper, we present an approach to Spanish talk shows summarization. Our approach is based on the use of Siamese Neural Networks on the transcription of the show audios. Specifically, we propose to use Hierarchical Attention Networks to select the most relevant [...] Read more.

In this paper, we present an approach to Spanish talk shows summarization. Our approach is based on the use of Siamese Neural Networks on the transcription of the show audios. Specifically, we propose to use Hierarchical Attention Networks to select the most relevant sentences for each speaker about a given topic in the show, in order to summarize his opinion about the topic. We train these networks in a siamese way to determine whether a summary is appropriate or not. Previous evaluation of this approach on summarization task of English newspapers achieved performances similar to other state-of-the-art systems. In the absence of enough transcribed or recognized speech data to train our system for talk show summarization in Spanish, we acquire a large corpus of document-summary pairs from Spanish newspapers and we use it to train our system. We choose this newspapers domain due to its high similarity with the topics addressed in talk shows. A preliminary evaluation of our summarization system on Spanish TV programs shows the adequacy of the proposal. Full article

(This article belongs to the Special Issue IberSPEECH 2018: Speech and Language Technologies for Iberian Languages)

► Show Figures

Figure 1

19 pages, 511 KiB

Open AccessArticle

An Analysis of the Short Utterance Problem for Speaker Characterization

by Ignacio Viñals, Alfonso Ortega, Antonio Miguel and Eduardo Lleida

Appl. Sci. 2019, 9(18), 3697; https://doi.org/10.3390/app9183697 - 5 Sep 2019

Cited by 6 | Viewed by 3028

Abstract

Speaker characterization has always been conditioned by the length of the evaluated utterances. Despite performing well with large amounts of audio, significant degradations in performance are obtained when short utterances are considered. In this work we present an analysis of the short utterance [...] Read more.

Speaker characterization has always been conditioned by the length of the evaluated utterances. Despite performing well with large amounts of audio, significant degradations in performance are obtained when short utterances are considered. In this work we present an analysis of the short utterance problem providing an alternative point of view. From our perspective the performance in the evaluation of short utterances is highly influenced by the phonetic similarity between enrollment and test utterances. Both enrollment and test should contain similar phonemes to properly discriminate, being degraded otherwise. In this study we also interpret short utterances as incomplete long utterances where some acoustic units are either unbalanced or just missing. These missing units are responsible for the speaker representations to be unreliable. These unreliable representations are biased with respect to the reference counterparts, obtained from long utterances. These undesired shifts increase the intra-speaker variability, causing a significant loss of performance. According to our experiments, short utterances (3–60 s) can perform as accurate as if long utterances were involved by just reassuring the phonetic distributions. This analysis is determined by the current embedding extraction approach, based on the accumulation of local short-time information. Thus it is applicable to most of the state-of-the-art embeddings, including traditional i-vectors and Deep Neural Network (DNN) xvectors. Full article

(This article belongs to the Special Issue IberSPEECH 2018: Speech and Language Technologies for Iberian Languages)

► Show Figures

Figure 1

14 pages, 573 KiB

Open AccessArticle

Exploring Efficient Neural Architectures for Linguistic–Acoustic Mapping in Text-To-Speech

by Santiago Pascual, Joan Serrà and Antonio Bonafonte

Appl. Sci. 2019, 9(16), 3391; https://doi.org/10.3390/app9163391 - 17 Aug 2019

Cited by 1 | Viewed by 3339

Abstract

Conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models such as recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the [...] Read more.

Conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models such as recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure with intermediate affine transformations tends to make them slow to train and to sample from. In this work, we explore two different mechanisms that enhance the operational efficiency of recurrent neural networks, and study their performance–speed trade-off. The first mechanism is based on the quasi-recurrent neural network, where expensive affine transformations are removed from temporal connections and placed only on feed-forward computational directions. The second mechanism includes a module based on the transformer decoder network, designed without recurrent connections but emulating them with attention and positioning codes. Our results show that the proposed decoder networks are competitive in terms of distortion when compared to a recurrent baseline, whilst being significantly faster in terms of CPU and GPU inference time. The best performing model is the one based on the quasi-recurrent mechanism, reaching the same level of naturalness as the recurrent neural network based model with a speedup of 11.2 on CPU and 3.3 on GPU. Full article

(This article belongs to the Special Issue IberSPEECH 2018: Speech and Language Technologies for Iberian Languages)

► Show Figures

Figure 1

12 pages, 1115 KiB

Open AccessArticle

Supervector Extraction for Encoding Speaker and Phrase Information with Neural Networks for Text-Dependent Speaker Verification

by Victoria Mingote, Antonio Miguel, Alfonso Ortega and Eduardo Lleida

Appl. Sci. 2019, 9(16), 3295; https://doi.org/10.3390/app9163295 - 11 Aug 2019

Cited by 10 | Viewed by 3302

Abstract

In this paper, we propose a new differentiable neural network with an alignment mechanism for text-dependent speaker verification. Unlike previous works, we do not extract the embedding of an utterance from the global average pooling of the temporal dimension. Our system replaces this [...] Read more.

In this paper, we propose a new differentiable neural network with an alignment mechanism for text-dependent speaker verification. Unlike previous works, we do not extract the embedding of an utterance from the global average pooling of the temporal dimension. Our system replaces this reduction mechanism by a phonetic phrase alignment model to keep the temporal structure of each phrase since the phonetic information is relevant in the verification task. Moreover, we can apply a convolutional neural network as front-end, and, thanks to the alignment process being differentiable, we can train the network to produce a supervector for each utterance that will be discriminative to the speaker and the phrase simultaneously. This choice has the advantage that the supervector encodes the phrase and speaker information providing good performance in text-dependent speaker verification tasks. The verification process is performed using a basic similarity metric. The new model using alignment to produce supervectors was evaluated on the RSR2015-Part I database, providing competitive results compared to similar size networks that make use of the global average pooling to extract embeddings. Furthermore, we also evaluated this proposal on the RSR2015-Part II. To our knowledge, this system achieves the best published results obtained on this second part. Full article

(This article belongs to the Special Issue IberSPEECH 2018: Speech and Language Technologies for Iberian Languages)

► Show Figures

Figure 1

16 pages, 311 KiB

Open AccessArticle

Intelligibility and Listening Effort of Spanish Oesophageal Speech

by Sneha Raman, Luis Serrano, Axel Winneke, Eva Navas and Inma Hernaez

Appl. Sci. 2019, 9(16), 3233; https://doi.org/10.3390/app9163233 - 8 Aug 2019

Cited by 4 | Viewed by 3904

Abstract

Communication is a huge challenge for oesophageal speakers, be it for interactions with fellow humans or with digital voice assistants. We aim to quantify these communication challenges (both human–human and human–machine interactions) by measuring intelligibility and Listening Effort (LE) of Oesophageal Speech (OS) [...] Read more.

Communication is a huge challenge for oesophageal speakers, be it for interactions with fellow humans or with digital voice assistants. We aim to quantify these communication challenges (both human–human and human–machine interactions) by measuring intelligibility and Listening Effort (LE) of Oesophageal Speech (OS) in comparison to Healthy Laryngeal Speech (HS). We conducted two listening tests (one web-based, the other in laboratory settings) to collect these measurements. Participants performed a sentence recognition and LE rating task in each test. Intelligibility, calculated as Word Error Rate, showed significant correlation with self-reported LE ratings. Speaker type (healthy or oesophageal) had a major effect on intelligibility and effort. More LE was reported for OS compared to HS even when OS intelligibility was close to HS. Listeners familiar with OS reported less effort when listening to OS compared to nonfamiliar listeners. However, such advantage of familiarity was not observed for intelligibility. Automatic speech recognition scores were higher for OS compared to HS. Full article

(This article belongs to the Special Issue IberSPEECH 2018: Speech and Language Technologies for Iberian Languages)

► Show Figures

Figure 1

16 pages, 1229 KiB

Open AccessArticle

Application of Pitch Derived Parameters to Speech and Monophonic Singing Classification

by Xabier Sarasola, Eva Navas, David Tavarez, Luis Serrano, Ibon Saratxaga and Inma Hernaez

Appl. Sci. 2019, 9(15), 3140; https://doi.org/10.3390/app9153140 - 2 Aug 2019

Cited by 4 | Viewed by 3175

Abstract

Speech and singing voice discrimination is an important task in the speech processing area given that each type of voice requires different information retrieval and signal processing techniques. This discrimination task is hard even for humans depending on the length of voice segments. [...] Read more.

Speech and singing voice discrimination is an important task in the speech processing area given that each type of voice requires different information retrieval and signal processing techniques. This discrimination task is hard even for humans depending on the length of voice segments. In this article, we present an automatic speech and singing voice classification method using pitch parameters derived from musical note information and

f_{0}

stability analysis. We applied our method to a database containing speech and a capella singing and compared the results with other discrimination techniques based on information derived from pitch and spectral envelope. Our method obtains good results discriminating both voice types, is efficient, has good generalisation capabilities and is computationally fast. In the process, we have also created a note detection algorithm with parametric control of the characteristics of the notes it detects. We compared the agreement of this algorithm with a state-of-the-art note detection algorithm and performed an experiment that proves that speech and singing discrimination parameters can represent generic information about the music style of the singing voice. Full article

(This article belongs to the Special Issue IberSPEECH 2018: Speech and Language Technologies for Iberian Languages)

► Show Figures

Figure 1

17 pages, 703 KiB

Open AccessArticle

Restricted Boltzmann Machine Vectors for Speaker Clustering and Tracking Tasks in TV Broadcast Shows

by Umair Khan, Pooyan Safari and Javier Hernando

Appl. Sci. 2019, 9(13), 2761; https://doi.org/10.3390/app9132761 - 9 Jul 2019

Cited by 4 | Viewed by 3576

Abstract

Restricted Boltzmann Machines (RBMs) have shown success in both the front-end and backend of speaker verification systems. In this paper, we propose applying RBMs to the front-end for the tasks of speaker clustering and speaker tracking in TV broadcast shows. RBMs are trained [...] Read more.

Restricted Boltzmann Machines (RBMs) have shown success in both the front-end and backend of speaker verification systems. In this paper, we propose applying RBMs to the front-end for the tasks of speaker clustering and speaker tracking in TV broadcast shows. RBMs are trained to transform utterances into a vector based representation. Because of the lack of data for a test speaker, we propose RBM adaptation to a global model. First, the global model—which is referred to as universal RBM—is trained with all the available background data. Then an adapted RBM model is trained with the data of each test speaker. The visible to hidden weight matrices of the adapted models are concatenated along with the bias vectors and are whitened to generate the vector representation of speakers. These vectors, referred to as RBM vectors, were shown to preserve speaker-specific information and are used in the tasks of speaker clustering and speaker tracking. The evaluation was performed on the audio recordings of Catalan TV Broadcast shows. The experimental results show that our proposed speaker clustering system gained up to 12% relative improvement, in terms of Equal Impurity (EI), over the baseline system. On the other hand, in the task of speaker tracking, our system has a relative improvement of 11% and 7% compared to the baseline system using cosine and Probabilistic Linear Discriminant Analysis (PLDA) scoring, respectively. Full article

(This article belongs to the Special Issue IberSPEECH 2018: Speech and Language Technologies for Iberian Languages)

► Show Figures

Figure 1

21 pages, 316 KiB

Open AccessArticle

Dual-Channel Speech Enhancement Based on Extended Kalman Filter Relative Transfer Function Estimation

by Juan M. Martín-Doñas, Antonio M. Peinado, Iván López-Espejo and Angel Gomez

Appl. Sci. 2019, 9(12), 2520; https://doi.org/10.3390/app9122520 - 20 Jun 2019

Cited by 5 | Viewed by 3587

Abstract

This paper deals with speech enhancement in dual-microphone smartphones using beamforming along with postfiltering techniques. The performance of these algorithms relies on a good estimation of the acoustic channel and speech and noise statistics. In this work we present a speech enhancement system [...] Read more.

This paper deals with speech enhancement in dual-microphone smartphones using beamforming along with postfiltering techniques. The performance of these algorithms relies on a good estimation of the acoustic channel and speech and noise statistics. In this work we present a speech enhancement system that combines the estimation of the relative transfer function (RTF) between microphones using an extended Kalman filter framework with a novel speech presence probability estimator intended to track the noise statistics’ variability. The available dual-channel information is exploited to obtain more reliable estimates of clean speech statistics. Noise reduction is further improved by means of postfiltering techniques that take advantage of the speech presence estimation. Our proposal is evaluated in different reverberant and noisy environments when the smartphone is used in both close-talk and far-talk positions. The experimental results show that our system achieves improvements in terms of noise reduction, low speech distortion and better speech intelligibility compared to other state-of-the-art approaches. Full article

(This article belongs to the Special Issue IberSPEECH 2018: Speech and Language Technologies for Iberian Languages)

► Show Figures

Figure 1

14 pages, 526 KiB

Open AccessArticle

Data Augmentation for Speaker Identification under Stress Conditions to Combat Gender-Based Violence

by Esther Rituerto-González, Alba Mínguez-Sánchez, Ascensión Gallardo-Antolín and Carmen Peláez-Moreno

Appl. Sci. 2019, 9(11), 2298; https://doi.org/10.3390/app9112298 - 4 Jun 2019

Cited by 18 | Viewed by 4495

Abstract

A Speaker Identification system for a personalized wearable device to combat gender-based violence is presented in this paper. Speaker recognition systems exhibit a decrease in performance when the user is under emotional or stress conditions, thus the objective of this paper is to [...] Read more.

A Speaker Identification system for a personalized wearable device to combat gender-based violence is presented in this paper. Speaker recognition systems exhibit a decrease in performance when the user is under emotional or stress conditions, thus the objective of this paper is to measure the effects of stress in speech to ultimately try to mitigate their consequences on a speaker identification task, by using data augmentation techniques specifically tailored for this purpose given the lack of data resources for this condition. An extensive experimentation has been carried out for assessing the effectiveness of the proposed techniques. First, we conclude that the best performance is always obtained when naturally stressed samples are included in the training set, and second, when these are not available, their substitution and augmentation with synthetically generated stress-like samples improves the performance of the system. Full article

(This article belongs to the Special Issue IberSPEECH 2018: Speech and Language Technologies for Iberian Languages)

► Show Figures

Figure 1

17 pages, 389 KiB

Open AccessArticle

Automatic Assessment of Prosodic Quality in Down Syndrome: Analysis of the Impact of Speaker Heterogeneity

by Mario Corrales-Astorgano, Pastora Martínez-Castilla, David Escudero-Mancebo, Lourdes Aguilar, César González-Ferreras and Valentín Cardeñoso-Payo

Appl. Sci. 2019, 9(7), 1440; https://doi.org/10.3390/app9071440 - 5 Apr 2019

Cited by 13 | Viewed by 3757

Abstract

Prosody is a fundamental speech element responsible for communicative functions such as intonation, accent and phrasing, and prosodic impairments of individuals with intellectual disabilities reduce their communication skills. Yet, technological resources have paid little attention to prosody. This study aims to develop an [...] Read more.

Prosody is a fundamental speech element responsible for communicative functions such as intonation, accent and phrasing, and prosodic impairments of individuals with intellectual disabilities reduce their communication skills. Yet, technological resources have paid little attention to prosody. This study aims to develop an automatic classifier to predict the prosodic quality of utterances produced by individuals with Down syndrome, and to analyse how inter-individual heterogeneity affects assessment results. A therapist and an expert in prosody judged the prosodic appropriateness of a corpus of Down syndrome’ utterances collected through a video game. The judgments of the expert were used to train an automatic classifier that predicts prosodic quality by using a set of fundamental frequency, duration and intensity features. The classifier accuracy was 79.3% and its true positive rate 89.9%. We analyzed how informative each of the features was for the assessment and studied relationships between participants’ developmental level and results: interspeaker variability conditioned the relative weight of prosodic features for automatic classification and participants’ developmental level was related to the prosodic quality of their productions. Therefore, since speaker variability is an intrinsic feature of individuals with Down syndrome, it should be considered to attain an effective automatic prosodic assessment system. Full article

(This article belongs to the Special Issue IberSPEECH 2018: Speech and Language Technologies for Iberian Languages)

► Show Figures

Figure 1

Journal Menu

Journal Browser

IberSPEECH 2018: Speech and Language Technologies for Iberian Languages

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (14 papers)

Editorial

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI