1. Introduction
Aphasia is a language disorder that causes impairments in dimensions including speech, writing, interaction or communication. People with aphasia (PWA) mainly acquire this disorder after suffering a stroke, a traumatic brain injury, a tumoral brain or any other affection in some specific areas of the brain that are related to language. Particularly, aphasia is more likely to be developed when the affected areas are located in the left hemisphere [
1]. Every year, millions of people worldwide acquire aphasia through one of these issues and its prevalence on the full population ranges between 6 and 62 people per 100.000 inhabitants depending on the region and country [
2,
3,
4]. These values may increase even up to 30–60% in people who have survived a stroke, which is the second cause of death globally [
4,
5,
6].
PWA may acquire communication impairments that affect their daily life in different grades depending on the severity of the disorder [
7]. Usually, these impediments are classified with the scale proposed by the Western Aphasia Battery (WBA) [
8] ranging from mild to very severe depending on the performance on several tasks that include reading, speech or writing, among others [
8]. On the other hand, aphasia disorders can also be distinguished by a combination of symptoms and the affected physical areas [
7]. The most extended classification uses the Wernicke–Lichtheim model, which associates communication capabilities with different brain regions [
9,
10], differentiating three main types of aphasia depending on the area damaged: Broca, Wernick and Anomic. Nevertheless, language comprehension and production are not isolated at the specific brain areas considered by this model [
11], and more modern and complete theories, e.g., dual stream model [
12] consider that language capabilities are organized in a distributed system in different cortical regions, emphasizing the connections between them [
13,
14,
15]. However, cortical damages that causes aphasic impairment have barely been mapped using these new theories; therefore, the Wernicke–Lichtheim model is still the most widely used method in clinical assessment [
11].
Intensive speech therapy conducted by interdisciplinary groups of clinical experts has a fundamental role in recovering the communication abilities of PWA [
16]. During the last years, intense research carried out in speech recognition technology promises to support the work of these clinical experts by automating processes and improving access to therapy related to isolated areas and/or less favored socioeconomic environments and collectives. In this sense, some applications such as
Constant Therapy [
17],
Lingraphica [
18] and
Tactus Therapy [
19], for which their usefulness has been recognized by the National Aphasia Association of United States
https://www.aphasia.org/ (accessed on 15 July 2021), provide exercises to practice speech, language and cognitive tasks by customizing the PWA progress. These applications have been proven to reinforce the therapy, achieving marked goals in less time [
20], especially in rural areas [
21]. Other technological applications focus on the adaptation of standard cognitive tests [
22] or on the automatic quantitative analysis of aphasia severity through speech [
23]. Taken together, these new techniques and solutions promise to enhance face-to-face therapy, to extend the treatment to more patients and, therefore, to improve the quality of life of PWA.
Nonetheless, there are still challenges related to automatic speech recognition (ASR) that must be solved worldwide in order to extend these therapy applications, since they basically depend on adequate engines that should properly recognize aphasic speech. ASR systems are usually trained with the voices of people without any speech pathology, and their performance degrades when they are applied to aphasic speech [
23,
24,
25,
26,
27]. Furthermore, ASR systems are usually language-dependent and have to be trained with hundreds or thousands of hours of transcribed speech. This idiosyncrasy avoids, in many cases, extending their use to the thousands of languages currently spoken in the world and, particularly, to the use case of aphasic speech recognition due to the lack of so many annotated data for training recognition models following the more traditional supervised learning methods.
In this work, we explore the application of novel semi-supervised end-to-end (E2E) learning methods on ASR to perform aphasic speech recognition in English and Spanish in a very challenging scenario with few annotated data. More specifically, we make use of the
wav2vec2.0 architecture [
28], building models adapted to aphasic speech for English and Spanish and comparing the results with previous fully supervised technological approaches presented in the literature. In particular, we achieved a relative error reduction in Word Error Rate (WER) for the English test set by ∼25% when comparing with previous published results. In addition, we demonstrate that this technological approach can be extended to perform aphasic speech recognition with few annotated data. To this end, we built the first Spanish E2E model adapted to aphasic speech recognition with less than one hour of data from PWA and report the first results in the literature for this language and domain.
The rest of the paper is organized as follows:
Section 2 introduces previous work in aphasic speech recognition.
Section 3 details the process performed over the main corpora used for the experiments in addition to the creation and compositions of the train, validation and test partitions. In
Section 4, the speech recognition architectures and constructions are explained, whilst the evaluation results obtained over different configurations of the systems are presented in
Section 5 for English and Spanish. Finally,
Section 6 concludes the paper and presents future work.
2. Related Work in Aphasic Speech Recognition
ASR is a technological field that has remarkably evolved over the last years from the hand of new methods and architectures based on Deep Neural Networks (DNNs), which are closer to reaching human-like performance in controlled acoustic environments [
28,
29,
30,
31,
32]. These improvements have great potential to impact new ASR clinical applications and to develop new e-health solutions [
33,
34,
35]. Particularly, ASR technology applied to disordered voices brings the opportunity to implement new assisted and personalized therapies, generate automatic cognitive tests or to develop adapted applications for people with impairments.
The first ASR systems for aphasic speech recognition found in the literature were focused on recognizing isolated words within small vocabularies for English [
36] and Portuguese [
24]. More recently, thanks to the advancements in deep learning speech recognition technologies, new studies achieved up to
accuracy on assessing
correct versus
incorrect naming attempts in controlled utterance verification systems [
37]. However, the biggest challenge in the field nowadays is to improve the performance of the continuous recognition of aphasic speech in large vocabularies. To the best of our knowledge, the published works in the task of aphasic continuous speech recognition of large vocabularies only consider English [
23,
38,
39] and Cantonese [
40] to date. In this sense, the performance and results for these systems widely oscillate depending on the severity level of aphasia, ranging WER from 33 on mildest cases to more than 60 on very severe cases. All these studies employ the same AphasiaBank database [
41] as the main corpus for training and evaluation, but they usually differ on the train-test-validation partitions and on the evaluation metrics employed, given that some studies used the Phoneme Error Rate (PER) as its main metric and others employed the Character Error Rate (CER). This decision strongly depends on the configuration and the basic modeling unit used to train their systems (phonemes or characters). Hence, a fair and balanced comparison between systems and technological approaches cannot always be guaranteed. Nonetheless, in some cases, notable improvements can be appreciated between the
of PER in moderate aphasia test group presented in [
25] and the more recent
of PER reported in [
39]. These results seems to be in line with the
global Syllable Error Rate (SER) reported for the full test set in Cantonese [
40], where more than 60% of the test set was composed of mild severity speech data.
Regarding technological approaches, previous works focused on developing ASR technology for aphasic speech considering architectures based on hybrid Acoustic Models (AMs) such as Deep Neural Networks and Hidden Markov Models (DNN-HMM) [
25], Bidirectional Long Short-Term Memory and Recurrent Neural Models (BLSTM-RNN) [
23], and solutions based on Mixture of Experts (MoEs) [
39]. More specifically, in the work presented in [
38], the authors established the first large-vocabulary continuous speech recognition baseline for English built on the AphasiaBank dataset using a DNN-HMM hybrid AM trained on unseen train-validation-test partitions and by distinguishing performances depending on aphasia severity. They reached PER metrics between
for mild severity test and
for very severe test set and reported that appending utterance fixed-length speaker identity vectors (i-vectors) to frame-level acoustic features resulted in PER reductions specially in speakers with more severe levels of aphasia. These results were then improved by using an acoustic modeling method based on a BLSTM-RNN architecture enriched with a trigram language model (LM) estimated on the transcripts of the training audios [
23]. In this case, the training of the AM was reinforced with transcribed data from healthy speakers, achieving an improved WER ranging from
on mild test set to
on very severe test set. In the work described in [
39], an AM based on a MoE of DNN models was proposed, where each expert in the model was specialized on specific aphasia severity. Additionally, an Speech Intelligibility Detector (SID) composed of two hidden layers and a final softmax function was trained to detect the Aphasia Quotient (AQ) severity level of a given speech frame by using the acoustic features and utterance-level speaker embeddings. At inference time, the contribution of each expert was decided by the SID module. Once again, the train-validation-test partitions were randomly generated, and they achieved PER values ranging from
on mild test set to
on severe test set.
Finally, the first ASR system for Cantonese continuous aphasic speech was described in the work presented in [
40]. They used a Time Delay Neural Network (TDNN) combined with a BLSTM model as the main AM, which was trained with both in-domain and out-of-domain speech data and a syllable-based trigram LM. The performance of the system was evaluated at the syllable level by using the SER metric. In this work, any distinctions between aphasia severities, yielding an overall SER of
for aphasic speech and
of SER for the healthy speakers, were not reported.
As it can be concluded, over the last years, the speech recognition of aphasic voices has benefited from the latest improvements in the ASR based on fully supervised learning methods, gradually enhancing its performance and, thus, allowing its application in real clinical and therapists tools. In this work, we show that semi-supervised learning methods have great potential in this particular domain, reporting interesting WER improvements for English and competitive results for Spanish considering the scarcity of annotated PWA data (less than 1 h) for this language.
5. Evaluation Results and Discussion
In this section, the evaluation results for English and Spanish are reported, together with the results obtained by the reference ASR systems of the literature, which are shown in
Table 4. All the evaluations were performed following the experimental setup, neural acoustic and language models and decoding strategies detailed in
Section 3 and
Section 4. In addition, a discussion of the results achieved is provided as well.
5.1. Semi-Supervised ASR Performance for English
The performances of the different ASR systems developed in this work for aphasic speech recognition in English are reported in
Table 5 and
Table 6 for the CER and WER metrics, respectively. The results are organized by the AM of the ASR system, the acoustic data used to finetune the pre-trained
XLSR-53-wav2vec2.0 model, the decoding type, the external LM used for rescoring the initial lattices and the aphasia severity level.
As it was expected, audio contents from more severe levels of PWA are more challenging to transcribe, whilst the speech segments from mild severity cases are recognized with lower error rates on CER and WER values. The differences between the performance in the different groups that establish the degree of the aphasia severity are quite significant, obtaining up to
error on the most severe groups when comparing with mild cases. These big differences between AQ level groups are in line with previous publications [
23,
38,
39], which PER and WER results are summarized in
Table 4.
At acoustic levels, the best performance was obtained when finetuning the XLSR-53 pre-trained model with data from the Mixed acoustic set, which included audio content from PWA and healthy controls. In this sense, we report CER and WER reductions of almost a ∼5% when adding the healthy controls in comparison with using only audios of PWA for training. It implies that the impact of the scarcity of annotated aphasic speech can be partially reduced by incorporating speech from healthy speakers and domains. This finding was explored and applied later on the Spanish dataset.
Regarding the beam-search decoding using external LMs for rescoring the initial lattices, it was demonstrated that this strategy clearly improves the performance of the speech recognition systems, showing different results depending on the level of severity of aphasia and the type of LM employed. At this point, it is worth remarking that the Large LM does not enhance overall results when comparing with the other LMs, even if it includes more than 803 million extra words, and the special symbols were ignored in order to compute metrics. It suggests that, in this case, the texts from the Librispeech and CommonVoice datasets used for training the LM are too far from the domain sentences of the AphasiaBank dataset. In this manner, the best results are achieved using the Mixed LM model, reaching a WER on the mild severity level group, a WER over the moderate subset, a WER for severe PWA and WER on very severe cases. Overall, this LM reported improvements of ∼2% in comparison with using the In-domain LM and ∼7% when comparing to greedy decoding.
The results obtained show that, despite the great differences in the quality of pronunciation in speakers from mild to very severe groups, the semi-supervised learning method applied in this work is able to generalize the learning of contextualized speech representations of a very diverse type of speech, improving the ASR performance for all cases. This strategy is again demonstrated in
Section 5.2 for the Spanish language. Finally, although a fair and well-balanced comparison of these results cannot be fully established with the ones published in previous studies (see
Table 4) considering the differences in the modeling units (character versus phoneme) and the possible mismatch in data partitions, the results provided in this work for the English language (
Table 5 and
Table 6) constitute a significant improvement in the quality of aphasic speech recognition systems tested to date on the AphasiaBank dataset.
5.2. Semi-Supervised ASR Performance for Spanish
The evaluation results achieved for the Spanish language are summarized in
Table 7 at CER and WER levels. Firstly, it is worth noting that, even when we used less than one hour of PWA transcribed speech, we were able to achieve performances of
of CER and
of WER on the test set using the most simple greedy search decoding. These results were further improved by integrating audio from healthy control speakers and the
Large LM trained with million of words to rescore and enhance the initial recognition hypothesis. If we consider the challenge of the task and the previous benchmarks of English and Cantonese ASR systems, which were trained with up to
more hours of transcribed speech, these results can be considered very competitive and promising. Moreover, these results are, to the best of our knowledge, the first benchmark of aphasic speech recognition published for Spanish.
The best initial results with the Spanish AM models trained with the PWA acoustic set were reached by fine-tuning the pre-trained model for 100 epochs, achieving a CER and a WER of . However, previous results in English demonstrated that augmenting the training dataset with data from healthy controls improved the overall ASR performance. In this manner, the Spanish model trained with the Mixed acoustic set improved the WER performance at around 10% when finetuning the pre-trained model for 200 epochs. Once again, this approach showed that using semi-supervised methods on clinical data scarcity domains together with non-pathological data augmentation results in a very promising and interesting strategy.
Finally, the best performance for this language was achieved through a beam search decoding with the external Large LM model. Once again, the special symbols FLR, SPN, BRTH and LAU were discarded during the evaluation since these symbols were not covered in the generic texts. Following this strategy, we achieved a of CER and a of WER on the test set. These results differs with the English subset where the external Large LM did not improve the results at all. This may be due to the fact that the Spanish AM, fine-tuned with much fewer data, did not learn special symbols properly. As a result, they could be removed during evaluation without a negative impact on the performance.
6. Conclusions and Future Work
In this work, we show that semi-supervised learning methods applied to the ASR are promising solutions for improving the performance on aphasic speech recognition. Moreover, we set new benchmarks for the English AphasiaBank dataset, and we performed the first study for the Spanish language. The acoustic data for training were augmented using a mix of data from PWA and healthy controls, demonstrating that this strategy considerably improves the performance. This benefit was boosted for the case of Spanish, which included less than one hour of available aphasic speech data. These results open the door to improve ASR systems for people with aphasia and other clinical speech pathologies, or even simply to make speech recognition engines available for those languages with few annotated and available data.
As future work, it would be interesting to check if the performance of the systems could be improve by considering some other learning rate schedulers, by tuning the SpecAugment parameters or by considering other hyperparameters configurations. Moreover, whether the results could be enhanced by fine-tuning specific models for each level of aphasia severity should be evaluated, as speakers in each group probably perform similar speech and acoustic patterns. Another strategy worth studying would be to train AMs by directly removing the special symbols and then rescoring with an external
Large LM. In any case, this point should be considered depending on the application, since special symbol information can be important for clinical practice but irrelevant for voice assistants. Furthermore, AMs may even be finetuned relative to individual patient speech by using Federated Learning approaches [
54]. Finally, future studies should be also focused on extending this semisupervised learning method to other languages where no benchmarks on aphasic speech recognition voices has been reported, probably due to the scarcity of annotated data. In addition, this technology should be tested in clinical practice, as well as in real medical environments and applications.