Semisupervised Speech Data Extraction from Basque Parliament Sessions and Validation on Fully Bilingual Basque–Spanish ASR
Round 1
Reviewer 1 Report
Dear authors
Your work introduces a semi-supervised method for extracting speech data, which is then utilized to generate a new dataset specifically tailored for the development of fully bilingual Automatic Speech Recognition systems for Basque and Spanish. This study holds great significance as it not only proves intriguing but also demonstrates potential applications in various other domains.
The work is very detailed, especially in the methodological part. Graphs, figures, and tables greatly assist the reader in understanding the study. Since the methods and technical details of the work are carefully expressed, I suggest improving some small aspects of the introduction. For example, it would be useful and interesting, especially for non-Spanish readers, to introduce a brief discussion of the Basque language at the beginning, highlighting its lexical and/or syntactic differences from Spanish. Additionally, it would be beneficial to provide a better-defined purpose of the study, perhaps in a separate paragraph, to bring more clarity to the initial part. It is also recommended to explicitly mention the acronym WER in the abstract, as was correctly done for ASR.
Author Response
=========================================
REVIEWER 1
Comments and Suggestions for Authors
Dear authors
Your work introduces a semi-supervised method for extracting speech data, which is then utilized to generate a new dataset specifically tailored for the development of fully bilingual Automatic Speech Recognition systems for Basque and Spanish. This study holds great significance as it not only proves intriguing but also demonstrates potential applications in various other domains.
The work is very detailed, especially in the methodological part. Graphs, figures, and tables greatly assist the reader in understanding the study. Since the methods and technical details of the work are carefully expressed, I suggest improving some small aspects of the introduction. For example, it would be useful and interesting, especially for non-Spanish readers, to introduce a brief discussion of the Basque language at the beginning, highlighting its lexical and/or syntactic differences from Spanish.
We thank Reviewer 1 for her/his good and encouraging comments. We have added a paragraph in Section 1 (lines 39-52) introducing the main characteristics of the Basque language and the most important differences (at the acoustic-phonetic, lexical and syntactic levels) with regard to Spanish.
Additionally, it would be beneficial to provide a better-defined purpose of the study, perhaps in a separate paragraph, to bring more clarity to the initial part.
We have added a couple of lines at the end of Section 1 (lines118-123) to summarize the main objectives of our study.
It is also recommended to explicitly mention the acronym WER in the abstract, as was correctly done for ASR.
This issue has been fixed in the revised paper. Thanks for pointing out this issue.
Author Response File: Author Response.pdf
Reviewer 2 Report
This paper presents a semisupervised iterative method based on bilingual ASR which can be useful to build a Speech Dataset from Basque parliament sessions, including speech in two different languages. Although the approach stems from similar semisupervised approaches, as pointed out by the authors, this proposal includes a different way to face ASR from a unique common phoneme recognizer for both languages plus a grapheme to phoneme converter. The methodology is clearly stated in general and the results are really promising, although some issues should be addressed before final acceptance of the paper:
1. The title might be not the most appropriate, since it seems the Speech dataset is only valid for ASR, which is not the case. It has been extracted from Basque parliament sessions using (o from) a Bilingual Basque-Spanish ASR but it is not 'for' it.
2. In lines 53 to 57, where the LM is presented, the authors assert a proposition without quantifying or backing it on some reference to previous works. How the number of switching points was measured? What was the percentage over the length? Was this above expected by what reason? Why it might not generalize well? Please, discuss further.
3. In line 95, the amount of speech data is quantified 'around 1000 hours', while in the abstract the specific amount of 998 is reported. Please correct to provide the same quantity.
4. The description of the acoustic units in section 2.1 and the selection criteria should be supported by referente phonetic studies and corpus statistics for Basque, if available. Even if they are commonplaces, it deserves attention to include wel informed background on this.
5. The G2P presented in section 2.2 (line 140) ... is available for reproducibility in some repository? Is there any technical report to document it? Since it is a key component of the ASR system, a care should be given to provide more information and accesibility to this.
6. In line 142 and 143: how many dictionaries of know words, with how many words and from which origin? Are the pronunciations referred typical? According to which norm?
7. The description on the process to decide on the language of an OOV word in the text requires a better explanation. First, it does not seem to be most natural to choose in terms of frequency of word of a given language with a given windos, specially for the less frequent. Also, the specification of the length of the window and the overlapping (if any) of it when sliding on the text, shoould be specified. If possible, a discussion or some evaluation of the impact of the length of this window on the result should be included.
8. At the begining of section 2.3 (lines 156, 157), it is described how the minutes and their translations are used to build the mixed LM. What translator was used?
9. The title of section 3 is not very fortunate and/or informative, since it hides the fact that iterative set construction procedure is describe inside, not the collection of the training data themselves, which is the result. Something mentioning 'Iterative training data selection from the sessions dataset' or similar might be more appropriate.
10. The description of the segment selection and growing made in section 3 (lines 186 to 206) does not clearly motivates the need for a two-step approach to the segment selection problem.
11. Does the sentence in line 223-224 have a solid (or at least plausible) mathematical or computational background? The iterative process might not ever end. And, moreover, the term 'better enough' is definitely ambigous and does not introduce the stopping control parameter which might be introduced in the algorithm.
12. Although notation in Algorithm 1 is almost self-contained, it would be interesting to include a caption to explain C and E sets. Also, in line 8, a delta is used to compare with a delta minimum and it is not clear if it represents any, all, or a function of the delta l, s and w.
13. When describing the development dataset, a reference to manual auditing of the points where a discrepancy was found between recognition results and minutes phoneme sequence is made (lines 282-283). When was this discrepancy considered big enough? How this auditing was carried out? What was the action after auditing? Correcting? Discarding? Other?
14. Were the AM of the baseline phoneme recognizer accuracy validated in front of a well known reference corpus (for Spanish and Basque) which were manually labelled with enough quality, even if not too big?
15. There is no phoneme for CH (UPS) like in Spanish word 'chollo' in table 1. This might not be an error or it might be the result of a decission on applying a lower bound to cut out phonemes with not enough nnumber of appearances which, then, should be discussed and documented in the text.
16. Finally, a discussion on the reason for a bias in the percentage of performance increase in WER in Tables 5 and 6, depending on the language would be highly neccesary to further clarify the validity of the method and to discard possible complex effects having an influence on the results. The improvement is certainly impressive in all cases, but it is specially noticeable that it is bigger for Spanish and Bilingual, in relative terms, as if the approach might not be selecting with the same fairness words of both languages. A discussion of this effect would be valuable.
Author Response
REVIEWER 2
Comments and Suggestions for Authors
This paper presents a semisupervised iterative method based on bilingual ASR which can be useful to build a Speech Dataset from Basque parliament sessions, including speech in two different languages. Although the approach stems from similar semisupervised approaches, as pointed out by the authors, this proposal includes a different way to face ASR from a unique common phoneme recognizer for both languages plus a grapheme to phoneme converter.
We thank Reviewer 2 for his/her exhaustive review and for pointing out many problematic issues present in our submission. We are sincerely grateful for helping us to improve our paper by adding missing information and/or clarifying some points. We think all the issues have been suitably addressed (or at least, we tried to) in the revised version of the paper. Next, we give details about how we addressed each of them.
The methodology is clearly stated in general and the results are really promising, although some issues should be addressed before final acceptance of the paper:
1. The title might be not the most appropriate, since it seems the Speech dataset is only valid for ASR, which is not the case. It has been extracted from Basque parliament sessions using (o from) a Bilingual Basque-Spanish ASR but it is not 'for' it.
Although our target application was bilingual ASR on the Basque Paliament sessions, this database does potentially have other applications. On the other hand, we do not use a full ASR system to extract the data but just the acoustic models to perform phone decoding, but we understand that the reviewer considers it ASR too. Based on this, we have modified the title of the paper to describe in more literal terms the work presented:
Semisupervised Speech Data Extraction from Basque Parliament Sessions and Validation on Fully Bilingual Basque-Spanish ASR.
2. In lines 53 to 57, where the LM is presented, the authors assert a proposition without quantifying or backing it on some reference to previous works. How the number of switching points was measured? What was the percentage over the length? Was this above expected by what reason? Why it might not generalize well? Please, discuss further.
The the sentence in lines 54-57 of the original submission was unfortunate because it certainly speculated with quantitative issues. We have replaced it by a new more qualitative assertion (now in lines 68-72), which better reflects what we want to say at this point:
Since the language model has been trained on sentences in both languages, some of them including code switchings, it can naturally accept any sequence of words in any language (the probability of such a sequence will always be nonzero), and this allows it to recognize sentences with code switchings, —although the model has not been tuned for this.
We hope this solves the issues posed by the reviewer. Just to clarify it, we did not measure the number, percentage or types of code switchings, nor the words involved. And, of course, we did not compare those features with other works, because we were not interested in studying the code-switching phenomena, but just in building an ASR system able to deal with them and recognize a bilingual sequence of words. Below in this paper, we provide the number of training segments collected in each language, along with the number of bilingual segments (likely containing a code switching event). These figures could be used to estimate the frequency of code switchings in these materials. Also, comparing the word error rates on bilingual segments to those on pure Basque or Spanish segments could be used to evaluate the effect of code-switching on ASR performance.
3. In line 95, the amount of speech data is quantified 'around 1000 hours', while in the abstract the specific amount of 998 is reported. Please correct to provide the same quantity.
Thanks for pointing it. We have replaced those 1000 hours by 998 hours througout the paper (when appropiate).
4. The description of the acoustic units in section 2.1 and the selection criteria should be supported by referente phonetic studies and corpus statistics for Basque, if available. Even if they are commonplaces, it deserves attention to include wel informed background on this.
We have added a reference in Section 2.1 (line 149) to a previous study on the same topic (i.e. defining a set of acoustic units for ASR in Basque), which also provides a number of references to classical linguistic studies on phonetics and phonology of Basque. In Section 1 we already provided a pair of references (line 40) to support our considerations about Basque and its differences with regard to Spanish, and below in the paper, in Section 2.2 (line 167), we provide two aditional references to studies on phonetics and phonology of Basque and Spanish (on which our pronunciation rules are based). We hope that these references (not present in our original submission) will help the interested reader to go deeper on these issues.
5. The G2P presented in section 2.2 (line 140) ... is available for reproducibility in some repository? Is there any technical report to document it? Since it is a key component of the ASR system, a care should be given to provide more information and accesibility to this.
Unfortunately, our grapheme-to-phoneme (G2P) converter is an ongoing unfinished work which we consider not reliable enough to make it public. When it becomes mature, we will make it available to the community through some repository (e.g. GitHub). There is no public documentation on it, but just two papers that briefly describe the philosophy and structure of a first version of it that we created around 2012. We have added references to those works in Section 2.2 (line 161).
6. In line 142 and 143: how many dictionaries of known words, with how many words and from which origin? Are the pronunciations referred typical? According to which norm?
We use two dynamic dictionaries, for Basque and Spanish, originally created from some small acoustic-phonetic databases and then dynamically updated with new words found in Basque Parliament sessions. Each dictionary contains nominal pronunciations of words according to standard pronunciation rules, accounted for in two studies that we have referenced in the paper. Since these dictionaries grow dynamically, we cannot provide a fixed size figure. They contain hundreds of thousands of words, including verbal inflections, acronyms, numbers, etc.
We have extended the sentence of lines 142-143 of the original submission by adding these considerations. We have also included references to the two mentioned studies on the phonetics and phonology of Spanish and Basque (see lines 162-171 of the revised paper). We hope this will answer the questions posed by the reviewer and allow the interested reader to go deeper on these issues.
7. The description on the process to decide on the language of an OOV word in the text requires a better explanation. First, it does not seem to be most natural to choose in terms of frequency of word of a given language with a given windos, specially for the less frequent. Also, the specification of the length of the window and the overlapping (if any) of it when sliding on the text, shoould be specified. If possible, a discussion or some evaluation of the impact of the length of this window on the result should be included.
This issue have been addressed in lines 171-177 of the revised version of the paper.
The analysis of the context takes place only in two situations: (1) a known word is to be processed but it appears in both dictionaries; or (2) an unknown word is to be processsed. For known words belonging to a single dictionary, the pronunciation stored in the corresponding dictionary is used.
Guessing the language (Basque or Spanish) can be done in different ways. We chose to use the context (choosing the language with more words in a context window) because it is quite unlikely that a word in Basque appears in the middle of a sentence in Spanish (or vice versa), and it is also unlikely that a code-switching event takes place just at that point. In the vast majority of the cases, the words surrounding a given word belong to the same language. In any case, as we note at the end of Section 2.2, after processing each session, new words added to dictionaries are supervised and validated by a human expert, so if an error happens because of a badly interpreted context, it will be eventually fixed.
With regard to the size of the window, it is not fixed but starts at 1 (one word at each side) and increases (2, 3, etc. up to the length of the sentence) until a reliable decision can be made (note that some words appear in both dictionaries). This strategy is found to be effective in practice, leading to very few errors.
8. At the begining of section 2.3 (lines 156, 157), it is described how the minutes and their translations are used to build the mixed LM. What translator was used?
We did not translate the minutes. The minutes already include translations to the other language: if the original speech was in Spanish, it is translated to Basque, and vice versa. Professional translators of the Basque Parliament are in charge of this task, so we considered those translations as trustable.
This issue is addressed in the paper by adding information about the origin of translations, as follows (lines 189-192):
In the original BP minutes, Spanish is dominant over Basque, with a 2:1 relation, but the professionally-produced translations included in the minutes have exactly the opposite relation, so that the language model is trained with exactly the same amount of text in both languages.
9. The title of section 3 is not very fortunate and/or informative, since it hides the fact that iterative set construction procedure is describe inside, not the collection of the training data themselves, which is the result. Something mentioning 'Iterative training data selection from the sessions dataset' or similar might be more appropriate.
Thanks for pointing out that the title of Section 3 could be misleading. We have changed it to be more descriptive of the procedure applied. The title of Section 3 is now as follows:
Iterative data collection through phonetic decoding and alignment
10. The description of the segment selection and growing made in section 3 (lines 186 to 206) does not clearly motivates the need for a two-step approach to the segment selection problem.
We assume that the reviewer actually refers to lines 197-206 of our original submission. We are not sure to understand what the reviewer is referring to with the so called two-step approach. Our segment selection procedure is applied to the whole dataset one first time. Then we train new acoustic models on the obtained segments, analyze the obtained results on ASR experiments and then e decide whether or not further iterations of the segment selection procedure (based on the new models) should be run. If we decide to run another iteration of the process (we have actually run two iterations of the process in this work), we repeat the analysis and decide again whether to continue or not. We honestly think that this approach is already explained in the paper.
11. Does the sentence in line 223-224 have a solid (or at least plausible) mathematical or computational background? The iterative process might not ever end. And, moreover, the term 'better enough' is definitely ambigous and does not introduce the stopping control parameter which might be introduced in the algorithm.
Again, the sentence of lines 223-224 of our original submission is unfortunate because it suggests that the process is automatically controlled by some improvement threshold, but in fact, as we noted above, it is us who decide after each iteration whether or not it is worth continuing. We have replaced it by a new sentence, which describes an issue that may prevent us from performing many iterations of the segment selection process, justifying human intervention after each iteration of the segment selection process (lines 256-262 of the revised paper):
Remind that we aim to collect those segments that best match acoustically the provided transcripts. However, after each iteration the models will better adjust to the provided (possibly wrong) transcripts, so that after many iterations we may eventually get a PRR of 100% for all segments, with no way to distinguish truly good transcripts from bad transcripts to which our models have adapted to. This will prevent us from running too many iterations and will force us to carefully set the threshold that separates good from bad segments after each iteration. We will come back to this issue in Section 5.
12. Although notation in Algorithm 1 is almost self-contained, it would be interesting to include a caption to explain C and E sets. Also, in line 8, a delta is used to compare with a delta minimum and it is not clear if it represents any, all, or a function of the delta l, s and w.
Thanks for noting this notational issues. We have slightly changed the notation in Algorithm 1 to address these issues; we have also added two comments explaining what the sets C and E represent.
13. When describing the development dataset, a reference to manual auditing of the points where a discrepancy was found between recognition results and minutes phoneme sequence is made (lines 282-283). When was this discrepancy considered big enough? How this auditing was carried out? What was the action after auditing? Correcting? Discarding? Other?
Thanks again for pointing out these ambiguities. We have replaced the sentence in lines 282-283 of our original submission by the following three sentences (now in lines 320-326):
These segments were manually audited (the audio listened to and the transcripts fixed) only at sections where the recognized sequence of words did not match the text in the minutes. These sections were located automatically, and involved any number of substitutions, deletions and/or insertions. The transcript resulting after auditing could be either the recognized sequence of words, the text provided in the minutes or a different sequence of words not matching any of them.
We hope to have answered all the questions posed by the reviewer.
14. Were the AM of the baseline phoneme recognizer accuracy validated in front of a well known reference corpus (for Spanish and Basque) which were manually labelled with enough quality, even if not too big?
In fact, we had actually validated our initial acoustic models in phone decoding experiments on the dev and test sets of the bootstrap databases, but we forgot to include this information in the paper. Now, we have added the following sentence (lines 270-271) which provides PRR performance of the initial (bootstrap) models on the test sets of Aditu (Basque) and Albayzin (Spanish):
PRR on the test sets of Aditu (Basque) and Albayzin (Spanish) were of 4.6% and 6.9%, respectively.
15. There is no phoneme for CH (UPS) like in Spanish word 'chollo' in table 1. This might not be an error or it might be the result of a decission on applying a lower bound to cut out phonemes with not enough nnumber of appearances which, then, should be discussed and documented in the text.
The phoneme /tʃ/ (as in the Spanish word mucho, or in the Basque word txikia) DOES appear in Table 1. We guess that it went unnoticed to the reviewer. To avoid similar confusions and to make easier to locate sounds in the provided examples, we have put in bold the ortographic counterparts of the phonetic units (as in the Spanish word mucho, or in the Basque word txikia) in Table 1.
16. Finally, a discussion on the reason for a bias in the percentage of performance increase in WER in Tables 5 and 6, depending on the language would be highly neccesary to further clarify the validity of the method and to discard possible complex effects having an influence on the results. The improvement is certainly impressive in all cases, but it is specially noticeable that it is bigger for Spanish and Bilingual, in relative terms, as if the approach might not be selecting with the same fairness words of both languages. A discussion of this effect would be valuable.
Following the reviewer's suggestion, we have added a brief discussion (lines 382-391 of the revised paper) on the possible reasons that could explain the improvement observed from Table 5 to Table 6, and the better results (higher improvement in terms of WER) observed for Spanish, compared to Basque, in Table 6:
These improvements could be generally due to using a greater amount of training data and these data being in-domain, that is, the same speakers and environment/channel conditions appear in both training and test datasets. On the other hand, WER figures are better for Spanish than for Basque. This could be explained by different factors: (1) the dominance of Spanish over Basque (with a 2:1 factor) in the training set; (2) a higher variability of accents/dialects in Basque (with not only different pronunciations, but also different vocabularies) compared to Spanish, which features a single accent/dialect in the dataset; and (3) the use of a reduced set of acoustic units might be hindering the discrimination ability of the acoustic models only for Basque, because the fused consonants do not exist in Spanish.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
I thank the authors for the thorough revision work they undertook. Also, I regret my lack of godd expression in the 'two step issue' which is, indeed, clearly stated in the paper. As for the phonemes, I regret to have mistaken the appearance fof the phoneme; as they decided, I think the new editing will helo to avoid similar confusions in readers.
Overall, I think the paper is now ready to be accepted for publication.