1. Introduction
In bilingual communities, speakers sometimes mix languages and jump spontaneously from one language to another, sometimes just for one word or phrase, sometimes for longer, and go back and forth several times [
1]. This phenomenon, known as code switching, appears even in formal settings such as parliamentary sessions and raises some interesting problems from the point of view of automatic speech recognition (ASR) systems [
2,
3,
4]. Commonly, each language requires a specific ASR system with its own phonetic, phonological, lexical and syntactic constraints. This means that language detection and segmentation (that is, language diarization) must be performed on code-switched speech before applying an ASR system [
5,
6,
7]. This language identification and segmentation process adds complexity and computational cost, and may introduce unrecoverable ASR errors when language detection fails. Current efforts are being devoted to integrate code switching detection and ASR within end-to-end deep learning approaches [
8,
9,
10]. In the last years, the interest in code switching has increased for certain language pairs, especially Mandarin–English, with international evaluations being organized [
11] and open datasets being released [
12].
In this work, we deal with Basque and Spanish (the two official languages in the Basque Country). Basque is a language of unknown origins, spoken by around 900 thousand speakers in a small region of Spain and France [
13,
14]. Basque greatly differs from Spanish, especially at the lexical and syntactic levels. Only a relatively small number of words come from the Latin or Romance languages with which Basque has had contact (Spanish, French and, of course, Latin itself). Spanish (like English) builds its structures using individual words with different grammatical functions, while Basque uses a set of cases to mark the grammatical relationships between words in a sentence, each with specific syntactic functions, and words are built in an agglutinative way by adding suffixes to lexemes. Spanish uses a verb conjugation system based on person and number, while Basque includes markers for subject, object and indirect object within the verb itself. In declarative sentences, the most common order in Spanish is subject-verb-object while in Basque it is subject-object-verb. However, on a phonetic level, Basque shares many of its sounds with Spanish (including its five vowels), with only some consonants, such as /ts/, /ts’/, /s’/ and some other less frequent ones (see
Table 1) not appearing in Spanish [
15].
In fact, Basque and the variety of Spanish spoken in the Basque Country share a great deal of features at the acoustic level, which allows us to use a single set of models, able to process speech in both languages so that a code switched transcription would be naturally output. Our proposed ASR system includes a single set of acoustic models, a single vocabulary (including words in both languages, sometimes with the same transcriptions but different pronunciations, sometimes with different transcriptions but the same pronunciations) and a single (aggregated) language model, which accepts code switchings at any point.
A positive effect of this integrated approach is that sharing acoustic models can alleviate the lack of annotated spoken resources for the low-resource language (Basque, in this case), by taking advantage of the resources available for the other. This will hopefully increase the robustness of the ASR system for the low-resource language, especially if the sets of acoustic units of the two languages are relatively close (as in the case of Basque and Spanish). On the negative side, having a single vocabulary may lead to a higher number of errors, due to words being recognized in the wrong language (those pronounced in the same way or very closely in the two languages). Since the language model has been trained on sentences in both languages, some of them including code switchings, it can naturally accept any sequence of words in any language (the probability of such a sequence will always be nonzero), and this allows it to recognize sentences with code switchings—although the model has not been tuned for this.
Our ASR system is targeted at the plenary sessions of the Basque Parliament (BP), with the final goal of obtaining high-quality automatic subtitles. BP members speak in both languages, Basque and Spanish, and code switchings are relatively abundant, so our fully bilingual approach seems to fit the domain quite well. To achieve the best performance, using in-domain training data is key, so a critical part of our work involves collecting as much BP data (audio + minutes) as possible. An important issue with BP minutes is that they are approximate, not reflecting the audio content of plenary sessions, because false starts, repetitions, filled pauses, syntactic errors and even some words or expressions which are judged too colloquial have been filtered out by human auditors. In this way, the BP minutes would be easily read (being syntactically correct) and fit the intended meaning, but the correspondence with the audio is partially lost.
This forces us to use the BP minutes with caution by applying a semi-supervised method to align the minutes with the audio, extract segments and discard those considered not reliable enough. Note that our method does not match the classical semi-supervised training methods that have been applied for more than two decades [
16,
17,
18,
19,
20,
21,
22,
23]: while those methods deal with completely untranscribed data, we do have some approximate transcriptions.
Classical semi-supervised approaches start from bootstrap acoustic models, typically trained on a relatively low amount of accurately transcribed non-target speech and used to build an initial ASR system, which is applied to transcribe a much larger amount of untranscribed speech, which is the target domain of the ASR system. Typically, the most confident fragments of the transcribed speech are selected (or other more sophisticated criteria are applied to select the speech materials) to train a second round of acoustic models which replace the bootstrap models. The same procedure is then iteratively applied until some convergence criterion is met.
In this work, we follow a similar approach but instead of a full ASR system, we apply a phone recognizer and an in-house bilingual grapheme-to-phoneme converter. Since nominal transcriptions are already available (the parliament minutes), we align the nominal and the recognized transcriptions at the phone level and select those segments that best match. In this way, a large fraction of BP sessions can be leveraged for training acoustic models. Besides increasing the amount of training materials for our ASR system (which is initially trained on generic speech datasets in Basque and Spanish), adding BP segments to the training set will help to improve ASR performance specifically on BP sessions (due to an implicit adaptation to speakers, acoustic conditions, vocabulary, etc.), which is the main objective of this work. In [
24], the authors also targeted BP plenary sessions, but adopted a different approach to leverage their speech contents, by creating two separate datasets for Spanish and Basque, on which two monolingual ASR systems were trained.
As a result of this work, we obtained a speech database specifically targeted at BP sessions. The database includes a large amount (998 h) of speech data for training acoustic models, and a development dataset (comprising more than 17 h of speech) used for tuning and evaluation under a cross-validation scheme. This latter dataset was extracted from a separate set of more recent BP sessions (not included in training) and then manually audited (their transcriptions being edited to match the audio contents). Finally, a bilingual (aggregated) trigram language model, estimated from the original minutes and translations of BP plenary sessions in Spanish and Basque, is also provided.
This paper is an extension of a previous work [
25]. The primary purpose of that work was to collect speech data for Basque and Spanish (with particular emphasis on the former) using the Basque Parliament plenary sessions as source. Second, we also aimed to build a fully bilingual ASR system especially targeted at BP sessions, so that its output could be reliably used as a starting point to produce the minutes (which still required human supervision). In this paper, we provide new results and more in-depth analyses. The new contributions of this work with regard to [
25] are summarized as follows: (1) the semi-supervised method employed to extract, rank and select training segments from BP sessions is now applied iteratively until the observed improvement is small enough; (2) to increase the statistical significance of performance results, the small (4 h long) development and test datasets used in [
25] have been replaced by a larger (17 h long) development set which is used under a cross-validation scheme, with 20 random 50/50 partitions, to perform hyperparameter tuning and then compute ASR performance; (3) the hyperparameter tuning procedure is described in detail and (4) the results section is enriched with figures illustrating the convergence of the training process and the ranking of segment scores.
The rest of the paper is organized as follows.
Section 2 and
Section 3 describe the main components of our bilingual ASR system and the method used to extract, rank and select training segments, respectively.
Section 4 provides the details of the experimental framework used to evaluate our fully bilingual ASR system on Basque Parliament data, while
Section 5 presents and discusses the results obtained in cross-validation experiments on the new development dataset specifically created in this work. Finally, a summary of the paper, conclusions and further work are outlined in
Section 6.
3. Iterative Data Collection through Phonetic Decoding and Alignment
For each Basque Parliament plenary session, an audio file and the corresponding minutes are available. In fact, for ease of processing, each audio file is manually split into two or three smaller chunks (each about 2 h long) and the minutes are split accordingly. As a starting point, a phone recognizer, trained on generic datasets for Basque and Spanish (not including BP materials) is applied to the audio files (without any phonological restrictions), to obtain a long sequence of phonetic units with their corresponding timestamps. On the other hand, the minutes are passed through the above mentioned G2P converter to acquire a reference (nominal) sequence of phonetic units. Finally, the recognized and reference sequences of phonetic units are aligned one with another under the criterion of maximizing the number of matching units (which is approximately the same as minimizing the number of deletions, insertions and substitutions), following the same text-and-speech alignment method that has been successfully applied in our group for the alignment of BP subtitles [
28,
34,
35]. In this way, those regions showing a high density of errors in the alignment would correspond to parts of the minutes which do not match the audio contents.
The recognized phonetic sequence sometimes features gaps between two consecutive units, which represent silent pauses. Gaps longer than 0.5 s are defined as potential
breaking points. Then, a
slice is defined as an audio chunk between two consecutive breaking points and a
segment as an audio chunk comprising one or more consecutive slices. This means that a segment might contain one or more breaking points inside of it. Data collection is performed by searching for the segment lasting between 3 and 10 s with the highest phone recognition rate (PRR), defined as:
where
m,
d,
i and
s are the number of matching units, deletions, insertions and substitutions yielded by the alignment for a given segment, respectively. When two or more segments attain the same (maximum) PRR, the longest segment is chosen. So PRR and length are the primary and secondary selection criteria, respectively.
A single-pass search is performed (with linear time complexity) to maximize PRR and length over those segments meeting the duration constraints in an audio chunk
U. Note that, because of segment duration constraints, for each starting slice, the method has to consider just a limited number of following slices (usually one or two). Once determined the optimal segment
in an audio chunk
U, the two audio sub-chunks at the left and right sides of
,
and
, if not empty, are independently searched in two recursive calls (see
Figure 1). Each call returns a list of segments, so we acquire two lists
and
, which are merged along with the optimal segment
into a single list
. In this way, after searching all the audio files, we end up with a list of segments
S that can be filtered according to PRR.
The recurrence relation defining the time complexity of the search procedure for an audio chunk
U with
n slices would be:
where
i and
j (with
) are the number of slices in the audio chunks
and
, respectively. Note that
implies that
would be empty and the recursive call would not be carried out; on the other hand, if
,
would consist of a single slice and the search would reduce to checking duration constraints. The same stands for
j and
. Recurrence (
2) resembles that of the well-known
quicksort algorithm, which, despite being
in the worst case, has an average cost of
. Finally, if
K audio files are to be processed, the time complexity of the segment extraction procedure will be in
,
being the number of slices in the
k-th audio file.
Once the list of segments
S is obtained for the whole set of BP sessions in the training set, a new set of acoustic models can be trained by using only those segments for which the provided transcription best matches the speech contents, either by requiring PRR to be higher than a given threshold or by using the top ranking segments amounting to a given number of hours (e.g., 998 h). The resulting models can be then applied again to perform phone recognition, obtain new alignments and hopefully a better set of segments for training. Remind that we aim to collect those segments that best match acoustically the provided transcripts. However, after each iteration the models will better adjust to the provided (possibly wrong) transcripts, so that after many iterations we may eventually achieve a PRR of 100% for all segments, with no way to distinguish
truly good transcripts from
bad transcripts to which our models have adapted to. This will prevent us from running too many iterations and will force us to carefully set the threshold that separates
good from
bad segments after each iteration. We will come back to this issue in
Section 5.
4. Experimental Setup
The acoustic models for the initial (bootstrap) phone recognizer have been trained on generic speech databases in Basque and Spanish: CommonVoice (cv-corpus-5.1-2020-06-22) [
29], OpenSLR (SLR76) [
36], Aditu [
30] and Albayzin [
31] (see
Table 2). The development and test sets of Aditu and Albayzin were used to validate and evaluate phone recognition performance. The training, development and test sets have durations of 332.21, 3.96 and 4.03 h, respectively. Note, however, that Spanish and Basque are highly imbalanced in the training set (with a 3:1 ratio). PRR on the test sets of Aditu (Basque) and Albayzin (Spanish) were of 4.6% and 6.9%, respectively.
To build the phone recognizer, an off-the-shelf, close to state-of-the-art end-to-end neural network-based ASR system is used: Facebook AI Research wav2letter++ (consolidated into Flashlight), applying the Gated ConvNet recipe presented in [
37]. Note that the phone recognizer requires neither lexical models nor a language model. For the semisupervised data collection step, all the BP plenary sessions from 2014 to 2021 (amounting to more than 1200 h) are used. Despite having access to BP sessions from 2010 to today, the audios prior to 2014 were recorded and stored using different formats and protocols, which prevented us from using those audios in this work.
The ASR system is also based on wav2letter++. In this case, besides the acoustic models, lexical and language models are also estimated, based on the minutes and translations of all BP plenary sessions from 2010 to 2021, which comprise more than 33 million words and around 279 thousand different entries. For each word in the vocabulary, a single pronunciation baseform is considered, as provided by our in-house G2P converter. A trigram language model is computed using KenLM [
38] (without pruning), including close to 16 million trigrams.
4.1. Hyperparameter Tuning
Though this work was not oriented towards optimizing the wav2letter++ framework used to built our ASR systems, we realized that three hyperparameters were critical for ASR performance: (1)
lmweight: the language model weight which is accumulated with the acoustic model score; (2)
silscore: the silence score (penalty) added whenever a silence unit is appended to the output; and (3)
wordscore: the score (penalty) added when appending a word to the output. To get the most of the wav2letter++ framework, tuning these parameters really makes a difference. So, a random walk search (see Algorithm 1) is performed to optimize ASR performance on a tuning dataset, and then the optimal hyperparameters are applied when processing a test set. Both the tuning and test datasets are independent from the training set (see
Section 4.2 for details).
Algorithm 1 Random walk optimization |
- 1:
function RWOPT(D, M, N) ▷D: tuning data, M: ASR model, N: max iterations - 2:
- 3:
- 4:
- 5:
▷E: grid points already evaluated - 6:
- 7:
- 8:
while and do - 9:
- 10:
▷C: grid points to be evaluated at this iteration - 11:
if then - 12:
- 13:
- 14:
- 15:
- 16:
if then - 17:
- 18:
- 19:
end if - 20:
else - 21:
- 22:
- 23:
- 24:
end if - 25:
end while - 26:
return - 27:
end function
|
The method sketched in Algorithm 1 includes the initial values of the hyperparameters (l: lmweight, s: silscore and w: wordscore), the initial values of the deltas used to explore the hyperparameter space and the minimum values of those deltas, which mark an exit point when attained. All of them were heuristically adjusted in preliminary experiments. Note that the method also terminates when it reaches a maximum number of iterations N, which has been set to 500 in this work. Note also that two auxiliary functions are used: (1) , which is assumed to perform ASR on a dataset D using some pretrained models M and some hyperparameter values, and returns the attained Word Error Rate (WER); and (2) , which is assumed to return a random element from a given set. Finally, note that the method involves some amount of randomness which might produce convergence issues. To study the impact of randomness, we ran the method a number of times on different datasets and observed that the hyperparameters obtained on a given set may actually differ across runs (due to randomness), but the ASR performance attained was almost the same in all cases. This means that different hyperparameter values could be equally good and lead to the same (close to optimal) performance.
4.2. Development Dataset and Cross-Validation Procedure
A development dataset was collected and used to carry out cross-validation experiments, first to tune wav2letter++ hyperparameters (using half of the dataset) and then to measure WER performance (using the other half), considering 20 random partitions and reporting the average WER. The development dataset comprises a set of segments extracted from the five BP sessions held in February, 2022 (thus not overlapping with the training set). Segments were extracted in the same way as the training segments, meaning that they did not correspond to complete sentences but to pieces of one or two sentences. These segments were manually audited (the audio listened to and the transcripts fixed) only at sections where the recognized sequence of words did not match the text in the minutes. These sections were located automatically, and involved any number of substitutions, deletions and/or insertions. The transcript resulting after auditing could be either the recognized sequence of words, the text provided in the minutes or a different sequence of words not matching any of them. Finally, each segment was automatically classified as containing only Spanish, only Basque or being bilingual (probably with a code switching event). This allowed us to disaggregate ASR performance by language. Details about this dataset are shown in
Table 3.
The 9251 segments of this dataset are organized chronologically, in the same order as they were produced in the original BP sessions. To define each partition, first an index k is chosen randomly between 0 and 9251, so that the half starting at k (from k to ) is assigned to the tuning set, while the half ending at (from to ) is assigned to the test set. This guarantees temporal coherence within both subsets, which will possibly contain different speakers and different topics, making the partition more realistic.
For each partition, we compute WER performance on both the tuning and test sets, obtaining one global and three per-language WER figures, for the Basque, Spanish and bilingual subsets of segments. In this way, we end up with eight WER results, by computing averages for the 20 partitions considered in cross-validation experiments. Besides the averages, standard deviations and 95% confidence intervals for the averages (using normal distributions) are also computed.