3. Proposed Fast Multichannel Parallel Acoustic Score Computation
Using the baseline acoustic score computation, the number of concurrent clients is restricted due to the frequent data transfer between GPU and CPU and the low parallelization of the GPU [
39]. To support more concurrent clients in real time, this section proposes a fast multichannel parallel acoustic score computation method by accelerating the GPU parallelization and reducing the transmission overhead.
As shown in
Figure 4, the proposed fast multichannel parallel acoustic score computation is performed with one decoding thread per client and an additional worker thread, whereas the baseline method is performed with no worker thread. When an online ASR server is launched, the server creates a worker thread for a GPU parallel decoding and initializes the maximum number (
) of concurrent clients and that (
) of the GPU parallel decodings. Once the worker thread is initialized, the run-time memories are allocated, as shown in
Figure 4a. These memories comprise (a) three types of CPU memories—
for input feature vectors,
for feature vectors to be decoded, and
for acoustic scores, and (b) three types of GPU memories,
for feature vectors to be decoded,
for a CSC-based matrix, and
for acoustic scores. As shown in
Figure 4b, one run-time CPU memory is also allocated when a decoder thread for a client is initialized. The run-time CPU memory
is for the acoustic scores. Moreover, the sizes of the run-time memories are summarized in
Table 3.
Whenever a decoder thread is in idle state and
contains speech features more than
, the decoder thread obtains a speech feature vector of length
and stores it in the CPU memory of the worker thread,
. For instance of an
i-th client, the vector is stored at
with the offset index of
Then, the decoder thread waits for the acoustic scores to be calculated by the worker thread. On the other hand, whenever the worker thread is in idle state and there are buffered speech feature vectors in
, the worker thread pops
k speech feature vectors from
in first-in, first-out (FIFO) order where
k is the number of feature vectors to be decoded at a time and the maximum is
. The obtained
k vectors are merged at
, and then transmitted from CPU to GPU (
). On GPU, the transmitted vectors are reformed into CSC-based vectors, which are merged into a CSC-based matrix in the cascaded form of the CSC-based matrix of each speech feature vector, as follows:
where
indicates the CSC-based matrix of the
i-th speech feature vector of
. Then, the matrix is normalized using an LDA-based transform [
37] and the acoustic scores are calculated into
using the CSC BLSTM AM. The acoustic scores are in the following cascaded form:
where
means the acoustic scores of the
i-th speech feature vector of
. Next, the acoustic scores are transmitted from GPU to CPU. For instance of
, if it is for the
m-th client,
is stored at
with an offset index of
If the waiting decoder thread detects the acoustic scores at
with the corresponding offset index of Equation (
8), the decoder thread copies them into its local memory (
) and proceeds the subsequent steps as the baseline method. The descried procedures are shown in
Figure 4c.
As shown in
Table 4, the transmission sizes are
for
k speech feature vectors from CPU to GPU and
for acoustic scores from GPU to CPU, respectively, when the worker thread is ready and
contains
k speech feature vectors. In addition, the frequency is varied according to the size of
k from
/
to
.
Assuming that the number of concurrent clients is and the number of decoded speech feature vectors obtained by the proposed method is , the main differences between the baseline and proposed acoustic score computation methods are as follows:
Decoding subject(s) The decoder thread of each client calculates acoustic scores in the baseline method, whereas the additionally used worker thread does so in the proposed method.
Transmission frequency The transmission occurs times in the baseline method and times in the proposed method. Therefore, the proposed method reduces the transfer frequency by k.
Transmission size For a transmission from CPU to GPU, the baseline method transmits for each client, whereas the proposed method does so for each decoding turn. The total transmission data size to GPU is reduced by the proposed method. On the other hand, for the transmission from GPU to CPU, the baseline method transmits for each client, whereas the proposed method transmits for each decoding turn. The total transmission size to CPU is equal.
Decoding size at a time The baseline method decodes one speech feature vector, whereas the proposed method decodes k vectors. This leads to more GPU parallelization.
4. Proposed DNN-Based VAD Method for Low Latency Decoding
Viterbi decoding involves two processes: one estimates probabilities of states in all possible paths, and the other finds an optimal path by backtracking the states with the highest probability. The ASR system yields the results only after both processes are completed, usually at the end of an utterance.
In an online ASR system recognizing long utterances, the end point of an utterance is not known in advance and deciding the back-tracking point affects user experience in terms of response time. If the backtracking is performed infrequently, the user will receive a delayed response, and in the opposite case, the beam search will not find the optimal path that reflects the language model contexts.
In our system, VAD based on an acoustic model for ASR is used to detect short pauses in continuous utterance, which will trigger backtracking. Especially, the acoustic model is built with a deep neural network; hence, we call it DNN-VAD. Here, DNN includes not only a fully connected DNN but also all types of deep models, including LSTM and BLSTM. As explained in previous sections, in our ASR system, BLSTM is used to compute a posterior probability of each state of triphone for each frame. By re-using these values, we can also estimate the probability of non-silence for a given frame with little additional computational cost.
Each output node of the DNN model can be mapped into states of nonsilence or silence phones. Let the output of the DNN model in the
i-th node be
. Then, the speech probability of the given frame is computed as
where
is log likelihood ratio. Each frame at time
t is decided to be a silence frame if
is smaller than the predefined threshold.
In addition, for smoothing purpose, the ratio of silence frames in a window of length
is computed and compared to the predefined threshold
.
All computations in from Equation (
9) to Equation (
13) are performed within each minibatch, and the frame at
t when
and
is regarded as the short pause in an utterance. As will be explained in the
Section 5.3, the frequent backtracking reduces the response time but also degrades the recognition accuracy. Thus, a minimal interval is set between the detection of short pauses to control the trade-off.
5. Experiment
We select Korean as the target language for the experiments conducted on the proposed methods (Though our experiments are based on Korean speech recognition, the proposed method can be applied to a CTC BLSTM based speech recognition for any language), and all experiments are performed on two Intel(R) Xeon(R) Silver 4214 CPU @ 2.20 GHz and single NVIDIA GeForce RTX 2080 Ti.
Section 5.1 describes the corpus and baseline ASR system and compares the performance of the ASR systems employing different AMs. Next,
Section 5.2 and
Section 5.3 present the performances of the proposed parallel acoustic score computation method and the DNN-VAD method, respectively.
5.1. Corpus and Baseline Korean ASR
We use the 3440-h Korean speech and its transcription data to train the baseline Korean ASR system. The speech data comprise approximately 19-million utterances, which are recorded in various environments, such as speaker, noise environment, recording device, and recording scripts. Each utterance is sampled at a rate of 16 kHz and no further augmentation methods are adopted. To evaluate the proposed methods, we prepare a test set recorded from documentary programs. The recordings include voices of narrators and interviewee with and without various background music and noises. The recordings are manually split into 69 segments, each of which are 29.24-s long, on average, and 33.63 min in total.
Each utterance of the training speech data is converted into 600-dimensional speech features. With the extracted speech features, a CSC BLSTM AM is trained using a Kaldi toolkit [
40], where the chunk and context sizes of the CSC are 20 and 40 ms, respectively. The AM comprises one input layer, five BLSTM layers, a fully connected layer, and a soft-max layer. Each BLSTM layer comprises 640 BLSTM cells and 128 projection units, while the output layer comprises 19901 units. For the language model, 38 GB of Korean text data is first preprocessed using text-normalization and word segmentation methods [
41], and then, the most frequent 540k sub-words are obtained from the text data (For Korean, a sub-word unit is commonly used as a basic unit of an ASR system [
42,
43]). Next, we train a back-off trigram of 540k sub-words [
41] using an SRILM toolkit [
44,
45].
During decoding, the minibatch size is set to 2 s. Although a larger minibatch size increases the decoding speed owing to the bulk computation of GPU, the latency also increases. We settle into 2 s of minibatch size as a compromise between decoding speed and latency [
14,
26,
36].
To compare the baseline ASR system, we additionally train two types of AMs—(a) DNN-based AM and (b) LSTM-based AM. A DNN-based AM comprises one input layer, eight fully connected hidden layers, and a soft-max layer. Each hidden layer comprises 2048 units and the output layer comprises 19901 units. An LSTM-based AM consists of one input layer, five LSTM layers, a fully connected layer, and a soft-max layer. Each LSTM layer consists of 1024 LSTM cells and 128 projection units, and the output layer consists of 19901 units. The ASR accuracy performance is measured using SyllER, which is calculated as following,
where
S,
D,
I, and
N are the numbers of substuted syllables, deleted syllables, inserted syllables, and reference syllables, respectivey. As shown in
Table 5, the BLSTM AM achieves an error rate reduction (ERR) of 20.66% for the test set when compared to the DNN-based AM. In addition, it achieves an ERR of 11.56% for the test set when compared to the LSTM-based AM. Therefore, we employ the BLSTM AM as our baseline ASR system for achieving better ASR accuracy performance (As for the comparision, the Korean speech recognition experiments with the same data set using the Google Cloud API achieved an average SyllER of 14.69%).
Next, we evaluate the multichannel performance of ASR systems employing the three AMs by examining the maximum number of concurrent clients where an ASR system can be performed in real time. That is, multiple clients are parallelly connected to an ASR server and each client requests to decode the test set. We then measure the real time factor for each client using the following equation
Next, we confirm that the number of concurrent clients are performed in real time if the average real-time factors for the concurrent clients are smaller than 1.0. As shown in
Table 6, the BLSTM AM supports 22 concurrent clients for the test set, whereas the DNN- or LSTM- based AMs support more concurrent clients. Hereafter, an experimental comparison is performed with only LSTM-based AM as our ASR system is optimized for uni- or bidirectional LSTM-based AMs.
Moreover, we evaluate the CPU and GPU usages (%) of the ASR systems using the baseline acoustic score computation method of
Section 2 for (a) the LSTM-based AM and (b) the BLSTM AM. The experiments are performed with the test set. From
Figure 5, the averaged usages of the CPU and GPU are 83.26% and 51.71% when the LSTM-based AM is employed, and 27.34% and 60.27% when the BLSTM AM is employed. The low usages of GPU can result from the frequent data transfer and low GPU parallelization. Moreover, the low usage of CPU can be observed for the BLSTM AM as CPU takes a long time to wait for the completion of acoustic score computation.
5.2. Experiments on the Proposed Fast Parallel Acoustic Score Computation Method
The proposed fast parallel acoustic score computation method can be utilized by replacing the baseline acoustic score computation method with no AM changes in the baseline Korean ASR system. As shown in
Table 7, the ASR system using the proposed method of
Section 3 achieves the same SyllER when compared to the system using the baseline method of
Section 2.3. No performance degradation can be obtained since we only modify the way how the acoustic scores are calculated by accelerating a GPU parallelization. Moreover, the ASR system using the proposed method supports 22 more concurrent clients for the test set, when compared to the system using the baseline method. Therefore, we conclude that the proposed acoustic score computation method increases the concurrent clients with no performance degradation.
To analyze the effects of the proposed acoustic score computation method, we compare the CPU and GPU usages (%) of ASR systems using the baseline and proposed methods for the test set. The averaged usages of the CPU and GPU are 78.58% and 68.17%, respectively, when the proposed method is used. By comparing
Figure 5b and
Figure 6a, the averaged usages of the CPU and GPU are improved by 51.24% and 7.90%, respectively. It can be concluded that the proposed method reduces the processing time of GPU and the waiting time of CPU by reducing the transfer overhead and increasing the GPU parallelization. In addition, we examine the number (
k) of parallel decoded feature vectors at each time stamp when using the proposed method, as shown in
Figure 6b. The parallel decoded feature vectors varies from 2 to 26, which is depend on the subsequent step, an optimal path search of Viterbi decoding.
For further improvement in the multichannel performance, some optimization methods can be applied, such as beam pruning. In this study, we apply a simple frame skip method during a token passing-based Viterbi decoding. That is, a token propagation is only performed in the odd time stamps during Viterbi decoding. From the experiments of the proposed acoustic score computation, the ASR system combined with the frame skip method supports up to 59 concurrent clients, although the SyllERs are degraded by 4.94% for the set, when compared to the ASR system without the frame skip method, as shown in
Table 8.
Again, we measure the CPU and GPU usages (%) of the ASR systems employing the proposed method with/without the frame skip method with the test set, as shown in
Figure 7a. The averaged usages of the CPU and GPU are measured as 47.24% and 71.87%. Note that the frame skip method unburdens the CPU load, and thus, the GPU usage is accordingly improved. Moreover,
Figure 7b compares the number of active hypotheses during Viterbi decoding.
In addition, we examine the number (
k) of parallel decoded feature vectors at each time stamp when using the proposed method, as shown in
Figure 7c. The parallel decoded feature vectors varies from 22 to 35. When compared to the
Figure 6b, the parallel decoded vectors are increased due to the reduced computation during Viterbi decoding when combining the frame skip method.
5.3. Experiments on DNN-VAD
DNN-VAD is used to reduce the waiting time for a user to receive the ASR results of what they said by triggering backtracking at the possible pause among user utterances. However, frequent backtracking at an improper time can degrade the recognition performance. Hence, in the experiments, the minimum interval between two consecutive backtracking points is set to various values.
Table 9 shows the segment lengths divided by DNN-VAD and the recognition accuracy with and without DNN-VAD for test set 1. For example, when the minimum interval is limited to 6 s, an utterance is split into 3.6 segments of 8.2 s each, on average, and the word error rate (WER) is 11.13, which is slightly reduced compared to the case in which VAD is not used, where WER is 11.02.
As the minimum interval reduces, the number of segments increases and the length of each segment increases, which means more frequent backtrackings and smaller user-perceived latencies. The accuracy degrades only slightly, which means that the backtracking point is selected reasonably. The internal investigation confirms that the segments are split mostly at the pause between phrases.
To measure the waiting time from the viewpoint of users, the user-perceived latency suggested in Reference [
19] is used. User-perceived latency is measured for each word uttered, and estimated empirically by measuring the difference in the timestamp of when a transcribed word is available to the user and that in an original audio. The aligned information in the recognition result is used as the timestamp of a word.
The average user-perceived latency is 11.71 s for test set 1 without DNN-VAD, which is very large since all results are received after the end of segments are sent to the server. When DNN-VAD is applied, the average latency is reduced to 5.41 s with a minibatch of 200 frames and 3.09 s with a minibatch of 100 frames. For a detailed analysis, the histogram of latency for each word is shown in
Figure 8.
6. Conclusions
In this paper, we presented a server-client-based online ASR system employing a BLSTM AM, which is a state-of-the-art AM. Accordingly, we adopted a CSC-based training and decoding approach for a BLSTM AM and proposed the following: (a) the parallel acoustic score computation to support more clients concurrently and (b) DNN-VAD to reduce the waiting time for a user to receive the recognition results. On the designed server-client-based ASR system, a client captures the audio signal from a user, sends the audio data to the ASR server, receives a decoded text from the server, and presents it to the user. The client can be deployed in various devices, from low to high performance. On the other hand, a server performs speech recognition using high-performance resources. That is, the server manages the main thread and decoder thread for each client and an additional worker thread for the proposed parallel acoustic computation method. The main thread communicates with the connected client, extracts speech features, and buffers them. The decoder thread performs speech recognition and sends the decoded text to the connected client. Speech recognition is performed in three main steps: acoustic score computation using a CSC BLSTM AM, DNN-VAD to detect a short pause in a long continuous utterance, and Viterbi decoding to search an optimal text using an LM. To handle more concurrent clients in real time, we first proposed the acoustic score computation method by merging the speech feature vectors collected from multiple clients, to reduce the amount of data transfer between the CPU and GPU, and calculating the acoustic scores with the merged data to increase GPU parallelization and to reduce the transfer overhead between the CPU and GPU. Second, we proposed DNN-VAD to detect a short pause in an utterance for a low latency response to a user. The Korean ASR experiments conducted using the broadcast audio data showed that the proposed acoustic score computation method increased the maximum number of concurrent clients from 22 to 44. Furthermore, by applying the frame skip method during Viterbi decoding, the maximum number of concurrent clients was increased to 59, although SyllER was degraded from 11.94% to 12.53%. Moreover, the average user-perceived latencies were reduced to 5.41 and 3.09 s with a minibatch of 200 frames and 100 frames, respectively, when the proposed DNN-VAD was used.