Next Article in Journal
A Novel Low-Area Point Multiplication Architecture for Elliptic-Curve Cryptography
Next Article in Special Issue
Mapping Discrete Emotions in the Dimensional Space: An Acoustic Approach
Previous Article in Journal
Comparative Analysis of Performance between Multimodal Implementation of Chatbot Based on News Classification Data Using Categories
Previous Article in Special Issue
A Systematic Review of the Use of Art in Virtual Reality
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Performance Evaluation of Offline Speech Recognition on Edge Devices

1
Facebook Inc., Menlo Park, CA 94025, USA
2
Facebook AI Research, Menlo Park, CA 94025, USA
*
Author to whom correspondence should be addressed.
Electronics 2021, 10(21), 2697; https://doi.org/10.3390/electronics10212697
Submission received: 23 September 2021 / Revised: 30 October 2021 / Accepted: 1 November 2021 / Published: 4 November 2021
(This article belongs to the Special Issue Human Computer Interaction for Intelligent Systems)

Abstract

:
Deep learning–based speech recognition applications have made great strides in the past decade. Deep learning–based systems have evolved to achieve higher accuracy while using simpler end-to-end architectures, compared to their predecessor hybrid architectures. Most of these state-of-the-art systems run on backend servers with large amounts of memory and CPU/GPU resources. The major disadvantage of server-based speech recognition is the lack of privacy and security for user speech data. Additionally, because of network dependency, this server-based architecture cannot always be reliable, performant and available. Nevertheless, offline speech recognition on client devices overcomes these issues. However, resource constraints on smaller edge devices may pose challenges for achieving state-of-the-art speech recognition results. In this paper, we evaluate the performance and efficiency of transformer-based speech recognition systems on edge devices. We evaluate inference performance on two popular edge devices, Raspberry Pi and Nvidia Jetson Nano, running on CPU and GPU, respectively. We conclude that with PyTorch mobile optimization and quantization, the models can achieve real-time inference on the Raspberry Pi CPU with a small degradation to word error rate. On the Jetson Nano GPU, the inference latency is three to five times better, compared to Raspberry Pi. The word error rate on the edge is still higher, but it is not too far behind, compared to that on the server inference.

Graphical Abstract

1. Introduction

Automatic speech recognition (ASR) is a process of converting speech signals to text. It has a large number of real-world use cases, such as dictation, accessibility, voice assistants, AR/VR applications, captioning of videos, podcasts, searching audio recordings, and automated answering services, to name a few. On-device ASR makes more sense for many use cases where an internet connection is not available or cannot be used. Private and always-available on-device speech recognition can unblock many such applications in healthcare, automotive, legal and military fields, such as taking patient diagnosis notes, in-car voice command to initiate phone calls, real-time speech writing, etc.
Deep learning–based speech recognition has made great strides in the past decade [1]. It is a subfield of machine learning which essentially mimics the neural network structure of the human brain for pattern matching and classification. It typically consists of an input layer, an output layer and one or more hidden layers. The learning algorithm adjusts the weights between different layers, using gradient descent and backpropagation until the required accuracy is met [1,2]. The major reason for its popularity is that it does not need feature engineering. It autonomously extracts the features based on the patterns in the training dataset. The dramatic progress of deep learning in the past decade can be attributed to three main factors [3]: (1) large amounts of transcribed data sets; (2) rapid increase in GPU processing power; and (3) improvements in machine learning algorithms and architectures. Computer vision, object detection, speech recognition and other similar fields have advanced rapidly because of the progress of deep learning.
The majority of speech recognition systems run in backend servers. Since audio data need to be sent to the server for transcription, the privacy and security of the speech cannot be guaranteed. Additionally, because of the reliance on a network connection, the server-based ASR solution cannot always be reliable, fast and available.
On the other hand, on-device-based speech recognition inherently provides privacy and security for the user speech data. It is always available and improves the reliability and latency of the speech recognition by precluding the need for network connectivity [4]. Other non-obvious benefits of edge inference are energy and battery conservation for on-the-go products by avoiding Bluetooth/Wi-Fi/LTE connection establishments for data transfers.
Inferencing on edge can be achieved either by running computations on CPU or on hardware accelerators, such as GPU, DSP or using dedicated neural processing engines. The benefits and demand for on-device ML is driving modern phones to have dedicated neural engine or tensor processing units. For example, Apple iOS 15 will support on-device speech recognition for iPhones with Apple neural engine [5]. The Google Pixel 6 phone comes equipped with a tensor processing unit to handle on-device ML, including speech recognition [6]. Though dedicated neural hardwares might become a general trend in the future, at least in the short term, a large majority of IoT, mobile or wearable devices will not have these dedicated hardwares for on-device ML. Hence, training the models on backend and then pre-optimizing for CPU or general purpose GPU-based edge inferencing is a practical near term solution for on-edge inference [4].
In this paper, we evaluate the performance of ASR on Raspberry Pi and Nvidia Jetson Nano. Since the CPU, GPU and memory specification of these two devices are similar to those of typical edge devices, such as smart speakers, smart displays, etc., the evaluation outcomes in this paper should be similar to the results on a typical edge device. Related to our work, large vocabulary continuous speech recognition was previously evaluated on an embedded device, using CMU SPHINX-II [7]. In [8], the authors evaluated the on-device speech recognition performance with DeepSpeech [9], Kaldi [10] and Wav2Letter [11] models. Moreover, most on-the-edge evaluation papers focus on computer vision tasks, using CNN [12,13]. To the best of our knowledge, there have been no evaluations done for any type of transformer-based speech recognition models on low power edge devices, using both CPU- and GPU-based inferencing. The major contributions of this paper are as follows:
  • We present the steps for preparing and inferencing the pre-trained PyTorch models for on edge CPU- and GPU-based inferencing.
  • We measure and analyze the accuracy, latency and computational efficiency of ASR inference with transformer-based models on Raspberry Pi and Jetson Nano.
  • We also provide a comparative analysis of inference between CPU- and GPU-based processing on edge.
The rest of the paper is organized as follows: In the background section, we discuss ASR and transformers. In the experimental setup, we go through the steps for preparing the models and setting up both the devices for inferencing. We highlight some of the challenges we faced while setting up the devices. We go over the accuracy, performance and efficiency metrics in the results section. Finally, we conclude with the summary and outlook.

2. Background

ASR is the process of converting audio signals to text. In simple terms, the audio signal is divided into frames and passed through fast Fourier transform to generate feature vectors. This goes through an acoustic model to output the probability distribution of phonemes. Then, a decoder with a lexicon, vocabulary and language model is used to generate the word n-grams distributions. The hidden Markov model (HMM) [14] with a Gaussian mixture model (GMM) [15] was considered a mainstream ASR algorithm until a decade ago. Conventionally, the featurizer, acoustic modeling, pronunciation modeling, and decoding all were built separately and composed together to create an ASR system. Hybrid HMM–DNN approaches replaced GMM with deep neural networks with significant performance gains [16]. Further advances used CNN- [17,18] and RNN-based [19] models to replace some or all components in hybrid DNN [1,2] architecture. Over time, ASR model architectures have evolved to convert audio signals to text directly, called sequence-to-sequence models. These architectures have simplified the training and implementation of ASR models.The most successful end-to-end ASR are based on connectionist temporal classification (CTC) [20], recurrent neural network (RNN) transducer (RNN-T) [19], and attention-based encoder–decoder architecture [21].
Transformer is a sequence-to-sequence architecture originally proposed for machine translation [22]. When used for ASR, the input of transformer is audio frames instead of the text input, as in translation use case. Transformer uses multi head attention and positional embeddings. It learns sequential information through a self-attention mechanism instead of the recurrent connection used in RNN. Since their introduction, transformers are increasingly becoming the model of choice for NLP problems. The powerful natural language processing (NLP) models, such as GPT-3 [23], BERT [24], and AlphaFold 2 [25], which is the model that predicts the structures of proteins from their genetic sequences, are all based on transformer architecture. The major advantages of transformers over RNN/LSTM [26] is that they process the whole sequence at once, enabling parallel computation and hence, reducing the training time. They also do not suffer from long dependency issues; hence, they are more accurate. Since the transformer processes the whole sequence at once, they are not directly suitable for streaming-based applications, such as continuous dictation. In addition, their decoding complexity is quadratic over input sequence length because the attention is computed pairwise for each input. In this paper, we focus on the general viability and computational cost of transformer-based ASR on audio files. In future, we plan to explore streaming supported transformer architectures on edge.

2.1. Wav2Vec 2.0 Model

Wav2Vec 2.0 is a transformer-based speech recognition model trained using a self-supervised method with contrastive training [27]. The raw audio is encoded using a multilayer convolutional network, the output of which is fed to the transformer network to build latent speech representations. Some of the input representations are masked during training. The model is then fine tuned with a small set of labeled data, using the connectionist temporal classification (CTC) [20] loss function. The great advantage of Wav2Vec 2.0 is the ability to learn from unlabeled data, which is tremendously useful in training for speech recognition for languages with very limited labeled audio. For the remaining part of this paper, we refer to the Wav2Vec 2.0 model as Wav2Vec to reduce verbosity. In our evaluation, we use a pre-trained base Wav2Vec model, which was trained on 960 hr of unlabeled LibriSpeech audio. We evaluate a 100 hr and a 960 hr fine-tuned model.
Figure 1 shows the simplified flow of the ASR process with this model.

2.2. Speech2Text Model

The Speech2Text model is a transformer-based speech recognition model trained using the supervised method [28]. The transformer architecture is based on [22]. In addition, it has an input subsampler. The purpose of the subsampler is to downsample the audio sequence to match the input dimensions of the transformer encoder. The model is trained with a LibriSpeech, 960 hr, labeled training data set. Unlike Wav2Vec, which takes raw audio samples as input, this model accepts 80-channel log Mel filter bank extracted features with a 25 ms window size and 10 ms shift. Additionally, utterance-level cepstral mean and variance normalization (CMVN) [29] is applied on the input frames before feeding to the subsampler. The decoder uses a 10,000 unigram vocabulary.
Figure 2 shows the simplified flow of the ASR process with this model.

3. Experimental Setup

3.1. Model Preparation

We use PyTorch models for evaluation. PyTorch is an open-source machine learning framework based on the Torch library. Figure 3 shows the steps for preparing the models for inferencing on edge devices.
We first go through a few of the PyTorch tools and APIs used in our evaluation.

3.1.1. TorchScript

TorchScript is the means by which PyTorch models can be optimized, serialized and saved in intermediate representation (IR) format. torch.jit (https://pytorch.org/docs/stable/jit.html (accessed on 30 October 2021)) APIs are used for converting, saving and loading PyTorch models as ScriptModules. TorchScript itself is a subset of the Python language. As a result, sometimes, a model written in Python needs to be simplified to convert it into a script module. The TorchScript module can be created either using tracing or scripting methods. Tracing works by executing the model with sample inputs and capturing all computations, whereas scripting performs static inspection to go through the model recursively. The advantage of scripting over tracing is that it correctly handles the loops and control statements in the module. A saved script module can then be loaded either in a Python or C++ environment for inferencing purposes. For our evaluation, we generated ScriptModules for both Speech2Text and Wav2Vec models after applying any valid optimizations for specific devices.

3.1.2. PyTorch Mobile Optimizations

PyTorch provides a set of APIs for optimizing the models for mobile platforms. It uses module fusing, operator fusing, and quantization among other things to optimize the models. We apply dynamic quantization for models used in this experiment. During this quantization, the scale factors are determined for activations dynamically based on the data range observed at runtime. By quantization, a neural network is converted to use a reduced precision integer representation for the weights and/or activations. This saves on model size and allows the use of higher throughput math operations on CPU or GPU.

3.1.3. Models

We evaluated the Speech2Text and Wav2Vec transformer-based models on Raspberry Pi and Nvidia Jetson Nano. Inference on Raspberry Pi happens on CPU, while on Jetson Nano, it happens on GPU, using CUDA APIs. Given the limited RAM, CPU, and storage on these devices, we make use of Google Colab for importing, optimizing and saving the model as a TorchScript module. The saved modules are copied to Raspberry Pi and Jetson Nano for inferencing. On Raspberry Pi, which uses CPU-based inference, we evaluate both quantized and unquantized models. On Jetson Nano, we only evaluate unquantized models since CUDA only supports floating point operations.

Speech2Text Model

The Speech2Text pre-trained model is imported from fairseq (https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text (accessed on 30 October 2021)). Fairseq is a sequence modeling toolkit that allows researchers and developers to train custom models for speech and text tasks. We needed to make minor syntactical changes, such as Python type hints, to export the generator model as a TorchScript module. We have used s2t_transformer_s small architecture for this evaluation. The decoding uses a beam search decoder with a beam size of 5 and a SentencePiece tokenizer.

Wav2Vec Model

Wav2Vec pre-trained models are imported from huggingface (https://huggingface.co/transformers/model_doc/wav2vec2.html (accessed on 30 October 2021)) using the Wav2Vec2ForCTC interface. We have used Wav2Vec2CTCTokenizer to decode the output indexes into transcribed text.

3.2. Raspberry Pi Setup

Raspberry Pi 4 B is used in this evaluation. The device specs are provided in Table 1. The default Raspberry Pi OS is 32 bit, which is not compatible with PyTorch. Hence, we installed a 64 bit OS.
The main Python package required for inferencing is PyTorch. The default prebuilt wheel files of this package are mainly for Intel architecture, which depend on Intel-MKL (math kernel library) for math routines on CPU. The ARM-based architectures cannot use Intel MKL. They instead have to use QNNPACK/XNNPACK backend with other BLAS (basic linear algebra subprograms) libraries. QNNPACK (https://github.com/pytorch/QNNPACK (accessed on 30 October 2021)) (quantized neural networks package) is a mobile-optimized library for low-precision, high-performance neural network inference. Similarly, XNNPACK (https://github.com/google/XNNPACK (accessed on 30 October 2021)) is a mobile-optimized library for higher precision neural network inference. We built and installed the torch wheel file on Raspberry Pi from source with XNNPACK and QNNPACK cmake configs. We needed to set the device backend to QNNPACK during inference as torch.backends.quantized.engine=’qnnpack’. Note that with the latest PyTorch release 1.9.0, the wheel files are available for ARM 64-bit architectures. Hence, there is no need to build torch from source anymore.
The lessons learnt during setup are as follows:
  • Speech2Text transformer models expect Mel-frequency cepstral coefficients [30] as input features. However, we could not use Torchaudio, PyKaldi, librosa or python_speech_features libraries for this because of dependency issues. Torchaudio has dependency on Intel MKL. Building PyKaldi on device was not feasible because of memory limitations. The librosa and python_speech_features packages produced different outputs for MFCC, which were unsuitable for PyTorch models. Therefore, the MFCC features for the LibriSpeech data set were pre-generated, using fairseq audio_utils (https://github.com/pytorch/fairseq/blob/master/fairseq/data/audio/audio_utils.py (accessed on 30 October 2021)) on the server, and saved as NumPy files. These NumPy files were used as model input after applying CWVN transforms.
  • Running pip install with or without sudo while installing packages, can cause silent dependency issues. This is especially true when the same package is installed multiple times with and without using sudo.
  • To experiment with huggingface transformer models, the datasets package is required, which in turn has dependency on PyArrow (an Apache arrow library). Arrow library needs to be built and installed from source to use PyArrow.

3.3. Nvidia Jetson Nano Setup

We configured Jetson Nano using the instructions on the Nvidia website. The Nano flash file comes with JetPack pre-installed, which includes all the CUDA libraries required for inferencing on GPU. The full specs of the device are provided in Table 2.
For Nano, we needed to build torch from source with CUDA cmake option. Further, an upgrade was needed to Clang and LLVM compiler toolchain to use Clang for compiling PyTorch.
The lessons learnt during setup are as follows:
  • Need to use 5 V, 4 A barrel jack power supply for Jetson Nano. The USB C power supply does not provide sufficient power for continuous speech-to-text inferencing on CUDA.
  • cuDNN benchmarking needs to be switched on for Nano to pick up the speed while executing. It takes a very long time for Nanto to execute the initial few samples. That is because the cuDNN tries to find the best algorithm for the configured input. After that, the RTF improves significantly and it executes very quickly.
  • Jetson Nano froze on long duration audios while inferencing with the Wav2Vec model. Through trial and error, we figured out that by limiting the input audio duration to 8 s and batching the inputs to be of size 64 K (4 s audio) or less, we can allow the inference to continue without hiccups.

3.4. Evaluation Methodology

This section explains the methodologies used for collecting and presenting the metrics in this paper. The LibriSpeech [31] test and dev datasets were used to evaluate ASR performance on both Raspberry Pi and Jetson Nano. The test and dev datasets together contain 21 hr of audio. To save time, for these experiments we randomly sampled 300 (∼10%) of the audio files in each of the four data sets for inference. The same set for each configuration was used so that the results would be comparable. Typically, ML practitioners only report the WER metric for server-based ASR. So, we did not have a server side reference for latency and efficiency metrics, such as memory, CPU or load times. Unlike backend servers, the edge devices are constrained in terms of memory, CPU, disk and energy. To achieve on-device ML, the inferencing needs to be efficient enough to fit within the device’s resource budgets. Hence, we measured these efficiency metrics along with the accuracy to assess the plausibility of meeting these budgets on typical edge devices.

3.4.1. Accuracy

Accuracy is measured using word error rate (WER), a standard metric for speech-to-text tasks. It is defined as in Equation (1):
W E R = ( S + I + D ) / N
where S is the number of substitutions, D is the number of deletions, I is the number of insertions and N is the number of words in the reference.
WER for a dataset is computed as the total number of errors over the total number of reference words in the dataset. We compare the on-device WER on Raspberry Pi and Jetson Nano with the on-server-based WER as reported in Speech2Text [28] and Wav2Vec [27] papers. In both papers, the WER for all models was computed on LibriSpeech test and dev data sets with GPU in standalone mode. On server, the Speech2Text model used a beam size of 5 and vocabulary of 10,000 words for decoding, whereas the Wav2Vec model used a transformer-based language model for decoding. The pre-trained models used in this experiment have the same configuration as that of the server models.

3.4.2. Latency

The latency of ASR is measured using real time factor (RTF). It is defined in Equation (2). In simple terms, with a RTF of 0.5, two seconds of audio will be transcribed by the system in one second.
R T F = ( r e a d t i m e + i n f e r e n c e t i m e + d e c o d i n g t i m e ) / t o t a l u t t t e r a n c e d u r a t i o n
We compute the avg, mean, pctl 75 and pctl 90 RTF over all the audio samples in each data set. We also used PyTorch profiler to visualize the CPU usage of various operators and functions inside the models.

3.4.3. Efficiency

We measure the CPU load and memory footprint during the entire data set evaluation, using the Linux top command. The top command is executed in the background every two minutes in order to avoid side effects on the main inference script.
The model load time is measured by collecting the torch.jit.load API latency to load the scripted model. We separately measured the load time by running 10 iterations and took an average. We ensured that the load time measurements were from a clean state, i.e., from the system boot, to discount any caching in the Linux OS layer for subsequent model loads.

4. Results

In this section, we present the accuracy, performance and efficiency metrics for Speech2Text and Wav2Vec model inference.

4.1. WER

Table 3 and Table 4 show the WER on Raspberry Pi and Jetson Nano, respectively.
The WER is slightly higher for the quantized models, compared to the unquantized ones by an avg of ∼0.5%. This is a small trade off in accuracy for better RTF and efficient inference. The test-other and dev-other data sets have a higher WER, compared to the test-clean and dev-clean data sets. This is expected because other datasets are noisier, compared to clean ones.
The WER on device for unquantized models is generally higher than what is reported on the server. We need to investigate further to understand this discrepancy. One plausible reason could be due to a smaller sampled dataset used in our evaluation, compared to the server WER, which is calculated over the entire dataset.
WER for the Wav2Vec case is higher because of batching of the input samples at the 64 K (4 s audio) boundary. If a sample duration is longer than 4 s, we divide it into two batches. See Section 3.3 for the reasoning. So, words at the boundary of 4 s can be misinterpreted. We plan to investigate this batching problem in future. We report the WER figures here for the purpose of completeness.

4.2. RTF

In our experiments, RTF is dominated by model inference time > 99% compared to other two factors in (2). Table 5 and Table 6 show the RTF for Raspberry Pi and Jetson Nano, respectively. RTF does not vary between different data sets for the same models. Hence, we show the RTF (avg, mean, pctl 75 and pctl 90) per model instead of one per data set.
RTF is improved by ∼10% for quantized models, compared to unquantized floating point models. This is because CPU has to load less memory and can run tensor computations more efficiently in int8 than in floating points. The inferencing of the Speech2Text model is three times faster than the Wav2Vec model. This can be explained by the fact that the Wav2Vec has three times more parameters than the Speech2Text model (refer to Table 7). There is no noticeable difference in RTF between 100 hr and 960 hr fine-tuned Wav2Vec models because the number of parameters do not change between 960 hr and 100 hr fine-tuned models.
RTF on Jetson Nano is three times better for the Speech2Text model and five times better for the Wav2Vec model, compared to Raspberry Pi. Nano is able to make use of a large number of CUDA cores for tensor computations. We do not evaluate quantized models on Nano because CUDA only supports floating point computations.
Wav2Vec RTF on Raspberry Pi is close to real time, whereas in every other case, the RTF is far below 1. This implies that on-device ASR can be used for real-time dictation, accessibility, voice based app navigation, translation and other such tasks without much latency.

4.3. Efficiency

For both CPU and memory measurements over time, we use the Linux top command. The command is executed in loop every 2 min in order to not affect the main processing.

4.3.1. CPU Load

Figure 4 and Figure 5 show the CPU load of all model inferences on Raspberry Pi and Jetson Nano, respectively. The CPU load in Nano for both the Speech2Text and Wav2Vec models is ∼85% in steady state. It mostly uses one of the four cores during operation. Most of the CPU processing on Nano is for copying the input to memory for GPU processing and also copying back the output. On Raspberry Pi, the CPU load is ∼380%. Since all the tensor computations happen on CPU, all CPU cores are utilized fully during model inference. On Nano, the initial few minutes are spent loading and benchmarking the model. That is why the CPU is not busy during the initial few minutes.

4.3.2. Memory Footprint

Figure 6 and Figure 7 show the memory of all model inferences on Raspberry Pi and Jetson Nano, respectively. The memory values presented here are RES (resident set size) values from top command. On Raspberry Pi, the quantized Wav2Vec model consumes ∼50% less memory (from 1 GB to 560 MB), compared to the unquantized model. Similarly, the Speech2Text model consumes ∼40% less memory (from 480 MB to 320 MB), compared to the unquantized model. On Nano, memory consumption for the Speech2Text model is ∼1 GB, and the Wav2Vec model is ∼500 MB. On Nano, the same memory is shared between GPU and CPU.

4.3.3. Model Load Time

Table 8 shows the model load times on Raspberry Pi and Jetson Nano. A load time of 1–2 s on Raspberry Pi seems reasonable for any practical application where the model is loaded once and the process inference requests multiple times. The load time on Nano is 15–20 times longer than on Raspberry Pi. Nano cuDNN has to allot some amount of cache for loading the model, which takes time.

4.4. PyTorch Profiler

PyTorch profiler (https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html (accessed on 30 October 2021)) can be used to study the time and memory consumption of the model’s operators. It is enabled through Context Manager in Python. The profiler is used to understand the distribution of CPU percentage over model operations. Some of the columns from the profiler are not shown in the table for simplicity.

4.4.1. Jetson Nano Profiles

Table 9 and Table 10 show the profiles of Wav2Vec and Speech2Text models on Jetson Nano.
For Wav2Vec model, the majority of the CUDA time is spent in aten::cudnn_convolution for input convolutions followed by matrix multiplication (aten::mm). Additionally, the CPU and GPU spend a significant amount of time transferring data between each other, aten::to.
For the Speech2Text model, the majority of the CUDA time is spent in decoder forward followed by aten::mm for tensor multiplication operations.

4.4.2. Raspberry Pi profiles

Table 11, Table 12, Table 13 and Table 14 show the profiles of Wav2Vec and Speech2Text models on Raspberry Pi.
The CPU time is dominated by linear_dynamic for linear layer computations followed by aten::addmm_ for tensor add multiplications.
Compared to the quantized model, the non-quantized model spends 5 s more time in linear computations, prepacked::linear_clamp_run.
CPU percentages are dominated by forward function, linear layer computations and batched matrix multiplication in both quantized and unquantized models.
The unquantized linear layer processing is 40% higher than the quantized version.

5. Conclusions

We evaluated the ASR accuracy, performance and computational efficiency of transformer-based models on edge devices. By applying quantization and PyTorch mobile optimizations for CPU based inferencing, we gain 10 % improvement in latency and ∼50% reduction in the memory footprint at the cost of ∼0.5% increase in WER, compared to the original model. Running the inference on Jetson Nano GPU improves the latency by a factor of 3 to 5. With 1–2 s load times, ∼300 MB of memory footprint and RTF < 1.0, the latest transformer models can be used on typical edge devices for private, secure, reliable and always-available ASR processing. For applications such as dictation, smart home control, accessibility, etc., a small trade off in WER for latency and efficiency gains is mostly acceptable since small ASR errors will not hamper the overall task completion rate for voice commands, such as turning off a lamp, opening an app on a device, etc. By offloading inference to a general purpose GPU, we can potentially gain 3–5× latency improvements.
In future, we are planning to explore other optimization techniques, such as pruning, sparsity, 4-bit quantization and different model architectures to further analyze the WER vs. performance trade offs. We also plan to measure the thermal and battery impact of various models in CPU and GPU platforms on mobile and wearable devices.

Author Contributions

Conceptualization—S.G. and V.P.; methodology—S.G. and V.P.; setup and experiments—S.G.; original draft preparation—S.G.; review and editing—S.G. and V.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Publicly available Librispeech datasets were used in this study. his data can be found here: https://www.openslr.org/12 (accessed on 30 October 2021).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript.
DLdeep learning
CPUcentral processing unit
GPUgraphics processing unit
ASRautomatic speech recognition
HMMhidden Markov model
RNNrecurrent neural network
RNNTrecurrent neural network transducer
CNNconvolutional neural network
LSTMlong short-term memory
Speech2Textspeech to text transformer model from fairseq
Wav2VecWav2Vec 2.0 model
GMMGaussian mixture model
DNNdeep neural network
CTCconnectionist temporal classification
CMVNcepstral mean and variance normalization
MFCCMel-frequency cepstral coefficients
CUDAa parallel computing platform and application programming interface by Nvidia
WERword error rate
RTFreal time factor
NLPnatural language processing

References

  1. Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.R.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N.; et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
  2. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
  3. Hannun, A. The History of Speech Recognition to the Year 2030. arXiv 2021, arXiv:2108.00084. [Google Scholar]
  4. Wu, C.J.; Brooks, D.; Chen, K.; Chen, D.; Choudhury, S.; Dukhan, M.; Hazelwood, K.; Isaac, E.; Jia, Y.; Jia, B.; et al. Machine Learning at Facebook: Understanding Inference at the Edge. In Proceedings of the 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), Washington, DC, USA, 16–20 February 2019; pp. 331–344. [Google Scholar] [CrossRef]
  5. Apple A12. Available online: https://en.wikipedia.org/wiki/Apple_A12 (accessed on 30 October 2021).
  6. Pixel 6. Available online: https://en.wikipedia.org/wiki/Pixel_6 (accessed on 30 October 2021).
  7. Huggins-Daines, D.; Kumar, M.; Chan, A.; Black, A.; Ravishankar, M.; Rudnicky, A.I. Pocketsphinx: A Free, Real-Time Continuous Speech Recognition System for Hand-Held Devices. In Proceedings of the 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Toulouse, France, 14–19 May 2006; Volume 1, p. I. [Google Scholar]
  8. Peinl, R.; Rizk, B.; Szabad, R. Open Source Speech Recognition on Edge Devices. In Proceedings of the 2020 10th International Conference on Advanced Computer Information Technologies (ACIT), Deggendorf, Germany, 13–15 May 2020; pp. 441–445. [Google Scholar] [CrossRef]
  9. Hannun, A.; Case, C.; Casper, J.; Catanzaro, B.; Diamos, G.; Elsen, E.; Prenger, R.; Satheesh, S.; Sengupta, S.; Coates, A.; et al. Deep Speech: Scaling up end-to-end speech recognition. arXiv 2014, arXiv:1412.5567. [Google Scholar]
  10. Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi Speech Recognition Toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 11–15 December 2011. [Google Scholar]
  11. Pratap, V.; Hannun, A.; Xu, Q.; Cai, J.; Kahn, J.; Synnaeve, G.; Liptchinsky, V.; Collobert, R. Wav2Letter++: A Fast Open-source Speech Recognition System. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6460–6464. [Google Scholar] [CrossRef] [Green Version]
  12. Lee, J.; Chirkov, N.; Ignasheva, E.; Pisarchyk, Y.; Shieh, M.; Riccardi, F.; Sarokin, R.; Kulik, A.; Grundmann, M. On-Device Neural Net Inference with Mobile GPUs. arXiv 2019, arXiv:1907.01989. [Google Scholar]
  13. Hadidi, R.; Cao, J.; Xie, Y.; Asgari, B.; Krishna, T.; Kim, H. Characterizing the Deployment of Deep Neural Networks on Commercial Edge Devices. In Proceedings of the 2019 IEEE International Symposium on Workload Characterization (IISWC), Orlando, FL, USA, 3–5 November 2019; pp. 35–48. [Google Scholar] [CrossRef]
  14. Rabiner, L. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
  15. Juang, B.H.; Levinson, S.; Sondhi, M. Maximum likelihood estimation for multivariate mixture observations of markov chains (Corresp.). IEEE Trans. Inf. Theory 1986, 32, 307–309. [Google Scholar] [CrossRef]
  16. Mohamed, A.R.; Sainath, T.N.; Dahl, G.; Ramabhadran, B.; Hinton, G.E.; Picheny, M.A. Deep Belief Networks using discriminative features for phone recognition. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 5060–5063. [Google Scholar] [CrossRef] [Green Version]
  17. Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, L.; Wang, G.; et al. Recent Advances in Convolutional Neural Networks. Pattern Recognit. 2017, 77, 354–377. [Google Scholar] [CrossRef] [Green Version]
  18. Abdel-Hamid, O.; Mohamed, A.R.; Jiang, H.; Deng, L.; Penn, G.; Yu, D. Convolutional Neural Networks for Speech Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1533–1545. [Google Scholar] [CrossRef] [Green Version]
  19. Graves, A. Sequence Transduction with Recurrent Neural Networks. arXiv 2012, arXiv:1211.3711. [Google Scholar]
  20. Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, Pittsburgh, PA, USA, 25–29 June 2006; Association for Computing Machinery: New York, NY, USA, 2006; pp. 369–376. [Google Scholar] [CrossRef]
  21. Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2016, arXiv:1409.0473. [Google Scholar]
  22. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  23. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
  24. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  25. Senior, A.W.; Evans, R.; Jumper, J.; Kirkpatrick, J.; Sifre, L.; Green, T.; Qin, C.; Žídek, A.; Nelson, A.W.R.; Bridgland, A.; et al. Improved protein structure prediction using potentials from deep learning. Nature 2020, 577, 706–710. [Google Scholar] [CrossRef] [PubMed]
  26. Greff, K.; Srivastava, R.K.; Koutnik, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A Search Space Odyssey. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 2222–2232. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  27. Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv 2020, arXiv:2006.11477. [Google Scholar]
  28. Wang, C.; Tang, Y.; Ma, X.; Wu, A.; Okhonko, D.; Pino, J. fairseq S2T: Fast Speech-to-Text Modeling with fairseq. In Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL), System Demonstrations, Suzhou, China, 4–7 December 2020. [Google Scholar]
  29. Droppo, J.; Acero, A. 33. Environmental Robustness. Available online: http://ai.stanford.edu/~amaas/data/cmn_paper.pdf (accessed on 30 October 2021).
  30. Muda, L.; Begam, M.; Elamvazuthi, I. Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques. arXiv 2010, arXiv:1003.4083. [Google Scholar]
  31. Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar] [CrossRef]
Figure 1. Wav2Vec2 inference.
Figure 1. Wav2Vec2 inference.
Electronics 10 02697 g001
Figure 2. Speech2Text inference.
Figure 2. Speech2Text inference.
Electronics 10 02697 g002
Figure 3. Model preparation steps.
Figure 3. Model preparation steps.
Electronics 10 02697 g003
Figure 4. CPU load on Raspberry Pi.
Figure 4. CPU load on Raspberry Pi.
Electronics 10 02697 g004
Figure 5. CPU load on Jetson Nano.
Figure 5. CPU load on Jetson Nano.
Electronics 10 02697 g005
Figure 6. Memory footprint on Raspberry Pi.
Figure 6. Memory footprint on Raspberry Pi.
Electronics 10 02697 g006
Figure 7. Memory footprint on Jetson Nano.
Figure 7. Memory footprint on Jetson Nano.
Electronics 10 02697 g007
Table 1. Raspberry Pi 4 B specs.
Table 1. Raspberry Pi 4 B specs.
NameSpec
ChipBCM2711
CPUQuad core Cortex-A72 (ARM v8) 64-bit SoC
Clock speed1.5 GHz
RAM4 GB SDRAM
Caches32 KB data + 48 KB instruction L1 cache per core. 1 MB L2 cache
Storage32 GB micro SD card
OS64 bit Raspberry Pi OS
Python version3.7
Power supply5 V DC via USB-C connector
Table 2. Jetson Nano specs.
Table 2. Jetson Nano specs.
NameSpec
GPU128-core Maxwell
CPUQuad-core ARM A57
Clock speed1.43 GHz
Memory4 GB 64-bit LPDDR4
Caches262,144 bytes L2 cache
Storage32 GB micro SD card
OSUbuntu 18.04.5 LTS
Python version3.6
CUDA10.2
nvidia-jetpack4.5.1-b17
Power supplyBarrel jack 5 V 4 A
Table 3. WER on Raspberry Pi.
Table 3. WER on Raspberry Pi.
Test DatasetDev Dataset
DatasetModelEdge WERServer WERDatasetModelEdge WERServer WER
test-cleanS2T_q4.7% dev-cleanS2T_q4.3%
S2T4.4%4.4%S2T3.9%3.8%
W2V100q7.3% W2V100q7.9%
W2V1006.9%2.6%W2V1007.8%2.2%
W2V960q4.1% W2V960q4.3%
W2V9604.1%2.1%W2V9603.6%1.8%
test-otherS2T_q11.7% dev-otherS2T_q11.1%
S2T11.0%9.0%S2T10.6%8.9%
W2V100q16.2% W2V100q15.1%
W2V10015.6%6.3%W2V10014.9%6.3%
W2V960q10.8% W2V960q10.2%
W2V9609.7%4.8%W2V9609.8%4.7%
Table 4. WER on Jetson Nano.
Table 4. WER on Jetson Nano.
Test DatasetDev Dataset
DatasetModelEdge WERServer WERDatasetModelEdge WERServer WER
test-cleanS2T4.4%4.4%dev-cleanS2T3.3%3.8%
W2V1009.5%2.6%W2V10010.2%2.2%
W2V9606.4%2.1%W2V9606.2%1.8%
test-otherS2T8.6%9.0%dev-otherS2T9.8%8.9%
W2V10020.5%6.3%W2V10019.7%6.3%
W2V96013.1%4.8%W2V96013.0%4.7%
Table 5. RTF of Raspberry Pi.
Table 5. RTF of Raspberry Pi.
ModelAvgMeanP75P90
Speech2Text0.330.330.380.45
Speech2Text quantized0.290.290.340.39
Wav2Vec 100 hr1.431.421.451.5
Wav2Vec 100 hr quantized1.000.971.031.11
Wav2Vec 960 hr1.491.481.541.58
Wav2Vec 960 hr quantized1.031.001.071.18
Table 6. RTF on Jetson Nano.
Table 6. RTF on Jetson Nano.
ModelAvgMeanP75P90
Speech2Text0.130.130.150.17
Wav2Vec 100 hr0.220.220.250.28
Wav2Vec 960 hr0.230.220.260.29
Table 7. Model size.
Table 7. Model size.
Model NameSizeParameters
Speech2Text quantized80 MB30 Million
Speech2Text125 MB
Wav2Vec quantized207 MB93 Million
Wav2Vec377 MB
Table 8. Model load times.
Table 8. Model load times.
Raspberry PiJetson Nano
Model Avg (sec) Model Avg (sec)
Speech2Text1.4Speech2Text24.2
Speech2Text quantized1.07Wav2Vec33.5
Wav2Vec1.9
Wav2Vec quantized1.9
Table 9. Jetson Nano profile for the Wav2Vec model.
Table 9. Jetson Nano profile for the Wav2Vec model.
NameSelf CPU %Self CUDASelf CUDA %# of Calls
forward0.705.373 ms0.511
aten::conv1d0.14576.000 us0.058
aten::convolution0.10228.000 us0.028
aten::_convolution0.11459.000 us0.048
aten::cudnn_convolution0.32527.416 ms50.328
<foward op>0.631.054 ms0.1061
aten::matmul0.971.614 ms0.1598
aten::linear10.481.279 ms0.1274
aten::mm0.84207.371 ms19.7874
aten::to38.43185.175 ms17.673
aten::bmm0.3120.066 ms1.9124
aten::gelu0.2720.261 ms1.9320
aten::group_norm0.034.000 us0.001
aten::native_group_norm0.0319.968 ms1.901
aten::add_0.816.373 ms1.5675
Table 10. Jetson Nano profile for Speech2Text model.
Table 10. Jetson Nano profile for Speech2Text model.
NameSelf CPU %Self CUDASelf CUDA %# of Calls
forward6.21307.304 ms14.281
aten::linear3.1286.356 ms4.01672
aten::matmul3.1680.340 ms3.73672
aten::mm4.72265.171 ms12.32672
aten::layer_norm1.0120.434 ms0.95253
aten::transpose5.46106.751 ms4.961398
aten::native_layer_norm2.7791.685 ms4.26253
aten::t2.9357.888 ms2.69710
aten::view9.56119.122 ms5.532724
aten::empty8.34102.937 ms4.782417
aten::bmm2.2683.940 ms3.90312
aten::as_strided6.8677.660 ms3.612156
aten::add_4.0767.874 ms3.15675
aten::softmax0.6616.265 ms0.76156
aten::to2.3841.231 ms1.92433
Table 11. Raspberry Pi profile for Wav2Vec quantized on model.
Table 11. Raspberry Pi profile for Wav2Vec quantized on model.
NameSelf CPU %Self CPUCPU Total# of Calls
forward0.4945.452 ms9.334 s1
quantized::linear_dynamic30.772.872 s3.167 s74
aten::conv1d0.00347.000 us2.875 s8
aten::convolution0.00274.000 us2.875 s8
aten::_convolution0.021.472 ms2.875 s8
aten::_convolution_nogroup0.043.663 ms2.862 s23
aten::thnn_conv2d0.3028.075 ms2.858 s23
aten::thnn_conv2d_forward5.46509.250 ms2.830 s23
aten::addmm_24.792.314 s2.314 s23
aten::matmul0.022.316 ms1.022 s24
aten::bmm10.66994.810 ms1.016 s24
aten::gelu10.33964.418 ms965.023 ms20
aten::softmax0.01597.000 us719.717 ms12
aten::_softmax7.69718.238 ms719.120 ms12
aten::mul2.78259.586 ms260.482 ms12
Table 12. Raspberry Pi profile for Wav2Vec non-quantized model.
Table 12. Raspberry Pi profile for Wav2Vec non-quantized model.
NameSelf CPU %Self CPUCPU Total# of Calls
forward0.4158.280 ms14.227 s1
prepacked::linear_clamp_run54.857.804 s7.994 s74
aten::conv1d0.00388.000 us2.865 s8
aten::convolution0.00266.000 us2.865 s8
aten::_convolution0.011.790 ms2.865 s8
aten::_convolution_nogroup0.01813.000 us2.855 s23
aten::thnn_conv2d0.2028.328 ms2.854 s23
aten::thnn_conv2d_forward3.18452.048 ms2.826 s23
aten::addmm_16.632.366 s2.366 s23
aten::matmul0.022.350 ms1.118 s24
aten::bmm7.641.087 s1.113 s24
aten::gelu6.54930.477 ms931.136 ms20
aten::softmax0.00645.000 us637.379 ms12
aten::_softmax4.47635.864 ms636.734 ms12
aten::mul2.43345.998 ms346.924 ms12
Table 13. Raspberry Pi profile for Speech2Text quantized model.
Table 13. Raspberry Pi profile for Speech2Text quantized model.
NameSelf CPU %Self CPUCPU Total# of Calls
forward6.75237.950 ms3.527 s1
quantized::linear_dynamic29.461.039 s1.634 s1995
aten::bmm14.56513.414 ms654.848 ms960
aten::min7.41261.352 ms282.381 ms1995
aten::max5.30186.852 ms204.806 ms1996
aten::select3.11109.748 ms158.923 ms12,591
aten::clamp_min2.1876.946 ms150.032 ms492
aten::layer_norm0.4515.811 ms122.822 ms766
aten::softmax0.186.385 ms114.797 ms480
aten::_softmax2.95104.130 ms108.412 ms480
aten::native_layer_norm2.4586.478 ms107.011 ms766
aten::add3.01106.317 ms106.365 ms924
aten::relu0.113.752 ms82.349 ms246
aten::copy_2.1776.565 ms76.565 ms1073
aten::empty1.9468.404 ms68.404 ms11,944
Table 14. Raspberry Pi profile for Speech2Text non-quantized model.
Table 14. Raspberry Pi profile for Speech2Text non-quantized model.
NameSelf CPU %Self CPUCPU Total# of Calls
forward7.93287.466 ms3.623 s1
prepacked::linear_clamp_run38.511.395 s1.683 s1995
aten::bmm11.84428.876 ms575.170 ms960
aten::copy_10.07364.827 ms364.827 ms3068
aten::select3.13113.435 ms163.539 ms12591
aten::clamp_min2.2882.503 ms159.938 ms492
aten::layer_norm0.4917.881 ms150.078 ms766
aten::native_layer_norm3.02109.335 ms132.197 ms766
aten::softmax0.186.389 ms130.655 ms480
aten::_softmax3.32120.186 ms124.266 ms480
aten::add2.81101.642 ms101.693 ms924
aten::relu0.186.374 ms92.151 ms246
aten::masked_fill0.02640.000 us79.648 ms12
aten::mul_0.3512.554 ms79.419 ms480
aten::mul1.3850.079 ms73.879 ms560
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Gondi, S.; Pratap, V. Performance Evaluation of Offline Speech Recognition on Edge Devices. Electronics 2021, 10, 2697. https://doi.org/10.3390/electronics10212697

AMA Style

Gondi S, Pratap V. Performance Evaluation of Offline Speech Recognition on Edge Devices. Electronics. 2021; 10(21):2697. https://doi.org/10.3390/electronics10212697

Chicago/Turabian Style

Gondi, Santosh, and Vineel Pratap. 2021. "Performance Evaluation of Offline Speech Recognition on Edge Devices" Electronics 10, no. 21: 2697. https://doi.org/10.3390/electronics10212697

APA Style

Gondi, S., & Pratap, V. (2021). Performance Evaluation of Offline Speech Recognition on Edge Devices. Electronics, 10(21), 2697. https://doi.org/10.3390/electronics10212697

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop