1. Introduction
Phoneme pronunciation is one of the most important basic skills for foreign language learning. Practicing pronunciations in a computer-assisted way is helpful in self- directed or long-distance learning environments [
1]. The computer-assisted pronunciation training (CAPT) programs record and analyze user speech acoustically, comparing their pronunciation and prosody with a native speaker sample using visual feedback. Although users often require additional training to ensure that they can interpret the feedback, such programs can be used to improve their prosody and vowel pronunciation [
2,
3].
Recent research projects indicate that machine learning provides nice opportunities to improve CAPT systems. From the view point of natural language processing, pronunciation diagnostic tasks are essentially acoustic pattern recognition problems [
4], which have made great progress [
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16]. For example, Gulati et al. [
12] achieved a 1.9% word error rate on clean test data by using more than 900 hours of labeled speech training data. Turan and Erzin [
13] address the close-talk and throat microphone domain mismatching problem by using a transfer learning approach based on stacking denoising auto-encoders, which allows improvement of the acoustic model by mapping the source domain representations and the target domain representations into a common latent space. Sun and Tang [
14] propose a method for supporting automatic communication error detection through integrated use of speech recognition, text analysis, and formal modeling of airport operational processes. It is hypothesized that it could form the basis for automating communication error detection and preventing loss of separation. Badrinath and Balakrishnan [
15] present an automatic speech recognition model tailored to the air traffic control domain that can transcribe air traffic control voice to text. The transcribed text is used to extract operational information such as call-signs and runway numbers. The models are based on recent improvements in machine learning techniques for speech recognition and natural language processing. Jiang et al. [
16] applied the recent state-of-the-art DNN-based training methods to the automatic language proficiency evaluation system that combines various kinds of non-native acoustic models and native ones. The reference-free rate is used as the machine score to estimate the second-language proficiency of the English learners. The evaluations based on the English-read-by-Japanese database demonstrate that it is an effective method to improve the language proficiency assessment techniques.
Despite many significant theoretical achievements in the terms of speech recognition algorithms, the utility value of today’s CAPT modalities is limited to the hardware devices, especially in the aspects of portability, maintainability, and resource consumption. Usually, the developments and evaluations of CAPT tools are realized by using general-purpose processors, which can hardly satisfy these requirements entirely; some more efforts therefore need to be made to prototype them embeddedly. E. Manor et al. [
17] point out the possibility of efficiently running networks on a Field Programmable Gate Array (FPGA) using a microcontroller and hardware accelerator. In the work of Silva et al. [
18], a support vector machine multi-class classifier is implemented within the asynchronous paradigm in a 4-stage architecture. It is claimed that a reduced power consumption of 5.2 mW, a fast average response time of 0.61
s, and the most area-efficient circuit of 1315 LUTs are obtained as a result. Chervyakov et al. [
19] propose a speech-recognition-available CNN architecture based on the Residue Number System (RNS) and the new Chinese Remainder Theorem with fractions. According to the simulations based on the Kintex7 xc7k70tfbg484-2 FPGA, the hardware cost is reduced by 32% compared to the traditional binary system. Paula et al. [
20] apply the long short-time memory network to the task of spectral prediction and propose a module generator for an FPGA implementation. Evaluations demonstrate that a prediction latency of 4.3
s on a Xilinx XC7K410T Kintex-7 FPGA is achievable. Up to present, mature embedded machine learning toolkits like OpenVINO have been developed and widely used in real-life research and developments [
21,
22,
23,
24,
25]. These successful cases significantly improved the products in different scenarios.
This work focuses on the French CAPT embedded solutions with the goals of high development productivity and running efficiency performance. It is conducted by the Algorithm-Architecture Adequation (AAA) methodology, first introduced by the AOSTE team of INRIA (French National Institute for computer science and applied mathematics) [
26]. The key feature of AAA is the ability to rapidly prototype complex real-time embedded applications based on automatic code generation. The concerned algorithm and its hardware architecture are studied simultaneously within a Software/Hardware co-design framework, which allows an embedded implementation optimized both in the algorithm and hardware level.
Concerning the pronunciation diagnosis algorithm, a high-accuracy and low-consumption classifier is desired to balance accuracy and efficient performance. A recently-proposed heterogeneous machine learning CAPT framework [
27] is therefore selected. For the reason that the phoneme utterances are made from the base vibrations of vocal cords through resonance chambers (buccal, nasal, and pharyngeal cavities) [
28,
29], the predictors of the phoneme feature vectors are highly probably collinear, resulting in a multicollinearity problem. The multicollinearity problem means that one of the predictor variables in a classification model can be linearly predicted from the others with a substantial degree of accuracy. Although it is usually difficult to figure out a precise mathematical model to explain the fundamentals ofal least squar a certain pattern recognition problem, research indicates that suppressing the multicollinearity by using some suitable method is helpful to improve the pattern discriminability [
30,
31,
32]. Yanjing et al. [
27] estimate the condition indices of a French phoneme utterance spectrum set, and
of its elements exceed 10. This means that the predictor dependencies start to affect the regression estimates [
33]. The framework of this work first suppresses the multicollinearity among the predictors of the phoneme sample vectors by using the partial least square (PLS) regression algorithm and then classifies them via soft-margin SVMs. Considering that FPGA is one of the most commonly used embedded devices for its benefits in terms of running cost, power consumption, and flexibility [
34,
35,
36,
37,
38,
39,
40,
41], our team therefore prototyped it as a hardware core within the register-transfer level for FPGA-available solutions.
The main challenge of this project is how to implement the desired algorithm behavior at the register transfer level efficiently with acceptable running efficiency and resource cost performance. For the purpose of high development productivity and maintainability, high-level synthesis techniques are developed. The work of E. Manor [
42] demonstrates that this method is an important and effective solution for fast embedded prototyping with efficient performance. This work uses a recently-proposed very high-level synthesis (VHLS) based SW/HW co-design flow [
43,
44] to facilitate the implementing process from Matlab to RTL. Moreover, different interface and parallel optimizations are made to accelerate the implementations. The evaluation experiment in this paper is conducted using a data set including 35 phonemes
sessions
persons
samples. The experiment results show that the outputs of the final RTL implementation are exactly the same as its Matlab prototype, implying that the Matlab-to-RTL synthesis process of this work is reliable. Comparing to the PLS regressor, SVMs, and deep neural network models, the proposed method achieves the lowest diagnostic error rate in the experiment of this paper. Additionally, the hardware performance evaluations of the RTL implementation indicate that the optimizations used in this paper achieve a speed up of
relative to that of the CPU.
The main novelties of this work are summarized as follows:
- (a)
An FPGA-suitable CAPT framework is conceived and trained, in which the phoneme pronunciation diagnostic algorithm is based on the partial least squares regression method and an improved support vector machine, so that it could raise the accuracy performance of the framework by suppressing the collinearity problem among the predictors.
- (b)
The phoneme diagnostic core is implemented at the register-transfer level (RTL) via the recently-proposed Matlab-to-RTL SW/HW co-design flow for the purpose of high development productivity and maintainability. The implementation is further accelerated at the instruction-level, and a speedup of is achieved relative to its CPU implementation.
- (c)
The proposed RTL implementation of the CAPT framework is functionally verified and evaluated by using a French phoneme utterance database, demonstrating its application values.
The remainder of this paper is organized as follows:
Section 2 describes the proposed embedded CAPT framework and explains how it is trained;
Section 3 presents the implementation and optimization processes of the proposed CAPT framework;
Section 4 analyzes the evaluation experiment results; and finally,
Section 5 gives the final conclusion of this work.
2. Architecture of the CAPT Framework
The overall framework of the desired French phoneme utterance detectors is shown in
Figure 1. Users utter the phoneme to learn it and record it as the input of the system. According to
Figure 1a, the normalized frequency spectrum of the utterance waveform
is assigned to the detector as the training or testing predictor vector.
Figure 1b zooms into the architecture of the detector unit, which is implemented as an Intellectual Property (IP) core in this paper. This architecture is a 2-layer network whose output
y can be mathematically described as
where
and
are two activation function sets.
and
are the propagation functions of the first and second layers expressed as
and
(
m is the vector size and set as 16,384 in this paper) is the input of the detector to which the predictor vector
is assigned directly.
and
are the coefficient matrices of the two layers, respectively. Their sizes are
m-by-
n and
n-by-1, where
is the phoneme number of the French language.
b is the bias value of the second layer.
is the output of the first activation function set
, whose element functions are rectified linear units (ReLU). For the second layer, the sigmoid function is used to its output as the activation function in order to constrain
y into a reasonable range from 0 to 1.
The decision of the system is made by comparing the output of the detector
y, which is the diagnosis score corresponding to the utterance quality, with a threshold
to feedback the diagnosis result. This work trains the detectors through a heterogeneous process presented in [
27]. It consists of partial least square (PLS) regression and soft-margin support vector machines.
2.1. Training Method of Layer 1
The diagnosing ability of the design is impacted by the multicollinearity problem among the utterance sample predictors; the partial least square (PLS) regression method is therefore applied to train the feature extraction layer of the French phoneme utterance detectors. PLS is a common class of methods for modeling relations between sets of observed variables by means of latent variables. The underlying assumption is that the observed data is generated by a system or process that is driven by a small number of latent (not directly observed or measured) variables. Its goal is to maximize the covariance between the two parts of a paired data set, even though those two parts are in different spaces. That implies that PLS regression can overcome the multicollinearity problem by modeling the relationships between the predictors. In the case of this paper, we train the first layer of the detector to extract the PLS feature of the samples to facilitate the classifying task of the second layer.
As presented in [
32], let
and
be two matrices whose rows are the predictor vectors
and their responses
corresponding to the
i-th sample. According to the nonlinear iterative partial least squares algorithm [
31,
45], the optimizing problem of PLS regression is to search for some projection directions that maximizes the covariance of the training and response matrices:
where
N is the number of training samples,
and
are two unit vectors corresponding to the projection directions. The directions that solve (
4) are the first singular vectors
and
of the singular value decomposition of
where the value of the covariance is given by the corresponding singular value
. In this paper we apply the same data projecting strategy through deflation in order to obtain multiple projecting direction.
The PLS regression algorithm is programmatically described in Algorithm 1. The inner loop computes the first singular value iteratively, which results in
converging to the first right singular vector
. Next, the deflation of
is computed. Finally, the regression coefficients
is given by
, where
is a matrix with columns
[
46].
Algorithm 1 Pseudocode of PLS regression algorithm |
Input: training matrix , response variables , projection direction number k |
Output: regression coefficients |
- 1:
initialization - 2:
for
do - 3:
first column of - 4:
- 5:
repeat - 6:
- 7:
- 8:
until convergence - 9:
- 10:
- 11:
- 12:
end for - 13:
|
2.2. Training Method of Layer 2
The second layer of the detector is trained by using soft-margin SVMs [
47]. SVM is a type of binary classifier that has been widely used [
10,
47,
48,
49] in speech processing. Classical SVMs build the classifier by searching for some hyperplane
that maximizes the margin between the two target clusters (correct pronunciations or not). This method classifies the utterance samples with a "hard margin" determined by support vectors, which may result in an over-fitting problem. For this issue, based on the SVM model of this paper (see (
3)), we propose to use soft-margin SVMs to build the classifier by searching for some hyperplane
that maximizes the soft margin between the two target clusters:
where
is the
i-th predictor vector used to train the second layer, and
C is the regularization constant.
is the insensitive loss function:
where
is the maximum error between the prediction results and the corresponding labels. The problem above can be solved by using the lagrange multiplier method. We introduce two slack variables
and
that correspond to the dissatisfaction degree with the margin constraint, so that
with
The lagrange function of (
6)
therefor can be written as
where
and
are the slack variables.
,
,
and
, which correspond to the columns of
,
,
and
, are the lagrange multipliers and can be solved by building the dual problem of (
8) with the Karush-Kuhn-Tucher constraints [
27]. The desired coefficient matrix
of the second layer are obtained by computing the partial derivatives of (
9) with respects to
,
b,
and
. The final bias
b is
with
where
is the bias value corresponding to
.
5. Discussions and Conclusions
This paper implements a new-developed phoneme pronunciation diagnostic framework for French CAPT modalities as a register-transfer level core. Classical machine learning networks are impacted by the multicollinearity problem among the predictors of the utterance sample vectors, the PLS algorithm is therefore applied in to the desired network as the feature extracting layer to suppress the collinearity. Next, the soft-margin SVM is used to perform the second network layer to enhance the classifying ability of the network. Experiments results demonstrate that this method possesses better accuracy performance than the state-of-the-art. Yet, we must claim that the performance of the DNN implementations are constrained by training data size, so the experiments of this paper cannot prove that the algorithm of this paper inevitably leads to the best performance. Considering that a classical DNN model include 5 layers at least (1 input, 3 hidden and 1 output layers), whereas the proposed one only 4 (1 input, 1 PLS feature extracting, 1 SVM classifying and 1 output layers), the latter is more suitable for the systems on chips.
As far to the register-transfer level implementing of the design, we prototype it via a new-proposed VHLS SW/HW co-design flow in order to facilitate the development works and maintenances. During this work, it is found that synthesizing directly the behavior from Matlab down to RTL prevents the implementation from benefiting from the running-efficiency advantages of FPGAs, a series optimizing forms are therefore made in the loop and instrument levels. The CUEB French Phoneme Database is used to evaluate the achievements of this work. The experiments results verify the basic function of the new implementation by comparing it with its Matlab and C++ implementations. The hardware evaluation experiments demonstrate that the prototype of this paper make efficient use of the given hardware resources, and achieves a speedup of , which making better use of the hardware resources. Despite of many benefits of development productivity and easy maintenances, it should note that high level synthesis seriously constrain the performance of FPGA implementations in the terms of hardware costs and running-efficiency comparing to the low abstract level implementations. If high performance is desired, some more bottom optimizations are still required, especially when the constraints of Placing and Routing cycle is taken into account.
In the future research, we will further improve the methods of this paper. The PLS methods and hardware implementing experiences will be considered as a potential solution of sparse learning to solve the data-hungry problems, which may also benefit the embedded CAPT applications from deep learning methods. Meanwhile, there exists still some other hardware solutions worth trying, such as MicroBlaze, which may provide nice performance if well optimized.