Next Article in Journal
A Novel Integrated Heuristic Optimizer Using a Water Cycle Algorithm and Gravitational Search Algorithm for Optimization Problems
Previous Article in Journal
An AdaBoost Method with K′K-Means Bayes Classifier for Imbalanced Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Mandarin Tone Recognition Algorithm Based on Random Forest and Feature Fusion  †

1
School of Microelectronics, Shandong University, Jinan 250100, China
2
China Telecom Shandong Branch, Jinan 250098, China
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in Proceedings—7th International Conference on Control Engineering and Artificial Intelligence, CCEAI 2023, Sanya, China, 28–30 January 2023; pp. 168–172.
These authors contributed equally to this work and are co-first authors.
Mathematics 2023, 11(8), 1879; https://doi.org/10.3390/math11081879
Submission received: 1 March 2023 / Revised: 24 March 2023 / Accepted: 13 April 2023 / Published: 15 April 2023

Abstract

:
In human–computer interaction (HCI) systems for Mandarin learning, tone recognition is of great importance. A brand-new tone recognition method based on random forest (RF) and feature fusion is proposed in this study. Firstly, three fusion feature sets (FFSs) were created by using different fusion methods on sound source features linked to Mandarin syllable tone. Following the construction of the CART decision trees using the three FFSs, modeling and optimization of the corresponding RF tone classifiers were performed. The method was tested and evaluated on the Syllable Corpus of Standard Chinese (SCSC), which is a speaker-independent Mandarin monosyllable corpus. Additionally, the effects were also assessed on small sample sets. The results show that the tone recognition algorithm can achieve high tone recognition accuracy and has good generalization capability and classification ability with unbalanced data. This indicates that the proposed approach is highly efficient and robust and is appropriate for mobile HCI learning systems.

1. Introduction

Different from English or other Western non-tonal languages, monosyllables with the same pinyin in Mandarin have four tones, such as mother (ma1), hemp (ma2), horse (ma3), and abuse (ma4). Therefore, for a learner whose mother tongue is non-tonal, mastering the tonal pronunciation of monosyllables is both a difficult and key point in learning Mandarin [1]; even a Chinese child with severe and profound prelinguistic deafness, after the implantation of a cochlear implant (CI) device, would also have a similar problem [2]. Since Mandarin sentences are composed of many continuous monosyllabic words and due to the vocal cord vibration inertia and mutual influence during the pronunciation of adjacent syllables, some syllables’ tones will change based on the tone-changed rules. A portable Mandarin tone training system with human–computer interaction (HCI) function is of great significance for learners. However, the tone recognition algorithm should be low complexity, efficient, robust, and speaker independent.
The key to tone recognition lies in the extraction of feature parameters and the design of the tone classifier. There are already a number of tone recognition approaches which are either based on machine learning or on deep learning. The method proposed by Fu et al. [3] extracted features related to fundamental frequency and energy in voiced segments and used support vector machines (SVMs) to automatically recognize four tones in Mandarin, achieving 93.52% accuracy. In Mizo tone recognition with four tones as in Mandarin and with six F0 features, the SVM-based classifier achieved 73.39% accuracy and the deep neural network (DNN)-based classifier achieved 74.11% accuracy [4]. However, the extracted sound source features in the above two methods were not fully optimized and fused. Zheng et al. [5] proposed a tone recognition algorithm for Mandarin three-syllable words based on fundamental frequency and an improved back propagation neural network (BPNN) algorithm, and the accuracy of the first word, middle word, and last word was 87.50%, 70.83%, and 79.17%, respectively, though the BPNN model was obtained by training involving 800,000 epochs. Shen et al. [6] put forward a Chinese four-tone recognition method based on the fusion of prosodic and cepstral features, and the tone classification accuracies of the Gaussian mixture model, BPNN, SVM, and convolutional neural network (CNN) were 84.55%, 86.28%, 85.50%, and 87.60%, respectively. Liu et al. [7] proposed a one-step continuous Mandarin tone recognition method, and the final tone accuracy rate was 88.80% based on pitch features and spectral features by using an MSD-HMM (multi-space distribution hidden Markov model). The performance of these two algorithms is still not ideal and the construction of tone feature parameters is not optimized for tone recognition tasks. In [8], a speaker-independent four-tone recognition system for Mandarin digits was realized that was only based on the pitch contour of each syllable, which achieved 90.20% recognition accuracy and a response time of about 1.64 s. Although the system is very simple, the accuracy and response time still need further improvement. A CNN architecture Mandarin tone classifier [9] was based on the features from pre-training on Mel frequency cepstrum coefficient (MFCC) vectors through the use of a denoising autoencoder and achieved 95.53% accuracy; however, the use of 3600 syllables was required for training. Based on a CNN and multi-layer perceptron, the ToneNet model [10] was used to classify Mandarin monosyllabic tones and achieved 99.16% accuracy with the Syllable Corpus of Standard Chinese (SCSC). However, the method used the Mel spectrogram’s image as the input of the model and required a large amount of data for training and large computation.
Although some of the above methods have achieved good results in Mandarin tone recognition, there is still no method that combines high accuracy and low complexity. While the random forest (RF) is an efficient ensemble modeling approach, in the process of model construction, the bootstrap aggregating (bagging) algorithm and random feature selection strategy can be used to avoid falling into local optimization [11,12]. There is also no Mandarin tone recognition method based on RF and feature fusion, and how to construct appropriate fusion features for a RF classifier and its recognition performance remains to be explored. Thus, in this paper, we introduce RF into tone recognition to establish a highly efficient Mandarin tone recognition method with a high level of performance, which should be suitable for mobile training systems, especially for small sample sets.
In the study, the sound source features related to Mandarin tones are first comprehensively selected and fused, and the fusion feature sets (FFSs) are produced by using three fusion methods, which are respectively used to form corresponding RF models, with the classification and comparison experiments then conducted. Furthermore, the tone recognition performances of different FFSs and RF with small sample sets are evaluated. The results show that the RF classifier has high effectiveness and robustness, which is suitable for Mandarin tone recognition in small sample sets.
Summarizing the contributions of this paper is the following:
  • RF is first applied to identify Mandarin tone in a speaker-independent manner.
  • Through a large number of experiments, with three FFSs from only sound source features we find that RF for tone recognition is a high-stability, low-complexity approach.
  • It is proven that the method proposed has good recognition for small sample sets and has strong generalization ability.

2. Materials and Methods

2.1. Data Description

Fundamental frequency, which varies greatly with speaker and gender, is an important feature parameter of tone recognition (the fundamental frequency range of males is 70~200 Hz and the fundamental frequency range of females voice is 140~400 Hz [13]). An effective speaker-independent Mandarin tone training system can be designed to first select gender and then start speech pronunciation practice. In this way, we need to study the relevant performance of the proposed algorithm in a certain gender, and the method for another gender can then be inferred.
In this work, the Syllable Corpus of Standard Chinese (SCSC) [14] was used to evaluate the effectiveness and robustness of our presented method. The corpus contains syllables used in daily Mandarin from fifteen male speakers named as m01, m02, …, m15, with the ages not noted. There are 1275 identical syllables per speaker and sound files are stored in high-quality 16 KHz sampling 16-bit data mono WAV format. In order to form the experimental speech dataset and balance the four tones, 40 monosyllables were selected from each speaker to obtain 600 syllables, including 10-tone 1, 10-tone 2, 10-tone 3, and 10-tone 4 (see Appendix A for details).

2.2. Preprocessing

In the short time frame processing, the frame length was set to 30 ms and frame shift was set to 10 ms. The double threshold method based on the short-time zero crossing rate and energy was then used to detect the voiced segments [15]. A Chebyshev low-pass filter with a cut-off frequency of 900 Hz was used to remove high frequency features from vocal tracts, and the auto-correlation method was used to derive fundamental frequency parameters [16].

2.3. Feature Fusion

After the preliminary experiment (see Section 3.2.1 for details), we found that cepstrum features have little contribution to improving the accuracy of tone recognition, so only sound source features should be used for feature fusion in tone recognition.
In this paper, seven original feature sets were used after reviewing the literatures, which are shown in Table 1.
Efficient RF models rely on high-quality FFSs. At present, it is not clear which features of the above feature parameters are important in the tone recognition task, so we used the three feature fusion methods shown in Figure 1 to explore this.
The specific details of these sets are as follows:
  • FFS SI was directly composed of all 94 feature parameters from S1 to S7.
  • The second method involved a BPNN, which has fine performance and wide application and was selected as a fixed classifier model to optimize the features of a tone recognition task. The number of nodes in the BPNN’s hidden layer was set to 32. The process was as follows: For each feature in SI, the ReliefF algorithm [22] was used to calculate the weight of each feature, and the weight was ranked from large to small. We then inputted the features into the BPNN in order for the purposes of tone recognition and stopped this process once recognition accuracy no longer rose. FFS SII was thus formed, which contained fifteen features.
  • FFS SIII, which included twelve features, was obtained using the third method. Firstly, the top three feature sets of S1 to S7 were selected by the BPNN. Each feature from the top three sets was then ranked by ReliefF. Lastly, the twelve features could be optimized according to a process similar to that of the second method.

2.4. Classifier

The proposed method aims to handle small sample sets well, achieve fast modeling, and rely on simple calculations so that it may be deployed on small mobile terminals for Mandarin learners. Therefore, the tone recognition classifier should not be too complicated, suggesting low-complexity machine learning models to be more suitable for constructing the classifier. Back propagation neural networks, support vector machines, the Naive Bayes model (NBM), AdaBoost, and random forest are commonly used machine learning classifiers [6,23,24]. Therefore, the five classifiers mentioned above were used for the preliminary experiment in Section 3.2.2. We found that the RF classifier has better learning ability; however, no study has found that RF is suitable for Mandarin tone recognition.

2.5. Tone Recognition Classifier Based on Random Forest

2.5.1. CART Decision Tree and Random Forest Modeling

Random forest is composed of T decision trees, where T is the hyperparameter. The ID3, C4.5, and CART algorithms are commonly used decision tree algorithms [25]. It has been studied that the classification accuracy of the CART algorithm is better than that of other algorithms [26]. Thus, in this paper, the CART algorithm was used to construct the decision tree. The samples of each decision tree in the random forest are chosen randomly (reflected in the use of the bootstrap sampling algorithm to construct the sample set in the training process), which can effectively avoid overfitting and improve robustness.
In the training process, the shape of the training set is M1*N, where M1 denotes the number of training samples, which contains L types of tones, N denotes the number of features of each sample, and T is the number of decision trees.
The specific process is as follows:
  • First, during the construction process of each decision tree, the training set is randomly selected and put back M1 times to obtain a sample set with sample size M1, where some data in the training set are selected multiple times and some are not. Thus, the M1*N features matrix F = {fi,j, i = 1,…, M1, j = 1,…, N} is formed.
  • Second, at the root node of each decision tree, one optimal feature with the smallest Gini index is selected from M1*N features, and its feature value is the decision at the root node.
In the CART algorithm, the Gini index is used to select the node feature and represents the impurity of the dataset; the smaller the Gini index, the lower the impurity. This is expressed by the following equation:
G i n i D , F = D 1 D G i n i D 1 + D 2 D G i n i D 2
G i n i D 1 = 1 t = 1 S S t D 1 2
G i n i D 2 = 1 t = 1 S S t D 2 2
where D denotes the sample set in a certain node containing S tone types, whose number is |D|. The number of the tth tone type is |St|. D is divided into two subsets (D1 and D2) according to the value of the current node feature fi,j, with D1 including samples whose fi,j feature value is lower than the value of the current node feature.
3.
In the next step, the matrix M1*N is divided into two parts: M11*N and M12*N. One optimal feature with the smallest Gini index is selected from M11*N features, and its feature parameter is the decision at the branch node. A similar step is performed in M12. M11 is also divided into two parts, and the above process is repeated until all features are used or the tone type is provided as output, which is the leaf node.
4.
Repeat 1, 2, and 3 T times to construct T decision trees, which thus form a random forest. Figure 2 shows one decision tree.

2.5.2. Tone Recognition Based on the Random Forest Classifier

In the test process, the matrix of the test set is M2*N, where M2 denotes the number of test samples. The test set is input into the modeled random forest classifier, and the steps included are as follows:
  • The test set is fed into the pre-trained random forest classifier.
  • Starting from the root node of the current decision tree, the random forest classifier compares the feature parameters based on the value of the current node on each decision tree until the decision reaches the leaf node, which outputs the corresponding tone type.
  • Since each decision tree is independent in the recognition process of each test sample, the final recognition result of the test sample is obtained via a voting process involving the results of multiple decision trees.
The entire training and test process is shown in Figure 3.

3. Experiment and Result

All experiments in this paper were implemented and tested on MATLAB R2020a using a 64-bit computer (Intel Core i7-12700 CPU, 2.10 GHz; 16.0 GB RAM).

3.1. Optimization Experiment of RF Classifier’s Hyperparameter T

Decision tree number T is an important hyperparameter in the RF classifier, and three RF classifiers were constructed based on SI, SII, and SIII. The number of T was set to 100, 200, 300, 400, and 500 in order to determine the optimal T value, and the variation in T values was reduced to 50 around the best value. The evaluation metrics were recognition accuracy rate (ACC), the area under the receiver operating characteristic (AUROC), and the area under the precision recall curve (AUPRC), which are commonly used for evaluating the performance of classifiers [27].
As shown in Figure 4 and Table 2, Table 3 and Table 4 the RF classifiers based on SI, SII, and SIII achieve the best performance at T = 400 (ACC, AUROC, and AUPRC are 98.33%, 98.88%, and 98.32%, respectively), T = 350 (97.50% ACC, 98.35% AUROC, and 97.50% AUPRC), and T = 350 (98.00% ACC, 98.65% AUROC, and 97.97% AUPRC), respectively.

3.2. Preliminary Experiment

3.2.1. Analysis of the Role of Vocal Tract Features in Tone Recognition

The feature parameters of the original speech signal can be divided into vocal tract features and sound source features. The former are mainly spectrum envelope parameters such as MFCC, and the latter are mainly time domain features, such as duration, energy, and fundamental frequency. MFCC, as a typical vocal tract feature, is commonly used in acoustic analysis such as automatic speech recognition [28] and dialect and language recognition [29]. However, the role of this feature in tone recognition needs to be evaluated through analysis experiments.
In the extraction of MFCC, preprocessing without 900 Hz low-pass filtering, fast Fourier transform (FFT) calculations, spectral line energy, Mel filtering energy, and a discrete cosine transform (DCT) cepstrum are needed. For each frame speech signal, twelve MFCCs and twelve ΔMFCCs were extracted, with the 24 feature parameters forming a one-dimensional feature vector. The ten frames of the central part of the syllable were selected for calculation, and these 240 feature parameters were named as the cepstrum feature set which was used for the tone recognition pre-experiment. When using five-fold cross-validation and a three-layer BPNN with 64 hidden layer nodes, tone recognition accuracy was 50.67%. It can be seen that the experimental result for tone recognition using the cepstrum feature set is not ideal. Next, we used both sound source features and cepstrum features for the experiment. We selected the fundamental frequency statistical features introduced in reference [6] as sound source features. Firstly, the fundamental frequency statistical features and cepstrum features were respectively used to carry out tone recognition experiments on the BPNN, and the two BPNN tone classifiers were then mixed and given weight α and 1-α, respectively, so as to explore the change in tone recognition accuracy with weight α. The specific realization formulas are as follows:
T p i t c h * = a v e r a g e T p i t c h T n X i
T M F C C * = a v e r a g e { T M F C C ( T n X i ) }
where T p i t c h * and T M F C C * are the recognized accuracy using the fundamental frequency statistical features and cepstrum features, respectively. n is the tone label, with values of 1, 2, 3, and 4. T p i t c h ( T n X i ) means the tone (Tn) accuracy of test sample set Xi (1 ≤ i ≤ N, N is the total number of samples) when using fundamental frequency statistical features and T M F C C ( T n X i ) means the Tn accuracy when using cepstrum features. The combined recognition accuracy with the two tone classifiers is defined as follows:
T * = α · T p i t c h * + ( 1 α ) · T M F C C *
We carried out five-fold cross-validation and set the value of α as 0:0.1:1, and the results are shown in Figure 5.
It can be seen that the accuracy of tone recognition is not high when the cepstrum features are used alone. When the fundamental frequency statistical features and cepstrum features are used jointly for tone recognition, the fundamental frequency statistical features play a major role in greatly improving classification accuracy. Cepstrum features have little effect on improving the accuracy of tone recognition but greatly increase the operational complexity of parameter extraction. Therefore, cepstrum features are not used for tone recognition in this paper.

3.2.2. Analysis of Classifiers in Tone Recognition

The five classifiers under the typical structure were used for pre-experiments on S1 to S7, and the experimental results are shown in Figure 6. In S2, S5, S6, and S7, RF achieved the highest recognition accuracy, and in the remaining feature sets (S1, S3, and S4), the BPNN achieved the highest recognition accuracy, which indicates that the random forest has better learning ability. Further, the recognition accuracy of the five classifiers was averaged over seven feature sets. From the average accuracies, it is also evident that the RF classifier was slightly higher than that of others. The best result for the NBM could reach 96.67% and the worst was only 50.67%, which indicates that the NBM’s classification is either unstable or not robust; the accuracy of RF and the BPNN always remained above 95%. The classification results of the above five classifiers were obtained based on the same seven feature sets, and since different classifiers are based on different mathematical models, it shows the roles and effects of mathematical models in classifier modeling.

3.3. Comparative Experiment

In order to study the performance of the proposed method, comparative experiments were performed.

3.3.1. Comparative Experiments of Different Fusion Feature Sets

The four performance results were obtained by applying five-fold cross-validation on the three FFSs and RF classifiers. The results are shown in Table 5.
The Average Processing Time Per Sample (APTPS) index is the average time from extracting features to identifying a single syllable tone. From this index, it can be seen that the RF methods can achieve real-time identification and high recognition accuracy with three FFSs. It shows that the FFSs and RF tone recognition methods are highly efficient.
There are 94 features contained in SI, which is the most data contained in the three FFSs. This can describe the characteristics of tone more comprehensively; thus, the ACC, AUROC, and AUPRC values obtained are the highest and the classification effect is the best. Experimental results show that the classification effect of SII (15 features) is slightly worse than SIII (12 features), although the number of features for SII is more than SIII. This proves that simply increasing the number of features does not necessarily improve the recognition accuracy. This suggests that feature fusion is more important.
In addition, feature optimization is also needed to decrease algorithm complexity. By comparing SII and SIII, it can be seen that the classification effect is better and the running speed is faster with SIII. In general, the feature fusion method can not only reduce the running time, but also maintain high recognition accuracy.

3.3.2. Comparative Experiments of Small Sample Sets

In order to analyze the performance of the RF classifier in tone recognition, we carried out comparative experiments on small sample training sets. The experimental speech database (i.e., 600 syllables) was divided into 10 parts, with each part containing the same number of samples for the four tones. The proportion of training samples was reduced from 90% to 10% by step 10%. Consequently, the percentage of test samples was increased from 10% to 90% by the same step 10%.
Figure 7 shows the recognition results of the corresponding nine groups based on the three FFSs and RF classifiers. It can be seen that as the proportion of training samples decreases, tone recognition accuracy only decreases by about four percent. Nevertheless, even when training samples only accounted for ten percent of the database (i.e., 60 samples), there was still good accuracy of above 93.57%, which demonstrates the powerful learning ability of random forest.

4. Discussion

Using state-of-the-art methods (ToneNet [10], CNN [9]) as a benchmark, we compared accuracy, and the results are shown in Table 6. Although [10] used ToneNet to perform tone recognition on the SCSC database with an accuracy of 99.16%, it depends on a large amount of data and brings complex calculation, which is contrary to our intention to deploy a model on small mobile terminals. In [10], the authors used the CNN method of [9] on the SCSC, with results that were not as good as those obtained in this work. Therefore, considering operation time and computational complexity comprehensively, our method is the most cost effective.
In addition, we also carried out cross-dataset (i.e., beyond the original 600 syllables) testing experiments with balanced samples and extremely unbalanced samples. The balanced samples were 100 new syllables from the 15 original people, and were randomly selected from the SCSC, i.e., the samples of tone 1, tone 2, tone 3, and tone 4, with 25 of each. The extremely unbalanced samples were another 400 syllables from the same 15 people, and the ratio of tone 1, tone 2, tone 3, and tone 4 was 7:1:1:1, 1:7:1:1, 1:1:7:1, and 1:1:1:7, respectively. The balanced samples were respectively sent into three modeled FFSs and RF classifiers for tone recognition experiments, and the results are shown in Figure 8a. The extremely unbalanced samples were sent into the modeled RF classifier corresponding to SIII (which was the most cost effective), and the results are shown in Figure 8b. We find that tone 2 and tone 3 are more difficult to recognize than tone 1 and tone 4, both for balanced samples and for extremely unbalanced samples; this conclusion is consistent with that of previous research [30]. The test results of the cross-dataset experiments show that the tone classification algorithms based on RF have strong generalization ability.

5. Conclusions

This study introduces a novel Mandarin tone recognition approach which is based on random forest and feature fusion. Feature fusion and optimization can obviously reduce the complexity of an algorithm and are more suitable for portable Mandarin tone recognition. The random forest model is a robust classifier with low complexity in tone recognition. Through comparative experiments, the performance of RF on three different FFSs is validated, which shows that RF modeling has the advantages of high recognition accuracy, simplicity, and strong learning capability. This method can achieve good recognition effects based on the simplified FFS SIII using a small number of training samples. The proposed algorithm has achieved the expected results, but only simulation verification has been completed at present, and we will verify this on more databases and deploy the proposed method to the practical learning terminal in the future.

Author Contributions

Conceptualization, J.Y., M.L. and L.T.; methodology, M.L. and X.W.; software, J.Y. and J.L.; validation, J.Y., X.W. and M.L.; investigation, J.Y.; writing—original draft preparation, J.Y. and M.L.; writing—review and editing, L.T., J.Y., Q.M., M.Z. and H.X.; visualization, J.L. and X.W.; supervision, L.T.; funding acquisition, L.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Shandong Province, grant number ZR2021ZD40 and ZR2021MF065, and in part by the Research Project for Graduate Education and Teaching Reform, Shandong University, China, grant number XYJG2020108.

Data Availability Statement

The SCSC data can be obtained from the Laboratory of Phonetics and Speech Science, Institute of Linguistics, CASS at http://paslab.phonetics.org.cn/?p=1741 (accessed on 20 March 2023).

Acknowledgments

We gratefully acknowledge the support from the above funds.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. List of the 600 Syllables from Fifteen Speakers in the SCSC Database

Table A1. Speaker: m01.
Table A1. Speaker: m01.
Syllable 1Syllable 2Syllable 3Syllable 4Syllable 5Syllable 6Syllable 7Syllable 8Syllable 9Syllable 10
Tone 1a1ai1ao1cheng1e1fang1feng1hou1ji1lao1
Tone 2a2ai2ao2cheng2e2fang2feng2hou2ji2lao2
Tone 3a3ai3ao3cheng3e3fang3feng3hou3ji3lao3
Tone 4a4ai4ao4cheng4e4fang4feng4hou4ji4lao4
Table A2. Speaker: m02.
Table A2. Speaker: m02.
Syllable 1Syllable 2Syllable 3Syllable 4Syllable 5Syllable 6Syllable 7Syllable 8Syllable 9Syllable 10
Tone 1ang1en1eng1wei1wo1wu1yan1yang1yao1yi1
Tone 2a2ai2ang2ao2er2wang2wei2wen2tu2e2
Tone 3fa3lou3yi3pai3yuan3wen3wo3xiang3yan3yang3
Tone 4a4ang4er4gun4lian4lie4lun4ou4si4weng4
Table A3. Speaker: m03.
Table A3. Speaker: m03.
Syllable 1Syllable 2Syllable 3Syllable 4Syllable 5Syllable 6Syllable 7Syllable 8Syllable 9Syllable 10
Tone 1fan1bie1chai1cun1en1mo1tuo1wu1xiong1yi1
Tone 2a2ai2fo2cun2fang2cu2ju2she2wang2wen2
Tone 3wu3yan3wei3fa3yao3ha3lou3wang3weng3xiang3
Tone 4lie4na4lun4gun4mie4mi4ou4si4tie4wen4
Table A4. Speaker: m04.
Table A4. Speaker: m04.
Syllable 1Syllable 2Syllable 3Syllable 4Syllable 5Syllable 6Syllable 7Syllable 8Syllable 9Syllable 10
Tone 1a1bie1tuo1en1yue1wa1tou1you1yuan1mo1
Tone 2ai2wen2fo2fang2ju2lou2wang2wu2she2zhuo2
Tone 3shu3a3yang3yan3ti3yong3you3wo3wei3yu3
Tone 4ci4gun4lie4lian4lun4xian4hu4pao4ou4zuan4
Table A5. Speaker: m05.
Table A5. Speaker: m05.
Syllable 1Syllable 2Syllable 3Syllable 4Syllable 5Syllable 6Syllable 7Syllable 8Syllable 9Syllable 10
Tone 1weng1en1eng1wo1wu1yao1yin1you1yuan1yue1
Tone 2a2ai2er2ban2wang2wen2wu2cu2yong2yuan2
Tone 3fa3er3wen3weng3xiang3yao3yong3you3yuan3yun3
Tone 4ou4wen4wu4yang4yao4ye4ying4you4yuan4yun4
Table A6. Speaker: m06.
Table A6. Speaker: m06.
Syllable 1Syllable 2Syllable 3Syllable 4Syllable 5Syllable 6Syllable 7Syllable 8Syllable 9Syllable 10
Tone 1en1eng1weng1wo1wu1yao1yin1you1yuan1yue1
Tone 2a2ai2er2wan2wang2wen2wu2ang2yong2yuan2
Tone 3ai3er3wen3weng3wu3yao3yong3you3yuan3yun3
Tone 4ou4wen4wu4yang4yao4ye4ying4you4yuan4yun4
Table A7. Speaker: m07.
Table A7. Speaker: m07.
Syllable 1Syllable 2Syllable 3Syllable 4Syllable 5Syllable 6Syllable 7Syllable 8Syllable 9Syllable 10
Tone 1an1ang1e1ou1wai1wan1wei1yan1yang1gou1
Tone 2ao2e2fang2zhuo2qu2ye2ying2yu2yun2chao2
Tone 3a3an3yao3wei3zhai3yang3yan3ye3yin3fa3
Tone 4a4an4ang4en4er4wa4wang4weng4ying4yong4
Table A8. Speaker: m08.
Table A8. Speaker: m08.
Syllable 1Syllable 2Syllable 3Syllable 4Syllable 5Syllable 6Syllable 7Syllable 8Syllable 9Syllable 10
Tone 1gou1hao1ji1nie1luo1mang1nang1ao1yong1wen1
Tone 2chao2gen2tong2pu2nu2ping2yan2yi2cheng2zhi2
Tone 3ao3gan3gou3gen3wa3wai3wan3gu3yin3yu3
Tone 4bai4dong4he4hou4ze4nan4miu4wan4wei4zun4
Table A9. Speaker: m09.
Table A9. Speaker: m09.
Syllable 1Syllable 2Syllable 3Syllable 4Syllable 5Syllable 6Syllable 7Syllable 8Syllable 9Syllable 10
Tone 1dun1fei1hai1hui1jia1qiu1sui1tan1xian1zha1
Tone 2chou2da2ji2jia2li2nang2niang2peng2ruan2gu2
Tone 3da3dang3dian3jia3lin3qian3rao3shi3ta3zen3
Tone 4cheng4cun4guan4jing4niu4qing4sui4xia4zhe4zuan4
Table A10. Speaker: m10.
Table A10. Speaker: m10.
Syllable 1Syllable 2Syllable 3Syllable 4Syllable 5Syllable 6Syllable 7Syllable 8Syllable 9Syllable 10
Tone 1cui1fu1gai1gu1hei1jiao1liao1man1dun1zi1
Tone 2fen2hai2hao2tuan2kang2ruo2qin2nong2shao2gu2
Tone 3chang3dai3ga3nian3shui3tu3xing3xue3zhuang3lao3
Tone 4ba4bi4dao4duan4ju4nie4nong4shui4qu4kao4
Table A11. Speaker: m11.
Table A11. Speaker: m11.
Syllable 1Syllable 2Syllable 3Syllable 4Syllable 5Syllable 6Syllable 7Syllable 8Syllable 9Syllable 10
Tone 1li1lao1tei1meng1shang1tong1pian1za1zhe1peng1
Tone 2kuang2chu2chuang2gang2hen2po2jie2lun2zha2zhuo2
Tone 3hen3mu3jiang3jue3liang3meng3niao3kou3tui3zhang3
Tone 4bao4chao4gang4gui4lue4nuo4qi4rui4shang4shuo4
Table A12. Speaker: m12.
Table A12. Speaker: m12.
Syllable 1Syllable 2Syllable 3Syllable 4Syllable 5Syllable 6Syllable 7Syllable 8Syllable 9Syllable 10
Tone 1sheng1pan1dao1de1kou1sen1shai1shou1che1deng1
Tone 2cen2luo2mu2qie2lin2shou2hong2ta2tai2xu2
Tone 3shun3guang3kua3li3qiang3zun3gei3nang3zhe3sun3
Tone 4kuai4den4di4guang4kong4mu4sa4shen4shou4cuan4
Table A13. Speaker: m13.
Table A13. Speaker: m13.
Syllable 1Syllable 2Syllable 3Syllable 4Syllable 5Syllable 6Syllable 7Syllable 8Syllable 9Syllable 10
Tone 1gan1gao1hong1hou1lu1pa1po1qi1zhen1zhi1
Tone 2biao2cong2hang2wa2luo2na2nan2nian2pi2qia2
Tone 3bang3dia3dou3gai3jie3ka3mai3qia3zhen3bie3
Tone 4cuan4duo4gai4jie4juan4mai4qia4she4shua4zhan4
Table A14. Speaker: m14.
Table A14. Speaker: m14.
Syllable 1Syllable 2Syllable 3Syllable 4Syllable 5Syllable 6Syllable 7Syllable 8Syllable 9Syllable 10
Tone 1bu1chang1pin1dui1huang1kai1kua1lin1ba1zhao1
Tone 2bu2chui2eng2ge2hu2kui2min2pei2ting2tou2
Tone 3bie3chai3huang3na3re3bei3sheng3za3zhao3zu3
Tone 4biao4bie4cha4kun4pei4pie4pu4rao4tie4zong4
Table A15. Speaker: m15.
Table A15. Speaker: m15.
Syllable 1Syllable 2Syllable 3Syllable 4Syllable 5Syllable 6Syllable 7Syllable 8Syllable 9Syllable 10
Tone 1beng1dai1he1jie1jing1long1bin1han1qian1jiu1
Tone 2die2huan2ke2liao2nao2neng2nuo2pian2qiu2qu2
Tone 3beng3duan3pi3fou3mian3fu3niu3mou3tao3yao3
Tone 4chou4guai4heng4huang4jiang4mao4ba4ren4xiong4xiu4

References

  1. Pelzl, E. What makes second language perception of Mandarin tones hard? A non-technical review of evidence from psycholinguistic research. Chin. Second Lang. 2019, 54, 51–78. [Google Scholar]
  2. Peng, S.C.; Tomblin, J.B.; Cheung, H.; Lin, Y.S.; Wang, L.S. Perception and production of mandarin tones in prelingually deaf children with cochlear implants. Ear Hear. 2004, 25, 251–264. [Google Scholar] [CrossRef] [PubMed]
  3. Fu, D.; Li, S.; Wang, S. Tone recognition based on support vector machine in continuous Mandarin Chinese. Comput. Sci. 2010, 37, 228–230. [Google Scholar]
  4. Gogoi, P.; Dey, A.; Lalhminghlui, W.; Sarmah, P.; Prasanna, S.R.M. Lexical Tone Recognition in Mizo using Acoustic-Prosodic Features. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020. [Google Scholar]
  5. Zheng, Y. Phonetic Pitch Detection and Tone Recognition of the Continuous Chinese Three-Syllabic Words. Master’s Thesis, Jilin University, Jilin, China, 2004. [Google Scholar]
  6. Shen, L.J.; Wang, W. Fusion Feature Based Automatic Mandarin Chinese Short Tone Classification. Technol. Acoust. 2018, 37, 167–174. [Google Scholar]
  7. Liu, C.; Ge, F.; Pan, F.; Dong, B.; Yan, Y. A One-Step Tone Recognition Approach Using MSD-HMM for Continuous Speech. In Proceedings of the Interspeech 2009, Brighton, UK, 6–10 September 2009. [Google Scholar]
  8. Chang, K.; Yang, C. A real-time pitch extraction and four-tone recognition system for Mandarin speech. J. Chin. Inst. Eng. 1986, 9, 37–49. [Google Scholar] [CrossRef]
  9. Chen, C.; Bunescu, R.; Xu, L.; Liu, C. Tone Classification in Mandarin Chinese using Convolutional Neural Networks. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016. [Google Scholar]
  10. Gao, Q.; Sun, S.; Yang, Y. ToneNet: A CNN Model of Tone Classification of Mandarin Chinese. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019. [Google Scholar]
  11. Breimanl, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  12. Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef] [Green Version]
  13. Biemans, M. Gender Variation in Voice Quality. Ph.D. Thesis, Catholic University of Nijmegen, Nijmegen, The Netherlands, 2000. [Google Scholar]
  14. SCSC-Syllable Corpus of Standard Chinese|Laboratory of Phonetics and Speech Science, Institute of Linguistics, CASS. Available online: http://paslab.phonetics.org.cn/?p=1741 (accessed on 20 March 2023).
  15. He, R. Endpoint Detection Algorithm for Speech Signal in Low SNR Environment. Master’s Thesis, Shandong University, Jinan, China, 2018. [Google Scholar]
  16. Li, M. Study on Multi-Feature Fusion Chinese Tone Recognition Algorithm Based on Machine Learning. Master’s Thesis, Shandong University, Jinan, China, 2021. [Google Scholar]
  17. Zhang, W. Study on Acoustic Features and Tone Recognition of Speech Recognition. Master’ Thesis, Shanghai Jiaotong University, Shanghai, China, 2003. [Google Scholar]
  18. Nie, K. Study on Speech Processing Strategy for Chinese-Spoken Cochlear Implants on the Basis of Characteristics of Chinese Language. Ph.D. Thesis, Tsinghua University, Beijing, China, 1999. [Google Scholar]
  19. Taylor, P. Analysis and synthesis of intonation using the Tilt model. J. Acoust. Soc. Am. 2000, 107, 1697–1714. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  20. Quang, V.M.; Besacier, L.; Castelli, E. Automatic question detection: Prosodic-lexical features and crosslingual experiments. In Proceedings of the Interspeech 2007, Antwerp, Belgium, 27–31 August 2007. [Google Scholar]
  21. Ma, M.; Evanini, K.; Loukina, A.; Wang, X.; Zechner, K. Using F0 Contours to Assess Nativeness in a Sentence Repeat Task. In Proceedings of the Interspeech 2015, Dresden, Germany, 6–10 September 2015. [Google Scholar]
  22. Robnik-Sikonja, M.; Kononenko, I. Theoretical and Empirical Analysis of ReliefF and RReliefF. Mach. Learn. 2003, 53, 23–69. [Google Scholar] [CrossRef] [Green Version]
  23. Onan, A.; Korukoglu, S.; Bulut, H. Ensemble of keyword extraction methods and classifiers in text classification. Expert Syst. Appl. 2016, 57, 232–247. [Google Scholar] [CrossRef]
  24. Yan, J.; Tian, L.; Wang, X.; Liu, J.; Li, M. A Mandarin Tone Recognition Algorithm Based on Random Forest and Features Fusion. In Proceedings of the 7th International Conference on Control Engineering and Artificial Intelligence, CCEAI 2023, Sanya, China, 28–30 January 2023. [Google Scholar]
  25. Bittencourt, H.R.; Clarke, R.T. Use of classification, and regression trees (CART) to classify remotely-sensed digital images. In Proceedings of the IGARSS 2003, Toulouse, France, 21–25 July 2003. [Google Scholar]
  26. Javed Mehedi Shamrat, F.M.; Ranjan, R.; Hasib, K.M.; Yadav, A.; Siddique, A.H. Performance Evaluation Among ID3, C4.5, and CART Decision Tree Algorithm. In Proceedings of the ICPCSN 2021, Salem, India, 19–20 March 2021. [Google Scholar]
  27. Xie, X.; Liu, H.; Chen, D.; Shu, M.; Wang, Y. Multilabel 12-Lead ECG Classification Based on Leadwise Grouping Multibranch Network. IEEE Trans. Instrum. Meas. 2022, 71, 1–11. [Google Scholar] [CrossRef]
  28. Paul, B.; Bera, S.; Paul, R.; Phadikar, S. Bengali Spoken Numerals Recognition by MFCC and GMM Technique. In Proceedings of the Advances in Electronics, Communication and Computing, Odisha, India, 5–6 March 2020. [Google Scholar]
  29. Koolagudi, S.G.; Rastogi, D.; Rao, K.S. Identification of Language using Mel-Frequency Cepstral Coefficients (MFCC). In Proceedings of the International Conference on Modelling Optimization and Computing, Kumarakoil, India, 10–11 April 2012. [Google Scholar]
  30. Hao, Y. Second language acquisition of Mandarin Chinese tones by tonal and non-tonal language speakers. J. Phon. 2012, 40, 269–279. [Google Scholar] [CrossRef]
Figure 1. Optimization process of feature sets in tone recognition. S1 to S7 are the original feature sets selected. SI, SII, and SIII are fusion feature sets obtained using three different optimization methods.
Figure 1. Optimization process of feature sets in tone recognition. S1 to S7 are the original feature sets selected. SI, SII, and SIII are fusion feature sets obtained using three different optimization methods.
Mathematics 11 01879 g001
Figure 2. Schematic diagram of one decision tree in the RF tone recognition process based on set SIII. xj indicates the jth feature of set SIII. The blue triangular box indicates the branch node, the blue line is the branch, the solid blue dot is the leaf node, and the related number is the tone prediction result.
Figure 2. Schematic diagram of one decision tree in the RF tone recognition process based on set SIII. xj indicates the jth feature of set SIII. The blue triangular box indicates the branch node, the blue line is the branch, the solid blue dot is the leaf node, and the related number is the tone prediction result.
Mathematics 11 01879 g002
Figure 3. Flow block of the RF training and test process. The blue section is the training process and the green section is the test process.
Figure 3. Flow block of the RF training and test process. The blue section is the training process and the green section is the test process.
Mathematics 11 01879 g003
Figure 4. Results of optimizing hyperparameter T of the random forest classifier (i.e., the number of decision trees) with three FFSs. (ac) show that the optimal value of T on SI, SII, and SIII is 400, 350, and 350, respectively.
Figure 4. Results of optimizing hyperparameter T of the random forest classifier (i.e., the number of decision trees) with three FFSs. (ac) show that the optimal value of T on SI, SII, and SIII is 400, 350, and 350, respectively.
Mathematics 11 01879 g004
Figure 5. Bar graph of tone recognition ACC analysis based on vocal tract features and fundamental frequency statistical features. α is the proportion of tone recognition results based on fundamental frequency statistical features. The bar’s value when α = 0 indicates the ACC result only based on vocal tract features, and the bar’s value when α = 1 indicates the ACC result only based on fundamental frequency statistical features.
Figure 5. Bar graph of tone recognition ACC analysis based on vocal tract features and fundamental frequency statistical features. α is the proportion of tone recognition results based on fundamental frequency statistical features. The bar’s value when α = 0 indicates the ACC result only based on vocal tract features, and the bar’s value when α = 1 indicates the ACC result only based on fundamental frequency statistical features.
Mathematics 11 01879 g005
Figure 6. Tone recognition results of five different classifiers for set S1 to S7. On the right end, “Mean” represents the average accuracy of each classifier under seven feature sets. The value of the red box is the highest recognition accuracy under each feature set.
Figure 6. Tone recognition results of five different classifiers for set S1 to S7. On the right end, “Mean” represents the average accuracy of each classifier under seven feature sets. The value of the red box is the highest recognition accuracy under each feature set.
Mathematics 11 01879 g006
Figure 7. Recognition results of comparative experiments with small sample sets. The 600 samples were divided into 10 parts. The x-coordinate at ninety percent means 9 parts were taken for training and 1 part was taken for the tone recognition test. For the smallest sample set, only 1 part was taken for training and the remaining 9 parts were taken for the test.
Figure 7. Recognition results of comparative experiments with small sample sets. The 600 samples were divided into 10 parts. The x-coordinate at ninety percent means 9 parts were taken for training and 1 part was taken for the tone recognition test. For the smallest sample set, only 1 part was taken for training and the remaining 9 parts were taken for the test.
Mathematics 11 01879 g007
Figure 8. Test results of cross-dataset experiments with balanced samples and extremely unbalanced samples using the modeled RF classifiers. (a) Results of balanced samples based on SI, SII, and SIII. Different colors represent different tones. (b) Results of extremely unbalanced samples based on SIII. The ratio of 7:1:1:1 indicates that tone 1 accounts for seventy percent of the samples, with tone 2, tone 3, and tone 4 accounting for ten percent. The ratio of 1:7:1:1 indicates that tone 2 accounts for seventy percent of samples, with tone 1, tone 3, and tone 4 accounting for ten percent. The other ratios follow this same pattern.
Figure 8. Test results of cross-dataset experiments with balanced samples and extremely unbalanced samples using the modeled RF classifiers. (a) Results of balanced samples based on SI, SII, and SIII. Different colors represent different tones. (b) Results of extremely unbalanced samples based on SIII. The ratio of 7:1:1:1 indicates that tone 1 accounts for seventy percent of the samples, with tone 2, tone 3, and tone 4 accounting for ten percent. The ratio of 1:7:1:1 indicates that tone 2 accounts for seventy percent of samples, with tone 1, tone 3, and tone 4 accounting for ten percent. The other ratios follow this same pattern.
Mathematics 11 01879 g008
Table 1. The seven original feature sets.
Table 1. The seven original feature sets.
Feature Set NameSourceFeatures Number
S1Reference [6]22
S2Reference [17]13
S3Reference [18]6
S4Reference [3]16
S5Reference [5]18
S6Reference [19]7
S7References [20,21]12
Table 2. Results of SI and RF.
Table 2. Results of SI and RF.
Number of Decision Trees (T)100200300350400450500
ACC (%)98.1798.0098.0098.0098.3398.0098.00
AUROC (%)98.7998.6998.6898.6898.8898.6898.68
AUPRC (%)98.1597.9997.9897.9898.3297.9897.98
Table 3. Results of SII and RF.
Table 3. Results of SII and RF.
Number of Decision Trees (T)100200250300350400500
ACC (%)97.0097.1797.3397.3397.5097.1797.33
AUROC (%)98.0298.1298.2398.2398.3598.1298.23
AUPRC (%)97.0297.1797.3297.3297.5097.1797.32
Table 4. Results of SIII and RF.
Table 4. Results of SIII and RF.
Number of Decision Trees (T)100200300350400450500
ACC (%)97.5097.5097.6798.0097.8397.8397.67
AUROC (%)98.3398.3198.4298.6598.5298.5598.42
AUPRC (%)97.4797.4897.6397.9797.8097.8197.63
Table 5. Recognition results from comparative experiments of different FFSs.
Table 5. Recognition results from comparative experiments of different FFSs.
SetSISIISIII
ACC (%)98.3397.5098.00
AUROC (%)98.8898.3598.65
AUPRC (%)98.3297.5097.97
APTPS (s)0.00220.00110.0007
Table 6. Recognition performance of different methods on the SCSC database.
Table 6. Recognition performance of different methods on the SCSC database.
MethodACCDatabaseSuitable for Small Learning Terminal
ToneNet [10]99.16%SCSCNO
CNN [9]94.45%SCSCNO
The proposed98.33%SCSCYES
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yan, J.; Meng, Q.; Tian, L.; Wang, X.; Liu, J.; Li, M.; Zeng, M.; Xu, H. A Mandarin Tone Recognition Algorithm Based on Random Forest and Feature Fusion . Mathematics 2023, 11, 1879. https://doi.org/10.3390/math11081879

AMA Style

Yan J, Meng Q, Tian L, Wang X, Liu J, Li M, Zeng M, Xu H. A Mandarin Tone Recognition Algorithm Based on Random Forest and Feature Fusion . Mathematics. 2023; 11(8):1879. https://doi.org/10.3390/math11081879

Chicago/Turabian Style

Yan, Jiameng, Qiang Meng, Lan Tian, Xiaoyu Wang, Junhui Liu, Meng Li, Ming Zeng, and Huifang Xu. 2023. "A Mandarin Tone Recognition Algorithm Based on Random Forest and Feature Fusion " Mathematics 11, no. 8: 1879. https://doi.org/10.3390/math11081879

APA Style

Yan, J., Meng, Q., Tian, L., Wang, X., Liu, J., Li, M., Zeng, M., & Xu, H. (2023). A Mandarin Tone Recognition Algorithm Based on Random Forest and Feature Fusion . Mathematics, 11(8), 1879. https://doi.org/10.3390/math11081879

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop