Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis
Abstract
:Featured Application
Abstract
1. Introduction
2. F0 Detection and Refinement
2.1. Contf0: Baseline
2.2. Adaptive Kalman Filtering
2.3. Adaptive Time-Warping
2.4. Adaptive StoneMask
3. Continuous Speech Analysis/Synthesis System
3.1. Baseline Vocoder
3.2. Harmonic-to-Noise Ratio
3.3. Maximum Voiced Frequency Estimation
- (1)
- Consecutive frames of the input signal are obtained by using a 3-period-long Hanning window .
- (2)
- -point fast Fourier transform (FFT) of every analysis frame is computed . is equal or greater than 4 times of the frame length .
- (3)
- The magnitude spectral peak detection for each frame is calculated, and their SLM score is given through cross-correlation [56]
- (4)
- The error of the MVF position at each peak is figured as
- (5)
- To give a final sequence of MVF estimates, a dynamic programming approach is used to eliminate the spurious values and to minimize the following cost function
4. Experimental setup
4.1. Datasets
4.2. Error Measurement Metrics
- (1)
- Gross Pitch Errors: GPE is the proportion of frames considered voiced by both estimated and referenced F0 for which the relative pitch error is higher than a certain threshold (usually set to 20% for speech). The can be calculated as:
- (2)
- Mean Fine Pitch Errors: Fine pitch error refers to all pitch errors that are not classified as GPE. In other words, MFPE can be derived from Equation (25) when
- (3)
5. Evaluation Results and Discussion
5.1. Objective Evaluation
5.1.1. Noise Robustness of F0 Estimation
5.1.2. Performance Comparison of F0 Estimation
5.1.3. Measuring Speech Quality after Analysis and Re-Synthesis
- One objective measure is the Weighted-Slope Spectral Distance (WSS) [67], which computes the weighted difference between the spectral slopes in each frequency band. The spectral slope is found as the difference between adjacent spectral magnitudes in decibels.
- As the speech production process can be modeled efficiently with Linear Predictive Coefficients (LPC), another objective measure is called the Log-Likelihood Ratio (LLR) [65]. It is generally a distance measure that can be directly calculated from the LPC vector of the clean and enhanced speech. The segmental LLR values were limited in the range of [0, 1].
- We also adopt the frequency-weighted segmental SNR (fwSNRseg) for the error criterion to measure speech quality, since it is said to be much more correlated with subjective speech quality than classical SNR [68]. The fwSNRseg measure applies weights taken from the ANSI SII standard to each frequency band [69]. Instead of working on the entire signal, only frames with segmental SNR in the range of −10 to 35 dB were considered in the average.
- Moreover, Jensen and Taal introduced an effective objective measure, which they called the Extended Short-Time Objective Intelligibility (ESTOI) measure [69]. The ESTOI calculates the correlation between the temporal envelopes of clean and enhanced speech in short frame segments.
- The final objective measure used here is the Normalized Covariance Metric (NCM) [70], which is based on the covariance between the clean and processed Hilbert envelope signals.
5.1.4. Phase Distortion Deviation
5.2. Subjective Evaluation
6. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Huang, X.; Acero, A.; Hon, H. Spoken Language Processing; Prentice Hall PTR: Upper Saddle River, NJ, USA, 2001. [Google Scholar]
- Talkin, D. A robust algorithm for pitch tracking (RAPT). In Speech Coding and Synthesis; Elsevier: Amsterdam, The Netherlands, 1995; pp. 495–518. [Google Scholar]
- Kai, Y.; Steve, Y. Continuous F0 modelling for HMM based statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 1071–1079. [Google Scholar]
- Latorre, J.; Gales, M.J.F.; Buchholz, S.; Knil, K.; Tamura, M.; Ohtani, Y.; Akamine, M. Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification? In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011. [Google Scholar]
- Masuko, T.; Tokuda, K.; Miyazaki, N.; Kobayashi, T. Pitch pattern generation using multi-space probability distribution HMM. IEICE Trans. Inf. Syst. 2000, J85-D-II, 1600–1609. [Google Scholar]
- Tokuda, K.; Mausko, T.; Miyazaki, N.; Kobayashi, T. Multi-space probability distribution HMM. IEICE Trans. Inf. Syst. 2002, E85-D, 455–464. [Google Scholar]
- Freij, G.J.; Fallsid, F. Lexical stress recognition using hidden Markov modeld. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), New York, NY, USA, 11–14 April 1988; pp. 135–138. [Google Scholar]
- Jensen, U.; Moore, R.K.; Dalsgaard, P.; Lindberg, B. Modelling intonation contours at the phrase level using continous density hidden Markov models. Comput. Speech Lang. 1994, 8, 247–260. [Google Scholar] [CrossRef]
- Garner, P.N.; Cernak, M.; Motlicek, P. A simple continuous pitch estimation algorithm. IEEE Signal Process. Lett. 2013, 20, 102–105. [Google Scholar] [CrossRef]
- Zhang, Q.; Soong, F.; Qian, Y.; Yan, Z.; Pan, J.; Yan, Y. Improved modeling for F0 generation and V/U decision in HMM-based TTS. In Proceedings of the IEEE International Conference Acoustics, Speech and Signal Processing, Dallas, TX, USA, 15–19 March 2010. [Google Scholar]
- Nielsen, J.K.; Christensen, M.G.; Jensen, S.H. An approximate Bayesian fundamental frequency estimator. In Proceedings of the IEEE Acoustics, Speech and Signal Processing, Kyoto, Japan, 25–30 March 2012. [Google Scholar]
- Tóth, B.P.; Csapó, T.G. Continuous fundamental frequency prediction with deep neural networks. In Proceedings of the European Signal Processing Conference (EUSIPCO), Budapest, Hungary, 28 August–2 September 2016. [Google Scholar]
- Csapó, T.G.; Németh, G.; Cernak, M. Residual-based excitation with continuous F0 modeling in HMM-based speech synthesis. In Proceedings of the 3rd International Conference on Statistical Language and Speech Processing (SLSP), Budapest, Hungary, 24–26 November 2015; pp. 27–38. [Google Scholar]
- Tsanas, A.; Zañartu, M.; Little, M.A.; Fox, C.; Ramig, L.O.; Clifford, G.D. Robust fundamental frequency estimation in sustained vowels: Detailed algorithmic comparisons and information fusion with adaptive Kalman filtering. J. Acoust. Soc. Am. 2014, 135, 2885–2901. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Stoter, F.R.; Werner, N.; Bayer, S.; Edler, B. Refining fundamental frequency estimates using time warping. In Proceedings of the 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 31 August–4 September 2015. [Google Scholar]
- Dutoit, T. High-quality text-to-speech synthesis: An overview. J. Electr. Electron. Eng. Aust. 1997, 17, 25–36. [Google Scholar]
- Kobayashi, K.; Hayashi, T.; Tamamori, A.; Toda, T. Statistical voice conversion with WaveNet-based waveform generation. In Proceedings of the 18th International Speech Communication Association. Annual Conference, Stockholm, Sweden, 20–24 August 2017; pp. 1138–1142. [Google Scholar]
- Kenmochi, H. Singing synthesis as a new musical instrument. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 5385–5388. [Google Scholar]
- Hu, Q.; Richmond, K.; Yamagishi, J.; Latorre, J. An experimental comparison of multiple vocoder types. In Proceedings of the 8th ISCA Speech Synthesis Workshop, Barcelona, Spain, 31 August–2 September 2013. [Google Scholar]
- Kawahara, H.; Masuda-Katsuse, I.; de-Cheveign, A. Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Commun. 1999, 27, 187–207. [Google Scholar] [CrossRef]
- McCree, A.V.; Barnwell, T.P. A mixed excitation LPC vocoder model for low bit rate speech coding. IEEE Trans. Speech Audio Process. 1995, 3, 242–250. [Google Scholar] [CrossRef]
- Stylianou, Y.; Laroche, J.; Moulines, E. High-quality speech modification based on a harmonic + noise mode. In Proceedings of the EuroSpeech, Madrid, Spain, 18–21 September 1995; pp. 451–454. [Google Scholar]
- Erro, D.; Sainz, I.; Navas, E.; Hernaez, I. Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE J. Sel. Top. Signal Process. 2014, 8, 184–194. [Google Scholar] [CrossRef]
- Tamamori, A.; Hayashi, T.; Kobayashi, K.; Takeda, K.; Toda, T. Speaker-dependent WaveNet vocoder. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 1118–1122. [Google Scholar]
- Wang, Y.; Ryan, R.J.; Stanton, D.; Wu, Y.; Weiss, R.J.; Jaitly, N.; Yang, Z.; Xiao, Y.; Chen, Z.; Bengio, S.; et al. Tacotron: Towards end-to-end speech synthesis. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 4006–4010. [Google Scholar]
- Agiomyrgiannakis, Y. Vocaine the vocoder and applications in speech synthesis. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 4230–4234. [Google Scholar]
- Oord, A.V.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.W.; Kavukcuoglu, K. WaveNet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
- Arik, S.O.; Chrzanowski, M.; Coates, A.; Diamos, G.; Gibiansky, A.; Kang, Y.; Li, X.; Miller, J.; Ng, A.; Raiman, J.; et al. Deep voice: Real-time neural text-to-speech. In Proceeding of the International conference on Machine Learning (ICML), Stockholm, Sweden, 6–11 August 2017; pp. 195–204. [Google Scholar]
- Ping, W.; Peng, K.; Chen, J. CLARINET: Parallel wave generation in end-to-end text-to-speech. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Drugman, T.; Stylianou, Y. Maximum voiced frequency estimation: exploiting amplitude and phase spectra. IEEE Signal Process. Lett. 2014, 21, 1230–1234. [Google Scholar] [CrossRef]
- Al-Radhi, M.S.; Csapó, T.G.; Németh, G. Time-domain envelope modulating the noise component of excitation in a continuous residual-based vocoder for statistical parametric speech synthesis. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 434–438. [Google Scholar]
- Degottex, G.; Lanchantin, P.; Gales, M. A log domain pulse model for parametric speech synthesis. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 57–70. [Google Scholar] [CrossRef]
- McKenna, J.; Isard, S. Tailoring Kalman filtering towards speaker characterisation. In Proceedings of the Eurospeech, Budapest, Hungary, 5–9 September 1999. [Google Scholar]
- Quillen, C. Kalman filter based speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, 15–19 March 2010. [Google Scholar]
- Vepa, J.; King, S. Kalman-filter based join cost for unit-selection speech synthesis. In Proceedings of the Interspeech, Geneva, Switzerland, 1–4 September 2003. [Google Scholar]
- Simon, D. Optimal State Estimation: Kalman, H Infinity, and Nonlinear Approaches; Wiley & Sons, Inc.: Hoboken, NJ, USA, 2006. [Google Scholar]
- Li, Q.; Mark, R.G.; Clifford, G.D. Robust heart rate estimation from multiple asynchronous noisy sources using signal quality indices and a Kalman filter. Physiol. Meas. 2008, 29, 15–32. [Google Scholar] [CrossRef] [PubMed]
- Nemati, S.; Malhorta, A.; Clifford, G.D. Data fusion for improved respiration rate estimation. EURASIP J. Adv. Signal Process. 2010, 2010, 1–10. [Google Scholar] [CrossRef] [PubMed]
- Kumaresan, R.; Ramalingam, C.S. On separating voiced-speech into its components. In Proceedings of the 27th Asilomar Conferance Signals, Systems, and Computers, Pacific Grove, CA, USA, 1–3 November 1993; pp. 1041–1046. [Google Scholar]
- Kawahara, H.; Katayose, H.; Cheveigne, A.D.; Patterson, R.D. Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of f0 and periodicity. In Proceedings of the EuroSpeech, Budapest, Hungary, 5–9 September 1999; pp. 2781–2784. [Google Scholar]
- Malyska, N.; Quatieri, T.F. A time-warping framework for speech turbulence-noise component estimation during aperiodic phonation. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 5404–5407. [Google Scholar]
- Abe, T.; Kobayashi, T.; Imai, S. The IF spectrogram: A new spectral representation. In Proceedings of the ASVA, Tokyo, Japan, 2–4 April 1997; pp. 423–430. [Google Scholar]
- Stone, S.; Steiner, P.; Birkholz, P. A time-warping pitch tracking algorithm considering fast f0 changes. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 419–423. [Google Scholar]
- Nuttall, A.H. Some windows with very good sidelobe behavior. IEEE Trans. Acoust. Speech Signal Process. 1981, 29, 84–91. [Google Scholar] [CrossRef] [Green Version]
- Flanagan, J.L.; Golden, R.M. Phase vocoder. Bell Syst. Tech. J. 2009, 45, 1493–1509. [Google Scholar] [CrossRef]
- Morise, M.; Yokomori, F.; Ozawa, K. WORLD: A vocoder based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 2016, E99-D, 1877–1884. [Google Scholar] [CrossRef]
- Morise, M.; Kawahara, H.; Nishiura, T. Rapid f0 estimation for high-snr speech based on fundamental component extraction. IEICE Trans. Inf. Syst. 2010, 93, 109–117. [Google Scholar]
- Drugman, T.; Dutoit, T. The deterministic plus stochastic model of the residual signal and its applications. IEEE Trans. Audio Speech Lang. Process. 2012, 20, 968–981. [Google Scholar] [CrossRef]
- Tokuda, K.; Kobayashi, T.; Masuko, T.; Imai, S. Mel-generalized cepstral analysis—A unified approach to speech spectral estimation. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), Yokohama, Japan, 18–22 September 1994; pp. 1043–1046. [Google Scholar]
- Imai, S.; Sumita, K.; Furuichi, C. Mel Log Spectrum Approximation (MLSA) filter for speech synthesis. Electron. Commun. Jpn. Part I Commun. 1983, 66, 10–18. [Google Scholar] [CrossRef]
- Griffin, D.W. Multi-Band Excitation Vocoder. Ph.D. Thesis, Massachusetts Institute of Technology (MIT), Cambridge, MA, USA, March 1987. [Google Scholar]
- Hoene, C.; Wiethölter, S.; Wolisz, A. Calculation of speech quality by aggregating the impacts of individual frame losses. In Proceedings of the IWQoS, Lecture Notes in Computer Science, Passau, Germany, 21–23 June 2005; pp. 136–150. [Google Scholar]
- Severin, F.; Bozkurt, B.; Dutoit, T. HNR extraction in voiced speech, oriented towards voice quality analysis. In Proceedings of the EUSIPCO, Antalya, Turkey, 4–8 September 2005. [Google Scholar]
- Boersma, P. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the Institute of Phonetic Sciences; University of Amsterdam: Amsterdam, The Netherlands, 1993. [Google Scholar]
- Stylianou, Y. Applying the harmonic plus noise model in concatenative speech synthesis. IEEE Trans. Audio Speech Lang. Process. 2001, 9, 21–29. [Google Scholar] [CrossRef]
- Rodet, X. Musical sound signals analysis/synthesis: Sinusoidal + residual and elementary waveform models. In Proceedings of the IEEE Time-Frequency and Time-Scale Workshop (TFTS), Coventry, UK, 27–29 August 1997; pp. 131–141. [Google Scholar]
- Kominek, J.; Black, A.W. CMU ARCTIC Databases for Speech Synthesis; Carnegie Mellon University: Pittsburgh, PA, USA, 2003. [Google Scholar]
- Nakatania, T.; Irino, T. Robust and accurate fundamental frequency estimation basedon dominant harmonic components. Acoust. Soc. Am. 2004, 116, 3690–3700. [Google Scholar] [CrossRef] [PubMed]
- Boersma, P. Praat, a system for doing phonetics by computer. Glot Int. 2002, 5, 341–345. [Google Scholar]
- Kawahara, H.; Agiomyrgiannakis, Y.; Zen, H. Using instantaneous frequency and aperiodicity detection to estimate f0 for high-quality speech synthesis. In Proceedings of the ISCA Workshop on Speech Synthesis, Sunnyvale, CA, USA, 13–15 September 2016. [Google Scholar]
- Hua, K. Improving YANGsaf F0 estimator with adaptive Kalman filter. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017. [Google Scholar]
- Kawahara, H.; Morise, M. Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework. Sadhana 2011, 36, 713–727. [Google Scholar] [CrossRef] [Green Version]
- Chu, W.; Alwan, A. SAFE: A Statistical Approach to F0 Estimation Under Clean and Noisy Conditions. IEEE Trans. Audio Speech Lang. Process. 2012, 20, 933–944. [Google Scholar] [CrossRef]
- Rabiner, L.R.; Cheng, M.J.; Rosenberg, A.E.; McGonegal, C.A. A comparative performance study of several pitch detection algorithms. IEEE Trans. Audio Speech Lang. Process. 1976, 24, 399–417. [Google Scholar] [CrossRef]
- Quackenbush, S.; Barnwell, T.; Clements, M. Objective Measures of Speech Quality; Prentice-Hall: Englewood Cliffs, NJ, USA, 1988. [Google Scholar]
- Hu, Y.; Loizou, P.C. Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 2008, 16, 229–238. [Google Scholar] [CrossRef]
- Klatt, D. Prediction of perceived phonetic distance from critical band spectra: A first step. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Paris, France, 3–5 May 1982; pp. 1278–1281. [Google Scholar]
- Tribolet, J.; Noll, P.; McDermott, B.; Crochiere, R.E. A study of complexity and quality of speech waveform coders. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Tulsa, OK, USA, 10–12 April 1978. [Google Scholar]
- Jensen, J.; Taal, C.H. An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 2009–2022. [Google Scholar] [CrossRef]
- Ma, J.; Hu, Y.; Loizou, P. Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. Acoust. Soc. Am. 2009, 125, 3387–3405. [Google Scholar] [CrossRef]
- Degottex, G.; Erro, D. A uniform phase representation for the harmonic model in speech synthesis applications. EURASIP J. Audio Speech Music Process. 2014, 38, 1–16. [Google Scholar] [CrossRef]
- Fisher, N.I. Statistical Analysis of Circular Data; Cambridge University: Cambridge, UK, 1995. [Google Scholar]
- International Telecommunications Union. Method for the Subjective Assessment of Intermediate Audio Quality; ITU-R Recommendation BS.1534; International Telecommunications Union: Geneva, Switzerland, 2001. [Google Scholar]
Method | GPE % | MFPE | STD | ||||||
---|---|---|---|---|---|---|---|---|---|
BDL | JMK | SLT | BDL | JMK | SLT | BDL | JMK | SLT | |
baseline | 12.754 | 9.850 | 7.677 | 3.558 | 3.428 | 4.421 | 4.756 | 4.513 | 6.764 |
contF0_AKF | 11.268 | 12.611 | 6.732 | 2.764 | 2.754 | 3.692 | 3.964 | 3.719 | 6.113 |
contF0_TWRP | 8.294 | 8.777 | 7.827 | 2.764 | 3.024 | 3.656 | 3.873 | 4.188 | 5.788 |
contF0_STMSK | 10.557 | 7.530 | 6.998 | 1.661 | 1.389 | 2.105 | 2.526 | 1.872 | 4.181 |
YANGsaf | 4.231 | 2.049 | 4.592 | 1.658 | 1.452 | 2.142 | 2.239 | 1.575 | 4.160 |
Method | GPE % | MFPE | STD | ||||||
---|---|---|---|---|---|---|---|---|---|
BDL | JMK | SLT | BDL | JMK | SLT | BDL | JMK | SLT | |
baseline | 33.170 | 40.057 | 27.502 | 4.050 | 3.901 | 3.512 | 4.393 | 4.293 | 3.912 |
contF0_AKF | 31.728 | 40.865 | 26.122 | 3.211 | 3.241 | 2.898 | 3.465 | 3.627 | 3.448 |
contF0_TWRP | 29.464 | 37.839 | 26.932 | 3.199 | 3.165 | 2.890 | 3.449 | 3.511 | 3.186 |
contF0_STMSK | 31.418 | 37.052 | 26.352 | 2.128 | 1.896 | 2.067 | 2.103 | 1.658 | 2.058 |
YANGsaf | 27.530 | 35.200 | 25.852 | 2.233 | 2.181 | 2.175 | 2.206 | 2.219 | 2.265 |
Method | GPE % | MFPE | STD | ||||||
---|---|---|---|---|---|---|---|---|---|
BDL | JMK | SLT | BDL | JMK | SLT | BDL | JMK | SLT | |
baseline | 25.041 | 26.870 | 33.124 | 2.919 | 2.799 | 2.845 | 3.061 | 2.936 | 3.180 |
contF0_AKF | 24.548 | 28.034 | 31.103 | 2.285 | 2.293 | 2.284 | 2.338 | 2.327 | 2.468 |
contF0_TWRP | 21.512 | 22.329 | 29.893 | 2.256 | 2.482 | 2.472 | 2.253 | 2.702 | 2.787 |
contF0_STMSK | 24.371 | 26.131 | 32.775 | 1.429 | 1.179 | 1.387 | 1.686 | 1.981 | 1.140 |
YANGsaf | 15.401 | 12.509 | 22.186 | 1.419 | 1.307 | 1.393 | 2.282 | 2.732 | 2.022 |
Method | Pitch Algorithm |
---|---|
Proposed #1 | adContF0-based adaptive Kalman filter |
Proposed #2 | adContF0-based adaptive Time-warping |
Proposed #3 | adContF0-based adaptive StoneMask |
Metric | Speaker | Baseline | Proposed#1 | Proposed#2 | Proposed#3 | STRAIGHT |
---|---|---|---|---|---|---|
fwSNRseg | BDL | 8.083 | 11.812 | 11.807 | 13.033 | 15.062 |
JMK | 6.816 | 9.505 | 9.784 | 10.621 | 13.094 | |
SLT | 7.605 | 9.906 | 9.736 | 11.079 | 15.295 | |
NCM | BDL | 0.650 | 0.850 | 0.854 | 0.913 | 0.992 |
JMK | 0.620 | 0.847 | 0.860 | 0.906 | 0.963 | |
SLT | 0.673 | 0.850 | 0.854 | 0.910 | 0.991 | |
ESTOI | BDL | 0.642 | 0.856 | 0.861 | 0.892 | 0.923 |
JMK | 0.620 | 0.831 | 0.847 | 0.873 | 0.895 | |
SLT | 0.679 | 0.848 | 0.846 | 0.894 | 0.945 | |
LLR | BDL | 0.820 | 0.457 | 0.456 | 0.453 | 0.219 |
JMK | 0.814 | 0.635 | 0.631 | 0.628 | 0.391 | |
SLT | 0.744 | 0.639 | 0.640 | 0.636 | 0.194 | |
WSS | BDL | 48.569 | 32.875 | 32.559 | 24.013 | 22.144 |
JMK | 51.788 | 36.236 | 32.175 | 26.238 | 29.748 | |
SLT | 58.043 | 42.789 | 45.254 | 26.906 | 23.614 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Al-Radhi, M.S.; Csapó, T.G.; Németh, G. Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis. Appl. Sci. 2019, 9, 2460. https://doi.org/10.3390/app9122460
Al-Radhi MS, Csapó TG, Németh G. Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis. Applied Sciences. 2019; 9(12):2460. https://doi.org/10.3390/app9122460
Chicago/Turabian StyleAl-Radhi, Mohammed Salah, Tamás Gábor Csapó, and Géza Németh. 2019. "Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis" Applied Sciences 9, no. 12: 2460. https://doi.org/10.3390/app9122460
APA StyleAl-Radhi, M. S., Csapó, T. G., & Németh, G. (2019). Adaptive Refinements of Pitch Tracking and HNR Estimation within a Vocoder for Statistical Parametric Speech Synthesis. Applied Sciences, 9(12), 2460. https://doi.org/10.3390/app9122460