Next Article in Journal
Prediction of Deflection Due to Multistage Loading of a Corrugated Package
Next Article in Special Issue
Grammar-Supervised End-to-End Speech Recognition with Part-of-Speech Tagging and Dependency Parsing
Previous Article in Journal
Simulation of Cross-Correlated Random Fields for Transversely Anisotropic Soil Slope by Copulas
Previous Article in Special Issue
An Investigation into Audio–Visual Speech Recognition under a Realistic Home–TV Scenario
 
 
Article
Peer-Review Record

Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus

Appl. Sci. 2023, 13(7), 4237; https://doi.org/10.3390/app13074237
by Kailin Liang, Bin Liu, Yifan Hu, Rui Liu *, Feilong Bao and Guanglai Gao
Reviewer 1:
Reviewer 2:
Appl. Sci. 2023, 13(7), 4237; https://doi.org/10.3390/app13074237
Submission received: 2 March 2023 / Revised: 21 March 2023 / Accepted: 23 March 2023 / Published: 27 March 2023
(This article belongs to the Special Issue Audio, Speech and Language Processing)

Round 1

Reviewer 1 Report

Article "Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus"

Comment1: Explain how the Mongolian as a representative low-resource language and lack of open source data.

Comment2: Add your  numerical results in abstract and remove the website link from abstract

Comment3: In your contribution , you uses MnTTS2 . What is your model in this contribution

Comment4: Related work and literature review  should be extended.

Comments5.: Figure 3. b  should be improved

Comment6: in 4.4. Robustness Analysis Error types should be explained in more detail.

Author Response

We have revised the manuscript and responded to the questions and comments made by the reviewers in the document line by line. Please read the document to see the specific responses.

Author Response File: Author Response.docx

Reviewer 2 Report

The article introduces an extension of their previous work on Mongolian TTS corpus. Authors enriched the MnTTS with two additional speakers and presents two baseline TTS systems for Mongolian language. Finally, they evaluated perceptually the quality of synthetic speech in terms of naturalness and speaker similarity on two Mongolian TTS.

An Open-Source Dataset and TTS systems on low resource language deserves a great appreciation, particularly when they are adapted to the state of the art. The article is well organized and easy to read. Although some additional details can improve the content. Here is some questions/suggestion:


1. In section 3.2.1 , authors explain that how they prepared the textual content for recording. While they mentioned that they filtered some of the unsuitable contents, it would be useful to explain the term "our requirement". The text selection for creating TTS corpus with small volume (30h) can impact on the linguistic and acoustic diversity.
Another question concerning this step is that do speakers have been recorded on similar linguistic contents, at least partially? 

2. In section 4.1.1, the speaker embedding layer need to be explained. One of the main novelty of this article comparing with the previous publication is the enrichment of corpus with new speaker. Consequently, the speaker embedding part in the TTS system deserve more details. 
At least it should be noted that "Is speaker embedding network trained separately?"

3. In section 4.3, it has been mentioned " ... the SS-MOS scores for F1, F2, and F3 were 4.58, 4.04, and 4.12, respectively, with large differences across speakers." I believe that the numerical scores of cross speakers is needed to have a better view on the current scores. It will also serve the main contribution of the article.


Minor remarks:

- First paragraph : FastSpech2(s) [9] > FastSpeech2(s) [9]

- Figure 2 : The range of y-axes can be same range which help to have a comparative view on different speaker. Otherwise, they can be presented by PDF.

Author Response

We have revised the manuscript and responded to the questions and comments made by the reviewers in the document line by line. Please read the document to see the specific responses.

Round 2

Reviewer 1 Report

Authors responses are satisfactory

Back to TopTop