Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus
Round 1
Reviewer 1 Report
Article "Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus"
Comment1: Explain how the Mongolian as a representative low-resource language and lack of open source data.
Comment2: Add your numerical results in abstract and remove the website link from abstract
Comment3: In your contribution , you uses MnTTS2 . What is your model in this contribution
Comment4: Related work and literature review should be extended.
Comments5.: Figure 3. b should be improved
Comment6: in 4.4. Robustness Analysis Error types should be explained in more detail.
Author Response
We have revised the manuscript and responded to the questions and comments made by the reviewers in the document line by line. Please read the document to see the specific responses.
Author Response File: Author Response.docx
Reviewer 2 Report
The article introduces an extension of their previous work on Mongolian TTS corpus. Authors enriched the MnTTS with two additional speakers and presents two baseline TTS systems for Mongolian language. Finally, they evaluated perceptually the quality of synthetic speech in terms of naturalness and speaker similarity on two Mongolian TTS.
An Open-Source Dataset and TTS systems on low resource language deserves a great appreciation, particularly when they are adapted to the state of the art. The article is well organized and easy to read. Although some additional details can improve the content. Here is some questions/suggestion:
1. In section 3.2.1 , authors explain that how they prepared the textual content for recording. While they mentioned that they filtered some of the unsuitable contents, it would be useful to explain the term "our requirement". The text selection for creating TTS corpus with small volume (30h) can impact on the linguistic and acoustic diversity.
Another question concerning this step is that do speakers have been recorded on similar linguistic contents, at least partially?
2. In section 4.1.1, the speaker embedding layer need to be explained. One of the main novelty of this article comparing with the previous publication is the enrichment of corpus with new speaker. Consequently, the speaker embedding part in the TTS system deserve more details.
At least it should be noted that "Is speaker embedding network trained separately?"
3. In section 4.3, it has been mentioned " ... the SS-MOS scores for F1, F2, and F3 were 4.58, 4.04, and 4.12, respectively, with large differences across speakers." I believe that the numerical scores of cross speakers is needed to have a better view on the current scores. It will also serve the main contribution of the article.
Minor remarks:
- First paragraph : FastSpech2(s) [9] > FastSpeech2(s) [9]
- Figure 2 : The range of y-axes can be same range which help to have a comparative view on different speaker. Otherwise, they can be presented by PDF.
Author Response
We have revised the manuscript and responded to the questions and comments made by the reviewers in the document line by line. Please read the document to see the specific responses.
Round 2
Reviewer 1 Report
Authors responses are satisfactory