Semantic Features Based N-Best Rescoring Methods for Automatic Speech Recognition
Abstract
:1. Introduction
- 446c040q-1 THERE ARE INDICATIONS THAT SALES ARE SLOWING DOWN BUT CONSUMER CREDIT SEARCH UPWARD IN DECEMBER
- 446c040q-2 THERE ARE INDICATIONS THAT SALES ARE SLOWING DOWN BUT CONSUMER CREDIT SURGED UPWARD IN DECEMBER
- 446c040q-3 THERE ARE INDICATIONS THAT SALES ARE SLOWING DOWN BUT CONSUMER CREDIT SIR <UNK> UPWARD IN DECEMBER
2. Related Works
3. Topic Model and Word Embeddings
3.1. Latent Dirichlet Allocation for Topic Modelling
- Sample the length N of the document from Poisson distribution: .
- Sample a multinomial distribution over topics for document i from a Dirichlet distribution parameterized by : .
- For the j-th word in the document, sample the topic of this word, , from , and then sample the word from the unigram distribution given the topic: .
3.2. Word Embedding
- . If f is viewed as a continuous function, it should obey that is finite.
- should be non-decreasing in case that rare co-occurrences are overweighted.
- should be relatively small for large value of x, in case that frequent co-occurrences are overweighted.
4. N-Best Rescoring with Coordination Scores
4.1. Sentence Coordination Scores
4.1.1. Sentence Probability Score Using LDA
4.1.2. Topic Similarity Score Using LDA
4.1.3. Word-Pair Probability Score Using Word Embedding
4.1.4. Word-Discourse Probability Score Using Word Embedding
4.2. Fallibility of Word for ASR Hypothesis
4.3. n-Best Rescoring for ASR
5. Experiments
5.1. Experimental Setup
5.2. LDA-Based Rescoring
5.3. Word Embedding Based Rescoring
6. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Mikolov, T.; Karafiát, M.; Burget, L.; Černockỳ, J.; Khudanpur, S. Recurrent neural network based language model. In Proceedings of the Eleventh Annual Conference of the International Speech Communication Association, Chiba, Japan, 26–30 September 2010. [Google Scholar]
- Mikolov, T.; Kombrink, S.; Burget, L.; Černockỳ, J.; Khudanpur, S. Extensions of recurrent neural network language model. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 5528–5531. [Google Scholar]
- Tam, Y.C.; Schultz, T. Unsupervised language model adaptation using latent semantic marginals. In Proceedings of the Ninth International Conference on Spoken Language Processing, Pittsburgh, PA, USA, 17–21 September 2006. [Google Scholar]
- Mnih, A.; Hinton, G. Three new graphical models for statistical language modelling. In Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, USA, 20–24 June 2007; pp. 641–648. [Google Scholar]
- Arora, S.; Li, Y.; Liang, Y.; Ma, T.; Risteski, A. Linear algebraic structure of word senses, with applications to polysemy. Trans. Assoc. Comput. Linguist. 2018, 6, 483–495. [Google Scholar] [CrossRef]
- Chu, S.M.; Mangu, L. Improving arabic broadcast transcription using automatic topic clustering. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 4449–4452. [Google Scholar]
- Jin, W.; He, T.; Qian, Y.; Yu, K. Paragraph vector based topic model for language model adaptation. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Lau, J.H.; Baldwin, T.; Cohn, T. Topically driven neural language model. arXiv 2017, arXiv:1704.08012. [Google Scholar]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Mikolov, T.; Zweig, G. Context dependent recurrent neural network language model. In Proceedings of the 2012 IEEE Spoken Language Technology Workshop (SLT), Miami, FL, USA, 2–5 December 2012; Volume 12, p. 8. [Google Scholar]
- Tam, Y.C.; Schultz, T. Dynamic language model adaptation using variational Bayes inference. In Proceedings of the INTERSPEECH 2005—Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, 4–8 September 2005. [Google Scholar]
- Haidar, M.A.; O’Shaughnessy, D. Novel weighting scheme for unsupervised language model adaptation using latent Dirichlet allocation. In Proceedings of the Eleventh Annual Conference of the International Speech Communication Association, Chiba, Japan, 26–30 September 2010. [Google Scholar]
- Haidar, M.A.; O’Shaughnessy, D. LDA-based LM adaptation using latent semantic marginals and minimum discriminant information. In Proceedings of the 20th European Signal Processing Conference (EUSIPCO), Bucharest, Romania, 27–31 August 2012; pp. 2040–2044. [Google Scholar]
- Ramabhadran, B.; Siohan, O.; Sethy, A. The IBM 2007 speech transcription system for European parliamentary speeches. In Proceedings of the 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), Kyoto, Japan, 9–13 December 2007; pp. 472–477. [Google Scholar]
- Heidel, A.; Lee, L.S. Robust topic inference for latent semantic language model adaptation. In Proceedings of the 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), Kyoto, Japan, 9–13 December 2007; pp. 177–182. [Google Scholar]
- Helmke, H.; Rataj, J.; Mühlhausen, T.; Ohneiser, O.; Ehr, H.; Kleinert, M.; Oualil, Y.; Schulder, M.; Klakow, D. Assistant-based speech recognition for ATM applications. In Proceedings of the 11th USA/Europe Air Traffic Management Research and Development Seminar (ATM2015), Lisbon, Portugal, 23–26 June 2015. [Google Scholar]
- Kleinert, M.; Helmke, H.; Ehr, H.; Kern, C.; Klakow, D.; Motlicek, P.; Singh, M.; Siol, G. Building Blocks of Assistant Based Speech Recognition for Air Traffic Management Applications. In Proceedings of the European Union, Eurocontrol-Conference: SESAR Innovation Days 2018, SESARJU, Salzburg, Austria, 3–7 December 2018. number CONF. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 3111–3119. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Arora, S.; Li, Y.; Liang, Y.; Ma, T.; Risteski, A. A latent variable model approach to pmi-based word embeddings. Trans. Assoc. Comput. Linguist. 2016, 4, 385–399. [Google Scholar] [CrossRef]
- Hashimoto, T.B.; Alvarez-Melis, D.; Jaakkola, T.S. Word embeddings as metric recovery in semantic spaces. Trans. Assoc. Comput. Linguist. 2016, 4, 273–286. [Google Scholar] [CrossRef]
- Audhkhasi, K.; Sethy, A.; Ramabhadran, B. Semantic word embedding neural network language models for automatic speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5995–5999. [Google Scholar]
- He, T.; Xiang, X.; Qian, Y.; Yu, K. Recurrent neural network language model with structured word embeddings for speech recognition. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 5396–5400. [Google Scholar]
- Wagner, R.A.; Fischer, M.J. The string-to-string correction problem. J. ACM 1974, 21, 168–173. [Google Scholar] [CrossRef]
- Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 11–15 December 2011. number EPFL-CONF-192584. [Google Scholar]
- Stolcke, A. SRILM-an extensible language modeling toolkit. In Proceedings of the Seventh International Conference On Spoken Language Processing, Denver, CO, USA, 16–20 September 2002. [Google Scholar]
Score-Type | Dev93-c | Dev93-s | Dev93-i | Eval92-c | Eval92-s | Eval92-i |
---|---|---|---|---|---|---|
word probability | −15.16 | −15.20 | −16.08 | −14.68 | −14.68 | −15.68 |
topic similarity | −0.35 | −0.40 | −0.36 | −0.34 | −0.40 | −0.36 |
Score-Type | #Topics | Fallibility | Dev93 WER | Eval92 WER |
---|---|---|---|---|
baseline | - | - | 9.58 | 6.98 |
2gram | - | - | 9.57 | 6.88 |
smallRNN | - | - | 9.44 | 6.57 |
LDAprob | 10 | no | 9.44 | 6.59 |
LDAprob | 10 | yes | 9.35 | 6.52 |
LDAprob | 20 | no | 9.38 | 6.65 |
LDAprob | 20 | yes | 9.35 | 6.52 |
LDAprob | 30 | no | 9.39 | 6.66 |
LDAprob | 30 | yes | 9.38 | 6.52 |
LDAprob | 40 | no | 9.44 | 6.66 |
LDAprob | 40 | yes | 9.36 | 6.54 |
LDAtopsim | 10 | no | 9.53 | 6.70 |
LDAtopsim | 10 | yes | 9.40 | 6.54 |
LDAtopsim | 20 | no | 9.45 | 6.73 |
LDAtopsim | 20 | yes | 9.38 | 6.61 |
LDAtopsim | 30 | no | 9.47 | 6.75 |
LDAtopsim | 30 | yes | 9.38 | 6.57 |
LDAtopsim | 40 | no | 9.50 | 6.75 |
LDAtopsim | 40 | yes | 9.34 | 6.54 |
Score-Type | Dev93-c | Dev93-s | Dev93-i | Eval92-c | Eval92-s | Eval92-i |
---|---|---|---|---|---|---|
word-pair | −9.64 | −10.87 | −11.98 | −9.61 | −11.02 | −12.15 |
word-discourse | −7.45 | −7.94 | −9.39 | −7.49 | −8.02 | −9.47 |
Score-Type | Feature Dimension | Fallibility | Dev93 WER | Eval92 WER |
---|---|---|---|---|
baseline | - | - | 9.58 | 6.98 |
2gram | - | - | 9.57 | 6.88 |
smallRNN | - | - | 9.44 | 6.57 |
word-pair | 30 | no | 9.44 | 6.57 |
word-pair | 30 | yes | 9.33 | 6.45 |
word-pair | 50 | no | 9.38 | 6.54 |
word-pair | 50 | yes | 9.30 | 6.50 |
word-pair | 80 | no | 9.40 | 6.57 |
word-pair | 80 | yes | 9.33 | 6.52 |
word-pair | 100 | no | 9.44 | 6.63 |
word-pair | 100 | yes | 9.34 | 6.54 |
word-pair | 300 | no | 9.53 | 6.91 |
word-pair | 300 | yes | 9.39 | 6.68 |
word-discourse | 30 | no | 9.46 | 6.61 |
word-discourse | 30 | yes | 9.30 | 6.47 |
word-discourse | 50 | no | 9.34 | 6.66 |
word-discourse | 50 | yes | 9.29 | 6.47 |
word-discourse | 80 | no | 9.39 | 6.56 |
word-discourse | 80 | yes | 9.30 | 6.52 |
word-discourse | 100 | no | 9.34 | 6.70 |
word-discourse | 100 | yes | 9.32 | 6.54 |
word-discourse | 300 | no | 9.34 | 6.42 |
word-discourse | 300 | yes | 9.33 | 6.52 |
LDA Scoretype | Word Embedding Score Type | Fallibility | Dev93 WER | Eval92 WER |
---|---|---|---|---|
baseline | - | - | 9.58 | 6.98 |
LDAprob10 | word-pair30 | yes | 9.32 | 6.56 |
LDAprob10 | word-discourse50 | yes | 9.30 | 6.54 |
LDAtopsim40 | word-pair30 | yes | 9.32 | 6.52 |
LDAtopsim40 | word-discourse50 | yes | 9.32 | 6.42 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, C.; Zhang, P.; Li, T.; Yan, Y. Semantic Features Based N-Best Rescoring Methods for Automatic Speech Recognition. Appl. Sci. 2019, 9, 5053. https://doi.org/10.3390/app9235053
Liu C, Zhang P, Li T, Yan Y. Semantic Features Based N-Best Rescoring Methods for Automatic Speech Recognition. Applied Sciences. 2019; 9(23):5053. https://doi.org/10.3390/app9235053
Chicago/Turabian StyleLiu, Chang, Pengyuan Zhang, Ta Li, and Yonghong Yan. 2019. "Semantic Features Based N-Best Rescoring Methods for Automatic Speech Recognition" Applied Sciences 9, no. 23: 5053. https://doi.org/10.3390/app9235053
APA StyleLiu, C., Zhang, P., Li, T., & Yan, Y. (2019). Semantic Features Based N-Best Rescoring Methods for Automatic Speech Recognition. Applied Sciences, 9(23), 5053. https://doi.org/10.3390/app9235053