Fully Attentional Network for Low-Resource Academic Machine Translation and Post Editing
Abstract
:1. Introduction
Problem Definitions and Motivation
- This study offers a translation system for researchers and graduate students to get better translations while writing academic papers in English. In this way, spelling and grammar errors can be minimized.
- This study introduces a parallel corpus that has been created comprehensively for the Turkish-English language pair.
- The use of the shallow fusion method added to the base of the translation architectures represented in this study may inspire further studies on different language pairs.
- The parallel corpus created and the proposed model are presented as open-source for researchers working in the field of machine translation. https://github.com/ilhamisel/tr-en_translate (Accessed on 3 November 2022).
2. Related Works
2.1. Machine Translation
2.2. Turkish English Machine Translation
3. Materials
3.1. Parallel Corpora Creation
3.2. Sentence Alignment
3.2.1. Hunalign
3.2.2. Vecalign
- : source-target sentence,
- : similarity between embedding vectors,
- : number of sentences in x and y,
- : properly selected examples from the given document
3.2.3. Text Pre-Processing Steps
- The data were recorded to text files by classifying them according to their fields and publishing year on the CoHE’s site.
- A corpus was created by reading each file in turn. In this step, the same theses published with more than one field tag were filtered. Then the data is split into sentences. The data consists of approximately 3.5 M Turkish and English sentences at this stage.
- The text has been translated into lowercase letters. Special characters and punctuation marks have been removed from the text so that it consists of numbers and letters only.
- Sentences with dimensions greater than 400 characters and less than 20 characters were ignored.
- The difference between the number of characters of Turkish and English sentences was filtered as a maximum of 150 characters.
- The results of Hunalign and Vecalign sentence alignment algorithms were combined to provide separate sentence alignments for each thesis.
- After all these processes, 1,217,300 sentences were obtained and approximately 2.3 M sentences were ignored.
- 1000 sentences randomly selected from these sentences were checked by a supervisor. It was determined that the translations of only 2 sentences were not matching, and this was a tolerable ratio. While the matches were checked, the quality of the translation was ignored.
- The created corpora consisted of approximately 30 k tokens for each language.
4. Methods
4.1. Transformer Encoder-Decoder
4.2. Integration of the Pre-Trained Language Model into the Translation System
4.2.1. The Shallow Fusion with FFN
- : output token,
- = : final output,
- = : output of decoder network,
- = : output of language model network,
4.2.2. The Shallow Fusion with FAN
4.3. Byte Pair Encoding (BPE)
4.4. Beam Search
4.5. Evaluation
4.5.1. Perplexity
4.5.2. BLEU (Bilingual Evaluation Understudy)
4.5.3. METEOR (Metric for Evaluation of Translation with Explicit Ordering)
4.5.4. TER (Translation Error Rate)
5. Result
5.1. Implementation Setup and Hyperparameters
5.2. Evaluating Translation Quality
6. Discussion and Future Works
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Barrault, L.; Bojar, O.; Costa-Jussa, M.R.; Federmann, C.; Fishel, M.; Graham, Y.; Haddow, B.; Huck, M.; Koehn, P.; Malmasi, S.; et al. Findings of the 2019 conference on machine translation (wmt19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), Florence, Italy, 1–2 August 2019; pp. 1–61. [Google Scholar]
- Li, F.; Zhu, J.; Yan, H.; Zhang, Z. Grammatically Derived Factual Relation Augmented Neural Machine Translation. Appl. Sci. 2022, 12, 6518. [Google Scholar] [CrossRef]
- Nakazawa, T.; Yaguchi, M.; Uchimoto, K.; Utiyama, M.; Sumita, E.; Kurohashi, S.; Isahara, H. Aspec: Asian scientific paper excerpt corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, 23–28 May 2016; pp. 2204–2208. [Google Scholar]
- Neves, M.; Yepes, A.J.; Névéol, A. The scielo corpus: A parallel corpus of scientific publications for biomedicine. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, 23–28 May 2016; pp. 2942–2948. [Google Scholar]
- Stahlberg, F. Neural machine translation: A review. J. Artif. Intell. Res. 2020, 69, 343–418. [Google Scholar] [CrossRef]
- Cho, K.; Van Merriënboer, B.; Bahdanau, D.; Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. arXiv 2014, arXiv:1409.1259. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Ranathunga, S.; Lee, E.S.A.; Skenduli, M.P.; Shekhar, R.; Alam, M.; Kaur, R. Neural machine translation for low-resource languages: A survey. arXiv 2021, arXiv:2106.15115. [Google Scholar]
- Wu, S.; Dredze, M. Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Hong Kong, China, 2019; pp. 833–844. [Google Scholar] [CrossRef] [Green Version]
- Wang, Z.; Mayhew, S.; Roth, D. Cross-Lingual Ability of Multilingual BERT: An Empirical Study. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Chi, E.A.; Hewitt, J.; Manning, C.D. Finding Universal Grammatical Relations in Multilingual BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5564–5577. [Google Scholar] [CrossRef]
- Guarasci, R.; Silvestri, S.; De Pietro, G.; Fujita, H.; Esposito, M. BERT syntactic transfer: A computational experiment on Italian, French and English languages. Comput. Speech Lang. 2022, 71, 101261. [Google Scholar] [CrossRef]
- de Vries, W.; Bartelds, M.; Nissim, M.; Wieling, M. Adapting Monolingual Models: Data can be Scarce when Language Similarity is High. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online, 1–6 August 2021; pp. 4901–4907. [Google Scholar] [CrossRef]
- Oflazer, K.; Durgar El-Kahlout, İ. Exploring Different Representational Units in English-to-Turkish Statistical Machine Translation; Association for Computational Linguistics: Stroudsburg, PA, USA, 2007. [Google Scholar]
- Bisazza, A.; Federico, M. Morphological pre-processing for Turkish to English statistical machine translation. In Proceedings of the 6th International Workshop on Spoken Language Translation: Papers, Tokyo, Japan, 1–2 December 2009. [Google Scholar]
- Mermer, C.; Kaya, H.; Doğan, M.U. The TÜBİTAK-UEKAE statistical machine translation system for IWSLT 2010. In Proceedings of the 7th International Workshop on Spoken Language Translation: Evaluation Campaign, Paris, France, 2–3 December 2010. [Google Scholar]
- Yeniterzi, R.; Oflazer, K. Syntax-to-morphology mapping in factored phrase-based statistical machine translation from English to Turkish. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 11–16 July 2010; pp. 454–464. [Google Scholar]
- Yılmaz, E.; El-Kahlout, I.D.; Aydın, B.; Özil, Z.S.; Mermer, C. TÜBİTAK Turkish-English submissions for IWSLT 2013. In Proceedings of the 10th International Workshop on Spoken Language Translation: Evaluation Campaign, Heidelberg, Germany, 5–6 December 2013. [Google Scholar]
- Bakay, Ö.; Avar, B.; Yildiz, O.T. A tree-based approach for English-to-Turkish translation. Turk. J. Electr. Eng. Comput. Sci. 2019, 27, 437–452. [Google Scholar] [CrossRef]
- Gulcehre, C.; Firat, O.; Xu, K.; Cho, K.; Bengio, Y. On integrating a language model into neural machine translation. Comput. Speech Lang. 2017, 45, 137–148. [Google Scholar] [CrossRef]
- Sennrich, R.; Haddow, B.; Birch, A. Improving neural machine translation models with monolingual data. arXiv 2015, arXiv:1511.06709. [Google Scholar]
- Currey, A.; Miceli-Barone, A.V.; Heafield, K. Copied monolingual data improves low-resource neural machine translation. In Proceedings of the Second Conference on Machine Translation, Copenhagen, Denmark, 7–8 September 2017; pp. 148–156. [Google Scholar]
- Nguyen, T.Q.; Chiang, D. Transfer learning across low-resource, related languages for neural machine translation. arXiv 2017, arXiv:1708.09803. [Google Scholar]
- Firat, O.; Cho, K.; Sankaran, B.; Vural, F.T.Y.; Bengio, Y. Multi-way, multilingual neural machine translation. Comput. Speech Lang. 2017, 45, 236–252. [Google Scholar] [CrossRef]
- Ataman, D.; Negri, M.; Turchi, M.; Federico, M. Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English. arXiv 2017, arXiv:1707.09879. [Google Scholar] [CrossRef]
- Pan, Y.; Li, X.; Yang, Y.; Dong, R. Dual-Source Transformer Model for Neural Machine Translation with Linguistic Knowledge. Preprints 2020, 2020020273. [Google Scholar] [CrossRef] [Green Version]
- Yıldız, O.T.; Solak, E.; Görgün, O.; Ehsani, R. Constructing a Turkish-English parallel treebank. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Baltimore, MD, USA, 22–27 June 2014; pp. 112–117. [Google Scholar]
- İlhami, S.; Hüseyin, Ü.; HANBAY, D. Creating a Parallel Corpora for Turkish-English Academic Translations. Comput. Sci. 2021, 335–340. [Google Scholar] [CrossRef]
- Soares, F.; Yamashita, G.H.; Anzanello, M.J. A parallel corpus of theses and dissertations abstracts. In Proceedings of the International Conference on Computational Processing of the Portuguese Language, Canela, Brazil, 24–26 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 345–352. [Google Scholar]
- Varga, D.; Halácsy, P.; Kornai, A.; Nagy, V.; Németh, L.; Trón, V. Parallel corpora for medium density languages. Amst. Stud. Theory Hist. Linguist. Sci. Ser. 4 2007, 292, 247. [Google Scholar]
- Thompson, B.; Koehn, P. Vecalign: Improved sentence alignment in linear time and space. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 1342–1348. [Google Scholar]
- Pavlick, E.; Post, M.; Irvine, A.; Kachaev, D.; Callison-Burch, C. The language demographics of amazon mechanical turk. Trans. Assoc. Comput. Linguist. 2014, 2, 79–92. [Google Scholar] [CrossRef]
- Artetxe, M.; Schwenk, H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguist. 2019, 7, 597–610. [Google Scholar] [CrossRef]
- de Santana Correia, A.; Colombini, E.L. Attention, please! A survey of neural attention models in deep learning. Artif. Intell. Rev. 2022, 1–88. [Google Scholar] [CrossRef]
- Yan, R.; Li, J.; Su, X.; Wang, X.; Gao, G. Boosting the Transformer with the BERT Supervision in Low-Resource Machine Translation. Appl. Sci. 2022, 12, 7195. [Google Scholar] [CrossRef]
- Mars, M. From Word Embeddings to Pre-Trained Language Models: A State-of-the-Art Walkthrough. Appl. Sci. 2022, 12, 8805. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv 2019, arXiv:1903.10676. [Google Scholar]
- Skorokhodov, I.; Rykachevskiy, A.; Emelyanenko, D.; Slotin, S.; Ponkratov, A. Semi-supervised neural machine translation with language models. In Proceedings of the AMTA 2018 workshop on technologies for MT of low resource languages (LoResMT 2018), Boston, MA, USA, 21 March 2018; pp. 37–44. [Google Scholar]
- Sennrich, R.; Haddow, B.; Birch, A. Neural machine translation of rare words with subword units. arXiv 2015, arXiv:1508.07909. [Google Scholar]
- Britz, D.; Goldie, A.; Luong, M.T.; Le, Q. Massive exploration of neural machine translation architectures. arXiv 2017, arXiv:1703.03906. [Google Scholar]
- Yin, X.; Gromann, D.; Rudolph, S. Neural machine translating from natural language to SPARQL. Future Gener. Comput. Syst. 2021, 117, 510–519. [Google Scholar] [CrossRef]
- Dušek, O.; Novikova, J.; Rieser, V. Evaluating the state-of-the-art of end-to-end natural language generation: The e2e nlg challenge. Comput. Speech Lang. 2020, 59, 123–156. [Google Scholar] [CrossRef]
- Lavie, A.; Agarwal, A. METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic, 23 June 2007; pp. 228–231. [Google Scholar]
- Snover, M.; Dorr, B.; Schwartz, R.; Micciulla, L.; Makhoul, J. A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, Cambridge, MA, USA, 8–12 August 2006; pp. 223–231. [Google Scholar]
- Behnke, M.; Heafield, K. Losing heads in the lottery: Pruning transformer attention in neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 2664–2674. [Google Scholar]
- Pan, Y.; Li, X.; Yang, Y.; Dong, R. Morphological word segmentation on agglutinative languages for neural machine translation. arXiv 2020, arXiv:2001.01589. [Google Scholar]
Turkish in the Literature | Google Translate | In the Literature |
---|---|---|
evrişimsel sinir ağı girişleri | convolutional neural network entries | Convolutional neural network inputs |
sözel olmayan yakınlık becerileri | non-verbal intimacy skills | Non-verbal immediacy skills |
nitel araştırmalarda çeşitleme | diversification in qualitative research | Triangulation in qualitative research |
sınıf içi öğretmen davranışları | classroom teacher behavior | Teacher behavior within classroom |
belirsizlik hoşgörüsü seviyesi | level of uncertainty tolerance | Ambiguity tolerance level |
Fields | Percentage (%) |
---|---|
Education and Training | 7.01 |
Business | 6.04 |
Agriculture | 2.85 |
Economy | 2.63 |
Electrical and Electronics Engineering | 2.37 |
Mechanical Engineering | 2.35 |
History | 2.33 |
Chemistry | 2.2 |
Religion | 2.19 |
Law | 1.99 |
Data | Sentences |
---|---|
Train | 1,197,300 |
Validation | 10,000 |
Test | 10,000 |
Total | 1,217,300 |
Layer | FFNN | dmodel | dQ, dK, dV | |
---|---|---|---|---|
Conv | 20 | 1024 | - | - |
LSTM | 2 | 1024 | - | - |
Transformer | 6 | 2048 | 512 | 64 |
Transformer Big | 12 | 2048 | 512 | 64 |
FFN | 1 | 1024 | - | - |
FAN | 4 | 2048 | 512 | 64 |
Parameter Size | |
---|---|
Conv | 96.841.408 |
LSTM | 31.066.208 |
Transformer (6 layers) | 73.033.074 |
Transformer Big (12 layers) | 98.278.770 |
Transformer Big + FFN | 136.303.045 |
Transformer Big + FAN | 158.523.484 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sel, İ.; Hanbay, D. Fully Attentional Network for Low-Resource Academic Machine Translation and Post Editing. Appl. Sci. 2022, 12, 11456. https://doi.org/10.3390/app122211456
Sel İ, Hanbay D. Fully Attentional Network for Low-Resource Academic Machine Translation and Post Editing. Applied Sciences. 2022; 12(22):11456. https://doi.org/10.3390/app122211456
Chicago/Turabian StyleSel, İlhami, and Davut Hanbay. 2022. "Fully Attentional Network for Low-Resource Academic Machine Translation and Post Editing" Applied Sciences 12, no. 22: 11456. https://doi.org/10.3390/app122211456
APA StyleSel, İ., & Hanbay, D. (2022). Fully Attentional Network for Low-Resource Academic Machine Translation and Post Editing. Applied Sciences, 12(22), 11456. https://doi.org/10.3390/app122211456