Korean Historical Documents Analysis with Improved Dynamic Word Embedding
Abstract
:1. Introduction
- We proposed an improved DWE using factorised embedding parameterization to identify the temporal change in the meaning of words in historical documents. We analysed the words after converting them to dense vectors using the improved DWE; we identified the change in the relationship between countries over time and the taxation structure that varied based on the king. Through this, it was able to better reflect the times than when DWE only was used.
- We confirmed the effectiveness of the improved DWE by incorporating it with tasks such as NER. Through the method we proposed, we were able to improve the F1-score by 3% to 7%. The improved DWE helps us identify the change in object name information or the usage of words for each king; further, the integration of this information and the NER model helped enhance the performance.
- We found that the application of parameters obtained from the NER model integrated with the improved DWE enhanced the effectiveness of historical document translations. Through the method we proposed, we were able to improve the BLEU score by 2% to 8%.
2. Related Work
3. Methodology
3.1. Dynamic Word Embedding
3.2. Named Entity Recognition & Neural Machine Translation
4. Experiment
4.1. Dataset
4.2. Experimental Setup
4.3. Analysis of Dynamic Word Embedding
4.4. Results of Named Entity Recognition and Neural Machine Translation
4.5. Discussion
5. Conclusions & Future Work
Author Contributions
Funding
Conflicts of Interest
References
- Yang, T.I.; Torget, A.; Mihalcea, R. Topic modeling on historical newspapers. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Portland, OR, USA, 24 June 2011; pp. 96–104. [Google Scholar]
- Zhao, H.; Wu, B.; Wang, H.; Shi, C. Sentiment analysis based on transfer learning for Chinese ancient literature. In Proceedings of the 2014 International Conference on Behavioral, Economic, and Socio-Cultural Computing (BESC2014), Shanghai, China, 30 October–1 November 2014; pp. 1–7. [Google Scholar]
- Bak, J.; Oh, A. Five centuries of monarchy in Korea: Mining the text of the annals of the Joseon dynasty. In Proceedings of the SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Beijing, China, 26–31 July 2015. [Google Scholar]
- Bak, J.; Oh, A. Conversational Decision-Making Model for Predicting the King’s Decision in the Annals of the Joseon Dynasty. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018. [Google Scholar]
- Storey, G.; Mimno, D. Like Two Pis in a Pod: Author Similarity Across Time in the Ancient Greek Corpus. J. Cult. Anal. 2020, 2371, 4549. [Google Scholar]
- Vellingiriraj, E.; Balamurugan, M.; Balasubramanie, P. Information extraction and text mining of Ancient Vattezhuthu characters in historical documents using image zoning. In Proceedings of the 2016 International Conference on Asian Language Processing (IALP), Tainan, Taiwan, 21–23 November 2016; pp. 37–40. [Google Scholar]
- Sousa, T.; Gonçalo Oliveira, H.; Alves, A. Exploring Different Methods for Solving Analogies with Portuguese Word Embeddings. In Proceedings of the 9th Symposium on Languages, Applications and Technologies (SLATE 2020), Barcelos, Portugal, 13–14 July 2020. [Google Scholar]
- Kapočiūtė-Dzikienė, J.; Damaševičius, R. Intrinsic evaluation of Lithuanian word embeddings using WordNet. In Computer Science On-Line Conference; Springer: Berlin/Heidelberg, Germany, 2018; pp. 394–404. [Google Scholar]
- Barzokas, V.; Papagiannopoulou, E.; Tsoumakas, G. Studying the Evolution of Greek Words via Word Embeddings. In Proceedings of the 11th Hellenic Conference on Artificial Intelligence, Athens, Greece, 2–4 September 2020; pp. 118–124. [Google Scholar]
- Jiang, Y.; Liu, Z.; Yang, L. The Dynamic Evolution of Common Address Terms in Chinese Based on Word Embedding. In Proceedings of the Workshop on Chinese Lexical Semantics, Chiayi, Taiwan, 26–28 May 2018; pp. 478–485. [Google Scholar]
- Yoo, C.; Park, M.; Kim, H.J.; Choi, J.; Sin, J.; Jun, C. Classification and evaluation of the documentary-recorded storm events in the Annals of the Choson Dynasty (1392–1910), Korea. J. Hydrol. 2015, 520, 387–396. [Google Scholar] [CrossRef]
- Hayakawa, H.; Iwahashi, K.; Ebihara, Y.; Tamazawa, H.; Shibata, K.; Knipp, D.J.; Kawamura, A.D.; Hattori, K.; Mase, K.; Nakanishi, I.; et al. Long-lasting Extreme Magnetic Storm Activities in 1770 Found in Historical Documents. Astrophys. J. 2017, 850, L31. [Google Scholar] [CrossRef] [Green Version]
- Lee, K.W.; Yang, H.J.; Park, M.G. Orbital elements of comet C/1490 Y1 and the Quadrantid shower. Mon. Not. R. Astron. Soc. 2009, 400, 1389–1393. [Google Scholar] [CrossRef] [Green Version]
- Jeong, H.Y.; Choi, K.H.; Lee, K.S.; Jo, B.M. Studies on conservation of the beeswax-treated Annals of Joseon Dynasty. J. Korea Tech. Assoc. Pulp Pap. Ind. 2012, 44, 70–78. [Google Scholar] [CrossRef]
- Ki, H.C.; Shin, E.K.; Woo, E.J.; Lee, E.; Hong, J.H.; Shin, D.H. Horse-riding accidents and injuries in historical records of Joseon Dynasty, Korea. Int. J. Paleopathol. 2018, 20, 20–25. [Google Scholar] [CrossRef]
- Kang, D.H.; Ko, D.W.; Gavart, M.; Song, J.M.; Cha, W.S. King Hyojong’s diseases and death records-through the Daily Records of Royal Secretariat of Joseon Dynasty Seungjeongwonilgi (承政院日記). J. Korean Med. Class. 2014, 27, 55–72. [Google Scholar] [CrossRef] [Green Version]
- Park, M.; Yoo, C.; Jun, C. Consideration of documentary records in the Annals of the Choson Dynasty for the frequency analysis of rainfall in Seoul, Korea. Meteorol. Appl. 2017, 24, 31–42. [Google Scholar] [CrossRef] [Green Version]
- Kang, K.; Choo, J.; Kim, Y. Whose opinion matters? analyzing relationships between bitcoin prices and user groups in online community. Soc. Sci. Comput. Rev. 2020, 38, 686–702. [Google Scholar] [CrossRef]
- Kim, Y.B.; Kang, K.; Choo, J.; Kang, S.J.; Kim, T.; Im, J.; Kim, J.H.; Kim, C.H. Predicting the currency market in online gaming via lexicon-based analysis on its online forum. Complexity 2017, 2017, 4152705. [Google Scholar] [CrossRef] [Green Version]
- Kim, Y.B.; Lee, J.; Park, N.; Choo, J.; Kim, J.H.; Kim, C.H. When Bitcoin encounters information in an online forum: Using text mining to analyse user opinions and predict value fluctuation. PLoS ONE 2017, 12, e0177630. [Google Scholar] [CrossRef]
- Christensen, K.; Nørskov, S.; Frederiksen, L.; Scholderer, J. In search of new product ideas: Identifying ideas in online communities by machine learning and text mining. Creat. Innov. Manag. 2017, 26, 17–30. [Google Scholar] [CrossRef]
- Chen, W.F.; Ku, L.W. Utcnn: A deep learning model of stance classificationon on social media text. arXiv 2016, arXiv:1611.03599. [Google Scholar]
- Poncelas, A.; Aboomar, M.; Buts, J.; Hadley, J.; Way, A. A Tool for Facilitating OCR Postediting in Historical Documents. arXiv 2020, arXiv:2004.11471. [Google Scholar]
- Can, Y.S.; Kabadayı, M.E. Automatic CNN-Based Arabic Numeral Spotting and Handwritten Digit Recognition by Using Deep Transfer Learning in Ottoman Population Registers. Appl. Sci. 2020, 10, 5430. [Google Scholar] [CrossRef]
- Chen, K.; Seuret, M.; Liwicki, M.; Hennebert, J.; Ingold, R. Page segmentation of historical document images with convolutional autoencoders. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 1011–1015. [Google Scholar]
- Riddell, A.B. How to read 22,198 journal articles: Studying the history of German studies with topic models. In Distant Readings: Topologies of German Culture in the Long Nineteenth Century; Boydell & Brewer: London, UK, 2014; pp. 91–114. [Google Scholar]
- Jeon, J.; Noh, S.J.; Lee, D.H. Relationship between lightning and solar activity for recorded between CE 1392–1877 in Korea. J. Atmos. Sol. Terr. Phys. 2018, 172, 63–68. [Google Scholar] [CrossRef]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–8 December 2013; pp. 3111–3119. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Hamilton, W.L.; Leskovec, J.; Jurafsky, D. Diachronic word embeddings reveal statistical laws of semantic change. arXiv 2016, arXiv:1605.09096. [Google Scholar]
- Bamler, R.; Mandt, S. Dynamic word embeddings. arXiv 2017, arXiv:1702.08359. [Google Scholar]
- Yao, Z.; Sun, Y.; Ding, W.; Rao, N.; Xiong, H. Dynamic word embeddings for evolving semantic discovery. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, Los Angeles, CA, USA, 5–9 February 2018; pp. 673–681. [Google Scholar]
- Rudolph, M.; Blei, D. Dynamic Bernoulli embeddings for language evolution. arXiv 2017, arXiv:1703.08052. [Google Scholar]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Zhang, M.; Zhang, Y.; Che, W.; Liu, T. Character-level chinese dependency parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, MD, USA, 22–27 June 2014; pp. 1326–1336. [Google Scholar]
- Li, H.; Zhang, Z.; Ju, Y.; Zhao, H. Neural character-level dependency parsing for Chinese. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Sennrich, R.; Haddow, B.; Birch, A. Neural machine translation of rare words with subword units. arXiv 2015, arXiv:1508.07909. [Google Scholar]
- Kudo, T.; Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv 2018, arXiv:1808.06226. [Google Scholar]
- Loshchilov, I.; Hutter, F. Fixing Weight Decay Regularization in Adam. 2018. Available online: https://www.semanticscholar.org/paper/Fixing-Weight-Decay-Regularization-in-Adam-Loshchilov-Hutter/45dfef0cc1ed96558c1c650432ce39d6a1050b6a#featured-content (accessed on 9 November 2018).
- Goyal, P.; Dollár, P.; Girshick, R.; Noordhuis, P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; He, K. Accurate, large minibatch sgd: Training imagenet in 1 h. arXiv 2017, arXiv:1706.02677. [Google Scholar]
- Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 17–19 June 2013; pp. 1310–1318. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Arnold, B.C.; Castillo, E.; Sarabia, J.M. Conditionally specified distributions: An introduction (with comments and a rejoinder by the authors). Stat. Sci. 2001, 16, 249–274. [Google Scholar]
- Jungshin, L. KoreansPerception of the Liaodong Region During the Chosŏn Dynasty: Focus on Sejong sillok chiriji (Geographical Treatise in the Annals of King Sejong) and Tongguk yŏji sŭnglam (Augmented survey of the geography of Korea). Int. J. Korean Hist. 2016, 21, 47–85. [Google Scholar]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 3104–3112. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, Michigan, 29 June 2005; pp. 65–72. [Google Scholar]
- Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Barcelona, Spain, 25–26 July 2004. [Google Scholar]
- DeVries, T.; Taylor, G.W. Improved regularization of convolutional neural networks with cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
- Jang, S.; Jin, K.; An, J.; Kim, Y. Regional Patch-Based Feature Interpolation Method for Effective Regularization. IEEE Access 2020, 8, 33658–33665. [Google Scholar] [CrossRef]
- Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
- Guo, H.; Mao, Y.; Zhang, R. Augmenting data with mixup for sentence classification: An empirical study. arXiv 2019, arXiv:1905.08941. [Google Scholar]
- Marivate, V.; Sefara, T. Improving short text classification through global augmentation methods. In Proceedings of the International Cross-Domain Conference for Machine Learning and Knowledge Extraction, Dublin, Ireland, 25–28 August 2020; pp. 385–399. [Google Scholar]
King | Lang | Target Word | |
---|---|---|---|
倭 | 일본 | ||
태 조 (1st) | (o) | 罒, 儵, 猬, 㛐, 嬿 | 대내전, 유구국, 대마도, 소이전, 귀국 |
(e) | fishnet, accident, Wi *, sister-in-law, gorgeous | Daenaejeon , Ryukyu Kingdom *, Tsushima Island *, Soijeon , remigrate | |
성 종 (9th) | (o) | 罒, 塩, 猬, 儵, 㛐 | 유구국, 중국, 대마도, 명, 요동 |
(e) | fishnet, salt, Wi *, accident, sister-in-law | Ryukyu Kingdom *, China, Tsushima Island *, Ming Dynasty, Liaodong * | |
선 조 (18th) | (o) | 儵, 虜, 煑, 扮, 嗎 | 중국, 요동, 대마도, 귀국, 명 |
(e) | accident, capture, sea salt, grasp, scold | China, Liaodong *, Tsushima Island *, remigrate, Ming Dynasty | |
숙 종 (27th) | (o) | 儵, 虜, 券, 煑, 攎 | 영국, 여진, 총병관, 미국, 몽고 |
(e) | accident, capture, weary, cook, oblige | England, Jurchen, admiral, USA, Mongolia | |
King | Lang | Target Word | |
중국 | 백성 | ||
태 조 (1st) | (o) | 명, 일본, 본국, 청, 귀국 | 흉년, 왜적, 서울, 변장, 오랑캐 |
(e) | Ming Dynasty, Japan, home country, Qing Dynasty, remigrate | lean year, Japanese burglar, Seoul *, military attache, barbarian | |
성 종 (9th) | (o) | 명, 일본, 청, 귀국, 본국 | 흉년, 민중, 도민, 기전, 굶주려 |
(e) | Ming Dynasty, Japan, Qing Dynasty, remigrate, home country | lean year, people, residents, metropolitan area surrounding, starve | |
선 조 (18th) | (o) | 일본, 명, 귀국, 조선, 사신 | 민중, 가난한, 기전, 굶주려, 도민 |
(e) | Japan, Qing Dynasty, remigrate, Joseon, envoy | people, poor, metropolitan area surrounding, starve, residents | |
숙 종 (27th) | (o) | 청, 일본, 본국, 몽고, 조선 | 흉년, 민생, 토병, 소민, 민간 |
(e) | Qing Dynasty, Japan, home country, Mongolia, Joseon | lean year, public welfare, native troops, plebeian, civil | |
King | Lang | Target Word | |
광산 | 세금 | ||
태 조 (1st) | (o) | 남양, 배천, 단양, 여산, 풍덕 | 공물, 전결, 부세, 공전, 요역 |
(e) | Nam-yang *, Bae-cheon *, Dan-yang *, Yeo-san *, Pung-duck * | tribute, field tax, duty, national land, corvee | |
성 종(9th) | (o) | 남양, 단양, 무안, 이천, 배천 | 전결, 공물, 부세, 요역, 잡역 |
(e) | Nam-yang *, Dan-yang *, Mu-an *, I-cheon *, Bae-cheon * | field tax, tribute, duty, corvee, chores | |
선조 (18th) | (o) | 인천, 배천, 무안, 의령, 용인 | 전결, 잡역, 소출, 공납, 부세 |
(e) | In-cheon *, Bae-cheon *, Mu-an *, Ui-ryeong *, Yong-in * | field tax, chores, crops, local products payment, duty | |
숙종 (27th) | (o) | 인천, 경성, 용인, 평양, 철산 | 잡역, 잡물, 미곡, 어염, 신역 |
(e) | In-cheon *, Kyung-sung *, Yong-in *, Pyongyang *, Cheolsan * | chores, sundries, rice, fishery tax, physical labor |
Method | Precision | Recall | F1-Score |
---|---|---|---|
W2V [28] | 0.581 | 0.637 | 0.607 |
DW2V [32] | 0.627 | 0.640 | 0.633 |
DW2V * [32] | 0.582 | 0.665 | 0.620 |
DBE [33] | 0.629 | 0.639 | 0.633 |
Ours | 0.650 | 0.646 | 0.648 |
Ours * | 0.679 | 0.690 | 0.684 |
Method | BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGE-L |
---|---|---|---|---|---|---|
RNN Seq2seq (GRU) [47] | 0.450228 | 0.350884 | 0.281607 | 0.250351 | 0.298210 | 0.428376 |
Transformer (from scratch) [35] | 0.547702 | 0.438370 | 0.359203 | 0.314636 | 0.372762 | 0.585471 |
Ours | 0.542257 | 0.447176 | 0.379800 | 0.337939 | 0.394580 | 0.656273 |
Original | 平安道 江界 渭原 寧邊 龜城等地雨雹. |
Predicted | 평안도의 강계·위원·영변·귀성 등 지방에 우박이 내렸다. |
Predicted (Eng.) | Hail fell in the provinces of Gang-gye, Wi-won, Yong-byon, and Gui-seong of Pyeong-an. |
Original | 太白晝見. |
Predicted | 태백성이 낮에 나타났다. |
Predicted (Eng.) | Venus appeared during the day. |
Original | 臺諫啓李希雍等事, 不允. |
Predicted | 대간이 이희옹 등의 일을 아뢰었으나 윤허하지 않았다. |
Predicted (Eng.) | Daegan referred to Lee Hee-ong’s work, but he was not allowed. |
Original | 壬辰/詣宗廟景慕宮展拜, 王世子隨詣行禮. |
Predicted | 종묘와 경모궁에 나아가 전배하였는데, 왕세자가 따라가 예를 거행하였다. |
Predicted (Eng.) | King went to Jongmyo and Gyeongmogung and worshiped them, and the Prince followed them to celebrate. |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jin, K.; Wi, J.; Kang, K.; Kim, Y. Korean Historical Documents Analysis with Improved Dynamic Word Embedding. Appl. Sci. 2020, 10, 7939. https://doi.org/10.3390/app10217939
Jin K, Wi J, Kang K, Kim Y. Korean Historical Documents Analysis with Improved Dynamic Word Embedding. Applied Sciences. 2020; 10(21):7939. https://doi.org/10.3390/app10217939
Chicago/Turabian StyleJin, KyoHoon, JeongA Wi, KyeongPil Kang, and YoungBin Kim. 2020. "Korean Historical Documents Analysis with Improved Dynamic Word Embedding" Applied Sciences 10, no. 21: 7939. https://doi.org/10.3390/app10217939
APA StyleJin, K., Wi, J., Kang, K., & Kim, Y. (2020). Korean Historical Documents Analysis with Improved Dynamic Word Embedding. Applied Sciences, 10(21), 7939. https://doi.org/10.3390/app10217939