Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German
Abstract
:1. Introduction
2. Data Description
2.1. Data Selection
2.2. Data Structure
3. Evaluation of Fold Distribution
4. Case Study: Vocabulary Growth
- no cleaning at all;
- exclusion of punctuation, names, and start-end-symbols (all identified via their respective POS tags), URLs, and wordforms only consisting of numbers (both identified by regular expressions);
- exclusion of wordforms containing numbers;
- exclusion of wordforms that contain upper-case letters following lower-case letters (in this cleaning stage, we exclude wordforms with non-conventional capitalization (e.g., dEr) while, at the same time, keeping capitalized abbreviations (e.g., NATO). For this cleaning stage, we only use the raw version because there cannot be any difference to the lowered version);
- exclusion of wordforms where the TreeTagger could not assign a lemma;
- selection of wordforms that are themselves (or the associated lemma (note that this means that, for example, the inflected wordform Weihnachtsmannes is also included, although it is not on the BLL itself, but the associated lemma Weihnachtsmann is. Another example is the wordform u., which is shorthand for the lemma und)) on a basic lemma list (BLL) of New High German standard language to identify a set of conventionalized word forms [21]. For more information regarding this basic lemma list, please refer to Koplenig et al. [11].
4.1. Number of Wordform Types
4.2. Number of Hapax Legomena
5. Summary
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Kupietz, M.; Lüngen, H.; Kamocki, P.; Witt, A. The German Reference Corpus DeReKo: New Developments—New Opportunities. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7 May 2018; Calzolari, N., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., et al., Eds.; European Language Resources Association (ELRA): Miyazaki, Japan, 2018. [Google Scholar]
- Marsden, E. Open Science and Transparency in Applied Linguistics Research. In The Encyclopedia of Applied Linguistics; Chapelle, C.A., Ed.; Wiley: Hoboken, NJ, USA, 2019; pp. 1–10. ISBN 978-1-4051-9473-0. [Google Scholar]
- Michel, J.-B.; Shen, Y.K.; Aiden, A.P.; Verses, A.; Gray, M.K.; The Google Books Team; Pickett, J.P.; Hoiberg, D.; Clancy, D.; Norvig, P.; et al. Quantitative Analysis of Culture Using Millions of Digitized Books. Science 2011, 331, 176–182. [Google Scholar] [CrossRef] [PubMed]
- Pechenick, E.A.; Danforth, C.M.; Dodds, P.S. Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution. PLoS ONE 2015, 10, e0137041. [Google Scholar] [CrossRef] [PubMed]
- Schmidt, B.; Piantadosi, S.T.; Mahowald, K. Uncontrolled Corpus Composition Drives an Apparent Surge in Cognitive Distortions. Proc. Natl. Acad. Sci. USA 2021, 118, e2115010118. [Google Scholar] [CrossRef] [PubMed]
- Jurafsky, D.; Martin, J.H. Speech and Language Processing, 3rd ed.; 2023; Available online: https://web.stanford.edu/~jurafsky/slp3/ (accessed on 7 November 2023).
- Frisson, S.; Rayner, K.; Pickering, M.J. Effects of Contextual Predictability and Transitional Probability on Eye Movements During Reading. J. Exp. Psychol. Learn. Mem. Cogn. 2005, 31, 862–877. [Google Scholar] [CrossRef] [PubMed]
- Kliegl, R.; Grabner, E.; Rolfs, M.; Engbert, R. Length, Frequency, and Predictability Effects of Words on Eye Movements in Reading. Eur. J. Cogn. Psychol. 2004, 16, 262–284. [Google Scholar] [CrossRef]
- Hauk, O.; Pulvermüller, F. Effects of Word Length and Frequency on the Human Event-Related Potential. Clin. Neurophysiol. 2004, 115, 1090–1103. [Google Scholar] [CrossRef] [PubMed]
- Hendrix, P.; Bolger, P.; Baayen, H. Distinct ERP Signatures of Word Frequency, Phrase Frequency, and Prototypicality in Speech Production. J. Exp. Psychol. Learn. Mem. Cogn. 2017, 43, 128–149. [Google Scholar] [CrossRef] [PubMed]
- Koplenig, A.; Kupietz, M.; Wolfer, S. Testing the Relationship between Word Length, Frequency, and Predictability Based on the German Reference Corpus. Cogn. Sci. 2022, 46, e13090. [Google Scholar] [CrossRef] [PubMed]
- Diewald, N.; Kupietz, M.; Lüngen, H. Tokenizing on Scale. Preprocessing Large Text Corpora on the Lexical and Sentence Level. In Proceedings of the Dictionaries and Society, Proceedings of the XX EURALEX International Congress, Mannheim, Germany, 12–16 July 2022; Klosa-Kückelhaus, A., Engelberg, S., Möhrs, C., Storjohann, P., Eds.; IDS-Verlag: Mannheim, Germany, 2022; pp. 208–221. [Google Scholar]
- Schmid, H. Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of the International Conference on New Methods in Language Processing, Manchester, UK, 6–8 July 1994. [Google Scholar]
- Brants, T.; Popat, A.C.; Xu, P.; Och, F.J.; Dean, J. Large Language Models in Machine Translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, 28–30 June 2007; Association for Computational Linguistics: Stroudsburg, PA, USA, 2007; pp. 858–867. [Google Scholar]
- Aumasson, J.-P.; Meier, W.; Phan, R.C.-W.; Henzen, L. BLAKE2. In The Hash Function BLAKE; Springer: Berlin/Heidelberg, Germany, 2014; pp. 165–183. [Google Scholar]
- Schiller, A.; Teufel, S.; Stöckert, C.; Thielen, C. Guidelines für das Tagging Deutscher Textcorpora mit STTS; Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart: Stuttgart, Germany, 1999. [Google Scholar]
- Mandelbrot, B. An Informational Theory of the Statistical Structure of Language. In Communication Theory; Jackson, W., Ed.; Butterworths Scientific Publications: London, UK, 1953; pp. 468–502. [Google Scholar]
- Zipf, G.K. The Psycho-Biology of Language; Houghton, Mifflin: Oxford, UK, 1935. [Google Scholar]
- Evert, S.; Baroni, M. zipfR: Word Frequency Distributions in R. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Posters and Demonstrations Sessions, Prague, Czech Republic, 25–27 June 2007; pp. 29–32. [Google Scholar]
- Baayen, R.H. The Effects of Lexical Specialization on the Growth Curve of the Vocabulary. Comput. Linguist. 1996, 22, 455–480. [Google Scholar]
- Stadler, H. Die Erstellung der Basislemmaliste der Neuhochdeutschen Standardsprache aus Mehrfach Linguistisch Annotierten Korpora; Blühdorn, H., Elstermann, M., Klosa, A., Eds.; Institut für Deutsche Sprache: Mannheim, Germany, 2014. [Google Scholar]
- Brysbaert, M.; Stevens, M.; Mandera, P.; Keuleers, E. How Many Words Do We Know? Practical Estimates of Vocabulary Size Dependent on Word Definition, the Degree of Language Input and the Participant’s Age. Front. Psychol. 2016, 7, 1116. [Google Scholar] [CrossRef] [PubMed]
- Herdan, G. Quantitative Linguistics; Butterworths: London, UK, 1964. [Google Scholar]
- Heaps, H.S. Information Retrieval, Computational and Theoretical Aspects; Library and Information Science; Academic Press: New York, NY, USA, 1978; ISBN 978-0-12-335750-2. [Google Scholar]
- Miller, D.; Biber, D. Evaluating Reliability in Quantitative Vocabulary Studies: The Influence of Corpus Design and Composition. Int. J. Corpus Linguist. 2015, 20, 30–53. [Google Scholar] [CrossRef]
- Baayen, R.H. Productivity in Language Production. Lang. Cogn. Process. 1994, 9, 447–469. [Google Scholar] [CrossRef]
Form 1 | Lem. 1 | POS 1 | Form 2 | Lem. 2 | POS 2 | Freq. | Form 1 (Clear) | Lem. 1 (Clear) | Form 2 (Clear) | Lem. 2 (Clear) |
---|---|---|---|---|---|---|---|---|---|---|
9 | 12 | APPR | 2 | 0 | ART | 1,775,493 | mit | mit | der | die |
1 | 2 | $, | 6 | 4 | APPR | 1,762,655 | , | , | in | in |
15 | 17 | APPR | 7 | 0 | ART | 1,761,184 | für | für | den | die |
1 | 2 | $, | 50 | 39 | KOUS | 1,760,171 | , | , | wie | wie |
1 | 2 | $, | 58 | 41 | ADV | 1,721,500 | , | , | so | so |
Measure | Δ1-Gram | CV | 2-Gram (Raw Only) | |||
---|---|---|---|---|---|---|
Raw | Low | Raw | Low | Δ2-Gram | CV | |
Number of wordforms/2-grams | 0.22 | 0.23 | 0.05 | 0.06 | 0.21 | 0.05 |
Number of hapax legomena | 0.26 | 0.29 | 0.07 | 0.07 | 0.23 | 0.06 |
Number of different wordforms tagged as nouns | 0.20 | 0.22 | 0.05 | 0.06 | - | |
Number of different wordforms tagged as function words | 1.85 | 1.04 | 0.5 | 0.3 | - | |
NE + NN tokens/fin. verb tokens | 0.10 | 0.02 | - | |||
Entropy | 0.01 | 0.01 | <0.01 | <0.01 | 0.01 | <0.01 |
Type-token ratio | 0.17 | 0.17 | 0.05 | 0.05 | 0.13 | 0.03 |
First n Wordforms | Spearman’s ρmin |
---|---|
100 | 0.9999 |
1000 | 0.9999 |
10,000 | 0.9997 |
100,000 | 0.9975 |
1,000,000 | 0.9660 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wolfer, S.; Koplenig, A.; Kupietz, M.; Müller-Spitzer, C. Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German. Data 2023, 8, 170. https://doi.org/10.3390/data8110170
Wolfer S, Koplenig A, Kupietz M, Müller-Spitzer C. Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German. Data. 2023; 8(11):170. https://doi.org/10.3390/data8110170
Chicago/Turabian StyleWolfer, Sascha, Alexander Koplenig, Marc Kupietz, and Carolin Müller-Spitzer. 2023. "Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German" Data 8, no. 11: 170. https://doi.org/10.3390/data8110170
APA StyleWolfer, S., Koplenig, A., Kupietz, M., & Müller-Spitzer, C. (2023). Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German. Data, 8(11), 170. https://doi.org/10.3390/data8110170