esCorpius-m: A Massive Multilingual Crawling Corpus with a Focus on Spanish
Abstract
:1. Introduction
- It is cleaner than state-of-the-art corpora and deduplicated.
- It maintains both document and paragraph boundaries allowing language models to deal with textual data in the same way as humans do, thus unlocking the capabilities of Natural Language Generation to understand paragraph representation.
- The data downloaded maintains the traceability of the origin of each document. This level of traceability makes it possible to apply the right of withdrawal of individual website owners or individual persons whose data are cited on websites and are protected by GDPR. It also allows for systematically excluding blacklisted websites.
- It is a high-quality multilingual corpus that excels in content cleaning and deduplication. Depending on the language, e.g., in Spanish, it is the largest web corpus of this quality available for the development of large language models.
2. State-of-the-Art
3. esCorpius-m at a Glance
- Contrary to most other corpora—with the exception of ParaCrawl [45]—our content is extracted directly from WARC files. This method is notably more reliable than sourcing from the frequently error-prone WET files.
- Language identification is a two-step process: first, using the less computationally-intensive cld2, followed by fastText for paragraph candidates. The latter is recognized for its superior quality (As can be demonstrated in https://modelpredict.com/language-identification-survey, accessed on 15 September 2023). Other corpora only employ a segment of this language identification pipeline, missing out on the combined strengths of the entire process.
- The process of main content identification utilizes a version of JustText, a modified version of high-quality tool (as demonstrated here: https://trafilatura.readthedocs.io/en/latest/evaluation.html, accessed on 15 September 2023), while the other corpora rely solely on the less detailed set of WET heuristics.
- Our deduplication process implemented is both robust and precise. Compared to ROOTS [38] we implement the deduplication not only at the document but also at the paragraph level, carrying out both exact and soft deduplication.
- Our system provides enhanced traceability, going beyond just URL tracking found in other corpora. It includes tracing by WARC segment location, allowing users to seamlessly trace textual segments back to their origins in Common Crawl and identify any undesired content.
4. Data Download and Cleaning Process
4.1. Common Crawl Repository
4.2. WARC Files and Archiving Standard
4.3. Common Crawl Subcorpus Selection and Cleaning
- Download a WARC from Common Crawl.
- Open a Gzip file reader.
- While reading the Gzip file, partially parse the WARC format.
- Parse the webpage and fix the encoding to UTF-8.
- Obtain the language used in the document (see Section 4.4). Proceed if the language is not English.
- Extract the text that is correct.
- Store the record in the format described in Section 4.6.
4.4. Language Detection Process
4.5. Main Content Identification
4.6. Output Storage Format
- id: unique document identifiers UUIDv4 (the complete specification of UUIDv4 can be found in https://www.ietf.org/rfc/rfc4122.txt, accessed on 15 September 2023) over the whole corpus.
- text: textual content of the paragraph.
- url_warc: this is the identifier of the WARC file from which the web page from which the text has been extracted following the Common Crawl segments nomenclature (“s3://commoncrawl/crawl-data/CC-MAIN-<YYYY>-<MM>/segments/<id>/warc/ CC-MAIN-<id>.warc.gz”, where YYYY is the 4 digits year and MM the WARC archive month).
- url: URL address from which the text has been extracted.
4.7. Deduplication
5. Technical Environment Used for Corpus Generation
- Hadoop File System (HDFS), a distributed and scalable file system, was used as an auxiliary repository in the tasks executed in EMR, which is built with part of the cluster nodes (Core Nodes).
- Amazon Simple Storage Service (S3), in particular EMR FS, which is a Hadoop reimplementation of AWS object storage.
- Apache Spark, this open source engine was used for the parallelisation of data cleaning tasks. The parallel WARCs processor is based on PySpark.
- Apache YARN was used as resource manager used for scheduling and monitoring the execution of tasks in the EMR cluster.
- Ganglia has been used for monitoring the cluster status. Cluster monitoring is important in the early stages, for node load visualization and selecting the number and type of nodes used in the cluster.
- M5 instances (https://aws.amazon.com/ec2/instance-types/m5/, accessed on 15 September 2023): general-purpose instances powered by Intel Xeon Platinum 8000 series processors up to 3.1 GHz. They have a network bandwidth ranging from 10 to 25 GBPS depending on the selected size.
- R5 instances (https://aws.amazon.com/ec2/instance-types/r5/, accessed on 15 September 2023): memory-optimized instances, with the same type of processors as the M5 instances but with a vCPU:RAM ratio of 1:8, allowing more memory-demanding tasks to be executed.
Textual Content Deduplication Infrastructure
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
AI | Artificial Intelligence |
BERT | Bidirectional Encoder Representations from Transformers |
DL | Deep Learning |
GDPR | General Data Protection Regulation |
LLM | Large Language Model |
NLP | Natural Language Processing |
WARC | Web ARChive |
References
- Gutiérrez-Fandiño, A.; Pérez-Fernández, D.; Armengol-Estapé, J.; Griol, D.; Callejas, Z. esCorpius: A Massive Spanish Crawling Corpus. In Proceedings of the IberSPEECH 2022 Conference, Granada, Spain, 14–16 November 2022; pp. 126–130. [Google Scholar] [CrossRef]
- Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the Opportunities and Risks of Foundation Models. arXiv 2022, arXiv:2108.07258. [Google Scholar]
- Khan, W.; Daud, A.; Khan, K.; Muhammad, S.; Haq, R. Exploring the frontiers of deep learning and natural language processing: A comprehensive overview of key challenges and emerging trends. Nat. Lang. Process. J. 2023, 4, 100026. [Google Scholar] [CrossRef]
- OECD. AI Language Models: Technological, Socio-Economic and Policy Considerations; OECD Publishing: Paris, France, 2023; pp. 20–28. [Google Scholar] [CrossRef]
- Rafiepour, M.; Sartakhti, J.S. CTRAN: CNN-Transformer-based network for natural language understanding. Eng. Appl. Artif. Intell. 2023, 126, 107013. [Google Scholar] [CrossRef]
- Li, B.; Weng, Y.; Xia, F.; Deng, H. Towards better Chinese-centric neural machine translation for low-resource languages. Comput. Speech Lang. 2023, 84, 101566. [Google Scholar] [CrossRef]
- Li, R.; Liu, C.; Jiang, D. Efficient dynamic feature adaptation for cross language sentiment analysis with biased adversarial training. Knowl.-Based Syst. 2023, 279, 110957. [Google Scholar] [CrossRef]
- Park, J.; Cho, S. Incorporation of company-related factual knowledge into pre-trained language models for stock-related spam tweet filtering. Expert Syst. Appl. 2023, 234, 121021. [Google Scholar] [CrossRef]
- López Espejel, J.; Yahaya Alassan, M.S.; Chouham, E.M.; Dahhane, W.; Ettifouri, E.H. A comprehensive review of State-of-The-Art methods for Java code generation from Natural Language Text. Nat. Lang. Process. J. 2023, 3, 100013. [Google Scholar] [CrossRef]
- Goswamy, T.; Singh, I.; Barkati, A.; Modi, A. Adapting a Language Model for Controlled Affective Text Generation. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 2787–2801. [Google Scholar]
- Abro, W.A.; Aicher, A.; Rach, N.; Ultes, S.; Minker, W.; Qi, G. Natural language understanding for argumentative dialogue systems in the opinion building domain. Knowl.-Based Syst. 2022, 242, 108318. [Google Scholar] [CrossRef]
- McTear, M. Conversational AI. Dialogue Systems, Conversational Agents, and Chatbots; Morgan and Claypool Publishers: San Rafael, CA, USA, 2020. [Google Scholar] [CrossRef]
- Abdelfattah Saleh, A.; Weigang, L. TxLASM: A novel language agnostic summarization model for text documents. Expert Syst. Appl. 2024, 237, 121433. [Google Scholar] [CrossRef]
- Xie, Q.; Bishop, J.A.; Tiwari, P.; Ananiadou, S. Pre-trained language models with domain knowledge for biomedical extractive summarization. Knowl.-Based Syst. 2022, 252, 109460. [Google Scholar] [CrossRef]
- Bansal, S.; Gowda, K.; Kumar, N. Multilingual personalized hashtag recommendation for low resource Indic languages using graph-based deep neural network. Expert Syst. Appl. 2024, 236, 121188. [Google Scholar] [CrossRef]
- Franco, M.; Gaggi, O.; Palazzi, C.E. Analyzing the Use of Large Language Models for Content Moderation with ChatGPT Examples. In Proceedings of the 3rd International Workshop on Open Challenges in Online Social Networks (OASIS’23), Rome, Italy, 4–8 September 2023. [Google Scholar] [CrossRef]
- Habernal, I.; Konopík, M. SWSNL: Semantic Web Search Using Natural Language. Expert Syst. Appl. 2013, 40, 3649–3664. [Google Scholar] [CrossRef]
- Hao, S.; Tan, B.; Tang, K.; Ni, B.; Shao, X.; Zhang, H.; Xing, E.; Hu, Z. BertNet: Harvesting Knowledge Graphs with Arbitrary Relations from Pretrained Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 5000–5015. [Google Scholar]
- Wang, C.; Liu, X.; Song, D. Language Models are Open Knowledge Graphs. arXiv 2020, arXiv:2010.11967. [Google Scholar]
- Kasneci, E.; Sessler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
- Arnau-González, P.; Arevalillo-Herráez, M.; Luise, R.A.D.; Arnau, D. A methodological approach to enable natural language interaction in an Intelligent Tutoring System. Comput. Speech Lang. 2023, 81, 101516. [Google Scholar] [CrossRef]
- Xiao, D.; Meyers, P.; Upperman, J.S.; Robinson, J.R. Revolutionizing Healthcare with ChatGPT: An Early Exploration of an AI Language Model’s Impact on Medicine at Large and its Role in Pediatric Surgery. J. Pediatr. Surg. 2023, 58, 2410–2415. [Google Scholar] [CrossRef] [PubMed]
- Sukanya, G.; Priyadarshini, J. Modified Hierarchical-Attention Network model for legal judgment predictions. Data Knowl. Eng. 2023, 147, 102203. [Google Scholar] [CrossRef]
- Peña, A.; Morales, A.; Fierrez, J.; Serna, I.; Ortega-Garcia, J.; Puente, Í.; Córdova, J.; Córdova, G. Leveraging Large Language Models for Topic Classification in the Domain of Public Affairs. In Proceedings of the Document Analysis and Recognition Conference—ICDAR 2023 Workshops, San Jose, CA, USA, 21–26 August 2023; pp. 20–33. [Google Scholar] [CrossRef]
- Jansen, B.J.; Gyo Jung, S.; Salminen, J. Employing large language models in survey research. Nat. Lang. Process. J. 2023, 4, 100020. [Google Scholar] [CrossRef]
- Suzuki, M.; Sakaji, H.; Hirano, M.; Izumi, K. Constructing and analyzing domain-specific language model for financial text mining. Inf. Process. Manag. 2023, 60, 103194. [Google Scholar] [CrossRef]
- Liu, S.; Peng, C.; Wang, C.; Chen, X.; Song, S. icsBERTs: Optimizing Pre-trained Language Models in Intelligent Customer Service. In Proceedings of the International Neural Network Society Workshop on Deep Learning Innovations and Applications (INNS DLIA’23), Gold Coast, Australia, 23 June 2023; Volume 222, pp. 127–136. [Google Scholar] [CrossRef]
- Kaddour, J.; Harris, J.; Mozes, M.; Bradley, H.; Raileanu, R.; McHardy, R. Challenges and Applications of Large Language Models. arXiv 2023, arXiv:2307.10169. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: New York, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
- Rae, J.W.; Borgeaud, S.; Cai, T.; Millican, K.; Hoffmann, J.; Song, F.; Aslanides, J.; Henderson, S.; Ring, R.; Young, S.; et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv 2022, arXiv:2112.11446. [Google Scholar]
- OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Otter, D.W.; Medina, J.R.; Kalita, J.K. A Survey of the Usages of Deep Learning for Natural Language Processing. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 604–624. [Google Scholar] [CrossRef] [PubMed]
- Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A Survey on Bias and Fairness in Machine Learning. ACM Comput. Surv. (CSUR) 2021, 54, 115. [Google Scholar] [CrossRef]
- Wu, S.; Roberts, K.; Datta, S.; Du, J.; Ji, Z.; Si, Y.; Soni, S.; Wang, Q.; Wei, Q.; Xiang, Y.; et al. Deep learning in clinical natural language processing: A methodical review. J. Am. Med. Inform. Assoc. 2020, 27, 457–470. [Google Scholar] [CrossRef] [PubMed]
- Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Virtual, 6–11 June 2021; pp. 483–498. [Google Scholar] [CrossRef]
- Sarti, G.; Nissim, M. IT5: Large-scale Text-to-text Pretraining for Italian Language Understanding and Generation. arXiv 2022, arXiv:2203.03759. [Google Scholar]
- Scao, T.L.; Fan, A.; Akiki, C.; Pavlick, E.; Ilić, S.; Hesslow, D.; Castagné, R.; Luccioni, A.S.; Yvon, A.; Gallé, M.; et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv 2023, arXiv:cs.CL/2211.05100. [Google Scholar]
- Laurençon, H.; Saulnier, L.; Wang, T.; Akiki, C.; del Moral, A.V.; Le Scao, T.; Von Werra, L.; Mou, C.; González Ponferrada, E.; Nguyen, H.; et al. The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset. arXiv 2023, arXiv:2303.03915. [Google Scholar]
- Kreutzer, J.; Caswell, I.; Wang, L.; Wahab, A.; van Esch, D.; Ulzii-Orshikh, N.; Tapo, A.; Subramani, N.; Sokolov, A.; Sikasote, C.; et al. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Trans. Assoc. Comput. Linguist. 2022, 10, 50–72. [Google Scholar] [CrossRef]
- El-Kishky, A.; Chaudhary, V.; Guzmán, F.; Koehn, P. CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 16–20 November 2020; pp. 5960–5969. [Google Scholar] [CrossRef]
- Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, FAccT ’21, Virtual, 3–10 March 2021; pp. 610–623. [Google Scholar] [CrossRef]
- Abadji, J.; Ortiz Suarez, P.; Romary, L.; Sagot, B. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. In Proceedings of the 13th Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; European Language Resources Association: Paris, France, 2022; pp. 4344–4355. [Google Scholar]
- Wenzek, G.; Lachaux, M.A.; Conneau, A.; Chaudhary, V.; Guzmán, F.; Joulin, A.; Grave, E. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 4003–4012. [Google Scholar]
- Bañón, M.; Chen, P.; Haddow, B.; Heafield, K.; Hoang, H.; Esplà-Gomis, M.; Forcada, M.L.; Kamran, A.; Kirefu, F.; Koehn, P.; et al. ParaCrawl: Web-Scale Acquisition of Parallel Corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual, 5–10 July 2020; pp. 4555–4567. [Google Scholar] [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef]
- Lee, K.; Ippolito, D.; Nystrom, A.; Zhang, C.; Eck, D.; Callison-Burch, C.; Carlini, N. Deduplicating Training Data Makes Language Models Better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022. [Google Scholar]
- Gao, L.; Biderman, S.; Black, S.; Golding, L.; Hoppe, T.; Foster, C.; Phang, J.; He, H.; Thite, A.; Nabeshima, N.; et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv 2020, arXiv:2101.00027. [Google Scholar]
Lang | 000 Words | % Words | Par. | % Par. | URLs | % URLs | Size, GB |
---|---|---|---|---|---|---|---|
Afrikaans | 114,386 | 0.05% | 248,588 | 0.04% | 245,361 | 0.04% | 0.66 |
Arabic | 4,542,747 | 1.88% | 11,502,118 | 1.78% | 11,403,653 | 1.79% | 49.38 |
Bengali | 594,746 | 0.25% | 1,772,168 | 0.27% | 1,759,258 | 0.28% | 10.44 |
Catalan | 1,680,554 | 0.69% | 4,284,590 | 0.66% | 4,227,541 | 0.66% | 10.95 |
Czech | 5,764,676 | 2.38% | 16,978,203 | 2.63% | 16,667,666 | 2.62% | 44.12 |
Danish | 3,155,961 | 1.30% | 7,551,657 | 1.17% | 7,410,853 | 1.17% | 21.15 |
German | 38,899,553 | 16.06% | 109,705,164 | 16.99% | 107,381,398 | 16.88% | 297.11 |
Greek | 5,151,481 | 2.13% | 12,042,049 | 1.86% | 11,923,822 | 1.87% | 61.87 |
Spanish | 28,618,504 | 11.81% | 57,808,392 | 8.95% | 57,327,087 | 9.01% | 195.7 |
Basque | 116,918 | 0.05% | 420,725 | 0.07% | 414,677 | 0.07% | 0.99 |
Persian | 6,171,391 | 2.55% | 12,631,282 | 1.96% | 12,475,782 | 1.96% | 155.13 |
Finnish | 2,793,619 | 1.15% | 7,945,694 | 1.23% | 7,805,202 | 1.23% | 25.68 |
French | 40,739,063 | 16.82% | 94,895,680 | 14.69% | 92,999,095 | 14.62% | 298.24 |
Galician | 223,711 | 0.09% | 603,920 | 0.09% | 595,481 | 0.09% | 1.5 |
Hindi | 1,326,133 | 0.55% | 2,735,525 | 0.42% | 2,714,560 | 0.43% | 16.93 |
Croatian | 1,222,123 | 0.50% | 4,067,720 | 0.63% | 4,026,201 | 0.63% | 8.93 |
Italian | 21,879,938 | 9.03% | 50,844,721 | 7.87% | 50,094,971 | 7.88% | 149.39 |
Japanese | 913,584 | 0.38% | 23,856,552 | 3.69% | 23,593,379 | 3.71% | 111.23 |
Korean | 1,711,686 | 0.71% | 3,810,567 | 0.59% | 3,782,029 | 0.59% | 18.45 |
Maltese | 24,340 | 0.01% | 51,934 | 0.01% | 51,457 | 0.01% | 0.2 |
Dutch | 12,015,439 | 4.96% | 32,497,299 | 5.03% | 31,779,475 | 5.00% | 80.77 |
Norwegian | 2,361,749 | 0.97% | 5,725,188 | 0.89% | 5,646,580 | 0.89% | 15.59 |
Occitan | 7371 | 0.00% | 20,631 | 0.00% | 20,323 | 0.00% | 0.05 |
Punjabi | 54,496 | 0.02% | 104,737 | 0.02% | 103,879 | 0.02% | 0.7 |
Polish | 11,731,521 | 4.84% | 33,017,904 | 5.11% | 32,436,496 | 5.10% | 92.34 |
Portuguese | 22,577,860 | 9.32% | 50,055,128 | 7.75% | 49,535,350 | 7.79% | 149.34 |
Romanian | 5,351,974 | 2.21% | 11,995,288 | 1.86% | 11,851,961 | 1.86% | 36.29 |
Slovenian | 903,213 | 0.37% | 2,296,724 | 0.36% | 2,262,941 | 0.36% | 6.36 |
Serbian | 445,014 | 0.18% | 907,411 | 0.14% | 894,739 | 0.14% | 5.2 |
Swedish | 6,135,501 | 2.53% | 14,777,204 | 2.29% | 14,528,724 | 2.28% | 48.93 |
Turkish | 6,393,599 | 2.64% | 21,232,882 | 3.29% | 21,032,030 | 3.31% | 56.65 |
Ukrainian | 3,483,220 | 1.44% | 9,066,207 | 1.40% | 8,958,138 | 1.41% | 46.51 |
Urdu | 335,909 | 0.14% | 568,139 | 0.09% | 563,963 | 0.09% | 2.81 |
Chinese | 4,806,600 | 1.98% | 39,750,371 | 6.16% | 39,533,316 | 6.22% | 504.62 |
Total | 242,248,582 | 100.00% | 645,772,362 | 100.00% | 636,047,388 | 100.00% | 2524.24 |
OSCAR 22.01 [43] | mC4 [36] | CC-100 [44] | ROOTS [39] | ParaCrawl v9 [45] | esCorpius-m (Ours) | |
---|---|---|---|---|---|---|
Size | 5.2 TB | 3.5 TB | 2.0 TB | 1.2 TB | 24 GB | 2.5 TB |
Docs | 560M | - | - | 521M | - | 1427M |
Words | 381B | 3567B | 284B | 189B | 8B | 242B |
WARC/WET | WET | WET | WET | WET | WARC | WARC |
Lang. identification | fastText | CLD3 | fastText | fastText | CLD2 | CLD2 + fastText |
Content identification | WET heuristic | WET heuristic | WET heuristic | WET heuristic | Sentence Alignment | JustText (modified) |
Elements | Document | Document | Document | Document | Sentence | Document and paragraph |
Parsing quality | Medium | Low | Medium | High | High | High |
Cleaning quality | Low | No cleaning | Low | High | High | High |
Deduplication | No | No | No | SimHash + SuffixArray | Bicleaner | dLHF |
If parallel | - | - | - | - | ✓ | - |
Traceability | URL | URL | URL | URL | URL | URL + WARC |
Licence | CC BY 4.0 | OCD-BY 1.0 | Common Crawl | CC-BY-SA-4.0 | CC0 | CC-BY-NC-ND 4.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gutiérrez-Fandiño, A.; Pérez-Fernández, D.; Armengol-Estapé, J.; Griol, D.; Kharitonova, K.; Callejas, Z. esCorpius-m: A Massive Multilingual Crawling Corpus with a Focus on Spanish. Appl. Sci. 2023, 13, 12155. https://doi.org/10.3390/app132212155
Gutiérrez-Fandiño A, Pérez-Fernández D, Armengol-Estapé J, Griol D, Kharitonova K, Callejas Z. esCorpius-m: A Massive Multilingual Crawling Corpus with a Focus on Spanish. Applied Sciences. 2023; 13(22):12155. https://doi.org/10.3390/app132212155
Chicago/Turabian StyleGutiérrez-Fandiño, Asier, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, Ksenia Kharitonova, and Zoraida Callejas. 2023. "esCorpius-m: A Massive Multilingual Crawling Corpus with a Focus on Spanish" Applied Sciences 13, no. 22: 12155. https://doi.org/10.3390/app132212155
APA StyleGutiérrez-Fandiño, A., Pérez-Fernández, D., Armengol-Estapé, J., Griol, D., Kharitonova, K., & Callejas, Z. (2023). esCorpius-m: A Massive Multilingual Crawling Corpus with a Focus on Spanish. Applied Sciences, 13(22), 12155. https://doi.org/10.3390/app132212155