1. Introduction
In the current era of information proliferation, effective communication is paramount for the comprehension of and interaction with diverse, often intricate, content. Text simplification, defined as the process of converting complex texts into versions that are more accessible while preserving the original meaning and nuances, is crucial for improving readability across various demographics. This includes individuals with cognitive disabilities, non-native speakers, and people with lower literacy levels.
Text simplification encompasses multiple sub-tasks: identifying text complexity through labels, assessing readability via numerical or categorical levels, and generating simplified text from complex input [
1]. Recent advancements in machine learning, particularly with deep learning- and transformer-based models, have substantially enhanced these sub-tasks. These advancements rely heavily on specialized datasets designed for specific languages, domains, and applications [
2].
Despite significant progress in English text simplification, other languages, notably Greek, have seen limited development. Greek poses distinct challenges due to its extensive vocabulary, intricate grammatical structures, and flexible syntactic ordering. Addressing these challenges necessitates not only the development of tailored simplification techniques but also the creation of a comprehensive dataset that captures the unique characteristics of the language.
This paper presents the development of a Greek text simplification dataset, detailing its significance and the methodology utilized in its assembly. Our dataset aims to broaden accessibility for diverse audiences, including native speakers across various literacy levels, non-native speakers, individuals with cognitive impairments, and those requiring efficient information processing. Texts sourced from Greek Wikipedia—chosen for its broad subject coverage and structural parallels to Simple English Wikipedia—serve as the foundation of our dataset. This choice mirrors foundational efforts in English text simplification, leveraging the structural and content diversity of Wikipedia.
In constructing this dataset, meticulous attention was devoted to ethical and cultural considerations, ensuring that the simplified texts faithfully preserve original meanings and respect linguistic and cultural nuances. This initiative extends beyond enhancing accessibility for Greek speakers; it also supports the creation of text simplification algorithms specifically designed for Greek linguistic features and grammatical rules. Moreover, the dataset acts as a vital resource for researchers and practitioners to evaluate existing models and to innovate new techniques tailored for Greek texts.
The primary goal of this research is to establish the inaugural comprehensive text simplification dataset for Greek, catalyzing further research and innovation in this domain. By advancing Greek-specific text simplification techniques, this work contributes to a more inclusive society by diminishing language barriers, making information universally accessible, and enhancing effective communication. This endeavor not only facilitates the development of advanced language technologies for Greek but also sets a benchmark for similar initiatives in other under-represented languages. Furthermore, this paper discusses the broader implications of text simplification in an age of ubiquitous information. Simplifying text can expedite and enhance knowledge dissemination and bridge the digital divide, ensuring equitable information access for all, irrespective of linguistic or cognitive abilities. Therefore, the creation of a Greek text simplification dataset represents not just an academic venture but a step toward a more inclusive and well-informed global community.
The remainder of this paper is organized as follows:
Section 2 reviews related work, surveying prior initiatives in text simplification across languages and specifically highlighting resource gaps for the Greek language.
Section 3 details the methodology used to create the Greek text simplification dataset, including the sources of our texts, selection criteria, and annotation processes.
Section 4 describes the technical implementation of the dataset, outlining the software tools employed and the data processing techniques applied.
Section 5 presents the experimental evaluation of the dataset, demonstrating the application of machine learning techniques to validate the effectiveness of the text simplification process. Finally,
Section 6 concludes the paper, summarizing our findings and outlining future directions for research and development in Greek text simplification.
2. Related Work
Text simplification is a vital aspect of natural language processing (NLP) that seeks to make text more accessible while preserving its original intent and meaning. As an interdisciplinary field, it intersects with other areas of NLP such as text summarization, machine translation, and information extraction [
3,
4]. These intersections have fostered diverse approaches to simplification, ranging from rule-based to data-driven methodologies, each leveraging different technological advances and linguistic theories.
Substantial developments have been observed in the field of text simplification across natural languages, dataset development processes, learning schemas, and domains. Text simplification has been historically linked to other natural language processing tasks such as text summarization [
5,
6], machine translation [
7,
8,
9,
10,
11], adopting training processes and evaluation metrics, and information extraction [
12].
Regarding corpus development, text simplification has utilized both supervised and unsupervised learning techniques. Supervised approaches typically involve creating parallel corpora that include manually simplified versions of complex sentences [
13]. In contrast, unsupervised methods leverage existing high-resource bilingual translation corpora to generate large-scale pseudo-parallel data for training models. This blend of methods underscores the dynamic and evolving nature of corpus development in simplification, aimed at refining model performance across diverse linguistic settings.
Prominent English datasets include the Newsela corpus [
14], the Wikipedia Simple English corpus [
15], and the One Stop English corpus [
16]. The Newsela corpus, for instance, offers over half a million complex–simple sentence pairs and marks a significant milestone for professional applications. The Wikipedia dataset consists of 140k aligned complex–simple English sentence pairs initially evaluated for improving translation efficiency. The One Stop English corpus targets ESL learners, providing texts at three distinct reading levels. The D-Wikipedia dataset emphasizes document-level text simplification and demonstrates the effectiveness of large-scale datasets in simplifying texts [
17].
For other languages, a comprehensive German corpus containing about 211,000 sentences was introduced by [
18], expanding upon the work by [
19] with more parallel and monolingual data, thus facilitating deeper analysis of text simplification and readability. Recent developments include a German news parallel corpus [
20]. The PorSimples project [
21] in Brazilian Portuguese and the Simplext Project [
22] for Spanish are noteworthy efforts, the former including 4500 sentences from general news and popular science articles and the latter containing 1000 sentences. The first Italian corpus for text simplification was designed and annotated by [
23], focusing on children’s literature and educational texts. The Alector corpus [
24] includes manually simplified versions of French primary school texts, while advancements in the Swedish language have been achieved through the construction of a pseudo-parallel monolingual corpus for automatic text simplification by [
25].
Recent research has expanded into multilingual code comment classification, moving beyond English to include languages like Serbian. Ref. [
26] introduced a novel taxonomy and the first multilingual dataset of code comments from languages including C, Java, and Python, annotated for diverse classification tasks. It evaluated the effectiveness of monolingual and multilingual neural models, finding that language-specific models performed best for Serbian, while multilingual models were optimal for English. This approach highlights the potential of advanced language models in multilingual settings and underscores the importance of developing tailored classification tools for software documentation across different languages.
The domain-specific applications of text simplification are varied. Ref. [
22] aimed to assist individuals with intellectual disabilities, whereas Ref. [
27] developed datasets specifically for simplifying medical texts, indicating the expanding scope of text simplification into specialized fields.
Approaches to text simplification range from lexical-based methods [
28], where simpler synonyms replace complex words considering the context [
29], to rule-based approaches that utilize syntactic information to identify structural changes [
30,
31,
32,
33], and data-driven methodologies. Hybrid approaches, combining data-driven and rule-based methods, have also been proposed [
34].
Data-driven methodologies in text simplification vary widely, spanning from knowledge-rich approaches that use syntactically parsed alignments between simple and complex sentence pairs [
35], to knowledge-poor methods primarily relying on the availability of appropriate parallel data [
36]. The most recent advancements involve neural simplification methods, which utilize encoder–decoder architectures, often augmented with long-short term memory (LSTM) layers [
37], and employ word embeddings as input [
11]. These embeddings can be pre-trained on large datasets or fine-tuned locally to better capture linguistic nuances [
38]. Additionally, these neural models have been expanded to include higher-level semantic information through cognitive conceptual annotations [
39], enhancing the ability to maintain semantic integrity during simplification. Another work reviewed for reproducibility provides further insights into the effectiveness and replicability of these sophisticated models, indicating a robust future for neural text simplification [
40].
Our research introduces a novel Greek text simplification dataset, encompassing 7000 sentences, both complex and simplified, derived from Greek Wikipedia [
41]. This choice reflects a strategic approach to capturing a broad spectrum of topics and discourse styles, essential for a comprehensive simplification tool. The dataset was developed with the collaboration of a diverse group of annotators from Ionian University, which contributed to a rich understanding of linguistic simplification across different demographics.
In conclusion, this section has not only highlighted the diverse and evolving landscape of text simplification research but also underscored our significant contribution through the development of a unique Greek text simplification dataset. By integrating insights from both historical and contemporary studies, our work addresses the notable under-representation of Greek in text simplification research and sets the stage for future advancements in creating more inclusive and accessible linguistic technologies. This endeavor not only enriches the academic field but also holds promise for real-world applications, potentially improving accessibility for Greek speakers worldwide.
3. Methodology
The methodology employed in the development of the Greek Wikipedia Simplification Dataset [
42] is comprehensive, designed to ensure the creation of a robust and reliable resource for text simplification tasks. This section outlines the systematic approach taken from initial data collection through to the final stages of dataset refinement and annotation. Our processes are grounded in rigorous data science practices, combining advanced computational techniques with meticulous manual reviews to produce a dataset of high quality and broad applicability.
We begin by detailing the data collection process, utilizing sophisticated programming tools and APIs to extract a diverse array of text from Greek Wikipedia. Following this, we describe our quality control measures, which are essential to maintaining the integrity and usability of the dataset. The subsequent sections cover the technical implementation and the specific challenges encountered during the project, providing insight into the solutions devised to address these issues. The development of the dataset is then explained, highlighting the collaborative efforts and the strategic expansion of the dataset to include both original and simplified texts. Finally, the annotation guidelines are discussed, which were carefully crafted to ensure consistency and accuracy in the simplifications provided by various contributors.
Through this multi-faceted approach, we aim to deliver a dataset that not only supports current research in natural language processing but also sets a precedent for future work in the field, particularly in enhancing accessibility and comprehension of text in the Greek language.
3.1. Dataset Collection
The data collection for our Greek text simplification dataset was meticulously structured to ensure robustness, scalability, and broad library support, crucial for effectively handling large-scale data extraction tasks. Python, celebrated for its versatile ecosystem and extensive library support, was chosen as the primary programming language for this project. We specifically employed the ‘wikipedia’ and ‘wikipediaapi’ libraries, which offer superior handling of API requests and exceptional flexibility in accessing and parsing large volumes of data. These libraries are particularly well suited for interfacing with complex web resources, making them ideal for systematically extracting structured content from Greek Wikipedia.
The data collection was automated through a custom script, detailed in Algorithm 1. This script was engineered to efficiently fetch data while ensuring a diverse and representative dataset by accessing multiple Wikipedia pages across a variety of subjects. The automation process involved initializing API settings, fetching and processing text, and storing the results in a structured CSV file, which facilitates ease of further processing and analysis.
Algorithm 1 Data Collection from Greek Wikipedia |
- 1:
Import libraries for Wikipedia access and CSV file manipulation. - 2:
Set language for Wikipedia access to Greek (el). - 3:
Initialize Wikipedia API with language settings. - 4:
Set the number of pages to fetch (3000). - 5:
Set the number of sentences per page (5). - 6:
Open a new CSV file ‘wikiSentences.csv’ in UTF-8 encoding. - 7:
Write the header to the CSV file with columns “Page Title” and “Summary”. - 8:
Initialize a page counter to zero. - 9:
while page counter < number of pages do - 10:
Fetch a random page title from Wikipedia. - 11:
Retrieve the page object for the fetched title using the API. - 12:
if page exists then - 13:
Initialize summary variable and sentence counter. - 14:
for each section in the page do - 15:
Trim the section text and split into sentences. - 16:
for each sentence in the section do - 17:
if sentence counter < number of sentences then - 18:
Append sentence to summary. - 19:
Increment sentence counter. - 20:
end if - 21:
if sentence counter == number of sentences then - 22:
Break from the loop. - 23:
end if - 24:
end for - 25:
end for - 26:
if sentence counter == number of sentences then - 27:
Write page title and summary to CSV file. - 28:
Increment page counter. - 29:
end if - 30:
end if - 31:
end while - 32:
Close the CSV file. - 33:
Handle exceptions to ensure script stability.
|
This algorithmic representation not only provides a clear and structured description of the data collection process but also underscores our systematic approach to maintaining the quality and diversity of the dataset. To ensure the integrity and reliability of the data collection process, we implemented robust error handling measures. These included exception handling mechanisms within our data extraction scripts to manage issues such as network interruptions, API limits, and data format errors. Our scripts featured retry logic to attempt data fetching multiple times before logging an error, thus enhancing resilience against transient network or API-related issues.
In addition to error handling, we established rigorous data validation measures to uphold data quality. Automated scripts were employed to verify the correctness of data formats and consistency checks were routinely performed to ensure all retrieved data adhered to our specified criteria. This included validating data against predefined schemas, performing checksums to detect data corruption, and manually cross-referencing entries with secondary sources for accuracy and consistency. Our preprocessing steps further involved cleaning the data by removing duplicates, standardizing metadata, and correcting syntactic inconsistencies.
This meticulous attention to detail in the data collection phase is crucial for minimizing errors and inconsistencies, thereby enhancing the overall quality of the dataset. By integrating both robust error management and comprehensive data validation, we significantly improved the reliability and usability of our Greek text simplification dataset for further research and application development.
The strategy of random page selection and sentence extraction, implemented through a programmatically randomized algorithm, was specifically chosen to maximize the representativeness of the dataset across various topics and styles found in Greek Wikipedia. This method ensures that the dataset encapsulates a broad spectrum of the Greek language as used in diverse contexts, which is essential for developing a robust text simplification model that is effective across different domains and text types. Furthermore, this approach is pivotal in capturing the varied linguistic nuances and cultural contexts inherent in the Greek language, thereby contributing significantly to the creation of a comprehensive and effective text simplification tool.
5. Experimental Evaluation
In this section, we assess the effectiveness of various text simplification strategies applied to the Greek Wikipedia Simplification Dataset. Through a series of experiments, we evaluate the dataset’s structural and linguistic characteristics, employ advanced statistical analyses to explore word and sentence complexity, and implement machine learning models to classify sentences based on their complexity. This comprehensive evaluation not only demonstrates the practical applications of our methodologies but also highlights the challenges and achievements in automating text simplification processes. Each subsection is designed to provide a detailed insight into the dataset’s composition and our efforts to refine text simplification techniques, ensuring that the results are robust, interpretable, and actionable for future research and practical applications.
5.7. Using the Dataset for Identifying Complex Text
In an effort to automate the process of differentiating between ‘simple’ and ‘complicated’ Greek sentences, we utilized the RapidMiner software v.9.0.to develop a model that classifies text based on its complexity. This endeavor is part of a broader initiative to apply machine learning techniques to enhance text simplification methodologies.
5.7.2. Evaluation Metrics
To thoroughly assess the effectiveness of each model, we relied on multiple metrics:
Accuracy: Measures the overall correctness of the model across both classes.
Precision: Indicates the accuracy of positive predictions, essential for determining the reliability of predictions for ‘complicated’ sentences.
Recall: Measures the model’s ability to identify all relevant instances, crucial for ensuring that all complex texts are correctly identified.
Area under the curve (AUC): Provides an aggregate measure of performance across all possible classification thresholds.
These metrics help elucidate the strengths and weaknesses of each model in handling the classification task.
The results of this evaluation are detailed in
Table 9, which compares the performance of the models across these metrics:
The KNN model shows modest accuracy at 52.12%, with a precision of 55.57% and a recall of 50.77%. Its performance indicates a balanced approach to both false positives and false negatives, suggesting a moderate ability to generalize across the dataset without significant bias towards either class. However, the precision breakdown between simple and complicated sentences (48.84%/55.57%) suggests a slightly better handling of complicated sentences over simple ones.
In contrast, the naive Bayes model exhibited the lowest performance among the models, with an accuracy of just 43.87%, a precision of 44.20%, and notably low recall at 20.39%. The poor recall indicates a significant number of false negatives, where many complex sentences are likely misclassified as simple. This model’s AUC of 0.48 nearly mirrors random guessing, emphasizing its limited capability in this specific classification context.
The SVM model, utilizing an RBF kernel, demonstrated the highest recall at an impressive 97.49%, suggesting it is highly effective at identifying complicated sentences. However, this comes at the cost of precision, particularly for simple sentences (24.63%), indicating a high rate of false positives, where simple sentences are incorrectly labeled as complicated. Its overall accuracy stands at 52.41%, and the AUC of 0.90 indicates excellent model performance in distinguishing between classes under ideal conditions. Yet, the class precision disparity suggests overfitting, particularly in recognizing simple sentences, which is a critical area for further adjustment.
While the initial results of our models indicate only modest success in classifying text as ‘simple’ or ‘complicated’, these outcomes highlight several critical areas for future research. The performance limitations observed point to the need for more tailored approaches that consider the unique aspects of the Greek language. We are particularly interested in exploring more sophisticated machine learning techniques, such as deep learning models, that can potentially capture the nuances of Greek syntax and morphology more effectively. Additionally, expanding the dataset and incorporating more varied linguistic features are likely to improve the training process, allowing for more nuanced understanding and classification capabilities. These steps will form the basis of our ongoing efforts to enhance the models’ accuracy and reliability.
These outcomes highlight the complexities of applying machine learning to natural language processing, especially in distinguishing text complexities in a nuanced language like Greek. The variance in model success rates underscores the need for continued refinement of approaches, possibly integrating more sophisticated or tailored algorithms that can better handle the idiosyncrasies of language data. This evaluation not only directs future model improvements but also stresses the importance of choosing appropriate metrics to capture the true effectiveness of each model comprehensively.
6. Conclusions and Future Work
In conclusion, the creation of the Greek text simplification dataset marks a pivotal advancement in improving readability and accessibility for Greek-speaking populations. This dataset directly addresses the linguistic intricacies inherent to the Greek language, such as its elaborate morphological structure and flexible syntactic arrangements, which are essential for developing customized text simplification solutions that are finely attuned to the nuances of the Greek language.
The utility of this dataset extends beyond assisting individuals with limited literacy skills. It is equally beneficial for non-native speakers, people with cognitive disabilities, and anyone seeking to streamline their consumption of Greek language information. Consequently, the dataset promotes both accessibility and inclusivity, offering substantial resources for researchers and practitioners to assess and refine text simplification models, craft Greek-specific algorithms, and innovate new methods to enhance text accessibility.
Despite the strides made, this research has its limitations. The primary constraint lies in the dataset’s reliance on texts sourced exclusively from Greek Wikipedia, which may not fully represent the diversity of language used in various contexts like literature, legal texts, or informal communication. This reliance on a single source could affect the generalizability of the simplification models developed from this dataset. Furthermore, the current methodologies predominantly focus on syntactic simplification without extensively exploring semantic simplification, which is crucial for maintaining the meaning and context of more complex sentences. Lastly, the use of traditional machine learning models, rather than more advanced neural network architectures, might limit the potential accuracy and sophistication of the text simplification solutions [
47]. Addressing these limitations in future work could significantly enhance the dataset’s utility and the effectiveness of the simplification tools derived from it.
Looking ahead, the field of Greek text simplification is ripe with opportunities for further research and development. Continued enhancement of simplification algorithms is crucial, leveraging user feedback and detailed performance metrics to improve the precision and efficacy of these tools. Such iterative refinements are essential to ensure that the simplification approaches remain effective and responsive to the specific needs of diverse user groups.
Broadening the scope of the dataset to encompass a wider array of text types and genres will significantly enhance the robustness and applicability of simplification techniques. A more comprehensive dataset facilitates the development of versatile tools that can operate effectively across various contexts, thereby broadening their potential impact.
Additionally, exploring advanced machine learning models, especially those based on transformer concepts, will be a priority. Integrating models such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) could revolutionize our text simplification efforts by leveraging their superior contextual processing capabilities. This approach promises not only to enhance the accuracy of text classification but also to refine the understanding of complex linguistic structures in Greek, thereby improving the effectiveness of simplification algorithms.
In furtherance of this initiative, we aim to explore the capabilities of large language models like ChatGPT v.3.5 to assess and demonstrate the practical applications of our dataset. We plan to conduct comprehensive evaluations to test how well such models can handle the intricacies of Greek text simplification. Given the significant computational resources required for such studies, we are considering collaborations with other research institutions to facilitate this advanced research. These future endeavors will help us harness the full potential of our dataset and demonstrate its applicability across various machine learning contexts.
Establishing robust mechanisms for user feedback and validation is vital for aligning simplification techniques with the actual requirements of end-users. A user-centric design approach ensures that the tools developed genuinely benefit those with reading challenges, enhancing practical usability and impact.
Fostering collaboration across disciplines—such as linguistics, computer science, and cognitive psychology—can lead to deeper insights and more innovative solutions. These collaborative efforts can address complex linguistic challenges and enhance our understanding of text processing and comprehension, paving the way for sophisticated applications in text simplification.
By embracing these initiatives, the domain of Greek text simplification can progress towards creating more advanced tools that cater to a broader spectrum of linguistic needs and promote inclusivity within the digital information landscape. Such advancements will further the utility of text simplification, making information more accessible and comprehensible for all, particularly within the Greek-speaking community.
Author Contributions
Conceptualization, L.A., A.A., X.K., A.M., I.T., D.M., K.L.K. and A.K.; Methodology, L.A., A.A., X.K., A.M., I.T., D.M., K.L.K. and A.K.; Software, L.A., A.A., X.K., A.M., I.T., D.M., K.L.K. and A.K.; Validation, L.A., A.A., X.K., A.M., I.T., D.M., K.L.K. and A.K.; Data curation, L.A., A.A., X.K., A.M., I.T., D.M., K.L.K. and A.K.; Writing—original draft, L.A., A.A., X.K., A.M., I.T., D.M., K.L.K. and A.K.; Writing—review & editing, D.M., K.L.K. and A.K.; Supervision, D.M., K.L.K. and A.K.; Project administration, D.M., K.L.K. and A.K. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The data presented in this study are available on request from the corresponding author.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Santucci, V.; Santarelli, F.; Forti, L.; Spina, S. Automatic Classification of Text Complexity. Appl. Sci. 2020, 10, 7285. [Google Scholar] [CrossRef]
- Mouratidis, D.; Mathe, E.; Voutos, Y.; Stamou, K.; Kermanidis, K.L.; Mylonas, P.; Kanavos, A. Domain-Specific Term Extraction: A Case Study on Greek Maritime Legal Texts. In Proceedings of the 12th Hellenic Conference on Artificial Intelligence (SETN), Corfu, Greece, 7–9 September 2022; ACM: New York, NY, USA, 2022; pp. 1–6. [Google Scholar]
- Kanavos, A.; Theodoridis, E.; Tsakalidis, A.K. Extracting Knowledge from Web Search Engine Results. In Proceedings of the 24th International Conference on Tools with Artificial Intelligence (ICTAI), Athens, Greece, 7–9 November 2012; IEEE Computer Society: Washington, DC, USA, 2012; pp. 860–867. [Google Scholar]
- Vonitsanos, G.; Kanavos, A.; Mylonas, P. Decoding Gender on Social Networks: An In-depth Analysis of Language in Online Discussions Using Natural Language Processing and Machine Learning. In Proceedings of the IEEE International Conference on Big Data, Sorrento, Italy, 15–18 December 2023; pp. 4618–4625. [Google Scholar]
- Siddharthan, A.; Nenkova, A.; McKeown, K.R. Syntactic Simplification for Improving Content Selection in Multi-Document Summarization. In Proceedings of the 20th International Conference on Computational Linguistics (COLING), Geneva, Switzerland, 23–27 August 2004. [Google Scholar]
- Silveira, S.B.; Branco, A. Combining a double clustering approach with sentence simplification to produce highly informative multi-document summaries. In Proceedings of the 13th International Conference on Information Reuse & Integration (IRI), Las Vegas, NV, USA, 8–10 August 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 482–489. [Google Scholar]
- Narayan, S.; Gardent, C. Hybrid Simplification using Deep Semantics and Machine Translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), Baltimore, MD, USA, 23–24 June 2014; The Association for Computer Linguistics: Cedarville, OH, USA, 2014; pp. 435–445. [Google Scholar]
- Qiang, J.; Zhang, F.; Li, Y.; Yuan, Y.; Zhu, Y.; Wu, X. Unsupervised Statistical Text Simplification using Pre-trained Language Modeling for Initialization. Front. Comput. Sci. 2023, 17, 171303. [Google Scholar] [CrossRef]
- Specia, L. Translating from Complex to Simplified Sentences. In Proceedings of the 9th International Conference on Computational Processing of the Portuguese Language (PROPOR), Porto Alegre, Brazil, 27–30 April 2010; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6001, pp. 30–39. [Google Scholar]
- Wubben, S.; van den Bosch, A.; Krahmer, E. Sentence Simplification by Monolingual Machine Translation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, Republic of Korea, 8–14 July 2012; The Association for Computer Linguistics: Cedarville, OH, USA, 2012; pp. 1015–1024. [Google Scholar]
- Zhang, X.; Lapata, M. Sentence Simplification with Deep Reinforcement Learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, 9–11 September 2017; Association for Computational Linguistics: Cedarville, OH, USA, 2017; pp. 584–594. [Google Scholar]
- Evans, R.J. Comparing Methods for the Syntactic Simplification of Sentences in Information Extraction. Lit. Linguist. Comput. 2011, 26, 371–388. [Google Scholar] [CrossRef]
- Lu, X.; Qiang, J.; Li, Y.; Yuan, Y.; Zhu, Y. An Unsupervised Method for Building Sentence Simplification Corpora in Multiple Languages. arXiv 2021, arXiv:2109.00165. [Google Scholar]
- Newsela Data. Available online: https://newsela.com/data (accessed on 30 July 2024).
- Coster, W.; Kauchak, D. Simple English Wikipedia: A New Text Simplification Task. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; The Association for Computer Linguistics: Cedarville, OH, USA, 2011; pp. 665–669. [Google Scholar]
- Vajjala, S.; Lucic, I. OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification. In Proceedings of the 13th Workshop on Innovative Use of NLP for Building Educational Applications@NAACL-HLT, New Orleans, LA, USA, 5 June 2018; Association for Computational Linguistics: Cedarville, OH, USA, 2018; pp. 297–304. [Google Scholar]
- Sun, R.; Jin, H.; Wan, X. Document-Level Text Simplification: Dataset, Criteria and Baseline. arXiv 2021, arXiv:2110.05071. [Google Scholar]
- Battisti, A.; Ebling, S. A Corpus for Automatic Readability Assessment and Text Simplification of German. arXiv 2019, arXiv:1909.09067. [Google Scholar]
- Klaper, D.; Ebling, S.; Volk, M. Building a German/Simple German Parallel Corpus for Automatic Text Simplification. In Proceedings of the 2nd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR@ACL), Sofia, Bulgaria, 8 August 2013; Association for Computational Linguistics: Cedarville, OH, USA, 2013; pp. 11–19. [Google Scholar]
- Rios, A.; Spring, N.; Kew, T.; Kostrzewa, M.; Säuberli, A.; Müller, M.; Ebling, S. A New Dataset and Efficient Baselines for Document-level Text Simplification in German. In Proceedings of the 3rd Workshop on New Frontiers in Summarization, Hong Kong, China, 10 November 2021; Association for Computational Linguistics: Cedarville, OH, USA, 2021; pp. 152–161. [Google Scholar]
- Aluisio, S.; Specia, L.; Gasperin, C.; Scarton, C. Readability Assessment for Text Simplification. In Proceedings of the NAACL HLT 5th Workshop on Innovative Use of NLP for Building Educational Applications, Los Angeles, CA, USA, 5 June 2010; pp. 1–9. [Google Scholar]
- Saggion, H.; Stajner, S.; Bott, S.; Mille, S.; Rello, L.; Drndarevic, B. Making It Simplext: Implementation and Evaluation of a Text Simplification System for Spanish. ACM Trans. Access. Comput. 2015, 6, 1–36. [Google Scholar] [CrossRef]
- Brunato, D.; Dell’Orletta, F.; Venturi, G.; Montemagni, S. Design and Annotation of the First Italian Corpus for Text Simplification. In Proceedings of the 9th Linguistic Annotation Workshop (LAW@NAACL-HLT), Denver, CO, USA, 5 June 2015; The Association for Computer Linguistics: Cedarville, OH, USA, 2015; pp. 31–41. [Google Scholar]
- Gala, N.; Tack, A.; Javourey-Drevet, L.; François, T.; Ziegler, J.C. Alector: A Parallel Corpus of Simplified French Texts with Alignments of Misreadings by Poor and Dyslexic Readers. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC), Marseille, France, 11–16 May 2020; European Language Resources Association: Paris, France, 2020; pp. 1353–1361. [Google Scholar]
- Holmer, D.; Rennes, E. Constructing Pseudo-parallel Swedish Sentence Corpora for Automatic Text Simplification. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), Tórshavn, France, 22–24 May 2023; University of Tartu Library: Tartu, Estonia, 2023; pp. 113–123. [Google Scholar]
- Kostic, M.; Batanovic, V.; Nikolic, B. Monolingual, Multilingual and Cross-lingual Code Comment Classification. Eng. Appl. Artif. Intell. 2023, 124, 106485. [Google Scholar] [CrossRef]
- Den Bercken, L.V.; Sips, R.; Lofi, C. Evaluating Neural Text Simplification in the Medical Domain. In Proceedings of the World Wide Web Conference (WWW), San Francisco, CA, USA, 13–17 May 2019; ACM: New York, NY, USA, 2019; pp. 3286–3292. [Google Scholar]
- Shardlow, M. Out in the Open: Finding and Categorising Errors in the Lexical Simplification Pipeline. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), Reykjavik, Iceland, 26–31 May 2014; European Language Resources Association (ELRA): Paris, France, 2014; pp. 1583–1590. [Google Scholar]
- Bott, S.; Rello, L.; Drndarevic, B.; Saggion, H. Can Spanish Be Simpler? LexSiS: Lexical Simplification for Spanish. In Proceedings of the 24th International Conference on Computational Linguistics (COLING), Mumbai, India, 8–15 December 2012; Indian Institute of Technology Bombay: Mumbai, India, 2012; pp. 357–374. [Google Scholar]
- Biran, O.; Brody, S.; Elhadad, N. Putting it Simply: A Context-Aware Approach to Lexical Simplification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; The Association for Computer Linguistics: Cedarville, OH, USA, 2011; pp. 496–501. [Google Scholar]
- Chandrasekar, R.; Srinivas, B. Automatic Induction of Rules for Text Simplification. Knowl. Based Syst. 1997, 10, 183–190. [Google Scholar] [CrossRef]
- Qiang, J.; Li, Y.; Zhu, Y.; Yuan, Y.; Shi, Y.; Wu, X. LSBert: Lexical Simplification Based on BERT. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3064–3076. [Google Scholar] [CrossRef]
- Siddharthan, A. Text Simplification using Typed Dependencies: A Comparision of the Robustness of Different Generation Strategies. In Proceedings of the 13th European Workshop on Natural Language Generation (ENLG), Nancy, France, 28–30 September 2011; The Association for Computer Linguistics: Cedarville, OH, USA, 2011; pp. 2–11. [Google Scholar]
- Siddharthan, A.; Mandya, A. Hybrid Text Simplification using Synchronous Dependency Grammars with Hand-written and Automatically Harvested Rules. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Gothenburg, Sweden, 26–30 April 2014; The Association for Computer Linguistics: Cedarville, OH, USA, 2014; pp. 722–731. [Google Scholar]
- Woodsend, K.; Lapata, M. Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Edinburgh, UK, 27–29 July 2011; ACL: Cedarville, OH, USA, 2011; pp. 409–420. [Google Scholar]
- Garbacea, C.; Guo, M.; Carton, S.; Mei, Q. An Empirical Study on Explainable Prediction of Text Complexity: Preliminaries for Text Simplification. arXiv 2020, arXiv:2007.15823v1. [Google Scholar]
- Wang, T.; Chen, P.; Amaral, K.M.; Qiang, J. An Experimental Study of LSTM Encoder-Decoder Model for Text Simplification. arXiv 2016, arXiv:1609.03663. [Google Scholar]
- Nisioi, S.; Stajner, S.; Ponzetto, S.P.; Dinu, L.P. Exploring Neural Text Simplification Models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, BC, Canada, 30 July–4 August 2017; Association for Computational Linguistics: Cedarville, OH, USA, 2017; pp. 85–91. [Google Scholar]
- Sulem, E.; Abend, O.; Rappoport, A. Simple and Effective Text Simplification Using Semantic and Neural Methods. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, Australia, 15–20 July 2018; Association for Computational Linguistics: Cedarville, OH, USA, 2018; pp. 162–173. [Google Scholar]
- Arvan, M.; Pina, L.; Parde, N. Reproducibility of Exploring Neural Text Simplification Models: A Review. In Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges, virtual, 18–22 July 2022; Association for Computational Linguistics: Cedarville, OH, USA, 2022; pp. 62–70. [Google Scholar]
- Greek Wikipedia. Available online: https://en.wikipedia.org/wiki/Greek_Wikipedia (accessed on 30 July 2024).
- HiLab Greek Text Simplification Dataset. Available online: https://hilab.di.ionio.gr/wp-content/uploads/2024/07/HiLab_Greek_text_simplification_Wikipedia_Dataset.zip (accessed on 30 July 2024).
- Lee, R.S.T. Natural Language Processing—A Textbook with Python Implementation; Springer: Berlin/Heidelberg, Germany, 2024. [Google Scholar]
- Wagner, W. Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit—O’Reilly Media. Lang. Resour. Eval. 2010, 44, 421–424. [Google Scholar] [CrossRef]
- Honnibal, M.; Montani, I.; Landeghem, S.V.; Boyd, A. spaCy: Industrial-Strength Natural Language Processing in Python; Zenodo: Geneva, Switzerland, 2020. [Google Scholar]
- Al-Thanyyan, S.; Azmi, A.M. Automated Text Simplification: A Survey. ACM Comput. Surv. 2022, 54, 1–36. [Google Scholar] [CrossRef]
- Mouratidis, D.; Kermanidis, K.; Kanavos, A. Comparative Study of Recurrent and Dense Neural Networks for Classifying Maritime Terms. In Proceedings of the 14th International Conference on Information, Intelligence, Systems & Applications (IISA), Volos, Greece, 10–12 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).