A Hybrid Text Summarization Technique of Student Open-Ended Responses to Online Educational Surveys

Karousos, Nikos; Vorvilas, George; Pantazi, Despoina; Verykios, Vassilios S.

doi:10.3390/electronics13183722

Open AccessArticle

A Hybrid Text Summarization Technique of Student Open-Ended Responses to Online Educational Surveys

by

Nikos Karousos

¹

,

George Vorvilas

^2,*

,

Despoina Pantazi

¹

and

Vassilios S. Verykios

^1,*

¹

School of Science and Technology, Hellenic Open University, 26335 Patras, Greece

²

School of Humanities, Hellenic Open University, 26335 Patras, Greece

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(18), 3722; https://doi.org/10.3390/electronics13183722

Submission received: 5 August 2024 / Revised: 13 September 2024 / Accepted: 18 September 2024 / Published: 19 September 2024

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

This study introduces a hybrid text summarization technique designed to enhance the analysis of qualitative feedback from online educational surveys. The technique was implemented at the Hellenic Open University (HOU) to tackle the challenges of processing large volumes of student feedback. The TextRank and Walktrap algorithms along with GPT-4o mini were used to analyze student comments regarding positive experiences, study challenges, and suggestions for improvement. The results indicate that students are satisfied with tutor–student interactions but concerns were raised about educational content and scheduling issues. To evaluate the proposed summarization approach, the G-Eval and DeepEval summarization metrics were employed, assessing the relevance, coherence, consistency, fluency, alignment, and coverage of the summaries. This research addresses the increasing demand for effective qualitative data analysis in higher education and contributes to ongoing discussions on student feedback in distance learning environments. By effectively summarizing open-ended responses, universities can better understand student experiences and make informed decisions to improve the educational process.

Keywords:

online surveys; student feedback; text summarization; TextRank; Walktrap

1. Introduction

The landscape of academic education has evolved rapidly in recent years. During the COVID-19 pandemic, many higher education institutions (HEIs) worldwide were forced to switch to technologies to support the educational process remotely, which greatly boosted the spread of academic distance education and e-learning [1]. Under these circumstances, enhancing the quality of academic education services provided was a priority for many HEIs. To achieve this, student performance was evaluated through various methods, focusing on students’ grades and their interaction with learning management systems, their engagement with online resources, the frequency and quality of their participation in virtual discussions, and their ability to complete assignments and assessments within the platform [2,3,4]. Among the indicators for assessing the quality of academic education are also student evaluations of teaching effectiveness, the quality of student attendance, the content and design of curricula, and so on. In this context, the systematic collection and analysis of student feedback have emerged as key practices in the operational and strategic decision-making processes of HEIs. Therefore, the emphasis on students’ evaluation of their studies not only contributes to curriculum improvement and faculty performance evaluation, but it is also an important indicator for assessing the overall educational experience.

Recent developments have made online surveys a widespread method for collecting educational evaluation data. This method provides many significant advantages over traditional pen-and-paper surveys, including faster data collection, increased accessibility via mobile devices, and the ability to explore a broader student sample. In addition, online survey platforms provide researchers with the flexibility to use a variety of question formats, enhancing the richness and depth of the data collected. However, despite these advantages, implementing online surveys often presents issues such as low response rates, potential biases in sample representation, and the complexity of analyzing qualitative responses to large datasets [5]. These challenges underscore the need for continued examination of methodologies and practices for collecting and interpreting educational evaluation data to improve the educational process of HEIs.

The prevalence of closed-ended questions in survey designs has lent a degree of objectivity and convenience to quantitative analyses. However, this reliance can bring limitations. Quantitative data often carry on predetermined research patterns, potentially limiting the exploration of emerging themes and insights that could enrich the understanding of students’ experiences [6]. In contrast, open-ended questions offer an effective way to more systematically elicit students’ views and experiences in their own words. However, the inherent challenges of processing and analyzing qualitative data can pose difficulties, especially in large-scale studies. Processing qualitative data requires rigorous coding and thematic analysis, which can be particularly challenging and time-consuming with large samples [7].

In recent years, significant developments in text mining (TM) and natural language processing (NLP) have greatly enhanced the structured analysis of data collected from large text corpora. A promising method that can contribute to the systematic analysis of qualitative data in education is automatic text summarization [8]. The present study introduces a hybrid text summarization technique to address the challenges of processing large volumes of university students’ feedback. This technique combines the TextRank and Walktrap algorithms, along with the GPT-4o mini model. Student feedback was obtained from an online educational survey at the HOU in Greece, focusing on critical dimensions such as tutors’ role, the quality of educational content, the design of academic modules, and the quality of administrative and technical services. Evaluated using the G-Eval framework and DeepEval summarization metrics, the technique demonstrates its ability to provide relevant, coherent, and comprehensive summaries, offering a deeper understanding of student experiences. The findings revealed that, while students were generally satisfied with tutor–student interactions, there were negative comments regarding the problematic content of some modules and scheduling issues. In summary, the results generated by this approach offer valuable insights not only for the HOU but also for HEIs in general. By providing a clearer understanding of the factors that positively or negatively affect the student experience in academic distance learning, the technique supports informed decision-making and paves the way for ongoing improvements in educational processes and practices across the field. Using the proposed technique, HEIs can quickly extract and condense large volumes of qualitative data, making student feedback more accessible to deans, tutors, and decision-makers. This technique helps identify data trends by faculty or program, and when combined with quantitative evaluation results, it provides a deeper understanding of student experiences. Such an approach could lead to improvements in tutors’ role, educational materials, course design, administrative services, and other dimensions related to the quality assurance of the academic educational process.

The rest of this paper is organized as follows: Section 2 provides an overview of the analysis of qualitative feedback based on student opinions collected through online educational evaluation surveys, discussing the benefits, challenges, and methods related to these evaluations. Section 3 describes the context in which the study was conducted, the research questions, and the contribution of the proposed technique to academic education. Section 4 outlines the research design, the participants, and the sample involved in the study, while Section 5 and Section 6 present in detail the design and implementation of the proposed hybrid summarization model, respectively. The findings from the application of the technique are analyzed in Section 7, followed by a discussion of these findings in Section 8. Finally, the study’s limitations are discussed, along with some ideas for future research directions, emphasizing the potential for improving the summarization processes used.

2. Analyzing University Student Feedback from Online Educational Surveys: Benefits, Challenges, and Methods

Student feedback in academic education is an important factor in investigating the quality of services provided by HEIs. Students’ opinions are mainly concerned with evaluations of faculty effectiveness and program quality [9,10,11] as well as student satisfaction [12]. These evaluations are routinely carried out using either internationally recognized and valid questionnaires or questionnaires constructed and adapted to the requirements of the institution concerned. HEIs attach particular importance to the student feedback they collect because it helps them to make decisions regarding the recruitment or promotion of academic staff, changes in course content, and improvements in the administrative and technical services provided.

In recent years, conducting evaluations in HEIs to collect feedback from students’ opinions has been greatly enhanced by the use of online surveys. Several online tools are commercially available, but many HEIs have also developed infrastructures that support their in-house online surveys. The major advantages of online surveys are the speed of data collection, the flexibility of location and time of completion, the ability to participate in the survey through mobile devices, the ability to reach a large sample of participants, etc. In addition, they provide researchers with ease of creating and structuring questionnaires through a range of question types and functionalities for data visualization and monitoring during the collection of the responses [5,13]. Their drawbacks include often low response rates, research design arbitrariness regarding sample representativeness, unclear questions or instructions, collection of unwanted responses (e.g., insulting comments), and privacy assurance issues [5,14,15].

In higher education, online evaluation questionnaires with closed-ended questions are preferred by far. These questions are indicators for measuring factors such as student–tutor interactions, program materials, course organization, and instructional design. Online questionnaires are effective because they allow for the rapid collection and analysis of data from almost the entire population and comparing results over time [16]. In addition, including rating items allows for easier interpretation of findings regarding tutor effectiveness and other dimensions of the educational process, and subsequent decision-making [17].

However, it has been pointed out that quantitative methods are characterized by their objectively closed nature, meaning that if a topic/dimension is not anticipated in the survey design, it will not appear in the results. More generally, the tendency of researchers to rely on previous quantitative studies and to confirm existing research hypotheses often leads to the replication of a closed variable structure that is systematically used in different contexts. This approach can simply recycle existing knowledge without any innovation, thereby limiting the possibility of generating new knowledge [6]. Although the use of open-ended questions can provide valuable qualitative data on student feedback, they are often considered time-consuming and demanding to process. As a result, they may be unsuitable for many researchers conducting large-scale surveys [11].

Including open-ended questions in online surveys seems to have many additional benefits compared to closed-ended responses. Open-ended questions provide valuable information and complement the quantitative findings in the survey. They are useful when exploring new topics or concept dimensions as they may reflect real-life experiences [18]. Through open-ended questions, respondents are not limited by a predetermined set of closed options and they can express their thoughts in their own words, often producing rich information [19]. In addition, respondents can create themes and sub-themes themselves, which may contradict or extend the theoretical assumptions of the researchers on the issue under investigation [6,20]. However, the completion rates for open-ended responses to electronic questionnaires are often lower than for closed-ended responses [13]. Issues such as the design of the questionnaire, the available space for typing, the instructions for completion [21,22], the interest of the respondent, and the cognitive load required to answer can affect the length and quality of open-ended responses [7,19].

In large-scale online surveys, open-ended responses posit significant challenges to researchers. Processing qualitative data requires their coding into categories, through the identification of recurring patterns and ideas [23]. The categories of analysis are shaped by the relevant literature review and the data itself, ensuring logical consistency. This process is time-consuming, requiring time to review the literature, preliminarily test the coding scheme, and train raters in its use. Raters’ training is essential to ensure agreement between them in coding the data and thus for the reliability of the survey [24]. In small samples, coding can be carried out manually, with two or more people coding independently to assess inter-rater reliability. In large-scale qualitative surveys, coding often requires the ad hoc creation of broad categories, leaving a wealth of data unused [25]. Furthermore, this type of survey involves the division of labor by creating multiple teams with specific responsibilities, which may facilitate survey management but often create problems of communication and translation of information when they do not communicate effectively with each other [25,26]. In addition, processing and coding qualitative data from online surveys becomes particularly challenging when responses are unclear [27].

The use of computer-assisted qualitative data analysis softwares (CAQDAS), such as Atlas.ti 8.0, NVivo 14.0 or MAXQDA 24.4.1, can help reduce the issues associated with analyzing qualitative data, but requires careful oversight to ensure the accuracy of the classification. Although these tools help with text tagging, the coding process remains time-consuming and demanding for large samples. In addition, these tools cannot be fully trained to automatically identify the appropriate pieces of text for coding. Therefore, individual decisions by researchers become crucial in this case, and agreement among raters on an acceptable coding scheme remains a common strategy to minimize subjective errors and bias [26].

Advances in NLP and TM have enhanced the structured analysis of data collected from large corpora. Ref. [28] stresses the significant role of NLP and TM in enhancing online education. By analyzing large amounts of unstructured data from forums, essays, and assessments, these technologies provide valuable information for improving teaching, learning, and student engagement. Key areas of application include automated grading, feedback generation, plagiarism detection, and personalized learning recommendations. Additionally, TM aids in creating educational content, predicting student dropout risks, and enhancing information retrieval. Ref. [29] also refers to, as the most common goals cited in the literature for applying NLP and TM techniques to student feedback, the following: (a) sentiment prediction, (b) category and rating prediction complemented with sentiment, (c) emotion prediction and analysis, (d) opinion mining, (e) lexicon construction, and (f) statistical or mathematical analysis. The literature distinguishes between supervised and unsupervised machine learning methods [30]. In supervised methods, initially, human raters encode a subset of training data, which in turn is used to train an algorithm for predicting uncategorized text responses. Although these methods require modeling and prediction expertise, they offer low cost and fast execution [7]. Unsupervised methods extract data from text mainly at an exploratory analysis level to search for common underlying themes and do not require manual effort. These themes lead to a better understanding of the conceptual structure of the corpus, cleaning it of unwanted information and coding it to highlight key concepts in the analysis. For a brief presentation of these approaches see also [31].

In a review of advancements in NLP in opinion mining of customer reviews during the last decade, Ref. [32] stresses the shift from traditional machine learning to deep learning models. Key areas of NLP implementation in customer reviews include sentiment analysis and opinion mining, fake review detection to identify fraudulent feedback, customer experience and satisfaction analysis, user profiling and recommendation systems, marketing, and brand management. Additionally, the emergence of technologies like transformers (e.g., BERT) enables deeper and sophisticated analysis. According to [33], sentiment analysis has been widely adopted in education to analyze student feedback. It helps evaluate student engagement, infrastructure limitations, course effectiveness, and policy decisions. Techniques like aspect-based and entity-level analysis allow for personalized learning by understanding individual student issues. Sentiment and emotional analysis categorize feedback into various levels, aiding in teacher evaluation and pedagogical improvements. Additionally, predictive models have been used to correlate sentiment with academic performance, helping institutions identify at-risk students early.

In addition to the above methods, a growing method in recent years is automated text summarization. This is the process of condensing a large amount of text data into a shorter, more concise form, allowing for easier apprehension of the main points without reading the entire text. This technique saves time when large volumes of text bodies must be managed. The procedure of automated text summarization usually includes three stages: pre-processing, processing, and post-processing. In pre-processing, words, sentences, and structural elements of the text are identified as input units. In processing, the input text is converted into a summary using a summary method. Finally, in post-processing any problems in the summary created are corrected [34].

There are different types of automated text summaries depending on the criteria applied to produce them [8]. We distinguish between single- and multi-document summaries depending on whether the summary results from one or more documents. We also speak of a language-specific summary if all documents and their summaries are written in the same language (e.g., Greek), or multilingual, if, for example, the documents are written in Greek and English but the summary must be in Greek, or cross-lingual, if all documents are written in Greek and the summary is in English. Summaries can also be flat (a single summary is produced with no intermediate summaries) or hierarchical (multiple levels of summaries are made, e.g., abstract and extended summary, allowing for zooming in and out of the text), general purpose or query-based (highlighting the parts of the text related to a query). Finally, summaries can be extractive or abstractive. In an extractive summary, the most important sentences from the original text are selected and prioritized into a shorter summary; in an abstractive summary, the most important text information is reworded to create the final summary [35]. Although abstractive summarization alone often performs better than extractive [36], it is usually difficult to implement. The emergence of Large Language Model (LLM) technology has dramatically advanced the field of abstractive summarization. However, the high hardware requirements and specifications, as well as the need for large amounts of training data, can act as a deterrent. This is because the required data may not be readily available in various domains, or the volume of data may not need to be particularly large. As [34] states, due to the complexity required to produce an abstract summary, recent research in the field has focused on extractive techniques.

Hybrid summarization is a combination of extractive and abstractive techniques with the aim of potentially producing more coherent and contextually relevant summaries. However, the quality of the final summary heavily depends on the initial extractive step, which can lead to lower-quality abstractive summaries that fail to fully capture the original text’s depth compared to pure abstractive methods. Despite these drawbacks, hybrid summarization often yields more coherent results than pure abstractive approaches, as it builds upon extracted sentences that already contain essential information [37]. A systematic 12-year literature review by [28] reveals that text summarization is not yet a widely adopted text mining technique in education. This study is expected to contribute significantly to the development and utilization of text summarization, particularly of hybrid approaches, in the field of academic education.

3. Context and Research Questions of the Present Study

The research was conducted in Greece at the HOU. The HOU is the only HEI in the country that offers exclusively distance education at undergraduate and postgraduate levels. With approximately 40,000 active tuition-paying students and an average age of 35 years, the HOU is the largest institution of its kind in Greece as of now. Academic studies at the HOU provide autonomous and self-directed learning which is facilitated by tutor–student, student–student, and student–content interactions [38,39] within a technologically supported environment. The HOU offers distance academic education in various subjects, with study programs structured in annual or semester modules. For each module, students have access to specific educational content (educational materials such as books and other digital educational resources, as well as educational activities such as written assignments, projects, and laboratory exercises). The learning path is defined by study guides and time schedules that accompany the educational content. It is the learning material that plays the primary instructional role in the student’s learning process and not the faculty members [40]. Also, activities supported by adequate feedback play a vital role in helping students to learn [41]. Faculty members, in this distance learning context, assume the role of tutors, who do not lecture but advise and guide students in their studies [42]. More specifically, the role of tutors is to provide clear explanations to student queries, assist students with the comprehension of educational material, guide them through educational activities, offer constructive written feedback on their work, and maintain effective communication with them. Additionally, tutors should encourage and motivate students, while effectively utilizing the educational platforms to enhance the learning experience. Attendance at the HOU is accompanied by online tutor–student meetings during which tutors are able to guide students through the activities, resolve their questions, and encourage them to continue their studies [43]. In addition to these online meetings, interaction with tutors and educational content is covered asynchronously through the use of educational platforms which are customized versions of the Moodle LMS 4.0 LMS.

In the aforementioned context, students are invited to anonymously evaluate their experience of the modules they attended to through an online questionnaire, which includes four evaluation strands comprising criteria adopted after an extensive literature review on student satisfaction in academic distance education and e-learning settings: (a) the role of the tutor (evaluation criteria: clarity in resolving queries, assistance in understanding educational material, guidance in completing educational activities, constructive written feedback, communication initiatives during the academic year/semester, encouragement during studies, and utilization of the educational platform), (b) the educational content, i.e., educational material and activities (evaluation criteria: material alignment with learning outcomes, material’s contribution to understanding the subject matter, material’s contribution to completing educational activities, and contribution of activities to learning outcomes), (c) the module design (evaluation criteria: clarity of module objectives, feasibility of the study schedule, clarity of assessment and grading criteria, and contribution of tutor–student meetings in subject-matter comprehension), and (d) the administrative and technical services (evaluation criteria: satisfaction with student registry support, satisfaction with technical support (helpdesk), usability of the educational platforms, and usability of the HOU website). These have been formulated in 20 closed-ended questions. In addition, the questionnaire includes three open-ended questions that encourage students to articulate the most important positive and negative aspects of the module, as well as propose suggestions for improvements. These questions aim to gather additional information about students’ learning experiences, which the closed-ended structure of the questionnaire may not anticipate.

The large volume of qualitative data resulting from student feedback, as well as its analysis, remains a challenge for the HOU. Various text mining techniques have previously been applied to either detect faculty member and student satisfaction with remote online examinations [44,45] or student satisfaction with master’s thesis supervision [46]. However, a comprehensive approach in terms of methodology and scale regarding the processing and analysis of qualitative data collected from online surveys is lacking. The findings from such an approach will provide valuable feedback to both the academic community and the administration, not only at the HOU but also for other higher education institutions (HEIs). This feedback can assist in making decisions regarding collaboration with tutors, updates of educational materials and activities, course redesign, and the improvement of administrative and technical services.

For the present study, the key beneficiaries of text summarization will be the HOU’s Study Program Directors (SPDs). An SPD is a faculty member whose main goal is to ensure objective assessment of student progress, promote scientific research, and develop technology and methodology in lifelong distance learning. The SPD assigns thesis projects, suggests other faculty members for the role of module coordinator and coordinator assistant, collaborates with administrative services, and contributes to the improvement of educational material. Additionally, the SPD is responsible for regular communication with module coordinators and tutors. They collect reports to inform the University Administration about the progress of the study program, highlighting issues and proposing improvements. The SPD also coordinates the study program certification processes according to HOU standards, ensuring the implementation of guidelines from the Quality Assurance Unit (QAU) and the Internal Evaluation Teams (IETs).

Considering the context of the HOU’s evaluation of the distance education process and the challenges posed by open-ended questions, this study aims to assess the satisfaction of students in an annual study program by analyzing their feedback from open-ended responses to an online evaluation questionnaire using hybrid text summarization. More specifically, the research questions posed are as follows:

(1): What are the students’ positive experiences of taking their modules?
(2): What aspects of module attendance do students find challenging or unsatisfactory?
(3): How can the modules be improved?
(4): How effective is the proposed summarization technique in capturing student feedback with respect to the abovementioned research questions?

Regarding research questions 1–3, it is emphasized that these aim to highlight qualitative feedback related to student satisfaction in connection with dimensions of the educational process, as reflected in the evaluation strands and criteria described above. The qualitative student feedback will be compared with the corresponding quantitative feedback from the evaluation questionnaire in order to identify convergences or divergences and to strengthen, where necessary, the relevant findings regarding student satisfaction with this module. Regarding research question 4, the hybrid summarization technique will be examined with respect to the effective condensation of large volumes of student feedback into coherent themes while ensuring that the analysis remains focused on the most important issues that positively or negatively affect the learning process.

4. Research Design, Participants and Sample

The present research was based on a cross-sectional design [47]. The target population consisted of students enrolled in an annual graduate study program during the academic year 2023–2024. Students attended several modules of the study program, while the total number of modules available is up to twenty. In total, 1197 evaluations were anonymously submitted out of 2618 students, representing a participation rate of 45.72%.

Specifically, students were asked to evaluate various aspects of the modules’ educational process through an online questionnaire shortly before completing their studies. Participation in the evaluation was voluntary; however, students were encouraged to contribute through frequent reminders, which highlighted that their feedback could help improve the course they were attending.

Before finalizing their questionnaire responses, students had the option to consent to or decline the use of their personal data for further surveys conducted by the Internal Evaluation Unit (IEU) of the HOU. They were also informed that the evaluation results would be communicated anonymously to the tutors only after the completion of re-examinations. The IEU of the HOU was responsible for administering the questionnaire and for the subsequent collection and processing of the data.

A total of 1832 responses were given to the three open-ended questions of the research tool. A response is the filling of free text in a submission form. Students could optionally provide one to three responses—one for each comment type (positive comments, negative comments, and suggestions for improvement). Each response included 1–n sentences. After sentence cleaning (see Section 5.1.3) a dataset of 2482 sentences remained for the hybrid summarization implementation. In Table 1, we can see by comment type the distribution of the responses and the sentences used for the automated summarization.

5. Method

Figure 1 shows the proposed model for analyzing responses to the open-ended questions. The model outlines a workflow consisting of five distinct steps. In each step, one or more processes are executed, with the results feeding into the subsequent steps. The first three steps (data preparation, anonymization, and normalization) focus on pre-processing, while the next two (sentence ranking and clustering, and summarization) involve mechanisms for extractive and abstractive summarization. The first group of steps can be performed iteratively until the results reach a satisfactory level of quality.

5.1. Data Preparation

This process aims to collect, separate, and clean the student responses to produce an initial dataset, suitable for the anonymization as well as the lemmatization processes that follow.

5.1.1. Responses Extraction

A relational database maintained by the university is used to store participants’ responses for every completed questionnaire. When data preparation begins, all responses to a particular question are collected in a dataset. Each response may include multiple sentences. Next, for each response a unique identifier is assigned. Metadata regarding school, study program, and module are attributed to each response. Usually, each response is categorized as positive, negative, or suggesting improvements based on the open-ended questions in the evaluation questionnaire. However, other categories of questions may also be supported depending on the application field.

5.1.2. Sentence Extraction

A sentence in each student response is considered as the coding unit of analysis, and the full student response as the context unit [48]. Therefore, the dataset from Section 5.1.1 is shaped into a new dataset in which each row contains a single sentence (with a unique ID), along with the relevant metadata and the student’s full responses, indicating where exactly the sentence is located.

5.1.3. Sentence Cleaning

The sentences are reviewed to remove those that are irrelevant to the open-ended questions of the questionnaire or do not provide meaningful insights. For example, in response to the question, “Please mention the most significant positive experiences from your participation in the module”, answers like “everything is good”, “everything is perfect”, or “I am completely satisfied” are excluded. Similarly, for the question, “Please mention the most significant negative experiences from your participation in the module”, responses such as “I found nothing negative”, “no negative aspects, everything is good!” are also excluded. Likewise, for the question, “Please mention any improvements you would like to see in the module”, responses like “no improvements needed” or “I have no suggestions” are excluded too. Moreover, extra filters concerning generic rules for sentence removal may be applied here. For example, rules that exclude one-word sentences or sentences that have less than a predefined number of required characters may also be applied. Finally, it is possible to exclude sentences that are written in other languages than Greek. The entire set of rules and filters is fed by external datasets of specific stop-words and other similar parameters that are generated once by administrators and domain experts and can be updated later.

5.2. Anonymization

The purpose of this process is to detect and replace the tutors’ names that may be mentioned in the sentences. This process does not guarantee absolute success and in case of a relevant requirement, human intervention is possible. It is worth noting that both Section 5.1 and this procedure are preferred to be carried out within the university because they may involve the processing of personal data.

5.2.1. Named Entity Recognition and Replacement

This process takes as input the prepared dataset and the intended result of its execution is the recognition of all words that are likely to be the name of a tutor. For this purpose, Named Entity Recognition (NER) is used. NER provides tools to recognize mentions of rigid designators from text belonging to predefined semantic types such as person, location, and organization [49]. NER methods can be dictionary-based, rule-based, machine learning-based, deep learning-based, or hybrid combinations of them [50]. After the names are identified, these names are replaced by the word “tutor”. This creates a new dataset in which a significant percentage of names have been identified and replaced. Due to the requirement for optimal name replacement, the entire process may additionally include one or more NER models.

5.2.2. Custom Name Replacement from Tutor DB

In this scenario, the entire set of tutors assigned to classes is already known by the university. Hence, a custom-made procedure for the detection and replacement of already-known tutors’ names is executed. This procedure considers on the one hand the students’ classes and their tutors and on the other hand the class in which each student response was found. Thus, it performs a very specific search-and-replace technique by searching in each sentence for a very limited number of names. Initial tests showed that completing both steps of the name matching had over 90% efficiency.

5.3. Normalization

Text normalization refers to the standardization of the elements (tokens) of a document to achieve greater homogenization aiming at a robust content analysis. To this end, two approaches are mentioned in the literature: stemming and lemmatization [49]. In stemming, the endings of words are removed, and their roots are kept in order to achieve homogenization of the information (e.g., “detailed” turns into “detail”). In lemmatization, words are replaced by their base form, i.e., the form in which they appear in a dictionary entry (e.g., “students” turns into “student”). In the steps presented below, lemmatization has been adopted as a normalization method. The evaluation of this stage is very important since it can lead to the detection of inaccuracies and unnecessary lemmas which should be removed or even redirected to the initial filtering stage to repeat the process.

5.3.1. Text Indexing

Text indexing concerns the process of converting a text into a list of words and includes the following steps [51,52]: (a) Tokenization: Each sentence (

V_{i})

is broken down into individual tokens that represent the sentence as a vector of words,

V_{i} = \{W_{1}, W_{2}, \dots, W_{n}\},

where there are N total sentences with a total number of

n_{i}

words for each sentence,

i \in Z \cap [1, N]

. As a result, each sentence is represented as a collection of word tokens. (b) Stop-words’ removal: Common word tokens with high frequency in the text, such as articles, pronouns, and conjunctions, are removed as they do not carry substantial information. All stop-words that must be removed can be given as external datasets filled by domain expert users. (c) Part-of-Speech (POS) Tagging: Word tokens are tagged with the grammatical category they belong to (e.g., noun, adjective, and verb).

5.3.2. Lemmatization

Tokens are converted to their dictionary form to achieve better language normalization. In turn, users may decide which lemmas should be kept for text analysis, based on the POS they belong to. Such decisions are guided by the research goals in question. POS descriptive statistics are also available in this step to obtain better insight.

5.3.3. Normalized Sentence Cleaning

Normalized methods are not always completely accurate, so often it is required from reviewers to verify the normalization process and make manual corrections to ensure correct lemma/stem formulation, rectify misclassified lemmas/stems, and ultimately exclude lemmas/stems with no contribution to the subsequent analysis. These corrections are recorded in specific datasets and may then update the initial datasets into the cleaning mechanism. The process is repeated until the desired outcome is reached.

5.4. Sentence Ranking and Clustering

The aim of this procedure is to recognize both the representativeness and the cluster to which each sentence may belong. These two factors will play a critical role in the final step in which summarization will take place, since they will guide the summarization models in including the most representative and significant trends coming from the set of responses. Combining clustering with the sentence ranking procedure improves some issues of the TextRank algorithm. In particular, the algorithm lacks understanding of the meaning and is influenced by the length of the sentences [53]. The distribution of the sentences into groups mitigates the bias that may arise in the measurement of their rank importance. Primarily, the extracted trends will offer a comprehensive and broad variety of topics on which students focus their responses.

5.4.1. Creation of a Similarity Matrix

The similarity matrix is formed via the Jaccard similarity measure, which assesses the overlap in terms of unique lemmas among sentences. Each sentence becomes a row and column in the matrix, with values indicating their similarity. Other similarity algorithms may also be tried for performance reasons when dealing with a large number of sentences. Generally, the Jaccard measure between two different sentences (vectors of words, Section 5.3.1)

V_{i}

and

V_{j}

is shown by

J (V_{i}, V_{j}) = \frac{\begin{matrix} {| V}_{i} \cap V_{j} | \end{matrix}}{| V_{i} \cup V_{j} |}

(1)

for

i, j \in Z \cap [1, N]

, where

| V_{i} \cap V_{j} |

is the number of common words between

V_{i}

and

V_{j}

and

| V_{i} \cup V_{j} |

is the number of total unique words in both sentences.

5.4.2. Ranking Sentences by Importance

The TextRank algorithm [53] is used to compute the relative importance scores for each sentence and then rank them. It is an implementation of the PageRank algorithm for sentences. The PageRank algorithm was introduced by [54] as a way to measure the probability of going to other pages, given that a user is on one web page (for brevity this is referred to as a page). PageRank is a graph-based ranking model that considers a graph G(V,E), where V is the vertices/nodes (pages), and E is the edges (links of pages). The PageRank (PR) for a random vertex of the graph, page

A

, is shown by

P R (A; t + 1) = \frac{(1 - d)}{N} + d x (\frac{P R (T_{1}; t)}{L (T_{1})} + \dots + \frac{P R (T_{k}; t)}{L (T_{k})})

(2)

Here, assuming a web with N total pages and a random node

k \in Z \cap [0, N]

,

T_{1 \dots} T_{k}

are the pages that are linked to page

A,

namely the incoming pages to

A

(for instance references to A) with the approximate number of outgoing links from these pages to others,

{L (T_{1}),}_{\dots}, L (T_{k}),

t is the time, and d is the damping factor (residual probability) to take into account that the user may not continue to click on a page that links to A. The choice of a value of d equal to 0.85 is considered optimal as this value accelerates convergence and provides accuracy in modeling web browsing behavior [55,56]. It is obvious that the PageRank score needs an initial PR score and is typically set to 1/N and that the algorithm focuses on incoming links to each page. The algorithm stops in a time length, t, where the convergence is achieved, i.e., for page

A

when

|P R (A; t + 1) - P R (A; t)| < e

, assuming a convergence threshold, e.

TextRank is also a graph-based ranking model that considers sentences as nodes and edges as weighted connections between pairs of nodes (similarity measure). For this purpose, the similarity matrix (Section 5.4.1) needs to be transformed into a column-stochastic one (each column totals one) to rely on incoming links. The TextRank (TR) for a random vertex of the graph, sentence

V_{i}

, is given by

T R (V_{i}; t + 1) = \frac{(1 - d)}{N} + d x (\frac{w_{1 i} T R (V_{1}; t)}{W_{1}} + \dots + \frac{w_{k i} T R (V_{k}; t)}{W_{k}})

(3)

Here,

V_{1 \dots} V_{k}

are the sentences that are linked to the sentence

V_{i}

(k total incoming links to

V_{i}, f o r k \in Z \cap [0, N]

),

w_{j i}

is the weight of the edge from the node

V_{j}

to the node

V_{i},

and

W_{j}

is the sum of the weights of all outgoing edges from the node

V_{j}

;

W_{j} = \sum_{l = 0}^{m} w_{j l}

(m total outgoing links from

V_{j}

,

l, m \in Z \cap [0, N]

).

The calculation of theΤextRank score is closely linked to the calculation of the similarity value (Section 5.4.1) used, in a proportional way. Indeed, the degree of change in a similarity measure (weight) affects the numerator of TextRank more directly than the denominator. In a parallel perspective, long sentences tend to have more common words with other sentences. Therefore, a problem that may arise is the bias of the TextRank in favor of long sentences, as they are more likely to have a high similarity with the other sentences, although it may be good when considering smoothness [57]. This is differentiated by the similarity measure used in each case and can be significantly limited, e.g., by normalizing the similarity index by the length of each sentence [53]. Moreover, using a clustering method reduces the potential of promoting long sentences by spreading sentences across communities and extracting a selected number of significant sentences from each group rather than the whole set.

5.4.3. Clustering Sentences in Topics

As TextRank can lead to distributions that may contain sentences with the same meaning in consecutive positions in the ranking, it is preferable to cluster the sentences so that eventually sentences from different clusters appear in the appropriate order, giving a better reflection of the overall trend. The similarity matrix (Section 5.4.1) is transformed to zero diagonals to avoid self-loops in nodes, and a graph, G(V,E), is constructed and inputted on the Walktrap algorithm [58]. Walktrap algorithm uses the Random Walk (RW) procedure’s properties (Markovian Property and Stochasticity). A random walk in a graph describes a repetitive path of random steps in space, commencing from a given vertex and progressing to a randomly selected neighboring vertex until a termination condition is satisfied.

Walktrap is used to simulate transition probabilities as random walks on an undirected graph G(V,E) where the algorithm starts at each sentence and moves to a neighboring sentence based on the unweighted or weighted edges in the graph (probabilities of transitioning from the starting sentences to others or remaining at the initial sentence). The basic idea of the Walktrap is based on the fact that random walks have the tendency to get “trapped” in densely connected parts of the graphs (community detection). Namely, the algorithm identifies regions of the graph where the random walks tend to stay within certain clusters of sentences for longer periods of time. These regions are interpreted as communities or clusters of closely related sentences. Given a weighted graph, the Walktrap algorithm begins by computing at each time step, t, the transition probability (

p_{i j})

from vertex i to vertex j as follows:

p_{i j} = \frac{s_{i j}}{d (i)}

(4)

where

s_{i j}

is the i-row, j-column element of the zero-diagonal similarity matrix and

d (i)

is the degree of vertex i given by

d (i) = \sum_{j} s_{i j}

, namely the sum of i neighbors’ weights. To initialize the algorithm (t = 0) based on the starting node i, the initial transition probabilities for N total nodes are set as a vector of length N,

{P_{i}}^{(0)} = [p_{i 1}^{(0)},

p_{i 2}^{(0)}, \dots, p_{i i}^{(0)}, \dots, p_{i N}^{(0)}]

, where

{p_{i i}}^{(0)} = 1

at node-position i and

{p_{i j}}^{(0)} = 0

at all other positions for j ≠ i. Assuming the transition matrix P of the random walk process, the follow applies for t-steps (length):

{P_{i}}^{(t)} = {P_{i}}^{(t - 1)} \times P^{(t)}

(5)

where

{P_{i}}^{(t)}

=

{P_{i}}^{t} .

It repeats this process several times (the length of the algorithm is defined and equals 4 in this study, further details in Section 7.3). Indeed, Ref. [58] emphasizes that random walks have to be short in time with the best empirical results for length 3 ≤ t ≤ 8, and a specific good compromised length t = 4 or t = 5. The reason is that when time, t, converges to infinity, each node’s transition probability equals an equilibrium probability. This equilibrium probability of being on a vertex j depends only on the degree of vertex j and not on the degree of starting vertex i. Walktrap relates to the spectral approach on a graph as it can be proved that the equilibrium transition probability vector corresponds to the eigenvector of the largest eigenvalue of matrix P (the transition matrix of the random walk process) and results in only one community (very small differences in transition probabilities). Subsequently, the Walktrap algorithm calculates the distances between sentences using the probability distribution in a metric based on Euclidean distance normalized with respect to the degree of node k (to reduce the influence of the total weights of neighbors), where k is any random node from 1 to N:

{d_{i j}}^{(t)} = \sqrt{\sum_{k = 1}^{n} \frac{{(p_{i k}^{(t)} - p_{j k}^{(t)})}^{2}}{d (k)}}

(6)

Using these probabilistic distances, Walktrap reduces the problem of finding communities through the technique of agglomerative clustering and the Ward method (nodes with the smallest distance are first merged into one community). The process stops when the maximum value of modularity, a common metric for the quality of a division of a network into communities [59], is achieved.

Graph clustering, which is part of graph theory, is an interesting and challenging research problem that has received much attention in recent years. Clustering on a large graph aims to partition the graph into several densely connected groups, called communities (community detection). A major difference between graph clustering and traditional relational data clustering is that graph clustering measures vertex closeness based on connectivity (e.g., the number of possible paths between two vertices) and structural similarity (e.g., the number of common neighbors of two vertices), while relational data clustering measures distance mainly based on attribute similarity [60]. Although many of the techniques involved in graph clustering are closely related to the task of finding clusters within a given graph, Walktrap has a lot of advantages compared to them.

Spectral clustering is an important technique in the domain of community detection as it considers the overall structure of the graph. The problem that spectral analysis faces, however, is the time needed to compute the eigenvalues, especially for a large-scale graph. Walktrap uses random walks to find communities quickly by making use of transition probabilities, avoiding the explicit computation of eigenvectors [58].

The hierarchical clustering method has the following main drawbacks that make it inappropriate for large real-world networks [61]: (a) it cannot clearly determine the number of communities into which the graph will be divided, even though it does not require setting a priori the number of clusters (like k-means), and (b) it fails to identify communities of varying scales, known as the resolution problem. In particular, agglomerative clustering fails to find the correct communities when the structure of communities on the graph is known and tends to identify only the core of a community and omit the periphery [59]. Walktrap offers a good solution as it avoids the problems that exist by unilaterally using hierarchical or spectral clustering and achieves the identification of communities through random walks and transition probabilities.

5.4.4. Updating Ordered Sentences with Cluster Information

Sentences ranked in descending order (Section 5.4.2) based on their TextRank scores are inserted into a data frame along with their corresponding cluster IDs. Essentially, the dataset brings together information about the cluster each sentence belongs to and how important each sentence is within its cluster. This unified dataset enables a comprehensive understanding of the relationship between sentences within clusters and their relative importance based on TextRank scores. Consequently, this approach enhances the ability to derive meaningful insights from the data, facilitating more informed decision-making according to the next step.

5.5. Summarization

This final step of summarization aims to produce a report based on both extractive and abstractive summarization results. The extractive summarization will present a list of the most representative sentences provided by participants. Conversely, the abstractive summarization will generate a synopsis of the trends identified through the extractive analysis.

5.5.1. Extractive Summarization

Assuming cluster

C_{i}

and sentences

V_{i j}

, the sentence

V_{j}

in the

C_{i}

cluster,

i € Z \cap [1, c],

j \in Z \cap [1, N]

, and the TextRank score for each sentence TR(

V_{i j})

, the process involves descending the clusters by their top-ranked (Section 5.4.4) sentence, namely the sentence with the highest TextRank score within each cluster. In each ordered cluster

C_{i}

, z top-ranked sentences are selected using

V_{i j} = \{v_{i j} € C_{i}| T R (v_{i j}) \geq {T R}_{m e d i a n}\}

(7)

Here,

{T R}_{m e d i a n}

is the median TextRank score of all sentences across all clusters and

| V_{i j} | \leq z

. The choice of the median is made to ensure that the most representative and unbiased trend from each community is obtained. Subsequently, x clusters that satisfy the condition of including sentences with a TextRank score value as defined are included in the resulting list with z sentences, where x ≤ c for a given c number of clusters resulting from Walktrap (Section 5.4.3).

5.5.2. Abstractive Summarization

This step involves generating an abstractive summary by concatenating all selected cluster sentences (Section 5.5.1) and passing them for elaboration to the GPT-4o mini model [62] in order to enhance readability and meaning for the end user. The final summary is generated in a hierarchical manner by first generating the summaries of each cluster and then generating a summary of all cluster summaries.

Ater the final summary generation, this process produces a final report that consists of three parts. The first part presents quantitative data concerning the number of open-ended responses per question type, the number of sentences extracted, and the number of pre-processed sentences used for summarization. The second part presents the abstractive summary, and the third part presents a list of 5–10 representative comments, according to (Section 5.4.4).

6. Implementation

The implementation of the above flowchart took place into the environment of the HOU, trying to bring most of the models inside the university’s area and limit the dependencies of third-party models in the cloud due to the restrictions of anonymization and the institute data policy. The future goal is to build a system that could be used for the analysis of all open-ended questions of all evaluations that are taking place in the university regardless of the study program. However, the present study was focused on one postgraduate study program.

There were two core modules of the system: The Response Preparation Module and the Survey Summarization Module. They were implemented using Python (3.0) as RestFull services, utilizing the Flask python library [63]. A MySQL database management system was used as storage for all the required data. The administration and the production of summaries were provided through a web-based application built on the PHP programming language.

The first module (Response Preparation) moderates the data preparation steps and it takes as input raw data of the students’ responses and returns a well-formed dataset that contains the initial response, the sentences of each response, and the lemmas of each sentence. The dataset is filtered and cleaned appropriately. This module includes sentence preparation, anonymization, and lemmatization of the internal components. The second module (Survey Summarization) is responsible for the production of a summary of the responses to open-ended questions. It takes as input the dataset that contains clean lemmatized sentences and then computes the similarity matrix between the sentences to produce a graph. Next, this module invokes the Sentence Ranking together with the Sentence Clustering internal components. These two components will result in the ranking of all the sentences and in their clustering as well. In this way, it will be able to highlight the most important sentences and to prepare a dataset of clustered sentences for an abstractive summarization. Finally, the module will produce the final report with the summary of the responses. More specifically, it presents an instructive quantitative paragraph in relation to the number of answers as well as the number of sentences that arose from them and the number of clean sentences that were finally used in the summarization process. Next, an abstractive summarization follows that briefly describes the trends found from the analysis of the responses and finally a list of indicative and representative sentences selected from the students’ responses is presented. When summarizing must be abstractive, the service will invoke an external AI summarization service.

Concerning the required input data for the whole process, they consist of the following datasets:

Students’ responses to an open-ended question;
Replacement rules in the form of ‘find and replace’ pairs;
Words to be excluded from NER analysis;
List of study programs, course modules, and tutors of each module (as an input for the custom Name Replacement algorithm);
Lemmas to be excluded from lemmatization;
Lemmas to be corrected before lemmatization;
Some POS custom definitions in particular lemmas.

Only two datasets of the above are being replaced each time a new question is analyzed (students’ responses, and sometimes the list of study programs, modules, and tutors), while other datasets are domain-specific and are updated rarely.

Some critical libraries and external services were also integrated in the system. First, concerning data cleaning, langdetect 1.0.9 [64] was used to detect responses in languages other than modern Greek that the system is using. This library is a direct port of Google’s language detection library from Java to Python. Concerning NER, an already pre-trained model in the Greek language named Spacy [65,66] was used and configured to recognize only persons’ names. The implementation of lemmatization in the Greek language was carried out by the spacy-udpipe library [67]. Both the TextRank algorithm and the Jaccard similarity computation were implemented in python; however, the latter used a CountVectorizer algorithm of the scikit-learn library [68] to optimize matrix production. In the same way, the WalkTrap algorithm was implemented inside the module using the igraph library [69], a network analysis package. Regarding abstractive summarization that is performed in case of the production of first the clusters’ summaries and next the final summary of all the clusters, the GPT-4o mini model was used via the OpenAI API.

It is worth mentioning that due to possible limitations related to cost and availability of the third-party services used for the summarization process, the system’s approach supports future integration of new models in case it is required or in case they can better enhance the current system functionality. This can occur by the adoption of a service-oriented architecture in which a third-party summarization service could be integrated to the system by using a simple wrapper module that will ensure compatibility of the input/output data model.

In terms of end users, administrators and SPDs can use this prototype system. Administrators who are domain experts are responsible for populating and updating the system with the required datasets, executing each of the modules and evaluating and, finally, modifying the results in order to produce the final report for each study program. An additional custom software tool will be implemented in the future, aiding administrators to efficiently manage the entire summarization task. This tool will enable summary management by providing editing, execution, and evaluation functionality for each phase of our proposed approach and for each study program as well. SPDs can view the reports and access the raw data in case they want to focus on a specific program module or a specific finding. The prototype system is designed to integrate with an existing application that offers quantitative information to directors. In this way, directors will have an overall view of the results of students’ responses to both closed and open-ended questions.

7. Results

7.1. Student Feedback Results

Regarding quantitative analysis, the mean values per evaluation dimension are presented in Figure 2, where the maximum value is 5 and the minimum is 1. As can be seen, students tend to be satisfied with their modules, with mean values ranging above the middle of the five-point scale. However, they seem less satisfied with the educational content (materials and activities).

The final summaries of student feedback (i.e., the summaries resulting from summary evaluation—see Section 7.3 below) are more illustrative regarding their experiences:

Summary of positive comments: “The students expressed positive feedback about the modules they attended. They mentioned gaining new knowledge related to the subject matter, which they consider important for their field. They appreciated the experience and knowledge gained from participating in project assignments and the feedback they received. They were enthusiastic about the presence and teaching of the tutors, who were excellent and inspiring, with great knowledge of the subject and always available to help with understanding the material and answering questions. Additionally, they valued the communication and effort made by the tutors in making the material comprehensible and offering their support. The online tutor-student meetings were also very helpful, and tutors provided during them extra assistance to ensure understanding. Overall, the students were impressed by the quality of tutoring and the enthusiasm of the tutors. They found the modules interesting, educational, and essential for their professional development”.

Summary of negative comments: “The students expressed negative feedback about the modules they attended. Some of the issues included excessive time required for written assignments, lack of organization in the material and tasks, outdated content, lack of reminders about submission deadlines, delays in online preparatory tutorials relative to submission dates, and insufficient time for preparation. Many students mentioned that the assignments did not help in understanding the material, and the tutorials did not align with the study guide. Additionally, some students reported feeling time pressure and being unable to correct their mistakes due to a lack of guidance. Finally, the mid-term exam was a source of stress and difficulty for many students. Overall, the students expressed dissatisfaction with the quality and design of the modules”.

Summary of improvements: “The students have suggested several improvements for the modules they attended. Some requested more educational material and audiovisual aids, while others proposed limiting the syllabus to make it easier to practice. The need for more frequent tutor-student meetings and greater guidance from tutors was also noticed. Some students expressed a desire for an upgrade in educational materials and fewer assignments, but with more meetings. Finally, there were suggestions for improving the organization of written assignments and tutoring in various areas. In conclusion, the students emphasized that they need more support and better preparation to succeed in their exams”.

A point of convergence between quantitative findings and summary feedback is that both types of data show strong satisfaction with tutors. The high mean value of tutors’ role is mirrored in the positive comments where students praised the tutors for being supportive and experts on the subject matter. The mean value for the study program design indicates a fairly positive yet somewhat moderate level of satisfaction. This is reflected in the qualitative feedback, where students complained about time management issues and suggested improvements to the syllabuses and the number of meetings. Convergence between quantitative and qualitative data is also shown in the case of educational content, where a low mean value can be interpreted in accordance with students’ complaints of frustration with the educational material and the lack of alignment between assignments and the subject matter. What is not highlighted in the summaries is students’ comments on their satisfaction with the administrative and technical services. This probably suggests that comments on these aspects were not a priority for them, either because they considered them less important or because they were quite satisfied (Figure 2) so they did not feel the need to provide extensive comments.

7.2. Summaries’ Evaluation

In order to check the quality of the summaries produced by the proposed method, two approaches were adopted: the G-Eval framework and DeepEval summarization. Both approaches were implemented using the DeepEval tool, an open-source evaluation framework for LLMs [70]. G-Eval is an evaluation framework for assessing the quality of summaries generated by LLMs with the use of LLMs [71]. Traditional metrics (e.g., ROUGE) focus on surface-level similarities (e.g., word overlap), which can miss the deeper semantic content and quality of a summary. LLMs, however, can assess summaries in a more human-like manner, and this leads to more accurate and context-aware evaluations. G-Eval utilizes LLMs to perform human-like evaluations of text summaries across the following dimensions: (a) Relevance: the inclusion of important information in the summary and the exclusion of redundancies. (b) Coherence: the logical flow and organization of the summary. (c) Consistency: summary’s alignment with the facts in the source document. (d) Fluency: readability and clarity of the summary.

Nevertheless, it has been noted that LLM summaries often suffer from hallucinations, i.e., arbitrariness and bias resulting in discrepancies and inaccuracies or contradictions between the input text and the output summary [72]. For this reason, we decided to further implement in parallel the DeepEval summarization metric which exploits the Question–Answer Generation (QAG) benchmark for summary evaluation to address the abovementioned issues [73]. The DeepEval summarization metric aims to enhance summarization evaluation by measuring the following aspects: (a) Coverage Score. Thisindicates whether the summary contains the necessary information from the original text. (b) Alignment Score. This indicates whether the summary contains hallucinated or contradictory information to the original text. The summarization score is calculated by the following equation:

Summarization = min(Alignment Score, Coverage Score)

(8)

The G-Eval framework was implemented according to the following steps: (a) task prompts were written, which instructed GPT-4o mini to evaluate the summaries produced according to the criteria of coherence, fluency, consistency, and relevance, (b) Chains of thought (CoT) were produced as intermediate reasoning steps to be followed for each evaluation task, (c) scores were calculated for each task and, (d) reasons were provided to explain the scores.

DeepEval summarization was also implemented according to the following steps: (a) a set of questions was generated based on each summary, (b) questions were answered using both the original comments and the summaries, (c) answers in turn were compared to calculate the final score, and (d) reasons were provided to explain the scores.

A graphical example for both evaluation methods is shown in Figure 3. The score results for the three summaries are presented in Table 2. The reasons that were provided by the DeepEval framework based on the obtained scores are discussed below.

According to G-Eval scores, the summary for the positive comments is estimated to effectively capture the main topic of students’ positive experiences, although it could include more specific details from the input. It aligns closely with the main ideas and structure of the original text, capturing the essence of student feedback without introducing conflicting information. The summary is mostly free of grammatical errors and maintains clarity, but some repetition and minor formatting issues affect its overall fluency. The DeeEval summarization metric score indicates no contradictions, and the overall content remains relevant, maintaining a high level of accuracy. However, it detects that the summary includes extra information not present in the original text, which may potentially mislead the reader.

Regarding the summary for the negative comments, G-Eval scores indicate that it captures several key points from the original text; however, it lacks specific details which could improve its relevance and coherence. The summary aligns adequately with the facts in the original text but could be improved further. Minor formatting inconsistencies and a lack of specific examples from the original text slightly affect the overall clarity. The DeeEval summarization score shows that the summary rather includes extra information not present in the original text, which can confuse readers and introduce inaccuracies. Additionally, it seems to miss certain questions that the original content could answer, affecting its completeness.

For the summary of students’ suggestions, G-Eval scores indicate that it captures the main topics from the original text; however, it lacks specific details. The summary shows some inconsistencies with the original text regarding the extent of the issues raised and could benefit from more specific details to enhance clarity. The DeeEval summarization score indicates that the summary inaccurately reflects some student suggestions compared to the original text. Despite these discrepancies, it captures the general sentiment of student feedback, which justifies the relatively high score.

Following a systematic review of the DeeEval framework’s results, the summaries were re-examined and corrections were made to better and more accurately convey the meaning and content of the relevant comments. However, it should be noted that during the review, discrepancies concerning the absence of details from the comments into the summaries were not taken into account. This is because the intention of the proposed methodological framework for hybrid summarization was, from the outset, to seek and highlight the main trends in the comments. Indeed, the use of the Walktrap algorithm as a technique for improving the TextRank algorithm results was oriented towards this direction. Particular attention, however, was given to clean the summaries of information not mentioned in the original comments.

7.3. Walktrap Performance

In the context of the search for the most appropriate community detection technique to meet the needs of this research, some of the most popular community discovery algorithms on non-fully connected graphs were benchmarked. In particular, the performance of Walktrap was investigated against Louvain [74], Infomap [75], Fast Greedy [76], and Label Propagation (LPA) [77]. Some representative metrics were computed to assess each algorithm’s performance and are presented in Table 3. Walktrap can be considered as the best solution due to its relative high modularity density [78], an improved measure that combines modularity with community density, and the relative number of communities created according to the average size of each community. Despite the not relatively high value in coverage [79] and the not relatively low value in conductance [80], the primary criteria for this study are the density inside the community, the creation of discrete groups, and the community size distribution. For instance, Louvain achieves high modularity because it is primarily focused on maximizing modularity and performs better in time but provides random results and presents problems in cases of very small communities or graphs with overlapping communities [74]. This was evident in trials conducted with various sample sizes from the dataset. In contrast, Waltrap is highly effective at identifying communities in small to medium-sized graphs, where detailed separability is crucial due to random walks. This is especially important regarding the sample size per study program in general. It is excellent when consistency and correct detection of relationships between communities are the key questions. Clustering was necessary to make the performance of TextRank even better with an important number of distinct groups based on topics. Clustering in this way can be combined with TextRank to capture the overall trend while minimizing the loss of students’ opinion patterns. However, the “best” approach is always contingent upon the dataset, the size, the type of object under study, and, moreover, the needs of clustering contribution.

Another point worth mentioning is the size of Walktrap’s random walks. Based on theory, t must be neither large (as it converges to infinity, all communities join into one) nor small, so as to allow the random walk to explore the network more thoroughly, leading to the discovery of more nuanced community structures. In Figure 4, it is depicted how modularity density changes for different values of t among three types of responses in the specific dataset under study. The best values of t for achieving high modularity density are 1, 2, and 4 for positive, 1, 2, 4, and 5 for improvement, and 1, 2, 3, 4, and 7 for negative responses. A length equal to 4 is considered for this research to be a compromise value to explore the communities, allowing for defined and coherent community structures.

8. Discussion

The present study aimed to develop a hybrid text summarization technique for effectively analyzing qualitative feedback from student responses to online educational surveys at the Hellenic Open University. Participants were graduate students enrolled in an annual study program during the academic year 2023–2024. The study sample comprised 1197 students who submitted evaluations, representing approximately 45.72% of the 2618 students enrolled. The final dataset for analysis consisted of 1832 responses to open-ended questions, which were processed to produce summaries reflecting students’ positive experiences, challenges, and suggestions for improvement in the educational process. The quantitative findings showed that students are in general satisfied with the tutor’s role, the study program design, the administrative—technical services, and to a lesser degree with the educational content. Furthermore, a richer picture was given with respect to their open-ended responses.

HOU students gave positive feedback about the modules they attended, highlighting the valuable knowledge they gained. They appreciated the experience of working on project assignments and the useful feedback they received. The tutors were regarded as experts on the subject matter and were always available to help clarify educational materials and answer questions. Students also valued the tutors’ efforts to make the subject matter clear and to provide support for their studies, especially through online meetings. These findings align with qualitative research in academic distance education and online learning settings regarding students’ positive experiences. Specifically, in relation to the valuable knowledge gained from their modules, the authors of [81] emphasized in their research that students considered the academic courses they attended to provide valuable knowledge and practical skills that were beneficial. In [82], university students expressed satisfaction with their tutors, describing them as “always available” and “helpful,” paralleling the support observed in our findings. The emphasis on online meetings and tutor support resonates also with [82], where students valued interactions with instructors and appreciated the support they received. Similarly, Ref. [83] found that instructors were always available for questions and provided timely feedback. They offered flexible office hours and responded to emails promptly, making them accessible for help.

Some HOU students expressed dissatisfaction with the modules, citing disorganized and outdated educational material and assignments, excessive time required for assignments, scheduling conflicts between meetings and the study guide, and difficulties with the mid-term exams. Many felt the assignments did not help them understand the material. HOU students also mentioned a lack of guidance in correcting mistakes on their assignments. This perception of assignments as excessively time-consuming and unhelpful is echoed in [84], where students in some HOU modules expressed negative feedback about the overwhelming number of activities, demanding workload, and limited time for study due to professional obligations. Ref. [82] also mentions that time management was a challenge for many university students, as the higher number of assignments made it difficult to complete them on time. Additionally, Ref. [83] found that university students mentioned inadequate tutor support and ineffective feedback on their performance, leaving them feeling unsupported and confused about their learning progress. Ref. [82] adds that some university students needed more guidance or found that their lecturers were not always available when they needed assistance.

HOU students suggested several improvements for the modules they attended as a counterbalance to the aforementioned negative comments. These included adding more educational materials and audiovisual aids, limiting the syllabus, increasing tutor–student meetings, and providing more guidance from tutors. Some requested upgraded materials and fewer assignments but with more meetings. Additionally, they recommended better organization of written assignments and tutoring in various areas.

Although the findings seem very interesting and helpful for the SPDs, both the process of their production and the exploitation should have a high level of clarity so that ethical issues can be adequately satisfied. In academic open education environments, it has already been pointed out that ethical issues arise, particularly in issues such as e-learning, dropout rates, and the development of methods of predicting student success. Ref. [85] highlighted that distance education institutions must begin to study how ethics can be applied to their decision-making processes. In our case, the HOU informs students about the aim of each of the evaluation tasks that deal with students’ data. Secondly, it protects tutors’ anonymity by using anonymization techniques in the proposed system. Finally, during the internal evaluation process of the entire study program, PSDs are requested to track down in a public document the decisions they took based on the results of the students’ responses.

The positive and negative feedback from HOU students, as well as their suggestions, reflect important quality requirements related to the following dimensions of the educational process provided by HEIs:

Tutor–student interaction. Student satisfaction largely depends on the quality of interaction with the tutor [43,86]. The desired competencies for tutors include frequent communication with students [43], addressing their questions, and guiding and encouraging them in their studies [87,88]. When tutor–student interaction is effectively practiced, it can significantly enhance the understanding of course materials and encourage community experiences through discussions, group work, and collaborative projects [76]. This aligns with Moore’s concept of minimizing transactional distance through engagement and communication [89].

Assignment feedback. In distance and online educational environments, students highly value timely feedback [90]. Feedback on assignments plays a crucial role in developing students’ academic skills, motivating them, offering encouragement, and identifying areas for improvement, which may impact their performance. However, feedback can become problematic due to misconceptions, lack of clarity, and dissatisfaction among students. Issues such as unclear comments affecting self-confidence, delayed feedback, and difficulty understanding academic language have been reported as barriers to effective feedback [91].

Student–content interaction. As confirmed by the literature, the quality of student–content interaction (e.g., reading learning articles and textbooks, writing assignments and projects, and interacting with multimedia resources) is a critical factor influencing the learning experience and overall satisfaction with their studies [87]. In distance education, students spend most of their time reading and writing, systematically studying the educational content to understand the subject matter and further develop their cognitive skills. Therefore, educational material should promote student reflection, discussion, and information exchange with the aim of collaborative knowledge-building, creative problem-solving, and hypothesis formulation—elements that promote critical thinking [92]. A poor quality and structure of educational material and activities can significantly limit the development of students’ critical thinking skills [93].

Module design. The quality of a study program’s design and structure can drastically affect student satisfaction. A program’s structure, including its objectives, teaching and learning methods, and assessment approaches, is crucial. Key factors influencing the design and structure of a study program include content presentation, learning paths, personalized guidance, and student motivation. Rigid study programs tend to limit communication and interaction, while more flexible programs can reduce transactional distance, fostering better communication and engagement [87].

Regarding the evaluation of the generated summaries, it was conducted using the G-Eval framework and DeepEval metrics. G-Eval demonstrated that the summary of positive comments was highly relevant and coherent, and effectively captures the students’ positive experiences, while the negative comment summary required improvement in specificity and detail. When assessed by DeepEval, the summaries displayed a high level of alignment with the input without introducing major inaccuracies, although some minor extraneous information was noted. On the other hand, the Walktrap algorithm excelled in identifying thematic communities within the feedback, distinguishing trends and capturing significant insights about the students’ experiences. Its performance was measured against other community detection algorithms, revealing that Walktrap achieved a favorable balance between community density and modularity; thus, it was proved the most effective method for the context of this research. This dual approach of utilizing community detection alongside summarization metrics allows for the production of diverse and content-rich summaries that are grounded in validity.

9. Limitations and Future Work

This study faces some limitations. One limitation is the response rate of 45.72%, as less than half of the target population participated. This may introduce a potential bias, as the sample may not fully represent the views of all students, particularly those who chose not to participate. Additionally, open-ended questions may have led to the underreporting of negative feedback, as students might have hesitated to share criticisms or felt restricted in expressing concerns due to fears of academic repercussions, despite assurances of anonymity. The cross-sectional design further limits this research, as it captures student feedback at only one point in time, preventing the assessment of changes over time. Despite the reinforcement of the quantitative evaluation findings, the qualitative data do not fully address the complexity and subjectivity of qualitative analysis, as they do not explore variations in feedback related to the study context and student characteristics (e.g., gender, attendance in compulsory or prerequisite modules, and grade performance in assignments). Additionally, the performance of the TextRank and Walktrap algorithms used for summarization and clustering may vary depending on the data’s characteristics, and the study did not explore alternative datasets to further assess the algorithms’ performance.

It is suggested that future research delve deeper into the creation of unique natural language processing models that are especially suited to the linguistic subtleties and contextual information present in language student feedback. This could help in resolving the text pre-processing errors discovered by using generic tools such as UDPipe. Exploring the integration of sentiment analysis to assess the emotional tone of student comments could also be beneficial to gain a more nuanced understanding of the experiences and satisfaction levels of the students. Both UDPipe’s Greek lemmatization and comment filtering for detecting the noise in data require manual correction, which is indeed time-consuming. A more automatic technique must be carried out to speed up the process and simultaneously keep all the important data without overfitting with comments that do not contribute to the analysis. Another interesting prospect is the processing of English comments in parallel and, even better, in a simultaneous analysis with Greek ones.

Advanced natural language processing (NLP) techniques like word embeddings can be very useful in addressing such issues [94]. Word embeddings can be used in extractive summarization instead of Jaccard similarity measures to capture semantic relationships. When combined with TextRank and Walktrap, they could yield more human-like summarization. Experiments show that this mathematical representation enhances the performance of the detection of similarities and can be combined with TextRank [72] for general trend detection and with Walktrap for more effective community separation.

Author Contributions

Conceptualization, G.V.; methodology, N.K., G.V., D.P. and V.S.V.; software, N.K., G.V., D.P. and V.S.V.; validation, G.V.; formal analysis, D.P.; resources, N.K.; data curation, N.K., G.V. and D.P.; writing—original draft preparation, N.K., G.V., D.P. and V.S.V.; writing—review and editing, N.K., G.V., D.P. an V.S.V.; visualization, N.K., G.V., D.P. and V.S.V.; supervision, V.S.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets presented in this article are not readily available due to privacy issues.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Masalimova, A.R.; Khvatova, M.A.; Chikileva, L.S.; Zvyagintseva, E.P.; Stepanova, V.V.; Melnik, M.V. Distance Learning in Higher Education During COVID-19. Front. Educ. 2022, 7, 822958. [Google Scholar] [CrossRef]
Karapiperis, D.; Tzafilkou, K.; Tsoni, R.; Feretzakis, G.; Verykios, V.S. A Probabilistic Approach to Modeling Students’ Interactions in a Learning Management System for Facilitating Distance Learning. Information 2023, 14, 440. [Google Scholar] [CrossRef]
Paxinou, E.; Manousou, E.; Verykios, V.S.; Kalles, D. Centrality Metrics from Students’ Discussion Fora at Distance Education. In Proceedings of the 2023 14th International Conference on Information, Intelligence, Systems & Applications (IISA), Volos, Greece, 10–12 July 2023. [Google Scholar] [CrossRef]
Paxinou, E.; Feretzakis, G.; Tsoni, R.; Karapiperis, D.; Kalles, D.; Verykios, V.S. Tracing Student Activity Patterns in E-Learning Environments: Insights into Academic Performance. Future Internet 2024, 16, 190. [Google Scholar] [CrossRef]
Evans, J.R.; Mathur, A. The value of online surveys: A look back and a look ahead. Internet Res. 2018, 28, 854–887. [Google Scholar] [CrossRef]
Merkys, G.; Bubeliene, D. Quantification of Textual Responses to Open-Ended Questions in Big Data. In Modeling and Simulation of Social-Behavioral Phenomena in Creative Societies; Agarwal, N., Kleiner, G.B., Sakalauskas, L., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 191–200. [Google Scholar] [CrossRef]
Haensch, A.-C.; Weiß, B.; Steins, P.; Chyrva, P.; Bitz, K. The semi-automatic classification of an open-ended question on panel survey motivation and its application in attrition analysis. Front. Big Data 2022, 5, 880554. [Google Scholar] [CrossRef]
Fiori, A. Trends and Applications of Text Summarization Techniques; IGI Global: Hershey, PA, USA, 2019. [Google Scholar]
Marsh, H.W. SEEQ: A Reliable, Valid, and Useful Instrument for Collecting Students’ Evaluations of University Teaching. Br. J. Educ. Psychol. 1982, 52, 77–95. [Google Scholar] [CrossRef]
Ramsden, P.; Ramsden, P. Learning to Teach in Higher Education, 2nd ed.; Routledge: London, UK, 1991. [Google Scholar] [CrossRef]
Richardson, J.T.E. Instruments for obtaining student feedback: A review of the literature. Assess. Eval. High. Educ. 2005, 30, 387–415. [Google Scholar] [CrossRef]
Martin, F.; Bolliger, D.U. Developing an online learner satisfaction framework in higher education through a systematic review of research. Int. J. Educ. Technol. High. Educ. 2022, 19, 50. [Google Scholar] [CrossRef]
Zucco, C.; Paglia, C.; Graziano, S.; Bella, S.; Cannataro, M. Sentiment Analysis and Text Mining of Questionnaires to Support Telemonitoring Programs. Information 2020, 11, 550. [Google Scholar] [CrossRef]
Wu, M.-J.; Zhao, K.; Fils-Aime, F. Response rates of online surveys in published research: A meta-analysis. Comput. Hum. Behav. Rep. 2022, 7, 100206. [Google Scholar] [CrossRef]
Wallace, D.; Hedberg, E.C.; Cesar, G. The Effect of Survey Mode on Socially Undesirable Responses to Open Ended Questions: A Mixed Methods Approach. Field Methods 2018, 30, 105–123. [Google Scholar] [CrossRef]
Stergiou, D.P.; Airey, D. Using the Course Experience Questionnaire for evaluating undergraduate tourism management courses in Greece. J. Hosp. Leis. Sport Tour. Educ. 2012, 11, 41–49. [Google Scholar] [CrossRef]
Marsh, H.W. Students’ Evaluations of University Teaching: Dimensionality, Reliability, Validity, Potential Biases and Usefulness. In The Scholarship of Teaching and Learning in Higher Education: An Evidence-Based Perspective; Perry, R.P., Smart, J.C., Eds.; Springer: Dordrecht, The Netherlands, 2007; pp. 319–383. [Google Scholar] [CrossRef]
Chung, G.; Rodriguez, M.; Lanier, P.; Gibbs, D. Text-Mining Open-Ended Survey Responses Using Structural Topic Modeling: A Practical Demonstration to Understand Parents’ Coping Methods During the COVID-19 Pandemic in Singapore. J. Technol. Hum. Serv. 2022, 40, 296–318. [Google Scholar] [CrossRef]
Baburajan, V.; e Silva, J.D.A.; Pereira, F.C. Open-Ended versus Closed-Ended Responses: A Comparison Study Using Topic Modeling and Factor Analysis. IEEE Trans. Intell. Transport. Syst. 2021, 22, 2123–2132. [Google Scholar] [CrossRef]
Neuert, C.E.; Meitinger, K.; Behr, D. Open-ended versus Closed Probes: Assessing Different Formats of Web Probing. Sociol. Methods Res. 2023, 52, 1981–2015. [Google Scholar] [CrossRef]
Maloshonok, N.; Terentev, E. The impact of visual design and response formats on data quality in a web survey of MOOC students. Comput. Hum. Behav. 2016, 62, 506–515. [Google Scholar] [CrossRef]
Mohr, A.H.; Sell, A.; Lindsay, T. Thinking Inside the Box: Data from an Online Alternative Uses Task with Visual Manipulation of the Survey Response Box. 2016. Available online: https://hdl.handle.net/11299/172116 (accessed on 15 July 2024).
Saldana, J. The Coding Manual for Qualitative Researchers; SAGE: London, UK, 2021. [Google Scholar]
Neuendorf, K.A. The Content Analysis Guidebook; SAGE Publications, Inc.: London, UK, 2017. [Google Scholar] [CrossRef]
Hunt, G.; Moloney, M.; Fazio, A. Embarking on large-scale qualitative research: Reaping the benefits of mixed methods in studying youth, clubs and drugs. Nord. Stud. Alcohol Drugs 2011, 28, 433–452. [Google Scholar] [CrossRef]
González Canché, M.S. Machine driven classification of open-ended responses (MDCOR): An analytic framework and no-code, free software application to classify longitudinal and cross-sectional text responses in survey and social media research. Expert Syst. Appl. 2023, 215, 119265. [Google Scholar] [CrossRef]
Popping, R. Analyzing Open-ended Questions by Means of Text Analysis Procedures. Bull. Sociol. Methodol./Bull. Méthodologie Sociol. 2015, 128, 23–39. [Google Scholar] [CrossRef]
Ferreira-Mello, R.; André, M.; Pinheiro, A.; Costa, E.; Romero, C. Text mining in education. WIREs Data Min. Knowl. Discov. 2019, 9, e1332. [Google Scholar] [CrossRef]
Sunar, A.S.; Khalid, M.S. Natural Language Processing of Student’s Feedback to Instructors: A Systematic Review. IEEE Trans. Learn. Technol. 2024, 17, 741–753. [Google Scholar] [CrossRef]
Zhai, C.; Massung, S. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining; Morgan & Claypool: San Rafael, CA, USA, 2016. [Google Scholar]
Allahyari, M.; Pouriyeh, S.; Assefi, M.; Safaei, S.; Trippe, E.D.; Gutierrez, J.B.; Kochut, K. A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. arXiv 2017, arXiv:1707.02919. [Google Scholar] [CrossRef]
Malik, N.; Bilal, M. Natural language processing for analyzing online customer reviews: A survey, taxonomy, and open research challenges. PeerJ Comput. Sci. 2024, 10, e2203. [Google Scholar] [CrossRef]
Shaik, T.; Tao, X.; Dann, C.; Xie, H.; Li, Y.; Galligan, L. Sentiment analysis and opinion mining on educational data: A survey. Nat. Lang. Process. J. 2023, 2, 100003. [Google Scholar] [CrossRef]
Alhojely, S.; Kalita, J. Recent Progress on Text Summarization. In Proceedings of the 2020 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 16–18 December 2020; pp. 1503–1509. [Google Scholar] [CrossRef]
Saikumar, D.; Subathra, P. Two-Level Text Summarization Using Topic Modeling. In Intelligent System Design; Satapathy, S.C., Bhateja, V., Janakiramaiah, B., Chen, Y.-W., Eds.; Springer: Singapore, 2021; pp. 153–167. [Google Scholar] [CrossRef]
Giarelis, N.; Mastrokostas, C.; Karacapilidis, N. Abstractive vs. Extractive Summarization: An Experimental Review. Appl. Sci. 2023, 13, 7620. [Google Scholar] [CrossRef]
El-Kassas, W.S.; Salama, C.R.; Rafea, A.A.; Mohamed, H.K. Automatic text summarization: A comprehensive survey. Expert Syst. Appl. 2021, 165, 113679. [Google Scholar] [CrossRef]
Moore, M.G. Editorial: Three types of interaction. Am. J. Distance Educ. 1989, 3, 1–7. [Google Scholar] [CrossRef]
Katsarou, E.; Chatzipanagiotou, P. A Critical Review of Selected Literature on Learner-centered Interactions in Online Learning. Electron. J. e-Learn. 2021, 19, 349–362. [Google Scholar] [CrossRef]
Rowntree, D. Exploring Open and Distance Learning; Kogan Page: London, UK, 1992. [Google Scholar]
Rowntree, D. Preparing Materials for Open, Distance & Flexible Learning: An Action Guide for Teachers and Trainers, 1st ed.; Kogan Page: London, UK, 1993. [Google Scholar]
Lionarakis, A.; Papadimitriou, S.; Ioakimidou, V. The Hellenic Open University: Innovations and Challenges in Greek Higher Education. J. Open Distance Educ. Educ. Technol. 2019, 15, 6–22. [Google Scholar] [CrossRef]
Anastasiades, P.; Iliadou, C. Communication between Tutors—Students in DL: A Case Study of the Hellenic Open University. Eur. J. Open Distance E-Learn. 2010. Available online: https://eric.ed.gov/?id=EJ914966 (accessed on 16 July 2024).
Vorvilas, G.; Liapis, A.; Korovesis, A.; Aggelopoulou, D.; Karousos, N.; Efstathopoulos, E. Conducting Remote Electronic Examinations in Distance Higher Education: Students’ Perceptions. Turk. Online J. Distance Educ. 2023, 24, 167–182. [Google Scholar] [CrossRef]
Vorvilas, G.; Liapis, A.; Karousos, N.; Theodorakopoulos, L.; Lagiou, E.; Kameas, A. Faculty members’ Perceptions of Remote Electronic Examinations in Distance Academic Education. In Proceedings of the 2023 14th International Conference on Information, Intelligence, Systems & Applications (IISA), Volos, Greece, 10–12 July 2023. [Google Scholar] [CrossRef]
Vorvilas, G.; Liapis, A.; Korovesis, A.; Athanasakopoulou, V.; Karousos, N.; Kameas, A. Evaluating the quality of master’s thesis supervision in academic distance education: Hellenic Open University students’ perceptions. In Proceedings of the Digital Reset: European Universities Transforming for a Changing World. Proceedings of the Innovating Higher Education Conference 2022 (I-HE2022), Athens, Greece, 19–21 October 2022. [Google Scholar] [CrossRef]
Bryman, A.; Bell, E.; Reck, J.; Fields, J. Social Research Methods, 1st ed.; Oxford University Press: New York, NY, USA, 2021. [Google Scholar]
Krippendorff, K. Content Analysis: An Introduction to Its Methodology; SAGE: London, UK, 2004. [Google Scholar]
Kumar, A.; Paul, A. Mastering Text Mining with R: Extract and Recognize Your Text Data; Packt Publishing: Birmingham, UK; Mumbai, India, 2016. [Google Scholar]
Warto; Rustad, S.; Shidik, G.F.; Noersasongko, E.; Purwanto; Muljono; Setiadi, D.R.I.M. Systematic Literature Review on Named Entity Recognition: Approach, Method, and Application. Stat. Optim. Inf. Comput. 2024, 12, 907–942. [Google Scholar] [CrossRef]
Jo, T. Text Mining: Concepts, Implementation, and Big Data Challenge, 1st ed.; 2019 ed.; Springer: New York, NY, USA, 2018. [Google Scholar]
Anandarajan, M.; Hill, C.; Nolan, T. Practical Text Analytics: Maximizing the Value of Text Data; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
Mihalcea, R.; Tarau, P. TextRank: Bringing Order into Text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing; Lin, D., Wu, D., Eds.; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 404–411. Available online: https://aclanthology.org/W04-3252 (accessed on 2 April 2024).
Page, L.; Brin, S.; Motwani, R.; Winograd, T. The PageRank Citation Ranking: Bringing Order to the Web. Presented at the The Web Conference, Toronto, Canada, 11–14 May 1999. Available online: https://www.semanticscholar.org/paper/The-PageRank-Citation-Ranking-%3A-Bringing-Order-to-Page-Brin/eb82d3035849cd23578096462ba419b53198a556 (accessed on 6 June 2024).
Langville, A.; Meyer, C. Deeper Inside PageRank. Internet Math 2004, 1, 335–380. [Google Scholar] [CrossRef]
Brin, S.; Page, L. Reprint of: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 2012, 56, 3825–3833. [Google Scholar] [CrossRef]
Thakkar, K.S.; Dharaskar, R.V.; Chandak, M.B. Graph-Based Algorithms for Text Summarization. In Proceedings of the 2010 3rd International Conference on Emerging Trends in Engineering and Technology, Goa, India, 19–21 November 2010; pp. 516–519. [Google Scholar] [CrossRef]
Pons, P.; Latapy, M. Journal of Graph Algorithms and Applications Computing Communities in Large Networks Using Random Walks. Available online: https://www.semanticscholar.org/paper/Journal-of-Graph-Algorithms-and-Applications-in-Pons-Latapy/51e4f920c54cc8794f0fe68f0bfe1d6e122c19ff (accessed on 6 June 2024).
Newman, M.E.J.; Girvan, M. Finding and evaluating community structure in networks. Phys. Rev. E 2004, 69, 026113. [Google Scholar] [CrossRef]
Zhou, Y.; Cheng, H.; Yu, J.X. Graph clustering based on structural/attribute similarities. Proc. VLDB Endow. 2009, 2, 718–729. [Google Scholar] [CrossRef]
Newman, M.E.J. Detecting community structure in networks. Eur. Phys. J. B 2004, 38, 321–330. [Google Scholar] [CrossRef]
GPT-4o Mini: Advancing Cost-Efficient Intelligence. Available online: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ (accessed on 22 July 2024).
Welcome to Flask—Flask Documentation (3.0.x). Available online: https://flask.palletsprojects.com/en/3.0.x/ (accessed on 22 July 2024).
Danilak, M.M. langdetect: Language Detection Library Ported from Google’s Language-Detection. Python. Available online: https://github.com/Mimino666/langdetect. (accessed on 22 July 2024).
spaCy Industrial-Strength Natural Language Processing in Python. Available online: https://spacy.io/ (accessed on 10 June 2024).
EntityRecognizer spaCy API Documentation. EntityRecognizer. Available online: https://spacy.io/api/entityrecognizer (accessed on 10 June 2024).
TakeLab, Spacy-Udpipe: Use Fast UDPipe Models Directly in spaCy. Python. Available online: https://github.com/TakeLab/spacy-udpipe (accessed on 10 June 2024).
Scikit-Learn: Machine Learning in Python—Scikit-Learn 1.5.1 Documentation. Available online: https://scikit-learn.org/stable/ (accessed on 22 July 2024).
Python-Igraph Stable—igraph Stable Documentation. Available online: https://python.igraph.org/en/stable/ (accessed on 22 July 2024).
Introduction|DeepEval—The Open-Source LLM Evaluation Framework. Available online: https://docs.confident-ai.com/docs/metrics-introduction (accessed on 5 September 2024).
Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 2511–2522. [Google Scholar] [CrossRef]
Cao, Z.; Wei, F.; Li, W.; Li, S. Faithful to the Original: Fact Aware Neural Abstractive Summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar] [CrossRef]
Wang, A.; Cho, K.; Lewis, M. Asking and Answering Questions to Evaluate the Factual Consistency of Summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Singapore, 2020; pp. 5008–5020. [Google Scholar] [CrossRef]
Blondel, V.D.; Guillaume, J.-L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. 2008, 2008, P10008. [Google Scholar] [CrossRef]
Rosvall, M.; Bergstrom, C.T. Maps of random walks on complex networks reveal community structure. Proc. Natl. Acad. Sci. USA 2008, 105, 1118–1123. [Google Scholar] [CrossRef]
Clauset, A.; Newman, M.E.J.; Moore, C. Finding community structure in very large networks. Phys. Rev. E 2004, 70, 066111. [Google Scholar] [CrossRef]
Raghavan, U.N.; Albert, R.; Kumara, S. Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E 2007, 76, 036106. [Google Scholar] [CrossRef] [PubMed]
Chen, M.; Nguyen, T.; Szymanski, B.K. A New Metric for Quality of Network Community Structure. arXiv 2015, arXiv:1507.04308. [Google Scholar] [CrossRef]
Fortunato, S. Community detection in graphs. Phys. Rep. 2010, 486, 75–174. [Google Scholar] [CrossRef]
Shi, J.; Malik, J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 888–905. [Google Scholar] [CrossRef]
Tsimaras, D.O.; Mystakidis, S.; Christopoulos, A.; Zoulias, E.; Hatzilygeroudis, I. E-Learning Courses Evaluation on the Basis of Trainees’ Feedback on Open Questions Text Analysis. Educ. Sci. 2022, 12, 888–905. [Google Scholar] [CrossRef]
Berg, G.V.D. Context Matters: Student Experiences of Interaction in Open Distance Learning. Turk. Online J. Distance Educ. 2020, 21, 223–236. [Google Scholar] [CrossRef]
Elkins, R.; McDade, R. Student Perception of Online Learning Experiences Associated with COVID-19. Res. Directs Health Sci. 2021, 1, 10419. [Google Scholar] [CrossRef]
Sideris, D.; Spyropoulou, N.; Kalantzi, R.; Androulakis, G. Empowering the Educational Procedure through Interactive Educational Activities in Distance Higher Education. In Proceedings of the 10th Annual International Conference of Education, Research and Innovation, Seville, Spain, 16–18 November 2017; pp. 6089–6094. [Google Scholar] [CrossRef]
Simpson, O. Open to People, Open with People: Ethical Issues in Open Learning. In Ethical Practices and Implications in Distance Learning; IGI Global: Hershey, PA, USA, 2009; pp. 199–215. [Google Scholar] [CrossRef]
Kuo, Y.-C.; Walker, A.E.; Schroder, K.E.E.; Belland, B.R. Interaction, Internet self-efficacy, and self-regulated learning as predictors of student satisfaction in online education courses. Internet High. Educ. 2014, 20, 35–50. [Google Scholar] [CrossRef]
Gavrilis, V.; Mavroidis, I.; Giossos, Y. Transactional Distance and Student Satisfaction in a Postgraduate Distance Learning Program. Turk. Online J. Distance Educ. 2020, 21, 48–62. [Google Scholar] [CrossRef]
Li, S.; Zhang, J.; Yu, C.; Chen, L. Rethinking Distance Tutoring in e-Learning Environments: A Study of the Priority of Roles and Competencies of Open University Tutors in China. Int. Rev. Res. Open Distrib. Learn. 2017, 18, 189–212. [Google Scholar] [CrossRef]
Moore, M. Theoretical Principles of Distance Education. Routledge; Keegan, D., Ed.; Routledge: London, UK, 1997; pp. 22–38. [Google Scholar]
Keržič, D.; Alex, J.K.; Alvarado, R.P.B.; Bezerra, D.d.S.; Cheraghi, M.; Dobrowolska, B.; Fagbamigbe, A.F.; Faris, M.E.; França, T.; González-Fernández, B.; et al. Academic Student Satisfaction and Perceived Performance in the E-Learning Environment During The COVID-19 Pandemic: Evidence Across ten Countries. PLoS ONE 2021, 16, e0258807. [Google Scholar] [CrossRef] [PubMed]
Kreonidou, G.; Kazamia, V. Assignment feedback in distance education: How do students perceive it? Res. Pap. Lang. Teach. Learn. 2019, 10, 134–153. [Google Scholar]
Mena, M. New pedagogical approaches to improve production of materials in distance education. J. Distance Educ. 1992, 7, 131–140. [Google Scholar]
Giannouli, V.; Vorvilas, G. Barriers in fostering critical thinking in higher distance education: Faculty members’ perceptions. Mediterr. J. Educ. 2023, 3, 17–27. [Google Scholar] [CrossRef]
Asudani, D.S.; Nagwani, N.K.; Singh, P. Impact of word embedding models on text analytics in deep learning environment: A review. Artif. Intell. Rev. 2023, 56, 10345–10425. [Google Scholar] [CrossRef]

Figure 1. The proposed automated text summarization model.

Figure 2. Quantitative results of student satisfaction per dimension of the study program.

Figure 3. An evaluation example of the summary of students’ positive comments based on the criteria of relevance, alignment, and coverage, as included in the DeepEval LLM evaluation tool.

Figure 4. Modularity density for different values of the random walk length in the Walktrap algorithm, applied to the three types of students’ responses: positives, negatives, and suggestions for improvement.

Table 1. Distribution of student comments.

Response Type	Comments’ Distribution per Response Type	Sentence Distribution per Response Type	Sentences Remaining after Cleaning
Positives	610	852	835
Negatives	612	831	758
Improvements	610	933	889
Total	1832	2616	2482

Table 2. Scores of the metrics used for the summaries’ evaluation.

	G-Eval Metrics				DeepEval Metrics
	Relevance	Coherence	Consistency	Fluency	Alignment	Coverage	Summarization
Summary of positive comments	0.87	0.91	0.92	0.78	0.85	0.80	0.80
Summary of negative comments	0.72	0.7	0.73	0.66	0.69	0.60	0.60
Summary of suggestions	0.68	0.68	0.64	0.65	0.75	0.80	0.75

Table 3. Performance results of the community discovery algorithms.

Response Type	Algorithm	Number	Modularity	Conductance	Modularity Density	Mean Size	Max Size	Min Size	Coverage
Positives	Louvain *	16	0.29	0.94	0.49	51	262	1	0.63
	Walktrap	36	0.26	0.97	0.46	23	350	1	0.70
	Infomap	14	0.00	0.93	0.20	60	819	1	1.00
	Fast Greedy	19	0.24	0.94	0.35	44	410	1	0.74
	LPA *	13	0.00	0.93	0.20	62	821	1	1.00
Negatives	Louvain *	19	0.30	0.94	0.48	42	173	1	0.63
	Walktrap	42	0.27	0.98	0.53	19	255	1	0.48
	Infomap	20	0.12	0.95	0.28	40	676	1	0.93
	Fast Greedy	18	0.27	0.95	0.37	44	380	1	0.57
	LPA *	15	0.03	0.93	0.13	54	744	1	0.96
Suggestions for improvements	Louvain *	26	0.32	0.96	0.50	34	194	1	0.50
	Walktrap	56	0.24	0.99	0.48	16	224	1	0.42
	Infomap	26	0.00	0.96	0.18	34	850	1	1.00
	Fast Greedy	24	0.26	0.96	0.34	37	283	1	0.54
	LPA *	20	0.00	0.95	0.07	43	866	1	1.00

* Louvain and LPA are randomized heuristic algorithms and their metric values are the corresponding mean values of 100 iterations.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Karousos, N.; Vorvilas, G.; Pantazi, D.; Verykios, V.S. A Hybrid Text Summarization Technique of Student Open-Ended Responses to Online Educational Surveys. Electronics 2024, 13, 3722. https://doi.org/10.3390/electronics13183722

AMA Style

Karousos N, Vorvilas G, Pantazi D, Verykios VS. A Hybrid Text Summarization Technique of Student Open-Ended Responses to Online Educational Surveys. Electronics. 2024; 13(18):3722. https://doi.org/10.3390/electronics13183722

Chicago/Turabian Style

Karousos, Nikos, George Vorvilas, Despoina Pantazi, and Vassilios S. Verykios. 2024. "A Hybrid Text Summarization Technique of Student Open-Ended Responses to Online Educational Surveys" Electronics 13, no. 18: 3722. https://doi.org/10.3390/electronics13183722

APA Style

Karousos, N., Vorvilas, G., Pantazi, D., & Verykios, V. S. (2024). A Hybrid Text Summarization Technique of Student Open-Ended Responses to Online Educational Surveys. Electronics, 13(18), 3722. https://doi.org/10.3390/electronics13183722

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid Text Summarization Technique of Student Open-Ended Responses to Online Educational Surveys

Abstract

1. Introduction

2. Analyzing University Student Feedback from Online Educational Surveys: Benefits, Challenges, and Methods

3. Context and Research Questions of the Present Study

4. Research Design, Participants and Sample

5. Method

5.1. Data Preparation

5.1.1. Responses Extraction

5.1.2. Sentence Extraction

5.1.3. Sentence Cleaning

5.2. Anonymization

5.2.1. Named Entity Recognition and Replacement

5.2.2. Custom Name Replacement from Tutor DB

5.3. Normalization

5.3.1. Text Indexing

5.3.2. Lemmatization

5.3.3. Normalized Sentence Cleaning

5.4. Sentence Ranking and Clustering

5.4.1. Creation of a Similarity Matrix

5.4.2. Ranking Sentences by Importance

5.4.3. Clustering Sentences in Topics

5.4.4. Updating Ordered Sentences with Cluster Information

5.5. Summarization

5.5.1. Extractive Summarization

5.5.2. Abstractive Summarization

6. Implementation

7. Results

7.1. Student Feedback Results

7.2. Summaries’ Evaluation

7.3. Walktrap Performance

8. Discussion

9. Limitations and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI