1. Introduction
The landscape of academic education has evolved rapidly in recent years. During the COVID-19 pandemic, many higher education institutions (HEIs) worldwide were forced to switch to technologies to support the educational process remotely, which greatly boosted the spread of academic distance education and e-learning [
1]. Under these circumstances, enhancing the quality of academic education services provided was a priority for many HEIs. To achieve this, student performance was evaluated through various methods, focusing on students’ grades and their interaction with learning management systems, their engagement with online resources, the frequency and quality of their participation in virtual discussions, and their ability to complete assignments and assessments within the platform [
2,
3,
4]. Among the indicators for assessing the quality of academic education are also student evaluations of teaching effectiveness, the quality of student attendance, the content and design of curricula, and so on. In this context, the systematic collection and analysis of student feedback have emerged as key practices in the operational and strategic decision-making processes of HEIs. Therefore, the emphasis on students’ evaluation of their studies not only contributes to curriculum improvement and faculty performance evaluation, but it is also an important indicator for assessing the overall educational experience.
Recent developments have made online surveys a widespread method for collecting educational evaluation data. This method provides many significant advantages over traditional pen-and-paper surveys, including faster data collection, increased accessibility via mobile devices, and the ability to explore a broader student sample. In addition, online survey platforms provide researchers with the flexibility to use a variety of question formats, enhancing the richness and depth of the data collected. However, despite these advantages, implementing online surveys often presents issues such as low response rates, potential biases in sample representation, and the complexity of analyzing qualitative responses to large datasets [
5]. These challenges underscore the need for continued examination of methodologies and practices for collecting and interpreting educational evaluation data to improve the educational process of HEIs.
The prevalence of closed-ended questions in survey designs has lent a degree of objectivity and convenience to quantitative analyses. However, this reliance can bring limitations. Quantitative data often carry on predetermined research patterns, potentially limiting the exploration of emerging themes and insights that could enrich the understanding of students’ experiences [
6]. In contrast, open-ended questions offer an effective way to more systematically elicit students’ views and experiences in their own words. However, the inherent challenges of processing and analyzing qualitative data can pose difficulties, especially in large-scale studies. Processing qualitative data requires rigorous coding and thematic analysis, which can be particularly challenging and time-consuming with large samples [
7].
In recent years, significant developments in text mining (TM) and natural language processing (NLP) have greatly enhanced the structured analysis of data collected from large text corpora. A promising method that can contribute to the systematic analysis of qualitative data in education is automatic text summarization [
8]. The present study introduces a hybrid text summarization technique to address the challenges of processing large volumes of university students’ feedback. This technique combines the TextRank and Walktrap algorithms, along with the GPT-4o mini model. Student feedback was obtained from an online educational survey at the HOU in Greece, focusing on critical dimensions such as tutors’ role, the quality of educational content, the design of academic modules, and the quality of administrative and technical services. Evaluated using the G-Eval framework and DeepEval summarization metrics, the technique demonstrates its ability to provide relevant, coherent, and comprehensive summaries, offering a deeper understanding of student experiences. The findings revealed that, while students were generally satisfied with tutor–student interactions, there were negative comments regarding the problematic content of some modules and scheduling issues. In summary, the results generated by this approach offer valuable insights not only for the HOU but also for HEIs in general. By providing a clearer understanding of the factors that positively or negatively affect the student experience in academic distance learning, the technique supports informed decision-making and paves the way for ongoing improvements in educational processes and practices across the field. Using the proposed technique, HEIs can quickly extract and condense large volumes of qualitative data, making student feedback more accessible to deans, tutors, and decision-makers. This technique helps identify data trends by faculty or program, and when combined with quantitative evaluation results, it provides a deeper understanding of student experiences. Such an approach could lead to improvements in tutors’ role, educational materials, course design, administrative services, and other dimensions related to the quality assurance of the academic educational process.
The rest of this paper is organized as follows:
Section 2 provides an overview of the analysis of qualitative feedback based on student opinions collected through online educational evaluation surveys, discussing the benefits, challenges, and methods related to these evaluations.
Section 3 describes the context in which the study was conducted, the research questions, and the contribution of the proposed technique to academic education.
Section 4 outlines the research design, the participants, and the sample involved in the study, while
Section 5 and
Section 6 present in detail the design and implementation of the proposed hybrid summarization model, respectively. The findings from the application of the technique are analyzed in
Section 7, followed by a discussion of these findings in
Section 8. Finally, the study’s limitations are discussed, along with some ideas for future research directions, emphasizing the potential for improving the summarization processes used.
2. Analyzing University Student Feedback from Online Educational Surveys: Benefits, Challenges, and Methods
Student feedback in academic education is an important factor in investigating the quality of services provided by HEIs. Students’ opinions are mainly concerned with evaluations of faculty effectiveness and program quality [
9,
10,
11] as well as student satisfaction [
12]. These evaluations are routinely carried out using either internationally recognized and valid questionnaires or questionnaires constructed and adapted to the requirements of the institution concerned. HEIs attach particular importance to the student feedback they collect because it helps them to make decisions regarding the recruitment or promotion of academic staff, changes in course content, and improvements in the administrative and technical services provided.
In recent years, conducting evaluations in HEIs to collect feedback from students’ opinions has been greatly enhanced by the use of online surveys. Several online tools are commercially available, but many HEIs have also developed infrastructures that support their in-house online surveys. The major advantages of online surveys are the speed of data collection, the flexibility of location and time of completion, the ability to participate in the survey through mobile devices, the ability to reach a large sample of participants, etc. In addition, they provide researchers with ease of creating and structuring questionnaires through a range of question types and functionalities for data visualization and monitoring during the collection of the responses [
5,
13]. Their drawbacks include often low response rates, research design arbitrariness regarding sample representativeness, unclear questions or instructions, collection of unwanted responses (e.g., insulting comments), and privacy assurance issues [
5,
14,
15].
In higher education, online evaluation questionnaires with closed-ended questions are preferred by far. These questions are indicators for measuring factors such as student–tutor interactions, program materials, course organization, and instructional design. Online questionnaires are effective because they allow for the rapid collection and analysis of data from almost the entire population and comparing results over time [
16]. In addition, including rating items allows for easier interpretation of findings regarding tutor effectiveness and other dimensions of the educational process, and subsequent decision-making [
17].
However, it has been pointed out that quantitative methods are characterized by their objectively closed nature, meaning that if a topic/dimension is not anticipated in the survey design, it will not appear in the results. More generally, the tendency of researchers to rely on previous quantitative studies and to confirm existing research hypotheses often leads to the replication of a closed variable structure that is systematically used in different contexts. This approach can simply recycle existing knowledge without any innovation, thereby limiting the possibility of generating new knowledge [
6]. Although the use of open-ended questions can provide valuable qualitative data on student feedback, they are often considered time-consuming and demanding to process. As a result, they may be unsuitable for many researchers conducting large-scale surveys [
11].
Including open-ended questions in online surveys seems to have many additional benefits compared to closed-ended responses. Open-ended questions provide valuable information and complement the quantitative findings in the survey. They are useful when exploring new topics or concept dimensions as they may reflect real-life experiences [
18]. Through open-ended questions, respondents are not limited by a predetermined set of closed options and they can express their thoughts in their own words, often producing rich information [
19]. In addition, respondents can create themes and sub-themes themselves, which may contradict or extend the theoretical assumptions of the researchers on the issue under investigation [
6,
20]. However, the completion rates for open-ended responses to electronic questionnaires are often lower than for closed-ended responses [
13]. Issues such as the design of the questionnaire, the available space for typing, the instructions for completion [
21,
22], the interest of the respondent, and the cognitive load required to answer can affect the length and quality of open-ended responses [
7,
19].
In large-scale online surveys, open-ended responses posit significant challenges to researchers. Processing qualitative data requires their coding into categories, through the identification of recurring patterns and ideas [
23]. The categories of analysis are shaped by the relevant literature review and the data itself, ensuring logical consistency. This process is time-consuming, requiring time to review the literature, preliminarily test the coding scheme, and train raters in its use. Raters’ training is essential to ensure agreement between them in coding the data and thus for the reliability of the survey [
24]. In small samples, coding can be carried out manually, with two or more people coding independently to assess inter-rater reliability. In large-scale qualitative surveys, coding often requires the ad hoc creation of broad categories, leaving a wealth of data unused [
25]. Furthermore, this type of survey involves the division of labor by creating multiple teams with specific responsibilities, which may facilitate survey management but often create problems of communication and translation of information when they do not communicate effectively with each other [
25,
26]. In addition, processing and coding qualitative data from online surveys becomes particularly challenging when responses are unclear [
27].
The use of computer-assisted qualitative data analysis softwares (CAQDAS), such as Atlas.ti 8.0, NVivo 14.0 or MAXQDA 24.4.1, can help reduce the issues associated with analyzing qualitative data, but requires careful oversight to ensure the accuracy of the classification. Although these tools help with text tagging, the coding process remains time-consuming and demanding for large samples. In addition, these tools cannot be fully trained to automatically identify the appropriate pieces of text for coding. Therefore, individual decisions by researchers become crucial in this case, and agreement among raters on an acceptable coding scheme remains a common strategy to minimize subjective errors and bias [
26].
Advances in NLP and TM have enhanced the structured analysis of data collected from large corpora. Ref. [
28] stresses the significant role of NLP and TM in enhancing online education. By analyzing large amounts of unstructured data from forums, essays, and assessments, these technologies provide valuable information for improving teaching, learning, and student engagement. Key areas of application include automated grading, feedback generation, plagiarism detection, and personalized learning recommendations. Additionally, TM aids in creating educational content, predicting student dropout risks, and enhancing information retrieval. Ref. [
29] also refers to, as the most common goals cited in the literature for applying NLP and TM techniques to student feedback, the following: (a) sentiment prediction, (b) category and rating prediction complemented with sentiment, (c) emotion prediction and analysis, (d) opinion mining, (e) lexicon construction, and (f) statistical or mathematical analysis. The literature distinguishes between supervised and unsupervised machine learning methods [
30]. In supervised methods, initially, human raters encode a subset of training data, which in turn is used to train an algorithm for predicting uncategorized text responses. Although these methods require modeling and prediction expertise, they offer low cost and fast execution [
7]. Unsupervised methods extract data from text mainly at an exploratory analysis level to search for common underlying themes and do not require manual effort. These themes lead to a better understanding of the conceptual structure of the corpus, cleaning it of unwanted information and coding it to highlight key concepts in the analysis. For a brief presentation of these approaches see also [
31].
In a review of advancements in NLP in opinion mining of customer reviews during the last decade, Ref. [
32] stresses the shift from traditional machine learning to deep learning models. Key areas of NLP implementation in customer reviews include sentiment analysis and opinion mining, fake review detection to identify fraudulent feedback, customer experience and satisfaction analysis, user profiling and recommendation systems, marketing, and brand management. Additionally, the emergence of technologies like transformers (e.g., BERT) enables deeper and sophisticated analysis. According to [
33], sentiment analysis has been widely adopted in education to analyze student feedback. It helps evaluate student engagement, infrastructure limitations, course effectiveness, and policy decisions. Techniques like aspect-based and entity-level analysis allow for personalized learning by understanding individual student issues. Sentiment and emotional analysis categorize feedback into various levels, aiding in teacher evaluation and pedagogical improvements. Additionally, predictive models have been used to correlate sentiment with academic performance, helping institutions identify at-risk students early.
In addition to the above methods, a growing method in recent years is automated text summarization. This is the process of condensing a large amount of text data into a shorter, more concise form, allowing for easier apprehension of the main points without reading the entire text. This technique saves time when large volumes of text bodies must be managed. The procedure of automated text summarization usually includes three stages: pre-processing, processing, and post-processing. In pre-processing, words, sentences, and structural elements of the text are identified as input units. In processing, the input text is converted into a summary using a summary method. Finally, in post-processing any problems in the summary created are corrected [
34].
There are different types of automated text summaries depending on the criteria applied to produce them [
8]. We distinguish between single- and multi-document summaries depending on whether the summary results from one or more documents. We also speak of a language-specific summary if all documents and their summaries are written in the same language (e.g., Greek), or multilingual, if, for example, the documents are written in Greek and English but the summary must be in Greek, or cross-lingual, if all documents are written in Greek and the summary is in English. Summaries can also be flat (a single summary is produced with no intermediate summaries) or hierarchical (multiple levels of summaries are made, e.g., abstract and extended summary, allowing for zooming in and out of the text), general purpose or query-based (highlighting the parts of the text related to a query). Finally, summaries can be extractive or abstractive. In an extractive summary, the most important sentences from the original text are selected and prioritized into a shorter summary; in an abstractive summary, the most important text information is reworded to create the final summary [
35]. Although abstractive summarization alone often performs better than extractive [
36], it is usually difficult to implement. The emergence of Large Language Model (LLM) technology has dramatically advanced the field of abstractive summarization. However, the high hardware requirements and specifications, as well as the need for large amounts of training data, can act as a deterrent. This is because the required data may not be readily available in various domains, or the volume of data may not need to be particularly large. As [
34] states, due to the complexity required to produce an abstract summary, recent research in the field has focused on extractive techniques.
Hybrid summarization is a combination of extractive and abstractive techniques with the aim of potentially producing more coherent and contextually relevant summaries. However, the quality of the final summary heavily depends on the initial extractive step, which can lead to lower-quality abstractive summaries that fail to fully capture the original text’s depth compared to pure abstractive methods. Despite these drawbacks, hybrid summarization often yields more coherent results than pure abstractive approaches, as it builds upon extracted sentences that already contain essential information [
37]. A systematic 12-year literature review by [
28] reveals that text summarization is not yet a widely adopted text mining technique in education. This study is expected to contribute significantly to the development and utilization of text summarization, particularly of hybrid approaches, in the field of academic education.
3. Context and Research Questions of the Present Study
The research was conducted in Greece at the HOU. The HOU is the only HEI in the country that offers exclusively distance education at undergraduate and postgraduate levels. With approximately 40,000 active tuition-paying students and an average age of 35 years, the HOU is the largest institution of its kind in Greece as of now. Academic studies at the HOU provide autonomous and self-directed learning which is facilitated by tutor–student, student–student, and student–content interactions [
38,
39] within a technologically supported environment. The HOU offers distance academic education in various subjects, with study programs structured in annual or semester modules. For each module, students have access to specific educational content (educational materials such as books and other digital educational resources, as well as educational activities such as written assignments, projects, and laboratory exercises). The learning path is defined by study guides and time schedules that accompany the educational content. It is the learning material that plays the primary instructional role in the student’s learning process and not the faculty members [
40]. Also, activities supported by adequate feedback play a vital role in helping students to learn [
41]. Faculty members, in this distance learning context, assume the role of tutors, who do not lecture but advise and guide students in their studies [
42]. More specifically, the role of tutors is to provide clear explanations to student queries, assist students with the comprehension of educational material, guide them through educational activities, offer constructive written feedback on their work, and maintain effective communication with them. Additionally, tutors should encourage and motivate students, while effectively utilizing the educational platforms to enhance the learning experience. Attendance at the HOU is accompanied by online tutor–student meetings during which tutors are able to guide students through the activities, resolve their questions, and encourage them to continue their studies [
43]. In addition to these online meetings, interaction with tutors and educational content is covered asynchronously through the use of educational platforms which are customized versions of the Moodle LMS 4.0 LMS.
In the aforementioned context, students are invited to anonymously evaluate their experience of the modules they attended to through an online questionnaire, which includes four evaluation strands comprising criteria adopted after an extensive literature review on student satisfaction in academic distance education and e-learning settings: (a) the role of the tutor (evaluation criteria: clarity in resolving queries, assistance in understanding educational material, guidance in completing educational activities, constructive written feedback, communication initiatives during the academic year/semester, encouragement during studies, and utilization of the educational platform), (b) the educational content, i.e., educational material and activities (evaluation criteria: material alignment with learning outcomes, material’s contribution to understanding the subject matter, material’s contribution to completing educational activities, and contribution of activities to learning outcomes), (c) the module design (evaluation criteria: clarity of module objectives, feasibility of the study schedule, clarity of assessment and grading criteria, and contribution of tutor–student meetings in subject-matter comprehension), and (d) the administrative and technical services (evaluation criteria: satisfaction with student registry support, satisfaction with technical support (helpdesk), usability of the educational platforms, and usability of the HOU website). These have been formulated in 20 closed-ended questions. In addition, the questionnaire includes three open-ended questions that encourage students to articulate the most important positive and negative aspects of the module, as well as propose suggestions for improvements. These questions aim to gather additional information about students’ learning experiences, which the closed-ended structure of the questionnaire may not anticipate.
The large volume of qualitative data resulting from student feedback, as well as its analysis, remains a challenge for the HOU. Various text mining techniques have previously been applied to either detect faculty member and student satisfaction with remote online examinations [
44,
45] or student satisfaction with master’s thesis supervision [
46]. However, a comprehensive approach in terms of methodology and scale regarding the processing and analysis of qualitative data collected from online surveys is lacking. The findings from such an approach will provide valuable feedback to both the academic community and the administration, not only at the HOU but also for other higher education institutions (HEIs). This feedback can assist in making decisions regarding collaboration with tutors, updates of educational materials and activities, course redesign, and the improvement of administrative and technical services.
For the present study, the key beneficiaries of text summarization will be the HOU’s Study Program Directors (SPDs). An SPD is a faculty member whose main goal is to ensure objective assessment of student progress, promote scientific research, and develop technology and methodology in lifelong distance learning. The SPD assigns thesis projects, suggests other faculty members for the role of module coordinator and coordinator assistant, collaborates with administrative services, and contributes to the improvement of educational material. Additionally, the SPD is responsible for regular communication with module coordinators and tutors. They collect reports to inform the University Administration about the progress of the study program, highlighting issues and proposing improvements. The SPD also coordinates the study program certification processes according to HOU standards, ensuring the implementation of guidelines from the Quality Assurance Unit (QAU) and the Internal Evaluation Teams (IETs).
Considering the context of the HOU’s evaluation of the distance education process and the challenges posed by open-ended questions, this study aims to assess the satisfaction of students in an annual study program by analyzing their feedback from open-ended responses to an online evaluation questionnaire using hybrid text summarization. More specifically, the research questions posed are as follows:
- (1)
What are the students’ positive experiences of taking their modules?
- (2)
What aspects of module attendance do students find challenging or unsatisfactory?
- (3)
How can the modules be improved?
- (4)
How effective is the proposed summarization technique in capturing student feedback with respect to the abovementioned research questions?
Regarding research questions 1–3, it is emphasized that these aim to highlight qualitative feedback related to student satisfaction in connection with dimensions of the educational process, as reflected in the evaluation strands and criteria described above. The qualitative student feedback will be compared with the corresponding quantitative feedback from the evaluation questionnaire in order to identify convergences or divergences and to strengthen, where necessary, the relevant findings regarding student satisfaction with this module. Regarding research question 4, the hybrid summarization technique will be examined with respect to the effective condensation of large volumes of student feedback into coherent themes while ensuring that the analysis remains focused on the most important issues that positively or negatively affect the learning process.
6. Implementation
The implementation of the above flowchart took place into the environment of the HOU, trying to bring most of the models inside the university’s area and limit the dependencies of third-party models in the cloud due to the restrictions of anonymization and the institute data policy. The future goal is to build a system that could be used for the analysis of all open-ended questions of all evaluations that are taking place in the university regardless of the study program. However, the present study was focused on one postgraduate study program.
There were two core modules of the system: The
Response Preparation Module and the
Survey Summarization Module. They were implemented using Python (3.0) as RestFull services, utilizing the Flask python library [
63]. A MySQL database management system was used as storage for all the required data. The administration and the production of summaries were provided through a web-based application built on the PHP programming language.
The first module (Response Preparation) moderates the data preparation steps and it takes as input raw data of the students’ responses and returns a well-formed dataset that contains the initial response, the sentences of each response, and the lemmas of each sentence. The dataset is filtered and cleaned appropriately. This module includes sentence preparation, anonymization, and lemmatization of the internal components. The second module (Survey Summarization) is responsible for the production of a summary of the responses to open-ended questions. It takes as input the dataset that contains clean lemmatized sentences and then computes the similarity matrix between the sentences to produce a graph. Next, this module invokes the Sentence Ranking together with the Sentence Clustering internal components. These two components will result in the ranking of all the sentences and in their clustering as well. In this way, it will be able to highlight the most important sentences and to prepare a dataset of clustered sentences for an abstractive summarization. Finally, the module will produce the final report with the summary of the responses. More specifically, it presents an instructive quantitative paragraph in relation to the number of answers as well as the number of sentences that arose from them and the number of clean sentences that were finally used in the summarization process. Next, an abstractive summarization follows that briefly describes the trends found from the analysis of the responses and finally a list of indicative and representative sentences selected from the students’ responses is presented. When summarizing must be abstractive, the service will invoke an external AI summarization service.
Concerning the required input data for the whole process, they consist of the following datasets:
Students’ responses to an open-ended question;
Replacement rules in the form of ‘find and replace’ pairs;
Words to be excluded from NER analysis;
List of study programs, course modules, and tutors of each module (as an input for the custom Name Replacement algorithm);
Lemmas to be excluded from lemmatization;
Lemmas to be corrected before lemmatization;
Some POS custom definitions in particular lemmas.
Only two datasets of the above are being replaced each time a new question is analyzed (students’ responses, and sometimes the list of study programs, modules, and tutors), while other datasets are domain-specific and are updated rarely.
Some critical libraries and external services were also integrated in the system. First, concerning data cleaning, langdetect 1.0.9 [
64] was used to detect responses in languages other than modern Greek that the system is using. This library is a direct port of Google’s language detection library from Java to Python. Concerning NER, an already pre-trained model in the Greek language named Spacy [
65,
66] was used and configured to recognize only persons’ names. The implementation of lemmatization in the Greek language was carried out by the spacy-udpipe library [
67]. Both the TextRank algorithm and the Jaccard similarity computation were implemented in python; however, the latter used a CountVectorizer algorithm of the scikit-learn library [
68] to optimize matrix production. In the same way, the WalkTrap algorithm was implemented inside the module using the igraph library [
69], a network analysis package. Regarding abstractive summarization that is performed in case of the production of first the clusters’ summaries and next the final summary of all the clusters, the GPT-4o mini model was used via the OpenAI API.
It is worth mentioning that due to possible limitations related to cost and availability of the third-party services used for the summarization process, the system’s approach supports future integration of new models in case it is required or in case they can better enhance the current system functionality. This can occur by the adoption of a service-oriented architecture in which a third-party summarization service could be integrated to the system by using a simple wrapper module that will ensure compatibility of the input/output data model.
In terms of end users, administrators and SPDs can use this prototype system. Administrators who are domain experts are responsible for populating and updating the system with the required datasets, executing each of the modules and evaluating and, finally, modifying the results in order to produce the final report for each study program. An additional custom software tool will be implemented in the future, aiding administrators to efficiently manage the entire summarization task. This tool will enable summary management by providing editing, execution, and evaluation functionality for each phase of our proposed approach and for each study program as well. SPDs can view the reports and access the raw data in case they want to focus on a specific program module or a specific finding. The prototype system is designed to integrate with an existing application that offers quantitative information to directors. In this way, directors will have an overall view of the results of students’ responses to both closed and open-ended questions.
7. Results
7.1. Student Feedback Results
Regarding quantitative analysis, the mean values per evaluation dimension are presented in
Figure 2, where the maximum value is 5 and the minimum is 1. As can be seen, students tend to be satisfied with their modules, with mean values ranging above the middle of the five-point scale. However, they seem less satisfied with the educational content (materials and activities).
The final summaries of student feedback (i.e., the summaries resulting from summary evaluation—see
Section 7.3 below) are more illustrative regarding their experiences:
Summary of positive comments: “The students expressed positive feedback about the modules they attended. They mentioned gaining new knowledge related to the subject matter, which they consider important for their field. They appreciated the experience and knowledge gained from participating in project assignments and the feedback they received. They were enthusiastic about the presence and teaching of the tutors, who were excellent and inspiring, with great knowledge of the subject and always available to help with understanding the material and answering questions. Additionally, they valued the communication and effort made by the tutors in making the material comprehensible and offering their support. The online tutor-student meetings were also very helpful, and tutors provided during them extra assistance to ensure understanding. Overall, the students were impressed by the quality of tutoring and the enthusiasm of the tutors. They found the modules interesting, educational, and essential for their professional development”.
Summary of negative comments: “The students expressed negative feedback about the modules they attended. Some of the issues included excessive time required for written assignments, lack of organization in the material and tasks, outdated content, lack of reminders about submission deadlines, delays in online preparatory tutorials relative to submission dates, and insufficient time for preparation. Many students mentioned that the assignments did not help in understanding the material, and the tutorials did not align with the study guide. Additionally, some students reported feeling time pressure and being unable to correct their mistakes due to a lack of guidance. Finally, the mid-term exam was a source of stress and difficulty for many students. Overall, the students expressed dissatisfaction with the quality and design of the modules”.
Summary of improvements: “The students have suggested several improvements for the modules they attended. Some requested more educational material and audiovisual aids, while others proposed limiting the syllabus to make it easier to practice. The need for more frequent tutor-student meetings and greater guidance from tutors was also noticed. Some students expressed a desire for an upgrade in educational materials and fewer assignments, but with more meetings. Finally, there were suggestions for improving the organization of written assignments and tutoring in various areas. In conclusion, the students emphasized that they need more support and better preparation to succeed in their exams”.
A point of convergence between quantitative findings and summary feedback is that both types of data show strong satisfaction with tutors. The high mean value of tutors’ role is mirrored in the positive comments where students praised the tutors for being supportive and experts on the subject matter. The mean value for the study program design indicates a fairly positive yet somewhat moderate level of satisfaction. This is reflected in the qualitative feedback, where students complained about time management issues and suggested improvements to the syllabuses and the number of meetings. Convergence between quantitative and qualitative data is also shown in the case of educational content, where a low mean value can be interpreted in accordance with students’ complaints of frustration with the educational material and the lack of alignment between assignments and the subject matter. What is not highlighted in the summaries is students’ comments on their satisfaction with the administrative and technical services. This probably suggests that comments on these aspects were not a priority for them, either because they considered them less important or because they were quite satisfied (
Figure 2) so they did not feel the need to provide extensive comments.
7.2. Summaries’ Evaluation
In order to check the quality of the summaries produced by the proposed method, two approaches were adopted: the G-Eval framework and DeepEval summarization. Both approaches were implemented using the DeepEval tool, an open-source evaluation framework for LLMs [
70]. G-Eval is an evaluation framework for assessing the quality of summaries generated by LLMs with the use of LLMs [
71]. Traditional metrics (e.g., ROUGE) focus on surface-level similarities (e.g., word overlap), which can miss the deeper semantic content and quality of a summary. LLMs, however, can assess summaries in a more human-like manner, and this leads to more accurate and context-aware evaluations. G-Eval utilizes LLMs to perform human-like evaluations of text summaries across the following dimensions: (a) Relevance: the inclusion of important information in the summary and the exclusion of redundancies. (b) Coherence: the logical flow and organization of the summary. (c) Consistency: summary’s alignment with the facts in the source document. (d) Fluency: readability and clarity of the summary.
Nevertheless, it has been noted that LLM summaries often suffer from hallucinations, i.e., arbitrariness and bias resulting in discrepancies and inaccuracies or contradictions between the input text and the output summary [
72]. For this reason, we decided to further implement in parallel the DeepEval summarization metric which exploits the Question–Answer Generation (QAG) benchmark for summary evaluation to address the abovementioned issues [
73]. The DeepEval summarization metric aims to enhance summarization evaluation by measuring the following aspects: (a) Coverage Score. Thisindicates whether the summary contains the necessary information from the original text. (b) Alignment Score. This indicates whether the summary contains hallucinated or contradictory information to the original text. The summarization score is calculated by the following equation:
The G-Eval framework was implemented according to the following steps: (a) task prompts were written, which instructed GPT-4o mini to evaluate the summaries produced according to the criteria of coherence, fluency, consistency, and relevance, (b) Chains of thought (CoT) were produced as intermediate reasoning steps to be followed for each evaluation task, (c) scores were calculated for each task and, (d) reasons were provided to explain the scores.
DeepEval summarization was also implemented according to the following steps: (a) a set of questions was generated based on each summary, (b) questions were answered using both the original comments and the summaries, (c) answers in turn were compared to calculate the final score, and (d) reasons were provided to explain the scores.
A graphical example for both evaluation methods is shown in
Figure 3. The score results for the three summaries are presented in
Table 2. The reasons that were provided by the DeepEval framework based on the obtained scores are discussed below.
According to G-Eval scores, the summary for the positive comments is estimated to effectively capture the main topic of students’ positive experiences, although it could include more specific details from the input. It aligns closely with the main ideas and structure of the original text, capturing the essence of student feedback without introducing conflicting information. The summary is mostly free of grammatical errors and maintains clarity, but some repetition and minor formatting issues affect its overall fluency. The DeeEval summarization metric score indicates no contradictions, and the overall content remains relevant, maintaining a high level of accuracy. However, it detects that the summary includes extra information not present in the original text, which may potentially mislead the reader.
Regarding the summary for the negative comments, G-Eval scores indicate that it captures several key points from the original text; however, it lacks specific details which could improve its relevance and coherence. The summary aligns adequately with the facts in the original text but could be improved further. Minor formatting inconsistencies and a lack of specific examples from the original text slightly affect the overall clarity. The DeeEval summarization score shows that the summary rather includes extra information not present in the original text, which can confuse readers and introduce inaccuracies. Additionally, it seems to miss certain questions that the original content could answer, affecting its completeness.
For the summary of students’ suggestions, G-Eval scores indicate that it captures the main topics from the original text; however, it lacks specific details. The summary shows some inconsistencies with the original text regarding the extent of the issues raised and could benefit from more specific details to enhance clarity. The DeeEval summarization score indicates that the summary inaccurately reflects some student suggestions compared to the original text. Despite these discrepancies, it captures the general sentiment of student feedback, which justifies the relatively high score.
Following a systematic review of the DeeEval framework’s results, the summaries were re-examined and corrections were made to better and more accurately convey the meaning and content of the relevant comments. However, it should be noted that during the review, discrepancies concerning the absence of details from the comments into the summaries were not taken into account. This is because the intention of the proposed methodological framework for hybrid summarization was, from the outset, to seek and highlight the main trends in the comments. Indeed, the use of the Walktrap algorithm as a technique for improving the TextRank algorithm results was oriented towards this direction. Particular attention, however, was given to clean the summaries of information not mentioned in the original comments.
7.3. Walktrap Performance
In the context of the search for the most appropriate community detection technique to meet the needs of this research, some of the most popular community discovery algorithms on non-fully connected graphs were benchmarked. In particular, the performance of Walktrap was investigated against Louvain [
74], Infomap [
75], Fast Greedy [
76], and Label Propagation (LPA) [
77]. Some representative metrics were computed to assess each algorithm’s performance and are presented in
Table 3. Walktrap can be considered as the best solution due to its relative high modularity density [
78], an improved measure that combines modularity with community density, and the relative number of communities created according to the average size of each community. Despite the not relatively high value in coverage [
79] and the not relatively low value in conductance [
80], the primary criteria for this study are the density inside the community, the creation of discrete groups, and the community size distribution. For instance, Louvain achieves high modularity because it is primarily focused on maximizing modularity and performs better in time but provides random results and presents problems in cases of very small communities or graphs with overlapping communities [
74]. This was evident in trials conducted with various sample sizes from the dataset. In contrast, Waltrap is highly effective at identifying communities in small to medium-sized graphs, where detailed separability is crucial due to random walks. This is especially important regarding the sample size per study program in general. It is excellent when consistency and correct detection of relationships between communities are the key questions. Clustering was necessary to make the performance of TextRank even better with an important number of distinct groups based on topics. Clustering in this way can be combined with TextRank to capture the overall trend while minimizing the loss of students’ opinion patterns. However, the “best” approach is always contingent upon the dataset, the size, the type of object under study, and, moreover, the needs of clustering contribution.
Another point worth mentioning is the size of Walktrap’s random walks. Based on theory, t must be neither large (as it converges to infinity, all communities join into one) nor small, so as to allow the random walk to explore the network more thoroughly, leading to the discovery of more nuanced community structures. In
Figure 4, it is depicted how modularity density changes for different values of t among three types of responses in the specific dataset under study. The best values of t for achieving high modularity density are 1, 2, and 4 for positive, 1, 2, 4, and 5 for improvement, and 1, 2, 3, 4, and 7 for negative responses. A length equal to 4 is considered for this research to be a compromise value to explore the communities, allowing for defined and coherent community structures.
8. Discussion
The present study aimed to develop a hybrid text summarization technique for effectively analyzing qualitative feedback from student responses to online educational surveys at the Hellenic Open University. Participants were graduate students enrolled in an annual study program during the academic year 2023–2024. The study sample comprised 1197 students who submitted evaluations, representing approximately 45.72% of the 2618 students enrolled. The final dataset for analysis consisted of 1832 responses to open-ended questions, which were processed to produce summaries reflecting students’ positive experiences, challenges, and suggestions for improvement in the educational process. The quantitative findings showed that students are in general satisfied with the tutor’s role, the study program design, the administrative—technical services, and to a lesser degree with the educational content. Furthermore, a richer picture was given with respect to their open-ended responses.
HOU students gave positive feedback about the modules they attended, highlighting the valuable knowledge they gained. They appreciated the experience of working on project assignments and the useful feedback they received. The tutors were regarded as experts on the subject matter and were always available to help clarify educational materials and answer questions. Students also valued the tutors’ efforts to make the subject matter clear and to provide support for their studies, especially through online meetings. These findings align with qualitative research in academic distance education and online learning settings regarding students’ positive experiences. Specifically, in relation to the valuable knowledge gained from their modules, the authors of [
81] emphasized in their research that students considered the academic courses they attended to provide valuable knowledge and practical skills that were beneficial. In [
82], university students expressed satisfaction with their tutors, describing them as “always available” and “helpful,” paralleling the support observed in our findings. The emphasis on online meetings and tutor support resonates also with [
82], where students valued interactions with instructors and appreciated the support they received. Similarly, Ref. [
83] found that instructors were always available for questions and provided timely feedback. They offered flexible office hours and responded to emails promptly, making them accessible for help.
Some HOU students expressed dissatisfaction with the modules, citing disorganized and outdated educational material and assignments, excessive time required for assignments, scheduling conflicts between meetings and the study guide, and difficulties with the mid-term exams. Many felt the assignments did not help them understand the material. HOU students also mentioned a lack of guidance in correcting mistakes on their assignments. This perception of assignments as excessively time-consuming and unhelpful is echoed in [
84], where students in some HOU modules expressed negative feedback about the overwhelming number of activities, demanding workload, and limited time for study due to professional obligations. Ref. [
82] also mentions that time management was a challenge for many university students, as the higher number of assignments made it difficult to complete them on time. Additionally, Ref. [
83] found that university students mentioned inadequate tutor support and ineffective feedback on their performance, leaving them feeling unsupported and confused about their learning progress. Ref. [
82] adds that some university students needed more guidance or found that their lecturers were not always available when they needed assistance.
HOU students suggested several improvements for the modules they attended as a counterbalance to the aforementioned negative comments. These included adding more educational materials and audiovisual aids, limiting the syllabus, increasing tutor–student meetings, and providing more guidance from tutors. Some requested upgraded materials and fewer assignments but with more meetings. Additionally, they recommended better organization of written assignments and tutoring in various areas.
Although the findings seem very interesting and helpful for the SPDs, both the process of their production and the exploitation should have a high level of clarity so that ethical issues can be adequately satisfied. In academic open education environments, it has already been pointed out that ethical issues arise, particularly in issues such as e-learning, dropout rates, and the development of methods of predicting student success. Ref. [
85] highlighted that distance education institutions must begin to study how ethics can be applied to their decision-making processes. In our case, the HOU informs students about the aim of each of the evaluation tasks that deal with students’ data. Secondly, it protects tutors’ anonymity by using anonymization techniques in the proposed system. Finally, during the internal evaluation process of the entire study program, PSDs are requested to track down in a public document the decisions they took based on the results of the students’ responses.
The positive and negative feedback from HOU students, as well as their suggestions, reflect important quality requirements related to the following dimensions of the educational process provided by HEIs:
Tutor–student interaction. Student satisfaction largely depends on the quality of interaction with the tutor [
43,
86]. The desired competencies for tutors include frequent communication with students [
43], addressing their questions, and guiding and encouraging them in their studies [
87,
88]. When tutor–student interaction is effectively practiced, it can significantly enhance the understanding of course materials and encourage community experiences through discussions, group work, and collaborative projects [
76]. This aligns with Moore’s concept of minimizing transactional distance through engagement and communication [
89].
Assignment feedback. In distance and online educational environments, students highly value timely feedback [
90]. Feedback on assignments plays a crucial role in developing students’ academic skills, motivating them, offering encouragement, and identifying areas for improvement, which may impact their performance. However, feedback can become problematic due to misconceptions, lack of clarity, and dissatisfaction among students. Issues such as unclear comments affecting self-confidence, delayed feedback, and difficulty understanding academic language have been reported as barriers to effective feedback [
91].
Student–content interaction. As confirmed by the literature, the quality of student–content interaction (e.g., reading learning articles and textbooks, writing assignments and projects, and interacting with multimedia resources) is a critical factor influencing the learning experience and overall satisfaction with their studies [
87]. In distance education, students spend most of their time reading and writing, systematically studying the educational content to understand the subject matter and further develop their cognitive skills. Therefore, educational material should promote student reflection, discussion, and information exchange with the aim of collaborative knowledge-building, creative problem-solving, and hypothesis formulation—elements that promote critical thinking [
92]. A poor quality and structure of educational material and activities can significantly limit the development of students’ critical thinking skills [
93].
Module design. The quality of a study program’s design and structure can drastically affect student satisfaction. A program’s structure, including its objectives, teaching and learning methods, and assessment approaches, is crucial. Key factors influencing the design and structure of a study program include content presentation, learning paths, personalized guidance, and student motivation. Rigid study programs tend to limit communication and interaction, while more flexible programs can reduce transactional distance, fostering better communication and engagement [
87].
Regarding the evaluation of the generated summaries, it was conducted using the G-Eval framework and DeepEval metrics. G-Eval demonstrated that the summary of positive comments was highly relevant and coherent, and effectively captures the students’ positive experiences, while the negative comment summary required improvement in specificity and detail. When assessed by DeepEval, the summaries displayed a high level of alignment with the input without introducing major inaccuracies, although some minor extraneous information was noted. On the other hand, the Walktrap algorithm excelled in identifying thematic communities within the feedback, distinguishing trends and capturing significant insights about the students’ experiences. Its performance was measured against other community detection algorithms, revealing that Walktrap achieved a favorable balance between community density and modularity; thus, it was proved the most effective method for the context of this research. This dual approach of utilizing community detection alongside summarization metrics allows for the production of diverse and content-rich summaries that are grounded in validity.
9. Limitations and Future Work
This study faces some limitations. One limitation is the response rate of 45.72%, as less than half of the target population participated. This may introduce a potential bias, as the sample may not fully represent the views of all students, particularly those who chose not to participate. Additionally, open-ended questions may have led to the underreporting of negative feedback, as students might have hesitated to share criticisms or felt restricted in expressing concerns due to fears of academic repercussions, despite assurances of anonymity. The cross-sectional design further limits this research, as it captures student feedback at only one point in time, preventing the assessment of changes over time. Despite the reinforcement of the quantitative evaluation findings, the qualitative data do not fully address the complexity and subjectivity of qualitative analysis, as they do not explore variations in feedback related to the study context and student characteristics (e.g., gender, attendance in compulsory or prerequisite modules, and grade performance in assignments). Additionally, the performance of the TextRank and Walktrap algorithms used for summarization and clustering may vary depending on the data’s characteristics, and the study did not explore alternative datasets to further assess the algorithms’ performance.
It is suggested that future research delve deeper into the creation of unique natural language processing models that are especially suited to the linguistic subtleties and contextual information present in language student feedback. This could help in resolving the text pre-processing errors discovered by using generic tools such as UDPipe. Exploring the integration of sentiment analysis to assess the emotional tone of student comments could also be beneficial to gain a more nuanced understanding of the experiences and satisfaction levels of the students. Both UDPipe’s Greek lemmatization and comment filtering for detecting the noise in data require manual correction, which is indeed time-consuming. A more automatic technique must be carried out to speed up the process and simultaneously keep all the important data without overfitting with comments that do not contribute to the analysis. Another interesting prospect is the processing of English comments in parallel and, even better, in a simultaneous analysis with Greek ones.
Advanced natural language processing (NLP) techniques like word embeddings can be very useful in addressing such issues [
94]. Word embeddings can be used in extractive summarization instead of Jaccard similarity measures to capture semantic relationships. When combined with TextRank and Walktrap, they could yield more human-like summarization. Experiments show that this mathematical representation enhances the performance of the detection of similarities and can be combined with TextRank [
72] for general trend detection and with Walktrap for more effective community separation.