1. Introduction
Text classification is a fundamental task in natural language processing (NLP) that involves automatically categorizing text into predefined categories. This process is crucial for organizing, managing, and retrieving information from vast amounts of unstructured textual data, making it a main component in various real-world applications [
1]. Text classification enables systems to understand and process information more effectively and its significance spans many domains and industries. In business, it is used for sentiment analysis to gauge customer opinions from reviews or social media posts, helping companies make data-driven decisions. In cybersecurity, spam detection systems rely on text classification to filter unwanted emails and detect phishing attempts. In the legal and academic sectors, topic labeling aids in categorizing large volumes of documents, improving searchability and content organization. Moreover, in language identification, text classification systems detect the language of a given text, which is crucial for multilingual platforms. Another notable use is in natural language inference, where the system determines logical relationships between pairs of sentences [
2].
Traditional approaches to text classification mostly leverage handcrafted features. Such schemes often require extensive domain knowledge and are not easily scalable. Machine learning, statistical modeling, and deep learning algorithms, on the other hand, have significantly advanced the field. However, the development of large language models and transformer-based schemes has led to an unprecedented success in performance in the vast majority of NLP tasks. Generally, transformer-based models display improved performance and better generalization [
3]. In addition, models trained on vast amounts of multilingual data have enabled language understanding and text processing in different languages, facilitating cross-linguistic applications [
4]. Multilingual transformers have achieved a high level of comprehension and performance across different languages, including Greek. This progress has opened up new avenues for handling the complexities of various languages in text categorization tasks. Thus, here, such models are investigated as to whether one can also exploit them to construct more accurate and efficient schemes for the Greek language.
Regarding text categorization with schemes dedicated to the Greek language, the literature seems relatively limited compared to other languages. Despite this, there have been some initiatives where Greek datasets and dictionaries have been developed to train models in various natural language processing tasks [
5,
6]. However, the Greek language poses unique challenges due to its inherent structure. Moreover, regarding text classification tasks, there is a wealth of additional issues to be considered within the modeling endeavors that are not as prevalent in the English language. These challenges stem from the specific linguistic characteristics and complexities inherent to Greek, such as its rich morphology and syntax [
7]. On this basis, a broad study of the relevant algorithmic schemes becomes necessary.
Currently, text classification involves several steps that typically remain consistent regardless of the specific classifier being employed. However, here, two different approaches can be incorporated. The first concerns the use of a pre-trained transformer model, with its usual fine-tuning to the specific data of the specific downstream task we are interested in. The second is to derive a framework of vector representations, usually by constructing embeddings, which is also done in the previous version, and subsequently fitting classifiers to the feature space consisting of the values of the multi-dimensional vectors of the embeddings. Thus, in any case, the text data must be encoded into numerical representations. Such encoding is essential for the machine learning models to process the text data.
In this paper, we start by first benchmarking transformer-based contextual embedding schemes and classic machine learning classifiers in the task of categorizing user questions regarding topics of the Greek School Network. On this basis, first, we examine several transformer models to extract vector representations to use as features for the aforementioned classification task on Greek text, and we identify the most efficient architecture in terms of a series of experiments containing ten different machine learning classifiers. From there, we show the most efficient architecture not only regarding the extraction of embeddings but also for the context of classifying the textual data contained in our newly presented dataset. These results are then used to investigate possible ensemble schemes that leverage the best-performing classifiers and transformer schemes. Specifically, we exploit the multilingual capabilities of the large version of a specific E5 model and construct a robust, interpretable, and efficient ensemble scheme that incorporates both worlds: state-of-the-art language representations and low-cost classic machine learning algorithms, under a soft voting combinatory formula.
Hence, rather than employing a transformer to directly categorize a large dataset, a task that demands substantial computational resources and a considerable amount of time, we utilized E5 solely to generate embeddings from the Greek dataset that we compiled. Bearing in mind that the product of the present investigation should be integrated into a wider helpdesk system for the GSN, at the beginning of which is placed the categorization of user requests, the use of such low-cost algorithms was first explored. Beyond that, however, and given the small extent of the relevant literature, our goal was to investigate transformer models in the Greek language whose performance was not presented as satisfactory in the context of relevant experiments in our dataset. Thus, such a restructuring of the experimental framework had to include the integration of classical classifiers along with contextual high-quality embeddings. In this manner, we applied a form of transfer learning to enhance the performance of low-cost classical algorithms. Given the above, the primary objective of this work is to investigate the effectiveness and behavior of various representation methods in the context of text classification using Greek data. We also aim to determine whether classical machine learning methods can adequately handle and respond to Greek data so that we can deploy them in a real-world system. However, our work extends beyond merely implementing and comparing different algorithms for Greek text categorization to the introduction of a hybrid ensemble scheme. Thus, a novel approach aimed at achieving improved results through the utilization of ensemble methods, with a particular focus on soft voting techniques, is presented. Soft voting, in broad terms, involves combining the predictions of multiple classifiers to produce a more accurate and robust outcome. By leveraging this approach, we aim to enhance the performance of text classification tasks in the Greek language, demonstrating that classical machine learning methods, when supplemented with modern techniques like transformer-based embeddings and ensemble learning, can be effective in handling Greek text while maintaining the costs low.
The contributions of the present work are as follows. First, we present a novel hybrid soft voting ensemble classifier incorporating contextual embeddings together with three classic machine learning classifiers. The model outperforms all of its constituent methods as well as a set of classic machine learning and possible ensemble schemes. The experiments on data regarding textual inquiries, questions, and statements made by users of the Greek School Network indicate the superior performance of the proposed scheme. Second, a series of results from a benchmarking of several multilingual transformer schemes for the extraction of contextual embeddings in Greek are listed, providing useful insights into the performance of embeddings in Greek language and their impact on the performance of transformer models. Last but not least, a comprehensive examination study illustrates the robustness and the weakness of various multilingual transformer models and thoroughly assesses their functionality in Greek text.
The remainder of the article is structured as follows.
Section 2 surveys the literature and analyzes recent related works.
Section 3 presents the proposed methodology and the experimental design.
Section 4 presents the results and discusses the main findings of our work. Finally,
Section 5 concludes the article and provides directions that future work will focus on.
2. Related Work
Text classification is attracting the increasing attention of researchers in natural language processing [
8]. In the literature, there is an increasing research interest and many studies have been conducted on the design of methods and the development of systems for automatically analyzing and classifying text. A detailed description of approaches and techniques can be found in [
9,
10].
Regarding text classification, the Greek language presents unique challenges, primarily due to its syntactic and morphological complexities [
11]. Researchers have made significant strides in this area by developing resources and datasets specific to the Greek language. For instance, researchers in [
12] focus on parsing Greek texts, a task complicated due to the relatively limited availability of annotated data. A suite of resources, including a manually developed lexicon, semi-supervised word embeddings, and comprehensive datasets designed for various NLP tasks such as sentiment analysis, emotion detection, and sarcasm identification, is introduced. The experimental results indicate a 24.9% improvement in F-score for sentiment analysis tasks across different domains when using these Greek-specific resources compared to traditional features like n-grams, highlighting the importance of language-specific tools in NLP. Moreover, in [
13], the task of hate speech classification, particularly targeting hateful, xenophobic, and racist discourse on social media platforms like Twitter, is investigated again concerning Greek. The study introduces a novel approach, combining computer vision techniques with NLP models to analyze multimodal data (text and images), thereby providing more nuanced modeling and identification of hate speech. The researchers use a fine-tuned version of BERT, alongside neural network models like ResNet, to classify messages. Their dataset, specifically curated for Greek hate speech detection, contributes significantly to the field, providing a valuable resource for future research. Furthermore, the model’s high accuracy and F1-score underscore its effectiveness in identifying and categorizing hateful content. Further advancing the field, the classification of legal texts in Greek is investigated in [
14], an area that requires precise and nuanced understanding due to the technical nature of legal language. A comprehensive dataset comprising over 47.000 official Greek legislative documents, categorized by legal topic, is introduced. The study compares various classification methods, including traditional machine learning approaches, recurrent neural networks (RNNs), and advanced transformer-based models. The findings reveal that incorporating domain-specific lexicons into the models significantly enhances performance, sometimes even surpassing state-of-the-art transformer-based schemes. This suggests that, for specialized fields like law, domain-specific enhancements can be as crucial as advanced model architectures. Given the above, one should note that our work consists of a specialized domain too.
Also, in the context of Greek language data, several research groups have focused on creating and refining embeddings to enhance language processing capabilities. In this context, a significant contribution involving training convolutional neural networks (CNNs) and long short-term memory (LSTM) networks on a massive corpus of Greek textual data is presented in [
15]. At the time of publication, the introduced corpus, compiled from approximately 20 million URLs, represented the largest collection of Greek-language text available at the time. The results show that corresponding embeddings play an instrumental role in improving the accuracy of the models in various NLP tasks, demonstrating the value of large-scale language-specific datasets in model training. An evaluation of Greek-specific word embeddings using standard benchmarks like Word2vec and the WordSim353 test collection is presented in [
16]. The authors adapt the above benchmarks to Greek, allowing for a more accurate assessment of word similarity, semantic “understanding”, and language comprehension. The study tests seven different embedding models, finding that meaningful and accurate representations could be achieved for Greek, which, again, has been found to have unique linguistic features that differ from more commonly studied languages like English. Furthering these efforts, the development of GreekBART, a sequence-to-sequence (Seq2Seq) model based on the BART architecture, marked a significant advancement in the field [
17]. GreekBART is a pre-trained transformer scheme fine-tuned on a large-scale corpus of Greek text and evaluated against a variety of models, including Greek-BERT and XLM-R. At the time of publication, its performance on tasks such as summarization and natural language generation demonstrates its superiority, particularly in tasks that require nuanced “comprehension” as well as the generation of Greek text. Again, the model’s apparent success underscores the importance of developing language-specific tools tailored to the linguistic and cultural nuances of the target language. A study exploring the detection of semantic shifts in the modern Greek language, specifically focusing on “dimotic” or demotic Greek, is presented in [
18]. The authors utilize NLP tools to analyze a corpus of Greek literature spanning several decades, providing insights into how the meanings of words have evolved over time. This work not only contributes to historical linguistics but also has practical implications for NLP applications, such as sentiment analysis and topic modeling, where understanding shifts in word usage and meaning is crucial.
It is commonplace that ensemble classification methods, particularly voting techniques, are widely used to improve the robustness and accuracy of predictive models. In recent studies, various innovative approaches have been explored to enhance ensemble learning methodologies. One such approach is the two-stage voting boosting (2SVB) concurrent ensemble learning method [
19]. The method is specifically applied to sentiment classification tasks on social networks. This technique integrates voting and boosting strategies, creating a concurrent ensemble framework that optimizes the utilization of erroneous data while reducing computation time. The researchers implement this method using a 3-fold cross-segmentation approach, followed by training on datasets enriched with identified erroneous data. This dual-stage process enhances the diversity and accuracy of the ensemble, as demonstrated by the high F1 score achieved on a coronavirus tweet sentiment dataset, which surpassed other competitive methods. A model for SMS spam detection is introduced in [
20], also leveraging the capabilities of pre-trained transformers and ensemble learning techniques. This model significantly improves detection performance by combining advanced text embeddings from the GPT-3 Transformer with a composite ensemble system comprising four machine learning models. The ensemble’s high performance, as evidenced by its high accuracy on the SMS spam collection dataset, further illustrates the effectiveness of integrating multiple learning models to enhance predictive accuracy. The comprehensive analysis of sentiment analysis techniques employed on datasets from platforms like TripAdvisor and Rotten Tomatoes presented in [
21] further highlights the efficacy of ensemble methods. The study compares various machine learning approaches, including Naïve Bayes (NB), Support Vector Machines (SVMs), Logistic Regression (LR), and K-Nearest Neighbour (KNN), integrated with ensemble strategies such as stacking and majority voting (MV). The results show that ensemble methods consistently outperform individual algorithms, with majority voting offering a slight edge over stacking in terms of performance. Lastly, a utilization of a weighted majority voting ensemble that incorporates models like Naïve Bayes, Logistic Regression, Stochastic Gradient Descent, Random Forest, Decision Tree, and SVM is given in [
22]. This approach optimizes classifier weights based on accuracy or F1-score. The results demonstrate superior performance over simple majority voting methods, particularly in tasks like those featured in the SemEval 2017 Task datasets.
3. Proposed Method and Experimental Setup
In this section, we proceed to the presentation of the proposed method and the outline of the experimental framework. First, we present our ensemble scheme and analyze its functionality. Next, we present the framework of the experimental procedure illustrating the workflow of the serial-structured process of extracting the proposed method.
3.1. Proposed Method
We designed and introduce a novel ensemble classifier specifically designed based on a dataset containing inquiries obtained from the Greek School Network helpdesk. The model leverages the semantic content present in the embeddings extracted with the incorporation of the large version of an E5 multilingual transformer scheme. Our method integrates three well-known, diverse, and effective classic machine learning algorithms under a simple yet interpretable soft voting combinatory scheme to improve the performance of our text categorization task. The three base learners used are an XGB classifier (XGBoost), the K-Nearest Neighbors classifier (KNN), and a logistic regression classification scheme (LR). Given that the selection outcome is result-driven, one can observe the diverse characteristics of the selected models and the complementary nature of their classification capabilities, which allow them to handle different facets of the text data effectively.
Figure 1 depicts an overview of the layout of the proposed scheme.
The ensemble scheme illustrated starts with the GSN dataset, which undergoes a transformation into embeddings via the E5 Multi Large model. This model leverages its 24 layers and 355 million parameters to extract comprehensive and meaningful features from the dataset, providing a robust representation for subsequent processing. These embeddings are then channeled into three distinct machine learning models. The XGBoost is used for its efficiency and accuracy in handling structured data and processes the embeddings to generate Prediction 1. K-nearest neighbors uses the embeddings to produce Prediction 2 and logistic regression analyzes the embeddings to create Prediction 3. Linear regression is valued for its interpretability and effectiveness in scenarios where the relationship between features and the target variable is linear. The significance of the proposed model lies in its ability to combine the strengths of multiple diverse models (XGBoost v2.03, KNN, and logistic regression) through the use of soft voting. Given that each base learner brings distinct strengths to the table, our scheme enhances prediction stability and accuracy.
Table 1 illustrates the main characteristic of the base models explaining their strengths and weaknesses.
Furthermore, by integrating E5 multilingual embeddings, we exploit the power of a state-of-the-art transformer model, which is known to extract rich semantic content from textual data, particularly in multilingual contexts such as ours. Such a combination enables our ensemble layout to handle both structured and semantic features efficiently. The latter is critical when working with complex, real-world datasets like those obtained from the Greek School Network helpdesk. Hence, in essence, our ensemble capitalizes on the diversity and complementary nature of the base learners and the strength of advanced NLP embeddings, offering generalizability and robustness over individual models while also maintaining interpretability through its voting mechanism. This makes it uniquely suited to multilingual and multi-class text classification tasks. Thus, combining transformers with classical machine learning models allows us to strike a balance between power, efficiency, interpretability, and robustness—factors that are crucial in real-world applications, such as the one that our scheme is going to be applied to. The main Pseudo Code of the ensemble Learning Architecture is illustrated in Algorithm 1.
Algorithm 1: Pseudo Code for Ensemble Learning Architecture |
# Step 1: Apply Embeddings Model to GSN Data Set begin # Initialize E5 Multi Large Embeddings Model embeddings_model = E5_Multi_Large_Embeddings() # Transform GSN Data Set to Embeddings embeddings = embeddings_model.transform(X) end # Step 1 Output: High-dimensional embeddings of the GSN Data Set
# Step 2: Train Classifier Models begin # Initialize Models xgboost_model = XGBoost() knn_model = KNearestNeighbors() lr_model = LogisticRegression() # Train the Models on Embeddings and Corresponding Labels xgboost_model.fit(embeddings, Y) # Train XGBoost knn_model.fit(embeddings, Y) # Train KNN lr_model.fit(embeddings, Y) # Train Logistic Regression end # Step 2 Output: Trained models - xgboost_model, knn_model, lr_model
# Step 3: Generate Predictions begin # Use the trained models to generate predictions on test data Prediction_1 = xgboost_model.predict(test_embeddings) # XGBoost Prediction Prediction_2 = knn_model.predict(test_embeddings) # KNN Prediction Prediction_3 = lr_model.predict(test_embeddings) # Logistic Regression Prediction end # Step 3 Output: Predictions - Prediction_1, Prediction_2, Prediction_3
# Step 4: Apply Soft Voting to the Predictions begin # Compute final ensemble prediction using soft voting Final_Prediction = soft_voting(Prediction_1, Prediction_2, Prediction_3) # If models provide hard labels: # Final_Prediction = mode([Prediction_1, Prediction_2, Prediction_3]) end # Final Output: Final prediction after ensemble soft voting |
3.1.1. Soft Voting
As mentioned, the method used to combine the outputs of these algorithms is soft voting. Soft voting is a technique used to combine multiple models to improve the accuracy of predictions. In a soft voting scheme, rather than making a hard decision as to the value of the output class, the predictions of the individual base learners are combined based on the probabilities they assign to the predictions of each class. This approach offers several advantages and differs from hard voting, where each classifier simply votes for the final class. The process of soft voting involves a few key steps. First, each base learner produces a set of probabilities that indicate the likelihood of each sample belonging to each possible class. These probabilities represent the model’s confidence in its predictions. Next, the probabilities from all base learners are aggregated, typically by averaging them. This means that if one classifier assigns a high probability to a certain class, this probability boosts the final ensemble probability for that class. Finally, the class with the highest combined probability is selected as the ensemble model’s final prediction. Soft voting offers several benefits. It often results in improved accuracy insofar as it combines the strengths of multiple models, with the mistakes of one model potentially being offset by the correct predictions of others. Moreover, by considering the probabilities, soft voting provides a more nuanced view of the system’s uncertainty, making the ensemble more stable in cases where the models are not entirely certain. This technique can also enhance the generalizability of the ensemble to new data as it integrates insights from multiple sources, thus reducing the likelihood of overfitting. The application of soft voting is widespread across various machine learning tasks, such as text categorization, sentiment analysis, and fraud detection. Typically, the ensemble combines models like XGBoost, K-nearest neighbors, and logistic regression, each of which has different strengths and weaknesses. For example, as we will see, XGBoost excels at handling non-linear relationships, KNN leverages the structure of the data, and logistic regression provides interpretable predictions. By combining diverse, efficient, and robust models, such an ensemble can achieve superior overall performance compared to individual schemes.
3.1.2. Multilingual E5 Large
In NLP, the process of generating efficient word embeddings can be quite demanding as it involves a variety of challenges. In recent years, this process has been continuously upgraded so that models can adequately represent the meaning of words and succeed in a multitude of tasks, such as text classification, sentiment analysis, word completion, machine translation, and text generation.
Embeddings are dense vector representations of words or tokens that constitute a crucial part of state-of-the-art language models, insofar as they seem to capture some of the semantic content contained in language. In transformers, the embeddings are produced in the encoder part of the architecture that encodes words in tokens that are represented numerically as dense, real-valued vectors of multi-dimensional space. The proposed ensemble schema incorporates embeddings extracted with the application of the large version of the Multilingual-E5 transformer model over our GSN Greek dataset. Multilingual-E5, introduced in mid-2023, is a sophisticated text integration model designed for the efficient management of multilingual text data. The Multilingual-E5 model training is performed in two stages. Initially, based on the English model, E5 is trained on 1 billion text pairs in different languages. Each pair consists of two texts, one in the source language and one in the translated language. The model learns to match the two texts, developing an “understanding” of grammatical structures, semantic relationships, and lexical correspondences between languages. This process involves the first stage of training, known as weakly supervised contrastive pre-training. In the second stage of the model training, that is, supervised fine-tuning, the model is tuned with data containing labeled texts with tags indicating their meaning or category. The model learns to classify new texts into the correct labels, improving its accuracy and ability to model details in each language. This ensures that the model can generalize effectively across languages. The initialization of the hyperparameters of the model was performed according to XLM-RoBERTa. Two evaluation criteria were used to demonstrate the multilingual capability of E5: MIRACL and Bitext [
23]. MIRACL measures the ability of the model to find the corresponding image for a given text description in 16 different languages, while Bitext refers to pairs of texts written in two different languages on the same topic. Typically, these texts are translations of each other, although not always exact ones.
3.1.3. Base Learners
Let us now move on to an outline of the specifics of each algorithm included in the ensemble, starting with XGBoost, introduced by Chen and Guestrin [
24]. XGBoost is a gradient-boosted decision tree model. This algorithm trains decision trees sequentially on the training data, adding a new tree at each iteration to enhance the value of the objective function. The objective function, which the algorithm aims to minimize, is composed of two terms: the loss term (
l) and the regularization term (Ω). The objective function for the t-th iteration (
) is defined in Equation (1), where
is the actual class label of instance
is the predicted class label of instance
i,
represents the function of the tree, n is the number of instances in the training set, and Ω is the regularization term.
The term
(2) penalizes model complexity to prevent overfitting. In this term,
γ and
λ are hyperparameters,
represents the number of leaves in the tree, and
ω denotes the weight of each leaf.
The second-order Taylor expansion is used to approximate the value of the loss function in (3). In this context,
denotes the first-order gradient statistics of the loss as shown in (4), and
denotes the second-order gradient statistics of the loss (5).
For a given tree structure, the optimal weight for each leaf node j and the corresponding optimal value of the objective function are determined by Equations (6) and (7), respectively. Here,
represents the set of instances in leaf j. Equation (7) serves as a scoring function to assess the quality of the tree structure q.
However, evaluating every possible tree structure to find the optimal one is not feasible due to the high computational cost. In practice, XGBoost uses a greedy algorithm to identify an optimal tree structure. XGBoost deals with missing data internally, so there is no need for imputation or deletion. When an instance has a missing feature, the method directs it along a default path at each node. This path can go either to the right or left node, and the default direction is chosen based on which path provides the maximum gain in the training set [
25].
Logistic regression (LR) is a statistical method used to determine the probability of a binary outcome based on several predictor variables. It assesses how the considered variables impact the dependent variable being studied. Conversely, if the explanatory variables include at least three unordered categories, multinomial logistic regression (MLR) is used. MLR is based on the same fundamental principles as binomial logistic regression. Therefore, it can be said that logistic regression is extended to handle multiple categories in MLR. In the research conducted by Le Cessie and Van Houwelingen [
26], a ridge value of
is recommended for calculating log probabilities. There are also modifications available for classification purposes. When there are
cases with
features and
classes, an
matrix is used to compute component B. The probability for class
, excluding the reference class, is given by Equation (8).
The last class has probability as shown in (9).
As a result, the negative multinomial log-likelihood is expressed as follows
A quasi-Newton method is used to find improved values of the
elements in order to determine matrix B where L is minimized. Before the optimization process, matrix B is transformed into an
vector [
27].
Nearest neighbor classification, commonly known as K-nearest neighbors (KNN), operates on the principle that the closest patterns to a target pattern
, for which we want to determine the label, provide valuable label information. KNN assigns the class label based on the majority label among the K-nearest patterns in the data space. To achieve this, a similarity measure in the data space must be defined. In
, it is common to use the Minkowski metric (
p-norm) for this purpose.
For
, this corresponds to the Euclidean distance. In other data spaces, appropriate distance functions must be selected, such as the Hamming distance in
. For binary classification, the label set
is used, and KNN is defined as follows:
with neighborhood size
and the set of indices
representing the K-nearest patterns.
The value of
determines the locality of KNN. For
, small neighborhoods form in regions where patterns from different classes are intermixed. In contrast, for larger neighborhood sizes, such as
, patterns with minority labels are often ignored [
28].
3.2. Experimental Setup
We now present the framework with which the proposed scheme is identified. The whole procedure consists of three phases.
First is the identification of the most efficient semantic representation. The latter, that is, embeddings, offer a comprehensive representation of the Greek textual content of our dataset, seemingly capturing semantic nuances and relationships critical for effective text categorization. Then, as mentioned, we proceed with the classification scheme benchmark. Actually, these two steps are interdependent and are produced simultaneously. This means that, here, to substantiate the best transformer, one needs the evaluation of the classifiers used, and vice versa, to export the best classifier, the embedding feature space must already be in play. The third step involves ensemble learning.
In short, the experimental process starts with the creation of context embeddings for each instance of the “short” and “long” instance description features in our dataset. Initially, the dataset is split into training and testing sets in an 80–20 ratio. From this split, a series of embeddings are extracted by the use of various transformer methods. These embeddings are then used as features in a three-class text categorization framework. Ten text classification algorithms are employed. After identifying the best-performing classifiers, we investigate a set of possible ensemble configurations. This result-driven approach leads to the identification of a hybrid multilingual E5 soft voting ensemble consisting of three classic machine learning base learners as the most efficient scheme.
Thus, the primary objective is to identify the best embeddings in terms of specific evaluation metrics. The investigation of the classification schemes comes next. Hence, two benchmarks are established: one regarding embeddings and the other regarding classifiers. The corresponding results are depicted in ranking tables. Using insight from the latter, and due to the form of the results, the best embedding representations and classification schemes are easily identified. The set of possible soft voting ensembles is constituted by up to three base learners. The most effective layout, be it individual or ensemble, incorporating the best performing embeddings, that is, the transformer’s embeddings that yield the best metric values, is, thus, exported in the last stage of this experimental process. Note that the computational costs of the individual classifiers examined here are comparable. Also, as mentioned, we are dealing with a dataset containing instances in Greek language. Thus, it is crucial to prioritize the identification of the optimal vector representation framework. This process, as we will see below, yields consistent results both for the embedding and the classification schemes. Hence, it always outputs a specific transformer embedding model, that is, a multilingual E5 large. This scheme consistently outperforms all others with respect to the results of each of the classic machine learning classifiers examined. Moreover, there are specific classification schemes that are always placed at the top of the rankings and, consequently, are used to form the potential ensemble schemes.
The above result-driven approach leads to the identification of the proposed ensemble classifier. Hence, using soft voting and leveraging the individual strengths and diversity of XGBoost, K-nearest neighbors, and logistic regression, while utilizing the E5 representational framework, we present a method that exhibits significant performance improvements. As we have seen, our framework not only aims to implement and evaluate various algorithms on the GSN dataset but also seeks to highlight the benefits of using an ensemble method for text categorization. The use of soft voting, combined with the high-quality E5 large embeddings, aims to deliver a more accurate and effective solution for categorizing the helpdesk queries from GSN. Ultimately, this contributes to the development of more sophisticated and reliable models for text classification in the Greek language.
3.3. Dataset
The dataset used in this study is a balanced GSN dataset, consisting of three classes. For feature extraction, embeddings from the E5MultiLarge model were utilized, applied to both the “Title“ (short-description) and “Description” (long-description) textual features of the dataset. This process resulted in a final dataset comprising 90.583 instances and 2.049 columns. The data were subsequently split into training and testing sets, with the training set containing 72.466 rows and 2.049 columns and the test set containing 18.117 rows and 2.049 columns. The queries contained in the data, submitted to the helpdesk of the Greek School Network, span a ten-year period. The GSN serves as Greece’s national network and internet service provider (ISP) for the Ministry of Education and Religious Affairs. It offers a variety of services, including e-learning, communication and collaboration tools, e-governance, and various support services.
The requests were selected to cover a specific range of categories that the helpdesk received from GSN users over the decade. They were extracted using SQL queries from the GSN database, and both the title and description, that is, the short and long descriptions of each instance, of each request, were preserved. Each request, i.e., query or question, was meticulously labeled with the appropriate category information. In the textual content, several text processing techniques were applied to clean and standardize the dataset. This preprocessing step also included, among others, the removal of personal information and all unwanted URLs, as well as standard procedures like lowercase conversion. Additionally, any data entries where the request text field contained fewer than four words were intentionally excluded from the dataset to ensure the quality and relevance of the representation of the data. This resulted in the inclusion of a total of 90,583 instances in the dataset, organized in three classes, as illustrated in
Table 2. The dataset provided a robust foundation for developing and evaluating both the embeddings and the text classification models.
The data were compiled as a whole by the helpdesk service of the Greek School Network. There, on the relevant platform, registered users can submit questions and inquiries regarding any issue that arises and is part of the service’s remit. Each request submitted via the Portal is recorded and the relevant ticket is created. All these submitted questions–tickets then compose a set of submitted queries which is archived by each year. Thus, the dataset is composed of the user questions and inquiries submitted to the helpdesk of the Greek School Network which have undergone the appropriate processing. In
Table 3, the main characteristics of the features of the dataset are presented.
The dataset consists of 91.306 instances, and the average length of the instances is 31.45 words with a std of 41.9. This indicates a substantial diversity in the length of the instances. The word count distribution of the dataset is illustrated in the
Figure 2 and indicative instances are presented in
Table 4.
Given the above, the aim of the present work is to leverage this ten-year dataset of GSN helpdesk queries to explore the effectiveness of various text classification methods towards the creation of a scheme that will eventually be deployed in a corresponding real-world application of the GSN. Thus, by employing text processing techniques and selecting this first subset of a balanced distribution of classes to be investigated, we set the stage for the initiation of a larger investigation into how well classical and modern machine learning methods can handle Greek text data in our multiclass classification context. This approach not only contributes to developing more effective and accurate text categorization models in our context but also helps in understanding the specific challenges posed by the Greek language.
3.4. Transformer Models
We discussed that the research framework aims to identify the most effective transformer-based embedding model in terms of the three evaluation metrics to be presented for the task of multi-class classification of the above GSN helpdesk data. The apparent requirement is that these models should work in the Greek language. On this basis, eleven state-of-the-art transformer models were tested. Two of these are specifically Greek models, while the remaining nine are multilingual. The models are all available on the Hugging Face platform [
3], as illustrated in
Table 5, which contains the respective abbreviations. It is important to note that for the extraction of the necessary embeddings, the transformer models were used without additional fine-tuning. This decision was made with the intent to refine the most promising architectures in subsequent studies following this case study.
GreekBert is a transformer model specifically designed for the Greek language, based on the BERT architecture. It has 12 layers and 110 million parameters, and reports quite good performance in various Greek NLP tasks, such as text classification, named entity recognition, and sentiment analysis. GreekBert is uncased, meaning that it does not differentiate between uppercase and lowercase letters, which simplifies text processing. The model leverages BERT’s deep bidirectional representations to understand and process Greek language intricacies, making it a robust choice for Greek-specific NLP applications. Its extensive training on Greek text corpora ensures high accuracy and relevance in understanding Greek language nuances.
GreekRoberta has 12 layers and 125 million parameters, and is tailored for legal texts in Greek, built upon the robust RoBERTa architecture. This model retains distinctions between uppercase and lowercase letters and is oriented to legal document processing. GreekRoberta enhances NLP tasks in the legal domain, such as legal document classification, case law analysis, and the extraction of legal entities. Its architecture facilitates context-aware interpretations of complex legal texts.
BertBaseMultiCased is a multilingual BERT model that supports multiple languages, including Greek. It has 12 layers and 110 million parameters, and retains letter case information, which is important for many languages. This model is widely used for tasks such as machine translation, multilingual text classification, and cross-lingual information retrieval. By leveraging shared multilingual representations, BertBaseMultiCased offers robust performance across different languages, making it a versatile tool for global NLP applications where maintaining case sensitivity is crucial for accurate language understanding.
BertBaseMultiUnCased is similar to BertBaseMultiCased but does not differentiate between uppercase and lowercase letters, making it suitable for languages where case information is less critical. With 12 layers and 110 million parameters, this model excels in various NLP tasks across multiple languages, such as sentiment analysis, named entity recognition, and question answering. Its uncased nature simplifies processing text from diverse linguistic backgrounds, providing versatile and reliable multilingual capabilities for a broad range of applications.
DistiluseBaseMultiCased is a distilled multilingual model optimized for sentence embeddings. It has six layers and 135 million parameters, and it is quite useful for tasks such as semantic search, sentence similarity, and multilingual embeddings. This model provides efficient and accurate sentence representations across multiple languages, facilitating improved performance in tasks requiring semantic understanding and comparison of sentences, making it an excellent choice for applications needing fast and reliable semantic processing.
E5MultiBase is a multilingual transformer model serving as the base variant in the E5 model series. It has 12 layers and 110 million parameters, and it is designed to handle multiple languages efficiently, providing robust performance in tasks such as text classification, language translation, and cross-lingual understanding. This model balances computational efficiency and performance, making it suitable for a wide range of multilingual NLP applications where both speed and accuracy are essential.
E5MultiLarge, the largest variant in the E5 model series, offers enhanced performance due to its increased parameters and layers. It has 24 layers and 355 million parameters, and reports quite good performance in complex NLP tasks that require deep language understanding.
E5MultiSmall is the smallest, most efficient variant in the E5 model series, designed for tasks requiring fast inference and low computational resources. It has 6 layers and 66 million parameters, and maintains strong performance across multiple languages, making it ideal for applications in mobile or embedded systems where resource efficiency is crucial without greatly compromising accuracy.
ParaXLMMultiV1 is a multilingual paraphrase identification model based on the XLM-R architecture. It has 24 layers and 270 million parameters to detect and generate paraphrases across different languages, enhancing tasks such as text generation, semantic search, and cross-lingual paraphrasing. This model performs well in applications that require understanding and generating semantically equivalent expressions in various languages.
MultiWikiNer is a multilingual named entity recognition (NER) model based on the WikiNeural dataset. It has 24 layers and 355 million parameters, and is designed to recognize and classify named entities across multiple languages with high accuracy. This model is particularly useful for applications in information extraction, multilingual content analysis, and cross-lingual entity recognition.
3.5. Classifiers
The classifiers used in the comparisons are presented in
Table 6. For the experiments, the PyCaret library was utilized, and all models were executed using their default initializations [
34].
Given the above, each of the selected machine learning schemes brings some unique strengths to the text classification task. Specifically, logistic regression offers simplicity and effectiveness for both binary and multi-class classification problems, while linear discriminant analysis is fitted for linearly separable classes and dimensionality reduction tasks. Moreover, extreme gradient boosting provides powerful performance and scalability, handling complex relationships efficiently. The K-nearest neighbors classifier can capture local patterns effectively, while the ridge classifier prevents overfitting in high-dimensional spaces. Then, we have quadratic discriminant analysis, a method that can handle non-linearly separable classes effectively, and, finally, AdaBoost, a method capable of improving accuracy and robustness by focusing on difficult cases. Hence, we have a set of quite diverse classification schemes that collectively enhance the robustness and versatility of our text classification experimental framework, making it both a comprehensive and effective approach.
3.6. Metrics
Finally, to evaluate the performance of the models, six key metrics were used. Each metric provides a different perspective on the model’s performance, ensuring a comprehensive evaluation.
3.6.1. Accuracy
It measures the proportion of correctly predicted instances out of the total number of instances. The formula for accuracy is:
3.6.2. Aggregated Recall (Macro-Averaged Recall)
Macro-averaged recall provides a balanced measure of recall across all classes by averaging the recall of each class equally. The formula for macro-averaged recall is:
where
It is particularly useful in scenarios where missing a positive instance (false negative) is more critical than incorrectly identifying a negative instance (false positive).
3.6.3. Aggregated Precision (Macro-Averaged Precision)
Macro-averaged precision calculates the average precision across all classes, treating each class equally. The formula for macro-averaged precision is the following:
where precision is calculated as
3.6.4. F1-Measure
where
and
are given by Equations (14) and (16), respectively.
3.6.5. Kappa Metric
The Kappa metric is calculated as follows:
where
k is Cohen’s Kappa,
is the observed agreement between the model predictions and the true labels, that is, the proportion of instances where the model output and the true labels agree on the class assignment, and, lastly, Expected Agreement is the agreement expected by chance, i.e., the agreement that would be expected purely by chance, given the marginal probabilities of each class.
3.6.6. Matthews Correlation Coefficient (MCC)
The Matthews correlation coefficient evaluates the quality of binary classifications. It considers all four categories of the confusion matrix: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). The formula for MCC is
MCC takes into account the balance between the predicted and actual classifications, and thus provides a comprehensive overview of the model’s performance.
4. Experimental Results and Discussion
In this section, we present the main results of the experimental study.
Table 7 contains the results of the first two aforementioned steps of the experimental procedure, that is, the transformer and the classification schemes’ benchmarks. Specifically, this table depicts the top three based on the evaluation of all the metrics employed in our study. Hence, this table not only highlights the rankings of the language models but also identifies the top-performing classifiers for each language model, along with their respective training times. There, one can observe all the previously discussed conclusions regarding the embedding scheme selection. However, the results also reveal the consistent superiority of the XGBoost method, which ranks first across all metrics. This consistent superior performance of XGBoost underscores its effectiveness in the context of our analysis, making it a proper candidate among the classifiers evaluated to constitute a possible base learner for our ensemble layout. Apparently, KNN is another candidate too, always placed in the third position of the rankings in all metrics. It is with these models that the relevant subset of methods is defined, from which the ensemble search process starts. Finally, one can easily observe that, in addition to the rankings, the actual values of the metric valuations fluctuate at levels that can be evaluated at least as good. Moreover, note that this is the case in all metrics examined.
Now, given the results of the individual machine learning classifiers, the consequent ensemble investigation yielded a specific soft voting configuration as best-performing. This approach, utilizing, as already depicted above, a combination of three base learners and a multilingual E5 transformer variation, led to even higher efficiencies across all evaluation metrics compared to the performance of each individual classifier. These enhanced results display our proposed scheme surpassing not only its base learners but all classifiers examined, practically demonstrating the effectiveness of our soft voting ensemble approach in improving classification performance. Note also that the proposed configuration is the best ensemble scheme out of a set of possible soft-voting combinations of two and three base learners. In other words, the classifier in question that we propose is the best possible arrangement within the limits of our experimental framework. The corresponding results of our scheme with respect to the individual classifiers considered are listed in
Table 8.
Figure 3 depicts the confusion matrix of the proposed ensemble. One can exploit such an illustration to obtain insight into the results, towards a deeper understanding of the classifier’s performance across all three distinct classes in the dataset. The confusion matrix is a valuable tool for visualizing how the classifier performs in terms of its ability to correctly identify and differentiate between the various classes. For a clearer interpretation of the confusion matrix, let us define the classes as follows:
- i.
Class 1: SCHOOL UNIT ICT—This class represents queries related to Information and Communication Technology issues within school units.
- ii.
Class 2: GREEK SCHOOL NETWORK CENTRAL SERVICES—This class encompasses queries related to central services in the public sector digital domain.
- iii.
Class 3: MINISTRY OF EDUCATION CENTRAL SERVICES—This class includes queries pertaining to central services in the public sector digital management domain.
Thus, the confusion matrix below enables one to analyze how effectively the classifier distinguishes between the three classes of our dataset, revealing any potential issues with misclassification or confusion between the different categories. Such a detailed examination allows one to assess the strengths and weaknesses of the classifier’s performance in more depth.
Given the above, first, it is essential to point out that this work presents a comprehensive evaluation of transformer-based ensembles and classic machine learning classification schemes for our dataset. The consequent proposal of the specific ensemble configuration comes as a byproduct of the core of our investigation, which is based on the deployment of a robust, efficient, and low-cost scheme in a real-world scenario of classifying user requests posed in the GSN helpdesk. Nevertheless, the starting point of the investigation lies within the theoretical hypothesis that posits that the incorporation of low-cost methods can perform adequately well in domain-specific frameworks, given that an efficient feature space that represents the textual content has already been efficiently substantiated. The latter is crucial in our framework exactly because of the small volume of relevant endeavors regarding the Greek language. In this context, the choice of investigating and employing an ensemble model that utilizes contextual transformer-based embeddings is to further enhance the predictive capacities while improving all relevant metrics by leveraging the strengths of a set of diverse yet low-cost and robust base learners. The ensemble classifier’s effectiveness is significantly attributed to the diversity of the three models it incorporates: XGBoost, K-nearest neighbors (KNN), and logistic regression (LR). Each model contributes uniquely to the ensemble. XGBoost excels at handling complex relationships and interactions between features, making it robust in capturing complex patterns in the data. KNN, on the other hand, utilizes information from neighboring data points, which can be particularly useful for local decision-making processes. LR offers a straightforward and interpretable approach for the linear discrimination of categories, providing a solid baseline that is easy to understand and implement. This diversity in methodologies enables the ensemble classifier to generalize better across different data patterns and reduces the risk of overfitting to specific features within the training set. Moreover, the incorporation of a soft voting mechanism, which works by summing the predicted probabilities from the three base learners and selecting the class with the highest combined probability, contributes both simplicity and general performance. Such a combinatory scheme reduces the likelihood of incorrect predictions that might arise from relying on a single classifier, leading to more stable and accurate results [
15]. This approach is particularly advantageous when dealing with Greek texts, where the application of existing transformer models can be challenging and may not yield satisfactory results for relatively straightforward categorization tasks.
As we have seen, we initially utilized embeddings generated from Greek texts, applying them to classical machine learning algorithms. Concerning embeddings, E5 is the obvious choice. Regarding classifiers, the results seem promising too, with XGBoost emerging as the best performer among all models examined, achieving a score of 91.4% in metrics such as accuracy, recall, precision, and F1, and scoring 87.1% in Kappa and MCC. This success led to the consideration of whether an ensemble of these models could further enhance performance. Subsequent research and experimentation led to the development of the proposed ensemble classifier combining XGBoost, KNN, and LR. Our scheme improved in all metrics by over 1%, achieving 92.8% in accuracy, recall, precision, and F1, and 89.2% in Kappa and MCC, marking a significant enhancement. Specifically, accuracy, recall, precision, and F1 metrics improved by 1.4 percentage points, from 91.4% to 92.8%. This indicates that the proposed ensemble is able to recognize positive cases with greater accuracy and reliability, improving its overall performance. The F1 score, which combines precision and recall, suggests that the balance between these two metrics is also better. The Kappa and MCC metrics improved by 2.1 percentage points, from 87.1% to 89.2%. These metrics are more stringent and consider both positive and negative predictions, assessing the overall agreement and correlation of the predictions with the actual values. The improvement in these metrics indicates that the ensemble model offers better overall accuracy and stability across all categories.
In the direction of interpreting the results, the confusion matrix provided offers a detailed analysis of the ensemble model’s performance across three classes of a dataset, revealing both strengths and areas for improvement. The model demonstrates high accuracy, particularly in the classification of Class 1, where 5.737 instances are correctly identified. However, 322 instances are misclassified as Class 2, and 61 as Class 3, indicating some minor overlap between these classes. Similarly, Class 2 shows a strong correct classification with 5.548 instances, though 311 are mistakenly classified as Class 1, and 261 as Class 3. For Class 3, 5.532 instances are correctly predicted, with 60 misclassified as Class 1 and 285 as Class 2. The confusion between Classes 2 and 3 is more pronounced, suggesting a greater overlap or similarity between these categories.
This pattern of confusion, particularly between Classes 2 and 3, can be attributed to inherent challenges in the dataset, likely due to overlapping characteristics in the data points. That is, the higher rate of confusion between Class 2 and Class 3 suggests that these two classes share more similarities, making them harder for the model to distinguish accurately. Such an overlap can be challenging to address due to the nature of the data collection that came out from the responses of helpdesk staff, where subjectivity or human error can play a significant role. In our context, the services being categorized are quite often related, and this relationship can lead to confusion in the ways the information is classified. This is exactly what seems to be happening with Classes 2 and 3. For example, services categorized under Class 2 and Class 3 may share similarities that make them harder to distinguish, especially given the aforementioned depiction where human input plays a role in labeling the data. A technician may classify a fault differently depending on the situation or specific details of a case, leading to inconsistencies even for similar issues. This dynamic is exemplified by scenarios where faults such as internet issues and email service problems, while related, might be classified into different categories depending on the technician’s interpretation at that moment. For example, suppose that a user registers a fault with a problem described as follows: “I don’t have internet via GSN, and I don’t receive email from the GSN service”. Then, it is possible that the technician, depending on the weight that they momentarily decide to attribute to the query, classifies the query either in the second category or in the third. Thus, it is possible, as well as highly probable, that at different times, even the same technician may categorize similar cases differently.
The confusion matrix thus highlights that while the ensemble model is highly effective in classification, some persistent misclassifications remain. These errors, particularly between Classes 2 and 3, may result from the aforementioned overlapping that is inherent to the data passed to the probabilistic predictions of the base models used in the ensemble. Soft voting, employed to consolidate predictions, helps stabilize the overall output but does not entirely eliminate errors when base models struggle to differentiate between similar classes. In summary, the ensemble model demonstrates strong predictive abilities, as indicated by the high number of correct classifications in all categories. However, the above-mentioned confusion between Classes 2 and 3 is a potential area for improvement, which could be addressed through better practices in data formulation or the design of specific categorization strategies. Nevertheless, the high efficiencies observed in the result metrics support the argument in favor of the overall performance of the model, despite the fact that such overlaps suggest that further improvement is needed to minimize classification errors.
5. Conclusions
In this work, we investigated the incorporation of contextual transformer-based ensembles in conjunction with classic machine learning and soft voting ensembles for categorizing user questions, inquiries, and statements written in Greek and posed in the helpdesk of the Greek School Network. Given the domain-specific data, we investigated the suggestion that for specialized fields like ours, domain-specific enhancements and linguistic representations, together with classic low-cost yet robust machine learning classifiers, can be as efficient as advanced model architectures.
According to what has been presented above, we can say the following. First, we are dealing with a new dataset, the content and form of which constitute the paradigmatic framework for investigation and experimentation towards a real-world application. This work, however, is also a theoretical inquiry into the limits of using large language models for straightforward tasks that do not require particularly fine distinctions. This is simply because we have seen classic machine learning methods perform satisfactorily, with the proposed ensemble scheme constituting a credible alternative to the mere incorporation of transformers. Likewise, the investigation of the Greek vector representations and whether they can be characterized as adequate is also essential. Here, we saw that a multilingual model performs better than the Greek schemes. The latter, however, are subject to the existence of relatively more extensive state-of-the-art models, which, in the context of our problem, are not realistic alternatives, given the size and resources they require. In any case, we can reasonably conclude that the proposed architecture is an extremely reliable methodology in the context of our problem, in terms of performance, cost, and robustness.
There are some main directions for future research. First, a direction for future work will examine the performance of fine-tuned multilingual transformers and assess their performance with regard to classic machine learning baseline models. Another direction for future work concerns the execution of the experimental study in a setup where additional datasets will be used, including the unbalanced version of the current dataset used in our experiments as well as two other datasets on multiclass text categorization concerning Greek School Network textual data. Given the above extensions, another promising direction for future work is the exploration of explainability techniques within the context of Greek text classification. In this line, future research could also focus on incorporating explainable artificial intelligence methodologies, such as SHAP or LIME, to provide various insights into how the models of interest extract their corresponding predictions. These explainability techniques would increase trust in the system and also aid in identifying potential biases or weaknesses in the classification process, especially when applied to less-resourced languages like Greek. Last but not least, future work could also examine models such as spiking neural networks which are quite efficient in natural language processing tasks and examine their performance in Greek language textual data. This constitutes a core direction that our future work will focus on.