A Random PRIM Based Algorithm for Interpretable Classification and Advanced Subgroup Discovery

Nassih, Rym; Berrado, Abdelaziz

doi:10.3390/a17120565

Open AccessArticle

A Random PRIM Based Algorithm for Interpretable Classification and Advanced Subgroup Discovery

by

Rym Nassih

^*

and

Abdelaziz Berrado

Equipe AMIPS—Ecole Mohammadia d’Ingénieurs, Mohammed V University in Rabat, Avenue Ibn Sina, Agdal, Rabat BP765, Morocco

^*

Author to whom correspondence should be addressed.

Algorithms 2024, 17(12), 565; https://doi.org/10.3390/a17120565

Submission received: 24 September 2024 / Revised: 1 December 2024 / Accepted: 2 December 2024 / Published: 10 December 2024

(This article belongs to the Special Issue Algorithms in Data Classification (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

:

Machine-learning algorithms have made significant strides, achieving high accuracy in many applications. However, traditional models often need large datasets, as they typically peel substantial portions of the data in each iteration, complicating the development of a classifier without sufficient data. In critical fields like healthcare, there is a growing need to identify and analyze small yet significant subgroups within data. To address these challenges, we introduce a novel classifier based on the patient rule-induction method (PRIM), a subgroup-discovery algorithm. PRIM finds rules by peeling minimal data at each iteration, enabling the discovery of highly relevant regions. Unlike traditional classifiers, PRIM requires experts to select input spaces manually. Our innovation transforms PRIM into an interpretable classifier by starting with random input space selections for each class, then pruning rules using metarules, and finally selecting definitive rules for the classifier. Tested against popular algorithms such as random forest, logistic regression, and XG-Boost, our random PRIM-based classifier (R-PRIM-Cl) demonstrates comparable robustness, superior interpretability, and the ability to handle categorical and numeric variables. It discovers more rules in certain datasets, making it especially valuable in fields where understanding the model’s decision-making process is as important as its predictive accuracy.

Keywords:

classification; subgroup discovery; patient rule-induction method; metarules; interpretability

1. Introduction

Recently, there has been a significant breakthrough in the accuracy of machine-learning models, which has revolutionized various industries, including finance [1,2] and healthcare [3,4]. These models are very good at spotting patterns in large datasets and making predictions. However, in spite of their outstanding performance, interpretability still presents a big obstacle [5]. Deciphering the logic behind these models’ forecasts gets harder as they get more intricate. Particularly in areas where transparency is essential, this opacity restricts our ability to fully trust and use these models [6].

Determining rules that can yield practical insights is one area where interpretability is particularly important. Conventional machine-learning models frequently perform exceptionally well on classification tasks, but they struggle to identify and explain the underlying principles that explain their predictions [7,8]. This gap shows how approaches that improve predictive accuracy while also providing comprehensible guidelines are needed.

Among the various techniques employed in classification, such as tree-based and rule-based methods, one notable approach is known as bump hunting [9]. This method focuses on identifying significant deviations or bumps in the data that can reveal important subgroups or patterns, and it lies under the subgroup-discovery domain. Subgroup discovery is a data mining technique that looks for statistically significant and interpretable subgroups in a dataset where the target variable exhibits unusual behavior [10]. Localized and exploratory in nature, subgroup discovery differs from traditional techniques like logistic regression [11] and decision trees [12], which seek to develop a global predictive model for the complete dataset. In the data, it looks for particular patterns or areas that offer significant insights; these are frequently stated as straightforward rules, for example, “age > 40 AND income < $50,000”. While logistic regression focuses on modeling relationships between variables using a single equation, frequently at the expense of interpretability, decision trees prioritize overall classification accuracy and may overlook smaller meaningful subgroups due to their hierarchical splitting process [13,14]. Contrarily, subgroup discovery concentrates on identifying statistically significant and actionable patterns within particular data subsets [15]. This makes it especially useful for fields where it is crucial to comprehend and explain local patterns, such as marketing, policy analysis, and healthcare. In this context, the patient rule-induction method (PRIM) emerges as a promising tool. Developed by Friedman and Fisher [9], PRIM is a well-regarded algorithm for subgroup discovery, as stated in [16], known for its ability to identify and describe subgroups with distinct characteristics. These regions have rectangular shapes; hence, they are called “boxes”.

PRIM is highly valued for its ability to generate actionable insights through interpretable rules, making it a valuable tool in various fields, including medicine, finance, and marketing. However, while PRIM excels in revealing valuable subgroups, it is not inherently designed as a classifier. This limitation presents an opportunity to enhance PRIM’s utility by developing a classifier based on its principles.

In this paper, we propose a novel approach that leverages the strengths of PRIM to build a robust bump-hunting rule-based classifier. By integrating PRIM’s rule-discovery capabilities with classification techniques, we aim to create a model that not only provides accurate predictions but also offers interpretable and actionable insights into the data. This advancement promises to bridge the gap between high accuracy and interpretability, ultimately contributing to more effective and trustworthy machine-learning applications.

This work aims to provide a new approach to obtain a robust classification technique and, at the same time, a description of the dataset to find new patterns. Because PRIM has a transparent process, combining it with metarules gives transparency and interpretability for the predictive model.

In summary, the contributions of our paper are as follows:

-: Initial PRIM only searches in the chosen feature space, but in our work, the algorithm chooses multiple random feature spaces.
-: Once the rules of each class label are discovered, we innovate in pruning the rules using metarules. This means that not only is the ruleset interpretable, but also, no rule is completely removed from the ruleset. Metarules aim at creating association rules having as items the rules generated.
-: To build the classifier, the selection of the final rule is primordial. We investigate the literature, especially CART and CBA original papers, to select the final rules that will be part of the final model. Having the algorithm choose only the optimal boxes in each peel makes the selection fall on the first rules with the most significant coverage, support, and confidence.
-: We test our random PRIM-based classifier (R-PRIM-Cl) on ten well-known datasets to validate the classifier and identify future challenges, and we compare the results to the three well-known and well-established classifiers: random forest, logistic regression, and XG-Boost. We use four metrics to evaluate the performances: accuracy, precision, recall, and F1-score.

2. Related Works

In this section, we first present an overview of the patient rule-induction method; second, some related works that motivated the research; and finally, the algorithms chosen for the final comparative experiment results.

2.1. The Patient Rule-Induction Method

2.1.1. Overview of PRIM

Introduced by Friedman and Fisher in 1999 [9], the patient rule-induction method is an algorithm used for identifying and interpreting patterns in data, relying on the procedure of bump hunting. PRIM is a flexible and interpretable method that aims to uncover regions in the feature space where the response variable is significantly different from the overall mean or median. PRIM is particularly useful for identifying segments or subgroups within data that exhibit particular characteristics or behaviors. It is often applied in various domains such as medicine, finance, and marketing to find patterns that could help with decision-making and understanding of the data. Therefore, PRIM is considered as a subgroup-discovery method, since it contributes to pattern discovery in datasets. This method relies on two main steps: the top-down peeling and the bottom-up pasting. In PRIM, the rules are shaped as rectangles, and hence, they are named boxes. Figure 1 shows the construction of the first box containing the red data. As observed, the construction is performed iteratively, and the refinement at each iteration does not peel a lot of data; thus the name patient given to the algorithm. This procedure ensures that no area of the data is missed.

To better understand the search in the input space according to the type of variables, numeric or categorical, here is a more formal statement of the problem:

Let p be the number of the input variables denoted by x₁, x₂ …, x_p and y be a target variable in the same dataset. If y is a multiclass variable, then the problem has to be reformulated as binary subproblems. Let S_j denote the set of possible values of x_j (j = 1, …, p). The input space is then:

S = S₁ x S₂ x …… x S_p

(1)

The objective is to find subregions R ⊂ S for which:

{\bar{y}}_{R} ≫ \bar{y}

(2)

where

\bar{y}

is the global mean of y and

{\bar{y}}_{R}

the mean of y in R.

Let s_i ⊂ S_i, we define a box by:

B = s₁ x s₂ x …… x s_p

(3)

where x ∈ B

\equiv \cap_{j = 1}^{p}

(x_j

\in

s_j). When s_i = S_i, we leave x_i out of the box definition since it may take any value in its domain. Figure 2 shows an example of a box defined on two numeric variables, where x

\in

B

\equiv

x₁

\in

[a,b]

\cap

x₂

\in

[c, d]. Figure 3 shows the definition of a box if the variables are categorical, where x

\in

B

\equiv

x₁

\in

{a, b}

\cap

x₂

\in

{c,d}. A box can also be defined by numeric and categorical variables at the same time.

Following top-down peeling, we are left with a series of boxes, each of which was created by peeling a portion of its predecessor. We only consider the next step at a time while conducting our searches, which is to select the peel that leads to the ideal sub-box. In other words, box boundaries are established without taking into account potential peels. Thus, redrawing the final box’s boundaries can occasionally make it better. Using bottom-up pasting, we recursively enlarge the last box until the next paste causes the value of

\bar{y}

in the box to decrease.

PRIM is known as a hill-climbing algorithm because, at each search step, the single sub-box that improves the target mean the most is peeled off. The peeling process comes to an end when the current box’s support falls below the user-specified minimum support value of β₀ fixed by the user.

2.1.2. PRIM’s Evaluation Metrics

When evaluating a box, there are four metrics to analyze: the support, the coverage, the density, and the dimension of the box.

The first statistic is the coverage. It is the proportion of the target cases (cases of interest) that are in the box compared to the total number of target cases in the dataset. The mathematical expression is as follows:

coverage = \frac{N u m b e r o f t a r g e t p o i n t s i n t h e b o x}{T o t a l n u m b e r o f t a r g e t p o i n t s i n t h e d a t a s e t}

(4)

The starting point of PRIM is the unrestricted full set of cases, which includes all outcomes of interest, and, therefore, the coverage starts at one and drops as the algorithm progresses. If the coverage of a box is high, it means that the box captures a larger fraction of the target cases, and, thus, it identifies the behavior of interest. If the coverage is low, it can lead to more precise and actionable insights, but it may also miss out on broader patterns or be applicable to only a limited set of cases. The coverage helps to evaluate the effectiveness of the box in isolating the desired patterns or responses within the dataset.

The second important statistic is the support. The support of a box is the fraction of points from the dataset that fall into the box. In the prim package on Python, it is referred to as mass, and mathematically it is expressed as:

m a s s = \frac{N u m b e r o f p o i n t s i n t h e b o x}{T o t a l n u m b e r o f p o i n t s i n t h e d a t a s e t}

(5)

The “mass” aids in determining the size of the box. A higher mass denotes a greater amount of data contained in the box, which means it is less specific but more general, whereas a lower mass indicates a smaller amount of data captured, hence focusing more on a subset that meets the criteria defined by the box. This statistic is crucial for comprehending the trade-off between the box’s specificity and coverage when locating intriguing areas within the data space.

The third statistic to evaluate a box is the density of the box. The density is the equivalent of the confidence in PRIM. It measures how concentrated the target cases are within the box, providing insight into the purity of the box concerning the target cases. Since with PRIM we deal with rectangles or hyper-rectangles, the density makes more sense. The expression of the density is as follows:

d e n s i t y = \frac{N u m b e r o f t a r g e t p o i n t s i n t h e b o x}{T o t a l n u m b e r o f p o i n t s i n t h e b o x}

(6)

If the density is high, it means that a larger proportion of the points in the box are target cases, suggesting that the box is well-targeted and specific to the desired behavior or outcome. Conversely, a lower density suggests that the box includes a significant number of non-target cases, making it less precise.

The last measurement is the restricted dimensions, or res dim, found in the Python prim package. The quantity of variables that define a box is indicated by this measure. It is helpful to evaluate the box’s complexity and comprehend the number of features that go into identifying the target cases inside the box, as this gives insight into the patterns that are produced and their interpretability and specificity.

The relationships between mass, density, and coverage in PRIM are critical to comprehending how well a box captures the target cases. The density shows the concentration of those target cases within the box, indicating how pure the box is, whereas the coverage measures the percentage of target cases that are included within the box relative to all target cases in the dataset. There is a trade-off between capturing many targets and retaining specificity; a box with high coverage captures many target cases, but its density may be low if it also includes many non-target cases. Conversely, the mass indicates the percentage of the total dataset that the box contains, including both target and non-target cases. The correlation between mass and coverage demonstrates that, since the mass includes all data points inside the box, not just the targets, a high mass does not always imply a high coverage. In order to effectively isolate the target cases and minimize the inclusion of non-target cases, a box with high coverage but low mass is preferred. This makes the box efficient and specific. When defining the boundaries of a box in PRIM, it is helpful to understand these relationships in order to assess the appropriate balance between specificity and generality.

To understand better, let us suppose, as an example, that we have a dataset with 1000 instances and a two-dimensional input space where each dimension ranges from 0 to 100, and we found two boxes, box 1 containing 200 points and having a support of 20% and a coverage of 36% and box 2 containing 50 points with a support of 5% and a coverage of 0.01%. These measures show that box 1 is more general since it has a higher coverage and a higher mass than box 2, which is more specific. Hence, the selection of the boxes depends on the domain knowledge and the search interest.

2.2. Metarules

In 2007, Berrado and Runger [17] presented a process for handling the massive quantity of association rules, many of which are redundant and overlapping, that are produced from high-dimensional data. The notion of metarules was then presented as the rules that characterize connections among the identified association rules themselves. In order to improve their interpretability and utility for data analysts, this method seeks to restructure, group, and prune the association rules.

The problem of data sparseness in high-dimensional spaces, which results in the creation of numerous redundant rules, was highlighted by the authors. The inefficiency of traditional rule summarization techniques in recognizing and removing redundant rules is a common cause of failure in sparse datasets. The rule sets are grouped and made simpler by the metarule-based technique described in the original paper, which finds relationships between the rules. By using this method, the number of rules is decreased, and they are arranged into more easily understood subgroups, as illustrated in Figure 4. With new roles, where each association rule is considered an item within a transaction, the paper describes the generation of metarules using the same algorithm applied to the original data. Then, using these metarules, more specific and corresponding rules are found and grouped. The entire rule set can be made more readable and simpler by removing any redundant or excessively specific rules by looking at these metarules. The efficiency of these authors’ method is illustrated by the notable reductions in rule complexity they show in both real manufacturing data and benchmark datasets. Comparing it to other distance-metric clustering techniques, the metarule approach performs well, demonstrating that it can uncover the relationships between rules rather than just clustering them according to proximity.

The finding of a metarule is a process where rules are considered items. The first step is the establishment of a binary matrix with the examples and the rules. If the instance is in a rule, then it is a one; else, it is a zero. Once we obtain the transaction matrix, like in the basket analysis problem, we apply the APRIORI algorithm to find associations between the items, which are the rules in our case. This technique allows two things: the first is to find the association between the rules of the same class label, and the other one is to find if there is an association between the rules of different class labels. With this technique, we provide the expert with a new organization of the rules. Figure 5 illustrates the process of creating metarules.

This method has been used as a pruning technique for other rules in machine learning than just the association rules. In [18,19,20,21], the authors apply the concept of the metarules to optimize and increase the interpretability of their models. They use the rules generated by random forest or R-CAR as the items in the dataset, and then they apply the APRIORI algorithm to find the associations between the rules.

Metarules, all things considered, offer an organized method of managing the numerous association rules produced by high-dimensional data, improving the results’ actionability and comprehension for analysts. Using domain expertise to inform future refinement, this approach prioritizes data-driven organization.

We propose to use metarules for two tasks: box pruning and overlapping-region detection. Metarules appears to be the best method for pruning. In fact, it preserves the more specific rules for the analyst to allow knowledge discovery that may aid in the detection of novel phenomena in the data, in addition to allowing us to present the general boxes as the classifier. All of this occurs without removing any rules from the ruleset. Regarding the identification of the overlapping region, the number of identical instances that we can locate in the rules is what determines the association between them. This means that, as Figure 4b illustrates, if two rules from two distinct class labels are associated, some population overlaps in both regions. After that, the overlapping regions can be dealt with to create a precise classifier.

2.3. Prior Works and Motivation

In recent research, Ghasemkhani et al. [22] presented a novel classification method called multi-class multi-label logistic model tree combining both logistic regression and decision tree. According to the paper, the algorithms are robust in multi-label class classification. Another work comparing well-established classification techniques to predict dementia can be found in [23]. The comparison included decision tree, random forest, ANN, SVM, naïve Bayes, and KNN. The findings show the promise of EEG-based machine learning for dementia diagnosis, but they also draw attention to the need for better techniques and bigger datasets to differentiate between different subtypes of dementia.

Other notable recent works include [24], which explores the application of unsupervised machine-learning models to analyze electronic health records. It compares latent Dirichlet allocation with a novel approach called the Poisson–Dirichlet model for the discovery of meaningful subgroups within the data. Other works that highlight the importance of having a predictive method that can uncover patterns can be found in [25,26,27].

The PRIM-related works can be categorized into two sets: the improvements and the applications. The improvements include the use of principal component analysis (PCA) combined with PRIM [28]. Another improvement concerns the second step of PRIM, the bottom-up pasting, where the box is expanded one dimension at a time to optimize it by expanding it two dimensions at once [29]. Also, in [30], the researchers use a modified PRIM to try to statistically identify significant subgroups based on clinical and demographic characteristics. Other improvements or innovations involve optimizing industrial processes and the mean and standard deviation by modifying PRIM [31,32]. On the other hand, the applications of PRIM are the main source that motivated this research. Indeed, PRIM is mostly used in the healthcare field, where the discovery of subgroups is of high interest. For example, in [30], the purpose of the study was to find the groups of patients for which the trial was a failure, and in [33], PRIM was used with other algorithms to find interesting rules to define an unstable slope. Against M5’, group method of data handling (GMDH), and multivariate adaptive regression splines (MARS), PRIM was one of the most effective and practical algorithms.

More detailed literature reviews can be found in [34,35].

As detected in most of the related works, the redundant observations that motivated this research are the following:

-: PRIM requires too many interactions with the expert because of the choice of the feature search space.
-: In the case of a large number of rules, if the expert chooses multiple search spaces, the interpretability is lost, and so is the explainability, since if the rules are not interpretable, they cannot be evaluated according to the domain knowledge.
-: PRIM lacks a classifier to use the discovered rules as a predictive model for future predictions.

2.4. Selected Algorithms for the Comparison

To analyze the performance of our approach, we contrasted our R-PRIM-CL with three of the most well-known and reputable algorithms in the field of data classification in Section 4 to assess the effectiveness of our method. Although the comparison can be expanded to include other algorithms, it is interesting to compare classifiers based on bagging with random forest (RF), boosting with XG-Boost (XGB), bump-hunting with random PRIM-based classifier, and the linear model with logistic regression (LR). Indeed, comparing a new classification algorithm with established algorithms like random forest, XG-Boost, and logistic regression is essential because these well-known models represent different strengths, weaknesses, and approaches to classification, making them valuable benchmarks.

The random forest [36] ensemble technique efficiently handles large high-dimensional datasets by combining multiple decision trees to increase the accuracy, robustness, and resistance to overfitting [37]. One way to evaluate how well a new algorithm captures complex patterns and interactions in the data is to compare it with random forest.

XG-Boost [38], with its exceptional accuracy, speed, and efficiency, excels in machine-learning comparisons and real-world applications. It is especially good at handling large datasets, missing values, and imbalanced data [39]. Assessing the new algorithm’s speed, scalability, and accuracy against one of the most competitive models out there is made possible by comparing it to XG-Boost.

In contrast, the widely used technique known as logistic regression [11] is easy to understand, straightforward, and a reliable baseline because it can effectively simulate linear relationships between inputs and outputs. It offers a simple point of comparison to assess whether the new algorithm’s added complexity represents a significant improvement over a conventional and uncomplicated method.

The range of machine-learning techniques covered by these algorithms, which include boosting, tree-based ensembles, and linear modeling, is quite extensive, ranging from simple to complex. Its performance across various data distributions, feature interactions, and problem complexity is demonstrated through benchmarking against them, which aids in establishing the new algorithm’s efficacy, adaptability, and competitive edge. Additionally, these comparisons show where R-PRIM-Cl sits on the interpretability versus predictive power spectrum: while random forest and XG-Boost offer high accuracy with less interpretability, logistic regression offers high interpretability. Thus, comparing R-PRIM-Cl to logistic regression, XG-Boost, and random forest offers a thorough understanding of its advantages, disadvantages, and possible uses while placing it in the context of classification models as a whole.

3. Proposed Methodology

In this section, we will first introduce the algorithm and then present an example to illustrate our approach.

3.1. Random PRIM Based Classifier

Based on the search strategy provided by PRIM, our algorithm includes additional steps to construct the classifier. The expert is still in charge of determining PRIM’s parameters, though. Indeed, not setting any support threshold can be beneficial to subgroup discovery. These parameters are determined based on the particular field of work and the experimentation conducted by the analysts. As the peeling criteria, minimum support, and pasting criteria in PRIM need to be adjusted domain-specifically to match the objectives of the analysis and the particulars of the data; they are not entirely automated. Important trade-offs in subgroup discovery, like purity versus coverage, are controlled by these parameters, which must strike a balance between statistical significance and usefulness. Peeling criteria, for instance, dictate how harshly data points are eliminated, affecting the purity of subgroups, whereas minimum support guarantees that subgroups are sizable enough to have significance without being overly stringent to eliminate significant but uncommon patterns. Refining subgroup boundaries through pasting criteria necessitates careful adjustment to prevent overgeneralization. Setting requirements vary depending on the dataset and task; for example, large, high-dimensional datasets benefit from more relaxed settings, while small, noisy datasets may require stricter thresholds. Although automating these parameters might standardize results, it might not reveal useful patterns that are essential for practical applications. By permitting specialists to modify these standards, PRIM guarantees adaptability, resilience, and congruence with the particular priorities of the issue at hand. It is significant to remember that PRIM already has these parameters set by default to 5% for the peeling and pasting threshold. If the support is not set, PRIM continues searching for a region until it reaches an empty box as a stopping criterion.

The detailed steps of the random PRIM-based classifier (R-PRIM-Cl) are outlined below.

Initialization step

(a): A set of N data instances
(b): A set of P categorical or numeric variables as X = {x₁, x₂, …, x_p}
(c): Two class labels target variable Y ∈ {0, 1}
(d): Define a minimum support, peeling, and pasting thresholds {s, α, β}

2.: Procedure

(a): Random choice of the feature search spaces
(b): Implementation of PRIM on each subspace for each class label
(c): Find boxes with the metrics
(d): Implement metarules to find the associations between rules
(e): Execute cross-validation on 10 folds
(f): Retain the rules or metarules based on cross-validation, the support, the density, and the coverage rates
(g): Calculate the accuracy, recall, precision, and F1-Score to validate the model

3.: Output

(a): The set of all boxes found to allow the discovery of new subgroups
(b): The set of selected boxes from the metarules pruning step
(c): The final box measures: coverage, density, support, and dimension
(d): The final model metrics: accuracy, precision, recall, and F1-score

3.2. Construction and Validation of the Classifier

As we intend to create a classifier based on the boxes generated for each class by PRIM, which falls in the subgroup-discovery algorithms, we have investigated the original book of Classification and Regression Tree (CART) by Breiman et al. [12] to understand the process of validating a classifier.

Let x₁, x₂ …, x_p be the number of features in our dataset.

Let X be the vectorial space of all the measures induced by the features.

Let G be the number of the objects’ classes 1, …, G and C the ensemble of the classes in our dataset C = {1, …, m}.

A classifier is then defined as a function f(x) defined on X so that for every x, f(x) is equal to one of the numbers 1, …, G.

Another way of looking at it is to define S_G as the subset of X on which f(x) = g, that is:

S_G = {x, f(x) = g}

(7)

S₁, … S_G are disjoints so X = U S_G, making S_G a partition of X.

In that additional context, a classifier can also be defined as a partition of X into G disjoint subsets such that for every x ∈ Sg the preselected class is g.

The most important criterion for a good classification procedure is to not only produce accurate classifiers, within the limits of the data, but also to provide insight and understanding into the predictive structure of the data. In other words, to focus on interpretability and explainability as much as we focus on the accuracy and precision of the results.

According to the literature, one of the limits in constructing a good classifier is the dimensionality of the dataset. In fact, the bigger the dimensionality, the sparser and more spread apart the data points are, as explained in The curse of dimensionality by Bellman in 1961 [40]. One of the advantages of the bump-hunting procedure is the adaptation for the high dimensionality. Indeed, since it peels one dimension at a time, it is able to focus on all the space and spot subgroups according to the input space chosen.

In [12], Breiman et al. validated CART as a classifier by evaluating the probability of misclassifying a sample. They set three types of internal estimates to evaluate the performance of the classifier. The first one is the resubstitution estimate, which consists of counting the misclassified cases and is considered as accurately weak. The second is the test sample estimation, where the dataset is divided into two independent samples; on the first one, the classifier is created, and on the second one, the misclassification is measured. The main drawback is the reduction of the effective sample size, a bigger dataset is needed to ensure that the test is efficient. The last procedure to estimate the efficiency of a classifier is cross-validation, the best approach, which consists of resampling the data into different portions to test the classifier and train the model.

To ensure that our classifier is valid, we have also investigated CBA, the classification based on association [41]. Three steps are involved in validating the classifier based on CBA. Sorting the rules according to their support and confidence is the first step. The precedency level refers to how rules that have greater support and confidence are typically given priority. Pruning the rules to prevent overfitting and redundancy is the second step. This entails getting rid of regulations that are superfluous or less important. For instance, if two rules have the same antecedent but different consequents, the rule with the higher confidence is kept in place. The last stage is to construct the classifier by putting the rules into priority order. When a new instance needs to be classified, the classifier looks through the rules one after the other until it locates one that corresponds with the attributes of the instance. These are completed for every rule. Next, we calculate the classifier’s total number of errors, and the process ends when neither a rule nor any training data remain. The last stage is to remove the rules that do not increase the classifier’s accuracy.

In our classifier, since the approach is a bump-hunting method, we have two types of boxes: general boxes and specific boxes. Our algorithm can, at the same time, classify instances and allow the discovery of subgroups. Thus, the classifier selects the general boxes as the final rules, but since we use metarules to prune the boxes, the majority of boxes are not removed, as illustrated by Figure 6. They are all organized using associations between them, which makes the reading easier. Therefore, because in PRIM the coverage, the support, and the density are equally important, the selection relies on them three instead of only on the support and the confidence. Concerning overlapping rules, they are detected in the metarules step and can be handled by evaluating the coverage since it represents the proportion of the positive target in the box in comparison with the number of positive targets in all the dataset. Therefore, if the coverage is high, then the instance should be classified as positive. We are still investigating other techniques, such as the Bayes theorem, to ensure that the rule conflict is well handled.

3.3. Illustrative Example

To illustrate the functioning of R-PRIM-Cl, we used the Iris dataset in Python. The low dimensionality, balanced classes, and ease of use of the Iris dataset make it a popular choice for quick experimentation and visualization of classifiers. Also, it is a good dataset to guide the users through the example. It offers a useful test of a classifier’s capacity to discern between three different but somewhat overlapping classes. Its compact size makes it ideal for learning and fine-tuning algorithms because it enables rapid training and testing. This dataset is a trustworthy standard for evaluating new classifiers because of its extensive documentation and popularity in the machine-learning community. This makes it a great place to start before working with more complicated datasets, as in Section 4.

We used the prim package in Python with additional functions that we developed to select random features and create matrices that contain the final results. For an illustrative purpose, we set the prediction at setosa (1) or not setosa (0).

The detailed steps for R-PRIM-CL to predict if a flower is a Setosa are as follows:

Initialization step

(a): A set of 150 flowers from the Iris dataset
(b): A set of four numeric variables as X = {sepal length (cm), sepal width (cm), petal length (cm), petal width (cm)}
(c): Y being a three class labels target, we coded setosa as 1 and versicolor and virginica as 0, hence Y ∈ {0, 1}
(d): Define minimum support, peeling, and pasting thresholds:
s = 10%, α = 5%, β = 5%

2.: Procedure

(a): Random choice of the feature search spaces by the algorithm:
[[‘sepal length (cm)’, ‘sepal width (cm)’],
[‘sepal length (cm)’, ‘petal width (cm)’],
[‘sepal width (cm)’, ‘petal length (cm)’, ‘petal width (cm)’],
[‘sepal width (cm)’, ‘petal length (cm)’],
[‘petal length (cm)’, ‘petal width (cm)’]]
(b): Implementation of PRIM on each subspace for each class label
(c): Find boxes with the metrics as displayed in Table 1 before the metarules and cross-validation

Table 1. The boxes generated by R-PRIM-CL before pruning and selecting.

Index	Box	Coverage	Density	Res Dim	Mass
R1	4.3 < sepal length (cm) < 5.35 AND 2.95 < sepal width (cm) < 4.4	0.78	1.00	2	0.26
R2	4.3 < sepal length (cm) < 5.45 AND 2.15 < sepal width (cm) < 4.4	0.15	0.50	2	0.10
R3	3.349 < sepal width (cm) < 4.4	0.08	0.38	1	0.07
R4	4.3 < sepal length (cm) < 5.95 AND 0.1 < petal width (cm) < 0.8	1.00	1.00	2	0.33
R5	1.0 < petal length (cm) < 3.75 AND 0.1 < petal width (cm) < 0.8	1.00	1.00	2	0.33
R6	1.0 < petal length (cm) < 1.79	0.95	1.00	1	0.32
R7	3.25 < sepal width (cm) < 4.4 AND 1.0 < petal length (cm) < 6.05	0.05	0.25	2	0.07
R8	1.0 < petal length (cm) < 3.75 AND 0.1 < petal width (cm) < 0.8	1.00	1.00	2	0.33
R9	4.3 < sepal length (cm) < 7.35 AND 0.1 < petal width (cm) < 2.25	0.90	0.35	2	0.86
R10	3.05 < sepal width (cm) < 4.05 AND 1.35 < petal length (cm) < 5.85	0.45	0.49	2	0.31
R11	4.3 < sepal length (cm) < 7.80 AND 3.05 < sepal width (cm) < 4.05 AND 1.35 < petal length (cm) < 6.9 AND 0.1 < petal width (cm) < 2.25	0.40	0.52	4	0.26
R12	6.55 < sepal length (cm) < 7.80 AND 2.15 < petal width (cm) < 2.5	0.10	0.44	2	0.08

(d)

Implement metarules to prune and detect overlapping

No overlapping detected

Metarules detected for a density = 100%:

○: R2 => R1
○: R3 => R7
○: R4,R8 => R5 and R5,R8 => R4 and R5,R4 => R8
○: R6 => R8

(e)

Execute cross-validation on 10 folds

(f)

Retain the rules or metarules based on cross-validation, the support, the density, and the coverage rates as displayed in Table 2

(g)

Calculate the accuracy, recall, precision, and F1-score to validate the model

Accuracy: 0.97

Precision: 0.83

Recall: 1.00

F1 Score: 0.91

Table 2. The final boxes retained for the classifier.

Index	Box	Coverage	Density	Res Dim	Mass
(R4, R5, R8)	4.3 < sepal length (cm) < 5.95 AND 0.1 < petal width (cm) < 0.8 AND 1.0 < petal length (cm) < 3.75	1.00	1.00	3	0.33
R6	1.0 < petal length (cm) < 1.79	0.95	1.00	1	0.32
R1	4.3 < sepal length (cm) < 5.35 AND 2.95 < sepal width (cm) < 4.4	0.78	1.00	2	0.26

3.: Output

(a): The set of all boxes found to allow the discovery of new subgroups
(R1, R2, R3, R4, R5, R6, R7, R8, R9, R10, R11, R12)
(b): The set of selected rules from the metarules pruning step
(R1, R4, R5, R6, R8) with R4, R5 and R8 covering the same data and R6 include in R8
(c): The final box measures: coverage, density, support, and dimension
Ex: R1 (coverage = 78%, density = 100%, support = 26%, dimension = 2)
(d): The final model metrics: accuracy, precision, recall, and F1-score
Accuracy = 96.7%; precision = 95%; recall = 100%; F1-score = 97.4%

As illustrated by this simple example, R-PRIM-Cl offers at each step the possibility to evaluate the boxes and reconsider them depending on the experts and the domain knowledge. The metarule (R4, R5, R8) obtained means that the rules R4, R5, and R8 cover the same data in the feature space and hence describe the containment between the three rules. The final rules retained by the classifier to predict if a flower is a “setosa” (class = 1) are:

-: If (4.3 < sepal length (cm) < 5.95 AND 0.1 < petal width (cm) < 0.8 AND 1.0 < petal length (cm) < 3.75) THEN flower = ‘setosa’
-: If (1.0 < petal length (cm) < 1.79) THEN flower = ‘setosa’
-: If (4.3 < sepal length (cm) < 5.35 AND 2.95 < sepal width (cm) < 4.4) THEN flower = ‘setosa’

The other rules have smaller coverage and support and can be given as a description of smaller groups in the dataset, as shown by the rule R7 (IF 3.25 < sepal width (cm) < 4.4 AND 1.0 < petal length (cm) < 6.05 THEN flower = ‘setosa’), which has a support of 7% and a coverage of 5%. These rules are more specific to the dataset and are presented to the experts as pattern discovery. According to different domains, the experts can explore these subgroups.

By having perfect recall and high precision, in particular, the model’s metrics show that it is operating at a high level and making accurate and comprehensive predictions. The model shows promise for applications needing both sensitivity and accuracy because of its balance between recall and precision, as evidenced by its high F1-score.

3.4. Comparison Between R-PRIM-Cl and CART

Breiman et al. introduced classification and regression trees in 1984 [12]. This rule-based classifier divides the dataset recursively into subsets according to feature thresholds in order to create decision trees for classification or regression tasks. Within the resulting groups, each split seeks to maximize the target variable’s homogeneity.

In this section, we compare R-PRIM-Cl and CART on the same Iris dataset for predicting if a flower is a “setosa”.

To test if CART can detect subgroups of “setosa” in the input space with the four features, we left the parameters at their default values and did not limit the tree’s depth. Notably, the purpose of this comparison is to see if CART and R-PRIM-Cl are able to identify as many subgroups of the iris flower “setosa” as possible.

As seen in Section 3.3, R-PRIM-Cl detected twelve rules thanks to the search in different feature spaces. R-PRIM-CL allowed the discovery of small groups with rule 7 with only 5% of support and 7% of coverage, rule 3 with 7% of support and 8% of coverage, and rule 2 with only 10% of support and 15% of coverage. These rules are used to find new trends and patterns in the dataset that subject-matter experts can identify. On the other hand, CART only detected one rule (IF petal length <= 2.45 THEN flower = ‘setosa’) in the feature space.

When the goal is subgroup discovery, the original PRIM algorithm outperforms CART (classification and regression trees) because it is designed to concentrate on identifying and characterizing subgroups in data. The identification of condensed, interpretable regions (or boxes) where the target outcome is enriched is given priority by PRIM, in contrast to CART, which strives for global predictive accuracy by creating an extensive decision tree.

This targeted approach allows R-PRIM-Cl to isolate meaningful subgroups and, at the same time, build an accurate predictive model. In contrast, CART may fail to differentiate finer patterns.

Furthermore, R-PRIM-Cl offers more straightforward and implementable rules for every subgroup that has been identified, while CART’s output can become excessively complicated with deep trees that do not focus on particular regions.

4. Results

The random forest, XG-Boost, and logistic regression algorithms are compared with the recently introduced random PRIM classifier in this section. Based on different scores, seven renowned datasets are subjected to the four algorithms. In this section, the dataset is described, the methods for tuning and estimating the different models are detailed, and the empirical evaluation results are presented.

4.1. Empirical Setting

Since PRIM works on categorical and numeric attributes, some of the benchmark datasets used in the comparison have only categorical attributes, and others only numeric. The seven datasets used in the test are:

○: Congressional Voting dataset (Vote)
○: Mushroom dataset (Mush)
○: Breast Cancer dataset (Cancer)
○: SPECT heart dataset (Heart)
○: Tic-Tac-Toe Endgame dataset (TicTac)
○: Pima Diabetes dataset (Diabetes)
○: German Credit Card dataset (Credit)

The seven datasets were obtained from the UC Irvine repository [42]. All the datasets have binary target variables that were coded as 0 or 1. It is interesting to limit the problem space to binary datasets since this allows for a clear assessment of the algorithm’s fundamental behavior and performance before moving on to more intricate multi-class tasks.

In Table 3, we find the characteristics of each dataset with the number of instances, the number of attributes, the class labels, and the class distribution. As seen below, the diversity of the datasets in the class distribution and dimensionality is essential to draw an accurate performance of R-PRIM-CL.

For the three chosen algorithms, Table 4 displays the tuning of the hyperparameters for their implementation. The best combinations of parameter values are chosen using 10-fold cross-validation. To select the best model for each algorithm, cross-validation was used following the normal machine-learning procedure, which is to split the data into 80% for the training set and 20% for the testing set. Then, the algorithm splits again the training set into 80% to construct the model and 20% to validate the model 10 times.

The data preprocessing step included coding the categorical feature into numeric ones for logistic regression, random forest, and XG-Boost, and cleaning missing values.

Accuracy, recall, precision, and F1-score are the four scores that are used to compare the final performances. Each dataset is randomly split into ten folds of roughly equal size to evaluate these scores using 10-fold cross-validation. Nine folds are used for model training, and one-fold is used for testing in each of the ten experiments. The average of the individual scores from the ten experiments is used to estimate the scores.

The experimentation was conducted using the prim package on Python. The execution machine was a CORE i5 8th Gen with 16 GB of RAM, running Windows 11. Random forest, XG-Boost, and logistic regression were applied on Orange [43], which was created by the University of Ljubljana and is available as an open-source data mining, machine-learning, and visualization software. With the use of widgets and/or Python scripts, it is utilized to create and evaluate machine-learning models as well as to carry out exploratory data analysis and viewing.

4.2. Results

Table 5 and Table 6 summarize the results of the empirical comparison; they display the datasets and four scores that are measured for the classifiers produced by random forest, XG-Boost, logistic regression, and R-PRIM-Cl. The precision tells us how many of the positive predictions are true, the recall gauges how well the model can identify every positive case, and the F1 score is the harmonic mean of recall and precision that balances the trade-off between the two. The three reflect the importance of classifying positive examples. The most popular metric in machine learning is accuracy, which compares the number of true positives and true negatives to the total number of predictions to determine how accurate the model is. In this section, Table 5 displays the recall and the precision, and Table 6 shows the accuracy and the F1-score.

The recall, precision, F1-score, accuracy, and evaluation of the R-PRIM-Cl algorithm across multiple datasets in Table 5 and Table 6 show that it performs consistently well, frequently surpassing or closely matching other industry-leading algorithms such as random forest (RF), XG-Boost (XGB), and logistic regression (LG). R-PRIM-Cl performs well in terms of recall, especially when it comes to datasets like Cancer, Heart, and Credit, where it attains the highest recall, proving that it is efficient in detecting true positive cases without overlooking pertinent cases. According to precision results, R-PRIM-Cl frequently has fewer false positives than other models and is very accurate in its positive predictions. This is especially true in datasets like Cancer, Diabetes, and Credit, in which the cost of false positives is high. The robust performance of R-PRIM-Cl is further highlighted by the F1-score, which provides a balance between recall and precision. It consistently manages false positives and false negatives with effectiveness, and it excels in the Cancer and Heart datasets in particular. Figure 7 offers a visualization of the performances of the four algorithms on the seven datasets for each measure, being Figure 7a for the accuracy, Figure 7b for the recall, Figure 7c for the precision, and Figure 7d for the F1-score.

Regarding the four primary performance indicators—accuracy, recall, precision, and F1-score—R-PRIM-Cl consistently demonstrates excellent performance, often securing the highest scores in datasets like Cancer, Heart, and Credit. This demonstrates its proficiency in handling intricate datasets while maintaining interpretability. XGB (XG-Boost) performs exceptionally well in precision and F1-score, specifically in simpler datasets like Mush and TicTac, where it consistently achieves scores close to or at 100%. Random forest (RF) consistently delivers satisfactory results across all metrics, but it often falls behind R-PRIM-Cl and XGB in more challenging datasets. Lastly, logistic regression (LG) competes well, particularly in datasets such as Cancer and Credit, but it typically trails behind ensemble methods like XGB and RF, as well as R-PRIM-Cl, in most of the metrics.

With the highest scores in accuracy (97.50%), recall (95.32%), and F1-score (97.00%), R-PRIM-Cl performs better than other algorithms in the Vote dataset. Nonetheless, XGB maintains its competitiveness by producing comparable outcomes in terms of precision (99.33%) and other metrics. The Mush dataset shows that XGB can handle simpler datasets with well-separated classes, as evidenced by its perfect precision (100.00%) and superior performance across most metrics. This dataset consistently shows R-PRIM-Cl as the second-best performer, with impressive precision (96.47%) and F1-score (97.06%).

The Cancer dataset demonstrates how well R-PRIM-Cl performs on increasingly difficult tasks, as it leads in every metric, including F1-score (92.15%), accuracy (90.04%), and precision (92.13%). LG does well in this dataset as well, outperforming RF and XGB by a wide margin across most metrics. Likewise, R-PRIM-Cl attains the highest F1-score (91.82%), precision (90.14%), and recall (93.57%) in the Heart dataset. With a robust F1-score (90.30%), XGB is still competitive, but it trails behind R-PRIM-Cl by a small margin in other metrics.

In the TicTac data, all methods show outstanding performance, highlighting the straightforward nature of the information. Nevertheless, R-PRIM-Cl continues to outperform in the F1-score category (97.43%), with XGB and RF not far behind, delivering near-flawless results in both precision and recall. As for the Diabetes dataset, R-PRIM-Cl takes the top spot in precision (94.36%) and F1-score (91.43%), proving its resilience in this field. LG also demonstrates impressive results, holding its own across all metrics.

To conclude, in the Credit dataset, R-PRIM-Cl outperforms every other algorithm, resulting in top-notch performance in terms of accuracy (92.42%), recall (90.03%), precision (95.24%), and F1-score (92.56%). LG performs exceptionally well, specifically in the areas of precision (91.78%) and F1-score (91.56%), making it a strong rival in this dataset.

In complex datasets like Cancer, Heart, and Credit, where its emphasis on interpretability and subgroup discovery proves invaluable, R-PRIM-Cl consistently performs exceptionally well across all metrics. In easier datasets like Mush and TicTac, XGB attains nearly flawless scores, demonstrating its capacity to manage well-separated data efficiently. In more complicated situations, RF frequently lags behind R-PRIM-Cl and XGB, but it consistently produces good results across all datasets. In datasets such as Cancer and Credit, LG exhibits competitive performance, proving its efficacy in structured and balanced data but lagging the ensemble methods in difficult datasets.

The overall efficacy of R-PRIM-Cl is validated by accuracy results, which show that it can make accurate classifications in a wide range of datasets, consistently obtaining top or near-top scores. Its performance is particularly impressive on complicated datasets like Cancer, Heart, and Credit, where it considerably outperforms other models, indicating that it is a good fit for difficult classification tasks. R-PRIM-Cl is a strong and adaptable classification model that can maintain performance across a variety of data types and guarantee balanced and accurate predictions. These high levels of recall, precision, F1-score, and accuracy highlight this strength. Its general stability in these metrics confirms its robustness and adaptability in a variety of scenarios, making it a great option for applications that need both sensitivity and precision.

5. Discussion

When R-PRIM-Cl is compared to well-known classification algorithms like random forest (RF), XG-Boost (XGB), and logistic regression (LG), important information about how well it performs on different datasets is revealed. According to the analysis, R-PRIM-Cl frequently outperforms or closely matches the other models in important metrics like recall, precision, F1-score, and accuracy. R-PRIM-Cl is a competitive option for classification tasks because of its robustness and versatility, which are highlighted by its consistent performance across a variety of complex and diverse datasets.

In terms of recall, R-PRIM-Cl frequently achieves the highest scores, especially in datasets like Cancer, Heart, and Credit. This indicates that the model excels in identifying all positive cases without missing any, which is particularly crucial in sensitive applications such as medical diagnosis or risk assessment, where missing true positives could have severe consequences. The model’s ability to maintain high recall while balancing precision is a noteworthy advantage over RF and XGB, which, although strong, often show slightly lower recall, particularly in more complex datasets. LG, while occasionally competitive, generally struggles with maintaining high recall, particularly in challenging datasets like Heart, further highlighting R-PRIM-Cl’s strength in sensitivity.

The findings on precision highlight the dependability of R-PRIM-Cl even more, especially for datasets where preventing false positives is essential. For instance, in the Cancer, Diabetes, and Credit datasets, R-PRIM-Cl achieves some of the highest precision scores, demonstrating its ability to make accurate positive predictions with minimal false positives. Applications like financial risk assessment and targeted medical treatments, where the cost of false positives is significant, benefit greatly from this feature. R-PRIM-Cl displays a more balanced performance with a stronger overall precision–recall trade-off when compared to RF and XGB, which may have slightly higher false positive rates.

Combining recall and precision, the F1-score demonstrates how well R-PRIM-Cl balances these metrics, particularly in large, complex datasets like Cancer and Heart. A well-rounded performance that skillfully handles false positives and false negatives is demonstrated by the model’s consistently high F1-score values. R-PRIM-Cl’s dependable F1 performance over a larger range of data complexities emphasizes its versatility and resilience, whereas RF and XGB are competitive and perform well in simpler datasets. R-PRIM-Cl performs better than LG, which indicates that even though LG occasionally displays strong F1-score values, it frequently falls short in situations where recall and precision must be balanced.

The general effectiveness of R-PRIM-Cl’s classification is also supported by accuracy results, especially when compared to other models on complex datasets like Cancer, Heart, and Credit. This implies that R-PRIM-Cl preserves a high rate of accurate classifications under a variety of conditions and performs exceptionally well in maintaining a balance between recall and precision. Even on easier datasets like Mush, where RF and XGB attain near-perfect accuracy, R-PRIM-Cl maintains a competitive edge, proving that it can handle both easy and difficult classification tasks with ease. While LG performs well most of the time, it sometimes falters in the more complicated datasets. This emphasizes R-PRIM-Cl’s advantage in terms of general performance maintenance.

All things considered, R-PRIM-Cl performs well on all assessed metrics, indicating that it is a very flexible and trustworthy model for classification. Particularly well-suited for applications where both sensitivity and specificity are crucial is its capacity to simultaneously maintain high recall and precision. The consistently high F1-score values and accuracy further affirm R-PRIM-Cl’s robustness, making it a valuable tool in diverse contexts, from healthcare to finance. These findings show that in scenarios requiring balanced, accurate, and dependable classification performance, R-PRIM-Cl is not only a competitive substitute for well-known algorithms like RF, XGB, and LG but also the smartest option. To fully realize the potential of R-PRIM-Cl, future research could investigate ways to improve its efficiency and interpretability through further optimization. Additionally, the technique could be applied to a wider range of high-dimensional datasets.

6. Conclusions

In this paper, we presented a new classifier based on the bump-hunting technique, namely the patient rule-induction method. This classifier randomly chooses numerous feature search spaces and finds the boxes by having minimum support, peeling, and pasting thresholds. This search is conducted for each class label, which are binary. Once the search finishes, the algorithm goes through pruning with the metarules to both organize the ruleset and detect overlapping regions. Finally, it ends by performing a 10-fold cross-validation to keep the more general rules based on the coverage, density, and support. The remaining boxes are, then, interpretable. The algorithm was tested against popular machine-learning algorithms: random forest, XG-Boost, and logistic regression. The performance of the overall R-PRIM-Cl was satisfactory, even though it lacked precision in some cases.

This experiment opens new research directions to enhance the algorithm even more. The first one is to establish a sensitivity analysis regarding the number of feature search spaces it selects. It would be fascinating to investigate if a high number of subspaces would be chosen at the expense of the algorithm’s interpretability and accuracy or if it actually improves them. The second direction would be to investigate other pruning techniques. Another one is to study the rule conflict resolution of the algorithm. The last one is to apply the approach to more datasets, binary and multiclass, and to compare it with other algorithms like SVM, KNN, neural network algorithms, and other SOTA techniques.

Author Contributions

Both authors have contributed equally to the research and writing of this paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data is available on UC Irvine repository, https://archive.ics.uci.edu/datasets.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dixon, M.F.; Halperin, I.; Bilokon, P. Machine Learning in Finance; Springer International Publishing: New York, NY, USA, 2020; Volume 1170. [Google Scholar]
Ahmed, S.; Alshater, M.M.; El Ammari, A.; Hammami, H. Artificial intelligence and machine learning in finance: A bibliometric review. Res. Int. Bus. Financ. 2022, 61, 101646. [Google Scholar] [CrossRef]
Nayyar, A.; Gadhavi, L.; Zaman, N. Machine learning in healthcare: Review, opportunities and challenges. In Machine Learning and the Internet of Medical Things in Healthcare; Elsevier: Amsterdam, The Netherlands, 2021; pp. 23–45. [Google Scholar]
An, Q.; Rahman, S.; Zhou, J.; Kang, J.J. A comprehensive review on machine learning in healthcare industry: Classification, restrictions, opportunities and challenges. Sensors 2023, 23, 4178. [Google Scholar] [CrossRef] [PubMed]
Gao, L.; Guan, L. Interpretability of machine learning: Recent advances and future prospects. IEEE MultiMedia 2023, 30, 105–118. [Google Scholar] [CrossRef]
Nassih, R.; Berrado, A. State of the art of Fairness, Interpretability and Explainability in Machine Learning: Case of PRIM. In Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications (SITA’20), Rabat, Morocco, 23–24 September 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–5. [Google Scholar] [CrossRef]
Capponi, A.; Lehalle, C.-A. (Eds.) Black-Box Model Risk in Finance. In Machine Learning and Data Sciences for Financial Markets: A Guide to Contemporary Practices; Cambridge University Press: Cambridge, UK, 2023; pp. 687–717. [Google Scholar]
Imrie, F.; Davis, R.; van der Schaar, M. Multiple stakeholders drive diverse interpretability requirements for machine learning in healthcare. Nat. Mach. Intell. 2023, 5, 824–829. [Google Scholar] [CrossRef]
Friedman, J.H.; Fisher, N.I. Bump hunting in high-dimensional data. Stat. Comput. 1999, 9, 123–143. [Google Scholar] [CrossRef]
Wrobel, S. An algorithm for multi-relational discovery of subgroups. In Principles of Data Mining and Knowledge Discovery, Proceedings of the First European Symposium, PKDD’ 1997, Trondheim, Norway, 24–27 June 1997; Proceedings, LNCS; Komorowski, H.J., Zytkow, J.M., Eds.; Springer: Berlin/Heidelberg, Germany, 1997; Volume 1263, pp. 78–87. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Wadsworth International Group: Belmont, CA, USA, 1984. [Google Scholar]
Molnar, C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable, 2nd ed.; Amazon, Inc.: Bellevue, WA, USA, 2020. [Google Scholar]
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef]
Herrera, F.; Carmona, C.J.; González, P.; del Jesus, M.J. An overview on subgroup discovery: Foundations and applications. Int. J. Comput. Intell. Syst. 2011, 12, 1602–1612. [Google Scholar] [CrossRef]
Atzmueller, M. Subgroup discovery. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2015, 5, 35–49. [Google Scholar] [CrossRef]
Berrado, A.; Runger, G.C. Using metarules to organize and group discovered association rules. Data Min. Knowl. Discov. 2007, 14, 409–431. [Google Scholar] [CrossRef]
Azmi, M.; Runger, G.C.; Berrado, A. Interpretable regularized class association rules algorithm for classification in a categorical data space. Inf. Sci. 2019, 483, 313–331. [Google Scholar] [CrossRef]
Maissae, H.; Abdelaziz, B. Forest-ORE: Mining Optimal Rule Ensemble to interpret Random Forest models. arXiv 2024, arXiv:2403.17588. [Google Scholar]
Maissae, H.; Abdelaziz, B. A novel approach for discretizing continuous attributes based on tree ensemble and moment matching optimization. Int. J. Data Sci. Anal. 2022, 14, 45–63. [Google Scholar] [CrossRef]
Azmi, M.; Berrado, A. CARs-RP: Lasso-based class association rules pruning. Int. J. Bus. Intell. Data Min. 2021, 18, 197–217. [Google Scholar] [CrossRef]
Ghasemkhani, B.; Balbal, K.F.; Birant, D. A New Predictive Method for Classification Tasks in Machine Learning: Multi-Class Multi-Label Logistic Model Tree (MMLMT). Mathematics 2024, 12, 2825. [Google Scholar] [CrossRef]
Miltiadous, A.; Tzimourta, K.D.; Giannakeas, N.; Tsipouras, M.G.; Afrantou, T.; Ioannidis, P.; Tzallas, A.T. Alzheimer’s Disease and Frontotemporal Dementia: A Robust Classification Method of EEG Signals and a Comparison of Validation Methods. Diagnostics 2021, 11, 1437. [Google Scholar] [CrossRef]
Wang, Y.; Zhao, Y.; Therneau, T.M.; Atkinson, E.J.; Tafti, A.P.; Zhang, N.; Amin, S.; Limper, A.H.; Khosla, S.; Liu, H. Unsupervised machine learning for the discovery of latent disease clusters and patient subgroups using electronic health records. J. Biomed. Inform. 2020, 102, 103364. [Google Scholar] [CrossRef]
Rabbani, N.; Kim, G.Y.; Suarez, C.J.; Chen, J.H. Applications of machine learning in routine laboratory medicine: Current state and future directions. Clin. Biochem. 2022, 103, 1–7. [Google Scholar] [CrossRef]
Smets, J.; Shevroja, E.; Hügle, T.; Leslie, W.D.; Hans, D. Machine Learning Solutions for Osteoporosis—A Review. J. Bone Miner. Res. 2021, 36, 833–851. [Google Scholar] [CrossRef]
Nagpal, C.; Wei, D.; Vinzamuri, B.; Shekhar, M.; Berger, S.E.; Das, S.; Varshney, K.R. Interpretable subgroup discovery in treatment effect estimation with application to opioid prescribing guidelines. In Proceedings of the ACM Conference on Health, Inference, and Learning (CHIL’20), Toronto, ON, Canada, 2–4 April 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 19–29. [Google Scholar] [CrossRef]
Dazard, J.E.; Rao, J.S. Local Sparse Bump Hunting. J. Comput. Graph. Stat. 2010, 19, 900–929. [Google Scholar] [CrossRef]
Polonik, W.; Wang, Z. PRIM analysis. J. Multivar. Anal. 2010, 101, 525–540. [Google Scholar] [CrossRef]
Dyson, G. An application of the Patient Rule-Induction Method to detect clinically meaningful subgroups from failed phase III clinical trials. Int. J. Clin. Biostat. Biom. 2021, 7, 38. [Google Scholar] [CrossRef]
Yang, J.K.; Lee, D.H. Optimization of mean and standard deviation of multiple responses using patient rule induction method. Int. J. Data Warehous. Min. 2018, 14, 60–74. [Google Scholar] [CrossRef]
Lee, D.H.; Yang, J.K.; Kim, K.J. Multiresponse optimization of a multistage manufacturing process using a patient rule induction method. Qual. Reliab. Eng. Int. 2020, 36, 1982–2002. [Google Scholar] [CrossRef]
Kaveh, A.; Hamze-Ziabari, S.M.; Bakhshpoori, T. Soft computing-based slope stability assessment: A comparative study. Geomech. Eng. 2018, 14, 257–269. [Google Scholar] [CrossRef]
Nassih, R.; Berrado, A. Potential for PRIM based classification: A literature review. In Proceedings of the Third European International Conference on Industrial Engineering and Operations Managements, Pilsen, Czech Republic, 23–26 July 2019; Volume 7. [Google Scholar]
Nassih, R.; Berrado, A. Towards a patient rule induction method-based classifier. In Proceedings of the 2019 1st International Conference on Smart Systems and Data Science (ICSSD), Rabat, Morocco, 3–4 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Biau, G.; Scornet, E. A random forest guided tour. TEST 2016, 25, 197–227. [Google Scholar] [CrossRef]
Qiu, Y.; Zhou, J.; Khandelwal, M.; Yang, H.; Yang, P.; Li, C. Performance evaluation of hybrid WOA-XGBoost, GWO-XGBoost and BO-XGBoost models to predict blast-induced ground vibration. Eng. Comput. 2022, 38 (Suppl. 5), 4145–4162. [Google Scholar] [CrossRef]
Cox, D.R. The Regression Analysis of Binary Sequences. J. R. Stat. Soc. Ser. B Stat. Methodol. 1958, 20, 215–242. [Google Scholar] [CrossRef]
Bellman, R.E. Adaptive Control Processes: A Guided Tour; Princeton University Press: Princeton, NJ, USA, 1961. [Google Scholar]
Liu, B.; Hsu, W.; Ma, Y. Integrating Classification and Association Rule Mining. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD), New York, NY, USA, 27–31 August 1998; pp. 80–86. [Google Scholar]
Blake, C.L.; Merz, C.J. UCI Repository of Machine Learning Databases; University of California, Department of Information and Computer Science: Irvine, CA, USA, 1998; Available online: https://archive.ics.uci.edu/datasets (accessed on 1 July 2024).
Demšar, J.; Curk, T.; Erjavec, A.; Gorup, Č.; Hočevar, T.; Milutinović, M.; Možina, M.; Polajnar, M.; Toplak, M.; Starič, A.; et al. Orange: Data Mining Toolbox in Python. J. Mach. Learn. Res. 2013, 14, 2349–2353. [Google Scholar]

Figure 1. Illustration of the top-down peeling phase to find one box.

Figure 2. Example of a numeric box.

Figure 3. Example of a categorical box.

Figure 4. An illustration of the organization of rules in the ruleset using metarules. (a) Display of the ruleset without the organization. (b) Illustration of the new organized ruleset with association between R6–R7, R1–R4, and R5–R3. The one-way relationship between R4 and R1 shows that R4 is more specific than R1.

Figure 5. Illustration of the metarules creation.

Figure 6. Illustration of the construction and the validation of the classifier after the obtention of all the rules.

Figure 7. Visualization of the four measures obtained in the experiment: (a) accuracy of each model, (b) recall of each model, (c) precision of each model, and (d) F1-score of each model.

Table 3. Major properties of the datasets selected for the experimentation.

Datasets	No. of Instances	No. of Attributes	Class Labels	Class Distribution
Vote	232	16	democrat: 0 republican:1	142 90
Mush	8124	22	e: 0 p: 1	4208 3916
Cancer	286	9	No recurrent: 0 Recurrent: 1	201 85
Heart	267	2	0 1	55 212
TicTac	958	9	Negative: 1 Positive: 0	332 626
Diabetes	768	8	Yes: 1 No: 0	269 499
Credit	1000	10	Bad: 1 Good: 0	300 700

Table 4. The hyperparameters tuned for each algorithm.

Algorithms	Hyperparameters	Tuning
Random forest	n_estimators: Number of trees in the forest	n_estimators = 100
	max_depth: Maximum depth of the trees	max_depth = None
	min_samples_split: Minimum number of samples required to split an internal node	min_samples_split = 2,
	min_samples_leaf: Minimum number of samples required to be at a leaf node	min_samples_leaf = 1,
	max_features: Number of features to consider when looking for the best split	max_features = ‘sqrt’,
	class_weight: Balancing of classes for imbalanced datasets	class_weight = ‘balanced’
XG-Boost	n_estimators: Number of boosting rounds.	n_estimators = 100
	learning_rate: Step size shrinkage used to prevent overfitting.	learning_rate = 0.1
	max_depth: Maximum depth of a tree.	max_depth = 6
	min_child_weight: Minimum sum of instance weights needed in a child.	min_child_weight = 1
	subsample: Fraction of samples used for training each tree.	subsample = 0.8
	colsample_bytree: Fraction of features used for training each tree.	colsample_bytree = 0.8
	scale_pos_weight: Balancing factor for imbalanced datasets.	scale_pos_weight = 1
Logistic regression	penalty: Regularization type (l1, l2, or elasticnet).	penalty = ‘l2’
	C: Inverse of regularization strength (smaller values imply stronger regularization).	C = 1.0
	solver: Optimization algorithm for training.	solver = ‘lbfgs’
	class_weight: Handling class imbalance	class_weight = ‘balanced’

Table 5. Results of the recall and precision of every algorithm on the seven datasets in %.

	Recall				Precision
Datasets	RF	XGB	LG	R-PRIM-Cl	RF	XGB	LG	R-PRIM-Cl
Vote	94.33	94.33	93.33	95.32	98.67	99.33	98.60	98.74
Mush	98.12	98.12	97.72	97.65	99.72	100.00	93.67	96.47
Cancer	84.17	89.64	90.31	92.18	77.39	78.60	87.72	92.13
Heart	92.01	94.37	78.36	93.57	87.30	86.57	91.25	90.14
TicTac	97.35	98.12	96.58	98.06	98.79	97.88	98.10	96.80
Diabetes	87.30	85.64	89.51	88.67	91.31	90.68	89.67	94.36
Credit	89.64	82.45	91.35	90.03	88.60	85.34	91.78	95.24

Table 6. Results of the F1-score and accuracy of every algorithm on the seven datasets in %.

	F1-Score				Accuracy
Datasets	RF	XGB	LG	R-PRIM-Cl	RF	XGB	LG	R-PRIM-Cl
Vote	96.45	96.77	95.89	97.00	96.56	97.10	97.43	97.40
Mush	98.91	99.05	95.65	97.06	100.00	100.00	98.63	97.54
Cancer	80.64	83.76	89.00	92.15	71.38	71.78	87.63	90.04
Heart	89.59	90.30	84.32	91.82	80.20	83.56	92.13	96.15
TicTac	98.06	98.00	97.33	97.43	98.85	98.87	97.86	98.37
Diabetes	89.26	88.09	89.59	91.43	84.65	84.37	89.61	87.98
Credit	89.12	83.87	91.56	92.56	90.15	89.61	91.34	92.42

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nassih, R.; Berrado, A. A Random PRIM Based Algorithm for Interpretable Classification and Advanced Subgroup Discovery. Algorithms 2024, 17, 565. https://doi.org/10.3390/a17120565

AMA Style

Nassih R, Berrado A. A Random PRIM Based Algorithm for Interpretable Classification and Advanced Subgroup Discovery. Algorithms. 2024; 17(12):565. https://doi.org/10.3390/a17120565

Chicago/Turabian Style

Nassih, Rym, and Abdelaziz Berrado. 2024. "A Random PRIM Based Algorithm for Interpretable Classification and Advanced Subgroup Discovery" Algorithms 17, no. 12: 565. https://doi.org/10.3390/a17120565

APA Style

Nassih, R., & Berrado, A. (2024). A Random PRIM Based Algorithm for Interpretable Classification and Advanced Subgroup Discovery. Algorithms, 17(12), 565. https://doi.org/10.3390/a17120565

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Random PRIM Based Algorithm for Interpretable Classification and Advanced Subgroup Discovery

Abstract

1. Introduction

2. Related Works

2.1. The Patient Rule-Induction Method

2.1.1. Overview of PRIM

2.1.2. PRIM’s Evaluation Metrics

2.2. Metarules

2.3. Prior Works and Motivation

2.4. Selected Algorithms for the Comparison

3. Proposed Methodology

3.1. Random PRIM Based Classifier

3.2. Construction and Validation of the Classifier

3.3. Illustrative Example

3.4. Comparison Between R-PRIM-Cl and CART

4. Results

4.1. Empirical Setting

4.2. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI