1. Introduction
Rough set theory (RST) [
1], initiated by Pawlak, is an effectively mathematical tool to deal with imprecise, fuzzy, and incomplete data. RST has been successfully applied in machine learning [
2,
3,
4], knowledge discovery [
5,
6], expert system [
7], disease diagnostics [
8,
9,
10], decision support [
11,
12,
13], and other areas [
14,
15]. Attribute reduction is one of the research hotspots in RST. As an important technology in the process of data preprocessing, attribute reduction has captured researchers’ attention in big data and knowledge discovery [
16,
17,
18]. The main objective of attribute reduction is to remove some irrelevant or non-important attributes while keeping the original distinguishing ability unchanged. In this way, the effect of data dimension reduction can be achieved, and a lot of time and space resources can be saved for the process of knowledge discovery and rule extraction.
With the rapid development of network and information technology, we have gradually entered the era of big data. The datasets have the characteristics of large volume, rapid change, and diverse data forms [
19]. At the same time, due to the influence of the collection method and environment during the data collection process, there are a large number of missing data or wrong data in the datasets. The existence of these disturbed data will seriously affect the decision making and judgement of big data, and even mislead decision makers. After a long period of unremitting endeavor, scholars have achieved outstanding results in attribute reduction [
20,
21,
22,
23,
24,
25,
26]. For example, Dai [
20] proposed a semi-supervised attribute reduction based on attribute indiscernibility. Cao [
21] put forward a three-way approximate reduction approach by using information-theoretic measure. Yang [
22] presented a novel incremental attribute reduction method via quantitative dominance-based neighborhood self-information. Lin [
16] et al. developed a feature selection way by using neighborhood multi-granulation fusion. In a variable precision rough set model, Yu [
24] et al. raised a novel attribute reduction based on local attribute significance. In the literature [
27], Devi proposed a new dimension reduce technology by considering the picture of fuzzy soft matrices in the decision-making process. Wen [
28] et al. raised an unsupervised attribute reduction algorithm for mixed data based on fuzzy optimal approximation set. These above-mentioned classic reduction models are suitable only for complete systems.
As with most of the datasets in various real-world applications, the classical rough set model defined with equivalence relation leads to the limitation in handling data in incomplete systems. The important attributes cannot be selected correctly, which leads to a decrease in the classification accuracy of the reduction set. In order to reduce the incomplete system, Zhang and Chen proposed a lambda-reduction approach based on the similarity degree respect to a conditional attribute subset for incomplete set-valued information systems [
25]. For incomplete interval-valued information systems, Li [
29] proposed the concept of similarity degree and tolerance relation between two information values of a given attribute. Then, three reduction algorithms based on theta-discernibility matrix, theta-information entropy, and theta-significance were designed. Liu introduced a new attribute reduction approach by using conditional entropy based on the fuzzy alpha-similarity relation [
30]. Subsequently, Dai [
31] proposed interval-valued fuzzy min–max similarity relations and designed two attribute reduction algorithms based on interval-valued fuzzy discernibility pairs model. Song [
32] put forward the similarity degree between information values on each attribute and an attribute reduction method was designed by using information granulation and information entropy. Zhou presented a heuristic attribute reduction algorithm with a binary similarity matrix and attribute significance as heuristic knowledge under incomplete information systems [
33]. Zhang [
34] presented a novel approach for knowledge reduction by using the discernibility techniques in multi-granulation rough set model. He and Qu [
35] put forward the fuzzy-rough iterative computation model based on symmetry relations for an incomplete categorical decision information system. Srirekha et al. proposed an attribute reduction in SE-ISI concept lattice based on the concept of object ranking for incomplete information systems [
36]. Cornelis et al. put forward a generalized model of attribute reduction using fuzzy tolerance relation within the context of fuzzy rough set theory [
37]. Liu applied the concept of accurate reduction and reduced invariant matrix for reducing attribute under information systems [
38]. To reduce unnecessary tolerance classes for the original cover, Nguyen [
39] introduced a new concept of stripped neighborhood covers and proposed an efficient heuristic algorithm in mixed and incomplete decision tables.
Although the above reduction algorithms can effectively reduce incomplete information systems, the classification accuracy of the reduction set is not ideal. The main reason is that only the importance of attributes is considered when selecting attributes, and the impact of attributes on classification is not considered. Usually, people take the best result of clustering as a reference standard for classification work and classify similar samples into the same cluster. In order to solve the problems mentioned above, this paper proposes an attribute reduction method from the perspective of clustering.
At present, the studies of attribute reduction on using the clustering idea to construct a feature selection model are relatively infrequent. In order to avoid or reduce the loss of some original information after discretizing continuous values, Zhang [
40] proposed a feature selection method based on fuzzy clustering, but this method has no reliable theoretical support. Jia [
41] proposed a spectral clustering method based on neighborhood information entropy feature selection, which uses a feature selection method to remove redundant features before clustering. In order to take the classification effect of the dataset into consideration when selecting features, Zhao proposed a fuzzy C-Means clustering fuzzy rough feature selection method [
42], which can improve the classification accuracy of the reduction set to a certain degree, but the effect is not obvious. Jia proposed a similarity attribute reduction in the context of clustering in the literature [
43], which can greatly improve the classification accuracy of the reduction set but needs to continuously adjust the parameters to achieve the best classification effect. Such a feature set has certain random limitations and increases the time consumption of the system. Therefore, it is necessary to design a stable model with high classification accuracy for data preprocessing.
Although these existing approaches can effectively reduce incomplete systems, they only consider the importance of the attributes themselves, and do not consider the correlation between attributes. The influence of conditional attributes on decision classification is not considered. In order to improve the classification accuracy of a reduction set, we apply the idea of clustering.
Based on the principle that the similarity of samples within a cluster is as large as possible and the similarity of samples between clusters is as small as possible, an attribute reduction algorithm for an incomplete system is designed under the background of clustering. First, according to the principle of tolerance and the theory of knowledge granularity, we define the similarity of intra-cluster and inter-cluster for an incomplete system. Secondly, a formula for calculating the similarity of intra-cluster and inter-cluster objects is designed. After normalizing the two similarities, we define the similarity of objects. Then, according to the corresponding similarity mechanism, a new attribute reduction algorithm for an incomplete system is proposed. Finally, a series of experiments have verified that the proposed algorithm in this paper is significantly better than other similar algorithms in terms of running time, accuracy, and the stability of algorithm was analyzed by using Friedman test and Bonferroni–Dunn test in statistics.
The contribution of this paper is embodied in the following four aspects:
- (1)
A tolerance class calculation in incomplete information systems is proposed and applied to knowledge granularity calculation.
- (2)
Knowledge granules are used as a measure of sample similarity to measure the similarity of inter-cluster samples and intra-cluster samples.
- (3)
A knowledge granularity reduction algorithm based on clustering context is designed in incomplete information systems.
- (4)
Lots of experiments are conducted to verify the validity of the algorithm proposed in this paper, and the stability of the algorithm is verified by mathematical statistics.
The other parts of this paper are constructed as follows. The principle of tolerance and related concepts of knowledge granularity are recalled in
Section 2. In
Section 3, we propose a similarity measure of intra-cluster and inter-cluster objects and discuss the reduction mechanism according to the clustering background for missing dataset. We normalize the similarity of inter-cluster and intra-cluster, and then design the corresponding reduction model in
Section 4. In
Section 5, a series of experiments are conducted and the performance of the algorithm is evaluated from the reduction size, running time, classification accuracy, and stability. Then, the feasibility and effectiveness of the algorithm are verified. Finally, the advantages and disadvantages of the algorithm proposed in this paper are concluded and unfolded in the future work.
2. Preliminaries
In this section, we review some basic concepts in rough set theory, the definitions of tolerance class, knowledge granularity, clustering metrics, and the significance of attribute for incomplete decision systems.
2.1. Basic Concept of RST
A decision information system is a quadruple
, where
is a non-empty finite set of objects and
is a finite nonempty attribute sets; if
, where
is the conditional attribute sets and
is the decision attribute set; then,
is the union of attribute domains,
is the value set of attribute
, called the domain of
;
is an information function with
for each
and
. For every attribute subset
, a indiscernibility relation is defined as follows:
By the relation
, we can obtain the partition of
denoted by
or
. If
, the upper approximation is denoted as
The lower approximation of
X can be denoted as
where the objects in
may belong to
X, while the objects in
must belong to
X.
Definition 1. In a decision system , if that , then we call the decision system an incomplete system (IDS). In an incomplete decision system, if the tolerance relation is as follows:where * represents missing value. is symmetric and reflexive, but not transitive. Definition 2. Given an incomplete decision system , , and is the tolerance class determined by o with respect to P, which is defined as follows: Let denote the family set of , which is the classification included by P. If then , which is the monotonicity of the tolerance class.
2.2. Basic Concept of Knowledge Granularity
Definition 3. Suppose is an incomplete decision system, . is the tolerance class of object with respect to P. The knowledge granularity of P on U is defined as follows:where represents the number of objects in dataset U. Since the reflexivity and symmetry of , there are lot of repeated calculations when calculating . In order to reduce the amount of calculation, we propose Definition 4 as follows: Definition 4. Suppose is an incomplete decision system, . is the simplified tolerance relation of object with respect to P, which is defined as follows: Definition 5. Given an incomplete decision system IDS=, and , the simplified tolerance class of object with respect to attribute P is defined as follows: We can delete the symmetric element pair and reflexive element pair from Definition 3 and obtain Definition 5, so that Definition 5 does not have the characteristics of symmetry and reflexivity.
Definition 6. Given an incomplete decision system , is the simplified tolerance class of object with respect to attribute P. The equal knowledge granularity of P on U is defined as follows: Theorem 1. Given an incomplete decision system ,. Let , where . All objects in subdivision are missing a value on attribute P, . For the convenience of the following description, we mark as . Objects with all non-missing values on attribute P are marked with . represents the equal knowledge granularity of P on U, we have:where .
Proof. Suppose that and . According to Definition 4, we can obtain and . Suppose , according to Definition 5, then we can obtain . Since , and , we can obtain
Property 1. Given an incomplete decision system IDS= , . If the knowledge granularity of P is on U and the equal knowledge granularity of P is , then we have: Proof. Let , where is the i-th subdivision of and stands for the subdivision of missing values on attribute P, stands for the number of object in subdivision . We can obtain . According to Definition 2, suppose that and , we can obtain = . Since the value of each object in is , we can obtain . In the same way, we can obtain . According to Definition 3, we can obtain , then . Let = , we can obtain = □
Due to the time complexity of calculating is , we know that the time complexity of calculating is . However, the time complexity of is , the time complexity of calculating is , and . In addition, in the process of calculating , the sub-division with a cardinality of 1 is constantly pruned, which further speeds up the calculation. Therefore, the time of calculating is less than for the same data set.
Example 1. Example of computing equivalent knowledgegranularity. Let , , . The detailed data are shown in Table 1. Let , we can obtain . We use the following two methods to calculate . - (1)
According to Definition 5, we obtain that . According to Definition 6, we can obtain .
- (2)
Since , let , , then . According to Theorem 1, we can obtain .
Although the above two methods achieve the same results, the calculation time is different. Since method 1 needs to scan the dataset multiple times, it consumes more time. However, method 2 only needs to scan the dataset one time and obtains each subdivision of U/P. According to the number of objects in each subdivision, we can acquire the combination value quickly, and then the value of equivalent knowledge granularity is calculated.
3. The Mechanism of Knowledge Granularity Attribute Reduction in the Background of Clustering
Most traditional attribute reduction models use equivalence class relation to compute the importance of conditional attributes. Although these methods can effectively deal with complete decision-making information systems, they cannot obtain correct reduction rules in incomplete ones. In order to deal with the loss of information effectively, this paper focuses on the reduction in incomplete decision systems.
The traditional reduction model does not consider the impact on the classification of the dataset when deleting redundant attributes. If there are inconsistent objects in the dataset, the classification accuracy of the reduced set will be affected. In order to improve the data quality, this paper uses the idea of clustering. Clustering is to divide all objects in the dataset into different clusters according to a certain standard when the target category is unknown. Objects within a cluster are as similar as possible, and objects between clusters are as dissimilar as possible. Classification is to classify all objects in the dataset according to a certain nature and level when the object category is known. Good clustering results can be used as a reference standard for accurate classification. The desired results of clustering involve the objects of the same class being gathered in intra-clustering, otherwise they will be gathered in different inter-clustering. This paper studies the labeled data objects decision information system. Therefore, we use the results of the classification to guide the process of clustering the data objects. When the data objects are clustered, they follow the principle that the objects of intra-clustering are as close as possible and the objects of inter-clustering are as far away as possible. Next, we discuss how to measure the distance of intra-clustering and inter-clustering objects.
3.1. The Intra-Cluster Similarity for Incomplete Systems
Generally, there are two approaches for clustering calculations: distance and similarity. The closer the distance between two different objects is, the weaker their ability to distinguish is. On the contrary, the farther the distance is, the stronger the ability to distinguish is. In this paper, the similarity method is used to measure the distinguishing ability of objects. Since the knowledge granularity can measure the similarity between objects, the coarser the knowledge granularity is, the stronger the distinguishing ability is. The better the knowledge granularity is, the weaker the distinguishing ability is. Next, we discuss how to use knowledge granularity information to measure the similarity of objects in an incomplete system.
Definition 7 (The similarity of intra-cluster objects).
Given an incomplete decision system , . For the sake of convenience, let , . Suppose the equivalence division relationship of under attribute set P is ; then, the similarity of objects in the cluster of about attribute P is defined as follows (where ): Definition 8 (The average similarity of intra-cluster objects).
Given an incomplete decision system . , . The knowledge granularity similarity of intra-clustering objects for subdivision with respect to attribute P is , then the average intra-clustering similarity is defined as follows: The desired effect of clustering is that the similarity of intra-clustering is high, and the similarity of inter-clustering is low.
Property 2. Given an incomplete system . If , and we have Proof. Let , where represents the object sets with missing values on attribute set P. Since , we can obtain . This to say, each subdivision of is a subset of some subdivision of . Let . According to Theorem 1 and Definition 7, we can obtain: . Since , then . Since ; therefore, the results of and are obtained. Above all, we can obtain . □
According to Property 2, we conclude that the intra-cluster similarity is monotonic when the conditional attribute set changes.
Example 2. In Table 1, let ; we have , .
3.2. The Inter-Cluster Similarity for Incomplete Systems
Definition 9 (The inter-cluster similarity for incomplete systems).
Given is an incomplete decision system, let . Suppose , , then the inter-cluster similarity of and with respect to attribute set P for incomplete systems is defined as the following: Assuming that and are objects of two different clusters, the inter-cluster similarity between and is calculated in two steps. The first step is to calculate the similarity after the two clusters are merged, and the second step is to remove the similarity information of the same cluster. The rest is the similarity information between objects in different clusters.
Property 3. Given an incomplete decision system . Let , where , ‘*’ is the flag of missing value. We have: Proof. According to Definition 5 and Theorem 1, we can obtain
. Since
, according to Definition 9, we can conclude that
. Then, we have:
Property 4. Given an incomplete decision system , , then we have .
Proof. Let . Since , then is a refinement of is a refinement of . Let . Suppose , then . . Since , then . Since , then we have . □
From Property 4, it can be concluded that the similarity of inter-clusters has mono- tonicity with the change in conditional attributes.
Definition 10 (The average similarity of inter-cluster objects).
Given an incomplete decision system ,, . In n different clusters, the inter-cluster similarity is calculated between every two clusters and the time of comparisons is , then the average knowledge granularity similarity of inter-cluster in dataset U respect of attribute set P is defined as the following: Example 3. In Table 1, with the same conditions as Example 2, let , we have .
4. Attribute Reduction of Knowledge Granularity for Incomplete Systems
Traditional attribute reduction methods are mostly aimed at datasets with no missing data. Various datasets in reality are often incomplete due to various subjective or objective factors. Therefore, we researched information systems with missing data and propose corresponding algorithms to improve the data quality of incomplete system reduction sets.
This paper discusses how to design a reduction method with the idea of clustering. For an ideal clustering effect, the objects of inter-cluster should be far away, and the objects of intra-cluster should be close together. Here, similarity is used to measure the distance between two different objects. The higher the similarity is, the closer the objects are. Conversely, the lower the similarity is, the farther the objects are. Based on the above analysis, we designed a formula to measure the importance of attributes as the following:
where
is the weight, and
is the dissimilarity of intra-cluster objects. We can set the importance of the intra-cluster similarity and inter-cluster similarity by using the size of the parameter
. This method requires the adjustment of the parameters continuously, which consumes a lot of time. To this end, we first normalize
and
. Then, the two similarity calculations can be measured within a unified range, avoiding the parameter adjustment process.
4.1. Normalization of Inter-Cluster Similarity and Intra-Cluster Similarity
Given an incomplete decision system . Since the number of elements in each sub-division may be different, then the value range of its equivalent knowledge granularity may also be different. To calculate the average similarity of all subdivisions, they must be calculated in the same domain. For the sake of generality, we normalize the inter-cluster similarity and intra-cluster similarity. According to Definition 7 and Theorem 1, we can obtain . When all data objects in the subdivision are indistinguishable with respect to the attribute set P, takes the maximum value.
When all data objects in
can be distinguished from each other,
takes the minimum value. So, the result of
is obtained. If the value of
is mapped to the range [0,1], the correction formula of
is defined as
The average similarity of intra-cluster objects is corrected as follows:
and
are two object sets of different clusters. Suppose
,
. If
, then
. According to Property 3, the similarity of inter-cluster respect to
and
is denoted as
When
, takes the minimum value. When all data objects in
and
are indistinguishable,
takes the maximum value. Then, we can obtain
. The normalized formula of
is as follows:
The definition of the average similarity of inter-cluster objects is revised as follows:
After the similarities of inter-cluster and intra-cluster objects are normalized,
is revised as follows:
Since represents the similarity of intra-cluster objects, then the dissimilarity is 1−. If you use the formula of to measure the effect of clustering, the larger the value of is, the better the effect is.
4.2. The Knowledge Granularity Attribute Reduction Algorithm for Incomplete Systems (IKAR)
In
Section 4.1, we discussed the similarities of inter-cluster and intra-cluster objects from the perspective of clustering, which provided a clear goal for the next step of attribute selection.
Definition 11 (Equal knowledge granularity attribute reduction).
Given an incomplete decision system , an attribute subset is an equal know-ledge granularity attribute reduction if and only if: In Definition 11, condition (1) is the jointly sufficient condition that guarantees that the equal knowledge granularity value induced from the reduction is minimal, and condition (2) is the individual necessary condition that guarantees the reduction is minimal.
According to Definition 11, we can find a smaller reduction set with an ideal classification effect. Property 2 proves that when the attribute decreases, the similarity of intra-cluster increases monotonically. Property 4 proves that the similarity of inter-cluster also increases when decreasing the attribute, so that the dissimilarity will decrease. Obviously, the formula
cannot determine its monotonicity. To find the R when the
is the largest in the conditional attribute, which is a combinatorial optimization problem, and trying the methods one by one is not the best way to solve the problem. So, we use a heuristic method to find the smallest reduced set. For any attribute
, its inner significance is defined as follows:
The bigger the value of is, the more important the attribute is.
In order to obtain the optimal reduction set quickly, we adopt the deletion strategy. Firstly, the importance
of attribute a is defined by the formula of
; then, sort the different
. Secondly, let R = C. The value of
is used as the initial condition, which ensures that the clustering effect after reduction is better than the raw dataset. Then, remove the unimportant attribute a from the remaining attributes
and calculate the value of
. If
, delete attribute a and continue; otherwise, the algorithm terminates. The details of the IKAR algorithm are shown in
Figure 1.
Example 4. In Table 1, since . Let , according to the definition of , we can obtain , , , and , then , , , . Since , we delete the attribute b from R and let , . In the same way, we obtain that , , , , and . Since , we delete the attribute e from R and let . We calculate the value of , obtain the results of and . Now, we have and . If attribute c is deleted , and the algorithm is terminated. We have . 4.3. Time Complexity Analysis
The time complexity of Step 1 is . In Step 2, we calculate the value of which include and . The computational time complexity of intra-cluster similarity is , then the time complexity of calculating is . Since the time complexity of calculating the similarity about inter-cluster and is , then time complexity of is . We can obtain the time complexity of Step 2 is . In Step 3, the consume time is . Since Step 4 utilizes the results of Step 3, the time complexity is . Step 5 is to sort the importance of each attribute, and the time complexity is . Since the time complexity of calculating is , then the time complexity of Step 5 is +. In Step 6, the time complexity of deleting a redundant attribute is , and Step 6 needs to be executed times, , then the time complexity of Step 6 is . In summary, the time complexity of IKAR is .
5. Experiments Results Analysis
In order to evaluate the feasibility and effectiveness of the proposed algorithm in this paper, the complete dataset was preprocessed to obtain incomplete information systems, and many different attribute reduction algorithms were used for reduction. The reduction set obtained in the previous stage was classified and analyzed by multiple classifiers in the Weka Tool. It was compared with three other existing algorithms in terms of reduction set size, running time, accuracy, and algorithm stability. The specific experimental framework is shown in
Figure 2.
All of the datasets are displayed in
Table 2. The twelve datasets selected were downloaded from UCI. In
Table 2,
,
and
represent the number of objects, conditional attributes, and the categories of the decision attribute, respectively. In order to generate an incomplete information system, we deleted 12% attribute values from the raw datasets, and the missing values were randomly and uniformly distributed. The missing value of each attribute was removed with equal probability, which eliminated the impact of the later reduction set on the classification accuracy due to different attribute selection. During data preprocessing prior to the experiment, we kept the discrete data unchanged and discretized the continuous value data. The two classifiers, RF (random forest) and SMO of the Weka v3.8 software platform, were used to demonstrate the classification effect of the reduction set. In subsequent experiments, the reduced sets of each dataset were analyzed for accuracy using the cross method. All the objects in the reduced set were randomly divided into 10 equal parts, one of which was used as the test set, and the remaining 9 were used as the training set. During the classification and analysis process, the default settings of the Weka tool were used for all parameters. In this way, each reduction set was repeated 10 times for the cross experiment. Finally, the size of the reduction set, running time, and classification accuracy obtained by the 10 experiments were averaged. We executed all experiments on a PC with Windows 10, Intel(R) Core(TM) i7-10710U CPU @ 1.10 GHz, 1.61 GHz and 8 GB memory. Algorithms were coded in python and the software that was used is PyCharm Community Edition 2020.2.3 x64.
5.1. Reduction Set Size and Running Time Analysis
This section mainly verifies the feasibility of IKAR for dataset reduction from the perspective of reduction size and calculation time. Here, we have selected three other representative incomplete system attribute reduction approaches. For the convenience of the following description, these three methods are referred to as NGLE [
44], IEAR [
45], and PARA [
46], respectively. NGLE is the neighborhood multi-granulation rough sets-based attribute reduction using Lebesque and entropy, and the method considers both algebraic view and information view. IEAR is the information entropy attribute reduction for incomplete set-valued data using the similarity degree function and proposed
-information entropy. PARA is the positive region attribute reduction using indiscernibility and discernibility relation.
The attribute reduction size of four algorithms is shown in
Table 3.
Table 4 shows the time consumed by the reduction process of the four algorithms in seconds. Ave represents the average value, Best in
Table 3 represents the number of minimum reduction sets obtained, and Best in
Table 4 stands for the number of times which the running time was the shortest. From
Table 3, it can be seen that the average reduction set size of the IKAR algorithm is 11.833, and the average reduction set size obtained by the NGLE, IEAR, and PARA algorithms are 11.917, 11.00, and 12.083, respectively. IKAR obtained the minimum reduction set on the Ches, Spli, Mush, and Shut datasets, and the reduction effect was slightly better than the NGLE and PARA, but not as ideal as the IEAR algorithm. In the 12 datasets in
Table 2, the IEAR algorithm obtains the minimum reduction set in 11 datasets.
From the consumption time of
Table 4, we can find that the reduction advantage obtained by the IKAR algorithm is not obvious, but that the IKAR algorithm is obviously better than the other three algorithms. When calculating the reduction set of 12 datasets, the average time required by IKAR is 15.20 s, whilst the average times consumed by the other three algorithms are (155.40, 2003.56, and 260.38) seconds, respectively. From the experimental results in
Table 4, it can also be found that the IKAR algorithm only needs 25.30 s to reduce the Shut dataset, and the running time of the other three algorithms of NGLE, IEAR, and PARA is (886.08, 3974.02, and 133.25) seconds. When reducing the Mush dataset, IKAR takes 4.45 s, and the other three algorithms take (411.09, 4544.04, and 827.16) seconds. Of course, the IKAR algorithm also has shortcomings, and when the number of sample categories is higher, the algorithm is more time-consuming. For example, if there are 10 different categories of samples in the Hand dataset, the calculation time of the IKAR algorithm takes 92.93 s. At this time, the time consumption of the NGLE and PARA algorithms is (81.61 and 91.28) seconds, respectively. It takes less time than that of the IKAR algorithm.
From the above analysis, we can obtain that the IKAR algorithm can effectively reduce the dataset, it is obviously better than similar algorithms in terms of reduction speed.
5.2. Changes of the Classification Accuracy When Missing Value
It is not enough to evaluate the overall performance of a reduction algorithm only from the size and running time of the reduction set. Here, we further analyze the performance of the above four algorithms from the perspective of the classification accuracy of the reduction set. In order to find out the influence of missing values on the IKAR algorithm, we selected four datasets in
Table 2, including Heas, Prom, Hepa, and Stai. During the experiment, we divided the missing values into 5 different levels, 2%, 4%, 6%, 8%, and 10%. First, 2% of the data objects in the original dataset were randomly selected, and random deletion is conducted in these data objects of attribute values for different attributes to generate an incomplete dataset. In order to reduce the bias of attribute selection that may be caused by random deletion, we used the 2% data just obtained as a basis, then selected attributes with the same probability, and then deleted the 2% data, which generated an incomplete dataset with a missing value of 4% and so on, generating a dataset with a missing value of 6%, 8%, and 10%, respectively. After the incomplete dataset was ready, we used IKAR, NGLE, IEAR, and PARA algorithms for reduction and used SMO and RF classifiers to analyze the classification accuracy. The specific results are shown in
Figure 3 and
Figure 4. The horizontal axis in
Figure 3 and
Figure 4 represent the proportion of deletion data objects in the dataset, and the vertical axis represents the classification accuracy.
Figure 3 and
Figure 4 show the change trend diagrams obtained by using the SMO and RF classifiers to analyze the accuracy of the reduced dataset. From the results of
Figure 3 and
Figure 4, it can be seen that when increasing the missing data, the classification accuracy of the above four algorithms has a downward trend. For example, under the SMO classifier, when the proportion of missing values in the dataset Heas changes from 2% to 4%, the accuracy changes in IKAR and other NGLE, IEAR and PARA algorithms are (84.01→83.69), (80.32→79.69), (77.01→76.86), and (79.24→78.92). Under the RF classifier, when the proportion of missing values in the dataset Heas changes from 2% to 4%, in the IKAR and other NGLE, IEAR, and PARA algorithms, the accuracy changes are (81.43→80.94), (77.85→76.28), (74.79→73.33), and (76.24→74.86), respectively. The main reason for this phenomenon is that as the proportion of missing data increases, more and more data objects cannot be distinguished, resulting in a decrease in classification accuracy. From the trend diagrams in
Figure 3 and
Figure 4, it can be seen that the IKAR algorithm changes smoothly. The other three algorithms have a greater impact on the classification accuracy of the reduced set as the proportion of missing data increases. For example, under the RF classifier, when the missing proportion of the Hepa dataset changes from 2% to 10%, the accuracy of IKAR changes to 2.09, while the accuracy changes of the NGLE, IEAR, and PARA algorithms are 3.2, 3.19, and 4.49, respectively. Under the SMO classifier, on the classification accuracy of the Heas dataset, the IKAR algorithm changes value is 1.12, and for the other three algorithms is 3.37, 2.28, and 2.93, respectively.
5.3. Classification Accuracy Analysis
The previous experiments have compared the changes in the classification accuracy when missing value exist. With the change in the proportion of missing values, the accuracy of the IKAR algorithm can not only change smoothly but can also obtain a better classification effect on multiple datasets. We used the SMO and RF classifiers to analyze the accuracy of the previous reduction set, and the detailed content is shown in
Figure 5 and
Figure 6, respectively. In
Figure 5 and
Figure 6, missing values represent the incomplete raw dataset and Ave denotes the average accuracy for the different datasets. Under the classifier SMO, the IKAR algorithm obtained an average accuracy of 85.31%, and the average accuracy obtained by the other three algorithms NGLE, IEAR, PARA, and raw datasets were only 78.72, 78.78, 78.81, 84.49(%), respectively. Among the 12 datasets in
Table 2, the IKAR algorithm was the highest 11 times. The classification accuracy of IKAR is higher than that of the raw set.
For example, the classification accuracy of IKAR is (94.76, 99.68, 93.86)% in the datasets of Hand, Shut, and Spli. The classification accuracy of IEAR is (81.59, 85.93, 81.53)% and the average accuracy is 10 percentage points lower than IKAR. Similarly, on the RF classifier, the average classification accuracy of the IKAR (86.24%) is higher than the raw dataset (85.09%) and those obtained by the other three algorithms obtained (80.79, 79.77, 81.14)%, respectively.
From the results shown in
Figure 5 and
Figure 6, we can obtain three conclusions as follows: (1) The classification accuracies obtained by the above four attribute algorithms are closer to those of the raw dataset, which indicates that all four algorithms are able to effectively reduce the incomplete dataset. (2) The algorithm has better overall performance. The main reason is that the IKAR algorithm considers both the correlation between attributes and the influence of attributes on the classification two factors, so it obtains a more ideal classification accuracy. (3) The classification accuracy of the IKAR algorithm is higher than the original dataset, and the main reason for this is that the IKAR algorithm effectively eliminates redundant attributes and reduces the influence of noisy data on classification.
5.4. Lgorithm Stability Analysis
To indicate the statistical significance of classification results, the non-parametric Friedman test and Bonferroni–Dunn test methods were used to the analyze classification accuracy of each classifier with different methods in
Section 5.2, where the Friedman test is a statistical test that uses the ranking of each method on each dataset. The Friedman statistic is described as follows:
where
N is the number of datasets, and
t is the categories of algorithms.
represents the average rank of the classification effect ranking of the
i-th algorithm on all datasets, and the statistics
obey the Fisher distribution with
t − 1 and (
t − 1)(
N − 1) degrees of freedom. If the value of
is bigger than
, then the original hypothesis does not hold.
The
test value can judge where these algorithms are different, but it cannot indicate the superiority of the algorithm. In order to explore which algorithm is better, we used the Bonferroni–Dunn test to calculate the critical value range of the average sequence value difference, which is defined as follows:
If the difference between the average ranking values of the two algorithms exceeds the critical region , the hypothesis that ‘the performance of the two algorithms is the same’ will be rejected with the corresponding confidence. Otherwise, the two algorithms perform differently, and the algorithm with the higher average rank is statistically better than the algorithm with the lower average rank. Generally, we set = 0.05.
In order to compare the stability of the IKAR algorithm in this paper, we chose three other similar algorithms, NGLE, LEAR, and PARA, to reduce the datasets in
Table 2. Then, the previous reduction results were analyzed by the classifiers SMO and RF using the Weka tool. The classification accuracy was detected by the Friedman test and Bonferroni–Dunn test. When t = 4 and N = 12, then
= 23.4 and
= 20.429. Under the classifier SMO, the classification accuracy of the four algorithm reduction sets is sorted, and the number 1 represents the most ideal. The details of the sorting results are shown in
Table 5. The average ranking of IKAR, NGLE, IEAR, and PARA algorithms are 1.08, 2.75, 3.58, and 2.58 in turn.
Since the critical value of
is 2.892 and
, we can reject the original hypothesis at
under the Friedman test. So, there are statistical differences in classification accuracy among the above four algorithms. Next,
2.569 and
1.354, and the results of Bonferroni–Dunn test for these four algorithms at
is shown in
Figure 7. The accuracy of the algorithm on the left side of the coordinate axis is relatively high in
Figure 7. From
Figure 7, we know that the average accuracy ranking of IKAR is s1 = 1.08. Among the other three algorithms, the better-ranking algorithm, PARA, has an average ranking value of s2 = 2.58. Since
= 1.5 and
, we can find that the IKAR algorithm is significantly superior to the PARA. In the same way, the IKAR is better on accuracy than NGLE and LEAR. However, there is no obvious difference in the ranking of the NGLE, LEAR, and PARA algorithms.
For the same reason, under the RF classifier, the average ranking values of the classification accuracy of the above four algorithms are 1.00, 2.83, 3.67, and 2.50 as shown in
Table 6. Since
= 26.8 and
= 32.043, then
. We can see that the classification accuracy of these four algorithms is significantly different in the statistical sense under the RF classifier. From
Figure 8, we can ascertain that the IKAR algorithm is significantly different from the NGLE classification, while the ranking of the NGLE, LEAR, and PARA algorithms have no obvious differences among each other.
In
Table 5 and
Table 6, the STD represents the standard deviation of the classification accuracy of different algorithms’ reduction datasets. From
Table 5 and
Table 6, we know that the STD of the IKAR algorithm is smallest among the four algorithms. The STD of the classification accuracy of algorithm IKAR is less than 1 for both SMO and RF classifiers, while the STD values of the other algorithms are bigger than 1, and some are even greater than 3. For example, the STD of IKAR in the Heas dataset is ±0.015 under the SMO, but the STD values of NGLE, LEAR, and PARA are ±1.455, ±2.318, and ±1.521, respectively. On RF classifiers, the IKAR algorithm has the same stability in classification accuracy. In the dataset Spli, the STD of IKAR’s classification accuracy is ±0.248; meanwhile, the STD values of the other three algorithms are ±1.994, ±3.565 and ±3.558, respectively.
Therefore, all the test results demonstrate that there is no consistent evidence to denote statistical differences between any two of the four approaches under the SMO and RF classifier. In general, the IKAR model is better than the other models in stability.
6. Conclusions
In the face of incomplete systems, most of the traditional attribute reduction models cannot obtain effective reduction rules and affect the classification accuracy of the reduced set. Therefore, we propose a new attribute reduction method, IKAR, based on the clustering background for incomplete system. IKAR uses the tolerance principle to calculate the information of knowledge granularity and to measure the similarity of data objects. When selecting attributes, the similarity of intra-cluster objects should be as large as possible, and the similarity of inter-cluster objects should be as small as possible. The innovations of this paper are manifested in the following four aspects: (1) Use of the tolerance principle to quickly calculate knowledge granularity; (2) Use of knowledge granularity to calculate the similarity of inter-cluster and intra-cluster objects; (3) Use of the idea of clustering to calculate the importance of attributes; (4) In addition to conventional time, space, and precision analysis, it also analyzes the stability of datasets with different percent missing value. All the experiments show that the IKAR algorithm is not only superior in reducing time compared to the other three algorithms, but it also has an excellent performance in terms of accuracy and stability. Of course, the IKAR algorithm also has some shortcomings. For example, it is unsuitable for datasets with multiple decision values, and complex data types and the dynamical changing of the datasets are not considered.
In our future research endeavors, we intend to work in the following four aspects. (1) We focus on attribute reduction of incomplete systems with mixed data types, especially on how to deal with missing data. (2) In addition, we will integrate the incremental learning method into the knowledge granularity reduction model in the background of clustering. (3) To address even big datasets more efficiently, applying GPU and MapReduce technologies to design parallel attribute reduction models or acceleration strategy is a very popular research topic.