1. Introduction
One of the most interesting and widely used approaches to the multidimensional data analysis are cluster analysis or clustering methods. Currently, there are many clustering algorithms. Despite significant differences between them, they all rely on the initial postulate of compactness: in the space of objects, all ‘‘close’’ objects must belong to the same cluster, and all different objects, respectively, must be in different clusters. The concepts of ‘‘proximity’’ are interpreted differently in different clustering algorithms.
Within the framework of Discrete Mathematical Analysis (DMA)—an original approach to data analysis that uses fuzzy mathematics and fuzzy logic [
1], methods of so-called DPS-clustering are being developed. The present study is devoted to DPS clustering and continues the series of papers on this problem [
2,
3,
4,
5,
6,
7].
The initial notion of DPS-clustering is a fuzzy model of the fundamental mathematical property ‘‘limit’’. It is called the density and represents a non-negative function depending on an arbitrary subset and any point in the initial space in which clustering is supposed.
The value of density can be understood as the strength of the connection between a subset and a point, as the degree of influence of a subset on a point, or dually, as the degree of limiting a point to a subset. This view of density automatically requires its monotonicity over a subset: the larger the subset, the stronger its impact on the point, and it is more limiting for it.
Nontrivial densities always exist in a finite metric space (FMS). Each density on the FMS gives a new look at it and a new program of its study, so that the density is a new mathematical concept.
Fixing the density level
and interpreting it as a limit level, we can introduce the concept of discrete perfection with level
: a subset is called discretely perfect with level
(
-DPS- or simply DPS-set) if it consists of exactly all points of the original space
-limit to it. A rigorous theory of DPS-sets has been constructed within the framework of DMA. In particular, it has been shown that DPS-sets have the properties of clusters. This, as well as the comparison of DPS-clustering with modern cluster analysis algorithms and its applications, is described in detail in [
5].
The DPS-clustering algorithms created to date operate in finite metric spaces, depend on three parameters (density P, density level , and local coverage radius r) and have two stages.
At the first stage, topological filtering of the original space is carried out. It is cleaned from noise. The DPS-algorithm iteratively cuts out from the original space (
Figure 1a) the maximum
-perfect subset (
Figure 1b).
At the second stage, the DPS-algorithm splits the result of the second stage into r-connected components for which it considers to be clusters (
Figure 1c).
For the array in
Figure 1a and for the SDPS-algorithm that is the most known of the DPS-algorithms
Section 2.2.2, the clusters are shown in different colors in
Figure 1b. Similar clusters obtained as a result of working on a given array of well-known cluster analysis algorithms DBSCAN [
8] and OPTICS [
9] practically coincide and are shown in
Figure 1c. This allows us to conclude that DPS-algorithms, like the DBSCAN and OPTICS algorithms, represent a new stage in cluster analysis, since they not only split the initial space into homogeneous parts, but also preliminarily clear it of noise (filter).
Studies show that there are situations where the result of the first stage does not include all noteworthy clusters. Decreasing the limitation level leads to a decrease in the quality of a cluster recognition and is not a way out of the situation.
Furthermore, due to the locality of the radius, the partitioning into
r-connected components of the maximum
-perfect subset in the second stage is often too shallow, and detailed, and needs to be enlarged as can be seen in
Figure 1b.
The present work is devoted to correcting these disadvantages.
3. Results: Iterative DPS
The first DPS-stage of the algorithm carves a maximal perfect subset , -extremely P-dense in the background of X at each of its points, from the space X.
Let us turn to
Figure 3b: it shows the result
of the DPS algorithm on array
X (
Figure 3a) in red when
. It is easy to see that not all noteworthy condensations from
X were included. The reason for their non-inclusion in the result
is explained by the contradiction between the
r-local character of the view of
X and the global approach to determine the level
of density
P by its extremality level
based on the whole image
(
7) and (
8): the density level of worthy condensations in the complement
appears to be below
.
To include them in the final result, we need to lower the extremum level
:
and switch from the
to the
algorithm. In the control example, the first level of
, for which the result
will include all worthy condensations, will be
. The result is shown in red in
Figure 3c: we see that the result
of the algorithm
, along with
and worthy condensations from
, included weak points of the
r-halo of the set
, and it helped them in this.
It can be done otherwise: keep the extremum level
, but change the original space by transition from
X to
. The result of
on
is shown in green in
Figure 3b. If we make another similar transition, a blue cluster appears in
Figure 3b.
Such way of using the DPS scheme is called the ‘‘iterative DPS’’ algorithm.
3.1. Iterations by Extremality
It will be remembered that the level
of density
P by extremality level
is determined in the
algorithm from Equation (
8). Let us introduce the partition
. According to (
7) and (
8),
From the properties of convexity, hence the inequality
However, the left mean
because
. Hence, if
, the right mean must be
:
Then, from the inequalities
and the properties of fuzzy comparison
n, we have:
It follows that the density level
, required for the operation of
algorithm on the space
:
does not exceed the level
, which we will consider as zero:
, and can be strictly less than it. In addition, this, in turn, means the possible nontriviality of such operation, i.e., the set
. Let us denote it by
.
In the situation of non-triviality , one can define the first iteration on the extremity of the set , which is naturally considered as its null iteration, assuming .
Definition 5. Let be non-trivial, and we will call by the first iteration of the space X with respect to the extremity level β the disjoint union Note 1. Given this definition, the above can be understood as the first precondition for the existence of the first iteration on extremality when β is non-negative. In this connection, the zero level of β appears to be the most productive and interesting, at which the second precondition (non-triviality of the result) will be the weakest.
A direct check shows that , so, by repeating the above deduction with replacement , , we obtain the level of the algorithm on the complement and the possible nontriviality of its result .
If all this can be continued up to and including the
i-th step, i.e., the sets
, will be nontrivial, then their union with
will be called the
i-th iteration of
X by extremum level
:
3.2. Iterations of the DPS Algorithm
Considering the algorithms
and
as their zero iterations:
,
, we define their
i-th iterations as processes of building for space
X its
i-th iteration
and then breaking it into
r-connectivity components:
Due to the disjunctive nature of the decomposition (
10), the iteration index ind is correctly defined at the
i-th iteration of
:
The index ind moves from points to connectivity components
, thus becoming multivalued:
In addition, this, in turn, makes it possible for any subset
to define a conditional version of the
algorithm:
Example 1. Figure 4b–d show the sets , for the array X shown in Figure 4a, obtained as a result of the operation of the SDPS-algorithm at based on the fuzzy comparison of non-negative numbers s and t [1]. Figure 5 shows both stages of the -algorithm. Its result indicates that, in the general case, the second stage of DPS cannot be considered final: many components of the r-connectivity in Figure 5b need further connection. This will be done by the third stage of DPS, the results of which on a given array X for a given SDPS-algorithm will be shown in Example 4. 4. Results: Third Stage of DPS
The first stage of the algorithm cuts out from the original FMS X the maximum -perfect subset . The second stage of the algorithm is considered to be its result and is the set of all components of the r-connectivity of the set . Each component is considered to be a DPS-cluster because, by virtue of the initial assumption that the initial density is r-local, P is a DPS-set in X and, in particular, is -dense at each of its points.
At this stage, the result is taken as a given and is not subject to further transformation. The radius r is assumed to be infinitely small, and the level of P is infinitely large. As a consequence, each component is considered to be a single and indivisible spot (large point).
Spots are interpreted as fragmentary manifestations (exits) of global anomalous entities in X. To understand their true scales, if possible, further connection of spots is necessary. This is the third and final stage of the DPS-algorithm.
4.1. Logic and Action Plan
A collection of spots is considered by the expert as a whole if the degree of advantage in the closeness of internal transitions (paths) between any of its spots over external transitions from C to , allows the expert to conclude that C is non-random.
Thus, the cluster of spots C must have a tighter internal connection between them in the general background of all spots from . There are no other constraints on C, in particular on shape.
Because of the
-perfectness of the
spots, it is natural to consider the minimum distance between the closest points in them as the distance between them:
It will no longer be a metric since there are no triangle inequalities for
d (
13), as the example below shows:
Example 2. Consider subsets of the real line: , , . It is obvious , , , .
Based on the distance d, the formalization of the expert’s logic for connecting spots is constructed: in each cluster of spots C, any two spots can be connected by a chain of spots whose distance d between neighboring links will be smaller than the distance d from C to . This is the d-clustering condition necessary for C. Its formal analysis will be the first part of the functioning program to algorithmise the third stage of the DPS. Note that d-clustering and DPS-clustering are different interpretations of clustering: the first is based on connectivity, and the second on limiting.
The second part of this program consists of constructing, based on the initial distance d in the original space X, measures of proximity and distance between spots expressing in this the expert. Based on these, only those spots are selected from the d-clusters of spots in which the proximity advantage of the inner transitions over the range of the outer ones will be non-random in the expert’s opinion.
4.2. FirstPart: The Theory of d-Clusters
Initial data and designations:
X is a finite set,
d is a quasi-metric in
X (
is a quasi-metric space)
Definition 6. By the enumeration of space X with origin at the point x, we call the sequence for which for any where . 4.2.1. The Notation
is a numerical sequence of distances
:
Definition 7. Let us call an eigen d-cluster , with center (origin) in x the eigen segment of the enumeration , for which : Statement 3. The cluster is independent of the choice of origin within itself: In other words, in the case of d-clustering, there is a set-theoretic equality that is, for any , the first terms in the enumeration will lie in .
Proof. Induction on the number i of steps , in the enumeration . Let , .
- •
. We have to show that
. By virtue of
, we have
The element in has two neighbors and , if , and one neighbor , if and . To any of the neighbors, the distance from y will be less than , so the next element after in the sequence necessarily lies .
- •
Let the assumption that the first i steps of , in the sequence lie in be satisfied. Let us show that the element will also lie in .
Under our assumptions
, therefore
Assume , . If is the correct ordering of array , then two cases are possible:
- 1st:
the array has no gaps, that is, it is an eigensegment in . In this case, necessarily contains either an element , or an element , whose distance from is nontrivial and less than . Hence, the term in the sequence necessarily lies in :
- 2nd:
array has gaps. Let be the first number for which . In this case, in the array , it has no number and in the set of an element , whose distance from will be less than . Hence, in this case too, the term in the sequence lies in .
□
Consequence 1. Eigen d-clusters behave like non-Archimedean balls: either one contains the other or they do not intersect. In particular, two d-clusters of the same order either coincide or do not intersect.
Proof. Let , and . According to Statement 3, , . If , then . □
The
d-clusters of
are eigennonpoint subsets in
X. Let us remove this restriction on
k in Definition 7 using the notion of a connectivity exponent, which develops and continues the topic of finite connectivity
Section 2.2.1.
Definition 8. - 1.
Connectivity index of a subset minimum r for which A is r-connected: - 2.
The isolation index of a subset the distance to its complement :
Proof. - 1.
Proof of the inequality . Induction on k,
:
and
according to (
14) and (
15).
Suppose that
, that is, any two points in
can be connected by a path with distance transitions
. Let
be the point in
, closest to
. According to (
15),
. Through
any point in
can be connected to
by a path with distance transitions
.
- 2.
Proof of the inequality : let , then , and any path with ends in and has at least one jump between them. Therefore, .
□
For the eigen
d-cluster
C (Definition 7), the proved statement means that the necessary condition is fulfilled
It is invariant not only from the beginning of the enumeration but also from the enumeration itself, which is obligatory at this point for the definition of C. Therefore, it is natural to check this condition for sufficiency of d-clustering.
Statement 5. If (17) is satisfied for an eigensubset of A in X, then A is a d-cluster and . Proof. There exists : for which the set A is -connected. Hence, in the enumeration of space X from any point the elements from will appear only at the -th step. □
The result of Definition 8, Statement 4 and Statement 5 is a new invariant and more complete definition of a d-cluster.
Definition 9. Let us call a d-cluster in X the subset C for which there is a true inequation: .
To the eigenclusters C will be added all points : , and all space X: .
Thus, the restrictions in Definition 7 on k are removed:
- •
-cluster, since ,
- •
-cluster, since : no appearance for X means that it is at infinity.
4.2.2. Conclusions
The collection of
d-clusters
forms a hierarchy of sets on
X based on the notion of discrete connectivity and different from traditional hierarchical cluster analysis hierarchies, usually binary [
10].
is the first part of the formalization of the third stage of DPS.
4.3. Second Part: Final Selection of D-Clusters
Thanks to (
14)–(
16), the search for
d-clusters in the space of
r-connectivity components
with respect to distance (
13), that is, the formation of the space
is constructive. Each
d-cluster
has two characteristics: internal
and external
(
17). Based on this information, expert E must decide: is it necessary to combine the spots from
C into a single whole or not?
Let denote the set of maximal d-clusters in , obtained after joining spots by expert E. According to Consequence 1, they are all disjoint.
Under the condition that
is nontrivial in the set
, a partition appears in the general case larger than the partition (
5) into
r-connectivity components:
Let it also denote by
and consider it the third and final stage of the DPS algorithm. Full history is presented in
Figure 6.
Thus, for each component , one of three things can happen:
component is sufficiently isolated from the rest to be of interest to the expert, it is the only way out of the global entity behind it on X;
, but component is part of the d-cluster C, which is of interest to the expert and in the team represents a global entity in X;
any d-cluster containing is of no interest to the expert: either it is not internally dense enough, or it is externally separable. According to the expert, if is a fragment of the output of the global entity on X, then it is not clear enough.
Furthermore, Example 3 will show that all options for are possible.
At stage 3.2 (
Figure 6), the expert can act in different ways. Let us present the simplest Boolean variant of actions for the formation of
.
4.3.1. Boolean Variant
Expert E decides whether d-cluster C is included in the final result based on comparison of and with their proximity and distance thresholds and : and .
Expert E considers the parameter
r of the DPS-algorithm to be very small (infinitesimal), much less than
, which, in turn, according to E, is much less than
:
The threshold
, like the radius
r in the DPS-algorithm, is built using the Kolmogorov averaging of non-trivial distances of the FMS
X [
5]
For the parameter r, numerous applications of DPS-series algorithms have established that the choice of can be considered optimal. The studies carried out within the framework of present paper show that . The intersection of the areas of parameters r, is explained both by the fuzzy perception of proximity by the expert and by the diversity in the arrangement of an arbitrary FMS.
The threshold is obtained from by formalizing the expert judgment ‘‘’’ using fuzzy comparison. For the comparison given in Example 1, this would be the inequality .
Example 3. Array X in Figure 7a has already been seen above: it was a testing ground for DPS, DBSCAN, and OPTICS algorithms. Figure 7b–d show the complete DPS-scenario with parameters , , , . The result of cutting at the first stage (Figure 7b) was divided into eight r-connectivity components at the second stage (Figure 7c). Two components (blue and light brown) and two more in a green combination passed independently to the third stage (Figure 7d). The remaining four brown components did not pass the third stage. Example 4. In the conditions, notation and parameters of Example 1 Figure 8a–c show the original array and two stages of the SDPS algorithm on it. Figure 8d–f show the complete scenario for his second iteration of . Comparison of Figure 8c,f serves as clear evidence of the work done in the paper: the full version of the second iteration of with the same parameters performs better than the incomplete zero version (the original SDPS algorithm) that existed before this article. 5. Discussion
The density P on a finite space X is a definition in X in the language of fuzzy mathematics of the property of limiting. From a formal point of view, P is a fuzzy structure on the direct product , monotonic in the first argument (1).
Fixing a level for P, we can define a closure operator on and, through it, the usual topology on X. Thus, an increasing family of topologies , arises on X, starting at zero with a trivially inseparable minimal topology of sticking together points and ending with a trivially separable maximal topology of all subsets.
Each density on the space X gives its own view of it and its own program of its study. It is a concept that does not lie in classical mathematics.
In addition to the density
S Section 2.2.2, nontrivial densities always exist in a finite metric space. The description of the most important densities and their significance for data analysis are given in [
5,
11].
The study of the space X with the help of the density P defined on it began with the study of perfect sets in the topology . The explanation for this is as follows: setting P on X defines a clustering in X: K is a cluster (P-cluster) in X if it consists of exactly all points that are P-limit to it. If is the limit threshold, then the formal expression of what has been said coincides with -perfection: .
The DPS-algorithm builds all such clusters, but there are too many of them for meaningful cluster analysis in X: first of all, we need ‘‘connected components’’ of the maximal -perfect subset in X.
All this exists when X is FMS and the density P is local on it. It is in this situation that the DPS-algorithm operates. It depends on the design of the density P, its level and the localization radius r: . DPS operates in two stages: first, it cuts out a subset in X (first stage), and then splits it into r-connected components, which it considers to be DPS-clusters in X (second stage).
Despite generally good results and efficient applications [
4,
5,
11,
12,
13], the DPS-algorithm in its current version has drawbacks: at the first stage, the cutting
is not always thorough and of high quality; at the second stage, the partitioning of
into
r-connectivity components is too small and detailed due to the locality of the radius
r. A proper partition of
must come from the global view of
induced by the entire space
X.
In the present paper, we have made attempts to correct or mitigate these drawbacks: our answers were at the first stage—the iterative cutting scheme , at the second—its third stage of connecting the components of the r-connectivity.
The iterations allowed the DPS-algorithm at the first stage to achieve the completeness of its result, while maintaining a high level of extremality of cutting.
At the moment, the result
of the second stage of the DPS-algorithm is considered its final result. The additional connections of components from
proposed in the paper occur in two stages. On the first of them, which has the necessary character,
d-clusters are searched in
. According to the results of the article, their search is constructive.
d-clustering is a necessary condition for connecting components from
into a single whole. The set of
d-clusters
serves as the basis for the second stage, which is already sufficient. The main thing here is the expert: his criteria form the final choice of
in
and at the same time the final partition of
. There are various formal variants of this choice in DMA. The simplest one of them is given in
Section 4.3.1.
DPS-series algorithms are actively applied in many geological and geophysical studies (analysis of seismic catalogs, search for signals on geophysical records, in the problem of radioactive waste disposal, etc.) [
1,
4,
5,
7,
11,
12,
13]. It seems that the new version of DPS developed in the present paper will make it possible to improve this application.
The addition of the third stage to the DPS according to the authors makes its architecture sufficient. Further development of DPS-algorithms should take place through their parameters: new constructions of densities and their connection with fuzzy logic will give (and already give [
5]) the possibility of deep study of finite metric spaces. Furthermore, the following circumstance is fundamental: in contrast to the Euclidean space, the FMS
X is local at each of its points
x; as a rule, it is arranged differently; therefore, the parameters
,
r of the DPS-algorithm, which have a local nature, must depend on
x:
,
. The iterative carving scheme only softens the matter.