1. Introduction
Technological advances have allowed the collection of large amounts of data. Published forecasts indicate that a significant growth in the amount of data is expected for the next years—for instance, the Internet traffic in 2020 was circa 95 times the volume from 2005. A detailed analysis of that data can extract valuable information about them. In fact, large datasets have been widely used to identify trends, recommend products or services to provide a customized experience that matches the interests of users [
1]. These applications are made feasible through classification algorithms.
Classification algorithms organize objects from datasets to provide a meaningful understanding, that is, learning, about key factors of the data. If the data is partially labeled, organizing it can be referred as supervised learning. On the other hand, unsupervised learning is defined when no prior information is available. Unsupervised learning, which is the most challenging approach, is commonly studied by clustering algorithms. Data clustering (or cluster analysis) is the task of grouping objects into clusters (groups) based on their similarity. Intuitively, objects within the same group share more common features among themselves than objects from other groups. Clustering has a wide range of applications, e.g. biology, physics, economics, chemistry, etc. [
2,
3,
4,
5,
6]. Organizing objects from a dataset into clusters require identifying their fundamental features that make themselves similar or not to each other. For instance, cars can be grouped based on their power, fuel consumption, price, size, etc. Each feature receives a numerical value, which is stored in a vector that represents a particular object. Objects can be compared using distance metrics among their vectors, meaning that close objects are similar to each other.
Whenever classical mathematical logic cannot model a system with sought accuracy, we can use fuzzy sets. Fuzzy sets and fuzzy logic are a form of infinite-valued logic that can be used to solve a large amount of computing problems, including clustering and image analysis. The birth of fuzzy logic is traced back to Zadeh in 1965 [
7]. His main idea was to introduce uncertainty in logic states. In fact, physical systems cannot be described solely by states such as true or false. This directly implies the concept of membership, where categories (e.g., hot, warm and cold) are defined, and a system status can have different degrees of membership of these categories (e.g. warm, extremely cold, etc.). The membership is usually expressed by smooth or continuous functions, mostly of regular shapes (e.g., triangular or trapezoidal shapes). Application of fuzzy logic is based in two main concepts:
fuzzification and
granulation. The fuzzification is the act of replacing a set with sharply defined boundaries with one with fuzzy boundaries. On the other hand, the granulation means partitioning an object into a collection of granules of related points.
In science and engineering, several processes can be much better described with such vague definitions. Fuzzy models are widely used in traffic signs control [
8], pattern recognition [
9], decision planning [
10], etc. The methods based on classic clustering assign each object to a single cluster. This approach can imply some difficulties when clusters are not well delimited (e.g., image processing, pattern recognition, etc.) and overlapping clusters are present. In fuzzy clustering algorithms, both of these two problems can be solved by the introduction of a membership function that allows assigning objects to multiple clusters.
Clustering has been known for decades as a computationally difficult (NP-hard) problem [
11]. In general, clustering algorithms are based on the minimization of an objective function that measures similarities among vector attributes representing objects. Moreover, data volumes are growing fast and impacting significantly their performance. In this scenario, serial executions are not feasible, and their parallelization is mandatory. These issues are often solved using algorithms based on K-Means [
12,
13], the most popular hard partition algorithm. Many parallel versions were proposed, and the initial solutions were based on the message passing method. Quite recently, new solutions have been based on other techniques and platforms, e.g., OpenMP, GPUs, and Intel Xeon Phi [
14]. Likewise, fuzzy clustering algorithms also demand parallel solutions to explore large datasets. The most adopted fuzzy clustering algorithm has an extensive literature about its parallelization [
15,
16]. However, these proposals do not change the requirements of this algorithm, i.e., a prior knowledge about the number of groups in the dataset. This has an exponential impact on the performance with the problem size, since several executions are needed to get the sought solution.
In order to overcome these issues, an algorithm called fuzzy minimals (FM) was proposed in [
17]. This algorithm uses fuzzy sets theory to find clusters without any prior knowledge about the number of groups, the size or format of the dataset. Furthermore, the authors proved that this solution meet all requirements that a good classification algorithm should have (e.g., scalability, adaptability, stability, etc). Besides the different approaches for clusterization that have been proposed, the technological advances in microelectronics led to improvements in graphical accelerators. These improvements, in a relatively short period of time, became the driving force behind thousands heavy computing problems. Some of these applications based on fuzzy sets, provides solutions for image segmentation to help medical doctors diagnosing illnesses, simulation of wildfires, modeling geological phenomena, etc. On the other hand, other applications have only explored GPU’s capabilities to speedup their solution up to a few hundred times their sequential code [
18,
19].
In this paper, we consider the introduction of two novel parallel implementations of the FM, which addresses frequent challenges in clustering algorithms [
20,
21,
22,
23]. More precisely, our study is focused on the parallel implementation on GPU, exploiting its massively parallel architecture to deliver high performance at a low cost. We compare this solution both with the traditional sequential version and with a new parallel version using MPI. Numerical simulations showed that our solution on GPU provides a significant speedup over the sequential version. In fact, the found solution on GPU gives results at least as fast as than the expensive computing environments needed to run the MPI version.
The remainder of this paper is organized as follows.
Section 2 presents some preliminaries on the FM algorithm and GPU.
Section 3.1 is intended to motivate our parallel fuzzy solution on GPU.
Section 4 shows the numerical results with a synthetically developed dataset and two real-world benchmarks. Finally,
Section 5 outlines the main results of this paper, open problems and possible future developments.
2. Preliminaries
This section contains some general remarks on the FM algorithm and GPU technology. Thus, the next sections will be rendered as self-contained as possible in order to facilitate access to the individual topics.
2.1. Fuzzy minimals algorithm
FM is an algorithm for data clustering that avoids the need of previous knowledge about groups in the data. Proposed by Flores-Sintas, Cadenas and Martin [
17], the combined metrics extracted from the Euclidean distance matrix and the inverse covariance matrix, making it possible to deduce a membership function to a group independent of the membership probability to other groups [
17,
24].
The first step of the algorithm consists of finding a metric related to the covariance matrix. This metric provides information about the homogeneity and isotropy of the feature space. We denote by
E the operator of statical mean. Thus, the elements of the covariance matrix are defined as follows:
where
M is the vector of the average samples. We note that (
1) and the Mahalanobis distance for the all feature vectors in the sample allow us to determine a diagonalized covariance matrix of general element given by
where
X is a vector. Of course, it follows from the tensorial form of (
1). The covariance matrix defined by (
2) is such that
where
are the diagonal elements of the covariant metric tensor.
Now, we can define a metric for each point of discourse, which is associated with a Riemannian manifold hypersurface. The local properties of the hypersurface, and therefore the distance relationship between sample points, are given by the Christoffel symbols:
We can determine the algorithm’s objective function as well as the membership function. First, a sample of
n data points, displaying
F features, is given by
. Accordingly, the FM algorithm minimizes the objective function given by
where
is the Euclidean norm between two points
x and
v, and
is the membership function between an object
x and the prototype
v. The membership function is given by
where the factor
r measures the isotropy in the dataset. Thus,
r indicates the disruptions in the data that should result in new clusters. Moreover, we recall that the factor
r can be computed by
where
C is the covariance matrix,
m is the mean of
X and
is the Euclidean norm between
x and
m.
The FM algorithm is an iterative process that works on clusters representatives, called
prototypes, given by
Algorithm 1 describes the implementation of the FM algorithm. More precisely, we assume that two standard parameters,
and
, indicate the committed error degree and the minimum difference between two potential prototypes, respectively. The minimum values of the objective function are the prototypes of each identified cluster. We note that there is no need to specify the number of clusters (being based on the computation of
r).
Algorithm 1 Fuzzy Minimals. |
Input: dataset X, factor r and N objects Output: prototypes found procedure FuzzyMinimals(X, r, N) for k = 1 until N do = ; t = 0; = 1 while do , using end while if then end if end for return V end procedure |
2.2. Graphics Processing Unit
GPUs have evolved from simple configurable graphics processors to high performance programmable units that can process complex graphical operations and, more in general, any massively parallel computation.
The current performance and low cost of GPUs led researchers and scientists to propose solutions based on this platform. Particularly, applications that demand high computational resources (e.g., clustering, pattern recognition, differential equation solvers, etc.) are strong candidates for parallelization on GPUs. In general, a GPU is composed by a set of streaming multiprocessors (SM), each of which has a significant number of computing cores (e.g., Nvidia Titan X).
There are several memory levels available on GPUs (i.e., global memory, local memory, shared memory and registers). In particular, each SM provides two high performance on-chip memories (registers and shared memory). Registers are privately used by threads but the number of registers per thread is limited. On the other hand, a shared memory address can be accessed by threads within the same block of threads, allowing their synchronization. Likewise, the maximum amount of shared memory per SM is small.
The processing in the cores occurs based on threads, which are grouped together into a block, while blocks are grouped into a grid. Only threads within the same block can communicate to each other via shared memory or synchronize their execution with barriers, because each block resides on its own SM. Threads from different blocks can do these operations via global memory, but that is a not a recommended solution. The introduction of Nvidia’s CUDA (Compute Unified Device Architecture) to their enabled GPUs allowed developers to design any general-purpose applications. A typical program using CUDA has a basic structure. Initially, the memory is allocated on GPU. Fist, data are copied from host (CPU) to device (GPU). Next, a kernel (function that runs on the device) is called passing the number of blocks and threads per block. When the kernel completes, data can be copied back from device to host. Finally, the memory allocated in the device is released.
3. Parallel FM
In this section, we show the methodology used for the parallelization of FM using GPU and MDPI.
3.1. Parallel FM on GPU
Our parallel solution on GPU can be divided into four high level phases. In the first phase (determination of factor R), we are the preprocessing phase. We determine the value of the factor R, i.e., the global parameter that measures the isotropy of the whole dataset. It is worth noticing that R is invariant with the kind of execution. In the second phase (data-partitioning), the dataset is divided into multiple partitions, where each partition has above 10% of the original dataset. Moreover, data are distributed following an uniform random distribution across all partitions. In the third phase (parallel FM), each block of threads processes a single data-partition, computing the prototypes and their initial clusters. This is the only phase executed on GPU. We note that there are hundreds of threads processing the same partition. Finally, in the last phase (hierarchical clustering), threads from different blocks can find similar prototypes, due to data distribution across partitions. In order to eliminate this side effect, similar prototypes are grouped by Johnson’s hierarchical clustering algorithm.
Hierarchical clustering does not provide a single final answer, but rather a tree-like structure called dendrogram. This flexible structure allows choosing clusters based on the level of the dendrogram. As the prototypes get closer, we see at once the level that should be used to produce the final results.
The Algorithm 2 shows the main program of our implementation. The program begins with the declaration of the blocks and threads per block that will be used. Next, the program reads all objects, storing them in a local variable
X. This vector is organized into
b logical partitions, and data are distributed uniformly across them (to identify cluster prototypes). The device does not have direct access to the host memory and vice versa. Therefore, we need to allocate a global memory space on the device and copy the data explicitly. Now, all the data are ready, thus the kernel PFMGPU is called, using the triple angle brackets syntax to indicate the number of blocks and threads per block. The parameters for this function are the factor
R, a pointer to the vector
, storing input data in the global memory, and a pointer to the vector
that will store identified prototypes. At the end of the kernel, the prototypes found are copied back to the host memory. Finally, the algorithm performs a hierarchical clustering of prototypes, showing the results and releasing the allocated memory.
Algorithm 2 PFMGPU’s main program. |
procedureMain number of blocks number of threads per block objects factor R Distribute data uniformly in b partitions = pointer to global memory region storing input data = pointer to global memory region storing output data Copy input from host memory stored in X to global memory in PFMGPU⟪<b, t⟫>(r, , ) Copy output from global memory stored in to host memory in V Hierarchical clustering of V Show C prototypes Free device memory end procedure |
Clearly, our method parallelizes threads in a GPU. As shown in
Figure 1, each block from the complete dataset
X has partition boundaries, based on
b blocks, where
t threads from the same multiprocessor can be executed. Threads in the same block have unique identifiers, thus we assign them to corresponding partition locations. Next, a parallel execution allows us to identify prototype candidates. The result is stored in the shared array
, which is visible to all threads from the same block. When the iteration completes for that location, we skip
n positions equal to the number of the threads to avoid reprocessing.
Algorithm 3 shows the kernel to compute the prototypes. We note that all objects are passed as parameters, thus any block identifies its partition. Moreover, each thread has to determine their start and end positions. Accordingly, there is no overlapping. These operations can be performed with by CUDA’s built-in variables, i.e.,
threadIdx and
gridDim which are the thread identifier in the block and the number of blocks in the grid, respectively. We declare the shared variables
,
and
on line 6, hence the threads within the same block can exchange data, avoiding global memory usage. Furthermore, the prototypes found in each data-partition are stored in
, where
keeps prototype candidates of any thread and
contains the total number of prototypes found in the partition. All local variables are declared on lines 8 and 9. We use an inner loop, between lines 10 and 17, to find prototype candidates. The prototype candidate found by any thread is stored in the shared array
on line 18. A barrier is created using
syncthreads. The loop between lines 21 and 26 verifies if the prototype candidate is not close to any previously added prototype. Before the kernel completes, the prototypes found are copied from shared memory to global memory. Therefore, the host can reference the prototypes stored in
V.
Algorithm 3 FM kernel on GPUs. |
Input: factor r, dataset X and V for prototypes. Output: all kernels return void procedurePFMGPU() initial index of the partition final index of the partition for += do = t = 0; = 1 while do , using end while if == 0 then for until do if then end if end for end if end for for += do end for end procedure |
3.2. Parallel FM on MPI
Our parallel implementation using MPI is similar in spirit to that based on GPU. Thus, we have again the four phases described for the parallel FM on GPU (i.e., determination of factor R, data-partitioning, parallel FM and hierarchical clustering). However, the final implementation using MPI is more straightforward than the version in GPU. Indeed, the mapping between the sequential and the parallel version is minimal.
Algorithm 4 shows the main program of our MPI implementation. First, we define the number of processes (i.e., the number of data-partitions) and the rank of the current process. Next, a coordinator process reads the objects and computes the factor R, distributing the data across
p logical partitions. Moreover, the coordinator process sends them to the peer processes. All processes run the FM function by Algorithm 1. Finally, the peers complete their execution sending the result to the coordinator process. Now, the coordinator process can perform the hierarchical clustering of prototypes.
Algorithm 4 Main program in PFMMPI. |
procedureMain number of processes process identifier if then X = objects factor R Dist. objects uniformly in p partitions for until do (r, , ) end for else (r, , X) end if (X, r, ) if then (V); else for until do V = V + (V) end for end if if then Hierarchical clustering of V Show C prototypes end if end procedure |
4. Numerical simulations and results
In this section, we analyze the perfomr of parallel FM algorithm. Obviously, we do not need to deal with correctness and quality of the FM algorithm already discussed in [
24,
25]. The results of this section involve a synthetically developed dataset and two real-world benchmarks.
Most popular fuzzy clustering algorithms rely on input parameters to find clusters. Thus, these algorithms require several executions to achieve the same performance of the FM algorithm. Comparison between the accuracy of the FM algorithm and other similar techniques has been extensively studied in recent years. Since the number of executions required by other solutions depend on input parameters provided by the user, we cannot perform a fair runtime comparison between FM algorithm and other similar techniques. Accordingly, our benchmarks are based on different implementations of the FM algorithm.
All results for PFMGPU were collected on a single machine, using from one to eight blocks with 960 threads per block. In each run, the number of blocks matches the number data-partitions. On the other hand, PFMMPI was executed on a cluster, using from one to eight nodes with one process per machine. Moreover, the number of processes matches the number of data-partitions. All machines (i.e., single and cluster machines) have the same configuration: CPU Intel Core i7 3.4Ghz, 16GB of RAM and GPU Nvidia GeForce GTS 450 with 4 SMs. Our goal is to compare a GPU-based implementation, based on massive parallelism of GPU cores, with a message passing implementation running on individually more powerful CPUs.
4.1. Ellipses dataset
The first dataset consists of three synthetically generated ellipses with a total of 10 thousand points, as shown in
Figure 2. We benchmarked four different test cases for both PFMGPU and PFMMPI. Moreover, the dataset is divided according to the number of blocks (PFMGPU) or processes (PFMMPI). We note that executions with a similar number of partitions provide similar performance, hence sequential and parallel executions with a single partition found the same prototypes. Under the hypothesis of same partitions, multi-partition executions in PFMGPU or PFMMPI should also return the same output (clusters). Nevertheless, it turns out that different number of partitions result in different prototypes. These differences vanish with the hierarchical clustering for multi-partition executions.
Figure 3 shows a dendrogram generated from the found prototypes when the dataset is divided into four partitions. The coordinates of each cluster prototype are represented with respect to the
x-axis. The same dendrogram is generated from both PFMGPU and PFMMPI with four partitions. In fact, the output in the two cases is the same. We note that three main clusters, at the dashed level, are identified. After hierarchically clustering prototypes, all parallel executions found three clusters as expected.
Table 1 shows all prototypes found by each execution. We see that PFMGPU and PFMMPI return the same prototypes under the same number of partitions.
Table 2 shows execution times collected for the ellipsis benchmark. Running on a single partition, PFMGPU is about three times faster than both the sequential version and PFMMPI. However, the highest performance is achieved for PFMGPU with eight partitions. We see that PFMMPI also achieves the best performance with eight partitions, but its runtimes are higher with respect to PFMGPU. This depends on the method of work division adopted. We have also computed the MPI communication times using
MPI_Wtime function, as shown in
Table 2. This is due to the fact that communication overhead can influence the total runtimes. We note that the time spent on communication for PFMMPI takes place during the transfer of objects to/from peer nodes. Nevertheless, the impact is minimal on the final result due to the dataset file size. Moreover, the communication time is proportional to the number of processes, in spite of decreasing the amount of data transferred to each host. This is due to on the overhead of connection functions, independent of the amount of data transferred. The performance of PFMMPI with a single partition is equivalent to the sequential version. In fact, both versions run on a single machine and the overhead due to MPI libraries is negligible.
4.2. Well-defined schools dataset
Figure 4 shows our second benchmark, that consists of 20,237 geographical coordinates of schools extracted from U.S. Board on Geographic Names [
26]. This dataset contains five different clusters. We doubled the number of objects in comparison to the first benchmark.
Figure 5a–c shows the dendrograms for hierarchical clustering of prototypes generated with two, four and eight partitions, respectively. The coordinates of each cluster prototype are represented with respect to the
x-axis. Obviously, PFMGPU and PFMMPI executions give the same dendrograms. However, the dendrograms are slightly different from each other due to data distribution across partitions. Thus, the prototypes found in each partition diverge at a certain degree. Moreover, we see at once that all executions found five prototypes at the dashed level.
All prototypes found by executions with one, two, four and eight partitions are shown in
Figure 6a–d, respectively. We note that the prototypes found by different executions appear to be the same. However, a more detailed analysis turns out that prototypes are slightly different. Accordingly, a minimal variation between executions with different partitions provide a high degree of precision.
In
Table 3, the runtimes for this benchmark are shown. We see at once that it achieved lower execution times. Thus, the performance does not depend uniquely on the number of objects, but it also take into account other factors (density, number of clusters, distances, etc.). Despite running on a GPU with only four SMs, PFMGPU showed lower execution times. More precisely, the eight-partition execution of PFMGPU was faster than PFMMPI running on eight nodes. The time spent on communication for PFMMPI increased from the previous experiment. Nevertheless, it has a minimal impact on the performance.
4.3. Complete schools dataset
Our final simulation is also based on data from U.S. Board on Geographic Names [
26]. More precisely, we analyzed 40,646 school locations across USA (see
Figure 7). We note that these data has no well-defined boundaries among clusters.
Figure 8a–c show dendrograms found for parallel executions with two, four and eight partitions, respectively. The coordinates of each cluster prototype are represented with respect to the
x-axis. We see at once that all executions found five different clusters at the dashed level. The prototypes found are located in high density areas. Hence, the prototypes do not have to be at the center of the cluster as in K-Means based algorithms. The variation between dendrograms did not influence the precision of the final result. In
Figure 9a–d, the prototypes found in tests with one, two, four and eight partitions are shown, respectively. From a clustering standpoint, the differences between prototypes are minimal for three main reasons (high volume of objects, data distribution across partitions and accuracy of our FM parallelization).
As in
Section 4.2, the results of our simulation are shown in
Table 4. As in previous benchmarks, parallel executions achieved their highest performance with eight partitions (circa 60 times faster than the serial execution). From one to four partitions, PFMGPU was consistently faster than PFMMPI, but the execution with eight partitions gave similar runtimes for both implementations. This depends on the fact that our GPU has only four SMs. Thus, our performance decreases whenever we exceed the available resources. Moreover, the time that PFMMPI spent on communication increased at a lower ratio, which balances the latency overhead.
4.4. Timing analysis
A comparison with other clustering algorithms may not produce realistic or significant results. In fact, the number of executions and thus the total runtime depend on parameters provided by the user. Accordingly, a practical analysis would involve the parallel implementation of the FM algorithm introduced in [
25]. However, the authors did not provide the implementation, making impossible any comparison. Consequently our analysis is constrained to the technology and methods shown in in [
25]. We can assume that the processor used was an Intel Core i5 clocked at 1.6 Ghz. There are several models for this 4
th generation processor. According to the PassMark benchmarks website, the slowest processor from that generation has a rate of 2256, which is about four times slower than the i7-3770 processor used in our MPI implementation. Moreover, the speed was about 1.27 ms/object with 8 partitions, which is much slower than the 0.11 ms/object achieved in our MPI implementation. Therefore, even if the speed is adjusted with the speed of the processors used in each simulation, our approach is at least three times faster. Another possible solution is that we deal with the efficiency of our parallel implementations by scalability and speedups. These results are shown in
Figure 10.
We see at once that both implementations are faster than the standard (and single process) implementation. In particular, we found that a speedup that generally quadruples when the number of partitions doubles. This depends on the hardware in use. Indeed, for the PFMGPU implementation we have that each partition is run in a different GPU SM, up to four partitions, and each SM is composed by hundreds of computing cores. On the other hand, for the PFMMPI implementation, although each cluster’s node executes a single MPI process, the CPU is multicore. Moreover, the major bottleneck with GPU is the data transfers between CPU and GPU memories. Therefore, the speedup achieved in the ellipsis test is higher in the case of GPU, as shown in
Figure 10.
5. Conclusions
The results of this paper are based on the parallel version of the FM algorithm on GPU (PFMGPU). This approach provided high performance at a low computational cost. More precisely, the use of GPUs allowed the parallelization by a reasonably low-cost parallel hardware, without variations in the accuracy of FM clustering algorithm.
The benchmarks studied in this paper proved that PFMGPU can solve clustering problems several times faster than in a classical CPU. In particular, we achieved a solution at least 50 times faster than the sequential execution in hypothesis of eight partitions. The major drawback of our solution is that each partition must have at least 10% of the data. Thus, we have a strict limitation on the number of partitions. Nevertheless, PFMGPU provided better results than a conventional parallelization via MPI (PFMMPI). This showed that even low-end GPUs can offer better performance. We propose that further research should be undertaken in two directions. Firstly, it is relevant to study the influence of partitions smaller than 10% in the cluster accuracy. Secondly, we need to deal with forms to improve the parallelization inside a partition.