1. Introduction
Cities have traditionally been recognized as the primary drivers of economic activity and play a crucial role in human society. By evaluating the spatial distribution pattern of human economic and social activities, we can determine that humankind has entered a city-centric era [
1]. The subject of “Urban Problems” has become increasingly prominent due to the rapid expansion of the size and population of major cities around the world [
2]. The fundamental source of “urban problems” is the conflict between limited local resources and the needs of numerous city residents [
3]. The typical phenomenon is traffic congestion [
4]. Faced with the most challenging urban traffic problems, many international metropolitan management organizations hope to promote the coordination of traffic and urban development by formulating and improving traffic development strategies. The effectiveness of these strategies essentially depends on an accurate understanding of urban hotspots distribution [
5]. Urban hotspots (also called human activity centers) generally have higher traffic densities than other city areas [
6].
Urban hotspots are usually the clustering centers of numerous traffic trajectory points that affect population flow, movement patterns and human interaction. The urban road network’s complexity and surrounding space environment make it challenging to obtain urban hotspots. The widespread application of information technologies in the cities, such as GPS navigation, provides high-resolution Spatio-temporal perception [
7]. In particular, GPS navigation services are widely used in vehicles, making it possible to accurately perceive the movement of many vehicles simultaneously in time and space [
8]. The emergence of GPS trajectory data provides a new approach and method to solve the problem of urban traffic optimization [
9]. As ordinary day-to-day GPS trajectory data, taxi track data not only reflects urban traffic conditions but also records people’s daily travel information [
10]. How to mine hidden knowledge from massive trajectory data has become an important research topic [
11,
12].
Many researchers have analyzed the GPS data of vehicle trajectory and drilled the connotative information behind these data, straining to find the spatial-temporal variation models reflecting vehicles distribution [
13]. These models are applied to public transportation optimization analysis [
14], urban population spatial and temporal dynamic distribution analysis [
15], private car navigation real-time optimal path analysis [
16], municipal road planning [
17], urban hotspots searching [
18,
19], etc. Many research results have been applied to the practice of smart city construction, achieving good results in the balance of urban traffic flow and human flow load, and playing a pivotal role in emergency treatment.
These researchers use many machine learning methods based on GPS data sets, such as clustering analysis algorithm, time series algorithm, deep learning algorithm, reinforcement learning algorithm, and so on. Mariescu-Istodor and Fränti [
20] proposed a grid-based method for GPS route analysis for retrieval that achieves efficient route search under noisy and variable sampling rate conditions. Zhang et al. [
21] proposed a hybrid method for incremental extraction of urban road network from GPS data set, which can mine cumulative change patterns of road network. Shafabakhsh et al. [
22] proposed a spatial kernel density calculation method based on GPS data to provide decision support for the allocation of medical emergency resources. Wang et al. [
23] proposed a trajectory clustering method based on adaptive distance and hierarchical clustering based on the optimal cluster number to analyze similarities and anomalies in taxi GPS trajectory. Zhang et al. [
24] proposed combining fuzzy C-means clustering with quantitative spatial regression to comprehensively analyze GPS data and establish the relationship model for improving the traffic service level from the perspective of time and space.
To extract the important information and mine the knowledge concealed in the data, a clustering algorithm is typically employed for preliminary data analysis [
25,
26]. The clustering algorithm is a fundamental and effective unsupervised learning method widely used in many fields. Clustering algorithms divide data set into different clusters according to a fixed standard (distance, similarity, etc.) [
27], making the similarity of objects in the same cluster as ample as possible and the distinction of objects in the different clusters as large as possible [
28]. These unsupervised clustering algorithms can be broadly divided into the following six categories: partition-based clustering, density-based clustering, model-based clustering, hierarchical clustering, grid-based clustering, and graph-based clustering [
29]. Among the above algorithms, the clustering based on partition represented by the K-means algorithm has the advantages of simple structure, high efficiency, easy convergence and muscular universality. As one of the mature clustering algorithms, the K-means algorithm has achieved significant application effects in many fields. But it has disadvantages such as sensitivity to noise and outliers, difficulty in selecting cluster numbers, sensitivity to the initial clustering center, etc.
Numerous scholars have presented many ideas and approaches to locating the ideal cluster centers. The K-means++ algorithm was designed to weaken the K-means algorithm’s sensitivity to the initial clustering centers; the algorithm’s fundamental idea is to maximize the distance between the original clustering centers [
30]. Krishna and Murty [
31] combined the genetic algorithm with the K-means algorithm to obtain the Genetic K-means algorithm (GAK) converged to the global optimum. The K-means algorithm execution is regarded as the K-means operator viewed as the search operator to replace crossover; Simultaneously, a kind of clustering-specific bias mutation operator is defined, called a distance-based mutation. Lu et al. [
32] proposed a new clustering algorithm called Fast Genetic K-means Algorithm (FGAK) inspired by GAK, always converging to the global optimum eventually and running much faster. Islam et al. [
33] advanced a previous genetic-searching approach called GenClust, with the intervention of fast hill-climbing of K-means and obtain an algorithm that is faster than its predecessor. Dowlatshahi and Nezamabadi-Pour [
34] adapt the structure of stochastic population-based Gravitational Search Algorithm for effectively solving the multivariate data clustering problem. Some scholars have proposed that the combination of nearest-better neighborhood and fuzzy PSO is statistically superior to multi-mode optimization, indicating that the hybrid algorithm has advantages [
35].
These improved K-means clustering algorithms achieved good results but still have disadvantages, such as excessive dependence on hyperparameters [
36], over-sensitivity to outliers, and sensitivity to cluster center initialization. Toward alleviating these problems, this paper proposes a novel K-means clustering algorithm based on hybrid “fuzzy system-particle swarm-genetic” to automatically obtain the optimal initial centers, namely the Novel FPSO-GAK clustering algorithm. This algorithm can enhance the processing effect of automatic clustering, avoid too much uncertainty of manual configuration parameters and clustering results falling into local optimum. The initialization operation of the K-means++ was used to obtain a relatively optimal cluster center points group as one initial particle of the Particle Swarm Optimization (PSO). Then, the parameters of the PSO algorithm were optimized by a fuzzy system and the GA algorithm was implemented to search for better clustering centers. Finally, the optimal clustering center points were found after several iterations. Three unsupervised evaluation indexes, the Separation (SP), Silhouette coefficient (SC) and the Sum of Squared Error (SSE), as well as two cluster similarity indexes Centroid index (CI) and Point-level centroid similarity index (CSI), were used to evaluate the clustering results. Five taxi GPS datasets were selected and compared with GAK, PSO based K-Means (PSOK), PSO based constraint K-Means (PSOCK), PSO based Weighted K-Means (PSOWK), PSO-GA based K-Means (PSO-GAK), and K-Means++ algorithms to verify the validity of the proposedalgorithm.
The innovations and main contributions of this paper are as follows:
A novel FPSO-GAK algorithm based on hybrid “fuzzy system-particle swarm-genetic” was designed to obtain excellent non-homogeneous search capability.
The PSO algorithm, fuzzy system algorithm, and genetic algorithm were organically combined to obtain the optimal initial clustering centers.
Three unsupervised evaluation indexes (SP, SC and SSE) and two cluster similarity indexes (CI and CSI) were employed to evaluate and analyze the clustering results.
The comprehensive experiment was designed and implemented to verify the effectiveness of the novel FPSO-GAK clustering algorithm.
3. Experiment Results and Analysis
3.1. Experiment Data Set
3.1.1. Taxis GPS Data Set
For the purpose of verifying the performance of the novel algorithm, five GPS data sets were used as shown in
Table 3. GPS data sets typically contain vehicle ID numbers, longitude values, latitude values, altitude values, timestamps, GpsSpeed, etc., which are periodically recorded by the onboard GLOBAL positioning system (GPS). Currently, public transport authorities in many cities have accumulated taxi GPS data sets for several years. In this study, we focused on GPS data sets from representative metropolitan areas in several countries, i.e., Aracaju (Brazil), Beijing (China), Chongqing (China), Rome (Italy), and San Francisco (USA). Data set Aracaju is a taxi GPS data set from Aracaju city in Brazil, which belongs to the standard data set in UCI Machine Learning Repository. Data set Beijing (China) is a taxis GPS data set from downtown Beijing, China, which is from the well-known Geolife Trajectories project [
51,
52]. Data set Chongqing (China) is a taxis GPS data set from urban area Chongqing, China. Data set Rome (Italy) is a taxi GPS data set from Rome city, which belongs to the data set in Crawdad and contains GPS data of about 320 taxis’ trajectories [
53]. Data set San Francisco (USA) is a taxi GPS data set from San Francisco, which belongs to the data set in Crawdad and contains GPS data of about 500 taxis’ trajectories [
53]. The data sets have been cleaned beforehand; just a subset of the original data sets has been extracted, and a few records between 1:00 and 6:00 p.m. have been eliminated.
3.1.2. Multi-Source Trajectory Data Set
In addition to the taxi vehicle GPS dataset, we validated the proposed algorithms using the multi-source trajectory dataset. The multi-source trajectory dataset is derived from the Routes 2019 subset of the MOPSI dataset (
http://cs.uef.fi/mopsi/data, accessed on 5 July 2022), which contains approximately 3 million trajectory data points [
54]. Trajectories encompass a variety of activities, such as walking, biking, hiking, jogging, driving, taking the bus, and others. Some of these trajectory points record life routines, such as going to work or shopping. The dataset is particularly suitable for migratory pattern mining, person activity identification, and location-based similarity. In the experiments of this paper, we select multi-source trajectory data points within the urban area for analysis, mainly including trajectory point data of vehicles, walking, jogging and biking. The spatial distribution of the multi-source data trajectory points is shown in
Figure 3.
3.2. Experimental Environment and Parameter Setting
The experimental environment based on a personal computer featured the following: Intel I5-8250U, dominant frequency 4 × 1.60 GHz with 8G RAM, Windows 10 operating system, and the algorithm routines were coded in MATLAB 2018b.The taxi GPS data sets contained subsets Aracaju, Beijing, Chongqing, Rome and San Francisco were used as experimental data sets. In each algorithm routine, clustering performance was evaluated using SC, SP, and SSE indicators, which paid particular attention to SC index. The above algorithm routines were trained 20 times independently in the experiment. The clustering numbers to GAK and K-means++ algorithms were initialized to 100. Then the parameters were adjusted respectively within
according to the amount of data sets to acquire the best experimental results. The initial parameter setting of PSO and fuzzy system are shown in
Table 4.
For the PSO algorithm: the inertia weight keeps the particle moving with inertia, giving it the tendency to expand the search space and the ability to explore new regions. The acceleration constants and represent the weights of the statistical acceleration terms that push each particle toward the and positions. A low value allows the particle to hover outside the target region before being pulled back, while a high value causes the particle to make a sudden dash toward or over the target region. The higher the number of populations (NP), the higher the accuracy will be obtained, but at the same time, it will prolong the computing time. The velocity range (VRP) determines the resolution (or accuracy) in the region between the current position and the best position. If velocity maximum is too high, the particle may fly past the good solution, and if velocity minimum is too small, the particle cannot perform sufficient exploration, leading to becoming stuck in local optima.
For the GA algorithm: the adaptive cross coefficient range (ACCR) is too large, it is easy to destroy the existing favorable mode, increase the randomness and easily miss the optimal individual; it is too small to renew the population effectively. The adaptive variation coefficient range (AVCR)is too small and the diversity of the population declines too fast, which easily leads to the rapid loss of effective genes and is not easy to repair; it is is too large, the diversity of the population can be guaranteed, but the probability of higher-order mode destruction also increases.
For the fuzzy system: the fuzzy range of in the PSO (FRW) control the variation range in the defuzzification process. The fuzzy range of in the PSO (FRC) control the variation range in the defuzzification process.
3.3. Experimental Results and Comparison Analysis
3.3.1. Taxis GPS Data Set
Figure 4,
Figure 5,
Figure 6,
Figure 7 and
Figure 8 show the convergence diagrams for the searching optimal initial clustering centers stage of the GAK, PSOK, PSOCK, PSOWK, PSO-GAK, and the proposed algorithms utilizing
Section 3.2 parameters in five distinct GPS data sets.
Figure 9,
Figure 10,
Figure 11,
Figure 12 and
Figure 13 is the convergence diagram for k-means clustering stage of 7 algorithms (the above six algorithms plus K-means++ algorithm).
Table 5,
Table 6 and
Table 7 are the comparison tables for the average Silhouette coefficient (SC), the average degree of separation (SP), and the average SSE value of 20 training sessions respectively (plus Random Swap Clustering algorithm). In order to accurately analyze the above evaluation indexes, each algorithm was iterated 180 times in a single run to guarantee convergence.
Figure 14,
Figure 15,
Figure 16,
Figure 17,
Figure 18 and
Figure 19 show The obtained clustering centers of Aracaju, Beijing, Chongqing, Rome, and San Francisco after the operation of six algorithms. Clustering centers outcome are shown on the Baidu Map background.
Table 5,
Table 6 and
Table 7 demonstrate that the proposed algorithm has superior clustering performance in terms of Silhouette coefficient, degree of separation and SSE, as well as the ability to capture better clustering results and locate more effective urban hotspots. However, in
Table 7’s San Francisco data set, the K-Means++ algorithm delivers just slightly worse than the proposed algorithm. It reveals that the K-Means++ algorithm can be used as a preliminary coarse-grained clustering algorithm in such applications. The tables also demonstrate that the overall clustering performance of the RS algorithm is good, and the clustering effect is better than the K-Means++ algorithm in most cases. In
Table 5, the average Silhouette coefficient of the proposed algorithm is higher than other algorithms. The greater the Silhouette coefficient, the more compact within a cluster and more dispersed between the clusters, indicating that the clustering effect is superior to the competitors. In
Table 6, the proposed algorithm has a higher SP value than other algorithms. The higher the SP value, the greater the separation between clustering clusters, and the higher the clustering algorithm’s ability to satisfy the needs of the majority of clustering.
Table 7 indicates that the average SSE value of the proposed algorithm in 20 training cycles is superior to other algorithms. Overall, the clustering results of the proposed algorithms all outperformed the comparison algorithms. The above suggests that hybrid heuristic initialization algorithms are effective, practical and feasible.
Figure 4,
Figure 5,
Figure 6,
Figure 7 and
Figure 8 show that the proposed algorithm undergoes minimal change during the iterative convergence process. Within the five groups of taxi GPS data sets, the proposed really is in the optimal state, the optimization effect is noticeable, and its curve variations are negligible. It indicates that the proposed algorithm can obtain relatively better clustering centers by using the initialization operation of the k -means++ algorithm when initializing the populations. Assuming that the proposed algorithm is equivalent to the K-means++ algorithm, the two algorithms’ convergence curves will coincide in the convergence scenario, which is not observed in the experiments. In the iterative optimization process of the proposed algorithm, the fitness value changes slightly, and the location distribution of particles and the clustering center are constantly modified to achieve a more optimal clustering centers distribution throughout the whole GPS data set. The final clustering effect can better accommodate actual requirements. The convergence curves of the proposed algorithm and PSO-GAK algorithm remain close to each other, indicating that the parameters of the proposed algorithm should be further adjusted to improve its convergence effect, which will be studied in future work.
Figure 9,
Figure 10,
Figure 11,
Figure 12 and
Figure 13 show that the proposed algorithm can obtain better particles than the GAK, PSOK, PSOCK, PSOWK, PSO-GAK and K-means++ as the input initial clustering centers of the K-means algorithm. With optimization by the proposed algorithm, improved clustering centers can be captured, significantly reducing the K-means algorithm’s sensitivity to initial clustering centers. When SSE is employed as the objective function of the k-means clustering algorithm, the clustering centers obtained by the proposed algorithm are used to perform the K-means clustering algorithm, which has faster convergence characteristics and a lower SSE value overall.
In
Figure 9, the proposed algorithm begins to converge in the second generation and converges entirely in the fourth generation, and the SSE obtained is significantly smaller than other algorithms. In
Figure 10, before the seventh generation, the convergence speed of the proposed algorithm is faster than other algorithms, and the SSE value is also relatively small. After the tenth generation, the SSE value of the proposed algorithm is slightly smaller than that of the K-means++ and GAK algorithms. In
Figure 11, proposed converges from about the second generation and converges entirely in the fifth generation. The SSE value is lower than that of other algorithms, showing that the clustering effect is excellent and the separation between clusters is significant, consequently demonstrating the algorithm’s effectiveness. In
Figure 12, the proposed converges from the second generation and entirely at about the fifth generation, with a very smooth convergence curve. In the meantime, the figure demonstrates that the SSE value is less than that of other algorithms, indicating that the clustering center captured by the proposed algorithm is effective and does not cause k-means clustering to fall into local optimal solutions. In
Figure 13, the proposed converges from about the second generation and ultimately converges at the fifth generation. In the beginning stage, the SSE value of the algorithm is substantially lower than that of other algorithms, and after convergence, it is extremely stable.
Overall, experiments on various datasets indicate that the algorithm that combines the genetic algorithm, fussy system, and particle swarm algorithm virtually captures the optimum clustering centers more effectively.
3.3.2. Multi-Source Trajectory Data Set
To further validate the performance of the proposed algorithm, we employed a more complex multi-source dataset. In order to accurately analyze the above evaluation indexes, The number of clusterings is randomly generated between 130 and 150, each algorithm was iterated 120 times in a single run to guarantee convergence.
Table 8 is the comparison tables for the average Silhouette coefficient (SC), the average degree of separation (SP), and the average SSE value of 20 training sessions respectively.
All eight algorithms converge well within less than 120 iterations in the multi-source dataset. The K-Means++ and Random Swap Clustering algorithms are much less time consuming than the other algorithms, and both achieve very good clustering results. The analysis of the experimental data shows that the Random Swap Clustering algorithm outperforms the K-Means++ algorithm. The Random Swap Clustering algorithm also outperforms single-mode heuristics in many cases. With reasonable time consumption, the evaluation indexes of the proposed algorithm are significantly better than those of the compared algorithms. This may be due to the fact that the proposed hybrid heuristic algorithm has stronger non-homogeneous spatial search capability, but also brings more computing resources and time consumption. Overall, the clustering results of the proposed algorithms all outperformed the comparison algorithms in the terms of above evaluation indexes. The above suggests that the proposed hybrid heuristic algorithms are effective.
3.4. Visual Presentation of Experimental Results
3.4.1. Visual Presentation of Taxis GPS Data Set
Figure 14,
Figure 15,
Figure 16,
Figure 17,
Figure 18 and
Figure 19 shows the centers of the clustering results for each algorithm in Baidu Map (put the clustering centers into the maps using the open software interface provided by Baidu Map). On the corresponding map, we intuitively perceive the distribution of clustering results based on various algorithms. According to pertinent maps, different clustering algorithms yield different clustering centers, indicating that different clustering results and identified urban hot areas are distinct. Since experiment GPS datasets are not cleaned according to road location limitations, the map may contain a very small number of outlier points.
3.4.2. Visual Presentation of Multi-Source Trajectory Data Set
Figure 20 shows the centers of the clustering results for each algorithm in multi-source trajectory data set. On the corresponding map, we intuitively perceive the distribution of clustering results based on various algorithms.
3.5. Clustering Algorithm Complexity Analysis
The proposed noevel FPSO-GAK algorithm consists of three parts. To facilitate analysis, the following notation is used: the number of GPS data points is N, the number of clustering is K, the particles population of PSO is M. In the first part, the k-Means++ initialization method is employed to obtain particles M population. The time complexity of this part is . In the second part, the hybrid heuristic initialization is utilized to determine optimal initial clustering centers. The iterations number of PSO algorithm is , The iterations number of GA algorithm is , number of calculations to fitness function is . The time complexity of the fuzzy system is related to the number of iterations and the number of fuzzy rules; it takes very little time and is negligible in terms of computational time complexity. The time complexity of the second part is . In the third part, the K-Means clustering algorithm is used to locate the urban hotspots. The iterations number of K-Means clustering algorithm is . The time complexity of the third part is .
The time complexity of the proposed clustering algorithm is . In practical applications, K and M are very small relative to the number of GPS data points N. The number of iterations T of the K-means++ algorithm in which the random initialization procedure fails to acquire the better initialization tends to be extremely high for large-scale datasets. Due to the good non-homogeneous search capability of the proposed hybrid heuristic initialization algorithm, the value of is relatively small. Furthermore, the better-initialized centers makes would be much smaller than T. The result of their interaction makes: the complexity of the proposed algorithm is of the same order of magnitude as that of k-Means, but it will be slightly larger numerically.
In the actual experimental operation, the PSO-GAK algorithm in literature [
55] consumes the most time since its complexity tends to
. The proposed algorithm can better adjust the search parameters automatically, and it will converge significantly faster than the PSO-GAK algorithm. The complexity of an intelligent algorithm is generally higher than the general iterative, recursive algorithm [
56]. Collectively, it can be seen that the complexity of the proposed algorithm is much less than
but greater than
. The complexity of the proposed algorithm is roughly linearly related to the number of samples, and it is very efficient and scalable for handling large data sets.
3.6. Clustering Results Similarity Analysis
External indexes have been commonly used to compare clustering algorithms by calculating how many pairs of data points are partitioned consistently between two clustering solutions. For both solutions to be consistent, a pair of points must be assigned to either the same cluster or a distinct cluster. This offers an estimation of point-level similarity but no direct information regarding cluster-level similarity. In most cases, information about the cluster-level similarity of the algorithm is of more interest than the estimation of point-level similarity. Fränti et al. [
57] proposed Centroid index (CI) measures cluster-level differences of two solutions, and centroid similarity index (CSI) measures point-level differences of two solutions. The CI index has clear intuitive interpretation on how many clusters are differently located in the two solutions by an integer value. A higher CI value indicates a lower similarity between the two solutions when the number of cluster classes is determined. Considering the effect of the number of clustering, this paper employs a normalized adjusted CI index (ACI) to compare solutions with the following formula:
where
represents the number of clusters in two solutions;
represents the absolute value function;
represents the minimum function.
Table 9 and
Table 10 are the comparison tables for the novel FPSO-GAK algorithm with other algorithms.
Table 9 shows that the ACI indices of FPSO-GAK and other algorithms are greater than 0.15, indicating a substantial difference in their clustering-level similarity. Obtaining the above outcomes under identical data input conditions reveals that the equivalent mathematical models of the FPSO-GAK algorithm differ significantly from those of other algorithms. However, the value of the ACI index is generally less than 0.25, indicating that the difference between equivalent mathematical models is not fundamental. The CSI indices in
Table 10 are more concerned with point-level similarity, and their values indicate similar model analysis information.
4. Discussion
The K-Means algorithm can achieve very good results on convex, compressible datasets. Urban roads are usually distributed in a neighborhood manner, and the trajectory points of vehicles satisfy convexity and compressibility in a large local area. Using the K-Means algorithm to identify urban hotspots can yield effective results. K-Means algorithm is very sensitive to the initial center points. Excellent initial clustering centers significantly impact the performance of the k-Means clustering algorithm. K-Means is particularly effective in fine-tuning local cluster boundaries, but is incapable of solving global cluster locations. The organic combination of global perception and local fine-tuning is the key to obtaining good initial centers. Literature [
50] proposed the method of random swap clustering, which is simple and efficient. Under the condition of a suitable KNN super parameter configuration, this method’s random swap efficiency is extremely high, it is an exquisite lightweight clustering method. Inspired by the ideas of randomness and exchange, the FPSO-GAK algorithm combines a variety of heuristic algorithms with the premise of sacrificing a little efficiency to avoid the homogeneity of the random search process and can better obtain global perception ability. At the same time, local search parameters can be fine-tuned automatically to prevent over-dependence on super parameter.
In practical applications, in order to further improve the clustering efficiency, the inertia weight and learning factor can better meet the iterative requirements by adjusting fuzzy rules. At the same time, the combination mode of PSO and GA can also be changed. PSO can improve the phenomenon that GA is prone to early maturity through global optimization in the early stage. while PSO is prone to fall into local optimization in the late stage, the deficiency can be made up through the global optimization in the late stage of GA.
Compared with many existing research algorithms in this field, the proposed algorithm has a better and relatively efficient non-homogeneous search capability, which can effectively find excellent initial clustering centers and at the same time can reduce the sensitivity of K-Means algorithm to noise to a certain extent. However, the proposed algorithm is relatively complex in structure and has a relatively high time complexity compared to many lightweight algorithms. The proposed algorithm is a class of algorithms based on a heuristic mechanism, which can only guarantee to get a better solution and cannot guarantee to obtain the optimal solution for sure.
5. Conclusions
In this paper, a Novel FPSO-GAK clustering algorithm has been proposed to effectively solve the problems of difficulty in determining the sensitivity of initializing the clustering center for a K-means clustering algorithm. The Novel FPSO-GAK clustering algorithm has been employed to capture urban hotspots in metropolitan areas in several countries worldwide. After the experimental clustering operation was completed, the SC index, SP index, and SSE index were applied to evaluate the clustering performance. The proposed Novel FPSO-GAK clustering algorithm obtained better clustering results for urban hotspots in metropolitan areas over the GAK, PSOK, PSOCK, PSOWK, PSO-GAK, and K-means++ algorithm. The algorithm presented in this paper can better improve the quality of public service for people in cities. In addition, the Novel FPSO-GAK clustering algorithm proposed in this paper can also be applied to the clustering analysis of remote sensing ground object data, animal migration area analysis, urban subway data analysis, urban heating network optimization analysis, detailed analysis of power supply flow of power grid and other fields.
There are also some shortcomings of the algorithm proposed in the paper. On the one hand, we only use the GPS data of taxis. The data amount is relatively small, leading to an insufficient resolution in the refined analysis of the distribution relationship of urban hotspots. On the other hand, the Novel FPSO-GAK clustering algorithm does not add urban road constraints, leading to some hotspots falling into particular areas where traffic vehicles cannot be reached, such as large lake areas in cities. These deficiencies will be further improved in future research work.