5.1. Discussion on Outliers
Outliers are generally defined as data points that globally have the least degree of similarity to the dataset they belong to, and for our classification task, a data point corresponds to the time series generated by a certain sensor, which represents the behavior of the associated parking space. With the previous naive technique, we overlooked the fact that the time series in the considered parking data span a wide range of behaviors, which makes it difficult to identify outliers by looking at the dataset as a whole. Nevertheless, we recognized that there exist time series with similar characteristics, which allows splitting the data into clusters. We then found that within each cluster, parking sequences are much more homogeneous, and in turn, outliers are easier to identify. For this reason, in the remainder, instead of looking for outliers by looking at the complete dataset at once, we first split it into clusters and then perform outlier identification inside each of them.
In addition, we note that assessing whether a certain sensor is an outlier strongly depends on the definition one uses for similarity. If, for example, we were to compare parking sequences only based on the average duration of their parking events, then two sequences with the same average event duration and different event frequency would be treated as similar. Of course, this is acceptable as long as our application does not need to track event frequency. For instance, event duration is all that matters to assess whether the parking time has expired, in which case, a fine is to be issued to the car owner.
Hence, the definition of the features of interest is crucial for the correct ascertainment of outliers, and we also need to define a similarity metric over these features. In the following sections, we identify a suitable feature set for our classification problem and quantify the concept of similarity between feature vectors.
5.2. Features for Parking Analysis
In machine learning and pattern recognition [
27], a feature can be defined as an individual measurable property of a phenomenon being observed. Informative, discriminating and independent features are key to the design of effective classification algorithms. For their definition, we consider, for each parking sensor
i, the following statistical measures:
- (1)
Sensor occupation (SO): accounts for the amount of time during which , i.e., the corresponding parking space is occupied.
- (2)
Event frequency (EF): accounts for the number of parking events per unit time.
- (3)
Parking event duration (PD): measures the duration of parking events.
- (4)
Vacancy duration (VD): measures the duration of vacancies.
We computed the hourly average trend for each of the above measures, considering two classes, (cl1) weekdays and (cl2) weekends, and averaging the data points corresponding to the same hour for all of the days in the same class. For each statistical measure, this leads to 24 average values, where the value associated with hour for Classes cl1/cl2 is obtained averaging the measure for hour t across all days of the same class. Thus, hourly sensor occupation functions were obtained for each hour of the day for Classes cl1 and cl2, which we respectively denote by and . Similar functions were obtained for the remaining measures, i.e., , , , where is either one (cl1) or two (cl2). We remark that all functions have been normalized through , where and respectively represent the minimum and maximum elements in the dataset for measure and class . Other normalizations were also evaluated, but this one led to the best classification results.
This leads to a total of average values for each measure (two classes and 24 h per class), which amounts to a total of values to represent the four statistical measures that we utilize to characterize the parking behavior of a sensor. Note that we purposely decided to separately process data from weekdays and weekends. This is because parking data from these two classes exhibits a significantly different behavior, and explicitly accounting for this fact increased the precision of our algorithms (although at the cost of a higher number of feature elements).
Thus, the eight functions (four per class) are computed for each sensor in the parking deployment, and their weighted sum is computed using suitable weights
with
and
. In this way, we obtain a single feature function
(see Equation (
2)) consisting of 96 average values: the first 48 values are representative of the average hourly measures from Class cl1, whereas the second 48 values represent the hourly measures from Class cl2. The weights determine the relative importance of each statistical measure; their correct assignment is crucial to obtain a good classification performance, as we show in
Section 6.2. The feature function
is obtained as:
where
. We remark that
is sensor-specific, i.e., one such function is computed for each parking sensor. Furthermore, for each parking sensor
i, this function defines a feature vector
of
elements that are used to train the clustering algorithms of the next sections. Specifically, for sensor
i, we set
,
, where
in
is computed from measures of that sensor. A feature function example from our parking dataset is shown in
Figure 4.
5.3. Selected Clustering Techniques from the Literature
Many clustering algorithms were proposed over recent decades; see [
28] for a literature survey. From this paper, an operational definition of clustering is: “given a representation of
n objects, find
k groups based on a measure of similarity such that the similarities between objects in the same group are high, while those between objects in different groups are low”. Note that our problem is in general more complex, as the number of clusters (data classes) is not known beforehand, but has to be inferred from the data. In this section, we review three clustering techniques from the literature, i.e.,
k-means, EM and DBSCAN. These tackle the clustering problem from quite different angles and will be considered for the performance evaluation of
Section 6. As will be shown, while they all detect similar clusters, none of them is entirely satisfactory, as outliers go most of the times undetected. Our SOM-based approach solves this.
k-means:
k-means is a very popular clustering technique [
4], which is successfully used in many applications. Basically, given
n input data vectors
, where
and
, the aim is to determine a set of
vectors, referred to as cluster centers, that minimizes the mean squared distance from each vector in the set to its nearest center. A popular heuristic to achieve this is Lloyd’s algorithm [
29]. First,
k-means is initialized by choosing
k cluster centers, called centroids. Hence, it proceeds by alternating the following steps until convergence:
- (1)
Compute the distances between each input data vector and each cluster centroid.
- (2)
Assign each input vector to the cluster associated with the closest centroid.
- (3)
Compute the new average of the points (data vectors) in each cluster to obtain the new cluster centroids.
The algorithm stabilizes when assignments no longer change. The clustering results provided by this technique depend on the quality of the starting points of the algorithm, i.e., the initial centroids. In this work, we consider the
k-means++algorithm [
30], which augments
k-means with a low complexity and randomized seeding technique. This is found to be very effective in avoiding the poor clusters that are sometimes found by the standard
k-means algorithm. Note also that
k-means is here considered due to its popularity as a baseline clustering scheme. Nevertheless, we stress that the number of clusters
k has to be known beforehand, which is not the case with real (unlabeled) parking data. Hence, techniques that discover a suitable number of clusters in an unsupervised manner are preferable, and the ones that we treat in the following do so.
EM: EM is an unsupervised classification technique, which fits a finite Gaussian Mixture Model (GMM) on the provided input vectors
,
(one vector for each parking sensor), using the EM algorithm [
6] to iteratively estimate the model parameters. Within the considered GMM, the number of mixtures equals the (pre-assigned) number of clusters, and each probability distribution models one cluster. A naive implementation of the algorithm requires to know beforehand the number of classes into which the dataset should be subdivided. Here, we implemented the procedure proposed in [
6] through which the number of clusters is automatically assessed.
The steps involved in the considered EM clustering algorithm are:
- (1)
Initial values of the normal distribution model (mean and standard deviation) are arbitrarily assigned.
- (2)
Mean and standard deviations are iteratively refined through the expectation and maximization steps of the EM algorithm. The algorithm terminates when the distribution parameters converge or a maximum number of iterations is reached.
- (3)
Data vectors are assigned to the cluster with the maximum membership probability.
For the automatic assessment of the number of clusters, we proceeded by cross-validation. This starts by setting the number of clusters to one. Thus, the training set is split into a given number of folds (10 for the results in this paper). EM is then performed ten times with the ten folds. The obtained log likelihood values are averaged. If the log likelihood is increased when the number of clusters is increased by one, then the procedure is repeated. The Weka data mining software was used to this end [
31].
DBSCAN: DBSCAN can be considered as the de facto standard unsupervised clustering technique [
5]. A set of
n points (real vectors in
) is taken as the input, and the objective is to group them into a number of regions. Differently from
k-means, the number of clusters (regions) does not need to be specified a priori. Two parameters have to be set up prior to applying the algorithm, namely:
- (1)
ε: used to define the ε-neighborhood of any input vector , which corresponds to the region of space whose distance from is smaller than or equal to ε.
- (2)
MinPts: representing the minimum number of points needed to form a so-called dense region.
An input vector is a core point if at least minPts points (including ) are within distance ε from it. The following applies: all of the points in the ε-neighborhood of a certain vector are reachable from it, whereas no points are directly reachable from a non-core point. A point is reachable from if we can locate a path of points with endpoints and , and all subsequent points in the path are mutually reachable. All points that are not reachable from any other point are tagged as outliers. DBSCAN starts from an arbitrary starting point (vector). If its ε-neighborhood has a sufficient number of points, a cluster is initiated; otherwise, the point is considered as noise. Note that this rejection policy makes the algorithm robust to outliers, but has to be tuned through a careful adjustment of the two parameters from above, whose best setting depends on the input data distribution. If a point is found to be a dense part of the cluster, its ε-neighborhood is also part of the cluster. From the first point, the cluster construction process continues by visiting all of the nodes that are mutually reachable, until the density-connected cluster is completely found. After this, a new point is processed, reiterating the procedure, which can lead to the discovery of a new cluster or noise. The main advantages of the algorithm are that the number of clusters does not have to be known, that it can discover arbitrarily-shaped regions, that is nearly insensitive to the ordering of the input points and that, if properly tuned, it is robust to outliers. Drawbacks are related to the choice of the distance measure and to the choice of the two parameters ε and MinPts, especially when clusters have large differences in densities, as a single parameter pair is unlikely to be good for all clusters.
5.4. Classification Based on Self-Organizing Maps
Next, we present an original clustering algorithm that exploits the unsupervised learning capabilities of self-organizing maps to automatically discover clusters and to concurrently identify outliers. This algorithm is here proposed, fine-tuned and tested against synthetic datasets and real parking signals (see
Section 6).
The SOM [
8,
9] is a neuro-computational algorithm that maps high-dimensional data into a one- or two-dimensional space through a nonlinear, competitive and unsupervised learning process. The SOM differs from other artificial neural networks as it uses a neighborhood function to preserve the topological properties of the input space [
32]. It is trained through input examples, and the input space is mapped into a two-dimensional lattice of neurons, preserving the property that similar input patterns are mapped by nearby neurons in the map.
In this work, we consider one- and two-dimensional maps, which are respectively made of sequences of neurons and a lattice of neurons, with . These maps become selectively tuned to the input patterns through an unsupervised (also referred to as competitive) learning process. As learning progresses, the neuron weights tend to become ordered with respect to each other in such a way that a significant coordinate system for different input features is created over the lattice. In other words, a SOM creates a topographic map of the input data space, where the spatial locations or coordinates of the neurons in the lattice correspond to a particular domain or intrinsic statistical feature of the input data. Remarkably, this is achieved without requiring any prior knowledge on the input distribution.
With
, we indicate the feature (input) set, and we let
be the input feature vector associated with parking sensor
, where
; see
Section 5.2. With
N, we mean the number of sensors, and
. Let
be the lattice. Each neuron is connected to each component of the input vector, as shown in
Figure 5. The links between the input vector and the neurons are weighted, such that the
j-th neuron is associated with a synaptic-weight vector
, where
.
SOM-based clustering: The clustering algorithm is summarized in Algorithm 1 and discussed in what follows, by identifying its main steps, i.e., training and clustering. Step 1: Training. The learning process occurs by means of showing input patters
to the SOM. Each time a new pattern is inputted to the map, the neurons compete among themselves to be activated, with the result that only one winning neuron is elected at any one time. To determine the winning neuron, the input vector
is compared with the synaptic-weight vectors
of all neurons. Only the neuron whose synaptic-weight vector most closely matches the current input vector according to a given distance measure (that we choose equal to the Euclidean distance, which is commonly used) dominates. Consequently, the weights of the winning neuron and of the nodes in the lattice within its neighborhood are adjusted to more closely resemble the input vector. The algorithm steps are (see also
Figure 6) as follows.
Algorithm 1 (SOM): |
Initialization: set the initial time step to and choose small random values for the initial synaptic-weight vectors , . Sampling: set and sample the training input pattern , i.e., the feature vector from the n-th parking sensor. Similarity matching: find the winning neuron, whose index is at time step n by using the minimum-distance criterion:
where with , we mean the Euclidean distance between vectors and . Update: adjust the synaptic-weight vectors of all neurons by using the update equation:
with , where is the learning rate parameter at iteration n and is the neighborhood function centered on at iteration n. Adjust neighborhood: adjust the neighborhood size (), the learning rate () and continue from Step 2 if ; stop otherwise.
|
A good and common choice for
is the Gaussian function, whose span at time
is chosen to cover all of the neurons in the lattice and is then reduced as the map is being trained. The reason for this is that a wide initial topological neighborhood, i.e., a coarse spatial resolution in the learning process, first induces a rough global order in the synaptic-weight vector values. Hence, during training, narrowing improves the spatial resolution of the map without destroying the acquired global order. The learning rate is also reduced as times goes by, and a sufficiently high number of iterations
has to be chosen, so that all synaptic weights stabilize. Moreover, if the number of available input items is smaller than
, new examples can be resampled from the input data until convergence. See Chapter 9 of [
32] for additional details.
Step 2: Clustering. Once the training is complete, the map has adapted to efficiently and compactly represent the feature space
. Clustering immediately follows using the trained map, as we now explain. The feature vectors
with
are fed as an input to the map. For each feature vector
, we use Equation (
3) (with the final synaptic weights) to assess the winning neuron (also referred to as the activated neuron) in the map for this vector. With
, we indicate the index of the winning neuron; see Equation (
3). Clusters are constructed by grouping together the feature vectors returning the same index. It is then immediate to realize that the maximum number of clusters returned by SOM equals the number of neurons in the map,
, i.e., one cluster per neuron. We underline that in some cases, a small number of neurons may never be activated, i.e., be selected as the best fitting neuron for a certain input patter (vector). As a consequence, the number of clusters may be strictly smaller than
(in our experimental validation, this number was never smaller than
). With this approach, the number of clusters is somewhat fixed in advance, as it strictly depends on the number of neurons in the map. Next, we propose an original algorithm that exploits SOM to self-assess the number of clusters according to a tree-splitting approach.
Unsupervised SOM-based clustering: The above SOM clustering approach requires knowing in advance the number of clusters in the dataset (i.e., the number of neurons in the SOM map). Next, we devise an unsupervised clustering algorithm that does not need to know in advance the number of data classes. We do this by taking inspiration from [
33,
34,
35]: as done in these papers, instead of clustering data through an agglomerative approach, we adopt a divisive approach, i.e., we start from a big cluster containing the entire dataset, and we iteratively partition this initial cluster into progressively smaller ones. A SOM map with only two neurons is utilized as a non-linear classifier to split clusters into two subsets. SOM has a higher discriminant power than the linear-discriminant functions of [
33]. Furthermore, we use a local-global adaptive clustering procedure, similar to that in [
36], whose cornerstone is the self-similarity found in the global and local characteristics of many real-world datasets. Specifically, we assess a correlation measure among the features of the entire dataset (global metric) and those of the smaller clusters obtained at a certain step of the algorithm (local metrics). Hence, global and local metrics are compared to determine when the current clusters have to be further split. Prior to describing the algorithm, we introduce the following concepts:
Data point: The input dataset is composed of N data points, where “data point” i is the feature column vector associated with the parking sensor . These vectors are conveniently represented through the full feature matrix . With , we mean a submatrix of obtained by collecting p columns (feature vectors), not necessarily the first p. A generic cluster containing p elements is then uniquely identified by a collection of p sensors and by the corresponding feature matrix .
Cluster cohesiveness: Consider a cluster
with
p elements, and let
be the corresponding feature matrix. We use a scatter function as a measure of its cohesiveness, i.e., to gauge the distance among the cluster elements and its mean (centroid). The centroid of
is computed as:
. The dispersion of the cluster members around
is assessed through the sample standard deviation:
where
is the Euclidean norm of vector
. Likewise, the dispersion of the full feature matrix
is denoted by
, and we define a further threshold
for a suitable
.
Global vs. local clustering metrics: In our tests, we experimented with different metrics, and the best results were obtained by tracking the correlation among features, as we now detail. We proceed by computing two statistical measures: (1) a first metric, referred to as global, is obtained for the entire feature matrix
; (2) a local metric is computed for the smaller clusters (matrix
).
- (1)
Global metric: Let
be the full feature matrix. From
, we obtain the
correlation matrix
, where
. Thus, we average
by row, obtaining the
N-sized vector
, with
. We respectively define
and
as the sample standard deviation and the mean of
. We finally compute two global measures for matrix
as:
- (2)
Local metric: The local metric is computed on a subsection of the entire dataset, namely on the clusters that are obtained at runtime. Now, let us focus on one such cluster, say cluster
with
. Hence, we build the corresponding feature matrix
by selecting the
p columns of
associated with the elements in
. Thus, we compute the correlation matrix of
, which we call
, and the
p-sized vector
, obtained averaging
by row as above. The local measures associated with matrix
are:
where
returns the smallest element in
.
Global vs. local dominance: We now elaborate on the comparison of global and local metrics. Let
and
respectively be the full feature matrix and that of a cluster obtained at runtime by our algorithm. Global and local metrics are respectively computed using Equations (
6) and (
7) and are compared in a Pareto [
37] sense as follows. We say that the global metric (matrix
) dominates the local one (matrix
) if the following inequalities are jointly verified:
For the first measure (
meas1), global dominance occurs when the global standard deviation is strictly larger than the local one. For the second measure (
meas2), global dominance occurs when the global mean is smaller than the local minimum. Full dominance occurs when the two conditions are jointly verified. Note that when the local metrics are non-dominated by the global ones, this means that the vectors in the associated cluster either have a larger variance or that the minimum correlation in the cluster is larger than the average correlation in the entire dataset. In both cases, the local measures are not desirable, as the cluster still has worse correlation properties than the full data. Hence, we infer that the cluster has to be split, as it either contains uncorrelated data points or at least one outlier.
Our unsupervised SOM-based clustering technique is detailed next.
Algorithm 2 Unsupervised SOM clustering: |
Initialization: Initialize an empty cluster set ; assign all of the parking sensors to a single cluster; mark it as non-finalized; and add it to set . Initialize a second empty set , used to collect the clusters created by the algorithm through successive partitioning. Compute the global metrics and . Evaluation: If set is empty, go to termination. Otherwise, pick a cluster at random from and compute its local metrics. Let and respectively be this cluster and the associated feature matrix. If either of the following conditions is verified, (c1) the local metrics and are non-dominated by the global ones or (c2) the cohesiveness of is such that , then perform bipartition on cluster (Step 2a below); otherwise, perform finalization (Step 2b below). - (a)
Bipartition: Remove from , and split it into two new clusters using a SOM (involving Step 1, Training, and Step 2, Clustering). Mark the two clusters so obtained as non-finalized and add them to set . Return to the evaluation step. - (b)
Finalization: Remove cluster from ; mark it as finalized; and add it to . This cluster will no longer be evaluated. Return to the evaluation step.
Termination: When all clusters have been finalized, merge all single-element clusters in in a single set, which is referred to as the outlier cluster. At this stage, contains the final clusters. Polishing: A final polishing (or merging) procedure is executed on the final clusters in . The aim of this is to join clusters that were separated to isolate outliers, but that actually contain points that are very close to one another. The polishing procedure is described below.
|
Polishing: The above cluster splitting procedure is very effective in isolating outliers, and these are usually moved onto clusters containing a single element (singletons). Nevertheless, some of the resulting non-singleton clusters may contain data points that are very close to one another (according to, e.g., the Euclidean distance metric), and this is because separating a single outlier from a cluster may at times require multiple splitting steps, which may entail scattering a uniform cluster into multiple, but (statistically) very similar ones. To solve this, we use a final polishing procedure to rejoin (merge) similar clusters. A similar strategy was originally proposed in [
38]. The merge works as follows: Let
be a cluster pair in
. We first evaluate their union
, from which we obtain
, the feature matrix of
, and its cohesiveness
. We compute the cohesiveness for each cluster pair in
, and we merge the two clusters with the smallest one. This is repeated until a certain stopping condition is met. Two stopping conditions can be considered: (s1) we keep merging until the number of cluster is equal to a preset number
k; (s2) we keep merging until in
there are no cluster pairs
with
.
Discussion: Algorithm 2 iteratively splits the feature set into smaller subsets, based on the relation between global and local metrics and on a local target cohesiveness, . SOM classifiers are utilized due to their self-tuning and non-linear characteristics, and decision (Voronoi) regions to separate the dataset into clusters are obtained in a fully unsupervised manner. We observe that splitting the data points (feature vectors) into progressively smaller subsets allows for a progressively increasing precision in the classification, i.e., at the beginning, we perform a coarse classification that is subsequently refined in the following iterations. Most importantly, the decision as to whether to split is made based on self-similarity considerations, using local vs. global metrics, but also on the expected cohesiveness of the clusters. For real data, we found the self-similarity principle to be especially important (and often the only one that was used), but for more regular (synthetic) data patterns (e.g., same statistics across days and regularly spaced in terms of average measures), the second strategy had a higher impact. The best performance was obtained through a combination of these two approaches, and it has shown a robust behavior of the clustering algorithm in all of the considered settings. The final algorithm requires one to set the single parameter (γ), which represents our clue about the target cohesiveness of the true clusters. Note that the same is used in the splitting and in the polishing phases. The DBSCAN parameters ε and MinPts play a similar role.