1. Introduction and Background
Intelligent computer vision systems are becoming increasingly popular. They are very useful tool in many applications that aid the decision making process, for example, agriculture [
1], structural health monitoring [
2], robotics [
3], surveillance [
4], product quality and manufacturing [
5], self-driving-vehicles [
6], medicine [
7], and so on.
Image segmentation can be viewed as a useful partitioning of the image. As such, it is a central task in many computer vision systems, and one of most difficult ones [
8,
9,
10,
11]. It can be supervised or unsupervised, as pointed out by Refs. [
12,
13]. In supervised image segmentation, we want to partition the image into semantically coherent regions on the image. This can also be referred to as semantic segmentation [
3,
14]. As such, it is usually based on classification methods [
13,
15,
16]. In unsupervised image segmentation, we are interested in extracting similar segments based on pixel positions and color intensity values with respect to a criterion [
17]. As such, clustering algorithms are mostly used [
11,
12,
17,
18].
For supervised image segmentation, we usually need very large datasets for training a classification model, e.g., a deep neural network [
2,
14]. In unsupervised image segmentation, we perform a clustering algorithm directly on the image under consideration [
12,
17]. Consequently, unsupervised image segmentation based on a clustering algorithm is weaker in performance than classification. However, they do not require large training datasets, are faster, require less computational power, and, most importantly, can be used in conjunction with newer classification methods [
3,
12,
13]. They are also much more useful than their supervised counterparts for certain domain-specific tasks where prior knowledge is lacking, whether due to a lack of plausible training images or other factors [
12,
13,
19,
20].
A mixture model is a compelling framework for many pattern recognition tasks [
21,
22,
23,
24]. Traditionally, it is used in machine learning as a clustering tool and, therefore, in computer vision for unsupervised image segmentation [
22,
25,
26,
27]. Mixture models can be seen as blob detectors, aimed at detecting regions of interest with similar colors or intensities in an image [
9,
25,
28]. In this respect, similar to the fuzzy clustering algorithms [
10,
12,
29], they are very promising for unsupervised image segmentation and as such can be useful for further feature extraction and detection of region of interests [
8,
19].
Obtaining a useful segmentation with the mixture models is a delicate process. First, the right mixture model must be chosen, i.e., Gaussian, Gamma, Weibul, or another that can capture the detail correctly [
11,
25,
30,
31]. Next is the choice of model selection criterion [
32]. Finally, the algorithm for estimation can be problematic [
9]. All of the above-mentioned can result in either under-segmentation problems (too few regions detected) or over-segmentation (too many segments detected) [
9,
28,
33]. An over-segmented image can be regarded as noisy, while an under-segmented image as incomplete and thus inappropriate for use. To tackle this issue, in this paper, we investigate the usefulness of a clearly over-segmented solution, provided by the estimated mixture model, and the methods for merging different detected regions to obtain a more useful segmentation.
Every component of the estimated mixture model represents a segment of the image, in order to reduce the number of unwanted segments. We are interested in merging different components of the mixture model [
33]. By doing so, we are remedying the problems that arose as a result of an inappropriate type of mixture model, the deficiency of model selection criteria, and, to some extent, drawbacks of the estimation algorithm [
33,
34,
35]. We are investigating if many-to-one mapping as a one-to-one relationship between a mixture model component and a cluster may be insufficient [
9,
33].
This paper is summarized as follows. We start with an explanation of mixture models and their application in clustering and image segmentation in
Section 2.
Section 3 is reserved for the derivation of merging procedures and criteria for merging.
Section 4 describes the estimation algorithm.
Section 5 discuses the optimal segmentation.
Section 6 introduces the needed constants for evaluation and in
Section 7 we provide the experimental results. In
Section 8 we provide some fruitful discussions of the experimental results, and
Section 9 finalizes our work and concludes this paper.
2. Mixture Models
Let
be a
d-dimensional observed dataset of
n continuous observations
. Each observation
is assumed to follow the probability density function (PDF):
Equation (
1) gives the PDF of a mixture model. The PDF of a mixture model consists of weighted
c components that follow simple parametric probability distribution, such as Gaussian distribution, Weibull distribution, and similar [
36,
37]. Component distribution PDF is given by the
and has the parameters
. For example, PDF of multivariate Gaussian distribution is
Component parameters
are distribution specific. For the Gaussian mixtures, those are the mean vector
and covariance matrix
. Weights
have the properties of the convex combination
[
38]. Thus, we can arrange the mixture model parameters in a set
or concisely and more conveniently
The estimation of the parameters involves estimation of component weights
and component parameters
. Usually, we denote the estimations as
. The easiest way to estimate the mixture model parameters is via the maximum likelihood. In other words, the parameters
yielding the maximum of
It is often more convenient to maximize the log-likelihood function
and, the estimation can be written as
The maximization problem described with Equation (
7) has many inconvenient aspects, mainly because the number of components
c in the mixture model is not known in advance. First, it is known that the likelihood value increases with the increase in the number of components due to the overall better fit [
28,
39]. Second, without knowing the number of components, it is impossible to derive the (
6) to obtain the solution to the maximization problem. Lastly, the use of global optimization algorithm is not convenient due to the large number of optimization parameters involved in a mixture model [
38]. Thus, some type of model selection procedure is commonly involved in the estimation process [
38].
2.1. Model Selection
Usually, we are interested in estimating the number of components
and the accompanying parameters
; however, in some cases, the number of components is known [
38], in which case the model selection procedure is not needed—this occurs in the minority of cases. In the majority of cases, we are faced with a ubiquitous problem of the maximization of (
6) when the number of components is unknown. Nonetheless, to derive Equation (
6), we need to know the number of components
c. Therefore, it is a common practice to try out different numbers of components and estimate the fit. Thus, it is quite common and justified to denote the mixture model in relation to the number of components
c,
Additionally, because the likelihood function increases with the increase in the number of components, the selection of the optimal model based solely on maximum likelihood can be misleading. Therefore, there is a need to introduce different model criteria to choose the best fit. Generally, those are called information criteria [
38]. We will denote those as
. The model selection procedure is described with Algorithm 1 [
28,
39].
In summary, we estimate the optimal mixture model
for different numbers of components
c. If the number of components is known, we can solve Equation (
7) using, for example, the algorithm EM (expectation–maximization) denoted by the EM function in line 4 of the Algorithm 1. EM is a local optimization algorithm mainly used to estimate the maximum likelihood parameters of mixture models when some initial parameter estimates are known in advance [
38]. The optimal model is obtained for each number of components, and the best model is selected from among them. The model giving the optimal selected
is thus the best model available. ICF in line 5 of Algorithm 1 represents the generic function for estimating the chosen information criterion.
Algorithm 1: Model selection procedure |
Input: dataset and initialize set ; Output: optimal mixture model parameters ; , , , ; foreachdo: Initialize , ; = EM(); Estimate criterion ; if : , , , ; end end
|
The importance of using a model selection procedure is illustrated in
Figure 1. Different models were estimated for different numbers of components
c. For each model, a Bayesian information criterion (BIC) was calculated [
40]. The BIC is defined as
where
M is the number of parameters in the model. It is quite obvious that the term
, given that the
M increases with the number of components
c, is a balance factor to compensate for the increase in log-likelihood value with the increase in number of components
c.
BIC is also one of the most commonly used information criteria [
41]. Choosing the model with the optimal value of BIC should provide the best solution [
33,
34]. Although every estimated model yields one of the local optima of log-likelihood function from (
6), different models provide less or more of the information needed. For example, a model with two components clearly is an under-fit, while a model with ten components gives an over-fit. In contrast, a model with five components seems to be the best selection, at least when the BIC criterion is used.
2.2. Mixture Model-Based Clustering and Image Segmentation
In a mixture model framework, it is generally considered that each component in a model represents one homogeneous unit [
42]. In other words, that homogeneous unit represents a sub-population of the whole population. In a clustering sense, that homogeneous unit represents the cluster. A cluster is a group of observations from a dataset that share similar characteristics. Thus, every observation that arose from a certain component in a mixture model is considered to form a cluster. For every observation
in dataset
, we can estimate the posterior probability
that it arose from the
lth mixture model component. The cluster assignment is made according to the maximum a posteriori (MAP) rule
We are interested in labeling all of the observations in dataset
: ergo, a clustering scheme or simply clustering. Image segmentation is a similar process. We want to identify
k partitions of the image, which are coherent in some way. This is illustrated in
Figure 2. The original image in
Figure 2a comes from the Berkeley dataset [
43] and the second image (
Figure 2b) is the human segmentation provided by the participants in study.
Many clustering algorithms are used for image segmentation [
44]. Image segmentation with a mixture model can be conducted using pixel intensity and position; however, using the pixel position values in the estimation of the mixture model often degrades the results, due to their non-random nature. Thus, we are interested only in modeling the PDF of pixel intensity; accordingly, we produce a clustering (scheme). Additionally, for the component distribution, the Gaussian distribution is usually used [
9,
31]. This is a popular choice because the multivariate extension is straightforward; thus, images with multiple channels (color images opposed to monochrome images) can be segmented. Furthermore, the mixtures of Gaussian distributions are the easiest and most straightforward to estimate. However, using Gaussian distribution as the component distribution brings many problems. The first is that the pixel intensity is bounded, while the Gaussian distribution is not. The second problem arises from the fact that the Gaussian PDF is symmetric around the mode. Because of the non-symetricity of the mode (i.e., the left and right tails are not equal), multiple Gaussian components are estimated to compensate for this. This leads to many clusters, which then leads to many segments, which are not actually present. Although these problems generally arise with the Gaussian mixture model, the problem of estimating too many components in other mixture models is also present, especially when estimating the PDF of an image.
3. Merging of Redundant Components in Mixture Model-Based Clustering
Let us look at the quite famous old faithful geyser dataset [
45]. The dataset contains 272 observations on two continuous variables, the first named
eruptions and the second named
waiting, as illustrated on plots in
Figure 3. The first variable,
eruptions, gives the geyser eruption time in minutes, whilst the second one,
waiting, gives the waiting time to next eruption, also in minutes.
Figure 3a gives the visualization of the dataset without identified clusters while
Figure 3b,c give the clustering solution obtained from the estimated two-component Gaussian mixture model and the three-component Gaussian mixture model.
It occurs quite naturally that the two-component clustering solution is true. Not even by looking at the solution on plot
Figure 3b and only by observing non-colored points in plot
Figure 3a there is some convincing evidence that there are two groups of points. However, let us examine the obtained values of the IC criteria in
Table 1. Even though the difference in IC values is not so dramatic, the three-component solution is the optimal solution judging by two different IC, namely AIC (Akaike information criterion [
46]) and BIC. Judging by their value, the three-component solution seems to be a preferred solution. In contrast, the ICL (integrated classification likelihood [
42]) criterion had a lower value for the two-component mixture, meaning that this model would be selected as optimal. This is expected because the ICL criterion favors the solution that is more probable to be the optimal clustering solution. The sole reason for the derivation of the ICL criterion is to bypass the somewhat tricky nature of BIC to favor the model that is not an optimal clustering solution.
3.1. Entropy-Based Merging
While the ICL criterion seems to be sufficient for the old geyser dataset and gives an optimal clustering solution, it is far from bulletproof. Mainly, the problem is that it can provide a clustering solution with too few clusters in highly overlapping situations [
34]. In image segmentation, such situations are frequent. Another point is provided in [
34]. Although the number of components in the mixture model is well met by, e.g., the BIC criterion, the number of components in that scenario is well suited for the underlying PDF of a dataset but not necessarily for clustering. That may well mean that some clusters in the clustering are not well-presented with only one component of the mixture model, meaning that a one-to-one scenario is not practical. This goes hand in hand with many density-based clustering approaches, especially the likes of mean-shift or modal-based EM algorithm [
47].
To propose a solution, Baudry et al. [
34] developed a merging procedure based on the estimation of entropy of soft clustering: in other words, posterior probabilities. Starting with a number of components
c in the mixture model that fits the dataset well and an equal number of clusters
as the number of components in the mixture model, the idea is to build a sequence of clusterings with merged pairs of clusters. Thus, the newly created clustering schemes contain
clusters. At each stage, merging happens between each and every pair of clusters and the most promising pair for creating the
clustering scheme is the one for which the newly obtained
clustering scheme maximizes a criterion
where the
gives the posterior that the
j th observation belongs to
lth component/cluster,
gives the posterior that the
j th observation belongs to
th component/cluster, and the
is the posterior of the merged cluster combined from components/clusters
l and
. The posterior
can be simply calculated as
and obvious condition
should be met. The number of combinations is therefore equal to
when the current number of components/clusters is
k.
The final selected merged clustering scheme fits the data as well as the first solution provided by the
c mixture model components, since it is based on the same mixture model and the likelihood does not change. Only the number and definition of clusters are different. In contrast, the likelihood of the mixture model selected by the ICL criterion can be worse. The method described above yields just one suggested set of clusters for each
k, and the user can choose between them on substantive grounds. For example, the number of clusters in the final clustering scheme can be selected to match the number of components in the mixture model selected with the ICL criterion. Otherwise, because in each merging step we obtain the entropy of merged pair of clusters from the previous clustering scheme, the elbow method can be employed on graphical results of the entropy variation against the number of clusters to select the final number of clusters
k [
34].
3.2. Merging Based on Directly Estimated Mis-Classification Probabilities
The second method is described by [
35]. As the previous method described, this method depends on soft clustering or, in other words, posterior probabilities
and hard clustering or indicator value
. Indicator value
is a binary variable giving the indicator if the observation
belongs to
lth component/cluster. However, we are not interested in the probability of observation being clustered to the
lth component/cluster but the misclassification probability between two components of the mixture model
. The misclassification probability
can be summarized as the distance between two probability distributions, e.g., for two Gaussian distributions, a Bhattacharyya distance [
35], but in a more generalized fashion and additionally more focused on clustering application. The misclassification probability can be estimated as
where the
is the soft clustering for the
lth component/cluster and
is the hard clustering for the
th component. It is also important to say that the
is not symmetric, meaning that
; however, this can be violated. Thus, the most optimistic criterion is the
and should be used to select the merging pair of components/clusters. Again, it is quite obvious that the merged pair will result in equal posterior probabilities
as for the previous described method (Equation (
13)).
Merely by observing the merging criteria in both methods, it is sufficient to say that both procedures will results in different final clustering schemes. Nonetheless, they share several similarities, given that the
must hold for any number of clusters
. The most important similarity, at least for the image segmentation task, is the obvious merging of overlapping components. Namely, two components, not overlapped significantly, will obey the rule of
or vice versa. In other words, one component will have posterior probabilities close to 1, while the other will have close to 0. Thus, both criteria (Equations (
12) and (
16)) will be close to 0, suggesting the component should not be merged. Otherwise, if the significant overlap between components exists, both criteria will yield significantly higher values, suggesting that the components should be merged. Simply, the methods will progress to the solution that will minimize the overlap between the posterior probabilities. If the overlap between the mixture model components is not high both methods will yield similar solutions; however, in the presence of high overlap between the components of the mixture model, both methods will give different results.
3.3. Component Merging Mechanism Inside Mixture Model
An advantage of both selected merging methods and criteria is in the fact that they can be used for any component distribution in the mixture model, as they both are distribution parameter independent [
35]. Instead of recalculating the best possible merging pair based on distribution component parameters, we are calculating the best possible merging pair based on posterior probabilities
and clustering labels (indicator values)
. Thus, both methods have a wider range of application. Furthermore, instead of recalculating new component parameters for new merged pairs, we simply store the merging tree, and only update the posterior probabilities with Equation (
13) to reflect the current/desirable number of clusters. Cluster indicator values for new merged clusters can be obtained merely by summing the indicator values of previously merged pair of clusters:
and cluster labels by the union of cluster labels of cluster pairs to be merged. Essentially, we are performing a form of hierarchical clustering on mixture model components. By doing so, we maintain the flexibility of a merged cluster being defined by multiple mixture model components.
5. Optimal Image Segmentation
In revising the merging criteria, we have already given some of the heuristics mentioned in [
34,
35] that indicate when is optimal to stop merging the mixture components and yield final clustering solution. We will revise them here and propose some more practical ones, especially with respect to image segmentation. As mentioned before, one could aim that the final number of clusters after merging is equal to the number of components of the mixture model selected with the ICL criterion. On the other hand, the merging process could be stopped if the merging criterion is lower than a predetermined lower bound for that specific merging criterion. For example, [
35] gives the lower bound for the DEMP merging criterion
in his experiments. The introduction of new hyperparameters seems, at least to us, somewhat problematic. They might work in one situation and not in another. Their values vary considerably for different examples, leaving a lot of leeway to the user. This can be encouraging and annoying at the same time. We would like to avoid this. In addition, they are usually difficult to automate. The best we can do is to use the elbow method to determine the optimal value. In other words, the heuristic here is the knee of a curve obtained by plotting the values of the merging criterion and the number of clusters. In clustering, we justify this with the notion of loss of explanation due to subsequent merging. We further note that using the elbow method for our purposes leads to the following problem. The digitized nature of the image intensities may result in multiple elbows (or knees), leading to multiple plausible stops, making this heuristic unappealing. Finally, we may also want to use a different metric that does not depend on the mixture model and is more focused on the current application, namely image segmentation.
Measuring entropy has proven to be a useful metric in many artificial intelligence image applications [
50,
51,
52,
53]. The entropy of an image is simply defined as
where
j is the unique color in the image and
is the probability of its occurrence. Segmenting the image using the mixture model with
c components yields the entropy estimate
It is trivial to see that subsequently merging the mixture components reduces the estimated entropy, since
when
due to the property of mixture components weights is
. Therefore, to approach a practical solution, we propose an average of the image segmentation entropy of the form
Finally, it should be noted that this average of the image segmentation entropy is in no way related to the average entropy of [
54], where the goal is the average entropy estimate of a continuous random variable. The optimal image segmentation can therefore be chosen to maximize the average of the image segmentation entropy or
.
6. Experimental Setup
In this section, we further describe the experiments, experimental algorithm settings, different algorithms used, and metrics used for the results evaluation. Experiments are conducted on digital images. Digital images can have multiple channels. Usually, color digital images have three or more channels. Each channel represents one of the primary colors and its intensity. However, monochrome digital images (gray-scale) have only one channel, which contains only the amount of light, i.e., the intensity.
Color digital images carry much more information; thus, they are more demanding for processing. In contrast, monochrome images are easier to process, as they have less information. Some algorithms used for image processing are also only capable of processing gray-scale images. Therefore, color images need to be transformed into gray-scale. A mixture model-based approach is capable of handling color images and, therefore, also gray-scale images. Because gray-scale processing requires fewer parameters to be estimated, it has a better prognosis on gray-scale images. Nonetheless, the results for color images can be promising, as seen in [
28].
Thus, the mixture model approach for clustering to both gray-scale and color images will be used to obtain segmentation. For the mixture model, we will use the Gaussian mixture model. For the information criterion, we will use BIC; for the model selection estimation strategy, we will use single REBMX and EM from
rebmix R package. The strategy is fully described in [
39]. In essence, we are combining both REBMIX and EM algorithms to obtain the best possible estimations while keeping the computation overhead minimal, by providing the smoothing parameter
K to the REBMIX pre-processing step. The value of parameter
is described in the previous section. The maximum number components to consider in model selection procedure is
. The merging of mixture model components with the entropy criterion, explained in
Section 3.1, is referred to as ENT, and the merging of mixture model components with directly estimated misclassification probabilities, explained in
Section 3.2, is referred to as DEMP. In addition, a single REBMX and EM strategy with ICL criterion is used as a baseline approach for optimal image segmentation with Gaussian mixture models. We refer to this as ICL.
We will use several useful external clustering metrics to quantitatively evaluate various image segmentation results.
Remark 1. For completeness, we reiterate that the terms segmentation and clustering are used interchangeably, as they are equivalent in this context and refer to the same thing. For aesthetic reasons, we use the term segmentation when we want to emphasize the clustering application of image partitioning.
We use precision
P, recall
R, Dice score
, Jaccard score, adjusted rand index
, and adjusted mutual information
. Precision and recall are self-explanatory. The former is the frequency of correctly predicted positives among all predicted positives and the latter is the frequency of correctly predicted positives among all correctly predicted positives. The Dice score is the harmonic mean between Precision and Recall, and the Jaccard score or better known as Intersection over Union is the frequency of correctly predicted positives among all true positives and predicted positives. Formally we define them with following equations
where in clustering
True positives () is the number of pairs of points belonging to the same cluster in predicted and true clustering;
False positives () is the number of pairs of points belonging to the same cluster in true clustering but they are in different clusters in predicted clustering;
False negatives () is the number of pairs of points belonging to the same cluster in predicted clustering but they are in different clusters in true clustering;
True negatives () is the number of pairs of points that do not belong to the same cluster in predicted and true clustering.
These four metrics give several meaningful aspects of segmentation quality with different clustering algorithms compared to ground truth (i.e., perfect or true segmentation). On the other hand, we use
and
to provide a measure of agreement between two clustering solutions, i.e., predicted and true clustering, that takes into account the randomness of cluster assignments. This is especially important when the number of clusters in the predicted and true clustering is different and when the number of predicted clusters is large.
is defined in the range
, with values less than 0 indicate that clustering is worse than performing clustering at random, while values greater than zero generally define the degree of agreement between clusters.
is defined in the range
, with values closer to 0 indicating largely independent clusters between the predicted and true clustering, and values closer to 1 indicating stronger agreement between clusters in the predicted and true clustering. The equations for
and
are somewhat unwieldy, and to avoid further repetition and unnecessary lengthening of the text, we omit them here. Interested readers may wish to consult them in [
55], for example.
Finally, a note on clustering algorithms used alongside mixture modeling. There are many clustering algorithms, but the most popular ones, and thus the ones we use for comparison here, are discussed in more detail in the following two groups.
EM algorithm with Gaussian mixtures denoted as GMM [
38],
K-means denoted as KMEANS [
56], and Fuzzy
C-means denoted as FCMEANS [
57],
and Meanshift denoted as MS [
58], Density-based clustering denoted as DBSCAN [
59], and Ordering Points To Identify the Clustering Structure denoted as OPTICS [
60].
In the first group of clustering algorithms, the number of clusters must be determined in advance. We will use these algorithms for the experiments if we want to obtain the clustering solution with the desired number of clusters (segments). The second group of clustering algorithms estimates the optimal number of clusters based on internal criteria, providing the optimal clustering. We will use these algorithms for optimal segmentation of images. For their implementation, we use the scikit-learn library for the Python programming language [
61]. To use them in the context of image segmentation, we followed the procedure of [
44]. First, we used the SEEDS (Superpixels Extracted via Energy-Driven Sampling) superpixel algorithm [
62] (from OpenCV library) and reduced the number of pixels in the image. In addition, we tried different values of the required parameters to obtain the best possible segmentation on several images. Once we were satisfied, we kept these values for further evaluation.
8. Discussion
Merging components of a mixture model based on entropy or misclassification criteria seems to be a very useful approach to quickly improve image segmentation. Indeed, using a model selection procedure and a non-clustering information criterion such as BIC allows greater flexibility in image segmentation than simply using a clustering information criterion, e.g., ICL. The explanation for this lies in the fact that the segment-wise distribution differs from the Gaussian distribution for many images. However, the Gaussian distribution is interesting because of its simple definition and straightforward application. The segment color distribution of a given image may be multimodal or skewed, and the BIC criterion would be sufficient to model the underlying PDF of the image, but the cost is the large number of components. ICL reduces the number of components by penalizing highly overlapping components. The price, however, is that such components may not be well suited for modeling individual segments. We have seen that neither of them is capable of handling image segmentation on its own. Merging components fixes this problem and improves image segmentation and thus the application of the mixture model.
However, upon visual inspection of the segmentation, we noticed that the two merge criteria introduced seem to suffer from the obvious drawback of hierarchical merging of components. More precisely, only one pair of components is merged in each stage of the merging process. In this way, the merge estimation is simplified because in each stage the merge criterion is recalculated for the remaining untouched and merged components. However, this presents an obvious problem when the final clustering solution is not just a few stages away from the original clustering solution. Both criteria clearly focus on merging smaller components with larger ones, and likewise on merging smaller clusters with larger ones. At each stage, they increase the size of the already large cluster, so to speak. For example, if the final merge is 10 stages away from the original solution, the smaller but informative cluster has already been merged because of its size. The smaller cluster will not be merged with the larger one only if its centroid is far apart in the feature space. The optimal segmentation heuristic presented here naively fixes this, as the maximum value of Equation (
20) favors more clusters of equal size. However, by using multiple merges in each stage, the problem could be circumvented. This should be considered more critically, however, as it could lead to further undesirable artifacts.
Here we have dealt with unsupervised segmentation of images using only color pixel values. It is also known that using the spatial information, i.e., pixel positions, can enrich the segmentation, especially when the image is contaminated with a lot of noise, see [
12] for example. However, including this information as two additional vectors does not lead to better results, since pixel positions are not random variables. They must be explored through well-defined neighborhood systems with spatial interactions. As a starting point, we decided to explore merging only for the color pixel values, since the additional inclusion of the spatial information through a well-defined theory, such as the hidden Markov random fields [
38], can only improve the results.
Finally, we would like to discuss the high computational cost of estimating the mixture model. Using the predefined value for the pre-processing step does not reduce the size of the dataset, but only makes it more compact. We obtained sufficiently fast results for gray-scale images, but the speed was not satisfactory for color images. The value for three-dimensional space results in 16,777,216 bins. Most of these are empty and we always obtain a smaller number of non-empty bins than the number of pixels in the image. However, most bins contain only one pixel. Decreasing the value of K also decreases the size of the dataset and affects the mixture estimation, especially with the algorithm EM. On the other hand, if we reduce the value of K to , we obtain significantly better computation times than all other algorithms. Nevertheless, the reduction should be critically evaluated and some guidelines need to be given, which we will further focus on in the future.