1. Introduction
Voting methods are ubiquitously used in human and artificial intelligence to improve the accuracy of automatic as well as hand-made annotations. The theoretical reason is simple: assuming that most individual annotators are relatively good (more often right than wrong) and independent to some extent, then for most situations, a majority of the annotators will be right while only a minority will be wrong, so by voting we are more likely to obtain a correct annotation in more situations than most individual annotators. Of course, when there are too many bad annotators who do not know what they are doing, then voting methods may be counter-productive: the results of good annotators will be killed by the results of bad annotators.
In this paper, we present our research on a novel voting method for the image segmentation problems in AI suggested to us by Professor Nguyen Tien Zung, Founder of Torus Actions SAS. This problem has been extensively explored in many papers [
1,
2,
3,
4,
5,
6,
7,
8] and the recently in the article [
9]. The most popular voting method for image segmentation is the (soft or hard) arithmetical voting [
4,
5,
6], where each pixel is voted by a majority rule for that pixel only and does not take into account the other pixels. Our starting idea is that the masks in each natural segmentation problem have a natural geometric-topological structure, the pixels are interrelated and not independent, so each pixel should be voted not individually but in connection with the other pixels. That is why we call our method the
topological voting method. The idea of considering the structure of the mask rather than pixel-wise using in ensemble method has been studied in [
1,
2,
3], where an image is divided into several regions (clusters) and the voting methods are applied for the regions. We shall discuss more about this idea in
Section 2 of this paper.
We will present the topological voting algorithm, together with its variations, in
Section 2. We will show not just one, but a whole family of topological voting methods, including
local topological voting, and
hybrid voting, which is a combination of arithmetical and topological voting methods, with a step to detect and exclude the outliers, those who are more likely to be wrong.
For simplicity, in this paper we will consider only the binary segmentation of 2D images, i.e., each image will be segmented into two parts: the region of interest, called the
mask, and the rest (the background). To illustrate our ideas, we will work out three concrete examples: (1) segmentation of salt in seismic images; (2) segmentation of human faces in photos; and (3) segmentation of blood vessels in retinal images. These three examples are pretty typical of segmentation problems, and in all these examples, the experimental results, presented in
Section 3,
Section 4 and
Section 5 of this paper, show that the topological voting method and its variations allow one to achieve better accuracy than the arithmetical voting method.
In
Section 6, we offer some arguments, theoretical and experimental results that show
why topological voting methods are efficient and give better results than arithmetical ones in many situations. These are the arguments and ideas that led to our experiments.
Section 7, the last section of this paper, is dedicated to conclusions and future work.
2. The Topological Voting Method
2.1. Image Segmentation and Jaccard Distance
Mathematically, one may represent a 2D-image binary segmentation tool or agent (a segmentor) as a discrete-valued map
or a continuous-valued map
from a space
of digital 2D images (of a fixed size, for simplicity), and
(height times width) is the number of pixels per image. For an image
, if
, where
is the position of a pixel (
), then it means that this pixel is in the mask of
x made by the segmentor
S; otherwise, the pixel is the background.
Convolutional neural networks (CNNs) that are used in image segmentation problems usually give us continuous-valued segmentors
, and we can obtain
S from
by fixing a threshold, for example
We will assume that each image has a true mask (the ground truth) denoted by . We want to measure how good our segmentor is, i.e., how far is the mask from the true mask .
A good and widely used measure of accuracy of binary segmentation is the so-called
intersection over union (IOU) score, also known as the
Jaccard score, introduced by Paul Jaccard in a paper in 1901 [
10].
where
denotes the intersection of the two masks
and
(i.e., the set of pixels where both segmentors have value equal to 1),
denotes their union (where at least one of them has value equal to 1), and the absolute value sign denotes the surface area (i.e., the number of pixels in the set).
Remark that an equivalent way to write the Jaccard score is MOM (min over max):
An advantage of Formula (
5) over Formula (
4) is that it also works for
soft masks, i.e., for continuous-valued segmentors: if
U and
V are two soft masks then we can define their relative
soft Jaccard score as
The
Jaccard distance of a mask
to a true mask
is defined by the formula
and it measures “how far”
is from
. The two masks coincide if, and only if, the Jaccard distance between them is equal to 0. In the case of soft masks, the same formula still works and gives what we call the
soft Jaccard distance.
In this paper, we will mainly use the Jaccard score (and the Jaccard distance) to measure the accuracy of our automatic segmentors.
Let us mention that there is a naive binary accuracy score, which is by definition the number of pixels at which coincides with divided by the total number of pixels N. When the true mask is small (the problem of segmentation of small objects), it may happen that the segmentor S gives a mask which is completely different from the true mask (no intersection) and the binary accuracy score is still near 1 (maximal possible), so this binary accuracy score is not a very good measure of accuracy, though we may still compute it sometimes.
2.2. Arithmetical Voting
In automatic as well as hand-made segmentation, one often creates not just one, but many segmentors
for the same problem, using different CNNs, or different training datasets, or different data augmentation methods, etc., and then ensembles them by a voting method to hopefully obtain a segmentor which is more accurate on the average than each one of them. The most obvious voting method is the majority vote: for each pixel, each segmentor has one vote, and the candidate value (0 or 1) that has the most votes wins. This majority voting method is also called the
hard arithmetical voting, and there is another variant of arithmetical voting called the
soft voting, see, e.g., [
4,
5]. In soft voting, one uses continuously-valued segmentors
instead of discrete-valued segmentors
, and put
when
n is large, then by the law of large numbers, soft voting and hard voting will give more or less the same results. However, when
n is small, soft voting may be finer and give slightly better results than hard voting. One may fine-tune the above arithmetical voting formula by giving different weights to different segmentors (weighted averaging formula).
Our
topological voting methods are very different from arithmetical voting. In the following
Section 2.3,
Section 2.4 and
Section 2.5 we present three versions of the method. One of them—the simplest version of the proposed method, turns out to be the same as the “Best of K” method in [
2], though we arrived at it independently. The other two versions are different.
2.3. Topological Voting: Simplest Version
The simplest forms (hard and soft topological voting) are presented in the following. The hard version is the one called “Best of K” in [
2] with only one cluster that is the whole image. More details, they both consist of the following steps:
- (i)
For an input image x, take n masks given by n different segmentors ;
- (ii)
For each index , measure the total distance from to the other masks, with respect to some natural distance function.
There are different natural distances in geometry that fit different problems. For example, the Hausdorff distance, also known as Hausdorff–Pompeiu distance, can be used effectively in many image processing problems [
11,
12]. For simplicity, here, we will only use the (soft or hard) Jaccard distance in our experiments and define the total distance
from
to the other masks by the following formula:
- (iii)
(Winner takes all) The mask with the smallest total Jaccard distance to the other masks wins, i.e.,
An equivalent way to formulate the “winner takes all” rule is: the mask with the highest total Jaccard score wins, i.e.,
and
In soft topological voting, one uses the same formulas as above but applied to the soft masks instead of the hard masks. A technical side notes: it may be useful to regularize the sigmoid values of the soft masks before computing soft scores, for example, by truncating them at 0.2 and 0.8: if , if , and if . This is actually what we do with soft masks in our experiments.
It turns out that, in many cases, the above “winner takes all” simple topological voting method already gives results that are superior to the arithmetical voting method. We will show it in the example of salt segmentation in seismic images (see
Section 3).
The following is a diagram for the system:
Topological Voting Schema (Simplest Version) |
Candidates (Topological objects) |
Topological distances |
Total distance from to the rest |
|
Selected candidate Y selected, where |
2.4. Local Topological Voting
Some objects (e.g., blood vessels) have a complicated global structure and so even a good segmentation may be imprecise in many places. To improve the accuracy of annotation, one should in theory reduce the complexity of the things to annotate by decomposing them into smaller, simpler things. For image segmentation, it means that we can often cut a big complicated image into smaller, simpler ones, easier to segment. This ‘localized segmentation’ strategy together with the topological voting idea leads to what we call local topological voting.
To be more concrete, the local topological voting algorithm that we will use for our experiments presented in this paper is the following:
- (i)
Fix a natural number s, which will be the radius of the local neighborhoods;
- (ii)
For each pixel, consider the neighborhood of radius s around that pixel: If the pixel is at position , then its s-neighborhood will be the square . (If the pixel is near the border, then its s-neighborhood will be the intersection of this square with the image);
- (iii)
Use the topological voting algorithm, as presented in the previous subsection, on the predicted masks restricted to the s-neighborhood of the pixel to obtain the result for this pixel. In other words, the voted value of the pixel is equal to the value at of the annotator which is considered to be locally topologically the best in the s-neighborhood of ;
- (iv)
Do the above step (iii) for every pixel to obtain the total mask.
The above local topological algorithm contains a parameter s, which is the size of the neighborhoods, i.e., the degree of locality. Notice that, when , then the s-neighborhood of a pixel is just itself, no topological structure is taken into account, and the 0-neighborhood topological voting is just the usual arithmetical voting. At the other extreme, when s is greater than the size of the image, then any s-neighborhood is the whole image, and we return back to the first version of our topological voting method. By varying s from 0 to infinity, we obtain a whole family of voting methods that goes from arithmetical (pixel-by-pixel) to topological (whole-picture) voting.
Intuitively, for each problem there is an optimal neighborhood size, and the two extreme sizes (0 and infinity) are not the best ones. In our experiments, we will vary the radius
s, and not surprisingly, we will see that some neighborhood sizes are really better than the others (see
Section 5 on blood vessel segmentation).
We illustrate an example of this voting version in
Figure 1. In that example, the radius is fixed
. At each pixel, a window of that radius is applied for all the segmentators
to obtain
n windows. We then use the method described in
Section 2.3 for the
n windows to obtain the final window. The value of the center pixel of the final window is chosen as the result for the global mask. We repeat these steps through all pixels of the image to obtain the final mask.
2.5. Hybrid Topological-Arithmetical Voting
The hybrid voting that we present in this paper is a 2-round voting. An example is illustrated in
Figure 2 and it goes as follows:
- (i)
In the first round, we use the (local or global) topological voting, not to choose the winner, but to choose the losers to exclude from the race.
More concretely, in this paper we will use one of the two following simple exclusion rules for each hybrid voting:
Method 1: Choose a threshold . For each input x keep the masks such that , where and n is the total number of our individual segmentors. If then considered to be an “outlier” and is excluded from the second round;
Method 2: Choose a number . For each input x keep the masks with the highest Jaccard scores , and exclude the other masks.
In method 1, the number of masks admitted to the second round is not fixed and it is different for each input, while in method 2, this number is always equal to .
- (ii)
The candidates that remain in the second round will be voted arithmetically.
Another variation of hybrid voting is voting with weights determined by the distance function, or equivalently, by the Jaccard scores : the higher the Jaccard score is (relatively to the Jaccard scores of the other annotators), the higher the weight of will be in the weighted arithmetical voting formula. Topological voting is when all the weights are equal to 0 except one weight which is equal to 1.
3. Salt Segmentation in Seismic Images
The automatic segmentation of salt deposits in seismic images is an important problem for geology companies in search of hydrocarbons. In 2018, Kaggle organized the “TGS Salt Identification Challenge” on this problem, and provided a training dataset of 4000 annotated grayscale images of size
[
13].
Our aim here is not to build a “state-of-the-art” automatic segmentor for the Kaggle challenge (in real life, one will not cut the seismic images into very small pieces and compute the accuracy scores the way Kaggle did anyway), but just to do experiments on the topological voting method. For that purpose, we will build our AI models based on a popular light-weight convolutional neural network architecture called MobileNet [
14], which is very handy in the sense that one can train it very fast, and it can run on small devices, such as mobile phones. We train our models using Tensorflow and Keras [
15]. The loss function that we use in training our model is the sum of the binary cross entropy and the Dice loss function [
16]. We use random padding, translation, rotating, flipping, and cutting, to create square images of size
from the original images of size
, since our MobileNet models use inputs of size
.
We divide 4000 annotated images into 2 sets: the training set (3000 images) and the test set (1000 images). The results that we show in the tables below are for the test set (not used in the training process of course). The training set is then divided into 5 folds, each fold containing 600 images. For each fold, we train a corresponding AI model, which uses that fold for validation and the other four folds for training. We train each model 500 epochs, each epoch has 3000 inputs, so in total each model is trained on 1,500,000 inputs. Each input is an image taken randomly from the training set and then undergoing random transformations (augmentations in data pre-processing). We do not take the AI model after exact 500 epochs of training, but rather the AI model after the epoch that offers the highest Jaccard scores on validation among all the 500 epochs.
After the above training process, we obtain 5 AI segmentors, corresponding to our 5 folds. Then, for each image in the test set, we vote on the 5 masks given by these 5 segmentors, using arithmetical voting and the (soft and hard) topological voting methods. The results are shown in
Table 1. Here, the Jaccard score is the means Jaccard score over the test set.
Some concrete examples of the masks given by our five individual segmentors, and the results of two different voting methods (arithmetical and topological) and shown in
Figure 3,
Figure 4,
Figure 5 and
Figure 6, together with the original images and the true masks (the ground truth given in the dataset). These four figures illustrate how the topological voting method works differently from the arithmetical one. We can see in these figures that the arithmetical chooses the one that equals to average of all mask pixel-wise regardless the structure of each individual mask, consequently, it causes a final voting result which is unstructured regarding that the region of salt is often smooth and continuous. Meanwhile, the topological chooses the mask which is the most common to the others in the meaning that it has the smallest “distance” to the other masks and it is thus reasonable close to the true mask.
In order to improve the accuracy, one can increase the number of individual segmentors. So we created 5 additional AI models using 5 additional folds, in the same way as before. The ensemble results using 10 models, shown in
Table 2, are indeed better than the ensemble results using just 5 models.
Table 2 also shows that, for the salt segmentation problem, the topological voting method clearly beats the arithmetical voting method (by more than a full percentage point). This table also shows the results of the hybrid topological-arithmetical voting method (at different thresholds), which are slightly better than the simple topological voting method.
As a side remark, we note that, if we measure the performance by using the binary accuracy metric instead of the Jaccard score, then the scores will be very high even for completely wrong segmentations, and the topological voting method will give worse results than the arithmetical voting method if we use this binary accuracy metric, see
Table 3.
4. Human Face Segmentation in Photos
The Face and Skin Detection Database which contains 4000 images is created by S. L. Phung, T. Y. Ke, and F. H. C. Tivive and used in [
17] to support research on skin segmentation and face recognition. The dataset provides many types of ground-truth, such as human face segmentation and skin segmentation. Here, we will use the human face segmentation of this dataset.
Actually, the masks that we will use are not exactly the human faces, but the smallest rectangles containing them. To make the problem more interesting, we will partially hide the human faces on the photos by random rectangles, and let the machine learn to segment the full human faces despite those hidden parts.
Figure 7e shows the original image in which the faces are covered by a random rectangle and
Figure 7f (the rightmost image) shows the mask of that image.
The dataset of 4000 photos is divided into 2 subsets: 1000 images for testing, and 3000 images for training. The 3000 training images are divided into 10 folds, each fold contains 600 images, so the folds overlap: the first 5 folds are a partition of the training data, and the second 5 folds are another partition as well.
The original photos in the dataset are of different sizes, but we will augment and resize them into images of size before feeding them into our CNN models. The transformations that we use to augment the images are the standard ones: random rotation, flipping, brightness modification, cropping, resizing, padding, and noise adding. As mentioned above, we also add a random back rectangle to each image to partially cover the faces (without changing the masks).
We use two different CNN architectures for our experiment: the light-weight MobileNet [
14], and the more heavy-weight EfficientNet B4 [
18], so in total, we obtain 20 different AI segmentors for our human face segmentation problem.
It is interesting to look at various inputs and outputs to see how the individual segmentors perform, and what are their main structural mistakes. For example:
In
Figure 7, one can see that some individual segmentors mistake body skin for facial skin, while the other segmentors do not make this mistake. Topological and hybrid voting allows us to exclude those segmentors who make this mistake, and so the voted result does not contain body skin in the mask, while the arithmetical voting does not have this semantic advantage, and the result of the arithmetical voting still contains body skin.
In
Figure 8, one can see that some individual segmentors barely recognize any facial skin, while there is one segmentor that mistakes a shirt for facial skin.
Figure 9 is another very interesting example, where some segmentors mistake a dog face for a human face.
The accuracy scores of different voting methods on our 20 models are given in
Table 4. It shows that hybrid voting gives the best results, by excluding those segmentors that make very gross mistakes.
One may wonder why the plain-vanilla topological voting method gives lower scores than the arithmetical voting method for the above experiment of facial segmentation? We think that the reason lies in the fact that the masks themselves are very simple (just rectangles), and the topological voting method excels at more complicated masks.
5. Blood Vessel Segmentation
The problem of segmentation of tree-like structures, such as microglia extensions, neurovascular structures, blood vessels, and pulmonary trees, are of great interest in medical AI, see, e.g., [
19] and references therein. Rouchdy and Cohen [
19] studied the problem of segmentation of blood vessels in retinal images using a method called
geodesic voting with radius (no deep learning), and showed the superiority of their method to older approaches such as the edge-based level set method [
20], the Chan and Vese method [
21], and the fuzzy connectedness method [
22]. The database that they used is the digital retinal images for vessel extraction (DRIVE) data [
23].
In this section, we propose to use deep learning to solve the retinal blood vessel segmentation problem using this same DRIVE dataset. Again, we will use the light-weight MobileNet [
14]. Not surprisingly, the deep neural networks can give better segmentation results than the previous image processing methods, including the geodesic voting method [
19].
The DRIVE dataset contains 20 images of size . We first divide it into two sets: 15 images for training and 5 images for testing. We create 15 folds for the training and the cross-validation of our AI using the 15 training images. Each fold leads to one segmentor, so in total we have 15 individual segmentors to vote on. Each fold uses 13 images for training and 2 images for validation. Each image is then transformed (augmented) using a combination of operations, including random rotating, flipping, padding, brightness modification, noise adding, cropping, and padding, into square images of size , before being fed into our CNN based on MobileNet for training and validation. The five original images used for testing are cropped into 2000 images of size where x and y are random whole numbers in the interval and then padded to square images of size , so our test set consists of 2000 images of size .
Figure 10 shows an example of our 15 individual segmentations of an image from the test set, and the results of three different voting methods on these 15 segmentations compared to the true mask. One can see visually on these pictures that both the local topological method and the hybrid voting methods give better results than the arithmetical voting methods: fewer missing pixels compared to the true mask (the vessels are less broken).
Table 5 shows a comparison of voting methods on the DRIVE dataset.
One may notice that the Jaccard scores obtained in blood vessels segmentation are much lower than in the previous two problems. The main reason is simple: the vessels are very thin, and so the Jaccard scores are very sensitive to small variations in the segmentation. Another reason is that MobileNet is a light-weight CNN aimed at quick processing and not highest accuracy, and we did not do any special optimization here either. What is important for us here is the fact that the gain obtained by the hybrid topological-arithmetical voting method is significant compared to the arithmetical voting method.
6. Why Does Topological Voting Work?
Let us recall that the philosophical reason behind our topological and hybrid voting methods is the following: for meaningful objects, their masks must have certain geometrical or topological properties or shapes. Assuming that we have good (but not yet excellent) segmentors , most masks given by them (for a typical input) will have reasonably good shapes reflecting topological-geometrical properties of the segmented object and are sufficiently close to the true mask. Consequently, in general, a mask which is far from the other masks will also be far from the true mask, while a mask which is closer to most other masks will also be closer to the true mask. So, even though we do not know what the true mask is, we can use the total distance function as a proxy to estimate how far is from the true mask.
As a side remark, one may see a parallel between the different voting methods used in image segmentation and the voting methods used in politics. For example, elections of village mayors are similar to pixel-by-pixel voting (each village is a ‘pixel’), while many presidential elections are similar to whole-picture voting.
Considering the simplest form of the proposed method presented in
Section 2.3, in small dimension settings, we show below theoretical and experimental results to explain why our voting method outperforms the classical arithmetical voting method. More detail, when
is a scalar in
for all
i, the two voting methods behave similarly, we shall prove that they both converge to
with the same the rate of convergence
. The difference can be seen when
is a vector and its components are dependent to describe the ’structure’ inside it. We illustrate below the simulations for the cases when
is vector in
and its second component totally depends on its first component to show how our voting method behaves better than the arithmetical one.
6.1. One Dimension Case
In this case,
are i.i.d. variables value in
with mean
and variance
. Recall that the classical soft arithmetical voting method (average voting) chooses
as its final result, while our voting policy chooses
where
d is some distance function. Here, we choose
to obtain a smooth function for analysis. The classical results show that
converges almost surely to
(law of large number) and it converges in distribution to
with the rate of
(central limit theorem). With our voting method, we obtain the same guarantees for the convergence, as well as the rate of convergence, which are illustrated in
Figure 11 and showed in Proposition 1. Remark that in Proposition 1, we restrict ourself by assuming that
have positive, continuous, and bounded probability distribution in a neighborhood of the mean
. However, this does “not restrict too much since most of the distributions (such as uniform distribution, normal distribution, exponential distribution, etc.) satisfy this restriction.
Proposition 1. If have positive, continuous and bounded probability distribution in a neighborhood of θ, we have the following statements:
- 1.
.
- 2.
converges almost surely to θ.
- 3.
converges in distribution to with the rate of .
Proof. We will prove each statement in turn:
- 1.
Define
. We have:
From variations of function
h we imply that the one in
closest to
either from the left or the right is the arg min of
h. On other hand,
h is symmetric at
, i.e., for every
,
Indeed, define
then it is obvious that
. We have
since
.
Similarly, we have
So we obtain the first statement.
- 2.
Denote
, we have
converges almost surely to 0 and that thus implies
converges almost surely to
. Indeed,
where
for all
since by the assumption
has positive continuous distribution at
, and
converges in distribution to
based on central limit theorem and from upper-tail inequality of normal distribution [
24] stating that if
then for every
we have
From theorem 7.5 in [
25] we imply
converges almost surely to 0 and therefore
converges almost surely to
.
- 3.
The third statement is implied from the fact that
converges in distribution to
with the rate of
(central limit theorem) and
(the element of
closest to
) converges in probability to
with the rate of
n. Indeed, to simplify the proof, we assume
, then for any
we have:
where
f is density of
and
is density of
.
With the assumption, is positive, continuous and bounded in a neighborhood of 0 as well because it is the density function of which is the average sum of . Let is the open neighborhood of 0 in which and f are positive, continuous and bounded. We first fix , such that is in then if is small enough we have is in as well. Therefore we have:
;
.
Thus the first component of (
13) converges to 0 since
converges to 0 a.s; and the second component converges to 0 when
goes to 0 since:
which ends the proof. □
6.2. Two Dimension Case
Now, are vectors in . The masks in each nature segmentation problem have a natural “structure” in the meaning that its coordinates are not independent. For the illustrations below, we consider the simplest case where the second component of totally depends on its first component, i.e., for all i where are i.i.d. mean and variance and f are some functions.
In
Figure 12, we vary function
f to illustrate the different policies of the two voting methods. It can be seen that in the average voting, the first component converges perfectly to the true one but the second component seems to be very far away from the true one, the reason is simple, for most of function
f,
f of average is different from average of
f. Meanwhile, the topological voting takes into account the trade-off between the two components.
In
Figure 13, we take a symmetric function
f. In this case, both of the two components of the result made by topological voting converge very well to the true one because of symmetry property of
f.
In
Figure 14, we keep same
f but vary distribution of
to see when the annotations are more or less variance how much different performance are between the two methods. One can see that when we decrease the uncertainty of the annotations, both of the two voting methods behave better especially for the topological voting. The improvement of topological voting to the arithmetical one is significant high when the uncertainty of the annotations is high.
7. Conclusions and Future Work
The paper considers various ensemble methods for image segmentation problems. We have presented the three proposed voting methods that take into account the structure of the whole mask. In three concrete examples, many experimental results are shown to prove that our voting methods outperform the classical arithmetical voting methods. We also provided some arguments about the philosophical reason, the mathematical guarantee and the experimental results (for the simplest version) to explain why the proposed topological voting methods behave better than the arithmetical one.
There are several directions in which this work can be taken. One avenue is to extend Proposition 1 for multiple dimension cases and to investigate the theoretical guarantee for the two other versions (the local and hybrid ones). Another avenue is to do more experimental comparisons with more other voting methods, for examples, the nine methods presented in [
2].