1. Introduction
Being able to recognize objects in images has been, for a long time, a prerogative of human beings. It has taken over 14 years to reach the level of an untrained human in the challenge of Imagenet. Things become more complex when the task requires not only to recognize the object in an image but also to identify its boundaries. This task is called semantic segmentation, and in machine learning this entails the classification of each pixel in an image. Due to the improvements of performance related to the adoption of machine learning models, this task is applied to many real-life scenarios [
1,
2]: in clinical practice, it can be used to identify polyps; similarly, in skin and blood analysis the identification of objects may help to visually bind the presence of different types of diseases. In addition, it can be used in autonomous vehicles, to identify objects surrounding the vehicle, in environmental microorganisms’ classification, and many others.
The standard approach is to train a system composed of two modules: an encoder, and a decoder. The first module learns a low-dimensional representation of the input that describes semantics in the image. The second module learns to build the original input based on this low-dimensional feature vector. This has been the approach adopted by U-Net [
3], one of the first systems developed for semantic segmentation. Autoencoders [
4] were also employed to resolve the task due to their ability to learn semantics’ low-level representations of an image through the encoder module and the ability of reconstructing the original input from this reduced representation. Autoencoders’ performance and results are the reasons why many researchers and practitioners from the computer vision area have adopted them.
The performance of autoencoders, as well as the ones of other classifier technologies, are strongly affected by architecture configuration, and other configurations often referred to as hyper-parameters tuning. That consists in finding the best values of some attributes of the model. This is a context-specific task that requires domain knowledge as well as expertise with the adopted machine learning techniques, resulting in big efforts and time consumption. The well-known no-free lunch theorem for machine learning highlights that a single model that works well on all the datasets cannot exist. Based on this evidence, another approach consists of adopting sets of classifiers, often shallow or weak, whose predictions are aggregated to form the output of the system. These frameworks are called ensemble methods. In an ensemble, individual classifiers are trained on the same dataset, in such a way, each model should generalize differently in the training space. Ensembles provide state-of-the-art results in many domains, but it is important to ensure some properties. One of them is to enforce some kind of diversity in the set of classifiers.
In this work, we propose a novel ensemble method for semantic segmentation. Our model is based on convolutional neural networks (CNNs) and transformers. Diversity among individual classifiers is enforced by adopting different loss functions and testing different data augmentations.
The model has been developed by combining DeepLabV3+ [
5], HarDNet-MSEG [
6], and Pyramid Vision Transformers [
7]. The developed solution is then assessed through an extensive empirical evaluation that compares our proposal with state-of-the-art solutions highlighting promising results often better than best approaches.
The evaluation has been carried out on five different real-world scenarios, namely polyp segmentation, skin segmentation, butterfly recognition, environmental microorganisms’ classification, and leukocyte detection.
Due to improvements of the discipline, machine learning techniques are used and applied in many different areas, for instance in medical diagnosis or in biology. Convolutional neural networks (CNNs) and other classic predictors are adopted for assisting researchers and practitioners in better identifying objects in images. This is the case for an instance of skin segmentation or butterfly identification. A drawback of this technology is based on the fact that a huge amount of data is needed to train these systems, but labeled data are a scarce resource in many domains. This is one of the reasons why big efforts are spent building and publishing datasets in specific areas; an example is Kvasir-SEG [
8], a recent dataset that contains polyp images annotated at the pixel level by a group of experts.
A novel architecture came from the scope of natural language processing (NLP), where researchers study how to comprehend the semantic of texts with the purpose of automating tasks such as summarization or translation. This new model called the transformer is designed with a self-attention mechanism that enables the system to focus on a specific part of the input. Transformers have also been applied to computer vision tasks, gaining performances comparable or even better than CNNs. Once again, the main drawback of these models resides in the high demand of data useful to train a stable and performing system. TransFuse [
9] and UACANet [
10] are two recent approaches in the medical domain that combine different techniques: the first is a combination of CNN kernels and transformers; while the second blends U-Net and a parallel axial attention autoencoder. No matter the architecture, the aim is to capture information at both local and global levels.
As previously noticed, semantic segmentation becomes of main importance in many contexts. Autonomous vehicles, for instance, use semantic segmentation to identify objects in the environment surrounding the vehicle in order to make safe decisions [
11]. In clinical practice, it helps in reducing the exposure to serious risks by detecting pathologies in their early stages, such as polyps detection that may prevent the evolution of colorectal cancers [
2]. Similarly, in skin detection, deep learning methods are employed in various areas, spanning from face detection to hand gesture recognition [
12]. In this context, deep approaches have faced some difficulties, such as the clutter in the background that hinders the reliable detection of hand gestures in real-world environments.
CNNs have shown their appeal also to this context; two examples are the works from Roy et al. [
13] and Arsalan et al. [
14]. In the former work, the authors suggest using a CNN based on skin detection techniques to enhance the hand detector output. The latter instead introduces a residual skip connections (OR-Skip-Net) CNN that decreases the computational effort of the network and at the same time tackles demanding skin segmentation tasks; the goal is achieved by moving data to the last layer of the network directly from the initial layer. CNNs are also employed for automatically translating sign language [
15].
A comparative analysis is reported in [
12] through an extensive empirical evaluation of several leading technologies on a set of skin detection benchmarks.
Recently, deep learning was also used for the automatic recognition and classification of leukocytes [
16]. This practice helps medical practitioners to diagnose various blood-related diseases. This can be done in many different ways: practitioners can analyze the percentages of leukocytes with a histogram-based technique or with iterative algorithms (such as GrabCut [
17]) that segment white blood cells.
Contribution: this paper proposes a new ensemble method based on DeepLabV3+, HarDNet-MSEG, and Pyramid Vision Transformers’ backbones. The proposal is intended to deal with semantic segmentation. In this work, diversity among individual classifiers in the ensemble is enforced by adopting different loss functions and testing different data augmentation approaches. We tested the proposed method on five different scenarios and compared the results with existing frameworks. The empirical evaluation highlights our results that are close to or even better than the state-of-the-art level.
2. Materials and Methods
In this section, we will provide all the techniques and approaches used to generate our ensemble. In particular, we will report the mathematical formalization of the loss functions adopted to design the networks.
2.1. Deep Learning for Semantic Image Segmentation
In the literature, different deep learning models are proposed to address semantic segmentation problems.
Semantic segmentation intends to identify objects in an image, with relative boundaries. The main purpose is therefore to assign a class at the pixel level; a task achieved thanks to FCNs (fully convolutional networks). FCNs have very high performance, and unlike CNNs they use a fully convolutional last layer instead of a fully connected one [
18].
In order to obtain deconvolutional networks, such as U-Net, FCNs and autoencoders are combined together.
U-Net represents the first attempt to use autoencoders in an image segmentation task. Through the autoencoder, it is possible to downsample the input and simultaneously increment the number of features used to describe the input space. We can find another symbolic example in SegNet [
19]: here, the max pool indices of the relative encoder level feed the decoders, while VGG is used for encoding. This allows it to reduce memory usage by having also a more promising segmentation.
DeepLab [
20] consists of a series of autoencoder models introduced by Google, and has shown excellent results also in semantic segmentation applications [
21]. These are some of the main features introduced to guarantee good performance:
A dilated convolution reduces the effects of pooling and stride, thereby greatly increasing the resolution.
Through an Atrous Spatial Pyramid Pooling, information is obtained at various scales.
A union of CNNs and probabilistic graphic models make it possible to detect the boundaries of objects.
We find in DeelLabV3, two most important innovations: a 1 × 1 convolution in Atrous Spatial Pyramid Pooling and a batch normalization; a set of modules placed in parallel and in cascade for convolutional dilation. DeepLabV3+ [
5], an expansion of the family developed by Google, is adopted in this work. This expansion includes, among the most important features, a decoder with depth-wise convolutions and point-wise convolutions. The depth-wise works in the same location but with various channels, while the point-wise uses the same channel in various locations. In order to obtain different designs for a framework, we can consider other characteristics of the structure of a model, in fact the architecture model itself that is used is only one choice.
In this paper, we will investigate ResNet101 [
22], a very famous CNN that acquires a residual function by referring to the block input (we recommend [
23] for an exhaustive list of CNN structures). We adopt the ResNet101 network, pre-trained on the VOC segmentation dataset and then modulated through the parameters suggested, (
https://github.com/matlab-deep-learning/pretrained-deeplabv3plus, (accessed on 20 April 2022)). These parameters have not been modified in order to prevent overfitting; thus, they are the same in all the tested datasets:
initial learning rate = 0.01;
number of epoch = 10 (using the simple data augmentation approach) or 15 (the latter more complex data augmentation approach due to the slower convergence using this larger augmented training set);
momentum = 0.9;
L2Regularization = 0.005;
Learning Rate Drop Period = 5;
Learning Rate Drop Factor = 0.2;
Shuffle training images every-epoch;
Optimizer = SGD (stochastic gradient descent).
Firstly, we propose an ensemble of the DeepLabV3+ models obtained by applying various loss and data augmentation methods, and then we combine the ensemble with HarDNet-MSEG [
6], and Pyramid Vision Transformers (PVT) [
7]. The HarD-Net-MSEG (harmonic densely connected network), a model influenced by densely connected networks, allows the reduction of memory consumption in this way: it decreases most of the connection layers at the DenseNet level, in order to reduce the costs of concatenation. In addition, the input/output channel ratio is equalized thanks to the increase in the channel width of the layers (consequently to the increase in its connections).
The PVT is a pure convolution-free transformer network which aims to acquire a high-resolution representation starting from a fine-grained input. The computational cost of the model is decreased by a progressive pyramidal shrinkage, accompanied by the depth of the model. A spatial-reduction attention (SRA) layer is introduced to an additional reduction of the computational complexity of the system.
In this work, both HarD-Net-MSEG and PVT have been trained using the same options in all the problems: batch size 15; number of epochs 100; initial learning rate 0.0001; decay rate = 0.1; decay epoch = 50.
2.2. Loss Functions
Different loss functions influence the training phase and the performance of the model. In semantic segmentation tasks, pixel-wise-cross-entropy is one of the most widespread and adopted loss functions; this operates at the pixel level, verifying whether the predicted label of a given pixel coincides with the ground-truth. One of the main problems with this approach is the critical situation in which the dataset is unbalanced with respect to the labels, but it can be solved through the use of counterweights. The dice loss function [
25] aims to verify the overlap between the predicted segmented images and the ground-truth. This approach, that we have also used in this work, is widespread in semantic segmentation.
An exhaustive overview of image segmentation and loss functions is available in a recent survey [
25].
2.2.1. Previously Tested Loss Functions
Our set of loss function includes the following widely adopted metric functions (the interested reader can refer to [
24] for a more detailed description):
Generalized Dice Loss [
26],
is a multiclass variant of the dice loss.
Tversky Loss [
27],
is a weighed version of the Twersky index designed to deal with unbalanced classes, i.e., the phenomenon whereby one class prevails over another.
Focal Tversky Loss [
28],
it is a variant of the Twersky loss, where a modulating factor (γ = 4/3) is used to ensure that the model focuses on hard samples instead of properly classified examples.
Focal Generalized Dice Loss [
29],
is the focal version of the generalized dice loss.
Log-Cosh Dice Loss is a combination of the dice loss and the log-cosh function, applied with the purpose of smoothing the curve of loss.
Log-Cosh Focal Tversky Loss is based on the same idea of smoothing, here applied to the focal Tversky loss.
SSIM Loss [
30],
, is obtained from the structural similarity (SSIM) index [
31], usually adopted to evaluate the quality of an image.
MS-SIM Loss is a variant of defined using the multiscale structural similarity (MS-SSIM) index.
Some combined losses:
- ○
- ○
- ○
- ○
2.2.2. Cross-Entropy
The cross-entropy (CE) loss function provides us a measure of the difference between two probability distributions. The goal is to minimize this difference and, in doing so, it has no bias between small or large regions.
This could be an issue when dealing with imbalanced datasets. Hence, the weighted cross-entropy loss was introduced and it resulted in well-balanced classifiers for imbalanced scenarios [
32].
The formula for weighted binary cross-entropy is presented in (1). In this equation, refers to the ground truth label image, while is the true value for the pixel and it can be equal to either 0 or 1. It is equal to 0 if the pixel belongs to the class , 0 otherwise.
is the prediction for the output image and is the probability of the -th pixel to belong to the -th class obtained by using the sigmoid activation function. For P, we used the softmax activation function which returns probabilities.
is the weight given to the
-th pixel of the image for the class
. These weights were calculated by using an average pooling over the mask with a kernel 31 × 31 and a stride of 1 in order to consider also nonmaximal activations.
where
is the number of classes and
the number of pixels.
2.2.3. Weighted Intersection over Union
Another well-known loss function is intersection over union (IoU) loss, which was introduced for the first time in [
33]. The original formula was:
As mentioned earlier, is the truth label image and is the prediction for the output image.
Unfortunately, the set symbols for Intersection and Union are not differentiable because
and
have to be either 0 s or 1 s. This is not true for
, so the formula was then approximated with the following:
where
is the element-wise multiplication of
and
. For what concerns the denominator, we subtract the “intersection” between
and
because we do not want to consider the intersection twice.
Once the set operators have been converted to arithmetic ones, the formula is differentiable, and it is possible to evaluate the gradient.
However, IoUis an evaluation metric used for evaluating the goodness of the prediction. Hence, a value of 1 is equivalent to a perfect prediction. For this reason, the loss function will be:
Unfortunately, this function has to face the same problem of CE in inferring the label of the boundary of each object; therefore, as suggested in [
34], we use the weighted intersection over union () wIoU, instead of the standard IoU.
The formula of the weighted intersection over union loss is:
where
is the number of pixels and
is the number of classes. The weights
are calculated as aforementioned.
and
are, respectively, the ground truth value and the prediction value for the pixel
belonging to the class
. We added 1 to both the numerator and the denominator in order to prevent the undefined division
.
2.2.4. Structure Loss
Based on the intuition in [
6], weighted intersection over union and weighted binary crossed-entropy are considered together.
Instead of applying an avgpool over the mask, we do so over the predictions. We have done this to improve the diversity in the deep learning network.
Then, we want to give more importance to the binary crossed-entropy loss, so we use a weight of 2 for that one, and this leads to the following formulation:
2.2.5. BoundExpStructure
We combine structure loss, boundary loss and exponential logarithmic loss. This is done to have better performances on the small structures of a highly imbalanced dataset and, at the same time, have better identification of the boundaries of the image.
2.2.6. Boundary Enhancement Loss
The boundary enhancement loss is a loss proposed in [
35] which explicitly focuses on the boundary areas during training.
This loss has very good performances as it does not require neither any pre- or post-processing of the image nor a particular net in order to work.
The Laplacian filter
is the milestone of this loss as it generates strong responses around the boundaries and zero everywhere else. When applying the Laplacian filter to a mask
, we obtain:
The positive aspect about using the Laplacian filter, is that it can be achieved quite easily by a series of convolution operations. So, the idea is to evaluate the difference between the filtered output of ground-truth labels and the filtered output of the predictions.
The boundary enhancement loss, as proposed in [
35], is:
where
is the
norm. This can be easily achieved as already described in the original paper [
35].
Based on the idea of the same paper, we will be using dice loss and boundary enhancement loss together, weighted appropriately, and the structure loss:
The best results are achieved by using and .
2.2.7. Contour-Aware Loss
Contour-aware loss was proposed for the first time in [
36]. It consists in a weighted binary cross-entropy loss where the weights are obtained with the aim of giving more importance to the borders of the image.
In the loss is employed a morphological gradient edge detector. Basically, the difference between the dilated and the eroded label map is evaluated. Then, for smoothing purposes, the Gaussian blur is applied. This spatial weight map can be formulated as:
here
and
are dilation and erosion operations with a 5 × 5 kernel.
K is a hyperparameter for assigning the high value to contour pixels, which is set to 5 empirically.
is the matrix with 1 in every position.
We can compute the contour-aware loss as:
Finally, we compute the new loss we are going to use:
2.3. Data Augmentation
The training phase of a classifier and the resulting performance of the system are strongly influenced by the size of the dataset. This is true also for an ensemble method. Thus, to increase the amount of data that can be used to train the system, several techniques may be applied to the original dataset. In the next paragraphs, we shall describe the different techniques adopted with the purpose of data augmentation. We employ these techniques on the training set, both on the input samples and their mask. We leave the test set unchanged.
Two different data augmentation approaches are tested:
DA1, base data augmentation consisting in horizontal and vertical flip, 90° rotation.
DA2, the following operations are performed:
The image is displaced to the right or the left.
The image is displaced up or down.
The image is rotated by an angle randomly selected from the range [0° 180°].
Horizontal or vertical shear by using the Matlab function randomAffine2d().
Horizontal or vertical flip.
Change in the brightness levels by adding the same value (random value between 25 and 50) to each RGB channel.
Change in the brightness levels by adding different values (random value between 25 and 50) to each RGB channel.
Add speckle noise, it adds multiplicative noise to the image I, adding a value n × I, where n is uniformly distributed random noise with mean 0 and variance 0.05.
Application of the technique “Contrast and Motion Blur”, described below.
Application of the technique “Shadows”, described below.
Application of the technique “Color Mapping”, described below.
Some artificial images (DA2 approach) contain only background pixels; therefore, to discard them we simply delete all the images where there are less than 10 pixels that belong to the foreground class.
2.3.1. Shadows
New image samples can be created by creating shadows in the original set of images. Shadows may be created randomly to the left or to the right of the original image. We use the following criteria to decide the intensity of the shadow (
direction = 1: right;
direction = 0: left):
2.3.2. Contrast and Motion Blur
Another technique for data augmentation that allows one to derive new samples from an original dataset is based on the combination of contrast and motion blur. The first one increases or decreases the original contrast, and the second one simulates the movement of the camera taking the image. We developed two different contrast functions, and each time we choose one of them at random.
The first function is defined as follows:
The parameter controls the contrast. Specifically: T\hte contrast increases when ; it is decreased when ; the image is unchanged when .
The value of the parameter is drawn at random in the following range:
→ Hard decrease in contrast;
→ Soft decrease in contrast;
→ Soft increase in contrast;
→ Hard increase in contrast.
The second function is defined as follows:
The parameter controls the contrast. In particular, the contrast is increased when ; it is decreased when ; if , then the image is left unchanged.
The parameter is chosen randomly from four possible ranges:
→ Hard decrease in contrast;
→ Soft decrease in contrast;
→ Soft increase in contrast;
→ Hard increase in contrast;
The blurring effect that mimics the movement of the camera is applied right after the contrast adjustment. We use the MATLAB function fspecial(’motion’, len, theta).
2.3.3. Color Mapping
Changing the color map of the image produces a new image. In particular, it is possible to map the color of an image to that of another image. We generated a pair of images by coupling any image in the original training set with another randomly selected image in the same set. We adopt the stain normalization toolbox (the toolbox is authored by Nicholas Trahearn and Adnan Khan and available online at
https://warwick.ac.uk/fac/cross_fac/tia/software/sntoolbox/, (accessed on 20 April 2022)), which provides this functionality in three different versions:
4. Conclusions
In computer vision, we call semantic segmentation the task that involves the classification of each pixel in an image.
This is a very important task in several fields, e.g., in autonomous vehicles, it allows the identification of objects surrounding the vehicle; in medical diagnosis, it improves the ability of early detecting dangerous pathologies and thus mitigates the risk of serious consequences.
Here, we obtain a state-of-the-art performance proposing different ensembles of segmentation approaches. We have tested:
Different loss functions;
Different data augmentation approaches;
Different network topologies, i.e., convolutional neural networks and transformer (namely DeepLabV3+, HarDNet-MSEG, and Pyramid Vision Transformers).
Finally, the ensemble is combined by the sum rule.
Our proposed ensemble has been tested, providing state-of-the-art results, in five benchmark datasets: polyp detection, skin detection, leukocytes recognition, environmental microorganism detection, and butterfly recognition.
As future work, we aim, through techniques such as pruning, quantization, low-ranking factorization, and distillation, to decrease the complexity of ensembles.