1. Introduction
The Amazon rainforest covers an area of approximately 5.5 million km
2, which amounts to approximately one-third the size of the South American continent. It spreads across nine countries, establishing an arc that begins in Bolivia and extends to French Guiana, being approximately 69% inside Brazilian borders. Altogether, the Amazon represents more than half of the remaining tropical forest area on Earth [
1] and contains more than 10% of all the above-ground biomass on the planet [
2]. The forest also holds about 20% of the total carbon sequestered by the world’s terrestrial forests [
3].
The forest covers the Amazon river catchment basin on which 20% of all free-flowing fresh water on Earth [
4] flows. The dominant conjunction of the Amazon rainforest and river basin produces large quantities of water vapor reaching out vast extents in South America. The so-called “flying rivers”, generated through evapotranspiration by the Amazon ecosystem, carry out moisture to the north, west, and south. Thus, it helps to regulate rainfall regimes in most of South America [
5]. The portion of moisture that heads south is also of enormous economic importance for irrigating crops and pastures, and for supplying water for the main hydroelectric plants on the continent.
Additionally, the biome contains unparalleled biodiversity, being the natural habitat of one-tenth of the currently known species [
6], the largest concentration of plants and animal species in the world. Such biodiversity forms an extremely complex and large-scale ecosystem whose evolution naturally produced countless biological solutions. Currently, the vast majority of those species are still far from being sufficiently studied to ensure the exploitation of such enormous biotechnological potential.
Unfortunately, for many decades, the Amazon has faced several threats as a result of unsustainable economic development. Among the main menaces to the biome are the expansion of agricultural activities at an industrial scale, especially soybean cultivation and cattle raising, forest fires, and illegal mining and logging [
7,
8,
9]. All those factors are directly associated with deforestation. Deforestation in the Brazilian Amazon also has important social and humanitarian consequences. While trying to fight for the protection of the forest for future generations, environmentalists [
10,
11] and members of traditional indigenous communities [
12] are often targets of attacks by loggers [
13], land grabbers or illegal fisherman, hunters, and miners.
The above-mentioned facts indicate the importance of the preservation of the Amazon forest, and it is well known that Remote Sensing (RS) data provide key capability for large-scale, low-cost, and risk-free environmental monitoring [
14]. According to the Brazilian National Institute for Space Research (INPE) [
15], deforestation accelerated significantly in the Brazilian Legal Amazon (BLA) area during the 1990s and early 2000s. Unfortunately, since 2019, it has reemerged with substantial intensity [
16,
17]. It is an urgent situation; the World Wildlife Fund [
18] estimates that more than a quarter of the rainforest may vanish by 2030 if the current rate of deforestation continues. Important consequences of deforestation can already be observed; a recent report paper [
19] asserts that between 2010 and 2019 Brazil’s Amazon carbon stock had been in deficit, since it gave off 16.6bn tonnes of carbon dioxide, while was able to draw down only 13.9bn tonnes.
Since the late 1980s, INPE is in charge of monitoring deforestation in the Brazilian Legal Amazon. Among its main monitoring projects are PRODES [
20], DETER-A [
21] and DETER-B [
22], which are mainly based on visual interpretation of RS data. The main reason for such a human-intensive approach is the accuracy requirement necessary for an official, governmental system. There is, therefore, a high demand for automatic methods that can reach the desired expectations in terms of accuracy values, lessening human intervention, while reducing response times.
Deforestation detection can be posed as a change detection problem. By and large, for identifying changed or unchanged pixels, there must be a pair of coregistered images for the same geographic location, acquired at different dates. Considering as the acquisition date of an image and as the date of an image from the same location acquired more recently, the problem to be tackled basically represents finding differences in pixel positions of both images associated with a level of forest suppression that can be regarded as deforestation.
Training a classifier for this task is a challenging problem. The first issue to be tackled is the intraclass variability of the unchanged class (i.e.,
no-deforestation), which encompasses forest pixels that remain standing at
, and of the changed class (i.e.,
deforestation). The deforestation pixel clusters may differ substantially depending on the deforestation practices in the region, which may be associated with clear-cutting, selective logging, or degradation through forest fire. Additionally, the texture in a dense forest may differ because of the sun position or other seasonal aspects [
23]. The other important difficulty of the target application is the high unbalance between unchanged and changed class samples. Regardless of the deforestation rates, the occurrence of deforestation pixels is still at least one order of magnitude lower, in relation to the amount of no-deforestation pixels.
Despite the success of deep learning methods in many application fields, deep neural networks (DNNs) are known to require large amounts of labeled data to be properly trained [
23]. That can be a serious problem for operational applications, especially those based on remote sensing, for which obtaining such reference data is usually costly and time-consuming. In the case of the deforestation detection application, for example, creating such reference data involves highly trained specialists, acquainted with the characteristics of the forest cover through time, and with the different spatial and spectral deforestation patterns. The selection of a proper DL model for RS applications should, therefore, take into consideration its demand for training data.
We hypothesize that the problems mentioned in the previous paragraphs, i.e., the intraclass variability, high class unbalance, and high demand for training data, can be successfully dealt with by selecting a proper deep learning architecture and a proper training procedure. Specifically, we believe that a fully convolutional, dense labeling architecture may lessen the intraclass variability issue, once it can learn not only specific class representations but also interrelationships among the classes of interest. Additionally, a proper design of a loss function, i.e., which gives higher importance to hard-to-classify samples, may help cope with intraclass variability and class unbalance and still lessen the demand for training data.
In this work, we propose a deep learning method based on a fully convolutional model, namely, the DeepLabv3+ [
24]. In a previous publication [
25], we presented preliminary results of this research. In the present work, we bring a thorough evaluation of the full set of experiments carried out afterward. In the course of this research, we adapted the DeepLabv3+ model to deforestation detection, by setting its input as a tensor formed by stacking the bands of two coregistered images. We also adjusted many of the model’s hyperparameters, aiming at improving the delineation accuracy of deforestation polygons. Additionally, we substituted the original DeepLabv3+ cross-entropy loss function by the weighted focal loss function [
26] and fine-tuned its parameters.
We also compared the results obtained with the proposed method with those reported in a previous work [
27] over the same study area. Besides comparing the respective deforestation detection accuracies, we devised experiments to assess the methods’ demands for training data. In short, the major contributions in this work are:
Adaptation of the DeepLabv3+ semantic segmentation model to deforestation detection.
Thorough grid search on the loss function hyperparameters for assessing their relative impact on deforestation mapping accuracy.
Evaluation of the proposed method’s sensitivity to the amount of training data.
Extensive performance evaluation of the proposed method on an area of the Amazon forest.
Comparison of the proposed method with previous deep-learning-based methods for deforestation detection.
The remainder of our article is organized as follows. We first review related works. In
Section 3, we describe the evolution of the lineage which brought about DeepLabv3+. In
Section 3, we present the adaptations of the original DeepLabv3+ model carried out in this research.
Section 4 is dedicated to the description of the experimental procedure, and, in
Section 6, we present and analyze the experimental results. We end the manuscript by bringing about the conclusions and briefly discussing directions for further work.
2. Related Works
According to Volpi and Tuia [
28], deep learning (DL) approaches for image classification on a pixel level can be subdivided into two main approaches: patchwise classification and semantic segmentation. Broadly speaking, considering an image patch, patchwise classification produces a single decision for its central pixel. In contrast, semantic segmentation, given an image patch, produces at once concurrent decisions for each of its pixels.
By and large, remote sensing (RS) change detection (CD) approaches can be comprehended as an extension of pixel-level image classification. The main difference resides in the fact that conventional image classification takes data from one time instance and provides a class label for each image pixel. On the other hand, CD relies on two or more coregistered images taken at distinct time instances, and classifies image pixels into changed or unchanged areas, sometimes attributing a semantic label to the changed pixels.
Dealing with change detection through deep learning leads to three main approaches. The first one is postclassification, in which the images from different epochs are classified, and afterward the classifications are compared to produce the change map. The second approach involves architectures containing two encoders, the so-called Siamese networks [
29], where each encoder processes an image of the same region acquired at a different time instance. Finally, in the so-called early fusion approaches, image features acquired at different dates are combined before being forwarded to a single encoder network. We briefly describe below notable DL-based studies dedicated to deforestation detection.
Post-classification approaches based on DL are not so common. Zulfiqar et al. [
30] exploited Landsat images for deriving a postclassification, semantic segmentation approach based on the U-Net architecture. The method was applied for forest estimation and change detection in Pakistan.
A number of early fusion patchwise classification approaches can be found in the literature. Khan et al. [
31] used a variant of the VGG-16 architecture and a long time series of Landsat images for forest change detection in Australia. Masolele et al. [
32] evaluated several deep learning approaches for land-use classification following deforestation using a large-scale dataset of Landsat images from Latin America, Africa, and Asia, where forests were converted to distinct uses. The approaches evaluated encompass 2D-CNN, LSTM, 3D-CNN, Hybrid CNN-LSTM, ConvLSTM, and CNN-MHSA. In a deforestation mapping effort in Ukraine, Shumilo et al. [
33] compared LSTM and MLP for deforestation detection using Sentinel-1 and Sentinel-2 images.
Approaches based on semantic segmentation have recently become more prevalent in the literature. Lee et al. [
34] compared semantic segmentation results provided by U-Net and SegNet for land-use/land-cover classification of high-resolution Kompsat-3 satellite images in Korean regions affected by deforestation. Isaienkov et al. [
35] compared the performance of different semantic segmentation approaches for deforestation detection in the Ukrainian Forest Ecosystem. The authors exploited a dataset of Sentinel-2 multispectral images for the evaluation of both early fusion and dual branch networks, comprising seven approaches inspired by U-Net, Siamese networks, and LSTM.
Some recent remote sensing image classification methods, although not yet applied to deforestation detection, are worth mentioning due to their differentiated performance. Jia et al. [
36] proposed a graph-in-graph (GiG) model and a related convolutional network (GiGCN) dedicated to HSI classification from a superpixel viewpoint. Some of such core ideas remount to Geobia approaches [
37], while GiG representation covers information inside and outside superpixels, corresponding to the local and global characteristics of geo-objects. Meanwhile, the external graph is constructed according to the spatial adjacent relationships among superpixels. The approach extracts hierarchical features and integrates them into multiple scales, improving the discriminability of GiGCN. In addition, ensemble learning is incorporated to further boost method robustness. In order to classify hyperspectral images, Ahmad et al. [
38] combine 3-D and 2-D inception networks with an attention mechanism. The resulting attention-fused hybrid network relies on three attention-fused parallel hybrid subnets with different kernels; and uses high-level features to enhance the final prediction maps. That approach is able to filter out the critical features for classification. Hong et al. [
39] proposed a general multimodal deep learning framework aiming at overcoming the difficulties of DL in finely classifying complex scenes, due to the limitation of information diversity. The method is additionally able to explore the cross-modality inherent to a number of RS image classification applications. That framework, besides being able to perform pixel-wise classification, is also applicable to spatial information modeling with convolutional neural networks.
Several DL approaches specifically dedicated to deforestation detection in the Amazon were recently presented. De Bem et al. [
40] evaluated SharpMask, U-Net, and ResU-net semantic segmentation algorithms for deforestation Detection. The proposed approach employs early fusion of Landsat images. Maretto et al. [
41] adapted the U-Net architecture for deforestation detection using Landsat images. The authors compared early fusion and dual branch approaches, and created spatiotemporal variations of the U-Net architecture, which make it possible to incorporate both spatial and temporal contexts. Tovar et al. [
42] evaluated the use of both spatial and channel attention models for deforestation detection. Watanabe et al. [
43] use FCNs to build an early fusion semantic segmentation approach for deforestation detection in Brazil and Peru using PALSAR-2/ScanSAR images. Taquary et al. [
44] propose a combination of LSTM and CNN for deforestation detection based on orbital SAR time series of Sentinel-1 images. The approach consists of an early fusion, patchwise classification technique. Shumilo et al. [
45] exploits Sentinel-2 and Sentinel-1 imagery for deforestation detection. The approach employs U-Net-based variants, and a semisupervised learning technique. Experiments were carried out on a dataset of the Kyiv region. Torres et al. [
46] evaluated several fully convolutional networks architectures (U-Net, ResU-Net, SegNet, FC-DenseNet, and DeepLabv3+) for deforestation detection. The performance of such state-of-the-art models was evaluated using two Brazilian Amazon datasets with different spatial and spectral resolutions (imagery from Landsat-8 and Sentinel-2). ResU-Net consistently outperformed the other methods, providing F1-scores of 70.7% for the Landsat dataset, and 70.2% for the Sentinel-2 dataset.
Adarme et al. [
27,
47] presented an evaluation of patchwise classification methods for automatic deforestation detection in the Amazon and Cerrado biomes. The compared approaches were both early fusion (SVM, CNN, CSVM) and dual branches networks (Siamese networks). The evaluated DL-based approaches outperformed the SVM baseline, both in terms of F1-score and overall accuracy. Among those works, [
47] presented the most thorough experimental evaluation. The herein presented work follows the same experimental protocol in order to make a thorough comparison with those state-of-the-art approaches. Among the approaches presented in [
47], the Early Fusion (EF) CNN and Siamese CNN (S-CNN) were used as experimental benchmarks for the evaluation that will be presented in the next sections.
The importance of [
47] for deforestation detection is corroborated by its developments, as is the case of the present research. Soto et al. [
23] represents another unfolding of [
47]. That work presents a DL-based approach for domain adaptation in the context of change detection tasks. Such model adapts DANN [
48] as well as the EF approach of [
47]. That domain adaptation approach has shown to be able to improve the accuracy of cross-domain deforestation detection, considering different sites in the Amazon and Brazilian Cerrado biomes.
3. Evolution of the DeepLab Family Design
DeepLab is a semantic segmentation CNN family of models, which evolution aggregated several consistent breakthroughs. In this description, we follow its timeline to present the sequence of main ideas behind its evolution.
The elder DeepLab version was proposed in [
49]; and brought among its major contributions a particular implementation of the “hole algorithm” originated to efficiently compute the undecimated wavelet transform [
50]. In its first release, the so-called
atrous convolution came about for preventing the network’s output stride to go beyond the target value of eight. The DeepLabv1 encoder was built on top of the VGG-16 [
51] architecture. Among the proposed adaptations were: (1) turning fully connected layers into convolutional ones (being the first an atrous convolution with a rate of four), (2) removing the last two pooling layers, and (3) converting the last three VGG-16 original convolutional layers into atrous convolutional ones with a dilation rate of two.
It should be stressed atrous convolutions—expressed by a regular, sparse linear transfer function—can considerably expand their receptive fields in relation to traditional convolutional filters, without enlarging the number of parameters and computational complexity. They prevent the network’s output stride from impacting the feature maps spatial resolution as a conventional sequence of convolutional followed by pooling blocks would do. The authors argue that the bigger the output stride the lesser accurate would be the silhouettes of the segmented objects.
The second version of DeepLab adopted the Spatial Pyramid Pooling (SPP) concept, which was introduced in the SPP-net architecture [
52]. DeepLabv2 [
53] proposed an atrous convolution-based SPP, the so-called ASPP, which comprises several parallel branches of atrous convolutions, with different receptive fields, coming together in a single set of feature maps. Therefore, such set of feature maps is capable of capturing information at various scales [
53]. Thus, in comparison to the first release, DeepLabv2 is more efficient in segmenting objects at multiple scales. In addition, in its second release, the VGG-16’s inspired DeepLab comes along with a residual net variant built on top of the ResNet architecture [
54].
Until its second version, DeepLab relied on fully connected Conditional Random Fields (CRF) [
55] to improve semantic segmentation outcomes. By and large, the CRF processes the outcomes of the deep convolutional neural network to provide the final output map. The use of CRFs was discontinued in the third DeepLab release [
56]. As a surrogate for CRF, DeepLabv3 uses a naive decoder which performs bilinear interpolation of the output probability maps. As for the encoder, the authors experimented wit variants using backbones based on both VGG-16 and ResNet. The resulting features pass through an updated ASPP version which includes
filters, which stand for a degeneration of the high rate atrous filter that avoids image border effects, and image-level features, also known as image pooling [
57]. Thus, the ASPP models multiple levels of context, gathering from local to global features.
Finally, the DeepLabv3+ model [
24] adopted an encoder–decoder architecture with a DeepLabv3-like encoder structure. The proposed alternatives were based on ResNet-101 and Xception [
58] backbones. Furthermore, a simple decoder module was devised aiming at enhancing segmentation results, especially alongside object boundaries. On one hand, the final features produced by the backbone pass through the ASPP to produce multiscale feature maps, which are then combined by a
convolution block. In such a context, the
convolution block reduces the number of channels associated with those features, so that they do not outweigh the importance of the encoder’s output. On the other hand, low level backbone features go directly to the decoder to, then, be submitted to a
convolution block, whose result is concatenated with an upsampled version of the ASPP outcome after being submitted to its respective
convolution block. Such concatenated features are then submitted to a
convolution block that refines those features. Finally, to create the network output, such features are submitted to a simple bilinear upsampling module.
When adapting the Xception model [
58] as the backbone of the encoder network for semantic segmentation, the authors used a deeper model like [
59]. Max-pooling operations were replaced by depthwise separable convolution with striding, and batch normalization [
60] and ReLU activation were added after each 3 × 3 depthwise convolution, as in the MobileNet design [
61].
4. DeepLab-based Change Detection Method
In this section, we describe the adaptations performed on the DeepLabv3+ model to create the hereinafter called DeepLab-based Change Detection (DLCD) model. As the first amendment, it adopts the early fusion approach, taking as input a synthetic image created through stacking along the spectral dimension two coregistered images from different epochs. As a consequence of being based on an FCN, DLCD delivers dense labeling of the input patches.
Still considering the DLCD input layer, its design was also impacted by the current BLA deforestation circumstances. Indeed, in the dataset exploited for this work, deforestation areas (or objects) are small and sparse, leading to large class unbalance, being the vast majority of the training set pixels either no-deforestation or unchanged. It should be stressed that such problem contours are dissimilar to the realty of PASCAL VOC 2012 and Cityscapes, the datasets used in [
24]. So, considering such dissimilarities, we experimented with patches of smaller sizes than in [
24]. After a preliminary evaluation, we obtained better and more consistent results using a patch size of 64 × 64 pixels. In terms of the DLCD backbone, we opted for the Xception architecture, since it provided the best results in [
24].
As a consequence of such a small input patch, we opted for an encoder output stride of 8, the smallest evaluated in the original DeepLabV3+ [
24]. In order to adjust the network architecture to the selected input patch size and output stride, we changed the dilation rates of the atrous convolutions in the ASPP to 3 and 6 (originally, dilation rates of 12 and 24 were employed), and removed the atrous convolution with the highest rate because it would degenerate into a 1 × 1 convolution. Considering the Xception backbone, to achieve an output stride of 8, we used only 3 convolutions with stride of 2 in entry flow, and only convolutions with stride of 1 in the middle and exit flow.
Figure 1 shows the architecture of the proposed DLCD model.
As mentioned before, deforestation detection problems are usually characterized by a large class unbalance, where, by and large, changed pixels are orders of magnitude less numerous than unchanged ones. Thus, cross-entropy, , normally exploited in image classification problems, tend to fail in leading to consistent network weights for detecting changes. The main reason is that in dense labeling problems, it is generally not possible to force training set class balancing by simply replicating the less common class examples.
One alternative for dealing with such scenarios is the so-called weighted focal loss function (WFL) [
26]. WFL is presented in Equation (
1), where
specifies the ground-truth class label for which
indicates deforestation, while
means unchanged pixel;
is the deforestation probability estimated by the proposed model’s for a specific pixel; and
represents the weight associated to the deforestation class.
Equation (
1) may be subdivided in three parts: (1) cross-entropy (
; (2) weight relative to each class
; and (3) the term
that reduces the loss magnitude for well-classified pixels while enhances the loss of misclassified pixels, whereas a larger
intensifies such effect.
6. Results and Discussion
In order to investigate the influence of the weighted focal loss (WFL) function parameters
and
in the classification performance of the DLCD method, we made a grid search. Accordingly, in the experiments, we considered the following values of
: 0.9, 0.8, 0.7, 0.6 and 0.5, while, for each different
value, we varied
values from 0 to 5, which amounts to a total of 30 combinations of
and
tuples. The selected
values were intended to cover a range of values that goes from not weighting the classes, i.e.,
, to weighting up to one order of magnitude in favor of the deforestation class, i.e.,
. We recall that the proportion of deforestation pixels in the training tiles varies from 1% to 3%, depending on the training scenario, but as explained in the previous section, only patches with deforestation pixels were selected for training, so we believe that
was a good upper bound for that parameter. As for the selected range of
values, we basically adopted the same range used by the authors of [
26] (that proposed the WFL). It is important to observe that when
, the WFL becomes equivalent to the balanced cross entropy loss. Additionally, with
and
the loss degenerates to the ordinary cross entropy loss function, which was the loss employed in the original implementation of DeepLabv3+.
For each tuple of WFL parameters, the mean results of 10 network initialization–training cycles using the four tiles training set scenario are presented in
Figure 4. At the top of the figure, the precision average outcomes are presented, below, the bar graph containing average recall values, while average F1-scores are presented in the third chart. The table at the bottom of
Figure 4 shows the parameters values employed in each experiment. In addition, the results associated with EF and S-CNN [
47], also expressing the mean for 10 initialization–training rounds, were included in that figure after the DLCD results.
Analyzing the precision values in
Figure 4, one can notice that the DLCD method trained using WFL produced consistently smaller amounts of false positives than EF and S-CNN. The best DLCD variant, DLCD-29, went beyond S-CNN by more than 25%. However, an analysis of the recall values makes it clear that, on the other hand, EF produced lesser false negatives than the DLCD variants, a behavior that was not observed with S-CNN. This fact can be observed on the recall chart in
Figure 4, in which EF outperformed the best among the DLCD evaluated variants, DLCD-3, by approximately 1.5%.
Looking at the results obtained for distinct values, one can notice that presented by the magenta shaded bars brought fewer false positives (see the precision chart). Nonetheless, it provided more false negatives downgrading the respective recall values. In contrast, (reddish bars) tended to bring superior recall values, suggesting that such value tends to produce fewer false negatives. On the other hand, that value produced the poorest precision, providing more false positives.
The observation of the average F1-scores, which stands for the harmonic mean between precision and recall, may bring into the conclusion that, in the performed experiments, F1-scores somehow compensates the conflicting behaviors between precision and recall for the scope of parameter values under analysis. Among the 30 DLCD variants, only three were outperformed by EF and S-CNN in terms of F1-Scores. The best mean F1-score was obtained for DLCD-14 which used and , outperforming EF in more than 10%. That outcome suggests a clear superiority of the herein presented method in terms of F1-scores.
Table 4 aggregates the results shown in
Figure 4. In an attempt of isolating the impact of the
parameter,
Table 4 presents the average values obtained for distinct
values disregarding
. By analyzing the results presented in
Table 4, one can notice that the highest F1-Score, Recall and Precision were obtained when
is equal to zero. This result suggests that it is possible to obtain stable results when WFL is reduced to weighted cross-entropy (WCE). However, the choice of
value should take into account that the best F1-score in
Figure 4 was
, a result obtained with DLCD-14, which employed
and
. Anyway, in general terms, increasing
to values beyond two demonstrates a noticeable impact on F1-Score, Precision and Recall. However, it should be noticed that the decay on Precision outweighs that on Recall. That result suggests that false positives (false deforestation alarm) are even more impacted by the increment of
than false negatives.
In another set of experiments, we sought to evaluate the impact of training set size on DLCD as well as on EF and S-CNN. For that purpose,
Figure 5 presents F1-scores provided while using distinct numbers of tiles to generate the training data. Hence, the more tiles employed, the larger is the training set size. For all number of tiles selected DLCD-14 performed better than both EF and S-CNN. Comparing S-CNN to EF, S-CNN presented better results for 1, 2 and 3 tiles, while EF preformed better for a training set generated from 4 image tiles. However, the most important fact to be observed is that the smaller the training set was, the greater the advantage of DLCD in relation to EF and S-CNN. That may lead to the conclusion that the use of the DLCD method is even more suitable in cases where the training set is restricted. It should be stressed that this is a very important result, since, in remote sensing change detection, the limitation of the amount of training data is a quite common condition. Thus, considering such an advantage, the use of DLCD should be preferable to EF and S-CNN.
Figure 6 presents the change maps produced by EF, S-CNN, and DLCD-14 for tiles 6 and 14, between the years of 2016 and 2017. The left column shows the results for tile 6, and the right column shows the outcomes for tile 14. The top row presents results produced by EF, the middle row shows the ones produced by S-CNN, and the bottom row show the results obtained with DLCD-14. Comparing those maps, it is possible to notice that the proposed method produced a slightly larger amount of false no-deforestation errors (plotted in blue) than EF and S-CNN. On the other hand, DLCD produced a notably lower number of false deforestation errors (indicated in red). We observe that such an outcome is particularly important for operational reasons, considering the effort and costs involved in the reconnaissance of the actual deforestation by the local authorities, involved in penalizing the perpetrators or in mitigating the effects of illegal deforestation.
To close this section we observe that, although the results obtained with the proposed DLCD method for the majority of the and value combinations produced better accuracies in terms of F1-scores than the baseline methods, in a few combinations, i.e., 0.7, 3; 0.8, 4; and 0.9, 4, the accuracy values were slightly inferior. Additionally, the differences in accuracy values between the best and the worst hyperparameter combinations vary substantially (up to about 19% in F1-score). Therefore, the selection of proper and values is key to obtain superior results. Additionally, although we designed the experiments in a way to avoid a high level of spatial correlation between the training and test samples (by extracting them from different image tiles), we cannot be sure that the optimum and values found for the target site are also the best for other sites, with dissimilar forest covers. Such generalization issues open grounds for further investigation.
7. Conclusions
In this work, we proposed and evaluated a deep-learning-based deforestation detection method. The method was built on top of the DeepLabv3+ architecture. The original DeepLabv3+ model was adapted in several ways and we adopted the weighted focal loss function during training, aiming at improving the accuracy of the detected deforestation polygons silhouettes, and at dealing with the extreme class unbalance, characteristic of the deforestation detection application. The proposed method, denoted DeepLab Change Detection (DLCD), was compared with two baseline methods: Early Fusion (EF), Siamese Convolutional Neural Network (S-CNN), which belong to the current state-of-the-art methods for the deforestation detection application.
The experiments addressed the analysis of the impact of the training set size on the method’s performance. We carried out an extensive grid search on the weighted focal loss function hyperparameters in the DLCD training procedure. When compared to the baseline methods, despite presenting smaller Recall values, variants of the proposed method significantly outperformed the EF and S-CNN methods in terms of Precision and F1-score. In addition, for smaller amounts of training data, more significant performance gains were presented by the DLCD method. That was an important result, since limited amounts of training data is a reality in many change detection problems. Another significant advantage of the proposed method is that, being a dense labeling model, it provides considerably lower inference times than patchwise classification methods like S-CNN and EF. Additionally, DLCD produced a notable lower number of false positive errors, which is crucial for such an application, considering the effort and costs involved to reinforce environmental laws in a vast region like the Brazilian Legal Amazon.
The adoption of the weighted focal loss has proven effective in dealing with the characteristic class unbalance of the target application. Nonetheless, the use of such loss function introduced some additional parameters when compared with S-CNN and EF. The herein presented evaluation indicates that and may be good options in future works in order to fasten the fine-tuning of such parameters.
In regard to directions for future work, we plan to investigate if the performance of the proposed semantic segmentation approach is transferable to other domains, e.g., training the proposed model with samples from a particular site, and testing on images covering different forest sites. In fact, investigating deep learning based domain adaptation is planned for the continuation of this research. Another possibility is adapting recent image classification approaches not yet applied to this problem ([
36,
38,
39] for instance) for the deforestation detection problem.