This section presents the characteristics of the Sentinel-1 mission, the classification techniques, and the methodology used in this article.
2.1. Sentinel-1 Mission
The data set employed in this studied was obtained through the Sentinel-1 mission, which is composed of two satellites, namely Sentinel-1A and Sentinel-1B, launched in 3 April 2014 and 25 April 2016, respectively [
27]. They are in a sun-synchronized near-polar orbit, operating day and night, with a 12-day repetition cycle and an altitude of 693 km, and they perform C-band SAR imaging [
28,
29]. This satellite has an SAR sensor capable of generating medium- and high-resolution measurements [
28]. In
Table 1, some of the Sentinel-1 system characteristics, such as the operating band, bandwidth, antenna size, antenna weight, and pulse repetition frequency, are presented [
27]. The Sentinel-1 systems support single- (HH or VV) and dual-polarization (HH + HV or VV + VH) operations, implemented by a transmit chain (switchable between H or V) and two parallel receive chains for H and V polarization. Additionally, the stripmap (SM), interferometric wide swath (IW), and extra-wide swath (EW) products are available with single or dual-polarization. However, the waver product is only available with single polarization [
27].
According to [
27], Sentinel-1 can acquire data in four modes, which are described in the following and shown in
Figure 1. First, SM is a standard SAR stripmap imaging mode. A continuous sequence of pulses with a fixed elevation angle illuminates a strip of ground. Second, in IW mode, the data are acquired in three bands using the terrain observation with progressive scanning SAR (TOPSAR) imaging technique. Third, the data are acquired on five swaths using the TOPSAR imaging technique. The EW mode provides extensive swath coverage at the expense of spatial resolution. Finally, in WV mode, the data are obtained in small stripmap scenes called “vignettes”, located at regular
intervals along the swath.
The four acquisition modes (SM, IW, EW, WV) can generate SAR level 0, level 1 SLC, level 1 GRD, and level 2 OCN products [
27], as shown in
Figure 2. The product used in our research is level 1 GRD with high-resolution SAR images, in IW mode, as shown in
Figure 3. It consists of focused SAR data detected, multi-looked, and projected to the ground range using an ellipsoid model.
Table 2 shows some examples of applications separated by operating modes.
Sentinel-1 Image Data Set
For this study, the Sentinel-1 SAR image data set contains 400 images (patches) in VH and VV polarization with maritime targets (platforms and ships), equally distributed (i.e., 200 patches with platforms and 200 patches with ships). Image patches were acquired at different times. There are targets with more than one patch. Despite being from the same target, the patches can be considered distinct because of the SAR image formation process, which is influenced by backscatter and sea currents that cause displacement on the platforms.
Following the methodology employed in [
26], the original amplitude-type images were transformed into sigma-zero (dB) images.
Figure 4 presents an optical image and its respective SAR image for the following targets: (i) Floating Production Storage and Offloading (FPSO) platforms P-48; (ii) Floating and Production Unit (FPU) P-53; (iii) Tension Leg Wellhead Platform (TLWP) P-61; (iv) Fixed Platform (FIX) PCH-1; and (v) Semisubmersible (SS) P-65. These images were collected with the ground-range-detected product; interferometric wide swath mode; high spatial resolution (
—range × azimuth); pixel spacing equal to
in range and azimuth, respectively; and
number of looks (the equivalent number of looks is 4.4 [
30]).
The legends (
) represent the images with VH polarization and the legends (
) represent the images with VV polarization, taken on the dates mentioned in
Table 3.
2.2. Classification Tools
This section presents the classifier methods employed in this article. In particular, a classifier can be defined as a function
f that maps the input vectors of features,
, into the output class labels,
, where
is the attribute space and
C is the number of classes. Usually, it is assumed that
or
, that is, that the attribute vector is a vector of
D real numbers or binary bits [
32]. Classification is one of the most important topics in data mining, especially for large amounts of data (big data) applications. The main task of classification is to predict the labels of the test data based on the training data [
33]. In the following, the employed classifiers in this study are presented. The first considered method is the SVM scheme, which is a class of statistical models first developed in the 1960s by Vladimir Vapnik [
34] that can be used for classification [
35]. SVM has become popular due to its applicability in a variety of contexts, such as extreme learning machines [
34], automatic target recognition for CNN-based in SAR images [
36], SAR ATR, and independent component analysis [
37].
The second scheme applied in our study is the DT, which is a nonparametric supervised learning method used for classification and regression [
38].
The main idea of this algorithm is that the trees learn how to approximate a sine curve with a set of decision rules. They are visualized using a graph, which makes them easy to interpret. In addition, they require little previous information and can handle both numerical and categorical data. On the other hand, instability due to small variations in the data can cause changes in the tree [
39,
40].
Another employed method is the RF, which is a hybrid of the bagging algorithm and random subspace method and uses DT as a basis in the classification process [
41]. In other words, RF is a combination of tree predictors, where each tree depends on the values of a random vector sampled independently with the same distribution for all the trees in the forest [
42], that is, each tree is built from a sample, which is taken with replacement from the training set. Individual DTs have high variance and tend to overfit. However, the randomness injected into forests produces DTs with reduced prediction errors. Furthermore, increasing the number of trees can produce better accuracy results and limit the generalization error [
42].
The NB and kNN methods were investigated in this study. The NB is one of the most efficient algorithms used in ML, classification, pattern recognition, and data mining, and it is based on Bayes’ theorem [
43,
44,
45]. The kNN is a nonparametric classification method that has been used in different real-world applications due to its simplicity and efficiency [
33,
46]. The main idea of the kNN method is to predict the label of a test data point by the majority rule. In other words, the test data label is predicted with the main class with its k most similar training data points in the attribute space [
33]. To avoid inaccurate prediction results, it is necessary to choose an appropriate value of k. A simple way to choose k is to load the algorithm several times with different values of k and select the one with the best result [
47].
We also considered the LR scheme, which is a linear model useful for classification tasks. The sigmoid function is the basis for LR. Particularly, the logistic sigmoid function is expressed as
where the input
produces results in the range of
. LR adds an exponential function at the linear regression, bounding the output
, and
, where
n is the total number of training samples [
48]. The relationship between the input and the predicted output for LR is presented as
where
is the input given by an n-dimensional vector belonging to reals;
is the current output value that is given by a one-dimensional array;
is the predicted output value that is given by an array;
is the weight parameters; and
a bias term. Since the output is limited to the interval
, it can be interpreted as a probabilistic measure, that is, the LR is a variation of the linear regression [
48].
This article also considered the AdaBoost (ADBST) classifier. ADBST is an adaptive boosting algorithm proposed by Freund and Schapire in 1999 [
49], developed for binary classifications. The purpose of the classifier is to train predictors sequentially, trying to correct previous predictors and focusing on the most difficult cases. The algorithm increases the weight for training samples that have been misclassified, that is, the classifier learns from previous prediction errors [
46]. The weight is associated with the degree of difficulty in getting it right. It builds a stronger classifier from a combination of weaker classifiers. If there are correct answers, then the classifier is rewarded. The process is repeated in T rounds, for
, and
n training samples. In each iteration of the algorithm, the weights are adjusted, and the samples are trained [
50]. The final model is defined by the weighted majority of weak T learners, where their weights are adjusted during the training [
46,
50]. Initially, all training samples must have the same weight, so
,
,
. Then, the algorithm considers all possible classifiers and identifies the
that minimizes
, which is the sum of the weights of the misclassified points. The weight
of the new classifier is expressed as
which depends on the accuracy with respect to the current set of measured points. The weights are then normalized as
, and as a result, we have a classifier with an error
. In the next round, incorrect classifications have their weights adjusted to make them more significant. Let
(
) be a class—assuming values of 1 or
—predicted for
, and
be the correct value of the class. For the situation where
(predicted) value is equal to
(observed) value, the
signal is a positive value; otherwise, it will assume a negative. The adjusted weights are expressed as
, before renormalizing them all, so that they continue to sum to 1, that is,
, and
[
50].
We also employed a neural network as a classification tool. A NET is a very complex technology that requires a large amount of data for the training process, which is based on how human neurons work, receiving a set of inputs that are used to predict one or more outputs [
51]. One of the main uses of NETs is in grouping data into two or more classes. Neural networks can be trained in two ways: (i) supervised learning, where each training input vector is paired with a target vector or desired output, and (ii) unsupervised learning, where the net self-organizes to extract patterns from data with no target information. In the n-dimensional space, the input vectors are represented as
or
x, and the coefficients or weights are represented as
or
v, i.e.,
[
52].
Finally, the stacking or stacked generalization technique is an ensemble method that combines multiple models to achieve better classification results [
53,
54], and this was also used in our study. This type of scheme can be more accurate than an individual classifier [
55]. For instance, [
56] demonstrates the efficiency of the technique by combining three different algorithms, DT, NB, and IB1 (a variation of the lazy algorithm).
2.3. Classification Setup
This section describes the methodology employed in this article, introducing the training and test groups and the steps used in the classification tools. To obtain the training and test groups, we adopted the methodology proposed in [
26], where 50 groups were created randomly, resulting in 320 training images (80% of the total samples) and 80 test images (20% of the total samples). Before starting the classification steps, the image attributes are extracted using CNN VGG-16 and VGG-19 [
57]. A VGG is a CNN with a convolutional layer stacking with different levels of depth.
The difference between VGG-16 and VGG-19 is in the number of convolutional layers. More precisely, VGG-16 comprises 13 convolutional layers [
58,
59], while VGG-19 is made up of 16 convolutional layers [
60,
61]. This difference can be seen in
Figure 5 and
Figure 6. The VGG-16 is composed of: (i) conv 1—two convolutional layers with 64 channels; (ii) conv 2—two convolutional layers with 128 channels; (iii) conv 3—three convolutional layers with 256 channels; (iv) conv 4—three convolutional layers with 512 channels; (v) conv 5—three convolutional layers with 512 channels; (vi) fc6—4096 channels; (vii) fc7—4096 channels; and (viii) fc8—1000 channels. On the other hand, VGG-19 is composed of: (i) conv 1—two convolutional layers of 64 channels; (ii) conv 2—two convolutional layers with 128 channels; (iii) conv 3—four convolutional layers with 256 channels; (iv) conv 4—four convolutional layers with 512 channels; (v) conv 5—four convolutional layers with 512 channels; (vi) fc6—4096 channels; (vii) fc7—4096 channels; and (viii) fc8—1000 channels. In short, the difference between the two occurs in conv 3–5. The VGG architecture used in the research ends in the FC7 layer. Therefore, the FC7 layer with the embedding is the input for the classification algorithms, as shown in
Figure 7.
After extracting attributes with the CNNs, four different data sets are created, df-16vh, df–16vv, df-19vh, and df-19vv, which are the results of the combinations of the two CNNs, VGG-16/VGG-19, and the VH/VV polarizations.
The bootstrap technique was used to ensure reproducibility and to make sure that the classifiers are evaluated under the same conditions. Bootstrap is a random resampling technique with replacement from the primary dataset [
39,
62]. This technique makes it possible to estimate the empirical distribution of statistics [
39,
63]. In this work, each data set (i.e., df-16vh, df-16vv, df-19vh, df-19vv) is resampled 50 times, as described in
Figure 8. Similar to [
26], each resampling consists of 320 training samples and 80 test samples.
The methodology of this work is described in the items below and shown in
Figure 9.
The data set consists of eight Sentinel-1 SAR images in VH polarization and eight in VV polarization, GRD product, IW mode, obtained through ESA’s Copernicus project [
64] in the periods listed in
Table 3.
The original data set went through a calibration process using SNAP software (Sentinel Application Platform) to transform the amplitude image into zero-sigma. Then, the cutouts of the targets were made manually through SNAP. The dimensions of the images are displayed in
Table 4. The identification of oil platforms is through geolocation (latitude × longitude) provided by the ANP [
65]. Targets without geolocation are considered ships. The number of targets extracted in each SAR image is presented in
Table 3. Each patch is individually exported as a TIFF image. The types of platforms in the images are listed in
Table 5. TIFF image patches form the VH and VV data sets from platforms and ships.
- (3)
Extraction of attributes
The VH and VV image patches are the input of two CNNs, VGG-16 and VGG-19, pre-trained in the ImageNet data set, which extract features and generate four data sets: df-16vh, df-16vv, df-19vh, and df-19vv.
Figure 5 presents an example of the application of VGG-16. It is noticed that the VGG-16 of our research uses the FC7 layer to provide the attributes to the classification algorithms. In this article, VGG-19 also uses the FC7 layer to provide the attributes to the classification algorithms.
- (4)
Formation of train and test samples
The proportion of 80% (training) and 20% (test) is considered. Training and testing samples are generated randomly. This was the methodology applied by [
26].
- (5)
Classification with the M1 method
In this method, ML techniques are applied in order to reproduce the results of [
26].
The samples vectored in the four data sets (df-16vh, df-16vv, df-19vh, df-19vv) are randomly distributed with replacement in 50 bootstrap groups and saved to submit the classifiers to the same reproducibility conditions.
Table 5 presents the distribution of platforms between training and test samples in a bootstrap group.
- (7)
Bootstrap and formation of train and test samples
Each of the 50 bootstrap groups comprises subsets of the original data sets (df-16vh, df-16vv, df-19vh, df-19vv).
- (8)
Classification with the M2-M5 method
ML techniques are applied to the M2–M5 methods considering the kNN, SVM, LR, DT, RF, NB, NET, and ADBST algorithms.
Statistical analysis is performed using the Shapiro–Wilk methods (normality analysis); Kruskal–Wallis (significant difference analysis); and Dunn (identifies who owns differs). For brevity purposes, the two best results in each method (M1–M5) were considered.
To perform the classification, we used the following five methods, namely M1, M2, M3, M4, and M5, defined as follows:
- (M1)
Aiming at reproducing the results obtained by [
26], the LR, SVM, RF, kNN, DT, and NB classifiers were used with parameters in the default setting. To the best of our knowledge, [
26] is the only study available in the literature for maritime target classification in Sentinel-1 SAR data based on ML techniques. For comparison purposes, in this method, the samples were randomly generated only at the time of classification and were not saved, and the parameters of the applied classification tool are the ones predefined as default in the Orange Canvas software [
66], which are described in
Table 6.
- (M2)
The number of algorithms used in M1 was increased with the addition of the NET and ADBST methods. In this step, an extensive computational search was performed, varying the parameters of the considered classification algorithms, aiming to maximize their performance. The employed parameters are presented in
Table 7. The basis for parameter adjustment is empirical and was optimized to improve the results presented by [
26]. For example, [
67] shows that 500 trees are a good choice for constituting the RF. However, this number can be increased to approximately 3000 to evaluate the results. For SVM and ADBST, [
67] shows that the proper adjustment is made by gradually changing the parameter values. Indeed, there is no analytical methodology to reach optimal parameter values because the optimization depends on the data. This was evident when [
68] optimized the parameters of SVM, kNN, and DT, demonstrating that the ideal values of the parameters can vary with the size of the training data.
- (M3)
In this method, the training data set was expanded with the concatenation of all samples from the VH and VV data sets. The test data set remained unchanged.
- (M4)
The training data set was extended with the concatenation of half of the samples of the VH and VV image data set. The test set remained with the same samples.
- (M5)
The stacked generalization technique consists of combining several classifiers, aiming to obtain better classification results [
53,
56,
69]. Since supervised classification is performed in all steps, the distribution of the training and test sets is done according to
Table 8.
In this section, the numerical results are presented and discussed. To perform the maritime target classification in the Sentinel SAR data based on ML techniques, we extracted 4096 features from the images using CNNs VGG-16 and VGG-19, available in the Orange Canvas software [
66]. For this architecture, the number of filters doubles after each max pool layer [
20]. Consequently, from the data set generated by feature extraction, 50 distinct groups were defined, separated by networks VGG-16 and VGG-19, and polarizations VH and VV. Additionally, to perform the classification, we employed the tools described in
Section 2.2 and the steps presented in
Section 2.3.
To evaluate the performance of the employed methods, we considered the following metrics: area under the curve (AUC), accuracy (Acc), F1 score, precision, and recall. The mean for each metric was computed considering the 50 groups of data (randomly created, as mentioned in
Section 2.3). Since all the results lead to the same conclusion, we decided only to discuss the Acc results in this section. The remaining results are detailed in
Appendix A.
Finally, statistical analyses, such as the Shapiro–Wilk and Kruskal–Wallis tests, are presented to assess the overall performance of the methods. A flowchart with the stages of image acquisition, attribute extraction with DL algorithm, image classification methods, and statistical analysis of classification accuracy is presented in
Figure 9.
Table 9 presents the setups that optimize the performance of the tested methods. The parameters are the number of CNN layers and the polarization channel. The NB and ADBST methods parameters are not presented since NB has no configuration parameters, and ADBST presented the same results, regardless of the parameter setup.
Considering the parameters displayed in
Table 9,
Table 10 shows the Acc mean values of 50 classification results; the best results are highlighted in bold. The classifiers that excelled in classification results were LR, NET, SVM, RF, and kNN, presenting a classification gain of 32.5%, 32.5%, 17.5%, 15.5%, and 2.5%, respectively.
Comparing our results with [
26], the accuracy of the CNN VGG-16 was increased by 7.7% and 4.2% for the M4-kNN (VV polarization) and M2-SVM (VH polarization), respectively. For the CNN VGG-19, the gains are about 7.0% and 3% for the M4-kNN (VV polarization) and M2-SVM (VH polarization), respectively. In methods M3 and M4, there is an upgrade in the variability of the samples and the accuracy of the classification results. The stacking technique presents accuracy results ranging from 76.1% to 84.1%. Compared with the results shown in [
26], the stacking technique does not excel only for the LR scheme.
Therefore, the RF, SVM, and NET classifiers excel in all the evaluated scenarios, and the LR had poor performance in method M4. In summary, considering all the classification techniques, the following ones stand out: LR, SVM, NET, STACK, and RF, with the highest accuracy results ranging between 80.5% and 85.45% for all the tested scenarios.
The VH polarization presents better results in detecting targets, mainly on oil platforms formed by large metallic structures of complex geometry. In general, the brightness of the targets is more intense in the VH polarization, and the background (sea) in the VV polarization. Therefore, feature extraction is best represented in VH polarization in the VGG-16 and VGG-19 networks.
To emphasize the results highlighted in
Table 10,
Figure 10 demonstrates the top two classification results in each method with VGG-16VH. The other graphical results for VGG-19VH, VGG-16VV, and VGG-19VV are presented in
Appendix A.
Complementing the results of
Table 10,
Table 11 displays the average of the classification results for the M1 method with the CNN VGG-16/VGG-19 and the classification metrics (AUC, F1 Score, Precision, and Recall). In addition, it reproduces the results of [
26], using the same training/testing and classification data generation approach. Each metric is calculated for the six classifiers (kNN, LR, NB, RF, SVM, and DT). As in [
26], LR is the classifier with the best performance. Furthermore, it is observed that the results with VH polarization are superior to those with VV polarization. Methods M2 to M5 present result tables with the same structure as M1 for the metrics (AUC, F1 Score, Precision, and Recall). Therefore, tables are included in
Appendix A.