3.3.1. Overview
Our objective was to implement an algorithm for crop identification [
51]—specifically rice in the Coto Arrocero area—for CAP grant monitoring purposes through S2 data and ML. For this purpose, ML-based models were implemented using the Python scikit-learn [
52,
53] and XGBoost [
54] libraries.
Figure 6 shows that, once the input information had been generated, the algorithm performed an initial assessment among many ML models [
55,
56], from which the most optimal one was chosen based on the information provided. Then, hyper-parameter optimization was performed for the selected model [
57]. Once the classifier was ready, pixel-by-pixel prediction [
58] was performed and, finally, validation of the used model was carried out [
59].
The methodology used generates the most relevant data: a raster with the prediction made and a large volume of data (stored in the spatial database). The data available in the database are related to characteristics of the input data, classifier configuration, model evaluation and validation metrics, probabilities associated with each of the classes, and information on the zonal statistics between the plot and the generated raster.
This algorithm was executed as many times as images were available in the reference time-series; specifically, 53, as mentioned in the previous paragraph (
Section 3.1). The main reason for this was the difference of four weeks in applying agricultural techniques on the plot, depending on where it was located, as discussed in
Section 3.2; which we felt could cause some uncertainty in the model.
In principle, the implemented algorithm performed classification between two classes. However, it was defined with a high parameterization capability, in order to be used for the classification of more than two classes. Similarly, the input variables that will be part of the learning matrix and the necessary S2 scenes are configurable.
3.3.2. Data Preparation
So that ML models can do their work, the input information provided has to be array-like. Therefore, our vector and raster geographical information had to be transformed into a format understandable by these models.
The models used in the implemented algorithm were based on supervised learning. For this reason, the prediction was made based on the training data, which must be generated in a particular way, maintaining a balance between classes during classification and ensuring that they were sufficiently dispersed throughout the area of study.
The learning matrix (or ground-truth generation) was based on Regions of Interest (ROI) consisting of a series of polygons that indicated the areas rice in which was grown and those in which it was not. These ROIs were generated through a semi-automatic process implemented in Python and could belong to two classes:
The class refers to the output variable; that is, what the ML model predicted.
For the rice class, the mechanical part consisted of selecting the declaration line plots, which were fully adjusted (updated with the series treatments) to the phenological curve of the crop of the current campaign. Data from the current campaign’s phenological curve were obtained using the process reported in
Section 3.2. The following formula was used to establish whether or not a plot was adjusted to the phenological curve represented in
Figure 7:
where
is the mean of the NDVI index for the current treatment date and
is the standard deviation.
The class corresponding to rice was formed from the parcels of the D.O. Calasparra, seeking to select those plots which grew rice in the current campaign and which have not requested aid for its cultivation, thus avoiding overfitting as much as possible. The objective here is to avoid as much as possible the inclusion in the learning matrix of those plots that will later have to be predicted. Thus avoiding that the model learns too well that learning matrix and favoring it can make predictions with high accuracy.
The class corresponding to the non-cultivation of rice was generated by digitizing polygon-type geometries manually on an S2 image, in areas where no rice was grown. The non-rice class mainly included elements that may have behavior and spectral response similar to rice cultivation, such as the horticultural and leguminous crops cited in
Section 2 or, even, the wild vegetation proliferating in the mote of the river that crosses the Coto. In this case, approximately 20% was allocated to horticultural areas, 15% to legumes, 25% to wild vegetation in the riverbed, and the rest to other types of elements, such as areas of huge trees, scattered trees, tilled or bare soil, and so on.
A balanced data sample was obtained after generating the training matrix, with a fair percentage of rice and non-rice pixels sufficiently dispersed over the study area’s entire surface.
Once the learning matrix rows had been composed (as detailed in the previous step), we still had to define the column values obtained from the raster that included the S2 bands and the biophysical indices.
Following the lessons learned in phase 1, a raster was generated based on the S2 scene element of the corresponding time-series, consisting of bands 2, 3, 4, and 8 and the biophysical indices (NDVI and NDWI).
Finally, a raster with 1538 rows, 1003 columns, and six bands was available, with a total of 1,542,614 pixels.
Once the vector data had been generated from the ROIs and the raster onto which the values of those ROIs had been extracted, the values themselves were removed to form the array of input variables.
With regard to the transformation of the geometries into raster format so that values could be extracted from the tacked S2 image at the corresponding locations, the pixel distribution (among the classes to be classified) in the learning matrix is given in
Table 3:
Finally, once the extracting values had been launched, the learning matrix was provided to the ML model to carry out the prediction.
To avoid random relationships between variables, it is good practice to subset data from the learning matrix into training data and test data [
60]. As its name indicates, the training data set is used for fitting or model training, while the test data set is used to test the model and find the errors that occurred during the prediction. The test data set must never be used for training, and vice versa.
In this case, 75% of the learning samples were allocated for training and the remaining 25% for testing. In this process, randomness is also introduced in the input variables. Therefore, 2721 pixels remained for training and 908 for testing (498 of the non-rice class and 410 rice).
The learning matrix was composed of six input variables. Four of them (bands 2, 3, 4, and 8) contained values between 0 and 4.095. The remaining two (NDVI and NDWI), obtained from the previous ones, had values ranging from −1 to 1. The above results can lead to the predominance of one input variable over others when making predictions, making some more critical than others.
Table 4 shows a table with Pearson’s correlation coefficient for the six input variables, observing the co-linearity between some variables:
Further, for many learning estimators, the standardization of the supplied data set is an indispensable requirement, as they may behave erratically if the individual characteristics do not resemble, more or less, the standard normal distribution.
To avoid the above, the algorithm performed standardization [
61] of the input variables, consisting of a data rescale to have a mean (
) of zero and a standard deviation (
) of one. This value was calculated as follows:
3.3.5. Classification
After the model has been optimized, the algorithm proceeded to adjust the model using the training data and S2 image prediction of our time-series. To predict class membership, the class probabilities were added together.
Based on the predicted output variable, together with the probability of belonging or not to the rice class, a raster was generated with the same characteristics of the used S2 image. For example,
Figure 8 shows how the selected pixel was classified as rice (Band 1 = 80) with a probability of being rice of 95% (Band 3 = 0.953333) and a probability of not being rice of approximately 5% (Band 2 = 0.0466667). It should be clarified that this option only applies to those models that incorporate the functionality allowing association of probabilities to predictions. However, in this study, all the models selected finally presented this feature.
The first phase of the CbM is the so-called automatic and dealt with in this paper. In this automatic phase, the plots marked with yellow traffic light color will go to a phase called expert judgment. In this phase, an expert, based on information of all kinds, must indicate whether or not the plot is fulfilling the obligations of the aid requested. Therefore, products and functionalities such as those shown in
Figure 8 can be very useful for the expert’s decision making.
To avoid the maximum type 2 errors (
), false negatives, or false greens (
Section 1), resulting in the payment of a non-corresponding grant, during classification, a process was implemented that allowed for adjustment of the threshold of probability that establishes membership of one class or another. By default, the estimators used 50% to establish this membership. The implemented algorithm allowed for obtaining an additional classification using a new membership threshold. If it is preferred to control this new threshold’s value, it can be set manually through the configuration file. If the parameter is not in the configuration file, the algorithm automatically sets it as the average probability of the pixels that were predicted as rice. Thus, this threshold may differ, depending on the time-series element.
Below is a comparison between the resulting raster with a threshold of 50% (
Figure 9a) and the raster using the automatically established threshold (
Figure 9b).
3.3.7. Zonal Statistics
Once the information derived from the validation of the corresponding estimator was recorded, it proceeded to obtain the zonal statistics [
71] between the parcels and the raster generated during the prediction process. The purpose of the zonal statistic was to provide additional information to the classification, in order to facilitate crop identification; however, instead of being pixel-oriented, it is plot-oriented.
Among the statistics obtained were the typical mean, standard deviation, minimum, maximum, count, sum, and majority. At the zonal statistic level, the plots with the most pixels labeled rice were marked as rice crops and vice versa.
Furthermore, complementary statistics were added for each plot: number of pixels classified as rice, number of pixels classified as non-rice, the average probability of the plot being classified as rice or non-rice (based on the associated probabilities generated by the classifier for each of the pixels included in the plot), the reliability of the zonal statistics level classification (the fraction of the total number of pixels included in the plot that have been labeled as rice), as shown in the equation:
and the average probability that the parcel was classified as rice or as non-rice; the latter based on the associated probabilities generated by the classifier for each of the pixels included in each of the plots.
Finally, the algorithm uses all the generated statistics to establish the traffic light color for each plot.