1. Introduction
An effective approach for the information support of precision agriculture involves the quantitative assessment of cropland area and the analysis of cropland structure. Currently, the use of machine learning methods for the development of global and regional cropland masks, as well as masks of individual crops, has received much attention [
1,
2,
3,
4]. Remote sensing has emerged as a promising method for generating such masks. In particular, Sentinel imagery has made it possible to obtain information with the simultaneous coverage of significant areas with high periodicity, which is practically impossible to realize when using ground-based data.
The issue of cropland classification can be addressed by aggregating the values of visible and infrared spectrum bands to serve as input data. This process yields vegetation indices, such as the normalized difference vegetation index (
NDVI), enhanced vegetation index (EVI), normalized difference water index (NDWI), and others. A popular method for constructing crop masks is the use of a single multispectral image [
5,
6,
7]. However, this approach is more suitable for classifying land cover than for crops that are characterized by significant variations in vegetation index values during the vegetation season. Therefore, time series of vegetation indices are extensively used. For instance, Luo et al. used Sentinel-2
NDVI time series data (with various phenological characteristics) to map crops in northern China [
8], computing 20 informative features (time series and phenological) to achieve a classification accuracy of 93% using the random forest algorithm (RF). Ouzemou et al. used Landsat
NDVI time series data for cropland mapping after resharpening them to a resolution of 15 m (10 images per season) [
9].
A major disadvantage of optical data is the significant influence of weather phenomena, such as different types of clouds, haze, aerosols, and cloud shadows, on the vegetation indices’ values. The expert or automated removal of unsuitable images is not a simple task and may often lead to a significant sparsity of time series. The process of the computation of vegetation indices’ temporal composites can help address this issue by reducing the negative effects of atmospheric indices. Computing composites also reduces the amount of input data. For example, Erdanaev et al. identified agricultural crops in Uzbekistan, including cotton, wheat, rice, and others, using Landsat-8 and Sentinel-2 monthly
NDVI, EVI, and NDWI composites [
10]. Hao et al. also used Landsat-8 and Sentinel-2 imagery to construct 15-day
NDVI and EVI composites [
11]. In both cases, the authors were able to achieve a high classification accuracy. However, when working with data from different sensors, it is essential to perform data projection and resampling procedures. Low-resolution data are also used for land cover and crop mapping. For example, 16-day Moderate Resolution Imaging Spectroradiometer (MODIS)
NDVI composites were used for classification [
12,
13]. However, using low spatial products, such as imagery with a spatial resolution of 250 m, does not allow for accurate crop mapping for relatively small sites or sites with multiple crops growing simultaneously.
To create an accurate classifier, it is necessary to account for the inter-annual variability associated with climatic factors, hydrological conditions, and anthropogenic factors. Consequently, studies using multi-year satellite data are more promising. For instance, Feyisa et al. classified croplands in Ethiopia using 8-day
NDVI composites and the RF method and achieved the highest accuracy when using 3-year MODIS data [
14]. Crop mapping using satellite data for previous (or, in the case of retrospective prediction, subsequent) years is also of great interest. For example, Konduri et al. used MODIS
NDVI data for 7 previous years to generate cropland maps in the US in 2015–2018 [
15]. Later, a classifier trained on monthly composites of Sentinel-2 bands and indices in 2020 performed a retrospective prediction of crops grown in 2019 using the RF algorithm to recover crop rotation information [
16].
Often, there may be long periods between the dates of optical images suitable for analysis in regions that are characterized by distinct rainy seasons, long cloudy periods, and the presence of aerosols or fog during the vegetation season. If such a pause lasts for 2 weeks or longer, the calculation of composites (even over 15–16 days) is insufficient for the generation of continuous time series. In such cases, the missing values are recovered through interpolation, using neighboring values [
17] and splines [
18] or filters [
19,
20]. However, a more effective approach to filling the gaps in the data is time series function fitting. In this case, a function with a shape similar to that of the seasonal course of the vegetation index is selected. Extensive research has been conducted on the selection of functions for the function-fitting of the most popular vegetation indices. For example, polynomials [
11], the Gauss function [
21], logistic functions [
22,
23], and the Fourier series [
24] have been used to fit
NDVI time series.
A characteristic feature of the Russian Far East is the lack of reliable cropland maps and databases. Various machine learning (ML) models have been developed to validate ground-based observations, often using existing cropland maps for accuracy assessment [
25,
26,
27]. Subsequently, the accuracy of global and federal cropland maps developed for different climatic conditions, crops, varieties, and agricultural practices is comparatively low. The lack of a regional database that takes into account crop phenology in the Russian Far East poses a serious challenge, hindering rational land use and the development of the agro-industrial complex as a whole. In addition, the inadequate control of crop rotation and agricultural practices in the region results in non-compliance with crop rotations and the dominance of soybean. In some municipalities of the Far East, the share of soybean in the total amount of cropland exceeds 90% (
https://rosstat.gov.ru, accessed on 15 April 2024). Repeating the cultivation of soybean as the most profitable crop, in cases of non-compliance with the crop rotation, leads to a decrease in soil fertility. To maintain fertility, the inclusion of oat and perennial grasses in crop rotation is recommended [
28]. Gajduchenko et al. conducted 20-year research and demonstrated that crop rotation increases the soybean yield [
29]. Moreover, a high-yielding crop such as buckwheat can be used in crop rotation with soybean for additional benefits in terms of the overall yield.
Various factors, including waterlogging, the decline in soil fertility, and the overgrowth of weeds and trees, coupled with primitive methods of fallow land identification, have resulted in a substantial amount of unused agricultural land in the region. Therefore, the introduction of forested and waterlogged fallow land into crop rotation is one of the most important tasks of agriculture in the Russian Far East. Remote sensing methods have not been previously used to solve these important problems in this region.
The main objective of this research is to develop an automated method for cropland classification using Sentinel-2 weekly NDVI composites and data reconstruction algorithms. An important problem that this study intends to address is to establish the feasibility of classification using time series over two years of observation. An approach that has not been used before for cropland mapping in the Russian Far East is relevant due to the large number of cloudy days during the growing season, as well as the different dates of images in different years, which create difficulties for the researcher. We have proposed to consider the possibility of using function-fitting to restore composite values. The study will evaluate the function-fitting accuracy using different algorithms and the classification accuracy using different ML methods.
2. Materials and Methods
2.1. Study Area
The study area is located in Khabarovsk Municipal District in the south of the Middle Amur Lowland. This region is on the right bank of the Amur River east of Khabarovsk between 48.31° and 48.64°N latitude and 134.81° and 135.57°E longitude (
Figure 1). These lands lie in the temperate belt and experience a monsoon type of climate, i.e., winters are quite cold while summers are warm and sometimes hot, with high precipitation. The average annual precipitation is about 700 mm. The duration of the frost-free period is 130–150 days. The short duration of the vegetation season does not allow for the growth of winter crops.
The climate and soil conditions favor the cultivation of leguminous crops, primarily soybean. In 2022, soybean grew on 59% of cropland in Khabarovsk District (10,722 ha), while oat and buckwheat occupied 8% (1446 ha) and 2% (327 ha) of total cropland, respectively.
2.2. Ground Truth Data
Marked sites of the Far-Eastern Agriculture Research Institute and agricultural enterprises of the region with crop information were used as ground observation data. Information on crop rotation in 2021 and 2022 was provided by the heads of organizations and verified during ground observations. Overall, in 2021, information on 50 sites with a total area of 1130 ha was collected. In 2022, information was collected on 80 sites, covering an area of 1992 ha. In 2021, the share of fallow land and soybean exceeded 30% of the total area, while buckwheat occupied 26% of these lands, and oat and perennial grasses together occupied just over 10%. In 2022, the number of sites with oat increased from 4 to 17 (with a corresponding increase in area from 67 to 511 ha), while those with perennial grasses increased from 2 to 29 (with a corresponding increase in area from 60 to 528 ha).
Table 1 shows the number and total area of sites for each class.
2.3. Crop Phenology
Active vegetation of soybean in Khabarovsk District begins in calendar weeks 26–28 (late June to mid-July), the vegetation maximum is reached by weeks 32–33 (mid-August), and the biomass then decreases until the end of the growing season (soybean harvesting is carried out not earlier than mid-October). Oat is sown in the first half of May (weeks 18–19), with the vegetation maximum occurring in early July (weeks 26–27); it is then harvested and the NDVI declines sharply. Perennial grasses are sown under the cover of grain crops in the previous year, allowing vegetation to begin immediately with the onset of the frost-free period. Biomass growth occurs at a rapid pace, reaching its peak by calendar weeks 25–27 (second half of June to early July). Following the summer mowing, the grass regrows rapidly.
In 2021, the experimental sowing of buckwheat was resumed in Khabarovsk District after a long break. Its vegetation period is approximately only 70 days, which allows for variations in the sowing and harvesting dates. Thus, in 2021, sowing was carried out in the first half of July (calendar weeks 27–28), and harvesting was performed in the second half of September (calendar weeks 38–39). Such dates are considered optimal for the local buckwheat variety. In 2022, sowing was carried out a month earlier, commencing in early June (weeks 22–23). Buckwheat maturation occurred by mid-August, with the end of the vegetative phase occurring in week 34, and harvesting took place in late August and early September.
The fallow land represents unused cropland with grass, shrubs, and woody vegetation. The insufficient melioration of cropland is an urgent problem in the Far East, resulting in waterlogging. Vegetation on fallow land experiences growth from the beginning of spring, maintains a consistently high level during the summer, and starts to wilt with a gradual decrease in biomass during the fall. Crop phenology for soybean, oat, buckwheat, and perennial grasses (with average temperatures) is presented in
Figure 2.
2.4. Imagery Processing and NDVI Calculation
Multispectral Sentinel-2 data (Level-2A product) were accessed from the Copernicus Open Access Hub service. The Graph Processing Framework (GPF), available on the European Space Agency (ESA) Sentinel Application Platform (SNAP) v.9.0 (
http://step.esa.int/main, accessed on 21 December 2023; ESA, Paris, France), was used to create graphs and set up chains for batch processing, as shown in
Figure 3.
NDVI time series were calculated for images that fulfilled the condition of the cloud cover being less than 20% for the scene (
Figure 4).
Additionally, for the selected images, Scene Classification (SCL) products were used together with Quality Indicators (QI) for clouds. Pixels were masked if they were contained in at least one of the following bands: “scl_cloud_high_proba”, “scl_cloud_shadow”, “opaque_clouds”, “scl_cloud_medium_proba”, “cirrus_clouds”, and “scl_thin_cirrus”. Initially, SCL products were resampled to 10 m. Further, using the near-infrared (
B8) and red (
B4) bands, the
NDVI for each scene was calculated. The
NDVI was calculated as follows:
Before the dataset construction, we additionally performed “2σ-filtering”—for each field, the mean value and standard deviation of the
NDVI by site were calculated for each date. Pixels when over half of the time series values did not fall within the 2σ-interval were considered anomalous and were not included in the dataset. A total of 0.6% of the series in 2021 and 0.8% of the series in 2022 were filtered out, which shows insignificant
NDVI variability across the site. A flowchart of the current research (starting from time-series filtration, including time series restoration and classification) is presented in
Figure 5.
2.5. Composites Construction and Data Restoration
To equalize the number of observations in the time series, we calculated weekly NDVI composites. Within each time series, the values were divided by calendar weeks; in the case of several values within one calendar week, the average value was calculated. Thus, all the NDVI composites time series included 27 observations (from 17 to 43 calendar weeks—from the end of April to the end of October).
The composites reduced the sparsity in the dataset, this was observed only for weeks with images. The dataset remained highly sparse, missing 36% of the values for 2021 and 47% of the values for 2022. An example of a time series with a large number of missing observations is shown in
Figure 6.
To recover the missing values, we fitted each time series. The independent variable was the number of calendar weeks, while the dependent variable was the
NDVI value. The parameters of the function were obtained for each time series. Then, using the number of the week as an argument, we obtained the missing values of the time series. The lmfit library (
https://lmfit.github.io/lmfit-py; accessed on 15 November 2023) of the Python programming language was used for fitting function parameters by the least squares’ method—specifically, the Levenberg–Marquardt algorithm [
30]. The fitting functions were the cubic polynomial, double sinusoidal function, and Fourier series.
2.5.1. Cubic Polynomial
Time series function fitting using polynomials is among the simplest and most commonly used methods for time series reconstruction [
11,
23,
31]. A cubic polynomial (Cube) has only four parameters and has the following form:
where
a,
b,
c, and
d represent the parameters of the model.
2.5.2. Double Sinusoidal Function
Trigonometric polynomials have demonstrated their effectiveness as fitting functions for periodic dependencies, which may include time series of vegetation indices [
32]. In particular, the double sinusoidal function (DS) is suitable for dependencies that have two characteristic peaks. The DS function has six parameters and has the following general form:
where
a1,
b1,
c1,
a2,
b2, and
c2 represent the parameters of the function.
2.5.3. Fourier Series
Fourier series are the best-known trigonometric series. Fourier series decomposition is often used in signal processing and serves to extract the amplitude component, thus allowing for the elimination of noise. There are several examples of the successful application of such decomposition for time series of vegetation indices [
33,
34]. Fourier series enables the representation of a complex relationship as the sum of simple trigonometric functions. In our study, we use the first two terms of the Fourier series expansion (DF):
As in the case of the DS, the approximation is reduced to the search for six parameters represented by
w,
a0,
a1,
a2,
b1, and
b2. We calculated the
MAPE for the averaged time series of all classes for 2021 and 2022, fitted by different numbers of terms in the Fourier series (
Table S2). It was found that the mean error for one term and two terms differed significantly (
p < 0.05), while the mean error for two terms and for three terms did not differ significantly (
p > 0.05). An increase in the number of terms of Fourier series leads to an increase in the number of model parameters, which, first, requires more observations in time series (which can cause limitations in case of a lack of uncloudy data), and second, increases the complexity of the calculations and interpretation of the model. Therefore, using the first two terms of the Fourier series in the model is optimal.
2.5.4. Function Fitting Accuracy Evaluation
The mean absolute percentage error (
MAPE) was calculated for each time series to compare fitting functions and to select the function that best fit the original data:
where
i is the series index,
is the predicted
NDVI in week
j,
is the real
NDVI in week
j, and
n is the week number with the known
NDVI for series
i. The
MAPE values obtained for each series were averaged across classes and the entire dataset, as follows:
where
class is the name of the class,
m represents the number of series in the class, and
N is the number of series in the dataset. The function with the lowest
is used to reconstruct the time series when generating datasets for ML.
2.6. Machine Learning
The dataset for 2021 and 2022 was obtained after filtering. The number of series in the datasets is presented in
Table 2. Partitioning into validation for 10-fold cross-validation sets was performed based on site locations to prevent the simultaneous inclusion of series from the same site in sets, as this could result in overfitting. A total of 295,673 labeled
NDVI series belonging to 130 sites, with a total area of 3122 ha, were used in cross-validation. In general, the dataset can be considered balanced, thus avoiding distortions in the classification result.
To create classifiers, we used three ML algorithms, namely, support vector machines (SVM), RF, and gradient boosting (GB). These algorithms, along with other popular ML algorithms, were implemented using the scikit-learn package (
https://scikit-learn.org/stable; accessed on 15 November 2023) in the Python language.
2.6.1. Support Vector Machines
The SVM algorithm is widely applied in regression and classification tasks, including cropland mapping [
9,
35,
36]. The essence of SVM is to determine the optimal hyperplane within the feature space that effectively separates points belonging to different classes. The competitive advantages of the SVM include efficiency with a large number of features, memory economy, and flexibility in kernel feature selection. In this study, a high-performance variation of the LinearSVC algorithm using a linear kernel was used. The parameters of SVM are the penalization norm, loss function, and multiclass strategy. The main parameter of the algorithm is the regularization coefficient
C.
2.6.2. Random Forest
The RF algorithm is based on averaging the results of the ensembles of trees through bagging, with each tree being trained on a subsample of the dataset [
37]. The RF method is used in both regression and classification problems. It is currently the most popular algorithm in cropland mapping [
25,
27,
37]. The advantages of RF include high performance (especially for large datasets), high accuracy, and robustness regarding the presence of noise in the data [
9,
38]. Its parameters are the quality criterion, the minimum number of samples required to split an internal node, the minimum number of samples required to be at a leaf node, and the number of features for each split. However, the main parameter affecting the performance of RF is the number of trees.
2.6.3. Gradient Boosting
Similar to RF, the GB [
39] algorithm is based on an ensemble of trees. The main difference in this algorithm is the fact that training is performed sequentially. During each iteration, the deviations in predictions made by the already trained ensemble on the training sample are computed. The next model to be added to the ensemble will predict these deviations. New trees are added to the ensemble as long as the error decreases or until one of the “early stopping” rules is satisfied. GB is considered a flexible, computationally fast algorithm that works even with sparse data. It has been used in prior research for land cover mapping tasks [
40,
41,
42]. Histogram-based GB is a simplified and fast implementation of GB in the scikit-learn library, developed to classify large datasets. These fast estimators first group the input samples X into integer-valued bins (typically, 256 bins). This significant reduction in the number of splitting points to consider enables the algorithm to utilize integer-based data structures (histograms) instead of relying on sorted continuous values when constructing the trees. The set of algorithm parameters is the minimum number of samples in a leaf node, the maximum number of leaves in a node, and the number of iterations.
2.6.4. Mapping Accuracy Evaluation
The GridSearchCV tool was used to optimize the key parameters of the algorithms. The parameters selection and algorithms comparison were performed using 10-fold cross-validation (10 sets by 13 sites). The random state control was used to ensure the reproducibility of the results. In this study, the metrics used to compare the algorithms were overall accuracy (OA), user accuracy (UA), performance accuracy (PA), and F1. These metrics were calculated from the confusion matrix.
The confusion matrix is a square matrix of the dimension
n × n, where
n represents the number of classes. In the confusion matrix, each column represents the actual class and each row represents the predicted class. Each element of the matrix
Xik denotes the number of pixels of class
k that fall into class
i during classification. The diagonal elements of the matrix represent the number of correctly classified pixels. The ratio of the sum of the diagonal elements (true positive pixels
TP) to the sum of all elements of the error matrix
N (expressed as a percentage) is given by the
OA metric:
The confusion matrix also shows the number of pixels of class
k (
k ∈ [1,
n]) that are incorrectly assigned to other classes (false negative pixels
FNk).
PA represents the map accuracy from the point of view of the map maker or producer. This metric represents the frequency at which real features on the ground are correctly shown on the classified map or the probability that a certain type of land cover of a given area on the ground is classified correctly.
PA is the complement of the omission error and is given by the number of reference sites classified accurately divided by the total number of reference sites for that class:
In the confusion matrix, it is also possible to count the number of pixels erroneously assigned to class
k (
FPk).
UA represents the accuracy from the point of view of a map user rather than the map maker.
UA represents the frequency at which a class on the map is present on the ground, which is referred to as reliability. It is the complement of the commission error and is given by the total number of correct classifications for a particular class divided by the sum of row elements:
The
F1 score can be interpreted as the harmonic mean of precision and recall, which varies from 0 to 1. The relative contributions of precision and recall to the
F1 score are equal:
Based on the results of comparing the metrics, the algorithm with the most suitable parameters for cropland mapping was selected. It is also important to determine the number of sites where the crop was correctly identified to evaluate the performance of the best classifier. The number of pixels assigned to a particular class by the classifier determined the crop in each site as the class label with the largest number of pixels assigned to the site. The number and proportion of sites with correctly identified crops were used to assess the classification accuracy at the parcel level.
5. Conclusions
Time series of weekly NDVI composites for 2021 and 2022 were generated using multispectral Sentinel-2 images for crop sites in Khabarovsk District. Three types of nonlinear functions were fitted to recover observations for missing weeks. It was found that the Fourier series showed the highest accuracy in approximating the NDVI series, with a mean error of 8.2%. Thus = 11.8%, = 8.6%, = 7.6%, = 5.1%, and = 7.9%. The use of a double sinusoidal function for fitting the NDVI series resulted in an increase in the error by up to 8.6%, while using a cubic polynomial resulted in an even higher increase in error by up to 16.8%.
Tenfold cross-validation was performed to optimize the hyperparameters of three classification methods (SVM with linear kernel, RF, and GB), with a histogram-based modification, and compare their accuracy. The overall accuracy when using the optimal hyperparameters was 67.3%, 87.2%, and 85.9% for SVM, RF, and GB, respectively. The
F1 values for different classes ranged from 0.33 to 0.92 for SVM, 0.69 to 0.96 for RF, and 0.62 to 0.98 for HGB. For the main crop of the Far East, soybean, the
F1 value when using GB was 0.98. An important factor supporting the use of this algorithm is that the time cost of using GB is an order of magnitude lower compared to the other methods. Thus, it has been shown that the use of fitting functions to reconstruct
NDVI time series and subsequent GB machine learning make it possible to perform crop mapping based on multi-year data. The high accuracy and productivity of the proposed method determine the prospects for its use in regions with similar cropland structures and crop phenological cycles, including the developing agricultural regions of Northeast China that have large areas of cropland [
3,
48,
49].
In future research, we plan to test the developed method in other regions using data for more years. It is also planned to use data from different satellites to eliminate sparsity in time series. The use of a combination of satellite bands and neural networks may be very promising.
The use of precision methods for cropland identification and control will make it possible to more effectively manage the agro-industrial complex. Accurate cropland mapping and vegetation analysis can also help assess the rational use of fertilizers, whose contribution to total nitrogen emissions is large enough that it can lead to significant changes in ecosystems [
50]. Effective management in agriculture, coupled with well-managed urbanization, will allow for the faster adoption of modern agricultural methods, which, in turn, will reduce nitrogen emissions, potentially improving air and water quality [
51]. Globally, climate change greatly impacts the production of major crops [
52]. This impact is very uneven, especially for countries with large territories. The climate in Russia is warming about 2.5 times more intensely than the global average. The softening of winters in Russia could be an important factor in the future development of agriculture [
53]. An increase in precipitation and climate warming leads to an increase in soybean yields. In general, the change in agro-climatic conditions in the Middle Amur River Region is relatively favorable for soybean (which is the major crop in this region) cultivation [
54].