1. Introduction
Remote sensing techniques have long been an essential method for agricultural monitoring, with their ability to quickly and efficiently collect data on the spatial-temporal variability of farmlands and crops [
1,
2,
3]. Remote sensing-based crop-type classifications could employ a small number of known samples to predict crop types for farmland fields. Thus, it is a crucial aspect of agricultural monitoring because it is fundamental for numerous precision agriculture applications (such as crop acreage and yield estimations) [
4,
5]. Due to the similarity of crop growth and the limited information from a single Earth observation, it can be challenging to distinguish between diverse crop types using a single satellite image, especially for crops grown during the same season. Exploring and learning time-series information from multi-temporal satellite images is, therefore, a promising method for improving crop classification [
3,
6,
7,
8]. Additionally, optical satellite images are easy to comprehend and interpret, as well as some vegetation indices (such as the normalized difference vegetation index, NDVI) derived from optical spectral bands, which can indicate crop growth stages explicitly. Traditionally, agricultural remote sensing applications have relied heavily on satellite data from optical sensors such as MODIS, Landsat, SPOT, and the Chinese Gaofen [
5,
9,
10]. However, due to the occlusion of clouds and shadows, for a special location, optical image observation sequences may be incomplete, as some observations can be missing. This poses a significant challenge for these methods (especially in cloudy and rainy regions). On the one hand, the absence of images at essential phenological stages could lead to inadequate crop classification performance. On the other hand, incomplete image sequences increase difficulties when following tasks and severely restrict the application of time-series crop monitoring [
11,
12,
13]. Therefore, how to extract inherent time-series features that can distinguish crop types from these incomplete observation sequences becomes the key to remote sensing and crop mapping [
14,
15].
Considerable research and effort have been devoted to constructing time-series features (representation) for improving crop classification [
16], which can be categorized into three major groups, (1) important feature-based methods, (2) time-series composition methods, (3) time-series reconstruction methods, and others. Instead of reconstructing regular time-series images or features, important feature-based methods attempt to select prominent images captured during crucial phenological stages for crop identification [
17]. For instance, rape and sunflower exhibit distinct yellow spectral features (with greater spectral reflectance on the red and green bands) during flowering, and paddy fields planted with rice seedlings are saturated with water, exhibiting higher water index values [
3]. In other words, this method is based on an in-depth understanding of crop growth and phenology and attempts to identify crop types using minor but significant images. Some studies attempt to apply time-series filtering (such as S-G filtering) on incomplete multi-temporal images to manually generate phenological dates (such as the start or the end of the growing season) and then use these dates to identify crop types [
18,
19,
20,
21]. However, such methods rely heavily on satellite images during crucial phenological stages, which may not always be available. In addition, these methods can only differentiate between crops with significant phenological differences, such as winter wheat and summer corn. It is challenging to distinguish crops during the same seasons (for example, soy and corn).
Contrary to important feature-based methods, time-series composition methods attempt to use all available satellite images to construct more complete image sequences for crop classification (but with a longer time interval). In particular, the construction of satellite constellations significantly shortens their Earth observation (or revisit) periods. For instance, the revisit periods of the Sentinel-2 constellation with two satellites and the virtual constellation between Landsat-8 and Landsat-9 are five and eight days, respectively. Multiple images are captured by these satellites during a particular phenological stage. Therefore, these images with close acquisition dates can be composited and mosaiced to produce composited images with lower cloud/shadow coverage [
22]. Such approaches can significantly enhance the completeness of time-series observations. However, such methods expand the spectral ranges of crop phenological stages, resulting in mixed feature spaces and overlapping type spaces for crop mapping [
23]. Meanwhile, this kind of method has limited improvement in crop classification. In addition, they cannot completely eliminate the missing values to construct regular time-series observations in cloudy and rainy regions.
Time-series reconstruction methods are promising alternatives for dealing with incomplete time-series observations. By exploring spatial similarity, spectral correlation, and temporal trends, time-series reconstruction can predict cloud- and shadow-covered pixels to generate regular time-series images [
12,
13]. Compared to time-series composition methods, these methods produce time-series images with original (or even shorter) time intervals, which are practical and effective in time-series crop classification. Nevertheless, a few studies [
12,
24] found that a larger percentage of missing data (including significant gaps in timestamps and large-area missing) results in greater uncertainty and over-smoothing, which could mislead subsequent time-series analysis. In addition, time-series reconstruction methods necessitate additional effort.
To address the issue of incomplete time-series data, another idea is to design algorithms that are capable of utilizing incomplete time series directly. Recent developments in machine learning have also begun to address the issue of an incomplete time-series analysis. For instance, by default, the eXtreme Gradient-Boosting (XGBoost) algorithm can handle missing values. Whenever a missing value is encountered during XGBoost-based prediction, a default (right) direction is created at each tree node [
25,
26]. In addition, masking layers can be used to identify the missing position in time-series data and then feed them directly into Long Short-Term Memory (LSTM)-based networks.
Despite this encouraging improvement in time-series feature representation for crop classification using incomplete image sequences achieved using the aforementioned methods, these approaches have a number of limitations: (1) Existing techniques have not established a general framework for constructing or learning inherent time-series feature representations from incomplete image sequences. Moreover, manually crafted features are limited in their ability to identify specific crop types or phenological periods. (2) Supervised LSTM-based methods typically require a large number of labeled samples for training. It is, thus, difficult to directly apply them in remote sensing applications with limitations in labeling [
3]. (3) From the standpoint of implementation, algorithms like XGBoost use a default or assumed trick to handle missing values in time-series data, as opposed to producing regular feature representation. Therefore, can a general framework be developed to represent inherent time-series features from incomplete image sequences in crop classification?
Recent research has focused more on self-supervised learning to extract effective representations from unlabeled data. Self-supervised pre-trained models with limited labeled data can achieve comparable performance to supervised models trained on complete and labeled data. Particularly, contrastive learning has recently demonstrated its strength for self-supervised representation learning in the computer vision domain due to its capacity to learn invariant representations from augmented data [
27,
28]. Contrastive learning explores numerous views of input images through the utilization of data augmentation techniques. It subsequently learns inherent representations by maximizing the similarity between views originating from the same sample while minimizing the similarity between views from distinct samples. This technique is widely employed in healthcare data analysis, visual comprehension, and natural language processing [
24,
29,
30], but it has been underexplored for remote sensing time-series analysis [
31].
This research aims to develop a general framework for inherent time-series feature representation from incomplete satellite image sequences to improve crop classification. This method was implemented by combining contrastive-learning-based feature representation with machine-learning-based classifications. Compared to previous approaches, this study combines three principal contributions. The first contribution involves a contrastive-learning-based framework for time-series feature representation from incomplete satellite image sequences. The second is developing a type-wise consistency augmentation and type-wise contrastive loss to enhance contrastive learning for supervised time-series classification. The third is an in-depth analysis of the effect of contrastive-learning-based feature representation. The proposed method is further discussed and validated through parcel-based time-series crop classifications in two study areas (one in Dijon of France and the other in Zhaosu of China) with Sentinel-2 image sequences in comparison to existing methods.
2. Study Area and Datasets
2.1. The Dijon Study Area
The first study area is Dijon, located in the Côte-d’Or department (with the Dijon prefecture) in the Bourgogne-Franche-Comté region of northeastern France at 05°01′E and 47°17′N in geographical coordinates (latitude/longitude) on the WGS-84 ellipsoid (
Figure 1). The study area, covering a total area of approximately 5000 km
2, features an oceanic climate with a continental influence under the Köppen climate classification, with average temperatures between 6.8 °C and 16.1 °C and an annual average precipitation level of 740 mm. These climatic conditions are ideal for growing wheat, rape, grape, and grass.
Under the Common Agricultural Policy of the European Union, the National Institute of Forest and Geography Information (IGN) of France is responsible for gathering geographical information on the geometry of cultivated crops. The IGN institute has released anonymized parcel geometries and types of cultivated crops under an open license policy. This study used data collected in 2019 to validate the proposed model. In the raw crop type categories, there were 328 distinct crop labels organized into 23 groups. ‘
Winter wheat’ (WWT), ‘
winter barley’ (WBR), ‘
winter rapeseed’ (WRP), ‘
winter triticale’ (WTT), ‘
spring barley’ (SBR), ‘
corn’ (CON), ‘
soy’ (SOY), ‘
sunflower’ (SFL), ‘
grape’ (GRA), ‘
alfalfa’ (AFF), ‘
grass’ (GRS), and ‘
fallow’ (FLW) were selected and summarized for the Dijon study area. The two minority classes (‘
sunflower’ and ‘
winter triticale’) were also retained to challenge classification methods. This further reflects the significant class imbalance in real-world crop-type-mapping datasets [
32]. This study area encompassed approximately 53,400 parcels, of which 20% were selected randomly to serve as labeled samples.
In addition, Sentinel-2 time-series images (with a Path/Row of T32TFN) captured between February 2019 and September 2019 were used to record crop growth, as the growth stages of winter crops are also concentrated in the subsequent year. The Sentinel-2 satellite image contained four (visible and near-infrared) bands with a spatial resolution of 10 m and six (red edge and shortwave infrared) bands at a 20 m resolution. All 48 images, captured every 5 days, were obtained from the Copernicus Open Access Hub at Level 1C, and 38 of them contained cloud and shadow contamination. Meanwhile, images captured on the days of the year (DOY) 48, 58, 88, 133, 143, 168, 178, 233, 238, 258, and 263 were free from clouds and shadows.
2.2. The Zhaosu Study Area
The second study area is Zhaosu, situated southwest of Yining City, Xinjiang Autonomous Region, China (latitude range: 43°09′N to 43°15′N and longitude range: 80°08′E to 81°30′E in geographical coordinates) (
Figure 2). It is a highland basin surrounded by mountains in the Central Asian hinterland, with an elevation ranging from 1323 m to 6995 m. It is dominated by a continental temperate semi-arid semi-humid cool climate, with an annual average temperature of 2.9 °C and 512 mm of annual precipitation. The majority of Zhaosu is covered by calcium-rich black soil with a thick humus layer and a high organic matter content. These natural geographical and climatic conditions are optimal for the growth of spring rapeseed (from April to September), making Zhaosu the largest producer of spring rapeseed in Xinjiang.
Official farmland parcel maps are unavailable for this study. Consequently, Chinese Gaofen-1 (GF-1) satellite images were used to delineate the precise geometries of farmland parcels. The GF-1 images included one panchromatic band with a 2 m spatial resolution and four multi-spectral bands (blue, green, red, and near-infrared) with an 8 m spatial resolution. Using the Gram–Schmidt spectral sharpening algorithm, the panchromatic and multi-spectral bands were combined to produce a multi-spectral pan-sharpened image with a 2 m spatial resolution. Two GF-1 images acquired in July 2020 with 60 km-wide swaths were registered and mosaicked to cover the study area. In this study, approximately 11,400 parcels were obtained.
In July 2020, field surveys for supervised crop classification and accuracy assessments were conducted. To facilitate the field surveys, sample sites were distributed along the roads. In surveys, a handheld GPS device (with a precise point positioning precision level of 3.0 m) was utilized to record geographic location (in the WGS84 geographic coordinate system). Approximately 1000 parcel samples were collected (200 rapeseed parcels and 800 parcels with other crops, proportional to the percentage of rapeseed-planted area). Therefore, this study designed a binary classification schema containing rapeseed and other types.
Two Sentinel-2 images (with P/R of T44TMN and T44TNN) captured on the same day were mosaiced to cover the Zhaosu study area. And 36 observations between April and September 2020 were used to identify rapeseed. Also, images acquired on DOY 115, 145, 165, 185, 190, 195, 220, 235, 255, 260, and 265 in 2020 were totally clean.
The datasets and crop growth periods are summarized in
Table 1. The two study areas were distinguished by distinct climatic and topographical conditions. In addition, the cultivation status of crops varied considerably based on crop type and farming technique. These circumstances were sufficient for validating the proposed model.
3. Methodology
Time-series feature representations using contrastive learning were employed to improve parcel-based crop mapping using multi-temporal Sentinel-2 images, as illustrated in
Figure 3. This procedure consisted of four major steps: (1) pixel-wise spectral features, (2) parcel-based spectral features, (3) time-series feature representation, and (4) time-series crop classification.
Before the main process, data preprocessing was performed, including atmosphere calibration, the cloud/shadow-based masking of Sentinel-2 images, the geographic registration of experimental data (including Sentinel-2 images, farmland parcel maps, and survey samples), and the generation of farmland parcel maps. First, at the pixel scale, time-series composition and band calculation were applied to Sentinel-2 images to generate spectral features and vegetation indices. Second, cloud/shadow-masked Sentinel-2 feature images (including spectral bands and indices) were overlaid onto parcel maps to generate parcel-based incomplete time-series spectral features (with missing values). Third, at the parcel scale, a contrastive learning framework was enhanced to map time-series spectral features into their inherent feature representation (without missing values). Finally, using the feature representation and time-series classifiers, parcel-based crop classification maps were generated.
3.1. Data Preprocessing
3.1.1. Farmland Parcel Maps
Parcel-based crop mapping requires known farmland parcel geometries that are accessible in most regions of Europe [
32] (including the Dijon study area). In the absence of geometry data (as in the Zhaosu study area), farmland parcel maps were generated from high-spatial-resolution images (the GF-1 images used for Zhaosu) using the method detailed in our previous study [
33]. First, roads, waterlines, and terrain lines derived from DEMs were used to spatially split the GF-1 images of study areas into multiple subareas. Then, in each subarea, the trained (by manually labeled samples of parcel boundaries) boundary-semantic-fusion convolutional neural network (BSNet) [
33] was utilized to automatically generate binary raster maps of parcel boundaries. Finally, automatic postprocessing (including the vectorization of binary parcel boundaries, topology checks on parcel geometries, and the removal of small polygons) and the manual correction of parcel polygons were applied to generate precise farmland parcel maps.
3.1.2. Sentinel-2 Images
On Sentinel-2 L1C images, the Sen2Cor algorithm was first applied for atmosphere calibration to generate bottom-of-atmosphere data where images acquired over time and space shared the same reflectance scale, thereby enhancing crop mapping when monitoring large-scale areas over time [
34]. Four spectral bands (bands 2, 3, 4, and 8) with a spatial resolution of 10 m and six spectral bands (bands 5, 6, 8A, 11, and 12) with a spatial resolution of 20 m were produced for each Sentinel-2 image.
A quality scene classification (SLC) band, which labels pixels obscured by clouds and shadows, was created using the Sen2Cor algorithm in the meantime. Misclassification in the SLC band was further corrected through expert visual interpretation, particularly at the cloud and shadow edges. Then, the SLC band was reclassified into a binary classification to generate a final masking band, one indicating clean pixels and the other for contaminated pixels (including cloudy and shadowy regions and no data value regions). Finally, masked images were generated by overlaying the masking band on Sentinel-2 images and setting the spectral reflectance values of pixels in masked regions to a default masking value (0 in our experiments).
3.2. Pixel-Wise Spectral Features
3.2.1. Time-Series Composition
Multi-temporal value composition is a common technique for suppressing atmospheric and cloud effects and reconstructing time-series observations when processing time-series optical images [
35]. This technology was employed to generate time-series images with lower cloud and shadow contamination. It was also noted that time-series composition could increase observation intervals, resulting in sparser time-series sequences.
Greater vegetation index values indicated a much more robust vegetation growth. Alternatively, this assumption may not hold true for spectral reflection. For instance, cloudy pixels with higher spectral values and shadowy pixels with lower values are contaminated. Consequently, a mean value composition algorithm was utilized in this study. Following a similar procedure of maximum value composition, mean value composition was applied to multi-temporal Sentinel-2 images to generate a composited image using the following equation.
where
N(
i,j) is the number of clean observations (not covered by clouds and shadows) at a geographical location (
i,j),
v(
i,j)
t is the pixel-wise spectral value of time step
t at location (
i,j) and
mvc(
i,j) is the composited value at location (
i,j).
3.2.2. Vegetation Indices
Typically, the vegetation index is derived from optical red and near-infrared (NIR) reflectance via linear or non-linear combination operations. They are simple but effective parameters for characterizing vegetation cover and growth status in agricultural remote sensing applications. Additionally, compared to other multi-spectral images (such as Landsat images), Sentinel-2 images contain three additional red-edge bands that are sensitive to vegetation growth [
36]. To expand spectral features for Sentinel-2 images, eight vegetation indices, including NDVI, EVI, MTCI (terrestrial chlorophyll index), NDRE (normalized difference red edge index), MCARI2 modified chlorophyll absorption ratio index), REP (red edge position), IRECI (novel inverted red-edge chlorophyll index), and CI
red-edge (red-edge chlorophyll index) were calculated [
37,
38]. Also, to maintain an exact spatial resolution of 10 m, the 20 m spectral bands were resampled into 10 m using the nearest neighbor sampling algorithm when calculating vegetation indices.
3.3. Parcel-Based Time-Series Features
Multi-temporal processed Sentinel-2 images (including spectral bands and derived vegetation indices) were overlaid onto parcel maps of farmland to generate parcel-based time series. Spectral values were averaged over the bounds of parcel geometry. In Sentinel-2 bands, pixels within a parcel polygon were first searched. Then, the average spectral value of these pixels was taken as the feature value of this parcel. When parcels were entirely covered by clouds and shadows, their features were assigned the default masking value of 0. When parcels were partially covered, the spectral values of clean pixels were averaged. Finally, for each parcel, a feature vector was generated, where D is the number of spectral bands, and T is the number of satellite observations.
3.4. Time-Series Feature Representation
This study employed contrastive learning to transform spectral features into time-series feature representations. There are three advantages to this. First, feature representation contributes to learning the inherent time-series feature for classification. Second, it can generate complete and regular time-series features (without missing values). Thirdly, it can decrease the demand for a larger number of labeled samples in deep-learning-based applications. A general framework (known as TS2Vec) was proposed for learning the representation of time series [
39]. It consists of three main components: a representation framework, consistency augmentation, and loss functions. Feature representation was performed using the representation framework. Using consistency augmentation, augmented sample pairs were generated to train the framework. Loss functions ensured the discovery of consistent features from multiple augmented samples.
This study attempted to improve the TS2Vec model for supervised time-series crop classifications (named type-wise TS2Vec) by incorporating prior-type information from labeled samples into contrastive learning. In general, we followed the architecture of the TS2Vec model [
39]. Further, the consistency augmentation and contrastive loss were enhanced. (1) When conducting consistency augmentation, we discarded original random cropping and developed novel type-wise random selection and random band-masking techniques. (2) When calculating multi-scale contrastive loss, type-wise contrastive loss was devised to replace instance-wise loss.
3.4.1. Consistency Augmentation
The establishment of positive sample pairs is fundamental in contrastive learning. Various augmentation strategies for general time-series tasks have been proposed in previous studies [
39,
40,
41]. For supervised time-series crop classification tasks, it is essential to ensure the following characteristics: (1) preserving the magnitude of time-series values, (2) retaining the length and timestamp of the time series when exploring phenological characteristics of crop growth, (3) exploring correlations between bands given that these correlations are higher; and (4) introducing crop-type information to enhance consistency augmentation.
Based on these assumptions, the random cropping technique in the TS2Vec model was eliminated due to its inconsistency with the assumption (2). Then, inspired by assumptions (3), and (4) a random band masking technique and a type-wise random selection technique were implemented, respectively. Incorporating the random timestamp masking proposed by [
39], we generated consistency augmentation, in which feature representations at the same timestamp in two augmented contexts with the same crop types were considered positive pairs.
Available type labels are high-quality-supervised information for constructing augmented contexts in contrastive learning for time-series crop classification. As shown in
Figure 4, this study proposed a type-wise random selection algorithm to construct augmented contexts in batch training.
A sample consists of a parcel-based feature vector and a crop-type label. First, in a sample batch, crop type labels (L) of each instance sample were recorded in order, and feature vectors (FV) of instance samples with the same type labels were compiled into a subset. Then, the recorded crop-type labels were replicated as augmented crop-type labels. In addition, for each crop-type label, a feature vector (FV) was randomly selected from the feature vector subset with the same crop-type labels as the augmented feature vectors. Finally, the selected feature vector was combined with the crop-type label to produce a type-wise augmented sample. Multiple instances of the same crop type labels are required for type-wise random sampling. Therefore, a batch size greater than the total number of crop-type labels was set.
Spectral band masking can also be adapted to generate new contexts. For each time-series input, one spectral band was randomly selected and masked (setting their values to 0) to generate an augmented context view. Furthermore, in two context reviews, their contextual representations should be consistent. The contrastive learning framework could capture band-to-band correlations through spectral band random masking to establish inherent feature spaces for crop classification.
3.4.2. Type-Wise Contrastive Loss
Multi-scale contrastive loss was employed to force the encoder to learn feature representations at multiple scales [
39]. At each scale, the TS2Vec model jointly leverages both instance-wise and temporal contrastive losses to capture a contextual representation of time series. In the instance-wise loss, representations of other instances at timestamp
t were taken as negative samples to capture fine-grained representations for general time-series tasks. For time-series classification tasks, this restriction was too strict for different instances with the same class types (categories). Thus, we attempted to utilize a supervised type of information for labeled samples to lift this restriction by taking representations with the same class type
c and at timestamp
t as positive samples. The type-wise contrastive loss indexed with (
i,
t) can be formulated as follows:
where
i is the index of the input sample,
B denotes the batch size, and
and
denote representations for the same timestamp
t and from two augmentations of one sample.
3.5. Time-Series Classification
Based on the time-series feature representation, a traditional machine-learning-based (XGBoost-based) classifier and an LSTM-based classifier were applied to generate crop classification maps.
3.5.1. XGBoost-Based Classifier
XGBoost is a highly efficient and widely used implementation of the gradient-boosted trees algorithm [
25]. It is a supervised learning algorithm for regression, classification, and ranking problems, which uses sequentially built shallow decision trees to provide accurate results. In this study, the XGBoost algorithm with a “
gbtree” booster and a “
softmax” objective was utilized to build XGBoost-based classifiers. In addition, the GridSearchCV technique was used to conduct hyperparameter tuning to determine the optimal parameter values for crop classification.
3.5.2. LSTM-Based Classifier
In recent years, the recurrent neural network (RNN) and its variants (such as LSTM) have been utilized extensively in time-series analysis, such as time-series prediction [
12,
13] and time-series classification [
3,
8]. This study employed stacked LSTM models for crop classification [
3]. In LSTM-based classification models, four LSTM layers with
h (where
h equals the dimension of input features) hidden neurons were first stacked to transform input time-series features into high-level features. Then, a dense layer fully connected high-level features to crop categories. A SoftMax activation function then outputs crop-type probabilities to generate crop classification maps. Furthermore, a cross-entropy loss function and an Adam (Adaptive Moment Estimation) optimizer with default parameters were employed to train LSTM-based classifiers.
3.6. Performance Evaluation and Comparison
3.6.1. Comparative Methods
To validate the effectiveness of contrastive-learning-based feature representation, several classification comparisons utilizing different time-series feature representations were performed. The baseline classification is an XGBoost-based classifier only using completely Clean Sentinel-2 images (referred to as XGB-Clean). Using all available Sentinel-2 time-series (TS) images, an LSTM-based time-series classifier and an XGBoost-based time-series classifier (referred to as LSTM-TS and XGB-TS, respectively) were constructed. Using time-series Feature Representation (FR) generated from the proposed contrastive learning framework, an LSTM-based classifier and an XGBoost-based classifier (referred to as LSTM-FR and XGB-FR, respectively) were also built. In addition, GridSearchCV was used to conduct hyperparameter tuning for the XGB-Clean, XGB-TS, and XGB-FR classifiers.
For classifiers, sample sets were randomly divided into training, validation, and testing sets in a ratio of 6:2:2, for both the proposed type-wise TS2Vec model and time-series classifiers.
3.6.2. Evaluation Metrics
Based on the confusion matrix [
42], which was created by comparing classification results to test samples parcel by parcel, overall accuracy (OA), precision (P), recall (R), and F1 scores were extracted to evaluate crop classification accuracy. The OA was determined by dividing all correctly classified parcels by the whole validation dataset. Precision and recall were calculated using Precision = TP/(TP + FP), Recall = TP/(TP + FN), where TP, TN, FP, FN represent the number of true positive, true negative, false positive, and false negative parcels, respectively, in the confusion matrix. In addition, F1 = 2 × UA × PA/(UA + PA), the harmonic mean of the PA and UA, was more meaningful than OA for a special crop type. Greater OA, P, R, and F1 scores indicated superior results, and vice versa.
5. Conclusions
Fundamental to remote sensing crop mapping is extracting and learning inherent time-series features that can distinguish crop types from incomplete satellite observation sequences. This study developed a contrastive-learning-based framework for time-series feature representation to improve crop classification using incomplete Sentinel-2 image sequences. The proposed method is further discussed and validated through parcel-based time-series crop classifications in two study areas (one in Dijon of France and the other in Zhaosu of China) with multi-temporal Sentinel-2 images. The classification results, with significant improvements greater than 3% in their overall accuracy and 0.04 in F1 scores over comparison methods, revealed the effectiveness of the proposed method in learning time-series features for parcel-based crop classification using incomplete Sentinel-2 image sequences.
In addition, evaluations of accuracy and comparisons were performed on parcel-based classification results to discuss the number of training samples, the benefit of type-wise contrastive learning, the sensitivity of dimensions in feature representation, and assistance from time-series composition and vegetation indices. We concluded that (1) the combination of feature representation and traditional machine-learning-based classifications could improve parcel-based crop mapping with limited labeled samples. (2) Type-wise contrastive learning is more effective than instance-wise in time-series classification tasks. (3) Preprocessing time-series composition and vegetation indices is not necessary for contrastive-learning-based feature representation.
These experiments and their conclusion can provide insights and ideas for time-series classification in agricultural remote sensing applications. In addition, the proposed method is adaptable to other satellite images and applications in future works.