2.1. Spectral Variability of Target Spectra in Multi-Temporal Images
Due to varying atmospheric, illumination, and environmental conditions, the spectral signatures of a target captured at different times may exhibit variability, which is commonly referred to as spectral variability [
52]. In addition, intra-class differences in spectral characteristics may exist within the same category of targets, also resulting in variations [
4]. Therefore, the prior and in-scene target spectra collected from multi-temporal HSIs may vary.
An urban-scene multi-temporal HSI pair was collected from AVIRIS data to illustrate spectral variability, as exhibited in
Figure 2. These two images were captured at different periods within Jefferson County, Washington, United States. Image 1 was acquired under conditions of minimal cloud coverage, whereas Image 2 was acquired under conditions with significant cloud presence. Storage tanks were selected as the targets of interest for this example. Manual target annotations are shown in
Figure 1, which were obtained with the help of ENVI software (Version 5.1). Based on these manual annotations, we computed the average target spectra from one of the images to obtain prior spectra. A comparison was then made with the target spectra in the other HSI. It is important to note that these spectra are in the radiance domain and have been normalized for better comparison.
Based on the visualization depicted in
Figure 2, notable differences can be observed between the prior spectra and in-scene target spectra. Some conventional HTD methods, such as ACE, generally assume that the prior spectrum represents the mean of the target spectral distribution. However, in real-world scenarios, this assumption may not always hold true, thereby affecting the effectiveness of these methods. Due to the availability of only prior and in-scene spectra (without labels), certain DL-based methods utilize non-learning detectors to pre-detect background spectra. These methods generate pseudo-target spectra by combining prior spectra with the detected background spectra. The augmented data are considered pseudo-data because their labels are not manually annotated and they are used to optimize the parameters of neural networks. Simulating high-quality variational target spectra for training DL detectors is challenging because the variations in target spectra are not solely caused by spectral mixture.
2.3. Structure and Pipeline of the Proposed Detector
This subsection introduces the structure and pipeline of the proposed ICLTD. An in-scene HSI is denoted , where w, h, and l are the width, height, and band number of the HSI, respectively. The ith in-scene spectrum of is denoted . A prior spectrum captured from another HSI is denoted . Different from box-level target detection in the remote sensing field, existing HTD detectors commonly access in-scene spectra (without annotations) for optimization. After training, the detector outputs the target confidence scores of the in-scene spectra as detection results, which are denoted . In the training phase, all of the in-scene spectra and the prior spectrum make up the data batch. For convenience, the number of in-scene spectra is denoted as . In this paper, only one prior spectrum is generated from the HSI captured at different times. Hence, the batch size is .
The network structure is shown in
Figure 4. As the illumination intensity varies over time and space, significant differences in the radiance amplitude may occur. To mitigate the impact of varying amplitudes, a linear transformation process is used to normalize the Euclidean norm (L2 norm) of the in-scene spectra and prior spectra to 1. The normalized spectra are then inputted into multiple fully connected blocks (FCBs) for feature extraction. Each FCB consists of a fully connected layer, the ICLM, and a nonlinear activation layer. The FC is related to the normalized vector
by:
where × represents matrix multiplication.
and
are the weight and bias of the fully connected layer, respectively, where
d is the dimension number of feature representations. The FC transforms each input vector independently; that is, there is no gradient path from
to
when
.
The realization of implicit contrastive learning is based on the designed ICLM, which is a refinement of the original batch normalization (BN) [
53]. The original BN was proposed to accelerate the training process while the ICLM is designed to regularize the optimization. The motivation for creating the ICLM is to force the detector to learn a representation of the in-scene spectra during the optimization. Without the ICLM, the detector only learns to transform the prior spectra for classification, and it cannot distinguish other spectra. The original BN could actually mitigate the above model collapse by leaking gradients from the features of the prior spectra to the in-scene spectra. However, the original BN was not specially designed for HTD and cannot provide adequate regularization because BN preserves most gradients for the learning of prior spectra rather than delegating to the in-scene spectra. Further theoretical analysis of this process is presented in
Section 2.4.
To solve this problem, the ICLM duplicates the prior spectra representations for normalization to provide adequate regularization. The pipeline of the ICLM is exhibited in
Figure 5. Specifically, the ICLM augments the prior spectra representation,
,
times and combines them with in-scene spectral features to compute the mean (
) and variance (
), where
d is the dimension number of input spectral feature
. ICLM is computed using
and
as:
where
is the duplication number of prior spectra representation and
m is the sum of
and
.
The computed mean and variance vectors are used to normalize the features of in-scene spectra and the prior spectra:
where
is the normalized feature vector.
Learnable vectors of the mean and variance (
and
, respectively) are used to fine-tune the distributions:
where ⊙ represents the element-wise product (Hadamard product). The complete algorithm flow of ICLM is exhibited in Algorithm 1.
Algorithm 1 Pipeline of ICLM |
-
Input: Features of the prior spectra and in-scene HSI: and ; and learnable parameters of the ICLM: and . -
Output: Normalized spectral features: and . - 1:
Duplicate times and construct a feature batch. - 2:
Compute and of the feature batch following Equation ( 2). - 3:
Normalize and with and following Equation ( 3). - 4:
Fine-tune the features with and following Equation ( 4). - 5:
return and .
|
After using the ICLM, a sigmoid function is applied to features to increase the nonlinear feature extraction ability of ICLTD. Multiple FCBs are used to extract features of the input spectra in a cascading manner. It is worth noting that the last FCB is free of the sigmoid layer, which brings a larger numerical range to the following classifier.
The classifier is composed of a fully connected layer and a softmax layer. The fully connected layer outputs predicted confidence vectors. The softmax layer normalizes the output confidence vectors to represent probabilities. The dimensions of the predicted probability vectors are the same as those of the latent features. The values of different dimensions represent the confidence of an in-scene spectrum belonging to different categories (targets of interest or various backgrounds). Because only the target prior is known and target spectra are our interest, the probabilities belonging to the target are selected from the output. The first dimension is assumed to be the target class by default, which is used for computing the loss function. The other dimensions of the output vectors represent the probabilities of class-agnostic backgrounds. For the visualization shown in
Figure 4, the predicted target probabilities of the input images are denoted
. The predicted confidence of the prior spectrum and the
ith in-scene spectrum of
are denoted
and
, respectively.
Unlike existing contrastive learning-based detectors, which require both positive and negative sample pairs for their loss functions, the proposed detector only utilizes the manual annotations of the prior spectrum for explicit supervised learning. The specific loss function for classifying positive samples is as follows:
When the ICLM is not present in the detector, the feature extraction and classification of a batch of spectra become independent of each other. This implies that the learning of detecting prior spectra does not help the detector to distinguish in-scene spectra. If optimizing the detector with Equation (
5) in the absence of the ICLM, the detector will over-fit when detecting prior spectra. When the ICLM is present, the detector will learn to minimize Equation (
5) by changing the representation of in-scene spectra, which regularizes the optimization.
2.4. Analysis of the ICLM
Under the HTD task, this subsection illustrates how the ICLM realizes implicit contrastive learning and brings discriminative feature extraction capability to the detector.
From the data distribution perspective, the ICLM normalizes features to restrict extreme output aggregation or divergence [
51]. Therefore, the detector will not output identical results (feature aggregation) or will only be able to detect prior spectra (feature divergence). The ICLM, to some extent, mitigates the differences between the prior and target spectra by adjusting the data distributions. However, adjusting the feature distribution cannot help the optimization much. To validate the effectiveness of distribution adjustment, an ablation study is conducted in
Section 3.5 where the gradient propagation component of the ICLM is disabled.
Compared to feature distribution adjustment, the gradient propagation path established by the ICLM between the in-scene spectra and prior spectra is more important. The last ICLM layer is chosen to analyze the implicit contrastive learning because the gradient calculation in this layer is simpler than those in the previous layers. Note that all of the vector multiplication presented in this subsection is the Hadamard product, which is omitted for brevity.
For convenience, the hidden features before and after the last ICLM layer are denoted
and
, respectively. We define a spectral feature sequence to exhibit the gradient propagation path from
to unlabeled
as:
where
are in-scene spectral features and
are duplicated prior features.
equals
when
. The normalized spectral features outputted by the last ICLM are computed by substituting
and
in Equation (
3) according to Equation (
2):
where only the normalized prior spectral features are exhibited because the in-scene spectral features are not used in the loss function.
According to the numerator in Equation (
7), the prior spectral features are subtracted from each input feature when the ICLM normalizes the distribution means. Similarly, the denominator constructs computational relationships between any two spectral samples. Because of the established gradient propagation paths, representations of in-scene spectra also receive supervised signals during optimization.
The gradient of features before the last ICLM () received from the ICLM is calculated to analyze the regularization of the implicit contrastive learning. There are three paths that could pass the gradients to : , , and . Because the gradients and rely on , the latter is calculated first.
Since in-scene spectral features are not used in classification loss, their derivatives are zero:
where
is denoted
for convenience. When
,
are the same because of duplication and are denoted
.
Next, the derivative of
ℓ with respect to
and
is calculated. According to Equation (
3), the gradient of the loss function
ℓ with respect to
is:
The gradient of the loss function
ℓ with respect to
can be simplified as:
According to Equation (
3) and Equation (
2), the gradients of
,
, and
with respect to
are, respectively:
According to the above gradients,
can be expressed as:
where the gradients labeled as ①, ②, and ③ come from
,
, and
, respectively.
Based on the loss function (Equation
5),
for in-scene spectral features (
). According to Equation (
12), the gradient of
ℓ with respect to
is summarized as:
Based on Equation (
13), the gradients computed from
ℓ are passed to in-scene spectra through
and
in the ICLM. In other words, the features of in-scene spectra are also optimized to realize the classification of prior spectra. Therefore, implicit contrastive learning of the in-scene spectra regularizes the optimization and prevents model collapse.
To analyze how the ICLM enables the model to learn the differentiated representation of in-scene spectra, we calculate the difference in gradients obtained through the ICLM for two in-scene spectra:
Assuming the two in-scene spectra in Equation (
14) are the target and background spectra, their inherent differences are automatically utilized by the ICLM to allocate differentiated gradients. We also calculate the difference in gradients obtained through
for the in-scene spectra and the prior spectra:
Based on Equation (
15), the gradients received by in-scene target spectra (through
) are more similar to those of the prior spectra than the in-scene background spectra, because of their inherent data similarities. In contrast, the gradients received by the in-scene background spectra are different from those of the prior spectra because of inherent data differences.
According to Equation (
16), the total amount of gradients passed through the ICLM from
ℓ to
equals 0, which means that the signs of the transmitted gradients through the ICLM to the prior and in-scene spectral features are opposite. Therefore, the more prior spectra in the ICLM are duplicated, the stronger the supervised signal transferred to the unlabeled spectra will be. Therefore,
determines the strength of regularization. The larger
is, the more the model relies on learning the representations of in-scene spectra to minimize the loss function. When
, the ICLM becomes the original BN. Because
, the gradients received by in-scene spectral features are close to 0 according to Equation (
15). Hence, the original BN could not provide adequate regularization.
To summarize, in this section, we analyzed the ICLM from three perspectives. From the perspective of data distribution, the ICLM avoids excessive aggregation or divergence of the extracted features. Regarding forward propagation, the ICLM establishes gradient propagation paths between prior and unlabeled spectra. From the perspective of gradient back-propagation, the ICLM transfers the gradient of the loss function to in-scene spectra, and the feature differences are used to determine the transferred gradients. As the number of prior spectra increases in the ICLM (i.e., as increases), the regularization brought by implicit contrastive learning is enhanced.
2.5. Local Spectral Similarity Constraint
With the development of imaging spectroscopy technology, the spatial and spectral resolution of HSIs has improved. Targets in HSIs may occupy multiple pixels, as shown by the example in
Figure 6. For a multiple-pixel target, its 3D geometric structure and its relationship with the background vary from different locations, which cause differences in its spectral characteristics. Our aim is to minimize the feature differences of these locally connected spectra, thereby improving the performance of in-scene adaptability. During training, the predicted confidence scores of in-scene spectra can be used to select candidate targets. For each candidate sample, if there are samples with higher confidence within their 3 × 3 neighborhoods, we increase the similarity between the feature representations of these two samples. Note that the LSSC can be employed if a target occupies two or more connected pixels.
According to the above motivation, the LSSC was proposed and applied to representations of in-scene spectra outputted by each fully connected layer of the FCBs. First, we find the target neighborhoods. A threshold,
t, is set to select high-confidence in-scene spectra. Given the detection results of in-scene HSI,
, the mask that reflects the distribution of target candidates is:
Latent features of the in-scene spectra extracted by the
kth fully connected layer are denoted
. Their corresponding confidence scores are denoted
. According to
, the representations of target candidates are collected into the set:
. For each candidate, its 3 × 3 neighborhood is constructed, encompassing multi-level representations and confidence scores. The neighborhood of features is denoted
, where
represents the first neighboring feature of
. The corresponding confidence scores are
. For each
, the set consisting of its neighboring target spectral features is denoted
, where
represents the target confidence score of
. With the above features, the proposed LSSC is defined as:
where
represents stopping the gradient propagation of
,
f is the softmax operation, and
is the number of candidate targets. Stopping the gradients of neighboring features is to ensure that the LSSC enhances the detectability of target candidates rather than diminishing that of the neighboring targets. The complete calculation process of LSSC is described in Algorithm 2. The total optimization function,
L, is the combination of positive sample classification loss and the LSSC:
Algorithm 2 Pipeline for computing LSSC |
-
Input: Spectral features extracted by the kth fully connected layer: ; confidence scores of in-scene spectra: ; and the threshold of the LSSC: t. -
Output: The loss of the LSSC, . - 1:
Generate the target candidate mask according to predicted confidence scores following Equation ( 17). - 2:
Collect features of candidate in-scene targets: . - 3:
Collect confidence scores of the candidates . - 4:
Collect neighboring spectral features of each in : . - 5:
Collect confidence scores of neighboring features: . - 6:
Collect desired neighboring features of : . - 7:
Get following Equation ( 18). - 8:
return .
|