1. Introduction
The regulatory monitoring of coastal water turbidity is crucial to safeguard shoreline and marine ecology during construction activities, especially in the context of climate change. Remote sensing, with its expansive spatial coverage, can significantly aid the monitoring in this endeavor [
1]. In recent years, the increasing utilization of UAV imagery has brought unique advantages such as high flexibility and on-demand deployment to further expand the monitoring capabilities [
2]. UAV imagery can achieve configurable resolutions down to centimeters, contingent on flight altitude [
3,
4]. However, an emerging challenge associated with the high-resolution remote sensing is the presence of obstructions, such as vessels and marine objects in the imagery of the coastal environment, particularly in bustling port areas. Such obstructions impede data completeness by concealing valuable information beneath the obstructions [
5,
6]. In particular, for the monitoring of coastal turbidity plumes, these obstructions in images can introduce significant errors to turbidity predictions. Thus, the proper handling of obstructed information due to vessels and marine objects in imagery can be crucial in high-resolution remote sensing applications for the coastal environment.
In the field of computer vision, the techniques used to restore occluded regions caused by the obstruction of specific objects in the imagery are termed “image inpainting”, which have seen extensive development over recent decades [
7]. Image inpainting methods aim to reconstruct the missing or damaged regions while maintaining visual plausibility. They vary in terms of data types, input formats, referencing methods, and processing systems. Recent advancements in deep learning methods have also revolutionized this domain [
8]. State-of-the-art image inpainting techniques can now be primarily categorized into two groups: (1) non-generative methods, which utilize “copy and paste” techniques, drawing information from the neighboring pixels within the image [
9,
10]; and (2) generative methods, which employ generative data-driven models to produce realistic and content-aware completion for occluded regions [
11,
12]. Noteworthy generative methods include the Context Encoder [
13], Partial Convolutional [
14], and DeepFill v2 [
15], each designed to address specific issues such as blurriness, overall and local consistency, realism, etc.
Table 1 presents an overview of the loss functions utilized in these different generative inpainting methods, along with their descriptions.
Video inpainting techniques have also advanced significantly in recent years by considering the temporal dimension with the spatial structure and motion coherence, addressing the additional complexities of handling image sequences. Examples of video inpainting models include the “copy and paste” techniques that rely on the optical flow approach for pixel tracking and the Video Inpainting Network (VINet) techniques that prioritize temporal consistency through recurrent feedback, flow guidance, and a temporal memory layer [
16]. To better capture the relationship between frames and, several inpainting models, including DSTT [
17] and Fuse Former [
18], further incorporate either 3D convolutional neural networks (CNNs) or self-attention-based vision transformers. Ulyanov et al. [
19] introduced another alternative, the Deep Image Prior (DIP) model, which initializes a deep neural network with random noise and optimizes it to achieve the desired properties. In doing so, the DIP model offers a novel approach to address the challenges of intricate textures, dynamic backgrounds, and computational resource constraints while eliminating the necessity for model training.
Table 1.
Existing loss functions in generative inpainting models.
Table 1.
Existing loss functions in generative inpainting models.
Loss Terms | Description | References |
---|
Reconstruction Loss | Compute the pixel-wise distance between network prediction and ground truth images. | [13,20,21] |
Adversarial Loss | Encourage closer data distributions between the real and filled images. | [17,18,22] |
Perceptual Loss | Penalize the feature-wise dissimilarity between the reconstructed images from pre-trained Visual Geometry Group (VGG) network and ground-truth images. | [23,24,25] |
Markov Random Fields Loss | Compute the distance between each patch in missing regions and the nearest neighbor. | [26,27] |
Total Variation Loss | Compute the difference between adjacent pixels in the missing regions. Ensures the smoothness of the completed image. | [28,29,30] |
Existing general inpainting models based on conventional image and video formats can be potentially further refined for remote sensing applications to address the obstruction issue. A recent study by [
31] employed the DIP model to remove traces of sensitive objects from synthetic-aperture radar (SAR) images across various land surface classifications, which featured an abundance of distinctive characteristics. Long et al. [
32] proposed a bishift network (BSN) model with multiscale feature connectivity, including shift connections and depth-wise separable convolution, to remove and reconstruct missing information obstructed by thick clouds from Sentinel-2 images. Park et al. [
33] proposed a deep-learning-based image segmentation and inpainting method to remove unwanted vehicles on road UAV-generated orthomosaics. However, retrieving the water surface information obstructed by vessels and marine objects is more challenging. Firstly, the water surface is typically homogeneous, providing less information for the generative model to use in reconstruction [
34,
35]. Secondly, the dynamic nature of the coastal environment can introduce spatial and temporal noise into the inpainting process [
36]. This effect is particularly noticeable in UAV data, where images are captured sequentially along a specific flight path. During this period, variations in the water surface and the movement of marine objects can introduce inconsistent information, complicating reconstruction for the inpainting model. In addition, the evaluation of model performance in removing obstructions from coastal remote sensing images requires a comprehensive assessment beyond individual image-wise evaluations. Therefore, as far as we are aware, inpainting models have not been implemented for reconstructing water surfaces in UAV imagery, highlighting a gap in the current literature.
In this study, state-of-the-art deep-learning-based inpainting models, namely, the DSTT and DIP models, are investigated to recover missing information obstructed by vessels and marine objects in the sequential UAV multispectral imagery of the coastal environment. Their performances are examined qualitatively and quantitatively using a dataset of UAV multispectral images acquired during this study for monitoring turbidity plumes in the coastal environment. In the following section, we will first describe the UAV survey before presenting the two models and discussing the results obtained.
4. Discussion and Conclusions
The qualitative and quantitative evaluations presented above reveal the strengths and weaknesses of the DSTT and DIP models for the inpainting removal of obstructions to improve UAV multispectral imagery for water turbidity retrieval as follows:
- (a)
The DSTT model excels at generating high-quality visual inpainting with low obstruction percentages, but encounters resolution constraints, reducing the output resolution by approximately threefold. This limitation can significantly affect the utility of the retrieved data, particularly for detailed environmental monitoring. Further improvements in the model architecture or specific transfer learning approaches will be necessary to address this limitation in the future. Moreover, the DSTT model struggles to effectively reconstruct areas with higher obstruction percentages due to its inability to leverage surrounding information for restoring extensively obstructed regions. As a result, its efficacy rapidly diminishes as obstruction percentages increase due to its reliance on adjacent frame data. This may result in incomplete or imprecise data restoration, potentially mispresenting environmental changes in areas with substantial obstructions. Hence, it is essential to identify a threshold obstruction percentage beyond which the DSTT model should not be considered at all.
- (b)
The DIP model demonstrates remarkable consistency in inpainting quality across different obstruction percentages. This attribute, coupled with its superior overall R2 and lower MAE scores, underscores its robustness and versatility. The DIP model also offers high adaptability with its flexible resolutions, but introduces variability in image quality due to its reliance on hyperparameters and network architecture. This inconsistency poses a challenge for tasks requiring high precision, such as sensitive environmental assessments that require high texture and consistency.
Comparing the two models, the DIP model outperforms in inpainting removal over a wider range of obstruction percentages in terms of temporal consistency. At the same time, the data processing time from the DIP model is approximately 92.95 h in this study using a single Intel Xeon Gold 6258R processor (Intel Corp., Santa Clara, CA, USA), which is much longer than the DSTT model with approximately 4.69 min only. The prolonged processing time emphasizes a crucial trade-off between processing speed and output quality, which may impede the DIP model’s adoption for remote sensing applications where timely data retrieval is essential. Hence, future research should focus on optimization strategies, such as algorithmic developments and hardware upgrades, to further reduce processing time and enhance the model’s applicability for specific remote sensing tasks.