1. Introduction
Image segmentation has emerged as a critical technique, enabling machines to interpret and understand visual information. It involves partitioning an image into meaningful segments, allowing for the identification of distinct objects and regions within the visual data. Among the various types of image segmentation, object segmentation plays a vital role by focusing on detecting and delineating individual objects within an image. For instance, in robotic vision, object segmentation allows robots to identify and manipulate objects with precision. In the field of augmented reality, it enhances the ability to overlay digital content seamlessly onto real-world objects.
With the rapid advancement of deep learning, the performance of object segmentation has seen remarkable improvements. However, underwater object segmentation still faces significant hurdles, posing a critical challenge for numerous underwater visual applications. Unlike images captured in general settings, underwater images are marred by unique complexities. The ocean environment introduces underwater turbulence diffusion, intense light absorption and scattering, various types of noise, low contrast, uneven illumination, monotonous color palettes, and intricate backgrounds [
1]. Moreover, the scarcity of underwater datasets further exacerbates these issues, complicating efforts to improve segmentation accuracy. Addressing these challenges is crucial for advancements in marine biology, underwater archaeology, and environmental monitoring, paving the way for more accurate and efficient underwater exploration and analysis.
Despite the critical importance of this field, research dedicated to underwater object segmentation remains limited. With the emergence of the foundation model, i.e., the Segment Anything Model (SAM) [
2], improving the performance of underwater object segmentation with limited labelled data becomes realistic. AquaSAM [
3] represents the pioneering effort to adapt the Segment Anything Model (SAM) to the underwater domain, aiming to achieve universal image segmentation in this challenging environment. However, AquaSAM fine-tunes all the parameters of SAM, which needs more labelled data for better performance and a high computation cost.
In this paper, we propose an adapted segment anything model for underwater object segmentation, named WaterSAM. Inspired by LoRA [
4], WaterSAM adapts SAM to underwater scenarios by injecting trainable rank decomposition matrices into each layer of the Transformer architecture. Specifically, WaterSAM adds trainable rank decomposition matrices into the image encoder of the original SAM, which enhances the feature extraction ability of the image encoder and helps extract robust image features from underwater images. Compared with fine-tuning all the parameters of SAM, WaterSAM has only 6.7% of all the parameters to be trained while keeping the parameters of pre-trained SAM unchanged, which greatly reduces the number of trainable parameters and the computation cost. Furthermore, by injecting trainable rank decomposition matrices into WaterSAM, WaterSAM can efficiently capture downstream task-specific information with fewer labelled data.We validated our proposed model on three underwater image datasets: COD10K, SUIM, and UIIS. The results on these datasets demonstrate that our model significantly outperforms pre-trained SAM alone in underwater segmentation tasks, making an important contribution to the field of underwater segmentation and related tasks.
In summary, our contributions are as follows:
We propose WaterSAM, an adapted version of the Segment Anything Model (SAM) specifically designed to address the unique challenges of underwater object segmentation.
We collect and process three underwater image datasets (COD10K, SUIM, UIIS) to enhance their suitability for evaluating underwater segmentation performance.
WaterSAM significantly reduces the number of trainable parameters to just 6.7% of the original model, achieving strong performance in underwater segmentation tasks. Experimental results on these datasets demonstrate the effectiveness of WaterSAM, offering lower computational costs and efficient training with fewer labeled data.
2. Background
In this section, we review the related work to our WaterSAM. It includes the image segmentation and foundation model adaptation.
2.1. Image Segmentation
Image segmentation is a classical computer vision task that divides an image into specific, unique areas to highlight objects of interest and provide information for object detection and related tasks. Over the years, numerous methods for image segmentation have been developed, including classic segmentation methods, co-segmentation methods, and deep learning-based semantic segmentation. Classical image segmentation methods aim to divide an image into segments or regions based on features such as color, brightness, or texture. Co-segmentation, on the other hand, involves the simultaneous segmentation of multiple images to identify and extract the same or similar objects across these images. However, due to the limitations in performance of these traditional methods, deep learning-based approaches are now commonly applied for more accurate and effective image segmentation.
2.1.1. Segmentation Methods Based on Deep Learning
FCN: The Fully Convolutional Network (FCN) [
5], was introduced by Long et al. in 2015. It transformed image segmentation into an end-to-end image processing problem by replacing fully connected layers with convolutional layers. The key innovation of FCN is its ability to handle images of any size, making it a foundational model for modern deep neural networks in semantic segmentation.
YOLO: YOLO [
6] (You Only Look Once) is a prediction method based on global image information and functions as an end-to-end object detection system, which can also be employed for segmentation tasks. YOLO divides images into grids and predicts the bounding boxes and categories of objects within each grid cell. The latest version, YOLOv8, builds upon the YOLO family, incorporating the experiences of previous versions while introducing innovative features and improvements to enhance performance and flexibility.
Mask R-CNN: Mask R-CNN [
7] is a state-of-the-art algorithm for object segmentation, offering capabilities in target detection, object segmentation, and key point detection. Notable for its high speed and accuracy, Mask R-CNN builds on the foundations of two classical algorithms: Faster R-CNN for object detection and FCN (Fully Convolutional Networks) for semantic segmentation. Faster R-CNN provides efficient and precise target detection, while FCN excels in semantic segmentation tasks.
U-Net: U-Net [
8] is a convolutional neural network designed for biomedical image segmentation. It builds upon a fully convolutional neural network with architectural modifications and extensions that enable it to achieve more accurate segmentations with fewer training images. Beyond biomedical segmentation, the U-Net architecture has been applied in diffusion models for iterative image denoising and serves as a foundation for many contemporary image generation models.
2.1.2. Underwater Image Segmentation Models
Underwater segmentation technology is pivotal in marine science, robotics, and computer vision for identifying and classifying objects or regions in underwater images. The task is particularly challenging due to poor visibility and color distortion caused by light absorption and scattering. This section provides an academic review of classical and contemporary techniques in underwater image segmentation.
Initially, traditional computer vision techniques were employed for underwater image segmentation, but they struggled with the unique characteristics of underwater environments and did not achieve high-precision results. Zhang et al. [
9] proposed a locally adaptive color correction method based on the principle of minimum color loss and a fusion strategy guided by the maximum attenuation map, effectively minimizing color loss by accounting for different color channels’ distinct attenuation characteristics. Similarly, Li et al. [
10] introduced an underwater color image segmentation method that achieves high segmentation accuracy by dynamically estimating the optimal weights for fusing the RGB channels, resulting in a grayscale image with high foreground-background contrast.
Subsequently, researchers turned to machine learning methods, which demonstrated superior performance in extracting complex features from underwater images. Efforts have been directed towards addressing color distortion issues and employing deep learning techniques to enhance segmentation performance. Drews-Jr et al. [
11] pioneered the use of a convolutional neural network (CNN) for underwater image segmentation in natural settings, pretraining the network on non-underwater images and fine-tuning it with a smaller dataset of manually labeled underwater images. Similarly, Arain et al. [
12] presented methods for improving underwater obstacle detection by integrating sparse stereo point clouds with monocular semantic image segmentation. Their approach enhanced obstacle detection, effectively rejected transient objects such as fish, and improved range estimation compared to using feature-based sparse and dense stereo point clouds alone.
2.2. Foundation Model Adaption
Foundation model fine-tuning involves further training a pre-trained foundation model with domain-specific datasets. This process aims to optimize the model’s performance on specific tasks, enabling better adaptation to and completion of tasks within a particular domain. Fine-tuning is an efficient way to enhance model performance, as it allows larger models to achieve more customized functionality. While large-scale models possess formidable capabilities, their efficacy may vary across specialized domains. However, through fine-tuning, these models can be meticulously tailored to meet the specific demands and nuances of a designated domain. This section introduces some classic foundation model fine-tuning methods.
Parameter-Efficient Fine-Tuning (PEFT) Methods
Fine-tuning a foundation model typically involves adjusting all layers and parameters to suit a specific task, using smaller learning rates and task-specific data. This process leverages the shared features of the pre-trained model but often requires substantial computational resources. In contrast, Parameter-Efficient Fine-Tuning (PEFT) technology allows models to adapt swiftly to new tasks, even in resource-constrained environments, by capitalizing on the knowledge embedded within pre-trained models. PEFT enhances model performance, reduces training duration, and lowers computational costs, making deep learning research more accessible. PEFT methods include LoRA, QLoRA, and Adapter Tuning.
LoRA: Low-Rank Adaptation (LoRA) [
4] is a technique for fine-tuning large pre-trained language models, such as GPT-3 or BERT. It introduces small, low-rank matrices at crucial layers of the model, enabling fine-tuning without significant modifications to the entire model structure. This approach allows effective model fine-tuning while preserving its original performance level and minimizing additional computational burden.
QloRA: Quantized Low-Rank Adaptation (QLoRA) [
13] is an efficient model fine-tuning technique that combines the principles of LoRA with deep quantization technology. QLoRA integrates quantization techniques, quantized operations, and fine-tuning stages. This approach significantly reduces memory and computational requirements in large-scale models, facilitating deployment and training in resource-constrained environments.
Adapter Tuning: Similar to LoRA, adapter tuning [
14] aims to enable a pre-trained model to adapt to new tasks while keeping the original parameters unchanged. This method involves inserting small neural network modules, known as “adapters”, between each layer or selected layers of the model. These adapters are trainable, whereas the parameters of the original model remain fixed.
3. Preliminary and Methodology
In this section, we provide a detailed overview of our proposed WaterSAM model. Since it is built upon the SAM model, we begin with a review of the SAM model’s image encoder, which we have adapted. Next, we offer a concise introduction to Low-Rank Adaptation (LoRA) [
15]. Finally, we explain how LoRA is integrated into the image encoder.
3.1. Segment Anything Model
To enhance the performance of the SAM model in underwater regions, we utilize SAM as the backbone and leverage the knowledge it has learned. As illustrated in
Figure 1, SAM comprises a prompt encoder, an image encoder, and a lightweight mask decoder. It employs a Transformer-based architecture, with the image encoder built on Vision Transformer (ViT) to extract image embeddings.
SAM divides the input image into fixed-size blocks, linearly transforms each block into a vector representation, and adds positional coding to these vectors to retain positional information. This sequence of embedded vectors is then processed through a multi-layer standard Transformer encoder, which includes a multi-head self-attention mechanism and a feedforward neural network. Finally, a classification header processes the output sequence to complete the image classification task.
The image encoder processes an input image of 1024 × 1024 pixels and outputs a 64 × 64 feature map. We keep the weights of the pre-trained prompt encoder and mask decoder frozen to avoid substantial computational overhead. During training, we primarily fine-tune the image encoder and use bounding boxes as prompts to assist the model in achieving better segmentation performance. With the aim to improve SAM performance in the underwater scenario, we use SAM as the backbone and leverage the knowledge learned from it.
3.2. Low-Rank Adaptation
Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning technique designed to perform an implicit low-rank transformation of the weight matrix in a foundational model. The core idea of LoRA is to approximate the incremental parameters of full-parameter fine-tuning in large language models (LLMs) with fewer training parameters. This results in efficient fine-tuning that uses less memory.
In contrast to full fine-tuning, where the model starts with its pre-trained weights and undergoes iterative gradient updates to optimize the conditional language modeling objective, LoRA significantly reduces the number of trainable parameters required for downstream tasks. It achieves this by employing low-rank approximation training with smaller matrices, while keeping the original LLM parameters frozen.
The LoRA technique is represented by the equation:
where
represents the original parameters,
represents the change in parameters, and
B and
A are smaller matrices used for the low-rank approximation. The schematic diagram of the principle is shown in
Figure 2.
3.3. Adapted Image Encoder in WaterSAM
To adapt SAM for underwater segmentation tasks, we add an adaption module to the image encoder of SAM. The adaption is developed based on LoRA, which is added to attention mechanisms of the image encoder. In this section, we will present our approach and the underlying principles.
The structure of the trainable LoRA and its application to the self-attention mechanism in the image encoder, which is based on Vision Transformer (ViT), are visualized in
Figure 3. In our method, we apply LoRA to the query (Q) and value (V) layers. By modifying the query layer, the model can influence its information selection process, while adjusting the value layer allows the model to govern how it processes and utilizes the selected information.
The number of trainable parameters is determined by the rank r and the shape of the original weights, given by , where represents the number of weight matrices to which we apply LoRA, r represents the rank of weight matrices. This approach allows for efficient fine-tuning with a significantly reduced number of parameters, enhancing the encoder’s adaptability without the need for extensive computational resources. For the prompt encoder and mask decoder in WaterSAM, we keep them the same as the those in the pre-trained SAM.
5. Conclusions
In conclusion, this paper presents WaterSAM, an adapted model for underwater object segmentation based on the Segment Anything Model (SAM). Our comprehensive evaluations across multiple underwater datasets, including COD10K, SUIM, and UIIS, demonstrate WaterSAM’s significant improvements in segmentation performance. By integrating trainable rank decomposition matrices into the Transformer’s layers, WaterSAM effectively reduces computational costs while maintaining high accuracy. This advancement is particularly notable in challenging underwater environments, where traditional models struggle with poor visibility and complex backgrounds. The results highlight WaterSAM’s potential for enhancing applications in marine biology, underwater archaeology, and environmental monitoring. As the development of SAM, the new iteration, SAM2 [
19], performs better in image segmentation and can be used for video segmentation. For future work, we will adapt SAM2 for underwater video segmentation tasks.