Cross-Granularity Infrared Image Segmentation Network for Nighttime Marine Observations

Xu, Hu; Yu, Yang; Zhang, Xiaomin; He, Ju

doi:10.3390/jmse12112082

Open AccessArticle

Cross-Granularity Infrared Image Segmentation Network for Nighttime Marine Observations

¹

School of Marine Science and Technology, Northwestern Polytechnical University, No. 127 Youyi West Road, Beilin District, Xi’an 710072, China

²

Shenzhen Research & Development Institute, Northwestern Polytechnical University, Shenzhen 510085, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

J. Mar. Sci. Eng. 2024, 12(11), 2082; https://doi.org/10.3390/jmse12112082

Submission received: 25 October 2024 / Revised: 16 November 2024 / Accepted: 16 November 2024 / Published: 18 November 2024

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Infrared image segmentation in marine environments is crucial for enhancing nighttime observations and ensuring maritime safety. While recent advancements in deep learning have significantly improved segmentation accuracy, challenges remain due to nighttime marine scenes including low contrast and noise backgrounds. This paper introduces a cross-granularity infrared image segmentation network CGSegNet designed to address these challenges specifically for infrared images. The proposed method designs a hybrid feature framework with cross-granularity to enhance segmentation performance in complex water surface scenarios. To suppress feature semantic disparity against different feature granularity, we propose an adaptive multi-scale fusion module (AMF) that combines local granularity extraction with global context granularity. Additionally, incorporating a handcrafted histogram of oriented gradients (HOG) features, we designed a novel HOG feature fusion module to improve edge detection accuracy under low-contrast conditions. Comprehensive experiments conducted on the public infrared segmentation dataset demonstrate that our method outperforms state-of-the-art techniques, achieving superior segmentation results compared to professional infrared image segmentation methods. The results highlight the potential of our approach in facilitating accurate infrared image segmentation for nighttime marine observation, with implications for maritime safety and environmental monitoring.

Keywords:

nighttime marine observation; infrared image segmentation; cross granularity; deep-learning

1. Introduction

Nighttime marine observation plays a critical role in a variety of applications, such as maritime navigation, search and rescue operations, and environmental monitoring [1]. Due to the limited visibility at night [2], infrared cameras [3] have become indispensable tools for capturing thermal signatures, which provide crucial information about the surrounding environment on the water surface. Unlike optical cameras, IR sensors can detect the heat radiated by objects, allowing for effective detection and segmentation even in complete darkness. As presented in Figure 1, infrared image segmentation [4] provides a reliable means of capturing thermal signatures, making it suitable for identifying ships, obstacles, and other objects under low-light or no-light conditions.

The mainstream infrared image segmentation algorithms can be classified into traditional methods [5] and deep learning methods [6]. Traditional infrared image segmentation methods often face significant challenges in marine environments. The reflective and dynamic characteristics of the water waves, combined with the low contrast between objects and the background, can lead to inaccurate segmentation results. These challenges are further amplified by the varying thermal properties of the sea surface, making it difficult for traditional segmentation algorithms [7,8] to consistently perform well. In recent years, deep learning-based methods [9] have achieved significant success in infrared (IR) image segmentation, with convolutional neural networks (CNNs) leading the way due to their powerful feature extraction capabilities. CNN-based models like FCN [10], U-Net [11], and DeepLabv3+ [12] have been widely used for IR segmentation, showing promising results. However, these models often prioritize local features, which can lead to the loss of important boundary details, particularly in noisy or complex environments.

In contrast, Transformer-based image segmentation networks [13] have gained considerable attention and demonstrated strong performance across various tasks. These models capture long-range dependencies and global context information, making them highly effective for segmenting complex scenes. Benefiting from self-attention mechanisms [14], Transformers can model relationships between distant pixels, offering an advantage over CNN-based methods that primarily focus on local features. However, when applied to infrared image segmentation, Transformer-based models also face challenges. The low contrast and high noises in infrared images make it difficult for Transformers to accurately capture detailed boundaries, often leading to suboptimal segmentation results in such scenarios.

To address these issues, we propose a novel cross-granularity infrared image segmentation network named CGSegNet with adaptive multi-scale attention against low semantic density. Granularity [15] refers to the level of detail captured by the features used for segmentation models. While local granularity provides detailed fine-scale information that captures object edges and smaller features, global granularity focuses on broader and contextual features that contribute to overall scene comprehension. The cross-granularity mechanism fuses these layers by applying attention mechanisms that dynamically weigh the importance of local vs. global features, allowing for a contextually informed balance in the segmentation process. In our proposed method, we integrate both global contextual correlation and local feature extraction to enhance the segmentation performance of infrared images in nighttime marine environments. While CNN-based encoders struggle with inadequate feature extraction and poor boundary segmentation, Transformer-based methods are prone to instability in low-contrast IR images. Our proposed approach incorporates a CNN-based local encoder for detailed feature extraction as well as a Transformer-based global encoder to capture long-range dependencies. Additionally, to further improve edge detection and robustness in noisy IR images, we introduce the histogram of oriented gradients (HOG), which are known for their stability in challenging visual conditions. This multi-scale attention approach leads to more accurate and reliable segmentation results in complex maritime scenarios. Evaluated on a public infrared segmentation dataset for nighttime marine observation, our proposed method demonstrates superior performance in infrared image segmentation, achieving state-of-the-art results. Compared to traditional approaches, our multi-scale attention network outperforms existing models in complex maritime scenarios. Specifically, our method surpasses top-performing models, such as ISNet [16] and MSAFFNet [17], with great improvements in segmentation accuracy, particularly in challenging environments where boundary details and low-contrast objects are difficult to capture. These results highlight the effectiveness and robustness of our approach in addressing the unique challenges of nighttime marine observation.

In summary, the contributions of this work are summarized as follows:

A cross-granularity infrared image segmentation network named CGSegNet is proposed for high-quality nighttime marine observation. We constructed a hybrid CNN–Transformer–HOG feature framework with multi-granularity to enhance segmentation accuracy against complex marine scenarios.
We designed a multi-scale fusion module (AFM) to combine CNN-based local feature extraction with Transformer-based global context modeling, effectively aligning features under granularity disparity conditions.
To address the challenges posed by noisy and low-contrast infrared images, we introduce critical HOG features into our model, improving boundary stability and segmentation accuracy in boundary delineation and low-contrast conditions.
Our method achieves state-of-the-art results on a public infrared segmentation dataset, demonstrating superior segmentation accuracy compared to other baseline methods. It outperforms leading segmentation models, showcasing its robustness and effectiveness in addressing the unique challenges of infrared-based marine perception.

2. Related Work

2.1. Infrared Image Semantic Segmentation

For infrared image segmentation in marine scenarios, various methods have been proposed to address challenges posed by low-contrast and noisy images. CNN-based methods with region descriptors have been particularly popular for extracting features from complex infrared images. Zhang et al. [5] proposed an infrared image segmentation model with a signed pressure function (SPF) to capture selective local attention. Chan et al. [18] utilized active contours to extract statistical average intensity information, improving segmentation accuracy for infrared images. To tackle inhomogeneous regions, Li et al. [19] introduced the local binary fitting model, which estimates neighboring pixel intensities based on the value of a central pixel. However, this approach can lead to local optima, resulting in suboptimal segmentation. To further address intensity inhomogeneity, they also proposed the local Intensity clustering model, which replaces the average intensity with a local cluster function to more accurately represent varying regions. Recent advancements have seen the development of cross-granularity-based segmentation models that combine global and local information for enhanced segmentation accuracy. For instance, Fang et al. [20] introduced the weighted hybrid signed pressure force model, which integrates global and local region-based SPFs. Liu et al. [21] contributed a global and local signed energy-based model for a more comprehensive analysis of intensity differences. However, boundary leakage where segmentation inaccurately delineates edges of foreground objects remains a common issue, particularly in infrared imaging. This problem is intensified in marine environments. Overall, while cross-granularity strategies have shown improved segmentation precision in certain conditions, they continue to face the unique challenges of infrared image segmentation in complex marine scenarios. The key problem is how to effectively integrate cross-granularity features.

2.2. Visual Semantic Segmentation in Maritime Environments

Semantic segmentation for visual-based systems in maritime environments has gained significant attention, particularly for autonomous surface vehicles (USVs). Cameras, due to their improving performance and cost-effectiveness compared to radar, have become the preferred sensor for maritime semantic segmentation tasks. Various advancements have been made to enhance segmentation accuracy and robustness in challenging marine conditions. Yao et al. [22] proposed a lightweight decoder architecture that improves segmentation accuracy for marine obstacles, enabling high-speed semantic segmentation in real-time video feeds. Xue et al. [6] applied the simple linear iterative clustering algorithm to a deep semantic segmentation network and introduced a super-pixel refinement technique, boosting the accuracy of obstacle segmentation on water surfaces. Zhan et al. [23] integrated conditional random fields (CRFs) into a standard encoder-decoder network to refine segmentation boundaries and enhance prediction accuracy. Further advancements were made by Girisha et al. [24], constructing a CNN-LSTM architecture to incorporate temporal context for maritime semantic segmentation. In the context of dynamic environments, Ding et al. [25] developed a joint learning framework that explores the contribution of temporal dynamics in video-level scene segmentation, demonstrating the importance of temporal features for semantic segmentation in driving scenarios. Shi et al. [26] enhanced the segmentation of 4D point clouds by leveraging both local and global temporal information. However, existing segmentation research in maritime environments mainly focused on common visual semantic segmentation. Infrared image segmentation for nighttime observation is still a less explored field and is crucial for enhancing segmentation accuracy against low contrast and noise background.

3. Method

To address challenging issues for nighttime marine observation, we proposed a cross-granularity infrared image segmentation method named CGSegNet. This section will illustrate the proposed CGSegNet in detail.

3.1. Overall Structure

Figure 2 presents the overall structure of the proposed CGSegNet, which consists of four key stages. First, in the feature extraction stage, we employ a combination of CNN encoder, Transformer encoder, and traditional HOG feature extraction from infrared images. The CNN modules capture local granularity spatial features, the Transformer modules extract global granularity information, and the HOG method provides additional edge granularity information from infrared images. Second, we designed an adaptive multi-scale fusion module named (AFM) that addresses semantic inconsistencies and aligns the extracted features across different granularity branches, ensuring a coherent fusion of features. Furthermore, the fused features produced by AFM are fed into a segmentation head, where HOG features with edge granularity are integrated through a HOG–deep learning fusion (HDF) module. Finally, the segmentation head produces the final segmentation result. This framework is designed to effectively handle the challenges posed by infrared image segmentation in complex marine environments, such as low contrast and noise, ensuring accurate and robust segmentation performance.

3.2. Local Granularity Branch

The local granularity branch is designed to capture rich local spatial information from infrared images, which is crucial for effective segmentation in complex marine environments. The network consists of five convolutional blocks and four residual blocks that progressively extract hierarchical features, ranging from low-level edge and texture information to high-level semantic features. Each residual block is followed by multiple convolutional layers with direct connections to ensure a stable gradient. Additionally, for each residual block, max-pooling layers are incorporated to downsample the feature maps and reduce computational complexity while preserving important spatial details. The extracted features from the CNN module serve as the foundation for further processing in the multi-scale fusion and segmentation stages, ensuring that fine details and local patterns are well-represented in the final segmentation results.

3.3. Global Granularity Branch

Our global granularity branch is designed to capture long-range dependencies and global granularity information from infrared images, which is essential for handling the challenges of low contrast and noisy environments in marine scenarios.

The initial processing stage is patch embedding. At the start, the input feature maps

X \in R^{H \times W \times C}

where H, W, and C, which denote the height, width, and number of channels, respectively, are divided into fixed-size patches and are then linearly projected into high-dimensional representations. This operation translates raw spatial image data into compact forms by encoding local structures within each patch, preparing them for further processing in Transformer-based modules. The patch embedding process can be formulated as follows:

X = [X_{1}, X_{2}, . . ., X_{N}], N = \frac{H \times W}{p^{2}},

(1)

where N is the number of embedded features with each patch having a spatial size of

p \times p

.

As shown in Figure 3, the self-attention mechanism computes pairwise interactions between all pixel positions in the image, this allows the network to focus on important regions and suppress irrelevant noise. We adopt a Transformer layer for embedded feature maps. The embedded features,

X_{i}

, are first linearly projected into query Q, key K, and value V representations as follows:

Q = X_{i} W_{Q}, K = X_{i} W_{K}, V = X_{i} W_{V},

(2)

where

W_{Q}

,

W_{K}

, and

W_{V}

are learnable weight matrices. The self-attention mechanism is then applied to calculate the attention scores by taking the dot product of the query and key matrices, followed by a softmax operation to obtain the attention weights. To capture global context effectively, multi-head attention is employed, which allows the model to jointly attend to information from different representation subspaces:

Multi ‐ Head (Q, K, V) = Softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(3)

where

d_{k}

is the dimension of the query/key vectors, used for scaling to prevent large dot product values.

The output of the multi-head attention module is passed through a feed-forward network (FFN) to further process the features. The FFN consists of two fully connected layers with a non-linear activation function (ReLU) in between:

FFN (X_{i}) = max (0, X W_{1} + b_{1}) W_{2} + b_{2},

(4)

where

W_{1}

,

W_{2}

,

b_{1}

, and

b_{2}

are learnable parameters.

Residual connections and layer normalization are applied after both the multi-head attention and FFN modules to stabilize training and enable deeper network architectures:

\hat{X_{i}} = LayerNorm (X_{i} + Multi ‐ Head (Q, K, V)),

(5)

X_{i}^{'} = LayerNorm (\hat{X_{i}} + FFN (\hat{X})) .

(6)

After employing self-attention in the Transformer layer, the patch-merging process progressively aggregates all embedded features for full representations. The patch-merging process can be formulated as follows:

X^{'} = M e r g e (X_{1}^{'}, X_{2}^{'}, . . ., X_{N}^{'})

(7)

where

X^{'}

denotes merged features and

M e r g e

denotes the merge process in the spatial dimension.

The global granularity branch effectively captures global interactions and contextual information, complementing the local features extracted by the CNN module. It plays a critical role in aligning features across different scales and resolutions in the subsequent multi-scale fusion stage.

3.4. HOG Feature Branch

The HOG feature branch is used in our segmentation network to capture edge granularity from infrared images. This branch is particularly effective in highlighting object boundaries and structural information, which is critical in marine environments where object contours can be faint or obscured by noise. As illustrated in Figure 4, the HOG descriptor computes gradient orientations in localized regions of an image, allowing for robust feature extraction even in low-contrast settings.

The HOG feature extraction process begins by computing the gradient of the image. Given an input image

I \in R^{H \times W}

, where H and W denote the height and width of the image, the gradient

G_{x}

in the x-direction and

G_{y}

in the y-direction are computed using a discrete derivative filter, such as the Sobel filter, as follows:

G_{x} (i, j) = I (i, j + 1) - I (i, j - 1),

(8)

G_{y} (i, j) = I (i + 1, j) - I (i - 1, j),

(9)

where i and j are the pixel indices in the vertical and horizontal directions, respectively.

Next, the magnitude

M (i, j)

and orientation

θ (i, j)

of the gradient at each pixel are calculated as follows:

M (i, j) = \sqrt{G_{x} {(i, j)}^{2} + G_{y} {(i, j)}^{2}},

(10)

θ (i, j) = atan 2 (G_{y} (i, j), G_{x} (i, j),

(11)

where

M (i, j)

represents the strength of the gradient at pixel

(i, j)

, and

θ (i, j)

is the orientation of the gradient.

The image is then divided into small spatial regions called cells, each typically consisting of

8 \times 8

pixels. For each cell, a histogram of gradient orientations is constructed. The orientations are binned into B bins (typically 9 bins for 0° to 180° in unsigned gradient mode or 18 bins for 0° to 360° in signed gradient mode), and the magnitude of the gradients is used to vote for the corresponding orientation bin. The histogram

H_{b}

for bin b in cell c is computed as follows:

H_{b} (c) = \sum_{i, j \in c} M (i, j) \cdot δ (θ (i, j) \in {bin}_{b}),

(12)

where

δ

is the indicator function that returns 1 if the gradient orientation

θ (i, j)

falls within the bin b, and 0 otherwise.

To improve the robustness of the feature descriptor against illumination changes and noise, the histograms are normalized. Cells are grouped into larger spatial blocks, typically consisting of

2 \times 2

cells. The histograms within each block are concatenated into a single vector and normalized. The normalization can be done using various normalization schemes, with the most common being L2 normalization:

v_{n o r m} = \frac{v}{\sqrt{{∥ v ∥}_{2}^{2} + ϵ^{2}}},

(13)

where v is the concatenated histogram vector for a block, and

ϵ

is a small constant added to prevent division by zero.

The final HOG descriptor is obtained by concatenating all the normalized histograms from the blocks across the entire image. The resulting feature vector provides a robust representation of the gradient structure in the image, capturing important shape and edge information while being invariant to local illumination changes and small spatial shifts. This HOG feature extraction process is crucial in our segmentation network, as it complements the high-level semantic features extracted by the CNN and Transformer modules. By capturing fine-grained edge details, the HOG features enhance the model’s ability to accurately delineate object boundaries, even in challenging infrared images with low contrast and noisy backgrounds.

3.5. Adaptive Multi-Scale Fusion (AMF)

Semantic granularity represents context information at varying levels of detail, which is crucial for understanding both global and local features in an image. In the context of semantic segmentation, local granularity enables more precise delineation of object boundaries, while global granularity captures broader contextual relationships between objects. As there is an obvious semantic disparity between local and global granularity, incorporating semantic granularity allows models to adapt to different levels of abstraction within the image. As shown in Figure 5, in our approach to infrared image segmentation for maritime applications, semantic granularity is addressed by employing an adaptive multi-scale feature fusion module (AFM). In the feature extraction stage, local granularity features

Z = [Z_{1}, Z_{2}, Z_{3}, Z_{4}]

are extracted from the local granularity branch, where each stage contains an upsampling layer to reduce size. Similarly, global granularity features

Z = [X_{1}, X_{2}, X_{3}, X_{4}]

are extracted from the global granularity branch with the same size as local granularity features. Defining the initial inputs

F_{1}

as equal to

X_{1}

, the AFM module integrates the input feature

F_{i}

with each local granularity feature

Z_{i}

as well as the global granularity feature

X_{i}

.

The AFM module consists of a multi-scale atrous convolution (MAC) operation and granularity alignment operation. The MAC module employs four atrous convolution layers, including a

1 \times 1

convolution layer and three

3 \times 3

convolution layers with different rates (1, 4, 8). The atrous convolutions [27], also named dilated convolutions, apply holes in the convolutional filters, effectively expanding the receptive field without increasing the number of parameters or losing information through downsampling.

The MAC operation can be expressed as follows:

G = C o n c a t e (Z_{i}, X_{i}),

(14)

G_{1} = Concate ({conv}_{1 \times 1, 1} (G), {conv}_{3 \times 3, 1} (G), {conv}_{3 \times 3, 4} (G), {conv}_{3 \times 3, 8} (G)),

(15)

where

G_{1}

denotes the output of MAC, and

C o n c a t e

denotes concatenation operation.

Global–local granularity (GA) alignment. For the extracted fusion features

G_{1}

, The GA module first employs an adaptive average pooling, followed by a multi-layer perception operation with a

1 \times 1

convolution layer. ReLu is the activation function of the middle layer, and another multi-layer perception operation follows. Subsequently, a local granularity branch applies the same multi-layer perception operation and a sigmoid non-linear activation function is employed to produce the output. For the last fusion feature

F_{i - 1}

, we apply a

3 \times 3

convolution layer and a maximum pooling operation for upsampling features to a unified size. The GA module can be formulated as follows:

G_{2} = Sigmoid ({Conv}_{1 \times 1} (ReLu ({Conv}_{1 \times 1} (AvgPool (G_{1})))) + {Conv}_{1 \times 1} (ReLu ({Conv}_{1 \times 1} (G_{1})))),

(16)

F_{i} = Concate ({Conv}_{3 \times 3} (MaxPool ({Conv}_{3 \times 3} (F_{i - 1}))), G_{2}),

(17)

where

F_{i}

denotes the output features of the AFM module, and AvgPool and MaxPool denote the adaptive average pooling operation and maximum pooling operation, respectively.

The AFM module combines local feature extraction with global context modeling, ensuring that both detailed object boundaries and broader scene information are captured effectively. This balance between granularity levels is particularly important for cross-granularity fusion models.

3.6. HOG and Deep-Learning Feature Fusion (HDF)

In our proposed HOG deep-learning feature fusion (HDF) module, we enhance the infrared image segmentation process by integrating edge gradient features extracted using the histogram of oriented gradients (HOG) with semantic features derived from deep learning networks. As HOG features only contain low-level edge information, we fuse them with high-level contextual features in the segmentation head to aggregate edge contextual information and boundary segmentation accuracy. To effectively combine both HOG and deep learning information, we designed a cross-granularity alignment (CGA) strategy presented in Figure 6. This alignment mechanism allows the model to selectively emphasize important regions in the image, blending local and global information to improve segmentation accuracy. For extracted HOG features

H \in R^{H \times W \times C}

and deep-learning features

D \in R^{H \times W \times C}

, we applied the CGA operation to generate enhanced HOG features

H_{1}

and enhanced deep features

D_{1}

, which can be calculated as follows:

H_{1} = C o n v_{1 \times 1} (C o n c a t e (A v g P o o l (C o n v_{1 \times 1} (D)), H)),

(18)

D_{1} = C o n v_{1 \times 1} (C o n c a t e (A v g P o o l (C o n v_{1 \times 1} (H)), D)),

(19)

After obtaining enhanced features, we fuse them in the post-fusion stage and generate the final fusion features. The post-fusion stage applies the sigmoid activation function and multiplies the operation to integrate

H_{1}

and

D_{1}

. A

1 \times 1

convolution layer is used to maintain and reduce the channel numbers of the fusion feature. The post-fusion pipeline can be expressed as follows:

F = C o n v_{1 \times 1} (C o n c a t e (S i g m o i d (H_{1}), S i g m o i d (D_{1}), D_{1} \cdot H_{1})) .

(20)

By combining the strengths of HOG and deep learning features, our proposed fusion module effectively enhances segmentation performance, particularly in detecting fine edges and preserving global context, which is crucial for complex infrared image segmentation tasks.

4. Experiment

4.1. Experimental Datasets

To qualitatively demonstrate the effectiveness of our infrared image segmentation method in nighttime marine observation, we conducted experiments on the publicly available MassMIND dataset [28]. The MassMIND dataset is a comprehensive collection of long-wave infrared (LWIR) images and comprises 2916 diverse LWIR images. We divide the MassMIND dataset into the train dataset and test dataset with a ratio of 7:3. Figure 7 showcases the detailed distribution of the MassMIND dataset. The images represent a wide range of marine environments, from busy harbors to less active ocean areas, and were recorded across different seasons and times of the day, offering both seasonal and temporal diversity. Each image in the dataset includes pixel-level instance segmentation, with objects categorized into seven distinct classes. As the first publicly available dataset for LWIR maritime images, MassMIND provides a valuable resource for advancing segmentation methods in complex marine scenarios.

4.2. Experimental Setups and Evaluation Metrics

All experiments were conducted on a system equipped with an Nvidia GTX 3090 GPU with 24 GB of memory and an Intel i7-9700K CPU with 64 GB RAM. Our model was implemented in Python using PyTorch 1.9 and CUDA 11.4 for GPU acceleration. In our model, the model was trained for 200 epochs using a stochastic gradient descent (SGD) optimizer, with a mini-batch size of 8. The initial learning rate was set to 0.04, and a poly learning rate decay strategy was employed to progressively reduce the learning rate during training. Additionally, we utilize the classical cross entropy function as the loss function to train our model. The model parameters were randomly initialized, and we applied a data augmentation strategy, including random flipping (with a probability of 0.5), random rotation (ranging from −0.83 to 0.83 rad), and random cutting to improve the model’s generalization capabilities. The input image size of our model was set to 640 × 640 pixels. Figure 8 illustrates the training process of our model.

The evaluation metrics used in our segmentation experiments include intersection over union (IoU) and the F1 score, which are widely adopted for assessing the performance of image segmentation algorithms. These metrics provide insights into the accuracy of the segmentation results by comparing the predicted regions with the ground truth labels. The IoU metric measures the overlap between the predicted segmentation (

\hat{Y}

) and the ground truth (M) and is defined as follows:

IoU = \frac{\hat{Y} \cap M}{\hat{Y} \cup M} = \frac{TP}{TP + FP + FN}

(21)

where TP, FP, and FN represent true positive, false positive, and false negative, respectively.

The F1 score is the harmonic mean of precision (P) and recall (R), offering a balanced measure of segmentation accuracy. Precision is the ratio of correctly predicted positive pixels to the total predicted positive pixels, and recall is the ratio of correctly predicted positive pixels to the actual positive pixels in the ground truth. The F1 score is defined as follows:

F 1 = \frac{2 \cdot P \cdot R}{P + R}

(22)

4.3. Comparison to Other Baseline Methods

To evaluate the performance of our proposed infrared image segmentation algorithm, we conducted a comprehensive comparison against several segmentation baseline methods, including classical visual segmentation method PSPNet [29], FCN [10], U-Net [11], DeepLabv3+ [27], Swin Transformer [30], Swin-UNet [31], and SegFormer [32]. In addition, we compared our method with specialized infrared segmentation methods, such as ISNet [16], AGPCNet [33], and MSAFFNet [17]. Each baseline model was trained using their recommended strategies, ensuring a fair comparison across all algorithms. We evaluated segmentation accuracy using standard metrics, which allowed us to demonstrate the effectiveness of our proposed method in enhancing segmentation performance in complex marine environments.

Quantitative analysis. Table 1 presents the comparison results using the proposed method and other baseline methods. Our model achieves state-of-the-art (SOTA) segmentation accuracy in terms of mIoU and F1 score metrics, significantly outperforming the classical segmentation baseline method, DeepLabv3+, by 6.47% in mIoU. Compared to traditional visual segmentation approaches, infrared-specific segmentation baseline models show superior performance against challenging nighttime scenarios. For each class of segmentation accuracy, targets with small quantities—such as bridges, obstacles, and persons—exhibit low segmentation accuracy, while large-scale targets like sky, water, and background display high segmentation accuracy across all baseline models. Figure 9 illustrates the comparison results of segmentation accuracy (mIoU) for classes with small amounts. From these comparison results, it is evident that the proposed method significantly improves segmentation accuracy in these classes with small amounts. This improvement is primarily due to the cross-granularity fusion strategy, which enables better feature representation for minority classes. Additionally, our model’s use of HOG features helps to capture contour information at various scales, enhancing performance in complex scenes. Compared to baseline methods, the proposed approach demonstrates robustness in distinguishing small targets from surrounding environments. Overall, these results highlight the model’s effectiveness in handling both large-scale and minority classes under challenging infrared conditions.

Efficiency analysis. Table 2 presents a comparison of the efficiencies of different segmentation methods. In terms of efficiency, our proposed infrared image segmentation method demonstrates a competitive balance between model complexity and inference speed. When comparing the parameter sizes, our method requires 138.2 MB of memory, which is on the higher end but remains lower than certain models such as SegFormer (182.3 MB) and Swin Transformer (147.5 MB). Despite having a relatively large parameter count, our method achieves an inference speed of 19 FPS, which, while not the fastest, remains within a reasonable range for real-time applications in marine environments. Notably, models like FCN and U-Net exhibit significantly higher inference speeds (215 FPS and 112 FPS, respectively), but this comes at the cost of reduced segmentation performance, especially in challenging infrared scenarios. On the other hand, state-of-the-art models such as DeepLabv3+ (30 FPS) and SegFormer (15 FPS) offer slower speeds despite their high performance in complex segmentation tasks. Swin Transformer, while a powerful model with advanced feature extraction capabilities, only achieves 18 FPS due to its larger model size.

Confusion matrix analysis. Figure 10 illustrates a confusion matrix presenting different classes using our method. The results show high classification accuracy for dominant classes like sky and water. However, there is some misclassification, particularly between sky and background, and between obstacles and other objects, reflecting the challenging nature of distinguishing fine details in such complex scenes. This could be due to similar visual characteristics in certain parts of these classes, especially under varying lighting or atmospheric conditions. The model also confuses “Obstacle” and “Bridge” objects. Since both can be similar in structure and position in the image, the model might require further training on more specific features to improve its discrimination between these classes. The matrix also indicates that smaller classes, such as “person” and “others”, have more confusion with larger classes, such as background, due to their relatively lower representation in the dataset. This suggests room for improvement in handling minority classes and refining edge cases.

Visualization analysis. Figure 11 and Figure 12 show visualization comparisons of segmentation results. The visualization results clearly demonstrate the superior performance of our method compared to other general semantic segmentation methods like U-Net, DeepLabv3+, and SegFormer. Our method produces more accurate and detailed segmentations, particularly in complex regions such as water reflections and small obstacles. While U-Net and DeepLabv3+ struggle with distinguishing fine-grained features in low-contrast areas, our method effectively captures these details, leading to smoother and more precise boundary delineations. SegFormer, though advanced, tends to lose accuracy in smaller object detection and edge preservation, areas where our approach excels, showcasing enhanced segmentation consistency across challenging maritime scenes. While ISNet and AGPCNet perform well in standard scenarios, they tend to struggle with accurately segmenting small objects and regions with low contrast, such as distant obstacles on the water surface. MSAFFNet, though effective in capturing global context, shows limitations in preserving fine details and boundary clarity. In contrast, our method excels in both global context understanding and edge preservation, producing sharper and more coherent segmentations, particularly in complex infrared marine scenes with noise and varying environmental conditions.

4.4. Ablation Analysis

In this section, we conduct ablation studies to investigate the impacts of different components and design choices in our proposed infrared image segmentation network. We evaluate the performance of our model by progressively removing or altering key components to quantify their contribution to the overall performance.

(1) Ablation analysis using different granularity branches.

We first analyze the impacts of different granularity branches on the segmentation performance. Our model incorporates global and local granularity feature extraction branches, which capture detailed local textures and global contextual information. Moreover, we introduce traditional handcraft HOG features into the deep learning model to enhance edge contextual information. To evaluate the effectiveness of this design, we remove one branch at a time and assess the model’s performance, including the local granularity (LG) branch, global granularity (GG) branch, and histogram of oriented gradients (HOG) feature branch.

Table 3 illustrates the comparison results using different granularity branches. The global granularity (GG) branch achieves higher accuracy compared to the local granularity (LG) branch and histogram of oriented gradients (HOG) feature branch, with a mean intersection over union (mIoU) of 74.21% and an F1 score of 85.04%, but at the cost of reduced inference speed (25 FPS). On the other hand, the HOG branch shows the highest efficiency with 72 FPS, but its mIoU and F1 scores are notably lower, at 63.91% and 74.32%, respectively.

When combining branches, cross-granularity approaches show a significant improvement in accuracy. The combination of global and HOG branches (GG + HOG) delivers the best performance among the cross-granularity methods, achieving a mIoU of 77.39% and an F1 score of 88.65%, albeit with a moderate FPS of 26. Our full method, which fuses all three branches (LG + GG + HOG), achieves the highest accuracy, with a mIoU of 78.30% and an F1 score of 89.37%. However, this comprehensive approach comes with the lowest inference speed, 19 FPS, due to the added complexity of integrating multiple feature branches. Overall, the results show that while individual branches offer a trade-off between accuracy and efficiency, the full combination of local, global, and HOG features maximizes segmentation accuracy, making it a strong candidate for applications where accuracy is prioritized over speed.

(2) Ablation analysis using different fusion modules.

We next investigate the effects of different fusion strategies used to combine features from the granularity branches. Our model employs a multi-scale fusion module to effectively integrate local and global features, ensuring robust segmentation in complex environments. Table 4 illustrates the ablation results using different feature fusion methods.

Different methods used for fusing global and local features were evaluated in the global–local granularity fusion experiments. The additive fusion (Add) method provided moderate segmentation accuracy with a decent inference speed. Concatenation (Concat) slightly improved the segmentation performance, although at the cost of a reduced processing speed. Using CBAM further boosted performance, showing an improved balance between accuracy and speed. However, our adaptive multi-scale fusion (AMF) method achieved the best results, effectively integrating global and local features while maintaining a reasonable processing speed, making it the most efficient approach for this task.

For HOG–deep feature fusion, we tested various fusion techniques in combination with our HDF module, which merged handcrafted HOG features with deep learning-based features. The Add method showed solid performance with a good balance of speed and accuracy, while Concat provided a slight accuracy improvement but at a slower speed. CBAM fusion also performed well, enhancing segmentation accuracy without a significant speed penalty. However, the combination of AMF and HDF delivered the best overall results, offering a strong balance of accuracy and speed, demonstrating its effectiveness in fusing multi-scale deep features and HOG features for infrared image segmentation.

5. Conclusions

This paper introduces a novel cross-granularity infrared image segmentation network named CGSegNet to tackle the challenges of infrared image segmentation in marine environments under nighttime conditions. Several key contributions are made. First, we propose a hybrid CNN–Transformer–HOG feature framework with cross-granularity that integrates CNN-based local feature extraction with Transformer-based global context modeling, improving segmentation performance in complex water surface scenarios. We also introduce handcrafted HOG features to enhance edge detection and robustness, particularly in low-contrast, noisy environments. Additionally, we designed an adaptive multi-scale fusion module (AMF) and a novel HOG–deep feature fusion module (HDF) to address semantic inconsistencies and align the extracted features across different granularity branches, effectively ensuring a coherent fusion of features. Extensive experiments conducted on the MassMIND dataset demonstrate that our method outperforms state-of-the-art techniques in both common and complex marine scenarios.

However, challenges remain in further improving efficiency toward cross-granularity fusion segmentation methods, our method currently requires a high-performance GPU, such as the Nvidia 3090, to support real-time applications. The inference speed is affected by two main factors: the model’s hybrid CNN–Transformer–HOG framework and the high computational demands of the Transformer branch. Our framework adopts a parallel structure where three branches independently extract features and participate in both forward and backward propagation, which is inherently time-consuming. Additionally, the Transformer branch incurs high computational and memory costs. Models with Transformer blocks exhibit lower computational efficiency compared to pure CNN structures, with the multi-head self-attention (MHSA) mechanism in Transformers posing specific challenges for efficient architecture. To address these limitations, we identified potential strategies for future work, including model pruning and quantization to reduce model size, adopting efficient architectures such as MobileNet or EfficientNet for faster inference, and using lightweight Transformer models to optimize the self-attention mechanism. We plan to explore these approaches to enhance the real-time applicability of our method and make it more feasible for practical deployment.

Author Contributions

Conceptualization, H.X. and Y.Y.; methodology, H.X.; software, H.X.; validation, H.X., J.H., and X.Z.; formal analysis, H.X.; investigation, H.X.; resources, H.X.; data curation, H.X.; writing—original draft preparation, H.X.; writing—review and editing, H.X.; visualization, H.X.; supervision, X.Z.; project administration, H.X.; funding acquisition, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program under grant nos. 2021YFC2803000 and 2021YFC2803001.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The segmentation model presented in this study is available from the author (H.X.) upon request.

Acknowledgments

The authors acknowledge the editors and reviewers for their comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, R.; Su, Y.; Li, Y.; Zhang, L.; Feng, J. Infrared and visible image fusion methods for unmanned surface vessels with marine applications. J. Mar. Sci. Eng. 2022, 10, 588. [Google Scholar] [CrossRef]
Wang, Y.; Wang, B.; Huo, L.; Fan, Y. GT-YOLO: Nearshore infrared ship detection based on infrared images. J. Mar. Sci. Eng. 2024, 12, 213. [Google Scholar] [CrossRef]
Wang, H.Y.; Fang, H.M.; Chiang, Y.C. Application of unmanned aerial vehicle–based infrared images in Determining Characteristics of Sea Surface Temperature Distribution. J. Mar. Sci. Technol. 2023, 31, 2. [Google Scholar] [CrossRef]
O’Byrne, M.; Pakrashi, V.; Schoefs, F.; Ghosh, B. Semantic segmentation of underwater imagery using deep networks trained on synthetic imagery. J. Mar. Sci. Eng. 2018, 6, 93. [Google Scholar] [CrossRef]
Zhang, K.; Zhang, L.; Song, H.; Zhou, W. Active contours with selective local or global segmentation: A new formulation and level set method. Image Vis. Comput. 2010, 28, 668–676. [Google Scholar] [CrossRef]
Xue, H.; Chen, X.; Zhang, R.; Wu, P.; Li, X.; Liu, Y. Deep learning-based maritime environment segmentation for unmanned surface vehicles using superpixel algorithms. J. Mar. Sci. Eng. 2021, 9, 1329. [Google Scholar] [CrossRef]
Xu, H.; Zhang, X.; He, J.; Geng, Z.; Yu, Y.; Cheng, Y. Panoptic water surface visual perception for USVs using monocular camera sensor. IEEE Sens. J. 2024, 24, 24263–24274. [Google Scholar] [CrossRef]
Xu, H.; Zhang, X.; He, J.; Geng, Z.; Pang, C.; Yu, Y. Surround-view water surface BEV segmentation for autonomous surface vehicles: Dataset, baseline and hybrid-BEV network. IEEE Trans. Intell. Veh. 2024, 1–15. [Google Scholar] [CrossRef]
Zhang, L.; Sun, X.; Li, Z.; Kong, D.; Liu, J.; Ni, P. Boundary enhancement-driven accurate semantic segmentation networks for unmanned surface vessels in complex marine environments. IEEE Sens. J. 2024, 24, 24972–24987. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—18th International Conference on Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Liu, F.; Fang, M. Semantic segmentation of underwater images based on improved Deeplab. J. Mar. Sci. Eng. 2020, 8, 188. [Google Scholar] [CrossRef]
He, J.; Chen, J.; Xu, H.; Yu, Y. SonarNet: Hybrid CNN-Transformer-HOG framework and multi-feature fusion mechanism for forward-Looking sonar image segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4203217. [Google Scholar] [CrossRef]
Wan, M.; Huang, Q.; Xu, Y.; Gu, G.; Chen, Q. Global and local multi-feature fusion-based active contour model for infrared image segmentation. Displays 2023, 78, 102452. [Google Scholar] [CrossRef]
Zhao, Y.; Li, K.; Cheng, Z.; Qiao, P.; Zheng, X.; Ji, R.; Liu, C.; Yuan, L.; Chen, J. GraCo: Granularity-controllable interactive segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3501–3510. [Google Scholar]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 877–886. [Google Scholar]
Tong, X.; Su, S.; Wu, P.; Guo, R.; Wei, J.; Zuo, Z.; Sun, B. MSAFFNet: A multiscale label-supervised attention feature fusion network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5002616. [Google Scholar] [CrossRef]
Chan, T.F.; Vese, L.A. Active contours without edges. IEEE Trans. Image Process. 2001, 10, 266–277. [Google Scholar] [CrossRef]
Li, C.; Kao, C.Y.; Gore, J.C.; Ding, Z. Minimization of region-scalable fitting energy for image segmentation. IEEE Trans. Image Process. 2008, 17, 1940–1949. [Google Scholar] [PubMed]
Fang, J.; Liu, H.; Zhang, L.; Liu, J.; Liu, H. Active contour driven by weighted hybrid signed pressure force for image segmentation. IEEE Access 2019, 7, 97492–97504. [Google Scholar] [CrossRef]
Liu, H.; Fang, J.; Zhang, Z.; Lin, Y. A novel active contour model guided by global and local signed energy-based pressure force. IEEE Access 2020, 8, 59412–59426. [Google Scholar] [CrossRef]
Yao, L.; Kanoulas, D.; Ji, Z.; Liu, Y. ShorelineNet: An efficient deep learning approach for shoreline semantic segmentation for unmanned surface vehicles. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 5403–5409. [Google Scholar]
Zhan, W.; Xiao, C.; Wen, Y.; Zhou, C.; Yuan, H.; Xiu, S.; Zou, X.; Xie, C.; Li, Q. Adaptive semantic segmentation for unmanned surface vehicle navigation. Electronics 2020, 9, 213. [Google Scholar] [CrossRef]
Girisha, S.; Verma, U.; Pai, M.M.; Pai, R.M. Uvid-net: Enhanced semantic segmentation of uav aerial videos by embedding temporal information. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4115–4127. [Google Scholar] [CrossRef]
Ding, L.; Terwilliger, J.; Sherony, R.; Reimer, B.; Fridman, L. Value of temporal dynamics information in driving scene segmentation. IEEE Trans. Intell. Veh. 2021, 7, 113–122. [Google Scholar] [CrossRef]
Shi, H.; Li, R.; Liu, F.; Lin, G. Temporal feature matching and propagation for semantic segmentation on 3D point cloud sequences. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7491–7502. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Nirgudkar, S.; DeFilippo, M.; Sacarny, M.; Benjamin, M.; Robinette, P. Massmind: Massachusetts maritime infrared dataset. Int. J. Robot. Res. 2023, 42, 21–32. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 205–218. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Zhang, T.; Cao, S.; Pu, T.; Peng, Z. AGPCNet: Attention-guided pyramid context networks for infrared small target detection. arXiv 2021, arXiv:2111.03580. [Google Scholar]

Figure 1. Top row presents nighttime marine observation applications, with (a) infrared cameras and (b) infrared image segmentation for nighttime marine observation. The second row demonstrates semantic segmentation in a maritime setting, with (c) infrared image input and (d) segmentation mask.

Figure 2. The overview structure of CGSegNet.

Figure 3. The detailed Transformer layer in CGSegNet.

Figure 4. HOG feature extraction produced from the infrared image.

Figure 5. The detailed structure of the proposed adaptive multi-scale fusion module.

Figure 6. A detailed structure of the proposed HOG and the deep-learning feature fusion module.

Figure 7. Massachusetts Maritime Infrared (MassMIND) dataset; (a) maritime infrared image samples, (b) pixel distribution featuring different classes in the MassMIND dataset.

Figure 8. The training process of our model. (a) shows the loss curve; (b) shows the mIoU curve.

Figure 9. The comparison results of segmentation accuracy (mIoU) in classes with small sample sizes using our method and other baseline methods.

Figure 10. A confusion matrix presenting different classes using our method.

Figure 11. Visualization results using our method and other general semantic segmentation methods.

Figure 12. Visualization results using our method and other specific infrared image semantic segmentation methods.

Table 1. Comparison results using our method, other visual segmentation, and specialized infrared segmentation baseline methods in the MassMIND dataset.

Method	IoU [%]							mIoU	F1
Method	Sky	Water	Bridge	Obstacle	Person	Background	Others	mIoU	F1
PSPNet [29]	95.92	96.76	47.60	35.58	31.52	67.35	85.47	65.74	75.32
FCN [10]	96.21	97.42	50.85	38.32	36.76	70.14	86.25	67.99	76.46
U-Net [11]	96.83	97.48	51.64	40.69	38.51	72.87	86.94	69.28	78.35
DeepLabv3+ [27]	97.40	97.91	55.19	46.32	41.59	76.60	87.83	71.83	80.91
Swin Transformer [30]	97.87	97.83	59.93	50.75	46.97	80.54	89.24	74.73	84.57
Swin-UNet [31]	96.92	97.59	57.48	49.18	45.84	78.38	87.56	73.27	83.42
SegFormer [32]	97.61	98.24	61.45	52.09	48.61	82.44	90.39	75.83	85.16
ISNet [16]	96.37	97.31	58.36	48.90	45.07	77.35	87.91	73.04	83.85
AGPCNet [33]	97.58	97.88	60.23	51.41	47.94	80.03	88.27	74.76	85.14
MSAFFNet [17]	98.02	98.37	62.38	52.83	49.39	81.56	89.63	76.02	86.52
Ours	98.52	98.75	66.04	57.21	53.71	83.07	90.85	78.30	89.37

Table 2. A comparison of the efficiencies of different segmentation methods.

Method	Parameters (MB)	Inference Speed (FPS)
PSPNet [29]	47.02	39.1
FCN [10]	11.05	215.3
U-Net [11]	25.31	112.4
DeepLabv3+ [27]	55.84	30.2
Swin Transformer [30]	147.5	18.5
Swin-UNet [31]	41.59	22.1
SegFormer [32]	182.3	15.3
ISNet [16]	39.2	25.6
AGPCNet [33]	107.3	17.4
MSAFFNet [17]	52.3	21.2
Ours	138.2	19.1

Table 3. Ablation experiment featuring different granularity branches.

Method	mIoU	F1 Score	Inference Speed (FPS)
Baseline
Local granularity (LG) branch	71.98	80.62	48.5
Global granularity (GG) branch	74.21	85.04	25.0
Histogram of oriented gradients (HOG) feature branch	63.91	74.32	72.1
Cross-granularity
LG + GG	76.98	87.49	23.3
LG + HOG	74.15	85.20	30.2
GG + HOG	77.39	88.65	26.5
LG + GG + HOG (Ours)	78.30	89.37	19.1

Table 4. Ablation experiment featuring different feature fusion methods.

Global–Local Granularity Fusion	HOG–Deep Features Fusion	mIoU	F1 Score	Inference Speed (FPS)
Add	HDF	74.19	83.69	23.5
Concat	HDF	75.37	86.52	17.5
CBAM	HDF	76.03	87.59	20.3
AMF	HDF	78.30	89.37	19.1
AMF	Add	75.65	87.32	21.5
AMF	Concat	76.85	87.91	18.2
AMF	CBAM	76.20	88.54	19.4
AMF	HDF	78.30	89.37	19.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, H.; Yu, Y.; Zhang, X.; He, J. Cross-Granularity Infrared Image Segmentation Network for Nighttime Marine Observations. J. Mar. Sci. Eng. 2024, 12, 2082. https://doi.org/10.3390/jmse12112082

AMA Style

Xu H, Yu Y, Zhang X, He J. Cross-Granularity Infrared Image Segmentation Network for Nighttime Marine Observations. Journal of Marine Science and Engineering. 2024; 12(11):2082. https://doi.org/10.3390/jmse12112082

Chicago/Turabian Style

Xu, Hu, Yang Yu, Xiaomin Zhang, and Ju He. 2024. "Cross-Granularity Infrared Image Segmentation Network for Nighttime Marine Observations" Journal of Marine Science and Engineering 12, no. 11: 2082. https://doi.org/10.3390/jmse12112082

APA Style

Xu, H., Yu, Y., Zhang, X., & He, J. (2024). Cross-Granularity Infrared Image Segmentation Network for Nighttime Marine Observations. Journal of Marine Science and Engineering, 12(11), 2082. https://doi.org/10.3390/jmse12112082

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cross-Granularity Infrared Image Segmentation Network for Nighttime Marine Observations

Abstract

1. Introduction

2. Related Work

2.1. Infrared Image Semantic Segmentation

2.2. Visual Semantic Segmentation in Maritime Environments

3. Method

3.1. Overall Structure

3.2. Local Granularity Branch

3.3. Global Granularity Branch

3.4. HOG Feature Branch

3.5. Adaptive Multi-Scale Fusion (AMF)

3.6. HOG and Deep-Learning Feature Fusion (HDF)

4. Experiment

4.1. Experimental Datasets

4.2. Experimental Setups and Evaluation Metrics

4.3. Comparison to Other Baseline Methods

4.4. Ablation Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI