1. Introduction
Ship classification plays an important role in military and civilian fields, such as maritime traffic, fishing vessel monitoring, maritime search and rescue, etc. [
1,
2]. However, in real life, ship classification results are very susceptible to background settings and recognition of intra-class differences among various types of ship has proven difficult. Therefore, ship classification has become one of the research hotspots in pattern recognition.
The main types of ship image are synthetic aperture radar (SAR) images, visible images and infrared images. After the launch of SEASAT in the 1970s, SAR began to be used in marine environmental research. SAR images are immune to light and weather conditions, but they have low resolution and are susceptible to electromagnetic interference as they are radar signals. Visible images, on the other hand, have high resolution and possess detailed texture, but they are easily affected by light conditions. When illumination is insufficient, the acquired image details drop significantly. Infrared images are not affected by light conditions either. Although the resolution is not very high, it has a clear target contour. Moreover, it has practical advantage as an infrared sensor can produce stable imaging. Therefore, combining visible and infrared images can improve the practicability of ship classification system.
Ship classification methods can be generalized into two categories, one is traditional handcrafted feature-based, and the other is convolutional neural network (CNN)-based. Traditional handcrafted features mainly include a histogram of oriented gradients (HOG) [
3], local binary pattern (LBP) [
4], Hu invariant moments [
5] and scale-invariant feature transform (SIFT) [
6], etc. Handcrafted features are only suitable for use in specific applications and rely on expert knowledge.
Nowadays, with the rapid development of deep learning technology, CNN has become a research hotspot in computer vision, with successful adoption in the fields of image classification [
7], object detection [
8], and traffic sign recognition [
9]. Ding et al. [
10] proposes a deep CNN method combining three types of data enhancement for ship object recognition. In literature [
11] a CNN using extreme learning machine [
12,
13] is proposed to recognize infrared ship images. Not only does it need an extreme learning machine method to learn CNN features, it also requires additional integrated extreme learning machine for classification, which doubles complexity. Li et al. [
14] proposes a CNN-based ship classification method which designed two networks built on AlexNet and GoogleNet, and used a pre-trained model on the ImageNet dataset for transfer learning. This method achieves good classification performance. Zhang et al. [
15] proposes a multi-feature structure fusion method for maritime vessels classification based on spectral regression discriminant analysis (SF-SRDA), which combines structure fusion and linear discriminant analysis. However, it can only perform separate training and testing for visible or infrared images, without integration between results from visible and infrared images. Liu et al. [
16] proposes a fusion recognition method based on CNN. The method designs a sensible network model to extract features of three band images and effectively fuse them. It then uses a feature selection method based on mutual information to sort the features according to their importance, which can eliminate redundant information and improve computational efficiency. Chen et al. [
17] propose a from-coarse-to-fine CNN ship-type recognition method. The training method of the ‘coarse’ step is similar to traditional CNN. The ‘fine’ step introduces a regularization mechanism to extract more inherent ship features, and improved recognition performance can be obtained by fine-tuning parameter settings. Shi et al. [
18] propose a deep learning framework that integrates low-level features for ship classification. Aziz et al. [
19] propose a robust recognition method for maritime images based on multi-modal deep learning. Huang et al. [
20] propose a ship classification model that combines multiple deep CNN features and use a fusion strategy to explore the relationship between multi-scale features. Jia et al. [
21] propose a maritime ship recognition method based on two cascaded CNNs. A shallow network is used for speedy removal of the background area to reduce the computational cost, and a deep network is used to classify the ship types in the remaining areas.
Most of the existing ship classification methods use a single band of infrared or visible images to classify ships, without taking into account the complementary information within images obtained by different sensors. There is relatively little research on ship classification methods based on visible and infrared images fusion. The accuracy of ship classification method needs to be further improved. Although the CNN can automatically learn high-level features from ship images, a single-scale convolution kernel may lose some detailed information when extracting the ship image features. In addition, attention mechanism [
22] (refer
Section 2.4) can focus on object area and suppress other useless information. Applying attention mechanism to CNN can improve the quality of convolutional feature mapping [
23]. Considering also that a single feature may not be comprehensive to represent ship images, the study proposes ship classification based on attention mechanism and multi-scale convolutional neural network (MSCNN). Firstly, the MSCNN has been proposed to extract the visible image features and infrared image features. Then the visible image features and infrared image features are fused to make full use of the complementarity of different features and obtain more comprehensive ship information. Lastly, we use the attention mechanism to enhance fusion feature representation capability, so as to achieve more accurate ship classification results.
Major contributions of this study can be summarized as follows: (1) a two-stream symmetric MSCNN feature extraction module is proposed to extract the features of visible and infrared images. The module can selectively extract those deep features of visible and infrared images with more detailed information. (2) The visible image features and infrared image features are concatenated to allow further use of the complementary information within different modal images, such that a more detailed ship object description can be obtained. (3) The attention mechanism is applied to the concatenated fusion layer to enhance important local details in the feature map, thereby improving overall classification capability of the model.
The remainder of this paper is organized as follows.
Section 2 describes the proposed classification method in details.
Section 3 introduces the visible and infrared spectra (VAIS) dataset [
24] and parameter settings, and analyzes experimental results.
Section 4 summarizes conclusions and the prospects of future work.
4. Conclusions
In this study, the authors proposed the use of an attention mechanism and MSCNN method for accurate ship classification. Firstly, a two-stream symmetric MSCNN is adopted to extract the features of visible images and infrared images, and the two features are concatenated such that complementary features can be effectively utilized. After that, the attention mechanism is applied to the concatenated fusion layer to obtain more effective feature representation. Lastly, fused features after attention mechanism modification are sent to fully connected layers and the Softmax output layer to obtain the final classification result. In order to verify the effectiveness of the proposed method, we conduct experiment on the VAIS dataset. The results show that, compared with existing methods, the proposed method can achieve better classification performance, with a classification accuracy of 93.81%. Results from F1-score and confusion matrix further validate the effectiveness of the proposed method. However, in the presence of high intra-class similarity, the proposed method still results in some degree of misclassification, and increases the average feature extraction time consumption per image slightly. In future research, we will consider exploring and researching how to select the fused features while maintaining high classification accuracy to improve the efficiency of the method.