1. Introduction
If an aircraft encounters safety problems during a flight, the consequences are severe [
1]. Ensuring airport safety, in addition to flight safety, is particularly important. When taxiing and parking an aircraft, it is necessary to frequently check whether there are other personnel on the airport surface. This is complicated by the fact that the appearance of staff and pedestrians on the airport surface is random and irregular. Therefore, real-time monitoring of staff and pedestrians is challenging [
2]. Furthermore, staff and pedestrians can be regarded as small targets relative to the aircraft [
3]. Owing to problems such as light reflection and long distances, small targets are difficult to detect [
4]. It is necessary to ensure that there are no pedestrians or other obstacles on the apron and flight region during the taxiing and parking of an aircraft [
5].
In response to the first problem, the randomness of airport personnel being present on the airport surface, the detection of targets on the apron and flight region relies mainly on monitoring and surveillance systems [
4]. A vision-based surveillance system can effectively reduce labour costs and work pressure on supervisors. Moreover, a vision-based surveillance system produces 7 × 24 h of uninterrupted intelligent patrol, which improves patrol efficiency and alarm accuracy, effectively solving the problem of the random appearance of airport personnel. In addition, vision-based surveillance systems can be used to modify traditional security operations and management modes as well as improve the value of monitoring equipment and the efficiency of decision-making [
6]. Finally, they provide off-site and intelligent management methods for airport flight area operation specifications, quality control, and safety precautions, and improve precise supervision and scientific safety operation management capabilities [
7].
However, the second problem, or the difficulties associated with small-target detection, arise from the following: (1) the automatic dependent surveillance-broadcast (ADS-B) [
8,
9] cannot detect non-cooperating targets, such as pedestrians and staff, and (2) the low resolution of the surface movement radar (SMR) [
10,
11] leads to the inability to effectively detect small targets. Although vision-based surveillance systems can monitor all targets on the entire apron in real-time and can even detect some small targets, numerous problems are still encountered in the detection of small targets, such as during the fine segmentation and location of such targets and when learning their non-salient features [
12]; further, bad weather can result in the omission of personnel or obstacles on the apron. In summary, the above difficulties limit the detection accuracy of small targets in vision-based surveillance systems. This paper proposes a method for small-target detection on the apron to address the aforementioned problem. The contributions of this study can be summarised as follows:
- (1)
A small-target detection model that can be applied to apron monitoring systems is proposed to monitor small targets (staff and pedestrians) on the airport surface.
- (2)
To enhance small-target detection accuracy, an improved small-target detection method is proposed to better extract and fuse features.
- (3)
A new, standard apron small-target dataset, mainly composed of airport apron pedestrians and staff, is established.
- (4)
The effectiveness and feasibility of the proposed method are verified on a real apron, and small targets are effectively detected, proving that the proposed method can be applied to monitor a real apron.
The remainder of this paper is organised as follows:
Section 2 presents a brief review of recent methods related to small-target detection. The proposed method is described in detail in
Section 3.
Section 4 presents the experimental results and analysis, including a comparison of the results with those of the current small-target detection algorithm. Finally, in
Section 5, we present our conclusions.
2. Related Works
Typical airport surface monitoring systems mainly monitor pedestrians, staff, and vehicles on the airport surface, but it must be noted that people are much smaller in size than cars; therefore, small-target surveillance on the airport surface should focus on monitoring people. Owing to the large area of the apron, it takes a significant amount of time for staff to monitor targets on the apron, and sometimes, owing to bad weather, personnel or obstacles on the airport surface may get omitted. If not detected in time, such omissions may lead to serious safety hazards. Additionally, the limitations of ADS-B and SMR make it particularly important to develop a small-target surveillance system. Based on these criteria, vision-based surveillance systems have addressed the issues.
In recent years, deep convolutional neural networks have been widely used in various vision-based surveillance methods, especially for the detection of small targets. Furthermore, the detection of small targets is widely used in various fields, such as infrared small-target detection [
13], defect detection for industry 4.0 [
14], and pest detection in smart farming [
15]. Such detection methods are divided into two categories according to the processing of candidate frames [
16]: (1) One-stage target detection methods based on regression, which take the entire image as an input to increase the receptive field of the targets on the image and regress the position and category information of the targets at different positions on the image. The most representative methods of this category are the YOLO [
17,
18] series, SSD [
19], and so on. (2) In the two-stage target detection method, the target candidate frame that may exist on the image is first extracted, the candidate frame of each region is classified, and then position regression is performed. Such methods mainly include the R-CNN [
20], Fast R-CNN [
21], Faster R-CNN [
22], and Mask R-CNN [
23,
24,
25]. The first group of methods (one-stage methods) have fast detection speeds and a high adaptability to large targets; however, small targets are easily missed. The second category of methods (two-stage methods) exhibit a relatively high accuracy in small-target detection and are therefore highly suitable for this task. Thus, we chose the Mask R-CNN as our baseline.
However, numerous challenges are encountered in the detection of small targets when using deep learning. First, target detection based on deep learning uses a convolutional neural network (CNN) as a feature extraction tool. When attempting to increase the receptive field in the CNN, the feature map shrinks, and the stride length may be larger than the size of the small targets; this makes it difficult for the small target to be passed on to the deep network during network convolution [
26]. Second, in commonly used datasets, such as Microsoft Common Objects in Context (MS COCO), the sample size of small targets accounts for a small proportion of the total targets, and the size difference between large and small targets is relatively significant. This leads to certain difficulties during target-aware network adaptation. Third, owing to the complexity of the airport surface, it is highly difficult for staff to monitor the targets of the apron. Target detection is the basic module of vision-based surveillance and is directly related to the overall performance of the entire surveillance system. However, because the apron targets are too small, they have few pixels, and their shapes and their outlines are unclear. The small targets appear similar to the surrounding background.
This study improves on the Mask Scoring R-CNN algorithm and proposes a method of small-target detection based on an attention mechanism, as shown in
Figure 1. The proposed method can not only obtain the location and category of the small targets, but also finely segment them to obtain the corresponding geometric properties (including the length, width, area, contour, centre, etc., of the targets).
3. Materials and Methods
This study presents a model that shows improvements in terms of three aspects: feature extraction, feature fusion, and classification. First, to enable the network to learn to extract the representative features of small targets on airport aprons and provide more effective feature information for the classifier and mask prediction, the proposed method adds an attention module [
27,
28,
29] to the feature-extraction network. The dependency relationship and positional relationship in the feature space guide the network to increase the weight of useful features of small targets on airport surfaces, making the network pay more attention to features that are conducive to the detection of small targets and ignore redundant and invalid information. Second, for the feature fusion module, a bidirectional feature pyramid network (BiFPN) is introduced [
30,
31] and then compared with the traditional feature pyramid network (FPN). BiFPN uses weighted fusion that enables the network to perform fusion more effectively. Finally, a more effective detection branch structure is proposed by changing the network structure in the original algorithm, thereby further improving the detection accuracy of small targets [
32,
33].
3.1. Attention Module: CBAM
Spatially, small targets on airport aprons are randomly located, and the attention module can guide the neural network to highlight important features during detection to improve the locating accuracy. Therefore, the algorithm used in this study refers to the convolutional block attention module (CBAM) [
34,
35]. The differences in the attention module (
Figure 2) between the original method and the proposed method are as follows: the attention module of the original method is embedded into the feature extraction network, which is a connection module of residual blocks; however, this study uses an independent attention module to connect the attention module with the output of the feature extraction network, the proposed method does not change the structure of the original feature extraction network, which can be initialised with pre-trained weights. Each attention module comprises two sub-modules: channel attention (
Figure 3) and spatial attention modules (
Figure 4).
1. Channel attention module: In the convolutional neural network, the size of the feature map slowly decreases with an increase in the depth of the network, and the number of feature channels increases exponentially as a result. The information extracted from one feature channel differs from that extracted from the other, and more features can be extracted using a conventional neural network. Although it is necessary to increase the number of feature channels in the feature extraction process, not all features are necessary, and some features only play an auxiliary role. The purpose of the channel attention module is to allow the network to give greater weight to the important feature channels, amplify feature information that contributes significantly to subsequent tasks, and suppress the irrelevant feature channels. The feature map processed via channel attention pays more attention to the amplified feature information.
Different feature channels contain different convolutional information. The channel attention module allows the network to place more weight on the important small-target feature channels, thereby suppressing insignificant feature channels, as shown in
Figure 3. Global maximum pooling and global average pooling are performed on the input feature map to aggregate the spatial information on the feature map, as can be seen from this figure. Subsequently, two different average pooling vectors and maximum pooling vectors are generated from the previous step. Both vectors are separately sent to the shared multilayer perceptron (MLP). To reduce the parameter overhead, the number of neurones in the first layer is C/r (r is the reduction rate of the attention channel) using rectified linear unit (ReLU) activation, and the second layer uses C neurones. Therefore, the length of the feature vector is first decreased and then increased, to reduce certain parameter overheads and improve the speed of attention generation. After MLP processing, the output feature vectors are added to form a feature vector via element-wise addition [
36,
37,
38]. The output features are then added together and activated by the sigmoid to generate the final channel attention features. The channel attention can be calculated as follows:
where
represents the input feature map,
represents average pooled feature vectors,
represents the maximum pooling vectors, and
and
represent the weight parameter of the shared MLP,
which represents the sigmoid activation function.
2. Spatial attention module: In addition to knowing which features of the channel are important, the feature extraction network can distinguish the attention of local spatial features. The spatial attention module guides the network to highlight areas containing useful information based on spatial relationships, and small targets may exist in specific areas. In most cases, owing to the random location of small targets on the apron, highlighting important local feature information is conducive to positioning these targets. The input of the spatial attention module is the feature from the output of the channel attention module. In the channel dimension, two spatial two-dimensional attention maps are obtained using global average pooling and global maximum pooling, also called average pooling and maximum pooling. The two maps are connected using channel concatenation and then a 7 × 7 convolution operation is performed. Spatial feature attention is generated when the sigmoid function is activated [
39,
40,
41]. Spatial attention is performed as follows:
where
represents the sigmoid activation function,
represents the convolution with the kernel of
, and
represents the splicing of the extended channel direction.
3.2. Feature Fusion: BiFPN
The smallest feature layer is 32 times smaller than the original image, which is not conducive to the detection of small targets on the apron. When the area of small targets is less than 322, the extracted features are compressed to less than one pixel, essentially losing the features of the small targets. Deep convolutional layers contain more semantic information about small targets; this helps with target classification. Shallow convolutional layers are more likely to be activated by local textures and can capture more specific details, which is suitable for detecting small targets. Therefore, combining both layers improves the detection of small targets.
Different levels of features are directly added when performing feature fusion in an FPN; this operation causes different levels of features to have the same weight. This, in turn, ensures that different feature layers contribute equally to the detection results. By adding a top-down path, shallow features can fuse powerful semantic information with deep features. However, because of the relatively large number of convolutions from the bottom to the top layer, dozens or even hundreds of convolution layers are required; thus, many detailed features of the target are lost in the process. Therefore, a lack of more specific details on deep features affects the detection accuracy of the network. BiFPN adds low-to-top channels and weighted fusion, which compensates for the lack of detailed information and produces more effective multiscale feature fusion.
In
Figure 5,
and
represent the features output by the feature extraction network and the fused feature, respectively. After
is up-sampled, it is fused with
according to (3), and then the intermediate feature
is obtained through 3 × 3 convolution. Then the same method is used to fuse
,
and
to generate
. The specific formula is shown in (4).
3.3. Classifier Head
The mask head of the Mask Scoring R-CNN method requires a classifier head to provide a category and bounding box for small targets, where the bounding box is used to clip the corresponding feature layer. It is then sent to the network for segmentation, and the final target mask is selected according to the target category. During detection, the original target detector predicts numerous target frames containing small targets; however, these target frames are redundant, and some target frames highly overlap. Therefore, numerous duplicated boxes must be filtered out in a nonmaximal, suppressed manner. We believe that the input feature involves numerous bounding box information because it is obtained through dozens of convolution layers. In this study, we adopted a deeper network structure (
Figure 6). Four convolutions were used to replace the first fully connected layer to recode the input features. To weaken the influence of external information during detection and highlight the target features, the output size of each convolution layer is always consistent with the input [
42].
In this study, the number of categories of airport small targets and other daily detection tasks was significantly lower than the number of common targets in the COCO dataset, which contains 80 types of targets. This study only detects and locates people. Further, using only the fully connected layer to complete two tasks is obviously not conducive to the positioning of small targets on the apron. Therefore, given that the classification task is not complex, and considering that the use of fully connected and convolutional branches greatly increases the network parameters and computational consumption, we use a convolutional layer that is more suitable for localisation to complete the classification and localisation tasks. As mentioned above, the network was designed to follow the setup of R-CNN, with four layers of shared convolution and one fully connected layer. The convolutional layer was used to extract local information suitable for classifying and locating targets in the proposed box, and the full connection introduced position coding, integrating global information through different position information. At the network parameter level, the number of parameters of the original network full connection was
, whereas the parameter quantity of the convolutional layer was only
, which was significantly smaller than the number of parameters of the full connection [
43].
4. Experimental Results
4.1. Dataset
Because no datasets on airport small targets are publicly available, we could not compare our results with those reported in other studies under the same standard. In this regard, we created a small target dataset simulating the airport and applied our method using this dataset. We also performed several experiments using this dataset.
The sizes of the small targets used in this experiment were based on the definition of relative scale. The width and height of the targets are generally considered to be less than one-tenth of the image, and the specific area is generally 32 × 32 pixels. In other words, if the area covered by the target is less than 0.25% of the entire image, it can be considered a small target. We took 300 high-quality images of an airport surface, which included 1500 small targets, by constructing a simulated airport experimental platform (
Figure 7). All the images were calibrated pixel-by-pixel using LabelMe, and the dataset was split in a 7:3 ratio between the training set and test set. The format of our dataset was the same as that of the COCO dataset. According to the pixel size statistics of the targets, the maximum pixel was 45 × 21, and the maximum proportion of the target size was 0.12%; therefore, all targets were small. Owing to the different pixel proportions of different targets, we divided the small targets into 11 proportional intervals from 0 to 0.25%, and the number of correctly identified targets in each interval was computed to compare the correct rate of identification using different methods. The target number of statistics for each interval was as shown in
Figure 8:
4.2. Experimental Details
We used the Mask Scoring R-CNN model as the baseline in our comparison experiments. All the pre-trained models used in this study are publicly available, and the accuracy of the different pixel ranges was taken as the evaluation index. All the models were trained on a GPU with two images per batch, and the network weights were updated using the Adam optimiser. Unless otherwise specified, all the models were trained for 96 epochs with an initial learning rate of 0.001, and the policy was updated with a cosine cooling learning rate. The minimum learning rate was set to 0.001 times the initial learning rate, and the models were initialised with ImageNet pre-trained weights. A multiscale training strategy was added to the final model, and the short side of the input image was randomly sampled between 640 and 768 with a step size of 32.
4.3. Ablation Experiments
According to the pixel size of all targets in the dataset, in which the maximum pixel was 45 × 21 and the maximum proportion of the target size was 0.12%, the number of correctly identified targets in 11 proportional intervals was counted. The correct rate of identification using different methods was compared. The results of the ablation experiment are as follows.
Attention module-CBAM: In comparing
Table 1 and
Table 2, it can be seen that the detection results of the attention mechanism are better than the baseline, which were increased by 8% in the range of (0, 0.01) and 6.06% in the range of (0.01, 0.02), and the results of other ranges reach 100%, which shows that the attention mechanism added in this study had a more noticeable effect and is conducive to the detection of small targets.
Feature fusion-BiFPN: It can be seen from
Table 3 that adding the feature fusion module BiFPN alone, compared with Mask Scoring R-CNN from the experimental results, increased the detection accuracy in the range of (0, 0.01) by 3%, and the detection accuracy in the range of (0.01, 0.02) was increased by 4.55%. Overall, there were seven more small targets detected than in the baseline.
Classifier head: From
Table 4, it is visible that the detection head added in this study also had noticeable effects. The detection accuracy in the range of (0, 0.01) was improved by 2%, in (0.01, 0.02) by 7.58%, and the total number of detections was increased by nine small targets.
Our method: As shown in
Table 5, the proposed method, or adding the above three in specific positions, obtained the best results. The range of (0, 0.01) increased by 16%, the range of (0.01, 0.02) increased by 15.15%, and the overall number of small targets detected increased by 27, which effectively proves that the proposed method better detects small targets.
According to the results, the detection method’s accuracy was improved due to the use of the Mask Scoring R-CNN. When only the CBAM was included, the detection performance was the most improved (
Figure 9), compared with the baseline in the scale range of (0, 0.01). That is, the detection accuracy was improved by 8%. Compared with the baseline, the method proposed in this study was improved 16% and 15.15% in the range of (0, 0.01) and (0.01, 0.02), respectively; all other sections were also improved. As can be seen in
Table 6, the model parameters exhibit certain changes while the remaining changes are minor. This is sufficient to prove that the proposed method has a higher detection accuracy for small targets.
4.4. Comparison with Other Methods
Compared to the Mask R-CNN experiment, our experimental results contained 17 pictures with undetectable small targets, while the Mask R-CNN experimental results contained 24 pictures with undetectable small targets. The statistics of the Mask R-CNN detection results are shown in
Figure 8. As shown in
Figure 10, the comparison shows that in the (0, 0.01) interval, the accuracy of our results was 23%, whereas that of the Mask R-CNN was 10%; further, the accuracy of the results obtained by the Mask Scoring R-CNN was 7%. When the proportion of small targets was in the (0.01, 0.02) interval, the accuracy rate of our experiment was 98.48%, while the accuracy of Mask R-CNN detection results was 95.45% and the correct rate of Mask Scoring R-CNN detection results was 83.33%, and the correct rate in other proportions was 100%. Therefore, the detection results obtained by this method are much higher than those obtained using the traditional small-target detection method.
Qualitative comparison: As shown in
Table 7,it can be seen that the four images of the detection results of the Mask R-CNN missed the targets that were detected by the proposed method. From (a)–(d), it is observed that our method detected all the staff on the airport surface. Therefore, the detection effectiveness of our method is better because it accurately and efficiently detected all small targets.
To verify the effectiveness of the proposed method, it was applied to a real airport surface to detect apron personnel. All staff in the figures meet the definition of small targets. As can be seen in
Figure 11, all the staff were detected with high accuracy, which proves that the proposed method has a desirable effect on the detection accuracy of small targets on the airport apron. Note that the left side presents the original figure, and the image detected by the proposed method is presented on the right side.
5. Conclusions
This study proposes a small-target detection method based on an attention mechanism; the proposed method has a higher detection accuracy and can perform a more detailed segmentation compared to the baseline. The proposed method was improved based on the advanced and representative Mask Scoring R-CNN algorithm. By introducing an attention module, using more reasonable feature fusion, and more effective detection branches, the effectiveness and accuracy of the detection of small airport targets were improved significantly. We carried out a series of ablation and comparative experiments on a small-target dataset. The major conclusions are as follows.
- (1)
Considering that there are currently almost no datasets suitable for small-target detection, we produced a small-target dataset for aprons that meets the definition of small targets from both relative and absolute scales.
- (2)
Considering that the targets are too small and their features are difficult to extract, an attention module was used to effectively focus on the features and thereby improve the detection accuracy. Furthermore, using a feature fusion module helped achieve a more effective multiscale feature fusion of small targets and a new classifier head was used for a more efficient detection of single classes of small targets and for reducing model complexity.
- (3)
Considering the difficulties associated with small-target detection in practical applications, a suitable method is proposed. Experiments show that the proposed method achieved improved detection accuracy compared with the baseline. The accuracy in the scale range of (0, 0.01) was improved by 16%, and in the range of (0.01, 0.02), by 15.15% compared with the Mask Scoring R-CNN.
Although the improved method was initially validated by ablation and comparative experiments with promising performance, there is still much room for further development, which is summarised as follows.
- (1)
The first task will be to further optimise the model and strengthen the practical application of the model, such as decreasing inference speed by reducing model parameters.
- (2)
Since the targets of this study are pedestrians on the airport surface, the targets are single; thus, the next step can be to expand the dataset for other small-target detection tasks.