1. Introduction
Many achievements have been made in small-scale face detection, which can be divided into two aspects. One is to preprocess the input image. For example, in [
1], a scale proposal network was designed to estimate the scale of a face; then, the input image was resized and sent to the network. In [
2], a generative adversarial network was used to reconstruct small-scale faces. Both methods mentioned above increase the computational cost. The second idea is to improve the ability to detect small-scale faces from the network itself, such as enhancing the expression ability of features. For example, Li J et al. [
3] adopted a dual shot face detector to merge a feature pyramid network (FPN) [
4] and receptive field block (RFB) [
5] to enhance the characteristics of the shared feature map. Although the detection precision is improved through the dual-channel detection method, the detection speed needs to be improved. Chi C et al. [
6] proposed receptive field enhancement module (RFEM) to generate receptive fields with various shapes to capture faces with extreme poses, but the effect of RFEM is insignificant. Considering that image preprocessing is time consuming, we improve the precision of face detection from the network itself.
In recent years, the face-detection algorithm based on faster region-based convolutional neural network (faster R-CNN) [
7] has made great progress. However, faster R-CNN destroys the spatial structure of features and makes the network insensitive to the position of the faces because it uses the fully connected layer for classification and detection. As a result, Dai et al. [
8] proposed region-based fully convolutional networks (R-FCN) and adopted a fully convolutional network (FCN) to encode the position information of the target in the region of interest (RoI) to solve the problems of classification invariance and position sensitivity of faster R-CNN. R-FCN has good detection performance on pattern analysis, statistical modelling and computational learning visual object classes (PASCAL VOC), and its detection speed is faster than that of faster R-CNN. Therefore, we adopt R-FCN to detect small-scale faces. The framework of R-FCN is shown in
Figure 1a. We propose a novel R-FCN framework based on R-FCN, which is shown in
Figure 1b. We redesign the feature extraction network, which includes the original feature extraction network branch of R-FCN and a new feature fusion branch. The new feature fusion branch is used to fully fuse the local information of the features extracted from the low and middle layers of the original feature extraction network branch and the semantic information of the features extracted from the top layer of the original feature extraction network branch. The improved R-FCN is called R-FCN with a feature fusion branch (f-R-FCN). In addition, we add a receptive field adaptation block (RFAB) based on f-R-FCN to enhance the discrimination of the features of small-scale faces. The improved f-R-FCN is called R-FCN with a feature fusion branch and RFAB (RFAB-f-R-FCN). In
Figure 1, a region proposal network (RPN) is a fully convolutional neural network (CNN), which can generate
k boxes with different scale and aspect ratio with each pixel as the center in the feature map, called anchor. After the anchors are classified and regressed, the corresponding region in the original image is called the proposal.
The main contributions of this study are as follows:
- (1)
We propose a novel R-FCN framework, that is, we add a feature fusion branch and RFAB in original R-FCN.
- (2)
A new feature fusion method is proposed to alleviate the problem of low detection rate caused by the original R-FCN.
- (3)
RFAB is proposed to enhance the ability of feature expression of small-scale faces.
- (4)
We improve the anchor setting method and adopt soft non-maximum suppression (SoftNMS) as the selection method of candidate boxes.
2. Related Works
According to the CNN-based face detection process, face detection based on deep learning can be divided into three categories: cascaded CNN, two-stage algorithm, and single-stage algorithm. The cascade-based method trains multiple cascaded CNNs for face detection, such as multi-task cascade convolutional networks (MTCNN) [
9] and inside contextual CNN [
10]. MTCNN and inside contextual CNN cascade three CNNs for face detection: the first CNN quickly generates candidate boxes, the second CNN refines the candidate boxes, and the third CNN selects the candidate boxes in detail. The two-stage algorithm divides the detection into two steps. The first step is to generate candidate boxes for the input image, and the second step is to determine whether the candidate boxes contain faces. For example, Jiang H et al. [
11] improved faster R-CNN by combining the specific attributes of face detection. Center loss [
12] and online hard example mining (OHEM) [
13] were adopted, but the scale of faces was ignored. In [
14], face R-FCN was proposed, and position-sensitive average pooling was designed to generate some hidden features to enhance the difference. Zhang C et al. [
15] adopted a deformable layer based on light-head R-CNN [
16] to reduce the number of channels for face detection. The single-stage face detection algorithm directly classifies and regresses the image, which greatly improves the detection speed. For example, Wang Y et al. [
17] proposed real-time face detection based on YOLOv3, in which intersection over union (IoU) is used to cluster the size of initial candidate boxes and the accuracy and speed can be guaranteed in complex environments. Zhang S et al. [
18] proposed single shot scale-invariant face detector (S
3FD) to solve the problem that the performance of anchor-based detector decreases sharply as the target becomes smaller. Wang J et al. [
19] provided different anchor attention mechanisms for different layers to highlight the face area for solving the occlusion problem.
Many face detection applications are based on low constraint scenarios; for example, polices need to track criminals according to surveillance video, and the attendance system may need to count people based on faces in dense crowds. Faces in low-constraint scenarios are very small with a scale even lower than 10 × 10. Thus, small-scale face detection has become a major research field in recent years. Many methods have been proposed to solve the problem of small-scale face detection. In 2017, Hu P et al. [
20] proposed a tiny face, which integrated multilayer features and effectively used context information to detect faces on the down-sampled image, but the image pyramid reduced the detection speed. Bai Y et al. [
21] proposed multi-scale FCN and trained different FCN networks in the feature layers of different scales. Each FCN is responsible for detecting faces of corresponding scales. However, training corresponding detectors at each scale is time consuming. Zhu C et al. [
22] proposed an expected max overlapping (EMO) score to explain the low ratio of intersection between the anchor and the human face and improved the performance by reducing the step size of the anchor. Zhang F et al. [
23] adopted selective refinement network (SRN) to improve the recall rate of detection, used IoU loss [
24] to make the regression position accurate, and utilized the max-out label to reduce simple negative samples.
The current detection algorithms for small-scale faces are still in the stage of developing. On the one hand, effectively extracting or enhancing the features of small-scale face detection is still a problem to be solved. On the other hand, different faces in the same image often have distinct scales, and an excellent face detector must be scale friendly. Thus, the detector must consider the various scales of faces when detecting small-scale faces.
3. Selection Method of Candidate Boxes and Anchor Setting of Region-Based Fully Convolutional Networks (R-FCN)
In this section, we change the non-maximum suppression (NMS) of R-FCN to SoftNMS and reset the anchor.
R-FCN is a universal object detector, which has some differences from the face detector. First, the general objects have large scales and different shapes, while most of the faces are rectangular and the ratio of length to width is fixed. Second, face detection is affected by various interference factors, such as illumination, expression, occlusion, and other factors; however, these factors only slightly affect the detection of general objects. Therefore, R-FCN should be modified to make the framework suitable for small-scale face detection.
Original R-FCN uses NMS to filter candidate boxes, but the threshold setting in NMS affects the precision of detection. If the threshold is too low, then the true positive samples will be suppressed. If the threshold is too high, then the false positive samples will increase. In face detection in a low-constraint environment, face occlusion is most common, and NMS will cause missed detection. In this study, SoftNMS [
25] is used to improve the performance of face detection under occlusion conditions. SoftNMS is defined as follows:
si denotes the score of the
i-th candidate box.
M and
ti are the coordinates of the candidate box with the highest score and the coordinates of
i-th candidate box.
represents the ratio of the intersection of the candidate box
i with
M to the union of candidate box
i with
M.
Nt is a preset threshold.
Formula (1) shows that SoftNMS will attenuate the scores of candidate boxes when IoU of candidate boxes with M is greater than the threshold Nt. Therefore, candidate boxes that are far from M will be unaffected, and candidate boxes that are close to M will be punished. However, NMS directly sets the score of the candidate boxes to 0 when IoU of the candidate box with M is greater than the threshold Nt. Compared with NMS, SoftNMS retains candidate boxes that are easy to delete by mistake.
In original R-FCN, the base size of the anchor is 16 according to the general object detection. The scale of anchor is set to (8, 16, 32) according to the distribution of the scales on PASCAL VOC dataset. The aspect ratio is (0.5,1,2); thus, 1 pixel corresponds to 9 anchors. However, many small-scale faces are involved in face detection; we modify the anchor according to the specific properties of the faces to prevent the small-scale faces from being missed due to the improper setting of anchor. Considering that most faces are rectangles with a length greater than or equal to width, we set the aspect ratio of anchor as (1, 1.3, 1.5). Wang J et al. [
19] emphasized that 80% of the face scales on the Wider Face training set are between 16 and 406 pixels. Accordingly, we set the base size of anchor as 8 and scale as (1, 2, 4, 8, 16, 32, 64). The modified R-FCN has 21 anchors, which can cover most of the face samples on Wider Face.
4. Feature Fusion Branch
When R-FCN is used for face detection, it is ineffective for small-scale face detection. The main reason is that the position-sensitive score map used for position-sensitive pooling is convoluted from the highest-level feature map of the backbone network. The pixel step corresponding to the original image is 16. When the input face is less than 16 × 16, only 1 pixel can be obtained for detection, which is difficult for the detector to classify and regress. Second, an increasing amount of semantic information will be integrated into the feature map with the increase in convolution layers, while the proportion of local information of small-scale faces becomes lower. The lower-level features have more local information, while the features of higher-level have rich semantic information on CNN. Therefore, we establish a feature fusion branch that can use the context information of CNN to improve the robustness of small-scale face detection.
FPN constructs a top-down feature pyramid, which contains information of various scales, but it still has deficiencies. First, FPN recursively adds high-level features to low-level features, and too many semantic features may damage the details of low-level features. This condition results in insufficient detection accuracy for small targets. Second, FPN manually assigns anchors with different scales to distinct layers, which may assign targets to feature layers that are not conducive to detection. The feature fusion branch we proposed adopts the bottom-up fusion method to integrate the information of the low-level layer with the high-level layer features. The scales of the features of each layer in the branch are the same as those of the corresponding features in the backbone, as shown in
Figure 2.
The same as those in FPN,
Ci represents the last feature map of each residual group of ResNet, and
.
Fi is the feature obtained by fusing
Ci and
Fi−1. To start iteration,
C2 is directly taken as the fused feature
F2. The scale of features from lower layers is large, and the number of channels is small. Thus, 3 × 3 convolution is used to reduce the feature scale for adding the corresponding elements with the upper layer features. The feature fusion operation can be expressed by Formula (2).
Conv represents convolution operation and
L denotes L2 normalization. When
i = 2,
Fi =
Ci.
The feature maps from different layers have distinct properties in terms of the number of channels, the scale of value, and the norm of feature map pixels. The norm of features of shallow layers is generally large, while the norm of features of deep layers is usually small. If a simple element addition operation is performed on two features, then the shallow layer features will dominate the deep layer features. Therefore, the L2 normalization proposed by ParseNet [
26] is introduced to normalize the feature pixels before feature fusion. The formula is as follows:
is the feature before normalization.
d is the number of channels.
is the normalized feature.
is the absolute value of
xi. The values of the feature will be changed and the difficulty of training will increase after normalization. Therefore, the normalized pixel value should be scaled, and Formula (5) should be used to scale the normalized feature along the channel:
yi is the value of the feature after scaling.
is the scale factor that can be obtained by learning in the training stage.
In summary, we add a feature fusion branch to the feature extraction network of R-FCN and perform face detection on
F5. The structure of the feature fusion branch in f-R-FCN is shown in
Figure 3.
5. Receptive Field Adaptation Block (RFAB)
Traditional CNN uses fixed-size convolutional kernels for feature extraction. The size of the receptive field of each layer of neurons is fixed, which may damage the discrimination of features. In the human visual cortex, the size of the receptive field is affected by many factors. For example, the size of population receptive field (pRF) is a function of the eccentricity of retinal imaging. As the eccentricity increases, the receptive field also increases [
27]. It can be understood that the brain gets more information from the area closer to the visual center, but less from the area farther away from the visual center. RFB is proposed according to [
27], which highlights the importance of the sampling center area. The insensitivity of CNN to small spatial changes is improved, and it achieves good results in object detection.
However, in the human visual system, the size of the receptive field is related not only to the eccentricity of retinal imaging but also the stimulation of the visual nerves. That is, the stimulation of the same neuron is different, and the corresponding receptive field size of the neuron is not fixed. In RFB, inception structure and dilated convolution are used to simulate the receptive field mechanism in human retina imaging, and linear superposition is used to integrate the features with different receptive fields in the spatial dimension. Although dilated convolution increases the weight of features closer to the sampling center to achieve the purpose of enhancing the discrimination of features, the effect of neuron stimulation on the receptive field is ignored. Therefore, we add a RFAB based on f-R-FCN to enhance the feature of small-scale faces by combining RFB and selective kernel (SK) [
28] module. In this way, the influence of eccentricity and the neuronal stimulation on the receptive field are considered. The structure of RFAB is shown in
Figure 4.
The former part of RFAB is the same as that of RFB. The input is divided into three branches by convolution operation of different sizes. The corresponding receptive field of each branch is different. Two 3 × 3 convolutions are used to replace the 5 × 5 convolution for reducing the calculation. “rate” in
Figure 4 represents the dilation rate of dilated convolution. Then, the features of the three branches are fused, after global average pooling, 1 × 1 convolution and softmax operation, and the corresponding probability value is obtained for each channel. The CNN selects the size of the receptive field for each channel through element product. RFAB also uses a shortcut to retain the original features and adds the processed features to obtain the final features. RFAB is used in the last layer of feature fusion branch to improve the discrimination of the shared feature map.
We add a RFAB based on f-R-FCN to enhance the feature of small-scale faces by combining RFB and SK module. The structure of RFAB-f-R-FCN is shown in
Figure 5. Compared with those in
Figure 3,
C5 is removed, and the feature fusion branch is constructed based on
C2,
C3,
C4 The reason is that we find that average precision (AP) of detection on
F4 is higher than that on
F5. See
Section 6.2.1 for details.
6. Experiments
The datasets are Wider Face [
29] and face detection dataset and benchmark (FDDB) [
30]. Wider Face has 32,203 images, with 393,703 faces, and the small-scale faces account for a large proportion. FDDB contains 2845 images and 5171 faces, which is usually used to evaluate the performance of the model. We use the Wider Face training dataset to train the model and conduct a verification test on the verification set. We also evaluate our method on FDDB.
The backbone network used in the experiments is ResNet50 pretrained on ImageNet and, during the training, stochastic gradient descent (SGD) is used to update the parameters. We set the training hyperparameters according to [
8], the weight decay is 0.0005, the momentum is 0.9 and the initial learning rate is 0.001. The shortest side of the input image is 600, and the longest side is 1000. The network is trained with 80,000 iterations and the learning rate is reduced to 0.0001 after 60,000 iterations. We adopt the multitask loss function of R-FCN, which is shown in Formula (6).
is the number of classified samples and
is the total number of candidate boxes.
and
represent the classification loss and smooth L1 regression function, respectively.
p denotes the classification score.
u is the true label.
,
u = 0 means background, and
u = 1 means the sample is the face.
tu denotes the coordinate of the prediction box.
ν is the true coordinate of the face.
λ is the balancing factor and is usually set to 1.
6.1. Results of Modified R-FCN
6.1.1. Comparison between Non-Maximum Suppression (NMS) and Soft Non-Maximum Suppression (SoftNMS)
Table 1 shows AP on the three verification sets of Wider Face when R-FCN adopts NMS and SoftNMS. After changing NMS to SoftNMS, AP increases by 1%, 3.3%, and 4.1% on three subsets, respectively, which verifies the effectiveness of SoftNMS. Notably, the improvement of AP on the Easy subset is not obvious. SoftNMS aims at the problem of missed detection of occlusion, while the detection difficulty on Easy subset is low, and occlusion is relatively rare. The effect of SoftNMS is equivalent to that of NMS. More occlusions and small-scale faces are observed on Medium and Hard subsets. Thus, the improvement effect of AP by adopting SoftNMS is more obvious.
6.1.2. Comparison of Different Anchor Settings
Table 2 shows AP on three subsets of Wider Face with different anchor settings. As shown in the table, the scale setting of anchor largely influences AP. When the scale range of anchor is sufficiently large, AP on the Hard subset can be significantly improved. However, when AP on the Hard subset increases, AP on the Easy subset decreases slightly. To balance the reduction of AP on the Easy subset with the increase in anchor scale and improve AP on the Hard subset, we finally set the anchor according to the parameters in the last row of
Table 2. This result is taken as the benchmark result of subsequent experiments. Compared with those in R-FCN, AP decreases by 2.5% on the Easy subset and 3.1% on the Medium subset, but AP on the Hard subset increases by 11.8%.
6.2. Results of Feature Fusion Branch
6.2.1. Comparison of Different Fusion Feature Layers
We try to predict different fusion layers experimentally. The first scheme is to access RPN at
F4 layer, construct position-sensitive score map at
F5, and perform classifications and regression operations. The second scheme discards the fifth residual group of ResNet, only uses
C2,
C3,
C4 to construct the fusion branch, accesses the RPN at final fusion layer
F4, and constructs the position-sensitive score maps for detection, which means RPN of the second scheme shares the
F4 with the detection network. The comparison between the two schemes and R-FCN is shown in
Table 3.
Table 3 shows that regardless of whether
F4 or
F5 prediction is adopted for f-R-FCN, AP on three subsets of Wider Face is improved to different degrees compared with those in R-FCN, in which the AP of small-scale face detection on
F4 layer is higher than that on
F5 layer. The reason is that RPN and the classification network share the fusion feature map on
F4, as shown in
Figure 3 and
Figure 5. Compared with those in R-FCN, AP is improved by 0.9% on the Easy subset, 1.8% on the Medium subset, and 7.5% on the Hard subset.
6.2.2. Comparison with R-FCN
Figure 6 shows precision–recall (PR) curve of f-R-FCN on Wider Face. f-R-FCN is marked as fusion R-FCN in
Figure 6. When recall is the same, the precision of f-R-FCN on the three subsets is higher than that of R-FCN, and the PR curve obtained is closer to the right side and steeper than that of R-FCN, which verifies that f-R-FCN has better performance than R-FCN.
Figure 7 intuitively shows the detection effect of R-FCN and f-R-FCN on small-scale faces in the Wider Face dataset. The number of faces detected by f-R-FCN in the same picture is significantly higher than R-FCN, which verifies that the f-R-FCN is more effective for small-scale face detection.
6.3. Effects of RFAB
In the experiment, multi-scale training is adopted, and the input image is rescaled to {600, 1200}. In the training stage, original R-FCN selected 6000 anchors and reserved 300 anchors after NMS, but in our experiments, 10,000 anchors with the highest score are selected, and 1000 anchors are reserved after SoftNMS since the number of anchors increased after resetting. In the test stage, 2000 anchors with the highest score are selected, and 600 RoIs are retained after SoftNMS. In addition, the OHEM strategy is adopted when training, considering the characteristics of small-scale faces, the size of position-sensitive pooling is changed from 7 × 7 to 5 × 5. PR curves of various methods are compared on Wider Face to evaluate the proposed method comprehensively, and discrete receiver operating characteristic (discROC) curve and continuous receiver operating characteristic (contROC) curves of our method and other methods are drawn on FDDB.
6.3.1. Comparison with and without RFAB
Table 4 shows AP on three subsets of Wider Face for R-FCN, f-R-FCN, and RFAB-f-R-FCN. Under the same conditions, AP of RFAB-f-R-FCN is 0.1% lower than that of f-R-FCN on the Easy subset, 1.1% higher than that of f-R-FCN on the Medium subset, and 3.5% higher than that of f-R-FCN on the Hard subset. AP of RFAB-f-R-FCN is 0.8%, 2.9%, and 11% higher than that of R-FCN. The reason is that RFAB enhances the discrimination of
F4 layer, which shows the effectiveness of RFAB.
F4+RFAB means R-FCN with both feature fusion branch and RFAB module, and uses fusion layer F4 for detecting.
Figure 8 shows the effect of the proposed method and R-FCN on the Wider Face dataset. The green box is the detection result of R-FCN, and the red box is the detection effect of the proposed method. R-FCN has a high rate of missed detection in the case of dense small-scale faces, and the proposed method can effectively detect small-scale faces, which shows the effectiveness of the proposed method for small-scale face detection.
6.3.2. Comparison of AP between the Method Proposed and the Classical Methods
Table 5 shows AP of the proposed method and several typical methods on the Wider Face verification set. AP of the several typical methods are from the official website of the Wider Face dataset.
Table 5 shows that AP of the proposed method is higher than that of the typical methods. Although the Hard subset of Wider Face has several small-scale faces, the proposed method is still higher than comparative methods on the Hard subset. Therefore, the proposed method is more robust for the small-scale faces than other methods.
6.3.3. Comparison of Precision–Recall (PR) Curves between the Proposed Method and the Classical Methods
Figure 9 shows the PR curves of the proposed method and other classical methods on Wider Face. As shown in the figure, the PR curve of the proposed method is closer to the right side than that of other methods. Under the same recall rate, the precision of the proposed method is the highest, which shows that the proposed method is superior to the classical methods of comparison. The reason is that the proposed method adds the feature fusion branch and RFAB to improve feature discrimination.
6.3.4. Comparison of the Proposed Method and Classical Methods on Face Detection Dataset and Benchmark (FDDB)
Figure 10 shows the discrete receiver operating characteristic (ROC) curves of the proposed method on the FDDB dataset. Two evaluation criteria of FDDB dataset are considered: discROC and contROC curve. For discROC, if IoU of prediction box and the ground truth is greater than 0.5, then it will be judged as a true positive sample, while contROC must be calculated by weighting IoU. The proposed method uses an unlimited training method. Thus, it is first trained with the Wider Face dataset and then tested on FDDB. The detector uses rectangles to mark faces, while FDDB uses ellipses to mark faces. The true positive rate of the proposed method in contROC will be lower than that in discROC.
Figure 10 shows that, if discROC is taken as the evaluation criteria, then the performance of the proposed method is better than that of multitask cascade CNN [
9] and LDCF + [
31] but is slightly worse than that of ScaleFace [
32]. This result is due to the fact that the proposed method maps all RoIs to the feature layer of the same depth, while Scaleface uses different networks to detect RoIs of a specific scale. If contROC is used as the evaluation criteria, then the performance of the proposed method is better than that of ScaleFace and LDCF + but is worse than that of multitask cascade CNN. The reason is that multitask cascade CNN uses three cascaded networks to continuously fine tune the position of the prediction box. Thus, the final prediction results are relatively higher.
In conclusion, the results of different methods under distinct evaluation standards vary. Compared with those of Wider Face, the small-scale face samples of FDDB are limited.
Table 5 and
Figure 9 and
Figure 10 show that our method has a better comprehensive performance for small-scale face detection than the other methods.
6.4. Inference Time
We trained R-FCN, f-R-FCN and RFAB-f-R-FCN with a single GeForce GTX 1080 graphics processing unit (GPU), and gave their inference time, as shown in
Table 6. Although the detection time increases after adding feature fusion branch and RFAB, it still meets the real-time detection requirements.
7. Conclusions
Face detection in real life, such as small-scale and occlusion face detections in extreme scenes, is still very challenging. Many algorithms in small-scale face detection have either precision that is too low or detection time that is too long. In this study, small-scale face detection based on R-FCN is explored. First, we propose a novel R-FCN framework, that is, we add feature fusion module and RFAB in R-FCN, to address the problem of the small-scale face detection. Second, a bottom-up feature fusion method is proposed to enrich the local information of high-layer features. Finally, RFAB is proposed, which enables the network to adaptively select the receptive field, enhances the expression ability of face features, and improves the detection rate of the small-scale faces. Furthermore, we improve the anchor setting method and adopt SoftNMS as the selection method of candidate boxes. The experimental results show that the proposed method has better comprehensive performance for small-scale face detection than other methods.