The overall architecture of our AFE-RCNN is shown in
Figure 2. We propose three novel modules, i.e., (a) the residual of dual attention proposal generation module (RDA module) (
Figure 2a); (b) multi-scale adaptive feature extraction module based on point clouds (MSAA) (
Figure 2b); (c) refinement loss function module with vertex associativity (VA module) (
Figure 2c). In our AFE-RCNN, the input point clouds are processed by two branches; that is, a voxel-based branch and a point-based branch. In the voxel-based branch, we first voxelize the raw point clouds. Referring to ref. [
17], we divide the raw point clouds into small voxel grids of the same scale in
x,y,z directions. Then, we use the average of the features of all points inside the non-empty voxel as the features of the voxel. Secondly, we obtain the voxel features through a 3D spatially sparse convolutional network [
10]. Next, we project the point clouds to the 2D BEV coordinate by eight times downsampling (i.e., stack the
Z-axis features together to generate the 2D BEV feature maps), and obtain the 2D BEV features through the RDA module. Finally, the RPN generates box proposals based on the anchor box approach and BEV features. In the point-based branch, to summarize the information of the scene, we sample the key points through farthest point sampling (FPS), then the features of key points are obtained by MSAA. Next, as the rich features are beneficial to the box proposal refinement, the features of BEV, raw point clouds, and voxels are fused to build the key point features. To make the foreground points contribute more to refinement, the weights of the key points are adjusted by the weighting module (see ref. [
17]). At last, the box proposal refinement is completed based on the key point features and the key point-to-grid RoI feature abstraction (see ref. [
17]); the grid point is the center point of the voxel grid. We guarantee the correlation among the vertices of the box to optimize the box proposal regression by the advantages of the VA module. Details of the main operations are given in the subsections.
3.1. The Residual of Dual Attention Proposal Generation Module
Our AFE-RCNN firstly divides the raw point clouds into the several uniform scales of voxel grids, and then obtains the 3D voxel features through a 3D spatially sparse convolutional network. Secondly, the proposed network projects the information of 3D features to the 2D BEV coordinate. The network of the BEV feature extraction in ref. [
17] consists of the traditional 2D CNN, which usually processes all features equivalently and cannot reflect the variability requirements in object detection tasks. Moreover, the extracted feature map only aggregates the spatial and channel information in the local receptive field, which will lead to a lack of relevance of global information.
To solve the above problems, we propose the residual of dual attention proposal generation module (RDA module). Inspired by the polarized self-attention block [
30], we construct the attention block by the residual connection to ensure the integrity of information transmission for feature processing, which can process both the spatial and channel dimensions of features with very little calculation increase. The diagram of the proposed RDA module is shown in
Figure 3. The input of the module is used to mainly control the direction of information processing, avoiding the situation in which background points are overly focused. The proposed module can strengthen the correlation between object features in the spatial domain and global feature channel, which can fully excavate semantic information in both spatial dimension and channel dimension.
We first assume that the input features
are independent; the channel attention branch
is defined as Equation (1).
where
and
are all 1
1 convolutional layers,
and
are two vector transformation symbols,
presents matrix dot product operation,
is a sigmoid operator, and
is a SoftMax operator. The output of channel attention branch is:
where
is the channel multiplication operator.
Here, the spatial attention branch
is defined as Equation (3):
where
,
, and
are vector transformation symbols;
is a global pooling operator. The output of the spatial attention branch is shown as Equation (4).
is the spatial multiplication operator.
The output of one residual block
is:
where
is the weighting factor of the attention module. Here, the range of
is 0~1; the best performance can be achieved when
. After a lot of experiments, we found that the average running times before and after adding the RDA module to the RPN stage are ~0.0060 s and ~0.0071 s run on the same processor, respectively, which is a difference of milliseconds. The use of the attention network did not result in a significant decrease in the efficiency of our proposed network.
3.2. Multi-Scale Feature Extraction Module Based on Adaptive Feature Adjustment
In ref. [
17], a certain number of points are sampled as the key points. That sampling approach may cause poor robustness features in the sparse region of the point clouds. To solve this problem, we propose a multi-scale feature extraction module based on adaptive feature adjustment (MSAA) as shown in
Figure 4. Firstly, we sample 2048 points through FPS to summarize the information of the scene. This is followed by two times 4-fold downsampling and points grouping; the sampling layers with different scales that containing 512 and 128 points are obtained, respectively. Inspired by ref. [
25], to extract richer feature information, we process the features for each grouping layer based on adaptive feature adjustment (AFA). The interaction between points can be found through the AFA, which can also integrate the neighboring contextual information into the features of each point. Moreover, the multi-scale point clouds can be constructed by different sampling and grouping layers. Thus, the features of all sampling layers are interpolated to the same feature shape and concatenated together, which can construct multi-scale features of the key points.
In the point cloud feature extraction for each sampling layer, first, the features of each point (
Figure 4a) in the region
Q consist of the feature set
, which can be expressed as
,
,…,
}. In order to focus on the correlation between each point pair, we connect the points in region
Q into a local network, which can learn the interactions between all point pairs. We get the difference map
(
Figure 4b) at the same time. And then,
learns the amount of impact among point features through the impact function to get the impact map
Figure 4c, this is so that the points in region
Q can merge the neighboring contextual information, and the features of each point have better local neighborhood representation ability. The adjusted feature
is defined as Equation (6).
where
is one of the features in the feature set
.
represents the amount of impact, which is affected by each feature in
on
.
is obtained by adaptive learning of the feature modulator in the feature set
. The learning process of the feature modulator
is defined as Equation (7), where
means product operation.
The learning process of the feature modulator is mainly involved in two functions; one is the amount of impact of
on
, which is represented by the function
. The other is the relation function
. The impact function
is calculated by the MLP operation. The function
is defined as Equation (8):
In the relation function
, the amount of impact is calculated by the difference between the two feature vectors
and
in feature set
. Specifically, when
, the amount of impact on
is the feature
itself.
is defined as Equation (9):
For each feature in the local region
Q, the total output of the learned feature is shown as Equation (10). Hence,
incorporates contextual information from the entire region through the above densely connected local network of points.
In different sampling layers, we firstly use interpolation to upsample the point clouds to the same number, and then concatenate these feature vectors as multi-scale features for key points. The interpolation process can be expressed as:
where
is one of the points in the sampling point set
.
is the feature of
;
is the feature obtained by interpolation. We calculate the distance weight
between
and
through the
K-nearest neighbors. As shown in Equation (12), the further away from
, the smaller the contribution on
. Here,
K = 3.
In the point sampling and grouping stage, the feature set of the (
k + 1)-th layer is
, which is represented by Equation (10). To make the features of the (
k + 1)-th layer consistent with the features of the previous layer in the point dimension,
is processed by interpolation to obtain
. Finally, the previous layer
is obtained by concatenating
and
. The feature set of the
k-th layer is:
In addition, to provide rich feature information for box proposal refinement, we convert the BEV features obtained in
Section 3.1 to key point features by bilinear interpolation, then the set abstraction operation in ref. [
22] is used to fuse the features of voxels into key points. After the above operation, the features of BEV, raw point clouds, and voxels are fused as the key point features. Finally, we sample
m grid points in each box proposal. Here,
m = 216. For each grid point, we aggregate the features of neighboring key points to it to help the box proposal refinement.
3.3. Refinement Loss Function Module with Vertex Associativity
The refinement network is mainly used to learn the center point, size, and direction of the detection box. Two MLP layers are used to construct the refinement network, which finally performs the two tasks of confidence prediction and box regression. Generally, for the box regression loss, the vertices are usually regarded as independent of each other. However, the vertices of the bounding box are related to each other to a certain extent. The intersection over union (IoU) [
33] is used to evaluate the effectiveness of the detection boxes. For different detection boxes, there may be a situation where the smooth-L1 loss is similar but the IoU varies greatly. In order to solve this problem, we propose the refinement loss function module with vertex associativity (VA module), as shown in Equation (14). The VA module trains the bounding box as a whole and solves the problem in which the anchor cannot be regressed when the prediction box does not intersect with the ground truth box.
The weight coefficient
of
is referred to the dataset and experiment. According to the statistics of multiple experiments, a better result can be achieved when the
value is in the range 0.1~1. The best performance can be achieved when
= 0.3.
and
are shown as Equations (15) and (16).
is the predicted corner;
is the ground truth corner.
Here, the normalized distance between the center points of the prediction box and the ground truth box is directly minimized to achieve rapid convergence. We choose the height value of the 3D bounding box within a certain angular range on the lidar coordinate. The height values are then normalized as a partial value for the length and width of the bounding box. We project the 3D bounding box to the 2D BEV coordinate through the above approach, so that the DIoU can be calculated by the 2D BEV information. L
DIoU is shown as Equation (17) [
35]:
where
represents the center point of the prediction box;
is the center point of the ground truth box.
represents the function of the Euclidean distance between two center points.
c represents the diagonal distance of the smallest rectangle that can cover the predicted box and the ground truth box at the same time.
For the confidence prediction, the IoU between the ground truth boxes and the box proposals are used as the optimization object.
Here, is the confidence of the optimization object for the k-th box proposals. is the predicted score.
3.4. Training Losses
Our AFE-RCNN is trained end-to-end. According to
Figure 2, the training losses
consist of the region proposal loss
, the segmentation loss
and the refinement loss
.
- 1
The proposal generation network performs the classification and regression of anchor boxes based on the BEV features. The region proposal loss
is:
Among them, the anchor box regression is calculated through
.
is the predicted residual for the anchor box;
is the regression target for the anchor box.
represents the position parameters (
x,
y,
z,
h,
l,
w) and pose parameters (
) of the 3D bounding box.
is the anchor classification loss as shown in Equation (20).
Among them, is the ground truth label; is the predicted input. is the balance factor; is the focusing parameter that is used to adjust the sample weights. Here, we set = 0.25, g = 2.0.
- 2
The key point segmentation loss is used to filter the foreground, where the calculation method is the same as the classification loss .
- 3
The refinement network performs the confidence prediction and regression of the box proposals based on the rich feature information of key points.
is used for confidence prediction, while
in
Section 3.3 is used for box regression. The proposal refinement loss
is:
Among them, is the predicted box residual; is the proposal regression object. , represent the position parameters and pose parameters of the bounding box.
The training loss
can now be expressed as the Equation (22):