1. Introduction
Three dimensional (3D) environment perception has an important role in the field of autonomous driving [
1]. It analyzes the real-time information of the surroundings to ensure traffic safety. To avoid vehicle collision [
2], 3D object detection is an important approach among the techniques of 3D environment perception. Its task is to identify the classification and predict the 3D bounding box of a targeted object a the traffic scenario. In a word, 3D object detection performance affects the traffic safety of intelligent driving [
3]. As 3D object detection requires spatial information from the environment, light detection and ranging (LiDAR) is a suitable sensor because it can generate a 3D point cloud in real-time [
4]. Thanks to its ranging accuracy and stability, multi-beam mechanical LiDAR is the mainstream LiDAR sensor for environment perception [
1,
5]. It is referred to as LiDAR henceforth for discussion simplicity. Due to the limited rotation frequency and beam number, the vertical and horizontal resolution angles are limited, causing sparsity of the LiDAR point cloud, thus increasing the difficulty of 3D object detection [
1].
The current 3D object detection method exploits the technique of deep learning and takes LiDAR points as the main input to identify and localize 3D objects [
6]. To decrease the negative impact of the sparse LiDAR point cloud, researchers have done lots of work in several areas such as (i) detector architecture [
7], (ii) supervised loss function [
4], and (iii) data augmentation [
8], which have made progress in fully supervised 3D object detection (FSOD-3D). By training with the sufficient labeled data, FSOD-3D can achieve performance in 3D environment perception close to that of humans.
However, there is a contradiction between the demand for 3D object detection performance and the cost of human annotation on the LiDAR point cloud. Due to the sparsity of the LiDAR point cloud and occlusion of the 3D object, the annotation cost of the 3D object is high, so the labeled dataset is insufficient. Therefore, it is essential to utilize unlabeled data to train the 3D object detector.
Semi-supervised 3D object detection (SSOD-3D) [
9,
10,
11] has attracted a lot of attention, for it improves the generalization ability of the 3D object detector with both labeled and lots of unlabeled samples recorded in various traffic scenarios. From the viewpoint of optimization, SSOD-3D is regarded as a problem that alternatively optimizes the weights of 3D objects detector
and pseudo labels from the unlabeled dataset. For one unlabeled sample
(i.e., the LiDAR point cloud), its pseudo label
consists of the 3D bounding boxes of the targeted objects (i.e., car, pedestrian, cyclist), predicted from
. This means that the capacity of
and quality of
are coupled. To obtain the optimal
, it is essential to decrease the false-positives (FP) and true-negatives (TN) in
. To improve the quality of
, one common approach is to utilize the label filter to remove the incorrect objects in
. Sohn et al. [
12] employed a confidence-based filter to remove pseudo labels of which the classification confidence score is below threshold
. Wang et al. [
11] extended this filter [
12] in their SSOD-3D architecture, with both the
and the 3D intersection-over-union (IoU) threshold. However, in practical application, the optimal thresholds are different with detector architecture, the training dataset and even the object category. It takes a lot of time to search the optimal thresholds in the label filter, which is inefficient in the actual application. Thus, it is a challenging problem to design a more effective and convenient SSOD-3D method.
In the background of intelligent driving, most self-driving cars are equipped with LiDAR and an optical camera. A sensor system with LiDAR and a camera is called a LiDAR-camera system. To remedy the sparsity of the LiDAR point cloud, researchers have studied 3D object detection methods on a LiDAR-camera system [
13,
14,
15,
16,
17]; the LiDAR-camera system provides a dense texture feature from the RGB image, improving the classification accuracy and confidence of the 3D detection result. Thus, it is wise for the SSOD-3D to consider the prior knowledge provided by LiDAR-camera systems.
Motivated by this, we present a novel SSOD-3D method on a LiDAR-camera system. First, in order to train a 3D object detector with reliable pseudo labels, we introduce a self-paced, semi-supervised and learning-based 3D object detection (SPSL-3D) framework. It exploits the theory of self-paced learning (SPL) [
18] to adaptively estimate the reliability weight of pseudo label with its 3D object detection loss. After that, we notice that the prior knowledge in the LiDAR point cloud and RGB image benefits the evaluation of the reliability of pseudo label, and propose a prior knowledge-based SPSL-3D (named PSPSL-3D) framework. Experiments are conducted in the autonomous driving dataset KITTI [
19]. With the different labeled training samples, both comparison results and ablation studies demonstrate the efficiency of the SPSL-3D and PSPSL-3D frameworks. Therefore, SPSL-3D and PSPSL-3D benefit SSOD-3D on a LiDAR-camera system. The remainder of this paper is organized as follows. At first, the related works of FSOD-3D and SSOD-3D are illustrated in
Section 2. In the next, the proposed SPSL-3D and PSPSL-3D methods are discussed in
Section 3. After that, experimental configuration and results are analyzed in
Section 4. Finally, this work is concluded in
Section 5.
3. Proposed Semi-Supervised 3D Object Detection
3.1. Problem Statement
SSOD-3D is a training framework to learn baseline detection from the labeled and unlabeled datasets and . The baseline detector is the arbitrary 3D object detector based on the LiDAR point cloud. Let be the parameter set of baseline detector. SSOD-3D aims to learn with higher generalization ability.
Some symbols are discussed here. Let
be the
i-th training sample where
or
means that it is ground truth (GT), labeled or not.
is
LiDAR point cloud where
is number of LiDAR points. It contains the 3D position and reflected intensity of the LiDAR point cloud. Let
be the 3D object label of
.
is the object number.
represents the 3D bounding box of the
j-th object using the parameter vector of the 3D bounding box [
28].
is the pseudo or GT label if
or
. Let
be the output of the 3D object detector with the input of
and weight of
.
is the pseudo label of
.
3.2. Previous Semi-Supervised 3D Object Detection
Before illustrating the proposed SPSL-3D, we briefly revisit the previous SSOD-3D approach [
10]. The pipeline of the previous SSOD-3D is presented in
Figure 1a. For the 3D object detector with high generalization ability, its prediction results from the unlabeled sample
and its augmented sample
are both consistent and closed to the GT labels. Based on this analysis, as the unlabeled sample does not have annotation,
was proposed to minimize the difference in labels predicted from
and
.
is the affine transformation on
of
, which contains scaling, X/Y-axis flipping, and Z-axis rotating operations.
is the core in this scheme [
10], for this loss can update the weights in the 3D object detector via back-propagation. The current SSOD-3D [
10,
11,
38,
41,
47] optimizes
by minimizing the function as:
where
is the common 3D object detection loss of each detected object [
20,
28]. It is represented as
vector to describe the detection loss of each object. With the inverse affine transformation
,
is obtained with the same reference coordinate system as in
. In the end, we also provide discussion of the relation of a previous SSOD-3D, and traditional SSL theory is further discussed in
Appendix A.1.
3.3. Self-Paced Semi-Supervised Learning-Based 3D Object Detection
The main challenge of consistency loss in Equation (2) is that the quality of the pseudo label is uncertain. As , if is noisy or even incorrect, the baseline detector with the optimized parameter set tends to detect 3D objects with low localization accuracy. To deal with this problem, we needs to evaluate the reliability weight of , where is a vector to reflect the reliability score of objects in (). In the training stage, unreliable pseudo labels are filtered out with . However, determining is a crucial problem.
One naive idea is to adjust the reliability weight
with the guidance of the consistency loss of
. If the consistency loss of
enlarges, the pseudo labels
are unreliable. Based on this idea, we exploit the theory of SPL [
18] to construct the mathematical relation of
to
, and propose a novel SSOD-3D framework, SPSL-3D, in this paper. Its pipeline is presented in
Figure 1b. SPSL-3D optimizes
by minimizing the function as:
where
is age parameter to control the learning pace [
48]. Let the current epoch and maximum training epoch be
e and
E. Furthermore,
is a self-paced regularization term [
48] for the unlabeled sample:
However, in deep learning,
contains lots of parameters, so it is difficult to directly optimize Equation (3). As the modern deep neural network (DNN) is trained with a batch of data the size of
[
49], the loss of SPSL-3D is simplified as:
An alternative optimization scheme is used to optimize
and
. With the fixed
,
needs to be optimized. The closed-form solution is obtained as Equation (7) via
. For a vector
L,
is its
k-th element. With the fixed
,
is optimized in Equation (6) with the deep learning-based optimizer (i.e., Adam and SGD).
Intuitive explanation of Equation (7) is discussed here. For the
k-th object in the sample with the pseudo label, if its loss is larger than
, it is regarded as an unreliable label and cannot be used. If its loss is smaller than
, SPSL-3D evaluates its reliability score with its consistency loss. SPSL-3D emphasizes the most reliable pseudo label in the training stage to enhance the robustness of the baseline detector. When epoch
e grows,
increases (seen Equation (4)), meaning that SPSL-3D enlarges the size of the unlabeled samples for training, thus improving the generalization ability of baseline detector. The improvement can be found in
Figure 1c.
3.4. Improving SPSL-3D with Prior Knowledge
From Equation (7), SPSL-3D can adaptively adjust the reliability weight of pseudo label using its 3D object detection loss. In fact, the reliability weight of pseudo label is not only dependent on consistency loss, but also dependent on the prior knowledge in the LiDAR point cloud and RGB image provided by the LiDAR-camera system. If the LiDAR point cloud or image feature of one predicted object is not salient, its pseudo label is not reliable. Based on this analysis, to further enhance the performance of PSPL-3D with information from the LiDAR point cloud and RGB image, we propose a prior knowledge-based SPSL-3D named PSPSL-3D, which is presented in
Figure 2.
In PSPSL-3D, we attempt to represent the reliability of the pseudo label with the LiDAR point cloud and RGB image. For the k-th object in , its prior reliability coefficient is modeled as . It consists of the detection difficulty coefficient and the label accuracy coefficient . The motivation of designing is to constrain with both 3D detection loss and prior knowledge from the RGB image and LiDAR point cloud.
Due to the LiDAR mechanism,
is mainly dependent on the occlusion and resolution of
in the LiDAR point cloud. However, due to the complex situation of 3D object in the real traffic situation, it is difficult to model the relationship between the occlusion, resolution, and detection difficulty of
. An approximate solution is provided here. Following the thought in the literature [
50], a statistical variable
is used as the LiDAR point number inside the 3D bounding box of
to describe
:
where
is the category (i.e., car, pedestrian, cyclist) of
, and
is the minimal threshold of the LiDAR point number of the corresponding category. For the actual implementation,
is a statistic variable from
, discussed in
Section 4.2. If
has higher resolution and less occlusion,
is closer and even higher than
, so that
is closer to 1. The 3D detection difficulty of
is largely decreased.
Then,
is discussed. As GT is unknown, we attempt to evaluate the annotation accuracy indirectly. On the one hand, as for the current 3D detection method, a confidence score
of
is supervised with the 3D IoU of the predicted and GT 3D bounding box [
21,
26].
can be used to describe
. On the other hand, a semantic segmentation map predicted from RGB image also contains annotation information. As the RGB image is more dense than the LiDAR point cloud, semantic segmentation on the RGB image is more accurate than the semantic segmentation on the LiDAR point cloud. Projecting the 3D bounding box of
onto the image plane generates a 2D bounding box
. Its pixel area is
. The pixel area of the semantic map of
inside
is
. If the predicted 3D bounding box of
is accurate,
and ratio of
and
are closer to 1. Due to the occlusion of object,
is not accurate enough. Thus, the arithmetic mean of
and the pixel area ratio is used to describe
:
From the above discussion, prior knowledge is not directly extracted from the RGB image, for the RGB feature of the targeted object is affected by shadow and blur in the complex traffic scenario. Compared with the RGB image, the semantic segmentation map is more stable to reflect the location information of a targeted object. Thus, the semantic feature is used to describe .
After obtaining
, a scheme designed to constrain
with
is required. Referring to the thought in self-paced curriculum learning [
51], the interval of
can be constrained from
to
. This means that the interval of
is dependent on its prior detection coefficient
. To achieve this scheme,
in Equation (6) is replaced with
, and the loss function of PSPSL-3D is presented as:
As the same in
Section 3.2, the alternative optimization scheme is used to find the optimal
and
. The close-formed solution of
is shown in Equation (11) via
. The procedure of PSPSL-3D is summarized in Algorithm 1.
Algorithm 1 Proposed SPSL-3D and PSPSL-3D framework for SSOD-3D. |
Inputs: Baseline detector, maximum epoch E, datasets and , batch sizes and ; Parameters: Baseline detector weight , current epoch e, sample weight , age parameters and ; Output: Optimal 3D detector weight - 1:
Pre-training baseline 3D detector in obtains - 2:
Let and - 3:
while
do - 4:
Let - 5:
while do - 6:
, - 7:
Computing with e and E via Equation (4) - 8:
if SPSL-3D is exploited then - 9:
Optimizing using , via Equation (7) - 10:
Optimizing with via Equations (6) - 11:
end if - 12:
if PSPSL-3D is exploited then - 13:
Computing via Equations (8) and (9) - 14:
Optimizing using , , via Equation (11) - 15:
Optimizing with via Equation (10) - 16:
end if - 17:
- 18:
end while - 19:
- 20:
end while - 21:
Return as .
|
5. Discussion
The proposed SSOD-3D frameworks, SPSL-3D and PSPSL-3D, have several advantages. At first, we consider the reliability of pseudo label in the SSOD-3D training stage. As the GT annotation of pseudo label is unknown, we attempt to use the consistency loss to represent the weight of the pseudo label. If the pseudo label is incorrect, its consistency loss is larger than other pseudo labels. To reduce the negative effect of the pseudo label with large noise, we need to decrease the reliability weight of this pseudo label in the training stage. To adaptively and dynamically adjust the reliability weight of all pseudo labels, we exploit the theory of SPL [
18] and then propose SPSL-3D as a novel and efficient framework. Second, we utilize the multi-model sensor data in the semi-supervised learning stage, thus further enhancing the capacity of the baseline 3D object detector based on LiDAR. The reason for the usage of multi-model data is that we notice that the LiDAR-camera system is widely equipped in the autonomous driving system. Thus, both the RGB image and LiDAR point cloud can be used in the training stage of SSOD-3D. For one object, there are generally abundant structural and textural features in the RGB image and LiDAR point cloud. However, as shown in
Figure 6, the RGB image has shadow, occlusion, and blur, so it is difficult to extract prior knowledge of an object in the RGB image. Compared with RGB images, semantic segmentation images can directly reflect the object category information. Based on this analysis, we used the area of semantic segmentation of the object to describe the annotation accuracy. However, the semantic segmentation image also has two main limitations: it cannot identify occluded objects, and finds it hard to determine objects that are far from the sensor. Thus, the proposed PSPSL-3D exploits Equation (
9) to approximately represent the annotation accuracy of the pseudo label.
Extensive experiments in
Section 4 demonstrate the effectiveness of the proposed SPSL-3D and PSPSL-3D. Firstly, compared with the state-of-the-art SSOD-3D methods, the proposed SPSL-3D and PSPSL-3D frameworks have achieved the better results than other methods because the proposed frameworks consider the reliability of the pseudo labels, thus decreasing the negative effect of incorrect pseudo labels in the training stage. Second, the proposed frameworks are also suitable to train baseline 3D object in a fully supervising way. Compared with the current FSOD-3D methods, the baseline 3D detector trained with PSPSL-3D outperforms other FSOD-3D methods. This means that the training scheme which emphasizes the weight of label is beneficial to baseline 3D object detector training.
In the future, we will study SSOD-3D in the several ways. Firstly, computer graphic (CG) techniques can be used to generate a huge number of labeled simulated samples. Exploiting the theory of SPL [
57] in SSOD-3D with unlabeled samples and labeled simulated samples is an ongoing problem. Secondly, in actual application, the data distribution of the labeled LiDAR point cloud might be different from that of the unlabeled LiDAR point cloud because the dataset is collected with a different type of LiDAR sensor at a different place. Utilizing domain adaptation in a SSOD-3D is a challenging problem. We will deal with the above problems in subsequent studies.