Advanced Point Cloud Techniques for Improved 3D Object Detection: A Study on DBSCAN, Attention, and Downsampling

Zhang, Wenqiang; Dong, Xiang; Cheng, Jingjing; Wang, Shuo

doi:10.3390/wevj15110527

Open AccessArticle

Advanced Point Cloud Techniques for Improved 3D Object Detection: A Study on DBSCAN, Attention, and Downsampling

¹

School of Electrical Engineering and Automation, Anhui University, Hefei 230601, China

²

State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

World Electr. Veh. J. 2024, 15(11), 527; https://doi.org/10.3390/wevj15110527

Submission received: 16 October 2024 / Revised: 10 November 2024 / Accepted: 13 November 2024 / Published: 15 November 2024

(This article belongs to the Special Issue Recent Advances in Intelligent Vehicle)

Download

Browse Figures

Versions Notes

Abstract

:

To address the challenges of limited detection precision and insufficient segmentation of small to medium-sized objects in dynamic and complex scenarios, such as the dense intermingling of pedestrians, vehicles, and various obstacles in urban environments, we propose an enhanced methodology. Firstly, we integrated a point cloud processing module utilizing the DBSCAN clustering algorithm to effectively segment and extract critical features from the point cloud data. Secondly, we introduced a fusion attention mechanism that significantly improves the network’s capability to capture both global and local features, thereby enhancing object detection performance in complex environments. Finally, we incorporated a CSPNet downsampling module, which substantially boosts the network’s overall performance and processing speed while reducing computational costs through advanced feature map segmentation and fusion techniques. The proposed method was evaluated using the KITTI dataset. Under moderate difficulty, the BEV mAP for detecting cars, pedestrians, and cyclists achieved 87.74%, 55.07%, and 67.78%, reflecting improvements of 1.64%, 5.84%, and 5.53% over PointPillars. For 3D mAP, the detection accuracy for cars, pedestrians, and cyclists reached 77.90%, 49.22%, and 62.10%, with improvements of 2.91%, 5.69%, and 3.03% compared to PointPillars.

Keywords:

autonomous vehicles; attention mechanism; DBSCAN; CSPNet; PointPillars

1. Introduction

Three-dimensional object detection is crucial for understanding 3D scenes, especially in autonomous driving technology, where it plays a central role in route planning and ensuring safe obstacle avoidance. By analyzing data in three-dimensional space, 3D object detection accurately identifies and locates objects such as vehicles, pedestrians, and bicycles, thus enabling a comprehensive perception of complex environments [1]. Current 3D object detection networks are typically categorized based on their input type into three categories: image-based 3D object detection networks, which detect objects by extracting depth information from 2D images [2]; point cloud-based 3D object detection networks, which process data from sensors like LiDAR to provide detailed spatial information [3]; and multimodal 3D object detection networks, which combine data from various sensors, such as images and point clouds, to enhance detection precision and reliability [4].

This paper focuses on point cloud-based 3D object detection, with the goal of accurately identifying and locating various entities, such as vehicles, pedestrians, and bicycles, using 3D point cloud data [5]. Unlike conventional image-based detection methods, 3D object detection from point clouds involves handling a collection of unordered points. These point cloud data, typically obtained from sensors like LiDAR (Light Detection and Ranging), provide detailed depth information and spatial structure. Unlike the two-dimensional nature of image data, point clouds are distributed in three-dimensional space and contain information about the spatial location, volume, and orientation of objects. Given the unordered and sparse characteristics of point clouds, a significant challenge is efficiently extracting crucial details such as the position, size, and orientation of objects. This necessitates algorithms capable of processing large volumes of point data while accurately capturing and interpreting the spatial relationships and features of the targets [6].

Point-based methods, exemplified by PointNet, handle raw point clouds directly by treating them as unordered point sets [7]. Each point is processed independently, and the features are aggregated to form a comprehensive representation of the entire point cloud. The enhanced PointNet++, which improves the handling of local point cloud data, incorporates a hierarchical framework and region-focused sampling techniques to boost its ability to capture detailed features [8].

Voxel-based approaches first convert point cloud data into a 3D voxel grid, followed by feature extraction and processing for each voxel. By quantizing continuous point cloud data into a fixed grid structure, these methods streamline feature extraction and processing. However, this approach introduces a certain level of computational complexity.

VoxelNet is one of the earliest voxel-based methods proposed [9]. It transforms point cloud data into uniform 3D grids through voxelization and applies a 3D convolutional neural network for feature extraction and classification. Although this method achieves high detection accuracy, it incurs substantial computational overhead.

SECOND (Sparsely Embedded Convolutional Detection) is an advanced voxel-based method [10] that effectively reduces computational effort and improves processing speed through the introduction of sparse convolutional operations. This approach performs particularly well with high-resolution point cloud data. The SECOND method significantly enhances processing efficiency while maintaining high detection accuracy, making large-scale 3D target detection more feasible.

CenterPoint further advances the voxel-based approach by introducing a center point detection strategy, which enables more effective target localization and identification [11]. This method not only improves detection accuracy but also reduces computational resource requirements, making it more efficient and practical for large-scale applications.

The column-based method is a specialized voxel approach that simplifies voxelization by neglecting the z-dimension, thereby reducing computational complexity and transforming the three-dimensional problem into a two-dimensional one [12]. This method segments the point cloud data into multiple columnar regions and extracts features from each region.

PointPillars is an advanced point cloud target detection method that utilizes point-by-point aggregated features to construct columnar feature representations. Compared to traditional point-based methods, PointPillars offers a more efficient feature extraction approach. As research on this method progresses, several improvement strategies have emerged. Reference [13] present a channel attention mechanism designed to address the issue of overlooking global features in pillar feature networks. This approach enables the model to capture global context more effectively by adaptively adjusting the significance of various channels. Additionally, Reference [14] introduce a two-stage reinforcement method for pillar-to-pillar relationships, which enhances the network’s ability to model connections between local features by improving interrelationships among different pillars. This approach has been demonstrated to improve the model’s performance in detecting objects within complex scenes.

However, despite these advancements significantly enhancing the performance of PointPillars, the model continues to face challenges with under-segmentation when objects overlap. When handling point cloud data with an excessively large number of points, the model’s accuracy in detection and localization can deteriorate significantly, leading to severe cases of missed or incorrectly detected targets, particularly in dense environments.

The novelty of this paper lies in addressing the under-segmentation and object overlap issues in the PointPillars method. We propose an enhanced approach by integrating DBSCAN clustering, attention mechanisms, and CSPNet modules to improve detection accuracy; specifically, as follows:

The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering algorithm is applied to point cloud processing. By utilizing its density-based clustering approach, DBSCAN effectively detects and removes outliers and noise within the point clouds, thereby improving data quality and enhancing the accuracy of subsequent analyses.
In the pillar feature network layer, a point-wise self-attention mechanism [15] and a spatial attention mechanism are combined to further reduce noise within the point cloud pillars, emphasize crucial feature information, and improve the feature extraction capability of the point cloud.
In the downsampling module, CSPNet, which facilitates gradient flow splitting, is used to replace the conventional convolutional blocks in the original module [16]. This modification enables gradient flow to propagate through different network paths, effectively reducing computational complexity and enhancing the network’s detection performance.

The aim of this research is to address the under-segmentation and object overlap issues in the PointPillars method by proposing an enhanced approach that incorporates DBSCAN clustering, attention mechanisms, and CSPNet modules to improve detection performance. The structure of the paper is as follows: Section 1 introduces the background and novelty of the research; Section 2 reviews the related work; Section 3 presents the experimental results and analysis; and Section 4 concludes the paper and discusses future research directions.

2. Related Work

2.1. PointPillars Network Analysis

Traditional PointPillars networks initially convert 3D point cloud data into consistent 2D feature maps by segmenting them into uniform 2D grid cells (pillars), extracting features from the points within each cell, and encoding these features into fixed-length vectors. The resulting feature map is then fed into a multilayer perceptron (MLP) for feature mapping and aggregation. Subsequently, global and local spatial features are extracted through a series of 2D convolutional layers. The detection head then performs bounding box regression and classification on these features, and redundant detection boxes are eliminated using non-maximal suppression (NMS) [17]. Finally, the detection results undergo coordinate transformation and confidence filtering to produce the final target detection output, which includes category, position, size, and orientation. The algorithm’s workflow is illustrated in Figure 1.

The traditional PointPillars network has limitations in processing point cloud data. Firstly, during the pillar encoding process, the fixed-grid limitation causes some degree of information loss, especially in high-density point clouds, which prevents the comprehensive capture of all details. Additionally, the original network fails to fully capture the local geometric details of the point cloud’s spatial arrangement. The feature extraction approach is relatively basic and does not fully leverage self-attention and spatial attention mechanisms to enhance feature representation [18], leading to limited performance in complex scenes. Moreover, the use of conventional convolutional blocks in the network backbone results in inefficiencies, high computational complexity, and an inability to maintain high computational performance while ensuring accuracy.

2.2. Point Cloud Processing Based on DBSCAN Clustering Algorithm

The PointPillars network processes point clouds by discarding points outside the region of interest (ROI), then shuffling and creating a random sequence from the remaining data [19], which are subsequently subjected to random sampling. Since point clouds closer to the sensor are typically denser than those farther away, random sampling can lead to the equal probability selection of both near and far points, potentially resulting in the loss of critical information from distant point clouds [20]. To address this issue, this paper introduces a point cloud data processing approach using the DBSCAN clustering algorithm [21], which involves the following steps:

Initial Selection: randomly select an unvisited point and mark it as visited.
Find Neighborhood Points: identify all points within a specified radius (eps) around the selected point; these points are termed neighborhood points.
Determine Core Points: If the number of neighboring points is greater than or equal to minPts, the point is designated as a core point, and a new cluster is formed that includes all the neighboring points. Conversely, if the number of neighboring points is fewer than minPts, the point is classified as a noise point.
Expand Cluster: For each neighborhood point of the core point, repeat the steps of finding neighborhood points and determining core points, and add the new neighborhood points to the current cluster. If a neighborhood point is also a core point, add its neighborhood points to the current cluster according to the density reachability principle.
Repeat the Process: continue with any remaining unvisited points, repeating the above steps until all points have been processed.

By integrating DBSCAN clustering into the network, we improve segmentation and feature extraction by ensuring that each pillar contains points closely related in terms of spatial proximity and object membership. This enhances feature extraction and results in more accurate object representations. Additionally, DBSCAN helps reduce noise by removing outliers, improving the signal-to-noise ratio and allowing the network to focus on relevant features. Unlike the fixed-size pillar grid in traditional PointPillars, DBSCAN dynamically adjusts to varying point densities, enabling the model to handle both sparse and dense environments effectively. This results in a significant improvement in object recognition accuracy, particularly in complex scenarios with overlapping or occluded objects, where traditional methods struggle.

The effect is shown in Figure 2.

2.3. Integration of Attention Mechanisms

To address the issue where the PointNet network and the linear layer in the column network treat columns and their internal points equally during column feature extraction, we introduce the point-wise self-attention mechanism and the channel attention mechanism [22]. These mechanisms effectively resolve the feature extraction accuracy and performance issues caused by differences in information volume and importance by assigning different weights to different points, focusing on key information points with higher weights while disregarding noise or irrelevant information [23].

The attention mechanism improves the accuracy and effectiveness of feature extraction by dynamically adjusting the weights [24], allowing the model to adaptively select key information in various contexts. By integrating both the point-wise self-attention mechanism and the channel attention mechanism into the pillar feature network [25], the model not only enhances its capability to identify and handle critical information within the point cloud data but also improves overall performance. This method more effectively captures both the local geometric features and global structural information of the point cloud [26], resulting in richer and more precise feature representations for subsequent point cloud analysis and object detection tasks.

The point-wise self-attention mechanism primarily focuses on the local information of each point. It dynamically adjusts the feature representation of each point by calculating the similarity between that point and other points within its neighborhood. The weight calculation formula is expressed as follows:

α_{i j} = \frac{e x p (s i m (P_{i}, P_{j}))}{\sum_{k = 1}^{N} e x p (s i m (P_{i}, P_{k}))}

(1)

a_ij denotes the attention weight of point P_i with respect to point P_j. The similarity between point P_i and all other points P_j is calculated (usually measured by the dot product), and this similarity is converted into attention weights using the softmax function. This conversion ensures that the sum of all attention weights is 1, which is essential for subsequent calculations [27].

Aggregated features, derived from the weighted sum of all point features [28], represent the updated feature fx of point P_i. Each point’s new feature considers its relationship with other points. The formula for the aggregated feature is expressed as follows:

f_{x} = \sum_{j = 1}^{N} (\frac{e x p (f_{i} \cdot f_{j}^{T})}{\sum_{k = 1}^{N} e x p (f_{i} \cdot f_{k}^{T})}) f_{j}

(2)

The channel attention mechanism enhances the features of important channels by evaluating the significance of each channel to emphasize crucial information.

For a feature map with dimensions H × W × C, the map undergoes global average pooling and global maximum pooling to produce two feature vectors,

F_{avg}

and

F_{\max}

. These vectors are subsequently processed through a fully connected layer (MLP) to compute the attention weights for each channel. The formula used is as follows:

F_{m l p} = ReLU (W_{1} F_{avg} + b_{1}) + ReLU (W_{1} F_{\max} + b_{1})

(3)

After passing through another fully connected layer, an output feature vector F_c, with the formula expressed as follows:

F_{c} = σ (W_{2} F_{m l p} + b_{2})

(4)

The sigmoid function σ normalizes the values to the range [0, 1], while W₁ and W₂ represent the weight matrices, and b₁ and b₂ denote the bias vectors. The feature vectors derived from the self-attention mechanism are combined with those obtained from the channel attention mechanism to create the final pillar features. These pillar features are then fed into both the 2D feature extraction network and the detection network to determine the object’s category and location. The specific flow is illustrated in Figure 3.

2.4. Improvements in Downsampling

The PointPillars network utilizes the Scatter operator to generate a pseudo-image, followed by multi-scale feature extraction from this pseudo-image [29]. However, the original downsampling approach in the network predominantly emphasizes global features, which results in insufficient extraction of local features from the point cloud data [30]. This limitation can lead to the loss of detailed information and a reduction in the spatial resolution of the feature map, impairing the ability to detect small objects and complex structures. Furthermore, when dealing with large-scale point cloud data, the original downsampling method may result in higher computational complexity and increased resource consumption [31].

To enhance the comprehensiveness of feature extraction, we employ CSPNet as the downsampling feature extraction network for point cloud pseudo-images. CSPNet effectively integrates features and improves the learning capacity of convolutional neural networks, thereby increasing the model’s accuracy.

CSPNet enhances feature extraction and fusion efficiency by dividing the input feature map into two segments for parallel processing [32]. The first segment employs a conventional 3 × 3 convolutional layer to extract basic feature information, effectively capturing simple patterns and local structures. The second segment follows a more intricate process: initially, a 1 × 1 convolutional layer reduces the number of channels to decrease the computational load and streamline the feature map. This is followed by a 3 × 3 convolutional layer that extracts more complex and detailed features, capturing richer contextual information. Finally, a 1 × 1 convolutional layer adjusts the number of channels back to the original count, maintaining consistency between the processed feature map and the input feature map.

By utilizing parallel processing and channel reduction strategies, CSPNet reduces the computational burden, enabling faster processing times while preserving important feature details. After processing these two feature maps, they are fused using an addition operation, effectively integrating the information from both maps while maintaining a consistent number of channels [33]. Subsequently, the fused feature map is concatenated with an additional processed feature map. This concatenation combines all processed feature maps along the channel axis, ensuring that the number of channels in the resultant feature map remains uniform. Finally, the SiLU activation function [16] is applied to the final feature map to introduce non-linearity, allowing the model to capture more complex feature representations.

The parallel processing, channel reduction, and fusion techniques employed by CSPNet significantly improve both the feature extraction efficiency and processing speed. These steps enable CSPNet to enhance the feature extraction efficiency and fusion, maintain channel consistency, and preserve both global and local details.

The specific workflow is illustrated in Figure 4.

3. Experimental Results and Analysis

3.1. Overview and Description of the Experimental Dataset

The KITTI dataset [17] is currently the largest dataset dedicated to evaluating computer vision algorithms in autonomous driving scenarios. It is extensively used for testing and assessing various autonomous driving technologies. The dataset covers a wide range of realistic driving environments, including city streets, country roads, and highways, and provides high-resolution image data, ensuring both comprehensive coverage and representativeness. To accommodate diverse research needs, the KITTI dataset categorizes the data into three levels—easy, medium, and difficult—based on factors such as object size, occlusion degree, and truncation extent. This classification enables researchers to evaluate and refine algorithms across different levels of complexity, ensuring their adaptability and robustness in real-world driving environments.

The KITTI dataset comprises 7481 training samples and 7518 test samples, with each sample scenario containing approximately 16,384 points. These richly annotated data are crucial for developing and testing automated driving perception and decision-making algorithms. The official leaderboard ranks performance based on the medium difficulty category, using mean average precision (mAP) as the evaluation criterion.

3.2. Experimental Parameterization Design and Environment Construction

The hyperparameters are set as follows during training: the point cloud range along the x, y, and z axes is W = [0 m, −39.68 m], H = [−3 m, 69.12 m], and D = [39.68 m, 1 m], with voxel dimensions set to V_w = 0.16, V_h = 0.16, and V_d = 4. The network allows a maximum of 20,000 columns and a maximum of 32 points per column. To enhance data diversity and model robustness, the point cloud is rotated and translated before being input into the network, achieving data augmentation. We use the PyTorch framework for the method, and all training is performed on an NVIDIA GTX3090 computing platform (Santa Clara, CA, USA). The Adam optimizer [22] is employed with 160 epochs of training, a batch size of 6, an initial learning rate of

3 \times 10^{- 3}

, and a cosine annealing scheduler for dynamic adjustment [29] of the learning rate.

3.3. Analysis of Experimental Results

3.3.1. Performance Analysis of Accuracy Means

In this paper, the detection results were assessed using the official KITTI BEV and 3D detection metrics. To evaluate the performance of the improved network model on the KITTI test set, we compared it with several existing algorithms, including MV3D, SECOND, VoxelNet, TANet [34], AVODFPN, PRGBNet [35], HDNET [36], and PointPillars. The specific comparison results are presented in Table 1 and Table 2.

As shown in the table above, under medium detection difficulty, vehicles, pedestrians, and cyclists achieve BEV mAP scores of 87.74%, 55.07%, and 67.78%, respectively, representing improvements in detection accuracy of 1.64%, 5.84%, and 5.53% over PointPillars. For 3D mAP, the scores are 77.90%, 49.22%, and 62.10%, reflecting increases of 2.91%, 5.69%, and 3.03% in detection accuracy for cars, pedestrians, and bicycles, respectively, compared to PointPillars.

3.3.2. Evaluation of Real-Time and Detection Performance of the Model

As detailed in the table above, under medium detection difficulty, the improved model achieves BEV mAP scores of 87.74%, 55.07%, and 67.78% for vehicles, pedestrians, and cyclists, respectively. These scores represent increases in detection accuracy of 1.64%, 5.84%, and 5.53% compared to PointPillars. For 3D mAP, the model records scores of 77.90%, 49.22%, and 62.10% for cars, pedestrians, and cyclists, respectively, reflecting improvements of 2.91%, 5.69%, and 3.03% over PointPillars.

3.3.3. Ablation Experiments

In this section, we conduct experiments on the KITTI validation set using PointPillars as the baseline network, incorporating various modules to enhance 3D object detection. Specifically, we evaluate the effectiveness of the following modules: a point cloud data processing module based on clustering algorithms, a column feature extraction module utilizing a fused attention mechanism, and a data processing module employing CSPNet for downsampling. By integrating these modules, we assess their impact on improving performance in 3D object detection tasks, highlighting each module’s contribution to enhancing detection accuracy and robustness. The experimental results are presented in Table 3 and Table 4.

The results presented in Table 3 and Table 4 indicate that integrating the clustering-based point cloud data processing module into the network architecture leads to significant improvements in object recognition accuracy for categories such as bicycles and cars. Additionally, incorporating the column feature extraction module, combined with a fused attention mechanism, enhances the network’s ability to capture both global and local features, resulting in marked improvements in detection performance. Furthermore, the data processing module utilizing CSPNet for downsampling not only reduces the processing time but also enhances the overall performance. Collectively, these enhancements demonstrate that each module contributes crucially to improving the detection accuracy and processing efficiency, leading to a substantial boost in the network’s overall performance.

3.3.4. Visualization of Experimental Results

To better assess the network’s performance, we apply it to generate predictions on the validation set and visualize these results in a 3D view.

As illustrated in the figure below each image comprises a camera image, As shown in the figure below, each image consists of a camera image, the point cloud visualization generated by the PointPillars algorithm, and the point cloud visualization generated by the algorithm proposed in this paper. In Figure 5a, where objects overlap, the algorithm struggles to clearly distinguish contours, misclassifying pedestrians as vehicles and streetlights as pedestrians. However, the proposed algorithm accurately distinguishes these objects, as shown in the box indicated by the arrow in the right image, demonstrating its ability to address the issue of occlusions and blurred object boundaries in crowded urban environments. Figure 5b further illustrates the limitations of PointPillars, where the algorithm misses several vehicles due to complex positioning and perspective, causing the vehicles to blend into the environment. In contrast, the proposed algorithm detects all vehicles, as shown in the box indicated by the arrow in the right image, successfully overcoming spatial positioning and perspective challenges in more complex scenes. Figure 5c shows the PointPillars algorithm failing to detect a car and misclassifying a distant flower bed as a car, leading to reduced accuracy with distant objects and background clutter. The proposed algorithm successfully detects all objects in this scene, improving performance in handling distant objects and cluttered backgrounds. Finally, Figure 5d demonstrates how the PointPillars algorithm misclassifies overlapping pedestrians as a bicycle and mistakes a streetlight for a pedestrian in crowded scenes with occluded objects. The right image, generated by the proposed algorithm, correctly detects all pedestrians and the streetlight, highlighting the improvements made in detecting objects in complex scenes with severe overlap and occlusion. These examples underscore how the proposed algorithm enhances object recognition accuracy in complex real-world environments.

These comparisons show that our proposed algorithm outperforms the PointPillars algorithm in detecting cars, pedestrians, and bicycles, highlighting the advantages and effectiveness of the enhanced modules used in our method.

4. Conclusions

To improve the accuracy and efficiency of 3D object detection, an enhanced pillar-based network has been proposed, incorporating three key modules: a DBSCAN-based point cloud processing module, a fused attention mechanism, and a CSPNet downsampling module. These modules work together to significantly boost detection performance, especially for small and medium-sized objects in complex urban environments. The DBSCAN module extracts key features from point clouds, improving the detection accuracy for objects such as cars and bicycles. The fused attention mechanism enhances the network’s ability to capture both global and local features, crucial for handling occlusions and overlaps in crowded scenes. The CSPNet downsampling module optimizes the processing speed and reduces the computational costs while maintaining performance. The experimental results of the KITTI dataset show that the proposed method outperforms existing approaches, achieving 2.91%, 5.69%, and 3.03% improvements in 3D detection accuracy for cars, pedestrians, and cyclists, respectively, along with notable gains in BEV detection. Ablation studies confirm the contribution of each module to the overall performance improvement.

This method is particularly advantageous for real-world applications, including autonomous driving, where accurate object detection in dynamic urban environments is critical for safety. The modular design also allows for easy adaptation to other fields, such as robotics and smart cities. However, further testing on diverse datasets is needed to fully assess the method’s generalizability and robustness in real-world scenarios.

Author Contributions

Conceptualization, W.Z. and J.C.; methodology, W.Z.; software, W.Z. and X.D.; validation, W.Z. and J.C.; formal analysis, W.Z.; investigation, W.Z. and X.D.; resources, W.Z.; data curation, W.Z. and J.C.; writing—original draft preparation, W.Z.; writing—review and editing, W.Z.; visualization, W.Z.; supervision, X.D. and S.W.; project administration, X.D. and S.W.; funding acquisition, X.D. and S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Natural Science Foundation of China under Grants 62273342 and in part by the Anhui Provincial Key Research and Development Project under Grants 2022a05020035.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bai, Z.; Wu, G.; Barth, M.J.; Liu, Y.; Sisbot, E.A.; Oguchi, K. PillarGrid: Deep Learning-based Cooperative Perception for 3D Object Detection from Onboard-Roadside LiDAR. In Proceedings of the 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Macau, China, 8–12 October 2022; pp. 1743–1749. [Google Scholar]
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
Chen, D.J.; Yu, W.J.; Gao, Y.B. 3D Object Detection of LiDAR Based on Improved PointPillars. Laser Optoelectron. Prog. 2023, 60, 447–453. [Google Scholar]
Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D proposal generation and object detection from view aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Li, J.N.; Wu, Z.; Xu, T.F. Research Progress of 3D Object Detection Technology Based on Point Cloud Data. Acta Opt. Sin. 2023, 43, 296–312. [Google Scholar]
Li, X.L.; Zhou, Y.E.; Bi, T.F.; Yu, Q.; Wang, Z.; Huang, J.; Xu, L. A Review on the Development of Key Technologies for Lightweight Sensing Lidar. Chin. J. Lasers 2022, 49, 263–277. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5099–5108. [Google Scholar]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely Embedded Convolutional Detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
Yin, T.; Zhou, X.; Krähenbühl, P. CenterPoint: Center-based 3D Object Detection and Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 11784–11793. [Google Scholar]
Sheng, H.L.; Cai, S.J.; Zhao, N.; Deng, B.; Huang, J.; Hua, X.S.; Zhao, M.J.; Lee, G.H. Rethinking IoU-Based Optimization for Single-Stage 3D Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 544–561. [Google Scholar]
Yang, Q.; Kong, D.; Chen, J.; Li, X.; Shen, Y. An Improved PointPillars Method Based on Density Clustering and Dual Attention Mechanism. Laser Optoelectron. Prog. 2024, 61, 2412003. [Google Scholar]
Xu, H.; Dong, X.; Wu, W.; Yu, B.; Zhu, H. A Two-Stage Pillar Feature-Encoding Network for Pillar-Based 3D Object Detection. World Electr. Veh. J. 2023, 14, 146. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, C.; Zhang, Z.; Liu, C.; Zhuang, Y.; Li, Y. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA, 13–19 June 2020; pp. 390–400. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Wang, Z.; Liu, L.; Yu, X.; Zhang, C.; Zhao, W. 3D Bounding Box Estimation Using Deep Learning and Geometry. In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1–10. [Google Scholar]
Ku, J.; Saldana, A.; Watterson, J.; Mertz, C.; Khandelwal, S.; Maturana, D. Joint 3D proposal generation and object detection from a single RGB-D image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1085–1094. [Google Scholar]
Wang, Y.Y.; Wang, Y.N.; Liu, J.X.; Ren, J. Research on Application of Port Logistics Big Data Based on Hadoop. J. YanShan Univ. 2023, 47, 216–220. [Google Scholar]
Elfwing, S.; Kabra, R.; Kawaguchi, K.; Doya, K. Sigmoid-weighted Linear Unit for Neural Network Activation Functions. In Proceedings of the IEEE Conference on Neural Information Processing Systems (NeurIPS 2018), Montreal, QC, Canada, 2–8 December 2018; pp. 5674–5683. [Google Scholar]
Hu, J.; An, Y.P.; Xu, W.C.; Xiong, Z.; Liu, H. 3D Object Detection Based on Deep Semantic and Positional Information Fusion of Laser Point Clouds. Chin. J. Lasers 2023, 50, 200–210. [Google Scholar]
Qiu, S.; Wu, Y.; Anwar, S.; Li, C. Investigating Attention Mechanism in 3D Point Cloud Object Detection. In Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; pp. 403–412. [Google Scholar]
Zhai, Z.; Wang, Q.; Pan, Z.; Gao, Z.; Hu, W. Muti-Frame Point Cloud Feature Fusion Based on Attention Mechanisms for 3D Object Detection. Sensors 2022, 22, 7473. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Liang, B.; Huang, J.; Peng, Y.; Yan, Y.; Li, J.; Shang, W.; Wei, W. Pillar-Based 3D Object Detection from Point Cloud with Multiattention Mechanism. Wirel. Commun. Mob. Comput. 2023, 2023, 5603123. [Google Scholar] [CrossRef]
Wang, L.; Song, Z.; Zhang, X.; Wang, C.; Zhang, G.; Zhu, L.; Li, J.; Liu, H. SAT-GCN: Self-Attention Graph Convolutional Network-Based 3D Object Detection for Autonomous Driving. Knowl. Based Syst. 2023, 259, 110080. [Google Scholar] [CrossRef]
Wang, Z.; Fu, H.; Wang, L.; Xiao, L.; Dai, B. SCNet: Subdivision Coding Network for Object Detection Based on 3D Point Cloud. IEEE Access 2019, 7, 120449–120462. [Google Scholar] [CrossRef]
Cao, P.; Chen, H.; Zhang, Y.; Wang, G. Multi-View Frustum PointNet for Object Detection in Autonomous Driving. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 3896–3899. [Google Scholar]
Wang, S.; Lu, K.; Xue, J.; Zhao, Y. DA-Net: Density-Aware 3D Object Detection Network for Point Clouds. IEEE Trans. Multimed. 2023, 1–14. [Google Scholar] [CrossRef]
Li, C.; Gao, F.; Han, X.; Zhang, B. A New Density-Based Clustering Method Considering Spatial Distribution of LiDAR Point Cloud for Object Detection of Autonomous Driving. Electronics 2021, 10, 2005. [Google Scholar] [CrossRef]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10529–10538. [Google Scholar]
Wang, Y.; Jiang, Z.; Li, Y.; Hwang, J.N.; Xing, G.; Liu, H. RODNet: A Real-Time Radar Object Detection Network Cross-Supervised by Camera-Radar Fused Object 3D Localization. IEEE J. Sel. Top. Signal Process. 2021, 15, 954–967. [Google Scholar] [CrossRef]
Zheng, K.; Zheng, Y.; Zhang, Y.; Li, B.; Wang, Z.; Li, L. TANet: Robust 3D object detection via dual attention network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 8805–8814. [Google Scholar]
Zhang, W.; Xu, L.; Zhang, X.; Liu, W.; Liao, R.; Li, Z. PRGBNet: Point cloud representation with graph-based neural network for 3D object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2022; pp. 3471–3480. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Yi, L. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 77–85. [Google Scholar]

Figure 1. PointPillars network architecture.

Figure 2. Comparison of point cloud before and after processing.

Figure 3. Feature extraction incorporating the attention mechanism.

Figure 4. Flowchart of CSPNet network.

Figure 5. Comparison of the results of PointPillars with the algorithm of this paper. The left part of each scene is the result of the baseline, and the right part is the result of the proposed approach. (a,d) show improvements for false detections caused by under-segmentation of small objects, while (b,c) show improvements for missed detections caused by occlusion.

Table 1. Results of the KITTI test BEV detection benchmark.

Model	Car			Pedestrian			Cyclist
Model	Easy	Mod.	Hard	Easy	Mod.	Hard	Easy	Mod.	Hard
MV3D	86.02	76.90	68.49	N/A	N/A	N/A	N/A	N/A	N/A
SECOND	88.07	79.37	77.95	55.10	46.27	44.76	73.67	56.04	48.78
VoxelNet	89.35	79.26	77.39	46.13	40.74	38.11	66.70	54.76	50.55
TANet	91.58	86.54	81.19	60.58	51.38	47.54	79.16	63.77	56.21
AVODFPN	88.53	83.79	77.90	58.75	51.50	47.54	68.09	57.48	50.77
PRGBNet	91.39	85.73	80.68	38.07	29.32	26.94	73.09	57.59	51.78
HDNET	89.14	86.57	78.32	N/A	N/A	N/A	N/A	N/A	N/A
PointPillars	88.35	86.10	79.83	58.66	50.23	47.19	79.14	62.25	56.00
Ours	90.09	87.74	83.72	62.94	55.07	53.94	78.70	67.78	62.34

Table 2. Results of the KITTI test 3D detection benchmark.

Model	Car			Pedestrian			Cyclist
Model	Easy	Mod.	Hard	Easy	Mod.	Hard	Easy	Mod.	Hard
MV3D	71.09	62.35	55.12	N/A	N/A	N/A	N/A	N/A	N/A
SECOND	83.13	73.66	66.20	51.07	42.56	37.29	70.51	53.85	46.90
VoxelNet	77.47	65.11	57.73	39.48	33.69	31.50	61.22	48.36	44.37
TANet	84.39	75.94	68.82	53.72	44.34	40.49	75.70	59.44	52.53
AVODFPN	81.94	71.88	66.38	50.58	42.81	40.88	64.00	52.18	46.61
PRGBNet	83.99	76.04	71.17	44.63	37.37	34.92	75.24	61.70	55.32
PointPillars	79.05	74.99	68.30	52.08	43.53	41.49	75.78	59.07	52.92
Ours	84.34	77.90	74.15	53.08	49.22	46.20	81.18	62.10	56.13

Table 3. Results of the KITTI test BEV detection benchmark.

Model	Car			Pedestrian			Cyclist
Model	Easy	Mod.	Hard	Easy	Mod.	Hard	Easy	Mod.	Hard
PP	88.35	86.10	79.83	58.66	50.23	47.19	79.14	62.25	56.00
PP + DBSCAN	88.29	85.63	83.26	58.11	51.23	49.99	79.00	61.16	58.99
PP + SC	89.87	87.25	84.72	58.62	53.26	48.68	79.62	64.21	58.01
PP + CSPNet	89.66	86.55	83.96	60.01	52.64	46.62	78.39	66.86	59.21
Both	90.09	87.74	83.72	62.94	55.07	53.94	78.70	67.78	62.34

Table 4. Results of the KITTI test 3D detection benchmark.

Model	Car			Pedestrian			Cyclist
Model	Easy	Mod.	Hard	Easy	Mod.	Hard	Easy	Mod.	Hard
PP	79.05	74.99	68.30	52.08	43.53	41.49	75.78	59.07	52.92
PP + DBSCAN	78.06	75.62	73.11	52.39	45.88	44.18	74.66	59.21	56.21
PP + SC	82.47	76.24	71.12	53.00	46.62	44.61	78.88	61.02	53.99
PP + CSPNet	81.04	75.22	70.65	53.66	46.98	44.21	79.24	59.02	52.22
Both	84.34	77.90	74.15	53.08	49.22	46.20	81.18	62.10	56.13

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Published by MDPI on behalf of the World Electric Vehicle Association. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, W.; Dong, X.; Cheng, J.; Wang, S. Advanced Point Cloud Techniques for Improved 3D Object Detection: A Study on DBSCAN, Attention, and Downsampling. World Electr. Veh. J. 2024, 15, 527. https://doi.org/10.3390/wevj15110527

AMA Style

Zhang W, Dong X, Cheng J, Wang S. Advanced Point Cloud Techniques for Improved 3D Object Detection: A Study on DBSCAN, Attention, and Downsampling. World Electric Vehicle Journal. 2024; 15(11):527. https://doi.org/10.3390/wevj15110527

Chicago/Turabian Style

Zhang, Wenqiang, Xiang Dong, Jingjing Cheng, and Shuo Wang. 2024. "Advanced Point Cloud Techniques for Improved 3D Object Detection: A Study on DBSCAN, Attention, and Downsampling" World Electric Vehicle Journal 15, no. 11: 527. https://doi.org/10.3390/wevj15110527

APA Style

Zhang, W., Dong, X., Cheng, J., & Wang, S. (2024). Advanced Point Cloud Techniques for Improved 3D Object Detection: A Study on DBSCAN, Attention, and Downsampling. World Electric Vehicle Journal, 15(11), 527. https://doi.org/10.3390/wevj15110527

Article Menu

Advanced Point Cloud Techniques for Improved 3D Object Detection: A Study on DBSCAN, Attention, and Downsampling

Abstract

1. Introduction

2. Related Work

2.1. PointPillars Network Analysis

2.2. Point Cloud Processing Based on DBSCAN Clustering Algorithm

2.3. Integration of Attention Mechanisms

2.4. Improvements in Downsampling

3. Experimental Results and Analysis

3.1. Overview and Description of the Experimental Dataset

3.2. Experimental Parameterization Design and Environment Construction

3.3. Analysis of Experimental Results

3.3.1. Performance Analysis of Accuracy Means

3.3.2. Evaluation of Real-Time and Detection Performance of the Model

3.3.3. Ablation Experiments

3.3.4. Visualization of Experimental Results

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI