MS3D: A Multi-Scale Feature Fusion 3D Object Detection Method for Autonomous Driving Applications

Li, Ying; Zhuang, Wupeng; Yang, Guangsong

doi:10.3390/app142210667

Open AccessArticle

MS3D: A Multi-Scale Feature Fusion 3D Object Detection Method for Autonomous Driving Applications

by

Ying Li

¹,

Wupeng Zhuang

¹ and

Guangsong Yang

^2,*

¹

Chengyi College, Jimei University, Xiamen 361021, China

²

School of Ocean Information Engineering, Jimei University, Xiamen 361021, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(22), 10667; https://doi.org/10.3390/app142210667

Submission received: 28 October 2024 / Revised: 10 November 2024 / Accepted: 13 November 2024 / Published: 18 November 2024

(This article belongs to the Special Issue Advances in Autonomous Driving and Smart Transportation)

Download

Browse Figures

Versions Notes

Abstract

:

With advancements in autonomous driving, LiDAR has become central to 3D object detection due to its precision and interference resistance. However, challenges such as point cloud sparsity and unstructured data persist. This study introduces MS3D (Multi-Scale Feature Fusion 3D Object Detection Method), a novel approach to 3D object detection that leverages the architecture of a 2D Convolutional Neural Network (CNN) as its core framework. It integrates a Second Feature Pyramid Network to enhance multi-scale feature representation and contextual integration. The Adam optimizer is employed for efficient adaptive parameter tuning, significantly improving detection performance. On the KITTI dataset, MS3D achieves average precisions of 93.58%, 90.91%, and 88.46% in easy, moderate, and hard scenarios, respectively, surpassing state-of-the-art models like VoxelNet, SECOND, and PointPillars.

Keywords:

LiDAR point cloud; 3D object detection; multi-scale feature fusion; adaptive optimization; autonomous driving

1. Introduction

The rapid advancement of autonomous driving technology demands precise environmental perception and reliable object detection. Traditional 2D methods, including cameras and LiDAR, are limited by lighting and occlusion, affecting their performances in complex and adverse conditions. Cameras struggle in low-light and high-contrast environments, while 2D LiDAR demonstrates excellent distance measurement precision and robustness against interference. However, due to its two-dimensional nature, it is inadequate for spatial navigation. Thus, 3D object detection using LiDAR point clouds has become a key focus in autonomous driving research.

Unlike traditional methods, 3D object detection using LiDAR point clouds offers notable advantages. It operates in all weather conditions, unaffected by lighting variations, ensuring reliable performance day and night and in adverse weather such as rain or fog. Its high spatial resolution allows for the early detection of distant objects, enhancing safety by providing more reaction time. LiDAR’s robustness to lighting changes and occlusions ensures consistent performance in complex scenarios. Additionally, the compact nature of LiDAR data facilitates efficient processing and computation, meeting the real-time demands of autonomous driving.

Overall, LiDAR-based 3D object detection is essential for autonomous vehicle perception systems, providing accurate and reliable detection crucial for safe and efficient driving. Early work, such as VoxelNet [1], introduced an end-to-end framework using voxel features extracted through 3D convolution, establishing a foundation for further research. SECOND [2] optimized efficiency with 3D sparse convolution, while PointPillars [3] enhanced detection speed by mapping voxel features to pseudo-images. SASSD [4] improved training efficiency with an auxiliary network for position and category predictions. PVRCNN [5] strengthened feature extraction by concatenating 3D sparse convolutional features. VoxelRCNN [6] increased accuracy with voxel ROI pooling. The Waymo and Google team developed the multi-view fusion algorithm [7], integrating perspective views and point clouds to boost detection accuracy. TANet [8] advanced object detection with a TA (Triple Attention) Module and CFR (Coarse-to-Fine Regression) strategy. Researchers have built large-scale LiDAR point cloud datasets, such as KITTI [9], NuScenes [10], Waymo [11], and ONCE [12], providing extensive data for training and testing, thus enhancing algorithm performance and applicability.

Despite its advantages, 3D object detection using LiDAR point clouds faces several challenges. The sparsity of point cloud data hinders traditional algorithms’ ability to delineate object boundaries and detect small, distant objects. The unordered nature of point clouds complicates feature extraction, particularly in cases of overlapping or occluded objects. Additionally, noise in point clouds can introduce misleading information, leading to increased false detections and reduced accuracy, especially in complex environments.

To address these challenges, we propose Multi-Scale Feature Fusion for 3D Object Detection (MS3D). MS3D utilizes spatial segmentation and feature extraction [3] to efficiently handle sparse LiDAR point cloud data. The method incorporates a 2D Convolutional Neural Network (CNN) to refine object boundaries, and the Second Feature Pyramid Network (SecondFPN) is utilized to augment feature discriminability. This integration enables robust cross-scale feature fusion, enhancing the model’s resilience to noise and thereby significantly boosting detection precision and stability. The Adam optimizer is used for adaptive parameter updates, accelerating convergence and boosting both training efficiency and detection precision. Moreover, a positive sample filtering strategy [13] is adopted for data augmentation, focusing on the extraction of discriminative features to minimize noise interference.

This paper is structured as follows: Section 1 introduces LiDAR-based 3D object detection. Section 2 reviews related works in 3D object detection. Section 3 explains the MS3D architecture. Section 4 compares MS3D with existing algorithms and evaluates its performance. Section 5 summarizes contributions and suggests future research directions.

2. Related Works

Three-dimensional object detection algorithms are the cornerstone of autonomous driving systems, tasked with identifying and precisely localizing road targets such as pedestrians, vehicles, and obstacles from onboard sensor data. Compared to 2D object detection, they provide more precise information, including target categories, locations (center coordinates and dimensions), and poses (rotation angles, etc.), which are crucial for helping autonomous vehicles to make precise decisions in path planning and obstacle avoidance, ensuring the safety and efficiency of driving.

In recent years, there has been a shift in 3D object detection technology from traditional computer vision methods [14,15] that rely on manual feature extraction to deep learning-driven techniques. Traditional methods, primarily based on geometric analysis and machine learning techniques such as Support Vector Machines, are limited in their ability to handle complex dynamic scenes due to environmental changes and computational efficiency. In contrast, deep learning methods, through end-to-end feature learning, have achieved single-stage direct prediction, two-stage ROI refinement strategies, and data fusion across mono-modal and cross-modal algorithms, significantly improving detection accuracy and robustness. Some 3D object detection algorithms are listed in Table 1.

In the domain of single-stage 3D object detection, algorithms such as 3D-SSD [16], IA-SSD [17], VoxelNet [1], SECOND [2], PointPillars [18], CIA-SSD [13], SA-SSD [4], Diswot [19], LargeKernel3D [20], OcTr [21], and VoxelNeXt [22] have achieved rapid and efficient target identification through the optimization of point cloud downsampling, voxelization methods, auxiliary networks, distillation techniques, innovative data structures, and sparse inspection architectures. Two-stage methods like CasA [23], BtcDet [24], MSF [25], and ConQueR [26] have enhanced detection accuracy for occluded and fast-moving targets through fine-grained feature extraction and regression optimization. Cross-modal methods including MV3D [27], EPNet [28], PointPainting [29], MVP [30], LoGoNet [31], SFD [32], and Virtual Conv [33] have further enhanced detection robustness by fusing LiDAR and visual image data.

LiDAR point cloud technology employs the echo time measurement of laser pulses to generate high-resolution point cloud models, which is vital for precise 3D spatial data acquisition in autonomous driving systems. This methodology surpasses 2D imaging by offering increased resilience against environmental lighting and weather variations, ensuring stable operation in low-visibility conditions. The technology’s high-resolution data acquisition and processing capabilities enable autonomous systems to effectively manage occlusions and detect objects across multiple scales, enhancing their perceptual abilities.

With the continuous advancement of deep learning technology, the integration of LiDAR point cloud and 3D object detection methods has significantly enhanced the accuracy and efficiency of detection in the field of autonomous driving. Methods based on voxels, such as VoxelNet [1], effectively handle large-scale point cloud data through regular three-dimensional voxel grid segmentation and the application of 3D Convolutional Neural Networks (CNNs). Meanwhile, point cloud-based methods, such as PointNet [34] and PointNet++ [35], directly process irregular point clouds and learn their local and global features, ensuring high-precision object detection. Pillar-based methods, such as PointPillars [3], enhance real-time detection by converting point cloud data to a pseudo-image plane and combining it with Convolutional Neural Networks. Additionally, the introduction of the Transformer architecture [3], leveraging its powerful attention mechanism for modeling, further captures the global relationships and detailed features within point cloud data, significantly improving the overall performance of 3D object detection.

3. Proposed Method

We introduce the 3D object detection method MS3D. Initially, we delineate the fundamental architecture of the network. Subsequently, we present the pivotal technique for voxelizing point cloud data. Following this, we elaborate on the SecondFPN feature pyramid network. Finally, we introduce the loss function.

3.1. Network Architecture

This paper introduces MS3D, a multi-scale feature fusion 3D object detection method for autonomous driving, as illustrated in Figure 1.

First, a spatial segmentation strategy divided the point cloud into uniform 3D cells, termed Pillars. Each Pillar represents a 3D cell, with each containing

M

points. Each point featured nine dimensions: 3D coordinates

(X, Y, Z)

, reflection intensity

r

, offset from the Pillar center

(X_{C}, Y_{C}, Z_{C})

, and offset from the grid center

(X_{P}, Y_{P})

.

Points within each Pillar were either randomly sampled or padded with zero-value points to form a uniform data structure

(D, N, P)

, where

D = 9

is the nine-dimensional features,

N

is the point count threshold, and

P

is the total number of Pillars. PointNet processed these Pillar features, converting them into a

C

channel feature representation

(C, N, P)

that was then transformed into a pseudo-image with a size of

H \times W \times C

through max pooling, where

H

and

W

are the height and width of the grid and

C

is the number of channels.

A 2D Convolutional Neural Network backbone extracted multi-scale features from the pseudo-image. The network consisted of two sub-networks: one for top-down feature extraction at different resolutions, and another that performed feature sampling via deconvolution and concatenates features from three different-sized feature maps.

Following the backbone, a SecondFPN was integrated to refine feature extraction further and enhance multi-scale detection capabilities.

For 3D object detection, an SSD (Single Shot MultiBox Detector) was used. The SSD head defined

K

anchor boxes in the feature maps, each with varying sizes and aspect ratios to accommodate multi-scale detection needs. Regression predicted each anchor box’s

Z

-axis coordinate

Z_{g}

and height

h

, constructing precise 3D bounding boxes:

\{(X_{g}, Y_{g}, Z_{g}), (l, w, h), θ, s c o r e, c\}

. Here,

(X_{g}, Y_{g}, Z_{g})

is the bounding box center,

l

is the length,

w

is the width,

h

is the height,

θ

is the rotation angle relative to the

x y

-plane, score is the confidence of the bounding box containing the object, and

c

is the object class (e.g., vehicle, pedestrian).

The model training was optimized using the Adam optimizer [36], which adjusted learning rates adaptively based on parameter gradients, enhancing training efficiency and model performance.

3.2. Voxelization

The voxel-based encoding strategy preprocessed point cloud data as follows:

Voxel Grid Construction: The point cloud dataset

P

was divided into a spatial grid with dimensions

X \times Y \times Z

, with voxel height

H

defined as the difference between the maximum and minimum values in the vertical coordinate. On the

x y

-plane, the point cloud was segmented into

N \times M

cylindrical voxels using a grid size of

p \times q

, where

N = X / p

and

M = Y / q

.

Voxel Representation: Each point

p_{i} \in P

was characterized by its 3D coordinates

(x_{i}, y_{i}, z_{i})

and reflection intensity

r_{i}

. The point cloud

P

was discretized into voxels

v

, with each voxel containing a set of points

{p_{i} = [x_{i}, y_{i}, z_{i}, r_{i}], i = 1, \dots, N ∣ p_{i} \in v}

. The number of points per voxel was normalized to

N_{m a x}

through sampling or padding.

Feature Vector Construction: Each cylindrical voxel

P_{k}

in the

x y

-plane, with a base size of

d x \times d y

and fixed height

H

, aggregated internal points to form a feature vector

f_{k}

. This vector included global coordinates

(x_{i}, y_{i}, z_{i})

, reflection intensity

r_{i}

, and local offsets

(x_{c i}, y_{c i}, z_{c i})

relative to the voxel center, represented as

f_{k} = [(x_{i}, y_{i}, z_{i}, r_{i}, x_{c i}, y_{c i}, z_{c i}) ∣ \forall p_{i} \in P_{k}]

.

Pseudo-Image Reconstruction: The feature vector

f_{k}

was then reconstructed into a pseudo-image

I

of size

H \times W \times C

, where

C

is the number of feature channels and

H

and

W

are the grid dimensions on the

x y

-plane. This transformation from unstructured point cloud data to a structured feature representation enhanced the input for subsequent models, retaining spatial structure and improving environmental perception accuracy and computational efficiency in autonomous driving systems.

3.3. SecondFPN

The SecondFPN network employed a carefully designed feature fusion strategy to integrate high-level semantic information from deeper layers with high-resolution features from shallower layers, enhancing semantic richness, as shown in Figure 2.

The process was as follows:

Feature Extraction: Input feature maps

I

were processed through deep convolutional layers

C^{(k_{d}, s_{d}, p_{d})}

to extract features, where

k_{d}

,

s_{d}

, and

p_{d}

denote the kernel size, stride, and padding, respectively.

F^{(1)} = C^{(k_{d}, s_{d}, p_{d})} (I)

(1)

Dimensionality Reduction: To optimize channel numbers and reduce computational parameters, 1 × 1 convolutions

C^{(k_{1}, s_{1}, p_{1})}

were applied to the second and third feature maps

F^{(2)}

and

F^{(3)}

, where

k_{1} =

1,

s_{1} = 1

, and

p_{1} = 0

.

F^{(2)} = C^{(k_{1}, s_{1}, p_{1})} (F^{(1)})

(2)

F^{(3)} = C^{(k_{1}, s_{1}, p_{1})} (F^{(2)})

(3)

Upsampling and Fusion: The lower-resolution, semantically rich fourth feature map

F^{(4)}

was upsampled using

U^{(k_{u})}

to match the resolution of the lower layer feature map

F^{(2)}

, where

k_{u}

is the upsampling kernel size. Feature fusion was performed by element-wise addition.

F^{(5)} = F^{(2)} + U^{(k_{u})} (F^{(4)})

(4)

Prediction: The fused feature map

F^{(5)}

underwent a convolutional prediction layer

P^{(k_{p}, s_{p}, p_{p})}

for object classification and localization, where

k_{p}

,

s_{p}

, and

p_{p}

are the kernel size, stride, and padding, respectively.

P = P (k_{p}, s_{p}, p_{p}) (\dot{F^{(5)}}) = C o n v (F^{(5)}, k_{p}, s_{p}, p_{p})

(5)

The SecondFPN fusion strategy enhanced feature representation and increased the network’s sensitivity to multi-scale objects, demonstrating exceptional performance in complex autonomous driving and robotic navigation scenarios.

3.4. Loss Function

The loss function comprised three components: localization loss

L_{l o c}

, orientation classification loss

L_{dir}

, and classification loss

L_{c l s}

. Each 3D bounding box was represented by a seven-dimensional vector

(x, y, z, w, l, h, θ)

, where

(x, y, z)

are the center coordinates,

(w, l, h)

are the dimensions, and

θ

is the orientation angle.

Localization Loss

L_{l o c}

: The localization loss used the Smooth

L 1

loss function [37], combining

L 1

and

L 2

loss characteristics to measure the discrepancy between predicted and ground truth bounding boxes. It balanced smoothness for small errors with robustness for larger deviations.

L_{l o c} = \sum_{b \in (x, y, z, w, l, h, θ)} S m o o t h L 1 (Δ b)

(6)

where

Δ b

represents the residual between predicted

\hat{b}

and ground truth

b

.

The Smooth

L 1

loss function [37], as depicted in Figure 3, is defined as

SmoothL 1 (Δ b) = \{\begin{array}{l} 0.5 {(Δ b)}^{2} & if |Δ b| < 1 \\ |Δ b| - 0.5 & otherwise \end{array}

(7)

Orientation Classification Loss

L_{dir}

: The orientation classification loss employed the Softmax function to optimize the model’s orientation predictions and minimize classification errors as follows:

L_{dir} = - \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} l o g (P_{i, c})

(8)

where

N

is the number of instances in a batch,

C

is the number of orientation classes,

y_{i, c}

is an indicator variable (1 if the true orientation of object

i

is

c

, otherwise 0), and

P_{i, c}

is the predicted probability for orientation class

c

.

Classification Loss

L_{c l s}

: The classification loss, using Focal Loss [37], addressed class imbalance by focusing on hard-to-classify examples while down-weighting easy ones as follows:

L_{c l s} = - \sum_{a} α_{a} (1 - p^{a})^{γ} \log p^{a}

(9)

where

p^{a}

is the predicted probability for class

a

,

α_{a}

is a balancing factor, and

γ

is a modulation term used to reduce the emphasis on well-classified examples.

The total loss

L

was a weighted sum of these components:

L = \frac{1}{N_{p o s}} (β_{l o c} L_{l o c} + β_{c l s} L_{c l s} + β_{d i r} L_{d i r})

(10)

where

N_{p o s}

is the number of positive anchors. The optimal balancing factors were determined through cross-validation and a comparative analysis of performance metrics across various parameter settings. For example, the alpha_a parameter was optimally set at 0.25 following a stringent experimental protocol. Initially, the dataset underwent categorical distribution scrutiny to detect under-represented classes. Subsequently, a five-fold cross-validation scheme was implemented, adjusting alpha_a incrementally from 0.1 to 0.5 in steps of 0.05. Analysis of F1 scores and precision-recall curves across different alpha_a settings confirmed that the model reached its optimal classification equilibrium at alpha_a = 0.25, characterized by a significant boost in recall for minority classes without compromising precision. Moreover, the incorporation of an early stopping criterion averted overfitting, ensuring the model’s robust generalization capabilities. In practice, the balancing factors were set to

α_{a}

= 0.25,

γ

= 2,

β_{l o c}

= 2,

β_{c l s}

= 1, and

β_{d i r}

= 0.1 to optimize model performance.

4. Experiments and Analysis

An in-depth exploration of experimental design and performance evaluation is conducted. Section 4.1 delineates the characteristics of the selected dataset. Section 4.2 introduces the performance metrics for the experiment. Section 4.3 provides the configuration of the experimental environment. Section 4.4 meticulously analyzes the experimental results to assess the algorithm’s performance.

4.1. Datasets

The KITTI dataset [9] is a key benchmark in autonomous driving, featuring diverse environments such as urban, rural, and highway settings, as shown in Figure 4. It includes varied objects such as vehicles and pedestrians with challenges like occlusion, rotation, multi-object tracking, and scale variations. The dataset is divided into three difficulty levels: easy (clear backgrounds, minor occlusion), medium (dense object overlap), and hard (complex scenes with severe occlusion). MS3D is evaluated on 7481 KITTI samples, with 3712 for training and 3769 for validation, to assess the method’s generalization and robustness. The KITTI-trained 3D object detection algorithm [38] achieves effective real-world application, underscoring the KITTI dataset’s crucial role in enhancing the algorithm’s ability to adapt to complex 3D environments.

4.2. Evaluation Metrics

AP (average precision): This metric evaluates model accuracy by averaging precision at various recall levels. A higher AP indicates superior detection accuracy.

F1 Score: The harmonic mean of precision and recall, reflecting the balance between avoiding false positives and false negatives. Higher F1 scores denote improved overall performance.

IoU (Intersection over Union): This metric measures the overlap between predicted and ground truth bounding boxes. IoU values range from 0 to 1, with higher values signifying better localization accuracy. This study employs an IoU threshold of 0.5.

Frame Rate: This metric quantifies the number of frames processed per second in video, image, or point cloud analysis. It is measured in Hertz (Hz), where 1 Hz equals one frame per second.

4.3. Experimental Setup

The point cloud data are restricted to the spatial dimensions of x-axis [0, 70.4] meters, y-axis [−40, 40] meters, and z-axis [−3, 1] meters, focusing on the vehicle’s front. Voxelization is performed using 0.16 m × 0.16 m × 4 m dimensions, creating vertical columns with a height of 4 m and a grid resolution of 0.16 m × 0.16 m in the xy-plane. Each voxel is limited to 32 points, totaling 40,000 voxels. Experiments are conducted on an Ubuntu 18.04 LTS system with PyTorch 1.7.1, utilizing CUDA 11.0 and cuDNN 8.0.4 for GPU acceleration.

To enhance model generalization, extensive data augmentation is employed, including global rotation, scaling, and flipping of the point clouds, as well as local adjustments near ground truth targets. Additionally, a positive sample filtering strategy [13] reduces the weights of difficult samples with significant overlap, improving training quality and focusing the model on more discriminative features to boost detection accuracy and minimize false positives.

4.4. Experiment Analysis

4.4.1. Accuracy Analysis

Table 2, Table 3 and Table 4 presents a comparative analysis of MS3D against the classical algorithms VoxelNet, SECOND, and PointPillars across easy, moderate, and hard difficulty levels on the KITTI dataset.

In Table 2, MS3D demonstrates superior performance in the easy category, achieving AP (average precision) scores of 0.84 for BBOX, 0.94 for BEV, and 0.93 for 3D, outperforming competing methods. Notably, MS3D’s 3D AP surpasses VoxelNet by 12%, highlighting its strength in 3D object detection. In addition, MS3D’s F1 Scores of 0.54, 0.56, and 0.58 in the BBOX, BEV, and 3D categories, respectively, also outperformed other methods, especially in the 3D category, with its F1 Score being 8% higher than Voxelnet, indicating a superior balance between precision and recall.

In the moderate category of Table 3, the AP of MS3D reaches 0.82, 0.92, and 0.91 for the BBOX, BEV, and 3D object detection categories, respectively. This represents improvements of (0.15, 0.16, 0.16), (0.06, 0.07, 0.07), and (0.04, 0.06, 0.05) over VoxelNet, SECOND, and PointPillars, respectively. Additionally, the F1 Scores of MS3D are 0.54, 0.56, and 0.56 for the BBOX, BEV, and 3D categories, which are also better than other methods. Indeed, this indicates that MS3D has a significant advantage in terms of detection accuracy and generalization ability in the medium mode.

In the difficult category, MS3D demonstrates superior accuracy over other methods. As indicated in Table 4, MS3D achieves higher AP scores across the BBOX, BEV, and 3D metrics. Specifically, for the 3D metric, MS3D attains an AP of 0.88, outperforming VoxelNet, SECOND, and PointPillars by 15%, 8%, and 5%, respectively. Furthermore, MS3D consistently delivers higher F1 Scores across the BBOX, BEV, and 3D metrics, underscoring its exceptional performance in complex detection tasks.

Across three diverse categories, both PointPillars and MS3D excel in real-time performance, achieving 62 Hz and 60 Hz, respectively, demonstrating high-efficiency point cloud processing. MS3D, while maintaining rapid processing speeds, significantly enhances detection efficacy with its multi-scale feature fusion approach, offering superior accuracy and robustness, particularly in complex environments, despite a slightly lower frame rate than PointPillars.

4.4.2. Representative Cases

Due to PointPillars’ exceptional performance, ranking second among the evaluated algorithms, this section primarily presents a visual comparison between MS3D (our model) and PointPillars. Figure 5a, Figure 6a and Figure 7a represent the point cloud data labels.

Easy Category: Minor Occlusion of Small Objects

In scenarios with simple categories, detecting small objects is challenging due to slight occlusions and sparse point cloud data, as shown in Figure 5. Sparse datasets lead to insufficient point density for small or distant targets, complicating detection. The pronounced data voids within sparse datasets lead to a marked reduction in informative content, thereby complicating the extraction of meaningful patterns during the learning process. This difficulty is particularly acute for small or remotely positioned targets, which are intrinsically under-represented in the dataset. The sparsity of the data further compounds the scarcity of feature information critical for detection, thereby intensifying the challenges in achieving precise identification and classification of these objects. Figure 5b demonstrates that traditional methods like PointPillars struggle with missed detections, such as failing to identify a distant small vehicle obscured by closer ones, highlighting limitations in handling occlusions and multi-scale detection.

MS3D mitigates these issues by employing advanced feature fusion techniques, combining upsampling and downsampling to integrate multi-scale features effectively. This strategy enhances the model’s ability to focus on critical features and reduces the impacts of less relevant ones, improving detection performance for small objects. Additionally, 2D convolutions enhance the model’s ability to interpret 2D projections of point cloud data, enabling more accurate detection of small targets under slight occlusions. As illustrated in Figure 5c, MS3D’s feature fusion approach yields higher detection confidence, accurately identifying occluded distant small vehicles and significantly improving small object detection.

2.: Moderate Category: Overlapping and Occluded Multi-Object

In moderate-category scenarios, multi-target overlap and occlusion pose significant challenges due to the unstructured nature of point cloud data, which complicates feature extraction, as shown in Figure 6. For instance, at an intersection, multiple vehicles overlap, with one black car nearly obscured on its left side. As depicted in Figure 6b, traditional methods like PointPillars fail to detect this occluded black car, resulting in detection failures.

In contrast, MS3D, leveraging the SecondFPN, effectively detects the occluded black car. The SecondFPN’s multi-scale feature extraction and fusion mechanisms enable the precise identification of targets across various scales, enhancing detection performance in complex occlusion scenarios. Additionally, the Adam optimizer’s adaptive learning rate facilitates handling the complexities of such scenes. As shown in Figure 6c, MS3D’s feature fusion capabilities significantly improve detection accuracy, effectively mitigating interference from overlapping targets.

3.: Difficult Category: Complex Backgrounds and Intricate Trajectories

In challenging scenarios with complex backgrounds and diverse, dynamically moving targets such as pedestrians, bicycles, and vehicles, detecting objects is highly demanding, as shown in Figure 7. Point cloud data can be contaminated by noise, leading to inaccurate detections. Traditional methods like PointPillars often struggle with distinguishing between background and target features, resulting in false positives and missed detections. As illustrated in Figure 7b, traditional approaches may misclassify traffic signs, curved lane dividers, tall flower beds, and multi-track pedestrians as vehicles. Moreover, a fast-moving red car and a partially occluded blue car (due to backlighting and partial visibility) are not detected, highlighting the limitations of these methods in handling background complexity and incomplete imagery.

In contrast, MS3D effectively detects the fast-moving red car and the severely occluded blue car while minimizing false positives, as shown in Figure 7c. Leveraging SecondFPN’s multi-scale feature extraction and fusion capabilities, MS3D significantly enhances point cloud feature representation, enabling precise focus on critical features and accurate detection of partially visible targets. The incorporation of 2D convolutions improves the model’s interpretation of 2D projections from point cloud data, thereby enhancing detection performance for severely occluded and dynamically moving objects. Additionally, the Adam optimizer’s efficient parameter update strategy facilitates rapid convergence and high accuracy in complex scenes. Collectively, these advancements markedly improve MS3D’s performance in challenging and intricate environments.

Despite advancements, the detection of pedestrians on multiple trajectories, especially in occluded scenarios such as individuals propelling bicycles, continues to pose challenges in terms of missed detections. While MS3D has enhanced detection performance through multi-scale feature fusion, the single fusion strategy struggles to adequately extract crucial information to address occlusion challenges, particularly in cases like cyclists. Additionally, the limitations of 2D Convolutional Neural Networks in boundary and detail processing, compounded by the small size of pedestrian targets, sparsity of point clouds, and motion artifacts caused by complex trajectories, all contribute to the complexity of feature extraction and recognition, resulting in persistent issues of missed detections. Subsequent research endeavors will be directed towards the refinement of motion distortion mitigation strategies and the identification and compensation of motion artifacts to bolster the detection precision for pedestrians in multi-track walking scenarios.

5. Conclusions

MS3D pushes the envelope in 3D object detection for autonomous driving by harmonizing a unique blend of spatial segmentation, sophisticated feature extraction, and integration with 2D CNNs, which together elevate detection efficiency and precision. The innovative SecondFPN module enhances feature representation, markedly reinforcing detection stability. Additionally, the strategic use of data augmentation coupled with positive sample filtering is crafted to amplify discriminative feature learning. The training regimen is fine-tuned with the Adam optimizer and an adaptive learning rate approach, assuring rapid convergence and high accuracy. Benchmark tests on the KITTI dataset reveal that MS3D surpasses state-of-the-art methods, including VoxelNet, SECOND, and PointPillars, particularly in detecting small, distant objects, dealing with scenes of multiple overlapping objects, and maneuvering through complex backgrounds, thus advancing the field of 3D object detection technology.

Despite its current capabilities, MS3D encounters difficulties in detecting pedestrians with complex trajectories in intricate urban environments. Future endeavors will concentrate on developing more sophisticated techniques for motion distortion mitigation strategies and the identification and compensation of motion artifacts to improve the identification and segmentation of pedestrians under adverse conditions. Furthermore, multimodal perception strategies, including teacher–student distillation networks [39,40,41,42], will integrate cross-modal data with point cloud characteristics to enhance environmental comprehension, thus bolstering the generalization and robustness of pedestrian detection in complex scenarios.

Building on this foundation, we will leverage large-scale autonomous driving models [43,44,45] to enhance our 3D object detection algorithms, enabling them to interpret and navigate complex urban environments more effectively. This will contribute to an overall improvement in autonomous driving performance.

To validate the accuracy and reliability of the enhanced MS3D algorithms, we will conduct comprehensive evaluations on complex datasets. Our approach will integrate feature layer fusion, multilevel feature extraction, and sophisticated data augmentation strategies, all within an advanced evaluation framework. Utilizing TensorFlow, JAX’s XLA, and Caffe2, we will refine the algorithms for real-time operation and mobile integration, thereby enhancing the performance metrics in autonomous driving vision technology.

Author Contributions

Conceptualization, G.Y. and Y.L.; methodology, Y.L.; software, Y.L. and W.Z.; validation, Y.L. and W.Z.; formal analysis, Y.L.; resources, G.Y.; data curation, Y.L. and W.Z.; writing—original draft preparation, Y.L.; writing—review and editing, G.Y.; project administration, G.Y.; funding acquisition, G.Y. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Fujian Province under Grant 2021J01865, the Natural Science Foundation of Xiamen (Grant No. 3502Z202473078), and the Education and Scientific Research Project for Middle-Aged and Young Teachers of Fujian Province (Grant No. JAT210678 and Grant No. JT180877).

Data Availability Statement

The original contributions presented in this study are included in this article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhou, Y.; Tuzel, O. VoxelNet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]
Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12697–12705. [Google Scholar]
He, C.; Zeng, H.; Huang, J.; Hua, X.-S.; Zhang, L. Structure aware single-stage 3D object detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11873–11882. [Google Scholar]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Pv-rcnn: Point-voxel feature set abstraction for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10529–10538. [Google Scholar]
Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3D object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; pp. 1201–1209. [Google Scholar]
Zhou, Y.; Sun, P.; Zhang, Y.; Anguelov, D.; Gao, J.; Ouyang, T.; Guo, J.; Ngiam, J.; Vasudevan, V. End-to-end multi-view fusion for 3D object detection in lidar point clouds. In Proceedings of the Conference on Robot Learning, PMLR, Virtual, 16–18 November 2020; pp. 923–932. [Google Scholar]
Liu, Z.; Zhao, X.; Huang, T.; Hu, R.; Zhou, Y.; Bai, X. TANet: Robust 3D object detection from point clouds with triple attention. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11677–11684. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2446–2454. [Google Scholar]
Dong, Y.; Kang, C.; Zhang, J.; Zhu, Z.; Wang, Y.; Yang, X.; Su, H.; Wei, X.; Zhu, J. Benchmarking robustness of 3D object detection to common corruptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1022–1032. [Google Scholar]
Zheng, W.; Tang, W.; Chen, S.; Jiang, L.; Fu, C.-W. Cia-ssd: Confident iou-aware single-stage object detector from point cloud. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; pp. 3555–3562. [Google Scholar]
Schuldt, C.; Laptev, I.; Caputo, B. Recognizing human actions: A local SVM approach. In Proceedings of the 17th International Conference on Pattern Recognition (ICPR 2004), Cambridge, UK, 23–26 August 2004; IEEE: Piscataway, NJ, USA, 2004; pp. 32–36. [Google Scholar]
Paul, A.; Mukherjee, D.P.; Das, P.; Gangopadhyay, A.; Chintha, A.R.; Kundu, S. Improved random forest for classification. IEEE Trans. Image Process. 2018, 27, 4012–4024. [Google Scholar] [CrossRef] [PubMed]
Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3Dssd: Point-based 3D single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11040–11048. [Google Scholar]
Zhang, Y.; Hu, Q.; Xu, G.; Ma, Y.; Wan, J.; Guo, Y. Not all points are equal: Learning highly efficient point-based detectors for 3D lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18953–18962. [Google Scholar]
Wang, Y.; Han, X.; Wei, X.; Luo, J. Instance Segmentation Frustum–PointPillars: A Lightweight Fusion Algorithm for Camera–LiDAR Perception in Autonomous Driving. Mathematics 2024, 12, 153. [Google Scholar] [CrossRef]
Dong, P.; Li, L.; Wei, Z. Diswot: Student architecture search for distillation without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 11898–11908. [Google Scholar]
Chen, Y.; Liu, J.; Zhang, X.; Qi, X.; Jia, J. Largekernel3D: Scaling up kernels in 3D sparse cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13488–13498. [Google Scholar]
Zhou, C.; Zhang, Y.; Chen, J.; Huang, D. Octr: Octree-based transformer for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5166–5175. [Google Scholar]
Chen, Y.; Liu, J.; Zhang, X.; Qi, X.; Jia, J. Voxelnext: Fully sparse VoxelNet for 3D object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21674–21683. [Google Scholar]
Wu, H.; Deng, J.; Wen, C.; Li, X.; Wang, C.; Li, J. CasA: A cascade attention network for 3-D object detection from LiDAR point clouds. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5704511. [Google Scholar] [CrossRef]
Xu, Q.; Zhong, Y.; Neumann, U. Behind the curtain: Learning occluded shapes for 3D object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 22 February–1 March 2022; pp. 2893–2901. [Google Scholar]
He, C.; Li, R.; Zhang, Y.; Li, S.; Zhang, L. Msf: Motion-guided sequential fusion for efficient 3D object detection from point cloud sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5196–5205. [Google Scholar]
Zhu, B.; Wang, Z.; Shi, S.; Xu, H.; Hong, L.; Li, H. Conquer: Query contrast voxel-detr for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9296–9305. [Google Scholar]
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
Huang, T.; Liu, Z.; Chen, X.; Bai, X. Epnet: Enhancing point features with image semantics for 3D object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XV 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 35–52. [Google Scholar]
Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. Pointpainting: Sequential fusion for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4604–4612. [Google Scholar]
Yin, T.; Zhou, X.; Krähenbühl, P. Multimodal virtual point 3D detection. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Online, 6–14 December 2021; pp. 16494–16507. [Google Scholar]
Li, X.; Ma, T.; Hou, Y.; Shi, B.; Yang, Y.; Liu, Y.; Wu, X.; Chen, Q.; Li, Y.; Qiao, Y. Logonet: Towards accurate 3D object detection with local-to-global cross-modal fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17524–17534. [Google Scholar]
Wu, X.; Peng, L.; Yang, H.; Xie, L.; Huang, C.; Deng, C.; Liu, H.; Cai, D. Sparse fuse dense: Towards high quality 3D detection with depth completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5418–5427. [Google Scholar]
Wu, H.; Wen, C.; Shi, S.; Li, X.; Wang, C. Virtual sparse convolution for multimodal 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21653–21662. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5105–5114. [Google Scholar]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Jiang, H.W.; Chen, M.Y.; Yuan, X.C. An algorithm for visual simultaneous localization and mapping with integrated hybrid attention instance segmentation. Laser Optoelectron. Prog. 2023, 60, 404–413. [Google Scholar]
Wang, T.; Hu, X.; Liu, Z.; Fu, C.-W. Sparse2Dense: Learning to densify 3D features for 3D object detection. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 38533–38545. [Google Scholar]
Wu, X.; Tian, Z.; Wen, X.; Peng, B.; Liu, X.; Yu, K.; Zhao, H. Towards large-scale 3D representation learning with multi-dataset point prompt training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19551–19562. [Google Scholar]
Liu, Y.; Kong, L.; Wu, X.; Chen, R.; Li, X.; Pan, L.; Liu, Z.; Ma, Y. Multi-Space Alignments Towards Universal LiDAR Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 14648–14661. [Google Scholar]
Jiao, Y.; Jie, Z.; Chen, S.; Chen, J.; Ma, L.; Jiang, Y.-G. Msmdfusion: Fusing lidar and camera at multiple scales with multi-depth seeds for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21643–21652. [Google Scholar]
Cui, C.; Ma, Y.; Cao, X.; Ye, W.; Zhou, Y.; Liang, K.; Chen, J.; Lu, J.; Yang, Z.; Liao, K.-D. A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 958–979. [Google Scholar]
Sun, T.; Zhang, Z.; Tan, X.; Peng, Y.; Qu, Y.; Xie, Y. Uni-to-Multi Modal Knowledge Distillation for Bidirectional LiDAR-Camera Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11059–11072. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Rao, Y.; Yu, X.; Zhou, J.; Lu, J. Point-to-pixel prompting for point cloud analysis with pre-trained image models. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4381–4397. [Google Scholar] [CrossRef] [PubMed]

Figure 1. MS3D network structure diagram.

Figure 2. SecondFPN structure.

Figure 3. A graph of the Smooth L1 loss function.

Figure 4. A collection scenario of the KITTI dataset.

Figure 5. Comparison of detection performance for small objects with minor occlusion.

Figure 6. Detection performance comparison in multi-object overlap and occlusion scenarios.

Figure 7. Detection performance comparison in complex background scenarios.

Table 1. List of 3D object detection algorithms.

Abbreviation	Full Name
3D-SSD [16]	3D Single Shot MultiBox Detector
IA-SSD [17]	Improved Anchor-based Single Shot MultiBox Detector
VoxelNet [1]	VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection
SECOND [2]	Sparsely Embedded Convolutional Detection
PointPillars [18]	PointPillars: Fast Encoders for Object Detection from Point Clouds
CIA-SSD [13]	Contextualized Intermediate-level Anchoring for Single Shot 3D Object Detection
SA-SSD [4]	Scale-Aware Single Shot MultiBox Detector for 3D Object Detection
Diswot [19]	Discrete Wasserstein Objectives for Learning with Limited Data
LargeKernel3D [20]	Large Kernel Matters—Improving Semantic Segmentation by Global Convolutional Network
OcTr [21]	Occlusion-aware Transformer for 3D Object Detection
VoxelNeXt [22]	VoxelNeXt: Voxels for the Next Generation of 3D Object Detection
CasA [23]	Cascaded Alignment for 3D Object Detection
BtcDet [24]	Bird’s Eye View Object Detection with Localization Refinement
MSF [25]	Multi-Scale Feature Aggregation for 3D Object Detection
ConQueR [26]	ConQueR: Monocular 3D Object Detection by Construction and Query
MV3D [27]	Multi-View 3D Object Detection Network for Autonomous Driving
EPNet [28]	Efficient PointNet for 3D Object Detection in Point Clouds
PointPainting [29]	PointPainting: Sequential Fusion for 3D Object Detection
MVP [30]	Multi-View Prediction for 3D Object Detection from Monocular Images
LoGoNet [31]	Local and Global Network for 3D Object Detection from Point Cloud
SFD [32]	Sparse Feature Detection for Point Cloud Based 3D Object Detection
Virtual Conv [33]	Virtual Convolution for Efficient Point Cloud Processing

Table 2. A performance comparison of MS3D with other methods in the easy category.

Methods	Frame Rate (HZ)	AP			F1Score
Methods	Frame Rate (HZ)	BBOX	BEV	3D	BBOX	BEV	3D
VoxelNet	4.4	0.69	0.82	0.81	0.47	0.49	0.50
SECOND	20	0.78	0.89	0.87	0.51	0.53	0.55
PointPillars	62	0.80	0.90	0.90	0.53	0.55	0.56
MS3D	60	0.84	0.94	0.93	0.54	0.56	0.58

Table 3. A performance comparison of MS3D with other methods in the moderate category.

Methods	Frame Rate (HZ)	AP			F1Score
Methods	Frame Rate (HZ)	BBOX	BEV	3D	BBOX	BEV	3D
VoxelNet	4.4	0.67	0.76	0.75	0.46	0.48	0.51
SECOND	20	0.76	0.85	0.84	0.50	0.53	0.54
PointPillars	62	0.78	0.864	0.863	0.53	0.55	0.55
MS3D	60	0.82	0.92	0.91	0.54	0.56	0.56

Table 4. A performance comparison of MS3D with other methods in the difficult category.

Methods	Frame Rate (HZ)	AP			F1Score
Methods	Frame Rate (HZ)	BBOX	BEV	3D	BBOX	BEV	3D
VoxelNet	4.4	0.63	0.74	0.73	0.45	0.47	0.49
SECOND	20	0.74	0.78	0.80	0.49	0.52	0.53
PointPillars	62	0.75	0.84	0.83	0.52	0.54	0.53
MS3D	60	0.77	0.89	0.88	0.53	0.55	0.55

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Zhuang, W.; Yang, G. MS3D: A Multi-Scale Feature Fusion 3D Object Detection Method for Autonomous Driving Applications. Appl. Sci. 2024, 14, 10667. https://doi.org/10.3390/app142210667

AMA Style

Li Y, Zhuang W, Yang G. MS3D: A Multi-Scale Feature Fusion 3D Object Detection Method for Autonomous Driving Applications. Applied Sciences. 2024; 14(22):10667. https://doi.org/10.3390/app142210667

Chicago/Turabian Style

Li, Ying, Wupeng Zhuang, and Guangsong Yang. 2024. "MS3D: A Multi-Scale Feature Fusion 3D Object Detection Method for Autonomous Driving Applications" Applied Sciences 14, no. 22: 10667. https://doi.org/10.3390/app142210667

APA Style

Li, Y., Zhuang, W., & Yang, G. (2024). MS3D: A Multi-Scale Feature Fusion 3D Object Detection Method for Autonomous Driving Applications. Applied Sciences, 14(22), 10667. https://doi.org/10.3390/app142210667

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MS3D: A Multi-Scale Feature Fusion 3D Object Detection Method for Autonomous Driving Applications

Abstract

1. Introduction

2. Related Works

3. Proposed Method

3.1. Network Architecture

3.2. Voxelization

3.3. SecondFPN

3.4. Loss Function

4. Experiments and Analysis

4.1. Datasets

4.2. Evaluation Metrics

4.3. Experimental Setup

4.4. Experiment Analysis

4.4.1. Accuracy Analysis

4.4.2. Representative Cases

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI