PLC-Fusion: Perspective-Based Hierarchical and Deep LiDAR Camera Fusion for 3D Object Detection in Autonomous Vehicles

Mushtaq, Husnain; Deng, Xiaoheng; Azhar, Fizza; Ali, Mubashir; Raza Sherazi, Hafiz Husnain

doi:10.3390/info15110739

Open AccessArticle

PLC-Fusion: Perspective-Based Hierarchical and Deep LiDAR Camera Fusion for 3D Object Detection in Autonomous Vehicles

by

Husnain Mushtaq

¹,

Xiaoheng Deng

^1,*

,

Fizza Azhar

²,

Mubashir Ali

³ and

Hafiz Husnain Raza Sherazi

⁴

¹

School of Computer Science and Engineering, Central South University, Changsha 410083, China

²

Department of Computer Science, University of Chenab, Gujrat 50700, Pakistan

³

School of Computer Science, University of Birmingham, Birmingham B15 2TT, UK

⁴

School of Computing, Newcastle University, Newcastle Upon Tyne NE4 5TG, UK

^*

Author to whom correspondence should be addressed.

Information 2024, 15(11), 739; https://doi.org/10.3390/info15110739

Submission received: 2 October 2024 / Revised: 11 November 2024 / Accepted: 14 November 2024 / Published: 19 November 2024

(This article belongs to the Special Issue Emerging Research in Object Tracking and Image Segmentation)

Download

Browse Figures

Versions Notes

Abstract

:

Accurate 3D object detection is essential for autonomous driving, yet traditional LiDAR models often struggle with sparse point clouds. We propose perspective-aware hierarchical vision transformer-based LiDAR-camera fusion (PLC-Fusion) for 3D object detection to address this. This efficient, multi-modal 3D object detection framework integrates LiDAR and camera data for improved performance. First, our method enhances LiDAR data by projecting them onto a 2D plane, enabling the extraction of object perspective features from a probability map via the Object Perspective Sampling (OPS) module. It incorporates a lightweight perspective detector, consisting of interconnected 2D and monocular 3D sub-networks, to extract image features and generate object perspective proposals by predicting and refining top-scored 3D candidates. Second, it leverages two independent transformers—CamViT for 2D image features and LidViT for 3D point cloud features. These ViT-based representations are fused via the Cross-Fusion module for hierarchical and deep representation learning, improving performance and computational efficiency. These mechanisms enhance the utilization of semantic features in a region of interest (ROI) to obtain more representative point features, leading to a more effective fusion of information from both LiDAR and camera sources. PLC-Fusion outperforms existing methods, achieving a mean average precision (mAP) of 83.52% and 90.37% for 3D and BEV detection, respectively. Moreover, PLC-Fusion maintains a competitive inference time of 0.18 s. Our model addresses computational bottlenecks by eliminating the need for dense BEV searches and global attention mechanisms while improving detection range and precision.

Keywords:

LiDAR-camera fusion; object perspective sampling; ViT feature fusion; 3D object detection; autonomous vehicles

1. Introduction

Recent advances in deep learning and automotive sensors have accelerated the progress of autonomous driving technologies, significantly improving driving efficiency and road safety. Three-dimensional (3D) object detection systems have become a critical component of autonomous driving, gaining substantial attention from academia and industry. These systems allow autonomous vehicles to track and detect objects in dynamic, real-world environments. Researchers have proposed image-based approaches to reduce the cost of such systems, which leverage the rich semantic information (e.g., color and texture) captured by 2D images [1,2]. However, 3D object detection algorithms that rely solely on image data may struggle with accurate object localization due to low depth information [3]. Although stereo cameras have been explored as a means to estimate depth, these methods often suffer from high computational complexity and sub-optimal performance [4].

In response to the challenges posed by single-modality detection, LiDAR-based systems have been developed to achieve high-precision depth perception. However, these systems face limitations such as sparse data, especially when objects are further from the sensor [5], as well as occlusions that result in incomplete 3D object representations, making it challenging to detect smaller or distant objects like bicycles and pedestrians [6]. Furthermore, distinguishing between foreground and background objects with similar geometric features can be problematic in the absence of semantic data, leading to higher false positive rates [7]. As such, the effectiveness of single-modality detection systems remains limited for practical use in autonomous driving [8,9].

To address these limitations, researchers have explored various multi-modal 3D object detection methods that combine LiDAR and camera modalities. Conventional fusion approaches often rely on dense feature combination techniques, such as PointPainting [10] and BEVFusion [8,11], which enrich LiDAR point clouds with pixel-level image features or project image-view features into a dense bird’s-eye view (BEV) space. However, these dense methods frequently face challenges such as data misalignment, high latency, and constrained detection ranges due to the substantial computational burden of dense view transformations. To reduce this overhead, recent methods have shifted towards sparse sampling, leveraging global attention mechanisms to fuse multi-modal features selectively [12,13,14]. Although effective in reducing computational costs, these sparse approaches often fall short of dense fusion methods in terms of overall performance, leaving open questions about the potential for sparse techniques to match or surpass dense detectors.

This study introduces a perspective-aware hierarchical vision transformer-based LiDAR-camera fusion (PLC-Fusion) method for 3D object detection, a fully sparse multi-modality 3D object detector, achieving outstanding performance and surpassing both sparse and dense competitors. By enhancing the integration of rich LiDAR and camera representations across two key areas—feature sampling and ViT-based multi-modality fusion —our approach, perspective-based LiDAR camera fusion (PLC-Fusion), closes the performance gap. First, we argue that the conventional approach of random sampling can be detrimental to the learning process, especially when additional efforts are needed to shift point features toward ground-truth targets. To address this, we propose the Object Perspective Sampling (OPS) module for LiDAR and image. OPS incorporates a lightweight perspective detector, consisting of interconnected 2D and monocular 3D sub-networks, to extract image features and generate object perspective proposals by predicting and refining top-scored 3D candidates. By aligning the learning process with ground-truth targets, our input-dependent approach improves the detector’s ability to recognize complex and rich environments in high-resolution images.

Second, we introduce a hierarchical vision transformer-based LiDAR-camera fusion method for 3D object detection. It includes two main components: CamViT and LidViT, which independently learn the embedding representations of camera images and LiDAR point clouds. These multi-modal inputs are fused using a Cross-Fusion module, enabling hierarchical and deep representation learning. Our proposed CamViT extends the vision transformer (ViT) model for object detection, achieving competitive 2D object detection by splitting images into small patches for representation learning. Additionally, we introduce LidViT, a volumetric-based 3D object detector that partitions LiDAR point clouds into small cubic regions. Acting as the end-to-end backbone, LidViT discretizes the point cloud into 3D grids using a transformer encoder, providing significant computational efficiency and reduced memory usage while maintaining the original 3D shape information.

Key features of the PLC-Fusion model aim to ensure a smooth integration between pre-fusion and fusion processes. Our model fully leverages the transformer encoder, making it especially effective in traffic environments where input frames vary both spatially and temporally. Unlike earlier approaches [15,16,17], PLC-Fusion avoids information bottlenecks caused by 3D-to-2D projection, such as bird’s-eye view, range view, or depth image projections. Furthermore, unlike DeepFusion [18], which requires manual pixel-to-point cloud alignment, PLC-Fusion performs end-to-end fusion directly, further enhancing robustness and eliminating the need for manual alignment. This demonstrates the model’s strong adaptability and superior performance. Our experiments using the KITTI dataset [19] demonstrate that our approach outperforms state-of-the-art methods in 3D object detection, particularly in detecting pedestrians and cyclists. To summarize, the key contributions of this study are as follows:

We propose an efficient perspective-aware hierarchical multimodal vision transformer-based end-to-end fusion framework, which enhances robustness and adaptability. This leads to superior detection performance, particularly for pedestrians and cyclists in complex traffic environments.
The OPS module is designed to improve feature sampling by aligning LiDAR and image data with ground-truth targets. This module incorporates a lightweight perspective detector that combines 2D and monocular 3D sub-networks to generate refined object perspective proposals, significantly enhancing object recognition in complex scenarios.
Our multi-modal cross-fusion approach leverages CamViT and LidViT to independently learn embedding representations from LiDAR point cloud and camera images. These outputs are fused via the Cross-Fusion module for hierarchical and deep representation learning, resulting in improved performance and computational efficiency.
Through experiments conducted in diverse urban traffic settings, the robustness and effectiveness of our method have been validated using the KITTI dataset.

This proposed framework (PLC-Fusion) presents a significant step forward in practical, real-world 3D object detection systems for autonomous vehicles, aiming to meet the demands of modern smart mobility. The manuscript is organized as follows: Section 1 introduces the motivations and significance of the research. Section 2 reviews related work. Section 3 details the proposed PLC-Fusion framework, and Section 4 details the experimental setup, evaluation metrics, and results of the experiments and discusses their implications. Finally, Section 5 concludes the study.

2. Related Work

This section reviews recent work on camera-based and LiDAR-based 3D object detection, multimodal and feature fusion approaches, and vision transformers in object detection for autonomous vehicles.

2.1. Camera-Based Object Detection

Bounding box coordinates in 3D space are often normalized using geometric constraints. To improve the precision of 3D object localization, recent methods for image-based 3D detection have aimed at recovering depth information. For instance, pseudo point clouds are generated from depth images by pseudo-LiDAR methods [20,21,22], enabling 3D object detection using these pseudo representations. Reading et al. [23] utilized a 3D CNN (convolutional neural network) to generate pseudo points, enhancing sparse information from LiDAR images into dense depth reconstructions. Additionally, adaptive filters are employed by DL4CN [24], a dynamic depthwise dilated local convolutional network, to automatically learn depth maps from images. In another perspective, these geometric normalization strategies have seen substantial advances but still lag behind in generating highly precise results.

Another approach, CaDNN [21], offers a fully end-to-end framework and differentiable monocular 3D detection, predicting categorical depth distributions for individual pixels. Despite these innovations, image-based 3D detectors frequently struggle with accuracy, producing only approximate or coarse 3D bounding boxes for objects. The geometric link between 2D pictures and 3D space was utilized by Chen et al. [25] and Brazil et al. [26] to facilitate item detection. Yet, even with these technological strides, the typical outcome remains relatively low-resolution 3D bounding boxes.

2.2. LiDAR-Based Object Detection

As many publicly available benchmarks demonstrate, state-of-the-art 3D object detection performance is achieved by directly regressing 3D bounding boxes from point clouds. Existing methods either leverage effective convolutional neural networks (CNNs) to process voxel features [5,27,28] or employ PointNet architectures [8,29,30] to handle the sparsity and irregularity of raw point clouds. Studies suggest that combining both voxelized and raw point cloud data formats can lead to improved detection accuracy. For example, SVGA-Net [30] constructs a convolutional network that explicitly utilizes the structured data in 3D point clouds to guide the backbone in recognizing object structures. Similarly, PV-RCNN [29] introduces modules for both keypoint and voxel set abstraction by condensing voxel representations into a limited number of keypoints and pooling these keypoints to represent region-of-interest (RoI) grid points. However, LiDAR-based methods can struggle with feature ambiguity between similar objects without semantic information.

2.3. Fusion Point-Based Methods

The advantages of combining image and point cloud data have garnered increasing attention. Early fusion methods such as PointPainting [10], StructuralIF [31], and MV3D [17] augment point clouds by adding labels or features from image data. 3D-CVF [32] performs auto-calibrated projection to convert image features into a smoother bird’s-eye view (BEV) map, followed by simple concatenation of features. EPNet [33] introduces the Li-Fusion module, establishing pointwise correspondences for more precise multimodal feature aggregation. DFIM [34] addresses this by leveraging geometric and semantic consistency to transform 2D and 3D detection candidates into a unified set of joint detection candidates. Challenges arise when fusing 3D object detection data from point clouds and RGB images. A common practice is to extract features from RGB images and point clouds using separate backbones for each modality, as seen in various studies [13,35,36,37].

However, the dual-backbone approach is computationally expensive and memory-intensive due to the burden of maintaining two large-scale networks. Traditional fusion techniques typically focus on either optimizing the fusion process or improving accuracy. Recent methods, such as AVOD [38], aim to enhance model efficiency by automating feature generation, moving away from the manual processes used in earlier models like MV3D [39,40,41]. Bai et al. [14] demonstrated that pointwise fusion provides more flexibility than ROI-based approaches. Based on different fusion approaches, methods can be categorized into pointwise fusion [42] and region of interest (ROI)-based fusion [11,18,43,44,45]. As demonstrated by Lui et al. [33], pointwise fusion offers greater flexibility than ROI-based fusion. This work explores the possibility of directly aggregating point characteristics from raw RGB photos with point cloud features, taking inspiration from pointwise fusion. In contrast to other strategies, the suggested solution uses a single backbone. Moreover, the model relies not only on an RGB image but also the RGB+ image as input.

2.4. ViT-Based Methods

Transformer approaches, initially introduced in the context of machine translation and natural language processing (NLP) [46], have made significant advancements in various domains, including image classification. However, the application of transformers in 3D point cloud classification is relatively new but promising. Point cloud transformers (PCT) [47] have emerged as an innovative technique for 3D object recognition, leveraging transformers to process point cloud data [14].

Recent advancements in computer vision have ushered in a new paradigm for object detection, driven by the remarkable success of transformer models across various domains [48,49,50,51,52,53,54]. By considering detection as a set prediction issue and using transformers with parallel decoding to detect objects in 2D pictures, DETR [48] transformed detection using their expertise in learning local context-aware representations [10,55,56]. A modified version of DETR [49] refined this approach by introducing a deformable attention module for cross-scale aggregation. Similarly, in the realm of point clouds, recent methodologies [50,51] have begun exploring the utilization of self-attention mechanisms for classification and segmentation tasks.

Fusion-based approaches combine image and LiDAR data to capitalize on both modalities’ strengths. Methods like PointPainting and MVX-Net integrate features at early or mid-level stages, whereas EPNet [33] enhances feature correspondence for more accurate detection. However, these methods struggle with data misalignment and imprecise fusion. Vision transformer (ViT)-based approaches, including point transformer and stratified transformer, introduce global feature learning and improved local representations. Yet, they rely on local neighborhood information, leading to performance degradation when point data are incomplete or inconsistent. In contrast, our proposed framework integrates LiDAR and camera data using a robust perspective-aware hierarchical multimodal vision transformer, improving detection performance, particularly for complex environments involving pedestrians and cyclists. Our cross-fusion method and OPS module further refine feature alignment and enhance object recognition.

3. Methodology

3.1. Overview

The proposed methodology leverages a multi-modal approach for 3D object detection by integrating deep learning techniques on the Kitti dataset. First, multi-modal feature extraction is performed using a geometry stream where raw point clouds are projected onto a 2D image plane and augmented with point-wise probability values, which helps distinguish between objects with similar shapes, as shown in Figure 1. OPS incorporates a lightweight perspective detector, consisting of interconnected 2D and monocular 3D sub-networks, to extract image features and generate object perspective proposals by predicting and refining top-scored 3D candidates, as shown in Figure 2. The CamViT branch applies a vision transformer (ViT) to traffic scenes by dividing images into patches and learning global and local features through a transformer encoder, while the LiDViT branch processes sparse LiDAR point clouds by voxelizing them and extracting spatial features via a point-based transformer architecture. A cross-fusion module then combines the features from CamViT and LiDViT using cross-attention, enabling both modalities to enhance the final 3D object detection through multi-modal fusion. This fusion allows the model to capture both visual and geometric information, improving accuracy in complex traffic environments, as shown in Figure 1.

3.2. MultiModal Feature Extraction

The matrix

M \in R^{H \times W \times C}

is generated using DeepLabv3+ with a softmax activation function. The number of classes is set to four

(q_{e}, q_{p}, q_{r}, and q_{b})

, representing the probabilities for car, pedestrian, cyclist, and background, respectively. In the geometry stream, raw points are projected onto a 2D image plane using the following equation to facilitate the extraction of corresponding semantic features:

M (\begin{matrix} u \\ v \\ 1 \end{matrix}) = K R_{g} P

(1)

where K and

R_{g}

represent intrinsic and extrinsic parameters, respectively, and

P = {p_{i} = (x_{i}, y_{i}, z_{i})}_{i = 1}^{N} \in R^{N \times 3}

defines the raw point set. Subsequently, an augmented point set

A P = {(x_{i}, y_{i}, z_{i}, q^{c}, q^{p}, q^{r}, q^{b})}_{i = 1}^{N} \in R^{N \times 7}

is created by concatenating the raw point coordinates with the point-wise probability values extracted from the probability map. The enhanced points, now imbued with semantic information, assist the model in distinguishing between objects with similar shapes, thereby reducing the number of false positives.

3.3. Object Perspective Sampling

In recent studies, query generation in 3D space is typically independent of input data and optimized as network parameters based on randomly distributed reference points [8] anchor boxes [29], or pillars [30]. However, such input-independent query generation requires additional learning to guide proposals toward ground-truth object targets, as demonstrated in 2D detection tasks. Three-dimensional query-based and 2D detectors provide different predictions, with 2D detectors often performing better on small and distant objects. Leveraging this 2D robustness, our Object Perspective Sampling (OPS) module utilizes the perceptual advantages of 2D detection to generate 3D queries, thereby enhancing the final 3D detection as shown in Figure 3.

The OPS module integrates 2D [21] and monocular 3D [29] sub-networks as the perspective detector. The monocular 3D sub-network predicts raw 3D attributes, including depth (d), rotation angles, sizes, and velocities from multi-view and multi-scale image features. Simultaneously, the 2D sub-network predicts corresponding 2D properties such as center coordinates

[c_{x}, c_{y}]

, confidence scores, and category labels. For each view v, we project the 2D box centers into 3D space using the intrinsic matrix

I_{v}

and extrinsic matrix

E_{v}

of the respective camera, as shown in Figure 3.

The Cross-Attention module then initiates queries by selecting the top

N_{k}

boxes, ranked by confidence scores. After filtering intersecting boxes using non-maximum suppression (NMS) in 3D space, the final query proposals are determined. The process is mathematically expressed as

q_{i} = \frac{1}{| V |} \sum_{v \in V} \sum_{m = 1}^{M} BS (r_{cam}^{ν m}, P_{v}^{ν} c_{i}^{3 D}),

(2)

where

P_{v}^{ν} (c_{i}^{3 D})

projects the 3D center

c_{i}^{3 D}

onto the vth image using corresponding camera parameters. The set of hit views is denoted by

V

, and

BS (\cdot)

represents the bilinear sampling function. To ensure coverage, we retain some randomly initialized query boxes

N_{r}

as fallback candidates in case certain objects are missed. Finally, our OPS module produces

N_{q} = N_{k} + N_{r}

query proposals.

By utilizing this input-dependent query generation approach, the OPS module improves the 3D detector’s ability to understand perspective priors, enabling better detection of small and distant objects. The 3D boxes are formed by combining the predicted 3D center

c^{3 D}

with the corresponding size, rotation angle, and velocity estimates.

3.4. Camera ViT Branch

In this study, we propose CamViT, a vision transformer-based approach tailored for object detection in traffic environments. This model applies a transformer framework by partitioning images into patches, enabling effective feature representation learning. CamViT extends the ViT model specifically for object recognition in complex traffic scenes. Given an input RGB image

I \in R^{H \times W \times 3}

, the image is split into

N_{p}

patches, each of size

p_{h} \times p_{w} \times 3

. The number of patches is defined as

N_{p} = \frac{H}{p_{h}} \cdot \frac{W}{p_{w}} .

(3)

Each patch is then flattened into a 1D vector of size

p_{h} \times p_{w} \times 3

, resulting in a matrix

P \in R^{N_{p} \times (p_{h} \cdot p_{w} \cdot 3)}

, representing the image in terms of its patch-wise features. These patches are projected through a linear projection layer (LP), which maps each patch into a feature vector of size

D_{f}

, defined as

p_{f}^{i} = LP (p_{i}), i \in [1, N_{p}],

(4)

where

p_{f}^{i} \in R^{D_{f}}

represents the feature vector of the ith patch. To preserve spatial information, positional encodings

E_{p o s}

are added to each patch embedding. Furthermore, a learnable class token

T_{c l s}

is prepended to the sequence to capture global information from the image, similar to BERT’s [CLS] token. The input sequence to the transformer encoder becomes

S_{0} = [T_{c l s}; p_{f}^{1} + E_{p o s}^{1}; p_{f}^{2} + E_{p o s}^{2}; \dots; p_{f}^{N_{p}} + E_{p o s}^{N_{p}}] \in R^{(N_{p} + 1) \times D_{f}} .

(5)

The input sequence

S_{0}

is then passed through the transformer encoder, which alternates between multi-head self-attention (QKV) and feedforward neural network (FFN) layers, where QKV refers to the query, key, and value matrices used in self-attention, as shown in Figure 4. Each encoder layer consists of attention mechanism operates by transforming the input into queries

Q_{l}

, keys

K_{l}

, and values

V_{l}

at layer l, where the attention output

A_{l}

is computed as

A_{l} = softmax (\frac{Q_{l} K_{l}^{⊤}}{\sqrt{D_{f}}}) V_{l} .

(6)

The attention output is then added to the input via residual connections after applying layer normalization (LN):

S_{l}^{'} = A_{l} + S_{l - 1} .

(7)

Each attention block is followed by an FFN, which is applied independently to each position in the sequence. The FFN consists of two linear layers with a ReLU activation in between:

F_{l} = ReLU (W_{2} \cdot LN (W_{1} \cdot S_{l}^{'})),

(8)

where

W_{1}

and

W_{2}

are the learnable weights of the FFN at layer l, and residual connections are applied again:

S_{l} = F_{l} + S_{l}^{'} .

(9)

This process continues through L layers, where L is the total number of layers in the transformer encoder. After passing through all transformer layers, the final output sequence

S_{L} \in R^{(N_{p} + 1) \times D_{f}}

contains the patch-wise learned features along with the global image representation from the class token. The final feature vector

G_{c}

is extracted from the class token representation using layer normalization:

G_{c} = LN (S_{L}^{0}),

(10)

where

G_{c} \in R^{D_{f}}

represents the global image feature learned by CamViT. The extracted global feature

G_{c}

can be used for 2D object detection by adding an object detection head. Additionally, these features can be concatenated with other sensor data (e.g., LiDAR) for multi-modal fusion, enhancing the model’s ability to perform comprehensive scene understanding and object detection. This fusion capability allows the CamViT model to capture both local details and global context, making it particularly well-suited for complex traffic scenes.

3.5. LiDAR ViT Branch

LiDAR sensors capture 3D point clouds representing the environment, providing geometric and spatial information critical for object detection. However, the sparse and irregular nature of LiDAR data presents challenges in feature extraction and learning. To address this, LiDViT extends the vision transformer (ViT) architecture to process voxelized point cloud data for effective 3D object detection in complex traffic scenes. Given a raw point cloud

P \in R^{N \times 3}

, where N is the number of points, we first voxelize the point cloud to transform it into a structured 3D grid. Each voxel aggregates a group of points that fall within its 3D boundaries. Let the voxel grid dimensions be

V_{x} \times V_{y} \times V_{z}

, where

V_{x}, V_{y}, V_{z}

denote the number of voxels along the respective axes as shown in Figure 4.

The voxelized point cloud

V \in R^{V_{x} \times V_{y} \times V_{z} \times C}

consists of C-dimensional features (e.g., coordinates, intensity, etc.) for each occupied voxel. We apply a voxel feature encoder (VFE) to learn voxel-wise representations:

v_{f}^{i, j, k} = VFE (v^{i, j, k}),

(11)

where

v_{f}^{i, j, k} \in R^{D_{v}}

is the feature vector for the voxel at position

(i, j, k)

, and

D_{v}

is the feature dimension. This step encodes each voxel’s geometric and structural properties into a latent representation. To capture the spatial structure of the 3D point cloud, we add a 3D positional encoding

E_{p o s}^{3 D}

to each voxel embedding. This encodes the relative position of each voxel in the 3D grid, enabling the transformer to retain spatial context. The input sequence to the transformer encoder becomes

S_{0} = [v_{f}^{1, 1, 1} + E_{p o s}^{1, 1, 1}; v_{f}^{1, 1, 2} + E_{p o s}^{1, 1, 2}; \dots; v_{f}^{V_{x}, V_{y}, V_{z}} + E_{p o s}^{V_{x}, V_{y}, V_{z}}],

(12)

where

S_{0} \in R^{V \times D_{v}}

, and

V = V_{x} \cdot V_{y} \cdot V_{z}

is the total number of voxels. This sequence is now ready to be processed by the transformer encoder.

Like CamViT, the LiDAR transformer encoder alternates between query-key-value (QKV) attention and feedforward neural network (FFN) blocks, adapted to process the voxel-wise features. The transformer operates on the sequence of voxel embeddings, enabling global context modeling over the entire point cloud. At each layer l, we compute the query

Q_{l}

, key

K_{l}

, and value

V_{l}

matrices for the voxel embeddings. The attention output

A_{l}

is computed as

A_{l} = softmax (\frac{Q_{l} K_{l}^{⊤}}{\sqrt{D_{v}}}) V_{l} .

(13)

The attention output is added to the input via residual connections, followed by layer normalization (LN):

S_{l}^{'} = A_{l} + S_{l - 1} .

(14)

The voxel-wise features are further refined using a feedforward neural network (FFN) consisting of two fully connected layers with a ReLU activation:

F_{l} = ReLU (W_{2} \cdot LN (W_{1} \cdot S_{l}^{'})),

(15)

where

W_{1}

and

W_{2}

are the learnable weights for the FFN at layer l, and residual connections are again applied:

S_{l} = F_{l} + S_{l}^{'} .

(16)

This process continues through

L_{v}

layers, where

L_{v}

is the total number of layers in the LiDAR transformer encoder. After processing the voxel embeddings through all transformer layers, the output sequence

S_{L_{v}} \in R^{V \times D_{v}}

contains the learned voxel-wise features. The global point cloud feature

G_{v}

is extracted by applying layer normalization on the output of the final layer:

G_{v} = LN (S_{L_{v}}),

(17)

where

G_{v} \in R^{V \times D_{v}}

represents the comprehensive voxel-wise features.

The learned voxel features

G_{v}

can be used to directly predict 3D bounding boxes by attaching a 3D detection head, enabling precise localization of objects in the 3D space. Moreover, these features can be fused with the global image features from CameraViT for multi-modal prediction, improving the robustness and accuracy of 3D object detection by leveraging both LiDAR and camera data. LiDARViT effectively processes sparse and irregular point cloud data by transforming them into structured voxel representations and applying a transformer-based architecture. The model captures both local voxel-wise information and global spatial relationships, making it well-suited for 3D object detection tasks in dynamic traffic environments. Combined with CameraViT, LiDARViT enables powerful multi-modal fusion for accurate scene understanding and object recognition.

3.6. ViT-Based Cross-Fusion

Our ViT-based cross-fusion approach is tailored to leverage geometric and spatial information from multi-modal sources (LiDAR and camera) for 3D object detection. The cross-fusion framework integrates features from LiDAR and camera sensors using a transformer-based approach, enabling enhanced understanding of the scene by combining visual and depth information.

In 3D object detection tasks, especially in dynamic environments such as traffic scenes, combining the geometric information from LiDAR and the rich visual data from camera sensors can significantly improve detection accuracy. However, effective fusion of these data streams is challenging due to their different modalities and spatial characteristics. To address this, we propose a ViT-based Cross-Fusion Module, which aligns and combines features from both sensors, enhancing the model’s ability to localize and classify objects accurately in 3D space.

As described previously, the CamViT and LiDViT branches independently extract features from the camera images and LiDAR point clouds. CamViT provides 2D visual features

G_{c} \in R^{N_{c} \times D_{c}}

, where

N_{c}

is the number of image patches and

D_{c}

is the feature dimension. LiDViT generates 3D spatial features

G_{v} \in R^{V \times D_{v}}

, where V is the number of voxels and

D_{v}

is the LiDAR feature dimension. The next step is to fuse these features in a way that retains their complementary strengths: the visual richness of the camera and the geometric accuracy of LiDAR.

The core of our cross-fusion approach lies in using cross-attention between the features extracted by CameraViT and LiDARViT. Cross-attention allows each modality to attend to relevant information in the other modality, facilitating better contextual understanding of objects in 3D space. We define two sets of queries, keys, and values for the fusion process: Camera as Query, LiDAR as Key-Value: Here, the camera features

G_{c}

act as the queries, whereas the LiDAR features

G_{v}

serve as the keys and values. This allows the camera features to focus on important spatial regions in the LiDAR data.

A_{c v} = softmax (\frac{G_{c} G_{v}^{⊤}}{\sqrt{D}}) H_{v},

(18)

where

A_{c v} \in R^{N_{c} \times D_{v}}

is the cross-attended feature matrix, and D is a scaling factor.

Similarly, we reverse the roles so that the LiDAR features can attend to the visual information from the camera:

A_{v c} = softmax (\frac{G_{v} G_{c}^{⊤}}{\sqrt{D}}) G_{c},

(19)

where

A_{v c} \in R^{V \times D_{c}}

represents the attended camera features based on LiDAR input.

Once the cross-attention has been computed, the fused features are combined into a unified representation that leverages the strengths of both modalities. We concatenate the cross-attended features

A_{c v}

and

A_{v c}

, along with the original CamViT and LiDViT features, to form a comprehensive multi-modal representation. Let the concatenated feature matrix be

F_{f u s i o n} = [A_{c v}; G_{c}; A_{v c}; G_{v}] .

(20)

This fused feature set

F_{f u s i o n} \in R^{(N_{c} + V) \times (D_{c} + D_{v})}

is then processed through a multi-layer perceptron (MLP) to transform it into a unified feature space:

F_{f i n a l} = MLP (F_{f u s i o n}),

(21)

where

F_{f i n a l} \in R^{M \times D_{f}}

, with

M = N_{c} + V

and

D_{f}

as the final feature dimension. This step ensures that the fused representation is well-aligned and compact. The fused features

F_{f i n a l}

are further processed through standard feedforward layers and layer normalization to refine the multi-modal representation, as shown in Figure 4. The process is defined as

F_{r e f i n e d} = FFN (LN (F_{f i n a l})) + F_{f i n a l},

(22)

where FFN denotes the feedforward network consisting of two fully connected layers with ReLU activation.

Our cross-attention mechanism is crucial for enabling the ViT-based Cross-Fusion module to integrate LiDAR and camera data, allowing it to align spatial and visual features for 3D object detection effectively. However, we recognize that cross-attention introduces additional computational costs. Specifically, the complexity of cross-attention in this context is

O (N \times M \times d)

, where N and M are the token counts for the two modalities, and d is the embedding dimension. This cost is minimized by setting d to an optimal value that balances computational demands with feature expressiveness.

For multi-head attention parameters, we selected eight heads based on empirical testing, as this configuration provides sufficient feature diversity without significantly increasing computational demands. Each head operates on a subspace of the feature dimension, enhancing the model’s ability to capture modality-specific patterns. We set the embedding dimension to 256, maintaining computational efficiency while allowing rich feature interactions between the two modalities.

3.7. Detection Head

In the detection head, a D-dimensional vector

y

is generated after the input point features undergo dimension reduction in earlier stages. Two separate feedforward neural networks (FFNs) process this vector: one is responsible for predicting box residuals relative to the input 3D proposals, and the other estimates the confidence scores.

3.8. Loss Function

The training objectives are defined based on the three-dimensional intersection over union (IoU) that exists between the ground-truth boxes and the three-dimensional proposals. The following loss function is used to construct the confidence prediction goal [8,11,57]:

Confidence Loss = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i}^{g} log ({\hat{y}}_{i}) + (1 - y_{i}^{g}) log (1 - {\hat{y}}_{i})]

(23)

where

y_{i}^{g}

denotes the ground-truth confidence for the ith proposal, and

{\hat{y}}_{i}

represents the predicted confidence.

The ground-truth boxes and and their corresponding proposals (denoted by superscript g) are utilized to encode the regression objectives (denoted by superscript t), expressed as follows:

\begin{matrix} x^{t} & = \frac{x^{g} - x^{c}}{d}, & y^{t} & = \frac{y^{g} - y^{c}}{d}, & z^{t} & = \frac{z^{g} - z^{c}}{h^{c}}, \\ l^{t} & = log (\frac{l^{g}}{l^{c}}), & w^{t} & = log (\frac{w^{g}}{w^{c}}), & h^{t} & = log (\frac{h^{g}}{h^{c}}), \\ θ^{t} & = θ^{g} - θ^{c}, \end{matrix}

(24)

where

d = \sqrt{{(l^{c})}^{2} + {(w^{c})}^{2}}

represents the diagonal of the base of the proposal box, and

(x^{c}, y^{c}, z^{c}, l^{c}, w^{c}, h^{c}, θ^{c})

are the parameters of the proposal box.

4. Experiment and Results

This section first details the implementation specifics of our proposed PLC-Fusion model, followed by a thorough evaluation based on experimental results using the KITTI dataset. We conduct ablation studies to validate the contribution of each model component and conclude with visualizations of the 3D object detection outcomes.

4.1. Dataset and Evaluation Metrics

The KITTI dataset [19], comprising 7481 training samples and 7518 testing samples, is widely recognized as one of the most challenging and comprehensive benchmarks for 3D object detection in autonomous driving research. This dataset is highly suited for evaluating deep learning models in real-world conditions, as it includes multiple modalities—camera images, 3D LiDAR point clouds, and corresponding calibration files. Each sample is meticulously labeled with instance-level annotations for objects such as cars, pedestrians, and cyclists, critical for self-driving vehicle perception tasks.

The dataset covers various driving scenarios, including urban, rural, and highway environments, providing a diverse range of obstacles and conditions, such as varying lighting, weather, and traffic. For model evaluation, KITTI uses a standard metric of average precision (AP), with the intersection over union (IoU) threshold set at 0.7 for cars and 0.5 for pedestrians and cyclists. The performance is assessed at three difficulty levels—easy, moderate, and hard—determined by object occlusion, truncation, and size.

In our study, the training set was split into 3769 samples for validation and 3712 samples for training, following a common practice to ensure robust model tuning and evaluation. The multi-modal nature of the dataset, combining both 2D and 3D data, allows for a thorough comparison between single-modal approaches (using either LiDAR or camera data alone) and multimodal methods that leverage both modalities. We evaluate the detection performance of our proposed PLC-Fusion model at the easy, moderate, and hard levels, comparing it against state-of-the-art single-modal detectors (e.g., PointPillars, MV3D, Pointformer [29]) and multimodal detectors (e.g., AVOD, PIXOR, MSL3D [43]). Our method demonstrates competitive performance, highlighting the effectiveness of integrating spatial and depth features in the detection pipeline. The results reinforce the importance of multimodal fusion in improving detection accuracy, particularly in challenging conditions such as heavily occluded or distant objects.

We set a 9:1 (training:validation) ratio of network and server settings. This involves a random selection of 6733 samples for training, leaving 785 samples for validation. Our comparative analysis employs the PVRCNN and PointPillars baseline models, each implemented with a comprehensive network configuration that includes NMS (non-maximum suppression) and at 0.7 IoU overlap threshold. Additionally, a range filter tailored to cars, defined as [(0.0, 70.40), (45.0, 45.0), (4.0, 1.0)], is used, along with an anchor filter of [3.92, 1.67, 1.55]. Data augmentation techniques are integrated using OpenPCDet [58] to enhance the dataset.

4.2. Training Details

Our PLC-Fusion model is trained on an NVIDIA RTX 3090 GPU with an adjusted batch size of 24, selected to ensure efficient memory utilization while maximizing parallelism. The input resolution for the image stream is resized from 1280 × 384 pixels to 1408 × 416 pixels to maintain alignment with the LiDAR data, which covers a spatial range of [0, 70.4] meters on the x-axis, [−40, 40] meters on the y-axis, and [−3, 1] meters on the z-axis. This spatial range is chosen to encompass the full 3D field of view required for autonomous driving scenarios.

We use the one-stage VoxelNet and the two-stage MSL3D detectors for point cloud feature extraction, with default hyperparameters from their original implementations. These choices are based on their effectiveness in balancing detection accuracy and computational load, an essential factor for real-time applications.

The PLC-Fusion network is trained over 80,000 steps with a batch size of 8 due to memory constraints and computational demands of multi-modal fusion. A learning rate of 0.003 is applied, determined after grid search testing to ensure stable convergence without excessive oscillations. A weight decay of 0.01 is added to prevent overfitting, while momentum is gradually adjusted from 0.95 to 0.85 to enhance learning stability. We utilize the one-cycle learning rate policy with the AdamW optimizer, which is particularly effective in accelerating convergence while maintaining robustness in large-scale multi-modal training scenarios. This setup allows PLC-Fusion to reach high performance across diverse input sources, providing reproducibility and robustness in challenging real-world 3D object detection tasks.

4.3. Qualitative Results

Table 1 presents a performance comparison of various 3D object detection methods on the KITTI dataset. PLC-Fusion outperforms both EPNet [33] and DFIM [34], which are among the top-performing methods. EPNet achieves a 3D detection mean average precision (mAP) of 81.23%, whereas DFIM reaches 82.15%. In contrast, PLC-Fusion surpasses both with an mAP of 83.52%, demonstrating superior performance across all difficulty levels: Easy (89.69%), Moderate (82.73%), and Hard (77.82%).

For bird’s-eye view (BEV) detection, PLC-Fusion also excels, achieving a higher mAP of 90.34% compared to EPNet’s 88.79% and DFIM’s 89.02%. This improvement is consistent across the Easy (93.75%), Moderate (89.87%), and Hard (86.42%) levels, making PLC-Fusion the top performer in both tasks. Despite its higher accuracy, PLC-Fusion maintains a competitive inference time of 0.18 s, balancing performance with efficiency. This positions PLC-Fusion as a more accurate and robust solution for 3D object detection and BEV tasks, outperforming other state-of-the-art methods across all key metrics on the KITTI dataset.

The results in the Table 2 provide a quantitative comparison of various state-of-the-art 3D object detection methods on the KITTI validation dataset. The methods are evaluated based on average precision (AP) for 3D detection (AP3D) and bird’s-eye view detection (APBEV) across three difficulty levels: Easy (E), Moderate (M), and Hard (H). PLC-Fusion emerges as the top-performing method, achieving the highest AP3D mAP of 88.17% and APBEV mAP of 92.44%. This result is superior across all difficulty levels in both detection tasks, with 93.08% (E), 87.05% (M), and 85.10% (H) for AP3D, and 96.06% (E), 91.27% (M), and 89.58% (H) for APBEV.

When compared to StructuralIF [31], which holds the second-highest 3D detection mAP of 87.20%, PLC-Fusion shows a significant improvement, particularly in the Moderate (87.05% vs. 85.38%) and Hard (85.10% vs. 83.45%) categories. Similarly, PLC-Fusion outperforms DFIM [34], the third-highest performer with an AP3D mAP of 87.48%. In terms of AP BEV, PLC-Fusion also exceeds DFIM’s mAP of 91.55%, with a margin of almost one percentage point. Overall, PLC-Fusion consistently delivers superior detection accuracy and sets a new benchmark for both 3D object detection and bird’s-eye view detection on the KITTI dataset as demonstrated in visualization results provided in Figure 5 and Figure 6.

Table 3 presents the performance of state-of-the-art 3D object detection methods on the KITTI test dataset for the “Pedestrian” and “Cyclist” classes, comparing average precision (AP) for Easy (E), Moderate (M), and Hard (H) difficulty levels. PLC-Fusion demonstrates the best overall performance, achieving the highest AP across both the Pedestrian and Cyclist classes. For Pedestrian detection, PLC-Fusion achieves an AP 3D of 51.19% (E), 43.13% (M), and 39.73% (H), outperforming both AVOD-FPN and StructuralIF, which scored lower across all difficulty levels. In the Cyclist category, PLC-Fusion achieves a significant margin of improvement with an AP3D of 77.25% (E), 60.50% (M), and 54.06% (H). This is a marked increase over StructuralIF, the second-highest performer, which achieved 72.54% (E), 56.39% (M), and 49.28% (H).

The results indicate that the PLC-Fusion model consistently outperforms the other three models across all IoU thresholds, as shown in Figure 7. The results demonstrate that PLC-Fusion not only performs well in detecting pedestrians but also excels at detecting cyclists, especially in more challenging scenarios. Its higher accuracy in both classes highlights its effectiveness for complex real-world 3D object detection tasks.

4.4. PLC-Fusion Efficiency

Table 4 compares various 3D object detection methods based on memory usage, parallelism, speed (both in perpendicular and parallel processing), and input scale. PLC-Fusion achieves a balanced performance with moderate memory usage at 751 MB, positioned between PointPillars (561 MB) and SVGA-Net (802 MB). Although PointPillars has the lowest memory consumption, PLC-Fusion offers improved speed in parallel processing (48) compared to PointPillars (15), demonstrating better efficiency in scenarios requiring parallel computation.

PointRCNN uses the least memory (324 MB) but is slower than PLC-Fusion in parallel speed. Although F-PointNet has high memory usage at 1223 MB, it achieves comparable perpendicular speed (38) but falls short in parallel speed (10), indicating limited parallel processing efficiency. PLC-Fusion’s input scale (16,348) aligns closely with SVGA-Net and PointPillars (16,384), supporting detailed object representation without significant memory overhead. Overall, PLC-Fusion provides an optimal balance across memory usage, parallelism, and speed, making it well-suited for applications needing efficient, scalable 3D object detection.

4.5. Ablation Studies

4.5.1. Effect of the Feature Fusion Approach

The ablation study evaluates the impact of different PLC fusion strategies on average precision (AP) for 3D and BEV detection across three difficulty levels: Easy, Moderate, and Hard. The baseline method demonstrates solid performance, but all PLC fusion variations improve upon it (see Table 5). Early fusion yields noticeable improvements, while concatenation offers further gains. Summation surpasses concatenation slightly, showing better performance overall. Multiplication, however, achieves the highest AP scores across all difficulty levels for both 3D and BEV detection, indicating it is the most effective fusion strategy. Specifically, multiplication reaches peak AP values of 93.08% (Easy), 87.05% (Moderate), and 85.10% (Hard) for 3D detection and 96.06% (Easy), 91.27% (Moderate), and 89.58% (Hard) for BEV detection. This trend shows that more such feature fusion methods enhance detection accuracy, with multiplication providing the most significant improvements.

4.5.2. Effect of Image and LiDAR Backbones

The ablation study reveals that the baseline model, which employs the V2-99 image backbone and VoxelNet LiDAR backbone without any advanced methods, yields a mean average precision (mAP) of 86.35 for 3D detection and 90.29 for BEV detection. Introducing CamVit improves performance slightly, increasing mAP 3D to 87.09 and mAP BEV to 90.73, highlighting its positive impact on image modality integration. Enabling LiDViT, however, results in a more substantial boost, with mAP 3D rising to 87.42 and mAP BEV to 91.57, underscoring its effectiveness in enhancing LiDAR data utilization. The most significant gains are observed when both CamVit and LiDViT are combined with fusion methods, leading to a mAP 3D of 88.17 and mAP BEV of 92.44, as shown in Table 6. This demonstrates that the synergistic effect of these advanced methods and fusion techniques delivers the highest performance for 3D object detection.

4.5.3. Effect of PLC-Fusion Components on Runtime

The evaluation of different configurations involving multimodal feature extraction (MFE), CamVit, LiDViT, and their fusion shows varying impacts on computational efficiency and detection performance, as shown in Table 7. Without CamVit and LiDViT but with MFE, the system requires 28.0 ms of time and 19,500 MB of memory, achieving 86.95% BEV accuracy and 90.82% 3D accuracy. Adding LiDViT reduces time to 23.5 ms and memory to 12,550 MB, improving BEV accuracy to 87.29% and 3D accuracy to 91.07%. Incorporating CamVit and LiDViT (without fusion) further enhances performance with 87.51% BEV and 91.48% 3D accuracy, though with slightly higher time and memory costs (25.0 ms, 12,700 MB). When combining MFE, CamVit, LiDViT, and fusion, time drops significantly to 16 ms, memory usage decreases to 11,900 MB, and performance peaks at 87.96% BEV and 91.89% 3D accuracy. The optimal configuration with all components and fusion achieves the highest accuracy of 88.17% BEV and 92.44% 3D with the best computational efficiency: 13.5 ms time and 4500 MB memory.

4.5.4. Distance Analysis

The performance comparison shows that PLC-Fusion leads with the highest overall accuracy, achieving 87.05% AP3D and 91.27% APBEV. It excels across all distance ranges, with 96.63% AP3D at 0–20 m and 54.38% APBEV beyond 40 m. DFIM follows with 86.04% AP3D and 90.72% APBEV, showing strong performance at shorter ranges but lower accuracy at longer distances. MSL3D has an overall 84.71% AP3D and 90.68% APBEV, with notable accuracy at 0–20 m but reduced performance beyond 40 m. StructuralIF provides solid results at short distances but lacks BEV data as shown in Table 8. The runtime analysis depicted in the graph compares various 3D detection methods based on their processing time in milliseconds (ms). PLC-Fusion achieves the fastest runtime, taking only 29 ms, making it the most efficient method among those evaluated. StructuralIF closely follows with 31 ms, whereas DFIM records 54 ms. PointPainting and EPNet show moderate runtimes at 61 ms and 71 ms, respectively. MSL3D has a higher runtime of 76 ms. Finally, PI-RCNN is the slowest, with a runtime of 91 ms. Overall, PLC-Fusion offers the best balance of computational efficiency and performance, as shown in Figure 8.

4.6. Analysis and Discussion

Quantitative results indicate PLC-Fusion’s superior performance across all evaluated metrics: for 3D detection, the model achieves a mean average precision (mAP) of 83.52%, and for bird’s-eye view detection, it records a mAP of 90.34%. These values place it ahead of top-performing models like EPNet and DFIM across Easy, Moderate, and Hard difficulty levels, while maintaining an efficient runtime (0.18 s) per frame. Moreover, an ablation study validates the performance gains introduced by its unique fusion approach, showing that the multiplication strategy yields the highest accuracy across 3D and bird’s-eye view (BEV) tasks. Additionally, although StructuralIF achieves a close mAP of 87.20% in 3D detection, PLC-Fusion surpasses it, particularly under Moderate and Hard settings. The study also evaluates the efficiency of different backbone configurations, where the combination of CamViT and LidViT with fusion leads to the best results, with PLC-Fusion reaching a mAP of 88.17% for 3D detection and 92.44% for BEV.

From a computational efficiency perspective, PLC-Fusion remains optimal, consuming only 13.5 ms and 4500 MB in its ideal configuration. Runtime comparisons indicate that PLC-Fusion achieves superior processing speeds, making it well-suited for real-time applications. When evaluated across varying distances, PLC-Fusion maintains high accuracy even for objects further from the sensor, with a significant lead in both AP3D and APBEV. Overall, PLC-Fusion’s approach to LiDAR-camera fusion achieves state-of-the-art accuracy and efficiency, presenting a robust solution for 3D object detection in autonomous driving applications.

Limitations and Future Enhancements: Although PLC-Fusion demonstrates strong performance and efficiency, it is sensitive to the quality of the calibration between LiDAR and camera data. High-quality calibration is critical to ensuring accurate spatial alignment between these two modalities; otherwise, fusion errors may introduce inconsistencies, impacting detection accuracy, particularly in real-world settings where minor misalignments can occur due to sensor movement, environmental changes, or hardware limitations. Future work will investigate methods for mitigating calibration sensitivity, such as adaptive calibration modules or self-supervised alignment techniques that can adjust to varying real-world conditions.

Additionally, although PLC-Fusion shows promising scalability across different scenes and sensor configurations, future iterations of the framework will focus on improving robustness to calibration variance and extending evaluations to other large-scale datasets, such as Waymo Open Dataset, NuScenes, and DIAR-V2X. This will not only validate PLC-Fusion’s generalizability but also address broader environmental scenarios and challenging conditions to further refine the model’s applicability in real-world autonomous driving tasks.

5. Conclusions

In conclusion, this study presents PLC-Fusion, a novel 3D object detection framework that integrates LiDAR and camera data using a perspective-aware, hierarchical vision transformer-based approach. The fully sparse, multi-modal detection framework enhances detection accuracy and computational efficiency, overcoming limitations of traditional single-modality systems like sparse point clouds and poor object localization. The key contributions include the Object Perspective Sampling (OPS) module for improved feature alignment and the hierarchical fusion method leveraging CamViT and LidViT for independent feature learning from 2D images and 3D point clouds. The Cross-Fusion module further refines multi-modal data integration, enhancing overall detection performance. PLC-Fusion achieves high mean average precision (mAP) for both 3D and bird’s-eye view (BEV) detection while maintaining low inference times. Experiments on the KITTI dataset show robustness in complex urban traffic environments, particularly in detecting pedestrians and cyclists. These results demonstrate the framework’s potential as a state-of-the-art solution for autonomous driving applications, addressing key computational and performance challenges of current methods. Future research directions for PLC-Fusion could explore integrating temporal data for dynamic object tracking, enabling real-time tracking in dynamic environments. Additionally, experimenting with alternative fusion strategies, such as adaptive fusion techniques or multi-modal attention mechanisms, could further improve scalability and versatility for broader applications. These advancements would enhance the robustness of PLC-Fusion in real-world scenarios with imperfect sensor calibration and varying environmental conditions.

Author Contributions

Conceptualization, H.M.; Methodology, H.M.; Software, M.A.; Formal analysis, F.A.; Investigation, X.D.; Writing—original draft, H.M.; Writing—review & editing, H.H.R.S.; Visualization, M.A.; Project administration, X.D.; Funding acquisition, M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China Project (62172441, 62172449); the Local Science and Technology Developing Foundation Guided by the Central Government of China (Free Exploration project 2021Szvup166); the Opening Project of State Key Laboratory of Nickel and Cobalt Resources Comprehensive Utilization (GZSYS-KY-2022-018, GZSYS-KY-2022-024); Key Project of Shenzhen City Special Fund for Fundamental Research (202208183000751); and the National Natural Science Foundation of Hunan Province (2023JJ30696).

Data Availability Statement

The dataset created and examined in the present study can be accessed from the KITTI 3D object detection repository (https://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d (accessed on 18 July 2023)).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yu, Z.; Wan, W.; Ren, M.; Zheng, X.; Fang, Z. Sparsefusion3d: Sparse sensor fusion for 3d object detection by radar and camera in environmental perception. IEEE Trans. Intell. Veh. 2023, 9, 1524–1536. [Google Scholar] [CrossRef]
Zhu, H.; Deng, J.; Zhang, Y.; Ji, J.; Mao, Q.; Li, H.; Zhang, Y. Vpfnet: Improving 3d object detection with virtual point based lidar and stereo data fusion. IEEE Trans. Multimed. 2022, 25, 5291–5304. [Google Scholar] [CrossRef]
Uzair, M.; Dong, J.; Shi, R.; Mushtaq, H.; Ullah, I. Channel-wise and spatially-guided Multimodal feature fusion network for 3D Object Detection in Autonomous Vehicles. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5707515. [Google Scholar] [CrossRef]
Nie, C.; Ju, Z.; Sun, Z.; Zhang, H. 3D object detection and tracking based on lidar-camera fusion and IMM-UKF algorithm towards highway driving. IEEE Trans. Emerg. Top. Comput. Intell. 2023, 7, 1242–1252. [Google Scholar] [CrossRef]
Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef]
Chen, Q.; Li, P.; Xu, M.; Qi, X. Sparse Activation Maps for Interpreting 3D Object Detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 19–25 June 2021; pp. 76–84. [Google Scholar] [CrossRef]
Chen, Y.; Li, Y.; Zhang, X.; Sun, J.; Jia, J. Focal Sparse Convolutional Networks for 3D Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5418–5427. [Google Scholar] [CrossRef]
Shi, S.; Wang, X.; Li, H. PointRCNN: 3D object proposal generation and detection from point cloud. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; Volume 2019. [Google Scholar] [CrossRef]
Mushtaq, H.; Deng, X.; Ullah, I.; Ali, M.; Malik, B.H. O2SAT: Object-Oriented-Segmentation-Guided Spatial-Attention Network for 3D Object Detection in Autonomous Vehicles. Information 2024, 15, 376. [Google Scholar] [CrossRef]
Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. Pointpainting: Sequential fusion for 3D object detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Xie, L.; Xiang, C.; Yu, Z.; Xu, G.; Yang, Z.; Cai, D.; He, X. PI-RCNN: An efficient multi-sensor 3D object detector with point-based attentive cont-conv fusion module. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12460–12467. [Google Scholar]
Wang, H.; Tang, H.; Shi, S.; Li, A.; Li, Z.; Schiele, B.; Wang, L. Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 11–15 June 2023; pp. 6792–6802. [Google Scholar]
Yan, J.; Liu, Y.; Sun, J.; Jia, F.; Li, S.; Wang, T.; Zhang, X. Cross modal transformer: Towards fast and robust 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 18268–18278. [Google Scholar]
Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.L. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1090–1099. [Google Scholar]
Chitta, K.; Prakash, A.; Jaeger, B.; Yu, Z.; Renz, K.; Geiger, A. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 12878–12895. [Google Scholar] [CrossRef]
Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D Proposal Generation and Object Detection from View Aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018. [Google Scholar] [CrossRef]
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; Volume 2017. [Google Scholar] [CrossRef]
Li, Y.; Yu, A.W.; Meng, T.; Caine, B.; Ngiam, J.; Peng, D.; Shen, J.; Lu, Y.; Zhou, D.; Le, Q.V.; et al. DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18-24 June 2022; Volume 2022. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar] [CrossRef]
Wang, Y.; Chao, W.L.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8445–8453. [Google Scholar]
Weng, X.; Kitani, K. Monocular 3d object detection with pseudo-lidar point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
You, Y.; Wang, Y.; Chao, W.L.; Garg, D.; Pleiss, G.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. arXiv 2019, arXiv:1906.06310. [Google Scholar]
Rukhovich, D.; Vorontsova, A.; Konushin, A. ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022. [Google Scholar] [CrossRef]
Ding, M.; Huo, Y.; Yi, H.; Wang, Z.; Shi, J.; Lu, Z.; Luo, P. Learning depth-guided convolutions for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 1000–1001. [Google Scholar]
Park, D.; Ambruş, R.; Guizilini, V.; Li, J.; Gaidon, A. Is Pseudo-Lidar needed for Monocular 3D Object detection? In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
Landrieu, L.; Simonovsky, M. Large-Scale Point Cloud Semantic Segmentation with Superpoint Graphs. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; Volume 2019. [Google Scholar] [CrossRef]
Pan, X.; Xia, Z.; Song, S.; Li, L.E.; Huang, G. 3D Object Detection with Pointformer. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
He, Q.; Wang, Z.; Zeng, H.; Zeng, Y.; Liu, Y. Svga-net: Sparse voxel-graph attention network for 3d object detection from point clouds. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 870–878. [Google Scholar]
An, P.; Liang, J.; Yu, K.; Fang, B.; Ma, J. Deep structural information fusion for 3D object detection on LiDAR–camera system. Comput. Vis. Image Underst. 2022, 214, 103295. [Google Scholar] [CrossRef]
Yoo, J.H.; Kim, Y.; Kim, J.; Choi, J.W. 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXVII 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 720–736. [Google Scholar]
Huang, T.; Liu, Z.; Chen, X.; Bai, X. Epnet: Enhancing point features with image semantics for 3d object detection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XV 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 35–52. [Google Scholar]
Chen, M.; Liu, P.; Zhao, H. LiDAR-camera fusion: Dual transformer enhancement for 3D object detection. Eng. Appl. Artif. Intell. 2023, 120, 105815. [Google Scholar] [CrossRef]
Hu, C.; Zheng, H.; Li, K.; Xu, J.; Mao, W.; Luo, M.; Wang, L.; Chen, M.; Liu, K.; Zhao, Y.; et al. FusionFormer: A multi-sensory fusion in bird’s-eye-view and temporal consistent transformer for 3D object detection. arXiv 2023, arXiv:2309.05257. [Google Scholar]
Huang, J.; Ye, Y.; Liang, Z.; Shan, Y.; Du, D. Detecting as labeling: Rethinking LiDAR-camera fusion in 3D object detection. arXiv 2023, arXiv:2311.07152. [Google Scholar]
Cai, H.; Zhang, Z.; Zhou, Z.; Li, Z.; Ding, W.; Zhao, J. BEVFusion4D: Learning LiDAR-camera fusion under bird’s-eye-view via cross-modality guidance and temporal aggregation. arXiv 2023, arXiv:2303.17099. [Google Scholar]
Khamsehashari, R.; Schill, K. Improving deep multi-modal 3D object detection for autonomous driving. In Proceedings of the 2021 7th International Conference on Automation, Robotics and Applications (ICARA), Auckland, New Zealand, 9–11 February 2021; pp. 263–267. [Google Scholar]
Chen, Z.; Li, Z.; Zhang, S.; Fang, L.; Jiang, Q.; Zhao, F. Deformable feature aggregation for dynamic multi-modal 3D object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 628–644. [Google Scholar]
Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. BEVFusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 2774–2781. [Google Scholar]
Liu, X.; Zhang, B.; Liu, N. The Graph Neural Network Detector Based on Neighbor Feature Alignment Mechanism in LIDAR Point Clouds. Machines 2023, 11, 116. [Google Scholar] [CrossRef]
Sindagi, V.A.; Zhou, Y.; Tuzel, O. MVX-net: Multimodal VoxelNet for 3D object detection. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; Volume 2019. [Google Scholar] [CrossRef]
Chen, W.; Li, P.; Zhao, H. MSL3D: 3D object detection from monocular, stereo and point cloud for autonomous driving. Neurocomputing 2022, 494, 23–32. [Google Scholar] [CrossRef]
Zhu, M.; Ma, C.; Ji, P.; Yang, X. Cross-modality 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Online, 5–9 January 2021; pp. 3772–3781. [Google Scholar]
Wei, Z.; Zhang, F.; Chang, S.; Liu, Y.; Wu, H.; Feng, Z. MmWave Radar and Vision Fusion for Object Detection in Autonomous Driving: A Review. Sensors 2022, 22, 2542. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 2017. [Google Scholar]
Xiang, P.; Wen, X.; Liu, Y.S.; Cao, Y.P.; Wan, P.; Zheng, W.; Han, Z. SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part I 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16X16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929v2. [Google Scholar]
Hua, B.S.; Tran, M.K.; Yeung, S.K. Pointwise Convolutional Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Mushtaq, H.; Deng, X.; Ali, M.; Hayat, B.; Raza Sherazi, H.H. DFA-SAT: Dynamic Feature Abstraction with Self-Attention-Based 3D Object Detection for Autonomous Driving. Sustainability 2023, 15, 3667. [Google Scholar] [CrossRef]
She, R.; Kang, Q.; Wang, S.; Tay, W.P.; Zhao, K.; Song, Y.; Geng, T.; Xu, Y.; Navarro, D.N.; Hartmannsgruber, A. PointDifformer: Robust Point Cloud Registration With Neural Diffusion and Transformer. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Lu, D.; Gao, K.; Xie, Q.; Xu, L.; Li, J. 3DGTN: 3-D Dual-Attention GLocal Transformer Network for Point Cloud Classification and Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Fei, J.; Chen, W.; Heidenreich, P.; Wirges, S.; Stiller, C. SemanticVoxels: Sequential Fusion for 3D Pedestrian Detection using LiDAR Point Cloud and Semantic Segmentation. In Proceedings of the 2020 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), Karlsruhe, Germany, 14–16 September 2020; Volume 2020. [Google Scholar] [CrossRef]
Mahmoud, A.; Waslander, S.L. Sequential Fusion via Bounding Box and Motion PointPainting for 3D Objection Detection. In Proceedings of the 2021 18th Conference on Robots and Vision (CRV), Burnaby, BC, Canada, 26–28 May 2021. [Google Scholar] [CrossRef]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-voxel feature set abstraction for 3D object detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
OpenPCDet Development Team. Openpcdet: An Opensource Toolbox for 3d Object Detection from Point Clouds. 2020. Available online: https://github.com/open-mmlab/OpenPCDet (accessed on 1 October 2024).
Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 641–656. [Google Scholar]

Figure 1. The architecture of our PLC-Fusion model for 3D object detection using LiDAR and camera data. The raw point cloud from LiDAR and raw image data are processed by separate 3D and 2D backbones, respectively. Perspective-based sampling is applied to both modalities before passing through a vision transformer (ViT)-based model (LiDViT for LiDAR data and CamViT for image data) to establish 2D and 3D correspondence. The Cross-Fusion module integrates these features, followed by region of interest (RoI)-based 3D detection for generating 3D bounding box predictions.

Figure 2. Graphical depiction of the object perspective sampling process for LiDAR and camera data within the multimodal fusion model.

Figure 3. Illustration of our object perspective sampling and projection process for LiDAR and camera data within the multimodal fusion model. The sampled points from LiDAR and camera images are projected into their respective 3D and 2D coordinate systems. Sparse feature extraction is applied to both modalities before being passed into vision transformer (ViT)-based encoders (LiDAR-ViT for LiDAR features and camera-ViT for image features). These extracted features are then fused in the Cross-Fusion module to establish a 2D–3D correspondence for improved multimodal 3D object detection.

Figure 4. The figure illustrates the vision transformer (ViT)-based cross-fusion approach for 3D object detection, combining camera and LiDAR data. Object perspective sampling extracts features from both sensors. The camera branch (CamViT) generates 3D and 2D feature maps

H_{c} \in R^{N_{c} \times D_{c}}

using multi-head attention (MH-Attention) and a feedforward neural network (FFN), while the LiDAR branch (LiDViT) processes 3D voxel features

H_{v} \in R^{V \times D_{v}}

through a similar transformer architecture. The 2D and 3D feature maps from both modalities are concatenated

F_{f u s i o n} = [A_{c v}; H_{c}; A_{v c}; H_{v}]

and undergo cross-attention to align visual and geometric data. A final FFN refines the fused representation

F_{f i n a l} = MLP (F_{f u s i o n})

, providing deep multimodal features for accurate object detection in 3D space.

Figure 4. The figure illustrates the vision transformer (ViT)-based cross-fusion approach for 3D object detection, combining camera and LiDAR data. Object perspective sampling extracts features from both sensors. The camera branch (CamViT) generates 3D and 2D feature maps

H_{c} \in R^{N_{c} \times D_{c}}

using multi-head attention (MH-Attention) and a feedforward neural network (FFN), while the LiDAR branch (LiDViT) processes 3D voxel features

H_{v} \in R^{V \times D_{v}}

through a similar transformer architecture. The 2D and 3D feature maps from both modalities are concatenated

F_{f u s i o n} = [A_{c v}; H_{c}; A_{v c}; H_{v}]

and undergo cross-attention to align visual and geometric data. A final FFN refines the fused representation

F_{f i n a l} = MLP (F_{f u s i o n})

, providing deep multimodal features for accurate object detection in 3D space.

Figure 5. Visual results of the proposed method on the KITTI validation dataset. For each case of sub-figures (a–d), the top row shows the visualization in the RGB image, and the bottom row displays the visualization in the LiDAR point cloud. Green represents the ground truth, and blue denotes the predicted outcomes.

Figure 6. Visual results of the proposed method on the KITTI test and validation datasets. Row (a) presents the testing results, and row (b) displays the validation outcomes. The detection results demonstrate the effectiveness of our method, with the dotted circles highlighting the undetected instances caused by distance and heavy occlusion.

Figure 7. Car class with Moderate condition: AP vs. IoU on KITII validation set.

Figure 8. Comparative analysis of the runtime of our model with recent methods.

Table 1. Quantitative performance evaluation of state-of-the-art 3D object detection methods on the KITTI test dataset with AP(3D) and AP(BEV). The “Mod.” denotes modality as “C”, and “L” mean the camera and LiDAR. Easy, moderate, and hard levels are represented by “E”, “M”, and “H”.

Method	Mod.	Time	3D Detection (mAP)				BEV Detection (mAP)
Method	Mod.	Time	E	M	H	mAP	E	M	H	mAP
SECOND [5]	L	0.05	83.13	73.66	66.20	74.33	79.37	77.95	79.37	78.90
VoxelNet [27]	L	0.23	77.47	65.11	57.73	66.77	89.35	79.26	77.39	82.00
PointPillar [28]	L	0.02	82.58	74.31	68.99	75.29	86.56	82.81	86.56	85.31
Pointformer [29]	L	-	87.13	77.06	69.25	81.48	-	-	-	-
SVGA-Net [30]	L	0.03	87.33	80.47	75.91	81.24	92.07	89.88	85.59	89.18
PointRCNN [8]	L	0.10	86.96	75.64	70.70	77.77	87.39	82.72	87.39	85.83
MV3D [17]	C + L	0.36	74.97	63.63	54.00	64.20	86.62	78.93	69.80	78.45
AVOD-FPN [16]	L + C	0.10	83.07	71.76	65.73	73.52	90.99	84.82	79.62	85.14
F-PointNet [59]	L + C	0.17	82.19	69.79	60.59	70.86	91.17	84.67	74.77	83.54
3D-CVF [32]	L + C	0.16	89.20	80.05	73.11	80.79	93.52	89.56	82.45	88.51
ContFuse [60]	C + L	0.16	83.68	68.78	61.67	71.38	94.07	85.35	75.88	85.10
CM3D [44]	C + L	-	87.22	77.28	72.04	78.85	-	-	-	-
MSL3D [43]	C + L	0.24	87.27	81.15	76.56	81.66	-	-	-	-
EPNet [33]	C + L	0.10	89.81	79.28	74.59	81.23	94.22	88.47	83.69	88.79
PointPainting [10]	C + L	0.40	82.11	71.70	67.08	73.63	92.45	88.11	83.36	87.97
PI-RCNN [11]	C + L	0.10	84.37	74.82	70.03	76.41	91.44	85.81	81.00	86.08
DFIM [34]	C + L	0.19	88.36	81.37	76.71	82.15	92.61	88.69	85.77	89.02
StructuralIF [31]	C + L	0.12	87.15	80.69	76.26	81.37	91.78	88.38	85.67	88.61
PLC-Fusion	C + L	0.18	89.69	82.73	77.82	83.52	93.75	89.87	86.42	90.34

Table 2. Quantitative performance evaluation of state-of-the-art 3D object detection methods on the KITTI validation dataset with AP(3D) and AP(BEV). The “Mod.” denotes modality as “C”, and “L” mean the camera and LiDAR. Easy, moderate, and hard levels are represented by “E”, “M”, and “H”.

Method	Mod.	3D Detection (mAP)				BEV Detection (mAP)
Method	Mod.	E	M	H	mAP	E	M	H	mAP
SECOND [5]	L	88.61	78.62	77.22	81.48	89.96	87.07	79.66	85.56
VoxelNet [27]	L	81.97	65.46	62.85	70.09	89.60	84.81	78.57	84.33
PointRCNN [8]	L	88.72	78.61	77.82	81.72	–	–	–	–
PointPillar [28]	L	86.46	77.28	74.65	79.46	–	–	–	–
SVGA-Net [30]	L	90.59	80.23	79.15	83.32	90.27	89.16	88.11	89.18
F-PointNet [59]	C + L	83.76	70.92	63.65	72.78	88.16	84.02	76.44	82.87
MV3D [17]	C + L	71.29	62.68	56.56	63.51	86.55	78.10	76.67	80.44
3D-CVF [32]	C + L	89.67	79.88	78.47	82.67	–	–	–	–
AVOD-FPN [16]	C + L	84.41	74.44	68.65	75.83	–	–	–	–
EPNet [33]	C + L	92.28	82.59	80.14	85.00	95.51	88.76	88.36	90.88
CM3D [44]	C + L	91.08	83.19	77.12	83.80	–	–	–	–
StructuralIF [31]	C + L	92.78	85.38	83.45	87.20	–	–	–	–
MSL3D [43]	C + L	91.78	84.71	82.20	86.23	94.35	90.68	88.40	91.13
DFIM [34]	C + L	92.15	86.04	84.26	87.48	95.09	90.72	88.85	91.55
PLC-Fusion	C + L	93.08	87.05	85.10	88.17	96.06	91.27	89.58	92.44

Table 3. Quantitative performance evaluation of state-of-the-art 3D object detection methods on the KITTI test dataset with AP(3D) and AP(BEV) for the “Pedestrian” and “Cyclist” classes.

Method	Mod.	AP 3D (Pedestrian)			AP 3D (Cyclist)
		E	M	H	E	M	H
AVOD-FPN [16]	C + L	50.73	42.54	39.31	64.03	50.82	45.20
StructuralIF	C + L	50.80	42.42	38.35	72.54	56.39	49.28
PLC-Fusion	C + L	51.19	43.13	39.73	77.25	60.50	54.06

Table 4. Comparison of various 3D object detection methods based on memory usage, parallelism, speed, and input scale.

Method	Mem.	Paral.	Speed⊥	Speed‖	Input Scale
PointRCNN	324 MB	29	47	59	2∼9 k
PointPillars	561 MB	19	11	15	16,384
F-PointNet	1223 MB	18	38	10	11∼17 k
SVGA-Net	802 MB	19	11	28	16,384
PLC-Fusion	751 MB	21	23	48	16,348

Table 5. Ablation study is reported as average precision (AP) percentages for 3D and BEV detection across three difficulty levels: Easy (E), Moderate (M), and Hard (H). The methods compared include a baseline and four variations of the PLC fusion strategy: early fusion, concatenation, summation, and multiplication.

Method	AP 3D			AP BEV
	E	M	H	E	M	H
Baseline	90.18	81.44	81.02	91.35	84.37	83.46
PLC (Early fusion)	91.40	83.19	82.43	93.48	86.35	85.24
PLC (Concatenation)	92.14	85.36	83.53	94.92	89.18	88.47
PLC (Summation)	92.29	85.47	83.68	95.09	89.34	88.65
PLC (Multiplication)	93.08	87.05	85.10	96.06	91.27	89.58

Table 6. Ablation study on the impact of different backbone configurations and fusion methods for image and LiDAR modalities. The results are reported as mean average precision (mAP) for 3D object detection (mAP 3D) and bird’s-eye view detection (mAP BEV). The experiment compares different combinations of CamVit, LiDViT, and their fusion on a base model consisting of an image backbone (V2-99) and a LiDAR backbone (VoxelNet).

Modality	Image Backbone	LiDAR Backbone	CamVit	LiDViT	Fusion	mAP 3D	mAP BEV
C+L	V2-99	VoxelNet				86.35	90.29
C+L	V2-99	VoxelNet	✓		✓	87.09	90.73
C+L	V2-99	VoxelNet		✓	✓	87.42	91.57
C+L	V2-99	VoxelNet	✓	✓	✓	88.17	92.44

Table 7. Evaluation of different configurations of multimodal feature extraction (MFE), CamVit, LiDViT, and their fusion on computational efficiency and detection performance. The results report time (ms), memory usage (MB), runtime (RT), bird’s-eye view accuracy (BEV), and 3D accuracy. Various combinations of the components are assessed for their impact on both performance and resource consumption.

OPS	CamVit	LiDViT	Fusion	Time	Mem.	RT	BEV (%)	3D (%)
✓				28.0	19,500	9.5	86.95	90.82
✓		✓		23.5	12,550	11.0	87.29	91.07
	✓	✓		25.0	12,700	10.6	87.51	91.48
✓	✓	✓		16	11,900	12.0	87.96	91.89
	✓	✓	✓	13.5	4500	20.8	88.17	92.44

Table 8. Performance comparison of various methods for 3D object detection (AP3D) and bird’s-eye view detection (APBEV) across different distance ranges: 0–20 m, 20–40 m, and beyond 40 m. The table compares StructuralIF, MSL3D, DFIM, and PLC-Fusion, with results reported as average precision (AP) percentages for both overall detection and specific distance ranges. PLC-Fusion achieves the highest overall performance in both AP3D and APBEV across all distance intervals.

Methods	AP3D (%)				APBEV (%)
	Overall	0–20 m	20–40 m	40 m-inf	Overall	0–20 m	20–40 m	40 m-inf
StructuralIF [31]	85.38	93.88	73.83	35.36	-	-	-	-
MSL3D [43]	84.71	94.38	75.91	31.95	90.68	95.24	88.37	51.39
DFIM [34]	86.04	95.24	75.98	35.97	90.72	94.38	88.57	53.76
PLC-Fusion	87.05	96.63	79.13	37.28	91.27	95.37	90.70	54.38

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mushtaq, H.; Deng, X.; Azhar, F.; Ali, M.; Raza Sherazi, H.H. PLC-Fusion: Perspective-Based Hierarchical and Deep LiDAR Camera Fusion for 3D Object Detection in Autonomous Vehicles. Information 2024, 15, 739. https://doi.org/10.3390/info15110739

AMA Style

Mushtaq H, Deng X, Azhar F, Ali M, Raza Sherazi HH. PLC-Fusion: Perspective-Based Hierarchical and Deep LiDAR Camera Fusion for 3D Object Detection in Autonomous Vehicles. Information. 2024; 15(11):739. https://doi.org/10.3390/info15110739

Chicago/Turabian Style

Mushtaq, Husnain, Xiaoheng Deng, Fizza Azhar, Mubashir Ali, and Hafiz Husnain Raza Sherazi. 2024. "PLC-Fusion: Perspective-Based Hierarchical and Deep LiDAR Camera Fusion for 3D Object Detection in Autonomous Vehicles" Information 15, no. 11: 739. https://doi.org/10.3390/info15110739

APA Style

Mushtaq, H., Deng, X., Azhar, F., Ali, M., & Raza Sherazi, H. H. (2024). PLC-Fusion: Perspective-Based Hierarchical and Deep LiDAR Camera Fusion for 3D Object Detection in Autonomous Vehicles. Information, 15(11), 739. https://doi.org/10.3390/info15110739

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PLC-Fusion: Perspective-Based Hierarchical and Deep LiDAR Camera Fusion for 3D Object Detection in Autonomous Vehicles

Abstract

1. Introduction

2. Related Work

2.1. Camera-Based Object Detection

2.2. LiDAR-Based Object Detection

2.3. Fusion Point-Based Methods

2.4. ViT-Based Methods

3. Methodology

3.1. Overview

3.2. MultiModal Feature Extraction

3.3. Object Perspective Sampling

3.4. Camera ViT Branch

3.5. LiDAR ViT Branch

3.6. ViT-Based Cross-Fusion

3.7. Detection Head

3.8. Loss Function

4. Experiment and Results

4.1. Dataset and Evaluation Metrics

4.2. Training Details

4.3. Qualitative Results

4.4. PLC-Fusion Efficiency

4.5. Ablation Studies

4.5.1. Effect of the Feature Fusion Approach

4.5.2. Effect of Image and LiDAR Backbones

4.5.3. Effect of PLC-Fusion Components on Runtime

4.5.4. Distance Analysis

4.6. Analysis and Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI