Abstract:
Objective: Multimodal three-dimensional object detection combines camera semantics with LiDAR geometry and plays a central role in autonomous driving perception. Yet performance often degrades in complex traffic due to unstable semantics, sparse geometry, and cross-modal misalignment. Camera features become unreliable under glare, low illumination, motion blur, and cluttered backgrounds, introducing uncertainty into bird’s-eye-view (BEV) projection. LiDAR point clouds are sparse at long range or in adverse weather, weakening local geometry and causing boundary blur and missed small objects. In BEV fusion, misalignment between modalities introduces conflicting cues and redundant responses, reducing confidence and accuracy. This study proposes an efficient BEV detector that enhances camera semantics, stabilizes sparse geometry, and achieves adaptive cross-modal fusion for improved accuracy and robustness without excessive computational cost.
Methods: The proposed framework, MEPFusion (multi-layer perception enhancement and gated alignment fusion), improves camera semantics, stabilizes sparse-geometry modeling, and enables gated alignment fusion in BEV space. For camera encoding, SimPVT enhances global context and salient-region representation with low overhead. A pyramid vision transformer backbone extracts multi-scale features to capture global and local dependencies. A parameter-free saliency reweighting unit amplifies target-related activations and suppresses background distraction, improving semantic separability. To reduce computation, the highest-resolution stage is removed while preserving three-scale outputs for projection. Features are lifted to BEV space via a lift–splat–shoot transformation supervised by depth estimation, stabilizing the depth distribution and reducing projection noise to improve spatial consistency. This camera branch aims to provide more reliable semantics under appearance variations, reducing texture-induced false positives and improving recall for visually small instances.
For LiDAR encoding, raw point clouds are voxelized and processed through a SECOND-style sparse convolution backbone. Standard sparse convolutions are replaced with spatial density-modulated convolution (SDMConv) to address equal-weight aggregation in sparse regions. SDMConv injects a position-dependent density prior into convolution responses via multiplicative modulation, producing a center-enhanced and periphery-decayed pattern within the receptive field. The modulation mitigates feature dilution and preserves boundary cues and shape continuity, especially in long-range or partially occluded areas, without increasing parameter count. The resulting sparse 3D features are compressed along height to produce LiDAR BEV features. In practice, SDMConv acts as a plug-in operator that strengthens local geometry when point evidence is weak, benefiting thin structures and small obstacles that are easily under-sampled.
Cross-modal fusion occurs in BEV space through a gated alignment fusion module (GAFM) designed to suppress misalignment while maintaining complementarity. Each modality undergoes channel recalibration to emphasize discriminative channels. A residual mutual-guidance path performs bidirectional correction: LiDAR features refine camera BEV features in geometry-sensitive regions, while camera semantics complement LiDAR features where geometric evidence is weak. Spatial gates are predicted for each BEV cell and normalized via softmax to assign adaptive fusion weights, allowing the more reliable modality to dominate locally while reducing inconsistent responses. The fused BEV features are decoded by a multi-scale neck and a transformer-based detection head to produce 3D bounding boxes and attributes. This strategy improves optimization stability and encourages reliability-aware weighting rather than uniform mixing.
Results and Discussions: Extensive experiments on the nuScenes benchmark demonstrate the effectiveness of MEPFusion. With six cameras and one LiDAR input, MEPFusion achieves 67.1% mAP and 70.8% NDS, surpassing the BEVFusion baseline by 2.8% and 1.7%, respectively. Regression errors in translation, scale, orientation, velocity, and attribute are reduced, confirming improved spatial alignment and motion estimation. Category-level analysis shows notable gains for small or geometry-sensitive objects (e.g., traffic cones, barriers, motorcycles, and pedestrians) while maintaining improvements for major vehicle classes. These trends align with module contributions: SimPVT strengthens semantics in cluttered scenes, SDMConv alleviates sparsity-induced degradation, and GAFM reduces cross-modal interference through adaptive gating and complementary fusion. Qualitative inspection suggests fewer duplicated boxes and cleaner spatial responses in crowded regions, consistent with suppressing redundant activations caused by misalignment.
Distance-stratified evaluation further verifies consistent improvements across near, mid, and far ranges. The far range, where point sparsity and scale ambiguity are most severe, still achieves clear gains, highlighting the benefits of density-aware encoding and local gating. Robustness tests under adverse conditions also validate generalization. On the night subset, MEPFusion improves mAP and NDS while reducing error terms, indicating stable semantics and selective fusion under low illumination and glare. On the rain subset, performance again improves, confirming resilience to reduced contrast and weaker LiDAR returns. Ablation studies reveal both independent and synergistic effects: SimPVT alone enhances accuracy with fewer parameters; SDMConv and GAFM each bring additional gains; combining all three yields the best overall performance. Efficiency analysis shows SDMConv introduces minimal overhead, and the full model maintains comparable throughput with reduced total parameters, indicating a favorable accuracy–efficiency trade-off for on-vehicle inference.
Conclusions: In summary, MEPFusion integrates semantic enhancement, sparse-geometry reinforcement, and gated alignment fusion within a unified BEV framework. The method achieves higher accuracy, improved regression stability, and consistent robustness across distance and environmental variations while retaining deployment-friendly efficiency. Stable semantics before view transformation, density-modulated LiDAR encoding, and adaptive BEV fusion jointly mitigate misalignment and suppress redundant noise. MEPFusion thus offers an effective and reliable multimodal perception solution for urban autonomous driving.