• 摘要: 针对多模态3D目标检测中图像语义受限、点云稀疏致几何退化与跨模态对齐困难等问题,提出一种多层感知增强与门控对齐融合的BEV目标检测方法(MEPFusion)。首先图像端利用SimPVT强化全局语义与关键区域建模;其次点云端设计空间密度调制卷积,以位置依赖权重注入密度先验,稳固稀疏场景的局部几何;最后融合端构建双模态门控对齐模块,通过通道与空间门控抑制错配并实现动态校准,促成BEV特征的动态对齐与互补交互。实验结果表明,本文方法在nuScenes上的mAP与NDS达到67.1%与70.8%,较BEVFusion分别提升2.8%和1.7%,显著提高检测精度与鲁棒性。消融结果验证MEPFusion在语义特征提取、几何建模与跨模态对齐三方面具备独立增益并产生协同效应,整体上在融合质量、检测性能与计算效率之间实现更优平衡,为自动驾驶三维目标检测提供高效可靠的方案。

       

      Abstract:
      Objective: Multimodal three-dimensional object detection combines camera semantics with LiDAR geometry and plays a central role in autonomous driving perception. Yet performance often degrades in complex traffic due to unstable semantics, sparse geometry, and cross-modal misalignment. Camera features become unreliable under glare, low illumination, motion blur, and cluttered backgrounds, introducing uncertainty into bird’s-eye-view (BEV) projection. LiDAR point clouds are sparse at long range or in adverse weather, weakening local geometry and causing boundary blur and missed small objects. In BEV fusion, misalignment between modalities introduces conflicting cues and redundant responses, reducing confidence and accuracy. This study proposes an efficient BEV detector that enhances camera semantics, stabilizes sparse geometry, and achieves adaptive cross-modal fusion for improved accuracy and robustness without excessive computational cost.
      Methods: The proposed framework, MEPFusion (multi-layer perception enhancement and gated alignment fusion), improves camera semantics, stabilizes sparse-geometry modeling, and enables gated alignment fusion in BEV space. For camera encoding, SimPVT enhances global context and salient-region representation with low overhead. A pyramid vision transformer backbone extracts multi-scale features to capture global and local dependencies. A parameter-free saliency reweighting unit amplifies target-related activations and suppresses background distraction, improving semantic separability. To reduce computation, the highest-resolution stage is removed while preserving three-scale outputs for projection. Features are lifted to BEV space via a lift–splat–shoot transformation supervised by depth estimation, stabilizing the depth distribution and reducing projection noise to improve spatial consistency. This camera branch aims to provide more reliable semantics under appearance variations, reducing texture-induced false positives and improving recall for visually small instances.
      For LiDAR encoding, raw point clouds are voxelized and processed through a SECOND-style sparse convolution backbone. Standard sparse convolutions are replaced with spatial density-modulated convolution (SDMConv) to address equal-weight aggregation in sparse regions. SDMConv injects a position-dependent density prior into convolution responses via multiplicative modulation, producing a center-enhanced and periphery-decayed pattern within the receptive field. The modulation mitigates feature dilution and preserves boundary cues and shape continuity, especially in long-range or partially occluded areas, without increasing parameter count. The resulting sparse 3D features are compressed along height to produce LiDAR BEV features. In practice, SDMConv acts as a plug-in operator that strengthens local geometry when point evidence is weak, benefiting thin structures and small obstacles that are easily under-sampled.
      Cross-modal fusion occurs in BEV space through a gated alignment fusion module (GAFM) designed to suppress misalignment while maintaining complementarity. Each modality undergoes channel recalibration to emphasize discriminative channels. A residual mutual-guidance path performs bidirectional correction: LiDAR features refine camera BEV features in geometry-sensitive regions, while camera semantics complement LiDAR features where geometric evidence is weak. Spatial gates are predicted for each BEV cell and normalized via softmax to assign adaptive fusion weights, allowing the more reliable modality to dominate locally while reducing inconsistent responses. The fused BEV features are decoded by a multi-scale neck and a transformer-based detection head to produce 3D bounding boxes and attributes. This strategy improves optimization stability and encourages reliability-aware weighting rather than uniform mixing.
      Results and Discussions: Extensive experiments on the nuScenes benchmark demonstrate the effectiveness of MEPFusion. With six cameras and one LiDAR input, MEPFusion achieves 67.1% mAP and 70.8% NDS, surpassing the BEVFusion baseline by 2.8% and 1.7%, respectively. Regression errors in translation, scale, orientation, velocity, and attribute are reduced, confirming improved spatial alignment and motion estimation. Category-level analysis shows notable gains for small or geometry-sensitive objects (e.g., traffic cones, barriers, motorcycles, and pedestrians) while maintaining improvements for major vehicle classes. These trends align with module contributions: SimPVT strengthens semantics in cluttered scenes, SDMConv alleviates sparsity-induced degradation, and GAFM reduces cross-modal interference through adaptive gating and complementary fusion. Qualitative inspection suggests fewer duplicated boxes and cleaner spatial responses in crowded regions, consistent with suppressing redundant activations caused by misalignment.
      Distance-stratified evaluation further verifies consistent improvements across near, mid, and far ranges. The far range, where point sparsity and scale ambiguity are most severe, still achieves clear gains, highlighting the benefits of density-aware encoding and local gating. Robustness tests under adverse conditions also validate generalization. On the night subset, MEPFusion improves mAP and NDS while reducing error terms, indicating stable semantics and selective fusion under low illumination and glare. On the rain subset, performance again improves, confirming resilience to reduced contrast and weaker LiDAR returns. Ablation studies reveal both independent and synergistic effects: SimPVT alone enhances accuracy with fewer parameters; SDMConv and GAFM each bring additional gains; combining all three yields the best overall performance. Efficiency analysis shows SDMConv introduces minimal overhead, and the full model maintains comparable throughput with reduced total parameters, indicating a favorable accuracy–efficiency trade-off for on-vehicle inference.
      Conclusions: In summary, MEPFusion integrates semantic enhancement, sparse-geometry reinforcement, and gated alignment fusion within a unified BEV framework. The method achieves higher accuracy, improved regression stability, and consistent robustness across distance and environmental variations while retaining deployment-friendly efficiency. Stable semantics before view transformation, density-modulated LiDAR encoding, and adaptive BEV fusion jointly mitigate misalignment and suppress redundant noise. MEPFusion thus offers an effective and reliable multimodal perception solution for urban autonomous driving.