混合编码器融合注意力机制的单目三维车道线检测

郑奎; 黄影平; 罗鑫

doi:10.12086/oee.2026.250359

混合编码器融合注意力机制的单目三维车道线检测

HA-Lane Net: Hybrid feature encoding and attention for monocular 3d lane detection

摘要: 三维车道线检测是自动驾驶领域的关键技术，旨在从二维图像中恢复车道线在三维空间中的几何形态。受三维信息的不确定性与复杂场景下表征歧义的限制，现有基于卷积神经网络的方法常面临特征分辨率与语义聚合之间的权衡：浅层特征具备更强的几何细节保真度，而深层特征语义判别性强但易出现空间量化损失与定位偏移。为此，本文提出一种新型三维车道线检测网络HA-Lane Net：1)设计了一种混合编码器，利用VGG浅层捕获细粒度纹理，结合ResNet深层提取高阶语义，通过通道适配与分级耦合构建均衡的特征层次；2)在输出层上引入投影前通道注意力机制，对车道相关通道进行自适应重标定，以增强跨视角一致性并抑制背景干扰。在Apollo-Sim3D数据集上的实验结果表明，该方法有效缓解了跨尺度与跨视角的不一致性。在不引入Transformer等复杂结构的前提下，HA-Lane Net的性能指标相较于3DLaneNet与Gen-Lane Net均有显著提升。

Abstract:

Objective Three-dimensional lane detection provides essential geometric information for autonomous driving systems, supporting vehicle localization, trajectory planning, and lateral control. Compared with traditional two-dimensional lane detection, three-dimensional lane detection aims to recover the spatial structure of lane markings directly from monocular images. This task remains challenging because monocular imagery lacks explicit depth cues and is affected by perspective distortion, scale variation, and complex road environments. Recent studies have increasingly adopted bird's-eye-view representations that transform front-view image features into a top-down spatial layout. Such representations simplify geometric reasoning and facilitate the estimation of lane structures in three-dimensional space. However, existing convolutional neural network architectures still face difficulties in simultaneously preserving detailed lane geometry and extracting high-level semantic information. Shallow layers capture fine textures and structural boundaries, whereas deeper layers emphasize semantic abstraction and contextual understanding. Excessive dependence on deep features may introduce localization offsets after spatial projection, while shallow features alone are insufficient for reliable detection under challenging conditions such as illumination changes, occlusions, or background interference. A network architecture capable of integrating geometric detail with semantic representation is therefore important for improving monocular three-dimensional lane detection.

Methods To address this issue, a monocular three-dimensional lane detection framework named HA-Lane Net was developed. The framework focuses on constructing balanced feature representations through a hybrid encoding strategy together with an attention-based feature refinement mechanism.

The backbone of the proposed framework is a hybrid encoder that combines complementary properties of two classical convolutional architectures. Shallow layers derived from VGG were used to capture detailed textures and structural patterns in lane regions. These features preserve spatial cues that are important for accurate geometric reconstruction. In parallel, deeper layers based on ResNet were employed to extract semantic representations and contextual information from complex road scenes. To integrate these heterogeneous features effectively, channel adaptation and hierarchical coupling mechanisms were introduced. These operations align feature dimensions and progressively fuse shallow geometric cues with deep semantic representations across multiple network stages. As a result, the encoder constructs a multi-scale feature hierarchy that maintains geometric precision while retaining strong semantic discrimination.

Results and Discussions Feature refinement was further introduced before spatial projection through a pre-projection channel attention mechanism. Inspired by the squeeze-and-excitation principle, this module recalibrates feature responses by strengthening channels related to lane structures and suppressing background signals. Applying attention prior to the front-view-to-bird's-eye-view projection improves feature consistency across viewpoints and reduces the influence of image blur, lighting variation, and environmental clutter. This design helps preserve structural lane information during the spatial transformation process.

Training stability was improved using a cosine annealing learning-rate schedule, which facilitates smoother optimization of the multi-task loss function. The framework was trained and evaluated on the Apollo-Sim3D dataset, which contains diverse simulated driving scenes with accurate three-dimensional lane annotations. During inference, the network directly predicts spatial coordinates and shapes of lane markings from monocular images, enabling reconstruction of lane geometry in three-dimensional space without relying on additional sensors.

Results and Discussions Experiments on the Apollo-Sim3D dataset show that HA-Lane Net improves three-dimensional lane detection performance compared with representative monocular approaches such as 3DLaneNet and Gen-Lane-Net. The hybrid encoder effectively integrates geometric detail with semantic context, enabling more accurate estimation of lane geometry.

The pre-projection attention mechanism further improves robustness by highlighting lane-related feature responses and suppressing irrelevant background information. This targeted refinement enhances feature consistency during the transformation from front-view images to bird's-eye-view representations and reduces localization errors caused by occlusion, illumination variation, or scene complexity. Consequently, the network produces more stable detection results across different driving scenarios.

Quantitative comparisons indicate improved accuracy in three-dimensional lane coordinate estimation relative to the reference methods used for comparison. Detection results also show increased stability across diverse road environments. These improvements arise from the combined effect of hybrid feature encoding and attention-based feature refinement, which strengthen lane structure representation while mitigating background noise. In addition, the proposed framework maintains a relatively efficient architecture. Performance improvements are achieved without introducing Transformer-based components or other computationally intensive modules, allowing the network to retain manageable complexity for real-time perception tasks.

Conclusions A monocular three-dimensional lane detection framework named HA-Lane Net has been presented for improved lane perception in autonomous driving environments. The proposed approach integrates a hybrid encoder that combines shallow VGG layers with deep ResNet layers, enabling balanced extraction of geometric detail and semantic representation. A pre-projection channel attention mechanism further refines lane-related features before spatial transformation, improving robustness against environmental interference.

Experimental evaluation on the Apollo-Sim3D dataset shows that the proposed framework improves three-dimensional lane detection performance compared with representative monocular approaches such as 3DLaneNet and Gen-Lane-Net. The hybrid encoding strategy together with attention-based refinement enables more reliable estimation of lane geometry while maintaining a relatively efficient network structure. These results suggest that the proposed framework provides a practical solution for monocular three-dimensional lane perception and offers useful insights for future research on autonomous driving scene understanding.

混合编码器融合注意力机制的单目三维车道线检测

HA-Lane Net: Hybrid feature encoding and attention for monocular 3d lane detection

相关链接

目录

混合编码器融合注意力机制的单目三维车道线检测

HA-Lane Net: Hybrid feature encoding and attention for monocular 3d lane detection

相关链接

目录

微信二维码