Abstract:
Objective Three-dimensional lane detection provides essential geometric information for autonomous driving systems, supporting vehicle localization, trajectory planning, and lateral control. Compared with traditional two-dimensional lane detection, three-dimensional lane detection aims to recover the spatial structure of lane markings directly from monocular images. This task remains challenging because monocular imagery lacks explicit depth cues and is affected by perspective distortion, scale variation, and complex road environments. Recent studies have increasingly adopted bird's-eye-view representations that transform front-view image features into a top-down spatial layout. Such representations simplify geometric reasoning and facilitate the estimation of lane structures in three-dimensional space. However, existing convolutional neural network architectures still face difficulties in simultaneously preserving detailed lane geometry and extracting high-level semantic information. Shallow layers capture fine textures and structural boundaries, whereas deeper layers emphasize semantic abstraction and contextual understanding. Excessive dependence on deep features may introduce localization offsets after spatial projection, while shallow features alone are insufficient for reliable detection under challenging conditions such as illumination changes, occlusions, or background interference. A network architecture capable of integrating geometric detail with semantic representation is therefore important for improving monocular three-dimensional lane detection.
Methods To address this issue, a monocular three-dimensional lane detection framework named HA-Lane Net was developed. The framework focuses on constructing balanced feature representations through a hybrid encoding strategy together with an attention-based feature refinement mechanism.
The backbone of the proposed framework is a hybrid encoder that combines complementary properties of two classical convolutional architectures. Shallow layers derived from VGG were used to capture detailed textures and structural patterns in lane regions. These features preserve spatial cues that are important for accurate geometric reconstruction. In parallel, deeper layers based on ResNet were employed to extract semantic representations and contextual information from complex road scenes. To integrate these heterogeneous features effectively, channel adaptation and hierarchical coupling mechanisms were introduced. These operations align feature dimensions and progressively fuse shallow geometric cues with deep semantic representations across multiple network stages. As a result, the encoder constructs a multi-scale feature hierarchy that maintains geometric precision while retaining strong semantic discrimination.
Results and Discussions Feature refinement was further introduced before spatial projection through a pre-projection channel attention mechanism. Inspired by the squeeze-and-excitation principle, this module recalibrates feature responses by strengthening channels related to lane structures and suppressing background signals. Applying attention prior to the front-view-to-bird's-eye-view projection improves feature consistency across viewpoints and reduces the influence of image blur, lighting variation, and environmental clutter. This design helps preserve structural lane information during the spatial transformation process.
Training stability was improved using a cosine annealing learning-rate schedule, which facilitates smoother optimization of the multi-task loss function. The framework was trained and evaluated on the Apollo-Sim3D dataset, which contains diverse simulated driving scenes with accurate three-dimensional lane annotations. During inference, the network directly predicts spatial coordinates and shapes of lane markings from monocular images, enabling reconstruction of lane geometry in three-dimensional space without relying on additional sensors.
Results and Discussions Experiments on the Apollo-Sim3D dataset show that HA-Lane Net improves three-dimensional lane detection performance compared with representative monocular approaches such as 3DLaneNet and Gen-Lane-Net. The hybrid encoder effectively integrates geometric detail with semantic context, enabling more accurate estimation of lane geometry.
The pre-projection attention mechanism further improves robustness by highlighting lane-related feature responses and suppressing irrelevant background information. This targeted refinement enhances feature consistency during the transformation from front-view images to bird's-eye-view representations and reduces localization errors caused by occlusion, illumination variation, or scene complexity. Consequently, the network produces more stable detection results across different driving scenarios.
Quantitative comparisons indicate improved accuracy in three-dimensional lane coordinate estimation relative to the reference methods used for comparison. Detection results also show increased stability across diverse road environments. These improvements arise from the combined effect of hybrid feature encoding and attention-based feature refinement, which strengthen lane structure representation while mitigating background noise. In addition, the proposed framework maintains a relatively efficient architecture. Performance improvements are achieved without introducing Transformer-based components or other computationally intensive modules, allowing the network to retain manageable complexity for real-time perception tasks.
Conclusions A monocular three-dimensional lane detection framework named HA-Lane Net has been presented for improved lane perception in autonomous driving environments. The proposed approach integrates a hybrid encoder that combines shallow VGG layers with deep ResNet layers, enabling balanced extraction of geometric detail and semantic representation. A pre-projection channel attention mechanism further refines lane-related features before spatial transformation, improving robustness against environmental interference.
Experimental evaluation on the Apollo-Sim3D dataset shows that the proposed framework improves three-dimensional lane detection performance compared with representative monocular approaches such as 3DLaneNet and Gen-Lane-Net. The hybrid encoding strategy together with attention-based refinement enables more reliable estimation of lane geometry while maintaining a relatively efficient network structure. These results suggest that the proposed framework provides a practical solution for monocular three-dimensional lane perception and offers useful insights for future research on autonomous driving scene understanding.