• 摘要: 针对城市街景语义分割中的边界模糊这一核心问题,本文提出一种基于双分支边界细化网络的街景语义分割方法 (BRNet)。该方法通过边界细化模块重点强化边缘特征提取,能有效改善复杂场景下的边界分割质量;结合自适应采样卷积增强多尺度特征适应能力,并通过特征传递模块促进双分支信息互补。此外,引入洛瓦兹损失函数缓解类别不平衡问题。实验表明,BRNet在Cityscapes数据集上的mIoU达81.43%,较双边分割网络V2 (bilateral segmentation network V2, BiSeNetV2)提升5.69%,在WildDash2和BDD100K数据集上展现良好泛化性,特别在边界清晰度和小目标分割方面表现优异,验证了其实用性。

       

      Abstract:
      Objective Semantic segmentation of urban street scenes plays a central role in perception systems for autonomous driving, where reliable pixel-level understanding of roads, vehicles, infrastructure, and vulnerable road users is essential for safe planning and control. Existing real-time networks still struggle to delineate object boundaries under occlusions, scale variation, cluttered backgrounds, and complex lighting, which leads to fragmented objects, blurred contours, and missed small instances such as poles, traffic signs, riders, and pedestrians. Many architectures emphasize global context or throughput, yet fail to model fine edges explicitly and therefore sacrifice boundary quality. This study aimed to enhance boundary precision and small-object segmentation while preserving high inference speed, through the design of a dual-branch network that exploits explicit boundary refinement, adaptive receptive fields, and cross-branch information propagation for street-scene understanding.
      Methods A dual-branch boundary refinement network (BRNet) was developed for street-scene semantic segmentation. The architecture contained a high-resolution branch with three adaptive sampling convolution (ASC) blocks that preserved fine spatial structure and adjusted receptive fields dynamically to complex object geometry. A low-resolution branch integrated a Boundary Refinement Module (BRM) that expanded the effective kernel to a 5×5 region with double convolutions and employed lightweight channel and spatial attention to strengthen edge features and suppress background noise near object contours. A feature propagation module (FPM) linked the two branches through progressive cross-scale fusion, compressing channels and performing downsampling or upsampling so that enhanced semantic and boundary cues from the low-resolution path flowed back to the high-resolution path and complemented local details. This design formed a collaborative boundary optimization mechanism, where both branches contributed to a consistent refinement of contours at different resolutions. Supervision signals were attached to multiple stages, and a hybrid loss combined cross-entropy with Lovasz-Softmax loss to optimize intersection-over-union directly and alleviate the class-imbalance characteristic of street-scene datasets, in which background and road pixels dominate rare categories. The network was trained and evaluated on the Cityscapes benchmark with fine annotations. The training pipeline adopted stochastic gradient descent with a poly learning-rate schedule, and used common data augmentation operations such as random scaling, cropping, color jittering, and horizontal flipping to increase robustness to viewpoint and appearance variations. Implementation relied on a standard deep learning framework with synchronized batch normalization and mixed-precision computation to stabilize optimization and accelerate training. Performance was assessed by mean intersection over union (mIoU), mean pixel accuracy (mPA), and frames per second (FPS) to capture both segmentation quality and computational efficiency. Additional experiments on the WildDash2 and BDD100K datasets examined robustness under distribution shifts that involved different cities, illumination conditions, camera setups, and adverse weather.
      Results and Discussions BRNet achieved an mIoU of 81.43% on the Cityscapes test set and delivered a frame rate of 67 f/s on a single GPU, which demonstrated a favorable balance between accuracy and real-time performance. The method outperformed the bilateral segmentation network BiSeNetV2 by 5.69 percentage points in mIoU and exceeded other multi-branch or lightweight baselines such as DDRNet, LMIINet, and SegNeXt on most evaluation metrics, while remaining close to transformer-based EfficientViT in accuracy with noticeably lower computational cost. Quantitative analysis of small-object categories showed consistent gains over BiSeNetV2, DDRNet, and PIDNet for poles, traffic lights, traffic signs, pedestrians, motorcycles, and bicycles, confirming that explicit boundary modeling and adaptive sampling improved recognition of thin structures and distant targets. A separate comparison with models that integrated Mamba or transformer blocks indicated that BRNet obtained comparable segmentation quality with fewer parameters and reduced latency, because boundary attention was introduced through compact convolutional modules instead of heavy global self-attention. Cross-dataset tests further validated the generalization ability of BRNet. On the BDD100K dataset the network reached an mIoU of 43.18% and maintained a smaller performance drop than competing models when transferred from Cityscapes to unfamiliar urban environments that contained night scenes, rain, and complex traffic patterns. On the WildDash2 benchmark BRNet preserved coherent road layouts and object masks in images with motion blur, low visibility, and camera distortions. Qualitative visualizations on WildDash2 and BDD100K revealed clearer object contours, better separation of overlapping vehicles and pedestrians, and more complete road and sidewalk regions, especially in regions with shadows, glare, or motion blur. Class activation maps indicated that the boundary refinement mechanism guided attention to compact regions along object outlines, which helped preserve structural integrity, reduced boundary ambiguity, and mitigated confusion between foreground and background.
      Conclusions The dual-branch BRNet with collaborative boundary refinement improved semantic segmentation of urban street scenes, especially in boundary precision and small-object delineation, while keeping real-time inference capability suitable for on-board deployment. The high-resolution branch with adaptive sampling captured detailed spatial layouts and local textures, and the low-resolution branch with the boundary refinement module enhanced semantic discrimination along contours; the feature propagation module linked these components into a continuous optimization loop for boundary representation across scales. The hybrid loss function reduced the impact of class imbalance and supported stable optimization toward high IoU scores without sacrificing pixel-wise accuracy. Experiments on Cityscapes, WildDash2, and BDD100K confirmed strong generalization across datasets and robustness to challenging imaging conditions such as adverse weather and poor illumination. These characteristics make BRNet a competitive candidate for perception stacks of autonomous driving and advanced driver-assistance systems that require accurate, efficient, and reliable understanding of complex urban environments. Future work will refine the architecture and compression strategy to further reduce latency and memory footprint, and will explore integration with panoptic segmentation, instance-level boundary detection, temporal consistency modeling, and multi-task learning frameworks to obtain richer and more coherent scene understanding from a single unified model.