New website getting online, testing
    • 摘要: 针对现有基于Transformer的语义分割网络存在的多尺度语义信息利用不充分、处理图像时生成冗长序列导致的高计算成本等问题,本文提出了一种基于多尺度特征增强的高效语义分割主干网络MFE-Former。该网络主要包括多尺度池化自注意力模块(multi-scale pooling self-attention, MPSA)和跨空间前馈网络模块(cross-spatial feed-forward network, CS-FFN)。其中,MPSA利用多尺度池化操作对特征图序列进行降采样,在减少计算成本的同时还高效地从特征图序列中提取多尺度的上下文信息,增强Transformer对多尺度信息的建模能力;CS-FFN通过采用简化的深度卷积层替代传统的全连接层,减少前馈网络初始线性变换层的参数量,并在前馈网络中引入跨空间注意力(cross-spatial attention, CSA),使模型更有效地捕捉不同空间的交互信息,进一步增强模型的表达能力。MFE-Former在数据集ADE20K、Cityscapes和COCO-Stuff上的平均交并比分别达到44.1%、80.6%和38.0%,与主流分割算法相比,MFE-Former能够以更低的计算成本获得具有竞争力的分割精度,有效改善了现有方法多尺度信息利用不足和计算成本高的问题。

       

      Abstract: To address the issues of insufficient utilization of multi-scale semantic information and high computational costs resulting from the generation of lengthy sequences in existing Transformer-based semantic segmentation networks, this paper proposes an efficient semantic segmentation backbone named MFE-Former, based on multi-scale feature enhancement. The network mainly includes the multi-scale pooling self-attention (MPSA) and the cross-spatial feed-forward network (CS-FFN). MPSA employs multi-scale pooling to downsample the feature map sequences, thereby reducing computational cost while efficiently extracting multi-scale contextual information, enhancing the Transformer’s capacity for multi-scale information modeling. CS-FFN replaces the traditional fully connected layers with simplified depth-wise convolution layers to reduce the parameters in the initial linear transformation of the feed-forward network and introduces a cross-spatial attention (CSA) to better capture different spaces interaction information, further enhancing the expressive power of the model. On the ADE20K, Cityscapes, and COCO-Stuff datasets, MFE-Former achieves mean intersection-over-union (mIoU) scores of 44.1%, 80.6%, and 38.0%, respectively. Compared to mainstream segmentation algorithms, MFE-Former demonstrates competitive segmentation accuracy at lower computational costs, effectively improving the utilization of multi-scale information and reducing computational burden.