• 摘要: 卷积神经网络(Convolutional neural networks, CNN)与视觉 Transformer (vision transformers, ViTs)在遥感图像超分辨率任务中均展现出卓越性能。ViTs 凭借其强大的长距离依赖建模能力,通常优于传统 CNN 方法,但其计算复杂度随图像分辨率呈二次增长,严重制约了在高分辨率遥感图像重建中的实际应用。为应对这一挑战,本文提出了一种多尺度增强状态空间模型(multi-scale augmented state space model, MS3M)。与现有基于固定扫描路径的特征提取方法不同,MS3M 引入了一种高效分组并行扫描策略,在维持线性计算复杂度的同时,有效建模全局与非局部特征依赖。针对遥感图像中固有的多尺度空间结构,本文进一步在状态空间模型中嵌入多感受野聚合机制,以融合不同尺度的上下文信息。此外,为增强特征表达能力,我们还提出一种高阶矩通道亲和力调制模块,用于优化局部特征表示。整个网络基于 U 型架构实现多层次特征融合,在多个公开遥感数据集上的实验结果表明,MS3M 在 PSNR、SSIM 和 LPIPS 等定量指标与视觉质量上均显著优于现有先进方法,验证了所提方法的有效性与先进性。

       

      Abstract:
      Convolutional neural networks (CNNs) and vision transformers (ViTs) represent the two dominant paradigms in remote sensing single image super-resolution (RSSISR), each with distinct strengths. While CNNs have long been the workhorse due to their inductive biases, vision transformers have recently demonstrated superior performance in many cases, primarily attributed to their exceptional capability in modeling long-range dependencies through the self-attention mechanism. However, this advantage comes at a significant cost: the self-attention mechanism suffers from quadratic computational complexity with respect to image size. This inherent limitation becomes a critical bottleneck in RSSISR, where generating high-resolution outputs from low-resolution inputs demands extensive computation, severely restricting the practical deployment of ViTs for large-area remote sensing imagery.
      To effectively overcome this fundamental challenge, we propose a novel architecture named the multi-scale augmented state space model (MS3M). Our approach is grounded in the recent advancements of state space models (SSMs), which are renowned for their linear computational complexity and strong potential in capturing long-range interactions. Unlike existing SSM-based feature extraction methods that often rely on fixed, unidirectional scanning paths, our MS3M introduces a grouped parallel scanning strategy. This design efficiently captures comprehensive global and non-local features without being constrained by a single scanning direction, all while rigorously maintaining linear computational complexity, thereby ensuring high efficiency.
      Furthermore, acknowledging the inherent and critically important multi-scale spatial structures present in remote sensing images—from fine-grained textures of buildings to extensive patterns of farmlands—we embed a multi-receptive-field aggregation mechanism directly into the state space model. This allows our network to seamlessly integrate contextual information across different scales, a capability essential for accurately reconstructing complex geographical objects. To further enhance the representation power of local features, we design a novel high-order moment channel affinity modulation module. This module moves beyond simple first-order statistics to optimize feature expressions, enabling more nuanced and powerful feature transformations within the network. The entire MS3M framework is constructed upon a U-shaped architecture to facilitate effective multi-level feature fusion across the encoder and decoder.
      We conduct extensive experiments on several public remote sensing datasets. The results demonstrate that our proposed MS3M achieves state-of-the-art performance, outperforming existing leading methods in terms of objective metrics including PSNR, SSIM and LPIPS, as well as in subjective visual quality. The superior results validate the effectiveness of our architectural choices and underscore MS3M's advancement as a robust and efficient solution for the challenging task of remote sensing image super-resolution.