Abstract:
Convolutional neural networks (CNNs) and vision transformers (ViTs) represent the two dominant paradigms in remote sensing single image super-resolution (RSSISR), each with distinct strengths. While CNNs have long been the workhorse due to their inductive biases, vision transformers have recently demonstrated superior performance in many cases, primarily attributed to their exceptional capability in modeling long-range dependencies through the self-attention mechanism. However, this advantage comes at a significant cost: the self-attention mechanism suffers from quadratic computational complexity with respect to image size. This inherent limitation becomes a critical bottleneck in RSSISR, where generating high-resolution outputs from low-resolution inputs demands extensive computation, severely restricting the practical deployment of ViTs for large-area remote sensing imagery.
To effectively overcome this fundamental challenge, we propose a novel architecture named the multi-scale augmented state space model (MS3M). Our approach is grounded in the recent advancements of state space models (SSMs), which are renowned for their linear computational complexity and strong potential in capturing long-range interactions. Unlike existing SSM-based feature extraction methods that often rely on fixed, unidirectional scanning paths, our MS3M introduces a grouped parallel scanning strategy. This design efficiently captures comprehensive global and non-local features without being constrained by a single scanning direction, all while rigorously maintaining linear computational complexity, thereby ensuring high efficiency.
Furthermore, acknowledging the inherent and critically important multi-scale spatial structures present in remote sensing images—from fine-grained textures of buildings to extensive patterns of farmlands—we embed a multi-receptive-field aggregation mechanism directly into the state space model. This allows our network to seamlessly integrate contextual information across different scales, a capability essential for accurately reconstructing complex geographical objects. To further enhance the representation power of local features, we design a novel high-order moment channel affinity modulation module. This module moves beyond simple first-order statistics to optimize feature expressions, enabling more nuanced and powerful feature transformations within the network. The entire MS3M framework is constructed upon a U-shaped architecture to facilitate effective multi-level feature fusion across the encoder and decoder.
We conduct extensive experiments on several public remote sensing datasets. The results demonstrate that our proposed MS3M achieves state-of-the-art performance, outperforming existing leading methods in terms of objective metrics including PSNR, SSIM and LPIPS, as well as in subjective visual quality. The superior results validate the effectiveness of our architectural choices and underscore MS3M's advancement as a robust and efficient solution for the challenging task of remote sensing image super-resolution.