• 摘要: 红外小目标检测(IRSTD)在预警系统、精确制导等领域具有重要应用价值,但现有方法存在计算开销大、小目标易受背景干扰以及IoU度量方法对小目标定位偏差敏感等问题。针对上述问题,提出一种多尺度特征学习网络MFLNet。首先设计CMABlock特征提取模块,通过线性注意力机制在轻量化前提下强化微弱目标感知,且提升特征判别性;其次设计DMSFHead检测头,通过动态加权融合机制自适应聚合多尺度特征,平衡细节保留与语义增强,提升特征利用率;此外,采用归一化高斯Wasserstein距离(NWD)替代传统IoU度量方法,并设计CIoU-NWD混合损失,降低训练过程中模型对小目标定位偏差的敏感性。选取IRSTD-1k、NUDT-SIRST、NUAA-SIRST三个数据集进行实验,结果表明MFLNet在三个数据集上的精度、召回率与F1分数均优于当前IRSTD的主流SOTA方法,且消融实验验证了本文网络各组件的有效性。本文提出的多尺度特征学习网络MFLNet能有效解决红外小目标检测中的低信杂比、特征利用率低与IoU敏感等问题。

       

      Abstract:
      Objective Infrared small target detection (IRSTD) serves as a foundational enabling technology in computer vision, with extensive and critical deployment across military and civilian domains including infrared early warning, precision guidance, maritime search and rescue, and intelligent traffic monitoring. The task presents inherent and unresolved challenges stemming from the unique physical properties of infrared small targets, which typically manifest as spot-like structures spanning only several to dozens of pixels, yielding extremely sparse texture and structural information. Compounding these difficulties, infrared imagery is frequently degraded by intense background clutter, low contrast, and low signal-to-clutter ratio (SCR), all of which obscure the faint signals of dim targets. Existing mainstream methods exhibit notable limitations in addressing these interconnected challenges: segmentation-based deep learning frameworks, while proficient in fine-grained local detail capture, incur prohibitive computational overhead and limited real-time performance; detection-based architectures, despite superior inference efficiency, demonstrate insufficient sensitivity to dim targets, suffer from target feature suppression under clutter interference, underutilize multi-scale feature information, and remain highly sensitive to minor localization deviations during training due to inherent constraints of standard intersection over union (IoU)-based loss functions. To mitigate these critical bottlenecks, this paper proposes a novel multi-scale feature learning network, hereafter referred to as MFLNet, for high-precision and efficient IRSTD.
      Methods To address the aforementioned challenges, MFLNet is built on a synergistic architecture with three core technical innovations. First, to enhance weak target feature perception in low-SCR scenarios, we develop a CSP with mamba-inspired linear attention block, termed CMABlock, for feature extraction. This module integrates cross stage partial architecture with a tailored linear attention mechanism, and incorporates rotary position encodings within a mamba-inspired linear attention submodule to realize position-aware hierarchical modeling of potential target regions. This design directs the model to focus on faint, location-critical target cues in intense background noise, improving feature discriminability while preserving a lightweight computational profile. Second, to resolve insufficient multi-scale feature utilization, we design a dynamic multi-scale feature fusion head, termed DMSFHead, as the detection decoder. Unlike conventional detection heads with fixed fusion weights, DMSFHead adopts a dynamic weighted fusion strategy, extending standard three-layer feature fusion to a four-layer feature pyramid optimized for IRSTD to retain critical high-resolution details before they are degraded in deep network layers. Its core feature aggregation and selective feature fusion modules adaptively balance contributions from multi-scale features, preserving spatial details from high-resolution layers while integrating semantic information from low-resolution layers, thus enhancing feature utilization and target-clutter discrimination. Third, to alleviate the extreme sensitivity of standard IoU metrics to minor positional shifts in tiny targets, we introduce normalized Gaussian Wasserstein distance (NWD), which models bounding boxes as two-dimensional Gaussian distributions to generate a smoother, scale-robust similarity metric. To balance this with precise geometric regression constraints, we construct a hybrid CIoU-NWD loss function, which combines the training stability of NWD for tiny targets with the accurate bounding box regression capability of complete IoU (CIoU), mitigating issues of sparse positive samples and unstable model training.
      Results and Discussions Extensive validation experiments are conducted on three widely used challenging public IRSTD datasets: IRSTD-1k, NUDT-SIRST, and NUAA-SIRST, with MFLNet benchmarked against six state-of-the-art (SOTA) methods including MDvsFA, ACM, ALCNet, ISNet, AGPCNet, and DNANet. Quantitative results show that MFLNet consistently outperforms all competing methods across all benchmarks. Specifically, MFLNet achieves an F1-score of 0.845 on IRSTD-1k, outperforming the second-best method by 10.1% relative; an F1-score of 0.945 on NUDT-SIRST, corresponding to a 4.4% relative improvement; and an F1-score of 0.889 on NUAA-SIRST, exceeding the runner-up by 4.8% relative. Ablation studies systematically validate the effectiveness of each proposed component: replacing the attention branch of CMABlock with standard convolutions incurs consistent performance degradation across all metrics, while substituting DMSFHead with a conventional detection head leads to significant drops in both precision and recall, verifying the critical contribution of dynamic multi-scale fusion. Experiments on the hybrid loss function confirm that a balanced weighting factor of λ=0.1 achieves optimal performance by reconciling training stability and localization accuracy. Cross-dataset generalization experiments further demonstrate MFLNet’s robustness to domain shifts, with stable performance across all six transfer tasks, indicating its ability to learn generalizable target representations rather than dataset-specific biases. Computational efficiency analysis shows that MFLNet maintains an exceptionally lightweight profile, with 4.12 M parameters, 13.80 GFLOPs of computational complexity, and 47.40 f/s inference speed, outperforming all compared methods in efficiency. This performance stems from its detection-based paradigm and lightweight module design, making it well-suited for deployment on resource-constrained edge platforms.
      Conclusions This paper proposes MFLNet, a novel high-performance multi-scale feature learning network for IRSTD. Through the synergistic design of three core components—CMABlock for enhanced weak target feature perception, DMSFHead for efficient multi-scale feature utilization, and the CIoU-NWD hybrid loss for stable model training—MFLNet systematically addresses the core technical challenges of IRSTD. Extensive experimental results confirm that MFLNet not only establishes new SOTA detection accuracy across multiple mainstream benchmarks but also exhibits superior computational efficiency. This work provides a robust and practical framework for real-world high-performance infrared detection systems, with important application value for military surveillance, autonomous navigation, and related fields. Future work will focus on optimizing sequence modeling mechanisms and exploring multi-frame spatiotemporal fusion to further improve detection performance in extremely low-SCR scenarios.