Abstract:
Objective Infrared small target detection (IRSTD) serves as a foundational enabling technology in computer vision, with extensive and critical deployment across military and civilian domains including infrared early warning, precision guidance, maritime search and rescue, and intelligent traffic monitoring. The task presents inherent and unresolved challenges stemming from the unique physical properties of infrared small targets, which typically manifest as spot-like structures spanning only several to dozens of pixels, yielding extremely sparse texture and structural information. Compounding these difficulties, infrared imagery is frequently degraded by intense background clutter, low contrast, and low signal-to-clutter ratio (SCR), all of which obscure the faint signals of dim targets. Existing mainstream methods exhibit notable limitations in addressing these interconnected challenges: segmentation-based deep learning frameworks, while proficient in fine-grained local detail capture, incur prohibitive computational overhead and limited real-time performance; detection-based architectures, despite superior inference efficiency, demonstrate insufficient sensitivity to dim targets, suffer from target feature suppression under clutter interference, underutilize multi-scale feature information, and remain highly sensitive to minor localization deviations during training due to inherent constraints of standard intersection over union (IoU)-based loss functions. To mitigate these critical bottlenecks, this paper proposes a novel multi-scale feature learning network, hereafter referred to as MFLNet, for high-precision and efficient IRSTD.
Methods To address the aforementioned challenges, MFLNet is built on a synergistic architecture with three core technical innovations. First, to enhance weak target feature perception in low-SCR scenarios, we develop a CSP with mamba-inspired linear attention block, termed CMABlock, for feature extraction. This module integrates cross stage partial architecture with a tailored linear attention mechanism, and incorporates rotary position encodings within a mamba-inspired linear attention submodule to realize position-aware hierarchical modeling of potential target regions. This design directs the model to focus on faint, location-critical target cues in intense background noise, improving feature discriminability while preserving a lightweight computational profile. Second, to resolve insufficient multi-scale feature utilization, we design a dynamic multi-scale feature fusion head, termed DMSFHead, as the detection decoder. Unlike conventional detection heads with fixed fusion weights, DMSFHead adopts a dynamic weighted fusion strategy, extending standard three-layer feature fusion to a four-layer feature pyramid optimized for IRSTD to retain critical high-resolution details before they are degraded in deep network layers. Its core feature aggregation and selective feature fusion modules adaptively balance contributions from multi-scale features, preserving spatial details from high-resolution layers while integrating semantic information from low-resolution layers, thus enhancing feature utilization and target-clutter discrimination. Third, to alleviate the extreme sensitivity of standard IoU metrics to minor positional shifts in tiny targets, we introduce normalized Gaussian Wasserstein distance (NWD), which models bounding boxes as two-dimensional Gaussian distributions to generate a smoother, scale-robust similarity metric. To balance this with precise geometric regression constraints, we construct a hybrid CIoU-NWD loss function, which combines the training stability of NWD for tiny targets with the accurate bounding box regression capability of complete IoU (CIoU), mitigating issues of sparse positive samples and unstable model training.
Results and Discussions Extensive validation experiments are conducted on three widely used challenging public IRSTD datasets: IRSTD-1k, NUDT-SIRST, and NUAA-SIRST, with MFLNet benchmarked against six state-of-the-art (SOTA) methods including MDvsFA, ACM, ALCNet, ISNet, AGPCNet, and DNANet. Quantitative results show that MFLNet consistently outperforms all competing methods across all benchmarks. Specifically, MFLNet achieves an F1-score of 0.845 on IRSTD-1k, outperforming the second-best method by 10.1% relative; an F1-score of 0.945 on NUDT-SIRST, corresponding to a 4.4% relative improvement; and an F1-score of 0.889 on NUAA-SIRST, exceeding the runner-up by 4.8% relative. Ablation studies systematically validate the effectiveness of each proposed component: replacing the attention branch of CMABlock with standard convolutions incurs consistent performance degradation across all metrics, while substituting DMSFHead with a conventional detection head leads to significant drops in both precision and recall, verifying the critical contribution of dynamic multi-scale fusion. Experiments on the hybrid loss function confirm that a balanced weighting factor of λ=0.1 achieves optimal performance by reconciling training stability and localization accuracy. Cross-dataset generalization experiments further demonstrate MFLNet’s robustness to domain shifts, with stable performance across all six transfer tasks, indicating its ability to learn generalizable target representations rather than dataset-specific biases. Computational efficiency analysis shows that MFLNet maintains an exceptionally lightweight profile, with 4.12 M parameters, 13.80 GFLOPs of computational complexity, and 47.40 f/s inference speed, outperforming all compared methods in efficiency. This performance stems from its detection-based paradigm and lightweight module design, making it well-suited for deployment on resource-constrained edge platforms.
Conclusions This paper proposes MFLNet, a novel high-performance multi-scale feature learning network for IRSTD. Through the synergistic design of three core components—CMABlock for enhanced weak target feature perception, DMSFHead for efficient multi-scale feature utilization, and the CIoU-NWD hybrid loss for stable model training—MFLNet systematically addresses the core technical challenges of IRSTD. Extensive experimental results confirm that MFLNet not only establishes new SOTA detection accuracy across multiple mainstream benchmarks but also exhibits superior computational efficiency. This work provides a robust and practical framework for real-world high-performance infrared detection systems, with important application value for military surveillance, autonomous navigation, and related fields. Future work will focus on optimizing sequence modeling mechanisms and exploring multi-frame spatiotemporal fusion to further improve detection performance in extremely low-SCR scenarios.