• 摘要: 现有基于深度学习的红外目标检测算法(infrared target detection algorithm, ITDA)对红外图像中特征的提取和处理能力不足,导致漏检、误检率较高。多数算法不能兼顾不同尺度的红外目标,红外小目标检测依赖高灵敏度特征提取以捕捉细微特征,当检测目标尺度增大时,因需要更强的全局理解能力,易出现局部纹理过拟合,导致性能降低,使算法在跨场景部署时准确性降低。为解决上述问题,本文提出了一种基于YOLO的自适应多尺度红外目标检测的网络(adaptive multi-scale infrared target detection network based on YOLO, AFITDYOLO),该网络通过多层特征提取模块及多层特征融合模块增强多尺度红外目标检测能力。提出多尺度特征融合模块MFFM,通过增强特征金字塔中不同层特征的相关性,提升了多尺度特征融合的表达能力;提出多核特征提取卷积MFEConv,通过异构卷积组增大感受野,更好地与不同尺度目标的空间分布保持一致;提出交叉注意力融合模块CAFM,通过网络中不同层输出特征图的对比交互,增强重要特征信息,提升特征表达能力。为了评估AFITDYOLO的性能,在无人机飞鸟数据集SIRST-UAVB和道路行人车辆数据集CTIR上进行实验,mAP50分别达到了88.9%和90.7%,对比YOLOv10n分别提高了5.6%和6.5%,与目前主流方法对比,本文方法检测精度最佳。本文提出的红外目标检测算法,在多尺度红外目标检测中,展现了优异的准确性和适应性,可跨场景部署。

       

      Abstract:
      Objective Infrared target detection plays an indispensable role in numerous critical domains, such as security surveillance, autonomous driving, and military reconnaissance, owing to its unique perceptual capability under complex environments (e.g., low-light conditions and severe weather). However, infrared images inherently suffer from low contrast, blurred details, and significant noise interference, which often lead to ambiguous target edges, missing texture features, and other challenges during the detection process. Existing deep learning-based infrared target detection algorithms (ITDA) exhibit inadequate performance in feature extraction and processing for infrared images, resulting in relatively high rates of missed detection and false detection. Moreover, our systematic analysis of infrared target detection tasks reveals that algorithms tailored for small infrared targets rely heavily on high-sensitivity feature extraction to capture subtle characteristics. Nevertheless, as the scale of detected targets increases, these algorithms tend to encounter overfitting to local textures and elevated false detection rates, thereby degrading overall performance. In practical applications, detection environments are dynamically changing with targets of varying scales; thus, multi-scale detection capability is critical to ensuring algorithms maintain high reliability and adaptability in complex real-world scenarios. Unfortunately, most state-of-the-art algorithms are optimized for single-scale targets, making it challenging to simultaneously satisfy the requirements of high-precision localization for small targets and effective semantic understanding for large targets.
      Methods To address the above issues, this paper proposes an adaptive multi-scale infrared target detection network based on YOLO (AFITDYOLO). This network is designed to receive infrared target images of different scales and employs a multi-layer feature extraction module and a multi-layer feature fusion module to enhance its multi-scale infrared target detection capability. Firstly, a multi-scale feature fusion module (MFFM) is proposed. This module enhances the correlation between features of different layers in the feature pyramid network (FPN), coordinates deep semantic features with shallow spatial detail features more effectively, and thereby improves the representational ability of multi-scale feature fusion. Secondly, a multi-kernel feature extraction convolution (MFEConv) is constructed. By utilizing heterogeneous convolution groups, MFEConv expands the receptive field and strengthens the model's feature extraction capability. Additionally, a cross-attention fusion module (CAFM) is designed. Through the comparative interaction of feature maps output by different layers in the detection network, CAFM leverages the complementary information among these feature maps to suppress infrared noise in images and further enhance feature representation capability.
      Results and Discussions To validate the effectiveness of the proposed method in improving detection performance, extensive training and evaluation are conducted on the CTIR dataset, which comprises road pedestrians and vehicles with multi-scale infrared targets. To further verify the adaptability of the method, additional experiments are performed on the SIRST-UAVB dataset—a single-frame UAV bird dataset characterized by more complex backgrounds and smaller target scales. Experimental results on these two datasets demonstrate that AFITDYOLO achieves mean average precision at 50% intersection over union (mAP50) of 88.9% and 90.7%, respectively, representing significant improvements of 5.6% and 6.5% compared with YOLOv10n. In terms of lightweight optimization, the proposed method achieves higher inference speed (measured in frames per second, FPS) while utilizing fewer model parameters (params) and floating-point operations (FLOPs). When compared with current mainstream methods, AFITDYOLO exhibits the highest detection accuracy, the lowest parameter count and FLOPs, and the fastest inference speed, demonstrating distinct advantages. Additionally, to evaluate the generalization ability of the proposed method, cross-dataset experiments are carried out on the HIT-UAV dataset (a high-altitude UAV infrared thermal imaging dataset) and the IRSTD-1k dataset (a classic infrared small target dataset). Experimental results indicate that while the precision (P) value of AFITDYOLO is slightly inferior to that of DEIM-N, it outperforms all other mainstream methods in all remaining evaluation metrics. These findings confirm that the proposed method achieves improved detection accuracy on the infrared datasets used in the generalization experiments, validating its strong generalization capability and further demonstrating its feasibility for cross-scenario deployment. Overall, the proposed method simultaneously achieves enhanced detection accuracy and lightweight optimization of the detection model, fully meeting the requirements of real-time detection applications.
      Conclusions The AFITDYOLO network proposed in this paper, which is an adaptive multi-scale infrared target detection network based on YOLO, enhances the detection accuracy of infrared targets of different scales under various backgrounds with a relatively small number of parameters. The proposed MFFM enhances the model's representational ability in multi-scale feature fusion by improving the correlation between features of different layers in FPN. Additionally, the lightweight convolution module MFEConv is designed to achieve an efficient and larger receptive field with minimal parameters by leveraging the target distribution characteristics of infrared images. Furthermore, the CAFM is introduced to highlight important feature information, filter out irrelevant background information, and suppress noise through the comparative interaction of feature maps output by different layers, thereby further boosting the model's feature representation capability. Experimental results demonstrate that the proposed method outperforms current mainstream algorithms, exhibiting excellent detection accuracy, lightweight performance, and generalization ability, along with cross-scenario deployment capabilities.