Abstract:
Objective Infrared target detection plays an indispensable role in numerous critical domains, such as security surveillance, autonomous driving, and military reconnaissance, owing to its unique perceptual capability under complex environments (e.g., low-light conditions and severe weather). However, infrared images inherently suffer from low contrast, blurred details, and significant noise interference, which often lead to ambiguous target edges, missing texture features, and other challenges during the detection process. Existing deep learning-based infrared target detection algorithms (ITDA) exhibit inadequate performance in feature extraction and processing for infrared images, resulting in relatively high rates of missed detection and false detection. Moreover, our systematic analysis of infrared target detection tasks reveals that algorithms tailored for small infrared targets rely heavily on high-sensitivity feature extraction to capture subtle characteristics. Nevertheless, as the scale of detected targets increases, these algorithms tend to encounter overfitting to local textures and elevated false detection rates, thereby degrading overall performance. In practical applications, detection environments are dynamically changing with targets of varying scales; thus, multi-scale detection capability is critical to ensuring algorithms maintain high reliability and adaptability in complex real-world scenarios. Unfortunately, most state-of-the-art algorithms are optimized for single-scale targets, making it challenging to simultaneously satisfy the requirements of high-precision localization for small targets and effective semantic understanding for large targets.
Methods To address the above issues, this paper proposes an adaptive multi-scale infrared target detection network based on YOLO (AFITDYOLO). This network is designed to receive infrared target images of different scales and employs a multi-layer feature extraction module and a multi-layer feature fusion module to enhance its multi-scale infrared target detection capability. Firstly, a multi-scale feature fusion module (MFFM) is proposed. This module enhances the correlation between features of different layers in the feature pyramid network (FPN), coordinates deep semantic features with shallow spatial detail features more effectively, and thereby improves the representational ability of multi-scale feature fusion. Secondly, a multi-kernel feature extraction convolution (MFEConv) is constructed. By utilizing heterogeneous convolution groups, MFEConv expands the receptive field and strengthens the model's feature extraction capability. Additionally, a cross-attention fusion module (CAFM) is designed. Through the comparative interaction of feature maps output by different layers in the detection network, CAFM leverages the complementary information among these feature maps to suppress infrared noise in images and further enhance feature representation capability.
Results and Discussions To validate the effectiveness of the proposed method in improving detection performance, extensive training and evaluation are conducted on the CTIR dataset, which comprises road pedestrians and vehicles with multi-scale infrared targets. To further verify the adaptability of the method, additional experiments are performed on the SIRST-UAVB dataset—a single-frame UAV bird dataset characterized by more complex backgrounds and smaller target scales. Experimental results on these two datasets demonstrate that AFITDYOLO achieves mean average precision at 50% intersection over union (mAP50) of 88.9% and 90.7%, respectively, representing significant improvements of 5.6% and 6.5% compared with YOLOv10n. In terms of lightweight optimization, the proposed method achieves higher inference speed (measured in frames per second, FPS) while utilizing fewer model parameters (params) and floating-point operations (FLOPs). When compared with current mainstream methods, AFITDYOLO exhibits the highest detection accuracy, the lowest parameter count and FLOPs, and the fastest inference speed, demonstrating distinct advantages. Additionally, to evaluate the generalization ability of the proposed method, cross-dataset experiments are carried out on the HIT-UAV dataset (a high-altitude UAV infrared thermal imaging dataset) and the IRSTD-1k dataset (a classic infrared small target dataset). Experimental results indicate that while the precision (P) value of AFITDYOLO is slightly inferior to that of DEIM-N, it outperforms all other mainstream methods in all remaining evaluation metrics. These findings confirm that the proposed method achieves improved detection accuracy on the infrared datasets used in the generalization experiments, validating its strong generalization capability and further demonstrating its feasibility for cross-scenario deployment. Overall, the proposed method simultaneously achieves enhanced detection accuracy and lightweight optimization of the detection model, fully meeting the requirements of real-time detection applications.
Conclusions The AFITDYOLO network proposed in this paper, which is an adaptive multi-scale infrared target detection network based on YOLO, enhances the detection accuracy of infrared targets of different scales under various backgrounds with a relatively small number of parameters. The proposed MFFM enhances the model's representational ability in multi-scale feature fusion by improving the correlation between features of different layers in FPN. Additionally, the lightweight convolution module MFEConv is designed to achieve an efficient and larger receptive field with minimal parameters by leveraging the target distribution characteristics of infrared images. Furthermore, the CAFM is introduced to highlight important feature information, filter out irrelevant background information, and suppress noise through the comparative interaction of feature maps output by different layers, thereby further boosting the model's feature representation capability. Experimental results demonstrate that the proposed method outperforms current mainstream algorithms, exhibiting excellent detection accuracy, lightweight performance, and generalization ability, along with cross-scenario deployment capabilities.