Abstract:
Objective Unmanned aerial vehicle (UAV) infrared target detection faces significant challenges including limited computational resources, strong background interference, and multi-scale target variations. Infrared imaging, compared with visible light imaging, provides all-day and all-weather working capabilities with strong anti-interference performance. However, due to flight altitude and sensor limitations, UAV-acquired infrared images commonly suffer from low spatial resolution, poor signal-to-noise ratio, and large target scale spans. In long-distance shooting conditions, targets often exhibit multi-scale distributions, with small targets of several pixels and medium-to-large targets covering tens of pixels coexisting in the same scene. Traditional detection methods tend to produce high false positive rates and severe missed detection in complex backgrounds. To address these issues, this paper proposes a multi-scale infrared target detection algorithm based on YOLO, named YOLO-MITD. The algorithm aims to achieve high detection accuracy while maintaining low computational complexity, making it suitable for deployment on resource-constrained UAV platforms.
Methods The proposed YOLO-MITD algorithm introduces four core modules to enhance detection performance. First, the ghost-enhanced cross-stage fusion module (GCFM) is designed by replacing standard convolutions in the original C2f module with GhostConv lightweight convolutions. GhostConv adopts a two-stage feature generation mechanism that uses standard convolution to extract essential features and then applies cheap transformations through depthwise separable convolution to generate phantom features, which significantly reduces model parameters while enhancing multi-scale feature extraction capabilities. The theoretical acceleration ratio approximately equals the redundancy factor s. Second, the efficient multi-scale interaction module (EMIM) incorporates the efficient multi-scale attention (EMA) mechanism into each bottleneck of the C2f structure. EMA adopts a "feature grouping plus cross-spatial learning" strategy to achieve one-time joint recalibration of channel and spatial information without dimensionality reduction in the channel dimension. The feature grouping number G is dynamically adjusted based on the channel count C, ensuring efficient feature representation while controlling computational complexity. Third, the adaptive spatial feature fusion head (ASFFH) is developed based on the adaptive spatial feature fusion (ASFF) mechanism, which consists of three steps: feature rescaling through interpolation or strided convolution to align spatial resolutions, adaptive fusion through softmax-normalized spatial weights learned by the network, and consistency guarantee to automatically suppress gradients from conflicting regions. This module dynamically allocates fusion weights for multi-scale features and suppresses feature conflicts, effectively improving scale consistency in target detection. Fourth, the adaptive threshold focal loss (ATFL) function is constructed with a piecewise weighting mechanism that dynamically adjusts the loss curvature. For high-confidence samples (pt > 0.5), exponential decay mechanism compresses loss weights to low levels; for low-confidence samples (pt <= 0.5), explosive growth mechanism amplifies loss weights to accelerate gradient backpropagation. The threshold parameter lambda is adaptively adjusted based on the exponential moving average of prediction confidence across training batches, thereby alleviating class imbalance issues during training.
Results and Discussions Extensive experiments were conducted on the HIT-UAV dataset, which is the first publicly available UAV infrared thermal imaging target detection dataset for high-altitude scenes. The dataset contains 2898 UAV infrared aerial images covering multiple real-world scenarios including schools, parking lots, roads, and playgrounds, under both daytime and nighttime conditions. Notably, the target size distribution exhibits significant diversity, with small targets accounting for approximately 70%, medium targets about 28%, and large targets only 2% of the dataset. The proposed YOLO-MITD algorithm achieved an mAP50 of 86.7%, representing a 6.1% improvement over the baseline YOLOv8n. The precision and recall reached 89.7% and 83.3%, respectively, with increases of 5.6% and 8.3% compared to YOLOv8n. Despite these significant performance gains, the computational overhead increased by only 0.6 GFLOPs and the parameter count was controlled at 3.895 M, achieving an excellent balance between accuracy and complexity. Ablation experiments demonstrated the effectiveness of each module: GCFM reduced parameters by 0.485 M and FLOPs by 6.5 G while improving mAP50 by 1.8 percentage points; EMIM improved mAP50 by 4.2 percentage points with the most significant accuracy enhancement; ASFFH enhanced mAP50 by 2.6% but increased FLOPs by 2.2 G; ATFL improved mAP50 by 3.0% without increasing model complexity. The complete model combining all four modules achieved optimal performance, verifying the complementary and synergistic nature of the modules. Comparative experiments against mainstream models including RT-DETR-l, YOLO-SRMX, YOLOv8n, YOLOv8s, YOLOv10n, YOLOv10s, YOLOv11n, and YOLOv12n showed that YOLO-MITD achieves superior detection performance while maintaining lightweight characteristics. Specifically, compared with RT-DETR-l and YOLO-SRMX, mAP50 improved by 6.6% and 3.9%, respectively. Cross-scene generalization experiments on the CTIR and SIRST datasets further validated the algorithm's robustness, with mAP50 reaching 85.4% and 67.2%, respectively, significantly outperforming existing methods. Visualization results demonstrated that the proposed method reduces missed and false detections, particularly in complex backgrounds and for targets with incomplete contours.
Conclusions This paper presents an improved YOLOv8n-based algorithm for multi-scale infrared target detection in UAV applications. The GCFM module achieves lightweight feature extraction through GhostConv by replacing standard convolutions with cheap operations, maintaining feature expressiveness while reducing computational burden. The EMIM module enhances feature representation via the EMA attention mechanism, which captures both channel and spatial dependencies through cross-dimensional interaction without dimensionality reduction. The ASFFH module improves multi-scale fusion through adaptive weight allocation, effectively resolving feature pyramid inconsistencies and suppressing gradient conflicts. The ATFL loss function optimizes the training process through dynamic threshold adjustment, applying different modulation strategies to samples based on confidence levels and alleviating class imbalance issues. The experimental results demonstrate that the proposed algorithm strikes an excellent balance between detection accuracy and model complexity, with strong cross-scene generalization capability. The method achieves state-of-the-art performance on multiple infrared datasets while maintaining low computational requirements, making it well-suited for real-time infrared target detection tasks on resource-constrained UAV platforms. Future work will focus on hardware deployment optimization, model quantization and pruning techniques, and further improvements to real-time performance and energy efficiency to adapt to more demanding engineering application requirements.