YOLO-MITD：多尺度红外目标检测算法

梁学才; 杜海娟; 李锦龙; 周羿阳; 杨贇; 石博; 杨俊; 许聪源; 汪佳旭

doi:10.12086/oee.2026.250322

YOLO-MITD：多尺度红外目标检测算法

YOLO-MITD: a multi-scale infrared target detection algorithm

摘要: 针对无人机红外目标检测中存在计算资源受限、背景干扰强及目标尺度多变等挑战，本文提出了一种基于YOLO的多尺度红外目标检测算法YOLO-MITD。首先，设计Ghost增强型跨级融合模块(ghost-enhanced cross-stage fusion module, GCFM)，通过GhostConv轻量化卷积重构特征融合路径，显著减少模型参数量并增强多尺度特征提取能力；其次，构建高效的多尺度交互模块(efficient multi-scale interaction module, EMIM)，通过跨维度交互强化关键特征通道与空间关联性，提升复杂背景下目标显著性表达；然后，设计自适应空间特征融合检测头(adaptive spatial feature fusion head, ASFFH)，通过动态权重分配实现多尺度特征融合并抑制特征冲突，有效提升目标检测的尺度一致性；最后，构造自适应阈值焦点损失函数(adaptive threshold focal loss, ATFL)，通过动态调整难例挖掘阈值优化训练过程，缓解类别不平衡问题。在HIT-UAV红外数据集上实验结果表明，所提算法以少量计算开销，mAP50达到86.7%，较基准YOLOv8n提升6.1%。在CTIR与SIRST数据集上的跨场景泛化实验进一步验证了算法的鲁棒性，mAP50分别达到85.4%与67.2%，显著优于主流模型。

Abstract:

Objective Unmanned aerial vehicle (UAV) infrared target detection faces significant challenges including limited computational resources, strong background interference, and multi-scale target variations. Infrared imaging, compared with visible light imaging, provides all-day and all-weather working capabilities with strong anti-interference performance. However, due to flight altitude and sensor limitations, UAV-acquired infrared images commonly suffer from low spatial resolution, poor signal-to-noise ratio, and large target scale spans. In long-distance shooting conditions, targets often exhibit multi-scale distributions, with small targets of several pixels and medium-to-large targets covering tens of pixels coexisting in the same scene. Traditional detection methods tend to produce high false positive rates and severe missed detection in complex backgrounds. To address these issues, this paper proposes a multi-scale infrared target detection algorithm based on YOLO, named YOLO-MITD. The algorithm aims to achieve high detection accuracy while maintaining low computational complexity, making it suitable for deployment on resource-constrained UAV platforms.

Methods The proposed YOLO-MITD algorithm introduces four core modules to enhance detection performance. First, the ghost-enhanced cross-stage fusion module (GCFM) is designed by replacing standard convolutions in the original C2f module with GhostConv lightweight convolutions. GhostConv adopts a two-stage feature generation mechanism that uses standard convolution to extract essential features and then applies cheap transformations through depthwise separable convolution to generate phantom features, which significantly reduces model parameters while enhancing multi-scale feature extraction capabilities. The theoretical acceleration ratio approximately equals the redundancy factor s. Second, the efficient multi-scale interaction module (EMIM) incorporates the efficient multi-scale attention (EMA) mechanism into each bottleneck of the C2f structure. EMA adopts a "feature grouping plus cross-spatial learning" strategy to achieve one-time joint recalibration of channel and spatial information without dimensionality reduction in the channel dimension. The feature grouping number G is dynamically adjusted based on the channel count C, ensuring efficient feature representation while controlling computational complexity. Third, the adaptive spatial feature fusion head (ASFFH) is developed based on the adaptive spatial feature fusion (ASFF) mechanism, which consists of three steps: feature rescaling through interpolation or strided convolution to align spatial resolutions, adaptive fusion through softmax-normalized spatial weights learned by the network, and consistency guarantee to automatically suppress gradients from conflicting regions. This module dynamically allocates fusion weights for multi-scale features and suppresses feature conflicts, effectively improving scale consistency in target detection. Fourth, the adaptive threshold focal loss (ATFL) function is constructed with a piecewise weighting mechanism that dynamically adjusts the loss curvature. For high-confidence samples (p_t > 0.5), exponential decay mechanism compresses loss weights to low levels; for low-confidence samples (p_t <= 0.5), explosive growth mechanism amplifies loss weights to accelerate gradient backpropagation. The threshold parameter lambda is adaptively adjusted based on the exponential moving average of prediction confidence across training batches, thereby alleviating class imbalance issues during training.

Results and Discussions Extensive experiments were conducted on the HIT-UAV dataset, which is the first publicly available UAV infrared thermal imaging target detection dataset for high-altitude scenes. The dataset contains 2898 UAV infrared aerial images covering multiple real-world scenarios including schools, parking lots, roads, and playgrounds, under both daytime and nighttime conditions. Notably, the target size distribution exhibits significant diversity, with small targets accounting for approximately 70%, medium targets about 28%, and large targets only 2% of the dataset. The proposed YOLO-MITD algorithm achieved an mAP50 of 86.7%, representing a 6.1% improvement over the baseline YOLOv8n. The precision and recall reached 89.7% and 83.3%, respectively, with increases of 5.6% and 8.3% compared to YOLOv8n. Despite these significant performance gains, the computational overhead increased by only 0.6 GFLOPs and the parameter count was controlled at 3.895 M, achieving an excellent balance between accuracy and complexity. Ablation experiments demonstrated the effectiveness of each module: GCFM reduced parameters by 0.485 M and FLOPs by 6.5 G while improving mAP50 by 1.8 percentage points; EMIM improved mAP50 by 4.2 percentage points with the most significant accuracy enhancement; ASFFH enhanced mAP50 by 2.6% but increased FLOPs by 2.2 G; ATFL improved mAP50 by 3.0% without increasing model complexity. The complete model combining all four modules achieved optimal performance, verifying the complementary and synergistic nature of the modules. Comparative experiments against mainstream models including RT-DETR-l, YOLO-SRMX, YOLOv8n, YOLOv8s, YOLOv10n, YOLOv10s, YOLOv11n, and YOLOv12n showed that YOLO-MITD achieves superior detection performance while maintaining lightweight characteristics. Specifically, compared with RT-DETR-l and YOLO-SRMX, mAP50 improved by 6.6% and 3.9%, respectively. Cross-scene generalization experiments on the CTIR and SIRST datasets further validated the algorithm's robustness, with mAP50 reaching 85.4% and 67.2%, respectively, significantly outperforming existing methods. Visualization results demonstrated that the proposed method reduces missed and false detections, particularly in complex backgrounds and for targets with incomplete contours.

Conclusions This paper presents an improved YOLOv8n-based algorithm for multi-scale infrared target detection in UAV applications. The GCFM module achieves lightweight feature extraction through GhostConv by replacing standard convolutions with cheap operations, maintaining feature expressiveness while reducing computational burden. The EMIM module enhances feature representation via the EMA attention mechanism, which captures both channel and spatial dependencies through cross-dimensional interaction without dimensionality reduction. The ASFFH module improves multi-scale fusion through adaptive weight allocation, effectively resolving feature pyramid inconsistencies and suppressing gradient conflicts. The ATFL loss function optimizes the training process through dynamic threshold adjustment, applying different modulation strategies to samples based on confidence levels and alleviating class imbalance issues. The experimental results demonstrate that the proposed algorithm strikes an excellent balance between detection accuracy and model complexity, with strong cross-scene generalization capability. The method achieves state-of-the-art performance on multiple infrared datasets while maintaining low computational requirements, making it well-suited for real-time infrared target detection tasks on resource-constrained UAV platforms. Future work will focus on hardware deployment optimization, model quantization and pruning techniques, and further improvements to real-time performance and energy efficiency to adapt to more demanding engineering application requirements.

YOLO-MITD：多尺度红外目标检测算法

YOLO-MITD: a multi-scale infrared target detection algorithm

相关链接

目录

YOLO-MITD：多尺度红外目标检测算法

YOLO-MITD: a multi-scale infrared target detection algorithm

相关链接

目录

微信二维码