• 摘要: 针对无人机航拍图像目标尺度差异大、空间分布不均衡,以及复杂背景下易出现的漏检、误检问题,本文提出一种语义引导与多维特征感知的无人机航拍图像目标检测方法。首先,嵌入自适应多维特征采样单元,通过多路径交互增强局部与全局特征协同融合,提高对多尺度特征的提取能力;其次,构建高效自适应池化层(spatial pyramid adapt pool-fast, SPAPF)池化层,保留特征细节与结构信息,并减少计算冗余,加快推理速度;然后,提出跨空间全局信息整合模块,利用跨空间多头自注意力机制增强对局部密集目标的全局感知能力,改善误检问题;最后,设计语义引导特征融合模块,实现浅层细节与深层语义特征高效融合,并集成一个针对小目标的附加检测头,提升定位和分类精度。在VisDrone2019和VisDrone2021数据集上进行的实验结果表明:本文算法的mAP@0.5达到46.8%和44.7%,较基线模型提升了7.2%和6.5%,优于其他对比算法,显著减少误检、漏检情况。

       

      Abstract:
      Objective UAV aerial images have been widely applied in military surveillance, emergency rescue, agricultural monitoring, and other fields due to their advantages of flexible shooting angles and rich spatial information acquisition. However, target detection in such images faces prominent challenges: extreme scale variations of targets, dense spatial distribution of small targets, and frequent false or missed detections caused by complex backgrounds. Additionally, real-time application scenarios impose strict requirements on the inference speed of detection models, which makes it difficult for existing algorithms to balance detection accuracy and computational efficiency. To address these issues, this study aims to propose a high-precision and efficient UAV aerial image target detection method, which enhances the model's ability to perceive multi-scale targets, especially small targets, reduces false and missed detections, and ensures real-time performance to meet practical application needs.
      Methods Based on the YOLOv11s baseline model, this study designs a multi-module collaborative optimization framework, and the specific technical methods are as follows: First, an Adaptive Multi-dimensional Feature Sampling (AMFS) unit is embedded in the backbone network. The input feature maps are grouped along the channel dimension, and three parallel paths are used for multi-scale feature extraction. Adaptive average pooling, depthwise convolution, and spatial attention mechanisms are integrated to realize the adaptive synergy fusion of local details and global contextual information, which makes up for the limitation of fixed convolution receptive fields and enhances the feature extraction ability for small targets. Second, a Spatial Pyramid Adaptive Pooling Fusion (SPAPF) layer is constructed to replace the traditional SPPF module. This layer combines smoothed approximate average pooling and smoothed approximate maximum pooling through learnable weight masks. It not only retains structural details and texture information of targets but also reduces redundant computations, thus accelerating the model's inference speed while maintaining lightweight characteristics. Third, a Cross-Spatial Contextual Parallel Self-Attention (CS_C2PSA) module is proposed. It introduces cross-spatial multi-head self-attention (CSL-MHSA) and phased residual connection structures. By associating local and global receptive fields through 1×1 and 3×3 convolutions, and integrating vertical and horizontal spatial position information, the module strengthens the model's global contextual perception ability, alleviates the confusion between small targets and backgrounds or adjacent targets, and improves the false detection problem. Finally, a Semantic-guided Feature Fusion (SGFF) module and a dedicated Small Object Detection (SOD_Detect) head are designed. The SGFF module adopts a cross-layer feature alignment strategy and sub-pixel convolution technology. It uses the SE channel attention mechanism to enhance the expression of deep semantic information in shallow features, realizes the efficient fusion of shallow detail features and deep semantic features, and generates high-resolution feature maps with rich semantics for the SOD_Detect head, which significantly improves the positioning and classification accuracy of small targets. The model is trained and tested on VisDrone2019 and VisDrone2021 datasets, with batch_size set to 16 and epochs to 200. The experimental environment is Windows 10 operating system, NVIDIA GeForce RTX 3090 GPU, PyTorch 1.10.1 + CUDA 11.1 deep learning framework, and Python 3.8 programming language.
      Results and Discussions Ablation experiments on the VisDrone2019 dataset show that each improved module plays a positive role in performance enhancement: The AMFS unit increases the mAP@0.5 by 3.2%, the SPAPF layer reduces the GFLOPs by 4.4 while improving the Recall by 2.2%, the CS_C2PSA module increases the Precision by 2.3% and the mAP@0.5 by 2.0%, and the SGFF module improves the Precision by 3.9% and the mAP@0.5 by 3.4%. When all four modules are integrated, the model's Precision reaches 58.3%, Recall 43.1%, and mAP@0.5 46.8%, which are 8.1%, 4.7%, and 7.2% higher than those of the baseline YOLOv11s, respectively, and the GFLOPs are maintained at 30.8, ensuring computational efficiency. Comparative experiments with 11 mainstream algorithms (including SSD, Faster-RCNN, YOLOv5s, YOLOv11s, etc.) on VisDrone2019 and VisDrone2021 datasets show that the proposed method achieves mAP@0.5 of 46.8% and 44.7% on the two datasets, respectively, which are 7.2% and 6.5% higher than the baseline model. Compared with the latest YOLOv10s and YOLOv8s, the mAP@0.5 is increased by 8.0% and 8.6% respectively on VisDrone2019 dataset. Compared with the UAV-specific detection algorithm ESO-DETR, the mAP@0.5 is improved by 2.1%, and the FPS is increased by 14. Even compared with the lightweight model PP-YOLOE, the proposed method still has advantages in both accuracy and speed, with Precision increased by 4.3% and FPS increased by 21. Visualization results of challenging scenarios such as dense small targets, aerial multi-scale, and low-light occlusion show that the proposed method effectively reduces false and missed detections, and has better detection performance for dense small targets than algorithms such as MIS-YOLOv8. However, there is still room for improvement in the distinction between Pedestrian and People categories in extremely dense scenarios, which is a direction for subsequent optimization.
      Conclusions This study proposes a UAV aerial image target detection method based on semantic guidance and multi-dimensional feature perception. Through the collaborative optimization of AMFS, SPAPF, CS_C2PSA, and SGFF modules, the method effectively solves the key problems of large target scale variation, dense small target distribution, and high false and missed detection rates in complex backgrounds in UAV aerial image target detection. Experimental results on public datasets verify that the proposed method significantly improves the detection accuracy while maintaining efficient computational performance, and has excellent generalization ability and robustness. It provides a reliable technical solution for high-precision and real-time target detection in UAV aerial imagery. In future research, we will further optimize the module structure to enhance the model's ability to distinguish similar target categories in extremely dense scenarios, and explore the application of the method in more complex environments such as severe weather and heavy occlusion to expand its practical application scope.