• 摘要: 针对无人机航拍图像中小目标丰富、目标尺度存在差异以及背景复杂导致检测精度低的问题,本文提出了一种基于YOLOv8n的注意力融合目标检测算法。首先,采用FasterNet架构对骨干网络进行改进,通过部分卷积弥补传统卷积和池化操作中细节信息的丢失,提升小目标的检测能力;其次,引入三元组注意力机制(Triplet),增强特征图信息捕捉,有效地应对多样本和目标尺度多样性的问题;最后,基于感受野注意力卷积(RFAConv)和卷积块注意力模块(CBAM),设计了DARFCB结构,强化边缘特征提升模型在复杂环境下的检测精度。在VisDrone2019数据集上的实验结果表明,本文算法相较基准模型,PR和mAP50值分别提升了4.9%、3.9%和6.1%,有效提高了检测精度。此外,本文算法在WiderPerson和PASCAL VOC数据集上的实验也展示了良好的泛化能力,进一步说明其能够为无人机航拍图像目标检测任务提供有效支持。

       

      Abstract:
      Objective Unmanned aerial vehicle (UAV) imagery poses enduring challenges for object detection, largely because targets are frequently small, densely distributed, and captured under heterogeneous imaging conditions. In contrast to ground-level scenes, aerial views often exhibit pronounced scale variation, viewpoint changes, partial occlusions, and severe background clutter arising from man-made structures, vegetation, road textures, and illumination-induced shadows. In addition, motion blur and sensor noise are common due to platform dynamics and altitude-dependent resolution constraints. Collectively, these factors exacerbate the loss of fine-grained cues during feature downsampling and increase both missed detections and false alarms, particularly when small objects occupy only a few pixels and their boundaries are confounded with background patterns.
      Methods This work presents an attention-fused object detection approach built upon an improved YOLOv8n framework to alleviate the above limitations. Although YOLOv8n is attractive for edge deployment owing to its lightweight design, its baseline configuration can underperform on UAV datasets when high-frequency details are attenuated by successive convolution and pooling operations and when multi-scale feature aggregation is insufficiently discriminative. To strengthen representation learning while retaining computational efficiency, we introduce three complementary modifications: (i) a backbone replacement based on FasterNet to better preserve small-object structure, (ii) an inter-stage Triplet attention mechanism to model coupled channel-spatial dependencies, and (iii) a neck-level enhancement module that integrates receptive-field attention convolution (RFAConv) with the convolutional block attention module (CBAM) via a newly designed DARFCB-C2f block. First, the original YOLOv8n backbone is replaced with FasterNet to improve the retention of fine details that are critical for detecting small targets. FasterNet employs partial convolution (PConv) in conjunction with pointwise convolution (PWConv) to balance representational capacity and computational cost. By applying convolution to only a subset of channels, PConv reduces redundant computation while maintaining local structural information, whereas PWConv performs channel mixing to sustain expressive feature interactions. Notably, the combination of PConv and PWConv yields a T-shaped restricted receptive-field pattern that biases attention toward central feature-map regions. Given that small objects typically occupy limited spatial extents, preserving and emphasizing central responses can mitigate the information loss induced by downsampling. As a result, the revised backbone enhances small-object feature fidelity without substantially increasing model complexity. Second, to better accommodate the coexistence of multiple target categories and large scale disparities within a single aerial frame, Triplet attention is inserted between the backbone and the neck. Conventional attention modules often compute channel and spatial attention independently, which may limit their ability to capture joint interactions across dimensions. Triplet attention addresses this by modeling dependencies through three coordinated branches: channel-spatial height, channel-spatial width, and spatial height-width. This triadic formulation enables richer coupling between channel and spatial representations, thereby providing a more comprehensive description of salient object patterns across scales. In dense UAV scenes, where small and medium-sized objects may appear simultaneously and compete for representational resources, the proposed attention insertion improves feature selectivity and enhances robustness to scale variation. Third, we redesign the neck feature-fusion blocks by proposing a DARFCB structure and constructing a DARFCB-C2f module that replaces the standard C2f blocks in YOLOv8n. The design leverages RFAConv to expand the effective receptive field and to perform attention-guided feature fusion, which is beneficial for incorporating contextual cues without sacrificing boundary-sensitive information. Such boundary cues are particularly important in UAV imagery, where object contours are frequently degraded by blur, low resolution, and background interference. In addition, CBAM is integrated to refine intermediate representations via channel and spatial attention, amplifying informative regions while suppressing irrelevant responses. Together, RFAConv and CBAM reinforce edge-related features and improve the reliability of multi-scale aggregation, leading to more stable detection under complex environmental conditions.
      Results and Discussions Comprehensive experiments are conducted on the VisDrone2019 benchmark, which is widely used for UAV object detection and contains challenging scenarios including dense traffic, crowded pedestrian regions, substantial scale variation, and cluttered backgrounds. Ablation studies and module replacement analyses demonstrate that each proposed component contributes measurable gains, and their combination yields the strongest overall performance. Relative to the baseline YOLOv8n detector, the proposed method improves precision (P) by 4.9%, recall (R) by 3.9%, and mAP50 by 6.1%. These improvements reflect a meaningful reduction in both false positives and false negatives, indicating that the attention-fused design effectively enhances the discriminability and completeness of detection results in UAV scenes. To assess cross-domain robustness, additional evaluations are performed on the WiderPerson and PASCAL VOC datasets, which differ from VisDrone2019 in terms of scene composition, object density, and imaging characteristics. The proposed model maintains strong performance across these datasets, suggesting that the introduced backbone redesign, attention fusion strategy, and neck enhancement provide transferable benefits rather than dataset-specific overfitting. This generalization evidence supports the applicability of the approach to practical UAV-driven tasks such as aerial surveillance, traffic monitoring, and search-and-rescue operations.
      Conclusions In summary, this paper proposes an attention-fused YOLOv8n-based detector for UAV aerial imagery that integrates three synergistic improvements: a FasterNet backbone to preserve small-target details and alleviate downsampling-induced information loss, Triplet attention to capture richer multi-dimensional dependencies for multi-scale recognition, and a DARFCB-C2f neck module that combines RFAConv and CBAM to strengthen edge features and improve robustness in cluttered environments. Experimental results on VisDrone2019 and cross-dataset validations on WiderPerson and PASCAL VOC collectively confirm that the proposed modifications substantially improve detection accuracy while retaining the efficiency advantages of a lightweight one-stage detector.