Abstract:
Objective Unmanned aerial vehicle (UAV) imagery poses enduring challenges for object detection, largely because targets are frequently small, densely distributed, and captured under heterogeneous imaging conditions. In contrast to ground-level scenes, aerial views often exhibit pronounced scale variation, viewpoint changes, partial occlusions, and severe background clutter arising from man-made structures, vegetation, road textures, and illumination-induced shadows. In addition, motion blur and sensor noise are common due to platform dynamics and altitude-dependent resolution constraints. Collectively, these factors exacerbate the loss of fine-grained cues during feature downsampling and increase both missed detections and false alarms, particularly when small objects occupy only a few pixels and their boundaries are confounded with background patterns.
Methods This work presents an attention-fused object detection approach built upon an improved YOLOv8n framework to alleviate the above limitations. Although YOLOv8n is attractive for edge deployment owing to its lightweight design, its baseline configuration can underperform on UAV datasets when high-frequency details are attenuated by successive convolution and pooling operations and when multi-scale feature aggregation is insufficiently discriminative. To strengthen representation learning while retaining computational efficiency, we introduce three complementary modifications: (i) a backbone replacement based on FasterNet to better preserve small-object structure, (ii) an inter-stage Triplet attention mechanism to model coupled channel-spatial dependencies, and (iii) a neck-level enhancement module that integrates receptive-field attention convolution (RFAConv) with the convolutional block attention module (CBAM) via a newly designed DARFCB-C2f block. First, the original YOLOv8n backbone is replaced with FasterNet to improve the retention of fine details that are critical for detecting small targets. FasterNet employs partial convolution (PConv) in conjunction with pointwise convolution (PWConv) to balance representational capacity and computational cost. By applying convolution to only a subset of channels, PConv reduces redundant computation while maintaining local structural information, whereas PWConv performs channel mixing to sustain expressive feature interactions. Notably, the combination of PConv and PWConv yields a T-shaped restricted receptive-field pattern that biases attention toward central feature-map regions. Given that small objects typically occupy limited spatial extents, preserving and emphasizing central responses can mitigate the information loss induced by downsampling. As a result, the revised backbone enhances small-object feature fidelity without substantially increasing model complexity. Second, to better accommodate the coexistence of multiple target categories and large scale disparities within a single aerial frame, Triplet attention is inserted between the backbone and the neck. Conventional attention modules often compute channel and spatial attention independently, which may limit their ability to capture joint interactions across dimensions. Triplet attention addresses this by modeling dependencies through three coordinated branches: channel-spatial height, channel-spatial width, and spatial height-width. This triadic formulation enables richer coupling between channel and spatial representations, thereby providing a more comprehensive description of salient object patterns across scales. In dense UAV scenes, where small and medium-sized objects may appear simultaneously and compete for representational resources, the proposed attention insertion improves feature selectivity and enhances robustness to scale variation. Third, we redesign the neck feature-fusion blocks by proposing a DARFCB structure and constructing a DARFCB-C2f module that replaces the standard C2f blocks in YOLOv8n. The design leverages RFAConv to expand the effective receptive field and to perform attention-guided feature fusion, which is beneficial for incorporating contextual cues without sacrificing boundary-sensitive information. Such boundary cues are particularly important in UAV imagery, where object contours are frequently degraded by blur, low resolution, and background interference. In addition, CBAM is integrated to refine intermediate representations via channel and spatial attention, amplifying informative regions while suppressing irrelevant responses. Together, RFAConv and CBAM reinforce edge-related features and improve the reliability of multi-scale aggregation, leading to more stable detection under complex environmental conditions.
Results and Discussions Comprehensive experiments are conducted on the VisDrone2019 benchmark, which is widely used for UAV object detection and contains challenging scenarios including dense traffic, crowded pedestrian regions, substantial scale variation, and cluttered backgrounds. Ablation studies and module replacement analyses demonstrate that each proposed component contributes measurable gains, and their combination yields the strongest overall performance. Relative to the baseline YOLOv8n detector, the proposed method improves precision (P) by 4.9%, recall (R) by 3.9%, and mAP50 by 6.1%. These improvements reflect a meaningful reduction in both false positives and false negatives, indicating that the attention-fused design effectively enhances the discriminability and completeness of detection results in UAV scenes. To assess cross-domain robustness, additional evaluations are performed on the WiderPerson and PASCAL VOC datasets, which differ from VisDrone2019 in terms of scene composition, object density, and imaging characteristics. The proposed model maintains strong performance across these datasets, suggesting that the introduced backbone redesign, attention fusion strategy, and neck enhancement provide transferable benefits rather than dataset-specific overfitting. This generalization evidence supports the applicability of the approach to practical UAV-driven tasks such as aerial surveillance, traffic monitoring, and search-and-rescue operations.
Conclusions In summary, this paper proposes an attention-fused YOLOv8n-based detector for UAV aerial imagery that integrates three synergistic improvements: a FasterNet backbone to preserve small-target details and alleviate downsampling-induced information loss, Triplet attention to capture richer multi-dimensional dependencies for multi-scale recognition, and a DARFCB-C2f neck module that combines RFAConv and CBAM to strengthen edge features and improve robustness in cluttered environments. Experimental results on VisDrone2019 and cross-dataset validations on WiderPerson and PASCAL VOC collectively confirm that the proposed modifications substantially improve detection accuracy while retaining the efficiency advantages of a lightweight one-stage detector.