Abstract:
Objective Small object detection remains a significant and persistent challenge within the computer vision domain. This challenge is particularly acute in the context of drone aerial imagery, where objects captured from high altitudes appear extremely small, densely packed, and often exhibit blurred features due to low resolution and environmental factors. Traditional object detection algorithms frequently suffer from performance degradation in these scenarios, manifesting as high rates of missed detections and false positives. To address these critical limitations, this paper proposes a novel object detection algorithm named FFC-YOLO (FASFF-FreqFusion-CAA YOLO), which is built upon a frequency-aware feature fusion framework.
Methods The core of the FFC-YOLO algorithm lies in three systematic and synergistic innovations designed to enhance feature representation, fusion, and contextual understanding specifically for small and dense objects. First, to bolster multi-scale feature extraction, we redesign the detection head by incorporating a Four Adaptively Spatial Feature Fusion (Detect-FASFF) structure. This is augmented with an additional P2 detection layer dedicated to capturing finer-grained spatial information crucial for identifying minuscule targets. The FASFF mechanism leverages four adaptive branches to dynamically aggregate features across different scales, thereby significantly improving the model's sensitivity and discriminative power for small objects across varying sizes.Second, we comprehensively optimize the feature fusion pathway to counteract the information loss common in conventional sampling methods (e.g., up-sampling and down-sampling). We introduce a Frequency-aware Feature Fusion (FreqFusion) module. This module operates by decomposing features into different frequency components within the frequency domain, allowing for more deliberate and effective integration of high-frequency details (essential for edge and texture information of small objects) and low-frequency semantics. This FreqFusion module is then elegantly coupled with a Bidirectional Feature Pyramid Network (BiFPN) structure. The resulting fusion architecture enables efficient, multi-level bidirectional cross-scale information flow, effectively mitigating the problems of feature obliteration, category misclassification, and bounding box localization drift in scenes crowded with small objects.
Third, to enhance the model's ability to leverage surrounding semantic context—a vital clue when object appearance is ambiguous—we integrate a Context Anchor Attention (CAA) mechanism into the C2PSA module. This attention mechanism functions by establishing dynamic associations between anchor points and their contextual surroundings, enabling the feature extraction process to selectively focus on and amplify informative regions around potential targets. This contextual awareness strengthens feature representation and aids in distinguishing small objects from complex backgrounds.
Results and Discussions The efficacy of the proposed FFC-YOLO algorithm is rigorously validated through extensive experiments on two benchmark datasets. On the public VisDrone2019 dataset, a standard benchmark for drone vision, FFC-YOLO achieves a mean Average Precision (mAP@0.5) of 40.0%. This represents a substantial performance gain of 8.0%, 7.6%, and 7.8% over the widely-recognized YOLO v8n, YOLOv10n, and YOLOv11n baselines, respectively. To further verify its generalization capability and prowess in detecting extremely small targets, we conducted additional experiments on a custom-built, challenging dataset named `tiny-data`, which contains three categories of progressively smaller persons (`sperson`, `lperson`, `wperson`). On this dataset, FFC-YOLO demonstrates superior results compared to the strong YOLOv11n baseline, with improvements of 9.2% in mAP@0.5, 8.7% in Precision (P), and 5.9% in Recall (R). These consistent gains across both public and private datasets robustly confirm that the FFC-YOLO algorithm possesses strong generalization ability and excels in the demanding task of drone-based aerial image analysis, particularly for long-range and dense small object detection.
Conclusions In conclusion, this paper presents a comprehensive and effective solution to the small object detection problem in aerial imagery. By innovating at the levels of feature extraction (FASFF head), feature fusion (FreqFusion-BiFPN), and contextual modeling (CAA attention), the FFC-YOLO framework successfully addresses key shortcomings of prior methods. The experimental evidence underscores its potential for practical application in UAV-based surveillance, traffic monitoring, and search-and-rescue operations. For future work, we plan to explore model compression techniques such as pruning and knowledge distillation to streamline the model for efficient deployment on resource-constrained edge devices, thereby bridging the gap between high accuracy and real-time performance.