• 摘要: 针对遥感图像中存在的背景复杂、目标尺度不一和小目标众多等因素所造成的目标检测精度低的问题,提出了一种改进RTDETR的遥感图像目标检测算法。首先引入轻量化主干网络StarNet,在大幅度降低参数量和计算量的同时提高了网络的特征提取能力。其次在基于注意力的尺度内特征交互模块(attention-based intra-scale feature interaction, AIFI)中引入可变形注意力机制(deformable attention, DAttention),提高了网络对不同尺度目标的理解能力。最后,在特征融合部分采用加权双向特征金字塔网络(Bi-directional feature pyramid network, BiFPN)提高网络的特征融合能力。实验结果表明,改进算法在DIOR数据集上mAP50和mAP50-95指标分别提升了1.8%和1.7%,同时参数量降低了37.7%。此外,在RSOD数据集上进行泛化分析实验,改进算法mAP50和mAP50-95指标分别提升1.6%和0.5%且参数量降低37.9%。改进后的算法在保证较低计算量和参数量的基础上,提升了检测精度,能够更好地应对复杂背景下的遥感图像目标检测任务。

       

      Abstract:
      Objective Remote sensing images are formed by capturing the reflection or radiation information of electromagnetic waves from ground objects, reflecting the spatial distribution and semantic features of these objects. Remote sensing image target detection holds significant application value in fields such as environmental monitoring, military defense, and land resource management. However, due to the complex background of remote sensing images, significant variations in target scales, and a large number of small targets, existing target detection algorithms still exhibit deficiencies in feature extraction, multi-scale feature fusion, and model complexity control, making it difficult to simultaneously achieve both detection accuracy and computational efficiency. To address the aforementioned issues, this paper proposes an improved RTDETR remote sensing image target detection method based on an analysis of the structure and performance characteristics of the RTDETR algorithm. The aim is to enhance the model's ability to detect multi-scale targets and reduce model complexity while maintaining detection accuracy.
      Methods This article proposes an improved RTDETR remote sensing image object detection method to address the problems of insufficient feature extraction ability, limited multi-scale object modeling ability, and high model complexity in remote sensing image object detection tasks. Firstly, in terms of backbone network design, a lightweight backbone network StarNet is adopted to replace the original ResNet18. StarNet maps input features to a high-dimensional nonlinear feature space through element level multiplication, which can generate richer feature representations and enhance the model's ability to express complex remote sensing scenes. Secondly, to enhance the modeling ability of the model for targets of different scales, a deformable attention mechanism is introduced in the AIFI module of the efficient hybrid encoder to replace the original standard attention mechanism. The deformable attention mechanism combines the idea of sparse attention with dynamic shift strategy. By learning variable sampling positions, attention allocation becomes more flexible, allowing for adaptive attention to target regions of different spatial positions and scales. In addition, a bidirectional feature pyramid network module is introduced in the feature fusion stage of the efficient hybrid encoder. This module achieves full interaction and fusion of feature information at different levels through a bidirectional feature transfer mechanism from top to bottom and from bottom to top, effectively enhancing the model's ability to comprehensively utilize multi-scale features and further improving the detection performance of targets at different scales in remote sensing images. To verify the effectiveness of the proposed method, this paper conducted ablation experiments and comparative experiments on the DIOR remote sensing image dataset. At the same time, in order to evaluate the generalization ability of the algorithm in practical remote sensing application scenarios, further generalization analysis experiments were conducted on the RSOD dataset to reduce the risk of overfitting and verify the robustness of the model.
      Results and Discussions The experimental results show that the proposed improved algorithm has achieved significant performance improvement on the DIOR dataset. Among them, the accuracy and recall of the algorithm increased by 1.8% and 3.0% respectively, and the mAP50 and mAP50-95 indicators increased by 1.8% and 1.7% respectively. While improving detection performance, the parameter and computational complexity of the model were reduced by 37.7% and 32.5%, respectively, indicating that the proposed method significantly reduces model complexity while ensuring detection accuracy, achieving an effective balance between accuracy and efficiency. To further verify the generalization ability of the algorithm, this paper conducted generalization analysis experiments on the RSOD dataset. The experimental results showed that the improved algorithm improved the mAP50 and mAP50-95 metrics by 1.6% and 0.5%, respectively, while reducing the parameter count by 37.9%. The above results indicate that the method has good robustness and generalization performance on different remote sensing datasets, and can effectively alleviate overfitting problems. In addition, the detection performance of the algorithm proposed in this article was compared with various mainstream object detection algorithms on the DIOR dataset. The experimental results showed that the algorithm proposed in this article achieved the best performance in terms of average detection accuracy, showing significant advantages compared to other comparative algorithms. The main reason for the performance improvement lies in the synergistic effect of the introduced deformable attention mechanism and the weighted bidirectional feature pyramid network. The deformable attention mechanism can adaptively focus on key information regions at different scales through flexible attention allocation strategies, thereby enhancing the modeling ability of the model for local details and multi-scale features; The weighted bidirectional feature pyramid network achieves full integration of low-level detail features and high-level semantic features through a bidirectional information flow mechanism, effectively enhancing the ability to express multi-scale features. In terms of lightweight performance, the parameter and computational complexity of the improved algorithm are 12.4 M and 38.5 G, respectively, and the overall algorithm complexity is lower than other compared algorithms. This is mainly due to the adoption of the StarNet backbone network, which maps input features to a high-dimensional nonlinear feature space through element level multiplication in star operations. While generating rich feature representations, it significantly reduces model parameters and computational overhead, thereby enhancing the practicality and deployment value of the algorithm in remote sensing image object detection tasks.
      Conclusions This paper addresses the challenge of balancing accuracy and efficiency in remote sensing image object detection by improving the RTDETR algorithm and proposing a target detection method that considers both detection performance and model lightweighting. By introducing the StarNet backbone network, deformable attention mechanism, and bidirectional feature pyramid network, the model's ability to model and fuse features of complex scenes and multi-scale targets is enhanced. Experimental results show that this method achieves good detection performance on multiple remote sensing datasets and significantly reduces model complexity, validating its effectiveness and generalization ability. This research provides a reference for the deployment of remote sensing object detection algorithms in practical applications and resource-constrained platforms. In the future, we will further explore more efficient feature modeling methods to enhance the algorithm's applicability in complex remote sensing scenes.