Citation: | Chen X, Peng D L, Gu Y. Real-time object detection for UAV images based on improved YOLOv5s[J]. Opto-Electron Eng, 2022, 49(3): 210372. doi: 10.12086/oee.2022.210372 |
[1] | Wu X, Li W, Hong D F, et al. Deep learning for UAV-based object detection and tracking: a survey[EB/OL]. arXiv: 2110.12638. https://arxiv.org/abs/2110.12638. |
[2] | Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014: 580–587. doi: 10.1109/CVPR.2014.81. |
[3] | Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 779–788. doi: 10.1109/CVPR.2016.91. |
[4] | Liu W, Anguelov D, Erhan D, et al. SSD: single shot MultiBox detector[C]//Proceedings of the 14th European Conference on Computer Vision, 2016: 21–37. doi: 10.1007/978-3-319-46448-0_2. |
[5] | Du D W, Zhu P F, Wen L Y, et al. VisDrone-DET2019: the vision meets drone object detection in image challenge results[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision Workshop, 2019: 213–226. doi: 10.1109/ICCVW.2019.00030. |
[6] | Du D W, Qi Y K, Yu H Y, et al. The unmanned aerial vehicle benchmark: object detection and tracking[C]//Proceedings of the15th European Conference on Computer Vision, 2018: 375–391. doi: 10.1007/978-3-030-01249-6_23. |
[7] | Cai Z W, Vasconcelos N. Cascade R-CNN: delving into high quality object detection[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 6154–6162. |
[8] | Yang F, Fan H, Chu P, et al. Clustered object detection in aerial images[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 8310–8319. doi: 10.1109/ICCV.2019.00840. |
[9] | Singh B, Najibi M, Davis L S. SNIPER: efficient multi-scale training[C]//Proceedings of Annual Conference on Neural Information Processing Systems 2018, 2018: 9333–9343. |
[10] | Wei Z W, Duan C Z, Song X H, et al. AMRNet: chips augmentation in aerial images object detection[EB/OL]. arXiv: 2009.07168. https://arxiv.org/abs/2009.07168. |
[11] | Duan K W, Bai S, Xie L X, et al. CenterNet: Keypoint triplets for object detection[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 6568–6577. doi: 10.1109/ICCV.2019.00667. |
[12] | Howard A G, Zhu M L, Chen B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[EB/OL]. arXiv: 1704.04861. https://arxiv.org/abs/1704.04861. |
[13] | Wang R J, Li X, Ao S, et al. Pelee: a real-time object detection system on mobile devices[C]//Proceedings of the 6th International Conference on Learning Representations, 2018. |
[14] | Tan M X, Pang R M, Le Q V. EfficientDet: scalable and efficient object detection[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 10778–10787. |
[15] | Han K, Wang Y H, Tian Q, et al. Ghostnet: more features from cheap operations[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 1580–1589. doi: 10.1109/CVPR42600.2020.00165. |
[16] | Li K, Wan G, Cheng G, et al. Object detection in optical remote sensing images: a survey and a new benchmark[J]. ISPRS J Photogr Remote Sens, 2020, 159: 296−307. doi: 10.1016/j.isprsjprs.2019.11.023 |
[17] | Wang C Y, Liao H Y M, Wu Y H, et al. CSPNet: a new backbone that can enhance learning capability of CNN[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020: 1571–1580. doi: 10.1109/CVPRW50498.2020.00203. |
[18] | He K M, Zhang X Y, Ren S Q, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Trans Pattern Anal Mach Intell, 2015, 37(9): 1904−1916. doi: 10.1109/TPAMI.2015.2389824. |
[19] | Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 936–944. |
[20] | Liu S, Qi L, Qin H F, et al. Path aggregation network for instance segmentation[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 8759–8768. doi: 10.1109/CVPR.2018.00913. |
[21] | Bochkovskiy A, Wang C Y, Liao H Y M. YOLOv4: optimal speed and accuracy of object detection[EB/OL]. arXiv: 2004.10934. https://arxiv.org/abs/2004.10934. |
[22] | Zhang N, Donahue J, Girshick R, et al. Part-based R-CNNs for fine-grained category detection[C]//Proceedings of the 13th European Conference on Computer Vision, 2014: 834–849. doi: 10.1007/978-3-319-10590-1_54. |
[23] | Woo S, Park J, Lee J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 3–19. doi: 10.1007/978-3-030-01234-2_1. |
[24] | Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]//Proceedings of 2017 IEEE International Conference on Computer Vision, 2017: 2999–3007. doi: 10.1109/ICCV.2017.324. |
[25] | Wu Y, Chen Y P, Yuan L, et al. Rethinking classification and localization for object detection[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 10183–10192. doi: 10.1109/CVPR42600.2020.01020. |
[26] | Chen X L, Fang H, Lin T Y, et al. Microsoft coco captions: data collection and evaluation server[EB/OL]. arXiv: 1504.00325. https://arxiv.org/abs/1504.00325. |
[27] | Wang C Y, Bochkovskiy A, Liao H Y M. Scaled-YOLOv4: scaling cross stage partial network[C]//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 13024–13033. doi: 10.1109/CVPR46437.2021.01283. |
[28] | Farhadi A, Redmon J. Yolov3: an incremental improvement[C]//Proceedings of Computer Vision and Pattern Recognition, 2018: 1804–2767. |
[29] | Howard A, Sandler M, Chen B, et al. Searching for MobileNetV3[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019: 1314–1324. doi: 10.1109/ICCV.2019.00140. |
[30] | Mehta S, Rastegari M. MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer[EB/OL]. arXiv: 2110.02178. https://arxiv.org/abs/2110.02178. |
[31] | Ge Z, Liu S T, Wang F, et al. YOLOX: exceeding YOLO series in 2021[EB/OL]. arXiv: 2107.08430. https://arxiv.org/abs/2107.08430. |
[32] | 姚艳清, 程塨, 谢星星, 等. 多分辨率特征融合的光学遥感图像目标检测[J]. 遥感学报, 2021, 25(5): 1124−1137. doi: 10.11834/jrs.20210505. Yao Y Q, Cheng G, Xie X X, et al. Optical remote sensing image object detection based on multi-resolution feature fusion[J]. J Remote Sens, 2021, 25(5): 1124−1137. doi: 10.11834/jrs.20210505. |
[33] | Ren S Q, He K M, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Trans Pattern Anal Mach Intell, 2017, 39(6): 1137−1149. doi: 10.1109/TPAMI.2016.2577031 |
[34] | Liu Y, Li H F, Hu C, et al. CATNet: context AggregaTion network for instance segmentation in remote sensing images[EB/OL]. arXiv: 2111.11057. https://arxiv.org/abs/2111.11057. |
The real-time object detection under unmanned aerial vehicle (UAV) scenario has a wide range of military and civilian applications, including traffic monitoring, power line detection, etc. As UAV image has the characteristics of complex background, high resolution, and large scale differences between targets, how to meet the requirements of both detection accuracy and real-time performance is one of key problems to be solved. Thus, a balanced real-time detection algorithm based on YOLOv5s, which is named as YOLOv5sm+ is proposed in this paper. First, the influence of network width and depth on UAV image detection performance was analyzed. The experimental results on VisDrone datasets show that, due to the less internal feature mapping, the detection performance improves with model depth rather than model width. Moreover, with the depth of the model grows, semantic information can improve detection accuracy under generic object detection scenarios. An improved shallow network based on YOLOv5s, which is named as YOLOv5sm, was proposed to improve the detection accuracy of major targets in UAV image through improving the utilization of spatial features extracted by residual dilated convolution module that could increase the receptive field. Then, a cross-stage attention feature fusion module (SCAM) was designed, which could improve the utilization of detailed information by local feature self-supervision and could improve classification accuracy of medium and large targets through effective feature fusion. Finally, a detection head structure consisting with decoupled regression and classification head was proposed to further improve the classification accuracy. The first stage completes the regression task, and the second stage uses the cross-stage convolution module to assist in the classification task. The contradiction between regression and classification was alleviated, and the accuracy of the fine-grained classification was improved. Under the synergistic of the balanced light-weight feature extraction network (YOLOv5sm), the cross-stage attention feature fusion module (SCAM) and the improved detection head, the algorithm named YOLOv5sm+ was proposed. The experimental results on VisDrone dataset show that when intersection over union equals 0.5 mean average precision (mAP50) of the proposed YOLOv5sm+ model reaches 60.6%. Compared with YOLOv5s model, mAP50 of YOLOv5sm+ has increased 4.1%. In addition, YOLOv5sm+ has higher detection speed. The migration experiment on the DIOR remote sensing dataset also verified the effectiveness of the proposed model. The improved model has the characteristics of low false alarm rate and high recognition rate under overlapping conditions, and is suitable for the object detection task of UAV images.
YOLOv5 backbone network architecture diagram
Structure diagram of feature fusion module
(a) Res-DConv module; (b) Receptive field mapping
Improved module structure
YOLOv5sm+ model architecture
(a) Total number of category instances on the VisDrone dataset; (b) Classes confusion matrix of YOLOv5m algorithm
The detection examples of different algorithms in the VisDrone UAV scene.
Comparison of the detection effects of three algorithms in dense vehicle scenes.
Detection comparison of improved algorithm in DIOR dataset.