Citation: |
|
[1] | Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014: 580-587. |
[2] | He K M, Zhang X Y, Ren S Q, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Trans Pattern Anal Mach Intell, 2015, 37(9): 1904-1916. doi: 10.1109/TPAMI.2015.2389824 |
[3] | Girshick R. Fast R-CNN[C]//Proceedings of 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 1440-1448. |
[4] | Ren S Q, He K M, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Trans Pattern Anal Mach Intell, 2017, 39(6): 1137-1149. doi: 10.1109/TPAMI.2016.2577031 |
[5] | Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016: 779-788. |
[6] | Redmon J, Farhadi A. YOLO9000: better, faster, stronger[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017: 7263-7271. |
[7] | Liu W, Anguelov D, Erhan D, et al. SSD: single shot MultiBox detector[C]//Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 2016: 21-37. |
[8] | Redmon J, Farhadi A. YoLOv3: an incremental improvement[Z]. arXiv: 1804.02767, 2018. |
[9] | Qi C R, Su H, Mo K C, et al. PointNet: deep learning on point sets for 3D classification and segmentation[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017: 652-660. |
[10] | Qi C R, Yi L, Su H, et al. PointNet++: deep hierarchical feature learning on point sets in a metric space[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 2017: 5099-5108. |
[11] | Zhou Y, Tuzel O. VoxelNet: end-to-end learning for point cloud based 3D object detection[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018: 4490-4499. |
[12] | Simon M, Milz S, Amende K, et al. Complex-YOLO: Real-time 3D object detection on point clouds[Z]. arXiv: 1803.06199, 2018. |
[13] | Beltrán J, Guindel C, Moreno F M, et al. BirdNet: a 3D object detection framework from LiDAR information[C]//Proceedings of 2018 21st International Conference on Intelligent Transportation Systems, Maui, HI, USA, 2018: 3517-3523. |
[14] | Minemura K, Liau H, Monrroy A, et al. LMNet: real-time multiclass object detection on CPU using 3D LiDAR[C]//Proceedings of 2018 3rd Asia-Pacific Conference on Intelligent Robot Systems, Singapore, 2018: 28-34. |
[15] | Li B, Zhang T L, Xia T. Vehicle detection from 3D lidar using fully convolutional network[Z]. arXiv: 1608.07916, 2016. |
[16] | Qi C R, Liu W, Wu C X, et al. Frustum PointNets for 3D object detection from RGB-D data[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018: 918-927. |
[17] | Du X X, Ang M H, Karaman S, et al. A general pipeline for 3D detection of vehicles[C]//Proceedings of 2018 IEEE International Conference on Robotics and Automation, Brisbane, QLD, Australia, 2018: 3194-3200. |
[18] | Du X X, Ang M H, Rus D. Car detection for autonomous vehicle: LIDAR and vision fusion approach through deep learning framework[C]//Proceedings of 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vancouver, BC, Canada, 2017: 749-754. |
[19] | Chen X Z, Ma H M, Wan J, et al. Multi-view 3D object detection network for autonomous driving[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017: 1907-1915. |
[20] | Ku J, Mozifian M, Lee J, et al. Joint 3D proposal generation and object detection from view aggregation[C]//Proceedings of 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, Madrid, Spain, 2018: 1-8. |
[21] | Tan M X, Le Q V. EfficientNet: rethinking model scaling for convolutional neural networks[Z]. arXiv: 1905.11946, 2020. |
[22] | Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017: 2117-2125. |
[23] | He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016: 770-778. |
[24] | Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017: 4700-4708. |
[25] | Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016: 2818-2826. |
[26] | Huang Y P, Cheng Y L, Bapna A, et al. GPipe: efficient training of giant neural networks using pipeline parallelism[C]//Proceedings of the 33rd Conference on Neural Information Processing Systems, Vancouver, Canada, 2019: 103-112. |
[27] | 常昕, 陈晓冬, 张佳琛, 等. 基于激光雷达和相机信息融合的目标检测及跟踪[J]. 光电工程, 2019, 46(7): 180420. doi: 10.12086/oee.2019.180420 Chang X, Chen X D, Zhang J C, et al. An object detection and tracking algorithm based on LiDAR and camera information fusion[J]. Opto-Electron Eng, 2019, 46(7): 180420. doi: 10.12086/oee.2019.180420 |
[28] | Chattopadhay A, Sarkar A, Howlader P, et al. Grad-CAM++: generalized gradient-based visual explanations for deep convolutional networks[C]//Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, NV, USA, 2018: 839-847. |
Overview: Autonomous vehicles are equipped with cameras and light detection and ranging (LiDAR) for obstacle detection. Cameras can provide dense texture and color information, but the nature of the passive sensor makes it vulnerable to variations in ambient lighting. LiDAR can get accurate spatial information and is not affected by seasons and lighting conditions, but the point cloud data is sparse and distributed non-uniformly. Single sensor cannot provide satisfactory solution for task of environment perception. Fusion of the two sensors can take advantage of the two modalities of data, improving the robustness and accuracy. In recent years, convolutional neural networks (CNNs) have achieved great success in vision-based object detection and also provide an efficient tool for multi-modal data fusion. This paper proposes a novel multi-modal fusion method for object detection by using CNNs. The depth information provided by LiDAR is used as additional features to train CNNs. Disordered and sparse point cloud is projected onto the image plane to generate the depth map which is processed by a depth completion algorithm. The dense depth map and the RGB image are taken as the input of the network. The input data is also processed by sliding the window to reduce information loss caused by resolution mismatch and inappropriate aspect ratio. We adopt EfficientNet-B2 as backbone network of feature extraction, data fusion, and detection. The network extracts respectively the features of the RGB image and the depth map and then fuses the feature maps together through a connection layer. Followed by 1×1 convolution operation, the detection network uses feature pyramid to generate three scales of feature maps and estimates objects through position regression and object classification. Non-maximum suppression is used to optimize the detection results for all sliding windows. The output of the network contains information about location, class, confidence and distance of the target. The experiments were conducted on the KITTI benchmark dataset by using a workstation equipped with 4-core processor and 11 GB NVIDIA 1080Ti GPU and Pytorch neural network framework. By quantitatively analyzing the single-frame inference time and average precision (mAP) of different data modality detection methods, the experimental results show that our method achieves a balance between detection accuracy and detection speed. By qualitatively analyzing the performance of different detection methods under various scenarios, the results show that the proposed method is robust in various illumination conditions and especially effective on detecting small objects.
Framework of the proposed method
EfficientNet-B2 architecture diagram
Feature fusion layer
Structure of the object detector
Transformation of coordinates
Projection of LiDAR point cloud on the image plane
The formation of the dense depth map
Methods of data loading.(a) Resizing and padding; (b) Sliding windows
Comparison of depth maps.(a) Sparse depth map; (b) Dense depth map
The feature extraction comparison of EfficientNet-B2 and DarkNet-53
Detection results in different scenarios
Comparison of the P-R curve between our method and YOLOv3.(a) Our method; (b) YOLOv3