Citation: | Mei T, Zhao J W, Lin S L, et al. Anchor-free instance segmentation algorithm based on YOLACTR[J]. Opto-Electron Eng, 2025, 52(5): 240265. doi: 10.12086/oee.2025.240265 |
[1] | 周涛, 赵雅楠, 陆惠玲, 等. 医学图像实例分割:从有候选区域向无候选区域[J]. 生物医学工程学杂志, 2022, 39(6): 1218−1232 doi: 10.7507/1001-5515.202201034 Zhou T, Zhao Y N, Lu H L, et al. Medical image instance segmentation: from candidate region to no candidate region[J]. J Biomed Eng, 2022, 39(6): 1218−1232. doi: 10.7507/1001-5515.202201034 |
[2] | Pei S W, Ni B, Shen T M, et al. RISAT: real-time instance segmentation with adversarial training[J]. Multimed Tools Appl, 2023, 82(3): 4063−4080. doi: 10.1007/s11042-022-13447-1 |
[3] | Hong S L, Jiang Z H, Liu L Z, et al. Improved mask R-CNN combined with Otsu preprocessing for rice panicle detection and segmentation[J]. Appl Sci, 2022, 12(22): 11701. doi: 10.3390/app122211701 |
[4] | 吴马靖, 张永爱, 林珊玲, 等. 基于BiLevelNet的实时语义分割算法[J]. 光电工程, 2024, 51(5): 240030. doi: 10.12086/oee.2024.240030 Wu M J, Zhang Y A, Lin S L, et al. Real-time semantic segmentation algorithm based on BiLevelNet[J]. Opto-Electron Eng, 2024, 51(5): 240030. doi: 10.12086/oee.2024.240030 |
[5] | 苏丽, 孙雨鑫, 苑守正. 基于深度学习的实例分割研究综述[J]. 智能系统学报, 2021, 17(1): 16−31. doi: 10.11992/tis.202109043 Su L, Sun Y X, Yuan S Z. A survey of instance segmentation research based on deep learning[J]. CAAI Trans Intell Syst, 2021, 17(1): 16−31. doi: 10.11992/tis.202109043 |
[6] | 张继凯, 赵君, 张然, 等. 深度学习的图像实例分割方法综述[J]. 小型微型计算机系统, 2021, 42(1): 161−171. doi: 10.3969/j.issn.1000-1220.2021.01.028 Zhang J K, Zhao J, Zhang R, et al. Survey of image instance segmentation methods using deep learning[J]. J Chin Comput Syst, 2021, 42(1): 161−171. doi: 10.3969/j.issn.1000-1220.2021.01.028 |
[7] | Minaee S, Boykov Y, Porikli F, et al. Image segmentation using deep learning: a survey[J]. IEEE Trans Pattern Anal Mach Intell, 2022, 44(7): 3523−3542. doi: 10.1109/TPAMI.2021.3059968 |
[8] | He K M, Gkioxari G, Dollár P, et al. Mask R-CNN[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, 2017: 2980–2988. https://doi.org/10.1109/ICCV.2017.322. |
[9] | Ren S Q, He K M, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//Proceedings of the 29th International Conference on Neural Information Processing Systems, 2015: 91–99. |
[10] | Cai Z W, Vasconcelos N. Cascade R-CNN: delving into high quality object detection[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 6154–6162. https://doi.org/10.1109/CVPR.2018.00644. |
[11] | 肖振久, 田昊, 张杰浩, 等. 融合动态特征增强的遥感建筑物分割[J]. 光电工程, 2020, 52(3): 240231 doi: 10.12086/oee.2025.240231 Xiao Z J, Tian H, Zhang J H, et al. Fusion of dynamic features enhances remote sensing building segmentation[J]. Opto-Electron Eng, 2020, 52(3): 240231 doi: 10.12086/oee.2025.240231 |
[12] | Chen K, Pang J M, Wang J Q, et al. Hybrid task cascade for instance segmentation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 4969–4978. https://doi.org/10.1109/CVPR.2019.00511. |
[13] | Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 779–788. https://doi.org/10.1109/CVPR.2016.91. |
[14] | Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014: 580–587. https://doi.org/10.1109/CVPR.2014.81. |
[15] | Tian Z, Shen C H, Chen H, et al. FCOS: fully convolutional one-stage object detection[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019: 9626–9635. https://doi.org/10.1109/ICCV.2019.00972. |
[16] | Zhou X Y, Wang D Q, Krähenbühl P. Objects as points[Z]. arXiv: 1904.07850, 2019. https://arxiv.org/abs/1904.07850. |
[17] | Wang X L, Kong T, Shen C H, et al. SOLO: segmenting objects by locations[C]//16th European Conference on Computer Vision, 2020: 649–665. https://doi.org/10.1007/978-3-030-58523-5_38. |
[18] | He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770–778. https://doi.org/10.1109/CVPR.2016.90. |
[19] | Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 936–944. https://doi.org/10.1109/CVPR.2017.106. |
[20] | 刘腾, 刘宏哲, 李学伟, 等. 基于无锚框分割网络改进的实例分割方法[J]. 计算机工程, 2022, 48(9): 239−247,253. doi: 10.19678/j.issn.1000-3428.0062846 Liu T, Liu H Z, Li X W, et al. Improved instance segmentation method based on anchor-free segmentation network[J]. Comput Eng, 2022, 48(9): 239−247,253. doi: 10.19678/j.issn.1000-3428.0062846 |
[21] | Kirillov A, Wu Y X, He K M, et al. PointRend: image segmentation as rendering[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 9796–9805. https://doi.org/10.1109/CVPR42600.2020.00982. |
[22] | Yang S S, Wang X G, Li Y, et al. Temporally efficient vision transformer for video instance segmentation[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 2875–2885. https://doi.org/10.1109/CVPR52688.2022.00290. |
[23] | Cheng B W, Misra I, Schwing A G, et al. Masked-attention mask transformer for universal image segmentation[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 1280–1289. https://doi.org/10.1109/CVPR52688.2022.00135. |
[24] | Bolya D, Zhou C, Xiao F Y, et al. YOLACT: real-time instance segmentation[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019: 9156–9165. https://doi.org/10.1109/ICCV.2019.00925. |
[25] | 赵敬伟, 林珊玲, 梅婷, 等. 基于YOLACT与Transformer相结合的实例分割算法研究[J]. 半导体光电, 2023, 44(1): 134−140. doi: 10.16818/j.issn1001-5868.2022110201 Zhao J W, Lin S L, Mei T, et al. Research on instance segmentation algorithm based on YOLACT and Transformer[J]. Semicond Optoelectron, 2023, 44(1): 134−140. doi: 10.16818/j.issn1001-5868.2022110201 |
[26] | Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 3213–3223. https://doi.org/10.1109/CVPR.2016.350. |
[27] | Cordts M, Omran M, Ramos S, et al. The cityscapes dataset[C]//CVPR Workshop on the Future of Datasets in Vision, 2015: 1. https://doi.org/10.48550/arXiv.1604.01685 |
[28] | Xie E Z, Sun P Z, Song X G, et al. PolarMask: single shot instance segmentation with polar representation[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 12190–12199. https://doi.org/10.1109/CVPR42600.2020.01221. |
This paper proposes an anchor-free instance segmentation algorithm based on YOLACTR to address the limitations of the single-stage YOLACT algorithm in instance segmentation tasks. Traditional YOLACT algorithms rely on bounding box detection, suffering from precise localization of regions of interest and facing difficulties in distinguishing overlapping instances, which constrains detection accuracy. This research decouples the mask generation process into parallel tasks of feature learning and convolution kernel learning, abandoning traditional bounding box detection methods and adopting a more natural mask representation approach.
In the algorithmic implementation, random positional embedding techniques are employed to enhance the position sensitivity of feature maps, utilizing a six-layer Transformer structure to process spatial information, simultaneously generating dynamic convolution kernels and category information. The feature aggregation network integrates bottom-layer features from the feature pyramid and high-level features from the prediction network, optimizing feature expression capabilities through channel-spatial (CS) attention modules. For the loss function design, the research implements a combination of focal loss for classification tasks and dice loss for mask generation.
The network architecture consists of four primary components: a multi-scale feature generation network utilizing ResNet and feature pyramid networks; A mask generation network combining transformer with feature aggregation; A prediction network incorporating positional information to generate dynamic convolution kernels; Auxiliary network structures to enhance overall performance. This design allows for more effective handling of spatial relationships and instance boundaries compared to traditional anchor-based approaches.
Experimental results on the MS COCO dataset demonstrate that this method achieves a mask accuracy (AP) of 35.2%, representing a 25.7% improvement over the YOLACT algorithm. Specifically, the detection accuracy for small targets is improved by 37.1%, for medium targets by 25.8%, and for large target by 21.9%. When compared to algorithms such as Mask R-CNN, YOLACTR, and SOLO, this method shows advantages in segmentation accuracy and edge detail preservation. It performs exceptionally well in handling overlapping objects and small target detection, effectively addressing the segmentation issues in instance boundary overlap regions faced by traditional methods.
This paper effectively overcomes the limitations of traditional bounding box methods by decoupling the mask generation process and introducing anchor-free design, achieving balanced performance in instance segmentation tasks across different scales of objects, particularly improving small target detection capability and boundary differentiation of overlapping objects.
YOLACT structure diagram
YOLACTR network structure
Anchless anchor box example segmentation model structure
Schematic diagram of the positional embedding
Structure diagram of mask generation network
Predictive network structure
Object detection results
Instance segmentation detection results
Loss variation curves under different loss function configurations
Different curves for 2- and 6-layer Transformer models. (a) Loss variation curves; (b)Accuracy variation curves
Schematic diagrams of the decline of each loss during the training process. (a) Total loss curve; (b) Classification loss curve; (c) Mask loss curve
Mask detection accuracy rise graph
Comparison diagrams before improvement (left) and after improvement (right)
Comparison diagram of instance segmentation results 1
Comparison diagram of instance segmentation results 2