Video object segmentation based on self-support group matching

Abstract

Memory networks are currently the mainstream approach for video object segmentation. However, because they only perform dense memory matching between different video frames, this tends to cause the model to focus on the details of the target and lose the global information about the target. We propose a video object segmentation method based on self-support grouped matching to address this issue. Firstly, we design a semantic enhancement module to capture the global information of the target, and then we design a self-support module to enhance the matching accuracy. Moreover, memory networks have a high computational cost. We propose a group matching mechanism for memory matching in memory networks, which reduces computational cost while avoiding interference from features that affect the matching results. The algorithm has been implemented on three mainstream spatiotemporal memory models: STM, STCN, and XMem, and has been extensively validated on multiple publicly available video object segmentation datasets. The experimental results show that the algorithm achieves a 1.5% improvement in J&F accuracy compared to STCN on the DAVIS 2017 dataset, reaching 86.9%, and the FPS has been increased from 25 to 30.