一种低光图像显著性目标检测的CNN-Transformer混合模型

张航; 魏英姿; 于聚壮

doi:10.12086/oee.2026.250269

一种低光图像显著性目标检测的CNN-Transformer混合模型

A low-light image saliency detection method based on a CNN-Transformer hybrid architecture

摘要: 针对低光照环境下显著性目标检测面临的图像质量退化、暗区噪声及低对比度等挑战，提出一种CNN-Transformer双流混合编码器-解码器模型(dual-stream hybrid network, DSHNet)。采用非对称双分支架构，以融合CNN多尺度局部特征提取能力与Transformer全局建模优势。首先，采用多层次特征提取策略分别从CNN和Transformer模块中提取特征，通过全局-局部特征协同建模，将同尺度不同分支特征两两交叉递阶完成双向融合。其次，设计通道与空间特征增强模块，经过特征变换和通道压缩优化，再做多分支特征融合。最后，采用Focal Loss+Dice Loss联合损失函数聚焦难样本和优化区域重叠，解决边界模糊的问题。在自建复杂目标数据集LlSOD及标准数据集上进行实验验证，DSHNet较同类混合模型ABiUNet参数量降低23%，在LlSOD数据集上F-measure、S-measure和加权F-measure分别达到91.2%、83.7%和87.2%，可有效应用于低光图像显著性检测任务。

Abstract:

Objective Salient Object Detection (SOD) is difficult in low-light conditions. The images often suffer from poor quality, heavy noise in dark areas, and low contrast. These issues make it hard for models to detect objects accurately and reliably. Current methods mostly rely on single-mode features. These methods cannot capture complex information in low-light images. This is especially true when light is uneven or local details are lost. Therefore, this study proposes an efficient and robust hybrid model. We combine the strengths of Convolutional Neural Networks (CNNs) and Transformers. Our goal is to improve detection performance and overcome the limits of traditional methods in the dark.

Methods This study proposes a Dual-Stream Hybrid Network (DSHNet) based on a hybrid CNN-Transformer architecture, which adopts an asymmetric dual-branch encoder-decoder lightweight structure to make up for the defects of the existing hybrid models in low-light image processing, such as inflexible multi-scale image processing and low feature fusion efficiency. DSHNet combines two main strengths: first, CNNs extract local features at multiple scales with shallow convolution suppressing noise interference caused by uneven illumination and deep convolution capturing fine-grained texture features; second, Transformers provide powerful global modeling with self-attention mechanism breaking the bottleneck of modeling long-distance dependence between target pixels in low-light images. The model uses a multi-level strategy to extract features from both the CNN and Transformer modules, then adopts a global-local synergy method to realize cross-hierarchical bidirectional fusion of features from different branches at the same scale, solving the problem of information loss in multi-scale feature fusion. We also designed a Channel Spatial Attention Enhancement Module (CSAEM), which dynamically fuses channel-spatial attention and channel recalibration, and uses a 1×1 convolution layer to learn the weighted coefficient to adaptively balance channel and spatial attention, realizing explicit perception of edge structures and effectively suppressing background noise in low-light images. In the encoding stage, the Efficient Spatial Reduction Attention (ESRA) module is applied step by step to Transformer features, compressing the keys and values in multi-head self-attention through convolution operations and introducing the DropKey mechanism to reduce the number of model parameters and computational complexity while avoiding over-focus on local areas. We also adopt an overlapping patch embedding strategy and embed depth convolution in the feedforward network to enhance the continuity of local features and avoid the dilution effect of fixed position encoding on dark area features. In the decoding stage, we use a two-stage hierarchical design with a coarse-grained decoding branch restoring the spatial resolution of global semantic features and a fine-grained decoding branch refining the dark texture and local contrast, and deploy saliency prediction heads at six feature stages with a deep supervision strategy to realize gradient backpropagation from coarse to fine and strengthen the boundary constraints of dark area targets. Finally, we use a combined loss function of Focal Loss and Dice Loss. This helps the model focus on difficult samples and improves area overlap, effectively solving the issue of unclear boundaries caused by severe imbalance between foreground and background in low-light scenes.

Results and Discussions To verify the effectiveness of DSHNet, we constructed a complex low-light scene mixed dataset LlSOD, which includes 3960 images and corresponding binary masks by screening and mixing LLVIP and YLLSOD datasets with NIQE greater than 6.78 and labeling with Labelme. Experimental validation was conducted on LlSOD and standard datasets including RGB252, RGBD-385, RGBT-621 and VI-789, with experimental configuration set as 50 training epochs, initial learning rate of 5e-5, poly learning rate strategy with power=0.9, weight decay of 1e-4 and batch size of 4. Ablation experiments on the LlSOD dataset show that the introduction of the lightweight CNN branch only increases 0.3M parameters, while the F-measure is increased by 5.2% and MAE is reduced by 2.4% compared with the single Transformer branch; the CSAEM dual-branch enhancement module further reduces MAE to 0.093 and increases F-measure to 0.901, which verifies that the synergy of channel and spatial attention is superior to a single attention mechanism. Comparative experiments show that DSHNet is much lighter than the similar hybrid model ABiUNet, reducing the number of parameters by 23% (only 26.1MB). On the LlSOD dataset, it achieved high scores: 91.2% for F-measure, 83.7% for S-measure, and 87.2% for weighted F-measure, which are 5.4%, 5.7% and 4.8% higher than the YLLSOD model dedicated to low-light image SOD respectively. In the generalization performance test on standard datasets, DSHNet achieved 0.021 MAE and 0.852 F-measure on RGB252, 0.037 MAE and 0.907 F-measure on VI-789, and even showed competitive performance on dual-modal datasets RGBD-385 and RGBT-621 without introducing additional modal data, with 0.850 F-measure on RGBD-385 and 0.854 F-measure on RGBT-621, which proves that the global-local feature fusion of DSHNet can simulate multi-modal information interaction and has strong robustness. Visual comparison results show that DSHNet can stably realize object positioning and segmentation in various low-light scenes such as backlight, indoor night and outdoor night, effectively solving the problems of foreground and background segmentation difficulty and inaccurate object positioning faced by most methods in low-light environments.

Conclusions In conclusion, DSHNet successfully solves key problems in low-light object detection such as difficult balance between global modeling ability and computational efficiency, difficulty in considering global information and local detail features, and overlap between multi-scale feature fusion and low-light noise. It combines the advantages of both CNNs and Transformers, and realizes the lightweight design of the model on the premise of improving detection performance. The results prove that our model is both lightweight and powerful, and provides a new effective approach for single-modal low-light image saliency detection. In the future, we will continue to optimize the model structure, further reduce the computational complexity of the model while maintaining detection accuracy, and plan to explore more real-world applications such as autonomous driving and security monitoring to promote the engineering transformation and application of this technology.

一种低光图像显著性目标检测的CNN-Transformer混合模型

A low-light image saliency detection method based on a CNN-Transformer hybrid architecture

相关链接

目录

一种低光图像显著性目标检测的CNN-Transformer混合模型

A low-light image saliency detection method based on a CNN-Transformer hybrid architecture

相关链接

目录

微信二维码