Abstract:
Objective Salient Object Detection (SOD) is difficult in low-light conditions. The images often suffer from poor quality, heavy noise in dark areas, and low contrast. These issues make it hard for models to detect objects accurately and reliably. Current methods mostly rely on single-mode features. These methods cannot capture complex information in low-light images. This is especially true when light is uneven or local details are lost. Therefore, this study proposes an efficient and robust hybrid model. We combine the strengths of Convolutional Neural Networks (CNNs) and Transformers. Our goal is to improve detection performance and overcome the limits of traditional methods in the dark.
Methods This study proposes a Dual-Stream Hybrid Network (DSHNet) based on a hybrid CNN-Transformer architecture, which adopts an asymmetric dual-branch encoder-decoder lightweight structure to make up for the defects of the existing hybrid models in low-light image processing, such as inflexible multi-scale image processing and low feature fusion efficiency. DSHNet combines two main strengths: first, CNNs extract local features at multiple scales with shallow convolution suppressing noise interference caused by uneven illumination and deep convolution capturing fine-grained texture features; second, Transformers provide powerful global modeling with self-attention mechanism breaking the bottleneck of modeling long-distance dependence between target pixels in low-light images. The model uses a multi-level strategy to extract features from both the CNN and Transformer modules, then adopts a global-local synergy method to realize cross-hierarchical bidirectional fusion of features from different branches at the same scale, solving the problem of information loss in multi-scale feature fusion. We also designed a Channel Spatial Attention Enhancement Module (CSAEM), which dynamically fuses channel-spatial attention and channel recalibration, and uses a 1×1 convolution layer to learn the weighted coefficient to adaptively balance channel and spatial attention, realizing explicit perception of edge structures and effectively suppressing background noise in low-light images. In the encoding stage, the Efficient Spatial Reduction Attention (ESRA) module is applied step by step to Transformer features, compressing the keys and values in multi-head self-attention through convolution operations and introducing the DropKey mechanism to reduce the number of model parameters and computational complexity while avoiding over-focus on local areas. We also adopt an overlapping patch embedding strategy and embed depth convolution in the feedforward network to enhance the continuity of local features and avoid the dilution effect of fixed position encoding on dark area features. In the decoding stage, we use a two-stage hierarchical design with a coarse-grained decoding branch restoring the spatial resolution of global semantic features and a fine-grained decoding branch refining the dark texture and local contrast, and deploy saliency prediction heads at six feature stages with a deep supervision strategy to realize gradient backpropagation from coarse to fine and strengthen the boundary constraints of dark area targets. Finally, we use a combined loss function of Focal Loss and Dice Loss. This helps the model focus on difficult samples and improves area overlap, effectively solving the issue of unclear boundaries caused by severe imbalance between foreground and background in low-light scenes.
Results and Discussions To verify the effectiveness of DSHNet, we constructed a complex low-light scene mixed dataset LlSOD, which includes 3960 images and corresponding binary masks by screening and mixing LLVIP and YLLSOD datasets with NIQE greater than 6.78 and labeling with Labelme. Experimental validation was conducted on LlSOD and standard datasets including RGB252, RGBD-385, RGBT-621 and VI-789, with experimental configuration set as 50 training epochs, initial learning rate of 5e-5, poly learning rate strategy with power=0.9, weight decay of 1e-4 and batch size of 4. Ablation experiments on the LlSOD dataset show that the introduction of the lightweight CNN branch only increases 0.3M parameters, while the F-measure is increased by 5.2% and MAE is reduced by 2.4% compared with the single Transformer branch; the CSAEM dual-branch enhancement module further reduces MAE to 0.093 and increases F-measure to 0.901, which verifies that the synergy of channel and spatial attention is superior to a single attention mechanism. Comparative experiments show that DSHNet is much lighter than the similar hybrid model ABiUNet, reducing the number of parameters by 23% (only 26.1MB). On the LlSOD dataset, it achieved high scores: 91.2% for F-measure, 83.7% for S-measure, and 87.2% for weighted F-measure, which are 5.4%, 5.7% and 4.8% higher than the YLLSOD model dedicated to low-light image SOD respectively. In the generalization performance test on standard datasets, DSHNet achieved 0.021 MAE and 0.852 F-measure on RGB252, 0.037 MAE and 0.907 F-measure on VI-789, and even showed competitive performance on dual-modal datasets RGBD-385 and RGBT-621 without introducing additional modal data, with 0.850 F-measure on RGBD-385 and 0.854 F-measure on RGBT-621, which proves that the global-local feature fusion of DSHNet can simulate multi-modal information interaction and has strong robustness. Visual comparison results show that DSHNet can stably realize object positioning and segmentation in various low-light scenes such as backlight, indoor night and outdoor night, effectively solving the problems of foreground and background segmentation difficulty and inaccurate object positioning faced by most methods in low-light environments.
Conclusions In conclusion, DSHNet successfully solves key problems in low-light object detection such as difficult balance between global modeling ability and computational efficiency, difficulty in considering global information and local detail features, and overlap between multi-scale feature fusion and low-light noise. It combines the advantages of both CNNs and Transformers, and realizes the lightweight design of the model on the premise of improving detection performance. The results prove that our model is both lightweight and powerful, and provides a new effective approach for single-modal low-light image saliency detection. In the future, we will continue to optimize the model structure, further reduce the computational complexity of the model while maintaining detection accuracy, and plan to explore more real-world applications such as autonomous driving and security monitoring to promote the engineering transformation and application of this technology.