Abstract:
Objective Nighttime low-light imaging is often affected by underexposure, limited dynamic range, and amplified sensor noise. These factors weaken edge, texture, and contour cues, and they increase background interference and feature ambiguity. Such degradation reduces detector reliability and causes missed detections, false positives, and localization bias, with more severe impact on small objects, distant targets, and dim instances. Based on YOLOv11n, LMD-YOLO was proposed as a low-light object detection method that integrates multi-scale features and local detail enhancement. Without substantially increasing model complexity, it strengthens multi-scale feature fusion and local detail extraction for low-light images and reduces high noise sensitivity in low-illumination scenes, thereby improving low-light detection accuracy.
Methods LMD-YOLO combined multi-scale feature fusion and local detail enhancement through four components. 1) A Multi-Pool Spatial Pyramid Fast Pooling module (MPSPPF) was inserted into the backbone. It contained parallel pooling branches and a gradient-enhanced concatenation part. The parallel pooling used MaxPool2d and AvgPool2d. Max pooling strengthened salient responses and benefited key-object extraction in low-contrast images. Average pooling produced smoother representations and reduced noise sensitivity. The pooled outputs were fused with weighted aggregation to preserve global structure while retaining local detail cues. MPSPPF also used repeated sequential pooling. The same pooling operation was applied stage by stage to the previous pooled output to form a deeper pyramid. This design enlarged the effective receptive field with limited cost and built a richer feature hierarchy. It captured local and global context that supported recognition of blurred, partially occluded, or weakly illuminated objects. The gradient-enhanced concatenation improved information flow across pooling stages and reduced detail attenuation in early feature extraction. 2) A Cross Stage Partial–Enhanced Dual Hybrid Attention Network module (CSP-EDHAN) was constructed for fine-grained low-light features. Depthwise separable convolution reduced computation while maintaining spatial sensitivity. Residual connections stabilized feature propagation under severe noise. A multi-path fusion structure increased cross-stage information flow and strengthened weak-target cues. A dual hybrid attention mechanism combined channel emphasis and spatial selection. Channel emphasis highlighted informative channels under illumination degradation. Spatial selection focused responses on object regions and suppressed random-noise activations. This module improved feature contrast for dim targets and reduced false activations caused by background clutter. 3) A Dynamic Detail–Semantic Fusion Pyramid Network (DDS-FPN) was designed for neck fusion. A channel–spatial attention unit from the SDFM module guided feature selection during cross-scale aggregation. Shallow detail cues and deep semantic cues were fused with dynamic weights, which raised responses in target regions and lowered responses in noisy background regions. An adaptive downsampling module (ADown) aligned deep semantics with shallow details during scale transitions. It reduced feature misalignment, improved feature consistency across pyramid levels, and compensated for detail loss caused by low illumination. A bidirectional pyramid path strengthened bottom-up detail propagation and top-down semantic guidance. Multi-scale features were aligned and enhanced for small, medium, and large objects under low-light conditions. 4) A dual-group detection head was proposed to balance accuracy and efficiency. Input multi-scale feature maps first passed through a shared feature-extraction stem composed of two grouped convolution layers. Grouped convolution increased feature diversity at low cost and improved localization and recognition. The classification and regression branches shared the stem parameters to reduce redundant computation. Two independent output branches were then applied. Each branch used a 1×1 convolution layer to predict class probabilities and bounding-box distributions. The decoupled branch design reduced task interference and improved training stability for noisy low-light inputs.
Results and Discussions Comparative experiments were conducted on EXDark and PASCAL VOC 2012 with YOLOv11n as the baseline. On EXDark, LMD-YOLO improved precision, recall, mAP50, and mAP50-95 by 1.3%, 2.3%, 3.1%, and 2.2%. On PASCAL VOC 2012, the same metrics increased by 1.4%, 1.5%, 1.8%, and 1.3%. Additional evaluation on a complex low-light scene dataset showed gains of 2.2% in mAP50 and 1.3% in mAP50-95 over the baseline. The results indicated that MPSPPF reduced early noise sensitivity through multi-branch pooling and context aggregation. Weighted fusion preserved structure cues while limiting noise amplification. Sequential pooling improved long-range context modeling, which benefited targets with weak boundaries. CSP-EDHAN strengthened weak-target representation through cross-stage fusion and attention-guided enhancement. Depthwise separable convolution preserved efficiency and improved feature robustness. DDS-FPN improved multi-scale alignment through dynamic detail–semantic fusion and attention-based selection. ADown reduced information loss during downsampling and improved semantic–detail consistency across pyramid levels. The dual-group head maintained lightweight computation and improved optimization through decoupled branches. Performance gains on PASCAL VOC 2012 suggested improved general feature quality rather than low-light domain bias. The method also showed stable behavior under mixed illumination and cluttered backgrounds, which are common in real nighttime scenes.
Conclusions LMD-YOLO improves low-light object detection through noise-robust feature extraction, weak-target enhancement, and dynamic multi-scale fusion while preserving a lightweight design. The method achieves consistent accuracy gains on low-light benchmarks and remains effective under normal illumination. Future work targets broader illumination distributions, stronger robustness to extreme noise, glare, and motion blur, and improved generalization in natural scenes.