Abstract:
Real-time and accurate detection of rockfalls on mining roads is critical for ensuring transportation safety, preventing geological hazards, and maintaining operational continuity in open-pit mining environments. Detecting rockfalls in such scenarios presents substantial challenges due to the small size of targets, irregular shapes, low contrast against the background, drastic illumination variations, and interference from adverse weather conditions such as fog, dust, and rain. Conventional single-modality visual detection methods, particularly those based solely on visible light imagery, often fail under low-visibility conditions, while existing dual-modality fusion approaches typically face limitations including complex network structures, high computational costs, and difficulties in simultaneously achieving real-time performance and high detection accuracy for small targets.
To address these challenges, this study proposes a novel progressive infrared–visible light fusion detection framework based on the Mamba state space model (SSM), which is specifically designed to efficiently integrate complementary information from infrared and visible light images while preserving fine-grained details of small targets, even under challenging environmental conditions such as low illumination, strong backlighting, partial occlusion, and adverse weather. The framework consists of three key components: (1) optimized feature extraction using MobileMamba and Wavelet Transform-based downsampling, which reduces feature loss while maintaining high-resolution details, (2) cross-modal feature fusion via the MambaFusion module, enabling adaptive integration of thermal and texture cues across multiple scales, and (3) small-object detection and localization enhancement through a Scale and Location Sensitive Loss (SLS Loss), which explicitly models scale variations and center-point offsets to improve both classification confidence and spatial precision for micro-scale rockfall targets. This multi-component design ensures robust detection in complex, unstructured mining environments and provides a balanced trade-off between computational efficiency and detection accuracy.
First, a Wavelet Transform-based WTSterm downsampling module is introduced to mitigate the loss of fine-grained features during feature extraction. Unlike conventional stride-based convolutional downsampling, the WTSterm module decomposes input images into multiple frequency components, including low-frequency approximations and directional high-frequency details. These components are concatenated along the channel dimension and projected through a 1×1 convolution, effectively preserving edge and texture information of small rock targets while reducing computational overhead. This module is integrated into an optimized MobileMamba backbone network, a lightweight visual SSM architecture that combines multi-kernel convolutions, multi-receptive field feature interaction, and wavelet-enhanced Mamba blocks. The MobileMamba backbone enables efficient extraction of thermal radiation cues from infrared images and rich spatial-texture features from visible light images, while selectively maintaining both shallow fine-grained details and deep high-level semantic features. To reduce computational redundancy, the network removes high-dimensional, small-scale feature branches that contribute little to small-object representation, thereby balancing feature richness and efficiency.
Second, a MambaFusion module is designed to progressively fuse multi-scale features from infrared and visible light modalities. The fusion process leverages Mamba’s dynamic state transition and gating mechanisms, which enable adaptive selection of discriminative information while suppressing modality-specific noise. MambaFusion combines the global dependency modeling ability of the SSM with local feature retention through convolution, facilitating cross-modal correspondence at semantic and spatial levels. Multi-scale fusion is performed across large, medium, and small feature maps, with adaptive weighting applied to emphasize salient information according to scene context. This mechanism allows the network to maintain robustness under challenging conditions such as strong backlighting, low illumination, partial occlusion, and airborne dust or smoke, enhancing both target discrimination and representation for small rockfall instances.
Third, to improve small-object detection and localization, the framework incorporates a Scale and Location Sensitive Loss (SLS Loss). This loss function explicitly models variations in target scale and center-point offsets, addressing the insensitivity of conventional losses (e.g., IoU, Dice) to size and position discrepancies. The scale-sensitive component reweights loss contributions according to the predicted and true target sizes, focusing training on underrepresented or mispredicted small targets. The location-sensitive component employs polar-coordinate quantification of center-point offsets, providing a strong penalty for positional errors and refining bounding box localization. Together, SLS Loss ensures that the network maintains high sensitivity to micro-scale rockfalls and achieves precise positioning in cluttered or unstructured road environments.
Extensive experiments were conducted on a self-collected open-pit mine rockfall dataset, containing diverse scenes with varying illumination, weather conditions, and target sizes (typically 10–50 cm in diameter). Results demonstrate that the proposed framework achieves precision of 0.856, recall of 0.715, and mAP@0.5 of 0.721, with an inference speed of 16 ms per frame. Compared with conventional single-modality detection and existing dual-modality fusion approaches, the proposed method significantly improves small-target detection performance and robustness while maintaining real-time computational efficiency. In conclusion, the progressive infrared–visible light fusion framework based on MobileMamba and MambaFusion modules effectively integrates complementary thermal and textural cues, preserves fine-grained details of micro-scale rockfall targets, and enhances small-object localization and discrimination in complex visual scenarios. By combining efficient feature extraction, adaptive cross-modal fusion, and a scale- and location-aware loss function, the proposed approach achieves high robustness under diverse environmental conditions, including low visibility, dynamic lighting, dust, and partial occlusion. The method offers a reliable, high-performance solution for real-time rockfall detection and monitoring in open-pit mining environments, significantly contributing to safer road operation, reduced operational risk, and improved transportation efficiency. These results demonstrate the potential of lightweight state-space models combined with progressive multi-modal fusion and targeted loss functions for deployment in practical industrial safety and autonomous monitoring systems.