• 摘要: 矿区道路落石实时精准检测对矿区安全生产意义重大。矿区落石存在目标微小、光照剧烈变化、恶劣天气干扰等挑战,导致传统单模态检测在低能见度下失效,现有双模态融合方法面临网络结构复杂、计算开销大、难以兼顾实时性与小目标检测精度等问题。为此,本文提出新型红外与可见光渐进式融合检测算法,首先设计基于小波变换的WTSterm下采样模块并优化MobileMamba骨干网络结构,以减少微小落石细节特征损失,实现红外与可见光特征高效提取;其次设计MambaFusion 跨模态融合模块,结合 Mamba 全局建模能力与卷积局部细节保留特性,通过动态状态转移和门控机制实现红外与可见光特征高效互补融合;最后结合尺度与位置敏感损失函数 (SLS Loss)以增强模型对微小目标的敏感性并优化定位精度。在自建露天矿落石数据集上的实验显示,该算法将精度、召回率和mAP0.5分别提升至 0.856、0.715和0.721,推理速度为16 ms,有效提升复杂场景下小目标落石的检测性能与鲁棒性,为矿区道路安全实时检测提供有效方法。

       

      Abstract:
      Real-time and accurate detection of rockfalls on mining roads is critical for ensuring transportation safety, preventing geological hazards, and maintaining operational continuity in open-pit mining environments. Detecting rockfalls in such scenarios presents substantial challenges due to the small size of targets, irregular shapes, low contrast against the background, drastic illumination variations, and interference from adverse weather conditions such as fog, dust, and rain. Conventional single-modality visual detection methods, particularly those based solely on visible light imagery, often fail under low-visibility conditions, while existing dual-modality fusion approaches typically face limitations including complex network structures, high computational costs, and difficulties in simultaneously achieving real-time performance and high detection accuracy for small targets.
      To address these challenges, this study proposes a novel progressive infrared–visible light fusion detection framework based on the Mamba state space model (SSM), which is specifically designed to efficiently integrate complementary information from infrared and visible light images while preserving fine-grained details of small targets, even under challenging environmental conditions such as low illumination, strong backlighting, partial occlusion, and adverse weather. The framework consists of three key components: (1) optimized feature extraction using MobileMamba and Wavelet Transform-based downsampling, which reduces feature loss while maintaining high-resolution details, (2) cross-modal feature fusion via the MambaFusion module, enabling adaptive integration of thermal and texture cues across multiple scales, and (3) small-object detection and localization enhancement through a Scale and Location Sensitive Loss (SLS Loss), which explicitly models scale variations and center-point offsets to improve both classification confidence and spatial precision for micro-scale rockfall targets. This multi-component design ensures robust detection in complex, unstructured mining environments and provides a balanced trade-off between computational efficiency and detection accuracy.
      First, a Wavelet Transform-based WTSterm downsampling module is introduced to mitigate the loss of fine-grained features during feature extraction. Unlike conventional stride-based convolutional downsampling, the WTSterm module decomposes input images into multiple frequency components, including low-frequency approximations and directional high-frequency details. These components are concatenated along the channel dimension and projected through a 1×1 convolution, effectively preserving edge and texture information of small rock targets while reducing computational overhead. This module is integrated into an optimized MobileMamba backbone network, a lightweight visual SSM architecture that combines multi-kernel convolutions, multi-receptive field feature interaction, and wavelet-enhanced Mamba blocks. The MobileMamba backbone enables efficient extraction of thermal radiation cues from infrared images and rich spatial-texture features from visible light images, while selectively maintaining both shallow fine-grained details and deep high-level semantic features. To reduce computational redundancy, the network removes high-dimensional, small-scale feature branches that contribute little to small-object representation, thereby balancing feature richness and efficiency.
      Second, a MambaFusion module is designed to progressively fuse multi-scale features from infrared and visible light modalities. The fusion process leverages Mamba’s dynamic state transition and gating mechanisms, which enable adaptive selection of discriminative information while suppressing modality-specific noise. MambaFusion combines the global dependency modeling ability of the SSM with local feature retention through convolution, facilitating cross-modal correspondence at semantic and spatial levels. Multi-scale fusion is performed across large, medium, and small feature maps, with adaptive weighting applied to emphasize salient information according to scene context. This mechanism allows the network to maintain robustness under challenging conditions such as strong backlighting, low illumination, partial occlusion, and airborne dust or smoke, enhancing both target discrimination and representation for small rockfall instances.
      Third, to improve small-object detection and localization, the framework incorporates a Scale and Location Sensitive Loss (SLS Loss). This loss function explicitly models variations in target scale and center-point offsets, addressing the insensitivity of conventional losses (e.g., IoU, Dice) to size and position discrepancies. The scale-sensitive component reweights loss contributions according to the predicted and true target sizes, focusing training on underrepresented or mispredicted small targets. The location-sensitive component employs polar-coordinate quantification of center-point offsets, providing a strong penalty for positional errors and refining bounding box localization. Together, SLS Loss ensures that the network maintains high sensitivity to micro-scale rockfalls and achieves precise positioning in cluttered or unstructured road environments.
      Extensive experiments were conducted on a self-collected open-pit mine rockfall dataset, containing diverse scenes with varying illumination, weather conditions, and target sizes (typically 10–50 cm in diameter). Results demonstrate that the proposed framework achieves precision of 0.856, recall of 0.715, and mAP@0.5 of 0.721, with an inference speed of 16 ms per frame. Compared with conventional single-modality detection and existing dual-modality fusion approaches, the proposed method significantly improves small-target detection performance and robustness while maintaining real-time computational efficiency. In conclusion, the progressive infrared–visible light fusion framework based on MobileMamba and MambaFusion modules effectively integrates complementary thermal and textural cues, preserves fine-grained details of micro-scale rockfall targets, and enhances small-object localization and discrimination in complex visual scenarios. By combining efficient feature extraction, adaptive cross-modal fusion, and a scale- and location-aware loss function, the proposed approach achieves high robustness under diverse environmental conditions, including low visibility, dynamic lighting, dust, and partial occlusion. The method offers a reliable, high-performance solution for real-time rockfall detection and monitoring in open-pit mining environments, significantly contributing to safer road operation, reduced operational risk, and improved transportation efficiency. These results demonstrate the potential of lightweight state-space models combined with progressive multi-modal fusion and targeted loss functions for deployment in practical industrial safety and autonomous monitoring systems.