• 摘要: 为克服单模态目标识别定位方法在复杂环境下的固有局限,提出了一种基于CNN-Transformer框架的红外-可见光多模态目标识别定位网络。网络利用分组卷积单元构建CNN分支从局部角度挖掘可见光目标信息,同时以特征聚合的自注意力机制构建Transformer分支从全局角度提取红外目标特征。其次,设计多模态融合结构,以自适应加权策略实现特征互补。最后,引入多尺度特征动态调节机制,从空间与通道角度缓解特征冲突,为增强网络的尺度不变性。实验结果表明:所提网络能够充分挖掘并融合不同模态目标特征,与同类型网络相比,该网络也体现出更高的精确性和鲁棒性。

       

      Abstract:
      Objective In complex real scenes (such as low illumination, haze, occlusion or strong light interference), single mode target recognition and localization methods face significant performance bottlenecks. Although the visible image is rich in texture and color details, it is highly sensitive to light conditions; Infrared image has all-weather working ability and strong penetrability by virtue of thermal radiation imaging, but it is limited by low resolution and lack of structural information. The inherent limitations of the two make it difficult for the single-mode system to meet the requirements of all-weather perception with high robustness and high accuracy. Therefore, how to deeply integrate the complementary advantages of infrared and visible modes, that is, combining the local detail expression ability of visible light with the global thermal semantic saliency of infrared, has become a key challenge to improve the performance of target detection in complex environments. This paper aims to build a new neural network architecture that can deeply mine, adaptively fuse and efficiently use multimodal features, in order to achieve higher accuracy, stronger robustness and better scale invariance of target recognition and localization.
      Methods This paper proposes an infrared visible multimodal target recognition and positioning network based on CNN transformer heterogeneous fusion framework. The network is composed of four core modules: visible light oriented CNN feature extraction branch, infrared oriented transformer feature extraction branch, cross modal adaptive fusion module and multi-scale dynamic adjustment structure. In the visible light branching, the "feature criticality grouping convolution" mechanism is designed: the feature importance is evaluated by weighting the maximum and average values at the channel level, and the normalized criticality weight is generated after full connection layer optimization. Based on this, the number of convolution output channels is dynamically allocated - the key region gets richer characterization, and the background region is compressed, so as to guide the network to focus on the essential information of the target; At the same time, the compression extraction module composed of 1 × 1 point convolution and Gelu activation is introduced, and the hybrid down sampling strategy is adopted to reduce the computational overhead and the loss of micro target information. In the infrared branch, the feature aggregation shift window self-attention structure is constructed: the computational complexity is controlled by using the window shifting mechanism of swing transformer for reference, and 2 × 2 average pooling is applied to the key (k) and value (V) matrices in the self-attention for local aggregation, which not only significantly reduces the redundant calculation, but also strengthens the ability of global context modeling for low contrast heat source targets. Then, a cross modal fusion module is designed: firstly, the local features of CNN and the global features of transformer at the same scale are normalized, and then channel attention (highlighting semantic importance) and spatial attention (accurately locating the target area) are applied respectively, and shallow features are fused to enhance detail retention; Finally, the adaptive stitching of the three features is realized through the learnable weight parameters, and the complementary enhancement between modes is achieved. Finally, in order to alleviate the feature conflict in multi-scale fusion, a multi-scale dynamic adjustment mechanism is proposed: Based on the feature pyramid structure, the fusion proportion is dynamically adjusted according to the difference in the number of adjacent scale channels (the larger the scale difference, the smaller the fusion weight), and the channel spatial attention mechanism is combined to strengthen the target response, suppress the background interference, and significantly improve the unified detection ability of small, medium and large targets.
      Results and Discussions Experiments is systematically validated on three public datasets—KAIST, FLIR, RGB-T—and a self-collected power grid equipment dataset (covering 5 types of electrical equipment). The results show that: 1) the proposed key-grouped convolution achieves 83.1% mAP on the visible light modality, with parameters and FLOPs reduced by 26% and 21% compared to standard convolutions, and inference speed increased by 29%, outperforming mainstream lightweight networks and demonstrating an excellent efficiency-accuracy balance. 2) After introducing feature aggregation in the infrared branch, FLOPs decreased from 173.2 G to 116.5G, FPS increased from 69 to 88, and mAP only slightly dropped by 0.9%, verifying its efficiency. 3) The modality-network adaptability experiment clearly shows that CNN performs better on visible light (83.1% mAP) than Transformer (81.2%), while Transformer outperforms CNN on infrared (78.9% vs. 78.2%), strongly supporting the rationality of the heterogeneous dual-branch design. 4) In the comparison of multimodal fusion strategies, the proposed attention-guided adaptive fusion achieved 88.4% mAP on KAIST, significantly surpassing addition (86.8%) and concatenation (87.5%) methods; feature visualization confirms its effectiveness in suppressing background noise, particularly benefiting small object edge preservation. 5) The multi-scale fusion module, built on FPN and ASFF, introduces dynamic ratio adjustment and dual attention, achieving an overall mAP of 89.3%, with small object mAPs improved to 79.4%, validating its effectiveness in alleviating scale conflicts. 6) Comprehensive comparisons with five advanced methods (including dual CNNs, dual Transformers, CNN-Transformer cascades, etc.) across the four datasets show that our method achieved the best or second-best results on KAIST (89.3%), FLIR (88.9%), RGB-T (87.1%), and the power grid dataset (89.1%), with controlled computational cost (FLOPs ≈169 G, FPS ≈69), balancing accuracy, robustness, and practicality. In summary, this method systematically addresses the issues of insufficient multimodal feature complementarity and scale sensitivity through the deep synergy of modality characteristics, network architecture, and fusion strategies.
      Conclusions Aiming at the problems of insufficient feature mining, weak modal complementarity and poor multi-scale adaptability in infrared visible multimodal target recognition and localization, this paper innovatively proposes a heterogeneous network architecture that deeply integrates the advantages of CNN and transformer. Through visible light oriented feature criticality grouping CNN, infrared oriented feature aggregation lightweight transformer, attention driven adaptive cross modal fusion and dynamic multi-scale adjustment mechanism, the extraction efficiency, fusion quality and scale invariance of multimodal features are systematically improved. A large number of experiments have verified the superior performance of this method on multiple real scene data sets, which not only leads significantly in detection accuracy, but also maintains good computational efficiency and environmental robustness. This study provides an effective technical path for highly reliable target perception in complex environments. Future work will focus on model compression (such as knowledge distillation, neural architecture search) to adapt edge devices, and explore strategies to enhance the generalization ability in extreme weather (such as rainstorm and smoke), so as to further promote the application of this technology in key fields such as power patrol, automatic driving, emergency rescue and so on.