基于红外和可见光模态的随机融合特征金子塔行人重识别

汪荣贵,王静,杨娟,等. 基于红外和可见光模态的随机融合特征金子塔行人重识别[J]. 光电工程,2020,47(12):190669. doi: 10.12086/oee.2020.190669
引用本文: 汪荣贵,王静,杨娟,等. 基于红外和可见光模态的随机融合特征金子塔行人重识别[J]. 光电工程,2020,47(12):190669. doi: 10.12086/oee.2020.190669
Wang R G, Wang J, Yang J, et al. Feature pyramid random fusion network for visible-infrared modality person re-identification[J]. Opto-Electron Eng, 2020, 47(12): 190669. doi: 10.12086/oee.2020.190669
Citation: Wang R G, Wang J, Yang J, et al. Feature pyramid random fusion network for visible-infrared modality person re-identification[J]. Opto-Electron Eng, 2020, 47(12): 190669. doi: 10.12086/oee.2020.190669

基于红外和可见光模态的随机融合特征金子塔行人重识别

详细信息
    作者简介:
    通讯作者: 杨娟(1983-),女,博士,讲师,硕士生导师,主要从事视频信息处理、视频大数据处理技术、深度学习与二进神经网络理论与应用等的研究。E-mail:yangjuan6985@163.com
  • 中图分类号: TP391.4

Feature pyramid random fusion network for visible-infrared modality person re-identification

More Information
    Corresponding author: Yang Juan, E-mail: yangjuan6985@163.com
  • 目前行人重识别的研究只关注了可见光下跨摄像头提取图像不变的特征表示,忽视了红外条件下的成像特点,并结合两种模态的研究成果很少。此外,当前行人重识别在判别两个图像时,通常是计算单个卷积层特征图的相似性,这会导致弱特征学习现象。为了解决上述问题,本文提出了基于特征金字塔的随机融合网络,它可以同时计算多个特征层级的相似性,匹配图像时是基于多个语义层的判别因子。该模型关注到红外图像的特性,并且缩小了可见光和红外模态内部负作用的偏差,平衡了模态间的异质差距,综合了局部特征和全局特征学习的优势,有效地解决了跨模态行人重识别问题。实验在SYSU-MM01数据集上对平均精确度和收敛速度进行验证。结果表明,所提的模型优于现有的先进算法,特征金字塔随机融合网络实现了快速收敛且平均精确度达到了32.12%。

  • Overview: Existing works in person re-identification only considers extracting invariant feature representations from cross-view visible cameras, which ignores the imaging feature in infrared domain, such that there are few studies on visible-infrared relevant modality. Besides, most works distinguish two-views by often computing the similarity in feature maps from one single convolutional layer, which causes a weak performance of learning features. To handle the above problems, we design a feature pyramid random fusion network (FPRnet). Firstly, we introduce SRCNN of a super-resolution reconstruction method to preprocess, and the purpose is to alleviate the interference of IR-images blur and make feature learning more robust. Secondly, we take ResNet-50 pre-trained on ImageNet dataset as a baseline to learn feature representations of images in RGB-domain and IR-domain. The re-identification based on the residual network can only learn features with one resolution scale. However, tracking a specific person requires multi-directional learning, including the pedestrian's overall properties, local attributes and important characteristics to reduce the occurrence of misjudgment. For this reason, referring to the thought of the feature pyramid network, the features of different convolution layers in ResNet-50 network are constructed into a pyramid structure. It can calculate the similarity between multiple features at the same time, and abandons the approach of the original pyramid network using different scales to adapt to pedestrian bounding box images. Instead, it embeds the spirit of the pyramid structure into the depth residual network as a feature extraction module to extract the IR-RGB block. This learning method integrates the advantages of learning local and global feature, and represents the features with strong semantics and strong geometric details. Then, the random fusion mechanism is used as the basis of the feature fusion module to complete the end-to-end design of the double-branch, and the fusion block is obtained, which can avoid the problem of excessive parameters in the pyramid model. Thirdly, after the tasks of feature extraction and feature fusion are completed, the cross-modality prediction is carried out. It consists of a blue cross-domain, a pink RGB-domain, and a purple IR-domain. It generates three types of classification loss, and then uses a hybrid loss function to reduce the gaps between the intra-modality visual appearance and the inter-modality heterogeneity issue. The IR-RGB block and fusion block use a minimax game to beat each other for learning the joint-modal classification loss. Finally, the original dataset is utilized for FPRnet testing. Extensive experiments on the public SYSU-MM01 dataset from aspects of mAP and convergence speed, demonstrate the superiorities in our approach to the state-of-the-art methods. Furthermore, FPRnet also achieves competitive results with 32.12% mAP recognition rate and much faster convergence. The source code of the FPRnet can be available from https://github.com/KyreneLaura/FPRnet.

  • 加载中
  • 图 1  特征金字塔随机融合网络结构示意图。(a)网络模型包括高层联合特征IR-RGB block与随机融合特征fusion block,具体来说,IR-RGB block由RGB block和IR block串联形成,fusion block由不同层级不同模态的特征随机融合生成;(b)跨模态预测是由蓝色的联合模态块、粉色的RGB单模态块以及紫色IR单模态块组成,并产生三种模态的分类损失,其中,联合模态分类损失由IR-RGB block与fusion block博弈产生

    Figure 1.  An illustration of the framework of feature pyramid random fusion network. (a) The model generates a top-level joint feature (termed IR-RGB block) and a random fusion feature (termed fusion block). Specifically, the IR-RGB block is concatenated by RGB block with IR block; the fusion block is generated by randomly blend features from different levels and distinct modalities; (b) The prediction consists of a blue cross-domain, a pink RGB-domain, and a purple IR-domain, which generates three types of classification loss. The IR-RGB block and fusion block use a minimax game to beat each other for learning the joint-modal classification loss

    图 2  特征选取方式。r、i分别表示RGB、IR模态,特征P(5), P(4), P(3), P(2)分别用绿色、橙色、蓝色、粉色表示

    Figure 2.  Feature selection. r and i represent RGB and IR domain respectively. Features P(5), P(4), P(3), and P(2) are shown by green, orange, blue and pink respectively

    图 3  特征融合方式。(a)横向串联;(b)纵向串联;(c)混合串联。r、i分别表示rgb、IR模态

    Figure 3.  The method of feature fusion. (a) Horizontal concatenation; (b) Vertical concatenation; (c) Hybrid concatenation. Let r be the RGB-modality, and i be the IR-modality

    图 4  对比最新方法的实验结果

    Figure 4.  Comparison with state of the art on SYSU-MM01

    图 5  训练阶段mAP变化趋势

    Figure 5.  The trend of mAP during training

    图 6  训练阶段loss变化趋势

    Figure 6.  The trend of loss during training

    表 1  基本网络结构

    Table 1.  Basic network structure

    Layer Input size Output size Structure
    C(1) 288×144 72×36 7×7, 64, stride=2
    C(2) 72×36 72×36 3×3 max pool, stride=2
    $\left[ {\begin{array}{*{20}{c}} {1 \times 1,64}\\ {3 \times 3,64}\\ {1 \times 1,256} \end{array}} \right] \times 3$
    C(3) 72×36 36×18 $\left[ {\begin{array}{*{20}{c}} {1 \times 1,128}\\ {3 \times 3,128}\\ {1 \times 1,512} \end{array}} \right] \times 4$
    C(4) 36×18 18×9 $\left[ {\begin{array}{*{20}{c}} {1 \times 1,256}\\ {3 \times 3,256}\\ {1 \times 1,1024} \end{array}} \right] \times 6$
    C(5) 18×9 9×5 $\left[ {\begin{array}{*{20}{c}} {1 \times 1,512}\\ {3 \times 3,512}\\ {1 \times 1,2048} \end{array}} \right] \times 3$
    RGB block and IR block 9×5 1×1 Average pool
    下载: 导出CSV

    表 2  构建特征金字塔

    Table 2.  Building feature pyramid

    Layer C(5) C(4) C(3) C(2)
    Step1 1×1, 256, stride=1
    Step2 P'(5)↑ P'(4)↑ P'(3)↑
    Step3 + + +
    Step4 3×3, 256, stride=1
    Output size 9×5 18×9 36×18 72×36
    Hidden layer P'(5) P'(4) P'(3) P'(2)
    Step5 Average pool
    Output size 1×1 1×1 1×1 1×1
    Result layer P(5) P(4) P(3) P(2)
    下载: 导出CSV

    表 3  选择权重参数值

    Table 3.  Selecting value of weight parameter

    1 2 3 4 5 6 7 8 9 10
    λ1 1 1.5 1.2 0.8 1 1 1.5 0.5 0.8 0.4
    λ2 1 1.5 1.2 0.8 0.5 0.3 1.2 0.2 0.15 0.35
    λ3 1 1.5 1.2 0.8 0.5 0.3 1.2 0.3 0.05 0.25
    mAP 31.43 28.78 29.97 30.05 26.53 27.46 29.29 30.11 30.14 28.17
    下载: 导出CSV

    表 4  分析特征抽取和特征融合方法的实验结果

    Table 4.  Analysis of feature extracting and fusion method

    Random feature pyramid Horizontal concatenation Vertical concatenation Hybrid concatenation
    rank-1 rank-5 mAP rank-1 rank-5 mAP T rank-1 rank-5 mAP
    Different level same modality Pr(5)Pr(2) Pr(5)Pr(2) 27.79 53.98 30.05 26.37 52.27 29.05 1 27.61 53.23 29.90
    3 27.02 52.26 29.06
    Pi(4)Pi(3) Pi(4)Pi(3) 27.89 53.58 30.10 27.13 51.87 28.96 1 28.18 53.21 30.23
    2 27.40 51.79 29.42
    Pr(5)Pr(3) 28.42 53.84 29.87 26.31 53.71 29.31 ----------
    Same level cross-modality Pi(3)Pr(3) 27.96 53.79 30.15 26.10 52.22 28.64 ----------
    Pr(5)Pr(5) Pi(5)Pi(5) 26.82 51.71 29.06 25.29 50.69 27.32 ----------
    Different
    level
    cross-modality
    Pr(5)Pr(2) Pi(5)Pi(2) 28.03 54.31 30.28 26.80 53.24 29.62 1 27.18 53.35 29.72
    3 26.83 52.59 29.30
    Pr(4)Pr(3) Pi(4)Pi(3) 28.05 54.48 30.39 26.81 52.83 29.93 1 27.97 53.40 29.84
    2 24.89 50.92 26.79
    Pr(5)Pi(5) Pi(4)Pr(4)
    Pr(5)Pi(5) Pi(2)Pr(2)
    29.28 55.35 31.43 26.84 53.12 29.65 ----------
    下载: 导出CSV

    表 5  Model结构对比

    Table 5.  Component comparison with network

    Remove out component rank-1 rank-5 rank-20 mAP
    RGB-domain and IR-domain 24.30 49.61 75.62 27.51
    IR-RGB block 25.40 50.41 76.81 28.79
    Fusion block 26.08 51.69 76.73 28.78
    Upsample 2× 26.91 52.66 77.96 29.55
    下载: 导出CSV

    表 6  Prediction结构比较

    Table 6.  Module comparison with network

    Convert into component rank-1 rank-5 rank-20 mAP
    Single SVD 26.33 51.76 78.95 29.35
    Double SVD 26.93 53.27 78.86 29.40
    LeakyReLU 26.96 52.74 77.67 29.42
    ReLU 26.34 51.96 78.70 28.94
    下载: 导出CSV

    表 7  超分辨结构

    Table 7.  Super-resolution structure

    Layer Input size Input channel Output size Output channel Structure
    Preprocessing 128×128 3 256×256 1 Bicubic interpolation (2)
    Conv1 256×256 1 256×256 64 9×9, stride=1
    Conv2 256×256 64 256×256 32 1×1
    Conv3 256×256 32 512×512 1 5×5, stride=1, PixelShuffle (2)
    下载: 导出CSV

    表 8  在SYSU-MM01数据集上对比最新方法

    Table 8.  Comparison with state of the art on SYSU-MM01

    Methods rank-1 rank-10 rank-20 mAP
    One-stream 12.04 49.68 66.74 13.67
    Two-stream 11.65 47.99 65.50 12.85
    Zero-padding 14.80 54.12 71.33 15.95
    CmGAN 26.97 67.51 80.56 27.80
    BCTR 16.12 54.90 71.47 19.15
    BDTR 17.01 55.43 71.96 19.66
    ResNet-50 19.36 59.89 73.47 23.85
    SVDnet 21.75 58.57 73.02 25.61
    FPRnet 29.28 68.43 81.01 31.43
    FPRnet+SRCNN 30.02 69.08 81.19 32.12
    FPRnet+reranking 30.99 67.91 79.76 31.17
    下载: 导出CSV
  • [1]

    许茗, 于晓升, 陈东岳, 等.复杂热红外监控场景下行人检测[J].中国图象图形学报, 2018, 23(12): 1829–1837. doi: 10.11834/jig.180299

    Xu M, Yu X S, Chen D Y, et al. Pedestrian detection in complex thermal infrared surveillance scene[J]. Journal of Image and Graphics, 2018, 23(12): 1829–1837. doi: 10.11834/jig.180299

    [2]

    Zheng L, Shen L Y, Tian L, et al. Scalable person re-identification: a benchmark[C]//Proceedings of 2015 IEEE International Conference on Computer Vision, Santiago, 2015: 1116–1124.

    [3]

    Dai Z Z, Chen M Q, Zhu S Y, et al. Batch feature erasing for person re-identification and beyond[Z]. arXiv: 1811.07130[cs: CV], 2018.

    [4]

    Wu A C, Zheng W S, Yu H X, et al. RGB-infrared cross-modality person re-identification[C]//Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, 2017: 2380–7504.

    [5]

    Dai P Y, Ji R R, Wang H B, et al. Cross-modality person re-identification with generative adversarial training[C]// Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, 2018: 677–683.

    [6]

    Ye M, Wang Z, Lan X Y, et al. Visible thermal person re-identification via dual-constrained top-ranking[C]// Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Palo Alto, 2018: 1092–1099.

    [7]

    Gray D, Tao H. Viewpoint invariant pedestrian recognition with an ensemble of localized features[C]//Proceedings of the 10th European Conference on Computer Vision, Marseille, France, 2008: 262–275.

    [8]

    Wang X G, Doretto G, Sebastian T, et al. Shape and appearance context modeling[C]//Proceedings of the 11th International Conference on Computer Vision, Rio de Janeiro, 2007: 1–8.

    [9]

    Li W, Zhao R, Xiao T, et al. DeepReID: deep filter pairing neural network for person re-identification[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, 2014: 152–159.

    [10]

    Huang Y, Xu J S, Wu Q, et al. Multi-pseudo regularized label for generated data in person re-identification[J]. IEEE Transactions on Image Processing, 2018, 28(3): 1391–1403. http://ieeexplore.ieee.org/document/8485730/

    [11]

    Liu J W, Zha Z J, Tian Q, et al. Multi-scale triplet CNN for person re-identification[C]//Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 2016: 192–196.

    [12]

    Qian X L, Fu Y W, Jiang Y G, et al. Multi-scale deep learning architectures for person re-identification[C]//Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, 2017: 5399–5408.

    [13]

    Chen Y B, Zhu X T, Gong S G. Person re-identification by deep learning multi-scale representations[C]//Proceedings of 2017 IEEE International Conference on Computer Vision Workshops, Venice, 2017: 2590–2600.

    [14]

    Li X, Zheng W S, Wang X J, et al, Gong S. Multi-scale learning for low-resolution person re-identification[C]//Proceedings of 2015 IEEE International Conference on Computer Vision, Santiago, 2015: 3765–3773.

    [15]

    Wang Z, Hu R M, Yu Y, et al. Scale-adaptive low-resolution person re-identification via learning a discriminating surface[C]//Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York, 2016: 2669–2675.

    [16]

    Jing X Y, Zhu X K, Wu F, et al. Super-resolution person re-identification with semi-coupled low-rank discriminant dictionary learning[J]. IEEE Transactions on Image Processing, 2017, 26(3): 1363–1378. http://ieeexplore.ieee.org/document/7812766

    [17]

    Zhang D Q, Li W J. Large-scale supervised multimodal hashing with semantic correlation maximization[C]//Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, Quebec City, 2014: 2177–2183.

    [18]

    Chen Y C, Zhu X T, Zheng W S, et al. Person re-identification by camera correlation aware feature augmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(2): 392–408. http://doi.ieeecomputersociety.org/10.1109/TPAMI.2017.2666805

    [19]

    Zhu X F, Huang Z, Shen H T, et al. Linear cross-modal hashing for efficient multimedia search[C]//Proceedings of the 21st ACM International Conference on Multimedia, Barcelona, 2013: 143–152.

    [20]

    Zhai D M, Chang H, Zhen Y, et al. Parametric local multimodal hashing for cross-view similarity search[C]//Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, 2013: 2754–2760.

    [21]

    Srivastava N, Salakhutdinov R. Multimodal learning with deep Boltzmann machines[J]. Journal of Machine Learning Research, 2014, 15(84): 2949–2980. http://dl.acm.org/ft_gateway.cfm?id=2697059&ftid=1557241&dwn=1&CFID=750790679&CFTOKEN=85656762

    [22]

    Nguyen D T, Hong H G, Kim K W, et al. Person recognition system based on a combination of body images from visible light and thermal cameras[J]. Sensors, 2017, 17(3): 605. http://pubmedcentralcanada.ca/pmcc/articles/PMC5375891/

    [23]

    Sarfraz M S, Stiefelhagen R. Deep perceptual mapping for cross-modal face recognition[J]. International Journal of Computer Vision, 2017, 122(3): 426–438. http://dl.acm.org/citation.cfm?id=3086090

    [24]

    Xiao T, Li H S, Ouyang W L, et al. Learning deep feature representations with domain guided dropout for person re-identification[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016: 1249–1258.

    [25]

    Wang F Q, Zuo W M, Lin L, et al. Joint learning of single-image and cross-image representations for person re-identification[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016: 1288–1296.

    [26]

    Jiang X Y, Wu F, Li X, et al. Deep compositional cross-modal learning to rank via local-global alignment[C]//Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, 2015: 69–78.

    [27]

    Møgelmose A, Bahnsen C, Moeslund T B, et al. Tri-modal person re-identification with RGB, depth and thermal features[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, 2013: 301–307.

    [28]

    Sun Y F, Zheng L, Deng W J, et al. SVDNet for pedestrian retrieval[C]//Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, 2017: 3800–3808.

    [29]

    Maas A L, Hannun A Y, Ng A Y. Rectifier nonlinearities improve neural network acoustic models[C]//Proceedings of 30th International Conference on Machine Learning, Atlanta, Georgia, 2013: 18–23.

    [30]

    Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks[C]//Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, 2011: 315–323.

    [31]

    Bottou L. Stochastic gradient descent tricks[M]//Montavon G, Orr G B, Müller K R. Neural Networks: Tricks of the Trade. Berlin, Heidelberg: Springer, 2012: 421–436.

    [32]

    Dong C, Loy C C, He K M, et al. Image super-resolution using deep convolutional networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 38(2): 295–307.

  • 加载中

(6)

(8)

计量
  • 文章访问数:  4240
  • PDF下载数:  793
  • 施引文献:  0
出版历程
收稿日期:  2019-11-02
修回日期:  2020-04-10
刊出日期:  2020-12-15

目录

/

返回文章
返回