Li Y Q, Li S Z, Sun G L, et al. Lightweight Swin Transformer combined with multi-scale feature fusion for face expression recognition[J]. Opto-Electron Eng, 2025, 52(1): 240234. doi: 10.12086/oee.2025.240234
Citation: Li Y Q, Li S Z, Sun G L, et al. Lightweight Swin Transformer combined with multi-scale feature fusion for face expression recognition[J]. Opto-Electron Eng, 2025, 52(1): 240234. doi: 10.12086/oee.2025.240234

Lightweight Swin Transformer combined with multi-scale feature fusion for face expression recognition

    Fund Project: Key Projects of Natural Science for Universities in Anhui Province (2022AH050249), Anhui Province’s Department of Education Natural Science Research Projects in Universities (2023AH050164), Outstanding Youth Research Program for Universities in Anhui Province (2023AH020022) , and Anhui Province’s Housing and Urban-Rural Development Science and Technology Plan Project (2023-YF058, 2023-YF113)
More Information
  • A lightweight Swin Transformer and multi-scale feature fusion (EMA) module combination is proposed for face expression recognition, which addresses the problems of the Swin Transformer model, such as excessive parameter quantity, poor real-time performance, and limited ability to capture the complex and small expression change features present in the expressions. The method first uses the proposed SPST module to replace the Swin Transformer block module in the fourth stage of the original Swin Transformer model to reduce the number of parameters of the model and realize the lightweight model. Then, the multi-scale feature fusion (EMA) module is embedded after the second stage of the lightweight model, which effectively improves the model's ability to capture the details of facial expressions through multi-scale feature extraction and cross-space information aggregation, thus improving the accuracy and robustness of facial expression recognition. The experimental results show that the proposed method achieves 97.56%, 86.46%, 87.29%, and 70.11% recognition accuracy on four public datasets, namely, JAFFE, FERPLUS, RAF-DB, and FANE, respectively. Compared with the original Swin Transformer model, the number of parameters of the improved model is decreased by 15.8% and the FPS is improved by 9.6%, which significantly enhances the real-time performance of the model while keeping the number of parameters of the model low.
  • 加载中
  • [1] 田晨智, 宋敏, 田继伟, 等. 基于组合赋权和FCE的指控系统人机交互效能评估方法[J]. 电光与控制, 2024, 31 (7): 87−96. doi: 10.3969/j.issn.1671-637X.2024.07.014

    CrossRef Google Scholar

    Tian C Z, Song M, Tian J W, et al. Combination weighting and FCE based evaluation for human-computer interaction effectiveness of command and control system[J]. Electron Opt Control, 2024, 31 (7): 87−96. doi: 10.3969/j.issn.1671-637X.2024.07.014

    CrossRef Google Scholar

    [2] 李宇豪, 吕晓琪, 谷宇, 等. 基于改进S3FD网络的人脸检测算法[J]. 激光技术, 2021, 45 (6): 722−728. doi: 10.7510/jgjs.issn.1001-3806.2021.06.008

    CrossRef Google Scholar

    Li Y H, Lü X Q, Gu Y, et al. Face detection algorithm based on improved S3FD network[J]. Laser Technol, 2021, 45 (6): 722−728. doi: 10.7510/jgjs.issn.1001-3806.2021.06.008

    CrossRef Google Scholar

    [3] 孙锐, 单晓全, 孙琦景, 等. 双重对比学习框架下近红外-可见光人脸图像转换方法[J]. 光电工程, 2022, 49 (4): 210317. doi: 10.12086/oee.2022.210317

    CrossRef Google Scholar

    Sun R, Shan X Q, Sun Q J, et al. NIR-VIS face image translation method with dual contrastive learning framework[J]. Opto-Electron Eng, 2022, 49 (4): 210317. doi: 10.12086/oee.2022.210317

    CrossRef Google Scholar

    [4] Wang K, Peng X J, Yang J F, et al. Suppressing uncertainties for large-scale facial expression recognition[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 6896–6905.

    Google Scholar

    [5] 张文雪, 罗一涵, 刘雅卿, 等. 基于主动位移成像的图像超分辨率重建[J]. 光电工程, 2024, 51 (1): 230290. doi: 10.12086/oee.2024.230290

    CrossRef Google Scholar

    Zhang W X, Luo Y H, Liu Y Q, et al. Image super-resolution reconstruction based on active displacement imaging[J]. Opto-Electron Eng, 2024, 51 (1): 230290. doi: 10.12086/oee.2024.230290

    CrossRef Google Scholar

    [6] 刘成, 曹良才, 靳业, 等. 基于Transformer的跨年龄人脸识别方法[J]. 激光与光电子学进展, 2023, 60 (10): 1010019. doi: 10.3788/LOP220785

    CrossRef Google Scholar

    Liu C, Cao L C, Jin Y, et al. Transformer for age-invariant face recognition[J]. Laser Optoelectron Prog, 2023, 60 (10): 1010019. doi: 10.3788/LOP220785

    CrossRef Google Scholar

    [7] Yaddaden Y, Adda M, Bouzouane A. Facial expression recognition using locally linear embedding with LBP and HOG descriptors[C]//2020 2nd International Workshop on Human-Centric Smart Environments for Health and Well-Being, 2021: 221–226. https://doi.org/10.1109/IHSH51661.2021.9378702.

    Google Scholar

    [8] Wang K, Peng X J, Yang J F, et al. Region attention networks for pose and occlusion robust facial expression recognition[J]. IEEE Trans Image Process, 2020, 29: 4057−4069. doi: 10.1109/TIP.2019.2956143

    CrossRef Google Scholar

    [9] Wasi A T, Šerbetar K, Islam R, et al. ARBEx: attentive feature extraction with reliability balancing for robust facial expression learning[Z]. arXiv: 2305.01486, 2024. https://doi.org/10.48550/arXiv.2305.01486.

    Google Scholar

    [10] 刘雅芝, 许喆铭, 郎丛妍, 等. 基于关系感知和标签消歧的细粒度面部表情识别算法[J]. 电子学报, 2024, 52 (10): 3336−3346. doi: 10.12263/DZXB.20240364

    CrossRef Google Scholar

    Liu Y Z, Xu Z M, Lang C Y, et al. Fine-grained facial expression recognition algorithm based on relationship-awareness and label disambiguation[J]. Acta Electron Sin, 2024, 52 (10): 3336−3346. doi: 10.12263/DZXB.20240364

    CrossRef Google Scholar

    [11] 陈妍, 吴乐晨, 王聪. 基于多层级信息融合网络的微表情识别方法[J]. 自动化学报, 2024, 50 (7): 1445−1457. doi: 10.16383/j.aas.c230641

    CrossRef Google Scholar

    Chen Y, Wu L C, Wang C. A micro-expression recognition method based on multi-level information fusion network[J]. Acta Autom Sin, 2024, 50 (7): 1445−1457. doi: 10.16383/j.aas.c230641

    CrossRef Google Scholar

    [12] 张晨晨, 王帅, 王文一, 等. 针对人脸识别卷积神经网络的局部背景区域对抗攻击[J]. 光电工程, 2023, 50 (1): 220266. doi: 10.12086/oee.2023.220266

    CrossRef Google Scholar

    Zhang C C, Wang S, Wang W Y, et al. Adversarial background attacks in a limited area for CNN based face recognition[J]. Opto-Electron Eng, 2023, 50 (1): 220266. doi: 10.12086/oee.2023.220266

    CrossRef Google Scholar

    [13] 魏鑫光. 基于卷积神经网络的面部表情识别方法研究[D]. 济南: 山东大学, 2023. https://doi.org/10.27272/d.cnki.gshdu.2023.006762.

    Google Scholar

    Wei X G. Research on facial expression recognition method based on convolutional neural network[D]. Ji’nan: Shandong University, 2023. https://doi.org/10.27272/d.cnki.gshdu.2023.006762.

    Google Scholar

    [14] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//31st International Conference on Neural Information Processing Systems, 2017: 6000–6010.

    Google Scholar

    [15] Chen M, Radford A, Child R, et al. Generative pretraining from pixels[C]//37th International Conference on Machine Learning, 2020: 1691–1703.

    Google Scholar

    [16] Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 4171–4186. https://doi.org/10.18653/v1/N19-1423.

    Google Scholar

    [17] Liu C, Hirota K, Dai Y P. Patch attention convolutional vision transformer for facial expression recognition with occlusion[J]. Inf Sci, 2023, 619: 781−794. doi: 10.1016/j.ins.2022.11.068

    CrossRef Google Scholar

    [18] Chen X C, Zheng X W, Sun K, et al. Self-supervised vision transformer-based few-shot learning for facial expression recognition[J]. Inf Sci, 2023, 634: 206−226. doi: 10.1016/j.ins.2023.03.105

    CrossRef Google Scholar

    [19] Zheng C, Mendieta M, Chen C. POSTER: a pyramid cross-fusion transformer network for facial expression recognition[C]//2023 IEEE/CVF International Conference on Computer Vision Workshops, 2023: 3138–3147. https://doi.org/10.1109/ICCVW60793.2023.00339.

    Google Scholar

    [20] Liu Z, Lin Y T, Cao Y, et al. Swin Transformer: hierarchical vision transformer using shifted windows[C]//2021 IEEE/CVF International Conference on Computer Vision, 2021: 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986.

    Google Scholar

    [21] Feng H Q, Huang W K, Zhang D H, et al. Fine-tuning Swin Transformer and multiple weights optimality-seeking for facial expression recognition[J]. IEEE Access, 2023, 11: 9995−10003. doi: 10.1109/ACCESS.2023.3237817

    CrossRef Google Scholar

    [22] Pinasthika K, Laksono B S P, Irsal R B P, et al. SparseSwin: Swin Transformer with sparse transformer block[J]. Neurocomputing, 2024, 580: 127433. doi: 10.1016/j.neucom.2024.127433

    CrossRef Google Scholar

    [23] Ouyang D L, He S, Zhang G Z, et al. Efficient multi-scale attention module with cross-spatial learning[C]//2023 IEEE International Conference on Acoustics, Speech and Signal Processing, 2023: 1–5. https://doi.org/10.1109/ICASSP49357.2023.10096516.

    Google Scholar

    [24] Khaled A, Li C, Ning J, et al. BCN: batch channel normalization for image classification[Z]. arXiv: 2312.00596, 2023. https://doi.org/10.48550/arXiv.2312.00596.

    Google Scholar

    [25] Lyons M, Kamachi M, Gyoba J. The Japanese female facial expression (JAFFE) dataset[DS]. Zenodo. (1997)[2024-10-07]. https://doi.org/10.5281/zenodo.14974867

    Google Scholar

    [26] Li S, Deng W, Du J P. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2852–2861. https://doi.org/10.1109/CVPR.2017.277

    Google Scholar

    [27] Barsoum E, Zhang C, Ferrer C C, et al. Training deep networks for facial expression recognition with crowd-sourced label distribution[C]//Proceedings of the 18th ACM international conference on multimodal interaction. 2016: 279–283. https://doi.org/10.1145/2993148.2993165

    Google Scholar

    [28] Bodavarapu P N R, Srinivas P V V S. Facial expression recognition for low resolution images using convolutional neural networks and denoising techniques[J]. Indian J Sci Technol, 2021, 14 (12): 971−983. doi: 10.17485/IJST/v14i12.14

    CrossRef Google Scholar

    [29] Sandler M, Howard A, Zhu M L, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 4510–4520. https://doi.org/10.1109/CVPR.2018.00474.

    Google Scholar

    [30] Howard A, Sandler M, Chen B, et al. Searching for MobileNetV3[C]//2019 IEEE/CVF International Conference on Computer Vision, 2019: 1314–1324. https://doi.org/10.1109/ICCV.2019.00140.

    Google Scholar

    [31] Fard A P, Mahoor M H. Ad-corre: adaptive correlation-based loss for facial expression recognition in the wild[J]. IEEE Access, 2022, 10: 26756−26768. doi: 10.1109/ACCESS.2022.3156598

    CrossRef Google Scholar

    [32] Zhu Y C, Wei L L, Lang C Y, et al. Fine-grained facial expression recognition via relational reasoning and hierarchical relation optimization[J]. Pattern Recognit Lett, 2022, 164: 67−73. doi: 10.1016/j.patrec.2022.10.020

    CrossRef Google Scholar

    [33] Li H Y, Wang N N, Yang X, et al. Towards semi-supervised deep facial expression recognition with an adaptive confidence margin[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 4156–4165. https://doi.org/10.1109/CVPR52688.2022.00413.

    Google Scholar

  • Currently, most facial recognition algorithms rely on convolutional neural networks (CNNs). However, CNNs heavily depend on spatial locality, limiting their ability to capture global features of facial expressions early on. Stacking convolutional layers to expand the receptive field often leads to information loss, increasing computational load and gradient vanishing. To address these issues, researchers are increasingly exploring Transformer models for image tasks. Transformers, with their powerful self-attention mechanism for capturing local features, show promise in expression recognition but face practical limitations. Traditional Transformers operate within fixed-size windows, restricting their ability to model long-range dependencies. Since facial expressions often involve coordinated changes across regions, relying solely on local windows can hinder global feature perception, impacting recognition performance. Moreover, stacking layers to capture global information results in higher parameters and greater computational demands.

    In 2021, Microsoft Research Asia introduced the Swin Transformer, utilizing sliding window-based and window-based multi-head self-attention mechanisms (SW-MSA and W-MSA) to integrate cross-window information. This approach addresses the limitations of traditional Transformers by effectively balancing global feature learning and computational efficiency, making it a promising model for facial recognition tasks.

    To summarize, a lightweight Swin Transformer and multi-scale feature fusion (EMA) module combination is proposed for face expression recognition, which addresses the problems of the Swin Transformer model, such as excessive parameter quantity, poor real-time performance, and limited ability to capture the complex and small expression change features present in the expressions. The method first uses the proposed SPST module to replace the Swin Transformer block module in the fourth stage of the original Swin Transformer model to reduce the number of parameters of the model and realize the lightweight model. Then, the multi-scale feature fusion (EMA) module is embedded behind the second stage of the lightweight model, which effectively improves the model's ability to capture the details of facial expressions through multi-scale feature extraction and cross-space information aggregation, thus improving the accuracy and robustness of facial expression recognition. The experimental results show that the proposed method achieves 97.56%, 86.46%, 87.29%, and 70.11% recognition accuracy on four public datasets, namely, JAFFE, FERPLUS, RAF-DB, and FANE, respectively. Compared with the original Swin Transformer model, the number of parameters of the improved model is decreased by 15.8% and the FPS is improved by 9.6%, which significantly enhances the real-time performance of the model while keeping the number of parameters of the model low.

  • 加载中
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Figures(14)

Tables(7)

Article Metrics

Article views() PDF downloads() Cited by()

Access History

Other Articles By Authors

Article Contents

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint