Lightweight Swin Transformer combined with multi-scale feature fusion for face expression recognition

Abstract

A lightweight Swin Transformer and multi-scale feature fusion (EMA) module combination is proposed for face expression recognition, which addresses the problems of the Swin Transformer model, such as excessive parameter quantity, poor real-time performance, and limited ability to capture the complex and small expression change features present in the expressions. The method first uses the proposed SPST module to replace the Swin Transformer block module in the fourth stage of the original Swin Transformer model to reduce the number of parameters of the model and realize the lightweight model. Then, the multi-scale feature fusion (EMA) module is embedded after the second stage of the lightweight model, which effectively improves the model's ability to capture the details of facial expressions through multi-scale feature extraction and cross-space information aggregation, thus improving the accuracy and robustness of facial expression recognition. The experimental results show that the proposed method achieves 97.56%, 86.46%, 87.29%, and 70.11% recognition accuracy on four public datasets, namely, JAFFE, FERPLUS, RAF-DB, and FANE, respectively. Compared with the original Swin Transformer model, the number of parameters of the improved model is decreased by 15.8% and the FPS is improved by 9.6%, which significantly enhances the real-time performance of the model while keeping the number of parameters of the model low.

FullText(HTML)