Abstract:
To solve the prominent problems of large parameter count, high computational cost, poor positioning ability, and low recognition accuracy in multi-task recognition models, we design a lightweight collaborative recognition network, LightYOLOv11s. In the backbone and neck of the network, a multi-scale convolution module, CAConv, based on the coordinate attention mechanism, is proposed to capture multi-scale target features, and the attention mechanism is used to strengthen the understanding of semantic information and improve the positioning precision. By simultaneously passing the output data from the neck to the detection and pose modules in the head of the network, the model ensures efficient collaborative recognition of object detection and human behaviour identification tasks based on shared feature extraction information. Additionally, a joint loss function is designed to dynamically adjust the weight parameters based on the number of objects and human behaviours in the image, thereby balancing the recognition accuracy of both tasks. After the model training, the model employs the layer-adaptive magnitude-based pruning algorithm (LAMP) to eliminate redundant information and simplify the network structure. Additionally, by utilising channel-wise knowledge distillation (CWD), the channel activation maps of the teacher network are normalised, allowing the student network to accurately learn the key channel features of the teacher network, thus optimising the model's predictive performance. Experimental results show that the LightYOLOv11s network is optimised in four key indicators: F1-score, mAP@0.5, model parameters, and computational overhead. Compared with the baseline YOLOv11s, the target detection F1-score and mAP@0.5 increased by 2.62% and 3.48%, respectively, the parameters decreased by 53.92%, and the computational overhead decreased by 55.78%. Compared with the YOLOv11sPose, the human behaviour recognition F1-score and mAP@0.5 increased by 9.66% and 9.97% respectively, the parameters decreased by 55.25%, and the computational overhead decreased by 57.74%. While streamlining the network structure, LightYOLOv11s achieves more accurate target detection and human behaviour collaborative recognition to satisfy the needs of lightweight deployment. The edge device deployment selected NPU, GPU, and CPU cluster architectures for experimental research, and compared with the test results of the autodl server platform, it is confirmed that mobile devices have significant advantages in recognition precision, reasoning speed, portable deployment, and mobile power energy storage.