• 摘要: 为解决多任务识别任务模型参数量大、计算成本高、定位能力差、识别精度低等突出问题,设计轻量化协同识别网络LightYOLOv11s。在主干网络与颈部,提出基于坐标注意力机制的多尺度卷积模块CAConv,捕获多尺寸目标特征,通过注意力机制强化语义信息理解,提高定位精度;向网络头部Detect和Pose模块传递颈部特征融合数据,确保模型在共享特征提取信息基础上,目标与人体行为解耦输出,实现高效协同识别;设计联合损失函数,根据图像中目标与人体行为数量动态调整权重参数,平衡两类任务识别精度。模型训练后,引入自适应通道剪枝算法(layer-adaptive magnitude-based pruning, LAMP),删除冗余信息,精简网络结构;同时,结合通道级知识蒸馏(CWD),对教师网络通道激活图归一化处理,使学生网络精准学习教师网络关键特征,优化模型预测。实验结果表明:LightYOLOv11s在F1-score、mAP@0.5、模型参数量、计算开销四项指标上均有优化。在目标检测任务中,对比基线YOLOv11s,F1-score、mAP@0.5依次增长2.62%、3.48%,参数量下降53.92%、计算开销降低55.78%。在人体行为识别中,参考基线YOLOv11sPose,F1-score、mAP@0.5依次提升9.66%、9.97%,参数量降低55.25%、计算开销下降57.74%。在精简网络结构同时,LightYOLOv11s实现更为精准的目标检测与人体行为协同识别,满足轻量化部署需要。边缘设备部署选择NPU、GPU、CPU集群架构开展实验研究,并与autodl服务器平台测试结果对比,证实移动端设备在识别精度、推理速度、便携性部署、移动电源能量存储多环节具有显著优势。

       

      Abstract: To solve the prominent problems of large parameter count, high computational cost, poor positioning ability, and low recognition accuracy in multi-task recognition models, we design a lightweight collaborative recognition network, LightYOLOv11s. In the backbone and neck of the network, a multi-scale convolution module, CAConv, based on the coordinate attention mechanism, is proposed to capture multi-scale target features, and the attention mechanism is used to strengthen the understanding of semantic information and improve the positioning precision. By simultaneously passing the output data from the neck to the detection and pose modules in the head of the network, the model ensures efficient collaborative recognition of object detection and human behaviour identification tasks based on shared feature extraction information. Additionally, a joint loss function is designed to dynamically adjust the weight parameters based on the number of objects and human behaviours in the image, thereby balancing the recognition accuracy of both tasks. After the model training, the model employs the layer-adaptive magnitude-based pruning algorithm (LAMP) to eliminate redundant information and simplify the network structure. Additionally, by utilising channel-wise knowledge distillation (CWD), the channel activation maps of the teacher network are normalised, allowing the student network to accurately learn the key channel features of the teacher network, thus optimising the model's predictive performance. Experimental results show that the LightYOLOv11s network is optimised in four key indicators: F1-score, mAP@0.5, model parameters, and computational overhead. Compared with the baseline YOLOv11s, the target detection F1-score and mAP@0.5 increased by 2.62% and 3.48%, respectively, the parameters decreased by 53.92%, and the computational overhead decreased by 55.78%. Compared with the YOLOv11sPose, the human behaviour recognition F1-score and mAP@0.5 increased by 9.66% and 9.97% respectively, the parameters decreased by 55.25%, and the computational overhead decreased by 57.74%. While streamlining the network structure, LightYOLOv11s achieves more accurate target detection and human behaviour collaborative recognition to satisfy the needs of lightweight deployment. The edge device deployment selected NPU, GPU, and CPU cluster architectures for experimental research, and compared with the test results of the autodl server platform, it is confirmed that mobile devices have significant advantages in recognition precision, reasoning speed, portable deployment, and mobile power energy storage.