Abstract:
Objective Image classification, a foundational core task in computer vision, supports diverse critical applications including clinical medical diagnosis, intelligent security surveillance, remote sensing image interpretation, and autonomous driving scene recognition. Deep learning models such as ResNet, DenseNet, and ViT have notably advanced classification performance in recent years, but traditional networks still bear inherent flaws that restrict their practical applicability in complex real-world scenarios. ResNet, a widely adopted architecture for image classification tasks due to its gradient preservation advantage, lacks sufficient channel feature interaction, which directly impedes the effective selection of key discriminative features. Its initial 7×7 convolution and max-pooling layers trigger excessive downsampling at the early feature extraction stage, resulting in the loss of fine-grained spatial details that are vital for accurate classification of complex images with overlapping objects or subtle differences. Fixed channel grouping in traditional convolutions further isolates cross-channel feature information, limiting the model’s ability to integrate multi-dimensional feature cues and capture inter-channel correlations. Existing attention mechanisms like SE and CBAM only focus on single dimensions—either channel or spatial—failing to achieve dynamic and comprehensive feature calibration varying image contents. This study addresses these critical limitations, enhances cross-channel information interaction, strengthens feature representation and key feature extraction capabilities, and ultimately improves the overall classification performance and generalization ability of the model.
Methods To achieve this objective, a pooling residual image classification network with dynamic channel feature calibration (DCPRNet) was proposed, which is optimized and improved based on the ResNet-34 architecture to address its inherent deficiencies. Three core targeted improvements were integrated into the network to overcome the drawbacks of traditional models while maintaining computational feasibility. First, the initial layer was optimized: the 7×7 convolution was replaced with a 3×3 convolution to reduce parameter complexity by nearly half while retaining robust feature extraction capability, and the max-pooling layer was removed to fully preserve fine-grained spatial information for subsequent feature processing, fusion, and calibration. Second, a dynamic channel feature calibration (DCFC) module was designed, which combines a random subgroup shuffle module and chunked weighted attention (CWA) mechanism. The random subgroup shuffle module breaks the isolation between channel groups via dual-layer grouping and random shuffling, effectively promoting cross-group information interaction and dynamic channel communication without introducing excessive computation. The CWA mechanism calibrates the weights of feature chunks through feature chunking, local importance score generation, mean pooling-based global information aggregation, and sigmoid normalization, thereby suppressing redundant background information and enhancing the contribution of critical feature regions related to target classification. Third, an average pooling-based residual module was added to the DCFC branch, using 1×1 convolution and 2×2 average pooling instead of traditional stride convolution to realize low-cost smooth downsampling, reduce feature distortion, and better preservation of effective feature information.
Results and Discussions Comprehensive experiments were conducted on five representative datasets (CIFAR-10, CIFAR-100, SVHN, Imagenette, Imagewoof) that cover simple object, complex category, low-resolution, and real-world natural image scenarios, ensuring the model’s performance is validated under diverse conditions. DCPRNet achieved classification accuracies of 96.41%, 80.36%, 96.97%, 91.59%, and 80.61% on these datasets respectively, showing stable and reliable performance across different data types, complexity levels, and feature distributions. Ablation studies further confirmed that input layer optimization, the DCFC module, and the average pooling residual module each boost model performance significantly, and the synergistic effects among the three components yield the optimal overall result by complementing each other’s advantages. Comparative experiments against 12 mainstream models, including ResNet-34, DenseNet-121, ViT-B/16, and MobileNet-V2, demonstrated DCPRNet’s consistent superiority: it achieved a 2.15% accuracy gain over the baseline ResNet-34 on the CIFAR-100 dataset and a 4.39% improvement on the ImageNet dataset, while maintaining comparable computational cost. Heatmap-based visual analysis further verified that DCPRNet can accurately focus on critical fine-grained features of target objects, effectively avoiding interference from irrelevant background regions and exhibiting excellent feature calibration capability. These results confirm that the proposed improvements effectively enhance cross-channel information interaction, resolve the inherent limitations of traditional networks, and strengthen the model’s ability to extract discriminative features.
Conclusions DCPRNet was successfully developed by optimizing the ResNet-34 architecture and introducing the DCFC module and average pooling residual module. The model effectively enhances cross-channel information interaction, significantly strengthens feature representation ability and key feature extraction efficiency, and achieves a balanced trade-off between classification accuracy, computational cost, and model robustness. Experimental results and visual analysis fully validate its superiority over mainstream models on multiple datasets, confirming its good robustness and adaptability to different image scenarios and data distributions. By addressing the core flaws of traditional image classification networks—such as insufficient channel interaction, excessive feature loss during downsampling, and limited feature calibration capability—DCPRNet provides a reliable technical solution for image classification tasks in multiple fields including medical imaging, intelligent surveillance, and remote sensing detection. It also offers valuable theoretical and practical reference for the further optimization of deep learning-based image classification models, laying a solid foundation for the development of more efficient, accurate, and lightweight computer vision algorithms suitable for edge computing devices.