Abstract:
Tumor grading based on microscopic imaging is critical for the diagnosis and prognosis of breast cancer, which demands excellent accuracy and interpretability. Deep networks with CNN blocks combined with attention currently offer better induction bias capabilities but low interpretability. In comparison, the deep network based on ViT blocks has stronger interpretability but less induction bias capabilities. To that end, we present an end-to-end adaptive model fusion for deep networks that combine ViT and CNN blocks with integrated attention. However, the existing model fusion methods suffer from negative fusion. Because there is no guarantee that both the ViT blocks and the CNN blocks with integrated attention have acceptable feature representation capabilities, and secondly, the great similarity between the two feature representations results in a lot of redundant information, resulting in a poor model fusion capability. For that purpose, the adaptive model fusion approach suggested in this study consists of multi-objective optimization, an adaptive feature representation metric, and adaptive feature fusion, thereby significantly boosting the model's fusion capabilities. The accuracy of this model is 95.14%, which is 9.73% better than that of ViT-B/16, and 7.6% better than that of FABNet; secondly, the visualization map of our model is more focused on the regions of nuclear heterogeneity (e.g., mega nuclei, polymorphic nuclei, multi-nuclei, and dark nuclei), which is more consistent with the regions of interest to pathologists. Overall, the proposed model outperforms other state-of-the-art models in terms of accuracy and interpretability.