MMBCD: Multimodal Breast Cancer Detection from Mammograms with Clinical History

1Indian Institute of Technology Delhi, 2Thapar Institute of Technology, 3All India Institute Of Medical Sciences Delhi
overview

Model Diagram Our framework utilizes a cross-attention layer to attend to top K ROI embeddings by the textual representation of the clinical history. Our findings reveal the synergistic impact of textual, visual, and cross-attention embeddings on the accuracy of breast cancer detection.

Abstract

Mammography serves as a vital tool for breast cancer detection, with screening and diagnostic modalities catering to distinct patient populations. However, in resource-constrained settings, screening mammography may not be feasible, necessitating reliance on diagnostic approaches. Recent advances in deep learning have shown promise in automated malignancy prediction, yet existing methodologies often overlook crucial clinical context inherent in diagnostic mammography. In this study, we propose a novel approach to integrate mammograms and clinical history to enhance breast cancer detection accuracy. To achieve our objective, we leverage recent advances in foundational models, where we use ViT for mammograms, and RoBERTa for encoding text based clinical history. Since, current implementations of ViT can not handle large 4K × 4K mammography scans, we device a novel framework to first detect region-of-interests, and then classify using multi-instance-learning strategy, while allowing text embedding from clinical history to attend to the visual regions of interest from the mammograms.

Extensive experimentation demonstrates that our model, MMBCD, successfully incorporates contextual information while preserving image resolution and context, leading to superior results over existing methods, and showcasing its potential to significantly improve breast cancer screening practices. We report an (Accuracy, F1) of (0.96,0.82), and (0.95,0.68) on our two in-house test datasets by MMBCD, against (0.91,0.41), and (0.87,0.39) by LLaVA, and (0.84,0.50), and (0.91,0.27) by CLIP-ViT; both state-of-the-art multi-modal foundational models.

BibTeX

@article{mmbcd,
  author    = {Jain, Kshitiz and Bansal, Aditya and Rangarajan, Krithika and Arora, Chetan},
  title     = {MMBCD: Multimodal Breast Cancer Detection from Mammograms with Clinical History},
  journal   = {MICCAI},
  year      = {2024},
}