Abstract
Computer-aided detection (CAD) frameworks for breast cancer screening have been researched for several decades. Early adoption of deep-learning models in CAD frameworks has shown greatly improved detection performance compared to traditional CAD on single-view images. Recently, studies have improved performance by merging information from multiple views within each screening exam. Clinically, the integration of lesion correspondence during screening is a complicated decision process that depends on the correct execution of several referencing steps. However, most multi-view CAD frameworks are deep-learning-based black-box techniques. Fully end-to-end designs make it very difficult to analyze model behaviors and fine-tune performance. More importantly, the black-box nature of the techniques discourages clinical adoption due to the lack of explicit reasoning for each multi-view referencing step. Therefore, there is a need for a multi-view detection framework that can not only detect cancers accurately but also provide step-by-step, multi-view reasoning. In this work, we present Ipsilateral-Matching-Refinement Networks (IMR-Net) for digital breast tomosynthesis (DBT) lesion detection across multiple views. Our proposed framework adaptively refines the single-view detection scores based on explicit ipsilateral lesion matching. IMR-Net is built on a robust, single-view detection CAD pipeline with a commercial development DBT dataset of 24675 DBT volumetric views from 8034 exams. Performance is measured using location-based, case-level receiver operating characteristic (ROC) and case-level free-response ROC (FROC) analysis.
Keywords: Computer-aided detection and diagnosis, Breast Cancer Screening, Digital Breast Tomosynthesis, Ipsilateral Matching
I. INTRODUCTION
BREAST cancer is one of the leading causes of cancer deaths among US women. The primary methods for breast cancer screening are mammography and digital breast tomosynthesis (DBT). However, these modalities still involve considerable false positive and false negative errors [1], [2], and the growing volume of screening exams poses challenges for current radiology workflows.
Computer Aided Detection (CAD) for breast cancer screening has been developing for multiple decades. These systems seek to assist radiologists to reduce average case reading time and to prevent missed cancer cases. Traditionally, a CAD framework consists of cascaded computer-vision and machine-learning algorithms that are based on hand-crafted features and intended to detect and classify suspicious lesions on each screening view. In the largest study to date, however, traditional CAD algorithms failed to improve radiologist performance in reading screening mammograms [3]. On the other hand, recent deep-learning methods have created a new generation of CAD frameworks that promise to deliver better performance due to more robust features that are extracted from much larger datasets and integrated by more powerful models. Freed from extensively hand-crafted features and rules, deep-learning-based object detection and classification models have been end-to-end trained on several breast imaging lesion detection tasks [4]–[12].
Recent CAD research aided by more expressive deep-learning models has shifted to multi-view information fusion, which includes three different strategies. First, ipsilateral matching seeks to match corresponding lesion candidates on the craniocaudal (CC) and mediolateral oblique (MLO) views of the same breast. Matching techniques include relation blocks [11], [13]–[15], deep aggregation [6], [7], and bipartile matching [12]. Ipsilateral matching is difficult, however, due to differences in positioning and compression of the soft breast tissue with almost no anatomical landmarks. Second, contralateral matching compares the left versus right breasts, using registration to distinguish high-density symmetric tissues versus suspicious lesions that are asymmetric [6], [9]. Third, temporal matching compares the same view of the same breast between current versus prior exams, again using registration methods similar to contralateral matching.
Two general concerns arose when evaluating the performance of multi-view lesion detection methods in the literature. First, studies often reported receiver operating characteristic (ROC) analysis that treats each case globally without requiring lesion location. Often this was necessary because the available research dataset only provided case-level annotations such as the absence or presence of cancer, which precluded localization ROC (LROC) or free-response ROC (FROC) that both require the lesion to be correctly localized. In some studies that reported both types of metrics, ROC substantially outperformed LROC results [6], [7], which suggested the model may be detecting false positives other than the cancers. Second, information was integrated together from multiple views without reasoning that would be needed for clinical adoption. Several deep-learning-based case-level feature aggregation approaches [6], [7] could not be evaluated on each of the multi-view referencing steps. Likewise, end-to-end trained lesion relation learning methods only provided the multi-view detection result [11], [12], [14], [15]. Without the single-view detection results and reasoning for how they correspond across the multiple views, further iterative improvement of such a framework would be very difficult. Furthermore, screening systems need to detect individual lesions, so the application of case-level models would be relegated only to triage.
In this work, we developed a novel Ipsilateral Matching and Refinement Network (IMR-Net) to perform multi-view breast cancer lesion detection. IMR-Net was designed to mimic how radiologists match and then analyze suspicious lesion candidates in corresponding ipsilateral views. The process is illustrated in Fig 1. Unlike other studies that directly derive the multi-view case score from extracted cases-level latent features, our framework was built on the foundation of more robust lesion-level detection. We summarize our key contribution as follows:
We present a novel multi-view breast cancer lesion detection framework. The proposed framework adds minimal computation cost compared to the single-view model counterpart. The framework was designed to concurrently output the single-view detection, ipsilateral matching result, and ipsilateral refinement reasoning. Together, the available information can help radiologists to have better confidence for clinical adoption.
We built our proposed methods on a robust single-view CAD pipeline. All evaluations were conducted using localization based LROC and FROC.
We validated our proposed framework on a large-scale screening tomosynthesis dataset containing exams from two major imaging manufacturers: Hologic and General Electric (GE).
II. RELATED WORKS
A. From 2D Mammography CAD to 3D DBT CAD
Convolutional neural networks (CNN) have been applied extensively for detecting breast cancer in 2-Dimensional (2D) mammography images [4], [10], [16]–[18]. With the ongoing clinical adoption of digital breast tomosynthesis (DBT), the field is adapting 2D mammography CAD systems for 3-Dimensional (3D) DBT volumes. Samala et al. [19] used a cascaded classifier system to transfer learned knowledge from the natural image and mammography domain to the DBT domain. Winkel et al. [20] conducted a reader study showing the generalizability of their existing architecture to both mammography and DBT. Yousefi et al. [21] extracted plane-by-plane latent features on DBT volumes and then performed volume classification using multiple instance learning. Ming et al. [22] modified the 2D faster-RCNN architecture into a 3D feature-extraction backbone.
There are very few publicly available DBT datasets. Buda et al. [23] released a mass lesion DBT dataset containing 89 cancer, 102 benign, and 5061 normal cases. Although this dataset was used in challenges [24], all groups were treating both cancer and benign cases as positive targets largely due to limited training data. These factors limit comparison against comprehensive CAD studies that are based on the clinical workflow, which focuses on detecting cancers for both mass and calcification lesions concurrently.
Multiple large scale retrospective reader studies have shown deep-learning-based mammography screening CAD systems are approaching radiologist-level performance [6]–[8], [25], [26]. These architecture designs relied on the extraction of view- or case-level latent features to assess the overall case. As noted above, although these systems consider multiple views, they are not applicable clinically because the case-level score cannot be explained by the detection of individual lesions and tend to report better case-level ROC than LROC performance.
B. CVR-RCNN, CVR-BGN, and MommiNet
Several recent mammography CAD studies proposed using relation block [27] to realize multi-view correspondence learning [12], [14], [15]. Relation block was shown to be more effective in modeling object relations than the simple aggregation of object features through fully connected layers. This concept led to end-to-end, multi-view architectures [14], [15] but the high diversity of visual appearance and interdependency of object relations made it difficult to train a generic relation block end-to-end [28]. The non-local network provided almost identical context for different query positions within an image [29], which contradicted the clinical need to establish one-to-one pairs for multi-view lesion matching. This shortcoming was also observed in our pilot study, where a relation block module did not outperform our earlier explicit matching-based approach [30].
On the other hand, CVR-BGN used a bipartite graph convolutional network to enhance features within each graph node [12]. This graph convolution module facilitated better multi-view relation modeling when there was a sufficiently large training dataset. However, this black-box technique did not offer multi-view reasoning that was interpretable.
C. End-to-end Detection and Re-identification
Multi-view lesion detection is closely related to the concepts of object detection [31]–[33] and object re-identification/tracking [34], [35]. Working along the tracking-by-detection paradigm, Retina-Track jointly performed multi-object detection and tracking [36]. Retina-Track achieved object re-identification by learning an instance-level embedding network that minimized the euclidean distance of instance-embedding vectors from the same objects while also maximizing the distance from different objects. The detection and tracking components of the model were jointly optimized during training, but the tracking result was not used to refine the detection. Inspired by this approach, our previous Retina-Match system [30] used a Retina-Net architecture backbone to perform lesion detection, ipsilateral lesion matching, and detection refinement end-to-end. That pilot study demonstrated that trainable lesion refinement was possible through explicit ipsilateral lesion matching.
III. PROPOSED METHOD
A. Overall Architecture
Our ipsilateral lesion refinement pipeline was motivated by how radiologists correlate a suspicious lesion candidate in the ipsilateral view, which greatly increases the likelihood of it being a true-positive detection. Our IMR-Net captured this ipsilateral relation through explicit matching and lesion score refinement. The overall architecture with three individually trained modules is shown in Fig.2. First, initial lesion candidates were generated by a single-view detection stage based on only local pixel information, assigning to each lesion candidate a single-view detection score. This stage provided high-quality detection candidates with sufficient specificity for the ipsilateral matching process. Second, exhaustive ipsilateral pairing identified the most likely lesion candidate pairs. A Siamese network [34] computed the matching probability of each ipsilateral lesion pair. After that, the greedy matching operation ranked and preserved only the top ipsilateral pairs. Third, ipsilateral refinement began by predicting a set of perlesion weighting factors to correlate the matching probability with the ipsilateral lesion score modifier. The final ipsilateral detection score was computed by summing the modifier to the existing single-view detection score. In the following sections, each of these three modules is described in detail.
B. Single View Detection
1). Candidate Detection Module:
Single-shot object detection methods such as You Only Look Once or YOLO [31] and RetinaNet [37] can be used as fast candidate detection methods. Similar to our pilot study [30], a YOLO v2 model [38] processed the stack of DBT slice images to propose initial lesion candidates. A fully convolutional design with separate detection score and bounding-box prediction heads allowed the processing of images with different sizes and aspect ratios. For each proposed lesion candidate, the model predicted a candidate generation score as well as bounding box coordinates , , , . For each DBT slice, lesion candidates with significant overlap were removed through Non-Maximum-Suppression (NMS) with an Intersection-Over-Union (IoU) threshold of 0.2. The model was trained using the original YOLO loss function [31] as in Eq. 1.
(1) |
2). Patch Classification Module:
Traditionally, cascaded patch classifiers were used to refine detected lesion candidates [4]. Instead, recent deep-learning based object detection methods such as RetinaNet [37] and Mask-RCNN [33] advocated for more multi-task and end-to-end trained pipelines. However, our workflow demonstrated substantial improvement in performance by applying a simple cascaded patch classifier on top of a well-tuned candidate detection model. Lesion patches were generated by cropping a fixed patch centered on the predicted , , and location and fed to the patch classifier. We attributed the observed improvement to the significant imbalance of positive and negative samples in the candidate detection stage. The single view detection stage’s case-level LROC and ROI-level FROC performances are shown in Fig.6a and 6b.
The patch classifier was trained using the sigmoid cross-entropy loss as in Eq.2, where is the number of samples in a mini-batch. All proposed patches within a volume that has a patch classification score are projected onto the same plane for the Volumetric Non-Maximum-Suppression (Vol-NMS) operation. The predicted x, y, w, and h from the YOLO detector and the corresponding patch classification score are used to compute the Vol-NMS output with an IoU threshold of 0.4. Surviving patches are used as the final output of the single-view detection stage.
(2) |
C. Ipsilateral Lesion Matching
1). Matching Architecture:
Several studies have shown the Siamese network can be used to re-identify images of the same object regardless of differences in lighting, angle, or image quality [34], [35], [39]. Therefore, we applied a Siamese network to re-identify the and lesion candidates in corresponding ipsilateral views. A generic feature extraction backbone created a latent feature vector for each lesion candidate. To aid the matching process, a datum line is drawn from the pectoral muscle line to measure the candidate-to-pectoral-muscle distance () and candidate-to-nipple distance (). The difference in the two distances and were embedded and concatenated to the latent features after global average pooling. Element-wise mean-square-error of the extracted feature was inputted to two fully connected () layers with 128 and 64 elements respectively to compute the matching probability .
(3) |
The Siamese network was trained using sigmoid cross entropy loss as in Eq.4. During training, the label of lesion candidate pair in the mini-batch was set to 1 only if the two candidates were from the same screening exam and had the same lesion ID, otherwise, it was set to 0.
(4) |
During inference, a greedy matching operation ranked all ipsilateral pairs based on the predicted matching probability . Starting from the top-ranking ipsilateral pairs, the final pair relation was established only if both lesion candidates are not yet matched. This design is intended to mimic the intuition that each detection should have a unique ipsilateral pair in the corresponding view. This step is critical to allow the following refinement step to further separate cancer vs noncancer lesion candidates.
2). Ipsilateral Lesion Refinement Module:
Based on the ipsilateral matching result, the refinement module modified each single-view lesion detection score. Analogous to the way radiologists perform ipsilateral matching, lesions correlated through ipsilateral views were marked as more suspicious. To derive the modifier value for each lesion patch, a set of lesion-specific weighting factors , and were predicted to correlate the with the optimum modifier value as in Eq.5. These weights respectively reinforced () or weakened () the predicted with a bias term. Then the multi-view detection score was the sum of the modifier and the single-view detection score as in Eq.6. Concrete examples of refinement processes are shown in Fig 4.
(5) |
(6) |
Each of the , and values were predicted by independent regressors with a linear output activation function. Each regressor was given the 1280 length feature vector extracted from the single-view stage patch classifier. The continuous nature of matching probability and single-view lesions score make the task an underlying regression problem. The three regressors were trained using MSE loss formulation as in Eq.7. During training, the refined scores were clipped from −∞ to 1 if the patch was labeled as positive, and 0 to ∞ if the patch was labeled as negative.
(7) |
3). Relation Block Patch Classifier Baseline:
The relation block learns the statistical dependency of various objects end-to-end without explicit labeling of object relations [13]. To provide a controlled comparison, we replaced the ipsilateral lesion matching and lesion refinement module in IMR-Net with a cascaded patch classifier that utilized the generic relation block module [11], [27]. The top lesion detection candidates from each ipsilateral view were paired. The exhaustive relation features of each candidate with all available candidates in the other view were computed through the relation block. Relation features were then applied to each lesion’s latent features through addition. A new regressor then computed the lesion score as in our patch classifier. This model was trained using sigmoid cross-entropy loss as in Eq. 2.
D. Implementation Detail
All models were implemented using Tensorflow 2.5 in Python 3.7 with XLA compiler enabled. The networks were optimized using Adam optimizer with the default settings and learning rate of . Both the candidate detection module and the patch classifier feature extraction backbones were initialized from the ImageNet pretrained MobileNetV2 weights. The ipsilateral matching model and relation block classifier were initialized from the fine-tuned patch classifier feature extractor. Each model was trained using an RTX 3090 graphics card. All models were trained using standard data augmentation techniques unless otherwise specified. During training, we first applied random brightness and window-level scaling during the normalization step. Then, we applied random scaling, cropping, and 0 − 360 rotation to each sample to increase model generalizability.
1). Candidate Detection:
The candidate detection model was designed to remove obvious normal tissues from the candidate pool while maintaining cancer sensitivity. A lesion was defined as positive if it was ±3 slices in the direction with respect to the central slice of the reference standard and having an Intersection-over-Union (IoU) larger than 0.2. On the fly during training, DBT slice images were randomly augmented, and slices were randomly cropped into patches. Benign cases were not used as negatives in the training to avoid degrading sensitivity. Only patches with scores larger than 0.4 for Hologic and 0.8 for GE were passed into the patch classifier, thus yielding an average of 100 false positives per view (FPPI) (prior to direction candidate merging), and ROI-level sensitivities were 98% for Hologic and 93% for GE on the validation dataset.
2). Patch Classifier:
For classification, patches were generated from the candidate detection results. During training, standard random augmentations were again performed on the fly. A patch was labeled as positive if it was ±1 slices away from the reference standard annotation and had an IoU larger than 0.2. During inference, only patches with classification scores larger than 0.05 were merged in the z-direction and passed into the ipsilateral processing stages. This yielded an average FFPI of 5.6 and 5.1, with an ROI-level sensitivity of 96% and 92% respectively for Hologic and GE on the validation dataset.
3). Patch Matching:
Surviving patches with a classification score larger than 0.2 for both Hologic and GE were passed into the matching model. The same random augmentation as the patch classifier training was also applied, but the random cropping and scaling factors for each ipsilateral pair were synchronized to learn the relative size relation. In object re-identification, the sampling of positive and negative pairs is critical. The following possible combinations of ipsilateral pairs for true-positive (TP) and false-positive (FP) patches were randomly sampled in equal ratios during training:
TP-TP positive pairs from the same cancer case.
TP-TP negative pairs from two different cancer cases.
TP-FP negative pairs from a cancer and a normal case.
FP-FP negative pairs from two different normal cases.
During training, only the first combination above was labeled as positive, while the others were all negatives that were intentionally defined to reduce any accidental pairing. During inference, exhaustive ipsilateral pairs were formed regardless of the TP or FP label.
For each batch, there were 16 positive pairs and 48 negative pairs for a batch size of 64. The entire model remained trainable. The best model iteration was selected based on batch-level classification AUC, which reached 0.95 and 0.92 respectively for Hologic and GE testing datasets.
4). Ipsilateral Refinement:
Next, the detection was refined using the ipsilateral modifiers. This stage was trained on the lesion detection pool that survived the classification stage and NMS operation. Additionally, ipsilateral pairs were excluded when the difference in lesion-to-nipple distance was larger than 5 cm. For each lesion candidate, the matching probability was set to 0 if no valid ipsilateral detection was found.
The trainable components of the refinement module were three independent regressor heads. To train the three regressors, we first replicated the patch classifier’s data pipeline, model architecture, and trained weights while attaching the three randomly initialized regressor heads. Only the newly initialized regressors remained trainable. The same augmentation during the patch classifier training was performed to prevent over-fitting. The development dataset also contained a small percentage of cases with missing ipsilateral views, for which the ipsilateral modifier was set to 0. During inference, extracted latent feature from the patch classifier model was fed to the trained regressor heads to obtain the three weighing factors.
5). Relation Block Patch Classifier Baseline:
To compare against our proposed method, the matching and refinement models were replaced with the well-known relation block patch classifier, which can theoretically play the same role. Specifically, all lesion candidates surviving the single-view detection stage were passed into the relation block patch classifier. Each batch contains 16 sets of ipsilateral pairs. For each batch, the top 8 lesion candidates in each view were listed. All exhaustive ipsilateral relations were computed through the relation block. The key, value, and query transformation functions [13] were implemented as three independent convolution layers; each took a feature dimension of and output a transformed feature dimension of . The lesion-to-nipple distance and lesion-to-pectoral-muscle distance were embedded and passed into the relation block as geometric features.
IV. EXPERIMENTS
A. DBT Dataset
To develop our proposed framework, we used a large-scale, highly curated dataset that has been used for commercial algorithm development (iCAD Inc., Nashua NH). The dataset contained 4182 Hologic and 3852 General Electric (GE) screening tomosynthesis cases for a total of 8034 cases. The cases were collected from 27 institutions across the United States and Europe between 2011 and 2019. The detailed train/test split is shown in Table I. Lesion level contour, type, and Breast Imaging Reporting and Data System (BIRADS) findings were annotated by Mammography Quality Standards Act (MQSA) certified radiologists with access to both radiology and pathology reports. Case-level lesion IDs were assigned to establish ipsilateral lesion pair relations.
TABLE I:
Train | Test | |||||
---|---|---|---|---|---|---|
| ||||||
Hologic | Case | View | Annotation | Case | View | Annotation |
Malignant | 729 | 2875 | 1578 | 769 | 3107 | 1661 |
Benign | 327 | 1224 | / | 362 | 1357 | / |
Normal | 982 | 3944 | / | 1013 | 4060 | / |
GE | ||||||
Malignant | 684 | 1049 | 1097 | 678 | 998 | 1054 |
Benign | 343 | 1053 | / | 334 | 985 | / |
Normal | 913 | 2008 | / | 900 | 2015 | / |
An in-house breast segmentation model was used to detect the pectoral muscle in the MLO views. The nipple location was then set to the farthest point from the chest wall that intersects the breast skin mask. All geometric distances for detection candidates were then computed along the chest-wall-to-nipple datum line. All tomosynthesis slices and generated patches were normalized according to the vendor-provided DICOM window-level setting.
B. CBIS-DDSM Dataset
To demonstrate the generalizability of our method, our framework was re-implemented using the publicly available CBIS-DDSM (Curated Breast Imaging Subset of DDSM) [40] dataset, which come from a very different domain because these digitized mammograms are 2D images of poor image quality. To faciliate comparison with existing literature, we used the subset of mass cases, which contain 691 and 202 patients with ipsilateral views in the train and test set respectively.
C. Experimental Settings
Our main comparison experiments were conducted on our enriched DBT dataset with cancer, benign, and normal cases specified in Table I. Only cancer annotations were treated as positive detection targets. Performance was evaluated using both LROC and FROC analysis. The following rules are used to determine true-positive (TP), false-positive (FP), and true-negative (TN):
A cancer lesion detection was counted as a TP only if the center (, , ) of the detection bounding-box falls within the contour of the reference standard annotation and the detection was above a fixed threshold.
A cancer view/case was counted as a TP if at least one annotation within the view/case had a TP.
All detection other than TP are counted as FP.
A non-cancer case was counted as a TN only if all FP detection candidates within the case were below a certain threshold.
For LROC analysis, the case sensitivity is calculated as the fraction of cancer cases with TP detection, and the case specificity is calculated as the fraction of noncancer cases without FP detection. LROC curve for low specificity systems will be below the diagonal line. For FROC analysis, the ROI sensitivity is calculated as the fraction of cancer lesions with TP detection and plotted against the average number of FP for each DBT view.
In the following sections, we first performed a comparsion study or our choice of single-view detection baseline. Then, we compare IMR-Net against its single-view counterpart and then against the relation block variation. After that, We performed an ablation study to understand the impact of each term in our modifier formulation. Lastly, to demonstrate the generalizability of the proposed method on 2D mammography exams, we applied the identical framework to the CBIS-DDSM dataset for mass lesion detection.
D. CAD Framework Evaluation Results
1). Comparison against SOTA Object Detectors:
First, the proposed single-view detection module was compared against other state-of-the-art (SOTA) detectors trained on the same dataset. The results are shown in Fig. 5. All stand-alone detection models performed inferior to our cascaded setup (p < 0.0001), which has a detection model followed by a patch classifier. This result shows the patch classifier significantly improves the separability of cancer vs noncancer candidates on top of the selected detection model. On the other hand, when YOLO is replaced with a Mask-RCNN model, the overall detection performance remained nearly identical. This indicates the performance improvement of Mask-RCNN is redundant to that of cascading a patch classifier, yielding the more complicated architecture design of Mask-RCNN unnecessary. Moreover, the focal-loss used in Retina-Net training degrades the detection performance. We hypothesize this is due to the focal loss emphasizing more difficult false-positive detection, further skewed cancer vs noncancer sample imbalance.
2). Comparison against Single-View Baseline:
First, IMR-Net was compared against the single-view baseline described in section IV with location-based evaluation metrics. The results are shown in Fig. 6. For both vendors, IMR-Net outperformed the single-view baseline substantially as shown in the AUC for case-level ROC curve. First, performance was measured as the partial area under the ROC curve (pAUC) in the high sensitivity range of 90%−100%, which is clinically more relevant than examining the entire ROC curve. In nonparametric testing, we compared the pAUC of IMR-Net to that of the single-view baseline, resulting in statistically significant improvement of the IMR-Net method (p < 0.0001) for both vendors. Second, at the operating points where the model is most likely to be deployed in the clinical workflow, the Hologic model at case sensitivity of 92.5% gained 10% in specificity (p < 0.0001); the GE model at case sensitivity of 88% gained 7% in specificity (p < 0.0001).
Qualitatively, the effect of ipsilateral pairing is illustrated in the concrete examples shown in Fig 4. For the cancer case, the term increased the lesion score due to the strong ipsilateral match. For the noncancer FP, the term reduced the score. The overall trends are visualized in Fig 9 by comparing each ipsilateral refined score versus the single-view baseline. In these figures, the identity diagonal represents no change from ipsilateral refinement. Scores for most cancer and benign cases were shifted above the diagonal, which was expected since the matching of these actionable lesions across the two views led to greater confidence. Conversely, normal FPs were shifted below the diagonal. The matching effect or distance from the diagonal was stronger for Hologic than GE, which led to the difference in performances shown previously in Fig. 6. These differences may be driven by the greater z-resolution and thus higher inter-slice variation in the GE DBT images. Thus, cancer related imaging features might be spread out onto more than 3 DBT slices, causing a reduction in detection sensitivity in our current model architecture. Future work may address these issues with volumetric feature aggregation techniques and more robust lesion matching model designs.
3). Comparison against Relation Block:
We next constructed a relation block variation of the ipsilateral detection pipeline, with results shown in Fig 7. Despite its greater flexibility in relation modeling, the relation block variation was inferior to IMR-Net. The superiority of IMR-Net may be attributed to two factors. First, the ipsilateral matching module was trained in strong supervision with lesion pair labels. In contrast, the relation block could only perform relation learning through back-propagating information from the lesion labels (cancer vs. non-cancer) by weak supervision. Second, the relation block training required a mini-batch configuration that inherently emphasized more difficult lesion candidates due to the top detection sampling. We observed just training a generic single-view cascaded classifier with this batch configuration results in a performance trade-off between high versus low sensitivity regions on the LROC curve.
Additionally, the relation block module showed an FROC gain compared to its own single-view baseline, but that gain did not transfer to the case LROC. Clinically, a screening case can only be scored as negative if the CAD system suppressed all FP detection on all views. The FROC performance gain observed came from reducing the amount of FP detection, but this would affect only the CAD user’s interactive experience rather than case-level cancer detection performance. This experiment showed the relation block module reduced the average amount of FP CAD user need to dismiss but at the same time increased the number of cases with FP detection. This indicates ROI-Level FROC alone is not sufficient to evaluate CAD system performances.
4). Comparisons to 2D Mammography Studies:
Screening mammography and DBT exams have the same exam structure with two CC and two MLO views for each breast. The difference between mammography and DBT CAD design is mainly in two folds. First, the volumetric nature of DBT images requires the slice detection framework to merge redundant detection in different depth locations. Second, processing a large number of slices in tomosynthesis is more computationally intense and requires model architecture optimization for efficient field deployment. However, the implementation of the multi-view reading strategies should generalize to both modalities.
In Fig 8 we compared our proposed framework against other existing mammography mass detection methods [32], [41]–[44] on the CBIS-DDSM dataset using cancer lesion detection FROC. Consistent with earlier studies, only the mass subset of CBIS-DDSM dataset was used for training, as well as the metric for performance comparison. Each module is trained from scratch using only CBIS-DDSM dataset. As seen with the DBT results above, the 2D mammography performance also improved from the single-view to the ipsilateral model. Additionally, our simple models match or outperform other published operating points.
E. Ablation Study
We performed an ablation study to understand the role of each term in the matching refinement described in Eq. 5. The results are shown in Fig. 10.
Term Only: The model was retrained after the and terms were removed, resulting in a single-view cascade module with the bias term but without any ipsilateral matching. The decreased performance showed that the bias term alone was not sufficient as a fine-tuning step.
Non-adaptive Weight Factor: To investigate the effectiveness of per-sample weighting factors, we optimize a set of constant scalar , , and values for the entire training dataset and evaluated those on the testing dataset. This model achieved moderate gains at high sensitivities but yielded the lowest overall AUC. This simplified approach could not address common screening complexities. For example, certain soft-tissue lesions like architectural distortions and asymmetries may be most visible on one view only; such cases would be heavily penalized by ipsilateral matching rules that are global.
Modifier Regressor Head: To demonstrate the importance of clinical prior knowledge as formulated in Eq.5, we replaced that function with a regressor network to make the modifier generic. A neural network with sufficient parameterization can be a universal function approximator [45] and thus should be able to reproduce our ipsilateral refinement logic. The regressor took the extracted 1280 feature vectors as well as the scalar matching probability as input. This model was noticeably worse compared to IMR-Net in the high sensitivities. Even for this large dataset, this study demonstrated the importance of proper inductive bias in network architecture design.
Geometric Information: To show the contribution of the geometric information, we retrained the ipsilateral stage without the position encoding in the ipsilateral matching model. This model achieved gains compared to the single-view baseline but performed consistently lower than the full IMR-Net in clinically important operating points.
V. RUN-TIME ANALYSIS
Compared to 2D mammography CAD workflow, the difficulty of optimizing DBT CAD run-time mainly comes from the candidate generation step that processes a large number of DBT slices. As shown in table II, run-time was profiled on the Hologic DBT test set. The testing workstation has a single RTX 3090 GPU, Intel i9–10900x CPU and reads data from a 1TB SATA SSD drive. Almost all exams have 4 standard-view volumes with an average of 69 slices per view. On average, our implemented YOLO model cost 8.6s to process 4 standard-views volumes, while Retina-Net and Mask-RCNN take 9.5s and 28.6s respectively.
TABLE II:
Average Time / DBT Case (secs) | |
---|---|
Load Volumes | 6.9 |
Candidate Generation | 8.6 |
Crop Patches | 3.3 |
Patch Classifier | 1.4 |
| |
Matching Model | 1.5 |
Refinement Model | ≪ 0.01 |
| |
IMR-Net Total | 21.7 |
In perspective, our proposed ipsilateral matching and refinement mechanism was effective and lightweight. For each DBT case, the ipsilateral components added negligible time (1.5s) compared to the single-view detector (20.2s).
VI. LIMITATION AND FUTURE WORK
The multi-view framework improved breast tomosynthesis screening performance by matching lesions between the two ipsilateral screening views of each breast. This study had several limitations. Although the model is designed to handle the small minority of lesions visible only in one view, there may be additional benefit from training with more single-view lesions. To facilitate interpretability and model training, the ipsilateral lesions were paired using handcrafted greedy matching, which may be optimized further by modules that are trainable. Finally, benign and cancer lesions behaved similarly during ipsilateral matching. In clinical evaluations that treat benign lesions as negatives, such matching may increase the false positives, but this should be acceptable since those benign lesions were actionable. This study presented the framework architecture and the design of its major components. Future work may extend this framework for other applications that can benefit from integrating information across multiple image views, such as exams that combine MLO tomosynthesis with CC mammography views as well as temporal matching of current versus prior exams.
VII. CONCLUSION
We have presented an ipsilateral breast cancer detection framework that significantly outperformed its single-view counterpart. Although designed to optimize a specific task, this study conveys certain lessons that may benefit other problems in medical imaging. This is the first academic study to report results on a large-scale, multinational dataset for commercial CAD development. The high baseline performance can be attributed to that heterogeneous patient data with accurate, lesion-level labels. Using that rich data, we developed the first tomosynthesis CAD system that achieved gains by fusing information from multiple image views. Moreover, that algorithm was designed to be portable in a data-driven way that generalized across several related breast imaging domains. For challenging problems in medical imaging, this combination of large-scale data, accurate labeling, and effective algorithm design all contribute to success.
Acknowledgments
This work was supported in part by a research agreement with iCAD Inc. and by the National Cancer Institute of the National Institutes of Health (U01-CA214183).
Contributor Information
Yinhao Ren, Pratt School of Engineering, Duke University, Durham, NC 27708 USA.
Xuan Liu, Pratt School of Engineering, Duke University, Durham, NC 27708 USA.
Jun Ge, iCAD Inc, Nashua, NH 03062 USA.
Zisheng Liang, iCAD Inc, Nashua, NH 03062 USA.
Xiaoming Xu, Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27710 USA.
Lars J. Grimm, Department of Radiology, Duke University, Durham, NC 27710 USA, Durham, NC 27710
Jonathan Go, iCAD Inc, Nashua, NH 03062 USA.
Jeffrey R. Marks, Department of Surgery, Duke University, Durham, NC 27710 USA
Joseph Y. Lo, Department of Radiology, Duke University, Durham, NC 27710 USA, Durham, NC 27710
REFERENCES
- [1].Lowry KP, Coley RY, Miglioretti DL, Kerlikowske K, Henderson LM, Onega T, Sprague BL, Lee JM, Herschorn S, Tosteson ANA, Rauscher G, and Lee CI, “Screening Performance of Digital Breast Tomosynthesis vs Digital Mammography in Community Practice by Patient Age, Screening Round, and Breast Density,” JAMA Network Open, vol. 3, pp. e2011792–e2011792, 07 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Conant EF, Barlow WE, Herschorn SD, Weaver DL, Beaber EF, Tosteson ANA, Haas JS, Lowry KP, Stout NK, Trentham-Dietz A, diFlorio Alexander RM, Li CI, Schnall MD, Onega T, Sprague BL, and for the Population-based Research Optimizing Screening Through Personalized Regimen (PROSPR) Consortium, “Association of Digital Breast Tomosynthesis vs Digital Mammography With Cancer Detection and Recall Rates by Age and Breast Density,” JAMA Oncology, vol. 5, pp. 635–642, 05 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Lehman CD, Wellman RD, Buist DSM, Kerlikowske K, Tosteson ANA, and Miglioretti DL, “Diagnostic accuracy of digitalscreening mammography with and without computer-aided detection.,” JAMA internal medicine, vol. 175 11, pp. 1828–37, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Kooi T, Litjens G, van Ginneken B, Gubern-Mérida A, Sánchez CI, Mann R, den Heeten A, and Karssemeijer N, “Large scale deep learning for computer aided detection of mammographic lesions,” Medical Image Analysis, vol. 35, pp. 303–312, 2017. [DOI] [PubMed] [Google Scholar]
- [5].Geras K, Wolfson S, Kim S, Moy L, and Cho K, “High-resolution breast cancer screening with multi-view deep convolutional neural networks,” 03 2017. [Google Scholar]
- [6].McKinney SM, Sieniek M, Godbole V, Godwin J, Antropova N, Ashrafian H, Back T, Chesus M, Corrado GS, Darzi A, Etemadi M, Garcia-Vicente F, Gilbert FJ, Halling-Brown M, Hassabis D, Jansen S, Karthikesalingam A, Kelly CJ, King D, Ledsam JR, Melnick D, Mostofi H, Peng L, Reicher JJ, Romera-Paredes B, Sidebottom R, Suleyman M, Tse D, Young KC, De Fauw J, and Shetty S, “International evaluation of an ai system for breast cancer screening,” Nature, vol. 577, pp. 89–94, Jan 2020. [DOI] [PubMed] [Google Scholar]
- [7].Kim H-E, Kim H, Han B-K, Kim KH, Han K, Nam H, Lee E, and Kim E-K, “Changes in cancer detection and false-positive recall in mammography using artificial intelligence: a retrospective, multireader study,” The Lancet Digital Health, vol. 2, 02 2020. [DOI] [PubMed] [Google Scholar]
- [8].Shoshan Y, Bakalo R, Gilboa-Solomon F, Ratner V, Barkan E, Ozery-Flato M, Amit M, Khapun D, Ambinder E, Oluyemi E, Panigrahi B, DiCarlo P, Rosen-Zvi M, and Mullen L, “Artificial intelligence for reducing workload in breast cancer screening with digital breast tomosynthesis,” Radiology, 01 2022. [DOI] [PubMed] [Google Scholar]
- [9].Hagos Y, Gubern-Mérida A, and Teuwen J, “Improving breast cancer detection using symmetry information with deep learning,” in MICCAI 2018, 09 2018. [Google Scholar]
- [10].Sainz de Cea MV, Diedrich K, Bakalo R, Ness L, and Richmond D, “Multi-task learning for detection and classification of cancer in screening mammography,” in Medical Image Computing and Computer Assisted Intervention – MICCAI 2020. (Martel AL, Abolmaesumi P, Stoyanov D, Mateus D, Zuluaga MA, Zhou SK, Racoceanu D, and Joskowicz L, eds.), Springer International Publishing. [Google Scholar]
- [11].Ma J, Liang S, Li X, Li H, Menze B, Zhang R, and Zheng W, “Cross-view relation networks for mammogram mass detection,” 2019. [Google Scholar]
- [12].Liu Y, Zhang F, Zhang Q, Wang S, Wang Y, and Yu Y, “Cross-view correspondence reasoning based on bipartite graph convolutional network for mammogram mass detection,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020 CVPR), 2020. [Google Scholar]
- [13].Wang X, Girshick R, Gupta A, and He K, “Non-local neural networks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794–7803, 2018. [Google Scholar]
- [14].Yang Z, Cao Z, Zhang Y, Han M, Xiao J, Huang L, Wu S, Ma J, and Chang P, “Momminet: Mammographic multi-view mass identification networks,” in MICCAI 2020 (Martel AL, Abolmaesumi P, Stoyanov D, Mateus D, Zuluaga MA, Zhou SK, Racoceanu D, and Joskowicz L, eds.), Springer International Publishing, 2020. [Google Scholar]
- [15].Yang Z, Cao Z, Zhang Y, Tang Y, Lin X, Ouyang R, Wu M, Han M, Xiao J, Huang L, Wu S, Chang P, and Ma J, “Momminet-v2: Mammographic multi-view mass identification networks,” Medical Image Analysis, vol. 73, p. 102204, 2021. [DOI] [PubMed] [Google Scholar]
- [16].Mordang J-J, Janssen T, Bria A, Kooi T, Gubern-Mérida A, and´ Karssemeijer N, “Automatic microcalcification detection in multi-vendor mammography using convolutional neural networks,” in MICCAI – International Workshop on Breast Imaging, 06 2016. [Google Scholar]
- [17].Samala R, Chan H-P, Hadjiiski L, Helvie M, Cha K, and Richter C, “Multi-task transfer learning deep convolutional neural network: Application to computer-aided diagnosis of breast cancer on mammograms,” Physics in Medicine and Biology, vol. 62, 10 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Ribli D, Horváth A, Unger Z, Pollner P, and Csabai I, “Detecting and classifying lesions in mammograms with deep learning,” Scientific Reports, vol. 8, 03 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Samala R, Chan H-P, Hadjiiski L, Helvie M, Richter C, and Cha K, “Breast cancer diagnosis in digital breast tomosynthesis: Effects of training sample size on multi-stage transfer learning using deep neural nets,” IEEE Transactions on Medical Imaging, pp. 1–1, 09 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Winkel S, Rodríguez-Ruiz A, Appelman L, Gubern-Mérida A, Karssemeijer N, Teuwen J, Wanders A, Sechopoulos I, and Mann R, “Impact of artificial intelligence support on accuracy and reading time in breast tomosynthesis image interpretation: a multi-reader multi-case study,” European Radiology, vol. 31, 05 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Yousefi M, Krżyzak A, and Suen C, “Mass detection in digital breast˙ tomosynthesis data using convolutional neural networks and multiple instance learning,” Computers in Biology and Medicine, vol. 96, 05 2018. [DOI] [PubMed] [Google Scholar]
- [22].Fan M, Li Y, Zheng S, Peng W, Tang W, and Li L, “Computer-aided detection of mass in digital breast tomosynthesis using a faster region-based convolutional neural network,” Methods, vol. 166, 02 2019. [DOI] [PubMed] [Google Scholar]
- [23].Buda M, Saha A, Walsh R, Ghate S, Li N, Swiecicki A, Lo J, and Mazurowski M, “A data set and deep learning algorithm for the detection of masses and architectural distortions in digital breast tomosynthesis images,” JAMA, vol. 4, p. e2119100, 08 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Park J, Shoshan Y, Martí R, Campo P, Ratner V, Khapun D, Zlotnick A, Barkan E, Gilboa F, Chledowski J, Witowski J, Millet A,Kim E, Lewin A, Pyrasenko K, Chen S, Goldberg J, Patel S, Plaunova A, and Geras K, “Lessons from the first dbtex challenge,” Nature Machine Intelligence, vol. 3, 07 2021. [Google Scholar]
- [25].Lotter W, Diab A, Haslam B, Kim J, Grisot G, Wu E, Wu K, Onieva J, Boxerman J, Wang M, Bandler M, Vijayaraghavan G, and Sorensen A, “Robust breast cancer detection in mammography and digital breast tomosynthesis using annotation-efficient deep learning approach,” 12 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Chan H-P, Samala R, Hadjiiski L, and Zhou C, “Deep learning in medical image analysis,” Advances in experimental medicine and biology, vol. 1213, pp. 3–21, 02 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Hu H, Gu J, Zhang Z, Dai J, and Wei Y, “Relation networks for object detection,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3588–3597, 2018. [Google Scholar]
- [28].Dai B, Zhang Y, and Lin D, “Detecting visual relationships with deep relational networks,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. [Google Scholar]
- [29].Cao Y, Xu J, Lin S, Wei F, and Hu H, “Gcnet: Non-local networks meet squeeze-excitation networks and beyond,” arXiv preprint arXiv:1904.11492, 2019. [Google Scholar]
- [30].Ren Y, Lu J, Liang Z, Grimm LJ, Kim C, Taylor-Cho M, Yoon S, Marks JR, and Lo JY, “Retina-match: Ipsilateral mammography lesion matching in a single shot detection pipeline,” in MICCAI – 2021, Springer International Publishing. [Google Scholar]
- [31].Redmon J, Divvala S, Girshick R, and Farhadi A, “You only look once: Unified, real-time object detection,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [Google Scholar]
- [32].Ren S, He K, Girshick R, and Sun J, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, vol. 28, Curran Associates, Inc., 2015. [Google Scholar]
- [33].He K, Gkioxari G, Dollár P, and Girshick RB, “Mask r-cnn,”2017 IEEE International Conference on Computer Vision (ICCV), 2017. [Google Scholar]
- [34].Schroff F, Kalenichenko D, and Philbin J, “Facenet: A unified embedding for face recognition and clustering,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2015. [Google Scholar]
- [35].Schroff F, Kalenichenko D, and Philbin J, “Facenet: A unified embedding for face recognition and clustering,” pp. 815–823, 06 2015. [Google Scholar]
- [36].Lu Z, Rathod V, Votel R, and Huang J, “Retinatrack: Online single stage joint detection and tracking,” 2020 CVPR), 2020. [Google Scholar]
- [37].Lin T-Y, Goyal P, Girshick RB, He K, and Dollár P, “Focal loss for dense object detection,” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2999–3007, 2017. [Google Scholar]
- [38].Redmon J. and Farhadi A, “Yolo9000: Better, faster, stronger,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. [Google Scholar]
- [39].You C, Zhou Y, Zhao R, Staib L, and Duncan JS, “Simcvd: Simple contrastive voxel-wise representation distillation for semi-supervised medical image segmentation,” IEEE Transactions on Medical Imaging, vol. 41, no. 9, pp. 2228–2237, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [40].Lee R, Gimenez F, Hoogi A, Miyake K, Gorovoy M, and Rubin D, “A curated mammography data set for use in computer-aided detection and diagnosis research,” Scientific Data, vol. 4, p. 170177, 12 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [41].Campanini R, Dongiovanni D, Iampieri E, Lanconelli N, Masotti M, Palermo G, Riccardi A, and Roffilli M, “A novel featureless approach to mass detection in digital mammograms based on support vector machines,” Physics in Medicine & Biology, vol. 49, no. 6, p. 961, 2004. [DOI] [PubMed] [Google Scholar]
- [42].Eltonsy NH, Tourassi GD, and Elmaghraby AS, “A concentric morphology model for the detection of masses in mammography,” IEEE transactions on medical imaging, vol. 26, no. 6, pp. 880–889, 2007. [DOI] [PubMed] [Google Scholar]
- [43].Sampat MP, Bovik AC, Whitman GJ, and Markey MK, “A model-based framework for the detection of spiculated masses on mammography a,” Medical Physics, vol. 35, no. 5, pp. 2110–2123, 2008. [DOI] [PubMed] [Google Scholar]
- [44].Ma J, Li X, Li H, Wang R, Menze B, and Zheng W-S, “Cross-view relation networks for mammogram mass detection,” in 2020 25th International Conference on Pattern Recognition (ICPR), IEEE, 2021. [Google Scholar]
- [45].Hornik K, Stinchcomb M, and White H, “Multilayer feedforward networks are universal approximator,” IEEE Transactions on Neural Networks, vol. 2, 01 1989. [Google Scholar]