AP2 - Quantifying Relation Between Mean Validation Dice Similarity Coefficient and Clinical Utility in Automated Segmentation

POSTER

Reid D. Jockisch, Connor R. Davey, Bradley P. Sutton, Samuel Hawkins, Art Sedighi, Matthew T. Bramlet

Automated medical image segmentation models have proved invaluable in presurgical planning workflows; however, utilization of segmentation models is limited by the lack of scalable verification of clinical accuracy of output models. This research attempts to quantify the relationship between the mean validation Dice Similarity Coefficient (DSC) and clinical accuracy of machine learning (ML) models for automated myocardial segmentation. We hypothesize that DSC values do not correlate with clinical utility rates for myocardial segmentations. To test this, 402 automated computed tomography (CT) segmentations were manually reviewed by a board-certified cardiologist and assessed for clinical accuracy. The assessed accuracy of this dataset was compared to ~90% mean validation DSC via a one-sample binomial proportion test. 52.7% of the 402 cases were judged to be clinically accurate, greatly differing from the mean validation DSC (p<0.001). Therefore, we conclude that high mean validation DSC does not equate to clinical utility further motivating clinically aligned quality assurance (QA) metrics.