Visual Intelligence Online Seminar: Echocardiography Foundation Models in Practice

Abstract

Echocardiography is the first-line imaging modality for assessing cardiac structure and function, and cardiovascular disease remains a leading cause of death worldwide. Recent echocardiography foundation models (FMs) have demonstrated strong multi-view, multi-task performance for interpretation, classification, and clinical estimation, yet their robustness for dense regression tasks is less established. Here, we evaluate FM-based video encoders for spatio-temporal left ventricular (LV) landmark detection on EchoNet-Dynamic, leveraging two state-of-the-art systems: EchoPrime, a multi-view vision–language model trained with contrastive learning on over 12 million video–report pairs and augmented with view-informed anatomical attention and multiple-instance weighting, and PanEcho, a unified multitask model trained on large-scale labeled echocardiography for broad diagnostic and measurement prediction across views. We compare frozen, partially fine-tuned, and fully trainable adaptation regimes for precise landmark regression. Frozen FM encoders underperform, whereas selective fine-tuning. unfreezing only the final transformer blocks, recovers most of the gains of full end-to-end training. Finally, we show that a graph-based decoder encoding LV contour anatomy and temporal motion consistently improves accuracy, achieving state-of-the-art regression performance while remaining computationally efficient.