Publication details
- Journal: Proceedings of Machine Learning Research (PMLR), vol. 307, 2026
-
International Standard Numbers:
- Electronic: 2640-3498
- Links:
Foundation models (FMs) have shown remarkable capabilities across computer vision tasks, yet their effectiveness for complex medical downstream tasks remains underexplored. This work investigates whether state–of–the–art video–based FMs for echocardiography can perform precise spatio–temporal landmark detection without extensive fine–tuning. We evaluate two recent powerful FMs, namely EchoPrime, and PanEcho, pre–trained on few millions of echocardiographic video–text pairs, for left ventricular contour detection at end–diastole (ED) and end–systole (ES) on EchoNet–Dynamic. We compare encoder regimes (frozen, partially frozen, fully trainable) and decoder heads (multilayer perceptron (MLP) vs. graph convolutional network (GCN)), and benchmark against strong non–FM backbones (ResNet–18 2D/3D, ViT–Base, MViTv2–Small). Frozen encoders perform poorly and variably (≈78.00 Dice, ED), whereas selectively unfreezing two blocks with GCN+augmentation yields a large jump (91.71 ±3.49 Dice, ED), recovering most of the improvement. Fully trainable EchoPrime (GCN+augmentation) achieves 93.13 ±3.11/90.95 ±3.71 Dice (ED/ES), which is SOTA for regression-based models on EchoNet. Deploying separate, fully fine–tuned models for each task quickly becomes impractical in resource–constrained settings. Our results suggest that partially fine-tuning the FM is a resource-efficient strategy that recovers most of the performance benefits of end-to-end training, while avoiding the overhead of maintaining a separate model for each task. The code is available at https://github.com/preetrajb/EchoVLMLandmarks.