Abstract: Visual-language foundation models, like CLIP, learn generalized representations that enable zero-shot open-set clas-sification. Few-shot adaptation methods, based on prompt tuning, have been ...
Abstract: CLIP, a foundational vision-language model, has emerged as a powerful tool for open-vocabulary semantic segmentation. While freezing the text encoder preserves its powerful embeddings, ...