Abstract: Vision-language models (VLMs), such as CLIP, play a foundational role in various cross-modal applications. To fully leverage the potential of VLMs in adapting to downstream tasks, context ...
Abstract: Deep multi-modal clustering (DMC) expects to improve clustering performance by exploiting abundant information available from multiple modalities. However, different modalities usually have ...