Though recent advances in vision–language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a ...
Geometry Forcing (GF) Overview. (a) Our proposed GF paradigm enhances video diffusion models by aligning with geometric features from VGGT. (b) Compared to DFoT, our method generates more temporally ...