Abstract: Vision-language models (VLMs), such as CLIP, have shown remarkable capabilities in downstream tasks. However, the coupling of semantic information between the foreground and the background ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results