Neural and computational evidence reveals that real-world size is a temporally late, semantically grounded, and hierarchically stable dimension of object representation in both human brains and ...
To address the degradation of visual-language (VL) representations during VLA supervised fine-tuning (SFT), we introduce Visual Representation Alignment. During SFT, we pull a VLA’s visual tokens ...