I am a third-year Ph.D. student in Computer Science at the Australian Institute for Machine Learning (AIML), University of Adelaide, advised by Prof. Javen Qinfeng Shi. My research studies multimodal representation learning, with a particular focus on contrastive learning theory, cross-modal alignment, and identifiable causal representations.
More broadly, I am interested in how language supervision shapes semantic structure in vision-language models, and how this perspective can support interpretable, reliable, and human-aligned AI systems. My work combines theoretical analysis with empirical study to better understand representation formation in modern multimodal models.
Research interests: multimodal learning, contrastive learning theory, identifiability, causal representation learning, and vision-language models.
Yichao Cai; Zhen Zhang; Yuhang Liu; Javen Q. Shi.
arXiv preprint 2026
A theoretical study of contrastive learning geometry, alignment forces, dispersion, and the emergence of modality gap.
Yichao Cai; Zhen Zhang; Yuhang Liu; Javen Q. Shi.
A theoretical study of contrastive learning geometry, alignment forces, dispersion, and the emergence of modality gap.
Yuhang Liu; Dong Gong; Yichao Cai; Erdun Gao; Zhen Zhang; Biwei Huang; Mingming Gong; Anton van den Hengel; Javen Q. Shi.
International Conference on Learning Representations (ICLR) 2026
An investigation of whether next-token prediction alone is sufficient for learning human-interpretable concepts from data.
Yuhang Liu; Dong Gong; Yichao Cai; Erdun Gao; Zhen Zhang; Biwei Huang; Mingming Gong; Anton van den Hengel; Javen Q. Shi.
An investigation of whether next-token prediction alone is sufficient for learning human-interpretable concepts from data.
Yichao Cai*; Yuhang Liu*; Erdun Gao; Tianjiao Jiang; Zhen Zhang; Anton van den Hengel; Javen Q. Shi. (* equal contribution)
Advances in Neural Information Processing Systems (NeurIPS) 2025 Spotlight
Studies when controlled cross-modal misalignment can improve multimodal representation learning instead of only harming it.
Yichao Cai*; Yuhang Liu*; Erdun Gao; Tianjiao Jiang; Zhen Zhang; Anton van den Hengel; Javen Q. Shi. (* equal contribution)
Spotlight
Studies when controlled cross-modal misalignment can improve multimodal representation learning instead of only harming it.
Yichao Cai; Yuhang Liu; Zhen Zhang; Javen Q. Shi.
European Conference on Computer Vision (ECCV) 2024
Explores language-guided disentanglement of style and content through contrastive learning with augmented prompts.
Yichao Cai; Yuhang Liu; Zhen Zhang; Javen Q. Shi.
Explores language-guided disentanglement of style and content through contrastive learning with augmented prompts.