Yichao Cai

Adelaide, Australia

yichao.cai@adelaide.edu.au

I am a third-year Ph.D. student in Computer Science at the Australian Institute for Machine Learning (AIML), Adelaide University (formerly The University of Adelaide), advised by Prof. Javen Qinfeng Shi. Before my Ph.D., I received my M.Sc. and B.Eng. degrees from Wuhan University of Technology. During my M.Sc., I spent five months as a visiting student researcher at California PATH, UC Berkeley.

I study how language supervision shapes the semantics, geometry, and identifiability of multimodal representations. My current research interests span:

representation learning (learning objectives and training paradigms, identifiability theory, semantic structure in learned representations);
vision-language modeling (multimodal alignment, multimodal LLMs, supervision design and data curation);
explainable machine learning (mechanistic interpretability, representation geometry, latent-structure characterization).

news

May 01, 2026	We had 3 papers on representation learning (contrastive learning theory, AI4Science, and graphical modeling) accepted to ICML 2026.
Feb 10, 2026	I attended MLSS Melbourne 2026 and enjoyed learning from world-class speakers and connecting with the community.
Jan 28, 2026	Check out our new preprint: The Geometric Mechanics of Contrastive Representation Learning.
Oct 15, 2025	I served as a guest lecturer in Statistical Machine Learning and presented recent advances in vision-language modeling. Slides.
Sep 19, 2025	Our work On the Value of Cross-Modal Misalignment in Multimodal Representation Learning was selected as a Spotlight at NeurIPS 2025.

Selected Publications

View full publications →

ICML’26
The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-Modal Divergence

Yichao Cai, Zhen Zhang, Yuhang Liu, and 1 more author

In International Conference on Machine Learning (ICML), 2026

Abs arXiv Bib PDF Code Website

While InfoNCE underlies modern contrastive learning, its geometric mechanisms remain under-characterized beyond the canonical alignment–uniformity decomposition. We develop a measure-theoretic framework in which representation measures evolve on a fixed embedding manifold. In the large-batch limit, we prove value and gradient consistency, linking the stochastic objective to explicit deterministic energy landscapes and revealing a geometric bifurcation between unimodal and symmetric multimodal regimes. In the unimodal case, the intrinsic energy is strictly convex and admits a unique Gibbs equilibrium, showing that entropy acts as a tie-breaker within the aligned basin. In the multimodal case, the intrinsic geometry becomes cross-coupled and contains a persistent negative symmetric divergence term: each modality’s marginal reshapes the effective landscape of the other, allowing strong pairwise alignment to coexist with a persistent modality gap. Controlled synthetic experiments and analyses of pretrained CLIP representations support these predictions. Overall, our results shift the analytical lens from pointwise discrimination to population geometry, showing that pairwise alignment alone is insufficient to control cross-modal marginal structure.
@inproceedings{cai2026geometric, title = {The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-Modal Divergence}, author = {Cai, Yichao and Zhang, Zhen and Liu, Yuhang and Shi, Javen Q.}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2026}, }
ICLR’26
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?

Yuhang Liu, Dong Gong, Yichao Cai, and 6 more authors

In International Conference on Learning Representations (ICLR), 2026

Abs arXiv Bib PDF Code Website

Recent empirical evidence shows that LLM representations encode human-interpretable concepts. Nevertheless, the mechanisms by which these representations emerge remain largely unexplored. To shed further light on this, we introduce a novel generative model that generates tokens on the basis of such concepts formulated as latent discrete variables. Under mild conditions, even when the mapping from the latent space to the observed space is non-invertible, we establish rigorous identifiability result: the representations learned by LLMs through next-token prediction can be approximately modeled as the logarithm of the posterior probabilities of these latent discrete concepts given input context, up to an linear transformation. This theoretical finding: 1) provides evidence that LLMs capture essential underlying generative factors, 2) offers a unified and principled perspective for understanding the linear representation hypothesis, and 3) motivates a theoretically grounded approach for evaluating sparse autoencoders. Empirically, we validate our theoretical results through evaluations on both simulation data and the Pythia, Llama, and DeepSeek model families.
@inproceedings{liu2026predict, title = {I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?}, author = {Liu, Yuhang and Gong, Dong and Cai, Yichao and Gao, Erdun and Zhang, Zhen and Huang, Biwei and Gong, Mingming and van den Hengel, Anton and Shi, Javen Q.}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2026}, }
NeurIPS’25
On the Value of Cross-Modal Misalignment in Multimodal Representation Learning

Yichao Cai, Yuhang Liu, Erdun Gao, and 4 more authors

In Advances in Neural Information Processing Systems (NeurIPS), 2025 Spotlight

Abs arXiv Bib PDF Code Website

Multimodal representation learning, exemplified by multimodal contrastive learning (MMCL) using image-text pairs, aims to learn powerful representations by aligning cues across modalities. This approach relies on the core assumption that the exemplar image-text pairs constitute two representations of an identical concept. However, recent research has revealed that real-world datasets often exhibit cross-modal misalignment. There are two distinct viewpoints on how to address this issue: one suggests mitigating the misalignment, and the other leveraging it. We seek here to reconcile these seemingly opposing perspectives, and to provide a practical guide for practitioners. Using latent variable models we thus formalize cross-modal misalignment by introducing two specific mechanisms: Selection bias, where some semantic variables are absent in the text, and perturbation bias, where semantic variables are altered – both leading to misalignment in data pairs. Our theoretical analysis demonstrates that, under mild assumptions, the representations learned by MMCL capture exactly the information related to the subset of the semantic variables invariant to selection and perturbation biases. This provides a unified perspective for understanding misalignment. Based on this, we further offer actionable insights into how misalignment should inform the design of real-world ML systems. We validate our theoretical findings via extensive empirical studies on both synthetic data and real image-text datasets, shedding light on the nuanced impact of cross-modal misalignment on multimodal representation learning.
@inproceedings{cai2025misalignment, title = {On the Value of Cross-Modal Misalignment in Multimodal Representation Learning}, author = {Cai, Yichao and Liu, Yuhang and Gao, Erdun and Jiang, Tianjiao and Zhang, Zhen and van den Hengel, Anton and Shi, Javen Q.}, booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, year = {2025}, }
ECCV’24
CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts

Yichao Cai, Yuhang Liu, Zhen Zhang, and 1 more author

In European Conference on Computer Vision (ECCV), 2024

Abs arXiv Bib PDF Code Website

Contrastive vision-language models, such as CLIP, have garnered considerable attention for various downstream tasks, mainly due to the remarkable ability of the learned features for generalization. However, the features they learned often blend content and style information, which somewhat limits their generalization capabilities under distribution shifts. To address this limitation, we adopt a causal generative perspective for multimodal data and propose contrastive learning with data augmentation to disentangle content features from the original representations. To achieve this, we begin with exploring image augmentation techniques and develop a method to seamlessly integrate them into pre-trained CLIP-like models to extract pure content features. Taking a step further, recognizing the inherent semantic richness and logical structure of text data, we explore the use of text augmentation to isolate latent content from style features. This enables CLIP-like model’s encoders to concentrate on latent content information, refining the learned representations by pre-trained CLIP-like models. Our extensive experiments across diverse datasets demonstrate significant improvements in zero-shot and few-shot classification tasks, alongside enhanced robustness to various perturbations. These results underscore the effectiveness of our proposed methods in refining vision-language representations and advancing the state-of-the-art in multimodal learning.
@inproceedings{cai2024clap, title = {CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts}, author = {Cai, Yichao and Liu, Yuhang and Zhang, Zhen and Shi, Javen Q.}, booktitle = {European Conference on Computer Vision (ECCV)}, year = {2024}, }