- What to show: monotone improvement with batch size.
- Takeaway: training dynamics are governed by the deterministic energy in the large-batch regime.
✨ TL;DR: We identify a geometric bifurcation in contrastive learning: Unimodal InfoNCE yields a unique convex Gibbs equilibrium, while Multimodal is coupled by a negative symmetric divergence driving "winner-take-all" dynamics, imposing the modality gap.
While InfoNCE powers modern contrastive learning, its geometric mechanisms remain under-characterized beyond the canonical alignment–uniformity decomposition. We present a measure-theoretic framework that models learning as the evolution of representation measures on a fixed embedding manifold. By establishing value and gradient consistency in the large-batch limit, we bridge the stochastic objective to explicit deterministic energy landscapes, uncovering a fundamental geometric bifurcation between the unimodal and multimodal regimes. In the unimodal setting, the intrinsic landscape is strictly convex with a unique Gibbs equilibrium; here, entropy acts merely as a tie-breaker, clarifying “uniformity” as a constrained expansion within the alignment basin. In contrast, the symmetric multimodal objective contains a persistent negative symmetric divergence term that remains even after kernel sharpening. We show that this term induces barrier-driven co-adaptation, enforcing a population-level modality gap as a structural geometric necessity rather than an initialization artifact. Our results shift the analytical lens from pointwise discrimination to population geometry, offering a principled basis for diagnosing and controlling distributional misalignment.
We model the embedding space Z as a compact geometric container with a reference volume measure. Encoders push data distributions forward to induce representation measures and positive-pair laws, and InfoNCE becomes an energy functional combining alignment potentials (from positives) and entropic dispersion (from negatives).
Main takeaway: unimodal InfoNCE admits a Gibbs-type equilibrium dictated by a strictly convex intrinsic landscape, whereas symmetric multimodal InfoNCE contains a structural repulsive divergence term that makes exact marginal matching a knife-edge condition.
🔬 Mechanism checks: (i) large-batch SGD follows a deterministic InfoNCE energy; (ii) unimodal InfoNCE converges to a unique Gibbs equilibrium; (iii) multimodal InfoNCE exhibits a divergence-driven structural modality gap under heterogeneous conditionals.
| Prediction / Claim | Observable Signature | Where |
|---|---|---|
| Large-batch consistency | Stochastic gradient aligns with the deterministic energy gradient as batch size grows. | Figures below · App. D.1 |
| Unimodal strict convexity | Single stable equilibrium; low temperature yields concentration near the Gibbs minimizer. | Figures below · App. D.2 |
| Multimodal negative symmetric divergence | Persistent divergence term; exact marginal matching is knife-edge; stable gap under misalignment. | Figures below · App. D.3 |
We compare the stochastic InfoNCE gradient (finite batch) with the deterministic energy gradient predicted by our large-batch limit. As batch size increases, the gradients become increasingly aligned, and the relative error shrinks—supporting value & gradient consistency.
In the unimodal regime, the intrinsic functional is strictly convex with a unique stable Gibbs equilibrium. Numerically, trajectories converge to the same terminal law across initializations; decreasing temperature produces sharper concentration inside the alignment basin rather than “uniformity for its own sake”.
In the multimodal regime with heterogeneous conditionals, the symmetric objective contains a negative symmetric divergence that persists after kernel sharpening. Numerically, this manifests as a stable population-level separation between learned marginals (a “modality gap”) that grows with misalignment and does not vanish with longer training.
For full setup details (models, temperatures, batch sizes, seeds), see Appendix D of the paper.
@misc{cai2026geometric,
title = {The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-Modal Divergence},
author = {Cai, Yichao and Zhang, Zhen and Liu, Yuhang and Shi, Javen Qinfeng},
year = {2026},
eprint = {2601.19597},
archivePrefix= {arXiv},
primaryClass = {cs.LG},
doi = {10.48550/arXiv.2601.19597},
url = {https://arxiv.org/abs/2601.19597}
}