The Geometric Mechanics of Contrastive Representation Learning

Australian Institute for Machine Learning (AIML), Adelaide University

TL;DR: We identify a geometric bifurcation in contrastive learning: Unimodal InfoNCE yields a unique convex Gibbs equilibrium, while Multimodal is coupled by a negative symmetric divergence driving "winner-take-all" dynamics, imposing the modality gap.

Abstract

While InfoNCE powers modern contrastive learning, its geometric mechanisms remain under-characterized beyond the canonical alignment–uniformity decomposition. We present a measure-theoretic framework that models learning as the evolution of representation measures on a fixed embedding manifold. By establishing value and gradient consistency in the large-batch limit, we bridge the stochastic objective to explicit deterministic energy landscapes, uncovering a fundamental geometric bifurcation between the unimodal and multimodal regimes. In the unimodal setting, the intrinsic landscape is strictly convex with a unique Gibbs equilibrium; here, entropy acts merely as a tie-breaker, clarifying “uniformity” as a constrained expansion within the alignment basin. In contrast, the symmetric multimodal objective contains a persistent negative symmetric divergence term that remains even after kernel sharpening. We show that this term induces barrier-driven co-adaptation, enforcing a population-level modality gap as a structural geometric necessity rather than an initialization artifact. Our results shift the analytical lens from pointwise discrimination to population geometry, offering a principled basis for diagnosing and controlling distributional misalignment.

Key Contributions

Framework

We model the embedding space Z as a compact geometric container with a reference volume measure. Encoders push data distributions forward to induce representation measures and positive-pair laws, and InfoNCE becomes an energy functional combining alignment potentials (from positives) and entropic dispersion (from negatives).

Analytical roadmap / overview
The geometric bifurcation of contrastive representation learning.

Main takeaway: unimodal InfoNCE admits a Gibbs-type equilibrium dictated by a strictly convex intrinsic landscape, whereas symmetric multimodal InfoNCE contains a structural repulsive divergence term that makes exact marginal matching a knife-edge condition.

Numerical Validations

🔬 Mechanism checks: (i) large-batch SGD follows a deterministic InfoNCE energy; (ii) unimodal InfoNCE converges to a unique Gibbs equilibrium; (iii) multimodal InfoNCE exhibits a divergence-driven structural modality gap under heterogeneous conditionals.

Prediction / Claim Observable Signature Where
Large-batch consistency Stochastic gradient aligns with the deterministic energy gradient as batch size grows. Figures below · App. D.1
Unimodal strict convexity Single stable equilibrium; low temperature yields concentration near the Gibbs minimizer. Figures below · App. D.2
Multimodal negative symmetric divergence Persistent divergence term; exact marginal matching is knife-edge; stable gap under misalignment. Figures below · App. D.3

1) Large-batch gradient consistency

We compare the stochastic InfoNCE gradient (finite batch) with the deterministic energy gradient predicted by our large-batch limit. As batch size increases, the gradients become increasingly aligned, and the relative error shrinks—supporting value & gradient consistency.

Gradient alignment vs batch size
Cosine alignment & Relative gradient error between stochastic and deterministic gradients vs batch size.
  • What to show: monotone improvement with batch size.
  • Takeaway: training dynamics are governed by the deterministic energy in the large-batch regime.

2) Unimodal: unique Gibbs equilibrium & low-temperature concentration

In the unimodal regime, the intrinsic functional is strictly convex with a unique stable Gibbs equilibrium. Numerically, trajectories converge to the same terminal law across initializations; decreasing temperature produces sharper concentration inside the alignment basin rather than “uniformity for its own sake”.

Unimodal hypersphere distribution animation
Unimodal hypersphere distribution animation.
  • What to show: A density visualization under various temperature settings.
  • Takeaway: unimodal “uniformity” is an entropic tie-breaker constrained by the alignment potential.

3) Multimodal: divergence-driven structural modality gap

In the multimodal regime with heterogeneous conditionals, the symmetric objective contains a negative symmetric divergence that persists after kernel sharpening. Numerically, this manifests as a stable population-level separation between learned marginals (a “modality gap”) that grows with misalignment and does not vanish with longer training.

Animated joint-angle coupling during training
Animated joint-angle coupling (1 frame per 10 steps). The diagonal concentration indicates improved pairwise coupling, while persistent off-diagonal mass reflects mismatch-induced structure that does not disappear with longer training.
  • What to look for: how the joint mass evolves over training; whether it concentrates on the diagonal; and whether residual spread persists under misalignment.
  • Takeaway: the modality gap is a structural geometric consequence of symmetric multimodal training under mismatched conditionals.

For full setup details (models, temperatures, batch sizes, seeds), see Appendix D of the paper.

Key Takeaways

📚 Cite this paper:
@misc{cai2026geometric,
  title        = {The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-Modal Divergence},
  author       = {Cai, Yichao and Zhang, Zhen and Liu, Yuhang and Shi, Javen Qinfeng},
  year         = {2026},
  eprint       = {2601.19597},
  archivePrefix= {arXiv},
  primaryClass = {cs.LG},
  doi          = {10.48550/arXiv.2601.19597},
  url          = {https://arxiv.org/abs/2601.19597}
}