ICML 2026

The Geometric Mechanics of Contrastive Representation Learning:

Alignment Potentials, Entropic Dispersion, and Cross-modal Divergence

* Australian Institute for Machine Learning (AIML), Adelaide University
† Responsible AI Research (RAIR) Centre, Australia

Core claim. In the analyzed regime, unimodal InfoNCE induces a strictly convex Gibbs-type intrinsic landscape, whereas symmetric multimodal InfoNCE introduces a persistent negative symmetric divergence coupling that can structurally favor a modality gap under conditional heterogeneity.

Why this paper matters. Contrastive learning is often summarized as “alignment plus uniformity,” but that view does not fully explain why cross-modal systems can align paired samples while still maintaining separated modality marginals. This work shifts the lens from pointwise discrimination to population geometry on the embedding manifold.

Paper at a glance

A single roadmap for the full argument

The paper is easiest to understand if the reader sees the entire analytical pipeline first: finite-batch InfoNCE, large-batch deterministic energies, intrinsic variational lifting, and then the bifurcation between unimodal and multimodal geometries.

Figure 1: unified analytical pipeline for contrastive learning geometry
Unified roadmap of the analysis: finite-batch InfoNCE, large-batch deterministic energies, intrinsic variational lifting, and the resulting bifurcation between unimodal and multimodal contrastive geometries.

1. Deterministic limit

Large-batch InfoNCE is shown to track an explicit deterministic energy, making the training geometry analyzable at the level of induced measures.

2. Unimodal cohesion

In the unimodal regime, the intrinsic functional is strictly convex and yields a unique Gibbs equilibrium. Entropy acts as a tie-breaker inside the alignment basin.

3. Multimodal bifurcation

In symmetric multimodal InfoNCE, a negative symmetric divergence term changes the geometry qualitatively and can favor modality-level separation.

Abstract

Summary

While InfoNCE underlies modern contrastive learning, its geometric mechanisms remain under-characterized beyond the canonical alignment--uniformity decomposition. We develop a measure-theoretic framework in which learning evolves representation measures on a fixed embedding manifold. In the large-batch limit, we prove value and gradient consistency, linking the stochastic objective to explicit deterministic energy landscapes and revealing a geometric bifurcation between unimodal and symmetric multimodal regimes. In the unimodal case, the intrinsic energy is strictly convex and admits a unique Gibbs equilibrium, showing that entropy acts as a tie-breaker within the aligned basin. In the multimodal case, the intrinsic geometry becomes cross-coupled and contains a persistent negative symmetric divergence term: each modality’s marginal reshapes the effective landscape of the other, allowing strong pairwise alignment to coexist with a persistent modality gap. Controlled synthetic experiments and analyses of pretrained CLIP representations support these predictions. Overall, our results shift the analytical lens from pointwise discrimination to population geometry, showing that pairwise alignment alone is insufficient to control cross-modal marginal structure.

Theory to evidence

The four-step narrative

1

Show the large-batch limit is real

Before making geometric claims, the paper first verifies that finite-batch gradients align with the deterministic gradient predicted by theory.

2

Establish unimodal Gibbs geometry

The next step is to show that the unimodal landscape is cohesive: a unique equilibrium exists and low temperature concentrates mass near aligned minima.

3

Expose the multimodal bifurcation

The central contrast is that symmetric multimodal InfoNCE becomes divergence-coupled, so pairwise attraction can coexist with population-level repulsion.

4

Validate on real CLIP-like systems

The final step tests whether the same signatures persist on realistic image-text embeddings and under controlled corruption on MS-COCO.

Result I

Large-batch InfoNCE tracks a deterministic energy

Figure 4: large-batch gradient consistency
Gradient alignment and relative gradient error between the stochastic InfoNCE objective and its deterministic large-batch counterpart improve as the number of negatives increases.

What this establishes

This result anchors the rest of the theory by showing that the deterministic-energy perspective is visible already at the level of practical finite-batch training.

  • Finite-batch contrastive gradients increasingly align with deterministic gradients.
  • The large-batch limit therefore captures the dominant training direction.
  • This justifies analyzing InfoNCE through induced energy landscapes on representation measures.

Result II

Unimodal InfoNCE has a cohesive Gibbs-type geometry

Figure 5: unimodal equilibria across temperature
Unimodal training evolves toward a Gibbs-type equilibrium shaped by the alignment potential; decreasing temperature concentrates mass more strongly near low-energy regions.
Figure 2: unimodal ground-state concentration
Quantitative view of unimodal low-temperature concentration: as temperature decreases, probability mass concentrates around near-minimizing regions of the alignment potential.

The unimodal story should be presented as a contrastive baseline: the modality evaluates cross-entropy against its own smoothed density, which produces a strictly convex intrinsic free energy and a unique Gibbs equilibrium. This lets the page reinterpret “uniformity” more precisely as entropic dispersion inside an already aligned basin, rather than a generic global force fighting alignment.

Result III

Symmetric multimodal InfoNCE bifurcates into a different geometry

This is the paper’s central mechanism. In the multimodal regime, each modality is evaluated against the other modality’s smoothed density. After lifting to the intrinsic functional, this produces a negative symmetric divergence coupling that can favor population-level separation under conditional heterogeneity.

Figure 6: joint-angle coupling under controlled misalignment
Joint-angle coupling under controlled misalignment. The diagonal concentration weakens and deforms as heterogeneity increases, revealing how pairwise attraction can coexist with persistent population-level separation.
Figure 3: marginal gap versus latent misalignment
Quantitative signature of the multimodal bifurcation: the estimated symmetric divergence between modality marginals increases as latent cross-modal misalignment becomes stronger.
  • Pairwise alignment does not by itself guarantee matched modality marginals.
  • Under conditional heterogeneity, symmetric multimodal InfoNCE can retain a stable distribution-level gap.
  • The resulting modality gap is therefore a geometric population effect rather than merely an initialization artifact.

Result IV

The same signatures appear on MS-COCO and pretrained CLIP-like models

Figure 7: MS-COCO validation of the modality-gap mechanism
Real-data validation on pretrained CLIP-like models and controlled MS-COCO corruption. Strong retrieval and small cross-modal discrepancy are related but not equivalent, and structured caption corruption enlarges the gap systematically.

What the reader should take away

  • Strong pretrained retrieval and small cross-modal discrepancy are related but not equivalent.
  • Mild, semantically plausible caption corruption already enlarges the image-text gap.
  • The gap is therefore not well captured by retrieval alone.

Takeaways

What this project page should make immediately clear

  • (i) Representation learning can be understood through induced population geometry, not only pairwise discrimination.
  • (ii) Unimodal InfoNCE is geometrically cohesive: it admits a unique Gibbs equilibrium.
  • (iii) Symmetric multimodal InfoNCE is qualitatively different because of a negative symmetric divergence coupling.
  • (iv) Closing the modality gap may require explicit distribution-level regularization, not merely stronger pairwise alignment.
📚 Cite this paper
@misc{cai2026geometric,
  title        = {The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-Modal Divergence},
  author       = {Cai, Yichao and Zhang, Zhen and Liu, Yuhang and Shi, Javen Qinfeng},
  year         = {2026},
  eprint       = {2601.19597},
  archivePrefix= {arXiv},
  primaryClass = {cs.LG},
  doi          = {10.48550/arXiv.2601.19597},
  url          = {https://arxiv.org/abs/2601.19597}
}