Alignment Potentials, Entropic Dispersion, and Cross-modal Divergence
Core claim. In the analyzed regime, unimodal InfoNCE induces a strictly convex Gibbs-type intrinsic landscape, whereas symmetric multimodal InfoNCE introduces a persistent negative symmetric divergence coupling that can structurally favor a modality gap under conditional heterogeneity.
Why this paper matters. Contrastive learning is often summarized as “alignment plus uniformity,” but that view does not fully explain why cross-modal systems can align paired samples while still maintaining separated modality marginals. This work shifts the lens from pointwise discrimination to population geometry on the embedding manifold.
Paper at a glance
The paper is easiest to understand if the reader sees the entire analytical pipeline first: finite-batch InfoNCE, large-batch deterministic energies, intrinsic variational lifting, and then the bifurcation between unimodal and multimodal geometries.
Large-batch InfoNCE is shown to track an explicit deterministic energy, making the training geometry analyzable at the level of induced measures.
In the unimodal regime, the intrinsic functional is strictly convex and yields a unique Gibbs equilibrium. Entropy acts as a tie-breaker inside the alignment basin.
In symmetric multimodal InfoNCE, a negative symmetric divergence term changes the geometry qualitatively and can favor modality-level separation.
Abstract
While InfoNCE underlies modern contrastive learning, its geometric mechanisms remain under-characterized beyond the canonical alignment--uniformity decomposition. We develop a measure-theoretic framework in which learning evolves representation measures on a fixed embedding manifold. In the large-batch limit, we prove value and gradient consistency, linking the stochastic objective to explicit deterministic energy landscapes and revealing a geometric bifurcation between unimodal and symmetric multimodal regimes. In the unimodal case, the intrinsic energy is strictly convex and admits a unique Gibbs equilibrium, showing that entropy acts as a tie-breaker within the aligned basin. In the multimodal case, the intrinsic geometry becomes cross-coupled and contains a persistent negative symmetric divergence term: each modality’s marginal reshapes the effective landscape of the other, allowing strong pairwise alignment to coexist with a persistent modality gap. Controlled synthetic experiments and analyses of pretrained CLIP representations support these predictions. Overall, our results shift the analytical lens from pointwise discrimination to population geometry, showing that pairwise alignment alone is insufficient to control cross-modal marginal structure.
Theory to evidence
Before making geometric claims, the paper first verifies that finite-batch gradients align with the deterministic gradient predicted by theory.
The next step is to show that the unimodal landscape is cohesive: a unique equilibrium exists and low temperature concentrates mass near aligned minima.
The central contrast is that symmetric multimodal InfoNCE becomes divergence-coupled, so pairwise attraction can coexist with population-level repulsion.
The final step tests whether the same signatures persist on realistic image-text embeddings and under controlled corruption on MS-COCO.
Result I
This result anchors the rest of the theory by showing that the deterministic-energy perspective is visible already at the level of practical finite-batch training.
Result II
The unimodal story should be presented as a contrastive baseline: the modality evaluates cross-entropy against its own smoothed density, which produces a strictly convex intrinsic free energy and a unique Gibbs equilibrium. This lets the page reinterpret “uniformity” more precisely as entropic dispersion inside an already aligned basin, rather than a generic global force fighting alignment.
Result III
This is the paper’s central mechanism. In the multimodal regime, each modality is evaluated against the other modality’s smoothed density. After lifting to the intrinsic functional, this produces a negative symmetric divergence coupling that can favor population-level separation under conditional heterogeneity.
Result IV
Takeaways
@misc{cai2026geometric,
title = {The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-Modal Divergence},
author = {Cai, Yichao and Zhang, Zhen and Liu, Yuhang and Shi, Javen Qinfeng},
year = {2026},
eprint = {2601.19597},
archivePrefix= {arXiv},
primaryClass = {cs.LG},
doi = {10.48550/arXiv.2601.19597},
url = {https://arxiv.org/abs/2601.19597}
}