On the Value of Cross-Modal Misalignment in Multimodal Representation Learning

* Equal Contribution
Australian Institute for Machine Learning (AIML), The University of Adelaide

โœจ TL;DR: Multimodal contrastive learning identifies only unbiased shared semantics under cross-modal misalignment, modeled via selection and perturbation biases.

Abstract

Multimodal representation learning, exemplified by multimodal contrastive learning (MMCL) using image-text pairs, aims to learn powerful representations by aligning cues across modalities. This approach relies on the core assumption that the exemplar image-text pairs constitute two representations of an identical concept. However, recent research has revealed that real-world datasets often exhibit cross-modal misalignment. There are two distinct viewpoints on how to address this issue: one suggests mitigating the misalignment, and the other leveraging it. We seek here to reconcile these seemingly opposing perspectives, and to provide a practical guide for practitioners. Using latent variable models we thus formalize cross-modal misalignment by introducing two specific mechanisms: Selection bias, where some semantic variables are absent in the text, and perturbation bias, where semantic variables are alteredโ€”both leading to misalignment in data pairs. Our theoretical analysis demonstrates that, under mild assumptions, the representations learned by MMCL capture exactly the information related to the subset of the semantic variables invariant to selection and perturbation biases. This provides a unified perspective for understanding misalignment. Based on this, we further offer actionable insights into how misalignment should inform the design of real-world ML systems. We validate our theoretical findings via extensive empirical studies on both synthetic data and real image-text datasets, shedding light on the nuanced impact of cross-modal misalignment on multimodal representation learning.

Key Contributions

Motivation

Multimodal contrastive learning aligns image and text embeddings. But captions often omit or distort semantics. Surprisingly, this misalignment sometimes helps rather than hurtsโ€”enhancing robustness. Our work seeks to understand and leverage this paradox.

Theoretical Framework

We introduce a latent variable model where each modality is generated from shared semantics with modality-specific biases. We define:



LVM Diagram
Illustration of the latent variable model (left), with misalignment across modalities modeled via selection and perturbation bias. Example image-text pairs (right) illustrate the misalignment induced by these two biases.

Main Result (informal): Multimodal contrastive learning, when equipped with the correct representational bottleneck (i.e., feature size), consistently identifies unbiased shared semantics under cross-modal misalignment, irrespective of the latent causal structure.

Practical Guidance: (1) For large-scale pretraining: When domain shift is not an issue, rich, faithful captions maximize semantics and boost task generalization. (2) For robust adaptation: Omit or perturb environment-specific/sensitive cues to enforce invariance/ethics, using misalignment as regularization.

Empirical Validation

We validate our theory on Causal3DIdent, MPI3D, and OpenCLIP. Results confirm that misalignment shapes concept awareness -- Increasing selection or perturbation bias systematically reduces performance on biased concepts while leaving invariant ones stable. See full results in the paper.

OpenCLIP Case Study

We evaluate OpenCLIP on 146 visual concepts across 15 groups. Performance drops sharply on under-captioned groups like Stereotype and Emotion, confirming that annotation frequency governs learned awareness. Captions act as epistemic filtersโ€”prioritizing some semantics over others.

OpenCLIP F1 performance by concept group
LAION-400M captions coverage rate and OpenCLIP zero-shot performance across concept groups.

Key Takeaways

Misalignment โ‰  Noise: Itโ€™s an Epistemic Filter

Our work supports a modern view of linguistic relativity: captions shape the conceptual space that models represent. Annotation choices reflect epistemic and ethical values. By treating annotation as a value-laden act, we can align learned representations with human priorities.

๐Ÿ“š Cite this paper:
@inproceedings{cai2025misalignment,
  title     = {On the Value of Cross-Modal Misalignment in Multimodal Representation Learning},
  author    = {Cai, Yichao and Liu, Yuhang and Gao, Erdun and Jiang, Tianjiao and Zhang, Zhen and van den Hengel, Anton and Shi, Javen Qinfeng},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2025}
}