This work introduces a unified framework for understanding cross-modal misalignment in vision-language learning. Despite prevailing assumptions, real-world image-text pairs often contain only partially overlapping semantics. Using a latent variable model (LVM), we formalize two common misalignment mechanisms—selection and perturbation biases—and analyze their impact on contrastive learning methods like CLIP. Our theory reveals that multimodal contrastive learning inherently extracts the unbiased shared semantics. Empirical studies across synthetic and real datasets confirm these insights.
Multimodal contrastive learning aims to align image and text embeddings in a shared space. This assumes that the pairs are semantically matched. However, in practice, the matching is often noisy: some semantics are missing (selection bias), and some are replaced or incorrect (perturbation bias). Surprisingly, such misalignment doesn't always harm performance—sometimes, it improves robustness. This paper seeks to understand and exploit that phenomenon.
The authors propose a flexible latent variable model, where the full semantics of each sample are encoded in latent variables. The model generates images and texts via different deterministic paths, with the possibility of omitting or altering variables. Two types of biases are defined:
Figure: Illustration of the proposed latent variable model (left), with misalignment across modalities modeled via selection and perturbation bias.
Main Results :The main theoretical result demonstrates that contrastive learning consistently identifies only the shared, unbiased components of the underlying semantic variables—regardless of the causal relationships among them. This explains why CLIP-like models maintain strong generalization performance even when trained on semantically noisy or misaligned image-text pairs. These findings also highlight the critical importance of careful dataset design in multimodal contrastive learning.
The paper evaluates the theory using numerical simulations, real-world (MPI-) and controlled synthetic datasets (Causal3DIdent). Metrics include zero-shot classification accuracy and semantic variable prediction accuracy (R², MCC).
Figure: Latent causal model underlying the generation of image samples in the Causal3DIdent dataset.
Table: Misalignment settings for text generation of Causal3DIdent dataset.
Figure: Predicting semantic variables under misalignment using image features.
This work challenges the assumption that perfect semantic alignment is necessary for effective multimodal learning. In fact, some degree of misalignment—such as selection or perturbation biases—can encourage models to focus on invariant, shared semantics. Our theoretical results show that contrastive learning recovers these unbiased components even without strong assumptions about the causal structure of the data, helping explain the robustness of CLIP-like models.
Beyond performance, our findings have broader implications. Supervision—particularly text—acts as an epistemic filter, shaping which aspects of the visual world are represented and learned. Biases in annotation reflect implicit value judgments about what is relevant or important. Rather than treating these biases as mere noise, they can be studied as signals—providing behavioral priors that inform what humans prioritize. This reframes dataset design as both a technical and ethical practice: one that impacts not only generalization, but also what concepts are visible to the model in the first place.
@article{cai2025misalignment,
title={On the Value of Cross-Modal Misalignment in Multimodal Representation Learning},
author={Cai, Yichao and Liu, Yuhang and Gao, Erdun and Jiang, Tianjiao and Zhang, Zhen and van den Hengel, Anton and Shi, Javen Qinfeng},
journal={arXiv preprint arXiv:2504.10143},
year={2025}
}