The New Locus of Representation?
A Researcher’s Perspective on the Large-Model Era with Limited Compute #

Yichao Cai,


The familiar “CUDA: Out of Memory” message flashed on my screen again last week. Somehow this tiny red error felt symbolic. For those of us working in representation learning without frontier-scale infrastructure, the last few years have felt like a quiet crisis. A once-dominant paradigm—learning vector embeddings for downstream tasks—now finds itself overshadowed by a disarmingly simple alternative:

Just tokenize your input and ask a big model.

This new multimodal pipeline is seductively straightforward: tokenize an image (or audio, or anything else), feed the tokens to a giant next-token predictor alongside a text prompt, and receive a natural-language answer. Everything, e.g., classification, detection, and planning, collapses into instruction–response. Language becomes the universal interface.

And this raises a question that feels both intellectual and personal:

If all useful (or human-valued) tasks can be cast as QA (or, instruction–response), what is left for the representation learning researcher?
Some argue that representation is no longer the primary bottleneck—that the real levers are data curation, scaling, and better text–vision tokenization. There is truth to this view, but it is incomplete. Representation learning has not disappeared. It has simply moved. It lives now in the interfaces, the bottlenecks, and the geometry that allow these giant models to see, reason, and align.



The New Locus of Representation#

The "old" paradigm was architecturally simple: [Image (unstructured data)] → [Encoder (trained with pretext objective)] → [Vector z (representation)] → [Classifier/Regressor/Detector,etc] . The new one looks different: ([Image] → [Tokenizer] → [Tokens]) + [Text Prompt] → [Large Model] → [Answer] . But representation is still central. It is merely relocated into subtler places.

First: It hides in the tokenizer#

The visual encoder has become a visual tokenizer—tasked not with producing a single vector, but with compressing the visual world into a symbolic sequence interpretable by a language model. At its core, this reflects the classic Rate–Distortion (RD) principle, recontextualized for the era of large-scale models.

A visual tokenizer must compress aggressively while still retaining just enough structure for the LLM to reason. This is representation learning, simply expressed through a new interface. Today, the visual tokenizer plays the same conceptual role, but the downstream consumer is no longer a small MLP but a massive world model. CLIP (Radfold et al., 2021) and DINOv3 (Siméoni et al., 2025) pioneered this style of self/semi-supervised compression.

Second: Representation hides in “alignment” #

“Alignment” has become an overloaded term, encompassing multiple meanings in today’s discourse. But both of its essential meanings are fundamentally representational.

The first sense concerns how modalities understand each other. Radford et al., 2021 taught us how to map vision and language into a shared intersection of semantics. But the new multimodal reasoning paradigm demands more than intersection. It requires synergy, a space where fine-grained perceptual detail meets abstract linguistic structure, where modality-specific insights can be shared at the right moment, and where reasoning remains stable even when one modality is missing. This is a representation problem: it asks how concepts (and the relations between them) ought to be organized.

The second sense concerns alignment with reality, and this is where misconceptions proliferate. Many attribute hallucination to non-factual data and advocate for “purely factual” corpora. My view is the opposite: removing imaginative, metaphorical, or synthetic compositions weakens compositional generalization. If experience always bundles factors—if aspects of the world never vary independently—can any agent truly learn to represent them separately? With enough factual data we can disentangle concepts the world naturally varies, but when the world never separates them, observation alone cannot undo their fusion. Counterfactual data enhances efficiency when latent structure exists and may even uncover higher-level regularities not evident in natural distributions. In both language and vision, meaning emerges only where structure aligns with learned priors; where structure is absent, interpretation collapses.

Two non-factual words.
Although both “neuroction” and “bnlahmd” are fictitious words, only the former is interpretable due to its alignment with familiar morphological structures (e.g., “neuro-” + “-ction”). In contrast, “bnlahmd,” while composed of recognizable letters, lacks internal compositional structure. Despite concealing a simple transformation (e.g., minus 1 to each ASCII code), it appears semantically meaningless. This illustrates the principle that human interpretation relies on latent structure, not surface symbols.

The real issue is not the presence of non-factual data, but the absence of grounding after pretraining. We should pretrain on rich, heterogeneous mixtures (including the impossible), then use factual datasets, or preference algorithms like DPO (Rafailov et al., 2023), to anchor the model’s concept space back to reality (or what are preferable to human).

Scaling Laws vs Representation Laws#

In recent years, our field has been mesmerized by scaling laws—the clean power-law curves that predict how performance improves with more parameters, more compute, and more data. These laws offer empirical regularities predicting how performance scales with model size, data volume, and compute. They are beautiful, empirical regularities, and they have guided the rise of modern LLMs. But scaling laws alone don't tell the whole story about the internal structure that makes models intelligent. What we lack, and what the entire research community, from large-scale labs to individual theorists, must help discover, are representation laws: predictable rules governing how latent spaces form, compress, align, and factorize as models grow.

Representation laws would answer questions that scaling laws may ignore, such as:

  • Factorization & disentanglement: Under what conditions does a model begin to separate content from style? How much counterfactual variation is necessary for attribute disentanglement?
  • Compression geometry: When and how does visual detail vanish as token budgets shrink? Why do large models often converge toward sparse, modular subspaces?
  • Modality competition: Why does hallucination arise when textual priors overpower weak visual evidence? When does data misalignment between modalities help rather than hurt?
These are the structural invariants underlying high-dimensional learning systems. Scaling laws reveal how much compute we need.
Representation laws would reveal how intelligence organizes itself.
And these laws cannot be discovered by brute-force scaling. They emerge from theory, from careful probing, from limited-compute experimentation—the kinds of work that individuals and small labs may excel at.

AI as Tools vs. Successors: What Are We Actually Building?#

This question about our ultimate goal—what are we building?—reveals a useful spectrum of motivations in our field.

  • On the one hand, there is the vital work of building powerful AI tools: systems that augment human capability, often by mastering language, our most polished interface for knowledge.
  • On the other hand, there is the pursuit of AI successors: foundational systems capable of perceiving and learning directly from the world, potentially surpassing human-conceived boundaries. This vision often prioritizes perception-first learning (like Yann LeCun's JEPA) and sees language as a powerful, but not exclusive, component of understanding reality.

In practice, these two visions feed each other: scaled tool models expose phenomena that successor-style research tries to explain, while successor architectures inspire methods that improve the tools.

The core question for us as researchers is where we place our focus. Representation learning sits at the heart of this interplay, determining whether we are optimizing a system to "talk about" the world as we see it, or to "see" the world in its own way.

A Hopeful View from Limited Compute#

And so the original question returns: What is left for representation learning researchers who cannot train XXB-parameter models?

The answer, I believe, is a vital and foundational role to play.

Scaling is vital—it is like building a more powerful telescope that reveals new phenomena. But revelation alone is only the first step. What needs focused insight alongside scaling includes:

  • designing better tokenizers,
  • studying how concepts form inside models,
  • crafting datasets with rich counterfactual structure,
  • engineering grounding mechanisms that reduce hallucination without suppressing imagination,
  • and developing the theoretical foundations of representation laws.
Limited computational resources foster clarity, encouraging early attention to questions of structure and efficiency. It is in these structural questions—the geometry, information flow, and factorization of concepts—where the next breakthroughs may emerge. And for those of us working with a different set of constraints, this is not a disadvantage, but a powerful focus. It forces us to confront the structures that can be difficult to isolate at massive scale: how concepts factorize, how modalities align, how grounding stabilizes reasoning, how counterfactuals shape abstraction. These questions require a partnership between massive compute and focused theoretical insight. They require insight, patience, theory, and careful experiments, which can then be validated and tested at scale.

Representation learning has not vanished. It has expanded. It has moved into the bottlenecks, interfaces, and latent geometry of large multimodal systems.

Scaling laws built the engines.
Representation laws will explain them.