What is the difference between world models and large language models?

Large language models predict the next token in text sequences, learning language patterns without physical grounding. World models predict how physical environments evolve through space and time, maintaining internal representations of reality that update based on actions. LLMs can explain physics but cannot simulate it; world models learn physics through training on visual data.

Why does Waymo use DeepMind's Genie 3 for robotaxi training?

Waymo uses Genie 3 to simulate dangerous driving scenarios that would be impossible or unethical to recreate in the real world. The world model generates consistent 3D environments with realistic physics for water, lighting, and terrain at 24fps and 720p resolution, allowing robotaxis to train on edge cases like near-collisions or extreme weather without physical risk.

What is I-JEPA and why does AMI Labs use it?

I-JEPA (Joint Embedding Predictive Architecture) learns visual representations by predicting abstract embeddings of masked image regions rather than reconstructing raw pixels. AMI Labs uses this approach because it bypasses the computational expense of pixel-level prediction while still learning meaningful physical relationships. Yann LeCun argues this abstract prediction is closer to how biological intelligence works.

World Models: AI's Pivot From Words to Physics

Large language models can write poetry about rain but have no concept of wetness. They can explain orbital mechanics but cannot predict that a dropped ball falls down. They've never experienced the world.

That limitation is now pulling billions in investment toward the most significant architectural shift since transformers. World models—AI systems that maintain internal representations of physical reality—are the focus of three major players: Yann LeCun's AMI Labs, Google DeepMind's Genie 3, and Fei-Fei Li's World Labs. The thesis: predicting the next token was a useful detour, but intelligence evolved from perception-action loops, not language manipulation. If you want AGI, you need systems that simulate reality.

Fei-Fei Li: LLMs are "eloquent but inexperienced, knowledgeable but ungrounded." They manipulate one-dimensional sequential language without spatial or temporal grounding.

This is not a training data problem. You cannot fix it by feeding GPT-5 more physics textbooks. The limitation is architectural. Current models have, as Scientific American puts it, "only an implicit sense of the world from their training data" without real-time updating capability. Once deployed, they do not learn from experience. UC Berkeley's Angjoo Kanazawa frames the relationship as complementary rather than competitive: the LLM serves as an interface handling language and common sense, while a world model provides the "spatial temporal memory" that LLMs lack. She believes the system that gets to AGI will need both.

World models work differently. They maintain internal representations that continuously update based on new inputs. Rather than predicting the next token, they predict how world states evolve through space and time. The technical approaches differ, but share core properties: world models must be generative (creating consistent 3D worlds), multimodal (processing diverse inputs beyond text), and interactive (outputting world states based on actions).

DeepMind's Genie 3 demonstrates what this looks like in practice. It is the first world model to allow real-time interaction at 24fps with 720p resolution, generating multi-minute consistent environments with physics simulation for water, lighting, and terrain. The model "teaches itself how the world works through training" rather than relying on hard-coded physics rules. Using auto-regressive frame generation, it predicts frames based on prior context and user actions. DeepMind is already commercializing this: Waymo now uses Genie 3 to train robotaxis on simulated dangerous scenarios that would be impossible to recreate safely in the real world.

Three Distinct Bets

The competitive landscape is unusually clean.

AMI Labs is the contrarian research bet. LeCun (who left Meta to become executive chairman) has been arguing against autoregressive models for years. The company uses I-JEPA (Joint Embedding Predictive Architecture), which learns abstract visual representations by predicting masked regions rather than reconstructing pixels. This bypasses the brute-force approach of predicting every detail. They're seeking €500 million at a €3 billion valuation, pre-product. Aggressive, but it reflects conviction that their architecture is fundamentally different. The company is targeting wearables, robotics, and manufacturing applications, with offices in Paris, Montreal, New York, and Singapore.

World Labs is the commercialization leader. Fei-Fei Li raised $230 million and shipped first. Their product Marble launched with a freemium model scaling to $95/month, and an API became available in January 2026. While competitors are still in research mode, World Labs is generating revenue.

DeepMind's Genie 3 started as the pure research play, but Google is moving fast to commercialize it. On January 29, Google launched Project Genie to US subscribers of its AI Ultra plan at $250/month, making Genie 3 accessible as a consumer product that lets users create, explore, and remix interactive worlds from text prompts and images. The Waymo integration for robotaxi training remains the flagship enterprise application, but Project Genie signals that Google sees a direct-to-consumer market for world models too. Operating internally at Google means no fundraising pressure, and they're leveraging that to push on both fronts simultaneously.

World models are not "post-transformer" in the sense of replacing transformers entirely. Genie 3 and other world models still use transformer components. The shift is in what they're trained to predict. Instead of next-token prediction on text, they predict world states, spatial relationships, and physical dynamics. Architecturally, this is more evolution than revolution. But the training objective change has profound implications for what these systems can learn and do.

The enabling research extends beyond any single company. Scientific American notes that technical advances include NeRF (Neural Radiance Fields) algorithms dating to 2020, with recent papers like NeoVerse and TeleWorld converting 2D video to 4D models (3D plus time). This matters because world models need to learn from visual data at scale. The ability to convert ordinary video into navigable 3D representations means training data is abundant. Every video ever recorded becomes potential training material.

What World Models Still Can't Do

Genie 3's published limitations: text rendering, accurate real-world location representation, and extended multi-agent interactions remain challenging. These are not trivial gaps.

Text rendering failures suggest the model lacks symbolic reasoning grounding. Location accuracy matters for any mapping or navigation application. Multi-agent limitations constrain usefulness for social simulation.

World models solve some problems that LLMs cannot, but they do not solve all problems. The open question is whether combining world models with language models produces something greater than either alone, or whether these turn out to be fundamentally different paths toward different goals.

For most developers, world models are not immediately actionable. The APIs are limited, the costs are high, and the use cases are specialized.

Three applications worth watching:

Robotics training is the clearest near-term win: simulating dangerous or rare scenarios before deploying physical robots is obviously valuable, and Waymo's adoption validates this.

Game and simulation development benefits from procedural world generation, with consistent, interactive 3D environments from simple prompts reducing production costs dramatically.

Scientific simulation is the high-upside bet. Li specifically mentions drug discovery and materials science as applications. If world models can accurately simulate molecular interactions or materials under stress, they accelerate research that currently requires expensive lab work.

Our read: The narrative tension here is between LeCun's patient research approach and the shipping pace of competitors. LeCun has been right about the limitations of autoregressive models for years, but being right early is not the same as winning. World Labs is already generating revenue. DeepMind has the Waymo integration and a $250/month consumer product in market. AMI Labs is raising at a pre-product valuation that assumes their architecture is meaningfully better. At least one of these bets will be wrong. The transformer-for-text paradigm that has dominated since 2017 is not the final answer. Whether world models replace it or supplement it is still an open question. But the concentration of talent, capital, and research attention suggests this is the architecture race to watch.

Three Distinct Bets

What World Models Still Can't Do

Frequently Asked Questions

What is the difference between world models and large language models?

Why does Waymo use DeepMind's Genie 3 for robotaxi training?

What is I-JEPA and why does AMI Labs use it?