What is the difference between a foundation model and an LLM?

All LLMs are foundation models, but not all foundation models are LLMs. Foundation models are a broader functional category that includes models handling vision, robotics, reasoning, and multimodal inputs. LLMs are language-specific, modeling text by creating digital representations of language. A vision model trained on images can be fine-tuned for medical diagnosis following the same pre-train-then-adapt pattern without processing any language.

How much does it cost to train a foundation model?

Pre-training costs vary enormously. The original 2017 Transformer required less than $1,000 in compute. GPT-4 cost approximately $78 million. Google's Gemini Ultra hit an estimated $191 million. DeepSeek-V3 achieved comparable results for roughly $5.6 million. This 200,000x cost range over seven years reflects how young and rapidly evolving the field remains.

What are the three stages of the foundation model pipeline?

Stage 1 is pre-training, where models learn patterns from massive unlabeled data through self-supervision. Stage 2 is adaptation, where the foundation is specialized for specific tasks through fine-tuning, prompting, or RLHF. Stage 3 is deployment, where the adapted model gets wrapped in APIs, interfaces, and guardrails for end users.

Foundation Models: The Three-Stage AI Pipeline

Q: What is emergence in foundation models?

Emergence describes capabilities that appear at scale but were never explicitly trained for. A model learns to predict text, and somewhere along the way develops the ability to do arithmetic, write code, or reason through multi-step problems. This is why scaling laws dominated AI strategy: if capabilities emerge unpredictably at larger scales, training bigger models to see what happens became the rational approach.

ChatGPT, Claude, Gemini, Midjourney, GitHub Copilot. Different products, different companies, different vibes. But underneath? They all run on the same architectural pattern that emerged around 2018 and now dominates everything. That pattern is the foundation model.

Stanford's Center for Research on Foundation Models puts it this way: a foundation model is trained on broad data using self-supervision at scale, then adapted to specific downstream tasks. The key insight is the sequence. These models learn general capabilities first, get specialized later. The pre-trained model is the foundation; everything else is built on top.

This matters because it fundamentally changed the economics of AI development. Instead of training a new model for every task, organizations can take an existing foundation and adapt it. What once required hundreds of millions of dollars can now be done for thousands.

From Raw Data to Working Product

Stage 1: Pre-training. This is where the heavy compute happens. A model trains on massive amounts of unlabeled data, learning patterns without being told what to look for. According to IBM, pre-training uses self-supervision on massive unlabeled datasets to teach the model general patterns rather than specific tasks. The model predicts masked words, next tokens, or image patches. It never sees a label; it just learns structure.

The scale is genuinely hard to overstate. Stanford's AI Index shows the cost progression: the original 2017 Transformer paper required less than $1,000 in compute. GPT-4 cost approximately $78 million. Google's Gemini Ultra hit an estimated $191 million.

A 200,000x increase in seven years.

Cost doesn't always correlate with capability, though. DeepSeek-V3 achieved comparable results for roughly $5.6 million total, covering pre-training, context extension, and fine-tuning. The gap between $191 million and $6 million for similar performance tells you everything about how young this field still is.

Stage 2: Adaptation. This is where one model becomes many. A single pre-trained foundation can be adapted to thousands of different applications through fine-tuning, prompting, RLHF, or other techniques. Fine-tuning uses a smaller, task-specific dataset to adjust the model's behavior for particular applications. The key advantage: it's dramatically cheaper than training from scratch. Organizations can spend thousands on fine-tuning rather than hundreds of millions on pre-training.

Two approaches dominate. Full fine-tuning updates all model parameters; it's more powerful but requires more compute and risks catastrophic forgetting. Parameter-efficient methods like LoRA and PEFT update only a small fraction of weights, making adaptation accessible to teams without massive GPU clusters. Neptune's 2025 training report shows how this plays out in practice. Typical deployments use 24 to 32 GPUs, with ranges spanning 2 to 128 or more. Teams allocate roughly 20% of compute budget to main training runs and 80% to experimentation. The real work is iteration, not the final training run.

Stage 3: Deployment. The adapted model gets wrapped in APIs, interfaces, and guardrails for end users. This stage is largely invisible to users, but it's where the model meets reality: latency requirements, cost constraints, safety filters, and the messy edge cases that benchmarks never capture.

Emergence and Homogenization

Emergence describes capabilities that appear at scale but were never explicitly trained for. A model learns to predict text; somewhere along the way, it develops the ability to do arithmetic, write code, or reason through multi-step problems. Stanford's CRFM report identifies emergence as a defining characteristic: scale produces unexpected capabilities. This is why scaling laws dominated AI strategy for years. If capabilities emerge unpredictably at larger scales, the rational move is to train bigger models and see what happens. That logic drove compute investment from 2020 through 2024.

Homogenization is the flip side.

When many applications build on the same few foundation models, they inherit both the strengths and the defects of those foundations. Stanford puts it directly: homogenization creates "powerful leverage" but also means "defects of the foundation model are inherited by all adapted models downstream."

Our read: this is the structural risk that doesn't get enough attention. If GPT-4 has a subtle reasoning flaw, every product built on it has that same flaw. If Claude mishandles a particular type of query, so does every application using Claude as its backbone. The efficiency gains from homogenization come with correlated failure modes.

The terminology here trips people up. Are foundation models and large language models the same thing? Georgetown's CSET clarifies the relationship: foundation models and LLMs overlap significantly but differ in scope. LLMs are language-specific; they model text by creating digital representations of language. Foundation models are a broader functional category that includes models handling vision, robotics, reasoning, and multimodal inputs. All LLMs are foundation models, but not all foundation models are LLMs. A vision model trained on images can be fine-tuned for medical diagnosis or autonomous driving; it follows the same pre-train-then-adapt pattern without processing any language. Georgetown also notes these remain "intentionally loose categories rather than watertight technical definitions." Don't expect crisp boundaries. The terminology evolved alongside the technology, and it shows.

Why Build Your Own?

Given the costs involved, why would anyone train a foundation model instead of using an existing one?

Neptune's report identifies three drivers. First, specialized problem-solving: some domains need capabilities that general-purpose models lack. Second, regulatory compliance: industries like healthcare and finance face data residency and privacy requirements that third-party APIs can't satisfy. Third, competitive competency: organizations want to build internal capability rather than depend on external providers.

The report also reveals how training infrastructure has shifted. The field moved from "meticulously curated datasets" toward massive raw data with filtering heuristics. Synthetic data is increasingly integral for generating rare signals and balancing datasets. Training at this scale is less about careful curation and more about processing everything, then filtering aggressively.

Foundation models aren't just a technical architecture; they're an economic architecture. The three-stage pipeline (pre-train at massive scale, adapt cheaply, deploy widely) creates a particular market structure. A handful of organizations can afford the pre-training step. They become the infrastructure layer: OpenAI, Anthropic, Google, Meta, and a few others. Everyone else builds on top through adaptation and deployment.

The leverage is enormous, both in capability and in inherited risk.

Stanford's CRFM identifies five core capability areas where foundation models now operate: language, vision, robotics, reasoning, and human interaction. That scope keeps expanding. What started as text prediction now reaches into physical manipulation, scientific discovery, and real-time decision-making.

The paradigm has limits. Emergence isn't guaranteed; some capabilities simply don't appear regardless of scale. Homogenization means concentrated points of failure. The costs of pre-training keep rising, which concentrates power among the few organizations that can afford it. But for now, the foundation model is the dominant paradigm. Understanding the three-stage pipeline, and the economics and risks embedded in each stage, is the prerequisite for understanding anything else happening in AI.