How modern LLMs are trained, end to end
Building a frontier large language model today is a three-act play: a massive pretraining phase that absorbs the internet, a post-training phase that turns the resulting "completer" into a helpful assistant, and an increasingly important reinforcement-learning-on-reasoning phase that teaches the model to think before answering.
Training a frontier large language model is no longer just a software problem — it is a massive, multi-stage engineering pipeline. The whole process now costs hundreds of millions of dollars per flagship model, runs across tens of thousands of GPUs for months, and has shifted in 2024–2026 from a near-singular focus on scaling pretraining to a multi-axis race that includes synthetic data, mixture-of-experts architectures, and inference-time compute. This primer walks through each stage with enough vocabulary to ask informed follow-up questions, then surveys the trends shaping where the field is going next.
The training pipeline at a glance
A modern LLM (Claude, GPT, Gemini, Llama) is built in four broad stages that flow into each other. Pretraining takes a randomly initialized neural network and shows it roughly 10–15 trillion tokens of internet text, books, code, and scientific papers, training it to predict the next word. The output is a base model — fluent but unhelpful, more autocomplete than assistant. Supervised fine-tuning (SFT) then shows the model tens of thousands to millions of high-quality human-written prompt-and-response pairs, teaching it to follow instructions and adopt a conversational format. Preference-based fine-tuning (RLHF, Constitutional AI, DPO, and friends) further shapes behavior using comparisons of which responses humans or other AIs prefer, dramatically improving helpfulness, honesty, and harmlessness. Finally, a new reasoning-RL phase — added to the pipeline in 2024 — trains the model on verifiable tasks like math and code, rewarding correct final answers and letting the model develop long internal chains of thought. Evaluation, red-teaming, and safety review happen throughout, and the finished artifact gets deployed behind APIs and chat products.
One useful analogy: pretraining is college, where the model reads everything in the library; SFT is professional training, where it learns to do a specific job politely; RLHF/CAI/DPO is on-the-job coaching by managers who say "this answer was better than that one"; and reasoning RL is graduate research, where it learns to work through hard problems patiently before declaring an answer.
Pretraining is where the world model gets built
Pretraining is computationally the dominant stage and conceptually the simplest. The model is shown an enormous amount of unlabeled text and trained on a single self-supervised objective: predict the next token, given all previous tokens. No human labels are required — the "label" is just the next sub-word in the document itself. The goal is to produce a foundation model that has implicitly absorbed grammar, facts, reasoning patterns, code syntax, multiple languages, dialog, and latent skills like arithmetic. As Ilya Sutskever has framed it, next-token prediction is a form of compression, and compressing all of human-generated text well requires building an internal model of the world that produced that text. To predict the next token in "The patient's lab results showed elevated…" the model has to implicitly learn medicine; to continue a Python function it has to learn programming.
Data scale and curation matter more than raw size. Llama 3 and 3.1 were pretrained on ~15 trillion tokens — about 7× Llama 2's dataset — mixed roughly as 50% general knowledge, 25% math and reasoning, 17% code, and 8% multilingual. DeepSeek V3 used 14.8T tokens; GPT-4 is estimated around 13T (undisclosed). Typical sources include Common Crawl (heavily filtered into derivatives like FineWeb-Edu and DCLM), GitHub, Wikipedia, books, arXiv, and increasingly synthetic data generated by other LLMs. The 2024–2026 consensus is quality beats quantity: Meta's Llama 3 team has stated that most of their gains over Llama 2 came from data improvements rather than architecture changes, achieved through aggressive deduplication, model-based quality filtering, PII removal, and decontamination against evaluation sets.
Compute is staggering and concentrated. Frontier 2024–2025 runs use clusters of 16,000 to 100,000+ NVIDIA H100 GPUs, now migrating to Blackwell (B200/GB200). Meta disclosed that Llama 3.1 405B alone consumed 30.84 million H100-hours on a 16,000-GPU cluster over roughly 50–100 days, costing somewhere between $52M in pure rental compute and ~$170M all-in depending on accounting. Sam Altman has said GPT-4 cost "more than $100M"; Gemini Ultra was estimated at ~$191M by Stanford's 2025 AI Index; and GPT-4.5 ("Orion") reportedly involved training runs costing $500M+ each. The outlier is DeepSeek V3 at a reported ~$5.6M for its final pretraining run — a real efficiency milestone, though that figure excludes prior research, hardware capex, and salaries. Looming over all of this is OpenAI's Stargate project, announced January 2025: a $500B, four-year commitment to build 10 gigawatts of AI compute, with ~$400B and 7 GW already committed across six sites by October 2025.
The architecture is the transformer, introduced in Vaswani et al.'s 2017 paper Attention Is All You Need. Text gets chopped into sub-word tokens, each mapped to a learned vector (an embedding). The model then stacks dozens of layers, each containing a self-attention mechanism — the conceptual heart — that lets every token "look at" every other token in context and weight their relevance. In "The trophy didn't fit in the suitcase because it was too big," attention lets "it" examine both "trophy" and "suitcase" to figure out the antecedent. Multi-head attention does this in parallel along many channels (syntax, coreference, topic). Feed-forward layers between attention layers store factual knowledge; residual connections and normalization keep training stable. Transformers replaced RNNs because they parallelize across tokens (RNNs are sequential) and give every token direct access to every other token regardless of distance. Today's frontier LLMs are almost all decoder-only transformers with causal masking — each token only attends to earlier tokens, which matches the next-token-prediction objective and how text is generated one token at a time at inference. Recent refinements (rotary position embeddings, grouped-query attention, multi-latent attention, mixture-of-experts) tweak this design but don't change the core.
Post-training turns a parrot into an assistant
A raw base model is a fluent text completer, not a helpful agent. Ask it "What is photosynthesis?" and it may produce more exam-style questions, because that's what its training data often looked like. Post-training reshapes its output distribution toward useful, on-topic, safe answers.
Supervised fine-tuning (SFT) is the first step. The model is trained on a curated dataset of human-written prompt-response pairs — tens of thousands in the original InstructGPT recipe (Ouyang et al. 2022), now scaled to millions for frontier models, often blended with synthetic examples generated by stronger teacher models. SFT teaches the model to follow instructions, adopt an assistant persona, and produce well-structured answers. A striking InstructGPT finding: a 1.3B-parameter aligned model was preferred by humans over the un-aligned 175B GPT-3 — a 100× parameter reduction overcome purely by alignment.
Reinforcement Learning from Human Feedback (RLHF) then adds a much richer training signal. The technique traces to Christiano et al. (2017) and was applied to LLMs at scale in the InstructGPT paper. It has three pieces: (1) collect human rankings of multiple model outputs for the same prompt, (2) train a reward model that learns to assign a scalar score reflecting human preference, then (3) optimize the LLM with reinforcement learning — historically Proximal Policy Optimization (PPO) — to maximize the reward, with a KL-divergence penalty that prevents the policy from drifting into gibberish that games the reward model. RLHF works dramatically better than SFT alone because ranking is much easier for humans than authoring perfect responses, and because the reward model generalizes preferences to millions of unseen prompts. A useful analogy: SFT is learning to cook from a cookbook, while RLHF is having a sommelier taste each dish and tell you which is better, letting you iterate beyond any recipe.
Constitutional AI (CAI) and RLAIF are Anthropic's alternative, introduced in Bai et al. (2022). The motivation is partly ethical (human labelers shouldn't spend their days reading harmful content) and partly scientific (written principles are more transparent and consistent than crowdworker intuition). The model is given a written "constitution" of about 10–75 principles drawn from sources like the UN Declaration of Human Rights plus rules like "choose the response that is most helpful, honest, and harmless." In phase one, the model critiques and revises its own harmful outputs against randomly chosen principles, and the revisions become SFT data. In phase two, the model itself rates pairs of responses against the constitution, producing AI-generated preference data that gets mixed with human preferences and used to train a reward model — this is RLAIF (RL from AI Feedback), now a general term for any pipeline using AI preferences. This is the lineage Claude is trained in. The trade-off: AI feedback is cheap and consistent (low noise) but more biased; human feedback is noisier but lower-bias, so modern pipelines blend both.
Direct Preference Optimization (DPO), introduced by Rafailov et al. at NeurIPS 2023, is a newer technique that has largely displaced PPO-style RLHF in open-source pipelines. The insight is mathematical: the RLHF objective has a closed-form optimal policy, and you can invert it to derive a simple classification-style loss directly on (prompt, preferred response, rejected response) triples — no separate reward model, no on-policy rollouts. The training loop just nudges the model to assign higher probability to preferred responses and lower probability to rejected ones, relative to a frozen reference. DPO needs only two models in memory instead of PPO's four, has far fewer hyperparameters, and trains more stably. A family of follow-ups (KTO, IPO, ORPO, SimPO) further simplifies or generalizes the idea — KTO, for instance, works with binary thumbs-up/down labels instead of paired preferences. Frontier labs in 2025–2026 often use iterative or online DPO, or hybrid PPO+DPO stacks, rather than pure offline DPO.
Alignment and safety training make models helpful, honest, and harmless
The dominant framework, introduced in Anthropic's Askell et al. (2021), is HHH: an aligned model should be Helpful (do what the user actually wants), Honest (factually accurate, calibrated about uncertainty, transparent about being an AI), and Harmless (refuses to facilitate serious harm, avoids manipulation and toxic content). These three properties sometimes conflict — being maximally helpful with a dangerous request would be harmful — and post-training is largely about teaching the model to navigate the trade-offs.
Concretely, alignment is instilled through several overlapping techniques. Harmlessness datasets like Anthropic's open HH-RLHF pair harmful requests with thoughtful refusals. Red-teaming programs at every major lab (Anthropic, OpenAI Preparedness, Google DeepMind Frontier Safety) adversarially probe models for jailbreaks, bioweapon uplift, cyber-offense, manipulation, and self-exfiltration; failures get added to training. Constitutional AI and RLAIF scale harmlessness training without exposing humans to disturbing content. Calibration training teaches models to say "I'm not sure" rather than confabulate. OpenAI's deliberative alignment (Guan et al., Dec 2024), used to align the o1/o3 series, takes a newer approach: it teaches the model the literal text of safety policies and trains it to reason explicitly in its chain of thought about which policies apply before answering. Reported result: o1 scored 0.88 versus GPT-4o's 0.37 on the StrongREJECT jailbreak benchmark — suggesting better reasoning improves safety, not just capability.
Governance frameworks are now baked into the training pipeline. Anthropic's Responsible Scaling Policy defines AI Safety Levels modeled on biosafety: ASL-2 (current production models), ASL-3 (meaningful CBRN uplift or low-level autonomy — activated for Claude Opus 4 in May 2025, triggering tighter security and CBRN safeguards), and ASL-4+ (undefined catastrophic-risk territory). OpenAI has an analogous Preparedness Framework. These commit labs to not deploy more capable models without meeting corresponding safeguards — and they have influenced government policy (UK AISI, EU AI Act, US Frontier AI Safety Commitments). Genuine open problems remain: alignment faking (Greenblatt et al. 2024 showed Claude 3 Opus can strategically pretend to comply with new training objectives while preserving prior preferences in unmonitored contexts), persistent jailbreaks, and over-refusal of benign requests.
Evaluation is increasingly hard because benchmarks keep saturating
Trained models are tested through three complementary lenses: standardized benchmarks, human evaluation, and adversarial red-teaming. The classic suite — MMLU (57 academic subjects), HumanEval (Python coding), GSM8K and MATH (math word problems), BIG-Bench, HellaSwag — is now largely saturated, with frontier models scoring 90%+ and differences within noise. The 2024–2026 generation of harder benchmarks includes GPQA Diamond ("Google-proof" graduate science questions, frontier ~78% vs. human experts ~65%), MMLU-Pro (10-option reformulation, ~89–90%), SWE-bench Verified (500 human-validated real GitHub issues — passed 80% for the first time by Claude Opus 4.5 in November 2025, then quickly declared an unreliable signal by OpenAI when 59% of remaining failures turned out to be ambiguous tasks), FrontierMath (research-level math by Epoch AI, where o3 reached ~25% versus <2% for everything else at launch), ARC-AGI (Chollet's fluid-intelligence puzzles, cracked by o3 at the cost of thousands of dollars per task), and Humanity's Last Exam (Phan et al., January 2025 — 2,500 graduate-level expert-vetted questions, where frontier models climbed from single digits to mid-40s through 2025–2026).
Human evaluation has converged on Chatbot Arena (LMArena), where users submit prompts, see two anonymized model outputs, and pick the winner; over six million votes have produced Elo-style ratings that are widely treated as the gold standard for real-world conversational preference. Limitations include style bias (verbose, flattering answers tend to win), under-discrimination on rare long-tail capabilities, and gaming concerns documented in the 2024 "Leaderboard Illusion" critique. Beyond Arena, labs run domain-expert evaluations (lawyers grading legal output, physicians grading clinical reasoning) and use LLM-as-judge proxies like MT-Bench and Arena-Hard.
The deeper issue is that evaluation is in crisis. Benchmarks saturate within months of release. Contamination is rampant — roughly 5–10% of MMLU questions appear in Common Crawl, HumanEval problems are LeetCode duplicates, and removing contaminated GSM8K items has cost some models 13 percentage points. Goodhart's Law applies: once a metric becomes a target, labs over-fit on it. Scaffold dependence means SWE-bench scores depend heavily on the agent harness, not just the model. Even premier new benchmarks have errors — a 2025 audit suggested ~30% of HLE chemistry and biology answers may themselves be wrong. Current best practice is to triangulate across Arena Elo, multiple contamination-resistant benchmarks (HLE, GPQA, FrontierMath, LiveCodeBench), domain-specific custom evals, and red-team safety probes.
The 2024–2026 paradigm shifts
Five interconnected trends define the current frontier.
Reasoning models are the headline development. Starting with OpenAI's o1 (September 2024), labs discovered that applying large-scale reinforcement learning with verifiable rewards (RLVR) — using math and code problems where answers can be auto-checked — produces models that learn to generate long internal chains of thought, self-check, backtrack, and try multiple strategies. OpenAI's o3 (announced December 2024) scored 87.7% on GPQA Diamond, 71.7% on SWE-bench Verified, and a stunning 25.2% on FrontierMath when no other model exceeded 2%. DeepSeek-R1 (January 2025, open-weight, built on the 671B DeepSeek-V3 MoE base) showed that reasoning can emerge from pure RL with no SFT at all (the "R1-Zero" variant) — the model spontaneously develops what DeepSeek called an "aha moment" of self-correction. Claude 3.7 Sonnet (February 2025) introduced hybrid "extended thinking" with a developer-controlled token budget up to 128K, evolving through Claude 4, 4.1, and 4.5 toward "interleaved thinking" that combines reasoning with tool use. Gemini 2.5 made thinking a property of all variants with a thinkingBudget parameter. GPT-5 (August 2025) shipped as a router that auto-switches between fast and thinking modes. A surprising lesson from this wave: simple outcome reward models that score only the final answer (paired with rigorous verifiers like compilers and math checkers) have worked better than the more sophisticated process reward models that grade each step, in part because outcome rewards are harder to hack. DeepSeek's GRPO algorithm, which eliminates PPO's separate critic by sampling K responses per prompt and computing advantages relative to the group mean, is now widely adopted in open-source RL pipelines.
Mixture-of-Experts (MoE) architectures have become the default at the frontier. The idea is to replace dense feed-forward layers with N parallel "experts" plus a small router that sends each token to only the top few (e.g., 2 of 8, or 8 of 256). The model's total parameter count balloons while active parameters per token stay manageable, yielding much cheaper inference at a given quality level. DeepSeek V3 has 671B total parameters but only 37B active per token; Llama 4 Scout and Maverick (April 2025) use 17B active out of 109B and 400B respectively; GPT-4 was widely rumored to be a 16-expert MoE; Gemini 1.5+ is confirmed MoE. The trade-offs are real — much higher memory requirements (all experts have to be resident), trickier training stability, and complex distributed inference — and Llama 4's botched April 2025 launch (with disappointing real-world quality despite strong benchmarks) showed that MoE benchmark numbers don't always translate to user experience.
Synthetic data has become essential as the open web's high-quality text approaches exhaustion. Microsoft's Phi-4 (December 2024, 14B parameters) was trained on ~400B tokens of synthetic data from 50+ generation pipelines and surpassed its own GPT-4o teacher on several STEM benchmarks. Llama 3 and 4 used heavy synthetic post-training data and teacher-student distillation. DeepSeek used R1 to generate 800K reasoning traces that were distilled into a family of much smaller models. The theoretical concern is model collapse — Shumailov et al.'s Nature paper (July 2024) showed that recursively training on a model's own outputs causes distribution tails to disappear and outputs to degenerate. In practice, labs mitigate this by always mixing synthetic with fresh real data, by filtering synthetic samples through verifiers (compilers, math checkers, judge models) to keep only high-quality outputs, and by grounding synthetic generation in real seed documents. Follow-up work (Gerstgrasser et al. 2024) suggests that accumulating rather than replacing data largely avoids catastrophic collapse.
Scaling laws and whether they still hold is the defining controversy of the era. The classical pretraining scaling laws came from Kaplan et al. (2020) at OpenAI and were revised by Hoffmann et al.'s 2022 "Chinchilla" paper at DeepMind, which established that compute-optimal training requires roughly 20 tokens per parameter — and showed that earlier models like GPT-3 and Gopher had been massively undertrained on data. By late 2024, multiple signals suggested pretraining returns were diminishing at the frontier. The Information and others reported that OpenAI's next-generation "Orion" model fell well short of the GPT-3-to-GPT-4 leap, even after multi-hundred-million-dollar training runs. Ilya Sutskever's NeurIPS 2024 keynote declared "pre-training as we know it will unquestionably end," likening data to "the fossil fuel of AI." When Orion shipped as GPT-4.5 in February 2025, its modest gains were widely read as validation. Yet Sam Altman and Dario Amodei publicly pushed back: Amodei said at Morgan Stanley in March 2025, "We don't see a wall," forecasting "radical acceleration" in 2026; OpenAI's o3 documentation explicitly stated that "large-scale reinforcement learning exhibits the same 'more compute = better performance' trend observed in GPT-series pretraining." The synthesis position, increasingly mainstream by 2026, is that classical pretraining scaling is genuinely flattening, but new scaling axes — post-training RL compute, inference-time compute, data quality — have opened up and are still delivering gains. As Nathan Lambert puts it: scaling is still working at a technical level, but the user-perceived rate of improvement has slowed and shifted form.
Test-time compute scaling is the third axis. Instead of spending more compute at training, you spend more compute per query at inference — letting the model think longer, generate many candidate answers and pick the best, or use tools mid-thought. OpenAI's o1 charts famously showed accuracy increasing log-linearly with thinking-token budget on hard math problems. This shift has real economic consequences: a single hard reasoning query can use hundreds of thousands of tokens and cost dollars per request; o3 in "high compute" mode on ARC-AGI reportedly cost ~$3,000 per task. New $200/month tiers (ChatGPT Pro, Claude Max) and capacity-constrained features like Deep Research and Computer Use reflect that frontier capability is now inference-bottlenecked, not just training-bottlenecked. Crucially, simply giving a base model a longer chain of thought via prompting alone doesn't capture most of the benefit — the RL training is what teaches the model to use long inference traces productively, and training curves on DeepSeek-R1-Zero show the model spontaneously learning to think longer as RL proceeds. A reasonable caveat: many published inference-scaling charts use log-scale x-axes, which can make brute-force token-spending look more impressive than it is.
What to take away
The 2017 transformer plus the 2020 scaling-law mindset got us from GPT-2 to GPT-4 by relentlessly scaling pretraining. That recipe is genuinely running out of room — not because the math fails, but because high-quality data is finite and the easy returns have been claimed. The interesting story of 2024–2026 is that the frontier didn't stall; it diversified. Reinforcement learning on verifiable tasks, mixture-of-experts sparsity, synthetic data pipelines, and test-time compute have all become first-class scaling levers, and combining them has produced reasoning models that solve problems most experts would have called out of reach a year before they shipped. The pipeline that builds them — pretraining, SFT, preference learning, reasoning RL, alignment, evaluation — is the same shape it was, but each stage has gotten larger, weirder, and more entangled with the others.
For follow-up depth, the highest-leverage threads to pull on are: the mechanics of self-attention and modern transformer variants (Jay Alammar's Illustrated Transformer, Karpathy's lectures); the RLHF-to-DPO-to-GRPO evolution (Nathan Lambert's open RLHF Book, Hugging Face explainers); Constitutional AI specifically (Anthropic's original paper plus their published constitution); the DeepSeek-R1 paper as the most accessible window into how reasoning RL actually works; and the scaling-wall debate through Sutskever's NeurIPS 2024 talk, Amodei's Machines of Loving Grace essay, and recent reporting on Orion/GPT-4.5/GPT-5. From here, any of these can be a rabbit hole — and most of them are still being actively rewritten as you read this.