Tokens, Tokenizers, and How Claude Counts Them

How subword tokenization algorithms work under the hood, why they shape context window economics and model accuracy, and what Claude's proprietary tokenizer changes mean for your API budget.

Tokens are the integer-encoded chunks of text that every large language model actually sees, reads, and bills you for — not words, not characters, but statistically learned subword pieces sitting between the two. A frontier model never "reads" the letter "h" in "hello"; it reads token ID 24912 (or whatever its tokenizer assigns), looks up a learned vector for that ID, and runs that vector through its transformer. This single design choice ripples through every aspect of LLMs: it determines your context window, your bill, how well the model does math, and why a sentence in Burmese can cost ten times what the same sentence costs in English. Anthropic, the maker of Claude, defines a token in plain terms as "the smallest individual units of a language model" and tells developers a Claude token is roughly 3.5 English characters or about ¾ of a word — close to the OpenAI rule of thumb but with its own proprietary tokenizer that has changed across Claude generations and most recently shifted again in Claude Opus 4.7 (April 2026), which can use up to 35% more tokens for the same text than its predecessors.

What a token actually is

A token can be a whole common word ("the", "apple"), a subword fragment ("token" + "ization"), a single character, a piece of punctuation, a whitespace-prefixed word (" cat" is usually one token, distinct from "cat" at the start of a string), or even a single raw byte. The tokenizer maps each token string to a unique integer ID; that ID indexes into the model's embedding matrix, producing the dense vector the transformer manipulates. Vocabularies range from BERT's 30,522 entries through GPT-4's 100,256 (cl100k_base) to GPT-4o's ~200,000 (o200k_base) and LLaMA 3's 128,000. Larger vocabularies pack more meaning into each token (shorter sequences, cheaper inference) but require larger embedding matrices and more training data to learn well.

The full pipeline from text to model is: normalization → pre-tokenization → subword tokenization → ID lookup → embedding lookup. Pre-tokenization typically applies a regex that splits on category boundaries (letters, digits, punctuation) so that merges never cross them; then the learned subword algorithm chunks the result; then each chunk becomes an integer; then the model's first layer looks up a learned vector for that integer. The integer-to-vector step is where "tokens" become "meaning" from the model's point of view.

The four tokenization algorithms you need to know

Modern LLMs use one of four families, and the differences shape model behavior more than most users realize.

Byte Pair Encoding (BPE) is the dominant approach. Introduced by Sennrich, Haddow, and Birch in 2016, it starts from individual characters and iteratively merges the most frequent adjacent pair into a new token, recording the merge rule, until the target vocabulary size is reached. Encoding new text simply replays those merge rules in order. Byte-level BPE, introduced with GPT-2 in 2019, starts from the 256 possible byte values rather than Unicode characters, which means any UTF-8 text can be tokenized with no <UNK> token ever needed. GPT-2/3/4/4o, RoBERTa, BART, LLaMA 3, Mistral, Gemma, and Qwen all use BPE variants. The trade-off: greedy frequency-based merges are simple and fast but sensitive to training-corpus quirks (see "glitch tokens" below).

WordPiece, developed at Google and made famous by BERT, uses the same merge loop but picks the pair maximizing a PMI-like likelihood score — freq(AB) / (freq(A) × freq(B)) — favoring pairs that co-occur more than chance rather than just frequent pairs. Continuation pieces get a ## prefix ("playing" → ["play", "##ing"]). WordPiece encodes greedily at inference and falls back to [UNK] for unknown pieces.

Unigram tokenization (Kudo, 2018) inverts the approach. It starts with a large candidate vocabulary and iteratively prunes tokens whose removal hurts a unigram language-model likelihood least. A unigram model can represent any word in multiple ways — "hugs" can become ["hug","s"] or ["h","ug","s"] — and picks the highest-probability segmentation, optionally sampling alternatives during training for regularization. T5, ALBERT, and XLNet use it.

SentencePiece is a library, not an algorithm: it wraps BPE or Unigram but treats the input as a raw stream of Unicode including spaces (typically encoded as , U+2581). This makes it language-agnostic — critical for Chinese, Japanese, and Thai which lack whitespace word boundaries — and lossless to decode. LLaMA 1/2, T5, Mistral, and Gemma all use SentencePiece.

AlgorithmSelection ruleMarker conventionRepresentative models
BPE / byte-level BPEMost frequent adjacent pairĠ for leading spaceGPT-2/3/4/4o, LLaMA 3, RoBERTa
WordPiecePointwise mutual information## for continuationBERT, DistilBERT, ELECTRA
UnigramDrop tokens that hurt LM likelihood least(via SentencePiece)T5, XLNet, ALBERT
SentencePiece (wrapper)BPE or Unigram on raw bytes for spaceLLaMA 1/2, T5, Mistral, Gemma

Why subwords, and the ratios you should remember

Word-level tokenization explodes vocabulary size and chokes on novel words; character-level tokenization produces sequences so long that the model spends its attention budget on morphology rather than meaning. Subword tokenization is the compromise: common words stay as one token, rare words decompose into recognizable pieces ("annoyingly" → ["annoy", "ing", "ly"]), and byte-level variants guarantee no out-of-vocabulary failures.

For English, OpenAI's widely cited rule of thumb is 1 token ≈ 4 characters ≈ ¾ of a word, so 100 tokens ≈ 75 words and a typical paragraph is ~100 tokens. Anthropic's number is essentially the same: a Claude token ≈ 3.5 English characters, with ~750 words ≈ 1,000 tokens, 200K tokens ≈ ~150,000 words (~500 pages), and 1M tokens ≈ ~750,000 words (~10 novels). These ratios collapse for other languages: "Cómo estás" is 10 characters but 5 GPT tokens, and a 2025 study ("The Token Tax", arXiv:2509.05486) showed that across 16 African languages, tokens-per-word ("fertility") reliably predicts model accuracy — the more your language fragments, the worse the model performs.

Why tokens are everything in an LLM economy

Context windows are measured in tokens, not words. GPT-4o tops out at 128K; Claude Opus/Sonnet 4.6 and 4.7 now reach 1,000,000 tokens at standard pricing; Gemini 1.5 Pro reaches 1–2M. Crucially, input and output share that budget: input_tokens + output_tokens ≤ context_window. Newer Claude models (since Sonnet 3.7) return a validation error instead of silently truncating when you exceed it.

Pricing is per-token, and output tokens cost roughly 5× input across every current Claude model. As of May 2026, Anthropic's headline rates are $5/$25 per million tokens for Opus 4.7 and 4.6, $3/$15 for Sonnet 4.6, and $1/$5 for Haiku 4.5, with the legacy Opus 3/4.1 still priced at the older $15/$75. The Batch API halves both rates; prompt caching reads cost 10% of base input (90% discount) while cache writes cost 1.25× (5-minute TTL) or 2× (1-hour TTL). This makes tokenizer efficiency a real cost factor — a tokenizer that fragments your text 30% more costs you 30% more.

Computational cost scales quadratically. Transformer self-attention is O(n²) in token count, so doubling tokens quadruples both compute and memory. This is why context windows above 1M require architectural tricks (sparse attention, sliding windows, ring attention) and why efficient tokenization translates directly into faster, cheaper inference.

Tokenization breaks math. Models like LLaMA and PaLM tokenize digits one at a time, while older GPT-3/4 grouped multi-digit chunks into single tokens — meaning "87439" became one opaque ID, hiding place value from the model. The paper "Tokenization counts" (arXiv:2402.14903) showed that left-to-right vs right-to-left digit chunking changes arithmetic accuracy substantially, and GPT-4 was eventually retrained with more consistent digit chunking specifically to address this. Numerical reasoning failures often trace back to the tokenizer, not the architecture.

Special tokens and the glitch-token problem

Beyond ordinary content tokens, every tokenizer reserves special tokens for structural roles: [BOS]/<s> for beginning-of-sequence, [EOS]/<|endoftext|> for end-of-sequence (this is how the model knows when to stop), [PAD] for batch padding, [CLS] and [SEP] for BERT-style classification, [MASK] for masked-language-modeling, and chat-format role delimiters like <|im_start|>/<|im_end|> or Claude's META_START/META_END that mark the boundaries between system, user, and assistant turns.

The most famous tokenization pathology is "SolidGoldMagikarp", discovered in early 2023 by Rumbelow and Watkins. GPT-3 davinci, when asked to repeat the string " SolidGoldMagikarp", would output "distribute"; " TheNitromeFan" produced "182"; " petertodd" produced insults or hallucinations. The cause: GPT's BPE tokenizer was trained on Reddit data (notably r/counting and a Twitch-Pokémon bot) where those specific strings appeared so often they earned dedicated tokens — but the model's training data was filtered to exclude that content, leaving the corresponding embedding vectors essentially untrained, random initializations clustered near the centroid of embedding space. The model literally cannot see these tokens. Similar glitch tokens have since been found in GPT-4 ("ByPrimaryKey"), LLaMA-2 ("davidjl"), GPT-4o (Chinese gambling-spam tokens like "给主人留下些什么吧"), and Vicuna ("réalis"). The phenomenon reveals an underappreciated fact: the tokenizer and the model are trained separately, on potentially different data, and that mismatch creates ghosts in the vocabulary.

How Anthropic defines tokens for Claude

Anthropic's official glossary at docs.claude.com/en/docs/about-claude/glossary (with docs.anthropic.com URLs mirroring the same content) defines a token verbatim as:

"Tokens are the smallest individual units of a language model, and can correspond to words, subwords, characters, or even bytes (in the case of Unicode). For Claude, a token approximately represents 3.5 English characters, though the exact number can vary depending on the language used. Tokens are typically hidden when interacting with language models at the 'text' level but become relevant when examining the exact inputs and outputs of a language model."

That 3.5-character estimate is the central number to remember — it's slightly more granular than OpenAI's 4-character figure, implying Claude's tokenizer fragments English text marginally more aggressively. Anthropic also publishes ~750 words ≈ 1,000 tokens in its pricing/billing materials, consistent with the character figure.

Anthropic does not publish a current Claude tokenizer. The official anthropic-tokenizer-typescript GitHub repo carries an explicit warning: "This package can be used to count tokens for Anthropic's older models. As of the Claude 3 models, this algorithm is no longer accurate, but can be used as a very rough approximation. We suggest that you rely on usage in the response body wherever possible." The legacy tokenizer (Claude 1/2/Instant era) was a BPE with ~65K vocabulary plus a handful of special tokens (EOT, SOS, META, META_START, META_END), reportedly ~70% overlapping with GPT-4's cl100k_base. Claude 3, 3.5, 4.x, and Opus 4.7 each use proprietary, undisclosed tokenizers — and Anthropic's pricing documentation explicitly warns about Opus 4.7: "Opus 4.7 uses a new tokenizer compared to previous models, contributing to its improved performance on a wide range of tasks. This new tokenizer may use up to 35% more tokens for the same fixed text." The per-token price is unchanged, but effective cost-per-request can rise materially.

The count_tokens endpoint is the source of truth

Because no public tokenizer is accurate for current Claude models, Anthropic provides a dedicated POST /v1/messages/count_tokens endpoint (originally a beta with the token-counting-2024-11-01 header in late 2024, now GA). It accepts the exact same structured inputs as the Messages API — model, system, messages, tools, thinking config, and content blocks including text, images, and PDFs — and returns a single field: {"input_tokens": <number>}.

The endpoint is free to use with its own independent rate limit ladder (100/2,000/4,000/8,000 RPM across usage tiers 1–4) that does not consume your Messages API quota. SDK methods are first-class: client.messages.count_tokens(...) in Python, client.messages.countTokens(...) in TypeScript, with equivalents in Go, Java, Ruby, PHP, C#, and cURL. A simple Python call counting "Hello, Claude" with a system prompt returns {"input_tokens": 14}; the same get_weather tool definition plus "What's the weather in San Francisco?" returns 403 tokens, illustrating how tool schemas can dominate small messages. Anthropic explicitly notes the count is an estimate: "the actual number of input tokens used when creating a message may differ by a small amount", and that "token counts may include tokens added automatically by Anthropic for system optimizations. You are not billed for system-added tokens."

How Claude counts the things that aren't text

Images follow a simple formula: tokens ≈ (width × height) / 750 in pixels. On models before Opus 4.7, the long edge is capped at 1,568 pixels (~1.15 megapixels) for a maximum of ~1,560 tokens per image; Opus 4.7 raised this to 2,576 pixels (~3.75 MP), which can roughly triple per-image token cost at maximum resolution. A single request can include up to 600 images on 1M-context models or 100 images on 200K-context models. PDFs flow through the same pipeline — Anthropic's docs cite a Tesla 10-Q SEC filing of ~51 mixed-content pages tokenizing to ~118,967 input tokens, useful for back-of-envelope sizing.

Tool definitions count as input tokens, often substantially: the count_tokens examples show even minimal tool schemas adding hundreds of tokens to the prompt. Extended thinking (Claude's visible reasoning mode) is billed as output tokens at standard output rates with a 1,024-token minimum budget; thinking blocks from previous assistant turns are discarded and don't re-bill, but the current turn's thinking counts toward input on subsequent calls. Starting with Opus 4.7, thinking content is omitted from responses by default unless callers opt in. System prompts count fully as input tokens and sit between tools and messages in the prompt-caching prefix order (tools → system → messages).

Context windows and the four input-token categories

As of May 2026, Anthropic's flagship models — Opus 4.7, Opus 4.6, Sonnet 4.6, and the gated "Mythos Preview" — offer 1,000,000-token context windows at standard pricing with no surcharge, a meaningful change from the prior tier where Sonnet 4.5's 1M-beta context triggered 2× input / 1.5× output pricing above 200K. Haiku 4.5, Opus 4.5, and the entire Claude 3.x family remain at 200K. Max output per request is typically 64K–128K synchronously, expandable to 300K via the Batch API with the output-300k-2026-03-24 header.

Every Messages API response surfaces a usage block exposing four token categories: input_tokens (full price), cache_creation_input_tokens (1.25× or 2× depending on TTL), cache_read_input_tokens (0.1×, a 90% discount), and output_tokens. Combined with the Batch API's 50% discount, a cache-hit request submitted via batch costs as little as 5% of full input price — multipliers stack, making prompt caching the single most important cost lever for repetitive workloads. The breakeven is one read for a 5-minute cache or two reads for a 1-hour cache.

What this means in practice

Three takeaways structure how you should reason about tokens going forward. First, tokens are physical units of model cognition, not abstractions — they determine the model's view of place value in numbers, of word morphology, of multilingual fairness, and even of which strings the model literally cannot see (the glitch-token problem). Second, the "1 token ≈ ¾ word" rule is an English-only convenience that misleads anyone working in non-Latin scripts or morphologically rich languages, where token tax can multiply costs 3–15× for identical semantic content. Third, with Claude specifically, never rely on local tokenizer estimates for Claude 3 and later — Anthropic explicitly tells you the public tokenizer is broken for these models, and Opus 4.7's new tokenizer can inflate counts by up to 35% versus its predecessor at unchanged per-token rates. The free count_tokens endpoint, with its independent rate limits and full support for tools, images, and PDFs, is the only reliable way to budget tokens before sending a request, and the usage block in every response is the only reliable way to reconcile costs after.

Tokenization looks like an implementation detail. It is, in fact, the substrate on which everything else — context, cost, language fairness, arithmetic accuracy, even safety glitches — quietly depends.