Tokenization

The process of converting text into smaller units called tokens that a language model can process and reason over.

Tokenization is the process of converting text into smaller units called tokens that a language model can process. Tokens are not the same as words. A single word might be one token, or it might be split into multiple subword tokens depending on the model’s vocabulary.

Understanding tokenization matters for developers because it determines how much code or text fits within a model’s context window. Code typically uses more tokens per line than natural language because variable names, syntax characters, and indentation all consume tokens.

Different models use different tokenization schemes. GPT-4 uses a different tokenizer than Claude, which means the same text may have different token counts across models.

In plain English

How AI splits text into small chunks to process it — roughly one token per word, which determines how much you can fit into a single request and what it costs.

Why it matters

Every AI API call is priced per token and limited by the context window measured in tokens. Understanding this helps you estimate costs, structure prompts efficiently, and understand why an AI sometimes seems to "forget" earlier parts of a long conversation — it hit the token limit and earlier content fell out of the window.

In practice

A developer tries to send a 12,000-line codebase to an AI in one request. It costs more than expected and the agent seems to ignore files from the beginning. The problem: the codebase exceeded the context window, so early files were truncated. The fix is selective inclusion — sending only the files relevant to the task — which reduces token consumption by 80% and produces better results.

How Codegen uses Tokenization

Codegen manages token consumption automatically by retrieving only the context relevant to each specific task rather than sending an entire codebase. This keeps costs predictable and ensures the agent is working with the right information rather than a noisy dump of everything. The per-task cost tracking in Codegen's analytics is partly a function of this — you can see exactly how much each task consumed and tune your task descriptions to be more targeted.

Frequently Asked Questions

What is Tokenization?

Why does Tokenization matter?

What does Tokenization look like in practice?