Tokenization is the process of converting text into smaller units called tokens that a language model can process. Tokens are not the same as words. A single word might be one token, or it might be split into multiple subword tokens depending on the model’s vocabulary.
Understanding tokenization matters for developers because it determines how much code or text fits within a model’s context window. Code typically uses more tokens per line than natural language because variable names, syntax characters, and indentation all consume tokens.
Different models use different tokenization schemes. GPT-4 uses a different tokenizer than Claude, which means the same text may have different token counts across models.
