Tokenization

What is Tokenization?

Tokenization is the process of converting text into smaller units called tokens that a language model can process. Tokens are not the same as words. A single word might be one token, or it might be split into multiple subword tokens depending on the model’s vocabulary.

Understanding tokenization matters for developers because it determines how much code or text fits within a model’s context window. Code typically uses more tokens per line than natural language because variable names, syntax characters, and indentation all consume tokens.

Different models use different tokenization schemes. GPT-4 uses a different tokenizer than Claude, which means the same text may have different token counts across models.

Frequently Asked Questions

What does Tokenization mean?

Why does Tokenization matter for developers?

What tools use Tokenization?

Tokenization

What is Tokenization?

Frequently Asked Questions

Company

Product

Resources

Accounts