Max Tokens
Max tokens controls the maximum number of tokens the model is allowed to generate in its response. It puts a hard limit on the length of the output, helping you manage response size, latency, and cost.
How does it work
Tokens are the basic units of text the model processes, roughly chunks of words or characters. When you set max_tokens, you're telling the model how many tokens it can generate after your input. If the response hits this limit, it stops even if the thought or sentence isn't complete.
When to use max_tokens
- When you want to control output length
- When you're working with limited context window and need to save space
- When building summarizers, previews, or short-form content
- When you want to limit costs or latency in high-volume applications
How to use max_tokens
- Estimate your desired output length. E.g., 100 tokens ≈ 75 words
- Set max_tokens in your API call. E.g.,
max_tokens: 300
- Balance it with your prompt length. Total tokens (prompt + completion) must fit within the model's context window
- Test and adjust. If responses feel too short or cut off, increase the limit
Tips
- Use higher values for creative writing or long-form generation
- Use lower values for brief responses, like titles or summaries
- Always account for both prompt and completion length when working near the model's max context size
- If a response gets cut off mid-sentence, the max_tokens limit was likely too low, try increasing it