Max Tokens

Max tokens controls the maximum number of tokens the model is allowed to generate in its response. It puts a hard limit on the length of the output, helping you manage response size, latency, and cost.

How does it work

Tokens are the basic units of text the model processes, roughly chunks of words or characters. When you set max_tokens, you're telling the model how many tokens it can generate after your input. If the response hits this limit, it stops even if the thought or sentence isn't complete.

When to use max_tokens

When you want to control output length
When you're working with limited context window and need to save space
When building summarizers, previews, or short-form content
When you want to limit costs or latency in high-volume applications

How to use max_tokens

Estimate your desired output length. E.g., 100 tokens ≈ 75 words
Set max_tokens in your API call. E.g., max_tokens: 300
Balance it with your prompt length. Total tokens (prompt + completion) must fit within the model's context window
Test and adjust. If responses feel too short or cut off, increase the limit

Tips

Use higher values for creative writing or long-form generation
Use lower values for brief responses, like titles or summaries
Always account for both prompt and completion length when working near the model's max context size
If a response gets cut off mid-sentence, the max_tokens limit was likely too low, try increasing it

LLM Parameters Guide

Max Tokens

How does it work

When to use max_tokens

How to use max_tokens

Tips