Top P (Nucleus Sampling)

Top P, also known as nucleus sampling, is a decoding parameter that limits the model's choices to the smallest possible set of tokens whose cumulative probability exceeds a threshold P. It helps balance creativity and coherence in the model's responses.

How does it work

Instead of picking from a fixed number of top tokens (like Top K), Top P dynamically selects a set of tokens based on their probabilities. For example, if top_p = 0.9, the model will only sample from the most likely tokens whose combined probability adds up to 90%. This set can vary in size at each step of generation.

When to use Top P

When you want more flexibility than Top K allows
When aiming for natural-sounding, diverse outputs without sacrificing quality
When generating creative content, like stories, brainstorms, or casual dialogue

How to use Top P

A common top_p value is between 0.8 and 0.95
Include it as the top_p parameter in your API call
Use temperature along with it for more nuanced control
Test different values. Higher values give more randomness; lower ones increase focus

Tips

top_p = 1.0 disables nucleus sampling entirely (i.e., all tokens are considered)
Lower top_p values can limit hallucinations but may sound dull or repetitive
For most natural outputs, top_p of 0.9 combined with temperature of 0.7-0.9 is a good starting point
Use in combination with Top K cautiously. If both are set, they work together to restrict token sampling even further

LLM Parameters Guide

Top P (Nucleus Sampling)

How does it work

When to use Top P

How to use Top P

Tips