Top P (Nucleus Sampling)

Top P, also known as nucleus sampling, is a decoding parameter that limits the model's choices to the smallest possible set of tokens whose cumulative probability exceeds a threshold P. It helps balance creativity and coherence in the model's responses.

How does it work

Instead of picking from a fixed number of top tokens (like Top K), Top P dynamically selects a set of tokens based on their probabilities. For example, if top_p = 0.9, the model will only sample from the most likely tokens whose combined probability adds up to 90%. This set can vary in size at each step of generation.

When to use Top P

  • When you want more flexibility than Top K allows
  • When aiming for natural-sounding, diverse outputs without sacrificing quality
  • When generating creative content, like stories, brainstorms, or casual dialogue

How to use Top P

  1. A common top_p value is between 0.8 and 0.95
  2. Include it as the top_p parameter in your API call
  3. Use temperature along with it for more nuanced control
  4. Test different values. Higher values give more randomness; lower ones increase focus

Tips

  • top_p = 1.0 disables nucleus sampling entirely (i.e., all tokens are considered)
  • Lower top_p values can limit hallucinations but may sound dull or repetitive
  • For most natural outputs, top_p of 0.9 combined with temperature of 0.7-0.9 is a good starting point
  • Use in combination with Top K cautiously. If both are set, they work together to restrict token sampling even further