Temperature, Top-P, and Why Your AI Sounds Drunk

Every developer who starts building with the OpenAI or Anthropic SDKs eventually hits the same wall. Your system prompt is flawless. Your few-shot examples are immaculate. But your production application occasionally spits out a response so hallucinated and bizarre it reads like the model had three martinis on an empty stomach.

You didn't write a bad prompt. You left the API's generation parameters on default.

Language models don't think; they predict the next most likely token. When you adjust Temperature and Top-P, you are literally altering the mathematical probabilities the model uses to make that selection. Leave them untouched, and you are ceding control of your application's reliability to a random number generator.

Temperature: The Chaos Dial

Temperature controls the randomness of the token selection. Think of it as flattening or sharpening a probability curve.

If you ask an LLM to complete the sentence "The sky is...", the probability breakdown for the next token might look like this internally:
blue (85%) | cloudy (10%) | falling (4%) | green (1%)

At Temperature 0.0: The model is completely deterministic. It will pick "blue" 100% of the time, forever. It only ever selects the mathematically highest-probability token. Use this for data extraction, code generation, JSON formatting, or strict Q&A over internal documents. The model will not guess, and it will not deviate.

At Temperature 0.7 (Standard API Default): The curve flattens slightly. It will usually pick "blue", but 10% of the time it will pick "cloudy". This introduces "creativity," which is just a marketing term for mathematical variance. This is great for email drafting, general chat, and brainstorming.

At Temperature 2.0 (The Maximum): The curve is completely flattened. "Blue" and "green" suddenly have nearly equal odds of being chosen. The model will begin stringing together highly improbable tokens, resulting in absolute gibberish. Never use a temperature above 1.0 in a production application unless your goal is abstract surrealist poetry.

Top-P: The Bouncer at the Club

While Temperature messes with the probabilities of the tokens, Top-P (nucleus sampling) limits the pool of tokens the model is even allowed to look at.

If you set Top-P to 0.90, the model will rank all possible next tokens by probability. It then draws a strict cutoff line the second the cumulative probability of those tokens hits 90%. Any token below that line is permanently discarded from consideration, no matter what the Temperature is doing.

Why does this matter? Because long prompt tails contain statistical garbage. If you reduce Top-P to 0.5, you are telling the model: "Only pick from the safest, most obvious tokens that make up the top 50% of the probability pool." It violently truncates hallucinations, making the model sound extremely direct and focused, but often repetitive.

The Golden Rule: Never Touch Both

The official documentation from OpenAI, Anthropic, and Cohere all state the exact same warning: Alter Temperature OR alter Top-P, but never both at the same time.

If you lower Top-P to 0.1 (restricting the token pool) but raise Temperature to 1.5 (flattening the curve of the remaining pool), the math becomes highly unstable. You are creating a paradox where the model is forced to choose randomly, but only from a tiny pool of highly probable words. The result is loops. The model will repeat the same phrase infinitely until it hits max tokens.

My Production Parameter Matrix

After running hundreds of thousands of production API calls across different enterprise features, this is the exact parameter matrix I use to configure our applications:

1. Data Extraction / JSON Formatting

Temperature: 0.0 | Top-P: 1.0 (Default)

When you need the model to extract a phone number from an email and output valid JSON, creativity is a bug. You want absolute determinism. If you run the prompt 10 times, you want the exact same JSON 10 times.

2. Code Generation & Logic

Temperature: 0.2 | Top-P: 1.0 (Default)

Code needs to be syntactically strict, but a temperature of exactly 0.0 occasionally causes models to get stuck in logical loops if they make a mistake early in a Chain-of-Thought process. A tiny bump to 0.2 gives the model enough "wiggle room" to escape bad syntactic branches without hallucinating variable names.

3. RAG (Retrieval-Augmented Generation)

Temperature: 0.0 | Top-P: 1.0 (Default)

If you are injecting proprietary documents into the context window and asking the model to answer based only on those documents, the temperature must be zero. If you raise it, the model will start using its pre-training data to fill in gaps in your documents. That is how customer support bots hallucinate refund policies.

4. Marketing Copy & Brainstorming

Temperature: 0.7 - 0.8 | Top-P: 1.0 (Default)

This is where standard defaults shine. You want varied sentence structure, engaging vocabulary, and unexpected hook angles. If you set this to 0.0, every marketing email will start with "I hope this email finds you well."

5. Creative Writing / Story Generation

Temperature: 1.0 (Default) | Top-P: 0.9

For long-form prose narrative, I prefer to leave Temperature alone and restrict Top-P. A Top-P of 0.9 prevents the model from choosing the bizarre, narrative-breaking tokens at the absolute bottom of the probability pool, while still allowing enough variance in the top 90% to produce engaging dialogue and description.

Stop Blaming the Prompt

Prompt engineering is only half the battle. If your prompt is a meticulously crafted, 2000-word instruction manual with five examples, but your temperature is set to 1.2, your pipeline will still fail. Understand the math. Control the probability. Lock the dials before you ship.