How to Prompt Thinking Models like DeepSeek R1 and OpenAI o3

The release of DeepSeek R1 model, a new class of language models designed to reason more effectively than their predecessors, has sparked significant interest in thinking models.
These models effectively internalize the Chain-of-Thought prompting process that was first popularized by Google researchers in a 2022 paper. Unlike traditional LLMs—which require CoT to "reason"—they handle reasoning natively, which often leads to better results.
However, their reasoning ability also means you need to prompt them differently for the best results, and most people get it wrong.
Key Takeaways
Prompting DeepSeek R1, OpenAI o3, and other thinking models requires a different approach compared to standard LLMs. Here are some best practices for prompting reasoning models:
✅ DOs | ❌ DON'Ts |
---|---|
⒈ Use minimal prompting to let the model think independently. | ⒈ Avoid few-shot and CoT prompting. Thinking models perform worse when overloaded with examples or asked to "reason" step-by-step. |
⒉ Encourage more reasoning for better performance at complex tasks by prompting for additional processing time. | ⒉ Avoid overloading the model with unnecessary details. More context isn't always better. |
⒊ Use delimiters for clarity to separate distinct parts of input, improving interpretation. | ⒊ Don't rely on thinking models for structured outputs unless absolutely necessary. They perform worse with structured outputs than traditional models. |
⒋ Use ensembling for highly complex tasks requiring high accuracy. | |
⒌ Use advanced techniques like Chain-of-Draft (CoD) prompting to minimize token use. |
Disclaimer: The insights come from research carried out by OpenAI and Microsoft and our own personal experience.
Let's get into it!
1. Use Minimal Prompting
Thinking models work best when given concise, direct, and structured prompts.
Too much information can actually reduce accuracy. Unlike previous models where more context helped, thinking models already structure their reasoning internally.
The best approach is to state the problem clearly and let the model figure out the steps.
✅ What are the main differences between classical and operant conditioning?
---
❌ In psychology, there are different learning theories. Classical conditioning was discovered by Pavlov, while operant conditioning was developed by Skinner. Could you please explain the difference between classical conditioning and operant conditioning? Make sure to include an example for each.
Fewer instructions allow the model to engage its reasoning process naturally.
2. Encourage More Reasoning for Complex Tasks
More complex problems benefit from additional reasoning time.
Thinking models use reasoning tokens, which allow them to internally process a problem before outputting an answer. By prompting the model to take its time, you can improve the quality of the response. However, this also increases token usage, impacting cost.
✅ Analyze the economic impact of renewable energy adoption over the next 20 years. Consider factors such as job creation, energy prices, and carbon reduction. Take your time and think through each aspect carefully.
---
❌ How does renewable energy impact the economy? Answer quickly.
Encouraging longer reasoning helps with multi-step problems, improving accuracy significantly.
3. Avoid Few-Shot and Chain-of-Thought Prompting
Traditional few-shot (where you give a bunch of examples) and CoT prompting strategies reduce performance for thinking models.
According to the paper, OpenAI's o1-preview model performed worse when given few-shot examples. This contrasts with older models, where few-shot learning improved results. Thinking models are already designed to break down problems internally, so explicit step-by-step guidance can interfere with their reasoning.
✅ What is the capital of Canada?
---
❌ Example 1:
Q: What is the capital of France?
A: Paris
Example 2:
Q: What is the capital of Japan?
A: Tokyo
Now answer this: What is the capital of Canada?
For thinking models, zero-shot prompts worked better than few-shot prompts.
4. Use Thinking Models for Complex Multi-Step Tasks
Thinking models perform best on tasks that require five or more steps.
When solving problems with 3-5 steps, thinking models offered a slight improvement over standard models. For simpler tasks (fewer than 3 steps), performance may actually degrade compared to traditional LLMs, because they "overthink."
If a task is highly structured or simple, a regular LLM like GPT-4 may be a better choice.
✅ Break down the process of solving a complex physics problem involving momentum conservation. Explain each step clearly and logically.
---
❌ What is 2+2?
Tip 💡
To check how many steps a problem requires, you can prompt the web version of a reasoning model to see how many reasoning steps it takes.
5. Use Delimiters to Structure Prompts for Consistent Outputs
For regular LLMs, developers typically use delimiters like triple quotation marks, XML tags, or section titles to clearly define distinct sections of the input. This makes it easier for the model to interpret the information correctly.
Thinking models, however, struggle with structured outputs but can be guided to maintain consistency. If you need a structured response (e.g., JSON, tables, fixed formats), structure your prompt carefully.
✅ [Task: Summarize the following text]
Text: The mitochondrion is the powerhouse of the cell. It produces ATP, the energy currency of the cell, through cellular respiration.
---
❌ Summarize this: The mitochondrion is the powerhouse of the cell. It produces ATP, the energy currency of the cell, through cellular respiration.
If structured output is critical, you're better off using a standard LLM instead of a thinking model.
6. Use Ensembling for Highly Complex Tasks
For high-stakes or complex problems, ensembling improves performance.
Ensembling involves running multiple prompts (either the same prompt multiple times or variations of the prompt) and aggregating the results. This approach increases accuracy but raises costs because multiple queries are required.
✅ #Prompt 1:
What are the primary causes of climate change? Provide a well-reasoned answer.
# Prompt 2:
Explain the major contributors to climate change, focusing on human activities and natural factors.
# Prompt 3:
Explain what causes climate change
<Context>
# [Response 1 + Response 2]
</Context>
---
❌ What causes climate change? Answer in a single response.
While ensembling boosts performance, it's expensive and should only be used when high accuracy is critical.
7. Use Chain-of-Draft Prompting to Minimize Token Usage
Thinking models can be token-hungry when reasoning through complex problems. Chain-of-Draft prompting offers a solution by encouraging models to generate minimal yet informative reasoning steps.
✅ [System Prompt]: Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most.
What's the compound interest on $10,000 invested for 5 years at 8% annual interest rate?
---
❌ What's the compound interest on $10,000 invested for 5 years at 8% annual interest rate?
Unlike with traditional Chain-of-Thought, CoD prompts reasoning models to limit each reasoning step to just a few words (typically 5 or fewer), similar to how humans jot down concise notes rather than writing verbose explanations.
CoD can reduce token usage by up to a whopping 80% while maintaining similar accuracy to the normal reasoning method.
Monitor your LLM application with Helicone 💡
Get full visibility into your thinking models, get detailed cost and latency metrics, and more with Helicone. Get started in 5 minutes.
Summary
Unlike standard LLMs, prompting thinking models like DeepSeek R1 and OpenAI o1 and o3 requires a different mindset and approach. In summary:
- Minimal, clear prompts work best.
- Forcing few-shot examples or step-by-step reasoning reduces effectiveness.
- When you must get a structured output, use standard LLMs instead.
- For even better responses, ensembling is a powerful (though costly) technique.
With these guidelines, you can optimize your interactions with thinking models and get the best possible responses.
You might find these useful:
- OpenAI o3 Released: Benchmarks and Comparison to o1
- How to Test Your LLM Prompts with Helicone
- Building a Simple Chatbot with OpenAI Structured Outputs
Frequently Asked Questions
What are thinking models like DeepSeek R1 and OpenAI o3?
Thinking models (such as DeepSeek R1, OpenAI o1/o3, and other reasoning models) are language models specifically designed to reason internally rather than relying on explicit Chain-of-Thought prompting. They natively incorporate the Chain-of-Thought process, often leading to better results on complex multi-step problems.
How should I prompt DeepSeek R1 differently than regular LLMs?
For DeepSeek R1 and similar thinking models, use minimal, concise prompting instead of verbose instructions. Avoid few-shot examples and explicit Chain-of-Thought prompting, which can actually degrade performance. Let the model think independently by stating problems clearly without overloading it with context.
How do I get better structured outputs from thinking models like DeepSeek R1?
Thinking models, including DeepSeek R1, typically aren't ideal for structured outputs. But if you must, use delimiters and clear formatting instructions (e.g. a template), but consider that traditional LLMs may perform better for this specific use case.
When should I use reasoning models like DeepSeek R1 versus traditional LLMs?
Use reasoning models like DeepSeek R1 and OpenAI o3 for complex problems requiring 5+ reasoning steps, analytical tasks, and scenarios where accuracy outweighs token efficiency. For simpler tasks, structured outputs, or when examples are beneficial, traditional LLMs may be more appropriate and cost-effective.
How can I prompt DeepSeek R1 for maximum effectiveness?
To maximize DeepSeek R1's thinking process, encourage more processing time with phrases like 'Take your time and think carefully' for complex problems. Avoid interrupting its reasoning with step-by-step instructions. For critical tasks, consider using ensembling techniques by running multiple variations of your prompt.
What is Chain-of-Draft (CoD) and how can I use it with thinking models?
Chain-of-Draft is a prompting technique that encourages models to use minimal, concise reasoning steps (typically 5 words or less). With thinking models like DeepSeek R1 and OpenAI o3, implement CoD by adding 'Think step by step, but only keep a minimum draft for each thinking step' to your prompt. This reduces token usage by up to 80% while maintaining reasoning quality.
Can I use Chain-of-Thought prompting with DeepSeek R1?
Contrary to traditional LLMs, explicit Chain-of-Thought prompting typically reduces performance in thinking models like DeepSeek R1. These models already incorporate reasoning internally, and forcing a specific thinking path can interfere with their native reasoning capabilities. Use direct, minimal prompting instead.
How do I create a DeepSeek R1 prompt template for consistent results?
For consistent results with DeepSeek R1, use a simple prompt template that includes: 1) A minimal system message, 2) Clear task description, 3) Necessary context in a structured format, and 4) A direct instruction. Use delimiters to separate sections, and avoid lengthy instructions or examples that might interfere with the model's reasoning.
Questions or feedback?
Are the information out of date? Please raise an issue or contact us, we'd love to hear from you!