back

Time: 7 minute read

Created: December 20, 2024

Author: Lina Lam

OpenAI Unveils New O3 Model: What Is It and How Is It Different from O1?

OpenAI has just revealed its latest advancements in AI with the launch of the o3 and o3-mini reasoning models. These new models are designed to push the boundaries of what AI can achieve by introducing a deeper level of reasoning.

Building on the foundation of OpenAI's o1 models, the o3 family introduces several notable improvements in performance, reasoning capabilities, and testing results.

OpenAI o3 released

What is OpenAI's o3 model?

The o3 model represents the next leap forward in OpenAI's development of "reasoning" AI models. Unlike traditional large language models (LLMs) that rely on simple pattern recognition, the o3 model incorporates a process called "simulated reasoning" (SR), deepening its reasoning capabilities compared to o1. This allows the model to pause and reflect on its own internal thought processes before responding, mimicking human-like reasoning in a way that previous models like o1 couldn't.

What sets o3 apart from its predecessors is its ability to evaluate complex tasks with greater accuracy. In simple terms, while the o1 models were good at understanding and generating text, the o3 models take it a step further by thinking through problems and planning their responses ahead of time. This "private chain-of-thought" technique is a core feature that sets o3 apart.

O3 vs. O1: What is different?

So, how does o3 differ from the earlier o1 models?

  1. Reasoning Ability: The o3 models are built to simulate reasoning at a deeper level. While o1 could generate responses based on patterns learned during training, o3 actively "thinks" about the problem at hand, improving its ability to tackle complex and multi-step tasks.

  2. Performance on Benchmarks: One of the most exciting aspects of o3 is its performance on various benchmarks. For example, it scored 75.7% on the ARC-AGI visual reasoning benchmarkin low-compute scenarios, which is impressive compared to human-level performance (85%). This was a huge improvement over o1 and shows just how much further o3 can go in solving challenging problems.

o3 performance

Image source: OpenAI's Youtube announcement

  1. Mathematical and Scientific Accuracy: OpenAI reports that o3 also achieved remarkable results in subjects like mathematics and science. For instance, it scored 96.7% on the American Invitational Mathematics Exam and 87.7% on a graduate-level science exam. These scores highlight the model's increased capacity for solving complex problems in fields that require high-level reasoning.

o3 math performance

  1. Code and Programming: In terms of coding, o3 also outperforms o1. For example, it scored better than o1 on the Codeforces benchmark, which tests the model's ability to solve programming problems. This shows that o3 is more adept at tasks that require logical thinking and problem-solving.

o3 code performance

Why not O2?

If the o3 model is such a big step forward, why did OpenAI skip o2?

According to OpenAI's CEO, Sam Altman, the decision was purely a matter of avoiding potential trademark issues. The name o2 was not used because it could have clashed with a British telecom company, O2. So, in the spirit of "OpenAI being really bad at names," Altman humorously explained, the team decided to jump straight to o3 instead. While the name might seem a bit unconventional, the reasoning behind it was purely practical.

Benchmarks Achieved by O3

ARC-AGI Benchmark

ARC-AGI tests an AI model's ability to recognize patterns in novel situations and how well it can adapt knowledge to unfamiliar challenges.

  • O3 scored 75.7% on low compute
  • O3 scored 87.5% on high compute, which is comparable to human performance at 85%

o3 arc-agi

With 87.5% accuracy in visual reasoning, O3 addresses prior models' struggles with spatial and physical object analysis. This breakthrough enhances real-world applications like robotics, medical imaging, and AR, fueling the AGI conversation. O3’s advancements mark a key step toward smarter, more capable AI systems.

American Invitational Mathematics Exam (AIME)

With an impressive 96.7% accuracy, O3 significantly outperforms O1's 83.3%. This leap showcases O3's superior ability to handle complex tasks. Mathematics, a crucial benchmark, highlights the model's capacity to grasp abstract concepts fundamental to scientific and universal understanding. O3's enhanced accuracy cements its position as a game-changer for users seeking precision and advanced reasoning in AI applications.

GPQA Diamond Benchmark

Scored 87.7%, demonstrating strong reasoning skills in graduate-level biology, physics, and chemistry questions.

o3 gpqa

EpochAI Frontier Math Benchmark

The EpochAI Frontier Math benchmark is one of the toughest challenges featuring unpublished, research-level problems that demand advanced reasoning and creativity.

These problems often take professional mathematicians hours or even days to solve. O3 solved 25.2% of the problems, when no other model has exceeded 2% previously on this benchmark.

o3 epochai


Introducing O3-Mini: A More Adaptive Model

Alongside the o3 model, OpenAI also unveiled the o3-mini. This version is more lightweight and offers an adaptive thinking time feature, allowing users to select low, medium, or high processing speeds depending on their needs. The o3-mini is designed for situations where you might not need the full power of o3 but still want to benefit from its advanced reasoning capabilities.

Although smaller and faster, the o3-mini is still a powerful tool. OpenAI claims that it outperforms o1 on several key benchmarks, making it a great option for those seeking more efficient performance.

The Rise of Simulated Reasoning

OpenAI's o3 and o3-mini models represent a shift in the AI landscape, especially in the context of simulated reasoning (SR). SR is gaining traction across the AI industry, with Google launching its own Gemini 2.0 Flash Thinking and DeepSeek launching their own models based on this approach. SR allows AI models to consider their own results and adjust their reasoning as they go, offering a more nuanced and accurate form of problem-solving compared to traditional LLMs.

Simulated reasoning models, including OpenAI's o3, are designed to scale at inference time. This means they can reason and make decisions faster than previous models, offering real-time responses that can handle complex, multi-faceted tasks. This approach also allows for better handling of more advanced and diverse problems, making it ideal for use cases in science, mathematics, and programming.

When Will O3 Be Available?

As of now, OpenAI is not releasing o3 and o3-mini models for general use. Instead, the company is initially providing access to safety researchers for testing. This allows OpenAI to gather important feedback and ensure that the models are safe and effective for broader applications.

OpenAI has announced that o3-mini will be available to a wider audience by late January, with the full o3 model expected to follow shortly after. The release of these models is highly anticipated, and they represent a significant step forward in the development of AI with more advanced reasoning capabilities.


Conclusion: A Leap Towards More Advanced AI

The introduction of o3 and o3-mini is a major milestone for OpenAI, marking a shift towards AI models that can reason more effectively and tackle problems in a more human-like way. By building on the success of o1, OpenAI has developed models that can perform better on complex benchmarks, improve mathematical and scientific problem-solving, and even handle coding tasks with greater efficiency.

As simulated reasoning continues to rise in popularity across the industry, it will be exciting to see how o3 and o3-mini influence the future of AI. With OpenAI's commitment to safety and research, these models are set to change the way we think about AI's role in problem-solving and decision-making. In the coming months, we can expect to see more updates on the performance and availability of these groundbreaking models.

Other models you might be interested in:


Questions or feedback?

Are the information out of date? Please raise an issue and we'd love to hear your insights!