back

Time: 6 minute read

Created: February 19, 2025

Author: Lina Lam

Grok 3 Technical Review: Everything You Need to Know

Grok 3 just dropped and it's making big claims about being the "Smartest AI in the world". Built on massive computational resources and designed for real-time knowledge, it’s xAI’s strongest competitor yet.

xAI Releases Grok 3: Benchmark Comparison

With significant improvements over Grok 2, this model promises better coding skills, reasoning, and even scientific problem-solving.

But how well does it actually perform? Let’s break it down.

What’s New in Grok 3?

Grok 3 is a massive leap forward from its predecessor. Here’s what changed:

  • 10-15x Compute power compared to Grok 2.
  • 100K+ Nvidia H100 GPUs: Trained on xAI’s Memphis supercomputer, one of the largest AI clusters in the world—built in 122 days.
  • Advanced reasoning: Runs multiple thought chains, self-corrects, and evaluates solutions before finalizing an answer.
  • Deep Search: A "next generation search engine" that allows Grok 3 to think about what it finds across sources and what to look for, not just search and retrieve information. Users can see its thought process in detail in real-time. Not to be confused with Deep Research.
  • Big Brain mode: A specialized mode where Grok 3 uses additional compute resources to improve its reasoning capabilities, and perform complex multi-step problems.
  • Real-time knowledge: Integrated with X, access to up-to-the-minute information.
  • Better at coding, math, and science: Grok 3 excels in technical domains, making it a serious competitor in AI-driven research and programming tasks.

Fun Fact 💡

Grok 3 remains largely uncensored and interestingly uses Wikipedia as a source quite a lot despite Elon Musk’s public criticism of the platform.

So How Smart is Grok 3? Here's the Benchmarks and Real-World Performance

Benchmarks: Grok 3 vs ChatGPT vs Gemini vs Claude

On paper, Grok 3 outperforms its rivals in various technical domains. Let’s look at some numbers.

According to benchmark results shown in xAI's release demo, Grok 3 scores higher than Gemini-2 Pro, DeepSeek V3, GPT-4o, and Claude 3.5 Sonnet in math (AIME), science (GPQA), and coding tasks (LiveCodeBench).

Grok 3 Benchmark compared to OpenAI o3

Image source: Outlook Business: Grok 3 Performance Against GPT-4o

Fun Fact 💡

Grok 3 is said to have successfully solved fresh, unseen problems in the 2025 AIME math competition.

LMArena Benchmarks

Perhaps more notably, in blind user-voted evaluations on LMArena—a crowd-sourced LLM benchmarking platform—Grok 3 has set a new milestone.

Unlike traditional AI benchmarks that rely on static test sets, LMArena uses live human feedback in a blind A/B test format, making it one of the most reliable indicators of real-world AI performance."

An early version of Grok 3 (codenamed “Chocolate”) has officially taken the #1 spot, outperforming models like GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro.

Grok 3 Breaks Records

Impressively, Grok 3 simultaneously became the first-ever model to break the 1400 ELO score barrier on LMArena—outperforming all other models across all categories, including overall, Hard Prompts (Prompts that utilize a pre-designed template to elicit model outputs), Coding, Math, Creative Writing, Instruction Following, Longer Query Handling, and Multi-Turn Conversations.

Grok 3 in the Wild: Real-World Reviews

Early real-world tests show mixed but promising results. While Grok 3’s reasoning is top-tier, its performance in some areas still lags behind OpenAI’s best models.

Strengths

✔️ Advanced Reasoning

Andrej Karpathy—who got early access—noted on X that Grok 3’s “Thinking” mode solves complex problems better than many competitors. It successfully solved a tricky Settlers of Catan programming task that stumped most other models.

✔️ Logic

The model performed well on structured logic problems, solving multiple tic-tac-toe challenges with proper chains of thought.

✔️ Deep Search

Its Deep Search tool was praised for finding high-quality information on recent events, such as Apple launch rumors and stock surges, similar in depth and quality to Perplexity's Deep Research but not at the level of OpenAI's.

Weaknesses

🆇 Coding Performance

An early user on X found Grok 3 struggled with somewhat complex coding, at least compared to GPT-4o and Claude which coded up better solutions as shown below.

🆇 Math & Symbolic Logic

While strong in structured problem-solving, it failed Andrej Karpathy’s Unicode emoji mystery challenge, whereas DeepSeek's R1 performed better.

In his tweet, Karpathy said "[Grok 3] did not solve my question where I give a smiling face with an attached message hidden inside Unicode variation selectors, even when I give a strong hint on how to decode it in the form of Rust code. The most progress I've seen is from DeepSeek-R1 which once partially decoded the message."

🆇 Humor & Creativity

The model lacks any advanced abilities for humor. When asked for jokes, it repeatedly gave variations of the same puns, similar to older LLMs, as Andrej Karpathy shared in his tweet.

🆇 Fact-checking Issues

In the same tweet, Andrej also found Grok 3 hallucinating citations and even inventing fake URLs, similar to problems seen in other LLMs.

Final Thoughts

Overall, Grok 3 has been impressive so far but not perfect—and seemingly not the "Smartest AI in the world" as Musk claims as it still lags behind other LLMs in some areas.

Grok 3 still lags behind OpenAI’s best models in benchmarks, and its real-world performance is mixed. Just take a look at the benchmark from earlier with OpenAI o3 added:

Grok 3 Reasoning Benchmarks with OpenAI o3 Added

How to Access Grok 3

Grok 3 is currently not available through API but is available through multiple channels:

  • X Premium+ Subscription: Access it directly via X for $40/month—please update the app if unavailable.
  • Grok app: Available on iOS and Android.
  • Web: The most up-to-date experience is on grok.com.

What’s Next for xAI and Grok?

xAI has big plans beyond Grok 3. Here’s what’s coming next:

  • API Access: Developers will soon be able to integrate Grok into their own applications.
  • Super Grok Subscription: A premium tier offering early access to cutting-edge features.
  • Voice Mode: A fully interactive AI voice assistant, expected within a week.
  • Memory Features: Persistent memory to recall past conversations for personalized interactions.
  • Bigger AI Cluster: xAI is already working on a 5x more powerful training setup.
  • Scientific Breakthroughs?: Elon Musk predicts AI will win (or at least help to) a Nobel, Turing, or Fields Medal within the next 1-2 years.

Monitoring your xAI app with Helicone ⚡️

The easiest way to monitor and debug your xAI applications. Start capturing traces in production. Integrate in minutes.

from openai import OpenAI

client = OpenAI(
    api_key="your-x-ai-api-key",  # X AI API key
    base_url="https://x.helicone.ai/v1/chat/completions"  # Helicone proxy URL for X AI
)

response = client.chat.completions.create(
    model="grok-beta",
    messages=[
        {"role": "user", "content": "Say this is a test"}
    ]
)
print(response.choices[0].message.content)

Final Thoughts: Is Grok 3 the Undisputed King?

Grok 3 is xAI’s most serious attempt at competing with OpenAI, Google, and others. It’s a leap forward in AI capability, with superior reasoning and a massive compute infrastructure behind it.

However, despite the improvements, Grok 3 is not yet the undisputed best as it still lags behind OpenAI’s o3 model in benchmarks.

That said, xAI’s rapid progress with Grok has been nothing short of remarkable and it will be interesting to see how it evolves in the coming months.

You might also like


FAQs

What is Deep Search, and how does it work?

Deep Search is Grok 3’s research-style retrieval system. Instead of just pulling up search results, it actively reads, synthesizes, and cross-verifies information before responding.

Is Grok 3 free?

No. You need an X Premium Plus subscription or a Grok app subscription to access it.

When will the Grok API be available?

xAI plans to release it within the next few weeks.

Does Grok 3 have memory?

Not yet, but memory features are planned for future updates.

Will xAI open-source Grok 3?

No confirmation yet, but Grok 1 is open-source so it's a possibility.


Questions or feedback?

Are the information out of date? Please raise an issue or contact us, we'd love to hear from you!