Grok X

Next-Generation Intelligence from xAI

We are pleased to introduce Grok 3, our most advanced model yet: blending strong reasoning with extensive pretraining knowledge. Trained on our Colossus supercluster with 10x the compute of previous state-of-the-art models, Grok 3 displays significant improvements in reasoning, mathematics, coding, world knowledge, and instruction-following tasks. Grok 3's reasoning capabilities, refined through large scale reinforcement learning, allow it to think for seconds to minutes, correcting errors, exploring alternatives, and delivering accurate answers. Grok 3 has leading performance across both academic benchmarks and real-world user preferences, achieving an Elo score of 1402 in the Chatbot Arena. Alongside it, we’re unveiling Grok 3 mini, which represents a new frontier in cost-efficient reasoning. Both models are still in training and will evolve rapidly with your feedback. We are rolling out Grok 3 to users in the coming days, along with an early preview of its reasoning capabilities.

Thinking Harder: Test-time Compute and Reasoning

Today, we are announcing two beta reasoning models, Grok 3 (Think) and Grok 3 mini (Think). They were trained using reinforcement learning (RL) at an unprecedented scale to refine its chain-of-thought process, enabling advanced reasoning in a data-efficient manner. With RL, Grok 3 (Think) learned to refine its problem-solving strategies, correct errors through backtracking, simplify steps, and utilize the knowledge it picked up during pretraining. Just like a human when tackling a complex problem, Grok 3 (Think) can spend anywhere from a few seconds to several minutes reasoning, often considering multiple approaches, verifying its own solution, and evaluating how to precisely meet the requirements of the problem.

anime shit

Both models are still in training, but already they show remarkable performance across a range of benchmarks. We tested these models on the 2025 American Invitational Mathematics Examination (AIME), which was released just 7 days ago on Feb 12th. With our highest level of test-time compute (cons@64), Grok 3 (Think) achieved 93.3% on this competition. Grok 3 (Think) also attained 84.6% on graduate-level expert reasoning (GPQA), and 79.4% on LiveCodeBench for code generation and problem-solving. Furthermore, Grok 3 mini reaches a new frontier in cost-efficient reasoning for STEM tasks that don't require as much world knowledge, reaching 95.8% on AIME 2024 and 80.4% on LiveCodeBench.

To use Grok 3’s reasoning capabilities, just press the Think button. Grok 3 (Think)’s mind is completely open, allowing users to inspect not only the final answer but the reasoning process of the model itself. We have found that Grok 3 (Think)'s performance generalizes across diverse problem domains. Here are some Grok 3 reasoning examples.

just an image test pic

With a context window of 1 million tokens — 8 times larger than our previous models — Grok 3 can process extensive documents and handle complex prompts while maintaining instruction-following accuracy. On the LOFT (128k) benchmark, which targets long-context RAG use cases, Grok 3 achieved state-of-the-art accuracy (averaged across 12 diverse tasks), showcasing its powerful information retrieval capabilities.

Grok 3 also demonstrates improved factual accuracy and enhanced stylistic control.
Under the codename chocolate, an early version of Grok 3 topped the LMArena Chatbot Arena leaderboard, outperforming all competitors in ElO scores across all categories.
As we continue to scale, we are preparing to train even larger models on our 200,000 GPU cluster.

Grok X

May 01, 2025

Next-Generation Intelligence from xAI

Thinking Harder: Test-time Compute and Reasoning