Gemini 2.5 Pro Gets Deep Think: Google's New Benchmark Leader

On June 22, Google shipped Gemini 2.5 Pro with a new reasoning mode called Deep Think and a context window that doubles the previous version. The benchmarks at launch show it leading the public rankings on science, math, and general reasoning, with one notable caveat. Here's what actually changed.

What Deep Think does

Most of the time, a language model generates tokens sequentially: each word is produced based on everything before it, left to right, without revisiting. That makes responses fast but limits the model's ability to catch its own early errors.

Deep Think changes this by extending inference time. Before committing to a response, the model generates multiple candidate approaches, evaluates each internally, and produces a reasoning trace showing its steps before the final answer appears. Google's release data indicates this adds roughly 15-25% accuracy on multi-step problems compared to standard mode.

This isn't new territory. OpenAI shipped extended thinking with o1 in late 2024, and Claude Fable 5 includes a similar extended reasoning capability. What's different here is the combination: Deep Think is paired with a 2 million token context window, which matters for the class of problem where you need both long-context intake and careful reasoning in the same session.

The 2 million token context window

The previous Gemini 2.5 Pro handled 1 million tokens. The new version doubles that. To put it concretely: 2 million tokens covers roughly 1,500 average-length novels, a full medium-sized codebase, or about ten hours of transcribed conversation.

For most everyday use, this won't change much. Short queries and conversational exchanges fit comfortably in any frontier model's context. Where the 2M window matters is in high-volume batch work: analyzing a full year of financial documents, reviewing a large codebase without truncating files, or processing months of correspondence in a single pass. Tasks that previously required chunking or retrieval pipelines can now run as single queries.

Where the benchmarks land

At launch, Gemini 2.5 Pro with Deep Think posted leading scores across several standard benchmarks. On MMLU-Pro, a broad academic knowledge test spanning 57 domains, the model scored 89.8%, the highest publicly recorded score at release. GPQA Diamond, which tests graduate-level science and reasoning, came in at 82.4%. HumanEval+ for coding reached 94.1%. MATH-500 hit 97.2%, essentially saturating that test.

These numbers come from Google's release data; independent verification on aggregators like llm-stats.com tracks ongoing community benchmark results as they're reproduced.

Where it doesn't lead

Benchmark sweeps can obscure as much as they reveal. Gemini 2.5 Pro with Deep Think leads on academic knowledge and math, but Claude Opus 4.7 still holds the top score on SWE-bench Pro, the standard benchmark for software engineering task completion in realistic codebases, at 64.3%. GPT-5.5 leads on Terminal-Bench at 82.7%, which tests multi-step command-line work.

The practical read: for research, analysis, scientific reasoning, or math-heavy tasks, the updated Gemini is the strongest option publicly available right now. For writing and debugging production code, Claude Opus 4.7 remains the benchmark leader. Neither wins everywhere, and that gap between them is narrower than it was six months ago, consistent with the broader market share shift playing out across the AI space.

Pricing and access

Standard Gemini 2.5 Pro costs approximately $2.50 per million input tokens and $15 per million output tokens via the API. Deep Think uses more compute (roughly 4x the standard rate) because the model generates and evaluates multiple internal reasoning paths before producing the final output. That cost is worth it for complex analytical tasks, less so for simple queries where standard mode is faster and sufficient.

For consumer access, the updated model is available on Google AI Plus at $7.99 per month and Google AI Ultra at $19.99 per month. Deep Think is included on both tiers with a daily prompt limit. By comparison, Gemini 2.5 Pro is roughly 50% cheaper than Claude Opus 4.7 and GPT-5.5 at comparable capability levels, a pricing gap that has made Google's stack increasingly competitive for high-volume use cases.

How to use it

In the Gemini app, the updated Pro model with Deep Think appears in the model selector. On AI Plus or Ultra, you activate Deep Think from the prompt bar when the Pro model is selected. The reasoning trace appears inline, so you can see the model's intermediate steps before the final response.

Via the API, Deep Think is activated with a thinking budget parameter in the generation configuration. The Gemini API changelog has been updated with the new parameters and usage details. The Gemini release notes cover the full list of capabilities added in this update.

For a broader comparison of how Gemini 2.5 Pro fits against ChatGPT and Claude across different task types, our full comparison guide covers the major differences by use case.

What to watch

The benchmark positions shift quickly. OpenAI and Anthropic have historically responded to Google releases within weeks, and Claude Fable 5 already brought significant gains to Anthropic's side. The more durable signal from this release is the 2M context window becoming a practical baseline: what was a differentiator six months ago is now table stakes.

What's harder to commoditize is reasoning depth. Deep Think, Claude's extended thinking, and OpenAI's o-series reasoning models represent different approaches to the same core problem: how to get a language model to slow down, check its own work, and produce better answers on problems that require more than pattern completion. That's where the frontier is right now.