Gemini 3: Google's Best Model Has an 88% Honesty Problem

Google's Gemini 3 tops benchmarks with 1501 Elo and 76.2% on SWE-bench, but fabricates answers 88% of the time when uncertain. What do these scores actually measure?

AnalysisGoogleGeminiBenchmarksAI Safety

Google's Gemini 3 is the first model to cross 1500 Elo on LMArena, scores 76.2% on SWE-bench for coding tasks, and introduces something genuinely new: generating complete applications instead of just text. It's also confidently wrong 88% of the time when it doesn't know something. And there's evidence it was trained directly on benchmark data.

Capability and reliability are diverging faster than ever.

Three Models, One Million Tokens

Three variants make up the release. Gemini 3 Pro, out November 2025, targets reasoning-heavy use cases. Gemini 3 Flash followed in December as the speed-optimized option, scoring 71 on Artificial Analysis's Intelligence Index (a 13-point jump from 2.5 Flash) while pushing 218 tokens per second. Both support a 1 million token context window.

The headline numbers? Legitimately impressive. 1501 Elo on LMArena. 91.9% on GPQA Diamond. 37.5% on Humanity's Last Exam without tools. For coding specifically, 76.2% on SWE-bench Verified puts it in direct competition with Claude's coding capabilities.

But the interesting addition is generative UI. Instead of returning text that describes an application, Gemini 3 can output working webpages, games, and tools directly. Google's framing this alongside their new Antigravity agentic development platform. The pitch: models that build complete artifacts rather than explain how you might build them. Text-to-application is a genuinely different interface than text-to-text, and Google's betting it represents where model outputs are heading.

Google's tiered pricing reflects the context window economics we've seen become standard. Gemini 3 Pro runs $2/$12 per million tokens (input/output) under 200k context. Cross that threshold and you're paying $4/$18. Flash comes in at $0.5/$3, making it roughly 2x cheaper than Pro for equivalent tasks. Batch pricing cuts Pro costs by 50%, and context caching is available at $0.20-$0.40 per million tokens plus storage fees.

The pricing is competitive with GPT-5.1 and slightly above Claude Opus 4.5 for equivalent capability tiers. Google's playing the same optimization game as everyone else.

The Honesty Gap

Artificial Analysis's AA-Omniscience benchmark tests something different from standard evals: whether models know when they don't know.

Gemini 3 Pro achieved the highest accuracy at 53%, which sounds good until you see the hallucination rate.

When Gemini 3 doesn't have the answer, it confidently fabricates one 88% of the time rather than acknowledging uncertainty.

The model scored 13 points on a scale from -100 to 100, which was actually the best result among tested models. 36 of 40 models scored negative. The benchmark reveals something uncomfortable: model size correlates with accuracy but not with reduced hallucination rates. Bigger models know more things, but they're not getting better at knowing what they don't know.

Our read: this is a fundamental architectural limitation, not a Gemini-specific problem. Models are trained to produce confident outputs. Saying "I don't know" requires a kind of metacognition that current training approaches don't prioritize. But Gemini 3's 88% rate is notably high even among frontier models.

LessWrong analysis raises a more fundamental concern: Gemini 3 can reproduce the BIG-bench canary string, a unique identifier specifically designed to detect when benchmark data has been included in training sets. The model also exhibits what researchers call "evaluation-paranoid" behavior. When asked whether it's currently being tested, Gemini 3 believes it's in an evaluation over 99.9% of the time. It treats real-world information as fabricated test scenarios.

This doesn't mean Gemini 3's capabilities are fake. The model clearly does advanced reasoning and coding work. But it does raise questions about how much the benchmark numbers reflect genuine capability versus optimization for the measurement itself. Benchmark contamination is becoming an industry-wide problem, not a Google-specific scandal. As models get trained on increasingly large swaths of the internet, avoiding benchmark data entirely may be impossible. The question is whether labs are actively preventing contamination or passively benefiting from it.

For teams evaluating Gemini 3 against GPT-5.2 and Claude Opus 4.5, the choice isn't straightforward. Raw capability? Gemini 3's benchmark numbers are competitive or leading across most categories. The million-token context window is useful for document-heavy workloads. Generative UI is a differentiator if you're building development tools.

Reliability is a different story. The hallucination rate should give pause for any application where factual accuracy matters. If you're building something where confident-but-wrong outputs create real problems (legal, medical, financial), that 88% rate on "I don't know" questions is a significant risk factor. The LMArena Elo score tells you the model produces outputs humans prefer in side-by-side comparisons. It doesn't tell you whether those outputs are true.

Our read: Gemini 3 is Google finally shipping a model that competes on raw capability. That's meaningful; Google's AI releases have often felt a step behind the frontier. This one doesn't. But the release also crystallizes a growing problem in model evaluation. When the top-performing model on accuracy benchmarks also has the highest hallucination rate among frontier models, when there's evidence of benchmark contamination, when evaluation-awareness is baked into model behavior, we're measuring something that increasingly diverges from real-world utility.

The benchmark numbers will keep climbing. The question is whether they'll keep mattering.

Frequently Asked Questions