Gemini 2.5 Flash Lite demonstrates significantly improved performance across a wide range of benchmarks. 2.5 Flash-Lite has all-round, significantly higher performance than 2.0 Flash-Lite on coding, math, science, reasoning and multimodal benchmarks. More details here:
| Benchmark | Gemini 2.0 Flash | Gemini 2.5 Flash-Lite (Non-Thinking) | Gemini 2.5 Flash-Lite (Thinking) |
|---|---|---|---|
| Reasoning & Knowledge | |||
| Humanity's Last Exam (no tools) | 5.1%* | 5.1% | 6.9% |
| Science | |||
| GPQA diamond | 65.2% | 64.6% | 66.7% |
| Mathematics | |||
| AIME 2025 | 29.7% | 49.8% | 63.1% |
| Code Generation | |||
| LiveCodeBench | 29.1% | 33.7% | 34.3% |
| Code Editing | |||
| Aider Polyglot | 21.3% | 26.7% | 27.1% |
| Agentic Coding | |||
| SWE-bench Verified (Single Attempt) | 21.4% | 31.6% | 27.6% |
| SWE-bench Verified (Multiple Attempts) | 34.2% | 42.6% | 44.9% |
| Factuality | |||
| SimpleQA | 29.9% | 10.7% | 13.0% |
| FACTS grounding | 84.6% | 84.1% | 86.8% |
| Visual Reasoning | |||
| MMMU | 69.3% | 72.9% | 72.9% |
| Image Understanding | |||
| Vibe-Eval (Reka) | 55.4% | 51.3% | 57.5% |
| Long Context | |||
| MRCR v2 (8-needle, 128k avg) | 19.0% | 16.6% | 30.6% |
| MRCR v2 (1M, pointwise) | 5.3% | 4.1% | 5.4% |
| Multilingual Performance | |||
| Global MMLU (Lite) | 83.4% | 81.1% | 84.5% |
Benchmarks Glossary
Reasoning & Knowledge
- Humanity’s Last Exam (HLE)
A multi-subject benchmark testing real-world reasoning without access to external tools. Measures general intelligence across domains.
Science
- GPQA Diamond
Graduate-level multiple-choice questions on physics and science, testing deep factual knowledge and reasoning.
Mathematics
- AIME 2025
American Invitational Mathematics Examination questions, designed to test high school-level mathematical problem-solving.
Code
-
LiveCodeBench
Evaluates real-time code generation on coding tasks using a live execution environment (1/1/2025–5/1/2025 period). -
Aider Polyglot
Tests multi-language code editing capabilities in real-world Git-based development workflows. Measured using Aider, an AI coding assistant.
Agentic Coding
- SWE-bench Verified
Assesses the ability to autonomously complete real GitHub issues across software engineering tasks, both in single and multi-attempt formats.
Factuality
-
SimpleQA
Measures basic fact retrieval and answering capability on simple question-answering tasks. -
FACTS Grounding
Tests factual consistency of responses based on grounded evidence from source documents.
Visual & Multimodal
-
MMMU (Massive Multitask Multimodal Understanding)
Evaluates understanding of multimodal content across multiple tasks, such as interpreting charts, images, and diagrams. -
Vibe-Eval (Reka)
Measures the ability of models to interpret and reason about images, using Gemini models as evaluators.
Long Context
- MRCR v2 (8-needle)
Evaluates performance on long-context reasoning tasks with complex references. Uses 128k and 1M-token versions to test scaling with long input.
Multilingual
- Global MMLU (Lite)
A multilingual version of the Massive Multitask Language Understanding benchmark, testing performance across languages and disciplines.