Google / Benchmarks - Langbase · Serverless AI Developer Platform

Gemini 2.5 Flash Lite demonstrates significantly improved performance across a wide range of benchmarks. 2.5 Flash-Lite has all-round, significantly higher performance than 2.0 Flash-Lite on coding, math, science, reasoning and multimodal benchmarks. More details here:

Benchmark	Gemini 2.0 Flash	Gemini 2.5 Flash-Lite (Non-Thinking)	Gemini 2.5 Flash-Lite (Thinking)
Reasoning & Knowledge
Humanity's Last Exam (no tools)	5.1%*	5.1%	6.9%
Science
GPQA diamond	65.2%	64.6%	66.7%
Mathematics
AIME 2025	29.7%	49.8%	63.1%
Code Generation
LiveCodeBench	29.1%	33.7%	34.3%
Code Editing
Aider Polyglot	21.3%	26.7%	27.1%
Agentic Coding
SWE-bench Verified (Single Attempt)	21.4%	31.6%	27.6%
SWE-bench Verified (Multiple Attempts)	34.2%	42.6%	44.9%
Factuality
SimpleQA	29.9%	10.7%	13.0%
FACTS grounding	84.6%	84.1%	86.8%
Visual Reasoning
MMMU	69.3%	72.9%	72.9%
Image Understanding
Vibe-Eval (Reka)	55.4%	51.3%	57.5%
Long Context
MRCR v2 (8-needle, 128k avg)	19.0%	16.6%	30.6%
MRCR v2 (1M, pointwise)	5.3%	4.1%	5.4%
Multilingual Performance
Global MMLU (Lite)	83.4%	81.1%	84.5%

Benchmarks Glossary

Reasoning & Knowledge

Humanity’s Last Exam (HLE)
A multi-subject benchmark testing real-world reasoning without access to external tools. Measures general intelligence across domains.

Science

GPQA Diamond
Graduate-level multiple-choice questions on physics and science, testing deep factual knowledge and reasoning.

Mathematics

AIME 2025
American Invitational Mathematics Examination questions, designed to test high school-level mathematical problem-solving.

Code

LiveCodeBench
Evaluates real-time code generation on coding tasks using a live execution environment (1/1/2025–5/1/2025 period).
Aider Polyglot
Tests multi-language code editing capabilities in real-world Git-based development workflows. Measured using Aider, an AI coding assistant.

Agentic Coding

SWE-bench Verified
Assesses the ability to autonomously complete real GitHub issues across software engineering tasks, both in single and multi-attempt formats.

Factuality

SimpleQA
Measures basic fact retrieval and answering capability on simple question-answering tasks.
FACTS Grounding
Tests factual consistency of responses based on grounded evidence from source documents.

Visual & Multimodal

MMMU (Massive Multitask Multimodal Understanding)
Evaluates understanding of multimodal content across multiple tasks, such as interpreting charts, images, and diagrams.
Vibe-Eval (Reka)
Measures the ability of models to interpret and reason about images, using Gemini models as evaluators.

Long Context

MRCR v2 (8-needle)
Evaluates performance on long-context reasoning tasks with complex references. Uses 128k and 1M-token versions to test scaling with long input.

Multilingual

Global MMLU (Lite)
A multilingual version of the Massive Multitask Language Understanding benchmark, testing performance across languages and disciplines.

Langbase

Model Card

Benchmarks Glossary

Reasoning & Knowledge

Science

Mathematics

Code

Agentic Coding

Factuality

Visual & Multimodal

Long Context

Multilingual

Meta data

Context

Prompt Cost

Completion Cost

Trained with data up to

⌘Langbase

Model Card

Benchmarks Glossary

Reasoning & Knowledge

Science

Mathematics

Code

Agentic Coding

Factuality

Visual & Multimodal

Long Context

Multilingual

Meta data

Context

Prompt Cost

Completion Cost

Trained with data up to

Langbase