Contact Support
    Google/
    Gemini-2.5-flash-lite

    Model Card

    Gemini 2.5 Flash Lite demonstrates significantly improved performance across a wide range of benchmarks. 2.5 Flash-Lite has all-round, significantly higher performance than 2.0 Flash-Lite on coding, math, science, reasoning and multimodal benchmarks. More details here:

    BenchmarkGemini 2.0 FlashGemini 2.5 Flash-Lite (Non-Thinking)Gemini 2.5 Flash-Lite (Thinking)
    Reasoning & Knowledge
    Humanity's Last Exam (no tools)5.1%*5.1%6.9%
    Science
    GPQA diamond65.2%64.6%66.7%
    Mathematics
    AIME 202529.7%49.8%63.1%
    Code Generation
    LiveCodeBench29.1%33.7%34.3%
    Code Editing
    Aider Polyglot21.3%26.7%27.1%
    Agentic Coding
    SWE-bench Verified (Single Attempt)21.4%31.6%27.6%
    SWE-bench Verified (Multiple Attempts)34.2%42.6%44.9%
    Factuality
    SimpleQA29.9%10.7%13.0%
    FACTS grounding84.6%84.1%86.8%
    Visual Reasoning
    MMMU69.3%72.9%72.9%
    Image Understanding
    Vibe-Eval (Reka)55.4%51.3%57.5%
    Long Context
    MRCR v2 (8-needle, 128k avg)19.0%16.6%30.6%
    MRCR v2 (1M, pointwise)5.3%4.1%5.4%
    Multilingual Performance
    Global MMLU (Lite)83.4%81.1%84.5%

    Benchmarks Glossary

    Reasoning & Knowledge

    • Humanity’s Last Exam (HLE)
      A multi-subject benchmark testing real-world reasoning without access to external tools. Measures general intelligence across domains.

    Science

    • GPQA Diamond
      Graduate-level multiple-choice questions on physics and science, testing deep factual knowledge and reasoning.

    Mathematics

    • AIME 2025
      American Invitational Mathematics Examination questions, designed to test high school-level mathematical problem-solving.

    Code

    • LiveCodeBench
      Evaluates real-time code generation on coding tasks using a live execution environment (1/1/2025–5/1/2025 period).

    • Aider Polyglot
      Tests multi-language code editing capabilities in real-world Git-based development workflows. Measured using Aider, an AI coding assistant.

    Agentic Coding

    • SWE-bench Verified
      Assesses the ability to autonomously complete real GitHub issues across software engineering tasks, both in single and multi-attempt formats.

    Factuality

    • SimpleQA
      Measures basic fact retrieval and answering capability on simple question-answering tasks.

    • FACTS Grounding
      Tests factual consistency of responses based on grounded evidence from source documents.

    Visual & Multimodal

    • MMMU (Massive Multitask Multimodal Understanding)
      Evaluates understanding of multimodal content across multiple tasks, such as interpreting charts, images, and diagrams.

    • Vibe-Eval (Reka)
      Measures the ability of models to interpret and reason about images, using Gemini models as evaluators.

    Long Context

    • MRCR v2 (8-needle)
      Evaluates performance on long-context reasoning tasks with complex references. Uses 128k and 1M-token versions to test scaling with long input.

    Multilingual

    • Global MMLU (Lite)
      A multilingual version of the Massive Multitask Language Understanding benchmark, testing performance across languages and disciplines.

    Meta data

    upto 1M tokens
    $0.1 per million
    $0.4 per million
    Jan 2025
    Create an agent Pipe