Contact Support
    Moonshot AI/Kimi-K2-Instruct

    Model Card

    The table below details the performance of Kimi-K2-Instruct, showing that it matches—or outperforms—the latest open-source and proprietary models across a diverse set of tasks. For more details read here:

    Coding Tasks

    BenchmarkMetricKimi-K2-InstructDeepSeek-V3-0324Qwen3-23B-A22B (Non-thinking)Claude Sonnet 4 (w/o extended)Claude Opus 4 (w/o extended)GPT-4.1Gemini 2.5 Flash
    LiveCodeBench v6Pass@1 (Aug 24–May 25)53.746.937.048.547.444.744.7
    OJBenchPass@127.124.011.315.319.619.519.5
    MultiPL-E85.783.178.288.689.686.785.6
    SWE-bench VerifiedSingle Patch w/o Test51.836.639.450.253.040.832.6
    SWE-bench VerifiedSingle Attempt (Agentic Coding)65.838.834.472.7*72.5*54.6
    SWE-bench VerifiedMultiple Attempts (Agentic)71.680.2*79.4*
    SWE-bench MultilingualSingle Attempt (Agentic)47.325.820.951.031.5
    TerminalBenchInhouse Framework30.035.543.28.3
    TerminalBenchTerminus (Acc)25.016.36.630.316.8
    Aider-PolyglotAcc60.055.161.856.470.752.444.0

    Tool Use Tasks

    BenchmarkMetricKimi-K2-InstructDeepSeek-V3-0324Qwen3-23B-A22BClaude Sonnet 4Claude Opus 4GPT-4.1Gemini 2.5 Flash
    Tau2 retailAvg@470.669.157.075.081.874.864.3
    Tau2 airlineAvg@456.539.026.555.560.054.542.5
    Tau2 telecomAvg@465.832.522.145.257.038.616.9
    AceBenchAcc76.572.770.576.275.680.174.5

    Math & STEM Tasks

    BenchmarkMetricKimi-K2-InstructDeepSeek-V3-0324Qwen3-23B-A22BClaude Sonnet 4Claude Opus 4GPT-4.1Gemini 2.5 Flash
    AIME 2024Avg@6469.659.4*40.1*43.448.246.561.3
    AIME 2025Avg@6449.546.724.7*33.133.9*37.046.6
    MATH-500Acc97.494.0*91.2*94.094.492.495.4
    HMMT 2025Avg@3238.827.511.915.915.919.434.7
    CNMO 2024Avg@1674.374.748.660.457.656.675.0
    PolyMath-enAvg@6465.159.551.952.849.854.049.9
    ZebraLogicAcc89.084.037.7*79.759.358.557.9
    AutoLogiAcc89.588.983.3*89.886.188.284.1
    GPOA-DiamondAvg@875.168.4*62.9*70.0*74.9*66.368.2
    SuperGPOAAcc57.253.750.255.756.550.849.6
    Humanity’s Last ExamAcc (Text Only)4.75.25.75.87.13.75.6

    General Tasks

    BenchmarkMetricKimi-K2-InstructDeepSeek-V3-0324Qwen3-23B-A22BClaude Sonnet 4Claude Opus 4GPT-4.1Gemini 2.5 Flash
    MMLUEM89.589.487.091.592.090.490.1
    MMLU-ReduxEM92.790.589.2*93.694.292.490.6
    MMLU-ProEM81.181.2*77.383.786.681.879.4
    IFEvalPrompt Strict89.881.183.2*87.687.488.084.3
    Multi-ChallengeEM54.131.434.046.849.036.439.5
    SimpleQACorrect31.027.713.215.922.842.323.3
    Livebench (2024)Pass@176.472.467.674.874.669.867.8

    Benchmark Glossary

    Benchmark / MetricDescription
    Pass@1Measures correctness on first attempt (e.g. code execution or QA tasks).
    AccAccuracy - percentage of correct responses.
    Avg@kAverage score across top-k predictions.
    EMExact Match - how often the prediction matches the ground truth exactly.
    Prompt StrictStrict evaluation where format and correctness both matter.
    Single AttemptEvaluation based on a single model response (no retries or voting).
    Multiple AttemptsAllows retries or majority voting across generations.
    Single Patch w/o TestCode task where model must fix a bug without access to test feedback.
    Inhouse FrameworkCustom internal benchmark suite.
    Terminus (Acc)Terminal-based task accuracy—evaluating CLI reasoning or actions.
    Agentic CodingTasks requiring planning and tool use over multiple steps.
    Tool Use TasksEvaluation of model’s ability to use APIs, tools, or simulated environments.
    Humanity’s Last ExamHigh-difficulty QA benchmark meant to test general reasoning ability.

    Meta data

    128K tokens
    $1 per million
    $3 per million
    Create an agent Pipe