The table below details the performance of Kimi-K2-Instruct, showing that it matches—or outperforms—the latest open-source and proprietary models across a diverse set of tasks. For more details read here:
Coding Tasks
| Benchmark | Metric | Kimi-K2-Instruct | DeepSeek-V3-0324 | Qwen3-23B-A22B (Non-thinking) | Claude Sonnet 4 (w/o extended) | Claude Opus 4 (w/o extended) | GPT-4.1 | Gemini 2.5 Flash |
|---|---|---|---|---|---|---|---|---|
| LiveCodeBench v6 | Pass@1 (Aug 24–May 25) | 53.7 | 46.9 | 37.0 | 48.5 | 47.4 | 44.7 | 44.7 |
| OJBench | Pass@1 | 27.1 | 24.0 | 11.3 | 15.3 | 19.6 | 19.5 | 19.5 |
| MultiPL-E | — | 85.7 | 83.1 | 78.2 | 88.6 | 89.6 | 86.7 | 85.6 |
| SWE-bench Verified | Single Patch w/o Test | 51.8 | 36.6 | 39.4 | 50.2 | 53.0 | 40.8 | 32.6 |
| SWE-bench Verified | Single Attempt (Agentic Coding) | 65.8 | 38.8 | 34.4 | 72.7* | 72.5* | 54.6 | — |
| SWE-bench Verified | Multiple Attempts (Agentic) | 71.6 | — | — | 80.2* | 79.4* | — | — |
| SWE-bench Multilingual | Single Attempt (Agentic) | 47.3 | 25.8 | 20.9 | 51.0 | — | 31.5 | — |
| TerminalBench | Inhouse Framework | 30.0 | — | — | 35.5 | 43.2 | 8.3 | — |
| TerminalBench | Terminus (Acc) | 25.0 | 16.3 | 6.6 | — | — | 30.3 | 16.8 |
| Aider-Polyglot | Acc | 60.0 | 55.1 | 61.8 | 56.4 | 70.7 | 52.4 | 44.0 |
Tool Use Tasks
| Benchmark | Metric | Kimi-K2-Instruct | DeepSeek-V3-0324 | Qwen3-23B-A22B | Claude Sonnet 4 | Claude Opus 4 | GPT-4.1 | Gemini 2.5 Flash |
|---|---|---|---|---|---|---|---|---|
| Tau2 retail | Avg@4 | 70.6 | 69.1 | 57.0 | 75.0 | 81.8 | 74.8 | 64.3 |
| Tau2 airline | Avg@4 | 56.5 | 39.0 | 26.5 | 55.5 | 60.0 | 54.5 | 42.5 |
| Tau2 telecom | Avg@4 | 65.8 | 32.5 | 22.1 | 45.2 | 57.0 | 38.6 | 16.9 |
| AceBench | Acc | 76.5 | 72.7 | 70.5 | 76.2 | 75.6 | 80.1 | 74.5 |
Math & STEM Tasks
| Benchmark | Metric | Kimi-K2-Instruct | DeepSeek-V3-0324 | Qwen3-23B-A22B | Claude Sonnet 4 | Claude Opus 4 | GPT-4.1 | Gemini 2.5 Flash |
|---|---|---|---|---|---|---|---|---|
| AIME 2024 | Avg@64 | 69.6 | 59.4* | 40.1* | 43.4 | 48.2 | 46.5 | 61.3 |
| AIME 2025 | Avg@64 | 49.5 | 46.7 | 24.7* | 33.1 | 33.9* | 37.0 | 46.6 |
| MATH-500 | Acc | 97.4 | 94.0* | 91.2* | 94.0 | 94.4 | 92.4 | 95.4 |
| HMMT 2025 | Avg@32 | 38.8 | 27.5 | 11.9 | 15.9 | 15.9 | 19.4 | 34.7 |
| CNMO 2024 | Avg@16 | 74.3 | 74.7 | 48.6 | 60.4 | 57.6 | 56.6 | 75.0 |
| PolyMath-en | Avg@64 | 65.1 | 59.5 | 51.9 | 52.8 | 49.8 | 54.0 | 49.9 |
| ZebraLogic | Acc | 89.0 | 84.0 | 37.7* | 79.7 | 59.3 | 58.5 | 57.9 |
| AutoLogi | Acc | 89.5 | 88.9 | 83.3* | 89.8 | 86.1 | 88.2 | 84.1 |
| GPOA-Diamond | Avg@8 | 75.1 | 68.4* | 62.9* | 70.0* | 74.9* | 66.3 | 68.2 |
| SuperGPOA | Acc | 57.2 | 53.7 | 50.2 | 55.7 | 56.5 | 50.8 | 49.6 |
| Humanity’s Last Exam | Acc (Text Only) | 4.7 | 5.2 | 5.7 | 5.8 | 7.1 | 3.7 | 5.6 |
General Tasks
| Benchmark | Metric | Kimi-K2-Instruct | DeepSeek-V3-0324 | Qwen3-23B-A22B | Claude Sonnet 4 | Claude Opus 4 | GPT-4.1 | Gemini 2.5 Flash |
|---|---|---|---|---|---|---|---|---|
| MMLU | EM | 89.5 | 89.4 | 87.0 | 91.5 | 92.0 | 90.4 | 90.1 |
| MMLU-Redux | EM | 92.7 | 90.5 | 89.2* | 93.6 | 94.2 | 92.4 | 90.6 |
| MMLU-Pro | EM | 81.1 | 81.2* | 77.3 | 83.7 | 86.6 | 81.8 | 79.4 |
| IFEval | Prompt Strict | 89.8 | 81.1 | 83.2* | 87.6 | 87.4 | 88.0 | 84.3 |
| Multi-Challenge | EM | 54.1 | 31.4 | 34.0 | 46.8 | 49.0 | 36.4 | 39.5 |
| SimpleQA | Correct | 31.0 | 27.7 | 13.2 | 15.9 | 22.8 | 42.3 | 23.3 |
| Livebench (2024) | Pass@1 | 76.4 | 72.4 | 67.6 | 74.8 | 74.6 | 69.8 | 67.8 |
Benchmark Glossary
| Benchmark / Metric | Description |
|---|---|
| Pass@1 | Measures correctness on first attempt (e.g. code execution or QA tasks). |
| Acc | Accuracy - percentage of correct responses. |
| Avg@k | Average score across top-k predictions. |
| EM | Exact Match - how often the prediction matches the ground truth exactly. |
| Prompt Strict | Strict evaluation where format and correctness both matter. |
| Single Attempt | Evaluation based on a single model response (no retries or voting). |
| Multiple Attempts | Allows retries or majority voting across generations. |
| Single Patch w/o Test | Code task where model must fix a bug without access to test feedback. |
| Inhouse Framework | Custom internal benchmark suite. |
| Terminus (Acc) | Terminal-based task accuracy—evaluating CLI reasoning or actions. |
| Agentic Coding | Tasks requiring planning and tool use over multiple steps. |
| Tool Use Tasks | Evaluation of model’s ability to use APIs, tools, or simulated environments. |
| Humanity’s Last Exam | High-difficulty QA benchmark meant to test general reasoning ability. |