Moonshot AI / Benchmarks - Langbase · Serverless AI Developer Platform

Moonshot AI/Kimi-K2-Instruct

Model Card

The table below details the performance of Kimi-K2-Instruct, showing that it matches—or outperforms—the latest open-source and proprietary models across a diverse set of tasks. For more details read here:

Coding Tasks

Benchmark	Metric	Kimi-K2-Instruct	DeepSeek-V3-0324	Qwen3-23B-A22B (Non-thinking)	Claude Sonnet 4 (w/o extended)	Claude Opus 4 (w/o extended)	GPT-4.1	Gemini 2.5 Flash
LiveCodeBench v6	Pass@1 (Aug 24–May 25)	53.7	46.9	37.0	48.5	47.4	44.7	44.7
OJBench	Pass@1	27.1	24.0	11.3	15.3	19.6	19.5	19.5
MultiPL-E	—	85.7	83.1	78.2	88.6	89.6	86.7	85.6
SWE-bench Verified	Single Patch w/o Test	51.8	36.6	39.4	50.2	53.0	40.8	32.6
SWE-bench Verified	Single Attempt (Agentic Coding)	65.8	38.8	34.4	72.7*	72.5*	54.6	—
SWE-bench Verified	Multiple Attempts (Agentic)	71.6	—	—	80.2*	79.4*	—	—
SWE-bench Multilingual	Single Attempt (Agentic)	47.3	25.8	20.9	51.0	—	31.5	—
TerminalBench	Inhouse Framework	30.0	—	—	35.5	43.2	8.3	—
TerminalBench	Terminus (Acc)	25.0	16.3	6.6	—	—	30.3	16.8
Aider-Polyglot	Acc	60.0	55.1	61.8	56.4	70.7	52.4	44.0

Tool Use Tasks

Benchmark	Metric	Kimi-K2-Instruct	DeepSeek-V3-0324	Qwen3-23B-A22B	Claude Sonnet 4	Claude Opus 4	GPT-4.1	Gemini 2.5 Flash
Tau2 retail	Avg@4	70.6	69.1	57.0	75.0	81.8	74.8	64.3
Tau2 airline	Avg@4	56.5	39.0	26.5	55.5	60.0	54.5	42.5
Tau2 telecom	Avg@4	65.8	32.5	22.1	45.2	57.0	38.6	16.9
AceBench	Acc	76.5	72.7	70.5	76.2	75.6	80.1	74.5

Math & STEM Tasks

Benchmark	Metric	Kimi-K2-Instruct	DeepSeek-V3-0324	Qwen3-23B-A22B	Claude Sonnet 4	Claude Opus 4	GPT-4.1	Gemini 2.5 Flash
AIME 2024	Avg@64	69.6	59.4*	40.1*	43.4	48.2	46.5	61.3
AIME 2025	Avg@64	49.5	46.7	24.7*	33.1	33.9*	37.0	46.6
MATH-500	Acc	97.4	94.0*	91.2*	94.0	94.4	92.4	95.4
HMMT 2025	Avg@32	38.8	27.5	11.9	15.9	15.9	19.4	34.7
CNMO 2024	Avg@16	74.3	74.7	48.6	60.4	57.6	56.6	75.0
PolyMath-en	Avg@64	65.1	59.5	51.9	52.8	49.8	54.0	49.9
ZebraLogic	Acc	89.0	84.0	37.7*	79.7	59.3	58.5	57.9
AutoLogi	Acc	89.5	88.9	83.3*	89.8	86.1	88.2	84.1
GPOA-Diamond	Avg@8	75.1	68.4*	62.9*	70.0*	74.9*	66.3	68.2
SuperGPOA	Acc	57.2	53.7	50.2	55.7	56.5	50.8	49.6
Humanity’s Last Exam	Acc (Text Only)	4.7	5.2	5.7	5.8	7.1	3.7	5.6

General Tasks

Benchmark	Metric	Kimi-K2-Instruct	DeepSeek-V3-0324	Qwen3-23B-A22B	Claude Sonnet 4	Claude Opus 4	GPT-4.1	Gemini 2.5 Flash
MMLU	EM	89.5	89.4	87.0	91.5	92.0	90.4	90.1
MMLU-Redux	EM	92.7	90.5	89.2*	93.6	94.2	92.4	90.6
MMLU-Pro	EM	81.1	81.2*	77.3	83.7	86.6	81.8	79.4
IFEval	Prompt Strict	89.8	81.1	83.2*	87.6	87.4	88.0	84.3
Multi-Challenge	EM	54.1	31.4	34.0	46.8	49.0	36.4	39.5
SimpleQA	Correct	31.0	27.7	13.2	15.9	22.8	42.3	23.3
Livebench (2024)	Pass@1	76.4	72.4	67.6	74.8	74.6	69.8	67.8

Benchmark Glossary

Benchmark / Metric	Description
Pass@1	Measures correctness on first attempt (e.g. code execution or QA tasks).
Acc	Accuracy - percentage of correct responses.
Avg@k	Average score across top-k predictions.
EM	Exact Match - how often the prediction matches the ground truth exactly.
Prompt Strict	Strict evaluation where format and correctness both matter.
Single Attempt	Evaluation based on a single model response (no retries or voting).
Multiple Attempts	Allows retries or majority voting across generations.
Single Patch w/o Test	Code task where model must fix a bug without access to test feedback.
Inhouse Framework	Custom internal benchmark suite.
Terminus (Acc)	Terminal-based task accuracy—evaluating CLI reasoning or actions.
Agentic Coding	Tasks requiring planning and tool use over multiple steps.
Tool Use Tasks	Evaluation of model’s ability to use APIs, tools, or simulated environments.
Humanity’s Last Exam	High-difficulty QA benchmark meant to test general reasoning ability.

Meta data

Context

128K tokens

Prompt Cost

$1 per million

Completion Cost

$3 per million

Create an agent Pipe