Anthropic / Benchmarks - Langbase · Serverless AI Developer Platform

Claude 3 Opus has been evaluated across various benchmarks, demonstrating significant advancements in intelligence and performance compared to other models in its class, including Claude 3 Sonnet, Claude 3 Haiku and GPT-4o, etc.

Here’s a comparison of Claude 3 Opus's performance against other models: (for more details read here)

Category	Claude 3.5 Sonnet	Claude 3 Opus	GPT-4o	Gemini 1.5 Pro	Llama-400b (early snapshot)
Graduate level reasoning	59.4%* (0-shot CoT)	50.4% (0-shot CoT)	53.6% (0-shot CoT)	-	-
Undergraduate level knowledge	88.7% (5-shot)	86.8% (5-shot)	88.7% (0-shot CoT)	85.9% (5-shot)	86.1% (5-shot)
Code (HumanEval)	92.0% (0-shot)	84.9% (0-shot)	90.2% (0-shot)	84.1% (0-shot)	84.1% (0-shot)
Multilingual math (MGSM)	91.6% (0-shot CoT)	90.7% (0-shot CoT)	90.5% (0-shot CoT)	87.5% (8-shot)	-
Reasoning over text (DROP)	87.1% (3-shot)	83.1% (3-shot)	83.4% (3-shot)	74.9% (Variable)	83.5% (3-shot Pre-trained)
Mixed evaluations (BIG-Bench)	93.1% (3-shot CoT)	86.8% (3-shot CoT)	-	89.2% (3-shot CoT)	85.3% (3-shot CoT Pre-trained)
Math problem-solving (MATH)	71.1% (0-shot CoT)	60.1% (0-shot CoT)	76.6% (0-shot CoT)	67.7% (4-shot)	57.8% (4-shot CoT)
Grade school math (GSM8K)	96.4% (0-shot CoT)	95.0% (0-shot CoT)	-	90.8% (11-shot)	94.1% (8-shot CoT)

Category	Claude 3.5 Sonnet	Claude 3 Opus	GPT-4o	Gemini 1.5 Pro
Visual math reasoning (MathVista)	67.7% (0-shot CoT)	50.5% (0-shot CoT)	63.8% (0-shot CoT)	63.9% (0-shot CoT)
Science diagrams (AI2D)	94.7% (0-shot)	88.1% (0-shot)	94.2% (0-shot)	94.4% (0-shot)
Visual question answering (MMMU)	68.3% (0-shot CoT)	59.4% (0-shot CoT)	69.1% (0-shot CoT)	62.2% (0-shot CoT)
Chart Q&A (Relaxed accuracy)	90.8% (0-shot CoT)	80.8% (0-shot CoT)	85.7% (0-shot CoT)	87.2% (0-shot CoT)
Document visual Q&A (ANLS score)	95.2% (0-shot)	89.3% (0-shot)	92.8% (0-shot)	93.1% (0-shot)

To help you choose the right model for your needs, here’s a compiled table comparing the key features and capabilities of each model in the Claude family:

Claude Model	Claude 3.5 Sonnet	Claude 3 Opus	Claude 3 Sonnet	Claude 3 Haiku
Description	Most intelligent model	Powerful model for highly complex tasks	Balance of intelligence and speed	Fastest and most compact model for near-instant responsiveness
Strengths	Highest level of intelligence and capability	Top-level performance, intelligence, fluency, and understanding	Strong utility, balanced for scaled deployments	Quick and accurate targeted performance
Multilingual	Yes	Yes	Yes	Yes
Vision	Yes	Yes	Yes	Yes
API model name	claude-3-5-sonnet-20240620	claude-3-opus-20240229	claude-3-sonnet-20240229	claude-3-haiku-20240307
API format	Messages API	Messages API	Messages API	Messages API
Comparative latency	Fast	Moderately fast	Fast	Fastest
Context window	200K	200K	200K	200K
Max output	8192 tokens	4096 tokens	4096 tokens	4096 tokens
Cost (Input / Output per MTok)	$3.00 / $15.00	$15.00 / $75.00	$3.00 / $15.00	$0.25 / $1.25
Training data cut-off	Apr 2024	Aug 2023	Aug 2023	Aug 2023

Benchmark Metric Glossary

GPQA (graduate-level reasoning): Assesses the ability to handle advanced questions in biology, physics, and chemistry.
MMLU (undergraduate-level knowledge): Measures knowledge and reasoning across multiple academic subjects at the undergraduate level.
HumanEval (coding proficiency): Tests the ability to generate correct Python code from problem descriptions.
MGSM (multilingual math): Evaluates performance in multilingual grade school math problems.
MathVista (visual math reasoning): Assesses visual mathematical reasoning skills, such as interpreting graphs and charts.
AI2D (science diagrams): Tests the ability to interpret and answer questions based on science diagrams.
MMMU (visual question answering): Measures the ability to answer questions about visual content, like images and diagrams.
Chart Q&A (relaxed accuracy): Evaluates the accuracy of answering questions based on chart data.
ANLS Score (document visual Q&A): Assesses the accuracy of answering questions based on document visuals, using the Average Normalized Levenshtein Similarity (ANLS) score.
MATH (math problem solving): Tests advanced problem-solving skills in various areas of mathematics.
DROP (reasoning over text): Evaluates the ability to perform discrete reasoning tasks over paragraphs of text.
BIG-Bench-Hard (mixed evaluations): A challenging benchmark that tests reasoning, common sense, and problem-solving across various tasks.
GSM8K (grade school math): Measures the ability to solve grade school-level math problems, often using chain-of-thought reasoning.

Langbase

Model Card

Benchmark Metric Glossary

Meta data

Context

Prompt Cost

Completion Cost

Trained with data up to

⌘Langbase

Model Card

Benchmark Metric Glossary

Meta data

Context

Prompt Cost

Completion Cost

Trained with data up to

Langbase