Claude 3 Opus has been evaluated across various benchmarks, demonstrating significant advancements in intelligence and performance compared to other models in its class, including Claude 3 Sonnet, Claude 3 Haiku and GPT-4o, etc.
Here’s a comparison of Claude 3 Opus's performance against other models: (for more details read here)
| Category | Claude 3.5 Sonnet | Claude 3 Opus | GPT-4o | Gemini 1.5 Pro | Llama-400b (early snapshot) |
|---|---|---|---|---|---|
| Graduate level reasoning | 59.4%* (0-shot CoT) | 50.4% (0-shot CoT) | 53.6% (0-shot CoT) | - | - |
| Undergraduate level knowledge | 88.7% (5-shot) | 86.8% (5-shot) | 88.7% (0-shot CoT) | 85.9% (5-shot) | 86.1% (5-shot) |
| Code (HumanEval) | 92.0% (0-shot) | 84.9% (0-shot) | 90.2% (0-shot) | 84.1% (0-shot) | 84.1% (0-shot) |
| Multilingual math (MGSM) | 91.6% (0-shot CoT) | 90.7% (0-shot CoT) | 90.5% (0-shot CoT) | 87.5% (8-shot) | - |
| Reasoning over text (DROP) | 87.1% (3-shot) | 83.1% (3-shot) | 83.4% (3-shot) | 74.9% (Variable) | 83.5% (3-shot Pre-trained) |
| Mixed evaluations (BIG-Bench) | 93.1% (3-shot CoT) | 86.8% (3-shot CoT) | - | 89.2% (3-shot CoT) | 85.3% (3-shot CoT Pre-trained) |
| Math problem-solving (MATH) | 71.1% (0-shot CoT) | 60.1% (0-shot CoT) | 76.6% (0-shot CoT) | 67.7% (4-shot) | 57.8% (4-shot CoT) |
| Grade school math (GSM8K) | 96.4% (0-shot CoT) | 95.0% (0-shot CoT) | - | 90.8% (11-shot) | 94.1% (8-shot CoT) |
| Category | Claude 3.5 Sonnet | Claude 3 Opus | GPT-4o | Gemini 1.5 Pro |
|---|---|---|---|---|
| Visual math reasoning (MathVista) | 67.7% (0-shot CoT) | 50.5% (0-shot CoT) | 63.8% (0-shot CoT) | 63.9% (0-shot CoT) |
| Science diagrams (AI2D) | 94.7% (0-shot) | 88.1% (0-shot) | 94.2% (0-shot) | 94.4% (0-shot) |
| Visual question answering (MMMU) | 68.3% (0-shot CoT) | 59.4% (0-shot CoT) | 69.1% (0-shot CoT) | 62.2% (0-shot CoT) |
| Chart Q&A (Relaxed accuracy) | 90.8% (0-shot CoT) | 80.8% (0-shot CoT) | 85.7% (0-shot CoT) | 87.2% (0-shot CoT) |
| Document visual Q&A (ANLS score) | 95.2% (0-shot) | 89.3% (0-shot) | 92.8% (0-shot) | 93.1% (0-shot) |
To help you choose the right model for your needs, here’s a compiled table comparing the key features and capabilities of each model in the Claude family:
| Claude Model | Claude 3.5 Sonnet | Claude 3 Opus | Claude 3 Sonnet | Claude 3 Haiku |
|---|---|---|---|---|
| Description | Most intelligent model | Powerful model for highly complex tasks | Balance of intelligence and speed | Fastest and most compact model for near-instant responsiveness |
| Strengths | Highest level of intelligence and capability | Top-level performance, intelligence, fluency, and understanding | Strong utility, balanced for scaled deployments | Quick and accurate targeted performance |
| Multilingual | Yes | Yes | Yes | Yes |
| Vision | Yes | Yes | Yes | Yes |
| API model name | claude-3-5-sonnet-20240620 | claude-3-opus-20240229 | claude-3-sonnet-20240229 | claude-3-haiku-20240307 |
| API format | Messages API | Messages API | Messages API | Messages API |
| Comparative latency | Fast | Moderately fast | Fast | Fastest |
| Context window | 200K | 200K | 200K | 200K |
| Max output | 8192 tokens | 4096 tokens | 4096 tokens | 4096 tokens |
| Cost (Input / Output per MTok) | $3.00 / $15.00 | $15.00 / $75.00 | $3.00 / $15.00 | $0.25 / $1.25 |
| Training data cut-off | Apr 2024 | Aug 2023 | Aug 2023 | Aug 2023 |
Benchmark Metric Glossary
- GPQA (graduate-level reasoning): Assesses the ability to handle advanced questions in biology, physics, and chemistry.
- MMLU (undergraduate-level knowledge): Measures knowledge and reasoning across multiple academic subjects at the undergraduate level.
- HumanEval (coding proficiency): Tests the ability to generate correct Python code from problem descriptions.
- MGSM (multilingual math): Evaluates performance in multilingual grade school math problems.
- MathVista (visual math reasoning): Assesses visual mathematical reasoning skills, such as interpreting graphs and charts.
- AI2D (science diagrams): Tests the ability to interpret and answer questions based on science diagrams.
- MMMU (visual question answering): Measures the ability to answer questions about visual content, like images and diagrams.
- Chart Q&A (relaxed accuracy): Evaluates the accuracy of answering questions based on chart data.
- ANLS Score (document visual Q&A): Assesses the accuracy of answering questions based on document visuals, using the Average Normalized Levenshtein Similarity (ANLS) score.
- MATH (math problem solving): Tests advanced problem-solving skills in various areas of mathematics.
- DROP (reasoning over text): Evaluates the ability to perform discrete reasoning tasks over paragraphs of text.
- BIG-Bench-Hard (mixed evaluations): A challenging benchmark that tests reasoning, common sense, and problem-solving across various tasks.
- GSM8K (grade school math): Measures the ability to solve grade school-level math problems, often using chain-of-thought reasoning.