Contact Support
    Anthropic/Claude-3 Opus

    Model Card

    Claude 3 Opus has been evaluated across various benchmarks, demonstrating significant advancements in intelligence and performance compared to other models in its class, including Claude 3 Sonnet, Claude 3 Haiku and GPT-4o, etc.

    Here’s a comparison of Claude 3 Opus's performance against other models: (for more details read here)

    CategoryClaude 3.5 SonnetClaude 3 OpusGPT-4oGemini 1.5 ProLlama-400b (early snapshot)
    Graduate level reasoning59.4%* (0-shot CoT)50.4% (0-shot CoT)53.6% (0-shot CoT)--
    Undergraduate level knowledge88.7% (5-shot)86.8% (5-shot)88.7% (0-shot CoT)85.9% (5-shot)86.1% (5-shot)
    Code (HumanEval)92.0% (0-shot)84.9% (0-shot)90.2% (0-shot)84.1% (0-shot)84.1% (0-shot)
    Multilingual math (MGSM)91.6% (0-shot CoT)90.7% (0-shot CoT)90.5% (0-shot CoT)87.5% (8-shot)-
    Reasoning over text (DROP)87.1% (3-shot)83.1% (3-shot)83.4% (3-shot)74.9% (Variable)83.5% (3-shot Pre-trained)
    Mixed evaluations (BIG-Bench)93.1% (3-shot CoT)86.8% (3-shot CoT)-89.2% (3-shot CoT)85.3% (3-shot CoT Pre-trained)
    Math problem-solving (MATH)71.1% (0-shot CoT)60.1% (0-shot CoT)76.6% (0-shot CoT)67.7% (4-shot)57.8% (4-shot CoT)
    Grade school math (GSM8K)96.4% (0-shot CoT)95.0% (0-shot CoT)-90.8% (11-shot)94.1% (8-shot CoT)
    CategoryClaude 3.5 SonnetClaude 3 OpusGPT-4oGemini 1.5 Pro
    Visual math reasoning (MathVista)67.7% (0-shot CoT)50.5% (0-shot CoT)63.8% (0-shot CoT)63.9% (0-shot CoT)
    Science diagrams (AI2D)94.7% (0-shot)88.1% (0-shot)94.2% (0-shot)94.4% (0-shot)
    Visual question answering (MMMU)68.3% (0-shot CoT)59.4% (0-shot CoT)69.1% (0-shot CoT)62.2% (0-shot CoT)
    Chart Q&A (Relaxed accuracy)90.8% (0-shot CoT)80.8% (0-shot CoT)85.7% (0-shot CoT)87.2% (0-shot CoT)
    Document visual Q&A (ANLS score)95.2% (0-shot)89.3% (0-shot)92.8% (0-shot)93.1% (0-shot)

    To help you choose the right model for your needs, here’s a compiled table comparing the key features and capabilities of each model in the Claude family:

    Claude ModelClaude 3.5 SonnetClaude 3 OpusClaude 3 SonnetClaude 3 Haiku
    DescriptionMost intelligent modelPowerful model for highly complex tasksBalance of intelligence and speedFastest and most compact model for near-instant responsiveness
    StrengthsHighest level of intelligence and capabilityTop-level performance, intelligence, fluency, and understandingStrong utility, balanced for scaled deploymentsQuick and accurate targeted performance
    MultilingualYesYesYesYes
    VisionYesYesYesYes
    API model nameclaude-3-5-sonnet-20240620claude-3-opus-20240229claude-3-sonnet-20240229claude-3-haiku-20240307
    API formatMessages APIMessages APIMessages APIMessages API
    Comparative latencyFastModerately fastFastFastest
    Context window200K200K200K200K
    Max output8192 tokens4096 tokens4096 tokens4096 tokens
    Cost (Input / Output per MTok)$3.00 / $15.00$15.00 / $75.00$3.00 / $15.00$0.25 / $1.25
    Training data cut-offApr 2024Aug 2023Aug 2023Aug 2023

    Benchmark Metric Glossary

    • GPQA (graduate-level reasoning): Assesses the ability to handle advanced questions in biology, physics, and chemistry.
    • MMLU (undergraduate-level knowledge): Measures knowledge and reasoning across multiple academic subjects at the undergraduate level.
    • HumanEval (coding proficiency): Tests the ability to generate correct Python code from problem descriptions.
    • MGSM (multilingual math): Evaluates performance in multilingual grade school math problems.
    • MathVista (visual math reasoning): Assesses visual mathematical reasoning skills, such as interpreting graphs and charts.
    • AI2D (science diagrams): Tests the ability to interpret and answer questions based on science diagrams.
    • MMMU (visual question answering): Measures the ability to answer questions about visual content, like images and diagrams.
    • Chart Q&A (relaxed accuracy): Evaluates the accuracy of answering questions based on chart data.
    • ANLS Score (document visual Q&A): Assesses the accuracy of answering questions based on document visuals, using the Average Normalized Levenshtein Similarity (ANLS) score.
    • MATH (math problem solving): Tests advanced problem-solving skills in various areas of mathematics.
    • DROP (reasoning over text): Evaluates the ability to perform discrete reasoning tasks over paragraphs of text.
    • BIG-Bench-Hard (mixed evaluations): A challenging benchmark that tests reasoning, common sense, and problem-solving across various tasks.
    • GSM8K (grade school math): Measures the ability to solve grade school-level math problems, often using chain-of-thought reasoning.

    Meta data

    200,000 tokens
    $15 per million
    $75 per million
    Aug 2023
    Create an agent Pipe