Contact Support
    Meta/Llama-3-70B
    License

    Model Card

    Here’s a table summarizing the comparison between different Llama models against various benchmarks: (for more details read here)

    Base pretrained models

    CategoryBenchmarkLlama 3 8BLlama 2 7BLlama 2 13BLlama 3 70BLlama 2 70B
    GeneralMMLU (5-shot)66.645.753.879.569.7
    AGIEval English (3-5 shot)45.928.838.76354.8
    CommonSenseQA (7-shot)72.657.667.683.878.7
    Winogrande (5-shot)76.173.375.483.181.8
    BIG-Bench Hard (3-shot, CoT)61.138.14781.365.7
    ARC-Challenge (25-shot)78.653.767.69385.3
    Knowledge reasoningTriviaQA-Wiki (5-shot)78.572.179.689.787.5
    Reading comprehensionSQuAD (1-shot)76.472.272.185.682.6
    QuAC (1-shot, F1)44.439.644.951.149.4
    BoolQ (0-shot)75.765.566.97973.1
    DROP (3-shot, F1)58.437.949.879.770.2

    Instruction tuned models

    BenchmarkLlama 3 8BLlama 2 7BLlama 2 13BLlama 3 70BLlama 2 70B
    MMLU (5-shot)68.434.147.88252.9
    GPQA (0-shot)34.221.722.339.521
    HumanEval (0-shot)62.27.91481.725.6
    GSM-8K (8-shot, CoT)79.625.777.49357.5
    MATH (4-shot, CoT)303.86.750.411.6

    Benchmarks and Metrics Glossary

    • MMLU (Massive Multitask Language Understanding): A benchmark designed to measure a model's performance across a wide range of tasks, focusing on its ability to handle diverse and complex language tasks.
    • MMLU-Pro (CoT) (Massive Multitask Language Understanding - Chain of Thought): A variant of MMLU where the model is expected to generate reasoning chains (step-by-step explanations) for solving tasks.
    • AGIEval: A benchmark designed to measure a model's performance on English language tasks.
    • CommonSenseQA: A benchmark that tests a model's ability to answer questions requiring common sense reasoning.
    • Winogrande: A commonsense reasoning benchmark focusing on resolving ambiguity in sentences.
    • BIG-Bench (Beyond the Imitation Game Benchmark): A comprehensive benchmark covering a wide array of language tasks. The "Hard" subset focuses on more challenging tasks.
    • ARC (AI2 Reasoning Challenge) - Challenge: A benchmark designed to measure a model's ability to answer challenging science questions.
    • TriviaQA-Wiki: A benchmark for evaluating the model's ability to answer trivia questions, using Wikipedia as the source of truth.
    • SQuAD (Stanford Question Answering Dataset): A reading comprehension benchmark where models must answer questions based on a given text passage.
    • QuAC (Question Answering in Context): A benchmark for evaluating a model's ability to answer questions based on a dialogue context.
    • BoolQ (Boolean Questions): A yes/no question-answering benchmark, testing the model’s ability to provide accurate binary answers.
    • DROP (Discrete Reasoning Over Paragraphs): A reading comprehension benchmark focusing on discrete reasoning over paragraphs.
    • IFEval: A benchmark designed to test a model’s understanding and performance across various English tasks.
    • GPQA (Generalized Professional QA): A benchmark that tests a model's ability to answer professional-level questions across various domains.
    • HumanEval: A benchmark designed to evaluate the model’s ability to write correct code based on problem descriptions.
    • MBPP (MultiPL-E Benchmark for Programming Proficiency): A benchmark that tests a model's proficiency in writing code across multiple languages.
    • GSM-8K (Grade School Math 8K): A benchmark focusing on the model’s ability to solve grade-school-level math problems.
    • MATH: A benchmark evaluating a model’s ability to solve complex mathematical problems.
    • API-Bank: A benchmark testing the model's accuracy in using APIs to solve tasks.
    • BFCL (Benchmark for Commonsense Language): A benchmark designed to measure a model’s ability to understand and use commonsense knowledge.
    • Gorilla Benchmark API Bench: A benchmark testing the model’s ability to work with APIs, focusing on more complex or rare tasks.
    • Nexus: A benchmark testing the model's macro-average accuracy in performing zero-shot tasks.
    • Multilingual MGSM (CoT): A multilingual version of the GSM-8K benchmark, testing the model’s ability to solve math problems in multiple languages.

    Metrics:

    • macro_avg/acc (Macro Average Accuracy): A metric that averages the accuracy of a model across all classes or tasks.
    • acc_char (Accuracy Character): A measure of how accurately a model performs at the character level, typically used in text-based benchmarks.
    • em (Exact Match): A metric that measures how often the model's output exactly matches the correct answer.
    • f1 (F1 Score): A metric that considers both precision and recall to measure a model's accuracy, particularly useful for imbalanced classes.
    • pass@1: A coding metric that measures the model's ability to generate correct code on the first attempt.
    • em_maj1@1: A measure of exact match accuracy, focusing on the first major attempt at a task, particularly in complex reasoning or math problems.
    • final_em: Final exact match score, often used in more challenging benchmarks like MATH.

    Meta data

    8192 tokens
    $0.9 per million
    $0.9 per million
    Dec 2023
    Create an agent Pipe