Here’s a table summarizing the comparison between different Llama models against various benchmarks: (for more details read here)
Base pretrained models
| Category | Benchmark | Llama 3 8B | Llama 2 7B | Llama 2 13B | Llama 3 70B | Llama 2 70B |
|---|---|---|---|---|---|---|
| General | MMLU (5-shot) | 66.6 | 45.7 | 53.8 | 79.5 | 69.7 |
| AGIEval English (3-5 shot) | 45.9 | 28.8 | 38.7 | 63 | 54.8 | |
| CommonSenseQA (7-shot) | 72.6 | 57.6 | 67.6 | 83.8 | 78.7 | |
| Winogrande (5-shot) | 76.1 | 73.3 | 75.4 | 83.1 | 81.8 | |
| BIG-Bench Hard (3-shot, CoT) | 61.1 | 38.1 | 47 | 81.3 | 65.7 | |
| ARC-Challenge (25-shot) | 78.6 | 53.7 | 67.6 | 93 | 85.3 | |
| Knowledge reasoning | TriviaQA-Wiki (5-shot) | 78.5 | 72.1 | 79.6 | 89.7 | 87.5 |
| Reading comprehension | SQuAD (1-shot) | 76.4 | 72.2 | 72.1 | 85.6 | 82.6 |
| QuAC (1-shot, F1) | 44.4 | 39.6 | 44.9 | 51.1 | 49.4 | |
| BoolQ (0-shot) | 75.7 | 65.5 | 66.9 | 79 | 73.1 | |
| DROP (3-shot, F1) | 58.4 | 37.9 | 49.8 | 79.7 | 70.2 |
Instruction tuned models
| Benchmark | Llama 3 8B | Llama 2 7B | Llama 2 13B | Llama 3 70B | Llama 2 70B |
|---|---|---|---|---|---|
| MMLU (5-shot) | 68.4 | 34.1 | 47.8 | 82 | 52.9 |
| GPQA (0-shot) | 34.2 | 21.7 | 22.3 | 39.5 | 21 |
| HumanEval (0-shot) | 62.2 | 7.9 | 14 | 81.7 | 25.6 |
| GSM-8K (8-shot, CoT) | 79.6 | 25.7 | 77.4 | 93 | 57.5 |
| MATH (4-shot, CoT) | 30 | 3.8 | 6.7 | 50.4 | 11.6 |
Benchmarks and Metrics Glossary
- MMLU (Massive Multitask Language Understanding): A benchmark designed to measure a model's performance across a wide range of tasks, focusing on its ability to handle diverse and complex language tasks.
- MMLU-Pro (CoT) (Massive Multitask Language Understanding - Chain of Thought): A variant of MMLU where the model is expected to generate reasoning chains (step-by-step explanations) for solving tasks.
- AGIEval: A benchmark designed to measure a model's performance on English language tasks.
- CommonSenseQA: A benchmark that tests a model's ability to answer questions requiring common sense reasoning.
- Winogrande: A commonsense reasoning benchmark focusing on resolving ambiguity in sentences.
- BIG-Bench (Beyond the Imitation Game Benchmark): A comprehensive benchmark covering a wide array of language tasks. The "Hard" subset focuses on more challenging tasks.
- ARC (AI2 Reasoning Challenge) - Challenge: A benchmark designed to measure a model's ability to answer challenging science questions.
- TriviaQA-Wiki: A benchmark for evaluating the model's ability to answer trivia questions, using Wikipedia as the source of truth.
- SQuAD (Stanford Question Answering Dataset): A reading comprehension benchmark where models must answer questions based on a given text passage.
- QuAC (Question Answering in Context): A benchmark for evaluating a model's ability to answer questions based on a dialogue context.
- BoolQ (Boolean Questions): A yes/no question-answering benchmark, testing the model’s ability to provide accurate binary answers.
- DROP (Discrete Reasoning Over Paragraphs): A reading comprehension benchmark focusing on discrete reasoning over paragraphs.
- IFEval: A benchmark designed to test a model’s understanding and performance across various English tasks.
- GPQA (Generalized Professional QA): A benchmark that tests a model's ability to answer professional-level questions across various domains.
- HumanEval: A benchmark designed to evaluate the model’s ability to write correct code based on problem descriptions.
- MBPP (MultiPL-E Benchmark for Programming Proficiency): A benchmark that tests a model's proficiency in writing code across multiple languages.
- GSM-8K (Grade School Math 8K): A benchmark focusing on the model’s ability to solve grade-school-level math problems.
- MATH: A benchmark evaluating a model’s ability to solve complex mathematical problems.
- API-Bank: A benchmark testing the model's accuracy in using APIs to solve tasks.
- BFCL (Benchmark for Commonsense Language): A benchmark designed to measure a model’s ability to understand and use commonsense knowledge.
- Gorilla Benchmark API Bench: A benchmark testing the model’s ability to work with APIs, focusing on more complex or rare tasks.
- Nexus: A benchmark testing the model's macro-average accuracy in performing zero-shot tasks.
- Multilingual MGSM (CoT): A multilingual version of the GSM-8K benchmark, testing the model’s ability to solve math problems in multiple languages.
Metrics:
- macro_avg/acc (Macro Average Accuracy): A metric that averages the accuracy of a model across all classes or tasks.
- acc_char (Accuracy Character): A measure of how accurately a model performs at the character level, typically used in text-based benchmarks.
- em (Exact Match): A metric that measures how often the model's output exactly matches the correct answer.
- f1 (F1 Score): A metric that considers both precision and recall to measure a model's accuracy, particularly useful for imbalanced classes.
- pass@1: A coding metric that measures the model's ability to generate correct code on the first attempt.
- em_maj1@1: A measure of exact match accuracy, focusing on the first major attempt at a task, particularly in complex reasoning or math problems.
- final_em: Final exact match score, often used in more challenging benchmarks like MATH.