Meta / Benchmarks - Langbase · Serverless AI Developer Platform

Here’s a table summarizing the comparison between different Llama models against various benchmarks: (for more details read here)

Base pretrained models

Category	Benchmark	Llama 3 8B	Llama 2 7B	Llama 2 13B	Llama 3 70B	Llama 2 70B
General	MMLU (5-shot)	66.6	45.7	53.8	79.5	69.7
	AGIEval English (3-5 shot)	45.9	28.8	38.7	63	54.8
	CommonSenseQA (7-shot)	72.6	57.6	67.6	83.8	78.7
	Winogrande (5-shot)	76.1	73.3	75.4	83.1	81.8
	BIG-Bench Hard (3-shot, CoT)	61.1	38.1	47	81.3	65.7
	ARC-Challenge (25-shot)	78.6	53.7	67.6	93	85.3
Knowledge reasoning	TriviaQA-Wiki (5-shot)	78.5	72.1	79.6	89.7	87.5
Reading comprehension	SQuAD (1-shot)	76.4	72.2	72.1	85.6	82.6
	QuAC (1-shot, F1)	44.4	39.6	44.9	51.1	49.4
	BoolQ (0-shot)	75.7	65.5	66.9	79	73.1
	DROP (3-shot, F1)	58.4	37.9	49.8	79.7	70.2

Instruction tuned models

Benchmark	Llama 3 8B	Llama 2 7B	Llama 2 13B	Llama 3 70B	Llama 2 70B
MMLU (5-shot)	68.4	34.1	47.8	82	52.9
GPQA (0-shot)	34.2	21.7	22.3	39.5	21
HumanEval (0-shot)	62.2	7.9	14	81.7	25.6
GSM-8K (8-shot, CoT)	79.6	25.7	77.4	93	57.5
MATH (4-shot, CoT)	30	3.8	6.7	50.4	11.6

Benchmarks and Metrics Glossary

MMLU (Massive Multitask Language Understanding): A benchmark designed to measure a model's performance across a wide range of tasks, focusing on its ability to handle diverse and complex language tasks.
MMLU-Pro (CoT) (Massive Multitask Language Understanding - Chain of Thought): A variant of MMLU where the model is expected to generate reasoning chains (step-by-step explanations) for solving tasks.
AGIEval: A benchmark designed to measure a model's performance on English language tasks.
CommonSenseQA: A benchmark that tests a model's ability to answer questions requiring common sense reasoning.
Winogrande: A commonsense reasoning benchmark focusing on resolving ambiguity in sentences.
BIG-Bench (Beyond the Imitation Game Benchmark): A comprehensive benchmark covering a wide array of language tasks. The "Hard" subset focuses on more challenging tasks.
ARC (AI2 Reasoning Challenge) - Challenge: A benchmark designed to measure a model's ability to answer challenging science questions.
TriviaQA-Wiki: A benchmark for evaluating the model's ability to answer trivia questions, using Wikipedia as the source of truth.
SQuAD (Stanford Question Answering Dataset): A reading comprehension benchmark where models must answer questions based on a given text passage.
QuAC (Question Answering in Context): A benchmark for evaluating a model's ability to answer questions based on a dialogue context.
BoolQ (Boolean Questions): A yes/no question-answering benchmark, testing the model’s ability to provide accurate binary answers.
DROP (Discrete Reasoning Over Paragraphs): A reading comprehension benchmark focusing on discrete reasoning over paragraphs.
IFEval: A benchmark designed to test a model’s understanding and performance across various English tasks.
GPQA (Generalized Professional QA): A benchmark that tests a model's ability to answer professional-level questions across various domains.
HumanEval: A benchmark designed to evaluate the model’s ability to write correct code based on problem descriptions.
MBPP (MultiPL-E Benchmark for Programming Proficiency): A benchmark that tests a model's proficiency in writing code across multiple languages.
GSM-8K (Grade School Math 8K): A benchmark focusing on the model’s ability to solve grade-school-level math problems.
MATH: A benchmark evaluating a model’s ability to solve complex mathematical problems.
API-Bank: A benchmark testing the model's accuracy in using APIs to solve tasks.
BFCL (Benchmark for Commonsense Language): A benchmark designed to measure a model’s ability to understand and use commonsense knowledge.
Gorilla Benchmark API Bench: A benchmark testing the model’s ability to work with APIs, focusing on more complex or rare tasks.
Nexus: A benchmark testing the model's macro-average accuracy in performing zero-shot tasks.
Multilingual MGSM (CoT): A multilingual version of the GSM-8K benchmark, testing the model’s ability to solve math problems in multiple languages.

Metrics:

macro_avg/acc (Macro Average Accuracy): A metric that averages the accuracy of a model across all classes or tasks.
acc_char (Accuracy Character): A measure of how accurately a model performs at the character level, typically used in text-based benchmarks.
em (Exact Match): A metric that measures how often the model's output exactly matches the correct answer.
f1 (F1 Score): A metric that considers both precision and recall to measure a model's accuracy, particularly useful for imbalanced classes.
pass@1: A coding metric that measures the model's ability to generate correct code on the first attempt.
em_maj1@1: A measure of exact match accuracy, focusing on the first major attempt at a task, particularly in complex reasoning or math problems.
final_em: Final exact match score, often used in more challenging benchmarks like MATH.

Langbase

Model Card

Base pretrained models

Instruction tuned models

Benchmarks and Metrics Glossary

Metrics:

Meta data

Context

Prompt Cost

Completion Cost

Trained with data up to

⌘Langbase

Model Card

Base pretrained models

Instruction tuned models

Benchmarks and Metrics Glossary

Metrics:

Meta data

Context

Prompt Cost

Completion Cost

Trained with data up to

Langbase