Mistral / Benchmarks - Langbase · Serverless AI Developer Platform

Mixtral 8x7B has been tested and outperforms many of its peers in various benchmarks, particularly when compared to Llama 2 70B and GPT-3.5. Here are the results:

General Performance Benchmarks

Benchmark	LLaMA 2 70B	GPT-3.5	Mixtral 8x7B
MMLU (MCQ in 57 subjects)	69.90%	70.00%	70.60%
HellaSwag (10-shot)	87.10%	85.50%	86.70%
ARC Challenge (25-shot)	85.10%	85.20%	85.80%
WinoGrande (5-shot)	83.20%	81.60%	81.20%
MBPP (pass@1)	49.80%	52.20%	60.70%
GSM-8K (5-shot)	53.60%	57.10%	58.40%
MT Bench (for Instruct Models)	6.86	8.32	8.3

Active Parameters and Detailed Performance

Model	Active Params	MMLU	HellaS	WinoG	PIQA	Arc-e	Arc-c	NQ	TriQA	HumanE	MBPP	Math	GSM8K
LLaMA 2 7B	7B	44.40%	77.10%	69.50%	77.90%	68.70%	43.20%	17.50%	56.60%	11.60%	26.10%	3.90%	16.00%
LLaMA 2 13B	13B	55.60%	80.70%	72.90%	80.80%	75.20%	48.80%	16.70%	64.00%	18.90%	35.40%	6.00%	34.30%
LLaMA 1 33B	33B	56.80%	83.70%	76.20%	82.20%	79.60%	54.40%	24.10%	68.50%	25.00%	40.90%	8.40%	44.10%
LLaMA 2 70B	70B	69.90%	85.40%	80.40%	82.60%	79.90%	56.50%	25.40%	73.00%	29.30%	49.80%	13.80%	69.60%
Mistral 7B	7B	62.50%	81.00%	74.20%	82.20%	80.50%	54.90%	23.20%	62.50%	26.20%	50.20%	12.70%	50.00%
Mixtral 8x7B	12B	70.60%	84.40%	77.20%	83.60%	83.10%	59.70%	30.60%	71.50%	40.20%	60.70%	28.40%	74.40%

Bias and Hallucination Benchmarks

Metric	Llama 2 70B	Mixtral 8x7B
BBQ (higher is better)	51.50%	55.98%
BOLD (std) (lower is better)	0.094	0.084
- Gender	0.073	0.045
- Profession	0.073	0.087
- Religious Ideology	0.133	0.089
- Political Ideology	0.14	0.146
- Race	0.049	0.052

Multilingual Performance

Model	Active Params	French Arc-c	French HellaS	French MMLU	German Arc-c	German HellaS	German MMLU	Spanish Arc-c	Spanish HellaS	Spanish MMLU	Italian Arc-c	Italian HellaS	Italian MMLU
LLaMA 1 33B	33B	39.30%	68.10%	49.90%	41.10%	63.30%	48.70%	45.70%	69.80%	52.30%	42.90%	65.40%	49.00%
LLaMA 2 70B	70B	49.90%	72.50%	64.30%	47.30%	68.70%	64.20%	50.50%	74.50%	66.00%	49.40%	70.90%	65.10%
Mixtral 8x7B	12B	58.20%	77.40%	70.90%	54.30%	73.00%	71.50%	55.40%	77.60%	72.50%	52.80%	75.10%	70.90%

Benchmarks Glossary

MMLU: Multiple-choice questions across 57 topics to evaluate knowledge and reasoning.
HellaSwag (HellaS): Benchmark for commonsense reasoning and text completion using a few-shot setup.
ARC Challenge (Arc-c): Advanced Reasoning Challenge, assessing complex problem-solving skills.
Arc-e: Elementary-level portion of the ARC, testing straightforward problem-solving.
WinoGrande (WinoG): Test for commonsense reasoning with ambiguous pronoun resolution.
MBPP: Measuring Python code generation accuracy in a single attempt.
GSM-8K: Grade school math problems requiring logical reasoning and calculations.
MT Bench (for Instruct Models): Benchmark for evaluating instruction-following models.
PIQA: Physical interaction question-answering benchmark for commonsense reasoning.
NQ: Natural Questions, evaluating open-domain question-answering capabilities.
TriQA: Multi-choice QA benchmark evaluating question-answering skills.
HumanE: HumanEval benchmark for assessing code generation and reasoning.
Math: Mathematical reasoning benchmark assessing problem-solving accuracy.

Bias and Hallucination Benchmarks

BBQ: Benchmark for measuring bias in binary questions (e.g., gender, profession).
BOLD: Benchmark for evaluating biases across different dimensions like gender, race, and ideology.

Langbase

Model Card

General Performance Benchmarks

Active Parameters and Detailed Performance

Bias and Hallucination Benchmarks

Multilingual Performance

Benchmarks Glossary

Meta data

Context

Prompt Cost

Completion Cost

⌘Langbase

Model Card

General Performance Benchmarks

Active Parameters and Detailed Performance

Bias and Hallucination Benchmarks

Multilingual Performance

Benchmarks Glossary

Meta data

Context

Prompt Cost

Completion Cost

Langbase