Mixtral 8x7B has been tested and outperforms many of its peers in various benchmarks, particularly when compared to Llama 2 70B and GPT-3.5. Here are the results:
General Performance Benchmarks
| Benchmark | LLaMA 2 70B | GPT-3.5 | Mixtral 8x7B |
|---|---|---|---|
| MMLU (MCQ in 57 subjects) | 69.90% | 70.00% | 70.60% |
| HellaSwag (10-shot) | 87.10% | 85.50% | 86.70% |
| ARC Challenge (25-shot) | 85.10% | 85.20% | 85.80% |
| WinoGrande (5-shot) | 83.20% | 81.60% | 81.20% |
| MBPP (pass@1) | 49.80% | 52.20% | 60.70% |
| GSM-8K (5-shot) | 53.60% | 57.10% | 58.40% |
| MT Bench (for Instruct Models) | 6.86 | 8.32 | 8.3 |
Active Parameters and Detailed Performance
| Model | Active Params | MMLU | HellaS | WinoG | PIQA | Arc-e | Arc-c | NQ | TriQA | HumanE | MBPP | Math | GSM8K |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LLaMA 2 7B | 7B | 44.40% | 77.10% | 69.50% | 77.90% | 68.70% | 43.20% | 17.50% | 56.60% | 11.60% | 26.10% | 3.90% | 16.00% |
| LLaMA 2 13B | 13B | 55.60% | 80.70% | 72.90% | 80.80% | 75.20% | 48.80% | 16.70% | 64.00% | 18.90% | 35.40% | 6.00% | 34.30% |
| LLaMA 1 33B | 33B | 56.80% | 83.70% | 76.20% | 82.20% | 79.60% | 54.40% | 24.10% | 68.50% | 25.00% | 40.90% | 8.40% | 44.10% |
| LLaMA 2 70B | 70B | 69.90% | 85.40% | 80.40% | 82.60% | 79.90% | 56.50% | 25.40% | 73.00% | 29.30% | 49.80% | 13.80% | 69.60% |
| Mistral 7B | 7B | 62.50% | 81.00% | 74.20% | 82.20% | 80.50% | 54.90% | 23.20% | 62.50% | 26.20% | 50.20% | 12.70% | 50.00% |
| Mixtral 8x7B | 12B | 70.60% | 84.40% | 77.20% | 83.60% | 83.10% | 59.70% | 30.60% | 71.50% | 40.20% | 60.70% | 28.40% | 74.40% |
Bias and Hallucination Benchmarks
| Metric | Llama 2 70B | Mixtral 8x7B |
|---|---|---|
| BBQ (higher is better) | 51.50% | 55.98% |
| BOLD (std) (lower is better) | 0.094 | 0.084 |
| - Gender | 0.073 | 0.045 |
| - Profession | 0.073 | 0.087 |
| - Religious Ideology | 0.133 | 0.089 |
| - Political Ideology | 0.14 | 0.146 |
| - Race | 0.049 | 0.052 |
Multilingual Performance
| Model | Active Params | French Arc-c | French HellaS | French MMLU | German Arc-c | German HellaS | German MMLU | Spanish Arc-c | Spanish HellaS | Spanish MMLU | Italian Arc-c | Italian HellaS | Italian MMLU |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LLaMA 1 33B | 33B | 39.30% | 68.10% | 49.90% | 41.10% | 63.30% | 48.70% | 45.70% | 69.80% | 52.30% | 42.90% | 65.40% | 49.00% |
| LLaMA 2 70B | 70B | 49.90% | 72.50% | 64.30% | 47.30% | 68.70% | 64.20% | 50.50% | 74.50% | 66.00% | 49.40% | 70.90% | 65.10% |
| Mixtral 8x7B | 12B | 58.20% | 77.40% | 70.90% | 54.30% | 73.00% | 71.50% | 55.40% | 77.60% | 72.50% | 52.80% | 75.10% | 70.90% |
Benchmarks Glossary
- MMLU: Multiple-choice questions across 57 topics to evaluate knowledge and reasoning.
- HellaSwag (HellaS): Benchmark for commonsense reasoning and text completion using a few-shot setup.
- ARC Challenge (Arc-c): Advanced Reasoning Challenge, assessing complex problem-solving skills.
- Arc-e: Elementary-level portion of the ARC, testing straightforward problem-solving.
- WinoGrande (WinoG): Test for commonsense reasoning with ambiguous pronoun resolution.
- MBPP: Measuring Python code generation accuracy in a single attempt.
- GSM-8K: Grade school math problems requiring logical reasoning and calculations.
- MT Bench (for Instruct Models): Benchmark for evaluating instruction-following models.
- PIQA: Physical interaction question-answering benchmark for commonsense reasoning.
- NQ: Natural Questions, evaluating open-domain question-answering capabilities.
- TriQA: Multi-choice QA benchmark evaluating question-answering skills.
- HumanE: HumanEval benchmark for assessing code generation and reasoning.
- Math: Mathematical reasoning benchmark assessing problem-solving accuracy.
Bias and Hallucination Benchmarks
- BBQ: Benchmark for measuring bias in binary questions (e.g., gender, profession).
- BOLD: Benchmark for evaluating biases across different dimensions like gender, race, and ideology.
