Contact Support
    Mistral/Mixtral-8x7B
    License

    Model Card

    Mixtral 8x7B has been tested and outperforms many of its peers in various benchmarks, particularly when compared to Llama 2 70B and GPT-3.5. Here are the results:

    General Performance Benchmarks

    BenchmarkLLaMA 2 70BGPT-3.5Mixtral 8x7B
    MMLU (MCQ in 57 subjects)69.90%70.00%70.60%
    HellaSwag (10-shot)87.10%85.50%86.70%
    ARC Challenge (25-shot)85.10%85.20%85.80%
    WinoGrande (5-shot)83.20%81.60%81.20%
    MBPP (pass@1)49.80%52.20%60.70%
    GSM-8K (5-shot)53.60%57.10%58.40%
    MT Bench (for Instruct Models)6.868.328.3

    Active Parameters and Detailed Performance

    ModelActive ParamsMMLUHellaSWinoGPIQAArc-eArc-cNQTriQAHumanEMBPPMathGSM8K
    LLaMA 2 7B7B44.40%77.10%69.50%77.90%68.70%43.20%17.50%56.60%11.60%26.10%3.90%16.00%
    LLaMA 2 13B13B55.60%80.70%72.90%80.80%75.20%48.80%16.70%64.00%18.90%35.40%6.00%34.30%
    LLaMA 1 33B33B56.80%83.70%76.20%82.20%79.60%54.40%24.10%68.50%25.00%40.90%8.40%44.10%
    LLaMA 2 70B70B69.90%85.40%80.40%82.60%79.90%56.50%25.40%73.00%29.30%49.80%13.80%69.60%
    Mistral 7B7B62.50%81.00%74.20%82.20%80.50%54.90%23.20%62.50%26.20%50.20%12.70%50.00%
    Mixtral 8x7B12B70.60%84.40%77.20%83.60%83.10%59.70%30.60%71.50%40.20%60.70%28.40%74.40%

    Bias and Hallucination Benchmarks

    MetricLlama 2 70BMixtral 8x7B
    BBQ (higher is better)51.50%55.98%
    BOLD (std) (lower is better)0.0940.084
    - Gender0.0730.045
    - Profession0.0730.087
    - Religious Ideology0.1330.089
    - Political Ideology0.140.146
    - Race0.0490.052

    Multilingual Performance

    ModelActive ParamsFrench Arc-cFrench HellaSFrench MMLUGerman Arc-cGerman HellaSGerman MMLUSpanish Arc-cSpanish HellaSSpanish MMLUItalian Arc-cItalian HellaSItalian MMLU
    LLaMA 1 33B33B39.30%68.10%49.90%41.10%63.30%48.70%45.70%69.80%52.30%42.90%65.40%49.00%
    LLaMA 2 70B70B49.90%72.50%64.30%47.30%68.70%64.20%50.50%74.50%66.00%49.40%70.90%65.10%
    Mixtral 8x7B12B58.20%77.40%70.90%54.30%73.00%71.50%55.40%77.60%72.50%52.80%75.10%70.90%

    Benchmarks Glossary

    • MMLU: Multiple-choice questions across 57 topics to evaluate knowledge and reasoning.
    • HellaSwag (HellaS): Benchmark for commonsense reasoning and text completion using a few-shot setup.
    • ARC Challenge (Arc-c): Advanced Reasoning Challenge, assessing complex problem-solving skills.
    • Arc-e: Elementary-level portion of the ARC, testing straightforward problem-solving.
    • WinoGrande (WinoG): Test for commonsense reasoning with ambiguous pronoun resolution.
    • MBPP: Measuring Python code generation accuracy in a single attempt.
    • GSM-8K: Grade school math problems requiring logical reasoning and calculations.
    • MT Bench (for Instruct Models): Benchmark for evaluating instruction-following models.
    • PIQA: Physical interaction question-answering benchmark for commonsense reasoning.
    • NQ: Natural Questions, evaluating open-domain question-answering capabilities.
    • TriQA: Multi-choice QA benchmark evaluating question-answering skills.
    • HumanE: HumanEval benchmark for assessing code generation and reasoning.
    • Math: Mathematical reasoning benchmark assessing problem-solving accuracy.

    Bias and Hallucination Benchmarks

    • BBQ: Benchmark for measuring bias in binary questions (e.g., gender, profession).
    • BOLD: Benchmark for evaluating biases across different dimensions like gender, race, and ideology.

    Meta data

    32,768 tokens
    $0.6 per million
    $0.6 per million
    Create an agent Pipe