Contact Support
    Mistral/Mixtral-8x7B
    License

    Model Card

    Mixtral 8x7B is a high-quality sparse mixture of experts model (SMoE) with open weights. It outperforms Llama-2-70B on most benchmarks with 6x faster inference. It is the strongest open-weight model with a permissive license and the best model overall regarding cost/performance trade-offs. In particular, it matches or outperforms GPT-3.5 on most standard benchmarks.

    Capabilities

    • Mixtral gracefully handles a context of 32k tokens.
    • It handles English, French, Italian, German, and Spanish.
    • Shows strong performance in code generation.
    • It can be fine-tuned into an instruction-following model that achieves a score of 8.3 on MT-Bench.

    Pushing the Frontier of Open Models with Sparse Architectures

    Mixtral is a sparse mixture-of-experts network. It is a decoder-only model where the feedforward block picks from a set of 8 distinct groups of parameters. At every layer, for every token, a router network chooses two of these groups (the “experts”) to process the token and combine their output additively.

    This technique increases the number of parameters of a model while controlling cost and latency, as the model only uses a fraction of the total set of parameters per token. Concretely, Mixtral has 46.7B total parameters but only uses 12.9B parameters per token. It, therefore, processes input and generates output at the same speed and for the same cost as a 12.9B model.

    Mixtral is pre-trained on data extracted from the open Web – experts and routers are trained simultaneously.

    Key Features

    • Sparse Mixture of Experts (SMoE) Architecture: Mixtral uses a sparse mixture-of-experts network, where the model's feedforward block selects from 8 distinct groups of parameters. This allows for increased parameter count while controlling costs and latency.
    • High Efficiency: With a total of 46.7B parameters, Mixtral only uses 12.9B parameters per token, ensuring fast processing and generation akin to a 12.9B parameter model.
    • Context Handling: Capable of managing a context window of up to 32k tokens.
    • Multilingual Support: Proficient in English, French, German, Spanish, and Italian.
    • Code Generation: Demonstrates strong performance in generating code.
    • Instruction Following: The Mixtral 8x7B Instruct variant has been fine-tuned for instruction-following, scoring 8.3 on MT-Bench.

    Meta data

    32,768 tokens
    $0.6 per million
    $0.6 per million
    Create an agent Pipe