ahmadawais
    ahmadawais/o1-reasoning-agent
    Public

    Fork

    About

    OpenAI o1 reasoning agent that thinks and reasons over writing React code exactly as asked.

    o1
    reasoning

    Meta

    No variables defined in the prompt.

    Tools

    No tools added to the Pipe.

    Readme

    OpenAI o1 Reasoning Agent

    OpenAI o1 model

    o1 is in beta. Access limited to developers on tier 5.

    OpenAI o1 series models are new large language models trained with reinforcement learning to perform complex reasoning. o1 models think before they answer, and can produce a long internal chain of thought before responding to the user. o1 models excel in scientific reasoning, ranking in the 89th percentile on competitive programming questions (Codeforces), placing among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeding human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).

    o1-preview an early preview of our o1 model, designed to reason about hard problems using broad general knowledge about the world. For complex reasoning tasks, this is a significant advancement and represents a new level of AI capability. Given this, OpenAI is resetting the counter back to 1 and naming this series OpenAI o1.

    As an early model, it doesn't yet have many of the features that make ChatGPT useful, like browsing the web for information and uploading files and images. For many common cases, GPT-4o will be more capable in the near term.

    OpenAI scorecard

    OpenAI scorecard rating

    Key Features

    • Enhanced Reasoning: Trained to spend more time thinking before responding, solving harder problems in science, coding, and math.
    • Advanced Problem-Solving: Excels in challenging tasks like physics, chemistry, biology, and complex mathematics.
    • Performance in Math and Coding: Scored 83% on International Mathematics Olympiad (IMO) qualifiers and reached the 89th percentile in Codeforces coding competitions.
    • Improved Safety Features: Utilizes reasoning to follow safety and alignment rules, scoring 84 on difficult jailbreaking tests (compared to 22 by GPT-4o).
    • AI Safety Partnerships: Collaborating with U.S. and U.K. AI Safety Institutes, providing early access for research, evaluation, and safety testing.
    • Target Users: Designed for researchers and developers tackling complex problems in fields like healthcare, quantum physics, and multi-step workflows.

    Langbase Recommendations

    1. STEM Developers: Ideal for building applications that require mathematical reasoning or multi-step workflows.
    2. Researchers in Science and Math: Great for generating complex formulas and analyzing data in fields like quantum physics or biology.
    3. Data Analysts: Suitable for data-heavy tasks requiring advanced reasoning and problem-solving capabilities.
    4. AI Developers: Cost-efficient option for creating AI applications focused on coding, math, and other STEM areas.
    5. Workflow Automation: Beneficial for those who need to build and execute multi-step workflows, particularly in data analysis, machine learning, and software development.

    Transparency

    Chain-of-thought reasoning provides new opportunities for alignment and safety. OpenAI found that integrating their policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles.

    By teaching the model safety rules and how to reason about them in context, OpenAI found evidence of reasoning capability directly benefiting model robustness: o1-preview achieved substantially improved performance on key jailbreak evaluations and their hardest internal benchmarks for evaluating model safety refusal boundaries.

    OpenAI believes that using a chain of thought offers significant advances for safety and alignment because:

    1. It enables them to observe the model thinking in a legible way.
    2. The model reasoning about safety rules is more robust to out-of-distribution scenarios.

    To stress-test these improvements, OpenAI conducted a suite of safety tests and red-teaming before deployment, in accordance with their Preparedness Framework. They found that chain-of-thought reasoning contributed to capability improvements across evaluations. Notably, they observed instances of reward hacking. Detailed results from these evaluations can be found in the accompanying System Card.

    Benchmarks

    OpenAI-o1-preview has been tested across a wide range of benchmarks, demonstrating state-of-the-art performance in multiple domains. Here are the results: (more details here)

    BenchmarkGPT-4oo1-previewo1Expert Human
    Competition Math (AIME 2024)13.40%56.70%83.30%-
    Competition Code (Codeforces)11.00%62.00%89.00%-
    PhD-Level Science Questions (GPQA Diamond)56.10%78.30%78.00%69.70%

    ML Benchmarks

    BenchmarkGPT-4oo1 improvement
    MATH60.394.8
    MathVista (testmini)63.873.2
    MMMU (val)69.178.1
    MMLU8892.3

    PhD-Level Science Questions (GPQA Diamond)

    BenchmarkGPT-4oo1 improvement
    Chemistry40.264.7
    Physics59.592.8
    Biology61.669.2

    Benchmarks Glossary

    • Competition Math (AIME 2024): Measures accuracy in advanced math problems.
    • Competition Code (Codeforces): Evaluates programming skills using Elo ratings.
    • PhD-Level Science Questions (GPQA Diamond): Assesses performance on complex science questions.
    • MATH: Benchmark for solving mathematical problems.
    • MathVista (testmini): Tests performance on mathematical reasoning.
    • MMMU (val): Evaluates understanding across various multi-modal tasks.
    • MMLU: Measures general language understanding across multiple languages.
    • Chemistry/Physics/Biology: PhD-level science problem-solving ability.