OpenAI o1 reasoning agent that thinks and reasons over writing React code exactly as asked.
o1
reasoning
Meta
ON
No variables defined in the prompt.
Tools
Add
No tools added to the Pipe.
Readme
OpenAI o1 Reasoning Agent
o1 is in beta. Access limited to developers on tier 5.
OpenAI o1 series models are new large language models trained with reinforcement learning to perform complex reasoning. o1 models think before they answer, and can produce a long internal chain of thought before responding to the user. o1 models excel in scientific reasoning, ranking in the 89th percentile on competitive programming questions (Codeforces), placing among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeding human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).
o1-preview an early preview of our o1 model, designed to reason about hard problems using broad general knowledge about the world. For complex reasoning tasks, this is a significant advancement and represents a new level of AI capability. Given this, OpenAI is resetting the counter back to 1 and naming this series OpenAI o1.
As an early model, it doesn't yet have many of the features that make ChatGPT useful, like browsing the web for information and uploading files and images. For many common cases, GPT-4o will be more capable in the near term.
Key Features
Enhanced Reasoning: Trained to spend more time thinking before responding, solving harder problems in science, coding, and math.
Advanced Problem-Solving: Excels in challenging tasks like physics, chemistry, biology, and complex mathematics.
Performance in Math and Coding: Scored 83% on International Mathematics Olympiad (IMO) qualifiers and reached the 89th percentile in Codeforces coding competitions.
Improved Safety Features: Utilizes reasoning to follow safety and alignment rules, scoring 84 on difficult jailbreaking tests (compared to 22 by GPT-4o).
AI Safety Partnerships: Collaborating with U.S. and U.K. AI Safety Institutes, providing early access for research, evaluation, and safety testing.
Target Users: Designed for researchers and developers tackling complex problems in fields like healthcare, quantum physics, and multi-step workflows.
Langbase Recommendations
STEM Developers: Ideal for building applications that require mathematical reasoning or multi-step workflows.
Researchers in Science and Math: Great for generating complex formulas and analyzing data in fields like quantum physics or biology.
Data Analysts: Suitable for data-heavy tasks requiring advanced reasoning and problem-solving capabilities.
AI Developers: Cost-efficient option for creating AI applications focused on coding, math, and other STEM areas.
Workflow Automation: Beneficial for those who need to build and execute multi-step workflows, particularly in data analysis, machine learning, and software development.
Transparency
Chain-of-thought reasoning provides new opportunities for alignment and safety. OpenAI found that integrating their policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles.
By teaching the model safety rules and how to reason about them in context, OpenAI found evidence of reasoning capability directly benefiting model robustness: o1-preview achieved substantially improved performance on key jailbreak evaluations and their hardest internal benchmarks for evaluating model safety refusal boundaries.
OpenAI believes that using a chain of thought offers significant advances for safety and alignment because:
It enables them to observe the model thinking in a legible way.
The model reasoning about safety rules is more robust to out-of-distribution scenarios.
To stress-test these improvements, OpenAI conducted a suite of safety tests and red-teaming before deployment, in accordance with their Preparedness Framework. They found that chain-of-thought reasoning contributed to capability improvements across evaluations. Notably, they observed instances of reward hacking. Detailed results from these evaluations can be found in the accompanying System Card.
Benchmarks
OpenAI-o1-preview has been tested across a wide range of benchmarks, demonstrating state-of-the-art performance in multiple domains. Here are the results: (more details here)
Benchmark
GPT-4o
o1-preview
o1
Expert Human
Competition Math (AIME 2024)
13.40%
56.70%
83.30%
-
Competition Code (Codeforces)
11.00%
62.00%
89.00%
-
PhD-Level Science Questions (GPQA Diamond)
56.10%
78.30%
78.00%
69.70%
ML Benchmarks
Benchmark
GPT-4o
o1 improvement
MATH
60.3
94.8
MathVista (testmini)
63.8
73.2
MMMU (val)
69.1
78.1
MMLU
88
92.3
PhD-Level Science Questions (GPQA Diamond)
Benchmark
GPT-4o
o1 improvement
Chemistry
40.2
64.7
Physics
59.5
92.8
Biology
61.6
69.2
Benchmarks Glossary
Competition Math (AIME 2024): Measures accuracy in advanced math problems.
Competition Code (Codeforces): Evaluates programming skills using Elo ratings.