The model incorporates safety and alignment through chain-of-thought reasoning, which provides:
- Observable and legible thinking processes
- Improved robustness in handling out-of-distribution scenarios
- Enhanced performance on jailbreak evaluations
- Better adherence to safety refusal boundaries
Safety Testing
- Comprehensive safety testing conducted pre-deployment
- Red-team evaluations performed following Preparedness Framework
- Identified and documented instances of reward hacking
- Detailed evaluation results available in the System Card