Automating AI Control Evaluations - Exploration Summary
Executive Summary
Recently I explored the feasibility of automating the production of control evaluation datasets, addressing a critical bottleneck in AI safety research. The work demonstrates promising approaches that could scale control evaluation capabilities from 25 to 100+ evaluation environments, essential for training robust attack policies for advancing AI control research.
Background & Problem
During my work as an external contractor with Anthropic on SHADE Arena, I extended the task pair dataset initially developed by Jon Kusatov. This experience revealed a significant scalability challenge:
Current Bottleneck: Implementing each task pair requires 10-20 hours of manual work, creating two major problems:
- Research velocity: The repetitive, time-consuming nature severely limits dataset expansion
- Human builder limitations: The soul-destroying repetitiveness makes it difficult to sustain motivation and retention
- Scale limitations: With only ~25 task pairs currently available, the dataset is insufficient for training attack policies via reinforcement learning needed for comprehensive control evaluations
AI Safety Impact: This bottleneck directly limits our ability to develop robust control mechanisms and evaluations. Training effective attack policies requires substantially larger datasets (target: 100+ task pairs). Without these attack policies it's difficult to evaluate how monitors will cope with agents with strong subversion capabilities.
Approach & Exploration
We explored the idea of automation as the key solution and explored two primary directions:
1. LM-as-Simulator Approach
Using language models as environment simulators, similar to having an AI act as a "D&D dungeon master."
Advantages:
- Rapid implementation and iteration
- High creativity in generating novel scenarios
- Natural adaptability to unexpected agent behaviors
- Lower upfront development overhead
Challenges:
- Evaluation complexity (requires an LM to assess agent success and this capability might correlate with monitor capabilities)
- Potential inconsistencies in environment physics/rules
2. Agent-Coded Environment Approach
Using AI agents to generate hardcoded, deterministic environments.
Results:
- Small environments: Claude Sonnet 4 successfully creates functional environments in single attempts
- Medium environments: Promising iterative improvement through agent-environment interaction loops
- Scalability: Currently scaling this approach with strong preliminary results
Current Status & Next Steps
The hardcoded environment approach is showing excellent progress. The system can:
- Generate initial environment code
- Run test agents within the environment
- Analyze interaction transcripts
- Iteratively improve environment code based on observed issues
This creates a scalable pipeline for producing the larger datasets essential for control evaluation research.
Value for AI Safety
This work directly addresses a critical infrastructure need in AI safety:
- Scalability: Transforms 15-20 hour manual processes into automated generation
- Research acceleration: Enables rapid expansion of control evaluation datasets
- Quality improvement: Allows iteration and refinement of evaluation scenarios
- Unlocks RL training for attack policies: Enables training attack policies via reinforcement learning against diverse environments, creating sophisticated adversaries that can thoroughly stress-test monitoring systems and reveal blind spots in current control approaches
The automated approaches developed here are essential for creating the robust evaluation infrastructure needed as AI systems become more capable and the stakes of control failures increase.
Conclusion
This initial exploration demonstrated the feasibility of automating control evaluation dataset creation, with the hardcoded approach showing particular promise for near-term deployment. The work provides a clear path toward scaling control evaluation capabilities, directly supporting the AI safety community's need for comprehensive testing infrastructure.