Automating AI Control Evaluations

Executive Summary

Recently I explored the feasibility of automating the production of control evaluation datasets, addressing a critical bottleneck in AI safety research. The work demonstrates promising approaches that could scale control evaluation capabilities from 25 to 100+ evaluation environments, essential for training robust attack policies for advancing AI control research.

Background & Problem

During my work as an external contractor with Anthropic on SHADE Arena, I extended the task pair dataset initially developed by Jon Kusatov. This experience revealed a significant scalability challenge:

Current Bottleneck: Implementing each task pair requires 10-20 hours of manual work, creating two major problems:

Research velocity: The repetitive, time-consuming nature severely limits dataset expansion
Human builder limitations: The soul-destroying repetitiveness makes it difficult to sustain motivation and retention
Scale limitations: With only ~25 task pairs currently available, the dataset is insufficient for training attack policies via reinforcement learning needed for comprehensive control evaluations

AI Safety Impact: This bottleneck directly limits our ability to develop robust control mechanisms and evaluations. Training effective attack policies requires substantially larger datasets (target: 100+ task pairs). Without these attack policies it's difficult to evaluate how monitors will cope with agents with strong subversion capabilities.

Approach & Exploration

We explored the idea of automation as the key solution and explored two primary directions:

1. LM-as-Simulator Approach

Using language models as environment simulators, similar to having an AI act as a "D&D dungeon master."

Advantages:

Rapid implementation and iteration
High creativity in generating novel scenarios
Natural adaptability to unexpected agent behaviors
Lower upfront development overhead

Challenges:

Evaluation complexity (requires an LM to assess agent success and this capability might correlate with monitor capabilities)
Potential inconsistencies in environment physics/rules

2. Agent-Coded Environment Approach

Using AI agents to generate hardcoded, deterministic environments.

Results:

Small environments: Claude Sonnet 4 successfully creates functional environments in single attempts
Medium environments: Promising iterative improvement through agent-environment interaction loops
Scalability: Currently scaling this approach with strong preliminary results

Current Status & Next Steps

The hardcoded environment approach is showing excellent progress. The system can:

Generate initial environment code
Run test agents within the environment
Analyze interaction transcripts
Iteratively improve environment code based on observed issues

This creates a scalable pipeline for producing the larger datasets essential for control evaluation research.

Value for AI Safety

This work directly addresses a critical infrastructure need in AI safety:

Scalability: Transforms 15-20 hour manual processes into automated generation
Research acceleration: Enables rapid expansion of control evaluation datasets
Quality improvement: Allows iteration and refinement of evaluation scenarios
Unlocks RL training for attack policies: Enables training attack policies via reinforcement learning against diverse environments, creating sophisticated adversaries that can thoroughly stress-test monitoring systems and reveal blind spots in current control approaches

The automated approaches developed here are essential for creating the robust evaluation infrastructure needed as AI systems become more capable and the stakes of control failures increase.

Conclusion

This initial exploration demonstrated the feasibility of automating control evaluation dataset creation, with the hardcoded approach showing particular promise for near-term deployment. The work provides a clear path toward scaling control evaluation capabilities, directly supporting the AI safety community's need for comprehensive testing infrastructure.

Automating AI Control Evaluations - Exploration Summary