Auto Evaluation Builder: Part 1
0%
โ† Home
(Click or press any key to continue)
Home ยท Blog

My Experience with Automating AI Evaluation Building

December 2025

Around this time last year, after struggling to make traction solving hard parts of the AI alignment problem, and having my funding dry up, I decided to learn a useful skill that I could make money with to keep me going, ideally while also making some positive impact.

Building AI evaluations seemed to be an in-demand service and a low-risk way to contribute to AI safety, so I pursued that path.

Very quickly, I was fortunate enough to get contracts with the UK AI Security Institute and Anthropic to build AI sabotage evaluations.

One particular threat model that AI safety researchers are concerned about is a scenario where a misaligned AI deployed internally at a frontier AI lab subtly sabotages safety efforts. For example, if deployed to build AI safety infrastructure, like AI monitoring processes, it could subtly sabotage the infrastructure so that the monitoring processes no longer catch other misaligned AIs taking dangerous actions.

As part of my work, I built simulated environments that allows researchers to observe whether AIs have the capability to commit sabotage. Example scenarios include investigating whether an AI deployed at a research lab could leak sensitive data without getting caught, or a weapons manufacturing AI inserting backdoors into autonomous weapons for future hijacking.

Having tens of environments simulating different scenarios for evaluating sabotage capabilities is helpful, but having hundreds would provide useful statistical data while also unlocking the ability to train better sabotage agents (basically, gain-of-function research for AI systems ... it's fine, don't worry about it).

The trouble with manually building hundreds of these evaluations is that it becomes extremely tedious and soul crushing very quickly. Each environment can take between 10 and 20 focused hours to build with slow feedback loops. Often one discovers halfway through that one of the requirements of the environment (such as the difficulty being correctly balanced) can't be satisfied, and so has to throw away this attempt and start afresh with a new design.

But given that the stakes were so high, it had to be done ... but a better solution was needed.

Much of my workflow already involved using language models to build the environments, such as rapidly populating a simulated social media websites with fake accounts, and I kept noticing parts of the workflow that could probably be automated. The question arose: how far could I take this?

With blessings from my collaborators and grant makers, I set off to attempt automating my entire workflow.

---

After two and a half months of iteration, I had my first sight of success. I had a working prototype of an automated builder that could take a high-level scenario description and output a working AI sabotage evaluation environment!

However, that was not all I had achieved. The pipeline was sufficiently general that, with a reasonable amount of tweaking, it could produce any type of AI evaluation environment.

Not only could the tool be used to help researchers study and measure AI sabotage capabilities, but it could help build evaluations for all kinds of purposes and serve the needs of AI safety engineers across all frontier AI labs, academics, independent researchers, governments, and enterprises!

I had a product that could solve an existential need. It was time to scale my impact ... and profit!

I applied and was quickly accepted to 5050 AI, a mentorship program run by VC firm, 50Y, that takes researchers with deep technical expertise and equips them with the knowledge and guidance for becoming successful entrepreneurs.

With my prototype and the support of the 50Y partners and mentors, I had a clear path to save the world ...

... Or did I?