My Experience with Automating AI Evaluation Production
← Home
(Click or press any key to continue)
Home · Blog

My Experience with Automating AI Evaluation Production

December 2025

For a technical deep-dive into the automated evaluation builder, see Automated Evaluation Builder.

---

Around this time last year, after failing to solve the hard part of AI alignment, my funding dried up. I needed to learn a skill that could make money, ideally while still contributing to AI safety.

Building AI evaluations seemed to be an in-demand service and a low-risk way to make a positive impact, so I pursued that path.

Very quickly, I was fortunate enough to get contracts with the UK AI Security Institute and Anthropic to build AI sabotage evaluations.

One particular threat model that AI safety researchers are concerned with is a scenario where a misaligned AI deployed internally at an AI lab subtly sabotages safety infrastructure. For example, it could introduce bugs into code for AI monitoring processes so that they no longer catch other misaligned AIs taking dangerous actions.

As part of my work, I built simulated environments that allow researchers to observe whether AIs have the capability to commit sabotage. Example scenarios include investigating whether an AI deployed at a research lab could leak sensitive data without getting caught, or a weapons manufacturing AI inserting backdoors into autonomous weapons for future hijacking.

Having tens of environments simulating different scenarios for evaluating sabotage capabilities is helpful, but having hundreds would provide more data while also unlocking the ability to train better sabotage agents (basically, gain-of-function research for AI systems ... it's fine, don't worry about it).

The trouble with manually building hundreds of these evaluations is that it becomes extremely tedious and soul crushing very quickly. Each environment can take between 10 and 20 focused hours to build with slow feedback loops. Often one discovers halfway through that one of the requirements of the environment (such as the difficulty being correctly balanced) can't be satisfied, and so has to throw away this attempt and start afresh with a new design.

But given that the stakes were so high, it had to be done ... but a better solution was needed.

Much of my workflow already involved using language models to build the environments, such as rapidly populating simulated social media sites with fake accounts or writing code, and I kept noticing parts of the workflow that could probably be automated. The question arose: how far could I take this?

With blessings from my collaborators and grant makers, I set off to attempt automating my entire workflow.

---

After two and a half months of iteration, I had my first signs of success. I had a working prototype of an automated builder that could take a high-level scenario description and output a working AI sabotage evaluation environment!

However, that was not all I had achieved. The pipeline was sufficiently general that, with a reasonable amount of tweaking, it could produce any type of AI evaluation environment.

Not only could the tool be used to help researchers study and measure AI sabotage capabilities, but it could build evaluations for all kinds of purposes - serving AI safety engineers at frontier labs, third-party evaluators, and enterprise businesses!

I had a product that could solve an existential need. It was time to scale my impact ... and profit!

I applied and was quickly accepted to 5050 AI, a mentorship program run by VC firm, 50Y, that takes researchers with deep technical expertise and equips them with the knowledge and guidance for becoming successful entrepreneurs.

With my prototype and the support of the 50Y partners and mentors, I had a clear path to save the world ...

... Or did I?

---

Nope!

The 5050 program lasted 2-3 months. My goal was simple: get letters of intent or contracts from at least two AI labs by mid-December. With that validation, I could pursue VC funding to scale.

I didn't get them. But I learned a lot about startups, the AI safety startup ecosystem, and what I want to do next.

Quick Lessons from Customer Discovery

Aside from working on the evaluation builder, throughout the program my day-to-day involved customer discovery - talking to people at frontier labs about their needs - and whether they'd actually pay for my evaluations.

My ideal customers were safety teams at the frontier AI labs (OpenAI, Anthropic, DeepMind, etc.) and third-party safety organizations such as the AI Security Institute.

I knew the labs pay well for solutions to their problems, and influence here is one of the main leverage points for reducing AI risk.

However, despite having warm connections to friends and researchers at each lab, the process of getting to talk to relevant decision-makers proved to be a lengthy struggle. There were long periods of no responses and eventual rejections due to labs having priorities other than saving the world from misaligned superintelligence (just kidding, I hope).

I also considered enterprise businesses as potential customers (healthcare agents, cybersecurity, etc.) but after initial cold outreach failed, I noticed I wasn't motivated to push harder - the reliability problems enterprises face aren't the AI risks I care most about.

Despite my lack of success during this period, I believe that a third-party AI safety organization providing red-teaming and evaluations to frontier labs and enterprises is a viable business - but I think the right structure to start with is a non-profit or a self-sustaining for-profit rather than a fast scaling startup.

During the 5050 talks, AI safety experts at leading labs explicitly stated that they would pay for third-party safety work. Labs have talent and resources, but the diversity of threat models (bio-risk, influence campaigns, sabotage) means some solutions are best handled by specialists who can aggregate insights across multiple labs. That said, some safety solutions should be built in-house, like general monitoring.

Some examples of organizations taking this approach include Apollo Research (tackling AI scheming)[1] and METR (capabilities for autonomy and AI research).

One last thing to note: there's a clear financial pull for AI safety startups to work on improving AI capabilities or enterprise solutions that don't meaningfully reduce AI risk. For example, building reinforcement learning environments for AI training is extremely popular right now, but there are arguments this accelerates the development of dangerous AI systems, leaving less time for finding necessary safety solutions.

It's worth keeping this pull in mind when thinking about what kind of organizational structure would best serve an AI safety founder's mission.[2]

In the end, I think that third-party AI safety solutions aimed at frontier lab risk are best structured as bootstrapped for-profits or non-profits (my impression after talking to non-profit grantmakers is that there's willingness to fund AI safety organizations).[3]

There doesn't seem to be enough demand to scale without diluting the focus on AI safety towards enterprise or accelerating capabilities.

Conclusion

So in the end, my vision of producing AI safety evaluations at scale wasn't realised. I didn't get a strong enough pull to fuel the growth of a VC-backed startup.

I might have succeeded with more time and effort - but during the course of the program, I began to realise that I didn't want to spend 5-10 years on building a startup around this idea in particular. When I pictured myself pursuing automated evals as a startup, I didn't feel energized. It felt like something I wanted to spin up and exit fast, not structure my life around.

I started working on building evaluations because I had a long period without funding to work on what I'm naturally drawn to. That's not to say that I don't think evaluations are important (far from it), but I don't think working on them in the long run is the right fit for me (given my natural interests and comparative advantages).

I'm open to continue building evaluations, using the pipeline to boost my productivity, but framing it as a service that I can offer on a contract basis rather than something I'll devote all my time to.[4] If demand eventually pulls me toward scaling, I may reconsider, but for now I'm excited about giving myself space to explore and seeing what pulls me in.

In the end, this experience has made me feel much more confident that I have the capacity to found an organization and would thoroughly enjoy it.[5]

---

For a technical deep-dive into how the pipeline works and example outputs, see Automated Evaluation Builder.

  1. Recently Apollo Research announced a transition from a non-profit to a for-profit PBC. I'd also highly recommend reading Marius' (CEO of Apollo) post on AGI safety products.
  2. I found this report by Halcyon Futures to be a good resource for thinking about this.
  3. This cuts off access to large pools of VC money and some talent but it's possible to restructure later if needed.
  4. As a recent update, I'm again in talks with a couple of organizations about potential contracts. As an update to this update - the talks didn't lead anywhere, again ...
  5. There were many moments where it seemed plausible that this could scale fast and I found myself naturally gravitating towards working 80-100 hour weeks.