The Self-Improving AI Loop That Went from 7/15 to 15/15 for Twenty-Four Cents

The most expensive part of building AI workflows isn't the API cost — it's the iteration cycle. Writing a prompt, testing it, identifying where it fails, revising it, and testing again is time-consuming manual work that most teams just accept as part of the process. A pattern called auto research, originally articulated by Andrej Karpathy and now implemented in an open-source repository, automates that iteration loop entirely. The result is an agent that improves its own outputs until it reaches a defined quality threshold, with no human in the loop after the initial setup.

What the Loop Actually Does

Auto research works by giving an agent a task, a scoring function, and a stopping condition. The agent attempts the task, evaluates the result against the scoring function, and if the score is below the threshold, it revises its own approach and tries again. Each iteration informs the next. The loop continues autonomously until the output meets the standard or the budget is exhausted. This is not prompt chaining — the agent is genuinely analyzing why its previous attempt fell short and making targeted changes to its strategy.

The Numbers: A Concrete Benchmark

Starting from a baseline score of 7 out of 15 on a prompt optimization task, the auto research loop ran autonomously and reached 15 out of 15 — a perfect score — at a total API cost of $0.24. Shopify's internal implementation of a similar pattern produced a 53% reduction in task completion time across their tested workflows. These aren't theoretical improvements; they're measurable gains that come directly from replacing manual iteration with autonomous iteration.

Where It Applies Beyond Research

The pattern extends well beyond what its name suggests. Any task where output quality can be scored is a candidate: email copy, landing page headlines, ad variants, code correctness, data extraction accuracy. The scoring function doesn't have to be automated — it can be a rubric that an LLM evaluates against. The key requirement is that "good" is defined precisely enough that the agent can tell the difference between a passing result and a failing one.

Why This Pattern Matters Now

The standard practice for AI-assisted work still assumes a human in the revision loop. A person sends a prompt, reads the output, decides whether it's good enough, and either accepts it or sends a revised prompt. Auto research replaces the human in that loop for tasks where the quality criteria are stable and definable. The practical implication is that workflows which previously required continuous human supervision can now run to completion autonomously, escalating to a human only when the scoring function can't reach its target within the budget.

This shifts how you think about where to invest time. Defining quality criteria upfront — articulating precisely what a good output looks like — becomes the highest-leverage task, because once that's done, the iteration is free.

Takeaway

Auto research is the clearest current example of an AI system that gets better at a task without human guidance. As the open-source tooling matures and the pattern spreads beyond early adopters, the expectation will shift: workflows that require manual prompt iteration will start to look like a solved problem. The teams that define their quality criteria clearly now will be positioned to automate their iteration loops first.