The paper was about using one agent to generate solutions and another to evaluate them, running in a loop. Generator proposes, Evaluator scores, loop until good enough or budget runs out.

I read it at breakfast and thought: I have a Playwright test suite. I have a FastAPI + SvelteKit app (Leseraum, a German reading tool). I wonder if this actually works.

So I said: “use what you’ve learned — set up two agents and have them work on Leseraum.”

What I got:

  • Generator agent pulled the current codebase, identified failing tests, wrote fixes and features, committed changes
  • Evaluator agent ran the full test suite (119 backend tests, 31 Playwright frontend tests) and returned a quality score
  • They ran 5 sprints: broken → 3.7 → 4.0 → continuing improvement

In parallel, a background agent was researching the LingQ power-user community (the LanguageMaster addon ecosystem, roosterburton’s tools). That research came back with an import API — Claude pulled 11,221 German words from my LingQ account directly into the app’s frequency database.

The session ended with a new cron job replacing the old ad-hoc import script. Push to GitHub. Done.

The interesting thing isn’t that it worked — it’s that the paper became the spec. There was no design document, no ticket, no “here’s the architecture.” The description of the pattern in the article was sufficient for Claude to instantiate it against my actual codebase with real tests as the evaluation signal.

Whether that scales beyond toy demos is a separate question. But for a single-developer project where the tests exist and the codebase is small enough to hold in context, generator-evaluator loops seem to genuinely work as described.

The Anthropic blog post and the working sprint results are now pointing at each other in my head as the same thing, which is either very satisfying or a sign I need to go outside more.