·3 min read
Playwright Codegen: The Complete Guide (2026)
Learn how to generate clean test scripts using Playwright Codegen, and how to scale those drafts into a production-ready test architecture.

Published: · 4 min read
Learn how to score the tests an AI agent writes. Record the run as a trajectory, replay it offline, and grade it without live API calls.
On this page
AI test agent evaluation is the practice of scoring the tests an AI agent writes, instead of trusting that they pass. You record the agent's run as a trajectory (a saved log of every step), replay it offline, and grade each step for correctness and relevance. Offline scoring needs no live API calls, so you can check agent quality on every pull request.
An AI agent can write 200 tests before lunch. That feels like progress.
Then a real bug ships, and not one of those tests caught it. The agent was confident, and it was wrong.
This guide shows how to stop guessing and start scoring. Stagehand 3.5.0 made the method first-class on June 3, 2026. The pattern works for any agent.
A green test suite tells you the tests ran. It does not tell you the tests were right.
An AI agent makes three mistakes a human reviewer would catch:
You cannot fix what you cannot measure. So the first job is a number, not a vibe.
A trajectory is a saved recording of an agent's run. It captures each step: what the agent saw, what it decided, and what code it produced.
You capture it once, during the agent's normal run.
// Illustrative pattern — confirm the exact Stagehand 3.5 API before use.
const trajectory = await agent.run(task, { record: true });
await saveTrajectory(trajectory, "runs/checkout-flow.json");
The recording is the receipt. Now you can study the run after it finishes, as many times as you want.
Offline means you grade the saved run without calling the live model again. No new API cost. No flaky network. Same input every time.
This matters for two reasons. It makes scoring cheap, so you can run it on every pull request. It makes scoring repeatable, so two engineers get the same result.
// Replay the saved run and score it, with no live API calls.
const run = await loadTrajectory("runs/checkout-flow.json");
const score = await evaluate(run, rubric);
A single pass/fail hides too much. Grade the run on a few clear axes instead.
Stagehand 3.5.0 added evaluation types for exactly this kind of offline scoring. You define the rubric once and apply it to every saved run.
const rubric = {
correctness: (run) => run.asserts.some(a => a.target === task.goal),
relevance: (run) => run.steps.every(s => s.onTask),
stability: (run) => run.reruns.every(r => r.passed),
};
A run that scores correctness 7/10, relevance pass, flaky tests 0 is a run you can talk about. "It passed" is not.
A score you read once and forget changes nothing. Turn it into a gate.
# CI step: fail the build if the agent's tests score too low.
- run: npx evaluate runs/ --min-correctness 0.8 --max-flaky 0
Now the agent earns trust the same way a junior engineer does. It ships work, the work gets graded, and only graded work reaches production.
I design AI test systems on a 3-Layer System:
Most teams build the first two layers and stop. They let the agent write tests and assume the green check means quality.
Offline evaluation is the Evidence Layer. It is the difference between an agent you hope works and an agent you can prove works.
Build the agent. Then prove it works.
Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at anton.qa or on LinkedIn.
Get notified when I publish something new, and unsubscribe at any time.
·3 min read
Learn how to generate clean test scripts using Playwright Codegen, and how to scale those drafts into a production-ready test architecture.

·7 min read
An AI QA Architect designs the test systems AI agents run on. Learn the role, architecture, skills, and MCP testing boundary.

·7 min read
I Ate My Own Dog Food: How I Benchmarked AI Skills and Proved Eval-Driven Development Works I built a tool to test AI skills. Then I used it on my own project. The benchmarks shocked even me. Anton Gulin is an AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET, current Lead Software Engineer in Test at CooperVision. Find him at anton.qa or on LinkedIn.
