How to Score Your AI Test Agents: Offline Evaluation with Trajectories (2026)

Published: June 10, 2026 · 4 min read

Learn how to score the tests an AI agent writes. Record the run as a trajectory, replay it offline, and grade it without live API calls.

How to Score Your AI Test Agents: Offline Evaluation with Trajectories (2026)

AI test agent evaluation is the practice of scoring the tests an AI agent writes, instead of trusting that they pass. You record the agent's run as a trajectory (a saved log of every step), replay it offline, and grade each step for correctness and relevance. Offline scoring needs no live API calls, so you can check agent quality on every pull request.

An AI agent can write 200 tests before lunch. That feels like progress.

Then a real bug ships, and not one of those tests caught it. The agent was confident, and it was wrong.

This guide shows how to stop guessing and start scoring. Stagehand 3.5.0 made the method first-class on June 3, 2026. The pattern works for any agent.

1. It passed is not a score

A green test suite tells you the tests ran. It does not tell you the tests were right.

An AI agent makes three mistakes a human reviewer would catch:

It checks the wrong thing. The test passes, but it never asserts the real behavior.
It writes flaky tests (tests that fail at random). They go green often enough to look fine.
It tests a happy path and skips the edge case that actually breaks.

You cannot fix what you cannot measure. So the first job is a number, not a vibe.

2. Record the run as a trajectory

A trajectory is a saved recording of an agent's run. It captures each step: what the agent saw, what it decided, and what code it produced.

You capture it once, during the agent's normal run.

// Illustrative pattern — confirm the exact Stagehand 3.5 API before use.
const trajectory = await agent.run(task, { record: true });
await saveTrajectory(trajectory, "runs/checkout-flow.json");

The recording is the receipt. Now you can study the run after it finishes, as many times as you want.

3. Replay it offline

Offline means you grade the saved run without calling the live model again. No new API cost. No flaky network. Same input every time.

This matters for two reasons. It makes scoring cheap, so you can run it on every pull request. It makes scoring repeatable, so two engineers get the same result.

// Replay the saved run and score it, with no live API calls.
const run = await loadTrajectory("runs/checkout-flow.json");
const score = await evaluate(run, rubric);

4. Score each step with evaluation types

A single pass/fail hides too much. Grade the run on a few clear axes instead.

Correctness: did the test assert the behavior the task asked for?
Relevance: does each step move toward the goal, or wander?
Stability: would this test pass on a clean re-run, or is it flaky?
Coverage: did the agent test the edge case, or only the happy path?

Stagehand 3.5.0 added evaluation types for exactly this kind of offline scoring. You define the rubric once and apply it to every saved run.

const rubric = {
  correctness: (run) => run.asserts.some(a => a.target === task.goal),
  relevance:   (run) => run.steps.every(s => s.onTask),
  stability:   (run) => run.reruns.every(r => r.passed),
};

A run that scores correctness 7/10, relevance pass, flaky tests 0 is a run you can talk about. "It passed" is not.

5. Wire the score into CI

A score you read once and forget changes nothing. Turn it into a gate.

# CI step: fail the build if the agent's tests score too low.
- run: npx evaluate runs/ --min-correctness 0.8 --max-flaky 0

Now the agent earns trust the same way a junior engineer does. It ships work, the work gets graded, and only graded work reaches production.

6. Where this sits: the Evidence Layer

I design AI test systems on a 3-Layer System:

Orchestration: decides what to test.
Execution: runs the tests, where the agent writes code.
Evidence: proves the work is right.

Most teams build the first two layers and stop. They let the agent write tests and assume the green check means quality.

Offline evaluation is the Evidence Layer. It is the difference between an agent you hope works and an agent you can prove works.

The 5-line checklist

Record every agent run as a trajectory.
Replay it offline, with no live API calls.
Score it on correctness, relevance, stability, and coverage.
Gate your build on the score.
Keep the trajectory, so you can re-grade when the rubric improves.

Build the agent. Then prove it works.

Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at anton.qa or on LinkedIn.

AI Testing · Test Evaluation · AI Agents · Stagehand · QA Architecture · SDET

Get notified when I publish something new, and unsubscribe at any time.

Read all my blog posts

June 17, 2026·8 min read

Playwright Best Practices: 10 Rules AI Agents Get Wrong (2026)

The 10 Playwright best practices for stable tests in 2026, and the ones AI code agents like Copilot and Cursor get wrong.

PlaywrightBest PracticesTest Automation

Playwright Best Practices: 10 Rules AI Agents Get Wrong (2026)

June 03, 2026·3 min read

Playwright Codegen: The Complete Guide (2026)

Learn how to generate clean test scripts using Playwright Codegen, and how to scale those drafts into a production-ready test architecture.

PlaywrightCodegenTest Automation

Playwright Codegen: The Complete Guide (2026)

May 27, 2026·7 min read

What Is an AI QA Architect? Role, System, and Skills

An AI QA Architect designs the test systems AI agents run on. Learn the role, architecture, skills, and MCP testing boundary.

AI QA architectQASDET

What Is an AI QA Architect? Role, System, and Skills

How to Score Your AI Test Agents: Offline Evaluation with Trajectories (2026)

How to Score Your AI Test Agents: Offline Evaluation with Trajectories (2026)

1. It passed is not a score

2. Record the run as a trajectory

3. Replay it offline

4. Score each step with evaluation types

5. Wire the score into CI

6. Where this sits: the Evidence Layer

The 5-line checklist

Subscribe

Related articles

Playwright Best Practices: 10 Rules AI Agents Get Wrong (2026)

Playwright Codegen: The Complete Guide (2026)

What Is an AI QA Architect? Role, System, and Skills