·5 min read
How to Score Your AI Test Agents: Offline Evaluation with Trajectories (2026)
Learn how to score the tests an AI agent writes. Record the run as a trajectory, replay it offline, and grade it without live API calls.

Published: · 6 min read
The 10 Playwright best practices for stable tests in 2026, and the ones AI code agents like Copilot and Cursor get wrong.
On this page
Playwright best practices are the rules that keep browser tests stable and easy to read. Use role-based locators (find by what users see), web-first assertions that auto-wait, and isolated tests. Seed data through the API (direct requests), not the UI. Avoid hard waits, conditional logic, and tests tied to your HTML. Turn on traces and run in parallel.
An AI agent can write 50 Playwright tests in a minute. That feels fast.
Then those tests fail at random, and nobody knows why. The agent copied old patterns from its training data. It does not know the run failed last night.
This guide lists the 10 best practices that keep tests stable. For each one, I show a small correct example. I also show what AI code agents get wrong. AI tools like Copilot, Cursor, and even Playwright codegen (the test recorder) lean on stale habits. Someone has to fix that.
A locator (a pointer to an element) should match what a person sees on screen. Use getByRole, getByLabel, or getByText. These read like the page. They also survive a redesign of your HTML.
import { test, expect } from '@playwright/test';
test('user can sign in', async ({ page }) => {
await page.goto('/login');
await page.getByLabel('Email').fill('ada@example.com');
await page.getByRole('button', { name: 'Sign in' }).click();
});
What AI agents get wrong: they reach for CSS or XPath (brittle path selectors) like page.locator('div.btn-primary > span'). Change one class name and the test breaks.
A web-first assertion (a check that auto-waits) retries until the page is ready. expect(locator).toBeVisible() waits on its own. You never add a fixed sleep.
import { test, expect } from '@playwright/test';
test('welcome message appears', async ({ page }) => {
await page.goto('/dashboard');
await expect(page.getByText('Welcome back')).toBeVisible();
});
What AI agents get wrong: they add await page.waitForTimeout(3000) (a hard pause). Hard waits are the top cause of flaky tests (tests that fail at random). Too short, the test fails. Too long, the suite crawls.
Isolated means each test starts clean. No shared login. No leftover data from the test before. Playwright gives each test a fresh browser context (a clean session). Set up state in a hook, not across tests.
import { test, expect } from '@playwright/test';
test.beforeEach(async ({ page }) => {
await page.goto('/login');
await page.getByLabel('Email').fill('ada@example.com');
await page.getByRole('button', { name: 'Sign in' }).click();
});
test('shows the account name', async ({ page }) => {
await expect(page.getByRole('heading', { name: 'Ada Lovelace' })).toBeVisible();
});
What AI agents get wrong: they chain tests, where test 2 needs test 1 to run first. One failure then breaks the whole file.
To test a page, you often need data first. A user, an order, a draft. Do not click through ten screens to make it. Send the data straight to your backend with the request fixture (a built-in HTTP client). It is faster and steadier.
import { test, expect } from '@playwright/test';
test('opens an existing project', async ({ page, request }) => {
const res = await request.post('/api/projects', {
data: { name: 'Apollo' },
});
expect(res.ok()).toBeTruthy();
await page.goto('/projects');
await expect(page.getByText('Apollo')).toBeVisible();
});
What AI agents get wrong: they build the data through the UI every time. The test gets long and slow, and a setup step fails for reasons that have nothing to do with the real check.
A test ID (a tag added just for tests, like data-testid) works as a fallback. But reach for getByRole and getByLabel first. Those test what a real user can do. A test ID only proves an attribute exists.
import { test, expect } from '@playwright/test';
test('cart shows one item', async ({ page }) => {
await page.goto('/cart');
// Prefer a real role over a test id.
await expect(page.getByRole('listitem')).toHaveCount(1);
});
What AI agents get wrong: they paste data-testid on everything. The tests pass even when the button has no label and a screen reader (assistive software) cannot find it. The test misses a real bug.
A trace (a full recording of the run) shows every step, the DOM, and the network. Set it to record only on the first retry of a failed test. You get the evidence for failures, and clean runs stay fast.
// playwright.config.ts
import { defineConfig } from '@playwright/test';
export default defineConfig({
retries: 1,
use: {
trace: 'on-first-retry',
},
});
What AI agents get wrong: they leave tracing off, or set trace: 'on' for every run. Off means no evidence when a test fails. Always-on slows the suite and fills your disk.
Parallel means many tests run at once. Playwright does this by default. For one big file of independent tests, set parallel mode. To split a slow suite across machines, use sharding (run a slice per machine).
import { test, expect } from '@playwright/test';
test.describe.configure({ mode: 'parallel' });
test('loads home', async ({ page }) => {
await page.goto('/');
await expect(page).toHaveTitle(/Home/);
});
Split across three machines on CI (your build server):
npx playwright test --shard=1/3
What AI agents get wrong: they write tests that share a database row or a single user. Run those in parallel and they fight each other, so you get flaky tests.
if and try out of your testsA test should walk one clear path. No branching. If a test asks "is the button there? if so click it," it hides a bug. The button should always be there. Assert it.
import { test, expect } from '@playwright/test';
test('checkout button works', async ({ page }) => {
await page.goto('/cart');
// Assert the state. Do not guess it with an if.
const checkout = page.getByRole('button', { name: 'Checkout' });
await expect(checkout).toBeEnabled();
await checkout.click();
});
What AI agents get wrong: they wrap clicks in if (await locator.isVisible()) to stop errors. That hides the real failure. A test that skips its own check still goes green.
Test the behavior, not the internals. Check the visible result. Do not check a CSS class, a state variable, or a function name. Those change when you refactor (rewrite the code), even though the app still works.
import { test, expect } from '@playwright/test';
test('shows a success message after submit', async ({ page }) => {
await page.goto('/contact');
await page.getByLabel('Message').fill('Hello');
await page.getByRole('button', { name: 'Send' }).click();
// Check the user-facing result, not an internal class.
await expect(page.getByText('Thanks, we got your message')).toBeVisible();
});
What AI agents get wrong: they assert on class="is-active" or an exact HTML shape. The test breaks on every redesign, even when nothing real changed.
A project (a named test setup) in playwright.config.ts runs the same tests under different settings. Use projects to cover Chromium, Firefox, and WebKit (the three main browser engines). One config, full coverage.
// playwright.config.ts
import { defineConfig, devices } from '@playwright/test';
export default defineConfig({
testDir: './tests',
fullyParallel: true,
retries: 1,
use: { trace: 'on-first-retry' },
projects: [
{ name: 'chromium', use: { ...devices['Desktop Chrome'] } },
{ name: 'firefox', use: { ...devices['Desktop Firefox'] } },
{ name: 'webkit', use: { ...devices['Desktop Safari'] } },
],
});
What AI agents get wrong: they hard-code one browser, or copy a config with no projects array. The suite then tests Chrome only, and a Safari-only bug ships to users.
AI writes the first draft fast. That part is real, and it is useful. But the first draft copies patterns from old code on the internet. It adds hard waits. It clicks through the UI to seed data. It wraps fragile steps in if blocks so the run stays green.
A green suite that proves nothing is worse than no suite. It buys false trust.
So the workflow is simple. Let the agent write the draft. Then a human reads it against these 10 rules and fixes what the agent got wrong. The agent moves fast. The human keeps the tests honest. That is the job of an AI QA Architect.
Build the tests with AI. Then make them stable yourself.
Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at anton.qa or on LinkedIn.
Get notified when I publish something new, and unsubscribe at any time.
·5 min read
Learn how to score the tests an AI agent writes. Record the run as a trajectory, replay it offline, and grade it without live API calls.

·3 min read
Learn how to generate clean test scripts using Playwright Codegen, and how to scale those drafts into a production-ready test architecture.

·4 min read
Compare Playwright, Cypress, and Selenium in 2026. Pick the right browser test tool for AI-agent workflows.
