·4 min read
Playwright vs Cypress vs Selenium in 2026: The QA Architect Decision Guide
Compare Playwright, Cypress, and Selenium in 2026. Pick the right browser test tool for AI-agent workflows.

Published: · 10 min read
Playwright's new stabilization work echoes a flaky-test fix from enterprise QA. See the pattern, why it matters, and how teams can reduce noisy failures.
On this page
Three years ago, I built a test framework that fixed itself. Nobody called it AI. "Agent" was still the thing antivirus ran on your laptop.
I split the framework into three parts. I called them Planner, Generator, and Healer. Not because I read a paper. Those were the three jobs I needed. I was out of good names.
Last October, Playwright shipped three Test Agents in version 1.56. Three of them.
They call them Planner, Generator, and Healer.
This month, version 1.59 shipped the rest of the plumbing. It added video recording inside tests (page.screencast). It added browser.bind(), so Claude or Cursor can connect to a running browser. It added async disposables (auto-cleanup for test resources). The agents shipped in October. Their plumbing shipped last week.
So this post is about one thing. The same three-part system that saved my career just shipped as a feature in the tool everyone uses.
Here is what Playwright got right. Here is what is still missing. And here is how to start using it today. Even if you stay on your own framework.
If your tests fail at random, and someone keeps asking you to "just make the flaky tests pass" — read this.
Here is a cost every engineering manager forgets: the flaky-test cost.
One team I worked with had 1,200 end-to-end tests. About 4% failed at random on each run. Sounds small. It was not.
That is the flaky-test cost. It costs you people, not money. That is why budgets miss it. It shows up as missed deadlines, canceled demos, and tired engineers.
The normal fix is "try harder."
All true. None is enough. You can try harder. Flaky tests keep growing.
So I stopped fixing each test. I started fixing how all tests work together.
I won't name the company. I will say this. My tests passed on my laptop. They failed only on clean CI builds. They failed when they ran beside another team's tests.
Sometimes they failed. Not every time. Always on Tuesday, between 10:14 AM and 10:22 AM.
We lost two weeks. I tried everything. I tried everything again. I tried everything in a new order.
On day 11, I stood at a whiteboard at 9 PM. The board was full of arrows. I finally saw the truth.
The tests were fine. The framework was the problem.
My framework thought the app was the only thing under test. It was not. The CI server was under test too. So was the database snapshot job. So was the deploy timing on the staging server.
We fixed that one bug. But the two weeks taught me the big lesson:
Fixing flaky tests is not a writing problem. It is a design problem.
The tests don't need more rules. The framework around them needs to be smarter.
That is where the three-part system was born.
Here is the whole system in short. The names are mine. The ideas are obvious once you stop pretending they are one job.
Job: read a feature, a user story, or a bug. Write a test plan.
Not code. A plan. A list of flows, edge cases, set-up, clean-up. In plain Markdown.
Why it is its own job: planning and writing are not the same skill. If one thing does both, tests drift from the plan. You get tests the agent can't explain. And gaps where it had no example to copy.
Plan first. Write later.
What I built three years ago: a plan generator that read from PR descriptions, Jira tickets, and production alerts. It produced a Markdown plan. Engineers reviewed it before any code was written. About 85% of plans were approved as-is. The 15% that were rejected were caught in minutes. Not days of debugging.
Job: take an approved plan. Write the test code. Pick the button names. Write the checks. Set up the test.
Why it is its own job: code writing works best with a narrow goal (one plan). Not a wide goal (the whole codebase). A focused generator with one plan beats a smart generator with the whole repo.
What I built: a generator that turned plan Markdown into Playwright tests in TypeScript. It picked button names in a fixed order (data-testid first, then role, then text as last resort). It set up fixtures. It used soft checks by default. No creativity. One plan in, one test file out.
Job: a test fails. Check why.
Is it a real bug? A button name that moved? Or a slow server that day?
Fix the things you can. Flag the real bugs. Park the rest with notes.
Why it is its own job: and this is the part no one wanted to hear. Healing is not "run it again until it passes." That is hiding. Healing is three steps: check, propose a fix, get it reviewed.
What I built: a Healer that compared the current page to the last green run. If the button name was stale, it proposed three new candidates. It scored each one. It opened a pull request with the best one-line change. A human reviewed it.
Humans merged about 80% of those fixes. The other 20% were caught in review. That is exactly what a good Healer looks like.
I don't love numbers without a shop name. My rules don't let me name the shop. So here is what I can tell you plainly:
These numbers are not magic. They come from splitting the work into three small jobs. And from watching the handoff between each job. If you already do this with your services, you already know why it works.
Playwright versions 1.58 and 1.59 shipped a set of Test Agents in VS Code and the command line:
The release notes: v1.58 and v1.59. The agent APIs are browser.bind() and page.screencast.
Same three jobs. Same split. Microsoft built what I built. They built it better in some ways. They missed one big thing.
Each agent works alone. You can run Planner by itself. Pass its output to Generator. Never touch Healer. That split is the whole point. An agent system where everything is tangled is just one big prompt.
The agents are optional. You don't have to buy in all at once. Drop the Healer into your old tests. Leave Planner and Generator for later. That is how real teams adopt new tools.
They shipped the plumbing, not just the agents. Two pieces matter:
browser.bind() — added in v1.59. It lets any AI tool like Claude or Cursor connect to a running browser. No fresh browser. No lost cookies. No mocking your login.Together, those two things solve a problem QA teams have been hacking around for years. Let an AI agent work on your real browser. Not a fresh empty one. Microsoft built the plumbing. You don't have to.
The review loop.
Self-healing is not a feature. It is a deal between the test, the app, and the team.
The Healer will happily propose fixes. But who reviews them? Who sets the merge rules? Who steps in when the Healer's fix rate drops? Playwright ships the agent. It does not ship the rules around the agent.
Those rules are the hard part. And you have to build them. Whether you use Microsoft's agents or your own.
A Healer with no review loop is just a bug generator with a nice screen.
If you're on Selenium, Cypress, or something older, the migration math got better with v1.59 this month — but the pattern is portable. You don't need Microsoft's implementation to build this. You need:
If you already use Playwright, the path is simple. Try the Planner agent in VS Code next sprint. Feed it one real user story. Compare its plan to your plan. Do that 10 times. If you would hand its plans to a junior engineer, it works. That means you found a 2–3× speed boost.
If you use Selenium, Cypress, or something older, migration got easier this month. But the system is portable. You don't need Microsoft's tools to build it. You need three things:
Start with the Healer if flaky tests block releases. Start with the Planner if you are short-staffed. Start with the Generator last. It is the flashy one. But it is the least useful without the other two.
If your team doesn't have this yet, print this post. Paste it in your design doc. Replace "I built" with "we can build." Take it to your next architecture review.
Three years ago, this system was a weird thing a weird architect built. Nothing off the shelf solved the problem.
This month, it ships as a native feature in the tool serious web teams use. Last October, the agents shipped inside Playwright. This week's v1.59 release added the production parts: video receipts, MCP interop (AI tool bridge), and async disposables.
If you are still treating flaky tests as a writing problem, you are three years behind.
If you treat them as a design problem, you are on time. The pattern worked then. It ships natively now — agents in v1.56, infrastructure in v1.59. The contract around it is still yours to build.
If you have been treating them as a design problem for years, you are ahead of the team that ships the framework.
That is a fine place to be.
The system worked then. It ships in v1.59 now. The rules around it are still yours to build.
That is the job.
Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test systems where AI agents and human engineers work together on quality. Former Apple SDET (Apple.com and Apple Card pre-release testing). Find him at anton.qa or on LinkedIn.
Get notified when I publish something new, and unsubscribe at any time.
·4 min read
Compare Playwright, Cypress, and Selenium in 2026. Pick the right browser test tool for AI-agent workflows.

·6 min read
Playwright v1.60 adds scoped HAR recording, locator.drop(), ARIA boxes, and test.abort() so CI failures carry better proof.

·4 min read
AI Test Automation Architecture: The 3-Layer System AI test automation architecture is the system that tells AI what to test. It also defines how to run tests and prove the result. I split it into three layers: orchestration, execution, and evidence. Without all three, AI testing becomes prompt output with no production gate.
