How to Implement AI in QA (2026): A Practical Framework

Published: · 6 min read

Where AI helps in QA, where it lies, and the one rule that keeps AI tests trustworthy — the AI does the work, a fixed check decides. With a working example.

Dark code-style card reading 'How to implement AI in QA' with a split: AI explores, a locked check decides pass or fail.

How to Implement AI in QA (2026): A Practical Framework

Key takeaways

To implement AI in QA, let AI do the work and let a fixed check decide the result. Use AI to generate tests, repair broken selectors, and explore your app. Never let AI grade its own output. Add an independent check the AI cannot change, test for repeatable results, and test that the agent did only what you asked.

What "AI in QA" actually means

AI in QA means using an AI model to help test software. That is the whole idea. The model can write test cases, fix tests that broke, click through your app like a user, or read a failure and guess the cause.

It does not mean the AI replaces testing. It means the AI does some of the work a tester used to do by hand. The judgment stays with you.

I test software for a living. The teams that win with AI in QA all draw the same line. AI does the work. A human, and a fixed check, decide if the work is good.

The one rule that keeps it safe

Here is the rule the rest of this guide hangs on.

Let AI do the work. Never let AI judge its own work.

Picture an AI agent that tests your login. It clicks around. It reports green. Everyone relaxes. But the agent decided what "pass" means. If it is too kind, it passes a broken page. Now you shipped a bug with a green check on top.

An AI that grades its own work is not a test. It is an opinion.

So you give the AI room to explore, and you keep one thing it cannot touch. A fixed check. A known-good answer. A hard assert (a check that fails loudly) on the real outcome. The agent finds the path. The fixed check says pass or fail.

In testing this fixed answer has a name: an oracle. The oracle is the part the system being tested is not allowed to influence. Keep your oracle out of the AI's reach and most AI-in-QA risk goes away.

Where AI helps in QA (the 4 good jobs)

These four jobs are where AI pays off today.

  1. Writing tests. Point the model at a page or a user story. It drafts test cases, including edge cases a tired human skips. You review and keep the good ones.
  2. Fixing broken tests. A button moved and the test broke. AI can find the new selector (how a test finds a button) and propose the fix. This is the biggest time-saver for most teams.
  3. Exploring the app. An AI agent can wander your app like a curious user and report what feels broken. Great for finding the bug nobody wrote a test for.
  4. Reading failures. When a test fails, AI can read the log and the trace and suggest the likely cause. It turns a wall of red into a short list to check.

In all four, the AI proposes. You and your fixed checks dispose.

Where AI lies in QA (the 4 traps)

This is the part most guides skip. AI in QA fails in four ways. Plan for each.

  1. It passes a broken thing. A too-kind agent calls a broken page "fine." Fix: an independent oracle the agent cannot move.
  2. It is not repeatable. The same input passes now and fails in ten minutes. Fix: run the same input twice and compare the shape of the answer.
  3. It does too much. You asked for one thing. The agent also changed a setting or sent a message. Fix: a scope check on what it touched.
  4. It depends on a model that can change. The model under your tool updates, or even goes offline, and your tests shift with it. Fix: pin the model version and keep a fallback.

That last one is not theoretical. In June 2026 a widely used model was pulled offline overnight. Teams that pinned their model switched to a fallback in one line. Teams that did not found out when their build broke.

A working example: an AI agent with an oracle it cannot move

Here is the pattern in code. An AI agent books a meeting room. Then three fixed checks decide if it really worked. The agent never grades itself.

import { test, expect } from '@playwright/test';
import { Stagehand } from '@browserbasehq/stagehand';

test('AI books a room — and only that', async ({ page }) => {
  // Pin the model. Do not let it auto-upgrade under your tests.
  const stage = new Stagehand({ env: 'LOCAL', model: 'anthropic/claude-opus-4-8' });
  await stage.init();

  // 1) Let the AI do the work. It decides HOW to book the room.
  await stage.act('Book room B for 2pm tomorrow, for 30 minutes');

  // 2) The fixed check the AI cannot move (the oracle).
  //    These helpers read your real database, not the agent's report.
  const booking = await getBookingFromDb({ room: 'B', time: '14:00' });
  expect(booking).toBeTruthy();          // it did the task
  expect(booking.durationMin).toBe(30);  // exactly what we asked for

  // 3) Scope check. Did it touch anything it should not have?
  const otherChanges = await getChangesExcept(booking.id);
  expect(otherChanges).toHaveLength(0);  // no surprise side effects
});

Read the three checks again. The agent's own "I booked it" is never trusted. The database is the oracle. The duration check catches a sloppy booking. The scope check catches the agent doing extra. (getBookingFromDb and getChangesExcept are your own helpers — they read real state, not the agent's words.)

To catch the repeatable-result trap, run the same prompt twice and compare:

test('same request, same result', async () => {
  const a = await runBooking('Book room B for 2pm tomorrow, 30 minutes');
  const b = await runBooking('Book room B for 2pm tomorrow, 30 minutes');
  // The wording of the agent's reply may differ. The outcome may not.
  expect(a.room).toBe(b.room);
  expect(a.durationMin).toBe(b.durationMin);
});

The model is allowed to phrase its answer differently each time. It is not allowed to book a different room.

The 3 tests every AI feature needs

If you ship an AI feature to users, these three tests catch the failures that page you at 2am. Most teams only write the first easy one ("does it give a good answer?").

  • The wrong-input test. Feed it junk, empty fields, another language, a user trying to break it. A good feature fails safely. A bad one is confidently wrong.
  • The same-input-twice test. Run the exact input twice. Same kind of answer? Different wording is fine. Pass-then-fail is not.
  • The scope test. Did it do only what you asked? Or did it also change a setting, send a message, or touch a file? Extra is not helpful. Extra is a future incident.

Where to start on Monday

You do not need a platform or a budget. Start small.

  1. Pick one flaky test. Let an AI tool propose the fix. You keep final say.
  2. Add one oracle. Take your most important flow and add a hard check on the real outcome, not the agent's report.
  3. Pin your model. Lock the version your AI tools use. Add a fallback.
  4. Add the scope check to one agent run. See what it touches when you are not looking.

Do those four and you have AI in QA that you can trust. The AI does more work. You keep the judgment. The fixed checks keep everyone honest.

That is the whole job: the gap between "the AI says it passed" and "it passed, for the right reason."


Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET (Apple.com / Apple Card pre-release testing). Find him at anton.qa or on LinkedIn.


ai-qa · ai-testing · test-automation · qa · playwright

Subscribe

Get notified when I publish something new, and unsubscribe at any time.

Related articles

Read all my blog posts
Read more about AI Test Automation Architecture: The 3-Layer System

·4 min read

AI Test Automation Architecture: The 3-Layer System

AI Test Automation Architecture: The 3-Layer System AI test automation architecture is the system that tells AI what to test. It also defines how to run tests and prove the result. I split it into three layers: orchestration, execution, and evidence. Without all three, AI testing becomes prompt output with no production gate.

 AI Test Automation Architecture: The 3-Layer System