Playwright CLI vs Scripts: How AI Agents Should Test

There’s a growing pattern in agentic development: give an AI agent a browser and let it figure things out. Tools like Playwright CLI, browser-use, and Claude Code’s computer use make it easy to point an LLM at a web page and say “test this.”

It works. Until it doesn’t.

After months of building an automated microservice deployment system where AI agents deploy, configure, and verify services end-to-end, I’ve landed on a clear mental model for when to use Playwright CLI (interactive, LLM-driven) vs standard Playwright scripts (deterministic, reproducible). The distinction matters more than most people think.

The Two Modes

Playwright CLI: Exploration Mode

The CLI is conversational. The agent navigates, takes snapshots, reads the DOM, decides what to click next. Each interaction is a new LLM call.

# Agent opens a page, takes a snapshot, reads the YAML, decides next action
playwright-cli -s=verify open https://app.example.com --headed
playwright-cli -s=verify snapshot
playwright-cli -s=verify click e{ref-from-snapshot}
playwright-cli -s=verify fill e{input-ref} "search term"

This is how my verify-test skill works in the Data Source Automator pipeline. When I’m manually testing a newly deployed microservice with Claude Code, the agent:

Opens the Verify platform in a browser
Takes a snapshot (outputs a YAML with element refs)
Reads the snapshot to find the right elements
Fills forms, clicks buttons, navigates
Evaluates results and decides the next step

Each step is an LLM inference. The agent adapts — if a dropdown is missing, it investigates. If auth expires, it re-logs. If results look wrong, it digs deeper.

This is not testing. This is exploration.

Playwright Scripts: Reproducible Mode

Scripts are deterministic. Same input, same steps, same assertions. No LLM involved in execution.

import { test, expect } from "@playwright/test";

test("Verify service search returns results", async ({ page }) => {
  await page.goto("https://app.example.com");
  await page.fill('[data-testid="search-input"]', "ACME Corp");
  await page.click('[data-testid="search-button"]');

  const results = page.locator('[data-testid="result-row"]');
  await expect(results).toHaveCount({ minimum: 1 });

  const firstResult = results.first();
  await expect(firstResult).toContainText("ACME");
});

No snapshots, no YAML parsing, no LLM reasoning about what to click next. Just code.

Why This Distinction Matters for AI Agents

When I first built the deployment automation system, everything used the CLI approach. The LLM agent would:

Deploy the service (git tag, config-deploy update, ArgoCD sync)
Run infrastructure checks (pods, health, swagger, etcd)
Open a browser via Playwright CLI
Navigate the platform UI to configure and test the service
Evaluate results and report

It worked. But three problems emerged quickly:

1. Token consumption exploded

Every Playwright CLI interaction requires the LLM to:

Read the full snapshot YAML (often 500+ lines)
Reason about which element to interact with
Formulate the next command
Evaluate the result

For a typical verification flow (login → configure → search → validate results → check details), that’s 15-20 LLM calls with large context. At scale, testing 50+ microservices, this was burning through tokens fast.

2. Variability killed reliability

The same test run twice could take different paths. The LLM might:

Click a different element if the snapshot was slightly different
Interpret results differently based on context window state
Get confused by UI changes that didn’t affect functionality

A test that passed Monday might fail Tuesday — not because the service broke, but because the agent reasoned differently.

3. Stable workflows don’t need reasoning

After the first few runs, I noticed the verification pattern was always the same:

Login (or restore auth state)
Navigate to settings
Configure autodiscovery
Create an application
Execute a search
Validate results have data
Open a detail view
Capture evidence screenshots

This doesn’t need an LLM. It needs a script.

The Architecture I Landed On

┌─────────────────────────────────────────────────┐
│           Deployment Pipeline                    │
│                                                  │
│  Deploy ──► Infra Checks ──► Browser Tests       │
│                                  │               │
│                          ┌──────┴──────┐         │
│                          │             │         │
│                    Playwright     Playwright      │
│                    Scripts        CLI             │
│                    (default)     (fallback)       │
│                          │             │         │
│                    Deterministic  LLM-driven      │
│                    Fast           Adaptive         │
│                    Low tokens     High tokens      │
│                          │             │         │
│                          └──────┬──────┘         │
│                                 │                │
│                          LLM receives            │
│                          outputs only            │
└─────────────────────────────────────────────────┘

Scripts run first. They handle the happy path: login, navigate, search, assert, screenshot. The orchestrating LLM agent receives only the test output (pass/fail + screenshots), not the raw DOM.

CLI kicks in as fallback. When a script fails unexpectedly — a UI change, a new modal, an element that moved — the agent switches to CLI mode. Now it can explore, take snapshots, reason about what changed, and either fix the issue or report it.

This is the key insight: the LLM doesn’t need to see every page load and every DOM tree. It just needs the results — and the ability to dig deeper when something goes wrong.

Real Example: The Generated Test Pattern

In the demo-video system, I took this further. Tests are auto-generated from scene definitions:

// scenes/uk-hmrc-supervised-business-register-video-a.ts
// Scene definitions: what to do, in what order, with what data
export const SCENE_CONFIGS = [
  { id: "login", name: "Login to platform" },
  { id: "navigate-settings", name: "Open service configuration" },
  { id: "configure-autodiscovery", name: "Enable the service" },
  { id: "create-application", name: "Create test application" },
  { id: "execute-search", name: "Run company search" },
  { id: "validate-results", name: "Check search results" },
];

// Each scene is a function: (page: Page) => Promise<void>
export const SCENES: Record<string, (page: Page) => Promise<void>> = {
  "login": async (page) => {
    await page.goto("https://demo1-dev.simplekyc.com");
    // ... deterministic steps
  },
  // ...
};

The generated spec file is pure boilerplate — iterates scenes, records timestamps, saves output. Zero LLM involvement at runtime.

// Auto-generated — do not edit manually
test("Record UK HMRC Supervised Business Register", async ({ browser }) => {
  const context = await browser.newContext({
    recordVideo: { dir: VIDEO_DIR, size: { width: 1280, height: 720 } },
    viewport: { width: 1280, height: 720 },
  });

  const page = await context.newPage();
  for (const config of SCENE_CONFIGS) {
    const sceneFn = SCENES[config.id];
    await sceneFn(page);
  }
  await context.close();
});

The LLM’s role? It wrote these scenes initially (using CLI exploration to understand the UI). Then it stepped back and let the scripts run.

When to Use Each

	Playwright CLI	Playwright Scripts
Purpose	Exploration, debugging, unknown flows	Verification, regression, known flows
LLM involvement	Every step	None at runtime
Token cost	High (snapshots + reasoning per action)	Near zero (agent only reads output)
Reproducibility	Low (LLM may take different paths)	High (same code, same steps)
Adaptability	High (handles unexpected UI)	Low (breaks on UI changes)
Speed	Slow (LLM latency per step)	Fast (native browser speed)
Best for	First-time flows, debugging failures, navigation	Repeated verification, CI/CD, evidence capture

The Mental Model

Think of it like a human QA engineer:

First time testing a feature? They click around, explore, take notes. This is CLI mode.
Writing the regression test? They script the exact steps they just explored. This is script mode.
Test fails unexpectedly? They go back to manual exploration to understand why. CLI mode again.

An AI agent should work the same way. Explore with the CLI, codify with scripts, fall back to CLI when scripts break.

Practical Guidelines

Start with CLI exploration when the AI agent encounters a new UI or flow. Let it snapshot, reason, navigate.
Once the flow is stable (works 3+ times consistently), generate a Playwright script. The agent itself can write it — it already knows the selectors and flow from its CLI exploration.
In automated pipelines, always use scripts as the primary test runner. The orchestrating LLM receives stdout (pass/fail, screenshots), not DOM trees.
Keep CLI as fallback. When a script fails, the agent can switch to CLI mode to diagnose: “The search button moved — let me take a snapshot and find it.”
Never ship CLI-only tests in CI. If your AI agent is doing 20 LLM calls per test run in CI, you’re paying for reasoning you don’t need.

One more distinction worth making: pure navigation is conceptually different from testing.

If an AI agent needs to browse the web — research a topic, fill out a form, extract data from a page — CLI is the right tool. That’s not testing, it’s interaction. The agent needs to reason about what it sees.

But the moment you’re verifying that a known flow produces expected results? That’s testing. Script it.

The pattern is simple: explore with CLI, verify with scripts, fall back to CLI when things break. Your AI agents will be faster, cheaper, and more reliable.

This approach is running in production for 50+ microservice deployments. The token savings from switching repetitive verifications to scripts were significant enough to justify the refactor within the first week.

For a deeper dive into the script-based approach — including scene recording, voiceover generation, and video assembly — see I Made a Product Demo Video Entirely with AI.