Blog | Pooling Playwright browsers across FastAPI scans | Daniel Joffe

The problem

My audit-api service runs Lighthouse and axe-core against any URL a user submits. Every scan used to launch a fresh Chromium instance, run the tools, close the browser, and then launch all over again for the next request. On Railway's container that cold start costs 2-4 seconds per scan, and under concurrent load the memory profile gets ugly fast: multiple browser processes fighting over the same container RAM.

Worse, some scans never finished at all. The original navigation uses wait_until="networkidle", which waits for zero network connections for 500ms, and sites with analytics pings, WebSocket heartbeats, or long-polling connections simply never reach that state. Scans against the Washington Post hit ERR_HTTP2_PROTOCOL_ERROR because the site's bot detection drops the connection outright.

So that's three problems: launch overhead, an unbounded wait, and hostile responses.

The pool pattern

One Chromium process stays alive between scans, and each scan gets a fresh browser context (isolated cookies, localStorage, session state) and then closes it:

class BrowserPool:
    def __init__(self) -> None:
        self._browser: Browser | None = None
        self._lock = asyncio.Lock()
        self._idle_handle: asyncio.TimerHandle | None = None
        self._scan_count = 0
 
    async def acquire(self) -> Browser:
        async with self._lock:
            self._cancel_idle_timer()
            if self._browser is None or not self._browser.is_connected():
                await self._launch()
            self._scan_count += 1
            return self._browser
 
    async def release(self) -> None:
        async with self._lock:
            self._scan_count = max(0, self._scan_count - 1)
            if self._scan_count == 0:
                self._schedule_idle_shutdown()

The idle timer fires after 30 minutes with no scans, and acquire() cancels it the moment a new scan arrives. On app shutdown, the FastAPI lifespan handler calls pool.shutdown() to force-close the browser.

Each scan creates an isolated context with a realistic User-Agent:

context = await browser.new_context(
    viewport={"width": cfg.screen_emulation.width, ...},
    user_agent=_USER_AGENT,
)
try:
    page = await context.new_page()
    await page.goto(url, wait_until="load", timeout=30_000)
    try:
        await page.wait_for_load_state("networkidle", timeout=10_000)
    except Exception:
        pass  # best-effort; proceed after load
finally:
    await context.close()

wait_until="load" fires once the DOM and all the subresources finish loading, and a 10-second networkidle attempt follows as best-effort. If the site keeps connections alive forever, the scan just proceeds anyway.

Parallel Lighthouse and axe

Lighthouse runs as a subprocess and axe-core runs inside the Playwright page, and since they share nothing, I overlap them:

lighthouse_task = asyncio.create_task(_run_lighthouse(url, device))
axe_task = asyncio.create_task(_run_axe_and_capture(url, device))
 
lighthouse_result, axe_results = await asyncio.gather(
    lighthouse_task, axe_task, return_exceptions=True
)
 
if isinstance(lighthouse_result, BaseException):
    raise lighthouse_result
if isinstance(axe_results, BaseException):
    raise axe_results

return_exceptions=True stops one failure from cancelling the other, which matters here because the axe task is holding a browser context that needs cleaning up. Once both complete, I re-raise any exception explicitly.

The ScanQueue itself is serialized to one scan at a time, because Lighthouse uses global performance marks that collide across concurrent runs. The parallelism lives within a single scan, not across scans.

Friendly error mapping

Bot detection and network failures produce raw Playwright error strings that mean nothing to a user, so a tuple lookup table turns them into actual sentences:

_ERROR_PATTERNS: list[tuple[str, str]] = [
    ("ERR_HTTP2_PROTOCOL_ERROR",
     "This site blocked our scanner. It uses bot detection..."),
    ("ERR_NAME_NOT_RESOLVED",
     "This domain could not be found. Check the URL for typos."),
    ("Timeout 30000ms exceeded",
     "This site took too long to load..."),
]
 
def _friendly_error(raw: str) -> str:
    for pattern, message in _ERROR_PATTERNS:
        if pattern in raw:
            return message
    return raw or "Unknown scan error"

It's just substring matching against the raw error string, and eight patterns cover every failure I've seen in production so far.

Takeaway

Pool the expensive resource (the browser process) and isolate the cheap one (a context per scan). Wait for what matters (load), attempt what helps (networkidle), and never block on something that might never arrive.