Overview
Project: Performance and reliability pass across audit-api and job-api, two FastAPI services running on Railway behind a Next.js frontend
Role: Solo Developer
Duration: April 2026
Purpose: Eliminate per-request resource allocation, parallelize independent I/O operations, and harden authentication across both API services in the Nx monorepo
The problems
Both services had the same anti-pattern: creating expensive resources per request and running independent operations sequentially.
audit-api launched a fresh Chromium instance for every scan, waited for networkidle (which hangs on sites with persistent connections), and ran Lighthouse and axe-core sequentially even though they share no state. Some sites blocked the headless browser with bot detection, producing cryptic errors like ERR_HTTP2_PROTOCOL_ERROR.
job-api created a new httpx.AsyncClient per ATS fetch (tearing down and rebuilding TCP connections for every request), created a new Supabase client per dependency injection call, and ran four database operations sequentially per source during polling.
Security review also surfaced two gaps: JWT decode paths did not require exp or sub claims, and the ALLOWED_HOSTS configuration could silently default to permissive behavior.
Audit-api: browser pool and parallel scanning
I replaced per-scan browser launches with a persistent BrowserPool that keeps one Chromium process alive across scans. Each scan gets a fresh browser context for cookie and state isolation, then closes it. An idle timer shuts down the browser after 30 minutes of inactivity. On app shutdown, the FastAPI lifespan handler calls pool.shutdown().
Navigation changed from wait_until="networkidle" to wait_until="load" with a 10-second best-effort networkidle follow-up that never blocks. A realistic Chrome User-Agent header reduced bot detection triggers. When sites still blocked the scanner, a pattern-matching function mapped raw Playwright errors to user-friendly messages.
Lighthouse and axe-core now run in parallel via asyncio.gather with return_exceptions=True. Lighthouse runs as a subprocess; axe runs inside the Playwright page. They share nothing. The ScanQueue itself remains serialized because Lighthouse uses global performance marks that collide in concurrent runs.
lighthouse_task = asyncio.create_task(_run_lighthouse(url, device))
axe_task = asyncio.create_task(_run_axe_and_capture(url, device))
lighthouse_result, axe_results = await asyncio.gather(
lighthouse_task, axe_task, return_exceptions=True
)Job-api: connection reuse and DB parallelism
A shared httpx.AsyncClient singleton replaced per-request client creation. The pool limits are 20 max connections and 10 keepalive, with a 15-second timeout and automatic redirect following. All six ATS fetchers (Greenhouse, Lever, Ashby, Workday, SmartRecruiters, JSON-LD) share this pool. The client closes on app shutdown.
def get_http_client() -> httpx.AsyncClient:
global _client
if _client is None or _client.is_closed:
_client = httpx.AsyncClient(
timeout=15.0,
limits=httpx.Limits(
max_connections=20, max_keepalive_connections=10
),
follow_redirects=True,
)
return _clientThe Supabase client follows the same singleton pattern: initialized once at startup via the FastAPI lifespan, shared across all requests.
For the poller, I identified two phases of independent database operations. Phase 1 runs the upsert of new/updated jobs concurrently with fetching existing rows (to find stale jobs). Phase 2 runs the archive of stale jobs concurrently with updating the last_polled_at timestamp:
# Phase 1: upsert + existing query in parallel
upsert_resp, existing_resp = await asyncio.gather(
asyncio.to_thread(upsert_query.execute),
asyncio.to_thread(existing_query.execute),
)
# Phase 2: archive stale + update timestamp in parallel
await asyncio.gather(
asyncio.to_thread(archive_query.execute),
asyncio.to_thread(last_polled_query.execute),
)The asyncio.to_thread calls are essential because supabase-py is a synchronous client. Without them, await does not yield the event loop and the operations run sequentially despite asyncio.gather.
Security hardening
JWT decode paths across all three auth strategies now require exp and sub claims:
payload = jwt.decode(
token,
s.admin_session_secret,
algorithms=["HS256"],
options={"require": ["exp", "sub"]},
)The ALLOWED_HOSTS configuration now raises a RuntimeError at import time if unset, preventing the server from starting with permissive defaults.
Results
| Change | Impact |
|---|---|
| Persistent browser pool | Eliminated 2-4s cold start per scan |
load + best-effort networkidle | No more hanging scans on persistent-connection sites |
| Parallel Lighthouse + axe | Two independent processes overlap instead of running sequentially |
| Shared httpx client | TCP connections reused across all ATS fetchers |
| Supabase singleton | One client per app lifetime, not per request |
Phased asyncio.gather on DB ops | 4 serial DB round trips reduced to 2 parallel phases |
JWT exp/sub requirement | Tokens without expiration or subject rejected |
Takeaway
Pool expensive resources at the process level, isolate per-request state with lightweight contexts. Parallelize I/O operations that do not share mutable state. Verify that async actually means concurrent: a synchronous library wrapped in async def still blocks unless you use asyncio.to_thread.
