Updated 2026-04-17: expanded from a single-ATS Greenhouse poller to six providers (Lever, Ashby, Workday, SmartRecruiters, JSON-LD), added one-input ATS auto-detection, and cut scan time from 48 seconds to 8.
Overview
Project: Multi-ATS job scraper, scoring engine, and admin dashboard for my active job search
Role: Solo Developer
Duration: April 2026
Purpose: Replace daily manual scanning of dozens of company career pages with a single ranked list of relevant postings, refreshed automatically, hosted under a password-protected admin route on the portfolio
Business impact
- Polls 49 career boards across six ATS providers (Greenhouse, Lever, Ashby, Workday, SmartRecruiters, and a JSON-LD fallback) and stores every posting with a score
- Cut the time to triage a day's new listings from about 40 minutes to under 5
- Filtered thousands of raw postings per scan down to a US-only, role-matched top 20
- Cut scan runtime from 48 seconds to 8 across three performance passes, well under Vercel's 60-second function ceiling
- Gave me one authenticated surface for every job-search tool instead of separate password prompts per dashboard
The challenge
The original pipeline spoke only Greenhouse. Ten hand-curated boards, one fetcher, a scoring engine, a dashboard. That worked for a week. Then I needed Lever, then Ashby, then a Workday instance, then two Workdays with different tenants, then a careers page that rendered through JavaScript and exposed nothing but JSON-LD in the HTML. The hard parts were:
- Unifying five wildly different ATS APIs behind one fetcher contract
- Detecting a provider automatically from either a careers URL or a plain company name, so adding a source is one input instead of three
- Scraping JSON-LD job postings without pulling in a dependency for something Python's stdlib can do
- Keeping the scan button under 10 seconds even with 49 sources fanning out concurrently
- Scoring postings against my actual target profile without hand-curating each one
- Authenticating a Next.js admin dashboard against a Python service without stuffing an API key into the browser
- Running a Python service inside a JavaScript monorepo without duct tape
Architecture
The pipeline has three services cooperating through Supabase:
Greenhouse API
│
▼
┌──────────────────────┐ ┌─────────────────────────┐
│ FastAPI (job-api) │ writes │ Supabase (Postgres) │
│ poller → score │────────▶│ job_sources │
│ sanitize HTML │ │ job_postings │
└──────────┬───────────┘ │ job_status_log │
▲ └────────────┬────────────┘
│ x-api-key (cron) │
│ JWT bearer (dashboard) │ reads
┌──────────┴───────────┐ ▼
│ Next.js proxy routes │ ┌──────────────────┐
│ /api/jobs/* │────────────────▶│ /tools/admin/jobs│
│ verifies admin cookie│ │ dashboard │
└──────────────────────┘ └──────────────────┘
The admin browser only ever talks to the Next.js app. Next.js verifies the admin JWT cookie in proxy.ts, then forwards the session token to FastAPI as a bearer credential. The scraper itself runs from a Vercel cron that authenticates with x-api-key. One backend, two valid credentials, one source of truth for sessions.
The scraper
The ATS clients and poller live in apps/job-api/app/services/. The loop is small: list sources, fetch each board with the right fetcher, diff against stored postings, score the new ones, write them back.
Each ATS has its own fetcher, but every fetcher returns the same StandardJob dataclass. The poller dispatches off the provider column on job_sources:
FETCHERS: dict[str, Fetcher] = {
"greenhouse": fetch_board_jobs,
"lever": fetch_lever_jobs,
"ashby": fetch_ashby_jobs,
"workday": fetch_workday_jobs,
"smartrecruiters": fetch_smartrecruiters_jobs,
"jsonld": fetch_jsonld_jobs,
}Adding a new ATS means adding a fetcher and a row to that dict. The poller, scoring, sanitization, and dashboard do not change.
The JSON-LD fetcher is the fallback for career pages with no public API. It reads the page, parses every <script type="application/ld+json"> block with Python's stdlib html.parser, and normalizes the three shapes you find in the wild (single object, array, @graph) into StandardJob rows. Zero new dependencies for the scraping layer.
The scoring engine is a weighted keyword config with five tiers: role titles, core technologies, domain skills, seniority signals, and negative keywords. A senior React/Next.js role scores high; a junior PHP contract lands in the negative zone and never surfaces. The weights live in version control, so recalibrating the filter is a PR, not a UI toggle.
HTML descriptions get sanitized on write with bleach and a tag allowlist. The assumption that a third-party API returns safe HTML is the wrong one; stripping tags before the row lands in Postgres means every consumer (dashboard, email, backup export) inherits the safety without having to remember it.
Auto-detecting the provider
Adding a source used to mean three decisions: provider dropdown, board token, company name. For every new company, I would open the careers page, identify the ATS, copy the right slug, paste. The dashboard now does it in one input:
Company name or careers URL: stripe
↓
Detected: Greenhouse (stripe), 142 jobs
The detect_ats service first tries to parse a known ATS URL pattern (boards.greenhouse.io/*, jobs.lever.co/*, jobs.ashbyhq.com/*, *.myworkdayjobs.com/*, careers.smartrecruiters.com/*). If the input is a bare slug, it probes each provider's public API and keeps the first one that responds with a non-empty board. One input, five providers covered, a collapsible "Advanced" pane preserves manual entry for the edge cases.
Making the scan button fast
Fanning out across 49 sources exposed a chain of bottlenecks none of which were visible with ten Greenhouse boards. A single scan started the day at 48 seconds and ended at 8, through three targeted fixes: concurrent polling with asyncio.gather, batched Supabase writes to collapse the N+1 round-trip pattern, and asyncio.to_thread on every .execute() so supabase-py's sync client stops blocking the event loop.
The first fix did nothing on its own, which is the most instructive part of the story. That walk-through lives in its own blog post: asyncio.gather is not enough for a sync client.
Auth across two runtimes
The scraper runs unattended on a cron schedule, so it needs a credential that does not expire. The dashboard runs in a browser, so it should not hold a long-lived API key.
The FastAPI service accepts either, and both paths are constant-time:
def verify_api_key_or_session(
request: Request,
key: str | None = Security(api_key_header),
s: Settings = Depends(get_settings),
) -> str:
if _api_key_matches(key, s.job_api_key):
return "api-key"
token = _extract_bearer_token(request)
if token:
try:
payload = jwt.decode(token, s.admin_session_secret, algorithms=["HS256"])
except jwt.PyJWTError:
pass
else:
if payload.get("sub") == "tools-admin":
return "session"
raise HTTPException(status_code=401, detail="Unauthorized")The Next.js app mints the JWT on /tools/login with jose, stores it as an httpOnly cookie, and verifies it in proxy.ts for every /tools/admin/* request. The Python service verifies the same HS256 signature with pyjwt. One shared secret, two runtimes, zero cross-origin quirks because the browser only ever hits same-origin Next.js proxy routes.
The poll endpoint deliberately stays API-key-only. A session cookie is not a thing cron has.
Running Python inside Nx
The service lives in apps/job-api/ next to the Next.js and Playwright workspaces. Nx has no Python integration, which is fine because Nx only needs to dispatch commands:
{
"name": "job-api",
"targets": {
"dev": {
"executor": "nx:run-commands",
"options": {
"command": "uv run --package job-api uvicorn app.main:app --reload --port 8000",
"cwd": "apps/job-api"
}
},
"test": {
"executor": "nx:run-commands",
"options": {
"command": "uv run --package job-api pytest -v",
"cwd": "apps/job-api"
}
},
"lint": {
"executor": "nx:run-commands",
"options": {
"command": "uv run --package job-api ruff check .",
"cwd": "apps/job-api"
}
},
"mypy": {
"executor": "nx:run-commands",
"options": {
"command": "uv run --package job-api mypy app/",
"cwd": "apps/job-api"
}
}
}
}A single uv workspace at the repo root locks every Python dependency. A dedicated ci-python GitHub Actions job runs pnpm nx run-many -t lint test mypy -p job-api on every non-docs PR, gated by ci-status alongside the Node checks. Mypy runs in strict mode; ruff runs with an opinionated select list; pytest covers 129 tests across sanitize, scoring, dependencies, schemas, every ATS fetcher, the ATS detector, the poller, and all three routers.
(The uv-in-Nx plumbing got its own blog post: Running a uv Python workspace inside an Nx monorepo.)
The deploy
FastAPI runs on Railway. The Docker build uses the monorepo root as the build context so it can see the workspace lockfile:
COPY pyproject.toml uv.lock ./
COPY apps/job-api/pyproject.toml ./apps/job-api/pyproject.toml
RUN uv sync --frozen --no-dev --no-editable --package job-api
COPY apps/job-api/app ./apps/job-api/apprailway.toml binds the container to $PORT and points Railway's healthcheck at /health. A TrustedHostMiddleware wired off an ALLOWED_HOSTS env var refuses requests with forged Host headers. The only public endpoint is /health; everything else sits behind verify_api_key or verify_api_key_or_session.
The result
The dashboard shows a ranked table of everything the poller has found in the last 30 days, with a detail view that renders the sanitized description and a status column I can flip to applied, interviewing, or rejected. Each status change appends a row to job_status_log, so the history is immutable and exportable.
What this replaced: a folder of browser tabs and a Notion page I updated by hand. What it enabled: actually knowing, at any moment, which postings are new and which are worth the next round of applications. Every piece of the system is small; the value is in having them all talking to each other behind one login.
Takeaway
Full-stack projects across two languages are easier than they sound if each tool owns its job. uv owns Python dependency resolution; Nx owns task dispatch; Next.js owns the browser boundary; FastAPI owns the database. Pick a shared JWT secret, put constant-time comparisons on both sides, and the rest is plumbing.
