How we index every public Workday tenant in under 4 minutes
A look at the queue architecture, the per-tenant adaptive crawl rate, and why we deleted our headless browser fleet.
Eng team
Engineering
Workday is the biggest single source on JobsPipe. Every Fortune 500 of consequence has a public company.myworkdayjobs.com tenant, and between them they post tens of thousands of new roles every month. The hard part isn’t finding the tenants — that list is public — it’s keeping every tenant’s job list fresh without hammering Workday with a thundering herd of polling.
This post walks through the architecture we settled on after three full rewrites: a per-tenant queue with adaptive crawl rates, no headless browsers, and a freshness budget of under four minutes from a posting going live to it landing in our database.
The headless browser problem
Our first version was the obvious one: spin up a Playwright pool, render each tenant’s job page, scrape the DOM. It worked. It also cost us $4,200/month in EC2 spot fleet and broke twice a week when Workday shipped a UI change to one specific tenant.
The breakthrough was realizing that every Workday tenant exposes its jobs as a structured JSON feed at a predictable URL. We don’t need to render anything. We just need to hit the endpoint and parse the response. We deleted the entire browser fleet in a single PR.
Per-tenant adaptive crawl rate
Different tenants post at different cadences. Stripe posts maybe ten jobs a week. SAP posts hundreds a day. Polling every tenant on the same schedule wastes effort on quiet tenants and misses fast-moving ones.
Each tenant has a queue worker with its own back-off curve. When a poll finds no changes, the next-poll interval doubles (up to a maximum of 30 minutes). When a poll finds changes, the interval halves (down to a minimum of 90 seconds). It’s a TCP-style adaptive timer applied to crawl scheduling.
The 4-minute SLA
Putting it together: when a Workday tenant publishes a new role, the tenant’s queue worker hits it within ~90 seconds. The diff detector inserts a new record into Postgres. A logical replication stream picks up the insert and pushes it into our webhook fanout. End-to-end median: 3 minutes 41 seconds. p99: 7 minutes.
All without a single browser. Operationally calm. Cost dropped 92%. And nobody on call has been paged about Workday in five months.
Try it free — 5,000 requests/month, no credit card.
Get a free API key