Background Jobs
Status: Phase 1 shipped (idempotency + rate limiting). Job-queue migration is backlog — see "Proposed migration" below.
This doc maps the heavy work that currently runs synchronously inside the HTTP request path, and the plan to move it off the request path. It is the companion to the Phase 1 changes in this PR:
src/lib/idempotency.ts— request idempotency for the payout endpointssrc/lib/rate-limit.ts— Postgres-backed fixed-window rate limiter- migration
0033_request_idempotency.sql—request_idempotency+rate_limit_bucketstables
1. In-request heavy calls (current blocking points)
The app has no queue and no worker. node-cron is in package.json but
nothing runs it. Every expensive call below blocks the HTTP response.
1a. AI intake (Anthropic)
| Where | File | Line |
|---|---|---|
| Issue create — auto-trigger | src/app/api/test-cycles/[id]/issues/route.ts | ~215 (enqueueIssueIntake) |
| Manual re-run endpoint | src/app/api/test-cycles/[id]/issues/[issueId]/intake/route.ts | ~70 (enqueueIssueIntake) |
| The actual Anthropic call | src/lib/ai/intake.ts | ~355 (client.messages.create) |
enqueueIssueIntake is fire-and-forget within the same Node process
(void runIssueIntake(issueId)). It does not block the HTTP response — but it
is not durable: if the process restarts mid-call, the ai_runs row is left
running forever and the suggestion is silently lost. There is no retry.
Two more Anthropic agents follow the same fire-and-forget-in-process pattern and have the same durability gap:
- Cycle close —
src/lib/ai/cycle-close.ts:274, triggered fromsrc/app/api/admin/test-cycles/[id]/route.ts:143(enqueueCycleClose). - Run summary —
src/lib/ai/run-summary.ts:155, triggered fromsrc/app/api/test-cycles/[id]/runs/[runId]/route.ts:94(enqueueRunSummary).
Phase 1 mitigation: runIssueIntake now consumes a per-user + per-org rate
limit token before the Anthropic call (checkIntakeRateLimit). A burst of
issue submissions or a retry loop can no longer run up an unbounded Anthropic
bill. Cycle-close and run-summary are not yet rate-limited — they are
admin-triggered and lower-frequency; revisit if abuse appears.
1b. Linear export (Linear GraphQL API)
| Where | File | Line |
|---|---|---|
| Export endpoint | src/app/api/admin/linear/export/route.ts | POST handler |
createLinearIssue | src/lib/integrations/linear.ts | ~337 |
createLinearAttachment | src/lib/integrations/linear.ts | (attachment push) |
This one fully blocks the request. In separate mode it loops over every
selected issue and awaits createLinearIssue (plus an attachment call each)
sequentially. Exporting N issues = at least N serial round-trips to Linear's
API before the HTTP response returns. At 10x scale a bulk export of hundreds of
issues will time out the request and can trip Linear's own rate limits.
Phase 1 mitigation: the export endpoint now enforces a per-user and
per-org rate limit (linearExportUser / linearExportOrg). It does not
yet move the work off the request path — that is backlog.
1c. Linear sync (Linear GraphQL API)
| Where | File | Line |
|---|---|---|
| Sync endpoint | src/app/api/admin/linear/sync/route.ts | ~114 (fetchLinearStatesByIdentifier) |
Manual, admin-triggered, pull-based. One batched GraphQL call, so less severe than export — but still synchronous and still grows with the number of exported issues. Not rate-limited in Phase 1 (low frequency, admin-only).
1d. Batch payouts
| Where | File |
|---|---|
| Mark-paid batch | src/app/api/admin/payouts/batch-pay/route.ts |
| Period run | src/app/api/admin/payouts/run-batch/route.ts |
These are DB-only (no external API) so latency is bounded — the risk was not latency but correctness: a flaky network retry could double-pay.
Phase 1 mitigation (shipped in this PR):
- Both endpoints now require an
Idempotency-Keyheader. A duplicate key within 24h replays the original cached response instead of re-running the payout. Seesrc/lib/idempotency.ts. - The ledger insert + status flip now run inside one DB transaction — no
more partial writes.
run-batchadditionally wraps thepayout_batchesinsert in the same transaction, eliminating orphaned batch rows.
2. Phase 1 — what shipped in this PR
| Concern | Solution | No new infra? |
|---|---|---|
| Double-pay on retry | request_idempotency table + Idempotency-Key header on both payout endpoints | ✅ Postgres only |
| Partial payout writes | Both endpoints wrap writes in db.transaction | ✅ |
| Unbounded Anthropic spend | Postgres fixed-window rate limiter on AI intake (per-user + per-org) | ✅ |
| Linear API hammering | Same limiter on the Linear export endpoint | ✅ |
Rate-limit defaults (all env-overridable, see src/lib/rate-limit.ts):
| Limiter | Env var | Default |
|---|---|---|
| AI intake / user / hr | RL_AI_INTAKE_USER_PER_HOUR | 30 |
| AI intake / org / hr | RL_AI_INTAKE_ORG_PER_HOUR | 200 |
| Linear export / user / hr | RL_LINEAR_EXPORT_USER_PER_HOUR | 60 |
| Linear export / org / hr | RL_LINEAR_EXPORT_ORG_PER_HOUR | 300 |
A rate-limit hit returns HTTP 429 with Retry-After and X-RateLimit-*
headers. The AI intake auto-trigger (fire-and-forget) instead records a failed
ai_runs row with failure_reason: rate_limited so the UI can show
"rate limited — try later" and the operator can re-run.
Cleanup
request_idempotency and rate_limit_buckets rows are self-pruning
candidates. pruneExpiredIdempotencyKeys() and pruneStaleRateLimitBuckets()
exist but are not scheduled — wire them into the job system below, or a
node-cron task, once one exists. Until then the tables grow slowly; neither
is on a hot read path and both have indexes on their time columns.
3. Proposed migration to a job system (BACKLOG — needs a human decision)
The fire-and-forget-in-process pattern (void runX()) is not durable: a deploy
or crash drops in-flight work with no retry. To survive 10x we should move all
external-API work to durable background jobs.
Recommended: Inngest
This needs a decision from a human because Inngest requires a new account
(free tier is generous; self-hosting is possible but heavier). Alternatives
that need no account: a Postgres-backed job table polled by a node-cron
worker (pg-boss is the off-the-shelf version of this — Postgres only, no
SaaS). If "no new account" is a hard requirement, pick pg-boss.
Target architecture
HTTP request ──> enqueue job (durable) ──> return 202 immediately
│
▼
worker picks up job ──> Anthropic / Linear call
│ │
├── success ────────────┘
└── failure ──> retry w/ backoff ──> dead-letterJobs to migrate (in priority order):
- AI intake (
runIssueIntake) — highest volume, currently silently lost on restart. - Linear export — currently fully blocks the request; move to a job and return a job id the UI polls.
- Cycle close / run summary — same durability fix as intake.
Also backlog
- Linear webhook + outbox table — replace the manual pull-based sync
(
/api/admin/linear/sync) with push: Linear webhook updatesissues.external_status, and anintegration_outboxtable holds outbound state changes with retry/backoff so a Linear outage doesn't lose updates. - Notification fan-out via outbox —
createNotificationsBulkis currently fire-and-forget; route it through the same outbox for durability + retries. - Per-org Anthropic monthly cost budget + circuit breaker — the rate
limiter caps frequency; it does not cap total monthly spend. Add a
per-org cents budget (sum
ai_runs.cost_usd_centsfor the month) and a circuit breaker that disables intake for an org once the budget is hit.