Background Jobs

Status: Phase 1 shipped (idempotency + rate limiting). Job-queue migration is backlog — see "Proposed migration" below.

This doc maps the heavy work that currently runs synchronously inside the HTTP request path, and the plan to move it off the request path. It is the companion to the Phase 1 changes in this PR:

src/lib/idempotency.ts — request idempotency for the payout endpoints
src/lib/rate-limit.ts — Postgres-backed fixed-window rate limiter
migration 0033_request_idempotency.sql — request_idempotency + rate_limit_buckets tables

1. In-request heavy calls (current blocking points)

The app has no queue and no worker. node-cron is in package.json but nothing runs it. Every expensive call below blocks the HTTP response.

1a. AI intake (Anthropic)

Where	File	Line
Issue create — auto-trigger	`src/app/api/test-cycles/[id]/issues/route.ts`	~215 (`enqueueIssueIntake`)
Manual re-run endpoint	`src/app/api/test-cycles/[id]/issues/[issueId]/intake/route.ts`	~70 (`enqueueIssueIntake`)
The actual Anthropic call	`src/lib/ai/intake.ts`	~355 (`client.messages.create`)

enqueueIssueIntake is fire-and-forget within the same Node process (void runIssueIntake(issueId)). It does not block the HTTP response — but it is not durable: if the process restarts mid-call, the ai_runs row is left running forever and the suggestion is silently lost. There is no retry.

Two more Anthropic agents follow the same fire-and-forget-in-process pattern and have the same durability gap:

Cycle close — src/lib/ai/cycle-close.ts:274, triggered from src/app/api/admin/test-cycles/[id]/route.ts:143 (enqueueCycleClose).
Run summary — src/lib/ai/run-summary.ts:155, triggered from src/app/api/test-cycles/[id]/runs/[runId]/route.ts:94 (enqueueRunSummary).

Phase 1 mitigation: runIssueIntake now consumes a per-user + per-org rate limit token before the Anthropic call (checkIntakeRateLimit). A burst of issue submissions or a retry loop can no longer run up an unbounded Anthropic bill. Cycle-close and run-summary are not yet rate-limited — they are admin-triggered and lower-frequency; revisit if abuse appears.

1b. Linear export (Linear GraphQL API)

Where	File	Line
Export endpoint	`src/app/api/admin/linear/export/route.ts`	`POST` handler
`createLinearIssue`	`src/lib/integrations/linear.ts`	~337
`createLinearAttachment`	`src/lib/integrations/linear.ts`	(attachment push)

This one fully blocks the request. In separate mode it loops over every selected issue and awaits createLinearIssue (plus an attachment call each) sequentially. Exporting N issues = at least N serial round-trips to Linear's API before the HTTP response returns. At 10x scale a bulk export of hundreds of issues will time out the request and can trip Linear's own rate limits.

Phase 1 mitigation: the export endpoint now enforces a per-user and per-org rate limit (linearExportUser / linearExportOrg). It does not yet move the work off the request path — that is backlog.

1c. Linear sync (Linear GraphQL API)

Where	File	Line
Sync endpoint	`src/app/api/admin/linear/sync/route.ts`	~114 (`fetchLinearStatesByIdentifier`)

Manual, admin-triggered, pull-based. One batched GraphQL call, so less severe than export — but still synchronous and still grows with the number of exported issues. Not rate-limited in Phase 1 (low frequency, admin-only).

1d. Batch payouts

Where	File
Mark-paid batch	`src/app/api/admin/payouts/batch-pay/route.ts`
Period run	`src/app/api/admin/payouts/run-batch/route.ts`

These are DB-only (no external API) so latency is bounded — the risk was not latency but correctness: a flaky network retry could double-pay.

Phase 1 mitigation (shipped in this PR):

Both endpoints now require an Idempotency-Key header. A duplicate key within 24h replays the original cached response instead of re-running the payout. See src/lib/idempotency.ts.
The ledger insert + status flip now run inside one DB transaction — no more partial writes. run-batch additionally wraps the payout_batches insert in the same transaction, eliminating orphaned batch rows.

2. Phase 1 — what shipped in this PR

Concern	Solution	No new infra?
Double-pay on retry	`request_idempotency` table + `Idempotency-Key` header on both payout endpoints	✅ Postgres only
Partial payout writes	Both endpoints wrap writes in `db.transaction`	✅
Unbounded Anthropic spend	Postgres fixed-window rate limiter on AI intake (per-user + per-org)	✅
Linear API hammering	Same limiter on the Linear export endpoint	✅

Rate-limit defaults (all env-overridable, see src/lib/rate-limit.ts):

Limiter	Env var	Default
AI intake / user / hr	`RL_AI_INTAKE_USER_PER_HOUR`	30
AI intake / org / hr	`RL_AI_INTAKE_ORG_PER_HOUR`	200
Linear export / user / hr	`RL_LINEAR_EXPORT_USER_PER_HOUR`	60
Linear export / org / hr	`RL_LINEAR_EXPORT_ORG_PER_HOUR`	300

A rate-limit hit returns HTTP 429 with Retry-After and X-RateLimit-* headers. The AI intake auto-trigger (fire-and-forget) instead records a failed ai_runs row with failure_reason: rate_limited so the UI can show "rate limited — try later" and the operator can re-run.

Cleanup

request_idempotency and rate_limit_buckets rows are self-pruning candidates. pruneExpiredIdempotencyKeys() and pruneStaleRateLimitBuckets() exist but are not scheduled — wire them into the job system below, or a node-cron task, once one exists. Until then the tables grow slowly; neither is on a hot read path and both have indexes on their time columns.

3. Proposed migration to a job system (BACKLOG — needs a human decision)

The fire-and-forget-in-process pattern (void runX()) is not durable: a deploy or crash drops in-flight work with no retry. To survive 10x we should move all external-API work to durable background jobs.

Recommended: Inngest

This needs a decision from a human because Inngest requires a new account (free tier is generous; self-hosting is possible but heavier). Alternatives that need no account: a Postgres-backed job table polled by a node-cron worker (pg-boss is the off-the-shelf version of this — Postgres only, no SaaS). If "no new account" is a hard requirement, pick pg-boss.

Target architecture

HTTP request ──> enqueue job (durable) ──> return 202 immediately
                        │
                        ▼
                 worker picks up job ──> Anthropic / Linear call
                        │                       │
                        ├── success ────────────┘
                        └── failure ──> retry w/ backoff ──> dead-letter

Jobs to migrate (in priority order):

AI intake (runIssueIntake) — highest volume, currently silently lost on restart.
Linear export — currently fully blocks the request; move to a job and return a job id the UI polls.
Cycle close / run summary — same durability fix as intake.

Also backlog

Linear webhook + outbox table — replace the manual pull-based sync (/api/admin/linear/sync) with push: Linear webhook updates issues.external_status, and an integration_outbox table holds outbound state changes with retry/backoff so a Linear outage doesn't lose updates.
Notification fan-out via outbox — createNotificationsBulk is currently fire-and-forget; route it through the same outbox for durability + retries.
Per-org Anthropic monthly cost budget + circuit breaker — the rate limiter caps frequency; it does not cap total monthly spend. Add a per-org cents budget (sum ai_runs.cost_usd_cents for the month) and a circuit breaker that disables intake for an org once the budget is hit.

Background Jobs

On this page