Hackorda Docs
Scale

Background Jobs

Status: Phase 1 shipped (idempotency + rate limiting). Job-queue migration is backlog — see "Proposed migration" below.

This doc maps the heavy work that currently runs synchronously inside the HTTP request path, and the plan to move it off the request path. It is the companion to the Phase 1 changes in this PR:

  • src/lib/idempotency.ts — request idempotency for the payout endpoints
  • src/lib/rate-limit.ts — Postgres-backed fixed-window rate limiter
  • migration 0033_request_idempotency.sqlrequest_idempotency + rate_limit_buckets tables

1. In-request heavy calls (current blocking points)

The app has no queue and no worker. node-cron is in package.json but nothing runs it. Every expensive call below blocks the HTTP response.

1a. AI intake (Anthropic)

WhereFileLine
Issue create — auto-triggersrc/app/api/test-cycles/[id]/issues/route.ts~215 (enqueueIssueIntake)
Manual re-run endpointsrc/app/api/test-cycles/[id]/issues/[issueId]/intake/route.ts~70 (enqueueIssueIntake)
The actual Anthropic callsrc/lib/ai/intake.ts~355 (client.messages.create)

enqueueIssueIntake is fire-and-forget within the same Node process (void runIssueIntake(issueId)). It does not block the HTTP response — but it is not durable: if the process restarts mid-call, the ai_runs row is left running forever and the suggestion is silently lost. There is no retry.

Two more Anthropic agents follow the same fire-and-forget-in-process pattern and have the same durability gap:

  • Cycle closesrc/lib/ai/cycle-close.ts:274, triggered from src/app/api/admin/test-cycles/[id]/route.ts:143 (enqueueCycleClose).
  • Run summarysrc/lib/ai/run-summary.ts:155, triggered from src/app/api/test-cycles/[id]/runs/[runId]/route.ts:94 (enqueueRunSummary).

Phase 1 mitigation: runIssueIntake now consumes a per-user + per-org rate limit token before the Anthropic call (checkIntakeRateLimit). A burst of issue submissions or a retry loop can no longer run up an unbounded Anthropic bill. Cycle-close and run-summary are not yet rate-limited — they are admin-triggered and lower-frequency; revisit if abuse appears.

1b. Linear export (Linear GraphQL API)

WhereFileLine
Export endpointsrc/app/api/admin/linear/export/route.tsPOST handler
createLinearIssuesrc/lib/integrations/linear.ts~337
createLinearAttachmentsrc/lib/integrations/linear.ts(attachment push)

This one fully blocks the request. In separate mode it loops over every selected issue and awaits createLinearIssue (plus an attachment call each) sequentially. Exporting N issues = at least N serial round-trips to Linear's API before the HTTP response returns. At 10x scale a bulk export of hundreds of issues will time out the request and can trip Linear's own rate limits.

Phase 1 mitigation: the export endpoint now enforces a per-user and per-org rate limit (linearExportUser / linearExportOrg). It does not yet move the work off the request path — that is backlog.

1c. Linear sync (Linear GraphQL API)

WhereFileLine
Sync endpointsrc/app/api/admin/linear/sync/route.ts~114 (fetchLinearStatesByIdentifier)

Manual, admin-triggered, pull-based. One batched GraphQL call, so less severe than export — but still synchronous and still grows with the number of exported issues. Not rate-limited in Phase 1 (low frequency, admin-only).

1d. Batch payouts

WhereFile
Mark-paid batchsrc/app/api/admin/payouts/batch-pay/route.ts
Period runsrc/app/api/admin/payouts/run-batch/route.ts

These are DB-only (no external API) so latency is bounded — the risk was not latency but correctness: a flaky network retry could double-pay.

Phase 1 mitigation (shipped in this PR):

  • Both endpoints now require an Idempotency-Key header. A duplicate key within 24h replays the original cached response instead of re-running the payout. See src/lib/idempotency.ts.
  • The ledger insert + status flip now run inside one DB transaction — no more partial writes. run-batch additionally wraps the payout_batches insert in the same transaction, eliminating orphaned batch rows.

2. Phase 1 — what shipped in this PR

ConcernSolutionNo new infra?
Double-pay on retryrequest_idempotency table + Idempotency-Key header on both payout endpoints✅ Postgres only
Partial payout writesBoth endpoints wrap writes in db.transaction
Unbounded Anthropic spendPostgres fixed-window rate limiter on AI intake (per-user + per-org)
Linear API hammeringSame limiter on the Linear export endpoint

Rate-limit defaults (all env-overridable, see src/lib/rate-limit.ts):

LimiterEnv varDefault
AI intake / user / hrRL_AI_INTAKE_USER_PER_HOUR30
AI intake / org / hrRL_AI_INTAKE_ORG_PER_HOUR200
Linear export / user / hrRL_LINEAR_EXPORT_USER_PER_HOUR60
Linear export / org / hrRL_LINEAR_EXPORT_ORG_PER_HOUR300

A rate-limit hit returns HTTP 429 with Retry-After and X-RateLimit-* headers. The AI intake auto-trigger (fire-and-forget) instead records a failed ai_runs row with failure_reason: rate_limited so the UI can show "rate limited — try later" and the operator can re-run.

Cleanup

request_idempotency and rate_limit_buckets rows are self-pruning candidates. pruneExpiredIdempotencyKeys() and pruneStaleRateLimitBuckets() exist but are not scheduled — wire them into the job system below, or a node-cron task, once one exists. Until then the tables grow slowly; neither is on a hot read path and both have indexes on their time columns.


3. Proposed migration to a job system (BACKLOG — needs a human decision)

The fire-and-forget-in-process pattern (void runX()) is not durable: a deploy or crash drops in-flight work with no retry. To survive 10x we should move all external-API work to durable background jobs.

This needs a decision from a human because Inngest requires a new account (free tier is generous; self-hosting is possible but heavier). Alternatives that need no account: a Postgres-backed job table polled by a node-cron worker (pg-boss is the off-the-shelf version of this — Postgres only, no SaaS). If "no new account" is a hard requirement, pick pg-boss.

Target architecture

HTTP request ──> enqueue job (durable) ──> return 202 immediately


                 worker picks up job ──> Anthropic / Linear call
                        │                       │
                        ├── success ────────────┘
                        └── failure ──> retry w/ backoff ──> dead-letter

Jobs to migrate (in priority order):

  1. AI intake (runIssueIntake) — highest volume, currently silently lost on restart.
  2. Linear export — currently fully blocks the request; move to a job and return a job id the UI polls.
  3. Cycle close / run summary — same durability fix as intake.

Also backlog

  • Linear webhook + outbox table — replace the manual pull-based sync (/api/admin/linear/sync) with push: Linear webhook updates issues.external_status, and an integration_outbox table holds outbound state changes with retry/backoff so a Linear outage doesn't lose updates.
  • Notification fan-out via outboxcreateNotificationsBulk is currently fire-and-forget; route it through the same outbox for durability + retries.
  • Per-org Anthropic monthly cost budget + circuit breaker — the rate limiter caps frequency; it does not cap total monthly spend. Add a per-org cents budget (sum ai_runs.cost_usd_cents for the month) and a circuit breaker that disables intake for an org once the budget is hit.

On this page