Durable — Documentation

Quickstart

from zero to a running engine + dashboard in one command

npx @dicabrio/durable
# → dashboard + API on http://localhost:3030

This starts an embedded Postgres (first run downloads the binary; data persists in ~/.durable/pgdata), applies all migrations, and serves the dashboard and API on one port. No Docker.

Connect an app (TypeScript)

import { createFunction, serve, DurableClient } from "@durable/sdk";

const hello = createFunction({
  id: "hello",
  trigger: { event: "demo.hello" },
  handler: async ({ event, step }) => {
    return step.run("greet", () => `hi ${event.data.name}`);
  },
});

// 1. create a workspace in the dashboard (or POST /api/apps) → id + signing key
const client = new DurableClient({
  baseUrl: "http://localhost:3030",
  appId: process.env.APP_ID, signingKey: process.env.APP_KEY,
  appUrl: "http://localhost:4000/api/durable",
});

// 2. expose the callback + register your functions
app.post("/api/durable", serve([hello], { signingKey }));
await client.sync([hello]);

// 3. fire events
await client.send({ name: "demo.hello", data: { name: "world" } });

Also available: npm run dev in the repo starts the full stack (Postgres in Docker, Adminer, service, a worker, and three demo apps) via process-compose.

Concepts

the five nouns, and the replay model that makes them durable

Term	Meaning
event	A named fact (`user.created`) with a JSON payload, sent into one app's workspace.
function	Your handler plus its trigger and options, registered via `sync`.
run	One execution of a function for one event.
step	A named unit inside a run (`step.run("send-email", …)`) that executes exactly once.
workspace	An app × environment pair — fully isolated data, functions and signing key.

The replay model

The engine never runs your code. It POSTs the run's state — the triggering event plus all memoized step results — to your app. The SDK calls your handler from the top: completed steps return their stored results instantly (no side effects), and the first new step executes for real. Its result is persisted and the cycle repeats, one step per round-trip, until the function returns.

Because each invocation starts from the top, your handler must be deterministic between steps: put every side effect (DB write, API call, randomness, Date.now()) inside a step.run.

Run state, two layers

Layer	Values	Question it answers
status	active · completed · failed · cancelled	Is the run finished?
activity	executing · queued · waiting · sleeping · scheduled	What is an active run doing right now?

A run waiting seven days for an approval is active · waiting — alive, but consuming no compute and no worker slot.

Steps API

three primitives cover almost every workflow

step.run(id, fn)

Execute a side effect once; the result is memoized and replayed forever after. A throw becomes a retry (exponential backoff, then the run fails).

const invoice = await step.run("create-invoice", () => billing.create(order));

step.sleep(id, duration)

Durable pause — "90s", "12h", "30d" or milliseconds. No process waits; a timer wakes the run. Survives restarts and deploys.

step.waitForEvent(id, { event, match?, timeout })

Park the run until a matching event arrives, or the timeout elapses. match is a subset check against the incoming event.data. Resolves with the event, or null on timeout — human-in-the-loop in four lines:

const approval = await step.waitForEvent("approve", {
  event: "approval.received",
  match: { orderId: event.data.orderId },
  timeout: "7d",
});
if (!approval) return { rejected: "timeout" };

Triggers & cron

event-driven or on a schedule

trigger: { event: "order.paid" }     // runs per matching event
trigger: { cron: "0 3 * * *" }       // daily at 03:00
trigger: { cron: "*/20 * * * * *" }  // 6-field: every 20 seconds

Cron runs receive a synthetic $cron event. Schedules never double-fire (row locks) and never storm after downtime — the next occurrence is always computed strictly in the future.

Flow control

six per-function policies, all with an optional per-key scope

Option	Effect on a burst	Use for
concurrency	max N executing at once; excess queues	protecting APIs & resources
priority	higher starts sooner under contention	VIP tenants, critical work
throttle	starts spread over time; nothing dropped	external rate limits
rateLimit	excess runs dropped	abuse, duplicate webhooks
debounce	burst collapses to one run with the last event, after quiet	rapid saves → one reindex
batch	events grouped; one run gets the whole list	bulk writes, metric ingestion

createFunction({
  id: "sync-crm",
  trigger: { event: "contact.changed" },
  concurrency: { limit: 2, key: "tenantId" },  // per tenant
  priority: 10,
  throttle: { limit: 1, period: "3s" },
  rateLimit: { limit: 100, period: "1m" },
  debounce: { period: "5s", key: "contactId" },
  // batch: { maxSize: 25, timeout: "10s" }  → event.data = the list
  handler: async ({ event, step }) => { /* … */ },
});

debounce vs batch vs throttle: debounce keeps only the last event and resets its timer on every arrival; batch keeps all events and its window is fixed by the first; throttle runs everything, just spaced out.

Apps & environments

isolation is the default, environments are explicit

An app identity is (name, environment). Every combination is a fully isolated workspace: its own app_id, its own signing key, its own events, functions and runs. An event fired into billing · acc can never trigger billing · prod.

The environment defaults to dev — you never set it locally. For acceptance and production you opt in explicitly (DURABLE_ENV=acc|prod, or pick it when creating the workspace). The dashboard shows loud color-coded badges: dev grey, acc amber, prod red.

# idempotent per (name, environment) — returns the same app + key every time
curl -X POST :3030/api/apps -d '{"name":"billing","environment":"prod"}'

Auth: every app→service call carries x-durable-app plus an HMAC-SHA256 signature over the raw body; service→app callbacks are signed with the same per-workspace key.

Dashboard

realtime, app-centric, safe in production

Realtime everywhere — Postgres NOTIFY → SSE push; every screen updates the moment data changes.
Trace drawer — click a run: a waterfall per step, each bar split grey (durable queue/sleep time) vs green (your server's execution time); click a step for its input/output; expand for the exact split.
Run actions — Rerun (from scratch), Rerun from step (steps before it are reused, the chosen step re-executes), Cancel.
Metrics — throughput, failure rate, durable-delay vs app-time, per-function p95, and backlog depth over time (1h / 24h / 7d).
Prod guards — in a prod workspace, Fire/Run/Cancel/Rerun arm on first click and execute only on a confirming second click.
Production auth — set DURABLE_ADMIN_TOKEN and the dashboard, tRPC and SSE surface require sign-in (httpOnly session cookie or a Bearer token). Unset = open, for local dev.

PHP SDK

same replay model, dependency-free, PHP ≥ 8.1

use Durable\{Client, DurableFunction, Serve, Step};

$fn = new DurableFunction(
  id: 'onboarding',
  trigger: ['event' => 'user.created'],
  handler: function (array $event, Step $step) {
    $user = $step->run('load-user', fn () => loadUser($event['data']['id']));
    $step->sleep('cooldown', '3s');
    return $step->run('send-email', fn () => sendMail($user));
  },
);

// callback endpoint (vanilla PHP, Laravel, Symfony — anything):
Serve::handle([$fn], $signingKey);

// register + fire:
$client = new Client($baseUrl, $appId, $key, $appUrl);
$client->sync([$fn]);
$client->send('user.created', ['id' => 'u1']);

All function options (concurrency, priority, throttle, rateLimit, debounce, batch) are supported with human-readable periods ('3s', '7d'). See sdk-php/example/ for a runnable app on PHP's built-in server.

Operations

scaling, shutdown, configuration

Scaling workers

Queue capacity is a process count. The SKIP LOCKED queue (with an exact, advisory-locked concurrency gate) makes concurrent instances safe — run as many worker-only processes as you need:

npm run worker        # pure capacity: no HTTP, no scheduler

Graceful drain

On SIGINT/SIGTERM the service stops pulling new jobs, finishes what's in flight (bounded by DURABLE_DRAIN_TIMEOUT_MS, default 15s), then exits. On timeout, abandoned jobs recover via lease expiry — nothing is lost either way. A second signal forces exit. App callbacks are bounded to the job lease, so a hung app can't wedge a drain.

Environment variables

Variable	Default	Purpose
PORT	3030	service + dashboard port
DATABASE_URL	—	Postgres connection (unused with embedded PG)
DURABLE_PG_PORT / DURABLE_PG_DIR	5434 / ~/.durable/pgdata	embedded Postgres
DURABLE_WORKERS	2	worker loops per process
DURABLE_LEASE_MS	30000	job lease + app-call timeout
DURABLE_MAX_ATTEMPTS	3	step retries before a run fails
DURABLE_DRAIN_TIMEOUT_MS	15000	graceful-drain bound
DURABLE_ADMIN_TOKEN	unset	set → dashboard/tRPC/SSE require sign-in
DURABLE_ENV	dev	app environment on self-provisioning

Wire protocol

small enough to port an SDK in an afternoon

One HTTP round-trip advances a run by at most one new step. All bodies are JSON; every request and response is signed: x-durable-signature: hex(hmac_sha256(rawBody, key)), app→service calls also send x-durable-app: <appId>.

Service → app (invoke)

POST {appUrl}
{ "runId": "…", "functionId": "onboarding",
  "event": { "id": "…", "name": "user.created", "data": { … } },
  "steps": { "load-user": { "type": "run", "data": { … } } } }

App → service (the next operation reached)

{ "op": "step",  "id": "send-email", "data": … }
{ "op": "sleep", "id": "cooldown",  "until": "2026-07-05T09:00:00.000Z" }
{ "op": "wait",  "id": "approve", "event": "approval.received",
  "match": { "orderId": "o1" } | null, "until": "…" }
{ "op": "done",  "data": … }
{ "op": "error", "id": "send-email" | null, "message": "…", "retryable": true }

App → service (management)

POST /e         { "name": "user.created", "data": { … } }
POST /fn/sync   { "url": "https://app/api/durable", "functions": [ { "id", "trigger", …options } ] }
POST /api/apps  { "name": "billing", "environment": "dev" }   # provision (admin)

That's the whole surface an SDK needs: sign, sync, send, and answer invokes with one of five ops. The TypeScript and PHP implementations are both under 300 lines.