fix: harden queue lifecycle and publish gate

- Preserve phase results on partial retry and keep interrupted phase context after restart. - Avoid webhook bookkeeping crashes when retention deletes stale jobs. - Add deeper unit, integration, and e2e coverage around queue seams. - Require verify job to pass before publish runs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-16 18:39:19 +02:00
parent 679053b27d
commit a9429e2118
16 changed files with 1867 additions and 87 deletions
--- a/docs/README.md
+++ b/docs/README.md
@@ -0,0 +1,39 @@
+# jobqueue docs
+
+Detailed architecture and runtime docs for `jobqueue`.
+
+## Doc map
+
+| File | Purpose |
+| --- | --- |
+| [`architecture.md`](./architecture.md) | Static architecture, module boundaries, data model, event model, state model |
+| [`runtime-lifecycle.md`](./runtime-lifecycle.md) | Step-by-step runtime behavior from startup through shutdown |
+| [`integration-findings.md`](./integration-findings.md) | Multi-agent scan results, verified bugs, fixes, and remaining behavioral notes |
+| [`jobqueue.c4`](./jobqueue.c4) | LikeC4 source for landscape, container, component, and runtime views |
+
+## Mental model
+
+`jobqueue` is a **single-process orchestrator** around a SQLite-backed job table.
+
+1. Consumer code creates a `JobQueue`.
+2. Jobs are persisted immediately in SQLite.
+3. A pump claims runnable jobs and hands them to a concurrency-limited worker pool.
+4. `PhaseRunner` executes configured phases and reports progress back through `JobQueue`.
+5. `JobQueue` persists each state transition, emits typed events, formats SSE payloads, and optionally sends webhooks.
+6. A retention scheduler can mark old jobs as `stale` and later delete them.
+
+## Rendering LikeC4 views
+
+The repository stores LikeC4 source in [`jobqueue.c4`](./jobqueue.c4).
+
+```bash
+npx likec4 start docs/jobqueue.c4
+```
+
+Recommended views in the file:
+
+- `index` - system landscape
+- `library` - container view of `jobqueue`
+- `runtime` - internal runtime/component view
+- `enqueue-to-complete` - dynamic happy path
+- `retry-flow` - dynamic retry path
--- a/docs/architecture.md
+++ b/docs/architecture.md
@@ -0,0 +1,241 @@
+# Architecture
+
+This document explains how `jobqueue` is structured, what each module owns, and how data moves through the system.
+
+## 1. Top-level structure
+
+`JobQueue` is the orchestrator. Everything else is a collaborator around four concerns:
+
+1. **Persistence** - `SqliteStorage`
+2. **Execution** - `WorkerPool` + `PhaseRunner`
+3. **Notifications** - `TypedEventBus`, `SseSerializer`, `WebhookDispatcher`
+4. **Lifecycle management** - `RetryStrategy`, `RetentionScheduler`, shutdown logic
+
+```mermaid
+flowchart LR
+  A[Consumer app] -->|enqueue / retry / cancel / list| B[JobQueue]
+  B --> C[SqliteStorage]
+  B --> D[WorkerPool]
+  D --> E[PhaseRunner]
+  E --> F[Phase handlers registered by consumer]
+  B --> G[TypedEventBus]
+  G --> H[SSE stream subscribers]
+  B --> I[WebhookDispatcher]
+  I --> J[Webhook endpoints]
+  B --> K[RetentionScheduler]
+  K --> C
+```
+
+## 2. Module responsibilities
+
+| Module | Responsibility | Key behavior |
+| --- | --- | --- |
+| `src/JobQueue.ts` | Public API and orchestration | Owns startup, enqueue, cancel, retry, pumping, event emission, webhook dispatch, shutdown |
+| `src/storage/SqliteStorage.ts` | SQLite persistence | Creates schema, claims jobs, persists progress, completion, failure, retry, cancellation, stale/delete |
+| `src/processor/WorkerPool.ts` | Concurrency limit | Wraps `p-limit`, tracks running promises, supports drain with timeout |
+| `src/processor/PhaseRunner.ts` | Multi-phase execution | Runs handlers in order, computes overall progress, stops on cancellation |
+| `src/retry/RetryStrategy.ts` | Retry policy | Classifies errors, computes backoff, decides retry vs fail |
+| `src/events/EventBus.ts` | In-process pub/sub | Strongly typed wrapper around Node `EventEmitter` |
+| `src/events/SseSerializer.ts` | SSE formatting | Serializes event name + JSON payload into SSE wire format |
+| `src/webhook/WebhookDispatcher.ts` | Outbound HTTP callbacks | Sends POST requests, signs payloads, retries transient failures |
+| `src/retention/RetentionScheduler.ts` | Background cleanup | Periodically marks old jobs stale and later deletes them |
+
+## 3. Public API surface
+
+Exports from `src/index.ts` expose both high-level and low-level building blocks:
+
+- `JobQueue`
+- `SqliteStorage`
+- `WorkerPool`
+- `PhaseRunner`
+- `RetryStrategy`
+- `TypedEventBus`
+- `SseSerializer`
+- `WebhookDispatcher`
+- `RetentionScheduler`
+- shared queue/job/event types
+
+That split makes the package usable in two modes:
+
+1. **Normal mode** - instantiate `JobQueue` and let it coordinate everything
+2. **Advanced mode** - reuse lower-level pieces independently in custom orchestration
+
+## 4. Persistence model
+
+`SqliteStorage` keeps a single `jobs` table. The queue is effectively modeled as persisted state transitions on that table.
+
+### Important columns
+
+| Column | Meaning |
+| --- | --- |
+| `id` | Stable job identifier |
+| `status` | `pending`, `active`, `completed`, `failed`, `cancelled`, `stale` |
+| `data` | Original enqueue payload, JSON-encoded |
+| `current_phase` | Phase currently executing or last failed/retried phase |
+| `phases_json` | Array of per-phase state objects |
+| `phase_results` | JSON object keyed by phase name |
+| `progress` / `progress_message` | Latest overall progress snapshot |
+| `error_json` | Persisted failure metadata |
+| `retry_count` / `max_attempts` | Retry bookkeeping |
+| `webhook_url` / `webhook_sent` | Delivery configuration and latest success flag |
+| `scheduled_at` | Delayed execution / retry wake-up time |
+| `completed_at` / `cancelled_at` / `updated_at` | Lifecycle timestamps |
+
+### Why SQLite works well here
+
+- queue selection is simple and local
+- state transitions are small, synchronous writes
+- WAL mode supports concurrent reads while jobs are executing
+- no separate broker is required for a single-process runtime
+
+## 5. Phase model
+
+Each job stores an array of `JobPhaseState` entries:
+
+| Field | Meaning |
+| --- | --- |
+| `name` | Phase identifier from `QueueConfig.phases` |
+| `status` | `pending`, `active`, `completed`, `failed`, `cancelled` |
+| `progress` | Per-phase progress percentage |
+| `message` | Human-oriented phase status |
+| `startedAt` / `completedAt` | Phase timestamps |
+| `error` | Last phase-level error string |
+
+`PhaseRunner` walks those phases sequentially and computes overall progress as:
+
+```text
+((phaseIndex + phaseProgress / 100) / totalPhases) * 100
+```
+
+That design gives one stable persisted representation for:
+
+- single-step jobs (`phases: ['run']`)
+- multi-step pipelines (`['download', 'process', 'upload']`)
+- retries that restart only unfinished phases
+
+## 6. Event model
+
+`JobQueue` emits in-process typed events first. The SSE stream and webhook flow are adapters on top of that state machine.
+
+### Core queue events
+
+- `job:enqueued`
+- `job:started`
+- `job:progress`
+- `job:phase:completed`
+- `job:completed`
+- `job:failed`
+- `job:retrying`
+- `job:cancelled`
+- `job:stale`
+- `job:deleted`
+- `job:webhook:delivered`
+- `job:webhook:failed`
+
+### Event ordering rule
+
+The queue persists state before emitting the corresponding event. That means listeners observe already-persisted state, not speculative state.
+
+This is important for consumers that mix:
+
+- `queue.on(...)`
+- `queue.getJob(id)`
+- `queue.listJobs(...)`
+- `queue.createEventStream(...)`
+
+## 7. SSE model
+
+`createEventStream()` creates a web `ReadableStream<Uint8Array>`.
+
+Behavior:
+
+1. Optional snapshot of current jobs is sent first
+2. Queue subscribes the stream to in-process events
+3. Each event is serialized as `event: <name>` + JSON `data: ...`
+4. Keepalive `ping` events are emitted on an interval
+5. Cancelling the reader removes subscriptions and the keepalive timer
+
+### Payload shapes
+
+Most runtime events include a full `job` object. Two notable exceptions:
+
+| Event | Payload detail |
+| --- | --- |
+| `job:deleted` | Includes `deletedJobId` because the record was removed from storage |
+| `ping` | No job payload, only a timestamp |
+
+## 8. Webhook model
+
+Webhooks are outbound notifications, not part of the core execution loop.
+
+### Flow
+
+1. `JobQueue` decides whether an event should trigger a webhook
+2. `WebhookDispatcher` POSTs JSON to queue-level or job-level URL
+3. Optional HMAC SHA-256 signature is attached as `X-JobQueue-Signature`
+4. 5xx and transport failures retry with exponential backoff
+5. Success sets `webhook_sent = 1` and emits `job:webhook:delivered`
+
+### Scope
+
+Supported webhook-triggering events:
+
+- `job:completed`
+- `job:failed`
+- `job:retrying`
+- `job:cancelled`
+- `job:stale`
+
+## 9. Retention model
+
+Retention is deliberately two-stage:
+
+```mermaid
+flowchart LR
+  A[completed / failed / cancelled] -->|older than staleAfterMs| B[stale]
+  B -->|older than deleteAfterMs| C[deleted]
+```
+
+Why two stages:
+
+- consumers get a visible grace period before deletion
+- `onStale` and `onDelete` hooks can clean external artifacts
+- `job:stale` is externally observable before hard deletion
+
+## 10. State machine
+
+```mermaid
+stateDiagram-v2
+  [*] --> pending : enqueue
+  pending --> active : claimPendingJob
+  pending --> cancelled : cancel
+  active --> completed : all phases succeed
+  active --> failed : fatal error / retries exhausted
+  active --> pending : recoverable error + retry
+  active --> cancelled : cancel / abort
+  completed --> stale : retention mark
+  failed --> stale : retention mark
+  cancelled --> stale : retention mark
+  stale --> [*] : retention delete
+  failed --> pending : manual retry
+  cancelled --> pending : manual retry
+  stale --> pending : manual retry
+```
+
+## 11. Build and packaging
+
+- ESM only
+- Node 20 target
+- bundled with `tsup`
+- type declarations emitted from the same entrypoint
+- tests run under Vitest in Node environment
+
+## 12. LikeC4 companion charts
+
+See [`jobqueue.c4`](./jobqueue.c4) for:
+
+- system landscape
+- library/container view
+- runtime/component view
+- enqueue-to-complete dynamic flow
+- retry dynamic flow
--- a/docs/integration-findings.md
+++ b/docs/integration-findings.md
@@ -0,0 +1,141 @@
+# Integration findings
+
+This document records what the multi-agent scan found, what was verified directly in source, and what changed.
+
+## 1. Scan method
+
+The repository was scanned through three independent passes:
+
+1. **lifecycle scan** - enqueue, scheduling, execution, retry, cancel, shutdown
+2. **storage/events scan** - persistence, SSE, webhook, retention interaction
+3. **code review scan** - cross-component defects only
+
+Those results were then checked against source before documenting or patching anything.
+
+## 2. Confirmed and fixed issues
+
+### A. Duplicate `job:cancelled` event at phase boundary
+
+**Observed risk**
+
+If `cancel()` landed after one phase finished but before the next phase started, two different code paths could emit `job:cancelled`:
+
+- direct `cancel(id)` call
+- `PhaseRunner` cancellation callback on next loop iteration
+
+**Why it mattered**
+
+- duplicate event bus notifications
+- duplicate SSE `job:cancelled` events
+- duplicate `job:cancelled` webhooks
+
+**Fix**
+
+`JobQueue` now checks whether the job is already cancelled before the `onCancelled` callback persists or emits anything.
+
+### B. Pending-job cancellation left phases in `pending`
+
+**Observed risk**
+
+Cancelling a job before it started produced:
+
+- job status: `cancelled`
+- phase states: still `pending`
+
+**Why it mattered**
+
+- persisted lifecycle shape was contradictory
+- dashboards and tooling reading phases could not trust phase status
+
+**Fix**
+
+`JobQueue` now marks all unfinished phases as `cancelled` whenever a cancellation is persisted.
+
+### C. Shutdown could close before in-flight webhook bookkeeping finished
+
+**Observed risk**
+
+Webhook dispatch was previously fire-and-forget. A completed job could still be mid-delivery when `shutdown()` closed SQLite.
+
+**Why it mattered**
+
+- `webhook_sent` might not be written
+- `job:webhook:delivered` / `job:webhook:failed` could be lost
+- delivery bookkeeping could throw against a closed database
+
+**Fix**
+
+`JobQueue` now tracks pending webhook promises and drains them during shutdown before closing storage.
+
+### D. Shutdown timeout skipped cleanup
+
+**Observed risk**
+
+If `WorkerPool.drain(timeout)` threw, `shutdown()` exited before:
+
+- removing listeners
+- closing storage
+
+**Why it mattered**
+
+- leaked resources
+- left queue internals half-open after failed shutdown
+
+**Fix**
+
+Cleanup now runs in a `finally` path. On timeout, active controllers are aborted, cleanup still executes, and the timeout error is rethrown after teardown.
+
+## 3. Behavioral notes kept as documentation, not code changes
+
+These are real integration characteristics, but not all are bugs.
+
+### `job:completed` precedes `job:webhook:delivered`
+
+This is expected ordering:
+
+1. job completion is persisted
+2. `job:completed` emits
+3. webhook dispatch happens
+4. `job:webhook:delivered` emits on success
+
+So `webhookSent` may still be `false` in the earlier completion event. Consumers should treat webhook delivery as a separate lifecycle step.
+
+### `job:deleted` does not contain full job payload
+
+This is intentional and pragmatic. Once a stale record is deleted, the queue only emits `deletedJobId`. The SSE contract reflects deletion, not a resurrected snapshot.
+
+### Webhooks are best-effort, not durable outbox delivery
+
+The package retries transient delivery errors, but it does **not** persist a webhook outbox with replay semantics. If a process dies after job completion and before webhook delivery completes, there is no durable re-dispatch queue.
+
+## 4. Regression coverage added
+
+New tests now cover:
+
+- cancelling a pending job marks unfinished phases cancelled
+- cancelling on a phase boundary emits one cancellation event
+- shutdown waits for in-flight webhooks
+- shutdown cleanup still happens when worker drain times out
+
+## 5. Remaining risk areas
+
+No new blocking integration bugs were confirmed after patching, but these seams still deserve attention as the library grows:
+
+1. **Durable outbound delivery** - webhook outbox/idempotency keys if delivery guarantees become stronger
+2. **Long-running non-cooperative handlers** - handlers that ignore `AbortSignal` can still force shutdown timeouts
+3. **SSE scaling** - each stream currently subscribes directly to the in-process event bus
+4. **Storage portability** - queue semantics are tightly coupled to SQLite row-level state transitions
+
+## 6. Second-scan fixes and coverage expansion
+
+The deeper follow-up scan confirmed three more issues that were patched:
+
+1. **Webhook completion after retention deletion** could throw when delivery bookkeeping re-fetched a deleted job.
+2. **Partial retry (`fromStart: false`)** dropped completed phase results because retry reset cleared `phase_results`.
+3. **Process restart recovery** dropped interrupted phase context in failure metadata.
+
+Coverage was expanded at three levels:
+
+- **Unit**: retry strategy, webhook retry policy, worker-pool drain timeout, storage retry/reset semantics
+- **Integration**: partial retry behavior, scheduled wakeups, restart recovery, queue lifecycle edges
+- **E2E harness**: realistic workflows covering SSE + webhooks + retries + retention deletion
--- a/docs/jobqueue.c4
+++ b/docs/jobqueue.c4
@@ -0,0 +1,140 @@
+specification {
+  element actor {
+    style {
+      shape person
+    }
+  }
+
+  element system {
+    style {
+      shape rectangle
+    }
+  }
+
+  element container {
+    style {
+      shape rectangle
+    }
+  }
+
+  element component {
+    style {
+      shape component
+    }
+  }
+
+  element database {
+    style {
+      shape storage
+    }
+  }
+
+  relationship async {
+    color amber
+    line dotted
+  }
+}
+
+model {
+  consumer = actor "Consumer application"
+  webhookReceiver = system "Webhook receiver"
+
+  jobqueue = system "jobqueue" {
+    api = container "Public API" {
+      technology "ESM / TypeScript"
+      description "JobQueue constructor plus enqueue, retry, cancel, query, stream, shutdown APIs"
+    }
+
+    runtime = container "Runtime orchestrator" {
+      technology "Node.js"
+      description "Coordinates persistence, execution, retries, events, SSE, webhooks, and shutdown"
+
+      queue = component "JobQueue"
+      storage = component "SqliteStorage"
+      pool = component "WorkerPool"
+      runner = component "PhaseRunner"
+      retry = component "RetryStrategy"
+      events = component "TypedEventBus"
+      sse = component "SseSerializer"
+      retention = component "RetentionScheduler"
+      webhooks = component "WebhookDispatcher"
+
+      queue -> storage "persists job state"
+      queue -> pool "dispatches runnable jobs"
+      queue -> runner "executes phase pipeline"
+      queue -> retry "classifies failures"
+      queue -> events "emits typed queue events"
+      queue -> sse "serializes SSE payloads"
+      queue -> retention "runs stale/delete cycle"
+      queue -[async]-> webhooks "dispatches outbound callbacks"
+    }
+
+    sqlite = database "SQLite jobs database" {
+      technology "better-sqlite3 + WAL"
+    }
+
+    handlers = container "Registered phase handlers" {
+      technology "Consumer-provided async functions"
+    }
+
+    streams = container "SSE subscribers" {
+      technology "ReadableStream consumers"
+    }
+
+    api -> runtime.queue "constructs and invokes"
+    runtime.storage -> sqlite "reads/writes rows"
+    runtime.runner -> handlers "invokes phase handlers"
+    runtime.events -> streams "pushes queue events"
+  }
+
+  consumer -> jobqueue.api "enqueue / retry / cancel / inspect / subscribe"
+  jobqueue.runtime.webhooks -[async]-> webhookReceiver "POST job events"
+}
+
+views {
+  view index {
+    title "jobqueue landscape"
+    include *
+    autoLayout LeftRight
+  }
+
+  view library of jobqueue {
+    title "jobqueue containers"
+    include *
+    autoLayout LeftRight
+  }
+
+  view runtime of jobqueue.runtime {
+    title "jobqueue runtime components"
+    include *
+    autoLayout LeftRight
+  }
+
+  dynamic view enqueue-to-complete {
+    title "Enqueue to successful completion"
+    consumer -> jobqueue.api "enqueue()"
+    jobqueue.api -> jobqueue.runtime.queue "create job"
+    jobqueue.runtime.queue -> jobqueue.runtime.storage "persist pending row"
+    jobqueue.runtime.queue -> jobqueue.runtime.pool "schedule worker"
+    jobqueue.runtime.pool -> jobqueue.runtime.runner "run phases"
+    jobqueue.runtime.runner -> jobqueue.handlers "invoke handler(s)"
+    jobqueue.runtime.runner -> jobqueue.runtime.storage "persist progress + phase results"
+    jobqueue.runtime.queue -> jobqueue.runtime.events "emit queue events"
+    jobqueue.runtime.events -> jobqueue.streams "push SSE"
+    jobqueue.runtime.queue -> jobqueue.runtime.webhooks "send completion webhook"
+    jobqueue.runtime.webhooks -> webhookReceiver "POST payload"
+    jobqueue.runtime.queue -> jobqueue.runtime.storage "mark webhook_sent"
+  }
+
+  dynamic view retry-flow {
+    title "Failure and retry flow"
+    jobqueue.runtime.runner -> jobqueue.handlers "invoke handler"
+    jobqueue.handlers -> jobqueue.runtime.queue "throw recoverable error"
+    jobqueue.runtime.queue -> jobqueue.runtime.retry "classify error"
+    jobqueue.runtime.retry -> jobqueue.runtime.queue "retry with delay"
+    jobqueue.runtime.queue -> jobqueue.runtime.storage "persist pending retry"
+    jobqueue.runtime.queue -> jobqueue.runtime.events "emit job:retrying"
+    jobqueue.runtime.events -> jobqueue.streams "push SSE"
+    jobqueue.runtime.queue -[async]-> jobqueue.runtime.webhooks "dispatch retry webhook"
+  }
+}
--- a/docs/runtime-lifecycle.md
+++ b/docs/runtime-lifecycle.md
@@ -0,0 +1,239 @@
+# Runtime lifecycle
+
+This document follows one queue instance from construction through shutdown.
+
+## 1. Construction
+
+When `new JobQueue(config)` runs, the constructor does more than store config:
+
+1. normalizes config defaults
+2. opens SQLite and enables WAL mode
+3. creates retry strategy
+4. creates worker pool
+5. optionally creates webhook dispatcher
+6. resets any previously `active` jobs to `failed`
+7. optionally starts retention scheduler
+8. requests an initial pump
+
+### Why `resetActiveJobs()` exists
+
+`jobqueue` is single-process. If the process dies mid-job, there is no other worker that can safely finish that in-flight job. On next boot, the queue marks those orphaned jobs failed so they do not stay stuck in `active`.
+
+## 2. Enqueue path
+
+```mermaid
+sequenceDiagram
+  participant App as Consumer app
+  participant Queue as JobQueue
+  participant DB as SqliteStorage
+  participant Pump as Pump loop
+
+  App->>Queue: enqueue(data, options)
+  Queue->>DB: createJob(...)
+  DB-->>Queue: JobRecord(status=pending)
+  Queue-->>App: jobId
+  Queue->>Queue: emit job:enqueued
+  Queue->>Pump: requestPump()
+```
+
+Key points:
+
+- enqueue is durable first, asynchronous execution second
+- a job can be scheduled for the future with `scheduledAt`
+- a per-job webhook URL can override queue-level webhook URL
+
+## 3. Pumping and dispatch
+
+The queue uses a **pump loop**, not a constantly-blocking worker thread.
+
+### Pump rules
+
+1. stop immediately if queue is closed
+2. if another pump is already running, request a repump and return
+3. while worker pool has capacity:
+   - read runnable `pending` jobs whose `scheduled_at <= now`
+   - try to claim each job atomically
+   - emit `job:started`
+   - hand claimed job to `WorkerPool`
+4. schedule a wake-up for the next delayed job
+
+### Why `claimPendingJob()` matters
+
+The queue lists candidates first, then claims them one by one with a status transition in SQLite. That second step is what prevents the same pending row from being started twice.
+
+## 4. Job execution
+
+Each claimed job gets its own `AbortController`. `PhaseRunner` then executes configured phases in order.
+
+```mermaid
+sequenceDiagram
+  participant Queue as JobQueue
+  participant Runner as PhaseRunner
+  participant Handler as Phase handler
+  participant DB as SqliteStorage
+  participant Events as TypedEventBus
+
+  Queue->>Runner: run(job, signal)
+  loop for each phase
+    Runner->>DB: saveProgress(on phase start)
+    Runner->>Handler: handler(job, context)
+    Handler->>Runner: ctx.progress(...)
+    Runner->>DB: saveProgress(...)
+    Runner->>Events: job:progress
+    Handler-->>Runner: phase result
+    Runner->>DB: savePhaseCompletion(...)
+    Runner->>Events: job:phase:completed
+  end
+  Queue->>DB: completeJob(...)
+  Queue->>Events: job:completed
+```
+
+## 5. Progress semantics
+
+Progress exists at two levels:
+
+- **phase progress** - what the current handler reports
+- **overall progress** - computed from phase index + phase progress
+
+Example for three phases:
+
+| Phase | Reported phase progress | Computed overall progress |
+| --- | --- | --- |
+| `download` | 50 | 17 |
+| `process` | 25 | 42 |
+| `upload` | 80 | 93 |
+
+`ctx.progress()` persists that state immediately, then emits `job:progress`.
+
+## 6. Result passing between phases
+
+Each handler can return JSON-serializable data. `PhaseRunner` stores that in `phaseResults` and exposes it to later handlers via:
+
+- `ctx.phaseResult(phaseName)`
+- `ctx.phaseResults()`
+
+This is the mechanism that turns the queue from "single task runner" into "multi-stage pipeline engine".
+
+## 7. Retry path
+
+When a handler throws, `JobQueue.handleFailure()` decides between retry and terminal failure.
+
+```mermaid
+sequenceDiagram
+  participant Handler as Phase handler
+  participant Queue as JobQueue
+  participant Retry as RetryStrategy
+  participant DB as SqliteStorage
+  participant Events as TypedEventBus
+
+  Handler-->>Queue: throws error
+  Queue->>Retry: shouldRetry(error, currentJob)
+  alt recoverable and attempts remain
+    Queue->>DB: scheduleRetry(...)
+    Queue->>Events: job:retrying
+    Queue->>Queue: requestPump()
+  else fatal or exhausted
+    Queue->>DB: failJob(...)
+    Queue->>Events: job:failed
+  end
+```
+
+### Retry details
+
+- `maxAttempts` includes the initial attempt
+- default disposition is `fatal`
+- backoff can be `fixed`, `linear`, or `exponential`
+- recoverable retries keep the job in `pending` with a future `scheduled_at`
+
+## 8. Cancellation path
+
+Cancellation is cooperative:
+
+1. `queue.cancel(id)` aborts the job controller if one exists
+2. unfinished phases are persisted as `cancelled`
+3. job status becomes `cancelled`
+4. `job:cancelled` is emitted
+5. later phase-runner cancellation callbacks become no-ops if the job is already cancelled
+
+That last rule matters. Without it, a cancel request arriving between phase transitions could emit duplicate `job:cancelled` events. The current implementation now guards against that.
+
+### Pending cancellation
+
+If a job is cancelled before it starts, **all unfinished phases are also marked `cancelled`**. That keeps the persisted phase graph aligned with the top-level job status.
+
+## 9. SSE stream lifecycle
+
+`createEventStream()` creates a stream over queue events.
+
+### Stream startup
+
+1. optional snapshot is written first
+2. event listeners are attached
+3. periodic `ping` keepalive starts
+
+### Stream shutdown
+
+- cancelling the reader removes all attached listeners
+- keepalive timer is cleared
+- no queue state is modified
+
+## 10. Webhook lifecycle
+
+Webhooks are triggered off queue events, but they are not the primary source of truth. SQLite is.
+
+### Completion + webhook ordering
+
+For a successful job:
+
+1. `completeJob()` persists `status = completed`
+2. `job:completed` is emitted
+3. webhook dispatch is scheduled
+4. successful delivery marks `webhook_sent = 1`
+5. `job:webhook:delivered` is emitted
+
+This means webhook state becomes visible in a **later event**, not inside the original `job:completed` event.
+
+### Shutdown interaction
+
+The queue now tracks in-flight webhook promises and waits for them during shutdown. That avoids closing SQLite while a completed webhook still needs to update `webhook_sent` or emit delivery/failure events.
+
+## 11. Retention lifecycle
+
+Retention runs independently from job execution:
+
+1. compute stale cutoff and delete cutoff
+2. mark eligible terminal jobs as `stale`
+3. run optional `onStale(job)` callback
+4. emit `job:stale`
+5. delete stale jobs past delete cutoff
+6. run optional `onDelete(job)` callback
+7. emit `job:deleted`
+
+## 12. Shutdown lifecycle
+
+Shutdown now has two responsibilities:
+
+1. **stop new work** - mark queue closed, stop retention, clear wake-up timer
+2. **tear down safely** - wait for workers, wait for webhooks, remove listeners, close storage
+
+### Current behavior
+
+```mermaid
+flowchart TD
+  A[shutdown()] --> B[closed = true]
+  B --> C[stop retention + clear wakeup timer]
+  C --> D{workers drained in time?}
+  D -- yes --> E[drain pending webhooks]
+  D -- no --> F[abort active controllers]
+  F --> G[best-effort second drain]
+  G --> E
+  E --> H[remove listeners]
+  H --> I[close SQLite]
+  I --> J{timeout happened?}
+  J -- no --> K[resolve]
+  J -- yes --> L[rethrow timeout error after cleanup]
+```
+
+### Important nuance
+
+If a handler ignores `AbortSignal`, shutdown can still time out. The queue now guarantees cleanup still runs, but graceful completion still depends on handler cooperation.