fix: harden queue lifecycle and publish gate

- Preserve phase results on partial retry and keep interrupted phase
  context after restart.
- Avoid webhook bookkeeping crashes when retention deletes stale jobs.
- Add deeper unit, integration, and e2e coverage around queue seams.
- Require verify job to pass before publish runs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
2026-05-16 18:39:19 +02:00
parent 679053b27d
commit a9429e2118
16 changed files with 1867 additions and 87 deletions

39
docs/README.md Normal file
View File

@@ -0,0 +1,39 @@
# jobqueue docs
Detailed architecture and runtime docs for `jobqueue`.
## Doc map
| File | Purpose |
| --- | --- |
| [`architecture.md`](./architecture.md) | Static architecture, module boundaries, data model, event model, state model |
| [`runtime-lifecycle.md`](./runtime-lifecycle.md) | Step-by-step runtime behavior from startup through shutdown |
| [`integration-findings.md`](./integration-findings.md) | Multi-agent scan results, verified bugs, fixes, and remaining behavioral notes |
| [`jobqueue.c4`](./jobqueue.c4) | LikeC4 source for landscape, container, component, and runtime views |
## Mental model
`jobqueue` is a **single-process orchestrator** around a SQLite-backed job table.
1. Consumer code creates a `JobQueue`.
2. Jobs are persisted immediately in SQLite.
3. A pump claims runnable jobs and hands them to a concurrency-limited worker pool.
4. `PhaseRunner` executes configured phases and reports progress back through `JobQueue`.
5. `JobQueue` persists each state transition, emits typed events, formats SSE payloads, and optionally sends webhooks.
6. A retention scheduler can mark old jobs as `stale` and later delete them.
## Rendering LikeC4 views
The repository stores LikeC4 source in [`jobqueue.c4`](./jobqueue.c4).
```bash
npx likec4 start docs/jobqueue.c4
```
Recommended views in the file:
- `index` - system landscape
- `library` - container view of `jobqueue`
- `runtime` - internal runtime/component view
- `enqueue-to-complete` - dynamic happy path
- `retry-flow` - dynamic retry path

241
docs/architecture.md Normal file
View File

@@ -0,0 +1,241 @@
# Architecture
This document explains how `jobqueue` is structured, what each module owns, and how data moves through the system.
## 1. Top-level structure
`JobQueue` is the orchestrator. Everything else is a collaborator around four concerns:
1. **Persistence** - `SqliteStorage`
2. **Execution** - `WorkerPool` + `PhaseRunner`
3. **Notifications** - `TypedEventBus`, `SseSerializer`, `WebhookDispatcher`
4. **Lifecycle management** - `RetryStrategy`, `RetentionScheduler`, shutdown logic
```mermaid
flowchart LR
A[Consumer app] -->|enqueue / retry / cancel / list| B[JobQueue]
B --> C[SqliteStorage]
B --> D[WorkerPool]
D --> E[PhaseRunner]
E --> F[Phase handlers registered by consumer]
B --> G[TypedEventBus]
G --> H[SSE stream subscribers]
B --> I[WebhookDispatcher]
I --> J[Webhook endpoints]
B --> K[RetentionScheduler]
K --> C
```
## 2. Module responsibilities
| Module | Responsibility | Key behavior |
| --- | --- | --- |
| `src/JobQueue.ts` | Public API and orchestration | Owns startup, enqueue, cancel, retry, pumping, event emission, webhook dispatch, shutdown |
| `src/storage/SqliteStorage.ts` | SQLite persistence | Creates schema, claims jobs, persists progress, completion, failure, retry, cancellation, stale/delete |
| `src/processor/WorkerPool.ts` | Concurrency limit | Wraps `p-limit`, tracks running promises, supports drain with timeout |
| `src/processor/PhaseRunner.ts` | Multi-phase execution | Runs handlers in order, computes overall progress, stops on cancellation |
| `src/retry/RetryStrategy.ts` | Retry policy | Classifies errors, computes backoff, decides retry vs fail |
| `src/events/EventBus.ts` | In-process pub/sub | Strongly typed wrapper around Node `EventEmitter` |
| `src/events/SseSerializer.ts` | SSE formatting | Serializes event name + JSON payload into SSE wire format |
| `src/webhook/WebhookDispatcher.ts` | Outbound HTTP callbacks | Sends POST requests, signs payloads, retries transient failures |
| `src/retention/RetentionScheduler.ts` | Background cleanup | Periodically marks old jobs stale and later deletes them |
## 3. Public API surface
Exports from `src/index.ts` expose both high-level and low-level building blocks:
- `JobQueue`
- `SqliteStorage`
- `WorkerPool`
- `PhaseRunner`
- `RetryStrategy`
- `TypedEventBus`
- `SseSerializer`
- `WebhookDispatcher`
- `RetentionScheduler`
- shared queue/job/event types
That split makes the package usable in two modes:
1. **Normal mode** - instantiate `JobQueue` and let it coordinate everything
2. **Advanced mode** - reuse lower-level pieces independently in custom orchestration
## 4. Persistence model
`SqliteStorage` keeps a single `jobs` table. The queue is effectively modeled as persisted state transitions on that table.
### Important columns
| Column | Meaning |
| --- | --- |
| `id` | Stable job identifier |
| `status` | `pending`, `active`, `completed`, `failed`, `cancelled`, `stale` |
| `data` | Original enqueue payload, JSON-encoded |
| `current_phase` | Phase currently executing or last failed/retried phase |
| `phases_json` | Array of per-phase state objects |
| `phase_results` | JSON object keyed by phase name |
| `progress` / `progress_message` | Latest overall progress snapshot |
| `error_json` | Persisted failure metadata |
| `retry_count` / `max_attempts` | Retry bookkeeping |
| `webhook_url` / `webhook_sent` | Delivery configuration and latest success flag |
| `scheduled_at` | Delayed execution / retry wake-up time |
| `completed_at` / `cancelled_at` / `updated_at` | Lifecycle timestamps |
### Why SQLite works well here
- queue selection is simple and local
- state transitions are small, synchronous writes
- WAL mode supports concurrent reads while jobs are executing
- no separate broker is required for a single-process runtime
## 5. Phase model
Each job stores an array of `JobPhaseState` entries:
| Field | Meaning |
| --- | --- |
| `name` | Phase identifier from `QueueConfig.phases` |
| `status` | `pending`, `active`, `completed`, `failed`, `cancelled` |
| `progress` | Per-phase progress percentage |
| `message` | Human-oriented phase status |
| `startedAt` / `completedAt` | Phase timestamps |
| `error` | Last phase-level error string |
`PhaseRunner` walks those phases sequentially and computes overall progress as:
```text
((phaseIndex + phaseProgress / 100) / totalPhases) * 100
```
That design gives one stable persisted representation for:
- single-step jobs (`phases: ['run']`)
- multi-step pipelines (`['download', 'process', 'upload']`)
- retries that restart only unfinished phases
## 6. Event model
`JobQueue` emits in-process typed events first. The SSE stream and webhook flow are adapters on top of that state machine.
### Core queue events
- `job:enqueued`
- `job:started`
- `job:progress`
- `job:phase:completed`
- `job:completed`
- `job:failed`
- `job:retrying`
- `job:cancelled`
- `job:stale`
- `job:deleted`
- `job:webhook:delivered`
- `job:webhook:failed`
### Event ordering rule
The queue persists state before emitting the corresponding event. That means listeners observe already-persisted state, not speculative state.
This is important for consumers that mix:
- `queue.on(...)`
- `queue.getJob(id)`
- `queue.listJobs(...)`
- `queue.createEventStream(...)`
## 7. SSE model
`createEventStream()` creates a web `ReadableStream<Uint8Array>`.
Behavior:
1. Optional snapshot of current jobs is sent first
2. Queue subscribes the stream to in-process events
3. Each event is serialized as `event: <name>` + JSON `data: ...`
4. Keepalive `ping` events are emitted on an interval
5. Cancelling the reader removes subscriptions and the keepalive timer
### Payload shapes
Most runtime events include a full `job` object. Two notable exceptions:
| Event | Payload detail |
| --- | --- |
| `job:deleted` | Includes `deletedJobId` because the record was removed from storage |
| `ping` | No job payload, only a timestamp |
## 8. Webhook model
Webhooks are outbound notifications, not part of the core execution loop.
### Flow
1. `JobQueue` decides whether an event should trigger a webhook
2. `WebhookDispatcher` POSTs JSON to queue-level or job-level URL
3. Optional HMAC SHA-256 signature is attached as `X-JobQueue-Signature`
4. 5xx and transport failures retry with exponential backoff
5. Success sets `webhook_sent = 1` and emits `job:webhook:delivered`
### Scope
Supported webhook-triggering events:
- `job:completed`
- `job:failed`
- `job:retrying`
- `job:cancelled`
- `job:stale`
## 9. Retention model
Retention is deliberately two-stage:
```mermaid
flowchart LR
A[completed / failed / cancelled] -->|older than staleAfterMs| B[stale]
B -->|older than deleteAfterMs| C[deleted]
```
Why two stages:
- consumers get a visible grace period before deletion
- `onStale` and `onDelete` hooks can clean external artifacts
- `job:stale` is externally observable before hard deletion
## 10. State machine
```mermaid
stateDiagram-v2
[*] --> pending : enqueue
pending --> active : claimPendingJob
pending --> cancelled : cancel
active --> completed : all phases succeed
active --> failed : fatal error / retries exhausted
active --> pending : recoverable error + retry
active --> cancelled : cancel / abort
completed --> stale : retention mark
failed --> stale : retention mark
cancelled --> stale : retention mark
stale --> [*] : retention delete
failed --> pending : manual retry
cancelled --> pending : manual retry
stale --> pending : manual retry
```
## 11. Build and packaging
- ESM only
- Node 20 target
- bundled with `tsup`
- type declarations emitted from the same entrypoint
- tests run under Vitest in Node environment
## 12. LikeC4 companion charts
See [`jobqueue.c4`](./jobqueue.c4) for:
- system landscape
- library/container view
- runtime/component view
- enqueue-to-complete dynamic flow
- retry dynamic flow

View File

@@ -0,0 +1,141 @@
# Integration findings
This document records what the multi-agent scan found, what was verified directly in source, and what changed.
## 1. Scan method
The repository was scanned through three independent passes:
1. **lifecycle scan** - enqueue, scheduling, execution, retry, cancel, shutdown
2. **storage/events scan** - persistence, SSE, webhook, retention interaction
3. **code review scan** - cross-component defects only
Those results were then checked against source before documenting or patching anything.
## 2. Confirmed and fixed issues
### A. Duplicate `job:cancelled` event at phase boundary
**Observed risk**
If `cancel()` landed after one phase finished but before the next phase started, two different code paths could emit `job:cancelled`:
- direct `cancel(id)` call
- `PhaseRunner` cancellation callback on next loop iteration
**Why it mattered**
- duplicate event bus notifications
- duplicate SSE `job:cancelled` events
- duplicate `job:cancelled` webhooks
**Fix**
`JobQueue` now checks whether the job is already cancelled before the `onCancelled` callback persists or emits anything.
### B. Pending-job cancellation left phases in `pending`
**Observed risk**
Cancelling a job before it started produced:
- job status: `cancelled`
- phase states: still `pending`
**Why it mattered**
- persisted lifecycle shape was contradictory
- dashboards and tooling reading phases could not trust phase status
**Fix**
`JobQueue` now marks all unfinished phases as `cancelled` whenever a cancellation is persisted.
### C. Shutdown could close before in-flight webhook bookkeeping finished
**Observed risk**
Webhook dispatch was previously fire-and-forget. A completed job could still be mid-delivery when `shutdown()` closed SQLite.
**Why it mattered**
- `webhook_sent` might not be written
- `job:webhook:delivered` / `job:webhook:failed` could be lost
- delivery bookkeeping could throw against a closed database
**Fix**
`JobQueue` now tracks pending webhook promises and drains them during shutdown before closing storage.
### D. Shutdown timeout skipped cleanup
**Observed risk**
If `WorkerPool.drain(timeout)` threw, `shutdown()` exited before:
- removing listeners
- closing storage
**Why it mattered**
- leaked resources
- left queue internals half-open after failed shutdown
**Fix**
Cleanup now runs in a `finally` path. On timeout, active controllers are aborted, cleanup still executes, and the timeout error is rethrown after teardown.
## 3. Behavioral notes kept as documentation, not code changes
These are real integration characteristics, but not all are bugs.
### `job:completed` precedes `job:webhook:delivered`
This is expected ordering:
1. job completion is persisted
2. `job:completed` emits
3. webhook dispatch happens
4. `job:webhook:delivered` emits on success
So `webhookSent` may still be `false` in the earlier completion event. Consumers should treat webhook delivery as a separate lifecycle step.
### `job:deleted` does not contain full job payload
This is intentional and pragmatic. Once a stale record is deleted, the queue only emits `deletedJobId`. The SSE contract reflects deletion, not a resurrected snapshot.
### Webhooks are best-effort, not durable outbox delivery
The package retries transient delivery errors, but it does **not** persist a webhook outbox with replay semantics. If a process dies after job completion and before webhook delivery completes, there is no durable re-dispatch queue.
## 4. Regression coverage added
New tests now cover:
- cancelling a pending job marks unfinished phases cancelled
- cancelling on a phase boundary emits one cancellation event
- shutdown waits for in-flight webhooks
- shutdown cleanup still happens when worker drain times out
## 5. Remaining risk areas
No new blocking integration bugs were confirmed after patching, but these seams still deserve attention as the library grows:
1. **Durable outbound delivery** - webhook outbox/idempotency keys if delivery guarantees become stronger
2. **Long-running non-cooperative handlers** - handlers that ignore `AbortSignal` can still force shutdown timeouts
3. **SSE scaling** - each stream currently subscribes directly to the in-process event bus
4. **Storage portability** - queue semantics are tightly coupled to SQLite row-level state transitions
## 6. Second-scan fixes and coverage expansion
The deeper follow-up scan confirmed three more issues that were patched:
1. **Webhook completion after retention deletion** could throw when delivery bookkeeping re-fetched a deleted job.
2. **Partial retry (`fromStart: false`)** dropped completed phase results because retry reset cleared `phase_results`.
3. **Process restart recovery** dropped interrupted phase context in failure metadata.
Coverage was expanded at three levels:
- **Unit**: retry strategy, webhook retry policy, worker-pool drain timeout, storage retry/reset semantics
- **Integration**: partial retry behavior, scheduled wakeups, restart recovery, queue lifecycle edges
- **E2E harness**: realistic workflows covering SSE + webhooks + retries + retention deletion

140
docs/jobqueue.c4 Normal file
View File

@@ -0,0 +1,140 @@
specification {
element actor {
style {
shape person
}
}
element system {
style {
shape rectangle
}
}
element container {
style {
shape rectangle
}
}
element component {
style {
shape component
}
}
element database {
style {
shape storage
}
}
relationship async {
color amber
line dotted
}
}
model {
consumer = actor "Consumer application"
webhookReceiver = system "Webhook receiver"
jobqueue = system "jobqueue" {
api = container "Public API" {
technology "ESM / TypeScript"
description "JobQueue constructor plus enqueue, retry, cancel, query, stream, shutdown APIs"
}
runtime = container "Runtime orchestrator" {
technology "Node.js"
description "Coordinates persistence, execution, retries, events, SSE, webhooks, and shutdown"
queue = component "JobQueue"
storage = component "SqliteStorage"
pool = component "WorkerPool"
runner = component "PhaseRunner"
retry = component "RetryStrategy"
events = component "TypedEventBus"
sse = component "SseSerializer"
retention = component "RetentionScheduler"
webhooks = component "WebhookDispatcher"
queue -> storage "persists job state"
queue -> pool "dispatches runnable jobs"
queue -> runner "executes phase pipeline"
queue -> retry "classifies failures"
queue -> events "emits typed queue events"
queue -> sse "serializes SSE payloads"
queue -> retention "runs stale/delete cycle"
queue -[async]-> webhooks "dispatches outbound callbacks"
}
sqlite = database "SQLite jobs database" {
technology "better-sqlite3 + WAL"
}
handlers = container "Registered phase handlers" {
technology "Consumer-provided async functions"
}
streams = container "SSE subscribers" {
technology "ReadableStream consumers"
}
api -> runtime.queue "constructs and invokes"
runtime.storage -> sqlite "reads/writes rows"
runtime.runner -> handlers "invokes phase handlers"
runtime.events -> streams "pushes queue events"
}
consumer -> jobqueue.api "enqueue / retry / cancel / inspect / subscribe"
jobqueue.runtime.webhooks -[async]-> webhookReceiver "POST job events"
}
views {
view index {
title "jobqueue landscape"
include *
autoLayout LeftRight
}
view library of jobqueue {
title "jobqueue containers"
include *
autoLayout LeftRight
}
view runtime of jobqueue.runtime {
title "jobqueue runtime components"
include *
autoLayout LeftRight
}
dynamic view enqueue-to-complete {
title "Enqueue to successful completion"
consumer -> jobqueue.api "enqueue()"
jobqueue.api -> jobqueue.runtime.queue "create job"
jobqueue.runtime.queue -> jobqueue.runtime.storage "persist pending row"
jobqueue.runtime.queue -> jobqueue.runtime.pool "schedule worker"
jobqueue.runtime.pool -> jobqueue.runtime.runner "run phases"
jobqueue.runtime.runner -> jobqueue.handlers "invoke handler(s)"
jobqueue.runtime.runner -> jobqueue.runtime.storage "persist progress + phase results"
jobqueue.runtime.queue -> jobqueue.runtime.events "emit queue events"
jobqueue.runtime.events -> jobqueue.streams "push SSE"
jobqueue.runtime.queue -> jobqueue.runtime.webhooks "send completion webhook"
jobqueue.runtime.webhooks -> webhookReceiver "POST payload"
jobqueue.runtime.queue -> jobqueue.runtime.storage "mark webhook_sent"
}
dynamic view retry-flow {
title "Failure and retry flow"
jobqueue.runtime.runner -> jobqueue.handlers "invoke handler"
jobqueue.handlers -> jobqueue.runtime.queue "throw recoverable error"
jobqueue.runtime.queue -> jobqueue.runtime.retry "classify error"
jobqueue.runtime.retry -> jobqueue.runtime.queue "retry with delay"
jobqueue.runtime.queue -> jobqueue.runtime.storage "persist pending retry"
jobqueue.runtime.queue -> jobqueue.runtime.events "emit job:retrying"
jobqueue.runtime.events -> jobqueue.streams "push SSE"
jobqueue.runtime.queue -[async]-> jobqueue.runtime.webhooks "dispatch retry webhook"
}
}

239
docs/runtime-lifecycle.md Normal file
View File

@@ -0,0 +1,239 @@
# Runtime lifecycle
This document follows one queue instance from construction through shutdown.
## 1. Construction
When `new JobQueue(config)` runs, the constructor does more than store config:
1. normalizes config defaults
2. opens SQLite and enables WAL mode
3. creates retry strategy
4. creates worker pool
5. optionally creates webhook dispatcher
6. resets any previously `active` jobs to `failed`
7. optionally starts retention scheduler
8. requests an initial pump
### Why `resetActiveJobs()` exists
`jobqueue` is single-process. If the process dies mid-job, there is no other worker that can safely finish that in-flight job. On next boot, the queue marks those orphaned jobs failed so they do not stay stuck in `active`.
## 2. Enqueue path
```mermaid
sequenceDiagram
participant App as Consumer app
participant Queue as JobQueue
participant DB as SqliteStorage
participant Pump as Pump loop
App->>Queue: enqueue(data, options)
Queue->>DB: createJob(...)
DB-->>Queue: JobRecord(status=pending)
Queue-->>App: jobId
Queue->>Queue: emit job:enqueued
Queue->>Pump: requestPump()
```
Key points:
- enqueue is durable first, asynchronous execution second
- a job can be scheduled for the future with `scheduledAt`
- a per-job webhook URL can override queue-level webhook URL
## 3. Pumping and dispatch
The queue uses a **pump loop**, not a constantly-blocking worker thread.
### Pump rules
1. stop immediately if queue is closed
2. if another pump is already running, request a repump and return
3. while worker pool has capacity:
- read runnable `pending` jobs whose `scheduled_at <= now`
- try to claim each job atomically
- emit `job:started`
- hand claimed job to `WorkerPool`
4. schedule a wake-up for the next delayed job
### Why `claimPendingJob()` matters
The queue lists candidates first, then claims them one by one with a status transition in SQLite. That second step is what prevents the same pending row from being started twice.
## 4. Job execution
Each claimed job gets its own `AbortController`. `PhaseRunner` then executes configured phases in order.
```mermaid
sequenceDiagram
participant Queue as JobQueue
participant Runner as PhaseRunner
participant Handler as Phase handler
participant DB as SqliteStorage
participant Events as TypedEventBus
Queue->>Runner: run(job, signal)
loop for each phase
Runner->>DB: saveProgress(on phase start)
Runner->>Handler: handler(job, context)
Handler->>Runner: ctx.progress(...)
Runner->>DB: saveProgress(...)
Runner->>Events: job:progress
Handler-->>Runner: phase result
Runner->>DB: savePhaseCompletion(...)
Runner->>Events: job:phase:completed
end
Queue->>DB: completeJob(...)
Queue->>Events: job:completed
```
## 5. Progress semantics
Progress exists at two levels:
- **phase progress** - what the current handler reports
- **overall progress** - computed from phase index + phase progress
Example for three phases:
| Phase | Reported phase progress | Computed overall progress |
| --- | --- | --- |
| `download` | 50 | 17 |
| `process` | 25 | 42 |
| `upload` | 80 | 93 |
`ctx.progress()` persists that state immediately, then emits `job:progress`.
## 6. Result passing between phases
Each handler can return JSON-serializable data. `PhaseRunner` stores that in `phaseResults` and exposes it to later handlers via:
- `ctx.phaseResult(phaseName)`
- `ctx.phaseResults()`
This is the mechanism that turns the queue from "single task runner" into "multi-stage pipeline engine".
## 7. Retry path
When a handler throws, `JobQueue.handleFailure()` decides between retry and terminal failure.
```mermaid
sequenceDiagram
participant Handler as Phase handler
participant Queue as JobQueue
participant Retry as RetryStrategy
participant DB as SqliteStorage
participant Events as TypedEventBus
Handler-->>Queue: throws error
Queue->>Retry: shouldRetry(error, currentJob)
alt recoverable and attempts remain
Queue->>DB: scheduleRetry(...)
Queue->>Events: job:retrying
Queue->>Queue: requestPump()
else fatal or exhausted
Queue->>DB: failJob(...)
Queue->>Events: job:failed
end
```
### Retry details
- `maxAttempts` includes the initial attempt
- default disposition is `fatal`
- backoff can be `fixed`, `linear`, or `exponential`
- recoverable retries keep the job in `pending` with a future `scheduled_at`
## 8. Cancellation path
Cancellation is cooperative:
1. `queue.cancel(id)` aborts the job controller if one exists
2. unfinished phases are persisted as `cancelled`
3. job status becomes `cancelled`
4. `job:cancelled` is emitted
5. later phase-runner cancellation callbacks become no-ops if the job is already cancelled
That last rule matters. Without it, a cancel request arriving between phase transitions could emit duplicate `job:cancelled` events. The current implementation now guards against that.
### Pending cancellation
If a job is cancelled before it starts, **all unfinished phases are also marked `cancelled`**. That keeps the persisted phase graph aligned with the top-level job status.
## 9. SSE stream lifecycle
`createEventStream()` creates a stream over queue events.
### Stream startup
1. optional snapshot is written first
2. event listeners are attached
3. periodic `ping` keepalive starts
### Stream shutdown
- cancelling the reader removes all attached listeners
- keepalive timer is cleared
- no queue state is modified
## 10. Webhook lifecycle
Webhooks are triggered off queue events, but they are not the primary source of truth. SQLite is.
### Completion + webhook ordering
For a successful job:
1. `completeJob()` persists `status = completed`
2. `job:completed` is emitted
3. webhook dispatch is scheduled
4. successful delivery marks `webhook_sent = 1`
5. `job:webhook:delivered` is emitted
This means webhook state becomes visible in a **later event**, not inside the original `job:completed` event.
### Shutdown interaction
The queue now tracks in-flight webhook promises and waits for them during shutdown. That avoids closing SQLite while a completed webhook still needs to update `webhook_sent` or emit delivery/failure events.
## 11. Retention lifecycle
Retention runs independently from job execution:
1. compute stale cutoff and delete cutoff
2. mark eligible terminal jobs as `stale`
3. run optional `onStale(job)` callback
4. emit `job:stale`
5. delete stale jobs past delete cutoff
6. run optional `onDelete(job)` callback
7. emit `job:deleted`
## 12. Shutdown lifecycle
Shutdown now has two responsibilities:
1. **stop new work** - mark queue closed, stop retention, clear wake-up timer
2. **tear down safely** - wait for workers, wait for webhooks, remove listeners, close storage
### Current behavior
```mermaid
flowchart TD
A[shutdown()] --> B[closed = true]
B --> C[stop retention + clear wakeup timer]
C --> D{workers drained in time?}
D -- yes --> E[drain pending webhooks]
D -- no --> F[abort active controllers]
F --> G[best-effort second drain]
G --> E
E --> H[remove listeners]
H --> I[close SQLite]
I --> J{timeout happened?}
J -- no --> K[resolve]
J -- yes --> L[rethrow timeout error after cleanup]
```
### Important nuance
If a handler ignores `AbortSignal`, shutdown can still time out. The queue now guarantees cleanup still runs, but graceful completion still depends on handler cooperation.