fix: harden queue lifecycle and publish gate

- Preserve phase results on partial retry and keep interrupted phase context after restart. - Avoid webhook bookkeeping crashes when retention deletes stale jobs. - Add deeper unit, integration, and e2e coverage around queue seams. - Require verify job to pass before publish runs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-16 18:39:19 +02:00
parent 679053b27d
commit a9429e2118
16 changed files with 1867 additions and 87 deletions
--- a/docs/runtime-lifecycle.md
+++ b/docs/runtime-lifecycle.md
@@ -0,0 +1,239 @@
+# Runtime lifecycle
+
+This document follows one queue instance from construction through shutdown.
+
+## 1. Construction
+
+When `new JobQueue(config)` runs, the constructor does more than store config:
+
+1. normalizes config defaults
+2. opens SQLite and enables WAL mode
+3. creates retry strategy
+4. creates worker pool
+5. optionally creates webhook dispatcher
+6. resets any previously `active` jobs to `failed`
+7. optionally starts retention scheduler
+8. requests an initial pump
+
+### Why `resetActiveJobs()` exists
+
+`jobqueue` is single-process. If the process dies mid-job, there is no other worker that can safely finish that in-flight job. On next boot, the queue marks those orphaned jobs failed so they do not stay stuck in `active`.
+
+## 2. Enqueue path
+
+```mermaid
+sequenceDiagram
+  participant App as Consumer app
+  participant Queue as JobQueue
+  participant DB as SqliteStorage
+  participant Pump as Pump loop
+
+  App->>Queue: enqueue(data, options)
+  Queue->>DB: createJob(...)
+  DB-->>Queue: JobRecord(status=pending)
+  Queue-->>App: jobId
+  Queue->>Queue: emit job:enqueued
+  Queue->>Pump: requestPump()
+```
+
+Key points:
+
+- enqueue is durable first, asynchronous execution second
+- a job can be scheduled for the future with `scheduledAt`
+- a per-job webhook URL can override queue-level webhook URL
+
+## 3. Pumping and dispatch
+
+The queue uses a **pump loop**, not a constantly-blocking worker thread.
+
+### Pump rules
+
+1. stop immediately if queue is closed
+2. if another pump is already running, request a repump and return
+3. while worker pool has capacity:
+   - read runnable `pending` jobs whose `scheduled_at <= now`
+   - try to claim each job atomically
+   - emit `job:started`
+   - hand claimed job to `WorkerPool`
+4. schedule a wake-up for the next delayed job
+
+### Why `claimPendingJob()` matters
+
+The queue lists candidates first, then claims them one by one with a status transition in SQLite. That second step is what prevents the same pending row from being started twice.
+
+## 4. Job execution
+
+Each claimed job gets its own `AbortController`. `PhaseRunner` then executes configured phases in order.
+
+```mermaid
+sequenceDiagram
+  participant Queue as JobQueue
+  participant Runner as PhaseRunner
+  participant Handler as Phase handler
+  participant DB as SqliteStorage
+  participant Events as TypedEventBus
+
+  Queue->>Runner: run(job, signal)
+  loop for each phase
+    Runner->>DB: saveProgress(on phase start)
+    Runner->>Handler: handler(job, context)
+    Handler->>Runner: ctx.progress(...)
+    Runner->>DB: saveProgress(...)
+    Runner->>Events: job:progress
+    Handler-->>Runner: phase result
+    Runner->>DB: savePhaseCompletion(...)
+    Runner->>Events: job:phase:completed
+  end
+  Queue->>DB: completeJob(...)
+  Queue->>Events: job:completed
+```
+
+## 5. Progress semantics
+
+Progress exists at two levels:
+
+- **phase progress** - what the current handler reports
+- **overall progress** - computed from phase index + phase progress
+
+Example for three phases:
+
+| Phase | Reported phase progress | Computed overall progress |
+| --- | --- | --- |
+| `download` | 50 | 17 |
+| `process` | 25 | 42 |
+| `upload` | 80 | 93 |
+
+`ctx.progress()` persists that state immediately, then emits `job:progress`.
+
+## 6. Result passing between phases
+
+Each handler can return JSON-serializable data. `PhaseRunner` stores that in `phaseResults` and exposes it to later handlers via:
+
+- `ctx.phaseResult(phaseName)`
+- `ctx.phaseResults()`
+
+This is the mechanism that turns the queue from "single task runner" into "multi-stage pipeline engine".
+
+## 7. Retry path
+
+When a handler throws, `JobQueue.handleFailure()` decides between retry and terminal failure.
+
+```mermaid
+sequenceDiagram
+  participant Handler as Phase handler
+  participant Queue as JobQueue
+  participant Retry as RetryStrategy
+  participant DB as SqliteStorage
+  participant Events as TypedEventBus
+
+  Handler-->>Queue: throws error
+  Queue->>Retry: shouldRetry(error, currentJob)
+  alt recoverable and attempts remain
+    Queue->>DB: scheduleRetry(...)
+    Queue->>Events: job:retrying
+    Queue->>Queue: requestPump()
+  else fatal or exhausted
+    Queue->>DB: failJob(...)
+    Queue->>Events: job:failed
+  end
+```
+
+### Retry details
+
+- `maxAttempts` includes the initial attempt
+- default disposition is `fatal`
+- backoff can be `fixed`, `linear`, or `exponential`
+- recoverable retries keep the job in `pending` with a future `scheduled_at`
+
+## 8. Cancellation path
+
+Cancellation is cooperative:
+
+1. `queue.cancel(id)` aborts the job controller if one exists
+2. unfinished phases are persisted as `cancelled`
+3. job status becomes `cancelled`
+4. `job:cancelled` is emitted
+5. later phase-runner cancellation callbacks become no-ops if the job is already cancelled
+
+That last rule matters. Without it, a cancel request arriving between phase transitions could emit duplicate `job:cancelled` events. The current implementation now guards against that.
+
+### Pending cancellation
+
+If a job is cancelled before it starts, **all unfinished phases are also marked `cancelled`**. That keeps the persisted phase graph aligned with the top-level job status.
+
+## 9. SSE stream lifecycle
+
+`createEventStream()` creates a stream over queue events.
+
+### Stream startup
+
+1. optional snapshot is written first
+2. event listeners are attached
+3. periodic `ping` keepalive starts
+
+### Stream shutdown
+
+- cancelling the reader removes all attached listeners
+- keepalive timer is cleared
+- no queue state is modified
+
+## 10. Webhook lifecycle
+
+Webhooks are triggered off queue events, but they are not the primary source of truth. SQLite is.
+
+### Completion + webhook ordering
+
+For a successful job:
+
+1. `completeJob()` persists `status = completed`
+2. `job:completed` is emitted
+3. webhook dispatch is scheduled
+4. successful delivery marks `webhook_sent = 1`
+5. `job:webhook:delivered` is emitted
+
+This means webhook state becomes visible in a **later event**, not inside the original `job:completed` event.
+
+### Shutdown interaction
+
+The queue now tracks in-flight webhook promises and waits for them during shutdown. That avoids closing SQLite while a completed webhook still needs to update `webhook_sent` or emit delivery/failure events.
+
+## 11. Retention lifecycle
+
+Retention runs independently from job execution:
+
+1. compute stale cutoff and delete cutoff
+2. mark eligible terminal jobs as `stale`
+3. run optional `onStale(job)` callback
+4. emit `job:stale`
+5. delete stale jobs past delete cutoff
+6. run optional `onDelete(job)` callback
+7. emit `job:deleted`
+
+## 12. Shutdown lifecycle
+
+Shutdown now has two responsibilities:
+
+1. **stop new work** - mark queue closed, stop retention, clear wake-up timer
+2. **tear down safely** - wait for workers, wait for webhooks, remove listeners, close storage
+
+### Current behavior
+
+```mermaid
+flowchart TD
+  A[shutdown()] --> B[closed = true]
+  B --> C[stop retention + clear wakeup timer]
+  C --> D{workers drained in time?}
+  D -- yes --> E[drain pending webhooks]
+  D -- no --> F[abort active controllers]
+  F --> G[best-effort second drain]
+  G --> E
+  E --> H[remove listeners]
+  H --> I[close SQLite]
+  I --> J{timeout happened?}
+  J -- no --> K[resolve]
+  J -- yes --> L[rethrow timeout error after cleanup]
+```
+
+### Important nuance
+
+If a handler ignores `AbortSignal`, shutdown can still time out. The queue now guarantees cleanup still runs, but graceful completion still depends on handler cooperation.