- Preserve phase results on partial retry and keep interrupted phase context after restart. - Avoid webhook bookkeeping crashes when retention deletes stale jobs. - Add deeper unit, integration, and e2e coverage around queue seams. - Require verify job to pass before publish runs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
7.2 KiB
Runtime lifecycle
This document follows one queue instance from construction through shutdown.
1. Construction
When new JobQueue(config) runs, the constructor does more than store config:
- normalizes config defaults
- opens SQLite and enables WAL mode
- creates retry strategy
- creates worker pool
- optionally creates webhook dispatcher
- resets any previously
activejobs tofailed - optionally starts retention scheduler
- requests an initial pump
Why resetActiveJobs() exists
jobqueue is single-process. If the process dies mid-job, there is no other worker that can safely finish that in-flight job. On next boot, the queue marks those orphaned jobs failed so they do not stay stuck in active.
2. Enqueue path
sequenceDiagram
participant App as Consumer app
participant Queue as JobQueue
participant DB as SqliteStorage
participant Pump as Pump loop
App->>Queue: enqueue(data, options)
Queue->>DB: createJob(...)
DB-->>Queue: JobRecord(status=pending)
Queue-->>App: jobId
Queue->>Queue: emit job:enqueued
Queue->>Pump: requestPump()
Key points:
- enqueue is durable first, asynchronous execution second
- a job can be scheduled for the future with
scheduledAt - a per-job webhook URL can override queue-level webhook URL
3. Pumping and dispatch
The queue uses a pump loop, not a constantly-blocking worker thread.
Pump rules
- stop immediately if queue is closed
- if another pump is already running, request a repump and return
- while worker pool has capacity:
- read runnable
pendingjobs whosescheduled_at <= now - try to claim each job atomically
- emit
job:started - hand claimed job to
WorkerPool
- read runnable
- schedule a wake-up for the next delayed job
Why claimPendingJob() matters
The queue lists candidates first, then claims them one by one with a status transition in SQLite. That second step is what prevents the same pending row from being started twice.
4. Job execution
Each claimed job gets its own AbortController. PhaseRunner then executes configured phases in order.
sequenceDiagram
participant Queue as JobQueue
participant Runner as PhaseRunner
participant Handler as Phase handler
participant DB as SqliteStorage
participant Events as TypedEventBus
Queue->>Runner: run(job, signal)
loop for each phase
Runner->>DB: saveProgress(on phase start)
Runner->>Handler: handler(job, context)
Handler->>Runner: ctx.progress(...)
Runner->>DB: saveProgress(...)
Runner->>Events: job:progress
Handler-->>Runner: phase result
Runner->>DB: savePhaseCompletion(...)
Runner->>Events: job:phase:completed
end
Queue->>DB: completeJob(...)
Queue->>Events: job:completed
5. Progress semantics
Progress exists at two levels:
- phase progress - what the current handler reports
- overall progress - computed from phase index + phase progress
Example for three phases:
| Phase | Reported phase progress | Computed overall progress |
|---|---|---|
download |
50 | 17 |
process |
25 | 42 |
upload |
80 | 93 |
ctx.progress() persists that state immediately, then emits job:progress.
6. Result passing between phases
Each handler can return JSON-serializable data. PhaseRunner stores that in phaseResults and exposes it to later handlers via:
ctx.phaseResult(phaseName)ctx.phaseResults()
This is the mechanism that turns the queue from "single task runner" into "multi-stage pipeline engine".
7. Retry path
When a handler throws, JobQueue.handleFailure() decides between retry and terminal failure.
sequenceDiagram
participant Handler as Phase handler
participant Queue as JobQueue
participant Retry as RetryStrategy
participant DB as SqliteStorage
participant Events as TypedEventBus
Handler-->>Queue: throws error
Queue->>Retry: shouldRetry(error, currentJob)
alt recoverable and attempts remain
Queue->>DB: scheduleRetry(...)
Queue->>Events: job:retrying
Queue->>Queue: requestPump()
else fatal or exhausted
Queue->>DB: failJob(...)
Queue->>Events: job:failed
end
Retry details
maxAttemptsincludes the initial attempt- default disposition is
fatal - backoff can be
fixed,linear, orexponential - recoverable retries keep the job in
pendingwith a futurescheduled_at
8. Cancellation path
Cancellation is cooperative:
queue.cancel(id)aborts the job controller if one exists- unfinished phases are persisted as
cancelled - job status becomes
cancelled job:cancelledis emitted- later phase-runner cancellation callbacks become no-ops if the job is already cancelled
That last rule matters. Without it, a cancel request arriving between phase transitions could emit duplicate job:cancelled events. The current implementation now guards against that.
Pending cancellation
If a job is cancelled before it starts, all unfinished phases are also marked cancelled. That keeps the persisted phase graph aligned with the top-level job status.
9. SSE stream lifecycle
createEventStream() creates a stream over queue events.
Stream startup
- optional snapshot is written first
- event listeners are attached
- periodic
pingkeepalive starts
Stream shutdown
- cancelling the reader removes all attached listeners
- keepalive timer is cleared
- no queue state is modified
10. Webhook lifecycle
Webhooks are triggered off queue events, but they are not the primary source of truth. SQLite is.
Completion + webhook ordering
For a successful job:
completeJob()persistsstatus = completedjob:completedis emitted- webhook dispatch is scheduled
- successful delivery marks
webhook_sent = 1 job:webhook:deliveredis emitted
This means webhook state becomes visible in a later event, not inside the original job:completed event.
Shutdown interaction
The queue now tracks in-flight webhook promises and waits for them during shutdown. That avoids closing SQLite while a completed webhook still needs to update webhook_sent or emit delivery/failure events.
11. Retention lifecycle
Retention runs independently from job execution:
- compute stale cutoff and delete cutoff
- mark eligible terminal jobs as
stale - run optional
onStale(job)callback - emit
job:stale - delete stale jobs past delete cutoff
- run optional
onDelete(job)callback - emit
job:deleted
12. Shutdown lifecycle
Shutdown now has two responsibilities:
- stop new work - mark queue closed, stop retention, clear wake-up timer
- tear down safely - wait for workers, wait for webhooks, remove listeners, close storage
Current behavior
flowchart TD
A[shutdown()] --> B[closed = true]
B --> C[stop retention + clear wakeup timer]
C --> D{workers drained in time?}
D -- yes --> E[drain pending webhooks]
D -- no --> F[abort active controllers]
F --> G[best-effort second drain]
G --> E
E --> H[remove listeners]
H --> I[close SQLite]
I --> J{timeout happened?}
J -- no --> K[resolve]
J -- yes --> L[rethrow timeout error after cleanup]
Important nuance
If a handler ignores AbortSignal, shutdown can still time out. The queue now guarantees cleanup still runs, but graceful completion still depends on handler cooperation.