fix: harden queue lifecycle and publish gate
- Preserve phase results on partial retry and keep interrupted phase context after restart. - Avoid webhook bookkeeping crashes when retention deletes stale jobs. - Add deeper unit, integration, and e2e coverage around queue seams. - Require verify job to pass before publish runs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
39
docs/README.md
Normal file
39
docs/README.md
Normal file
@@ -0,0 +1,39 @@
|
||||
# jobqueue docs
|
||||
|
||||
Detailed architecture and runtime docs for `jobqueue`.
|
||||
|
||||
## Doc map
|
||||
|
||||
| File | Purpose |
|
||||
| --- | --- |
|
||||
| [`architecture.md`](./architecture.md) | Static architecture, module boundaries, data model, event model, state model |
|
||||
| [`runtime-lifecycle.md`](./runtime-lifecycle.md) | Step-by-step runtime behavior from startup through shutdown |
|
||||
| [`integration-findings.md`](./integration-findings.md) | Multi-agent scan results, verified bugs, fixes, and remaining behavioral notes |
|
||||
| [`jobqueue.c4`](./jobqueue.c4) | LikeC4 source for landscape, container, component, and runtime views |
|
||||
|
||||
## Mental model
|
||||
|
||||
`jobqueue` is a **single-process orchestrator** around a SQLite-backed job table.
|
||||
|
||||
1. Consumer code creates a `JobQueue`.
|
||||
2. Jobs are persisted immediately in SQLite.
|
||||
3. A pump claims runnable jobs and hands them to a concurrency-limited worker pool.
|
||||
4. `PhaseRunner` executes configured phases and reports progress back through `JobQueue`.
|
||||
5. `JobQueue` persists each state transition, emits typed events, formats SSE payloads, and optionally sends webhooks.
|
||||
6. A retention scheduler can mark old jobs as `stale` and later delete them.
|
||||
|
||||
## Rendering LikeC4 views
|
||||
|
||||
The repository stores LikeC4 source in [`jobqueue.c4`](./jobqueue.c4).
|
||||
|
||||
```bash
|
||||
npx likec4 start docs/jobqueue.c4
|
||||
```
|
||||
|
||||
Recommended views in the file:
|
||||
|
||||
- `index` - system landscape
|
||||
- `library` - container view of `jobqueue`
|
||||
- `runtime` - internal runtime/component view
|
||||
- `enqueue-to-complete` - dynamic happy path
|
||||
- `retry-flow` - dynamic retry path
|
||||
241
docs/architecture.md
Normal file
241
docs/architecture.md
Normal file
@@ -0,0 +1,241 @@
|
||||
# Architecture
|
||||
|
||||
This document explains how `jobqueue` is structured, what each module owns, and how data moves through the system.
|
||||
|
||||
## 1. Top-level structure
|
||||
|
||||
`JobQueue` is the orchestrator. Everything else is a collaborator around four concerns:
|
||||
|
||||
1. **Persistence** - `SqliteStorage`
|
||||
2. **Execution** - `WorkerPool` + `PhaseRunner`
|
||||
3. **Notifications** - `TypedEventBus`, `SseSerializer`, `WebhookDispatcher`
|
||||
4. **Lifecycle management** - `RetryStrategy`, `RetentionScheduler`, shutdown logic
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
A[Consumer app] -->|enqueue / retry / cancel / list| B[JobQueue]
|
||||
B --> C[SqliteStorage]
|
||||
B --> D[WorkerPool]
|
||||
D --> E[PhaseRunner]
|
||||
E --> F[Phase handlers registered by consumer]
|
||||
B --> G[TypedEventBus]
|
||||
G --> H[SSE stream subscribers]
|
||||
B --> I[WebhookDispatcher]
|
||||
I --> J[Webhook endpoints]
|
||||
B --> K[RetentionScheduler]
|
||||
K --> C
|
||||
```
|
||||
|
||||
## 2. Module responsibilities
|
||||
|
||||
| Module | Responsibility | Key behavior |
|
||||
| --- | --- | --- |
|
||||
| `src/JobQueue.ts` | Public API and orchestration | Owns startup, enqueue, cancel, retry, pumping, event emission, webhook dispatch, shutdown |
|
||||
| `src/storage/SqliteStorage.ts` | SQLite persistence | Creates schema, claims jobs, persists progress, completion, failure, retry, cancellation, stale/delete |
|
||||
| `src/processor/WorkerPool.ts` | Concurrency limit | Wraps `p-limit`, tracks running promises, supports drain with timeout |
|
||||
| `src/processor/PhaseRunner.ts` | Multi-phase execution | Runs handlers in order, computes overall progress, stops on cancellation |
|
||||
| `src/retry/RetryStrategy.ts` | Retry policy | Classifies errors, computes backoff, decides retry vs fail |
|
||||
| `src/events/EventBus.ts` | In-process pub/sub | Strongly typed wrapper around Node `EventEmitter` |
|
||||
| `src/events/SseSerializer.ts` | SSE formatting | Serializes event name + JSON payload into SSE wire format |
|
||||
| `src/webhook/WebhookDispatcher.ts` | Outbound HTTP callbacks | Sends POST requests, signs payloads, retries transient failures |
|
||||
| `src/retention/RetentionScheduler.ts` | Background cleanup | Periodically marks old jobs stale and later deletes them |
|
||||
|
||||
## 3. Public API surface
|
||||
|
||||
Exports from `src/index.ts` expose both high-level and low-level building blocks:
|
||||
|
||||
- `JobQueue`
|
||||
- `SqliteStorage`
|
||||
- `WorkerPool`
|
||||
- `PhaseRunner`
|
||||
- `RetryStrategy`
|
||||
- `TypedEventBus`
|
||||
- `SseSerializer`
|
||||
- `WebhookDispatcher`
|
||||
- `RetentionScheduler`
|
||||
- shared queue/job/event types
|
||||
|
||||
That split makes the package usable in two modes:
|
||||
|
||||
1. **Normal mode** - instantiate `JobQueue` and let it coordinate everything
|
||||
2. **Advanced mode** - reuse lower-level pieces independently in custom orchestration
|
||||
|
||||
## 4. Persistence model
|
||||
|
||||
`SqliteStorage` keeps a single `jobs` table. The queue is effectively modeled as persisted state transitions on that table.
|
||||
|
||||
### Important columns
|
||||
|
||||
| Column | Meaning |
|
||||
| --- | --- |
|
||||
| `id` | Stable job identifier |
|
||||
| `status` | `pending`, `active`, `completed`, `failed`, `cancelled`, `stale` |
|
||||
| `data` | Original enqueue payload, JSON-encoded |
|
||||
| `current_phase` | Phase currently executing or last failed/retried phase |
|
||||
| `phases_json` | Array of per-phase state objects |
|
||||
| `phase_results` | JSON object keyed by phase name |
|
||||
| `progress` / `progress_message` | Latest overall progress snapshot |
|
||||
| `error_json` | Persisted failure metadata |
|
||||
| `retry_count` / `max_attempts` | Retry bookkeeping |
|
||||
| `webhook_url` / `webhook_sent` | Delivery configuration and latest success flag |
|
||||
| `scheduled_at` | Delayed execution / retry wake-up time |
|
||||
| `completed_at` / `cancelled_at` / `updated_at` | Lifecycle timestamps |
|
||||
|
||||
### Why SQLite works well here
|
||||
|
||||
- queue selection is simple and local
|
||||
- state transitions are small, synchronous writes
|
||||
- WAL mode supports concurrent reads while jobs are executing
|
||||
- no separate broker is required for a single-process runtime
|
||||
|
||||
## 5. Phase model
|
||||
|
||||
Each job stores an array of `JobPhaseState` entries:
|
||||
|
||||
| Field | Meaning |
|
||||
| --- | --- |
|
||||
| `name` | Phase identifier from `QueueConfig.phases` |
|
||||
| `status` | `pending`, `active`, `completed`, `failed`, `cancelled` |
|
||||
| `progress` | Per-phase progress percentage |
|
||||
| `message` | Human-oriented phase status |
|
||||
| `startedAt` / `completedAt` | Phase timestamps |
|
||||
| `error` | Last phase-level error string |
|
||||
|
||||
`PhaseRunner` walks those phases sequentially and computes overall progress as:
|
||||
|
||||
```text
|
||||
((phaseIndex + phaseProgress / 100) / totalPhases) * 100
|
||||
```
|
||||
|
||||
That design gives one stable persisted representation for:
|
||||
|
||||
- single-step jobs (`phases: ['run']`)
|
||||
- multi-step pipelines (`['download', 'process', 'upload']`)
|
||||
- retries that restart only unfinished phases
|
||||
|
||||
## 6. Event model
|
||||
|
||||
`JobQueue` emits in-process typed events first. The SSE stream and webhook flow are adapters on top of that state machine.
|
||||
|
||||
### Core queue events
|
||||
|
||||
- `job:enqueued`
|
||||
- `job:started`
|
||||
- `job:progress`
|
||||
- `job:phase:completed`
|
||||
- `job:completed`
|
||||
- `job:failed`
|
||||
- `job:retrying`
|
||||
- `job:cancelled`
|
||||
- `job:stale`
|
||||
- `job:deleted`
|
||||
- `job:webhook:delivered`
|
||||
- `job:webhook:failed`
|
||||
|
||||
### Event ordering rule
|
||||
|
||||
The queue persists state before emitting the corresponding event. That means listeners observe already-persisted state, not speculative state.
|
||||
|
||||
This is important for consumers that mix:
|
||||
|
||||
- `queue.on(...)`
|
||||
- `queue.getJob(id)`
|
||||
- `queue.listJobs(...)`
|
||||
- `queue.createEventStream(...)`
|
||||
|
||||
## 7. SSE model
|
||||
|
||||
`createEventStream()` creates a web `ReadableStream<Uint8Array>`.
|
||||
|
||||
Behavior:
|
||||
|
||||
1. Optional snapshot of current jobs is sent first
|
||||
2. Queue subscribes the stream to in-process events
|
||||
3. Each event is serialized as `event: <name>` + JSON `data: ...`
|
||||
4. Keepalive `ping` events are emitted on an interval
|
||||
5. Cancelling the reader removes subscriptions and the keepalive timer
|
||||
|
||||
### Payload shapes
|
||||
|
||||
Most runtime events include a full `job` object. Two notable exceptions:
|
||||
|
||||
| Event | Payload detail |
|
||||
| --- | --- |
|
||||
| `job:deleted` | Includes `deletedJobId` because the record was removed from storage |
|
||||
| `ping` | No job payload, only a timestamp |
|
||||
|
||||
## 8. Webhook model
|
||||
|
||||
Webhooks are outbound notifications, not part of the core execution loop.
|
||||
|
||||
### Flow
|
||||
|
||||
1. `JobQueue` decides whether an event should trigger a webhook
|
||||
2. `WebhookDispatcher` POSTs JSON to queue-level or job-level URL
|
||||
3. Optional HMAC SHA-256 signature is attached as `X-JobQueue-Signature`
|
||||
4. 5xx and transport failures retry with exponential backoff
|
||||
5. Success sets `webhook_sent = 1` and emits `job:webhook:delivered`
|
||||
|
||||
### Scope
|
||||
|
||||
Supported webhook-triggering events:
|
||||
|
||||
- `job:completed`
|
||||
- `job:failed`
|
||||
- `job:retrying`
|
||||
- `job:cancelled`
|
||||
- `job:stale`
|
||||
|
||||
## 9. Retention model
|
||||
|
||||
Retention is deliberately two-stage:
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
A[completed / failed / cancelled] -->|older than staleAfterMs| B[stale]
|
||||
B -->|older than deleteAfterMs| C[deleted]
|
||||
```
|
||||
|
||||
Why two stages:
|
||||
|
||||
- consumers get a visible grace period before deletion
|
||||
- `onStale` and `onDelete` hooks can clean external artifacts
|
||||
- `job:stale` is externally observable before hard deletion
|
||||
|
||||
## 10. State machine
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> pending : enqueue
|
||||
pending --> active : claimPendingJob
|
||||
pending --> cancelled : cancel
|
||||
active --> completed : all phases succeed
|
||||
active --> failed : fatal error / retries exhausted
|
||||
active --> pending : recoverable error + retry
|
||||
active --> cancelled : cancel / abort
|
||||
completed --> stale : retention mark
|
||||
failed --> stale : retention mark
|
||||
cancelled --> stale : retention mark
|
||||
stale --> [*] : retention delete
|
||||
failed --> pending : manual retry
|
||||
cancelled --> pending : manual retry
|
||||
stale --> pending : manual retry
|
||||
```
|
||||
|
||||
## 11. Build and packaging
|
||||
|
||||
- ESM only
|
||||
- Node 20 target
|
||||
- bundled with `tsup`
|
||||
- type declarations emitted from the same entrypoint
|
||||
- tests run under Vitest in Node environment
|
||||
|
||||
## 12. LikeC4 companion charts
|
||||
|
||||
See [`jobqueue.c4`](./jobqueue.c4) for:
|
||||
|
||||
- system landscape
|
||||
- library/container view
|
||||
- runtime/component view
|
||||
- enqueue-to-complete dynamic flow
|
||||
- retry dynamic flow
|
||||
141
docs/integration-findings.md
Normal file
141
docs/integration-findings.md
Normal file
@@ -0,0 +1,141 @@
|
||||
# Integration findings
|
||||
|
||||
This document records what the multi-agent scan found, what was verified directly in source, and what changed.
|
||||
|
||||
## 1. Scan method
|
||||
|
||||
The repository was scanned through three independent passes:
|
||||
|
||||
1. **lifecycle scan** - enqueue, scheduling, execution, retry, cancel, shutdown
|
||||
2. **storage/events scan** - persistence, SSE, webhook, retention interaction
|
||||
3. **code review scan** - cross-component defects only
|
||||
|
||||
Those results were then checked against source before documenting or patching anything.
|
||||
|
||||
## 2. Confirmed and fixed issues
|
||||
|
||||
### A. Duplicate `job:cancelled` event at phase boundary
|
||||
|
||||
**Observed risk**
|
||||
|
||||
If `cancel()` landed after one phase finished but before the next phase started, two different code paths could emit `job:cancelled`:
|
||||
|
||||
- direct `cancel(id)` call
|
||||
- `PhaseRunner` cancellation callback on next loop iteration
|
||||
|
||||
**Why it mattered**
|
||||
|
||||
- duplicate event bus notifications
|
||||
- duplicate SSE `job:cancelled` events
|
||||
- duplicate `job:cancelled` webhooks
|
||||
|
||||
**Fix**
|
||||
|
||||
`JobQueue` now checks whether the job is already cancelled before the `onCancelled` callback persists or emits anything.
|
||||
|
||||
### B. Pending-job cancellation left phases in `pending`
|
||||
|
||||
**Observed risk**
|
||||
|
||||
Cancelling a job before it started produced:
|
||||
|
||||
- job status: `cancelled`
|
||||
- phase states: still `pending`
|
||||
|
||||
**Why it mattered**
|
||||
|
||||
- persisted lifecycle shape was contradictory
|
||||
- dashboards and tooling reading phases could not trust phase status
|
||||
|
||||
**Fix**
|
||||
|
||||
`JobQueue` now marks all unfinished phases as `cancelled` whenever a cancellation is persisted.
|
||||
|
||||
### C. Shutdown could close before in-flight webhook bookkeeping finished
|
||||
|
||||
**Observed risk**
|
||||
|
||||
Webhook dispatch was previously fire-and-forget. A completed job could still be mid-delivery when `shutdown()` closed SQLite.
|
||||
|
||||
**Why it mattered**
|
||||
|
||||
- `webhook_sent` might not be written
|
||||
- `job:webhook:delivered` / `job:webhook:failed` could be lost
|
||||
- delivery bookkeeping could throw against a closed database
|
||||
|
||||
**Fix**
|
||||
|
||||
`JobQueue` now tracks pending webhook promises and drains them during shutdown before closing storage.
|
||||
|
||||
### D. Shutdown timeout skipped cleanup
|
||||
|
||||
**Observed risk**
|
||||
|
||||
If `WorkerPool.drain(timeout)` threw, `shutdown()` exited before:
|
||||
|
||||
- removing listeners
|
||||
- closing storage
|
||||
|
||||
**Why it mattered**
|
||||
|
||||
- leaked resources
|
||||
- left queue internals half-open after failed shutdown
|
||||
|
||||
**Fix**
|
||||
|
||||
Cleanup now runs in a `finally` path. On timeout, active controllers are aborted, cleanup still executes, and the timeout error is rethrown after teardown.
|
||||
|
||||
## 3. Behavioral notes kept as documentation, not code changes
|
||||
|
||||
These are real integration characteristics, but not all are bugs.
|
||||
|
||||
### `job:completed` precedes `job:webhook:delivered`
|
||||
|
||||
This is expected ordering:
|
||||
|
||||
1. job completion is persisted
|
||||
2. `job:completed` emits
|
||||
3. webhook dispatch happens
|
||||
4. `job:webhook:delivered` emits on success
|
||||
|
||||
So `webhookSent` may still be `false` in the earlier completion event. Consumers should treat webhook delivery as a separate lifecycle step.
|
||||
|
||||
### `job:deleted` does not contain full job payload
|
||||
|
||||
This is intentional and pragmatic. Once a stale record is deleted, the queue only emits `deletedJobId`. The SSE contract reflects deletion, not a resurrected snapshot.
|
||||
|
||||
### Webhooks are best-effort, not durable outbox delivery
|
||||
|
||||
The package retries transient delivery errors, but it does **not** persist a webhook outbox with replay semantics. If a process dies after job completion and before webhook delivery completes, there is no durable re-dispatch queue.
|
||||
|
||||
## 4. Regression coverage added
|
||||
|
||||
New tests now cover:
|
||||
|
||||
- cancelling a pending job marks unfinished phases cancelled
|
||||
- cancelling on a phase boundary emits one cancellation event
|
||||
- shutdown waits for in-flight webhooks
|
||||
- shutdown cleanup still happens when worker drain times out
|
||||
|
||||
## 5. Remaining risk areas
|
||||
|
||||
No new blocking integration bugs were confirmed after patching, but these seams still deserve attention as the library grows:
|
||||
|
||||
1. **Durable outbound delivery** - webhook outbox/idempotency keys if delivery guarantees become stronger
|
||||
2. **Long-running non-cooperative handlers** - handlers that ignore `AbortSignal` can still force shutdown timeouts
|
||||
3. **SSE scaling** - each stream currently subscribes directly to the in-process event bus
|
||||
4. **Storage portability** - queue semantics are tightly coupled to SQLite row-level state transitions
|
||||
|
||||
## 6. Second-scan fixes and coverage expansion
|
||||
|
||||
The deeper follow-up scan confirmed three more issues that were patched:
|
||||
|
||||
1. **Webhook completion after retention deletion** could throw when delivery bookkeeping re-fetched a deleted job.
|
||||
2. **Partial retry (`fromStart: false`)** dropped completed phase results because retry reset cleared `phase_results`.
|
||||
3. **Process restart recovery** dropped interrupted phase context in failure metadata.
|
||||
|
||||
Coverage was expanded at three levels:
|
||||
|
||||
- **Unit**: retry strategy, webhook retry policy, worker-pool drain timeout, storage retry/reset semantics
|
||||
- **Integration**: partial retry behavior, scheduled wakeups, restart recovery, queue lifecycle edges
|
||||
- **E2E harness**: realistic workflows covering SSE + webhooks + retries + retention deletion
|
||||
140
docs/jobqueue.c4
Normal file
140
docs/jobqueue.c4
Normal file
@@ -0,0 +1,140 @@
|
||||
specification {
|
||||
element actor {
|
||||
style {
|
||||
shape person
|
||||
}
|
||||
}
|
||||
|
||||
element system {
|
||||
style {
|
||||
shape rectangle
|
||||
}
|
||||
}
|
||||
|
||||
element container {
|
||||
style {
|
||||
shape rectangle
|
||||
}
|
||||
}
|
||||
|
||||
element component {
|
||||
style {
|
||||
shape component
|
||||
}
|
||||
}
|
||||
|
||||
element database {
|
||||
style {
|
||||
shape storage
|
||||
}
|
||||
}
|
||||
|
||||
relationship async {
|
||||
color amber
|
||||
line dotted
|
||||
}
|
||||
}
|
||||
|
||||
model {
|
||||
consumer = actor "Consumer application"
|
||||
webhookReceiver = system "Webhook receiver"
|
||||
|
||||
jobqueue = system "jobqueue" {
|
||||
api = container "Public API" {
|
||||
technology "ESM / TypeScript"
|
||||
description "JobQueue constructor plus enqueue, retry, cancel, query, stream, shutdown APIs"
|
||||
}
|
||||
|
||||
runtime = container "Runtime orchestrator" {
|
||||
technology "Node.js"
|
||||
description "Coordinates persistence, execution, retries, events, SSE, webhooks, and shutdown"
|
||||
|
||||
queue = component "JobQueue"
|
||||
storage = component "SqliteStorage"
|
||||
pool = component "WorkerPool"
|
||||
runner = component "PhaseRunner"
|
||||
retry = component "RetryStrategy"
|
||||
events = component "TypedEventBus"
|
||||
sse = component "SseSerializer"
|
||||
retention = component "RetentionScheduler"
|
||||
webhooks = component "WebhookDispatcher"
|
||||
|
||||
queue -> storage "persists job state"
|
||||
queue -> pool "dispatches runnable jobs"
|
||||
queue -> runner "executes phase pipeline"
|
||||
queue -> retry "classifies failures"
|
||||
queue -> events "emits typed queue events"
|
||||
queue -> sse "serializes SSE payloads"
|
||||
queue -> retention "runs stale/delete cycle"
|
||||
queue -[async]-> webhooks "dispatches outbound callbacks"
|
||||
}
|
||||
|
||||
sqlite = database "SQLite jobs database" {
|
||||
technology "better-sqlite3 + WAL"
|
||||
}
|
||||
|
||||
handlers = container "Registered phase handlers" {
|
||||
technology "Consumer-provided async functions"
|
||||
}
|
||||
|
||||
streams = container "SSE subscribers" {
|
||||
technology "ReadableStream consumers"
|
||||
}
|
||||
|
||||
api -> runtime.queue "constructs and invokes"
|
||||
runtime.storage -> sqlite "reads/writes rows"
|
||||
runtime.runner -> handlers "invokes phase handlers"
|
||||
runtime.events -> streams "pushes queue events"
|
||||
}
|
||||
|
||||
consumer -> jobqueue.api "enqueue / retry / cancel / inspect / subscribe"
|
||||
jobqueue.runtime.webhooks -[async]-> webhookReceiver "POST job events"
|
||||
}
|
||||
|
||||
views {
|
||||
view index {
|
||||
title "jobqueue landscape"
|
||||
include *
|
||||
autoLayout LeftRight
|
||||
}
|
||||
|
||||
view library of jobqueue {
|
||||
title "jobqueue containers"
|
||||
include *
|
||||
autoLayout LeftRight
|
||||
}
|
||||
|
||||
view runtime of jobqueue.runtime {
|
||||
title "jobqueue runtime components"
|
||||
include *
|
||||
autoLayout LeftRight
|
||||
}
|
||||
|
||||
dynamic view enqueue-to-complete {
|
||||
title "Enqueue to successful completion"
|
||||
consumer -> jobqueue.api "enqueue()"
|
||||
jobqueue.api -> jobqueue.runtime.queue "create job"
|
||||
jobqueue.runtime.queue -> jobqueue.runtime.storage "persist pending row"
|
||||
jobqueue.runtime.queue -> jobqueue.runtime.pool "schedule worker"
|
||||
jobqueue.runtime.pool -> jobqueue.runtime.runner "run phases"
|
||||
jobqueue.runtime.runner -> jobqueue.handlers "invoke handler(s)"
|
||||
jobqueue.runtime.runner -> jobqueue.runtime.storage "persist progress + phase results"
|
||||
jobqueue.runtime.queue -> jobqueue.runtime.events "emit queue events"
|
||||
jobqueue.runtime.events -> jobqueue.streams "push SSE"
|
||||
jobqueue.runtime.queue -> jobqueue.runtime.webhooks "send completion webhook"
|
||||
jobqueue.runtime.webhooks -> webhookReceiver "POST payload"
|
||||
jobqueue.runtime.queue -> jobqueue.runtime.storage "mark webhook_sent"
|
||||
}
|
||||
|
||||
dynamic view retry-flow {
|
||||
title "Failure and retry flow"
|
||||
jobqueue.runtime.runner -> jobqueue.handlers "invoke handler"
|
||||
jobqueue.handlers -> jobqueue.runtime.queue "throw recoverable error"
|
||||
jobqueue.runtime.queue -> jobqueue.runtime.retry "classify error"
|
||||
jobqueue.runtime.retry -> jobqueue.runtime.queue "retry with delay"
|
||||
jobqueue.runtime.queue -> jobqueue.runtime.storage "persist pending retry"
|
||||
jobqueue.runtime.queue -> jobqueue.runtime.events "emit job:retrying"
|
||||
jobqueue.runtime.events -> jobqueue.streams "push SSE"
|
||||
jobqueue.runtime.queue -[async]-> jobqueue.runtime.webhooks "dispatch retry webhook"
|
||||
}
|
||||
}
|
||||
239
docs/runtime-lifecycle.md
Normal file
239
docs/runtime-lifecycle.md
Normal file
@@ -0,0 +1,239 @@
|
||||
# Runtime lifecycle
|
||||
|
||||
This document follows one queue instance from construction through shutdown.
|
||||
|
||||
## 1. Construction
|
||||
|
||||
When `new JobQueue(config)` runs, the constructor does more than store config:
|
||||
|
||||
1. normalizes config defaults
|
||||
2. opens SQLite and enables WAL mode
|
||||
3. creates retry strategy
|
||||
4. creates worker pool
|
||||
5. optionally creates webhook dispatcher
|
||||
6. resets any previously `active` jobs to `failed`
|
||||
7. optionally starts retention scheduler
|
||||
8. requests an initial pump
|
||||
|
||||
### Why `resetActiveJobs()` exists
|
||||
|
||||
`jobqueue` is single-process. If the process dies mid-job, there is no other worker that can safely finish that in-flight job. On next boot, the queue marks those orphaned jobs failed so they do not stay stuck in `active`.
|
||||
|
||||
## 2. Enqueue path
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant App as Consumer app
|
||||
participant Queue as JobQueue
|
||||
participant DB as SqliteStorage
|
||||
participant Pump as Pump loop
|
||||
|
||||
App->>Queue: enqueue(data, options)
|
||||
Queue->>DB: createJob(...)
|
||||
DB-->>Queue: JobRecord(status=pending)
|
||||
Queue-->>App: jobId
|
||||
Queue->>Queue: emit job:enqueued
|
||||
Queue->>Pump: requestPump()
|
||||
```
|
||||
|
||||
Key points:
|
||||
|
||||
- enqueue is durable first, asynchronous execution second
|
||||
- a job can be scheduled for the future with `scheduledAt`
|
||||
- a per-job webhook URL can override queue-level webhook URL
|
||||
|
||||
## 3. Pumping and dispatch
|
||||
|
||||
The queue uses a **pump loop**, not a constantly-blocking worker thread.
|
||||
|
||||
### Pump rules
|
||||
|
||||
1. stop immediately if queue is closed
|
||||
2. if another pump is already running, request a repump and return
|
||||
3. while worker pool has capacity:
|
||||
- read runnable `pending` jobs whose `scheduled_at <= now`
|
||||
- try to claim each job atomically
|
||||
- emit `job:started`
|
||||
- hand claimed job to `WorkerPool`
|
||||
4. schedule a wake-up for the next delayed job
|
||||
|
||||
### Why `claimPendingJob()` matters
|
||||
|
||||
The queue lists candidates first, then claims them one by one with a status transition in SQLite. That second step is what prevents the same pending row from being started twice.
|
||||
|
||||
## 4. Job execution
|
||||
|
||||
Each claimed job gets its own `AbortController`. `PhaseRunner` then executes configured phases in order.
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Queue as JobQueue
|
||||
participant Runner as PhaseRunner
|
||||
participant Handler as Phase handler
|
||||
participant DB as SqliteStorage
|
||||
participant Events as TypedEventBus
|
||||
|
||||
Queue->>Runner: run(job, signal)
|
||||
loop for each phase
|
||||
Runner->>DB: saveProgress(on phase start)
|
||||
Runner->>Handler: handler(job, context)
|
||||
Handler->>Runner: ctx.progress(...)
|
||||
Runner->>DB: saveProgress(...)
|
||||
Runner->>Events: job:progress
|
||||
Handler-->>Runner: phase result
|
||||
Runner->>DB: savePhaseCompletion(...)
|
||||
Runner->>Events: job:phase:completed
|
||||
end
|
||||
Queue->>DB: completeJob(...)
|
||||
Queue->>Events: job:completed
|
||||
```
|
||||
|
||||
## 5. Progress semantics
|
||||
|
||||
Progress exists at two levels:
|
||||
|
||||
- **phase progress** - what the current handler reports
|
||||
- **overall progress** - computed from phase index + phase progress
|
||||
|
||||
Example for three phases:
|
||||
|
||||
| Phase | Reported phase progress | Computed overall progress |
|
||||
| --- | --- | --- |
|
||||
| `download` | 50 | 17 |
|
||||
| `process` | 25 | 42 |
|
||||
| `upload` | 80 | 93 |
|
||||
|
||||
`ctx.progress()` persists that state immediately, then emits `job:progress`.
|
||||
|
||||
## 6. Result passing between phases
|
||||
|
||||
Each handler can return JSON-serializable data. `PhaseRunner` stores that in `phaseResults` and exposes it to later handlers via:
|
||||
|
||||
- `ctx.phaseResult(phaseName)`
|
||||
- `ctx.phaseResults()`
|
||||
|
||||
This is the mechanism that turns the queue from "single task runner" into "multi-stage pipeline engine".
|
||||
|
||||
## 7. Retry path
|
||||
|
||||
When a handler throws, `JobQueue.handleFailure()` decides between retry and terminal failure.
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Handler as Phase handler
|
||||
participant Queue as JobQueue
|
||||
participant Retry as RetryStrategy
|
||||
participant DB as SqliteStorage
|
||||
participant Events as TypedEventBus
|
||||
|
||||
Handler-->>Queue: throws error
|
||||
Queue->>Retry: shouldRetry(error, currentJob)
|
||||
alt recoverable and attempts remain
|
||||
Queue->>DB: scheduleRetry(...)
|
||||
Queue->>Events: job:retrying
|
||||
Queue->>Queue: requestPump()
|
||||
else fatal or exhausted
|
||||
Queue->>DB: failJob(...)
|
||||
Queue->>Events: job:failed
|
||||
end
|
||||
```
|
||||
|
||||
### Retry details
|
||||
|
||||
- `maxAttempts` includes the initial attempt
|
||||
- default disposition is `fatal`
|
||||
- backoff can be `fixed`, `linear`, or `exponential`
|
||||
- recoverable retries keep the job in `pending` with a future `scheduled_at`
|
||||
|
||||
## 8. Cancellation path
|
||||
|
||||
Cancellation is cooperative:
|
||||
|
||||
1. `queue.cancel(id)` aborts the job controller if one exists
|
||||
2. unfinished phases are persisted as `cancelled`
|
||||
3. job status becomes `cancelled`
|
||||
4. `job:cancelled` is emitted
|
||||
5. later phase-runner cancellation callbacks become no-ops if the job is already cancelled
|
||||
|
||||
That last rule matters. Without it, a cancel request arriving between phase transitions could emit duplicate `job:cancelled` events. The current implementation now guards against that.
|
||||
|
||||
### Pending cancellation
|
||||
|
||||
If a job is cancelled before it starts, **all unfinished phases are also marked `cancelled`**. That keeps the persisted phase graph aligned with the top-level job status.
|
||||
|
||||
## 9. SSE stream lifecycle
|
||||
|
||||
`createEventStream()` creates a stream over queue events.
|
||||
|
||||
### Stream startup
|
||||
|
||||
1. optional snapshot is written first
|
||||
2. event listeners are attached
|
||||
3. periodic `ping` keepalive starts
|
||||
|
||||
### Stream shutdown
|
||||
|
||||
- cancelling the reader removes all attached listeners
|
||||
- keepalive timer is cleared
|
||||
- no queue state is modified
|
||||
|
||||
## 10. Webhook lifecycle
|
||||
|
||||
Webhooks are triggered off queue events, but they are not the primary source of truth. SQLite is.
|
||||
|
||||
### Completion + webhook ordering
|
||||
|
||||
For a successful job:
|
||||
|
||||
1. `completeJob()` persists `status = completed`
|
||||
2. `job:completed` is emitted
|
||||
3. webhook dispatch is scheduled
|
||||
4. successful delivery marks `webhook_sent = 1`
|
||||
5. `job:webhook:delivered` is emitted
|
||||
|
||||
This means webhook state becomes visible in a **later event**, not inside the original `job:completed` event.
|
||||
|
||||
### Shutdown interaction
|
||||
|
||||
The queue now tracks in-flight webhook promises and waits for them during shutdown. That avoids closing SQLite while a completed webhook still needs to update `webhook_sent` or emit delivery/failure events.
|
||||
|
||||
## 11. Retention lifecycle
|
||||
|
||||
Retention runs independently from job execution:
|
||||
|
||||
1. compute stale cutoff and delete cutoff
|
||||
2. mark eligible terminal jobs as `stale`
|
||||
3. run optional `onStale(job)` callback
|
||||
4. emit `job:stale`
|
||||
5. delete stale jobs past delete cutoff
|
||||
6. run optional `onDelete(job)` callback
|
||||
7. emit `job:deleted`
|
||||
|
||||
## 12. Shutdown lifecycle
|
||||
|
||||
Shutdown now has two responsibilities:
|
||||
|
||||
1. **stop new work** - mark queue closed, stop retention, clear wake-up timer
|
||||
2. **tear down safely** - wait for workers, wait for webhooks, remove listeners, close storage
|
||||
|
||||
### Current behavior
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[shutdown()] --> B[closed = true]
|
||||
B --> C[stop retention + clear wakeup timer]
|
||||
C --> D{workers drained in time?}
|
||||
D -- yes --> E[drain pending webhooks]
|
||||
D -- no --> F[abort active controllers]
|
||||
F --> G[best-effort second drain]
|
||||
G --> E
|
||||
E --> H[remove listeners]
|
||||
H --> I[close SQLite]
|
||||
I --> J{timeout happened?}
|
||||
J -- no --> K[resolve]
|
||||
J -- yes --> L[rethrow timeout error after cleanup]
|
||||
```
|
||||
|
||||
### Important nuance
|
||||
|
||||
If a handler ignores `AbortSignal`, shutdown can still time out. The queue now guarantees cleanup still runs, but graceful completion still depends on handler cooperation.
|
||||
Reference in New Issue
Block a user