# Runtime lifecycle This document follows one queue instance from construction through shutdown. ## 1. Construction When `new JobQueue(config)` runs, the constructor does more than store config: 1. normalizes config defaults 2. opens SQLite and enables WAL mode 3. creates retry strategy 4. creates worker pool 5. optionally creates webhook dispatcher 6. resets any previously `active` jobs to `failed` 7. optionally starts retention scheduler 8. requests an initial pump ### Why `resetActiveJobs()` exists `jobqueue` is single-process. If the process dies mid-job, there is no other worker that can safely finish that in-flight job. On next boot, the queue marks those orphaned jobs failed so they do not stay stuck in `active`. ## 2. Enqueue path ```mermaid sequenceDiagram participant App as Consumer app participant Queue as JobQueue participant DB as SqliteStorage participant Pump as Pump loop App->>Queue: enqueue(data, options) Queue->>DB: createJob(...) DB-->>Queue: JobRecord(status=pending) Queue-->>App: jobId Queue->>Queue: emit job:enqueued Queue->>Pump: requestPump() ``` Key points: - enqueue is durable first, asynchronous execution second - a job can be scheduled for the future with `scheduledAt` - a per-job webhook URL can override queue-level webhook URL ## 3. Pumping and dispatch The queue uses a **pump loop**, not a constantly-blocking worker thread. ### Pump rules 1. stop immediately if queue is closed 2. if another pump is already running, request a repump and return 3. while worker pool has capacity: - read runnable `pending` jobs whose `scheduled_at <= now` - try to claim each job atomically - emit `job:started` - hand claimed job to `WorkerPool` 4. schedule a wake-up for the next delayed job ### Why `claimPendingJob()` matters The queue lists candidates first, then claims them one by one with a status transition in SQLite. That second step is what prevents the same pending row from being started twice. ## 4. Job execution Each claimed job gets its own `AbortController`. `PhaseRunner` then executes configured phases in order. ```mermaid sequenceDiagram participant Queue as JobQueue participant Runner as PhaseRunner participant Handler as Phase handler participant DB as SqliteStorage participant Events as TypedEventBus Queue->>Runner: run(job, signal) loop for each phase Runner->>DB: saveProgress(on phase start) Runner->>Handler: handler(job, context) Handler->>Runner: ctx.progress(...) Runner->>DB: saveProgress(...) Runner->>Events: job:progress Handler-->>Runner: phase result Runner->>DB: savePhaseCompletion(...) Runner->>Events: job:phase:completed end Queue->>DB: completeJob(...) Queue->>Events: job:completed ``` ## 5. Progress semantics Progress exists at two levels: - **phase progress** - what the current handler reports - **overall progress** - computed from phase index + phase progress Example for three phases: | Phase | Reported phase progress | Computed overall progress | | --- | --- | --- | | `download` | 50 | 17 | | `process` | 25 | 42 | | `upload` | 80 | 93 | `ctx.progress()` persists that state immediately, then emits `job:progress`. ## 6. Result passing between phases Each handler can return JSON-serializable data. `PhaseRunner` stores that in `phaseResults` and exposes it to later handlers via: - `ctx.phaseResult(phaseName)` - `ctx.phaseResults()` This is the mechanism that turns the queue from "single task runner" into "multi-stage pipeline engine". ## 7. Retry path When a handler throws, `JobQueue.handleFailure()` decides between retry and terminal failure. ```mermaid sequenceDiagram participant Handler as Phase handler participant Queue as JobQueue participant Retry as RetryStrategy participant DB as SqliteStorage participant Events as TypedEventBus Handler-->>Queue: throws error Queue->>Retry: shouldRetry(error, currentJob) alt recoverable and attempts remain Queue->>DB: scheduleRetry(...) Queue->>Events: job:retrying Queue->>Queue: requestPump() else fatal or exhausted Queue->>DB: failJob(...) Queue->>Events: job:failed end ``` ### Retry details - `maxAttempts` includes the initial attempt - default disposition is `fatal` - backoff can be `fixed`, `linear`, or `exponential` - recoverable retries keep the job in `pending` with a future `scheduled_at` ## 8. Cancellation path Cancellation is cooperative: 1. `queue.cancel(id)` aborts the job controller if one exists 2. unfinished phases are persisted as `cancelled` 3. job status becomes `cancelled` 4. `job:cancelled` is emitted 5. later phase-runner cancellation callbacks become no-ops if the job is already cancelled That last rule matters. Without it, a cancel request arriving between phase transitions could emit duplicate `job:cancelled` events. The current implementation now guards against that. ### Pending cancellation If a job is cancelled before it starts, **all unfinished phases are also marked `cancelled`**. That keeps the persisted phase graph aligned with the top-level job status. ## 9. SSE stream lifecycle `createEventStream()` creates a stream over queue events. ### Stream startup 1. optional snapshot is written first 2. event listeners are attached 3. periodic `ping` keepalive starts ### Stream shutdown - cancelling the reader removes all attached listeners - keepalive timer is cleared - no queue state is modified ## 10. Webhook lifecycle Webhooks are triggered off queue events, but they are not the primary source of truth. SQLite is. ### Completion + webhook ordering For a successful job: 1. `completeJob()` persists `status = completed` 2. `job:completed` is emitted 3. webhook dispatch is scheduled 4. successful delivery marks `webhook_sent = 1` 5. `job:webhook:delivered` is emitted This means webhook state becomes visible in a **later event**, not inside the original `job:completed` event. ### Shutdown interaction The queue now tracks in-flight webhook promises and waits for them during shutdown. That avoids closing SQLite while a completed webhook still needs to update `webhook_sent` or emit delivery/failure events. ## 11. Retention lifecycle Retention runs independently from job execution: 1. compute stale cutoff and delete cutoff 2. mark eligible terminal jobs as `stale` 3. run optional `onStale(job)` callback 4. emit `job:stale` 5. delete stale jobs past delete cutoff 6. run optional `onDelete(job)` callback 7. emit `job:deleted` ## 12. Shutdown lifecycle Shutdown now has two responsibilities: 1. **stop new work** - mark queue closed, stop retention, clear wake-up timer 2. **tear down safely** - wait for workers, wait for webhooks, remove listeners, close storage ### Current behavior ```mermaid flowchart TD A[shutdown()] --> B[closed = true] B --> C[stop retention + clear wakeup timer] C --> D{workers drained in time?} D -- yes --> E[drain pending webhooks] D -- no --> F[abort active controllers] F --> G[best-effort second drain] G --> E E --> H[remove listeners] H --> I[close SQLite] I --> J{timeout happened?} J -- no --> K[resolve] J -- yes --> L[rethrow timeout error after cleanup] ``` ### Important nuance If a handler ignores `AbortSignal`, shutdown can still time out. The queue now guarantees cleanup still runs, but graceful completion still depends on handler cooperation.