jobqueue/docs/runtime-lifecycle.md

# Runtime lifecycle

This document follows one queue instance from construction through shutdown.

## 1. Construction

When `new JobQueue(config)` runs, the constructor does more than store config:

1. normalizes config defaults
2. opens SQLite and enables WAL mode
3. creates retry strategy
4. creates worker pool
5. optionally creates webhook dispatcher
6. resets any previously `active` jobs to `failed`
7. optionally starts retention scheduler
8. requests an initial pump

### Why `resetActiveJobs()` exists

`jobqueue` is single-process. If the process dies mid-job, there is no other worker that can safely finish that in-flight job. On next boot, the queue marks those orphaned jobs failed so they do not stay stuck in `active`.

## 2. Enqueue path

```mermaid
sequenceDiagram
  participant App as Consumer app
  participant Queue as JobQueue
  participant DB as SqliteStorage
  participant Pump as Pump loop

  App->>Queue: enqueue(data, options)
  Queue->>DB: createJob(...)
  DB-->>Queue: JobRecord(status=pending)
  Queue-->>App: jobId
  Queue->>Queue: emit job:enqueued
  Queue->>Pump: requestPump()
```

Key points:

- enqueue is durable first, asynchronous execution second
- a job can be scheduled for the future with `scheduledAt`
- a per-job webhook URL can override queue-level webhook URL

## 3. Pumping and dispatch

The queue uses a **pump loop**, not a constantly-blocking worker thread.

### Pump rules

1. stop immediately if queue is closed
2. if another pump is already running, request a repump and return
3. while worker pool has capacity:
   - read runnable `pending` jobs whose `scheduled_at <= now`
   - try to claim each job atomically
   - emit `job:started`
   - hand claimed job to `WorkerPool`
4. schedule a wake-up for the next delayed job

### Why `claimPendingJob()` matters

The queue lists candidates first, then claims them one by one with a status transition in SQLite. That second step is what prevents the same pending row from being started twice.

## 4. Job execution

Each claimed job gets its own `AbortController`. `PhaseRunner` then executes configured phases in order.

```mermaid
sequenceDiagram
  participant Queue as JobQueue
  participant Runner as PhaseRunner
  participant Handler as Phase handler
  participant DB as SqliteStorage
  participant Events as TypedEventBus

  Queue->>Runner: run(job, signal)
  loop for each phase
    Runner->>DB: saveProgress(on phase start)
    Runner->>Handler: handler(job, context)
    Handler->>Runner: ctx.progress(...)
    Runner->>DB: saveProgress(...)
    Runner->>Events: job:progress
    Handler-->>Runner: phase result
    Runner->>DB: savePhaseCompletion(...)
    Runner->>Events: job:phase:completed
  end
  Queue->>DB: completeJob(...)
  Queue->>Events: job:completed
```

## 5. Progress semantics

Progress exists at two levels:

- **phase progress** - what the current handler reports
- **overall progress** - computed from phase index + phase progress

Example for three phases:

| Phase | Reported phase progress | Computed overall progress |
| --- | --- | --- |
| `download` | 50 | 17 |
| `process` | 25 | 42 |
| `upload` | 80 | 93 |

`ctx.progress()` persists that state immediately, then emits `job:progress`.

## 6. Result passing between phases

Each handler can return JSON-serializable data. `PhaseRunner` stores that in `phaseResults` and exposes it to later handlers via:

- `ctx.phaseResult(phaseName)`
- `ctx.phaseResults()`

This is the mechanism that turns the queue from "single task runner" into "multi-stage pipeline engine".

## 7. Retry path

When a handler throws, `JobQueue.handleFailure()` decides between retry and terminal failure.

```mermaid
sequenceDiagram
  participant Handler as Phase handler
  participant Queue as JobQueue
  participant Retry as RetryStrategy
  participant DB as SqliteStorage
  participant Events as TypedEventBus

  Handler-->>Queue: throws error
  Queue->>Retry: shouldRetry(error, currentJob)
  alt recoverable and attempts remain
    Queue->>DB: scheduleRetry(...)
    Queue->>Events: job:retrying
    Queue->>Queue: requestPump()
  else fatal or exhausted
    Queue->>DB: failJob(...)
    Queue->>Events: job:failed
  end
```

### Retry details

- `maxAttempts` includes the initial attempt
- default disposition is `fatal`
- backoff can be `fixed`, `linear`, or `exponential`
- recoverable retries keep the job in `pending` with a future `scheduled_at`

## 8. Cancellation path

Cancellation is cooperative:

1. `queue.cancel(id)` aborts the job controller if one exists
2. unfinished phases are persisted as `cancelled`
3. job status becomes `cancelled`
4. `job:cancelled` is emitted
5. later phase-runner cancellation callbacks become no-ops if the job is already cancelled

That last rule matters. Without it, a cancel request arriving between phase transitions could emit duplicate `job:cancelled` events. The current implementation now guards against that.

### Pending cancellation

If a job is cancelled before it starts, **all unfinished phases are also marked `cancelled`**. That keeps the persisted phase graph aligned with the top-level job status.

## 9. SSE stream lifecycle

`createEventStream()` creates a stream over queue events.

### Stream startup

1. optional snapshot is written first
2. event listeners are attached
3. periodic `ping` keepalive starts

### Stream shutdown

- cancelling the reader removes all attached listeners
- keepalive timer is cleared
- no queue state is modified

## 10. Webhook lifecycle

Webhooks are triggered off queue events, but they are not the primary source of truth. SQLite is.

### Completion + webhook ordering

For a successful job:

1. `completeJob()` persists `status = completed`
2. `job:completed` is emitted
3. webhook dispatch is scheduled
4. successful delivery marks `webhook_sent = 1`
5. `job:webhook:delivered` is emitted

This means webhook state becomes visible in a **later event**, not inside the original `job:completed` event.

### Shutdown interaction

The queue now tracks in-flight webhook promises and waits for them during shutdown. That avoids closing SQLite while a completed webhook still needs to update `webhook_sent` or emit delivery/failure events.

## 11. Retention lifecycle

Retention runs independently from job execution:

1. compute stale cutoff and delete cutoff
2. mark eligible terminal jobs as `stale`
3. run optional `onStale(job)` callback
4. emit `job:stale`
5. delete stale jobs past delete cutoff
6. run optional `onDelete(job)` callback
7. emit `job:deleted`

## 12. Shutdown lifecycle

Shutdown now has two responsibilities:

1. **stop new work** - mark queue closed, stop retention, clear wake-up timer
2. **tear down safely** - wait for workers, wait for webhooks, remove listeners, close storage

### Current behavior

```mermaid
flowchart TD
  A[shutdown()] --> B[closed = true]
  B --> C[stop retention + clear wakeup timer]
  C --> D{workers drained in time?}
  D -- yes --> E[drain pending webhooks]
  D -- no --> F[abort active controllers]
  F --> G[best-effort second drain]
  G --> E
  E --> H[remove listeners]
  H --> I[close SQLite]
  I --> J{timeout happened?}
  J -- no --> K[resolve]
  J -- yes --> L[rethrow timeout error after cleanup]
```

### Important nuance

If a handler ignores `AbortSignal`, shutdown can still time out. The queue now guarantees cleanup still runs, but graceful completion still depends on handler cooperation.