jobqueue/docs/integration-findings.md

# Integration findings

This document records what the multi-agent scan found, what was verified directly in source, and what changed.

## 1. Scan method

The repository was scanned through three independent passes:

1. **lifecycle scan** - enqueue, scheduling, execution, retry, cancel, shutdown
2. **storage/events scan** - persistence, SSE, webhook, retention interaction
3. **code review scan** - cross-component defects only

Those results were then checked against source before documenting or patching anything.

## 2. Confirmed and fixed issues

### A. Duplicate `job:cancelled` event at phase boundary

**Observed risk**

If `cancel()` landed after one phase finished but before the next phase started, two different code paths could emit `job:cancelled`:

- direct `cancel(id)` call
- `PhaseRunner` cancellation callback on next loop iteration

**Why it mattered**

- duplicate event bus notifications
- duplicate SSE `job:cancelled` events
- duplicate `job:cancelled` webhooks

**Fix**

`JobQueue` now checks whether the job is already cancelled before the `onCancelled` callback persists or emits anything.

### B. Pending-job cancellation left phases in `pending`

**Observed risk**

Cancelling a job before it started produced:

- job status: `cancelled`
- phase states: still `pending`

**Why it mattered**

- persisted lifecycle shape was contradictory
- dashboards and tooling reading phases could not trust phase status

**Fix**

`JobQueue` now marks all unfinished phases as `cancelled` whenever a cancellation is persisted.

### C. Shutdown could close before in-flight webhook bookkeeping finished

**Observed risk**

Webhook dispatch was previously fire-and-forget. A completed job could still be mid-delivery when `shutdown()` closed SQLite.

**Why it mattered**

- `webhook_sent` might not be written
- `job:webhook:delivered` / `job:webhook:failed` could be lost
- delivery bookkeeping could throw against a closed database

**Fix**

`JobQueue` now tracks pending webhook promises and drains them during shutdown before closing storage.

### D. Shutdown timeout skipped cleanup

**Observed risk**

If `WorkerPool.drain(timeout)` threw, `shutdown()` exited before:

- removing listeners
- closing storage

**Why it mattered**

- leaked resources
- left queue internals half-open after failed shutdown

**Fix**

Cleanup now runs in a `finally` path. On timeout, active controllers are aborted, cleanup still executes, and the timeout error is rethrown after teardown.

## 3. Behavioral notes kept as documentation, not code changes

These are real integration characteristics, but not all are bugs.

### `job:completed` precedes `job:webhook:delivered`

This is expected ordering:

1. job completion is persisted
2. `job:completed` emits
3. webhook dispatch happens
4. `job:webhook:delivered` emits on success

So `webhookSent` may still be `false` in the earlier completion event. Consumers should treat webhook delivery as a separate lifecycle step.

### `job:deleted` does not contain full job payload

This is intentional and pragmatic. Once a stale record is deleted, the queue only emits `deletedJobId`. The SSE contract reflects deletion, not a resurrected snapshot.

### Webhooks are best-effort, not durable outbox delivery

The package retries transient delivery errors, but it does **not** persist a webhook outbox with replay semantics. If a process dies after job completion and before webhook delivery completes, there is no durable re-dispatch queue.

## 4. Regression coverage added

New tests now cover:

- cancelling a pending job marks unfinished phases cancelled
- cancelling on a phase boundary emits one cancellation event
- shutdown waits for in-flight webhooks
- shutdown cleanup still happens when worker drain times out

## 5. Remaining risk areas

No new blocking integration bugs were confirmed after patching, but these seams still deserve attention as the library grows:

1. **Durable outbound delivery** - webhook outbox/idempotency keys if delivery guarantees become stronger
2. **Long-running non-cooperative handlers** - handlers that ignore `AbortSignal` can still force shutdown timeouts
3. **SSE scaling** - each stream currently subscribes directly to the in-process event bus
4. **Storage portability** - queue semantics are tightly coupled to SQLite row-level state transitions

## 6. Second-scan fixes and coverage expansion

The deeper follow-up scan confirmed three more issues that were patched:

1. **Webhook completion after retention deletion** could throw when delivery bookkeeping re-fetched a deleted job.
2. **Partial retry (`fromStart: false`)** dropped completed phase results because retry reset cleared `phase_results`.
3. **Process restart recovery** dropped interrupted phase context in failure metadata.

Coverage was expanded at three levels:

- **Unit**: retry strategy, webhook retry policy, worker-pool drain timeout, storage retry/reset semantics
- **Integration**: partial retry behavior, scheduled wakeups, restart recovery, queue lifecycle edges
- **E2E harness**: realistic workflows covering SSE + webhooks + retries + retention deletion