# Integration findings This document records what the multi-agent scan found, what was verified directly in source, and what changed. ## 1. Scan method The repository was scanned through three independent passes: 1. **lifecycle scan** - enqueue, scheduling, execution, retry, cancel, shutdown 2. **storage/events scan** - persistence, SSE, webhook, retention interaction 3. **code review scan** - cross-component defects only Those results were then checked against source before documenting or patching anything. ## 2. Confirmed and fixed issues ### A. Duplicate `job:cancelled` event at phase boundary **Observed risk** If `cancel()` landed after one phase finished but before the next phase started, two different code paths could emit `job:cancelled`: - direct `cancel(id)` call - `PhaseRunner` cancellation callback on next loop iteration **Why it mattered** - duplicate event bus notifications - duplicate SSE `job:cancelled` events - duplicate `job:cancelled` webhooks **Fix** `JobQueue` now checks whether the job is already cancelled before the `onCancelled` callback persists or emits anything. ### B. Pending-job cancellation left phases in `pending` **Observed risk** Cancelling a job before it started produced: - job status: `cancelled` - phase states: still `pending` **Why it mattered** - persisted lifecycle shape was contradictory - dashboards and tooling reading phases could not trust phase status **Fix** `JobQueue` now marks all unfinished phases as `cancelled` whenever a cancellation is persisted. ### C. Shutdown could close before in-flight webhook bookkeeping finished **Observed risk** Webhook dispatch was previously fire-and-forget. A completed job could still be mid-delivery when `shutdown()` closed SQLite. **Why it mattered** - `webhook_sent` might not be written - `job:webhook:delivered` / `job:webhook:failed` could be lost - delivery bookkeeping could throw against a closed database **Fix** `JobQueue` now tracks pending webhook promises and drains them during shutdown before closing storage. ### D. Shutdown timeout skipped cleanup **Observed risk** If `WorkerPool.drain(timeout)` threw, `shutdown()` exited before: - removing listeners - closing storage **Why it mattered** - leaked resources - left queue internals half-open after failed shutdown **Fix** Cleanup now runs in a `finally` path. On timeout, active controllers are aborted, cleanup still executes, and the timeout error is rethrown after teardown. ## 3. Behavioral notes kept as documentation, not code changes These are real integration characteristics, but not all are bugs. ### `job:completed` precedes `job:webhook:delivered` This is expected ordering: 1. job completion is persisted 2. `job:completed` emits 3. webhook dispatch happens 4. `job:webhook:delivered` emits on success So `webhookSent` may still be `false` in the earlier completion event. Consumers should treat webhook delivery as a separate lifecycle step. ### `job:deleted` does not contain full job payload This is intentional and pragmatic. Once a stale record is deleted, the queue only emits `deletedJobId`. The SSE contract reflects deletion, not a resurrected snapshot. ### Webhooks are best-effort, not durable outbox delivery The package retries transient delivery errors, but it does **not** persist a webhook outbox with replay semantics. If a process dies after job completion and before webhook delivery completes, there is no durable re-dispatch queue. ## 4. Regression coverage added New tests now cover: - cancelling a pending job marks unfinished phases cancelled - cancelling on a phase boundary emits one cancellation event - shutdown waits for in-flight webhooks - shutdown cleanup still happens when worker drain times out ## 5. Remaining risk areas No new blocking integration bugs were confirmed after patching, but these seams still deserve attention as the library grows: 1. **Durable outbound delivery** - webhook outbox/idempotency keys if delivery guarantees become stronger 2. **Long-running non-cooperative handlers** - handlers that ignore `AbortSignal` can still force shutdown timeouts 3. **SSE scaling** - each stream currently subscribes directly to the in-process event bus 4. **Storage portability** - queue semantics are tightly coupled to SQLite row-level state transitions ## 6. Second-scan fixes and coverage expansion The deeper follow-up scan confirmed three more issues that were patched: 1. **Webhook completion after retention deletion** could throw when delivery bookkeeping re-fetched a deleted job. 2. **Partial retry (`fromStart: false`)** dropped completed phase results because retry reset cleared `phase_results`. 3. **Process restart recovery** dropped interrupted phase context in failure metadata. Coverage was expanded at three levels: - **Unit**: retry strategy, webhook retry policy, worker-pool drain timeout, storage retry/reset semantics - **Integration**: partial retry behavior, scheduled wakeups, restart recovery, queue lifecycle edges - **E2E harness**: realistic workflows covering SSE + webhooks + retries + retention deletion