Files
jobqueue/docs/integration-findings.md
Giancarmine Salucci a9429e2118 fix: harden queue lifecycle and publish gate
- Preserve phase results on partial retry and keep interrupted phase
  context after restart.
- Avoid webhook bookkeeping crashes when retention deletes stale jobs.
- Add deeper unit, integration, and e2e coverage around queue seams.
- Require verify job to pass before publish runs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-16 18:39:19 +02:00

5.0 KiB

Integration findings

This document records what the multi-agent scan found, what was verified directly in source, and what changed.

1. Scan method

The repository was scanned through three independent passes:

  1. lifecycle scan - enqueue, scheduling, execution, retry, cancel, shutdown
  2. storage/events scan - persistence, SSE, webhook, retention interaction
  3. code review scan - cross-component defects only

Those results were then checked against source before documenting or patching anything.

2. Confirmed and fixed issues

A. Duplicate job:cancelled event at phase boundary

Observed risk

If cancel() landed after one phase finished but before the next phase started, two different code paths could emit job:cancelled:

  • direct cancel(id) call
  • PhaseRunner cancellation callback on next loop iteration

Why it mattered

  • duplicate event bus notifications
  • duplicate SSE job:cancelled events
  • duplicate job:cancelled webhooks

Fix

JobQueue now checks whether the job is already cancelled before the onCancelled callback persists or emits anything.

B. Pending-job cancellation left phases in pending

Observed risk

Cancelling a job before it started produced:

  • job status: cancelled
  • phase states: still pending

Why it mattered

  • persisted lifecycle shape was contradictory
  • dashboards and tooling reading phases could not trust phase status

Fix

JobQueue now marks all unfinished phases as cancelled whenever a cancellation is persisted.

C. Shutdown could close before in-flight webhook bookkeeping finished

Observed risk

Webhook dispatch was previously fire-and-forget. A completed job could still be mid-delivery when shutdown() closed SQLite.

Why it mattered

  • webhook_sent might not be written
  • job:webhook:delivered / job:webhook:failed could be lost
  • delivery bookkeeping could throw against a closed database

Fix

JobQueue now tracks pending webhook promises and drains them during shutdown before closing storage.

D. Shutdown timeout skipped cleanup

Observed risk

If WorkerPool.drain(timeout) threw, shutdown() exited before:

  • removing listeners
  • closing storage

Why it mattered

  • leaked resources
  • left queue internals half-open after failed shutdown

Fix

Cleanup now runs in a finally path. On timeout, active controllers are aborted, cleanup still executes, and the timeout error is rethrown after teardown.

3. Behavioral notes kept as documentation, not code changes

These are real integration characteristics, but not all are bugs.

job:completed precedes job:webhook:delivered

This is expected ordering:

  1. job completion is persisted
  2. job:completed emits
  3. webhook dispatch happens
  4. job:webhook:delivered emits on success

So webhookSent may still be false in the earlier completion event. Consumers should treat webhook delivery as a separate lifecycle step.

job:deleted does not contain full job payload

This is intentional and pragmatic. Once a stale record is deleted, the queue only emits deletedJobId. The SSE contract reflects deletion, not a resurrected snapshot.

Webhooks are best-effort, not durable outbox delivery

The package retries transient delivery errors, but it does not persist a webhook outbox with replay semantics. If a process dies after job completion and before webhook delivery completes, there is no durable re-dispatch queue.

4. Regression coverage added

New tests now cover:

  • cancelling a pending job marks unfinished phases cancelled
  • cancelling on a phase boundary emits one cancellation event
  • shutdown waits for in-flight webhooks
  • shutdown cleanup still happens when worker drain times out

5. Remaining risk areas

No new blocking integration bugs were confirmed after patching, but these seams still deserve attention as the library grows:

  1. Durable outbound delivery - webhook outbox/idempotency keys if delivery guarantees become stronger
  2. Long-running non-cooperative handlers - handlers that ignore AbortSignal can still force shutdown timeouts
  3. SSE scaling - each stream currently subscribes directly to the in-process event bus
  4. Storage portability - queue semantics are tightly coupled to SQLite row-level state transitions

6. Second-scan fixes and coverage expansion

The deeper follow-up scan confirmed three more issues that were patched:

  1. Webhook completion after retention deletion could throw when delivery bookkeeping re-fetched a deleted job.
  2. Partial retry (fromStart: false) dropped completed phase results because retry reset cleared phase_results.
  3. Process restart recovery dropped interrupted phase context in failure metadata.

Coverage was expanded at three levels:

  • Unit: retry strategy, webhook retry policy, worker-pool drain timeout, storage retry/reset semantics
  • Integration: partial retry behavior, scheduled wakeups, restart recovery, queue lifecycle edges
  • E2E harness: realistic workflows covering SSE + webhooks + retries + retention deletion