Files

Giancarmine Salucci a9429e2118 fix: harden queue lifecycle and publish gate

- Preserve phase results on partial retry and keep interrupted phase
  context after restart.
- Avoid webhook bookkeeping crashes when retention deletes stale jobs.
- Add deeper unit, integration, and e2e coverage around queue seams.
- Require verify job to pass before publish runs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

2026-05-16 18:39:19 +02:00

5.0 KiB

Raw Blame History

Integration findings

This document records what the multi-agent scan found, what was verified directly in source, and what changed.

1. Scan method

The repository was scanned through three independent passes:

lifecycle scan - enqueue, scheduling, execution, retry, cancel, shutdown
storage/events scan - persistence, SSE, webhook, retention interaction
code review scan - cross-component defects only

Those results were then checked against source before documenting or patching anything.

2. Confirmed and fixed issues

A. Duplicate `job:cancelled` event at phase boundary

Observed risk

If cancel() landed after one phase finished but before the next phase started, two different code paths could emit job:cancelled:

direct cancel(id) call
PhaseRunner cancellation callback on next loop iteration

Why it mattered

duplicate event bus notifications
duplicate SSE job:cancelled events
duplicate job:cancelled webhooks

Fix

JobQueue now checks whether the job is already cancelled before the onCancelled callback persists or emits anything.

B. Pending-job cancellation left phases in `pending`

Observed risk

Cancelling a job before it started produced:

job status: cancelled
phase states: still pending

Why it mattered

persisted lifecycle shape was contradictory
dashboards and tooling reading phases could not trust phase status

Fix

JobQueue now marks all unfinished phases as cancelled whenever a cancellation is persisted.

C. Shutdown could close before in-flight webhook bookkeeping finished

Observed risk

Webhook dispatch was previously fire-and-forget. A completed job could still be mid-delivery when shutdown() closed SQLite.

Why it mattered

webhook_sent might not be written
job:webhook:delivered / job:webhook:failed could be lost
delivery bookkeeping could throw against a closed database

Fix

JobQueue now tracks pending webhook promises and drains them during shutdown before closing storage.

D. Shutdown timeout skipped cleanup

Observed risk

If WorkerPool.drain(timeout) threw, shutdown() exited before:

removing listeners
closing storage

Why it mattered

leaked resources
left queue internals half-open after failed shutdown

Fix

Cleanup now runs in a finally path. On timeout, active controllers are aborted, cleanup still executes, and the timeout error is rethrown after teardown.

3. Behavioral notes kept as documentation, not code changes

These are real integration characteristics, but not all are bugs.

`job:completed` precedes `job:webhook:delivered`

This is expected ordering:

job completion is persisted
job:completed emits
webhook dispatch happens
job:webhook:delivered emits on success

So webhookSent may still be false in the earlier completion event. Consumers should treat webhook delivery as a separate lifecycle step.

`job:deleted` does not contain full job payload

This is intentional and pragmatic. Once a stale record is deleted, the queue only emits deletedJobId. The SSE contract reflects deletion, not a resurrected snapshot.

Webhooks are best-effort, not durable outbox delivery

The package retries transient delivery errors, but it does not persist a webhook outbox with replay semantics. If a process dies after job completion and before webhook delivery completes, there is no durable re-dispatch queue.

4. Regression coverage added

New tests now cover:

cancelling a pending job marks unfinished phases cancelled
cancelling on a phase boundary emits one cancellation event
shutdown waits for in-flight webhooks
shutdown cleanup still happens when worker drain times out

5. Remaining risk areas

No new blocking integration bugs were confirmed after patching, but these seams still deserve attention as the library grows:

Durable outbound delivery - webhook outbox/idempotency keys if delivery guarantees become stronger
Long-running non-cooperative handlers - handlers that ignore AbortSignal can still force shutdown timeouts
SSE scaling - each stream currently subscribes directly to the in-process event bus
Storage portability - queue semantics are tightly coupled to SQLite row-level state transitions

6. Second-scan fixes and coverage expansion

The deeper follow-up scan confirmed three more issues that were patched:

Webhook completion after retention deletion could throw when delivery bookkeeping re-fetched a deleted job.
Partial retry (fromStart: false) dropped completed phase results because retry reset cleared phase_results.
Process restart recovery dropped interrupted phase context in failure metadata.

Coverage was expanded at three levels:

Unit: retry strategy, webhook retry policy, worker-pool drain timeout, storage retry/reset semantics
Integration: partial retry behavior, scheduled wakeups, restart recovery, queue lifecycle edges
E2E harness: realistic workflows covering SSE + webhooks + retries + retention deletion

5.0 KiB Raw Blame History

Integration findings

1. Scan method

2. Confirmed and fixed issues

A. Duplicate job:cancelled event at phase boundary

B. Pending-job cancellation left phases in pending

C. Shutdown could close before in-flight webhook bookkeeping finished

D. Shutdown timeout skipped cleanup

3. Behavioral notes kept as documentation, not code changes

job:completed precedes job:webhook:delivered

job:deleted does not contain full job payload

Webhooks are best-effort, not durable outbox delivery

4. Regression coverage added

5. Remaining risk areas

6. Second-scan fixes and coverage expansion

5.0 KiB

Raw Blame History

A. Duplicate `job:cancelled` event at phase boundary

B. Pending-job cancellation left phases in `pending`

`job:completed` precedes `job:webhook:delivered`

`job:deleted` does not contain full job payload