- Preserve phase results on partial retry and keep interrupted phase context after restart. - Avoid webhook bookkeeping crashes when retention deletes stale jobs. - Add deeper unit, integration, and e2e coverage around queue seams. - Require verify job to pass before publish runs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
5.0 KiB
Integration findings
This document records what the multi-agent scan found, what was verified directly in source, and what changed.
1. Scan method
The repository was scanned through three independent passes:
- lifecycle scan - enqueue, scheduling, execution, retry, cancel, shutdown
- storage/events scan - persistence, SSE, webhook, retention interaction
- code review scan - cross-component defects only
Those results were then checked against source before documenting or patching anything.
2. Confirmed and fixed issues
A. Duplicate job:cancelled event at phase boundary
Observed risk
If cancel() landed after one phase finished but before the next phase started, two different code paths could emit job:cancelled:
- direct
cancel(id)call PhaseRunnercancellation callback on next loop iteration
Why it mattered
- duplicate event bus notifications
- duplicate SSE
job:cancelledevents - duplicate
job:cancelledwebhooks
Fix
JobQueue now checks whether the job is already cancelled before the onCancelled callback persists or emits anything.
B. Pending-job cancellation left phases in pending
Observed risk
Cancelling a job before it started produced:
- job status:
cancelled - phase states: still
pending
Why it mattered
- persisted lifecycle shape was contradictory
- dashboards and tooling reading phases could not trust phase status
Fix
JobQueue now marks all unfinished phases as cancelled whenever a cancellation is persisted.
C. Shutdown could close before in-flight webhook bookkeeping finished
Observed risk
Webhook dispatch was previously fire-and-forget. A completed job could still be mid-delivery when shutdown() closed SQLite.
Why it mattered
webhook_sentmight not be writtenjob:webhook:delivered/job:webhook:failedcould be lost- delivery bookkeeping could throw against a closed database
Fix
JobQueue now tracks pending webhook promises and drains them during shutdown before closing storage.
D. Shutdown timeout skipped cleanup
Observed risk
If WorkerPool.drain(timeout) threw, shutdown() exited before:
- removing listeners
- closing storage
Why it mattered
- leaked resources
- left queue internals half-open after failed shutdown
Fix
Cleanup now runs in a finally path. On timeout, active controllers are aborted, cleanup still executes, and the timeout error is rethrown after teardown.
3. Behavioral notes kept as documentation, not code changes
These are real integration characteristics, but not all are bugs.
job:completed precedes job:webhook:delivered
This is expected ordering:
- job completion is persisted
job:completedemits- webhook dispatch happens
job:webhook:deliveredemits on success
So webhookSent may still be false in the earlier completion event. Consumers should treat webhook delivery as a separate lifecycle step.
job:deleted does not contain full job payload
This is intentional and pragmatic. Once a stale record is deleted, the queue only emits deletedJobId. The SSE contract reflects deletion, not a resurrected snapshot.
Webhooks are best-effort, not durable outbox delivery
The package retries transient delivery errors, but it does not persist a webhook outbox with replay semantics. If a process dies after job completion and before webhook delivery completes, there is no durable re-dispatch queue.
4. Regression coverage added
New tests now cover:
- cancelling a pending job marks unfinished phases cancelled
- cancelling on a phase boundary emits one cancellation event
- shutdown waits for in-flight webhooks
- shutdown cleanup still happens when worker drain times out
5. Remaining risk areas
No new blocking integration bugs were confirmed after patching, but these seams still deserve attention as the library grows:
- Durable outbound delivery - webhook outbox/idempotency keys if delivery guarantees become stronger
- Long-running non-cooperative handlers - handlers that ignore
AbortSignalcan still force shutdown timeouts - SSE scaling - each stream currently subscribes directly to the in-process event bus
- Storage portability - queue semantics are tightly coupled to SQLite row-level state transitions
6. Second-scan fixes and coverage expansion
The deeper follow-up scan confirmed three more issues that were patched:
- Webhook completion after retention deletion could throw when delivery bookkeeping re-fetched a deleted job.
- Partial retry (
fromStart: false) dropped completed phase results because retry reset clearedphase_results. - Process restart recovery dropped interrupted phase context in failure metadata.
Coverage was expanded at three levels:
- Unit: retry strategy, webhook retry policy, worker-pool drain timeout, storage retry/reset semantics
- Integration: partial retry behavior, scheduled wakeups, restart recovery, queue lifecycle edges
- E2E harness: realistic workflows covering SSE + webhooks + retries + retention deletion