Files
jobqueue/docs/integration-findings.md
Giancarmine Salucci a9429e2118 fix: harden queue lifecycle and publish gate
- Preserve phase results on partial retry and keep interrupted phase
  context after restart.
- Avoid webhook bookkeeping crashes when retention deletes stale jobs.
- Add deeper unit, integration, and e2e coverage around queue seams.
- Require verify job to pass before publish runs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-16 18:39:19 +02:00

142 lines
5.0 KiB
Markdown

# Integration findings
This document records what the multi-agent scan found, what was verified directly in source, and what changed.
## 1. Scan method
The repository was scanned through three independent passes:
1. **lifecycle scan** - enqueue, scheduling, execution, retry, cancel, shutdown
2. **storage/events scan** - persistence, SSE, webhook, retention interaction
3. **code review scan** - cross-component defects only
Those results were then checked against source before documenting or patching anything.
## 2. Confirmed and fixed issues
### A. Duplicate `job:cancelled` event at phase boundary
**Observed risk**
If `cancel()` landed after one phase finished but before the next phase started, two different code paths could emit `job:cancelled`:
- direct `cancel(id)` call
- `PhaseRunner` cancellation callback on next loop iteration
**Why it mattered**
- duplicate event bus notifications
- duplicate SSE `job:cancelled` events
- duplicate `job:cancelled` webhooks
**Fix**
`JobQueue` now checks whether the job is already cancelled before the `onCancelled` callback persists or emits anything.
### B. Pending-job cancellation left phases in `pending`
**Observed risk**
Cancelling a job before it started produced:
- job status: `cancelled`
- phase states: still `pending`
**Why it mattered**
- persisted lifecycle shape was contradictory
- dashboards and tooling reading phases could not trust phase status
**Fix**
`JobQueue` now marks all unfinished phases as `cancelled` whenever a cancellation is persisted.
### C. Shutdown could close before in-flight webhook bookkeeping finished
**Observed risk**
Webhook dispatch was previously fire-and-forget. A completed job could still be mid-delivery when `shutdown()` closed SQLite.
**Why it mattered**
- `webhook_sent` might not be written
- `job:webhook:delivered` / `job:webhook:failed` could be lost
- delivery bookkeeping could throw against a closed database
**Fix**
`JobQueue` now tracks pending webhook promises and drains them during shutdown before closing storage.
### D. Shutdown timeout skipped cleanup
**Observed risk**
If `WorkerPool.drain(timeout)` threw, `shutdown()` exited before:
- removing listeners
- closing storage
**Why it mattered**
- leaked resources
- left queue internals half-open after failed shutdown
**Fix**
Cleanup now runs in a `finally` path. On timeout, active controllers are aborted, cleanup still executes, and the timeout error is rethrown after teardown.
## 3. Behavioral notes kept as documentation, not code changes
These are real integration characteristics, but not all are bugs.
### `job:completed` precedes `job:webhook:delivered`
This is expected ordering:
1. job completion is persisted
2. `job:completed` emits
3. webhook dispatch happens
4. `job:webhook:delivered` emits on success
So `webhookSent` may still be `false` in the earlier completion event. Consumers should treat webhook delivery as a separate lifecycle step.
### `job:deleted` does not contain full job payload
This is intentional and pragmatic. Once a stale record is deleted, the queue only emits `deletedJobId`. The SSE contract reflects deletion, not a resurrected snapshot.
### Webhooks are best-effort, not durable outbox delivery
The package retries transient delivery errors, but it does **not** persist a webhook outbox with replay semantics. If a process dies after job completion and before webhook delivery completes, there is no durable re-dispatch queue.
## 4. Regression coverage added
New tests now cover:
- cancelling a pending job marks unfinished phases cancelled
- cancelling on a phase boundary emits one cancellation event
- shutdown waits for in-flight webhooks
- shutdown cleanup still happens when worker drain times out
## 5. Remaining risk areas
No new blocking integration bugs were confirmed after patching, but these seams still deserve attention as the library grows:
1. **Durable outbound delivery** - webhook outbox/idempotency keys if delivery guarantees become stronger
2. **Long-running non-cooperative handlers** - handlers that ignore `AbortSignal` can still force shutdown timeouts
3. **SSE scaling** - each stream currently subscribes directly to the in-process event bus
4. **Storage portability** - queue semantics are tightly coupled to SQLite row-level state transitions
## 6. Second-scan fixes and coverage expansion
The deeper follow-up scan confirmed three more issues that were patched:
1. **Webhook completion after retention deletion** could throw when delivery bookkeeping re-fetched a deleted job.
2. **Partial retry (`fromStart: false`)** dropped completed phase results because retry reset cleared `phase_results`.
3. **Process restart recovery** dropped interrupted phase context in failure metadata.
Coverage was expanded at three levels:
- **Unit**: retry strategy, webhook retry policy, worker-pool drain timeout, storage retry/reset semantics
- **Integration**: partial retry behavior, scheduled wakeups, restart recovery, queue lifecycle edges
- **E2E harness**: realistic workflows covering SSE + webhooks + retries + retention deletion