- Preserve phase results on partial retry and keep interrupted phase context after restart. - Avoid webhook bookkeeping crashes when retention deletes stale jobs. - Add deeper unit, integration, and e2e coverage around queue seams. - Require verify job to pass before publish runs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
142 lines
5.0 KiB
Markdown
142 lines
5.0 KiB
Markdown
# Integration findings
|
|
|
|
This document records what the multi-agent scan found, what was verified directly in source, and what changed.
|
|
|
|
## 1. Scan method
|
|
|
|
The repository was scanned through three independent passes:
|
|
|
|
1. **lifecycle scan** - enqueue, scheduling, execution, retry, cancel, shutdown
|
|
2. **storage/events scan** - persistence, SSE, webhook, retention interaction
|
|
3. **code review scan** - cross-component defects only
|
|
|
|
Those results were then checked against source before documenting or patching anything.
|
|
|
|
## 2. Confirmed and fixed issues
|
|
|
|
### A. Duplicate `job:cancelled` event at phase boundary
|
|
|
|
**Observed risk**
|
|
|
|
If `cancel()` landed after one phase finished but before the next phase started, two different code paths could emit `job:cancelled`:
|
|
|
|
- direct `cancel(id)` call
|
|
- `PhaseRunner` cancellation callback on next loop iteration
|
|
|
|
**Why it mattered**
|
|
|
|
- duplicate event bus notifications
|
|
- duplicate SSE `job:cancelled` events
|
|
- duplicate `job:cancelled` webhooks
|
|
|
|
**Fix**
|
|
|
|
`JobQueue` now checks whether the job is already cancelled before the `onCancelled` callback persists or emits anything.
|
|
|
|
### B. Pending-job cancellation left phases in `pending`
|
|
|
|
**Observed risk**
|
|
|
|
Cancelling a job before it started produced:
|
|
|
|
- job status: `cancelled`
|
|
- phase states: still `pending`
|
|
|
|
**Why it mattered**
|
|
|
|
- persisted lifecycle shape was contradictory
|
|
- dashboards and tooling reading phases could not trust phase status
|
|
|
|
**Fix**
|
|
|
|
`JobQueue` now marks all unfinished phases as `cancelled` whenever a cancellation is persisted.
|
|
|
|
### C. Shutdown could close before in-flight webhook bookkeeping finished
|
|
|
|
**Observed risk**
|
|
|
|
Webhook dispatch was previously fire-and-forget. A completed job could still be mid-delivery when `shutdown()` closed SQLite.
|
|
|
|
**Why it mattered**
|
|
|
|
- `webhook_sent` might not be written
|
|
- `job:webhook:delivered` / `job:webhook:failed` could be lost
|
|
- delivery bookkeeping could throw against a closed database
|
|
|
|
**Fix**
|
|
|
|
`JobQueue` now tracks pending webhook promises and drains them during shutdown before closing storage.
|
|
|
|
### D. Shutdown timeout skipped cleanup
|
|
|
|
**Observed risk**
|
|
|
|
If `WorkerPool.drain(timeout)` threw, `shutdown()` exited before:
|
|
|
|
- removing listeners
|
|
- closing storage
|
|
|
|
**Why it mattered**
|
|
|
|
- leaked resources
|
|
- left queue internals half-open after failed shutdown
|
|
|
|
**Fix**
|
|
|
|
Cleanup now runs in a `finally` path. On timeout, active controllers are aborted, cleanup still executes, and the timeout error is rethrown after teardown.
|
|
|
|
## 3. Behavioral notes kept as documentation, not code changes
|
|
|
|
These are real integration characteristics, but not all are bugs.
|
|
|
|
### `job:completed` precedes `job:webhook:delivered`
|
|
|
|
This is expected ordering:
|
|
|
|
1. job completion is persisted
|
|
2. `job:completed` emits
|
|
3. webhook dispatch happens
|
|
4. `job:webhook:delivered` emits on success
|
|
|
|
So `webhookSent` may still be `false` in the earlier completion event. Consumers should treat webhook delivery as a separate lifecycle step.
|
|
|
|
### `job:deleted` does not contain full job payload
|
|
|
|
This is intentional and pragmatic. Once a stale record is deleted, the queue only emits `deletedJobId`. The SSE contract reflects deletion, not a resurrected snapshot.
|
|
|
|
### Webhooks are best-effort, not durable outbox delivery
|
|
|
|
The package retries transient delivery errors, but it does **not** persist a webhook outbox with replay semantics. If a process dies after job completion and before webhook delivery completes, there is no durable re-dispatch queue.
|
|
|
|
## 4. Regression coverage added
|
|
|
|
New tests now cover:
|
|
|
|
- cancelling a pending job marks unfinished phases cancelled
|
|
- cancelling on a phase boundary emits one cancellation event
|
|
- shutdown waits for in-flight webhooks
|
|
- shutdown cleanup still happens when worker drain times out
|
|
|
|
## 5. Remaining risk areas
|
|
|
|
No new blocking integration bugs were confirmed after patching, but these seams still deserve attention as the library grows:
|
|
|
|
1. **Durable outbound delivery** - webhook outbox/idempotency keys if delivery guarantees become stronger
|
|
2. **Long-running non-cooperative handlers** - handlers that ignore `AbortSignal` can still force shutdown timeouts
|
|
3. **SSE scaling** - each stream currently subscribes directly to the in-process event bus
|
|
4. **Storage portability** - queue semantics are tightly coupled to SQLite row-level state transitions
|
|
|
|
## 6. Second-scan fixes and coverage expansion
|
|
|
|
The deeper follow-up scan confirmed three more issues that were patched:
|
|
|
|
1. **Webhook completion after retention deletion** could throw when delivery bookkeeping re-fetched a deleted job.
|
|
2. **Partial retry (`fromStart: false`)** dropped completed phase results because retry reset cleared `phase_results`.
|
|
3. **Process restart recovery** dropped interrupted phase context in failure metadata.
|
|
|
|
Coverage was expanded at three levels:
|
|
|
|
- **Unit**: retry strategy, webhook retry policy, worker-pool drain timeout, storage retry/reset semantics
|
|
- **Integration**: partial retry behavior, scheduled wakeups, restart recovery, queue lifecycle edges
|
|
- **E2E harness**: realistic workflows covering SSE + webhooks + retries + retention deletion
|