- Model starts unloaded (lazy); loads on first job or POST /model/load
- Auto-unloads after IDLE_TIMEOUT_SECS (default 300) of inactivity
- POST /model/unload for immediate manual release
- GPU-busy detection: on VRAM OOM, enters WaitingForGpu and retries
every GPU_POLL_INTERVAL_SECS (default 30) indefinitely
- POST /jobs when unloaded → 503 + Retry-After header, triggers load
- AppError::OutOfMemory and AppError::ModelNotReady variants
- WorkerCmd channel (SyncSender<WorkerCmd>) replaces bare tx_req channel
- Idle timer via recv_timeout(1s) tick inside OS thread (no extra thread)
- Model lifecycle events broadcast via tokio broadcast channel (SSE + webhooks)
- webhook_registry: all clients that ever submitted a webhook_url receive
model_ready and model_unloaded webhooks
- GPU warmup retained on every (re)load
New routes:
GET /model/status — current state + VRAM stats
POST /model/load — trigger load (idempotent)
POST /model/unload — immediate unload
GET /model/events — SSE stream of model lifecycle events
New env vars:
IDLE_TIMEOUT_SECS (default 300)
GPU_POLL_INTERVAL_SECS (default 30)
Tests:
tests/test_model_lifecycle.sh — 18 integration tests (full state machine,
SSE events, webhooks, concurrency, unload-during-load)
tests/test_idle_timeout.sh — 5 tests with short IDLE_TIMEOUT_SECS=5
test_all.sh updated: loads model before job submission, asserts
model_state in /health, adds POST /model/unload at end
Docs:
docs/USAGE.md: model lifecycle section, new env vars, 503 retry pattern,
updated /health response shape
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Step 9 used 'echo $RESULT | python3 - << HEREDOC' which is a bash gotcha:
the heredoc takes over stdin (as the script source), so the pipe is
silently ignored and sys.stdin.read() returns empty string → JSONDecodeError.
Fix: write RESULT to a temp file and pass it as sys.argv[1] to the script.
Also removed the old buggy test suite that was accidentally left appended
at lines 181-327 (had language=auto, ['id'] field, wrong DELETE assertion).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GPU warmup (src/transcriber.rs):
After creating WhisperState, run a 1s silent inference pass in load().
CUDA JIT-compiles device kernels on the first whisper_full_with_state call.
On a cold GPU this compilation disrupts the decode pipeline mid-inference,
returning 0 segments in ~0.5s. The warmup forces all kernel compilation at
startup so the first real job runs on fully compiled kernels.
test_all.sh:
- Fix submit response field: 'id' → 'job_id' (was breaking all downstream steps)
- Remove language=auto: not a valid ISO 639-1 code; omit field for auto-detect
- Make BASE and AUDIO configurable via env vars (WHISPER_BASE_URL, TEST_AUDIO)
- Fix DELETE assertion: completed jobs return 409 Conflict, not 204
- Add explicit zero-segments failure check in quality inspection (step 9)
- Add progress reporting to poll loop
docs/FINDINGS.md + KNOWLEDGE.md:
Document cold GPU warmup issue, root cause, and fix.
Document language=auto as invalid API usage.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>