feat: dynamic model loading/unloading with GPU polling

- Model starts unloaded (lazy); loads on first job or POST /model/load - Auto-unloads after IDLE_TIMEOUT_SECS (default 300) of inactivity - POST /model/unload for immediate manual release - GPU-busy detection: on VRAM OOM, enters WaitingForGpu and retries every GPU_POLL_INTERVAL_SECS (default 30) indefinitely - POST /jobs when unloaded → 503 + Retry-After header, triggers load - AppError::OutOfMemory and AppError::ModelNotReady variants - WorkerCmd channel (SyncSender<WorkerCmd>) replaces bare tx_req channel - Idle timer via recv_timeout(1s) tick inside OS thread (no extra thread) - Model lifecycle events broadcast via tokio broadcast channel (SSE + webhooks) - webhook_registry: all clients that ever submitted a webhook_url receive model_ready and model_unloaded webhooks - GPU warmup retained on every (re)load New routes: GET /model/status — current state + VRAM stats POST /model/load — trigger load (idempotent) POST /model/unload — immediate unload GET /model/events — SSE stream of model lifecycle events New env vars: IDLE_TIMEOUT_SECS (default 300) GPU_POLL_INTERVAL_SECS (default 30) Tests: tests/test_model_lifecycle.sh — 18 integration tests (full state machine, SSE events, webhooks, concurrency, unload-during-load) tests/test_idle_timeout.sh — 5 tests with short IDLE_TIMEOUT_SECS=5 test_all.sh updated: loads model before job submission, asserts model_state in /health, adds POST /model/unload at end Docs: docs/USAGE.md: model lifecycle section, new env vars, 503 retry pattern, updated /health response shape Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-08 17:57:20 +02:00
parent 78c6fab81b
commit b191fbe200
13 changed files with 2053 additions and 148 deletions
--- a/docs/USAGE.md
+++ b/docs/USAGE.md
@@ -66,6 +66,8 @@ The bundled `docker-compose.yml` mounts named volumes for data and models and se
 | `WHISPER_MODEL_PATH` | `/models/ggml-large-v3.bin` | Absolute path to GGML model file |
 | `WHISPER_MODEL` | `large-v3` | Model name reported by `/health` (display only) |
 | `CUDA_DEVICE` | `0` | CUDA device index to use for inference |
+| `IDLE_TIMEOUT_SECS` | `300` | Seconds of idle before the model is automatically unloaded from GPU memory. Set to `0` to disable auto-unload. |
+| `GPU_POLL_INTERVAL_SECS` | `30` | Seconds between VRAM-availability retries when a load fails due to insufficient VRAM. |

 ### Note on CUDA device ordering
 Inside Docker, device ordering matches `nvidia-smi` (PCI bus order). On the host without Docker, ordering may differ. See [FINDINGS.md](FINDINGS.md#cuda-device-index-ordering-differs-between-host-and-docker) for details.
@@ -76,6 +78,194 @@ Inside Docker, device ordering matches `nvidia-smi` (PCI bus order). On the host

 The interactive Swagger UI is available at `http://localhost:8080/docs`.

+---
+
+## Model Lifecycle Management
+
+The model starts **unloaded** on startup (lazy loading). It is loaded into GPU memory on the first job submission or via `POST /model/load`, and automatically unloaded after `IDLE_TIMEOUT_SECS` of inactivity.
+
+### Model State Machine
+
+```
+Unloaded ──(job / POST /model/load)──► Loading ──(success)──► Ready
+                                                 └──(VRAM full)──► WaitingForGpu ──(retry OK)──► Loading
+Ready ──(idle timeout / POST /model/unload)──► Unloaded
+WaitingForGpu ──(POST /model/unload)──► Unloaded
+```
+
+### `GET /model/status`
+
+Returns the current model state and VRAM statistics.
+
+```bash
+curl http://localhost:8080/model/status
+```
+
+**When unloaded:**
+```json
+{ "state": "unloaded" }
+```
+
+**When loading:**
+```json
+{ "state": "loading" }
+```
+
+**When ready:**
+```json
+{
+  "state": "ready",
+  "loaded_at": "2026-05-10T14:00:00Z",
+  "vram_used_mb": 4096,
+  "vram_total_mb": 8192
+}
+```
+
+**When waiting for VRAM:**
+```json
+{
+  "state": "waiting_for_gpu",
+  "vram_needed_mb": 3951,
+  "vram_free_mb": 512,
+  "retry_in_secs": 30
+}
+```
+
+---
+
+### `POST /model/load`
+
+Request the model to be loaded. Idempotent — if already loading or ready, returns immediately.
+
+```bash
+curl -X POST http://localhost:8080/model/load
+```
+
+- Returns `202 Accepted` with `{"status":"load_initiated"}` when load is triggered
+- Returns `200 OK` with `{"status":"already_ready"}` when model is already ready
+- Poll `GET /model/status` or subscribe to `GET /model/events` to know when ready
+
+---
+
+### `POST /model/unload`
+
+Unload the model from GPU memory immediately, freeing VRAM.
+
+```bash
+curl -X POST http://localhost:8080/model/unload
+```
+
+Returns `200 OK` regardless of current state.
+
+---
+
+### `GET /model/events` — Model SSE stream
+
+Subscribe to model lifecycle events via Server-Sent Events.
+
+```bash
+curl -N http://localhost:8080/model/events
+```
+
+**Event types:**
+
+```
+event: model_loading
+data: {"type":"model_loading"}
+
+event: model_ready
+data: {"type":"model_ready","loaded_at":"2026-05-10T14:00:00Z"}
+
+event: model_unloaded
+data: {"type":"model_unloaded"}
+
+event: model_waiting_for_gpu
+data: {"type":"model_waiting_for_gpu","vram_needed_mb":3951,"vram_free_mb":512,"retry_in_secs":30}
+```
+
+**JavaScript example:**
+```javascript
+const es = new EventSource('/model/events');
+
+es.addEventListener('model_ready', () => {
+  console.log('Model loaded — ready to transcribe');
+});
+
+es.addEventListener('model_unloaded', () => {
+  console.log('Model freed GPU memory');
+});
+```
+
+---
+
+### Webhooks for model events
+
+When any job is submitted with a `webhook_url`, that URL is registered to receive model lifecycle webhooks for the lifetime of the server process. The following events trigger a webhook POST:
+
+| Event | Fired when |
+|-------|-----------|
+| `model_ready` | Model finishes loading (after GPU warmup) |
+| `model_unloaded` | Model is freed from GPU memory |
+
+**Webhook payload** (`Content-Type: application/json`):
+```json
+{ "type": "model_ready", "loaded_at": "2026-05-10T14:00:00Z" }
+{ "type": "model_unloaded" }
+```
+
+Delivery is attempted up to 3 times with exponential backoff (1s, 2s).
+
+---
+
+### Handling 503 Model Not Ready
+
+When you submit a job and the model is not yet loaded, you receive `503 Service Unavailable` with a `Retry-After` header:
+
+```
+HTTP/1.1 503 Service Unavailable
+Retry-After: 30
+Content-Type: application/json
+
+{
+  "error": "model_not_ready",
+  "state": "unloaded",
+  "retry_after_secs": 30
+}
+```
+
+| State at rejection | `retry_after_secs` | Meaning |
+|---|---|---|
+| `unloaded` | 30 | Load was triggered; retry after ~30s |
+| `loading` | 10 | Check again in 10s |
+| `waiting_for_gpu` | `GPU_POLL_INTERVAL_SECS` | VRAM contention; retry later |
+
+A job rejection when the model is `unloaded` **automatically triggers a load** — you do not need to call `POST /model/load` separately.
+
+**Recommended client pattern:**
+```javascript
+async function submitWithRetry(formData, maxAttempts = 10) {
+  for (let i = 0; i < maxAttempts; i++) {
+    const resp = await fetch('/jobs', { method: 'POST', body: formData });
+    if (resp.ok) return resp.json();
+    if (resp.status === 503) {
+      const retryAfter = parseInt(resp.headers.get('Retry-After') ?? '30');
+      const body = await resp.json();
+      console.log(`Model ${body.state} — retrying in ${retryAfter}s`);
+      await new Promise(r => setTimeout(r, retryAfter * 1000));
+      continue;
+    }
+    throw new Error(`Submit failed: ${resp.status}`);
+  }
+  throw new Error('Gave up after max attempts');
+}
+```
+
+---
+
+## API Reference
+
+The interactive Swagger UI is available at `http://localhost:8080/docs`.
+
 ### `POST /jobs` — Submit a transcription job

 Accepts a multipart/form-data body.
@@ -249,11 +439,12 @@ curl http://localhost:8080/health
  "gpu_name": "NVIDIA GeForce RTX 2080",
  "vram_total_mb": 8192,
  "model": "large-v3",
-  "queue_depth": 0
+  "queue_depth": 0,
+  "model_state": "ready"
 }
 ```

-`queue_depth` is the number of jobs waiting to be processed (not counting the one currently running).
+`queue_depth` is the number of jobs waiting to be processed (not counting the one currently running). `model_state` reflects the current lifecycle state (`unloaded`, `loading`, `waiting_for_gpu`, `ready`).

 ---

@@ -340,6 +531,11 @@ curl -X POST http://localhost:8080/jobs \

 ## Troubleshooting

+### Server returns `503 model_not_ready`
+- The model starts unloaded. Call `POST /model/load` explicitly, or just retry the job submission — rejection automatically triggers a load.
+- If state is `waiting_for_gpu`, another process is using the GPU's VRAM. The server will retry automatically every `GPU_POLL_INTERVAL_SECS` seconds.
+- Monitor `GET /model/status` or subscribe to `GET /model/events` to know when the model is ready.
+
 ### Server returns 0 segments
 - Check that you are **not** setting `language` to an empty string — omit the field entirely for auto-detection
 - Verify the audio file is not corrupted: `ffprobe audio.mp3`