feat: dynamic model loading/unloading with GPU polling
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 8m41s

- Model starts unloaded (lazy); loads on first job or POST /model/load
- Auto-unloads after IDLE_TIMEOUT_SECS (default 300) of inactivity
- POST /model/unload for immediate manual release
- GPU-busy detection: on VRAM OOM, enters WaitingForGpu and retries
  every GPU_POLL_INTERVAL_SECS (default 30) indefinitely
- POST /jobs when unloaded → 503 + Retry-After header, triggers load
- AppError::OutOfMemory and AppError::ModelNotReady variants
- WorkerCmd channel (SyncSender<WorkerCmd>) replaces bare tx_req channel
- Idle timer via recv_timeout(1s) tick inside OS thread (no extra thread)
- Model lifecycle events broadcast via tokio broadcast channel (SSE + webhooks)
- webhook_registry: all clients that ever submitted a webhook_url receive
  model_ready and model_unloaded webhooks
- GPU warmup retained on every (re)load

New routes:
  GET  /model/status  — current state + VRAM stats
  POST /model/load    — trigger load (idempotent)
  POST /model/unload  — immediate unload
  GET  /model/events  — SSE stream of model lifecycle events

New env vars:
  IDLE_TIMEOUT_SECS       (default 300)
  GPU_POLL_INTERVAL_SECS  (default 30)

Tests:
  tests/test_model_lifecycle.sh — 18 integration tests (full state machine,
    SSE events, webhooks, concurrency, unload-during-load)
  tests/test_idle_timeout.sh    — 5 tests with short IDLE_TIMEOUT_SECS=5
  test_all.sh updated: loads model before job submission, asserts
    model_state in /health, adds POST /model/unload at end

Docs:
  docs/USAGE.md: model lifecycle section, new env vars, 503 retry pattern,
    updated /health response shape

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
mozempk
2026-05-08 17:57:20 +02:00
parent 78c6fab81b
commit b191fbe200
13 changed files with 2053 additions and 148 deletions

View File

@@ -66,6 +66,8 @@ The bundled `docker-compose.yml` mounts named volumes for data and models and se
| `WHISPER_MODEL_PATH` | `/models/ggml-large-v3.bin` | Absolute path to GGML model file |
| `WHISPER_MODEL` | `large-v3` | Model name reported by `/health` (display only) |
| `CUDA_DEVICE` | `0` | CUDA device index to use for inference |
| `IDLE_TIMEOUT_SECS` | `300` | Seconds of idle before the model is automatically unloaded from GPU memory. Set to `0` to disable auto-unload. |
| `GPU_POLL_INTERVAL_SECS` | `30` | Seconds between VRAM-availability retries when a load fails due to insufficient VRAM. |
### Note on CUDA device ordering
Inside Docker, device ordering matches `nvidia-smi` (PCI bus order). On the host without Docker, ordering may differ. See [FINDINGS.md](FINDINGS.md#cuda-device-index-ordering-differs-between-host-and-docker) for details.
@@ -76,6 +78,194 @@ Inside Docker, device ordering matches `nvidia-smi` (PCI bus order). On the host
The interactive Swagger UI is available at `http://localhost:8080/docs`.
---
## Model Lifecycle Management
The model starts **unloaded** on startup (lazy loading). It is loaded into GPU memory on the first job submission or via `POST /model/load`, and automatically unloaded after `IDLE_TIMEOUT_SECS` of inactivity.
### Model State Machine
```
Unloaded ──(job / POST /model/load)──► Loading ──(success)──► Ready
└──(VRAM full)──► WaitingForGpu ──(retry OK)──► Loading
Ready ──(idle timeout / POST /model/unload)──► Unloaded
WaitingForGpu ──(POST /model/unload)──► Unloaded
```
### `GET /model/status`
Returns the current model state and VRAM statistics.
```bash
curl http://localhost:8080/model/status
```
**When unloaded:**
```json
{ "state": "unloaded" }
```
**When loading:**
```json
{ "state": "loading" }
```
**When ready:**
```json
{
"state": "ready",
"loaded_at": "2026-05-10T14:00:00Z",
"vram_used_mb": 4096,
"vram_total_mb": 8192
}
```
**When waiting for VRAM:**
```json
{
"state": "waiting_for_gpu",
"vram_needed_mb": 3951,
"vram_free_mb": 512,
"retry_in_secs": 30
}
```
---
### `POST /model/load`
Request the model to be loaded. Idempotent — if already loading or ready, returns immediately.
```bash
curl -X POST http://localhost:8080/model/load
```
- Returns `202 Accepted` with `{"status":"load_initiated"}` when load is triggered
- Returns `200 OK` with `{"status":"already_ready"}` when model is already ready
- Poll `GET /model/status` or subscribe to `GET /model/events` to know when ready
---
### `POST /model/unload`
Unload the model from GPU memory immediately, freeing VRAM.
```bash
curl -X POST http://localhost:8080/model/unload
```
Returns `200 OK` regardless of current state.
---
### `GET /model/events` — Model SSE stream
Subscribe to model lifecycle events via Server-Sent Events.
```bash
curl -N http://localhost:8080/model/events
```
**Event types:**
```
event: model_loading
data: {"type":"model_loading"}
event: model_ready
data: {"type":"model_ready","loaded_at":"2026-05-10T14:00:00Z"}
event: model_unloaded
data: {"type":"model_unloaded"}
event: model_waiting_for_gpu
data: {"type":"model_waiting_for_gpu","vram_needed_mb":3951,"vram_free_mb":512,"retry_in_secs":30}
```
**JavaScript example:**
```javascript
const es = new EventSource('/model/events');
es.addEventListener('model_ready', () => {
console.log('Model loaded — ready to transcribe');
});
es.addEventListener('model_unloaded', () => {
console.log('Model freed GPU memory');
});
```
---
### Webhooks for model events
When any job is submitted with a `webhook_url`, that URL is registered to receive model lifecycle webhooks for the lifetime of the server process. The following events trigger a webhook POST:
| Event | Fired when |
|-------|-----------|
| `model_ready` | Model finishes loading (after GPU warmup) |
| `model_unloaded` | Model is freed from GPU memory |
**Webhook payload** (`Content-Type: application/json`):
```json
{ "type": "model_ready", "loaded_at": "2026-05-10T14:00:00Z" }
{ "type": "model_unloaded" }
```
Delivery is attempted up to 3 times with exponential backoff (1s, 2s).
---
### Handling 503 Model Not Ready
When you submit a job and the model is not yet loaded, you receive `503 Service Unavailable` with a `Retry-After` header:
```
HTTP/1.1 503 Service Unavailable
Retry-After: 30
Content-Type: application/json
{
"error": "model_not_ready",
"state": "unloaded",
"retry_after_secs": 30
}
```
| State at rejection | `retry_after_secs` | Meaning |
|---|---|---|
| `unloaded` | 30 | Load was triggered; retry after ~30s |
| `loading` | 10 | Check again in 10s |
| `waiting_for_gpu` | `GPU_POLL_INTERVAL_SECS` | VRAM contention; retry later |
A job rejection when the model is `unloaded` **automatically triggers a load** — you do not need to call `POST /model/load` separately.
**Recommended client pattern:**
```javascript
async function submitWithRetry(formData, maxAttempts = 10) {
for (let i = 0; i < maxAttempts; i++) {
const resp = await fetch('/jobs', { method: 'POST', body: formData });
if (resp.ok) return resp.json();
if (resp.status === 503) {
const retryAfter = parseInt(resp.headers.get('Retry-After') ?? '30');
const body = await resp.json();
console.log(`Model ${body.state} — retrying in ${retryAfter}s`);
await new Promise(r => setTimeout(r, retryAfter * 1000));
continue;
}
throw new Error(`Submit failed: ${resp.status}`);
}
throw new Error('Gave up after max attempts');
}
```
---
## API Reference
The interactive Swagger UI is available at `http://localhost:8080/docs`.
### `POST /jobs` — Submit a transcription job
Accepts a multipart/form-data body.
@@ -249,11 +439,12 @@ curl http://localhost:8080/health
"gpu_name": "NVIDIA GeForce RTX 2080",
"vram_total_mb": 8192,
"model": "large-v3",
"queue_depth": 0
"queue_depth": 0,
"model_state": "ready"
}
```
`queue_depth` is the number of jobs waiting to be processed (not counting the one currently running).
`queue_depth` is the number of jobs waiting to be processed (not counting the one currently running). `model_state` reflects the current lifecycle state (`unloaded`, `loading`, `waiting_for_gpu`, `ready`).
---
@@ -340,6 +531,11 @@ curl -X POST http://localhost:8080/jobs \
## Troubleshooting
### Server returns `503 model_not_ready`
- The model starts unloaded. Call `POST /model/load` explicitly, or just retry the job submission — rejection automatically triggers a load.
- If state is `waiting_for_gpu`, another process is using the GPU's VRAM. The server will retry automatically every `GPU_POLL_INTERVAL_SECS` seconds.
- Monitor `GET /model/status` or subscribe to `GET /model/events` to know when the model is ready.
### Server returns 0 segments
- Check that you are **not** setting `language` to an empty string — omit the field entirely for auto-detection
- Verify the audio file is not corrupted: `ffprobe audio.mp3`