feat: dynamic model loading/unloading with GPU polling
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 8m41s
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 8m41s
- Model starts unloaded (lazy); loads on first job or POST /model/load
- Auto-unloads after IDLE_TIMEOUT_SECS (default 300) of inactivity
- POST /model/unload for immediate manual release
- GPU-busy detection: on VRAM OOM, enters WaitingForGpu and retries
every GPU_POLL_INTERVAL_SECS (default 30) indefinitely
- POST /jobs when unloaded → 503 + Retry-After header, triggers load
- AppError::OutOfMemory and AppError::ModelNotReady variants
- WorkerCmd channel (SyncSender<WorkerCmd>) replaces bare tx_req channel
- Idle timer via recv_timeout(1s) tick inside OS thread (no extra thread)
- Model lifecycle events broadcast via tokio broadcast channel (SSE + webhooks)
- webhook_registry: all clients that ever submitted a webhook_url receive
model_ready and model_unloaded webhooks
- GPU warmup retained on every (re)load
New routes:
GET /model/status — current state + VRAM stats
POST /model/load — trigger load (idempotent)
POST /model/unload — immediate unload
GET /model/events — SSE stream of model lifecycle events
New env vars:
IDLE_TIMEOUT_SECS (default 300)
GPU_POLL_INTERVAL_SECS (default 30)
Tests:
tests/test_model_lifecycle.sh — 18 integration tests (full state machine,
SSE events, webhooks, concurrency, unload-during-load)
tests/test_idle_timeout.sh — 5 tests with short IDLE_TIMEOUT_SECS=5
test_all.sh updated: loads model before job submission, asserts
model_state in /health, adds POST /model/unload at end
Docs:
docs/USAGE.md: model lifecycle section, new env vars, 503 retry pattern,
updated /health response shape
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
200
docs/USAGE.md
200
docs/USAGE.md
@@ -66,6 +66,8 @@ The bundled `docker-compose.yml` mounts named volumes for data and models and se
|
||||
| `WHISPER_MODEL_PATH` | `/models/ggml-large-v3.bin` | Absolute path to GGML model file |
|
||||
| `WHISPER_MODEL` | `large-v3` | Model name reported by `/health` (display only) |
|
||||
| `CUDA_DEVICE` | `0` | CUDA device index to use for inference |
|
||||
| `IDLE_TIMEOUT_SECS` | `300` | Seconds of idle before the model is automatically unloaded from GPU memory. Set to `0` to disable auto-unload. |
|
||||
| `GPU_POLL_INTERVAL_SECS` | `30` | Seconds between VRAM-availability retries when a load fails due to insufficient VRAM. |
|
||||
|
||||
### Note on CUDA device ordering
|
||||
Inside Docker, device ordering matches `nvidia-smi` (PCI bus order). On the host without Docker, ordering may differ. See [FINDINGS.md](FINDINGS.md#cuda-device-index-ordering-differs-between-host-and-docker) for details.
|
||||
@@ -76,6 +78,194 @@ Inside Docker, device ordering matches `nvidia-smi` (PCI bus order). On the host
|
||||
|
||||
The interactive Swagger UI is available at `http://localhost:8080/docs`.
|
||||
|
||||
---
|
||||
|
||||
## Model Lifecycle Management
|
||||
|
||||
The model starts **unloaded** on startup (lazy loading). It is loaded into GPU memory on the first job submission or via `POST /model/load`, and automatically unloaded after `IDLE_TIMEOUT_SECS` of inactivity.
|
||||
|
||||
### Model State Machine
|
||||
|
||||
```
|
||||
Unloaded ──(job / POST /model/load)──► Loading ──(success)──► Ready
|
||||
└──(VRAM full)──► WaitingForGpu ──(retry OK)──► Loading
|
||||
Ready ──(idle timeout / POST /model/unload)──► Unloaded
|
||||
WaitingForGpu ──(POST /model/unload)──► Unloaded
|
||||
```
|
||||
|
||||
### `GET /model/status`
|
||||
|
||||
Returns the current model state and VRAM statistics.
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/model/status
|
||||
```
|
||||
|
||||
**When unloaded:**
|
||||
```json
|
||||
{ "state": "unloaded" }
|
||||
```
|
||||
|
||||
**When loading:**
|
||||
```json
|
||||
{ "state": "loading" }
|
||||
```
|
||||
|
||||
**When ready:**
|
||||
```json
|
||||
{
|
||||
"state": "ready",
|
||||
"loaded_at": "2026-05-10T14:00:00Z",
|
||||
"vram_used_mb": 4096,
|
||||
"vram_total_mb": 8192
|
||||
}
|
||||
```
|
||||
|
||||
**When waiting for VRAM:**
|
||||
```json
|
||||
{
|
||||
"state": "waiting_for_gpu",
|
||||
"vram_needed_mb": 3951,
|
||||
"vram_free_mb": 512,
|
||||
"retry_in_secs": 30
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### `POST /model/load`
|
||||
|
||||
Request the model to be loaded. Idempotent — if already loading or ready, returns immediately.
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8080/model/load
|
||||
```
|
||||
|
||||
- Returns `202 Accepted` with `{"status":"load_initiated"}` when load is triggered
|
||||
- Returns `200 OK` with `{"status":"already_ready"}` when model is already ready
|
||||
- Poll `GET /model/status` or subscribe to `GET /model/events` to know when ready
|
||||
|
||||
---
|
||||
|
||||
### `POST /model/unload`
|
||||
|
||||
Unload the model from GPU memory immediately, freeing VRAM.
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8080/model/unload
|
||||
```
|
||||
|
||||
Returns `200 OK` regardless of current state.
|
||||
|
||||
---
|
||||
|
||||
### `GET /model/events` — Model SSE stream
|
||||
|
||||
Subscribe to model lifecycle events via Server-Sent Events.
|
||||
|
||||
```bash
|
||||
curl -N http://localhost:8080/model/events
|
||||
```
|
||||
|
||||
**Event types:**
|
||||
|
||||
```
|
||||
event: model_loading
|
||||
data: {"type":"model_loading"}
|
||||
|
||||
event: model_ready
|
||||
data: {"type":"model_ready","loaded_at":"2026-05-10T14:00:00Z"}
|
||||
|
||||
event: model_unloaded
|
||||
data: {"type":"model_unloaded"}
|
||||
|
||||
event: model_waiting_for_gpu
|
||||
data: {"type":"model_waiting_for_gpu","vram_needed_mb":3951,"vram_free_mb":512,"retry_in_secs":30}
|
||||
```
|
||||
|
||||
**JavaScript example:**
|
||||
```javascript
|
||||
const es = new EventSource('/model/events');
|
||||
|
||||
es.addEventListener('model_ready', () => {
|
||||
console.log('Model loaded — ready to transcribe');
|
||||
});
|
||||
|
||||
es.addEventListener('model_unloaded', () => {
|
||||
console.log('Model freed GPU memory');
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Webhooks for model events
|
||||
|
||||
When any job is submitted with a `webhook_url`, that URL is registered to receive model lifecycle webhooks for the lifetime of the server process. The following events trigger a webhook POST:
|
||||
|
||||
| Event | Fired when |
|
||||
|-------|-----------|
|
||||
| `model_ready` | Model finishes loading (after GPU warmup) |
|
||||
| `model_unloaded` | Model is freed from GPU memory |
|
||||
|
||||
**Webhook payload** (`Content-Type: application/json`):
|
||||
```json
|
||||
{ "type": "model_ready", "loaded_at": "2026-05-10T14:00:00Z" }
|
||||
{ "type": "model_unloaded" }
|
||||
```
|
||||
|
||||
Delivery is attempted up to 3 times with exponential backoff (1s, 2s).
|
||||
|
||||
---
|
||||
|
||||
### Handling 503 Model Not Ready
|
||||
|
||||
When you submit a job and the model is not yet loaded, you receive `503 Service Unavailable` with a `Retry-After` header:
|
||||
|
||||
```
|
||||
HTTP/1.1 503 Service Unavailable
|
||||
Retry-After: 30
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"error": "model_not_ready",
|
||||
"state": "unloaded",
|
||||
"retry_after_secs": 30
|
||||
}
|
||||
```
|
||||
|
||||
| State at rejection | `retry_after_secs` | Meaning |
|
||||
|---|---|---|
|
||||
| `unloaded` | 30 | Load was triggered; retry after ~30s |
|
||||
| `loading` | 10 | Check again in 10s |
|
||||
| `waiting_for_gpu` | `GPU_POLL_INTERVAL_SECS` | VRAM contention; retry later |
|
||||
|
||||
A job rejection when the model is `unloaded` **automatically triggers a load** — you do not need to call `POST /model/load` separately.
|
||||
|
||||
**Recommended client pattern:**
|
||||
```javascript
|
||||
async function submitWithRetry(formData, maxAttempts = 10) {
|
||||
for (let i = 0; i < maxAttempts; i++) {
|
||||
const resp = await fetch('/jobs', { method: 'POST', body: formData });
|
||||
if (resp.ok) return resp.json();
|
||||
if (resp.status === 503) {
|
||||
const retryAfter = parseInt(resp.headers.get('Retry-After') ?? '30');
|
||||
const body = await resp.json();
|
||||
console.log(`Model ${body.state} — retrying in ${retryAfter}s`);
|
||||
await new Promise(r => setTimeout(r, retryAfter * 1000));
|
||||
continue;
|
||||
}
|
||||
throw new Error(`Submit failed: ${resp.status}`);
|
||||
}
|
||||
throw new Error('Gave up after max attempts');
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## API Reference
|
||||
|
||||
The interactive Swagger UI is available at `http://localhost:8080/docs`.
|
||||
|
||||
### `POST /jobs` — Submit a transcription job
|
||||
|
||||
Accepts a multipart/form-data body.
|
||||
@@ -249,11 +439,12 @@ curl http://localhost:8080/health
|
||||
"gpu_name": "NVIDIA GeForce RTX 2080",
|
||||
"vram_total_mb": 8192,
|
||||
"model": "large-v3",
|
||||
"queue_depth": 0
|
||||
"queue_depth": 0,
|
||||
"model_state": "ready"
|
||||
}
|
||||
```
|
||||
|
||||
`queue_depth` is the number of jobs waiting to be processed (not counting the one currently running).
|
||||
`queue_depth` is the number of jobs waiting to be processed (not counting the one currently running). `model_state` reflects the current lifecycle state (`unloaded`, `loading`, `waiting_for_gpu`, `ready`).
|
||||
|
||||
---
|
||||
|
||||
@@ -340,6 +531,11 @@ curl -X POST http://localhost:8080/jobs \
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Server returns `503 model_not_ready`
|
||||
- The model starts unloaded. Call `POST /model/load` explicitly, or just retry the job submission — rejection automatically triggers a load.
|
||||
- If state is `waiting_for_gpu`, another process is using the GPU's VRAM. The server will retry automatically every `GPU_POLL_INTERVAL_SECS` seconds.
|
||||
- Monitor `GET /model/status` or subscribe to `GET /model/events` to know when the model is ready.
|
||||
|
||||
### Server returns 0 segments
|
||||
- Check that you are **not** setting `language` to an empty string — omit the field entirely for auto-detection
|
||||
- Verify the audio file is not corrupted: `ffprobe audio.mp3`
|
||||
|
||||
Reference in New Issue
Block a user