Files
whisper-rtx2080/docs/USAGE.md
mozempk b191fbe200
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 8m41s
feat: dynamic model loading/unloading with GPU polling
- Model starts unloaded (lazy); loads on first job or POST /model/load
- Auto-unloads after IDLE_TIMEOUT_SECS (default 300) of inactivity
- POST /model/unload for immediate manual release
- GPU-busy detection: on VRAM OOM, enters WaitingForGpu and retries
  every GPU_POLL_INTERVAL_SECS (default 30) indefinitely
- POST /jobs when unloaded → 503 + Retry-After header, triggers load
- AppError::OutOfMemory and AppError::ModelNotReady variants
- WorkerCmd channel (SyncSender<WorkerCmd>) replaces bare tx_req channel
- Idle timer via recv_timeout(1s) tick inside OS thread (no extra thread)
- Model lifecycle events broadcast via tokio broadcast channel (SSE + webhooks)
- webhook_registry: all clients that ever submitted a webhook_url receive
  model_ready and model_unloaded webhooks
- GPU warmup retained on every (re)load

New routes:
  GET  /model/status  — current state + VRAM stats
  POST /model/load    — trigger load (idempotent)
  POST /model/unload  — immediate unload
  GET  /model/events  — SSE stream of model lifecycle events

New env vars:
  IDLE_TIMEOUT_SECS       (default 300)
  GPU_POLL_INTERVAL_SECS  (default 30)

Tests:
  tests/test_model_lifecycle.sh — 18 integration tests (full state machine,
    SSE events, webhooks, concurrency, unload-during-load)
  tests/test_idle_timeout.sh    — 5 tests with short IDLE_TIMEOUT_SECS=5
  test_all.sh updated: loads model before job submission, asserts
    model_state in /health, adds POST /model/unload at end

Docs:
  docs/USAGE.md: model lifecycle section, new env vars, 503 retry pattern,
    updated /health response shape

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-08 17:57:20 +02:00

555 lines
15 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Usage Guide
## Prerequisites
- Docker + NVIDIA Container Toolkit (for GPU access)
- An NVIDIA GPU — optimised for RTX 2080 (sm_75), but any CUDA-capable GPU works
- A Whisper GGML model file (e.g. `ggml-large-v3.bin`)
---
## Quick Start
### 1. Pull the image
```bash
docker pull git.sal.giize.com/mozempk/whisper-rtx2080:latest
```
### 2. Download a model
```bash
# large-v3 recommended (~3 GB)
mkdir -p ~/whisper-models
curl -L "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin" \
-o ~/whisper-models/ggml-large-v3.bin
```
### 3. Start the server
```bash
docker run --rm --gpus all \
-p 8080:8080 \
-v ~/whisper-models:/models:ro \
-v whisper-data:/data \
-e WHISPER_MODEL_PATH=/models/ggml-large-v3.bin \
git.sal.giize.com/mozempk/whisper-rtx2080:latest
```
### 4. Verify
```bash
curl http://localhost:8080/health
# {"status":"ok","gpu_name":"NVIDIA GeForce RTX 2080","vram_total_mb":8192,"model":"large-v3","queue_depth":0}
```
---
## docker-compose
```bash
# Copy the compose file, configure volumes, then:
docker compose up -d
```
The bundled `docker-compose.yml` mounts named volumes for data and models and sets sane defaults.
---
## Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `PORT` | `8080` | HTTP listen port |
| `RUST_LOG` | `info` | Log level: `error`, `warn`, `info`, `debug`, `trace` |
| `DATA_DIR` | `/data` | Directory for job JSON files and temp audio |
| `WHISPER_MODEL_PATH` | `/models/ggml-large-v3.bin` | Absolute path to GGML model file |
| `WHISPER_MODEL` | `large-v3` | Model name reported by `/health` (display only) |
| `CUDA_DEVICE` | `0` | CUDA device index to use for inference |
| `IDLE_TIMEOUT_SECS` | `300` | Seconds of idle before the model is automatically unloaded from GPU memory. Set to `0` to disable auto-unload. |
| `GPU_POLL_INTERVAL_SECS` | `30` | Seconds between VRAM-availability retries when a load fails due to insufficient VRAM. |
### Note on CUDA device ordering
Inside Docker, device ordering matches `nvidia-smi` (PCI bus order). On the host without Docker, ordering may differ. See [FINDINGS.md](FINDINGS.md#cuda-device-index-ordering-differs-between-host-and-docker) for details.
---
## API Reference
The interactive Swagger UI is available at `http://localhost:8080/docs`.
---
## Model Lifecycle Management
The model starts **unloaded** on startup (lazy loading). It is loaded into GPU memory on the first job submission or via `POST /model/load`, and automatically unloaded after `IDLE_TIMEOUT_SECS` of inactivity.
### Model State Machine
```
Unloaded ──(job / POST /model/load)──► Loading ──(success)──► Ready
└──(VRAM full)──► WaitingForGpu ──(retry OK)──► Loading
Ready ──(idle timeout / POST /model/unload)──► Unloaded
WaitingForGpu ──(POST /model/unload)──► Unloaded
```
### `GET /model/status`
Returns the current model state and VRAM statistics.
```bash
curl http://localhost:8080/model/status
```
**When unloaded:**
```json
{ "state": "unloaded" }
```
**When loading:**
```json
{ "state": "loading" }
```
**When ready:**
```json
{
"state": "ready",
"loaded_at": "2026-05-10T14:00:00Z",
"vram_used_mb": 4096,
"vram_total_mb": 8192
}
```
**When waiting for VRAM:**
```json
{
"state": "waiting_for_gpu",
"vram_needed_mb": 3951,
"vram_free_mb": 512,
"retry_in_secs": 30
}
```
---
### `POST /model/load`
Request the model to be loaded. Idempotent — if already loading or ready, returns immediately.
```bash
curl -X POST http://localhost:8080/model/load
```
- Returns `202 Accepted` with `{"status":"load_initiated"}` when load is triggered
- Returns `200 OK` with `{"status":"already_ready"}` when model is already ready
- Poll `GET /model/status` or subscribe to `GET /model/events` to know when ready
---
### `POST /model/unload`
Unload the model from GPU memory immediately, freeing VRAM.
```bash
curl -X POST http://localhost:8080/model/unload
```
Returns `200 OK` regardless of current state.
---
### `GET /model/events` — Model SSE stream
Subscribe to model lifecycle events via Server-Sent Events.
```bash
curl -N http://localhost:8080/model/events
```
**Event types:**
```
event: model_loading
data: {"type":"model_loading"}
event: model_ready
data: {"type":"model_ready","loaded_at":"2026-05-10T14:00:00Z"}
event: model_unloaded
data: {"type":"model_unloaded"}
event: model_waiting_for_gpu
data: {"type":"model_waiting_for_gpu","vram_needed_mb":3951,"vram_free_mb":512,"retry_in_secs":30}
```
**JavaScript example:**
```javascript
const es = new EventSource('/model/events');
es.addEventListener('model_ready', () => {
console.log('Model loaded — ready to transcribe');
});
es.addEventListener('model_unloaded', () => {
console.log('Model freed GPU memory');
});
```
---
### Webhooks for model events
When any job is submitted with a `webhook_url`, that URL is registered to receive model lifecycle webhooks for the lifetime of the server process. The following events trigger a webhook POST:
| Event | Fired when |
|-------|-----------|
| `model_ready` | Model finishes loading (after GPU warmup) |
| `model_unloaded` | Model is freed from GPU memory |
**Webhook payload** (`Content-Type: application/json`):
```json
{ "type": "model_ready", "loaded_at": "2026-05-10T14:00:00Z" }
{ "type": "model_unloaded" }
```
Delivery is attempted up to 3 times with exponential backoff (1s, 2s).
---
### Handling 503 Model Not Ready
When you submit a job and the model is not yet loaded, you receive `503 Service Unavailable` with a `Retry-After` header:
```
HTTP/1.1 503 Service Unavailable
Retry-After: 30
Content-Type: application/json
{
"error": "model_not_ready",
"state": "unloaded",
"retry_after_secs": 30
}
```
| State at rejection | `retry_after_secs` | Meaning |
|---|---|---|
| `unloaded` | 30 | Load was triggered; retry after ~30s |
| `loading` | 10 | Check again in 10s |
| `waiting_for_gpu` | `GPU_POLL_INTERVAL_SECS` | VRAM contention; retry later |
A job rejection when the model is `unloaded` **automatically triggers a load** — you do not need to call `POST /model/load` separately.
**Recommended client pattern:**
```javascript
async function submitWithRetry(formData, maxAttempts = 10) {
for (let i = 0; i < maxAttempts; i++) {
const resp = await fetch('/jobs', { method: 'POST', body: formData });
if (resp.ok) return resp.json();
if (resp.status === 503) {
const retryAfter = parseInt(resp.headers.get('Retry-After') ?? '30');
const body = await resp.json();
console.log(`Model ${body.state} — retrying in ${retryAfter}s`);
await new Promise(r => setTimeout(r, retryAfter * 1000));
continue;
}
throw new Error(`Submit failed: ${resp.status}`);
}
throw new Error('Gave up after max attempts');
}
```
---
## API Reference
The interactive Swagger UI is available at `http://localhost:8080/docs`.
### `POST /jobs` — Submit a transcription job
Accepts a multipart/form-data body.
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `audio` | file | ✓ | Audio file. Any format ffmpeg supports (MP3, WAV, FLAC, AAC, OGG, M4A, WEBM, …). No size limit. |
| `language` | string | — | ISO 639-1 language code (e.g. `en`, `fr`, `de`). Omit to auto-detect. |
| `task` | string | — | `transcribe` (default) or `translate` (translates to English) |
| `webhook_url` | string | — | URL to POST the completed job to |
**Response:** `202 Accepted`
```json
{ "job_id": "550e8400-e29b-41d4-a716-446655440000" }
```
**Example:**
```bash
curl -X POST http://localhost:8080/jobs \
-F "audio=@/path/to/recording.mp3" \
-F "language=en"
```
Auto-detect language:
```bash
curl -X POST http://localhost:8080/jobs \
-F "audio=@/path/to/recording.mp3"
```
With webhook:
```bash
curl -X POST http://localhost:8080/jobs \
-F "audio=@recording.mp3" \
-F "webhook_url=https://myapp.example.com/transcription-done"
```
---
### `GET /jobs/{id}` — Poll job status
```bash
curl http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000
```
**Response while running:**
```json
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"status": "running",
"task": "transcribe",
"progress": 42,
"created_at": "2026-05-06T10:00:00Z"
}
```
**Response when done:**
```json
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"status": "done",
"language": "en",
"task": "transcribe",
"duration_secs": 3720.5,
"progress": 100,
"created_at": "2026-05-06T10:00:00Z",
"completed_at": "2026-05-06T10:12:34Z",
"filename": "recording.mp3",
"segments": [
{
"index": 0,
"start": 0.0,
"end": 4.52,
"text": " Hello and welcome to the conference.",
"words": [
{ "text": " Hello", "start": 0.0, "end": 0.68, "probability": 0.98 },
...
]
},
...
]
}
```
**Job statuses:**
| Status | Meaning |
|--------|---------|
| `queued` | Waiting for the GPU worker to pick it up |
| `running` | Being transcribed right now |
| `done` | Complete; `segments` array is populated |
| `failed` | Error occurred; `error` field contains the message |
| `cancelled` | Cancelled via DELETE before or during processing |
---
### `GET /jobs/{id}/stream` — Real-time progress via SSE
Subscribe to a Server-Sent Events stream for live progress updates.
```bash
curl -N http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000/stream
```
**Event types:**
```
event: progress
data: {"type":"progress","percent":23,"chunk":2,"chunks_total":8}
event: progress
data: {"type":"progress","percent":47,"chunk":4,"chunks_total":8}
event: done
data: {"type":"done","job":{...full job object...}}
```
```
event: error
data: {"type":"error","message":"ffmpeg spawn failed: ..."}
```
- `percent` — overall progress 0100
- `chunk` / `chunks_total` — which silence-split chunk is currently being transcribed
- If you connect after the job has finished, you receive a single `done` event immediately
**JavaScript example:**
```javascript
const es = new EventSource(`/jobs/${jobId}/stream`);
es.addEventListener('progress', (e) => {
const { percent, chunk, chunks_total } = JSON.parse(e.data);
console.log(`${percent}% (chunk ${chunk}/${chunks_total})`);
});
es.addEventListener('done', (e) => {
const { job } = JSON.parse(e.data);
console.log('Transcript:', job.segments.map(s => s.text).join(''));
es.close();
});
es.addEventListener('error', (e) => {
const { message } = JSON.parse(e.data);
console.error('Failed:', message);
es.close();
});
```
---
### `DELETE /jobs/{id}` — Cancel a job
Marks a queued job as cancelled immediately. For running jobs, the cancellation is recorded but the current whisper.cpp inference call completes before the worker checks the flag (whisper.cpp does not support mid-inference abort).
```bash
curl -X DELETE http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000
```
Returns `409 Conflict` if the job is already in a terminal state (`done`, `failed`, `cancelled`).
---
### `GET /health` — Service health
```bash
curl http://localhost:8080/health
```
```json
{
"status": "ok",
"gpu_name": "NVIDIA GeForce RTX 2080",
"vram_total_mb": 8192,
"model": "large-v3",
"queue_depth": 0,
"model_state": "ready"
}
```
`queue_depth` is the number of jobs waiting to be processed (not counting the one currently running). `model_state` reflects the current lifecycle state (`unloaded`, `loading`, `waiting_for_gpu`, `ready`).
---
## Output Format
The `segments` array in a completed job contains one entry per whisper segment (typically a sentence or clause):
```json
{
"index": 0,
"start": 12.34,
"end": 15.78,
"text": " This is a transcribed sentence.",
"words": [
{ "text": " This", "start": 12.34, "end": 12.56, "probability": 0.97 },
{ "text": " is", "start": 12.56, "end": 12.72, "probability": 0.99 },
{ "text": " a", "start": 12.72, "end": 12.84, "probability": 0.98 },
{ "text": " transcribed", "start": 12.84, "end": 13.40, "probability": 0.95 },
{ "text": " sentence.", "start": 13.40, "end": 15.78, "probability": 0.96 }
]
}
```
Notes:
- `start` / `end` are in seconds (floating point), absolute from the beginning of the input audio
- `text` typically includes a leading space (whisper's tokenisation convention)
- `words` contains token-level timestamps; may be empty if flash attention is enabled (it is disabled by default)
- `probability` is the model's confidence for each word token (01)
- All timestamps are in the source language's timeline — no re-mapping occurs
---
## Webhook Payload
When a `webhook_url` is provided, the server POSTs the full `Job` JSON to that URL on completion (including on failure). Headers: `Content-Type: application/json`.
Delivery is attempted up to 5 times with exponential backoff (1s, 2s, 4s, 8s, 16s). If all retries fail, the error is logged and dropped.
---
## Building from Source
```bash
# Build the Docker image locally (requires Docker Buildx + NVIDIA CUDA driver on host)
docker build -t whisper-rtx2080 .
# Custom CUDA version (e.g. for CUDA 11.8 on older drivers)
docker build \
--build-arg CUDA_VERSION=11.8.0 \
--build-arg CUDNN_TAG=cudnn8 \
--build-arg UBUNTU_VERSION=20.04 \
-t whisper-rtx2080:cu118 .
```
Cross-compiling without a CUDA-capable host is not supported — the build requires `nvcc` to compile the CUDA kernels.
### Build-time ARGs
| ARG | Default | Notes |
|-----|---------|-------|
| `CUDA_VERSION` | `12.4.1` | Must match a tag on `nvidia/cuda` Docker Hub |
| `CUDNN_TAG` | `cudnn` | Use `cudnn8` for CUDA 11.x images |
| `UBUNTU_VERSION` | `22.04` | `20.04` or `22.04` |
---
## Working with Audio Files
The server accepts any format ffmpeg understands. To prepare audio manually:
```bash
# Download YouTube audio
yt-dlp -x --audio-format mp3 "https://www.youtube.com/watch?v=..." -o audio.mp3
# Convert to whisper's native format (optional — the server does this automatically)
ffmpeg -i audio.mp3 -f f32le -ac 1 -ar 16000 audio.pcm
# Submit
curl -X POST http://localhost:8080/jobs \
-F "audio=@audio.mp3"
```
---
## Troubleshooting
### Server returns `503 model_not_ready`
- The model starts unloaded. Call `POST /model/load` explicitly, or just retry the job submission — rejection automatically triggers a load.
- If state is `waiting_for_gpu`, another process is using the GPU's VRAM. The server will retry automatically every `GPU_POLL_INTERVAL_SECS` seconds.
- Monitor `GET /model/status` or subscribe to `GET /model/events` to know when the model is ready.
### Server returns 0 segments
- Check that you are **not** setting `language` to an empty string — omit the field entirely for auto-detection
- Verify the audio file is not corrupted: `ffprobe audio.mp3`
- Check logs for `whisper.cpp` output: the auto-detected language and confidence should appear as `info` level logs
### Server returns `failed` with ffmpeg error
- Ensure `ffmpeg` is installed in the container (it is by default)
- Verify the audio file is a valid media file
### CUDA out-of-memory
- `ggml-large-v3.bin` requires ~5-6 GB VRAM. Use `medium` or `small` models on GPUs with less than 8 GB
- Check that no other process is consuming VRAM: `nvidia-smi`
### Wrong GPU being used
- Inside Docker: set `CUDA_DEVICE=0` for the first GPU (nvidia-smi order)
- On host without Docker: device ordering may be inverted; see [FINDINGS.md](FINDINGS.md#cuda-device-index-ordering-differs-between-host-and-docker)