All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 8m41s
- Model starts unloaded (lazy); loads on first job or POST /model/load
- Auto-unloads after IDLE_TIMEOUT_SECS (default 300) of inactivity
- POST /model/unload for immediate manual release
- GPU-busy detection: on VRAM OOM, enters WaitingForGpu and retries
every GPU_POLL_INTERVAL_SECS (default 30) indefinitely
- POST /jobs when unloaded → 503 + Retry-After header, triggers load
- AppError::OutOfMemory and AppError::ModelNotReady variants
- WorkerCmd channel (SyncSender<WorkerCmd>) replaces bare tx_req channel
- Idle timer via recv_timeout(1s) tick inside OS thread (no extra thread)
- Model lifecycle events broadcast via tokio broadcast channel (SSE + webhooks)
- webhook_registry: all clients that ever submitted a webhook_url receive
model_ready and model_unloaded webhooks
- GPU warmup retained on every (re)load
New routes:
GET /model/status — current state + VRAM stats
POST /model/load — trigger load (idempotent)
POST /model/unload — immediate unload
GET /model/events — SSE stream of model lifecycle events
New env vars:
IDLE_TIMEOUT_SECS (default 300)
GPU_POLL_INTERVAL_SECS (default 30)
Tests:
tests/test_model_lifecycle.sh — 18 integration tests (full state machine,
SSE events, webhooks, concurrency, unload-during-load)
tests/test_idle_timeout.sh — 5 tests with short IDLE_TIMEOUT_SECS=5
test_all.sh updated: loads model before job submission, asserts
model_state in /health, adds POST /model/unload at end
Docs:
docs/USAGE.md: model lifecycle section, new env vars, 503 retry pattern,
updated /health response shape
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
555 lines
15 KiB
Markdown
555 lines
15 KiB
Markdown
# Usage Guide
|
||
|
||
## Prerequisites
|
||
|
||
- Docker + NVIDIA Container Toolkit (for GPU access)
|
||
- An NVIDIA GPU — optimised for RTX 2080 (sm_75), but any CUDA-capable GPU works
|
||
- A Whisper GGML model file (e.g. `ggml-large-v3.bin`)
|
||
|
||
---
|
||
|
||
## Quick Start
|
||
|
||
### 1. Pull the image
|
||
|
||
```bash
|
||
docker pull git.sal.giize.com/mozempk/whisper-rtx2080:latest
|
||
```
|
||
|
||
### 2. Download a model
|
||
|
||
```bash
|
||
# large-v3 recommended (~3 GB)
|
||
mkdir -p ~/whisper-models
|
||
curl -L "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin" \
|
||
-o ~/whisper-models/ggml-large-v3.bin
|
||
```
|
||
|
||
### 3. Start the server
|
||
|
||
```bash
|
||
docker run --rm --gpus all \
|
||
-p 8080:8080 \
|
||
-v ~/whisper-models:/models:ro \
|
||
-v whisper-data:/data \
|
||
-e WHISPER_MODEL_PATH=/models/ggml-large-v3.bin \
|
||
git.sal.giize.com/mozempk/whisper-rtx2080:latest
|
||
```
|
||
|
||
### 4. Verify
|
||
|
||
```bash
|
||
curl http://localhost:8080/health
|
||
# {"status":"ok","gpu_name":"NVIDIA GeForce RTX 2080","vram_total_mb":8192,"model":"large-v3","queue_depth":0}
|
||
```
|
||
|
||
---
|
||
|
||
## docker-compose
|
||
|
||
```bash
|
||
# Copy the compose file, configure volumes, then:
|
||
docker compose up -d
|
||
```
|
||
|
||
The bundled `docker-compose.yml` mounts named volumes for data and models and sets sane defaults.
|
||
|
||
---
|
||
|
||
## Environment Variables
|
||
|
||
| Variable | Default | Description |
|
||
|----------|---------|-------------|
|
||
| `PORT` | `8080` | HTTP listen port |
|
||
| `RUST_LOG` | `info` | Log level: `error`, `warn`, `info`, `debug`, `trace` |
|
||
| `DATA_DIR` | `/data` | Directory for job JSON files and temp audio |
|
||
| `WHISPER_MODEL_PATH` | `/models/ggml-large-v3.bin` | Absolute path to GGML model file |
|
||
| `WHISPER_MODEL` | `large-v3` | Model name reported by `/health` (display only) |
|
||
| `CUDA_DEVICE` | `0` | CUDA device index to use for inference |
|
||
| `IDLE_TIMEOUT_SECS` | `300` | Seconds of idle before the model is automatically unloaded from GPU memory. Set to `0` to disable auto-unload. |
|
||
| `GPU_POLL_INTERVAL_SECS` | `30` | Seconds between VRAM-availability retries when a load fails due to insufficient VRAM. |
|
||
|
||
### Note on CUDA device ordering
|
||
Inside Docker, device ordering matches `nvidia-smi` (PCI bus order). On the host without Docker, ordering may differ. See [FINDINGS.md](FINDINGS.md#cuda-device-index-ordering-differs-between-host-and-docker) for details.
|
||
|
||
---
|
||
|
||
## API Reference
|
||
|
||
The interactive Swagger UI is available at `http://localhost:8080/docs`.
|
||
|
||
---
|
||
|
||
## Model Lifecycle Management
|
||
|
||
The model starts **unloaded** on startup (lazy loading). It is loaded into GPU memory on the first job submission or via `POST /model/load`, and automatically unloaded after `IDLE_TIMEOUT_SECS` of inactivity.
|
||
|
||
### Model State Machine
|
||
|
||
```
|
||
Unloaded ──(job / POST /model/load)──► Loading ──(success)──► Ready
|
||
└──(VRAM full)──► WaitingForGpu ──(retry OK)──► Loading
|
||
Ready ──(idle timeout / POST /model/unload)──► Unloaded
|
||
WaitingForGpu ──(POST /model/unload)──► Unloaded
|
||
```
|
||
|
||
### `GET /model/status`
|
||
|
||
Returns the current model state and VRAM statistics.
|
||
|
||
```bash
|
||
curl http://localhost:8080/model/status
|
||
```
|
||
|
||
**When unloaded:**
|
||
```json
|
||
{ "state": "unloaded" }
|
||
```
|
||
|
||
**When loading:**
|
||
```json
|
||
{ "state": "loading" }
|
||
```
|
||
|
||
**When ready:**
|
||
```json
|
||
{
|
||
"state": "ready",
|
||
"loaded_at": "2026-05-10T14:00:00Z",
|
||
"vram_used_mb": 4096,
|
||
"vram_total_mb": 8192
|
||
}
|
||
```
|
||
|
||
**When waiting for VRAM:**
|
||
```json
|
||
{
|
||
"state": "waiting_for_gpu",
|
||
"vram_needed_mb": 3951,
|
||
"vram_free_mb": 512,
|
||
"retry_in_secs": 30
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### `POST /model/load`
|
||
|
||
Request the model to be loaded. Idempotent — if already loading or ready, returns immediately.
|
||
|
||
```bash
|
||
curl -X POST http://localhost:8080/model/load
|
||
```
|
||
|
||
- Returns `202 Accepted` with `{"status":"load_initiated"}` when load is triggered
|
||
- Returns `200 OK` with `{"status":"already_ready"}` when model is already ready
|
||
- Poll `GET /model/status` or subscribe to `GET /model/events` to know when ready
|
||
|
||
---
|
||
|
||
### `POST /model/unload`
|
||
|
||
Unload the model from GPU memory immediately, freeing VRAM.
|
||
|
||
```bash
|
||
curl -X POST http://localhost:8080/model/unload
|
||
```
|
||
|
||
Returns `200 OK` regardless of current state.
|
||
|
||
---
|
||
|
||
### `GET /model/events` — Model SSE stream
|
||
|
||
Subscribe to model lifecycle events via Server-Sent Events.
|
||
|
||
```bash
|
||
curl -N http://localhost:8080/model/events
|
||
```
|
||
|
||
**Event types:**
|
||
|
||
```
|
||
event: model_loading
|
||
data: {"type":"model_loading"}
|
||
|
||
event: model_ready
|
||
data: {"type":"model_ready","loaded_at":"2026-05-10T14:00:00Z"}
|
||
|
||
event: model_unloaded
|
||
data: {"type":"model_unloaded"}
|
||
|
||
event: model_waiting_for_gpu
|
||
data: {"type":"model_waiting_for_gpu","vram_needed_mb":3951,"vram_free_mb":512,"retry_in_secs":30}
|
||
```
|
||
|
||
**JavaScript example:**
|
||
```javascript
|
||
const es = new EventSource('/model/events');
|
||
|
||
es.addEventListener('model_ready', () => {
|
||
console.log('Model loaded — ready to transcribe');
|
||
});
|
||
|
||
es.addEventListener('model_unloaded', () => {
|
||
console.log('Model freed GPU memory');
|
||
});
|
||
```
|
||
|
||
---
|
||
|
||
### Webhooks for model events
|
||
|
||
When any job is submitted with a `webhook_url`, that URL is registered to receive model lifecycle webhooks for the lifetime of the server process. The following events trigger a webhook POST:
|
||
|
||
| Event | Fired when |
|
||
|-------|-----------|
|
||
| `model_ready` | Model finishes loading (after GPU warmup) |
|
||
| `model_unloaded` | Model is freed from GPU memory |
|
||
|
||
**Webhook payload** (`Content-Type: application/json`):
|
||
```json
|
||
{ "type": "model_ready", "loaded_at": "2026-05-10T14:00:00Z" }
|
||
{ "type": "model_unloaded" }
|
||
```
|
||
|
||
Delivery is attempted up to 3 times with exponential backoff (1s, 2s).
|
||
|
||
---
|
||
|
||
### Handling 503 Model Not Ready
|
||
|
||
When you submit a job and the model is not yet loaded, you receive `503 Service Unavailable` with a `Retry-After` header:
|
||
|
||
```
|
||
HTTP/1.1 503 Service Unavailable
|
||
Retry-After: 30
|
||
Content-Type: application/json
|
||
|
||
{
|
||
"error": "model_not_ready",
|
||
"state": "unloaded",
|
||
"retry_after_secs": 30
|
||
}
|
||
```
|
||
|
||
| State at rejection | `retry_after_secs` | Meaning |
|
||
|---|---|---|
|
||
| `unloaded` | 30 | Load was triggered; retry after ~30s |
|
||
| `loading` | 10 | Check again in 10s |
|
||
| `waiting_for_gpu` | `GPU_POLL_INTERVAL_SECS` | VRAM contention; retry later |
|
||
|
||
A job rejection when the model is `unloaded` **automatically triggers a load** — you do not need to call `POST /model/load` separately.
|
||
|
||
**Recommended client pattern:**
|
||
```javascript
|
||
async function submitWithRetry(formData, maxAttempts = 10) {
|
||
for (let i = 0; i < maxAttempts; i++) {
|
||
const resp = await fetch('/jobs', { method: 'POST', body: formData });
|
||
if (resp.ok) return resp.json();
|
||
if (resp.status === 503) {
|
||
const retryAfter = parseInt(resp.headers.get('Retry-After') ?? '30');
|
||
const body = await resp.json();
|
||
console.log(`Model ${body.state} — retrying in ${retryAfter}s`);
|
||
await new Promise(r => setTimeout(r, retryAfter * 1000));
|
||
continue;
|
||
}
|
||
throw new Error(`Submit failed: ${resp.status}`);
|
||
}
|
||
throw new Error('Gave up after max attempts');
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## API Reference
|
||
|
||
The interactive Swagger UI is available at `http://localhost:8080/docs`.
|
||
|
||
### `POST /jobs` — Submit a transcription job
|
||
|
||
Accepts a multipart/form-data body.
|
||
|
||
| Field | Type | Required | Description |
|
||
|-------|------|----------|-------------|
|
||
| `audio` | file | ✓ | Audio file. Any format ffmpeg supports (MP3, WAV, FLAC, AAC, OGG, M4A, WEBM, …). No size limit. |
|
||
| `language` | string | — | ISO 639-1 language code (e.g. `en`, `fr`, `de`). Omit to auto-detect. |
|
||
| `task` | string | — | `transcribe` (default) or `translate` (translates to English) |
|
||
| `webhook_url` | string | — | URL to POST the completed job to |
|
||
|
||
**Response:** `202 Accepted`
|
||
```json
|
||
{ "job_id": "550e8400-e29b-41d4-a716-446655440000" }
|
||
```
|
||
|
||
**Example:**
|
||
```bash
|
||
curl -X POST http://localhost:8080/jobs \
|
||
-F "audio=@/path/to/recording.mp3" \
|
||
-F "language=en"
|
||
```
|
||
|
||
Auto-detect language:
|
||
```bash
|
||
curl -X POST http://localhost:8080/jobs \
|
||
-F "audio=@/path/to/recording.mp3"
|
||
```
|
||
|
||
With webhook:
|
||
```bash
|
||
curl -X POST http://localhost:8080/jobs \
|
||
-F "audio=@recording.mp3" \
|
||
-F "webhook_url=https://myapp.example.com/transcription-done"
|
||
```
|
||
|
||
---
|
||
|
||
### `GET /jobs/{id}` — Poll job status
|
||
|
||
```bash
|
||
curl http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000
|
||
```
|
||
|
||
**Response while running:**
|
||
```json
|
||
{
|
||
"id": "550e8400-e29b-41d4-a716-446655440000",
|
||
"status": "running",
|
||
"task": "transcribe",
|
||
"progress": 42,
|
||
"created_at": "2026-05-06T10:00:00Z"
|
||
}
|
||
```
|
||
|
||
**Response when done:**
|
||
```json
|
||
{
|
||
"id": "550e8400-e29b-41d4-a716-446655440000",
|
||
"status": "done",
|
||
"language": "en",
|
||
"task": "transcribe",
|
||
"duration_secs": 3720.5,
|
||
"progress": 100,
|
||
"created_at": "2026-05-06T10:00:00Z",
|
||
"completed_at": "2026-05-06T10:12:34Z",
|
||
"filename": "recording.mp3",
|
||
"segments": [
|
||
{
|
||
"index": 0,
|
||
"start": 0.0,
|
||
"end": 4.52,
|
||
"text": " Hello and welcome to the conference.",
|
||
"words": [
|
||
{ "text": " Hello", "start": 0.0, "end": 0.68, "probability": 0.98 },
|
||
...
|
||
]
|
||
},
|
||
...
|
||
]
|
||
}
|
||
```
|
||
|
||
**Job statuses:**
|
||
|
||
| Status | Meaning |
|
||
|--------|---------|
|
||
| `queued` | Waiting for the GPU worker to pick it up |
|
||
| `running` | Being transcribed right now |
|
||
| `done` | Complete; `segments` array is populated |
|
||
| `failed` | Error occurred; `error` field contains the message |
|
||
| `cancelled` | Cancelled via DELETE before or during processing |
|
||
|
||
---
|
||
|
||
### `GET /jobs/{id}/stream` — Real-time progress via SSE
|
||
|
||
Subscribe to a Server-Sent Events stream for live progress updates.
|
||
|
||
```bash
|
||
curl -N http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000/stream
|
||
```
|
||
|
||
**Event types:**
|
||
|
||
```
|
||
event: progress
|
||
data: {"type":"progress","percent":23,"chunk":2,"chunks_total":8}
|
||
|
||
event: progress
|
||
data: {"type":"progress","percent":47,"chunk":4,"chunks_total":8}
|
||
|
||
event: done
|
||
data: {"type":"done","job":{...full job object...}}
|
||
```
|
||
|
||
```
|
||
event: error
|
||
data: {"type":"error","message":"ffmpeg spawn failed: ..."}
|
||
```
|
||
|
||
- `percent` — overall progress 0–100
|
||
- `chunk` / `chunks_total` — which silence-split chunk is currently being transcribed
|
||
- If you connect after the job has finished, you receive a single `done` event immediately
|
||
|
||
**JavaScript example:**
|
||
```javascript
|
||
const es = new EventSource(`/jobs/${jobId}/stream`);
|
||
|
||
es.addEventListener('progress', (e) => {
|
||
const { percent, chunk, chunks_total } = JSON.parse(e.data);
|
||
console.log(`${percent}% (chunk ${chunk}/${chunks_total})`);
|
||
});
|
||
|
||
es.addEventListener('done', (e) => {
|
||
const { job } = JSON.parse(e.data);
|
||
console.log('Transcript:', job.segments.map(s => s.text).join(''));
|
||
es.close();
|
||
});
|
||
|
||
es.addEventListener('error', (e) => {
|
||
const { message } = JSON.parse(e.data);
|
||
console.error('Failed:', message);
|
||
es.close();
|
||
});
|
||
```
|
||
|
||
---
|
||
|
||
### `DELETE /jobs/{id}` — Cancel a job
|
||
|
||
Marks a queued job as cancelled immediately. For running jobs, the cancellation is recorded but the current whisper.cpp inference call completes before the worker checks the flag (whisper.cpp does not support mid-inference abort).
|
||
|
||
```bash
|
||
curl -X DELETE http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000
|
||
```
|
||
|
||
Returns `409 Conflict` if the job is already in a terminal state (`done`, `failed`, `cancelled`).
|
||
|
||
---
|
||
|
||
### `GET /health` — Service health
|
||
|
||
```bash
|
||
curl http://localhost:8080/health
|
||
```
|
||
|
||
```json
|
||
{
|
||
"status": "ok",
|
||
"gpu_name": "NVIDIA GeForce RTX 2080",
|
||
"vram_total_mb": 8192,
|
||
"model": "large-v3",
|
||
"queue_depth": 0,
|
||
"model_state": "ready"
|
||
}
|
||
```
|
||
|
||
`queue_depth` is the number of jobs waiting to be processed (not counting the one currently running). `model_state` reflects the current lifecycle state (`unloaded`, `loading`, `waiting_for_gpu`, `ready`).
|
||
|
||
---
|
||
|
||
## Output Format
|
||
|
||
The `segments` array in a completed job contains one entry per whisper segment (typically a sentence or clause):
|
||
|
||
```json
|
||
{
|
||
"index": 0,
|
||
"start": 12.34,
|
||
"end": 15.78,
|
||
"text": " This is a transcribed sentence.",
|
||
"words": [
|
||
{ "text": " This", "start": 12.34, "end": 12.56, "probability": 0.97 },
|
||
{ "text": " is", "start": 12.56, "end": 12.72, "probability": 0.99 },
|
||
{ "text": " a", "start": 12.72, "end": 12.84, "probability": 0.98 },
|
||
{ "text": " transcribed", "start": 12.84, "end": 13.40, "probability": 0.95 },
|
||
{ "text": " sentence.", "start": 13.40, "end": 15.78, "probability": 0.96 }
|
||
]
|
||
}
|
||
```
|
||
|
||
Notes:
|
||
- `start` / `end` are in seconds (floating point), absolute from the beginning of the input audio
|
||
- `text` typically includes a leading space (whisper's tokenisation convention)
|
||
- `words` contains token-level timestamps; may be empty if flash attention is enabled (it is disabled by default)
|
||
- `probability` is the model's confidence for each word token (0–1)
|
||
- All timestamps are in the source language's timeline — no re-mapping occurs
|
||
|
||
---
|
||
|
||
## Webhook Payload
|
||
|
||
When a `webhook_url` is provided, the server POSTs the full `Job` JSON to that URL on completion (including on failure). Headers: `Content-Type: application/json`.
|
||
|
||
Delivery is attempted up to 5 times with exponential backoff (1s, 2s, 4s, 8s, 16s). If all retries fail, the error is logged and dropped.
|
||
|
||
---
|
||
|
||
## Building from Source
|
||
|
||
```bash
|
||
# Build the Docker image locally (requires Docker Buildx + NVIDIA CUDA driver on host)
|
||
docker build -t whisper-rtx2080 .
|
||
|
||
# Custom CUDA version (e.g. for CUDA 11.8 on older drivers)
|
||
docker build \
|
||
--build-arg CUDA_VERSION=11.8.0 \
|
||
--build-arg CUDNN_TAG=cudnn8 \
|
||
--build-arg UBUNTU_VERSION=20.04 \
|
||
-t whisper-rtx2080:cu118 .
|
||
```
|
||
|
||
Cross-compiling without a CUDA-capable host is not supported — the build requires `nvcc` to compile the CUDA kernels.
|
||
|
||
### Build-time ARGs
|
||
|
||
| ARG | Default | Notes |
|
||
|-----|---------|-------|
|
||
| `CUDA_VERSION` | `12.4.1` | Must match a tag on `nvidia/cuda` Docker Hub |
|
||
| `CUDNN_TAG` | `cudnn` | Use `cudnn8` for CUDA 11.x images |
|
||
| `UBUNTU_VERSION` | `22.04` | `20.04` or `22.04` |
|
||
|
||
---
|
||
|
||
## Working with Audio Files
|
||
|
||
The server accepts any format ffmpeg understands. To prepare audio manually:
|
||
|
||
```bash
|
||
# Download YouTube audio
|
||
yt-dlp -x --audio-format mp3 "https://www.youtube.com/watch?v=..." -o audio.mp3
|
||
|
||
# Convert to whisper's native format (optional — the server does this automatically)
|
||
ffmpeg -i audio.mp3 -f f32le -ac 1 -ar 16000 audio.pcm
|
||
|
||
# Submit
|
||
curl -X POST http://localhost:8080/jobs \
|
||
-F "audio=@audio.mp3"
|
||
```
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
### Server returns `503 model_not_ready`
|
||
- The model starts unloaded. Call `POST /model/load` explicitly, or just retry the job submission — rejection automatically triggers a load.
|
||
- If state is `waiting_for_gpu`, another process is using the GPU's VRAM. The server will retry automatically every `GPU_POLL_INTERVAL_SECS` seconds.
|
||
- Monitor `GET /model/status` or subscribe to `GET /model/events` to know when the model is ready.
|
||
|
||
### Server returns 0 segments
|
||
- Check that you are **not** setting `language` to an empty string — omit the field entirely for auto-detection
|
||
- Verify the audio file is not corrupted: `ffprobe audio.mp3`
|
||
- Check logs for `whisper.cpp` output: the auto-detected language and confidence should appear as `info` level logs
|
||
|
||
### Server returns `failed` with ffmpeg error
|
||
- Ensure `ffmpeg` is installed in the container (it is by default)
|
||
- Verify the audio file is a valid media file
|
||
|
||
### CUDA out-of-memory
|
||
- `ggml-large-v3.bin` requires ~5-6 GB VRAM. Use `medium` or `small` models on GPUs with less than 8 GB
|
||
- Check that no other process is consuming VRAM: `nvidia-smi`
|
||
|
||
### Wrong GPU being used
|
||
- Inside Docker: set `CUDA_DEVICE=0` for the first GPU (nvidia-smi order)
|
||
- On host without Docker: device ordering may be inverted; see [FINDINGS.md](FINDINGS.md#cuda-device-index-ordering-differs-between-host-and-docker)
|