- Model starts unloaded (lazy); loads on first job or POST /model/load
- Auto-unloads after IDLE_TIMEOUT_SECS (default 300) of inactivity
- POST /model/unload for immediate manual release
- GPU-busy detection: on VRAM OOM, enters WaitingForGpu and retries
every GPU_POLL_INTERVAL_SECS (default 30) indefinitely
- POST /jobs when unloaded → 503 + Retry-After header, triggers load
- AppError::OutOfMemory and AppError::ModelNotReady variants
- WorkerCmd channel (SyncSender<WorkerCmd>) replaces bare tx_req channel
- Idle timer via recv_timeout(1s) tick inside OS thread (no extra thread)
- Model lifecycle events broadcast via tokio broadcast channel (SSE + webhooks)
- webhook_registry: all clients that ever submitted a webhook_url receive
model_ready and model_unloaded webhooks
- GPU warmup retained on every (re)load
New routes:
GET /model/status — current state + VRAM stats
POST /model/load — trigger load (idempotent)
POST /model/unload — immediate unload
GET /model/events — SSE stream of model lifecycle events
New env vars:
IDLE_TIMEOUT_SECS (default 300)
GPU_POLL_INTERVAL_SECS (default 30)
Tests:
tests/test_model_lifecycle.sh — 18 integration tests (full state machine,
SSE events, webhooks, concurrency, unload-during-load)
tests/test_idle_timeout.sh — 5 tests with short IDLE_TIMEOUT_SECS=5
test_all.sh updated: loads model before job submission, asserts
model_state in /health, adds POST /model/unload at end
Docs:
docs/USAGE.md: model lifecycle section, new env vars, 503 retry pattern,
updated /health response shape
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
15 KiB
Usage Guide
Prerequisites
- Docker + NVIDIA Container Toolkit (for GPU access)
- An NVIDIA GPU — optimised for RTX 2080 (sm_75), but any CUDA-capable GPU works
- A Whisper GGML model file (e.g.
ggml-large-v3.bin)
Quick Start
1. Pull the image
docker pull git.sal.giize.com/mozempk/whisper-rtx2080:latest
2. Download a model
# large-v3 recommended (~3 GB)
mkdir -p ~/whisper-models
curl -L "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin" \
-o ~/whisper-models/ggml-large-v3.bin
3. Start the server
docker run --rm --gpus all \
-p 8080:8080 \
-v ~/whisper-models:/models:ro \
-v whisper-data:/data \
-e WHISPER_MODEL_PATH=/models/ggml-large-v3.bin \
git.sal.giize.com/mozempk/whisper-rtx2080:latest
4. Verify
curl http://localhost:8080/health
# {"status":"ok","gpu_name":"NVIDIA GeForce RTX 2080","vram_total_mb":8192,"model":"large-v3","queue_depth":0}
docker-compose
# Copy the compose file, configure volumes, then:
docker compose up -d
The bundled docker-compose.yml mounts named volumes for data and models and sets sane defaults.
Environment Variables
| Variable | Default | Description |
|---|---|---|
PORT |
8080 |
HTTP listen port |
RUST_LOG |
info |
Log level: error, warn, info, debug, trace |
DATA_DIR |
/data |
Directory for job JSON files and temp audio |
WHISPER_MODEL_PATH |
/models/ggml-large-v3.bin |
Absolute path to GGML model file |
WHISPER_MODEL |
large-v3 |
Model name reported by /health (display only) |
CUDA_DEVICE |
0 |
CUDA device index to use for inference |
IDLE_TIMEOUT_SECS |
300 |
Seconds of idle before the model is automatically unloaded from GPU memory. Set to 0 to disable auto-unload. |
GPU_POLL_INTERVAL_SECS |
30 |
Seconds between VRAM-availability retries when a load fails due to insufficient VRAM. |
Note on CUDA device ordering
Inside Docker, device ordering matches nvidia-smi (PCI bus order). On the host without Docker, ordering may differ. See FINDINGS.md for details.
API Reference
The interactive Swagger UI is available at http://localhost:8080/docs.
Model Lifecycle Management
The model starts unloaded on startup (lazy loading). It is loaded into GPU memory on the first job submission or via POST /model/load, and automatically unloaded after IDLE_TIMEOUT_SECS of inactivity.
Model State Machine
Unloaded ──(job / POST /model/load)──► Loading ──(success)──► Ready
└──(VRAM full)──► WaitingForGpu ──(retry OK)──► Loading
Ready ──(idle timeout / POST /model/unload)──► Unloaded
WaitingForGpu ──(POST /model/unload)──► Unloaded
GET /model/status
Returns the current model state and VRAM statistics.
curl http://localhost:8080/model/status
When unloaded:
{ "state": "unloaded" }
When loading:
{ "state": "loading" }
When ready:
{
"state": "ready",
"loaded_at": "2026-05-10T14:00:00Z",
"vram_used_mb": 4096,
"vram_total_mb": 8192
}
When waiting for VRAM:
{
"state": "waiting_for_gpu",
"vram_needed_mb": 3951,
"vram_free_mb": 512,
"retry_in_secs": 30
}
POST /model/load
Request the model to be loaded. Idempotent — if already loading or ready, returns immediately.
curl -X POST http://localhost:8080/model/load
- Returns
202 Acceptedwith{"status":"load_initiated"}when load is triggered - Returns
200 OKwith{"status":"already_ready"}when model is already ready - Poll
GET /model/statusor subscribe toGET /model/eventsto know when ready
POST /model/unload
Unload the model from GPU memory immediately, freeing VRAM.
curl -X POST http://localhost:8080/model/unload
Returns 200 OK regardless of current state.
GET /model/events — Model SSE stream
Subscribe to model lifecycle events via Server-Sent Events.
curl -N http://localhost:8080/model/events
Event types:
event: model_loading
data: {"type":"model_loading"}
event: model_ready
data: {"type":"model_ready","loaded_at":"2026-05-10T14:00:00Z"}
event: model_unloaded
data: {"type":"model_unloaded"}
event: model_waiting_for_gpu
data: {"type":"model_waiting_for_gpu","vram_needed_mb":3951,"vram_free_mb":512,"retry_in_secs":30}
JavaScript example:
const es = new EventSource('/model/events');
es.addEventListener('model_ready', () => {
console.log('Model loaded — ready to transcribe');
});
es.addEventListener('model_unloaded', () => {
console.log('Model freed GPU memory');
});
Webhooks for model events
When any job is submitted with a webhook_url, that URL is registered to receive model lifecycle webhooks for the lifetime of the server process. The following events trigger a webhook POST:
| Event | Fired when |
|---|---|
model_ready |
Model finishes loading (after GPU warmup) |
model_unloaded |
Model is freed from GPU memory |
Webhook payload (Content-Type: application/json):
{ "type": "model_ready", "loaded_at": "2026-05-10T14:00:00Z" }
{ "type": "model_unloaded" }
Delivery is attempted up to 3 times with exponential backoff (1s, 2s).
Handling 503 Model Not Ready
When you submit a job and the model is not yet loaded, you receive 503 Service Unavailable with a Retry-After header:
HTTP/1.1 503 Service Unavailable
Retry-After: 30
Content-Type: application/json
{
"error": "model_not_ready",
"state": "unloaded",
"retry_after_secs": 30
}
| State at rejection | retry_after_secs |
Meaning |
|---|---|---|
unloaded |
30 | Load was triggered; retry after ~30s |
loading |
10 | Check again in 10s |
waiting_for_gpu |
GPU_POLL_INTERVAL_SECS |
VRAM contention; retry later |
A job rejection when the model is unloaded automatically triggers a load — you do not need to call POST /model/load separately.
Recommended client pattern:
async function submitWithRetry(formData, maxAttempts = 10) {
for (let i = 0; i < maxAttempts; i++) {
const resp = await fetch('/jobs', { method: 'POST', body: formData });
if (resp.ok) return resp.json();
if (resp.status === 503) {
const retryAfter = parseInt(resp.headers.get('Retry-After') ?? '30');
const body = await resp.json();
console.log(`Model ${body.state} — retrying in ${retryAfter}s`);
await new Promise(r => setTimeout(r, retryAfter * 1000));
continue;
}
throw new Error(`Submit failed: ${resp.status}`);
}
throw new Error('Gave up after max attempts');
}
API Reference
The interactive Swagger UI is available at http://localhost:8080/docs.
POST /jobs — Submit a transcription job
Accepts a multipart/form-data body.
| Field | Type | Required | Description |
|---|---|---|---|
audio |
file | ✓ | Audio file. Any format ffmpeg supports (MP3, WAV, FLAC, AAC, OGG, M4A, WEBM, …). No size limit. |
language |
string | — | ISO 639-1 language code (e.g. en, fr, de). Omit to auto-detect. |
task |
string | — | transcribe (default) or translate (translates to English) |
webhook_url |
string | — | URL to POST the completed job to |
Response: 202 Accepted
{ "job_id": "550e8400-e29b-41d4-a716-446655440000" }
Example:
curl -X POST http://localhost:8080/jobs \
-F "audio=@/path/to/recording.mp3" \
-F "language=en"
Auto-detect language:
curl -X POST http://localhost:8080/jobs \
-F "audio=@/path/to/recording.mp3"
With webhook:
curl -X POST http://localhost:8080/jobs \
-F "audio=@recording.mp3" \
-F "webhook_url=https://myapp.example.com/transcription-done"
GET /jobs/{id} — Poll job status
curl http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000
Response while running:
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"status": "running",
"task": "transcribe",
"progress": 42,
"created_at": "2026-05-06T10:00:00Z"
}
Response when done:
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"status": "done",
"language": "en",
"task": "transcribe",
"duration_secs": 3720.5,
"progress": 100,
"created_at": "2026-05-06T10:00:00Z",
"completed_at": "2026-05-06T10:12:34Z",
"filename": "recording.mp3",
"segments": [
{
"index": 0,
"start": 0.0,
"end": 4.52,
"text": " Hello and welcome to the conference.",
"words": [
{ "text": " Hello", "start": 0.0, "end": 0.68, "probability": 0.98 },
...
]
},
...
]
}
Job statuses:
| Status | Meaning |
|---|---|
queued |
Waiting for the GPU worker to pick it up |
running |
Being transcribed right now |
done |
Complete; segments array is populated |
failed |
Error occurred; error field contains the message |
cancelled |
Cancelled via DELETE before or during processing |
GET /jobs/{id}/stream — Real-time progress via SSE
Subscribe to a Server-Sent Events stream for live progress updates.
curl -N http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000/stream
Event types:
event: progress
data: {"type":"progress","percent":23,"chunk":2,"chunks_total":8}
event: progress
data: {"type":"progress","percent":47,"chunk":4,"chunks_total":8}
event: done
data: {"type":"done","job":{...full job object...}}
event: error
data: {"type":"error","message":"ffmpeg spawn failed: ..."}
percent— overall progress 0–100chunk/chunks_total— which silence-split chunk is currently being transcribed- If you connect after the job has finished, you receive a single
doneevent immediately
JavaScript example:
const es = new EventSource(`/jobs/${jobId}/stream`);
es.addEventListener('progress', (e) => {
const { percent, chunk, chunks_total } = JSON.parse(e.data);
console.log(`${percent}% (chunk ${chunk}/${chunks_total})`);
});
es.addEventListener('done', (e) => {
const { job } = JSON.parse(e.data);
console.log('Transcript:', job.segments.map(s => s.text).join(''));
es.close();
});
es.addEventListener('error', (e) => {
const { message } = JSON.parse(e.data);
console.error('Failed:', message);
es.close();
});
DELETE /jobs/{id} — Cancel a job
Marks a queued job as cancelled immediately. For running jobs, the cancellation is recorded but the current whisper.cpp inference call completes before the worker checks the flag (whisper.cpp does not support mid-inference abort).
curl -X DELETE http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000
Returns 409 Conflict if the job is already in a terminal state (done, failed, cancelled).
GET /health — Service health
curl http://localhost:8080/health
{
"status": "ok",
"gpu_name": "NVIDIA GeForce RTX 2080",
"vram_total_mb": 8192,
"model": "large-v3",
"queue_depth": 0,
"model_state": "ready"
}
queue_depth is the number of jobs waiting to be processed (not counting the one currently running). model_state reflects the current lifecycle state (unloaded, loading, waiting_for_gpu, ready).
Output Format
The segments array in a completed job contains one entry per whisper segment (typically a sentence or clause):
{
"index": 0,
"start": 12.34,
"end": 15.78,
"text": " This is a transcribed sentence.",
"words": [
{ "text": " This", "start": 12.34, "end": 12.56, "probability": 0.97 },
{ "text": " is", "start": 12.56, "end": 12.72, "probability": 0.99 },
{ "text": " a", "start": 12.72, "end": 12.84, "probability": 0.98 },
{ "text": " transcribed", "start": 12.84, "end": 13.40, "probability": 0.95 },
{ "text": " sentence.", "start": 13.40, "end": 15.78, "probability": 0.96 }
]
}
Notes:
start/endare in seconds (floating point), absolute from the beginning of the input audiotexttypically includes a leading space (whisper's tokenisation convention)wordscontains token-level timestamps; may be empty if flash attention is enabled (it is disabled by default)probabilityis the model's confidence for each word token (0–1)- All timestamps are in the source language's timeline — no re-mapping occurs
Webhook Payload
When a webhook_url is provided, the server POSTs the full Job JSON to that URL on completion (including on failure). Headers: Content-Type: application/json.
Delivery is attempted up to 5 times with exponential backoff (1s, 2s, 4s, 8s, 16s). If all retries fail, the error is logged and dropped.
Building from Source
# Build the Docker image locally (requires Docker Buildx + NVIDIA CUDA driver on host)
docker build -t whisper-rtx2080 .
# Custom CUDA version (e.g. for CUDA 11.8 on older drivers)
docker build \
--build-arg CUDA_VERSION=11.8.0 \
--build-arg CUDNN_TAG=cudnn8 \
--build-arg UBUNTU_VERSION=20.04 \
-t whisper-rtx2080:cu118 .
Cross-compiling without a CUDA-capable host is not supported — the build requires nvcc to compile the CUDA kernels.
Build-time ARGs
| ARG | Default | Notes |
|---|---|---|
CUDA_VERSION |
12.4.1 |
Must match a tag on nvidia/cuda Docker Hub |
CUDNN_TAG |
cudnn |
Use cudnn8 for CUDA 11.x images |
UBUNTU_VERSION |
22.04 |
20.04 or 22.04 |
Working with Audio Files
The server accepts any format ffmpeg understands. To prepare audio manually:
# Download YouTube audio
yt-dlp -x --audio-format mp3 "https://www.youtube.com/watch?v=..." -o audio.mp3
# Convert to whisper's native format (optional — the server does this automatically)
ffmpeg -i audio.mp3 -f f32le -ac 1 -ar 16000 audio.pcm
# Submit
curl -X POST http://localhost:8080/jobs \
-F "audio=@audio.mp3"
Troubleshooting
Server returns 503 model_not_ready
- The model starts unloaded. Call
POST /model/loadexplicitly, or just retry the job submission — rejection automatically triggers a load. - If state is
waiting_for_gpu, another process is using the GPU's VRAM. The server will retry automatically everyGPU_POLL_INTERVAL_SECSseconds. - Monitor
GET /model/statusor subscribe toGET /model/eventsto know when the model is ready.
Server returns 0 segments
- Check that you are not setting
languageto an empty string — omit the field entirely for auto-detection - Verify the audio file is not corrupted:
ffprobe audio.mp3 - Check logs for
whisper.cppoutput: the auto-detected language and confidence should appear asinfolevel logs
Server returns failed with ffmpeg error
- Ensure
ffmpegis installed in the container (it is by default) - Verify the audio file is a valid media file
CUDA out-of-memory
ggml-large-v3.binrequires ~5-6 GB VRAM. Usemediumorsmallmodels on GPUs with less than 8 GB- Check that no other process is consuming VRAM:
nvidia-smi
Wrong GPU being used
- Inside Docker: set
CUDA_DEVICE=0for the first GPU (nvidia-smi order) - On host without Docker: device ordering may be inverted; see FINDINGS.md