Files
whisper-rtx2080/docs/USAGE.md
mozempk b191fbe200
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 8m41s
feat: dynamic model loading/unloading with GPU polling
- Model starts unloaded (lazy); loads on first job or POST /model/load
- Auto-unloads after IDLE_TIMEOUT_SECS (default 300) of inactivity
- POST /model/unload for immediate manual release
- GPU-busy detection: on VRAM OOM, enters WaitingForGpu and retries
  every GPU_POLL_INTERVAL_SECS (default 30) indefinitely
- POST /jobs when unloaded → 503 + Retry-After header, triggers load
- AppError::OutOfMemory and AppError::ModelNotReady variants
- WorkerCmd channel (SyncSender<WorkerCmd>) replaces bare tx_req channel
- Idle timer via recv_timeout(1s) tick inside OS thread (no extra thread)
- Model lifecycle events broadcast via tokio broadcast channel (SSE + webhooks)
- webhook_registry: all clients that ever submitted a webhook_url receive
  model_ready and model_unloaded webhooks
- GPU warmup retained on every (re)load

New routes:
  GET  /model/status  — current state + VRAM stats
  POST /model/load    — trigger load (idempotent)
  POST /model/unload  — immediate unload
  GET  /model/events  — SSE stream of model lifecycle events

New env vars:
  IDLE_TIMEOUT_SECS       (default 300)
  GPU_POLL_INTERVAL_SECS  (default 30)

Tests:
  tests/test_model_lifecycle.sh — 18 integration tests (full state machine,
    SSE events, webhooks, concurrency, unload-during-load)
  tests/test_idle_timeout.sh    — 5 tests with short IDLE_TIMEOUT_SECS=5
  test_all.sh updated: loads model before job submission, asserts
    model_state in /health, adds POST /model/unload at end

Docs:
  docs/USAGE.md: model lifecycle section, new env vars, 503 retry pattern,
    updated /health response shape

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-08 17:57:20 +02:00

15 KiB
Raw Blame History

Usage Guide

Prerequisites

  • Docker + NVIDIA Container Toolkit (for GPU access)
  • An NVIDIA GPU — optimised for RTX 2080 (sm_75), but any CUDA-capable GPU works
  • A Whisper GGML model file (e.g. ggml-large-v3.bin)

Quick Start

1. Pull the image

docker pull git.sal.giize.com/mozempk/whisper-rtx2080:latest

2. Download a model

# large-v3 recommended (~3 GB)
mkdir -p ~/whisper-models
curl -L "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin" \
  -o ~/whisper-models/ggml-large-v3.bin

3. Start the server

docker run --rm --gpus all \
  -p 8080:8080 \
  -v ~/whisper-models:/models:ro \
  -v whisper-data:/data \
  -e WHISPER_MODEL_PATH=/models/ggml-large-v3.bin \
  git.sal.giize.com/mozempk/whisper-rtx2080:latest

4. Verify

curl http://localhost:8080/health
# {"status":"ok","gpu_name":"NVIDIA GeForce RTX 2080","vram_total_mb":8192,"model":"large-v3","queue_depth":0}

docker-compose

# Copy the compose file, configure volumes, then:
docker compose up -d

The bundled docker-compose.yml mounts named volumes for data and models and sets sane defaults.


Environment Variables

Variable Default Description
PORT 8080 HTTP listen port
RUST_LOG info Log level: error, warn, info, debug, trace
DATA_DIR /data Directory for job JSON files and temp audio
WHISPER_MODEL_PATH /models/ggml-large-v3.bin Absolute path to GGML model file
WHISPER_MODEL large-v3 Model name reported by /health (display only)
CUDA_DEVICE 0 CUDA device index to use for inference
IDLE_TIMEOUT_SECS 300 Seconds of idle before the model is automatically unloaded from GPU memory. Set to 0 to disable auto-unload.
GPU_POLL_INTERVAL_SECS 30 Seconds between VRAM-availability retries when a load fails due to insufficient VRAM.

Note on CUDA device ordering

Inside Docker, device ordering matches nvidia-smi (PCI bus order). On the host without Docker, ordering may differ. See FINDINGS.md for details.


API Reference

The interactive Swagger UI is available at http://localhost:8080/docs.


Model Lifecycle Management

The model starts unloaded on startup (lazy loading). It is loaded into GPU memory on the first job submission or via POST /model/load, and automatically unloaded after IDLE_TIMEOUT_SECS of inactivity.

Model State Machine

Unloaded ──(job / POST /model/load)──► Loading ──(success)──► Ready
                                                 └──(VRAM full)──► WaitingForGpu ──(retry OK)──► Loading
Ready ──(idle timeout / POST /model/unload)──► Unloaded
WaitingForGpu ──(POST /model/unload)──► Unloaded

GET /model/status

Returns the current model state and VRAM statistics.

curl http://localhost:8080/model/status

When unloaded:

{ "state": "unloaded" }

When loading:

{ "state": "loading" }

When ready:

{
  "state": "ready",
  "loaded_at": "2026-05-10T14:00:00Z",
  "vram_used_mb": 4096,
  "vram_total_mb": 8192
}

When waiting for VRAM:

{
  "state": "waiting_for_gpu",
  "vram_needed_mb": 3951,
  "vram_free_mb": 512,
  "retry_in_secs": 30
}

POST /model/load

Request the model to be loaded. Idempotent — if already loading or ready, returns immediately.

curl -X POST http://localhost:8080/model/load
  • Returns 202 Accepted with {"status":"load_initiated"} when load is triggered
  • Returns 200 OK with {"status":"already_ready"} when model is already ready
  • Poll GET /model/status or subscribe to GET /model/events to know when ready

POST /model/unload

Unload the model from GPU memory immediately, freeing VRAM.

curl -X POST http://localhost:8080/model/unload

Returns 200 OK regardless of current state.


GET /model/events — Model SSE stream

Subscribe to model lifecycle events via Server-Sent Events.

curl -N http://localhost:8080/model/events

Event types:

event: model_loading
data: {"type":"model_loading"}

event: model_ready
data: {"type":"model_ready","loaded_at":"2026-05-10T14:00:00Z"}

event: model_unloaded
data: {"type":"model_unloaded"}

event: model_waiting_for_gpu
data: {"type":"model_waiting_for_gpu","vram_needed_mb":3951,"vram_free_mb":512,"retry_in_secs":30}

JavaScript example:

const es = new EventSource('/model/events');

es.addEventListener('model_ready', () => {
  console.log('Model loaded — ready to transcribe');
});

es.addEventListener('model_unloaded', () => {
  console.log('Model freed GPU memory');
});

Webhooks for model events

When any job is submitted with a webhook_url, that URL is registered to receive model lifecycle webhooks for the lifetime of the server process. The following events trigger a webhook POST:

Event Fired when
model_ready Model finishes loading (after GPU warmup)
model_unloaded Model is freed from GPU memory

Webhook payload (Content-Type: application/json):

{ "type": "model_ready", "loaded_at": "2026-05-10T14:00:00Z" }
{ "type": "model_unloaded" }

Delivery is attempted up to 3 times with exponential backoff (1s, 2s).


Handling 503 Model Not Ready

When you submit a job and the model is not yet loaded, you receive 503 Service Unavailable with a Retry-After header:

HTTP/1.1 503 Service Unavailable
Retry-After: 30
Content-Type: application/json

{
  "error": "model_not_ready",
  "state": "unloaded",
  "retry_after_secs": 30
}
State at rejection retry_after_secs Meaning
unloaded 30 Load was triggered; retry after ~30s
loading 10 Check again in 10s
waiting_for_gpu GPU_POLL_INTERVAL_SECS VRAM contention; retry later

A job rejection when the model is unloaded automatically triggers a load — you do not need to call POST /model/load separately.

Recommended client pattern:

async function submitWithRetry(formData, maxAttempts = 10) {
  for (let i = 0; i < maxAttempts; i++) {
    const resp = await fetch('/jobs', { method: 'POST', body: formData });
    if (resp.ok) return resp.json();
    if (resp.status === 503) {
      const retryAfter = parseInt(resp.headers.get('Retry-After') ?? '30');
      const body = await resp.json();
      console.log(`Model ${body.state} — retrying in ${retryAfter}s`);
      await new Promise(r => setTimeout(r, retryAfter * 1000));
      continue;
    }
    throw new Error(`Submit failed: ${resp.status}`);
  }
  throw new Error('Gave up after max attempts');
}

API Reference

The interactive Swagger UI is available at http://localhost:8080/docs.

POST /jobs — Submit a transcription job

Accepts a multipart/form-data body.

Field Type Required Description
audio file Audio file. Any format ffmpeg supports (MP3, WAV, FLAC, AAC, OGG, M4A, WEBM, …). No size limit.
language string ISO 639-1 language code (e.g. en, fr, de). Omit to auto-detect.
task string transcribe (default) or translate (translates to English)
webhook_url string URL to POST the completed job to

Response: 202 Accepted

{ "job_id": "550e8400-e29b-41d4-a716-446655440000" }

Example:

curl -X POST http://localhost:8080/jobs \
  -F "audio=@/path/to/recording.mp3" \
  -F "language=en"

Auto-detect language:

curl -X POST http://localhost:8080/jobs \
  -F "audio=@/path/to/recording.mp3"

With webhook:

curl -X POST http://localhost:8080/jobs \
  -F "audio=@recording.mp3" \
  -F "webhook_url=https://myapp.example.com/transcription-done"

GET /jobs/{id} — Poll job status

curl http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000

Response while running:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "running",
  "task": "transcribe",
  "progress": 42,
  "created_at": "2026-05-06T10:00:00Z"
}

Response when done:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "done",
  "language": "en",
  "task": "transcribe",
  "duration_secs": 3720.5,
  "progress": 100,
  "created_at": "2026-05-06T10:00:00Z",
  "completed_at": "2026-05-06T10:12:34Z",
  "filename": "recording.mp3",
  "segments": [
    {
      "index": 0,
      "start": 0.0,
      "end": 4.52,
      "text": " Hello and welcome to the conference.",
      "words": [
        { "text": " Hello", "start": 0.0, "end": 0.68, "probability": 0.98 },
        ...
      ]
    },
    ...
  ]
}

Job statuses:

Status Meaning
queued Waiting for the GPU worker to pick it up
running Being transcribed right now
done Complete; segments array is populated
failed Error occurred; error field contains the message
cancelled Cancelled via DELETE before or during processing

GET /jobs/{id}/stream — Real-time progress via SSE

Subscribe to a Server-Sent Events stream for live progress updates.

curl -N http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000/stream

Event types:

event: progress
data: {"type":"progress","percent":23,"chunk":2,"chunks_total":8}

event: progress
data: {"type":"progress","percent":47,"chunk":4,"chunks_total":8}

event: done
data: {"type":"done","job":{...full job object...}}
event: error
data: {"type":"error","message":"ffmpeg spawn failed: ..."}
  • percent — overall progress 0100
  • chunk / chunks_total — which silence-split chunk is currently being transcribed
  • If you connect after the job has finished, you receive a single done event immediately

JavaScript example:

const es = new EventSource(`/jobs/${jobId}/stream`);

es.addEventListener('progress', (e) => {
  const { percent, chunk, chunks_total } = JSON.parse(e.data);
  console.log(`${percent}% (chunk ${chunk}/${chunks_total})`);
});

es.addEventListener('done', (e) => {
  const { job } = JSON.parse(e.data);
  console.log('Transcript:', job.segments.map(s => s.text).join(''));
  es.close();
});

es.addEventListener('error', (e) => {
  const { message } = JSON.parse(e.data);
  console.error('Failed:', message);
  es.close();
});

DELETE /jobs/{id} — Cancel a job

Marks a queued job as cancelled immediately. For running jobs, the cancellation is recorded but the current whisper.cpp inference call completes before the worker checks the flag (whisper.cpp does not support mid-inference abort).

curl -X DELETE http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000

Returns 409 Conflict if the job is already in a terminal state (done, failed, cancelled).


GET /health — Service health

curl http://localhost:8080/health
{
  "status": "ok",
  "gpu_name": "NVIDIA GeForce RTX 2080",
  "vram_total_mb": 8192,
  "model": "large-v3",
  "queue_depth": 0,
  "model_state": "ready"
}

queue_depth is the number of jobs waiting to be processed (not counting the one currently running). model_state reflects the current lifecycle state (unloaded, loading, waiting_for_gpu, ready).


Output Format

The segments array in a completed job contains one entry per whisper segment (typically a sentence or clause):

{
  "index": 0,
  "start": 12.34,
  "end":   15.78,
  "text":  " This is a transcribed sentence.",
  "words": [
    { "text": " This",         "start": 12.34, "end": 12.56, "probability": 0.97 },
    { "text": " is",           "start": 12.56, "end": 12.72, "probability": 0.99 },
    { "text": " a",            "start": 12.72, "end": 12.84, "probability": 0.98 },
    { "text": " transcribed",  "start": 12.84, "end": 13.40, "probability": 0.95 },
    { "text": " sentence.",    "start": 13.40, "end": 15.78, "probability": 0.96 }
  ]
}

Notes:

  • start / end are in seconds (floating point), absolute from the beginning of the input audio
  • text typically includes a leading space (whisper's tokenisation convention)
  • words contains token-level timestamps; may be empty if flash attention is enabled (it is disabled by default)
  • probability is the model's confidence for each word token (01)
  • All timestamps are in the source language's timeline — no re-mapping occurs

Webhook Payload

When a webhook_url is provided, the server POSTs the full Job JSON to that URL on completion (including on failure). Headers: Content-Type: application/json.

Delivery is attempted up to 5 times with exponential backoff (1s, 2s, 4s, 8s, 16s). If all retries fail, the error is logged and dropped.


Building from Source

# Build the Docker image locally (requires Docker Buildx + NVIDIA CUDA driver on host)
docker build -t whisper-rtx2080 .

# Custom CUDA version (e.g. for CUDA 11.8 on older drivers)
docker build \
  --build-arg CUDA_VERSION=11.8.0 \
  --build-arg CUDNN_TAG=cudnn8 \
  --build-arg UBUNTU_VERSION=20.04 \
  -t whisper-rtx2080:cu118 .

Cross-compiling without a CUDA-capable host is not supported — the build requires nvcc to compile the CUDA kernels.

Build-time ARGs

ARG Default Notes
CUDA_VERSION 12.4.1 Must match a tag on nvidia/cuda Docker Hub
CUDNN_TAG cudnn Use cudnn8 for CUDA 11.x images
UBUNTU_VERSION 22.04 20.04 or 22.04

Working with Audio Files

The server accepts any format ffmpeg understands. To prepare audio manually:

# Download YouTube audio
yt-dlp -x --audio-format mp3 "https://www.youtube.com/watch?v=..." -o audio.mp3

# Convert to whisper's native format (optional — the server does this automatically)
ffmpeg -i audio.mp3 -f f32le -ac 1 -ar 16000 audio.pcm

# Submit
curl -X POST http://localhost:8080/jobs \
  -F "audio=@audio.mp3"

Troubleshooting

Server returns 503 model_not_ready

  • The model starts unloaded. Call POST /model/load explicitly, or just retry the job submission — rejection automatically triggers a load.
  • If state is waiting_for_gpu, another process is using the GPU's VRAM. The server will retry automatically every GPU_POLL_INTERVAL_SECS seconds.
  • Monitor GET /model/status or subscribe to GET /model/events to know when the model is ready.

Server returns 0 segments

  • Check that you are not setting language to an empty string — omit the field entirely for auto-detection
  • Verify the audio file is not corrupted: ffprobe audio.mp3
  • Check logs for whisper.cpp output: the auto-detected language and confidence should appear as info level logs

Server returns failed with ffmpeg error

  • Ensure ffmpeg is installed in the container (it is by default)
  • Verify the audio file is a valid media file

CUDA out-of-memory

  • ggml-large-v3.bin requires ~5-6 GB VRAM. Use medium or small models on GPUs with less than 8 GB
  • Check that no other process is consuming VRAM: nvidia-smi

Wrong GPU being used

  • Inside Docker: set CUDA_DEVICE=0 for the first GPU (nvidia-smi order)
  • On host without Docker: device ordering may be inverted; see FINDINGS.md