mozempk/whisper-rtx2080

Files

Build & Push Docker Image / build-and-push (push) Successful in 8m41s

Details

feat: dynamic model loading/unloading with GPU polling

- Model starts unloaded (lazy); loads on first job or POST /model/load
- Auto-unloads after IDLE_TIMEOUT_SECS (default 300) of inactivity
- POST /model/unload for immediate manual release
- GPU-busy detection: on VRAM OOM, enters WaitingForGpu and retries
  every GPU_POLL_INTERVAL_SECS (default 30) indefinitely
- POST /jobs when unloaded → 503 + Retry-After header, triggers load
- AppError::OutOfMemory and AppError::ModelNotReady variants
- WorkerCmd channel (SyncSender<WorkerCmd>) replaces bare tx_req channel
- Idle timer via recv_timeout(1s) tick inside OS thread (no extra thread)
- Model lifecycle events broadcast via tokio broadcast channel (SSE + webhooks)
- webhook_registry: all clients that ever submitted a webhook_url receive
  model_ready and model_unloaded webhooks
- GPU warmup retained on every (re)load

New routes:
  GET  /model/status  — current state + VRAM stats
  POST /model/load    — trigger load (idempotent)
  POST /model/unload  — immediate unload
  GET  /model/events  — SSE stream of model lifecycle events

New env vars:
  IDLE_TIMEOUT_SECS       (default 300)
  GPU_POLL_INTERVAL_SECS  (default 30)

Tests:
  tests/test_model_lifecycle.sh — 18 integration tests (full state machine,
    SSE events, webhooks, concurrency, unload-during-load)
  tests/test_idle_timeout.sh    — 5 tests with short IDLE_TIMEOUT_SECS=5
  test_all.sh updated: loads model before job submission, asserts
    model_state in /health, adds POST /model/unload at end

Docs:
  docs/USAGE.md: model lifecycle section, new env vars, 503 retry pattern,
    updated /health response shape

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

2026-05-08 17:57:20 +02:00

15 KiB

Raw Blame History

Usage Guide

Prerequisites

Docker + NVIDIA Container Toolkit (for GPU access)
An NVIDIA GPU — optimised for RTX 2080 (sm_75), but any CUDA-capable GPU works
A Whisper GGML model file (e.g. ggml-large-v3.bin)

Quick Start

1. Pull the image

docker pull git.sal.giize.com/mozempk/whisper-rtx2080:latest

2. Download a model

# large-v3 recommended (~3 GB)
mkdir -p ~/whisper-models
curl -L "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin" \
  -o ~/whisper-models/ggml-large-v3.bin

3. Start the server

docker run --rm --gpus all \
  -p 8080:8080 \
  -v ~/whisper-models:/models:ro \
  -v whisper-data:/data \
  -e WHISPER_MODEL_PATH=/models/ggml-large-v3.bin \
  git.sal.giize.com/mozempk/whisper-rtx2080:latest

4. Verify

curl http://localhost:8080/health
# {"status":"ok","gpu_name":"NVIDIA GeForce RTX 2080","vram_total_mb":8192,"model":"large-v3","queue_depth":0}

docker-compose

# Copy the compose file, configure volumes, then:
docker compose up -d

The bundled docker-compose.yml mounts named volumes for data and models and sets sane defaults.

Environment Variables

Variable	Default	Description
`PORT`	`8080`	HTTP listen port
`RUST_LOG`	`info`	Log level: `error`, `warn`, `info`, `debug`, `trace`
`DATA_DIR`	`/data`	Directory for job JSON files and temp audio
`WHISPER_MODEL_PATH`	`/models/ggml-large-v3.bin`	Absolute path to GGML model file
`WHISPER_MODEL`	`large-v3`	Model name reported by `/health` (display only)
`CUDA_DEVICE`	`0`	CUDA device index to use for inference
`IDLE_TIMEOUT_SECS`	`300`	Seconds of idle before the model is automatically unloaded from GPU memory. Set to `0` to disable auto-unload.
`GPU_POLL_INTERVAL_SECS`	`30`	Seconds between VRAM-availability retries when a load fails due to insufficient VRAM.

Note on CUDA device ordering

Inside Docker, device ordering matches nvidia-smi (PCI bus order). On the host without Docker, ordering may differ. See FINDINGS.md for details.

API Reference

The interactive Swagger UI is available at http://localhost:8080/docs.

Model Lifecycle Management

The model starts unloaded on startup (lazy loading). It is loaded into GPU memory on the first job submission or via POST /model/load, and automatically unloaded after IDLE_TIMEOUT_SECS of inactivity.

Model State Machine

Unloaded ──(job / POST /model/load)──► Loading ──(success)──► Ready
                                                 └──(VRAM full)──► WaitingForGpu ──(retry OK)──► Loading
Ready ──(idle timeout / POST /model/unload)──► Unloaded
WaitingForGpu ──(POST /model/unload)──► Unloaded

`GET /model/status`

Returns the current model state and VRAM statistics.

curl http://localhost:8080/model/status

When unloaded:

{ "state": "unloaded" }

When loading:

{ "state": "loading" }

When ready:

{
  "state": "ready",
  "loaded_at": "2026-05-10T14:00:00Z",
  "vram_used_mb": 4096,
  "vram_total_mb": 8192
}

When waiting for VRAM:

{
  "state": "waiting_for_gpu",
  "vram_needed_mb": 3951,
  "vram_free_mb": 512,
  "retry_in_secs": 30
}

`POST /model/load`

Request the model to be loaded. Idempotent — if already loading or ready, returns immediately.

curl -X POST http://localhost:8080/model/load

Returns 202 Accepted with {"status":"load_initiated"} when load is triggered
Returns 200 OK with {"status":"already_ready"} when model is already ready
Poll GET /model/status or subscribe to GET /model/events to know when ready

`POST /model/unload`

Unload the model from GPU memory immediately, freeing VRAM.

curl -X POST http://localhost:8080/model/unload

Returns 200 OK regardless of current state.

`GET /model/events` — Model SSE stream

Subscribe to model lifecycle events via Server-Sent Events.

curl -N http://localhost:8080/model/events

Event types:

event: model_loading
data: {"type":"model_loading"}

event: model_ready
data: {"type":"model_ready","loaded_at":"2026-05-10T14:00:00Z"}

event: model_unloaded
data: {"type":"model_unloaded"}

event: model_waiting_for_gpu
data: {"type":"model_waiting_for_gpu","vram_needed_mb":3951,"vram_free_mb":512,"retry_in_secs":30}

JavaScript example:

const es = new EventSource('/model/events');

es.addEventListener('model_ready', () => {
  console.log('Model loaded — ready to transcribe');
});

es.addEventListener('model_unloaded', () => {
  console.log('Model freed GPU memory');
});

Webhooks for model events

When any job is submitted with a webhook_url, that URL is registered to receive model lifecycle webhooks for the lifetime of the server process. The following events trigger a webhook POST:

Event	Fired when
`model_ready`	Model finishes loading (after GPU warmup)
`model_unloaded`	Model is freed from GPU memory

Webhook payload (Content-Type: application/json):

{ "type": "model_ready", "loaded_at": "2026-05-10T14:00:00Z" }
{ "type": "model_unloaded" }

Delivery is attempted up to 3 times with exponential backoff (1s, 2s).

Handling 503 Model Not Ready

When you submit a job and the model is not yet loaded, you receive 503 Service Unavailable with a Retry-After header:

HTTP/1.1 503 Service Unavailable
Retry-After: 30
Content-Type: application/json

{
  "error": "model_not_ready",
  "state": "unloaded",
  "retry_after_secs": 30
}

State at rejection	`retry_after_secs`	Meaning
`unloaded`	30	Load was triggered; retry after ~30s
`loading`	10	Check again in 10s
`waiting_for_gpu`	`GPU_POLL_INTERVAL_SECS`	VRAM contention; retry later

A job rejection when the model is unloaded automatically triggers a load — you do not need to call POST /model/load separately.

Recommended client pattern:

async function submitWithRetry(formData, maxAttempts = 10) {
  for (let i = 0; i < maxAttempts; i++) {
    const resp = await fetch('/jobs', { method: 'POST', body: formData });
    if (resp.ok) return resp.json();
    if (resp.status === 503) {
      const retryAfter = parseInt(resp.headers.get('Retry-After') ?? '30');
      const body = await resp.json();
      console.log(`Model ${body.state} — retrying in ${retryAfter}s`);
      await new Promise(r => setTimeout(r, retryAfter * 1000));
      continue;
    }
    throw new Error(`Submit failed: ${resp.status}`);
  }
  throw new Error('Gave up after max attempts');
}

API Reference

The interactive Swagger UI is available at http://localhost:8080/docs.

`POST /jobs` — Submit a transcription job

Accepts a multipart/form-data body.

Field	Type	Required	Description
`audio`	file	✓	Audio file. Any format ffmpeg supports (MP3, WAV, FLAC, AAC, OGG, M4A, WEBM, …). No size limit.
`language`	string	—	ISO 639-1 language code (e.g. `en`, `fr`, `de`). Omit to auto-detect.
`task`	string	—	`transcribe` (default) or `translate` (translates to English)
`webhook_url`	string	—	URL to POST the completed job to

Response: 202 Accepted

{ "job_id": "550e8400-e29b-41d4-a716-446655440000" }

Example:

curl -X POST http://localhost:8080/jobs \
  -F "audio=@/path/to/recording.mp3" \
  -F "language=en"

Auto-detect language:

curl -X POST http://localhost:8080/jobs \
  -F "audio=@/path/to/recording.mp3"

With webhook:

curl -X POST http://localhost:8080/jobs \
  -F "audio=@recording.mp3" \
  -F "webhook_url=https://myapp.example.com/transcription-done"

`GET /jobs/{id}` — Poll job status

curl http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000

Response while running:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "running",
  "task": "transcribe",
  "progress": 42,
  "created_at": "2026-05-06T10:00:00Z"
}

Response when done:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "done",
  "language": "en",
  "task": "transcribe",
  "duration_secs": 3720.5,
  "progress": 100,
  "created_at": "2026-05-06T10:00:00Z",
  "completed_at": "2026-05-06T10:12:34Z",
  "filename": "recording.mp3",
  "segments": [
    {
      "index": 0,
      "start": 0.0,
      "end": 4.52,
      "text": " Hello and welcome to the conference.",
      "words": [
        { "text": " Hello", "start": 0.0, "end": 0.68, "probability": 0.98 },
        ...
      ]
    },
    ...
  ]
}

Job statuses:

Status	Meaning
`queued`	Waiting for the GPU worker to pick it up
`running`	Being transcribed right now
`done`	Complete; `segments` array is populated
`failed`	Error occurred; `error` field contains the message
`cancelled`	Cancelled via DELETE before or during processing

`GET /jobs/{id}/stream` — Real-time progress via SSE

Subscribe to a Server-Sent Events stream for live progress updates.

curl -N http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000/stream

Event types:

event: progress
data: {"type":"progress","percent":23,"chunk":2,"chunks_total":8}

event: progress
data: {"type":"progress","percent":47,"chunk":4,"chunks_total":8}

event: done
data: {"type":"done","job":{...full job object...}}

event: error
data: {"type":"error","message":"ffmpeg spawn failed: ..."}

percent — overall progress 0–100
chunk / chunks_total — which silence-split chunk is currently being transcribed
If you connect after the job has finished, you receive a single done event immediately

JavaScript example:

const es = new EventSource(`/jobs/${jobId}/stream`);

es.addEventListener('progress', (e) => {
  const { percent, chunk, chunks_total } = JSON.parse(e.data);
  console.log(`${percent}% (chunk ${chunk}/${chunks_total})`);
});

es.addEventListener('done', (e) => {
  const { job } = JSON.parse(e.data);
  console.log('Transcript:', job.segments.map(s => s.text).join(''));
  es.close();
});

es.addEventListener('error', (e) => {
  const { message } = JSON.parse(e.data);
  console.error('Failed:', message);
  es.close();
});

`DELETE /jobs/{id}` — Cancel a job

Marks a queued job as cancelled immediately. For running jobs, the cancellation is recorded but the current whisper.cpp inference call completes before the worker checks the flag (whisper.cpp does not support mid-inference abort).

curl -X DELETE http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000

Returns 409 Conflict if the job is already in a terminal state (done, failed, cancelled).

`GET /health` — Service health

curl http://localhost:8080/health

{
  "status": "ok",
  "gpu_name": "NVIDIA GeForce RTX 2080",
  "vram_total_mb": 8192,
  "model": "large-v3",
  "queue_depth": 0,
  "model_state": "ready"
}

queue_depth is the number of jobs waiting to be processed (not counting the one currently running). model_state reflects the current lifecycle state (unloaded, loading, waiting_for_gpu, ready).

Output Format

The segments array in a completed job contains one entry per whisper segment (typically a sentence or clause):

{
  "index": 0,
  "start": 12.34,
  "end":   15.78,
  "text":  " This is a transcribed sentence.",
  "words": [
    { "text": " This",         "start": 12.34, "end": 12.56, "probability": 0.97 },
    { "text": " is",           "start": 12.56, "end": 12.72, "probability": 0.99 },
    { "text": " a",            "start": 12.72, "end": 12.84, "probability": 0.98 },
    { "text": " transcribed",  "start": 12.84, "end": 13.40, "probability": 0.95 },
    { "text": " sentence.",    "start": 13.40, "end": 15.78, "probability": 0.96 }
  ]
}

Notes:

start / end are in seconds (floating point), absolute from the beginning of the input audio
text typically includes a leading space (whisper's tokenisation convention)
words contains token-level timestamps; may be empty if flash attention is enabled (it is disabled by default)
probability is the model's confidence for each word token (0–1)
All timestamps are in the source language's timeline — no re-mapping occurs

Webhook Payload

When a webhook_url is provided, the server POSTs the full Job JSON to that URL on completion (including on failure). Headers: Content-Type: application/json.

Delivery is attempted up to 5 times with exponential backoff (1s, 2s, 4s, 8s, 16s). If all retries fail, the error is logged and dropped.

Building from Source

# Build the Docker image locally (requires Docker Buildx + NVIDIA CUDA driver on host)
docker build -t whisper-rtx2080 .

# Custom CUDA version (e.g. for CUDA 11.8 on older drivers)
docker build \
  --build-arg CUDA_VERSION=11.8.0 \
  --build-arg CUDNN_TAG=cudnn8 \
  --build-arg UBUNTU_VERSION=20.04 \
  -t whisper-rtx2080:cu118 .

Cross-compiling without a CUDA-capable host is not supported — the build requires nvcc to compile the CUDA kernels.

Build-time ARGs

ARG	Default	Notes
`CUDA_VERSION`	`12.4.1`	Must match a tag on `nvidia/cuda` Docker Hub
`CUDNN_TAG`	`cudnn`	Use `cudnn8` for CUDA 11.x images
`UBUNTU_VERSION`	`22.04`	`20.04` or `22.04`

Working with Audio Files

The server accepts any format ffmpeg understands. To prepare audio manually:

# Download YouTube audio
yt-dlp -x --audio-format mp3 "https://www.youtube.com/watch?v=..." -o audio.mp3

# Convert to whisper's native format (optional — the server does this automatically)
ffmpeg -i audio.mp3 -f f32le -ac 1 -ar 16000 audio.pcm

# Submit
curl -X POST http://localhost:8080/jobs \
  -F "audio=@audio.mp3"

Troubleshooting

Server returns `503 model_not_ready`

The model starts unloaded. Call POST /model/load explicitly, or just retry the job submission — rejection automatically triggers a load.
If state is waiting_for_gpu, another process is using the GPU's VRAM. The server will retry automatically every GPU_POLL_INTERVAL_SECS seconds.
Monitor GET /model/status or subscribe to GET /model/events to know when the model is ready.

Server returns 0 segments

Check that you are not setting language to an empty string — omit the field entirely for auto-detection
Verify the audio file is not corrupted: ffprobe audio.mp3
Check logs for whisper.cpp output: the auto-detected language and confidence should appear as info level logs

Server returns `failed` with ffmpeg error

Ensure ffmpeg is installed in the container (it is by default)
Verify the audio file is a valid media file

CUDA out-of-memory

ggml-large-v3.bin requires ~5-6 GB VRAM. Use medium or small models on GPUs with less than 8 GB
Check that no other process is consuming VRAM: nvidia-smi

Wrong GPU being used

Inside Docker: set CUDA_DEVICE=0 for the first GPU (nvidia-smi order)
On host without Docker: device ordering may be inverted; see FINDINGS.md

15 KiB Raw Blame History Unescape Escape

Usage Guide

Prerequisites

Quick Start

1. Pull the image

2. Download a model

3. Start the server

4. Verify

docker-compose

Environment Variables

Note on CUDA device ordering

API Reference

Model Lifecycle Management

Model State Machine

GET /model/status

POST /model/load

POST /model/unload

GET /model/events — Model SSE stream

Webhooks for model events

Handling 503 Model Not Ready

API Reference

POST /jobs — Submit a transcription job

GET /jobs/{id} — Poll job status

GET /jobs/{id}/stream — Real-time progress via SSE

DELETE /jobs/{id} — Cancel a job

GET /health — Service health

Output Format

Webhook Payload

Building from Source

Build-time ARGs

Working with Audio Files

Troubleshooting

Server returns 503 model_not_ready

Server returns 0 segments

Server returns failed with ffmpeg error

CUDA out-of-memory

Wrong GPU being used

15 KiB

Raw Blame History

`GET /model/status`

`POST /model/load`

`POST /model/unload`

`GET /model/events` — Model SSE stream

`POST /jobs` — Submit a transcription job

`GET /jobs/{id}` — Poll job status

`GET /jobs/{id}/stream` — Real-time progress via SSE

`DELETE /jobs/{id}` — Cancel a job

`GET /health` — Service health

Server returns `503 model_not_ready`

Server returns `failed` with ffmpeg error