- Model starts unloaded (lazy); loads on first job or POST /model/load
- Auto-unloads after IDLE_TIMEOUT_SECS (default 300) of inactivity
- POST /model/unload for immediate manual release
- GPU-busy detection: on VRAM OOM, enters WaitingForGpu and retries
every GPU_POLL_INTERVAL_SECS (default 30) indefinitely
- POST /jobs when unloaded → 503 + Retry-After header, triggers load
- AppError::OutOfMemory and AppError::ModelNotReady variants
- WorkerCmd channel (SyncSender<WorkerCmd>) replaces bare tx_req channel
- Idle timer via recv_timeout(1s) tick inside OS thread (no extra thread)
- Model lifecycle events broadcast via tokio broadcast channel (SSE + webhooks)
- webhook_registry: all clients that ever submitted a webhook_url receive
model_ready and model_unloaded webhooks
- GPU warmup retained on every (re)load
New routes:
GET /model/status — current state + VRAM stats
POST /model/load — trigger load (idempotent)
POST /model/unload — immediate unload
GET /model/events — SSE stream of model lifecycle events
New env vars:
IDLE_TIMEOUT_SECS (default 300)
GPU_POLL_INTERVAL_SECS (default 30)
Tests:
tests/test_model_lifecycle.sh — 18 integration tests (full state machine,
SSE events, webhooks, concurrency, unload-during-load)
tests/test_idle_timeout.sh — 5 tests with short IDLE_TIMEOUT_SECS=5
test_all.sh updated: loads model before job submission, asserts
model_state in /health, adds POST /model/unload at end
Docs:
docs/USAGE.md: model lifecycle section, new env vars, 503 retry pattern,
updated /health response shape
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
whisper-rtx2080
Async REST API for GPU-accelerated speech transcription, built in Rust (Axum) on top of whisper.cpp compiled with CUDA for the NVIDIA RTX 2080 (Turing, sm_75, 8 GB VRAM). No Python.
Requirements
| Dependency | Notes |
|---|---|
| Docker ≥ 20.10 | |
| NVIDIA Container Toolkit | nvidia-docker2 on the host |
| Host NVIDIA driver ≥ 525 | Required for CUDA 12.x |
| GGML model file | Downloaded automatically on first start |
Quick start
# Build (CUDA 12.4, sm_75, large-v3 model)
docker compose build
# Start the server (model downloads on first run — ~3 GB)
docker compose up -d
# Check it's running
curl http://localhost:8080/health
# Transcribe a file
curl -X POST http://localhost:8080/jobs \
-F "audio=@/path/to/speech.mp3" | jq .
# → { "job_id": "550e8400-..." }
# Poll for result
curl http://localhost:8080/jobs/550e8400-... | jq .
# Or stream progress in real time
curl -N http://localhost:8080/jobs/550e8400-.../stream
# Browse the interactive API docs
open http://localhost:8080/docs
API reference
| Method | Path | Description |
|---|---|---|
POST |
/jobs |
Submit audio for transcription |
GET |
/jobs/{id} |
Poll job status + result |
GET |
/jobs/{id}/stream |
SSE: live progress + completion event |
DELETE |
/jobs/{id} |
Cancel a queued or running job |
GET |
/health |
GPU info + queue depth |
GET |
/docs |
Swagger UI |
GET |
/openapi.json |
Raw OpenAPI 3.0 spec |
POST /jobs — multipart fields
| Field | Required | Description |
|---|---|---|
audio |
✅ | Audio file — any format ffmpeg understands; no size limit |
language |
❌ | ISO 639-1 source language (e.g. en). Auto-detected when absent. |
task |
❌ | transcribe (default) or translate (output always English) |
webhook_url |
❌ | URL to POST the completed job JSON to on completion |
Job result JSON
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"status": "done",
"language": "en",
"task": "transcribe",
"duration_secs": 142.3,
"progress": 100,
"segments": [
{
"index": 0,
"start": 0.0,
"end": 2.4,
"text": " Hello, world.",
"words": []
}
],
"error": null,
"created_at": "2026-05-05T21:00:00Z",
"completed_at": "2026-05-05T21:02:13Z"
}
SSE events (GET /jobs/{id}/stream)
event: progress
data: {"type":"progress","percent":42}
event: progress
data: {"type":"progress","percent":91}
event: done
data: {"type":"done","job":{...full job object...}}
Build arguments
| ARG | Default | Notes |
|---|---|---|
CUDA_VERSION |
12.4.1 |
Passed to the NVIDIA base image tag |
CUDNN_TAG |
cudnn |
cudnn for CUDA 12.x · cudnn8 for CUDA 11.x |
UBUNTU_VERSION |
22.04 |
Ubuntu base |
Custom CUDA version examples
# CUDA 12.1
docker build \
--build-arg CUDA_VERSION=12.1.0 \
--build-arg CUDNN_TAG=cudnn8 \
-t whisper-rtx2080:cu121 .
# CUDA 11.8 (legacy)
docker build \
--build-arg CUDA_VERSION=11.8.0 \
--build-arg CUDNN_TAG=cudnn8 \
--build-arg UBUNTU_VERSION=20.04 \
-t whisper-rtx2080:cu118 .
Runtime environment variables
All can be overridden with -e or in docker-compose.yml:
| Variable | Default | Description |
|---|---|---|
PORT |
8080 |
TCP port the server listens on |
RUST_LOG |
info |
Log level (trace, debug, info, warn, error) |
DATA_DIR |
/data |
Directory for persisted job state (mount a volume here) |
WHISPER_MODEL |
large-v3 |
Model name (for /health reporting) |
WHISPER_MODEL_PATH |
/models/ggml-large-v3.bin |
Absolute path to the GGML model file |
RTX 2080 optimisation notes
| Setting | Value | Reason |
|---|---|---|
CMAKE_CUDA_ARCHITECTURES |
75 |
Compiles kernels only for sm_75 — smaller binary, faster build |
GGML_CUDA_FORCE_MMQ |
ON |
Quantised matrix-multiply (WMMA Tensor Cores) — best for Q4/Q5/Q8 models on Turing |
GGML_CUDA_GRAPHS |
ON |
CUDA Graph capture → eliminates CPU→GPU dispatch overhead per call (requires sm_75+) |
GGML_CUDA_FA_ALL_QUANTS |
ON |
Flash Attention tile kernels for all quantisation types |
GGML_CUDA_F16 |
ON |
FP16 arithmetic via Turing Tensor Cores |
flash_attn (runtime) |
true |
Enabled in WhisperContextParameters — tile-based, works on sm_75 |
beam_size |
5 |
Best accuracy/speed balance |
temperature |
0.0 |
Deterministic, fastest decode path |
n_threads |
host CPU count | CPU-side pre/post processing |
bfloat16 is intentionally not enabled — that requires Ampere (sm_80+).
flash_attn and DTW token timestamps are mutually exclusive — the server enables flash_attn and omits DTW to maximise throughput.
Webhooks
If webhook_url is set on a job, the server will POST the completed job JSON to that URL:
- Up to 5 retries with exponential backoff: 1 s → 2 s → 4 s → 8 s → 16 s
- After all retries are exhausted the failure is logged and dropped
Troubleshooting
CUDA error: no kernel image available for execution on the device
→ The binary was compiled for a different architecture. Rebuild with
--build-arg CUDA_VERSION=... matching your driver. The image is always compiled
for sm_75 only.
libcuda.so.1: cannot open shared object file
→ NVIDIA Container Toolkit is not installed or --gpus all / deploy.resources is missing.
Model not found at /models/ggml-large-v3.bin
→ On first start the server will fail immediately. Download the model manually:
docker run --rm -v whisper-models:/models curlimages/curl:latest \
-L -o /models/ggml-large-v3.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin
Then restart the server.
Out-of-memory on large-v3
→ The large-v3 GGML model at F16 uses ~3.1 GB VRAM; you should have headroom on 8 GB.
If running other GPU workloads in parallel, switch to ggml-medium.bin (~1.5 GB).