mozempk/whisper-rtx2080

Go to file

Build & Push Docker Image / build-and-push (push) Successful in 8m41s

Details

feat: dynamic model loading/unloading with GPU polling

- Model starts unloaded (lazy); loads on first job or POST /model/load
- Auto-unloads after IDLE_TIMEOUT_SECS (default 300) of inactivity
- POST /model/unload for immediate manual release
- GPU-busy detection: on VRAM OOM, enters WaitingForGpu and retries
  every GPU_POLL_INTERVAL_SECS (default 30) indefinitely
- POST /jobs when unloaded → 503 + Retry-After header, triggers load
- AppError::OutOfMemory and AppError::ModelNotReady variants
- WorkerCmd channel (SyncSender<WorkerCmd>) replaces bare tx_req channel
- Idle timer via recv_timeout(1s) tick inside OS thread (no extra thread)
- Model lifecycle events broadcast via tokio broadcast channel (SSE + webhooks)
- webhook_registry: all clients that ever submitted a webhook_url receive
  model_ready and model_unloaded webhooks
- GPU warmup retained on every (re)load

New routes:
  GET  /model/status  — current state + VRAM stats
  POST /model/load    — trigger load (idempotent)
  POST /model/unload  — immediate unload
  GET  /model/events  — SSE stream of model lifecycle events

New env vars:
  IDLE_TIMEOUT_SECS       (default 300)
  GPU_POLL_INTERVAL_SECS  (default 30)

Tests:
  tests/test_model_lifecycle.sh — 18 integration tests (full state machine,
    SSE events, webhooks, concurrency, unload-during-load)
  tests/test_idle_timeout.sh    — 5 tests with short IDLE_TIMEOUT_SECS=5
  test_all.sh updated: loads model before job submission, asserts
    model_state in /health, adds POST /model/unload at end

Docs:
  docs/USAGE.md: model lifecycle section, new env vars, 503 retry pattern,
    updated /health response shape

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

2026-05-08 17:57:20 +02:00

.gitea/workflows

feat: GPU-accelerated Whisper API for RTX 2080 (sm_75)

2026-05-05 22:47:24 +02:00

docs

feat: dynamic model loading/unloading with GPU polling

2026-05-08 17:57:20 +02:00

src

feat: dynamic model loading/unloading with GPU polling

2026-05-08 17:57:20 +02:00

tests

feat: dynamic model loading/unloading with GPU polling

2026-05-08 17:57:20 +02:00

.dockerignore

feat: GPU-accelerated Whisper API for RTX 2080 (sm_75)

2026-05-05 22:47:24 +02:00

.gitignore

feat: GPU-accelerated Whisper API for RTX 2080 (sm_75)

2026-05-05 22:47:24 +02:00

Cargo.toml

feat: GPU-accelerated Whisper API for RTX 2080 (sm_75)

2026-05-05 22:47:24 +02:00

docker-compose.yml

feat: GPU-accelerated Whisper API for RTX 2080 (sm_75)

2026-05-05 22:47:24 +02:00

Dockerfile

feat: GPU-accelerated Whisper API for RTX 2080 (sm_75)

2026-05-05 22:47:24 +02:00

KNOWLEDGE.md

fix: GPU warmup on startup + fix test_all.sh + document cold-GPU finding

2026-05-06 11:57:30 +02:00

README.md

feat: GPU-accelerated Whisper API for RTX 2080 (sm_75)

2026-05-05 22:47:24 +02:00

test_all.sh

feat: dynamic model loading/unloading with GPU polling

2026-05-08 17:57:20 +02:00

README.md

whisper-rtx2080

Async REST API for GPU-accelerated speech transcription, built in Rust (Axum) on top of whisper.cpp compiled with CUDA for the NVIDIA RTX 2080 (Turing, sm_75, 8 GB VRAM). No Python.

Requirements

Dependency	Notes
Docker ≥ 20.10
NVIDIA Container Toolkit	`nvidia-docker2` on the host
Host NVIDIA driver ≥ 525	Required for CUDA 12.x
GGML model file	Downloaded automatically on first start

Quick start

# Build (CUDA 12.4, sm_75, large-v3 model)
docker compose build

# Start the server (model downloads on first run — ~3 GB)
docker compose up -d

# Check it's running
curl http://localhost:8080/health

# Transcribe a file
curl -X POST http://localhost:8080/jobs \
  -F "audio=@/path/to/speech.mp3" | jq .
# → { "job_id": "550e8400-..." }

# Poll for result
curl http://localhost:8080/jobs/550e8400-... | jq .

# Or stream progress in real time
curl -N http://localhost:8080/jobs/550e8400-.../stream

# Browse the interactive API docs
open http://localhost:8080/docs

API reference

Method	Path	Description
`POST`	`/jobs`	Submit audio for transcription
`GET`	`/jobs/{id}`	Poll job status + result
`GET`	`/jobs/{id}/stream`	SSE: live progress + completion event
`DELETE`	`/jobs/{id}`	Cancel a queued or running job
`GET`	`/health`	GPU info + queue depth
`GET`	`/docs`	Swagger UI
`GET`	`/openapi.json`	Raw OpenAPI 3.0 spec

POST /jobs — multipart fields

Field	Required	Description
`audio`	✅	Audio file — any format ffmpeg understands; no size limit
`language`	❌	ISO 639-1 source language (e.g. `en`). Auto-detected when absent.
`task`	❌	`transcribe` (default) or `translate` (output always English)
`webhook_url`	❌	URL to POST the completed job JSON to on completion

Job result JSON

{
  "id":            "550e8400-e29b-41d4-a716-446655440000",
  "status":        "done",
  "language":      "en",
  "task":          "transcribe",
  "duration_secs": 142.3,
  "progress":      100,
  "segments": [
    {
      "index": 0,
      "start": 0.0,
      "end":   2.4,
      "text":  " Hello, world.",
      "words": []
    }
  ],
  "error":        null,
  "created_at":   "2026-05-05T21:00:00Z",
  "completed_at": "2026-05-05T21:02:13Z"
}

SSE events (`GET /jobs/{id}/stream`)

event: progress
data: {"type":"progress","percent":42}

event: progress
data: {"type":"progress","percent":91}

event: done
data: {"type":"done","job":{...full job object...}}

Build arguments

ARG	Default	Notes
`CUDA_VERSION`	`12.4.1`	Passed to the NVIDIA base image tag
`CUDNN_TAG`	`cudnn`	`cudnn` for CUDA 12.x · `cudnn8` for CUDA 11.x
`UBUNTU_VERSION`	`22.04`	Ubuntu base

Custom CUDA version examples

# CUDA 12.1
docker build \
  --build-arg CUDA_VERSION=12.1.0 \
  --build-arg CUDNN_TAG=cudnn8 \
  -t whisper-rtx2080:cu121 .

# CUDA 11.8 (legacy)
docker build \
  --build-arg CUDA_VERSION=11.8.0 \
  --build-arg CUDNN_TAG=cudnn8 \
  --build-arg UBUNTU_VERSION=20.04 \
  -t whisper-rtx2080:cu118 .

Runtime environment variables

All can be overridden with -e or in docker-compose.yml:

Variable	Default	Description
`PORT`	`8080`	TCP port the server listens on
`RUST_LOG`	`info`	Log level (`trace`, `debug`, `info`, `warn`, `error`)
`DATA_DIR`	`/data`	Directory for persisted job state (mount a volume here)
`WHISPER_MODEL`	`large-v3`	Model name (for /health reporting)
`WHISPER_MODEL_PATH`	`/models/ggml-large-v3.bin`	Absolute path to the GGML model file

RTX 2080 optimisation notes

Setting	Value	Reason
`CMAKE_CUDA_ARCHITECTURES`	`75`	Compiles kernels only for sm_75 — smaller binary, faster build
`GGML_CUDA_FORCE_MMQ`	`ON`	Quantised matrix-multiply (WMMA Tensor Cores) — best for Q4/Q5/Q8 models on Turing
`GGML_CUDA_GRAPHS`	`ON`	CUDA Graph capture → eliminates CPU→GPU dispatch overhead per call (requires sm_75+)
`GGML_CUDA_FA_ALL_QUANTS`	`ON`	Flash Attention tile kernels for all quantisation types
`GGML_CUDA_F16`	`ON`	FP16 arithmetic via Turing Tensor Cores
`flash_attn` (runtime)	`true`	Enabled in `WhisperContextParameters` — tile-based, works on sm_75
`beam_size`	`5`	Best accuracy/speed balance
`temperature`	`0.0`	Deterministic, fastest decode path
`n_threads`	host CPU count	CPU-side pre/post processing

bfloat16 is intentionally not enabled — that requires Ampere (sm_80+).

flash_attn and DTW token timestamps are mutually exclusive — the server enables flash_attn and omits DTW to maximise throughput.

Webhooks

If webhook_url is set on a job, the server will POST the completed job JSON to that URL:

Up to 5 retries with exponential backoff: 1 s → 2 s → 4 s → 8 s → 16 s
After all retries are exhausted the failure is logged and dropped

Troubleshooting

CUDA error: no kernel image available for execution on the device → The binary was compiled for a different architecture. Rebuild with --build-arg CUDA_VERSION=... matching your driver. The image is always compiled for sm_75 only.

libcuda.so.1: cannot open shared object file → NVIDIA Container Toolkit is not installed or --gpus all / deploy.resources is missing.

Model not found at /models/ggml-large-v3.bin → On first start the server will fail immediately. Download the model manually:

docker run --rm -v whisper-models:/models curlimages/curl:latest \
  -L -o /models/ggml-large-v3.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin

Then restart the server.

Out-of-memory on large-v3 → The large-v3 GGML model at F16 uses ~3.1 GB VRAM; you should have headroom on 8 GB. If running other GPU workloads in parallel, switch to ggml-medium.bin (~1.5 GB).

Releases 2

On demand model loading Latest

2026-05-08 23:45:52 +02:00

Languages

Rust 70.7%

Shell 24.2%

Dockerfile 5.1%

README.md

whisper-rtx2080

Requirements

Quick start

API reference

POST /jobs — multipart fields

Job result JSON

SSE events (GET /jobs/{id}/stream)

Build arguments

Custom CUDA version examples

Runtime environment variables

RTX 2080 optimisation notes

Webhooks

Troubleshooting

SSE events (`GET /jobs/{id}/stream`)