mozempk fd8d4deefb
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 6m39s
fix: GPU warmup on startup + fix test_all.sh + document cold-GPU finding
GPU warmup (src/transcriber.rs):
  After creating WhisperState, run a 1s silent inference pass in load().
  CUDA JIT-compiles device kernels on the first whisper_full_with_state call.
  On a cold GPU this compilation disrupts the decode pipeline mid-inference,
  returning 0 segments in ~0.5s. The warmup forces all kernel compilation at
  startup so the first real job runs on fully compiled kernels.

test_all.sh:
  - Fix submit response field: 'id' → 'job_id' (was breaking all downstream steps)
  - Remove language=auto: not a valid ISO 639-1 code; omit field for auto-detect
  - Make BASE and AUDIO configurable via env vars (WHISPER_BASE_URL, TEST_AUDIO)
  - Fix DELETE assertion: completed jobs return 409 Conflict, not 204
  - Add explicit zero-segments failure check in quality inspection (step 9)
  - Add progress reporting to poll loop

docs/FINDINGS.md + KNOWLEDGE.md:
  Document cold GPU warmup issue, root cause, and fix.
  Document language=auto as invalid API usage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 11:57:30 +02:00

whisper-rtx2080

Async REST API for GPU-accelerated speech transcription, built in Rust (Axum) on top of whisper.cpp compiled with CUDA for the NVIDIA RTX 2080 (Turing, sm_75, 8 GB VRAM). No Python.


Requirements

Dependency Notes
Docker ≥ 20.10
NVIDIA Container Toolkit nvidia-docker2 on the host
Host NVIDIA driver ≥ 525 Required for CUDA 12.x
GGML model file Downloaded automatically on first start

Quick start

# Build (CUDA 12.4, sm_75, large-v3 model)
docker compose build

# Start the server (model downloads on first run — ~3 GB)
docker compose up -d

# Check it's running
curl http://localhost:8080/health

# Transcribe a file
curl -X POST http://localhost:8080/jobs \
  -F "audio=@/path/to/speech.mp3" | jq .
# → { "job_id": "550e8400-..." }

# Poll for result
curl http://localhost:8080/jobs/550e8400-... | jq .

# Or stream progress in real time
curl -N http://localhost:8080/jobs/550e8400-.../stream

# Browse the interactive API docs
open http://localhost:8080/docs

API reference

Method Path Description
POST /jobs Submit audio for transcription
GET /jobs/{id} Poll job status + result
GET /jobs/{id}/stream SSE: live progress + completion event
DELETE /jobs/{id} Cancel a queued or running job
GET /health GPU info + queue depth
GET /docs Swagger UI
GET /openapi.json Raw OpenAPI 3.0 spec

POST /jobs — multipart fields

Field Required Description
audio Audio file — any format ffmpeg understands; no size limit
language ISO 639-1 source language (e.g. en). Auto-detected when absent.
task transcribe (default) or translate (output always English)
webhook_url URL to POST the completed job JSON to on completion

Job result JSON

{
  "id":            "550e8400-e29b-41d4-a716-446655440000",
  "status":        "done",
  "language":      "en",
  "task":          "transcribe",
  "duration_secs": 142.3,
  "progress":      100,
  "segments": [
    {
      "index": 0,
      "start": 0.0,
      "end":   2.4,
      "text":  " Hello, world.",
      "words": []
    }
  ],
  "error":        null,
  "created_at":   "2026-05-05T21:00:00Z",
  "completed_at": "2026-05-05T21:02:13Z"
}

SSE events (GET /jobs/{id}/stream)

event: progress
data: {"type":"progress","percent":42}

event: progress
data: {"type":"progress","percent":91}

event: done
data: {"type":"done","job":{...full job object...}}

Build arguments

ARG Default Notes
CUDA_VERSION 12.4.1 Passed to the NVIDIA base image tag
CUDNN_TAG cudnn cudnn for CUDA 12.x · cudnn8 for CUDA 11.x
UBUNTU_VERSION 22.04 Ubuntu base

Custom CUDA version examples

# CUDA 12.1
docker build \
  --build-arg CUDA_VERSION=12.1.0 \
  --build-arg CUDNN_TAG=cudnn8 \
  -t whisper-rtx2080:cu121 .

# CUDA 11.8 (legacy)
docker build \
  --build-arg CUDA_VERSION=11.8.0 \
  --build-arg CUDNN_TAG=cudnn8 \
  --build-arg UBUNTU_VERSION=20.04 \
  -t whisper-rtx2080:cu118 .

Runtime environment variables

All can be overridden with -e or in docker-compose.yml:

Variable Default Description
PORT 8080 TCP port the server listens on
RUST_LOG info Log level (trace, debug, info, warn, error)
DATA_DIR /data Directory for persisted job state (mount a volume here)
WHISPER_MODEL large-v3 Model name (for /health reporting)
WHISPER_MODEL_PATH /models/ggml-large-v3.bin Absolute path to the GGML model file

RTX 2080 optimisation notes

Setting Value Reason
CMAKE_CUDA_ARCHITECTURES 75 Compiles kernels only for sm_75 — smaller binary, faster build
GGML_CUDA_FORCE_MMQ ON Quantised matrix-multiply (WMMA Tensor Cores) — best for Q4/Q5/Q8 models on Turing
GGML_CUDA_GRAPHS ON CUDA Graph capture → eliminates CPU→GPU dispatch overhead per call (requires sm_75+)
GGML_CUDA_FA_ALL_QUANTS ON Flash Attention tile kernels for all quantisation types
GGML_CUDA_F16 ON FP16 arithmetic via Turing Tensor Cores
flash_attn (runtime) true Enabled in WhisperContextParameters — tile-based, works on sm_75
beam_size 5 Best accuracy/speed balance
temperature 0.0 Deterministic, fastest decode path
n_threads host CPU count CPU-side pre/post processing

bfloat16 is intentionally not enabled — that requires Ampere (sm_80+).

flash_attn and DTW token timestamps are mutually exclusive — the server enables flash_attn and omits DTW to maximise throughput.


Webhooks

If webhook_url is set on a job, the server will POST the completed job JSON to that URL:

  • Up to 5 retries with exponential backoff: 1 s → 2 s → 4 s → 8 s → 16 s
  • After all retries are exhausted the failure is logged and dropped

Troubleshooting

CUDA error: no kernel image available for execution on the device → The binary was compiled for a different architecture. Rebuild with --build-arg CUDA_VERSION=... matching your driver. The image is always compiled for sm_75 only.

libcuda.so.1: cannot open shared object file → NVIDIA Container Toolkit is not installed or --gpus all / deploy.resources is missing.

Model not found at /models/ggml-large-v3.bin → On first start the server will fail immediately. Download the model manually:

docker run --rm -v whisper-models:/models curlimages/curl:latest \
  -L -o /models/ggml-large-v3.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin

Then restart the server.

Out-of-memory on large-v3 → The large-v3 GGML model at F16 uses ~3.1 GB VRAM; you should have headroom on 8 GB. If running other GPU workloads in parallel, switch to ggml-medium.bin (~1.5 GB).

Description
GPU-accelerated Whisper transcription API for RTX 2080 (sm_75) — pure Rust, Axum, whisper-rs, CUDA
Readme 263 KiB
2026-05-08 23:45:52 +02:00
Languages
Rust 70.7%
Shell 24.2%
Dockerfile 5.1%