# whisper-rtx2080 Async REST API for GPU-accelerated speech transcription, built in **Rust** (Axum) on top of **whisper.cpp** compiled with CUDA for the **NVIDIA RTX 2080** (Turing, sm\_75, 8 GB VRAM). No Python. --- ## Requirements | Dependency | Notes | |---|---| | Docker ≥ 20.10 | | | [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) | `nvidia-docker2` on the host | | Host NVIDIA driver ≥ 525 | Required for CUDA 12.x | | GGML model file | Downloaded automatically on first start | --- ## Quick start ```bash # Build (CUDA 12.4, sm_75, large-v3 model) docker compose build # Start the server (model downloads on first run — ~3 GB) docker compose up -d # Check it's running curl http://localhost:8080/health # Transcribe a file curl -X POST http://localhost:8080/jobs \ -F "audio=@/path/to/speech.mp3" | jq . # → { "job_id": "550e8400-..." } # Poll for result curl http://localhost:8080/jobs/550e8400-... | jq . # Or stream progress in real time curl -N http://localhost:8080/jobs/550e8400-.../stream # Browse the interactive API docs open http://localhost:8080/docs ``` --- ## API reference | Method | Path | Description | |---|---|---| | `POST` | `/jobs` | Submit audio for transcription | | `GET` | `/jobs/{id}` | Poll job status + result | | `GET` | `/jobs/{id}/stream` | SSE: live progress + completion event | | `DELETE` | `/jobs/{id}` | Cancel a queued or running job | | `GET` | `/health` | GPU info + queue depth | | `GET` | `/docs` | Swagger UI | | `GET` | `/openapi.json` | Raw OpenAPI 3.0 spec | ### POST /jobs — multipart fields | Field | Required | Description | |---|---|---| | `audio` | ✅ | Audio file — any format ffmpeg understands; no size limit | | `language` | ❌ | ISO 639-1 source language (e.g. `en`). Auto-detected when absent. | | `task` | ❌ | `transcribe` (default) or `translate` (output always English) | | `webhook_url` | ❌ | URL to POST the completed job JSON to on completion | ### Job result JSON ```json { "id": "550e8400-e29b-41d4-a716-446655440000", "status": "done", "language": "en", "task": "transcribe", "duration_secs": 142.3, "progress": 100, "segments": [ { "index": 0, "start": 0.0, "end": 2.4, "text": " Hello, world.", "words": [] } ], "error": null, "created_at": "2026-05-05T21:00:00Z", "completed_at": "2026-05-05T21:02:13Z" } ``` ### SSE events (`GET /jobs/{id}/stream`) ``` event: progress data: {"type":"progress","percent":42} event: progress data: {"type":"progress","percent":91} event: done data: {"type":"done","job":{...full job object...}} ``` --- ## Build arguments | ARG | Default | Notes | |---|---|---| | `CUDA_VERSION` | `12.4.1` | Passed to the NVIDIA base image tag | | `CUDNN_TAG` | `cudnn` | `cudnn` for CUDA 12.x · `cudnn8` for CUDA 11.x | | `UBUNTU_VERSION` | `22.04` | Ubuntu base | ### Custom CUDA version examples ```bash # CUDA 12.1 docker build \ --build-arg CUDA_VERSION=12.1.0 \ --build-arg CUDNN_TAG=cudnn8 \ -t whisper-rtx2080:cu121 . # CUDA 11.8 (legacy) docker build \ --build-arg CUDA_VERSION=11.8.0 \ --build-arg CUDNN_TAG=cudnn8 \ --build-arg UBUNTU_VERSION=20.04 \ -t whisper-rtx2080:cu118 . ``` --- ## Runtime environment variables All can be overridden with `-e` or in `docker-compose.yml`: | Variable | Default | Description | |---|---|---| | `PORT` | `8080` | TCP port the server listens on | | `RUST_LOG` | `info` | Log level (`trace`, `debug`, `info`, `warn`, `error`) | | `DATA_DIR` | `/data` | Directory for persisted job state (mount a volume here) | | `WHISPER_MODEL` | `large-v3` | Model name (for /health reporting) | | `WHISPER_MODEL_PATH` | `/models/ggml-large-v3.bin` | Absolute path to the GGML model file | --- ## RTX 2080 optimisation notes | Setting | Value | Reason | |---|---|---| | `CMAKE_CUDA_ARCHITECTURES` | `75` | Compiles kernels **only for sm\_75** — smaller binary, faster build | | `GGML_CUDA_FORCE_MMQ` | `ON` | Quantised matrix-multiply (WMMA Tensor Cores) — best for Q4/Q5/Q8 models on Turing | | `GGML_CUDA_GRAPHS` | `ON` | CUDA Graph capture → eliminates CPU→GPU dispatch overhead per call (requires sm\_75+) | | `GGML_CUDA_FA_ALL_QUANTS` | `ON` | Flash Attention tile kernels for all quantisation types | | `GGML_CUDA_F16` | `ON` | FP16 arithmetic via Turing Tensor Cores | | `flash_attn` (runtime) | `true` | Enabled in `WhisperContextParameters` — tile-based, works on sm\_75 | | `beam_size` | `5` | Best accuracy/speed balance | | `temperature` | `0.0` | Deterministic, fastest decode path | | `n_threads` | host CPU count | CPU-side pre/post processing | > **bfloat16 is intentionally not enabled** — that requires Ampere (sm\_80+). > > **flash\_attn and DTW token timestamps are mutually exclusive** — the server enables > flash\_attn and omits DTW to maximise throughput. --- ## Webhooks If `webhook_url` is set on a job, the server will `POST` the completed job JSON to that URL: - Up to **5 retries** with exponential backoff: 1 s → 2 s → 4 s → 8 s → 16 s - After all retries are exhausted the failure is logged and dropped --- ## Troubleshooting **`CUDA error: no kernel image available for execution on the device`** → The binary was compiled for a different architecture. Rebuild with `--build-arg CUDA_VERSION=...` matching your driver. The image is always compiled for sm\_75 only. **`libcuda.so.1: cannot open shared object file`** → NVIDIA Container Toolkit is not installed or `--gpus all` / `deploy.resources` is missing. **Model not found at `/models/ggml-large-v3.bin`** → On first start the server will fail immediately. Download the model manually: ```bash docker run --rm -v whisper-models:/models curlimages/curl:latest \ -L -o /models/ggml-large-v3.bin \ https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin ``` Then restart the server. **Out-of-memory on large-v3** → The large-v3 GGML model at F16 uses ~3.1 GB VRAM; you should have headroom on 8 GB. If running other GPU workloads in parallel, switch to `ggml-medium.bin` (~1.5 GB).