Files
whisper-rtx2080/README.md
mozempk 16cb6ca661
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 11m13s
feat: GPU-accelerated Whisper API for RTX 2080 (sm_75)
- Pure Rust: Axum 0.7 + whisper-rs 0.13 (CUDA FFI)
- Async job queue with SSE progress streaming
- Webhook delivery with 5x exponential backoff
- Disk-persisted job state (survives restarts)
- Anti-hallucination params: no_speech_thold, entropy_thold, suppress_blank
- CUDA sm_75 flags: GGML_CUDA_FORCE_MMQ, GGML_CUDA_GRAPHS, GGML_CUDA_FA_ALL_QUANTS
- Configurable via env: CUDA_DEVICE, WHISPER_MODEL_PATH, PORT, DATA_DIR
- Gitea Actions CI: build + push to git.sal.giize.com registry
- Multi-stage Dockerfile with customizable CUDA_VERSION ARG

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-05 22:47:24 +02:00

202 lines
6.1 KiB
Markdown

# whisper-rtx2080
Async REST API for GPU-accelerated speech transcription, built in **Rust** (Axum) on top of
**whisper.cpp** compiled with CUDA for the **NVIDIA RTX 2080** (Turing, sm\_75, 8 GB VRAM).
No Python.
---
## Requirements
| Dependency | Notes |
|---|---|
| Docker ≥ 20.10 | |
| [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) | `nvidia-docker2` on the host |
| Host NVIDIA driver ≥ 525 | Required for CUDA 12.x |
| GGML model file | Downloaded automatically on first start |
---
## Quick start
```bash
# Build (CUDA 12.4, sm_75, large-v3 model)
docker compose build
# Start the server (model downloads on first run — ~3 GB)
docker compose up -d
# Check it's running
curl http://localhost:8080/health
# Transcribe a file
curl -X POST http://localhost:8080/jobs \
-F "audio=@/path/to/speech.mp3" | jq .
# → { "job_id": "550e8400-..." }
# Poll for result
curl http://localhost:8080/jobs/550e8400-... | jq .
# Or stream progress in real time
curl -N http://localhost:8080/jobs/550e8400-.../stream
# Browse the interactive API docs
open http://localhost:8080/docs
```
---
## API reference
| Method | Path | Description |
|---|---|---|
| `POST` | `/jobs` | Submit audio for transcription |
| `GET` | `/jobs/{id}` | Poll job status + result |
| `GET` | `/jobs/{id}/stream` | SSE: live progress + completion event |
| `DELETE` | `/jobs/{id}` | Cancel a queued or running job |
| `GET` | `/health` | GPU info + queue depth |
| `GET` | `/docs` | Swagger UI |
| `GET` | `/openapi.json` | Raw OpenAPI 3.0 spec |
### POST /jobs — multipart fields
| Field | Required | Description |
|---|---|---|
| `audio` | ✅ | Audio file — any format ffmpeg understands; no size limit |
| `language` | ❌ | ISO 639-1 source language (e.g. `en`). Auto-detected when absent. |
| `task` | ❌ | `transcribe` (default) or `translate` (output always English) |
| `webhook_url` | ❌ | URL to POST the completed job JSON to on completion |
### Job result JSON
```json
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"status": "done",
"language": "en",
"task": "transcribe",
"duration_secs": 142.3,
"progress": 100,
"segments": [
{
"index": 0,
"start": 0.0,
"end": 2.4,
"text": " Hello, world.",
"words": []
}
],
"error": null,
"created_at": "2026-05-05T21:00:00Z",
"completed_at": "2026-05-05T21:02:13Z"
}
```
### SSE events (`GET /jobs/{id}/stream`)
```
event: progress
data: {"type":"progress","percent":42}
event: progress
data: {"type":"progress","percent":91}
event: done
data: {"type":"done","job":{...full job object...}}
```
---
## Build arguments
| ARG | Default | Notes |
|---|---|---|
| `CUDA_VERSION` | `12.4.1` | Passed to the NVIDIA base image tag |
| `CUDNN_TAG` | `cudnn` | `cudnn` for CUDA 12.x · `cudnn8` for CUDA 11.x |
| `UBUNTU_VERSION` | `22.04` | Ubuntu base |
### Custom CUDA version examples
```bash
# CUDA 12.1
docker build \
--build-arg CUDA_VERSION=12.1.0 \
--build-arg CUDNN_TAG=cudnn8 \
-t whisper-rtx2080:cu121 .
# CUDA 11.8 (legacy)
docker build \
--build-arg CUDA_VERSION=11.8.0 \
--build-arg CUDNN_TAG=cudnn8 \
--build-arg UBUNTU_VERSION=20.04 \
-t whisper-rtx2080:cu118 .
```
---
## Runtime environment variables
All can be overridden with `-e` or in `docker-compose.yml`:
| Variable | Default | Description |
|---|---|---|
| `PORT` | `8080` | TCP port the server listens on |
| `RUST_LOG` | `info` | Log level (`trace`, `debug`, `info`, `warn`, `error`) |
| `DATA_DIR` | `/data` | Directory for persisted job state (mount a volume here) |
| `WHISPER_MODEL` | `large-v3` | Model name (for /health reporting) |
| `WHISPER_MODEL_PATH` | `/models/ggml-large-v3.bin` | Absolute path to the GGML model file |
---
## RTX 2080 optimisation notes
| Setting | Value | Reason |
|---|---|---|
| `CMAKE_CUDA_ARCHITECTURES` | `75` | Compiles kernels **only for sm\_75** — smaller binary, faster build |
| `GGML_CUDA_FORCE_MMQ` | `ON` | Quantised matrix-multiply (WMMA Tensor Cores) — best for Q4/Q5/Q8 models on Turing |
| `GGML_CUDA_GRAPHS` | `ON` | CUDA Graph capture → eliminates CPU→GPU dispatch overhead per call (requires sm\_75+) |
| `GGML_CUDA_FA_ALL_QUANTS` | `ON` | Flash Attention tile kernels for all quantisation types |
| `GGML_CUDA_F16` | `ON` | FP16 arithmetic via Turing Tensor Cores |
| `flash_attn` (runtime) | `true` | Enabled in `WhisperContextParameters` — tile-based, works on sm\_75 |
| `beam_size` | `5` | Best accuracy/speed balance |
| `temperature` | `0.0` | Deterministic, fastest decode path |
| `n_threads` | host CPU count | CPU-side pre/post processing |
> **bfloat16 is intentionally not enabled** — that requires Ampere (sm\_80+).
>
> **flash\_attn and DTW token timestamps are mutually exclusive** — the server enables
> flash\_attn and omits DTW to maximise throughput.
---
## Webhooks
If `webhook_url` is set on a job, the server will `POST` the completed job JSON to that URL:
- Up to **5 retries** with exponential backoff: 1 s → 2 s → 4 s → 8 s → 16 s
- After all retries are exhausted the failure is logged and dropped
---
## Troubleshooting
**`CUDA error: no kernel image available for execution on the device`**
→ The binary was compiled for a different architecture. Rebuild with
`--build-arg CUDA_VERSION=...` matching your driver. The image is always compiled
for sm\_75 only.
**`libcuda.so.1: cannot open shared object file`**
→ NVIDIA Container Toolkit is not installed or `--gpus all` / `deploy.resources` is missing.
**Model not found at `/models/ggml-large-v3.bin`**
→ On first start the server will fail immediately. Download the model manually:
```bash
docker run --rm -v whisper-models:/models curlimages/curl:latest \
-L -o /models/ggml-large-v3.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin
```
Then restart the server.
**Out-of-memory on large-v3**
→ The large-v3 GGML model at F16 uses ~3.1 GB VRAM; you should have headroom on 8 GB.
If running other GPU workloads in parallel, switch to `ggml-medium.bin` (~1.5 GB).