All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 11m13s
- Pure Rust: Axum 0.7 + whisper-rs 0.13 (CUDA FFI) - Async job queue with SSE progress streaming - Webhook delivery with 5x exponential backoff - Disk-persisted job state (survives restarts) - Anti-hallucination params: no_speech_thold, entropy_thold, suppress_blank - CUDA sm_75 flags: GGML_CUDA_FORCE_MMQ, GGML_CUDA_GRAPHS, GGML_CUDA_FA_ALL_QUANTS - Configurable via env: CUDA_DEVICE, WHISPER_MODEL_PATH, PORT, DATA_DIR - Gitea Actions CI: build + push to git.sal.giize.com registry - Multi-stage Dockerfile with customizable CUDA_VERSION ARG Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
202 lines
6.1 KiB
Markdown
202 lines
6.1 KiB
Markdown
# whisper-rtx2080
|
|
|
|
Async REST API for GPU-accelerated speech transcription, built in **Rust** (Axum) on top of
|
|
**whisper.cpp** compiled with CUDA for the **NVIDIA RTX 2080** (Turing, sm\_75, 8 GB VRAM).
|
|
No Python.
|
|
|
|
---
|
|
|
|
## Requirements
|
|
|
|
| Dependency | Notes |
|
|
|---|---|
|
|
| Docker ≥ 20.10 | |
|
|
| [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) | `nvidia-docker2` on the host |
|
|
| Host NVIDIA driver ≥ 525 | Required for CUDA 12.x |
|
|
| GGML model file | Downloaded automatically on first start |
|
|
|
|
---
|
|
|
|
## Quick start
|
|
|
|
```bash
|
|
# Build (CUDA 12.4, sm_75, large-v3 model)
|
|
docker compose build
|
|
|
|
# Start the server (model downloads on first run — ~3 GB)
|
|
docker compose up -d
|
|
|
|
# Check it's running
|
|
curl http://localhost:8080/health
|
|
|
|
# Transcribe a file
|
|
curl -X POST http://localhost:8080/jobs \
|
|
-F "audio=@/path/to/speech.mp3" | jq .
|
|
# → { "job_id": "550e8400-..." }
|
|
|
|
# Poll for result
|
|
curl http://localhost:8080/jobs/550e8400-... | jq .
|
|
|
|
# Or stream progress in real time
|
|
curl -N http://localhost:8080/jobs/550e8400-.../stream
|
|
|
|
# Browse the interactive API docs
|
|
open http://localhost:8080/docs
|
|
```
|
|
|
|
---
|
|
|
|
## API reference
|
|
|
|
| Method | Path | Description |
|
|
|---|---|---|
|
|
| `POST` | `/jobs` | Submit audio for transcription |
|
|
| `GET` | `/jobs/{id}` | Poll job status + result |
|
|
| `GET` | `/jobs/{id}/stream` | SSE: live progress + completion event |
|
|
| `DELETE` | `/jobs/{id}` | Cancel a queued or running job |
|
|
| `GET` | `/health` | GPU info + queue depth |
|
|
| `GET` | `/docs` | Swagger UI |
|
|
| `GET` | `/openapi.json` | Raw OpenAPI 3.0 spec |
|
|
|
|
### POST /jobs — multipart fields
|
|
|
|
| Field | Required | Description |
|
|
|---|---|---|
|
|
| `audio` | ✅ | Audio file — any format ffmpeg understands; no size limit |
|
|
| `language` | ❌ | ISO 639-1 source language (e.g. `en`). Auto-detected when absent. |
|
|
| `task` | ❌ | `transcribe` (default) or `translate` (output always English) |
|
|
| `webhook_url` | ❌ | URL to POST the completed job JSON to on completion |
|
|
|
|
### Job result JSON
|
|
|
|
```json
|
|
{
|
|
"id": "550e8400-e29b-41d4-a716-446655440000",
|
|
"status": "done",
|
|
"language": "en",
|
|
"task": "transcribe",
|
|
"duration_secs": 142.3,
|
|
"progress": 100,
|
|
"segments": [
|
|
{
|
|
"index": 0,
|
|
"start": 0.0,
|
|
"end": 2.4,
|
|
"text": " Hello, world.",
|
|
"words": []
|
|
}
|
|
],
|
|
"error": null,
|
|
"created_at": "2026-05-05T21:00:00Z",
|
|
"completed_at": "2026-05-05T21:02:13Z"
|
|
}
|
|
```
|
|
|
|
### SSE events (`GET /jobs/{id}/stream`)
|
|
|
|
```
|
|
event: progress
|
|
data: {"type":"progress","percent":42}
|
|
|
|
event: progress
|
|
data: {"type":"progress","percent":91}
|
|
|
|
event: done
|
|
data: {"type":"done","job":{...full job object...}}
|
|
```
|
|
|
|
---
|
|
|
|
## Build arguments
|
|
|
|
| ARG | Default | Notes |
|
|
|---|---|---|
|
|
| `CUDA_VERSION` | `12.4.1` | Passed to the NVIDIA base image tag |
|
|
| `CUDNN_TAG` | `cudnn` | `cudnn` for CUDA 12.x · `cudnn8` for CUDA 11.x |
|
|
| `UBUNTU_VERSION` | `22.04` | Ubuntu base |
|
|
|
|
### Custom CUDA version examples
|
|
|
|
```bash
|
|
# CUDA 12.1
|
|
docker build \
|
|
--build-arg CUDA_VERSION=12.1.0 \
|
|
--build-arg CUDNN_TAG=cudnn8 \
|
|
-t whisper-rtx2080:cu121 .
|
|
|
|
# CUDA 11.8 (legacy)
|
|
docker build \
|
|
--build-arg CUDA_VERSION=11.8.0 \
|
|
--build-arg CUDNN_TAG=cudnn8 \
|
|
--build-arg UBUNTU_VERSION=20.04 \
|
|
-t whisper-rtx2080:cu118 .
|
|
```
|
|
|
|
---
|
|
|
|
## Runtime environment variables
|
|
|
|
All can be overridden with `-e` or in `docker-compose.yml`:
|
|
|
|
| Variable | Default | Description |
|
|
|---|---|---|
|
|
| `PORT` | `8080` | TCP port the server listens on |
|
|
| `RUST_LOG` | `info` | Log level (`trace`, `debug`, `info`, `warn`, `error`) |
|
|
| `DATA_DIR` | `/data` | Directory for persisted job state (mount a volume here) |
|
|
| `WHISPER_MODEL` | `large-v3` | Model name (for /health reporting) |
|
|
| `WHISPER_MODEL_PATH` | `/models/ggml-large-v3.bin` | Absolute path to the GGML model file |
|
|
|
|
---
|
|
|
|
## RTX 2080 optimisation notes
|
|
|
|
| Setting | Value | Reason |
|
|
|---|---|---|
|
|
| `CMAKE_CUDA_ARCHITECTURES` | `75` | Compiles kernels **only for sm\_75** — smaller binary, faster build |
|
|
| `GGML_CUDA_FORCE_MMQ` | `ON` | Quantised matrix-multiply (WMMA Tensor Cores) — best for Q4/Q5/Q8 models on Turing |
|
|
| `GGML_CUDA_GRAPHS` | `ON` | CUDA Graph capture → eliminates CPU→GPU dispatch overhead per call (requires sm\_75+) |
|
|
| `GGML_CUDA_FA_ALL_QUANTS` | `ON` | Flash Attention tile kernels for all quantisation types |
|
|
| `GGML_CUDA_F16` | `ON` | FP16 arithmetic via Turing Tensor Cores |
|
|
| `flash_attn` (runtime) | `true` | Enabled in `WhisperContextParameters` — tile-based, works on sm\_75 |
|
|
| `beam_size` | `5` | Best accuracy/speed balance |
|
|
| `temperature` | `0.0` | Deterministic, fastest decode path |
|
|
| `n_threads` | host CPU count | CPU-side pre/post processing |
|
|
|
|
> **bfloat16 is intentionally not enabled** — that requires Ampere (sm\_80+).
|
|
>
|
|
> **flash\_attn and DTW token timestamps are mutually exclusive** — the server enables
|
|
> flash\_attn and omits DTW to maximise throughput.
|
|
|
|
---
|
|
|
|
## Webhooks
|
|
|
|
If `webhook_url` is set on a job, the server will `POST` the completed job JSON to that URL:
|
|
- Up to **5 retries** with exponential backoff: 1 s → 2 s → 4 s → 8 s → 16 s
|
|
- After all retries are exhausted the failure is logged and dropped
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
**`CUDA error: no kernel image available for execution on the device`**
|
|
→ The binary was compiled for a different architecture. Rebuild with
|
|
`--build-arg CUDA_VERSION=...` matching your driver. The image is always compiled
|
|
for sm\_75 only.
|
|
|
|
**`libcuda.so.1: cannot open shared object file`**
|
|
→ NVIDIA Container Toolkit is not installed or `--gpus all` / `deploy.resources` is missing.
|
|
|
|
**Model not found at `/models/ggml-large-v3.bin`**
|
|
→ On first start the server will fail immediately. Download the model manually:
|
|
```bash
|
|
docker run --rm -v whisper-models:/models curlimages/curl:latest \
|
|
-L -o /models/ggml-large-v3.bin \
|
|
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin
|
|
```
|
|
Then restart the server.
|
|
|
|
**Out-of-memory on large-v3**
|
|
→ The large-v3 GGML model at F16 uses ~3.1 GB VRAM; you should have headroom on 8 GB.
|
|
If running other GPU workloads in parallel, switch to `ggml-medium.bin` (~1.5 GB).
|