whisper-rtx2080/README.md

# whisper-rtx2080

Async REST API for GPU-accelerated speech transcription, built in **Rust** (Axum) on top of
**whisper.cpp** compiled with CUDA for the **NVIDIA RTX 2080** (Turing, sm\_75, 8 GB VRAM).
No Python.

---

## Requirements

| Dependency | Notes |
|---|---|
| Docker ≥ 20.10 | |
| [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) | `nvidia-docker2` on the host |
| Host NVIDIA driver ≥ 525 | Required for CUDA 12.x |
| GGML model file | Downloaded automatically on first start |

---

## Quick start

```bash
# Build (CUDA 12.4, sm_75, large-v3 model)
docker compose build

# Start the server (model downloads on first run — ~3 GB)
docker compose up -d

# Check it's running
curl http://localhost:8080/health

# Transcribe a file
curl -X POST http://localhost:8080/jobs \
  -F "audio=@/path/to/speech.mp3" | jq .
# → { "job_id": "550e8400-..." }

# Poll for result
curl http://localhost:8080/jobs/550e8400-... | jq .

# Or stream progress in real time
curl -N http://localhost:8080/jobs/550e8400-.../stream

# Browse the interactive API docs
open http://localhost:8080/docs
```

---

## API reference

| Method | Path | Description |
|---|---|---|
| `POST` | `/jobs` | Submit audio for transcription |
| `GET` | `/jobs/{id}` | Poll job status + result |
| `GET` | `/jobs/{id}/stream` | SSE: live progress + completion event |
| `DELETE` | `/jobs/{id}` | Cancel a queued or running job |
| `GET` | `/health` | GPU info + queue depth |
| `GET` | `/docs` | Swagger UI |
| `GET` | `/openapi.json` | Raw OpenAPI 3.0 spec |

### POST /jobs — multipart fields

| Field | Required | Description |
|---|---|---|
| `audio` | ✅ | Audio file — any format ffmpeg understands; no size limit |
| `language` | ❌ | ISO 639-1 source language (e.g. `en`). Auto-detected when absent. |
| `task` | ❌ | `transcribe` (default) or `translate` (output always English) |
| `webhook_url` | ❌ | URL to POST the completed job JSON to on completion |

### Job result JSON

```json
{
  "id":            "550e8400-e29b-41d4-a716-446655440000",
  "status":        "done",
  "language":      "en",
  "task":          "transcribe",
  "duration_secs": 142.3,
  "progress":      100,
  "segments": [
    {
      "index": 0,
      "start": 0.0,
      "end":   2.4,
      "text":  " Hello, world.",
      "words": []
    }
  ],
  "error":        null,
  "created_at":   "2026-05-05T21:00:00Z",
  "completed_at": "2026-05-05T21:02:13Z"
}
```

### SSE events (`GET /jobs/{id}/stream`)

```
event: progress
data: {"type":"progress","percent":42}

event: progress
data: {"type":"progress","percent":91}

event: done
data: {"type":"done","job":{...full job object...}}
```

---

## Build arguments

| ARG | Default | Notes |
|---|---|---|
| `CUDA_VERSION` | `12.4.1` | Passed to the NVIDIA base image tag |
| `CUDNN_TAG` | `cudnn` | `cudnn` for CUDA 12.x · `cudnn8` for CUDA 11.x |
| `UBUNTU_VERSION` | `22.04` | Ubuntu base |

### Custom CUDA version examples

```bash
# CUDA 12.1
docker build \
  --build-arg CUDA_VERSION=12.1.0 \
  --build-arg CUDNN_TAG=cudnn8 \
  -t whisper-rtx2080:cu121 .

# CUDA 11.8 (legacy)
docker build \
  --build-arg CUDA_VERSION=11.8.0 \
  --build-arg CUDNN_TAG=cudnn8 \
  --build-arg UBUNTU_VERSION=20.04 \
  -t whisper-rtx2080:cu118 .
```

---

## Runtime environment variables

All can be overridden with `-e` or in `docker-compose.yml`:

| Variable | Default | Description |
|---|---|---|
| `PORT` | `8080` | TCP port the server listens on |
| `RUST_LOG` | `info` | Log level (`trace`, `debug`, `info`, `warn`, `error`) |
| `DATA_DIR` | `/data` | Directory for persisted job state (mount a volume here) |
| `WHISPER_MODEL` | `large-v3` | Model name (for /health reporting) |
| `WHISPER_MODEL_PATH` | `/models/ggml-large-v3.bin` | Absolute path to the GGML model file |

---

## RTX 2080 optimisation notes

| Setting | Value | Reason |
|---|---|---|
| `CMAKE_CUDA_ARCHITECTURES` | `75` | Compiles kernels **only for sm\_75** — smaller binary, faster build |
| `GGML_CUDA_FORCE_MMQ` | `ON` | Quantised matrix-multiply (WMMA Tensor Cores) — best for Q4/Q5/Q8 models on Turing |
| `GGML_CUDA_GRAPHS` | `ON` | CUDA Graph capture → eliminates CPU→GPU dispatch overhead per call (requires sm\_75+) |
| `GGML_CUDA_FA_ALL_QUANTS` | `ON` | Flash Attention tile kernels for all quantisation types |
| `GGML_CUDA_F16` | `ON` | FP16 arithmetic via Turing Tensor Cores |
| `flash_attn` (runtime) | `true` | Enabled in `WhisperContextParameters` — tile-based, works on sm\_75 |
| `beam_size` | `5` | Best accuracy/speed balance |
| `temperature` | `0.0` | Deterministic, fastest decode path |
| `n_threads` | host CPU count | CPU-side pre/post processing |

> **bfloat16 is intentionally not enabled** — that requires Ampere (sm\_80+).
>
> **flash\_attn and DTW token timestamps are mutually exclusive** — the server enables
> flash\_attn and omits DTW to maximise throughput.

---

## Webhooks

If `webhook_url` is set on a job, the server will `POST` the completed job JSON to that URL:
- Up to **5 retries** with exponential backoff: 1 s → 2 s → 4 s → 8 s → 16 s
- After all retries are exhausted the failure is logged and dropped

---

## Troubleshooting

**`CUDA error: no kernel image available for execution on the device`**
→ The binary was compiled for a different architecture. Rebuild with
`--build-arg CUDA_VERSION=...` matching your driver. The image is always compiled
for sm\_75 only.

**`libcuda.so.1: cannot open shared object file`**
→ NVIDIA Container Toolkit is not installed or `--gpus all` / `deploy.resources` is missing.

**Model not found at `/models/ggml-large-v3.bin`**
→ On first start the server will fail immediately. Download the model manually:
```bash
docker run --rm -v whisper-models:/models curlimages/curl:latest \
  -L -o /models/ggml-large-v3.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin
```
Then restart the server.

**Out-of-memory on large-v3**
→ The large-v3 GGML model at F16 uses ~3.1 GB VRAM; you should have headroom on 8 GB.
If running other GPU workloads in parallel, switch to `ggml-medium.bin` (~1.5 GB).