# Architecture ## Overview `whisper-server` is a single-binary Rust HTTP server that exposes an asynchronous REST API for speech transcription. It wraps [whisper.cpp](https://github.com/ggerganov/whisper.cpp) (via `whisper-rs`) compiled with CUDA support targeting NVIDIA RTX 2080 (sm_75). Jobs are submitted as audio files, chunked on silence boundaries, transcribed chunk-by-chunk on the GPU, and returned as timestamped JSON. Progress is streamed to clients in real time over Server-Sent Events (SSE). --- ## Component Map ``` HTTP client │ ┌────────▼────────┐ │ Axum Router │ (tower-http: CORS, tracing) │ /jobs │ │ /jobs/:id │ │ /jobs/:id/stream│ │ /health │ │ /docs (Swagger)│ └────────┬────────┘ │ AppState (Arc) ┌────────────┼────────────────┐ │ │ │ ┌─────▼──────┐ ┌───▼────┐ ┌───────▼──────┐ │ Storage │ │ ProgReg│ │ job_tx (mpsc)│ │ (disk JSON)│ │(DashMap)│ └───────┬──────┘ └─────▲──────┘ └───▲────┘ │ │ │ ┌──────▼──────────────────────┐ │ │ │ Tokio worker task (run()) │ │ │ │ - dequeues job IDs │ │ │ │ - decodes audio (ffmpeg) │ │ │ │ - runs silencedetect │ │ │ │ - chunks PCM │ │ │ │ - sends TranscribeRequest │ │ │ │ - offsets timestamps │ │ │ │ - persists result │ │ │ │ - fires webhook │ │ └─────────┤ │ └───────────────────── └──────┬──────────────────────┘ │ std::sync::mpsc ┌──────▼───────────────┐ │ whisper-gpu thread │ │ (OS thread, non-Send)│ │ owns WhisperContext │ │ runs CUDA inference │ └──────────────────────┘ ``` --- ## Source Files | File | Responsibility | |------|---------------| | `src/main.rs` | Startup: env vars, storage init, worker spawn, router assembly, OpenAPI | | `src/models.rs` | All data types: `Job`, `Segment`, `Word`, `SsePayload`, `JobStatus` | | `src/error.rs` | `AppError` enum → HTTP status codes via `IntoResponse` | | `src/storage.rs` | File-backed job store (one JSON file per job UUID) | | `src/transcriber.rs` | Owns `WhisperContext`; sets all inference parameters; decodes output | | `src/worker.rs` | Audio pipeline: silence detection, chunking, progress, job lifecycle | | `src/webhook.rs` | Fire-and-forget POST with exponential backoff (5 retries) | | `src/routes/mod.rs` | Router assembly; disables body limit on POST /jobs | | `src/routes/jobs.rs` | Handlers: submit, get, SSE stream, delete/cancel | | `src/routes/health.rs` | Health check + GPU info via `nvidia-smi` | --- ## Threading Model whisper.cpp's `WhisperContext` is `Send` but **not `Sync`** — it cannot be shared across threads simultaneously. The design uses a **two-layer concurrency model**: ### Layer 1 — Tokio async runtime All HTTP handling, file I/O, ffmpeg subprocesses, and job lifecycle management run on the Tokio thread pool. This is where async/await is used. ### Layer 2 — Dedicated OS thread (`whisper-gpu`) A single non-async OS thread owns the `WhisperContext` for its entire lifetime. The thread loops on a `std::sync::mpsc::Receiver`, processes one inference at a time, and sends the result back through a `oneshot::Sender`. ### Communication ``` Tokio task std::mpsc GPU thread ───────── ────────── TranscribeRequest ────────────────────► transcriber.transcribe() │ oneshot::Receiver ◄──────────────────── oneshot::Sender (Vec, lang) ``` This ensures: - GPU inference is never interleaved (one job at a time on the GPU) - The async runtime is never blocked by long-running GPU work - `WhisperContext` never needs to be `Sync` --- ## Job Lifecycle ``` POST /jobs │ ├─ Stream audio to disk → /.audio ├─ Create Job{status: Queued} in storage ├─ Pre-create broadcast channel in ProgressRegistry ├─ Send job ID into job_tx └─ Return 202 { job_id } Worker picks up job_id │ ├─ Load job from storage ├─ Mark Running → save ├─ decode_audio (ffmpeg → 16kHz mono f32 PCM) ├─ detect_silence_midpoints (ffmpeg silencedetect) ├─ snap_to_silence → cut points ├─ to_chunk_ranges → [(start, end), ...] │ └─ For each chunk: ├─ slice PCM ├─ trim_trailing_silence ├─ broadcast Progress{percent, chunk, total} ├─ save snapshot to disk ├─ send TranscribeRequest → GPU thread ├─ await oneshot reply ├─ offset timestamps by chunk_start └─ accumulate segments │ ├─ renumber segment indices ├─ broadcast Done / Error ├─ save final Job to disk ├─ delete .audio file └─ (optional) fire webhook ``` --- ## Audio Pre-Processing Pipeline ``` Input file (any format) │ ▼ ffmpeg silencedetect (runs on original file — captures full dynamic range for silence detection) │ ▼ Silence midpoints → snap_to_silence() → cut points at ~60s intervals │ ▼ ffmpeg decode: → 16kHz, mono, f32le PCM │ ▼ Per chunk: pcm[start_sample..end_sample] │ ▼ trim_trailing_silence() → removes silence tail, keeps 0.5s padding │ ▼ WhisperContext::full() → Vec │ ▼ Offset all timestamps += chunk_start_secs ``` --- ## Persistence Jobs are persisted as pretty-printed JSON files in `DATA_DIR`: ``` /data/ .json ← job state (updated on every progress snapshot) .audio ← raw upload (deleted after transcription) ``` On startup, `recover_interrupted_jobs()` scans all `.json` files and marks any `Running` jobs as `Failed` (they were killed mid-transcription). There is no database. The file-per-job approach is intentional: it is trivially inspectable, survives crashes without WAL complexity, and scales to thousands of jobs without any overhead. --- ## Progress & SSE `ProgressRegistry` is a `DashMap>`. It is populated when a job is submitted (before the worker starts) so clients connecting early don't miss the first events. SSE events are typed: | Event name | Payload shape | |-------------|--------------| | `progress` | `{ "type": "progress", "percent": 0-100, "chunk": N, "chunks_total": M }` | | `done` | `{ "type": "done", "job": { ...full Job object... } }` | | `error` | `{ "type": "error", "message": "..." }` | If a client connects to `/jobs/:id/stream` for an already-finished job, it receives a single `done` event immediately. The broadcast channel has a buffer of 64 events. Lagged receivers (slow SSE clients) are silently dropped — their messages return `Err(RecvError::Lagged)` which is filtered to `None` in the SSE stream adapter. --- ## Webhook When `webhook_url` is provided at submission, the server POSTs the complete `Job` JSON to that URL after the job reaches a terminal state. Delivery is fire-and-forget with exponential backoff: 1s, 2s, 4s, 8s, 16s (max 5 attempts). Failures after all retries are logged and discarded. --- ## Docker / Build ### Multi-stage Dockerfile | Stage | Base image | Purpose | |-------|-----------|---------| | `builder` | `nvidia/cuda:-cudnn-devel-ubuntu` | Full CUDA devel + Rust toolchain; compiles whisper.cpp CUDA kernels and the Rust binary | | `runtime` | `nvidia/cuda:-cudnn-runtime-ubuntu` | Minimal runtime; only `ffmpeg` and the binary | `whisper-rs` bundles the whisper.cpp source inside the crate (`~/.cargo/registry/.../whisper-rs-sys-*/whisper.cpp/`). There is no external clone step — the build is fully deterministic. ### CUDA build flags (set as ENV in builder stage) ``` GGML_CUDA=ON CMAKE_CUDA_ARCHITECTURES=75 # RTX 2080 = Turing = sm_75 GGML_CUDA_FORCE_MMQ=ON # matrix-multiply quantized kernels GGML_CUDA_GRAPHS=ON # CUDA graph capture for repeated patterns GGML_CUDA_FA_ALL_QUANTS=ON # flash attention for all quantisation types GGML_CUDA_F16=ON # half-precision accumulation ``` ### CI (Gitea Actions) - Triggers on `push` to `main` and semver tags (`v*`) - PRs: build only, no push - Tags produced: `latest`, `sha-`, semver components on tags - Build cache stored in registry as `:buildcache` tag - `CUDA_VERSION` and `UBUNTU_VERSION` overridable via repo Variables