Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
10 KiB
Architecture
Overview
whisper-server is a single-binary Rust HTTP server that exposes an asynchronous REST API for speech transcription. It wraps whisper.cpp (via whisper-rs) compiled with CUDA support targeting NVIDIA RTX 2080 (sm_75).
Jobs are submitted as audio files, chunked on silence boundaries, transcribed chunk-by-chunk on the GPU, and returned as timestamped JSON. Progress is streamed to clients in real time over Server-Sent Events (SSE).
Component Map
HTTP client
│
┌────────▼────────┐
│ Axum Router │ (tower-http: CORS, tracing)
│ /jobs │
│ /jobs/:id │
│ /jobs/:id/stream│
│ /health │
│ /docs (Swagger)│
└────────┬────────┘
│ AppState (Arc)
┌────────────┼────────────────┐
│ │ │
┌─────▼──────┐ ┌───▼────┐ ┌───────▼──────┐
│ Storage │ │ ProgReg│ │ job_tx (mpsc)│
│ (disk JSON)│ │(DashMap)│ └───────┬──────┘
└─────▲──────┘ └───▲────┘ │
│ │ ┌──────▼──────────────────────┐
│ │ │ Tokio worker task (run()) │
│ │ │ - dequeues job IDs │
│ │ │ - decodes audio (ffmpeg) │
│ │ │ - runs silencedetect │
│ │ │ - chunks PCM │
│ │ │ - sends TranscribeRequest │
│ │ │ - offsets timestamps │
│ │ │ - persists result │
│ │ │ - fires webhook │
│ └─────────┤ │
└───────────────────── └──────┬──────────────────────┘
│ std::sync::mpsc
┌──────▼───────────────┐
│ whisper-gpu thread │
│ (OS thread, non-Send)│
│ owns WhisperContext │
│ runs CUDA inference │
└──────────────────────┘
Source Files
| File | Responsibility |
|---|---|
src/main.rs |
Startup: env vars, storage init, worker spawn, router assembly, OpenAPI |
src/models.rs |
All data types: Job, Segment, Word, SsePayload, JobStatus |
src/error.rs |
AppError enum → HTTP status codes via IntoResponse |
src/storage.rs |
File-backed job store (one JSON file per job UUID) |
src/transcriber.rs |
Owns WhisperContext; sets all inference parameters; decodes output |
src/worker.rs |
Audio pipeline: silence detection, chunking, progress, job lifecycle |
src/webhook.rs |
Fire-and-forget POST with exponential backoff (5 retries) |
src/routes/mod.rs |
Router assembly; disables body limit on POST /jobs |
src/routes/jobs.rs |
Handlers: submit, get, SSE stream, delete/cancel |
src/routes/health.rs |
Health check + GPU info via nvidia-smi |
Threading Model
whisper.cpp's WhisperContext is Send but not Sync — it cannot be shared across threads simultaneously. The design uses a two-layer concurrency model:
Layer 1 — Tokio async runtime
All HTTP handling, file I/O, ffmpeg subprocesses, and job lifecycle management run on the Tokio thread pool. This is where async/await is used.
Layer 2 — Dedicated OS thread (whisper-gpu)
A single non-async OS thread owns the WhisperContext for its entire lifetime. The thread loops on a std::sync::mpsc::Receiver<TranscribeRequest>, processes one inference at a time, and sends the result back through a oneshot::Sender.
Communication
Tokio task std::mpsc GPU thread
───────── ──────────
TranscribeRequest ────────────────────► transcriber.transcribe()
│
oneshot::Receiver ◄──────────────────── oneshot::Sender (Vec<Segment>, lang)
This ensures:
- GPU inference is never interleaved (one job at a time on the GPU)
- The async runtime is never blocked by long-running GPU work
WhisperContextnever needs to beSync
Job Lifecycle
POST /jobs
│
├─ Stream audio to disk → <DATA_DIR>/<uuid>.audio
├─ Create Job{status: Queued} in storage
├─ Pre-create broadcast channel in ProgressRegistry
├─ Send job ID into job_tx
└─ Return 202 { job_id }
Worker picks up job_id
│
├─ Load job from storage
├─ Mark Running → save
├─ decode_audio (ffmpeg → 16kHz mono f32 PCM)
├─ detect_silence_midpoints (ffmpeg silencedetect)
├─ snap_to_silence → cut points
├─ to_chunk_ranges → [(start, end), ...]
│
└─ For each chunk:
├─ slice PCM
├─ trim_trailing_silence
├─ broadcast Progress{percent, chunk, total}
├─ save snapshot to disk
├─ send TranscribeRequest → GPU thread
├─ await oneshot reply
├─ offset timestamps by chunk_start
└─ accumulate segments
│
├─ renumber segment indices
├─ broadcast Done / Error
├─ save final Job to disk
├─ delete .audio file
└─ (optional) fire webhook
Audio Pre-Processing Pipeline
Input file (any format)
│
▼
ffmpeg silencedetect
(runs on original file — captures full dynamic range for silence detection)
│
▼
Silence midpoints → snap_to_silence() → cut points at ~60s intervals
│
▼
ffmpeg decode: → 16kHz, mono, f32le PCM
│
▼
Per chunk:
pcm[start_sample..end_sample]
│
▼
trim_trailing_silence() → removes silence tail, keeps 0.5s padding
│
▼
WhisperContext::full() → Vec<Segment>
│
▼
Offset all timestamps += chunk_start_secs
Persistence
Jobs are persisted as pretty-printed JSON files in DATA_DIR:
/data/
<uuid>.json ← job state (updated on every progress snapshot)
<uuid>.audio ← raw upload (deleted after transcription)
On startup, recover_interrupted_jobs() scans all .json files and marks any Running jobs as Failed (they were killed mid-transcription).
There is no database. The file-per-job approach is intentional: it is trivially inspectable, survives crashes without WAL complexity, and scales to thousands of jobs without any overhead.
Progress & SSE
ProgressRegistry is a DashMap<JobId, broadcast::Sender<ProgressEvent>>. It is populated when a job is submitted (before the worker starts) so clients connecting early don't miss the first events.
SSE events are typed:
| Event name | Payload shape |
|---|---|
progress |
{ "type": "progress", "percent": 0-100, "chunk": N, "chunks_total": M } |
done |
{ "type": "done", "job": { ...full Job object... } } |
error |
{ "type": "error", "message": "..." } |
If a client connects to /jobs/:id/stream for an already-finished job, it receives a single done event immediately.
The broadcast channel has a buffer of 64 events. Lagged receivers (slow SSE clients) are silently dropped — their messages return Err(RecvError::Lagged) which is filtered to None in the SSE stream adapter.
Webhook
When webhook_url is provided at submission, the server POSTs the complete Job JSON to that URL after the job reaches a terminal state. Delivery is fire-and-forget with exponential backoff: 1s, 2s, 4s, 8s, 16s (max 5 attempts). Failures after all retries are logged and discarded.
Docker / Build
Multi-stage Dockerfile
| Stage | Base image | Purpose |
|---|---|---|
builder |
nvidia/cuda:<ver>-cudnn-devel-ubuntu<ver> |
Full CUDA devel + Rust toolchain; compiles whisper.cpp CUDA kernels and the Rust binary |
runtime |
nvidia/cuda:<ver>-cudnn-runtime-ubuntu<ver> |
Minimal runtime; only ffmpeg and the binary |
whisper-rs bundles the whisper.cpp source inside the crate (~/.cargo/registry/.../whisper-rs-sys-*/whisper.cpp/). There is no external clone step — the build is fully deterministic.
CUDA build flags (set as ENV in builder stage)
GGML_CUDA=ON
CMAKE_CUDA_ARCHITECTURES=75 # RTX 2080 = Turing = sm_75
GGML_CUDA_FORCE_MMQ=ON # matrix-multiply quantized kernels
GGML_CUDA_GRAPHS=ON # CUDA graph capture for repeated patterns
GGML_CUDA_FA_ALL_QUANTS=ON # flash attention for all quantisation types
GGML_CUDA_F16=ON # half-precision accumulation
CI (Gitea Actions)
- Triggers on
pushtomainand semver tags (v*) - PRs: build only, no push
- Tags produced:
latest,sha-<short>, semver components on tags - Build cache stored in registry as
:buildcachetag CUDA_VERSIONandUBUNTU_VERSIONoverridable via repo Variables