Files
whisper-rtx2080/docs/ARCHITECTURE.md
mozempk c25e8e7ffb
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 17s
docs: add ARCHITECTURE, CODE_STYLE, FINDINGS, USAGE under docs/
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 10:17:53 +02:00

10 KiB

Architecture

Overview

whisper-server is a single-binary Rust HTTP server that exposes an asynchronous REST API for speech transcription. It wraps whisper.cpp (via whisper-rs) compiled with CUDA support targeting NVIDIA RTX 2080 (sm_75).

Jobs are submitted as audio files, chunked on silence boundaries, transcribed chunk-by-chunk on the GPU, and returned as timestamped JSON. Progress is streamed to clients in real time over Server-Sent Events (SSE).


Component Map

                   HTTP client
                       │
              ┌────────▼────────┐
              │   Axum Router   │  (tower-http: CORS, tracing)
              │  /jobs          │
              │  /jobs/:id      │
              │  /jobs/:id/stream│
              │  /health        │
              │  /docs (Swagger)│
              └────────┬────────┘
                       │ AppState (Arc)
          ┌────────────┼────────────────┐
          │            │                │
    ┌─────▼──────┐ ┌───▼────┐  ┌───────▼──────┐
    │  Storage   │ │ ProgReg│  │  job_tx (mpsc)│
    │ (disk JSON)│ │(DashMap)│  └───────┬──────┘
    └─────▲──────┘ └───▲────┘          │
          │            │         ┌──────▼──────────────────────┐
          │            │         │  Tokio worker task (run())   │
          │            │         │  - dequeues job IDs          │
          │            │         │  - decodes audio (ffmpeg)    │
          │            │         │  - runs silencedetect        │
          │            │         │  - chunks PCM                │
          │            │         │  - sends TranscribeRequest   │
          │            │         │  - offsets timestamps        │
          │            │         │  - persists result           │
          │            │         │  - fires webhook             │
          │            └─────────┤                              │
          └───────────────────── └──────┬──────────────────────┘
                                         │ std::sync::mpsc
                                  ┌──────▼───────────────┐
                                  │  whisper-gpu thread   │
                                  │  (OS thread, non-Send)│
                                  │  owns WhisperContext  │
                                  │  runs CUDA inference  │
                                  └──────────────────────┘

Source Files

File Responsibility
src/main.rs Startup: env vars, storage init, worker spawn, router assembly, OpenAPI
src/models.rs All data types: Job, Segment, Word, SsePayload, JobStatus
src/error.rs AppError enum → HTTP status codes via IntoResponse
src/storage.rs File-backed job store (one JSON file per job UUID)
src/transcriber.rs Owns WhisperContext; sets all inference parameters; decodes output
src/worker.rs Audio pipeline: silence detection, chunking, progress, job lifecycle
src/webhook.rs Fire-and-forget POST with exponential backoff (5 retries)
src/routes/mod.rs Router assembly; disables body limit on POST /jobs
src/routes/jobs.rs Handlers: submit, get, SSE stream, delete/cancel
src/routes/health.rs Health check + GPU info via nvidia-smi

Threading Model

whisper.cpp's WhisperContext is Send but not Sync — it cannot be shared across threads simultaneously. The design uses a two-layer concurrency model:

Layer 1 — Tokio async runtime

All HTTP handling, file I/O, ffmpeg subprocesses, and job lifecycle management run on the Tokio thread pool. This is where async/await is used.

Layer 2 — Dedicated OS thread (whisper-gpu)

A single non-async OS thread owns the WhisperContext for its entire lifetime. The thread loops on a std::sync::mpsc::Receiver<TranscribeRequest>, processes one inference at a time, and sends the result back through a oneshot::Sender.

Communication

Tokio task          std::mpsc           GPU thread
─────────                               ──────────
TranscribeRequest ────────────────────► transcriber.transcribe()
                                             │
oneshot::Receiver ◄──────────────────── oneshot::Sender (Vec<Segment>, lang)

This ensures:

  • GPU inference is never interleaved (one job at a time on the GPU)
  • The async runtime is never blocked by long-running GPU work
  • WhisperContext never needs to be Sync

Job Lifecycle

POST /jobs
    │
    ├─ Stream audio to disk → <DATA_DIR>/<uuid>.audio
    ├─ Create Job{status: Queued} in storage
    ├─ Pre-create broadcast channel in ProgressRegistry
    ├─ Send job ID into job_tx
    └─ Return 202 { job_id }

Worker picks up job_id
    │
    ├─ Load job from storage
    ├─ Mark Running → save
    ├─ decode_audio (ffmpeg → 16kHz mono f32 PCM)
    ├─ detect_silence_midpoints (ffmpeg silencedetect)
    ├─ snap_to_silence → cut points
    ├─ to_chunk_ranges → [(start, end), ...]
    │
    └─ For each chunk:
           ├─ slice PCM
           ├─ trim_trailing_silence
           ├─ broadcast Progress{percent, chunk, total}
           ├─ save snapshot to disk
           ├─ send TranscribeRequest → GPU thread
           ├─ await oneshot reply
           ├─ offset timestamps by chunk_start
           └─ accumulate segments
    │
    ├─ renumber segment indices
    ├─ broadcast Done / Error
    ├─ save final Job to disk
    ├─ delete .audio file
    └─ (optional) fire webhook

Audio Pre-Processing Pipeline

Input file (any format)
        │
        ▼
  ffmpeg silencedetect
  (runs on original file — captures full dynamic range for silence detection)
        │
        ▼
  Silence midpoints → snap_to_silence() → cut points at ~60s intervals
        │
        ▼
  ffmpeg decode: → 16kHz, mono, f32le PCM
        │
        ▼
  Per chunk:
    pcm[start_sample..end_sample]
        │
        ▼
  trim_trailing_silence() → removes silence tail, keeps 0.5s padding
        │
        ▼
  WhisperContext::full() → Vec<Segment>
        │
        ▼
  Offset all timestamps += chunk_start_secs

Persistence

Jobs are persisted as pretty-printed JSON files in DATA_DIR:

/data/
  <uuid>.json     ← job state (updated on every progress snapshot)
  <uuid>.audio    ← raw upload (deleted after transcription)

On startup, recover_interrupted_jobs() scans all .json files and marks any Running jobs as Failed (they were killed mid-transcription).

There is no database. The file-per-job approach is intentional: it is trivially inspectable, survives crashes without WAL complexity, and scales to thousands of jobs without any overhead.


Progress & SSE

ProgressRegistry is a DashMap<JobId, broadcast::Sender<ProgressEvent>>. It is populated when a job is submitted (before the worker starts) so clients connecting early don't miss the first events.

SSE events are typed:

Event name Payload shape
progress { "type": "progress", "percent": 0-100, "chunk": N, "chunks_total": M }
done { "type": "done", "job": { ...full Job object... } }
error { "type": "error", "message": "..." }

If a client connects to /jobs/:id/stream for an already-finished job, it receives a single done event immediately.

The broadcast channel has a buffer of 64 events. Lagged receivers (slow SSE clients) are silently dropped — their messages return Err(RecvError::Lagged) which is filtered to None in the SSE stream adapter.


Webhook

When webhook_url is provided at submission, the server POSTs the complete Job JSON to that URL after the job reaches a terminal state. Delivery is fire-and-forget with exponential backoff: 1s, 2s, 4s, 8s, 16s (max 5 attempts). Failures after all retries are logged and discarded.


Docker / Build

Multi-stage Dockerfile

Stage Base image Purpose
builder nvidia/cuda:<ver>-cudnn-devel-ubuntu<ver> Full CUDA devel + Rust toolchain; compiles whisper.cpp CUDA kernels and the Rust binary
runtime nvidia/cuda:<ver>-cudnn-runtime-ubuntu<ver> Minimal runtime; only ffmpeg and the binary

whisper-rs bundles the whisper.cpp source inside the crate (~/.cargo/registry/.../whisper-rs-sys-*/whisper.cpp/). There is no external clone step — the build is fully deterministic.

CUDA build flags (set as ENV in builder stage)

GGML_CUDA=ON
CMAKE_CUDA_ARCHITECTURES=75        # RTX 2080 = Turing = sm_75
GGML_CUDA_FORCE_MMQ=ON             # matrix-multiply quantized kernels
GGML_CUDA_GRAPHS=ON                # CUDA graph capture for repeated patterns
GGML_CUDA_FA_ALL_QUANTS=ON         # flash attention for all quantisation types
GGML_CUDA_F16=ON                   # half-precision accumulation

CI (Gitea Actions)

  • Triggers on push to main and semver tags (v*)
  • PRs: build only, no push
  • Tags produced: latest, sha-<short>, semver components on tags
  • Build cache stored in registry as :buildcache tag
  • CUDA_VERSION and UBUNTU_VERSION overridable via repo Variables