docs: add ARCHITECTURE, CODE_STYLE, FINDINGS, USAGE under docs/

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 10:17:53 +02:00
parent 8fc45ee86f
commit c25e8e7ffb
4 changed files with 1019 additions and 0 deletions
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@@ -0,0 +1,233 @@
+# Architecture
+
+## Overview
+
+`whisper-server` is a single-binary Rust HTTP server that exposes an asynchronous REST API for speech transcription. It wraps [whisper.cpp](https://github.com/ggerganov/whisper.cpp) (via `whisper-rs`) compiled with CUDA support targeting NVIDIA RTX 2080 (sm_75).
+
+Jobs are submitted as audio files, chunked on silence boundaries, transcribed chunk-by-chunk on the GPU, and returned as timestamped JSON. Progress is streamed to clients in real time over Server-Sent Events (SSE).
+
+---
+
+## Component Map
+
+```
+                   HTTP client
+                       │
+              ┌────────▼────────┐
+              │   Axum Router   │  (tower-http: CORS, tracing)
+              │  /jobs          │
+              │  /jobs/:id      │
+              │  /jobs/:id/stream│
+              │  /health        │
+              │  /docs (Swagger)│
+              └────────┬────────┘
+                       │ AppState (Arc)
+          ┌────────────┼────────────────┐
+          │            │                │
+    ┌─────▼──────┐ ┌───▼────┐  ┌───────▼──────┐
+    │  Storage   │ │ ProgReg│  │  job_tx (mpsc)│
+    │ (disk JSON)│ │(DashMap)│  └───────┬──────┘
+    └─────▲──────┘ └───▲────┘          │
+          │            │         ┌──────▼──────────────────────┐
+          │            │         │  Tokio worker task (run())   │
+          │            │         │  - dequeues job IDs          │
+          │            │         │  - decodes audio (ffmpeg)    │
+          │            │         │  - runs silencedetect        │
+          │            │         │  - chunks PCM                │
+          │            │         │  - sends TranscribeRequest   │
+          │            │         │  - offsets timestamps        │
+          │            │         │  - persists result           │
+          │            │         │  - fires webhook             │
+          │            └─────────┤                              │
+          └───────────────────── └──────┬──────────────────────┘
+                                         │ std::sync::mpsc
+                                  ┌──────▼───────────────┐
+                                  │  whisper-gpu thread   │
+                                  │  (OS thread, non-Send)│
+                                  │  owns WhisperContext  │
+                                  │  runs CUDA inference  │
+                                  └──────────────────────┘
+```
+
+---
+
+## Source Files
+
+| File | Responsibility |
+|------|---------------|
+| `src/main.rs` | Startup: env vars, storage init, worker spawn, router assembly, OpenAPI |
+| `src/models.rs` | All data types: `Job`, `Segment`, `Word`, `SsePayload`, `JobStatus` |
+| `src/error.rs` | `AppError` enum → HTTP status codes via `IntoResponse` |
+| `src/storage.rs` | File-backed job store (one JSON file per job UUID) |
+| `src/transcriber.rs` | Owns `WhisperContext`; sets all inference parameters; decodes output |
+| `src/worker.rs` | Audio pipeline: silence detection, chunking, progress, job lifecycle |
+| `src/webhook.rs` | Fire-and-forget POST with exponential backoff (5 retries) |
+| `src/routes/mod.rs` | Router assembly; disables body limit on POST /jobs |
+| `src/routes/jobs.rs` | Handlers: submit, get, SSE stream, delete/cancel |
+| `src/routes/health.rs` | Health check + GPU info via `nvidia-smi` |
+
+---
+
+## Threading Model
+
+whisper.cpp's `WhisperContext` is `Send` but **not `Sync`** — it cannot be shared across threads simultaneously. The design uses a **two-layer concurrency model**:
+
+### Layer 1 — Tokio async runtime
+All HTTP handling, file I/O, ffmpeg subprocesses, and job lifecycle management run on the Tokio thread pool. This is where async/await is used.
+
+### Layer 2 — Dedicated OS thread (`whisper-gpu`)
+A single non-async OS thread owns the `WhisperContext` for its entire lifetime. The thread loops on a `std::sync::mpsc::Receiver<TranscribeRequest>`, processes one inference at a time, and sends the result back through a `oneshot::Sender`.
+
+### Communication
+```
+Tokio task          std::mpsc           GPU thread
+─────────                               ──────────
+TranscribeRequest ────────────────────► transcriber.transcribe()
+                                             │
+oneshot::Receiver ◄──────────────────── oneshot::Sender (Vec<Segment>, lang)
+```
+
+This ensures:
+- GPU inference is never interleaved (one job at a time on the GPU)
+- The async runtime is never blocked by long-running GPU work
+- `WhisperContext` never needs to be `Sync`
+
+---
+
+## Job Lifecycle
+
+```
+POST /jobs
+    │
+    ├─ Stream audio to disk → <DATA_DIR>/<uuid>.audio
+    ├─ Create Job{status: Queued} in storage
+    ├─ Pre-create broadcast channel in ProgressRegistry
+    ├─ Send job ID into job_tx
+    └─ Return 202 { job_id }
+
+Worker picks up job_id
+    │
+    ├─ Load job from storage
+    ├─ Mark Running → save
+    ├─ decode_audio (ffmpeg → 16kHz mono f32 PCM)
+    ├─ detect_silence_midpoints (ffmpeg silencedetect)
+    ├─ snap_to_silence → cut points
+    ├─ to_chunk_ranges → [(start, end), ...]
+    │
+    └─ For each chunk:
+           ├─ slice PCM
+           ├─ trim_trailing_silence
+           ├─ broadcast Progress{percent, chunk, total}
+           ├─ save snapshot to disk
+           ├─ send TranscribeRequest → GPU thread
+           ├─ await oneshot reply
+           ├─ offset timestamps by chunk_start
+           └─ accumulate segments
+    │
+    ├─ renumber segment indices
+    ├─ broadcast Done / Error
+    ├─ save final Job to disk
+    ├─ delete .audio file
+    └─ (optional) fire webhook
+```
+
+---
+
+## Audio Pre-Processing Pipeline
+
+```
+Input file (any format)
+        │
+        ▼
+  ffmpeg silencedetect
+  (runs on original file — captures full dynamic range for silence detection)
+        │
+        ▼
+  Silence midpoints → snap_to_silence() → cut points at ~60s intervals
+        │
+        ▼
+  ffmpeg decode: → 16kHz, mono, f32le PCM
+        │
+        ▼
+  Per chunk:
+    pcm[start_sample..end_sample]
+        │
+        ▼
+  trim_trailing_silence() → removes silence tail, keeps 0.5s padding
+        │
+        ▼
+  WhisperContext::full() → Vec<Segment>
+        │
+        ▼
+  Offset all timestamps += chunk_start_secs
+```
+
+---
+
+## Persistence
+
+Jobs are persisted as pretty-printed JSON files in `DATA_DIR`:
+
+```
+/data/
+  <uuid>.json     ← job state (updated on every progress snapshot)
+  <uuid>.audio    ← raw upload (deleted after transcription)
+```
+
+On startup, `recover_interrupted_jobs()` scans all `.json` files and marks any `Running` jobs as `Failed` (they were killed mid-transcription).
+
+There is no database. The file-per-job approach is intentional: it is trivially inspectable, survives crashes without WAL complexity, and scales to thousands of jobs without any overhead.
+
+---
+
+## Progress & SSE
+
+`ProgressRegistry` is a `DashMap<JobId, broadcast::Sender<ProgressEvent>>`. It is populated when a job is submitted (before the worker starts) so clients connecting early don't miss the first events.
+
+SSE events are typed:
+
+| Event name  | Payload shape |
+|-------------|--------------|
+| `progress`  | `{ "type": "progress", "percent": 0-100, "chunk": N, "chunks_total": M }` |
+| `done`      | `{ "type": "done", "job": { ...full Job object... } }` |
+| `error`     | `{ "type": "error", "message": "..." }` |
+
+If a client connects to `/jobs/:id/stream` for an already-finished job, it receives a single `done` event immediately.
+
+The broadcast channel has a buffer of 64 events. Lagged receivers (slow SSE clients) are silently dropped — their messages return `Err(RecvError::Lagged)` which is filtered to `None` in the SSE stream adapter.
+
+---
+
+## Webhook
+
+When `webhook_url` is provided at submission, the server POSTs the complete `Job` JSON to that URL after the job reaches a terminal state. Delivery is fire-and-forget with exponential backoff: 1s, 2s, 4s, 8s, 16s (max 5 attempts). Failures after all retries are logged and discarded.
+
+---
+
+## Docker / Build
+
+### Multi-stage Dockerfile
+
+| Stage | Base image | Purpose |
+|-------|-----------|---------|
+| `builder` | `nvidia/cuda:<ver>-cudnn-devel-ubuntu<ver>` | Full CUDA devel + Rust toolchain; compiles whisper.cpp CUDA kernels and the Rust binary |
+| `runtime` | `nvidia/cuda:<ver>-cudnn-runtime-ubuntu<ver>` | Minimal runtime; only `ffmpeg` and the binary |
+
+`whisper-rs` bundles the whisper.cpp source inside the crate (`~/.cargo/registry/.../whisper-rs-sys-*/whisper.cpp/`). There is no external clone step — the build is fully deterministic.
+
+### CUDA build flags (set as ENV in builder stage)
+```
+GGML_CUDA=ON
+CMAKE_CUDA_ARCHITECTURES=75        # RTX 2080 = Turing = sm_75
+GGML_CUDA_FORCE_MMQ=ON             # matrix-multiply quantized kernels
+GGML_CUDA_GRAPHS=ON                # CUDA graph capture for repeated patterns
+GGML_CUDA_FA_ALL_QUANTS=ON         # flash attention for all quantisation types
+GGML_CUDA_F16=ON                   # half-precision accumulation
+```
+
+### CI (Gitea Actions)
+- Triggers on `push` to `main` and semver tags (`v*`)
+- PRs: build only, no push
+- Tags produced: `latest`, `sha-<short>`, semver components on tags
+- Build cache stored in registry as `:buildcache` tag
+- `CUDA_VERSION` and `UBUNTU_VERSION` overridable via repo Variables