docs: add ARCHITECTURE, CODE_STYLE, FINDINGS, USAGE under docs/

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 10:17:53 +02:00
parent 8fc45ee86f
commit c25e8e7ffb
4 changed files with 1019 additions and 0 deletions
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@@ -0,0 +1,233 @@
+# Architecture
+
+## Overview
+
+`whisper-server` is a single-binary Rust HTTP server that exposes an asynchronous REST API for speech transcription. It wraps [whisper.cpp](https://github.com/ggerganov/whisper.cpp) (via `whisper-rs`) compiled with CUDA support targeting NVIDIA RTX 2080 (sm_75).
+
+Jobs are submitted as audio files, chunked on silence boundaries, transcribed chunk-by-chunk on the GPU, and returned as timestamped JSON. Progress is streamed to clients in real time over Server-Sent Events (SSE).
+
+---
+
+## Component Map
+
+```
+                   HTTP client
+                       │
+              ┌────────▼────────┐
+              │   Axum Router   │  (tower-http: CORS, tracing)
+              │  /jobs          │
+              │  /jobs/:id      │
+              │  /jobs/:id/stream│
+              │  /health        │
+              │  /docs (Swagger)│
+              └────────┬────────┘
+                       │ AppState (Arc)
+          ┌────────────┼────────────────┐
+          │            │                │
+    ┌─────▼──────┐ ┌───▼────┐  ┌───────▼──────┐
+    │  Storage   │ │ ProgReg│  │  job_tx (mpsc)│
+    │ (disk JSON)│ │(DashMap)│  └───────┬──────┘
+    └─────▲──────┘ └───▲────┘          │
+          │            │         ┌──────▼──────────────────────┐
+          │            │         │  Tokio worker task (run())   │
+          │            │         │  - dequeues job IDs          │
+          │            │         │  - decodes audio (ffmpeg)    │
+          │            │         │  - runs silencedetect        │
+          │            │         │  - chunks PCM                │
+          │            │         │  - sends TranscribeRequest   │
+          │            │         │  - offsets timestamps        │
+          │            │         │  - persists result           │
+          │            │         │  - fires webhook             │
+          │            └─────────┤                              │
+          └───────────────────── └──────┬──────────────────────┘
+                                         │ std::sync::mpsc
+                                  ┌──────▼───────────────┐
+                                  │  whisper-gpu thread   │
+                                  │  (OS thread, non-Send)│
+                                  │  owns WhisperContext  │
+                                  │  runs CUDA inference  │
+                                  └──────────────────────┘
+```
+
+---
+
+## Source Files
+
+| File | Responsibility |
+|------|---------------|
+| `src/main.rs` | Startup: env vars, storage init, worker spawn, router assembly, OpenAPI |
+| `src/models.rs` | All data types: `Job`, `Segment`, `Word`, `SsePayload`, `JobStatus` |
+| `src/error.rs` | `AppError` enum → HTTP status codes via `IntoResponse` |
+| `src/storage.rs` | File-backed job store (one JSON file per job UUID) |
+| `src/transcriber.rs` | Owns `WhisperContext`; sets all inference parameters; decodes output |
+| `src/worker.rs` | Audio pipeline: silence detection, chunking, progress, job lifecycle |
+| `src/webhook.rs` | Fire-and-forget POST with exponential backoff (5 retries) |
+| `src/routes/mod.rs` | Router assembly; disables body limit on POST /jobs |
+| `src/routes/jobs.rs` | Handlers: submit, get, SSE stream, delete/cancel |
+| `src/routes/health.rs` | Health check + GPU info via `nvidia-smi` |
+
+---
+
+## Threading Model
+
+whisper.cpp's `WhisperContext` is `Send` but **not `Sync`** — it cannot be shared across threads simultaneously. The design uses a **two-layer concurrency model**:
+
+### Layer 1 — Tokio async runtime
+All HTTP handling, file I/O, ffmpeg subprocesses, and job lifecycle management run on the Tokio thread pool. This is where async/await is used.
+
+### Layer 2 — Dedicated OS thread (`whisper-gpu`)
+A single non-async OS thread owns the `WhisperContext` for its entire lifetime. The thread loops on a `std::sync::mpsc::Receiver<TranscribeRequest>`, processes one inference at a time, and sends the result back through a `oneshot::Sender`.
+
+### Communication
+```
+Tokio task          std::mpsc           GPU thread
+─────────                               ──────────
+TranscribeRequest ────────────────────► transcriber.transcribe()
+                                             │
+oneshot::Receiver ◄──────────────────── oneshot::Sender (Vec<Segment>, lang)
+```
+
+This ensures:
+- GPU inference is never interleaved (one job at a time on the GPU)
+- The async runtime is never blocked by long-running GPU work
+- `WhisperContext` never needs to be `Sync`
+
+---
+
+## Job Lifecycle
+
+```
+POST /jobs
+    │
+    ├─ Stream audio to disk → <DATA_DIR>/<uuid>.audio
+    ├─ Create Job{status: Queued} in storage
+    ├─ Pre-create broadcast channel in ProgressRegistry
+    ├─ Send job ID into job_tx
+    └─ Return 202 { job_id }
+
+Worker picks up job_id
+    │
+    ├─ Load job from storage
+    ├─ Mark Running → save
+    ├─ decode_audio (ffmpeg → 16kHz mono f32 PCM)
+    ├─ detect_silence_midpoints (ffmpeg silencedetect)
+    ├─ snap_to_silence → cut points
+    ├─ to_chunk_ranges → [(start, end), ...]
+    │
+    └─ For each chunk:
+           ├─ slice PCM
+           ├─ trim_trailing_silence
+           ├─ broadcast Progress{percent, chunk, total}
+           ├─ save snapshot to disk
+           ├─ send TranscribeRequest → GPU thread
+           ├─ await oneshot reply
+           ├─ offset timestamps by chunk_start
+           └─ accumulate segments
+    │
+    ├─ renumber segment indices
+    ├─ broadcast Done / Error
+    ├─ save final Job to disk
+    ├─ delete .audio file
+    └─ (optional) fire webhook
+```
+
+---
+
+## Audio Pre-Processing Pipeline
+
+```
+Input file (any format)
+        │
+        ▼
+  ffmpeg silencedetect
+  (runs on original file — captures full dynamic range for silence detection)
+        │
+        ▼
+  Silence midpoints → snap_to_silence() → cut points at ~60s intervals
+        │
+        ▼
+  ffmpeg decode: → 16kHz, mono, f32le PCM
+        │
+        ▼
+  Per chunk:
+    pcm[start_sample..end_sample]
+        │
+        ▼
+  trim_trailing_silence() → removes silence tail, keeps 0.5s padding
+        │
+        ▼
+  WhisperContext::full() → Vec<Segment>
+        │
+        ▼
+  Offset all timestamps += chunk_start_secs
+```
+
+---
+
+## Persistence
+
+Jobs are persisted as pretty-printed JSON files in `DATA_DIR`:
+
+```
+/data/
+  <uuid>.json     ← job state (updated on every progress snapshot)
+  <uuid>.audio    ← raw upload (deleted after transcription)
+```
+
+On startup, `recover_interrupted_jobs()` scans all `.json` files and marks any `Running` jobs as `Failed` (they were killed mid-transcription).
+
+There is no database. The file-per-job approach is intentional: it is trivially inspectable, survives crashes without WAL complexity, and scales to thousands of jobs without any overhead.
+
+---
+
+## Progress & SSE
+
+`ProgressRegistry` is a `DashMap<JobId, broadcast::Sender<ProgressEvent>>`. It is populated when a job is submitted (before the worker starts) so clients connecting early don't miss the first events.
+
+SSE events are typed:
+
+| Event name  | Payload shape |
+|-------------|--------------|
+| `progress`  | `{ "type": "progress", "percent": 0-100, "chunk": N, "chunks_total": M }` |
+| `done`      | `{ "type": "done", "job": { ...full Job object... } }` |
+| `error`     | `{ "type": "error", "message": "..." }` |
+
+If a client connects to `/jobs/:id/stream` for an already-finished job, it receives a single `done` event immediately.
+
+The broadcast channel has a buffer of 64 events. Lagged receivers (slow SSE clients) are silently dropped — their messages return `Err(RecvError::Lagged)` which is filtered to `None` in the SSE stream adapter.
+
+---
+
+## Webhook
+
+When `webhook_url` is provided at submission, the server POSTs the complete `Job` JSON to that URL after the job reaches a terminal state. Delivery is fire-and-forget with exponential backoff: 1s, 2s, 4s, 8s, 16s (max 5 attempts). Failures after all retries are logged and discarded.
+
+---
+
+## Docker / Build
+
+### Multi-stage Dockerfile
+
+| Stage | Base image | Purpose |
+|-------|-----------|---------|
+| `builder` | `nvidia/cuda:<ver>-cudnn-devel-ubuntu<ver>` | Full CUDA devel + Rust toolchain; compiles whisper.cpp CUDA kernels and the Rust binary |
+| `runtime` | `nvidia/cuda:<ver>-cudnn-runtime-ubuntu<ver>` | Minimal runtime; only `ffmpeg` and the binary |
+
+`whisper-rs` bundles the whisper.cpp source inside the crate (`~/.cargo/registry/.../whisper-rs-sys-*/whisper.cpp/`). There is no external clone step — the build is fully deterministic.
+
+### CUDA build flags (set as ENV in builder stage)
+```
+GGML_CUDA=ON
+CMAKE_CUDA_ARCHITECTURES=75        # RTX 2080 = Turing = sm_75
+GGML_CUDA_FORCE_MMQ=ON             # matrix-multiply quantized kernels
+GGML_CUDA_GRAPHS=ON                # CUDA graph capture for repeated patterns
+GGML_CUDA_FA_ALL_QUANTS=ON         # flash attention for all quantisation types
+GGML_CUDA_F16=ON                   # half-precision accumulation
+```
+
+### CI (Gitea Actions)
+- Triggers on `push` to `main` and semver tags (`v*`)
+- PRs: build only, no push
+- Tags produced: `latest`, `sha-<short>`, semver components on tags
+- Build cache stored in registry as `:buildcache` tag
+- `CUDA_VERSION` and `UBUNTU_VERSION` overridable via repo Variables
--- a/docs/CODE_STYLE.md
+++ b/docs/CODE_STYLE.md
@@ -0,0 +1,211 @@
+# Code Style
+
+This document describes the conventions used in this codebase. Follow them when adding or modifying code.
+
+---
+
+## Language & Edition
+
+- Rust 2021 edition
+- Stable toolchain only — no nightly features
+- `rustfmt` is installed in the Docker builder; run it before committing
+
+---
+
+## Error Handling
+
+### Use `AppError` everywhere
+All fallible functions in the HTTP layer return `crate::Result<T>` which is `std::result::Result<T, AppError>`.
+
+```rust
+// Good
+pub async fn get_job(...) -> Result<Json<Job>> {
+    let job = state.storage.get(&id).await?;
+    Ok(Json(job))
+}
+
+// Bad — don't use anyhow inside handlers
+pub async fn get_job(...) -> anyhow::Result<...> { ... }
+```
+
+`AppError` has four variants that map cleanly to HTTP status codes:
+
+| Variant | HTTP |
+|---------|------|
+| `NotFound(String)` | 404 |
+| `BadRequest(String)` | 400 |
+| `Conflict(String)` | 409 |
+| `Internal(String)` | 500 |
+
+### `anyhow` is only for `main()`
+`anyhow::Result` is used in `main()` for startup failures where a single error type is not needed. It must not leak into handler or worker code.
+
+### Propagate with `?`, map errors explicitly
+Prefer `.map_err(|e| AppError::Internal(format!("context: {e}")))` over `.unwrap()` or `.expect()` in any code path that can be reached at runtime.
+
+```rust
+// Good
+fs::write(&path, payload).await.map_err(|e| {
+    AppError::Internal(format!("failed to write job {}: {e}", job.id))
+})?;
+
+// Bad
+fs::write(&path, payload).await.unwrap();
+```
+
+---
+
+## Async & Concurrency
+
+### Tokio for I/O, std thread for GPU
+- All file I/O, HTTP, ffmpeg subprocess calls: use `tokio::fs`, `tokio::process::Command`
+- GPU inference runs on a dedicated `std::thread` — never spawn a `tokio::task` that blocks on `WhisperContext`
+
+### Don't block the async runtime
+Long CPU-bound or blocking operations must not run on Tokio tasks. Currently the only such operation is whisper inference, which lives on the `whisper-gpu` OS thread and communicates back via `oneshot` channels.
+
+### Channel conventions
+| Channel type | Used for |
+|-------------|---------|
+| `tokio::sync::mpsc::UnboundedSender<JobId>` | HTTP → worker: new job notifications |
+| `std::sync::mpsc::Sender<TranscribeRequest>` | Worker → GPU thread: inference requests |
+| `tokio::sync::oneshot` | GPU thread → worker: inference result |
+| `tokio::sync::broadcast` | Worker → SSE subscribers: progress events |
+
+---
+
+## State Management
+
+`AppState` is `Clone` and passed via Axum's `State` extractor. All fields are either `Arc<T>` or cheap-to-clone types (`mpsc::UnboundedSender` is `Clone`).
+
+```rust
+#[derive(Clone)]
+pub struct AppState {
+    pub job_tx:      mpsc::UnboundedSender<models::JobId>,  // cheap clone
+    pub storage:     Arc<storage::Storage>,                   // shared ref
+    pub progress:    worker::ProgressRegistry,                // Arc<DashMap>
+    pub model_name:  Arc<str>,                                // shared ref
+    pub queue_depth: Arc<std::sync::atomic::AtomicUsize>,    // shared atomic
+    pub gpu_device:  u32,                                     // copy
+}
+```
+
+Never put `Mutex<T>` in `AppState` for hot paths. Prefer atomics (`AtomicUsize`) for counters and `DashMap` for concurrent maps.
+
+---
+
+## Naming
+
+- Types: `UpperCamelCase`
+- Functions / methods / locals: `snake_case`
+- Constants: `SCREAMING_SNAKE_CASE`
+- Module files match the concept they contain (`storage.rs`, `transcriber.rs`)
+
+```rust
+const TARGET_CHUNK_SECS: f32 = 60.0;
+const SNAP_WINDOW_SECS:  f32 = 30.0;
+```
+
+Align related constant definitions vertically with spaces — the compiler doesn't care and it's significantly easier to scan.
+
+---
+
+## Comments
+
+Comment *why*, not *what*. The code shows what; the comment explains decisions, non-obvious constraints, and gotchas.
+
+```rust
+// Good — explains the constraint
+// no_context: do not use previous call's transcript as initial prompt.
+// Each silence-chunked audio segment is independent; cross-chunk context
+// would re-anchor the decoder to any hallucinations from the prior chunk.
+fp.set_no_context(true);
+
+// Bad — restates the code
+// set no_context to true
+fp.set_context(true);
+```
+
+For **disabled code** (commented-out functionality): always include a reason.
+
+```rust
+// Flash Attention disabled: causes silent 0-segment output on some
+// real-world audio (conference recordings, noisy MP3s).
+// params.flash_attn(true);
+```
+
+---
+
+## Logging
+
+Use `tracing` macros at the appropriate level:
+
+| Level | When |
+|-------|------|
+| `error!` | Unrecoverable failure; always investigated |
+| `warn!` | Recoverable unexpected state (e.g. ffmpeg unavailable, job not found at dequeue) |
+| `info!` | Significant lifecycle events (server start, model load, job queued/done) |
+| `debug!` | Per-chunk detail, useful when diagnosing a specific job |
+| `trace!` | Very high-frequency detail (sample-level, buffer operations) |
+
+Always use structured fields, not string interpolation, so logs are machine-parseable (the subscriber emits JSON):
+
+```rust
+// Good
+tracing::info!(job_id = %id, model = path, "job queued");
+
+// Bad
+tracing::info!("job {} queued, model: {}", id, path);
+```
+
+---
+
+## Serialisation
+
+- All public API types derive `Serialize`/`Deserialize` from `serde`
+- Use `#[serde(rename_all = "snake_case")]` on enums used in JSON responses
+- Use `#[serde(skip_serializing_if = "Option::is_none")]` on optional fields to keep the JSON clean
+- Timestamps are `chrono::DateTime<Utc>` — they serialise to ISO 8601 automatically
+
+---
+
+## OpenAPI Annotations
+
+All route handlers are annotated with `#[utoipa::path(...)]`. Keep the annotations up to date when adding or changing endpoints. The Swagger UI is served at `/docs` in development.
+
+---
+
+## File Structure Conventions
+
+```
+src/
+  main.rs          — entry point only; no business logic
+  models.rs        — pure data types; no logic
+  error.rs         — AppError + IntoResponse
+  storage.rs       — I/O only; no business logic
+  transcriber.rs   — whisper inference only
+  worker.rs        — orchestration: audio pipeline, job lifecycle
+  webhook.rs       — HTTP delivery only
+  routes/
+    mod.rs         — router assembly only
+    jobs.rs        — HTTP handlers for job endpoints
+    health.rs      — HTTP handler for /health
+```
+
+Keep each file focused on a single concern. Business logic lives in `worker.rs`. Data shapes live in `models.rs`. Avoid cross-cutting imports in the wrong direction (e.g. `storage.rs` must not import from `worker.rs`).
+
+---
+
+## Release Profile
+
+```toml
+[profile.release]
+opt-level     = 3
+lto           = "thin"
+codegen-units = 1
+strip         = "symbols"
+```
+
+- `lto = "thin"` is a good balance between link time and binary size/performance
+- `codegen-units = 1` maximises inlining across crate boundaries — important for hot inference paths
+- `strip = "symbols"` keeps the Docker runtime image small; debug info is not needed in production
--- a/docs/FINDINGS.md
+++ b/docs/FINDINGS.md
@@ -0,0 +1,217 @@
+# Findings, Quirks & Research Notes
+
+This document records all non-obvious behaviour, surprising bugs, hardware quirks, and research findings discovered during the development and testing of this project. It exists so we don't rediscover the same things twice.
+
+---
+
+## whisper.cpp
+
+### `detect_language=true` is a language-ID-only mode — NOT "auto-detect and transcribe"
+
+**Severity: Critical (was a production regression)**
+
+In `whisper.cpp` (`whisper_full_with_state`):
+```c
+if (params.detect_language) {
+    return 0;  // exits immediately after language detection
+}
+```
+
+Setting `detect_language=true` causes whisper to auto-detect the language, print it to stderr, and then **return 0 without running the decoder**. The result is always 0 segments.
+
+The `whisper-rs` docs suggest this is equivalent to auto-detect — **it is not**.
+
+**Correct API for auto-detect + transcription:**
+```rust
+fp.set_language(None);   // passes language = NULL to whisper.cpp → auto-detects AND transcribes
+```
+
+**Wrong:**
+```rust
+fp.set_detect_language(true);   // language-ID mode only — 0 segments returned
+```
+
+This bug caused every job submitted without an explicit `language=` parameter to return 0 segments after commit `35e7ea8`. Fixed in `6327ffc`.
+
+---
+
+### `no_speech_thold` is not implemented
+
+The `whisper.cpp` header exposes `no_speech_thold` as a parameter, but the source contains a `// TODO: not implemented` comment on the actual check. Calling `fp.set_no_speech_thold(...)` has no effect. Do not rely on it.
+
+---
+
+### `entropy_thold` only fires when `result_len > 32`
+
+whisper's entropy check (which triggers temperature retry on repetitive output) is only evaluated when the segment has more than 32 output tokens. This means:
+
+- Short hallucination loops of 1-2 words (e.g. "kas", "sick", "Bye.") are **never caught**, no matter how low you set the threshold
+- The check is useful for medium-length loops (~9 word phrases have theoretical entropy ≈ log₂(9) ≈ 3.17)
+- Default `entropy_thold=2.4` catches 1-4 unique-token loops; we raised it to 3.5 to also catch 9-word phrase loops
+- The retry schedule uses `temperature_inc`: on failure, whisper retries with temp += 0.2 until temp=1.0
+
+---
+
+### `vad_filter=true` causes "Okay." hallucinations
+
+When VAD (Voice Activity Detection) filter is enabled, whisper silences quiet sections before feeding to the decoder. For conference recordings with audience reactions or low-volume speakers, this causes whisper to fill the resulting void with short filler tokens ("Okay.", "Yeah.", "Thank you.") at ~1s intervals.
+
+**Do not use `vad_filter`** on recordings with ambient audience sound or variable volume speakers.
+
+---
+
+### Flash Attention (`flash_attn=true`) causes 0 segments on some audio
+
+Flash attention is disabled with a comment. When tested on real-world conference recordings (noisy MP3s), it silently returned 0 segments on certain audio windows. The root cause was not fully investigated. Safe to disable; the performance benefit is marginal for large-v3.
+
+---
+
+### `no_context=true` is essential for chunked processing
+
+When `no_context=false` (default), whisper uses the transcript from the previous `full()` call as an initial prompt for the next one. In our pipeline, each chunk is a separate `full()` call. Without `no_context=true`, a hallucinated phrase from chunk N gets fed as a prompt into chunk N+1, poisoning it. This can cascade across the entire transcript.
+
+---
+
+### Timestamps are in centiseconds internally
+
+whisper.cpp returns `t0` and `t1` as integer centiseconds. The conversion to seconds is:
+```rust
+let start = state.full_get_segment_t0(i)? as f32 / 100.0;
+```
+
+This is not documented prominently. The divide-by-100 is critical — omitting it gives timestamps 100× too large.
+
+---
+
+### `full_n_segments_from_state` vs `full_n_segments`
+
+Two versions of this function exist in `whisper-rs`:
+- `full_n_segments_from_state(&state)` — correct; reads from the state created by `create_state()`
+- `full_n_segments(&ctx)` — reads from the context's internal state (used for single-threaded, non-state-based calls)
+
+Since we use `create_state()` + `state.full(fp, pcm)`, always use the `_from_state` variants. Using the wrong variant returns stale or zero results from a previous inference.
+
+---
+
+## CUDA / Hardware
+
+### CUDA device index ordering differs between host and Docker
+
+On the development machine:
+- `nvidia-smi`: GPU 0 = RTX 2080 SUPER (8 GB), GPU 1 = RTX 3060 (12 GB)
+- `whisper.cpp` CUDA on **host**: Device 0 = RTX 3060, Device 1 = RTX 2080 SUPER (**inverted**)
+- `whisper.cpp` CUDA **inside Docker**: Device 0 = RTX 2080 SUPER (**matches nvidia-smi**)
+
+The inversion on the host is caused by `CUDA_DEVICE_ORDER` not being set to `PCI_BUS_ID`. The Docker image explicitly sets `CUDA_DEVICE_ORDER=PCI_BUS_ID`, which forces the expected ordering.
+
+**To target RTX 2080 SUPER on host**: `CUDA_DEVICE=1`  
+**Inside Docker**: `CUDA_DEVICE=0`
+
+The `/health` endpoint queries GPU info via `nvidia-smi --id=<device>` which uses the nvidia-smi (PCI_BUS_ID) ordering. When running on the host with `CUDA_DEVICE=1`, the health endpoint correctly reports RTX 2080 SUPER.
+
+---
+
+### RTX 2080 is Turing (sm_75) — not Ampere
+
+The RTX 2080 (non-Super, non-Ti) uses the Turing architecture, compute capability sm_75. This is relevant because:
+- Some CUDA kernels are only compiled for sm_80+ (Ampere) by default
+- `CMAKE_CUDA_ARCHITECTURES=75` must be set explicitly, otherwise the build falls back to a generic/slower kernel or fails
+- `GGML_CUDA_FORCE_MMQ=ON` enables the matrix-multiply-quantized kernels that are Turing-optimised
+
+---
+
+### VRAM allocation: ~5 GB for large-v3
+
+The `ggml-large-v3.bin` model occupies approximately 5-6 GB of VRAM on the RTX 2080's 8 GB pool. This leaves ~2 GB for CUDA workspace, which is sufficient for f16 inference with beam_size=5. Do not run two instances on the same GPU simultaneously.
+
+---
+
+## Audio Processing
+
+### ffmpeg `silencedetect` logs to stderr, not stdout
+
+When running `ffmpeg -af silencedetect=n=-35dB:d=0.4 -f null -`, silence events are printed to **stderr** (not stdout), in this format:
+```
+[silencedetect @ 0x...] silence_start: 12.345
+[silencedetect @ 0x...] silence_end: 13.456 | silence_duration: 1.111
+```
+
+The parser must read `output.stderr`, not `output.stdout`.
+
+---
+
+### Whisper requires exactly 16kHz mono f32 PCM
+
+whisper.cpp's `full()` function expects:
+- Sample rate: exactly 16,000 Hz
+- Channels: 1 (mono)
+- Format: f32 little-endian (values in [-1.0, 1.0])
+
+Deviating from any of these silently produces garbage output. The ffmpeg decode command:
+```
+ffmpeg -i <input> -f f32le -ac 1 -ar 16000 -
+```
+converts any input format to this exactly.
+
+---
+
+### MP3 is fully supported by ffmpeg → whisper
+
+whisper itself only accepts PCM; it has no MP3 decoder. But since we always decode through ffmpeg first, any format ffmpeg supports (MP3, AAC, FLAC, OGG, WAV, M4A, WEBM, etc.) works as input. There is no codec-level restriction.
+
+---
+
+### Chunking trade-offs
+
+| Chunk size | Pros | Cons |
+|-----------|------|------|
+| 30s | Hallucinations contained to tiny window | Very short proper nouns / spellings get no context |
+| 60s | Good balance; ~2× whisper's native 30s window | Isolated 10s sections (e.g. name spelling) still lack context |
+| 120-180s | Better context for short sections | Hallucinations can corrupt larger content blocks |
+
+Current setting: **60s**. Snap window: **±30s** from the target cut point.
+
+The snap-to-silence algorithm avoids micro-chunks (<5s) and trailing slivers (<25% of target) by stopping early.
+
+---
+
+## Quality Findings
+
+### Quality baseline on 101-minute conference recording (ggml-large-v3)
+
+Reference: human-corrected transcript (12,894 words)
+
+| Metric | Score |
+|--------|-------|
+| WER | 9.3% |
+| Word coverage | 93.1% |
+| 1-gram F1 | 94.9% |
+| 3-gram F1 | 84.7% |
+| 5-gram F1 | 77.5% |
+
+The 3-gram F1 is the most informative single metric for conference transcription: it captures both word accuracy and local phrase fidelity without being overly sensitive to exact phrasing.
+
+### Remaining failure modes
+
+| Pattern | Location | Root cause | Fixable? |
+|---------|----------|-----------|---------|
+| 'KAS' ×12 | ~2801s | Speaker spells "K-A-S"; 60s chunk isolates 10s section with no context | Increase chunk size to 90s |
+| 'sick' ×4 | ~4540s | Single 9s segment; `result_len < 32` → entropy check skipped | `compression_ratio_thold`? |
+| 'Bye.' ×10 | ~6070s | Speaker says goodbye multiple times at end; trailing silence trim can't help — real audio | Post-processing dedup (declined by user) |
+| 5 content gaps | Various | Chunk windows with noisy/overlapping audio → whisper skips content | Retry at 30s scope |
+
+---
+
+## Rust / Library Quirks
+
+### `whisper-rs` 0.13 bundles whisper.cpp source
+
+`whisper-rs-sys` includes the full whisper.cpp source tree inside the crate. The build is entirely self-contained — no internet access is needed during `cargo build` once the registry cache is warm. The whisper.cpp version is pinned by the crate version; updating whisper.cpp requires bumping the `whisper-rs` dependency.
+
+### `BroadcastStream` silently drops lagged receivers
+
+`tokio_stream::wrappers::BroadcastStream` returns `Err(RecvError::Lagged(n))` when a subscriber falls behind. This is filtered to `None` in our SSE adapter, which silently drops the lagged events. Clients that can't keep up with the SSE stream will miss progress events but will still receive the final `done` event (or can poll `GET /jobs/:id`).
+
+### `DashMap` as `ProgressRegistry`
+
+`DashMap<JobId, broadcast::Sender<ProgressEvent>>` provides lock-free concurrent map access. Senders are cleaned up 30 seconds after a job completes (see the `sleep(30s)` + `registry.remove()` in `worker::run`). The 30-second window gives SSE subscribers time to receive the `done` event before the channel is dropped.
--- a/docs/USAGE.md
+++ b/docs/USAGE.md
@@ -0,0 +1,358 @@
+# Usage Guide
+
+## Prerequisites
+
+- Docker + NVIDIA Container Toolkit (for GPU access)
+- An NVIDIA GPU — optimised for RTX 2080 (sm_75), but any CUDA-capable GPU works
+- A Whisper GGML model file (e.g. `ggml-large-v3.bin`)
+
+---
+
+## Quick Start
+
+### 1. Pull the image
+
+```bash
+docker pull git.sal.giize.com/mozempk/whisper-rtx2080:latest
+```
+
+### 2. Download a model
+
+```bash
+# large-v3 recommended (~3 GB)
+mkdir -p ~/whisper-models
+curl -L "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin" \
+  -o ~/whisper-models/ggml-large-v3.bin
+```
+
+### 3. Start the server
+
+```bash
+docker run --rm --gpus all \
+  -p 8080:8080 \
+  -v ~/whisper-models:/models:ro \
+  -v whisper-data:/data \
+  -e WHISPER_MODEL_PATH=/models/ggml-large-v3.bin \
+  git.sal.giize.com/mozempk/whisper-rtx2080:latest
+```
+
+### 4. Verify
+
+```bash
+curl http://localhost:8080/health
+# {"status":"ok","gpu_name":"NVIDIA GeForce RTX 2080","vram_total_mb":8192,"model":"large-v3","queue_depth":0}
+```
+
+---
+
+## docker-compose
+
+```bash
+# Copy the compose file, configure volumes, then:
+docker compose up -d
+```
+
+The bundled `docker-compose.yml` mounts named volumes for data and models and sets sane defaults.
+
+---
+
+## Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `PORT` | `8080` | HTTP listen port |
+| `RUST_LOG` | `info` | Log level: `error`, `warn`, `info`, `debug`, `trace` |
+| `DATA_DIR` | `/data` | Directory for job JSON files and temp audio |
+| `WHISPER_MODEL_PATH` | `/models/ggml-large-v3.bin` | Absolute path to GGML model file |
+| `WHISPER_MODEL` | `large-v3` | Model name reported by `/health` (display only) |
+| `CUDA_DEVICE` | `0` | CUDA device index to use for inference |
+
+### Note on CUDA device ordering
+Inside Docker, device ordering matches `nvidia-smi` (PCI bus order). On the host without Docker, ordering may differ. See [FINDINGS.md](FINDINGS.md#cuda-device-index-ordering-differs-between-host-and-docker) for details.
+
+---
+
+## API Reference
+
+The interactive Swagger UI is available at `http://localhost:8080/docs`.
+
+### `POST /jobs` — Submit a transcription job
+
+Accepts a multipart/form-data body.
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `audio` | file | ✓ | Audio file. Any format ffmpeg supports (MP3, WAV, FLAC, AAC, OGG, M4A, WEBM, …). No size limit. |
+| `language` | string | — | ISO 639-1 language code (e.g. `en`, `fr`, `de`). Omit to auto-detect. |
+| `task` | string | — | `transcribe` (default) or `translate` (translates to English) |
+| `webhook_url` | string | — | URL to POST the completed job to |
+
+**Response:** `202 Accepted`
+```json
+{ "job_id": "550e8400-e29b-41d4-a716-446655440000" }
+```
+
+**Example:**
+```bash
+curl -X POST http://localhost:8080/jobs \
+  -F "audio=@/path/to/recording.mp3" \
+  -F "language=en"
+```
+
+Auto-detect language:
+```bash
+curl -X POST http://localhost:8080/jobs \
+  -F "audio=@/path/to/recording.mp3"
+```
+
+With webhook:
+```bash
+curl -X POST http://localhost:8080/jobs \
+  -F "audio=@recording.mp3" \
+  -F "webhook_url=https://myapp.example.com/transcription-done"
+```
+
+---
+
+### `GET /jobs/{id}` — Poll job status
+
+```bash
+curl http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000
+```
+
+**Response while running:**
+```json
+{
+  "id": "550e8400-e29b-41d4-a716-446655440000",
+  "status": "running",
+  "task": "transcribe",
+  "progress": 42,
+  "created_at": "2026-05-06T10:00:00Z"
+}
+```
+
+**Response when done:**
+```json
+{
+  "id": "550e8400-e29b-41d4-a716-446655440000",
+  "status": "done",
+  "language": "en",
+  "task": "transcribe",
+  "duration_secs": 3720.5,
+  "progress": 100,
+  "created_at": "2026-05-06T10:00:00Z",
+  "completed_at": "2026-05-06T10:12:34Z",
+  "filename": "recording.mp3",
+  "segments": [
+    {
+      "index": 0,
+      "start": 0.0,
+      "end": 4.52,
+      "text": " Hello and welcome to the conference.",
+      "words": [
+        { "text": " Hello", "start": 0.0, "end": 0.68, "probability": 0.98 },
+        ...
+      ]
+    },
+    ...
+  ]
+}
+```
+
+**Job statuses:**
+
+| Status | Meaning |
+|--------|---------|
+| `queued` | Waiting for the GPU worker to pick it up |
+| `running` | Being transcribed right now |
+| `done` | Complete; `segments` array is populated |
+| `failed` | Error occurred; `error` field contains the message |
+| `cancelled` | Cancelled via DELETE before or during processing |
+
+---
+
+### `GET /jobs/{id}/stream` — Real-time progress via SSE
+
+Subscribe to a Server-Sent Events stream for live progress updates.
+
+```bash
+curl -N http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000/stream
+```
+
+**Event types:**
+
+```
+event: progress
+data: {"type":"progress","percent":23,"chunk":2,"chunks_total":8}
+
+event: progress
+data: {"type":"progress","percent":47,"chunk":4,"chunks_total":8}
+
+event: done
+data: {"type":"done","job":{...full job object...}}
+```
+
+```
+event: error
+data: {"type":"error","message":"ffmpeg spawn failed: ..."}
+```
+
+- `percent` — overall progress 0–100
+- `chunk` / `chunks_total` — which silence-split chunk is currently being transcribed
+- If you connect after the job has finished, you receive a single `done` event immediately
+
+**JavaScript example:**
+```javascript
+const es = new EventSource(`/jobs/${jobId}/stream`);
+
+es.addEventListener('progress', (e) => {
+  const { percent, chunk, chunks_total } = JSON.parse(e.data);
+  console.log(`${percent}% (chunk ${chunk}/${chunks_total})`);
+});
+
+es.addEventListener('done', (e) => {
+  const { job } = JSON.parse(e.data);
+  console.log('Transcript:', job.segments.map(s => s.text).join(''));
+  es.close();
+});
+
+es.addEventListener('error', (e) => {
+  const { message } = JSON.parse(e.data);
+  console.error('Failed:', message);
+  es.close();
+});
+```
+
+---
+
+### `DELETE /jobs/{id}` — Cancel a job
+
+Marks a queued job as cancelled immediately. For running jobs, the cancellation is recorded but the current whisper.cpp inference call completes before the worker checks the flag (whisper.cpp does not support mid-inference abort).
+
+```bash
+curl -X DELETE http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000
+```
+
+Returns `409 Conflict` if the job is already in a terminal state (`done`, `failed`, `cancelled`).
+
+---
+
+### `GET /health` — Service health
+
+```bash
+curl http://localhost:8080/health
+```
+
+```json
+{
+  "status": "ok",
+  "gpu_name": "NVIDIA GeForce RTX 2080",
+  "vram_total_mb": 8192,
+  "model": "large-v3",
+  "queue_depth": 0
+}
+```
+
+`queue_depth` is the number of jobs waiting to be processed (not counting the one currently running).
+
+---
+
+## Output Format
+
+The `segments` array in a completed job contains one entry per whisper segment (typically a sentence or clause):
+
+```json
+{
+  "index": 0,
+  "start": 12.34,
+  "end":   15.78,
+  "text":  " This is a transcribed sentence.",
+  "words": [
+    { "text": " This",         "start": 12.34, "end": 12.56, "probability": 0.97 },
+    { "text": " is",           "start": 12.56, "end": 12.72, "probability": 0.99 },
+    { "text": " a",            "start": 12.72, "end": 12.84, "probability": 0.98 },
+    { "text": " transcribed",  "start": 12.84, "end": 13.40, "probability": 0.95 },
+    { "text": " sentence.",    "start": 13.40, "end": 15.78, "probability": 0.96 }
+  ]
+}
+```
+
+Notes:
+- `start` / `end` are in seconds (floating point), absolute from the beginning of the input audio
+- `text` typically includes a leading space (whisper's tokenisation convention)
+- `words` contains token-level timestamps; may be empty if flash attention is enabled (it is disabled by default)
+- `probability` is the model's confidence for each word token (0–1)
+- All timestamps are in the source language's timeline — no re-mapping occurs
+
+---
+
+## Webhook Payload
+
+When a `webhook_url` is provided, the server POSTs the full `Job` JSON to that URL on completion (including on failure). Headers: `Content-Type: application/json`.
+
+Delivery is attempted up to 5 times with exponential backoff (1s, 2s, 4s, 8s, 16s). If all retries fail, the error is logged and dropped.
+
+---
+
+## Building from Source
+
+```bash
+# Build the Docker image locally (requires Docker Buildx + NVIDIA CUDA driver on host)
+docker build -t whisper-rtx2080 .
+
+# Custom CUDA version (e.g. for CUDA 11.8 on older drivers)
+docker build \
+  --build-arg CUDA_VERSION=11.8.0 \
+  --build-arg CUDNN_TAG=cudnn8 \
+  --build-arg UBUNTU_VERSION=20.04 \
+  -t whisper-rtx2080:cu118 .
+```
+
+Cross-compiling without a CUDA-capable host is not supported — the build requires `nvcc` to compile the CUDA kernels.
+
+### Build-time ARGs
+
+| ARG | Default | Notes |
+|-----|---------|-------|
+| `CUDA_VERSION` | `12.4.1` | Must match a tag on `nvidia/cuda` Docker Hub |
+| `CUDNN_TAG` | `cudnn` | Use `cudnn8` for CUDA 11.x images |
+| `UBUNTU_VERSION` | `22.04` | `20.04` or `22.04` |
+
+---
+
+## Working with Audio Files
+
+The server accepts any format ffmpeg understands. To prepare audio manually:
+
+```bash
+# Download YouTube audio
+yt-dlp -x --audio-format mp3 "https://www.youtube.com/watch?v=..." -o audio.mp3
+
+# Convert to whisper's native format (optional — the server does this automatically)
+ffmpeg -i audio.mp3 -f f32le -ac 1 -ar 16000 audio.pcm
+
+# Submit
+curl -X POST http://localhost:8080/jobs \
+  -F "audio=@audio.mp3"
+```
+
+---
+
+## Troubleshooting
+
+### Server returns 0 segments
+- Check that you are **not** setting `language` to an empty string — omit the field entirely for auto-detection
+- Verify the audio file is not corrupted: `ffprobe audio.mp3`
+- Check logs for `whisper.cpp` output: the auto-detected language and confidence should appear as `info` level logs
+
+### Server returns `failed` with ffmpeg error
+- Ensure `ffmpeg` is installed in the container (it is by default)
+- Verify the audio file is a valid media file
+
+### CUDA out-of-memory
+- `ggml-large-v3.bin` requires ~5-6 GB VRAM. Use `medium` or `small` models on GPUs with less than 8 GB
+- Check that no other process is consuming VRAM: `nvidia-smi`
+
+### Wrong GPU being used
+- Inside Docker: set `CUDA_DEVICE=0` for the first GPU (nvidia-smi order)
+- On host without Docker: device ordering may be inverted; see [FINDINGS.md](FINDINGS.md#cuda-device-index-ordering-differs-between-host-and-docker)