docs: add ARCHITECTURE, CODE_STYLE, FINDINGS, USAGE under docs/
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 17s
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 17s
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
233
docs/ARCHITECTURE.md
Normal file
233
docs/ARCHITECTURE.md
Normal file
@@ -0,0 +1,233 @@
|
||||
# Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
`whisper-server` is a single-binary Rust HTTP server that exposes an asynchronous REST API for speech transcription. It wraps [whisper.cpp](https://github.com/ggerganov/whisper.cpp) (via `whisper-rs`) compiled with CUDA support targeting NVIDIA RTX 2080 (sm_75).
|
||||
|
||||
Jobs are submitted as audio files, chunked on silence boundaries, transcribed chunk-by-chunk on the GPU, and returned as timestamped JSON. Progress is streamed to clients in real time over Server-Sent Events (SSE).
|
||||
|
||||
---
|
||||
|
||||
## Component Map
|
||||
|
||||
```
|
||||
HTTP client
|
||||
│
|
||||
┌────────▼────────┐
|
||||
│ Axum Router │ (tower-http: CORS, tracing)
|
||||
│ /jobs │
|
||||
│ /jobs/:id │
|
||||
│ /jobs/:id/stream│
|
||||
│ /health │
|
||||
│ /docs (Swagger)│
|
||||
└────────┬────────┘
|
||||
│ AppState (Arc)
|
||||
┌────────────┼────────────────┐
|
||||
│ │ │
|
||||
┌─────▼──────┐ ┌───▼────┐ ┌───────▼──────┐
|
||||
│ Storage │ │ ProgReg│ │ job_tx (mpsc)│
|
||||
│ (disk JSON)│ │(DashMap)│ └───────┬──────┘
|
||||
└─────▲──────┘ └───▲────┘ │
|
||||
│ │ ┌──────▼──────────────────────┐
|
||||
│ │ │ Tokio worker task (run()) │
|
||||
│ │ │ - dequeues job IDs │
|
||||
│ │ │ - decodes audio (ffmpeg) │
|
||||
│ │ │ - runs silencedetect │
|
||||
│ │ │ - chunks PCM │
|
||||
│ │ │ - sends TranscribeRequest │
|
||||
│ │ │ - offsets timestamps │
|
||||
│ │ │ - persists result │
|
||||
│ │ │ - fires webhook │
|
||||
│ └─────────┤ │
|
||||
└───────────────────── └──────┬──────────────────────┘
|
||||
│ std::sync::mpsc
|
||||
┌──────▼───────────────┐
|
||||
│ whisper-gpu thread │
|
||||
│ (OS thread, non-Send)│
|
||||
│ owns WhisperContext │
|
||||
│ runs CUDA inference │
|
||||
└──────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Source Files
|
||||
|
||||
| File | Responsibility |
|
||||
|------|---------------|
|
||||
| `src/main.rs` | Startup: env vars, storage init, worker spawn, router assembly, OpenAPI |
|
||||
| `src/models.rs` | All data types: `Job`, `Segment`, `Word`, `SsePayload`, `JobStatus` |
|
||||
| `src/error.rs` | `AppError` enum → HTTP status codes via `IntoResponse` |
|
||||
| `src/storage.rs` | File-backed job store (one JSON file per job UUID) |
|
||||
| `src/transcriber.rs` | Owns `WhisperContext`; sets all inference parameters; decodes output |
|
||||
| `src/worker.rs` | Audio pipeline: silence detection, chunking, progress, job lifecycle |
|
||||
| `src/webhook.rs` | Fire-and-forget POST with exponential backoff (5 retries) |
|
||||
| `src/routes/mod.rs` | Router assembly; disables body limit on POST /jobs |
|
||||
| `src/routes/jobs.rs` | Handlers: submit, get, SSE stream, delete/cancel |
|
||||
| `src/routes/health.rs` | Health check + GPU info via `nvidia-smi` |
|
||||
|
||||
---
|
||||
|
||||
## Threading Model
|
||||
|
||||
whisper.cpp's `WhisperContext` is `Send` but **not `Sync`** — it cannot be shared across threads simultaneously. The design uses a **two-layer concurrency model**:
|
||||
|
||||
### Layer 1 — Tokio async runtime
|
||||
All HTTP handling, file I/O, ffmpeg subprocesses, and job lifecycle management run on the Tokio thread pool. This is where async/await is used.
|
||||
|
||||
### Layer 2 — Dedicated OS thread (`whisper-gpu`)
|
||||
A single non-async OS thread owns the `WhisperContext` for its entire lifetime. The thread loops on a `std::sync::mpsc::Receiver<TranscribeRequest>`, processes one inference at a time, and sends the result back through a `oneshot::Sender`.
|
||||
|
||||
### Communication
|
||||
```
|
||||
Tokio task std::mpsc GPU thread
|
||||
───────── ──────────
|
||||
TranscribeRequest ────────────────────► transcriber.transcribe()
|
||||
│
|
||||
oneshot::Receiver ◄──────────────────── oneshot::Sender (Vec<Segment>, lang)
|
||||
```
|
||||
|
||||
This ensures:
|
||||
- GPU inference is never interleaved (one job at a time on the GPU)
|
||||
- The async runtime is never blocked by long-running GPU work
|
||||
- `WhisperContext` never needs to be `Sync`
|
||||
|
||||
---
|
||||
|
||||
## Job Lifecycle
|
||||
|
||||
```
|
||||
POST /jobs
|
||||
│
|
||||
├─ Stream audio to disk → <DATA_DIR>/<uuid>.audio
|
||||
├─ Create Job{status: Queued} in storage
|
||||
├─ Pre-create broadcast channel in ProgressRegistry
|
||||
├─ Send job ID into job_tx
|
||||
└─ Return 202 { job_id }
|
||||
|
||||
Worker picks up job_id
|
||||
│
|
||||
├─ Load job from storage
|
||||
├─ Mark Running → save
|
||||
├─ decode_audio (ffmpeg → 16kHz mono f32 PCM)
|
||||
├─ detect_silence_midpoints (ffmpeg silencedetect)
|
||||
├─ snap_to_silence → cut points
|
||||
├─ to_chunk_ranges → [(start, end), ...]
|
||||
│
|
||||
└─ For each chunk:
|
||||
├─ slice PCM
|
||||
├─ trim_trailing_silence
|
||||
├─ broadcast Progress{percent, chunk, total}
|
||||
├─ save snapshot to disk
|
||||
├─ send TranscribeRequest → GPU thread
|
||||
├─ await oneshot reply
|
||||
├─ offset timestamps by chunk_start
|
||||
└─ accumulate segments
|
||||
│
|
||||
├─ renumber segment indices
|
||||
├─ broadcast Done / Error
|
||||
├─ save final Job to disk
|
||||
├─ delete .audio file
|
||||
└─ (optional) fire webhook
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Audio Pre-Processing Pipeline
|
||||
|
||||
```
|
||||
Input file (any format)
|
||||
│
|
||||
▼
|
||||
ffmpeg silencedetect
|
||||
(runs on original file — captures full dynamic range for silence detection)
|
||||
│
|
||||
▼
|
||||
Silence midpoints → snap_to_silence() → cut points at ~60s intervals
|
||||
│
|
||||
▼
|
||||
ffmpeg decode: → 16kHz, mono, f32le PCM
|
||||
│
|
||||
▼
|
||||
Per chunk:
|
||||
pcm[start_sample..end_sample]
|
||||
│
|
||||
▼
|
||||
trim_trailing_silence() → removes silence tail, keeps 0.5s padding
|
||||
│
|
||||
▼
|
||||
WhisperContext::full() → Vec<Segment>
|
||||
│
|
||||
▼
|
||||
Offset all timestamps += chunk_start_secs
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Persistence
|
||||
|
||||
Jobs are persisted as pretty-printed JSON files in `DATA_DIR`:
|
||||
|
||||
```
|
||||
/data/
|
||||
<uuid>.json ← job state (updated on every progress snapshot)
|
||||
<uuid>.audio ← raw upload (deleted after transcription)
|
||||
```
|
||||
|
||||
On startup, `recover_interrupted_jobs()` scans all `.json` files and marks any `Running` jobs as `Failed` (they were killed mid-transcription).
|
||||
|
||||
There is no database. The file-per-job approach is intentional: it is trivially inspectable, survives crashes without WAL complexity, and scales to thousands of jobs without any overhead.
|
||||
|
||||
---
|
||||
|
||||
## Progress & SSE
|
||||
|
||||
`ProgressRegistry` is a `DashMap<JobId, broadcast::Sender<ProgressEvent>>`. It is populated when a job is submitted (before the worker starts) so clients connecting early don't miss the first events.
|
||||
|
||||
SSE events are typed:
|
||||
|
||||
| Event name | Payload shape |
|
||||
|-------------|--------------|
|
||||
| `progress` | `{ "type": "progress", "percent": 0-100, "chunk": N, "chunks_total": M }` |
|
||||
| `done` | `{ "type": "done", "job": { ...full Job object... } }` |
|
||||
| `error` | `{ "type": "error", "message": "..." }` |
|
||||
|
||||
If a client connects to `/jobs/:id/stream` for an already-finished job, it receives a single `done` event immediately.
|
||||
|
||||
The broadcast channel has a buffer of 64 events. Lagged receivers (slow SSE clients) are silently dropped — their messages return `Err(RecvError::Lagged)` which is filtered to `None` in the SSE stream adapter.
|
||||
|
||||
---
|
||||
|
||||
## Webhook
|
||||
|
||||
When `webhook_url` is provided at submission, the server POSTs the complete `Job` JSON to that URL after the job reaches a terminal state. Delivery is fire-and-forget with exponential backoff: 1s, 2s, 4s, 8s, 16s (max 5 attempts). Failures after all retries are logged and discarded.
|
||||
|
||||
---
|
||||
|
||||
## Docker / Build
|
||||
|
||||
### Multi-stage Dockerfile
|
||||
|
||||
| Stage | Base image | Purpose |
|
||||
|-------|-----------|---------|
|
||||
| `builder` | `nvidia/cuda:<ver>-cudnn-devel-ubuntu<ver>` | Full CUDA devel + Rust toolchain; compiles whisper.cpp CUDA kernels and the Rust binary |
|
||||
| `runtime` | `nvidia/cuda:<ver>-cudnn-runtime-ubuntu<ver>` | Minimal runtime; only `ffmpeg` and the binary |
|
||||
|
||||
`whisper-rs` bundles the whisper.cpp source inside the crate (`~/.cargo/registry/.../whisper-rs-sys-*/whisper.cpp/`). There is no external clone step — the build is fully deterministic.
|
||||
|
||||
### CUDA build flags (set as ENV in builder stage)
|
||||
```
|
||||
GGML_CUDA=ON
|
||||
CMAKE_CUDA_ARCHITECTURES=75 # RTX 2080 = Turing = sm_75
|
||||
GGML_CUDA_FORCE_MMQ=ON # matrix-multiply quantized kernels
|
||||
GGML_CUDA_GRAPHS=ON # CUDA graph capture for repeated patterns
|
||||
GGML_CUDA_FA_ALL_QUANTS=ON # flash attention for all quantisation types
|
||||
GGML_CUDA_F16=ON # half-precision accumulation
|
||||
```
|
||||
|
||||
### CI (Gitea Actions)
|
||||
- Triggers on `push` to `main` and semver tags (`v*`)
|
||||
- PRs: build only, no push
|
||||
- Tags produced: `latest`, `sha-<short>`, semver components on tags
|
||||
- Build cache stored in registry as `:buildcache` tag
|
||||
- `CUDA_VERSION` and `UBUNTU_VERSION` overridable via repo Variables
|
||||
Reference in New Issue
Block a user