docs: add ARCHITECTURE, CODE_STYLE, FINDINGS, USAGE under docs/

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 10:17:53 +02:00
parent 8fc45ee86f
commit c25e8e7ffb
4 changed files with 1019 additions and 0 deletions
--- a/docs/USAGE.md
+++ b/docs/USAGE.md
@@ -0,0 +1,358 @@
+# Usage Guide
+
+## Prerequisites
+
+- Docker + NVIDIA Container Toolkit (for GPU access)
+- An NVIDIA GPU — optimised for RTX 2080 (sm_75), but any CUDA-capable GPU works
+- A Whisper GGML model file (e.g. `ggml-large-v3.bin`)
+
+---
+
+## Quick Start
+
+### 1. Pull the image
+
+```bash
+docker pull git.sal.giize.com/mozempk/whisper-rtx2080:latest
+```
+
+### 2. Download a model
+
+```bash
+# large-v3 recommended (~3 GB)
+mkdir -p ~/whisper-models
+curl -L "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin" \
+  -o ~/whisper-models/ggml-large-v3.bin
+```
+
+### 3. Start the server
+
+```bash
+docker run --rm --gpus all \
+  -p 8080:8080 \
+  -v ~/whisper-models:/models:ro \
+  -v whisper-data:/data \
+  -e WHISPER_MODEL_PATH=/models/ggml-large-v3.bin \
+  git.sal.giize.com/mozempk/whisper-rtx2080:latest
+```
+
+### 4. Verify
+
+```bash
+curl http://localhost:8080/health
+# {"status":"ok","gpu_name":"NVIDIA GeForce RTX 2080","vram_total_mb":8192,"model":"large-v3","queue_depth":0}
+```
+
+---
+
+## docker-compose
+
+```bash
+# Copy the compose file, configure volumes, then:
+docker compose up -d
+```
+
+The bundled `docker-compose.yml` mounts named volumes for data and models and sets sane defaults.
+
+---
+
+## Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `PORT` | `8080` | HTTP listen port |
+| `RUST_LOG` | `info` | Log level: `error`, `warn`, `info`, `debug`, `trace` |
+| `DATA_DIR` | `/data` | Directory for job JSON files and temp audio |
+| `WHISPER_MODEL_PATH` | `/models/ggml-large-v3.bin` | Absolute path to GGML model file |
+| `WHISPER_MODEL` | `large-v3` | Model name reported by `/health` (display only) |
+| `CUDA_DEVICE` | `0` | CUDA device index to use for inference |
+
+### Note on CUDA device ordering
+Inside Docker, device ordering matches `nvidia-smi` (PCI bus order). On the host without Docker, ordering may differ. See [FINDINGS.md](FINDINGS.md#cuda-device-index-ordering-differs-between-host-and-docker) for details.
+
+---
+
+## API Reference
+
+The interactive Swagger UI is available at `http://localhost:8080/docs`.
+
+### `POST /jobs` — Submit a transcription job
+
+Accepts a multipart/form-data body.
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `audio` | file | ✓ | Audio file. Any format ffmpeg supports (MP3, WAV, FLAC, AAC, OGG, M4A, WEBM, …). No size limit. |
+| `language` | string | — | ISO 639-1 language code (e.g. `en`, `fr`, `de`). Omit to auto-detect. |
+| `task` | string | — | `transcribe` (default) or `translate` (translates to English) |
+| `webhook_url` | string | — | URL to POST the completed job to |
+
+**Response:** `202 Accepted`
+```json
+{ "job_id": "550e8400-e29b-41d4-a716-446655440000" }
+```
+
+**Example:**
+```bash
+curl -X POST http://localhost:8080/jobs \
+  -F "audio=@/path/to/recording.mp3" \
+  -F "language=en"
+```
+
+Auto-detect language:
+```bash
+curl -X POST http://localhost:8080/jobs \
+  -F "audio=@/path/to/recording.mp3"
+```
+
+With webhook:
+```bash
+curl -X POST http://localhost:8080/jobs \
+  -F "audio=@recording.mp3" \
+  -F "webhook_url=https://myapp.example.com/transcription-done"
+```
+
+---
+
+### `GET /jobs/{id}` — Poll job status
+
+```bash
+curl http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000
+```
+
+**Response while running:**
+```json
+{
+  "id": "550e8400-e29b-41d4-a716-446655440000",
+  "status": "running",
+  "task": "transcribe",
+  "progress": 42,
+  "created_at": "2026-05-06T10:00:00Z"
+}
+```
+
+**Response when done:**
+```json
+{
+  "id": "550e8400-e29b-41d4-a716-446655440000",
+  "status": "done",
+  "language": "en",
+  "task": "transcribe",
+  "duration_secs": 3720.5,
+  "progress": 100,
+  "created_at": "2026-05-06T10:00:00Z",
+  "completed_at": "2026-05-06T10:12:34Z",
+  "filename": "recording.mp3",
+  "segments": [
+    {
+      "index": 0,
+      "start": 0.0,
+      "end": 4.52,
+      "text": " Hello and welcome to the conference.",
+      "words": [
+        { "text": " Hello", "start": 0.0, "end": 0.68, "probability": 0.98 },
+        ...
+      ]
+    },
+    ...
+  ]
+}
+```
+
+**Job statuses:**
+
+| Status | Meaning |
+|--------|---------|
+| `queued` | Waiting for the GPU worker to pick it up |
+| `running` | Being transcribed right now |
+| `done` | Complete; `segments` array is populated |
+| `failed` | Error occurred; `error` field contains the message |
+| `cancelled` | Cancelled via DELETE before or during processing |
+
+---
+
+### `GET /jobs/{id}/stream` — Real-time progress via SSE
+
+Subscribe to a Server-Sent Events stream for live progress updates.
+
+```bash
+curl -N http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000/stream
+```
+
+**Event types:**
+
+```
+event: progress
+data: {"type":"progress","percent":23,"chunk":2,"chunks_total":8}
+
+event: progress
+data: {"type":"progress","percent":47,"chunk":4,"chunks_total":8}
+
+event: done
+data: {"type":"done","job":{...full job object...}}
+```
+
+```
+event: error
+data: {"type":"error","message":"ffmpeg spawn failed: ..."}
+```
+
+- `percent` — overall progress 0–100
+- `chunk` / `chunks_total` — which silence-split chunk is currently being transcribed
+- If you connect after the job has finished, you receive a single `done` event immediately
+
+**JavaScript example:**
+```javascript
+const es = new EventSource(`/jobs/${jobId}/stream`);
+
+es.addEventListener('progress', (e) => {
+  const { percent, chunk, chunks_total } = JSON.parse(e.data);
+  console.log(`${percent}% (chunk ${chunk}/${chunks_total})`);
+});
+
+es.addEventListener('done', (e) => {
+  const { job } = JSON.parse(e.data);
+  console.log('Transcript:', job.segments.map(s => s.text).join(''));
+  es.close();
+});
+
+es.addEventListener('error', (e) => {
+  const { message } = JSON.parse(e.data);
+  console.error('Failed:', message);
+  es.close();
+});
+```
+
+---
+
+### `DELETE /jobs/{id}` — Cancel a job
+
+Marks a queued job as cancelled immediately. For running jobs, the cancellation is recorded but the current whisper.cpp inference call completes before the worker checks the flag (whisper.cpp does not support mid-inference abort).
+
+```bash
+curl -X DELETE http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000
+```
+
+Returns `409 Conflict` if the job is already in a terminal state (`done`, `failed`, `cancelled`).
+
+---
+
+### `GET /health` — Service health
+
+```bash
+curl http://localhost:8080/health
+```
+
+```json
+{
+  "status": "ok",
+  "gpu_name": "NVIDIA GeForce RTX 2080",
+  "vram_total_mb": 8192,
+  "model": "large-v3",
+  "queue_depth": 0
+}
+```
+
+`queue_depth` is the number of jobs waiting to be processed (not counting the one currently running).
+
+---
+
+## Output Format
+
+The `segments` array in a completed job contains one entry per whisper segment (typically a sentence or clause):
+
+```json
+{
+  "index": 0,
+  "start": 12.34,
+  "end":   15.78,
+  "text":  " This is a transcribed sentence.",
+  "words": [
+    { "text": " This",         "start": 12.34, "end": 12.56, "probability": 0.97 },
+    { "text": " is",           "start": 12.56, "end": 12.72, "probability": 0.99 },
+    { "text": " a",            "start": 12.72, "end": 12.84, "probability": 0.98 },
+    { "text": " transcribed",  "start": 12.84, "end": 13.40, "probability": 0.95 },
+    { "text": " sentence.",    "start": 13.40, "end": 15.78, "probability": 0.96 }
+  ]
+}
+```
+
+Notes:
+- `start` / `end` are in seconds (floating point), absolute from the beginning of the input audio
+- `text` typically includes a leading space (whisper's tokenisation convention)
+- `words` contains token-level timestamps; may be empty if flash attention is enabled (it is disabled by default)
+- `probability` is the model's confidence for each word token (0–1)
+- All timestamps are in the source language's timeline — no re-mapping occurs
+
+---
+
+## Webhook Payload
+
+When a `webhook_url` is provided, the server POSTs the full `Job` JSON to that URL on completion (including on failure). Headers: `Content-Type: application/json`.
+
+Delivery is attempted up to 5 times with exponential backoff (1s, 2s, 4s, 8s, 16s). If all retries fail, the error is logged and dropped.
+
+---
+
+## Building from Source
+
+```bash
+# Build the Docker image locally (requires Docker Buildx + NVIDIA CUDA driver on host)
+docker build -t whisper-rtx2080 .
+
+# Custom CUDA version (e.g. for CUDA 11.8 on older drivers)
+docker build \
+  --build-arg CUDA_VERSION=11.8.0 \
+  --build-arg CUDNN_TAG=cudnn8 \
+  --build-arg UBUNTU_VERSION=20.04 \
+  -t whisper-rtx2080:cu118 .
+```
+
+Cross-compiling without a CUDA-capable host is not supported — the build requires `nvcc` to compile the CUDA kernels.
+
+### Build-time ARGs
+
+| ARG | Default | Notes |
+|-----|---------|-------|
+| `CUDA_VERSION` | `12.4.1` | Must match a tag on `nvidia/cuda` Docker Hub |
+| `CUDNN_TAG` | `cudnn` | Use `cudnn8` for CUDA 11.x images |
+| `UBUNTU_VERSION` | `22.04` | `20.04` or `22.04` |
+
+---
+
+## Working with Audio Files
+
+The server accepts any format ffmpeg understands. To prepare audio manually:
+
+```bash
+# Download YouTube audio
+yt-dlp -x --audio-format mp3 "https://www.youtube.com/watch?v=..." -o audio.mp3
+
+# Convert to whisper's native format (optional — the server does this automatically)
+ffmpeg -i audio.mp3 -f f32le -ac 1 -ar 16000 audio.pcm
+
+# Submit
+curl -X POST http://localhost:8080/jobs \
+  -F "audio=@audio.mp3"
+```
+
+---
+
+## Troubleshooting
+
+### Server returns 0 segments
+- Check that you are **not** setting `language` to an empty string — omit the field entirely for auto-detection
+- Verify the audio file is not corrupted: `ffprobe audio.mp3`
+- Check logs for `whisper.cpp` output: the auto-detected language and confidence should appear as `info` level logs
+
+### Server returns `failed` with ffmpeg error
+- Ensure `ffmpeg` is installed in the container (it is by default)
+- Verify the audio file is a valid media file
+
+### CUDA out-of-memory
+- `ggml-large-v3.bin` requires ~5-6 GB VRAM. Use `medium` or `small` models on GPUs with less than 8 GB
+- Check that no other process is consuming VRAM: `nvidia-smi`
+
+### Wrong GPU being used
+- Inside Docker: set `CUDA_DEVICE=0` for the first GPU (nvidia-smi order)
+- On host without Docker: device ordering may be inverted; see [FINDINGS.md](FINDINGS.md#cuda-device-index-ordering-differs-between-host-and-docker)