# Usage Guide ## Prerequisites - Docker + NVIDIA Container Toolkit (for GPU access) - An NVIDIA GPU — optimised for RTX 2080 (sm_75), but any CUDA-capable GPU works - A Whisper GGML model file (e.g. `ggml-large-v3.bin`) --- ## Quick Start ### 1. Pull the image ```bash docker pull git.sal.giize.com/mozempk/whisper-rtx2080:latest ``` ### 2. Download a model ```bash # large-v3 recommended (~3 GB) mkdir -p ~/whisper-models curl -L "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin" \ -o ~/whisper-models/ggml-large-v3.bin ``` ### 3. Start the server ```bash docker run --rm --gpus all \ -p 8080:8080 \ -v ~/whisper-models:/models:ro \ -v whisper-data:/data \ -e WHISPER_MODEL_PATH=/models/ggml-large-v3.bin \ git.sal.giize.com/mozempk/whisper-rtx2080:latest ``` ### 4. Verify ```bash curl http://localhost:8080/health # {"status":"ok","gpu_name":"NVIDIA GeForce RTX 2080","vram_total_mb":8192,"model":"large-v3","queue_depth":0} ``` --- ## docker-compose ```bash # Copy the compose file, configure volumes, then: docker compose up -d ``` The bundled `docker-compose.yml` mounts named volumes for data and models and sets sane defaults. --- ## Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `PORT` | `8080` | HTTP listen port | | `RUST_LOG` | `info` | Log level: `error`, `warn`, `info`, `debug`, `trace` | | `DATA_DIR` | `/data` | Directory for job JSON files and temp audio | | `WHISPER_MODEL_PATH` | `/models/ggml-large-v3.bin` | Absolute path to GGML model file | | `WHISPER_MODEL` | `large-v3` | Model name reported by `/health` (display only) | | `CUDA_DEVICE` | `0` | CUDA device index to use for inference | | `IDLE_TIMEOUT_SECS` | `300` | Seconds of idle before the model is automatically unloaded from GPU memory. Set to `0` to disable auto-unload. | | `GPU_POLL_INTERVAL_SECS` | `30` | Seconds between VRAM-availability retries when a load fails due to insufficient VRAM. | ### Note on CUDA device ordering Inside Docker, device ordering matches `nvidia-smi` (PCI bus order). On the host without Docker, ordering may differ. See [FINDINGS.md](FINDINGS.md#cuda-device-index-ordering-differs-between-host-and-docker) for details. --- ## API Reference The interactive Swagger UI is available at `http://localhost:8080/docs`. --- ## Model Lifecycle Management The model starts **unloaded** on startup (lazy loading). It is loaded into GPU memory on the first job submission or via `POST /model/load`, and automatically unloaded after `IDLE_TIMEOUT_SECS` of inactivity. ### Model State Machine ``` Unloaded ──(job / POST /model/load)──► Loading ──(success)──► Ready └──(VRAM full)──► WaitingForGpu ──(retry OK)──► Loading Ready ──(idle timeout / POST /model/unload)──► Unloaded WaitingForGpu ──(POST /model/unload)──► Unloaded ``` ### `GET /model/status` Returns the current model state and VRAM statistics. ```bash curl http://localhost:8080/model/status ``` **When unloaded:** ```json { "state": "unloaded" } ``` **When loading:** ```json { "state": "loading" } ``` **When ready:** ```json { "state": "ready", "loaded_at": "2026-05-10T14:00:00Z", "vram_used_mb": 4096, "vram_total_mb": 8192 } ``` **When waiting for VRAM:** ```json { "state": "waiting_for_gpu", "vram_needed_mb": 3951, "vram_free_mb": 512, "retry_in_secs": 30 } ``` --- ### `POST /model/load` Request the model to be loaded. Idempotent — if already loading or ready, returns immediately. ```bash curl -X POST http://localhost:8080/model/load ``` - Returns `202 Accepted` with `{"status":"load_initiated"}` when load is triggered - Returns `200 OK` with `{"status":"already_ready"}` when model is already ready - Poll `GET /model/status` or subscribe to `GET /model/events` to know when ready --- ### `POST /model/unload` Unload the model from GPU memory immediately, freeing VRAM. ```bash curl -X POST http://localhost:8080/model/unload ``` Returns `200 OK` regardless of current state. --- ### `GET /model/events` — Model SSE stream Subscribe to model lifecycle events via Server-Sent Events. ```bash curl -N http://localhost:8080/model/events ``` **Event types:** ``` event: model_loading data: {"type":"model_loading"} event: model_ready data: {"type":"model_ready","loaded_at":"2026-05-10T14:00:00Z"} event: model_unloaded data: {"type":"model_unloaded"} event: model_waiting_for_gpu data: {"type":"model_waiting_for_gpu","vram_needed_mb":3951,"vram_free_mb":512,"retry_in_secs":30} ``` **JavaScript example:** ```javascript const es = new EventSource('/model/events'); es.addEventListener('model_ready', () => { console.log('Model loaded — ready to transcribe'); }); es.addEventListener('model_unloaded', () => { console.log('Model freed GPU memory'); }); ``` --- ### Webhooks for model events When any job is submitted with a `webhook_url`, that URL is registered to receive model lifecycle webhooks for the lifetime of the server process. The following events trigger a webhook POST: | Event | Fired when | |-------|-----------| | `model_ready` | Model finishes loading (after GPU warmup) | | `model_unloaded` | Model is freed from GPU memory | **Webhook payload** (`Content-Type: application/json`): ```json { "type": "model_ready", "loaded_at": "2026-05-10T14:00:00Z" } { "type": "model_unloaded" } ``` Delivery is attempted up to 3 times with exponential backoff (1s, 2s). --- ### Handling 503 Model Not Ready When you submit a job and the model is not yet loaded, you receive `503 Service Unavailable` with a `Retry-After` header: ``` HTTP/1.1 503 Service Unavailable Retry-After: 30 Content-Type: application/json { "error": "model_not_ready", "state": "unloaded", "retry_after_secs": 30 } ``` | State at rejection | `retry_after_secs` | Meaning | |---|---|---| | `unloaded` | 30 | Load was triggered; retry after ~30s | | `loading` | 10 | Check again in 10s | | `waiting_for_gpu` | `GPU_POLL_INTERVAL_SECS` | VRAM contention; retry later | A job rejection when the model is `unloaded` **automatically triggers a load** — you do not need to call `POST /model/load` separately. **Recommended client pattern:** ```javascript async function submitWithRetry(formData, maxAttempts = 10) { for (let i = 0; i < maxAttempts; i++) { const resp = await fetch('/jobs', { method: 'POST', body: formData }); if (resp.ok) return resp.json(); if (resp.status === 503) { const retryAfter = parseInt(resp.headers.get('Retry-After') ?? '30'); const body = await resp.json(); console.log(`Model ${body.state} — retrying in ${retryAfter}s`); await new Promise(r => setTimeout(r, retryAfter * 1000)); continue; } throw new Error(`Submit failed: ${resp.status}`); } throw new Error('Gave up after max attempts'); } ``` --- ## API Reference The interactive Swagger UI is available at `http://localhost:8080/docs`. ### `POST /jobs` — Submit a transcription job Accepts a multipart/form-data body. | Field | Type | Required | Description | |-------|------|----------|-------------| | `audio` | file | ✓ | Audio file. Any format ffmpeg supports (MP3, WAV, FLAC, AAC, OGG, M4A, WEBM, …). No size limit. | | `language` | string | — | ISO 639-1 language code (e.g. `en`, `fr`, `de`). Omit to auto-detect. | | `task` | string | — | `transcribe` (default) or `translate` (translates to English) | | `webhook_url` | string | — | URL to POST the completed job to | **Response:** `202 Accepted` ```json { "job_id": "550e8400-e29b-41d4-a716-446655440000" } ``` **Example:** ```bash curl -X POST http://localhost:8080/jobs \ -F "audio=@/path/to/recording.mp3" \ -F "language=en" ``` Auto-detect language: ```bash curl -X POST http://localhost:8080/jobs \ -F "audio=@/path/to/recording.mp3" ``` With webhook: ```bash curl -X POST http://localhost:8080/jobs \ -F "audio=@recording.mp3" \ -F "webhook_url=https://myapp.example.com/transcription-done" ``` --- ### `GET /jobs/{id}` — Poll job status ```bash curl http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000 ``` **Response while running:** ```json { "id": "550e8400-e29b-41d4-a716-446655440000", "status": "running", "task": "transcribe", "progress": 42, "created_at": "2026-05-06T10:00:00Z" } ``` **Response when done:** ```json { "id": "550e8400-e29b-41d4-a716-446655440000", "status": "done", "language": "en", "task": "transcribe", "duration_secs": 3720.5, "progress": 100, "created_at": "2026-05-06T10:00:00Z", "completed_at": "2026-05-06T10:12:34Z", "filename": "recording.mp3", "segments": [ { "index": 0, "start": 0.0, "end": 4.52, "text": " Hello and welcome to the conference.", "words": [ { "text": " Hello", "start": 0.0, "end": 0.68, "probability": 0.98 }, ... ] }, ... ] } ``` **Job statuses:** | Status | Meaning | |--------|---------| | `queued` | Waiting for the GPU worker to pick it up | | `running` | Being transcribed right now | | `done` | Complete; `segments` array is populated | | `failed` | Error occurred; `error` field contains the message | | `cancelled` | Cancelled via DELETE before or during processing | --- ### `GET /jobs/{id}/stream` — Real-time progress via SSE Subscribe to a Server-Sent Events stream for live progress updates. ```bash curl -N http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000/stream ``` **Event types:** ``` event: progress data: {"type":"progress","percent":23,"chunk":2,"chunks_total":8} event: progress data: {"type":"progress","percent":47,"chunk":4,"chunks_total":8} event: done data: {"type":"done","job":{...full job object...}} ``` ``` event: error data: {"type":"error","message":"ffmpeg spawn failed: ..."} ``` - `percent` — overall progress 0–100 - `chunk` / `chunks_total` — which silence-split chunk is currently being transcribed - If you connect after the job has finished, you receive a single `done` event immediately **JavaScript example:** ```javascript const es = new EventSource(`/jobs/${jobId}/stream`); es.addEventListener('progress', (e) => { const { percent, chunk, chunks_total } = JSON.parse(e.data); console.log(`${percent}% (chunk ${chunk}/${chunks_total})`); }); es.addEventListener('done', (e) => { const { job } = JSON.parse(e.data); console.log('Transcript:', job.segments.map(s => s.text).join('')); es.close(); }); es.addEventListener('error', (e) => { const { message } = JSON.parse(e.data); console.error('Failed:', message); es.close(); }); ``` --- ### `DELETE /jobs/{id}` — Cancel a job Marks a queued job as cancelled immediately. For running jobs, the cancellation is recorded but the current whisper.cpp inference call completes before the worker checks the flag (whisper.cpp does not support mid-inference abort). ```bash curl -X DELETE http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000 ``` Returns `409 Conflict` if the job is already in a terminal state (`done`, `failed`, `cancelled`). --- ### `GET /health` — Service health ```bash curl http://localhost:8080/health ``` ```json { "status": "ok", "gpu_name": "NVIDIA GeForce RTX 2080", "vram_total_mb": 8192, "model": "large-v3", "queue_depth": 0, "model_state": "ready" } ``` `queue_depth` is the number of jobs waiting to be processed (not counting the one currently running). `model_state` reflects the current lifecycle state (`unloaded`, `loading`, `waiting_for_gpu`, `ready`). --- ## Output Format The `segments` array in a completed job contains one entry per whisper segment (typically a sentence or clause): ```json { "index": 0, "start": 12.34, "end": 15.78, "text": " This is a transcribed sentence.", "words": [ { "text": " This", "start": 12.34, "end": 12.56, "probability": 0.97 }, { "text": " is", "start": 12.56, "end": 12.72, "probability": 0.99 }, { "text": " a", "start": 12.72, "end": 12.84, "probability": 0.98 }, { "text": " transcribed", "start": 12.84, "end": 13.40, "probability": 0.95 }, { "text": " sentence.", "start": 13.40, "end": 15.78, "probability": 0.96 } ] } ``` Notes: - `start` / `end` are in seconds (floating point), absolute from the beginning of the input audio - `text` typically includes a leading space (whisper's tokenisation convention) - `words` contains token-level timestamps; may be empty if flash attention is enabled (it is disabled by default) - `probability` is the model's confidence for each word token (0–1) - All timestamps are in the source language's timeline — no re-mapping occurs --- ## Webhook Payload When a `webhook_url` is provided, the server POSTs the full `Job` JSON to that URL on completion (including on failure). Headers: `Content-Type: application/json`. Delivery is attempted up to 5 times with exponential backoff (1s, 2s, 4s, 8s, 16s). If all retries fail, the error is logged and dropped. --- ## Building from Source ```bash # Build the Docker image locally (requires Docker Buildx + NVIDIA CUDA driver on host) docker build -t whisper-rtx2080 . # Custom CUDA version (e.g. for CUDA 11.8 on older drivers) docker build \ --build-arg CUDA_VERSION=11.8.0 \ --build-arg CUDNN_TAG=cudnn8 \ --build-arg UBUNTU_VERSION=20.04 \ -t whisper-rtx2080:cu118 . ``` Cross-compiling without a CUDA-capable host is not supported — the build requires `nvcc` to compile the CUDA kernels. ### Build-time ARGs | ARG | Default | Notes | |-----|---------|-------| | `CUDA_VERSION` | `12.4.1` | Must match a tag on `nvidia/cuda` Docker Hub | | `CUDNN_TAG` | `cudnn` | Use `cudnn8` for CUDA 11.x images | | `UBUNTU_VERSION` | `22.04` | `20.04` or `22.04` | --- ## Working with Audio Files The server accepts any format ffmpeg understands. To prepare audio manually: ```bash # Download YouTube audio yt-dlp -x --audio-format mp3 "https://www.youtube.com/watch?v=..." -o audio.mp3 # Convert to whisper's native format (optional — the server does this automatically) ffmpeg -i audio.mp3 -f f32le -ac 1 -ar 16000 audio.pcm # Submit curl -X POST http://localhost:8080/jobs \ -F "audio=@audio.mp3" ``` --- ## Troubleshooting ### Server returns `503 model_not_ready` - The model starts unloaded. Call `POST /model/load` explicitly, or just retry the job submission — rejection automatically triggers a load. - If state is `waiting_for_gpu`, another process is using the GPU's VRAM. The server will retry automatically every `GPU_POLL_INTERVAL_SECS` seconds. - Monitor `GET /model/status` or subscribe to `GET /model/events` to know when the model is ready. ### Server returns 0 segments - Check that you are **not** setting `language` to an empty string — omit the field entirely for auto-detection - Verify the audio file is not corrupted: `ffprobe audio.mp3` - Check logs for `whisper.cpp` output: the auto-detected language and confidence should appear as `info` level logs ### Server returns `failed` with ffmpeg error - Ensure `ffmpeg` is installed in the container (it is by default) - Verify the audio file is a valid media file ### CUDA out-of-memory - `ggml-large-v3.bin` requires ~5-6 GB VRAM. Use `medium` or `small` models on GPUs with less than 8 GB - Check that no other process is consuming VRAM: `nvidia-smi` ### Wrong GPU being used - Inside Docker: set `CUDA_DEVICE=0` for the first GPU (nvidia-smi order) - On host without Docker: device ordering may be inverted; see [FINDINGS.md](FINDINGS.md#cuda-device-index-ordering-differs-between-host-and-docker)