# Usage Guide ## Prerequisites - Docker + NVIDIA Container Toolkit (for GPU access) - An NVIDIA GPU — optimised for RTX 2080 (sm_75), but any CUDA-capable GPU works - A Whisper GGML model file (e.g. `ggml-large-v3.bin`) --- ## Quick Start ### 1. Pull the image ```bash docker pull git.sal.giize.com/mozempk/whisper-rtx2080:latest ``` ### 2. Download a model ```bash # large-v3 recommended (~3 GB) mkdir -p ~/whisper-models curl -L "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin" \ -o ~/whisper-models/ggml-large-v3.bin ``` ### 3. Start the server ```bash docker run --rm --gpus all \ -p 8080:8080 \ -v ~/whisper-models:/models:ro \ -v whisper-data:/data \ -e WHISPER_MODEL_PATH=/models/ggml-large-v3.bin \ git.sal.giize.com/mozempk/whisper-rtx2080:latest ``` ### 4. Verify ```bash curl http://localhost:8080/health # {"status":"ok","gpu_name":"NVIDIA GeForce RTX 2080","vram_total_mb":8192,"model":"large-v3","queue_depth":0} ``` --- ## docker-compose ```bash # Copy the compose file, configure volumes, then: docker compose up -d ``` The bundled `docker-compose.yml` mounts named volumes for data and models and sets sane defaults. --- ## Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `PORT` | `8080` | HTTP listen port | | `RUST_LOG` | `info` | Log level: `error`, `warn`, `info`, `debug`, `trace` | | `DATA_DIR` | `/data` | Directory for job JSON files and temp audio | | `WHISPER_MODEL_PATH` | `/models/ggml-large-v3.bin` | Absolute path to GGML model file | | `WHISPER_MODEL` | `large-v3` | Model name reported by `/health` (display only) | | `CUDA_DEVICE` | `0` | CUDA device index to use for inference | ### Note on CUDA device ordering Inside Docker, device ordering matches `nvidia-smi` (PCI bus order). On the host without Docker, ordering may differ. See [FINDINGS.md](FINDINGS.md#cuda-device-index-ordering-differs-between-host-and-docker) for details. --- ## API Reference The interactive Swagger UI is available at `http://localhost:8080/docs`. ### `POST /jobs` — Submit a transcription job Accepts a multipart/form-data body. | Field | Type | Required | Description | |-------|------|----------|-------------| | `audio` | file | ✓ | Audio file. Any format ffmpeg supports (MP3, WAV, FLAC, AAC, OGG, M4A, WEBM, …). No size limit. | | `language` | string | — | ISO 639-1 language code (e.g. `en`, `fr`, `de`). Omit to auto-detect. | | `task` | string | — | `transcribe` (default) or `translate` (translates to English) | | `webhook_url` | string | — | URL to POST the completed job to | **Response:** `202 Accepted` ```json { "job_id": "550e8400-e29b-41d4-a716-446655440000" } ``` **Example:** ```bash curl -X POST http://localhost:8080/jobs \ -F "audio=@/path/to/recording.mp3" \ -F "language=en" ``` Auto-detect language: ```bash curl -X POST http://localhost:8080/jobs \ -F "audio=@/path/to/recording.mp3" ``` With webhook: ```bash curl -X POST http://localhost:8080/jobs \ -F "audio=@recording.mp3" \ -F "webhook_url=https://myapp.example.com/transcription-done" ``` --- ### `GET /jobs/{id}` — Poll job status ```bash curl http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000 ``` **Response while running:** ```json { "id": "550e8400-e29b-41d4-a716-446655440000", "status": "running", "task": "transcribe", "progress": 42, "created_at": "2026-05-06T10:00:00Z" } ``` **Response when done:** ```json { "id": "550e8400-e29b-41d4-a716-446655440000", "status": "done", "language": "en", "task": "transcribe", "duration_secs": 3720.5, "progress": 100, "created_at": "2026-05-06T10:00:00Z", "completed_at": "2026-05-06T10:12:34Z", "filename": "recording.mp3", "segments": [ { "index": 0, "start": 0.0, "end": 4.52, "text": " Hello and welcome to the conference.", "words": [ { "text": " Hello", "start": 0.0, "end": 0.68, "probability": 0.98 }, ... ] }, ... ] } ``` **Job statuses:** | Status | Meaning | |--------|---------| | `queued` | Waiting for the GPU worker to pick it up | | `running` | Being transcribed right now | | `done` | Complete; `segments` array is populated | | `failed` | Error occurred; `error` field contains the message | | `cancelled` | Cancelled via DELETE before or during processing | --- ### `GET /jobs/{id}/stream` — Real-time progress via SSE Subscribe to a Server-Sent Events stream for live progress updates. ```bash curl -N http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000/stream ``` **Event types:** ``` event: progress data: {"type":"progress","percent":23,"chunk":2,"chunks_total":8} event: progress data: {"type":"progress","percent":47,"chunk":4,"chunks_total":8} event: done data: {"type":"done","job":{...full job object...}} ``` ``` event: error data: {"type":"error","message":"ffmpeg spawn failed: ..."} ``` - `percent` — overall progress 0–100 - `chunk` / `chunks_total` — which silence-split chunk is currently being transcribed - If you connect after the job has finished, you receive a single `done` event immediately **JavaScript example:** ```javascript const es = new EventSource(`/jobs/${jobId}/stream`); es.addEventListener('progress', (e) => { const { percent, chunk, chunks_total } = JSON.parse(e.data); console.log(`${percent}% (chunk ${chunk}/${chunks_total})`); }); es.addEventListener('done', (e) => { const { job } = JSON.parse(e.data); console.log('Transcript:', job.segments.map(s => s.text).join('')); es.close(); }); es.addEventListener('error', (e) => { const { message } = JSON.parse(e.data); console.error('Failed:', message); es.close(); }); ``` --- ### `DELETE /jobs/{id}` — Cancel a job Marks a queued job as cancelled immediately. For running jobs, the cancellation is recorded but the current whisper.cpp inference call completes before the worker checks the flag (whisper.cpp does not support mid-inference abort). ```bash curl -X DELETE http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000 ``` Returns `409 Conflict` if the job is already in a terminal state (`done`, `failed`, `cancelled`). --- ### `GET /health` — Service health ```bash curl http://localhost:8080/health ``` ```json { "status": "ok", "gpu_name": "NVIDIA GeForce RTX 2080", "vram_total_mb": 8192, "model": "large-v3", "queue_depth": 0 } ``` `queue_depth` is the number of jobs waiting to be processed (not counting the one currently running). --- ## Output Format The `segments` array in a completed job contains one entry per whisper segment (typically a sentence or clause): ```json { "index": 0, "start": 12.34, "end": 15.78, "text": " This is a transcribed sentence.", "words": [ { "text": " This", "start": 12.34, "end": 12.56, "probability": 0.97 }, { "text": " is", "start": 12.56, "end": 12.72, "probability": 0.99 }, { "text": " a", "start": 12.72, "end": 12.84, "probability": 0.98 }, { "text": " transcribed", "start": 12.84, "end": 13.40, "probability": 0.95 }, { "text": " sentence.", "start": 13.40, "end": 15.78, "probability": 0.96 } ] } ``` Notes: - `start` / `end` are in seconds (floating point), absolute from the beginning of the input audio - `text` typically includes a leading space (whisper's tokenisation convention) - `words` contains token-level timestamps; may be empty if flash attention is enabled (it is disabled by default) - `probability` is the model's confidence for each word token (0–1) - All timestamps are in the source language's timeline — no re-mapping occurs --- ## Webhook Payload When a `webhook_url` is provided, the server POSTs the full `Job` JSON to that URL on completion (including on failure). Headers: `Content-Type: application/json`. Delivery is attempted up to 5 times with exponential backoff (1s, 2s, 4s, 8s, 16s). If all retries fail, the error is logged and dropped. --- ## Building from Source ```bash # Build the Docker image locally (requires Docker Buildx + NVIDIA CUDA driver on host) docker build -t whisper-rtx2080 . # Custom CUDA version (e.g. for CUDA 11.8 on older drivers) docker build \ --build-arg CUDA_VERSION=11.8.0 \ --build-arg CUDNN_TAG=cudnn8 \ --build-arg UBUNTU_VERSION=20.04 \ -t whisper-rtx2080:cu118 . ``` Cross-compiling without a CUDA-capable host is not supported — the build requires `nvcc` to compile the CUDA kernels. ### Build-time ARGs | ARG | Default | Notes | |-----|---------|-------| | `CUDA_VERSION` | `12.4.1` | Must match a tag on `nvidia/cuda` Docker Hub | | `CUDNN_TAG` | `cudnn` | Use `cudnn8` for CUDA 11.x images | | `UBUNTU_VERSION` | `22.04` | `20.04` or `22.04` | --- ## Working with Audio Files The server accepts any format ffmpeg understands. To prepare audio manually: ```bash # Download YouTube audio yt-dlp -x --audio-format mp3 "https://www.youtube.com/watch?v=..." -o audio.mp3 # Convert to whisper's native format (optional — the server does this automatically) ffmpeg -i audio.mp3 -f f32le -ac 1 -ar 16000 audio.pcm # Submit curl -X POST http://localhost:8080/jobs \ -F "audio=@audio.mp3" ``` --- ## Troubleshooting ### Server returns 0 segments - Check that you are **not** setting `language` to an empty string — omit the field entirely for auto-detection - Verify the audio file is not corrupted: `ffprobe audio.mp3` - Check logs for `whisper.cpp` output: the auto-detected language and confidence should appear as `info` level logs ### Server returns `failed` with ffmpeg error - Ensure `ffmpeg` is installed in the container (it is by default) - Verify the audio file is a valid media file ### CUDA out-of-memory - `ggml-large-v3.bin` requires ~5-6 GB VRAM. Use `medium` or `small` models on GPUs with less than 8 GB - Check that no other process is consuming VRAM: `nvidia-smi` ### Wrong GPU being used - Inside Docker: set `CUDA_DEVICE=0` for the first GPU (nvidia-smi order) - On host without Docker: device ordering may be inverted; see [FINDINGS.md](FINDINGS.md#cuda-device-index-ordering-differs-between-host-and-docker)