Files
whisper-rtx2080/docs/USAGE.md
mozempk c25e8e7ffb
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 17s
docs: add ARCHITECTURE, CODE_STYLE, FINDINGS, USAGE under docs/
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 10:17:53 +02:00

359 lines
9.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Usage Guide
## Prerequisites
- Docker + NVIDIA Container Toolkit (for GPU access)
- An NVIDIA GPU — optimised for RTX 2080 (sm_75), but any CUDA-capable GPU works
- A Whisper GGML model file (e.g. `ggml-large-v3.bin`)
---
## Quick Start
### 1. Pull the image
```bash
docker pull git.sal.giize.com/mozempk/whisper-rtx2080:latest
```
### 2. Download a model
```bash
# large-v3 recommended (~3 GB)
mkdir -p ~/whisper-models
curl -L "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin" \
-o ~/whisper-models/ggml-large-v3.bin
```
### 3. Start the server
```bash
docker run --rm --gpus all \
-p 8080:8080 \
-v ~/whisper-models:/models:ro \
-v whisper-data:/data \
-e WHISPER_MODEL_PATH=/models/ggml-large-v3.bin \
git.sal.giize.com/mozempk/whisper-rtx2080:latest
```
### 4. Verify
```bash
curl http://localhost:8080/health
# {"status":"ok","gpu_name":"NVIDIA GeForce RTX 2080","vram_total_mb":8192,"model":"large-v3","queue_depth":0}
```
---
## docker-compose
```bash
# Copy the compose file, configure volumes, then:
docker compose up -d
```
The bundled `docker-compose.yml` mounts named volumes for data and models and sets sane defaults.
---
## Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `PORT` | `8080` | HTTP listen port |
| `RUST_LOG` | `info` | Log level: `error`, `warn`, `info`, `debug`, `trace` |
| `DATA_DIR` | `/data` | Directory for job JSON files and temp audio |
| `WHISPER_MODEL_PATH` | `/models/ggml-large-v3.bin` | Absolute path to GGML model file |
| `WHISPER_MODEL` | `large-v3` | Model name reported by `/health` (display only) |
| `CUDA_DEVICE` | `0` | CUDA device index to use for inference |
### Note on CUDA device ordering
Inside Docker, device ordering matches `nvidia-smi` (PCI bus order). On the host without Docker, ordering may differ. See [FINDINGS.md](FINDINGS.md#cuda-device-index-ordering-differs-between-host-and-docker) for details.
---
## API Reference
The interactive Swagger UI is available at `http://localhost:8080/docs`.
### `POST /jobs` — Submit a transcription job
Accepts a multipart/form-data body.
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `audio` | file | ✓ | Audio file. Any format ffmpeg supports (MP3, WAV, FLAC, AAC, OGG, M4A, WEBM, …). No size limit. |
| `language` | string | — | ISO 639-1 language code (e.g. `en`, `fr`, `de`). Omit to auto-detect. |
| `task` | string | — | `transcribe` (default) or `translate` (translates to English) |
| `webhook_url` | string | — | URL to POST the completed job to |
**Response:** `202 Accepted`
```json
{ "job_id": "550e8400-e29b-41d4-a716-446655440000" }
```
**Example:**
```bash
curl -X POST http://localhost:8080/jobs \
-F "audio=@/path/to/recording.mp3" \
-F "language=en"
```
Auto-detect language:
```bash
curl -X POST http://localhost:8080/jobs \
-F "audio=@/path/to/recording.mp3"
```
With webhook:
```bash
curl -X POST http://localhost:8080/jobs \
-F "audio=@recording.mp3" \
-F "webhook_url=https://myapp.example.com/transcription-done"
```
---
### `GET /jobs/{id}` — Poll job status
```bash
curl http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000
```
**Response while running:**
```json
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"status": "running",
"task": "transcribe",
"progress": 42,
"created_at": "2026-05-06T10:00:00Z"
}
```
**Response when done:**
```json
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"status": "done",
"language": "en",
"task": "transcribe",
"duration_secs": 3720.5,
"progress": 100,
"created_at": "2026-05-06T10:00:00Z",
"completed_at": "2026-05-06T10:12:34Z",
"filename": "recording.mp3",
"segments": [
{
"index": 0,
"start": 0.0,
"end": 4.52,
"text": " Hello and welcome to the conference.",
"words": [
{ "text": " Hello", "start": 0.0, "end": 0.68, "probability": 0.98 },
...
]
},
...
]
}
```
**Job statuses:**
| Status | Meaning |
|--------|---------|
| `queued` | Waiting for the GPU worker to pick it up |
| `running` | Being transcribed right now |
| `done` | Complete; `segments` array is populated |
| `failed` | Error occurred; `error` field contains the message |
| `cancelled` | Cancelled via DELETE before or during processing |
---
### `GET /jobs/{id}/stream` — Real-time progress via SSE
Subscribe to a Server-Sent Events stream for live progress updates.
```bash
curl -N http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000/stream
```
**Event types:**
```
event: progress
data: {"type":"progress","percent":23,"chunk":2,"chunks_total":8}
event: progress
data: {"type":"progress","percent":47,"chunk":4,"chunks_total":8}
event: done
data: {"type":"done","job":{...full job object...}}
```
```
event: error
data: {"type":"error","message":"ffmpeg spawn failed: ..."}
```
- `percent` — overall progress 0100
- `chunk` / `chunks_total` — which silence-split chunk is currently being transcribed
- If you connect after the job has finished, you receive a single `done` event immediately
**JavaScript example:**
```javascript
const es = new EventSource(`/jobs/${jobId}/stream`);
es.addEventListener('progress', (e) => {
const { percent, chunk, chunks_total } = JSON.parse(e.data);
console.log(`${percent}% (chunk ${chunk}/${chunks_total})`);
});
es.addEventListener('done', (e) => {
const { job } = JSON.parse(e.data);
console.log('Transcript:', job.segments.map(s => s.text).join(''));
es.close();
});
es.addEventListener('error', (e) => {
const { message } = JSON.parse(e.data);
console.error('Failed:', message);
es.close();
});
```
---
### `DELETE /jobs/{id}` — Cancel a job
Marks a queued job as cancelled immediately. For running jobs, the cancellation is recorded but the current whisper.cpp inference call completes before the worker checks the flag (whisper.cpp does not support mid-inference abort).
```bash
curl -X DELETE http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000
```
Returns `409 Conflict` if the job is already in a terminal state (`done`, `failed`, `cancelled`).
---
### `GET /health` — Service health
```bash
curl http://localhost:8080/health
```
```json
{
"status": "ok",
"gpu_name": "NVIDIA GeForce RTX 2080",
"vram_total_mb": 8192,
"model": "large-v3",
"queue_depth": 0
}
```
`queue_depth` is the number of jobs waiting to be processed (not counting the one currently running).
---
## Output Format
The `segments` array in a completed job contains one entry per whisper segment (typically a sentence or clause):
```json
{
"index": 0,
"start": 12.34,
"end": 15.78,
"text": " This is a transcribed sentence.",
"words": [
{ "text": " This", "start": 12.34, "end": 12.56, "probability": 0.97 },
{ "text": " is", "start": 12.56, "end": 12.72, "probability": 0.99 },
{ "text": " a", "start": 12.72, "end": 12.84, "probability": 0.98 },
{ "text": " transcribed", "start": 12.84, "end": 13.40, "probability": 0.95 },
{ "text": " sentence.", "start": 13.40, "end": 15.78, "probability": 0.96 }
]
}
```
Notes:
- `start` / `end` are in seconds (floating point), absolute from the beginning of the input audio
- `text` typically includes a leading space (whisper's tokenisation convention)
- `words` contains token-level timestamps; may be empty if flash attention is enabled (it is disabled by default)
- `probability` is the model's confidence for each word token (01)
- All timestamps are in the source language's timeline — no re-mapping occurs
---
## Webhook Payload
When a `webhook_url` is provided, the server POSTs the full `Job` JSON to that URL on completion (including on failure). Headers: `Content-Type: application/json`.
Delivery is attempted up to 5 times with exponential backoff (1s, 2s, 4s, 8s, 16s). If all retries fail, the error is logged and dropped.
---
## Building from Source
```bash
# Build the Docker image locally (requires Docker Buildx + NVIDIA CUDA driver on host)
docker build -t whisper-rtx2080 .
# Custom CUDA version (e.g. for CUDA 11.8 on older drivers)
docker build \
--build-arg CUDA_VERSION=11.8.0 \
--build-arg CUDNN_TAG=cudnn8 \
--build-arg UBUNTU_VERSION=20.04 \
-t whisper-rtx2080:cu118 .
```
Cross-compiling without a CUDA-capable host is not supported — the build requires `nvcc` to compile the CUDA kernels.
### Build-time ARGs
| ARG | Default | Notes |
|-----|---------|-------|
| `CUDA_VERSION` | `12.4.1` | Must match a tag on `nvidia/cuda` Docker Hub |
| `CUDNN_TAG` | `cudnn` | Use `cudnn8` for CUDA 11.x images |
| `UBUNTU_VERSION` | `22.04` | `20.04` or `22.04` |
---
## Working with Audio Files
The server accepts any format ffmpeg understands. To prepare audio manually:
```bash
# Download YouTube audio
yt-dlp -x --audio-format mp3 "https://www.youtube.com/watch?v=..." -o audio.mp3
# Convert to whisper's native format (optional — the server does this automatically)
ffmpeg -i audio.mp3 -f f32le -ac 1 -ar 16000 audio.pcm
# Submit
curl -X POST http://localhost:8080/jobs \
-F "audio=@audio.mp3"
```
---
## Troubleshooting
### Server returns 0 segments
- Check that you are **not** setting `language` to an empty string — omit the field entirely for auto-detection
- Verify the audio file is not corrupted: `ffprobe audio.mp3`
- Check logs for `whisper.cpp` output: the auto-detected language and confidence should appear as `info` level logs
### Server returns `failed` with ffmpeg error
- Ensure `ffmpeg` is installed in the container (it is by default)
- Verify the audio file is a valid media file
### CUDA out-of-memory
- `ggml-large-v3.bin` requires ~5-6 GB VRAM. Use `medium` or `small` models on GPUs with less than 8 GB
- Check that no other process is consuming VRAM: `nvidia-smi`
### Wrong GPU being used
- Inside Docker: set `CUDA_DEVICE=0` for the first GPU (nvidia-smi order)
- On host without Docker: device ordering may be inverted; see [FINDINGS.md](FINDINGS.md#cuda-device-index-ordering-differs-between-host-and-docker)