mozempk/whisper-rtx2080

Files

Build & Push Docker Image / build-and-push (push) Successful in 17s

Details

docs: add ARCHITECTURE, CODE_STYLE, FINDINGS, USAGE under docs/

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

2026-05-06 10:17:53 +02:00

9.9 KiB

Raw Blame History

Usage Guide

Prerequisites

Docker + NVIDIA Container Toolkit (for GPU access)
An NVIDIA GPU — optimised for RTX 2080 (sm_75), but any CUDA-capable GPU works
A Whisper GGML model file (e.g. ggml-large-v3.bin)

Quick Start

1. Pull the image

docker pull git.sal.giize.com/mozempk/whisper-rtx2080:latest

2. Download a model

# large-v3 recommended (~3 GB)
mkdir -p ~/whisper-models
curl -L "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin" \
  -o ~/whisper-models/ggml-large-v3.bin

3. Start the server

docker run --rm --gpus all \
  -p 8080:8080 \
  -v ~/whisper-models:/models:ro \
  -v whisper-data:/data \
  -e WHISPER_MODEL_PATH=/models/ggml-large-v3.bin \
  git.sal.giize.com/mozempk/whisper-rtx2080:latest

4. Verify

curl http://localhost:8080/health
# {"status":"ok","gpu_name":"NVIDIA GeForce RTX 2080","vram_total_mb":8192,"model":"large-v3","queue_depth":0}

docker-compose

# Copy the compose file, configure volumes, then:
docker compose up -d

The bundled docker-compose.yml mounts named volumes for data and models and sets sane defaults.

Environment Variables

Variable	Default	Description
`PORT`	`8080`	HTTP listen port
`RUST_LOG`	`info`	Log level: `error`, `warn`, `info`, `debug`, `trace`
`DATA_DIR`	`/data`	Directory for job JSON files and temp audio
`WHISPER_MODEL_PATH`	`/models/ggml-large-v3.bin`	Absolute path to GGML model file
`WHISPER_MODEL`	`large-v3`	Model name reported by `/health` (display only)
`CUDA_DEVICE`	`0`	CUDA device index to use for inference

Note on CUDA device ordering

Inside Docker, device ordering matches nvidia-smi (PCI bus order). On the host without Docker, ordering may differ. See FINDINGS.md for details.

API Reference

The interactive Swagger UI is available at http://localhost:8080/docs.

`POST /jobs` — Submit a transcription job

Accepts a multipart/form-data body.

Field	Type	Required	Description
`audio`	file	✓	Audio file. Any format ffmpeg supports (MP3, WAV, FLAC, AAC, OGG, M4A, WEBM, …). No size limit.
`language`	string	—	ISO 639-1 language code (e.g. `en`, `fr`, `de`). Omit to auto-detect.
`task`	string	—	`transcribe` (default) or `translate` (translates to English)
`webhook_url`	string	—	URL to POST the completed job to

Response: 202 Accepted

{ "job_id": "550e8400-e29b-41d4-a716-446655440000" }

Example:

curl -X POST http://localhost:8080/jobs \
  -F "audio=@/path/to/recording.mp3" \
  -F "language=en"

Auto-detect language:

curl -X POST http://localhost:8080/jobs \
  -F "audio=@/path/to/recording.mp3"

With webhook:

curl -X POST http://localhost:8080/jobs \
  -F "audio=@recording.mp3" \
  -F "webhook_url=https://myapp.example.com/transcription-done"

`GET /jobs/{id}` — Poll job status

curl http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000

Response while running:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "running",
  "task": "transcribe",
  "progress": 42,
  "created_at": "2026-05-06T10:00:00Z"
}

Response when done:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "done",
  "language": "en",
  "task": "transcribe",
  "duration_secs": 3720.5,
  "progress": 100,
  "created_at": "2026-05-06T10:00:00Z",
  "completed_at": "2026-05-06T10:12:34Z",
  "filename": "recording.mp3",
  "segments": [
    {
      "index": 0,
      "start": 0.0,
      "end": 4.52,
      "text": " Hello and welcome to the conference.",
      "words": [
        { "text": " Hello", "start": 0.0, "end": 0.68, "probability": 0.98 },
        ...
      ]
    },
    ...
  ]
}

Job statuses:

Status	Meaning
`queued`	Waiting for the GPU worker to pick it up
`running`	Being transcribed right now
`done`	Complete; `segments` array is populated
`failed`	Error occurred; `error` field contains the message
`cancelled`	Cancelled via DELETE before or during processing

`GET /jobs/{id}/stream` — Real-time progress via SSE

Subscribe to a Server-Sent Events stream for live progress updates.

curl -N http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000/stream

Event types:

event: progress
data: {"type":"progress","percent":23,"chunk":2,"chunks_total":8}

event: progress
data: {"type":"progress","percent":47,"chunk":4,"chunks_total":8}

event: done
data: {"type":"done","job":{...full job object...}}

event: error
data: {"type":"error","message":"ffmpeg spawn failed: ..."}

percent — overall progress 0–100
chunk / chunks_total — which silence-split chunk is currently being transcribed
If you connect after the job has finished, you receive a single done event immediately

JavaScript example:

const es = new EventSource(`/jobs/${jobId}/stream`);

es.addEventListener('progress', (e) => {
  const { percent, chunk, chunks_total } = JSON.parse(e.data);
  console.log(`${percent}% (chunk ${chunk}/${chunks_total})`);
});

es.addEventListener('done', (e) => {
  const { job } = JSON.parse(e.data);
  console.log('Transcript:', job.segments.map(s => s.text).join(''));
  es.close();
});

es.addEventListener('error', (e) => {
  const { message } = JSON.parse(e.data);
  console.error('Failed:', message);
  es.close();
});

`DELETE /jobs/{id}` — Cancel a job

Marks a queued job as cancelled immediately. For running jobs, the cancellation is recorded but the current whisper.cpp inference call completes before the worker checks the flag (whisper.cpp does not support mid-inference abort).

curl -X DELETE http://localhost:8080/jobs/550e8400-e29b-41d4-a716-446655440000

Returns 409 Conflict if the job is already in a terminal state (done, failed, cancelled).

`GET /health` — Service health

curl http://localhost:8080/health

{
  "status": "ok",
  "gpu_name": "NVIDIA GeForce RTX 2080",
  "vram_total_mb": 8192,
  "model": "large-v3",
  "queue_depth": 0
}

queue_depth is the number of jobs waiting to be processed (not counting the one currently running).

Output Format

The segments array in a completed job contains one entry per whisper segment (typically a sentence or clause):

{
  "index": 0,
  "start": 12.34,
  "end":   15.78,
  "text":  " This is a transcribed sentence.",
  "words": [
    { "text": " This",         "start": 12.34, "end": 12.56, "probability": 0.97 },
    { "text": " is",           "start": 12.56, "end": 12.72, "probability": 0.99 },
    { "text": " a",            "start": 12.72, "end": 12.84, "probability": 0.98 },
    { "text": " transcribed",  "start": 12.84, "end": 13.40, "probability": 0.95 },
    { "text": " sentence.",    "start": 13.40, "end": 15.78, "probability": 0.96 }
  ]
}

Notes:

start / end are in seconds (floating point), absolute from the beginning of the input audio
text typically includes a leading space (whisper's tokenisation convention)
words contains token-level timestamps; may be empty if flash attention is enabled (it is disabled by default)
probability is the model's confidence for each word token (0–1)
All timestamps are in the source language's timeline — no re-mapping occurs

Webhook Payload

When a webhook_url is provided, the server POSTs the full Job JSON to that URL on completion (including on failure). Headers: Content-Type: application/json.

Delivery is attempted up to 5 times with exponential backoff (1s, 2s, 4s, 8s, 16s). If all retries fail, the error is logged and dropped.

Building from Source

# Build the Docker image locally (requires Docker Buildx + NVIDIA CUDA driver on host)
docker build -t whisper-rtx2080 .

# Custom CUDA version (e.g. for CUDA 11.8 on older drivers)
docker build \
  --build-arg CUDA_VERSION=11.8.0 \
  --build-arg CUDNN_TAG=cudnn8 \
  --build-arg UBUNTU_VERSION=20.04 \
  -t whisper-rtx2080:cu118 .

Cross-compiling without a CUDA-capable host is not supported — the build requires nvcc to compile the CUDA kernels.

Build-time ARGs

ARG	Default	Notes
`CUDA_VERSION`	`12.4.1`	Must match a tag on `nvidia/cuda` Docker Hub
`CUDNN_TAG`	`cudnn`	Use `cudnn8` for CUDA 11.x images
`UBUNTU_VERSION`	`22.04`	`20.04` or `22.04`

Working with Audio Files

The server accepts any format ffmpeg understands. To prepare audio manually:

# Download YouTube audio
yt-dlp -x --audio-format mp3 "https://www.youtube.com/watch?v=..." -o audio.mp3

# Convert to whisper's native format (optional — the server does this automatically)
ffmpeg -i audio.mp3 -f f32le -ac 1 -ar 16000 audio.pcm

# Submit
curl -X POST http://localhost:8080/jobs \
  -F "audio=@audio.mp3"

Troubleshooting

Server returns 0 segments

Check that you are not setting language to an empty string — omit the field entirely for auto-detection
Verify the audio file is not corrupted: ffprobe audio.mp3
Check logs for whisper.cpp output: the auto-detected language and confidence should appear as info level logs

Server returns `failed` with ffmpeg error

Ensure ffmpeg is installed in the container (it is by default)
Verify the audio file is a valid media file

CUDA out-of-memory

ggml-large-v3.bin requires ~5-6 GB VRAM. Use medium or small models on GPUs with less than 8 GB
Check that no other process is consuming VRAM: nvidia-smi

Wrong GPU being used

Inside Docker: set CUDA_DEVICE=0 for the first GPU (nvidia-smi order)
On host without Docker: device ordering may be inverted; see FINDINGS.md

9.9 KiB Raw Blame History Unescape Escape