mozempk/whisper-rtx2080

Files

Build & Push Docker Image / build-and-push (push) Successful in 6m39s

Details

fix: GPU warmup on startup + fix test_all.sh + document cold-GPU finding

GPU warmup (src/transcriber.rs):
  After creating WhisperState, run a 1s silent inference pass in load().
  CUDA JIT-compiles device kernels on the first whisper_full_with_state call.
  On a cold GPU this compilation disrupts the decode pipeline mid-inference,
  returning 0 segments in ~0.5s. The warmup forces all kernel compilation at
  startup so the first real job runs on fully compiled kernels.

test_all.sh:
  - Fix submit response field: 'id' → 'job_id' (was breaking all downstream steps)
  - Remove language=auto: not a valid ISO 639-1 code; omit field for auto-detect
  - Make BASE and AUDIO configurable via env vars (WHISPER_BASE_URL, TEST_AUDIO)
  - Fix DELETE assertion: completed jobs return 409 Conflict, not 204
  - Add explicit zero-segments failure check in quality inspection (step 9)
  - Add progress reporting to poll loop

docs/FINDINGS.md + KNOWLEDGE.md:
  Document cold GPU warmup issue, root cause, and fix.
  Document language=auto as invalid API usage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

2026-05-06 11:57:30 +02:00

12 KiB

Raw Permalink Blame History

Findings, Quirks & Research Notes

This document records all non-obvious behaviour, surprising bugs, hardware quirks, and research findings discovered during the development and testing of this project. It exists so we don't rediscover the same things twice.

Cold GPU: first job returns 0 segments in ~0.5s after container restart

Symptom: After container restart, the first submitted job completes in ~0.5s and returns 0 segments. Language is detected correctly. All subsequent jobs work fine.

Root cause: CUDA JIT-compiles its device kernels on the first call to whisper_full_with_state. On a cold GPU, this compilation happens synchronously mid-inference and disrupts the decode pipeline, causing it to return immediately with 0 results.

Why subsequent jobs are fine: Compiled kernels are cached in the CUDA driver for the lifetime of the process. Once the first (warmup) call completes, all further calls use the cached compiled kernels.

Why language detection can succeed on the same call: Language detection uses a mel-spectrogram + encoder pass on the first 30s of audio. These lighter kernels may compile faster or be partially cached, while the full decoder kernels (the heavier path) are what causes the failure.

Fix (in Transcriber::load()):

let silence = vec![0.0f32; 16_000]; // 1s @ 16 kHz — just enough to trigger kernel compilation
let mut wp = FullParams::new(SamplingStrategy::Greedy { best_of: 1 });
wp.set_language(Some("en"));
wp.set_print_progress(false);
let _ = state.full(wp, &silence); // 0 segments expected; side-effect is the goal
tracing::info!("GPU warmup complete");

Also fixed simultaneously: create_state() was called per-chunk (~700 MB GPU allocation each time), causing VRAM churn under concurrent processes. State is now created once and reused. See WhisperState reuse section above.

`language=auto` is not a valid API parameter

Passing language=auto in the multipart form is silently incorrect. The language field expects an ISO 639-1 code (e.g. en, fr) or should be omitted entirely for auto-detection. Passing "auto" causes whisper-rs to pass the string "auto" as a language code, which whisper.cpp does not recognise and may fallback in undefined ways.

Correct usage:

Auto-detect: omit the language field entirely
Explicit: language=en

`detect_language=true` is a language-ID-only mode — NOT "auto-detect and transcribe"

Severity: Critical (was a production regression)

In whisper.cpp (whisper_full_with_state):

if (params.detect_language) {
    return 0;  // exits immediately after language detection
}

Setting detect_language=true causes whisper to auto-detect the language, print it to stderr, and then return 0 without running the decoder. The result is always 0 segments.

The whisper-rs docs suggest this is equivalent to auto-detect — it is not.

Correct API for auto-detect + transcription:

fp.set_language(None);   // passes language = NULL to whisper.cpp → auto-detects AND transcribes

Wrong:

fp.set_detect_language(true);   // language-ID mode only — 0 segments returned

This bug caused every job submitted without an explicit language= parameter to return 0 segments after commit 35e7ea8. Fixed in 6327ffc.

`no_speech_thold` is not implemented

The whisper.cpp header exposes no_speech_thold as a parameter, but the source contains a // TODO: not implemented comment on the actual check. Calling fp.set_no_speech_thold(...) has no effect. Do not rely on it.

`entropy_thold` only fires when `result_len > 32`

whisper's entropy check (which triggers temperature retry on repetitive output) is only evaluated when the segment has more than 32 output tokens. This means:

Short hallucination loops of 1-2 words (e.g. "kas", "sick", "Bye.") are never caught, no matter how low you set the threshold
The check is useful for medium-length loops (~9 word phrases have theoretical entropy ≈ log₂(9) ≈ 3.17)
Default entropy_thold=2.4 catches 1-4 unique-token loops; we raised it to 3.5 to also catch 9-word phrase loops
The retry schedule uses temperature_inc: on failure, whisper retries with temp += 0.2 until temp=1.0

`vad_filter=true` causes "Okay." hallucinations

When VAD (Voice Activity Detection) filter is enabled, whisper silences quiet sections before feeding to the decoder. For conference recordings with audience reactions or low-volume speakers, this causes whisper to fill the resulting void with short filler tokens ("Okay.", "Yeah.", "Thank you.") at ~1s intervals.

Do not use vad_filter on recordings with ambient audience sound or variable volume speakers.

Flash Attention (`flash_attn=true`) causes 0 segments on some audio

Flash attention is disabled with a comment. When tested on real-world conference recordings (noisy MP3s), it silently returned 0 segments on certain audio windows. The root cause was not fully investigated. Safe to disable; the performance benefit is marginal for large-v3.

`no_context=true` is essential for chunked processing

When no_context=false (default), whisper uses the transcript from the previous full() call as an initial prompt for the next one. In our pipeline, each chunk is a separate full() call. Without no_context=true, a hallucinated phrase from chunk N gets fed as a prompt into chunk N+1, poisoning it. This can cascade across the entire transcript.

Timestamps are in centiseconds internally

whisper.cpp returns t0 and t1 as integer centiseconds. The conversion to seconds is:

let start = state.full_get_segment_t0(i)? as f32 / 100.0;

This is not documented prominently. The divide-by-100 is critical — omitting it gives timestamps 100× too large.

`full_n_segments_from_state` vs `full_n_segments`

Two versions of this function exist in whisper-rs:

full_n_segments_from_state(&state) — correct; reads from the state created by create_state()
full_n_segments(&ctx) — reads from the context's internal state (used for single-threaded, non-state-based calls)

Since we use create_state() + state.full(fp, pcm), always use the _from_state variants. Using the wrong variant returns stale or zero results from a previous inference.

CUDA / Hardware

CUDA device index ordering differs between host and Docker

On the development machine:

nvidia-smi: GPU 0 = RTX 2080 SUPER (8 GB), GPU 1 = RTX 3060 (12 GB)
whisper.cpp CUDA on host: Device 0 = RTX 3060, Device 1 = RTX 2080 SUPER (inverted)
whisper.cpp CUDA inside Docker: Device 0 = RTX 2080 SUPER (matches nvidia-smi)

The inversion on the host is caused by CUDA_DEVICE_ORDER not being set to PCI_BUS_ID. The Docker image explicitly sets CUDA_DEVICE_ORDER=PCI_BUS_ID, which forces the expected ordering.

To target RTX 2080 SUPER on host: CUDA_DEVICE=1
Inside Docker: CUDA_DEVICE=0

The /health endpoint queries GPU info via nvidia-smi --id=<device> which uses the nvidia-smi (PCI_BUS_ID) ordering. When running on the host with CUDA_DEVICE=1, the health endpoint correctly reports RTX 2080 SUPER.

RTX 2080 is Turing (sm_75) — not Ampere

The RTX 2080 (non-Super, non-Ti) uses the Turing architecture, compute capability sm_75. This is relevant because:

Some CUDA kernels are only compiled for sm_80+ (Ampere) by default
CMAKE_CUDA_ARCHITECTURES=75 must be set explicitly, otherwise the build falls back to a generic/slower kernel or fails
GGML_CUDA_FORCE_MMQ=ON enables the matrix-multiply-quantized kernels that are Turing-optimised

VRAM allocation: ~5 GB for large-v3

The ggml-large-v3.bin model occupies approximately 5-6 GB of VRAM on the RTX 2080's 8 GB pool. This leaves ~2 GB for CUDA workspace, which is sufficient for f16 inference with beam_size=5. Do not run two instances on the same GPU simultaneously.

Audio Processing

ffmpeg `silencedetect` logs to stderr, not stdout

When running ffmpeg -af silencedetect=n=-35dB:d=0.4 -f null -, silence events are printed to stderr (not stdout), in this format:

[silencedetect @ 0x...] silence_start: 12.345
[silencedetect @ 0x...] silence_end: 13.456 | silence_duration: 1.111

The parser must read output.stderr, not output.stdout.

Whisper requires exactly 16kHz mono f32 PCM

whisper.cpp's full() function expects:

Sample rate: exactly 16,000 Hz
Channels: 1 (mono)
Format: f32 little-endian (values in [-1.0, 1.0])

Deviating from any of these silently produces garbage output. The ffmpeg decode command:

ffmpeg -i <input> -f f32le -ac 1 -ar 16000 -

converts any input format to this exactly.

MP3 is fully supported by ffmpeg → whisper

whisper itself only accepts PCM; it has no MP3 decoder. But since we always decode through ffmpeg first, any format ffmpeg supports (MP3, AAC, FLAC, OGG, WAV, M4A, WEBM, etc.) works as input. There is no codec-level restriction.

Chunking trade-offs

Chunk size	Pros	Cons
30s	Hallucinations contained to tiny window	Very short proper nouns / spellings get no context
60s	Good balance; ~2× whisper's native 30s window	Isolated 10s sections (e.g. name spelling) still lack context
120-180s	Better context for short sections	Hallucinations can corrupt larger content blocks

Current setting: 60s. Snap window: ±30s from the target cut point.

The snap-to-silence algorithm avoids micro-chunks (<5s) and trailing slivers (<25% of target) by stopping early.

Quality Findings

Quality baseline on 101-minute conference recording (ggml-large-v3)

Reference: human-corrected transcript (12,894 words)

Metric	Score
WER	9.3%
Word coverage	93.1%
1-gram F1	94.9%
3-gram F1	84.7%
5-gram F1	77.5%

The 3-gram F1 is the most informative single metric for conference transcription: it captures both word accuracy and local phrase fidelity without being overly sensitive to exact phrasing.

Remaining failure modes

Pattern	Location	Root cause	Fixable?
'KAS' ×12	~2801s	Speaker spells "K-A-S"; 60s chunk isolates 10s section with no context	Increase chunk size to 90s
'sick' ×4	~4540s	Single 9s segment; `result_len < 32` → entropy check skipped	`compression_ratio_thold`?
'Bye.' ×10	~6070s	Speaker says goodbye multiple times at end; trailing silence trim can't help — real audio	Post-processing dedup (declined by user)
5 content gaps	Various	Chunk windows with noisy/overlapping audio → whisper skips content	Retry at 30s scope

Rust / Library Quirks

`whisper-rs` 0.13 bundles whisper.cpp source

whisper-rs-sys includes the full whisper.cpp source tree inside the crate. The build is entirely self-contained — no internet access is needed during cargo build once the registry cache is warm. The whisper.cpp version is pinned by the crate version; updating whisper.cpp requires bumping the whisper-rs dependency.

`BroadcastStream` silently drops lagged receivers

tokio_stream::wrappers::BroadcastStream returns Err(RecvError::Lagged(n)) when a subscriber falls behind. This is filtered to None in our SSE adapter, which silently drops the lagged events. Clients that can't keep up with the SSE stream will miss progress events but will still receive the final done event (or can poll GET /jobs/:id).

`DashMap` as `ProgressRegistry`

DashMap<JobId, broadcast::Sender<ProgressEvent>> provides lock-free concurrent map access. Senders are cleaned up 30 seconds after a job completes (see the sleep(30s) + registry.remove() in worker::run). The 30-second window gives SSE subscribers time to receive the done event before the channel is dropped.

12 KiB Raw Permalink Blame History Unescape Escape