mozempk/whisper-rtx2080

Files

Build & Push Docker Image / build-and-push (push) Successful in 6m39s

Details

fix: GPU warmup on startup + fix test_all.sh + document cold-GPU finding

GPU warmup (src/transcriber.rs):
  After creating WhisperState, run a 1s silent inference pass in load().
  CUDA JIT-compiles device kernels on the first whisper_full_with_state call.
  On a cold GPU this compilation disrupts the decode pipeline mid-inference,
  returning 0 segments in ~0.5s. The warmup forces all kernel compilation at
  startup so the first real job runs on fully compiled kernels.

test_all.sh:
  - Fix submit response field: 'id' → 'job_id' (was breaking all downstream steps)
  - Remove language=auto: not a valid ISO 639-1 code; omit field for auto-detect
  - Make BASE and AUDIO configurable via env vars (WHISPER_BASE_URL, TEST_AUDIO)
  - Fix DELETE assertion: completed jobs return 409 Conflict, not 204
  - Add explicit zero-segments failure check in quality inspection (step 9)
  - Add progress reporting to poll loop

docs/FINDINGS.md + KNOWLEDGE.md:
  Document cold GPU warmup issue, root cause, and fix.
  Document language=auto as invalid API usage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

2026-05-06 11:57:30 +02:00

6.7 KiB

Raw Permalink Blame History

Whisper RTX2080 — Lessons Learned & Improvement Notes

Quality Baseline (as of 2026-05-06)

Audio: 101-minute YouTube conference talk (Unblocked — Peter Werry) Model: ggml-large-v3, chunking at 60s on silence boundaries

Metric	Score
WER	9.3%
Word coverage	93.1%
1-gram F1	94.9%
3-gram F1	84.7%
5-gram F1	77.5%

Cold GPU Warmup — First Job Returns 0 Segments in ~0.5s

Severity: Critical (production issue, intermittent, hard to diagnose)

Symptom: After a container restart, the very first submitted job completes in ~0.5 seconds and returns 0 segments. Subsequent jobs work correctly.

Root cause: CUDA JIT-compiles its kernels on the first call to whisper_full_with_state. On a cold GPU, this compilation happens mid-inference and blocks/disrupts the decode pipeline, causing whisper to return immediately with 0 segments.

Why language detection can still succeed: Language detection uses only a small mel-spectrogram + encoder pass on the first 30 seconds of audio. Some of these kernels may already be compiled or cached from a prior session. The full decoder kernels (the heavier ones) are what get JIT-compiled on the first full inference.

Fix: In Transcriber::load(), after creating the state, run a 1-second silent inference pass:

let silence = vec![0.0f32; 16_000]; // 1s @ 16 kHz
let mut wp = FullParams::new(SamplingStrategy::Greedy { best_of: 1 });
wp.set_language(Some("en"));
let _ = state.full(wp, &silence); // forces CUDA JIT — 0 segments expected
tracing::info!("GPU warmup complete");

This forces all CUDA kernel compilation at startup. The first real job then runs on fully compiled kernels. Startup takes a few seconds longer but every job is reliable.

`set_detect_language(true)` is NOT "auto-detect and transcribe"

whisper.cpp source: if (params.detect_language) { return 0; } — it exits immediately after language detection, returns 0 segments
Correct API: fp.set_language(None) → passes language = NULL to whisper.cpp, which auto-detects AND transcribes
set_detect_language(true) is only for language identification workflows, not transcription
This caused 0-segment regressions on every job submitted without an explicit language= param

VAD filter causes hallucinations

vad_filter=true silences quiet audience speech → whisper fills the void with "Okay." hallucinations at ~1s intervals
Fix: Remove vad_filter entirely

Remaining Known Issues

1. Short-token hallucination loops (unfixable by entropy_thold)

entropy_thold is only evaluated when result_len > 32 output tokens
Short loops like kas, sick, Bye. (each 1 token) are never caught, no matter how low you set the threshold
Current occurrences: 'kas' ×12 at ~2801s, 'sick' ×4 at ~4540s, 'Bye.' ×10 at ~6070s
Possible future fix: post-process to collapse consecutive identical segments (user declined this for now — raw output only)
compression_ratio_thold may also help but wasn't tested

4. Cold GPU: first job returns 0 segments in ~0.5s (intermittent, after container restart)

CUDA JIT-compiles kernels on the first call to whisper_full_with_state. On a cold GPU this compilation blocks/disrupts the decode pipeline mid-inference, causing an immediate return with 0 segments.

Fix: Run a 1-second silent warmup inference in Transcriber::load(). This forces JIT compilation at startup so the first real job runs on fully compiled kernels.

Largest: 439 words at ~68 min, 328 words at ~80 min, then 3 × ~293-250 word gaps
These are chunks where whisper produced off-topic or repetitive output instead of real content
Likely caused by: speaker overlap, audience noise, or poor audio quality in those windows
Possible future fix: retry failed chunks at smaller scope (30s), detect by low-confidence score or segment density

3. CUDA device ordering inversion

nvidia-smi: GPU0=RTX 2080 SUPER, GPU1=RTX 3060
whisper.cpp on host: Device 0=RTX 3060, Device 1=RTX 2080 SUPER (inverted vs nvidia-smi)
Inside Docker: matches nvidia-smi order
Health endpoint uses nvml (nvidia-smi ordering) → reports wrong GPU name when running on host
Workaround: CUDA_DEVICE=1 on host to target RTX 2080 SUPER

Whisper Parameter Tuning Notes

Current values in src/transcriber.rs:

beam_size = 5, patience = 1.0
entropy_thold = 3.5        (catches ~9-word phrase loops, theoretical entropy ≈ log₂(9) ≈ 3.17)
logprob_thold = -1.0       (rejects very low confidence segments)
temperature_inc = 0.2      (fallback temperature increment on failure)
no_context = true          (prevents context from one chunk poisoning the next)
suppress_non_speech_tokens = true
suppress_blank = true
language = None            (auto-detect + transcribe)

What NOT to set:

vad_filter=true → hallucination loops on quiet speech
detect_language=true → returns 0 segments, transcription never runs

Audio Pre-Processing Pipeline

Download: yt-dlp → MP3
Convert: ffmpeg → 16kHz mono WAV (whisper native format)
Silence detection: ffmpeg silencedetect filter at -35dB / 0.4s min duration
Chunking: target 60s, snap to nearest silence midpoint within ±30s window, fallback to hard cut
Trim trailing silence per chunk: -35dB threshold, 0.5s padding (applied before whisper)
Transcribe each chunk independently, offset timestamps, concatenate

Why chunking helps: Whisper hallucinations compound over time. Starting each chunk fresh limits how far a bad segment can spread.

Chunk size trade-off:

Smaller (60s): less hallucination spread, but short isolated sections (e.g. someone spelling a name) lose context
Larger (180s): more context, handles short sections better, but hallucinations can corrupt more content
Current sweet spot: 60s. If 'KAS'-type issues are a priority, try 90-120s.

Potential Future Improvements (Prioritized)

Retry bad chunks at smaller scope — detect low-quality output (by segment density or avg logprob) and re-run the chunk at 30s windows
Increase chunk size to 90-120s — better context for short proper nouns / name spelling; test if hallucination spread stays acceptable
compression_ratio_thold — may catch short-token loops that entropy_thold misses; test values around 2.0-2.4
Adaptive snap window — if no silence in ±30s, try ±45s before hard-cutting; reduces long unbroken speech chunks
Per-segment confidence scoring — expose avg_logprob per segment in the JSON output for downstream filtering
Multiple model support — medium model for speed, large-v3 for quality; selectable per job

6.7 KiB Raw Permalink Blame History Unescape Escape