All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 18s
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
5.0 KiB
5.0 KiB
Whisper RTX2080 — Lessons Learned & Improvement Notes
Quality Baseline (as of 2026-05-06)
Audio: 101-minute YouTube conference talk (Unblocked — Peter Werry) Model: ggml-large-v3, chunking at 60s on silence boundaries
| Metric | Score |
|---|---|
| WER | 9.3% |
| Word coverage | 93.1% |
| 1-gram F1 | 94.9% |
| 3-gram F1 | 84.7% |
| 5-gram F1 | 77.5% |
Critical Bugs Found & Fixed
set_detect_language(true) is NOT "auto-detect and transcribe"
whisper.cppsource:if (params.detect_language) { return 0; }— it exits immediately after language detection, returns 0 segments- Correct API:
fp.set_language(None)→ passeslanguage = NULLto whisper.cpp, which auto-detects AND transcribes set_detect_language(true)is only for language identification workflows, not transcription- This caused 0-segment regressions on every job submitted without an explicit
language=param
VAD filter causes hallucinations
vad_filter=truesilences quiet audience speech → whisper fills the void with "Okay." hallucinations at ~1s intervals- Fix: Remove
vad_filterentirely
Remaining Known Issues
1. Short-token hallucination loops (unfixable by entropy_thold)
entropy_tholdis only evaluated whenresult_len > 32output tokens- Short loops like
kas,sick,Bye.(each 1 token) are never caught, no matter how low you set the threshold - Current occurrences: 'kas' ×12 at ~2801s, 'sick' ×4 at ~4540s, 'Bye.' ×10 at ~6070s
- Possible future fix: post-process to collapse consecutive identical segments (user declined this for now — raw output only)
compression_ratio_tholdmay also help but wasn't tested
2. Five significant content gaps (~1600 words total)
- Largest: 439 words at ~68 min, 328 words at ~80 min, then 3 × ~293-250 word gaps
- These are chunks where whisper produced off-topic or repetitive output instead of real content
- Likely caused by: speaker overlap, audience noise, or poor audio quality in those windows
- Possible future fix: retry failed chunks at smaller scope (30s), detect by low-confidence score or segment density
3. CUDA device ordering inversion
nvidia-smi: GPU0=RTX 2080 SUPER, GPU1=RTX 3060whisper.cppon host: Device 0=RTX 3060, Device 1=RTX 2080 SUPER (inverted vs nvidia-smi)- Inside Docker: matches nvidia-smi order
- Health endpoint uses nvml (nvidia-smi ordering) → reports wrong GPU name when running on host
- Workaround:
CUDA_DEVICE=1on host to target RTX 2080 SUPER
Whisper Parameter Tuning Notes
Current values in src/transcriber.rs:
beam_size = 5, patience = 1.0
entropy_thold = 3.5 (catches ~9-word phrase loops, theoretical entropy ≈ log₂(9) ≈ 3.17)
logprob_thold = -1.0 (rejects very low confidence segments)
temperature_inc = 0.2 (fallback temperature increment on failure)
no_context = true (prevents context from one chunk poisoning the next)
suppress_non_speech_tokens = true
suppress_blank = true
language = None (auto-detect + transcribe)
What NOT to set:
vad_filter=true→ hallucination loops on quiet speechdetect_language=true→ returns 0 segments, transcription never runs
Audio Pre-Processing Pipeline
- Download: yt-dlp → MP3
- Convert: ffmpeg → 16kHz mono WAV (whisper native format)
- Silence detection: ffmpeg
silencedetectfilter at -35dB / 0.4s min duration - Chunking: target 60s, snap to nearest silence midpoint within ±30s window, fallback to hard cut
- Trim trailing silence per chunk: -35dB threshold, 0.5s padding (applied before whisper)
- Transcribe each chunk independently, offset timestamps, concatenate
Why chunking helps: Whisper hallucinations compound over time. Starting each chunk fresh limits how far a bad segment can spread.
Chunk size trade-off:
- Smaller (60s): less hallucination spread, but short isolated sections (e.g. someone spelling a name) lose context
- Larger (180s): more context, handles short sections better, but hallucinations can corrupt more content
- Current sweet spot: 60s. If 'KAS'-type issues are a priority, try 90-120s.
Potential Future Improvements (Prioritized)
- Retry bad chunks at smaller scope — detect low-quality output (by segment density or avg logprob) and re-run the chunk at 30s windows
- Increase chunk size to 90-120s — better context for short proper nouns / name spelling; test if hallucination spread stays acceptable
- compression_ratio_thold — may catch short-token loops that entropy_thold misses; test values around 2.0-2.4
- Adaptive snap window — if no silence in ±30s, try ±45s before hard-cutting; reduces long unbroken speech chunks
- Per-segment confidence scoring — expose avg_logprob per segment in the JSON output for downstream filtering
- Multiple model support — medium model for speed, large-v3 for quality; selectable per job