Files
whisper-rtx2080/KNOWLEDGE.md
mozempk 8fc45ee86f
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 18s
docs: add KNOWLEDGE.md with lessons learned and improvement notes
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 10:09:27 +02:00

103 lines
5.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Whisper RTX2080 — Lessons Learned & Improvement Notes
## Quality Baseline (as of 2026-05-06)
Audio: 101-minute YouTube conference talk (Unblocked — Peter Werry)
Model: ggml-large-v3, chunking at 60s on silence boundaries
| Metric | Score |
|-------------------|--------|
| WER | 9.3% |
| Word coverage | 93.1% |
| 1-gram F1 | 94.9% |
| 3-gram F1 | 84.7% |
| 5-gram F1 | 77.5% |
---
## Critical Bugs Found & Fixed
### `set_detect_language(true)` is NOT "auto-detect and transcribe"
- `whisper.cpp` source: `if (params.detect_language) { return 0; }` — it exits immediately after language detection, returns 0 segments
- **Correct API**: `fp.set_language(None)` → passes `language = NULL` to whisper.cpp, which auto-detects AND transcribes
- `set_detect_language(true)` is only for language identification workflows, not transcription
- This caused 0-segment regressions on every job submitted without an explicit `language=` param
### VAD filter causes hallucinations
- `vad_filter=true` silences quiet audience speech → whisper fills the void with "Okay." hallucinations at ~1s intervals
- **Fix**: Remove `vad_filter` entirely
---
## Remaining Known Issues
### 1. Short-token hallucination loops (unfixable by entropy_thold)
- `entropy_thold` is only evaluated when `result_len > 32` output tokens
- Short loops like `kas`, `sick`, `Bye.` (each 1 token) are **never caught**, no matter how low you set the threshold
- Current occurrences: 'kas' ×12 at ~2801s, 'sick' ×4 at ~4540s, 'Bye.' ×10 at ~6070s
- **Possible future fix**: post-process to collapse consecutive identical segments (user declined this for now — raw output only)
- `compression_ratio_thold` may also help but wasn't tested
### 2. Five significant content gaps (~1600 words total)
- Largest: 439 words at ~68 min, 328 words at ~80 min, then 3 × ~293-250 word gaps
- These are chunks where whisper produced off-topic or repetitive output instead of real content
- Likely caused by: speaker overlap, audience noise, or poor audio quality in those windows
- **Possible future fix**: retry failed chunks at smaller scope (30s), detect by low-confidence score or segment density
### 3. CUDA device ordering inversion
- `nvidia-smi`: GPU0=RTX 2080 SUPER, GPU1=RTX 3060
- `whisper.cpp` on host: Device 0=RTX 3060, Device 1=RTX 2080 SUPER (inverted vs nvidia-smi)
- Inside Docker: matches nvidia-smi order
- Health endpoint uses nvml (nvidia-smi ordering) → reports wrong GPU name when running on host
- **Workaround**: `CUDA_DEVICE=1` on host to target RTX 2080 SUPER
---
## Whisper Parameter Tuning Notes
Current values in `src/transcriber.rs`:
```
beam_size = 5, patience = 1.0
entropy_thold = 3.5 (catches ~9-word phrase loops, theoretical entropy ≈ log₂(9) ≈ 3.17)
logprob_thold = -1.0 (rejects very low confidence segments)
temperature_inc = 0.2 (fallback temperature increment on failure)
no_context = true (prevents context from one chunk poisoning the next)
suppress_non_speech_tokens = true
suppress_blank = true
language = None (auto-detect + transcribe)
```
**What NOT to set:**
- `vad_filter=true` → hallucination loops on quiet speech
- `detect_language=true` → returns 0 segments, transcription never runs
---
## Audio Pre-Processing Pipeline
1. **Download**: yt-dlp → MP3
2. **Convert**: ffmpeg → 16kHz mono WAV (whisper native format)
3. **Silence detection**: ffmpeg `silencedetect` filter at -35dB / 0.4s min duration
4. **Chunking**: target 60s, snap to nearest silence midpoint within ±30s window, fallback to hard cut
5. **Trim trailing silence** per chunk: -35dB threshold, 0.5s padding (applied before whisper)
6. **Transcribe** each chunk independently, offset timestamps, concatenate
**Why chunking helps:** Whisper hallucinations compound over time. Starting each chunk fresh limits how far a bad segment can spread.
**Chunk size trade-off:**
- Smaller (60s): less hallucination spread, but short isolated sections (e.g. someone spelling a name) lose context
- Larger (180s): more context, handles short sections better, but hallucinations can corrupt more content
- Current sweet spot: 60s. If 'KAS'-type issues are a priority, try 90-120s.
---
## Potential Future Improvements (Prioritized)
1. **Retry bad chunks at smaller scope** — detect low-quality output (by segment density or avg logprob) and re-run the chunk at 30s windows
2. **Increase chunk size to 90-120s** — better context for short proper nouns / name spelling; test if hallucination spread stays acceptable
3. **compression_ratio_thold** — may catch short-token loops that entropy_thold misses; test values around 2.0-2.4
4. **Adaptive snap window** — if no silence in ±30s, try ±45s before hard-cutting; reduces long unbroken speech chunks
5. **Per-segment confidence scoring** — expose avg_logprob per segment in the JSON output for downstream filtering
6. **Multiple model support** — medium model for speed, large-v3 for quality; selectable per job