docs: add KNOWLEDGE.md with lessons learned and improvement notes

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 10:09:27 +02:00
parent 6327ffc09d
commit 8fc45ee86f
1 changed files with 102 additions and 0 deletions
--- a/KNOWLEDGE.md
+++ b/KNOWLEDGE.md
@@ -0,0 +1,102 @@
+# Whisper RTX2080 — Lessons Learned & Improvement Notes
+
+## Quality Baseline (as of 2026-05-06)
+
+Audio: 101-minute YouTube conference talk (Unblocked — Peter Werry)
+Model: ggml-large-v3, chunking at 60s on silence boundaries
+
+| Metric            | Score  |
+|-------------------|--------|
+| WER               | 9.3%   |
+| Word coverage     | 93.1%  |
+| 1-gram F1         | 94.9%  |
+| 3-gram F1         | 84.7%  |
+| 5-gram F1         | 77.5%  |
+
+---
+
+## Critical Bugs Found & Fixed
+
+### `set_detect_language(true)` is NOT "auto-detect and transcribe"
+- `whisper.cpp` source: `if (params.detect_language) { return 0; }` — it exits immediately after language detection, returns 0 segments
+- **Correct API**: `fp.set_language(None)` → passes `language = NULL` to whisper.cpp, which auto-detects AND transcribes
+- `set_detect_language(true)` is only for language identification workflows, not transcription
+- This caused 0-segment regressions on every job submitted without an explicit `language=` param
+
+### VAD filter causes hallucinations
+- `vad_filter=true` silences quiet audience speech → whisper fills the void with "Okay." hallucinations at ~1s intervals
+- **Fix**: Remove `vad_filter` entirely
+
+---
+
+## Remaining Known Issues
+
+### 1. Short-token hallucination loops (unfixable by entropy_thold)
+- `entropy_thold` is only evaluated when `result_len > 32` output tokens
+- Short loops like `kas`, `sick`, `Bye.` (each 1 token) are **never caught**, no matter how low you set the threshold
+- Current occurrences: 'kas' ×12 at ~2801s, 'sick' ×4 at ~4540s, 'Bye.' ×10 at ~6070s
+- **Possible future fix**: post-process to collapse consecutive identical segments (user declined this for now — raw output only)
+- `compression_ratio_thold` may also help but wasn't tested
+
+### 2. Five significant content gaps (~1600 words total)
+- Largest: 439 words at ~68 min, 328 words at ~80 min, then 3 × ~293-250 word gaps
+- These are chunks where whisper produced off-topic or repetitive output instead of real content
+- Likely caused by: speaker overlap, audience noise, or poor audio quality in those windows
+- **Possible future fix**: retry failed chunks at smaller scope (30s), detect by low-confidence score or segment density
+
+### 3. CUDA device ordering inversion
+- `nvidia-smi`: GPU0=RTX 2080 SUPER, GPU1=RTX 3060
+- `whisper.cpp` on host: Device 0=RTX 3060, Device 1=RTX 2080 SUPER (inverted vs nvidia-smi)
+- Inside Docker: matches nvidia-smi order
+- Health endpoint uses nvml (nvidia-smi ordering) → reports wrong GPU name when running on host
+- **Workaround**: `CUDA_DEVICE=1` on host to target RTX 2080 SUPER
+
+---
+
+## Whisper Parameter Tuning Notes
+
+Current values in `src/transcriber.rs`:
+
+```
+beam_size = 5, patience = 1.0
+entropy_thold = 3.5        (catches ~9-word phrase loops, theoretical entropy ≈ log₂(9) ≈ 3.17)
+logprob_thold = -1.0       (rejects very low confidence segments)
+temperature_inc = 0.2      (fallback temperature increment on failure)
+no_context = true          (prevents context from one chunk poisoning the next)
+suppress_non_speech_tokens = true
+suppress_blank = true
+language = None            (auto-detect + transcribe)
+```
+
+**What NOT to set:**
+- `vad_filter=true` → hallucination loops on quiet speech
+- `detect_language=true` → returns 0 segments, transcription never runs
+
+---
+
+## Audio Pre-Processing Pipeline
+
+1. **Download**: yt-dlp → MP3
+2. **Convert**: ffmpeg → 16kHz mono WAV (whisper native format)
+3. **Silence detection**: ffmpeg `silencedetect` filter at -35dB / 0.4s min duration
+4. **Chunking**: target 60s, snap to nearest silence midpoint within ±30s window, fallback to hard cut
+5. **Trim trailing silence** per chunk: -35dB threshold, 0.5s padding (applied before whisper)
+6. **Transcribe** each chunk independently, offset timestamps, concatenate
+
+**Why chunking helps:** Whisper hallucinations compound over time. Starting each chunk fresh limits how far a bad segment can spread.
+
+**Chunk size trade-off:**
+- Smaller (60s): less hallucination spread, but short isolated sections (e.g. someone spelling a name) lose context
+- Larger (180s): more context, handles short sections better, but hallucinations can corrupt more content
+- Current sweet spot: 60s. If 'KAS'-type issues are a priority, try 90-120s.
+
+---
+
+## Potential Future Improvements (Prioritized)
+
+1. **Retry bad chunks at smaller scope** — detect low-quality output (by segment density or avg logprob) and re-run the chunk at 30s windows
+2. **Increase chunk size to 90-120s** — better context for short proper nouns / name spelling; test if hallucination spread stays acceptable
+3. **compression_ratio_thold** — may catch short-token loops that entropy_thold misses; test values around 2.0-2.4
+4. **Adaptive snap window** — if no silence in ±30s, try ±45s before hard-cutting; reduces long unbroken speech chunks
+5. **Per-segment confidence scoring** — expose avg_logprob per segment in the JSON output for downstream filtering
+6. **Multiple model support** — medium model for speed, large-v3 for quality; selectable per job