Files
whisper-rtx2080/KNOWLEDGE.md
mozempk fd8d4deefb
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 6m39s
fix: GPU warmup on startup + fix test_all.sh + document cold-GPU finding
GPU warmup (src/transcriber.rs):
  After creating WhisperState, run a 1s silent inference pass in load().
  CUDA JIT-compiles device kernels on the first whisper_full_with_state call.
  On a cold GPU this compilation disrupts the decode pipeline mid-inference,
  returning 0 segments in ~0.5s. The warmup forces all kernel compilation at
  startup so the first real job runs on fully compiled kernels.

test_all.sh:
  - Fix submit response field: 'id' → 'job_id' (was breaking all downstream steps)
  - Remove language=auto: not a valid ISO 639-1 code; omit field for auto-detect
  - Make BASE and AUDIO configurable via env vars (WHISPER_BASE_URL, TEST_AUDIO)
  - Fix DELETE assertion: completed jobs return 409 Conflict, not 204
  - Add explicit zero-segments failure check in quality inspection (step 9)
  - Add progress reporting to poll loop

docs/FINDINGS.md + KNOWLEDGE.md:
  Document cold GPU warmup issue, root cause, and fix.
  Document language=auto as invalid API usage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 11:57:30 +02:00

130 lines
6.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Whisper RTX2080 — Lessons Learned & Improvement Notes
## Quality Baseline (as of 2026-05-06)
Audio: 101-minute YouTube conference talk (Unblocked — Peter Werry)
Model: ggml-large-v3, chunking at 60s on silence boundaries
| Metric | Score |
|-------------------|--------|
| WER | 9.3% |
| Word coverage | 93.1% |
| 1-gram F1 | 94.9% |
| 3-gram F1 | 84.7% |
| 5-gram F1 | 77.5% |
---
## Cold GPU Warmup — First Job Returns 0 Segments in ~0.5s
**Severity: Critical (production issue, intermittent, hard to diagnose)**
**Symptom:** After a container restart, the very first submitted job completes in ~0.5 seconds and returns 0 segments. Subsequent jobs work correctly.
**Root cause:** CUDA JIT-compiles its kernels on the **first** call to `whisper_full_with_state`. On a cold GPU, this compilation happens mid-inference and blocks/disrupts the decode pipeline, causing whisper to return immediately with 0 segments.
**Why language detection can still succeed:** Language detection uses only a small mel-spectrogram + encoder pass on the first 30 seconds of audio. Some of these kernels may already be compiled or cached from a prior session. The full decoder kernels (the heavier ones) are what get JIT-compiled on the first full inference.
**Fix:** In `Transcriber::load()`, after creating the state, run a 1-second silent inference pass:
```rust
let silence = vec![0.0f32; 16_000]; // 1s @ 16 kHz
let mut wp = FullParams::new(SamplingStrategy::Greedy { best_of: 1 });
wp.set_language(Some("en"));
let _ = state.full(wp, &silence); // forces CUDA JIT — 0 segments expected
tracing::info!("GPU warmup complete");
```
This forces all CUDA kernel compilation at startup. The first real job then runs on fully compiled kernels. Startup takes a few seconds longer but every job is reliable.
---
### `set_detect_language(true)` is NOT "auto-detect and transcribe"
- `whisper.cpp` source: `if (params.detect_language) { return 0; }` — it exits immediately after language detection, returns 0 segments
- **Correct API**: `fp.set_language(None)` → passes `language = NULL` to whisper.cpp, which auto-detects AND transcribes
- `set_detect_language(true)` is only for language identification workflows, not transcription
- This caused 0-segment regressions on every job submitted without an explicit `language=` param
### VAD filter causes hallucinations
- `vad_filter=true` silences quiet audience speech → whisper fills the void with "Okay." hallucinations at ~1s intervals
- **Fix**: Remove `vad_filter` entirely
---
## Remaining Known Issues
### 1. Short-token hallucination loops (unfixable by entropy_thold)
- `entropy_thold` is only evaluated when `result_len > 32` output tokens
- Short loops like `kas`, `sick`, `Bye.` (each 1 token) are **never caught**, no matter how low you set the threshold
- Current occurrences: 'kas' ×12 at ~2801s, 'sick' ×4 at ~4540s, 'Bye.' ×10 at ~6070s
- **Possible future fix**: post-process to collapse consecutive identical segments (user declined this for now — raw output only)
- `compression_ratio_thold` may also help but wasn't tested
### 4. Cold GPU: first job returns 0 segments in ~0.5s (intermittent, after container restart)
CUDA JIT-compiles kernels on the first call to `whisper_full_with_state`. On a cold GPU this compilation blocks/disrupts the decode pipeline mid-inference, causing an immediate return with 0 segments.
**Fix**: Run a 1-second silent warmup inference in `Transcriber::load()`. This forces JIT compilation at startup so the first real job runs on fully compiled kernels.
---
- Largest: 439 words at ~68 min, 328 words at ~80 min, then 3 × ~293-250 word gaps
- These are chunks where whisper produced off-topic or repetitive output instead of real content
- Likely caused by: speaker overlap, audience noise, or poor audio quality in those windows
- **Possible future fix**: retry failed chunks at smaller scope (30s), detect by low-confidence score or segment density
### 3. CUDA device ordering inversion
- `nvidia-smi`: GPU0=RTX 2080 SUPER, GPU1=RTX 3060
- `whisper.cpp` on host: Device 0=RTX 3060, Device 1=RTX 2080 SUPER (inverted vs nvidia-smi)
- Inside Docker: matches nvidia-smi order
- Health endpoint uses nvml (nvidia-smi ordering) → reports wrong GPU name when running on host
- **Workaround**: `CUDA_DEVICE=1` on host to target RTX 2080 SUPER
---
## Whisper Parameter Tuning Notes
Current values in `src/transcriber.rs`:
```
beam_size = 5, patience = 1.0
entropy_thold = 3.5 (catches ~9-word phrase loops, theoretical entropy ≈ log₂(9) ≈ 3.17)
logprob_thold = -1.0 (rejects very low confidence segments)
temperature_inc = 0.2 (fallback temperature increment on failure)
no_context = true (prevents context from one chunk poisoning the next)
suppress_non_speech_tokens = true
suppress_blank = true
language = None (auto-detect + transcribe)
```
**What NOT to set:**
- `vad_filter=true` → hallucination loops on quiet speech
- `detect_language=true` → returns 0 segments, transcription never runs
---
## Audio Pre-Processing Pipeline
1. **Download**: yt-dlp → MP3
2. **Convert**: ffmpeg → 16kHz mono WAV (whisper native format)
3. **Silence detection**: ffmpeg `silencedetect` filter at -35dB / 0.4s min duration
4. **Chunking**: target 60s, snap to nearest silence midpoint within ±30s window, fallback to hard cut
5. **Trim trailing silence** per chunk: -35dB threshold, 0.5s padding (applied before whisper)
6. **Transcribe** each chunk independently, offset timestamps, concatenate
**Why chunking helps:** Whisper hallucinations compound over time. Starting each chunk fresh limits how far a bad segment can spread.
**Chunk size trade-off:**
- Smaller (60s): less hallucination spread, but short isolated sections (e.g. someone spelling a name) lose context
- Larger (180s): more context, handles short sections better, but hallucinations can corrupt more content
- Current sweet spot: 60s. If 'KAS'-type issues are a priority, try 90-120s.
---
## Potential Future Improvements (Prioritized)
1. **Retry bad chunks at smaller scope** — detect low-quality output (by segment density or avg logprob) and re-run the chunk at 30s windows
2. **Increase chunk size to 90-120s** — better context for short proper nouns / name spelling; test if hallucination spread stays acceptable
3. **compression_ratio_thold** — may catch short-token loops that entropy_thold misses; test values around 2.0-2.4
4. **Adaptive snap window** — if no silence in ±30s, try ±45s before hard-cutting; reduces long unbroken speech chunks
5. **Per-segment confidence scoring** — expose avg_logprob per segment in the JSON output for downstream filtering
6. **Multiple model support** — medium model for speed, large-v3 for quality; selectable per job