All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 6m39s
GPU warmup (src/transcriber.rs): After creating WhisperState, run a 1s silent inference pass in load(). CUDA JIT-compiles device kernels on the first whisper_full_with_state call. On a cold GPU this compilation disrupts the decode pipeline mid-inference, returning 0 segments in ~0.5s. The warmup forces all kernel compilation at startup so the first real job runs on fully compiled kernels. test_all.sh: - Fix submit response field: 'id' → 'job_id' (was breaking all downstream steps) - Remove language=auto: not a valid ISO 639-1 code; omit field for auto-detect - Make BASE and AUDIO configurable via env vars (WHISPER_BASE_URL, TEST_AUDIO) - Fix DELETE assertion: completed jobs return 409 Conflict, not 204 - Add explicit zero-segments failure check in quality inspection (step 9) - Add progress reporting to poll loop docs/FINDINGS.md + KNOWLEDGE.md: Document cold GPU warmup issue, root cause, and fix. Document language=auto as invalid API usage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
130 lines
6.7 KiB
Markdown
130 lines
6.7 KiB
Markdown
# Whisper RTX2080 — Lessons Learned & Improvement Notes
|
||
|
||
## Quality Baseline (as of 2026-05-06)
|
||
|
||
Audio: 101-minute YouTube conference talk (Unblocked — Peter Werry)
|
||
Model: ggml-large-v3, chunking at 60s on silence boundaries
|
||
|
||
| Metric | Score |
|
||
|-------------------|--------|
|
||
| WER | 9.3% |
|
||
| Word coverage | 93.1% |
|
||
| 1-gram F1 | 94.9% |
|
||
| 3-gram F1 | 84.7% |
|
||
| 5-gram F1 | 77.5% |
|
||
|
||
---
|
||
|
||
## Cold GPU Warmup — First Job Returns 0 Segments in ~0.5s
|
||
|
||
**Severity: Critical (production issue, intermittent, hard to diagnose)**
|
||
|
||
**Symptom:** After a container restart, the very first submitted job completes in ~0.5 seconds and returns 0 segments. Subsequent jobs work correctly.
|
||
|
||
**Root cause:** CUDA JIT-compiles its kernels on the **first** call to `whisper_full_with_state`. On a cold GPU, this compilation happens mid-inference and blocks/disrupts the decode pipeline, causing whisper to return immediately with 0 segments.
|
||
|
||
**Why language detection can still succeed:** Language detection uses only a small mel-spectrogram + encoder pass on the first 30 seconds of audio. Some of these kernels may already be compiled or cached from a prior session. The full decoder kernels (the heavier ones) are what get JIT-compiled on the first full inference.
|
||
|
||
**Fix:** In `Transcriber::load()`, after creating the state, run a 1-second silent inference pass:
|
||
```rust
|
||
let silence = vec![0.0f32; 16_000]; // 1s @ 16 kHz
|
||
let mut wp = FullParams::new(SamplingStrategy::Greedy { best_of: 1 });
|
||
wp.set_language(Some("en"));
|
||
let _ = state.full(wp, &silence); // forces CUDA JIT — 0 segments expected
|
||
tracing::info!("GPU warmup complete");
|
||
```
|
||
This forces all CUDA kernel compilation at startup. The first real job then runs on fully compiled kernels. Startup takes a few seconds longer but every job is reliable.
|
||
|
||
---
|
||
|
||
### `set_detect_language(true)` is NOT "auto-detect and transcribe"
|
||
- `whisper.cpp` source: `if (params.detect_language) { return 0; }` — it exits immediately after language detection, returns 0 segments
|
||
- **Correct API**: `fp.set_language(None)` → passes `language = NULL` to whisper.cpp, which auto-detects AND transcribes
|
||
- `set_detect_language(true)` is only for language identification workflows, not transcription
|
||
- This caused 0-segment regressions on every job submitted without an explicit `language=` param
|
||
|
||
### VAD filter causes hallucinations
|
||
- `vad_filter=true` silences quiet audience speech → whisper fills the void with "Okay." hallucinations at ~1s intervals
|
||
- **Fix**: Remove `vad_filter` entirely
|
||
|
||
---
|
||
|
||
## Remaining Known Issues
|
||
|
||
### 1. Short-token hallucination loops (unfixable by entropy_thold)
|
||
- `entropy_thold` is only evaluated when `result_len > 32` output tokens
|
||
- Short loops like `kas`, `sick`, `Bye.` (each 1 token) are **never caught**, no matter how low you set the threshold
|
||
- Current occurrences: 'kas' ×12 at ~2801s, 'sick' ×4 at ~4540s, 'Bye.' ×10 at ~6070s
|
||
- **Possible future fix**: post-process to collapse consecutive identical segments (user declined this for now — raw output only)
|
||
- `compression_ratio_thold` may also help but wasn't tested
|
||
|
||
### 4. Cold GPU: first job returns 0 segments in ~0.5s (intermittent, after container restart)
|
||
|
||
CUDA JIT-compiles kernels on the first call to `whisper_full_with_state`. On a cold GPU this compilation blocks/disrupts the decode pipeline mid-inference, causing an immediate return with 0 segments.
|
||
|
||
**Fix**: Run a 1-second silent warmup inference in `Transcriber::load()`. This forces JIT compilation at startup so the first real job runs on fully compiled kernels.
|
||
|
||
---
|
||
|
||
- Largest: 439 words at ~68 min, 328 words at ~80 min, then 3 × ~293-250 word gaps
|
||
- These are chunks where whisper produced off-topic or repetitive output instead of real content
|
||
- Likely caused by: speaker overlap, audience noise, or poor audio quality in those windows
|
||
- **Possible future fix**: retry failed chunks at smaller scope (30s), detect by low-confidence score or segment density
|
||
|
||
### 3. CUDA device ordering inversion
|
||
- `nvidia-smi`: GPU0=RTX 2080 SUPER, GPU1=RTX 3060
|
||
- `whisper.cpp` on host: Device 0=RTX 3060, Device 1=RTX 2080 SUPER (inverted vs nvidia-smi)
|
||
- Inside Docker: matches nvidia-smi order
|
||
- Health endpoint uses nvml (nvidia-smi ordering) → reports wrong GPU name when running on host
|
||
- **Workaround**: `CUDA_DEVICE=1` on host to target RTX 2080 SUPER
|
||
|
||
---
|
||
|
||
## Whisper Parameter Tuning Notes
|
||
|
||
Current values in `src/transcriber.rs`:
|
||
|
||
```
|
||
beam_size = 5, patience = 1.0
|
||
entropy_thold = 3.5 (catches ~9-word phrase loops, theoretical entropy ≈ log₂(9) ≈ 3.17)
|
||
logprob_thold = -1.0 (rejects very low confidence segments)
|
||
temperature_inc = 0.2 (fallback temperature increment on failure)
|
||
no_context = true (prevents context from one chunk poisoning the next)
|
||
suppress_non_speech_tokens = true
|
||
suppress_blank = true
|
||
language = None (auto-detect + transcribe)
|
||
```
|
||
|
||
**What NOT to set:**
|
||
- `vad_filter=true` → hallucination loops on quiet speech
|
||
- `detect_language=true` → returns 0 segments, transcription never runs
|
||
|
||
---
|
||
|
||
## Audio Pre-Processing Pipeline
|
||
|
||
1. **Download**: yt-dlp → MP3
|
||
2. **Convert**: ffmpeg → 16kHz mono WAV (whisper native format)
|
||
3. **Silence detection**: ffmpeg `silencedetect` filter at -35dB / 0.4s min duration
|
||
4. **Chunking**: target 60s, snap to nearest silence midpoint within ±30s window, fallback to hard cut
|
||
5. **Trim trailing silence** per chunk: -35dB threshold, 0.5s padding (applied before whisper)
|
||
6. **Transcribe** each chunk independently, offset timestamps, concatenate
|
||
|
||
**Why chunking helps:** Whisper hallucinations compound over time. Starting each chunk fresh limits how far a bad segment can spread.
|
||
|
||
**Chunk size trade-off:**
|
||
- Smaller (60s): less hallucination spread, but short isolated sections (e.g. someone spelling a name) lose context
|
||
- Larger (180s): more context, handles short sections better, but hallucinations can corrupt more content
|
||
- Current sweet spot: 60s. If 'KAS'-type issues are a priority, try 90-120s.
|
||
|
||
---
|
||
|
||
## Potential Future Improvements (Prioritized)
|
||
|
||
1. **Retry bad chunks at smaller scope** — detect low-quality output (by segment density or avg logprob) and re-run the chunk at 30s windows
|
||
2. **Increase chunk size to 90-120s** — better context for short proper nouns / name spelling; test if hallucination spread stays acceptable
|
||
3. **compression_ratio_thold** — may catch short-token loops that entropy_thold misses; test values around 2.0-2.4
|
||
4. **Adaptive snap window** — if no silence in ±30s, try ±45s before hard-cutting; reduces long unbroken speech chunks
|
||
5. **Per-segment confidence scoring** — expose avg_logprob per segment in the JSON output for downstream filtering
|
||
6. **Multiple model support** — medium model for speed, large-v3 for quality; selectable per job
|