fix: GPU warmup on startup + fix test_all.sh + document cold-GPU finding

GPU warmup (src/transcriber.rs): After creating WhisperState, run a 1s silent inference pass in load(). CUDA JIT-compiles device kernels on the first whisper_full_with_state call. On a cold GPU this compilation disrupts the decode pipeline mid-inference, returning 0 segments in ~0.5s. The warmup forces all kernel compilation at startup so the first real job runs on fully compiled kernels. test_all.sh: - Fix submit response field: 'id' → 'job_id' (was breaking all downstream steps) - Remove language=auto: not a valid ISO 639-1 code; omit field for auto-detect - Make BASE and AUDIO configurable via env vars (WHISPER_BASE_URL, TEST_AUDIO) - Fix DELETE assertion: completed jobs return 409 Conflict, not 204 - Add explicit zero-segments failure check in quality inspection (step 9) - Add progress reporting to poll loop docs/FINDINGS.md + KNOWLEDGE.md: Document cold GPU warmup issue, root cause, and fix. Document language=auto as invalid API usage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 11:57:30 +02:00
parent d5a88d1866
commit fd8d4deefb
4 changed files with 252 additions and 9 deletions
--- a/docs/FINDINGS.md
+++ b/docs/FINDINGS.md
@@ -4,7 +4,39 @@ This document records all non-obvious behaviour, surprising bugs, hardware quirk

 ---

-## whisper.cpp
+### Cold GPU: first job returns 0 segments in ~0.5s after container restart
+
+**Symptom:** After container restart, the first submitted job completes in ~0.5s and returns 0 segments. Language is detected correctly. All subsequent jobs work fine.
+
+**Root cause:** CUDA JIT-compiles its device kernels on the first call to `whisper_full_with_state`. On a cold GPU, this compilation happens synchronously mid-inference and disrupts the decode pipeline, causing it to return immediately with 0 results.
+
+**Why subsequent jobs are fine:** Compiled kernels are cached in the CUDA driver for the lifetime of the process. Once the first (warmup) call completes, all further calls use the cached compiled kernels.
+
+**Why language detection can succeed on the same call:** Language detection uses a mel-spectrogram + encoder pass on the first 30s of audio. These lighter kernels may compile faster or be partially cached, while the full decoder kernels (the heavier path) are what causes the failure.
+
+**Fix (in `Transcriber::load()`):**
+```rust
+let silence = vec![0.0f32; 16_000]; // 1s @ 16 kHz — just enough to trigger kernel compilation
+let mut wp = FullParams::new(SamplingStrategy::Greedy { best_of: 1 });
+wp.set_language(Some("en"));
+wp.set_print_progress(false);
+let _ = state.full(wp, &silence); // 0 segments expected; side-effect is the goal
+tracing::info!("GPU warmup complete");
+```
+
+**Also fixed simultaneously:** `create_state()` was called per-chunk (~700 MB GPU allocation each time), causing VRAM churn under concurrent processes. State is now created once and reused. See `WhisperState` reuse section above.
+
+---
+
+### `language=auto` is not a valid API parameter
+
+Passing `language=auto` in the multipart form is silently incorrect. The `language` field expects an ISO 639-1 code (e.g. `en`, `fr`) or should be **omitted entirely** for auto-detection. Passing "auto" causes whisper-rs to pass the string "auto" as a language code, which whisper.cpp does not recognise and may fallback in undefined ways.
+
+**Correct usage:**
+- Auto-detect: omit the `language` field entirely
+- Explicit: `language=en`
+
+---

 ### `detect_language=true` is a language-ID-only mode — NOT "auto-detect and transcribe"