fix: GPU warmup on startup + fix test_all.sh + document cold-GPU finding
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 6m39s
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 6m39s
GPU warmup (src/transcriber.rs): After creating WhisperState, run a 1s silent inference pass in load(). CUDA JIT-compiles device kernels on the first whisper_full_with_state call. On a cold GPU this compilation disrupts the decode pipeline mid-inference, returning 0 segments in ~0.5s. The warmup forces all kernel compilation at startup so the first real job runs on fully compiled kernels. test_all.sh: - Fix submit response field: 'id' → 'job_id' (was breaking all downstream steps) - Remove language=auto: not a valid ISO 639-1 code; omit field for auto-detect - Make BASE and AUDIO configurable via env vars (WHISPER_BASE_URL, TEST_AUDIO) - Fix DELETE assertion: completed jobs return 409 Conflict, not 204 - Add explicit zero-segments failure check in quality inspection (step 9) - Add progress reporting to poll loop docs/FINDINGS.md + KNOWLEDGE.md: Document cold GPU warmup issue, root cause, and fix. Document language=auto as invalid API usage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
31
KNOWLEDGE.md
31
KNOWLEDGE.md
@@ -15,7 +15,27 @@ Model: ggml-large-v3, chunking at 60s on silence boundaries
|
||||
|
||||
---
|
||||
|
||||
## Critical Bugs Found & Fixed
|
||||
## Cold GPU Warmup — First Job Returns 0 Segments in ~0.5s
|
||||
|
||||
**Severity: Critical (production issue, intermittent, hard to diagnose)**
|
||||
|
||||
**Symptom:** After a container restart, the very first submitted job completes in ~0.5 seconds and returns 0 segments. Subsequent jobs work correctly.
|
||||
|
||||
**Root cause:** CUDA JIT-compiles its kernels on the **first** call to `whisper_full_with_state`. On a cold GPU, this compilation happens mid-inference and blocks/disrupts the decode pipeline, causing whisper to return immediately with 0 segments.
|
||||
|
||||
**Why language detection can still succeed:** Language detection uses only a small mel-spectrogram + encoder pass on the first 30 seconds of audio. Some of these kernels may already be compiled or cached from a prior session. The full decoder kernels (the heavier ones) are what get JIT-compiled on the first full inference.
|
||||
|
||||
**Fix:** In `Transcriber::load()`, after creating the state, run a 1-second silent inference pass:
|
||||
```rust
|
||||
let silence = vec![0.0f32; 16_000]; // 1s @ 16 kHz
|
||||
let mut wp = FullParams::new(SamplingStrategy::Greedy { best_of: 1 });
|
||||
wp.set_language(Some("en"));
|
||||
let _ = state.full(wp, &silence); // forces CUDA JIT — 0 segments expected
|
||||
tracing::info!("GPU warmup complete");
|
||||
```
|
||||
This forces all CUDA kernel compilation at startup. The first real job then runs on fully compiled kernels. Startup takes a few seconds longer but every job is reliable.
|
||||
|
||||
---
|
||||
|
||||
### `set_detect_language(true)` is NOT "auto-detect and transcribe"
|
||||
- `whisper.cpp` source: `if (params.detect_language) { return 0; }` — it exits immediately after language detection, returns 0 segments
|
||||
@@ -38,7 +58,14 @@ Model: ggml-large-v3, chunking at 60s on silence boundaries
|
||||
- **Possible future fix**: post-process to collapse consecutive identical segments (user declined this for now — raw output only)
|
||||
- `compression_ratio_thold` may also help but wasn't tested
|
||||
|
||||
### 2. Five significant content gaps (~1600 words total)
|
||||
### 4. Cold GPU: first job returns 0 segments in ~0.5s (intermittent, after container restart)
|
||||
|
||||
CUDA JIT-compiles kernels on the first call to `whisper_full_with_state`. On a cold GPU this compilation blocks/disrupts the decode pipeline mid-inference, causing an immediate return with 0 segments.
|
||||
|
||||
**Fix**: Run a 1-second silent warmup inference in `Transcriber::load()`. This forces JIT compilation at startup so the first real job runs on fully compiled kernels.
|
||||
|
||||
---
|
||||
|
||||
- Largest: 439 words at ~68 min, 328 words at ~80 min, then 3 × ~293-250 word gaps
|
||||
- These are chunks where whisper produced off-topic or repetitive output instead of real content
|
||||
- Likely caused by: speaker overlap, audience noise, or poor audio quality in those windows
|
||||
|
||||
Reference in New Issue
Block a user