GPU warmup (src/transcriber.rs): After creating WhisperState, run a 1s silent inference pass in load(). CUDA JIT-compiles device kernels on the first whisper_full_with_state call. On a cold GPU this compilation disrupts the decode pipeline mid-inference, returning 0 segments in ~0.5s. The warmup forces all kernel compilation at startup so the first real job runs on fully compiled kernels. test_all.sh: - Fix submit response field: 'id' → 'job_id' (was breaking all downstream steps) - Remove language=auto: not a valid ISO 639-1 code; omit field for auto-detect - Make BASE and AUDIO configurable via env vars (WHISPER_BASE_URL, TEST_AUDIO) - Fix DELETE assertion: completed jobs return 409 Conflict, not 204 - Add explicit zero-segments failure check in quality inspection (step 9) - Add progress reporting to poll loop docs/FINDINGS.md + KNOWLEDGE.md: Document cold GPU warmup issue, root cause, and fix. Document language=auto as invalid API usage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
12 KiB
Findings, Quirks & Research Notes
This document records all non-obvious behaviour, surprising bugs, hardware quirks, and research findings discovered during the development and testing of this project. It exists so we don't rediscover the same things twice.
Cold GPU: first job returns 0 segments in ~0.5s after container restart
Symptom: After container restart, the first submitted job completes in ~0.5s and returns 0 segments. Language is detected correctly. All subsequent jobs work fine.
Root cause: CUDA JIT-compiles its device kernels on the first call to whisper_full_with_state. On a cold GPU, this compilation happens synchronously mid-inference and disrupts the decode pipeline, causing it to return immediately with 0 results.
Why subsequent jobs are fine: Compiled kernels are cached in the CUDA driver for the lifetime of the process. Once the first (warmup) call completes, all further calls use the cached compiled kernels.
Why language detection can succeed on the same call: Language detection uses a mel-spectrogram + encoder pass on the first 30s of audio. These lighter kernels may compile faster or be partially cached, while the full decoder kernels (the heavier path) are what causes the failure.
Fix (in Transcriber::load()):
let silence = vec![0.0f32; 16_000]; // 1s @ 16 kHz — just enough to trigger kernel compilation
let mut wp = FullParams::new(SamplingStrategy::Greedy { best_of: 1 });
wp.set_language(Some("en"));
wp.set_print_progress(false);
let _ = state.full(wp, &silence); // 0 segments expected; side-effect is the goal
tracing::info!("GPU warmup complete");
Also fixed simultaneously: create_state() was called per-chunk (~700 MB GPU allocation each time), causing VRAM churn under concurrent processes. State is now created once and reused. See WhisperState reuse section above.
language=auto is not a valid API parameter
Passing language=auto in the multipart form is silently incorrect. The language field expects an ISO 639-1 code (e.g. en, fr) or should be omitted entirely for auto-detection. Passing "auto" causes whisper-rs to pass the string "auto" as a language code, which whisper.cpp does not recognise and may fallback in undefined ways.
Correct usage:
- Auto-detect: omit the
languagefield entirely - Explicit:
language=en
detect_language=true is a language-ID-only mode — NOT "auto-detect and transcribe"
Severity: Critical (was a production regression)
In whisper.cpp (whisper_full_with_state):
if (params.detect_language) {
return 0; // exits immediately after language detection
}
Setting detect_language=true causes whisper to auto-detect the language, print it to stderr, and then return 0 without running the decoder. The result is always 0 segments.
The whisper-rs docs suggest this is equivalent to auto-detect — it is not.
Correct API for auto-detect + transcription:
fp.set_language(None); // passes language = NULL to whisper.cpp → auto-detects AND transcribes
Wrong:
fp.set_detect_language(true); // language-ID mode only — 0 segments returned
This bug caused every job submitted without an explicit language= parameter to return 0 segments after commit 35e7ea8. Fixed in 6327ffc.
no_speech_thold is not implemented
The whisper.cpp header exposes no_speech_thold as a parameter, but the source contains a // TODO: not implemented comment on the actual check. Calling fp.set_no_speech_thold(...) has no effect. Do not rely on it.
entropy_thold only fires when result_len > 32
whisper's entropy check (which triggers temperature retry on repetitive output) is only evaluated when the segment has more than 32 output tokens. This means:
- Short hallucination loops of 1-2 words (e.g. "kas", "sick", "Bye.") are never caught, no matter how low you set the threshold
- The check is useful for medium-length loops (~9 word phrases have theoretical entropy ≈ log₂(9) ≈ 3.17)
- Default
entropy_thold=2.4catches 1-4 unique-token loops; we raised it to 3.5 to also catch 9-word phrase loops - The retry schedule uses
temperature_inc: on failure, whisper retries with temp += 0.2 until temp=1.0
vad_filter=true causes "Okay." hallucinations
When VAD (Voice Activity Detection) filter is enabled, whisper silences quiet sections before feeding to the decoder. For conference recordings with audience reactions or low-volume speakers, this causes whisper to fill the resulting void with short filler tokens ("Okay.", "Yeah.", "Thank you.") at ~1s intervals.
Do not use vad_filter on recordings with ambient audience sound or variable volume speakers.
Flash Attention (flash_attn=true) causes 0 segments on some audio
Flash attention is disabled with a comment. When tested on real-world conference recordings (noisy MP3s), it silently returned 0 segments on certain audio windows. The root cause was not fully investigated. Safe to disable; the performance benefit is marginal for large-v3.
no_context=true is essential for chunked processing
When no_context=false (default), whisper uses the transcript from the previous full() call as an initial prompt for the next one. In our pipeline, each chunk is a separate full() call. Without no_context=true, a hallucinated phrase from chunk N gets fed as a prompt into chunk N+1, poisoning it. This can cascade across the entire transcript.
Timestamps are in centiseconds internally
whisper.cpp returns t0 and t1 as integer centiseconds. The conversion to seconds is:
let start = state.full_get_segment_t0(i)? as f32 / 100.0;
This is not documented prominently. The divide-by-100 is critical — omitting it gives timestamps 100× too large.
full_n_segments_from_state vs full_n_segments
Two versions of this function exist in whisper-rs:
full_n_segments_from_state(&state)— correct; reads from the state created bycreate_state()full_n_segments(&ctx)— reads from the context's internal state (used for single-threaded, non-state-based calls)
Since we use create_state() + state.full(fp, pcm), always use the _from_state variants. Using the wrong variant returns stale or zero results from a previous inference.
CUDA / Hardware
CUDA device index ordering differs between host and Docker
On the development machine:
nvidia-smi: GPU 0 = RTX 2080 SUPER (8 GB), GPU 1 = RTX 3060 (12 GB)whisper.cppCUDA on host: Device 0 = RTX 3060, Device 1 = RTX 2080 SUPER (inverted)whisper.cppCUDA inside Docker: Device 0 = RTX 2080 SUPER (matches nvidia-smi)
The inversion on the host is caused by CUDA_DEVICE_ORDER not being set to PCI_BUS_ID. The Docker image explicitly sets CUDA_DEVICE_ORDER=PCI_BUS_ID, which forces the expected ordering.
To target RTX 2080 SUPER on host: CUDA_DEVICE=1
Inside Docker: CUDA_DEVICE=0
The /health endpoint queries GPU info via nvidia-smi --id=<device> which uses the nvidia-smi (PCI_BUS_ID) ordering. When running on the host with CUDA_DEVICE=1, the health endpoint correctly reports RTX 2080 SUPER.
RTX 2080 is Turing (sm_75) — not Ampere
The RTX 2080 (non-Super, non-Ti) uses the Turing architecture, compute capability sm_75. This is relevant because:
- Some CUDA kernels are only compiled for sm_80+ (Ampere) by default
CMAKE_CUDA_ARCHITECTURES=75must be set explicitly, otherwise the build falls back to a generic/slower kernel or failsGGML_CUDA_FORCE_MMQ=ONenables the matrix-multiply-quantized kernels that are Turing-optimised
VRAM allocation: ~5 GB for large-v3
The ggml-large-v3.bin model occupies approximately 5-6 GB of VRAM on the RTX 2080's 8 GB pool. This leaves ~2 GB for CUDA workspace, which is sufficient for f16 inference with beam_size=5. Do not run two instances on the same GPU simultaneously.
Audio Processing
ffmpeg silencedetect logs to stderr, not stdout
When running ffmpeg -af silencedetect=n=-35dB:d=0.4 -f null -, silence events are printed to stderr (not stdout), in this format:
[silencedetect @ 0x...] silence_start: 12.345
[silencedetect @ 0x...] silence_end: 13.456 | silence_duration: 1.111
The parser must read output.stderr, not output.stdout.
Whisper requires exactly 16kHz mono f32 PCM
whisper.cpp's full() function expects:
- Sample rate: exactly 16,000 Hz
- Channels: 1 (mono)
- Format: f32 little-endian (values in [-1.0, 1.0])
Deviating from any of these silently produces garbage output. The ffmpeg decode command:
ffmpeg -i <input> -f f32le -ac 1 -ar 16000 -
converts any input format to this exactly.
MP3 is fully supported by ffmpeg → whisper
whisper itself only accepts PCM; it has no MP3 decoder. But since we always decode through ffmpeg first, any format ffmpeg supports (MP3, AAC, FLAC, OGG, WAV, M4A, WEBM, etc.) works as input. There is no codec-level restriction.
Chunking trade-offs
| Chunk size | Pros | Cons |
|---|---|---|
| 30s | Hallucinations contained to tiny window | Very short proper nouns / spellings get no context |
| 60s | Good balance; ~2× whisper's native 30s window | Isolated 10s sections (e.g. name spelling) still lack context |
| 120-180s | Better context for short sections | Hallucinations can corrupt larger content blocks |
Current setting: 60s. Snap window: ±30s from the target cut point.
The snap-to-silence algorithm avoids micro-chunks (<5s) and trailing slivers (<25% of target) by stopping early.
Quality Findings
Quality baseline on 101-minute conference recording (ggml-large-v3)
Reference: human-corrected transcript (12,894 words)
| Metric | Score |
|---|---|
| WER | 9.3% |
| Word coverage | 93.1% |
| 1-gram F1 | 94.9% |
| 3-gram F1 | 84.7% |
| 5-gram F1 | 77.5% |
The 3-gram F1 is the most informative single metric for conference transcription: it captures both word accuracy and local phrase fidelity without being overly sensitive to exact phrasing.
Remaining failure modes
| Pattern | Location | Root cause | Fixable? |
|---|---|---|---|
| 'KAS' ×12 | ~2801s | Speaker spells "K-A-S"; 60s chunk isolates 10s section with no context | Increase chunk size to 90s |
| 'sick' ×4 | ~4540s | Single 9s segment; result_len < 32 → entropy check skipped |
compression_ratio_thold? |
| 'Bye.' ×10 | ~6070s | Speaker says goodbye multiple times at end; trailing silence trim can't help — real audio | Post-processing dedup (declined by user) |
| 5 content gaps | Various | Chunk windows with noisy/overlapping audio → whisper skips content | Retry at 30s scope |
Rust / Library Quirks
whisper-rs 0.13 bundles whisper.cpp source
whisper-rs-sys includes the full whisper.cpp source tree inside the crate. The build is entirely self-contained — no internet access is needed during cargo build once the registry cache is warm. The whisper.cpp version is pinned by the crate version; updating whisper.cpp requires bumping the whisper-rs dependency.
BroadcastStream silently drops lagged receivers
tokio_stream::wrappers::BroadcastStream returns Err(RecvError::Lagged(n)) when a subscriber falls behind. This is filtered to None in our SSE adapter, which silently drops the lagged events. Clients that can't keep up with the SSE stream will miss progress events but will still receive the final done event (or can poll GET /jobs/:id).
DashMap as ProgressRegistry
DashMap<JobId, broadcast::Sender<ProgressEvent>> provides lock-free concurrent map access. Senders are cleaned up 30 seconds after a job completes (see the sleep(30s) + registry.remove() in worker::run). The 30-second window gives SSE subscribers time to receive the done event before the channel is dropped.