All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 17s
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
218 lines
9.7 KiB
Markdown
218 lines
9.7 KiB
Markdown
# Findings, Quirks & Research Notes
|
||
|
||
This document records all non-obvious behaviour, surprising bugs, hardware quirks, and research findings discovered during the development and testing of this project. It exists so we don't rediscover the same things twice.
|
||
|
||
---
|
||
|
||
## whisper.cpp
|
||
|
||
### `detect_language=true` is a language-ID-only mode — NOT "auto-detect and transcribe"
|
||
|
||
**Severity: Critical (was a production regression)**
|
||
|
||
In `whisper.cpp` (`whisper_full_with_state`):
|
||
```c
|
||
if (params.detect_language) {
|
||
return 0; // exits immediately after language detection
|
||
}
|
||
```
|
||
|
||
Setting `detect_language=true` causes whisper to auto-detect the language, print it to stderr, and then **return 0 without running the decoder**. The result is always 0 segments.
|
||
|
||
The `whisper-rs` docs suggest this is equivalent to auto-detect — **it is not**.
|
||
|
||
**Correct API for auto-detect + transcription:**
|
||
```rust
|
||
fp.set_language(None); // passes language = NULL to whisper.cpp → auto-detects AND transcribes
|
||
```
|
||
|
||
**Wrong:**
|
||
```rust
|
||
fp.set_detect_language(true); // language-ID mode only — 0 segments returned
|
||
```
|
||
|
||
This bug caused every job submitted without an explicit `language=` parameter to return 0 segments after commit `35e7ea8`. Fixed in `6327ffc`.
|
||
|
||
---
|
||
|
||
### `no_speech_thold` is not implemented
|
||
|
||
The `whisper.cpp` header exposes `no_speech_thold` as a parameter, but the source contains a `// TODO: not implemented` comment on the actual check. Calling `fp.set_no_speech_thold(...)` has no effect. Do not rely on it.
|
||
|
||
---
|
||
|
||
### `entropy_thold` only fires when `result_len > 32`
|
||
|
||
whisper's entropy check (which triggers temperature retry on repetitive output) is only evaluated when the segment has more than 32 output tokens. This means:
|
||
|
||
- Short hallucination loops of 1-2 words (e.g. "kas", "sick", "Bye.") are **never caught**, no matter how low you set the threshold
|
||
- The check is useful for medium-length loops (~9 word phrases have theoretical entropy ≈ log₂(9) ≈ 3.17)
|
||
- Default `entropy_thold=2.4` catches 1-4 unique-token loops; we raised it to 3.5 to also catch 9-word phrase loops
|
||
- The retry schedule uses `temperature_inc`: on failure, whisper retries with temp += 0.2 until temp=1.0
|
||
|
||
---
|
||
|
||
### `vad_filter=true` causes "Okay." hallucinations
|
||
|
||
When VAD (Voice Activity Detection) filter is enabled, whisper silences quiet sections before feeding to the decoder. For conference recordings with audience reactions or low-volume speakers, this causes whisper to fill the resulting void with short filler tokens ("Okay.", "Yeah.", "Thank you.") at ~1s intervals.
|
||
|
||
**Do not use `vad_filter`** on recordings with ambient audience sound or variable volume speakers.
|
||
|
||
---
|
||
|
||
### Flash Attention (`flash_attn=true`) causes 0 segments on some audio
|
||
|
||
Flash attention is disabled with a comment. When tested on real-world conference recordings (noisy MP3s), it silently returned 0 segments on certain audio windows. The root cause was not fully investigated. Safe to disable; the performance benefit is marginal for large-v3.
|
||
|
||
---
|
||
|
||
### `no_context=true` is essential for chunked processing
|
||
|
||
When `no_context=false` (default), whisper uses the transcript from the previous `full()` call as an initial prompt for the next one. In our pipeline, each chunk is a separate `full()` call. Without `no_context=true`, a hallucinated phrase from chunk N gets fed as a prompt into chunk N+1, poisoning it. This can cascade across the entire transcript.
|
||
|
||
---
|
||
|
||
### Timestamps are in centiseconds internally
|
||
|
||
whisper.cpp returns `t0` and `t1` as integer centiseconds. The conversion to seconds is:
|
||
```rust
|
||
let start = state.full_get_segment_t0(i)? as f32 / 100.0;
|
||
```
|
||
|
||
This is not documented prominently. The divide-by-100 is critical — omitting it gives timestamps 100× too large.
|
||
|
||
---
|
||
|
||
### `full_n_segments_from_state` vs `full_n_segments`
|
||
|
||
Two versions of this function exist in `whisper-rs`:
|
||
- `full_n_segments_from_state(&state)` — correct; reads from the state created by `create_state()`
|
||
- `full_n_segments(&ctx)` — reads from the context's internal state (used for single-threaded, non-state-based calls)
|
||
|
||
Since we use `create_state()` + `state.full(fp, pcm)`, always use the `_from_state` variants. Using the wrong variant returns stale or zero results from a previous inference.
|
||
|
||
---
|
||
|
||
## CUDA / Hardware
|
||
|
||
### CUDA device index ordering differs between host and Docker
|
||
|
||
On the development machine:
|
||
- `nvidia-smi`: GPU 0 = RTX 2080 SUPER (8 GB), GPU 1 = RTX 3060 (12 GB)
|
||
- `whisper.cpp` CUDA on **host**: Device 0 = RTX 3060, Device 1 = RTX 2080 SUPER (**inverted**)
|
||
- `whisper.cpp` CUDA **inside Docker**: Device 0 = RTX 2080 SUPER (**matches nvidia-smi**)
|
||
|
||
The inversion on the host is caused by `CUDA_DEVICE_ORDER` not being set to `PCI_BUS_ID`. The Docker image explicitly sets `CUDA_DEVICE_ORDER=PCI_BUS_ID`, which forces the expected ordering.
|
||
|
||
**To target RTX 2080 SUPER on host**: `CUDA_DEVICE=1`
|
||
**Inside Docker**: `CUDA_DEVICE=0`
|
||
|
||
The `/health` endpoint queries GPU info via `nvidia-smi --id=<device>` which uses the nvidia-smi (PCI_BUS_ID) ordering. When running on the host with `CUDA_DEVICE=1`, the health endpoint correctly reports RTX 2080 SUPER.
|
||
|
||
---
|
||
|
||
### RTX 2080 is Turing (sm_75) — not Ampere
|
||
|
||
The RTX 2080 (non-Super, non-Ti) uses the Turing architecture, compute capability sm_75. This is relevant because:
|
||
- Some CUDA kernels are only compiled for sm_80+ (Ampere) by default
|
||
- `CMAKE_CUDA_ARCHITECTURES=75` must be set explicitly, otherwise the build falls back to a generic/slower kernel or fails
|
||
- `GGML_CUDA_FORCE_MMQ=ON` enables the matrix-multiply-quantized kernels that are Turing-optimised
|
||
|
||
---
|
||
|
||
### VRAM allocation: ~5 GB for large-v3
|
||
|
||
The `ggml-large-v3.bin` model occupies approximately 5-6 GB of VRAM on the RTX 2080's 8 GB pool. This leaves ~2 GB for CUDA workspace, which is sufficient for f16 inference with beam_size=5. Do not run two instances on the same GPU simultaneously.
|
||
|
||
---
|
||
|
||
## Audio Processing
|
||
|
||
### ffmpeg `silencedetect` logs to stderr, not stdout
|
||
|
||
When running `ffmpeg -af silencedetect=n=-35dB:d=0.4 -f null -`, silence events are printed to **stderr** (not stdout), in this format:
|
||
```
|
||
[silencedetect @ 0x...] silence_start: 12.345
|
||
[silencedetect @ 0x...] silence_end: 13.456 | silence_duration: 1.111
|
||
```
|
||
|
||
The parser must read `output.stderr`, not `output.stdout`.
|
||
|
||
---
|
||
|
||
### Whisper requires exactly 16kHz mono f32 PCM
|
||
|
||
whisper.cpp's `full()` function expects:
|
||
- Sample rate: exactly 16,000 Hz
|
||
- Channels: 1 (mono)
|
||
- Format: f32 little-endian (values in [-1.0, 1.0])
|
||
|
||
Deviating from any of these silently produces garbage output. The ffmpeg decode command:
|
||
```
|
||
ffmpeg -i <input> -f f32le -ac 1 -ar 16000 -
|
||
```
|
||
converts any input format to this exactly.
|
||
|
||
---
|
||
|
||
### MP3 is fully supported by ffmpeg → whisper
|
||
|
||
whisper itself only accepts PCM; it has no MP3 decoder. But since we always decode through ffmpeg first, any format ffmpeg supports (MP3, AAC, FLAC, OGG, WAV, M4A, WEBM, etc.) works as input. There is no codec-level restriction.
|
||
|
||
---
|
||
|
||
### Chunking trade-offs
|
||
|
||
| Chunk size | Pros | Cons |
|
||
|-----------|------|------|
|
||
| 30s | Hallucinations contained to tiny window | Very short proper nouns / spellings get no context |
|
||
| 60s | Good balance; ~2× whisper's native 30s window | Isolated 10s sections (e.g. name spelling) still lack context |
|
||
| 120-180s | Better context for short sections | Hallucinations can corrupt larger content blocks |
|
||
|
||
Current setting: **60s**. Snap window: **±30s** from the target cut point.
|
||
|
||
The snap-to-silence algorithm avoids micro-chunks (<5s) and trailing slivers (<25% of target) by stopping early.
|
||
|
||
---
|
||
|
||
## Quality Findings
|
||
|
||
### Quality baseline on 101-minute conference recording (ggml-large-v3)
|
||
|
||
Reference: human-corrected transcript (12,894 words)
|
||
|
||
| Metric | Score |
|
||
|--------|-------|
|
||
| WER | 9.3% |
|
||
| Word coverage | 93.1% |
|
||
| 1-gram F1 | 94.9% |
|
||
| 3-gram F1 | 84.7% |
|
||
| 5-gram F1 | 77.5% |
|
||
|
||
The 3-gram F1 is the most informative single metric for conference transcription: it captures both word accuracy and local phrase fidelity without being overly sensitive to exact phrasing.
|
||
|
||
### Remaining failure modes
|
||
|
||
| Pattern | Location | Root cause | Fixable? |
|
||
|---------|----------|-----------|---------|
|
||
| 'KAS' ×12 | ~2801s | Speaker spells "K-A-S"; 60s chunk isolates 10s section with no context | Increase chunk size to 90s |
|
||
| 'sick' ×4 | ~4540s | Single 9s segment; `result_len < 32` → entropy check skipped | `compression_ratio_thold`? |
|
||
| 'Bye.' ×10 | ~6070s | Speaker says goodbye multiple times at end; trailing silence trim can't help — real audio | Post-processing dedup (declined by user) |
|
||
| 5 content gaps | Various | Chunk windows with noisy/overlapping audio → whisper skips content | Retry at 30s scope |
|
||
|
||
---
|
||
|
||
## Rust / Library Quirks
|
||
|
||
### `whisper-rs` 0.13 bundles whisper.cpp source
|
||
|
||
`whisper-rs-sys` includes the full whisper.cpp source tree inside the crate. The build is entirely self-contained — no internet access is needed during `cargo build` once the registry cache is warm. The whisper.cpp version is pinned by the crate version; updating whisper.cpp requires bumping the `whisper-rs` dependency.
|
||
|
||
### `BroadcastStream` silently drops lagged receivers
|
||
|
||
`tokio_stream::wrappers::BroadcastStream` returns `Err(RecvError::Lagged(n))` when a subscriber falls behind. This is filtered to `None` in our SSE adapter, which silently drops the lagged events. Clients that can't keep up with the SSE stream will miss progress events but will still receive the final `done` event (or can poll `GET /jobs/:id`).
|
||
|
||
### `DashMap` as `ProgressRegistry`
|
||
|
||
`DashMap<JobId, broadcast::Sender<ProgressEvent>>` provides lock-free concurrent map access. Senders are cleaned up 30 seconds after a job completes (see the `sleep(30s)` + `registry.remove()` in `worker::run`). The 30-second window gives SSE subscribers time to receive the `done` event before the channel is dropped.
|