Comprehensive integration guide for frontend/full-stack developers:
- Architecture overview diagram
- Quick start (submit + poll in ~20 lines)
- Model lifecycle: state machine diagram, all 4 /model/* endpoints,
SSE event subscription with JS examples
- Job submission: multipart fields, 503 model_not_ready handling,
retry-with-auto-load pattern
- Job progress: polling vs SSE, all event types with payloads
- Webhooks: job completion + model lifecycle, Express receiver example,
how to distinguish job vs model payloads
- Health check field reference
- Cancellation semantics (GPU inference not interruptible)
- Full TypeScript type definitions for all API shapes
- React hooks: useModelStatus, useJobStream, useTranscribe
- Complete WhisperClient class example with ensureModelReady,
streamProgress, and end-to-end transcribe()
- Error reference table with all 400/404/409/503/500 shapes
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add Dockerfile 'tester' stage (FROM builder):
- Symlinks /usr/local/cuda/lib64/stubs/libcuda.so → libcuda.so.1
so the test binary can satisfy the dynamic linker without a real GPU
- Runs `cargo test --release` reusing the cached release build artifacts
(no recompilation — tests complete in ~6s)
- docker build --target tester . to run all 30 unit tests
- Add 'test' job to .gitea/workflows/docker-build.yml:
- Runs before build-and-push (build-and-push needs: test)
- Builds --target tester with registry build cache
- Gate: build-and-push only runs when all tests pass
- Add run_tests.sh convenience script for local use:
- Accepts optional test name filter as first argument
- Respects CUDA_VERSION / UBUNTU_VERSION env overrides
All 30 unit tests pass:
error::tests — 7 tests (OOM detection, ModelNotReady HTTP shape)
models::tests — 17 tests (state machine, serialization, retry-after)
worker::tests — 6 tests (chunk ranges, silence snap/trim)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Model starts unloaded (lazy); loads on first job or POST /model/load
- Auto-unloads after IDLE_TIMEOUT_SECS (default 300) of inactivity
- POST /model/unload for immediate manual release
- GPU-busy detection: on VRAM OOM, enters WaitingForGpu and retries
every GPU_POLL_INTERVAL_SECS (default 30) indefinitely
- POST /jobs when unloaded → 503 + Retry-After header, triggers load
- AppError::OutOfMemory and AppError::ModelNotReady variants
- WorkerCmd channel (SyncSender<WorkerCmd>) replaces bare tx_req channel
- Idle timer via recv_timeout(1s) tick inside OS thread (no extra thread)
- Model lifecycle events broadcast via tokio broadcast channel (SSE + webhooks)
- webhook_registry: all clients that ever submitted a webhook_url receive
model_ready and model_unloaded webhooks
- GPU warmup retained on every (re)load
New routes:
GET /model/status — current state + VRAM stats
POST /model/load — trigger load (idempotent)
POST /model/unload — immediate unload
GET /model/events — SSE stream of model lifecycle events
New env vars:
IDLE_TIMEOUT_SECS (default 300)
GPU_POLL_INTERVAL_SECS (default 30)
Tests:
tests/test_model_lifecycle.sh — 18 integration tests (full state machine,
SSE events, webhooks, concurrency, unload-during-load)
tests/test_idle_timeout.sh — 5 tests with short IDLE_TIMEOUT_SECS=5
test_all.sh updated: loads model before job submission, asserts
model_state in /health, adds POST /model/unload at end
Docs:
docs/USAGE.md: model lifecycle section, new env vars, 503 retry pattern,
updated /health response shape
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Step 9 used 'echo $RESULT | python3 - << HEREDOC' which is a bash gotcha:
the heredoc takes over stdin (as the script source), so the pipe is
silently ignored and sys.stdin.read() returns empty string → JSONDecodeError.
Fix: write RESULT to a temp file and pass it as sys.argv[1] to the script.
Also removed the old buggy test suite that was accidentally left appended
at lines 181-327 (had language=auto, ['id'] field, wrong DELETE assertion).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GPU warmup (src/transcriber.rs):
After creating WhisperState, run a 1s silent inference pass in load().
CUDA JIT-compiles device kernels on the first whisper_full_with_state call.
On a cold GPU this compilation disrupts the decode pipeline mid-inference,
returning 0 segments in ~0.5s. The warmup forces all kernel compilation at
startup so the first real job runs on fully compiled kernels.
test_all.sh:
- Fix submit response field: 'id' → 'job_id' (was breaking all downstream steps)
- Remove language=auto: not a valid ISO 639-1 code; omit field for auto-detect
- Make BASE and AUDIO configurable via env vars (WHISPER_BASE_URL, TEST_AUDIO)
- Fix DELETE assertion: completed jobs return 409 Conflict, not 204
- Add explicit zero-segments failure check in quality inspection (step 9)
- Add progress reporting to poll loop
docs/FINDINGS.md + KNOWLEDGE.md:
Document cold GPU warmup issue, root cause, and fix.
Document language=auto as invalid API usage.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously create_state() was called for every 60s audio chunk, triggering
whisper_init_state() each time. This allocates ~700 MB of GPU compute buffers
(KV caches, CUDA workspace) and re-initialises the CUDA backend per chunk.
For a 101-minute audio (102 chunks), this caused 102 GPU re-initialisations
and VRAM allocation cycles. Under VRAM pressure from concurrent processes,
CUDA allocation failures occurred silently — whisper returned language
detection results but 0 segments.
Fix: create WhisperState once in Transcriber::load() and reuse it for every
transcription call. GPU memory is stable; no_context=true prevents KV-cache
contamination between chunks.
WhisperState is Send+Sync (explicitly declared in whisper-rs) and holds its
own Arc<WhisperInnerContext>, so the model weights stay alive even after
WhisperContext is dropped.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
detect_language=true causes whisper.cpp to return 0 immediately after
language detection without running the decoder (whisper.cpp source:
if (params.detect_language) { return 0; }
Setting language=null triggers auto-detection AND transcription.
This was the root cause of 0 segments on all jobs without explicit language.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Whisper hallucinates filler tokens (Bye., Thank you., etc.) into
end-of-chunk silence. This is especially visible on the final chunk
of long audio where the outro silence triggers a 10× repetition loop.
Fix: after slicing each PCM chunk, scan backwards to find the last
sample above −35 dB, then keep 0.5 s of padding and truncate.
Applied to every chunk, not just the last — any chunk ending in a long
silence period gets the same protection.
Constants match the silencedetect filter already used for chunking:
THRESHOLD = 0.0178 (−35 dB)
PADDING = 8000 samples (0.5 s at 16 kHz)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ProgressEvent::Progress now carries chunk index and total count
- SsePayload::Progress gains chunk / chunks_total fields
→ SSE clients can show 'chunk N/51' instead of bare percent
- process_job persists job.progress to storage at each chunk boundary
→ GET /jobs/:id now shows live progress (not stuck at 0)
- Emits Progress event at chunk START (boundary event), not just on
whisper's internal callback
- entropy_thold raised to 3.5 (catches medium-phrase loops; triggers
whisper's own temperature-retry instead of silent repetition)
- no_speech_thold removed (confirmed // TODO: not implemented in
whisper.cpp source; was a no-op)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Run ffmpeg silencedetect (n=-35dB, d=0.4s) on the original audio to
find silence midpoints. Build chunk boundaries every 180s, snapping to
the nearest silence midpoint within ±30s (fallback: hard cut).
Each chunk is transcribed independently with its own CUDA context;
timestamps are shifted by chunk_start before merging. Progress is
scaled per-chunk across the overall 0-100% job range.
Result on 101-min YouTube audio (34 chunks, 1714 silence points):
- Previous: x1025 'Yeah.' + x1008 sentence-length loops (hallucinations)
- After: x4 max consecutive run, all repetitions verified genuine
Also refactored TranscribeRequest to carry on_progress: Box<dyn Fn(u8)>
instead of a raw ProgressTx so each chunk can independently scale its
contribution to the job's broadcast channel.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
set_no_context(true) stops whisper from feeding its own output back as
context for the next segment. Without this, at audio end the model
halluccinates a phrase ('All right.', 'So I think we're going to wrap up.')
and repeats it hundreds of times in a tight loop.
Observed: 759x 'All right.' + 750x 'So I think we're going to wrap up.'
in the final 8 seconds of a 101min YouTube conference recording.
After fix: clean termination with no repetition loops.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Revert no_speech_thold from 0.0 back to 0.6 (whisper.cpp default)
0.0 means 'suppress if p(no-speech) > 0.0' which fires on every segment,
silently producing 0-segment output for all real-world audio
- Revert SamplingStrategy from Greedy{best_of:5} back to BeamSearch{beam_size:5}
Greedy with temperature=0.0 and best_of>1 is undefined in whisper.cpp
- Restore entropy_thold=2.4 and logprob_thold=-1.0 defaults
- Keep flash_attn disabled (was causing silent failures on conference audio)
- Tested: 59 segments on 5 min YouTube conference audio, 29 on repair audio
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>