Commit Graph

6 Commits

Author SHA1 Message Date
mozempk
b191fbe200 feat: dynamic model loading/unloading with GPU polling
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 8m41s
- Model starts unloaded (lazy); loads on first job or POST /model/load
- Auto-unloads after IDLE_TIMEOUT_SECS (default 300) of inactivity
- POST /model/unload for immediate manual release
- GPU-busy detection: on VRAM OOM, enters WaitingForGpu and retries
  every GPU_POLL_INTERVAL_SECS (default 30) indefinitely
- POST /jobs when unloaded → 503 + Retry-After header, triggers load
- AppError::OutOfMemory and AppError::ModelNotReady variants
- WorkerCmd channel (SyncSender<WorkerCmd>) replaces bare tx_req channel
- Idle timer via recv_timeout(1s) tick inside OS thread (no extra thread)
- Model lifecycle events broadcast via tokio broadcast channel (SSE + webhooks)
- webhook_registry: all clients that ever submitted a webhook_url receive
  model_ready and model_unloaded webhooks
- GPU warmup retained on every (re)load

New routes:
  GET  /model/status  — current state + VRAM stats
  POST /model/load    — trigger load (idempotent)
  POST /model/unload  — immediate unload
  GET  /model/events  — SSE stream of model lifecycle events

New env vars:
  IDLE_TIMEOUT_SECS       (default 300)
  GPU_POLL_INTERVAL_SECS  (default 30)

Tests:
  tests/test_model_lifecycle.sh — 18 integration tests (full state machine,
    SSE events, webhooks, concurrency, unload-during-load)
  tests/test_idle_timeout.sh    — 5 tests with short IDLE_TIMEOUT_SECS=5
  test_all.sh updated: loads model before job submission, asserts
    model_state in /health, adds POST /model/unload at end

Docs:
  docs/USAGE.md: model lifecycle section, new env vars, 503 retry pattern,
    updated /health response shape

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-08 17:57:20 +02:00
mozempk
d5a88d1866 fix: create WhisperState once at load time, reuse across all chunks
Some checks failed
Build & Push Docker Image / build-and-push (push) Has been cancelled
Previously create_state() was called for every 60s audio chunk, triggering
whisper_init_state() each time. This allocates ~700 MB of GPU compute buffers
(KV caches, CUDA workspace) and re-initialises the CUDA backend per chunk.

For a 101-minute audio (102 chunks), this caused 102 GPU re-initialisations
and VRAM allocation cycles. Under VRAM pressure from concurrent processes,
CUDA allocation failures occurred silently — whisper returned language
detection results but 0 segments.

Fix: create WhisperState once in Transcriber::load() and reuse it for every
transcription call. GPU memory is stable; no_context=true prevents KV-cache
contamination between chunks.

WhisperState is Send+Sync (explicitly declared in whisper-rs) and holds its
own Arc<WhisperInnerContext>, so the model weights stay alive even after
WhisperContext is dropped.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 11:51:33 +02:00
mozempk
ef9c04b070 fix: trim trailing silence from each chunk before whisper
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 6m44s
Whisper hallucinates filler tokens (Bye., Thank you., etc.) into
end-of-chunk silence. This is especially visible on the final chunk
of long audio where the outro silence triggers a 10× repetition loop.

Fix: after slicing each PCM chunk, scan backwards to find the last
sample above −35 dB, then keep 0.5 s of padding and truncate.
Applied to every chunk, not just the last — any chunk ending in a long
silence period gets the same protection.

Constants match the silencedetect filter already used for chunking:
  THRESHOLD = 0.0178  (−35 dB)
  PADDING   = 8000 samples (0.5 s at 16 kHz)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 02:13:20 +02:00
mozempk
35e7ea8d28 feat: progress reporting with chunk context + live job persistence
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 6m38s
- ProgressEvent::Progress now carries chunk index and total count
- SsePayload::Progress gains chunk / chunks_total fields
  → SSE clients can show 'chunk N/51' instead of bare percent
- process_job persists job.progress to storage at each chunk boundary
  → GET /jobs/:id now shows live progress (not stuck at 0)
- Emits Progress event at chunk START (boundary event), not just on
  whisper's internal callback
- entropy_thold raised to 3.5 (catches medium-phrase loops; triggers
  whisper's own temperature-retry instead of silent repetition)
- no_speech_thold removed (confirmed // TODO: not implemented in
  whisper.cpp source; was a no-op)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 02:00:46 +02:00
mozempk
fb8556441c feat: silence-based audio chunking before transcription
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 6m40s
Run ffmpeg silencedetect (n=-35dB, d=0.4s) on the original audio to
find silence midpoints. Build chunk boundaries every 180s, snapping to
the nearest silence midpoint within ±30s (fallback: hard cut).

Each chunk is transcribed independently with its own CUDA context;
timestamps are shifted by chunk_start before merging. Progress is
scaled per-chunk across the overall 0-100% job range.

Result on 101-min YouTube audio (34 chunks, 1714 silence points):
- Previous: x1025 'Yeah.' + x1008 sentence-length loops (hallucinations)
- After:    x4 max consecutive run, all repetitions verified genuine

Also refactored TranscribeRequest to carry on_progress: Box<dyn Fn(u8)>
instead of a raw ProgressTx so each chunk can independently scale its
contribution to the job's broadcast channel.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 01:08:06 +02:00
mozempk
16cb6ca661 feat: GPU-accelerated Whisper API for RTX 2080 (sm_75)
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 11m13s
- Pure Rust: Axum 0.7 + whisper-rs 0.13 (CUDA FFI)
- Async job queue with SSE progress streaming
- Webhook delivery with 5x exponential backoff
- Disk-persisted job state (survives restarts)
- Anti-hallucination params: no_speech_thold, entropy_thold, suppress_blank
- CUDA sm_75 flags: GGML_CUDA_FORCE_MMQ, GGML_CUDA_GRAPHS, GGML_CUDA_FA_ALL_QUANTS
- Configurable via env: CUDA_DEVICE, WHISPER_MODEL_PATH, PORT, DATA_DIR
- Gitea Actions CI: build + push to git.sal.giize.com registry
- Multi-stage Dockerfile with customizable CUDA_VERSION ARG

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-05 22:47:24 +02:00