whisper-rtx2080

mozempk/whisper-rtx2080

Author	SHA1	Message	Date
mozempk	b191fbe200	feat: dynamic model loading/unloading with GPU polling All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 8m41s Details - Model starts unloaded (lazy); loads on first job or POST /model/load - Auto-unloads after IDLE_TIMEOUT_SECS (default 300) of inactivity - POST /model/unload for immediate manual release - GPU-busy detection: on VRAM OOM, enters WaitingForGpu and retries every GPU_POLL_INTERVAL_SECS (default 30) indefinitely - POST /jobs when unloaded → 503 + Retry-After header, triggers load - AppError::OutOfMemory and AppError::ModelNotReady variants - WorkerCmd channel (SyncSender<WorkerCmd>) replaces bare tx_req channel - Idle timer via recv_timeout(1s) tick inside OS thread (no extra thread) - Model lifecycle events broadcast via tokio broadcast channel (SSE + webhooks) - webhook_registry: all clients that ever submitted a webhook_url receive model_ready and model_unloaded webhooks - GPU warmup retained on every (re)load New routes: GET /model/status — current state + VRAM stats POST /model/load — trigger load (idempotent) POST /model/unload — immediate unload GET /model/events — SSE stream of model lifecycle events New env vars: IDLE_TIMEOUT_SECS (default 300) GPU_POLL_INTERVAL_SECS (default 30) Tests: tests/test_model_lifecycle.sh — 18 integration tests (full state machine, SSE events, webhooks, concurrency, unload-during-load) tests/test_idle_timeout.sh — 5 tests with short IDLE_TIMEOUT_SECS=5 test_all.sh updated: loads model before job submission, asserts model_state in /health, adds POST /model/unload at end Docs: docs/USAGE.md: model lifecycle section, new env vars, 503 retry pattern, updated /health response shape Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-08 17:57:20 +02:00
mozempk	d5a88d1866	fix: create WhisperState once at load time, reuse across all chunks Some checks failed Build & Push Docker Image / build-and-push (push) Has been cancelled Details Previously create_state() was called for every 60s audio chunk, triggering whisper_init_state() each time. This allocates ~700 MB of GPU compute buffers (KV caches, CUDA workspace) and re-initialises the CUDA backend per chunk. For a 101-minute audio (102 chunks), this caused 102 GPU re-initialisations and VRAM allocation cycles. Under VRAM pressure from concurrent processes, CUDA allocation failures occurred silently — whisper returned language detection results but 0 segments. Fix: create WhisperState once in Transcriber::load() and reuse it for every transcription call. GPU memory is stable; no_context=true prevents KV-cache contamination between chunks. WhisperState is Send+Sync (explicitly declared in whisper-rs) and holds its own Arc<WhisperInnerContext>, so the model weights stay alive even after WhisperContext is dropped. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 11:51:33 +02:00
mozempk	ef9c04b070	fix: trim trailing silence from each chunk before whisper All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 6m44s Details Whisper hallucinates filler tokens (Bye., Thank you., etc.) into end-of-chunk silence. This is especially visible on the final chunk of long audio where the outro silence triggers a 10× repetition loop. Fix: after slicing each PCM chunk, scan backwards to find the last sample above −35 dB, then keep 0.5 s of padding and truncate. Applied to every chunk, not just the last — any chunk ending in a long silence period gets the same protection. Constants match the silencedetect filter already used for chunking: THRESHOLD = 0.0178 (−35 dB) PADDING = 8000 samples (0.5 s at 16 kHz) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 02:13:20 +02:00
mozempk	35e7ea8d28	feat: progress reporting with chunk context + live job persistence All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 6m38s Details - ProgressEvent::Progress now carries chunk index and total count - SsePayload::Progress gains chunk / chunks_total fields → SSE clients can show 'chunk N/51' instead of bare percent - process_job persists job.progress to storage at each chunk boundary → GET /jobs/:id now shows live progress (not stuck at 0) - Emits Progress event at chunk START (boundary event), not just on whisper's internal callback - entropy_thold raised to 3.5 (catches medium-phrase loops; triggers whisper's own temperature-retry instead of silent repetition) - no_speech_thold removed (confirmed // TODO: not implemented in whisper.cpp source; was a no-op) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 02:00:46 +02:00
mozempk	fb8556441c	feat: silence-based audio chunking before transcription All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 6m40s Details Run ffmpeg silencedetect (n=-35dB, d=0.4s) on the original audio to find silence midpoints. Build chunk boundaries every 180s, snapping to the nearest silence midpoint within ±30s (fallback: hard cut). Each chunk is transcribed independently with its own CUDA context; timestamps are shifted by chunk_start before merging. Progress is scaled per-chunk across the overall 0-100% job range. Result on 101-min YouTube audio (34 chunks, 1714 silence points): - Previous: x1025 'Yeah.' + x1008 sentence-length loops (hallucinations) - After: x4 max consecutive run, all repetitions verified genuine Also refactored TranscribeRequest to carry on_progress: Box<dyn Fn(u8)> instead of a raw ProgressTx so each chunk can independently scale its contribution to the job's broadcast channel. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 01:08:06 +02:00
mozempk	16cb6ca661	feat: GPU-accelerated Whisper API for RTX 2080 (sm_75) All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 11m13s Details - Pure Rust: Axum 0.7 + whisper-rs 0.13 (CUDA FFI) - Async job queue with SSE progress streaming - Webhook delivery with 5x exponential backoff - Disk-persisted job state (survives restarts) - Anti-hallucination params: no_speech_thold, entropy_thold, suppress_blank - CUDA sm_75 flags: GGML_CUDA_FORCE_MMQ, GGML_CUDA_GRAPHS, GGML_CUDA_FA_ALL_QUANTS - Configurable via env: CUDA_DEVICE, WHISPER_MODEL_PATH, PORT, DATA_DIR - Gitea Actions CI: build + push to git.sal.giize.com registry - Multi-stage Dockerfile with customizable CUDA_VERSION ARG Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-05 22:47:24 +02:00

6 Commits