whisper-rtx2080

mozempk/whisper-rtx2080

Author	SHA1	Message	Date
mozempk	bcaf8680db	docs: add FRONTEND_INTEGRATION.md developer guide All checks were successful Build & Push Docker Image / test (push) Successful in 5m54s Details Build & Push Docker Image / build-and-push (push) Successful in 17s Details Comprehensive integration guide for frontend/full-stack developers: - Architecture overview diagram - Quick start (submit + poll in ~20 lines) - Model lifecycle: state machine diagram, all 4 /model/* endpoints, SSE event subscription with JS examples - Job submission: multipart fields, 503 model_not_ready handling, retry-with-auto-load pattern - Job progress: polling vs SSE, all event types with payloads - Webhooks: job completion + model lifecycle, Express receiver example, how to distinguish job vs model payloads - Health check field reference - Cancellation semantics (GPU inference not interruptible) - Full TypeScript type definitions for all API shapes - React hooks: useModelStatus, useJobStream, useTranscribe - Complete WhisperClient class example with ensureModelReady, streamProgress, and end-to-end transcribe() - Error reference table with all 400/404/409/503/500 shapes Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-08 23:47:13 +02:00
mozempk	d0148260e3	test: add unit test infrastructure (Docker tester stage + CI) All checks were successful Build & Push Docker Image / test (push) Successful in 5m42s Details Build & Push Docker Image / build-and-push (push) Successful in 21s Details - Add Dockerfile 'tester' stage (FROM builder): - Symlinks /usr/local/cuda/lib64/stubs/libcuda.so → libcuda.so.1 so the test binary can satisfy the dynamic linker without a real GPU - Runs `cargo test --release` reusing the cached release build artifacts (no recompilation — tests complete in ~6s) - docker build --target tester . to run all 30 unit tests - Add 'test' job to .gitea/workflows/docker-build.yml: - Runs before build-and-push (build-and-push needs: test) - Builds --target tester with registry build cache - Gate: build-and-push only runs when all tests pass - Add run_tests.sh convenience script for local use: - Accepts optional test name filter as first argument - Respects CUDA_VERSION / UBUNTU_VERSION env overrides All 30 unit tests pass: error::tests — 7 tests (OOM detection, ModelNotReady HTTP shape) models::tests — 17 tests (state machine, serialization, retry-after) worker::tests — 6 tests (chunk ranges, silence snap/trim) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> 0.0.2	2026-05-08 18:20:52 +02:00
mozempk	b191fbe200	feat: dynamic model loading/unloading with GPU polling All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 8m41s Details - Model starts unloaded (lazy); loads on first job or POST /model/load - Auto-unloads after IDLE_TIMEOUT_SECS (default 300) of inactivity - POST /model/unload for immediate manual release - GPU-busy detection: on VRAM OOM, enters WaitingForGpu and retries every GPU_POLL_INTERVAL_SECS (default 30) indefinitely - POST /jobs when unloaded → 503 + Retry-After header, triggers load - AppError::OutOfMemory and AppError::ModelNotReady variants - WorkerCmd channel (SyncSender<WorkerCmd>) replaces bare tx_req channel - Idle timer via recv_timeout(1s) tick inside OS thread (no extra thread) - Model lifecycle events broadcast via tokio broadcast channel (SSE + webhooks) - webhook_registry: all clients that ever submitted a webhook_url receive model_ready and model_unloaded webhooks - GPU warmup retained on every (re)load New routes: GET /model/status — current state + VRAM stats POST /model/load — trigger load (idempotent) POST /model/unload — immediate unload GET /model/events — SSE stream of model lifecycle events New env vars: IDLE_TIMEOUT_SECS (default 300) GPU_POLL_INTERVAL_SECS (default 30) Tests: tests/test_model_lifecycle.sh — 18 integration tests (full state machine, SSE events, webhooks, concurrency, unload-during-load) tests/test_idle_timeout.sh — 5 tests with short IDLE_TIMEOUT_SECS=5 test_all.sh updated: loads model before job submission, asserts model_state in /health, adds POST /model/unload at end Docs: docs/USAGE.md: model lifecycle section, new env vars, 503 retry pattern, updated /health response shape Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-08 17:57:20 +02:00
mozempk	78c6fab81b	fix: remove duplicate old test suite and fix step 9 pipe/heredoc bug All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 16s Details Step 9 used 'echo $RESULT \| python3 - << HEREDOC' which is a bash gotcha: the heredoc takes over stdin (as the script source), so the pipe is silently ignored and sys.stdin.read() returns empty string → JSONDecodeError. Fix: write RESULT to a temp file and pass it as sys.argv[1] to the script. Also removed the old buggy test suite that was accidentally left appended at lines 181-327 (had language=auto, ['id'] field, wrong DELETE assertion). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 12:13:15 +02:00
mozempk	fd8d4deefb	fix: GPU warmup on startup + fix test_all.sh + document cold-GPU finding All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 6m39s Details GPU warmup (src/transcriber.rs): After creating WhisperState, run a 1s silent inference pass in load(). CUDA JIT-compiles device kernels on the first whisper_full_with_state call. On a cold GPU this compilation disrupts the decode pipeline mid-inference, returning 0 segments in ~0.5s. The warmup forces all kernel compilation at startup so the first real job runs on fully compiled kernels. test_all.sh: - Fix submit response field: 'id' → 'job_id' (was breaking all downstream steps) - Remove language=auto: not a valid ISO 639-1 code; omit field for auto-detect - Make BASE and AUDIO configurable via env vars (WHISPER_BASE_URL, TEST_AUDIO) - Fix DELETE assertion: completed jobs return 409 Conflict, not 204 - Add explicit zero-segments failure check in quality inspection (step 9) - Add progress reporting to poll loop docs/FINDINGS.md + KNOWLEDGE.md: Document cold GPU warmup issue, root cause, and fix. Document language=auto as invalid API usage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 11:57:30 +02:00
mozempk	d5a88d1866	fix: create WhisperState once at load time, reuse across all chunks Some checks failed Build & Push Docker Image / build-and-push (push) Has been cancelled Details Previously create_state() was called for every 60s audio chunk, triggering whisper_init_state() each time. This allocates ~700 MB of GPU compute buffers (KV caches, CUDA workspace) and re-initialises the CUDA backend per chunk. For a 101-minute audio (102 chunks), this caused 102 GPU re-initialisations and VRAM allocation cycles. Under VRAM pressure from concurrent processes, CUDA allocation failures occurred silently — whisper returned language detection results but 0 segments. Fix: create WhisperState once in Transcriber::load() and reuse it for every transcription call. GPU memory is stable; no_context=true prevents KV-cache contamination between chunks. WhisperState is Send+Sync (explicitly declared in whisper-rs) and holds its own Arc<WhisperInnerContext>, so the model weights stay alive even after WhisperContext is dropped. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 11:51:33 +02:00
mozempk	c25e8e7ffb	docs: add ARCHITECTURE, CODE_STYLE, FINDINGS, USAGE under docs/ All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 17s Details Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> v0.0.1	2026-05-06 10:17:53 +02:00
mozempk	8fc45ee86f	docs: add KNOWLEDGE.md with lessons learned and improvement notes All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 18s Details Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 10:09:27 +02:00
mozempk	6327ffc09d	fix: use set_language(None) for auto-detect instead of set_detect_language(true) All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 6m43s Details detect_language=true causes whisper.cpp to return 0 immediately after language detection without running the decoder (whisper.cpp source: if (params.detect_language) { return 0; } Setting language=null triggers auto-detection AND transcription. This was the root cause of 0 segments on all jobs without explicit language. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 02:58:35 +02:00
mozempk	ef9c04b070	fix: trim trailing silence from each chunk before whisper All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 6m44s Details Whisper hallucinates filler tokens (Bye., Thank you., etc.) into end-of-chunk silence. This is especially visible on the final chunk of long audio where the outro silence triggers a 10× repetition loop. Fix: after slicing each PCM chunk, scan backwards to find the last sample above −35 dB, then keep 0.5 s of padding and truncate. Applied to every chunk, not just the last — any chunk ending in a long silence period gets the same protection. Constants match the silencedetect filter already used for chunking: THRESHOLD = 0.0178 (−35 dB) PADDING = 8000 samples (0.5 s at 16 kHz) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 02:13:20 +02:00
mozempk	35e7ea8d28	feat: progress reporting with chunk context + live job persistence All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 6m38s Details - ProgressEvent::Progress now carries chunk index and total count - SsePayload::Progress gains chunk / chunks_total fields → SSE clients can show 'chunk N/51' instead of bare percent - process_job persists job.progress to storage at each chunk boundary → GET /jobs/:id now shows live progress (not stuck at 0) - Emits Progress event at chunk START (boundary event), not just on whisper's internal callback - entropy_thold raised to 3.5 (catches medium-phrase loops; triggers whisper's own temperature-retry instead of silent repetition) - no_speech_thold removed (confirmed // TODO: not implemented in whisper.cpp source; was a no-op) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 02:00:46 +02:00
mozempk	fb8556441c	feat: silence-based audio chunking before transcription All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 6m40s Details Run ffmpeg silencedetect (n=-35dB, d=0.4s) on the original audio to find silence midpoints. Build chunk boundaries every 180s, snapping to the nearest silence midpoint within ±30s (fallback: hard cut). Each chunk is transcribed independently with its own CUDA context; timestamps are shifted by chunk_start before merging. Progress is scaled per-chunk across the overall 0-100% job range. Result on 101-min YouTube audio (34 chunks, 1714 silence points): - Previous: x1025 'Yeah.' + x1008 sentence-length loops (hallucinations) - After: x4 max consecutive run, all repetitions verified genuine Also refactored TranscribeRequest to carry on_progress: Box<dyn Fn(u8)> instead of a raw ProgressTx so each chunk can independently scale its contribution to the job's broadcast channel. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 01:08:06 +02:00
mozempk	9a36000062	fix: disable previous-text conditioning to prevent end-of-file loops All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 6m41s Details set_no_context(true) stops whisper from feeding its own output back as context for the next segment. Without this, at audio end the model halluccinates a phrase ('All right.', 'So I think we're going to wrap up.') and repeats it hundreds of times in a tight loop. Observed: 759x 'All right.' + 750x 'So I think we're going to wrap up.' in the final 8 seconds of a 101min YouTube conference recording. After fix: clean termination with no repetition loops. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 00:14:56 +02:00
mozempk	2176206afe	fix: restore correct no_speech_thold and BeamSearch defaults All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 6m28s Details - Revert no_speech_thold from 0.0 back to 0.6 (whisper.cpp default) 0.0 means 'suppress if p(no-speech) > 0.0' which fires on every segment, silently producing 0-segment output for all real-world audio - Revert SamplingStrategy from Greedy{best_of:5} back to BeamSearch{beam_size:5} Greedy with temperature=0.0 and best_of>1 is undefined in whisper.cpp - Restore entropy_thold=2.4 and logprob_thold=-1.0 defaults - Keep flash_attn disabled (was causing silent failures on conference audio) - Tested: 59 segments on 5 min YouTube conference audio, 29 on repair audio Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-05 23:50:12 +02:00
mozempk	16cb6ca661	feat: GPU-accelerated Whisper API for RTX 2080 (sm_75) All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 11m13s Details - Pure Rust: Axum 0.7 + whisper-rs 0.13 (CUDA FFI) - Async job queue with SSE progress streaming - Webhook delivery with 5x exponential backoff - Disk-persisted job state (survives restarts) - Anti-hallucination params: no_speech_thold, entropy_thold, suppress_blank - CUDA sm_75 flags: GGML_CUDA_FORCE_MMQ, GGML_CUDA_GRAPHS, GGML_CUDA_FA_ALL_QUANTS - Configurable via env: CUDA_DEVICE, WHISPER_MODEL_PATH, PORT, DATA_DIR - Gitea Actions CI: build + push to git.sal.giize.com registry - Multi-stage Dockerfile with customizable CUDA_VERSION ARG Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-05 22:47:24 +02:00

15 Commits