whisper-rtx2080

mozempk/whisper-rtx2080

Author	SHA1	Message	Date
mozempk	bcaf8680db	docs: add FRONTEND_INTEGRATION.md developer guide All checks were successful Build & Push Docker Image / test (push) Successful in 5m54s Details Build & Push Docker Image / build-and-push (push) Successful in 17s Details Comprehensive integration guide for frontend/full-stack developers: - Architecture overview diagram - Quick start (submit + poll in ~20 lines) - Model lifecycle: state machine diagram, all 4 /model/* endpoints, SSE event subscription with JS examples - Job submission: multipart fields, 503 model_not_ready handling, retry-with-auto-load pattern - Job progress: polling vs SSE, all event types with payloads - Webhooks: job completion + model lifecycle, Express receiver example, how to distinguish job vs model payloads - Health check field reference - Cancellation semantics (GPU inference not interruptible) - Full TypeScript type definitions for all API shapes - React hooks: useModelStatus, useJobStream, useTranscribe - Complete WhisperClient class example with ensureModelReady, streamProgress, and end-to-end transcribe() - Error reference table with all 400/404/409/503/500 shapes Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-08 23:47:13 +02:00
mozempk	b191fbe200	feat: dynamic model loading/unloading with GPU polling All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 8m41s Details - Model starts unloaded (lazy); loads on first job or POST /model/load - Auto-unloads after IDLE_TIMEOUT_SECS (default 300) of inactivity - POST /model/unload for immediate manual release - GPU-busy detection: on VRAM OOM, enters WaitingForGpu and retries every GPU_POLL_INTERVAL_SECS (default 30) indefinitely - POST /jobs when unloaded → 503 + Retry-After header, triggers load - AppError::OutOfMemory and AppError::ModelNotReady variants - WorkerCmd channel (SyncSender<WorkerCmd>) replaces bare tx_req channel - Idle timer via recv_timeout(1s) tick inside OS thread (no extra thread) - Model lifecycle events broadcast via tokio broadcast channel (SSE + webhooks) - webhook_registry: all clients that ever submitted a webhook_url receive model_ready and model_unloaded webhooks - GPU warmup retained on every (re)load New routes: GET /model/status — current state + VRAM stats POST /model/load — trigger load (idempotent) POST /model/unload — immediate unload GET /model/events — SSE stream of model lifecycle events New env vars: IDLE_TIMEOUT_SECS (default 300) GPU_POLL_INTERVAL_SECS (default 30) Tests: tests/test_model_lifecycle.sh — 18 integration tests (full state machine, SSE events, webhooks, concurrency, unload-during-load) tests/test_idle_timeout.sh — 5 tests with short IDLE_TIMEOUT_SECS=5 test_all.sh updated: loads model before job submission, asserts model_state in /health, adds POST /model/unload at end Docs: docs/USAGE.md: model lifecycle section, new env vars, 503 retry pattern, updated /health response shape Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-08 17:57:20 +02:00
mozempk	fd8d4deefb	fix: GPU warmup on startup + fix test_all.sh + document cold-GPU finding All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 6m39s Details GPU warmup (src/transcriber.rs): After creating WhisperState, run a 1s silent inference pass in load(). CUDA JIT-compiles device kernels on the first whisper_full_with_state call. On a cold GPU this compilation disrupts the decode pipeline mid-inference, returning 0 segments in ~0.5s. The warmup forces all kernel compilation at startup so the first real job runs on fully compiled kernels. test_all.sh: - Fix submit response field: 'id' → 'job_id' (was breaking all downstream steps) - Remove language=auto: not a valid ISO 639-1 code; omit field for auto-detect - Make BASE and AUDIO configurable via env vars (WHISPER_BASE_URL, TEST_AUDIO) - Fix DELETE assertion: completed jobs return 409 Conflict, not 204 - Add explicit zero-segments failure check in quality inspection (step 9) - Add progress reporting to poll loop docs/FINDINGS.md + KNOWLEDGE.md: Document cold GPU warmup issue, root cause, and fix. Document language=auto as invalid API usage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 11:57:30 +02:00
mozempk	c25e8e7ffb	docs: add ARCHITECTURE, CODE_STYLE, FINDINGS, USAGE under docs/ All checks were successful Build & Push Docker Image / build-and-push (push) Successful in 17s Details Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 10:17:53 +02:00

4 Commits