4 Commits

Author SHA1 Message Date
mozempk
bcaf8680db docs: add FRONTEND_INTEGRATION.md developer guide
All checks were successful
Build & Push Docker Image / test (push) Successful in 5m54s
Build & Push Docker Image / build-and-push (push) Successful in 17s
Comprehensive integration guide for frontend/full-stack developers:

- Architecture overview diagram
- Quick start (submit + poll in ~20 lines)
- Model lifecycle: state machine diagram, all 4 /model/* endpoints,
  SSE event subscription with JS examples
- Job submission: multipart fields, 503 model_not_ready handling,
  retry-with-auto-load pattern
- Job progress: polling vs SSE, all event types with payloads
- Webhooks: job completion + model lifecycle, Express receiver example,
  how to distinguish job vs model payloads
- Health check field reference
- Cancellation semantics (GPU inference not interruptible)
- Full TypeScript type definitions for all API shapes
- React hooks: useModelStatus, useJobStream, useTranscribe
- Complete WhisperClient class example with ensureModelReady,
  streamProgress, and end-to-end transcribe()
- Error reference table with all 400/404/409/503/500 shapes

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-08 23:47:13 +02:00
mozempk
b191fbe200 feat: dynamic model loading/unloading with GPU polling
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 8m41s
- Model starts unloaded (lazy); loads on first job or POST /model/load
- Auto-unloads after IDLE_TIMEOUT_SECS (default 300) of inactivity
- POST /model/unload for immediate manual release
- GPU-busy detection: on VRAM OOM, enters WaitingForGpu and retries
  every GPU_POLL_INTERVAL_SECS (default 30) indefinitely
- POST /jobs when unloaded → 503 + Retry-After header, triggers load
- AppError::OutOfMemory and AppError::ModelNotReady variants
- WorkerCmd channel (SyncSender<WorkerCmd>) replaces bare tx_req channel
- Idle timer via recv_timeout(1s) tick inside OS thread (no extra thread)
- Model lifecycle events broadcast via tokio broadcast channel (SSE + webhooks)
- webhook_registry: all clients that ever submitted a webhook_url receive
  model_ready and model_unloaded webhooks
- GPU warmup retained on every (re)load

New routes:
  GET  /model/status  — current state + VRAM stats
  POST /model/load    — trigger load (idempotent)
  POST /model/unload  — immediate unload
  GET  /model/events  — SSE stream of model lifecycle events

New env vars:
  IDLE_TIMEOUT_SECS       (default 300)
  GPU_POLL_INTERVAL_SECS  (default 30)

Tests:
  tests/test_model_lifecycle.sh — 18 integration tests (full state machine,
    SSE events, webhooks, concurrency, unload-during-load)
  tests/test_idle_timeout.sh    — 5 tests with short IDLE_TIMEOUT_SECS=5
  test_all.sh updated: loads model before job submission, asserts
    model_state in /health, adds POST /model/unload at end

Docs:
  docs/USAGE.md: model lifecycle section, new env vars, 503 retry pattern,
    updated /health response shape

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-08 17:57:20 +02:00
mozempk
fd8d4deefb fix: GPU warmup on startup + fix test_all.sh + document cold-GPU finding
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 6m39s
GPU warmup (src/transcriber.rs):
  After creating WhisperState, run a 1s silent inference pass in load().
  CUDA JIT-compiles device kernels on the first whisper_full_with_state call.
  On a cold GPU this compilation disrupts the decode pipeline mid-inference,
  returning 0 segments in ~0.5s. The warmup forces all kernel compilation at
  startup so the first real job runs on fully compiled kernels.

test_all.sh:
  - Fix submit response field: 'id' → 'job_id' (was breaking all downstream steps)
  - Remove language=auto: not a valid ISO 639-1 code; omit field for auto-detect
  - Make BASE and AUDIO configurable via env vars (WHISPER_BASE_URL, TEST_AUDIO)
  - Fix DELETE assertion: completed jobs return 409 Conflict, not 204
  - Add explicit zero-segments failure check in quality inspection (step 9)
  - Add progress reporting to poll loop

docs/FINDINGS.md + KNOWLEDGE.md:
  Document cold GPU warmup issue, root cause, and fix.
  Document language=auto as invalid API usage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 11:57:30 +02:00
mozempk
c25e8e7ffb docs: add ARCHITECTURE, CODE_STYLE, FINDINGS, USAGE under docs/
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 17s
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 10:17:53 +02:00