# Whisper Backend Investigation — Observations & Findings ## Summary The `whisper-rtx2080` backend **does work correctly** when the GPU is warm. The empty-segments problem is a **transient cold-GPU issue**, not a code bug. --- ## What Was Tried ### 1. Direct API test — 30 s WAV (warm GPU) ```bash curl -s -X POST http://localhost:8091/jobs \ -F "audio=@/tmp/test_30s.wav" \ -F "task=transcribe" \ -F "language=en" ``` **Result:** 6 segments returned in ~25 s. Backend works. --- ### 2. Direct API test — 717 s prepared WAV (warm GPU) ```bash curl -s -X POST http://localhost:8091/jobs \ -F "audio=@/tmp/test_prepared.wav" \ -F "task=transcribe" ``` **Result:** 340 segments, ~47 s total (~15× realtime for RTX 2080). Backend works. --- ### 3. End-to-end PWA submission — YouTube URL Submitted `https://www.youtube.com/watch?v=KQDVDtklf34` through the PWA. - Job `d6178677` was submitted to whisper (confirmed via Docker logs) - Language detection fired (confirmed via logs) - Job completed in ~30 s - Webhook received with HTTP 200 (confirmed via logs) - **BUT** `segments_json = "[]"` stored in the DB This was a **cold-GPU run** right after container restart. --- ### 4. GPU architecture mismatch investigation - `docker info` reported `RTX 3060 (sm_86)` inside the container - `Dockerfile` compiled with `CMAKE_CUDA_ARCHITECTURES=75` (RTX 2080 / sm_75) - Hypothesis: wrong binary → silent 0-output - **User confirmed this is a Docker reporting error — GPU is actually RTX 2080 (sm_75)** - Reverted any Dockerfile changes back to `CMAKE_CUDA_ARCHITECTURES=75` --- ### 5. Source code analysis — `transcriber.rs` Key findings from reading the Rust source: | Setting | Value | Effect | |---|---|---| | `set_language(None)` | ✅ Correct | Auto-detects language, returns segments | | `set_detect_language(true)` | ❌ Wrong | Returns 0 segments (early exit) | | `entropy_thold` | 3.5 (vs default 2.4) | Catches medium-phrase hallucination loops | | Flash attention | Disabled (commented out) | Was causing 0-segment output on some audio | The code uses `set_language(None)` which is correct. Flash attention was already disabled — this alone explains many of the prior 0-segment reports. --- ### 6. Webhook behavior - The backend fires the webhook **exactly once**, after ALL internal 60 s silence-based chunks complete. - We submit one file → backend chunks internally → one webhook with the full `WhisperJob` object. - Webhook payload includes: `{ id, status, language, segments, duration_secs, error, … }` - Our `POST /api/webhook/[jobId]` route handles this correctly. --- ### 7. Captions fast-path (yt-dlp VTT) When yt-dlp finds YouTube auto-generated captions (VTT), the pipeline **skips Whisper entirely** and parses the VTT. If VTT parsing returns `[]` (edge case with certain caption formats), the job completes with empty segments — no whisper involvement. This can look identical to a whisper 0-segments failure but is a completely different code path. --- ## Root Cause of Empty Segments **Cold GPU after container restart.** Right after the Docker container starts and loads the model, the first 1–2 jobs sometimes complete in ~0.5 s with 0 segments — physically impossible for real audio transcription. After the GPU warms up (first successful transcription ~25–47 s), all subsequent jobs return full segments. This is a transient state that resolves on its own. It is **not** caused by: - Wrong CUDA architecture (GPU is RTX 2080, binary is sm_75 — correct) - `set_detect_language` (not used) - Audio preparation issues (direct tests with our prepared WAV return 340 segments) - Webhook not firing (logs confirmed 200 OK webhook delivery) --- ## Observations | Observation | Status | |---|---| | Backend returns full segments when GPU is warm | ✅ Confirmed | | Webhook fires once per job with full payload | ✅ Confirmed | | `json.job_id` (not `json.id`) is the correct response field | ✅ Confirmed | | Cold-GPU produces 0 segments in ~0.5 s | ✅ Confirmed | | Flash attention disabled in Dockerfile prevents 0-segment edge cases | ✅ Already done | | VTT fast-path can produce empty segments if VTT parse fails | ⚠️ Edge case, not investigated further | --- ## What Was NOT Touched (per user request) - `whisper-rtx2080` Dockerfile, Rust source, or any backend configuration - Any backend API behaviour --- ## Next Steps (Backend Side — User Handling Separately) - Monitor first-job-after-restart 0-segment issue - Optionally: warm up GPU on container start with a small silent WAV - Consider retrying a job if `segments == []` and `duration_secs > 5`