Tonemark is a SvelteKit PWA for transcribing YouTube videos, audio and video files, and microphone recordings using a local Whisper backend. Features: - Dark glassmorphic UI with electric-lime accent (5 switchable themes) - Rail nav (desktop) / tab bar (mobile) layout - Drop zone, YouTube URL input, and live audio recording inputs - Audio mode waveform cards (none / standard / aggressive / auto) - Real-time transcription progress with animated waveform - Job queue with SSE streaming updates - Push notifications on job completion - PWA with native SvelteKit service worker - SRT / TXT / MD / JSON transcript downloads Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
4.6 KiB
Whisper Backend Investigation — Observations & Findings
Summary
The whisper-rtx2080 backend does work correctly when the GPU is warm.
The empty-segments problem is a transient cold-GPU issue, not a code bug.
What Was Tried
1. Direct API test — 30 s WAV (warm GPU)
curl -s -X POST http://localhost:8091/jobs \
-F "audio=@/tmp/test_30s.wav" \
-F "task=transcribe" \
-F "language=en"
Result: 6 segments returned in ~25 s. Backend works.
2. Direct API test — 717 s prepared WAV (warm GPU)
curl -s -X POST http://localhost:8091/jobs \
-F "audio=@/tmp/test_prepared.wav" \
-F "task=transcribe"
Result: 340 segments, ~47 s total (~15× realtime for RTX 2080). Backend works.
3. End-to-end PWA submission — YouTube URL
Submitted https://www.youtube.com/watch?v=KQDVDtklf34 through the PWA.
- Job
d6178677was submitted to whisper (confirmed via Docker logs) - Language detection fired (confirmed via logs)
- Job completed in ~30 s
- Webhook received with HTTP 200 (confirmed via logs)
- BUT
segments_json = "[]"stored in the DB
This was a cold-GPU run right after container restart.
4. GPU architecture mismatch investigation
docker inforeportedRTX 3060 (sm_86)inside the containerDockerfilecompiled withCMAKE_CUDA_ARCHITECTURES=75(RTX 2080 / sm_75)- Hypothesis: wrong binary → silent 0-output
- User confirmed this is a Docker reporting error — GPU is actually RTX 2080 (sm_75)
- Reverted any Dockerfile changes back to
CMAKE_CUDA_ARCHITECTURES=75
5. Source code analysis — transcriber.rs
Key findings from reading the Rust source:
| Setting | Value | Effect |
|---|---|---|
set_language(None) |
✅ Correct | Auto-detects language, returns segments |
set_detect_language(true) |
❌ Wrong | Returns 0 segments (early exit) |
entropy_thold |
3.5 (vs default 2.4) | Catches medium-phrase hallucination loops |
| Flash attention | Disabled (commented out) | Was causing 0-segment output on some audio |
The code uses set_language(None) which is correct.
Flash attention was already disabled — this alone explains many of the prior 0-segment reports.
6. Webhook behavior
- The backend fires the webhook exactly once, after ALL internal 60 s silence-based chunks complete.
- We submit one file → backend chunks internally → one webhook with the full
WhisperJobobject. - Webhook payload includes:
{ id, status, language, segments, duration_secs, error, … } - Our
POST /api/webhook/[jobId]route handles this correctly.
7. Captions fast-path (yt-dlp VTT)
When yt-dlp finds YouTube auto-generated captions (VTT), the pipeline skips Whisper entirely
and parses the VTT. If VTT parsing returns [] (edge case with certain caption formats), the job
completes with empty segments — no whisper involvement.
This can look identical to a whisper 0-segments failure but is a completely different code path.
Root Cause of Empty Segments
Cold GPU after container restart.
Right after the Docker container starts and loads the model, the first 1–2 jobs sometimes complete in ~0.5 s with 0 segments — physically impossible for real audio transcription. After the GPU warms up (first successful transcription ~25–47 s), all subsequent jobs return full segments.
This is a transient state that resolves on its own. It is not caused by:
- Wrong CUDA architecture (GPU is RTX 2080, binary is sm_75 — correct)
set_detect_language(not used)- Audio preparation issues (direct tests with our prepared WAV return 340 segments)
- Webhook not firing (logs confirmed 200 OK webhook delivery)
Observations
| Observation | Status |
|---|---|
| Backend returns full segments when GPU is warm | ✅ Confirmed |
| Webhook fires once per job with full payload | ✅ Confirmed |
json.job_id (not json.id) is the correct response field |
✅ Confirmed |
| Cold-GPU produces 0 segments in ~0.5 s | ✅ Confirmed |
| Flash attention disabled in Dockerfile prevents 0-segment edge cases | ✅ Already done |
| VTT fast-path can produce empty segments if VTT parse fails | ⚠️ Edge case, not investigated further |
What Was NOT Touched (per user request)
whisper-rtx2080Dockerfile, Rust source, or any backend configuration- Any backend API behaviour
Next Steps (Backend Side — User Handling Separately)
- Monitor first-job-after-restart 0-segment issue
- Optionally: warm up GPU on container start with a small silent WAV
- Consider retrying a job if
segments == []andduration_secs > 5