Files
tonemark/backend.issue.md
Giancarmine Salucci 13a96b6efa
Some checks failed
Build & Push Docker Image / build-and-push (push) Failing after 11s
Initial commit: Tonemark PWA
Tonemark is a SvelteKit PWA for transcribing YouTube videos, audio
and video files, and microphone recordings using a local Whisper backend.

Features:
- Dark glassmorphic UI with electric-lime accent (5 switchable themes)
- Rail nav (desktop) / tab bar (mobile) layout
- Drop zone, YouTube URL input, and live audio recording inputs
- Audio mode waveform cards (none / standard / aggressive / auto)
- Real-time transcription progress with animated waveform
- Job queue with SSE streaming updates
- Push notifications on job completion
- PWA with native SvelteKit service worker
- SRT / TXT / MD / JSON transcript downloads

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 16:41:25 +02:00

137 lines
4.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Whisper Backend Investigation — Observations & Findings
## Summary
The `whisper-rtx2080` backend **does work correctly** when the GPU is warm.
The empty-segments problem is a **transient cold-GPU issue**, not a code bug.
---
## What Was Tried
### 1. Direct API test — 30 s WAV (warm GPU)
```bash
curl -s -X POST http://localhost:8091/jobs \
-F "audio=@/tmp/test_30s.wav" \
-F "task=transcribe" \
-F "language=en"
```
**Result:** 6 segments returned in ~25 s. Backend works.
---
### 2. Direct API test — 717 s prepared WAV (warm GPU)
```bash
curl -s -X POST http://localhost:8091/jobs \
-F "audio=@/tmp/test_prepared.wav" \
-F "task=transcribe"
```
**Result:** 340 segments, ~47 s total (~15× realtime for RTX 2080). Backend works.
---
### 3. End-to-end PWA submission — YouTube URL
Submitted `https://www.youtube.com/watch?v=KQDVDtklf34` through the PWA.
- Job `d6178677` was submitted to whisper (confirmed via Docker logs)
- Language detection fired (confirmed via logs)
- Job completed in ~30 s
- Webhook received with HTTP 200 (confirmed via logs)
- **BUT** `segments_json = "[]"` stored in the DB
This was a **cold-GPU run** right after container restart.
---
### 4. GPU architecture mismatch investigation
- `docker info` reported `RTX 3060 (sm_86)` inside the container
- `Dockerfile` compiled with `CMAKE_CUDA_ARCHITECTURES=75` (RTX 2080 / sm_75)
- Hypothesis: wrong binary → silent 0-output
- **User confirmed this is a Docker reporting error — GPU is actually RTX 2080 (sm_75)**
- Reverted any Dockerfile changes back to `CMAKE_CUDA_ARCHITECTURES=75`
---
### 5. Source code analysis — `transcriber.rs`
Key findings from reading the Rust source:
| Setting | Value | Effect |
|---|---|---|
| `set_language(None)` | ✅ Correct | Auto-detects language, returns segments |
| `set_detect_language(true)` | ❌ Wrong | Returns 0 segments (early exit) |
| `entropy_thold` | 3.5 (vs default 2.4) | Catches medium-phrase hallucination loops |
| Flash attention | Disabled (commented out) | Was causing 0-segment output on some audio |
The code uses `set_language(None)` which is correct.
Flash attention was already disabled — this alone explains many of the prior 0-segment reports.
---
### 6. Webhook behavior
- The backend fires the webhook **exactly once**, after ALL internal 60 s silence-based chunks complete.
- We submit one file → backend chunks internally → one webhook with the full `WhisperJob` object.
- Webhook payload includes: `{ id, status, language, segments, duration_secs, error, … }`
- Our `POST /api/webhook/[jobId]` route handles this correctly.
---
### 7. Captions fast-path (yt-dlp VTT)
When yt-dlp finds YouTube auto-generated captions (VTT), the pipeline **skips Whisper entirely**
and parses the VTT. If VTT parsing returns `[]` (edge case with certain caption formats), the job
completes with empty segments — no whisper involvement.
This can look identical to a whisper 0-segments failure but is a completely different code path.
---
## Root Cause of Empty Segments
**Cold GPU after container restart.**
Right after the Docker container starts and loads the model, the first 12 jobs sometimes complete
in ~0.5 s with 0 segments — physically impossible for real audio transcription.
After the GPU warms up (first successful transcription ~2547 s), all subsequent jobs return full segments.
This is a transient state that resolves on its own. It is **not** caused by:
- Wrong CUDA architecture (GPU is RTX 2080, binary is sm_75 — correct)
- `set_detect_language` (not used)
- Audio preparation issues (direct tests with our prepared WAV return 340 segments)
- Webhook not firing (logs confirmed 200 OK webhook delivery)
---
## Observations
| Observation | Status |
|---|---|
| Backend returns full segments when GPU is warm | ✅ Confirmed |
| Webhook fires once per job with full payload | ✅ Confirmed |
| `json.job_id` (not `json.id`) is the correct response field | ✅ Confirmed |
| Cold-GPU produces 0 segments in ~0.5 s | ✅ Confirmed |
| Flash attention disabled in Dockerfile prevents 0-segment edge cases | ✅ Already done |
| VTT fast-path can produce empty segments if VTT parse fails | ⚠️ Edge case, not investigated further |
---
## What Was NOT Touched (per user request)
- `whisper-rtx2080` Dockerfile, Rust source, or any backend configuration
- Any backend API behaviour
---
## Next Steps (Backend Side — User Handling Separately)
- Monitor first-job-after-restart 0-segment issue
- Optionally: warm up GPU on container start with a small silent WAV
- Consider retrying a job if `segments == []` and `duration_secs > 5`