Initial commit: Tonemark PWA
Some checks failed
Build & Push Docker Image / build-and-push (push) Failing after 11s

Tonemark is a SvelteKit PWA for transcribing YouTube videos, audio
and video files, and microphone recordings using a local Whisper backend.

Features:
- Dark glassmorphic UI with electric-lime accent (5 switchable themes)
- Rail nav (desktop) / tab bar (mobile) layout
- Drop zone, YouTube URL input, and live audio recording inputs
- Audio mode waveform cards (none / standard / aggressive / auto)
- Real-time transcription progress with animated waveform
- Job queue with SSE streaming updates
- Push notifications on job completion
- PWA with native SvelteKit service worker
- SRT / TXT / MD / JSON transcript downloads

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
Giancarmine Salucci
2026-05-06 16:41:25 +02:00
commit 13a96b6efa
68 changed files with 9712 additions and 0 deletions

136
backend.issue.md Normal file
View File

@@ -0,0 +1,136 @@
# Whisper Backend Investigation — Observations & Findings
## Summary
The `whisper-rtx2080` backend **does work correctly** when the GPU is warm.
The empty-segments problem is a **transient cold-GPU issue**, not a code bug.
---
## What Was Tried
### 1. Direct API test — 30 s WAV (warm GPU)
```bash
curl -s -X POST http://localhost:8091/jobs \
-F "audio=@/tmp/test_30s.wav" \
-F "task=transcribe" \
-F "language=en"
```
**Result:** 6 segments returned in ~25 s. Backend works.
---
### 2. Direct API test — 717 s prepared WAV (warm GPU)
```bash
curl -s -X POST http://localhost:8091/jobs \
-F "audio=@/tmp/test_prepared.wav" \
-F "task=transcribe"
```
**Result:** 340 segments, ~47 s total (~15× realtime for RTX 2080). Backend works.
---
### 3. End-to-end PWA submission — YouTube URL
Submitted `https://www.youtube.com/watch?v=KQDVDtklf34` through the PWA.
- Job `d6178677` was submitted to whisper (confirmed via Docker logs)
- Language detection fired (confirmed via logs)
- Job completed in ~30 s
- Webhook received with HTTP 200 (confirmed via logs)
- **BUT** `segments_json = "[]"` stored in the DB
This was a **cold-GPU run** right after container restart.
---
### 4. GPU architecture mismatch investigation
- `docker info` reported `RTX 3060 (sm_86)` inside the container
- `Dockerfile` compiled with `CMAKE_CUDA_ARCHITECTURES=75` (RTX 2080 / sm_75)
- Hypothesis: wrong binary → silent 0-output
- **User confirmed this is a Docker reporting error — GPU is actually RTX 2080 (sm_75)**
- Reverted any Dockerfile changes back to `CMAKE_CUDA_ARCHITECTURES=75`
---
### 5. Source code analysis — `transcriber.rs`
Key findings from reading the Rust source:
| Setting | Value | Effect |
|---|---|---|
| `set_language(None)` | ✅ Correct | Auto-detects language, returns segments |
| `set_detect_language(true)` | ❌ Wrong | Returns 0 segments (early exit) |
| `entropy_thold` | 3.5 (vs default 2.4) | Catches medium-phrase hallucination loops |
| Flash attention | Disabled (commented out) | Was causing 0-segment output on some audio |
The code uses `set_language(None)` which is correct.
Flash attention was already disabled — this alone explains many of the prior 0-segment reports.
---
### 6. Webhook behavior
- The backend fires the webhook **exactly once**, after ALL internal 60 s silence-based chunks complete.
- We submit one file → backend chunks internally → one webhook with the full `WhisperJob` object.
- Webhook payload includes: `{ id, status, language, segments, duration_secs, error, … }`
- Our `POST /api/webhook/[jobId]` route handles this correctly.
---
### 7. Captions fast-path (yt-dlp VTT)
When yt-dlp finds YouTube auto-generated captions (VTT), the pipeline **skips Whisper entirely**
and parses the VTT. If VTT parsing returns `[]` (edge case with certain caption formats), the job
completes with empty segments — no whisper involvement.
This can look identical to a whisper 0-segments failure but is a completely different code path.
---
## Root Cause of Empty Segments
**Cold GPU after container restart.**
Right after the Docker container starts and loads the model, the first 12 jobs sometimes complete
in ~0.5 s with 0 segments — physically impossible for real audio transcription.
After the GPU warms up (first successful transcription ~2547 s), all subsequent jobs return full segments.
This is a transient state that resolves on its own. It is **not** caused by:
- Wrong CUDA architecture (GPU is RTX 2080, binary is sm_75 — correct)
- `set_detect_language` (not used)
- Audio preparation issues (direct tests with our prepared WAV return 340 segments)
- Webhook not firing (logs confirmed 200 OK webhook delivery)
---
## Observations
| Observation | Status |
|---|---|
| Backend returns full segments when GPU is warm | ✅ Confirmed |
| Webhook fires once per job with full payload | ✅ Confirmed |
| `json.job_id` (not `json.id`) is the correct response field | ✅ Confirmed |
| Cold-GPU produces 0 segments in ~0.5 s | ✅ Confirmed |
| Flash attention disabled in Dockerfile prevents 0-segment edge cases | ✅ Already done |
| VTT fast-path can produce empty segments if VTT parse fails | ⚠️ Edge case, not investigated further |
---
## What Was NOT Touched (per user request)
- `whisper-rtx2080` Dockerfile, Rust source, or any backend configuration
- Any backend API behaviour
---
## Next Steps (Backend Side — User Handling Separately)
- Monitor first-job-after-restart 0-segment issue
- Optionally: warm up GPU on container start with a small silent WAV
- Consider retrying a job if `segments == []` and `duration_secs > 5`