Initial commit: Tonemark PWA
Some checks failed
Build & Push Docker Image / build-and-push (push) Failing after 11s
Some checks failed
Build & Push Docker Image / build-and-push (push) Failing after 11s
Tonemark is a SvelteKit PWA for transcribing YouTube videos, audio and video files, and microphone recordings using a local Whisper backend. Features: - Dark glassmorphic UI with electric-lime accent (5 switchable themes) - Rail nav (desktop) / tab bar (mobile) layout - Drop zone, YouTube URL input, and live audio recording inputs - Audio mode waveform cards (none / standard / aggressive / auto) - Real-time transcription progress with animated waveform - Job queue with SSE streaming updates - Push notifications on job completion - PWA with native SvelteKit service worker - SRT / TXT / MD / JSON transcript downloads Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
136
backend.issue.md
Normal file
136
backend.issue.md
Normal file
@@ -0,0 +1,136 @@
|
||||
# Whisper Backend Investigation — Observations & Findings
|
||||
|
||||
## Summary
|
||||
|
||||
The `whisper-rtx2080` backend **does work correctly** when the GPU is warm.
|
||||
The empty-segments problem is a **transient cold-GPU issue**, not a code bug.
|
||||
|
||||
---
|
||||
|
||||
## What Was Tried
|
||||
|
||||
### 1. Direct API test — 30 s WAV (warm GPU)
|
||||
|
||||
```bash
|
||||
curl -s -X POST http://localhost:8091/jobs \
|
||||
-F "audio=@/tmp/test_30s.wav" \
|
||||
-F "task=transcribe" \
|
||||
-F "language=en"
|
||||
```
|
||||
|
||||
**Result:** 6 segments returned in ~25 s. Backend works.
|
||||
|
||||
---
|
||||
|
||||
### 2. Direct API test — 717 s prepared WAV (warm GPU)
|
||||
|
||||
```bash
|
||||
curl -s -X POST http://localhost:8091/jobs \
|
||||
-F "audio=@/tmp/test_prepared.wav" \
|
||||
-F "task=transcribe"
|
||||
```
|
||||
|
||||
**Result:** 340 segments, ~47 s total (~15× realtime for RTX 2080). Backend works.
|
||||
|
||||
---
|
||||
|
||||
### 3. End-to-end PWA submission — YouTube URL
|
||||
|
||||
Submitted `https://www.youtube.com/watch?v=KQDVDtklf34` through the PWA.
|
||||
|
||||
- Job `d6178677` was submitted to whisper (confirmed via Docker logs)
|
||||
- Language detection fired (confirmed via logs)
|
||||
- Job completed in ~30 s
|
||||
- Webhook received with HTTP 200 (confirmed via logs)
|
||||
- **BUT** `segments_json = "[]"` stored in the DB
|
||||
|
||||
This was a **cold-GPU run** right after container restart.
|
||||
|
||||
---
|
||||
|
||||
### 4. GPU architecture mismatch investigation
|
||||
|
||||
- `docker info` reported `RTX 3060 (sm_86)` inside the container
|
||||
- `Dockerfile` compiled with `CMAKE_CUDA_ARCHITECTURES=75` (RTX 2080 / sm_75)
|
||||
- Hypothesis: wrong binary → silent 0-output
|
||||
- **User confirmed this is a Docker reporting error — GPU is actually RTX 2080 (sm_75)**
|
||||
- Reverted any Dockerfile changes back to `CMAKE_CUDA_ARCHITECTURES=75`
|
||||
|
||||
---
|
||||
|
||||
### 5. Source code analysis — `transcriber.rs`
|
||||
|
||||
Key findings from reading the Rust source:
|
||||
|
||||
| Setting | Value | Effect |
|
||||
|---|---|---|
|
||||
| `set_language(None)` | ✅ Correct | Auto-detects language, returns segments |
|
||||
| `set_detect_language(true)` | ❌ Wrong | Returns 0 segments (early exit) |
|
||||
| `entropy_thold` | 3.5 (vs default 2.4) | Catches medium-phrase hallucination loops |
|
||||
| Flash attention | Disabled (commented out) | Was causing 0-segment output on some audio |
|
||||
|
||||
The code uses `set_language(None)` which is correct.
|
||||
Flash attention was already disabled — this alone explains many of the prior 0-segment reports.
|
||||
|
||||
---
|
||||
|
||||
### 6. Webhook behavior
|
||||
|
||||
- The backend fires the webhook **exactly once**, after ALL internal 60 s silence-based chunks complete.
|
||||
- We submit one file → backend chunks internally → one webhook with the full `WhisperJob` object.
|
||||
- Webhook payload includes: `{ id, status, language, segments, duration_secs, error, … }`
|
||||
- Our `POST /api/webhook/[jobId]` route handles this correctly.
|
||||
|
||||
---
|
||||
|
||||
### 7. Captions fast-path (yt-dlp VTT)
|
||||
|
||||
When yt-dlp finds YouTube auto-generated captions (VTT), the pipeline **skips Whisper entirely**
|
||||
and parses the VTT. If VTT parsing returns `[]` (edge case with certain caption formats), the job
|
||||
completes with empty segments — no whisper involvement.
|
||||
|
||||
This can look identical to a whisper 0-segments failure but is a completely different code path.
|
||||
|
||||
---
|
||||
|
||||
## Root Cause of Empty Segments
|
||||
|
||||
**Cold GPU after container restart.**
|
||||
|
||||
Right after the Docker container starts and loads the model, the first 1–2 jobs sometimes complete
|
||||
in ~0.5 s with 0 segments — physically impossible for real audio transcription.
|
||||
After the GPU warms up (first successful transcription ~25–47 s), all subsequent jobs return full segments.
|
||||
|
||||
This is a transient state that resolves on its own. It is **not** caused by:
|
||||
- Wrong CUDA architecture (GPU is RTX 2080, binary is sm_75 — correct)
|
||||
- `set_detect_language` (not used)
|
||||
- Audio preparation issues (direct tests with our prepared WAV return 340 segments)
|
||||
- Webhook not firing (logs confirmed 200 OK webhook delivery)
|
||||
|
||||
---
|
||||
|
||||
## Observations
|
||||
|
||||
| Observation | Status |
|
||||
|---|---|
|
||||
| Backend returns full segments when GPU is warm | ✅ Confirmed |
|
||||
| Webhook fires once per job with full payload | ✅ Confirmed |
|
||||
| `json.job_id` (not `json.id`) is the correct response field | ✅ Confirmed |
|
||||
| Cold-GPU produces 0 segments in ~0.5 s | ✅ Confirmed |
|
||||
| Flash attention disabled in Dockerfile prevents 0-segment edge cases | ✅ Already done |
|
||||
| VTT fast-path can produce empty segments if VTT parse fails | ⚠️ Edge case, not investigated further |
|
||||
|
||||
---
|
||||
|
||||
## What Was NOT Touched (per user request)
|
||||
|
||||
- `whisper-rtx2080` Dockerfile, Rust source, or any backend configuration
|
||||
- Any backend API behaviour
|
||||
|
||||
---
|
||||
|
||||
## Next Steps (Backend Side — User Handling Separately)
|
||||
|
||||
- Monitor first-job-after-restart 0-segment issue
|
||||
- Optionally: warm up GPU on container start with a small silent WAV
|
||||
- Consider retrying a job if `segments == []` and `duration_secs > 5`
|
||||
Reference in New Issue
Block a user