Files
tonemark/backend.issue.md
Giancarmine Salucci 13a96b6efa
Some checks failed
Build & Push Docker Image / build-and-push (push) Failing after 11s
Initial commit: Tonemark PWA
Tonemark is a SvelteKit PWA for transcribing YouTube videos, audio
and video files, and microphone recordings using a local Whisper backend.

Features:
- Dark glassmorphic UI with electric-lime accent (5 switchable themes)
- Rail nav (desktop) / tab bar (mobile) layout
- Drop zone, YouTube URL input, and live audio recording inputs
- Audio mode waveform cards (none / standard / aggressive / auto)
- Real-time transcription progress with animated waveform
- Job queue with SSE streaming updates
- Push notifications on job completion
- PWA with native SvelteKit service worker
- SRT / TXT / MD / JSON transcript downloads

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 16:41:25 +02:00

4.6 KiB
Raw Blame History

Whisper Backend Investigation — Observations & Findings

Summary

The whisper-rtx2080 backend does work correctly when the GPU is warm. The empty-segments problem is a transient cold-GPU issue, not a code bug.


What Was Tried

1. Direct API test — 30 s WAV (warm GPU)

curl -s -X POST http://localhost:8091/jobs \
  -F "audio=@/tmp/test_30s.wav" \
  -F "task=transcribe" \
  -F "language=en"

Result: 6 segments returned in ~25 s. Backend works.


2. Direct API test — 717 s prepared WAV (warm GPU)

curl -s -X POST http://localhost:8091/jobs \
  -F "audio=@/tmp/test_prepared.wav" \
  -F "task=transcribe"

Result: 340 segments, ~47 s total (~15× realtime for RTX 2080). Backend works.


3. End-to-end PWA submission — YouTube URL

Submitted https://www.youtube.com/watch?v=KQDVDtklf34 through the PWA.

  • Job d6178677 was submitted to whisper (confirmed via Docker logs)
  • Language detection fired (confirmed via logs)
  • Job completed in ~30 s
  • Webhook received with HTTP 200 (confirmed via logs)
  • BUT segments_json = "[]" stored in the DB

This was a cold-GPU run right after container restart.


4. GPU architecture mismatch investigation

  • docker info reported RTX 3060 (sm_86) inside the container
  • Dockerfile compiled with CMAKE_CUDA_ARCHITECTURES=75 (RTX 2080 / sm_75)
  • Hypothesis: wrong binary → silent 0-output
  • User confirmed this is a Docker reporting error — GPU is actually RTX 2080 (sm_75)
  • Reverted any Dockerfile changes back to CMAKE_CUDA_ARCHITECTURES=75

5. Source code analysis — transcriber.rs

Key findings from reading the Rust source:

Setting Value Effect
set_language(None) Correct Auto-detects language, returns segments
set_detect_language(true) Wrong Returns 0 segments (early exit)
entropy_thold 3.5 (vs default 2.4) Catches medium-phrase hallucination loops
Flash attention Disabled (commented out) Was causing 0-segment output on some audio

The code uses set_language(None) which is correct.
Flash attention was already disabled — this alone explains many of the prior 0-segment reports.


6. Webhook behavior

  • The backend fires the webhook exactly once, after ALL internal 60 s silence-based chunks complete.
  • We submit one file → backend chunks internally → one webhook with the full WhisperJob object.
  • Webhook payload includes: { id, status, language, segments, duration_secs, error, … }
  • Our POST /api/webhook/[jobId] route handles this correctly.

7. Captions fast-path (yt-dlp VTT)

When yt-dlp finds YouTube auto-generated captions (VTT), the pipeline skips Whisper entirely and parses the VTT. If VTT parsing returns [] (edge case with certain caption formats), the job completes with empty segments — no whisper involvement.

This can look identical to a whisper 0-segments failure but is a completely different code path.


Root Cause of Empty Segments

Cold GPU after container restart.

Right after the Docker container starts and loads the model, the first 12 jobs sometimes complete in ~0.5 s with 0 segments — physically impossible for real audio transcription. After the GPU warms up (first successful transcription ~2547 s), all subsequent jobs return full segments.

This is a transient state that resolves on its own. It is not caused by:

  • Wrong CUDA architecture (GPU is RTX 2080, binary is sm_75 — correct)
  • set_detect_language (not used)
  • Audio preparation issues (direct tests with our prepared WAV return 340 segments)
  • Webhook not firing (logs confirmed 200 OK webhook delivery)

Observations

Observation Status
Backend returns full segments when GPU is warm Confirmed
Webhook fires once per job with full payload Confirmed
json.job_id (not json.id) is the correct response field Confirmed
Cold-GPU produces 0 segments in ~0.5 s Confirmed
Flash attention disabled in Dockerfile prevents 0-segment edge cases Already done
VTT fast-path can produce empty segments if VTT parse fails ⚠️ Edge case, not investigated further

What Was NOT Touched (per user request)

  • whisper-rtx2080 Dockerfile, Rust source, or any backend configuration
  • Any backend API behaviour

Next Steps (Backend Side — User Handling Separately)

  • Monitor first-job-after-restart 0-segment issue
  • Optionally: warm up GPU on container start with a small silent WAV
  • Consider retrying a job if segments == [] and duration_secs > 5