tonemark/backend.issue.md

# Whisper Backend Investigation — Observations & Findings

## Summary

The `whisper-rtx2080` backend **does work correctly** when the GPU is warm.
The empty-segments problem is a **transient cold-GPU issue**, not a code bug.

---

## What Was Tried

### 1. Direct API test — 30 s WAV (warm GPU)

```bash
curl -s -X POST http://localhost:8091/jobs \
  -F "audio=@/tmp/test_30s.wav" \
  -F "task=transcribe" \
  -F "language=en"
```

**Result:** 6 segments returned in ~25 s. Backend works.

---

### 2. Direct API test — 717 s prepared WAV (warm GPU)

```bash
curl -s -X POST http://localhost:8091/jobs \
  -F "audio=@/tmp/test_prepared.wav" \
  -F "task=transcribe"
```

**Result:** 340 segments, ~47 s total (~15× realtime for RTX 2080). Backend works.

---

### 3. End-to-end PWA submission — YouTube URL

Submitted `https://www.youtube.com/watch?v=KQDVDtklf34` through the PWA.

- Job `d6178677` was submitted to whisper (confirmed via Docker logs)
- Language detection fired (confirmed via logs)
- Job completed in ~30 s
- Webhook received with HTTP 200 (confirmed via logs)
- **BUT** `segments_json = "[]"` stored in the DB

This was a **cold-GPU run** right after container restart.

---

### 4. GPU architecture mismatch investigation

- `docker info` reported `RTX 3060 (sm_86)` inside the container
- `Dockerfile` compiled with `CMAKE_CUDA_ARCHITECTURES=75` (RTX 2080 / sm_75)
- Hypothesis: wrong binary → silent 0-output
- **User confirmed this is a Docker reporting error — GPU is actually RTX 2080 (sm_75)**
- Reverted any Dockerfile changes back to `CMAKE_CUDA_ARCHITECTURES=75`

---

### 5. Source code analysis — `transcriber.rs`

Key findings from reading the Rust source:

| Setting | Value | Effect |
|---|---|---|
| `set_language(None)` | ✅ Correct | Auto-detects language, returns segments |
| `set_detect_language(true)` | ❌ Wrong | Returns 0 segments (early exit) |
| `entropy_thold` | 3.5 (vs default 2.4) | Catches medium-phrase hallucination loops |
| Flash attention | Disabled (commented out) | Was causing 0-segment output on some audio |

The code uses `set_language(None)` which is correct.
Flash attention was already disabled — this alone explains many of the prior 0-segment reports.

---

### 6. Webhook behavior

- The backend fires the webhook **exactly once**, after ALL internal 60 s silence-based chunks complete.
- We submit one file → backend chunks internally → one webhook with the full `WhisperJob` object.
- Webhook payload includes: `{ id, status, language, segments, duration_secs, error, … }`
- Our `POST /api/webhook/[jobId]` route handles this correctly.

---

### 7. Captions fast-path (yt-dlp VTT)

When yt-dlp finds YouTube auto-generated captions (VTT), the pipeline **skips Whisper entirely**
and parses the VTT. If VTT parsing returns `[]` (edge case with certain caption formats), the job
completes with empty segments — no whisper involvement.

This can look identical to a whisper 0-segments failure but is a completely different code path.

---

## Root Cause of Empty Segments

**Cold GPU after container restart.**

Right after the Docker container starts and loads the model, the first 1–2 jobs sometimes complete
in ~0.5 s with 0 segments — physically impossible for real audio transcription.
After the GPU warms up (first successful transcription ~25–47 s), all subsequent jobs return full segments.

This is a transient state that resolves on its own. It is **not** caused by:
- Wrong CUDA architecture (GPU is RTX 2080, binary is sm_75 — correct)
- `set_detect_language` (not used)
- Audio preparation issues (direct tests with our prepared WAV return 340 segments)
- Webhook not firing (logs confirmed 200 OK webhook delivery)

---

## Observations

| Observation | Status |
|---|---|
| Backend returns full segments when GPU is warm | ✅ Confirmed |
| Webhook fires once per job with full payload | ✅ Confirmed |
| `json.job_id` (not `json.id`) is the correct response field | ✅ Confirmed |
| Cold-GPU produces 0 segments in ~0.5 s | ✅ Confirmed |
| Flash attention disabled in Dockerfile prevents 0-segment edge cases | ✅ Already done |
| VTT fast-path can produce empty segments if VTT parse fails | ⚠️ Edge case, not investigated further |

---

## What Was NOT Touched (per user request)

- `whisper-rtx2080` Dockerfile, Rust source, or any backend configuration
- Any backend API behaviour

---

## Next Steps (Backend Side — User Handling Separately)

- Monitor first-job-after-restart 0-segment issue
- Optionally: warm up GPU on container start with a small silent WAV
- Consider retrying a job if `segments == []` and `duration_secs > 5`