Commit Graph

12 Commits

Author SHA1 Message Date
mozempk
78c6fab81b fix: remove duplicate old test suite and fix step 9 pipe/heredoc bug
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 16s
Step 9 used 'echo $RESULT | python3 - << HEREDOC' which is a bash gotcha:
the heredoc takes over stdin (as the script source), so the pipe is
silently ignored and sys.stdin.read() returns empty string → JSONDecodeError.

Fix: write RESULT to a temp file and pass it as sys.argv[1] to the script.

Also removed the old buggy test suite that was accidentally left appended
at lines 181-327 (had language=auto, ['id'] field, wrong DELETE assertion).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 12:13:15 +02:00
mozempk
fd8d4deefb fix: GPU warmup on startup + fix test_all.sh + document cold-GPU finding
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 6m39s
GPU warmup (src/transcriber.rs):
  After creating WhisperState, run a 1s silent inference pass in load().
  CUDA JIT-compiles device kernels on the first whisper_full_with_state call.
  On a cold GPU this compilation disrupts the decode pipeline mid-inference,
  returning 0 segments in ~0.5s. The warmup forces all kernel compilation at
  startup so the first real job runs on fully compiled kernels.

test_all.sh:
  - Fix submit response field: 'id' → 'job_id' (was breaking all downstream steps)
  - Remove language=auto: not a valid ISO 639-1 code; omit field for auto-detect
  - Make BASE and AUDIO configurable via env vars (WHISPER_BASE_URL, TEST_AUDIO)
  - Fix DELETE assertion: completed jobs return 409 Conflict, not 204
  - Add explicit zero-segments failure check in quality inspection (step 9)
  - Add progress reporting to poll loop

docs/FINDINGS.md + KNOWLEDGE.md:
  Document cold GPU warmup issue, root cause, and fix.
  Document language=auto as invalid API usage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 11:57:30 +02:00
mozempk
d5a88d1866 fix: create WhisperState once at load time, reuse across all chunks
Some checks failed
Build & Push Docker Image / build-and-push (push) Has been cancelled
Previously create_state() was called for every 60s audio chunk, triggering
whisper_init_state() each time. This allocates ~700 MB of GPU compute buffers
(KV caches, CUDA workspace) and re-initialises the CUDA backend per chunk.

For a 101-minute audio (102 chunks), this caused 102 GPU re-initialisations
and VRAM allocation cycles. Under VRAM pressure from concurrent processes,
CUDA allocation failures occurred silently — whisper returned language
detection results but 0 segments.

Fix: create WhisperState once in Transcriber::load() and reuse it for every
transcription call. GPU memory is stable; no_context=true prevents KV-cache
contamination between chunks.

WhisperState is Send+Sync (explicitly declared in whisper-rs) and holds its
own Arc<WhisperInnerContext>, so the model weights stay alive even after
WhisperContext is dropped.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 11:51:33 +02:00
mozempk
c25e8e7ffb docs: add ARCHITECTURE, CODE_STYLE, FINDINGS, USAGE under docs/
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 17s
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
v0.0.1
2026-05-06 10:17:53 +02:00
mozempk
8fc45ee86f docs: add KNOWLEDGE.md with lessons learned and improvement notes
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 18s
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 10:09:27 +02:00
mozempk
6327ffc09d fix: use set_language(None) for auto-detect instead of set_detect_language(true)
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 6m43s
detect_language=true causes whisper.cpp to return 0 immediately after
language detection without running the decoder (whisper.cpp source:
  if (params.detect_language) { return 0; }
Setting language=null triggers auto-detection AND transcription.

This was the root cause of 0 segments on all jobs without explicit language.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 02:58:35 +02:00
mozempk
ef9c04b070 fix: trim trailing silence from each chunk before whisper
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 6m44s
Whisper hallucinates filler tokens (Bye., Thank you., etc.) into
end-of-chunk silence. This is especially visible on the final chunk
of long audio where the outro silence triggers a 10× repetition loop.

Fix: after slicing each PCM chunk, scan backwards to find the last
sample above −35 dB, then keep 0.5 s of padding and truncate.
Applied to every chunk, not just the last — any chunk ending in a long
silence period gets the same protection.

Constants match the silencedetect filter already used for chunking:
  THRESHOLD = 0.0178  (−35 dB)
  PADDING   = 8000 samples (0.5 s at 16 kHz)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 02:13:20 +02:00
mozempk
35e7ea8d28 feat: progress reporting with chunk context + live job persistence
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 6m38s
- ProgressEvent::Progress now carries chunk index and total count
- SsePayload::Progress gains chunk / chunks_total fields
  → SSE clients can show 'chunk N/51' instead of bare percent
- process_job persists job.progress to storage at each chunk boundary
  → GET /jobs/:id now shows live progress (not stuck at 0)
- Emits Progress event at chunk START (boundary event), not just on
  whisper's internal callback
- entropy_thold raised to 3.5 (catches medium-phrase loops; triggers
  whisper's own temperature-retry instead of silent repetition)
- no_speech_thold removed (confirmed // TODO: not implemented in
  whisper.cpp source; was a no-op)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 02:00:46 +02:00
mozempk
fb8556441c feat: silence-based audio chunking before transcription
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 6m40s
Run ffmpeg silencedetect (n=-35dB, d=0.4s) on the original audio to
find silence midpoints. Build chunk boundaries every 180s, snapping to
the nearest silence midpoint within ±30s (fallback: hard cut).

Each chunk is transcribed independently with its own CUDA context;
timestamps are shifted by chunk_start before merging. Progress is
scaled per-chunk across the overall 0-100% job range.

Result on 101-min YouTube audio (34 chunks, 1714 silence points):
- Previous: x1025 'Yeah.' + x1008 sentence-length loops (hallucinations)
- After:    x4 max consecutive run, all repetitions verified genuine

Also refactored TranscribeRequest to carry on_progress: Box<dyn Fn(u8)>
instead of a raw ProgressTx so each chunk can independently scale its
contribution to the job's broadcast channel.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 01:08:06 +02:00
mozempk
9a36000062 fix: disable previous-text conditioning to prevent end-of-file loops
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 6m41s
set_no_context(true) stops whisper from feeding its own output back as
context for the next segment. Without this, at audio end the model
halluccinates a phrase ('All right.', 'So I think we're going to wrap up.')
and repeats it hundreds of times in a tight loop.

Observed: 759x 'All right.' + 750x 'So I think we're going to wrap up.'
in the final 8 seconds of a 101min YouTube conference recording.
After fix: clean termination with no repetition loops.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 00:14:56 +02:00
mozempk
2176206afe fix: restore correct no_speech_thold and BeamSearch defaults
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 6m28s
- Revert no_speech_thold from 0.0 back to 0.6 (whisper.cpp default)
  0.0 means 'suppress if p(no-speech) > 0.0' which fires on every segment,
  silently producing 0-segment output for all real-world audio
- Revert SamplingStrategy from Greedy{best_of:5} back to BeamSearch{beam_size:5}
  Greedy with temperature=0.0 and best_of>1 is undefined in whisper.cpp
- Restore entropy_thold=2.4 and logprob_thold=-1.0 defaults
- Keep flash_attn disabled (was causing silent failures on conference audio)
- Tested: 59 segments on 5 min YouTube conference audio, 29 on repair audio

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-05 23:50:12 +02:00
mozempk
16cb6ca661 feat: GPU-accelerated Whisper API for RTX 2080 (sm_75)
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 11m13s
- Pure Rust: Axum 0.7 + whisper-rs 0.13 (CUDA FFI)
- Async job queue with SSE progress streaming
- Webhook delivery with 5x exponential backoff
- Disk-persisted job state (survives restarts)
- Anti-hallucination params: no_speech_thold, entropy_thold, suppress_blank
- CUDA sm_75 flags: GGML_CUDA_FORCE_MMQ, GGML_CUDA_GRAPHS, GGML_CUDA_FA_ALL_QUANTS
- Configurable via env: CUDA_DEVICE, WHISPER_MODEL_PATH, PORT, DATA_DIR
- Gitea Actions CI: build + push to git.sal.giize.com registry
- Multi-stage Dockerfile with customizable CUDA_VERSION ARG

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-05 22:47:24 +02:00