Commit Graph

4 Commits

Author SHA1 Message Date
mozempk
fb8556441c feat: silence-based audio chunking before transcription
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 6m40s
Run ffmpeg silencedetect (n=-35dB, d=0.4s) on the original audio to
find silence midpoints. Build chunk boundaries every 180s, snapping to
the nearest silence midpoint within ±30s (fallback: hard cut).

Each chunk is transcribed independently with its own CUDA context;
timestamps are shifted by chunk_start before merging. Progress is
scaled per-chunk across the overall 0-100% job range.

Result on 101-min YouTube audio (34 chunks, 1714 silence points):
- Previous: x1025 'Yeah.' + x1008 sentence-length loops (hallucinations)
- After:    x4 max consecutive run, all repetitions verified genuine

Also refactored TranscribeRequest to carry on_progress: Box<dyn Fn(u8)>
instead of a raw ProgressTx so each chunk can independently scale its
contribution to the job's broadcast channel.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 01:08:06 +02:00
mozempk
9a36000062 fix: disable previous-text conditioning to prevent end-of-file loops
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 6m41s
set_no_context(true) stops whisper from feeding its own output back as
context for the next segment. Without this, at audio end the model
halluccinates a phrase ('All right.', 'So I think we're going to wrap up.')
and repeats it hundreds of times in a tight loop.

Observed: 759x 'All right.' + 750x 'So I think we're going to wrap up.'
in the final 8 seconds of a 101min YouTube conference recording.
After fix: clean termination with no repetition loops.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-06 00:14:56 +02:00
mozempk
2176206afe fix: restore correct no_speech_thold and BeamSearch defaults
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 6m28s
- Revert no_speech_thold from 0.0 back to 0.6 (whisper.cpp default)
  0.0 means 'suppress if p(no-speech) > 0.0' which fires on every segment,
  silently producing 0-segment output for all real-world audio
- Revert SamplingStrategy from Greedy{best_of:5} back to BeamSearch{beam_size:5}
  Greedy with temperature=0.0 and best_of>1 is undefined in whisper.cpp
- Restore entropy_thold=2.4 and logprob_thold=-1.0 defaults
- Keep flash_attn disabled (was causing silent failures on conference audio)
- Tested: 59 segments on 5 min YouTube conference audio, 29 on repair audio

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-05 23:50:12 +02:00
mozempk
16cb6ca661 feat: GPU-accelerated Whisper API for RTX 2080 (sm_75)
All checks were successful
Build & Push Docker Image / build-and-push (push) Successful in 11m13s
- Pure Rust: Axum 0.7 + whisper-rs 0.13 (CUDA FFI)
- Async job queue with SSE progress streaming
- Webhook delivery with 5x exponential backoff
- Disk-persisted job state (survives restarts)
- Anti-hallucination params: no_speech_thold, entropy_thold, suppress_blank
- CUDA sm_75 flags: GGML_CUDA_FORCE_MMQ, GGML_CUDA_GRAPHS, GGML_CUDA_FA_ALL_QUANTS
- Configurable via env: CUDA_DEVICE, WHISPER_MODEL_PATH, PORT, DATA_DIR
- Gitea Actions CI: build + push to git.sal.giize.com registry
- Multi-stage Dockerfile with customizable CUDA_VERSION ARG

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-05 22:47:24 +02:00