feat: dynamic model loading/unloading with GPU polling

- Model starts unloaded (lazy); loads on first job or POST /model/load - Auto-unloads after IDLE_TIMEOUT_SECS (default 300) of inactivity - POST /model/unload for immediate manual release - GPU-busy detection: on VRAM OOM, enters WaitingForGpu and retries every GPU_POLL_INTERVAL_SECS (default 30) indefinitely - POST /jobs when unloaded → 503 + Retry-After header, triggers load - AppError::OutOfMemory and AppError::ModelNotReady variants - WorkerCmd channel (SyncSender<WorkerCmd>) replaces bare tx_req channel - Idle timer via recv_timeout(1s) tick inside OS thread (no extra thread) - Model lifecycle events broadcast via tokio broadcast channel (SSE + webhooks) - webhook_registry: all clients that ever submitted a webhook_url receive model_ready and model_unloaded webhooks - GPU warmup retained on every (re)load New routes: GET /model/status — current state + VRAM stats POST /model/load — trigger load (idempotent) POST /model/unload — immediate unload GET /model/events — SSE stream of model lifecycle events New env vars: IDLE_TIMEOUT_SECS (default 300) GPU_POLL_INTERVAL_SECS (default 30) Tests: tests/test_model_lifecycle.sh — 18 integration tests (full state machine, SSE events, webhooks, concurrency, unload-during-load) tests/test_idle_timeout.sh — 5 tests with short IDLE_TIMEOUT_SECS=5 test_all.sh updated: loads model before job submission, asserts model_state in /health, adds POST /model/unload at end Docs: docs/USAGE.md: model lifecycle section, new env vars, 503 retry pattern, updated /health response shape Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-08 17:57:20 +02:00
parent 78c6fab81b
commit b191fbe200
13 changed files with 2053 additions and 148 deletions
--- a/src/routes/jobs.rs
+++ b/src/routes/jobs.rs
@@ -19,7 +19,7 @@ use uuid::Uuid;

 use crate::{
    models::{Job, JobId, JobStatus, SubmitResponse},
-    worker::{audio_path_for, ProgressEvent},
+    worker::{audio_path_for, ProgressEvent, WorkerCmd},
    AppError, AppState, Result,
 };

@@ -107,6 +107,36 @@ pub async fn submit_job(
        ));
    }

+    // Check model state before accepting the job.
+    let (model_ready, retry_after_secs, state_tag) = {
+        let ms = state.model_state.read().await;
+        let ready = ms.is_ready();
+        let retry = ms.retry_after_secs();
+        let tag   = ms.tag().to_string();
+        (ready, retry, tag)
+    };
+
+    // Register the webhook URL regardless of model state — so model lifecycle
+    // events are delivered even if the job itself is rejected.
+    if let Some(url) = &webhook_url {
+        state.webhook_registry.lock()
+            .unwrap_or_else(|e| e.into_inner())
+            .insert(url.clone());
+    }
+
+    if !model_ready {
+        // Trigger a load if the model is simply unloaded (not already loading).
+        if state_tag == "unloaded" {
+            let _ = state.cmd_tx.try_send(WorkerCmd::Load);
+        }
+        // Clean up the audio file we already wrote to disk.
+        let _ = tokio::fs::remove_file(&audio_path).await;
+        return Err(AppError::ModelNotReady {
+            state: state_tag,
+            retry_after_secs,
+        });
+    }
+
    let mut job = Job::new(id, task, webhook_url, filename);
    job.language = language;