feat: GPU model lazy-load/unload lifecycle management
All checks were successful
Build and publish Docker image / Build and push CPU image (push) Successful in 2m33s
Build and publish Docker image / Build and push GPU image (push) Successful in 3m15s

- Domain: add ModelState, ModelStateEvent, ModelNotReady, ManageModelLifecycle
  (in-port), ModelLoader and ModelStateEventBus (out-ports)
- Application: InMemoryModelStateEventBus; ModelLifecycleService — state
  machine (ReentrantLock), lazy load on first request, idle-timeout auto-unload
  (configurable via trueref.embedding.idle-timeout-seconds, default 300 s),
  job-guard (skips unload while ingestion running), platform-thread CUDA executor
- Adapters: OnnxModelLoader wires embedder + reranker start/stop; remove
  @PostConstruct/@PreDestroy from OnnxEmbeddingService and OnnxRerankerService;
  requireStarted() now throws ModelNotReady instead of IllegalStateException
- REST: GET /api/model/status, POST /api/model/unload (409 when jobs running,
  force=true to override), GET /api/model/status/stream (SSE)
- GlobalExceptionHandler: ModelNotReady -> 503 + Retry-After header
- HybridSearchService: calls lifecycle.ensureReady() before every search so
  both REST and MCP paths get ModelNotReady (-> 503 / MCP error) when unloaded
- TrueRefMcpTools: catches ModelNotReady, returns retry hint in MCP error text
- Tests: InMemoryModelStateEventBusTest, ModelLifecycleServiceTest (10 cases),
  OnnxModelLoaderTest, GlobalExceptionHandlerTest — all 41 tests green

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
moze
2026-05-09 15:44:33 +02:00
parent 943a38fd36
commit 5c6085df99
24 changed files with 1144 additions and 17 deletions

View File

@@ -7,6 +7,7 @@ import com.trueref.application.observability.InMemoryJobEventBus;
import com.trueref.application.observability.JobObservationService;
import com.trueref.application.resolve.LibraryResolver;
import com.trueref.application.search.HybridSearchService;
import com.trueref.domain.port.in.ManageModelLifecycle;
import com.trueref.domain.port.out.ChunkStore;
import com.trueref.domain.port.out.CodeParser;
import com.trueref.domain.port.out.EmbeddingCache;
@@ -76,10 +77,11 @@ public class ApplicationBeans {
EmbeddingService embedder,
RerankerService reranker,
RepositoryStore repos,
ManageModelLifecycle lifecycle,
@Value("${trueref.search.rrf-k:60}") int rrfK,
@Value("${trueref.reranker.top-k:50}") int rerankTopK,
@Value("${trueref.search.final-top-k:20}") int finalTopK) {
return new HybridSearchService(chunks, embedder, reranker, repos, rrfK, rerankTopK, finalTopK);
return new HybridSearchService(chunks, embedder, reranker, repos, lifecycle, rrfK, rerankTopK, finalTopK);
}
@Bean