llama-cpp

5 Commits 1 Branch 0 Tags

Author	SHA1	Message	Date
Giancarmine Salucci	e7e389c0e1	llama+compose: fix bigctx startup timing - compose: increase start_period for bigctx services - gemma4-e4b-bigctx: 60s -> 150s (5 GiB model + warmup + 163840 ctx takes ~90-120s) - gemma4-e2b-bigctx: 60s -> 120s (large ctx 393216 allocation) - smollm3/qwen3-4b bigctx: 60s -> 90s - llama: extend health poll from 30x2s=60s to 75x2s=150s - llama: require 3 consecutive unhealthy before giving up (avoids false positives during Docker start_period window)	2026-05-06 19:03:31 +02:00
Giancarmine Salucci	0618078937	llama: fix bigctx double-profile conflict (llama_server name collision)	2026-05-06 17:41:20 +02:00
Giancarmine Salucci	33333507a5	llama: remove stopped containers before start to fix name conflict	2026-05-06 17:37:47 +02:00
Giancarmine Salucci	9f0193c3fc	Add llama launcher script - ./llama (interactive menu) or ./llama <cmd> [args] - start <model> [--bigctx] [--webui]: verify model file, warn before stopping running server, health-wait after start - stop: stop all llama containers - status: running model + health + env vars - logs [--follow]: tail server logs - build: build TurboQuant images - bench <model>: run llama-bench via bench profile	2026-05-06 17:31:35 +02:00
Giancarmine Salucci	4ad296608b	Initial commit: tuned multi-model llama.cpp stack - 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B - TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs - Bigctx profiles (-nkvo KV in RAM): 2-16x context gain - turbo2 KV: 2x smaller, benchmarked against PPL quality gate - Per-model env files with justified parameters - kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts - docs/FINDINGS.md: surprises, pitfalls, recommendations - docs/ARCHITECTURE.md: compose + test script design	2026-05-06 15:56:40 +02:00