|
|
9f0193c3fc
|
Add llama launcher script
- ./llama (interactive menu) or ./llama <cmd> [args]
- start <model> [--bigctx] [--webui]: verify model file, warn before stopping running server, health-wait after start
- stop: stop all llama containers
- status: running model + health + env vars
- logs [--follow]: tail server logs
- build: build TurboQuant images
- bench <model>: run llama-bench via bench profile
|
2026-05-06 17:31:35 +02:00 |
|
|
|
4ad296608b
|
Initial commit: tuned multi-model llama.cpp stack
- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B
- TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs
- Bigctx profiles (-nkvo KV in RAM): 2-16x context gain
- turbo2 KV: 2x smaller, benchmarked against PPL quality gate
- Per-model env files with justified parameters
- kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts
- docs/FINDINGS.md: surprises, pitfalls, recommendations
- docs/ARCHITECTURE.md: compose + test script design
|
2026-05-06 15:56:40 +02:00 |
|