Initial commit: tuned multi-model llama.cpp stack

- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B - TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs - Bigctx profiles (-nkvo KV in RAM): 2-16x context gain - turbo2 KV: 2x smaller, benchmarked against PPL quality gate - Per-model env files with justified parameters - kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts - docs/FINDINGS.md: surprises, pitfalls, recommendations - docs/ARCHITECTURE.md: compose + test script design
2026-05-06 15:56:40 +02:00
commit 4ad296608b
22 changed files with 2530 additions and 0 deletions
--- a/envs/.env.gemma4-e4b-bigctx
+++ b/envs/.env.gemma4-e4b-bigctx
@@ -0,0 +1,26 @@
+# ==============================================================================
+# Gemma 4 E4B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo)
+# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): turbo2 rec ctx=163840
+# +139264 tokens vs pure-GPU 24576. turbo2 KV = 2.1 KB/tok vs q4_0 4.5 KB/tok.
+# Speed at ctx=163840: baseline 30.0 t/s, est. 17.8@50% / 22.4@25% (PCIe BW).
+# RAM at 163840: 346 MiB KV. ngl=42 (all layers on GPU).
+# Use this profile when you need >24K context; otherwise use gemma4-e4b.
+# ==============================================================================
+
+MODEL_FILE=google_gemma-4-E4B-it-Q4_K_M.gguf
+
+N_GPU_LAYERS=42
+CTX_SIZE=163840
+
+THREADS=6
+THREADS_BATCH=6
+
+BATCH_SIZE=512
+UBATCH_SIZE=128
+
+CACHE_TYPE_K=turbo2
+CACHE_TYPE_V=turbo2
+
+PARALLEL=1
+
+EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload