Initial commit: tuned multi-model llama.cpp stack
- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B - TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs - Bigctx profiles (-nkvo KV in RAM): 2-16x context gain - turbo2 KV: 2x smaller, benchmarked against PPL quality gate - Per-model env files with justified parameters - kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts - docs/FINDINGS.md: surprises, pitfalls, recommendations - docs/ARCHITECTURE.md: compose + test script design
This commit is contained in:
26
envs/.env.gemma4-e2b-bigctx
Normal file
26
envs/.env.gemma4-e2b-bigctx
Normal file
@@ -0,0 +1,26 @@
|
||||
# ==============================================================================
|
||||
# Gemma 4 E2B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo)
|
||||
# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): q4_0 rec ctx=393216
|
||||
# +368640 tokens vs pure-GPU 24576. MQA arch = only 1.7 KB KV/tok (tiny!).
|
||||
# Speed at ctx=393216: baseline 61.7 t/s, est. 17.0@50% / 26.6@25% (PCIe BW).
|
||||
# RAM at 393216: 651 MiB KV. q4_0 used (turbo2 paradoxically larger for MQA).
|
||||
# Use this profile when you need >24K context; otherwise use gemma4-e2b.
|
||||
# ==============================================================================
|
||||
|
||||
MODEL_FILE=google_gemma-4-E2B-it-Q4_K_M.gguf
|
||||
|
||||
N_GPU_LAYERS=99
|
||||
CTX_SIZE=393216
|
||||
|
||||
THREADS=6
|
||||
THREADS_BATCH=6
|
||||
|
||||
BATCH_SIZE=512
|
||||
UBATCH_SIZE=256
|
||||
|
||||
CACHE_TYPE_K=q4_0
|
||||
CACHE_TYPE_V=q4_0
|
||||
|
||||
PARALLEL=1
|
||||
|
||||
EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload
|
||||
Reference in New Issue
Block a user