# ============================================================================== # Gemma 4 E2B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo) # Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): q4_0 rec ctx=393216 # +368640 tokens vs pure-GPU 24576. MQA arch = only 1.7 KB KV/tok (tiny!). # Speed at ctx=393216: baseline 61.7 t/s, est. 17.0@50% / 26.6@25% (PCIe BW). # RAM at 393216: 651 MiB KV. q4_0 used (turbo2 paradoxically larger for MQA). # Use this profile when you need >24K context; otherwise use gemma4-e2b. # ============================================================================== MODEL_FILE=google_gemma-4-E2B-it-Q4_K_M.gguf N_GPU_LAYERS=99 CTX_SIZE=393216 THREADS=6 THREADS_BATCH=6 BATCH_SIZE=512 UBATCH_SIZE=256 CACHE_TYPE_K=q4_0 CACHE_TYPE_V=q4_0 PARALLEL=1 EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload