# ============================================================================== # Gemma 4 E4B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo) # Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): turbo2 rec ctx=163840 # +139264 tokens vs pure-GPU 24576. turbo2 KV = 2.1 KB/tok vs q4_0 4.5 KB/tok. # Speed at ctx=163840: baseline 30.0 t/s, est. 17.8@50% / 22.4@25% (PCIe BW). # RAM at 163840: 346 MiB KV. ngl=42 (all layers on GPU). # Use this profile when you need >24K context; otherwise use gemma4-e4b. # ============================================================================== MODEL_FILE=google_gemma-4-E4B-it-Q4_K_M.gguf N_GPU_LAYERS=42 CTX_SIZE=163840 THREADS=6 THREADS_BATCH=6 BATCH_SIZE=512 UBATCH_SIZE=128 CACHE_TYPE_K=turbo2 CACHE_TYPE_V=turbo2 PARALLEL=1 EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload