# ============================================================================== # SmolLM3 3B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo) # Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): turbo2 rec ctx=65536 # +40960 tokens vs pure-GPU 24576. turbo2 KV = 10.9 KB/tok vs q4_0 19.8 KB/tok. # Speed at ctx=65536: baseline 53.1 t/s, est. 15.2@50% / 23.7@25% (PCIe BW). # RAM at 65536: 714 MiB KV. turbo2 passes PPL quality gate at all tested ctx. # Use this profile when you need >24K context; otherwise use smollm3-3b. # ============================================================================== MODEL_FILE=HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf N_GPU_LAYERS=99 CTX_SIZE=65536 THREADS=6 THREADS_BATCH=6 BATCH_SIZE=512 UBATCH_SIZE=256 CACHE_TYPE_K=turbo2 CACHE_TYPE_V=turbo2 PARALLEL=1 EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload