- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B - TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs - Bigctx profiles (-nkvo KV in RAM): 2-16x context gain - turbo2 KV: 2x smaller, benchmarked against PPL quality gate - Per-model env files with justified parameters - kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts - docs/FINDINGS.md: surprises, pitfalls, recommendations - docs/ARCHITECTURE.md: compose + test script design
27 lines
851 B
Plaintext
27 lines
851 B
Plaintext
# ==============================================================================
|
|
# SmolLM3 3B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo)
|
|
# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): turbo2 rec ctx=65536
|
|
# +40960 tokens vs pure-GPU 24576. turbo2 KV = 10.9 KB/tok vs q4_0 19.8 KB/tok.
|
|
# Speed at ctx=65536: baseline 53.1 t/s, est. 15.2@50% / 23.7@25% (PCIe BW).
|
|
# RAM at 65536: 714 MiB KV. turbo2 passes PPL quality gate at all tested ctx.
|
|
# Use this profile when you need >24K context; otherwise use smollm3-3b.
|
|
# ==============================================================================
|
|
|
|
MODEL_FILE=HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf
|
|
|
|
N_GPU_LAYERS=99
|
|
CTX_SIZE=65536
|
|
|
|
THREADS=6
|
|
THREADS_BATCH=6
|
|
|
|
BATCH_SIZE=512
|
|
UBATCH_SIZE=256
|
|
|
|
CACHE_TYPE_K=turbo2
|
|
CACHE_TYPE_V=turbo2
|
|
|
|
PARALLEL=1
|
|
|
|
EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload
|