Initial commit: tuned multi-model llama.cpp stack

- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B - TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs - Bigctx profiles (-nkvo KV in RAM): 2-16x context gain - turbo2 KV: 2x smaller, benchmarked against PPL quality gate - Per-model env files with justified parameters - kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts - docs/FINDINGS.md: surprises, pitfalls, recommendations - docs/ARCHITECTURE.md: compose + test script design
2026-05-06 15:56:40 +02:00
commit 4ad296608b
22 changed files with 2530 additions and 0 deletions
--- a/envs/.env.gemma4-e2b
+++ b/envs/.env.gemma4-e2b
@@ -0,0 +1,43 @@
+# ==============================================================================
+# Gemma 4 E2B-it Q4_K_M — Google DeepMind (April 2025)
+# Architecture: Dense transformer + Per-Layer Embeddings (PLE)
+#   - 2.3B effective params (5.1B total with PLE embedding tables)
+#   - 35 layers, hybrid local (512-token window) + global attention
+#   - 128K context window
+# Model size: ~2.9 GB Q4_K_M  |  Full GPU fit (ngl=99, VRAM ~3.4 GB total)
+# Modalities: text + image + audio (ASR/translation) + video frames
+#
+# Download:
+#   huggingface-cli download bartowski/google_gemma-4-E2B-it-GGUF \
+#     google_gemma-4-E2B-it-Q4_K_M.gguf --local-dir ./models/
+#
+# NOTE: Verify the exact filename after download — bartowski naming may vary.
+#       Check: ls models/google_gemma*
+# ==============================================================================
+
+MODEL_FILE=google_gemma-4-E2B-it-Q4_K_M.gguf
+
+# All 35 layers fit in VRAM. PLE layers are small compute, large embedding lookup.
+N_GPU_LAYERS=99
+
+# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
+# Hybrid sliding-window attention (512-token) keeps KV tiny → 32K ctx fits!
+# 65K/131K OOM (full global-attn layers eat VRAM at large ctx).
+# Baseline: 350 pp / 64.6 tg t/s  |  At 32K ctx: 365 pp / 66.8 tg t/s (fa=1)
+CTX_SIZE=24576
+
+THREADS=6
+THREADS_BATCH=6
+
+BATCH_SIZE=512
+UBATCH_SIZE=256
+
+# f16 KV — model small, KV overhead negligible even at 32K
+CACHE_TYPE_K=f16
+CACHE_TYPE_V=f16
+
+# 2 parallel slots — fast model (66 tg t/s), VRAM headroom available
+PARALLEL=2
+
+# fa=1 confirmed working on hybrid Gemma4 attention (+5% vs fa=0)
+EXTRA_ARGS=--flash-attn on --mmap
--- a/envs/.env.gemma4-e2b-bigctx
+++ b/envs/.env.gemma4-e2b-bigctx
@@ -0,0 +1,26 @@
+# ==============================================================================
+# Gemma 4 E2B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo)
+# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): q4_0 rec ctx=393216
+# +368640 tokens vs pure-GPU 24576. MQA arch = only 1.7 KB KV/tok (tiny!).
+# Speed at ctx=393216: baseline 61.7 t/s, est. 17.0@50% / 26.6@25% (PCIe BW).
+# RAM at 393216: 651 MiB KV. q4_0 used (turbo2 paradoxically larger for MQA).
+# Use this profile when you need >24K context; otherwise use gemma4-e2b.
+# ==============================================================================
+
+MODEL_FILE=google_gemma-4-E2B-it-Q4_K_M.gguf
+
+N_GPU_LAYERS=99
+CTX_SIZE=393216
+
+THREADS=6
+THREADS_BATCH=6
+
+BATCH_SIZE=512
+UBATCH_SIZE=256
+
+CACHE_TYPE_K=q4_0
+CACHE_TYPE_V=q4_0
+
+PARALLEL=1
+
+EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload
--- a/envs/.env.gemma4-e4b
+++ b/envs/.env.gemma4-e4b
@@ -0,0 +1,43 @@
+# ==============================================================================
+# Gemma 4 E4B-it Q4_K_M — Google DeepMind (April 2025)
+# Architecture: Dense transformer + Per-Layer Embeddings (PLE)
+#   - 4.5B effective params (8B total with PLE embedding tables)
+#   - 42 layers, hybrid local (512-token window) + global attention
+#   - 128K context window
+# Model size: ~4.7 GB Q4_K_M  |  CPU-split needed (exceeds 3.7 GB VRAM)
+# Modalities: text + image + audio (ASR/translation) + video frames
+#
+# Download:
+#   huggingface-cli download bartowski/google_gemma-4-E4B-it-GGUF \
+#     google_gemma-4-E4B-it-Q4_K_M.gguf --local-dir ./models/
+#
+# NOTE: Verify the exact filename after download — bartowski naming may vary.
+#       Check: ls models/google_gemma*
+# ==============================================================================
+
+MODEL_FILE=google_gemma-4-E4B-it-Q4_K_M.gguf
+
+# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
+# ALL 42 layers fit on GPU when no other containers hold VRAM!
+# ngl sweep: ngl=42 → 133 pp / 32.0 tg t/s  (ngl=28 was only 59/16.5)
+# Max ctx=24576 (hybrid attention, 32K OOM).  fa=1 works (+3% vs fa=0).
+# Thread sweep: t=4-6 optimal (GPU-only now, CPU largely idle for tg)
+N_GPU_LAYERS=42
+
+# 24K max — hybrid sliding-window keeps most layers' KV tiny
+# 32K OOM due to global-attn layers hitting VRAM wall
+CTX_SIZE=24576
+
+THREADS=6
+THREADS_BATCH=6
+
+BATCH_SIZE=512
+UBATCH_SIZE=128
+
+CACHE_TYPE_K=q4_0
+CACHE_TYPE_V=q4_0
+
+PARALLEL=1
+
+# fa=1 confirmed working on hybrid Gemma4 attention
+EXTRA_ARGS=--flash-attn on --mmap
--- a/envs/.env.gemma4-e4b-bigctx
+++ b/envs/.env.gemma4-e4b-bigctx
@@ -0,0 +1,26 @@
+# ==============================================================================
+# Gemma 4 E4B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo)
+# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): turbo2 rec ctx=163840
+# +139264 tokens vs pure-GPU 24576. turbo2 KV = 2.1 KB/tok vs q4_0 4.5 KB/tok.
+# Speed at ctx=163840: baseline 30.0 t/s, est. 17.8@50% / 22.4@25% (PCIe BW).
+# RAM at 163840: 346 MiB KV. ngl=42 (all layers on GPU).
+# Use this profile when you need >24K context; otherwise use gemma4-e4b.
+# ==============================================================================
+
+MODEL_FILE=google_gemma-4-E4B-it-Q4_K_M.gguf
+
+N_GPU_LAYERS=42
+CTX_SIZE=163840
+
+THREADS=6
+THREADS_BATCH=6
+
+BATCH_SIZE=512
+UBATCH_SIZE=128
+
+CACHE_TYPE_K=turbo2
+CACHE_TYPE_V=turbo2
+
+PARALLEL=1
+
+EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload
--- a/envs/.env.qwen3-4b
+++ b/envs/.env.qwen3-4b
@@ -0,0 +1,42 @@
+# ==============================================================================
+# Qwen3-4B-Instruct Q4_K_M — Alibaba (May 2025)
+# Architecture: Decoder-only transformer, GQA
+#   - 4B params, 32 layers
+#   - 32K native context (128K with YaRN)
+# Model size: ~2.4 GB Q4_K_M  |  Full GPU fit (ngl=99)
+# Features: thinking mode (/think /no_think), tool calling, 119 languages,
+#           Apache 2.0. Strong code + reasoning. Best ecosystem (most fine-tunes).
+#
+# Download:
+#   huggingface-cli download bartowski/Qwen3-4B-GGUF \
+#     Qwen3-4B-Q4_K_M.gguf --local-dir ./models/
+#
+# NOTE: Verify exact filename after download:
+#       ls models/Qwen3-4B*
+# ==============================================================================
+
+MODEL_FILE=Qwen3-4B-Q4_K_M.gguf
+
+# All layers fit — ~2.4 GB leaves ~1.3 GB free for KV + compute
+N_GPU_LAYERS=99
+
+# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
+# Max ctx=8192 (12K OOM). Full attention — all KV must fit at full ctx.
+# GGUF native limit=40960, but VRAM walls at ~8K.
+# Baseline: 181 pp / 41.6 tg t/s. At 8K ctx fa=1: 191 pp / 44.3 tg t/s (+6%).
+CTX_SIZE=16384
+
+THREADS=6
+THREADS_BATCH=6
+
+BATCH_SIZE=512
+UBATCH_SIZE=256
+
+CACHE_TYPE_K=q4_0
+CACHE_TYPE_V=q4_0
+
+# 1 parallel slot — limited VRAM at 8K ctx with 2.4GB model
+PARALLEL=1
+
+# fa=1 gives +6% tg speed on full-attention Qwen3
+EXTRA_ARGS=--flash-attn on --mmap
--- a/envs/.env.qwen3-4b-bigctx
+++ b/envs/.env.qwen3-4b-bigctx
@@ -0,0 +1,24 @@
+# ==============================================================================
+# Qwen3-4B Q4_K_M — bigctx variant (KV in RAM via -nkvo)
+# Benchmarked 2026-05-06: -nkvo max ctx=24576 (+8K vs pure-GPU 16384)
+# Baseline TG: ~39 t/s (empty KV).
+# Use this profile when you need >16K context; otherwise use qwen3-4b.
+# ==============================================================================
+
+MODEL_FILE=Qwen3-4B-Q4_K_M.gguf
+
+N_GPU_LAYERS=99
+CTX_SIZE=24576
+
+THREADS=6
+THREADS_BATCH=6
+
+BATCH_SIZE=512
+UBATCH_SIZE=256
+
+CACHE_TYPE_K=q4_0
+CACHE_TYPE_V=q4_0
+
+PARALLEL=1
+
+EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload
--- a/envs/.env.qwen35-9b
+++ b/envs/.env.qwen35-9b
@@ -0,0 +1,41 @@
+# ==============================================================================
+# Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2 Q8_0 — TurboQuant SM75
+# Architecture: 32 layers (8 full-attn + 24 linear-attn), GQA 4 KV heads
+# Model size: 8.86 GB  |  VRAM usage: ~3.4 GB (11 layers on GPU)
+# RAM usage: ~5.5 GB (21 layers pinned via mlock)
+#
+# Benchmark results (turbo2 KV, ngl=11, fa=1):
+#   t=1→0.86  t=2→1.62  t=3→2.25  t=4→2.94  t=5→3.56  t=6→4.38 ← best
+#   t=8→4.22  t=12→3.61  (hyperthreading hurts above 6)
+# Theoretical ceiling: ~5.1 t/s (45 GB/s RAM BW ÷ 8.86 GB model)
+# Achieved: 4.38 t/s = 86% efficiency
+#
+# Download:
+#   huggingface-cli download Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF \
+#     Qwen3.5-9B.Q8_0.gguf --local-dir ./models/
+# ==============================================================================
+
+MODEL_FILE=Qwen3.5-9B.Q8_0.gguf
+
+# GPU: 11 layers fit in 3.7 GB VRAM. ngl=12 causes OOM at ctx>2048.
+N_GPU_LAYERS=11
+
+# 32K context fits with turbo2 KV (~104 MiB overhead vs ~3.3 GB for f16)
+CTX_SIZE=32768
+
+# t=6 is optimal for i7-10750H (6 physical cores). t>6 uses HT which hurts.
+THREADS=6
+THREADS_BATCH=6
+
+BATCH_SIZE=512
+UBATCH_SIZE=128
+
+# turbo2: 2-bit KV cache, 6.4× smaller than f16. Requires TurboQuant image.
+CACHE_TYPE_K=turbo2
+CACHE_TYPE_V=turbo2
+
+PARALLEL=1
+
+# --no-mmap --mlock: pins entire model in RAM (prevents paging, avoids cold reads)
+# --flash-attn on: required with turbo2 KV (fa=0 + turbo2 has no speed benefit)
+EXTRA_ARGS=--flash-attn on --no-mmap --mlock
--- a/envs/.env.smollm3-3b
+++ b/envs/.env.smollm3-3b
@@ -0,0 +1,42 @@
+# ==============================================================================
+# SmolLM3 3B-it Q4_K_M — HuggingFace (2025)
+# Architecture: Decoder-only transformer, GQA + NoPE (3:1 ratio)
+#   - 3B params, 11.2T training tokens
+#   - 64K native context (128K with YaRN)
+# Model size: ~1.9 GB Q4_K_M  |  Full GPU fit (ngl=99)
+# Features: thinking mode (/think /no_think), tool calling, 6 languages,
+#           Apache 2.0. AIME 2025: 36.7% in think mode.
+#
+# Download:
+#   huggingface-cli download bartowski/HuggingFaceTB_SmolLM3-3B-GGUF \
+#     HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf --local-dir ./models/
+#
+# NOTE: Verify exact filename after download:
+#       ls models/SmolLM3* models/HuggingFaceTB_SmolLM3*
+# ==============================================================================
+
+MODEL_FILE=HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf
+
+# All layers fit comfortably — ~1.9 GB leaves ~1.8 GB free for KV + compute
+N_GPU_LAYERS=99
+
+# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
+# Max ctx=24576 (32K OOM). Baseline: 249 pp / 56.8 tg t/s.
+# At 24K ctx with fa=1: 260 pp / 58.3 tg t/s (+2%).
+# Model context limit = 65536, VRAM is the constraint here.
+CTX_SIZE=24576
+
+THREADS=6
+THREADS_BATCH=6
+
+BATCH_SIZE=512
+UBATCH_SIZE=256
+
+CACHE_TYPE_K=q8_0
+CACHE_TYPE_V=q8_0
+
+# 2 parallel slots — less headroom at 24K ctx vs original 16K estimate
+PARALLEL=2
+
+# fa=1 gives small but consistent improvement (+2 tg t/s)
+EXTRA_ARGS=--flash-attn on --mmap
--- a/envs/.env.smollm3-3b-bigctx
+++ b/envs/.env.smollm3-3b-bigctx
@@ -0,0 +1,26 @@
+# ==============================================================================
+# SmolLM3 3B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo)
+# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): turbo2 rec ctx=65536
+# +40960 tokens vs pure-GPU 24576. turbo2 KV = 10.9 KB/tok vs q4_0 19.8 KB/tok.
+# Speed at ctx=65536: baseline 53.1 t/s, est. 15.2@50% / 23.7@25% (PCIe BW).
+# RAM at 65536: 714 MiB KV. turbo2 passes PPL quality gate at all tested ctx.
+# Use this profile when you need >24K context; otherwise use smollm3-3b.
+# ==============================================================================
+
+MODEL_FILE=HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf
+
+N_GPU_LAYERS=99
+CTX_SIZE=65536
+
+THREADS=6
+THREADS_BATCH=6
+
+BATCH_SIZE=512
+UBATCH_SIZE=256
+
+CACHE_TYPE_K=turbo2
+CACHE_TYPE_V=turbo2
+
+PARALLEL=1
+
+EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload