Initial commit: tuned multi-model llama.cpp stack

- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B
- TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs
- Bigctx profiles (-nkvo KV in RAM): 2-16x context gain
- turbo2 KV: 2x smaller, benchmarked against PPL quality gate
- Per-model env files with justified parameters
- kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts
- docs/FINDINGS.md: surprises, pitfalls, recommendations
- docs/ARCHITECTURE.md: compose + test script design
This commit is contained in:
2026-05-06 15:56:40 +02:00
commit 4ad296608b
22 changed files with 2530 additions and 0 deletions

43
envs/.env.gemma4-e2b Normal file
View File

@@ -0,0 +1,43 @@
# ==============================================================================
# Gemma 4 E2B-it Q4_K_M — Google DeepMind (April 2025)
# Architecture: Dense transformer + Per-Layer Embeddings (PLE)
# - 2.3B effective params (5.1B total with PLE embedding tables)
# - 35 layers, hybrid local (512-token window) + global attention
# - 128K context window
# Model size: ~2.9 GB Q4_K_M | Full GPU fit (ngl=99, VRAM ~3.4 GB total)
# Modalities: text + image + audio (ASR/translation) + video frames
#
# Download:
# huggingface-cli download bartowski/google_gemma-4-E2B-it-GGUF \
# google_gemma-4-E2B-it-Q4_K_M.gguf --local-dir ./models/
#
# NOTE: Verify the exact filename after download — bartowski naming may vary.
# Check: ls models/google_gemma*
# ==============================================================================
MODEL_FILE=google_gemma-4-E2B-it-Q4_K_M.gguf
# All 35 layers fit in VRAM. PLE layers are small compute, large embedding lookup.
N_GPU_LAYERS=99
# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
# Hybrid sliding-window attention (512-token) keeps KV tiny → 32K ctx fits!
# 65K/131K OOM (full global-attn layers eat VRAM at large ctx).
# Baseline: 350 pp / 64.6 tg t/s | At 32K ctx: 365 pp / 66.8 tg t/s (fa=1)
CTX_SIZE=24576
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=256
# f16 KV — model small, KV overhead negligible even at 32K
CACHE_TYPE_K=f16
CACHE_TYPE_V=f16
# 2 parallel slots — fast model (66 tg t/s), VRAM headroom available
PARALLEL=2
# fa=1 confirmed working on hybrid Gemma4 attention (+5% vs fa=0)
EXTRA_ARGS=--flash-attn on --mmap

View File

@@ -0,0 +1,26 @@
# ==============================================================================
# Gemma 4 E2B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo)
# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): q4_0 rec ctx=393216
# +368640 tokens vs pure-GPU 24576. MQA arch = only 1.7 KB KV/tok (tiny!).
# Speed at ctx=393216: baseline 61.7 t/s, est. 17.0@50% / 26.6@25% (PCIe BW).
# RAM at 393216: 651 MiB KV. q4_0 used (turbo2 paradoxically larger for MQA).
# Use this profile when you need >24K context; otherwise use gemma4-e2b.
# ==============================================================================
MODEL_FILE=google_gemma-4-E2B-it-Q4_K_M.gguf
N_GPU_LAYERS=99
CTX_SIZE=393216
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=256
CACHE_TYPE_K=q4_0
CACHE_TYPE_V=q4_0
PARALLEL=1
EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload

43
envs/.env.gemma4-e4b Normal file
View File

@@ -0,0 +1,43 @@
# ==============================================================================
# Gemma 4 E4B-it Q4_K_M — Google DeepMind (April 2025)
# Architecture: Dense transformer + Per-Layer Embeddings (PLE)
# - 4.5B effective params (8B total with PLE embedding tables)
# - 42 layers, hybrid local (512-token window) + global attention
# - 128K context window
# Model size: ~4.7 GB Q4_K_M | CPU-split needed (exceeds 3.7 GB VRAM)
# Modalities: text + image + audio (ASR/translation) + video frames
#
# Download:
# huggingface-cli download bartowski/google_gemma-4-E4B-it-GGUF \
# google_gemma-4-E4B-it-Q4_K_M.gguf --local-dir ./models/
#
# NOTE: Verify the exact filename after download — bartowski naming may vary.
# Check: ls models/google_gemma*
# ==============================================================================
MODEL_FILE=google_gemma-4-E4B-it-Q4_K_M.gguf
# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
# ALL 42 layers fit on GPU when no other containers hold VRAM!
# ngl sweep: ngl=42 → 133 pp / 32.0 tg t/s (ngl=28 was only 59/16.5)
# Max ctx=24576 (hybrid attention, 32K OOM). fa=1 works (+3% vs fa=0).
# Thread sweep: t=4-6 optimal (GPU-only now, CPU largely idle for tg)
N_GPU_LAYERS=42
# 24K max — hybrid sliding-window keeps most layers' KV tiny
# 32K OOM due to global-attn layers hitting VRAM wall
CTX_SIZE=24576
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=128
CACHE_TYPE_K=q4_0
CACHE_TYPE_V=q4_0
PARALLEL=1
# fa=1 confirmed working on hybrid Gemma4 attention
EXTRA_ARGS=--flash-attn on --mmap

View File

@@ -0,0 +1,26 @@
# ==============================================================================
# Gemma 4 E4B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo)
# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): turbo2 rec ctx=163840
# +139264 tokens vs pure-GPU 24576. turbo2 KV = 2.1 KB/tok vs q4_0 4.5 KB/tok.
# Speed at ctx=163840: baseline 30.0 t/s, est. 17.8@50% / 22.4@25% (PCIe BW).
# RAM at 163840: 346 MiB KV. ngl=42 (all layers on GPU).
# Use this profile when you need >24K context; otherwise use gemma4-e4b.
# ==============================================================================
MODEL_FILE=google_gemma-4-E4B-it-Q4_K_M.gguf
N_GPU_LAYERS=42
CTX_SIZE=163840
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=128
CACHE_TYPE_K=turbo2
CACHE_TYPE_V=turbo2
PARALLEL=1
EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload

42
envs/.env.qwen3-4b Normal file
View File

@@ -0,0 +1,42 @@
# ==============================================================================
# Qwen3-4B-Instruct Q4_K_M — Alibaba (May 2025)
# Architecture: Decoder-only transformer, GQA
# - 4B params, 32 layers
# - 32K native context (128K with YaRN)
# Model size: ~2.4 GB Q4_K_M | Full GPU fit (ngl=99)
# Features: thinking mode (/think /no_think), tool calling, 119 languages,
# Apache 2.0. Strong code + reasoning. Best ecosystem (most fine-tunes).
#
# Download:
# huggingface-cli download bartowski/Qwen3-4B-GGUF \
# Qwen3-4B-Q4_K_M.gguf --local-dir ./models/
#
# NOTE: Verify exact filename after download:
# ls models/Qwen3-4B*
# ==============================================================================
MODEL_FILE=Qwen3-4B-Q4_K_M.gguf
# All layers fit — ~2.4 GB leaves ~1.3 GB free for KV + compute
N_GPU_LAYERS=99
# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
# Max ctx=8192 (12K OOM). Full attention — all KV must fit at full ctx.
# GGUF native limit=40960, but VRAM walls at ~8K.
# Baseline: 181 pp / 41.6 tg t/s. At 8K ctx fa=1: 191 pp / 44.3 tg t/s (+6%).
CTX_SIZE=16384
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=256
CACHE_TYPE_K=q4_0
CACHE_TYPE_V=q4_0
# 1 parallel slot — limited VRAM at 8K ctx with 2.4GB model
PARALLEL=1
# fa=1 gives +6% tg speed on full-attention Qwen3
EXTRA_ARGS=--flash-attn on --mmap

24
envs/.env.qwen3-4b-bigctx Normal file
View File

@@ -0,0 +1,24 @@
# ==============================================================================
# Qwen3-4B Q4_K_M — bigctx variant (KV in RAM via -nkvo)
# Benchmarked 2026-05-06: -nkvo max ctx=24576 (+8K vs pure-GPU 16384)
# Baseline TG: ~39 t/s (empty KV).
# Use this profile when you need >16K context; otherwise use qwen3-4b.
# ==============================================================================
MODEL_FILE=Qwen3-4B-Q4_K_M.gguf
N_GPU_LAYERS=99
CTX_SIZE=24576
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=256
CACHE_TYPE_K=q4_0
CACHE_TYPE_V=q4_0
PARALLEL=1
EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload

41
envs/.env.qwen35-9b Normal file
View File

@@ -0,0 +1,41 @@
# ==============================================================================
# Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2 Q8_0 — TurboQuant SM75
# Architecture: 32 layers (8 full-attn + 24 linear-attn), GQA 4 KV heads
# Model size: 8.86 GB | VRAM usage: ~3.4 GB (11 layers on GPU)
# RAM usage: ~5.5 GB (21 layers pinned via mlock)
#
# Benchmark results (turbo2 KV, ngl=11, fa=1):
# t=1→0.86 t=2→1.62 t=3→2.25 t=4→2.94 t=5→3.56 t=6→4.38 ← best
# t=8→4.22 t=12→3.61 (hyperthreading hurts above 6)
# Theoretical ceiling: ~5.1 t/s (45 GB/s RAM BW ÷ 8.86 GB model)
# Achieved: 4.38 t/s = 86% efficiency
#
# Download:
# huggingface-cli download Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF \
# Qwen3.5-9B.Q8_0.gguf --local-dir ./models/
# ==============================================================================
MODEL_FILE=Qwen3.5-9B.Q8_0.gguf
# GPU: 11 layers fit in 3.7 GB VRAM. ngl=12 causes OOM at ctx>2048.
N_GPU_LAYERS=11
# 32K context fits with turbo2 KV (~104 MiB overhead vs ~3.3 GB for f16)
CTX_SIZE=32768
# t=6 is optimal for i7-10750H (6 physical cores). t>6 uses HT which hurts.
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=128
# turbo2: 2-bit KV cache, 6.4× smaller than f16. Requires TurboQuant image.
CACHE_TYPE_K=turbo2
CACHE_TYPE_V=turbo2
PARALLEL=1
# --no-mmap --mlock: pins entire model in RAM (prevents paging, avoids cold reads)
# --flash-attn on: required with turbo2 KV (fa=0 + turbo2 has no speed benefit)
EXTRA_ARGS=--flash-attn on --no-mmap --mlock

42
envs/.env.smollm3-3b Normal file
View File

@@ -0,0 +1,42 @@
# ==============================================================================
# SmolLM3 3B-it Q4_K_M — HuggingFace (2025)
# Architecture: Decoder-only transformer, GQA + NoPE (3:1 ratio)
# - 3B params, 11.2T training tokens
# - 64K native context (128K with YaRN)
# Model size: ~1.9 GB Q4_K_M | Full GPU fit (ngl=99)
# Features: thinking mode (/think /no_think), tool calling, 6 languages,
# Apache 2.0. AIME 2025: 36.7% in think mode.
#
# Download:
# huggingface-cli download bartowski/HuggingFaceTB_SmolLM3-3B-GGUF \
# HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf --local-dir ./models/
#
# NOTE: Verify exact filename after download:
# ls models/SmolLM3* models/HuggingFaceTB_SmolLM3*
# ==============================================================================
MODEL_FILE=HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf
# All layers fit comfortably — ~1.9 GB leaves ~1.8 GB free for KV + compute
N_GPU_LAYERS=99
# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
# Max ctx=24576 (32K OOM). Baseline: 249 pp / 56.8 tg t/s.
# At 24K ctx with fa=1: 260 pp / 58.3 tg t/s (+2%).
# Model context limit = 65536, VRAM is the constraint here.
CTX_SIZE=24576
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=256
CACHE_TYPE_K=q8_0
CACHE_TYPE_V=q8_0
# 2 parallel slots — less headroom at 24K ctx vs original 16K estimate
PARALLEL=2
# fa=1 gives small but consistent improvement (+2 tg t/s)
EXTRA_ARGS=--flash-attn on --mmap

View File

@@ -0,0 +1,26 @@
# ==============================================================================
# SmolLM3 3B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo)
# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): turbo2 rec ctx=65536
# +40960 tokens vs pure-GPU 24576. turbo2 KV = 10.9 KB/tok vs q4_0 19.8 KB/tok.
# Speed at ctx=65536: baseline 53.1 t/s, est. 15.2@50% / 23.7@25% (PCIe BW).
# RAM at 65536: 714 MiB KV. turbo2 passes PPL quality gate at all tested ctx.
# Use this profile when you need >24K context; otherwise use smollm3-3b.
# ==============================================================================
MODEL_FILE=HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf
N_GPU_LAYERS=99
CTX_SIZE=65536
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=256
CACHE_TYPE_K=turbo2
CACHE_TYPE_V=turbo2
PARALLEL=1
EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload