Initial commit: tuned multi-model llama.cpp stack
- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B - TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs - Bigctx profiles (-nkvo KV in RAM): 2-16x context gain - turbo2 KV: 2x smaller, benchmarked against PPL quality gate - Per-model env files with justified parameters - kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts - docs/FINDINGS.md: surprises, pitfalls, recommendations - docs/ARCHITECTURE.md: compose + test script design
This commit is contained in:
43
envs/.env.gemma4-e2b
Normal file
43
envs/.env.gemma4-e2b
Normal file
@@ -0,0 +1,43 @@
|
||||
# ==============================================================================
|
||||
# Gemma 4 E2B-it Q4_K_M — Google DeepMind (April 2025)
|
||||
# Architecture: Dense transformer + Per-Layer Embeddings (PLE)
|
||||
# - 2.3B effective params (5.1B total with PLE embedding tables)
|
||||
# - 35 layers, hybrid local (512-token window) + global attention
|
||||
# - 128K context window
|
||||
# Model size: ~2.9 GB Q4_K_M | Full GPU fit (ngl=99, VRAM ~3.4 GB total)
|
||||
# Modalities: text + image + audio (ASR/translation) + video frames
|
||||
#
|
||||
# Download:
|
||||
# huggingface-cli download bartowski/google_gemma-4-E2B-it-GGUF \
|
||||
# google_gemma-4-E2B-it-Q4_K_M.gguf --local-dir ./models/
|
||||
#
|
||||
# NOTE: Verify the exact filename after download — bartowski naming may vary.
|
||||
# Check: ls models/google_gemma*
|
||||
# ==============================================================================
|
||||
|
||||
MODEL_FILE=google_gemma-4-E2B-it-Q4_K_M.gguf
|
||||
|
||||
# All 35 layers fit in VRAM. PLE layers are small compute, large embedding lookup.
|
||||
N_GPU_LAYERS=99
|
||||
|
||||
# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
|
||||
# Hybrid sliding-window attention (512-token) keeps KV tiny → 32K ctx fits!
|
||||
# 65K/131K OOM (full global-attn layers eat VRAM at large ctx).
|
||||
# Baseline: 350 pp / 64.6 tg t/s | At 32K ctx: 365 pp / 66.8 tg t/s (fa=1)
|
||||
CTX_SIZE=24576
|
||||
|
||||
THREADS=6
|
||||
THREADS_BATCH=6
|
||||
|
||||
BATCH_SIZE=512
|
||||
UBATCH_SIZE=256
|
||||
|
||||
# f16 KV — model small, KV overhead negligible even at 32K
|
||||
CACHE_TYPE_K=f16
|
||||
CACHE_TYPE_V=f16
|
||||
|
||||
# 2 parallel slots — fast model (66 tg t/s), VRAM headroom available
|
||||
PARALLEL=2
|
||||
|
||||
# fa=1 confirmed working on hybrid Gemma4 attention (+5% vs fa=0)
|
||||
EXTRA_ARGS=--flash-attn on --mmap
|
||||
26
envs/.env.gemma4-e2b-bigctx
Normal file
26
envs/.env.gemma4-e2b-bigctx
Normal file
@@ -0,0 +1,26 @@
|
||||
# ==============================================================================
|
||||
# Gemma 4 E2B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo)
|
||||
# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): q4_0 rec ctx=393216
|
||||
# +368640 tokens vs pure-GPU 24576. MQA arch = only 1.7 KB KV/tok (tiny!).
|
||||
# Speed at ctx=393216: baseline 61.7 t/s, est. 17.0@50% / 26.6@25% (PCIe BW).
|
||||
# RAM at 393216: 651 MiB KV. q4_0 used (turbo2 paradoxically larger for MQA).
|
||||
# Use this profile when you need >24K context; otherwise use gemma4-e2b.
|
||||
# ==============================================================================
|
||||
|
||||
MODEL_FILE=google_gemma-4-E2B-it-Q4_K_M.gguf
|
||||
|
||||
N_GPU_LAYERS=99
|
||||
CTX_SIZE=393216
|
||||
|
||||
THREADS=6
|
||||
THREADS_BATCH=6
|
||||
|
||||
BATCH_SIZE=512
|
||||
UBATCH_SIZE=256
|
||||
|
||||
CACHE_TYPE_K=q4_0
|
||||
CACHE_TYPE_V=q4_0
|
||||
|
||||
PARALLEL=1
|
||||
|
||||
EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload
|
||||
43
envs/.env.gemma4-e4b
Normal file
43
envs/.env.gemma4-e4b
Normal file
@@ -0,0 +1,43 @@
|
||||
# ==============================================================================
|
||||
# Gemma 4 E4B-it Q4_K_M — Google DeepMind (April 2025)
|
||||
# Architecture: Dense transformer + Per-Layer Embeddings (PLE)
|
||||
# - 4.5B effective params (8B total with PLE embedding tables)
|
||||
# - 42 layers, hybrid local (512-token window) + global attention
|
||||
# - 128K context window
|
||||
# Model size: ~4.7 GB Q4_K_M | CPU-split needed (exceeds 3.7 GB VRAM)
|
||||
# Modalities: text + image + audio (ASR/translation) + video frames
|
||||
#
|
||||
# Download:
|
||||
# huggingface-cli download bartowski/google_gemma-4-E4B-it-GGUF \
|
||||
# google_gemma-4-E4B-it-Q4_K_M.gguf --local-dir ./models/
|
||||
#
|
||||
# NOTE: Verify the exact filename after download — bartowski naming may vary.
|
||||
# Check: ls models/google_gemma*
|
||||
# ==============================================================================
|
||||
|
||||
MODEL_FILE=google_gemma-4-E4B-it-Q4_K_M.gguf
|
||||
|
||||
# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
|
||||
# ALL 42 layers fit on GPU when no other containers hold VRAM!
|
||||
# ngl sweep: ngl=42 → 133 pp / 32.0 tg t/s (ngl=28 was only 59/16.5)
|
||||
# Max ctx=24576 (hybrid attention, 32K OOM). fa=1 works (+3% vs fa=0).
|
||||
# Thread sweep: t=4-6 optimal (GPU-only now, CPU largely idle for tg)
|
||||
N_GPU_LAYERS=42
|
||||
|
||||
# 24K max — hybrid sliding-window keeps most layers' KV tiny
|
||||
# 32K OOM due to global-attn layers hitting VRAM wall
|
||||
CTX_SIZE=24576
|
||||
|
||||
THREADS=6
|
||||
THREADS_BATCH=6
|
||||
|
||||
BATCH_SIZE=512
|
||||
UBATCH_SIZE=128
|
||||
|
||||
CACHE_TYPE_K=q4_0
|
||||
CACHE_TYPE_V=q4_0
|
||||
|
||||
PARALLEL=1
|
||||
|
||||
# fa=1 confirmed working on hybrid Gemma4 attention
|
||||
EXTRA_ARGS=--flash-attn on --mmap
|
||||
26
envs/.env.gemma4-e4b-bigctx
Normal file
26
envs/.env.gemma4-e4b-bigctx
Normal file
@@ -0,0 +1,26 @@
|
||||
# ==============================================================================
|
||||
# Gemma 4 E4B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo)
|
||||
# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): turbo2 rec ctx=163840
|
||||
# +139264 tokens vs pure-GPU 24576. turbo2 KV = 2.1 KB/tok vs q4_0 4.5 KB/tok.
|
||||
# Speed at ctx=163840: baseline 30.0 t/s, est. 17.8@50% / 22.4@25% (PCIe BW).
|
||||
# RAM at 163840: 346 MiB KV. ngl=42 (all layers on GPU).
|
||||
# Use this profile when you need >24K context; otherwise use gemma4-e4b.
|
||||
# ==============================================================================
|
||||
|
||||
MODEL_FILE=google_gemma-4-E4B-it-Q4_K_M.gguf
|
||||
|
||||
N_GPU_LAYERS=42
|
||||
CTX_SIZE=163840
|
||||
|
||||
THREADS=6
|
||||
THREADS_BATCH=6
|
||||
|
||||
BATCH_SIZE=512
|
||||
UBATCH_SIZE=128
|
||||
|
||||
CACHE_TYPE_K=turbo2
|
||||
CACHE_TYPE_V=turbo2
|
||||
|
||||
PARALLEL=1
|
||||
|
||||
EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload
|
||||
42
envs/.env.qwen3-4b
Normal file
42
envs/.env.qwen3-4b
Normal file
@@ -0,0 +1,42 @@
|
||||
# ==============================================================================
|
||||
# Qwen3-4B-Instruct Q4_K_M — Alibaba (May 2025)
|
||||
# Architecture: Decoder-only transformer, GQA
|
||||
# - 4B params, 32 layers
|
||||
# - 32K native context (128K with YaRN)
|
||||
# Model size: ~2.4 GB Q4_K_M | Full GPU fit (ngl=99)
|
||||
# Features: thinking mode (/think /no_think), tool calling, 119 languages,
|
||||
# Apache 2.0. Strong code + reasoning. Best ecosystem (most fine-tunes).
|
||||
#
|
||||
# Download:
|
||||
# huggingface-cli download bartowski/Qwen3-4B-GGUF \
|
||||
# Qwen3-4B-Q4_K_M.gguf --local-dir ./models/
|
||||
#
|
||||
# NOTE: Verify exact filename after download:
|
||||
# ls models/Qwen3-4B*
|
||||
# ==============================================================================
|
||||
|
||||
MODEL_FILE=Qwen3-4B-Q4_K_M.gguf
|
||||
|
||||
# All layers fit — ~2.4 GB leaves ~1.3 GB free for KV + compute
|
||||
N_GPU_LAYERS=99
|
||||
|
||||
# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
|
||||
# Max ctx=8192 (12K OOM). Full attention — all KV must fit at full ctx.
|
||||
# GGUF native limit=40960, but VRAM walls at ~8K.
|
||||
# Baseline: 181 pp / 41.6 tg t/s. At 8K ctx fa=1: 191 pp / 44.3 tg t/s (+6%).
|
||||
CTX_SIZE=16384
|
||||
|
||||
THREADS=6
|
||||
THREADS_BATCH=6
|
||||
|
||||
BATCH_SIZE=512
|
||||
UBATCH_SIZE=256
|
||||
|
||||
CACHE_TYPE_K=q4_0
|
||||
CACHE_TYPE_V=q4_0
|
||||
|
||||
# 1 parallel slot — limited VRAM at 8K ctx with 2.4GB model
|
||||
PARALLEL=1
|
||||
|
||||
# fa=1 gives +6% tg speed on full-attention Qwen3
|
||||
EXTRA_ARGS=--flash-attn on --mmap
|
||||
24
envs/.env.qwen3-4b-bigctx
Normal file
24
envs/.env.qwen3-4b-bigctx
Normal file
@@ -0,0 +1,24 @@
|
||||
# ==============================================================================
|
||||
# Qwen3-4B Q4_K_M — bigctx variant (KV in RAM via -nkvo)
|
||||
# Benchmarked 2026-05-06: -nkvo max ctx=24576 (+8K vs pure-GPU 16384)
|
||||
# Baseline TG: ~39 t/s (empty KV).
|
||||
# Use this profile when you need >16K context; otherwise use qwen3-4b.
|
||||
# ==============================================================================
|
||||
|
||||
MODEL_FILE=Qwen3-4B-Q4_K_M.gguf
|
||||
|
||||
N_GPU_LAYERS=99
|
||||
CTX_SIZE=24576
|
||||
|
||||
THREADS=6
|
||||
THREADS_BATCH=6
|
||||
|
||||
BATCH_SIZE=512
|
||||
UBATCH_SIZE=256
|
||||
|
||||
CACHE_TYPE_K=q4_0
|
||||
CACHE_TYPE_V=q4_0
|
||||
|
||||
PARALLEL=1
|
||||
|
||||
EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload
|
||||
41
envs/.env.qwen35-9b
Normal file
41
envs/.env.qwen35-9b
Normal file
@@ -0,0 +1,41 @@
|
||||
# ==============================================================================
|
||||
# Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2 Q8_0 — TurboQuant SM75
|
||||
# Architecture: 32 layers (8 full-attn + 24 linear-attn), GQA 4 KV heads
|
||||
# Model size: 8.86 GB | VRAM usage: ~3.4 GB (11 layers on GPU)
|
||||
# RAM usage: ~5.5 GB (21 layers pinned via mlock)
|
||||
#
|
||||
# Benchmark results (turbo2 KV, ngl=11, fa=1):
|
||||
# t=1→0.86 t=2→1.62 t=3→2.25 t=4→2.94 t=5→3.56 t=6→4.38 ← best
|
||||
# t=8→4.22 t=12→3.61 (hyperthreading hurts above 6)
|
||||
# Theoretical ceiling: ~5.1 t/s (45 GB/s RAM BW ÷ 8.86 GB model)
|
||||
# Achieved: 4.38 t/s = 86% efficiency
|
||||
#
|
||||
# Download:
|
||||
# huggingface-cli download Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF \
|
||||
# Qwen3.5-9B.Q8_0.gguf --local-dir ./models/
|
||||
# ==============================================================================
|
||||
|
||||
MODEL_FILE=Qwen3.5-9B.Q8_0.gguf
|
||||
|
||||
# GPU: 11 layers fit in 3.7 GB VRAM. ngl=12 causes OOM at ctx>2048.
|
||||
N_GPU_LAYERS=11
|
||||
|
||||
# 32K context fits with turbo2 KV (~104 MiB overhead vs ~3.3 GB for f16)
|
||||
CTX_SIZE=32768
|
||||
|
||||
# t=6 is optimal for i7-10750H (6 physical cores). t>6 uses HT which hurts.
|
||||
THREADS=6
|
||||
THREADS_BATCH=6
|
||||
|
||||
BATCH_SIZE=512
|
||||
UBATCH_SIZE=128
|
||||
|
||||
# turbo2: 2-bit KV cache, 6.4× smaller than f16. Requires TurboQuant image.
|
||||
CACHE_TYPE_K=turbo2
|
||||
CACHE_TYPE_V=turbo2
|
||||
|
||||
PARALLEL=1
|
||||
|
||||
# --no-mmap --mlock: pins entire model in RAM (prevents paging, avoids cold reads)
|
||||
# --flash-attn on: required with turbo2 KV (fa=0 + turbo2 has no speed benefit)
|
||||
EXTRA_ARGS=--flash-attn on --no-mmap --mlock
|
||||
42
envs/.env.smollm3-3b
Normal file
42
envs/.env.smollm3-3b
Normal file
@@ -0,0 +1,42 @@
|
||||
# ==============================================================================
|
||||
# SmolLM3 3B-it Q4_K_M — HuggingFace (2025)
|
||||
# Architecture: Decoder-only transformer, GQA + NoPE (3:1 ratio)
|
||||
# - 3B params, 11.2T training tokens
|
||||
# - 64K native context (128K with YaRN)
|
||||
# Model size: ~1.9 GB Q4_K_M | Full GPU fit (ngl=99)
|
||||
# Features: thinking mode (/think /no_think), tool calling, 6 languages,
|
||||
# Apache 2.0. AIME 2025: 36.7% in think mode.
|
||||
#
|
||||
# Download:
|
||||
# huggingface-cli download bartowski/HuggingFaceTB_SmolLM3-3B-GGUF \
|
||||
# HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf --local-dir ./models/
|
||||
#
|
||||
# NOTE: Verify exact filename after download:
|
||||
# ls models/SmolLM3* models/HuggingFaceTB_SmolLM3*
|
||||
# ==============================================================================
|
||||
|
||||
MODEL_FILE=HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf
|
||||
|
||||
# All layers fit comfortably — ~1.9 GB leaves ~1.8 GB free for KV + compute
|
||||
N_GPU_LAYERS=99
|
||||
|
||||
# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
|
||||
# Max ctx=24576 (32K OOM). Baseline: 249 pp / 56.8 tg t/s.
|
||||
# At 24K ctx with fa=1: 260 pp / 58.3 tg t/s (+2%).
|
||||
# Model context limit = 65536, VRAM is the constraint here.
|
||||
CTX_SIZE=24576
|
||||
|
||||
THREADS=6
|
||||
THREADS_BATCH=6
|
||||
|
||||
BATCH_SIZE=512
|
||||
UBATCH_SIZE=256
|
||||
|
||||
CACHE_TYPE_K=q8_0
|
||||
CACHE_TYPE_V=q8_0
|
||||
|
||||
# 2 parallel slots — less headroom at 24K ctx vs original 16K estimate
|
||||
PARALLEL=2
|
||||
|
||||
# fa=1 gives small but consistent improvement (+2 tg t/s)
|
||||
EXTRA_ARGS=--flash-attn on --mmap
|
||||
26
envs/.env.smollm3-3b-bigctx
Normal file
26
envs/.env.smollm3-3b-bigctx
Normal file
@@ -0,0 +1,26 @@
|
||||
# ==============================================================================
|
||||
# SmolLM3 3B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo)
|
||||
# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): turbo2 rec ctx=65536
|
||||
# +40960 tokens vs pure-GPU 24576. turbo2 KV = 10.9 KB/tok vs q4_0 19.8 KB/tok.
|
||||
# Speed at ctx=65536: baseline 53.1 t/s, est. 15.2@50% / 23.7@25% (PCIe BW).
|
||||
# RAM at 65536: 714 MiB KV. turbo2 passes PPL quality gate at all tested ctx.
|
||||
# Use this profile when you need >24K context; otherwise use smollm3-3b.
|
||||
# ==============================================================================
|
||||
|
||||
MODEL_FILE=HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf
|
||||
|
||||
N_GPU_LAYERS=99
|
||||
CTX_SIZE=65536
|
||||
|
||||
THREADS=6
|
||||
THREADS_BATCH=6
|
||||
|
||||
BATCH_SIZE=512
|
||||
UBATCH_SIZE=256
|
||||
|
||||
CACHE_TYPE_K=turbo2
|
||||
CACHE_TYPE_V=turbo2
|
||||
|
||||
PARALLEL=1
|
||||
|
||||
EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload
|
||||
Reference in New Issue
Block a user