Initial commit: tuned multi-model llama.cpp stack

- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B
- TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs
- Bigctx profiles (-nkvo KV in RAM): 2-16x context gain
- turbo2 KV: 2x smaller, benchmarked against PPL quality gate
- Per-model env files with justified parameters
- kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts
- docs/FINDINGS.md: surprises, pitfalls, recommendations
- docs/ARCHITECTURE.md: compose + test script design
This commit is contained in:
2026-05-06 15:56:40 +02:00
commit 4ad296608b
22 changed files with 2530 additions and 0 deletions

41
envs/.env.qwen35-9b Normal file
View File

@@ -0,0 +1,41 @@
# ==============================================================================
# Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2 Q8_0 — TurboQuant SM75
# Architecture: 32 layers (8 full-attn + 24 linear-attn), GQA 4 KV heads
# Model size: 8.86 GB | VRAM usage: ~3.4 GB (11 layers on GPU)
# RAM usage: ~5.5 GB (21 layers pinned via mlock)
#
# Benchmark results (turbo2 KV, ngl=11, fa=1):
# t=1→0.86 t=2→1.62 t=3→2.25 t=4→2.94 t=5→3.56 t=6→4.38 ← best
# t=8→4.22 t=12→3.61 (hyperthreading hurts above 6)
# Theoretical ceiling: ~5.1 t/s (45 GB/s RAM BW ÷ 8.86 GB model)
# Achieved: 4.38 t/s = 86% efficiency
#
# Download:
# huggingface-cli download Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF \
# Qwen3.5-9B.Q8_0.gguf --local-dir ./models/
# ==============================================================================
MODEL_FILE=Qwen3.5-9B.Q8_0.gguf
# GPU: 11 layers fit in 3.7 GB VRAM. ngl=12 causes OOM at ctx>2048.
N_GPU_LAYERS=11
# 32K context fits with turbo2 KV (~104 MiB overhead vs ~3.3 GB for f16)
CTX_SIZE=32768
# t=6 is optimal for i7-10750H (6 physical cores). t>6 uses HT which hurts.
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=128
# turbo2: 2-bit KV cache, 6.4× smaller than f16. Requires TurboQuant image.
CACHE_TYPE_K=turbo2
CACHE_TYPE_V=turbo2
PARALLEL=1
# --no-mmap --mlock: pins entire model in RAM (prevents paging, avoids cold reads)
# --flash-attn on: required with turbo2 KV (fa=0 + turbo2 has no speed benefit)
EXTRA_ARGS=--flash-attn on --no-mmap --mlock