Files

Giancarmine Salucci 4ad296608b Initial commit: tuned multi-model llama.cpp stack

- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B
- TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs
- Bigctx profiles (-nkvo KV in RAM): 2-16x context gain
- turbo2 KV: 2x smaller, benchmarked against PPL quality gate
- Per-model env files with justified parameters
- kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts
- docs/FINDINGS.md: surprises, pitfalls, recommendations
- docs/ARCHITECTURE.md: compose + test script design

2026-05-06 15:56:40 +02:00

6.9 KiB

Raw Blame History

Benchmarking Findings

Hardware: GTX 1650 Ti Mobile (Turing/SM75, 3717 MiB VRAM, CC 7.5) + i7-10750H 6c/12t, 15 GiB DDR4-2933 RAM.
All benchmarks: llama.cpp local/llama-cpp-turboquant:*-cuda-sm75-mmq image (TurboQuant fork, DGGML_CUDA_FORCE_MMQ=ON).
Date: 2026-05-05 / 2026-05-06.

1. FORCE_MMQ — Free +6–11% on Turing GPUs

Finding: GPUs without tensor cores (Turing = RTX 1650, 1660, 2060 etc.) run the GEMM path through cuBLAS GEMM, which is slower than the hand-written MMQ (matrix-multiply quantized) kernel. Compiling with DGGML_CUDA_FORCE_MMQ=ON forces the MMQ path unconditionally.

Model	Standard image t/s	TurboQuant t/s	Gain
SmolLM3-3B	~49.9	53.1	+6.2%
Gemma4-E2B	~55.7	61.7	+10.7%
Gemma4-E4B	~27.0	30.0	+11.4%
Qwen3-4B	~36.7	38.8	+5.7%

Caution: On Ampere/Ada/Hopper (RTX 3000+/4000+), tensor cores are faster. FORCE_MMQ would hurt on those cards. This image is SM75-only.

2. KV Quantization — turbo2 is the best sweet spot

Finding: The TurboQuant fork adds 2/3/4-bit KV quantization ("turbo2/3/4") beyond llama.cpp's built-in q8_0/q4_0. turbo2 at 2 bits is roughly half the size of q4_0, with acceptable perplexity loss.

Perplexity delta vs f16 baseline (quality gate: Δ < 0.5):

KV type	SmolLM3-3B	Gemma4-E2B	Gemma4-E4B	Qwen3-4B
q8_0	✓	✓	✓	✓
q4_0	✓	✓	✓	✓
turbo2	✓	✓	✓	✗ BROKEN
turbo3	✓	✓	✓	✗ BROKEN
turbo4	✓	✓	✓	✗ BROKEN

⚠️ Critical: turbo2/3/4 breaks Qwen3-4B

Qwen3-4B uses full GQA (32 KV heads, 40 KB/token). At ctx ≥ 8192, turbo KV quantization causes catastrophic PPL degradation:

ctx=4096   turbo2: PPL=1.79  (baseline 1.76, Δ=0.03 ✓)
ctx=8192   turbo2: PPL=4.2   (Δ=2.4 ✗)
ctx=16384  turbo2: PPL=15.4  (Δ=13.7 ✗)
ctx=32768  turbo2: PPL=438   (broken)

Never use turbo2/3/4 for Qwen3-4B. Use q4_0.

3. MQA Architecture — Gemma4 E2B/E4B KV is tiny

Finding: Gemma4's hybrid attention uses Multi-Query Attention (MQA) for most layers — only 1 KV head is maintained per token instead of full GQA. This results in dramatically smaller KV cache:

Model	KV bytes/token (q4_0)	Architecture
SmolLM3-3B	~19.8 KB	GQA
Qwen3-4B	~39.6 KB	full GQA
Gemma4-E4B	~4.5 KB	MQA-like (42 layers)
Gemma4-E2B	~1.7 KB	MQA (35 layers)

Implication: E2B can hold 393K tokens in KV cache with only 651 MiB RAM. E4B can hold 163K tokens with 346 MiB RAM.

⚠️ turbo2 is worse for E2B (MQA padding artifact)

turbo2 uses block quantization. For MQA models with tiny KV tensors, the per-block header/padding overhead is proportionally larger than the savings. At E2B:

ctx=32768  q4_0: 57 MiB KV   turbo2: 68 MiB KV  (+19% worse!)

Do not use turbo2 for Gemma4-E2B bigctx. Use q4_0.

4. -nkvo (KV in RAM) — Massive Context Gain at PCIe Cost

Finding: --no-kv-offload moves the KV cache from VRAM to host RAM. VRAM is then entirely free for model weights and compute. The tradeoff is token generation speed — each token generation requires reading the full KV cache over PCIe x4.

Bandwidth model: t/s = 1000 / (gpu_ms_empty + ctx × kv_bytes_per_tok / pcie_bw_bps × 1000)

PCIe x4 Gen3 ≈ 8 GB/s practical (measured from BW model fit to actual results).

Context gains with -nkvo (v4, TurboQuant):

Model	Pure-GPU ctx	-nkvo q4_0 rec	-nkvo turbo2 rec	KV type used
SmolLM3-3B	24576	32768	65536	turbo2
Gemma4-E2B	24576	393216	393216	q4_0 (turbo2 worse!)
Gemma4-E4B	24576	98304	163840	turbo2
Qwen3-4B	16384	24576	BROKEN	q4_0

Recommendation threshold: ≥ 15 t/s at 50% context fill.

⚠️ Qwen3.5-9B cannot use -nkvo

Qwen3.5-9B (Q8_0, 8.86 GB) with ngl=11 fills nearly all 15 GiB RAM with model weights + system overhead. At any tested context size, -nkvo OOMs. The existing server config at ctx=32768 with turbo2 KV in VRAM is the only viable option.

5. Qwen3.5-9B — RAM-bound, llama-perplexity incompatible

Finding: This model has a hybrid architecture: 8 full-attention layers + 24 linear-attention layers. The linear-attention layers cause llama-perplexity to fail (not OOM — the evaluation tool simply can't handle the architecture). The server works correctly.

Performance ceiling: Theoretical max t/s = RAM_BW / model_size = 45 GB/s ÷ 8.86 GB = 5.1 t/s. Achieved: 4.38 t/s = 86% efficiency. This is purely RAM-bandwidth-limited.

Thread optimization (i7-10750H, 6 physical / 12 logical):

Optimal: THREADS=6 (one per physical core)
HT hurts: t=8 → 4.22 t/s (worse than t=6 → 4.38 t/s)

6. Gemma4-E4B — all layers fit when VRAM is free

Surprise: E4B's Q4_K_M file is 4.7 GB — larger than the 3.7 GB VRAM. However, model weight loading is paged; at ngl=42, ALL 42 layers fit in VRAM during inference because llama.cpp holds only the needed tensors. The "file size > VRAM" heuristic is wrong for split configs.

ngl sweep result:

ngl=28  →  59 pp / 16.5 tg t/s
ngl=35  →  101 pp / 24.6 tg t/s
ngl=42  →  133 pp / 32.0 tg t/s  ← all layers, much faster

Caution: ngl=42 fails if another container is holding VRAM. Always stop other services before starting E4B.

7. Flash Attention (+2–3% pp, required for bigctx)

--flash-attn on is required for -nkvo bigctx profiles (prefill OOM otherwise at large ctx). For standard pure-GPU profiles it gives a small speed boost (~2–3% pp, ~1% tg). Always enable it.

8. Benchmarking pitfalls

False OOM from prefill timeout

Early test scripts ran llama-perplexity on a full wiki dataset. At large contexts, prefill takes >600s and the script misread the timeout as OOM. Fix: use a 64-line "tiny" file for alloc checks — the model allocates the full KV cache at startup, then exits after trivial compute (< 15s).

kv/tok measurement anomalies

The kv_bytes_per_tok column in cpu_ctx_test.sh is computed as kv_mib / ctx. At small ctx, block padding dominates and the value appears higher. The true per-token cost stabilizes at larger ctx. Use ctx ≥ 32768 values for BW model calibration.

Summary: Recommended configurations

Model	Profile	KV type	CTX	t/s@base	Notes
SmolLM3-3B	pure-GPU	q8_0	24576	~53	max VRAM ctx
SmolLM3-3B	bigctx	turbo2	65536	~15@50%	714 MiB RAM
Gemma4-E2B	pure-GPU	f16	24576	~62	MQA = tiny KV
Gemma4-E2B	bigctx	q4_0	393216	~17@50%	651 MiB RAM, turbo2 worse
Gemma4-E4B	pure-GPU	q4_0	24576	~30	ngl=42 all layers
Gemma4-E4B	bigctx	turbo2	163840	~18@50%	346 MiB RAM
Qwen3-4B	pure-GPU	q4_0	16384	~39	NO turbo KV ever
Qwen3-4B	bigctx	q4_0	24576	~11@50%	turbo2 broken
Qwen3.5-9B	pure-GPU	turbo2	32768	~4.4	RAM-bound, no bigctx

6.9 KiB Raw Blame History Unescape Escape