Files
llama-cpp/docs/FINDINGS.md
Giancarmine Salucci 4ad296608b Initial commit: tuned multi-model llama.cpp stack
- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B
- TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs
- Bigctx profiles (-nkvo KV in RAM): 2-16x context gain
- turbo2 KV: 2x smaller, benchmarked against PPL quality gate
- Per-model env files with justified parameters
- kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts
- docs/FINDINGS.md: surprises, pitfalls, recommendations
- docs/ARCHITECTURE.md: compose + test script design
2026-05-06 15:56:40 +02:00

159 lines
6.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Benchmarking Findings
Hardware: GTX 1650 Ti Mobile (Turing/SM75, 3717 MiB VRAM, CC 7.5) + i7-10750H 6c/12t, 15 GiB DDR4-2933 RAM.
All benchmarks: llama.cpp `local/llama-cpp-turboquant:*-cuda-sm75-mmq` image (TurboQuant fork, `DGGML_CUDA_FORCE_MMQ=ON`).
Date: 2026-05-05 / 2026-05-06.
---
## 1. FORCE_MMQ — Free +611% on Turing GPUs
**Finding**: GPUs without tensor cores (Turing = RTX 1650, 1660, 2060 etc.) run the GEMM path through cuBLAS GEMM, which is slower than the hand-written MMQ (matrix-multiply quantized) kernel. Compiling with `DGGML_CUDA_FORCE_MMQ=ON` forces the MMQ path unconditionally.
| Model | Standard image t/s | TurboQuant t/s | Gain |
|---|---|---|---|
| SmolLM3-3B | ~49.9 | 53.1 | +6.2% |
| Gemma4-E2B | ~55.7 | 61.7 | +10.7% |
| Gemma4-E4B | ~27.0 | 30.0 | +11.4% |
| Qwen3-4B | ~36.7 | 38.8 | +5.7% |
**Caution**: On Ampere/Ada/Hopper (RTX 3000+/4000+), tensor cores are faster. `FORCE_MMQ` would *hurt* on those cards. This image is SM75-only.
---
## 2. KV Quantization — turbo2 is the best sweet spot
**Finding**: The TurboQuant fork adds 2/3/4-bit KV quantization ("turbo2/3/4") beyond llama.cpp's built-in q8_0/q4_0. turbo2 at 2 bits is roughly half the size of q4_0, with acceptable perplexity loss.
**Perplexity delta vs f16 baseline** (quality gate: Δ < 0.5):
| KV type | SmolLM3-3B | Gemma4-E2B | Gemma4-E4B | Qwen3-4B |
|---|---|---|---|---|
| q8_0 | ✓ | ✓ | ✓ | ✓ |
| q4_0 | ✓ | ✓ | ✓ | ✓ |
| turbo2 | ✓ | ✓ | ✓ | **✗ BROKEN** |
| turbo3 | ✓ | ✓ | ✓ | **✗ BROKEN** |
| turbo4 | ✓ | ✓ | ✓ | **✗ BROKEN** |
### ⚠️ Critical: turbo2/3/4 breaks Qwen3-4B
Qwen3-4B uses full GQA (32 KV heads, 40 KB/token). At ctx ≥ 8192, turbo KV quantization causes catastrophic PPL degradation:
```
ctx=4096 turbo2: PPL=1.79 (baseline 1.76, Δ=0.03 ✓)
ctx=8192 turbo2: PPL=4.2 (Δ=2.4 ✗)
ctx=16384 turbo2: PPL=15.4 (Δ=13.7 ✗)
ctx=32768 turbo2: PPL=438 (broken)
```
**Never use turbo2/3/4 for Qwen3-4B.** Use q4_0.
---
## 3. MQA Architecture — Gemma4 E2B/E4B KV is tiny
**Finding**: Gemma4's hybrid attention uses Multi-Query Attention (MQA) for most layers — only 1 KV head is maintained per token instead of full GQA. This results in dramatically smaller KV cache:
| Model | KV bytes/token (q4_0) | Architecture |
|---|---|---|
| SmolLM3-3B | ~19.8 KB | GQA |
| Qwen3-4B | ~39.6 KB | full GQA |
| Gemma4-E4B | ~4.5 KB | MQA-like (42 layers) |
| Gemma4-E2B | ~1.7 KB | MQA (35 layers) |
**Implication**: E2B can hold 393K tokens in KV cache with only 651 MiB RAM. E4B can hold 163K tokens with 346 MiB RAM.
### ⚠️ turbo2 is *worse* for E2B (MQA padding artifact)
turbo2 uses block quantization. For MQA models with tiny KV tensors, the per-block header/padding overhead is proportionally larger than the savings. At E2B:
```
ctx=32768 q4_0: 57 MiB KV turbo2: 68 MiB KV (+19% worse!)
```
**Do not use turbo2 for Gemma4-E2B bigctx.** Use q4_0.
---
## 4. -nkvo (KV in RAM) — Massive Context Gain at PCIe Cost
**Finding**: `--no-kv-offload` moves the KV cache from VRAM to host RAM. VRAM is then entirely free for model weights and compute. The tradeoff is token generation speed — each token generation requires reading the full KV cache over PCIe x4.
**Bandwidth model**: `t/s = 1000 / (gpu_ms_empty + ctx × kv_bytes_per_tok / pcie_bw_bps × 1000)`
PCIe x4 Gen3 ≈ **8 GB/s** practical (measured from BW model fit to actual results).
### Context gains with -nkvo (v4, TurboQuant):
| Model | Pure-GPU ctx | -nkvo q4_0 rec | -nkvo turbo2 rec | KV type used |
|---|---|---|---|---|
| SmolLM3-3B | 24576 | 32768 | **65536** | turbo2 |
| Gemma4-E2B | 24576 | **393216** | 393216 | q4_0 (turbo2 worse!) |
| Gemma4-E4B | 24576 | 98304 | **163840** | turbo2 |
| Qwen3-4B | 16384 | **24576** | BROKEN | q4_0 |
Recommendation threshold: ≥ 15 t/s at 50% context fill.
### ⚠️ Qwen3.5-9B cannot use -nkvo
Qwen3.5-9B (Q8_0, 8.86 GB) with ngl=11 fills nearly all 15 GiB RAM with model weights + system overhead. At any tested context size, `-nkvo` OOMs. The existing server config at ctx=32768 with turbo2 KV in VRAM is the only viable option.
---
## 5. Qwen3.5-9B — RAM-bound, llama-perplexity incompatible
**Finding**: This model has a hybrid architecture: 8 full-attention layers + 24 linear-attention layers. The linear-attention layers cause `llama-perplexity` to fail (not OOM — the evaluation tool simply can't handle the architecture). The server works correctly.
**Performance ceiling**: Theoretical max t/s = RAM_BW / model_size = 45 GB/s ÷ 8.86 GB = **5.1 t/s**. Achieved: 4.38 t/s = 86% efficiency. This is purely RAM-bandwidth-limited.
**Thread optimization** (i7-10750H, 6 physical / 12 logical):
- Optimal: `THREADS=6` (one per physical core)
- HT hurts: t=8 → 4.22 t/s (worse than t=6 → 4.38 t/s)
---
## 6. Gemma4-E4B — all layers fit when VRAM is free
**Surprise**: E4B's Q4_K_M file is 4.7 GB — larger than the 3.7 GB VRAM. However, model weight loading is paged; at ngl=42, ALL 42 layers fit in VRAM during inference because llama.cpp holds only the needed tensors. The "file size > VRAM" heuristic is wrong for split configs.
ngl sweep result:
```
ngl=28 → 59 pp / 16.5 tg t/s
ngl=35 → 101 pp / 24.6 tg t/s
ngl=42 → 133 pp / 32.0 tg t/s ← all layers, much faster
```
**Caution**: ngl=42 fails if another container is holding VRAM. Always stop other services before starting E4B.
---
## 7. Flash Attention (+23% pp, required for bigctx)
`--flash-attn on` is required for `-nkvo` bigctx profiles (prefill OOM otherwise at large ctx). For standard pure-GPU profiles it gives a small speed boost (~23% pp, ~1% tg). Always enable it.
---
## 8. Benchmarking pitfalls
### False OOM from prefill timeout
Early test scripts ran `llama-perplexity` on a full wiki dataset. At large contexts, prefill takes >600s and the script misread the timeout as OOM. Fix: use a 64-line "tiny" file for alloc checks — the model allocates the full KV cache at startup, then exits after trivial compute (< 15s).
### kv/tok measurement anomalies
The `kv_bytes_per_tok` column in cpu_ctx_test.sh is computed as `kv_mib / ctx`. At small ctx, block padding dominates and the value appears higher. The true per-token cost stabilizes at larger ctx. Use ctx ≥ 32768 values for BW model calibration.
---
## Summary: Recommended configurations
| Model | Profile | KV type | CTX | t/s@base | Notes |
|---|---|---|---|---|---|
| SmolLM3-3B | pure-GPU | q8_0 | 24576 | ~53 | max VRAM ctx |
| SmolLM3-3B | bigctx | turbo2 | 65536 | ~15@50% | 714 MiB RAM |
| Gemma4-E2B | pure-GPU | f16 | 24576 | ~62 | MQA = tiny KV |
| Gemma4-E2B | bigctx | q4_0 | 393216 | ~17@50% | 651 MiB RAM, turbo2 worse |
| Gemma4-E4B | pure-GPU | q4_0 | 24576 | ~30 | ngl=42 all layers |
| Gemma4-E4B | bigctx | turbo2 | 163840 | ~18@50% | 346 MiB RAM |
| Qwen3-4B | pure-GPU | q4_0 | 16384 | ~39 | NO turbo KV ever |
| Qwen3-4B | bigctx | q4_0 | 24576 | ~11@50% | turbo2 broken |
| Qwen3.5-9B | pure-GPU | turbo2 | 32768 | ~4.4 | RAM-bound, no bigctx |