llama-cpp/docs/FINDINGS.md

# Benchmarking Findings

Hardware: GTX 1650 Ti Mobile (Turing/SM75, 3717 MiB VRAM, CC 7.5) + i7-10750H 6c/12t, 15 GiB DDR4-2933 RAM.
All benchmarks: llama.cpp `local/llama-cpp-turboquant:*-cuda-sm75-mmq` image (TurboQuant fork, `DGGML_CUDA_FORCE_MMQ=ON`).
Date: 2026-05-05 / 2026-05-06.

---

## 1. FORCE_MMQ — Free +6–11% on Turing GPUs

**Finding**: GPUs without tensor cores (Turing = RTX 1650, 1660, 2060 etc.) run the GEMM path through cuBLAS GEMM, which is slower than the hand-written MMQ (matrix-multiply quantized) kernel. Compiling with `DGGML_CUDA_FORCE_MMQ=ON` forces the MMQ path unconditionally.

| Model | Standard image t/s | TurboQuant t/s | Gain |
|---|---|---|---|
| SmolLM3-3B | ~49.9 | 53.1 | +6.2% |
| Gemma4-E2B | ~55.7 | 61.7 | +10.7% |
| Gemma4-E4B | ~27.0 | 30.0 | +11.4% |
| Qwen3-4B | ~36.7 | 38.8 | +5.7% |

**Caution**: On Ampere/Ada/Hopper (RTX 3000+/4000+), tensor cores are faster. `FORCE_MMQ` would *hurt* on those cards. This image is SM75-only.

---

## 2. KV Quantization — turbo2 is the best sweet spot

**Finding**: The TurboQuant fork adds 2/3/4-bit KV quantization ("turbo2/3/4") beyond llama.cpp's built-in q8_0/q4_0. turbo2 at 2 bits is roughly half the size of q4_0, with acceptable perplexity loss.

**Perplexity delta vs f16 baseline** (quality gate: Δ < 0.5):

| KV type | SmolLM3-3B | Gemma4-E2B | Gemma4-E4B | Qwen3-4B |
|---|---|---|---|---|
| q8_0 | ✓ | ✓ | ✓ | ✓ |
| q4_0 | ✓ | ✓ | ✓ | ✓ |
| turbo2 | ✓ | ✓ | ✓ | **✗ BROKEN** |
| turbo3 | ✓ | ✓ | ✓ | **✗ BROKEN** |
| turbo4 | ✓ | ✓ | ✓ | **✗ BROKEN** |

### ⚠️ Critical: turbo2/3/4 breaks Qwen3-4B

Qwen3-4B uses full GQA (32 KV heads, 40 KB/token). At ctx ≥ 8192, turbo KV quantization causes catastrophic PPL degradation:

```
ctx=4096   turbo2: PPL=1.79  (baseline 1.76, Δ=0.03 ✓)
ctx=8192   turbo2: PPL=4.2   (Δ=2.4 ✗)
ctx=16384  turbo2: PPL=15.4  (Δ=13.7 ✗)
ctx=32768  turbo2: PPL=438   (broken)
```

**Never use turbo2/3/4 for Qwen3-4B.** Use q4_0.

---

## 3. MQA Architecture — Gemma4 E2B/E4B KV is tiny

**Finding**: Gemma4's hybrid attention uses Multi-Query Attention (MQA) for most layers — only 1 KV head is maintained per token instead of full GQA. This results in dramatically smaller KV cache:

| Model | KV bytes/token (q4_0) | Architecture |
|---|---|---|
| SmolLM3-3B | ~19.8 KB | GQA |
| Qwen3-4B | ~39.6 KB | full GQA |
| Gemma4-E4B | ~4.5 KB | MQA-like (42 layers) |
| Gemma4-E2B | ~1.7 KB | MQA (35 layers) |

**Implication**: E2B can hold 393K tokens in KV cache with only 651 MiB RAM. E4B can hold 163K tokens with 346 MiB RAM.

### ⚠️ turbo2 is *worse* for E2B (MQA padding artifact)

turbo2 uses block quantization. For MQA models with tiny KV tensors, the per-block header/padding overhead is proportionally larger than the savings. At E2B:

```
ctx=32768  q4_0: 57 MiB KV   turbo2: 68 MiB KV  (+19% worse!)
```

**Do not use turbo2 for Gemma4-E2B bigctx.** Use q4_0.

---

## 4. -nkvo (KV in RAM) — Massive Context Gain at PCIe Cost

**Finding**: `--no-kv-offload` moves the KV cache from VRAM to host RAM. VRAM is then entirely free for model weights and compute. The tradeoff is token generation speed — each token generation requires reading the full KV cache over PCIe x4.

**Bandwidth model**: `t/s = 1000 / (gpu_ms_empty + ctx × kv_bytes_per_tok / pcie_bw_bps × 1000)`

PCIe x4 Gen3 ≈ **8 GB/s** practical (measured from BW model fit to actual results).

### Context gains with -nkvo (v4, TurboQuant):

| Model | Pure-GPU ctx | -nkvo q4_0 rec | -nkvo turbo2 rec | KV type used |
|---|---|---|---|---|
| SmolLM3-3B | 24576 | 32768 | **65536** | turbo2 |
| Gemma4-E2B | 24576 | **393216** | 393216 | q4_0 (turbo2 worse!) |
| Gemma4-E4B | 24576 | 98304 | **163840** | turbo2 |
| Qwen3-4B | 16384 | **24576** | BROKEN | q4_0 |

Recommendation threshold: ≥ 15 t/s at 50% context fill.

### ⚠️ Qwen3.5-9B cannot use -nkvo

Qwen3.5-9B (Q8_0, 8.86 GB) with ngl=11 fills nearly all 15 GiB RAM with model weights + system overhead. At any tested context size, `-nkvo` OOMs. The existing server config at ctx=32768 with turbo2 KV in VRAM is the only viable option.

---

## 5. Qwen3.5-9B — RAM-bound, llama-perplexity incompatible

**Finding**: This model has a hybrid architecture: 8 full-attention layers + 24 linear-attention layers. The linear-attention layers cause `llama-perplexity` to fail (not OOM — the evaluation tool simply can't handle the architecture). The server works correctly.

**Performance ceiling**: Theoretical max t/s = RAM_BW / model_size = 45 GB/s ÷ 8.86 GB = **5.1 t/s**. Achieved: 4.38 t/s = 86% efficiency. This is purely RAM-bandwidth-limited.

**Thread optimization** (i7-10750H, 6 physical / 12 logical):
- Optimal: `THREADS=6` (one per physical core)
- HT hurts: t=8 → 4.22 t/s (worse than t=6 → 4.38 t/s)

---

## 6. Gemma4-E4B — all layers fit when VRAM is free

**Surprise**: E4B's Q4_K_M file is 4.7 GB — larger than the 3.7 GB VRAM. However, model weight loading is paged; at ngl=42, ALL 42 layers fit in VRAM during inference because llama.cpp holds only the needed tensors. The "file size > VRAM" heuristic is wrong for split configs.

ngl sweep result:
```
ngl=28  →  59 pp / 16.5 tg t/s
ngl=35  →  101 pp / 24.6 tg t/s
ngl=42  →  133 pp / 32.0 tg t/s  ← all layers, much faster
```

**Caution**: ngl=42 fails if another container is holding VRAM. Always stop other services before starting E4B.

---

## 7. Flash Attention (+2–3% pp, required for bigctx)

`--flash-attn on` is required for `-nkvo` bigctx profiles (prefill OOM otherwise at large ctx). For standard pure-GPU profiles it gives a small speed boost (~2–3% pp, ~1% tg). Always enable it.

---

## 8. Benchmarking pitfalls

### False OOM from prefill timeout
Early test scripts ran `llama-perplexity` on a full wiki dataset. At large contexts, prefill takes >600s and the script misread the timeout as OOM. Fix: use a 64-line "tiny" file for alloc checks — the model allocates the full KV cache at startup, then exits after trivial compute (< 15s).

### kv/tok measurement anomalies
The `kv_bytes_per_tok` column in cpu_ctx_test.sh is computed as `kv_mib / ctx`. At small ctx, block padding dominates and the value appears higher. The true per-token cost stabilizes at larger ctx. Use ctx ≥ 32768 values for BW model calibration.

---

## Summary: Recommended configurations

| Model | Profile | KV type | CTX | t/s@base | Notes |
|---|---|---|---|---|---|
| SmolLM3-3B | pure-GPU | q8_0 | 24576 | ~53 | max VRAM ctx |
| SmolLM3-3B | bigctx | turbo2 | 65536 | ~15@50% | 714 MiB RAM |
| Gemma4-E2B | pure-GPU | f16 | 24576 | ~62 | MQA = tiny KV |
| Gemma4-E2B | bigctx | q4_0 | 393216 | ~17@50% | 651 MiB RAM, turbo2 worse |
| Gemma4-E4B | pure-GPU | q4_0 | 24576 | ~30 | ngl=42 all layers |
| Gemma4-E4B | bigctx | turbo2 | 163840 | ~18@50% | 346 MiB RAM |
| Qwen3-4B | pure-GPU | q4_0 | 16384 | ~39 | NO turbo KV ever |
| Qwen3-4B | bigctx | q4_0 | 24576 | ~11@50% | turbo2 broken |
| Qwen3.5-9B | pure-GPU | turbo2 | 32768 | ~4.4 | RAM-bound, no bigctx |