Initial commit: tuned multi-model llama.cpp stack
- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B - TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs - Bigctx profiles (-nkvo KV in RAM): 2-16x context gain - turbo2 KV: 2x smaller, benchmarked against PPL quality gate - Per-model env files with justified parameters - kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts - docs/FINDINGS.md: surprises, pitfalls, recommendations - docs/ARCHITECTURE.md: compose + test script design
This commit is contained in:
210
docs/ARCHITECTURE.md
Normal file
210
docs/ARCHITECTURE.md
Normal file
@@ -0,0 +1,210 @@
|
||||
# Architecture
|
||||
|
||||
Hardware: GTX 1650 Ti Mobile (SM75/Turing, 3717 MiB VRAM) + i7-10750H 6c/12t + 15 GiB DDR4-2933 RAM.
|
||||
|
||||
---
|
||||
|
||||
## Docker Compose Architecture
|
||||
|
||||
### Image Strategy
|
||||
|
||||
Two custom images built from the [TurboQuant fork](https://github.com/TheTom/llama-cpp-turboquant) of llama.cpp:
|
||||
|
||||
| Image | Target | Used by |
|
||||
|---|---|---|
|
||||
| `local/llama-cpp-turboquant:server-cuda-sm75-mmq` | `server` | All llama-server services |
|
||||
| `local/llama-cpp-turboquant:full-cuda-sm75-mmq` | `full` | All bench/test services |
|
||||
|
||||
Both built with `CUDA_DOCKER_ARCH: "75 -DGGML_CUDA_FORCE_MMQ=ON"`:
|
||||
- SM75 = Turing architecture codepath (no tensor cores)
|
||||
- `FORCE_MMQ` = always use hand-written MMQ kernels instead of cuBLAS GEMM
|
||||
- `full` target includes `llama-bench`, `llama-perplexity`, `llama-cli` alongside the server
|
||||
|
||||
Both images share the same custom entrypoint wrapper that enables the `turbo2/3/4` KV quantization types unavailable in upstream llama.cpp. **All `docker run` calls must use `--entrypoint=""` to bypass the wrapper.**
|
||||
|
||||
### Compose Structure
|
||||
|
||||
```
|
||||
compose.yaml
|
||||
├── x-gpu — NVIDIA runtime + capability passthrough (merged into all services)
|
||||
├── x-hc — Common healthcheck (curl /health, start_period overridden per service)
|
||||
├── x-server — Merged into all server services:
|
||||
│ ├── volumes: ./models:/models:ro
|
||||
│ ├── ports: 8080:8080
|
||||
│ ├── network alias: llama-current (all servers share this alias)
|
||||
│ ├── entrypoint: llama-server with $$VAR shell expansion from env_file
|
||||
│ └── restart: unless-stopped
|
||||
└── x-bench — Merged into all bench services:
|
||||
├── volumes: ./models:/models:ro, ./benchmark-results:/results, ./scripts:/scripts:ro
|
||||
└── entrypoint: /bin/bash /scripts/benchmark.sh (overrideable)
|
||||
```
|
||||
|
||||
### Profile System
|
||||
|
||||
Docker Compose profiles allow mutually exclusive model selection. Only one model server should run at a time (single GPU).
|
||||
|
||||
```
|
||||
docker compose --profile <PROFILE> up -d
|
||||
```
|
||||
|
||||
**Server profiles** (bring up `llama-server` on port 8080):
|
||||
|
||||
| Profile | Model | Image | VRAM | Strategy |
|
||||
|---|---|---|---|---|
|
||||
| `qwen35-9b` | Qwen3.5-9B Q8_0 | TurboQuant (built) | 3.4 GB (11 layers) | RAM-bound; mlock pins weights |
|
||||
| `gemma4-e2b` | Gemma4-E2B Q4_K_M | TurboQuant | ~3.4 GB | Full GPU, MQA |
|
||||
| `gemma4-e4b` | Gemma4-E4B Q4_K_M | TurboQuant | ~3.5 GB | Full GPU (42 layers, CPU-split) |
|
||||
| `smollm3-3b` | SmolLM3-3B Q4_K_M | TurboQuant | ~2.0 GB | Full GPU |
|
||||
| `qwen3-4b` | Qwen3-4B Q4_K_M | TurboQuant | ~2.5 GB | Full GPU |
|
||||
|
||||
**Bigctx profiles** (server with `-nkvo`: KV cache in host RAM):
|
||||
|
||||
| Profile | Model | KV type | CTX | ~t/s@50% ctx |
|
||||
|---|---|---|---|---|
|
||||
| `smollm3-3b-bigctx` | SmolLM3-3B | turbo2 | 65536 | 15.2 |
|
||||
| `gemma4-e2b-bigctx` | Gemma4-E2B | q4_0 | 393216 | 17.0 |
|
||||
| `gemma4-e4b-bigctx` | Gemma4-E4B | turbo2 | 163840 | 17.8 |
|
||||
| `qwen3-4b-bigctx` | Qwen3-4B | q4_0 | 24576 | 11.2 |
|
||||
|
||||
**Bench profiles** (one-shot benchmark containers):
|
||||
|
||||
| Profile | Service | Purpose |
|
||||
|---|---|---|
|
||||
| `bench-qwen35-9b` | bench-qwen35-9b | Also hosts `cpu_ctx_test.sh` / `kv_quant_test.sh` (all models have model files accessible) |
|
||||
| `bench-gemma4-e2b` | bench-gemma4-e2b | E2B bench |
|
||||
| `bench-gemma4-e4b` | bench-gemma4-e4b | E4B bench |
|
||||
| `bench-smollm3-3b` | bench-smollm3-3b | SmolLM3 bench |
|
||||
| `bench-qwen3-4b` | bench-qwen3-4b | Qwen3-4B bench |
|
||||
|
||||
**Add-on profile** (combine with any model):
|
||||
|
||||
| Profile | Service | Purpose |
|
||||
|---|---|---|
|
||||
| `webui` | openwebui | Open WebUI connecting to `llama-current:8080` |
|
||||
|
||||
### Env File Architecture
|
||||
|
||||
Each model has a dedicated `envs/.env.<model>` file injected into the container. Shell variables use `$$VAR` in the compose command to escape compose interpolation — the container shell expands them at runtime.
|
||||
|
||||
```
|
||||
envs/
|
||||
├── .env.smollm3-3b ← pure-GPU: q8_0 KV, ctx=24576
|
||||
├── .env.smollm3-3b-bigctx ← -nkvo: turbo2 KV, ctx=65536
|
||||
├── .env.gemma4-e2b ← pure-GPU: f16 KV, ctx=24576
|
||||
├── .env.gemma4-e2b-bigctx ← -nkvo: q4_0 KV, ctx=393216 (turbo2 worse for MQA)
|
||||
├── .env.gemma4-e4b ← pure-GPU: q4_0 KV, ctx=24576, ngl=42
|
||||
├── .env.gemma4-e4b-bigctx ← -nkvo: turbo2 KV, ctx=163840, ngl=42
|
||||
├── .env.qwen3-4b ← pure-GPU: q4_0 KV, ctx=16384 (NO turbo2 ever)
|
||||
├── .env.qwen3-4b-bigctx ← -nkvo: q4_0 KV, ctx=24576 (NO turbo2 ever)
|
||||
└── .env.qwen35-9b ← mixed: turbo2 KV, ctx=32768, ngl=11, mlock
|
||||
```
|
||||
|
||||
Key env variables per file:
|
||||
|
||||
```bash
|
||||
MODEL_FILE # filename under /models/
|
||||
N_GPU_LAYERS # ngl: how many transformer layers offloaded to GPU
|
||||
CTX_SIZE # context window size
|
||||
THREADS / THREADS_BATCH
|
||||
BATCH_SIZE / UBATCH_SIZE
|
||||
CACHE_TYPE_K/V # KV quantization: f16 | q8_0 | q4_0 | turbo2
|
||||
PARALLEL # number of concurrent request slots
|
||||
EXTRA_ARGS # passed verbatim to llama-server (e.g. --flash-attn on --no-kv-offload)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Test Script Architecture
|
||||
|
||||
All test scripts run inside the `bench-qwen35-9b` container (has `full` image with all binaries), with all model files accessible via `/models/`.
|
||||
|
||||
### scripts/kv_quant_test.sh
|
||||
|
||||
**Purpose**: Determine optimal KV quantization type for each model at various context sizes.
|
||||
**Method**: `llama-perplexity` on a 4000-line synthetic text file. Computes perplexity for each (model, ctx, KV type) combination, measures Δ vs f16 baseline.
|
||||
**Quality gate**: Δ < 0.5 → acceptable; Δ ≥ 0.5 → degraded.
|
||||
|
||||
```
|
||||
for each model:
|
||||
for each ctx in CTX_CANDIDATES:
|
||||
run f16 baseline → get PPL_baseline
|
||||
for each KV type in MODEL_KV_TYPES:
|
||||
run with that KV type → get PPL
|
||||
report Δ = PPL - PPL_baseline
|
||||
```
|
||||
|
||||
**Outputs**:
|
||||
- Pass/fail per (model, ctx, KV type) combination
|
||||
- Recommendation: highest-quality KV type that stays within quality gate at all tested ctx
|
||||
|
||||
**Known limitations**:
|
||||
- `Qwen3.5-9B`: hybrid linear-attention architecture is incompatible with `llama-perplexity` → always fails. Not a real model issue; the server works correctly.
|
||||
- At very small ctx (< 4096), block-padding overhead inflates turbo2 apparent per-token cost.
|
||||
|
||||
### scripts/cpu_ctx_test.sh
|
||||
|
||||
**Purpose**: Find maximum viable context size when using `-nkvo` (KV in host RAM), accounting for PCIe bandwidth penalty.
|
||||
**Method**: Two-phase per (model, ctx, KV type):
|
||||
|
||||
1. **Alloc check** (fast, ~15s): run `llama-perplexity` on a 64-line file with `-nkvo`. The model allocates full KV at startup regardless of input length. If it exits cleanly → alloc succeeds; timeout/error → OOM.
|
||||
|
||||
2. **Speed estimation** (analytic bandwidth model):
|
||||
```
|
||||
GPU-compute models (smollm3, e2b, e4b, qwen3-4b):
|
||||
t/s(ctx) = 1000 / (1000/baseline + ctx × kv_bytes_per_tok / PCIe_BW × 1000)
|
||||
PCIe_BW = 8 GB/s (PCIe x4 Gen3 practical)
|
||||
|
||||
RAM-bound models (qwen35-9b, ngl=11):
|
||||
t/s(ctx) = 1000 / (1000/baseline + ctx × kv_bytes_per_tok / RAM_BW × 1000)
|
||||
RAM_BW = 45 GB/s (DDR4-2933)
|
||||
```
|
||||
|
||||
3. **Recommendation**: highest ctx where `t/s@50%fill ≥ 15`.
|
||||
|
||||
**kv_bytes_per_tok** measured empirically: `KV_MiB_allocated / ctx_size` from actual alloc run.
|
||||
|
||||
**KV types tested per model**:
|
||||
|
||||
| Model | KV types | Reason |
|
||||
|---|---|---|
|
||||
| smollm3, e2b, e4b | q4_0 + turbo2 | Both safe (PPL gate passes) |
|
||||
| qwen3-4b | q4_0 only | turbo2 breaks at ctx≥8192 |
|
||||
| qwen35-9b | q4_0 only | OOMs regardless (skipped) |
|
||||
|
||||
### scripts/benchmark.sh
|
||||
|
||||
Default entrypoint for bench containers. Runs `llama-bench` sweep over prompt/generation lengths and thread counts, outputs CSV to `/results/`.
|
||||
|
||||
### scripts/quality_test.sh
|
||||
|
||||
Early script (superseded by kv_quant_test.sh). Tested KV types via basic generation quality comparison.
|
||||
|
||||
---
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
Model GGUF files (./models/)
|
||||
│
|
||||
▼
|
||||
Docker container (/models/ read-only bind mount)
|
||||
│
|
||||
├─── llama-server ──► OpenAI-compatible API on :8080
|
||||
│ │
|
||||
│ env_file values: MODEL_FILE, N_GPU_LAYERS, CTX_SIZE,
|
||||
│ CACHE_TYPE_K/V, EXTRA_ARGS, ...
|
||||
│
|
||||
└─── llama-bench / llama-perplexity ──► benchmark-results/ (bind mount)
|
||||
│
|
||||
test scripts (scripts/ read-only bind mount)
|
||||
```
|
||||
|
||||
## Port / Network Layout
|
||||
|
||||
```
|
||||
Host:8080 ──► llama_server container:8080
|
||||
Host:3000 ──► open_webui container:8080 ──► http://llama-current:8080/v1 (Docker network)
|
||||
|
||||
llama-net (bridge):
|
||||
llama-current — alias shared by ALL server services; only one runs at a time
|
||||
```
|
||||
158
docs/FINDINGS.md
Normal file
158
docs/FINDINGS.md
Normal file
@@ -0,0 +1,158 @@
|
||||
# Benchmarking Findings
|
||||
|
||||
Hardware: GTX 1650 Ti Mobile (Turing/SM75, 3717 MiB VRAM, CC 7.5) + i7-10750H 6c/12t, 15 GiB DDR4-2933 RAM.
|
||||
All benchmarks: llama.cpp `local/llama-cpp-turboquant:*-cuda-sm75-mmq` image (TurboQuant fork, `DGGML_CUDA_FORCE_MMQ=ON`).
|
||||
Date: 2026-05-05 / 2026-05-06.
|
||||
|
||||
---
|
||||
|
||||
## 1. FORCE_MMQ — Free +6–11% on Turing GPUs
|
||||
|
||||
**Finding**: GPUs without tensor cores (Turing = RTX 1650, 1660, 2060 etc.) run the GEMM path through cuBLAS GEMM, which is slower than the hand-written MMQ (matrix-multiply quantized) kernel. Compiling with `DGGML_CUDA_FORCE_MMQ=ON` forces the MMQ path unconditionally.
|
||||
|
||||
| Model | Standard image t/s | TurboQuant t/s | Gain |
|
||||
|---|---|---|---|
|
||||
| SmolLM3-3B | ~49.9 | 53.1 | +6.2% |
|
||||
| Gemma4-E2B | ~55.7 | 61.7 | +10.7% |
|
||||
| Gemma4-E4B | ~27.0 | 30.0 | +11.4% |
|
||||
| Qwen3-4B | ~36.7 | 38.8 | +5.7% |
|
||||
|
||||
**Caution**: On Ampere/Ada/Hopper (RTX 3000+/4000+), tensor cores are faster. `FORCE_MMQ` would *hurt* on those cards. This image is SM75-only.
|
||||
|
||||
---
|
||||
|
||||
## 2. KV Quantization — turbo2 is the best sweet spot
|
||||
|
||||
**Finding**: The TurboQuant fork adds 2/3/4-bit KV quantization ("turbo2/3/4") beyond llama.cpp's built-in q8_0/q4_0. turbo2 at 2 bits is roughly half the size of q4_0, with acceptable perplexity loss.
|
||||
|
||||
**Perplexity delta vs f16 baseline** (quality gate: Δ < 0.5):
|
||||
|
||||
| KV type | SmolLM3-3B | Gemma4-E2B | Gemma4-E4B | Qwen3-4B |
|
||||
|---|---|---|---|---|
|
||||
| q8_0 | ✓ | ✓ | ✓ | ✓ |
|
||||
| q4_0 | ✓ | ✓ | ✓ | ✓ |
|
||||
| turbo2 | ✓ | ✓ | ✓ | **✗ BROKEN** |
|
||||
| turbo3 | ✓ | ✓ | ✓ | **✗ BROKEN** |
|
||||
| turbo4 | ✓ | ✓ | ✓ | **✗ BROKEN** |
|
||||
|
||||
### ⚠️ Critical: turbo2/3/4 breaks Qwen3-4B
|
||||
|
||||
Qwen3-4B uses full GQA (32 KV heads, 40 KB/token). At ctx ≥ 8192, turbo KV quantization causes catastrophic PPL degradation:
|
||||
|
||||
```
|
||||
ctx=4096 turbo2: PPL=1.79 (baseline 1.76, Δ=0.03 ✓)
|
||||
ctx=8192 turbo2: PPL=4.2 (Δ=2.4 ✗)
|
||||
ctx=16384 turbo2: PPL=15.4 (Δ=13.7 ✗)
|
||||
ctx=32768 turbo2: PPL=438 (broken)
|
||||
```
|
||||
|
||||
**Never use turbo2/3/4 for Qwen3-4B.** Use q4_0.
|
||||
|
||||
---
|
||||
|
||||
## 3. MQA Architecture — Gemma4 E2B/E4B KV is tiny
|
||||
|
||||
**Finding**: Gemma4's hybrid attention uses Multi-Query Attention (MQA) for most layers — only 1 KV head is maintained per token instead of full GQA. This results in dramatically smaller KV cache:
|
||||
|
||||
| Model | KV bytes/token (q4_0) | Architecture |
|
||||
|---|---|---|
|
||||
| SmolLM3-3B | ~19.8 KB | GQA |
|
||||
| Qwen3-4B | ~39.6 KB | full GQA |
|
||||
| Gemma4-E4B | ~4.5 KB | MQA-like (42 layers) |
|
||||
| Gemma4-E2B | ~1.7 KB | MQA (35 layers) |
|
||||
|
||||
**Implication**: E2B can hold 393K tokens in KV cache with only 651 MiB RAM. E4B can hold 163K tokens with 346 MiB RAM.
|
||||
|
||||
### ⚠️ turbo2 is *worse* for E2B (MQA padding artifact)
|
||||
|
||||
turbo2 uses block quantization. For MQA models with tiny KV tensors, the per-block header/padding overhead is proportionally larger than the savings. At E2B:
|
||||
|
||||
```
|
||||
ctx=32768 q4_0: 57 MiB KV turbo2: 68 MiB KV (+19% worse!)
|
||||
```
|
||||
|
||||
**Do not use turbo2 for Gemma4-E2B bigctx.** Use q4_0.
|
||||
|
||||
---
|
||||
|
||||
## 4. -nkvo (KV in RAM) — Massive Context Gain at PCIe Cost
|
||||
|
||||
**Finding**: `--no-kv-offload` moves the KV cache from VRAM to host RAM. VRAM is then entirely free for model weights and compute. The tradeoff is token generation speed — each token generation requires reading the full KV cache over PCIe x4.
|
||||
|
||||
**Bandwidth model**: `t/s = 1000 / (gpu_ms_empty + ctx × kv_bytes_per_tok / pcie_bw_bps × 1000)`
|
||||
|
||||
PCIe x4 Gen3 ≈ **8 GB/s** practical (measured from BW model fit to actual results).
|
||||
|
||||
### Context gains with -nkvo (v4, TurboQuant):
|
||||
|
||||
| Model | Pure-GPU ctx | -nkvo q4_0 rec | -nkvo turbo2 rec | KV type used |
|
||||
|---|---|---|---|---|
|
||||
| SmolLM3-3B | 24576 | 32768 | **65536** | turbo2 |
|
||||
| Gemma4-E2B | 24576 | **393216** | 393216 | q4_0 (turbo2 worse!) |
|
||||
| Gemma4-E4B | 24576 | 98304 | **163840** | turbo2 |
|
||||
| Qwen3-4B | 16384 | **24576** | BROKEN | q4_0 |
|
||||
|
||||
Recommendation threshold: ≥ 15 t/s at 50% context fill.
|
||||
|
||||
### ⚠️ Qwen3.5-9B cannot use -nkvo
|
||||
|
||||
Qwen3.5-9B (Q8_0, 8.86 GB) with ngl=11 fills nearly all 15 GiB RAM with model weights + system overhead. At any tested context size, `-nkvo` OOMs. The existing server config at ctx=32768 with turbo2 KV in VRAM is the only viable option.
|
||||
|
||||
---
|
||||
|
||||
## 5. Qwen3.5-9B — RAM-bound, llama-perplexity incompatible
|
||||
|
||||
**Finding**: This model has a hybrid architecture: 8 full-attention layers + 24 linear-attention layers. The linear-attention layers cause `llama-perplexity` to fail (not OOM — the evaluation tool simply can't handle the architecture). The server works correctly.
|
||||
|
||||
**Performance ceiling**: Theoretical max t/s = RAM_BW / model_size = 45 GB/s ÷ 8.86 GB = **5.1 t/s**. Achieved: 4.38 t/s = 86% efficiency. This is purely RAM-bandwidth-limited.
|
||||
|
||||
**Thread optimization** (i7-10750H, 6 physical / 12 logical):
|
||||
- Optimal: `THREADS=6` (one per physical core)
|
||||
- HT hurts: t=8 → 4.22 t/s (worse than t=6 → 4.38 t/s)
|
||||
|
||||
---
|
||||
|
||||
## 6. Gemma4-E4B — all layers fit when VRAM is free
|
||||
|
||||
**Surprise**: E4B's Q4_K_M file is 4.7 GB — larger than the 3.7 GB VRAM. However, model weight loading is paged; at ngl=42, ALL 42 layers fit in VRAM during inference because llama.cpp holds only the needed tensors. The "file size > VRAM" heuristic is wrong for split configs.
|
||||
|
||||
ngl sweep result:
|
||||
```
|
||||
ngl=28 → 59 pp / 16.5 tg t/s
|
||||
ngl=35 → 101 pp / 24.6 tg t/s
|
||||
ngl=42 → 133 pp / 32.0 tg t/s ← all layers, much faster
|
||||
```
|
||||
|
||||
**Caution**: ngl=42 fails if another container is holding VRAM. Always stop other services before starting E4B.
|
||||
|
||||
---
|
||||
|
||||
## 7. Flash Attention (+2–3% pp, required for bigctx)
|
||||
|
||||
`--flash-attn on` is required for `-nkvo` bigctx profiles (prefill OOM otherwise at large ctx). For standard pure-GPU profiles it gives a small speed boost (~2–3% pp, ~1% tg). Always enable it.
|
||||
|
||||
---
|
||||
|
||||
## 8. Benchmarking pitfalls
|
||||
|
||||
### False OOM from prefill timeout
|
||||
Early test scripts ran `llama-perplexity` on a full wiki dataset. At large contexts, prefill takes >600s and the script misread the timeout as OOM. Fix: use a 64-line "tiny" file for alloc checks — the model allocates the full KV cache at startup, then exits after trivial compute (< 15s).
|
||||
|
||||
### kv/tok measurement anomalies
|
||||
The `kv_bytes_per_tok` column in cpu_ctx_test.sh is computed as `kv_mib / ctx`. At small ctx, block padding dominates and the value appears higher. The true per-token cost stabilizes at larger ctx. Use ctx ≥ 32768 values for BW model calibration.
|
||||
|
||||
---
|
||||
|
||||
## Summary: Recommended configurations
|
||||
|
||||
| Model | Profile | KV type | CTX | t/s@base | Notes |
|
||||
|---|---|---|---|---|---|
|
||||
| SmolLM3-3B | pure-GPU | q8_0 | 24576 | ~53 | max VRAM ctx |
|
||||
| SmolLM3-3B | bigctx | turbo2 | 65536 | ~15@50% | 714 MiB RAM |
|
||||
| Gemma4-E2B | pure-GPU | f16 | 24576 | ~62 | MQA = tiny KV |
|
||||
| Gemma4-E2B | bigctx | q4_0 | 393216 | ~17@50% | 651 MiB RAM, turbo2 worse |
|
||||
| Gemma4-E4B | pure-GPU | q4_0 | 24576 | ~30 | ngl=42 all layers |
|
||||
| Gemma4-E4B | bigctx | turbo2 | 163840 | ~18@50% | 346 MiB RAM |
|
||||
| Qwen3-4B | pure-GPU | q4_0 | 16384 | ~39 | NO turbo KV ever |
|
||||
| Qwen3-4B | bigctx | q4_0 | 24576 | ~11@50% | turbo2 broken |
|
||||
| Qwen3.5-9B | pure-GPU | turbo2 | 32768 | ~4.4 | RAM-bound, no bigctx |
|
||||
Reference in New Issue
Block a user