Files
llama-cpp/docs/ARCHITECTURE.md
Giancarmine Salucci 4ad296608b Initial commit: tuned multi-model llama.cpp stack
- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B
- TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs
- Bigctx profiles (-nkvo KV in RAM): 2-16x context gain
- turbo2 KV: 2x smaller, benchmarked against PPL quality gate
- Per-model env files with justified parameters
- kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts
- docs/FINDINGS.md: surprises, pitfalls, recommendations
- docs/ARCHITECTURE.md: compose + test script design
2026-05-06 15:56:40 +02:00

211 lines
8.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Architecture
Hardware: GTX 1650 Ti Mobile (SM75/Turing, 3717 MiB VRAM) + i7-10750H 6c/12t + 15 GiB DDR4-2933 RAM.
---
## Docker Compose Architecture
### Image Strategy
Two custom images built from the [TurboQuant fork](https://github.com/TheTom/llama-cpp-turboquant) of llama.cpp:
| Image | Target | Used by |
|---|---|---|
| `local/llama-cpp-turboquant:server-cuda-sm75-mmq` | `server` | All llama-server services |
| `local/llama-cpp-turboquant:full-cuda-sm75-mmq` | `full` | All bench/test services |
Both built with `CUDA_DOCKER_ARCH: "75 -DGGML_CUDA_FORCE_MMQ=ON"`:
- SM75 = Turing architecture codepath (no tensor cores)
- `FORCE_MMQ` = always use hand-written MMQ kernels instead of cuBLAS GEMM
- `full` target includes `llama-bench`, `llama-perplexity`, `llama-cli` alongside the server
Both images share the same custom entrypoint wrapper that enables the `turbo2/3/4` KV quantization types unavailable in upstream llama.cpp. **All `docker run` calls must use `--entrypoint=""` to bypass the wrapper.**
### Compose Structure
```
compose.yaml
├── x-gpu — NVIDIA runtime + capability passthrough (merged into all services)
├── x-hc — Common healthcheck (curl /health, start_period overridden per service)
├── x-server — Merged into all server services:
│ ├── volumes: ./models:/models:ro
│ ├── ports: 8080:8080
│ ├── network alias: llama-current (all servers share this alias)
│ ├── entrypoint: llama-server with $$VAR shell expansion from env_file
│ └── restart: unless-stopped
└── x-bench — Merged into all bench services:
├── volumes: ./models:/models:ro, ./benchmark-results:/results, ./scripts:/scripts:ro
└── entrypoint: /bin/bash /scripts/benchmark.sh (overrideable)
```
### Profile System
Docker Compose profiles allow mutually exclusive model selection. Only one model server should run at a time (single GPU).
```
docker compose --profile <PROFILE> up -d
```
**Server profiles** (bring up `llama-server` on port 8080):
| Profile | Model | Image | VRAM | Strategy |
|---|---|---|---|---|
| `qwen35-9b` | Qwen3.5-9B Q8_0 | TurboQuant (built) | 3.4 GB (11 layers) | RAM-bound; mlock pins weights |
| `gemma4-e2b` | Gemma4-E2B Q4_K_M | TurboQuant | ~3.4 GB | Full GPU, MQA |
| `gemma4-e4b` | Gemma4-E4B Q4_K_M | TurboQuant | ~3.5 GB | Full GPU (42 layers, CPU-split) |
| `smollm3-3b` | SmolLM3-3B Q4_K_M | TurboQuant | ~2.0 GB | Full GPU |
| `qwen3-4b` | Qwen3-4B Q4_K_M | TurboQuant | ~2.5 GB | Full GPU |
**Bigctx profiles** (server with `-nkvo`: KV cache in host RAM):
| Profile | Model | KV type | CTX | ~t/s@50% ctx |
|---|---|---|---|---|
| `smollm3-3b-bigctx` | SmolLM3-3B | turbo2 | 65536 | 15.2 |
| `gemma4-e2b-bigctx` | Gemma4-E2B | q4_0 | 393216 | 17.0 |
| `gemma4-e4b-bigctx` | Gemma4-E4B | turbo2 | 163840 | 17.8 |
| `qwen3-4b-bigctx` | Qwen3-4B | q4_0 | 24576 | 11.2 |
**Bench profiles** (one-shot benchmark containers):
| Profile | Service | Purpose |
|---|---|---|
| `bench-qwen35-9b` | bench-qwen35-9b | Also hosts `cpu_ctx_test.sh` / `kv_quant_test.sh` (all models have model files accessible) |
| `bench-gemma4-e2b` | bench-gemma4-e2b | E2B bench |
| `bench-gemma4-e4b` | bench-gemma4-e4b | E4B bench |
| `bench-smollm3-3b` | bench-smollm3-3b | SmolLM3 bench |
| `bench-qwen3-4b` | bench-qwen3-4b | Qwen3-4B bench |
**Add-on profile** (combine with any model):
| Profile | Service | Purpose |
|---|---|---|
| `webui` | openwebui | Open WebUI connecting to `llama-current:8080` |
### Env File Architecture
Each model has a dedicated `envs/.env.<model>` file injected into the container. Shell variables use `$$VAR` in the compose command to escape compose interpolation — the container shell expands them at runtime.
```
envs/
├── .env.smollm3-3b ← pure-GPU: q8_0 KV, ctx=24576
├── .env.smollm3-3b-bigctx ← -nkvo: turbo2 KV, ctx=65536
├── .env.gemma4-e2b ← pure-GPU: f16 KV, ctx=24576
├── .env.gemma4-e2b-bigctx ← -nkvo: q4_0 KV, ctx=393216 (turbo2 worse for MQA)
├── .env.gemma4-e4b ← pure-GPU: q4_0 KV, ctx=24576, ngl=42
├── .env.gemma4-e4b-bigctx ← -nkvo: turbo2 KV, ctx=163840, ngl=42
├── .env.qwen3-4b ← pure-GPU: q4_0 KV, ctx=16384 (NO turbo2 ever)
├── .env.qwen3-4b-bigctx ← -nkvo: q4_0 KV, ctx=24576 (NO turbo2 ever)
└── .env.qwen35-9b ← mixed: turbo2 KV, ctx=32768, ngl=11, mlock
```
Key env variables per file:
```bash
MODEL_FILE # filename under /models/
N_GPU_LAYERS # ngl: how many transformer layers offloaded to GPU
CTX_SIZE # context window size
THREADS / THREADS_BATCH
BATCH_SIZE / UBATCH_SIZE
CACHE_TYPE_K/V # KV quantization: f16 | q8_0 | q4_0 | turbo2
PARALLEL # number of concurrent request slots
EXTRA_ARGS # passed verbatim to llama-server (e.g. --flash-attn on --no-kv-offload)
```
---
## Test Script Architecture
All test scripts run inside the `bench-qwen35-9b` container (has `full` image with all binaries), with all model files accessible via `/models/`.
### scripts/kv_quant_test.sh
**Purpose**: Determine optimal KV quantization type for each model at various context sizes.
**Method**: `llama-perplexity` on a 4000-line synthetic text file. Computes perplexity for each (model, ctx, KV type) combination, measures Δ vs f16 baseline.
**Quality gate**: Δ < 0.5 → acceptable; Δ ≥ 0.5 → degraded.
```
for each model:
for each ctx in CTX_CANDIDATES:
run f16 baseline → get PPL_baseline
for each KV type in MODEL_KV_TYPES:
run with that KV type → get PPL
report Δ = PPL - PPL_baseline
```
**Outputs**:
- Pass/fail per (model, ctx, KV type) combination
- Recommendation: highest-quality KV type that stays within quality gate at all tested ctx
**Known limitations**:
- `Qwen3.5-9B`: hybrid linear-attention architecture is incompatible with `llama-perplexity` → always fails. Not a real model issue; the server works correctly.
- At very small ctx (< 4096), block-padding overhead inflates turbo2 apparent per-token cost.
### scripts/cpu_ctx_test.sh
**Purpose**: Find maximum viable context size when using `-nkvo` (KV in host RAM), accounting for PCIe bandwidth penalty.
**Method**: Two-phase per (model, ctx, KV type):
1. **Alloc check** (fast, ~15s): run `llama-perplexity` on a 64-line file with `-nkvo`. The model allocates full KV at startup regardless of input length. If it exits cleanly → alloc succeeds; timeout/error → OOM.
2. **Speed estimation** (analytic bandwidth model):
```
GPU-compute models (smollm3, e2b, e4b, qwen3-4b):
t/s(ctx) = 1000 / (1000/baseline + ctx × kv_bytes_per_tok / PCIe_BW × 1000)
PCIe_BW = 8 GB/s (PCIe x4 Gen3 practical)
RAM-bound models (qwen35-9b, ngl=11):
t/s(ctx) = 1000 / (1000/baseline + ctx × kv_bytes_per_tok / RAM_BW × 1000)
RAM_BW = 45 GB/s (DDR4-2933)
```
3. **Recommendation**: highest ctx where `t/s@50%fill ≥ 15`.
**kv_bytes_per_tok** measured empirically: `KV_MiB_allocated / ctx_size` from actual alloc run.
**KV types tested per model**:
| Model | KV types | Reason |
|---|---|---|
| smollm3, e2b, e4b | q4_0 + turbo2 | Both safe (PPL gate passes) |
| qwen3-4b | q4_0 only | turbo2 breaks at ctx≥8192 |
| qwen35-9b | q4_0 only | OOMs regardless (skipped) |
### scripts/benchmark.sh
Default entrypoint for bench containers. Runs `llama-bench` sweep over prompt/generation lengths and thread counts, outputs CSV to `/results/`.
### scripts/quality_test.sh
Early script (superseded by kv_quant_test.sh). Tested KV types via basic generation quality comparison.
---
## Data Flow
```
Model GGUF files (./models/)
Docker container (/models/ read-only bind mount)
├─── llama-server ──► OpenAI-compatible API on :8080
│ │
│ env_file values: MODEL_FILE, N_GPU_LAYERS, CTX_SIZE,
│ CACHE_TYPE_K/V, EXTRA_ARGS, ...
└─── llama-bench / llama-perplexity ──► benchmark-results/ (bind mount)
test scripts (scripts/ read-only bind mount)
```
## Port / Network Layout
```
Host:8080 ──► llama_server container:8080
Host:3000 ──► open_webui container:8080 ──► http://llama-current:8080/v1 (Docker network)
llama-net (bridge):
llama-current — alias shared by ALL server services; only one runs at a time
```