Initial commit: tuned multi-model llama.cpp stack

- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B - TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs - Bigctx profiles (-nkvo KV in RAM): 2-16x context gain - turbo2 KV: 2x smaller, benchmarked against PPL quality gate - Per-model env files with justified parameters - kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts - docs/FINDINGS.md: surprises, pitfalls, recommendations - docs/ARCHITECTURE.md: compose + test script design
2026-05-06 15:56:40 +02:00
commit 4ad296608b
22 changed files with 2530 additions and 0 deletions
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@@ -0,0 +1,210 @@
+# Architecture
+
+Hardware: GTX 1650 Ti Mobile (SM75/Turing, 3717 MiB VRAM) + i7-10750H 6c/12t + 15 GiB DDR4-2933 RAM.
+
+---
+
+## Docker Compose Architecture
+
+### Image Strategy
+
+Two custom images built from the [TurboQuant fork](https://github.com/TheTom/llama-cpp-turboquant) of llama.cpp:
+
+| Image | Target | Used by |
+|---|---|---|
+| `local/llama-cpp-turboquant:server-cuda-sm75-mmq` | `server` | All llama-server services |
+| `local/llama-cpp-turboquant:full-cuda-sm75-mmq` | `full` | All bench/test services |
+
+Both built with `CUDA_DOCKER_ARCH: "75 -DGGML_CUDA_FORCE_MMQ=ON"`:
+- SM75 = Turing architecture codepath (no tensor cores)
+- `FORCE_MMQ` = always use hand-written MMQ kernels instead of cuBLAS GEMM
+- `full` target includes `llama-bench`, `llama-perplexity`, `llama-cli` alongside the server
+
+Both images share the same custom entrypoint wrapper that enables the `turbo2/3/4` KV quantization types unavailable in upstream llama.cpp. **All `docker run` calls must use `--entrypoint=""` to bypass the wrapper.**
+
+### Compose Structure
+
+```
+compose.yaml
+├── x-gpu           — NVIDIA runtime + capability passthrough (merged into all services)
+├── x-hc            — Common healthcheck (curl /health, start_period overridden per service)
+├── x-server        — Merged into all server services:
+│   ├── volumes: ./models:/models:ro
+│   ├── ports: 8080:8080
+│   ├── network alias: llama-current (all servers share this alias)
+│   ├── entrypoint: llama-server with $$VAR shell expansion from env_file
+│   └── restart: unless-stopped
+└── x-bench         — Merged into all bench services:
+    ├── volumes: ./models:/models:ro, ./benchmark-results:/results, ./scripts:/scripts:ro
+    └── entrypoint: /bin/bash /scripts/benchmark.sh (overrideable)
+```
+
+### Profile System
+
+Docker Compose profiles allow mutually exclusive model selection. Only one model server should run at a time (single GPU).
+
+```
+docker compose --profile <PROFILE> up -d
+```
+
+**Server profiles** (bring up `llama-server` on port 8080):
+
+| Profile | Model | Image | VRAM | Strategy |
+|---|---|---|---|---|
+| `qwen35-9b` | Qwen3.5-9B Q8_0 | TurboQuant (built) | 3.4 GB (11 layers) | RAM-bound; mlock pins weights |
+| `gemma4-e2b` | Gemma4-E2B Q4_K_M | TurboQuant | ~3.4 GB | Full GPU, MQA |
+| `gemma4-e4b` | Gemma4-E4B Q4_K_M | TurboQuant | ~3.5 GB | Full GPU (42 layers, CPU-split) |
+| `smollm3-3b` | SmolLM3-3B Q4_K_M | TurboQuant | ~2.0 GB | Full GPU |
+| `qwen3-4b` | Qwen3-4B Q4_K_M | TurboQuant | ~2.5 GB | Full GPU |
+
+**Bigctx profiles** (server with `-nkvo`: KV cache in host RAM):
+
+| Profile | Model | KV type | CTX | ~t/s@50% ctx |
+|---|---|---|---|---|
+| `smollm3-3b-bigctx` | SmolLM3-3B | turbo2 | 65536 | 15.2 |
+| `gemma4-e2b-bigctx` | Gemma4-E2B | q4_0 | 393216 | 17.0 |
+| `gemma4-e4b-bigctx` | Gemma4-E4B | turbo2 | 163840 | 17.8 |
+| `qwen3-4b-bigctx` | Qwen3-4B | q4_0 | 24576 | 11.2 |
+
+**Bench profiles** (one-shot benchmark containers):
+
+| Profile | Service | Purpose |
+|---|---|---|
+| `bench-qwen35-9b` | bench-qwen35-9b | Also hosts `cpu_ctx_test.sh` / `kv_quant_test.sh` (all models have model files accessible) |
+| `bench-gemma4-e2b` | bench-gemma4-e2b | E2B bench |
+| `bench-gemma4-e4b` | bench-gemma4-e4b | E4B bench |
+| `bench-smollm3-3b` | bench-smollm3-3b | SmolLM3 bench |
+| `bench-qwen3-4b` | bench-qwen3-4b | Qwen3-4B bench |
+
+**Add-on profile** (combine with any model):
+
+| Profile | Service | Purpose |
+|---|---|---|
+| `webui` | openwebui | Open WebUI connecting to `llama-current:8080` |
+
+### Env File Architecture
+
+Each model has a dedicated `envs/.env.<model>` file injected into the container. Shell variables use `$$VAR` in the compose command to escape compose interpolation — the container shell expands them at runtime.
+
+```
+envs/
+├── .env.smollm3-3b         ← pure-GPU: q8_0 KV, ctx=24576
+├── .env.smollm3-3b-bigctx  ← -nkvo:    turbo2 KV, ctx=65536
+├── .env.gemma4-e2b         ← pure-GPU: f16 KV, ctx=24576
+├── .env.gemma4-e2b-bigctx  ← -nkvo:    q4_0 KV, ctx=393216 (turbo2 worse for MQA)
+├── .env.gemma4-e4b         ← pure-GPU: q4_0 KV, ctx=24576, ngl=42
+├── .env.gemma4-e4b-bigctx  ← -nkvo:    turbo2 KV, ctx=163840, ngl=42
+├── .env.qwen3-4b           ← pure-GPU: q4_0 KV, ctx=16384 (NO turbo2 ever)
+├── .env.qwen3-4b-bigctx    ← -nkvo:    q4_0 KV, ctx=24576 (NO turbo2 ever)
+└── .env.qwen35-9b          ← mixed: turbo2 KV, ctx=32768, ngl=11, mlock
+```
+
+Key env variables per file:
+
+```bash
+MODEL_FILE          # filename under /models/
+N_GPU_LAYERS        # ngl: how many transformer layers offloaded to GPU
+CTX_SIZE            # context window size
+THREADS / THREADS_BATCH
+BATCH_SIZE / UBATCH_SIZE
+CACHE_TYPE_K/V      # KV quantization: f16 | q8_0 | q4_0 | turbo2
+PARALLEL            # number of concurrent request slots
+EXTRA_ARGS          # passed verbatim to llama-server (e.g. --flash-attn on --no-kv-offload)
+```
+
+---
+
+## Test Script Architecture
+
+All test scripts run inside the `bench-qwen35-9b` container (has `full` image with all binaries), with all model files accessible via `/models/`.
+
+### scripts/kv_quant_test.sh
+
+**Purpose**: Determine optimal KV quantization type for each model at various context sizes.  
+**Method**: `llama-perplexity` on a 4000-line synthetic text file. Computes perplexity for each (model, ctx, KV type) combination, measures Δ vs f16 baseline.  
+**Quality gate**: Δ < 0.5 → acceptable; Δ ≥ 0.5 → degraded.
+
+```
+for each model:
+  for each ctx in CTX_CANDIDATES:
+    run f16 baseline → get PPL_baseline
+    for each KV type in MODEL_KV_TYPES:
+      run with that KV type → get PPL
+      report Δ = PPL - PPL_baseline
+```
+
+**Outputs**:
+- Pass/fail per (model, ctx, KV type) combination
+- Recommendation: highest-quality KV type that stays within quality gate at all tested ctx
+
+**Known limitations**:
+- `Qwen3.5-9B`: hybrid linear-attention architecture is incompatible with `llama-perplexity` → always fails. Not a real model issue; the server works correctly.
+- At very small ctx (< 4096), block-padding overhead inflates turbo2 apparent per-token cost.
+
+### scripts/cpu_ctx_test.sh
+
+**Purpose**: Find maximum viable context size when using `-nkvo` (KV in host RAM), accounting for PCIe bandwidth penalty.  
+**Method**: Two-phase per (model, ctx, KV type):
+
+1. **Alloc check** (fast, ~15s): run `llama-perplexity` on a 64-line file with `-nkvo`. The model allocates full KV at startup regardless of input length. If it exits cleanly → alloc succeeds; timeout/error → OOM.
+
+2. **Speed estimation** (analytic bandwidth model):
+   ```
+   GPU-compute models (smollm3, e2b, e4b, qwen3-4b):
+     t/s(ctx) = 1000 / (1000/baseline + ctx × kv_bytes_per_tok / PCIe_BW × 1000)
+     PCIe_BW = 8 GB/s  (PCIe x4 Gen3 practical)
+
+   RAM-bound models (qwen35-9b, ngl=11):
+     t/s(ctx) = 1000 / (1000/baseline + ctx × kv_bytes_per_tok / RAM_BW × 1000)
+     RAM_BW = 45 GB/s  (DDR4-2933)
+   ```
+
+3. **Recommendation**: highest ctx where `t/s@50%fill ≥ 15`.
+
+**kv_bytes_per_tok** measured empirically: `KV_MiB_allocated / ctx_size` from actual alloc run.
+
+**KV types tested per model**:
+
+| Model | KV types | Reason |
+|---|---|---|
+| smollm3, e2b, e4b | q4_0 + turbo2 | Both safe (PPL gate passes) |
+| qwen3-4b | q4_0 only | turbo2 breaks at ctx≥8192 |
+| qwen35-9b | q4_0 only | OOMs regardless (skipped) |
+
+### scripts/benchmark.sh
+
+Default entrypoint for bench containers. Runs `llama-bench` sweep over prompt/generation lengths and thread counts, outputs CSV to `/results/`.
+
+### scripts/quality_test.sh
+
+Early script (superseded by kv_quant_test.sh). Tested KV types via basic generation quality comparison.
+
+---
+
+## Data Flow
+
+```
+Model GGUF files (./models/)
+        │
+        ▼
+Docker container (/models/ read-only bind mount)
+        │
+        ├─── llama-server ──► OpenAI-compatible API on :8080
+        │         │
+        │    env_file values: MODEL_FILE, N_GPU_LAYERS, CTX_SIZE,
+        │                     CACHE_TYPE_K/V, EXTRA_ARGS, ...
+        │
+        └─── llama-bench / llama-perplexity ──► benchmark-results/ (bind mount)
+                  │
+             test scripts (scripts/ read-only bind mount)
+```
+
+## Port / Network Layout
+
+```
+Host:8080 ──► llama_server container:8080
+Host:3000 ──► open_webui container:8080 ──► http://llama-current:8080/v1 (Docker network)
+
+llama-net (bridge):
+  llama-current  — alias shared by ALL server services; only one runs at a time
+```