Initial commit: tuned multi-model llama.cpp stack

- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B - TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs - Bigctx profiles (-nkvo KV in RAM): 2-16x context gain - turbo2 KV: 2x smaller, benchmarked against PPL quality gate - Per-model env files with justified parameters - kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts - docs/FINDINGS.md: surprises, pitfalls, recommendations - docs/ARCHITECTURE.md: compose + test script design
2026-05-06 15:56:40 +02:00
commit 4ad296608b
22 changed files with 2530 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,174 @@
+# llama-cpp-docker
+
+Production-ready llama.cpp server stack for a single consumer GPU (GTX 1650 Ti / SM75 Turing).  
+Fully benchmarked and tuned: every parameter justified by measurement, not guesswork.
+
+---
+
+## What this is
+
+A Docker Compose setup that runs multiple LLMs via [llama.cpp](https://github.com/ggerganov/llama.cpp), with:
+
+- **Per-model env files** — all parameters (ctx, KV type, ngl, threads) tuned per model on this hardware
+- **TurboQuant image** — custom build with `FORCE_MMQ` (+6–11% free speed on Turing GPUs) and `turbo2/3/4` KV quantization
+- **Bigctx profiles** — `-nkvo` (KV in RAM) variants that multiply usable context by 2–16× at modest speed cost
+- **Benchmark scripts** — reproducible PPL quality tests and PCIe/RAM bandwidth-modeled context sizing
+- **Open WebUI** — optional web UI, profile-composable with any model
+
+> **Hardware target**: GTX 1650 Ti (SM75, 3717 MiB VRAM), i7-10750H, 15 GiB DDR4-2933.  
+> Parameters will work on any similar Turing GPU. See [docs/FINDINGS.md](docs/FINDINGS.md) before porting to other architectures.
+
+---
+
+## Quick start
+
+### 1. Build the TurboQuant image (once, ~20 min)
+
+```bash
+docker compose --profile qwen35-9b build llama-qwen35-9b
+```
+
+This builds both `server-cuda-sm75-mmq` and `full-cuda-sm75-mmq` tags used by all services.
+
+### 2. Download models
+
+```bash
+bash scripts/download_models.sh
+```
+
+Downloads all five models to `./models/`. Requires `huggingface-cli` (`pip install huggingface_hub`).  
+To download individual models:
+
+```bash
+bash scripts/download_models.sh smollm3
+bash scripts/download_models.sh qwen35-9b
+# options: smollm3 | gemma4-e2b | gemma4-e4b | qwen3-4b | qwen35-9b | all
+```
+
+### 3. Start a model
+
+```bash
+# Start SmolLM3 (fastest, 53 t/s, 65K context in bigctx mode)
+docker compose --profile smollm3-3b up -d
+
+# Start Gemma4-E2B (multimodal, 62 t/s, up to 393K context)
+docker compose --profile gemma4-e2b up -d
+
+# Add Open WebUI to any running model
+docker compose --profile gemma4-e2b --profile webui up -d
+```
+
+API is available at **http://localhost:8080** (OpenAI-compatible).  
+WebUI at **http://localhost:3000**.
+
+---
+
+## Models
+
+| Profile | Model | Size | t/s | CTX | Highlights |
+|---|---|---|---|---|---|
+| `qwen35-9b` | Qwen3.5-9B Q8_0 | 8.9 GB | ~4.4 | 32K | Reasoning distill, hybrid linear-attn |
+| `gemma4-e2b` | Gemma4-E2B Q4_K_M | 2.9 GB | ~62 | 24K | Multimodal (image/audio/video), MQA |
+| `gemma4-e4b` | Gemma4-E4B Q4_K_M | 4.7 GB | ~30 | 24K | Multimodal, larger, CPU-split |
+| `smollm3-3b` | SmolLM3-3B Q4_K_M | 1.9 GB | ~53 | 24K | Thinking mode, tool calling, Apache 2.0 |
+| `qwen3-4b` | Qwen3-4B Q4_K_M | 2.4 GB | ~39 | 16K | Thinking mode, 119 languages, best ecosystem |
+
+### Big context profiles (KV in RAM via `-nkvo`)
+
+Use when you need more context than the pure-GPU profiles offer. Speed drops as context fills (PCIe bandwidth bottleneck).
+
+| Profile | Model | KV type | CTX | ~t/s@50% fill | RAM KV usage |
+|---|---|---|---|---|---|
+| `smollm3-3b-bigctx` | SmolLM3-3B | turbo2 | 65536 | 15.2 | 714 MiB |
+| `gemma4-e2b-bigctx` | Gemma4-E2B | q4_0 | 393216 | 17.0 | 651 MiB |
+| `gemma4-e4b-bigctx` | Gemma4-E4B | turbo2 | 163840 | 17.8 | 346 MiB |
+| `qwen3-4b-bigctx` | Qwen3-4B | q4_0 | 24576 | 11.2 | ~972 MiB |
+
+```bash
+docker compose --profile gemma4-e2b-bigctx up -d
+```
+
+---
+
+## Running benchmarks
+
+One-shot — results written to `benchmark-results/`:
+
+```bash
+# Standard llama-bench sweep
+docker compose --profile bench-smollm3-3b run --rm bench-smollm3-3b
+
+# KV quantization quality test (all models)
+docker compose --profile bench-qwen35-9b run --rm -T \
+  --entrypoint="bash /scripts/kv_quant_test.sh all" bench-qwen35-9b
+
+# Context size test with bandwidth model (all models)
+docker compose --profile bench-qwen35-9b run --rm -T \
+  --entrypoint="bash /scripts/cpu_ctx_test.sh all" bench-qwen35-9b
+
+# Ad-hoc llama-bench
+docker compose --profile bench-smollm3-3b run --rm --entrypoint="" bench-smollm3-3b \
+  bash -c '/app/llama-bench -m /models/$MODEL_FILE -ngl 99 -o csv 2>/dev/null'
+```
+
+---
+
+## Project structure
+
+```
+compose.yaml             — All services, profiles, YAML anchors
+envs/
+  .env.<model>           — Pure-GPU tuned params per model
+  .env.<model>-bigctx    — -nkvo KV-in-RAM params
+scripts/
+  download_models.sh     — huggingface-cli download helper
+  benchmark.sh           — Default bench entrypoint (llama-bench sweep)
+  kv_quant_test.sh       — PPL quality test: f16 vs q8_0/q4_0/turbo2 per model/ctx
+  cpu_ctx_test.sh        — -nkvo alloc check + PCIe/RAM BW model → max viable ctx
+  quality_test.sh        — Early generation quality test (superseded by kv_quant_test.sh)
+docs/
+  FINDINGS.md            — What we learned, surprises, and what to watch out for
+  ARCHITECTURE.md        — Compose and test script architecture in detail
+models/                  — GGUF model files (gitignored, downloaded separately)
+benchmark-results/       — Test output logs and CSVs (gitignored)
+```
+
+---
+
+## Key findings
+
+> Full details in [docs/FINDINGS.md](docs/FINDINGS.md).
+
+**FORCE_MMQ gives free +6–11% on Turing GPUs.** GPUs without tensor cores (RTX 1650, 1660, 2060) are faster with the MMQ kernel than cuBLAS GEMM. The TurboQuant image compiles this in. Do not use this image on Ampere/Ada GPUs — it would hurt.
+
+**turbo2 KV quantization breaks Qwen3-4B.** At ctx ≥ 8192, PPL degrades catastrophically (1.79 → 4.2 → 15 → 438). Never use turbo2/3/4 for Qwen3-4B. Use q4_0.
+
+**turbo2 is paradoxically larger than q4_0 for Gemma4-E2B.** MQA architecture produces tiny KV tensors; block-quantization padding overhead makes turbo2 actually larger. Use q4_0 for E2B bigctx.
+
+**Gemma4's MQA architecture enables extreme context.** E2B has only 1.7 KB KV/token (vs SmolLM3's 19.8 KB). 393K context costs just 651 MiB RAM, and speed barely drops (62 → 17 t/s@50% fill).
+
+**Qwen3.5-9B cannot use -nkvo.** At ngl=11, model weights + OS fill all 15 GiB RAM. No bigctx possible. Existing 32K config with turbo2 KV in VRAM is the ceiling.
+
+**`llama-perplexity` is incompatible with Qwen3.5-9B.** Hybrid linear-attention architecture causes the PPL tool to fail. Not a real model limitation — the server works correctly.
+
+---
+
+## Requirements
+
+- Docker + NVIDIA Container Toolkit
+- NVIDIA GPU (SM75 for pre-built image; rebuild with different `CUDA_DOCKER_ARCH` for other architectures)
+- `huggingface-cli` for model downloads: `pip install huggingface_hub`
+- ~25 GB disk for all models (download selectively as needed)
+
+---
+
+## Tuning for different hardware
+
+Edit `envs/.env.<model>` files. Key parameters:
+
+- `N_GPU_LAYERS` — increase for more VRAM, decrease for CPU-split
+- `CTX_SIZE` — reduce if OOM, increase if VRAM headroom
+- `CACHE_TYPE_K/V` — `f16` > `q8_0` > `q4_0` > `turbo2` quality; reverse order for size
+- `THREADS` — match physical core count (HT hurts for RAM-bound models)
+
+See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for full parameter reference.