# llama-cpp-docker Production-ready llama.cpp server stack for a single consumer GPU (GTX 1650 Ti / SM75 Turing). Fully benchmarked and tuned: every parameter justified by measurement, not guesswork. --- ## What this is A Docker Compose setup that runs multiple LLMs via [llama.cpp](https://github.com/ggerganov/llama.cpp), with: - **Per-model env files** — all parameters (ctx, KV type, ngl, threads) tuned per model on this hardware - **TurboQuant image** — custom build with `FORCE_MMQ` (+6–11% free speed on Turing GPUs) and `turbo2/3/4` KV quantization - **Bigctx profiles** — `-nkvo` (KV in RAM) variants that multiply usable context by 2–16× at modest speed cost - **Benchmark scripts** — reproducible PPL quality tests and PCIe/RAM bandwidth-modeled context sizing - **Open WebUI** — optional web UI, profile-composable with any model > **Hardware target**: GTX 1650 Ti (SM75, 3717 MiB VRAM), i7-10750H, 15 GiB DDR4-2933. > Parameters will work on any similar Turing GPU. See [docs/FINDINGS.md](docs/FINDINGS.md) before porting to other architectures. --- ## Quick start ### 1. Build the TurboQuant image (once, ~20 min) ```bash docker compose --profile qwen35-9b build llama-qwen35-9b ``` This builds both `server-cuda-sm75-mmq` and `full-cuda-sm75-mmq` tags used by all services. ### 2. Download models ```bash bash scripts/download_models.sh ``` Downloads all five models to `./models/`. Requires `huggingface-cli` (`pip install huggingface_hub`). To download individual models: ```bash bash scripts/download_models.sh smollm3 bash scripts/download_models.sh qwen35-9b # options: smollm3 | gemma4-e2b | gemma4-e4b | qwen3-4b | qwen35-9b | all ``` ### 3. Start a model ```bash # Start SmolLM3 (fastest, 53 t/s, 65K context in bigctx mode) docker compose --profile smollm3-3b up -d # Start Gemma4-E2B (multimodal, 62 t/s, up to 393K context) docker compose --profile gemma4-e2b up -d # Add Open WebUI to any running model docker compose --profile gemma4-e2b --profile webui up -d ``` API is available at **http://localhost:8080** (OpenAI-compatible). WebUI at **http://localhost:3000**. --- ## Models | Profile | Model | Size | t/s | CTX | Highlights | |---|---|---|---|---|---| | `qwen35-9b` | Qwen3.5-9B Q8_0 | 8.9 GB | ~4.4 | 32K | Reasoning distill, hybrid linear-attn | | `gemma4-e2b` | Gemma4-E2B Q4_K_M | 2.9 GB | ~62 | 24K | Multimodal (image/audio/video), MQA | | `gemma4-e4b` | Gemma4-E4B Q4_K_M | 4.7 GB | ~30 | 24K | Multimodal, larger, CPU-split | | `smollm3-3b` | SmolLM3-3B Q4_K_M | 1.9 GB | ~53 | 24K | Thinking mode, tool calling, Apache 2.0 | | `qwen3-4b` | Qwen3-4B Q4_K_M | 2.4 GB | ~39 | 16K | Thinking mode, 119 languages, best ecosystem | ### Big context profiles (KV in RAM via `-nkvo`) Use when you need more context than the pure-GPU profiles offer. Speed drops as context fills (PCIe bandwidth bottleneck). | Profile | Model | KV type | CTX | ~t/s@50% fill | RAM KV usage | |---|---|---|---|---|---| | `smollm3-3b-bigctx` | SmolLM3-3B | turbo2 | 65536 | 15.2 | 714 MiB | | `gemma4-e2b-bigctx` | Gemma4-E2B | q4_0 | 393216 | 17.0 | 651 MiB | | `gemma4-e4b-bigctx` | Gemma4-E4B | turbo2 | 163840 | 17.8 | 346 MiB | | `qwen3-4b-bigctx` | Qwen3-4B | q4_0 | 24576 | 11.2 | ~972 MiB | ```bash docker compose --profile gemma4-e2b-bigctx up -d ``` --- ## Running benchmarks One-shot — results written to `benchmark-results/`: ```bash # Standard llama-bench sweep docker compose --profile bench-smollm3-3b run --rm bench-smollm3-3b # KV quantization quality test (all models) docker compose --profile bench-qwen35-9b run --rm -T \ --entrypoint="bash /scripts/kv_quant_test.sh all" bench-qwen35-9b # Context size test with bandwidth model (all models) docker compose --profile bench-qwen35-9b run --rm -T \ --entrypoint="bash /scripts/cpu_ctx_test.sh all" bench-qwen35-9b # Ad-hoc llama-bench docker compose --profile bench-smollm3-3b run --rm --entrypoint="" bench-smollm3-3b \ bash -c '/app/llama-bench -m /models/$MODEL_FILE -ngl 99 -o csv 2>/dev/null' ``` --- ## Project structure ``` compose.yaml — All services, profiles, YAML anchors envs/ .env. — Pure-GPU tuned params per model .env.-bigctx — -nkvo KV-in-RAM params scripts/ download_models.sh — huggingface-cli download helper benchmark.sh — Default bench entrypoint (llama-bench sweep) kv_quant_test.sh — PPL quality test: f16 vs q8_0/q4_0/turbo2 per model/ctx cpu_ctx_test.sh — -nkvo alloc check + PCIe/RAM BW model → max viable ctx quality_test.sh — Early generation quality test (superseded by kv_quant_test.sh) docs/ FINDINGS.md — What we learned, surprises, and what to watch out for ARCHITECTURE.md — Compose and test script architecture in detail models/ — GGUF model files (gitignored, downloaded separately) benchmark-results/ — Test output logs and CSVs (gitignored) ``` --- ## Key findings > Full details in [docs/FINDINGS.md](docs/FINDINGS.md). **FORCE_MMQ gives free +6–11% on Turing GPUs.** GPUs without tensor cores (RTX 1650, 1660, 2060) are faster with the MMQ kernel than cuBLAS GEMM. The TurboQuant image compiles this in. Do not use this image on Ampere/Ada GPUs — it would hurt. **turbo2 KV quantization breaks Qwen3-4B.** At ctx ≥ 8192, PPL degrades catastrophically (1.79 → 4.2 → 15 → 438). Never use turbo2/3/4 for Qwen3-4B. Use q4_0. **turbo2 is paradoxically larger than q4_0 for Gemma4-E2B.** MQA architecture produces tiny KV tensors; block-quantization padding overhead makes turbo2 actually larger. Use q4_0 for E2B bigctx. **Gemma4's MQA architecture enables extreme context.** E2B has only 1.7 KB KV/token (vs SmolLM3's 19.8 KB). 393K context costs just 651 MiB RAM, and speed barely drops (62 → 17 t/s@50% fill). **Qwen3.5-9B cannot use -nkvo.** At ngl=11, model weights + OS fill all 15 GiB RAM. No bigctx possible. Existing 32K config with turbo2 KV in VRAM is the ceiling. **`llama-perplexity` is incompatible with Qwen3.5-9B.** Hybrid linear-attention architecture causes the PPL tool to fail. Not a real model limitation — the server works correctly. --- ## Requirements - Docker + NVIDIA Container Toolkit - NVIDIA GPU (SM75 for pre-built image; rebuild with different `CUDA_DOCKER_ARCH` for other architectures) - `huggingface-cli` for model downloads: `pip install huggingface_hub` - ~25 GB disk for all models (download selectively as needed) --- ## Tuning for different hardware Edit `envs/.env.` files. Key parameters: - `N_GPU_LAYERS` — increase for more VRAM, decrease for CPU-split - `CTX_SIZE` — reduce if OOM, increase if VRAM headroom - `CACHE_TYPE_K/V` — `f16` > `q8_0` > `q4_0` > `turbo2` quality; reverse order for size - `THREADS` — match physical core count (HT hurts for RAM-bound models) See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for full parameter reference.