Files
llama-cpp/README.md
Giancarmine Salucci 4ad296608b Initial commit: tuned multi-model llama.cpp stack
- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B
- TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs
- Bigctx profiles (-nkvo KV in RAM): 2-16x context gain
- turbo2 KV: 2x smaller, benchmarked against PPL quality gate
- Per-model env files with justified parameters
- kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts
- docs/FINDINGS.md: surprises, pitfalls, recommendations
- docs/ARCHITECTURE.md: compose + test script design
2026-05-06 15:56:40 +02:00

6.8 KiB
Raw Blame History

llama-cpp-docker

Production-ready llama.cpp server stack for a single consumer GPU (GTX 1650 Ti / SM75 Turing).
Fully benchmarked and tuned: every parameter justified by measurement, not guesswork.


What this is

A Docker Compose setup that runs multiple LLMs via llama.cpp, with:

  • Per-model env files — all parameters (ctx, KV type, ngl, threads) tuned per model on this hardware
  • TurboQuant image — custom build with FORCE_MMQ (+611% free speed on Turing GPUs) and turbo2/3/4 KV quantization
  • Bigctx profiles-nkvo (KV in RAM) variants that multiply usable context by 216× at modest speed cost
  • Benchmark scripts — reproducible PPL quality tests and PCIe/RAM bandwidth-modeled context sizing
  • Open WebUI — optional web UI, profile-composable with any model

Hardware target: GTX 1650 Ti (SM75, 3717 MiB VRAM), i7-10750H, 15 GiB DDR4-2933.
Parameters will work on any similar Turing GPU. See docs/FINDINGS.md before porting to other architectures.


Quick start

1. Build the TurboQuant image (once, ~20 min)

docker compose --profile qwen35-9b build llama-qwen35-9b

This builds both server-cuda-sm75-mmq and full-cuda-sm75-mmq tags used by all services.

2. Download models

bash scripts/download_models.sh

Downloads all five models to ./models/. Requires huggingface-cli (pip install huggingface_hub).
To download individual models:

bash scripts/download_models.sh smollm3
bash scripts/download_models.sh qwen35-9b
# options: smollm3 | gemma4-e2b | gemma4-e4b | qwen3-4b | qwen35-9b | all

3. Start a model

# Start SmolLM3 (fastest, 53 t/s, 65K context in bigctx mode)
docker compose --profile smollm3-3b up -d

# Start Gemma4-E2B (multimodal, 62 t/s, up to 393K context)
docker compose --profile gemma4-e2b up -d

# Add Open WebUI to any running model
docker compose --profile gemma4-e2b --profile webui up -d

API is available at http://localhost:8080 (OpenAI-compatible).
WebUI at http://localhost:3000.


Models

Profile Model Size t/s CTX Highlights
qwen35-9b Qwen3.5-9B Q8_0 8.9 GB ~4.4 32K Reasoning distill, hybrid linear-attn
gemma4-e2b Gemma4-E2B Q4_K_M 2.9 GB ~62 24K Multimodal (image/audio/video), MQA
gemma4-e4b Gemma4-E4B Q4_K_M 4.7 GB ~30 24K Multimodal, larger, CPU-split
smollm3-3b SmolLM3-3B Q4_K_M 1.9 GB ~53 24K Thinking mode, tool calling, Apache 2.0
qwen3-4b Qwen3-4B Q4_K_M 2.4 GB ~39 16K Thinking mode, 119 languages, best ecosystem

Big context profiles (KV in RAM via -nkvo)

Use when you need more context than the pure-GPU profiles offer. Speed drops as context fills (PCIe bandwidth bottleneck).

Profile Model KV type CTX ~t/s@50% fill RAM KV usage
smollm3-3b-bigctx SmolLM3-3B turbo2 65536 15.2 714 MiB
gemma4-e2b-bigctx Gemma4-E2B q4_0 393216 17.0 651 MiB
gemma4-e4b-bigctx Gemma4-E4B turbo2 163840 17.8 346 MiB
qwen3-4b-bigctx Qwen3-4B q4_0 24576 11.2 ~972 MiB
docker compose --profile gemma4-e2b-bigctx up -d

Running benchmarks

One-shot — results written to benchmark-results/:

# Standard llama-bench sweep
docker compose --profile bench-smollm3-3b run --rm bench-smollm3-3b

# KV quantization quality test (all models)
docker compose --profile bench-qwen35-9b run --rm -T \
  --entrypoint="bash /scripts/kv_quant_test.sh all" bench-qwen35-9b

# Context size test with bandwidth model (all models)
docker compose --profile bench-qwen35-9b run --rm -T \
  --entrypoint="bash /scripts/cpu_ctx_test.sh all" bench-qwen35-9b

# Ad-hoc llama-bench
docker compose --profile bench-smollm3-3b run --rm --entrypoint="" bench-smollm3-3b \
  bash -c '/app/llama-bench -m /models/$MODEL_FILE -ngl 99 -o csv 2>/dev/null'

Project structure

compose.yaml             — All services, profiles, YAML anchors
envs/
  .env.<model>           — Pure-GPU tuned params per model
  .env.<model>-bigctx    — -nkvo KV-in-RAM params
scripts/
  download_models.sh     — huggingface-cli download helper
  benchmark.sh           — Default bench entrypoint (llama-bench sweep)
  kv_quant_test.sh       — PPL quality test: f16 vs q8_0/q4_0/turbo2 per model/ctx
  cpu_ctx_test.sh        — -nkvo alloc check + PCIe/RAM BW model → max viable ctx
  quality_test.sh        — Early generation quality test (superseded by kv_quant_test.sh)
docs/
  FINDINGS.md            — What we learned, surprises, and what to watch out for
  ARCHITECTURE.md        — Compose and test script architecture in detail
models/                  — GGUF model files (gitignored, downloaded separately)
benchmark-results/       — Test output logs and CSVs (gitignored)

Key findings

Full details in docs/FINDINGS.md.

FORCE_MMQ gives free +611% on Turing GPUs. GPUs without tensor cores (RTX 1650, 1660, 2060) are faster with the MMQ kernel than cuBLAS GEMM. The TurboQuant image compiles this in. Do not use this image on Ampere/Ada GPUs — it would hurt.

turbo2 KV quantization breaks Qwen3-4B. At ctx ≥ 8192, PPL degrades catastrophically (1.79 → 4.2 → 15 → 438). Never use turbo2/3/4 for Qwen3-4B. Use q4_0.

turbo2 is paradoxically larger than q4_0 for Gemma4-E2B. MQA architecture produces tiny KV tensors; block-quantization padding overhead makes turbo2 actually larger. Use q4_0 for E2B bigctx.

Gemma4's MQA architecture enables extreme context. E2B has only 1.7 KB KV/token (vs SmolLM3's 19.8 KB). 393K context costs just 651 MiB RAM, and speed barely drops (62 → 17 t/s@50% fill).

Qwen3.5-9B cannot use -nkvo. At ngl=11, model weights + OS fill all 15 GiB RAM. No bigctx possible. Existing 32K config with turbo2 KV in VRAM is the ceiling.

llama-perplexity is incompatible with Qwen3.5-9B. Hybrid linear-attention architecture causes the PPL tool to fail. Not a real model limitation — the server works correctly.


Requirements

  • Docker + NVIDIA Container Toolkit
  • NVIDIA GPU (SM75 for pre-built image; rebuild with different CUDA_DOCKER_ARCH for other architectures)
  • huggingface-cli for model downloads: pip install huggingface_hub
  • ~25 GB disk for all models (download selectively as needed)

Tuning for different hardware

Edit envs/.env.<model> files. Key parameters:

  • N_GPU_LAYERS — increase for more VRAM, decrease for CPU-split
  • CTX_SIZE — reduce if OOM, increase if VRAM headroom
  • CACHE_TYPE_K/Vf16 > q8_0 > q4_0 > turbo2 quality; reverse order for size
  • THREADS — match physical core count (HT hurts for RAM-bound models)

See docs/ARCHITECTURE.md for full parameter reference.