Files

Giancarmine Salucci 4ad296608b Initial commit: tuned multi-model llama.cpp stack

- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B
- TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs
- Bigctx profiles (-nkvo KV in RAM): 2-16x context gain
- turbo2 KV: 2x smaller, benchmarked against PPL quality gate
- Per-model env files with justified parameters
- kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts
- docs/FINDINGS.md: surprises, pitfalls, recommendations
- docs/ARCHITECTURE.md: compose + test script design

2026-05-06 15:56:40 +02:00

6.8 KiB

Raw Blame History

llama-cpp-docker

Production-ready llama.cpp server stack for a single consumer GPU (GTX 1650 Ti / SM75 Turing).
Fully benchmarked and tuned: every parameter justified by measurement, not guesswork.

What this is

A Docker Compose setup that runs multiple LLMs via llama.cpp, with:

Per-model env files — all parameters (ctx, KV type, ngl, threads) tuned per model on this hardware
TurboQuant image — custom build with FORCE_MMQ (+6–11% free speed on Turing GPUs) and turbo2/3/4 KV quantization
Bigctx profiles — -nkvo (KV in RAM) variants that multiply usable context by 2–16× at modest speed cost
Benchmark scripts — reproducible PPL quality tests and PCIe/RAM bandwidth-modeled context sizing
Open WebUI — optional web UI, profile-composable with any model

Hardware target: GTX 1650 Ti (SM75, 3717 MiB VRAM), i7-10750H, 15 GiB DDR4-2933.
Parameters will work on any similar Turing GPU. See docs/FINDINGS.md before porting to other architectures.

Quick start

1. Build the TurboQuant image (once, ~20 min)

docker compose --profile qwen35-9b build llama-qwen35-9b

This builds both server-cuda-sm75-mmq and full-cuda-sm75-mmq tags used by all services.

2. Download models

bash scripts/download_models.sh

Downloads all five models to ./models/. Requires huggingface-cli (pip install huggingface_hub).
To download individual models:

bash scripts/download_models.sh smollm3
bash scripts/download_models.sh qwen35-9b
# options: smollm3 | gemma4-e2b | gemma4-e4b | qwen3-4b | qwen35-9b | all

3. Start a model

# Start SmolLM3 (fastest, 53 t/s, 65K context in bigctx mode)
docker compose --profile smollm3-3b up -d

# Start Gemma4-E2B (multimodal, 62 t/s, up to 393K context)
docker compose --profile gemma4-e2b up -d

# Add Open WebUI to any running model
docker compose --profile gemma4-e2b --profile webui up -d

API is available at http://localhost:8080 (OpenAI-compatible).
WebUI at http://localhost:3000.

Models

Profile	Model	Size	t/s	CTX	Highlights
`qwen35-9b`	Qwen3.5-9B Q8_0	8.9 GB	~4.4	32K	Reasoning distill, hybrid linear-attn
`gemma4-e2b`	Gemma4-E2B Q4_K_M	2.9 GB	~62	24K	Multimodal (image/audio/video), MQA
`gemma4-e4b`	Gemma4-E4B Q4_K_M	4.7 GB	~30	24K	Multimodal, larger, CPU-split
`smollm3-3b`	SmolLM3-3B Q4_K_M	1.9 GB	~53	24K	Thinking mode, tool calling, Apache 2.0
`qwen3-4b`	Qwen3-4B Q4_K_M	2.4 GB	~39	16K	Thinking mode, 119 languages, best ecosystem

Big context profiles (KV in RAM via `-nkvo`)

Use when you need more context than the pure-GPU profiles offer. Speed drops as context fills (PCIe bandwidth bottleneck).

Profile	Model	KV type	CTX	~t/s@50% fill	RAM KV usage
`smollm3-3b-bigctx`	SmolLM3-3B	turbo2	65536	15.2	714 MiB
`gemma4-e2b-bigctx`	Gemma4-E2B	q4_0	393216	17.0	651 MiB
`gemma4-e4b-bigctx`	Gemma4-E4B	turbo2	163840	17.8	346 MiB
`qwen3-4b-bigctx`	Qwen3-4B	q4_0	24576	11.2	~972 MiB

docker compose --profile gemma4-e2b-bigctx up -d

Running benchmarks

One-shot — results written to benchmark-results/:

# Standard llama-bench sweep
docker compose --profile bench-smollm3-3b run --rm bench-smollm3-3b

# KV quantization quality test (all models)
docker compose --profile bench-qwen35-9b run --rm -T \
  --entrypoint="bash /scripts/kv_quant_test.sh all" bench-qwen35-9b

# Context size test with bandwidth model (all models)
docker compose --profile bench-qwen35-9b run --rm -T \
  --entrypoint="bash /scripts/cpu_ctx_test.sh all" bench-qwen35-9b

# Ad-hoc llama-bench
docker compose --profile bench-smollm3-3b run --rm --entrypoint="" bench-smollm3-3b \
  bash -c '/app/llama-bench -m /models/$MODEL_FILE -ngl 99 -o csv 2>/dev/null'

Project structure

compose.yaml             — All services, profiles, YAML anchors
envs/
  .env.<model>           — Pure-GPU tuned params per model
  .env.<model>-bigctx    — -nkvo KV-in-RAM params
scripts/
  download_models.sh     — huggingface-cli download helper
  benchmark.sh           — Default bench entrypoint (llama-bench sweep)
  kv_quant_test.sh       — PPL quality test: f16 vs q8_0/q4_0/turbo2 per model/ctx
  cpu_ctx_test.sh        — -nkvo alloc check + PCIe/RAM BW model → max viable ctx
  quality_test.sh        — Early generation quality test (superseded by kv_quant_test.sh)
docs/
  FINDINGS.md            — What we learned, surprises, and what to watch out for
  ARCHITECTURE.md        — Compose and test script architecture in detail
models/                  — GGUF model files (gitignored, downloaded separately)
benchmark-results/       — Test output logs and CSVs (gitignored)

Key findings

Full details in docs/FINDINGS.md.

FORCE_MMQ gives free +6–11% on Turing GPUs. GPUs without tensor cores (RTX 1650, 1660, 2060) are faster with the MMQ kernel than cuBLAS GEMM. The TurboQuant image compiles this in. Do not use this image on Ampere/Ada GPUs — it would hurt.

turbo2 KV quantization breaks Qwen3-4B. At ctx ≥ 8192, PPL degrades catastrophically (1.79 → 4.2 → 15 → 438). Never use turbo2/3/4 for Qwen3-4B. Use q4_0.

turbo2 is paradoxically larger than q4_0 for Gemma4-E2B. MQA architecture produces tiny KV tensors; block-quantization padding overhead makes turbo2 actually larger. Use q4_0 for E2B bigctx.

Gemma4's MQA architecture enables extreme context. E2B has only 1.7 KB KV/token (vs SmolLM3's 19.8 KB). 393K context costs just 651 MiB RAM, and speed barely drops (62 → 17 t/s@50% fill).

Qwen3.5-9B cannot use -nkvo. At ngl=11, model weights + OS fill all 15 GiB RAM. No bigctx possible. Existing 32K config with turbo2 KV in VRAM is the ceiling.

llama-perplexity is incompatible with Qwen3.5-9B. Hybrid linear-attention architecture causes the PPL tool to fail. Not a real model limitation — the server works correctly.

Requirements

Docker + NVIDIA Container Toolkit
NVIDIA GPU (SM75 for pre-built image; rebuild with different CUDA_DOCKER_ARCH for other architectures)
huggingface-cli for model downloads: pip install huggingface_hub
~25 GB disk for all models (download selectively as needed)

Tuning for different hardware

Edit envs/.env.<model> files. Key parameters:

N_GPU_LAYERS — increase for more VRAM, decrease for CPU-split
CTX_SIZE — reduce if OOM, increase if VRAM headroom
CACHE_TYPE_K/V — f16 > q8_0 > q4_0 > turbo2 quality; reverse order for size
THREADS — match physical core count (HT hurts for RAM-bound models)

See docs/ARCHITECTURE.md for full parameter reference.

6.8 KiB Raw Blame History Unescape Escape