- compose: increase start_period for bigctx services - gemma4-e4b-bigctx: 60s -> 150s (5 GiB model + warmup + 163840 ctx takes ~90-120s) - gemma4-e2b-bigctx: 60s -> 120s (large ctx 393216 allocation) - smollm3/qwen3-4b bigctx: 60s -> 90s - llama: extend health poll from 30x2s=60s to 75x2s=150s - llama: require 3 consecutive unhealthy before giving up (avoids false positives during Docker start_period window)
llama-cpp-docker
Production-ready llama.cpp server stack for a single consumer GPU (GTX 1650 Ti / SM75 Turing).
Fully benchmarked and tuned: every parameter justified by measurement, not guesswork.
What this is
A Docker Compose setup that runs multiple LLMs via llama.cpp, with:
- Per-model env files — all parameters (ctx, KV type, ngl, threads) tuned per model on this hardware
- TurboQuant image — custom build with
FORCE_MMQ(+6–11% free speed on Turing GPUs) andturbo2/3/4KV quantization - Bigctx profiles —
-nkvo(KV in RAM) variants that multiply usable context by 2–16× at modest speed cost - Benchmark scripts — reproducible PPL quality tests and PCIe/RAM bandwidth-modeled context sizing
- Open WebUI — optional web UI, profile-composable with any model
Hardware target: GTX 1650 Ti (SM75, 3717 MiB VRAM), i7-10750H, 15 GiB DDR4-2933.
Parameters will work on any similar Turing GPU. See docs/FINDINGS.md before porting to other architectures.
Quick start
1. Build the TurboQuant image (once, ~20 min)
docker compose --profile qwen35-9b build llama-qwen35-9b
This builds both server-cuda-sm75-mmq and full-cuda-sm75-mmq tags used by all services.
2. Download models
bash scripts/download_models.sh
Downloads all five models to ./models/. Requires huggingface-cli (pip install huggingface_hub).
To download individual models:
bash scripts/download_models.sh smollm3
bash scripts/download_models.sh qwen35-9b
# options: smollm3 | gemma4-e2b | gemma4-e4b | qwen3-4b | qwen35-9b | all
3. Start a model
# Start SmolLM3 (fastest, 53 t/s, 65K context in bigctx mode)
docker compose --profile smollm3-3b up -d
# Start Gemma4-E2B (multimodal, 62 t/s, up to 393K context)
docker compose --profile gemma4-e2b up -d
# Add Open WebUI to any running model
docker compose --profile gemma4-e2b --profile webui up -d
API is available at http://localhost:8080 (OpenAI-compatible).
WebUI at http://localhost:3000.
Models
| Profile | Model | Size | t/s | CTX | Highlights |
|---|---|---|---|---|---|
qwen35-9b |
Qwen3.5-9B Q8_0 | 8.9 GB | ~4.4 | 32K | Reasoning distill, hybrid linear-attn |
gemma4-e2b |
Gemma4-E2B Q4_K_M | 2.9 GB | ~62 | 24K | Multimodal (image/audio/video), MQA |
gemma4-e4b |
Gemma4-E4B Q4_K_M | 4.7 GB | ~30 | 24K | Multimodal, larger, CPU-split |
smollm3-3b |
SmolLM3-3B Q4_K_M | 1.9 GB | ~53 | 24K | Thinking mode, tool calling, Apache 2.0 |
qwen3-4b |
Qwen3-4B Q4_K_M | 2.4 GB | ~39 | 16K | Thinking mode, 119 languages, best ecosystem |
Big context profiles (KV in RAM via -nkvo)
Use when you need more context than the pure-GPU profiles offer. Speed drops as context fills (PCIe bandwidth bottleneck).
| Profile | Model | KV type | CTX | ~t/s@50% fill | RAM KV usage |
|---|---|---|---|---|---|
smollm3-3b-bigctx |
SmolLM3-3B | turbo2 | 65536 | 15.2 | 714 MiB |
gemma4-e2b-bigctx |
Gemma4-E2B | q4_0 | 393216 | 17.0 | 651 MiB |
gemma4-e4b-bigctx |
Gemma4-E4B | turbo2 | 163840 | 17.8 | 346 MiB |
qwen3-4b-bigctx |
Qwen3-4B | q4_0 | 24576 | 11.2 | ~972 MiB |
docker compose --profile gemma4-e2b-bigctx up -d
Running benchmarks
One-shot — results written to benchmark-results/:
# Standard llama-bench sweep
docker compose --profile bench-smollm3-3b run --rm bench-smollm3-3b
# KV quantization quality test (all models)
docker compose --profile bench-qwen35-9b run --rm -T \
--entrypoint="bash /scripts/kv_quant_test.sh all" bench-qwen35-9b
# Context size test with bandwidth model (all models)
docker compose --profile bench-qwen35-9b run --rm -T \
--entrypoint="bash /scripts/cpu_ctx_test.sh all" bench-qwen35-9b
# Ad-hoc llama-bench
docker compose --profile bench-smollm3-3b run --rm --entrypoint="" bench-smollm3-3b \
bash -c '/app/llama-bench -m /models/$MODEL_FILE -ngl 99 -o csv 2>/dev/null'
Project structure
compose.yaml — All services, profiles, YAML anchors
envs/
.env.<model> — Pure-GPU tuned params per model
.env.<model>-bigctx — -nkvo KV-in-RAM params
scripts/
download_models.sh — huggingface-cli download helper
benchmark.sh — Default bench entrypoint (llama-bench sweep)
kv_quant_test.sh — PPL quality test: f16 vs q8_0/q4_0/turbo2 per model/ctx
cpu_ctx_test.sh — -nkvo alloc check + PCIe/RAM BW model → max viable ctx
quality_test.sh — Early generation quality test (superseded by kv_quant_test.sh)
docs/
FINDINGS.md — What we learned, surprises, and what to watch out for
ARCHITECTURE.md — Compose and test script architecture in detail
models/ — GGUF model files (gitignored, downloaded separately)
benchmark-results/ — Test output logs and CSVs (gitignored)
Key findings
Full details in docs/FINDINGS.md.
FORCE_MMQ gives free +6–11% on Turing GPUs. GPUs without tensor cores (RTX 1650, 1660, 2060) are faster with the MMQ kernel than cuBLAS GEMM. The TurboQuant image compiles this in. Do not use this image on Ampere/Ada GPUs — it would hurt.
turbo2 KV quantization breaks Qwen3-4B. At ctx ≥ 8192, PPL degrades catastrophically (1.79 → 4.2 → 15 → 438). Never use turbo2/3/4 for Qwen3-4B. Use q4_0.
turbo2 is paradoxically larger than q4_0 for Gemma4-E2B. MQA architecture produces tiny KV tensors; block-quantization padding overhead makes turbo2 actually larger. Use q4_0 for E2B bigctx.
Gemma4's MQA architecture enables extreme context. E2B has only 1.7 KB KV/token (vs SmolLM3's 19.8 KB). 393K context costs just 651 MiB RAM, and speed barely drops (62 → 17 t/s@50% fill).
Qwen3.5-9B cannot use -nkvo. At ngl=11, model weights + OS fill all 15 GiB RAM. No bigctx possible. Existing 32K config with turbo2 KV in VRAM is the ceiling.
llama-perplexity is incompatible with Qwen3.5-9B. Hybrid linear-attention architecture causes the PPL tool to fail. Not a real model limitation — the server works correctly.
Requirements
- Docker + NVIDIA Container Toolkit
- NVIDIA GPU (SM75 for pre-built image; rebuild with different
CUDA_DOCKER_ARCHfor other architectures) huggingface-clifor model downloads:pip install huggingface_hub- ~25 GB disk for all models (download selectively as needed)
Tuning for different hardware
Edit envs/.env.<model> files. Key parameters:
N_GPU_LAYERS— increase for more VRAM, decrease for CPU-splitCTX_SIZE— reduce if OOM, increase if VRAM headroomCACHE_TYPE_K/V—f16>q8_0>q4_0>turbo2quality; reverse order for sizeTHREADS— match physical core count (HT hurts for RAM-bound models)
See docs/ARCHITECTURE.md for full parameter reference.