Initial commit: tuned multi-model llama.cpp stack
- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B - TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs - Bigctx profiles (-nkvo KV in RAM): 2-16x context gain - turbo2 KV: 2x smaller, benchmarked against PPL quality gate - Per-model env files with justified parameters - kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts - docs/FINDINGS.md: surprises, pitfalls, recommendations - docs/ARCHITECTURE.md: compose + test script design
This commit is contained in:
174
README.md
Normal file
174
README.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# llama-cpp-docker
|
||||
|
||||
Production-ready llama.cpp server stack for a single consumer GPU (GTX 1650 Ti / SM75 Turing).
|
||||
Fully benchmarked and tuned: every parameter justified by measurement, not guesswork.
|
||||
|
||||
---
|
||||
|
||||
## What this is
|
||||
|
||||
A Docker Compose setup that runs multiple LLMs via [llama.cpp](https://github.com/ggerganov/llama.cpp), with:
|
||||
|
||||
- **Per-model env files** — all parameters (ctx, KV type, ngl, threads) tuned per model on this hardware
|
||||
- **TurboQuant image** — custom build with `FORCE_MMQ` (+6–11% free speed on Turing GPUs) and `turbo2/3/4` KV quantization
|
||||
- **Bigctx profiles** — `-nkvo` (KV in RAM) variants that multiply usable context by 2–16× at modest speed cost
|
||||
- **Benchmark scripts** — reproducible PPL quality tests and PCIe/RAM bandwidth-modeled context sizing
|
||||
- **Open WebUI** — optional web UI, profile-composable with any model
|
||||
|
||||
> **Hardware target**: GTX 1650 Ti (SM75, 3717 MiB VRAM), i7-10750H, 15 GiB DDR4-2933.
|
||||
> Parameters will work on any similar Turing GPU. See [docs/FINDINGS.md](docs/FINDINGS.md) before porting to other architectures.
|
||||
|
||||
---
|
||||
|
||||
## Quick start
|
||||
|
||||
### 1. Build the TurboQuant image (once, ~20 min)
|
||||
|
||||
```bash
|
||||
docker compose --profile qwen35-9b build llama-qwen35-9b
|
||||
```
|
||||
|
||||
This builds both `server-cuda-sm75-mmq` and `full-cuda-sm75-mmq` tags used by all services.
|
||||
|
||||
### 2. Download models
|
||||
|
||||
```bash
|
||||
bash scripts/download_models.sh
|
||||
```
|
||||
|
||||
Downloads all five models to `./models/`. Requires `huggingface-cli` (`pip install huggingface_hub`).
|
||||
To download individual models:
|
||||
|
||||
```bash
|
||||
bash scripts/download_models.sh smollm3
|
||||
bash scripts/download_models.sh qwen35-9b
|
||||
# options: smollm3 | gemma4-e2b | gemma4-e4b | qwen3-4b | qwen35-9b | all
|
||||
```
|
||||
|
||||
### 3. Start a model
|
||||
|
||||
```bash
|
||||
# Start SmolLM3 (fastest, 53 t/s, 65K context in bigctx mode)
|
||||
docker compose --profile smollm3-3b up -d
|
||||
|
||||
# Start Gemma4-E2B (multimodal, 62 t/s, up to 393K context)
|
||||
docker compose --profile gemma4-e2b up -d
|
||||
|
||||
# Add Open WebUI to any running model
|
||||
docker compose --profile gemma4-e2b --profile webui up -d
|
||||
```
|
||||
|
||||
API is available at **http://localhost:8080** (OpenAI-compatible).
|
||||
WebUI at **http://localhost:3000**.
|
||||
|
||||
---
|
||||
|
||||
## Models
|
||||
|
||||
| Profile | Model | Size | t/s | CTX | Highlights |
|
||||
|---|---|---|---|---|---|
|
||||
| `qwen35-9b` | Qwen3.5-9B Q8_0 | 8.9 GB | ~4.4 | 32K | Reasoning distill, hybrid linear-attn |
|
||||
| `gemma4-e2b` | Gemma4-E2B Q4_K_M | 2.9 GB | ~62 | 24K | Multimodal (image/audio/video), MQA |
|
||||
| `gemma4-e4b` | Gemma4-E4B Q4_K_M | 4.7 GB | ~30 | 24K | Multimodal, larger, CPU-split |
|
||||
| `smollm3-3b` | SmolLM3-3B Q4_K_M | 1.9 GB | ~53 | 24K | Thinking mode, tool calling, Apache 2.0 |
|
||||
| `qwen3-4b` | Qwen3-4B Q4_K_M | 2.4 GB | ~39 | 16K | Thinking mode, 119 languages, best ecosystem |
|
||||
|
||||
### Big context profiles (KV in RAM via `-nkvo`)
|
||||
|
||||
Use when you need more context than the pure-GPU profiles offer. Speed drops as context fills (PCIe bandwidth bottleneck).
|
||||
|
||||
| Profile | Model | KV type | CTX | ~t/s@50% fill | RAM KV usage |
|
||||
|---|---|---|---|---|---|
|
||||
| `smollm3-3b-bigctx` | SmolLM3-3B | turbo2 | 65536 | 15.2 | 714 MiB |
|
||||
| `gemma4-e2b-bigctx` | Gemma4-E2B | q4_0 | 393216 | 17.0 | 651 MiB |
|
||||
| `gemma4-e4b-bigctx` | Gemma4-E4B | turbo2 | 163840 | 17.8 | 346 MiB |
|
||||
| `qwen3-4b-bigctx` | Qwen3-4B | q4_0 | 24576 | 11.2 | ~972 MiB |
|
||||
|
||||
```bash
|
||||
docker compose --profile gemma4-e2b-bigctx up -d
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Running benchmarks
|
||||
|
||||
One-shot — results written to `benchmark-results/`:
|
||||
|
||||
```bash
|
||||
# Standard llama-bench sweep
|
||||
docker compose --profile bench-smollm3-3b run --rm bench-smollm3-3b
|
||||
|
||||
# KV quantization quality test (all models)
|
||||
docker compose --profile bench-qwen35-9b run --rm -T \
|
||||
--entrypoint="bash /scripts/kv_quant_test.sh all" bench-qwen35-9b
|
||||
|
||||
# Context size test with bandwidth model (all models)
|
||||
docker compose --profile bench-qwen35-9b run --rm -T \
|
||||
--entrypoint="bash /scripts/cpu_ctx_test.sh all" bench-qwen35-9b
|
||||
|
||||
# Ad-hoc llama-bench
|
||||
docker compose --profile bench-smollm3-3b run --rm --entrypoint="" bench-smollm3-3b \
|
||||
bash -c '/app/llama-bench -m /models/$MODEL_FILE -ngl 99 -o csv 2>/dev/null'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Project structure
|
||||
|
||||
```
|
||||
compose.yaml — All services, profiles, YAML anchors
|
||||
envs/
|
||||
.env.<model> — Pure-GPU tuned params per model
|
||||
.env.<model>-bigctx — -nkvo KV-in-RAM params
|
||||
scripts/
|
||||
download_models.sh — huggingface-cli download helper
|
||||
benchmark.sh — Default bench entrypoint (llama-bench sweep)
|
||||
kv_quant_test.sh — PPL quality test: f16 vs q8_0/q4_0/turbo2 per model/ctx
|
||||
cpu_ctx_test.sh — -nkvo alloc check + PCIe/RAM BW model → max viable ctx
|
||||
quality_test.sh — Early generation quality test (superseded by kv_quant_test.sh)
|
||||
docs/
|
||||
FINDINGS.md — What we learned, surprises, and what to watch out for
|
||||
ARCHITECTURE.md — Compose and test script architecture in detail
|
||||
models/ — GGUF model files (gitignored, downloaded separately)
|
||||
benchmark-results/ — Test output logs and CSVs (gitignored)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key findings
|
||||
|
||||
> Full details in [docs/FINDINGS.md](docs/FINDINGS.md).
|
||||
|
||||
**FORCE_MMQ gives free +6–11% on Turing GPUs.** GPUs without tensor cores (RTX 1650, 1660, 2060) are faster with the MMQ kernel than cuBLAS GEMM. The TurboQuant image compiles this in. Do not use this image on Ampere/Ada GPUs — it would hurt.
|
||||
|
||||
**turbo2 KV quantization breaks Qwen3-4B.** At ctx ≥ 8192, PPL degrades catastrophically (1.79 → 4.2 → 15 → 438). Never use turbo2/3/4 for Qwen3-4B. Use q4_0.
|
||||
|
||||
**turbo2 is paradoxically larger than q4_0 for Gemma4-E2B.** MQA architecture produces tiny KV tensors; block-quantization padding overhead makes turbo2 actually larger. Use q4_0 for E2B bigctx.
|
||||
|
||||
**Gemma4's MQA architecture enables extreme context.** E2B has only 1.7 KB KV/token (vs SmolLM3's 19.8 KB). 393K context costs just 651 MiB RAM, and speed barely drops (62 → 17 t/s@50% fill).
|
||||
|
||||
**Qwen3.5-9B cannot use -nkvo.** At ngl=11, model weights + OS fill all 15 GiB RAM. No bigctx possible. Existing 32K config with turbo2 KV in VRAM is the ceiling.
|
||||
|
||||
**`llama-perplexity` is incompatible with Qwen3.5-9B.** Hybrid linear-attention architecture causes the PPL tool to fail. Not a real model limitation — the server works correctly.
|
||||
|
||||
---
|
||||
|
||||
## Requirements
|
||||
|
||||
- Docker + NVIDIA Container Toolkit
|
||||
- NVIDIA GPU (SM75 for pre-built image; rebuild with different `CUDA_DOCKER_ARCH` for other architectures)
|
||||
- `huggingface-cli` for model downloads: `pip install huggingface_hub`
|
||||
- ~25 GB disk for all models (download selectively as needed)
|
||||
|
||||
---
|
||||
|
||||
## Tuning for different hardware
|
||||
|
||||
Edit `envs/.env.<model>` files. Key parameters:
|
||||
|
||||
- `N_GPU_LAYERS` — increase for more VRAM, decrease for CPU-split
|
||||
- `CTX_SIZE` — reduce if OOM, increase if VRAM headroom
|
||||
- `CACHE_TYPE_K/V` — `f16` > `q8_0` > `q4_0` > `turbo2` quality; reverse order for size
|
||||
- `THREADS` — match physical core count (HT hurts for RAM-bound models)
|
||||
|
||||
See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for full parameter reference.
|
||||
Reference in New Issue
Block a user