Files
llama-cpp/README.md
Giancarmine Salucci 4ad296608b Initial commit: tuned multi-model llama.cpp stack
- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B
- TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs
- Bigctx profiles (-nkvo KV in RAM): 2-16x context gain
- turbo2 KV: 2x smaller, benchmarked against PPL quality gate
- Per-model env files with justified parameters
- kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts
- docs/FINDINGS.md: surprises, pitfalls, recommendations
- docs/ARCHITECTURE.md: compose + test script design
2026-05-06 15:56:40 +02:00

175 lines
6.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# llama-cpp-docker
Production-ready llama.cpp server stack for a single consumer GPU (GTX 1650 Ti / SM75 Turing).
Fully benchmarked and tuned: every parameter justified by measurement, not guesswork.
---
## What this is
A Docker Compose setup that runs multiple LLMs via [llama.cpp](https://github.com/ggerganov/llama.cpp), with:
- **Per-model env files** — all parameters (ctx, KV type, ngl, threads) tuned per model on this hardware
- **TurboQuant image** — custom build with `FORCE_MMQ` (+611% free speed on Turing GPUs) and `turbo2/3/4` KV quantization
- **Bigctx profiles** — `-nkvo` (KV in RAM) variants that multiply usable context by 216× at modest speed cost
- **Benchmark scripts** — reproducible PPL quality tests and PCIe/RAM bandwidth-modeled context sizing
- **Open WebUI** — optional web UI, profile-composable with any model
> **Hardware target**: GTX 1650 Ti (SM75, 3717 MiB VRAM), i7-10750H, 15 GiB DDR4-2933.
> Parameters will work on any similar Turing GPU. See [docs/FINDINGS.md](docs/FINDINGS.md) before porting to other architectures.
---
## Quick start
### 1. Build the TurboQuant image (once, ~20 min)
```bash
docker compose --profile qwen35-9b build llama-qwen35-9b
```
This builds both `server-cuda-sm75-mmq` and `full-cuda-sm75-mmq` tags used by all services.
### 2. Download models
```bash
bash scripts/download_models.sh
```
Downloads all five models to `./models/`. Requires `huggingface-cli` (`pip install huggingface_hub`).
To download individual models:
```bash
bash scripts/download_models.sh smollm3
bash scripts/download_models.sh qwen35-9b
# options: smollm3 | gemma4-e2b | gemma4-e4b | qwen3-4b | qwen35-9b | all
```
### 3. Start a model
```bash
# Start SmolLM3 (fastest, 53 t/s, 65K context in bigctx mode)
docker compose --profile smollm3-3b up -d
# Start Gemma4-E2B (multimodal, 62 t/s, up to 393K context)
docker compose --profile gemma4-e2b up -d
# Add Open WebUI to any running model
docker compose --profile gemma4-e2b --profile webui up -d
```
API is available at **http://localhost:8080** (OpenAI-compatible).
WebUI at **http://localhost:3000**.
---
## Models
| Profile | Model | Size | t/s | CTX | Highlights |
|---|---|---|---|---|---|
| `qwen35-9b` | Qwen3.5-9B Q8_0 | 8.9 GB | ~4.4 | 32K | Reasoning distill, hybrid linear-attn |
| `gemma4-e2b` | Gemma4-E2B Q4_K_M | 2.9 GB | ~62 | 24K | Multimodal (image/audio/video), MQA |
| `gemma4-e4b` | Gemma4-E4B Q4_K_M | 4.7 GB | ~30 | 24K | Multimodal, larger, CPU-split |
| `smollm3-3b` | SmolLM3-3B Q4_K_M | 1.9 GB | ~53 | 24K | Thinking mode, tool calling, Apache 2.0 |
| `qwen3-4b` | Qwen3-4B Q4_K_M | 2.4 GB | ~39 | 16K | Thinking mode, 119 languages, best ecosystem |
### Big context profiles (KV in RAM via `-nkvo`)
Use when you need more context than the pure-GPU profiles offer. Speed drops as context fills (PCIe bandwidth bottleneck).
| Profile | Model | KV type | CTX | ~t/s@50% fill | RAM KV usage |
|---|---|---|---|---|---|
| `smollm3-3b-bigctx` | SmolLM3-3B | turbo2 | 65536 | 15.2 | 714 MiB |
| `gemma4-e2b-bigctx` | Gemma4-E2B | q4_0 | 393216 | 17.0 | 651 MiB |
| `gemma4-e4b-bigctx` | Gemma4-E4B | turbo2 | 163840 | 17.8 | 346 MiB |
| `qwen3-4b-bigctx` | Qwen3-4B | q4_0 | 24576 | 11.2 | ~972 MiB |
```bash
docker compose --profile gemma4-e2b-bigctx up -d
```
---
## Running benchmarks
One-shot — results written to `benchmark-results/`:
```bash
# Standard llama-bench sweep
docker compose --profile bench-smollm3-3b run --rm bench-smollm3-3b
# KV quantization quality test (all models)
docker compose --profile bench-qwen35-9b run --rm -T \
--entrypoint="bash /scripts/kv_quant_test.sh all" bench-qwen35-9b
# Context size test with bandwidth model (all models)
docker compose --profile bench-qwen35-9b run --rm -T \
--entrypoint="bash /scripts/cpu_ctx_test.sh all" bench-qwen35-9b
# Ad-hoc llama-bench
docker compose --profile bench-smollm3-3b run --rm --entrypoint="" bench-smollm3-3b \
bash -c '/app/llama-bench -m /models/$MODEL_FILE -ngl 99 -o csv 2>/dev/null'
```
---
## Project structure
```
compose.yaml — All services, profiles, YAML anchors
envs/
.env.<model> — Pure-GPU tuned params per model
.env.<model>-bigctx — -nkvo KV-in-RAM params
scripts/
download_models.sh — huggingface-cli download helper
benchmark.sh — Default bench entrypoint (llama-bench sweep)
kv_quant_test.sh — PPL quality test: f16 vs q8_0/q4_0/turbo2 per model/ctx
cpu_ctx_test.sh — -nkvo alloc check + PCIe/RAM BW model → max viable ctx
quality_test.sh — Early generation quality test (superseded by kv_quant_test.sh)
docs/
FINDINGS.md — What we learned, surprises, and what to watch out for
ARCHITECTURE.md — Compose and test script architecture in detail
models/ — GGUF model files (gitignored, downloaded separately)
benchmark-results/ — Test output logs and CSVs (gitignored)
```
---
## Key findings
> Full details in [docs/FINDINGS.md](docs/FINDINGS.md).
**FORCE_MMQ gives free +611% on Turing GPUs.** GPUs without tensor cores (RTX 1650, 1660, 2060) are faster with the MMQ kernel than cuBLAS GEMM. The TurboQuant image compiles this in. Do not use this image on Ampere/Ada GPUs — it would hurt.
**turbo2 KV quantization breaks Qwen3-4B.** At ctx ≥ 8192, PPL degrades catastrophically (1.79 → 4.2 → 15 → 438). Never use turbo2/3/4 for Qwen3-4B. Use q4_0.
**turbo2 is paradoxically larger than q4_0 for Gemma4-E2B.** MQA architecture produces tiny KV tensors; block-quantization padding overhead makes turbo2 actually larger. Use q4_0 for E2B bigctx.
**Gemma4's MQA architecture enables extreme context.** E2B has only 1.7 KB KV/token (vs SmolLM3's 19.8 KB). 393K context costs just 651 MiB RAM, and speed barely drops (62 → 17 t/s@50% fill).
**Qwen3.5-9B cannot use -nkvo.** At ngl=11, model weights + OS fill all 15 GiB RAM. No bigctx possible. Existing 32K config with turbo2 KV in VRAM is the ceiling.
**`llama-perplexity` is incompatible with Qwen3.5-9B.** Hybrid linear-attention architecture causes the PPL tool to fail. Not a real model limitation — the server works correctly.
---
## Requirements
- Docker + NVIDIA Container Toolkit
- NVIDIA GPU (SM75 for pre-built image; rebuild with different `CUDA_DOCKER_ARCH` for other architectures)
- `huggingface-cli` for model downloads: `pip install huggingface_hub`
- ~25 GB disk for all models (download selectively as needed)
---
## Tuning for different hardware
Edit `envs/.env.<model>` files. Key parameters:
- `N_GPU_LAYERS` — increase for more VRAM, decrease for CPU-split
- `CTX_SIZE` — reduce if OOM, increase if VRAM headroom
- `CACHE_TYPE_K/V``f16` > `q8_0` > `q4_0` > `turbo2` quality; reverse order for size
- `THREADS` — match physical core count (HT hurts for RAM-bound models)
See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for full parameter reference.