llama-cpp/README.md

# llama-cpp-docker

Production-ready llama.cpp server stack for a single consumer GPU (GTX 1650 Ti / SM75 Turing).
Fully benchmarked and tuned: every parameter justified by measurement, not guesswork.

---

## What this is

A Docker Compose setup that runs multiple LLMs via [llama.cpp](https://github.com/ggerganov/llama.cpp), with:

- **Per-model env files** — all parameters (ctx, KV type, ngl, threads) tuned per model on this hardware
- **TurboQuant image** — custom build with `FORCE_MMQ` (+6–11% free speed on Turing GPUs) and `turbo2/3/4` KV quantization
- **Bigctx profiles** — `-nkvo` (KV in RAM) variants that multiply usable context by 2–16× at modest speed cost
- **Benchmark scripts** — reproducible PPL quality tests and PCIe/RAM bandwidth-modeled context sizing
- **Open WebUI** — optional web UI, profile-composable with any model

> **Hardware target**: GTX 1650 Ti (SM75, 3717 MiB VRAM), i7-10750H, 15 GiB DDR4-2933.
> Parameters will work on any similar Turing GPU. See [docs/FINDINGS.md](docs/FINDINGS.md) before porting to other architectures.

---

## Quick start

### 1. Build the TurboQuant image (once, ~20 min)

```bash
docker compose --profile qwen35-9b build llama-qwen35-9b
```

This builds both `server-cuda-sm75-mmq` and `full-cuda-sm75-mmq` tags used by all services.

### 2. Download models

```bash
bash scripts/download_models.sh
```

Downloads all five models to `./models/`. Requires `huggingface-cli` (`pip install huggingface_hub`).
To download individual models:

```bash
bash scripts/download_models.sh smollm3
bash scripts/download_models.sh qwen35-9b
# options: smollm3 | gemma4-e2b | gemma4-e4b | qwen3-4b | qwen35-9b | all
```

### 3. Start a model

```bash
# Start SmolLM3 (fastest, 53 t/s, 65K context in bigctx mode)
docker compose --profile smollm3-3b up -d

# Start Gemma4-E2B (multimodal, 62 t/s, up to 393K context)
docker compose --profile gemma4-e2b up -d

# Add Open WebUI to any running model
docker compose --profile gemma4-e2b --profile webui up -d
```

API is available at **http://localhost:8080** (OpenAI-compatible).
WebUI at **http://localhost:3000**.

---

## Models

| Profile | Model | Size | t/s | CTX | Highlights |
|---|---|---|---|---|---|
| `qwen35-9b` | Qwen3.5-9B Q8_0 | 8.9 GB | ~4.4 | 32K | Reasoning distill, hybrid linear-attn |
| `gemma4-e2b` | Gemma4-E2B Q4_K_M | 2.9 GB | ~62 | 24K | Multimodal (image/audio/video), MQA |
| `gemma4-e4b` | Gemma4-E4B Q4_K_M | 4.7 GB | ~30 | 24K | Multimodal, larger, CPU-split |
| `smollm3-3b` | SmolLM3-3B Q4_K_M | 1.9 GB | ~53 | 24K | Thinking mode, tool calling, Apache 2.0 |
| `qwen3-4b` | Qwen3-4B Q4_K_M | 2.4 GB | ~39 | 16K | Thinking mode, 119 languages, best ecosystem |

### Big context profiles (KV in RAM via `-nkvo`)

Use when you need more context than the pure-GPU profiles offer. Speed drops as context fills (PCIe bandwidth bottleneck).

| Profile | Model | KV type | CTX | ~t/s@50% fill | RAM KV usage |
|---|---|---|---|---|---|
| `smollm3-3b-bigctx` | SmolLM3-3B | turbo2 | 65536 | 15.2 | 714 MiB |
| `gemma4-e2b-bigctx` | Gemma4-E2B | q4_0 | 393216 | 17.0 | 651 MiB |
| `gemma4-e4b-bigctx` | Gemma4-E4B | turbo2 | 163840 | 17.8 | 346 MiB |
| `qwen3-4b-bigctx` | Qwen3-4B | q4_0 | 24576 | 11.2 | ~972 MiB |

```bash
docker compose --profile gemma4-e2b-bigctx up -d
```

---

## Running benchmarks

One-shot — results written to `benchmark-results/`:

```bash
# Standard llama-bench sweep
docker compose --profile bench-smollm3-3b run --rm bench-smollm3-3b

# KV quantization quality test (all models)
docker compose --profile bench-qwen35-9b run --rm -T \
  --entrypoint="bash /scripts/kv_quant_test.sh all" bench-qwen35-9b

# Context size test with bandwidth model (all models)
docker compose --profile bench-qwen35-9b run --rm -T \
  --entrypoint="bash /scripts/cpu_ctx_test.sh all" bench-qwen35-9b

# Ad-hoc llama-bench
docker compose --profile bench-smollm3-3b run --rm --entrypoint="" bench-smollm3-3b \
  bash -c '/app/llama-bench -m /models/$MODEL_FILE -ngl 99 -o csv 2>/dev/null'
```

---

## Project structure

```
compose.yaml             — All services, profiles, YAML anchors
envs/
  .env.<model>           — Pure-GPU tuned params per model
  .env.<model>-bigctx    — -nkvo KV-in-RAM params
scripts/
  download_models.sh     — huggingface-cli download helper
  benchmark.sh           — Default bench entrypoint (llama-bench sweep)
  kv_quant_test.sh       — PPL quality test: f16 vs q8_0/q4_0/turbo2 per model/ctx
  cpu_ctx_test.sh        — -nkvo alloc check + PCIe/RAM BW model → max viable ctx
  quality_test.sh        — Early generation quality test (superseded by kv_quant_test.sh)
docs/
  FINDINGS.md            — What we learned, surprises, and what to watch out for
  ARCHITECTURE.md        — Compose and test script architecture in detail
models/                  — GGUF model files (gitignored, downloaded separately)
benchmark-results/       — Test output logs and CSVs (gitignored)
```

---

## Key findings

> Full details in [docs/FINDINGS.md](docs/FINDINGS.md).

**FORCE_MMQ gives free +6–11% on Turing GPUs.** GPUs without tensor cores (RTX 1650, 1660, 2060) are faster with the MMQ kernel than cuBLAS GEMM. The TurboQuant image compiles this in. Do not use this image on Ampere/Ada GPUs — it would hurt.

**turbo2 KV quantization breaks Qwen3-4B.** At ctx ≥ 8192, PPL degrades catastrophically (1.79 → 4.2 → 15 → 438). Never use turbo2/3/4 for Qwen3-4B. Use q4_0.

**turbo2 is paradoxically larger than q4_0 for Gemma4-E2B.** MQA architecture produces tiny KV tensors; block-quantization padding overhead makes turbo2 actually larger. Use q4_0 for E2B bigctx.

**Gemma4's MQA architecture enables extreme context.** E2B has only 1.7 KB KV/token (vs SmolLM3's 19.8 KB). 393K context costs just 651 MiB RAM, and speed barely drops (62 → 17 t/s@50% fill).

**Qwen3.5-9B cannot use -nkvo.** At ngl=11, model weights + OS fill all 15 GiB RAM. No bigctx possible. Existing 32K config with turbo2 KV in VRAM is the ceiling.

**`llama-perplexity` is incompatible with Qwen3.5-9B.** Hybrid linear-attention architecture causes the PPL tool to fail. Not a real model limitation — the server works correctly.

---

## Requirements

- Docker + NVIDIA Container Toolkit
- NVIDIA GPU (SM75 for pre-built image; rebuild with different `CUDA_DOCKER_ARCH` for other architectures)
- `huggingface-cli` for model downloads: `pip install huggingface_hub`
- ~25 GB disk for all models (download selectively as needed)

---

## Tuning for different hardware

Edit `envs/.env.<model>` files. Key parameters:

- `N_GPU_LAYERS` — increase for more VRAM, decrease for CPU-split
- `CTX_SIZE` — reduce if OOM, increase if VRAM headroom
- `CACHE_TYPE_K/V` — `f16` > `q8_0` > `q4_0` > `turbo2` quality; reverse order for size
- `THREADS` — match physical core count (HT hurts for RAM-bound models)

See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for full parameter reference.