- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B - TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs - Bigctx profiles (-nkvo KV in RAM): 2-16x context gain - turbo2 KV: 2x smaller, benchmarked against PPL quality gate - Per-model env files with justified parameters - kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts - docs/FINDINGS.md: surprises, pitfalls, recommendations - docs/ARCHITECTURE.md: compose + test script design
8.4 KiB
Architecture
Hardware: GTX 1650 Ti Mobile (SM75/Turing, 3717 MiB VRAM) + i7-10750H 6c/12t + 15 GiB DDR4-2933 RAM.
Docker Compose Architecture
Image Strategy
Two custom images built from the TurboQuant fork of llama.cpp:
| Image | Target | Used by |
|---|---|---|
local/llama-cpp-turboquant:server-cuda-sm75-mmq |
server |
All llama-server services |
local/llama-cpp-turboquant:full-cuda-sm75-mmq |
full |
All bench/test services |
Both built with CUDA_DOCKER_ARCH: "75 -DGGML_CUDA_FORCE_MMQ=ON":
- SM75 = Turing architecture codepath (no tensor cores)
FORCE_MMQ= always use hand-written MMQ kernels instead of cuBLAS GEMMfulltarget includesllama-bench,llama-perplexity,llama-clialongside the server
Both images share the same custom entrypoint wrapper that enables the turbo2/3/4 KV quantization types unavailable in upstream llama.cpp. All docker run calls must use --entrypoint="" to bypass the wrapper.
Compose Structure
compose.yaml
├── x-gpu — NVIDIA runtime + capability passthrough (merged into all services)
├── x-hc — Common healthcheck (curl /health, start_period overridden per service)
├── x-server — Merged into all server services:
│ ├── volumes: ./models:/models:ro
│ ├── ports: 8080:8080
│ ├── network alias: llama-current (all servers share this alias)
│ ├── entrypoint: llama-server with $$VAR shell expansion from env_file
│ └── restart: unless-stopped
└── x-bench — Merged into all bench services:
├── volumes: ./models:/models:ro, ./benchmark-results:/results, ./scripts:/scripts:ro
└── entrypoint: /bin/bash /scripts/benchmark.sh (overrideable)
Profile System
Docker Compose profiles allow mutually exclusive model selection. Only one model server should run at a time (single GPU).
docker compose --profile <PROFILE> up -d
Server profiles (bring up llama-server on port 8080):
| Profile | Model | Image | VRAM | Strategy |
|---|---|---|---|---|
qwen35-9b |
Qwen3.5-9B Q8_0 | TurboQuant (built) | 3.4 GB (11 layers) | RAM-bound; mlock pins weights |
gemma4-e2b |
Gemma4-E2B Q4_K_M | TurboQuant | ~3.4 GB | Full GPU, MQA |
gemma4-e4b |
Gemma4-E4B Q4_K_M | TurboQuant | ~3.5 GB | Full GPU (42 layers, CPU-split) |
smollm3-3b |
SmolLM3-3B Q4_K_M | TurboQuant | ~2.0 GB | Full GPU |
qwen3-4b |
Qwen3-4B Q4_K_M | TurboQuant | ~2.5 GB | Full GPU |
Bigctx profiles (server with -nkvo: KV cache in host RAM):
| Profile | Model | KV type | CTX | ~t/s@50% ctx |
|---|---|---|---|---|
smollm3-3b-bigctx |
SmolLM3-3B | turbo2 | 65536 | 15.2 |
gemma4-e2b-bigctx |
Gemma4-E2B | q4_0 | 393216 | 17.0 |
gemma4-e4b-bigctx |
Gemma4-E4B | turbo2 | 163840 | 17.8 |
qwen3-4b-bigctx |
Qwen3-4B | q4_0 | 24576 | 11.2 |
Bench profiles (one-shot benchmark containers):
| Profile | Service | Purpose |
|---|---|---|
bench-qwen35-9b |
bench-qwen35-9b | Also hosts cpu_ctx_test.sh / kv_quant_test.sh (all models have model files accessible) |
bench-gemma4-e2b |
bench-gemma4-e2b | E2B bench |
bench-gemma4-e4b |
bench-gemma4-e4b | E4B bench |
bench-smollm3-3b |
bench-smollm3-3b | SmolLM3 bench |
bench-qwen3-4b |
bench-qwen3-4b | Qwen3-4B bench |
Add-on profile (combine with any model):
| Profile | Service | Purpose |
|---|---|---|
webui |
openwebui | Open WebUI connecting to llama-current:8080 |
Env File Architecture
Each model has a dedicated envs/.env.<model> file injected into the container. Shell variables use $$VAR in the compose command to escape compose interpolation — the container shell expands them at runtime.
envs/
├── .env.smollm3-3b ← pure-GPU: q8_0 KV, ctx=24576
├── .env.smollm3-3b-bigctx ← -nkvo: turbo2 KV, ctx=65536
├── .env.gemma4-e2b ← pure-GPU: f16 KV, ctx=24576
├── .env.gemma4-e2b-bigctx ← -nkvo: q4_0 KV, ctx=393216 (turbo2 worse for MQA)
├── .env.gemma4-e4b ← pure-GPU: q4_0 KV, ctx=24576, ngl=42
├── .env.gemma4-e4b-bigctx ← -nkvo: turbo2 KV, ctx=163840, ngl=42
├── .env.qwen3-4b ← pure-GPU: q4_0 KV, ctx=16384 (NO turbo2 ever)
├── .env.qwen3-4b-bigctx ← -nkvo: q4_0 KV, ctx=24576 (NO turbo2 ever)
└── .env.qwen35-9b ← mixed: turbo2 KV, ctx=32768, ngl=11, mlock
Key env variables per file:
MODEL_FILE # filename under /models/
N_GPU_LAYERS # ngl: how many transformer layers offloaded to GPU
CTX_SIZE # context window size
THREADS / THREADS_BATCH
BATCH_SIZE / UBATCH_SIZE
CACHE_TYPE_K/V # KV quantization: f16 | q8_0 | q4_0 | turbo2
PARALLEL # number of concurrent request slots
EXTRA_ARGS # passed verbatim to llama-server (e.g. --flash-attn on --no-kv-offload)
Test Script Architecture
All test scripts run inside the bench-qwen35-9b container (has full image with all binaries), with all model files accessible via /models/.
scripts/kv_quant_test.sh
Purpose: Determine optimal KV quantization type for each model at various context sizes.
Method: llama-perplexity on a 4000-line synthetic text file. Computes perplexity for each (model, ctx, KV type) combination, measures Δ vs f16 baseline.
Quality gate: Δ < 0.5 → acceptable; Δ ≥ 0.5 → degraded.
for each model:
for each ctx in CTX_CANDIDATES:
run f16 baseline → get PPL_baseline
for each KV type in MODEL_KV_TYPES:
run with that KV type → get PPL
report Δ = PPL - PPL_baseline
Outputs:
- Pass/fail per (model, ctx, KV type) combination
- Recommendation: highest-quality KV type that stays within quality gate at all tested ctx
Known limitations:
Qwen3.5-9B: hybrid linear-attention architecture is incompatible withllama-perplexity→ always fails. Not a real model issue; the server works correctly.- At very small ctx (< 4096), block-padding overhead inflates turbo2 apparent per-token cost.
scripts/cpu_ctx_test.sh
Purpose: Find maximum viable context size when using -nkvo (KV in host RAM), accounting for PCIe bandwidth penalty.
Method: Two-phase per (model, ctx, KV type):
-
Alloc check (fast, ~15s): run
llama-perplexityon a 64-line file with-nkvo. The model allocates full KV at startup regardless of input length. If it exits cleanly → alloc succeeds; timeout/error → OOM. -
Speed estimation (analytic bandwidth model):
GPU-compute models (smollm3, e2b, e4b, qwen3-4b): t/s(ctx) = 1000 / (1000/baseline + ctx × kv_bytes_per_tok / PCIe_BW × 1000) PCIe_BW = 8 GB/s (PCIe x4 Gen3 practical) RAM-bound models (qwen35-9b, ngl=11): t/s(ctx) = 1000 / (1000/baseline + ctx × kv_bytes_per_tok / RAM_BW × 1000) RAM_BW = 45 GB/s (DDR4-2933) -
Recommendation: highest ctx where
t/s@50%fill ≥ 15.
kv_bytes_per_tok measured empirically: KV_MiB_allocated / ctx_size from actual alloc run.
KV types tested per model:
| Model | KV types | Reason |
|---|---|---|
| smollm3, e2b, e4b | q4_0 + turbo2 | Both safe (PPL gate passes) |
| qwen3-4b | q4_0 only | turbo2 breaks at ctx≥8192 |
| qwen35-9b | q4_0 only | OOMs regardless (skipped) |
scripts/benchmark.sh
Default entrypoint for bench containers. Runs llama-bench sweep over prompt/generation lengths and thread counts, outputs CSV to /results/.
scripts/quality_test.sh
Early script (superseded by kv_quant_test.sh). Tested KV types via basic generation quality comparison.
Data Flow
Model GGUF files (./models/)
│
▼
Docker container (/models/ read-only bind mount)
│
├─── llama-server ──► OpenAI-compatible API on :8080
│ │
│ env_file values: MODEL_FILE, N_GPU_LAYERS, CTX_SIZE,
│ CACHE_TYPE_K/V, EXTRA_ARGS, ...
│
└─── llama-bench / llama-perplexity ──► benchmark-results/ (bind mount)
│
test scripts (scripts/ read-only bind mount)
Port / Network Layout
Host:8080 ──► llama_server container:8080
Host:3000 ──► open_webui container:8080 ──► http://llama-current:8080/v1 (Docker network)
llama-net (bridge):
llama-current — alias shared by ALL server services; only one runs at a time