Files

Giancarmine Salucci 4ad296608b Initial commit: tuned multi-model llama.cpp stack

- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B
- TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs
- Bigctx profiles (-nkvo KV in RAM): 2-16x context gain
- turbo2 KV: 2x smaller, benchmarked against PPL quality gate
- Per-model env files with justified parameters
- kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts
- docs/FINDINGS.md: surprises, pitfalls, recommendations
- docs/ARCHITECTURE.md: compose + test script design

2026-05-06 15:56:40 +02:00

8.4 KiB

Raw Blame History

Architecture

Hardware: GTX 1650 Ti Mobile (SM75/Turing, 3717 MiB VRAM) + i7-10750H 6c/12t + 15 GiB DDR4-2933 RAM.

Docker Compose Architecture

Image Strategy

Two custom images built from the TurboQuant fork of llama.cpp:

Image	Target	Used by
`local/llama-cpp-turboquant:server-cuda-sm75-mmq`	`server`	All llama-server services
`local/llama-cpp-turboquant:full-cuda-sm75-mmq`	`full`	All bench/test services

Both built with CUDA_DOCKER_ARCH: "75 -DGGML_CUDA_FORCE_MMQ=ON":

SM75 = Turing architecture codepath (no tensor cores)
FORCE_MMQ = always use hand-written MMQ kernels instead of cuBLAS GEMM
full target includes llama-bench, llama-perplexity, llama-cli alongside the server

Both images share the same custom entrypoint wrapper that enables the turbo2/3/4 KV quantization types unavailable in upstream llama.cpp. All docker run calls must use --entrypoint="" to bypass the wrapper.

Compose Structure

compose.yaml
├── x-gpu           — NVIDIA runtime + capability passthrough (merged into all services)
├── x-hc            — Common healthcheck (curl /health, start_period overridden per service)
├── x-server        — Merged into all server services:
│   ├── volumes: ./models:/models:ro
│   ├── ports: 8080:8080
│   ├── network alias: llama-current (all servers share this alias)
│   ├── entrypoint: llama-server with $$VAR shell expansion from env_file
│   └── restart: unless-stopped
└── x-bench         — Merged into all bench services:
    ├── volumes: ./models:/models:ro, ./benchmark-results:/results, ./scripts:/scripts:ro
    └── entrypoint: /bin/bash /scripts/benchmark.sh (overrideable)

Profile System

Docker Compose profiles allow mutually exclusive model selection. Only one model server should run at a time (single GPU).

docker compose --profile <PROFILE> up -d

Server profiles (bring up llama-server on port 8080):

Profile	Model	Image	VRAM	Strategy
`qwen35-9b`	Qwen3.5-9B Q8_0	TurboQuant (built)	3.4 GB (11 layers)	RAM-bound; mlock pins weights
`gemma4-e2b`	Gemma4-E2B Q4_K_M	TurboQuant	~3.4 GB	Full GPU, MQA
`gemma4-e4b`	Gemma4-E4B Q4_K_M	TurboQuant	~3.5 GB	Full GPU (42 layers, CPU-split)
`smollm3-3b`	SmolLM3-3B Q4_K_M	TurboQuant	~2.0 GB	Full GPU
`qwen3-4b`	Qwen3-4B Q4_K_M	TurboQuant	~2.5 GB	Full GPU

Bigctx profiles (server with -nkvo: KV cache in host RAM):

Profile	Model	KV type	CTX	~t/s@50% ctx
`smollm3-3b-bigctx`	SmolLM3-3B	turbo2	65536	15.2
`gemma4-e2b-bigctx`	Gemma4-E2B	q4_0	393216	17.0
`gemma4-e4b-bigctx`	Gemma4-E4B	turbo2	163840	17.8
`qwen3-4b-bigctx`	Qwen3-4B	q4_0	24576	11.2

Bench profiles (one-shot benchmark containers):

Profile	Service	Purpose
`bench-qwen35-9b`	bench-qwen35-9b	Also hosts `cpu_ctx_test.sh` / `kv_quant_test.sh` (all models have model files accessible)
`bench-gemma4-e2b`	bench-gemma4-e2b	E2B bench
`bench-gemma4-e4b`	bench-gemma4-e4b	E4B bench
`bench-smollm3-3b`	bench-smollm3-3b	SmolLM3 bench
`bench-qwen3-4b`	bench-qwen3-4b	Qwen3-4B bench

Add-on profile (combine with any model):

Profile	Service	Purpose
`webui`	openwebui	Open WebUI connecting to `llama-current:8080`

Env File Architecture

Each model has a dedicated envs/.env.<model> file injected into the container. Shell variables use $$VAR in the compose command to escape compose interpolation — the container shell expands them at runtime.

envs/
├── .env.smollm3-3b         ← pure-GPU: q8_0 KV, ctx=24576
├── .env.smollm3-3b-bigctx  ← -nkvo:    turbo2 KV, ctx=65536
├── .env.gemma4-e2b         ← pure-GPU: f16 KV, ctx=24576
├── .env.gemma4-e2b-bigctx  ← -nkvo:    q4_0 KV, ctx=393216 (turbo2 worse for MQA)
├── .env.gemma4-e4b         ← pure-GPU: q4_0 KV, ctx=24576, ngl=42
├── .env.gemma4-e4b-bigctx  ← -nkvo:    turbo2 KV, ctx=163840, ngl=42
├── .env.qwen3-4b           ← pure-GPU: q4_0 KV, ctx=16384 (NO turbo2 ever)
├── .env.qwen3-4b-bigctx    ← -nkvo:    q4_0 KV, ctx=24576 (NO turbo2 ever)
└── .env.qwen35-9b          ← mixed: turbo2 KV, ctx=32768, ngl=11, mlock

Key env variables per file:

MODEL_FILE          # filename under /models/
N_GPU_LAYERS        # ngl: how many transformer layers offloaded to GPU
CTX_SIZE            # context window size
THREADS / THREADS_BATCH
BATCH_SIZE / UBATCH_SIZE
CACHE_TYPE_K/V      # KV quantization: f16 | q8_0 | q4_0 | turbo2
PARALLEL            # number of concurrent request slots
EXTRA_ARGS          # passed verbatim to llama-server (e.g. --flash-attn on --no-kv-offload)

Test Script Architecture

All test scripts run inside the bench-qwen35-9b container (has full image with all binaries), with all model files accessible via /models/.

scripts/kv_quant_test.sh

Purpose: Determine optimal KV quantization type for each model at various context sizes.
Method: llama-perplexity on a 4000-line synthetic text file. Computes perplexity for each (model, ctx, KV type) combination, measures Δ vs f16 baseline.
Quality gate: Δ < 0.5 → acceptable; Δ ≥ 0.5 → degraded.

for each model:
  for each ctx in CTX_CANDIDATES:
    run f16 baseline → get PPL_baseline
    for each KV type in MODEL_KV_TYPES:
      run with that KV type → get PPL
      report Δ = PPL - PPL_baseline

Outputs:

Pass/fail per (model, ctx, KV type) combination
Recommendation: highest-quality KV type that stays within quality gate at all tested ctx

Known limitations:

Qwen3.5-9B: hybrid linear-attention architecture is incompatible with llama-perplexity → always fails. Not a real model issue; the server works correctly.
At very small ctx (< 4096), block-padding overhead inflates turbo2 apparent per-token cost.

scripts/cpu_ctx_test.sh

Purpose: Find maximum viable context size when using -nkvo (KV in host RAM), accounting for PCIe bandwidth penalty.
Method: Two-phase per (model, ctx, KV type):

Alloc check (fast, ~15s): run llama-perplexity on a 64-line file with -nkvo. The model allocates full KV at startup regardless of input length. If it exits cleanly → alloc succeeds; timeout/error → OOM.

Speed estimation (analytic bandwidth model):

GPU-compute models (smollm3, e2b, e4b, qwen3-4b):
  t/s(ctx) = 1000 / (1000/baseline + ctx × kv_bytes_per_tok / PCIe_BW × 1000)
  PCIe_BW = 8 GB/s  (PCIe x4 Gen3 practical)

RAM-bound models (qwen35-9b, ngl=11):
  t/s(ctx) = 1000 / (1000/baseline + ctx × kv_bytes_per_tok / RAM_BW × 1000)
  RAM_BW = 45 GB/s  (DDR4-2933)

Recommendation: highest ctx where t/s@50%fill ≥ 15.

kv_bytes_per_tok measured empirically: KV_MiB_allocated / ctx_size from actual alloc run.

KV types tested per model:

Model	KV types	Reason
smollm3, e2b, e4b	q4_0 + turbo2	Both safe (PPL gate passes)
qwen3-4b	q4_0 only	turbo2 breaks at ctx≥8192
qwen35-9b	q4_0 only	OOMs regardless (skipped)

scripts/benchmark.sh

Default entrypoint for bench containers. Runs llama-bench sweep over prompt/generation lengths and thread counts, outputs CSV to /results/.

scripts/quality_test.sh

Early script (superseded by kv_quant_test.sh). Tested KV types via basic generation quality comparison.

Data Flow

Model GGUF files (./models/)
        │
        ▼
Docker container (/models/ read-only bind mount)
        │
        ├─── llama-server ──► OpenAI-compatible API on :8080
        │         │
        │    env_file values: MODEL_FILE, N_GPU_LAYERS, CTX_SIZE,
        │                     CACHE_TYPE_K/V, EXTRA_ARGS, ...
        │
        └─── llama-bench / llama-perplexity ──► benchmark-results/ (bind mount)
                  │
             test scripts (scripts/ read-only bind mount)

Port / Network Layout

Host:8080 ──► llama_server container:8080
Host:3000 ──► open_webui container:8080 ──► http://llama-current:8080/v1 (Docker network)

llama-net (bridge):
  llama-current  — alias shared by ALL server services; only one runs at a time

8.4 KiB Raw Blame History Unescape Escape