Initial commit: tuned multi-model llama.cpp stack

- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B - TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs - Bigctx profiles (-nkvo KV in RAM): 2-16x context gain - turbo2 KV: 2x smaller, benchmarked against PPL quality gate - Per-model env files with justified parameters - kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts - docs/FINDINGS.md: surprises, pitfalls, recommendations - docs/ARCHITECTURE.md: compose + test script design
2026-05-06 15:56:40 +02:00
commit 4ad296608b
22 changed files with 2530 additions and 0 deletions
--- a/.env
+++ b/.env
@@ -0,0 +1,13 @@
+# =============================================================================
+# llama.cpp project root .env
+#
+# Per-model parameters have moved to envs/.env.<model>:
+#   envs/.env.qwen35-9b   — Qwen3.5-9B Q8_0 TurboQuant  ~4.4 t/s
+#   envs/.env.gemma4-e2b  — Gemma 4 E2B Q4_K_M           ~65 t/s
+#   envs/.env.gemma4-e4b  — Gemma 4 E4B Q4_K_M (split)   ~30 t/s
+#   envs/.env.smollm3-3b  — SmolLM3 3B Q4_K_M             ~90 t/s
+#   envs/.env.qwen3-4b    — Qwen3 4B Q4_K_M               ~75 t/s
+#
+# This file is loaded by Docker Compose for project-level interpolation.
+# Add project-wide overrides here (e.g. COMPOSE_PROJECT_NAME).
+# =============================================================================
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,34 @@
+# Model files — large binaries, download with scripts/download_models.sh
+models/*.gguf
+models/*.bin
+models/*.safetensors
+
+# Benchmark output logs, CSVs, and generated env snapshots — generated, not source
+benchmark-results/*.log
+benchmark-results/*.csv
+benchmark-results/*.txt
+benchmark-results/*.env
+# Keep the .gitkeep placeholder
+!benchmark-results/.gitkeep
+
+# Docker build cache artifacts
+.docker/
+
+# Python cache
+__pycache__/
+*.pyc
+*.pyo
+.venv/
+
+# Editor / OS artifacts
+.DS_Store
+Thumbs.db
+*.swp
+*.swo
+*~
+.idea/
+.vscode/
+
+# Local overrides (never commit secrets or machine-specific tweaks)
+.env.local
+envs/.env.*.local
--- a/README.md
+++ b/README.md
@@ -0,0 +1,174 @@
+# llama-cpp-docker
+
+Production-ready llama.cpp server stack for a single consumer GPU (GTX 1650 Ti / SM75 Turing).  
+Fully benchmarked and tuned: every parameter justified by measurement, not guesswork.
+
+---
+
+## What this is
+
+A Docker Compose setup that runs multiple LLMs via [llama.cpp](https://github.com/ggerganov/llama.cpp), with:
+
+- **Per-model env files** — all parameters (ctx, KV type, ngl, threads) tuned per model on this hardware
+- **TurboQuant image** — custom build with `FORCE_MMQ` (+6–11% free speed on Turing GPUs) and `turbo2/3/4` KV quantization
+- **Bigctx profiles** — `-nkvo` (KV in RAM) variants that multiply usable context by 2–16× at modest speed cost
+- **Benchmark scripts** — reproducible PPL quality tests and PCIe/RAM bandwidth-modeled context sizing
+- **Open WebUI** — optional web UI, profile-composable with any model
+
+> **Hardware target**: GTX 1650 Ti (SM75, 3717 MiB VRAM), i7-10750H, 15 GiB DDR4-2933.  
+> Parameters will work on any similar Turing GPU. See [docs/FINDINGS.md](docs/FINDINGS.md) before porting to other architectures.
+
+---
+
+## Quick start
+
+### 1. Build the TurboQuant image (once, ~20 min)
+
+```bash
+docker compose --profile qwen35-9b build llama-qwen35-9b
+```
+
+This builds both `server-cuda-sm75-mmq` and `full-cuda-sm75-mmq` tags used by all services.
+
+### 2. Download models
+
+```bash
+bash scripts/download_models.sh
+```
+
+Downloads all five models to `./models/`. Requires `huggingface-cli` (`pip install huggingface_hub`).  
+To download individual models:
+
+```bash
+bash scripts/download_models.sh smollm3
+bash scripts/download_models.sh qwen35-9b
+# options: smollm3 | gemma4-e2b | gemma4-e4b | qwen3-4b | qwen35-9b | all
+```
+
+### 3. Start a model
+
+```bash
+# Start SmolLM3 (fastest, 53 t/s, 65K context in bigctx mode)
+docker compose --profile smollm3-3b up -d
+
+# Start Gemma4-E2B (multimodal, 62 t/s, up to 393K context)
+docker compose --profile gemma4-e2b up -d
+
+# Add Open WebUI to any running model
+docker compose --profile gemma4-e2b --profile webui up -d
+```
+
+API is available at **http://localhost:8080** (OpenAI-compatible).  
+WebUI at **http://localhost:3000**.
+
+---
+
+## Models
+
+| Profile | Model | Size | t/s | CTX | Highlights |
+|---|---|---|---|---|---|
+| `qwen35-9b` | Qwen3.5-9B Q8_0 | 8.9 GB | ~4.4 | 32K | Reasoning distill, hybrid linear-attn |
+| `gemma4-e2b` | Gemma4-E2B Q4_K_M | 2.9 GB | ~62 | 24K | Multimodal (image/audio/video), MQA |
+| `gemma4-e4b` | Gemma4-E4B Q4_K_M | 4.7 GB | ~30 | 24K | Multimodal, larger, CPU-split |
+| `smollm3-3b` | SmolLM3-3B Q4_K_M | 1.9 GB | ~53 | 24K | Thinking mode, tool calling, Apache 2.0 |
+| `qwen3-4b` | Qwen3-4B Q4_K_M | 2.4 GB | ~39 | 16K | Thinking mode, 119 languages, best ecosystem |
+
+### Big context profiles (KV in RAM via `-nkvo`)
+
+Use when you need more context than the pure-GPU profiles offer. Speed drops as context fills (PCIe bandwidth bottleneck).
+
+| Profile | Model | KV type | CTX | ~t/s@50% fill | RAM KV usage |
+|---|---|---|---|---|---|
+| `smollm3-3b-bigctx` | SmolLM3-3B | turbo2 | 65536 | 15.2 | 714 MiB |
+| `gemma4-e2b-bigctx` | Gemma4-E2B | q4_0 | 393216 | 17.0 | 651 MiB |
+| `gemma4-e4b-bigctx` | Gemma4-E4B | turbo2 | 163840 | 17.8 | 346 MiB |
+| `qwen3-4b-bigctx` | Qwen3-4B | q4_0 | 24576 | 11.2 | ~972 MiB |
+
+```bash
+docker compose --profile gemma4-e2b-bigctx up -d
+```
+
+---
+
+## Running benchmarks
+
+One-shot — results written to `benchmark-results/`:
+
+```bash
+# Standard llama-bench sweep
+docker compose --profile bench-smollm3-3b run --rm bench-smollm3-3b
+
+# KV quantization quality test (all models)
+docker compose --profile bench-qwen35-9b run --rm -T \
+  --entrypoint="bash /scripts/kv_quant_test.sh all" bench-qwen35-9b
+
+# Context size test with bandwidth model (all models)
+docker compose --profile bench-qwen35-9b run --rm -T \
+  --entrypoint="bash /scripts/cpu_ctx_test.sh all" bench-qwen35-9b
+
+# Ad-hoc llama-bench
+docker compose --profile bench-smollm3-3b run --rm --entrypoint="" bench-smollm3-3b \
+  bash -c '/app/llama-bench -m /models/$MODEL_FILE -ngl 99 -o csv 2>/dev/null'
+```
+
+---
+
+## Project structure
+
+```
+compose.yaml             — All services, profiles, YAML anchors
+envs/
+  .env.<model>           — Pure-GPU tuned params per model
+  .env.<model>-bigctx    — -nkvo KV-in-RAM params
+scripts/
+  download_models.sh     — huggingface-cli download helper
+  benchmark.sh           — Default bench entrypoint (llama-bench sweep)
+  kv_quant_test.sh       — PPL quality test: f16 vs q8_0/q4_0/turbo2 per model/ctx
+  cpu_ctx_test.sh        — -nkvo alloc check + PCIe/RAM BW model → max viable ctx
+  quality_test.sh        — Early generation quality test (superseded by kv_quant_test.sh)
+docs/
+  FINDINGS.md            — What we learned, surprises, and what to watch out for
+  ARCHITECTURE.md        — Compose and test script architecture in detail
+models/                  — GGUF model files (gitignored, downloaded separately)
+benchmark-results/       — Test output logs and CSVs (gitignored)
+```
+
+---
+
+## Key findings
+
+> Full details in [docs/FINDINGS.md](docs/FINDINGS.md).
+
+**FORCE_MMQ gives free +6–11% on Turing GPUs.** GPUs without tensor cores (RTX 1650, 1660, 2060) are faster with the MMQ kernel than cuBLAS GEMM. The TurboQuant image compiles this in. Do not use this image on Ampere/Ada GPUs — it would hurt.
+
+**turbo2 KV quantization breaks Qwen3-4B.** At ctx ≥ 8192, PPL degrades catastrophically (1.79 → 4.2 → 15 → 438). Never use turbo2/3/4 for Qwen3-4B. Use q4_0.
+
+**turbo2 is paradoxically larger than q4_0 for Gemma4-E2B.** MQA architecture produces tiny KV tensors; block-quantization padding overhead makes turbo2 actually larger. Use q4_0 for E2B bigctx.
+
+**Gemma4's MQA architecture enables extreme context.** E2B has only 1.7 KB KV/token (vs SmolLM3's 19.8 KB). 393K context costs just 651 MiB RAM, and speed barely drops (62 → 17 t/s@50% fill).
+
+**Qwen3.5-9B cannot use -nkvo.** At ngl=11, model weights + OS fill all 15 GiB RAM. No bigctx possible. Existing 32K config with turbo2 KV in VRAM is the ceiling.
+
+**`llama-perplexity` is incompatible with Qwen3.5-9B.** Hybrid linear-attention architecture causes the PPL tool to fail. Not a real model limitation — the server works correctly.
+
+---
+
+## Requirements
+
+- Docker + NVIDIA Container Toolkit
+- NVIDIA GPU (SM75 for pre-built image; rebuild with different `CUDA_DOCKER_ARCH` for other architectures)
+- `huggingface-cli` for model downloads: `pip install huggingface_hub`
+- ~25 GB disk for all models (download selectively as needed)
+
+---
+
+## Tuning for different hardware
+
+Edit `envs/.env.<model>` files. Key parameters:
+
+- `N_GPU_LAYERS` — increase for more VRAM, decrease for CPU-split
+- `CTX_SIZE` — reduce if OOM, increase if VRAM headroom
+- `CACHE_TYPE_K/V` — `f16` > `q8_0` > `q4_0` > `turbo2` quality; reverse order for size
+- `THREADS` — match physical core count (HT hurts for RAM-bound models)
+
+See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for full parameter reference.
--- a/benchmark-results/.gitkeep
+++ b/benchmark-results/.gitkeep
--- a/compose.yaml
+++ b/compose.yaml
@@ -0,0 +1,290 @@
+# ==============================================================================
+# llama.cpp multi-model server
+# Hardware: GTX 1650 Ti (3717 MiB VRAM, CC 7.5) + i7-10750H 6c/12t
+#
+# MODEL PROFILES (mutually exclusive — GPU can only hold one at a time):
+#   qwen35-9b         Qwen3.5-9B Q8_0   TurboQuant (turbo2 KV, FORCE_MMQ) ~4.4 t/s
+#   gemma4-e2b        Gemma 4 E2B       Official llama.cpp                 ~65 t/s
+#   gemma4-e4b        Gemma 4 E4B       Official llama.cpp (CPU split)     ~30 t/s
+#   smollm3-3b        SmolLM3 3B        Official llama.cpp                 ~90 t/s
+#   qwen3-4b          Qwen3 4B          Official llama.cpp                 ~75 t/s
+#
+# BIGCTX PROFILES (-nkvo: KV in RAM, benchmarked v4 2026-05-06, TurboQuant FORCE_MMQ):
+#   smollm3-3b-bigctx SmolLM3 3B  ctx=65536   turbo2 | ~53 t/s base | ~15 t/s@50% | +40960 vs GPU
+#   gemma4-e2b-bigctx Gemma 4 E2B ctx=393216  q4_0   | ~62 t/s base | ~17 t/s@50% | +368640 vs GPU (MQA!)
+#   gemma4-e4b-bigctx Gemma 4 E4B ctx=163840  turbo2 | ~30 t/s base | ~18 t/s@50% | +139264 vs GPU
+#   qwen3-4b-bigctx   Qwen3 4B    ctx=24576   q4_0   | ~39 t/s base | ~11 t/s@50% | +8192 vs GPU
+#
+# OPTIONAL ADD-ON (combine with any model profile):
+#   webui        Open WebUI — auto-connects to whichever model is running
+#
+# BENCHMARK PROFILES (one-shot, run with: docker compose ... run --rm <service>):
+#   bench-qwen35-9b / bench-gemma4-e2b / bench-gemma4-e4b
+#   bench-smollm3-3b / bench-qwen3-4b
+#
+# EXAMPLES:
+#   docker compose --profile qwen35-9b up -d
+#   docker compose --profile gemma4-e2b --profile webui up -d
+#   docker compose --profile bench-smollm3-3b run --rm --entrypoint="" bench-smollm3-3b \
+#     bash -c '/app/llama-bench -m /models/$MODEL_FILE -ngl 99 -o csv 2>/dev/null'
+#
+# FIRST-TIME BUILD (qwen35-9b TurboQuant image, ~20 min):
+#   docker compose --profile qwen35-9b build llama-qwen35-9b
+#
+# Per-model params live in envs/.env.<model> — edit there to retune.
+# All server services expose API on host port 8080 and Docker network as
+# http://llama-current:8080 via the llama-net network alias.
+# ==============================================================================
+
+# ── Shared GPU passthrough ────────────────────────────────────────────────────
+x-gpu: &gpu
+  runtime: nvidia
+  environment:
+    NVIDIA_VISIBLE_DEVICES: all
+    NVIDIA_DRIVER_CAPABILITIES: compute,utility
+
+# ── Common healthcheck properties (start_period overridden per service) ───────
+x-hc: &hc
+  test: ["CMD-SHELL", "curl -sf http://localhost:8080/health | grep -q '\"status\":\"ok\"'"]
+  interval: 20s
+  timeout: 10s
+  retries: 10
+
+# ── Common server scaffold ────────────────────────────────────────────────────
+# All model services merge this. Per-model differences go in envs/.env.<model>.
+# $$VAR uses double-$ to escape compose interpolation — shell expands them at
+# runtime from the env_file variables injected into the container.
+x-server: &server
+  <<: *gpu
+  container_name: llama_server
+  volumes:
+    - ./models:/models:ro
+  ports:
+    - "8080:8080"
+  shm_size: 1g
+  ulimits:
+    memlock:
+      soft: -1
+      hard: -1
+  restart: unless-stopped
+  entrypoint: ["/bin/sh", "-c"]
+  command: |
+    exec /app/llama-server \
+      --model "/models/$$MODEL_FILE" \
+      --host 0.0.0.0 --port 8080 \
+      --n-gpu-layers $$N_GPU_LAYERS \
+      --ctx-size $$CTX_SIZE \
+      --threads $$THREADS --threads-batch $$THREADS_BATCH \
+      --batch-size $$BATCH_SIZE --ubatch-size $$UBATCH_SIZE \
+      --cache-type-k $$CACHE_TYPE_K --cache-type-v $$CACHE_TYPE_V \
+      --cont-batching --parallel $$PARALLEL \
+      $$EXTRA_ARGS \
+      --log-disable
+  networks:
+    llama-net:
+      aliases: [llama-current]
+
+# ── Common benchmark scaffold ─────────────────────────────────────────────────
+x-bench: &bench
+  <<: *gpu
+  container_name: llama_bench
+  volumes:
+    - ./models:/models:ro
+    - ./benchmark-results:/results
+    - ./scripts:/scripts:ro
+  shm_size: 1g
+  ulimits:
+    memlock:
+      soft: -1
+      hard: -1
+  entrypoint: ["/bin/bash", "/scripts/benchmark.sh"]
+
+# ── Networks ──────────────────────────────────────────────────────────────────
+networks:
+  llama-net:
+    driver: bridge
+
+# ── Volumes ───────────────────────────────────────────────────────────────────
+volumes:
+  open-webui-data:
+
+# ==============================================================================
+services:
+
+  # ── QWEN 3.5-9B Q8_0 — TurboQuant (turbo2 KV, FORCE_MMQ, SM75) ────────────
+  # Build image first: docker compose --profile qwen35-9b build llama-qwen35-9b
+  llama-qwen35-9b:
+    build:
+      context: https://github.com/TheTom/llama-cpp-turboquant.git#feature/turboquant-kv-cache
+      dockerfile: .devops/cuda.Dockerfile
+      target: server
+      args:
+        CUDA_DOCKER_ARCH: "75 -DGGML_CUDA_FORCE_MMQ=ON"
+    image: local/llama-cpp-turboquant:server-cuda-sm75-mmq
+    <<: *server
+    profiles: [qwen35-9b]
+    env_file: envs/.env.qwen35-9b
+    healthcheck:
+      <<: *hc
+      retries: 12
+      start_period: 180s  # mlock pins 8.86 GB into RAM — needs time
+
+  # ── GEMMA 4 E2B — 2.3B effective (5.1B total/PLE), 128K ctx, audio+video ───
+  # Download: see envs/.env.gemma4-e2b for huggingface-cli command
+  llama-gemma4-e2b:
+    image: local/llama-cpp-turboquant:server-cuda-sm75-mmq
+    <<: *server
+    profiles: [gemma4-e2b]
+    env_file: envs/.env.gemma4-e2b
+    healthcheck:
+      <<: *hc
+      start_period: 60s
+
+  # ── GEMMA 4 E4B — 4.5B effective (8B total/PLE), 128K ctx, CPU-split ────────
+  # Fits ~28/42 layers on GPU; remaining layers run on CPU RAM
+  llama-gemma4-e4b:
+    image: local/llama-cpp-turboquant:server-cuda-sm75-mmq
+    <<: *server
+    profiles: [gemma4-e4b]
+    env_file: envs/.env.gemma4-e4b
+    healthcheck:
+      <<: *hc
+      start_period: 60s
+
+  # ── SMOLLM3 3B — thinking mode, tool calling, 64K ctx, Apache 2.0 ──────────
+  llama-smollm3-3b:
+    image: local/llama-cpp-turboquant:server-cuda-sm75-mmq
+    <<: *server
+    profiles: [smollm3-3b]
+    env_file: envs/.env.smollm3-3b
+    healthcheck:
+      <<: *hc
+      start_period: 60s
+
+  # ── QWEN3 4B — thinking mode, 128K ctx, best ecosystem ────────────────────
+  llama-qwen3-4b:
+    image: local/llama-cpp-turboquant:server-cuda-sm75-mmq
+    <<: *server
+    profiles: [qwen3-4b]
+    env_file: envs/.env.qwen3-4b
+    healthcheck:
+      <<: *hc
+      start_period: 60s
+
+  # ── BIGCTX VARIANTS (-nkvo: KV in RAM, benchmarked 2026-05-06) ────────────
+  # Use when you need more context than the pure-GPU profiles offer.
+  # KV cache lives in CPU RAM instead of VRAM → VRAM freed for larger ctx.
+  # Speed estimated via PCIe bandwidth model (8 GB/s). E2B/E4B use MQA — tiny KV, far less PCIe pressure.
+
+  llama-smollm3-3b-bigctx:
+    image: local/llama-cpp-turboquant:server-cuda-sm75-mmq
+    <<: *server
+    profiles: [smollm3-3b-bigctx]
+    env_file: envs/.env.smollm3-3b-bigctx
+    healthcheck:
+      <<: *hc
+      start_period: 60s
+
+  llama-gemma4-e2b-bigctx:
+    image: local/llama-cpp-turboquant:server-cuda-sm75-mmq
+    <<: *server
+    profiles: [gemma4-e2b-bigctx]
+    env_file: envs/.env.gemma4-e2b-bigctx
+    healthcheck:
+      <<: *hc
+      start_period: 60s
+
+  llama-gemma4-e4b-bigctx:
+    image: local/llama-cpp-turboquant:server-cuda-sm75-mmq
+    <<: *server
+    profiles: [gemma4-e4b-bigctx]
+    env_file: envs/.env.gemma4-e4b-bigctx
+    healthcheck:
+      <<: *hc
+      start_period: 60s
+
+  llama-qwen3-4b-bigctx:
+    image: local/llama-cpp-turboquant:server-cuda-sm75-mmq
+    <<: *server
+    profiles: [qwen3-4b-bigctx]
+    env_file: envs/.env.qwen3-4b-bigctx
+    healthcheck:
+      <<: *hc
+      start_period: 60s
+
+  # ── OPEN WEBUI ─────────────────────────────────────────────────────────────
+  # Separate profile — add to any running model:
+  #   docker compose --profile <model> --profile webui up -d
+  # Connects to whichever model is running via the llama-current DNS alias.
+  # Open WebUI retries on startup so no depends_on needed.
+  openwebui:
+    image: ghcr.io/open-webui/open-webui:main
+    container_name: open_webui
+    profiles: [webui]
+    environment:
+      - OPENAI_API_BASE_URL=http://llama-current:8080/v1
+      - OPENAI_API_KEY=sk-no-key-needed
+      - WEBUI_AUTH=false
+    ports:
+      - "3000:8080"
+    networks:
+      - llama-net
+    volumes:
+      - open-webui-data:/app/backend/data
+    restart: unless-stopped
+
+  # ── BENCHMARKS ─────────────────────────────────────────────────────────────
+  # Run as one-shot: docker compose --profile bench-<model> run --rm bench-<model>
+  # Override entrypoint for ad-hoc: ... run --rm --entrypoint="" bench-<model> bash -c '...'
+
+  bench-qwen35-9b:
+    build:
+      context: https://github.com/TheTom/llama-cpp-turboquant.git#feature/turboquant-kv-cache
+      dockerfile: .devops/cuda.Dockerfile
+      target: full
+      args:
+        CUDA_DOCKER_ARCH: "75 -DGGML_CUDA_FORCE_MMQ=ON"
+    image: local/llama-cpp-turboquant:full-cuda-sm75-mmq
+    <<: *bench
+    profiles: [bench-qwen35-9b]
+    environment:
+      MODEL_FILE: Qwen3.5-9B.Q8_0.gguf
+      OUTPUT_DIR: /results
+      VARIANT: qwen35-9b-turboquant
+      PATH: /app:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+
+  bench-gemma4-e2b:
+    image: local/llama-cpp-turboquant:full-cuda-sm75-mmq
+    <<: *bench
+    profiles: [bench-gemma4-e2b]
+    environment:
+      MODEL_FILE: google_gemma-4-E2B-it-Q4_K_M.gguf
+      OUTPUT_DIR: /results
+      VARIANT: gemma4-e2b
+
+  bench-gemma4-e4b:
+    image: local/llama-cpp-turboquant:full-cuda-sm75-mmq
+    <<: *bench
+    profiles: [bench-gemma4-e4b]
+    environment:
+      MODEL_FILE: google_gemma-4-E4B-it-Q4_K_M.gguf
+      OUTPUT_DIR: /results
+      VARIANT: gemma4-e4b
+
+  bench-smollm3-3b:
+    image: local/llama-cpp-turboquant:full-cuda-sm75-mmq
+    <<: *bench
+    profiles: [bench-smollm3-3b]
+    environment:
+      MODEL_FILE: HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf
+      OUTPUT_DIR: /results
+      VARIANT: smollm3-3b
+
+  bench-qwen3-4b:
+    image: local/llama-cpp-turboquant:full-cuda-sm75-mmq
+    <<: *bench
+    profiles: [bench-qwen3-4b]
+    environment:
+      MODEL_FILE: Qwen3-4B-Q4_K_M.gguf
+      OUTPUT_DIR: /results
+      VARIANT: qwen3-4b
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@@ -0,0 +1,210 @@
+# Architecture
+
+Hardware: GTX 1650 Ti Mobile (SM75/Turing, 3717 MiB VRAM) + i7-10750H 6c/12t + 15 GiB DDR4-2933 RAM.
+
+---
+
+## Docker Compose Architecture
+
+### Image Strategy
+
+Two custom images built from the [TurboQuant fork](https://github.com/TheTom/llama-cpp-turboquant) of llama.cpp:
+
+| Image | Target | Used by |
+|---|---|---|
+| `local/llama-cpp-turboquant:server-cuda-sm75-mmq` | `server` | All llama-server services |
+| `local/llama-cpp-turboquant:full-cuda-sm75-mmq` | `full` | All bench/test services |
+
+Both built with `CUDA_DOCKER_ARCH: "75 -DGGML_CUDA_FORCE_MMQ=ON"`:
+- SM75 = Turing architecture codepath (no tensor cores)
+- `FORCE_MMQ` = always use hand-written MMQ kernels instead of cuBLAS GEMM
+- `full` target includes `llama-bench`, `llama-perplexity`, `llama-cli` alongside the server
+
+Both images share the same custom entrypoint wrapper that enables the `turbo2/3/4` KV quantization types unavailable in upstream llama.cpp. **All `docker run` calls must use `--entrypoint=""` to bypass the wrapper.**
+
+### Compose Structure
+
+```
+compose.yaml
+├── x-gpu           — NVIDIA runtime + capability passthrough (merged into all services)
+├── x-hc            — Common healthcheck (curl /health, start_period overridden per service)
+├── x-server        — Merged into all server services:
+│   ├── volumes: ./models:/models:ro
+│   ├── ports: 8080:8080
+│   ├── network alias: llama-current (all servers share this alias)
+│   ├── entrypoint: llama-server with $$VAR shell expansion from env_file
+│   └── restart: unless-stopped
+└── x-bench         — Merged into all bench services:
+    ├── volumes: ./models:/models:ro, ./benchmark-results:/results, ./scripts:/scripts:ro
+    └── entrypoint: /bin/bash /scripts/benchmark.sh (overrideable)
+```
+
+### Profile System
+
+Docker Compose profiles allow mutually exclusive model selection. Only one model server should run at a time (single GPU).
+
+```
+docker compose --profile <PROFILE> up -d
+```
+
+**Server profiles** (bring up `llama-server` on port 8080):
+
+| Profile | Model | Image | VRAM | Strategy |
+|---|---|---|---|---|
+| `qwen35-9b` | Qwen3.5-9B Q8_0 | TurboQuant (built) | 3.4 GB (11 layers) | RAM-bound; mlock pins weights |
+| `gemma4-e2b` | Gemma4-E2B Q4_K_M | TurboQuant | ~3.4 GB | Full GPU, MQA |
+| `gemma4-e4b` | Gemma4-E4B Q4_K_M | TurboQuant | ~3.5 GB | Full GPU (42 layers, CPU-split) |
+| `smollm3-3b` | SmolLM3-3B Q4_K_M | TurboQuant | ~2.0 GB | Full GPU |
+| `qwen3-4b` | Qwen3-4B Q4_K_M | TurboQuant | ~2.5 GB | Full GPU |
+
+**Bigctx profiles** (server with `-nkvo`: KV cache in host RAM):
+
+| Profile | Model | KV type | CTX | ~t/s@50% ctx |
+|---|---|---|---|---|
+| `smollm3-3b-bigctx` | SmolLM3-3B | turbo2 | 65536 | 15.2 |
+| `gemma4-e2b-bigctx` | Gemma4-E2B | q4_0 | 393216 | 17.0 |
+| `gemma4-e4b-bigctx` | Gemma4-E4B | turbo2 | 163840 | 17.8 |
+| `qwen3-4b-bigctx` | Qwen3-4B | q4_0 | 24576 | 11.2 |
+
+**Bench profiles** (one-shot benchmark containers):
+
+| Profile | Service | Purpose |
+|---|---|---|
+| `bench-qwen35-9b` | bench-qwen35-9b | Also hosts `cpu_ctx_test.sh` / `kv_quant_test.sh` (all models have model files accessible) |
+| `bench-gemma4-e2b` | bench-gemma4-e2b | E2B bench |
+| `bench-gemma4-e4b` | bench-gemma4-e4b | E4B bench |
+| `bench-smollm3-3b` | bench-smollm3-3b | SmolLM3 bench |
+| `bench-qwen3-4b` | bench-qwen3-4b | Qwen3-4B bench |
+
+**Add-on profile** (combine with any model):
+
+| Profile | Service | Purpose |
+|---|---|---|
+| `webui` | openwebui | Open WebUI connecting to `llama-current:8080` |
+
+### Env File Architecture
+
+Each model has a dedicated `envs/.env.<model>` file injected into the container. Shell variables use `$$VAR` in the compose command to escape compose interpolation — the container shell expands them at runtime.
+
+```
+envs/
+├── .env.smollm3-3b         ← pure-GPU: q8_0 KV, ctx=24576
+├── .env.smollm3-3b-bigctx  ← -nkvo:    turbo2 KV, ctx=65536
+├── .env.gemma4-e2b         ← pure-GPU: f16 KV, ctx=24576
+├── .env.gemma4-e2b-bigctx  ← -nkvo:    q4_0 KV, ctx=393216 (turbo2 worse for MQA)
+├── .env.gemma4-e4b         ← pure-GPU: q4_0 KV, ctx=24576, ngl=42
+├── .env.gemma4-e4b-bigctx  ← -nkvo:    turbo2 KV, ctx=163840, ngl=42
+├── .env.qwen3-4b           ← pure-GPU: q4_0 KV, ctx=16384 (NO turbo2 ever)
+├── .env.qwen3-4b-bigctx    ← -nkvo:    q4_0 KV, ctx=24576 (NO turbo2 ever)
+└── .env.qwen35-9b          ← mixed: turbo2 KV, ctx=32768, ngl=11, mlock
+```
+
+Key env variables per file:
+
+```bash
+MODEL_FILE          # filename under /models/
+N_GPU_LAYERS        # ngl: how many transformer layers offloaded to GPU
+CTX_SIZE            # context window size
+THREADS / THREADS_BATCH
+BATCH_SIZE / UBATCH_SIZE
+CACHE_TYPE_K/V      # KV quantization: f16 | q8_0 | q4_0 | turbo2
+PARALLEL            # number of concurrent request slots
+EXTRA_ARGS          # passed verbatim to llama-server (e.g. --flash-attn on --no-kv-offload)
+```
+
+---
+
+## Test Script Architecture
+
+All test scripts run inside the `bench-qwen35-9b` container (has `full` image with all binaries), with all model files accessible via `/models/`.
+
+### scripts/kv_quant_test.sh
+
+**Purpose**: Determine optimal KV quantization type for each model at various context sizes.  
+**Method**: `llama-perplexity` on a 4000-line synthetic text file. Computes perplexity for each (model, ctx, KV type) combination, measures Δ vs f16 baseline.  
+**Quality gate**: Δ < 0.5 → acceptable; Δ ≥ 0.5 → degraded.
+
+```
+for each model:
+  for each ctx in CTX_CANDIDATES:
+    run f16 baseline → get PPL_baseline
+    for each KV type in MODEL_KV_TYPES:
+      run with that KV type → get PPL
+      report Δ = PPL - PPL_baseline
+```
+
+**Outputs**:
+- Pass/fail per (model, ctx, KV type) combination
+- Recommendation: highest-quality KV type that stays within quality gate at all tested ctx
+
+**Known limitations**:
+- `Qwen3.5-9B`: hybrid linear-attention architecture is incompatible with `llama-perplexity` → always fails. Not a real model issue; the server works correctly.
+- At very small ctx (< 4096), block-padding overhead inflates turbo2 apparent per-token cost.
+
+### scripts/cpu_ctx_test.sh
+
+**Purpose**: Find maximum viable context size when using `-nkvo` (KV in host RAM), accounting for PCIe bandwidth penalty.  
+**Method**: Two-phase per (model, ctx, KV type):
+
+1. **Alloc check** (fast, ~15s): run `llama-perplexity` on a 64-line file with `-nkvo`. The model allocates full KV at startup regardless of input length. If it exits cleanly → alloc succeeds; timeout/error → OOM.
+
+2. **Speed estimation** (analytic bandwidth model):
+   ```
+   GPU-compute models (smollm3, e2b, e4b, qwen3-4b):
+     t/s(ctx) = 1000 / (1000/baseline + ctx × kv_bytes_per_tok / PCIe_BW × 1000)
+     PCIe_BW = 8 GB/s  (PCIe x4 Gen3 practical)
+
+   RAM-bound models (qwen35-9b, ngl=11):
+     t/s(ctx) = 1000 / (1000/baseline + ctx × kv_bytes_per_tok / RAM_BW × 1000)
+     RAM_BW = 45 GB/s  (DDR4-2933)
+   ```
+
+3. **Recommendation**: highest ctx where `t/s@50%fill ≥ 15`.
+
+**kv_bytes_per_tok** measured empirically: `KV_MiB_allocated / ctx_size` from actual alloc run.
+
+**KV types tested per model**:
+
+| Model | KV types | Reason |
+|---|---|---|
+| smollm3, e2b, e4b | q4_0 + turbo2 | Both safe (PPL gate passes) |
+| qwen3-4b | q4_0 only | turbo2 breaks at ctx≥8192 |
+| qwen35-9b | q4_0 only | OOMs regardless (skipped) |
+
+### scripts/benchmark.sh
+
+Default entrypoint for bench containers. Runs `llama-bench` sweep over prompt/generation lengths and thread counts, outputs CSV to `/results/`.
+
+### scripts/quality_test.sh
+
+Early script (superseded by kv_quant_test.sh). Tested KV types via basic generation quality comparison.
+
+---
+
+## Data Flow
+
+```
+Model GGUF files (./models/)
+        │
+        ▼
+Docker container (/models/ read-only bind mount)
+        │
+        ├─── llama-server ──► OpenAI-compatible API on :8080
+        │         │
+        │    env_file values: MODEL_FILE, N_GPU_LAYERS, CTX_SIZE,
+        │                     CACHE_TYPE_K/V, EXTRA_ARGS, ...
+        │
+        └─── llama-bench / llama-perplexity ──► benchmark-results/ (bind mount)
+                  │
+             test scripts (scripts/ read-only bind mount)
+```
+
+## Port / Network Layout
+
+```
+Host:8080 ──► llama_server container:8080
+Host:3000 ──► open_webui container:8080 ──► http://llama-current:8080/v1 (Docker network)
+
+llama-net (bridge):
+  llama-current  — alias shared by ALL server services; only one runs at a time
+```
--- a/docs/FINDINGS.md
+++ b/docs/FINDINGS.md
@@ -0,0 +1,158 @@
+# Benchmarking Findings
+
+Hardware: GTX 1650 Ti Mobile (Turing/SM75, 3717 MiB VRAM, CC 7.5) + i7-10750H 6c/12t, 15 GiB DDR4-2933 RAM.  
+All benchmarks: llama.cpp `local/llama-cpp-turboquant:*-cuda-sm75-mmq` image (TurboQuant fork, `DGGML_CUDA_FORCE_MMQ=ON`).  
+Date: 2026-05-05 / 2026-05-06.
+
+---
+
+## 1. FORCE_MMQ — Free +6–11% on Turing GPUs
+
+**Finding**: GPUs without tensor cores (Turing = RTX 1650, 1660, 2060 etc.) run the GEMM path through cuBLAS GEMM, which is slower than the hand-written MMQ (matrix-multiply quantized) kernel. Compiling with `DGGML_CUDA_FORCE_MMQ=ON` forces the MMQ path unconditionally.
+
+| Model | Standard image t/s | TurboQuant t/s | Gain |
+|---|---|---|---|
+| SmolLM3-3B | ~49.9 | 53.1 | +6.2% |
+| Gemma4-E2B | ~55.7 | 61.7 | +10.7% |
+| Gemma4-E4B | ~27.0 | 30.0 | +11.4% |
+| Qwen3-4B | ~36.7 | 38.8 | +5.7% |
+
+**Caution**: On Ampere/Ada/Hopper (RTX 3000+/4000+), tensor cores are faster. `FORCE_MMQ` would *hurt* on those cards. This image is SM75-only.
+
+---
+
+## 2. KV Quantization — turbo2 is the best sweet spot
+
+**Finding**: The TurboQuant fork adds 2/3/4-bit KV quantization ("turbo2/3/4") beyond llama.cpp's built-in q8_0/q4_0. turbo2 at 2 bits is roughly half the size of q4_0, with acceptable perplexity loss.
+
+**Perplexity delta vs f16 baseline** (quality gate: Δ < 0.5):
+
+| KV type | SmolLM3-3B | Gemma4-E2B | Gemma4-E4B | Qwen3-4B |
+|---|---|---|---|---|
+| q8_0 | ✓ | ✓ | ✓ | ✓ |
+| q4_0 | ✓ | ✓ | ✓ | ✓ |
+| turbo2 | ✓ | ✓ | ✓ | **✗ BROKEN** |
+| turbo3 | ✓ | ✓ | ✓ | **✗ BROKEN** |
+| turbo4 | ✓ | ✓ | ✓ | **✗ BROKEN** |
+
+### ⚠️ Critical: turbo2/3/4 breaks Qwen3-4B
+
+Qwen3-4B uses full GQA (32 KV heads, 40 KB/token). At ctx ≥ 8192, turbo KV quantization causes catastrophic PPL degradation:
+
+```
+ctx=4096   turbo2: PPL=1.79  (baseline 1.76, Δ=0.03 ✓)
+ctx=8192   turbo2: PPL=4.2   (Δ=2.4 ✗)
+ctx=16384  turbo2: PPL=15.4  (Δ=13.7 ✗)
+ctx=32768  turbo2: PPL=438   (broken)
+```
+
+**Never use turbo2/3/4 for Qwen3-4B.** Use q4_0.
+
+---
+
+## 3. MQA Architecture — Gemma4 E2B/E4B KV is tiny
+
+**Finding**: Gemma4's hybrid attention uses Multi-Query Attention (MQA) for most layers — only 1 KV head is maintained per token instead of full GQA. This results in dramatically smaller KV cache:
+
+| Model | KV bytes/token (q4_0) | Architecture |
+|---|---|---|
+| SmolLM3-3B | ~19.8 KB | GQA |
+| Qwen3-4B | ~39.6 KB | full GQA |
+| Gemma4-E4B | ~4.5 KB | MQA-like (42 layers) |
+| Gemma4-E2B | ~1.7 KB | MQA (35 layers) |
+
+**Implication**: E2B can hold 393K tokens in KV cache with only 651 MiB RAM. E4B can hold 163K tokens with 346 MiB RAM.
+
+### ⚠️ turbo2 is *worse* for E2B (MQA padding artifact)
+
+turbo2 uses block quantization. For MQA models with tiny KV tensors, the per-block header/padding overhead is proportionally larger than the savings. At E2B:
+
+```
+ctx=32768  q4_0: 57 MiB KV   turbo2: 68 MiB KV  (+19% worse!)
+```
+
+**Do not use turbo2 for Gemma4-E2B bigctx.** Use q4_0.
+
+---
+
+## 4. -nkvo (KV in RAM) — Massive Context Gain at PCIe Cost
+
+**Finding**: `--no-kv-offload` moves the KV cache from VRAM to host RAM. VRAM is then entirely free for model weights and compute. The tradeoff is token generation speed — each token generation requires reading the full KV cache over PCIe x4.
+
+**Bandwidth model**: `t/s = 1000 / (gpu_ms_empty + ctx × kv_bytes_per_tok / pcie_bw_bps × 1000)`
+
+PCIe x4 Gen3 ≈ **8 GB/s** practical (measured from BW model fit to actual results).
+
+### Context gains with -nkvo (v4, TurboQuant):
+
+| Model | Pure-GPU ctx | -nkvo q4_0 rec | -nkvo turbo2 rec | KV type used |
+|---|---|---|---|---|
+| SmolLM3-3B | 24576 | 32768 | **65536** | turbo2 |
+| Gemma4-E2B | 24576 | **393216** | 393216 | q4_0 (turbo2 worse!) |
+| Gemma4-E4B | 24576 | 98304 | **163840** | turbo2 |
+| Qwen3-4B | 16384 | **24576** | BROKEN | q4_0 |
+
+Recommendation threshold: ≥ 15 t/s at 50% context fill.
+
+### ⚠️ Qwen3.5-9B cannot use -nkvo
+
+Qwen3.5-9B (Q8_0, 8.86 GB) with ngl=11 fills nearly all 15 GiB RAM with model weights + system overhead. At any tested context size, `-nkvo` OOMs. The existing server config at ctx=32768 with turbo2 KV in VRAM is the only viable option.
+
+---
+
+## 5. Qwen3.5-9B — RAM-bound, llama-perplexity incompatible
+
+**Finding**: This model has a hybrid architecture: 8 full-attention layers + 24 linear-attention layers. The linear-attention layers cause `llama-perplexity` to fail (not OOM — the evaluation tool simply can't handle the architecture). The server works correctly.
+
+**Performance ceiling**: Theoretical max t/s = RAM_BW / model_size = 45 GB/s ÷ 8.86 GB = **5.1 t/s**. Achieved: 4.38 t/s = 86% efficiency. This is purely RAM-bandwidth-limited.
+
+**Thread optimization** (i7-10750H, 6 physical / 12 logical):
+- Optimal: `THREADS=6` (one per physical core)
+- HT hurts: t=8 → 4.22 t/s (worse than t=6 → 4.38 t/s)
+
+---
+
+## 6. Gemma4-E4B — all layers fit when VRAM is free
+
+**Surprise**: E4B's Q4_K_M file is 4.7 GB — larger than the 3.7 GB VRAM. However, model weight loading is paged; at ngl=42, ALL 42 layers fit in VRAM during inference because llama.cpp holds only the needed tensors. The "file size > VRAM" heuristic is wrong for split configs.
+
+ngl sweep result:
+```
+ngl=28  →  59 pp / 16.5 tg t/s
+ngl=35  →  101 pp / 24.6 tg t/s
+ngl=42  →  133 pp / 32.0 tg t/s  ← all layers, much faster
+```
+
+**Caution**: ngl=42 fails if another container is holding VRAM. Always stop other services before starting E4B.
+
+---
+
+## 7. Flash Attention (+2–3% pp, required for bigctx)
+
+`--flash-attn on` is required for `-nkvo` bigctx profiles (prefill OOM otherwise at large ctx). For standard pure-GPU profiles it gives a small speed boost (~2–3% pp, ~1% tg). Always enable it.
+
+---
+
+## 8. Benchmarking pitfalls
+
+### False OOM from prefill timeout
+Early test scripts ran `llama-perplexity` on a full wiki dataset. At large contexts, prefill takes >600s and the script misread the timeout as OOM. Fix: use a 64-line "tiny" file for alloc checks — the model allocates the full KV cache at startup, then exits after trivial compute (< 15s).
+
+### kv/tok measurement anomalies
+The `kv_bytes_per_tok` column in cpu_ctx_test.sh is computed as `kv_mib / ctx`. At small ctx, block padding dominates and the value appears higher. The true per-token cost stabilizes at larger ctx. Use ctx ≥ 32768 values for BW model calibration.
+
+---
+
+## Summary: Recommended configurations
+
+| Model | Profile | KV type | CTX | t/s@base | Notes |
+|---|---|---|---|---|---|
+| SmolLM3-3B | pure-GPU | q8_0 | 24576 | ~53 | max VRAM ctx |
+| SmolLM3-3B | bigctx | turbo2 | 65536 | ~15@50% | 714 MiB RAM |
+| Gemma4-E2B | pure-GPU | f16 | 24576 | ~62 | MQA = tiny KV |
+| Gemma4-E2B | bigctx | q4_0 | 393216 | ~17@50% | 651 MiB RAM, turbo2 worse |
+| Gemma4-E4B | pure-GPU | q4_0 | 24576 | ~30 | ngl=42 all layers |
+| Gemma4-E4B | bigctx | turbo2 | 163840 | ~18@50% | 346 MiB RAM |
+| Qwen3-4B | pure-GPU | q4_0 | 16384 | ~39 | NO turbo KV ever |
+| Qwen3-4B | bigctx | q4_0 | 24576 | ~11@50% | turbo2 broken |
+| Qwen3.5-9B | pure-GPU | turbo2 | 32768 | ~4.4 | RAM-bound, no bigctx |
--- a/envs/.env.gemma4-e2b
+++ b/envs/.env.gemma4-e2b
@@ -0,0 +1,43 @@
+# ==============================================================================
+# Gemma 4 E2B-it Q4_K_M — Google DeepMind (April 2025)
+# Architecture: Dense transformer + Per-Layer Embeddings (PLE)
+#   - 2.3B effective params (5.1B total with PLE embedding tables)
+#   - 35 layers, hybrid local (512-token window) + global attention
+#   - 128K context window
+# Model size: ~2.9 GB Q4_K_M  |  Full GPU fit (ngl=99, VRAM ~3.4 GB total)
+# Modalities: text + image + audio (ASR/translation) + video frames
+#
+# Download:
+#   huggingface-cli download bartowski/google_gemma-4-E2B-it-GGUF \
+#     google_gemma-4-E2B-it-Q4_K_M.gguf --local-dir ./models/
+#
+# NOTE: Verify the exact filename after download — bartowski naming may vary.
+#       Check: ls models/google_gemma*
+# ==============================================================================
+
+MODEL_FILE=google_gemma-4-E2B-it-Q4_K_M.gguf
+
+# All 35 layers fit in VRAM. PLE layers are small compute, large embedding lookup.
+N_GPU_LAYERS=99
+
+# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
+# Hybrid sliding-window attention (512-token) keeps KV tiny → 32K ctx fits!
+# 65K/131K OOM (full global-attn layers eat VRAM at large ctx).
+# Baseline: 350 pp / 64.6 tg t/s  |  At 32K ctx: 365 pp / 66.8 tg t/s (fa=1)
+CTX_SIZE=24576
+
+THREADS=6
+THREADS_BATCH=6
+
+BATCH_SIZE=512
+UBATCH_SIZE=256
+
+# f16 KV — model small, KV overhead negligible even at 32K
+CACHE_TYPE_K=f16
+CACHE_TYPE_V=f16
+
+# 2 parallel slots — fast model (66 tg t/s), VRAM headroom available
+PARALLEL=2
+
+# fa=1 confirmed working on hybrid Gemma4 attention (+5% vs fa=0)
+EXTRA_ARGS=--flash-attn on --mmap
--- a/envs/.env.gemma4-e2b-bigctx
+++ b/envs/.env.gemma4-e2b-bigctx
@@ -0,0 +1,26 @@
+# ==============================================================================
+# Gemma 4 E2B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo)
+# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): q4_0 rec ctx=393216
+# +368640 tokens vs pure-GPU 24576. MQA arch = only 1.7 KB KV/tok (tiny!).
+# Speed at ctx=393216: baseline 61.7 t/s, est. 17.0@50% / 26.6@25% (PCIe BW).
+# RAM at 393216: 651 MiB KV. q4_0 used (turbo2 paradoxically larger for MQA).
+# Use this profile when you need >24K context; otherwise use gemma4-e2b.
+# ==============================================================================
+
+MODEL_FILE=google_gemma-4-E2B-it-Q4_K_M.gguf
+
+N_GPU_LAYERS=99
+CTX_SIZE=393216
+
+THREADS=6
+THREADS_BATCH=6
+
+BATCH_SIZE=512
+UBATCH_SIZE=256
+
+CACHE_TYPE_K=q4_0
+CACHE_TYPE_V=q4_0
+
+PARALLEL=1
+
+EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload
--- a/envs/.env.gemma4-e4b
+++ b/envs/.env.gemma4-e4b
@@ -0,0 +1,43 @@
+# ==============================================================================
+# Gemma 4 E4B-it Q4_K_M — Google DeepMind (April 2025)
+# Architecture: Dense transformer + Per-Layer Embeddings (PLE)
+#   - 4.5B effective params (8B total with PLE embedding tables)
+#   - 42 layers, hybrid local (512-token window) + global attention
+#   - 128K context window
+# Model size: ~4.7 GB Q4_K_M  |  CPU-split needed (exceeds 3.7 GB VRAM)
+# Modalities: text + image + audio (ASR/translation) + video frames
+#
+# Download:
+#   huggingface-cli download bartowski/google_gemma-4-E4B-it-GGUF \
+#     google_gemma-4-E4B-it-Q4_K_M.gguf --local-dir ./models/
+#
+# NOTE: Verify the exact filename after download — bartowski naming may vary.
+#       Check: ls models/google_gemma*
+# ==============================================================================
+
+MODEL_FILE=google_gemma-4-E4B-it-Q4_K_M.gguf
+
+# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
+# ALL 42 layers fit on GPU when no other containers hold VRAM!
+# ngl sweep: ngl=42 → 133 pp / 32.0 tg t/s  (ngl=28 was only 59/16.5)
+# Max ctx=24576 (hybrid attention, 32K OOM).  fa=1 works (+3% vs fa=0).
+# Thread sweep: t=4-6 optimal (GPU-only now, CPU largely idle for tg)
+N_GPU_LAYERS=42
+
+# 24K max — hybrid sliding-window keeps most layers' KV tiny
+# 32K OOM due to global-attn layers hitting VRAM wall
+CTX_SIZE=24576
+
+THREADS=6
+THREADS_BATCH=6
+
+BATCH_SIZE=512
+UBATCH_SIZE=128
+
+CACHE_TYPE_K=q4_0
+CACHE_TYPE_V=q4_0
+
+PARALLEL=1
+
+# fa=1 confirmed working on hybrid Gemma4 attention
+EXTRA_ARGS=--flash-attn on --mmap
--- a/envs/.env.gemma4-e4b-bigctx
+++ b/envs/.env.gemma4-e4b-bigctx
@@ -0,0 +1,26 @@
+# ==============================================================================
+# Gemma 4 E4B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo)
+# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): turbo2 rec ctx=163840
+# +139264 tokens vs pure-GPU 24576. turbo2 KV = 2.1 KB/tok vs q4_0 4.5 KB/tok.
+# Speed at ctx=163840: baseline 30.0 t/s, est. 17.8@50% / 22.4@25% (PCIe BW).
+# RAM at 163840: 346 MiB KV. ngl=42 (all layers on GPU).
+# Use this profile when you need >24K context; otherwise use gemma4-e4b.
+# ==============================================================================
+
+MODEL_FILE=google_gemma-4-E4B-it-Q4_K_M.gguf
+
+N_GPU_LAYERS=42
+CTX_SIZE=163840
+
+THREADS=6
+THREADS_BATCH=6
+
+BATCH_SIZE=512
+UBATCH_SIZE=128
+
+CACHE_TYPE_K=turbo2
+CACHE_TYPE_V=turbo2
+
+PARALLEL=1
+
+EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload
--- a/envs/.env.qwen3-4b
+++ b/envs/.env.qwen3-4b
@@ -0,0 +1,42 @@
+# ==============================================================================
+# Qwen3-4B-Instruct Q4_K_M — Alibaba (May 2025)
+# Architecture: Decoder-only transformer, GQA
+#   - 4B params, 32 layers
+#   - 32K native context (128K with YaRN)
+# Model size: ~2.4 GB Q4_K_M  |  Full GPU fit (ngl=99)
+# Features: thinking mode (/think /no_think), tool calling, 119 languages,
+#           Apache 2.0. Strong code + reasoning. Best ecosystem (most fine-tunes).
+#
+# Download:
+#   huggingface-cli download bartowski/Qwen3-4B-GGUF \
+#     Qwen3-4B-Q4_K_M.gguf --local-dir ./models/
+#
+# NOTE: Verify exact filename after download:
+#       ls models/Qwen3-4B*
+# ==============================================================================
+
+MODEL_FILE=Qwen3-4B-Q4_K_M.gguf
+
+# All layers fit — ~2.4 GB leaves ~1.3 GB free for KV + compute
+N_GPU_LAYERS=99
+
+# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
+# Max ctx=8192 (12K OOM). Full attention — all KV must fit at full ctx.
+# GGUF native limit=40960, but VRAM walls at ~8K.
+# Baseline: 181 pp / 41.6 tg t/s. At 8K ctx fa=1: 191 pp / 44.3 tg t/s (+6%).
+CTX_SIZE=16384
+
+THREADS=6
+THREADS_BATCH=6
+
+BATCH_SIZE=512
+UBATCH_SIZE=256
+
+CACHE_TYPE_K=q4_0
+CACHE_TYPE_V=q4_0
+
+# 1 parallel slot — limited VRAM at 8K ctx with 2.4GB model
+PARALLEL=1
+
+# fa=1 gives +6% tg speed on full-attention Qwen3
+EXTRA_ARGS=--flash-attn on --mmap
--- a/envs/.env.qwen3-4b-bigctx
+++ b/envs/.env.qwen3-4b-bigctx
@@ -0,0 +1,24 @@
+# ==============================================================================
+# Qwen3-4B Q4_K_M — bigctx variant (KV in RAM via -nkvo)
+# Benchmarked 2026-05-06: -nkvo max ctx=24576 (+8K vs pure-GPU 16384)
+# Baseline TG: ~39 t/s (empty KV).
+# Use this profile when you need >16K context; otherwise use qwen3-4b.
+# ==============================================================================
+
+MODEL_FILE=Qwen3-4B-Q4_K_M.gguf
+
+N_GPU_LAYERS=99
+CTX_SIZE=24576
+
+THREADS=6
+THREADS_BATCH=6
+
+BATCH_SIZE=512
+UBATCH_SIZE=256
+
+CACHE_TYPE_K=q4_0
+CACHE_TYPE_V=q4_0
+
+PARALLEL=1
+
+EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload
--- a/envs/.env.qwen35-9b
+++ b/envs/.env.qwen35-9b
@@ -0,0 +1,41 @@
+# ==============================================================================
+# Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2 Q8_0 — TurboQuant SM75
+# Architecture: 32 layers (8 full-attn + 24 linear-attn), GQA 4 KV heads
+# Model size: 8.86 GB  |  VRAM usage: ~3.4 GB (11 layers on GPU)
+# RAM usage: ~5.5 GB (21 layers pinned via mlock)
+#
+# Benchmark results (turbo2 KV, ngl=11, fa=1):
+#   t=1→0.86  t=2→1.62  t=3→2.25  t=4→2.94  t=5→3.56  t=6→4.38 ← best
+#   t=8→4.22  t=12→3.61  (hyperthreading hurts above 6)
+# Theoretical ceiling: ~5.1 t/s (45 GB/s RAM BW ÷ 8.86 GB model)
+# Achieved: 4.38 t/s = 86% efficiency
+#
+# Download:
+#   huggingface-cli download Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF \
+#     Qwen3.5-9B.Q8_0.gguf --local-dir ./models/
+# ==============================================================================
+
+MODEL_FILE=Qwen3.5-9B.Q8_0.gguf
+
+# GPU: 11 layers fit in 3.7 GB VRAM. ngl=12 causes OOM at ctx>2048.
+N_GPU_LAYERS=11
+
+# 32K context fits with turbo2 KV (~104 MiB overhead vs ~3.3 GB for f16)
+CTX_SIZE=32768
+
+# t=6 is optimal for i7-10750H (6 physical cores). t>6 uses HT which hurts.
+THREADS=6
+THREADS_BATCH=6
+
+BATCH_SIZE=512
+UBATCH_SIZE=128
+
+# turbo2: 2-bit KV cache, 6.4× smaller than f16. Requires TurboQuant image.
+CACHE_TYPE_K=turbo2
+CACHE_TYPE_V=turbo2
+
+PARALLEL=1
+
+# --no-mmap --mlock: pins entire model in RAM (prevents paging, avoids cold reads)
+# --flash-attn on: required with turbo2 KV (fa=0 + turbo2 has no speed benefit)
+EXTRA_ARGS=--flash-attn on --no-mmap --mlock
--- a/envs/.env.smollm3-3b
+++ b/envs/.env.smollm3-3b
@@ -0,0 +1,42 @@
+# ==============================================================================
+# SmolLM3 3B-it Q4_K_M — HuggingFace (2025)
+# Architecture: Decoder-only transformer, GQA + NoPE (3:1 ratio)
+#   - 3B params, 11.2T training tokens
+#   - 64K native context (128K with YaRN)
+# Model size: ~1.9 GB Q4_K_M  |  Full GPU fit (ngl=99)
+# Features: thinking mode (/think /no_think), tool calling, 6 languages,
+#           Apache 2.0. AIME 2025: 36.7% in think mode.
+#
+# Download:
+#   huggingface-cli download bartowski/HuggingFaceTB_SmolLM3-3B-GGUF \
+#     HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf --local-dir ./models/
+#
+# NOTE: Verify exact filename after download:
+#       ls models/SmolLM3* models/HuggingFaceTB_SmolLM3*
+# ==============================================================================
+
+MODEL_FILE=HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf
+
+# All layers fit comfortably — ~1.9 GB leaves ~1.8 GB free for KV + compute
+N_GPU_LAYERS=99
+
+# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
+# Max ctx=24576 (32K OOM). Baseline: 249 pp / 56.8 tg t/s.
+# At 24K ctx with fa=1: 260 pp / 58.3 tg t/s (+2%).
+# Model context limit = 65536, VRAM is the constraint here.
+CTX_SIZE=24576
+
+THREADS=6
+THREADS_BATCH=6
+
+BATCH_SIZE=512
+UBATCH_SIZE=256
+
+CACHE_TYPE_K=q8_0
+CACHE_TYPE_V=q8_0
+
+# 2 parallel slots — less headroom at 24K ctx vs original 16K estimate
+PARALLEL=2
+
+# fa=1 gives small but consistent improvement (+2 tg t/s)
+EXTRA_ARGS=--flash-attn on --mmap
--- a/envs/.env.smollm3-3b-bigctx
+++ b/envs/.env.smollm3-3b-bigctx
@@ -0,0 +1,26 @@
+# ==============================================================================
+# SmolLM3 3B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo)
+# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): turbo2 rec ctx=65536
+# +40960 tokens vs pure-GPU 24576. turbo2 KV = 10.9 KB/tok vs q4_0 19.8 KB/tok.
+# Speed at ctx=65536: baseline 53.1 t/s, est. 15.2@50% / 23.7@25% (PCIe BW).
+# RAM at 65536: 714 MiB KV. turbo2 passes PPL quality gate at all tested ctx.
+# Use this profile when you need >24K context; otherwise use smollm3-3b.
+# ==============================================================================
+
+MODEL_FILE=HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf
+
+N_GPU_LAYERS=99
+CTX_SIZE=65536
+
+THREADS=6
+THREADS_BATCH=6
+
+BATCH_SIZE=512
+UBATCH_SIZE=256
+
+CACHE_TYPE_K=turbo2
+CACHE_TYPE_V=turbo2
+
+PARALLEL=1
+
+EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload
--- a/scripts/benchmark.sh
+++ b/scripts/benchmark.sh
@@ -0,0 +1,335 @@
+#!/usr/bin/env bash
+# =============================================================================
+# llama.cpp Automated Benchmark — Qwen3.5-9B on GTX 1650 Ti (4 GB VRAM)
+#
+# Runs for BOTH official llama.cpp and TurboQuant fork.
+# VARIANT env var selects which KV type set to sweep:
+#   VARIANT=official    → f16 q8_0 q5_0 q4_0 iq4_nl
+#   VARIANT=turboquant  → f16 q8_0 iq4_nl turbo4 turbo3 turbo2
+#
+# Output: CSV + recommended .env per variant, plus a final comparison table.
+#
+# Run:
+#   docker compose --profile benchmark run --rm benchmark          (official)
+#   docker compose --profile benchmark run --rm benchmark-turbo    (turboquant)
+# =============================================================================
+
+set -euo pipefail
+
+# Ensure llama-bench is findable in both official (/usr/local/bin) and TurboQuant (/app) images
+export PATH="/app:/usr/local/bin:/usr/bin:/bin:${PATH:-}"
+
+MODEL="${MODEL:-${1:-/models/Qwen3.5-9B.Q8_0.gguf}}"
+OUTPUT_DIR="${OUTPUT_DIR:-${2:-/results}}"
+VARIANT="${VARIANT:-official}"   # official | turboquant
+TIMESTAMP=$(date +%Y%m%d_%H%M%S)
+RESULTS_CSV="${OUTPUT_DIR}/${VARIANT}_results_${TIMESTAMP}.csv"
+LOG="${OUTPUT_DIR}/${VARIANT}_benchmark_${TIMESTAMP}.log"
+
+# ── Baseline config ────────────────────────────────────────────────────────
+THREADS=6
+THREADS_BATCH=12
+BATCH_SIZE=2048
+UBATCH_SIZE=512
+PROMPT_TOKENS=512
+GEN_TOKENS=32
+REPETITIONS=1
+
+# ── KV type sets per variant ───────────────────────────────────────────────
+# turbo2=2-bit (6.4× vs f16), turbo3=3-bit, turbo4=4-bit — TurboQuant only
+if [[ "$VARIANT" == "turboquant" ]]; then
+  KV_TYPES=(f16 q8_0 iq4_nl turbo4 turbo3 turbo2)
+else
+  # Official llama.cpp: all standard quant types
+  # iq4_nl = i-quant non-linear: best quality at 4-bit (non-uniform scale)
+  KV_TYPES=(f16 q8_0 q5_0 q4_0 iq4_nl)
+fi
+
+# ── GPU layer sweep (Q8_0 ~297 MB/layer, 3717 MiB VRAM → max ~12 layers) ──
+NGL_VALUES=(6 9 12 13 14 99)
+
+# ── Context sweep: use -p to stress KV cache at given size ─────────────────
+CTX_VALUES=(128 512 1024 2048 4096 8192)
+
+# ── Batch sweep ────────────────────────────────────────────────────────────
+BATCH_VALUES=(512 1024 2048 4096)
+
+mkdir -p "$OUTPUT_DIR"
+
+log()  { echo "$*" | tee -a "$LOG"; }
+sep()  { log "$(printf '─%.0s' {1..70})"; }
+hdr()  { sep; log "  $*"; sep; }
+
+log "llama.cpp Benchmark [${VARIANT}] — $(date)"
+log "Model:   $MODEL"
+log "GPU:     $(nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null || echo 'CPU only')"
+log "KV set:  ${KV_TYPES[*]}"
+sep
+
+echo "variant,phase,ngl,ctx,kv_type_k,kv_type_v,flash_attn,batch_size,ubatch_size,threads,pp_tokens_per_sec,tg_tokens_per_sec,status" \
+  > "$RESULTS_CSV"
+
+# ── Helper: run llama-bench ────────────────────────────────────────────────
+LAST_PP=0
+LAST_TG=0
+
+run_bench() {
+  local ngl=$1 ctx=$2 kv=$3 fa=$4 batch=$5 ubatch=$6 phase="${7:-test}"
+  local raw
+
+  # New llama-bench API (b9014+): -c and -tb removed; -p sets prompt/ctx size
+  # CSV columns: ...n_prompt(34),n_gen(35),n_depth(36),test_time(37),
+  #              avg_ns(38),stddev_ns(39),avg_ts(40),stddev_ts(41)
+  # pp row: n_gen==0; tg row: n_prompt==0
+  raw=$(timeout 300 /app/llama-bench \
+    -m "$MODEL" \
+    -ngl "$ngl" \
+    -p "$ctx" \
+    -n "$GEN_TOKENS" \
+    -t "$THREADS" \
+    -b "$batch" \
+    -ub "$ubatch" \
+    -ctk "$kv" \
+    -ctv "$kv" \
+    -fa "$fa" \
+    -r "$REPETITIONS" \
+    -o csv 2>&1) || return 1
+
+  # Strip quotes from CSV, then extract avg_ts (col 40) by pp/tg row type
+  LAST_PP=$(printf '%s\n' "$raw" | sed 's/"//g' | awk -F',' 'NR>1 && $35=="0"    && $40+0>0 {print $40+0; exit}')
+  LAST_TG=$(printf '%s\n' "$raw" | sed 's/"//g' | awk -F',' 'NR>1 && $34=="0" && $35+0>0 && $40+0>0 {print $40+0; exit}')
+  LAST_PP="${LAST_PP:-0}"
+  LAST_TG="${LAST_TG:-0}"
+
+  echo "${VARIANT},${phase},${ngl},${ctx},${kv},${kv},${fa},${batch},${ubatch},${THREADS},${LAST_PP},${LAST_TG},ok" \
+    >> "$RESULTS_CSV"
+  return 0
+}
+
+fail_row() {
+  local phase=$1 ngl=$2 ctx=$3 kv=$4 fa=$5 batch=$6 ubatch=$7
+  echo "${VARIANT},${phase},${ngl},${ctx},${kv},${kv},${fa},${batch},${ubatch},${THREADS},0,0,failed" \
+    >> "$RESULTS_CSV"
+}
+
+# ── Phase 1: GPU layer sweep ───────────────────────────────────────────────
+hdr "PHASE 1 — GPU layer sweep (prompt=128  kv=f16  fa=0)"
+# Use f16 KV: prebuilt official image lacks SM75 CUDA kernels for quantized KV.
+# We isolate the NGL variable here; KV type is swept in Phase 3.
+MAX_STABLE_NGL=0
+for ngl in "${NGL_VALUES[@]}"; do
+  printf "  ngl=%-3s  " "$ngl" | tee -a "$LOG"
+  if run_bench "$ngl" 128 f16 0 "$BATCH_SIZE" "$UBATCH_SIZE" "ph1_ngl"; then
+    log "OK  pp=${LAST_PP} t/s  tg=${LAST_TG} t/s"
+    MAX_STABLE_NGL="$ngl"
+  else
+    log "FAILED (OOM/timeout)"
+    fail_row ph1_ngl "$ngl" 128 f16 0 "$BATCH_SIZE" "$UBATCH_SIZE"
+    break
+  fi
+done
+log "  → Best ngl: ${MAX_STABLE_NGL}"
+
+# ── Phase 2: Context sweep ─────────────────────────────────────────────────
+hdr "PHASE 2 — Context/prompt sweep (ngl=${MAX_STABLE_NGL}  kv=f16  fa=0)"
+
+MAX_STABLE_CTX=128
+for ctx in "${CTX_VALUES[@]}"; do
+  printf "  ctx=%-6s  " "$ctx" | tee -a "$LOG"
+  if run_bench "$MAX_STABLE_NGL" "$ctx" f16 0 "$BATCH_SIZE" "$UBATCH_SIZE" "ph2_ctx"; then
+    log "OK  pp=${LAST_PP} t/s  tg=${LAST_TG} t/s"
+    MAX_STABLE_CTX="$ctx"
+  else
+    log "FAILED (OOM/timeout)"
+    fail_row ph2_ctx "$MAX_STABLE_NGL" "$ctx" f16 0 "$BATCH_SIZE" "$UBATCH_SIZE"
+    break
+  fi
+done
+log "  → Best ctx: ${MAX_STABLE_CTX}"
+
+# ── Phase 3: KV cache type sweep ───────────────────────────────────────────
+hdr "PHASE 3 — KV type sweep (ngl=${MAX_STABLE_NGL}  ctx=${MAX_STABLE_CTX}  fa=1)"
+log "  [${VARIANT}] KV types: ${KV_TYPES[*]}"
+log "  Note: Qwen3.5-9B has only 8/32 full-attention layers + GQA (4 KV heads)"
+log "        Linear-attention layers need no KV cache at all → quant errors minimal"
+if [[ "$VARIANT" == "turboquant" ]]; then
+  log "  turbo2=2-bit (6.4× compression), turbo3=3-bit, turbo4=4-bit"
+fi
+
+BEST_KV="q8_0"
+BEST_TG_KV=0
+
+for kv in "${KV_TYPES[@]}"; do
+  printf "  kv=%-8s  " "$kv" | tee -a "$LOG"
+  if run_bench "$MAX_STABLE_NGL" "$MAX_STABLE_CTX" "$kv" 0 "$BATCH_SIZE" "$UBATCH_SIZE" "ph3_kv"; then
+    log "OK  pp=${LAST_PP} t/s  tg=${LAST_TG} t/s"
+    tg_n=$(printf '%s' "$LAST_TG" | grep -oP '[0-9]+\.?[0-9]*' | head -1)
+    if awk "BEGIN{exit !(${tg_n:-0} > ${BEST_TG_KV:-0})}"; then
+      BEST_TG_KV="${tg_n:-0}"
+      BEST_KV="$kv"
+    fi
+  else
+    log "FAILED"
+    fail_row ph3_kv "$MAX_STABLE_NGL" "$MAX_STABLE_CTX" "$kv" 0 "$BATCH_SIZE" "$UBATCH_SIZE"
+  fi
+done
+log "  → Best KV: ${BEST_KV}  (tg=${BEST_TG_KV} t/s)"
+
+# ── Phase 4: Flash attention ───────────────────────────────────────────────
+hdr "PHASE 4 — Flash attention (ngl=${MAX_STABLE_NGL}  ctx=${MAX_STABLE_CTX}  kv=${BEST_KV})"
+log "  GTX 1650 Ti = CC 7.5 (Turing) — FA2 requires SM80+ but FA1 works on CC>=7.5"
+
+BEST_FA=1
+BEST_TG_FA=0
+
+for fa in 1 0; do
+  fa_label=$([ "$fa" -eq 1 ] && echo "on " || echo "off")
+  printf "  fa=%-3s  " "$fa_label" | tee -a "$LOG"
+  if run_bench "$MAX_STABLE_NGL" "$MAX_STABLE_CTX" "$BEST_KV" "$fa" "$BATCH_SIZE" "$UBATCH_SIZE" "ph4_fa"; then
+    log "OK  pp=${LAST_PP} t/s  tg=${LAST_TG} t/s"
+    tg_n=$(printf '%s' "$LAST_TG" | grep -oP '[0-9]+\.?[0-9]*' | head -1)
+    if awk "BEGIN{exit !(${tg_n:-0} > ${BEST_TG_FA:-0})}"; then
+      BEST_TG_FA="${tg_n:-0}"
+      BEST_FA="$fa"
+    fi
+  else
+    log "FAILED"
+  fi
+done
+log "  → Best FA: ${BEST_FA}  (tg=${BEST_TG_FA} t/s)"
+
+# ── Phase 5: Batch sweep ───────────────────────────────────────────────────
+hdr "PHASE 5 — Batch sweep (ngl=${MAX_STABLE_NGL}  ctx=${MAX_STABLE_CTX}  kv=${BEST_KV}  fa=${BEST_FA})"
+# Use small fixed prompt (64) to isolate batch-buffer allocation overhead from prompt size.
+# Larger batch = larger CUDA activation buffers; tests whether they fit in remaining VRAM.
+BEST_BATCH="$BATCH_SIZE"
+BEST_PP_BATCH=0
+FIXED_P=64
+
+for batch in "${BATCH_VALUES[@]}"; do
+  ubatch=$(( batch / 4 < 64 ? 64 : batch / 4 ))
+  printf "  batch=%-5s ubatch=%-4s  " "$batch" "$ubatch" | tee -a "$LOG"
+  if run_bench "$MAX_STABLE_NGL" "$FIXED_P" "$BEST_KV" "$BEST_FA" "$batch" "$ubatch" "ph5_batch"; then
+    log "OK  pp=${LAST_PP} t/s  tg=${LAST_TG} t/s"
+    pp_n=$(printf '%s' "$LAST_PP" | grep -oP '[0-9]+\.?[0-9]*' | head -1)
+    if awk "BEGIN{exit !(${pp_n:-0} > ${BEST_PP_BATCH:-0})}"; then
+      BEST_PP_BATCH="${pp_n:-0}"
+      BEST_BATCH="$batch"
+    fi
+  else
+    log "FAILED"
+  fi
+done
+BEST_UBATCH=$(( BEST_BATCH / 4 < 64 ? 64 : BEST_BATCH / 4 ))
+log "  → Best batch: ${BEST_BATCH}  ubatch: ${BEST_UBATCH}  (pp=${BEST_PP_BATCH} t/s)"
+
+# ── Phase 6 (TurboQuant only): max context with turbo2 KV ──────────────────
+if [[ "$VARIANT" == "turboquant" ]]; then
+  hdr "PHASE 6 — TurboQuant: extended context with turbo2 KV (ngl=${MAX_STABLE_NGL}  fa=${BEST_FA})"
+  log "  turbo2 = 2-bit KV (6.4× smaller than f16) → enables much larger ctx in same VRAM"
+
+  TURBO_CTX_VALUES=(512 1024 2048 4096 8192 16384 32768)
+  MAX_TURBO_CTX="128"
+  TURBO_KV="turbo2"
+
+  for ctx in "${TURBO_CTX_VALUES[@]}"; do
+    printf "  ctx=%-7s  " "$ctx" | tee -a "$LOG"
+    if run_bench "$MAX_STABLE_NGL" "$ctx" "$TURBO_KV" "$BEST_FA" "$BEST_BATCH" "$BEST_UBATCH" "ph6_turbo_ctx"; then
+      log "OK  pp=${LAST_PP} t/s  tg=${LAST_TG} t/s"
+      MAX_TURBO_CTX="$ctx"
+    else
+      log "FAILED (OOM/timeout)"
+      fail_row ph6_turbo_ctx "$MAX_STABLE_NGL" "$ctx" "$TURBO_KV" "$BEST_FA" "$BEST_BATCH" "$BEST_UBATCH"
+      break
+    fi
+  done
+  log "  → Max context with turbo2: ${MAX_TURBO_CTX}"
+  # Use the larger turbo ctx for the recommended .env
+  MAX_STABLE_CTX="$MAX_TURBO_CTX"
+  BEST_KV="$TURBO_KV"
+fi
+
+# ── Summary ────────────────────────────────────────────────────────────────
+sep
+log "BENCHMARK COMPLETE [${VARIANT}] — $(date)"
+sep
+log ""
+log "  Optimal params for GTX 1650 Ti + Qwen3.5-9B Q4_K_M [${VARIANT}]:"
+log ""
+log "    ngl        : ${MAX_STABLE_NGL}"
+log "    ctx_size   : ${MAX_STABLE_CTX}"
+log "    kv_type    : ${BEST_KV}"
+log "    flash_attn : ${BEST_FA}"
+log "    batch_size : ${BEST_BATCH}"
+log "    ubatch     : ${BEST_UBATCH}"
+log ""
+log "  Full CSV: ${RESULTS_CSV}"
+log ""
+
+# Write recommended .env
+ENV_OUT="${OUTPUT_DIR}/${VARIANT}_recommended.env"
+cat > "$ENV_OUT" <<EOF
+# Generated by benchmark.sh [${VARIANT}] on $(date)
+LLAMA_N_GPU_LAYERS=${MAX_STABLE_NGL}
+LLAMA_CTX_SIZE=${MAX_STABLE_CTX}
+LLAMA_CACHE_TYPE_K=${BEST_KV}
+LLAMA_CACHE_TYPE_V=${BEST_KV}
+LLAMA_BATCH_SIZE=${BEST_BATCH}
+LLAMA_UBATCH_SIZE=${BEST_UBATCH}
+LLAMA_THREADS=${THREADS}
+LLAMA_THREADS_BATCH=${THREADS_BATCH}
+LLAMA_PARALLEL=1
+EOF
+log "  Recommended .env → ${ENV_OUT}"
+
+# ── Cross-variant comparison (if both results exist) ──────────────────────
+OFFICIAL_CSV=$(ls "${OUTPUT_DIR}"/official_results_*.csv 2>/dev/null | sort | tail -1 || true)
+TURBO_CSV=$(ls "${OUTPUT_DIR}"/turboquant_results_*.csv 2>/dev/null | sort | tail -1 || true)
+
+if [[ -n "$OFFICIAL_CSV" && -n "$TURBO_CSV" ]]; then
+  COMPARE_OUT="${OUTPUT_DIR}/comparison_$(date +%Y%m%d_%H%M%S).txt"
+  {
+    echo "======================================================================"
+    echo " OFFICIAL vs TURBOQUANT COMPARISON"
+    echo "======================================================================"
+    echo ""
+    echo "Official  CSV: $OFFICIAL_CSV"
+    echo "TurboQuant CSV: $TURBO_CSV"
+    echo ""
+    echo "KV type benchmark results (phase ph3_kv):"
+    echo ""
+    printf "%-12s  %-10s  %-10s  %-12s  %-12s\n" "variant" "kv_type" "ctx" "pp (t/s)" "tg (t/s)"
+    echo "----------------------------------------------------------------------"
+    for csv in "$OFFICIAL_CSV" "$TURBO_CSV"; do
+      awk -F',' '
+        NR>1 && $2 == "ph3_kv" {
+          printf "%-12s  %-10s  %-10s  %-12s  %-12s\n", $1, $5, $4, $11, $12
+        }
+      ' "$csv"
+    done
+    echo ""
+    echo "Winner by tg (generation speed):"
+    awk -F',' '
+      NR>1 && $2 == "ph3_kv" && $13 == "ok" {
+        key = $1 "," $5
+        val = $12+0
+        if (val > best[key]) { best[key] = val; row[key] = $0 }
+      }
+      END {
+        best_tg = 0; best_key = ""
+        for (k in best) { if (best[k] > best_tg) { best_tg = best[k]; best_key = k } }
+        n = split(best_key, a, ",")
+        printf "  %s with kv=%s → %.1f t/s\n", a[1], a[2], best_tg
+      }
+    ' "$OFFICIAL_CSV" "$TURBO_CSV"
+    echo "======================================================================"
+  } | tee "$COMPARE_OUT" | tee -a "$LOG"
+  echo ""
+  echo "Comparison report: $COMPARE_OUT"
+fi
+
+sep
+echo ""
+echo "=== RECOMMENDED .env [${VARIANT}] ==="
+cat "$ENV_OUT"
+
--- a/scripts/benchmark_models.sh
+++ b/scripts/benchmark_models.sh
@@ -0,0 +1,175 @@
+#!/bin/bash
+# Benchmark all 4 new models on GTX 1650 Ti (3717 MiB VRAM)
+# Priority: max context size > tg speed
+# Runs inside ghcr.io/ggml-org/llama.cpp:full-cuda (build b9014, no -c flag)
+#
+# Architecture context limits (from GGUF metadata):
+#   SmolLM3-3B   : 65536  (full attention, KV-limited to ~28K in practice)
+#   Gemma4-E2B   : 131072 (hybrid: sliding_window=512 → huge ctx possible)
+#   Gemma4-E4B   : 131072 (hybrid: sliding_window=512)
+#   Qwen3-4B     : 40960  (full attention, KV-limited to ~9K in practice)
+#
+# NOTE: llama-bench b9014 has NO -c flag. Context is set by -p (prompt tokens).
+#   -p N -n G allocates KV for N+G tokens. OOM = exit!=0 or error in stdout.
+
+set -uo pipefail
+
+M_SMOL="/models/HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf"
+M_E2B="/models/google_gemma-4-E2B-it-Q4_K_M.gguf"
+M_E4B="/models/google_gemma-4-E4B-it-Q4_K_M.gguf"
+M_Q3="/models/Qwen3-4B-Q4_K_M.gguf"
+
+# -- CSV column detection (called once on first successful output) --
+TS_COL=0; NG_COL=0; NP_COL=0
+
+detect_cols() {
+    local hdr
+    hdr=$(printf '%s\n' "$1" | sed 's/"//g' | grep '^build_commit' | head -1)
+    TS_COL=$(printf '%s\n' "$hdr" | awk -F',' '{for(i=1;i<=NF;i++) if($i=="avg_ts"){print i;exit}}')
+    NG_COL=$(printf '%s\n' "$hdr" | awk -F',' '{for(i=1;i<=NF;i++) if($i=="n_gen"){print i;exit}}')
+    NP_COL=$(printf '%s\n' "$hdr" | awk -F',' '{for(i=1;i<=NF;i++) if($i=="n_prompt"){print i;exit}}')
+    TS_COL=${TS_COL:-0}; NG_COL=${NG_COL:-0}; NP_COL=${NP_COL:-0}
+}
+
+# Returns "pp_speed pp / tg_speed tg t/s"
+parse_speeds() {
+    local out="$1"
+    [ "${TS_COL:-0}" = "0" ] && detect_cols "$out"
+    local s pp tg
+    s=$(printf '%s\n' "$out" | sed 's/"//g')
+    pp=$(printf '%s\n' "$s" | awk -F',' -v tc="$TS_COL" -v np="$NP_COL" -v ng="$NG_COL" \
+        'NR>1 && $np+0>0 && $ng+0==0 {printf "%.0f", $tc+0; exit}')
+    tg=$(printf '%s\n' "$s" | awk -F',' -v tc="$TS_COL" -v np="$NP_COL" -v ng="$NG_COL" \
+        'NR>1 && $ng+0>0 && $np+0==0 {printf "%.1f", $tc+0; exit}')
+    printf "%s pp / %s tg t/s" "${pp:--}" "${tg:--}"
+}
+
+is_oom() {
+    local out="$1" ec="$2"
+    [ "$ec" -ne 0 ] && return 0
+    printf '%s\n' "$out" | grep -qiE "failed to create context|out of memory|GGML_ASSERT|error:" && return 0
+    return 1
+}
+
+# bench MODEL NGL [llama-bench extra args...]
+# Standard speed benchmark: -p 512 -n 128 small context
+bench() {
+    local model=$1 ngl=$2; shift 2
+    local out ec
+    out=$(timeout 250 /app/llama-bench -m "$model" -ngl "$ngl" \
+        -b 512 -ub 128 -o csv "$@" 2>&1)
+    ec=$?
+    if is_oom "$out" "$ec"; then echo "OOM"; return; fi
+    [ "${TS_COL:-0}" = "0" ] && detect_cols "$out"
+    parse_speeds "$out"
+}
+
+# bench_ctx MODEL NGL CTX
+# Context-capacity test: allocates KV for CTX tokens via -p CTX -n 1
+# Tries fa=1 first, falls back to fa=0. Returns "OK (N pp t/s [fa=N])" or "OOM"
+bench_ctx() {
+    local model=$1 ngl=$2 ctx=$3
+    local out ec fa_used
+    for fa in 1 0; do
+        out=$(timeout 250 /app/llama-bench -m "$model" -ngl "$ngl" \
+            -p "$ctx" -n 1 -r 1 --no-warmup \
+            -b 512 -ub 128 -fa "$fa" -t 6 -o csv 2>&1)
+        ec=$?
+        is_oom "$out" "$ec" || { fa_used=$fa; break; }
+        [ "$fa" = "0" ] && { echo "OOM"; return; }
+    done
+    [ "${TS_COL:-0}" = "0" ] && detect_cols "$out"
+    local pp
+    pp=$(printf '%s\n' "$out" | sed 's/"//g' | \
+        awk -F',' -v tc="$TS_COL" -v np="$NP_COL" \
+        'NR>1 && $np+0>0 {printf "%.0f", $tc+0; exit}')
+    printf "OK (%s pp t/s fa=%s)" "${pp:--}" "${fa_used:-?}"
+}
+
+HR="======================================================================"
+echo "$HR"
+echo "LLAMA.CPP BENCHMARK — ALL MODELS — $(date)"
+echo "GPU: $(nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null || echo unknown)"
+echo "$HR"
+echo ""
+
+# ── Phase 1: Baseline (small context) ────────────────────────────────────────
+echo "=== Phase 1: Baseline (ngl=99, p=512 n=128 r=2, t=6, fa=0) ==="
+for entry in "SmolLM3-3B:$M_SMOL" "Gemma4-E2B:$M_E2B" "Gemma4-E4B:$M_E4B" "Qwen3-4B:$M_Q3"; do
+    lbl="${entry%%:*}"; mdl="${entry#*:}"
+    printf "  %-14s  %s\n" "$lbl" "$(bench "$mdl" 99 -p 512 -n 128 -r 2 -t 6 -fa 0)"
+done
+echo ""
+
+# ── Phase 2: Gemma4-E4B ngl sweep ────────────────────────────────────────────
+echo "=== Phase 2: Gemma4-E4B ngl sweep (p=16 n=64 r=1 t=6 fa=0) ==="
+echo "    5.1GB model on 3.7GB VRAM — finding highest ngl before OOM"
+best_e4b_ngl=0
+for ngl in 0 4 8 12 16 20 24 28 32 36 42; do
+    ts=$(bench "$M_E4B" $ngl -p 16 -n 64 -r 1 -t 6 -fa 0)
+    printf "  ngl=%-3s  %s\n" "$ngl" "$ts"
+    [[ "$ts" == OOM ]] && break
+    best_e4b_ngl=$ngl
+done
+echo "  → best_e4b_ngl=$best_e4b_ngl"
+echo ""
+
+# ── Phase 3: Max context sweep ────────────────────────────────────────────────
+echo "=== Phase 3: Max context (p=ctx n=1 r=1 no-warmup fa=1) ==="
+echo "    Gemma4 hybrid attention (sliding_window=512) enables large ctx cheaply."
+declare -A BEST_CTX
+BEST_CTX[smollm3]=512; BEST_CTX[e2b]=512; BEST_CTX[e4b]=512; BEST_CTX[q3]=512
+
+for entry in "smollm3:SmolLM3-3B:$M_SMOL:99" \
+             "e2b:Gemma4-E2B:$M_E2B:99" \
+             "e4b:Gemma4-E4B:$M_E4B:$best_e4b_ngl" \
+             "q3:Qwen3-4B:$M_Q3:99"; do
+    IFS=':' read -r key lbl mdl ngl <<< "$entry"
+    echo "  -- $lbl (ngl=$ngl) --"
+    for ctx in 512 1024 2048 4096 8192 12288 16384 24576 32768 49152 65536 98304 131072; do
+        ts=$(bench_ctx "$mdl" "$ngl" "$ctx")
+        printf "    ctx=%-7s  %s\n" "$ctx" "$ts"
+        [[ "$ts" == OOM ]] && break
+        BEST_CTX[$key]=$ctx
+    done
+    echo "    → MAX ctx=${BEST_CTX[$key]}"
+done
+echo ""
+
+# ── Phase 4: TG speed at max context ─────────────────────────────────────────
+echo "=== Phase 4: TG speed at max context (p=512 n=128 r=2 fa=1 t=6) ==="
+for entry in "smollm3:SmolLM3-3B:$M_SMOL:99" \
+             "e2b:Gemma4-E2B:$M_E2B:99" \
+             "e4b:Gemma4-E4B:$M_E4B:$best_e4b_ngl" \
+             "q3:Qwen3-4B:$M_Q3:99"; do
+    IFS=':' read -r key lbl mdl ngl <<< "$entry"
+    ts=$(bench "$mdl" "$ngl" -p 512 -n 128 -r 2 -fa 1 -t 6)
+    printf "  %-14s  max_ctx=%-7s  %s\n" "$lbl" "${BEST_CTX[$key]}" "$ts"
+done
+echo ""
+
+# ── Phase 5: E4B thread sweep (CPU split model — threads matter) ──────────────
+echo "=== Phase 5: Gemma4-E4B thread sweep (p=512 n=128 r=2 fa=0 ngl=$best_e4b_ngl) ==="
+for t in 1 2 3 4 5 6 8 10 12; do
+    ts=$(bench "$M_E4B" "$best_e4b_ngl" -p 512 -n 128 -r 2 -fa 0 -t "$t")
+    printf "  t=%-3s  %s\n" "$t" "$ts"
+done
+echo ""
+
+# ── Phase 6: Flash attention comparison ──────────────────────────────────────
+echo "=== Phase 6: Flash attention fa=0 vs fa=1 (p=512 n=128 r=2 t=6) ==="
+echo "    Gemma4 hybrid attention may not support FA — testing both."
+for entry in "smollm3:SmolLM3-3B:$M_SMOL:99" \
+             "e2b:Gemma4-E2B:$M_E2B:99" \
+             "e4b:Gemma4-E4B:$M_E4B:$best_e4b_ngl" \
+             "q3:Qwen3-4B:$M_Q3:99"; do
+    IFS=':' read -r key lbl mdl ngl <<< "$entry"
+    ts0=$(bench "$mdl" "$ngl" -p 512 -n 128 -r 2 -fa 0 -t 6)
+    ts1=$(bench "$mdl" "$ngl" -p 512 -n 128 -r 2 -fa 1 -t 6)
+    printf "  %-14s  fa=0: %-30s  fa=1: %s\n" "$lbl" "$ts0" "$ts1"
+done
+echo ""
+
+echo "$HR"
+echo "BENCHMARK COMPLETE: $(date)"
+echo "$HR"
--- a/scripts/cpu_ctx_test.sh
+++ b/scripts/cpu_ctx_test.sh
@@ -0,0 +1,251 @@
+#!/bin/bash
+# cpu_ctx_test.sh v4 — -nkvo bigctx with TurboQuant image (FORCE_MMQ)
+# Image: local/llama-cpp-turboquant:full-cuda-sm75-mmq
+#
+# Tests KV in RAM (-nkvo) with BOTH q4_0 and turbo2 KV types.
+# turbo2 = 2-bit KV (2x smaller than q4_0) → ~2x more context at same RAM budget.
+#
+# Speed model per token:
+#   GPU-compute models (smollm3/e2b/e4b/q3): bottleneck = PCIe KV reads
+#     t/s = 1000 / (gpu_ms + ctx * kv_bytes_per_token / PCIE_BPS * 1000)
+#   Qwen3.5-9B: bottleneck = RAM reads (21/32 layers on CPU, 8.86 GB model)
+#     t/s = 1000 / (1000/baseline + ctx * kv_bytes_per_token / RAM_BPS * 1000)
+#
+# Usage: bash /scripts/cpu_ctx_test.sh [smollm3|e2b|e4b|q3|qwen35q|all]
+
+set -uo pipefail
+
+TARGET="${1:-all}"
+TARGET_TPS=15
+CPU_THREADS=6
+BENCH_GEN=32
+PCIE_BW_GBPS=8.0   # PCIe x4 3.0 practical read BW (conservative)
+RAM_BW_GBPS=45.0   # RAM practical read BW (i7-10750H DDR4-2933)
+
+M_SMOL="/models/HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf"
+M_E2B="/models/google_gemma-4-E2B-it-Q4_K_M.gguf"
+M_E4B="/models/google_gemma-4-E4B-it-Q4_K_M.gguf"
+M_Q3="/models/Qwen3-4B-Q4_K_M.gguf"
+M_Q35="/models/Qwen3.5-9B.Q8_0.gguf"
+
+declare -A NGL_GPU=([smollm3]=99 [e2b]=99 [e4b]=42 [q3]=99 [qwen35q]=11)
+# BW source: pcie for GPU-compute models, ram for qwen35-9b (CPU-compute bound)
+declare -A BW_GBPS=([smollm3]=$PCIE_BW_GBPS [e2b]=$PCIE_BW_GBPS [e4b]=$PCIE_BW_GBPS [q3]=$PCIE_BW_GBPS [qwen35q]=$RAM_BW_GBPS)
+declare -A BW_LABEL=([smollm3]="PCIe" [e2b]="PCIe" [e4b]="PCIe" [q3]="PCIe" [qwen35q]="RAM")
+
+# CTX candidates: larger now thanks to turbo2 (2x smaller KV vs q4_0)
+# Note: turbo2 is SKIPPED for Qwen3-4B (PPL explodes at ctx>=8192: +0.52 → +13 → +437)
+#       turbo2 is SKIPPED for Qwen3.5-9B (hybrid linear-attn incompatible with llama-perplexity;
+#       server works fine at 32K — this is a test-tool limitation, not a real issue)
+SMOL_CTXS=(32768 49152 65536 98304 131072 163840)
+E2B_CTXS=(32768 49152 65536 98304 131072 163840 196608 262144 393216)
+E4B_CTXS=(32768 49152 65536 98304 131072 163840)
+Q3_CTXS=(24576 32768 49152 65536 98304 131072)
+Q35_CTXS=(16384 32768 49152 65536 98304 131072)
+declare -A CTX_CANDIDATES=(
+    [smollm3]="SMOL_CTXS" [e2b]="E2B_CTXS" [e4b]="E4B_CTXS"
+    [q3]="Q3_CTXS" [qwen35q]="Q35_CTXS")
+
+# Pure-GPU ctx for gain comparison
+declare -A PURE_GPU_CTX=([smollm3]=24576 [e2b]=24576 [e4b]=24576 [q3]=16384 [qwen35q]=32768)
+
+GREEN='\033[0;32m'; RED='\033[0;31m'; YELLOW='\033[1;33m'; BLUE='\033[0;34m'; CYAN='\033[0;36m'; NC='\033[0m'
+HR="======================================================================"
+
+# Tiny alloc file — enough for 1 chunk, minimal compute time
+ALLOC_FILE="/tmp/kv_alloc_tiny.txt"
+python3 -c "
+sentences = [
+    'The transformer architecture uses self-attention mechanisms to process sequences.',
+    'Large language models require significant computational resources for training.',
+    'Quantization reduces memory usage by storing weights in lower precision formats.',
+    'Flash attention enables memory-efficient computation for long context windows.',
+    'The key-value cache stores intermediate attention states during generation.',
+]
+import random; random.seed(1)
+print(chr(10).join([random.choice(sentences) for _ in range(64)]))
+" > "$ALLOC_FILE"
+
+# check_alloc MODEL NGL KV CTX [EXTRA...]
+# Returns "<host_kv_mib>" on success, "OOM" on failure. Fast: <15s.
+check_alloc() {
+    local model=$1 ngl=$2 kv=$3 ctx=$4
+    shift 4
+    local extra_args=("$@")
+    local tmp_err; tmp_err=$(mktemp)
+
+    timeout 90 /app/llama-perplexity \
+        -m "$model" -ngl "$ngl" \
+        -fa on -nkvo \
+        -c "$ctx" -ctk "$kv" -ctv "$kv" \
+        -f "$ALLOC_FILE" --chunks 1 \
+        "${extra_args[@]}" \
+        > /dev/null 2>"$tmp_err"
+    local rc=$?
+
+    local err; err=$(cat "$tmp_err"); rm -f "$tmp_err"
+
+    if grep -qi "out of memory\|failed to allocate\|cudaMalloc failed\|CUDA_ERROR_OUT_OF_MEMORY\|ggml_cuda_malloc\|cannot allocate memory\|cannot create buffer" <<< "$err"; then
+        echo "OOM"; return 1
+    fi
+
+    # Parse Host context MiB: "| Host | total = model + context + compute |"
+    local host_ctx_mib
+    host_ctx_mib=$(grep "Host" <<< "$err" | \
+        grep -oP "=\s*\d+\s*\+\s*\K\d+(?=\s*\+)" | head -1 || true)
+    echo "${host_ctx_mib:-?}"
+}
+
+# measure_baseline_tps MODEL NGL [EXTRA...]
+measure_baseline_tps() {
+    local model=$1 ngl=$2
+    shift 2
+    local extra_args=("$@")
+    local raw
+    raw=$(timeout 120 /app/llama-bench \
+        -m "$model" -ngl "$ngl" -t "$CPU_THREADS" \
+        -p 1 -n "$BENCH_GEN" \
+        -ctk q4_0 -ctv q4_0 -nkvo 1 -fa 1 -r 1 -o csv \
+        "${extra_args[@]}" 2>/dev/null) || true
+    printf '%s\n' "$raw" | sed 's/"//g' | \
+        awk -F',' 'NR>1 && $34=="0" && $35+0>0 && $40+0>0 {print $40+0; exit}'
+}
+
+# estimate_tps BASELINE_TPS KV_PER_TOKEN_MIB CTX BW_GBPS
+estimate_tps() {
+    local baseline_tps=$1 kv_per_token_mib=$2 ctx=$3 bw_gbps=$4
+    python3 -c "
+baseline = float('$baseline_tps')
+kv_tok_bytes = float('$kv_per_token_mib') * 1024 * 1024
+bps = float('$bw_gbps') * 1e9
+ctx = int('$ctx')
+base_ms = 1000.0 / baseline
+kv_ms = ctx * kv_tok_bytes / bps * 1000
+print(f'{1000.0 / (base_ms + kv_ms):.1f}')
+" 2>/dev/null || echo "?"
+}
+
+# ---------------------------------------------------------------------------
+echo "$HR"
+echo "CPU-RAM KV CONTEXT TEST v4 (-nkvo, TurboQuant FORCE_MMQ) -- $(date)"
+echo "GPU: $(nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null)"
+echo "KV types tested: q4_0 (4-bit) and turbo2 (2-bit, 2x smaller → 2x more ctx)"
+printf "PCIe assumption: %.1f GB/s  |  RAM assumption: %.1f GB/s\n" "$PCIE_BW_GBPS" "$RAM_BW_GBPS"
+echo "$HR"
+echo ""
+
+declare -a SUMMARY=()
+
+for entry in \
+    "smollm3:SmolLM3-3B:$M_SMOL" \
+    "e2b:Gemma4-E2B:$M_E2B" \
+    "e4b:Gemma4-E4B:$M_E4B" \
+    "q3:Qwen3-4B:$M_Q3" \
+    "qwen35q:Qwen3.5-9B:$M_Q35"
+do
+    IFS=':' read -r key lbl model <<< "$entry"
+    [[ "$TARGET" != "all" && "$TARGET" != "$key" ]] && continue
+
+    eval "ctxs=(\"\${${CTX_CANDIDATES[$key]}[@]}\")"
+    ngl="${NGL_GPU[$key]}"
+    bw_gbps="${BW_GBPS[$key]}"
+    bw_label="${BW_LABEL[$key]}"
+
+    # turbo2 incompatible with Qwen3-4B (quality fails at ctx>=8192)
+    # turbo2 alloc works for Qwen3.5-9B but quality measurement unreliable — test q4_0 only
+    if [[ "$key" == "q3" || "$key" == "qwen35q" ]]; then
+        kv_types_to_test=(q4_0)
+    else
+        kv_types_to_test=(q4_0 turbo2)
+    fi
+
+    extra_args=()
+
+    printf "${BLUE}=== %s (ngl=%s, BW model: %s %.0f GB/s) ===${NC}\n" \
+        "$lbl" "$ngl" "$bw_label" "$bw_gbps"
+
+    # Baseline t/s (empty KV, with q4_0 -nkvo — upper bound)
+    printf "  Measuring baseline t/s (empty KV, p=1)... "
+    baseline_tps=$(measure_baseline_tps "$model" "$ngl" "${extra_args[@]}")
+    if [[ -z "$baseline_tps" ]]; then
+        printf "${RED}FAIL${NC}\n\n"
+        SUMMARY+=("$lbl|FAIL|FAIL|FAIL|FAIL|FAIL")
+        continue
+    fi
+    printf "${GREEN}%s t/s${NC}\n\n" "$baseline_tps"
+
+    # Header
+    printf "  %-10s  %-12s  %-12s  %-12s  %-12s  %-12s  %-12s\n" \
+        "ctx" "KV type" "KV in RAM" "kv/tok" "t/s@25%" "t/s@50%" "t/s@100%"
+    printf "  %-10s  %-12s  %-12s  %-12s  %-12s  %-12s  %-12s\n" \
+        "---" "-------" "---------" "------" "-------" "-------" "--------"
+
+    max_ctx_q4=""
+    max_ctx_t2=""
+    rec_q4=""
+    rec_t2=""
+    declare -A kv_ref_mib=()
+
+    for ctx in "${ctxs[@]}"; do
+        for kv_type in "${kv_types_to_test[@]}"; do
+            result=$(check_alloc "$model" "$ngl" "$kv_type" "$ctx" "${extra_args[@]}")
+            if [[ "$result" == "OOM" ]]; then
+                printf "  ${RED}%-10s  %-12s  OOM${NC}\n" "$ctx" "$kv_type"
+                continue
+            fi
+
+            host_kv_mib="${result}"
+            [[ "$kv_type" == "q4_0" ]] && max_ctx_q4=$ctx || max_ctx_t2=$ctx
+
+            # KV per token
+            if [[ "$host_kv_mib" =~ ^[0-9]+$ ]]; then
+                kv_per_token_mib=$(python3 -c "print(f'{$host_kv_mib / $ctx:.6f}')")
+                kv_ref_mib[$kv_type]=$kv_per_token_mib
+            else
+                kv_per_token_mib="${kv_ref_mib[$kv_type]:-?}"
+            fi
+
+            tps25=$(estimate_tps "$baseline_tps" "$kv_per_token_mib" "$(( ctx / 4 ))" "$bw_gbps")
+            tps50=$(estimate_tps "$baseline_tps" "$kv_per_token_mib" "$(( ctx / 2 ))" "$bw_gbps")
+            tps100=$(estimate_tps "$baseline_tps" "$kv_per_token_mib" "$ctx" "$bw_gbps")
+
+            meets=$(python3 -c "print(1 if '$tps50' != '?' and float('$tps50') >= $TARGET_TPS else 0)" 2>/dev/null || echo 0)
+            [[ "$kv_type" == "q4_0" && "$meets" == "1" ]] && rec_q4=$ctx
+            [[ "$kv_type" == "turbo2" && "$meets" == "1" ]] && rec_t2=$ctx
+
+            color=$([[ "$meets" == "1" ]] && echo "$GREEN" || echo "$YELLOW")
+            printf "  ${color}%-10s${NC}  %-12s  %-12s  %-12s  %-12s  ${color}%-12s${NC}  %-12s\n" \
+                "$ctx" "$kv_type" "${host_kv_mib}MiB" "${kv_per_token_mib}MiB" \
+                "$tps25" "$tps50" "$tps100"
+        done
+    done
+
+    rec_q4="${rec_q4:-$max_ctx_q4}"
+    rec_t2="${rec_t2:-$max_ctx_t2}"
+    pg="${PURE_GPU_CTX[$key]}"
+
+    printf "\n  Recommended ctx (>=%s t/s@50%%): q4_0=%s  turbo2=%s  (pure-GPU was %s)\n\n" \
+        "$TARGET_TPS" "${rec_q4:-FAIL}" "${rec_t2:-FAIL}" "$pg"
+
+    gain_q4=$([[ -n "${rec_q4:-}" && "${rec_q4:-}" != "FAIL" ]] && echo "$((rec_q4 - pg))" || echo "?")
+    gain_t2=$([[ -n "${rec_t2:-}" && "${rec_t2:-}" != "FAIL" ]] && echo "$((rec_t2 - pg))" || echo "?")
+    SUMMARY+=("$lbl|$baseline_tps|${max_ctx_q4:-OOM}|${rec_q4:-FAIL}|${max_ctx_t2:-OOM}|${rec_t2:-FAIL}|$gain_q4|$gain_t2")
+
+    unset kv_ref_mib max_ctx_q4 max_ctx_t2 rec_q4 rec_t2
+done
+
+echo "$HR"
+echo "SUMMARY — -nkvo (KV in RAM): q4_0 vs turbo2"
+echo "$HR"
+printf "%-16s  %-12s  %-14s  %-14s  %-14s  %-14s\n" \
+    "Model" "Baseline t/s" "q4_0 max" "q4_0 rec" "turbo2 max" "turbo2 rec"
+printf "%-16s  %-12s  %-14s  %-14s  %-14s  %-14s\n" \
+    "-----" "------------" "--------" "--------" "----------" "----------"
+for row in "${SUMMARY[@]}"; do
+    IFS='|' read -r lbl btps max_q4 rec_q4 max_t2 rec_t2 g_q4 g_t2 <<< "$row"
+    printf "${GREEN}%-16s  %-12s  %-14s  %-14s  %-14s  %-14s  [q4+%s / t2+%s vs pure-GPU]${NC}\n" \
+        "$lbl" "$btps" "$max_q4" "$rec_q4" "$max_t2" "$rec_t2" "$g_q4" "$g_t2"
+done
+echo "$HR"
+echo "Note: Qwen3.5-9B baseline already <15 t/s (RAM-bound, 8.86 GB model). BW model uses RAM not PCIe."
+echo "$HR"
--- a/scripts/download_models.sh
+++ b/scripts/download_models.sh
@@ -0,0 +1,116 @@
+#!/usr/bin/env bash
+# download_models.sh — Download GGUF model files to ./models/
+#
+# Usage:
+#   bash scripts/download_models.sh           # all models
+#   bash scripts/download_models.sh smollm3   # single model
+#   bash scripts/download_models.sh gemma4-e2b gemma4-e4b  # multiple
+#
+# Requires: huggingface-cli (pip install huggingface_hub)
+# Models land in: ./models/
+#
+# Available keys: smollm3 | gemma4-e2b | gemma4-e4b | qwen3-4b | qwen35-9b | all
+
+set -euo pipefail
+
+MODELS_DIR="$(cd "$(dirname "$0")/.." && pwd)/models"
+mkdir -p "$MODELS_DIR"
+
+GREEN='\033[0;32m'; YELLOW='\033[1;33m'; RED='\033[0;31m'; NC='\033[0m'
+
+check_hf_cli() {
+    if ! command -v huggingface-cli &>/dev/null; then
+        echo -e "${RED}Error: huggingface-cli not found.${NC}"
+        echo "Install with: pip install huggingface_hub"
+        exit 1
+    fi
+}
+
+download() {
+    local key="$1"
+    local repo="$2"
+    local filename="$3"
+    local size_hint="$4"
+
+    local dest="$MODELS_DIR/$filename"
+    if [[ -f "$dest" ]]; then
+        echo -e "${YELLOW}[$key]${NC} Already exists: $filename — skipping"
+        return
+    fi
+
+    echo -e "${GREEN}[$key]${NC} Downloading $filename (~$size_hint) from $repo ..."
+    huggingface-cli download "$repo" "$filename" --local-dir "$MODELS_DIR"
+    echo -e "${GREEN}[$key]${NC} Done: $MODELS_DIR/$filename"
+}
+
+download_smollm3() {
+    download "smollm3" \
+        "bartowski/HuggingFaceTB_SmolLM3-3B-GGUF" \
+        "HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf" \
+        "1.9 GB"
+}
+
+download_gemma4_e2b() {
+    download "gemma4-e2b" \
+        "bartowski/google_gemma-4-E2B-it-GGUF" \
+        "google_gemma-4-E2B-it-Q4_K_M.gguf" \
+        "2.9 GB"
+}
+
+download_gemma4_e4b() {
+    download "gemma4-e4b" \
+        "bartowski/google_gemma-4-E4B-it-GGUF" \
+        "google_gemma-4-E4B-it-Q4_K_M.gguf" \
+        "4.7 GB"
+}
+
+download_qwen3_4b() {
+    download "qwen3-4b" \
+        "bartowski/Qwen3-4B-GGUF" \
+        "Qwen3-4B-Q4_K_M.gguf" \
+        "2.4 GB"
+}
+
+download_qwen35_9b() {
+    download "qwen35-9b" \
+        "Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF" \
+        "Qwen3.5-9B.Q8_0.gguf" \
+        "8.9 GB"
+}
+
+main() {
+    check_hf_cli
+
+    local targets=("$@")
+    if [[ ${#targets[@]} -eq 0 || "${targets[0]}" == "all" ]]; then
+        targets=(smollm3 gemma4-e2b gemma4-e4b qwen3-4b qwen35-9b)
+    fi
+
+    for target in "${targets[@]}"; do
+        case "$target" in
+            smollm3)     download_smollm3 ;;
+            gemma4-e2b)  download_gemma4_e2b ;;
+            gemma4-e4b)  download_gemma4_e4b ;;
+            qwen3-4b)    download_qwen3_4b ;;
+            qwen35-9b)   download_qwen35_9b ;;
+            all)
+                download_smollm3
+                download_gemma4_e2b
+                download_gemma4_e4b
+                download_qwen3_4b
+                download_qwen35_9b
+                ;;
+            *)
+                echo -e "${RED}Unknown model: $target${NC}"
+                echo "Valid keys: smollm3 | gemma4-e2b | gemma4-e4b | qwen3-4b | qwen35-9b | all"
+                exit 1
+                ;;
+        esac
+    done
+
+    echo ""
+    echo "Models directory:"
+    ls -lh "$MODELS_DIR"/*.gguf 2>/dev/null || echo "(no .gguf files found)"
+}
+
+main "$@"
--- a/scripts/kv_quant_test.sh
+++ b/scripts/kv_quant_test.sh
@@ -0,0 +1,246 @@
+#!/bin/bash
+# KV cache quantization test using llama-perplexity.
+# Image: local/llama-cpp-turboquant:full-cuda-sm75-mmq (FORCE_MMQ, turbo2/3/4 support)
+#
+# Tests KV types: f16 (baseline) + q8_0/q4_0/turbo2 for Q4_K_M models
+#                 f16 (baseline) + turbo2/3/4 for Qwen3.5-9B Q8_0
+# Quality gate: PPL delta vs f16 < 0.5 (lossless for practical use)
+#
+# Usage: bash /scripts/kv_quant_test.sh [MODEL_KEY]
+#   MODEL_KEY: smollm3 | e2b | e4b | q3 | qwen35q | all (default: all)
+
+set -uo pipefail
+
+TARGET="${1:-all}"
+
+M_SMOL="/models/HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf"
+M_E2B="/models/google_gemma-4-E2B-it-Q4_K_M.gguf"
+M_E4B="/models/google_gemma-4-E4B-it-Q4_K_M.gguf"
+M_Q3="/models/Qwen3-4B-Q4_K_M.gguf"
+M_Q35="/models/Qwen3.5-9B.Q8_0.gguf"
+
+declare -A NGL=([smollm3]=99 [e2b]=99 [e4b]=42 [q3]=99 [qwen35q]=11)
+declare -A BASE_CTX=([smollm3]=18432 [e2b]=32768 [e4b]=20480 [q3]=8192 [qwen35q]=8192)
+declare -A PPL_TIMEOUT=([smollm3]=300 [e2b]=300 [e4b]=300 [q3]=300 [qwen35q]=600)
+
+# Per-model KV types to test (f16 is always the baseline)
+# Standard Q4_K_M models: q8_0/q4_0 + turbo2 (all supported by TurboQuant image)
+# Qwen3.5-9B: designed for turbo KV — test turbo2/3/4 only (q4_0 would also work but less relevant)
+declare -A MODEL_KV_TYPES=(
+    [smollm3]="q8_0 q4_0 turbo2"
+    [e2b]="q8_0 q4_0 turbo2"
+    [e4b]="q8_0 q4_0 turbo2"
+    [q3]="q8_0 q4_0 turbo2"
+    [qwen35q]="turbo2 turbo3 turbo4"
+)
+
+# ctx candidates per model
+SMOL_CTXS=(8192 12288 16384 18432 20480 24576 32768 40960 49152)
+E2B_CTXS=(8192 16384 24576 32768 40960 49152 65536)
+E4B_CTXS=(8192 12288 16384 20480 24576 32768 40960)
+Q3_CTXS=(4096 6144 8192 10240 12288 16384 24576 32768)
+Q35_CTXS=(4096 8192 16384 24576 32768 40960 49152)
+declare -A CTX_CANDIDATES=(
+    [smollm3]="SMOL_CTXS" [e2b]="E2B_CTXS" [e4b]="E4B_CTXS"
+    [q3]="Q3_CTXS" [qwen35q]="Q35_CTXS")
+
+GREEN='\033[0;32m'; RED='\033[0;31m'; YELLOW='\033[1;33m'; BLUE='\033[0;34m'; NC='\033[0m'
+HR="======================================================================"
+
+# Synthetic PPL file — 4000 lines, deterministic, no network needed
+PPL_FILE="/tmp/kv_ppl_input.txt"
+ensure_ppl_file() {
+    [[ -f "$PPL_FILE" ]] && return
+    python3 - << 'PY'
+import random, sys
+random.seed(42)
+sentences = [
+    "The transformer architecture uses self-attention mechanisms to process sequences.",
+    "Large language models require significant computational resources for training.",
+    "Quantization reduces memory usage by storing weights in lower precision formats.",
+    "Flash attention enables memory-efficient computation for long context windows.",
+    "The key-value cache stores intermediate attention states during generation.",
+    "Context length determines how many tokens the model can attend to simultaneously.",
+    "Perplexity measures how well a probability model predicts a sample of text.",
+    "Lower perplexity values indicate better language modeling performance overall.",
+    "GPU memory bandwidth is the primary bottleneck for autoregressive token generation.",
+    "Grouped query attention reduces KV cache size by sharing keys across head groups.",
+    "Rotary position embeddings encode relative position information in attention queries.",
+    "Mixture of experts models route tokens through specialized feed-forward networks.",
+    "Continuous batching allows servers to process multiple requests simultaneously.",
+    "KV cache quantization trades a small quality loss for significantly larger contexts.",
+]
+lines = [random.choice(sentences) for _ in range(4000)]
+print('\n'.join(lines), file=open('/tmp/kv_ppl_input.txt', 'w'))
+PY
+}
+
+# run_ppl MODEL NGL KV CTX TIMEOUT [EXTRA_ARGS...]
+# Echoes PPL value on stdout, returns 0 on success, 1 on OOM/crash.
+run_ppl() {
+    local model=$1 ngl=$2 kv=$3 ctx=$4 timeout_s=$5
+    shift 5
+    local extra_args=("$@")
+
+    local tmp_err; tmp_err=$(mktemp)
+    local ppl_out; ppl_out=$(mktemp)
+
+    timeout "$timeout_s" /app/llama-perplexity \
+        -m "$model" \
+        -ngl "$ngl" \
+        -fa on \
+        -c "$ctx" \
+        -ctk "$kv" -ctv "$kv" \
+        -f "$PPL_FILE" \
+        --chunks 1 \
+        "${extra_args[@]}" \
+        > "$ppl_out" 2>"$tmp_err"
+    local ppl_rc=$?
+
+    local err; err=$(cat "$tmp_err"); rm -f "$tmp_err"
+
+    if [[ "$ppl_rc" != "0" ]] || \
+       grep -qi "out of memory\|failed to allocate\|cudaMalloc failed\|CUDA_ERROR_OUT_OF_MEMORY\|ggml_cuda_malloc\|cannot allocate memory" <<< "$err"; then
+        rm -f "$ppl_out"
+        return 1
+    fi
+
+    local ppl_val
+    ppl_val=$(grep -oP '\[\d+\]\K[0-9.]+' "$ppl_out" | tail -1)
+    rm -f "$ppl_out"
+    [[ -z "$ppl_val" ]] && return 1
+    echo "$ppl_val"
+}
+
+# ---------------------------------------------------------------------------
+ensure_ppl_file
+
+echo "$HR"
+echo "KV CACHE QUANT TEST (llama-perplexity) — TurboQuant image (FORCE_MMQ SM75)"
+echo "$(date)"
+echo "GPU: $(nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null)"
+echo "$HR"
+echo "Standard models: f16 baseline | q8_0 | q4_0 | turbo2 (2-bit, 8x smaller KV vs f16)"
+echo "Qwen3.5-9B:      f16 baseline | turbo2 | turbo3 | turbo4 (TurboQuant KV types)"
+echo "Quality gate: PPL delta vs f16 < 0.5"
+echo ""
+
+declare -a SUMMARY=()
+
+for entry in \
+    "smollm3:SmolLM3-3B:$M_SMOL" \
+    "e2b:Gemma4-E2B:$M_E2B" \
+    "e4b:Gemma4-E4B:$M_E4B" \
+    "q3:Qwen3-4B:$M_Q3" \
+    "qwen35q:Qwen3.5-9B:$M_Q35"
+do
+    IFS=':' read -r key lbl model <<< "$entry"
+    [[ "$TARGET" != "all" && "$TARGET" != "$key" ]] && continue
+
+    eval "ctxs=(\"\${${CTX_CANDIDATES[$key]}[@]}\")"
+    ngl="${NGL[$key]}"
+    timeout_s="${PPL_TIMEOUT[$key]}"
+    IFS=' ' read -ra kv_types <<< "${MODEL_KV_TYPES[$key]}"
+
+    # Extra args for qwen35-9b (flash attn already set; no mlock needed for PPL correctness)
+    extra_args=()
+
+    printf "${BLUE}=== %s (base ctx=%s, ngl=%s) ===${NC}\n" \
+        "$lbl" "${BASE_CTX[$key]}" "$ngl"
+
+    # Dynamic header based on KV types for this model
+    printf "  %-10s  %-18s" "ctx" "f16 (PPL)"
+    for kv in "${kv_types[@]}"; do
+        printf "  %-20s" "$kv (PPL/delta)"
+    done
+    printf "\n"
+    printf "  %-10s  %-18s" "---" "---------"
+    for kv in "${kv_types[@]}"; do
+        printf "  %-20s" "--------------------"
+    done
+    printf "\n"
+
+    declare -A best_ctx_per_kv=([f16]="${BASE_CTX[$key]}")
+    for kv in "${kv_types[@]}"; do best_ctx_per_kv[$kv]="${BASE_CTX[$key]}"; done
+    declare -A oom_kv=([f16]=0)
+    for kv in "${kv_types[@]}"; do oom_kv[$kv]=0; done
+    declare -A ppl_f16_at_ctx=()
+
+    for ctx in "${ctxs[@]}"; do
+        printf "  %-10s" "$ctx"
+
+        # f16 baseline
+        f16_ppl=""
+        if [[ "${oom_kv[f16]}" == "1" ]]; then
+            printf "  ${RED}%-18s${NC}" "OOM"
+        else
+            f16_ppl=$(run_ppl "$model" "$ngl" "f16" "$ctx" "$timeout_s" "${extra_args[@]}")
+            if [[ $? -ne 0 ]]; then
+                printf "  ${RED}%-18s${NC}" "OOM"
+                oom_kv[f16]=1
+            else
+                printf "  ${GREEN}%-18s${NC}" "$f16_ppl"
+                best_ctx_per_kv[f16]=$ctx
+                ppl_f16_at_ctx[$ctx]=$f16_ppl
+            fi
+        fi
+
+        # KV type columns
+        for kv in "${kv_types[@]}"; do
+            if [[ "${oom_kv[$kv]}" == "1" ]]; then
+                printf "  ${RED}%-20s${NC}" "OOM"
+                continue
+            fi
+            ppl=$(run_ppl "$model" "$ngl" "$kv" "$ctx" "$timeout_s" "${extra_args[@]}")
+            if [[ $? -ne 0 ]]; then
+                printf "  ${RED}%-20s${NC}" "OOM"
+                oom_kv[$kv]=1
+                continue
+            fi
+            best_ctx_per_kv[$kv]=$ctx
+
+            if [[ -n "$f16_ppl" ]]; then
+                delta=$(python3 -c "print(f'{float(\"$ppl\")-float(\"$f16_ppl\"):+.2f}')" 2>/dev/null || echo "?")
+                ok=$(python3 -c "exit(0 if abs(float('$ppl')-float('$f16_ppl'))<0.5 else 1)" 2>/dev/null && echo ok || echo bad)
+                if [[ "$ok" == "ok" ]]; then
+                    printf "  ${GREEN}%-20s${NC}" "${ppl}(${delta})"
+                else
+                    printf "  ${YELLOW}%-20s${NC}" "${ppl}(${delta})"
+                fi
+            else
+                printf "  ${GREEN}%-20s${NC}" "$ppl"
+            fi
+        done
+        echo ""
+    done
+
+    echo ""
+
+    # Best recommendation: highest ctx where all non-f16 types passed quality gate
+    overall_best_ctx="${BASE_CTX[$key]}"
+    overall_best_kv="f16"
+    for kv in "${kv_types[@]}"; do
+        bctx="${best_ctx_per_kv[$kv]}"
+        SUMMARY+=("$lbl|$kv|$bctx")
+        if [[ "$bctx" -gt "$overall_best_ctx" ]]; then
+            overall_best_ctx=$bctx; overall_best_kv=$kv
+        fi
+    done
+    SUMMARY+=("$lbl|f16|${best_ctx_per_kv[f16]}")
+    printf "  ${GREEN}Best: %s → max ctx %s${NC}\n\n" "$overall_best_kv" "$overall_best_ctx"
+
+    unset best_ctx_per_kv oom_kv ppl_f16_at_ctx
+done
+
+echo "$HR"
+echo "SUMMARY"
+echo "$HR"
+printf "%-16s  %-8s  %s\n" "Model" "KV" "Max Ctx (no OOM + PPL delta<0.5)"
+printf "%-16s  %-8s  %s\n" "-----" "--" "---------------------------------"
+for row in "${SUMMARY[@]}"; do
+    IFS='|' read -r lbl kv ctx <<< "$row"
+    printf "${GREEN}%-16s  %-8s  %s${NC}\n" "$lbl" "$kv" "$ctx"
+done
+echo "$HR"
+echo "Reminder: update envs/.env.<model>: CACHE_TYPE_K/V=<best_kv>  CTX_SIZE=<max_ctx>"
+echo "$HR"
--- a/scripts/quality_test.sh
+++ b/scripts/quality_test.sh
@@ -0,0 +1,215 @@
+#!/bin/bash
+# Quality tests for all 4 models — runs inside full-cuda container.
+# Tests: coding tasks + needle-in-haystack at 1K/8K ctx.
+#
+# Inference parameters sourced from official HF model cards:
+#   SmolLM3:  /no_think in SYSTEM prompt (-sys); temp=0.6 top_p=0.95
+#   Qwen3:    /no_think in SYSTEM prompt (-sys); temp=0.7 top_p=0.8 top_k=20
+#             DO NOT use greedy (temp=0) — causes endless repetition per Qwen3 docs
+#   Gemma4:   No thinking mode; temp=0.7 top_p=0.95
+
+set -uo pipefail
+
+M_SMOL="/models/HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf"
+M_E2B="/models/google_gemma-4-E2B-it-Q4_K_M.gguf"
+M_E4B="/models/google_gemma-4-E4B-it-Q4_K_M.gguf"
+M_Q3="/models/Qwen3-4B-Q4_K_M.gguf"
+
+declare -A NGL=([smollm3]=99 [e2b]=99 [e4b]=42 [q3]=99)
+declare -A MAX_CTX=([smollm3]=24576 [e2b]=32768 [e4b]=24576 [q3]=8192)
+
+# Per-model sampling params (HF model card sources)
+declare -A TEMP=([smollm3]="0.6"  [e2b]="0.7"  [e4b]="0.7"  [q3]="0.7")
+declare -A TOPP=([smollm3]="0.95" [e2b]="0.95" [e4b]="0.95" [q3]="0.8")
+declare -A TOPK=([smollm3]="0"    [e2b]="0"    [e4b]="0"    [q3]="20")
+# /no_think in system prompt disables thinking for SmolLM3 and Qwen3
+declare -A SYSP=([smollm3]="/no_think" [e2b]="" [e4b]="" [q3]="/no_think")
+
+PASS=0; FAIL=0; TOTAL=0
+
+GREEN='\033[0;32m'; RED='\033[0;31m'; YELLOW='\033[1;33m'; NC='\033[0m'
+
+# sed script to strip llama-cli interactive UI banner from stdout.
+# ▄ (U+2584) and █ (U+2588) appear in the llama.cpp ASCII logo, sometimes
+# with leading spaces — match anywhere on the line to be safe.
+STRIP_BANNER='/^$/d
+/^Loading model/d
+/^[[:space:]]*$/d
+/[▄█]/d
+/^build /d
+/^model /d
+/^modalities/d
+/^available commands/d
+/^  \//d
+/^\[ Prompt:/d
+/^\[  Prompt:/d
+/^Exiting/d
+/^> /d
+'
+
+check() {
+    local lbl="$1" out="$2"
+    shift 2
+    local patterns=("$@")
+    local ok=1
+    for pat in "${patterns[@]}"; do
+        printf '%s\n' "$out" | grep -qiE "$pat" || { ok=0; break; }
+    done
+    TOTAL=$((TOTAL+1))
+    if [ "$ok" = "1" ]; then
+        PASS=$((PASS+1)); printf "  ${GREEN}PASS${NC} %s\n" "$lbl"
+    else
+        FAIL=$((FAIL+1)); printf "  ${RED}FAIL${NC} %s\n" "$lbl"
+        printf '%s\n' "$out" | grep -v '^$' | tail -3 | sed 's/^/       | /'
+    fi
+}
+
+# Strip thinking blocks from output.
+# Gemma4 uses [Start thinking]...[End thinking].
+# Qwen3/SmolLM3 use <think>...</think>.
+# Match to end-of-string as fallback for truncated/incomplete blocks.
+strip_think() {
+    python3 -c "
+import sys, re
+t = sys.stdin.read()
+# Only strip COMPLETE blocks. If thinking hit token limit, leave as-is so
+# check patterns can still match reasoning content inside the block.
+t = re.sub(r'\[Start thinking\].*?\[End thinking\]', '', t, flags=re.DOTALL)
+t = re.sub(r'<think>.*?</think>', '', t, flags=re.DOTALL)
+print(t.strip())
+" 2>/dev/null || cat
+}
+
+# run KEY MODEL PROMPT MAX_TOKENS [SYS_OVERRIDE]
+# SYS_OVERRIDE defaults to SYSP[$key] if omitted.
+# Pass "" explicitly to disable system prompt (thinking ON for Qwen3/SmolLM3).
+# Thinking params: SmolLM3/Qwen3 thinking=temp0.6/top_p0.95, nothink=model defaults.
+run() {
+    local key=$1 model=$2 prompt=$3 max_tok=$4
+    local ngl="${NGL[$key]}"
+    # 5th arg overrides sys; if not provided, use SYSP[$key]
+    local use_sys
+    if [ "${5+x}" = "x" ]; then use_sys="$5"; else use_sys="${SYSP[$key]}"; fi
+    # choose sampling params: thinking mode uses 0.6/0.95, non-think uses model defaults
+    local temp topp topk
+    if [ -z "$use_sys" ] && [[ "$key" == "smollm3" || "$key" == "q3" ]]; then
+        temp="0.6"; topp="0.95"; topk="${TOPK[$key]}"
+    else
+        temp="${TEMP[$key]}"; topp="${TOPP[$key]}"; topk="${TOPK[$key]}"
+    fi
+    local sys_arg=()
+    [ -n "$use_sys" ] && sys_arg=(-sys "$use_sys")
+    local topk_arg=()
+    [ "$topk" != "0" ] && topk_arg=(--top-k "$topk")
+    timeout 300 /app/llama-cli -m "$model" -ngl "$ngl" \
+        -n "$max_tok" --temp "$temp" --top-p "$topp" "${topk_arg[@]}" \
+        --repeat-penalty 1.1 -fa on --mmap --single-turn \
+        "${sys_arg[@]}" -p "$prompt" 2>/dev/null \
+    | sed "$STRIP_BANNER" \
+    | strip_think
+}
+
+# needle_test KEY MODEL NEEDLE CTX
+# Generates ~CTX tokens of filler, plants needle in middle, asks to recall it.
+needle_test() {
+    local key=$1 model=$2 needle=$3 ctx=$4
+    local ngl="${NGL[$key]}"
+    local temp="${TEMP[$key]}" topp="${TOPP[$key]}" sys="${SYSP[$key]}"
+    local sys_arg=()
+    [ -n "$sys" ] && sys_arg=(-sys "$sys")
+
+    # filler: ctx/2 tokens each side, 1 token ~4 chars
+    local half_chars=$(( ctx * 2 ))
+    local reps=$(( half_chars / 45 + 2 ))
+    local filler
+    filler=$(python3 -c "print('The quick brown fox jumps over the lazy dog. ' * $reps)" 2>/dev/null \
+        | head -c "$half_chars")
+
+    local prompt
+    printf -v prompt \
+        '%s\nSECRET_VALUE=%s\n%s\nWhat is SECRET_VALUE? Reply with only the value, nothing else.' \
+        "$filler" "$needle" "$filler"
+
+    local ctx_size=$(( ctx + 512 ))
+    local out
+    out=$(timeout 180 /app/llama-cli -m "$model" -ngl "$ngl" \
+        -n 512 --temp "$temp" --top-p "$topp" \
+        -fa on --mmap --single-turn \
+        -c "$ctx_size" "${sys_arg[@]}" -p "$prompt" 2>/dev/null \
+    | sed "$STRIP_BANNER" \
+    | strip_think)
+
+    # join lines before grep in case model breaks needle across newlines
+    local flat
+    flat=$(printf '%s' "$out" | tr '\n' ' ')
+    if printf '%s' "$flat" | grep -qF "$needle"; then
+        echo "FOUND"
+    else
+        local snip
+        snip=$(printf '%s' "$flat" | cut -c1-80)
+        echo "MISSED (${snip:-<empty>})"
+    fi
+}
+
+HR="======================================================================"
+echo "$HR"
+echo "QUALITY TESTS — ALL MODELS — $(date)"
+echo "GPU: $(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null)"
+echo "$HR"
+printf "Temps: SmolLM3=0.6/0.95 | Qwen3=0.7/0.8/k20 | Gemma4=0.7/0.95\n"
+printf "/no_think via -sys for needle tests | thinking ON for coding bug test\n\n"
+
+CODING_FIZZBUZZ='Write ONLY the Python function fizzbuzz(n). It returns a list where multiples of 3 are "Fizz", multiples of 5 are "Buzz", multiples of both are "FizzBuzz", others are the number as string. Output code only, no prose.'
+
+# hi is correctly len(arr)-1 to have ONE unambiguous bug: lo=mid (infinite loop)
+CODING_BUG='Find the bug in this Python function and explain it in one sentence:
+def binary_search(arr, target):
+    lo, hi = 0, len(arr) - 1
+    while lo < hi:
+        mid = (lo + hi) // 2
+        if arr[mid] == target:
+            return mid
+        elif arr[mid] < target:
+            lo = mid
+        else:
+            hi = mid
+    return -1'
+
+for entry in "smollm3:SmolLM3-3B:$M_SMOL" "e2b:Gemma4-E2B:$M_E2B" "e4b:Gemma4-E4B:$M_E4B" "q3:Qwen3-4B:$M_Q3"; do
+    IFS=':' read -r key lbl model <<< "$entry"
+    echo "=== $lbl ==="
+
+    # Coding test 1: FizzBuzz — expect def + Fizz + Buzz
+    out=$(run "$key" "$model" "$CODING_FIZZBUZZ" 512)
+    check "FizzBuzz: def + Fizz + Buzz in output" "$out" \
+        "def " "Fizz" "Buzz"
+
+    # Coding test 2: Bug — thinking ON for all models (more reliable reasoning).
+    # Pass "" to disable /no_think override. Gemma4 already thinks by default.
+    out=$(run "$key" "$model" "$CODING_BUG" 3000 "")
+    check "Bug: identify lo=mid / infinite loop" "$out" \
+        "lo.*=.*mid.*\+.*1|lo\+1|infinite loop|never.*advance|never.*progress|stuck|lo should be|lo\b.*never.*incr"
+
+    # Needle-in-haystack
+    NEEDLE="QX7-ALPHA-9"
+    # strict < so we skip when ctx == max_ctx (prompt fills entire context, no room for output)
+    for ctx in 1024 8192; do
+        if [ "$ctx" -lt "${MAX_CTX[$key]}" ]; then
+            result=$(needle_test "$key" "$model" "$NEEDLE" "$ctx")
+            TOTAL=$((TOTAL+1))
+            if [[ "$result" == FOUND ]]; then
+                PASS=$((PASS+1)); printf "  ${GREEN}PASS${NC} Needle @ %s tok: %s\n" "$ctx" "$result"
+            else
+                FAIL=$((FAIL+1)); printf "  ${RED}FAIL${NC} Needle @ %s tok: %s\n" "$ctx" "$result"
+            fi
+        else
+            printf "  ${YELLOW}SKIP${NC} Needle @ %s tok (exceeds model max %s)\n" "$ctx" "${MAX_CTX[$key]}"
+        fi
+    done
+
+    echo ""
+done
+
+echo "$HR"
+printf "RESULTS: ${GREEN}%s PASSED${NC} / ${RED}%s FAILED${NC} / %s TOTAL\n" "$PASS" "$FAIL" "$TOTAL"
+echo "$HR"