Initial commit: tuned multi-model llama.cpp stack

- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B
- TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs
- Bigctx profiles (-nkvo KV in RAM): 2-16x context gain
- turbo2 KV: 2x smaller, benchmarked against PPL quality gate
- Per-model env files with justified parameters
- kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts
- docs/FINDINGS.md: surprises, pitfalls, recommendations
- docs/ARCHITECTURE.md: compose + test script design
This commit is contained in:
2026-05-06 15:56:40 +02:00
commit 4ad296608b
22 changed files with 2530 additions and 0 deletions

13
.env Normal file
View File

@@ -0,0 +1,13 @@
# =============================================================================
# llama.cpp project root .env
#
# Per-model parameters have moved to envs/.env.<model>:
# envs/.env.qwen35-9b — Qwen3.5-9B Q8_0 TurboQuant ~4.4 t/s
# envs/.env.gemma4-e2b — Gemma 4 E2B Q4_K_M ~65 t/s
# envs/.env.gemma4-e4b — Gemma 4 E4B Q4_K_M (split) ~30 t/s
# envs/.env.smollm3-3b — SmolLM3 3B Q4_K_M ~90 t/s
# envs/.env.qwen3-4b — Qwen3 4B Q4_K_M ~75 t/s
#
# This file is loaded by Docker Compose for project-level interpolation.
# Add project-wide overrides here (e.g. COMPOSE_PROJECT_NAME).
# =============================================================================

34
.gitignore vendored Normal file
View File

@@ -0,0 +1,34 @@
# Model files — large binaries, download with scripts/download_models.sh
models/*.gguf
models/*.bin
models/*.safetensors
# Benchmark output logs, CSVs, and generated env snapshots — generated, not source
benchmark-results/*.log
benchmark-results/*.csv
benchmark-results/*.txt
benchmark-results/*.env
# Keep the .gitkeep placeholder
!benchmark-results/.gitkeep
# Docker build cache artifacts
.docker/
# Python cache
__pycache__/
*.pyc
*.pyo
.venv/
# Editor / OS artifacts
.DS_Store
Thumbs.db
*.swp
*.swo
*~
.idea/
.vscode/
# Local overrides (never commit secrets or machine-specific tweaks)
.env.local
envs/.env.*.local

174
README.md Normal file
View File

@@ -0,0 +1,174 @@
# llama-cpp-docker
Production-ready llama.cpp server stack for a single consumer GPU (GTX 1650 Ti / SM75 Turing).
Fully benchmarked and tuned: every parameter justified by measurement, not guesswork.
---
## What this is
A Docker Compose setup that runs multiple LLMs via [llama.cpp](https://github.com/ggerganov/llama.cpp), with:
- **Per-model env files** — all parameters (ctx, KV type, ngl, threads) tuned per model on this hardware
- **TurboQuant image** — custom build with `FORCE_MMQ` (+611% free speed on Turing GPUs) and `turbo2/3/4` KV quantization
- **Bigctx profiles** — `-nkvo` (KV in RAM) variants that multiply usable context by 216× at modest speed cost
- **Benchmark scripts** — reproducible PPL quality tests and PCIe/RAM bandwidth-modeled context sizing
- **Open WebUI** — optional web UI, profile-composable with any model
> **Hardware target**: GTX 1650 Ti (SM75, 3717 MiB VRAM), i7-10750H, 15 GiB DDR4-2933.
> Parameters will work on any similar Turing GPU. See [docs/FINDINGS.md](docs/FINDINGS.md) before porting to other architectures.
---
## Quick start
### 1. Build the TurboQuant image (once, ~20 min)
```bash
docker compose --profile qwen35-9b build llama-qwen35-9b
```
This builds both `server-cuda-sm75-mmq` and `full-cuda-sm75-mmq` tags used by all services.
### 2. Download models
```bash
bash scripts/download_models.sh
```
Downloads all five models to `./models/`. Requires `huggingface-cli` (`pip install huggingface_hub`).
To download individual models:
```bash
bash scripts/download_models.sh smollm3
bash scripts/download_models.sh qwen35-9b
# options: smollm3 | gemma4-e2b | gemma4-e4b | qwen3-4b | qwen35-9b | all
```
### 3. Start a model
```bash
# Start SmolLM3 (fastest, 53 t/s, 65K context in bigctx mode)
docker compose --profile smollm3-3b up -d
# Start Gemma4-E2B (multimodal, 62 t/s, up to 393K context)
docker compose --profile gemma4-e2b up -d
# Add Open WebUI to any running model
docker compose --profile gemma4-e2b --profile webui up -d
```
API is available at **http://localhost:8080** (OpenAI-compatible).
WebUI at **http://localhost:3000**.
---
## Models
| Profile | Model | Size | t/s | CTX | Highlights |
|---|---|---|---|---|---|
| `qwen35-9b` | Qwen3.5-9B Q8_0 | 8.9 GB | ~4.4 | 32K | Reasoning distill, hybrid linear-attn |
| `gemma4-e2b` | Gemma4-E2B Q4_K_M | 2.9 GB | ~62 | 24K | Multimodal (image/audio/video), MQA |
| `gemma4-e4b` | Gemma4-E4B Q4_K_M | 4.7 GB | ~30 | 24K | Multimodal, larger, CPU-split |
| `smollm3-3b` | SmolLM3-3B Q4_K_M | 1.9 GB | ~53 | 24K | Thinking mode, tool calling, Apache 2.0 |
| `qwen3-4b` | Qwen3-4B Q4_K_M | 2.4 GB | ~39 | 16K | Thinking mode, 119 languages, best ecosystem |
### Big context profiles (KV in RAM via `-nkvo`)
Use when you need more context than the pure-GPU profiles offer. Speed drops as context fills (PCIe bandwidth bottleneck).
| Profile | Model | KV type | CTX | ~t/s@50% fill | RAM KV usage |
|---|---|---|---|---|---|
| `smollm3-3b-bigctx` | SmolLM3-3B | turbo2 | 65536 | 15.2 | 714 MiB |
| `gemma4-e2b-bigctx` | Gemma4-E2B | q4_0 | 393216 | 17.0 | 651 MiB |
| `gemma4-e4b-bigctx` | Gemma4-E4B | turbo2 | 163840 | 17.8 | 346 MiB |
| `qwen3-4b-bigctx` | Qwen3-4B | q4_0 | 24576 | 11.2 | ~972 MiB |
```bash
docker compose --profile gemma4-e2b-bigctx up -d
```
---
## Running benchmarks
One-shot — results written to `benchmark-results/`:
```bash
# Standard llama-bench sweep
docker compose --profile bench-smollm3-3b run --rm bench-smollm3-3b
# KV quantization quality test (all models)
docker compose --profile bench-qwen35-9b run --rm -T \
--entrypoint="bash /scripts/kv_quant_test.sh all" bench-qwen35-9b
# Context size test with bandwidth model (all models)
docker compose --profile bench-qwen35-9b run --rm -T \
--entrypoint="bash /scripts/cpu_ctx_test.sh all" bench-qwen35-9b
# Ad-hoc llama-bench
docker compose --profile bench-smollm3-3b run --rm --entrypoint="" bench-smollm3-3b \
bash -c '/app/llama-bench -m /models/$MODEL_FILE -ngl 99 -o csv 2>/dev/null'
```
---
## Project structure
```
compose.yaml — All services, profiles, YAML anchors
envs/
.env.<model> — Pure-GPU tuned params per model
.env.<model>-bigctx — -nkvo KV-in-RAM params
scripts/
download_models.sh — huggingface-cli download helper
benchmark.sh — Default bench entrypoint (llama-bench sweep)
kv_quant_test.sh — PPL quality test: f16 vs q8_0/q4_0/turbo2 per model/ctx
cpu_ctx_test.sh — -nkvo alloc check + PCIe/RAM BW model → max viable ctx
quality_test.sh — Early generation quality test (superseded by kv_quant_test.sh)
docs/
FINDINGS.md — What we learned, surprises, and what to watch out for
ARCHITECTURE.md — Compose and test script architecture in detail
models/ — GGUF model files (gitignored, downloaded separately)
benchmark-results/ — Test output logs and CSVs (gitignored)
```
---
## Key findings
> Full details in [docs/FINDINGS.md](docs/FINDINGS.md).
**FORCE_MMQ gives free +611% on Turing GPUs.** GPUs without tensor cores (RTX 1650, 1660, 2060) are faster with the MMQ kernel than cuBLAS GEMM. The TurboQuant image compiles this in. Do not use this image on Ampere/Ada GPUs — it would hurt.
**turbo2 KV quantization breaks Qwen3-4B.** At ctx ≥ 8192, PPL degrades catastrophically (1.79 → 4.2 → 15 → 438). Never use turbo2/3/4 for Qwen3-4B. Use q4_0.
**turbo2 is paradoxically larger than q4_0 for Gemma4-E2B.** MQA architecture produces tiny KV tensors; block-quantization padding overhead makes turbo2 actually larger. Use q4_0 for E2B bigctx.
**Gemma4's MQA architecture enables extreme context.** E2B has only 1.7 KB KV/token (vs SmolLM3's 19.8 KB). 393K context costs just 651 MiB RAM, and speed barely drops (62 → 17 t/s@50% fill).
**Qwen3.5-9B cannot use -nkvo.** At ngl=11, model weights + OS fill all 15 GiB RAM. No bigctx possible. Existing 32K config with turbo2 KV in VRAM is the ceiling.
**`llama-perplexity` is incompatible with Qwen3.5-9B.** Hybrid linear-attention architecture causes the PPL tool to fail. Not a real model limitation — the server works correctly.
---
## Requirements
- Docker + NVIDIA Container Toolkit
- NVIDIA GPU (SM75 for pre-built image; rebuild with different `CUDA_DOCKER_ARCH` for other architectures)
- `huggingface-cli` for model downloads: `pip install huggingface_hub`
- ~25 GB disk for all models (download selectively as needed)
---
## Tuning for different hardware
Edit `envs/.env.<model>` files. Key parameters:
- `N_GPU_LAYERS` — increase for more VRAM, decrease for CPU-split
- `CTX_SIZE` — reduce if OOM, increase if VRAM headroom
- `CACHE_TYPE_K/V``f16` > `q8_0` > `q4_0` > `turbo2` quality; reverse order for size
- `THREADS` — match physical core count (HT hurts for RAM-bound models)
See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for full parameter reference.

View File

290
compose.yaml Normal file
View File

@@ -0,0 +1,290 @@
# ==============================================================================
# llama.cpp multi-model server
# Hardware: GTX 1650 Ti (3717 MiB VRAM, CC 7.5) + i7-10750H 6c/12t
#
# MODEL PROFILES (mutually exclusive — GPU can only hold one at a time):
# qwen35-9b Qwen3.5-9B Q8_0 TurboQuant (turbo2 KV, FORCE_MMQ) ~4.4 t/s
# gemma4-e2b Gemma 4 E2B Official llama.cpp ~65 t/s
# gemma4-e4b Gemma 4 E4B Official llama.cpp (CPU split) ~30 t/s
# smollm3-3b SmolLM3 3B Official llama.cpp ~90 t/s
# qwen3-4b Qwen3 4B Official llama.cpp ~75 t/s
#
# BIGCTX PROFILES (-nkvo: KV in RAM, benchmarked v4 2026-05-06, TurboQuant FORCE_MMQ):
# smollm3-3b-bigctx SmolLM3 3B ctx=65536 turbo2 | ~53 t/s base | ~15 t/s@50% | +40960 vs GPU
# gemma4-e2b-bigctx Gemma 4 E2B ctx=393216 q4_0 | ~62 t/s base | ~17 t/s@50% | +368640 vs GPU (MQA!)
# gemma4-e4b-bigctx Gemma 4 E4B ctx=163840 turbo2 | ~30 t/s base | ~18 t/s@50% | +139264 vs GPU
# qwen3-4b-bigctx Qwen3 4B ctx=24576 q4_0 | ~39 t/s base | ~11 t/s@50% | +8192 vs GPU
#
# OPTIONAL ADD-ON (combine with any model profile):
# webui Open WebUI — auto-connects to whichever model is running
#
# BENCHMARK PROFILES (one-shot, run with: docker compose ... run --rm <service>):
# bench-qwen35-9b / bench-gemma4-e2b / bench-gemma4-e4b
# bench-smollm3-3b / bench-qwen3-4b
#
# EXAMPLES:
# docker compose --profile qwen35-9b up -d
# docker compose --profile gemma4-e2b --profile webui up -d
# docker compose --profile bench-smollm3-3b run --rm --entrypoint="" bench-smollm3-3b \
# bash -c '/app/llama-bench -m /models/$MODEL_FILE -ngl 99 -o csv 2>/dev/null'
#
# FIRST-TIME BUILD (qwen35-9b TurboQuant image, ~20 min):
# docker compose --profile qwen35-9b build llama-qwen35-9b
#
# Per-model params live in envs/.env.<model> — edit there to retune.
# All server services expose API on host port 8080 and Docker network as
# http://llama-current:8080 via the llama-net network alias.
# ==============================================================================
# ── Shared GPU passthrough ────────────────────────────────────────────────────
x-gpu: &gpu
runtime: nvidia
environment:
NVIDIA_VISIBLE_DEVICES: all
NVIDIA_DRIVER_CAPABILITIES: compute,utility
# ── Common healthcheck properties (start_period overridden per service) ───────
x-hc: &hc
test: ["CMD-SHELL", "curl -sf http://localhost:8080/health | grep -q '\"status\":\"ok\"'"]
interval: 20s
timeout: 10s
retries: 10
# ── Common server scaffold ────────────────────────────────────────────────────
# All model services merge this. Per-model differences go in envs/.env.<model>.
# $$VAR uses double-$ to escape compose interpolation — shell expands them at
# runtime from the env_file variables injected into the container.
x-server: &server
<<: *gpu
container_name: llama_server
volumes:
- ./models:/models:ro
ports:
- "8080:8080"
shm_size: 1g
ulimits:
memlock:
soft: -1
hard: -1
restart: unless-stopped
entrypoint: ["/bin/sh", "-c"]
command: |
exec /app/llama-server \
--model "/models/$$MODEL_FILE" \
--host 0.0.0.0 --port 8080 \
--n-gpu-layers $$N_GPU_LAYERS \
--ctx-size $$CTX_SIZE \
--threads $$THREADS --threads-batch $$THREADS_BATCH \
--batch-size $$BATCH_SIZE --ubatch-size $$UBATCH_SIZE \
--cache-type-k $$CACHE_TYPE_K --cache-type-v $$CACHE_TYPE_V \
--cont-batching --parallel $$PARALLEL \
$$EXTRA_ARGS \
--log-disable
networks:
llama-net:
aliases: [llama-current]
# ── Common benchmark scaffold ─────────────────────────────────────────────────
x-bench: &bench
<<: *gpu
container_name: llama_bench
volumes:
- ./models:/models:ro
- ./benchmark-results:/results
- ./scripts:/scripts:ro
shm_size: 1g
ulimits:
memlock:
soft: -1
hard: -1
entrypoint: ["/bin/bash", "/scripts/benchmark.sh"]
# ── Networks ──────────────────────────────────────────────────────────────────
networks:
llama-net:
driver: bridge
# ── Volumes ───────────────────────────────────────────────────────────────────
volumes:
open-webui-data:
# ==============================================================================
services:
# ── QWEN 3.5-9B Q8_0 — TurboQuant (turbo2 KV, FORCE_MMQ, SM75) ────────────
# Build image first: docker compose --profile qwen35-9b build llama-qwen35-9b
llama-qwen35-9b:
build:
context: https://github.com/TheTom/llama-cpp-turboquant.git#feature/turboquant-kv-cache
dockerfile: .devops/cuda.Dockerfile
target: server
args:
CUDA_DOCKER_ARCH: "75 -DGGML_CUDA_FORCE_MMQ=ON"
image: local/llama-cpp-turboquant:server-cuda-sm75-mmq
<<: *server
profiles: [qwen35-9b]
env_file: envs/.env.qwen35-9b
healthcheck:
<<: *hc
retries: 12
start_period: 180s # mlock pins 8.86 GB into RAM — needs time
# ── GEMMA 4 E2B — 2.3B effective (5.1B total/PLE), 128K ctx, audio+video ───
# Download: see envs/.env.gemma4-e2b for huggingface-cli command
llama-gemma4-e2b:
image: local/llama-cpp-turboquant:server-cuda-sm75-mmq
<<: *server
profiles: [gemma4-e2b]
env_file: envs/.env.gemma4-e2b
healthcheck:
<<: *hc
start_period: 60s
# ── GEMMA 4 E4B — 4.5B effective (8B total/PLE), 128K ctx, CPU-split ────────
# Fits ~28/42 layers on GPU; remaining layers run on CPU RAM
llama-gemma4-e4b:
image: local/llama-cpp-turboquant:server-cuda-sm75-mmq
<<: *server
profiles: [gemma4-e4b]
env_file: envs/.env.gemma4-e4b
healthcheck:
<<: *hc
start_period: 60s
# ── SMOLLM3 3B — thinking mode, tool calling, 64K ctx, Apache 2.0 ──────────
llama-smollm3-3b:
image: local/llama-cpp-turboquant:server-cuda-sm75-mmq
<<: *server
profiles: [smollm3-3b]
env_file: envs/.env.smollm3-3b
healthcheck:
<<: *hc
start_period: 60s
# ── QWEN3 4B — thinking mode, 128K ctx, best ecosystem ────────────────────
llama-qwen3-4b:
image: local/llama-cpp-turboquant:server-cuda-sm75-mmq
<<: *server
profiles: [qwen3-4b]
env_file: envs/.env.qwen3-4b
healthcheck:
<<: *hc
start_period: 60s
# ── BIGCTX VARIANTS (-nkvo: KV in RAM, benchmarked 2026-05-06) ────────────
# Use when you need more context than the pure-GPU profiles offer.
# KV cache lives in CPU RAM instead of VRAM → VRAM freed for larger ctx.
# Speed estimated via PCIe bandwidth model (8 GB/s). E2B/E4B use MQA — tiny KV, far less PCIe pressure.
llama-smollm3-3b-bigctx:
image: local/llama-cpp-turboquant:server-cuda-sm75-mmq
<<: *server
profiles: [smollm3-3b-bigctx]
env_file: envs/.env.smollm3-3b-bigctx
healthcheck:
<<: *hc
start_period: 60s
llama-gemma4-e2b-bigctx:
image: local/llama-cpp-turboquant:server-cuda-sm75-mmq
<<: *server
profiles: [gemma4-e2b-bigctx]
env_file: envs/.env.gemma4-e2b-bigctx
healthcheck:
<<: *hc
start_period: 60s
llama-gemma4-e4b-bigctx:
image: local/llama-cpp-turboquant:server-cuda-sm75-mmq
<<: *server
profiles: [gemma4-e4b-bigctx]
env_file: envs/.env.gemma4-e4b-bigctx
healthcheck:
<<: *hc
start_period: 60s
llama-qwen3-4b-bigctx:
image: local/llama-cpp-turboquant:server-cuda-sm75-mmq
<<: *server
profiles: [qwen3-4b-bigctx]
env_file: envs/.env.qwen3-4b-bigctx
healthcheck:
<<: *hc
start_period: 60s
# ── OPEN WEBUI ─────────────────────────────────────────────────────────────
# Separate profile — add to any running model:
# docker compose --profile <model> --profile webui up -d
# Connects to whichever model is running via the llama-current DNS alias.
# Open WebUI retries on startup so no depends_on needed.
openwebui:
image: ghcr.io/open-webui/open-webui:main
container_name: open_webui
profiles: [webui]
environment:
- OPENAI_API_BASE_URL=http://llama-current:8080/v1
- OPENAI_API_KEY=sk-no-key-needed
- WEBUI_AUTH=false
ports:
- "3000:8080"
networks:
- llama-net
volumes:
- open-webui-data:/app/backend/data
restart: unless-stopped
# ── BENCHMARKS ─────────────────────────────────────────────────────────────
# Run as one-shot: docker compose --profile bench-<model> run --rm bench-<model>
# Override entrypoint for ad-hoc: ... run --rm --entrypoint="" bench-<model> bash -c '...'
bench-qwen35-9b:
build:
context: https://github.com/TheTom/llama-cpp-turboquant.git#feature/turboquant-kv-cache
dockerfile: .devops/cuda.Dockerfile
target: full
args:
CUDA_DOCKER_ARCH: "75 -DGGML_CUDA_FORCE_MMQ=ON"
image: local/llama-cpp-turboquant:full-cuda-sm75-mmq
<<: *bench
profiles: [bench-qwen35-9b]
environment:
MODEL_FILE: Qwen3.5-9B.Q8_0.gguf
OUTPUT_DIR: /results
VARIANT: qwen35-9b-turboquant
PATH: /app:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
bench-gemma4-e2b:
image: local/llama-cpp-turboquant:full-cuda-sm75-mmq
<<: *bench
profiles: [bench-gemma4-e2b]
environment:
MODEL_FILE: google_gemma-4-E2B-it-Q4_K_M.gguf
OUTPUT_DIR: /results
VARIANT: gemma4-e2b
bench-gemma4-e4b:
image: local/llama-cpp-turboquant:full-cuda-sm75-mmq
<<: *bench
profiles: [bench-gemma4-e4b]
environment:
MODEL_FILE: google_gemma-4-E4B-it-Q4_K_M.gguf
OUTPUT_DIR: /results
VARIANT: gemma4-e4b
bench-smollm3-3b:
image: local/llama-cpp-turboquant:full-cuda-sm75-mmq
<<: *bench
profiles: [bench-smollm3-3b]
environment:
MODEL_FILE: HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf
OUTPUT_DIR: /results
VARIANT: smollm3-3b
bench-qwen3-4b:
image: local/llama-cpp-turboquant:full-cuda-sm75-mmq
<<: *bench
profiles: [bench-qwen3-4b]
environment:
MODEL_FILE: Qwen3-4B-Q4_K_M.gguf
OUTPUT_DIR: /results
VARIANT: qwen3-4b

210
docs/ARCHITECTURE.md Normal file
View File

@@ -0,0 +1,210 @@
# Architecture
Hardware: GTX 1650 Ti Mobile (SM75/Turing, 3717 MiB VRAM) + i7-10750H 6c/12t + 15 GiB DDR4-2933 RAM.
---
## Docker Compose Architecture
### Image Strategy
Two custom images built from the [TurboQuant fork](https://github.com/TheTom/llama-cpp-turboquant) of llama.cpp:
| Image | Target | Used by |
|---|---|---|
| `local/llama-cpp-turboquant:server-cuda-sm75-mmq` | `server` | All llama-server services |
| `local/llama-cpp-turboquant:full-cuda-sm75-mmq` | `full` | All bench/test services |
Both built with `CUDA_DOCKER_ARCH: "75 -DGGML_CUDA_FORCE_MMQ=ON"`:
- SM75 = Turing architecture codepath (no tensor cores)
- `FORCE_MMQ` = always use hand-written MMQ kernels instead of cuBLAS GEMM
- `full` target includes `llama-bench`, `llama-perplexity`, `llama-cli` alongside the server
Both images share the same custom entrypoint wrapper that enables the `turbo2/3/4` KV quantization types unavailable in upstream llama.cpp. **All `docker run` calls must use `--entrypoint=""` to bypass the wrapper.**
### Compose Structure
```
compose.yaml
├── x-gpu — NVIDIA runtime + capability passthrough (merged into all services)
├── x-hc — Common healthcheck (curl /health, start_period overridden per service)
├── x-server — Merged into all server services:
│ ├── volumes: ./models:/models:ro
│ ├── ports: 8080:8080
│ ├── network alias: llama-current (all servers share this alias)
│ ├── entrypoint: llama-server with $$VAR shell expansion from env_file
│ └── restart: unless-stopped
└── x-bench — Merged into all bench services:
├── volumes: ./models:/models:ro, ./benchmark-results:/results, ./scripts:/scripts:ro
└── entrypoint: /bin/bash /scripts/benchmark.sh (overrideable)
```
### Profile System
Docker Compose profiles allow mutually exclusive model selection. Only one model server should run at a time (single GPU).
```
docker compose --profile <PROFILE> up -d
```
**Server profiles** (bring up `llama-server` on port 8080):
| Profile | Model | Image | VRAM | Strategy |
|---|---|---|---|---|
| `qwen35-9b` | Qwen3.5-9B Q8_0 | TurboQuant (built) | 3.4 GB (11 layers) | RAM-bound; mlock pins weights |
| `gemma4-e2b` | Gemma4-E2B Q4_K_M | TurboQuant | ~3.4 GB | Full GPU, MQA |
| `gemma4-e4b` | Gemma4-E4B Q4_K_M | TurboQuant | ~3.5 GB | Full GPU (42 layers, CPU-split) |
| `smollm3-3b` | SmolLM3-3B Q4_K_M | TurboQuant | ~2.0 GB | Full GPU |
| `qwen3-4b` | Qwen3-4B Q4_K_M | TurboQuant | ~2.5 GB | Full GPU |
**Bigctx profiles** (server with `-nkvo`: KV cache in host RAM):
| Profile | Model | KV type | CTX | ~t/s@50% ctx |
|---|---|---|---|---|
| `smollm3-3b-bigctx` | SmolLM3-3B | turbo2 | 65536 | 15.2 |
| `gemma4-e2b-bigctx` | Gemma4-E2B | q4_0 | 393216 | 17.0 |
| `gemma4-e4b-bigctx` | Gemma4-E4B | turbo2 | 163840 | 17.8 |
| `qwen3-4b-bigctx` | Qwen3-4B | q4_0 | 24576 | 11.2 |
**Bench profiles** (one-shot benchmark containers):
| Profile | Service | Purpose |
|---|---|---|
| `bench-qwen35-9b` | bench-qwen35-9b | Also hosts `cpu_ctx_test.sh` / `kv_quant_test.sh` (all models have model files accessible) |
| `bench-gemma4-e2b` | bench-gemma4-e2b | E2B bench |
| `bench-gemma4-e4b` | bench-gemma4-e4b | E4B bench |
| `bench-smollm3-3b` | bench-smollm3-3b | SmolLM3 bench |
| `bench-qwen3-4b` | bench-qwen3-4b | Qwen3-4B bench |
**Add-on profile** (combine with any model):
| Profile | Service | Purpose |
|---|---|---|
| `webui` | openwebui | Open WebUI connecting to `llama-current:8080` |
### Env File Architecture
Each model has a dedicated `envs/.env.<model>` file injected into the container. Shell variables use `$$VAR` in the compose command to escape compose interpolation — the container shell expands them at runtime.
```
envs/
├── .env.smollm3-3b ← pure-GPU: q8_0 KV, ctx=24576
├── .env.smollm3-3b-bigctx ← -nkvo: turbo2 KV, ctx=65536
├── .env.gemma4-e2b ← pure-GPU: f16 KV, ctx=24576
├── .env.gemma4-e2b-bigctx ← -nkvo: q4_0 KV, ctx=393216 (turbo2 worse for MQA)
├── .env.gemma4-e4b ← pure-GPU: q4_0 KV, ctx=24576, ngl=42
├── .env.gemma4-e4b-bigctx ← -nkvo: turbo2 KV, ctx=163840, ngl=42
├── .env.qwen3-4b ← pure-GPU: q4_0 KV, ctx=16384 (NO turbo2 ever)
├── .env.qwen3-4b-bigctx ← -nkvo: q4_0 KV, ctx=24576 (NO turbo2 ever)
└── .env.qwen35-9b ← mixed: turbo2 KV, ctx=32768, ngl=11, mlock
```
Key env variables per file:
```bash
MODEL_FILE # filename under /models/
N_GPU_LAYERS # ngl: how many transformer layers offloaded to GPU
CTX_SIZE # context window size
THREADS / THREADS_BATCH
BATCH_SIZE / UBATCH_SIZE
CACHE_TYPE_K/V # KV quantization: f16 | q8_0 | q4_0 | turbo2
PARALLEL # number of concurrent request slots
EXTRA_ARGS # passed verbatim to llama-server (e.g. --flash-attn on --no-kv-offload)
```
---
## Test Script Architecture
All test scripts run inside the `bench-qwen35-9b` container (has `full` image with all binaries), with all model files accessible via `/models/`.
### scripts/kv_quant_test.sh
**Purpose**: Determine optimal KV quantization type for each model at various context sizes.
**Method**: `llama-perplexity` on a 4000-line synthetic text file. Computes perplexity for each (model, ctx, KV type) combination, measures Δ vs f16 baseline.
**Quality gate**: Δ < 0.5 → acceptable; Δ ≥ 0.5 → degraded.
```
for each model:
for each ctx in CTX_CANDIDATES:
run f16 baseline → get PPL_baseline
for each KV type in MODEL_KV_TYPES:
run with that KV type → get PPL
report Δ = PPL - PPL_baseline
```
**Outputs**:
- Pass/fail per (model, ctx, KV type) combination
- Recommendation: highest-quality KV type that stays within quality gate at all tested ctx
**Known limitations**:
- `Qwen3.5-9B`: hybrid linear-attention architecture is incompatible with `llama-perplexity` → always fails. Not a real model issue; the server works correctly.
- At very small ctx (< 4096), block-padding overhead inflates turbo2 apparent per-token cost.
### scripts/cpu_ctx_test.sh
**Purpose**: Find maximum viable context size when using `-nkvo` (KV in host RAM), accounting for PCIe bandwidth penalty.
**Method**: Two-phase per (model, ctx, KV type):
1. **Alloc check** (fast, ~15s): run `llama-perplexity` on a 64-line file with `-nkvo`. The model allocates full KV at startup regardless of input length. If it exits cleanly → alloc succeeds; timeout/error → OOM.
2. **Speed estimation** (analytic bandwidth model):
```
GPU-compute models (smollm3, e2b, e4b, qwen3-4b):
t/s(ctx) = 1000 / (1000/baseline + ctx × kv_bytes_per_tok / PCIe_BW × 1000)
PCIe_BW = 8 GB/s (PCIe x4 Gen3 practical)
RAM-bound models (qwen35-9b, ngl=11):
t/s(ctx) = 1000 / (1000/baseline + ctx × kv_bytes_per_tok / RAM_BW × 1000)
RAM_BW = 45 GB/s (DDR4-2933)
```
3. **Recommendation**: highest ctx where `t/s@50%fill ≥ 15`.
**kv_bytes_per_tok** measured empirically: `KV_MiB_allocated / ctx_size` from actual alloc run.
**KV types tested per model**:
| Model | KV types | Reason |
|---|---|---|
| smollm3, e2b, e4b | q4_0 + turbo2 | Both safe (PPL gate passes) |
| qwen3-4b | q4_0 only | turbo2 breaks at ctx≥8192 |
| qwen35-9b | q4_0 only | OOMs regardless (skipped) |
### scripts/benchmark.sh
Default entrypoint for bench containers. Runs `llama-bench` sweep over prompt/generation lengths and thread counts, outputs CSV to `/results/`.
### scripts/quality_test.sh
Early script (superseded by kv_quant_test.sh). Tested KV types via basic generation quality comparison.
---
## Data Flow
```
Model GGUF files (./models/)
Docker container (/models/ read-only bind mount)
├─── llama-server ──► OpenAI-compatible API on :8080
│ │
│ env_file values: MODEL_FILE, N_GPU_LAYERS, CTX_SIZE,
│ CACHE_TYPE_K/V, EXTRA_ARGS, ...
└─── llama-bench / llama-perplexity ──► benchmark-results/ (bind mount)
test scripts (scripts/ read-only bind mount)
```
## Port / Network Layout
```
Host:8080 ──► llama_server container:8080
Host:3000 ──► open_webui container:8080 ──► http://llama-current:8080/v1 (Docker network)
llama-net (bridge):
llama-current — alias shared by ALL server services; only one runs at a time
```

158
docs/FINDINGS.md Normal file
View File

@@ -0,0 +1,158 @@
# Benchmarking Findings
Hardware: GTX 1650 Ti Mobile (Turing/SM75, 3717 MiB VRAM, CC 7.5) + i7-10750H 6c/12t, 15 GiB DDR4-2933 RAM.
All benchmarks: llama.cpp `local/llama-cpp-turboquant:*-cuda-sm75-mmq` image (TurboQuant fork, `DGGML_CUDA_FORCE_MMQ=ON`).
Date: 2026-05-05 / 2026-05-06.
---
## 1. FORCE_MMQ — Free +611% on Turing GPUs
**Finding**: GPUs without tensor cores (Turing = RTX 1650, 1660, 2060 etc.) run the GEMM path through cuBLAS GEMM, which is slower than the hand-written MMQ (matrix-multiply quantized) kernel. Compiling with `DGGML_CUDA_FORCE_MMQ=ON` forces the MMQ path unconditionally.
| Model | Standard image t/s | TurboQuant t/s | Gain |
|---|---|---|---|
| SmolLM3-3B | ~49.9 | 53.1 | +6.2% |
| Gemma4-E2B | ~55.7 | 61.7 | +10.7% |
| Gemma4-E4B | ~27.0 | 30.0 | +11.4% |
| Qwen3-4B | ~36.7 | 38.8 | +5.7% |
**Caution**: On Ampere/Ada/Hopper (RTX 3000+/4000+), tensor cores are faster. `FORCE_MMQ` would *hurt* on those cards. This image is SM75-only.
---
## 2. KV Quantization — turbo2 is the best sweet spot
**Finding**: The TurboQuant fork adds 2/3/4-bit KV quantization ("turbo2/3/4") beyond llama.cpp's built-in q8_0/q4_0. turbo2 at 2 bits is roughly half the size of q4_0, with acceptable perplexity loss.
**Perplexity delta vs f16 baseline** (quality gate: Δ < 0.5):
| KV type | SmolLM3-3B | Gemma4-E2B | Gemma4-E4B | Qwen3-4B |
|---|---|---|---|---|
| q8_0 | ✓ | ✓ | ✓ | ✓ |
| q4_0 | ✓ | ✓ | ✓ | ✓ |
| turbo2 | ✓ | ✓ | ✓ | **✗ BROKEN** |
| turbo3 | ✓ | ✓ | ✓ | **✗ BROKEN** |
| turbo4 | ✓ | ✓ | ✓ | **✗ BROKEN** |
### ⚠️ Critical: turbo2/3/4 breaks Qwen3-4B
Qwen3-4B uses full GQA (32 KV heads, 40 KB/token). At ctx ≥ 8192, turbo KV quantization causes catastrophic PPL degradation:
```
ctx=4096 turbo2: PPL=1.79 (baseline 1.76, Δ=0.03 ✓)
ctx=8192 turbo2: PPL=4.2 (Δ=2.4 ✗)
ctx=16384 turbo2: PPL=15.4 (Δ=13.7 ✗)
ctx=32768 turbo2: PPL=438 (broken)
```
**Never use turbo2/3/4 for Qwen3-4B.** Use q4_0.
---
## 3. MQA Architecture — Gemma4 E2B/E4B KV is tiny
**Finding**: Gemma4's hybrid attention uses Multi-Query Attention (MQA) for most layers — only 1 KV head is maintained per token instead of full GQA. This results in dramatically smaller KV cache:
| Model | KV bytes/token (q4_0) | Architecture |
|---|---|---|
| SmolLM3-3B | ~19.8 KB | GQA |
| Qwen3-4B | ~39.6 KB | full GQA |
| Gemma4-E4B | ~4.5 KB | MQA-like (42 layers) |
| Gemma4-E2B | ~1.7 KB | MQA (35 layers) |
**Implication**: E2B can hold 393K tokens in KV cache with only 651 MiB RAM. E4B can hold 163K tokens with 346 MiB RAM.
### ⚠️ turbo2 is *worse* for E2B (MQA padding artifact)
turbo2 uses block quantization. For MQA models with tiny KV tensors, the per-block header/padding overhead is proportionally larger than the savings. At E2B:
```
ctx=32768 q4_0: 57 MiB KV turbo2: 68 MiB KV (+19% worse!)
```
**Do not use turbo2 for Gemma4-E2B bigctx.** Use q4_0.
---
## 4. -nkvo (KV in RAM) — Massive Context Gain at PCIe Cost
**Finding**: `--no-kv-offload` moves the KV cache from VRAM to host RAM. VRAM is then entirely free for model weights and compute. The tradeoff is token generation speed — each token generation requires reading the full KV cache over PCIe x4.
**Bandwidth model**: `t/s = 1000 / (gpu_ms_empty + ctx × kv_bytes_per_tok / pcie_bw_bps × 1000)`
PCIe x4 Gen3 ≈ **8 GB/s** practical (measured from BW model fit to actual results).
### Context gains with -nkvo (v4, TurboQuant):
| Model | Pure-GPU ctx | -nkvo q4_0 rec | -nkvo turbo2 rec | KV type used |
|---|---|---|---|---|
| SmolLM3-3B | 24576 | 32768 | **65536** | turbo2 |
| Gemma4-E2B | 24576 | **393216** | 393216 | q4_0 (turbo2 worse!) |
| Gemma4-E4B | 24576 | 98304 | **163840** | turbo2 |
| Qwen3-4B | 16384 | **24576** | BROKEN | q4_0 |
Recommendation threshold: ≥ 15 t/s at 50% context fill.
### ⚠️ Qwen3.5-9B cannot use -nkvo
Qwen3.5-9B (Q8_0, 8.86 GB) with ngl=11 fills nearly all 15 GiB RAM with model weights + system overhead. At any tested context size, `-nkvo` OOMs. The existing server config at ctx=32768 with turbo2 KV in VRAM is the only viable option.
---
## 5. Qwen3.5-9B — RAM-bound, llama-perplexity incompatible
**Finding**: This model has a hybrid architecture: 8 full-attention layers + 24 linear-attention layers. The linear-attention layers cause `llama-perplexity` to fail (not OOM — the evaluation tool simply can't handle the architecture). The server works correctly.
**Performance ceiling**: Theoretical max t/s = RAM_BW / model_size = 45 GB/s ÷ 8.86 GB = **5.1 t/s**. Achieved: 4.38 t/s = 86% efficiency. This is purely RAM-bandwidth-limited.
**Thread optimization** (i7-10750H, 6 physical / 12 logical):
- Optimal: `THREADS=6` (one per physical core)
- HT hurts: t=8 → 4.22 t/s (worse than t=6 → 4.38 t/s)
---
## 6. Gemma4-E4B — all layers fit when VRAM is free
**Surprise**: E4B's Q4_K_M file is 4.7 GB — larger than the 3.7 GB VRAM. However, model weight loading is paged; at ngl=42, ALL 42 layers fit in VRAM during inference because llama.cpp holds only the needed tensors. The "file size > VRAM" heuristic is wrong for split configs.
ngl sweep result:
```
ngl=28 → 59 pp / 16.5 tg t/s
ngl=35 → 101 pp / 24.6 tg t/s
ngl=42 → 133 pp / 32.0 tg t/s ← all layers, much faster
```
**Caution**: ngl=42 fails if another container is holding VRAM. Always stop other services before starting E4B.
---
## 7. Flash Attention (+23% pp, required for bigctx)
`--flash-attn on` is required for `-nkvo` bigctx profiles (prefill OOM otherwise at large ctx). For standard pure-GPU profiles it gives a small speed boost (~23% pp, ~1% tg). Always enable it.
---
## 8. Benchmarking pitfalls
### False OOM from prefill timeout
Early test scripts ran `llama-perplexity` on a full wiki dataset. At large contexts, prefill takes >600s and the script misread the timeout as OOM. Fix: use a 64-line "tiny" file for alloc checks — the model allocates the full KV cache at startup, then exits after trivial compute (< 15s).
### kv/tok measurement anomalies
The `kv_bytes_per_tok` column in cpu_ctx_test.sh is computed as `kv_mib / ctx`. At small ctx, block padding dominates and the value appears higher. The true per-token cost stabilizes at larger ctx. Use ctx ≥ 32768 values for BW model calibration.
---
## Summary: Recommended configurations
| Model | Profile | KV type | CTX | t/s@base | Notes |
|---|---|---|---|---|---|
| SmolLM3-3B | pure-GPU | q8_0 | 24576 | ~53 | max VRAM ctx |
| SmolLM3-3B | bigctx | turbo2 | 65536 | ~15@50% | 714 MiB RAM |
| Gemma4-E2B | pure-GPU | f16 | 24576 | ~62 | MQA = tiny KV |
| Gemma4-E2B | bigctx | q4_0 | 393216 | ~17@50% | 651 MiB RAM, turbo2 worse |
| Gemma4-E4B | pure-GPU | q4_0 | 24576 | ~30 | ngl=42 all layers |
| Gemma4-E4B | bigctx | turbo2 | 163840 | ~18@50% | 346 MiB RAM |
| Qwen3-4B | pure-GPU | q4_0 | 16384 | ~39 | NO turbo KV ever |
| Qwen3-4B | bigctx | q4_0 | 24576 | ~11@50% | turbo2 broken |
| Qwen3.5-9B | pure-GPU | turbo2 | 32768 | ~4.4 | RAM-bound, no bigctx |

43
envs/.env.gemma4-e2b Normal file
View File

@@ -0,0 +1,43 @@
# ==============================================================================
# Gemma 4 E2B-it Q4_K_M — Google DeepMind (April 2025)
# Architecture: Dense transformer + Per-Layer Embeddings (PLE)
# - 2.3B effective params (5.1B total with PLE embedding tables)
# - 35 layers, hybrid local (512-token window) + global attention
# - 128K context window
# Model size: ~2.9 GB Q4_K_M | Full GPU fit (ngl=99, VRAM ~3.4 GB total)
# Modalities: text + image + audio (ASR/translation) + video frames
#
# Download:
# huggingface-cli download bartowski/google_gemma-4-E2B-it-GGUF \
# google_gemma-4-E2B-it-Q4_K_M.gguf --local-dir ./models/
#
# NOTE: Verify the exact filename after download — bartowski naming may vary.
# Check: ls models/google_gemma*
# ==============================================================================
MODEL_FILE=google_gemma-4-E2B-it-Q4_K_M.gguf
# All 35 layers fit in VRAM. PLE layers are small compute, large embedding lookup.
N_GPU_LAYERS=99
# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
# Hybrid sliding-window attention (512-token) keeps KV tiny → 32K ctx fits!
# 65K/131K OOM (full global-attn layers eat VRAM at large ctx).
# Baseline: 350 pp / 64.6 tg t/s | At 32K ctx: 365 pp / 66.8 tg t/s (fa=1)
CTX_SIZE=24576
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=256
# f16 KV — model small, KV overhead negligible even at 32K
CACHE_TYPE_K=f16
CACHE_TYPE_V=f16
# 2 parallel slots — fast model (66 tg t/s), VRAM headroom available
PARALLEL=2
# fa=1 confirmed working on hybrid Gemma4 attention (+5% vs fa=0)
EXTRA_ARGS=--flash-attn on --mmap

View File

@@ -0,0 +1,26 @@
# ==============================================================================
# Gemma 4 E2B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo)
# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): q4_0 rec ctx=393216
# +368640 tokens vs pure-GPU 24576. MQA arch = only 1.7 KB KV/tok (tiny!).
# Speed at ctx=393216: baseline 61.7 t/s, est. 17.0@50% / 26.6@25% (PCIe BW).
# RAM at 393216: 651 MiB KV. q4_0 used (turbo2 paradoxically larger for MQA).
# Use this profile when you need >24K context; otherwise use gemma4-e2b.
# ==============================================================================
MODEL_FILE=google_gemma-4-E2B-it-Q4_K_M.gguf
N_GPU_LAYERS=99
CTX_SIZE=393216
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=256
CACHE_TYPE_K=q4_0
CACHE_TYPE_V=q4_0
PARALLEL=1
EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload

43
envs/.env.gemma4-e4b Normal file
View File

@@ -0,0 +1,43 @@
# ==============================================================================
# Gemma 4 E4B-it Q4_K_M — Google DeepMind (April 2025)
# Architecture: Dense transformer + Per-Layer Embeddings (PLE)
# - 4.5B effective params (8B total with PLE embedding tables)
# - 42 layers, hybrid local (512-token window) + global attention
# - 128K context window
# Model size: ~4.7 GB Q4_K_M | CPU-split needed (exceeds 3.7 GB VRAM)
# Modalities: text + image + audio (ASR/translation) + video frames
#
# Download:
# huggingface-cli download bartowski/google_gemma-4-E4B-it-GGUF \
# google_gemma-4-E4B-it-Q4_K_M.gguf --local-dir ./models/
#
# NOTE: Verify the exact filename after download — bartowski naming may vary.
# Check: ls models/google_gemma*
# ==============================================================================
MODEL_FILE=google_gemma-4-E4B-it-Q4_K_M.gguf
# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
# ALL 42 layers fit on GPU when no other containers hold VRAM!
# ngl sweep: ngl=42 → 133 pp / 32.0 tg t/s (ngl=28 was only 59/16.5)
# Max ctx=24576 (hybrid attention, 32K OOM). fa=1 works (+3% vs fa=0).
# Thread sweep: t=4-6 optimal (GPU-only now, CPU largely idle for tg)
N_GPU_LAYERS=42
# 24K max — hybrid sliding-window keeps most layers' KV tiny
# 32K OOM due to global-attn layers hitting VRAM wall
CTX_SIZE=24576
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=128
CACHE_TYPE_K=q4_0
CACHE_TYPE_V=q4_0
PARALLEL=1
# fa=1 confirmed working on hybrid Gemma4 attention
EXTRA_ARGS=--flash-attn on --mmap

View File

@@ -0,0 +1,26 @@
# ==============================================================================
# Gemma 4 E4B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo)
# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): turbo2 rec ctx=163840
# +139264 tokens vs pure-GPU 24576. turbo2 KV = 2.1 KB/tok vs q4_0 4.5 KB/tok.
# Speed at ctx=163840: baseline 30.0 t/s, est. 17.8@50% / 22.4@25% (PCIe BW).
# RAM at 163840: 346 MiB KV. ngl=42 (all layers on GPU).
# Use this profile when you need >24K context; otherwise use gemma4-e4b.
# ==============================================================================
MODEL_FILE=google_gemma-4-E4B-it-Q4_K_M.gguf
N_GPU_LAYERS=42
CTX_SIZE=163840
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=128
CACHE_TYPE_K=turbo2
CACHE_TYPE_V=turbo2
PARALLEL=1
EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload

42
envs/.env.qwen3-4b Normal file
View File

@@ -0,0 +1,42 @@
# ==============================================================================
# Qwen3-4B-Instruct Q4_K_M — Alibaba (May 2025)
# Architecture: Decoder-only transformer, GQA
# - 4B params, 32 layers
# - 32K native context (128K with YaRN)
# Model size: ~2.4 GB Q4_K_M | Full GPU fit (ngl=99)
# Features: thinking mode (/think /no_think), tool calling, 119 languages,
# Apache 2.0. Strong code + reasoning. Best ecosystem (most fine-tunes).
#
# Download:
# huggingface-cli download bartowski/Qwen3-4B-GGUF \
# Qwen3-4B-Q4_K_M.gguf --local-dir ./models/
#
# NOTE: Verify exact filename after download:
# ls models/Qwen3-4B*
# ==============================================================================
MODEL_FILE=Qwen3-4B-Q4_K_M.gguf
# All layers fit — ~2.4 GB leaves ~1.3 GB free for KV + compute
N_GPU_LAYERS=99
# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
# Max ctx=8192 (12K OOM). Full attention — all KV must fit at full ctx.
# GGUF native limit=40960, but VRAM walls at ~8K.
# Baseline: 181 pp / 41.6 tg t/s. At 8K ctx fa=1: 191 pp / 44.3 tg t/s (+6%).
CTX_SIZE=16384
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=256
CACHE_TYPE_K=q4_0
CACHE_TYPE_V=q4_0
# 1 parallel slot — limited VRAM at 8K ctx with 2.4GB model
PARALLEL=1
# fa=1 gives +6% tg speed on full-attention Qwen3
EXTRA_ARGS=--flash-attn on --mmap

24
envs/.env.qwen3-4b-bigctx Normal file
View File

@@ -0,0 +1,24 @@
# ==============================================================================
# Qwen3-4B Q4_K_M — bigctx variant (KV in RAM via -nkvo)
# Benchmarked 2026-05-06: -nkvo max ctx=24576 (+8K vs pure-GPU 16384)
# Baseline TG: ~39 t/s (empty KV).
# Use this profile when you need >16K context; otherwise use qwen3-4b.
# ==============================================================================
MODEL_FILE=Qwen3-4B-Q4_K_M.gguf
N_GPU_LAYERS=99
CTX_SIZE=24576
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=256
CACHE_TYPE_K=q4_0
CACHE_TYPE_V=q4_0
PARALLEL=1
EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload

41
envs/.env.qwen35-9b Normal file
View File

@@ -0,0 +1,41 @@
# ==============================================================================
# Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2 Q8_0 — TurboQuant SM75
# Architecture: 32 layers (8 full-attn + 24 linear-attn), GQA 4 KV heads
# Model size: 8.86 GB | VRAM usage: ~3.4 GB (11 layers on GPU)
# RAM usage: ~5.5 GB (21 layers pinned via mlock)
#
# Benchmark results (turbo2 KV, ngl=11, fa=1):
# t=1→0.86 t=2→1.62 t=3→2.25 t=4→2.94 t=5→3.56 t=6→4.38 ← best
# t=8→4.22 t=12→3.61 (hyperthreading hurts above 6)
# Theoretical ceiling: ~5.1 t/s (45 GB/s RAM BW ÷ 8.86 GB model)
# Achieved: 4.38 t/s = 86% efficiency
#
# Download:
# huggingface-cli download Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF \
# Qwen3.5-9B.Q8_0.gguf --local-dir ./models/
# ==============================================================================
MODEL_FILE=Qwen3.5-9B.Q8_0.gguf
# GPU: 11 layers fit in 3.7 GB VRAM. ngl=12 causes OOM at ctx>2048.
N_GPU_LAYERS=11
# 32K context fits with turbo2 KV (~104 MiB overhead vs ~3.3 GB for f16)
CTX_SIZE=32768
# t=6 is optimal for i7-10750H (6 physical cores). t>6 uses HT which hurts.
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=128
# turbo2: 2-bit KV cache, 6.4× smaller than f16. Requires TurboQuant image.
CACHE_TYPE_K=turbo2
CACHE_TYPE_V=turbo2
PARALLEL=1
# --no-mmap --mlock: pins entire model in RAM (prevents paging, avoids cold reads)
# --flash-attn on: required with turbo2 KV (fa=0 + turbo2 has no speed benefit)
EXTRA_ARGS=--flash-attn on --no-mmap --mlock

42
envs/.env.smollm3-3b Normal file
View File

@@ -0,0 +1,42 @@
# ==============================================================================
# SmolLM3 3B-it Q4_K_M — HuggingFace (2025)
# Architecture: Decoder-only transformer, GQA + NoPE (3:1 ratio)
# - 3B params, 11.2T training tokens
# - 64K native context (128K with YaRN)
# Model size: ~1.9 GB Q4_K_M | Full GPU fit (ngl=99)
# Features: thinking mode (/think /no_think), tool calling, 6 languages,
# Apache 2.0. AIME 2025: 36.7% in think mode.
#
# Download:
# huggingface-cli download bartowski/HuggingFaceTB_SmolLM3-3B-GGUF \
# HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf --local-dir ./models/
#
# NOTE: Verify exact filename after download:
# ls models/SmolLM3* models/HuggingFaceTB_SmolLM3*
# ==============================================================================
MODEL_FILE=HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf
# All layers fit comfortably — ~1.9 GB leaves ~1.8 GB free for KV + compute
N_GPU_LAYERS=99
# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB):
# Max ctx=24576 (32K OOM). Baseline: 249 pp / 56.8 tg t/s.
# At 24K ctx with fa=1: 260 pp / 58.3 tg t/s (+2%).
# Model context limit = 65536, VRAM is the constraint here.
CTX_SIZE=24576
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=256
CACHE_TYPE_K=q8_0
CACHE_TYPE_V=q8_0
# 2 parallel slots — less headroom at 24K ctx vs original 16K estimate
PARALLEL=2
# fa=1 gives small but consistent improvement (+2 tg t/s)
EXTRA_ARGS=--flash-attn on --mmap

View File

@@ -0,0 +1,26 @@
# ==============================================================================
# SmolLM3 3B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo)
# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): turbo2 rec ctx=65536
# +40960 tokens vs pure-GPU 24576. turbo2 KV = 10.9 KB/tok vs q4_0 19.8 KB/tok.
# Speed at ctx=65536: baseline 53.1 t/s, est. 15.2@50% / 23.7@25% (PCIe BW).
# RAM at 65536: 714 MiB KV. turbo2 passes PPL quality gate at all tested ctx.
# Use this profile when you need >24K context; otherwise use smollm3-3b.
# ==============================================================================
MODEL_FILE=HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf
N_GPU_LAYERS=99
CTX_SIZE=65536
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=256
CACHE_TYPE_K=turbo2
CACHE_TYPE_V=turbo2
PARALLEL=1
EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload

335
scripts/benchmark.sh Executable file
View File

@@ -0,0 +1,335 @@
#!/usr/bin/env bash
# =============================================================================
# llama.cpp Automated Benchmark — Qwen3.5-9B on GTX 1650 Ti (4 GB VRAM)
#
# Runs for BOTH official llama.cpp and TurboQuant fork.
# VARIANT env var selects which KV type set to sweep:
# VARIANT=official → f16 q8_0 q5_0 q4_0 iq4_nl
# VARIANT=turboquant → f16 q8_0 iq4_nl turbo4 turbo3 turbo2
#
# Output: CSV + recommended .env per variant, plus a final comparison table.
#
# Run:
# docker compose --profile benchmark run --rm benchmark (official)
# docker compose --profile benchmark run --rm benchmark-turbo (turboquant)
# =============================================================================
set -euo pipefail
# Ensure llama-bench is findable in both official (/usr/local/bin) and TurboQuant (/app) images
export PATH="/app:/usr/local/bin:/usr/bin:/bin:${PATH:-}"
MODEL="${MODEL:-${1:-/models/Qwen3.5-9B.Q8_0.gguf}}"
OUTPUT_DIR="${OUTPUT_DIR:-${2:-/results}}"
VARIANT="${VARIANT:-official}" # official | turboquant
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
RESULTS_CSV="${OUTPUT_DIR}/${VARIANT}_results_${TIMESTAMP}.csv"
LOG="${OUTPUT_DIR}/${VARIANT}_benchmark_${TIMESTAMP}.log"
# ── Baseline config ────────────────────────────────────────────────────────
THREADS=6
THREADS_BATCH=12
BATCH_SIZE=2048
UBATCH_SIZE=512
PROMPT_TOKENS=512
GEN_TOKENS=32
REPETITIONS=1
# ── KV type sets per variant ───────────────────────────────────────────────
# turbo2=2-bit (6.4× vs f16), turbo3=3-bit, turbo4=4-bit — TurboQuant only
if [[ "$VARIANT" == "turboquant" ]]; then
KV_TYPES=(f16 q8_0 iq4_nl turbo4 turbo3 turbo2)
else
# Official llama.cpp: all standard quant types
# iq4_nl = i-quant non-linear: best quality at 4-bit (non-uniform scale)
KV_TYPES=(f16 q8_0 q5_0 q4_0 iq4_nl)
fi
# ── GPU layer sweep (Q8_0 ~297 MB/layer, 3717 MiB VRAM → max ~12 layers) ──
NGL_VALUES=(6 9 12 13 14 99)
# ── Context sweep: use -p to stress KV cache at given size ─────────────────
CTX_VALUES=(128 512 1024 2048 4096 8192)
# ── Batch sweep ────────────────────────────────────────────────────────────
BATCH_VALUES=(512 1024 2048 4096)
mkdir -p "$OUTPUT_DIR"
log() { echo "$*" | tee -a "$LOG"; }
sep() { log "$(printf '─%.0s' {1..70})"; }
hdr() { sep; log " $*"; sep; }
log "llama.cpp Benchmark [${VARIANT}] — $(date)"
log "Model: $MODEL"
log "GPU: $(nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null || echo 'CPU only')"
log "KV set: ${KV_TYPES[*]}"
sep
echo "variant,phase,ngl,ctx,kv_type_k,kv_type_v,flash_attn,batch_size,ubatch_size,threads,pp_tokens_per_sec,tg_tokens_per_sec,status" \
> "$RESULTS_CSV"
# ── Helper: run llama-bench ────────────────────────────────────────────────
LAST_PP=0
LAST_TG=0
run_bench() {
local ngl=$1 ctx=$2 kv=$3 fa=$4 batch=$5 ubatch=$6 phase="${7:-test}"
local raw
# New llama-bench API (b9014+): -c and -tb removed; -p sets prompt/ctx size
# CSV columns: ...n_prompt(34),n_gen(35),n_depth(36),test_time(37),
# avg_ns(38),stddev_ns(39),avg_ts(40),stddev_ts(41)
# pp row: n_gen==0; tg row: n_prompt==0
raw=$(timeout 300 /app/llama-bench \
-m "$MODEL" \
-ngl "$ngl" \
-p "$ctx" \
-n "$GEN_TOKENS" \
-t "$THREADS" \
-b "$batch" \
-ub "$ubatch" \
-ctk "$kv" \
-ctv "$kv" \
-fa "$fa" \
-r "$REPETITIONS" \
-o csv 2>&1) || return 1
# Strip quotes from CSV, then extract avg_ts (col 40) by pp/tg row type
LAST_PP=$(printf '%s\n' "$raw" | sed 's/"//g' | awk -F',' 'NR>1 && $35=="0" && $40+0>0 {print $40+0; exit}')
LAST_TG=$(printf '%s\n' "$raw" | sed 's/"//g' | awk -F',' 'NR>1 && $34=="0" && $35+0>0 && $40+0>0 {print $40+0; exit}')
LAST_PP="${LAST_PP:-0}"
LAST_TG="${LAST_TG:-0}"
echo "${VARIANT},${phase},${ngl},${ctx},${kv},${kv},${fa},${batch},${ubatch},${THREADS},${LAST_PP},${LAST_TG},ok" \
>> "$RESULTS_CSV"
return 0
}
fail_row() {
local phase=$1 ngl=$2 ctx=$3 kv=$4 fa=$5 batch=$6 ubatch=$7
echo "${VARIANT},${phase},${ngl},${ctx},${kv},${kv},${fa},${batch},${ubatch},${THREADS},0,0,failed" \
>> "$RESULTS_CSV"
}
# ── Phase 1: GPU layer sweep ───────────────────────────────────────────────
hdr "PHASE 1 — GPU layer sweep (prompt=128 kv=f16 fa=0)"
# Use f16 KV: prebuilt official image lacks SM75 CUDA kernels for quantized KV.
# We isolate the NGL variable here; KV type is swept in Phase 3.
MAX_STABLE_NGL=0
for ngl in "${NGL_VALUES[@]}"; do
printf " ngl=%-3s " "$ngl" | tee -a "$LOG"
if run_bench "$ngl" 128 f16 0 "$BATCH_SIZE" "$UBATCH_SIZE" "ph1_ngl"; then
log "OK pp=${LAST_PP} t/s tg=${LAST_TG} t/s"
MAX_STABLE_NGL="$ngl"
else
log "FAILED (OOM/timeout)"
fail_row ph1_ngl "$ngl" 128 f16 0 "$BATCH_SIZE" "$UBATCH_SIZE"
break
fi
done
log " → Best ngl: ${MAX_STABLE_NGL}"
# ── Phase 2: Context sweep ─────────────────────────────────────────────────
hdr "PHASE 2 — Context/prompt sweep (ngl=${MAX_STABLE_NGL} kv=f16 fa=0)"
MAX_STABLE_CTX=128
for ctx in "${CTX_VALUES[@]}"; do
printf " ctx=%-6s " "$ctx" | tee -a "$LOG"
if run_bench "$MAX_STABLE_NGL" "$ctx" f16 0 "$BATCH_SIZE" "$UBATCH_SIZE" "ph2_ctx"; then
log "OK pp=${LAST_PP} t/s tg=${LAST_TG} t/s"
MAX_STABLE_CTX="$ctx"
else
log "FAILED (OOM/timeout)"
fail_row ph2_ctx "$MAX_STABLE_NGL" "$ctx" f16 0 "$BATCH_SIZE" "$UBATCH_SIZE"
break
fi
done
log " → Best ctx: ${MAX_STABLE_CTX}"
# ── Phase 3: KV cache type sweep ───────────────────────────────────────────
hdr "PHASE 3 — KV type sweep (ngl=${MAX_STABLE_NGL} ctx=${MAX_STABLE_CTX} fa=1)"
log " [${VARIANT}] KV types: ${KV_TYPES[*]}"
log " Note: Qwen3.5-9B has only 8/32 full-attention layers + GQA (4 KV heads)"
log " Linear-attention layers need no KV cache at all → quant errors minimal"
if [[ "$VARIANT" == "turboquant" ]]; then
log " turbo2=2-bit (6.4× compression), turbo3=3-bit, turbo4=4-bit"
fi
BEST_KV="q8_0"
BEST_TG_KV=0
for kv in "${KV_TYPES[@]}"; do
printf " kv=%-8s " "$kv" | tee -a "$LOG"
if run_bench "$MAX_STABLE_NGL" "$MAX_STABLE_CTX" "$kv" 0 "$BATCH_SIZE" "$UBATCH_SIZE" "ph3_kv"; then
log "OK pp=${LAST_PP} t/s tg=${LAST_TG} t/s"
tg_n=$(printf '%s' "$LAST_TG" | grep -oP '[0-9]+\.?[0-9]*' | head -1)
if awk "BEGIN{exit !(${tg_n:-0} > ${BEST_TG_KV:-0})}"; then
BEST_TG_KV="${tg_n:-0}"
BEST_KV="$kv"
fi
else
log "FAILED"
fail_row ph3_kv "$MAX_STABLE_NGL" "$MAX_STABLE_CTX" "$kv" 0 "$BATCH_SIZE" "$UBATCH_SIZE"
fi
done
log " → Best KV: ${BEST_KV} (tg=${BEST_TG_KV} t/s)"
# ── Phase 4: Flash attention ───────────────────────────────────────────────
hdr "PHASE 4 — Flash attention (ngl=${MAX_STABLE_NGL} ctx=${MAX_STABLE_CTX} kv=${BEST_KV})"
log " GTX 1650 Ti = CC 7.5 (Turing) — FA2 requires SM80+ but FA1 works on CC>=7.5"
BEST_FA=1
BEST_TG_FA=0
for fa in 1 0; do
fa_label=$([ "$fa" -eq 1 ] && echo "on " || echo "off")
printf " fa=%-3s " "$fa_label" | tee -a "$LOG"
if run_bench "$MAX_STABLE_NGL" "$MAX_STABLE_CTX" "$BEST_KV" "$fa" "$BATCH_SIZE" "$UBATCH_SIZE" "ph4_fa"; then
log "OK pp=${LAST_PP} t/s tg=${LAST_TG} t/s"
tg_n=$(printf '%s' "$LAST_TG" | grep -oP '[0-9]+\.?[0-9]*' | head -1)
if awk "BEGIN{exit !(${tg_n:-0} > ${BEST_TG_FA:-0})}"; then
BEST_TG_FA="${tg_n:-0}"
BEST_FA="$fa"
fi
else
log "FAILED"
fi
done
log " → Best FA: ${BEST_FA} (tg=${BEST_TG_FA} t/s)"
# ── Phase 5: Batch sweep ───────────────────────────────────────────────────
hdr "PHASE 5 — Batch sweep (ngl=${MAX_STABLE_NGL} ctx=${MAX_STABLE_CTX} kv=${BEST_KV} fa=${BEST_FA})"
# Use small fixed prompt (64) to isolate batch-buffer allocation overhead from prompt size.
# Larger batch = larger CUDA activation buffers; tests whether they fit in remaining VRAM.
BEST_BATCH="$BATCH_SIZE"
BEST_PP_BATCH=0
FIXED_P=64
for batch in "${BATCH_VALUES[@]}"; do
ubatch=$(( batch / 4 < 64 ? 64 : batch / 4 ))
printf " batch=%-5s ubatch=%-4s " "$batch" "$ubatch" | tee -a "$LOG"
if run_bench "$MAX_STABLE_NGL" "$FIXED_P" "$BEST_KV" "$BEST_FA" "$batch" "$ubatch" "ph5_batch"; then
log "OK pp=${LAST_PP} t/s tg=${LAST_TG} t/s"
pp_n=$(printf '%s' "$LAST_PP" | grep -oP '[0-9]+\.?[0-9]*' | head -1)
if awk "BEGIN{exit !(${pp_n:-0} > ${BEST_PP_BATCH:-0})}"; then
BEST_PP_BATCH="${pp_n:-0}"
BEST_BATCH="$batch"
fi
else
log "FAILED"
fi
done
BEST_UBATCH=$(( BEST_BATCH / 4 < 64 ? 64 : BEST_BATCH / 4 ))
log " → Best batch: ${BEST_BATCH} ubatch: ${BEST_UBATCH} (pp=${BEST_PP_BATCH} t/s)"
# ── Phase 6 (TurboQuant only): max context with turbo2 KV ──────────────────
if [[ "$VARIANT" == "turboquant" ]]; then
hdr "PHASE 6 — TurboQuant: extended context with turbo2 KV (ngl=${MAX_STABLE_NGL} fa=${BEST_FA})"
log " turbo2 = 2-bit KV (6.4× smaller than f16) → enables much larger ctx in same VRAM"
TURBO_CTX_VALUES=(512 1024 2048 4096 8192 16384 32768)
MAX_TURBO_CTX="128"
TURBO_KV="turbo2"
for ctx in "${TURBO_CTX_VALUES[@]}"; do
printf " ctx=%-7s " "$ctx" | tee -a "$LOG"
if run_bench "$MAX_STABLE_NGL" "$ctx" "$TURBO_KV" "$BEST_FA" "$BEST_BATCH" "$BEST_UBATCH" "ph6_turbo_ctx"; then
log "OK pp=${LAST_PP} t/s tg=${LAST_TG} t/s"
MAX_TURBO_CTX="$ctx"
else
log "FAILED (OOM/timeout)"
fail_row ph6_turbo_ctx "$MAX_STABLE_NGL" "$ctx" "$TURBO_KV" "$BEST_FA" "$BEST_BATCH" "$BEST_UBATCH"
break
fi
done
log " → Max context with turbo2: ${MAX_TURBO_CTX}"
# Use the larger turbo ctx for the recommended .env
MAX_STABLE_CTX="$MAX_TURBO_CTX"
BEST_KV="$TURBO_KV"
fi
# ── Summary ────────────────────────────────────────────────────────────────
sep
log "BENCHMARK COMPLETE [${VARIANT}] — $(date)"
sep
log ""
log " Optimal params for GTX 1650 Ti + Qwen3.5-9B Q4_K_M [${VARIANT}]:"
log ""
log " ngl : ${MAX_STABLE_NGL}"
log " ctx_size : ${MAX_STABLE_CTX}"
log " kv_type : ${BEST_KV}"
log " flash_attn : ${BEST_FA}"
log " batch_size : ${BEST_BATCH}"
log " ubatch : ${BEST_UBATCH}"
log ""
log " Full CSV: ${RESULTS_CSV}"
log ""
# Write recommended .env
ENV_OUT="${OUTPUT_DIR}/${VARIANT}_recommended.env"
cat > "$ENV_OUT" <<EOF
# Generated by benchmark.sh [${VARIANT}] on $(date)
LLAMA_N_GPU_LAYERS=${MAX_STABLE_NGL}
LLAMA_CTX_SIZE=${MAX_STABLE_CTX}
LLAMA_CACHE_TYPE_K=${BEST_KV}
LLAMA_CACHE_TYPE_V=${BEST_KV}
LLAMA_BATCH_SIZE=${BEST_BATCH}
LLAMA_UBATCH_SIZE=${BEST_UBATCH}
LLAMA_THREADS=${THREADS}
LLAMA_THREADS_BATCH=${THREADS_BATCH}
LLAMA_PARALLEL=1
EOF
log " Recommended .env → ${ENV_OUT}"
# ── Cross-variant comparison (if both results exist) ──────────────────────
OFFICIAL_CSV=$(ls "${OUTPUT_DIR}"/official_results_*.csv 2>/dev/null | sort | tail -1 || true)
TURBO_CSV=$(ls "${OUTPUT_DIR}"/turboquant_results_*.csv 2>/dev/null | sort | tail -1 || true)
if [[ -n "$OFFICIAL_CSV" && -n "$TURBO_CSV" ]]; then
COMPARE_OUT="${OUTPUT_DIR}/comparison_$(date +%Y%m%d_%H%M%S).txt"
{
echo "======================================================================"
echo " OFFICIAL vs TURBOQUANT COMPARISON"
echo "======================================================================"
echo ""
echo "Official CSV: $OFFICIAL_CSV"
echo "TurboQuant CSV: $TURBO_CSV"
echo ""
echo "KV type benchmark results (phase ph3_kv):"
echo ""
printf "%-12s %-10s %-10s %-12s %-12s\n" "variant" "kv_type" "ctx" "pp (t/s)" "tg (t/s)"
echo "----------------------------------------------------------------------"
for csv in "$OFFICIAL_CSV" "$TURBO_CSV"; do
awk -F',' '
NR>1 && $2 == "ph3_kv" {
printf "%-12s %-10s %-10s %-12s %-12s\n", $1, $5, $4, $11, $12
}
' "$csv"
done
echo ""
echo "Winner by tg (generation speed):"
awk -F',' '
NR>1 && $2 == "ph3_kv" && $13 == "ok" {
key = $1 "," $5
val = $12+0
if (val > best[key]) { best[key] = val; row[key] = $0 }
}
END {
best_tg = 0; best_key = ""
for (k in best) { if (best[k] > best_tg) { best_tg = best[k]; best_key = k } }
n = split(best_key, a, ",")
printf " %s with kv=%s → %.1f t/s\n", a[1], a[2], best_tg
}
' "$OFFICIAL_CSV" "$TURBO_CSV"
echo "======================================================================"
} | tee "$COMPARE_OUT" | tee -a "$LOG"
echo ""
echo "Comparison report: $COMPARE_OUT"
fi
sep
echo ""
echo "=== RECOMMENDED .env [${VARIANT}] ==="
cat "$ENV_OUT"

175
scripts/benchmark_models.sh Normal file
View File

@@ -0,0 +1,175 @@
#!/bin/bash
# Benchmark all 4 new models on GTX 1650 Ti (3717 MiB VRAM)
# Priority: max context size > tg speed
# Runs inside ghcr.io/ggml-org/llama.cpp:full-cuda (build b9014, no -c flag)
#
# Architecture context limits (from GGUF metadata):
# SmolLM3-3B : 65536 (full attention, KV-limited to ~28K in practice)
# Gemma4-E2B : 131072 (hybrid: sliding_window=512 → huge ctx possible)
# Gemma4-E4B : 131072 (hybrid: sliding_window=512)
# Qwen3-4B : 40960 (full attention, KV-limited to ~9K in practice)
#
# NOTE: llama-bench b9014 has NO -c flag. Context is set by -p (prompt tokens).
# -p N -n G allocates KV for N+G tokens. OOM = exit!=0 or error in stdout.
set -uo pipefail
M_SMOL="/models/HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf"
M_E2B="/models/google_gemma-4-E2B-it-Q4_K_M.gguf"
M_E4B="/models/google_gemma-4-E4B-it-Q4_K_M.gguf"
M_Q3="/models/Qwen3-4B-Q4_K_M.gguf"
# -- CSV column detection (called once on first successful output) --
TS_COL=0; NG_COL=0; NP_COL=0
detect_cols() {
local hdr
hdr=$(printf '%s\n' "$1" | sed 's/"//g' | grep '^build_commit' | head -1)
TS_COL=$(printf '%s\n' "$hdr" | awk -F',' '{for(i=1;i<=NF;i++) if($i=="avg_ts"){print i;exit}}')
NG_COL=$(printf '%s\n' "$hdr" | awk -F',' '{for(i=1;i<=NF;i++) if($i=="n_gen"){print i;exit}}')
NP_COL=$(printf '%s\n' "$hdr" | awk -F',' '{for(i=1;i<=NF;i++) if($i=="n_prompt"){print i;exit}}')
TS_COL=${TS_COL:-0}; NG_COL=${NG_COL:-0}; NP_COL=${NP_COL:-0}
}
# Returns "pp_speed pp / tg_speed tg t/s"
parse_speeds() {
local out="$1"
[ "${TS_COL:-0}" = "0" ] && detect_cols "$out"
local s pp tg
s=$(printf '%s\n' "$out" | sed 's/"//g')
pp=$(printf '%s\n' "$s" | awk -F',' -v tc="$TS_COL" -v np="$NP_COL" -v ng="$NG_COL" \
'NR>1 && $np+0>0 && $ng+0==0 {printf "%.0f", $tc+0; exit}')
tg=$(printf '%s\n' "$s" | awk -F',' -v tc="$TS_COL" -v np="$NP_COL" -v ng="$NG_COL" \
'NR>1 && $ng+0>0 && $np+0==0 {printf "%.1f", $tc+0; exit}')
printf "%s pp / %s tg t/s" "${pp:--}" "${tg:--}"
}
is_oom() {
local out="$1" ec="$2"
[ "$ec" -ne 0 ] && return 0
printf '%s\n' "$out" | grep -qiE "failed to create context|out of memory|GGML_ASSERT|error:" && return 0
return 1
}
# bench MODEL NGL [llama-bench extra args...]
# Standard speed benchmark: -p 512 -n 128 small context
bench() {
local model=$1 ngl=$2; shift 2
local out ec
out=$(timeout 250 /app/llama-bench -m "$model" -ngl "$ngl" \
-b 512 -ub 128 -o csv "$@" 2>&1)
ec=$?
if is_oom "$out" "$ec"; then echo "OOM"; return; fi
[ "${TS_COL:-0}" = "0" ] && detect_cols "$out"
parse_speeds "$out"
}
# bench_ctx MODEL NGL CTX
# Context-capacity test: allocates KV for CTX tokens via -p CTX -n 1
# Tries fa=1 first, falls back to fa=0. Returns "OK (N pp t/s [fa=N])" or "OOM"
bench_ctx() {
local model=$1 ngl=$2 ctx=$3
local out ec fa_used
for fa in 1 0; do
out=$(timeout 250 /app/llama-bench -m "$model" -ngl "$ngl" \
-p "$ctx" -n 1 -r 1 --no-warmup \
-b 512 -ub 128 -fa "$fa" -t 6 -o csv 2>&1)
ec=$?
is_oom "$out" "$ec" || { fa_used=$fa; break; }
[ "$fa" = "0" ] && { echo "OOM"; return; }
done
[ "${TS_COL:-0}" = "0" ] && detect_cols "$out"
local pp
pp=$(printf '%s\n' "$out" | sed 's/"//g' | \
awk -F',' -v tc="$TS_COL" -v np="$NP_COL" \
'NR>1 && $np+0>0 {printf "%.0f", $tc+0; exit}')
printf "OK (%s pp t/s fa=%s)" "${pp:--}" "${fa_used:-?}"
}
HR="======================================================================"
echo "$HR"
echo "LLAMA.CPP BENCHMARK — ALL MODELS — $(date)"
echo "GPU: $(nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null || echo unknown)"
echo "$HR"
echo ""
# ── Phase 1: Baseline (small context) ────────────────────────────────────────
echo "=== Phase 1: Baseline (ngl=99, p=512 n=128 r=2, t=6, fa=0) ==="
for entry in "SmolLM3-3B:$M_SMOL" "Gemma4-E2B:$M_E2B" "Gemma4-E4B:$M_E4B" "Qwen3-4B:$M_Q3"; do
lbl="${entry%%:*}"; mdl="${entry#*:}"
printf " %-14s %s\n" "$lbl" "$(bench "$mdl" 99 -p 512 -n 128 -r 2 -t 6 -fa 0)"
done
echo ""
# ── Phase 2: Gemma4-E4B ngl sweep ────────────────────────────────────────────
echo "=== Phase 2: Gemma4-E4B ngl sweep (p=16 n=64 r=1 t=6 fa=0) ==="
echo " 5.1GB model on 3.7GB VRAM — finding highest ngl before OOM"
best_e4b_ngl=0
for ngl in 0 4 8 12 16 20 24 28 32 36 42; do
ts=$(bench "$M_E4B" $ngl -p 16 -n 64 -r 1 -t 6 -fa 0)
printf " ngl=%-3s %s\n" "$ngl" "$ts"
[[ "$ts" == OOM ]] && break
best_e4b_ngl=$ngl
done
echo " → best_e4b_ngl=$best_e4b_ngl"
echo ""
# ── Phase 3: Max context sweep ────────────────────────────────────────────────
echo "=== Phase 3: Max context (p=ctx n=1 r=1 no-warmup fa=1) ==="
echo " Gemma4 hybrid attention (sliding_window=512) enables large ctx cheaply."
declare -A BEST_CTX
BEST_CTX[smollm3]=512; BEST_CTX[e2b]=512; BEST_CTX[e4b]=512; BEST_CTX[q3]=512
for entry in "smollm3:SmolLM3-3B:$M_SMOL:99" \
"e2b:Gemma4-E2B:$M_E2B:99" \
"e4b:Gemma4-E4B:$M_E4B:$best_e4b_ngl" \
"q3:Qwen3-4B:$M_Q3:99"; do
IFS=':' read -r key lbl mdl ngl <<< "$entry"
echo " -- $lbl (ngl=$ngl) --"
for ctx in 512 1024 2048 4096 8192 12288 16384 24576 32768 49152 65536 98304 131072; do
ts=$(bench_ctx "$mdl" "$ngl" "$ctx")
printf " ctx=%-7s %s\n" "$ctx" "$ts"
[[ "$ts" == OOM ]] && break
BEST_CTX[$key]=$ctx
done
echo " → MAX ctx=${BEST_CTX[$key]}"
done
echo ""
# ── Phase 4: TG speed at max context ─────────────────────────────────────────
echo "=== Phase 4: TG speed at max context (p=512 n=128 r=2 fa=1 t=6) ==="
for entry in "smollm3:SmolLM3-3B:$M_SMOL:99" \
"e2b:Gemma4-E2B:$M_E2B:99" \
"e4b:Gemma4-E4B:$M_E4B:$best_e4b_ngl" \
"q3:Qwen3-4B:$M_Q3:99"; do
IFS=':' read -r key lbl mdl ngl <<< "$entry"
ts=$(bench "$mdl" "$ngl" -p 512 -n 128 -r 2 -fa 1 -t 6)
printf " %-14s max_ctx=%-7s %s\n" "$lbl" "${BEST_CTX[$key]}" "$ts"
done
echo ""
# ── Phase 5: E4B thread sweep (CPU split model — threads matter) ──────────────
echo "=== Phase 5: Gemma4-E4B thread sweep (p=512 n=128 r=2 fa=0 ngl=$best_e4b_ngl) ==="
for t in 1 2 3 4 5 6 8 10 12; do
ts=$(bench "$M_E4B" "$best_e4b_ngl" -p 512 -n 128 -r 2 -fa 0 -t "$t")
printf " t=%-3s %s\n" "$t" "$ts"
done
echo ""
# ── Phase 6: Flash attention comparison ──────────────────────────────────────
echo "=== Phase 6: Flash attention fa=0 vs fa=1 (p=512 n=128 r=2 t=6) ==="
echo " Gemma4 hybrid attention may not support FA — testing both."
for entry in "smollm3:SmolLM3-3B:$M_SMOL:99" \
"e2b:Gemma4-E2B:$M_E2B:99" \
"e4b:Gemma4-E4B:$M_E4B:$best_e4b_ngl" \
"q3:Qwen3-4B:$M_Q3:99"; do
IFS=':' read -r key lbl mdl ngl <<< "$entry"
ts0=$(bench "$mdl" "$ngl" -p 512 -n 128 -r 2 -fa 0 -t 6)
ts1=$(bench "$mdl" "$ngl" -p 512 -n 128 -r 2 -fa 1 -t 6)
printf " %-14s fa=0: %-30s fa=1: %s\n" "$lbl" "$ts0" "$ts1"
done
echo ""
echo "$HR"
echo "BENCHMARK COMPLETE: $(date)"
echo "$HR"

251
scripts/cpu_ctx_test.sh Normal file
View File

@@ -0,0 +1,251 @@
#!/bin/bash
# cpu_ctx_test.sh v4 — -nkvo bigctx with TurboQuant image (FORCE_MMQ)
# Image: local/llama-cpp-turboquant:full-cuda-sm75-mmq
#
# Tests KV in RAM (-nkvo) with BOTH q4_0 and turbo2 KV types.
# turbo2 = 2-bit KV (2x smaller than q4_0) → ~2x more context at same RAM budget.
#
# Speed model per token:
# GPU-compute models (smollm3/e2b/e4b/q3): bottleneck = PCIe KV reads
# t/s = 1000 / (gpu_ms + ctx * kv_bytes_per_token / PCIE_BPS * 1000)
# Qwen3.5-9B: bottleneck = RAM reads (21/32 layers on CPU, 8.86 GB model)
# t/s = 1000 / (1000/baseline + ctx * kv_bytes_per_token / RAM_BPS * 1000)
#
# Usage: bash /scripts/cpu_ctx_test.sh [smollm3|e2b|e4b|q3|qwen35q|all]
set -uo pipefail
TARGET="${1:-all}"
TARGET_TPS=15
CPU_THREADS=6
BENCH_GEN=32
PCIE_BW_GBPS=8.0 # PCIe x4 3.0 practical read BW (conservative)
RAM_BW_GBPS=45.0 # RAM practical read BW (i7-10750H DDR4-2933)
M_SMOL="/models/HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf"
M_E2B="/models/google_gemma-4-E2B-it-Q4_K_M.gguf"
M_E4B="/models/google_gemma-4-E4B-it-Q4_K_M.gguf"
M_Q3="/models/Qwen3-4B-Q4_K_M.gguf"
M_Q35="/models/Qwen3.5-9B.Q8_0.gguf"
declare -A NGL_GPU=([smollm3]=99 [e2b]=99 [e4b]=42 [q3]=99 [qwen35q]=11)
# BW source: pcie for GPU-compute models, ram for qwen35-9b (CPU-compute bound)
declare -A BW_GBPS=([smollm3]=$PCIE_BW_GBPS [e2b]=$PCIE_BW_GBPS [e4b]=$PCIE_BW_GBPS [q3]=$PCIE_BW_GBPS [qwen35q]=$RAM_BW_GBPS)
declare -A BW_LABEL=([smollm3]="PCIe" [e2b]="PCIe" [e4b]="PCIe" [q3]="PCIe" [qwen35q]="RAM")
# CTX candidates: larger now thanks to turbo2 (2x smaller KV vs q4_0)
# Note: turbo2 is SKIPPED for Qwen3-4B (PPL explodes at ctx>=8192: +0.52 → +13 → +437)
# turbo2 is SKIPPED for Qwen3.5-9B (hybrid linear-attn incompatible with llama-perplexity;
# server works fine at 32K — this is a test-tool limitation, not a real issue)
SMOL_CTXS=(32768 49152 65536 98304 131072 163840)
E2B_CTXS=(32768 49152 65536 98304 131072 163840 196608 262144 393216)
E4B_CTXS=(32768 49152 65536 98304 131072 163840)
Q3_CTXS=(24576 32768 49152 65536 98304 131072)
Q35_CTXS=(16384 32768 49152 65536 98304 131072)
declare -A CTX_CANDIDATES=(
[smollm3]="SMOL_CTXS" [e2b]="E2B_CTXS" [e4b]="E4B_CTXS"
[q3]="Q3_CTXS" [qwen35q]="Q35_CTXS")
# Pure-GPU ctx for gain comparison
declare -A PURE_GPU_CTX=([smollm3]=24576 [e2b]=24576 [e4b]=24576 [q3]=16384 [qwen35q]=32768)
GREEN='\033[0;32m'; RED='\033[0;31m'; YELLOW='\033[1;33m'; BLUE='\033[0;34m'; CYAN='\033[0;36m'; NC='\033[0m'
HR="======================================================================"
# Tiny alloc file — enough for 1 chunk, minimal compute time
ALLOC_FILE="/tmp/kv_alloc_tiny.txt"
python3 -c "
sentences = [
'The transformer architecture uses self-attention mechanisms to process sequences.',
'Large language models require significant computational resources for training.',
'Quantization reduces memory usage by storing weights in lower precision formats.',
'Flash attention enables memory-efficient computation for long context windows.',
'The key-value cache stores intermediate attention states during generation.',
]
import random; random.seed(1)
print(chr(10).join([random.choice(sentences) for _ in range(64)]))
" > "$ALLOC_FILE"
# check_alloc MODEL NGL KV CTX [EXTRA...]
# Returns "<host_kv_mib>" on success, "OOM" on failure. Fast: <15s.
check_alloc() {
local model=$1 ngl=$2 kv=$3 ctx=$4
shift 4
local extra_args=("$@")
local tmp_err; tmp_err=$(mktemp)
timeout 90 /app/llama-perplexity \
-m "$model" -ngl "$ngl" \
-fa on -nkvo \
-c "$ctx" -ctk "$kv" -ctv "$kv" \
-f "$ALLOC_FILE" --chunks 1 \
"${extra_args[@]}" \
> /dev/null 2>"$tmp_err"
local rc=$?
local err; err=$(cat "$tmp_err"); rm -f "$tmp_err"
if grep -qi "out of memory\|failed to allocate\|cudaMalloc failed\|CUDA_ERROR_OUT_OF_MEMORY\|ggml_cuda_malloc\|cannot allocate memory\|cannot create buffer" <<< "$err"; then
echo "OOM"; return 1
fi
# Parse Host context MiB: "| Host | total = model + context + compute |"
local host_ctx_mib
host_ctx_mib=$(grep "Host" <<< "$err" | \
grep -oP "=\s*\d+\s*\+\s*\K\d+(?=\s*\+)" | head -1 || true)
echo "${host_ctx_mib:-?}"
}
# measure_baseline_tps MODEL NGL [EXTRA...]
measure_baseline_tps() {
local model=$1 ngl=$2
shift 2
local extra_args=("$@")
local raw
raw=$(timeout 120 /app/llama-bench \
-m "$model" -ngl "$ngl" -t "$CPU_THREADS" \
-p 1 -n "$BENCH_GEN" \
-ctk q4_0 -ctv q4_0 -nkvo 1 -fa 1 -r 1 -o csv \
"${extra_args[@]}" 2>/dev/null) || true
printf '%s\n' "$raw" | sed 's/"//g' | \
awk -F',' 'NR>1 && $34=="0" && $35+0>0 && $40+0>0 {print $40+0; exit}'
}
# estimate_tps BASELINE_TPS KV_PER_TOKEN_MIB CTX BW_GBPS
estimate_tps() {
local baseline_tps=$1 kv_per_token_mib=$2 ctx=$3 bw_gbps=$4
python3 -c "
baseline = float('$baseline_tps')
kv_tok_bytes = float('$kv_per_token_mib') * 1024 * 1024
bps = float('$bw_gbps') * 1e9
ctx = int('$ctx')
base_ms = 1000.0 / baseline
kv_ms = ctx * kv_tok_bytes / bps * 1000
print(f'{1000.0 / (base_ms + kv_ms):.1f}')
" 2>/dev/null || echo "?"
}
# ---------------------------------------------------------------------------
echo "$HR"
echo "CPU-RAM KV CONTEXT TEST v4 (-nkvo, TurboQuant FORCE_MMQ) -- $(date)"
echo "GPU: $(nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null)"
echo "KV types tested: q4_0 (4-bit) and turbo2 (2-bit, 2x smaller → 2x more ctx)"
printf "PCIe assumption: %.1f GB/s | RAM assumption: %.1f GB/s\n" "$PCIE_BW_GBPS" "$RAM_BW_GBPS"
echo "$HR"
echo ""
declare -a SUMMARY=()
for entry in \
"smollm3:SmolLM3-3B:$M_SMOL" \
"e2b:Gemma4-E2B:$M_E2B" \
"e4b:Gemma4-E4B:$M_E4B" \
"q3:Qwen3-4B:$M_Q3" \
"qwen35q:Qwen3.5-9B:$M_Q35"
do
IFS=':' read -r key lbl model <<< "$entry"
[[ "$TARGET" != "all" && "$TARGET" != "$key" ]] && continue
eval "ctxs=(\"\${${CTX_CANDIDATES[$key]}[@]}\")"
ngl="${NGL_GPU[$key]}"
bw_gbps="${BW_GBPS[$key]}"
bw_label="${BW_LABEL[$key]}"
# turbo2 incompatible with Qwen3-4B (quality fails at ctx>=8192)
# turbo2 alloc works for Qwen3.5-9B but quality measurement unreliable — test q4_0 only
if [[ "$key" == "q3" || "$key" == "qwen35q" ]]; then
kv_types_to_test=(q4_0)
else
kv_types_to_test=(q4_0 turbo2)
fi
extra_args=()
printf "${BLUE}=== %s (ngl=%s, BW model: %s %.0f GB/s) ===${NC}\n" \
"$lbl" "$ngl" "$bw_label" "$bw_gbps"
# Baseline t/s (empty KV, with q4_0 -nkvo — upper bound)
printf " Measuring baseline t/s (empty KV, p=1)... "
baseline_tps=$(measure_baseline_tps "$model" "$ngl" "${extra_args[@]}")
if [[ -z "$baseline_tps" ]]; then
printf "${RED}FAIL${NC}\n\n"
SUMMARY+=("$lbl|FAIL|FAIL|FAIL|FAIL|FAIL")
continue
fi
printf "${GREEN}%s t/s${NC}\n\n" "$baseline_tps"
# Header
printf " %-10s %-12s %-12s %-12s %-12s %-12s %-12s\n" \
"ctx" "KV type" "KV in RAM" "kv/tok" "t/s@25%" "t/s@50%" "t/s@100%"
printf " %-10s %-12s %-12s %-12s %-12s %-12s %-12s\n" \
"---" "-------" "---------" "------" "-------" "-------" "--------"
max_ctx_q4=""
max_ctx_t2=""
rec_q4=""
rec_t2=""
declare -A kv_ref_mib=()
for ctx in "${ctxs[@]}"; do
for kv_type in "${kv_types_to_test[@]}"; do
result=$(check_alloc "$model" "$ngl" "$kv_type" "$ctx" "${extra_args[@]}")
if [[ "$result" == "OOM" ]]; then
printf " ${RED}%-10s %-12s OOM${NC}\n" "$ctx" "$kv_type"
continue
fi
host_kv_mib="${result}"
[[ "$kv_type" == "q4_0" ]] && max_ctx_q4=$ctx || max_ctx_t2=$ctx
# KV per token
if [[ "$host_kv_mib" =~ ^[0-9]+$ ]]; then
kv_per_token_mib=$(python3 -c "print(f'{$host_kv_mib / $ctx:.6f}')")
kv_ref_mib[$kv_type]=$kv_per_token_mib
else
kv_per_token_mib="${kv_ref_mib[$kv_type]:-?}"
fi
tps25=$(estimate_tps "$baseline_tps" "$kv_per_token_mib" "$(( ctx / 4 ))" "$bw_gbps")
tps50=$(estimate_tps "$baseline_tps" "$kv_per_token_mib" "$(( ctx / 2 ))" "$bw_gbps")
tps100=$(estimate_tps "$baseline_tps" "$kv_per_token_mib" "$ctx" "$bw_gbps")
meets=$(python3 -c "print(1 if '$tps50' != '?' and float('$tps50') >= $TARGET_TPS else 0)" 2>/dev/null || echo 0)
[[ "$kv_type" == "q4_0" && "$meets" == "1" ]] && rec_q4=$ctx
[[ "$kv_type" == "turbo2" && "$meets" == "1" ]] && rec_t2=$ctx
color=$([[ "$meets" == "1" ]] && echo "$GREEN" || echo "$YELLOW")
printf " ${color}%-10s${NC} %-12s %-12s %-12s %-12s ${color}%-12s${NC} %-12s\n" \
"$ctx" "$kv_type" "${host_kv_mib}MiB" "${kv_per_token_mib}MiB" \
"$tps25" "$tps50" "$tps100"
done
done
rec_q4="${rec_q4:-$max_ctx_q4}"
rec_t2="${rec_t2:-$max_ctx_t2}"
pg="${PURE_GPU_CTX[$key]}"
printf "\n Recommended ctx (>=%s t/s@50%%): q4_0=%s turbo2=%s (pure-GPU was %s)\n\n" \
"$TARGET_TPS" "${rec_q4:-FAIL}" "${rec_t2:-FAIL}" "$pg"
gain_q4=$([[ -n "${rec_q4:-}" && "${rec_q4:-}" != "FAIL" ]] && echo "$((rec_q4 - pg))" || echo "?")
gain_t2=$([[ -n "${rec_t2:-}" && "${rec_t2:-}" != "FAIL" ]] && echo "$((rec_t2 - pg))" || echo "?")
SUMMARY+=("$lbl|$baseline_tps|${max_ctx_q4:-OOM}|${rec_q4:-FAIL}|${max_ctx_t2:-OOM}|${rec_t2:-FAIL}|$gain_q4|$gain_t2")
unset kv_ref_mib max_ctx_q4 max_ctx_t2 rec_q4 rec_t2
done
echo "$HR"
echo "SUMMARY — -nkvo (KV in RAM): q4_0 vs turbo2"
echo "$HR"
printf "%-16s %-12s %-14s %-14s %-14s %-14s\n" \
"Model" "Baseline t/s" "q4_0 max" "q4_0 rec" "turbo2 max" "turbo2 rec"
printf "%-16s %-12s %-14s %-14s %-14s %-14s\n" \
"-----" "------------" "--------" "--------" "----------" "----------"
for row in "${SUMMARY[@]}"; do
IFS='|' read -r lbl btps max_q4 rec_q4 max_t2 rec_t2 g_q4 g_t2 <<< "$row"
printf "${GREEN}%-16s %-12s %-14s %-14s %-14s %-14s [q4+%s / t2+%s vs pure-GPU]${NC}\n" \
"$lbl" "$btps" "$max_q4" "$rec_q4" "$max_t2" "$rec_t2" "$g_q4" "$g_t2"
done
echo "$HR"
echo "Note: Qwen3.5-9B baseline already <15 t/s (RAM-bound, 8.86 GB model). BW model uses RAM not PCIe."
echo "$HR"

116
scripts/download_models.sh Executable file
View File

@@ -0,0 +1,116 @@
#!/usr/bin/env bash
# download_models.sh — Download GGUF model files to ./models/
#
# Usage:
# bash scripts/download_models.sh # all models
# bash scripts/download_models.sh smollm3 # single model
# bash scripts/download_models.sh gemma4-e2b gemma4-e4b # multiple
#
# Requires: huggingface-cli (pip install huggingface_hub)
# Models land in: ./models/
#
# Available keys: smollm3 | gemma4-e2b | gemma4-e4b | qwen3-4b | qwen35-9b | all
set -euo pipefail
MODELS_DIR="$(cd "$(dirname "$0")/.." && pwd)/models"
mkdir -p "$MODELS_DIR"
GREEN='\033[0;32m'; YELLOW='\033[1;33m'; RED='\033[0;31m'; NC='\033[0m'
check_hf_cli() {
if ! command -v huggingface-cli &>/dev/null; then
echo -e "${RED}Error: huggingface-cli not found.${NC}"
echo "Install with: pip install huggingface_hub"
exit 1
fi
}
download() {
local key="$1"
local repo="$2"
local filename="$3"
local size_hint="$4"
local dest="$MODELS_DIR/$filename"
if [[ -f "$dest" ]]; then
echo -e "${YELLOW}[$key]${NC} Already exists: $filename — skipping"
return
fi
echo -e "${GREEN}[$key]${NC} Downloading $filename (~$size_hint) from $repo ..."
huggingface-cli download "$repo" "$filename" --local-dir "$MODELS_DIR"
echo -e "${GREEN}[$key]${NC} Done: $MODELS_DIR/$filename"
}
download_smollm3() {
download "smollm3" \
"bartowski/HuggingFaceTB_SmolLM3-3B-GGUF" \
"HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf" \
"1.9 GB"
}
download_gemma4_e2b() {
download "gemma4-e2b" \
"bartowski/google_gemma-4-E2B-it-GGUF" \
"google_gemma-4-E2B-it-Q4_K_M.gguf" \
"2.9 GB"
}
download_gemma4_e4b() {
download "gemma4-e4b" \
"bartowski/google_gemma-4-E4B-it-GGUF" \
"google_gemma-4-E4B-it-Q4_K_M.gguf" \
"4.7 GB"
}
download_qwen3_4b() {
download "qwen3-4b" \
"bartowski/Qwen3-4B-GGUF" \
"Qwen3-4B-Q4_K_M.gguf" \
"2.4 GB"
}
download_qwen35_9b() {
download "qwen35-9b" \
"Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF" \
"Qwen3.5-9B.Q8_0.gguf" \
"8.9 GB"
}
main() {
check_hf_cli
local targets=("$@")
if [[ ${#targets[@]} -eq 0 || "${targets[0]}" == "all" ]]; then
targets=(smollm3 gemma4-e2b gemma4-e4b qwen3-4b qwen35-9b)
fi
for target in "${targets[@]}"; do
case "$target" in
smollm3) download_smollm3 ;;
gemma4-e2b) download_gemma4_e2b ;;
gemma4-e4b) download_gemma4_e4b ;;
qwen3-4b) download_qwen3_4b ;;
qwen35-9b) download_qwen35_9b ;;
all)
download_smollm3
download_gemma4_e2b
download_gemma4_e4b
download_qwen3_4b
download_qwen35_9b
;;
*)
echo -e "${RED}Unknown model: $target${NC}"
echo "Valid keys: smollm3 | gemma4-e2b | gemma4-e4b | qwen3-4b | qwen35-9b | all"
exit 1
;;
esac
done
echo ""
echo "Models directory:"
ls -lh "$MODELS_DIR"/*.gguf 2>/dev/null || echo "(no .gguf files found)"
}
main "$@"

246
scripts/kv_quant_test.sh Normal file
View File

@@ -0,0 +1,246 @@
#!/bin/bash
# KV cache quantization test using llama-perplexity.
# Image: local/llama-cpp-turboquant:full-cuda-sm75-mmq (FORCE_MMQ, turbo2/3/4 support)
#
# Tests KV types: f16 (baseline) + q8_0/q4_0/turbo2 for Q4_K_M models
# f16 (baseline) + turbo2/3/4 for Qwen3.5-9B Q8_0
# Quality gate: PPL delta vs f16 < 0.5 (lossless for practical use)
#
# Usage: bash /scripts/kv_quant_test.sh [MODEL_KEY]
# MODEL_KEY: smollm3 | e2b | e4b | q3 | qwen35q | all (default: all)
set -uo pipefail
TARGET="${1:-all}"
M_SMOL="/models/HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf"
M_E2B="/models/google_gemma-4-E2B-it-Q4_K_M.gguf"
M_E4B="/models/google_gemma-4-E4B-it-Q4_K_M.gguf"
M_Q3="/models/Qwen3-4B-Q4_K_M.gguf"
M_Q35="/models/Qwen3.5-9B.Q8_0.gguf"
declare -A NGL=([smollm3]=99 [e2b]=99 [e4b]=42 [q3]=99 [qwen35q]=11)
declare -A BASE_CTX=([smollm3]=18432 [e2b]=32768 [e4b]=20480 [q3]=8192 [qwen35q]=8192)
declare -A PPL_TIMEOUT=([smollm3]=300 [e2b]=300 [e4b]=300 [q3]=300 [qwen35q]=600)
# Per-model KV types to test (f16 is always the baseline)
# Standard Q4_K_M models: q8_0/q4_0 + turbo2 (all supported by TurboQuant image)
# Qwen3.5-9B: designed for turbo KV — test turbo2/3/4 only (q4_0 would also work but less relevant)
declare -A MODEL_KV_TYPES=(
[smollm3]="q8_0 q4_0 turbo2"
[e2b]="q8_0 q4_0 turbo2"
[e4b]="q8_0 q4_0 turbo2"
[q3]="q8_0 q4_0 turbo2"
[qwen35q]="turbo2 turbo3 turbo4"
)
# ctx candidates per model
SMOL_CTXS=(8192 12288 16384 18432 20480 24576 32768 40960 49152)
E2B_CTXS=(8192 16384 24576 32768 40960 49152 65536)
E4B_CTXS=(8192 12288 16384 20480 24576 32768 40960)
Q3_CTXS=(4096 6144 8192 10240 12288 16384 24576 32768)
Q35_CTXS=(4096 8192 16384 24576 32768 40960 49152)
declare -A CTX_CANDIDATES=(
[smollm3]="SMOL_CTXS" [e2b]="E2B_CTXS" [e4b]="E4B_CTXS"
[q3]="Q3_CTXS" [qwen35q]="Q35_CTXS")
GREEN='\033[0;32m'; RED='\033[0;31m'; YELLOW='\033[1;33m'; BLUE='\033[0;34m'; NC='\033[0m'
HR="======================================================================"
# Synthetic PPL file — 4000 lines, deterministic, no network needed
PPL_FILE="/tmp/kv_ppl_input.txt"
ensure_ppl_file() {
[[ -f "$PPL_FILE" ]] && return
python3 - << 'PY'
import random, sys
random.seed(42)
sentences = [
"The transformer architecture uses self-attention mechanisms to process sequences.",
"Large language models require significant computational resources for training.",
"Quantization reduces memory usage by storing weights in lower precision formats.",
"Flash attention enables memory-efficient computation for long context windows.",
"The key-value cache stores intermediate attention states during generation.",
"Context length determines how many tokens the model can attend to simultaneously.",
"Perplexity measures how well a probability model predicts a sample of text.",
"Lower perplexity values indicate better language modeling performance overall.",
"GPU memory bandwidth is the primary bottleneck for autoregressive token generation.",
"Grouped query attention reduces KV cache size by sharing keys across head groups.",
"Rotary position embeddings encode relative position information in attention queries.",
"Mixture of experts models route tokens through specialized feed-forward networks.",
"Continuous batching allows servers to process multiple requests simultaneously.",
"KV cache quantization trades a small quality loss for significantly larger contexts.",
]
lines = [random.choice(sentences) for _ in range(4000)]
print('\n'.join(lines), file=open('/tmp/kv_ppl_input.txt', 'w'))
PY
}
# run_ppl MODEL NGL KV CTX TIMEOUT [EXTRA_ARGS...]
# Echoes PPL value on stdout, returns 0 on success, 1 on OOM/crash.
run_ppl() {
local model=$1 ngl=$2 kv=$3 ctx=$4 timeout_s=$5
shift 5
local extra_args=("$@")
local tmp_err; tmp_err=$(mktemp)
local ppl_out; ppl_out=$(mktemp)
timeout "$timeout_s" /app/llama-perplexity \
-m "$model" \
-ngl "$ngl" \
-fa on \
-c "$ctx" \
-ctk "$kv" -ctv "$kv" \
-f "$PPL_FILE" \
--chunks 1 \
"${extra_args[@]}" \
> "$ppl_out" 2>"$tmp_err"
local ppl_rc=$?
local err; err=$(cat "$tmp_err"); rm -f "$tmp_err"
if [[ "$ppl_rc" != "0" ]] || \
grep -qi "out of memory\|failed to allocate\|cudaMalloc failed\|CUDA_ERROR_OUT_OF_MEMORY\|ggml_cuda_malloc\|cannot allocate memory" <<< "$err"; then
rm -f "$ppl_out"
return 1
fi
local ppl_val
ppl_val=$(grep -oP '\[\d+\]\K[0-9.]+' "$ppl_out" | tail -1)
rm -f "$ppl_out"
[[ -z "$ppl_val" ]] && return 1
echo "$ppl_val"
}
# ---------------------------------------------------------------------------
ensure_ppl_file
echo "$HR"
echo "KV CACHE QUANT TEST (llama-perplexity) — TurboQuant image (FORCE_MMQ SM75)"
echo "$(date)"
echo "GPU: $(nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null)"
echo "$HR"
echo "Standard models: f16 baseline | q8_0 | q4_0 | turbo2 (2-bit, 8x smaller KV vs f16)"
echo "Qwen3.5-9B: f16 baseline | turbo2 | turbo3 | turbo4 (TurboQuant KV types)"
echo "Quality gate: PPL delta vs f16 < 0.5"
echo ""
declare -a SUMMARY=()
for entry in \
"smollm3:SmolLM3-3B:$M_SMOL" \
"e2b:Gemma4-E2B:$M_E2B" \
"e4b:Gemma4-E4B:$M_E4B" \
"q3:Qwen3-4B:$M_Q3" \
"qwen35q:Qwen3.5-9B:$M_Q35"
do
IFS=':' read -r key lbl model <<< "$entry"
[[ "$TARGET" != "all" && "$TARGET" != "$key" ]] && continue
eval "ctxs=(\"\${${CTX_CANDIDATES[$key]}[@]}\")"
ngl="${NGL[$key]}"
timeout_s="${PPL_TIMEOUT[$key]}"
IFS=' ' read -ra kv_types <<< "${MODEL_KV_TYPES[$key]}"
# Extra args for qwen35-9b (flash attn already set; no mlock needed for PPL correctness)
extra_args=()
printf "${BLUE}=== %s (base ctx=%s, ngl=%s) ===${NC}\n" \
"$lbl" "${BASE_CTX[$key]}" "$ngl"
# Dynamic header based on KV types for this model
printf " %-10s %-18s" "ctx" "f16 (PPL)"
for kv in "${kv_types[@]}"; do
printf " %-20s" "$kv (PPL/delta)"
done
printf "\n"
printf " %-10s %-18s" "---" "---------"
for kv in "${kv_types[@]}"; do
printf " %-20s" "--------------------"
done
printf "\n"
declare -A best_ctx_per_kv=([f16]="${BASE_CTX[$key]}")
for kv in "${kv_types[@]}"; do best_ctx_per_kv[$kv]="${BASE_CTX[$key]}"; done
declare -A oom_kv=([f16]=0)
for kv in "${kv_types[@]}"; do oom_kv[$kv]=0; done
declare -A ppl_f16_at_ctx=()
for ctx in "${ctxs[@]}"; do
printf " %-10s" "$ctx"
# f16 baseline
f16_ppl=""
if [[ "${oom_kv[f16]}" == "1" ]]; then
printf " ${RED}%-18s${NC}" "OOM"
else
f16_ppl=$(run_ppl "$model" "$ngl" "f16" "$ctx" "$timeout_s" "${extra_args[@]}")
if [[ $? -ne 0 ]]; then
printf " ${RED}%-18s${NC}" "OOM"
oom_kv[f16]=1
else
printf " ${GREEN}%-18s${NC}" "$f16_ppl"
best_ctx_per_kv[f16]=$ctx
ppl_f16_at_ctx[$ctx]=$f16_ppl
fi
fi
# KV type columns
for kv in "${kv_types[@]}"; do
if [[ "${oom_kv[$kv]}" == "1" ]]; then
printf " ${RED}%-20s${NC}" "OOM"
continue
fi
ppl=$(run_ppl "$model" "$ngl" "$kv" "$ctx" "$timeout_s" "${extra_args[@]}")
if [[ $? -ne 0 ]]; then
printf " ${RED}%-20s${NC}" "OOM"
oom_kv[$kv]=1
continue
fi
best_ctx_per_kv[$kv]=$ctx
if [[ -n "$f16_ppl" ]]; then
delta=$(python3 -c "print(f'{float(\"$ppl\")-float(\"$f16_ppl\"):+.2f}')" 2>/dev/null || echo "?")
ok=$(python3 -c "exit(0 if abs(float('$ppl')-float('$f16_ppl'))<0.5 else 1)" 2>/dev/null && echo ok || echo bad)
if [[ "$ok" == "ok" ]]; then
printf " ${GREEN}%-20s${NC}" "${ppl}(${delta})"
else
printf " ${YELLOW}%-20s${NC}" "${ppl}(${delta})"
fi
else
printf " ${GREEN}%-20s${NC}" "$ppl"
fi
done
echo ""
done
echo ""
# Best recommendation: highest ctx where all non-f16 types passed quality gate
overall_best_ctx="${BASE_CTX[$key]}"
overall_best_kv="f16"
for kv in "${kv_types[@]}"; do
bctx="${best_ctx_per_kv[$kv]}"
SUMMARY+=("$lbl|$kv|$bctx")
if [[ "$bctx" -gt "$overall_best_ctx" ]]; then
overall_best_ctx=$bctx; overall_best_kv=$kv
fi
done
SUMMARY+=("$lbl|f16|${best_ctx_per_kv[f16]}")
printf " ${GREEN}Best: %s → max ctx %s${NC}\n\n" "$overall_best_kv" "$overall_best_ctx"
unset best_ctx_per_kv oom_kv ppl_f16_at_ctx
done
echo "$HR"
echo "SUMMARY"
echo "$HR"
printf "%-16s %-8s %s\n" "Model" "KV" "Max Ctx (no OOM + PPL delta<0.5)"
printf "%-16s %-8s %s\n" "-----" "--" "---------------------------------"
for row in "${SUMMARY[@]}"; do
IFS='|' read -r lbl kv ctx <<< "$row"
printf "${GREEN}%-16s %-8s %s${NC}\n" "$lbl" "$kv" "$ctx"
done
echo "$HR"
echo "Reminder: update envs/.env.<model>: CACHE_TYPE_K/V=<best_kv> CTX_SIZE=<max_ctx>"
echo "$HR"

215
scripts/quality_test.sh Normal file
View File

@@ -0,0 +1,215 @@
#!/bin/bash
# Quality tests for all 4 models — runs inside full-cuda container.
# Tests: coding tasks + needle-in-haystack at 1K/8K ctx.
#
# Inference parameters sourced from official HF model cards:
# SmolLM3: /no_think in SYSTEM prompt (-sys); temp=0.6 top_p=0.95
# Qwen3: /no_think in SYSTEM prompt (-sys); temp=0.7 top_p=0.8 top_k=20
# DO NOT use greedy (temp=0) — causes endless repetition per Qwen3 docs
# Gemma4: No thinking mode; temp=0.7 top_p=0.95
set -uo pipefail
M_SMOL="/models/HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf"
M_E2B="/models/google_gemma-4-E2B-it-Q4_K_M.gguf"
M_E4B="/models/google_gemma-4-E4B-it-Q4_K_M.gguf"
M_Q3="/models/Qwen3-4B-Q4_K_M.gguf"
declare -A NGL=([smollm3]=99 [e2b]=99 [e4b]=42 [q3]=99)
declare -A MAX_CTX=([smollm3]=24576 [e2b]=32768 [e4b]=24576 [q3]=8192)
# Per-model sampling params (HF model card sources)
declare -A TEMP=([smollm3]="0.6" [e2b]="0.7" [e4b]="0.7" [q3]="0.7")
declare -A TOPP=([smollm3]="0.95" [e2b]="0.95" [e4b]="0.95" [q3]="0.8")
declare -A TOPK=([smollm3]="0" [e2b]="0" [e4b]="0" [q3]="20")
# /no_think in system prompt disables thinking for SmolLM3 and Qwen3
declare -A SYSP=([smollm3]="/no_think" [e2b]="" [e4b]="" [q3]="/no_think")
PASS=0; FAIL=0; TOTAL=0
GREEN='\033[0;32m'; RED='\033[0;31m'; YELLOW='\033[1;33m'; NC='\033[0m'
# sed script to strip llama-cli interactive UI banner from stdout.
# ▄ (U+2584) and █ (U+2588) appear in the llama.cpp ASCII logo, sometimes
# with leading spaces — match anywhere on the line to be safe.
STRIP_BANNER='/^$/d
/^Loading model/d
/^[[:space:]]*$/d
/[▄█]/d
/^build /d
/^model /d
/^modalities/d
/^available commands/d
/^ \//d
/^\[ Prompt:/d
/^\[ Prompt:/d
/^Exiting/d
/^> /d
'
check() {
local lbl="$1" out="$2"
shift 2
local patterns=("$@")
local ok=1
for pat in "${patterns[@]}"; do
printf '%s\n' "$out" | grep -qiE "$pat" || { ok=0; break; }
done
TOTAL=$((TOTAL+1))
if [ "$ok" = "1" ]; then
PASS=$((PASS+1)); printf " ${GREEN}PASS${NC} %s\n" "$lbl"
else
FAIL=$((FAIL+1)); printf " ${RED}FAIL${NC} %s\n" "$lbl"
printf '%s\n' "$out" | grep -v '^$' | tail -3 | sed 's/^/ | /'
fi
}
# Strip thinking blocks from output.
# Gemma4 uses [Start thinking]...[End thinking].
# Qwen3/SmolLM3 use <think>...</think>.
# Match to end-of-string as fallback for truncated/incomplete blocks.
strip_think() {
python3 -c "
import sys, re
t = sys.stdin.read()
# Only strip COMPLETE blocks. If thinking hit token limit, leave as-is so
# check patterns can still match reasoning content inside the block.
t = re.sub(r'\[Start thinking\].*?\[End thinking\]', '', t, flags=re.DOTALL)
t = re.sub(r'<think>.*?</think>', '', t, flags=re.DOTALL)
print(t.strip())
" 2>/dev/null || cat
}
# run KEY MODEL PROMPT MAX_TOKENS [SYS_OVERRIDE]
# SYS_OVERRIDE defaults to SYSP[$key] if omitted.
# Pass "" explicitly to disable system prompt (thinking ON for Qwen3/SmolLM3).
# Thinking params: SmolLM3/Qwen3 thinking=temp0.6/top_p0.95, nothink=model defaults.
run() {
local key=$1 model=$2 prompt=$3 max_tok=$4
local ngl="${NGL[$key]}"
# 5th arg overrides sys; if not provided, use SYSP[$key]
local use_sys
if [ "${5+x}" = "x" ]; then use_sys="$5"; else use_sys="${SYSP[$key]}"; fi
# choose sampling params: thinking mode uses 0.6/0.95, non-think uses model defaults
local temp topp topk
if [ -z "$use_sys" ] && [[ "$key" == "smollm3" || "$key" == "q3" ]]; then
temp="0.6"; topp="0.95"; topk="${TOPK[$key]}"
else
temp="${TEMP[$key]}"; topp="${TOPP[$key]}"; topk="${TOPK[$key]}"
fi
local sys_arg=()
[ -n "$use_sys" ] && sys_arg=(-sys "$use_sys")
local topk_arg=()
[ "$topk" != "0" ] && topk_arg=(--top-k "$topk")
timeout 300 /app/llama-cli -m "$model" -ngl "$ngl" \
-n "$max_tok" --temp "$temp" --top-p "$topp" "${topk_arg[@]}" \
--repeat-penalty 1.1 -fa on --mmap --single-turn \
"${sys_arg[@]}" -p "$prompt" 2>/dev/null \
| sed "$STRIP_BANNER" \
| strip_think
}
# needle_test KEY MODEL NEEDLE CTX
# Generates ~CTX tokens of filler, plants needle in middle, asks to recall it.
needle_test() {
local key=$1 model=$2 needle=$3 ctx=$4
local ngl="${NGL[$key]}"
local temp="${TEMP[$key]}" topp="${TOPP[$key]}" sys="${SYSP[$key]}"
local sys_arg=()
[ -n "$sys" ] && sys_arg=(-sys "$sys")
# filler: ctx/2 tokens each side, 1 token ~4 chars
local half_chars=$(( ctx * 2 ))
local reps=$(( half_chars / 45 + 2 ))
local filler
filler=$(python3 -c "print('The quick brown fox jumps over the lazy dog. ' * $reps)" 2>/dev/null \
| head -c "$half_chars")
local prompt
printf -v prompt \
'%s\nSECRET_VALUE=%s\n%s\nWhat is SECRET_VALUE? Reply with only the value, nothing else.' \
"$filler" "$needle" "$filler"
local ctx_size=$(( ctx + 512 ))
local out
out=$(timeout 180 /app/llama-cli -m "$model" -ngl "$ngl" \
-n 512 --temp "$temp" --top-p "$topp" \
-fa on --mmap --single-turn \
-c "$ctx_size" "${sys_arg[@]}" -p "$prompt" 2>/dev/null \
| sed "$STRIP_BANNER" \
| strip_think)
# join lines before grep in case model breaks needle across newlines
local flat
flat=$(printf '%s' "$out" | tr '\n' ' ')
if printf '%s' "$flat" | grep -qF "$needle"; then
echo "FOUND"
else
local snip
snip=$(printf '%s' "$flat" | cut -c1-80)
echo "MISSED (${snip:-<empty>})"
fi
}
HR="======================================================================"
echo "$HR"
echo "QUALITY TESTS — ALL MODELS — $(date)"
echo "GPU: $(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null)"
echo "$HR"
printf "Temps: SmolLM3=0.6/0.95 | Qwen3=0.7/0.8/k20 | Gemma4=0.7/0.95\n"
printf "/no_think via -sys for needle tests | thinking ON for coding bug test\n\n"
CODING_FIZZBUZZ='Write ONLY the Python function fizzbuzz(n). It returns a list where multiples of 3 are "Fizz", multiples of 5 are "Buzz", multiples of both are "FizzBuzz", others are the number as string. Output code only, no prose.'
# hi is correctly len(arr)-1 to have ONE unambiguous bug: lo=mid (infinite loop)
CODING_BUG='Find the bug in this Python function and explain it in one sentence:
def binary_search(arr, target):
lo, hi = 0, len(arr) - 1
while lo < hi:
mid = (lo + hi) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
lo = mid
else:
hi = mid
return -1'
for entry in "smollm3:SmolLM3-3B:$M_SMOL" "e2b:Gemma4-E2B:$M_E2B" "e4b:Gemma4-E4B:$M_E4B" "q3:Qwen3-4B:$M_Q3"; do
IFS=':' read -r key lbl model <<< "$entry"
echo "=== $lbl ==="
# Coding test 1: FizzBuzz — expect def + Fizz + Buzz
out=$(run "$key" "$model" "$CODING_FIZZBUZZ" 512)
check "FizzBuzz: def + Fizz + Buzz in output" "$out" \
"def " "Fizz" "Buzz"
# Coding test 2: Bug — thinking ON for all models (more reliable reasoning).
# Pass "" to disable /no_think override. Gemma4 already thinks by default.
out=$(run "$key" "$model" "$CODING_BUG" 3000 "")
check "Bug: identify lo=mid / infinite loop" "$out" \
"lo.*=.*mid.*\+.*1|lo\+1|infinite loop|never.*advance|never.*progress|stuck|lo should be|lo\b.*never.*incr"
# Needle-in-haystack
NEEDLE="QX7-ALPHA-9"
# strict < so we skip when ctx == max_ctx (prompt fills entire context, no room for output)
for ctx in 1024 8192; do
if [ "$ctx" -lt "${MAX_CTX[$key]}" ]; then
result=$(needle_test "$key" "$model" "$NEEDLE" "$ctx")
TOTAL=$((TOTAL+1))
if [[ "$result" == FOUND ]]; then
PASS=$((PASS+1)); printf " ${GREEN}PASS${NC} Needle @ %s tok: %s\n" "$ctx" "$result"
else
FAIL=$((FAIL+1)); printf " ${RED}FAIL${NC} Needle @ %s tok: %s\n" "$ctx" "$result"
fi
else
printf " ${YELLOW}SKIP${NC} Needle @ %s tok (exceeds model max %s)\n" "$ctx" "${MAX_CTX[$key]}"
fi
done
echo ""
done
echo "$HR"
printf "RESULTS: ${GREEN}%s PASSED${NC} / ${RED}%s FAILED${NC} / %s TOTAL\n" "$PASS" "$FAIL" "$TOTAL"
echo "$HR"