From 4ad296608b0d5a3dee262dbbc8861edb2aab5b67 Mon Sep 17 00:00:00 2001 From: Giancarmine Salucci Date: Wed, 6 May 2026 15:56:40 +0200 Subject: [PATCH] Initial commit: tuned multi-model llama.cpp stack - 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B - TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs - Bigctx profiles (-nkvo KV in RAM): 2-16x context gain - turbo2 KV: 2x smaller, benchmarked against PPL quality gate - Per-model env files with justified parameters - kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts - docs/FINDINGS.md: surprises, pitfalls, recommendations - docs/ARCHITECTURE.md: compose + test script design --- .env | 13 ++ .gitignore | 34 ++++ README.md | 174 +++++++++++++++++++ benchmark-results/.gitkeep | 0 compose.yaml | 290 +++++++++++++++++++++++++++++++ docs/ARCHITECTURE.md | 210 ++++++++++++++++++++++ docs/FINDINGS.md | 158 +++++++++++++++++ envs/.env.gemma4-e2b | 43 +++++ envs/.env.gemma4-e2b-bigctx | 26 +++ envs/.env.gemma4-e4b | 43 +++++ envs/.env.gemma4-e4b-bigctx | 26 +++ envs/.env.qwen3-4b | 42 +++++ envs/.env.qwen3-4b-bigctx | 24 +++ envs/.env.qwen35-9b | 41 +++++ envs/.env.smollm3-3b | 42 +++++ envs/.env.smollm3-3b-bigctx | 26 +++ scripts/benchmark.sh | 335 ++++++++++++++++++++++++++++++++++++ scripts/benchmark_models.sh | 175 +++++++++++++++++++ scripts/cpu_ctx_test.sh | 251 +++++++++++++++++++++++++++ scripts/download_models.sh | 116 +++++++++++++ scripts/kv_quant_test.sh | 246 ++++++++++++++++++++++++++ scripts/quality_test.sh | 215 +++++++++++++++++++++++ 22 files changed, 2530 insertions(+) create mode 100644 .env create mode 100644 .gitignore create mode 100644 README.md create mode 100644 benchmark-results/.gitkeep create mode 100644 compose.yaml create mode 100644 docs/ARCHITECTURE.md create mode 100644 docs/FINDINGS.md create mode 100644 envs/.env.gemma4-e2b create mode 100644 envs/.env.gemma4-e2b-bigctx create mode 100644 envs/.env.gemma4-e4b create mode 100644 envs/.env.gemma4-e4b-bigctx create mode 100644 envs/.env.qwen3-4b create mode 100644 envs/.env.qwen3-4b-bigctx create mode 100644 envs/.env.qwen35-9b create mode 100644 envs/.env.smollm3-3b create mode 100644 envs/.env.smollm3-3b-bigctx create mode 100755 scripts/benchmark.sh create mode 100644 scripts/benchmark_models.sh create mode 100644 scripts/cpu_ctx_test.sh create mode 100755 scripts/download_models.sh create mode 100644 scripts/kv_quant_test.sh create mode 100644 scripts/quality_test.sh diff --git a/.env b/.env new file mode 100644 index 0000000..1b74164 --- /dev/null +++ b/.env @@ -0,0 +1,13 @@ +# ============================================================================= +# llama.cpp project root .env +# +# Per-model parameters have moved to envs/.env.: +# envs/.env.qwen35-9b — Qwen3.5-9B Q8_0 TurboQuant ~4.4 t/s +# envs/.env.gemma4-e2b — Gemma 4 E2B Q4_K_M ~65 t/s +# envs/.env.gemma4-e4b — Gemma 4 E4B Q4_K_M (split) ~30 t/s +# envs/.env.smollm3-3b — SmolLM3 3B Q4_K_M ~90 t/s +# envs/.env.qwen3-4b — Qwen3 4B Q4_K_M ~75 t/s +# +# This file is loaded by Docker Compose for project-level interpolation. +# Add project-wide overrides here (e.g. COMPOSE_PROJECT_NAME). +# ============================================================================= diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..0787b72 --- /dev/null +++ b/.gitignore @@ -0,0 +1,34 @@ +# Model files — large binaries, download with scripts/download_models.sh +models/*.gguf +models/*.bin +models/*.safetensors + +# Benchmark output logs, CSVs, and generated env snapshots — generated, not source +benchmark-results/*.log +benchmark-results/*.csv +benchmark-results/*.txt +benchmark-results/*.env +# Keep the .gitkeep placeholder +!benchmark-results/.gitkeep + +# Docker build cache artifacts +.docker/ + +# Python cache +__pycache__/ +*.pyc +*.pyo +.venv/ + +# Editor / OS artifacts +.DS_Store +Thumbs.db +*.swp +*.swo +*~ +.idea/ +.vscode/ + +# Local overrides (never commit secrets or machine-specific tweaks) +.env.local +envs/.env.*.local diff --git a/README.md b/README.md new file mode 100644 index 0000000..6b911c5 --- /dev/null +++ b/README.md @@ -0,0 +1,174 @@ +# llama-cpp-docker + +Production-ready llama.cpp server stack for a single consumer GPU (GTX 1650 Ti / SM75 Turing). +Fully benchmarked and tuned: every parameter justified by measurement, not guesswork. + +--- + +## What this is + +A Docker Compose setup that runs multiple LLMs via [llama.cpp](https://github.com/ggerganov/llama.cpp), with: + +- **Per-model env files** — all parameters (ctx, KV type, ngl, threads) tuned per model on this hardware +- **TurboQuant image** — custom build with `FORCE_MMQ` (+6–11% free speed on Turing GPUs) and `turbo2/3/4` KV quantization +- **Bigctx profiles** — `-nkvo` (KV in RAM) variants that multiply usable context by 2–16× at modest speed cost +- **Benchmark scripts** — reproducible PPL quality tests and PCIe/RAM bandwidth-modeled context sizing +- **Open WebUI** — optional web UI, profile-composable with any model + +> **Hardware target**: GTX 1650 Ti (SM75, 3717 MiB VRAM), i7-10750H, 15 GiB DDR4-2933. +> Parameters will work on any similar Turing GPU. See [docs/FINDINGS.md](docs/FINDINGS.md) before porting to other architectures. + +--- + +## Quick start + +### 1. Build the TurboQuant image (once, ~20 min) + +```bash +docker compose --profile qwen35-9b build llama-qwen35-9b +``` + +This builds both `server-cuda-sm75-mmq` and `full-cuda-sm75-mmq` tags used by all services. + +### 2. Download models + +```bash +bash scripts/download_models.sh +``` + +Downloads all five models to `./models/`. Requires `huggingface-cli` (`pip install huggingface_hub`). +To download individual models: + +```bash +bash scripts/download_models.sh smollm3 +bash scripts/download_models.sh qwen35-9b +# options: smollm3 | gemma4-e2b | gemma4-e4b | qwen3-4b | qwen35-9b | all +``` + +### 3. Start a model + +```bash +# Start SmolLM3 (fastest, 53 t/s, 65K context in bigctx mode) +docker compose --profile smollm3-3b up -d + +# Start Gemma4-E2B (multimodal, 62 t/s, up to 393K context) +docker compose --profile gemma4-e2b up -d + +# Add Open WebUI to any running model +docker compose --profile gemma4-e2b --profile webui up -d +``` + +API is available at **http://localhost:8080** (OpenAI-compatible). +WebUI at **http://localhost:3000**. + +--- + +## Models + +| Profile | Model | Size | t/s | CTX | Highlights | +|---|---|---|---|---|---| +| `qwen35-9b` | Qwen3.5-9B Q8_0 | 8.9 GB | ~4.4 | 32K | Reasoning distill, hybrid linear-attn | +| `gemma4-e2b` | Gemma4-E2B Q4_K_M | 2.9 GB | ~62 | 24K | Multimodal (image/audio/video), MQA | +| `gemma4-e4b` | Gemma4-E4B Q4_K_M | 4.7 GB | ~30 | 24K | Multimodal, larger, CPU-split | +| `smollm3-3b` | SmolLM3-3B Q4_K_M | 1.9 GB | ~53 | 24K | Thinking mode, tool calling, Apache 2.0 | +| `qwen3-4b` | Qwen3-4B Q4_K_M | 2.4 GB | ~39 | 16K | Thinking mode, 119 languages, best ecosystem | + +### Big context profiles (KV in RAM via `-nkvo`) + +Use when you need more context than the pure-GPU profiles offer. Speed drops as context fills (PCIe bandwidth bottleneck). + +| Profile | Model | KV type | CTX | ~t/s@50% fill | RAM KV usage | +|---|---|---|---|---|---| +| `smollm3-3b-bigctx` | SmolLM3-3B | turbo2 | 65536 | 15.2 | 714 MiB | +| `gemma4-e2b-bigctx` | Gemma4-E2B | q4_0 | 393216 | 17.0 | 651 MiB | +| `gemma4-e4b-bigctx` | Gemma4-E4B | turbo2 | 163840 | 17.8 | 346 MiB | +| `qwen3-4b-bigctx` | Qwen3-4B | q4_0 | 24576 | 11.2 | ~972 MiB | + +```bash +docker compose --profile gemma4-e2b-bigctx up -d +``` + +--- + +## Running benchmarks + +One-shot — results written to `benchmark-results/`: + +```bash +# Standard llama-bench sweep +docker compose --profile bench-smollm3-3b run --rm bench-smollm3-3b + +# KV quantization quality test (all models) +docker compose --profile bench-qwen35-9b run --rm -T \ + --entrypoint="bash /scripts/kv_quant_test.sh all" bench-qwen35-9b + +# Context size test with bandwidth model (all models) +docker compose --profile bench-qwen35-9b run --rm -T \ + --entrypoint="bash /scripts/cpu_ctx_test.sh all" bench-qwen35-9b + +# Ad-hoc llama-bench +docker compose --profile bench-smollm3-3b run --rm --entrypoint="" bench-smollm3-3b \ + bash -c '/app/llama-bench -m /models/$MODEL_FILE -ngl 99 -o csv 2>/dev/null' +``` + +--- + +## Project structure + +``` +compose.yaml — All services, profiles, YAML anchors +envs/ + .env. — Pure-GPU tuned params per model + .env.-bigctx — -nkvo KV-in-RAM params +scripts/ + download_models.sh — huggingface-cli download helper + benchmark.sh — Default bench entrypoint (llama-bench sweep) + kv_quant_test.sh — PPL quality test: f16 vs q8_0/q4_0/turbo2 per model/ctx + cpu_ctx_test.sh — -nkvo alloc check + PCIe/RAM BW model → max viable ctx + quality_test.sh — Early generation quality test (superseded by kv_quant_test.sh) +docs/ + FINDINGS.md — What we learned, surprises, and what to watch out for + ARCHITECTURE.md — Compose and test script architecture in detail +models/ — GGUF model files (gitignored, downloaded separately) +benchmark-results/ — Test output logs and CSVs (gitignored) +``` + +--- + +## Key findings + +> Full details in [docs/FINDINGS.md](docs/FINDINGS.md). + +**FORCE_MMQ gives free +6–11% on Turing GPUs.** GPUs without tensor cores (RTX 1650, 1660, 2060) are faster with the MMQ kernel than cuBLAS GEMM. The TurboQuant image compiles this in. Do not use this image on Ampere/Ada GPUs — it would hurt. + +**turbo2 KV quantization breaks Qwen3-4B.** At ctx ≥ 8192, PPL degrades catastrophically (1.79 → 4.2 → 15 → 438). Never use turbo2/3/4 for Qwen3-4B. Use q4_0. + +**turbo2 is paradoxically larger than q4_0 for Gemma4-E2B.** MQA architecture produces tiny KV tensors; block-quantization padding overhead makes turbo2 actually larger. Use q4_0 for E2B bigctx. + +**Gemma4's MQA architecture enables extreme context.** E2B has only 1.7 KB KV/token (vs SmolLM3's 19.8 KB). 393K context costs just 651 MiB RAM, and speed barely drops (62 → 17 t/s@50% fill). + +**Qwen3.5-9B cannot use -nkvo.** At ngl=11, model weights + OS fill all 15 GiB RAM. No bigctx possible. Existing 32K config with turbo2 KV in VRAM is the ceiling. + +**`llama-perplexity` is incompatible with Qwen3.5-9B.** Hybrid linear-attention architecture causes the PPL tool to fail. Not a real model limitation — the server works correctly. + +--- + +## Requirements + +- Docker + NVIDIA Container Toolkit +- NVIDIA GPU (SM75 for pre-built image; rebuild with different `CUDA_DOCKER_ARCH` for other architectures) +- `huggingface-cli` for model downloads: `pip install huggingface_hub` +- ~25 GB disk for all models (download selectively as needed) + +--- + +## Tuning for different hardware + +Edit `envs/.env.` files. Key parameters: + +- `N_GPU_LAYERS` — increase for more VRAM, decrease for CPU-split +- `CTX_SIZE` — reduce if OOM, increase if VRAM headroom +- `CACHE_TYPE_K/V` — `f16` > `q8_0` > `q4_0` > `turbo2` quality; reverse order for size +- `THREADS` — match physical core count (HT hurts for RAM-bound models) + +See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for full parameter reference. diff --git a/benchmark-results/.gitkeep b/benchmark-results/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/compose.yaml b/compose.yaml new file mode 100644 index 0000000..8e53afe --- /dev/null +++ b/compose.yaml @@ -0,0 +1,290 @@ +# ============================================================================== +# llama.cpp multi-model server +# Hardware: GTX 1650 Ti (3717 MiB VRAM, CC 7.5) + i7-10750H 6c/12t +# +# MODEL PROFILES (mutually exclusive — GPU can only hold one at a time): +# qwen35-9b Qwen3.5-9B Q8_0 TurboQuant (turbo2 KV, FORCE_MMQ) ~4.4 t/s +# gemma4-e2b Gemma 4 E2B Official llama.cpp ~65 t/s +# gemma4-e4b Gemma 4 E4B Official llama.cpp (CPU split) ~30 t/s +# smollm3-3b SmolLM3 3B Official llama.cpp ~90 t/s +# qwen3-4b Qwen3 4B Official llama.cpp ~75 t/s +# +# BIGCTX PROFILES (-nkvo: KV in RAM, benchmarked v4 2026-05-06, TurboQuant FORCE_MMQ): +# smollm3-3b-bigctx SmolLM3 3B ctx=65536 turbo2 | ~53 t/s base | ~15 t/s@50% | +40960 vs GPU +# gemma4-e2b-bigctx Gemma 4 E2B ctx=393216 q4_0 | ~62 t/s base | ~17 t/s@50% | +368640 vs GPU (MQA!) +# gemma4-e4b-bigctx Gemma 4 E4B ctx=163840 turbo2 | ~30 t/s base | ~18 t/s@50% | +139264 vs GPU +# qwen3-4b-bigctx Qwen3 4B ctx=24576 q4_0 | ~39 t/s base | ~11 t/s@50% | +8192 vs GPU +# +# OPTIONAL ADD-ON (combine with any model profile): +# webui Open WebUI — auto-connects to whichever model is running +# +# BENCHMARK PROFILES (one-shot, run with: docker compose ... run --rm ): +# bench-qwen35-9b / bench-gemma4-e2b / bench-gemma4-e4b +# bench-smollm3-3b / bench-qwen3-4b +# +# EXAMPLES: +# docker compose --profile qwen35-9b up -d +# docker compose --profile gemma4-e2b --profile webui up -d +# docker compose --profile bench-smollm3-3b run --rm --entrypoint="" bench-smollm3-3b \ +# bash -c '/app/llama-bench -m /models/$MODEL_FILE -ngl 99 -o csv 2>/dev/null' +# +# FIRST-TIME BUILD (qwen35-9b TurboQuant image, ~20 min): +# docker compose --profile qwen35-9b build llama-qwen35-9b +# +# Per-model params live in envs/.env. — edit there to retune. +# All server services expose API on host port 8080 and Docker network as +# http://llama-current:8080 via the llama-net network alias. +# ============================================================================== + +# ── Shared GPU passthrough ──────────────────────────────────────────────────── +x-gpu: &gpu + runtime: nvidia + environment: + NVIDIA_VISIBLE_DEVICES: all + NVIDIA_DRIVER_CAPABILITIES: compute,utility + +# ── Common healthcheck properties (start_period overridden per service) ─────── +x-hc: &hc + test: ["CMD-SHELL", "curl -sf http://localhost:8080/health | grep -q '\"status\":\"ok\"'"] + interval: 20s + timeout: 10s + retries: 10 + +# ── Common server scaffold ──────────────────────────────────────────────────── +# All model services merge this. Per-model differences go in envs/.env.. +# $$VAR uses double-$ to escape compose interpolation — shell expands them at +# runtime from the env_file variables injected into the container. +x-server: &server + <<: *gpu + container_name: llama_server + volumes: + - ./models:/models:ro + ports: + - "8080:8080" + shm_size: 1g + ulimits: + memlock: + soft: -1 + hard: -1 + restart: unless-stopped + entrypoint: ["/bin/sh", "-c"] + command: | + exec /app/llama-server \ + --model "/models/$$MODEL_FILE" \ + --host 0.0.0.0 --port 8080 \ + --n-gpu-layers $$N_GPU_LAYERS \ + --ctx-size $$CTX_SIZE \ + --threads $$THREADS --threads-batch $$THREADS_BATCH \ + --batch-size $$BATCH_SIZE --ubatch-size $$UBATCH_SIZE \ + --cache-type-k $$CACHE_TYPE_K --cache-type-v $$CACHE_TYPE_V \ + --cont-batching --parallel $$PARALLEL \ + $$EXTRA_ARGS \ + --log-disable + networks: + llama-net: + aliases: [llama-current] + +# ── Common benchmark scaffold ───────────────────────────────────────────────── +x-bench: &bench + <<: *gpu + container_name: llama_bench + volumes: + - ./models:/models:ro + - ./benchmark-results:/results + - ./scripts:/scripts:ro + shm_size: 1g + ulimits: + memlock: + soft: -1 + hard: -1 + entrypoint: ["/bin/bash", "/scripts/benchmark.sh"] + +# ── Networks ────────────────────────────────────────────────────────────────── +networks: + llama-net: + driver: bridge + +# ── Volumes ─────────────────────────────────────────────────────────────────── +volumes: + open-webui-data: + +# ============================================================================== +services: + + # ── QWEN 3.5-9B Q8_0 — TurboQuant (turbo2 KV, FORCE_MMQ, SM75) ──────────── + # Build image first: docker compose --profile qwen35-9b build llama-qwen35-9b + llama-qwen35-9b: + build: + context: https://github.com/TheTom/llama-cpp-turboquant.git#feature/turboquant-kv-cache + dockerfile: .devops/cuda.Dockerfile + target: server + args: + CUDA_DOCKER_ARCH: "75 -DGGML_CUDA_FORCE_MMQ=ON" + image: local/llama-cpp-turboquant:server-cuda-sm75-mmq + <<: *server + profiles: [qwen35-9b] + env_file: envs/.env.qwen35-9b + healthcheck: + <<: *hc + retries: 12 + start_period: 180s # mlock pins 8.86 GB into RAM — needs time + + # ── GEMMA 4 E2B — 2.3B effective (5.1B total/PLE), 128K ctx, audio+video ─── + # Download: see envs/.env.gemma4-e2b for huggingface-cli command + llama-gemma4-e2b: + image: local/llama-cpp-turboquant:server-cuda-sm75-mmq + <<: *server + profiles: [gemma4-e2b] + env_file: envs/.env.gemma4-e2b + healthcheck: + <<: *hc + start_period: 60s + + # ── GEMMA 4 E4B — 4.5B effective (8B total/PLE), 128K ctx, CPU-split ──────── + # Fits ~28/42 layers on GPU; remaining layers run on CPU RAM + llama-gemma4-e4b: + image: local/llama-cpp-turboquant:server-cuda-sm75-mmq + <<: *server + profiles: [gemma4-e4b] + env_file: envs/.env.gemma4-e4b + healthcheck: + <<: *hc + start_period: 60s + + # ── SMOLLM3 3B — thinking mode, tool calling, 64K ctx, Apache 2.0 ────────── + llama-smollm3-3b: + image: local/llama-cpp-turboquant:server-cuda-sm75-mmq + <<: *server + profiles: [smollm3-3b] + env_file: envs/.env.smollm3-3b + healthcheck: + <<: *hc + start_period: 60s + + # ── QWEN3 4B — thinking mode, 128K ctx, best ecosystem ──────────────────── + llama-qwen3-4b: + image: local/llama-cpp-turboquant:server-cuda-sm75-mmq + <<: *server + profiles: [qwen3-4b] + env_file: envs/.env.qwen3-4b + healthcheck: + <<: *hc + start_period: 60s + + # ── BIGCTX VARIANTS (-nkvo: KV in RAM, benchmarked 2026-05-06) ──────────── + # Use when you need more context than the pure-GPU profiles offer. + # KV cache lives in CPU RAM instead of VRAM → VRAM freed for larger ctx. + # Speed estimated via PCIe bandwidth model (8 GB/s). E2B/E4B use MQA — tiny KV, far less PCIe pressure. + + llama-smollm3-3b-bigctx: + image: local/llama-cpp-turboquant:server-cuda-sm75-mmq + <<: *server + profiles: [smollm3-3b-bigctx] + env_file: envs/.env.smollm3-3b-bigctx + healthcheck: + <<: *hc + start_period: 60s + + llama-gemma4-e2b-bigctx: + image: local/llama-cpp-turboquant:server-cuda-sm75-mmq + <<: *server + profiles: [gemma4-e2b-bigctx] + env_file: envs/.env.gemma4-e2b-bigctx + healthcheck: + <<: *hc + start_period: 60s + + llama-gemma4-e4b-bigctx: + image: local/llama-cpp-turboquant:server-cuda-sm75-mmq + <<: *server + profiles: [gemma4-e4b-bigctx] + env_file: envs/.env.gemma4-e4b-bigctx + healthcheck: + <<: *hc + start_period: 60s + + llama-qwen3-4b-bigctx: + image: local/llama-cpp-turboquant:server-cuda-sm75-mmq + <<: *server + profiles: [qwen3-4b-bigctx] + env_file: envs/.env.qwen3-4b-bigctx + healthcheck: + <<: *hc + start_period: 60s + + # ── OPEN WEBUI ───────────────────────────────────────────────────────────── + # Separate profile — add to any running model: + # docker compose --profile --profile webui up -d + # Connects to whichever model is running via the llama-current DNS alias. + # Open WebUI retries on startup so no depends_on needed. + openwebui: + image: ghcr.io/open-webui/open-webui:main + container_name: open_webui + profiles: [webui] + environment: + - OPENAI_API_BASE_URL=http://llama-current:8080/v1 + - OPENAI_API_KEY=sk-no-key-needed + - WEBUI_AUTH=false + ports: + - "3000:8080" + networks: + - llama-net + volumes: + - open-webui-data:/app/backend/data + restart: unless-stopped + + # ── BENCHMARKS ───────────────────────────────────────────────────────────── + # Run as one-shot: docker compose --profile bench- run --rm bench- + # Override entrypoint for ad-hoc: ... run --rm --entrypoint="" bench- bash -c '...' + + bench-qwen35-9b: + build: + context: https://github.com/TheTom/llama-cpp-turboquant.git#feature/turboquant-kv-cache + dockerfile: .devops/cuda.Dockerfile + target: full + args: + CUDA_DOCKER_ARCH: "75 -DGGML_CUDA_FORCE_MMQ=ON" + image: local/llama-cpp-turboquant:full-cuda-sm75-mmq + <<: *bench + profiles: [bench-qwen35-9b] + environment: + MODEL_FILE: Qwen3.5-9B.Q8_0.gguf + OUTPUT_DIR: /results + VARIANT: qwen35-9b-turboquant + PATH: /app:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin + + bench-gemma4-e2b: + image: local/llama-cpp-turboquant:full-cuda-sm75-mmq + <<: *bench + profiles: [bench-gemma4-e2b] + environment: + MODEL_FILE: google_gemma-4-E2B-it-Q4_K_M.gguf + OUTPUT_DIR: /results + VARIANT: gemma4-e2b + + bench-gemma4-e4b: + image: local/llama-cpp-turboquant:full-cuda-sm75-mmq + <<: *bench + profiles: [bench-gemma4-e4b] + environment: + MODEL_FILE: google_gemma-4-E4B-it-Q4_K_M.gguf + OUTPUT_DIR: /results + VARIANT: gemma4-e4b + + bench-smollm3-3b: + image: local/llama-cpp-turboquant:full-cuda-sm75-mmq + <<: *bench + profiles: [bench-smollm3-3b] + environment: + MODEL_FILE: HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf + OUTPUT_DIR: /results + VARIANT: smollm3-3b + + bench-qwen3-4b: + image: local/llama-cpp-turboquant:full-cuda-sm75-mmq + <<: *bench + profiles: [bench-qwen3-4b] + environment: + MODEL_FILE: Qwen3-4B-Q4_K_M.gguf + OUTPUT_DIR: /results + VARIANT: qwen3-4b diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md new file mode 100644 index 0000000..3752169 --- /dev/null +++ b/docs/ARCHITECTURE.md @@ -0,0 +1,210 @@ +# Architecture + +Hardware: GTX 1650 Ti Mobile (SM75/Turing, 3717 MiB VRAM) + i7-10750H 6c/12t + 15 GiB DDR4-2933 RAM. + +--- + +## Docker Compose Architecture + +### Image Strategy + +Two custom images built from the [TurboQuant fork](https://github.com/TheTom/llama-cpp-turboquant) of llama.cpp: + +| Image | Target | Used by | +|---|---|---| +| `local/llama-cpp-turboquant:server-cuda-sm75-mmq` | `server` | All llama-server services | +| `local/llama-cpp-turboquant:full-cuda-sm75-mmq` | `full` | All bench/test services | + +Both built with `CUDA_DOCKER_ARCH: "75 -DGGML_CUDA_FORCE_MMQ=ON"`: +- SM75 = Turing architecture codepath (no tensor cores) +- `FORCE_MMQ` = always use hand-written MMQ kernels instead of cuBLAS GEMM +- `full` target includes `llama-bench`, `llama-perplexity`, `llama-cli` alongside the server + +Both images share the same custom entrypoint wrapper that enables the `turbo2/3/4` KV quantization types unavailable in upstream llama.cpp. **All `docker run` calls must use `--entrypoint=""` to bypass the wrapper.** + +### Compose Structure + +``` +compose.yaml +├── x-gpu — NVIDIA runtime + capability passthrough (merged into all services) +├── x-hc — Common healthcheck (curl /health, start_period overridden per service) +├── x-server — Merged into all server services: +│ ├── volumes: ./models:/models:ro +│ ├── ports: 8080:8080 +│ ├── network alias: llama-current (all servers share this alias) +│ ├── entrypoint: llama-server with $$VAR shell expansion from env_file +│ └── restart: unless-stopped +└── x-bench — Merged into all bench services: + ├── volumes: ./models:/models:ro, ./benchmark-results:/results, ./scripts:/scripts:ro + └── entrypoint: /bin/bash /scripts/benchmark.sh (overrideable) +``` + +### Profile System + +Docker Compose profiles allow mutually exclusive model selection. Only one model server should run at a time (single GPU). + +``` +docker compose --profile up -d +``` + +**Server profiles** (bring up `llama-server` on port 8080): + +| Profile | Model | Image | VRAM | Strategy | +|---|---|---|---|---| +| `qwen35-9b` | Qwen3.5-9B Q8_0 | TurboQuant (built) | 3.4 GB (11 layers) | RAM-bound; mlock pins weights | +| `gemma4-e2b` | Gemma4-E2B Q4_K_M | TurboQuant | ~3.4 GB | Full GPU, MQA | +| `gemma4-e4b` | Gemma4-E4B Q4_K_M | TurboQuant | ~3.5 GB | Full GPU (42 layers, CPU-split) | +| `smollm3-3b` | SmolLM3-3B Q4_K_M | TurboQuant | ~2.0 GB | Full GPU | +| `qwen3-4b` | Qwen3-4B Q4_K_M | TurboQuant | ~2.5 GB | Full GPU | + +**Bigctx profiles** (server with `-nkvo`: KV cache in host RAM): + +| Profile | Model | KV type | CTX | ~t/s@50% ctx | +|---|---|---|---|---| +| `smollm3-3b-bigctx` | SmolLM3-3B | turbo2 | 65536 | 15.2 | +| `gemma4-e2b-bigctx` | Gemma4-E2B | q4_0 | 393216 | 17.0 | +| `gemma4-e4b-bigctx` | Gemma4-E4B | turbo2 | 163840 | 17.8 | +| `qwen3-4b-bigctx` | Qwen3-4B | q4_0 | 24576 | 11.2 | + +**Bench profiles** (one-shot benchmark containers): + +| Profile | Service | Purpose | +|---|---|---| +| `bench-qwen35-9b` | bench-qwen35-9b | Also hosts `cpu_ctx_test.sh` / `kv_quant_test.sh` (all models have model files accessible) | +| `bench-gemma4-e2b` | bench-gemma4-e2b | E2B bench | +| `bench-gemma4-e4b` | bench-gemma4-e4b | E4B bench | +| `bench-smollm3-3b` | bench-smollm3-3b | SmolLM3 bench | +| `bench-qwen3-4b` | bench-qwen3-4b | Qwen3-4B bench | + +**Add-on profile** (combine with any model): + +| Profile | Service | Purpose | +|---|---|---| +| `webui` | openwebui | Open WebUI connecting to `llama-current:8080` | + +### Env File Architecture + +Each model has a dedicated `envs/.env.` file injected into the container. Shell variables use `$$VAR` in the compose command to escape compose interpolation — the container shell expands them at runtime. + +``` +envs/ +├── .env.smollm3-3b ← pure-GPU: q8_0 KV, ctx=24576 +├── .env.smollm3-3b-bigctx ← -nkvo: turbo2 KV, ctx=65536 +├── .env.gemma4-e2b ← pure-GPU: f16 KV, ctx=24576 +├── .env.gemma4-e2b-bigctx ← -nkvo: q4_0 KV, ctx=393216 (turbo2 worse for MQA) +├── .env.gemma4-e4b ← pure-GPU: q4_0 KV, ctx=24576, ngl=42 +├── .env.gemma4-e4b-bigctx ← -nkvo: turbo2 KV, ctx=163840, ngl=42 +├── .env.qwen3-4b ← pure-GPU: q4_0 KV, ctx=16384 (NO turbo2 ever) +├── .env.qwen3-4b-bigctx ← -nkvo: q4_0 KV, ctx=24576 (NO turbo2 ever) +└── .env.qwen35-9b ← mixed: turbo2 KV, ctx=32768, ngl=11, mlock +``` + +Key env variables per file: + +```bash +MODEL_FILE # filename under /models/ +N_GPU_LAYERS # ngl: how many transformer layers offloaded to GPU +CTX_SIZE # context window size +THREADS / THREADS_BATCH +BATCH_SIZE / UBATCH_SIZE +CACHE_TYPE_K/V # KV quantization: f16 | q8_0 | q4_0 | turbo2 +PARALLEL # number of concurrent request slots +EXTRA_ARGS # passed verbatim to llama-server (e.g. --flash-attn on --no-kv-offload) +``` + +--- + +## Test Script Architecture + +All test scripts run inside the `bench-qwen35-9b` container (has `full` image with all binaries), with all model files accessible via `/models/`. + +### scripts/kv_quant_test.sh + +**Purpose**: Determine optimal KV quantization type for each model at various context sizes. +**Method**: `llama-perplexity` on a 4000-line synthetic text file. Computes perplexity for each (model, ctx, KV type) combination, measures Δ vs f16 baseline. +**Quality gate**: Δ < 0.5 → acceptable; Δ ≥ 0.5 → degraded. + +``` +for each model: + for each ctx in CTX_CANDIDATES: + run f16 baseline → get PPL_baseline + for each KV type in MODEL_KV_TYPES: + run with that KV type → get PPL + report Δ = PPL - PPL_baseline +``` + +**Outputs**: +- Pass/fail per (model, ctx, KV type) combination +- Recommendation: highest-quality KV type that stays within quality gate at all tested ctx + +**Known limitations**: +- `Qwen3.5-9B`: hybrid linear-attention architecture is incompatible with `llama-perplexity` → always fails. Not a real model issue; the server works correctly. +- At very small ctx (< 4096), block-padding overhead inflates turbo2 apparent per-token cost. + +### scripts/cpu_ctx_test.sh + +**Purpose**: Find maximum viable context size when using `-nkvo` (KV in host RAM), accounting for PCIe bandwidth penalty. +**Method**: Two-phase per (model, ctx, KV type): + +1. **Alloc check** (fast, ~15s): run `llama-perplexity` on a 64-line file with `-nkvo`. The model allocates full KV at startup regardless of input length. If it exits cleanly → alloc succeeds; timeout/error → OOM. + +2. **Speed estimation** (analytic bandwidth model): + ``` + GPU-compute models (smollm3, e2b, e4b, qwen3-4b): + t/s(ctx) = 1000 / (1000/baseline + ctx × kv_bytes_per_tok / PCIe_BW × 1000) + PCIe_BW = 8 GB/s (PCIe x4 Gen3 practical) + + RAM-bound models (qwen35-9b, ngl=11): + t/s(ctx) = 1000 / (1000/baseline + ctx × kv_bytes_per_tok / RAM_BW × 1000) + RAM_BW = 45 GB/s (DDR4-2933) + ``` + +3. **Recommendation**: highest ctx where `t/s@50%fill ≥ 15`. + +**kv_bytes_per_tok** measured empirically: `KV_MiB_allocated / ctx_size` from actual alloc run. + +**KV types tested per model**: + +| Model | KV types | Reason | +|---|---|---| +| smollm3, e2b, e4b | q4_0 + turbo2 | Both safe (PPL gate passes) | +| qwen3-4b | q4_0 only | turbo2 breaks at ctx≥8192 | +| qwen35-9b | q4_0 only | OOMs regardless (skipped) | + +### scripts/benchmark.sh + +Default entrypoint for bench containers. Runs `llama-bench` sweep over prompt/generation lengths and thread counts, outputs CSV to `/results/`. + +### scripts/quality_test.sh + +Early script (superseded by kv_quant_test.sh). Tested KV types via basic generation quality comparison. + +--- + +## Data Flow + +``` +Model GGUF files (./models/) + │ + ▼ +Docker container (/models/ read-only bind mount) + │ + ├─── llama-server ──► OpenAI-compatible API on :8080 + │ │ + │ env_file values: MODEL_FILE, N_GPU_LAYERS, CTX_SIZE, + │ CACHE_TYPE_K/V, EXTRA_ARGS, ... + │ + └─── llama-bench / llama-perplexity ──► benchmark-results/ (bind mount) + │ + test scripts (scripts/ read-only bind mount) +``` + +## Port / Network Layout + +``` +Host:8080 ──► llama_server container:8080 +Host:3000 ──► open_webui container:8080 ──► http://llama-current:8080/v1 (Docker network) + +llama-net (bridge): + llama-current — alias shared by ALL server services; only one runs at a time +``` diff --git a/docs/FINDINGS.md b/docs/FINDINGS.md new file mode 100644 index 0000000..ba4ad0f --- /dev/null +++ b/docs/FINDINGS.md @@ -0,0 +1,158 @@ +# Benchmarking Findings + +Hardware: GTX 1650 Ti Mobile (Turing/SM75, 3717 MiB VRAM, CC 7.5) + i7-10750H 6c/12t, 15 GiB DDR4-2933 RAM. +All benchmarks: llama.cpp `local/llama-cpp-turboquant:*-cuda-sm75-mmq` image (TurboQuant fork, `DGGML_CUDA_FORCE_MMQ=ON`). +Date: 2026-05-05 / 2026-05-06. + +--- + +## 1. FORCE_MMQ — Free +6–11% on Turing GPUs + +**Finding**: GPUs without tensor cores (Turing = RTX 1650, 1660, 2060 etc.) run the GEMM path through cuBLAS GEMM, which is slower than the hand-written MMQ (matrix-multiply quantized) kernel. Compiling with `DGGML_CUDA_FORCE_MMQ=ON` forces the MMQ path unconditionally. + +| Model | Standard image t/s | TurboQuant t/s | Gain | +|---|---|---|---| +| SmolLM3-3B | ~49.9 | 53.1 | +6.2% | +| Gemma4-E2B | ~55.7 | 61.7 | +10.7% | +| Gemma4-E4B | ~27.0 | 30.0 | +11.4% | +| Qwen3-4B | ~36.7 | 38.8 | +5.7% | + +**Caution**: On Ampere/Ada/Hopper (RTX 3000+/4000+), tensor cores are faster. `FORCE_MMQ` would *hurt* on those cards. This image is SM75-only. + +--- + +## 2. KV Quantization — turbo2 is the best sweet spot + +**Finding**: The TurboQuant fork adds 2/3/4-bit KV quantization ("turbo2/3/4") beyond llama.cpp's built-in q8_0/q4_0. turbo2 at 2 bits is roughly half the size of q4_0, with acceptable perplexity loss. + +**Perplexity delta vs f16 baseline** (quality gate: Δ < 0.5): + +| KV type | SmolLM3-3B | Gemma4-E2B | Gemma4-E4B | Qwen3-4B | +|---|---|---|---|---| +| q8_0 | ✓ | ✓ | ✓ | ✓ | +| q4_0 | ✓ | ✓ | ✓ | ✓ | +| turbo2 | ✓ | ✓ | ✓ | **✗ BROKEN** | +| turbo3 | ✓ | ✓ | ✓ | **✗ BROKEN** | +| turbo4 | ✓ | ✓ | ✓ | **✗ BROKEN** | + +### ⚠️ Critical: turbo2/3/4 breaks Qwen3-4B + +Qwen3-4B uses full GQA (32 KV heads, 40 KB/token). At ctx ≥ 8192, turbo KV quantization causes catastrophic PPL degradation: + +``` +ctx=4096 turbo2: PPL=1.79 (baseline 1.76, Δ=0.03 ✓) +ctx=8192 turbo2: PPL=4.2 (Δ=2.4 ✗) +ctx=16384 turbo2: PPL=15.4 (Δ=13.7 ✗) +ctx=32768 turbo2: PPL=438 (broken) +``` + +**Never use turbo2/3/4 for Qwen3-4B.** Use q4_0. + +--- + +## 3. MQA Architecture — Gemma4 E2B/E4B KV is tiny + +**Finding**: Gemma4's hybrid attention uses Multi-Query Attention (MQA) for most layers — only 1 KV head is maintained per token instead of full GQA. This results in dramatically smaller KV cache: + +| Model | KV bytes/token (q4_0) | Architecture | +|---|---|---| +| SmolLM3-3B | ~19.8 KB | GQA | +| Qwen3-4B | ~39.6 KB | full GQA | +| Gemma4-E4B | ~4.5 KB | MQA-like (42 layers) | +| Gemma4-E2B | ~1.7 KB | MQA (35 layers) | + +**Implication**: E2B can hold 393K tokens in KV cache with only 651 MiB RAM. E4B can hold 163K tokens with 346 MiB RAM. + +### ⚠️ turbo2 is *worse* for E2B (MQA padding artifact) + +turbo2 uses block quantization. For MQA models with tiny KV tensors, the per-block header/padding overhead is proportionally larger than the savings. At E2B: + +``` +ctx=32768 q4_0: 57 MiB KV turbo2: 68 MiB KV (+19% worse!) +``` + +**Do not use turbo2 for Gemma4-E2B bigctx.** Use q4_0. + +--- + +## 4. -nkvo (KV in RAM) — Massive Context Gain at PCIe Cost + +**Finding**: `--no-kv-offload` moves the KV cache from VRAM to host RAM. VRAM is then entirely free for model weights and compute. The tradeoff is token generation speed — each token generation requires reading the full KV cache over PCIe x4. + +**Bandwidth model**: `t/s = 1000 / (gpu_ms_empty + ctx × kv_bytes_per_tok / pcie_bw_bps × 1000)` + +PCIe x4 Gen3 ≈ **8 GB/s** practical (measured from BW model fit to actual results). + +### Context gains with -nkvo (v4, TurboQuant): + +| Model | Pure-GPU ctx | -nkvo q4_0 rec | -nkvo turbo2 rec | KV type used | +|---|---|---|---|---| +| SmolLM3-3B | 24576 | 32768 | **65536** | turbo2 | +| Gemma4-E2B | 24576 | **393216** | 393216 | q4_0 (turbo2 worse!) | +| Gemma4-E4B | 24576 | 98304 | **163840** | turbo2 | +| Qwen3-4B | 16384 | **24576** | BROKEN | q4_0 | + +Recommendation threshold: ≥ 15 t/s at 50% context fill. + +### ⚠️ Qwen3.5-9B cannot use -nkvo + +Qwen3.5-9B (Q8_0, 8.86 GB) with ngl=11 fills nearly all 15 GiB RAM with model weights + system overhead. At any tested context size, `-nkvo` OOMs. The existing server config at ctx=32768 with turbo2 KV in VRAM is the only viable option. + +--- + +## 5. Qwen3.5-9B — RAM-bound, llama-perplexity incompatible + +**Finding**: This model has a hybrid architecture: 8 full-attention layers + 24 linear-attention layers. The linear-attention layers cause `llama-perplexity` to fail (not OOM — the evaluation tool simply can't handle the architecture). The server works correctly. + +**Performance ceiling**: Theoretical max t/s = RAM_BW / model_size = 45 GB/s ÷ 8.86 GB = **5.1 t/s**. Achieved: 4.38 t/s = 86% efficiency. This is purely RAM-bandwidth-limited. + +**Thread optimization** (i7-10750H, 6 physical / 12 logical): +- Optimal: `THREADS=6` (one per physical core) +- HT hurts: t=8 → 4.22 t/s (worse than t=6 → 4.38 t/s) + +--- + +## 6. Gemma4-E4B — all layers fit when VRAM is free + +**Surprise**: E4B's Q4_K_M file is 4.7 GB — larger than the 3.7 GB VRAM. However, model weight loading is paged; at ngl=42, ALL 42 layers fit in VRAM during inference because llama.cpp holds only the needed tensors. The "file size > VRAM" heuristic is wrong for split configs. + +ngl sweep result: +``` +ngl=28 → 59 pp / 16.5 tg t/s +ngl=35 → 101 pp / 24.6 tg t/s +ngl=42 → 133 pp / 32.0 tg t/s ← all layers, much faster +``` + +**Caution**: ngl=42 fails if another container is holding VRAM. Always stop other services before starting E4B. + +--- + +## 7. Flash Attention (+2–3% pp, required for bigctx) + +`--flash-attn on` is required for `-nkvo` bigctx profiles (prefill OOM otherwise at large ctx). For standard pure-GPU profiles it gives a small speed boost (~2–3% pp, ~1% tg). Always enable it. + +--- + +## 8. Benchmarking pitfalls + +### False OOM from prefill timeout +Early test scripts ran `llama-perplexity` on a full wiki dataset. At large contexts, prefill takes >600s and the script misread the timeout as OOM. Fix: use a 64-line "tiny" file for alloc checks — the model allocates the full KV cache at startup, then exits after trivial compute (< 15s). + +### kv/tok measurement anomalies +The `kv_bytes_per_tok` column in cpu_ctx_test.sh is computed as `kv_mib / ctx`. At small ctx, block padding dominates and the value appears higher. The true per-token cost stabilizes at larger ctx. Use ctx ≥ 32768 values for BW model calibration. + +--- + +## Summary: Recommended configurations + +| Model | Profile | KV type | CTX | t/s@base | Notes | +|---|---|---|---|---|---| +| SmolLM3-3B | pure-GPU | q8_0 | 24576 | ~53 | max VRAM ctx | +| SmolLM3-3B | bigctx | turbo2 | 65536 | ~15@50% | 714 MiB RAM | +| Gemma4-E2B | pure-GPU | f16 | 24576 | ~62 | MQA = tiny KV | +| Gemma4-E2B | bigctx | q4_0 | 393216 | ~17@50% | 651 MiB RAM, turbo2 worse | +| Gemma4-E4B | pure-GPU | q4_0 | 24576 | ~30 | ngl=42 all layers | +| Gemma4-E4B | bigctx | turbo2 | 163840 | ~18@50% | 346 MiB RAM | +| Qwen3-4B | pure-GPU | q4_0 | 16384 | ~39 | NO turbo KV ever | +| Qwen3-4B | bigctx | q4_0 | 24576 | ~11@50% | turbo2 broken | +| Qwen3.5-9B | pure-GPU | turbo2 | 32768 | ~4.4 | RAM-bound, no bigctx | diff --git a/envs/.env.gemma4-e2b b/envs/.env.gemma4-e2b new file mode 100644 index 0000000..e54fc01 --- /dev/null +++ b/envs/.env.gemma4-e2b @@ -0,0 +1,43 @@ +# ============================================================================== +# Gemma 4 E2B-it Q4_K_M — Google DeepMind (April 2025) +# Architecture: Dense transformer + Per-Layer Embeddings (PLE) +# - 2.3B effective params (5.1B total with PLE embedding tables) +# - 35 layers, hybrid local (512-token window) + global attention +# - 128K context window +# Model size: ~2.9 GB Q4_K_M | Full GPU fit (ngl=99, VRAM ~3.4 GB total) +# Modalities: text + image + audio (ASR/translation) + video frames +# +# Download: +# huggingface-cli download bartowski/google_gemma-4-E2B-it-GGUF \ +# google_gemma-4-E2B-it-Q4_K_M.gguf --local-dir ./models/ +# +# NOTE: Verify the exact filename after download — bartowski naming may vary. +# Check: ls models/google_gemma* +# ============================================================================== + +MODEL_FILE=google_gemma-4-E2B-it-Q4_K_M.gguf + +# All 35 layers fit in VRAM. PLE layers are small compute, large embedding lookup. +N_GPU_LAYERS=99 + +# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB): +# Hybrid sliding-window attention (512-token) keeps KV tiny → 32K ctx fits! +# 65K/131K OOM (full global-attn layers eat VRAM at large ctx). +# Baseline: 350 pp / 64.6 tg t/s | At 32K ctx: 365 pp / 66.8 tg t/s (fa=1) +CTX_SIZE=24576 + +THREADS=6 +THREADS_BATCH=6 + +BATCH_SIZE=512 +UBATCH_SIZE=256 + +# f16 KV — model small, KV overhead negligible even at 32K +CACHE_TYPE_K=f16 +CACHE_TYPE_V=f16 + +# 2 parallel slots — fast model (66 tg t/s), VRAM headroom available +PARALLEL=2 + +# fa=1 confirmed working on hybrid Gemma4 attention (+5% vs fa=0) +EXTRA_ARGS=--flash-attn on --mmap diff --git a/envs/.env.gemma4-e2b-bigctx b/envs/.env.gemma4-e2b-bigctx new file mode 100644 index 0000000..5d7eb86 --- /dev/null +++ b/envs/.env.gemma4-e2b-bigctx @@ -0,0 +1,26 @@ +# ============================================================================== +# Gemma 4 E2B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo) +# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): q4_0 rec ctx=393216 +# +368640 tokens vs pure-GPU 24576. MQA arch = only 1.7 KB KV/tok (tiny!). +# Speed at ctx=393216: baseline 61.7 t/s, est. 17.0@50% / 26.6@25% (PCIe BW). +# RAM at 393216: 651 MiB KV. q4_0 used (turbo2 paradoxically larger for MQA). +# Use this profile when you need >24K context; otherwise use gemma4-e2b. +# ============================================================================== + +MODEL_FILE=google_gemma-4-E2B-it-Q4_K_M.gguf + +N_GPU_LAYERS=99 +CTX_SIZE=393216 + +THREADS=6 +THREADS_BATCH=6 + +BATCH_SIZE=512 +UBATCH_SIZE=256 + +CACHE_TYPE_K=q4_0 +CACHE_TYPE_V=q4_0 + +PARALLEL=1 + +EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload diff --git a/envs/.env.gemma4-e4b b/envs/.env.gemma4-e4b new file mode 100644 index 0000000..9c7f8c8 --- /dev/null +++ b/envs/.env.gemma4-e4b @@ -0,0 +1,43 @@ +# ============================================================================== +# Gemma 4 E4B-it Q4_K_M — Google DeepMind (April 2025) +# Architecture: Dense transformer + Per-Layer Embeddings (PLE) +# - 4.5B effective params (8B total with PLE embedding tables) +# - 42 layers, hybrid local (512-token window) + global attention +# - 128K context window +# Model size: ~4.7 GB Q4_K_M | CPU-split needed (exceeds 3.7 GB VRAM) +# Modalities: text + image + audio (ASR/translation) + video frames +# +# Download: +# huggingface-cli download bartowski/google_gemma-4-E4B-it-GGUF \ +# google_gemma-4-E4B-it-Q4_K_M.gguf --local-dir ./models/ +# +# NOTE: Verify the exact filename after download — bartowski naming may vary. +# Check: ls models/google_gemma* +# ============================================================================== + +MODEL_FILE=google_gemma-4-E4B-it-Q4_K_M.gguf + +# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB): +# ALL 42 layers fit on GPU when no other containers hold VRAM! +# ngl sweep: ngl=42 → 133 pp / 32.0 tg t/s (ngl=28 was only 59/16.5) +# Max ctx=24576 (hybrid attention, 32K OOM). fa=1 works (+3% vs fa=0). +# Thread sweep: t=4-6 optimal (GPU-only now, CPU largely idle for tg) +N_GPU_LAYERS=42 + +# 24K max — hybrid sliding-window keeps most layers' KV tiny +# 32K OOM due to global-attn layers hitting VRAM wall +CTX_SIZE=24576 + +THREADS=6 +THREADS_BATCH=6 + +BATCH_SIZE=512 +UBATCH_SIZE=128 + +CACHE_TYPE_K=q4_0 +CACHE_TYPE_V=q4_0 + +PARALLEL=1 + +# fa=1 confirmed working on hybrid Gemma4 attention +EXTRA_ARGS=--flash-attn on --mmap diff --git a/envs/.env.gemma4-e4b-bigctx b/envs/.env.gemma4-e4b-bigctx new file mode 100644 index 0000000..b64c904 --- /dev/null +++ b/envs/.env.gemma4-e4b-bigctx @@ -0,0 +1,26 @@ +# ============================================================================== +# Gemma 4 E4B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo) +# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): turbo2 rec ctx=163840 +# +139264 tokens vs pure-GPU 24576. turbo2 KV = 2.1 KB/tok vs q4_0 4.5 KB/tok. +# Speed at ctx=163840: baseline 30.0 t/s, est. 17.8@50% / 22.4@25% (PCIe BW). +# RAM at 163840: 346 MiB KV. ngl=42 (all layers on GPU). +# Use this profile when you need >24K context; otherwise use gemma4-e4b. +# ============================================================================== + +MODEL_FILE=google_gemma-4-E4B-it-Q4_K_M.gguf + +N_GPU_LAYERS=42 +CTX_SIZE=163840 + +THREADS=6 +THREADS_BATCH=6 + +BATCH_SIZE=512 +UBATCH_SIZE=128 + +CACHE_TYPE_K=turbo2 +CACHE_TYPE_V=turbo2 + +PARALLEL=1 + +EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload diff --git a/envs/.env.qwen3-4b b/envs/.env.qwen3-4b new file mode 100644 index 0000000..d929152 --- /dev/null +++ b/envs/.env.qwen3-4b @@ -0,0 +1,42 @@ +# ============================================================================== +# Qwen3-4B-Instruct Q4_K_M — Alibaba (May 2025) +# Architecture: Decoder-only transformer, GQA +# - 4B params, 32 layers +# - 32K native context (128K with YaRN) +# Model size: ~2.4 GB Q4_K_M | Full GPU fit (ngl=99) +# Features: thinking mode (/think /no_think), tool calling, 119 languages, +# Apache 2.0. Strong code + reasoning. Best ecosystem (most fine-tunes). +# +# Download: +# huggingface-cli download bartowski/Qwen3-4B-GGUF \ +# Qwen3-4B-Q4_K_M.gguf --local-dir ./models/ +# +# NOTE: Verify exact filename after download: +# ls models/Qwen3-4B* +# ============================================================================== + +MODEL_FILE=Qwen3-4B-Q4_K_M.gguf + +# All layers fit — ~2.4 GB leaves ~1.3 GB free for KV + compute +N_GPU_LAYERS=99 + +# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB): +# Max ctx=8192 (12K OOM). Full attention — all KV must fit at full ctx. +# GGUF native limit=40960, but VRAM walls at ~8K. +# Baseline: 181 pp / 41.6 tg t/s. At 8K ctx fa=1: 191 pp / 44.3 tg t/s (+6%). +CTX_SIZE=16384 + +THREADS=6 +THREADS_BATCH=6 + +BATCH_SIZE=512 +UBATCH_SIZE=256 + +CACHE_TYPE_K=q4_0 +CACHE_TYPE_V=q4_0 + +# 1 parallel slot — limited VRAM at 8K ctx with 2.4GB model +PARALLEL=1 + +# fa=1 gives +6% tg speed on full-attention Qwen3 +EXTRA_ARGS=--flash-attn on --mmap diff --git a/envs/.env.qwen3-4b-bigctx b/envs/.env.qwen3-4b-bigctx new file mode 100644 index 0000000..6018085 --- /dev/null +++ b/envs/.env.qwen3-4b-bigctx @@ -0,0 +1,24 @@ +# ============================================================================== +# Qwen3-4B Q4_K_M — bigctx variant (KV in RAM via -nkvo) +# Benchmarked 2026-05-06: -nkvo max ctx=24576 (+8K vs pure-GPU 16384) +# Baseline TG: ~39 t/s (empty KV). +# Use this profile when you need >16K context; otherwise use qwen3-4b. +# ============================================================================== + +MODEL_FILE=Qwen3-4B-Q4_K_M.gguf + +N_GPU_LAYERS=99 +CTX_SIZE=24576 + +THREADS=6 +THREADS_BATCH=6 + +BATCH_SIZE=512 +UBATCH_SIZE=256 + +CACHE_TYPE_K=q4_0 +CACHE_TYPE_V=q4_0 + +PARALLEL=1 + +EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload diff --git a/envs/.env.qwen35-9b b/envs/.env.qwen35-9b new file mode 100644 index 0000000..bfd8a38 --- /dev/null +++ b/envs/.env.qwen35-9b @@ -0,0 +1,41 @@ +# ============================================================================== +# Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2 Q8_0 — TurboQuant SM75 +# Architecture: 32 layers (8 full-attn + 24 linear-attn), GQA 4 KV heads +# Model size: 8.86 GB | VRAM usage: ~3.4 GB (11 layers on GPU) +# RAM usage: ~5.5 GB (21 layers pinned via mlock) +# +# Benchmark results (turbo2 KV, ngl=11, fa=1): +# t=1→0.86 t=2→1.62 t=3→2.25 t=4→2.94 t=5→3.56 t=6→4.38 ← best +# t=8→4.22 t=12→3.61 (hyperthreading hurts above 6) +# Theoretical ceiling: ~5.1 t/s (45 GB/s RAM BW ÷ 8.86 GB model) +# Achieved: 4.38 t/s = 86% efficiency +# +# Download: +# huggingface-cli download Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF \ +# Qwen3.5-9B.Q8_0.gguf --local-dir ./models/ +# ============================================================================== + +MODEL_FILE=Qwen3.5-9B.Q8_0.gguf + +# GPU: 11 layers fit in 3.7 GB VRAM. ngl=12 causes OOM at ctx>2048. +N_GPU_LAYERS=11 + +# 32K context fits with turbo2 KV (~104 MiB overhead vs ~3.3 GB for f16) +CTX_SIZE=32768 + +# t=6 is optimal for i7-10750H (6 physical cores). t>6 uses HT which hurts. +THREADS=6 +THREADS_BATCH=6 + +BATCH_SIZE=512 +UBATCH_SIZE=128 + +# turbo2: 2-bit KV cache, 6.4× smaller than f16. Requires TurboQuant image. +CACHE_TYPE_K=turbo2 +CACHE_TYPE_V=turbo2 + +PARALLEL=1 + +# --no-mmap --mlock: pins entire model in RAM (prevents paging, avoids cold reads) +# --flash-attn on: required with turbo2 KV (fa=0 + turbo2 has no speed benefit) +EXTRA_ARGS=--flash-attn on --no-mmap --mlock diff --git a/envs/.env.smollm3-3b b/envs/.env.smollm3-3b new file mode 100644 index 0000000..4295827 --- /dev/null +++ b/envs/.env.smollm3-3b @@ -0,0 +1,42 @@ +# ============================================================================== +# SmolLM3 3B-it Q4_K_M — HuggingFace (2025) +# Architecture: Decoder-only transformer, GQA + NoPE (3:1 ratio) +# - 3B params, 11.2T training tokens +# - 64K native context (128K with YaRN) +# Model size: ~1.9 GB Q4_K_M | Full GPU fit (ngl=99) +# Features: thinking mode (/think /no_think), tool calling, 6 languages, +# Apache 2.0. AIME 2025: 36.7% in think mode. +# +# Download: +# huggingface-cli download bartowski/HuggingFaceTB_SmolLM3-3B-GGUF \ +# HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf --local-dir ./models/ +# +# NOTE: Verify exact filename after download: +# ls models/SmolLM3* models/HuggingFaceTB_SmolLM3* +# ============================================================================== + +MODEL_FILE=HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf + +# All layers fit comfortably — ~1.9 GB leaves ~1.8 GB free for KV + compute +N_GPU_LAYERS=99 + +# Benchmarked 2026-05-05 on GTX 1650 Ti (3717 MiB): +# Max ctx=24576 (32K OOM). Baseline: 249 pp / 56.8 tg t/s. +# At 24K ctx with fa=1: 260 pp / 58.3 tg t/s (+2%). +# Model context limit = 65536, VRAM is the constraint here. +CTX_SIZE=24576 + +THREADS=6 +THREADS_BATCH=6 + +BATCH_SIZE=512 +UBATCH_SIZE=256 + +CACHE_TYPE_K=q8_0 +CACHE_TYPE_V=q8_0 + +# 2 parallel slots — less headroom at 24K ctx vs original 16K estimate +PARALLEL=2 + +# fa=1 gives small but consistent improvement (+2 tg t/s) +EXTRA_ARGS=--flash-attn on --mmap diff --git a/envs/.env.smollm3-3b-bigctx b/envs/.env.smollm3-3b-bigctx new file mode 100644 index 0000000..666faf8 --- /dev/null +++ b/envs/.env.smollm3-3b-bigctx @@ -0,0 +1,26 @@ +# ============================================================================== +# SmolLM3 3B-it Q4_K_M — bigctx variant (KV in RAM via -nkvo) +# Benchmarked 2026-05-06 v4 (TurboQuant FORCE_MMQ): turbo2 rec ctx=65536 +# +40960 tokens vs pure-GPU 24576. turbo2 KV = 10.9 KB/tok vs q4_0 19.8 KB/tok. +# Speed at ctx=65536: baseline 53.1 t/s, est. 15.2@50% / 23.7@25% (PCIe BW). +# RAM at 65536: 714 MiB KV. turbo2 passes PPL quality gate at all tested ctx. +# Use this profile when you need >24K context; otherwise use smollm3-3b. +# ============================================================================== + +MODEL_FILE=HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf + +N_GPU_LAYERS=99 +CTX_SIZE=65536 + +THREADS=6 +THREADS_BATCH=6 + +BATCH_SIZE=512 +UBATCH_SIZE=256 + +CACHE_TYPE_K=turbo2 +CACHE_TYPE_V=turbo2 + +PARALLEL=1 + +EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload diff --git a/scripts/benchmark.sh b/scripts/benchmark.sh new file mode 100755 index 0000000..629d0c9 --- /dev/null +++ b/scripts/benchmark.sh @@ -0,0 +1,335 @@ +#!/usr/bin/env bash +# ============================================================================= +# llama.cpp Automated Benchmark — Qwen3.5-9B on GTX 1650 Ti (4 GB VRAM) +# +# Runs for BOTH official llama.cpp and TurboQuant fork. +# VARIANT env var selects which KV type set to sweep: +# VARIANT=official → f16 q8_0 q5_0 q4_0 iq4_nl +# VARIANT=turboquant → f16 q8_0 iq4_nl turbo4 turbo3 turbo2 +# +# Output: CSV + recommended .env per variant, plus a final comparison table. +# +# Run: +# docker compose --profile benchmark run --rm benchmark (official) +# docker compose --profile benchmark run --rm benchmark-turbo (turboquant) +# ============================================================================= + +set -euo pipefail + +# Ensure llama-bench is findable in both official (/usr/local/bin) and TurboQuant (/app) images +export PATH="/app:/usr/local/bin:/usr/bin:/bin:${PATH:-}" + +MODEL="${MODEL:-${1:-/models/Qwen3.5-9B.Q8_0.gguf}}" +OUTPUT_DIR="${OUTPUT_DIR:-${2:-/results}}" +VARIANT="${VARIANT:-official}" # official | turboquant +TIMESTAMP=$(date +%Y%m%d_%H%M%S) +RESULTS_CSV="${OUTPUT_DIR}/${VARIANT}_results_${TIMESTAMP}.csv" +LOG="${OUTPUT_DIR}/${VARIANT}_benchmark_${TIMESTAMP}.log" + +# ── Baseline config ──────────────────────────────────────────────────────── +THREADS=6 +THREADS_BATCH=12 +BATCH_SIZE=2048 +UBATCH_SIZE=512 +PROMPT_TOKENS=512 +GEN_TOKENS=32 +REPETITIONS=1 + +# ── KV type sets per variant ─────────────────────────────────────────────── +# turbo2=2-bit (6.4× vs f16), turbo3=3-bit, turbo4=4-bit — TurboQuant only +if [[ "$VARIANT" == "turboquant" ]]; then + KV_TYPES=(f16 q8_0 iq4_nl turbo4 turbo3 turbo2) +else + # Official llama.cpp: all standard quant types + # iq4_nl = i-quant non-linear: best quality at 4-bit (non-uniform scale) + KV_TYPES=(f16 q8_0 q5_0 q4_0 iq4_nl) +fi + +# ── GPU layer sweep (Q8_0 ~297 MB/layer, 3717 MiB VRAM → max ~12 layers) ── +NGL_VALUES=(6 9 12 13 14 99) + +# ── Context sweep: use -p to stress KV cache at given size ───────────────── +CTX_VALUES=(128 512 1024 2048 4096 8192) + +# ── Batch sweep ──────────────────────────────────────────────────────────── +BATCH_VALUES=(512 1024 2048 4096) + +mkdir -p "$OUTPUT_DIR" + +log() { echo "$*" | tee -a "$LOG"; } +sep() { log "$(printf '─%.0s' {1..70})"; } +hdr() { sep; log " $*"; sep; } + +log "llama.cpp Benchmark [${VARIANT}] — $(date)" +log "Model: $MODEL" +log "GPU: $(nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null || echo 'CPU only')" +log "KV set: ${KV_TYPES[*]}" +sep + +echo "variant,phase,ngl,ctx,kv_type_k,kv_type_v,flash_attn,batch_size,ubatch_size,threads,pp_tokens_per_sec,tg_tokens_per_sec,status" \ + > "$RESULTS_CSV" + +# ── Helper: run llama-bench ──────────────────────────────────────────────── +LAST_PP=0 +LAST_TG=0 + +run_bench() { + local ngl=$1 ctx=$2 kv=$3 fa=$4 batch=$5 ubatch=$6 phase="${7:-test}" + local raw + + # New llama-bench API (b9014+): -c and -tb removed; -p sets prompt/ctx size + # CSV columns: ...n_prompt(34),n_gen(35),n_depth(36),test_time(37), + # avg_ns(38),stddev_ns(39),avg_ts(40),stddev_ts(41) + # pp row: n_gen==0; tg row: n_prompt==0 + raw=$(timeout 300 /app/llama-bench \ + -m "$MODEL" \ + -ngl "$ngl" \ + -p "$ctx" \ + -n "$GEN_TOKENS" \ + -t "$THREADS" \ + -b "$batch" \ + -ub "$ubatch" \ + -ctk "$kv" \ + -ctv "$kv" \ + -fa "$fa" \ + -r "$REPETITIONS" \ + -o csv 2>&1) || return 1 + + # Strip quotes from CSV, then extract avg_ts (col 40) by pp/tg row type + LAST_PP=$(printf '%s\n' "$raw" | sed 's/"//g' | awk -F',' 'NR>1 && $35=="0" && $40+0>0 {print $40+0; exit}') + LAST_TG=$(printf '%s\n' "$raw" | sed 's/"//g' | awk -F',' 'NR>1 && $34=="0" && $35+0>0 && $40+0>0 {print $40+0; exit}') + LAST_PP="${LAST_PP:-0}" + LAST_TG="${LAST_TG:-0}" + + echo "${VARIANT},${phase},${ngl},${ctx},${kv},${kv},${fa},${batch},${ubatch},${THREADS},${LAST_PP},${LAST_TG},ok" \ + >> "$RESULTS_CSV" + return 0 +} + +fail_row() { + local phase=$1 ngl=$2 ctx=$3 kv=$4 fa=$5 batch=$6 ubatch=$7 + echo "${VARIANT},${phase},${ngl},${ctx},${kv},${kv},${fa},${batch},${ubatch},${THREADS},0,0,failed" \ + >> "$RESULTS_CSV" +} + +# ── Phase 1: GPU layer sweep ─────────────────────────────────────────────── +hdr "PHASE 1 — GPU layer sweep (prompt=128 kv=f16 fa=0)" +# Use f16 KV: prebuilt official image lacks SM75 CUDA kernels for quantized KV. +# We isolate the NGL variable here; KV type is swept in Phase 3. +MAX_STABLE_NGL=0 +for ngl in "${NGL_VALUES[@]}"; do + printf " ngl=%-3s " "$ngl" | tee -a "$LOG" + if run_bench "$ngl" 128 f16 0 "$BATCH_SIZE" "$UBATCH_SIZE" "ph1_ngl"; then + log "OK pp=${LAST_PP} t/s tg=${LAST_TG} t/s" + MAX_STABLE_NGL="$ngl" + else + log "FAILED (OOM/timeout)" + fail_row ph1_ngl "$ngl" 128 f16 0 "$BATCH_SIZE" "$UBATCH_SIZE" + break + fi +done +log " → Best ngl: ${MAX_STABLE_NGL}" + +# ── Phase 2: Context sweep ───────────────────────────────────────────────── +hdr "PHASE 2 — Context/prompt sweep (ngl=${MAX_STABLE_NGL} kv=f16 fa=0)" + +MAX_STABLE_CTX=128 +for ctx in "${CTX_VALUES[@]}"; do + printf " ctx=%-6s " "$ctx" | tee -a "$LOG" + if run_bench "$MAX_STABLE_NGL" "$ctx" f16 0 "$BATCH_SIZE" "$UBATCH_SIZE" "ph2_ctx"; then + log "OK pp=${LAST_PP} t/s tg=${LAST_TG} t/s" + MAX_STABLE_CTX="$ctx" + else + log "FAILED (OOM/timeout)" + fail_row ph2_ctx "$MAX_STABLE_NGL" "$ctx" f16 0 "$BATCH_SIZE" "$UBATCH_SIZE" + break + fi +done +log " → Best ctx: ${MAX_STABLE_CTX}" + +# ── Phase 3: KV cache type sweep ─────────────────────────────────────────── +hdr "PHASE 3 — KV type sweep (ngl=${MAX_STABLE_NGL} ctx=${MAX_STABLE_CTX} fa=1)" +log " [${VARIANT}] KV types: ${KV_TYPES[*]}" +log " Note: Qwen3.5-9B has only 8/32 full-attention layers + GQA (4 KV heads)" +log " Linear-attention layers need no KV cache at all → quant errors minimal" +if [[ "$VARIANT" == "turboquant" ]]; then + log " turbo2=2-bit (6.4× compression), turbo3=3-bit, turbo4=4-bit" +fi + +BEST_KV="q8_0" +BEST_TG_KV=0 + +for kv in "${KV_TYPES[@]}"; do + printf " kv=%-8s " "$kv" | tee -a "$LOG" + if run_bench "$MAX_STABLE_NGL" "$MAX_STABLE_CTX" "$kv" 0 "$BATCH_SIZE" "$UBATCH_SIZE" "ph3_kv"; then + log "OK pp=${LAST_PP} t/s tg=${LAST_TG} t/s" + tg_n=$(printf '%s' "$LAST_TG" | grep -oP '[0-9]+\.?[0-9]*' | head -1) + if awk "BEGIN{exit !(${tg_n:-0} > ${BEST_TG_KV:-0})}"; then + BEST_TG_KV="${tg_n:-0}" + BEST_KV="$kv" + fi + else + log "FAILED" + fail_row ph3_kv "$MAX_STABLE_NGL" "$MAX_STABLE_CTX" "$kv" 0 "$BATCH_SIZE" "$UBATCH_SIZE" + fi +done +log " → Best KV: ${BEST_KV} (tg=${BEST_TG_KV} t/s)" + +# ── Phase 4: Flash attention ─────────────────────────────────────────────── +hdr "PHASE 4 — Flash attention (ngl=${MAX_STABLE_NGL} ctx=${MAX_STABLE_CTX} kv=${BEST_KV})" +log " GTX 1650 Ti = CC 7.5 (Turing) — FA2 requires SM80+ but FA1 works on CC>=7.5" + +BEST_FA=1 +BEST_TG_FA=0 + +for fa in 1 0; do + fa_label=$([ "$fa" -eq 1 ] && echo "on " || echo "off") + printf " fa=%-3s " "$fa_label" | tee -a "$LOG" + if run_bench "$MAX_STABLE_NGL" "$MAX_STABLE_CTX" "$BEST_KV" "$fa" "$BATCH_SIZE" "$UBATCH_SIZE" "ph4_fa"; then + log "OK pp=${LAST_PP} t/s tg=${LAST_TG} t/s" + tg_n=$(printf '%s' "$LAST_TG" | grep -oP '[0-9]+\.?[0-9]*' | head -1) + if awk "BEGIN{exit !(${tg_n:-0} > ${BEST_TG_FA:-0})}"; then + BEST_TG_FA="${tg_n:-0}" + BEST_FA="$fa" + fi + else + log "FAILED" + fi +done +log " → Best FA: ${BEST_FA} (tg=${BEST_TG_FA} t/s)" + +# ── Phase 5: Batch sweep ─────────────────────────────────────────────────── +hdr "PHASE 5 — Batch sweep (ngl=${MAX_STABLE_NGL} ctx=${MAX_STABLE_CTX} kv=${BEST_KV} fa=${BEST_FA})" +# Use small fixed prompt (64) to isolate batch-buffer allocation overhead from prompt size. +# Larger batch = larger CUDA activation buffers; tests whether they fit in remaining VRAM. +BEST_BATCH="$BATCH_SIZE" +BEST_PP_BATCH=0 +FIXED_P=64 + +for batch in "${BATCH_VALUES[@]}"; do + ubatch=$(( batch / 4 < 64 ? 64 : batch / 4 )) + printf " batch=%-5s ubatch=%-4s " "$batch" "$ubatch" | tee -a "$LOG" + if run_bench "$MAX_STABLE_NGL" "$FIXED_P" "$BEST_KV" "$BEST_FA" "$batch" "$ubatch" "ph5_batch"; then + log "OK pp=${LAST_PP} t/s tg=${LAST_TG} t/s" + pp_n=$(printf '%s' "$LAST_PP" | grep -oP '[0-9]+\.?[0-9]*' | head -1) + if awk "BEGIN{exit !(${pp_n:-0} > ${BEST_PP_BATCH:-0})}"; then + BEST_PP_BATCH="${pp_n:-0}" + BEST_BATCH="$batch" + fi + else + log "FAILED" + fi +done +BEST_UBATCH=$(( BEST_BATCH / 4 < 64 ? 64 : BEST_BATCH / 4 )) +log " → Best batch: ${BEST_BATCH} ubatch: ${BEST_UBATCH} (pp=${BEST_PP_BATCH} t/s)" + +# ── Phase 6 (TurboQuant only): max context with turbo2 KV ────────────────── +if [[ "$VARIANT" == "turboquant" ]]; then + hdr "PHASE 6 — TurboQuant: extended context with turbo2 KV (ngl=${MAX_STABLE_NGL} fa=${BEST_FA})" + log " turbo2 = 2-bit KV (6.4× smaller than f16) → enables much larger ctx in same VRAM" + + TURBO_CTX_VALUES=(512 1024 2048 4096 8192 16384 32768) + MAX_TURBO_CTX="128" + TURBO_KV="turbo2" + + for ctx in "${TURBO_CTX_VALUES[@]}"; do + printf " ctx=%-7s " "$ctx" | tee -a "$LOG" + if run_bench "$MAX_STABLE_NGL" "$ctx" "$TURBO_KV" "$BEST_FA" "$BEST_BATCH" "$BEST_UBATCH" "ph6_turbo_ctx"; then + log "OK pp=${LAST_PP} t/s tg=${LAST_TG} t/s" + MAX_TURBO_CTX="$ctx" + else + log "FAILED (OOM/timeout)" + fail_row ph6_turbo_ctx "$MAX_STABLE_NGL" "$ctx" "$TURBO_KV" "$BEST_FA" "$BEST_BATCH" "$BEST_UBATCH" + break + fi + done + log " → Max context with turbo2: ${MAX_TURBO_CTX}" + # Use the larger turbo ctx for the recommended .env + MAX_STABLE_CTX="$MAX_TURBO_CTX" + BEST_KV="$TURBO_KV" +fi + +# ── Summary ──────────────────────────────────────────────────────────────── +sep +log "BENCHMARK COMPLETE [${VARIANT}] — $(date)" +sep +log "" +log " Optimal params for GTX 1650 Ti + Qwen3.5-9B Q4_K_M [${VARIANT}]:" +log "" +log " ngl : ${MAX_STABLE_NGL}" +log " ctx_size : ${MAX_STABLE_CTX}" +log " kv_type : ${BEST_KV}" +log " flash_attn : ${BEST_FA}" +log " batch_size : ${BEST_BATCH}" +log " ubatch : ${BEST_UBATCH}" +log "" +log " Full CSV: ${RESULTS_CSV}" +log "" + +# Write recommended .env +ENV_OUT="${OUTPUT_DIR}/${VARIANT}_recommended.env" +cat > "$ENV_OUT" </dev/null | sort | tail -1 || true) +TURBO_CSV=$(ls "${OUTPUT_DIR}"/turboquant_results_*.csv 2>/dev/null | sort | tail -1 || true) + +if [[ -n "$OFFICIAL_CSV" && -n "$TURBO_CSV" ]]; then + COMPARE_OUT="${OUTPUT_DIR}/comparison_$(date +%Y%m%d_%H%M%S).txt" + { + echo "======================================================================" + echo " OFFICIAL vs TURBOQUANT COMPARISON" + echo "======================================================================" + echo "" + echo "Official CSV: $OFFICIAL_CSV" + echo "TurboQuant CSV: $TURBO_CSV" + echo "" + echo "KV type benchmark results (phase ph3_kv):" + echo "" + printf "%-12s %-10s %-10s %-12s %-12s\n" "variant" "kv_type" "ctx" "pp (t/s)" "tg (t/s)" + echo "----------------------------------------------------------------------" + for csv in "$OFFICIAL_CSV" "$TURBO_CSV"; do + awk -F',' ' + NR>1 && $2 == "ph3_kv" { + printf "%-12s %-10s %-10s %-12s %-12s\n", $1, $5, $4, $11, $12 + } + ' "$csv" + done + echo "" + echo "Winner by tg (generation speed):" + awk -F',' ' + NR>1 && $2 == "ph3_kv" && $13 == "ok" { + key = $1 "," $5 + val = $12+0 + if (val > best[key]) { best[key] = val; row[key] = $0 } + } + END { + best_tg = 0; best_key = "" + for (k in best) { if (best[k] > best_tg) { best_tg = best[k]; best_key = k } } + n = split(best_key, a, ",") + printf " %s with kv=%s → %.1f t/s\n", a[1], a[2], best_tg + } + ' "$OFFICIAL_CSV" "$TURBO_CSV" + echo "======================================================================" + } | tee "$COMPARE_OUT" | tee -a "$LOG" + echo "" + echo "Comparison report: $COMPARE_OUT" +fi + +sep +echo "" +echo "=== RECOMMENDED .env [${VARIANT}] ===" +cat "$ENV_OUT" + diff --git a/scripts/benchmark_models.sh b/scripts/benchmark_models.sh new file mode 100644 index 0000000..64c890d --- /dev/null +++ b/scripts/benchmark_models.sh @@ -0,0 +1,175 @@ +#!/bin/bash +# Benchmark all 4 new models on GTX 1650 Ti (3717 MiB VRAM) +# Priority: max context size > tg speed +# Runs inside ghcr.io/ggml-org/llama.cpp:full-cuda (build b9014, no -c flag) +# +# Architecture context limits (from GGUF metadata): +# SmolLM3-3B : 65536 (full attention, KV-limited to ~28K in practice) +# Gemma4-E2B : 131072 (hybrid: sliding_window=512 → huge ctx possible) +# Gemma4-E4B : 131072 (hybrid: sliding_window=512) +# Qwen3-4B : 40960 (full attention, KV-limited to ~9K in practice) +# +# NOTE: llama-bench b9014 has NO -c flag. Context is set by -p (prompt tokens). +# -p N -n G allocates KV for N+G tokens. OOM = exit!=0 or error in stdout. + +set -uo pipefail + +M_SMOL="/models/HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf" +M_E2B="/models/google_gemma-4-E2B-it-Q4_K_M.gguf" +M_E4B="/models/google_gemma-4-E4B-it-Q4_K_M.gguf" +M_Q3="/models/Qwen3-4B-Q4_K_M.gguf" + +# -- CSV column detection (called once on first successful output) -- +TS_COL=0; NG_COL=0; NP_COL=0 + +detect_cols() { + local hdr + hdr=$(printf '%s\n' "$1" | sed 's/"//g' | grep '^build_commit' | head -1) + TS_COL=$(printf '%s\n' "$hdr" | awk -F',' '{for(i=1;i<=NF;i++) if($i=="avg_ts"){print i;exit}}') + NG_COL=$(printf '%s\n' "$hdr" | awk -F',' '{for(i=1;i<=NF;i++) if($i=="n_gen"){print i;exit}}') + NP_COL=$(printf '%s\n' "$hdr" | awk -F',' '{for(i=1;i<=NF;i++) if($i=="n_prompt"){print i;exit}}') + TS_COL=${TS_COL:-0}; NG_COL=${NG_COL:-0}; NP_COL=${NP_COL:-0} +} + +# Returns "pp_speed pp / tg_speed tg t/s" +parse_speeds() { + local out="$1" + [ "${TS_COL:-0}" = "0" ] && detect_cols "$out" + local s pp tg + s=$(printf '%s\n' "$out" | sed 's/"//g') + pp=$(printf '%s\n' "$s" | awk -F',' -v tc="$TS_COL" -v np="$NP_COL" -v ng="$NG_COL" \ + 'NR>1 && $np+0>0 && $ng+0==0 {printf "%.0f", $tc+0; exit}') + tg=$(printf '%s\n' "$s" | awk -F',' -v tc="$TS_COL" -v np="$NP_COL" -v ng="$NG_COL" \ + 'NR>1 && $ng+0>0 && $np+0==0 {printf "%.1f", $tc+0; exit}') + printf "%s pp / %s tg t/s" "${pp:--}" "${tg:--}" +} + +is_oom() { + local out="$1" ec="$2" + [ "$ec" -ne 0 ] && return 0 + printf '%s\n' "$out" | grep -qiE "failed to create context|out of memory|GGML_ASSERT|error:" && return 0 + return 1 +} + +# bench MODEL NGL [llama-bench extra args...] +# Standard speed benchmark: -p 512 -n 128 small context +bench() { + local model=$1 ngl=$2; shift 2 + local out ec + out=$(timeout 250 /app/llama-bench -m "$model" -ngl "$ngl" \ + -b 512 -ub 128 -o csv "$@" 2>&1) + ec=$? + if is_oom "$out" "$ec"; then echo "OOM"; return; fi + [ "${TS_COL:-0}" = "0" ] && detect_cols "$out" + parse_speeds "$out" +} + +# bench_ctx MODEL NGL CTX +# Context-capacity test: allocates KV for CTX tokens via -p CTX -n 1 +# Tries fa=1 first, falls back to fa=0. Returns "OK (N pp t/s [fa=N])" or "OOM" +bench_ctx() { + local model=$1 ngl=$2 ctx=$3 + local out ec fa_used + for fa in 1 0; do + out=$(timeout 250 /app/llama-bench -m "$model" -ngl "$ngl" \ + -p "$ctx" -n 1 -r 1 --no-warmup \ + -b 512 -ub 128 -fa "$fa" -t 6 -o csv 2>&1) + ec=$? + is_oom "$out" "$ec" || { fa_used=$fa; break; } + [ "$fa" = "0" ] && { echo "OOM"; return; } + done + [ "${TS_COL:-0}" = "0" ] && detect_cols "$out" + local pp + pp=$(printf '%s\n' "$out" | sed 's/"//g' | \ + awk -F',' -v tc="$TS_COL" -v np="$NP_COL" \ + 'NR>1 && $np+0>0 {printf "%.0f", $tc+0; exit}') + printf "OK (%s pp t/s fa=%s)" "${pp:--}" "${fa_used:-?}" +} + +HR="======================================================================" +echo "$HR" +echo "LLAMA.CPP BENCHMARK — ALL MODELS — $(date)" +echo "GPU: $(nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null || echo unknown)" +echo "$HR" +echo "" + +# ── Phase 1: Baseline (small context) ──────────────────────────────────────── +echo "=== Phase 1: Baseline (ngl=99, p=512 n=128 r=2, t=6, fa=0) ===" +for entry in "SmolLM3-3B:$M_SMOL" "Gemma4-E2B:$M_E2B" "Gemma4-E4B:$M_E4B" "Qwen3-4B:$M_Q3"; do + lbl="${entry%%:*}"; mdl="${entry#*:}" + printf " %-14s %s\n" "$lbl" "$(bench "$mdl" 99 -p 512 -n 128 -r 2 -t 6 -fa 0)" +done +echo "" + +# ── Phase 2: Gemma4-E4B ngl sweep ──────────────────────────────────────────── +echo "=== Phase 2: Gemma4-E4B ngl sweep (p=16 n=64 r=1 t=6 fa=0) ===" +echo " 5.1GB model on 3.7GB VRAM — finding highest ngl before OOM" +best_e4b_ngl=0 +for ngl in 0 4 8 12 16 20 24 28 32 36 42; do + ts=$(bench "$M_E4B" $ngl -p 16 -n 64 -r 1 -t 6 -fa 0) + printf " ngl=%-3s %s\n" "$ngl" "$ts" + [[ "$ts" == OOM ]] && break + best_e4b_ngl=$ngl +done +echo " → best_e4b_ngl=$best_e4b_ngl" +echo "" + +# ── Phase 3: Max context sweep ──────────────────────────────────────────────── +echo "=== Phase 3: Max context (p=ctx n=1 r=1 no-warmup fa=1) ===" +echo " Gemma4 hybrid attention (sliding_window=512) enables large ctx cheaply." +declare -A BEST_CTX +BEST_CTX[smollm3]=512; BEST_CTX[e2b]=512; BEST_CTX[e4b]=512; BEST_CTX[q3]=512 + +for entry in "smollm3:SmolLM3-3B:$M_SMOL:99" \ + "e2b:Gemma4-E2B:$M_E2B:99" \ + "e4b:Gemma4-E4B:$M_E4B:$best_e4b_ngl" \ + "q3:Qwen3-4B:$M_Q3:99"; do + IFS=':' read -r key lbl mdl ngl <<< "$entry" + echo " -- $lbl (ngl=$ngl) --" + for ctx in 512 1024 2048 4096 8192 12288 16384 24576 32768 49152 65536 98304 131072; do + ts=$(bench_ctx "$mdl" "$ngl" "$ctx") + printf " ctx=%-7s %s\n" "$ctx" "$ts" + [[ "$ts" == OOM ]] && break + BEST_CTX[$key]=$ctx + done + echo " → MAX ctx=${BEST_CTX[$key]}" +done +echo "" + +# ── Phase 4: TG speed at max context ───────────────────────────────────────── +echo "=== Phase 4: TG speed at max context (p=512 n=128 r=2 fa=1 t=6) ===" +for entry in "smollm3:SmolLM3-3B:$M_SMOL:99" \ + "e2b:Gemma4-E2B:$M_E2B:99" \ + "e4b:Gemma4-E4B:$M_E4B:$best_e4b_ngl" \ + "q3:Qwen3-4B:$M_Q3:99"; do + IFS=':' read -r key lbl mdl ngl <<< "$entry" + ts=$(bench "$mdl" "$ngl" -p 512 -n 128 -r 2 -fa 1 -t 6) + printf " %-14s max_ctx=%-7s %s\n" "$lbl" "${BEST_CTX[$key]}" "$ts" +done +echo "" + +# ── Phase 5: E4B thread sweep (CPU split model — threads matter) ────────────── +echo "=== Phase 5: Gemma4-E4B thread sweep (p=512 n=128 r=2 fa=0 ngl=$best_e4b_ngl) ===" +for t in 1 2 3 4 5 6 8 10 12; do + ts=$(bench "$M_E4B" "$best_e4b_ngl" -p 512 -n 128 -r 2 -fa 0 -t "$t") + printf " t=%-3s %s\n" "$t" "$ts" +done +echo "" + +# ── Phase 6: Flash attention comparison ────────────────────────────────────── +echo "=== Phase 6: Flash attention fa=0 vs fa=1 (p=512 n=128 r=2 t=6) ===" +echo " Gemma4 hybrid attention may not support FA — testing both." +for entry in "smollm3:SmolLM3-3B:$M_SMOL:99" \ + "e2b:Gemma4-E2B:$M_E2B:99" \ + "e4b:Gemma4-E4B:$M_E4B:$best_e4b_ngl" \ + "q3:Qwen3-4B:$M_Q3:99"; do + IFS=':' read -r key lbl mdl ngl <<< "$entry" + ts0=$(bench "$mdl" "$ngl" -p 512 -n 128 -r 2 -fa 0 -t 6) + ts1=$(bench "$mdl" "$ngl" -p 512 -n 128 -r 2 -fa 1 -t 6) + printf " %-14s fa=0: %-30s fa=1: %s\n" "$lbl" "$ts0" "$ts1" +done +echo "" + +echo "$HR" +echo "BENCHMARK COMPLETE: $(date)" +echo "$HR" diff --git a/scripts/cpu_ctx_test.sh b/scripts/cpu_ctx_test.sh new file mode 100644 index 0000000..43945db --- /dev/null +++ b/scripts/cpu_ctx_test.sh @@ -0,0 +1,251 @@ +#!/bin/bash +# cpu_ctx_test.sh v4 — -nkvo bigctx with TurboQuant image (FORCE_MMQ) +# Image: local/llama-cpp-turboquant:full-cuda-sm75-mmq +# +# Tests KV in RAM (-nkvo) with BOTH q4_0 and turbo2 KV types. +# turbo2 = 2-bit KV (2x smaller than q4_0) → ~2x more context at same RAM budget. +# +# Speed model per token: +# GPU-compute models (smollm3/e2b/e4b/q3): bottleneck = PCIe KV reads +# t/s = 1000 / (gpu_ms + ctx * kv_bytes_per_token / PCIE_BPS * 1000) +# Qwen3.5-9B: bottleneck = RAM reads (21/32 layers on CPU, 8.86 GB model) +# t/s = 1000 / (1000/baseline + ctx * kv_bytes_per_token / RAM_BPS * 1000) +# +# Usage: bash /scripts/cpu_ctx_test.sh [smollm3|e2b|e4b|q3|qwen35q|all] + +set -uo pipefail + +TARGET="${1:-all}" +TARGET_TPS=15 +CPU_THREADS=6 +BENCH_GEN=32 +PCIE_BW_GBPS=8.0 # PCIe x4 3.0 practical read BW (conservative) +RAM_BW_GBPS=45.0 # RAM practical read BW (i7-10750H DDR4-2933) + +M_SMOL="/models/HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf" +M_E2B="/models/google_gemma-4-E2B-it-Q4_K_M.gguf" +M_E4B="/models/google_gemma-4-E4B-it-Q4_K_M.gguf" +M_Q3="/models/Qwen3-4B-Q4_K_M.gguf" +M_Q35="/models/Qwen3.5-9B.Q8_0.gguf" + +declare -A NGL_GPU=([smollm3]=99 [e2b]=99 [e4b]=42 [q3]=99 [qwen35q]=11) +# BW source: pcie for GPU-compute models, ram for qwen35-9b (CPU-compute bound) +declare -A BW_GBPS=([smollm3]=$PCIE_BW_GBPS [e2b]=$PCIE_BW_GBPS [e4b]=$PCIE_BW_GBPS [q3]=$PCIE_BW_GBPS [qwen35q]=$RAM_BW_GBPS) +declare -A BW_LABEL=([smollm3]="PCIe" [e2b]="PCIe" [e4b]="PCIe" [q3]="PCIe" [qwen35q]="RAM") + +# CTX candidates: larger now thanks to turbo2 (2x smaller KV vs q4_0) +# Note: turbo2 is SKIPPED for Qwen3-4B (PPL explodes at ctx>=8192: +0.52 → +13 → +437) +# turbo2 is SKIPPED for Qwen3.5-9B (hybrid linear-attn incompatible with llama-perplexity; +# server works fine at 32K — this is a test-tool limitation, not a real issue) +SMOL_CTXS=(32768 49152 65536 98304 131072 163840) +E2B_CTXS=(32768 49152 65536 98304 131072 163840 196608 262144 393216) +E4B_CTXS=(32768 49152 65536 98304 131072 163840) +Q3_CTXS=(24576 32768 49152 65536 98304 131072) +Q35_CTXS=(16384 32768 49152 65536 98304 131072) +declare -A CTX_CANDIDATES=( + [smollm3]="SMOL_CTXS" [e2b]="E2B_CTXS" [e4b]="E4B_CTXS" + [q3]="Q3_CTXS" [qwen35q]="Q35_CTXS") + +# Pure-GPU ctx for gain comparison +declare -A PURE_GPU_CTX=([smollm3]=24576 [e2b]=24576 [e4b]=24576 [q3]=16384 [qwen35q]=32768) + +GREEN='\033[0;32m'; RED='\033[0;31m'; YELLOW='\033[1;33m'; BLUE='\033[0;34m'; CYAN='\033[0;36m'; NC='\033[0m' +HR="======================================================================" + +# Tiny alloc file — enough for 1 chunk, minimal compute time +ALLOC_FILE="/tmp/kv_alloc_tiny.txt" +python3 -c " +sentences = [ + 'The transformer architecture uses self-attention mechanisms to process sequences.', + 'Large language models require significant computational resources for training.', + 'Quantization reduces memory usage by storing weights in lower precision formats.', + 'Flash attention enables memory-efficient computation for long context windows.', + 'The key-value cache stores intermediate attention states during generation.', +] +import random; random.seed(1) +print(chr(10).join([random.choice(sentences) for _ in range(64)])) +" > "$ALLOC_FILE" + +# check_alloc MODEL NGL KV CTX [EXTRA...] +# Returns "" on success, "OOM" on failure. Fast: <15s. +check_alloc() { + local model=$1 ngl=$2 kv=$3 ctx=$4 + shift 4 + local extra_args=("$@") + local tmp_err; tmp_err=$(mktemp) + + timeout 90 /app/llama-perplexity \ + -m "$model" -ngl "$ngl" \ + -fa on -nkvo \ + -c "$ctx" -ctk "$kv" -ctv "$kv" \ + -f "$ALLOC_FILE" --chunks 1 \ + "${extra_args[@]}" \ + > /dev/null 2>"$tmp_err" + local rc=$? + + local err; err=$(cat "$tmp_err"); rm -f "$tmp_err" + + if grep -qi "out of memory\|failed to allocate\|cudaMalloc failed\|CUDA_ERROR_OUT_OF_MEMORY\|ggml_cuda_malloc\|cannot allocate memory\|cannot create buffer" <<< "$err"; then + echo "OOM"; return 1 + fi + + # Parse Host context MiB: "| Host | total = model + context + compute |" + local host_ctx_mib + host_ctx_mib=$(grep "Host" <<< "$err" | \ + grep -oP "=\s*\d+\s*\+\s*\K\d+(?=\s*\+)" | head -1 || true) + echo "${host_ctx_mib:-?}" +} + +# measure_baseline_tps MODEL NGL [EXTRA...] +measure_baseline_tps() { + local model=$1 ngl=$2 + shift 2 + local extra_args=("$@") + local raw + raw=$(timeout 120 /app/llama-bench \ + -m "$model" -ngl "$ngl" -t "$CPU_THREADS" \ + -p 1 -n "$BENCH_GEN" \ + -ctk q4_0 -ctv q4_0 -nkvo 1 -fa 1 -r 1 -o csv \ + "${extra_args[@]}" 2>/dev/null) || true + printf '%s\n' "$raw" | sed 's/"//g' | \ + awk -F',' 'NR>1 && $34=="0" && $35+0>0 && $40+0>0 {print $40+0; exit}' +} + +# estimate_tps BASELINE_TPS KV_PER_TOKEN_MIB CTX BW_GBPS +estimate_tps() { + local baseline_tps=$1 kv_per_token_mib=$2 ctx=$3 bw_gbps=$4 + python3 -c " +baseline = float('$baseline_tps') +kv_tok_bytes = float('$kv_per_token_mib') * 1024 * 1024 +bps = float('$bw_gbps') * 1e9 +ctx = int('$ctx') +base_ms = 1000.0 / baseline +kv_ms = ctx * kv_tok_bytes / bps * 1000 +print(f'{1000.0 / (base_ms + kv_ms):.1f}') +" 2>/dev/null || echo "?" +} + +# --------------------------------------------------------------------------- +echo "$HR" +echo "CPU-RAM KV CONTEXT TEST v4 (-nkvo, TurboQuant FORCE_MMQ) -- $(date)" +echo "GPU: $(nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null)" +echo "KV types tested: q4_0 (4-bit) and turbo2 (2-bit, 2x smaller → 2x more ctx)" +printf "PCIe assumption: %.1f GB/s | RAM assumption: %.1f GB/s\n" "$PCIE_BW_GBPS" "$RAM_BW_GBPS" +echo "$HR" +echo "" + +declare -a SUMMARY=() + +for entry in \ + "smollm3:SmolLM3-3B:$M_SMOL" \ + "e2b:Gemma4-E2B:$M_E2B" \ + "e4b:Gemma4-E4B:$M_E4B" \ + "q3:Qwen3-4B:$M_Q3" \ + "qwen35q:Qwen3.5-9B:$M_Q35" +do + IFS=':' read -r key lbl model <<< "$entry" + [[ "$TARGET" != "all" && "$TARGET" != "$key" ]] && continue + + eval "ctxs=(\"\${${CTX_CANDIDATES[$key]}[@]}\")" + ngl="${NGL_GPU[$key]}" + bw_gbps="${BW_GBPS[$key]}" + bw_label="${BW_LABEL[$key]}" + + # turbo2 incompatible with Qwen3-4B (quality fails at ctx>=8192) + # turbo2 alloc works for Qwen3.5-9B but quality measurement unreliable — test q4_0 only + if [[ "$key" == "q3" || "$key" == "qwen35q" ]]; then + kv_types_to_test=(q4_0) + else + kv_types_to_test=(q4_0 turbo2) + fi + + extra_args=() + + printf "${BLUE}=== %s (ngl=%s, BW model: %s %.0f GB/s) ===${NC}\n" \ + "$lbl" "$ngl" "$bw_label" "$bw_gbps" + + # Baseline t/s (empty KV, with q4_0 -nkvo — upper bound) + printf " Measuring baseline t/s (empty KV, p=1)... " + baseline_tps=$(measure_baseline_tps "$model" "$ngl" "${extra_args[@]}") + if [[ -z "$baseline_tps" ]]; then + printf "${RED}FAIL${NC}\n\n" + SUMMARY+=("$lbl|FAIL|FAIL|FAIL|FAIL|FAIL") + continue + fi + printf "${GREEN}%s t/s${NC}\n\n" "$baseline_tps" + + # Header + printf " %-10s %-12s %-12s %-12s %-12s %-12s %-12s\n" \ + "ctx" "KV type" "KV in RAM" "kv/tok" "t/s@25%" "t/s@50%" "t/s@100%" + printf " %-10s %-12s %-12s %-12s %-12s %-12s %-12s\n" \ + "---" "-------" "---------" "------" "-------" "-------" "--------" + + max_ctx_q4="" + max_ctx_t2="" + rec_q4="" + rec_t2="" + declare -A kv_ref_mib=() + + for ctx in "${ctxs[@]}"; do + for kv_type in "${kv_types_to_test[@]}"; do + result=$(check_alloc "$model" "$ngl" "$kv_type" "$ctx" "${extra_args[@]}") + if [[ "$result" == "OOM" ]]; then + printf " ${RED}%-10s %-12s OOM${NC}\n" "$ctx" "$kv_type" + continue + fi + + host_kv_mib="${result}" + [[ "$kv_type" == "q4_0" ]] && max_ctx_q4=$ctx || max_ctx_t2=$ctx + + # KV per token + if [[ "$host_kv_mib" =~ ^[0-9]+$ ]]; then + kv_per_token_mib=$(python3 -c "print(f'{$host_kv_mib / $ctx:.6f}')") + kv_ref_mib[$kv_type]=$kv_per_token_mib + else + kv_per_token_mib="${kv_ref_mib[$kv_type]:-?}" + fi + + tps25=$(estimate_tps "$baseline_tps" "$kv_per_token_mib" "$(( ctx / 4 ))" "$bw_gbps") + tps50=$(estimate_tps "$baseline_tps" "$kv_per_token_mib" "$(( ctx / 2 ))" "$bw_gbps") + tps100=$(estimate_tps "$baseline_tps" "$kv_per_token_mib" "$ctx" "$bw_gbps") + + meets=$(python3 -c "print(1 if '$tps50' != '?' and float('$tps50') >= $TARGET_TPS else 0)" 2>/dev/null || echo 0) + [[ "$kv_type" == "q4_0" && "$meets" == "1" ]] && rec_q4=$ctx + [[ "$kv_type" == "turbo2" && "$meets" == "1" ]] && rec_t2=$ctx + + color=$([[ "$meets" == "1" ]] && echo "$GREEN" || echo "$YELLOW") + printf " ${color}%-10s${NC} %-12s %-12s %-12s %-12s ${color}%-12s${NC} %-12s\n" \ + "$ctx" "$kv_type" "${host_kv_mib}MiB" "${kv_per_token_mib}MiB" \ + "$tps25" "$tps50" "$tps100" + done + done + + rec_q4="${rec_q4:-$max_ctx_q4}" + rec_t2="${rec_t2:-$max_ctx_t2}" + pg="${PURE_GPU_CTX[$key]}" + + printf "\n Recommended ctx (>=%s t/s@50%%): q4_0=%s turbo2=%s (pure-GPU was %s)\n\n" \ + "$TARGET_TPS" "${rec_q4:-FAIL}" "${rec_t2:-FAIL}" "$pg" + + gain_q4=$([[ -n "${rec_q4:-}" && "${rec_q4:-}" != "FAIL" ]] && echo "$((rec_q4 - pg))" || echo "?") + gain_t2=$([[ -n "${rec_t2:-}" && "${rec_t2:-}" != "FAIL" ]] && echo "$((rec_t2 - pg))" || echo "?") + SUMMARY+=("$lbl|$baseline_tps|${max_ctx_q4:-OOM}|${rec_q4:-FAIL}|${max_ctx_t2:-OOM}|${rec_t2:-FAIL}|$gain_q4|$gain_t2") + + unset kv_ref_mib max_ctx_q4 max_ctx_t2 rec_q4 rec_t2 +done + +echo "$HR" +echo "SUMMARY — -nkvo (KV in RAM): q4_0 vs turbo2" +echo "$HR" +printf "%-16s %-12s %-14s %-14s %-14s %-14s\n" \ + "Model" "Baseline t/s" "q4_0 max" "q4_0 rec" "turbo2 max" "turbo2 rec" +printf "%-16s %-12s %-14s %-14s %-14s %-14s\n" \ + "-----" "------------" "--------" "--------" "----------" "----------" +for row in "${SUMMARY[@]}"; do + IFS='|' read -r lbl btps max_q4 rec_q4 max_t2 rec_t2 g_q4 g_t2 <<< "$row" + printf "${GREEN}%-16s %-12s %-14s %-14s %-14s %-14s [q4+%s / t2+%s vs pure-GPU]${NC}\n" \ + "$lbl" "$btps" "$max_q4" "$rec_q4" "$max_t2" "$rec_t2" "$g_q4" "$g_t2" +done +echo "$HR" +echo "Note: Qwen3.5-9B baseline already <15 t/s (RAM-bound, 8.86 GB model). BW model uses RAM not PCIe." +echo "$HR" diff --git a/scripts/download_models.sh b/scripts/download_models.sh new file mode 100755 index 0000000..d5f53e0 --- /dev/null +++ b/scripts/download_models.sh @@ -0,0 +1,116 @@ +#!/usr/bin/env bash +# download_models.sh — Download GGUF model files to ./models/ +# +# Usage: +# bash scripts/download_models.sh # all models +# bash scripts/download_models.sh smollm3 # single model +# bash scripts/download_models.sh gemma4-e2b gemma4-e4b # multiple +# +# Requires: huggingface-cli (pip install huggingface_hub) +# Models land in: ./models/ +# +# Available keys: smollm3 | gemma4-e2b | gemma4-e4b | qwen3-4b | qwen35-9b | all + +set -euo pipefail + +MODELS_DIR="$(cd "$(dirname "$0")/.." && pwd)/models" +mkdir -p "$MODELS_DIR" + +GREEN='\033[0;32m'; YELLOW='\033[1;33m'; RED='\033[0;31m'; NC='\033[0m' + +check_hf_cli() { + if ! command -v huggingface-cli &>/dev/null; then + echo -e "${RED}Error: huggingface-cli not found.${NC}" + echo "Install with: pip install huggingface_hub" + exit 1 + fi +} + +download() { + local key="$1" + local repo="$2" + local filename="$3" + local size_hint="$4" + + local dest="$MODELS_DIR/$filename" + if [[ -f "$dest" ]]; then + echo -e "${YELLOW}[$key]${NC} Already exists: $filename — skipping" + return + fi + + echo -e "${GREEN}[$key]${NC} Downloading $filename (~$size_hint) from $repo ..." + huggingface-cli download "$repo" "$filename" --local-dir "$MODELS_DIR" + echo -e "${GREEN}[$key]${NC} Done: $MODELS_DIR/$filename" +} + +download_smollm3() { + download "smollm3" \ + "bartowski/HuggingFaceTB_SmolLM3-3B-GGUF" \ + "HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf" \ + "1.9 GB" +} + +download_gemma4_e2b() { + download "gemma4-e2b" \ + "bartowski/google_gemma-4-E2B-it-GGUF" \ + "google_gemma-4-E2B-it-Q4_K_M.gguf" \ + "2.9 GB" +} + +download_gemma4_e4b() { + download "gemma4-e4b" \ + "bartowski/google_gemma-4-E4B-it-GGUF" \ + "google_gemma-4-E4B-it-Q4_K_M.gguf" \ + "4.7 GB" +} + +download_qwen3_4b() { + download "qwen3-4b" \ + "bartowski/Qwen3-4B-GGUF" \ + "Qwen3-4B-Q4_K_M.gguf" \ + "2.4 GB" +} + +download_qwen35_9b() { + download "qwen35-9b" \ + "Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF" \ + "Qwen3.5-9B.Q8_0.gguf" \ + "8.9 GB" +} + +main() { + check_hf_cli + + local targets=("$@") + if [[ ${#targets[@]} -eq 0 || "${targets[0]}" == "all" ]]; then + targets=(smollm3 gemma4-e2b gemma4-e4b qwen3-4b qwen35-9b) + fi + + for target in "${targets[@]}"; do + case "$target" in + smollm3) download_smollm3 ;; + gemma4-e2b) download_gemma4_e2b ;; + gemma4-e4b) download_gemma4_e4b ;; + qwen3-4b) download_qwen3_4b ;; + qwen35-9b) download_qwen35_9b ;; + all) + download_smollm3 + download_gemma4_e2b + download_gemma4_e4b + download_qwen3_4b + download_qwen35_9b + ;; + *) + echo -e "${RED}Unknown model: $target${NC}" + echo "Valid keys: smollm3 | gemma4-e2b | gemma4-e4b | qwen3-4b | qwen35-9b | all" + exit 1 + ;; + esac + done + + echo "" + echo "Models directory:" + ls -lh "$MODELS_DIR"/*.gguf 2>/dev/null || echo "(no .gguf files found)" +} + +main "$@" diff --git a/scripts/kv_quant_test.sh b/scripts/kv_quant_test.sh new file mode 100644 index 0000000..313fbb8 --- /dev/null +++ b/scripts/kv_quant_test.sh @@ -0,0 +1,246 @@ +#!/bin/bash +# KV cache quantization test using llama-perplexity. +# Image: local/llama-cpp-turboquant:full-cuda-sm75-mmq (FORCE_MMQ, turbo2/3/4 support) +# +# Tests KV types: f16 (baseline) + q8_0/q4_0/turbo2 for Q4_K_M models +# f16 (baseline) + turbo2/3/4 for Qwen3.5-9B Q8_0 +# Quality gate: PPL delta vs f16 < 0.5 (lossless for practical use) +# +# Usage: bash /scripts/kv_quant_test.sh [MODEL_KEY] +# MODEL_KEY: smollm3 | e2b | e4b | q3 | qwen35q | all (default: all) + +set -uo pipefail + +TARGET="${1:-all}" + +M_SMOL="/models/HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf" +M_E2B="/models/google_gemma-4-E2B-it-Q4_K_M.gguf" +M_E4B="/models/google_gemma-4-E4B-it-Q4_K_M.gguf" +M_Q3="/models/Qwen3-4B-Q4_K_M.gguf" +M_Q35="/models/Qwen3.5-9B.Q8_0.gguf" + +declare -A NGL=([smollm3]=99 [e2b]=99 [e4b]=42 [q3]=99 [qwen35q]=11) +declare -A BASE_CTX=([smollm3]=18432 [e2b]=32768 [e4b]=20480 [q3]=8192 [qwen35q]=8192) +declare -A PPL_TIMEOUT=([smollm3]=300 [e2b]=300 [e4b]=300 [q3]=300 [qwen35q]=600) + +# Per-model KV types to test (f16 is always the baseline) +# Standard Q4_K_M models: q8_0/q4_0 + turbo2 (all supported by TurboQuant image) +# Qwen3.5-9B: designed for turbo KV — test turbo2/3/4 only (q4_0 would also work but less relevant) +declare -A MODEL_KV_TYPES=( + [smollm3]="q8_0 q4_0 turbo2" + [e2b]="q8_0 q4_0 turbo2" + [e4b]="q8_0 q4_0 turbo2" + [q3]="q8_0 q4_0 turbo2" + [qwen35q]="turbo2 turbo3 turbo4" +) + +# ctx candidates per model +SMOL_CTXS=(8192 12288 16384 18432 20480 24576 32768 40960 49152) +E2B_CTXS=(8192 16384 24576 32768 40960 49152 65536) +E4B_CTXS=(8192 12288 16384 20480 24576 32768 40960) +Q3_CTXS=(4096 6144 8192 10240 12288 16384 24576 32768) +Q35_CTXS=(4096 8192 16384 24576 32768 40960 49152) +declare -A CTX_CANDIDATES=( + [smollm3]="SMOL_CTXS" [e2b]="E2B_CTXS" [e4b]="E4B_CTXS" + [q3]="Q3_CTXS" [qwen35q]="Q35_CTXS") + +GREEN='\033[0;32m'; RED='\033[0;31m'; YELLOW='\033[1;33m'; BLUE='\033[0;34m'; NC='\033[0m' +HR="======================================================================" + +# Synthetic PPL file — 4000 lines, deterministic, no network needed +PPL_FILE="/tmp/kv_ppl_input.txt" +ensure_ppl_file() { + [[ -f "$PPL_FILE" ]] && return + python3 - << 'PY' +import random, sys +random.seed(42) +sentences = [ + "The transformer architecture uses self-attention mechanisms to process sequences.", + "Large language models require significant computational resources for training.", + "Quantization reduces memory usage by storing weights in lower precision formats.", + "Flash attention enables memory-efficient computation for long context windows.", + "The key-value cache stores intermediate attention states during generation.", + "Context length determines how many tokens the model can attend to simultaneously.", + "Perplexity measures how well a probability model predicts a sample of text.", + "Lower perplexity values indicate better language modeling performance overall.", + "GPU memory bandwidth is the primary bottleneck for autoregressive token generation.", + "Grouped query attention reduces KV cache size by sharing keys across head groups.", + "Rotary position embeddings encode relative position information in attention queries.", + "Mixture of experts models route tokens through specialized feed-forward networks.", + "Continuous batching allows servers to process multiple requests simultaneously.", + "KV cache quantization trades a small quality loss for significantly larger contexts.", +] +lines = [random.choice(sentences) for _ in range(4000)] +print('\n'.join(lines), file=open('/tmp/kv_ppl_input.txt', 'w')) +PY +} + +# run_ppl MODEL NGL KV CTX TIMEOUT [EXTRA_ARGS...] +# Echoes PPL value on stdout, returns 0 on success, 1 on OOM/crash. +run_ppl() { + local model=$1 ngl=$2 kv=$3 ctx=$4 timeout_s=$5 + shift 5 + local extra_args=("$@") + + local tmp_err; tmp_err=$(mktemp) + local ppl_out; ppl_out=$(mktemp) + + timeout "$timeout_s" /app/llama-perplexity \ + -m "$model" \ + -ngl "$ngl" \ + -fa on \ + -c "$ctx" \ + -ctk "$kv" -ctv "$kv" \ + -f "$PPL_FILE" \ + --chunks 1 \ + "${extra_args[@]}" \ + > "$ppl_out" 2>"$tmp_err" + local ppl_rc=$? + + local err; err=$(cat "$tmp_err"); rm -f "$tmp_err" + + if [[ "$ppl_rc" != "0" ]] || \ + grep -qi "out of memory\|failed to allocate\|cudaMalloc failed\|CUDA_ERROR_OUT_OF_MEMORY\|ggml_cuda_malloc\|cannot allocate memory" <<< "$err"; then + rm -f "$ppl_out" + return 1 + fi + + local ppl_val + ppl_val=$(grep -oP '\[\d+\]\K[0-9.]+' "$ppl_out" | tail -1) + rm -f "$ppl_out" + [[ -z "$ppl_val" ]] && return 1 + echo "$ppl_val" +} + +# --------------------------------------------------------------------------- +ensure_ppl_file + +echo "$HR" +echo "KV CACHE QUANT TEST (llama-perplexity) — TurboQuant image (FORCE_MMQ SM75)" +echo "$(date)" +echo "GPU: $(nvidia-smi --query-gpu=name,memory.total --format=csv,noheader 2>/dev/null)" +echo "$HR" +echo "Standard models: f16 baseline | q8_0 | q4_0 | turbo2 (2-bit, 8x smaller KV vs f16)" +echo "Qwen3.5-9B: f16 baseline | turbo2 | turbo3 | turbo4 (TurboQuant KV types)" +echo "Quality gate: PPL delta vs f16 < 0.5" +echo "" + +declare -a SUMMARY=() + +for entry in \ + "smollm3:SmolLM3-3B:$M_SMOL" \ + "e2b:Gemma4-E2B:$M_E2B" \ + "e4b:Gemma4-E4B:$M_E4B" \ + "q3:Qwen3-4B:$M_Q3" \ + "qwen35q:Qwen3.5-9B:$M_Q35" +do + IFS=':' read -r key lbl model <<< "$entry" + [[ "$TARGET" != "all" && "$TARGET" != "$key" ]] && continue + + eval "ctxs=(\"\${${CTX_CANDIDATES[$key]}[@]}\")" + ngl="${NGL[$key]}" + timeout_s="${PPL_TIMEOUT[$key]}" + IFS=' ' read -ra kv_types <<< "${MODEL_KV_TYPES[$key]}" + + # Extra args for qwen35-9b (flash attn already set; no mlock needed for PPL correctness) + extra_args=() + + printf "${BLUE}=== %s (base ctx=%s, ngl=%s) ===${NC}\n" \ + "$lbl" "${BASE_CTX[$key]}" "$ngl" + + # Dynamic header based on KV types for this model + printf " %-10s %-18s" "ctx" "f16 (PPL)" + for kv in "${kv_types[@]}"; do + printf " %-20s" "$kv (PPL/delta)" + done + printf "\n" + printf " %-10s %-18s" "---" "---------" + for kv in "${kv_types[@]}"; do + printf " %-20s" "--------------------" + done + printf "\n" + + declare -A best_ctx_per_kv=([f16]="${BASE_CTX[$key]}") + for kv in "${kv_types[@]}"; do best_ctx_per_kv[$kv]="${BASE_CTX[$key]}"; done + declare -A oom_kv=([f16]=0) + for kv in "${kv_types[@]}"; do oom_kv[$kv]=0; done + declare -A ppl_f16_at_ctx=() + + for ctx in "${ctxs[@]}"; do + printf " %-10s" "$ctx" + + # f16 baseline + f16_ppl="" + if [[ "${oom_kv[f16]}" == "1" ]]; then + printf " ${RED}%-18s${NC}" "OOM" + else + f16_ppl=$(run_ppl "$model" "$ngl" "f16" "$ctx" "$timeout_s" "${extra_args[@]}") + if [[ $? -ne 0 ]]; then + printf " ${RED}%-18s${NC}" "OOM" + oom_kv[f16]=1 + else + printf " ${GREEN}%-18s${NC}" "$f16_ppl" + best_ctx_per_kv[f16]=$ctx + ppl_f16_at_ctx[$ctx]=$f16_ppl + fi + fi + + # KV type columns + for kv in "${kv_types[@]}"; do + if [[ "${oom_kv[$kv]}" == "1" ]]; then + printf " ${RED}%-20s${NC}" "OOM" + continue + fi + ppl=$(run_ppl "$model" "$ngl" "$kv" "$ctx" "$timeout_s" "${extra_args[@]}") + if [[ $? -ne 0 ]]; then + printf " ${RED}%-20s${NC}" "OOM" + oom_kv[$kv]=1 + continue + fi + best_ctx_per_kv[$kv]=$ctx + + if [[ -n "$f16_ppl" ]]; then + delta=$(python3 -c "print(f'{float(\"$ppl\")-float(\"$f16_ppl\"):+.2f}')" 2>/dev/null || echo "?") + ok=$(python3 -c "exit(0 if abs(float('$ppl')-float('$f16_ppl'))<0.5 else 1)" 2>/dev/null && echo ok || echo bad) + if [[ "$ok" == "ok" ]]; then + printf " ${GREEN}%-20s${NC}" "${ppl}(${delta})" + else + printf " ${YELLOW}%-20s${NC}" "${ppl}(${delta})" + fi + else + printf " ${GREEN}%-20s${NC}" "$ppl" + fi + done + echo "" + done + + echo "" + + # Best recommendation: highest ctx where all non-f16 types passed quality gate + overall_best_ctx="${BASE_CTX[$key]}" + overall_best_kv="f16" + for kv in "${kv_types[@]}"; do + bctx="${best_ctx_per_kv[$kv]}" + SUMMARY+=("$lbl|$kv|$bctx") + if [[ "$bctx" -gt "$overall_best_ctx" ]]; then + overall_best_ctx=$bctx; overall_best_kv=$kv + fi + done + SUMMARY+=("$lbl|f16|${best_ctx_per_kv[f16]}") + printf " ${GREEN}Best: %s → max ctx %s${NC}\n\n" "$overall_best_kv" "$overall_best_ctx" + + unset best_ctx_per_kv oom_kv ppl_f16_at_ctx +done + +echo "$HR" +echo "SUMMARY" +echo "$HR" +printf "%-16s %-8s %s\n" "Model" "KV" "Max Ctx (no OOM + PPL delta<0.5)" +printf "%-16s %-8s %s\n" "-----" "--" "---------------------------------" +for row in "${SUMMARY[@]}"; do + IFS='|' read -r lbl kv ctx <<< "$row" + printf "${GREEN}%-16s %-8s %s${NC}\n" "$lbl" "$kv" "$ctx" +done +echo "$HR" +echo "Reminder: update envs/.env.: CACHE_TYPE_K/V= CTX_SIZE=" +echo "$HR" diff --git a/scripts/quality_test.sh b/scripts/quality_test.sh new file mode 100644 index 0000000..6d1770e --- /dev/null +++ b/scripts/quality_test.sh @@ -0,0 +1,215 @@ +#!/bin/bash +# Quality tests for all 4 models — runs inside full-cuda container. +# Tests: coding tasks + needle-in-haystack at 1K/8K ctx. +# +# Inference parameters sourced from official HF model cards: +# SmolLM3: /no_think in SYSTEM prompt (-sys); temp=0.6 top_p=0.95 +# Qwen3: /no_think in SYSTEM prompt (-sys); temp=0.7 top_p=0.8 top_k=20 +# DO NOT use greedy (temp=0) — causes endless repetition per Qwen3 docs +# Gemma4: No thinking mode; temp=0.7 top_p=0.95 + +set -uo pipefail + +M_SMOL="/models/HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf" +M_E2B="/models/google_gemma-4-E2B-it-Q4_K_M.gguf" +M_E4B="/models/google_gemma-4-E4B-it-Q4_K_M.gguf" +M_Q3="/models/Qwen3-4B-Q4_K_M.gguf" + +declare -A NGL=([smollm3]=99 [e2b]=99 [e4b]=42 [q3]=99) +declare -A MAX_CTX=([smollm3]=24576 [e2b]=32768 [e4b]=24576 [q3]=8192) + +# Per-model sampling params (HF model card sources) +declare -A TEMP=([smollm3]="0.6" [e2b]="0.7" [e4b]="0.7" [q3]="0.7") +declare -A TOPP=([smollm3]="0.95" [e2b]="0.95" [e4b]="0.95" [q3]="0.8") +declare -A TOPK=([smollm3]="0" [e2b]="0" [e4b]="0" [q3]="20") +# /no_think in system prompt disables thinking for SmolLM3 and Qwen3 +declare -A SYSP=([smollm3]="/no_think" [e2b]="" [e4b]="" [q3]="/no_think") + +PASS=0; FAIL=0; TOTAL=0 + +GREEN='\033[0;32m'; RED='\033[0;31m'; YELLOW='\033[1;33m'; NC='\033[0m' + +# sed script to strip llama-cli interactive UI banner from stdout. +# ▄ (U+2584) and █ (U+2588) appear in the llama.cpp ASCII logo, sometimes +# with leading spaces — match anywhere on the line to be safe. +STRIP_BANNER='/^$/d +/^Loading model/d +/^[[:space:]]*$/d +/[▄█]/d +/^build /d +/^model /d +/^modalities/d +/^available commands/d +/^ \//d +/^\[ Prompt:/d +/^\[ Prompt:/d +/^Exiting/d +/^> /d +' + +check() { + local lbl="$1" out="$2" + shift 2 + local patterns=("$@") + local ok=1 + for pat in "${patterns[@]}"; do + printf '%s\n' "$out" | grep -qiE "$pat" || { ok=0; break; } + done + TOTAL=$((TOTAL+1)) + if [ "$ok" = "1" ]; then + PASS=$((PASS+1)); printf " ${GREEN}PASS${NC} %s\n" "$lbl" + else + FAIL=$((FAIL+1)); printf " ${RED}FAIL${NC} %s\n" "$lbl" + printf '%s\n' "$out" | grep -v '^$' | tail -3 | sed 's/^/ | /' + fi +} + +# Strip thinking blocks from output. +# Gemma4 uses [Start thinking]...[End thinking]. +# Qwen3/SmolLM3 use .... +# Match to end-of-string as fallback for truncated/incomplete blocks. +strip_think() { + python3 -c " +import sys, re +t = sys.stdin.read() +# Only strip COMPLETE blocks. If thinking hit token limit, leave as-is so +# check patterns can still match reasoning content inside the block. +t = re.sub(r'\[Start thinking\].*?\[End thinking\]', '', t, flags=re.DOTALL) +t = re.sub(r'.*?', '', t, flags=re.DOTALL) +print(t.strip()) +" 2>/dev/null || cat +} + +# run KEY MODEL PROMPT MAX_TOKENS [SYS_OVERRIDE] +# SYS_OVERRIDE defaults to SYSP[$key] if omitted. +# Pass "" explicitly to disable system prompt (thinking ON for Qwen3/SmolLM3). +# Thinking params: SmolLM3/Qwen3 thinking=temp0.6/top_p0.95, nothink=model defaults. +run() { + local key=$1 model=$2 prompt=$3 max_tok=$4 + local ngl="${NGL[$key]}" + # 5th arg overrides sys; if not provided, use SYSP[$key] + local use_sys + if [ "${5+x}" = "x" ]; then use_sys="$5"; else use_sys="${SYSP[$key]}"; fi + # choose sampling params: thinking mode uses 0.6/0.95, non-think uses model defaults + local temp topp topk + if [ -z "$use_sys" ] && [[ "$key" == "smollm3" || "$key" == "q3" ]]; then + temp="0.6"; topp="0.95"; topk="${TOPK[$key]}" + else + temp="${TEMP[$key]}"; topp="${TOPP[$key]}"; topk="${TOPK[$key]}" + fi + local sys_arg=() + [ -n "$use_sys" ] && sys_arg=(-sys "$use_sys") + local topk_arg=() + [ "$topk" != "0" ] && topk_arg=(--top-k "$topk") + timeout 300 /app/llama-cli -m "$model" -ngl "$ngl" \ + -n "$max_tok" --temp "$temp" --top-p "$topp" "${topk_arg[@]}" \ + --repeat-penalty 1.1 -fa on --mmap --single-turn \ + "${sys_arg[@]}" -p "$prompt" 2>/dev/null \ + | sed "$STRIP_BANNER" \ + | strip_think +} + +# needle_test KEY MODEL NEEDLE CTX +# Generates ~CTX tokens of filler, plants needle in middle, asks to recall it. +needle_test() { + local key=$1 model=$2 needle=$3 ctx=$4 + local ngl="${NGL[$key]}" + local temp="${TEMP[$key]}" topp="${TOPP[$key]}" sys="${SYSP[$key]}" + local sys_arg=() + [ -n "$sys" ] && sys_arg=(-sys "$sys") + + # filler: ctx/2 tokens each side, 1 token ~4 chars + local half_chars=$(( ctx * 2 )) + local reps=$(( half_chars / 45 + 2 )) + local filler + filler=$(python3 -c "print('The quick brown fox jumps over the lazy dog. ' * $reps)" 2>/dev/null \ + | head -c "$half_chars") + + local prompt + printf -v prompt \ + '%s\nSECRET_VALUE=%s\n%s\nWhat is SECRET_VALUE? Reply with only the value, nothing else.' \ + "$filler" "$needle" "$filler" + + local ctx_size=$(( ctx + 512 )) + local out + out=$(timeout 180 /app/llama-cli -m "$model" -ngl "$ngl" \ + -n 512 --temp "$temp" --top-p "$topp" \ + -fa on --mmap --single-turn \ + -c "$ctx_size" "${sys_arg[@]}" -p "$prompt" 2>/dev/null \ + | sed "$STRIP_BANNER" \ + | strip_think) + + # join lines before grep in case model breaks needle across newlines + local flat + flat=$(printf '%s' "$out" | tr '\n' ' ') + if printf '%s' "$flat" | grep -qF "$needle"; then + echo "FOUND" + else + local snip + snip=$(printf '%s' "$flat" | cut -c1-80) + echo "MISSED (${snip:-})" + fi +} + +HR="======================================================================" +echo "$HR" +echo "QUALITY TESTS — ALL MODELS — $(date)" +echo "GPU: $(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null)" +echo "$HR" +printf "Temps: SmolLM3=0.6/0.95 | Qwen3=0.7/0.8/k20 | Gemma4=0.7/0.95\n" +printf "/no_think via -sys for needle tests | thinking ON for coding bug test\n\n" + +CODING_FIZZBUZZ='Write ONLY the Python function fizzbuzz(n). It returns a list where multiples of 3 are "Fizz", multiples of 5 are "Buzz", multiples of both are "FizzBuzz", others are the number as string. Output code only, no prose.' + +# hi is correctly len(arr)-1 to have ONE unambiguous bug: lo=mid (infinite loop) +CODING_BUG='Find the bug in this Python function and explain it in one sentence: +def binary_search(arr, target): + lo, hi = 0, len(arr) - 1 + while lo < hi: + mid = (lo + hi) // 2 + if arr[mid] == target: + return mid + elif arr[mid] < target: + lo = mid + else: + hi = mid + return -1' + +for entry in "smollm3:SmolLM3-3B:$M_SMOL" "e2b:Gemma4-E2B:$M_E2B" "e4b:Gemma4-E4B:$M_E4B" "q3:Qwen3-4B:$M_Q3"; do + IFS=':' read -r key lbl model <<< "$entry" + echo "=== $lbl ===" + + # Coding test 1: FizzBuzz — expect def + Fizz + Buzz + out=$(run "$key" "$model" "$CODING_FIZZBUZZ" 512) + check "FizzBuzz: def + Fizz + Buzz in output" "$out" \ + "def " "Fizz" "Buzz" + + # Coding test 2: Bug — thinking ON for all models (more reliable reasoning). + # Pass "" to disable /no_think override. Gemma4 already thinks by default. + out=$(run "$key" "$model" "$CODING_BUG" 3000 "") + check "Bug: identify lo=mid / infinite loop" "$out" \ + "lo.*=.*mid.*\+.*1|lo\+1|infinite loop|never.*advance|never.*progress|stuck|lo should be|lo\b.*never.*incr" + + # Needle-in-haystack + NEEDLE="QX7-ALPHA-9" + # strict < so we skip when ctx == max_ctx (prompt fills entire context, no room for output) + for ctx in 1024 8192; do + if [ "$ctx" -lt "${MAX_CTX[$key]}" ]; then + result=$(needle_test "$key" "$model" "$NEEDLE" "$ctx") + TOTAL=$((TOTAL+1)) + if [[ "$result" == FOUND ]]; then + PASS=$((PASS+1)); printf " ${GREEN}PASS${NC} Needle @ %s tok: %s\n" "$ctx" "$result" + else + FAIL=$((FAIL+1)); printf " ${RED}FAIL${NC} Needle @ %s tok: %s\n" "$ctx" "$result" + fi + else + printf " ${YELLOW}SKIP${NC} Needle @ %s tok (exceeds model max %s)\n" "$ctx" "${MAX_CTX[$key]}" + fi + done + + echo "" +done + +echo "$HR" +printf "RESULTS: ${GREEN}%s PASSED${NC} / ${RED}%s FAILED${NC} / %s TOTAL\n" "$PASS" "$FAIL" "$TOTAL" +echo "$HR"