- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B - TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs - Bigctx profiles (-nkvo KV in RAM): 2-16x context gain - turbo2 KV: 2x smaller, benchmarked against PPL quality gate - Per-model env files with justified parameters - kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts - docs/FINDINGS.md: surprises, pitfalls, recommendations - docs/ARCHITECTURE.md: compose + test script design
42 lines
1.5 KiB
Plaintext
42 lines
1.5 KiB
Plaintext
# ==============================================================================
|
||
# Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2 Q8_0 — TurboQuant SM75
|
||
# Architecture: 32 layers (8 full-attn + 24 linear-attn), GQA 4 KV heads
|
||
# Model size: 8.86 GB | VRAM usage: ~3.4 GB (11 layers on GPU)
|
||
# RAM usage: ~5.5 GB (21 layers pinned via mlock)
|
||
#
|
||
# Benchmark results (turbo2 KV, ngl=11, fa=1):
|
||
# t=1→0.86 t=2→1.62 t=3→2.25 t=4→2.94 t=5→3.56 t=6→4.38 ← best
|
||
# t=8→4.22 t=12→3.61 (hyperthreading hurts above 6)
|
||
# Theoretical ceiling: ~5.1 t/s (45 GB/s RAM BW ÷ 8.86 GB model)
|
||
# Achieved: 4.38 t/s = 86% efficiency
|
||
#
|
||
# Download:
|
||
# huggingface-cli download Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF \
|
||
# Qwen3.5-9B.Q8_0.gguf --local-dir ./models/
|
||
# ==============================================================================
|
||
|
||
MODEL_FILE=Qwen3.5-9B.Q8_0.gguf
|
||
|
||
# GPU: 11 layers fit in 3.7 GB VRAM. ngl=12 causes OOM at ctx>2048.
|
||
N_GPU_LAYERS=11
|
||
|
||
# 32K context fits with turbo2 KV (~104 MiB overhead vs ~3.3 GB for f16)
|
||
CTX_SIZE=32768
|
||
|
||
# t=6 is optimal for i7-10750H (6 physical cores). t>6 uses HT which hurts.
|
||
THREADS=6
|
||
THREADS_BATCH=6
|
||
|
||
BATCH_SIZE=512
|
||
UBATCH_SIZE=128
|
||
|
||
# turbo2: 2-bit KV cache, 6.4× smaller than f16. Requires TurboQuant image.
|
||
CACHE_TYPE_K=turbo2
|
||
CACHE_TYPE_V=turbo2
|
||
|
||
PARALLEL=1
|
||
|
||
# --no-mmap --mlock: pins entire model in RAM (prevents paging, avoids cold reads)
|
||
# --flash-attn on: required with turbo2 KV (fa=0 + turbo2 has no speed benefit)
|
||
EXTRA_ARGS=--flash-attn on --no-mmap --mlock
|