Files
llama-cpp/envs/.env.qwen3-4b-bigctx
Giancarmine Salucci 4ad296608b Initial commit: tuned multi-model llama.cpp stack
- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B
- TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs
- Bigctx profiles (-nkvo KV in RAM): 2-16x context gain
- turbo2 KV: 2x smaller, benchmarked against PPL quality gate
- Per-model env files with justified parameters
- kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts
- docs/FINDINGS.md: surprises, pitfalls, recommendations
- docs/ARCHITECTURE.md: compose + test script design
2026-05-06 15:56:40 +02:00

25 lines
621 B
Plaintext

# ==============================================================================
# Qwen3-4B Q4_K_M — bigctx variant (KV in RAM via -nkvo)
# Benchmarked 2026-05-06: -nkvo max ctx=24576 (+8K vs pure-GPU 16384)
# Baseline TG: ~39 t/s (empty KV).
# Use this profile when you need >16K context; otherwise use qwen3-4b.
# ==============================================================================
MODEL_FILE=Qwen3-4B-Q4_K_M.gguf
N_GPU_LAYERS=99
CTX_SIZE=24576
THREADS=6
THREADS_BATCH=6
BATCH_SIZE=512
UBATCH_SIZE=256
CACHE_TYPE_K=q4_0
CACHE_TYPE_V=q4_0
PARALLEL=1
EXTRA_ARGS=--flash-attn on --mmap --no-kv-offload