llama-cpp

mozempk/llama-cpp

Fork 0

Commit Graph

Author	SHA1	Message	Date
Giancarmine Salucci	322364e6fc	compose: fix server command structure (critical bug) Compose shlex-splits 'command: \|' block scalar into a list when used with 'entrypoint: ["/bin/sh","-c"]'. Docker then runs '/bin/sh -c exec' where 'exec' is the only -c argument and '/app/llama-server' becomes $0. 'exec' with no program in sh exits 0 immediately → 37-restart crash-loop, no server. Fix: use 'entrypoint: []' and 'command: [/bin/sh, -c, <\|block>]' so the full shell command is passed as a single list element — not further split by Compose.	2026-05-06 23:31:12 +02:00
Giancarmine Salucci	e7e389c0e1	llama+compose: fix bigctx startup timing - compose: increase start_period for bigctx services - gemma4-e4b-bigctx: 60s -> 150s (5 GiB model + warmup + 163840 ctx takes ~90-120s) - gemma4-e2b-bigctx: 60s -> 120s (large ctx 393216 allocation) - smollm3/qwen3-4b bigctx: 60s -> 90s - llama: extend health poll from 30x2s=60s to 75x2s=150s - llama: require 3 consecutive unhealthy before giving up (avoids false positives during Docker start_period window)	2026-05-06 19:03:31 +02:00
Giancarmine Salucci	4ad296608b	Initial commit: tuned multi-model llama.cpp stack - 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B - TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs - Bigctx profiles (-nkvo KV in RAM): 2-16x context gain - turbo2 KV: 2x smaller, benchmarked against PPL quality gate - Per-model env files with justified parameters - kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts - docs/FINDINGS.md: surprises, pitfalls, recommendations - docs/ARCHITECTURE.md: compose + test script design	2026-05-06 15:56:40 +02:00

Author

SHA1

Message

Date

Giancarmine Salucci

322364e6fc

compose: fix server command structure (critical bug)

Compose shlex-splits 'command: |' block scalar into a list when used with
'entrypoint: ["/bin/sh","-c"]'. Docker then runs '/bin/sh -c exec' where
'exec' is the only -c argument and '/app/llama-server' becomes $0. 'exec'
with no program in sh exits 0 immediately → 37-restart crash-loop, no server.

Fix: use 'entrypoint: []' and 'command: [/bin/sh, -c, <|block>]' so the full
shell command is passed as a single list element — not further split by Compose.

2026-05-06 23:31:12 +02:00

Giancarmine Salucci

e7e389c0e1

llama+compose: fix bigctx startup timing

- compose: increase start_period for bigctx services
  - gemma4-e4b-bigctx: 60s -> 150s (5 GiB model + warmup + 163840 ctx takes ~90-120s)
  - gemma4-e2b-bigctx: 60s -> 120s (large ctx 393216 allocation)
  - smollm3/qwen3-4b bigctx: 60s -> 90s
- llama: extend health poll from 30x2s=60s to 75x2s=150s
- llama: require 3 consecutive unhealthy before giving up (avoids
  false positives during Docker start_period window)

2026-05-06 19:03:31 +02:00

Giancarmine Salucci

4ad296608b

Initial commit: tuned multi-model llama.cpp stack

- 5 models: SmolLM3-3B, Gemma4-E2B/E4B, Qwen3-4B, Qwen3.5-9B
- TurboQuant image (FORCE_MMQ): +6-11% free speed on Turing GPUs
- Bigctx profiles (-nkvo KV in RAM): 2-16x context gain
- turbo2 KV: 2x smaller, benchmarked against PPL quality gate
- Per-model env files with justified parameters
- kv_quant_test.sh + cpu_ctx_test.sh benchmark scripts
- docs/FINDINGS.md: surprises, pitfalls, recommendations
- docs/ARCHITECTURE.md: compose + test script design

2026-05-06 15:56:40 +02:00

3 Commits