Files
trueref/FINDINGS.md
moze e54e1dd33b
All checks were successful
Build and publish Docker image / Build and push CPU image (push) Successful in 2m10s
Build and publish Docker image / Build and push GPU image (push) Successful in 3m2s
fix(mcp): align SDK and wire streamable server manually
- align all io.modelcontextprotocol.sdk artifacts to 0.18.1 via
  dependencyManagement so Spring AI transitives no longer pull mcp 0.10.0
- exclude Spring AI's legacy MCP server/webmvc auto-config, which is binary-
  incompatible with the 0.18.1 streamable transport APIs
- build McpSyncServer directly against WebMvcStreamableServerTransportProvider
  and adapt Spring AI ToolCallbacks to MCP SyncToolSpecifications manually
- keep /mcp as the sole Streamable HTTP endpoint for both initialize/tool calls
  and optional SSE event streams
- update MCP transport documentation to match the new runtime

Validated locally with:
- POST /mcp initialize -> HTTP 200 + Mcp-Session-Id
- POST /mcp tools/list -> returns resolve-library-id + get-library-docs
2026-05-06 03:05:22 +02:00

12 KiB
Raw Blame History

trueref — Findings

Research notes backing the choices in ARCHITECTURE.md. Each section ends with a verdict and follow-up questions if any.


F1. Context7 ingestion behavior (what we replicate functionally)

  • Context7 ingests git repositories and crawls associated docs sites driven by a context7.json manifest at the repo root, plus an optional llms.txt index.
  • It produces snippets shaped roughly as { title, description, source, code, language } and serves them via two MCP tools: resolve-library-id and get-library-docs.
  • The get-library-docs API accepts topic and tokens parameters; topic biases retrieval, tokens caps the response size (defaults observed in client docs: ~5000).
  • Source: upstash/context7 GitHub repo & MCP docs.

Verdict: functional parity is achievable without copying the manifest schema. Our chunk model captures the same fields under different names (symbol/content/filePath/language). MCP tool signatures are kept byte-identical for LLM compatibility.


F2. Embedded vector store choice — Lucene 9 over Qdrant

  • Qdrant is a Rust binary; embedding it in a fat JAR requires extracting & spawning a child process, contradicting the "single JAR, embedded everything" goal.
  • Apache Lucene ≥9.0 ships HNSW kNN (KnnFloatVectorField) alongside BM25 in a single index segment. Pure JVM, no native deps.
  • Lucene supports filtered kNN (KnnFloatVectorQuery with a BooleanQuery filter), which we need for (repoId, versionId) scoping.
  • Trade-off: Lucene HNSW lacks Qdrant's payload-rich filtering tricks (e.g. quantization presets, named vectors). Acceptable for our scale; we get BM25 in the same store for free.

Verdict: Lucene 9 (we'll target the latest 9.x). One IndexWriter, refresh-on-search via SearcherManager.


F3. Embedding model — bge-m3

  • BAAI/bge-m3: 568M params, 8192 ctx, multilingual (100+ langs), trained on multi-functionality (dense + sparse + colbert).
  • ONNX export available (BAAI provides it; community variants on HuggingFace).
  • License: MIT-style (model weights), works for self-hosted commercial use.
  • Vector dim: 1024 (dense). Sparse vocab compatible with Lucene if we want SPLADE-like sparse — out of scope for v1.

Verdict: bge-m3 (dense only for v1). Sparse channel deferred.


F4. Reranker — bge-reranker-v2-m3

  • Cross-encoder, scores (query, passage) pairs.
  • Same family as embedder: balanced quality/cost, ONNX-exportable.
  • Apache 2.0 license.

Verdict: bge-reranker-v2-m3. Top-K candidates from RRF fed in, top-N (default 20) returned.


F5. ML runtime — ONNX Runtime (Java bindings)

  • ONNX Runtime has official Java bindings (com.microsoft.onnxruntime:onnxruntime + onnxruntime_gpu).
  • Execution providers we will support:
    • CUDA (onnxruntime_gpu): Linux + Windows with NVIDIA driver ≥ matching CUDA 12.x.
    • DirectML (onnxruntime-directml): Windows, any DX12 GPU.
    • CPU: always-on fallback.
  • ONNX Runtime has no Vulkan execution provider. Our earlier "Vulkan fallback" wish is not satisfiable in this stack — we drop it.
  • Generative LLMs in ONNX (e.g. Phi-3.5-mini) are possible but awkward (KV cache management, tokenizer differences). Since we picked retrieval-only, no generative model is needed.

Verdict: ONNX Runtime, providers tried in order: cuda → directml → cpu. Vulkan dropped (documented).


F6. Java version — 21 LTS, not 25

  • Spring Boot 3.5.x officially supports Java 1723.
  • Spring AI 1.0.x targets the same range.
  • Java 25 is supported by neither at time of writing; risking obscure reflection/MR-JAR issues with downstream libs (JGit, Lucene, ONNX bindings).
  • Java 21 is LTS and has stable virtual threads + structured concurrency (StructuredTaskScope was preview through 23, finalizing soon — we'll guard usage behind a thin wrapper to ease later upgrade).

Verdict: Java 21 LTS. Re-evaluate to 25 once Spring Boot certifies it.


F7. Differential indexing scheme

  • We chose dedupe-by-content-hash AND git-diff-driven file skipping.
  • The hash dedupe alone gives constant-cost embeddings for unchanged code across tags.
  • The git-diff path additionally avoids parsing/chunking unchanged files, which dominates ingest CPU on large repos.
  • Storage model:
    • chunks: one row per unique content_hash. Vector lives in Lucene keyed by chunkId.
    • chunk_versions: many-to-many; one row per (chunk, version, file, line range).
    • Search: BooleanQuery(filter=chunk_versions.version_id IN scope) joined to vector field.
  • The chunk dedupe ratio is reported as a UI metric — it's the most intuitive measure of "differential" effectiveness.

Verdict: confirmed; both mechanisms compose without conflict.


F8. MCP transport — Streamable HTTP

  • The current MCP spec (revision 2025-03-26) defines Streamable HTTP: a single POST /mcp endpoint that may upgrade to SSE for long-lived/streamed responses; replaces the deprecated 2024-11-05 SSE transport.
  • Spring AI 1.0 ships an MCP server module that supports Streamable HTTP via Spring MVC.
  • We expose only Streamable HTTP, no SSE-only legacy endpoint (per user spec).

Verdict: Streamable HTTP only at /mcp.


F9. Embedded SQL store — H2 (MVCC)

  • H2 in MVCC mode supports concurrent readers and a single writer with row-level locking. Good enough for our metadata write rates (jobs, versions, chunk_versions).
  • File-based, single JAR dependency, JDBC.
  • Considered & rejected:
    • DuckDB: column-store, slower OLTP, no good Flyway story.
    • SQLite: poor concurrency under write load.
    • Embedded Postgres (zonky): pulls a 100+ MB native binary per OS — fights the fat JAR goal.

Verdict: H2 file-based, MVCC=true, with Flyway migrations.


F10. Job orchestration — custom virtual-thread orchestrator

  • Spring Batch is feature-rich but requires a JobRepository (typically Postgres or H2) and adds startup cost we don't need.
  • Our jobs are per-tag, simple linear stage sequences, with persistence-of-status as the only durability requirement.
  • Custom orchestrator: each IngestionJob runs on a virtual thread; stages execute sequentially; stage transitions are durably written to H2 in a transaction; JobEventBus emits events for SSE.
  • Crash recovery: on startup, scan jobs in RUNNING status, mark them FAILED (or resume specific resumable stages — v2).

Verdict: custom orchestrator. Spring Batch deferred unless we hit a ceiling.


F11. Code parser — pure-Java heuristic for v1, tree-sitter pluggable for v2

The Java tree-sitter ecosystem in 2026 is fragmented:

  • io.github.tree-sitter:jtreesitter uses Project Panama FFI → requires Java 22+. We target Java 21 LTS, so this is out.
  • io.github.bonede:tree-sitter is JNI-based and works on Java 21, but bundling per-OS (linux/windows/mac × x64/arm64) native grammar binaries for many languages bloats the fat JAR significantly and creates a packaging matrix we don't want to maintain in v1.
  • ai.serenade.treesitter:java-tree-sitter is unmaintained.

Decision (v1): ship a pure-Java heuristic CodeParser adapter. Strategies, tried in order per file:

  1. Markdown / .txt / .rst: split by ATX/Setext headings; large sections further split by paragraph.
  2. Brace-balanced languages (java, c, c++, c#, go, rust, js, ts, kotlin, scala, swift): walk the file tracking brace depth + line-based heuristics (function signatures, top-level declarations) to extract chunks of complete top-level constructs. Symbol name extracted via a tiny regex per language.
  3. Indent-based languages (python, yaml, ruby): split on top-level def/class/module boundaries; symbol name from the declaration line.
  4. Fallback (any text file): sliding-window of N lines (default 80) with M lines overlap (default 10).

The CodeParser port is unchanged. A future tree-sitter implementation (when JDK upgrade or upstream packaging matures) can be swapped in by providing an alternate @Component and toggling a config flag — that's exactly what hexagonal architecture buys us.

Verdict: pure-Java heuristic parser for v1; tree-sitter remains a documented future enhancement.


F12. Concurrency caps & GPU contention

  • User chose unbounded virtual threads. This is safe for I/O-bound stages.
  • ONNX inference is GPU-bound; calling the same OrtSession from many threads concurrently is unsupported. Two mitigations:
    1. A session pool of size N (config embedding.session-count, default 2).
    2. A Semaphore(N) acquired by any caller before invoking inference. Pool & semaphore sizes match.
  • This means tag-level parallelism is naturally throttled by GPU capacity without explicit per-tag limits.

Verdict: session pool + semaphore. Document the knob clearly in application.yml.


F13. Frontend in fat JAR

  • SvelteKit @sveltejs/adapter-static produces a fully static bundle (HTML/CSS/JS). We build it as a Maven sub-step (frontend-maven-plugin) and copy frontend/build/ to bootstrap/src/main/resources/static/. Spring serves it by default.
  • SPA fallback: a WebMvcConfigurer maps all unmatched non-API paths to index.html so client-side routing works.

Verdict: static adapter + Spring static-resource serving. Single artifact preserved.


F14. Open questions / future work

  1. Sparse channel (bge-m3 sparse / SPLADE) for stronger lexical recall — deferred to v2.
  2. Per-language reranker fine-tuning — out of scope (no fine-tuning, per spec).
  3. Compaction job to truly delete orphan chunks (currently soft-delete on versions). Schedule TBD.
  4. Watched-folder auto-discovery semantics: how often do we rescan ./data/watched/? Default proposal: every 5 min + on filesystem watch event (Java NIO WatchService).
  5. Repo size cap: do we need a maximum total cloned size to prevent runaway disk use? Currently unlimited; could add per-repo and global caps in v2.
  6. GPU memory introspection: Linux NVML via JNI (jnvml) for GPU mem gauges; on Windows + DirectML we surface only "available/in-use" booleans.

F15. References (for re-checking when libraries bump)

  • Context7 repo & MCP tool surface — to sanity-check schema fidelity on releases.
  • Spring AI 1.0.x release notes — verify MCP server Streamable HTTP module name & API.
  • Spring Boot 3.5.x release notes — confirm Java version compatibility window.
  • Lucene 9.x kNN docs — confirm filtered vector query API surface.
  • ONNX Runtime Java release notes — confirm CUDA/DirectML EP availability per version.
  • BAAI/bge-m3 model card — confirm ONNX export availability/format.
  • MCP spec 2025-03-26 — Streamable HTTP transport requirements.

Use the Context7 MCP lookup skill before bumping any of the above to fetch fresh, version-specific docs.


F16. Smoke-test log (2026-04-21)

End-to-end smoke after first assembly:

  • mvn -pl trueref-bootstrap -am package → BUILD SUCCESS, fat JAR ~582 MB.
  • mvn test16 tests pass (parser 6, pooling 5, disk cache 5), 0 failures.
  • java -jar trueref-bootstrap/target/trueref.jar --trueref.embedding.session-count=0 — started in 3.6 s.
  • GET /actuator/healthUP (db H2, disk, ping, ssl).
  • POST /api/repos + GET /api/repos — round-trips a repo.
  • GET /swagger-ui.html → 302 redirect (to /swagger-ui/index.html), GET /v3/api-docs → 200.
  • GET / → 200 (SvelteKit SPA served from Spring static resources).
  • Historical note: at this point the server still used the legacy WebMVC SSE transport, so POST /mcp without an established GET /sse session returned HTTP 500. This was later replaced by the Streamable HTTP transport on GET/POST /mcp.

Fixes landed during smoke:

  • V1__init_schema.sql: H2 in PostgreSQL mode rejects AUTO_INCREMENT. Switched job_log_events.id to BIGINT GENERATED BY DEFAULT AS IDENTITY and removed the explicit NULL constraint.
  • OnnxProperties.sessionCount can now be 0 (disables the ONNX stack, for environments where models aren't available); GpuSemaphore accepts 0 permits by internally using 1 (never acquired in disabled mode).
  • OnnxEmbeddingService / OnnxRerankerService short-circuit in disabled mode; reranker pass-through preserves input order.
  • ApplicationBeans exposes only concrete beans (not both the class and its interface) to avoid ambiguous autowiring.