# trueref — Findings Research notes backing the choices in [ARCHITECTURE.md](ARCHITECTURE.md). Each section ends with a verdict and follow-up questions if any. --- ## F1. Context7 ingestion behavior (what we replicate functionally) - Context7 ingests git repositories and crawls associated docs sites driven by a `context7.json` manifest at the repo root, plus an optional `llms.txt` index. - It produces snippets shaped roughly as `{ title, description, source, code, language }` and serves them via two MCP tools: `resolve-library-id` and `get-library-docs`. - The `get-library-docs` API accepts `topic` and `tokens` parameters; topic biases retrieval, tokens caps the response size (defaults observed in client docs: ~5000). - Source: upstash/context7 GitHub repo & MCP docs. **Verdict:** functional parity is achievable without copying the manifest schema. Our chunk model captures the same fields under different names (`symbol`/`content`/`filePath`/`language`). MCP tool signatures are kept **byte-identical** for LLM compatibility. --- ## F2. Embedded vector store choice — Lucene 9 over Qdrant - Qdrant is a Rust binary; embedding it in a fat JAR requires extracting & spawning a child process, contradicting the "single JAR, embedded everything" goal. - **Apache Lucene ≥9.0** ships HNSW kNN (`KnnFloatVectorField`) alongside BM25 in a single index segment. Pure JVM, no native deps. - Lucene supports **filtered kNN** (`KnnFloatVectorQuery` with a `BooleanQuery` filter), which we need for `(repoId, versionId)` scoping. - Trade-off: Lucene HNSW lacks Qdrant's payload-rich filtering tricks (e.g. quantization presets, named vectors). Acceptable for our scale; we get BM25 in the same store for free. **Verdict:** Lucene 9 (we'll target the latest 9.x). One `IndexWriter`, refresh-on-search via `SearcherManager`. --- ## F3. Embedding model — bge-m3 - BAAI/bge-m3: 568M params, 8192 ctx, multilingual (100+ langs), trained on multi-functionality (dense + sparse + colbert). - ONNX export available (BAAI provides it; community variants on HuggingFace). - License: MIT-style (model weights), works for self-hosted commercial use. - Vector dim: 1024 (dense). Sparse vocab compatible with Lucene if we want SPLADE-like sparse — out of scope for v1. **Verdict:** bge-m3 (dense only for v1). Sparse channel deferred. --- ## F4. Reranker — bge-reranker-v2-m3 - Cross-encoder, scores (query, passage) pairs. - Same family as embedder: balanced quality/cost, ONNX-exportable. - Apache 2.0 license. **Verdict:** bge-reranker-v2-m3. Top-K candidates from RRF fed in, top-N (default 20) returned. --- ## F5. ML runtime — ONNX Runtime (Java bindings) - ONNX Runtime has **official Java bindings** (`com.microsoft.onnxruntime:onnxruntime` + `onnxruntime_gpu`). - Execution providers we will support: - **CUDA** (`onnxruntime_gpu`): Linux + Windows with NVIDIA driver ≥ matching CUDA 12.x. - **DirectML** (`onnxruntime-directml`): Windows, any DX12 GPU. - **CPU**: always-on fallback. - ONNX Runtime has **no Vulkan execution provider**. Our earlier "Vulkan fallback" wish is not satisfiable in this stack — we drop it. - Generative LLMs in ONNX (e.g. Phi-3.5-mini) are possible but awkward (KV cache management, tokenizer differences). Since we picked **retrieval-only**, no generative model is needed. **Verdict:** ONNX Runtime, providers tried in order: cuda → directml → cpu. Vulkan dropped (documented). --- ## F6. Java version — 21 LTS, not 25 - Spring Boot 3.5.x officially supports Java 17–23. - Spring AI 1.0.x targets the same range. - Java 25 is supported by neither at time of writing; risking obscure reflection/MR-JAR issues with downstream libs (JGit, Lucene, ONNX bindings). - Java 21 is LTS and has stable virtual threads + structured concurrency (`StructuredTaskScope` was preview through 23, finalizing soon — we'll guard usage behind a thin wrapper to ease later upgrade). **Verdict:** Java 21 LTS. Re-evaluate to 25 once Spring Boot certifies it. --- ## F7. Differential indexing scheme - We chose **dedupe-by-content-hash** AND **git-diff-driven file skipping**. - The hash dedupe alone gives constant-cost embeddings for unchanged code across tags. - The git-diff path additionally avoids parsing/chunking unchanged files, which dominates ingest CPU on large repos. - Storage model: - `chunks`: one row per unique `content_hash`. Vector lives in Lucene keyed by `chunkId`. - `chunk_versions`: many-to-many; one row per `(chunk, version, file, line range)`. - Search: `BooleanQuery(filter=chunk_versions.version_id IN scope)` joined to vector field. - The chunk dedupe ratio is reported as a UI metric — it's the most intuitive measure of "differential" effectiveness. **Verdict:** confirmed; both mechanisms compose without conflict. --- ## F8. MCP transport — Streamable HTTP - The current MCP spec (revision 2025-03-26) defines **Streamable HTTP**: a single `POST /mcp` endpoint that may upgrade to SSE for long-lived/streamed responses; replaces the deprecated 2024-11-05 SSE transport. - Spring AI 1.0 ships an MCP server module that supports Streamable HTTP via Spring MVC. - We expose **only** Streamable HTTP, no SSE-only legacy endpoint (per user spec). **Verdict:** Streamable HTTP only at `/mcp`. --- ## F9. Embedded SQL store — H2 (MVCC) - H2 in MVCC mode supports concurrent readers and a single writer with row-level locking. Good enough for our metadata write rates (jobs, versions, chunk_versions). - File-based, single JAR dependency, JDBC. - Considered & rejected: - **DuckDB**: column-store, slower OLTP, no good Flyway story. - **SQLite**: poor concurrency under write load. - **Embedded Postgres (zonky)**: pulls a 100+ MB native binary per OS — fights the fat JAR goal. **Verdict:** H2 file-based, MVCC=true, with Flyway migrations. --- ## F10. Job orchestration — custom virtual-thread orchestrator - Spring Batch is feature-rich but requires a JobRepository (typically Postgres or H2) and adds startup cost we don't need. - Our jobs are **per-tag**, **simple linear stage sequences**, with persistence-of-status as the only durability requirement. - Custom orchestrator: each `IngestionJob` runs on a virtual thread; stages execute sequentially; stage transitions are durably written to H2 in a transaction; `JobEventBus` emits events for SSE. - Crash recovery: on startup, scan jobs in `RUNNING` status, mark them `FAILED` (or resume specific resumable stages — v2). **Verdict:** custom orchestrator. Spring Batch deferred unless we hit a ceiling. --- ## F11. Code parser — pure-Java heuristic for v1, tree-sitter pluggable for v2 The Java tree-sitter ecosystem in 2026 is fragmented: - **`io.github.tree-sitter:jtreesitter`** uses Project Panama FFI → requires **Java 22+**. We target Java 21 LTS, so this is out. - **`io.github.bonede:tree-sitter`** is JNI-based and works on Java 21, but bundling per-OS (linux/windows/mac × x64/arm64) native grammar binaries for many languages bloats the fat JAR significantly and creates a packaging matrix we don't want to maintain in v1. - **`ai.serenade.treesitter:java-tree-sitter`** is unmaintained. **Decision (v1):** ship a pure-Java heuristic `CodeParser` adapter. Strategies, tried in order per file: 1. **Markdown / `.txt` / `.rst`**: split by ATX/Setext headings; large sections further split by paragraph. 2. **Brace-balanced languages** (java, c, c++, c#, go, rust, js, ts, kotlin, scala, swift): walk the file tracking brace depth + line-based heuristics (function signatures, top-level declarations) to extract chunks of complete top-level constructs. Symbol name extracted via a tiny regex per language. 3. **Indent-based languages** (python, yaml, ruby): split on top-level `def`/`class`/`module` boundaries; symbol name from the declaration line. 4. **Fallback** (any text file): sliding-window of N lines (default 80) with M lines overlap (default 10). The `CodeParser` port is unchanged. A future tree-sitter implementation (when JDK upgrade or upstream packaging matures) can be swapped in by providing an alternate `@Component` and toggling a config flag — that's exactly what hexagonal architecture buys us. **Verdict:** pure-Java heuristic parser for v1; tree-sitter remains a documented future enhancement. --- ## F12. Concurrency caps & GPU contention - User chose **unbounded virtual threads**. This is safe for I/O-bound stages. - ONNX inference is GPU-bound; calling the same `OrtSession` from many threads concurrently is unsupported. Two mitigations: 1. A **session pool** of size N (config `embedding.session-count`, default 2). 2. A **`Semaphore(N)`** acquired by any caller before invoking inference. Pool & semaphore sizes match. - This means tag-level parallelism is naturally throttled by GPU capacity without explicit per-tag limits. **Verdict:** session pool + semaphore. Document the knob clearly in `application.yml`. --- ## F13. Frontend in fat JAR - SvelteKit `@sveltejs/adapter-static` produces a fully static bundle (HTML/CSS/JS). We build it as a Maven sub-step (frontend-maven-plugin) and copy `frontend/build/` to `bootstrap/src/main/resources/static/`. Spring serves it by default. - SPA fallback: a `WebMvcConfigurer` maps all unmatched non-API paths to `index.html` so client-side routing works. **Verdict:** static adapter + Spring static-resource serving. Single artifact preserved. --- ## F14. Open questions / future work 1. **Sparse channel** (bge-m3 sparse / SPLADE) for stronger lexical recall — deferred to v2. 2. **Per-language reranker fine-tuning** — out of scope (no fine-tuning, per spec). 3. **Compaction job** to truly delete orphan chunks (currently soft-delete on versions). Schedule TBD. 4. **Watched-folder** auto-discovery semantics: how often do we rescan `./data/watched/`? Default proposal: every 5 min + on filesystem watch event (Java NIO `WatchService`). 5. **Repo size cap**: do we need a maximum total cloned size to prevent runaway disk use? Currently unlimited; could add per-repo and global caps in v2. 6. **GPU memory introspection**: Linux NVML via JNI (`jnvml`) for GPU mem gauges; on Windows + DirectML we surface only "available/in-use" booleans. --- ## F15. References (for re-checking when libraries bump) - Context7 repo & MCP tool surface — to sanity-check schema fidelity on releases. - Spring AI 1.0.x release notes — verify MCP server Streamable HTTP module name & API. - Spring Boot 3.5.x release notes — confirm Java version compatibility window. - Lucene 9.x kNN docs — confirm filtered vector query API surface. - ONNX Runtime Java release notes — confirm CUDA/DirectML EP availability per version. - BAAI/bge-m3 model card — confirm ONNX export availability/format. - MCP spec 2025-03-26 — Streamable HTTP transport requirements. > Use the Context7 MCP lookup skill before bumping any of the above to fetch fresh, version-specific docs. --- ## F16. Smoke-test log (2026-04-21) End-to-end smoke after first assembly: - `mvn -pl trueref-bootstrap -am package` → BUILD SUCCESS, fat JAR ~582 MB. - `mvn test` → **16 tests pass** (parser 6, pooling 5, disk cache 5), **0 failures**. - `java -jar trueref-bootstrap/target/trueref.jar --trueref.embedding.session-count=0` — started in 3.6 s. - `GET /actuator/health` → `UP` (db H2, disk, ping, ssl). - `POST /api/repos` + `GET /api/repos` — round-trips a repo. - `GET /swagger-ui.html` → 302 redirect (to `/swagger-ui/index.html`), `GET /v3/api-docs` → 200. - `GET /` → 200 (SvelteKit SPA served from Spring static resources). - Historical note: at this point the server still used the legacy WebMVC SSE transport, so `POST /mcp` without an established `GET /sse` session returned HTTP 500. This was later replaced by the Streamable HTTP transport on `GET`/`POST /mcp`. Fixes landed during smoke: - `V1__init_schema.sql`: H2 in PostgreSQL mode rejects `AUTO_INCREMENT`. Switched `job_log_events.id` to `BIGINT GENERATED BY DEFAULT AS IDENTITY` and removed the explicit `NULL` constraint. - `OnnxProperties.sessionCount` can now be 0 (disables the ONNX stack, for environments where models aren't available); `GpuSemaphore` accepts 0 permits by internally using 1 (never acquired in disabled mode). - `OnnxEmbeddingService` / `OnnxRerankerService` short-circuit in disabled mode; reranker pass-through preserves input order. - `ApplicationBeans` exposes only concrete beans (not both the class and its interface) to avoid ambiguous autowiring.