Files
trueref/FINDINGS.md
moze e54e1dd33b
All checks were successful
Build and publish Docker image / Build and push CPU image (push) Successful in 2m10s
Build and publish Docker image / Build and push GPU image (push) Successful in 3m2s
fix(mcp): align SDK and wire streamable server manually
- align all io.modelcontextprotocol.sdk artifacts to 0.18.1 via
  dependencyManagement so Spring AI transitives no longer pull mcp 0.10.0
- exclude Spring AI's legacy MCP server/webmvc auto-config, which is binary-
  incompatible with the 0.18.1 streamable transport APIs
- build McpSyncServer directly against WebMvcStreamableServerTransportProvider
  and adapt Spring AI ToolCallbacks to MCP SyncToolSpecifications manually
- keep /mcp as the sole Streamable HTTP endpoint for both initialize/tool calls
  and optional SSE event streams
- update MCP transport documentation to match the new runtime

Validated locally with:
- POST /mcp initialize -> HTTP 200 + Mcp-Session-Id
- POST /mcp tools/list -> returns resolve-library-id + get-library-docs
2026-05-06 03:05:22 +02:00

208 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# trueref — Findings
Research notes backing the choices in [ARCHITECTURE.md](ARCHITECTURE.md). Each section ends with a verdict and follow-up questions if any.
---
## F1. Context7 ingestion behavior (what we replicate functionally)
- Context7 ingests git repositories and crawls associated docs sites driven by a `context7.json` manifest at the repo root, plus an optional `llms.txt` index.
- It produces snippets shaped roughly as `{ title, description, source, code, language }` and serves them via two MCP tools: `resolve-library-id` and `get-library-docs`.
- The `get-library-docs` API accepts `topic` and `tokens` parameters; topic biases retrieval, tokens caps the response size (defaults observed in client docs: ~5000).
- Source: upstash/context7 GitHub repo & MCP docs.
**Verdict:** functional parity is achievable without copying the manifest schema. Our chunk model captures the same fields under different names (`symbol`/`content`/`filePath`/`language`). MCP tool signatures are kept **byte-identical** for LLM compatibility.
---
## F2. Embedded vector store choice — Lucene 9 over Qdrant
- Qdrant is a Rust binary; embedding it in a fat JAR requires extracting & spawning a child process, contradicting the "single JAR, embedded everything" goal.
- **Apache Lucene ≥9.0** ships HNSW kNN (`KnnFloatVectorField`) alongside BM25 in a single index segment. Pure JVM, no native deps.
- Lucene supports **filtered kNN** (`KnnFloatVectorQuery` with a `BooleanQuery` filter), which we need for `(repoId, versionId)` scoping.
- Trade-off: Lucene HNSW lacks Qdrant's payload-rich filtering tricks (e.g. quantization presets, named vectors). Acceptable for our scale; we get BM25 in the same store for free.
**Verdict:** Lucene 9 (we'll target the latest 9.x). One `IndexWriter`, refresh-on-search via `SearcherManager`.
---
## F3. Embedding model — bge-m3
- BAAI/bge-m3: 568M params, 8192 ctx, multilingual (100+ langs), trained on multi-functionality (dense + sparse + colbert).
- ONNX export available (BAAI provides it; community variants on HuggingFace).
- License: MIT-style (model weights), works for self-hosted commercial use.
- Vector dim: 1024 (dense). Sparse vocab compatible with Lucene if we want SPLADE-like sparse — out of scope for v1.
**Verdict:** bge-m3 (dense only for v1). Sparse channel deferred.
---
## F4. Reranker — bge-reranker-v2-m3
- Cross-encoder, scores (query, passage) pairs.
- Same family as embedder: balanced quality/cost, ONNX-exportable.
- Apache 2.0 license.
**Verdict:** bge-reranker-v2-m3. Top-K candidates from RRF fed in, top-N (default 20) returned.
---
## F5. ML runtime — ONNX Runtime (Java bindings)
- ONNX Runtime has **official Java bindings** (`com.microsoft.onnxruntime:onnxruntime` + `onnxruntime_gpu`).
- Execution providers we will support:
- **CUDA** (`onnxruntime_gpu`): Linux + Windows with NVIDIA driver ≥ matching CUDA 12.x.
- **DirectML** (`onnxruntime-directml`): Windows, any DX12 GPU.
- **CPU**: always-on fallback.
- ONNX Runtime has **no Vulkan execution provider**. Our earlier "Vulkan fallback" wish is not satisfiable in this stack — we drop it.
- Generative LLMs in ONNX (e.g. Phi-3.5-mini) are possible but awkward (KV cache management, tokenizer differences). Since we picked **retrieval-only**, no generative model is needed.
**Verdict:** ONNX Runtime, providers tried in order: cuda → directml → cpu. Vulkan dropped (documented).
---
## F6. Java version — 21 LTS, not 25
- Spring Boot 3.5.x officially supports Java 1723.
- Spring AI 1.0.x targets the same range.
- Java 25 is supported by neither at time of writing; risking obscure reflection/MR-JAR issues with downstream libs (JGit, Lucene, ONNX bindings).
- Java 21 is LTS and has stable virtual threads + structured concurrency (`StructuredTaskScope` was preview through 23, finalizing soon — we'll guard usage behind a thin wrapper to ease later upgrade).
**Verdict:** Java 21 LTS. Re-evaluate to 25 once Spring Boot certifies it.
---
## F7. Differential indexing scheme
- We chose **dedupe-by-content-hash** AND **git-diff-driven file skipping**.
- The hash dedupe alone gives constant-cost embeddings for unchanged code across tags.
- The git-diff path additionally avoids parsing/chunking unchanged files, which dominates ingest CPU on large repos.
- Storage model:
- `chunks`: one row per unique `content_hash`. Vector lives in Lucene keyed by `chunkId`.
- `chunk_versions`: many-to-many; one row per `(chunk, version, file, line range)`.
- Search: `BooleanQuery(filter=chunk_versions.version_id IN scope)` joined to vector field.
- The chunk dedupe ratio is reported as a UI metric — it's the most intuitive measure of "differential" effectiveness.
**Verdict:** confirmed; both mechanisms compose without conflict.
---
## F8. MCP transport — Streamable HTTP
- The current MCP spec (revision 2025-03-26) defines **Streamable HTTP**: a single `POST /mcp` endpoint that may upgrade to SSE for long-lived/streamed responses; replaces the deprecated 2024-11-05 SSE transport.
- Spring AI 1.0 ships an MCP server module that supports Streamable HTTP via Spring MVC.
- We expose **only** Streamable HTTP, no SSE-only legacy endpoint (per user spec).
**Verdict:** Streamable HTTP only at `/mcp`.
---
## F9. Embedded SQL store — H2 (MVCC)
- H2 in MVCC mode supports concurrent readers and a single writer with row-level locking. Good enough for our metadata write rates (jobs, versions, chunk_versions).
- File-based, single JAR dependency, JDBC.
- Considered & rejected:
- **DuckDB**: column-store, slower OLTP, no good Flyway story.
- **SQLite**: poor concurrency under write load.
- **Embedded Postgres (zonky)**: pulls a 100+ MB native binary per OS — fights the fat JAR goal.
**Verdict:** H2 file-based, MVCC=true, with Flyway migrations.
---
## F10. Job orchestration — custom virtual-thread orchestrator
- Spring Batch is feature-rich but requires a JobRepository (typically Postgres or H2) and adds startup cost we don't need.
- Our jobs are **per-tag**, **simple linear stage sequences**, with persistence-of-status as the only durability requirement.
- Custom orchestrator: each `IngestionJob` runs on a virtual thread; stages execute sequentially; stage transitions are durably written to H2 in a transaction; `JobEventBus` emits events for SSE.
- Crash recovery: on startup, scan jobs in `RUNNING` status, mark them `FAILED` (or resume specific resumable stages — v2).
**Verdict:** custom orchestrator. Spring Batch deferred unless we hit a ceiling.
---
## F11. Code parser — pure-Java heuristic for v1, tree-sitter pluggable for v2
The Java tree-sitter ecosystem in 2026 is fragmented:
- **`io.github.tree-sitter:jtreesitter`** uses Project Panama FFI → requires **Java 22+**. We target Java 21 LTS, so this is out.
- **`io.github.bonede:tree-sitter`** is JNI-based and works on Java 21, but bundling per-OS (linux/windows/mac × x64/arm64) native grammar binaries for many languages bloats the fat JAR significantly and creates a packaging matrix we don't want to maintain in v1.
- **`ai.serenade.treesitter:java-tree-sitter`** is unmaintained.
**Decision (v1):** ship a pure-Java heuristic `CodeParser` adapter. Strategies, tried in order per file:
1. **Markdown / `.txt` / `.rst`**: split by ATX/Setext headings; large sections further split by paragraph.
2. **Brace-balanced languages** (java, c, c++, c#, go, rust, js, ts, kotlin, scala, swift): walk the file tracking brace depth + line-based heuristics (function signatures, top-level declarations) to extract chunks of complete top-level constructs. Symbol name extracted via a tiny regex per language.
3. **Indent-based languages** (python, yaml, ruby): split on top-level `def`/`class`/`module` boundaries; symbol name from the declaration line.
4. **Fallback** (any text file): sliding-window of N lines (default 80) with M lines overlap (default 10).
The `CodeParser` port is unchanged. A future tree-sitter implementation (when JDK upgrade or upstream packaging matures) can be swapped in by providing an alternate `@Component` and toggling a config flag — that's exactly what hexagonal architecture buys us.
**Verdict:** pure-Java heuristic parser for v1; tree-sitter remains a documented future enhancement.
---
## F12. Concurrency caps & GPU contention
- User chose **unbounded virtual threads**. This is safe for I/O-bound stages.
- ONNX inference is GPU-bound; calling the same `OrtSession` from many threads concurrently is unsupported. Two mitigations:
1. A **session pool** of size N (config `embedding.session-count`, default 2).
2. A **`Semaphore(N)`** acquired by any caller before invoking inference. Pool & semaphore sizes match.
- This means tag-level parallelism is naturally throttled by GPU capacity without explicit per-tag limits.
**Verdict:** session pool + semaphore. Document the knob clearly in `application.yml`.
---
## F13. Frontend in fat JAR
- SvelteKit `@sveltejs/adapter-static` produces a fully static bundle (HTML/CSS/JS). We build it as a Maven sub-step (frontend-maven-plugin) and copy `frontend/build/` to `bootstrap/src/main/resources/static/`. Spring serves it by default.
- SPA fallback: a `WebMvcConfigurer` maps all unmatched non-API paths to `index.html` so client-side routing works.
**Verdict:** static adapter + Spring static-resource serving. Single artifact preserved.
---
## F14. Open questions / future work
1. **Sparse channel** (bge-m3 sparse / SPLADE) for stronger lexical recall — deferred to v2.
2. **Per-language reranker fine-tuning** — out of scope (no fine-tuning, per spec).
3. **Compaction job** to truly delete orphan chunks (currently soft-delete on versions). Schedule TBD.
4. **Watched-folder** auto-discovery semantics: how often do we rescan `./data/watched/`? Default proposal: every 5 min + on filesystem watch event (Java NIO `WatchService`).
5. **Repo size cap**: do we need a maximum total cloned size to prevent runaway disk use? Currently unlimited; could add per-repo and global caps in v2.
6. **GPU memory introspection**: Linux NVML via JNI (`jnvml`) for GPU mem gauges; on Windows + DirectML we surface only "available/in-use" booleans.
---
## F15. References (for re-checking when libraries bump)
- Context7 repo & MCP tool surface — to sanity-check schema fidelity on releases.
- Spring AI 1.0.x release notes — verify MCP server Streamable HTTP module name & API.
- Spring Boot 3.5.x release notes — confirm Java version compatibility window.
- Lucene 9.x kNN docs — confirm filtered vector query API surface.
- ONNX Runtime Java release notes — confirm CUDA/DirectML EP availability per version.
- BAAI/bge-m3 model card — confirm ONNX export availability/format.
- MCP spec 2025-03-26 — Streamable HTTP transport requirements.
> Use the Context7 MCP lookup skill before bumping any of the above to fetch fresh, version-specific docs.
---
## F16. Smoke-test log (2026-04-21)
End-to-end smoke after first assembly:
- `mvn -pl trueref-bootstrap -am package` → BUILD SUCCESS, fat JAR ~582 MB.
- `mvn test`**16 tests pass** (parser 6, pooling 5, disk cache 5), **0 failures**.
- `java -jar trueref-bootstrap/target/trueref.jar --trueref.embedding.session-count=0` — started in 3.6 s.
- `GET /actuator/health``UP` (db H2, disk, ping, ssl).
- `POST /api/repos` + `GET /api/repos` — round-trips a repo.
- `GET /swagger-ui.html` → 302 redirect (to `/swagger-ui/index.html`), `GET /v3/api-docs` → 200.
- `GET /` → 200 (SvelteKit SPA served from Spring static resources).
- Historical note: at this point the server still used the legacy WebMVC SSE transport, so `POST /mcp` without an established `GET /sse` session returned HTTP 500. This was later replaced by the Streamable HTTP transport on `GET`/`POST /mcp`.
Fixes landed during smoke:
- `V1__init_schema.sql`: H2 in PostgreSQL mode rejects `AUTO_INCREMENT`. Switched `job_log_events.id` to `BIGINT GENERATED BY DEFAULT AS IDENTITY` and removed the explicit `NULL` constraint.
- `OnnxProperties.sessionCount` can now be 0 (disables the ONNX stack, for environments where models aren't available); `GpuSemaphore` accepts 0 permits by internally using 1 (never acquired in disabled mode).
- `OnnxEmbeddingService` / `OnnxRerankerService` short-circuit in disabled mode; reranker pass-through preserves input order.
- `ApplicationBeans` exposes only concrete beans (not both the class and its interface) to avoid ambiguous autowiring.