# trueref — Architecture > Self-hosted, fat-JAR, Java-21 clone of [Context7](https://github.com/upstash/context7) ingestion + retrieval, with first-class differential per-tag indexing, embedded vector + BM25 store, ONNX-accelerated embeddings/rerank, Streamable-HTTP MCP server, REST + OpenAPI, and a SvelteKit UI. ## 1. Goals & Non-Goals ### Goals - **Functional parity with Context7** ingestion outcome (own chunk schema). - **Differential per-tag indexing**: every git tag of every registered repo is independently queryable. - **Embedded everything**: single fat JAR runnable on a workstation/server. No external Postgres/Qdrant. - **GPU-accelerated retrieval** via ONNX Runtime (CUDA Linux/Win, DirectML Win, CPU fallback). - **MCP Streamable-HTTP server** exposing exactly two tools: `resolve-library-id`, `get-library-docs` — drop-in for any MCP client. - **Full observability** of ingestion pipelines surfaced in the UI (live progress, log tail, history, timings, resource usage). - **REST + OpenAPI/Swagger** for programmatic and UI use. - **SvelteKit UI** for repo registration, indexing control, monitoring, and ad-hoc query. - **Hexagonal architecture** so vector store, embedder, parser, persistence, etc. are swappable. ### Non-Goals - No public hosted SaaS — self-host only. - No model fine-tuning. - No mobile app. - No generative LLM in the pipeline (retrieval-only, like Context7). - No multi-tenancy / auth (LAN-only deployment). --- ## 2. Tech Stack (locked) | Concern | Choice | Rationale | |---|---|---| | Language / runtime | **Java 21 LTS** | Virtual threads stable; Spring Boot 3.5 supported. (Java 25 dropped — Boot 3.5 supports up to 23.) | | Framework | **Spring Boot 3.5.x** + **Spring AI 1.0.x** | Web MVC + virtual-thread executor; Spring AI for embedding/MCP abstractions. | | Build | **Maven** | Stable, ubiquitous, Spring-Boot first-class. | | Metadata store | **H2 (MVCC mode, file-based)** + Flyway | Zero ops, JDBC, MVCC concurrency, fits fat JAR. | | Vector + lexical store | **Apache Lucene 9.x** | Pure JVM. BM25 + HNSW kNN in one index. Collapses two stores. | | Embedding model | **BAAI/bge-m3** (ONNX) | Multilingual, 8k context, dense+sparse capable. MIT-like license. | | Reranker | **BAAI/bge-reranker-v2-m3** (ONNX) | Cross-encoder, Apache 2.0. | | ML runtime | **ONNX Runtime** (`onnxruntime_gpu` Linux CUDA / `onnxruntime-directml` Win / `onnxruntime` CPU) | In-JVM via official Java bindings. | | Git | **JGit** | Pure Java; clone, fetch, tag enumeration, diff. | | Code parsing | **Pure-Java heuristic chunker** (markdown-aware, brace-balanced for C-family, indent-based for Python, sliding-window fallback) | No native deps; preserves fat-JAR purity. Tree-sitter is a documented future swap (see FINDINGS §F11). | | Job orchestration | **Custom virtual-thread orchestrator** + H2-backed durable state | Fast, no Spring Batch overhead. | | MCP server | **Spring AI MCP Server (Streamable HTTP)** | Spec 2025-03-26, single `/mcp` endpoint. | | REST docs | **springdoc-openapi** | OpenAPI 3 + Swagger UI auto-generated. | | Observability | **Micrometer + OpenTelemetry**, exposed via REST/SSE for UI. **Prometheus + Grafana optional** via `/actuator/prometheus`. | UI-first; Prom/Graf attach later. | | Frontend | **SvelteKit + `@sveltejs/adapter-static`** | Built into `bootstrap/src/main/resources/static/`, served by Spring as part of fat JAR. | | Packaging | **Single fat JAR** via `spring-boot-maven-plugin` | One artifact, embedded everything. | --- ## 3. Hexagonal Layout (Maven multi-module) Direction of dependencies is enforced by Maven coordinates alone — no ArchUnit needed. ``` trueref-parent/ (pom; BOM + plugin management) ├── trueref-domain pure Java; records, sealed types, port interfaces. ZERO deps. ├── trueref-application use-case impls; depends on: domain ├── trueref-adapters ALL adapters live here; depends on: domain, application │ └── com.trueref.adapter │ ├── in │ │ ├── rest @RestController + DTOs + OpenAPI + SSE │ │ └── mcp MCP tool defs (Spring AI MCP server) │ └── out │ ├── persistence.h2 JdbcClient + Flyway, RepositoryStore impl │ ├── vectorstore.lucene Lucene BM25 + HNSW kNN, ChunkStore impl │ ├── embedding.onnx ONNX bge-m3 + bge-reranker-v2-m3 │ ├── git.jgit GitClient impl │ ├── parsing.treesitter CodeParser impl │ └── cache.disk EmbeddingCache (file-per-hash) ├── trueref-frontend SvelteKit; built via frontend-maven-plugin into static jar └── trueref-bootstrap @SpringBootApplication; wires beans; produces fat JAR depends on: domain, application, adapters, frontend ``` **Dependency rule (Maven-enforced):** - `domain` → nothing. - `application` → `domain`. - `adapters` → `domain` + `application`. - `frontend` → none (resource-only jar). - `bootstrap` → all of the above (the only place wiring lives). > All packages live under `com.trueref.*` regardless of module. Module boundaries enforce dependency direction; package layout inside `adapters` mirrors the in/out hexagonal convention. --- ## 4. Core Domain Model ``` Repository { id: UUID name: String // "spring-projects/spring-boot" remoteUrl: String? // null if local-only localPath: Path // either user-provided or our managed clone dir managedClone: bool // true if WE clone/fetch ignoreGlobs: List // per-repo overrides maxFileSizeBytes: long // default 1MB pollIntervalSec: long // default 3600; 0 disables polling versionMappingRules: List // exact, v-prefix, release-prefix, regex createdAt, updatedAt } Version { id: UUID repoId: UUID tag: String // "v1.2.3" or branch name commitSha: String status: enum { DISCOVERED, INDEXING, INDEXED, FAILED, INACTIVE } indexedAt: Instant? chunkCount: int errorMessage: String? } Chunk { // global, deduplicated by content_hash id: UUID contentHash: String // sha256 of canonicalized content content: String // the snippet text language: String // "java", "python", "markdown", ... symbol: String? // function/class name if AST-extracted tokenCount: int // dense + sparse vectors stored in Lucene index, not here } ChunkVersion { // many-to-many: which versions contain which chunks chunkId: UUID versionId: UUID filePath: String startLine: int endLine: int // PK (chunkId, versionId, filePath, startLine) } IngestionJob { id: UUID repoId: UUID versionId: UUID? // null = repo-level (e.g. discovery) type: enum { DISCOVER_TAGS, INDEX_VERSION, COMPACT, REFRESH } status: enum { QUEUED, RUNNING, SUCCEEDED, FAILED, CANCELLED } startedAt, finishedAt stages: List } JobStage { jobId: UUID name: enum { CLONE, FETCH, CHECKOUT, DISCOVER_FILES, PARSE, CHUNK, EMBED, INDEX, COMMIT } status: enum { PENDING, RUNNING, SUCCEEDED, FAILED, SKIPPED } startedAt, finishedAt itemsProcessed: long itemsTotal: long bytesProcessed: long errorMessage: String? } JobLogEvent { // ring-buffered + persisted; streamed via SSE jobId: UUID ts: Instant level: enum { DEBUG, INFO, WARN, ERROR } stage: JobStage.name? message: String } ``` --- ## 5. Ingestion Pipeline ``` ┌────────────────────────────────────────────────────────┐ │ IngestionOrchestrator (virtual-thread per stage) │ └────────────────────────────────────────────────────────┘ │ ┌──────────────────────────┼──────────────────────────────────────────┐ ▼ ▼ ▼ [CLONE/FETCH] [DISCOVER_TAGS] [INDEX_VERSION job] JGit pull/clone git tag list ∩ (per (repo,tag)) version mapping rules │ ┌────────────────────────────┤ ▼ ▼ [CHECKOUT worktree] (parallel tags up to N) │ ▼ [DISCOVER_FILES] respect .gitignore + defaults + per-repo globs + max file size │ ▼ [GIT_DIFF vs prev indexed tag] → if exists, only changed files reach PARSE │ ▼ [PARSE] heuristic chunker (markdown sections; brace-balanced; indent-based; sliding-window fallback) │ ▼ [CHUNK] AST-aware splits + sliding-window fallback │ ▼ [HASH + DEDUPE] content_hash lookup → existing chunkId reused │ ▼ [EMBED] ONNX bge-m3 NEW chunks only (GPU semaphore-gated batch) │ ▼ [INDEX] Lucene upsert: - chunk doc with vector - chunk_version doc │ ▼ [COMMIT] Lucene commit + H2 transaction │ ▼ Version.status = INDEXED ``` ### Key invariants 1. **Embeddings are computed at most once per `content_hash`.** Persistent disk cache keyed by hash → vector bytes. 2. **A tag's chunks = union of (a) reused chunks via hash and (b) newly-embedded chunks.** This makes re-indexing a near-identical tag almost free. 3. **Git-diff fast path:** if a tag's parent (nearest previously indexed tag in semver order) exists, only files changed in `git diff parent..tag` are re-parsed. Unchanged files contribute their parent's chunk_versions verbatim with new line offsets adjusted by diff (or fully re-parsed if rename detection is ambiguous). 4. **Per-stage virtual-thread pools.** Threads themselves are unbounded (per user spec), but a **GPU semaphore** (default `permits = ortSessionCount`) gates ONNX inference to avoid GPU OOM. Lucene writer is single-thread (its own queue). --- ## 6. Search Pipeline ``` query ─► [Query Rewrite] rule-based: lowercase, dedupe stop tokens, │ optional library-id-aware expansion ▼ [BM25 search] [Dense kNN search] Lucene similarity Lucene HNSW (bge-m3 dense) │ │ └─────────────► [RRF fusion] ◄──────┘ │ ▼ top-K candidates (default 50) │ ▼ [Cross-encoder rerank] ONNX bge-reranker-v2-m3 (GPU semaphore) │ ▼ [Token-budget assemble] pack snippets up to `tokens` param (default 5000, min 500, max 50000) │ ▼ ranked snippets w/ citations (file path, repo, tag, lines) ``` All searches are **scoped** to `(repoId, versionId)` filter clauses on the Lucene index using `chunk_versions` join semantics. --- ## 7. MCP Server (Streamable HTTP) - Single endpoint: `POST /mcp` (JSON-RPC over HTTP) with optional SSE upgrade per request, per [MCP 2025-03-26 spec](https://spec.modelcontextprotocol.io/specification/2025-03-26/basic/transports/). - **Two tools, exactly matching Context7 schema:** ### `resolve-library-id` ```json { "name": "resolve-library-id", "description": "Resolves a library/package name to a trueref-compatible library ID...", "inputSchema": { "type": "object", "required": ["libraryName"], "properties": { "libraryName": { "type": "string" }, "query": { "type": "string", "description": "optional, ranks results by relevance" } } } } ``` Returns ranked candidate library IDs (`/{owner}/{repo}` style) with metadata (description, snippet count, available versions, source reputation). ### `get-library-docs` ```json { "name": "get-library-docs", "inputSchema": { "type": "object", "required": ["libraryId"], "properties": { "libraryId": { "type": "string", "description": "/org/project[/version]" }, "topic": { "type": "string" }, "tokens": { "type": "integer", "minimum": 500, "maximum": 50000, "default": 5000 } } } } ``` ### On-demand indexing flow - If `libraryId` includes a version that maps to a known git tag but is **not yet indexed**: 1. Enqueue `INDEX_VERSION` job immediately. 2. Return a **partial** response built from the **nearest indexed tag** (semver-closest) plus a status block: `{ "indexing": { "status": "in_progress", "version": "1.2.3", "retryAfterSec": 30 } }`. - If version maps to **no** tag: return error `version_not_found` with the list of candidate tags discovered. --- ## 8. REST API Surface | Method | Path | Purpose | |---|---|---| | GET | `/api/repos` | List registered repos | | POST | `/api/repos` | Register (local path or remote URL) | | GET | `/api/repos/{id}` | Repo detail + version summary | | DELETE | `/api/repos/{id}` | Unregister + soft-delete versions | | POST | `/api/repos/{id}/discover` | Force tag discovery | | GET | `/api/repos/{id}/versions` | All known versions + status | | POST | `/api/repos/{id}/versions/{tag}/index` | Index a specific tag | | POST | `/api/repos/{id}/versions/{tag}/reindex` | Force re-index | | GET | `/api/jobs` | List jobs (filter by repo/version/status) | | GET | `/api/jobs/{id}` | Job detail with stages | | GET | `/api/jobs/{id}/log` (SSE) | Live log stream | | GET | `/api/jobs/stream` (SSE) | Live job-status events for the dashboard | | POST | `/api/search` | Hybrid search across one or more (repo, version) scopes | | GET | `/api/resolve?q=react` | Library-ID resolution preview | | GET | `/api/observability/metrics` | UI-friendly aggregated metrics JSON | | GET | `/api/observability/resources` | Heap, GPU mem (via NVML when present), index size | | GET | `/swagger-ui/index.html` | Swagger UI | | GET | `/v3/api-docs` | OpenAPI JSON | | ANY | `/mcp` | MCP Streamable HTTP endpoint | | GET | `/actuator/prometheus` | Prometheus scrape (optional) | | GET | `/**` | SPA fallback to `index.html` | --- ## 9. Concurrency & Performance - **Virtual threads everywhere** for I/O (HTTP, JGit, file I/O, Lucene reads). - **`Tomcat` configured with virtual-thread executor** (`spring.threads.virtual.enabled=true`). - **Per-stage logical pools** are unbounded virtual-thread executors per orchestrator instance. - **GPU access gated by a `Semaphore`** with permits = number of ONNX sessions (configurable, default = 2). - **Lucene writer**: single `IndexWriter` instance protected by a queue; readers use a refresh-on-search `SearcherManager`. - **Embedding cache**: file-per-hash on disk under `data/embedding-cache/`; hot LRU in memory. - **Tag concurrency**: not capped (per spec), but each tag job awaits the GPU semaphore — natural backpressure. --- ## 10. Observability - **Metrics** via Micrometer (`MeterRegistry`): - Counters: chunks_embedded, chunks_reused, files_skipped, jobs_succeeded/failed. - Timers: stage durations per stage name. - Gauges: active_jobs, gpu_semaphore_available, lucene_index_size_bytes, heap_used. - **OpenTelemetry traces** for every job (one trace per `IngestionJob`, span per `JobStage`). - **JobEventBus**: in-process pub/sub. SSE controllers subscribe and push events to UI. - **UI dashboards** (no Grafana required): - "Live" tab: progress bars per running (repo, tag), per-stage throughput, log tail. - "History" tab: paginated jobs table. - "Stats" tab: per-stage timing histograms, chunk counts per repo/version, chunk dedupe ratio. - "Resources" tab: heap, GPU memory (NVML where available), index size on disk. - **Prometheus** scraping is opt-in (Actuator endpoint). --- ## 11. Storage Layout (on disk) ``` $TRUEREF_HOME/ # default: ./data ├── h2/ # H2 database files ├── lucene/ # single index dir; one Lucene writer ├── repos/ # managed clones (when managedClone=true) │ └── /... ├── embedding-cache/ # one file per content_hash → fp16 vector bytes ├── models/ # ONNX model files (auto-downloaded on first run) │ ├── bge-m3/ │ └── bge-reranker-v2-m3/ └── logs/ ``` --- ## 12. Configuration (excerpt) ```yaml trueref: home: ${TRUEREF_HOME:./data} ingestion: poll-interval-default: 1h tag-cap-default: 100 # most-recent N tags by semver/date max-file-size-bytes-default: 1048576 embedding: model: bge-m3 onnx-providers: [cuda, directml, cpu] # tried in order session-count: 2 # = GPU semaphore permits batch-size: 32 reranker: model: bge-reranker-v2-m3 top-k: 50 search: rrf-k: 60 final-top-k: 20 mcp: tokens-default: 5000 tokens-min: 500 tokens-max: 50000 spring: threads.virtual.enabled: true ``` --- ## 13. Out-of-the-box behaviors locked from clarifications - **Auth**: none (LAN-only) on REST and MCP. - **Tag selection**: default cap 100 most-recent; on-demand index of any tag via UI search OR via MCP when an unindexed version is requested. - **Differential indexing**: dedupe by `content_hash` AND skip unchanged files via `git diff parent..tag`. - **Repo input**: UI-add (local path or remote URL) AND watched folder `./data/watched/` for bare repos. - **Re-index trigger**: on-demand + scheduled `git fetch` poll (default 1h per repo). - **Stale tag cleanup**: soft delete via `Version.status=INACTIVE`; compaction job reclaims orphan chunks. - **Embedding cache**: persistent on disk, keyed by `content_hash`. - **Concurrency**: unbounded virtual threads, GPU semaphore-gated. See [FINDINGS.md](FINDINGS.md) for research backing each choice.