Files
trueref/ARCHITECTURE.md
moze c5f950c2c0
Some checks failed
Build and publish Docker image / Build and push (push) Failing after 1m27s
Initial commit: trueref v0.1.0-SNAPSHOT
Java 21 / Spring Boot 3.5.3 multi-module Maven project.
Hybrid BM25+HNSW search with RRF, cross-encoder reranker,
ONNX Runtime 1.22.0 (CPU + CUDA 12 GPU variants).
2026-05-06 00:49:16 +02:00

20 KiB

trueref — Architecture

Self-hosted, fat-JAR, Java-21 clone of Context7 ingestion + retrieval, with first-class differential per-tag indexing, embedded vector + BM25 store, ONNX-accelerated embeddings/rerank, Streamable-HTTP MCP server, REST + OpenAPI, and a SvelteKit UI.

1. Goals & Non-Goals

Goals

  • Functional parity with Context7 ingestion outcome (own chunk schema).
  • Differential per-tag indexing: every git tag of every registered repo is independently queryable.
  • Embedded everything: single fat JAR runnable on a workstation/server. No external Postgres/Qdrant.
  • GPU-accelerated retrieval via ONNX Runtime (CUDA Linux/Win, DirectML Win, CPU fallback).
  • MCP Streamable-HTTP server exposing exactly two tools: resolve-library-id, get-library-docs — drop-in for any MCP client.
  • Full observability of ingestion pipelines surfaced in the UI (live progress, log tail, history, timings, resource usage).
  • REST + OpenAPI/Swagger for programmatic and UI use.
  • SvelteKit UI for repo registration, indexing control, monitoring, and ad-hoc query.
  • Hexagonal architecture so vector store, embedder, parser, persistence, etc. are swappable.

Non-Goals

  • No public hosted SaaS — self-host only.
  • No model fine-tuning.
  • No mobile app.
  • No generative LLM in the pipeline (retrieval-only, like Context7).
  • No multi-tenancy / auth (LAN-only deployment).

2. Tech Stack (locked)

Concern Choice Rationale
Language / runtime Java 21 LTS Virtual threads stable; Spring Boot 3.5 supported. (Java 25 dropped — Boot 3.5 supports up to 23.)
Framework Spring Boot 3.5.x + Spring AI 1.0.x Web MVC + virtual-thread executor; Spring AI for embedding/MCP abstractions.
Build Maven Stable, ubiquitous, Spring-Boot first-class.
Metadata store H2 (MVCC mode, file-based) + Flyway Zero ops, JDBC, MVCC concurrency, fits fat JAR.
Vector + lexical store Apache Lucene 9.x Pure JVM. BM25 + HNSW kNN in one index. Collapses two stores.
Embedding model BAAI/bge-m3 (ONNX) Multilingual, 8k context, dense+sparse capable. MIT-like license.
Reranker BAAI/bge-reranker-v2-m3 (ONNX) Cross-encoder, Apache 2.0.
ML runtime ONNX Runtime (onnxruntime_gpu Linux CUDA / onnxruntime-directml Win / onnxruntime CPU) In-JVM via official Java bindings.
Git JGit Pure Java; clone, fetch, tag enumeration, diff.
Code parsing Pure-Java heuristic chunker (markdown-aware, brace-balanced for C-family, indent-based for Python, sliding-window fallback) No native deps; preserves fat-JAR purity. Tree-sitter is a documented future swap (see FINDINGS §F11).
Job orchestration Custom virtual-thread orchestrator + H2-backed durable state Fast, no Spring Batch overhead.
MCP server Spring AI MCP Server (Streamable HTTP) Spec 2025-03-26, single /mcp endpoint.
REST docs springdoc-openapi OpenAPI 3 + Swagger UI auto-generated.
Observability Micrometer + OpenTelemetry, exposed via REST/SSE for UI. Prometheus + Grafana optional via /actuator/prometheus. UI-first; Prom/Graf attach later.
Frontend SvelteKit + @sveltejs/adapter-static Built into bootstrap/src/main/resources/static/, served by Spring as part of fat JAR.
Packaging Single fat JAR via spring-boot-maven-plugin One artifact, embedded everything.

3. Hexagonal Layout (Maven multi-module)

Direction of dependencies is enforced by Maven coordinates alone — no ArchUnit needed.

trueref-parent/                  (pom; BOM + plugin management)
├── trueref-domain               pure Java; records, sealed types, port interfaces. ZERO deps.
├── trueref-application          use-case impls; depends on: domain
├── trueref-adapters             ALL adapters live here; depends on: domain, application
│   └── com.trueref.adapter
│       ├── in
│       │   ├── rest             @RestController + DTOs + OpenAPI + SSE
│       │   └── mcp              MCP tool defs (Spring AI MCP server)
│       └── out
│           ├── persistence.h2   JdbcClient + Flyway, RepositoryStore impl
│           ├── vectorstore.lucene  Lucene BM25 + HNSW kNN, ChunkStore impl
│           ├── embedding.onnx   ONNX bge-m3 + bge-reranker-v2-m3
│           ├── git.jgit         GitClient impl
│           ├── parsing.treesitter  CodeParser impl
│           └── cache.disk       EmbeddingCache (file-per-hash)
├── trueref-frontend             SvelteKit; built via frontend-maven-plugin into static jar
└── trueref-bootstrap            @SpringBootApplication; wires beans; produces fat JAR
                                 depends on: domain, application, adapters, frontend

Dependency rule (Maven-enforced):

  • domain → nothing.
  • applicationdomain.
  • adaptersdomain + application.
  • frontend → none (resource-only jar).
  • bootstrap → all of the above (the only place wiring lives).

All packages live under com.trueref.* regardless of module. Module boundaries enforce dependency direction; package layout inside adapters mirrors the in/out hexagonal convention.


4. Core Domain Model

Repository {
  id: UUID
  name: String                 // "spring-projects/spring-boot"
  remoteUrl: String?           // null if local-only
  localPath: Path              // either user-provided or our managed clone dir
  managedClone: bool           // true if WE clone/fetch
  ignoreGlobs: List<String>    // per-repo overrides
  maxFileSizeBytes: long       // default 1MB
  pollIntervalSec: long        // default 3600; 0 disables polling
  versionMappingRules: List<TagPattern>  // exact, v-prefix, release-prefix, regex
  createdAt, updatedAt
}

Version {
  id: UUID
  repoId: UUID
  tag: String                  // "v1.2.3" or branch name
  commitSha: String
  status: enum { DISCOVERED, INDEXING, INDEXED, FAILED, INACTIVE }
  indexedAt: Instant?
  chunkCount: int
  errorMessage: String?
}

Chunk {                        // global, deduplicated by content_hash
  id: UUID
  contentHash: String          // sha256 of canonicalized content
  content: String              // the snippet text
  language: String             // "java", "python", "markdown", ...
  symbol: String?              // function/class name if AST-extracted
  tokenCount: int
  // dense + sparse vectors stored in Lucene index, not here
}

ChunkVersion {                 // many-to-many: which versions contain which chunks
  chunkId: UUID
  versionId: UUID
  filePath: String
  startLine: int
  endLine: int
  // PK (chunkId, versionId, filePath, startLine)
}

IngestionJob {
  id: UUID
  repoId: UUID
  versionId: UUID?             // null = repo-level (e.g. discovery)
  type: enum { DISCOVER_TAGS, INDEX_VERSION, COMPACT, REFRESH }
  status: enum { QUEUED, RUNNING, SUCCEEDED, FAILED, CANCELLED }
  startedAt, finishedAt
  stages: List<JobStage>
}

JobStage {
  jobId: UUID
  name: enum { CLONE, FETCH, CHECKOUT, DISCOVER_FILES, PARSE, CHUNK, EMBED, INDEX, COMMIT }
  status: enum { PENDING, RUNNING, SUCCEEDED, FAILED, SKIPPED }
  startedAt, finishedAt
  itemsProcessed: long
  itemsTotal: long
  bytesProcessed: long
  errorMessage: String?
}

JobLogEvent {                  // ring-buffered + persisted; streamed via SSE
  jobId: UUID
  ts: Instant
  level: enum { DEBUG, INFO, WARN, ERROR }
  stage: JobStage.name?
  message: String
}

5. Ingestion Pipeline

                ┌────────────────────────────────────────────────────────┐
                │  IngestionOrchestrator  (virtual-thread per stage)     │
                └────────────────────────────────────────────────────────┘
                              │
   ┌──────────────────────────┼──────────────────────────────────────────┐
   ▼                          ▼                                          ▼
[CLONE/FETCH]          [DISCOVER_TAGS]                          [INDEX_VERSION job]
 JGit pull/clone        git tag list ∩                            (per (repo,tag))
                        version mapping
                        rules
                                                                         │
                                            ┌────────────────────────────┤
                                            ▼                            ▼
                                      [CHECKOUT worktree]          (parallel tags up to N)
                                            │
                                            ▼
                                      [DISCOVER_FILES]
                                       respect .gitignore +
                                       defaults + per-repo globs +
                                       max file size
                                            │
                                            ▼
                                      [GIT_DIFF vs prev indexed tag]
                                       → if exists, only changed
                                          files reach PARSE
                                            │
                                            ▼
                                      [PARSE]  heuristic chunker
                                       (markdown sections; brace-balanced;
                                       indent-based; sliding-window fallback)
                                            │
                                            ▼
                                      [CHUNK]  AST-aware splits +
                                       sliding-window fallback
                                            │
                                            ▼
                                      [HASH + DEDUPE]
                                       content_hash lookup → existing
                                          chunkId reused
                                            │
                                            ▼
                                      [EMBED]  ONNX bge-m3
                                       NEW chunks only
                                       (GPU semaphore-gated batch)
                                            │
                                            ▼
                                      [INDEX]  Lucene upsert:
                                       - chunk doc with vector
                                       - chunk_version doc
                                            │
                                            ▼
                                      [COMMIT]  Lucene commit +
                                       H2 transaction
                                            │
                                            ▼
                                       Version.status = INDEXED

Key invariants

  1. Embeddings are computed at most once per content_hash. Persistent disk cache keyed by hash → vector bytes.
  2. A tag's chunks = union of (a) reused chunks via hash and (b) newly-embedded chunks. This makes re-indexing a near-identical tag almost free.
  3. Git-diff fast path: if a tag's parent (nearest previously indexed tag in semver order) exists, only files changed in git diff parent..tag are re-parsed. Unchanged files contribute their parent's chunk_versions verbatim with new line offsets adjusted by diff (or fully re-parsed if rename detection is ambiguous).
  4. Per-stage virtual-thread pools. Threads themselves are unbounded (per user spec), but a GPU semaphore (default permits = ortSessionCount) gates ONNX inference to avoid GPU OOM. Lucene writer is single-thread (its own queue).

6. Search Pipeline

query  ─►  [Query Rewrite]      rule-based: lowercase, dedupe stop tokens,
              │                  optional library-id-aware expansion
              ▼
           [BM25 search]                [Dense kNN search]
           Lucene similarity            Lucene HNSW (bge-m3 dense)
                  │                                   │
                  └─────────────► [RRF fusion] ◄──────┘
                                          │
                                          ▼
                                  top-K candidates (default 50)
                                          │
                                          ▼
                                  [Cross-encoder rerank]
                                  ONNX bge-reranker-v2-m3
                                  (GPU semaphore)
                                          │
                                          ▼
                                  [Token-budget assemble]
                                  pack snippets up to `tokens` param
                                  (default 5000, min 500, max 50000)
                                          │
                                          ▼
                                   ranked snippets w/ citations
                                   (file path, repo, tag, lines)

All searches are scoped to (repoId, versionId) filter clauses on the Lucene index using chunk_versions join semantics.


7. MCP Server (Streamable HTTP)

  • Single endpoint: POST /mcp (JSON-RPC over HTTP) with optional SSE upgrade per request, per MCP 2025-03-26 spec.
  • Two tools, exactly matching Context7 schema:

resolve-library-id

{
  "name": "resolve-library-id",
  "description": "Resolves a library/package name to a trueref-compatible library ID...",
  "inputSchema": {
    "type": "object",
    "required": ["libraryName"],
    "properties": {
      "libraryName": { "type": "string" },
      "query":       { "type": "string", "description": "optional, ranks results by relevance" }
    }
  }
}

Returns ranked candidate library IDs (/{owner}/{repo} style) with metadata (description, snippet count, available versions, source reputation).

get-library-docs

{
  "name": "get-library-docs",
  "inputSchema": {
    "type": "object",
    "required": ["libraryId"],
    "properties": {
      "libraryId": { "type": "string", "description": "/org/project[/version]" },
      "topic":     { "type": "string" },
      "tokens":    { "type": "integer", "minimum": 500, "maximum": 50000, "default": 5000 }
    }
  }
}

On-demand indexing flow

  • If libraryId includes a version that maps to a known git tag but is not yet indexed:
    1. Enqueue INDEX_VERSION job immediately.
    2. Return a partial response built from the nearest indexed tag (semver-closest) plus a status block: { "indexing": { "status": "in_progress", "version": "1.2.3", "retryAfterSec": 30 } }.
  • If version maps to no tag: return error version_not_found with the list of candidate tags discovered.

8. REST API Surface

Method Path Purpose
GET /api/repos List registered repos
POST /api/repos Register (local path or remote URL)
GET /api/repos/{id} Repo detail + version summary
DELETE /api/repos/{id} Unregister + soft-delete versions
POST /api/repos/{id}/discover Force tag discovery
GET /api/repos/{id}/versions All known versions + status
POST /api/repos/{id}/versions/{tag}/index Index a specific tag
POST /api/repos/{id}/versions/{tag}/reindex Force re-index
GET /api/jobs List jobs (filter by repo/version/status)
GET /api/jobs/{id} Job detail with stages
GET /api/jobs/{id}/log (SSE) Live log stream
GET /api/jobs/stream (SSE) Live job-status events for the dashboard
POST /api/search Hybrid search across one or more (repo, version) scopes
GET /api/resolve?q=react Library-ID resolution preview
GET /api/observability/metrics UI-friendly aggregated metrics JSON
GET /api/observability/resources Heap, GPU mem (via NVML when present), index size
GET /swagger-ui/index.html Swagger UI
GET /v3/api-docs OpenAPI JSON
ANY /mcp MCP Streamable HTTP endpoint
GET /actuator/prometheus Prometheus scrape (optional)
GET /** SPA fallback to index.html

9. Concurrency & Performance

  • Virtual threads everywhere for I/O (HTTP, JGit, file I/O, Lucene reads).
  • Tomcat configured with virtual-thread executor (spring.threads.virtual.enabled=true).
  • Per-stage logical pools are unbounded virtual-thread executors per orchestrator instance.
  • GPU access gated by a Semaphore with permits = number of ONNX sessions (configurable, default = 2).
  • Lucene writer: single IndexWriter instance protected by a queue; readers use a refresh-on-search SearcherManager.
  • Embedding cache: file-per-hash on disk under data/embedding-cache/; hot LRU in memory.
  • Tag concurrency: not capped (per spec), but each tag job awaits the GPU semaphore — natural backpressure.

10. Observability

  • Metrics via Micrometer (MeterRegistry):
    • Counters: chunks_embedded, chunks_reused, files_skipped, jobs_succeeded/failed.
    • Timers: stage durations per stage name.
    • Gauges: active_jobs, gpu_semaphore_available, lucene_index_size_bytes, heap_used.
  • OpenTelemetry traces for every job (one trace per IngestionJob, span per JobStage).
  • JobEventBus: in-process pub/sub. SSE controllers subscribe and push events to UI.
  • UI dashboards (no Grafana required):
    • "Live" tab: progress bars per running (repo, tag), per-stage throughput, log tail.
    • "History" tab: paginated jobs table.
    • "Stats" tab: per-stage timing histograms, chunk counts per repo/version, chunk dedupe ratio.
    • "Resources" tab: heap, GPU memory (NVML where available), index size on disk.
  • Prometheus scraping is opt-in (Actuator endpoint).

11. Storage Layout (on disk)

$TRUEREF_HOME/                  # default: ./data
├── h2/                         # H2 database files
├── lucene/                     # single index dir; one Lucene writer
├── repos/                      # managed clones (when managedClone=true)
│   └── <repoId>/...
├── embedding-cache/            # one file per content_hash → fp16 vector bytes
├── models/                     # ONNX model files (auto-downloaded on first run)
│   ├── bge-m3/
│   └── bge-reranker-v2-m3/
└── logs/

12. Configuration (excerpt)

trueref:
  home: ${TRUEREF_HOME:./data}
  ingestion:
    poll-interval-default: 1h
    tag-cap-default: 100              # most-recent N tags by semver/date
    max-file-size-bytes-default: 1048576
  embedding:
    model: bge-m3
    onnx-providers: [cuda, directml, cpu]   # tried in order
    session-count: 2                  # = GPU semaphore permits
    batch-size: 32
  reranker:
    model: bge-reranker-v2-m3
    top-k: 50
  search:
    rrf-k: 60
    final-top-k: 20
  mcp:
    tokens-default: 5000
    tokens-min: 500
    tokens-max: 50000
spring:
  threads.virtual.enabled: true

13. Out-of-the-box behaviors locked from clarifications

  • Auth: none (LAN-only) on REST and MCP.
  • Tag selection: default cap 100 most-recent; on-demand index of any tag via UI search OR via MCP when an unindexed version is requested.
  • Differential indexing: dedupe by content_hash AND skip unchanged files via git diff parent..tag.
  • Repo input: UI-add (local path or remote URL) AND watched folder ./data/watched/ for bare repos.
  • Re-index trigger: on-demand + scheduled git fetch poll (default 1h per repo).
  • Stale tag cleanup: soft delete via Version.status=INACTIVE; compaction job reclaims orphan chunks.
  • Embedding cache: persistent on disk, keyed by content_hash.
  • Concurrency: unbounded virtual threads, GPU semaphore-gated.

See FINDINGS.md for research backing each choice.