Initial commit: trueref v0.1.0-SNAPSHOT
Some checks failed
Build and publish Docker image / Build and push (push) Failing after 1m27s
Some checks failed
Build and publish Docker image / Build and push (push) Failing after 1m27s
Java 21 / Spring Boot 3.5.3 multi-module Maven project. Hybrid BM25+HNSW search with RRF, cross-encoder reranker, ONNX Runtime 1.22.0 (CPU + CUDA 12 GPU variants).
This commit is contained in:
428
ARCHITECTURE.md
Normal file
428
ARCHITECTURE.md
Normal file
@@ -0,0 +1,428 @@
|
||||
# trueref — Architecture
|
||||
|
||||
> Self-hosted, fat-JAR, Java-21 clone of [Context7](https://github.com/upstash/context7) ingestion + retrieval, with first-class differential per-tag indexing, embedded vector + BM25 store, ONNX-accelerated embeddings/rerank, Streamable-HTTP MCP server, REST + OpenAPI, and a SvelteKit UI.
|
||||
|
||||
## 1. Goals & Non-Goals
|
||||
|
||||
### Goals
|
||||
- **Functional parity with Context7** ingestion outcome (own chunk schema).
|
||||
- **Differential per-tag indexing**: every git tag of every registered repo is independently queryable.
|
||||
- **Embedded everything**: single fat JAR runnable on a workstation/server. No external Postgres/Qdrant.
|
||||
- **GPU-accelerated retrieval** via ONNX Runtime (CUDA Linux/Win, DirectML Win, CPU fallback).
|
||||
- **MCP Streamable-HTTP server** exposing exactly two tools: `resolve-library-id`, `get-library-docs` — drop-in for any MCP client.
|
||||
- **Full observability** of ingestion pipelines surfaced in the UI (live progress, log tail, history, timings, resource usage).
|
||||
- **REST + OpenAPI/Swagger** for programmatic and UI use.
|
||||
- **SvelteKit UI** for repo registration, indexing control, monitoring, and ad-hoc query.
|
||||
- **Hexagonal architecture** so vector store, embedder, parser, persistence, etc. are swappable.
|
||||
|
||||
### Non-Goals
|
||||
- No public hosted SaaS — self-host only.
|
||||
- No model fine-tuning.
|
||||
- No mobile app.
|
||||
- No generative LLM in the pipeline (retrieval-only, like Context7).
|
||||
- No multi-tenancy / auth (LAN-only deployment).
|
||||
|
||||
---
|
||||
|
||||
## 2. Tech Stack (locked)
|
||||
|
||||
| Concern | Choice | Rationale |
|
||||
|---|---|---|
|
||||
| Language / runtime | **Java 21 LTS** | Virtual threads stable; Spring Boot 3.5 supported. (Java 25 dropped — Boot 3.5 supports up to 23.) |
|
||||
| Framework | **Spring Boot 3.5.x** + **Spring AI 1.0.x** | Web MVC + virtual-thread executor; Spring AI for embedding/MCP abstractions. |
|
||||
| Build | **Maven** | Stable, ubiquitous, Spring-Boot first-class. |
|
||||
| Metadata store | **H2 (MVCC mode, file-based)** + Flyway | Zero ops, JDBC, MVCC concurrency, fits fat JAR. |
|
||||
| Vector + lexical store | **Apache Lucene 9.x** | Pure JVM. BM25 + HNSW kNN in one index. Collapses two stores. |
|
||||
| Embedding model | **BAAI/bge-m3** (ONNX) | Multilingual, 8k context, dense+sparse capable. MIT-like license. |
|
||||
| Reranker | **BAAI/bge-reranker-v2-m3** (ONNX) | Cross-encoder, Apache 2.0. |
|
||||
| ML runtime | **ONNX Runtime** (`onnxruntime_gpu` Linux CUDA / `onnxruntime-directml` Win / `onnxruntime` CPU) | In-JVM via official Java bindings. |
|
||||
| Git | **JGit** | Pure Java; clone, fetch, tag enumeration, diff. |
|
||||
| Code parsing | **Pure-Java heuristic chunker** (markdown-aware, brace-balanced for C-family, indent-based for Python, sliding-window fallback) | No native deps; preserves fat-JAR purity. Tree-sitter is a documented future swap (see FINDINGS §F11). |
|
||||
| Job orchestration | **Custom virtual-thread orchestrator** + H2-backed durable state | Fast, no Spring Batch overhead. |
|
||||
| MCP server | **Spring AI MCP Server (Streamable HTTP)** | Spec 2025-03-26, single `/mcp` endpoint. |
|
||||
| REST docs | **springdoc-openapi** | OpenAPI 3 + Swagger UI auto-generated. |
|
||||
| Observability | **Micrometer + OpenTelemetry**, exposed via REST/SSE for UI. **Prometheus + Grafana optional** via `/actuator/prometheus`. | UI-first; Prom/Graf attach later. |
|
||||
| Frontend | **SvelteKit + `@sveltejs/adapter-static`** | Built into `bootstrap/src/main/resources/static/`, served by Spring as part of fat JAR. |
|
||||
| Packaging | **Single fat JAR** via `spring-boot-maven-plugin` | One artifact, embedded everything. |
|
||||
|
||||
---
|
||||
|
||||
## 3. Hexagonal Layout (Maven multi-module)
|
||||
|
||||
Direction of dependencies is enforced by Maven coordinates alone — no ArchUnit needed.
|
||||
|
||||
```
|
||||
trueref-parent/ (pom; BOM + plugin management)
|
||||
├── trueref-domain pure Java; records, sealed types, port interfaces. ZERO deps.
|
||||
├── trueref-application use-case impls; depends on: domain
|
||||
├── trueref-adapters ALL adapters live here; depends on: domain, application
|
||||
│ └── com.trueref.adapter
|
||||
│ ├── in
|
||||
│ │ ├── rest @RestController + DTOs + OpenAPI + SSE
|
||||
│ │ └── mcp MCP tool defs (Spring AI MCP server)
|
||||
│ └── out
|
||||
│ ├── persistence.h2 JdbcClient + Flyway, RepositoryStore impl
|
||||
│ ├── vectorstore.lucene Lucene BM25 + HNSW kNN, ChunkStore impl
|
||||
│ ├── embedding.onnx ONNX bge-m3 + bge-reranker-v2-m3
|
||||
│ ├── git.jgit GitClient impl
|
||||
│ ├── parsing.treesitter CodeParser impl
|
||||
│ └── cache.disk EmbeddingCache (file-per-hash)
|
||||
├── trueref-frontend SvelteKit; built via frontend-maven-plugin into static jar
|
||||
└── trueref-bootstrap @SpringBootApplication; wires beans; produces fat JAR
|
||||
depends on: domain, application, adapters, frontend
|
||||
```
|
||||
|
||||
**Dependency rule (Maven-enforced):**
|
||||
- `domain` → nothing.
|
||||
- `application` → `domain`.
|
||||
- `adapters` → `domain` + `application`.
|
||||
- `frontend` → none (resource-only jar).
|
||||
- `bootstrap` → all of the above (the only place wiring lives).
|
||||
|
||||
> All packages live under `com.trueref.*` regardless of module. Module boundaries enforce dependency direction; package layout inside `adapters` mirrors the in/out hexagonal convention.
|
||||
|
||||
---
|
||||
|
||||
## 4. Core Domain Model
|
||||
|
||||
```
|
||||
Repository {
|
||||
id: UUID
|
||||
name: String // "spring-projects/spring-boot"
|
||||
remoteUrl: String? // null if local-only
|
||||
localPath: Path // either user-provided or our managed clone dir
|
||||
managedClone: bool // true if WE clone/fetch
|
||||
ignoreGlobs: List<String> // per-repo overrides
|
||||
maxFileSizeBytes: long // default 1MB
|
||||
pollIntervalSec: long // default 3600; 0 disables polling
|
||||
versionMappingRules: List<TagPattern> // exact, v-prefix, release-prefix, regex
|
||||
createdAt, updatedAt
|
||||
}
|
||||
|
||||
Version {
|
||||
id: UUID
|
||||
repoId: UUID
|
||||
tag: String // "v1.2.3" or branch name
|
||||
commitSha: String
|
||||
status: enum { DISCOVERED, INDEXING, INDEXED, FAILED, INACTIVE }
|
||||
indexedAt: Instant?
|
||||
chunkCount: int
|
||||
errorMessage: String?
|
||||
}
|
||||
|
||||
Chunk { // global, deduplicated by content_hash
|
||||
id: UUID
|
||||
contentHash: String // sha256 of canonicalized content
|
||||
content: String // the snippet text
|
||||
language: String // "java", "python", "markdown", ...
|
||||
symbol: String? // function/class name if AST-extracted
|
||||
tokenCount: int
|
||||
// dense + sparse vectors stored in Lucene index, not here
|
||||
}
|
||||
|
||||
ChunkVersion { // many-to-many: which versions contain which chunks
|
||||
chunkId: UUID
|
||||
versionId: UUID
|
||||
filePath: String
|
||||
startLine: int
|
||||
endLine: int
|
||||
// PK (chunkId, versionId, filePath, startLine)
|
||||
}
|
||||
|
||||
IngestionJob {
|
||||
id: UUID
|
||||
repoId: UUID
|
||||
versionId: UUID? // null = repo-level (e.g. discovery)
|
||||
type: enum { DISCOVER_TAGS, INDEX_VERSION, COMPACT, REFRESH }
|
||||
status: enum { QUEUED, RUNNING, SUCCEEDED, FAILED, CANCELLED }
|
||||
startedAt, finishedAt
|
||||
stages: List<JobStage>
|
||||
}
|
||||
|
||||
JobStage {
|
||||
jobId: UUID
|
||||
name: enum { CLONE, FETCH, CHECKOUT, DISCOVER_FILES, PARSE, CHUNK, EMBED, INDEX, COMMIT }
|
||||
status: enum { PENDING, RUNNING, SUCCEEDED, FAILED, SKIPPED }
|
||||
startedAt, finishedAt
|
||||
itemsProcessed: long
|
||||
itemsTotal: long
|
||||
bytesProcessed: long
|
||||
errorMessage: String?
|
||||
}
|
||||
|
||||
JobLogEvent { // ring-buffered + persisted; streamed via SSE
|
||||
jobId: UUID
|
||||
ts: Instant
|
||||
level: enum { DEBUG, INFO, WARN, ERROR }
|
||||
stage: JobStage.name?
|
||||
message: String
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Ingestion Pipeline
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────────┐
|
||||
│ IngestionOrchestrator (virtual-thread per stage) │
|
||||
└────────────────────────────────────────────────────────┘
|
||||
│
|
||||
┌──────────────────────────┼──────────────────────────────────────────┐
|
||||
▼ ▼ ▼
|
||||
[CLONE/FETCH] [DISCOVER_TAGS] [INDEX_VERSION job]
|
||||
JGit pull/clone git tag list ∩ (per (repo,tag))
|
||||
version mapping
|
||||
rules
|
||||
│
|
||||
┌────────────────────────────┤
|
||||
▼ ▼
|
||||
[CHECKOUT worktree] (parallel tags up to N)
|
||||
│
|
||||
▼
|
||||
[DISCOVER_FILES]
|
||||
respect .gitignore +
|
||||
defaults + per-repo globs +
|
||||
max file size
|
||||
│
|
||||
▼
|
||||
[GIT_DIFF vs prev indexed tag]
|
||||
→ if exists, only changed
|
||||
files reach PARSE
|
||||
│
|
||||
▼
|
||||
[PARSE] heuristic chunker
|
||||
(markdown sections; brace-balanced;
|
||||
indent-based; sliding-window fallback)
|
||||
│
|
||||
▼
|
||||
[CHUNK] AST-aware splits +
|
||||
sliding-window fallback
|
||||
│
|
||||
▼
|
||||
[HASH + DEDUPE]
|
||||
content_hash lookup → existing
|
||||
chunkId reused
|
||||
│
|
||||
▼
|
||||
[EMBED] ONNX bge-m3
|
||||
NEW chunks only
|
||||
(GPU semaphore-gated batch)
|
||||
│
|
||||
▼
|
||||
[INDEX] Lucene upsert:
|
||||
- chunk doc with vector
|
||||
- chunk_version doc
|
||||
│
|
||||
▼
|
||||
[COMMIT] Lucene commit +
|
||||
H2 transaction
|
||||
│
|
||||
▼
|
||||
Version.status = INDEXED
|
||||
```
|
||||
|
||||
### Key invariants
|
||||
|
||||
1. **Embeddings are computed at most once per `content_hash`.** Persistent disk cache keyed by hash → vector bytes.
|
||||
2. **A tag's chunks = union of (a) reused chunks via hash and (b) newly-embedded chunks.** This makes re-indexing a near-identical tag almost free.
|
||||
3. **Git-diff fast path:** if a tag's parent (nearest previously indexed tag in semver order) exists, only files changed in `git diff parent..tag` are re-parsed. Unchanged files contribute their parent's chunk_versions verbatim with new line offsets adjusted by diff (or fully re-parsed if rename detection is ambiguous).
|
||||
4. **Per-stage virtual-thread pools.** Threads themselves are unbounded (per user spec), but a **GPU semaphore** (default `permits = ortSessionCount`) gates ONNX inference to avoid GPU OOM. Lucene writer is single-thread (its own queue).
|
||||
|
||||
---
|
||||
|
||||
## 6. Search Pipeline
|
||||
|
||||
```
|
||||
query ─► [Query Rewrite] rule-based: lowercase, dedupe stop tokens,
|
||||
│ optional library-id-aware expansion
|
||||
▼
|
||||
[BM25 search] [Dense kNN search]
|
||||
Lucene similarity Lucene HNSW (bge-m3 dense)
|
||||
│ │
|
||||
└─────────────► [RRF fusion] ◄──────┘
|
||||
│
|
||||
▼
|
||||
top-K candidates (default 50)
|
||||
│
|
||||
▼
|
||||
[Cross-encoder rerank]
|
||||
ONNX bge-reranker-v2-m3
|
||||
(GPU semaphore)
|
||||
│
|
||||
▼
|
||||
[Token-budget assemble]
|
||||
pack snippets up to `tokens` param
|
||||
(default 5000, min 500, max 50000)
|
||||
│
|
||||
▼
|
||||
ranked snippets w/ citations
|
||||
(file path, repo, tag, lines)
|
||||
```
|
||||
|
||||
All searches are **scoped** to `(repoId, versionId)` filter clauses on the Lucene index using `chunk_versions` join semantics.
|
||||
|
||||
---
|
||||
|
||||
## 7. MCP Server (Streamable HTTP)
|
||||
|
||||
- Single endpoint: `POST /mcp` (JSON-RPC over HTTP) with optional SSE upgrade per request, per [MCP 2025-03-26 spec](https://spec.modelcontextprotocol.io/specification/2025-03-26/basic/transports/).
|
||||
- **Two tools, exactly matching Context7 schema:**
|
||||
|
||||
### `resolve-library-id`
|
||||
```json
|
||||
{
|
||||
"name": "resolve-library-id",
|
||||
"description": "Resolves a library/package name to a trueref-compatible library ID...",
|
||||
"inputSchema": {
|
||||
"type": "object",
|
||||
"required": ["libraryName"],
|
||||
"properties": {
|
||||
"libraryName": { "type": "string" },
|
||||
"query": { "type": "string", "description": "optional, ranks results by relevance" }
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
Returns ranked candidate library IDs (`/{owner}/{repo}` style) with metadata (description, snippet count, available versions, source reputation).
|
||||
|
||||
### `get-library-docs`
|
||||
```json
|
||||
{
|
||||
"name": "get-library-docs",
|
||||
"inputSchema": {
|
||||
"type": "object",
|
||||
"required": ["libraryId"],
|
||||
"properties": {
|
||||
"libraryId": { "type": "string", "description": "/org/project[/version]" },
|
||||
"topic": { "type": "string" },
|
||||
"tokens": { "type": "integer", "minimum": 500, "maximum": 50000, "default": 5000 }
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### On-demand indexing flow
|
||||
- If `libraryId` includes a version that maps to a known git tag but is **not yet indexed**:
|
||||
1. Enqueue `INDEX_VERSION` job immediately.
|
||||
2. Return a **partial** response built from the **nearest indexed tag** (semver-closest) plus a status block: `{ "indexing": { "status": "in_progress", "version": "1.2.3", "retryAfterSec": 30 } }`.
|
||||
- If version maps to **no** tag: return error `version_not_found` with the list of candidate tags discovered.
|
||||
|
||||
---
|
||||
|
||||
## 8. REST API Surface
|
||||
|
||||
| Method | Path | Purpose |
|
||||
|---|---|---|
|
||||
| GET | `/api/repos` | List registered repos |
|
||||
| POST | `/api/repos` | Register (local path or remote URL) |
|
||||
| GET | `/api/repos/{id}` | Repo detail + version summary |
|
||||
| DELETE | `/api/repos/{id}` | Unregister + soft-delete versions |
|
||||
| POST | `/api/repos/{id}/discover` | Force tag discovery |
|
||||
| GET | `/api/repos/{id}/versions` | All known versions + status |
|
||||
| POST | `/api/repos/{id}/versions/{tag}/index` | Index a specific tag |
|
||||
| POST | `/api/repos/{id}/versions/{tag}/reindex` | Force re-index |
|
||||
| GET | `/api/jobs` | List jobs (filter by repo/version/status) |
|
||||
| GET | `/api/jobs/{id}` | Job detail with stages |
|
||||
| GET | `/api/jobs/{id}/log` (SSE) | Live log stream |
|
||||
| GET | `/api/jobs/stream` (SSE) | Live job-status events for the dashboard |
|
||||
| POST | `/api/search` | Hybrid search across one or more (repo, version) scopes |
|
||||
| GET | `/api/resolve?q=react` | Library-ID resolution preview |
|
||||
| GET | `/api/observability/metrics` | UI-friendly aggregated metrics JSON |
|
||||
| GET | `/api/observability/resources` | Heap, GPU mem (via NVML when present), index size |
|
||||
| GET | `/swagger-ui/index.html` | Swagger UI |
|
||||
| GET | `/v3/api-docs` | OpenAPI JSON |
|
||||
| ANY | `/mcp` | MCP Streamable HTTP endpoint |
|
||||
| GET | `/actuator/prometheus` | Prometheus scrape (optional) |
|
||||
| GET | `/**` | SPA fallback to `index.html` |
|
||||
|
||||
---
|
||||
|
||||
## 9. Concurrency & Performance
|
||||
|
||||
- **Virtual threads everywhere** for I/O (HTTP, JGit, file I/O, Lucene reads).
|
||||
- **`Tomcat` configured with virtual-thread executor** (`spring.threads.virtual.enabled=true`).
|
||||
- **Per-stage logical pools** are unbounded virtual-thread executors per orchestrator instance.
|
||||
- **GPU access gated by a `Semaphore`** with permits = number of ONNX sessions (configurable, default = 2).
|
||||
- **Lucene writer**: single `IndexWriter` instance protected by a queue; readers use a refresh-on-search `SearcherManager`.
|
||||
- **Embedding cache**: file-per-hash on disk under `data/embedding-cache/`; hot LRU in memory.
|
||||
- **Tag concurrency**: not capped (per spec), but each tag job awaits the GPU semaphore — natural backpressure.
|
||||
|
||||
---
|
||||
|
||||
## 10. Observability
|
||||
|
||||
- **Metrics** via Micrometer (`MeterRegistry`):
|
||||
- Counters: chunks_embedded, chunks_reused, files_skipped, jobs_succeeded/failed.
|
||||
- Timers: stage durations per stage name.
|
||||
- Gauges: active_jobs, gpu_semaphore_available, lucene_index_size_bytes, heap_used.
|
||||
- **OpenTelemetry traces** for every job (one trace per `IngestionJob`, span per `JobStage`).
|
||||
- **JobEventBus**: in-process pub/sub. SSE controllers subscribe and push events to UI.
|
||||
- **UI dashboards** (no Grafana required):
|
||||
- "Live" tab: progress bars per running (repo, tag), per-stage throughput, log tail.
|
||||
- "History" tab: paginated jobs table.
|
||||
- "Stats" tab: per-stage timing histograms, chunk counts per repo/version, chunk dedupe ratio.
|
||||
- "Resources" tab: heap, GPU memory (NVML where available), index size on disk.
|
||||
- **Prometheus** scraping is opt-in (Actuator endpoint).
|
||||
|
||||
---
|
||||
|
||||
## 11. Storage Layout (on disk)
|
||||
|
||||
```
|
||||
$TRUEREF_HOME/ # default: ./data
|
||||
├── h2/ # H2 database files
|
||||
├── lucene/ # single index dir; one Lucene writer
|
||||
├── repos/ # managed clones (when managedClone=true)
|
||||
│ └── <repoId>/...
|
||||
├── embedding-cache/ # one file per content_hash → fp16 vector bytes
|
||||
├── models/ # ONNX model files (auto-downloaded on first run)
|
||||
│ ├── bge-m3/
|
||||
│ └── bge-reranker-v2-m3/
|
||||
└── logs/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 12. Configuration (excerpt)
|
||||
|
||||
```yaml
|
||||
trueref:
|
||||
home: ${TRUEREF_HOME:./data}
|
||||
ingestion:
|
||||
poll-interval-default: 1h
|
||||
tag-cap-default: 100 # most-recent N tags by semver/date
|
||||
max-file-size-bytes-default: 1048576
|
||||
embedding:
|
||||
model: bge-m3
|
||||
onnx-providers: [cuda, directml, cpu] # tried in order
|
||||
session-count: 2 # = GPU semaphore permits
|
||||
batch-size: 32
|
||||
reranker:
|
||||
model: bge-reranker-v2-m3
|
||||
top-k: 50
|
||||
search:
|
||||
rrf-k: 60
|
||||
final-top-k: 20
|
||||
mcp:
|
||||
tokens-default: 5000
|
||||
tokens-min: 500
|
||||
tokens-max: 50000
|
||||
spring:
|
||||
threads.virtual.enabled: true
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 13. Out-of-the-box behaviors locked from clarifications
|
||||
|
||||
- **Auth**: none (LAN-only) on REST and MCP.
|
||||
- **Tag selection**: default cap 100 most-recent; on-demand index of any tag via UI search OR via MCP when an unindexed version is requested.
|
||||
- **Differential indexing**: dedupe by `content_hash` AND skip unchanged files via `git diff parent..tag`.
|
||||
- **Repo input**: UI-add (local path or remote URL) AND watched folder `./data/watched/` for bare repos.
|
||||
- **Re-index trigger**: on-demand + scheduled `git fetch` poll (default 1h per repo).
|
||||
- **Stale tag cleanup**: soft delete via `Version.status=INACTIVE`; compaction job reclaims orphan chunks.
|
||||
- **Embedding cache**: persistent on disk, keyed by `content_hash`.
|
||||
- **Concurrency**: unbounded virtual threads, GPU semaphore-gated.
|
||||
|
||||
See [FINDINGS.md](FINDINGS.md) for research backing each choice.
|
||||
Reference in New Issue
Block a user