Files
trueref/docs/ARCHITECTURE.md
2026-04-01 14:09:19 +02:00

169 lines
12 KiB
Markdown

# Architecture
Last Updated: 2026-04-01T12:05:23.000Z
## Overview
TrueRef is a TypeScript-first, self-hosted documentation retrieval platform built on SvelteKit. The repository contains a Node-targeted web application, a REST API, a Model Context Protocol server, and a worker-threaded indexing pipeline backed by SQLite via better-sqlite3, Drizzle ORM, FTS5, and sqlite-vec.
- Primary language: TypeScript (147 `.ts` files) with a small amount of JavaScript configuration and build code (2 `.js` files), excluding generated output and dependencies
- Application type: Full-stack SvelteKit application with server-side indexing, retrieval, and MCP integration
- Runtime framework: SvelteKit with adapter-node
- Storage: SQLite in WAL mode with Drizzle-managed relational schema, FTS5 full-text indexes, and sqlite-vec virtual tables for vector lookup
- Concurrency: Node.js `worker_threads` for parse, embed, and auxiliary write-worker infrastructure
- Testing: Vitest for unit and integration coverage
## Project Structure
- `src/routes`: SvelteKit pages and HTTP endpoints, including the public UI and `/api/v1` surface
- `src/lib/server`: Backend implementation grouped by concern: `api`, `config`, `crawler`, `db`, `embeddings`, `mappers`, `models`, `parser`, `pipeline`, `search`, `services`, `utils`
- `src/mcp`: Standalone MCP server entry point, client, tests, and tool handlers
- `scripts`: Build helpers, including worker bundling
- `static`: Static assets such as `robots.txt`
- `docs/features`: Feature-level implementation notes and product documentation
- `build`: Generated SvelteKit output and bundled worker entrypoints
## Key Directories
### `src/routes`
Contains the UI entry points and API routes. The API tree under `src/routes/api/v1` is the public HTTP contract for repository management, version discovery, indexing jobs, search/context retrieval, embedding settings, indexing settings, filesystem browsing, worker-status inspection, and SSE progress streaming.
### `src/lib/server/db`
Owns SQLite schema definitions, relational migrations, connection bootstrapping, and sqlite-vec loading. Database startup goes through `initializeDatabase()` and `getClient()`, both of which configure WAL-mode pragmas and ensure sqlite-vec is loaded on each connection before vector-backed queries run.
### `src/lib/server/search`
Implements keyword, vector, and hybrid retrieval. Keyword search uses SQLite FTS5 and BM25-style ranking. Vector search uses `SqliteVecStore` to maintain per-profile sqlite-vec `vec0` tables plus rowid mapping tables, and hybrid search blends FTS and vector candidates through reciprocal rank fusion.
### `src/lib/server/pipeline`
Coordinates crawl, diff, parse, store, embed, and job-state broadcasting. The pipeline module consists of:
- `IndexingPipeline`: orchestrates crawl, diff, parse, transactional replacement, optional embedding generation, and repository statistics updates
- `WorkerPool`: manages parse workers, an optional embed worker, an optional write worker, per-repository-and-version serialization, worker respawn, and runtime concurrency changes
- `worker-entry.ts`: parse worker that opens its own `better-sqlite3` connection, runs the indexing pipeline, and reports progress back to the parent
- `embed-worker-entry.ts`: embedding worker that loads the active profile, creates an `EmbeddingService`, and generates vectors after parse completion
- `write-worker-entry.ts`: batch-write worker with a `write`/`write_ack`/`write_error` message protocol for document and snippet persistence
- `progress-broadcaster.ts`: server-side pub/sub for per-job, per-repository, global, and worker-status SSE streams
- `startup.ts`: recovers stale jobs, constructs singleton queue/pipeline/pool/broadcaster instances, loads concurrency settings, and drains queued work after restart
- `worker-types.ts`: shared TypeScript discriminated unions for parse, embed, and write worker protocols
### `src/lib/server/crawler` and `src/lib/server/parser`
Convert GitHub repositories and local folders into normalized snippet records. Crawlers fetch repository contents and configuration, parsers split Markdown, code, config, HTML-like, and plain-text files into searchable snippet records, and downstream services persist searchable content and embeddings.
### `src/mcp`
Provides a thin compatibility layer over the HTTP API. The MCP server exposes `resolve-library-id` and `query-docs` over stdio or HTTP and forwards work to local handlers that reuse the application retrieval stack.
## Design Patterns
- **Service layer**: business logic lives in classes such as `RepositoryService`, `VersionService`, `SearchService`, `HybridSearchService`, and `EmbeddingService`
- **Factory pattern**: embedding providers are created from persisted profile records through registry/factory helpers
- **Mapper/entity separation**: mappers translate between raw database rows and domain entities such as `RepositoryEntity`, `RepositoryVersionEntity`, and `EmbeddingProfileEntity`
- **Module-level singletons**: pipeline startup owns lifecycle for `JobQueue`, `IndexingPipeline`, `WorkerPool`, and `ProgressBroadcaster`, with accessor functions for route handlers
- **Pub/sub**: `ProgressBroadcaster` maintains job, repository, global, and worker-status subscriptions for SSE delivery
- **Discriminated unions**: worker message protocols use a `type` field for type-safe parent/worker communication
## Key Components
### SvelteKit server bootstrap
`src/hooks.server.ts` initializes the relational database, opens the shared raw SQLite client, loads the default embedding profile, creates the optional `EmbeddingService`, reads indexing concurrency from the `settings` table, and initializes the queue/pipeline/worker infrastructure.
### Database layer
`src/lib/server/db/schema.ts` defines repositories, repository versions, documents, snippets, embedding profiles, relational embedding metadata, indexing jobs, repository configs, and generic settings. Relational embedding rows keep canonical model metadata and raw float buffers, while sqlite-vec virtual tables are managed separately per profile through `SqliteVecStore`.
### sqlite-vec integration
`src/lib/server/db/sqlite-vec.ts` centralizes sqlite-vec loading and deterministic per-profile table naming. `SqliteVecStore` creates `vec0` tables plus rowid mapping tables, backfills missing rows from `snippet_embeddings`, removes stale vector references, and executes nearest-neighbor queries constrained by repository, optional version, and profile.
### Retrieval API
`src/routes/api/v1/context/+server.ts` validates input, resolves repository and optional version scope, chooses keyword, semantic, or hybrid retrieval, applies token budgeting, and formats JSON or text responses. `/api/v1/libs/search` handles repository-level lookup, while MCP tool handlers expose the same retrieval behavior over stdio or HTTP transports.
### Search engine
`SearchService` preprocesses raw user input into FTS5-safe expressions before keyword search. `HybridSearchService` supports explicit keyword, semantic, and hybrid modes, falls back to vector retrieval when keyword search yields no candidates and an embedding provider is configured, and uses reciprocal rank fusion to merge ranked lists. `VectorSearch` delegates KNN execution to `SqliteVecStore` instead of doing brute-force in-memory cosine scoring.
### Repository and version management
`RepositoryService` and `VersionService` provide CRUD, indexing-status, cleanup, and statistics logic for indexed repositories and tagged versions, including sqlite-vec cleanup when repository-scoped or version-scoped content is removed.
### Worker-threaded indexing
The active indexing path is parse-worker-first: queued jobs are dispatched to parse workers, progress is written to SQLite and broadcast over SSE, and successful parse completion can enqueue embedding work on the dedicated embed worker. The worker pool also exposes status snapshots through `/api/v1/workers`. Write-worker infrastructure exists in the current architecture and is bundled at build time, but parse/embed flow remains the primary live path described by `IndexingPipeline` and `WorkerPool`.
### SSE streaming and job control
`progress-broadcaster.ts` provides real-time Server-Sent Event streaming of indexing progress. Route handlers under `/api/v1/jobs/stream` and `/api/v1/jobs/[id]/stream` expose SSE endpoints, and `/api/v1/workers` exposes worker-pool status. Job control endpoints support pause, resume, and cancel transitions backed by SQLite job state.
### Indexing settings
`/api/v1/settings/indexing` exposes GET and PUT for indexing concurrency. The value is persisted in the `settings` table and applied live to the `WorkerPool` through `setMaxConcurrency()`.
## Dependencies
### Production
- `@modelcontextprotocol/sdk`: MCP server transport and protocol types
- `@xenova/transformers`: local embedding support
- `better-sqlite3`: synchronous SQLite driver used by the main app and workers
- `sqlite-vec`: SQLite vector extension used for `vec0` storage and nearest-neighbor queries
- `zod`: runtime validation for MCP tools and server helpers
### Development
- `@sveltejs/kit` and `@sveltejs/adapter-node`: application framework and Node deployment target
- `drizzle-kit` and `drizzle-orm`: schema management and typed database access
- `esbuild`: worker entrypoint bundling into `build/workers`
- `vite` and `@tailwindcss/vite`: application bundling and Tailwind integration
- `vitest` and `@vitest/browser-playwright`: server and browser test execution
- `eslint`, `typescript-eslint`, `eslint-plugin-svelte`, `prettier`, `prettier-plugin-svelte`, `prettier-plugin-tailwindcss`: linting and formatting
- `typescript` and `@types/node`: type-checking and Node typings
## Module Organization
The backend is organized by responsibility rather than by route. HTTP handlers under `src/routes/api/v1` are intentionally thin and delegate to modules in `src/lib/server`. Within `src/lib/server`, concerns are separated into:
- `models` and `mappers` for entity translation
- `services` for repository/version operations
- `search` for keyword, vector, and hybrid retrieval strategies
- `crawler` and `parser` for indexing input transformation
- `pipeline` for orchestration, workers, and job execution
- `embeddings` for provider abstraction and vector generation
- `db`, `api`, `config`, and `utils` for persistence, response formatting, validation, and shared helpers
The frontend and backend live in the same SvelteKit repository, but most non-UI behavior is implemented on the server side.
## Data Flow
### Indexing flow
1. Server startup runs database initialization, opens the shared client, loads sqlite-vec, and initializes the pipeline singletons.
2. Startup recovery marks interrupted jobs as failed, resets repositories stuck in `indexing`, reads persisted concurrency settings, and drains queued jobs.
3. `JobQueue` dispatches eligible work to the `WorkerPool`, which serializes by `(repositoryId, versionId)` and posts jobs to idle parse workers.
4. Each parse worker opens its own SQLite connection, crawls the source, computes differential work, parses files into snippets, and persists replacement data through the indexing pipeline.
5. The parent thread updates job progress in SQLite and broadcasts SSE progress and worker-status events.
6. If an embedding provider is configured, the completed parse job triggers embed work that stores canonical embedding blobs and synchronizes sqlite-vec profile tables for nearest-neighbor lookup.
7. Repository/version statistics and job status are finalized in SQLite, and control endpoints can pause, resume, or cancel subsequent queued work.
### Retrieval flow
1. Clients call `/api/v1/libs/search`, `/api/v1/context`, or the MCP tools.
2. Route handlers validate input and use the shared SQLite client.
3. Keyword search uses FTS5 through `SearchService`; semantic search uses sqlite-vec KNN through `VectorSearch`; hybrid search merges both paths with reciprocal rank fusion.
4. Retrieval is scoped by repository and optional version, and semantic/hybrid paths can fall back when keyword search yields no usable candidates.
5. Token budgeting selects ranked snippets for the response formatter, which emits repository-aware JSON or text payloads.
## Build System
- Build command: `npm run build` (runs `vite build` then `node scripts/build-workers.mjs`)
- Worker bundling: `scripts/build-workers.mjs` uses esbuild to compile `worker-entry.ts`, `embed-worker-entry.ts`, and `write-worker-entry.ts` into `build/workers/` as ESM bundles
- Test command: `npm test`
- Primary local run command: `npm run dev`
- MCP entry points: `npm run mcp:start` and `npm run mcp:http`