mozempk/trueref

Fork 0

Files

Giancarmine Salucci 9525c58e9a feat(TRUEREF-0023): add sqlite-vec search pipeline

2026-04-01 14:09:19 +02:00

12 KiB

Raw Blame History

Architecture

Last Updated: 2026-04-01T12:05:23.000Z

Overview

TrueRef is a TypeScript-first, self-hosted documentation retrieval platform built on SvelteKit. The repository contains a Node-targeted web application, a REST API, a Model Context Protocol server, and a worker-threaded indexing pipeline backed by SQLite via better-sqlite3, Drizzle ORM, FTS5, and sqlite-vec.

Primary language: TypeScript (147 .ts files) with a small amount of JavaScript configuration and build code (2 .js files), excluding generated output and dependencies
Application type: Full-stack SvelteKit application with server-side indexing, retrieval, and MCP integration
Runtime framework: SvelteKit with adapter-node
Storage: SQLite in WAL mode with Drizzle-managed relational schema, FTS5 full-text indexes, and sqlite-vec virtual tables for vector lookup
Concurrency: Node.js worker_threads for parse, embed, and auxiliary write-worker infrastructure
Testing: Vitest for unit and integration coverage

Project Structure

src/routes: SvelteKit pages and HTTP endpoints, including the public UI and /api/v1 surface
src/lib/server: Backend implementation grouped by concern: api, config, crawler, db, embeddings, mappers, models, parser, pipeline, search, services, utils
src/mcp: Standalone MCP server entry point, client, tests, and tool handlers
scripts: Build helpers, including worker bundling
static: Static assets such as robots.txt
docs/features: Feature-level implementation notes and product documentation
build: Generated SvelteKit output and bundled worker entrypoints

Key Directories

`src/routes`

Contains the UI entry points and API routes. The API tree under src/routes/api/v1 is the public HTTP contract for repository management, version discovery, indexing jobs, search/context retrieval, embedding settings, indexing settings, filesystem browsing, worker-status inspection, and SSE progress streaming.

`src/lib/server/db`

Owns SQLite schema definitions, relational migrations, connection bootstrapping, and sqlite-vec loading. Database startup goes through initializeDatabase() and getClient(), both of which configure WAL-mode pragmas and ensure sqlite-vec is loaded on each connection before vector-backed queries run.

`src/lib/server/search`

Implements keyword, vector, and hybrid retrieval. Keyword search uses SQLite FTS5 and BM25-style ranking. Vector search uses SqliteVecStore to maintain per-profile sqlite-vec vec0 tables plus rowid mapping tables, and hybrid search blends FTS and vector candidates through reciprocal rank fusion.

`src/lib/server/pipeline`

Coordinates crawl, diff, parse, store, embed, and job-state broadcasting. The pipeline module consists of:

IndexingPipeline: orchestrates crawl, diff, parse, transactional replacement, optional embedding generation, and repository statistics updates
WorkerPool: manages parse workers, an optional embed worker, an optional write worker, per-repository-and-version serialization, worker respawn, and runtime concurrency changes
worker-entry.ts: parse worker that opens its own better-sqlite3 connection, runs the indexing pipeline, and reports progress back to the parent
embed-worker-entry.ts: embedding worker that loads the active profile, creates an EmbeddingService, and generates vectors after parse completion
write-worker-entry.ts: batch-write worker with a write/write_ack/write_error message protocol for document and snippet persistence
progress-broadcaster.ts: server-side pub/sub for per-job, per-repository, global, and worker-status SSE streams
startup.ts: recovers stale jobs, constructs singleton queue/pipeline/pool/broadcaster instances, loads concurrency settings, and drains queued work after restart
worker-types.ts: shared TypeScript discriminated unions for parse, embed, and write worker protocols

`src/lib/server/crawler` and `src/lib/server/parser`

Convert GitHub repositories and local folders into normalized snippet records. Crawlers fetch repository contents and configuration, parsers split Markdown, code, config, HTML-like, and plain-text files into searchable snippet records, and downstream services persist searchable content and embeddings.

`src/mcp`

Provides a thin compatibility layer over the HTTP API. The MCP server exposes resolve-library-id and query-docs over stdio or HTTP and forwards work to local handlers that reuse the application retrieval stack.

Design Patterns

Service layer: business logic lives in classes such as RepositoryService, VersionService, SearchService, HybridSearchService, and EmbeddingService
Factory pattern: embedding providers are created from persisted profile records through registry/factory helpers
Mapper/entity separation: mappers translate between raw database rows and domain entities such as RepositoryEntity, RepositoryVersionEntity, and EmbeddingProfileEntity
Module-level singletons: pipeline startup owns lifecycle for JobQueue, IndexingPipeline, WorkerPool, and ProgressBroadcaster, with accessor functions for route handlers
Pub/sub: ProgressBroadcaster maintains job, repository, global, and worker-status subscriptions for SSE delivery
Discriminated unions: worker message protocols use a type field for type-safe parent/worker communication

Key Components

SvelteKit server bootstrap

src/hooks.server.ts initializes the relational database, opens the shared raw SQLite client, loads the default embedding profile, creates the optional EmbeddingService, reads indexing concurrency from the settings table, and initializes the queue/pipeline/worker infrastructure.

Database layer

src/lib/server/db/schema.ts defines repositories, repository versions, documents, snippets, embedding profiles, relational embedding metadata, indexing jobs, repository configs, and generic settings. Relational embedding rows keep canonical model metadata and raw float buffers, while sqlite-vec virtual tables are managed separately per profile through SqliteVecStore.

sqlite-vec integration

src/lib/server/db/sqlite-vec.ts centralizes sqlite-vec loading and deterministic per-profile table naming. SqliteVecStore creates vec0 tables plus rowid mapping tables, backfills missing rows from snippet_embeddings, removes stale vector references, and executes nearest-neighbor queries constrained by repository, optional version, and profile.

Retrieval API

src/routes/api/v1/context/+server.ts validates input, resolves repository and optional version scope, chooses keyword, semantic, or hybrid retrieval, applies token budgeting, and formats JSON or text responses. /api/v1/libs/search handles repository-level lookup, while MCP tool handlers expose the same retrieval behavior over stdio or HTTP transports.

Search engine

SearchService preprocesses raw user input into FTS5-safe expressions before keyword search. HybridSearchService supports explicit keyword, semantic, and hybrid modes, falls back to vector retrieval when keyword search yields no candidates and an embedding provider is configured, and uses reciprocal rank fusion to merge ranked lists. VectorSearch delegates KNN execution to SqliteVecStore instead of doing brute-force in-memory cosine scoring.

Repository and version management

RepositoryService and VersionService provide CRUD, indexing-status, cleanup, and statistics logic for indexed repositories and tagged versions, including sqlite-vec cleanup when repository-scoped or version-scoped content is removed.

Worker-threaded indexing

The active indexing path is parse-worker-first: queued jobs are dispatched to parse workers, progress is written to SQLite and broadcast over SSE, and successful parse completion can enqueue embedding work on the dedicated embed worker. The worker pool also exposes status snapshots through /api/v1/workers. Write-worker infrastructure exists in the current architecture and is bundled at build time, but parse/embed flow remains the primary live path described by IndexingPipeline and WorkerPool.

SSE streaming and job control

progress-broadcaster.ts provides real-time Server-Sent Event streaming of indexing progress. Route handlers under /api/v1/jobs/stream and /api/v1/jobs/[id]/stream expose SSE endpoints, and /api/v1/workers exposes worker-pool status. Job control endpoints support pause, resume, and cancel transitions backed by SQLite job state.

Indexing settings

/api/v1/settings/indexing exposes GET and PUT for indexing concurrency. The value is persisted in the settings table and applied live to the WorkerPool through setMaxConcurrency().

Dependencies

Production

@modelcontextprotocol/sdk: MCP server transport and protocol types
@xenova/transformers: local embedding support
better-sqlite3: synchronous SQLite driver used by the main app and workers
sqlite-vec: SQLite vector extension used for vec0 storage and nearest-neighbor queries
zod: runtime validation for MCP tools and server helpers

Development

@sveltejs/kit and @sveltejs/adapter-node: application framework and Node deployment target
drizzle-kit and drizzle-orm: schema management and typed database access
esbuild: worker entrypoint bundling into build/workers
vite and @tailwindcss/vite: application bundling and Tailwind integration
vitest and @vitest/browser-playwright: server and browser test execution
eslint, typescript-eslint, eslint-plugin-svelte, prettier, prettier-plugin-svelte, prettier-plugin-tailwindcss: linting and formatting
typescript and @types/node: type-checking and Node typings

Module Organization

The backend is organized by responsibility rather than by route. HTTP handlers under src/routes/api/v1 are intentionally thin and delegate to modules in src/lib/server. Within src/lib/server, concerns are separated into:

models and mappers for entity translation
services for repository/version operations
search for keyword, vector, and hybrid retrieval strategies
crawler and parser for indexing input transformation
pipeline for orchestration, workers, and job execution
embeddings for provider abstraction and vector generation
db, api, config, and utils for persistence, response formatting, validation, and shared helpers

The frontend and backend live in the same SvelteKit repository, but most non-UI behavior is implemented on the server side.

Data Flow

Indexing flow

Server startup runs database initialization, opens the shared client, loads sqlite-vec, and initializes the pipeline singletons.
Startup recovery marks interrupted jobs as failed, resets repositories stuck in indexing, reads persisted concurrency settings, and drains queued jobs.
JobQueue dispatches eligible work to the WorkerPool, which serializes by (repositoryId, versionId) and posts jobs to idle parse workers.
Each parse worker opens its own SQLite connection, crawls the source, computes differential work, parses files into snippets, and persists replacement data through the indexing pipeline.
The parent thread updates job progress in SQLite and broadcasts SSE progress and worker-status events.
If an embedding provider is configured, the completed parse job triggers embed work that stores canonical embedding blobs and synchronizes sqlite-vec profile tables for nearest-neighbor lookup.
Repository/version statistics and job status are finalized in SQLite, and control endpoints can pause, resume, or cancel subsequent queued work.

Retrieval flow

Clients call /api/v1/libs/search, /api/v1/context, or the MCP tools.
Route handlers validate input and use the shared SQLite client.
Keyword search uses FTS5 through SearchService; semantic search uses sqlite-vec KNN through VectorSearch; hybrid search merges both paths with reciprocal rank fusion.
Retrieval is scoped by repository and optional version, and semantic/hybrid paths can fall back when keyword search yields no usable candidates.
Token budgeting selects ranked snippets for the response formatter, which emits repository-aware JSON or text payloads.

Build System

Build command: npm run build (runs vite build then node scripts/build-workers.mjs)
Worker bundling: scripts/build-workers.mjs uses esbuild to compile worker-entry.ts, embed-worker-entry.ts, and write-worker-entry.ts into build/workers/ as ESM bundles
Test command: npm test
Primary local run command: npm run dev
MCP entry points: npm run mcp:start and npm run mcp:http

12 KiB Raw Blame History