diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index d32ec18..323f3ee 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -1,15 +1,16 @@ # Architecture -Last Updated: 2026-03-27T00:24:13.000Z +Last Updated: 2026-03-30T00:00:00.000Z ## Overview -TrueRef is a TypeScript-first, self-hosted documentation retrieval platform built on SvelteKit. The repository contains a Node-targeted web application, a REST API, a Model Context Protocol server, and a server-side indexing pipeline backed by SQLite via better-sqlite3 and Drizzle ORM. +TrueRef is a TypeScript-first, self-hosted documentation retrieval platform built on SvelteKit. The repository contains a Node-targeted web application, a REST API, a Model Context Protocol server, and a multi-threaded server-side indexing pipeline backed by SQLite via better-sqlite3 and Drizzle ORM. -- Primary language: TypeScript (110 files) with a small amount of JavaScript configuration (2 files) -- Application type: Full-stack SvelteKit application with server-side indexing and retrieval services +- Primary language: TypeScript (141 files) with a small amount of JavaScript configuration (2 files) +- Application type: Full-stack SvelteKit application with worker-threaded indexing and retrieval services - Runtime framework: SvelteKit with adapter-node -- Storage: SQLite with Drizzle-managed schema plus hand-written FTS5 setup +- Storage: SQLite (WAL mode) with Drizzle-managed schema plus hand-written FTS5 setup +- Concurrency: Node.js worker_threads for parse and embedding work - Testing: Vitest with separate client and server projects ## Project Structure @@ -25,7 +26,7 @@ TrueRef is a TypeScript-first, self-hosted documentation retrieval platform buil ### src/routes -Contains the UI entry points and API routes. The API tree under src/routes/api/v1 is the public HTTP contract for repository management, indexing jobs, search/context retrieval, settings, filesystem browsing, and JSON schema discovery. +Contains the UI entry points and API routes. The API tree under src/routes/api/v1 is the public HTTP contract for repository management, indexing jobs, search/context retrieval, settings, filesystem browsing, JSON schema discovery, real-time SSE progress streaming, and job control (pause/resume/cancel). ### src/lib/server/db @@ -33,7 +34,15 @@ Owns SQLite schema definitions, migration bootstrapping, and FTS initialization. ### src/lib/server/pipeline -Coordinates crawl, parse, chunk, store, and optional embedding generation work. Startup recovery marks stale jobs as failed, resets repositories stuck in indexing state, initializes singleton queue/pipeline instances, and drains queued work after restart. +Coordinates crawl, parse, chunk, store, and optional embedding generation work using a worker thread pool. The pipeline module consists of: + +- **WorkerPool** (`worker-pool.ts`): Manages a configurable number of Node.js `worker_threads` for parse jobs and an optional dedicated embed worker. Dispatches jobs round-robin to idle workers, enforces per-repository serialisation (one active job per repo), auto-respawns crashed workers, and supports runtime concurrency adjustment via `setMaxConcurrency()`. Falls back to main-thread execution when worker scripts are not found. +- **Parse worker** (`worker-entry.ts`): Runs in a worker thread. Opens its own `better-sqlite3` connection (WAL mode, `busy_timeout = 5000`), constructs a local `IndexingPipeline` instance, and processes jobs by posting `progress`, `done`, or `failed` messages back to the parent. +- **Embed worker** (`embed-worker-entry.ts`): Dedicated worker for embedding generation. Loads the embedding profile from the database, creates an `EmbeddingService`, and processes embed requests after the parse worker finishes a job. +- **ProgressBroadcaster** (`progress-broadcaster.ts`): Server-side pub/sub for real-time SSE streaming. Supports per-job, per-repository, and global subscriptions. Caches the last event per job for reconnect support. +- **Worker types** (`worker-types.ts`): Shared TypeScript discriminated union types for `ParseWorkerRequest`/`ParseWorkerResponse` and `EmbedWorkerRequest`/`EmbedWorkerResponse` message protocols. +- **Startup** (`startup.ts`): Recovers stale jobs, constructs singleton `JobQueue`, `IndexingPipeline`, `WorkerPool`, and `ProgressBroadcaster` instances, reads concurrency settings from the database, and drains queued work after restart. +- **JobQueue** (`job-queue.ts`): SQLite-backed queue that delegates to the `WorkerPool` when available, with pause/resume/cancel support. ### src/lib/server/search @@ -49,16 +58,18 @@ Provides a thin compatibility layer over the HTTP API. The MCP server exposes re ## Design Patterns -- No explicit design patterns detected from semantic analysis. -- The implementation does consistently use service classes such as RepositoryService, SearchService, and HybridSearchService for business logic. -- Mapping and entity layers separate raw database rows from domain objects through mapper/entity pairs such as RepositoryMapper and RepositoryEntity. -- Pipeline startup uses module-level singleton state for JobQueue and IndexingPipeline lifecycle management. +- The WorkerPool implements an **observer/callback pattern**: the pool owner provides `onProgress`, `onJobDone`, `onJobFailed`, `onEmbedDone`, and `onEmbedFailed` callbacks at construction time, and the pool invokes them when workers post messages. +- ProgressBroadcaster implements a **pub/sub pattern** with three subscription tiers (per-job, per-repository, global) and last-event caching for SSE reconnect. +- The implementation consistently uses **service classes** such as RepositoryService, SearchService, and HybridSearchService for business logic. +- Mapping and entity layers separate raw database rows from domain objects through **mapper/entity pairs** such as RepositoryMapper and RepositoryEntity. +- Pipeline startup uses **module-level singletons** for JobQueue, IndexingPipeline, WorkerPool, and ProgressBroadcaster lifecycle management, with accessor functions (getQueue, getPool, getBroadcaster) for route handlers. +- Worker message protocols use **TypeScript discriminated unions** (`type` field) for type-safe worker ↔ parent communication. ## Key Components ### SvelteKit server bootstrap -src/hooks.server.ts initializes the database, loads persisted embedding configuration, creates the optional EmbeddingService, starts the indexing pipeline, and applies CORS headers to all /api routes. +src/hooks.server.ts initializes the database, loads persisted embedding configuration, creates the optional EmbeddingService, reads indexing concurrency settings from the database, starts the indexing pipeline with WorkerPool and ProgressBroadcaster via `initializePipeline(db, embeddingService, { concurrency, dbPath })`, and applies CORS headers to all /api routes. ### Database layer @@ -80,6 +91,22 @@ src/lib/server/services/repository.service.ts provides CRUD and statistics for i src/mcp/index.ts creates the MCP server, registers the two supported tools, and exposes them over stdio or streamable HTTP. +### Worker thread pool + +src/lib/server/pipeline/worker-pool.ts manages a pool of Node.js worker threads. Parse workers run the full crawl → parse → store pipeline inside isolated threads with their own better-sqlite3 connections (WAL mode enables concurrent readers). An optional embed worker handles embedding generation in a separate thread. The pool enforces per-repository serialisation, auto-respawns crashed workers, and supports runtime concurrency changes persisted through the settings table. + +### SSE streaming + +src/lib/server/pipeline/progress-broadcaster.ts provides real-time Server-Sent Event streaming of indexing progress. Route handlers in src/routes/api/v1/jobs/stream and src/routes/api/v1/jobs/[id]/stream expose SSE endpoints. The broadcaster supports per-job, per-repository, and global subscriptions, with last-event caching for reconnect via the `Last-Event-ID` header. + +### Job control + +src/routes/api/v1/jobs/[id]/pause, resume, and cancel endpoints allow runtime control of indexing jobs. The JobQueue supports pause/resume/cancel state transitions persisted to SQLite. + +### Indexing settings + +src/routes/api/v1/settings/indexing exposes GET and PUT for indexing concurrency. PUT validates and clamps the value to `max(cpus - 1, 1)`, persists it to the settings table, and live-updates the WorkerPool via `setMaxConcurrency()`. + ## Dependencies ### Production @@ -93,6 +120,7 @@ src/mcp/index.ts creates the MCP server, registers the two supported tools, and - @sveltejs/kit and @sveltejs/adapter-node: application framework and Node deployment target - drizzle-kit and drizzle-orm: schema management and typed database access +- esbuild: worker thread entry point bundling (build/workers/) - vite and @tailwindcss/vite: bundling and Tailwind integration - vitest and @vitest/browser-playwright: server and browser test execution - eslint, typescript-eslint, eslint-plugin-svelte, prettier, prettier-plugin-svelte, prettier-plugin-tailwindcss: linting and formatting @@ -116,12 +144,13 @@ The frontend and backend share the same SvelteKit repository, but most non-UI be ### Indexing flow -1. Server startup runs initializeDatabase() and initializePipeline() from src/hooks.server.ts. -2. The pipeline recovers stale jobs, initializes crawler/parser infrastructure, and resumes queued work. -3. Crawlers ingest GitHub or local repository contents. -4. Parsers split files into document and snippet records with token counts and metadata. -5. Database modules persist repositories, documents, snippets, versions, configs, and job state. -6. If an embedding provider is configured, embedding services generate vectors for snippet search. +1. Server startup runs initializeDatabase() and initializePipeline() from src/hooks.server.ts, which creates the WorkerPool, ProgressBroadcaster, and JobQueue singletons. +2. The pipeline recovers stale jobs (marks running → failed, indexing → error), reads concurrency settings, and resumes queued work. +3. When a job is enqueued, the JobQueue delegates to the WorkerPool, which dispatches work to an idle parse worker thread. +4. Each parse worker opens its own better-sqlite3 connection (WAL mode) and runs the full crawl → parse → store pipeline, posting progress messages back to the parent thread. +5. The parent thread updates job progress in the database and broadcasts SSE events through the ProgressBroadcaster. +6. On parse completion, if an embedding provider is configured, the WorkerPool enqueues an embed request to the dedicated embed worker, which generates vectors in its own thread. +7. Job control endpoints allow pausing, resuming, or cancelling jobs at runtime. ### Retrieval flow @@ -135,7 +164,8 @@ The frontend and backend share the same SvelteKit repository, but most non-UI be ## Build System -- Build command: npm run build +- Build command: npm run build (runs `vite build` then `node scripts/build-workers.mjs`) +- Worker bundling: scripts/build-workers.mjs uses esbuild to compile worker-entry.ts and embed-worker-entry.ts into build/workers/ as ESM bundles (.mjs), with $lib path aliases resolved and better-sqlite3/@xenova/transformers marked external - Test command: npm run test - Primary local run command from package.json: npm run dev - MCP entry points: npm run mcp:start and npm run mcp:http diff --git a/docs/FINDINGS.md b/docs/FINDINGS.md index 13892fd..788a99d 100644 --- a/docs/FINDINGS.md +++ b/docs/FINDINGS.md @@ -1,25 +1,29 @@ # Findings -Last Updated: 2026-03-27T00:24:13.000Z +Last Updated: 2026-03-30T00:00:00.000Z ## Initializer Summary -- JIRA: FEEDBACK-0001 +- JIRA: TRUEREF-0022 - Refresh mode: REFRESH_IF_REQUIRED -- Result: refreshed affected documentation only. ARCHITECTURE.md and FINDINGS.md were updated from current repository analysis; CODE_STYLE.md remained trusted and unchanged because the documented conventions still match the codebase. +- Result: Refreshed ARCHITECTURE.md and FINDINGS.md. CODE_STYLE.md remained trusted — new worker thread code follows established conventions. ## Research Performed -- Discovered source-language distribution, dependency manifest, import patterns, and project structure. -- Read the retrieval, formatter, token-budget, parser, mapper, and response-model modules affected by the latest implementation changes. -- Compared the trusted cache state with current behavior to identify which documentation files were actually stale. -- Confirmed package scripts for build and test. -- Confirmed Linux-native md5sum availability for documentation trust metadata. +- Discovered 141 TypeScript/JavaScript source files (up from 110), with new pipeline worker, broadcaster, and SSE endpoint files. +- Read worker-pool.ts, worker-entry.ts, embed-worker-entry.ts, worker-types.ts, progress-broadcaster.ts, startup.ts, job-queue.ts to understand the new worker thread architecture. +- Read SSE endpoints (jobs/stream, jobs/[id]/stream) and job control endpoints (pause, resume, cancel). +- Read indexing settings endpoint and hooks.server.ts to verify startup wiring changes. +- Read build-workers.mjs and package.json to verify build system and dependency changes. +- Compared trusted cache state with current codebase to identify ARCHITECTURE.md as stale. +- Confirmed CODE_STYLE.md conventions still match the codebase — new code uses PascalCase classes, camelCase functions, tab indentation, ESM imports, and TypeScript discriminated unions consistent with existing style. ## Open Questions For Planner - Verify whether the retrieval response contract should document the new repository and version metadata fields formally in a public API reference beyond the architecture summary. - Verify whether parser chunking should evolve further from file-level and declaration-level boundaries to member-level semantic chunks for class-heavy codebases. +- Verify whether the SSE streaming contract (event names, data shapes) should be documented in a dedicated API reference for external consumers. +- Assess whether the WorkerPool fallback mode (main-thread execution when worker scripts are missing) needs explicit test coverage or should be removed in favour of a hard build requirement. ## Planner Notes Template diff --git a/src/lib/components/RepositoryCard.svelte b/src/lib/components/RepositoryCard.svelte index 8cbdb84..698fa78 100644 --- a/src/lib/components/RepositoryCard.svelte +++ b/src/lib/components/RepositoryCard.svelte @@ -1,5 +1,5 @@