chore(TRUEREF-0022): fix lint errors and update architecture docs

- Fix 15 ESLint errors across pipeline workers, SSE endpoints, and UI - Replace explicit any with proper entity types in worker entries - Remove unused imports and variables (basename, SSEEvent, getBroadcasterFn, seedRules) - Use empty catch clauses instead of unused error variables - Use SvelteSet for reactive Set state in repository page - Fix operator precedence in nullish coalescing expression - Replace $state+$effect with $derived for concurrency input - Use resolve() directly in href for navigation lint rule - Update ARCHITECTURE.md and FINDINGS.md for worker-thread architecture
2026-03-30 17:28:38 +02:00
parent 7630740403
commit 6297edf109
11 changed files with 85 additions and 69 deletions
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@@ -1,15 +1,16 @@
 # Architecture

-Last Updated: 2026-03-27T00:24:13.000Z
+Last Updated: 2026-03-30T00:00:00.000Z

 ## Overview

-TrueRef is a TypeScript-first, self-hosted documentation retrieval platform built on SvelteKit. The repository contains a Node-targeted web application, a REST API, a Model Context Protocol server, and a server-side indexing pipeline backed by SQLite via better-sqlite3 and Drizzle ORM.
+TrueRef is a TypeScript-first, self-hosted documentation retrieval platform built on SvelteKit. The repository contains a Node-targeted web application, a REST API, a Model Context Protocol server, and a multi-threaded server-side indexing pipeline backed by SQLite via better-sqlite3 and Drizzle ORM.

- Primary language: TypeScript (110 files) with a small amount of JavaScript configuration (2 files)
- Application type: Full-stack SvelteKit application with server-side indexing and retrieval services
+- Primary language: TypeScript (141 files) with a small amount of JavaScript configuration (2 files)
+- Application type: Full-stack SvelteKit application with worker-threaded indexing and retrieval services
 - Runtime framework: SvelteKit with adapter-node
- Storage: SQLite with Drizzle-managed schema plus hand-written FTS5 setup
+- Storage: SQLite (WAL mode) with Drizzle-managed schema plus hand-written FTS5 setup
+- Concurrency: Node.js worker_threads for parse and embedding work
 - Testing: Vitest with separate client and server projects

 ## Project Structure
@@ -25,7 +26,7 @@ TrueRef is a TypeScript-first, self-hosted documentation retrieval platform buil

 ### src/routes

-Contains the UI entry points and API routes. The API tree under src/routes/api/v1 is the public HTTP contract for repository management, indexing jobs, search/context retrieval, settings, filesystem browsing, and JSON schema discovery.
+Contains the UI entry points and API routes. The API tree under src/routes/api/v1 is the public HTTP contract for repository management, indexing jobs, search/context retrieval, settings, filesystem browsing, JSON schema discovery, real-time SSE progress streaming, and job control (pause/resume/cancel).

 ### src/lib/server/db

@@ -33,7 +34,15 @@ Owns SQLite schema definitions, migration bootstrapping, and FTS initialization.

 ### src/lib/server/pipeline

-Coordinates crawl, parse, chunk, store, and optional embedding generation work. Startup recovery marks stale jobs as failed, resets repositories stuck in indexing state, initializes singleton queue/pipeline instances, and drains queued work after restart.
+Coordinates crawl, parse, chunk, store, and optional embedding generation work using a worker thread pool. The pipeline module consists of:
+
+- **WorkerPool** (`worker-pool.ts`): Manages a configurable number of Node.js `worker_threads` for parse jobs and an optional dedicated embed worker. Dispatches jobs round-robin to idle workers, enforces per-repository serialisation (one active job per repo), auto-respawns crashed workers, and supports runtime concurrency adjustment via `setMaxConcurrency()`. Falls back to main-thread execution when worker scripts are not found.
+- **Parse worker** (`worker-entry.ts`): Runs in a worker thread. Opens its own `better-sqlite3` connection (WAL mode, `busy_timeout = 5000`), constructs a local `IndexingPipeline` instance, and processes jobs by posting `progress`, `done`, or `failed` messages back to the parent.
+- **Embed worker** (`embed-worker-entry.ts`): Dedicated worker for embedding generation. Loads the embedding profile from the database, creates an `EmbeddingService`, and processes embed requests after the parse worker finishes a job.
+- **ProgressBroadcaster** (`progress-broadcaster.ts`): Server-side pub/sub for real-time SSE streaming. Supports per-job, per-repository, and global subscriptions. Caches the last event per job for reconnect support.
+- **Worker types** (`worker-types.ts`): Shared TypeScript discriminated union types for `ParseWorkerRequest`/`ParseWorkerResponse` and `EmbedWorkerRequest`/`EmbedWorkerResponse` message protocols.
+- **Startup** (`startup.ts`): Recovers stale jobs, constructs singleton `JobQueue`, `IndexingPipeline`, `WorkerPool`, and `ProgressBroadcaster` instances, reads concurrency settings from the database, and drains queued work after restart.
+- **JobQueue** (`job-queue.ts`): SQLite-backed queue that delegates to the `WorkerPool` when available, with pause/resume/cancel support.

 ### src/lib/server/search

@@ -49,16 +58,18 @@ Provides a thin compatibility layer over the HTTP API. The MCP server exposes re

 ## Design Patterns

- No explicit design patterns detected from semantic analysis.
- The implementation does consistently use service classes such as RepositoryService, SearchService, and HybridSearchService for business logic.
- Mapping and entity layers separate raw database rows from domain objects through mapper/entity pairs such as RepositoryMapper and RepositoryEntity.
- Pipeline startup uses module-level singleton state for JobQueue and IndexingPipeline lifecycle management.
+- The WorkerPool implements an **observer/callback pattern**: the pool owner provides `onProgress`, `onJobDone`, `onJobFailed`, `onEmbedDone`, and `onEmbedFailed` callbacks at construction time, and the pool invokes them when workers post messages.
+- ProgressBroadcaster implements a **pub/sub pattern** with three subscription tiers (per-job, per-repository, global) and last-event caching for SSE reconnect.
+- The implementation consistently uses **service classes** such as RepositoryService, SearchService, and HybridSearchService for business logic.
+- Mapping and entity layers separate raw database rows from domain objects through **mapper/entity pairs** such as RepositoryMapper and RepositoryEntity.
+- Pipeline startup uses **module-level singletons** for JobQueue, IndexingPipeline, WorkerPool, and ProgressBroadcaster lifecycle management, with accessor functions (getQueue, getPool, getBroadcaster) for route handlers.
+- Worker message protocols use **TypeScript discriminated unions** (`type` field) for type-safe worker ↔ parent communication.

 ## Key Components

 ### SvelteKit server bootstrap

-src/hooks.server.ts initializes the database, loads persisted embedding configuration, creates the optional EmbeddingService, starts the indexing pipeline, and applies CORS headers to all /api routes.
+src/hooks.server.ts initializes the database, loads persisted embedding configuration, creates the optional EmbeddingService, reads indexing concurrency settings from the database, starts the indexing pipeline with WorkerPool and ProgressBroadcaster via `initializePipeline(db, embeddingService, { concurrency, dbPath })`, and applies CORS headers to all /api routes.

 ### Database layer

@@ -80,6 +91,22 @@ src/lib/server/services/repository.service.ts provides CRUD and statistics for i

 src/mcp/index.ts creates the MCP server, registers the two supported tools, and exposes them over stdio or streamable HTTP.

+### Worker thread pool
+
+src/lib/server/pipeline/worker-pool.ts manages a pool of Node.js worker threads. Parse workers run the full crawl → parse → store pipeline inside isolated threads with their own better-sqlite3 connections (WAL mode enables concurrent readers). An optional embed worker handles embedding generation in a separate thread. The pool enforces per-repository serialisation, auto-respawns crashed workers, and supports runtime concurrency changes persisted through the settings table.
+
+### SSE streaming
+
+src/lib/server/pipeline/progress-broadcaster.ts provides real-time Server-Sent Event streaming of indexing progress. Route handlers in src/routes/api/v1/jobs/stream and src/routes/api/v1/jobs/[id]/stream expose SSE endpoints. The broadcaster supports per-job, per-repository, and global subscriptions, with last-event caching for reconnect via the `Last-Event-ID` header.
+
+### Job control
+
+src/routes/api/v1/jobs/[id]/pause, resume, and cancel endpoints allow runtime control of indexing jobs. The JobQueue supports pause/resume/cancel state transitions persisted to SQLite.
+
+### Indexing settings
+
+src/routes/api/v1/settings/indexing exposes GET and PUT for indexing concurrency. PUT validates and clamps the value to `max(cpus - 1, 1)`, persists it to the settings table, and live-updates the WorkerPool via `setMaxConcurrency()`.
+
 ## Dependencies

 ### Production
@@ -93,6 +120,7 @@ src/mcp/index.ts creates the MCP server, registers the two supported tools, and

 - @sveltejs/kit and @sveltejs/adapter-node: application framework and Node deployment target
 - drizzle-kit and drizzle-orm: schema management and typed database access
+- esbuild: worker thread entry point bundling (build/workers/)
 - vite and @tailwindcss/vite: bundling and Tailwind integration
 - vitest and @vitest/browser-playwright: server and browser test execution
 - eslint, typescript-eslint, eslint-plugin-svelte, prettier, prettier-plugin-svelte, prettier-plugin-tailwindcss: linting and formatting
@@ -116,12 +144,13 @@ The frontend and backend share the same SvelteKit repository, but most non-UI be

 ### Indexing flow

-1. Server startup runs initializeDatabase() and initializePipeline() from src/hooks.server.ts.
-2. The pipeline recovers stale jobs, initializes crawler/parser infrastructure, and resumes queued work.
-3. Crawlers ingest GitHub or local repository contents.
-4. Parsers split files into document and snippet records with token counts and metadata.
-5. Database modules persist repositories, documents, snippets, versions, configs, and job state.
-6. If an embedding provider is configured, embedding services generate vectors for snippet search.
+1. Server startup runs initializeDatabase() and initializePipeline() from src/hooks.server.ts, which creates the WorkerPool, ProgressBroadcaster, and JobQueue singletons.
+2. The pipeline recovers stale jobs (marks running → failed, indexing → error), reads concurrency settings, and resumes queued work.
+3. When a job is enqueued, the JobQueue delegates to the WorkerPool, which dispatches work to an idle parse worker thread.
+4. Each parse worker opens its own better-sqlite3 connection (WAL mode) and runs the full crawl → parse → store pipeline, posting progress messages back to the parent thread.
+5. The parent thread updates job progress in the database and broadcasts SSE events through the ProgressBroadcaster.
+6. On parse completion, if an embedding provider is configured, the WorkerPool enqueues an embed request to the dedicated embed worker, which generates vectors in its own thread.
+7. Job control endpoints allow pausing, resuming, or cancelling jobs at runtime.

 ### Retrieval flow

@@ -135,7 +164,8 @@ The frontend and backend share the same SvelteKit repository, but most non-UI be

 ## Build System

- Build command: npm run build
+- Build command: npm run build (runs `vite build` then `node scripts/build-workers.mjs`)
+- Worker bundling: scripts/build-workers.mjs uses esbuild to compile worker-entry.ts and embed-worker-entry.ts into build/workers/ as ESM bundles (.mjs), with $lib path aliases resolved and better-sqlite3/@xenova/transformers marked external
 - Test command: npm run test
 - Primary local run command from package.json: npm run dev
 - MCP entry points: npm run mcp:start and npm run mcp:http
--- a/docs/FINDINGS.md
+++ b/docs/FINDINGS.md
@@ -1,25 +1,29 @@
 # Findings

-Last Updated: 2026-03-27T00:24:13.000Z
+Last Updated: 2026-03-30T00:00:00.000Z

 ## Initializer Summary

- JIRA: FEEDBACK-0001
+- JIRA: TRUEREF-0022
 - Refresh mode: REFRESH_IF_REQUIRED
- Result: refreshed affected documentation only. ARCHITECTURE.md and FINDINGS.md were updated from current repository analysis; CODE_STYLE.md remained trusted and unchanged because the documented conventions still match the codebase.
+- Result: Refreshed ARCHITECTURE.md and FINDINGS.md. CODE_STYLE.md remained trusted — new worker thread code follows established conventions.

 ## Research Performed

- Discovered source-language distribution, dependency manifest, import patterns, and project structure.
- Read the retrieval, formatter, token-budget, parser, mapper, and response-model modules affected by the latest implementation changes.
- Compared the trusted cache state with current behavior to identify which documentation files were actually stale.
- Confirmed package scripts for build and test.
- Confirmed Linux-native md5sum availability for documentation trust metadata.
+- Discovered 141 TypeScript/JavaScript source files (up from 110), with new pipeline worker, broadcaster, and SSE endpoint files.
+- Read worker-pool.ts, worker-entry.ts, embed-worker-entry.ts, worker-types.ts, progress-broadcaster.ts, startup.ts, job-queue.ts to understand the new worker thread architecture.
+- Read SSE endpoints (jobs/stream, jobs/[id]/stream) and job control endpoints (pause, resume, cancel).
+- Read indexing settings endpoint and hooks.server.ts to verify startup wiring changes.
+- Read build-workers.mjs and package.json to verify build system and dependency changes.
+- Compared trusted cache state with current codebase to identify ARCHITECTURE.md as stale.
+- Confirmed CODE_STYLE.md conventions still match the codebase — new code uses PascalCase classes, camelCase functions, tab indentation, ESM imports, and TypeScript discriminated unions consistent with existing style.

 ## Open Questions For Planner

 - Verify whether the retrieval response contract should document the new repository and version metadata fields formally in a public API reference beyond the architecture summary.
 - Verify whether parser chunking should evolve further from file-level and declaration-level boundaries to member-level semantic chunks for class-heavy codebases.
+- Verify whether the SSE streaming contract (event names, data shapes) should be documented in a dedicated API reference for external consumers.
+- Assess whether the WorkerPool fallback mode (main-thread execution when worker scripts are missing) needs explicit test coverage or should be removed in favour of a hard build requirement.

 ## Planner Notes Template