diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
index d32ec18..323f3ee 100644
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@@ -1,15 +1,16 @@
# Architecture
-Last Updated: 2026-03-27T00:24:13.000Z
+Last Updated: 2026-03-30T00:00:00.000Z
## Overview
-TrueRef is a TypeScript-first, self-hosted documentation retrieval platform built on SvelteKit. The repository contains a Node-targeted web application, a REST API, a Model Context Protocol server, and a server-side indexing pipeline backed by SQLite via better-sqlite3 and Drizzle ORM.
+TrueRef is a TypeScript-first, self-hosted documentation retrieval platform built on SvelteKit. The repository contains a Node-targeted web application, a REST API, a Model Context Protocol server, and a multi-threaded server-side indexing pipeline backed by SQLite via better-sqlite3 and Drizzle ORM.
-- Primary language: TypeScript (110 files) with a small amount of JavaScript configuration (2 files)
-- Application type: Full-stack SvelteKit application with server-side indexing and retrieval services
+- Primary language: TypeScript (141 files) with a small amount of JavaScript configuration (2 files)
+- Application type: Full-stack SvelteKit application with worker-threaded indexing and retrieval services
- Runtime framework: SvelteKit with adapter-node
-- Storage: SQLite with Drizzle-managed schema plus hand-written FTS5 setup
+- Storage: SQLite (WAL mode) with Drizzle-managed schema plus hand-written FTS5 setup
+- Concurrency: Node.js worker_threads for parse and embedding work
- Testing: Vitest with separate client and server projects
## Project Structure
@@ -25,7 +26,7 @@ TrueRef is a TypeScript-first, self-hosted documentation retrieval platform buil
### src/routes
-Contains the UI entry points and API routes. The API tree under src/routes/api/v1 is the public HTTP contract for repository management, indexing jobs, search/context retrieval, settings, filesystem browsing, and JSON schema discovery.
+Contains the UI entry points and API routes. The API tree under src/routes/api/v1 is the public HTTP contract for repository management, indexing jobs, search/context retrieval, settings, filesystem browsing, JSON schema discovery, real-time SSE progress streaming, and job control (pause/resume/cancel).
### src/lib/server/db
@@ -33,7 +34,15 @@ Owns SQLite schema definitions, migration bootstrapping, and FTS initialization.
### src/lib/server/pipeline
-Coordinates crawl, parse, chunk, store, and optional embedding generation work. Startup recovery marks stale jobs as failed, resets repositories stuck in indexing state, initializes singleton queue/pipeline instances, and drains queued work after restart.
+Coordinates crawl, parse, chunk, store, and optional embedding generation work using a worker thread pool. The pipeline module consists of:
+
+- **WorkerPool** (`worker-pool.ts`): Manages a configurable number of Node.js `worker_threads` for parse jobs and an optional dedicated embed worker. Dispatches jobs round-robin to idle workers, enforces per-repository serialisation (one active job per repo), auto-respawns crashed workers, and supports runtime concurrency adjustment via `setMaxConcurrency()`. Falls back to main-thread execution when worker scripts are not found.
+- **Parse worker** (`worker-entry.ts`): Runs in a worker thread. Opens its own `better-sqlite3` connection (WAL mode, `busy_timeout = 5000`), constructs a local `IndexingPipeline` instance, and processes jobs by posting `progress`, `done`, or `failed` messages back to the parent.
+- **Embed worker** (`embed-worker-entry.ts`): Dedicated worker for embedding generation. Loads the embedding profile from the database, creates an `EmbeddingService`, and processes embed requests after the parse worker finishes a job.
+- **ProgressBroadcaster** (`progress-broadcaster.ts`): Server-side pub/sub for real-time SSE streaming. Supports per-job, per-repository, and global subscriptions. Caches the last event per job for reconnect support.
+- **Worker types** (`worker-types.ts`): Shared TypeScript discriminated union types for `ParseWorkerRequest`/`ParseWorkerResponse` and `EmbedWorkerRequest`/`EmbedWorkerResponse` message protocols.
+- **Startup** (`startup.ts`): Recovers stale jobs, constructs singleton `JobQueue`, `IndexingPipeline`, `WorkerPool`, and `ProgressBroadcaster` instances, reads concurrency settings from the database, and drains queued work after restart.
+- **JobQueue** (`job-queue.ts`): SQLite-backed queue that delegates to the `WorkerPool` when available, with pause/resume/cancel support.
### src/lib/server/search
@@ -49,16 +58,18 @@ Provides a thin compatibility layer over the HTTP API. The MCP server exposes re
## Design Patterns
-- No explicit design patterns detected from semantic analysis.
-- The implementation does consistently use service classes such as RepositoryService, SearchService, and HybridSearchService for business logic.
-- Mapping and entity layers separate raw database rows from domain objects through mapper/entity pairs such as RepositoryMapper and RepositoryEntity.
-- Pipeline startup uses module-level singleton state for JobQueue and IndexingPipeline lifecycle management.
+- The WorkerPool implements an **observer/callback pattern**: the pool owner provides `onProgress`, `onJobDone`, `onJobFailed`, `onEmbedDone`, and `onEmbedFailed` callbacks at construction time, and the pool invokes them when workers post messages.
+- ProgressBroadcaster implements a **pub/sub pattern** with three subscription tiers (per-job, per-repository, global) and last-event caching for SSE reconnect.
+- The implementation consistently uses **service classes** such as RepositoryService, SearchService, and HybridSearchService for business logic.
+- Mapping and entity layers separate raw database rows from domain objects through **mapper/entity pairs** such as RepositoryMapper and RepositoryEntity.
+- Pipeline startup uses **module-level singletons** for JobQueue, IndexingPipeline, WorkerPool, and ProgressBroadcaster lifecycle management, with accessor functions (getQueue, getPool, getBroadcaster) for route handlers.
+- Worker message protocols use **TypeScript discriminated unions** (`type` field) for type-safe worker ↔ parent communication.
## Key Components
### SvelteKit server bootstrap
-src/hooks.server.ts initializes the database, loads persisted embedding configuration, creates the optional EmbeddingService, starts the indexing pipeline, and applies CORS headers to all /api routes.
+src/hooks.server.ts initializes the database, loads persisted embedding configuration, creates the optional EmbeddingService, reads indexing concurrency settings from the database, starts the indexing pipeline with WorkerPool and ProgressBroadcaster via `initializePipeline(db, embeddingService, { concurrency, dbPath })`, and applies CORS headers to all /api routes.
### Database layer
@@ -80,6 +91,22 @@ src/lib/server/services/repository.service.ts provides CRUD and statistics for i
src/mcp/index.ts creates the MCP server, registers the two supported tools, and exposes them over stdio or streamable HTTP.
+### Worker thread pool
+
+src/lib/server/pipeline/worker-pool.ts manages a pool of Node.js worker threads. Parse workers run the full crawl → parse → store pipeline inside isolated threads with their own better-sqlite3 connections (WAL mode enables concurrent readers). An optional embed worker handles embedding generation in a separate thread. The pool enforces per-repository serialisation, auto-respawns crashed workers, and supports runtime concurrency changes persisted through the settings table.
+
+### SSE streaming
+
+src/lib/server/pipeline/progress-broadcaster.ts provides real-time Server-Sent Event streaming of indexing progress. Route handlers in src/routes/api/v1/jobs/stream and src/routes/api/v1/jobs/[id]/stream expose SSE endpoints. The broadcaster supports per-job, per-repository, and global subscriptions, with last-event caching for reconnect via the `Last-Event-ID` header.
+
+### Job control
+
+src/routes/api/v1/jobs/[id]/pause, resume, and cancel endpoints allow runtime control of indexing jobs. The JobQueue supports pause/resume/cancel state transitions persisted to SQLite.
+
+### Indexing settings
+
+src/routes/api/v1/settings/indexing exposes GET and PUT for indexing concurrency. PUT validates and clamps the value to `max(cpus - 1, 1)`, persists it to the settings table, and live-updates the WorkerPool via `setMaxConcurrency()`.
+
## Dependencies
### Production
@@ -93,6 +120,7 @@ src/mcp/index.ts creates the MCP server, registers the two supported tools, and
- @sveltejs/kit and @sveltejs/adapter-node: application framework and Node deployment target
- drizzle-kit and drizzle-orm: schema management and typed database access
+- esbuild: worker thread entry point bundling (build/workers/)
- vite and @tailwindcss/vite: bundling and Tailwind integration
- vitest and @vitest/browser-playwright: server and browser test execution
- eslint, typescript-eslint, eslint-plugin-svelte, prettier, prettier-plugin-svelte, prettier-plugin-tailwindcss: linting and formatting
@@ -116,12 +144,13 @@ The frontend and backend share the same SvelteKit repository, but most non-UI be
### Indexing flow
-1. Server startup runs initializeDatabase() and initializePipeline() from src/hooks.server.ts.
-2. The pipeline recovers stale jobs, initializes crawler/parser infrastructure, and resumes queued work.
-3. Crawlers ingest GitHub or local repository contents.
-4. Parsers split files into document and snippet records with token counts and metadata.
-5. Database modules persist repositories, documents, snippets, versions, configs, and job state.
-6. If an embedding provider is configured, embedding services generate vectors for snippet search.
+1. Server startup runs initializeDatabase() and initializePipeline() from src/hooks.server.ts, which creates the WorkerPool, ProgressBroadcaster, and JobQueue singletons.
+2. The pipeline recovers stale jobs (marks running → failed, indexing → error), reads concurrency settings, and resumes queued work.
+3. When a job is enqueued, the JobQueue delegates to the WorkerPool, which dispatches work to an idle parse worker thread.
+4. Each parse worker opens its own better-sqlite3 connection (WAL mode) and runs the full crawl → parse → store pipeline, posting progress messages back to the parent thread.
+5. The parent thread updates job progress in the database and broadcasts SSE events through the ProgressBroadcaster.
+6. On parse completion, if an embedding provider is configured, the WorkerPool enqueues an embed request to the dedicated embed worker, which generates vectors in its own thread.
+7. Job control endpoints allow pausing, resuming, or cancelling jobs at runtime.
### Retrieval flow
@@ -135,7 +164,8 @@ The frontend and backend share the same SvelteKit repository, but most non-UI be
## Build System
-- Build command: npm run build
+- Build command: npm run build (runs `vite build` then `node scripts/build-workers.mjs`)
+- Worker bundling: scripts/build-workers.mjs uses esbuild to compile worker-entry.ts and embed-worker-entry.ts into build/workers/ as ESM bundles (.mjs), with $lib path aliases resolved and better-sqlite3/@xenova/transformers marked external
- Test command: npm run test
- Primary local run command from package.json: npm run dev
- MCP entry points: npm run mcp:start and npm run mcp:http
diff --git a/docs/FINDINGS.md b/docs/FINDINGS.md
index 13892fd..788a99d 100644
--- a/docs/FINDINGS.md
+++ b/docs/FINDINGS.md
@@ -1,25 +1,29 @@
# Findings
-Last Updated: 2026-03-27T00:24:13.000Z
+Last Updated: 2026-03-30T00:00:00.000Z
## Initializer Summary
-- JIRA: FEEDBACK-0001
+- JIRA: TRUEREF-0022
- Refresh mode: REFRESH_IF_REQUIRED
-- Result: refreshed affected documentation only. ARCHITECTURE.md and FINDINGS.md were updated from current repository analysis; CODE_STYLE.md remained trusted and unchanged because the documented conventions still match the codebase.
+- Result: Refreshed ARCHITECTURE.md and FINDINGS.md. CODE_STYLE.md remained trusted — new worker thread code follows established conventions.
## Research Performed
-- Discovered source-language distribution, dependency manifest, import patterns, and project structure.
-- Read the retrieval, formatter, token-budget, parser, mapper, and response-model modules affected by the latest implementation changes.
-- Compared the trusted cache state with current behavior to identify which documentation files were actually stale.
-- Confirmed package scripts for build and test.
-- Confirmed Linux-native md5sum availability for documentation trust metadata.
+- Discovered 141 TypeScript/JavaScript source files (up from 110), with new pipeline worker, broadcaster, and SSE endpoint files.
+- Read worker-pool.ts, worker-entry.ts, embed-worker-entry.ts, worker-types.ts, progress-broadcaster.ts, startup.ts, job-queue.ts to understand the new worker thread architecture.
+- Read SSE endpoints (jobs/stream, jobs/[id]/stream) and job control endpoints (pause, resume, cancel).
+- Read indexing settings endpoint and hooks.server.ts to verify startup wiring changes.
+- Read build-workers.mjs and package.json to verify build system and dependency changes.
+- Compared trusted cache state with current codebase to identify ARCHITECTURE.md as stale.
+- Confirmed CODE_STYLE.md conventions still match the codebase — new code uses PascalCase classes, camelCase functions, tab indentation, ESM imports, and TypeScript discriminated unions consistent with existing style.
## Open Questions For Planner
- Verify whether the retrieval response contract should document the new repository and version metadata fields formally in a public API reference beyond the architecture summary.
- Verify whether parser chunking should evolve further from file-level and declaration-level boundaries to member-level semantic chunks for class-heavy codebases.
+- Verify whether the SSE streaming contract (event names, data shapes) should be documented in a dedicated API reference for external consumers.
+- Assess whether the WorkerPool fallback mode (main-thread execution when worker scripts are missing) needs explicit test coverage or should be removed in favour of a hard build requirement.
## Planner Notes Template
diff --git a/package.json b/package.json
index 3c0e331..3b7a3a1 100644
--- a/package.json
+++ b/package.json
@@ -5,7 +5,7 @@
"type": "module",
"scripts": {
"dev": "vite dev",
- "build": "vite build",
+ "build": "vite build && node scripts/build-workers.mjs",
"preview": "vite preview",
"prepare": "svelte-kit sync || echo ''",
"check": "svelte-kit sync && svelte-check --tsconfig ./tsconfig.json",
@@ -34,6 +34,7 @@
"@vitest/browser-playwright": "^4.1.0",
"drizzle-kit": "^0.31.8",
"drizzle-orm": "^0.45.1",
+ "esbuild": "^0.24.0",
"eslint": "^9.39.2",
"eslint-config-prettier": "^10.1.8",
"eslint-plugin-svelte": "^3.14.0",
diff --git a/scripts/build-workers.mjs b/scripts/build-workers.mjs
new file mode 100644
index 0000000..331e2ad
--- /dev/null
+++ b/scripts/build-workers.mjs
@@ -0,0 +1,38 @@
+import * as esbuild from 'esbuild';
+import { existsSync } from 'node:fs';
+
+const entries = [
+ 'src/lib/server/pipeline/worker-entry.ts',
+ 'src/lib/server/pipeline/embed-worker-entry.ts'
+];
+
+try {
+ const existing = entries.filter(e => existsSync(e));
+ if (existing.length === 0) {
+ console.log('[build-workers] No worker entry files found yet, skipping.');
+ process.exit(0);
+ }
+
+ await esbuild.build({
+ entryPoints: existing,
+ bundle: true,
+ platform: 'node',
+ target: 'node20',
+ format: 'esm',
+ outdir: 'build/workers',
+ outExtension: { '.js': '.mjs' },
+ alias: {
+ '$lib': './src/lib',
+ '$lib/server': './src/lib/server'
+ },
+ external: ['better-sqlite3', '@xenova/transformers'],
+ banner: {
+ js: "import { createRequire } from 'module'; const require = createRequire(import.meta.url);"
+ }
+ });
+
+ console.log(`[build-workers] Compiled ${existing.length} worker(s) to build/workers/`);
+} catch (err) {
+ console.error('[build-workers] Error:', err);
+ process.exit(1);
+}
diff --git a/src/hooks.server.ts b/src/hooks.server.ts
index 1d25e06..4c43ea0 100644
--- a/src/hooks.server.ts
+++ b/src/hooks.server.ts
@@ -16,6 +16,7 @@ import {
type EmbeddingProfileEntityProps
} from '$lib/server/models/embedding-profile.js';
import { EmbeddingProfileMapper } from '$lib/server/mappers/embedding-profile.mapper.js';
+import { env } from '$env/dynamic/private';
import type { Handle } from '@sveltejs/kit';
// ---------------------------------------------------------------------------
@@ -47,7 +48,29 @@ try {
embeddingService = new EmbeddingService(db, provider, activeProfile.id);
}
- initializePipeline(db, embeddingService);
+ // Read database path from environment
+ const dbPath = env.DATABASE_URL;
+
+ // Read indexing concurrency setting from database
+ let concurrency = 2; // default
+ if (dbPath) {
+ const concurrencyRow = db
+ .prepare<[], { value: string }>(
+ "SELECT value FROM settings WHERE key = 'indexing.concurrency' LIMIT 1"
+ )
+ .get();
+ if (concurrencyRow) {
+ try {
+ const parsed = JSON.parse(concurrencyRow.value);
+ concurrency = parsed.value ?? 2;
+ } catch {
+ // If parsing fails, use default
+ concurrency = 2;
+ }
+ }
+ }
+
+ initializePipeline(db, embeddingService, { concurrency, dbPath });
console.log('[hooks.server] Indexing pipeline initialised.');
} catch (err) {
console.error(
diff --git a/src/lib/components/RepositoryCard.svelte b/src/lib/components/RepositoryCard.svelte
index 8cbdb84..698fa78 100644
--- a/src/lib/components/RepositoryCard.svelte
+++ b/src/lib/components/RepositoryCard.svelte
@@ -1,5 +1,5 @@
@@ -181,6 +236,11 @@
>
Status
+