TRUEREF-0023 rewrite indexing pipeline - parallel reads - serialized writes

2026-04-02 09:49:38 +02:00
parent 9525c58e9a
commit f86be4106b
68 changed files with 5042 additions and 3131 deletions
--- a/docs/features/TRUEREF-0022.md
+++ b/docs/features/TRUEREF-0022.md
@@ -88,6 +88,7 @@ The UI currently polls `GET /api/v1/jobs?repositoryId=...` every 2 seconds. This
 #### Worker Thread lifecycle

 Each worker is a long-lived `node:worker_threads` `Worker` instance that:
+
 1. Opens its own `better-sqlite3` connection to the same database file.
 2. Listens for `{ type: 'run', jobId }` messages from the main thread.
 3. Runs `IndexingPipeline.run(job)`, emitting `postMessage` progress events at each stage boundary and every N files.
@@ -100,18 +101,18 @@ Manages a pool of `concurrency` workers.

 ```typescript
 interface WorkerPoolOptions {
-  concurrency: number;      // default: Math.max(1, os.cpus().length - 1), capped at 4
-  workerScript: string;     // absolute path to the compiled worker entry
+	concurrency: number; // default: Math.max(1, os.cpus().length - 1), capped at 4
+	workerScript: string; // absolute path to the compiled worker entry
 }

 class WorkerPool {
-  private workers: Worker[];
-  private idle: Worker[];
+	private workers: Worker[];
+	private idle: Worker[];

-  enqueue(jobId: string): void;
-  private dispatch(worker: Worker, jobId: string): void;
-  private onWorkerMessage(msg: WorkerMessage): void;
-  private onWorkerExit(worker: Worker, code: number): void;
+	enqueue(jobId: string): void;
+	private dispatch(worker: Worker, jobId: string): void;
+	private onWorkerMessage(msg: WorkerMessage): void;
+	private onWorkerExit(worker: Worker, code: number): void;
 }
 ```

@@ -120,12 +121,14 @@ Workers are kept alive across jobs. If a worker crashes (non-zero exit), the poo
 #### Parallelism and write contention

 With WAL mode enabled (already the case), SQLite supports:
+
 - **One concurrent writer** (the transaction lock)
 - **Many concurrent readers**

 The `replaceSnippets` transaction for different repositories never contends — they write different rows. The `cloneFromAncestor` operation writes to the same tables but different `version_id` values, so WAL checkpoint logic keeps them non-overlapping at the page level.

 Two jobs on the **same repository** (e.g. `/my-lib/v1.0.0` and `/my-lib/v2.0.0`) can run in parallel because:
+
 - Differential indexing (TRUEREF-0021) ensures `v2.0.0` reads from `v1.0.0`'s already-committed rows.
 - The write transactions for each version touch disjoint `version_id` partitions.

@@ -134,6 +137,7 @@ If write contention still occurs under parallel load, `busy_timeout = 5000` (alr
 #### Concurrency limit per repository

 To prevent a user from queuing 500 tags and overwhelming the worker pool, the pool enforces:
+
 - **Max 1 running job per repository** for the default branch (re-index).
 - **Max `concurrency` total running jobs** across all repositories.
 - Version jobs for the same repository are serialised within the pool (the queue picks the oldest queued version job for a given repo only when no other version job for that repo is running).
@@ -148,15 +152,15 @@ Replace the opaque integer progress with a structured stage model:

 ```typescript
 type IndexingStage =
-  | 'queued'
-  | 'differential'   // computing ancestor diff
-  | 'crawling'       // fetching files from GitHub or local FS
-  | 'cloning'        // cloning unchanged files from ancestor (differential only)
-  | 'parsing'        // parsing files into snippets
-  | 'storing'        // writing documents + snippets to DB
-  | 'embedding'      // generating vector embeddings
-  | 'done'
-  | 'failed';
+	| 'queued'
+	| 'differential' // computing ancestor diff
+	| 'crawling' // fetching files from GitHub or local FS
+	| 'cloning' // cloning unchanged files from ancestor (differential only)
+	| 'parsing' // parsing files into snippets
+	| 'storing' // writing documents + snippets to DB
+	| 'embedding' // generating vector embeddings
+	| 'done'
+	| 'failed';
 ```

 ### Extended Job Schema
@@ -172,22 +176,24 @@ The `progress` column (0–100) is retained for backward compatibility and overa

 ```typescript
 interface ProgressMessage {
-  type: 'progress';
-  jobId: string;
-  stage: IndexingStage;
-  stageDetail?: string;    // human-readable detail for the current stage
-  progress: number;        // 0–100 overall
-  processedFiles: number;
-  totalFiles: number;
+	type: 'progress';
+	jobId: string;
+	stage: IndexingStage;
+	stageDetail?: string; // human-readable detail for the current stage
+	progress: number; // 0–100 overall
+	processedFiles: number;
+	totalFiles: number;
 }
 ```

 Workers emit this message:
+
 - On every stage transition (crawl start, parse start, store start, embed start).
 - Every `PROGRESS_EMIT_EVERY = 10` files during the parse loop.
 - On job completion or failure.

 The main thread receives these messages and does two things:
+
 1. Writes the update to `indexing_jobs` in SQLite (batched — one write per message, not per file).
 2. Pushes the payload to any open SSE channels for that jobId.

@@ -198,6 +204,7 @@ The main thread receives these messages and does two things:
 ### `GET /api/v1/jobs/:id/stream`

 Opens an SSE connection for a specific job. The server:
+
 1. Sends the current job state as the first event immediately (no initial lag).
 2. Pushes `ProgressMessage` events as the worker emits them.
 3. Sends a final `event: done` or `event: failed` event, then closes the connection.
@@ -216,7 +223,7 @@ id: 1
 event: progress
 data: {"stage":"crawling","progress":0,"processedFiles":0,"totalFiles":0}

-id: 2  
+id: 2
 event: progress
 data: {"stage":"parsing","progress":12,"processedFiles":240,"totalFiles":2000}

@@ -281,7 +288,7 @@ Expose via the settings table (key `indexing.concurrency`):

 ```typescript
 interface IndexingSettings {
-  concurrency: number;   // 1–max(cpus-1, 1); default 2
+	concurrency: number; // 1–max(cpus-1, 1); default 2
 }
 ```

@@ -362,13 +369,13 @@ The embedding stage must **not** run inside the same Worker Thread as the crawl/

 ### Why a dedicated embedding worker

-| Concern | Per-parse-worker model | Dedicated embedding worker |
-|---|---|---|
-| Memory | N × ~100 MB (model weights + WASM heap) per worker | 1 × ~100 MB regardless of concurrency |
-| Model warm-up | Paid once per worker spawn; cold starts slow | Paid once at server startup |
-| Batch size | Each worker batches only its own job's snippets | All in-flight jobs queue to one worker → larger batches → higher WASM throughput |
-| Provider migration | Must update every worker | Update one file |
-| API rate limiting | N parallel streams to the same API → N×rate-limit hits | One serial stream, naturally throttled |
+| Concern            | Per-parse-worker model                                 | Dedicated embedding worker                                                       |
+| ------------------ | ------------------------------------------------------ | -------------------------------------------------------------------------------- |
+| Memory             | N × ~100 MB (model weights + WASM heap) per worker     | 1 × ~100 MB regardless of concurrency                                            |
+| Model warm-up      | Paid once per worker spawn; cold starts slow           | Paid once at server startup                                                      |
+| Batch size         | Each worker batches only its own job's snippets        | All in-flight jobs queue to one worker → larger batches → higher WASM throughput |
+| Provider migration | Must update every worker                               | Update one file                                                                  |
+| API rate limiting  | N parallel streams to the same API → N×rate-limit hits | One serial stream, naturally throttled                                           |

 With `Xenova/all-MiniLM-L6-v2`, the WASM model and weight files occupy ~90–120 MB of heap. Running three parse workers with embedded model loading costs ~300–360 MB of resident memory that can never be freed while the server is alive. A dedicated worker keeps that cost fixed at one instance.

@@ -415,6 +422,7 @@ Instead, the existing `findSnippetIdsMissingEmbeddings` query is the handshake:
 5. Main thread routes this to the SSE broadcaster → UI updates the embedding progress slice.

 This means:
+
 - The embedding worker reads snippet text from the DB itself (no IPC serialisation of content).
 - The model is loaded once, stays warm, and processes batches from all repositories in FIFO order.
 - Parse workers are never blocked waiting for embeddings — they complete their job stages and exit immediately.
@@ -424,15 +432,15 @@ This means:
 ```typescript
 // Main → Embedding worker
 type EmbedRequest =
-  | { type: 'embed'; jobId: string; repositoryId: string; versionId: string | null }
-  | { type: 'shutdown' };
+	| { type: 'embed'; jobId: string; repositoryId: string; versionId: string | null }
+	| { type: 'shutdown' };

 // Embedding worker → Main
 type EmbedResponse =
-  | { type: 'embed-progress'; jobId: string; done: number; total: number }
-  | { type: 'embed-done'; jobId: string }
-  | { type: 'embed-failed'; jobId: string; error: string }
-  | { type: 'ready' };  // emitted once after model warm-up completes
+	| { type: 'embed-progress'; jobId: string; done: number; total: number }
+	| { type: 'embed-done'; jobId: string }
+	| { type: 'embed-failed'; jobId: string; error: string }
+	| { type: 'ready' }; // emitted once after model warm-up completes
 ```

 The `ready` message allows the server startup sequence to defer routing any embed requests until the model is loaded, preventing a race on first-run.