TRUEREF-0023 rewrite indexing pipeline - parallel reads - serialized writes

This commit is contained in:
Giancarmine Salucci
2026-04-02 09:49:38 +02:00
parent 9525c58e9a
commit f86be4106b
68 changed files with 5042 additions and 3131 deletions

View File

@@ -88,6 +88,7 @@ The UI currently polls `GET /api/v1/jobs?repositoryId=...` every 2 seconds. This
#### Worker Thread lifecycle
Each worker is a long-lived `node:worker_threads` `Worker` instance that:
1. Opens its own `better-sqlite3` connection to the same database file.
2. Listens for `{ type: 'run', jobId }` messages from the main thread.
3. Runs `IndexingPipeline.run(job)`, emitting `postMessage` progress events at each stage boundary and every N files.
@@ -100,18 +101,18 @@ Manages a pool of `concurrency` workers.
```typescript
interface WorkerPoolOptions {
concurrency: number; // default: Math.max(1, os.cpus().length - 1), capped at 4
workerScript: string; // absolute path to the compiled worker entry
concurrency: number; // default: Math.max(1, os.cpus().length - 1), capped at 4
workerScript: string; // absolute path to the compiled worker entry
}
class WorkerPool {
private workers: Worker[];
private idle: Worker[];
private workers: Worker[];
private idle: Worker[];
enqueue(jobId: string): void;
private dispatch(worker: Worker, jobId: string): void;
private onWorkerMessage(msg: WorkerMessage): void;
private onWorkerExit(worker: Worker, code: number): void;
enqueue(jobId: string): void;
private dispatch(worker: Worker, jobId: string): void;
private onWorkerMessage(msg: WorkerMessage): void;
private onWorkerExit(worker: Worker, code: number): void;
}
```
@@ -120,12 +121,14 @@ Workers are kept alive across jobs. If a worker crashes (non-zero exit), the poo
#### Parallelism and write contention
With WAL mode enabled (already the case), SQLite supports:
- **One concurrent writer** (the transaction lock)
- **Many concurrent readers**
The `replaceSnippets` transaction for different repositories never contends — they write different rows. The `cloneFromAncestor` operation writes to the same tables but different `version_id` values, so WAL checkpoint logic keeps them non-overlapping at the page level.
Two jobs on the **same repository** (e.g. `/my-lib/v1.0.0` and `/my-lib/v2.0.0`) can run in parallel because:
- Differential indexing (TRUEREF-0021) ensures `v2.0.0` reads from `v1.0.0`'s already-committed rows.
- The write transactions for each version touch disjoint `version_id` partitions.
@@ -134,6 +137,7 @@ If write contention still occurs under parallel load, `busy_timeout = 5000` (alr
#### Concurrency limit per repository
To prevent a user from queuing 500 tags and overwhelming the worker pool, the pool enforces:
- **Max 1 running job per repository** for the default branch (re-index).
- **Max `concurrency` total running jobs** across all repositories.
- Version jobs for the same repository are serialised within the pool (the queue picks the oldest queued version job for a given repo only when no other version job for that repo is running).
@@ -148,15 +152,15 @@ Replace the opaque integer progress with a structured stage model:
```typescript
type IndexingStage =
| 'queued'
| 'differential' // computing ancestor diff
| 'crawling' // fetching files from GitHub or local FS
| 'cloning' // cloning unchanged files from ancestor (differential only)
| 'parsing' // parsing files into snippets
| 'storing' // writing documents + snippets to DB
| 'embedding' // generating vector embeddings
| 'done'
| 'failed';
| 'queued'
| 'differential' // computing ancestor diff
| 'crawling' // fetching files from GitHub or local FS
| 'cloning' // cloning unchanged files from ancestor (differential only)
| 'parsing' // parsing files into snippets
| 'storing' // writing documents + snippets to DB
| 'embedding' // generating vector embeddings
| 'done'
| 'failed';
```
### Extended Job Schema
@@ -172,22 +176,24 @@ The `progress` column (0100) is retained for backward compatibility and overa
```typescript
interface ProgressMessage {
type: 'progress';
jobId: string;
stage: IndexingStage;
stageDetail?: string; // human-readable detail for the current stage
progress: number; // 0100 overall
processedFiles: number;
totalFiles: number;
type: 'progress';
jobId: string;
stage: IndexingStage;
stageDetail?: string; // human-readable detail for the current stage
progress: number; // 0100 overall
processedFiles: number;
totalFiles: number;
}
```
Workers emit this message:
- On every stage transition (crawl start, parse start, store start, embed start).
- Every `PROGRESS_EMIT_EVERY = 10` files during the parse loop.
- On job completion or failure.
The main thread receives these messages and does two things:
1. Writes the update to `indexing_jobs` in SQLite (batched — one write per message, not per file).
2. Pushes the payload to any open SSE channels for that jobId.
@@ -198,6 +204,7 @@ The main thread receives these messages and does two things:
### `GET /api/v1/jobs/:id/stream`
Opens an SSE connection for a specific job. The server:
1. Sends the current job state as the first event immediately (no initial lag).
2. Pushes `ProgressMessage` events as the worker emits them.
3. Sends a final `event: done` or `event: failed` event, then closes the connection.
@@ -216,7 +223,7 @@ id: 1
event: progress
data: {"stage":"crawling","progress":0,"processedFiles":0,"totalFiles":0}
id: 2
id: 2
event: progress
data: {"stage":"parsing","progress":12,"processedFiles":240,"totalFiles":2000}
@@ -281,7 +288,7 @@ Expose via the settings table (key `indexing.concurrency`):
```typescript
interface IndexingSettings {
concurrency: number; // 1max(cpus-1, 1); default 2
concurrency: number; // 1max(cpus-1, 1); default 2
}
```
@@ -362,13 +369,13 @@ The embedding stage must **not** run inside the same Worker Thread as the crawl/
### Why a dedicated embedding worker
| Concern | Per-parse-worker model | Dedicated embedding worker |
|---|---|---|
| Memory | N × ~100 MB (model weights + WASM heap) per worker | 1 × ~100 MB regardless of concurrency |
| Model warm-up | Paid once per worker spawn; cold starts slow | Paid once at server startup |
| Batch size | Each worker batches only its own job's snippets | All in-flight jobs queue to one worker → larger batches → higher WASM throughput |
| Provider migration | Must update every worker | Update one file |
| API rate limiting | N parallel streams to the same API → N×rate-limit hits | One serial stream, naturally throttled |
| Concern | Per-parse-worker model | Dedicated embedding worker |
| ------------------ | ------------------------------------------------------ | -------------------------------------------------------------------------------- |
| Memory | N × ~100 MB (model weights + WASM heap) per worker | 1 × ~100 MB regardless of concurrency |
| Model warm-up | Paid once per worker spawn; cold starts slow | Paid once at server startup |
| Batch size | Each worker batches only its own job's snippets | All in-flight jobs queue to one worker → larger batches → higher WASM throughput |
| Provider migration | Must update every worker | Update one file |
| API rate limiting | N parallel streams to the same API → N×rate-limit hits | One serial stream, naturally throttled |
With `Xenova/all-MiniLM-L6-v2`, the WASM model and weight files occupy ~90120 MB of heap. Running three parse workers with embedded model loading costs ~300360 MB of resident memory that can never be freed while the server is alive. A dedicated worker keeps that cost fixed at one instance.
@@ -415,6 +422,7 @@ Instead, the existing `findSnippetIdsMissingEmbeddings` query is the handshake:
5. Main thread routes this to the SSE broadcaster → UI updates the embedding progress slice.
This means:
- The embedding worker reads snippet text from the DB itself (no IPC serialisation of content).
- The model is loaded once, stays warm, and processes batches from all repositories in FIFO order.
- Parse workers are never blocked waiting for embeddings — they complete their job stages and exit immediately.
@@ -424,15 +432,15 @@ This means:
```typescript
// Main → Embedding worker
type EmbedRequest =
| { type: 'embed'; jobId: string; repositoryId: string; versionId: string | null }
| { type: 'shutdown' };
| { type: 'embed'; jobId: string; repositoryId: string; versionId: string | null }
| { type: 'shutdown' };
// Embedding worker → Main
type EmbedResponse =
| { type: 'embed-progress'; jobId: string; done: number; total: number }
| { type: 'embed-done'; jobId: string }
| { type: 'embed-failed'; jobId: string; error: string }
| { type: 'ready' }; // emitted once after model warm-up completes
| { type: 'embed-progress'; jobId: string; done: number; total: number }
| { type: 'embed-done'; jobId: string }
| { type: 'embed-failed'; jobId: string; error: string }
| { type: 'ready' }; // emitted once after model warm-up completes
```
The `ready` message allows the server startup sequence to defer routing any embed requests until the model is loaded, preventing a race on first-run.