TRUEREF-0023 rewrite indexing pipeline - parallel reads - serialized writes
This commit is contained in:
@@ -88,6 +88,7 @@ The UI currently polls `GET /api/v1/jobs?repositoryId=...` every 2 seconds. This
|
||||
#### Worker Thread lifecycle
|
||||
|
||||
Each worker is a long-lived `node:worker_threads` `Worker` instance that:
|
||||
|
||||
1. Opens its own `better-sqlite3` connection to the same database file.
|
||||
2. Listens for `{ type: 'run', jobId }` messages from the main thread.
|
||||
3. Runs `IndexingPipeline.run(job)`, emitting `postMessage` progress events at each stage boundary and every N files.
|
||||
@@ -100,18 +101,18 @@ Manages a pool of `concurrency` workers.
|
||||
|
||||
```typescript
|
||||
interface WorkerPoolOptions {
|
||||
concurrency: number; // default: Math.max(1, os.cpus().length - 1), capped at 4
|
||||
workerScript: string; // absolute path to the compiled worker entry
|
||||
concurrency: number; // default: Math.max(1, os.cpus().length - 1), capped at 4
|
||||
workerScript: string; // absolute path to the compiled worker entry
|
||||
}
|
||||
|
||||
class WorkerPool {
|
||||
private workers: Worker[];
|
||||
private idle: Worker[];
|
||||
private workers: Worker[];
|
||||
private idle: Worker[];
|
||||
|
||||
enqueue(jobId: string): void;
|
||||
private dispatch(worker: Worker, jobId: string): void;
|
||||
private onWorkerMessage(msg: WorkerMessage): void;
|
||||
private onWorkerExit(worker: Worker, code: number): void;
|
||||
enqueue(jobId: string): void;
|
||||
private dispatch(worker: Worker, jobId: string): void;
|
||||
private onWorkerMessage(msg: WorkerMessage): void;
|
||||
private onWorkerExit(worker: Worker, code: number): void;
|
||||
}
|
||||
```
|
||||
|
||||
@@ -120,12 +121,14 @@ Workers are kept alive across jobs. If a worker crashes (non-zero exit), the poo
|
||||
#### Parallelism and write contention
|
||||
|
||||
With WAL mode enabled (already the case), SQLite supports:
|
||||
|
||||
- **One concurrent writer** (the transaction lock)
|
||||
- **Many concurrent readers**
|
||||
|
||||
The `replaceSnippets` transaction for different repositories never contends — they write different rows. The `cloneFromAncestor` operation writes to the same tables but different `version_id` values, so WAL checkpoint logic keeps them non-overlapping at the page level.
|
||||
|
||||
Two jobs on the **same repository** (e.g. `/my-lib/v1.0.0` and `/my-lib/v2.0.0`) can run in parallel because:
|
||||
|
||||
- Differential indexing (TRUEREF-0021) ensures `v2.0.0` reads from `v1.0.0`'s already-committed rows.
|
||||
- The write transactions for each version touch disjoint `version_id` partitions.
|
||||
|
||||
@@ -134,6 +137,7 @@ If write contention still occurs under parallel load, `busy_timeout = 5000` (alr
|
||||
#### Concurrency limit per repository
|
||||
|
||||
To prevent a user from queuing 500 tags and overwhelming the worker pool, the pool enforces:
|
||||
|
||||
- **Max 1 running job per repository** for the default branch (re-index).
|
||||
- **Max `concurrency` total running jobs** across all repositories.
|
||||
- Version jobs for the same repository are serialised within the pool (the queue picks the oldest queued version job for a given repo only when no other version job for that repo is running).
|
||||
@@ -148,15 +152,15 @@ Replace the opaque integer progress with a structured stage model:
|
||||
|
||||
```typescript
|
||||
type IndexingStage =
|
||||
| 'queued'
|
||||
| 'differential' // computing ancestor diff
|
||||
| 'crawling' // fetching files from GitHub or local FS
|
||||
| 'cloning' // cloning unchanged files from ancestor (differential only)
|
||||
| 'parsing' // parsing files into snippets
|
||||
| 'storing' // writing documents + snippets to DB
|
||||
| 'embedding' // generating vector embeddings
|
||||
| 'done'
|
||||
| 'failed';
|
||||
| 'queued'
|
||||
| 'differential' // computing ancestor diff
|
||||
| 'crawling' // fetching files from GitHub or local FS
|
||||
| 'cloning' // cloning unchanged files from ancestor (differential only)
|
||||
| 'parsing' // parsing files into snippets
|
||||
| 'storing' // writing documents + snippets to DB
|
||||
| 'embedding' // generating vector embeddings
|
||||
| 'done'
|
||||
| 'failed';
|
||||
```
|
||||
|
||||
### Extended Job Schema
|
||||
@@ -172,22 +176,24 @@ The `progress` column (0–100) is retained for backward compatibility and overa
|
||||
|
||||
```typescript
|
||||
interface ProgressMessage {
|
||||
type: 'progress';
|
||||
jobId: string;
|
||||
stage: IndexingStage;
|
||||
stageDetail?: string; // human-readable detail for the current stage
|
||||
progress: number; // 0–100 overall
|
||||
processedFiles: number;
|
||||
totalFiles: number;
|
||||
type: 'progress';
|
||||
jobId: string;
|
||||
stage: IndexingStage;
|
||||
stageDetail?: string; // human-readable detail for the current stage
|
||||
progress: number; // 0–100 overall
|
||||
processedFiles: number;
|
||||
totalFiles: number;
|
||||
}
|
||||
```
|
||||
|
||||
Workers emit this message:
|
||||
|
||||
- On every stage transition (crawl start, parse start, store start, embed start).
|
||||
- Every `PROGRESS_EMIT_EVERY = 10` files during the parse loop.
|
||||
- On job completion or failure.
|
||||
|
||||
The main thread receives these messages and does two things:
|
||||
|
||||
1. Writes the update to `indexing_jobs` in SQLite (batched — one write per message, not per file).
|
||||
2. Pushes the payload to any open SSE channels for that jobId.
|
||||
|
||||
@@ -198,6 +204,7 @@ The main thread receives these messages and does two things:
|
||||
### `GET /api/v1/jobs/:id/stream`
|
||||
|
||||
Opens an SSE connection for a specific job. The server:
|
||||
|
||||
1. Sends the current job state as the first event immediately (no initial lag).
|
||||
2. Pushes `ProgressMessage` events as the worker emits them.
|
||||
3. Sends a final `event: done` or `event: failed` event, then closes the connection.
|
||||
@@ -216,7 +223,7 @@ id: 1
|
||||
event: progress
|
||||
data: {"stage":"crawling","progress":0,"processedFiles":0,"totalFiles":0}
|
||||
|
||||
id: 2
|
||||
id: 2
|
||||
event: progress
|
||||
data: {"stage":"parsing","progress":12,"processedFiles":240,"totalFiles":2000}
|
||||
|
||||
@@ -281,7 +288,7 @@ Expose via the settings table (key `indexing.concurrency`):
|
||||
|
||||
```typescript
|
||||
interface IndexingSettings {
|
||||
concurrency: number; // 1–max(cpus-1, 1); default 2
|
||||
concurrency: number; // 1–max(cpus-1, 1); default 2
|
||||
}
|
||||
```
|
||||
|
||||
@@ -362,13 +369,13 @@ The embedding stage must **not** run inside the same Worker Thread as the crawl/
|
||||
|
||||
### Why a dedicated embedding worker
|
||||
|
||||
| Concern | Per-parse-worker model | Dedicated embedding worker |
|
||||
|---|---|---|
|
||||
| Memory | N × ~100 MB (model weights + WASM heap) per worker | 1 × ~100 MB regardless of concurrency |
|
||||
| Model warm-up | Paid once per worker spawn; cold starts slow | Paid once at server startup |
|
||||
| Batch size | Each worker batches only its own job's snippets | All in-flight jobs queue to one worker → larger batches → higher WASM throughput |
|
||||
| Provider migration | Must update every worker | Update one file |
|
||||
| API rate limiting | N parallel streams to the same API → N×rate-limit hits | One serial stream, naturally throttled |
|
||||
| Concern | Per-parse-worker model | Dedicated embedding worker |
|
||||
| ------------------ | ------------------------------------------------------ | -------------------------------------------------------------------------------- |
|
||||
| Memory | N × ~100 MB (model weights + WASM heap) per worker | 1 × ~100 MB regardless of concurrency |
|
||||
| Model warm-up | Paid once per worker spawn; cold starts slow | Paid once at server startup |
|
||||
| Batch size | Each worker batches only its own job's snippets | All in-flight jobs queue to one worker → larger batches → higher WASM throughput |
|
||||
| Provider migration | Must update every worker | Update one file |
|
||||
| API rate limiting | N parallel streams to the same API → N×rate-limit hits | One serial stream, naturally throttled |
|
||||
|
||||
With `Xenova/all-MiniLM-L6-v2`, the WASM model and weight files occupy ~90–120 MB of heap. Running three parse workers with embedded model loading costs ~300–360 MB of resident memory that can never be freed while the server is alive. A dedicated worker keeps that cost fixed at one instance.
|
||||
|
||||
@@ -415,6 +422,7 @@ Instead, the existing `findSnippetIdsMissingEmbeddings` query is the handshake:
|
||||
5. Main thread routes this to the SSE broadcaster → UI updates the embedding progress slice.
|
||||
|
||||
This means:
|
||||
|
||||
- The embedding worker reads snippet text from the DB itself (no IPC serialisation of content).
|
||||
- The model is loaded once, stays warm, and processes batches from all repositories in FIFO order.
|
||||
- Parse workers are never blocked waiting for embeddings — they complete their job stages and exit immediately.
|
||||
@@ -424,15 +432,15 @@ This means:
|
||||
```typescript
|
||||
// Main → Embedding worker
|
||||
type EmbedRequest =
|
||||
| { type: 'embed'; jobId: string; repositoryId: string; versionId: string | null }
|
||||
| { type: 'shutdown' };
|
||||
| { type: 'embed'; jobId: string; repositoryId: string; versionId: string | null }
|
||||
| { type: 'shutdown' };
|
||||
|
||||
// Embedding worker → Main
|
||||
type EmbedResponse =
|
||||
| { type: 'embed-progress'; jobId: string; done: number; total: number }
|
||||
| { type: 'embed-done'; jobId: string }
|
||||
| { type: 'embed-failed'; jobId: string; error: string }
|
||||
| { type: 'ready' }; // emitted once after model warm-up completes
|
||||
| { type: 'embed-progress'; jobId: string; done: number; total: number }
|
||||
| { type: 'embed-done'; jobId: string }
|
||||
| { type: 'embed-failed'; jobId: string; error: string }
|
||||
| { type: 'ready' }; // emitted once after model warm-up completes
|
||||
```
|
||||
|
||||
The `ready` message allows the server startup sequence to defer routing any embed requests until the model is loaded, preventing a race on first-run.
|
||||
|
||||
Reference in New Issue
Block a user