TRUEREF-0023 rewrite indexing pipeline - parallel reads - serialized writes
This commit is contained in:
@@ -335,3 +335,47 @@ Add subsequent research below this section.
|
||||
- Risks / follow-ups:
|
||||
- Iteration 2 task decomposition must treat the current dirty code files from iterations 0 and 1 as the validation baseline, otherwise the executor will keep rediscovering pre-existing worktree drift instead of new task deltas.
|
||||
- The sqlite-vec bootstrap helper and the relational cleanup should be planned as one acceptance unit before any downstream vec0, worker-status, or admin-page tasks, because that is the smallest unit that removes the known broken intermediate state.
|
||||
|
||||
### 2026-04-01T00:00:00.000Z — TRUEREF-0023 iteration 3 navbar follow-up planning research
|
||||
|
||||
- Task: Plan the accepted follow-up request to add an admin route to the main navbar.
|
||||
- Files inspected:
|
||||
- `prompts/TRUEREF-0023/progress.yaml`
|
||||
- `prompts/TRUEREF-0023/iteration_2/review_report.yaml`
|
||||
- `prompts/TRUEREF-0023/prompt.yaml`
|
||||
- `package.json`
|
||||
- `src/routes/+layout.svelte`
|
||||
- `src/routes/admin/jobs/+page.svelte`
|
||||
- Findings:
|
||||
- The accepted iteration-2 workspace is green: `review_report.yaml` records passing build, passing tests, and no workspace diagnostics, so this request is a narrow additive follow-up rather than a rework of the sqlite-vec/admin jobs implementation.
|
||||
- The main navbar is defined entirely in `src/routes/+layout.svelte` and already uses base-aware SvelteKit navigation via `resolve as resolveRoute` from `$app/paths` for the existing `Repositories`, `Search`, and `Settings` links.
|
||||
- The existing admin surface already lives at `src/routes/admin/jobs/+page.svelte`, which sets the page title to `Job Queue - TrueRef Admin`; adding a navbar entry can therefore target `/admin/jobs` directly without introducing new routes, loaders, or components.
|
||||
- Repository findings from the earlier lint planning work already confirm the codebase expectation to avoid root-relative internal navigation in SvelteKit pages and components, so the new navbar link should follow the existing `resolveRoute('/...')` anchor pattern.
|
||||
- No dedicated test file currently covers the shared navbar. The appropriate validation for this follow-up remains repository-level `npm run build` and `npm test` after the single layout edit.
|
||||
- Risks / follow-ups:
|
||||
- The follow-up navigation request should stay isolated to the shared layout so it does not reopen the accepted sqlite-vec implementation surface.
|
||||
- Build and test validation remain the appropriate regression checks because no dedicated navbar test currently exists.
|
||||
|
||||
### 2026-04-01T12:05:23.000Z — TRUEREF-0023 iteration 5 tabs filter and bulk reprocess planning research
|
||||
|
||||
- Task: Plan the follow-up repo-detail UI change to filter version rows in the tabs/tags view and add a bulk action that reprocesses all errored tags without adding a new backend endpoint.
|
||||
- Files inspected:
|
||||
- `prompts/TRUEREF-0023/progress.yaml`
|
||||
- `prompts/TRUEREF-0023/prompt.yaml`
|
||||
- `prompts/TRUEREF-0023/iteration_2/plan.md`
|
||||
- `prompts/TRUEREF-0023/iteration_2/tasks.yaml`
|
||||
- `src/routes/repos/[id]/+page.svelte`
|
||||
- `src/routes/api/v1/libs/[id]/versions/[tag]/index/+server.ts`
|
||||
- `src/routes/api/v1/api-contract.integration.test.ts`
|
||||
- `package.json`
|
||||
- Findings:
|
||||
- The relevant UI surface is entirely in `src/routes/repos/[id]/+page.svelte`; the page already loads `versions`, renders per-version state badges, and exposes per-tag `Index` and `Remove` buttons.
|
||||
- Version states are concretely `pending`, `indexing`, `indexed`, and `error`, and the page already centralizes their labels and color classes in `stateLabels` and `stateColors`.
|
||||
- Existing per-tag reprocessing is implemented by `handleIndexVersion(tag)`, which POSTs to `/api/v1/libs/:id/versions/:tag/index`; the corresponding backend route exists and returns a queued job DTO with status `202`.
|
||||
- No bulk reprocess endpoint exists, so the lowest-risk implementation is a UI-only bulk action that iterates the existing per-tag route.
|
||||
- The page already contains a bounded batching pattern in `handleRegisterSelected()` with `BATCH_SIZE = 5`, which provides a concrete local precedent for bulk tag operations without inventing a new concurrency model.
|
||||
- There is no existing page-component or browser test targeting `src/routes/repos/[id]/+page.svelte`; nearby automated coverage is API-contract focused, so this iteration should rely on `npm run build` and `npm test` regression validation unless a developer discovers an existing Svelte page harness during implementation.
|
||||
- Context7 lookup for Svelte and SvelteKit could not be completed in this environment because the configured API key is invalid; planning therefore relied on installed versions from `package.json` (`svelte` `^5.51.0`, `@sveltejs/kit` `^2.50.2`) and the live page patterns already present in the repository.
|
||||
- Risks / follow-ups:
|
||||
- Bulk reprocessing must avoid queuing duplicate jobs for tags already shown as `indexing` or already tracked in `activeVersionJobs`.
|
||||
- Filter state should be implemented as local UI state only and must not disturb the existing `onMount(loadVersions)` fetch path or the SSE job-progress flow.
|
||||
|
||||
@@ -47,8 +47,8 @@ Executed in `IndexingPipeline.run()` before the crawl, when the job has a `versi
|
||||
containing shell metacharacters).
|
||||
|
||||
3. **Path partitioning**: The changed-file list is split into `changedPaths` (added + modified
|
||||
+ renamed-destination) and `deletedPaths`. `unchangedPaths` is derived as
|
||||
`ancestorFilePaths − changedPaths − deletedPaths`.
|
||||
- renamed-destination) and `deletedPaths`. `unchangedPaths` is derived as
|
||||
`ancestorFilePaths − changedPaths − deletedPaths`.
|
||||
|
||||
4. **Guard**: Returns `null` when no indexed ancestor exists, when the ancestor has no indexed
|
||||
documents, or when all files changed (nothing to clone).
|
||||
@@ -74,18 +74,18 @@ matching files are returned. This minimises GitHub API requests and local I/O.
|
||||
|
||||
## API Surface Changes
|
||||
|
||||
| Symbol | Location | Change |
|
||||
|---|---|---|
|
||||
| `buildDifferentialPlan` | `pipeline/differential-strategy.ts` | **New** — async function |
|
||||
| `DifferentialPlan` | `pipeline/differential-strategy.ts` | **New** — interface |
|
||||
| `findBestAncestorVersion` | `utils/tag-order.ts` | **New** — pure function |
|
||||
| `fetchGitHubChangedFiles` | `crawler/github-compare.ts` | **New** — async function |
|
||||
| `getChangedFilesBetweenRefs` | `utils/git.ts` | **New** — sync function (uses `execFileSync`) |
|
||||
| `ChangedFile` | `crawler/types.ts` | **New** — interface |
|
||||
| `CrawlOptions.allowedPaths` | `crawler/types.ts` | **New** — optional field |
|
||||
| `IndexingPipeline.crawl()` | `pipeline/indexing.pipeline.ts` | **Modified** — added `allowedPaths` param |
|
||||
| `IndexingPipeline.cloneFromAncestor()` | `pipeline/indexing.pipeline.ts` | **New** — private method |
|
||||
| `IndexingPipeline.run()` | `pipeline/indexing.pipeline.ts` | **Modified** — Stage 0 added |
|
||||
| Symbol | Location | Change |
|
||||
| -------------------------------------- | ----------------------------------- | --------------------------------------------- |
|
||||
| `buildDifferentialPlan` | `pipeline/differential-strategy.ts` | **New** — async function |
|
||||
| `DifferentialPlan` | `pipeline/differential-strategy.ts` | **New** — interface |
|
||||
| `findBestAncestorVersion` | `utils/tag-order.ts` | **New** — pure function |
|
||||
| `fetchGitHubChangedFiles` | `crawler/github-compare.ts` | **New** — async function |
|
||||
| `getChangedFilesBetweenRefs` | `utils/git.ts` | **New** — sync function (uses `execFileSync`) |
|
||||
| `ChangedFile` | `crawler/types.ts` | **New** — interface |
|
||||
| `CrawlOptions.allowedPaths` | `crawler/types.ts` | **New** — optional field |
|
||||
| `IndexingPipeline.crawl()` | `pipeline/indexing.pipeline.ts` | **Modified** — added `allowedPaths` param |
|
||||
| `IndexingPipeline.cloneFromAncestor()` | `pipeline/indexing.pipeline.ts` | **New** — private method |
|
||||
| `IndexingPipeline.run()` | `pipeline/indexing.pipeline.ts` | **Modified** — Stage 0 added |
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -88,6 +88,7 @@ The UI currently polls `GET /api/v1/jobs?repositoryId=...` every 2 seconds. This
|
||||
#### Worker Thread lifecycle
|
||||
|
||||
Each worker is a long-lived `node:worker_threads` `Worker` instance that:
|
||||
|
||||
1. Opens its own `better-sqlite3` connection to the same database file.
|
||||
2. Listens for `{ type: 'run', jobId }` messages from the main thread.
|
||||
3. Runs `IndexingPipeline.run(job)`, emitting `postMessage` progress events at each stage boundary and every N files.
|
||||
@@ -100,18 +101,18 @@ Manages a pool of `concurrency` workers.
|
||||
|
||||
```typescript
|
||||
interface WorkerPoolOptions {
|
||||
concurrency: number; // default: Math.max(1, os.cpus().length - 1), capped at 4
|
||||
workerScript: string; // absolute path to the compiled worker entry
|
||||
concurrency: number; // default: Math.max(1, os.cpus().length - 1), capped at 4
|
||||
workerScript: string; // absolute path to the compiled worker entry
|
||||
}
|
||||
|
||||
class WorkerPool {
|
||||
private workers: Worker[];
|
||||
private idle: Worker[];
|
||||
private workers: Worker[];
|
||||
private idle: Worker[];
|
||||
|
||||
enqueue(jobId: string): void;
|
||||
private dispatch(worker: Worker, jobId: string): void;
|
||||
private onWorkerMessage(msg: WorkerMessage): void;
|
||||
private onWorkerExit(worker: Worker, code: number): void;
|
||||
enqueue(jobId: string): void;
|
||||
private dispatch(worker: Worker, jobId: string): void;
|
||||
private onWorkerMessage(msg: WorkerMessage): void;
|
||||
private onWorkerExit(worker: Worker, code: number): void;
|
||||
}
|
||||
```
|
||||
|
||||
@@ -120,12 +121,14 @@ Workers are kept alive across jobs. If a worker crashes (non-zero exit), the poo
|
||||
#### Parallelism and write contention
|
||||
|
||||
With WAL mode enabled (already the case), SQLite supports:
|
||||
|
||||
- **One concurrent writer** (the transaction lock)
|
||||
- **Many concurrent readers**
|
||||
|
||||
The `replaceSnippets` transaction for different repositories never contends — they write different rows. The `cloneFromAncestor` operation writes to the same tables but different `version_id` values, so WAL checkpoint logic keeps them non-overlapping at the page level.
|
||||
|
||||
Two jobs on the **same repository** (e.g. `/my-lib/v1.0.0` and `/my-lib/v2.0.0`) can run in parallel because:
|
||||
|
||||
- Differential indexing (TRUEREF-0021) ensures `v2.0.0` reads from `v1.0.0`'s already-committed rows.
|
||||
- The write transactions for each version touch disjoint `version_id` partitions.
|
||||
|
||||
@@ -134,6 +137,7 @@ If write contention still occurs under parallel load, `busy_timeout = 5000` (alr
|
||||
#### Concurrency limit per repository
|
||||
|
||||
To prevent a user from queuing 500 tags and overwhelming the worker pool, the pool enforces:
|
||||
|
||||
- **Max 1 running job per repository** for the default branch (re-index).
|
||||
- **Max `concurrency` total running jobs** across all repositories.
|
||||
- Version jobs for the same repository are serialised within the pool (the queue picks the oldest queued version job for a given repo only when no other version job for that repo is running).
|
||||
@@ -148,15 +152,15 @@ Replace the opaque integer progress with a structured stage model:
|
||||
|
||||
```typescript
|
||||
type IndexingStage =
|
||||
| 'queued'
|
||||
| 'differential' // computing ancestor diff
|
||||
| 'crawling' // fetching files from GitHub or local FS
|
||||
| 'cloning' // cloning unchanged files from ancestor (differential only)
|
||||
| 'parsing' // parsing files into snippets
|
||||
| 'storing' // writing documents + snippets to DB
|
||||
| 'embedding' // generating vector embeddings
|
||||
| 'done'
|
||||
| 'failed';
|
||||
| 'queued'
|
||||
| 'differential' // computing ancestor diff
|
||||
| 'crawling' // fetching files from GitHub or local FS
|
||||
| 'cloning' // cloning unchanged files from ancestor (differential only)
|
||||
| 'parsing' // parsing files into snippets
|
||||
| 'storing' // writing documents + snippets to DB
|
||||
| 'embedding' // generating vector embeddings
|
||||
| 'done'
|
||||
| 'failed';
|
||||
```
|
||||
|
||||
### Extended Job Schema
|
||||
@@ -172,22 +176,24 @@ The `progress` column (0–100) is retained for backward compatibility and overa
|
||||
|
||||
```typescript
|
||||
interface ProgressMessage {
|
||||
type: 'progress';
|
||||
jobId: string;
|
||||
stage: IndexingStage;
|
||||
stageDetail?: string; // human-readable detail for the current stage
|
||||
progress: number; // 0–100 overall
|
||||
processedFiles: number;
|
||||
totalFiles: number;
|
||||
type: 'progress';
|
||||
jobId: string;
|
||||
stage: IndexingStage;
|
||||
stageDetail?: string; // human-readable detail for the current stage
|
||||
progress: number; // 0–100 overall
|
||||
processedFiles: number;
|
||||
totalFiles: number;
|
||||
}
|
||||
```
|
||||
|
||||
Workers emit this message:
|
||||
|
||||
- On every stage transition (crawl start, parse start, store start, embed start).
|
||||
- Every `PROGRESS_EMIT_EVERY = 10` files during the parse loop.
|
||||
- On job completion or failure.
|
||||
|
||||
The main thread receives these messages and does two things:
|
||||
|
||||
1. Writes the update to `indexing_jobs` in SQLite (batched — one write per message, not per file).
|
||||
2. Pushes the payload to any open SSE channels for that jobId.
|
||||
|
||||
@@ -198,6 +204,7 @@ The main thread receives these messages and does two things:
|
||||
### `GET /api/v1/jobs/:id/stream`
|
||||
|
||||
Opens an SSE connection for a specific job. The server:
|
||||
|
||||
1. Sends the current job state as the first event immediately (no initial lag).
|
||||
2. Pushes `ProgressMessage` events as the worker emits them.
|
||||
3. Sends a final `event: done` or `event: failed` event, then closes the connection.
|
||||
@@ -216,7 +223,7 @@ id: 1
|
||||
event: progress
|
||||
data: {"stage":"crawling","progress":0,"processedFiles":0,"totalFiles":0}
|
||||
|
||||
id: 2
|
||||
id: 2
|
||||
event: progress
|
||||
data: {"stage":"parsing","progress":12,"processedFiles":240,"totalFiles":2000}
|
||||
|
||||
@@ -281,7 +288,7 @@ Expose via the settings table (key `indexing.concurrency`):
|
||||
|
||||
```typescript
|
||||
interface IndexingSettings {
|
||||
concurrency: number; // 1–max(cpus-1, 1); default 2
|
||||
concurrency: number; // 1–max(cpus-1, 1); default 2
|
||||
}
|
||||
```
|
||||
|
||||
@@ -362,13 +369,13 @@ The embedding stage must **not** run inside the same Worker Thread as the crawl/
|
||||
|
||||
### Why a dedicated embedding worker
|
||||
|
||||
| Concern | Per-parse-worker model | Dedicated embedding worker |
|
||||
|---|---|---|
|
||||
| Memory | N × ~100 MB (model weights + WASM heap) per worker | 1 × ~100 MB regardless of concurrency |
|
||||
| Model warm-up | Paid once per worker spawn; cold starts slow | Paid once at server startup |
|
||||
| Batch size | Each worker batches only its own job's snippets | All in-flight jobs queue to one worker → larger batches → higher WASM throughput |
|
||||
| Provider migration | Must update every worker | Update one file |
|
||||
| API rate limiting | N parallel streams to the same API → N×rate-limit hits | One serial stream, naturally throttled |
|
||||
| Concern | Per-parse-worker model | Dedicated embedding worker |
|
||||
| ------------------ | ------------------------------------------------------ | -------------------------------------------------------------------------------- |
|
||||
| Memory | N × ~100 MB (model weights + WASM heap) per worker | 1 × ~100 MB regardless of concurrency |
|
||||
| Model warm-up | Paid once per worker spawn; cold starts slow | Paid once at server startup |
|
||||
| Batch size | Each worker batches only its own job's snippets | All in-flight jobs queue to one worker → larger batches → higher WASM throughput |
|
||||
| Provider migration | Must update every worker | Update one file |
|
||||
| API rate limiting | N parallel streams to the same API → N×rate-limit hits | One serial stream, naturally throttled |
|
||||
|
||||
With `Xenova/all-MiniLM-L6-v2`, the WASM model and weight files occupy ~90–120 MB of heap. Running three parse workers with embedded model loading costs ~300–360 MB of resident memory that can never be freed while the server is alive. A dedicated worker keeps that cost fixed at one instance.
|
||||
|
||||
@@ -415,6 +422,7 @@ Instead, the existing `findSnippetIdsMissingEmbeddings` query is the handshake:
|
||||
5. Main thread routes this to the SSE broadcaster → UI updates the embedding progress slice.
|
||||
|
||||
This means:
|
||||
|
||||
- The embedding worker reads snippet text from the DB itself (no IPC serialisation of content).
|
||||
- The model is loaded once, stays warm, and processes batches from all repositories in FIFO order.
|
||||
- Parse workers are never blocked waiting for embeddings — they complete their job stages and exit immediately.
|
||||
@@ -424,15 +432,15 @@ This means:
|
||||
```typescript
|
||||
// Main → Embedding worker
|
||||
type EmbedRequest =
|
||||
| { type: 'embed'; jobId: string; repositoryId: string; versionId: string | null }
|
||||
| { type: 'shutdown' };
|
||||
| { type: 'embed'; jobId: string; repositoryId: string; versionId: string | null }
|
||||
| { type: 'shutdown' };
|
||||
|
||||
// Embedding worker → Main
|
||||
type EmbedResponse =
|
||||
| { type: 'embed-progress'; jobId: string; done: number; total: number }
|
||||
| { type: 'embed-done'; jobId: string }
|
||||
| { type: 'embed-failed'; jobId: string; error: string }
|
||||
| { type: 'ready' }; // emitted once after model warm-up completes
|
||||
| { type: 'embed-progress'; jobId: string; done: number; total: number }
|
||||
| { type: 'embed-done'; jobId: string }
|
||||
| { type: 'embed-failed'; jobId: string; error: string }
|
||||
| { type: 'ready' }; // emitted once after model warm-up completes
|
||||
```
|
||||
|
||||
The `ready` message allows the server startup sequence to defer routing any embed requests until the model is loaded, preventing a race on first-run.
|
||||
|
||||
955
docs/features/TRUEREF-0023.md
Normal file
955
docs/features/TRUEREF-0023.md
Normal file
@@ -0,0 +1,955 @@
|
||||
# TRUEREF-0023 — libSQL Migration, Native Vector Search, Parallel Tag Indexing, and Performance Hardening
|
||||
|
||||
**Priority:** P1
|
||||
**Status:** Draft
|
||||
**Depends On:** TRUEREF-0001, TRUEREF-0022
|
||||
**Blocks:** —
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
TrueRef currently uses `better-sqlite3` for all database access. This creates three compounding performance problems:
|
||||
|
||||
1. **Vector search does not scale.** `VectorSearch.vectorSearch()` loads the entire `snippet_embeddings` table for a repository into Node.js memory and computes cosine similarity in a JavaScript loop. A repository with 100k snippets at 1536 OpenAI dimensions allocates ~600 MB per query and ties up the worker thread for seconds before returning results.
|
||||
2. **Missing composite indexes cause table scans on every query.** The schema defines FK columns used in every search and embedding filter, but declares zero composite or covering indexes on them. Every call to `searchSnippets`, `findSnippetIdsMissingEmbeddings`, and `cloneFromAncestor` performs full or near-full table scans.
|
||||
3. **SQLite connection is under-configured.** Critical pragmas (`synchronous`, `cache_size`, `mmap_size`, `temp_store`) are absent, leaving significant I/O throughput on the table.
|
||||
|
||||
The solution is to replace `better-sqlite3` with `@libsql/better-sqlite3` — an embeddable, drop-in synchronous replacement that is a superset of the better-sqlite3 API and exposes libSQL's native vector index (`libsql_vector_idx`). Because the API is identical, no service layer or ORM code changes are needed beyond import statements and the vector search implementation.
|
||||
|
||||
Two additional structural improvements are delivered in the same feature:
|
||||
|
||||
4. **Per-repo job serialization is too coarse.** `WorkerPool` prevents any two jobs sharing the same `repositoryId` from running in parallel. This means indexing 200 tags of a single library is fully sequential — one tag at a time — even though different tags write to entirely disjoint row sets. The constraint should track `(repositoryId, versionId)` pairs instead.
|
||||
5. **Write lock contention under parallel indexing.** When multiple parse workers flush parsed snippets simultaneously they all compete for the SQLite write lock, spending most of their time in `busy_timeout` back-off. A single dedicated write worker eliminates this: parse workers become pure CPU workers (crawl → parse → send batches over `postMessage`) and the write worker is the sole DB writer.
|
||||
6. **Admin UI is unusable under load.** The job queue page has no status or repository filters, no worker status panel, no skeleton loading, uses blocking `alert()` / `confirm()` dialogs, and `IndexingProgress` still polls every 2 seconds instead of consuming the existing SSE stream.
|
||||
|
||||
---
|
||||
|
||||
## Goals
|
||||
|
||||
1. Replace `better-sqlite3` with `@libsql/better-sqlite3` with minimal code churn — import paths only.
|
||||
2. Add a libSQL vector index on `snippet_embeddings` so that KNN queries execute inside SQLite instead of in a JavaScript loop.
|
||||
3. Add the six composite and covering indexes required by the hot query paths.
|
||||
4. Tune the SQLite pragma configuration for I/O performance.
|
||||
5. Eliminate the leading cause of OOM risk during semantic search.
|
||||
6. Keep a single embedded database file — no external server, no network.
|
||||
7. Allow multiple tags of the same repository to index in parallel (unrelated version rows, no write conflict).
|
||||
8. Eliminate write-lock contention between parallel parse workers by introducing a single dedicated write worker.
|
||||
9. Rebuild the admin jobs page with full filtering (status, repository, free-text), a live worker status panel, skeleton loading on initial fetch, per-action inline spinners, non-blocking toast notifications, and SSE-driven real-time updates throughout.
|
||||
|
||||
---
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Migrating to the async `@libsql/client` package (HTTP/embedded-replica mode).
|
||||
- Changing the Drizzle ORM adapter (`drizzle-orm/better-sqlite3` stays unchanged).
|
||||
- Changing `drizzle.config.ts` dialect (`sqlite` is still correct for embedded libSQL).
|
||||
- Adding hybrid/approximate indexing beyond the default HNSW strategy provided by `libsql_vector_idx`.
|
||||
- Parallelizing embedding batches across providers (separate feature).
|
||||
- Horizontally scaling across processes.
|
||||
- Allowing more than one job for the exact same `(repositoryId, versionId)` pair to run concurrently (still serialized — duplicate detection in `JobQueue` is unchanged).
|
||||
- A full admin authentication system (out of scope).
|
||||
- Mobile-responsive redesign of the entire admin section (out of scope).
|
||||
|
||||
---
|
||||
|
||||
## Problem Detail
|
||||
|
||||
### 1. Vector Search — Full Table Scan in JavaScript
|
||||
|
||||
**File:** `src/lib/server/search/vector.search.ts`
|
||||
|
||||
```typescript
|
||||
// Current: no LIMIT, loads ALL embeddings for repo into memory
|
||||
const rows = this.db.prepare<unknown[], RawEmbeddingRow>(sql).all(...params);
|
||||
|
||||
const scored: VectorSearchResult[] = rows.map((row) => {
|
||||
const embedding = new Float32Array(
|
||||
row.embedding.buffer,
|
||||
row.embedding.byteOffset,
|
||||
row.embedding.byteLength / 4
|
||||
);
|
||||
return { snippetId: row.snippet_id, score: cosineSimilarity(queryEmbedding, embedding) };
|
||||
});
|
||||
|
||||
return scored.sort((a, b) => b.score - a.score).slice(0, limit);
|
||||
```
|
||||
|
||||
For a repo with N snippets and D dimensions, this allocates `N × D × 4` bytes per query. At N=100k and D=1536, that is ~600 MB allocated synchronously. The result is sorted entirely in JS before the top-k is returned. With a native vector index, SQLite returns only the top-k rows.
|
||||
|
||||
### 2. Missing Composite Indexes
|
||||
|
||||
The `snippets`, `documents`, and `snippet_embeddings` tables are queried with multi-column WHERE predicates in every hot path, but no composite indexes exist:
|
||||
|
||||
| Table | Filter columns | Used in |
|
||||
| -------------------- | ----------------------------- | ---------------------------------------------- |
|
||||
| `snippets` | `(repository_id, version_id)` | All search, diff, clone |
|
||||
| `snippets` | `(repository_id, type)` | Type-filtered queries |
|
||||
| `documents` | `(repository_id, version_id)` | Diff strategy, clone |
|
||||
| `snippet_embeddings` | `(profile_id, snippet_id)` | `findSnippetIdsMissingEmbeddings` LEFT JOIN |
|
||||
| `repositories` | `(state)` | `searchRepositories` WHERE `state = 'indexed'` |
|
||||
| `indexing_jobs` | `(repository_id, status)` | Job status lookups |
|
||||
|
||||
Without these indexes, SQLite performs a B-tree scan of the primary key and filters rows in memory. On a 500k-row `snippets` table this is the dominant cost of every search.
|
||||
|
||||
### 4. Admin UI — Current Problems
|
||||
|
||||
**File:** `src/routes/admin/jobs/+page.svelte`, `src/lib/components/IndexingProgress.svelte`
|
||||
|
||||
| Problem | Location | Impact |
|
||||
| -------------------------------------------------------------- | ----------------------------------------- | ------------------------------------------------------------ |
|
||||
| `IndexingProgress` polls every 2 s via `setInterval` + `fetch` | `IndexingProgress.svelte` | Constant HTTP traffic; progress lags by up to 2 s |
|
||||
| No status or repository filter controls | `admin/jobs/+page.svelte` | With 200 tag jobs, finding a specific one requires scrolling |
|
||||
| No worker status panel | — (no endpoint exists) | Operator cannot see which workers are busy or idle |
|
||||
| `alert()` for errors, `confirm()` for cancel | `admin/jobs/+page.svelte` — `showToast()` | Blocks the entire browser tab; unusable under parallel jobs |
|
||||
| `actionInProgress` is a single string, not per-job | `admin/jobs/+page.svelte` | Pausing job A disables buttons on all other jobs |
|
||||
| No skeleton loading — blank + spinner on first load | `admin/jobs/+page.svelte` | Layout shift; no structural preview while data loads |
|
||||
| Hard-coded `limit=50` query, no pagination | `admin/jobs/+page.svelte:fetchJobs()` | Page truncates silently for large queues |
|
||||
|
||||
---
|
||||
|
||||
### 3. Under-configured SQLite Connection
|
||||
|
||||
**File:** `src/lib/server/db/client.ts` and `src/lib/server/db/index.ts`
|
||||
|
||||
Current pragmas:
|
||||
|
||||
```typescript
|
||||
client.pragma('journal_mode = WAL');
|
||||
client.pragma('foreign_keys = ON');
|
||||
client.pragma('busy_timeout = 5000');
|
||||
```
|
||||
|
||||
Missing:
|
||||
|
||||
- `synchronous = NORMAL` — halves fsync overhead vs the default FULL; safe with WAL
|
||||
- `cache_size = -65536` — 64 MB page cache; default is 2 MB
|
||||
- `temp_store = MEMORY` — temp tables and sort spills stay in RAM
|
||||
- `mmap_size = 268435456` — 256 MB memory-mapped read path; bypasses system call overhead for reads
|
||||
- `wal_autocheckpoint = 1000` — more frequent checkpoints prevent WAL growth
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
### Drop-In Replacement: `@libsql/better-sqlite3`
|
||||
|
||||
`@libsql/better-sqlite3` is published by Turso and implemented as a Node.js native addon wrapping the libSQL embedded engine. The exported class is API-compatible with `better-sqlite3`:
|
||||
|
||||
```typescript
|
||||
// before
|
||||
import Database from 'better-sqlite3';
|
||||
const db = new Database('/path/to/file.db');
|
||||
db.pragma('journal_mode = WAL');
|
||||
const rows = db.prepare('SELECT ...').all(...params);
|
||||
|
||||
// after — identical code
|
||||
import Database from '@libsql/better-sqlite3';
|
||||
const db = new Database('/path/to/file.db');
|
||||
db.pragma('journal_mode = WAL');
|
||||
const rows = db.prepare('SELECT ...').all(...params);
|
||||
```
|
||||
|
||||
All of the following continue to work unchanged:
|
||||
|
||||
- `drizzle-orm/better-sqlite3` adapter and `migrate` helper
|
||||
- `drizzle-kit` with `dialect: 'sqlite'`
|
||||
- Prepared statements, transactions, WAL pragmas, foreign keys
|
||||
- Worker thread per-thread connections (`worker-entry.ts`, `embed-worker-entry.ts`)
|
||||
- All `type Database from 'better-sqlite3'` type imports (replaced in lock-step)
|
||||
|
||||
### Vector Index Design
|
||||
|
||||
libSQL provides `libsql_vector_idx()` — a virtual index type stored in a shadow table alongside the main table. Once indexed, KNN queries use a SQL `vector_top_k()` function:
|
||||
|
||||
```sql
|
||||
-- KNN: return top-k snippet IDs closest to the query vector
|
||||
SELECT snippet_id
|
||||
FROM vector_top_k('idx_snippet_embeddings_vec', vector_from_float32(?), ?)
|
||||
```
|
||||
|
||||
`vector_from_float32(blob)` accepts the same raw little-endian Float32 bytes currently stored in the `embedding` blob column. **No data migration is needed** — the existing blob column can be re-indexed with `libsql_vector_idx` pointing at the bytes-stored column.
|
||||
|
||||
The index strategy:
|
||||
|
||||
1. Add a generated `vec_embedding` column of type `F32_BLOB(dimensions)` to `snippet_embeddings`, populated from the existing `embedding` blob via a migration trigger.
|
||||
2. Create the vector index: `CREATE INDEX idx_snippet_embeddings_vec ON snippet_embeddings(vec_embedding) USING libsql_vector_idx(vec_embedding)`.
|
||||
3. Rewrite `VectorSearch.vectorSearch()` to use `vector_top_k()` with a two-step join instead of the in-memory loop.
|
||||
4. Update `EmbeddingService.embedSnippets()` to write `vec_embedding` on insert.
|
||||
|
||||
Dimensions are profile-specific. Because the index is per-column, a separate index is needed per embedding dimensionality. For v1, a single index covering the default profile's dimensions is sufficient; multi-profile KNN can be handled with a `WHERE profile_id = ?` pre-filter on the vector_top_k results.
|
||||
|
||||
### Updated Vector Search Query
|
||||
|
||||
```typescript
|
||||
vectorSearch(queryEmbedding: Float32Array, options: VectorSearchOptions): VectorSearchResult[] {
|
||||
const { repositoryId, versionId, profileId = 'local-default', limit = 50 } = options;
|
||||
|
||||
// Encode query vector as raw bytes (same format as stored blobs)
|
||||
const queryBytes = Buffer.from(queryEmbedding.buffer);
|
||||
|
||||
// Use libSQL vector_top_k for ANN — returns ordered (rowid, distance) pairs
|
||||
let sql = `
|
||||
SELECT se.snippet_id,
|
||||
vector_distance_cos(se.vec_embedding, vector_from_float32(?)) AS score
|
||||
FROM vector_top_k('idx_snippet_embeddings_vec', vector_from_float32(?), ?) AS knn
|
||||
JOIN snippet_embeddings se ON se.rowid = knn.id
|
||||
JOIN snippets s ON s.id = se.snippet_id
|
||||
WHERE s.repository_id = ?
|
||||
AND se.profile_id = ?
|
||||
`;
|
||||
const params: unknown[] = [queryBytes, queryBytes, limit * 4, repositoryId, profileId];
|
||||
|
||||
if (versionId) {
|
||||
sql += ' AND s.version_id = ?';
|
||||
params.push(versionId);
|
||||
}
|
||||
|
||||
sql += ' ORDER BY score ASC LIMIT ?';
|
||||
params.push(limit);
|
||||
|
||||
return this.db
|
||||
.prepare<unknown[], { snippet_id: string; score: number }>(sql)
|
||||
.all(...params)
|
||||
.map((row) => ({ snippetId: row.snippet_id, score: 1 - row.score }));
|
||||
}
|
||||
```
|
||||
|
||||
`vector_distance_cos` returns distance (0 = identical), so `1 - distance` gives a similarity score in [0, 1] matching the existing `VectorSearchResult.score` contract.
|
||||
|
||||
---
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1 — Package Swap (no logic changes)
|
||||
|
||||
**Files touched:** `package.json`, all `.ts` files that import `better-sqlite3`
|
||||
|
||||
1. In `package.json`:
|
||||
- Remove `"better-sqlite3": "^12.6.2"` from `dependencies`
|
||||
- Add `"@libsql/better-sqlite3": "^0.4.0"` to `dependencies`
|
||||
- Remove `"@types/better-sqlite3": "^7.6.13"` from `devDependencies`
|
||||
- `@libsql/better-sqlite3` ships its own TypeScript declarations
|
||||
|
||||
2. Replace all import statements (35 occurrences across 19 files):
|
||||
|
||||
| Old import | New import |
|
||||
| --------------------------------------------------------------- | ---------------------------------------------------- |
|
||||
| `import Database from 'better-sqlite3'` | `import Database from '@libsql/better-sqlite3'` |
|
||||
| `import type Database from 'better-sqlite3'` | `import type Database from '@libsql/better-sqlite3'` |
|
||||
| `import { drizzle } from 'drizzle-orm/better-sqlite3'` | unchanged |
|
||||
| `import { migrate } from 'drizzle-orm/better-sqlite3/migrator'` | unchanged |
|
||||
|
||||
Affected production files:
|
||||
- `src/lib/server/db/index.ts`
|
||||
- `src/lib/server/db/client.ts`
|
||||
- `src/lib/server/embeddings/embedding.service.ts`
|
||||
- `src/lib/server/pipeline/indexing.pipeline.ts`
|
||||
- `src/lib/server/pipeline/job-queue.ts`
|
||||
- `src/lib/server/pipeline/startup.ts`
|
||||
- `src/lib/server/pipeline/worker-entry.ts`
|
||||
- `src/lib/server/pipeline/embed-worker-entry.ts`
|
||||
- `src/lib/server/pipeline/differential-strategy.ts`
|
||||
- `src/lib/server/search/vector.search.ts`
|
||||
- `src/lib/server/search/hybrid.search.service.ts`
|
||||
- `src/lib/server/search/search.service.ts`
|
||||
- `src/lib/server/services/repository.service.ts`
|
||||
- `src/lib/server/services/version.service.ts`
|
||||
- `src/lib/server/services/embedding-settings.service.ts`
|
||||
|
||||
Affected test files (same mechanical replacement):
|
||||
- `src/routes/api/v1/api-contract.integration.test.ts`
|
||||
- `src/routes/api/v1/sse-and-settings.integration.test.ts`
|
||||
- `src/routes/settings/page.server.test.ts`
|
||||
- `src/lib/server/db/schema.test.ts`
|
||||
- `src/lib/server/embeddings/embedding.service.test.ts`
|
||||
- `src/lib/server/pipeline/indexing.pipeline.test.ts`
|
||||
- `src/lib/server/pipeline/differential-strategy.test.ts`
|
||||
- `src/lib/server/search/search.service.test.ts`
|
||||
- `src/lib/server/search/hybrid.search.service.test.ts`
|
||||
- `src/lib/server/services/repository.service.test.ts`
|
||||
- `src/lib/server/services/version.service.test.ts`
|
||||
- `src/routes/api/v1/settings/embedding/server.test.ts`
|
||||
- `src/routes/api/v1/libs/[id]/index/server.test.ts`
|
||||
- `src/routes/api/v1/libs/[id]/versions/discover/server.test.ts`
|
||||
|
||||
3. Run all tests — they should pass with zero logic changes: `npm test`
|
||||
|
||||
### Phase 2 — Pragma Hardening
|
||||
|
||||
**Files touched:** `src/lib/server/db/client.ts`, `src/lib/server/db/index.ts`
|
||||
|
||||
Add the following pragmas to both connection factories (raw client and `initializeDatabase()`):
|
||||
|
||||
```typescript
|
||||
client.pragma('synchronous = NORMAL');
|
||||
client.pragma('cache_size = -65536'); // 64 MB
|
||||
client.pragma('temp_store = MEMORY');
|
||||
client.pragma('mmap_size = 268435456'); // 256 MB
|
||||
client.pragma('wal_autocheckpoint = 1000');
|
||||
```
|
||||
|
||||
Worker threads (`worker-entry.ts`, `embed-worker-entry.ts`) open their own connections — apply the same pragmas there.
|
||||
|
||||
### Phase 3 — Composite Indexes (Drizzle migration)
|
||||
|
||||
**Files touched:** `src/lib/server/db/schema.ts`, new migration SQL file
|
||||
|
||||
Add indexes in `schema.ts` using Drizzle's `index()` helper:
|
||||
|
||||
```typescript
|
||||
// snippets table
|
||||
export const snippets = sqliteTable(
|
||||
'snippets',
|
||||
{
|
||||
/* unchanged */
|
||||
},
|
||||
(t) => [
|
||||
index('idx_snippets_repo_version').on(t.repositoryId, t.versionId),
|
||||
index('idx_snippets_repo_type').on(t.repositoryId, t.type)
|
||||
]
|
||||
);
|
||||
|
||||
// documents table
|
||||
export const documents = sqliteTable(
|
||||
'documents',
|
||||
{
|
||||
/* unchanged */
|
||||
},
|
||||
(t) => [index('idx_documents_repo_version').on(t.repositoryId, t.versionId)]
|
||||
);
|
||||
|
||||
// snippet_embeddings table
|
||||
export const snippetEmbeddings = sqliteTable(
|
||||
'snippet_embeddings',
|
||||
{
|
||||
/* unchanged */
|
||||
},
|
||||
(table) => [
|
||||
primaryKey({ columns: [table.snippetId, table.profileId] }), // unchanged
|
||||
index('idx_embeddings_profile').on(table.profileId, table.snippetId)
|
||||
]
|
||||
);
|
||||
|
||||
// repositories table
|
||||
export const repositories = sqliteTable(
|
||||
'repositories',
|
||||
{
|
||||
/* unchanged */
|
||||
},
|
||||
(t) => [index('idx_repositories_state').on(t.state)]
|
||||
);
|
||||
|
||||
// indexing_jobs table
|
||||
export const indexingJobs = sqliteTable(
|
||||
'indexing_jobs',
|
||||
{
|
||||
/* unchanged */
|
||||
},
|
||||
(t) => [index('idx_jobs_repo_status').on(t.repositoryId, t.status)]
|
||||
);
|
||||
```
|
||||
|
||||
Generate and apply migration: `npm run db:generate && npm run db:migrate`
|
||||
|
||||
### Phase 4 — Vector Column and Index (Drizzle migration)
|
||||
|
||||
**Files touched:** `src/lib/server/db/schema.ts`, new migration SQL, `src/lib/server/search/vector.search.ts`, `src/lib/server/embeddings/embedding.service.ts`
|
||||
|
||||
#### 4a. Schema: add `vec_embedding` column
|
||||
|
||||
Add `vec_embedding` to `snippet_embeddings`. Drizzle does not have a `F32_BLOB` column type helper; use a raw SQL column:
|
||||
|
||||
```typescript
|
||||
import { sql } from 'drizzle-orm';
|
||||
import { customType } from 'drizzle-orm/sqlite-core';
|
||||
|
||||
const f32Blob = (name: string, dimensions: number) =>
|
||||
customType<{ data: Buffer }>({
|
||||
dataType() {
|
||||
return `F32_BLOB(${dimensions})`;
|
||||
}
|
||||
})(name);
|
||||
|
||||
export const snippetEmbeddings = sqliteTable(
|
||||
'snippet_embeddings',
|
||||
{
|
||||
snippetId: text('snippet_id')
|
||||
.notNull()
|
||||
.references(() => snippets.id, { onDelete: 'cascade' }),
|
||||
profileId: text('profile_id')
|
||||
.notNull()
|
||||
.references(() => embeddingProfiles.id, { onDelete: 'cascade' }),
|
||||
model: text('model').notNull(),
|
||||
dimensions: integer('dimensions').notNull(),
|
||||
embedding: blob('embedding').notNull(), // existing blob — kept for backward compat
|
||||
vecEmbedding: f32Blob('vec_embedding', 1536), // libSQL vector column (nullable during migration fill)
|
||||
createdAt: integer('created_at').notNull()
|
||||
},
|
||||
(table) => [
|
||||
primaryKey({ columns: [table.snippetId, table.profileId] }),
|
||||
index('idx_embeddings_profile').on(table.profileId, table.snippetId)
|
||||
]
|
||||
);
|
||||
```
|
||||
|
||||
Because dimensionality is fixed per model, `F32_BLOB(1536)` covers OpenAI `text-embedding-3-small/large`. A follow-up can parameterize this per profile.
|
||||
|
||||
#### 4b. Migration SQL: populate `vec_embedding` from existing `embedding` blob and create the vector index
|
||||
|
||||
The vector index cannot be expressed in SQL DDL portable across Drizzle — it must be applied in the FTS-style custom SQL file (`src/lib/server/db/fts.sql` or an equivalent `vectors.sql`):
|
||||
|
||||
```sql
|
||||
-- Backfill vec_embedding from existing raw blob data
|
||||
UPDATE snippet_embeddings
|
||||
SET vec_embedding = vector_from_float32(embedding)
|
||||
WHERE vec_embedding IS NULL AND embedding IS NOT NULL;
|
||||
|
||||
-- Create the HNSW vector index (libSQL extension syntax)
|
||||
CREATE INDEX IF NOT EXISTS idx_snippet_embeddings_vec
|
||||
ON snippet_embeddings(vec_embedding)
|
||||
USING libsql_vector_idx(vec_embedding, 'metric=cosine', 'compress_neighbors=float8', 'max_neighbors=20');
|
||||
```
|
||||
|
||||
Add a call to this SQL in `initializeDatabase()` alongside the existing `fts.sql` execution:
|
||||
|
||||
```typescript
|
||||
const vectorSql = readFileSync(join(__dirname, 'vectors.sql'), 'utf-8');
|
||||
client.exec(vectorSql);
|
||||
```
|
||||
|
||||
#### 4c. Update `EmbeddingService.embedSnippets()`
|
||||
|
||||
When inserting a new embedding, write both the blob and the vec column:
|
||||
|
||||
```typescript
|
||||
const insert = this.db.prepare<[string, string, string, number, Buffer, Buffer]>(`
|
||||
INSERT OR REPLACE INTO snippet_embeddings
|
||||
(snippet_id, profile_id, model, dimensions, embedding, vec_embedding, created_at)
|
||||
VALUES (?, ?, ?, ?, ?, vector_from_float32(?), unixepoch())
|
||||
`);
|
||||
|
||||
// inside the transaction:
|
||||
insert.run(
|
||||
snippet.id,
|
||||
this.profileId,
|
||||
embedding.model,
|
||||
embedding.dimensions,
|
||||
embeddingBuffer,
|
||||
embeddingBuffer // same bytes — vector_from_float32() interprets them
|
||||
);
|
||||
```
|
||||
|
||||
#### 4d. Rewrite `VectorSearch.vectorSearch()`
|
||||
|
||||
Replace the full-scan JS loop with `vector_top_k()`:
|
||||
|
||||
```typescript
|
||||
vectorSearch(queryEmbedding: Float32Array, options: VectorSearchOptions): VectorSearchResult[] {
|
||||
const { repositoryId, versionId, profileId = 'local-default', limit = 50 } = options;
|
||||
|
||||
const queryBytes = Buffer.from(queryEmbedding.buffer);
|
||||
const candidatePool = limit * 4; // over-fetch for post-filter
|
||||
|
||||
let sql = `
|
||||
SELECT se.snippet_id,
|
||||
vector_distance_cos(se.vec_embedding, vector_from_float32(?)) AS distance
|
||||
FROM vector_top_k('idx_snippet_embeddings_vec', vector_from_float32(?), ?) AS knn
|
||||
JOIN snippet_embeddings se ON se.rowid = knn.id
|
||||
JOIN snippets s ON s.id = se.snippet_id
|
||||
WHERE s.repository_id = ?
|
||||
AND se.profile_id = ?
|
||||
`;
|
||||
const params: unknown[] = [queryBytes, queryBytes, candidatePool, repositoryId, profileId];
|
||||
|
||||
if (versionId) {
|
||||
sql += ' AND s.version_id = ?';
|
||||
params.push(versionId);
|
||||
}
|
||||
|
||||
sql += ' ORDER BY distance ASC LIMIT ?';
|
||||
params.push(limit);
|
||||
|
||||
return this.db
|
||||
.prepare<unknown[], { snippet_id: string; distance: number }>(sql)
|
||||
.all(...params)
|
||||
.map((row) => ({ snippetId: row.snippet_id, score: 1 - row.distance }));
|
||||
}
|
||||
```
|
||||
|
||||
The `score` contract is preserved (1 = identical, 0 = orthogonal). The `cosineSimilarity` helper function is no longer called at runtime but can be kept for unit tests.
|
||||
|
||||
### Phase 5 — Per-Job Serialization Key Fix
|
||||
|
||||
**Files touched:** `src/lib/server/pipeline/worker-pool.ts`
|
||||
|
||||
The current serialization guard uses a bare `repositoryId`:
|
||||
|
||||
```typescript
|
||||
// current
|
||||
private runningRepoIds = new Set<string>();
|
||||
// blocks any job whose repositoryId is already in the set
|
||||
const jobIdx = this.jobQueue.findIndex((j) => !this.runningRepoIds.has(j.repositoryId));
|
||||
```
|
||||
|
||||
Different tags of the same repository write to completely disjoint rows (`version_id`-partitioned documents, snippets, and embeddings). The only genuine conflict is two jobs for the same `(repositoryId, versionId)` pair, which `JobQueue.enqueue()` already prevents via the `status IN ('queued', 'running')` deduplication check.
|
||||
|
||||
Change the guard to key on the compound pair:
|
||||
|
||||
```typescript
|
||||
// replace Set<string> with Set<string> keyed on compound pair
|
||||
private runningJobKeys = new Set<string>();
|
||||
|
||||
private jobKey(repositoryId: string, versionId?: string | null): string {
|
||||
return `${repositoryId}|${versionId ?? ''}`;
|
||||
}
|
||||
```
|
||||
|
||||
Update all four sites that read/write `runningRepoIds`:
|
||||
|
||||
| Location | Old | New |
|
||||
| ------------------------------------ | ----------------------------------------------------- | ---------------------------------------------------------------------------------------- |
|
||||
| `dispatch()` find | `!this.runningRepoIds.has(j.repositoryId)` | `!this.runningJobKeys.has(this.jobKey(j.repositoryId, j.versionId))` |
|
||||
| `dispatch()` add | `this.runningRepoIds.add(job.repositoryId)` | `this.runningJobKeys.add(this.jobKey(job.repositoryId, job.versionId))` |
|
||||
| `onWorkerMessage` done/failed delete | `this.runningRepoIds.delete(runningJob.repositoryId)` | `this.runningJobKeys.delete(this.jobKey(runningJob.repositoryId, runningJob.versionId))` |
|
||||
| `onWorkerExit` delete | same | same |
|
||||
|
||||
The `QueuedJob` and `RunningJob` interfaces already carry `versionId` — no type changes needed.
|
||||
|
||||
The only serialized case that remains is `versionId = null` (default-branch re-index) paired with itself, which maps to the stable key `"repositoryId|"` — correctly deduplicated.
|
||||
|
||||
---
|
||||
|
||||
### Phase 6 — Dedicated Write Worker (Single-Writer Pattern)
|
||||
|
||||
**Files touched:** `src/lib/server/pipeline/worker-types.ts`, `src/lib/server/pipeline/write-worker-entry.ts` (new), `src/lib/server/pipeline/worker-entry.ts`, `src/lib/server/pipeline/worker-pool.ts`
|
||||
|
||||
#### Motivation
|
||||
|
||||
With Phase 5 in place, N tags of the same library can index in parallel. Each parse worker currently opens its own DB connection and holds the write lock while storing parsed snippets. Under N concurrent writers, each worker spends the majority of its wall-clock time waiting in `busy_timeout` back-off. The fix is the single-writer pattern: one dedicated write worker owns the only writable DB connection; parse workers become stateless CPU workers that send write batches over `postMessage`.
|
||||
|
||||
```
|
||||
Parse Worker 1 ──┐ WriteRequest (docs[], snippets[]) ┌── WriteAck
|
||||
Parse Worker 2 ──┼─────────────────────────────────────► Write Worker (sole DB writer)
|
||||
Parse Worker N ──┘ └── single better-sqlite3 connection
|
||||
```
|
||||
|
||||
#### New message types (`worker-types.ts`)
|
||||
|
||||
```typescript
|
||||
export interface WriteRequest {
|
||||
type: 'write';
|
||||
jobId: string;
|
||||
documents: SerializedDocument[];
|
||||
snippets: SerializedSnippet[];
|
||||
}
|
||||
|
||||
export interface WriteAck {
|
||||
type: 'write_ack';
|
||||
jobId: string;
|
||||
documentCount: number;
|
||||
snippetCount: number;
|
||||
}
|
||||
|
||||
export interface WriteError {
|
||||
type: 'write_error';
|
||||
jobId: string;
|
||||
error: string;
|
||||
}
|
||||
|
||||
// SerializedDocument / SerializedSnippet mirror the DB column shapes
|
||||
// (plain objects, safe to transfer via structured clone)
|
||||
```
|
||||
|
||||
#### Write worker (`write-worker-entry.ts`)
|
||||
|
||||
The write worker:
|
||||
|
||||
- Opens its own `Database` connection (WAL mode, all pragmas from Phase 2)
|
||||
- Listens for `WriteRequest` messages
|
||||
- Wraps each batch in a single transaction
|
||||
- Posts `WriteAck` or `WriteError` back to the parent, which forwards the ack to the originating parse worker by `jobId`
|
||||
|
||||
```typescript
|
||||
import Database from '@libsql/better-sqlite3';
|
||||
import { workerData, parentPort } from 'node:worker_threads';
|
||||
import type { WriteRequest, WriteAck, WriteError } from './worker-types.js';
|
||||
|
||||
const db = new Database((workerData as WorkerInitData).dbPath);
|
||||
db.pragma('journal_mode = WAL');
|
||||
db.pragma('synchronous = NORMAL');
|
||||
db.pragma('cache_size = -65536');
|
||||
db.pragma('foreign_keys = ON');
|
||||
|
||||
const insertDoc = db.prepare(`INSERT OR REPLACE INTO documents (...) VALUES (...)`);
|
||||
const insertSnippet = db.prepare(`INSERT OR REPLACE INTO snippets (...) VALUES (...)`);
|
||||
|
||||
const writeBatch = db.transaction((req: WriteRequest) => {
|
||||
for (const doc of req.documents) insertDoc.run(doc);
|
||||
for (const snip of req.snippets) insertSnippet.run(snip);
|
||||
});
|
||||
|
||||
parentPort!.on('message', (req: WriteRequest) => {
|
||||
try {
|
||||
writeBatch(req);
|
||||
const ack: WriteAck = {
|
||||
type: 'write_ack',
|
||||
jobId: req.jobId,
|
||||
documentCount: req.documents.length,
|
||||
snippetCount: req.snippets.length
|
||||
};
|
||||
parentPort!.postMessage(ack);
|
||||
} catch (err) {
|
||||
const fail: WriteError = { type: 'write_error', jobId: req.jobId, error: String(err) };
|
||||
parentPort!.postMessage(fail);
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
#### Parse worker changes (`worker-entry.ts`)
|
||||
|
||||
Parse workers lose their DB connection. `IndexingPipeline` receives a `sendWrite` callback instead of a `db` instance. After parsing each file batch, the worker calls `sendWrite({ type: 'write', jobId, documents, snippets })` and awaits the `WriteAck` before continuing. This keeps back-pressure: a slow write worker naturally throttles the parse workers without additional semaphores.
|
||||
|
||||
#### WorkerPool changes
|
||||
|
||||
- Spawn one write worker at startup (always, regardless of embedding config)
|
||||
- Route incoming `write_ack` / `write_error` messages to the correct waiting parse worker via a `Map<jobId, resolve>` promise registry
|
||||
- The write worker is separate from the embed worker — embed writes (`snippet_embeddings`) can still go through the write worker by adding an `EmbedWriteRequest` message type, or remain in the embed worker since embedding runs after parsing completes (no lock contention with active parse jobs)
|
||||
|
||||
#### Conflict analysis with Phase 5
|
||||
|
||||
Phases 5 and 6 compose cleanly:
|
||||
|
||||
- Phase 5 allows multiple `(repo, versionId)` jobs to run concurrently
|
||||
- Phase 6 ensures all those concurrent jobs share a single write path — contention is eliminated by design
|
||||
- The write worker is stateless with respect to job identity; it just executes batches in arrival order within a FIFO message queue (Node.js `postMessage` is ordered)
|
||||
- The embed worker remains a separate process (it runs after parse completes, so it never overlaps with active parse writes for the same job)
|
||||
|
||||
---
|
||||
|
||||
### Phase 7 — Admin UI Overhaul
|
||||
|
||||
**Files touched:**
|
||||
|
||||
- `src/routes/admin/jobs/+page.svelte` — rebuilt
|
||||
- `src/routes/api/v1/workers/+server.ts` — new endpoint
|
||||
- `src/lib/components/admin/JobStatusBadge.svelte` — extend with spinner variant
|
||||
- `src/lib/components/admin/JobSkeleton.svelte` — new
|
||||
- `src/lib/components/admin/WorkerStatusPanel.svelte` — new
|
||||
- `src/lib/components/admin/Toast.svelte` — new
|
||||
- `src/lib/components/IndexingProgress.svelte` — switch to SSE
|
||||
|
||||
#### 7a. New API endpoint: `GET /api/v1/workers`
|
||||
|
||||
The `WorkerPool` singleton tracks running jobs in `runningJobs: Map<Worker, RunningJob>` and idle workers in `idleWorkers: Worker[]`. Expose this state as a lightweight REST snapshot:
|
||||
|
||||
```typescript
|
||||
// GET /api/v1/workers
|
||||
// Response shape:
|
||||
interface WorkersResponse {
|
||||
concurrency: number; // configured max workers
|
||||
active: number; // workers with a running job
|
||||
idle: number; // workers waiting for work
|
||||
workers: WorkerStatus[]; // one entry per spawned parse worker
|
||||
}
|
||||
|
||||
interface WorkerStatus {
|
||||
index: number; // worker slot (0-based)
|
||||
state: 'idle' | 'running'; // current state
|
||||
jobId: string | null; // null when idle
|
||||
repositoryId: string | null;
|
||||
versionId: string | null;
|
||||
}
|
||||
```
|
||||
|
||||
The route handler calls `getPool().getStatus()` — add a `getStatus(): WorkersResponse` method to `WorkerPool` that reads `runningJobs` and `idleWorkers` without any DB call. This is read-only and runs on the main thread.
|
||||
|
||||
The SSE stream at `/api/v1/jobs/stream` should emit a new `worker-status` event type whenever a worker transitions idle ↔ running (on `dispatch()` and job completion). This allows the worker panel to update in real-time without polling the REST endpoint.
|
||||
|
||||
#### 7b. `GET /api/v1/jobs` — add `repositoryId` free-text and multi-status filter
|
||||
|
||||
The existing endpoint already accepts `repositoryId` (exact match) and `status` (single value). Extend:
|
||||
|
||||
- `repositoryId` to also support prefix match (e.g. `?repositoryId=/facebook` returns all `/facebook/*` repos)
|
||||
- `status` to accept comma-separated values: `?status=queued,running`
|
||||
- `page` and `pageSize` query params (default pageSize=50, max 200) in addition to `limit` for backwards compat
|
||||
|
||||
Return `{ jobs, total, page, pageSize }` with `total` always reflecting the unfiltered-by-page count.
|
||||
|
||||
#### 7c. New component: `JobSkeleton.svelte`
|
||||
|
||||
A set of skeleton rows matching the job table structure. Shown during the initial fetch before any data arrives. Uses Tailwind `animate-pulse`:
|
||||
|
||||
```svelte
|
||||
<!-- renders N skeleton rows -->
|
||||
<script lang="ts">
|
||||
let { rows = 5 }: { rows?: number } = $props();
|
||||
</script>
|
||||
|
||||
{#each Array(rows) as _, i (i)}
|
||||
<tr>
|
||||
<td class="px-6 py-4">
|
||||
<div class="h-4 w-48 animate-pulse rounded bg-gray-200"></div>
|
||||
<div class="mt-1 h-3 w-24 animate-pulse rounded bg-gray-100"></div>
|
||||
</td>
|
||||
<td class="px-6 py-4">
|
||||
<div class="h-5 w-16 animate-pulse rounded-full bg-gray-200"></div>
|
||||
</td>
|
||||
<td class="px-6 py-4">
|
||||
<div class="h-4 w-20 animate-pulse rounded bg-gray-200"></div>
|
||||
</td>
|
||||
<td class="px-6 py-4">
|
||||
<div class="h-2 w-32 animate-pulse rounded-full bg-gray-200"></div>
|
||||
</td>
|
||||
<td class="px-6 py-4">
|
||||
<div class="h-4 w-28 animate-pulse rounded bg-gray-200"></div>
|
||||
</td>
|
||||
<td class="px-6 py-4 text-right">
|
||||
<div class="ml-auto h-7 w-20 animate-pulse rounded bg-gray-200"></div>
|
||||
</td>
|
||||
</tr>
|
||||
{/each}
|
||||
```
|
||||
|
||||
#### 7d. New component: `Toast.svelte`
|
||||
|
||||
Replaces all `alert()` / `console.log()` calls in the jobs page. Renders a fixed-position stack in the bottom-right corner. Each toast auto-dismisses after 4 seconds and can be manually closed:
|
||||
|
||||
```svelte
|
||||
<!-- Usage: bind a toasts array and call push({ message, type }) -->
|
||||
<script lang="ts">
|
||||
export interface ToastItem {
|
||||
id: string;
|
||||
message: string;
|
||||
type: 'success' | 'error' | 'info';
|
||||
}
|
||||
|
||||
let { toasts = $bindable([]) }: { toasts: ToastItem[] } = $props();
|
||||
|
||||
function dismiss(id: string) {
|
||||
toasts = toasts.filter((t) => t.id !== id);
|
||||
}
|
||||
</script>
|
||||
|
||||
<div class="fixed right-4 bottom-4 z-50 flex flex-col gap-2">
|
||||
{#each toasts as toast (toast.id)}
|
||||
<!-- color by type, close button, auto-dismiss via onmount timer -->
|
||||
{/each}
|
||||
</div>
|
||||
```
|
||||
|
||||
The jobs page replaces `showToast()` with pushing onto the bound `toasts` array. The `confirm()` for cancel is replaced with an inline confirmation state per job (`pendingCancelId`) that shows "Confirm cancel?" / "Yes" / "No" buttons inside the row.
|
||||
|
||||
#### 7e. New component: `WorkerStatusPanel.svelte`
|
||||
|
||||
A compact panel displayed above the job table showing the worker pool health. Subscribes to the `worker-status` SSE events and falls back to polling `GET /api/v1/workers` every 5 s on SSE error:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Workers [2 / 4 active] ████░░░░ 50% │
|
||||
│ Worker 0 ● running /facebook/react / v18.3.0 │
|
||||
│ Worker 1 ● running /facebook/react / v17.0.2 │
|
||||
│ Worker 2 ○ idle │
|
||||
│ Worker 3 ○ idle │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
Each worker row shows: slot index, status dot (animated green pulse for running), repository ID, version tag, and a link to the job row in the table below.
|
||||
|
||||
#### 7f. Filter bar on the jobs page
|
||||
|
||||
Add a filter strip between the page header and the table:
|
||||
|
||||
```
|
||||
[ Repository: _______________ ] [ Status: ▾ all ] [ 🔍 Apply ] [ ↺ Reset ]
|
||||
```
|
||||
|
||||
- **Repository field**: free-text input, matches `repositoryId` prefix (e.g. `/facebook` shows all `/facebook/*`)
|
||||
- **Status dropdown**: multi-select checkboxes for `queued`, `running`, `paused`, `cancelled`, `done`, `failed`; default = all
|
||||
- Filters are applied client-side against the loaded `jobs` array for instant feedback, and also re-fetched from the API on Apply to get the correct total count
|
||||
- Filter state is mirrored to URL search params (`?repo=...&status=...`) so the view is bookmarkable and survives refresh
|
||||
|
||||
#### 7g. Per-job action spinner and disabled state
|
||||
|
||||
Replace the single `actionInProgress: string | null` with a `Map<string, 'pausing' | 'resuming' | 'cancelling'>`:
|
||||
|
||||
```typescript
|
||||
let actionInProgress = $state(new Map<string, 'pausing' | 'resuming' | 'cancelling'>());
|
||||
```
|
||||
|
||||
Each action button shows an inline spinner (small `animate-spin` circle) and is disabled only for that row. Other rows remain fully interactive during the action. On completion the entry is deleted from the map.
|
||||
|
||||
#### 7h. `IndexingProgress.svelte` — switch from polling to SSE
|
||||
|
||||
The component currently uses `setInterval + fetch` at 2 s. Replace with the per-job SSE stream already available at `/api/v1/jobs/{id}/stream`:
|
||||
|
||||
```typescript
|
||||
// replace the $effect body
|
||||
$effect(() => {
|
||||
job = null;
|
||||
const es = new EventSource(`/api/v1/jobs/${jobId}/stream`);
|
||||
|
||||
es.addEventListener('job-progress', (event) => {
|
||||
const data = JSON.parse(event.data);
|
||||
job = { ...job, ...data };
|
||||
});
|
||||
|
||||
es.addEventListener('job-done', () => {
|
||||
void fetch(`/api/v1/jobs/${jobId}`)
|
||||
.then((r) => r.json())
|
||||
.then((d) => {
|
||||
job = d.job;
|
||||
oncomplete?.();
|
||||
});
|
||||
es.close();
|
||||
});
|
||||
|
||||
es.addEventListener('job-failed', (event) => {
|
||||
const data = JSON.parse(event.data);
|
||||
job = { ...job, status: 'failed', error: data.error };
|
||||
oncomplete?.();
|
||||
es.close();
|
||||
});
|
||||
|
||||
es.onerror = () => {
|
||||
// on SSE failure fall back to a single fetch to get current state
|
||||
es.close();
|
||||
void fetch(`/api/v1/jobs/${jobId}`)
|
||||
.then((r) => r.json())
|
||||
.then((d) => {
|
||||
job = d.job;
|
||||
});
|
||||
};
|
||||
|
||||
return () => es.close();
|
||||
});
|
||||
```
|
||||
|
||||
This reduces network traffic from 1 request/2 s to zero requests during active indexing — updates arrive as server-push events.
|
||||
|
||||
#### 7i. Pagination on the jobs page
|
||||
|
||||
Replace the hard-coded `?limit=50` fetch with paginated requests:
|
||||
|
||||
```typescript
|
||||
let currentPage = $state(1);
|
||||
const PAGE_SIZE = 50;
|
||||
|
||||
async function fetchJobs() {
|
||||
const params = new URLSearchParams({
|
||||
page: String(currentPage),
|
||||
pageSize: String(PAGE_SIZE),
|
||||
...(filterRepo ? { repositoryId: filterRepo } : {}),
|
||||
...(filterStatuses.length ? { status: filterStatuses.join(',') } : {})
|
||||
});
|
||||
const data = await fetch(`/api/v1/jobs?${params}`).then((r) => r.json());
|
||||
jobs = data.jobs;
|
||||
total = data.total;
|
||||
}
|
||||
```
|
||||
|
||||
Render a simple `« Prev Page N of M Next »` control below the table, hidden when `total <= PAGE_SIZE`.
|
||||
|
||||
---
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [ ] `npm install` with `@libsql/better-sqlite3` succeeds; `better-sqlite3` is absent from `node_modules`
|
||||
- [ ] All existing unit and integration tests pass after Phase 1 import swap
|
||||
- [ ] `npm run db:migrate` applies the composite index migration cleanly against an existing database
|
||||
- [ ] `npm run db:migrate` applies the vector column migration cleanly; `sql> SELECT vec_embedding FROM snippet_embeddings LIMIT 1` returns a non-NULL value for any previously-embedded snippet
|
||||
- [ ] `GET /api/v1/context?libraryId=...&query=...` with a semantic-mode or hybrid-mode request returns results in ≤ 200 ms on a repository with 50k+ snippets (vs previous multi-second response)
|
||||
- [ ] Memory profiled during a /context request shows no allocation spike proportional to repository size
|
||||
- [ ] `EXPLAIN QUERY PLAN` on the `snippets` search query shows `SCAN snippets USING INDEX idx_snippets_repo_version` instead of `SCAN snippets`
|
||||
- [ ] Worker threads (`worker-entry.ts`, `embed-worker-entry.ts`) start and complete an indexing job successfully after the package swap
|
||||
- [ ] `drizzle-kit studio` connects and browses the migrated database
|
||||
- [ ] Re-indexing a repository after the migration correctly populates `vec_embedding` on all new snippets
|
||||
- [ ] `cosineSimilarity` unit tests still pass (function is kept)
|
||||
- [ ] Starting two indexing jobs for different tags of the same repository simultaneously results in both jobs reaching `running` state concurrently (not one waiting for the other)
|
||||
- [ ] Starting two indexing jobs for the **same** `(repositoryId, versionId)` pair returns the existing job (deduplication unchanged)
|
||||
- [ ] With 4 parse workers and 4 concurrent tag jobs, zero `SQLITE_BUSY` errors appear in logs
|
||||
- [ ] Write worker is present in the process list during active indexing (`worker_threads` inspector shows `write-worker-entry`)
|
||||
- [ ] A `WriteError` from the write worker marks the originating job as `failed` with the error message propagated to the SSE stream
|
||||
- [ ] `GET /api/v1/workers` returns a `WorkersResponse` JSON object with correct `active`, `idle`, and `workers[]` fields while jobs are in-flight
|
||||
- [ ] The `worker-status` SSE event is emitted by `/api/v1/jobs/stream` whenever a worker transitions state
|
||||
- [ ] The admin jobs page shows skeleton rows (not a blank screen) during the initial `fetchJobs()` call
|
||||
- [ ] No `alert()` or `confirm()` calls exist in `admin/jobs/+page.svelte` after this change; all notifications go through `Toast.svelte`
|
||||
- [ ] Pausing job A while job B is also in progress does not disable job B's action buttons
|
||||
- [ ] The status filter multi-select correctly restricts the visible job list; the URL updates to reflect the filter state
|
||||
- [ ] The repository prefix filter `?repositoryId=/facebook` returns all jobs whose `repositoryId` starts with `/facebook`
|
||||
- [ ] Paginating past page 1 fetches the next batch from the API, not from the client-side array
|
||||
- [ ] `IndexingProgress.svelte` has no `setInterval` call; it uses `EventSource` for progress updates
|
||||
- [ ] The `WorkerStatusPanel` shows the correct number of running workers live during a multi-tag indexing run
|
||||
- [ ] Refreshing the jobs page with `?repo=/facebook/react&status=running` pre-populates the filters and fetches with those params
|
||||
|
||||
---
|
||||
|
||||
## Migration Safety
|
||||
|
||||
### Backward Compatibility
|
||||
|
||||
The `embedding` blob column is kept. The `vec_embedding` column is nullable during the backfill window and becomes populated as:
|
||||
|
||||
1. The `UPDATE` in `vectors.sql` fills all existing rows on startup
|
||||
2. New embeddings populate it at insert time
|
||||
|
||||
If `vec_embedding IS NULL` for a row (e.g., a row inserted before the migration runs), the vector index silently omits that row from results. The fallback in `HybridSearchService` to FTS-only mode still applies when no embeddings exist, so degraded-but-correct behavior is preserved.
|
||||
|
||||
### Rollback
|
||||
|
||||
Rollback before Phase 4 (vector column): remove `@libsql/better-sqlite3`, restore `better-sqlite3`, restore imports. No schema changes have been made.
|
||||
|
||||
Rollback after Phase 4: schema now has `vec_embedding` column. Drop the column with a migration reversal and restore imports. The `embedding` blob is intact throughout — no data loss.
|
||||
|
||||
### SQLite File Compatibility
|
||||
|
||||
libSQL embedded mode reads and writes standard SQLite 3 files. The WAL file, page size, and encoding are unchanged. An existing production database opened with `@libsql/better-sqlite3` is fully readable and writable. The vector index is stored in a shadow table `idx_snippet_embeddings_vec_shadow` which better-sqlite3 would ignore if rolled back (it is a regular table with a special name).
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
| Package | Action | Reason |
|
||||
| ------------------------ | ----------------------------- | ----------------------------------------------- |
|
||||
| `better-sqlite3` | Remove from `dependencies` | Replaced |
|
||||
| `@types/better-sqlite3` | Remove from `devDependencies` | `@libsql/better-sqlite3` ships own types |
|
||||
| `@libsql/better-sqlite3` | Add to `dependencies` | Drop-in libSQL node addon |
|
||||
| `drizzle-orm` | No change | `better-sqlite3` adapter works unchanged |
|
||||
| `drizzle-kit` | No change | `dialect: 'sqlite'` correct for embedded libSQL |
|
||||
|
||||
No new runtime dependencies beyond the package replacement.
|
||||
|
||||
---
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
|
||||
- `src/lib/server/search/vector.search.ts`: add test asserting KNN results are correct for a seeded 3-vector table; verify memory is not proportional to table size (mock `db.prepare` to assert no unbounded `.all()` is called)
|
||||
- `src/lib/server/embeddings/embedding.service.ts`: existing tests cover insert round-trips; verify `vec_embedding` column is non-NULL after `embedSnippets()`
|
||||
|
||||
### Integration Tests
|
||||
|
||||
- `api-contract.integration.test.ts`: existing tests already use `new Database(':memory:')` — these continue to work with `@libsql/better-sqlite3` because the in-memory path is identical
|
||||
- Add one test to `api-contract.integration.test.ts`: seed a repository + multiple embeddings, call `/api/v1/context` in semantic mode, assert non-empty results and response time < 500ms on in-memory DB
|
||||
|
||||
### UI Tests
|
||||
|
||||
- `src/routes/admin/jobs/+page.svelte`: add Vitest browser tests (Playwright) verifying:
|
||||
- Skeleton rows appear before the first fetch resolves (mock `fetch` to delay 200 ms)
|
||||
- Status filter restricts displayed rows; URL param updates
|
||||
- Pausing job A leaves job B's buttons enabled
|
||||
- Toast appears and auto-dismisses on successful pause
|
||||
- Cancel confirm flow shows inline confirmation, not `window.confirm`
|
||||
- `src/lib/components/IndexingProgress.svelte`: unit test that no `setInterval` is created; verify `EventSource` is opened with the correct URL
|
||||
|
||||
### Performance Regression Gate
|
||||
|
||||
Add a benchmark script `scripts/bench-vector-search.mjs` that:
|
||||
|
||||
1. Creates an in-memory libSQL database
|
||||
2. Seeds 10000 snippet embeddings (random Float32Array, 1536 dims)
|
||||
3. Runs 100 `vectorSearch()` calls
|
||||
4. Asserts p99 < 50 ms
|
||||
|
||||
This gates the CI check on Phase 4 correctness and speed.
|
||||
Reference in New Issue
Block a user