feat(TRUEREF-0023): add sqlite-vec search pipeline

This commit is contained in:
Giancarmine Salucci
2026-04-01 14:09:19 +02:00
parent 0752636847
commit 9525c58e9a
45 changed files with 4009 additions and 614 deletions

View File

@@ -1,29 +1,28 @@
# Findings
Last Updated: 2026-03-30T00:00:00.000Z
Last Updated: 2026-04-01T12:05:23.000Z
## Initializer Summary
- JIRA: TRUEREF-0022
- JIRA: TRUEREF-0023
- Refresh mode: REFRESH_IF_REQUIRED
- Result: Refreshed ARCHITECTURE.md and FINDINGS.md. CODE_STYLE.md remained trusted — new worker thread code follows established conventions.
- Result: Refreshed ARCHITECTURE.md and FINDINGS.md. CODE_STYLE.md remained trusted — sqlite-vec, worker-status, and write-worker additions follow the established conventions already documented.
## Research Performed
- Discovered 141 TypeScript/JavaScript source files (up from 110), with new pipeline worker, broadcaster, and SSE endpoint files.
- Read worker-pool.ts, worker-entry.ts, embed-worker-entry.ts, worker-types.ts, progress-broadcaster.ts, startup.ts, job-queue.ts to understand the new worker thread architecture.
- Read SSE endpoints (jobs/stream, jobs/[id]/stream) and job control endpoints (pause, resume, cancel).
- Read indexing settings endpoint and hooks.server.ts to verify startup wiring changes.
- Read build-workers.mjs and package.json to verify build system and dependency changes.
- Compared trusted cache state with current codebase to identify ARCHITECTURE.md as stale.
- Confirmed CODE_STYLE.md conventions still match the codebase — new code uses PascalCase classes, camelCase functions, tab indentation, ESM imports, and TypeScript discriminated unions consistent with existing style.
- Counted 149 TypeScript/JavaScript source files in the repository-wide scan and verified the live, non-generated source mix as 147 `.ts` files and 2 `.js` files.
- Read `package.json`, `.prettierrc`, and `eslint.config.js` to verify dependencies, formatting rules, and linting conventions.
- Read `sqlite-vec.ts`, `sqlite-vec.store.ts`, `vector.search.ts`, `hybrid.search.service.ts`, `schema.ts`, `client.ts`, and startup wiring to verify the accepted sqlite-vec implementation and current retrieval architecture.
- Read `worker-pool.ts`, `worker-types.ts`, `write-worker-entry.ts`, and `/api/v1/workers/+server.ts` to verify the current worker topology and status surface.
- Compared `docs/docs_cache_state.yaml` against the live docs and codebase to identify stale cache evidence and architecture drift.
- Confirmed `CODE_STYLE.md` still matches the codebase: tabs, single quotes, `trailingComma: none`, ESM imports with `node:` built-ins, flat ESLint config, and descriptive PascalCase/camelCase naming remain consistent.
## Open Questions For Planner
- Verify whether the retrieval response contract should document the new repository and version metadata fields formally in a public API reference beyond the architecture summary.
- Verify whether parser chunking should evolve further from file-level and declaration-level boundaries to member-level semantic chunks for class-heavy codebases.
- Verify whether the SSE streaming contract (event names, data shapes) should be documented in a dedicated API reference for external consumers.
- Assess whether the WorkerPool fallback mode (main-thread execution when worker scripts are missing) needs explicit test coverage or should be removed in favour of a hard build requirement.
- Verify whether the write-worker protocol should become part of the active indexing flow or remain documented as optional infrastructure only.
- Verify whether worker-status and SSE event payloads should be documented in a dedicated API reference for external consumers.
- Verify whether sqlite-vec operational details such as per-profile vec-table lifecycle and backfill behavior should move into a separate persistence document if the subsystem grows further.
- Assess whether the WorkerPool fallback mode (main-thread execution when worker scripts are missing) still belongs in the runtime contract or should be removed in favour of a hard build requirement.
## Planner Notes Template
@@ -37,6 +36,41 @@ Add subsequent research below this section.
- Findings:
- Risks / follow-ups:
### 2026-04-01 — TRUEREF-0023 initializer refresh audit
- Task: Refresh only stale or invalid documentation after the accepted sqlite-vec implementation.
- Files inspected:
- `docs/docs_cache_state.yaml`
- `docs/ARCHITECTURE.md`
- `docs/CODE_STYLE.md`
- `docs/FINDINGS.md`
- `package.json`
- `.prettierrc`
- `eslint.config.js`
- `src/hooks.server.ts`
- `src/lib/server/db/client.ts`
- `src/lib/server/db/schema.ts`
- `src/lib/server/db/sqlite-vec.ts`
- `src/lib/server/search/sqlite-vec.store.ts`
- `src/lib/server/search/vector.search.ts`
- `src/lib/server/search/hybrid.search.service.ts`
- `src/lib/server/pipeline/startup.ts`
- `src/lib/server/pipeline/worker-pool.ts`
- `src/lib/server/pipeline/worker-types.ts`
- `src/lib/server/pipeline/write-worker-entry.ts`
- `src/routes/api/v1/workers/+server.ts`
- `scripts/build-workers.mjs`
- Findings:
- The trusted cache metadata was no longer reliable as evidence for planning: `docs/docs_cache_state.yaml` still referenced 2026-03-27 hashes while `ARCHITECTURE.md` and `FINDINGS.md` had been edited later.
- `ARCHITECTURE.md` was stale. It still described only parse and embed worker concurrency, omitted the `sqlite-vec` production dependency, and did not document the current per-profile vec-table storage layer, worker-status endpoint, or write-worker infrastructure.
- The current retrieval stack uses sqlite-vec concretely: `loadSqliteVec()` bootstraps connections, `SqliteVecStore` manages vec0 tables plus rowid mapping tables, and `VectorSearch` delegates nearest-neighbor lookup to that store instead of brute-force scoring.
- The worker architecture now includes parse, embed, and write worker protocols in `worker-types.ts`, build-time bundling for all three entries, and a `/api/v1/workers` route that returns `WorkerPool` status snapshots.
- `CODE_STYLE.md` remained valid and did not require refresh. The observed source and config files still use tabs, single quotes, `trailingComma: none`, flat ESLint config, ESM imports, PascalCase class names, and camelCase helpers exactly as already documented.
- `FINDINGS.md` itself was stale because the initializer summary still referred to `TRUEREF-0022` instead of the requested `TRUEREF-0023` refresh.
- Risks / follow-ups:
- The write-worker protocol exists and is bundled, but the active indexing path is still centered on parse plus optional embed flow. Future documentation should keep distinguishing implemented infrastructure from the currently exercised path.
- Cache validity should continue to be driven by deterministic hash evidence rather than document timestamps or trust text alone.
### 2026-03-27 — FEEDBACK-0001 initializer refresh audit
- Task: Refresh only stale documentation after changes to retrieval, formatters, token budgeting, and parser behavior.
@@ -192,3 +226,112 @@ Add subsequent research below this section.
- Risks / follow-ups:
- The fix should preserve the existing `/repos/[id]` route shape instead of redesigning it to a rest route unless a broader navigation contract change is explicitly requested.
- Any normalization helper introduced for the repo detail page should be reused consistently across server load and client event handlers to avoid mixed encoded and decoded repository IDs during navigation and fetches.
### 2026-04-01 — TRUEREF-0023 sqlite-vec replanning research
- Task: Replan the rejected libSQL-native vector iteration around sqlite-vec using the current worktree and verified runtime constraints.
- Files inspected:
- `package.json`
- `docs/docs_cache_state.yaml`
- `prompts/TRUEREF-0023/prompt.yaml`
- `prompts/TRUEREF-0023/progress.yaml`
- `prompts/TRUEREF-0023/iteration_0/review_report.yaml`
- `prompts/TRUEREF-0023/iteration_0/plan.md`
- `prompts/TRUEREF-0023/iteration_0/tasks.yaml`
- `src/lib/server/db/client.ts`
- `src/lib/server/db/index.ts`
- `src/lib/server/db/schema.ts`
- `src/lib/server/db/fts.sql`
- `src/lib/server/db/vectors.sql`
- `src/lib/server/db/schema.test.ts`
- `src/lib/server/search/vector.search.ts`
- `src/lib/server/search/hybrid.search.service.test.ts`
- `src/lib/server/embeddings/embedding.service.ts`
- `src/lib/server/embeddings/embedding.service.test.ts`
- `src/lib/server/pipeline/job-queue.ts`
- `src/lib/server/pipeline/progress-broadcaster.ts`
- `src/lib/server/pipeline/progress-broadcaster.test.ts`
- `src/lib/server/pipeline/worker-pool.ts`
- `src/lib/server/pipeline/worker-entry.ts`
- `src/lib/server/pipeline/embed-worker-entry.ts`
- `src/lib/server/pipeline/worker-types.ts`
- `src/lib/server/pipeline/startup.ts`
- `src/lib/server/pipeline/indexing.pipeline.ts`
- `src/lib/server/pipeline/indexing.pipeline.test.ts`
- `src/routes/api/v1/jobs/+server.ts`
- `src/routes/api/v1/jobs/stream/+server.ts`
- `src/routes/api/v1/jobs/[id]/stream/+server.ts`
- `src/routes/api/v1/sse-and-settings.integration.test.ts`
- `src/routes/admin/jobs/+page.svelte`
- `src/lib/components/IndexingProgress.svelte`
- `src/lib/components/admin/JobStatusBadge.svelte`
- `src/lib/components/admin/JobSkeleton.svelte`
- `src/lib/components/admin/Toast.svelte`
- `src/lib/components/admin/WorkerStatusPanel.svelte`
- `scripts/build-workers.mjs`
- `node_modules/libsql/types/index.d.ts`
- `node_modules/libsql/index.js`
- Findings:
- Iteration 0 already changed the workspace materially: direct DB imports were switched from `better-sqlite3` to `libsql`, the extra WAL-related pragmas were added in the main DB clients and embed worker, composite indexes plus `vec_embedding` were added to the Drizzle schema and migration metadata, `IndexingProgress.svelte` now uses SSE, the admin jobs page was overhauled, and `WorkerPool` now serializes on `(repositoryId, versionId)` instead of repository only.
- The rejected vector implementation is still invalid in the current tree. `src/lib/server/db/vectors.sql` contains the rejected libSQL-native assumptions, including a dangling `USING libsql_vector_idx(...)` clause with no valid `CREATE INDEX` statement, and `src/lib/server/search/vector.search.ts` still performs full-table JS cosine scoring over `snippet_embeddings` instead of true in-database KNN.
- `sqlite-vec` is not currently present in `package.json` or the lockfile, and there is no existing `sqliteVec.load(...)`, `db.loadExtension(...)`, `vec0`, or extension bootstrap code anywhere under `src/`.
- Context7 sqlite-vec docs confirm the supported Node integration path is `import * as sqliteVec from 'sqlite-vec'; sqliteVec.load(db);`, storing vectors in a `vec0` virtual table and querying with `WHERE embedding MATCH ? ORDER BY distance LIMIT ?`. The docs also show vec0 metadata columns can be filtered directly, which fits the repositoryId, versionId, and profileId requirements.
- Context7 `better-sqlite3` v12.6.2 docs confirm `db.loadExtension(path)` exists. The installed `libsql` package in this workspace also exposes `loadExtension(path): this` in `node_modules/libsql/types/index.d.ts` and `loadExtension(...args)` in `node_modules/libsql/index.js`, so extension loading is not obviously blocked by the driver API surface alone.
- The review report remains the only verified runtime evidence for the current libsql path: `vector_from_float32(...)` is unavailable and `libsql_vector_idx` DDL is rejected in this environment. That invalidates the original native-vector approach but does not by itself prove sqlite-vec extension loading succeeds through the current `libsql` package alias, so the replan must include explicit connection-bootstrap and test coverage for real extension loading on the main DB client and worker-owned connections.
- Two iteration-0 deliverables referenced in the rejected plan do not exist in the current worktree: `src/lib/server/pipeline/write-worker-entry.ts` and `src/routes/api/v1/workers/+server.ts`. `scripts/build-workers.mjs` and the admin `WorkerStatusPanel.svelte` already reference those missing paths, so iteration 1 must either create them or revert those dangling references as part of a consistent plan.
- The existing admin/SSE work is largely salvageable. `src/routes/api/v1/jobs/stream/+server.ts`, `src/routes/api/v1/jobs/[id]/stream/+server.ts`, `src/lib/server/pipeline/progress-broadcaster.ts`, `src/lib/components/IndexingProgress.svelte`, `src/lib/components/admin/JobSkeleton.svelte`, `src/lib/components/admin/Toast.svelte`, and `src/lib/components/admin/WorkerStatusPanel.svelte` provide a usable foundation, but `src/routes/admin/jobs/+page.svelte` still contains `confirm(...)` and the queue API still only supports exact `repository_id = ?` and single-status filtering.
- The existing tests still encode the rejected pre-sqlite-vec model: `embedding.service.test.ts`, `schema.test.ts`, `hybrid.search.service.test.ts`, and `indexing.pipeline.test.ts` seed and assert against `snippet_embeddings.embedding` blobs only. The sqlite-vec replan therefore needs new DB bootstrap helpers, vec-table lifecycle assertions, and vector-search tests that validate actual vec0 writes and filtered KNN queries.
- Risks / follow-ups:
- The current worktree is dirty with iteration-0 partial changes and generated migration metadata, so iteration-1 tasks must explicitly distinguish keep/revise/revert work to avoid sibling tasks fighting over the same files.
- Because the current `libsql` package appears to expose `loadExtension`, the replan should avoid assuming an immediate full revert to upstream `better-sqlite3`; instead it should sequence a driver/bootstrap compatibility decision around actual sqlite-vec extension loading behavior with testable acceptance criteria.
### 2026-04-01 — TRUEREF-0023 iteration-2 current-worktree verification
- Task: Replan iteration 2 against the post-iteration-1 workspace state so the first validation unit no longer leaves a known vec_embedding mismatch behind.
- Files inspected:
- `package.json`
- `package-lock.json`
- `scripts/build-workers.mjs`
- `src/lib/server/db/client.ts`
- `src/lib/server/db/index.ts`
- `src/lib/server/db/schema.ts`
- `src/lib/server/db/vectors.sql`
- `src/lib/server/db/migrations/0006_yielding_centennial.sql`
- `src/lib/server/db/schema.test.ts`
- `src/lib/server/embeddings/embedding.service.ts`
- `src/lib/server/embeddings/embedding.service.test.ts`
- `src/lib/server/search/vector.search.ts`
- `src/lib/server/search/hybrid.search.service.ts`
- `src/lib/server/search/hybrid.search.service.test.ts`
- `src/lib/server/pipeline/job-queue.ts`
- `src/lib/server/pipeline/worker-pool.ts`
- `src/lib/server/pipeline/worker-entry.ts`
- `src/lib/server/pipeline/embed-worker-entry.ts`
- `src/lib/server/pipeline/worker-types.ts`
- `src/lib/server/pipeline/indexing.pipeline.ts`
- `src/lib/server/pipeline/indexing.pipeline.test.ts`
- `src/lib/server/pipeline/startup.ts`
- `src/lib/server/pipeline/progress-broadcaster.ts`
- `src/routes/api/v1/jobs/+server.ts`
- `src/routes/api/v1/jobs/stream/+server.ts`
- `src/routes/api/v1/sse-and-settings.integration.test.ts`
- `src/routes/admin/jobs/+page.svelte`
- `src/lib/components/IndexingProgress.svelte`
- `src/lib/components/admin/JobStatusBadge.svelte`
- `src/lib/components/admin/JobSkeleton.svelte`
- `src/lib/components/admin/Toast.svelte`
- `src/lib/components/admin/WorkerStatusPanel.svelte`
- `prompts/TRUEREF-0023/iteration_1/plan.md`
- `prompts/TRUEREF-0023/iteration_1/tasks.yaml`
- Findings:
- Iteration 1 already completed the direct-driver reset in the working tree: `package.json` and `package-lock.json` now contain real `better-sqlite3` plus `sqlite-vec`, and the current production/test files read in this pass import `better-sqlite3`, not `libsql`.
- The remaining failing intermediate state is exactly the schema/write mismatch called out in the review report: `src/lib/server/db/schema.ts` and `src/lib/server/db/migrations/0006_yielding_centennial.sql` still declare `vec_embedding`, `src/lib/server/db/index.ts` still executes `vectors.sql`, and `src/lib/server/embeddings/embedding.service.ts` still inserts `(embedding, vec_embedding)` into `snippet_embeddings`.
- `src/lib/server/db/vectors.sql` is still invalid startup SQL. It contains a dangling `USING libsql_vector_idx(...)` clause with no enclosing `CREATE INDEX`, so leaving it in the initialization path keeps the rejected libSQL-native design alive.
- The first iteration-1 task boundary was therefore wrong for the current baseline: the package/import reset is already present, but it only becomes a valid foundation once the relational `vec_embedding` artifacts and `EmbeddingService` insert path are cleaned up in the same validation unit.
- The current search path is still the pre-sqlite-vec implementation. `src/lib/server/search/vector.search.ts` reads every candidate embedding blob and scores in JavaScript; no `vec0`, `sqliteVec.load(db)`, or sqlite-vec KNN query exists anywhere under `src/` yet.
- The write worker and worker-status backend are still missing in the live tree even though they are already referenced elsewhere: `scripts/build-workers.mjs` includes `src/lib/server/pipeline/write-worker-entry.ts`, `src/lib/components/admin/WorkerStatusPanel.svelte` fetches `/api/v1/workers`, and `src/routes/api/v1/jobs/stream/+server.ts` currently has no worker-status event source.
- The admin jobs page remains incomplete but salvageable: `src/routes/admin/jobs/+page.svelte` still uses `confirm(...)` and `alert(...)`, while `JobSkeleton.svelte`, `Toast.svelte`, `WorkerStatusPanel.svelte`, `JobStatusBadge.svelte`, and `IndexingProgress.svelte` already provide the intended UI foundation.
- `src/lib/server/pipeline/job-queue.ts` still only supports exact `repository_id = ?` and single `status = ?` filtering, so API-side filter work remains a separate backend task and does not need to block the vector-storage implementation.
- Risks / follow-ups:
- Iteration 2 task decomposition must treat the current dirty code files from iterations 0 and 1 as the validation baseline, otherwise the executor will keep rediscovering pre-existing worktree drift instead of new task deltas.
- The sqlite-vec bootstrap helper and the relational cleanup should be planned as one acceptance unit before any downstream vec0, worker-status, or admin-page tasks, because that is the smallest unit that removes the known broken intermediate state.