feat(TRUEREF-0021): implement differential tag indexing

This commit is contained in:
U811073
2026-03-30 13:12:50 +02:00
committed by Giancarmine Salucci
parent e63279fcf6
commit f4fe8c6043
10 changed files with 1281 additions and 9 deletions

View File

@@ -0,0 +1,113 @@
# TRUEREF-0021 — Differential Tag Indexing
**Priority:** P1
**Status:** Implemented
**Depends On:** TRUEREF-0014, TRUEREF-0017, TRUEREF-0019
**Blocks:**
---
## Problem Statement
Repositories with many version tags (e.g. hundreds or thousands, as seen in projects like RWC
UXFramework) make full re-indexing prohibitively expensive. Between consecutive semver tags the
overwhelming majority of files are unchanged — often only dependency manifests (`package.json`,
`*.lock`) differ. Indexing the complete file tree for every tag wastes compute time, GitHub API
quota, and embedding credits.
---
## Solution
Differential tag indexing detects when an already-indexed ancestor version exists for a given
target tag, determines exactly which files changed, and:
1. **Clones** unchanged document rows, snippet rows, and embedding rows from the ancestor version
into the target version in a single SQLite transaction (`cloneFromAncestor`).
2. **Crawls** only the changed (added / modified) files, parses and embeds them normally.
3. **Skips** deleted files (not cloned, not crawled).
4. **Falls back** silently to a full crawl when no indexed ancestor can be found or any step fails.
---
## Algorithm
### Stage 0 — Differential Plan (`buildDifferentialPlan`)
Executed in `IndexingPipeline.run()` before the crawl, when the job has a `versionId`:
1. **Ancestor selection** (`findBestAncestorVersion` in `tag-order.ts`): Loads all `indexed`
versions for the repository, parses their tags as semver, and returns the closest predecessor
to the target tag. Falls back to creation-timestamp ordering for non-semver tags.
2. **Changed-file detection**: For GitHub repositories, calls the GitHub Compare API
(`fetchGitHubChangedFiles` in `github-compare.ts`). For local repositories, uses
`git diff --name-status` via `getChangedFilesBetweenRefs` in `git.ts` (implemented with
`execFileSync` — not `execSync` — to prevent shell-injection attacks on branch/tag names
containing shell metacharacters).
3. **Path partitioning**: The changed-file list is split into `changedPaths` (added + modified
+ renamed-destination) and `deletedPaths`. `unchangedPaths` is derived as
`ancestorFilePaths changedPaths deletedPaths`.
4. **Guard**: Returns `null` when no indexed ancestor exists, when the ancestor has no indexed
documents, or when all files changed (nothing to clone).
### Stage 0.5 — Clone Unchanged Files (`cloneFromAncestor`)
When `buildDifferentialPlan` returns a non-null plan with `unchangedPaths.size > 0`:
- Fetches ancestor `documents` rows for the unchanged paths using a parameterised
`IN (?, ?, …)` query (no string interpolation of path values → no SQL injection).
- Inserts new `documents` rows for each, with new UUIDs and `version_id = targetVersionId`.
- Fetches ancestor `snippets` rows for those document IDs; inserts clones with new IDs.
- Fetches ancestor `snippet_embeddings` rows; inserts clones pointing to the new snippet IDs.
- The entire operation runs inside a single `this.db.transaction(…)()` call for atomicity.
### Stage 1 — Partial Crawl
`IndexingPipeline.crawl()` accepts an optional third argument `allowedPaths?: Set<string>`.
When provided (set to `differentialPlan.changedPaths`), the crawl result is filtered so only
matching files are returned. This minimises GitHub API requests and local I/O.
---
## API Surface Changes
| Symbol | Location | Change |
|---|---|---|
| `buildDifferentialPlan` | `pipeline/differential-strategy.ts` | **New** — async function |
| `DifferentialPlan` | `pipeline/differential-strategy.ts` | **New** — interface |
| `findBestAncestorVersion` | `utils/tag-order.ts` | **New** — pure function |
| `fetchGitHubChangedFiles` | `crawler/github-compare.ts` | **New** — async function |
| `getChangedFilesBetweenRefs` | `utils/git.ts` | **New** — sync function (uses `execFileSync`) |
| `ChangedFile` | `crawler/types.ts` | **New** — interface |
| `CrawlOptions.allowedPaths` | `crawler/types.ts` | **New** — optional field |
| `IndexingPipeline.crawl()` | `pipeline/indexing.pipeline.ts` | **Modified** — added `allowedPaths` param |
| `IndexingPipeline.cloneFromAncestor()` | `pipeline/indexing.pipeline.ts` | **New** — private method |
| `IndexingPipeline.run()` | `pipeline/indexing.pipeline.ts` | **Modified** — Stage 0 added |
---
## Correctness Properties
- **Atomicity**: `cloneFromAncestor` wraps all inserts in one SQLite transaction; a failure
leaves the target version with no partially-cloned data.
- **Idempotency (fallback)**: If the clone or plan step fails for any reason, the pipeline
catches the error, logs a warning, and continues with a full crawl. No data loss occurs.
- **No shell injection**: `getChangedFilesBetweenRefs` uses `execFileSync` with an argument
array rather than `execSync` with a template-literal string.
- **No SQL injection**: Path values are never interpolated into SQL strings; only `?`
placeholders are used.
---
## Fallback Conditions
The differential plan returns `null` (triggering a full crawl) when:
- No versions for this repository have `state = 'indexed'`.
- The best ancestor has no indexed documents.
- All files changed between ancestor and target (`unchangedPaths.size === 0`).
- The GitHub Compare API call or `git diff` call throws an error.
- Any unexpected exception inside `buildDifferentialPlan`.