5.9 KiB
TRUEREF-0021 — Differential Tag Indexing
Priority: P1 Status: Implemented Depends On: TRUEREF-0014, TRUEREF-0017, TRUEREF-0019 Blocks: —
Problem Statement
Repositories with many version tags (e.g. hundreds or thousands, as seen in projects like RWC
UXFramework) make full re-indexing prohibitively expensive. Between consecutive semver tags the
overwhelming majority of files are unchanged — often only dependency manifests (package.json,
*.lock) differ. Indexing the complete file tree for every tag wastes compute time, GitHub API
quota, and embedding credits.
Solution
Differential tag indexing detects when an already-indexed ancestor version exists for a given target tag, determines exactly which files changed, and:
- Clones unchanged document rows, snippet rows, and embedding rows from the ancestor version
into the target version in a single SQLite transaction (
cloneFromAncestor). - Crawls only the changed (added / modified) files, parses and embeds them normally.
- Skips deleted files (not cloned, not crawled).
- Falls back silently to a full crawl when no indexed ancestor can be found or any step fails.
Algorithm
Stage 0 — Differential Plan (buildDifferentialPlan)
Executed in IndexingPipeline.run() before the crawl, when the job has a versionId:
-
Ancestor selection (
findBestAncestorVersionintag-order.ts): Loads allindexedversions for the repository, parses their tags as semver, and returns the closest predecessor to the target tag. Falls back to creation-timestamp ordering for non-semver tags. -
Changed-file detection: For GitHub repositories, calls the GitHub Compare API (
fetchGitHubChangedFilesingithub-compare.ts). For local repositories, usesgit diff --name-statusviagetChangedFilesBetweenRefsingit.ts(implemented withexecFileSync— notexecSync— to prevent shell-injection attacks on branch/tag names containing shell metacharacters). -
Path partitioning: The changed-file list is split into
changedPaths(added + modified- renamed-destination) and
deletedPaths.unchangedPathsis derived asancestorFilePaths − changedPaths − deletedPaths.
- renamed-destination) and
-
Guard: Returns
nullwhen no indexed ancestor exists, when the ancestor has no indexed documents, or when all files changed (nothing to clone).
Stage 0.5 — Clone Unchanged Files (cloneFromAncestor)
When buildDifferentialPlan returns a non-null plan with unchangedPaths.size > 0:
- Fetches ancestor
documentsrows for the unchanged paths using a parameterisedIN (?, ?, …)query (no string interpolation of path values → no SQL injection). - Inserts new
documentsrows for each, with new UUIDs andversion_id = targetVersionId. - Fetches ancestor
snippetsrows for those document IDs; inserts clones with new IDs. - Fetches ancestor
snippet_embeddingsrows; inserts clones pointing to the new snippet IDs. - The entire operation runs inside a single
this.db.transaction(…)()call for atomicity.
Stage 1 — Partial Crawl
IndexingPipeline.crawl() accepts an optional third argument allowedPaths?: Set<string>.
When provided (set to differentialPlan.changedPaths), the crawl result is filtered so only
matching files are returned. This minimises GitHub API requests and local I/O.
API Surface Changes
| Symbol | Location | Change |
|---|---|---|
buildDifferentialPlan |
pipeline/differential-strategy.ts |
New — async function |
DifferentialPlan |
pipeline/differential-strategy.ts |
New — interface |
findBestAncestorVersion |
utils/tag-order.ts |
New — pure function |
fetchGitHubChangedFiles |
crawler/github-compare.ts |
New — async function |
getChangedFilesBetweenRefs |
utils/git.ts |
New — sync function (uses execFileSync) |
ChangedFile |
crawler/types.ts |
New — interface |
CrawlOptions.allowedPaths |
crawler/types.ts |
New — optional field |
IndexingPipeline.crawl() |
pipeline/indexing.pipeline.ts |
Modified — added allowedPaths param |
IndexingPipeline.cloneFromAncestor() |
pipeline/indexing.pipeline.ts |
New — private method |
IndexingPipeline.run() |
pipeline/indexing.pipeline.ts |
Modified — Stage 0 added |
Correctness Properties
- Atomicity:
cloneFromAncestorwraps all inserts in one SQLite transaction; a failure leaves the target version with no partially-cloned data. - Idempotency (fallback): If the clone or plan step fails for any reason, the pipeline catches the error, logs a warning, and continues with a full crawl. No data loss occurs.
- No shell injection:
getChangedFilesBetweenRefsusesexecFileSyncwith an argument array rather thanexecSyncwith a template-literal string. - No SQL injection: Path values are never interpolated into SQL strings; only
?placeholders are used.
Fallback Conditions
The differential plan returns null (triggering a full crawl) when:
- No versions for this repository have
state = 'indexed'. - The best ancestor has no indexed documents.
- All files changed between ancestor and target (
unchangedPaths.size === 0). - The GitHub Compare API call or
git diffcall throws an error. - Any unexpected exception inside
buildDifferentialPlan.