Files
trueref/docs/features/TRUEREF-0021.md

5.9 KiB
Raw Blame History

TRUEREF-0021 — Differential Tag Indexing

Priority: P1 Status: Implemented Depends On: TRUEREF-0014, TRUEREF-0017, TRUEREF-0019 Blocks:


Problem Statement

Repositories with many version tags (e.g. hundreds or thousands, as seen in projects like RWC UXFramework) make full re-indexing prohibitively expensive. Between consecutive semver tags the overwhelming majority of files are unchanged — often only dependency manifests (package.json, *.lock) differ. Indexing the complete file tree for every tag wastes compute time, GitHub API quota, and embedding credits.


Solution

Differential tag indexing detects when an already-indexed ancestor version exists for a given target tag, determines exactly which files changed, and:

  1. Clones unchanged document rows, snippet rows, and embedding rows from the ancestor version into the target version in a single SQLite transaction (cloneFromAncestor).
  2. Crawls only the changed (added / modified) files, parses and embeds them normally.
  3. Skips deleted files (not cloned, not crawled).
  4. Falls back silently to a full crawl when no indexed ancestor can be found or any step fails.

Algorithm

Stage 0 — Differential Plan (buildDifferentialPlan)

Executed in IndexingPipeline.run() before the crawl, when the job has a versionId:

  1. Ancestor selection (findBestAncestorVersion in tag-order.ts): Loads all indexed versions for the repository, parses their tags as semver, and returns the closest predecessor to the target tag. Falls back to creation-timestamp ordering for non-semver tags.

  2. Changed-file detection: For GitHub repositories, calls the GitHub Compare API (fetchGitHubChangedFiles in github-compare.ts). For local repositories, uses git diff --name-status via getChangedFilesBetweenRefs in git.ts (implemented with execFileSync — not execSync — to prevent shell-injection attacks on branch/tag names containing shell metacharacters).

  3. Path partitioning: The changed-file list is split into changedPaths (added + modified

    • renamed-destination) and deletedPaths. unchangedPaths is derived as ancestorFilePaths changedPaths deletedPaths.
  4. Guard: Returns null when no indexed ancestor exists, when the ancestor has no indexed documents, or when all files changed (nothing to clone).

Stage 0.5 — Clone Unchanged Files (cloneFromAncestor)

When buildDifferentialPlan returns a non-null plan with unchangedPaths.size > 0:

  • Fetches ancestor documents rows for the unchanged paths using a parameterised IN (?, ?, …) query (no string interpolation of path values → no SQL injection).
  • Inserts new documents rows for each, with new UUIDs and version_id = targetVersionId.
  • Fetches ancestor snippets rows for those document IDs; inserts clones with new IDs.
  • Fetches ancestor snippet_embeddings rows; inserts clones pointing to the new snippet IDs.
  • The entire operation runs inside a single this.db.transaction(…)() call for atomicity.

Stage 1 — Partial Crawl

IndexingPipeline.crawl() accepts an optional third argument allowedPaths?: Set<string>. When provided (set to differentialPlan.changedPaths), the crawl result is filtered so only matching files are returned. This minimises GitHub API requests and local I/O.


API Surface Changes

Symbol Location Change
buildDifferentialPlan pipeline/differential-strategy.ts New — async function
DifferentialPlan pipeline/differential-strategy.ts New — interface
findBestAncestorVersion utils/tag-order.ts New — pure function
fetchGitHubChangedFiles crawler/github-compare.ts New — async function
getChangedFilesBetweenRefs utils/git.ts New — sync function (uses execFileSync)
ChangedFile crawler/types.ts New — interface
CrawlOptions.allowedPaths crawler/types.ts New — optional field
IndexingPipeline.crawl() pipeline/indexing.pipeline.ts Modified — added allowedPaths param
IndexingPipeline.cloneFromAncestor() pipeline/indexing.pipeline.ts New — private method
IndexingPipeline.run() pipeline/indexing.pipeline.ts Modified — Stage 0 added

Correctness Properties

  • Atomicity: cloneFromAncestor wraps all inserts in one SQLite transaction; a failure leaves the target version with no partially-cloned data.
  • Idempotency (fallback): If the clone or plan step fails for any reason, the pipeline catches the error, logs a warning, and continues with a full crawl. No data loss occurs.
  • No shell injection: getChangedFilesBetweenRefs uses execFileSync with an argument array rather than execSync with a template-literal string.
  • No SQL injection: Path values are never interpolated into SQL strings; only ? placeholders are used.

Fallback Conditions

The differential plan returns null (triggering a full crawl) when:

  • No versions for this repository have state = 'indexed'.
  • The best ancestor has no indexed documents.
  • All files changed between ancestor and target (unchangedPaths.size === 0).
  • The GitHub Compare API call or git diff call throws an error.
  • Any unexpected exception inside buildDifferentialPlan.