# TRUEREF-0021 — Differential Tag Indexing **Priority:** P1 **Status:** Implemented **Depends On:** TRUEREF-0014, TRUEREF-0017, TRUEREF-0019 **Blocks:** — --- ## Problem Statement Repositories with many version tags (e.g. hundreds or thousands, as seen in projects like RWC UXFramework) make full re-indexing prohibitively expensive. Between consecutive semver tags the overwhelming majority of files are unchanged — often only dependency manifests (`package.json`, `*.lock`) differ. Indexing the complete file tree for every tag wastes compute time, GitHub API quota, and embedding credits. --- ## Solution Differential tag indexing detects when an already-indexed ancestor version exists for a given target tag, determines exactly which files changed, and: 1. **Clones** unchanged document rows, snippet rows, and embedding rows from the ancestor version into the target version in a single SQLite transaction (`cloneFromAncestor`). 2. **Crawls** only the changed (added / modified) files, parses and embeds them normally. 3. **Skips** deleted files (not cloned, not crawled). 4. **Falls back** silently to a full crawl when no indexed ancestor can be found or any step fails. --- ## Algorithm ### Stage 0 — Differential Plan (`buildDifferentialPlan`) Executed in `IndexingPipeline.run()` before the crawl, when the job has a `versionId`: 1. **Ancestor selection** (`findBestAncestorVersion` in `tag-order.ts`): Loads all `indexed` versions for the repository, parses their tags as semver, and returns the closest predecessor to the target tag. Falls back to creation-timestamp ordering for non-semver tags. 2. **Changed-file detection**: For GitHub repositories, calls the GitHub Compare API (`fetchGitHubChangedFiles` in `github-compare.ts`). For local repositories, uses `git diff --name-status` via `getChangedFilesBetweenRefs` in `git.ts` (implemented with `execFileSync` — not `execSync` — to prevent shell-injection attacks on branch/tag names containing shell metacharacters). 3. **Path partitioning**: The changed-file list is split into `changedPaths` (added + modified + renamed-destination) and `deletedPaths`. `unchangedPaths` is derived as `ancestorFilePaths − changedPaths − deletedPaths`. 4. **Guard**: Returns `null` when no indexed ancestor exists, when the ancestor has no indexed documents, or when all files changed (nothing to clone). ### Stage 0.5 — Clone Unchanged Files (`cloneFromAncestor`) When `buildDifferentialPlan` returns a non-null plan with `unchangedPaths.size > 0`: - Fetches ancestor `documents` rows for the unchanged paths using a parameterised `IN (?, ?, …)` query (no string interpolation of path values → no SQL injection). - Inserts new `documents` rows for each, with new UUIDs and `version_id = targetVersionId`. - Fetches ancestor `snippets` rows for those document IDs; inserts clones with new IDs. - Fetches ancestor `snippet_embeddings` rows; inserts clones pointing to the new snippet IDs. - The entire operation runs inside a single `this.db.transaction(…)()` call for atomicity. ### Stage 1 — Partial Crawl `IndexingPipeline.crawl()` accepts an optional third argument `allowedPaths?: Set`. When provided (set to `differentialPlan.changedPaths`), the crawl result is filtered so only matching files are returned. This minimises GitHub API requests and local I/O. --- ## API Surface Changes | Symbol | Location | Change | |---|---|---| | `buildDifferentialPlan` | `pipeline/differential-strategy.ts` | **New** — async function | | `DifferentialPlan` | `pipeline/differential-strategy.ts` | **New** — interface | | `findBestAncestorVersion` | `utils/tag-order.ts` | **New** — pure function | | `fetchGitHubChangedFiles` | `crawler/github-compare.ts` | **New** — async function | | `getChangedFilesBetweenRefs` | `utils/git.ts` | **New** — sync function (uses `execFileSync`) | | `ChangedFile` | `crawler/types.ts` | **New** — interface | | `CrawlOptions.allowedPaths` | `crawler/types.ts` | **New** — optional field | | `IndexingPipeline.crawl()` | `pipeline/indexing.pipeline.ts` | **Modified** — added `allowedPaths` param | | `IndexingPipeline.cloneFromAncestor()` | `pipeline/indexing.pipeline.ts` | **New** — private method | | `IndexingPipeline.run()` | `pipeline/indexing.pipeline.ts` | **Modified** — Stage 0 added | --- ## Correctness Properties - **Atomicity**: `cloneFromAncestor` wraps all inserts in one SQLite transaction; a failure leaves the target version with no partially-cloned data. - **Idempotency (fallback)**: If the clone or plan step fails for any reason, the pipeline catches the error, logs a warning, and continues with a full crawl. No data loss occurs. - **No shell injection**: `getChangedFilesBetweenRefs` uses `execFileSync` with an argument array rather than `execSync` with a template-literal string. - **No SQL injection**: Path values are never interpolated into SQL strings; only `?` placeholders are used. --- ## Fallback Conditions The differential plan returns `null` (triggering a full crawl) when: - No versions for this repository have `state = 'indexed'`. - The best ancestor has no indexed documents. - All files changed between ancestor and target (`unchangedPaths.size === 0`). - The GitHub Compare API call or `git diff` call throws an error. - Any unexpected exception inside `buildDifferentialPlan`.