114 lines
5.3 KiB
Markdown
114 lines
5.3 KiB
Markdown
# TRUEREF-0021 — Differential Tag Indexing
|
||
|
||
**Priority:** P1
|
||
**Status:** Implemented
|
||
**Depends On:** TRUEREF-0014, TRUEREF-0017, TRUEREF-0019
|
||
**Blocks:** —
|
||
|
||
---
|
||
|
||
## Problem Statement
|
||
|
||
Repositories with many version tags (e.g. hundreds or thousands, as seen in projects like RWC
|
||
UXFramework) make full re-indexing prohibitively expensive. Between consecutive semver tags the
|
||
overwhelming majority of files are unchanged — often only dependency manifests (`package.json`,
|
||
`*.lock`) differ. Indexing the complete file tree for every tag wastes compute time, GitHub API
|
||
quota, and embedding credits.
|
||
|
||
---
|
||
|
||
## Solution
|
||
|
||
Differential tag indexing detects when an already-indexed ancestor version exists for a given
|
||
target tag, determines exactly which files changed, and:
|
||
|
||
1. **Clones** unchanged document rows, snippet rows, and embedding rows from the ancestor version
|
||
into the target version in a single SQLite transaction (`cloneFromAncestor`).
|
||
2. **Crawls** only the changed (added / modified) files, parses and embeds them normally.
|
||
3. **Skips** deleted files (not cloned, not crawled).
|
||
4. **Falls back** silently to a full crawl when no indexed ancestor can be found or any step fails.
|
||
|
||
---
|
||
|
||
## Algorithm
|
||
|
||
### Stage 0 — Differential Plan (`buildDifferentialPlan`)
|
||
|
||
Executed in `IndexingPipeline.run()` before the crawl, when the job has a `versionId`:
|
||
|
||
1. **Ancestor selection** (`findBestAncestorVersion` in `tag-order.ts`): Loads all `indexed`
|
||
versions for the repository, parses their tags as semver, and returns the closest predecessor
|
||
to the target tag. Falls back to creation-timestamp ordering for non-semver tags.
|
||
|
||
2. **Changed-file detection**: For GitHub repositories, calls the GitHub Compare API
|
||
(`fetchGitHubChangedFiles` in `github-compare.ts`). For local repositories, uses
|
||
`git diff --name-status` via `getChangedFilesBetweenRefs` in `git.ts` (implemented with
|
||
`execFileSync` — not `execSync` — to prevent shell-injection attacks on branch/tag names
|
||
containing shell metacharacters).
|
||
|
||
3. **Path partitioning**: The changed-file list is split into `changedPaths` (added + modified
|
||
+ renamed-destination) and `deletedPaths`. `unchangedPaths` is derived as
|
||
`ancestorFilePaths − changedPaths − deletedPaths`.
|
||
|
||
4. **Guard**: Returns `null` when no indexed ancestor exists, when the ancestor has no indexed
|
||
documents, or when all files changed (nothing to clone).
|
||
|
||
### Stage 0.5 — Clone Unchanged Files (`cloneFromAncestor`)
|
||
|
||
When `buildDifferentialPlan` returns a non-null plan with `unchangedPaths.size > 0`:
|
||
|
||
- Fetches ancestor `documents` rows for the unchanged paths using a parameterised
|
||
`IN (?, ?, …)` query (no string interpolation of path values → no SQL injection).
|
||
- Inserts new `documents` rows for each, with new UUIDs and `version_id = targetVersionId`.
|
||
- Fetches ancestor `snippets` rows for those document IDs; inserts clones with new IDs.
|
||
- Fetches ancestor `snippet_embeddings` rows; inserts clones pointing to the new snippet IDs.
|
||
- The entire operation runs inside a single `this.db.transaction(…)()` call for atomicity.
|
||
|
||
### Stage 1 — Partial Crawl
|
||
|
||
`IndexingPipeline.crawl()` accepts an optional third argument `allowedPaths?: Set<string>`.
|
||
When provided (set to `differentialPlan.changedPaths`), the crawl result is filtered so only
|
||
matching files are returned. This minimises GitHub API requests and local I/O.
|
||
|
||
---
|
||
|
||
## API Surface Changes
|
||
|
||
| Symbol | Location | Change |
|
||
|---|---|---|
|
||
| `buildDifferentialPlan` | `pipeline/differential-strategy.ts` | **New** — async function |
|
||
| `DifferentialPlan` | `pipeline/differential-strategy.ts` | **New** — interface |
|
||
| `findBestAncestorVersion` | `utils/tag-order.ts` | **New** — pure function |
|
||
| `fetchGitHubChangedFiles` | `crawler/github-compare.ts` | **New** — async function |
|
||
| `getChangedFilesBetweenRefs` | `utils/git.ts` | **New** — sync function (uses `execFileSync`) |
|
||
| `ChangedFile` | `crawler/types.ts` | **New** — interface |
|
||
| `CrawlOptions.allowedPaths` | `crawler/types.ts` | **New** — optional field |
|
||
| `IndexingPipeline.crawl()` | `pipeline/indexing.pipeline.ts` | **Modified** — added `allowedPaths` param |
|
||
| `IndexingPipeline.cloneFromAncestor()` | `pipeline/indexing.pipeline.ts` | **New** — private method |
|
||
| `IndexingPipeline.run()` | `pipeline/indexing.pipeline.ts` | **Modified** — Stage 0 added |
|
||
|
||
---
|
||
|
||
## Correctness Properties
|
||
|
||
- **Atomicity**: `cloneFromAncestor` wraps all inserts in one SQLite transaction; a failure
|
||
leaves the target version with no partially-cloned data.
|
||
- **Idempotency (fallback)**: If the clone or plan step fails for any reason, the pipeline
|
||
catches the error, logs a warning, and continues with a full crawl. No data loss occurs.
|
||
- **No shell injection**: `getChangedFilesBetweenRefs` uses `execFileSync` with an argument
|
||
array rather than `execSync` with a template-literal string.
|
||
- **No SQL injection**: Path values are never interpolated into SQL strings; only `?`
|
||
placeholders are used.
|
||
|
||
---
|
||
|
||
## Fallback Conditions
|
||
|
||
The differential plan returns `null` (triggering a full crawl) when:
|
||
|
||
- No versions for this repository have `state = 'indexed'`.
|
||
- The best ancestor has no indexed documents.
|
||
- All files changed between ancestor and target (`unchangedPaths.size === 0`).
|
||
- The GitHub Compare API call or `git diff` call throws an error.
|
||
- Any unexpected exception inside `buildDifferentialPlan`.
|