Files
trueref/docs/features/TRUEREF-0021.md

114 lines
5.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# TRUEREF-0021 — Differential Tag Indexing
**Priority:** P1
**Status:** Implemented
**Depends On:** TRUEREF-0014, TRUEREF-0017, TRUEREF-0019
**Blocks:**
---
## Problem Statement
Repositories with many version tags (e.g. hundreds or thousands, as seen in projects like RWC
UXFramework) make full re-indexing prohibitively expensive. Between consecutive semver tags the
overwhelming majority of files are unchanged — often only dependency manifests (`package.json`,
`*.lock`) differ. Indexing the complete file tree for every tag wastes compute time, GitHub API
quota, and embedding credits.
---
## Solution
Differential tag indexing detects when an already-indexed ancestor version exists for a given
target tag, determines exactly which files changed, and:
1. **Clones** unchanged document rows, snippet rows, and embedding rows from the ancestor version
into the target version in a single SQLite transaction (`cloneFromAncestor`).
2. **Crawls** only the changed (added / modified) files, parses and embeds them normally.
3. **Skips** deleted files (not cloned, not crawled).
4. **Falls back** silently to a full crawl when no indexed ancestor can be found or any step fails.
---
## Algorithm
### Stage 0 — Differential Plan (`buildDifferentialPlan`)
Executed in `IndexingPipeline.run()` before the crawl, when the job has a `versionId`:
1. **Ancestor selection** (`findBestAncestorVersion` in `tag-order.ts`): Loads all `indexed`
versions for the repository, parses their tags as semver, and returns the closest predecessor
to the target tag. Falls back to creation-timestamp ordering for non-semver tags.
2. **Changed-file detection**: For GitHub repositories, calls the GitHub Compare API
(`fetchGitHubChangedFiles` in `github-compare.ts`). For local repositories, uses
`git diff --name-status` via `getChangedFilesBetweenRefs` in `git.ts` (implemented with
`execFileSync` — not `execSync` — to prevent shell-injection attacks on branch/tag names
containing shell metacharacters).
3. **Path partitioning**: The changed-file list is split into `changedPaths` (added + modified
+ renamed-destination) and `deletedPaths`. `unchangedPaths` is derived as
`ancestorFilePaths changedPaths deletedPaths`.
4. **Guard**: Returns `null` when no indexed ancestor exists, when the ancestor has no indexed
documents, or when all files changed (nothing to clone).
### Stage 0.5 — Clone Unchanged Files (`cloneFromAncestor`)
When `buildDifferentialPlan` returns a non-null plan with `unchangedPaths.size > 0`:
- Fetches ancestor `documents` rows for the unchanged paths using a parameterised
`IN (?, ?, …)` query (no string interpolation of path values → no SQL injection).
- Inserts new `documents` rows for each, with new UUIDs and `version_id = targetVersionId`.
- Fetches ancestor `snippets` rows for those document IDs; inserts clones with new IDs.
- Fetches ancestor `snippet_embeddings` rows; inserts clones pointing to the new snippet IDs.
- The entire operation runs inside a single `this.db.transaction(…)()` call for atomicity.
### Stage 1 — Partial Crawl
`IndexingPipeline.crawl()` accepts an optional third argument `allowedPaths?: Set<string>`.
When provided (set to `differentialPlan.changedPaths`), the crawl result is filtered so only
matching files are returned. This minimises GitHub API requests and local I/O.
---
## API Surface Changes
| Symbol | Location | Change |
|---|---|---|
| `buildDifferentialPlan` | `pipeline/differential-strategy.ts` | **New** — async function |
| `DifferentialPlan` | `pipeline/differential-strategy.ts` | **New** — interface |
| `findBestAncestorVersion` | `utils/tag-order.ts` | **New** — pure function |
| `fetchGitHubChangedFiles` | `crawler/github-compare.ts` | **New** — async function |
| `getChangedFilesBetweenRefs` | `utils/git.ts` | **New** — sync function (uses `execFileSync`) |
| `ChangedFile` | `crawler/types.ts` | **New** — interface |
| `CrawlOptions.allowedPaths` | `crawler/types.ts` | **New** — optional field |
| `IndexingPipeline.crawl()` | `pipeline/indexing.pipeline.ts` | **Modified** — added `allowedPaths` param |
| `IndexingPipeline.cloneFromAncestor()` | `pipeline/indexing.pipeline.ts` | **New** — private method |
| `IndexingPipeline.run()` | `pipeline/indexing.pipeline.ts` | **Modified** — Stage 0 added |
---
## Correctness Properties
- **Atomicity**: `cloneFromAncestor` wraps all inserts in one SQLite transaction; a failure
leaves the target version with no partially-cloned data.
- **Idempotency (fallback)**: If the clone or plan step fails for any reason, the pipeline
catches the error, logs a warning, and continues with a full crawl. No data loss occurs.
- **No shell injection**: `getChangedFilesBetweenRefs` uses `execFileSync` with an argument
array rather than `execSync` with a template-literal string.
- **No SQL injection**: Path values are never interpolated into SQL strings; only `?`
placeholders are used.
---
## Fallback Conditions
The differential plan returns `null` (triggering a full crawl) when:
- No versions for this repository have `state = 'indexed'`.
- The best ancestor has no indexed documents.
- All files changed between ancestor and target (`unchangedPaths.size === 0`).
- The GitHub Compare API call or `git diff` call throws an error.
- Any unexpected exception inside `buildDifferentialPlan`.