feat(TRUEREF-0021): implement differential tag indexing
This commit is contained in:
committed by
Giancarmine Salucci
parent
e63279fcf6
commit
f4fe8c6043
113
docs/features/TRUEREF-0021.md
Normal file
113
docs/features/TRUEREF-0021.md
Normal file
@@ -0,0 +1,113 @@
|
||||
# TRUEREF-0021 — Differential Tag Indexing
|
||||
|
||||
**Priority:** P1
|
||||
**Status:** Implemented
|
||||
**Depends On:** TRUEREF-0014, TRUEREF-0017, TRUEREF-0019
|
||||
**Blocks:** —
|
||||
|
||||
---
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Repositories with many version tags (e.g. hundreds or thousands, as seen in projects like RWC
|
||||
UXFramework) make full re-indexing prohibitively expensive. Between consecutive semver tags the
|
||||
overwhelming majority of files are unchanged — often only dependency manifests (`package.json`,
|
||||
`*.lock`) differ. Indexing the complete file tree for every tag wastes compute time, GitHub API
|
||||
quota, and embedding credits.
|
||||
|
||||
---
|
||||
|
||||
## Solution
|
||||
|
||||
Differential tag indexing detects when an already-indexed ancestor version exists for a given
|
||||
target tag, determines exactly which files changed, and:
|
||||
|
||||
1. **Clones** unchanged document rows, snippet rows, and embedding rows from the ancestor version
|
||||
into the target version in a single SQLite transaction (`cloneFromAncestor`).
|
||||
2. **Crawls** only the changed (added / modified) files, parses and embeds them normally.
|
||||
3. **Skips** deleted files (not cloned, not crawled).
|
||||
4. **Falls back** silently to a full crawl when no indexed ancestor can be found or any step fails.
|
||||
|
||||
---
|
||||
|
||||
## Algorithm
|
||||
|
||||
### Stage 0 — Differential Plan (`buildDifferentialPlan`)
|
||||
|
||||
Executed in `IndexingPipeline.run()` before the crawl, when the job has a `versionId`:
|
||||
|
||||
1. **Ancestor selection** (`findBestAncestorVersion` in `tag-order.ts`): Loads all `indexed`
|
||||
versions for the repository, parses their tags as semver, and returns the closest predecessor
|
||||
to the target tag. Falls back to creation-timestamp ordering for non-semver tags.
|
||||
|
||||
2. **Changed-file detection**: For GitHub repositories, calls the GitHub Compare API
|
||||
(`fetchGitHubChangedFiles` in `github-compare.ts`). For local repositories, uses
|
||||
`git diff --name-status` via `getChangedFilesBetweenRefs` in `git.ts` (implemented with
|
||||
`execFileSync` — not `execSync` — to prevent shell-injection attacks on branch/tag names
|
||||
containing shell metacharacters).
|
||||
|
||||
3. **Path partitioning**: The changed-file list is split into `changedPaths` (added + modified
|
||||
+ renamed-destination) and `deletedPaths`. `unchangedPaths` is derived as
|
||||
`ancestorFilePaths − changedPaths − deletedPaths`.
|
||||
|
||||
4. **Guard**: Returns `null` when no indexed ancestor exists, when the ancestor has no indexed
|
||||
documents, or when all files changed (nothing to clone).
|
||||
|
||||
### Stage 0.5 — Clone Unchanged Files (`cloneFromAncestor`)
|
||||
|
||||
When `buildDifferentialPlan` returns a non-null plan with `unchangedPaths.size > 0`:
|
||||
|
||||
- Fetches ancestor `documents` rows for the unchanged paths using a parameterised
|
||||
`IN (?, ?, …)` query (no string interpolation of path values → no SQL injection).
|
||||
- Inserts new `documents` rows for each, with new UUIDs and `version_id = targetVersionId`.
|
||||
- Fetches ancestor `snippets` rows for those document IDs; inserts clones with new IDs.
|
||||
- Fetches ancestor `snippet_embeddings` rows; inserts clones pointing to the new snippet IDs.
|
||||
- The entire operation runs inside a single `this.db.transaction(…)()` call for atomicity.
|
||||
|
||||
### Stage 1 — Partial Crawl
|
||||
|
||||
`IndexingPipeline.crawl()` accepts an optional third argument `allowedPaths?: Set<string>`.
|
||||
When provided (set to `differentialPlan.changedPaths`), the crawl result is filtered so only
|
||||
matching files are returned. This minimises GitHub API requests and local I/O.
|
||||
|
||||
---
|
||||
|
||||
## API Surface Changes
|
||||
|
||||
| Symbol | Location | Change |
|
||||
|---|---|---|
|
||||
| `buildDifferentialPlan` | `pipeline/differential-strategy.ts` | **New** — async function |
|
||||
| `DifferentialPlan` | `pipeline/differential-strategy.ts` | **New** — interface |
|
||||
| `findBestAncestorVersion` | `utils/tag-order.ts` | **New** — pure function |
|
||||
| `fetchGitHubChangedFiles` | `crawler/github-compare.ts` | **New** — async function |
|
||||
| `getChangedFilesBetweenRefs` | `utils/git.ts` | **New** — sync function (uses `execFileSync`) |
|
||||
| `ChangedFile` | `crawler/types.ts` | **New** — interface |
|
||||
| `CrawlOptions.allowedPaths` | `crawler/types.ts` | **New** — optional field |
|
||||
| `IndexingPipeline.crawl()` | `pipeline/indexing.pipeline.ts` | **Modified** — added `allowedPaths` param |
|
||||
| `IndexingPipeline.cloneFromAncestor()` | `pipeline/indexing.pipeline.ts` | **New** — private method |
|
||||
| `IndexingPipeline.run()` | `pipeline/indexing.pipeline.ts` | **Modified** — Stage 0 added |
|
||||
|
||||
---
|
||||
|
||||
## Correctness Properties
|
||||
|
||||
- **Atomicity**: `cloneFromAncestor` wraps all inserts in one SQLite transaction; a failure
|
||||
leaves the target version with no partially-cloned data.
|
||||
- **Idempotency (fallback)**: If the clone or plan step fails for any reason, the pipeline
|
||||
catches the error, logs a warning, and continues with a full crawl. No data loss occurs.
|
||||
- **No shell injection**: `getChangedFilesBetweenRefs` uses `execFileSync` with an argument
|
||||
array rather than `execSync` with a template-literal string.
|
||||
- **No SQL injection**: Path values are never interpolated into SQL strings; only `?`
|
||||
placeholders are used.
|
||||
|
||||
---
|
||||
|
||||
## Fallback Conditions
|
||||
|
||||
The differential plan returns `null` (triggering a full crawl) when:
|
||||
|
||||
- No versions for this repository have `state = 'indexed'`.
|
||||
- The best ancestor has no indexed documents.
|
||||
- All files changed between ancestor and target (`unchangedPaths.size === 0`).
|
||||
- The GitHub Compare API call or `git diff` call throws an error.
|
||||
- Any unexpected exception inside `buildDifferentialPlan`.
|
||||
Reference in New Issue
Block a user