138 lines
4.1 KiB
Markdown
138 lines
4.1 KiB
Markdown
# TRUEREF-0017 — Incremental Re-indexing (Checksum Diff)
|
|
|
|
**Priority:** P1
|
|
**Status:** Pending
|
|
**Depends On:** TRUEREF-0009
|
|
**Blocks:** —
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
Optimize re-indexing by skipping files that haven't changed since the last indexing run. Uses file checksums (SHA-256) to detect changes. Only modified, added, or deleted files trigger parser/embedding work. This dramatically reduces re-indexing time for large repositories.
|
|
|
|
---
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [ ] Checksum comparison before parsing each file
|
|
- [ ] Unchanged files reuse existing `Document` and `Snippet` records (no re-parse, no re-embed)
|
|
- [ ] New files: full parse + embed
|
|
- [ ] Modified files: delete old snippets, parse new ones, re-embed
|
|
- [ ] Deleted files (present in DB but not in new crawl): delete documents and snippets
|
|
- [ ] Job progress reflects total files (including skipped), not just processed
|
|
- [ ] Statistics updated correctly after incremental run
|
|
- [ ] Integration test covering unchanged, modified, added, and deleted files
|
|
|
|
---
|
|
|
|
## Diff Algorithm
|
|
|
|
```typescript
|
|
interface FileDiff {
|
|
added: CrawledFile[]; // new files not in DB
|
|
modified: CrawledFile[]; // files with changed checksum
|
|
deleted: string[]; // file paths in DB but not in crawl
|
|
unchanged: string[]; // file paths with matching checksum
|
|
}
|
|
|
|
function computeDiff(
|
|
crawledFiles: CrawledFile[],
|
|
existingDocs: Document[] // documents currently in DB for this repo
|
|
): FileDiff {
|
|
const existingMap = new Map(existingDocs.map((d) => [d.filePath, d]));
|
|
const crawledMap = new Map(crawledFiles.map((f) => [f.path, f]));
|
|
|
|
const added: CrawledFile[] = [];
|
|
const modified: CrawledFile[] = [];
|
|
const unchanged: string[] = [];
|
|
|
|
for (const file of crawledFiles) {
|
|
const existing = existingMap.get(file.path);
|
|
if (!existing) {
|
|
added.push(file);
|
|
} else if (existing.checksum !== file.sha) {
|
|
modified.push(file);
|
|
} else {
|
|
unchanged.push(file.path);
|
|
}
|
|
}
|
|
|
|
const deleted = existingDocs
|
|
.filter((doc) => !crawledMap.has(doc.filePath))
|
|
.map((doc) => doc.filePath);
|
|
|
|
return { added, modified, deleted, unchanged };
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Integration with IndexingPipeline
|
|
|
|
```typescript
|
|
// In IndexingPipeline.run(), after crawling:
|
|
|
|
const existingDocs = this.getExistingDocuments(repo.id, job.versionId);
|
|
const diff = computeDiff(crawledResult.files, existingDocs);
|
|
|
|
// Log diff summary
|
|
this.updateJob(job.id, {
|
|
totalFiles: crawledResult.files.length
|
|
});
|
|
|
|
// Process only changed/new files
|
|
const filesToProcess = [...diff.added, ...diff.modified];
|
|
const newSnippets: NewSnippet[] = [];
|
|
const newDocuments: NewDocument[] = [];
|
|
const docIdsToDelete: string[] = [];
|
|
|
|
// Map modified files to their existing document IDs for deletion
|
|
for (const file of diff.modified) {
|
|
const existing = existingDocs.find((d) => d.filePath === file.path);
|
|
if (existing) docIdsToDelete.push(existing.id);
|
|
}
|
|
|
|
// Map deleted file paths to document IDs
|
|
for (const filePath of diff.deleted) {
|
|
const existing = existingDocs.find((d) => d.filePath === filePath);
|
|
if (existing) docIdsToDelete.push(existing.id);
|
|
}
|
|
|
|
// Parse new/modified files
|
|
for (const [i, file] of filesToProcess.entries()) {
|
|
const docId = crypto.randomUUID();
|
|
newDocuments.push({ id: docId, ...buildDocument(file, repo.id, job.versionId) });
|
|
newSnippets.push(...parseFile(file, { repositoryId: repo.id, documentId: docId }));
|
|
|
|
// Count ALL files (including skipped) in progress
|
|
const totalProcessed = diff.unchanged.length + i + 1;
|
|
const progress = Math.round((totalProcessed / crawledResult.files.length) * 80);
|
|
this.updateJob(job.id, {
|
|
processedFiles: totalProcessed,
|
|
progress
|
|
});
|
|
}
|
|
|
|
// Atomic replacement of only changed documents
|
|
this.replaceSnippets(repo.id, docIdsToDelete, newDocuments, newSnippets);
|
|
```
|
|
|
|
---
|
|
|
|
## Performance Impact
|
|
|
|
For a typical repository with 1,000 files where 50 changed:
|
|
|
|
- **Without incremental**: 1,000 files parsed + 1,000 embed batches
|
|
- **With incremental**: 50 files parsed + 50 embed batches
|
|
- Estimated speedup: ~20x for re-indexing
|
|
|
|
---
|
|
|
|
## Files to Modify
|
|
|
|
- `src/lib/server/pipeline/indexing.pipeline.ts` — add diff computation
|
|
- `src/lib/server/pipeline/diff.ts` — `computeDiff` function (new file)
|
|
- `src/lib/server/pipeline/diff.test.ts` — unit tests
|