Files
trueref/docs/features/TRUEREF-0017.md
2026-03-22 17:08:15 +01:00

4.2 KiB

TRUEREF-0017 — Incremental Re-indexing (Checksum Diff)

Priority: P1 Status: Pending Depends On: TRUEREF-0009 Blocks:


Overview

Optimize re-indexing by skipping files that haven't changed since the last indexing run. Uses file checksums (SHA-256) to detect changes. Only modified, added, or deleted files trigger parser/embedding work. This dramatically reduces re-indexing time for large repositories.


Acceptance Criteria

  • Checksum comparison before parsing each file
  • Unchanged files reuse existing Document and Snippet records (no re-parse, no re-embed)
  • New files: full parse + embed
  • Modified files: delete old snippets, parse new ones, re-embed
  • Deleted files (present in DB but not in new crawl): delete documents and snippets
  • Job progress reflects total files (including skipped), not just processed
  • Statistics updated correctly after incremental run
  • Integration test covering unchanged, modified, added, and deleted files

Diff Algorithm

interface FileDiff {
  added: CrawledFile[];      // new files not in DB
  modified: CrawledFile[];   // files with changed checksum
  deleted: string[];         // file paths in DB but not in crawl
  unchanged: string[];       // file paths with matching checksum
}

function computeDiff(
  crawledFiles: CrawledFile[],
  existingDocs: Document[]   // documents currently in DB for this repo
): FileDiff {
  const existingMap = new Map(existingDocs.map(d => [d.filePath, d]));
  const crawledMap = new Map(crawledFiles.map(f => [f.path, f]));

  const added: CrawledFile[] = [];
  const modified: CrawledFile[] = [];
  const unchanged: string[] = [];

  for (const file of crawledFiles) {
    const existing = existingMap.get(file.path);
    if (!existing) {
      added.push(file);
    } else if (existing.checksum !== file.sha) {
      modified.push(file);
    } else {
      unchanged.push(file.path);
    }
  }

  const deleted = existingDocs
    .filter(doc => !crawledMap.has(doc.filePath))
    .map(doc => doc.filePath);

  return { added, modified, deleted, unchanged };
}

Integration with IndexingPipeline

// In IndexingPipeline.run(), after crawling:

const existingDocs = this.getExistingDocuments(repo.id, job.versionId);
const diff = computeDiff(crawledResult.files, existingDocs);

// Log diff summary
this.updateJob(job.id, {
  totalFiles: crawledResult.files.length,
});

// Process only changed/new files
const filesToProcess = [...diff.added, ...diff.modified];
const newSnippets: NewSnippet[] = [];
const newDocuments: NewDocument[] = [];
const docIdsToDelete: string[] = [];

// Map modified files to their existing document IDs for deletion
for (const file of diff.modified) {
  const existing = existingDocs.find(d => d.filePath === file.path);
  if (existing) docIdsToDelete.push(existing.id);
}

// Map deleted file paths to document IDs
for (const filePath of diff.deleted) {
  const existing = existingDocs.find(d => d.filePath === filePath);
  if (existing) docIdsToDelete.push(existing.id);
}

// Parse new/modified files
for (const [i, file] of filesToProcess.entries()) {
  const docId = crypto.randomUUID();
  newDocuments.push({ id: docId, ...buildDocument(file, repo.id, job.versionId) });
  newSnippets.push(...parseFile(file, { repositoryId: repo.id, documentId: docId }));

  // Count ALL files (including skipped) in progress
  const totalProcessed = diff.unchanged.length + i + 1;
  const progress = Math.round((totalProcessed / crawledResult.files.length) * 80);
  this.updateJob(job.id, {
    processedFiles: totalProcessed,
    progress,
  });
}

// Atomic replacement of only changed documents
this.replaceSnippets(repo.id, docIdsToDelete, newDocuments, newSnippets);

Performance Impact

For a typical repository with 1,000 files where 50 changed:

  • Without incremental: 1,000 files parsed + 1,000 embed batches
  • With incremental: 50 files parsed + 50 embed batches
  • Estimated speedup: ~20x for re-indexing

Files to Modify

  • src/lib/server/pipeline/indexing.pipeline.ts — add diff computation
  • src/lib/server/pipeline/diff.tscomputeDiff function (new file)
  • src/lib/server/pipeline/diff.test.ts — unit tests