TRUEREF-0017 — Incremental Re-indexing (Checksum Diff)

Priority: P1 Status: Pending Depends On: TRUEREF-0009 Blocks: —

Overview

Optimize re-indexing by skipping files that haven't changed since the last indexing run. Uses file checksums (SHA-256) to detect changes. Only modified, added, or deleted files trigger parser/embedding work. This dramatically reduces re-indexing time for large repositories.

Acceptance Criteria

Checksum comparison before parsing each file
Unchanged files reuse existing Document and Snippet records (no re-parse, no re-embed)
New files: full parse + embed
Modified files: delete old snippets, parse new ones, re-embed
Deleted files (present in DB but not in new crawl): delete documents and snippets
Job progress reflects total files (including skipped), not just processed
Statistics updated correctly after incremental run
Integration test covering unchanged, modified, added, and deleted files

Diff Algorithm

interface FileDiff {
  added: CrawledFile[];      // new files not in DB
  modified: CrawledFile[];   // files with changed checksum
  deleted: string[];         // file paths in DB but not in crawl
  unchanged: string[];       // file paths with matching checksum
}

function computeDiff(
  crawledFiles: CrawledFile[],
  existingDocs: Document[]   // documents currently in DB for this repo
): FileDiff {
  const existingMap = new Map(existingDocs.map(d => [d.filePath, d]));
  const crawledMap = new Map(crawledFiles.map(f => [f.path, f]));

  const added: CrawledFile[] = [];
  const modified: CrawledFile[] = [];
  const unchanged: string[] = [];

  for (const file of crawledFiles) {
    const existing = existingMap.get(file.path);
    if (!existing) {
      added.push(file);
    } else if (existing.checksum !== file.sha) {
      modified.push(file);
    } else {
      unchanged.push(file.path);
    }
  }

  const deleted = existingDocs
    .filter(doc => !crawledMap.has(doc.filePath))
    .map(doc => doc.filePath);

  return { added, modified, deleted, unchanged };
}

Integration with IndexingPipeline

// In IndexingPipeline.run(), after crawling:

const existingDocs = this.getExistingDocuments(repo.id, job.versionId);
const diff = computeDiff(crawledResult.files, existingDocs);

// Log diff summary
this.updateJob(job.id, {
  totalFiles: crawledResult.files.length,
});

// Process only changed/new files
const filesToProcess = [...diff.added, ...diff.modified];
const newSnippets: NewSnippet[] = [];
const newDocuments: NewDocument[] = [];
const docIdsToDelete: string[] = [];

// Map modified files to their existing document IDs for deletion
for (const file of diff.modified) {
  const existing = existingDocs.find(d => d.filePath === file.path);
  if (existing) docIdsToDelete.push(existing.id);
}

// Map deleted file paths to document IDs
for (const filePath of diff.deleted) {
  const existing = existingDocs.find(d => d.filePath === filePath);
  if (existing) docIdsToDelete.push(existing.id);
}

// Parse new/modified files
for (const [i, file] of filesToProcess.entries()) {
  const docId = crypto.randomUUID();
  newDocuments.push({ id: docId, ...buildDocument(file, repo.id, job.versionId) });
  newSnippets.push(...parseFile(file, { repositoryId: repo.id, documentId: docId }));

  // Count ALL files (including skipped) in progress
  const totalProcessed = diff.unchanged.length + i + 1;
  const progress = Math.round((totalProcessed / crawledResult.files.length) * 80);
  this.updateJob(job.id, {
    processedFiles: totalProcessed,
    progress,
  });
}

// Atomic replacement of only changed documents
this.replaceSnippets(repo.id, docIdsToDelete, newDocuments, newSnippets);

Performance Impact

For a typical repository with 1,000 files where 50 changed:

Without incremental: 1,000 files parsed + 1,000 embed batches
With incremental: 50 files parsed + 50 embed batches
Estimated speedup: ~20x for re-indexing

Files to Modify

src/lib/server/pipeline/indexing.pipeline.ts — add diff computation
src/lib/server/pipeline/diff.ts — computeDiff function (new file)
src/lib/server/pipeline/diff.test.ts — unit tests

4.2 KiB Raw Blame History