TRUEREF-0004 — Local Filesystem Crawler

Priority: P1 Status: Pending Depends On: TRUEREF-0001, TRUEREF-0003 (shares types and filter logic) Blocks: TRUEREF-0009

Overview

Implement a local filesystem crawler that indexes repositories stored on disk. Uses the same file filtering logic as the GitHub crawler but reads from the local filesystem using Node.js fs APIs. Useful for private internal codebases, monorepos on disk, and offline development.

When indexing a local project, the crawler should prefer the repository's root .gitignore when present so local indexing follows the same intent developers use in day-to-day work. If no .gitignore exists, or if it does not exclude common dependency and artifact paths, the crawler must still avoid indexing those paths by default. The goal is to return relevant library code and documentation, not vendored dependencies, caches, lockfiles, or generated build output.

Acceptance Criteria

Walk a directory tree and enumerate all files
Apply the same extension and size filters as the GitHub crawler
Apply trueref.json include/exclude rules
Respect a root .gitignore file when present
Prune common dependency / artifact directories even when .gitignore is absent
Exclude common lockfiles and minified bundle artifacts from indexing
Read file contents as UTF-8 strings
Compute SHA-256 checksum per file for change detection
Detect trueref.json / context7.json at the repo root before filtering other files
Report progress via callback
Skip symlinks, special files (devices, sockets, etc.)
Unit tests with temporary directory fixtures

Data Types

Reuses CrawledFile and CrawlResult from TRUEREF-0003 crawler types:

export interface LocalCrawlOptions {
  rootPath: string;          // absolute path to repository root
  config?: RepoConfig;       // parsed trueref.json
  onProgress?: (processed: number, total: number) => void;
}

Implementation

export class LocalCrawler {
  async crawl(options: LocalCrawlOptions): Promise<CrawlResult> {
    // 1. Load root .gitignore if present
    const gitignore = await this.loadGitignore(options.rootPath);

    // 2. Enumerate files recursively, pruning ignored directories early
    const allFiles = await this.walkDirectory(options.rootPath, '', gitignore);

    // 3. Look for trueref.json / context7.json first
    const configFile = allFiles.find(f =>
      f === 'trueref.json' || f === 'context7.json'
    );
    let config = options.config;
    if (configFile && !config) {
      config = await this.parseConfigFile(
        path.join(options.rootPath, configFile)
      );
    }

    // 4. Filter files
    const filteredFiles = allFiles.filter(relPath => {
      const stat = statCache.get(relPath);
      return shouldIndexFile(relPath, stat.size, config);
    });

    // 5. Read and return file contents
    const crawledFiles: CrawledFile[] = [];
    for (const [i, relPath] of filteredFiles.entries()) {
      const absPath = path.join(options.rootPath, relPath);
      const content = await fs.readFile(absPath, 'utf-8');
      const sha = computeSHA256(content);
      crawledFiles.push({
        path: relPath,
        content,
        size: Buffer.byteLength(content, 'utf-8'),
        sha,
        language: detectLanguage(relPath),
      });
      options.onProgress?.(i + 1, filteredFiles.length);
    }

    return {
      files: crawledFiles,
      totalFiles: filteredFiles.length,
      skippedFiles: allFiles.length - filteredFiles.length,
      branch: 'local',
      commitSha: computeSHA256(crawledFiles.map(f => f.sha).join('')),
    };
  }

  private async walkDirectory(dir: string, rel = '', gitignore?: GitignoreFilter): Promise<string[]> {
    const entries = await fs.readdir(dir, { withFileTypes: true });
    const files: string[] = [];
    for (const entry of entries) {
      if (!entry.isFile() && !entry.isDirectory()) continue; // skip symlinks, devices
      const relPath = rel ? `${rel}/${entry.name}` : entry.name;
      if (entry.isDirectory()) {
        if (shouldPruneDirectory(relPath) || gitignore?.isIgnored(relPath, true)) {
          continue;
        }
        files.push(...await this.walkDirectory(
          path.join(dir, entry.name), relPath, gitignore
        ));
      } else {
        if (gitignore?.isIgnored(relPath, false)) continue;
        files.push(relPath);
      }
    }
    return files;
  }
}

Ignore Handling

Filtering happens in three layers:

Root .gitignore rules for local-project expectations.
Built-in exclusions for dependency stores and artifacts such as node_modules, dist, build, .next, vendor, target, .venv, __pycache__, caches, coverage output, and other generated directories.
Shared file-level exclusions for oversized files, unsupported extensions, known lockfiles such as package-lock.json and pnpm-lock.yaml, and minified/bundled assets such as vendor.min.js or app.bundle.js.

Directory pruning should happen during the walk so large dependency trees are never enumerated in the first place.

Checksum Computation

import { createHash } from 'crypto';

function computeSHA256(content: string): string {
  return createHash('sha256').update(content, 'utf-8').digest('hex');
}

Files to Create

src/lib/server/crawler/local.crawler.ts
src/lib/server/crawler/local.crawler.test.ts

5.4 KiB Raw Blame History