7.4 KiB
TRUEREF-0003 — GitHub Repository Crawler
Priority: P0 Status: Pending Depends On: TRUEREF-0001 Blocks: TRUEREF-0009, TRUEREF-0013
Overview
Implement the GitHub crawler that fetches repository file trees and downloads file contents using the GitHub REST API. The crawler respects rate limits, supports private repos via PAT, and applies include/exclude filtering from trueref.json configuration.
The shared file-filtering layer is also responsible for keeping retrieval focused on repository source and docs rather than dependency trees or generated artifacts. That means common dependency/build/cache directories, lockfiles, and minified bundles are excluded even if the repository does not provide explicit config.
Acceptance Criteria
- Fetch complete file tree for a GitHub repo (default branch or specific tag/branch)
- Filter files by extension (only index relevant file types)
- Apply
trueref.jsonfolder/file include/exclude rules - Exclude common dependency, cache, and build-artifact directories via shared filtering
- Exclude known lockfiles and minified / bundled assets via shared filtering
- Download file contents in parallel (with concurrency limit)
- Handle GitHub API rate limiting (respect
X-RateLimit-*headers, exponential backoff) - Support private repositories via GitHub Personal Access Token (PAT)
- Return structured
CrawledFileobjects for each fetched file - Report progress via callback (for job tracking)
- Unit tests with mocked GitHub API responses
Indexable File Types
The crawler only downloads files with these extensions:
const INDEXABLE_EXTENSIONS = new Set([
// Documentation
'.md', '.mdx', '.txt', '.rst',
// Code
'.ts', '.tsx', '.js', '.jsx',
'.py', '.rb', '.go', '.rs', '.java', '.cs', '.cpp', '.c', '.h',
'.swift', '.kt', '.php', '.scala', '.clj', '.ex', '.exs',
'.sh', '.bash', '.zsh', '.fish',
// Config / data
'.json', '.yaml', '.yml', '.toml',
// Web
'.html', '.css', '.svelte', '.vue',
]);
const MAX_FILE_SIZE_BYTES = 500_000; // 500 KB — skip large generated files
Data Types
export interface CrawledFile {
path: string; // relative path within repo, e.g. "src/index.ts"
content: string; // UTF-8 file content
size: number; // bytes
sha: string; // GitHub blob SHA (used as checksum)
language: string; // detected from extension
}
export interface CrawlResult {
files: CrawledFile[];
totalFiles: number; // files matching filters
skippedFiles: number; // filtered out or too large
branch: string; // branch/tag that was crawled
commitSha: string; // HEAD commit SHA
}
export interface CrawlOptions {
owner: string;
repo: string;
ref?: string; // branch, tag, or commit SHA; defaults to repo default branch
token?: string; // GitHub PAT for private repos
config?: RepoConfig; // parsed trueref.json
onProgress?: (processed: number, total: number) => void;
}
GitHub API Usage
Step 1: Get default branch (if ref not specified)
GET https://api.github.com/repos/{owner}/{repo}
→ { default_branch: "main", stargazers_count: 12345 }
Step 2: Fetch file tree (recursive)
GET https://api.github.com/repos/{owner}/{repo}/git/trees/{ref}?recursive=1
→ {
tree: [
{ path: "src/index.ts", type: "blob", size: 1234, sha: "abc123", url: "..." },
...
],
truncated: false
}
If truncated: true, the tree has >100k items. Use --depth pagination or filter top-level directories first.
Step 3: Download file contents (parallel)
GET https://api.github.com/repos/{owner}/{repo}/contents/{path}?ref={ref}
→ { content: "<base64>", encoding: "base64", size: 1234, sha: "abc123" }
Alternative for large repos: use raw content URL:
GET https://raw.githubusercontent.com/{owner}/{repo}/{ref}/{path}
Filtering Logic
function shouldIndexFile(
filePath: string,
fileSize: number,
config?: RepoConfig
): boolean {
const ext = path.extname(filePath).toLowerCase();
const base = path.basename(filePath);
// 1. Must have indexable extension
if (!INDEXABLE_EXTENSIONS.has(ext)) return false;
// 2. Must not exceed size limit
if (fileSize > MAX_FILE_SIZE_BYTES) return false;
// 3. Exclude lockfiles and other non-source artifacts
if (IGNORED_FILE_NAMES.has(base)) return false;
// 4. Exclude minified and bundled assets
if (base.includes('.min.') || base.endsWith('.bundle.js') || base.endsWith('.bundle.css')) {
return false;
}
// 5. Apply config excludeFiles (exact filename match)
if (config?.excludeFiles?.includes(base)) return false;
// 6. Exclude common dependency/build/cache directories at any depth
if (isInIgnoredDirectory(filePath)) return false;
// 7. Apply config excludeFolders (regex or prefix match)
if (config?.excludeFolders?.some(folder =>
filePath.startsWith(folder) || new RegExp(folder).test(filePath)
)) return false;
// 8. Apply config folders allowlist (if specified, only index those paths)
if (config?.folders?.length) {
const inAllowedFolder = config.folders.some(folder =>
filePath.startsWith(folder) || new RegExp(folder).test(filePath)
);
if (!inAllowedFolder) return false;
}
return true;
}
The shared ignored-directory list is intentionally broader than the original baseline and covers common language ecosystems and build tools, for example node_modules, dist, build, .next, .svelte-kit, vendor, target, __pycache__, .venv, coverage output, cache directories, and generated-code folders.
Rate Limiting
class GitHubRateLimiter {
private remaining = 5000;
private resetAt = Date.now();
updateFromHeaders(headers: Headers): void {
this.remaining = parseInt(headers.get('X-RateLimit-Remaining') ?? '5000');
this.resetAt = parseInt(headers.get('X-RateLimit-Reset') ?? '0') * 1000;
}
async waitIfNeeded(): Promise<void> {
if (this.remaining <= 10) {
const waitMs = Math.max(0, this.resetAt - Date.now()) + 1000;
await sleep(waitMs);
}
}
}
Requests are made with a concurrency limit of 10 parallel downloads using a semaphore/pool pattern.
Error Handling
| Scenario | Behavior |
|---|---|
| 404 Not Found | Throw RepositoryNotFoundError |
| 401 Unauthorized | Throw AuthenticationError (invalid or missing token) |
| 403 Forbidden | If X-RateLimit-Remaining: 0, wait and retry; else throw PermissionError |
| 422 Unprocessable | Tree too large; switch to directory-by-directory traversal |
| Network error | Retry up to 3 times with exponential backoff |
| File content decode error | Skip file, log warning |
Implementation Notes
- Prefer
raw.githubusercontent.comfor file downloads — faster and doesn't count against rate limit as heavily as API. - Cache the file tree in memory during a single crawl run to avoid redundant requests.
- The
shafield from the tree response is the blob SHA — use this as the document checksum, not the file content SHA. - Detect
trueref.json/context7.jsonin the tree before downloading other files, so filtering rules apply to the rest of the crawl.
Files to Create
src/lib/server/crawler/github.crawler.tssrc/lib/server/crawler/rate-limiter.tssrc/lib/server/crawler/file-filter.tssrc/lib/server/crawler/types.tssrc/lib/server/crawler/github.crawler.test.ts