docs: update docs, add new features

This commit is contained in:
Giancarmine Salucci
2026-03-25 15:11:01 +01:00
parent 59628dd408
commit b9d52405fa
4 changed files with 376 additions and 19 deletions

View File

@@ -11,6 +11,8 @@
Implement the GitHub crawler that fetches repository file trees and downloads file contents using the GitHub REST API. The crawler respects rate limits, supports private repos via PAT, and applies include/exclude filtering from `trueref.json` configuration.
The shared file-filtering layer is also responsible for keeping retrieval focused on repository source and docs rather than dependency trees or generated artifacts. That means common dependency/build/cache directories, lockfiles, and minified bundles are excluded even if the repository does not provide explicit config.
---
## Acceptance Criteria
@@ -18,6 +20,8 @@ Implement the GitHub crawler that fetches repository file trees and downloads fi
- [ ] Fetch complete file tree for a GitHub repo (default branch or specific tag/branch)
- [ ] Filter files by extension (only index relevant file types)
- [ ] Apply `trueref.json` folder/file include/exclude rules
- [ ] Exclude common dependency, cache, and build-artifact directories via shared filtering
- [ ] Exclude known lockfiles and minified / bundled assets via shared filtering
- [ ] Download file contents in parallel (with concurrency limit)
- [ ] Handle GitHub API rate limiting (respect `X-RateLimit-*` headers, exponential backoff)
- [ ] Support private repositories via GitHub Personal Access Token (PAT)
@@ -126,6 +130,7 @@ function shouldIndexFile(
config?: RepoConfig
): boolean {
const ext = path.extname(filePath).toLowerCase();
const base = path.basename(filePath);
// 1. Must have indexable extension
if (!INDEXABLE_EXTENSIONS.has(ext)) return false;
@@ -133,15 +138,26 @@ function shouldIndexFile(
// 2. Must not exceed size limit
if (fileSize > MAX_FILE_SIZE_BYTES) return false;
// 3. Apply config excludeFiles (exact filename match)
if (config?.excludeFiles?.includes(path.basename(filePath))) return false;
// 3. Exclude lockfiles and other non-source artifacts
if (IGNORED_FILE_NAMES.has(base)) return false;
// 4. Apply config excludeFolders (regex or prefix match)
// 4. Exclude minified and bundled assets
if (base.includes('.min.') || base.endsWith('.bundle.js') || base.endsWith('.bundle.css')) {
return false;
}
// 5. Apply config excludeFiles (exact filename match)
if (config?.excludeFiles?.includes(base)) return false;
// 6. Exclude common dependency/build/cache directories at any depth
if (isInIgnoredDirectory(filePath)) return false;
// 7. Apply config excludeFolders (regex or prefix match)
if (config?.excludeFolders?.some(folder =>
filePath.startsWith(folder) || new RegExp(folder).test(filePath)
)) return false;
// 5. Apply config folders allowlist (if specified, only index those paths)
// 8. Apply config folders allowlist (if specified, only index those paths)
if (config?.folders?.length) {
const inAllowedFolder = config.folders.some(folder =>
filePath.startsWith(folder) || new RegExp(folder).test(filePath)
@@ -149,17 +165,12 @@ function shouldIndexFile(
if (!inAllowedFolder) return false;
}
// 6. Default excludes: node_modules, .git, dist, build, coverage
const defaultExcludes = [
'node_modules/', '.git/', 'dist/', 'build/', 'coverage/',
'.next/', '__pycache__/', 'vendor/', 'target/', '.cache/',
];
if (defaultExcludes.some(ex => filePath.startsWith(ex))) return false;
return true;
}
```
The shared ignored-directory list is intentionally broader than the original baseline and covers common language ecosystems and build tools, for example `node_modules`, `dist`, `build`, `.next`, `.svelte-kit`, `vendor`, `target`, `__pycache__`, `.venv`, coverage output, cache directories, and generated-code folders.
---
## Rate Limiting