docs: update docs, add new features
This commit is contained in:
@@ -11,6 +11,8 @@
|
||||
|
||||
Implement the GitHub crawler that fetches repository file trees and downloads file contents using the GitHub REST API. The crawler respects rate limits, supports private repos via PAT, and applies include/exclude filtering from `trueref.json` configuration.
|
||||
|
||||
The shared file-filtering layer is also responsible for keeping retrieval focused on repository source and docs rather than dependency trees or generated artifacts. That means common dependency/build/cache directories, lockfiles, and minified bundles are excluded even if the repository does not provide explicit config.
|
||||
|
||||
---
|
||||
|
||||
## Acceptance Criteria
|
||||
@@ -18,6 +20,8 @@ Implement the GitHub crawler that fetches repository file trees and downloads fi
|
||||
- [ ] Fetch complete file tree for a GitHub repo (default branch or specific tag/branch)
|
||||
- [ ] Filter files by extension (only index relevant file types)
|
||||
- [ ] Apply `trueref.json` folder/file include/exclude rules
|
||||
- [ ] Exclude common dependency, cache, and build-artifact directories via shared filtering
|
||||
- [ ] Exclude known lockfiles and minified / bundled assets via shared filtering
|
||||
- [ ] Download file contents in parallel (with concurrency limit)
|
||||
- [ ] Handle GitHub API rate limiting (respect `X-RateLimit-*` headers, exponential backoff)
|
||||
- [ ] Support private repositories via GitHub Personal Access Token (PAT)
|
||||
@@ -126,6 +130,7 @@ function shouldIndexFile(
|
||||
config?: RepoConfig
|
||||
): boolean {
|
||||
const ext = path.extname(filePath).toLowerCase();
|
||||
const base = path.basename(filePath);
|
||||
|
||||
// 1. Must have indexable extension
|
||||
if (!INDEXABLE_EXTENSIONS.has(ext)) return false;
|
||||
@@ -133,15 +138,26 @@ function shouldIndexFile(
|
||||
// 2. Must not exceed size limit
|
||||
if (fileSize > MAX_FILE_SIZE_BYTES) return false;
|
||||
|
||||
// 3. Apply config excludeFiles (exact filename match)
|
||||
if (config?.excludeFiles?.includes(path.basename(filePath))) return false;
|
||||
// 3. Exclude lockfiles and other non-source artifacts
|
||||
if (IGNORED_FILE_NAMES.has(base)) return false;
|
||||
|
||||
// 4. Apply config excludeFolders (regex or prefix match)
|
||||
// 4. Exclude minified and bundled assets
|
||||
if (base.includes('.min.') || base.endsWith('.bundle.js') || base.endsWith('.bundle.css')) {
|
||||
return false;
|
||||
}
|
||||
|
||||
// 5. Apply config excludeFiles (exact filename match)
|
||||
if (config?.excludeFiles?.includes(base)) return false;
|
||||
|
||||
// 6. Exclude common dependency/build/cache directories at any depth
|
||||
if (isInIgnoredDirectory(filePath)) return false;
|
||||
|
||||
// 7. Apply config excludeFolders (regex or prefix match)
|
||||
if (config?.excludeFolders?.some(folder =>
|
||||
filePath.startsWith(folder) || new RegExp(folder).test(filePath)
|
||||
)) return false;
|
||||
|
||||
// 5. Apply config folders allowlist (if specified, only index those paths)
|
||||
// 8. Apply config folders allowlist (if specified, only index those paths)
|
||||
if (config?.folders?.length) {
|
||||
const inAllowedFolder = config.folders.some(folder =>
|
||||
filePath.startsWith(folder) || new RegExp(folder).test(filePath)
|
||||
@@ -149,17 +165,12 @@ function shouldIndexFile(
|
||||
if (!inAllowedFolder) return false;
|
||||
}
|
||||
|
||||
// 6. Default excludes: node_modules, .git, dist, build, coverage
|
||||
const defaultExcludes = [
|
||||
'node_modules/', '.git/', 'dist/', 'build/', 'coverage/',
|
||||
'.next/', '__pycache__/', 'vendor/', 'target/', '.cache/',
|
||||
];
|
||||
if (defaultExcludes.some(ex => filePath.startsWith(ex))) return false;
|
||||
|
||||
return true;
|
||||
}
|
||||
```
|
||||
|
||||
The shared ignored-directory list is intentionally broader than the original baseline and covers common language ecosystems and build tools, for example `node_modules`, `dist`, `build`, `.next`, `.svelte-kit`, `vendor`, `target`, `__pycache__`, `.venv`, coverage output, cache directories, and generated-code folders.
|
||||
|
||||
---
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
Reference in New Issue
Block a user