docs: update docs, add new features

2026-03-25 15:11:01 +01:00
parent 59628dd408
commit b9d52405fa
4 changed files with 376 additions and 19 deletions
--- a/docs/features/TRUEREF-0003.md
+++ b/docs/features/TRUEREF-0003.md
@@ -11,6 +11,8 @@

 Implement the GitHub crawler that fetches repository file trees and downloads file contents using the GitHub REST API. The crawler respects rate limits, supports private repos via PAT, and applies include/exclude filtering from `trueref.json` configuration.

+The shared file-filtering layer is also responsible for keeping retrieval focused on repository source and docs rather than dependency trees or generated artifacts. That means common dependency/build/cache directories, lockfiles, and minified bundles are excluded even if the repository does not provide explicit config.
+
 ---

 ## Acceptance Criteria
@@ -18,6 +20,8 @@ Implement the GitHub crawler that fetches repository file trees and downloads fi
 - [ ] Fetch complete file tree for a GitHub repo (default branch or specific tag/branch)
 - [ ] Filter files by extension (only index relevant file types)
 - [ ] Apply `trueref.json` folder/file include/exclude rules
+- [ ] Exclude common dependency, cache, and build-artifact directories via shared filtering
+- [ ] Exclude known lockfiles and minified / bundled assets via shared filtering
 - [ ] Download file contents in parallel (with concurrency limit)
 - [ ] Handle GitHub API rate limiting (respect `X-RateLimit-*` headers, exponential backoff)
 - [ ] Support private repositories via GitHub Personal Access Token (PAT)
@@ -126,6 +130,7 @@ function shouldIndexFile(
  config?: RepoConfig
 ): boolean {
  const ext = path.extname(filePath).toLowerCase();
+  const base = path.basename(filePath);

  // 1. Must have indexable extension
  if (!INDEXABLE_EXTENSIONS.has(ext)) return false;
@@ -133,15 +138,26 @@ function shouldIndexFile(
  // 2. Must not exceed size limit
  if (fileSize > MAX_FILE_SIZE_BYTES) return false;

-  // 3. Apply config excludeFiles (exact filename match)
-  if (config?.excludeFiles?.includes(path.basename(filePath))) return false;
+  // 3. Exclude lockfiles and other non-source artifacts
+  if (IGNORED_FILE_NAMES.has(base)) return false;

-  // 4. Apply config excludeFolders (regex or prefix match)
+  // 4. Exclude minified and bundled assets
+  if (base.includes('.min.') || base.endsWith('.bundle.js') || base.endsWith('.bundle.css')) {
+    return false;
+  }
+
+  // 5. Apply config excludeFiles (exact filename match)
+  if (config?.excludeFiles?.includes(base)) return false;
+
+  // 6. Exclude common dependency/build/cache directories at any depth
+  if (isInIgnoredDirectory(filePath)) return false;
+
+  // 7. Apply config excludeFolders (regex or prefix match)
  if (config?.excludeFolders?.some(folder =>
    filePath.startsWith(folder) || new RegExp(folder).test(filePath)
  )) return false;

-  // 5. Apply config folders allowlist (if specified, only index those paths)
+  // 8. Apply config folders allowlist (if specified, only index those paths)
  if (config?.folders?.length) {
    const inAllowedFolder = config.folders.some(folder =>
      filePath.startsWith(folder) || new RegExp(folder).test(filePath)
@@ -149,17 +165,12 @@ function shouldIndexFile(
    if (!inAllowedFolder) return false;
  }

-  // 6. Default excludes: node_modules, .git, dist, build, coverage
-  const defaultExcludes = [
-    'node_modules/', '.git/', 'dist/', 'build/', 'coverage/',
-    '.next/', '__pycache__/', 'vendor/', 'target/', '.cache/',
-  ];
-  if (defaultExcludes.some(ex => filePath.startsWith(ex))) return false;
-
  return true;
 }
 ```

+The shared ignored-directory list is intentionally broader than the original baseline and covers common language ecosystems and build tools, for example `node_modules`, `dist`, `build`, `.next`, `.svelte-kit`, `vendor`, `target`, `__pycache__`, `.venv`, coverage output, cache directories, and generated-code folders.
+
 ---

 ## Rate Limiting
--- a/docs/features/TRUEREF-0004.md
+++ b/docs/features/TRUEREF-0004.md
@@ -11,6 +11,8 @@

 Implement a local filesystem crawler that indexes repositories stored on disk. Uses the same file filtering logic as the GitHub crawler but reads from the local filesystem using Node.js `fs` APIs. Useful for private internal codebases, monorepos on disk, and offline development.

+When indexing a local project, the crawler should prefer the repository's root `.gitignore` when present so local indexing follows the same intent developers use in day-to-day work. If no `.gitignore` exists, or if it does not exclude common dependency and artifact paths, the crawler must still avoid indexing those paths by default. The goal is to return relevant library code and documentation, not vendored dependencies, caches, lockfiles, or generated build output.
+
 ---

 ## Acceptance Criteria
@@ -18,6 +20,9 @@ Implement a local filesystem crawler that indexes repositories stored on disk. U
 - [ ] Walk a directory tree and enumerate all files
 - [ ] Apply the same extension and size filters as the GitHub crawler
 - [ ] Apply `trueref.json` include/exclude rules
+- [ ] Respect a root `.gitignore` file when present
+- [ ] Prune common dependency / artifact directories even when `.gitignore` is absent
+- [ ] Exclude common lockfiles and minified bundle artifacts from indexing
 - [ ] Read file contents as UTF-8 strings
 - [ ] Compute SHA-256 checksum per file for change detection
 - [ ] Detect `trueref.json` / `context7.json` at the repo root before filtering other files
@@ -46,10 +51,13 @@ export interface LocalCrawlOptions {
 ```typescript
 export class LocalCrawler {
  async crawl(options: LocalCrawlOptions): Promise<CrawlResult> {
-    // 1. Enumerate all files recursively
-    const allFiles = await this.walkDirectory(options.rootPath);
+    // 1. Load root .gitignore if present
+    const gitignore = await this.loadGitignore(options.rootPath);

-    // 2. Look for trueref.json / context7.json first
+    // 2. Enumerate files recursively, pruning ignored directories early
+    const allFiles = await this.walkDirectory(options.rootPath, '', gitignore);
+
+    // 3. Look for trueref.json / context7.json first
    const configFile = allFiles.find(f =>
      f === 'trueref.json' || f === 'context7.json'
    );
@@ -60,13 +68,13 @@ export class LocalCrawler {
      );
    }

-    // 3. Filter files
+    // 4. Filter files
    const filteredFiles = allFiles.filter(relPath => {
      const stat = statCache.get(relPath);
      return shouldIndexFile(relPath, stat.size, config);
    });

-    // 4. Read and return file contents
+    // 5. Read and return file contents
    const crawledFiles: CrawledFile[] = [];
    for (const [i, relPath] of filteredFiles.entries()) {
      const absPath = path.join(options.rootPath, relPath);
@@ -91,17 +99,21 @@ export class LocalCrawler {
    };
  }

-  private async walkDirectory(dir: string, rel = ''): Promise<string[]> {
+  private async walkDirectory(dir: string, rel = '', gitignore?: GitignoreFilter): Promise<string[]> {
    const entries = await fs.readdir(dir, { withFileTypes: true });
    const files: string[] = [];
    for (const entry of entries) {
      if (!entry.isFile() && !entry.isDirectory()) continue; // skip symlinks, devices
      const relPath = rel ? `${rel}/${entry.name}` : entry.name;
      if (entry.isDirectory()) {
+        if (shouldPruneDirectory(relPath) || gitignore?.isIgnored(relPath, true)) {
+          continue;
+        }
        files.push(...await this.walkDirectory(
-          path.join(dir, entry.name), relPath
+          path.join(dir, entry.name), relPath, gitignore
        ));
      } else {
+        if (gitignore?.isIgnored(relPath, false)) continue;
        files.push(relPath);
      }
    }
@@ -112,6 +124,18 @@ export class LocalCrawler {

 ---

+## Ignore Handling
+
+Filtering happens in three layers:
+
+1. Root `.gitignore` rules for local-project expectations.
+2. Built-in exclusions for dependency stores and artifacts such as `node_modules`, `dist`, `build`, `.next`, `vendor`, `target`, `.venv`, `__pycache__`, caches, coverage output, and other generated directories.
+3. Shared file-level exclusions for oversized files, unsupported extensions, known lockfiles such as `package-lock.json` and `pnpm-lock.yaml`, and minified/bundled assets such as `vendor.min.js` or `app.bundle.js`.
+
+Directory pruning should happen during the walk so large dependency trees are never enumerated in the first place.
+
+---
+
 ## Checksum Computation

 ```typescript
--- a/docs/features/TRUEREF-0019.md
+++ b/docs/features/TRUEREF-0019.md
@@ -2,7 +2,7 @@

 **Priority:** P1
 **Status:** Pending
-**Depends On:** TRUEREF-0001, TRUEREF-0002
+**Depends On:** TRUEREF-0001, TRUEREF-0002, TRUEREF-0020
 **Blocks:** —

 ---
@@ -16,6 +16,8 @@ TrueRef is intended for corporate environments where developers work with privat

 Together these allow a team to deploy TrueRef once, point it at their internal repositories, and have it return version-accurate documentation to LLM assistants.

+This ticket depends on TRUEREF-0020 so that version-targeted retrieval remains semantically correct after version indexing is made commit-accurate. Without version-scoped hybrid retrieval, semantic results can still leak across versions even if version metadata is stored correctly.
+
 ---

 ## Part 1: Git-Native Version Indexing
--- a/docs/features/TRUEREF-0020.md
+++ b/docs/features/TRUEREF-0020.md
@@ -0,0 +1,320 @@
+# TRUEREF-0020 — Embedding Profiles, Default Local Embeddings, and Version-Scoped Semantic Retrieval
+
+**Priority:** P1
+**Status:** Pending
+**Depends On:** TRUEREF-0007, TRUEREF-0008, TRUEREF-0009, TRUEREF-0010, TRUEREF-0011, TRUEREF-0012, TRUEREF-0014, TRUEREF-0018
+**Blocks:** TRUEREF-0019
+
+---
+
+## Overview
+
+TrueRef already has the main ingredients for embeddings and hybrid search, but the current design is still centered on a single hard-coded provider configuration and does not guarantee version-safe semantic retrieval at query time. This feature formalizes the full provider-registry approach and makes semantic retrieval production-ready for both the REST API and MCP surfaces.
+
+The scope is intentionally narrow:
+
+1. Introduce first-class embedding profiles so custom AI providers can be registered without hard-coding provider names throughout the API, UI, and runtime.
+2. Enable embeddings by default using the local `@xenova/transformers` model so a fresh install provides semantic retrieval out of the box.
+3. Make semantic and hybrid retrieval version-scoped, so a query for a specific library and version only searches snippets indexed for that exact version.
+4. Extend the API and MCP `query-docs` path to use the active embedding profile at query time.
+
+Out of scope:
+
+- semantic repository discovery or reranking for `libs/search`
+- inferring the repository from the query text
+- adding multi-tenant provider isolation
+
+Consumers are expected to pass an exact library or repository identifier and the needed version when they want version-specific retrieval.
+
+---
+
+## Problem Statement
+
+Current semantic search support has four structural gaps:
+
+1. Query-time semantic retrieval is not reliably wired to the configured provider.
+2. The embedding configuration shape is fixed to `openai | local | none`, which does not scale to custom provider adapters.
+3. Stored embeddings are keyed too narrowly to support multiple profiles or safe provider migration.
+4. The vector search path does not enforce version scoping as strongly as the keyword search path.
+
+That leaves TrueRef in a state where embeddings may be generated at indexing time, but retrieval behavior, provider flexibility, and version guarantees are still weaker than required.
+
+---
+
+## Goals
+
+- Make semantic retrieval work by default on a fresh install.
+- Keep the default self-hosted path fully local.
+- Support custom AI providers through a provider registry plus profile system.
+- Keep the API as the source of truth for retrieval behavior.
+- Keep MCP as a thin compatibility layer over the API.
+- Guarantee version-scoped hybrid retrieval when a versioned library ID is provided.
+
+---
+
+## Non-Goals
+
+- semantic repository search
+- automatic repo selection from free-text intent
+- remote provider secrets management beyond current settings persistence model
+- support for non-embedding rerankers in this ticket
+
+---
+
+## Default Local Embeddings
+
+Embeddings should be enabled by default with the local model path instead of shipping in FTS-only mode.
+
+### Default Runtime Behavior
+
+- Install `@xenova/transformers` as a normal runtime dependency rather than treating it as optional for the default setup.
+- Seed the default embedding profile to the local provider.
+- Default model: `Xenova/all-MiniLM-L6-v2`
+- Default dimensions: `384`
+- New repositories index snippets with embeddings automatically unless the user explicitly disables embeddings.
+- Query-time retrieval uses hybrid mode automatically when the active profile is healthy.
+- If the local model cannot be loaded, the system should surface a clear startup or settings error instead of silently pretending semantic search is enabled.
+
+### Acceptance Criteria
+
+- [ ] `@xenova/transformers` is installed by default for production/runtime use
+- [ ] Fresh installations default to an active local embedding profile
+- [ ] No manual provider configuration is required to get semantic search on a clean setup
+- [ ] The settings UI shows local embeddings as the default active profile
+- [ ] Disabling embeddings remains possible from settings
+
+---
+
+## Embedding Profile Registry
+
+Replace the single enum-style config with a registry-oriented model.
+
+### Core Concepts
+
+#### Provider Adapter
+
+A provider adapter is code registered in the server runtime that knows how to validate config and generate embeddings for one provider kind.
+
+Examples:
+
+- `local-transformers`
+- `openai-compatible`
+- future custom adapters added in code without redesigning the API contract
+
+#### Embedding Profile
+
+An embedding profile is persisted configuration selecting one provider adapter plus its runtime settings.
+
+```typescript
+interface EmbeddingProfile {
+  id: string;
+  providerKind: string;
+  title: string;
+  enabled: boolean;
+  isDefault: boolean;
+  config: Record<string, unknown>;
+  model: string;
+  dimensions: number;
+  createdAt: number;
+  updatedAt: number;
+}
+```
+
+### Registry Responsibilities
+
+- create provider instance from profile
+- validate profile config
+- expose provider metadata to the settings API and UI
+- allow future custom providers without widening TypeScript unions across the app
+
+### Acceptance Criteria
+
+- [ ] Provider selection is no longer hard-coded to `openai | local | none`
+- [ ] Providers are instantiated through a registry keyed by `providerKind`
+- [ ] Profiles are stored as first-class records rather than a single settings blob
+- [ ] One profile can be marked as the default active profile for indexing and retrieval
+- [ ] Settings endpoints return profile data and provider metadata cleanly
+
+---
+
+## Data Model Changes
+
+The current `snippet_embeddings` shape is insufficient for multiple profiles because it allows only one embedding row per snippet.
+
+### New Tables / Changes
+
+#### `embedding_profiles`
+
+```typescript
+embeddingProfiles {
+  id: text('id').primaryKey(),
+  providerKind: text('provider_kind').notNull(),
+  title: text('title').notNull(),
+  enabled: integer('enabled', { mode: 'boolean' }).notNull().default(true),
+  isDefault: integer('is_default', { mode: 'boolean' }).notNull().default(false),
+  model: text('model').notNull(),
+  dimensions: integer('dimensions').notNull(),
+  config: text('config', { mode: 'json' }).notNull(),
+  createdAt: integer('created_at').notNull(),
+  updatedAt: integer('updated_at').notNull(),
+}
+```
+
+#### `snippet_embeddings`
+
+Add `profile_id` and replace the single-row-per-snippet constraint with a composite key or unique index on `(snippet_id, profile_id)`.
+
+```typescript
+snippetEmbeddings {
+  snippetId: text('snippet_id').notNull(),
+  profileId: text('profile_id').notNull(),
+  model: text('model').notNull(),
+  dimensions: integer('dimensions').notNull(),
+  embedding: blob('embedding').notNull(),
+  createdAt: integer('created_at').notNull(),
+}
+```
+
+### Migration Requirements
+
+- [ ] migration adds `embedding_profiles`
+- [ ] migration updates `snippet_embeddings` for profile scoping
+- [ ] migration seeds a default local profile using `Xenova/all-MiniLM-L6-v2`
+- [ ] migration safely maps existing single-provider configs into one default profile when upgrading
+
+---
+
+## Query-Time Semantic Retrieval
+
+The API must resolve the active embedding profile at request time instead of baking provider selection into startup-only flows.
+
+### API Behavior
+
+`GET /api/v1/context`
+
+- keeps `libraryId`, `query`, `tokens`, and `type`
+- adds optional `searchMode=auto|keyword|semantic|hybrid`
+- adds optional `alpha` for hybrid blending
+- uses the default active embedding profile when `searchMode` is `auto`, `semantic`, or `hybrid`
+- falls back to keyword mode only when embeddings are disabled or the caller explicitly requests keyword mode
+
+### Version-Scoped Retrieval Rules
+
+- when `libraryId` includes a version, both FTS and vector retrieval must filter to the resolved `versionId`
+- re-fetching snippets after ranking must also preserve `versionId`
+- default-branch snippets must not bleed into versioned queries
+- one version's embeddings must not be compared against another version's snippets for the same repository
+
+### Acceptance Criteria
+
+- [ ] `/api/v1/context` loads the active embedding profile at request time
+- [ ] hybrid retrieval works without restarting the server after profile changes
+- [ ] `searchMode` is supported for context queries
+- [ ] versioned `libraryId` queries enforce version filters in both FTS and vector phases
+- [ ] JSON responses can include retrieval metadata such as mode, profile ID, model, and alpha
+
+---
+
+## MCP Surface
+
+MCP should stay thin and inherit semantic behavior from the API.
+
+### `query-docs`
+
+Extend the MCP tool schema to support:
+
+- `searchMode?: 'auto' | 'keyword' | 'semantic' | 'hybrid'`
+- `alpha?: number`
+
+The MCP server should forward these options directly to `/api/v1/context`.
+
+### Explicitly Out of Scope
+
+- semantic reranking for `resolve-library-id`
+- automatic library detection from the query text
+
+### Acceptance Criteria
+
+- [ ] MCP `query-docs` supports the same retrieval mode controls as the API
+- [ ] MCP stdio and HTTP transports both preserve the new options
+- [ ] MCP remains backward compatible when the new fields are omitted
+
+---
+
+## Settings and Profile Management
+
+The existing settings page must evolve from a single provider switcher into profile management for the supported provider kinds.
+
+### Required UX Changes
+
+- show the default local profile as the initial active profile
+- allow enabling/disabling embeddings globally
+- allow creating additional custom profiles for supported provider adapters
+- allow selecting exactly one default profile
+- show provider health and profile test results
+- warn when changing the default profile requires re-embedding to preserve semantic quality
+
+### Acceptance Criteria
+
+- [ ] `/settings` supports profile-based embedding configuration
+- [ ] users can create an `openai-compatible` custom profile with arbitrary base URL and model
+- [ ] the local default profile is visible and editable
+- [ ] switching the default profile triggers a re-embedding workflow or explicit warning state
+
+---
+
+## Indexing and Re-Embedding
+
+Indexing must embed snippets against the default active profile, and profile changes must be operationally explicit.
+
+### Required Behavior
+
+- new indexing jobs use the current default profile
+- re-indexing stores embeddings under that profile ID
+- changing the default profile does not silently reuse embeddings from another profile
+- if a profile is changed in a way that invalidates stored embeddings, affected repositories must be marked as needing re-embedding or re-indexing
+
+### Acceptance Criteria
+
+- [ ] indexing records which profile produced each embedding row
+- [ ] re-embedding can be triggered after default-profile changes
+- [ ] no cross-profile embedding reuse occurs
+
+---
+
+## Test Coverage
+
+- [ ] migration tests for `embedding_profiles` and `snippet_embeddings`
+- [ ] unit tests for provider registry resolution
+- [ ] unit tests for version-scoped vector search
+- [ ] unit tests for hybrid retrieval with explicit `searchMode`
+- [ ] API tests covering default local profile behavior on fresh setup
+- [ ] MCP tests covering `query-docs` semantic and hybrid forwarding
+
+---
+
+## Files to Modify
+
+- `package.json` — install `@xenova/transformers` as a runtime dependency
+- `src/lib/server/db/schema.ts`
+- `src/lib/server/db/migrations/*`
+- `src/lib/server/embeddings/provider.ts`
+- `src/lib/server/embeddings/local.provider.ts`
+- `src/lib/server/embeddings/openai.provider.ts`
+- `src/lib/server/embeddings/factory.ts` or replacement registry module
+- `src/lib/server/embeddings/embedding.service.ts`
+- `src/lib/server/search/vector.search.ts`
+- `src/lib/server/search/hybrid.search.service.ts`
+- `src/routes/api/v1/context/+server.ts`
+- `src/routes/api/v1/settings/embedding/+server.ts`
+- `src/routes/api/v1/settings/embedding/test/+server.ts`
+- `src/routes/settings/+page.svelte`
+- `src/mcp/client.ts`
+- `src/mcp/tools/query-docs.ts`
+- `src/mcp/index.ts`
+
+---
+
+## Notes
+
+This ticket intentionally leaves `libs/search` as keyword-only. The caller is expected to identify the target library and, when needed, pass a version-qualified library ID such as `/owner/repo/v1.2.3` before requesting semantic retrieval.