# TrueRef — Product Requirements Document **Version:** 1.0 **Date:** 2026-03-22 **Status:** Draft --- ## 1. Executive Summary TrueRef is a self-hosted, open-source documentation intelligence platform — a full-stack clone of context7 that lets teams index, search, and query their own private or public code repositories. Unlike context7 (which has a private indexing backend), TrueRef ships the complete stack: crawler, parser, indexer, semantic search engine, REST API, MCP server, and a management web UI. The core use case is enabling AI coding assistants (Claude Code, Cursor, Zed, etc.) to retrieve accurate, up-to-date, version-aware documentation from repositories the user controls — without sending code to third-party services. --- ## 2. Problem Statement ### 2.1 Context7's Limitations - The indexing and crawling backend is entirely private and closed-source. - Only public libraries already in the context7.com catalog are available. - Private, internal, or niche repositories cannot be added. - Data sovereignty: all queries go to context7.com servers. - No way to self-host for air-gapped or compliance-constrained environments. ### 2.2 The Gap Teams with internal SDKs, private libraries, proprietary documentation, or a need for data sovereignty have no tooling that provides context7-equivalent LLM documentation retrieval. --- ## 3. Goals & Non-Goals ### Goals - Replicate all context7 capabilities: library search, documentation retrieval, MCP tools (`resolve-library-id`, `query-docs`). - Support both GitHub-hosted and local filesystem repositories. - Provide a full indexing pipeline: crawl → parse → chunk → embed → store → query. - Expose a REST API compatible with context7's `/api/v2/*` surface. - Ship an MCP server implementing `resolve-library-id` and `query-docs`. - Provide a web UI for repository management and search exploration. - Support `trueref.json` config files in repos (analogous to `context7.json`). - Support versioned documentation via git tags/branches. - Self-hostable with minimal dependencies (SQLite-first, no external vector DB required). ### Non-Goals (v1) - Authentication & authorization (deferred to a future version). - Skill generation (context7 CLI skill feature). - Multi-tenant SaaS mode. - Binary pre-built releases / Docker image (infrastructure, not product). - Paid API tier / rate limiting. - Support for non-git version control systems. --- ## 4. Users & Personas ### Primary: The Developer / Tech Lead Configures TrueRef, adds repositories, integrates the MCP server with their AI coding assistant. Technical, comfortable with CLI and config files. ### Secondary: The AI Coding Assistant The "user" at query time. Calls `resolve-library-id` and `query-docs` via MCP to retrieve documentation snippets for code generation. --- ## 5. Architecture Overview ``` ┌────────────────────────────────────────────────────────────────────┐ │ TrueRef Platform │ │ │ │ ┌──────────────────┐ ┌──────────────────┐ ┌───────────────┐ │ │ │ Web UI │ │ REST API │ │ MCP Server │ │ │ │ (SvelteKit) │ │ /api/v1/* │ │ (stdio/HTTP) │ │ │ └────────┬─────────┘ └────────┬─────────┘ └───────┬───────┘ │ │ │ │ │ │ │ └──────────────────────┼─────────────────────┘ │ │ │ │ │ ┌─────────────▼──────────────┐ │ │ │ Service Layer │ │ │ │ LibraryService │ │ │ │ SearchService │ │ │ │ IndexingService │ │ │ └─────────────┬──────────────┘ │ │ │ │ │ ┌──────────────────────┼───────────────────┐ │ │ │ │ │ │ │ ┌────────▼────────┐ ┌─────────▼──────┐ ┌────────▼──────────┐ │ │ │ Indexing │ │ SQLite DB │ │ Vector/FTS Index │ │ │ │ Pipeline │ │ (drizzle-orm) │ │ (SQLite FTS5 + │ │ │ │ Crawler │ │ │ │ embeddings) │ │ │ │ Parser │ │ │ │ │ │ │ │ Chunker │ │ │ │ │ │ │ └────────┬────────┘ └───────────────┘ └────────────────────┘ │ │ │ │ │ ┌───────▼──────────────────────┐ │ │ │ Repository Sources │ │ │ │ - GitHub API │ │ │ │ - Local filesystem │ │ │ └──────────────────────────────┘ │ └────────────────────────────────────────────────────────────────────┘ ``` ### Technology Stack | Layer | Technology | |-------|-----------| | Framework | SvelteKit (Node adapter) | | Language | TypeScript | | Database | SQLite via better-sqlite3 + drizzle-orm | | Full-Text Search | SQLite FTS5 | | Vector Search | SQLite `sqlite-vec` extension (cosine similarity) | | Embeddings | Pluggable: local (transformers.js / ONNX) or OpenAI-compatible API | | MCP Protocol | `@modelcontextprotocol/sdk` | | HTTP | SvelteKit API routes + optional standalone MCP HTTP server | | CSS | TailwindCSS v4 | | Testing | Vitest | | Linting | ESLint + Prettier | --- ## 6. Data Model ### 6.1 Repositories A `Repository` is the top-level entity. It maps to a GitHub repo or local directory. ``` Repository { id TEXT PRIMARY KEY -- e.g. "/facebook/react" title TEXT NOT NULL -- display name description TEXT source TEXT NOT NULL -- "github" | "local" sourceUrl TEXT -- GitHub URL or local path branch TEXT -- default branch state TEXT NOT NULL -- "pending" | "indexing" | "indexed" | "error" totalSnippets INTEGER DEFAULT 0 totalTokens INTEGER DEFAULT 0 trustScore REAL DEFAULT 0 stars INTEGER lastIndexedAt DATETIME createdAt DATETIME updatedAt DATETIME } ``` ### 6.2 Repository Versions ``` RepositoryVersion { id TEXT PRIMARY KEY repositoryId TEXT FK → Repository tag TEXT NOT NULL -- git tag or branch name title TEXT state TEXT -- "pending" | "indexed" | "error" indexedAt DATETIME } ``` ### 6.3 Documents (parsed files) ``` Document { id TEXT PRIMARY KEY repositoryId TEXT FK → Repository versionId TEXT FK → RepositoryVersion (nullable = default branch) filePath TEXT NOT NULL title TEXT content TEXT NOT NULL -- raw markdown/code language TEXT -- programming language if code file tokenCount INTEGER checksum TEXT -- SHA-256 for change detection indexedAt DATETIME } ``` ### 6.4 Snippets (indexed chunks) ``` Snippet { id TEXT PRIMARY KEY documentId TEXT FK → Document repositoryId TEXT FK → Repository type TEXT NOT NULL -- "code" | "info" title TEXT content TEXT NOT NULL -- the actual searchable text/code language TEXT breadcrumb TEXT -- heading hierarchy path tokenCount INTEGER embedding BLOB -- float32[] stored as blob createdAt DATETIME } ``` ### 6.5 Indexing Jobs ``` IndexingJob { id TEXT PRIMARY KEY repositoryId TEXT FK → Repository versionId TEXT status TEXT -- "queued" | "running" | "done" | "failed" progress INTEGER DEFAULT 0 -- 0-100 totalFiles INTEGER processedFiles INTEGER error TEXT startedAt DATETIME completedAt DATETIME createdAt DATETIME } ``` ### 6.6 Repository Configuration (`trueref.json`) ``` RepositoryConfig { repositoryId TEXT FK → Repository projectTitle TEXT description TEXT folders TEXT[] -- include paths excludeFolders TEXT[] excludeFiles TEXT[] rules TEXT[] -- best practices for LLMs previousVersions { tag, title }[] } ``` --- ## 7. Core Features ### F1: Repository Management Add, remove, update, and list repositories. Support GitHub (public/private via token) and local filesystem sources. Trigger indexing on demand or on schedule. ### F2: GitHub Crawler Fetch repository file trees via GitHub Trees API. Download file contents. Respect `trueref.json` include/exclude rules. Support rate limiting and incremental re-indexing (checksum-based). ### F3: Local Filesystem Crawler Walk directory trees. Apply include/exclude rules from `trueref.json`. Watch for file changes (optional). ### F4: Document Parser & Chunker - Parse Markdown files into sections (heading-based splitting). - Extract code blocks from Markdown. - Parse standalone code files into function/class-level chunks. - Calculate token counts. - Produce structured `Snippet` records (type: "code" or "info"). ### F5: Embedding & Vector Storage - Generate embeddings for each snippet using a pluggable embeddings backend. - Store embeddings as binary blobs in SQLite (sqlite-vec). - Support fallback to FTS5-only search when no embedding provider is configured. ### F6: Semantic Search Engine - Hybrid search: vector similarity + FTS5 keyword matching (BM25) with reciprocal rank fusion. - Query-time retrieval: given `libraryId + query`, return ranked snippets. - Library search: given `libraryName + query`, return matching repositories. ### F7: REST API (`/api/v1/*`) - `GET /api/v1/libs/search?query=&libraryName=` — search libraries (context7-compatible) - `GET /api/v1/context?query=&libraryId=&type=json|txt` — fetch documentation - `GET /api/v1/libs` — list all indexed libraries - `POST /api/v1/libs` — add a new repository - `DELETE /api/v1/libs/:id` — remove a repository - `POST /api/v1/libs/:id/index` — trigger re-indexing - `GET /api/v1/jobs/:id` — get indexing job status ### F8: MCP Server - Tool: `resolve-library-id` — search for libraries by name - Tool: `query-docs` — fetch documentation by libraryId + query - Transport: stdio (primary), HTTP (optional) - Compatible with Claude Code, Cursor, and other MCP-aware tools ### F9: Web UI — Repository Dashboard - List all repositories with status, snippet count, last indexed date - Add/remove repositories (GitHub URL or local path) - Trigger re-indexing - View indexing job progress - View repository config (`trueref.json`) ### F10: Web UI — Search Explorer - Interactive search interface (resolve library → query docs) - Preview snippets with syntax highlighting - View raw document content ### F11: `trueref.json` Config Support - Parse `trueref.json` from repo root (or `context7.json` for compatibility) - Apply `folders`, `excludeFolders`, `excludeFiles` during crawling - Inject `rules` into LLM context alongside snippets - Support `previousVersions` for versioned documentation ### F12: Indexing Pipeline & Job Queue - SQLite-backed job queue (no external message broker required) - Sequential processing with progress tracking - Error recovery and retry logic - Incremental re-indexing using file checksums ### F13: Version Support - Index specific git tags/branches per repository - Serve version-specific context when libraryId includes version (`/owner/repo/v1.2.3`) - UI for managing available versions --- ## 8. API Compatibility with context7 TrueRef's REST API mirrors context7's `/api/v2/*` interface to allow drop-in compatibility: | context7 Endpoint | TrueRef Endpoint | Notes | |-------------------|-----------------|-------| | `GET /api/v2/libs/search` | `GET /api/v1/libs/search` | Same query params | | `GET /api/v2/context` | `GET /api/v1/context` | Same query params, same response shape | The MCP tool names and input schemas are identical: - `resolve-library-id` with `libraryName` + `query` - `query-docs` with `libraryId` + `query` Library IDs follow the same convention: `/owner/repo` or `/owner/repo/version`. --- ## 9. Non-Functional Requirements ### Performance - Library search: < 200ms p99 - Documentation retrieval: < 500ms p99 for 20 snippets - Indexing throughput: > 1,000 files/minute (GitHub API rate-limited) ### Reliability - Failed indexing jobs must not corrupt existing indexed data - Atomic snippet replacement during re-indexing ### Portability - Single SQLite file for all data - Runs on Linux, macOS, Windows (Node.js 20+) - No required external services beyond optional embedding API ### Scalability (v1 constraints) - Designed for single-node deployment - SQLite suitable for up to ~500 repositories, ~500k snippets --- ## 10. Milestones & Feature Order | ID | Feature | Priority | Depends On | |----|---------|----------|-----------| | TRUEREF-0001 | Database schema & core data models | P0 | — | | TRUEREF-0002 | Repository management service & REST API | P0 | TRUEREF-0001 | | TRUEREF-0003 | GitHub repository crawler | P0 | TRUEREF-0001 | | TRUEREF-0004 | Local filesystem crawler | P1 | TRUEREF-0001 | | TRUEREF-0005 | Document parser & chunker | P0 | TRUEREF-0001 | | TRUEREF-0006 | SQLite FTS5 full-text search | P0 | TRUEREF-0005 | | TRUEREF-0007 | Embedding generation & vector storage | P1 | TRUEREF-0005 | | TRUEREF-0008 | Hybrid semantic search engine | P1 | TRUEREF-0006, TRUEREF-0007 | | TRUEREF-0009 | Indexing pipeline & job queue | P0 | TRUEREF-0003, TRUEREF-0005 | | TRUEREF-0010 | REST API (search + context endpoints) | P0 | TRUEREF-0006, TRUEREF-0009 | | TRUEREF-0011 | MCP server (stdio transport) | P0 | TRUEREF-0010 | | TRUEREF-0012 | MCP server (HTTP transport) | P1 | TRUEREF-0011 | | TRUEREF-0013 | `trueref.json` config file support | P0 | TRUEREF-0003 | | TRUEREF-0014 | Repository version management | P1 | TRUEREF-0003 | | TRUEREF-0015 | Web UI — repository dashboard | P1 | TRUEREF-0002, TRUEREF-0009 | | TRUEREF-0016 | Web UI — search explorer | P2 | TRUEREF-0010, TRUEREF-0015 | | TRUEREF-0017 | Incremental re-indexing (checksum diff) | P1 | TRUEREF-0009 | | TRUEREF-0018 | Embedding provider configuration UI | P2 | TRUEREF-0007, TRUEREF-0015 | --- ## 11. Open Questions 1. **Embedding provider default**: Should we bundle a local model (transformers.js) or require external API configuration? Local model provides zero-config experience but adds ~100MB to install size. 2. **Streaming responses**: Should `query-docs` support streaming for long responses? 3. **GitHub private repos**: Should we store the GitHub token in the DB or require it per-request? 4. **context7.json backward compatibility**: Should we auto-detect and use `context7.json` if no `trueref.json` is present? 5. **Re-indexing strategy**: Automatic periodic re-indexing (cron) vs. manual-only?