16 KiB
TrueRef — Product Requirements Document
Version: 1.0 Date: 2026-03-22 Status: Draft
1. Executive Summary
TrueRef is a self-hosted, open-source documentation intelligence platform — a full-stack clone of context7 that lets teams index, search, and query their own private or public code repositories. Unlike context7 (which has a private indexing backend), TrueRef ships the complete stack: crawler, parser, indexer, semantic search engine, REST API, MCP server, and a management web UI.
The core use case is enabling AI coding assistants (Claude Code, Cursor, Zed, etc.) to retrieve accurate, up-to-date, version-aware documentation from repositories the user controls — without sending code to third-party services.
2. Problem Statement
2.1 Context7's Limitations
- The indexing and crawling backend is entirely private and closed-source.
- Only public libraries already in the context7.com catalog are available.
- Private, internal, or niche repositories cannot be added.
- Data sovereignty: all queries go to context7.com servers.
- No way to self-host for air-gapped or compliance-constrained environments.
2.2 The Gap
Teams with internal SDKs, private libraries, proprietary documentation, or a need for data sovereignty have no tooling that provides context7-equivalent LLM documentation retrieval.
3. Goals & Non-Goals
Goals
- Replicate all context7 capabilities: library search, documentation retrieval, MCP tools (
resolve-library-id,query-docs). - Support both GitHub-hosted and local filesystem repositories.
- Provide a full indexing pipeline: crawl → parse → chunk → embed → store → query.
- Expose a REST API compatible with context7's
/api/v2/*surface. - Ship an MCP server implementing
resolve-library-idandquery-docs. - Provide a web UI for repository management and search exploration.
- Support
trueref.jsonconfig files in repos (analogous tocontext7.json). - Support versioned documentation via git tags/branches.
- Self-hostable with minimal dependencies (SQLite-first, no external vector DB required).
Non-Goals (v1)
- Authentication & authorization (deferred to a future version).
- Skill generation (context7 CLI skill feature).
- Multi-tenant SaaS mode.
- Binary pre-built releases / Docker image (infrastructure, not product).
- Paid API tier / rate limiting.
- Support for non-git version control systems.
4. Users & Personas
Primary: The Developer / Tech Lead
Configures TrueRef, adds repositories, integrates the MCP server with their AI coding assistant. Technical, comfortable with CLI and config files.
Secondary: The AI Coding Assistant
The "user" at query time. Calls resolve-library-id and query-docs via MCP to retrieve documentation snippets for code generation.
5. Architecture Overview
┌────────────────────────────────────────────────────────────────────┐
│ TrueRef Platform │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌───────────────┐ │
│ │ Web UI │ │ REST API │ │ MCP Server │ │
│ │ (SvelteKit) │ │ /api/v1/* │ │ (stdio/HTTP) │ │
│ └────────┬─────────┘ └────────┬─────────┘ └───────┬───────┘ │
│ │ │ │ │
│ └──────────────────────┼─────────────────────┘ │
│ │ │
│ ┌─────────────▼──────────────┐ │
│ │ Service Layer │ │
│ │ LibraryService │ │
│ │ SearchService │ │
│ │ IndexingService │ │
│ └─────────────┬──────────────┘ │
│ │ │
│ ┌──────────────────────┼───────────────────┐ │
│ │ │ │ │
│ ┌────────▼────────┐ ┌─────────▼──────┐ ┌────────▼──────────┐ │
│ │ Indexing │ │ SQLite DB │ │ Vector/FTS Index │ │
│ │ Pipeline │ │ (drizzle-orm) │ │ (SQLite FTS5 + │ │
│ │ Crawler │ │ │ │ embeddings) │ │
│ │ Parser │ │ │ │ │ │
│ │ Chunker │ │ │ │ │ │
│ └────────┬────────┘ └───────────────┘ └────────────────────┘ │
│ │ │
│ ┌───────▼──────────────────────┐ │
│ │ Repository Sources │ │
│ │ - GitHub API │ │
│ │ - Local filesystem │ │
│ └──────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘
Technology Stack
| Layer | Technology |
|---|---|
| Framework | SvelteKit (Node adapter) |
| Language | TypeScript |
| Database | SQLite via better-sqlite3 + drizzle-orm |
| Full-Text Search | SQLite FTS5 |
| Vector Search | SQLite sqlite-vec extension (cosine similarity) |
| Embeddings | Pluggable: local (transformers.js / ONNX) or OpenAI-compatible API |
| MCP Protocol | @modelcontextprotocol/sdk |
| HTTP | SvelteKit API routes + optional standalone MCP HTTP server |
| CSS | TailwindCSS v4 |
| Testing | Vitest |
| Linting | ESLint + Prettier |
6. Data Model
6.1 Repositories
A Repository is the top-level entity. It maps to a GitHub repo or local directory.
Repository {
id TEXT PRIMARY KEY -- e.g. "/facebook/react"
title TEXT NOT NULL -- display name
description TEXT
source TEXT NOT NULL -- "github" | "local"
sourceUrl TEXT -- GitHub URL or local path
branch TEXT -- default branch
state TEXT NOT NULL -- "pending" | "indexing" | "indexed" | "error"
totalSnippets INTEGER DEFAULT 0
totalTokens INTEGER DEFAULT 0
trustScore REAL DEFAULT 0
stars INTEGER
lastIndexedAt DATETIME
createdAt DATETIME
updatedAt DATETIME
}
6.2 Repository Versions
RepositoryVersion {
id TEXT PRIMARY KEY
repositoryId TEXT FK → Repository
tag TEXT NOT NULL -- git tag or branch name
title TEXT
state TEXT -- "pending" | "indexed" | "error"
indexedAt DATETIME
}
6.3 Documents (parsed files)
Document {
id TEXT PRIMARY KEY
repositoryId TEXT FK → Repository
versionId TEXT FK → RepositoryVersion (nullable = default branch)
filePath TEXT NOT NULL
title TEXT
content TEXT NOT NULL -- raw markdown/code
language TEXT -- programming language if code file
tokenCount INTEGER
checksum TEXT -- SHA-256 for change detection
indexedAt DATETIME
}
6.4 Snippets (indexed chunks)
Snippet {
id TEXT PRIMARY KEY
documentId TEXT FK → Document
repositoryId TEXT FK → Repository
type TEXT NOT NULL -- "code" | "info"
title TEXT
content TEXT NOT NULL -- the actual searchable text/code
language TEXT
breadcrumb TEXT -- heading hierarchy path
tokenCount INTEGER
embedding BLOB -- float32[] stored as blob
createdAt DATETIME
}
6.5 Indexing Jobs
IndexingJob {
id TEXT PRIMARY KEY
repositoryId TEXT FK → Repository
versionId TEXT
status TEXT -- "queued" | "running" | "done" | "failed"
progress INTEGER DEFAULT 0 -- 0-100
totalFiles INTEGER
processedFiles INTEGER
error TEXT
startedAt DATETIME
completedAt DATETIME
createdAt DATETIME
}
6.6 Repository Configuration (trueref.json)
RepositoryConfig {
repositoryId TEXT FK → Repository
projectTitle TEXT
description TEXT
folders TEXT[] -- include paths
excludeFolders TEXT[]
excludeFiles TEXT[]
rules TEXT[] -- best practices for LLMs
previousVersions { tag, title }[]
}
7. Core Features
F1: Repository Management
Add, remove, update, and list repositories. Support GitHub (public/private via token) and local filesystem sources. Trigger indexing on demand or on schedule.
F2: GitHub Crawler
Fetch repository file trees via GitHub Trees API. Download file contents. Respect trueref.json include/exclude rules. Support rate limiting and incremental re-indexing (checksum-based).
F3: Local Filesystem Crawler
Walk directory trees. Apply include/exclude rules from trueref.json. Watch for file changes (optional).
F4: Document Parser & Chunker
- Parse Markdown files into sections (heading-based splitting).
- Extract code blocks from Markdown.
- Parse standalone code files into function/class-level chunks.
- Calculate token counts.
- Produce structured
Snippetrecords (type: "code" or "info").
F5: Embedding & Vector Storage
- Generate embeddings for each snippet using a pluggable embeddings backend.
- Store embeddings as binary blobs in SQLite (sqlite-vec).
- Support fallback to FTS5-only search when no embedding provider is configured.
F6: Semantic Search Engine
- Hybrid search: vector similarity + FTS5 keyword matching (BM25) with reciprocal rank fusion.
- Query-time retrieval: given
libraryId + query, return ranked snippets. - Library search: given
libraryName + query, return matching repositories.
F7: REST API (/api/v1/*)
GET /api/v1/libs/search?query=&libraryName=— search libraries (context7-compatible)GET /api/v1/context?query=&libraryId=&type=json|txt— fetch documentationGET /api/v1/libs— list all indexed librariesPOST /api/v1/libs— add a new repositoryDELETE /api/v1/libs/:id— remove a repositoryPOST /api/v1/libs/:id/index— trigger re-indexingGET /api/v1/jobs/:id— get indexing job status
F8: MCP Server
- Tool:
resolve-library-id— search for libraries by name - Tool:
query-docs— fetch documentation by libraryId + query - Transport: stdio (primary), HTTP (optional)
- Compatible with Claude Code, Cursor, and other MCP-aware tools
F9: Web UI — Repository Dashboard
- List all repositories with status, snippet count, last indexed date
- Add/remove repositories (GitHub URL or local path)
- Trigger re-indexing
- View indexing job progress
- View repository config (
trueref.json)
F10: Web UI — Search Explorer
- Interactive search interface (resolve library → query docs)
- Preview snippets with syntax highlighting
- View raw document content
F11: trueref.json Config Support
- Parse
trueref.jsonfrom repo root (orcontext7.jsonfor compatibility) - Apply
folders,excludeFolders,excludeFilesduring crawling - Inject
rulesinto LLM context alongside snippets - Support
previousVersionsfor versioned documentation
F12: Indexing Pipeline & Job Queue
- SQLite-backed job queue (no external message broker required)
- Sequential processing with progress tracking
- Error recovery and retry logic
- Incremental re-indexing using file checksums
F13: Version Support
- Index specific git tags/branches per repository
- Serve version-specific context when libraryId includes version (
/owner/repo/v1.2.3) - UI for managing available versions
8. API Compatibility with context7
TrueRef's REST API mirrors context7's /api/v2/* interface to allow drop-in compatibility:
| context7 Endpoint | TrueRef Endpoint | Notes |
|---|---|---|
GET /api/v2/libs/search |
GET /api/v1/libs/search |
Same query params |
GET /api/v2/context |
GET /api/v1/context |
Same query params, same response shape |
The MCP tool names and input schemas are identical:
resolve-library-idwithlibraryName+queryquery-docswithlibraryId+query
Library IDs follow the same convention: /owner/repo or /owner/repo/version.
9. Non-Functional Requirements
Performance
- Library search: < 200ms p99
- Documentation retrieval: < 500ms p99 for 20 snippets
- Indexing throughput: > 1,000 files/minute (GitHub API rate-limited)
Reliability
- Failed indexing jobs must not corrupt existing indexed data
- Atomic snippet replacement during re-indexing
Portability
- Single SQLite file for all data
- Runs on Linux, macOS, Windows (Node.js 20+)
- No required external services beyond optional embedding API
Scalability (v1 constraints)
- Designed for single-node deployment
- SQLite suitable for up to ~500 repositories, ~500k snippets
10. Milestones & Feature Order
| ID | Feature | Priority | Depends On |
|---|---|---|---|
| TRUEREF-0001 | Database schema & core data models | P0 | — |
| TRUEREF-0002 | Repository management service & REST API | P0 | TRUEREF-0001 |
| TRUEREF-0003 | GitHub repository crawler | P0 | TRUEREF-0001 |
| TRUEREF-0004 | Local filesystem crawler | P1 | TRUEREF-0001 |
| TRUEREF-0005 | Document parser & chunker | P0 | TRUEREF-0001 |
| TRUEREF-0006 | SQLite FTS5 full-text search | P0 | TRUEREF-0005 |
| TRUEREF-0007 | Embedding generation & vector storage | P1 | TRUEREF-0005 |
| TRUEREF-0008 | Hybrid semantic search engine | P1 | TRUEREF-0006, TRUEREF-0007 |
| TRUEREF-0009 | Indexing pipeline & job queue | P0 | TRUEREF-0003, TRUEREF-0005 |
| TRUEREF-0010 | REST API (search + context endpoints) | P0 | TRUEREF-0006, TRUEREF-0009 |
| TRUEREF-0011 | MCP server (stdio transport) | P0 | TRUEREF-0010 |
| TRUEREF-0012 | MCP server (HTTP transport) | P1 | TRUEREF-0011 |
| TRUEREF-0013 | trueref.json config file support |
P0 | TRUEREF-0003 |
| TRUEREF-0014 | Repository version management | P1 | TRUEREF-0003 |
| TRUEREF-0015 | Web UI — repository dashboard | P1 | TRUEREF-0002, TRUEREF-0009 |
| TRUEREF-0016 | Web UI — search explorer | P2 | TRUEREF-0010, TRUEREF-0015 |
| TRUEREF-0017 | Incremental re-indexing (checksum diff) | P1 | TRUEREF-0009 |
| TRUEREF-0018 | Embedding provider configuration UI | P2 | TRUEREF-0007, TRUEREF-0015 |
11. Open Questions
- Embedding provider default: Should we bundle a local model (transformers.js) or require external API configuration? Local model provides zero-config experience but adds ~100MB to install size.
- Streaming responses: Should
query-docssupport streaming for long responses? - GitHub private repos: Should we store the GitHub token in the DB or require it per-request?
- context7.json backward compatibility: Should we auto-detect and use
context7.jsonif notrueref.jsonis present? - Re-indexing strategy: Automatic periodic re-indexing (cron) vs. manual-only?