Files
trueref/docs/PRD.md
2026-03-27 02:23:01 +01:00

18 KiB

TrueRef — Product Requirements Document

Version: 1.0 Date: 2026-03-22 Status: Draft


1. Executive Summary

TrueRef is a self-hosted, open-source documentation intelligence platform — a full-stack clone of context7 that lets teams index, search, and query their own private or public code repositories. Unlike context7 (which has a private indexing backend), TrueRef ships the complete stack: crawler, parser, indexer, semantic search engine, REST API, MCP server, and a management web UI.

The core use case is enabling AI coding assistants (Claude Code, Cursor, Zed, etc.) to retrieve accurate, up-to-date, version-aware documentation from repositories the user controls — without sending code to third-party services.


2. Problem Statement

2.1 Context7's Limitations

  • The indexing and crawling backend is entirely private and closed-source.
  • Only public libraries already in the context7.com catalog are available.
  • Private, internal, or niche repositories cannot be added.
  • Data sovereignty: all queries go to context7.com servers.
  • No way to self-host for air-gapped or compliance-constrained environments.

2.2 The Gap

Teams with internal SDKs, private libraries, proprietary documentation, or a need for data sovereignty have no tooling that provides context7-equivalent LLM documentation retrieval.


3. Goals & Non-Goals

Goals

  • Replicate all context7 capabilities: library search, documentation retrieval, MCP tools (resolve-library-id, query-docs).
  • Support both GitHub-hosted and local filesystem repositories.
  • Provide a full indexing pipeline: crawl → parse → chunk → embed → store → query.
  • Expose a REST API compatible with context7's /api/v2/* surface.
  • Ship an MCP server implementing resolve-library-id and query-docs.
  • Provide a web UI for repository management and search exploration.
  • Support trueref.json config files in repos (analogous to context7.json).
  • Support versioned documentation via git tags/branches.
  • Self-hostable with minimal dependencies (SQLite-first, no external vector DB required).

Non-Goals (v1)

  • Authentication & authorization (deferred to a future version).
  • Skill generation (context7 CLI skill feature).
  • Multi-tenant SaaS mode.
  • Binary pre-built releases / Docker image (infrastructure, not product).
  • Paid API tier / rate limiting.
  • Support for non-git version control systems.

4. Users & Personas

Primary: The Developer / Tech Lead

Configures TrueRef, adds repositories, integrates the MCP server with their AI coding assistant. Technical, comfortable with CLI and config files.

Secondary: The AI Coding Assistant

The "user" at query time. Calls resolve-library-id and query-docs via MCP to retrieve documentation snippets for code generation.


5. Architecture Overview

┌────────────────────────────────────────────────────────────────────┐
│                          TrueRef Platform                          │
│                                                                    │
│  ┌──────────────────┐   ┌──────────────────┐   ┌───────────────┐  │
│  │   Web UI         │   │   REST API       │   │  MCP Server   │  │
│  │  (SvelteKit)     │   │  /api/v1/*       │   │  (stdio/HTTP) │  │
│  └────────┬─────────┘   └────────┬─────────┘   └───────┬───────┘  │
│           │                      │                     │           │
│           └──────────────────────┼─────────────────────┘           │
│                                  │                                  │
│                    ┌─────────────▼──────────────┐                  │
│                    │      Service Layer           │                  │
│                    │  LibraryService              │                  │
│                    │  SearchService               │                  │
│                    │  IndexingService             │                  │
│                    └─────────────┬──────────────┘                  │
│                                  │                                  │
│           ┌──────────────────────┼───────────────────┐             │
│           │                      │                   │             │
│  ┌────────▼────────┐  ┌─────────▼──────┐  ┌────────▼──────────┐  │
│  │  Indexing        │  │  SQLite DB     │  │  Vector/FTS Index │  │
│  │  Pipeline        │  │  (drizzle-orm) │  │  (SQLite FTS5 +   │  │
│  │  Crawler         │  │               │  │   embeddings)      │  │
│  │  Parser          │  │               │  │                    │  │
│  │  Chunker         │  │               │  │                    │  │
│  └────────┬────────┘  └───────────────┘  └────────────────────┘  │
│           │                                                         │
│   ┌───────▼──────────────────────┐                                 │
│   │    Repository Sources         │                                 │
│   │  - GitHub API                 │                                 │
│   │  - Local filesystem           │                                 │
│   └──────────────────────────────┘                                 │
└────────────────────────────────────────────────────────────────────┘

Technology Stack

Layer Technology
Framework SvelteKit (Node adapter)
Language TypeScript
Database SQLite via better-sqlite3 + drizzle-orm
Full-Text Search SQLite FTS5
Vector Search SQLite sqlite-vec extension (cosine similarity)
Embeddings Pluggable: local (transformers.js / ONNX) or OpenAI-compatible API
MCP Protocol @modelcontextprotocol/sdk
HTTP SvelteKit API routes + optional standalone MCP HTTP server
CSS TailwindCSS v4
Testing Vitest
Linting ESLint + Prettier

6. Data Model

6.1 Repositories

A Repository is the top-level entity. It maps to a GitHub repo or local directory.

Repository {
  id          TEXT PRIMARY KEY        -- e.g. "/facebook/react"
  title       TEXT NOT NULL           -- display name
  description TEXT
  source      TEXT NOT NULL           -- "github" | "local"
  sourceUrl   TEXT                    -- GitHub URL or local path
  branch      TEXT                    -- default branch
  state       TEXT NOT NULL           -- "pending" | "indexing" | "indexed" | "error"
  totalSnippets INTEGER DEFAULT 0
  totalTokens   INTEGER DEFAULT 0
  trustScore    REAL DEFAULT 0
  stars         INTEGER
  lastIndexedAt DATETIME
  createdAt     DATETIME
  updatedAt     DATETIME
}

6.2 Repository Versions

RepositoryVersion {
  id           TEXT PRIMARY KEY
  repositoryId TEXT FK → Repository
  tag          TEXT NOT NULL          -- git tag or branch name
  title        TEXT
  state        TEXT                   -- "pending" | "indexed" | "error"
  indexedAt    DATETIME
}

6.3 Documents (parsed files)

Document {
  id           TEXT PRIMARY KEY
  repositoryId TEXT FK → Repository
  versionId    TEXT FK → RepositoryVersion (nullable = default branch)
  filePath     TEXT NOT NULL
  title        TEXT
  content      TEXT NOT NULL          -- raw markdown/code
  language     TEXT                   -- programming language if code file
  tokenCount   INTEGER
  checksum     TEXT                   -- SHA-256 for change detection
  indexedAt    DATETIME
}

6.4 Snippets (indexed chunks)

Snippet {
  id           TEXT PRIMARY KEY
  documentId   TEXT FK → Document
  repositoryId TEXT FK → Repository
  type         TEXT NOT NULL          -- "code" | "info"
  title        TEXT
  content      TEXT NOT NULL          -- the actual searchable text/code
  language     TEXT
  breadcrumb   TEXT                   -- heading hierarchy path
  tokenCount   INTEGER
  embedding    BLOB                   -- float32[] stored as blob
  createdAt    DATETIME
}

6.5 Indexing Jobs

IndexingJob {
  id           TEXT PRIMARY KEY
  repositoryId TEXT FK → Repository
  versionId    TEXT
  status       TEXT                   -- "queued" | "running" | "done" | "failed"
  progress     INTEGER DEFAULT 0      -- 0-100
  totalFiles   INTEGER
  processedFiles INTEGER
  error        TEXT
  startedAt    DATETIME
  completedAt  DATETIME
  createdAt    DATETIME
}

6.6 Repository Configuration (trueref.json)

RepositoryConfig {
  repositoryId  TEXT FK → Repository
  projectTitle  TEXT
  description   TEXT
  folders       TEXT[]                -- include paths
  excludeFolders TEXT[]
  excludeFiles  TEXT[]
  rules         TEXT[]                -- best practices for LLMs
  previousVersions { tag, title }[]
}

7. Core Features

F1: Repository Management

Add, remove, update, and list repositories. Support GitHub (public/private via token) and local filesystem sources. Trigger indexing on demand or on schedule.

F2: GitHub Crawler

Fetch repository file trees via GitHub Trees API. Download file contents. Respect trueref.json include/exclude rules. Support rate limiting and incremental re-indexing (checksum-based).

F3: Local Filesystem Crawler

Walk directory trees. Apply include/exclude rules from trueref.json. Watch for file changes (optional).

F4: Document Parser & Chunker

  • Parse Markdown files into sections (heading-based splitting).
  • Extract code blocks from Markdown.
  • Parse standalone code files into function/class-level chunks.
  • Calculate token counts.
  • Produce structured Snippet records (type: "code" or "info").

F5: Embedding & Vector Storage

  • Generate embeddings for each snippet using a pluggable embeddings backend.
  • Store embeddings as binary blobs in SQLite (sqlite-vec).
  • Support fallback to FTS5-only search when no embedding provider is configured.

F6: Semantic Search Engine

  • Hybrid search: vector similarity + FTS5 keyword matching (BM25) with reciprocal rank fusion.
  • Query-time retrieval: given libraryId + query, return ranked snippets.
  • Library search: given libraryName + query, return matching repositories.

F7: REST API (/api/v1/*)

  • GET /api/v1/libs/search?query=&libraryName= — search libraries (context7-compatible)
  • GET /api/v1/context?query=&libraryId=&type=json|txt — fetch documentation
  • GET /api/v1/libs — list all indexed libraries
  • POST /api/v1/libs — add a new repository
  • DELETE /api/v1/libs/:id — remove a repository
  • POST /api/v1/libs/:id/index — trigger re-indexing
  • GET /api/v1/jobs/:id — get indexing job status

F8: MCP Server

  • Tool: resolve-library-id — search for libraries by name
  • Tool: query-docs — fetch documentation by libraryId + query
  • Transport: stdio (primary), HTTP (optional)
  • Compatible with Claude Code, Cursor, and other MCP-aware tools

F9: Web UI — Repository Dashboard

  • List all repositories with status, snippet count, last indexed date
  • Add/remove repositories (GitHub URL or local path)
  • Trigger re-indexing
  • View indexing job progress
  • View repository config (trueref.json)

F10: Web UI — Search Explorer

  • Interactive search interface (resolve library → query docs)
  • Preview snippets with syntax highlighting
  • View raw document content

F11: trueref.json Config Support

  • Parse trueref.json from repo root (or context7.json for compatibility)
  • Apply folders, excludeFolders, excludeFiles during crawling
  • Inject rules into LLM context alongside snippets
  • Support previousVersions for versioned documentation

F12: Indexing Pipeline & Job Queue

  • SQLite-backed job queue (no external message broker required)
  • Sequential processing with progress tracking
  • Error recovery and retry logic
  • Incremental re-indexing using file checksums

F13: Version Support

  • Index specific git tags/branches per repository
  • Serve version-specific context when libraryId includes version (/owner/repo/v1.2.3)
  • UI for managing available versions

8. API Compatibility with context7

TrueRef's REST API mirrors context7's /api/v2/* interface to allow drop-in compatibility:

context7 Endpoint TrueRef Endpoint Notes
GET /api/v2/libs/search GET /api/v1/libs/search Same query params
GET /api/v2/context GET /api/v1/context Same query params, same response shape

The MCP tool names and input schemas are identical:

  • resolve-library-id with libraryName + query
  • query-docs with libraryId + query

Library IDs follow the same convention: /owner/repo or /owner/repo/version.


9. Non-Functional Requirements

Performance

  • Library search: < 200ms p99
  • Documentation retrieval: < 500ms p99 for 20 snippets
  • Indexing throughput: > 1,000 files/minute (GitHub API rate-limited)

Reliability

  • Failed indexing jobs must not corrupt existing indexed data
  • Atomic snippet replacement during re-indexing

Portability

  • Single SQLite file for all data
  • Runs on Linux, macOS, Windows (Node.js 20+)
  • No required external services beyond optional embedding API

Scalability (v1 constraints)

  • Designed for single-node deployment
  • SQLite suitable for up to ~500 repositories, ~500k snippets

10. Milestones & Feature Order

ID Feature Priority Depends On
TRUEREF-0001 Database schema & core data models P0
TRUEREF-0002 Repository management service & REST API P0 TRUEREF-0001
TRUEREF-0003 GitHub repository crawler P0 TRUEREF-0001
TRUEREF-0004 Local filesystem crawler P1 TRUEREF-0001
TRUEREF-0005 Document parser & chunker P0 TRUEREF-0001
TRUEREF-0006 SQLite FTS5 full-text search P0 TRUEREF-0005
TRUEREF-0007 Embedding generation & vector storage P1 TRUEREF-0005
TRUEREF-0008 Hybrid semantic search engine P1 TRUEREF-0006, TRUEREF-0007
TRUEREF-0009 Indexing pipeline & job queue P0 TRUEREF-0003, TRUEREF-0005
TRUEREF-0010 REST API (search + context endpoints) P0 TRUEREF-0006, TRUEREF-0009
TRUEREF-0011 MCP server (stdio transport) P0 TRUEREF-0010
TRUEREF-0012 MCP server (HTTP transport) P1 TRUEREF-0011
TRUEREF-0013 trueref.json config file support P0 TRUEREF-0003
TRUEREF-0014 Repository version management P1 TRUEREF-0003
TRUEREF-0015 Web UI — repository dashboard P1 TRUEREF-0002, TRUEREF-0009
TRUEREF-0016 Web UI — search explorer P2 TRUEREF-0010, TRUEREF-0015
TRUEREF-0017 Incremental re-indexing (checksum diff) P1 TRUEREF-0009
TRUEREF-0018 Embedding provider configuration UI P2 TRUEREF-0007, TRUEREF-0015

11. Open Questions

  1. Embedding provider default: Should we bundle a local model (transformers.js) or require external API configuration? Local model provides zero-config experience but adds ~100MB to install size.
  2. Streaming responses: Should query-docs support streaming for long responses?
  3. GitHub private repos: Should we store the GitHub token in the DB or require it per-request?
  4. context7.json backward compatibility: Should we auto-detect and use context7.json if no trueref.json is present?
  5. Re-indexing strategy: Automatic periodic re-indexing (cron) vs. manual-only?