# TrueRef — Product Requirements Document

**Version:** 1.0
**Date:** 2026-03-22
**Status:** Draft

---

## 1. Executive Summary

TrueRef is a self-hosted, open-source documentation intelligence platform — a full-stack clone of context7 that lets teams index, search, and query their own private or public code repositories. Unlike context7 (which has a private indexing backend), TrueRef ships the complete stack: crawler, parser, indexer, semantic search engine, REST API, MCP server, and a management web UI.

The core use case is enabling AI coding assistants (Claude Code, Cursor, Zed, etc.) to retrieve accurate, up-to-date, version-aware documentation from repositories the user controls — without sending code to third-party services.

---

## 2. Problem Statement

### 2.1 Context7's Limitations
- The indexing and crawling backend is entirely private and closed-source.
- Only public libraries already in the context7.com catalog are available.
- Private, internal, or niche repositories cannot be added.
- Data sovereignty: all queries go to context7.com servers.
- No way to self-host for air-gapped or compliance-constrained environments.

### 2.2 The Gap
Teams with internal SDKs, private libraries, proprietary documentation, or a need for data sovereignty have no tooling that provides context7-equivalent LLM documentation retrieval.

---

## 3. Goals & Non-Goals

### Goals
- Replicate all context7 capabilities: library search, documentation retrieval, MCP tools (`resolve-library-id`, `query-docs`).
- Support both GitHub-hosted and local filesystem repositories.
- Provide a full indexing pipeline: crawl → parse → chunk → embed → store → query.
- Expose a REST API compatible with context7's `/api/v2/*` surface.
- Ship an MCP server implementing `resolve-library-id` and `query-docs`.
- Provide a web UI for repository management and search exploration.
- Support `trueref.json` config files in repos (analogous to `context7.json`).
- Support versioned documentation via git tags/branches.
- Self-hostable with minimal dependencies (SQLite-first, no external vector DB required).

### Non-Goals (v1)
- Authentication & authorization (deferred to a future version).
- Skill generation (context7 CLI skill feature).
- Multi-tenant SaaS mode.
- Binary pre-built releases / Docker image (infrastructure, not product).
- Paid API tier / rate limiting.
- Support for non-git version control systems.

---

## 4. Users & Personas

### Primary: The Developer / Tech Lead
Configures TrueRef, adds repositories, integrates the MCP server with their AI coding assistant. Technical, comfortable with CLI and config files.

### Secondary: The AI Coding Assistant
The "user" at query time. Calls `resolve-library-id` and `query-docs` via MCP to retrieve documentation snippets for code generation.

---

## 5. Architecture Overview

```
┌────────────────────────────────────────────────────────────────────┐
│                          TrueRef Platform                          │
│                                                                    │
│  ┌──────────────────┐   ┌──────────────────┐   ┌───────────────┐  │
│  │   Web UI         │   │   REST API       │   │  MCP Server   │  │
│  │  (SvelteKit)     │   │  /api/v1/*       │   │  (stdio/HTTP) │  │
│  └────────┬─────────┘   └────────┬─────────┘   └───────┬───────┘  │
│           │                      │                     │           │
│           └──────────────────────┼─────────────────────┘           │
│                                  │                                  │
│                    ┌─────────────▼──────────────┐                  │
│                    │      Service Layer           │                  │
│                    │  LibraryService              │                  │
│                    │  SearchService               │                  │
│                    │  IndexingService             │                  │
│                    └─────────────┬──────────────┘                  │
│                                  │                                  │
│           ┌──────────────────────┼───────────────────┐             │
│           │                      │                   │             │
│  ┌────────▼────────┐  ┌─────────▼──────┐  ┌────────▼──────────┐  │
│  │  Indexing        │  │  SQLite DB     │  │  Vector/FTS Index │  │
│  │  Pipeline        │  │  (drizzle-orm) │  │  (SQLite FTS5 +   │  │
│  │  Crawler         │  │               │  │   embeddings)      │  │
│  │  Parser          │  │               │  │                    │  │
│  │  Chunker         │  │               │  │                    │  │
│  └────────┬────────┘  └───────────────┘  └────────────────────┘  │
│           │                                                         │
│   ┌───────▼──────────────────────┐                                 │
│   │    Repository Sources         │                                 │
│   │  - GitHub API                 │                                 │
│   │  - Local filesystem           │                                 │
│   └──────────────────────────────┘                                 │
└────────────────────────────────────────────────────────────────────┘
```

### Technology Stack
| Layer | Technology |
|-------|-----------|
| Framework | SvelteKit (Node adapter) |
| Language | TypeScript |
| Database | SQLite via better-sqlite3 + drizzle-orm |
| Full-Text Search | SQLite FTS5 |
| Vector Search | SQLite `sqlite-vec` extension (cosine similarity) |
| Embeddings | Pluggable: local (transformers.js / ONNX) or OpenAI-compatible API |
| MCP Protocol | `@modelcontextprotocol/sdk` |
| HTTP | SvelteKit API routes + optional standalone MCP HTTP server |
| CSS | TailwindCSS v4 |
| Testing | Vitest |
| Linting | ESLint + Prettier |

---

## 6. Data Model

### 6.1 Repositories
A `Repository` is the top-level entity. It maps to a GitHub repo or local directory.

```
Repository {
  id          TEXT PRIMARY KEY        -- e.g. "/facebook/react"
  title       TEXT NOT NULL           -- display name
  description TEXT
  source      TEXT NOT NULL           -- "github" | "local"
  sourceUrl   TEXT                    -- GitHub URL or local path
  branch      TEXT                    -- default branch
  state       TEXT NOT NULL           -- "pending" | "indexing" | "indexed" | "error"
  totalSnippets INTEGER DEFAULT 0
  totalTokens   INTEGER DEFAULT 0
  trustScore    REAL DEFAULT 0
  stars         INTEGER
  lastIndexedAt DATETIME
  createdAt     DATETIME
  updatedAt     DATETIME
}
```

### 6.2 Repository Versions
```
RepositoryVersion {
  id           TEXT PRIMARY KEY
  repositoryId TEXT FK → Repository
  tag          TEXT NOT NULL          -- git tag or branch name
  title        TEXT
  state        TEXT                   -- "pending" | "indexed" | "error"
  indexedAt    DATETIME
}
```

### 6.3 Documents (parsed files)
```
Document {
  id           TEXT PRIMARY KEY
  repositoryId TEXT FK → Repository
  versionId    TEXT FK → RepositoryVersion (nullable = default branch)
  filePath     TEXT NOT NULL
  title        TEXT
  content      TEXT NOT NULL          -- raw markdown/code
  language     TEXT                   -- programming language if code file
  tokenCount   INTEGER
  checksum     TEXT                   -- SHA-256 for change detection
  indexedAt    DATETIME
}
```

### 6.4 Snippets (indexed chunks)
```
Snippet {
  id           TEXT PRIMARY KEY
  documentId   TEXT FK → Document
  repositoryId TEXT FK → Repository
  type         TEXT NOT NULL          -- "code" | "info"
  title        TEXT
  content      TEXT NOT NULL          -- the actual searchable text/code
  language     TEXT
  breadcrumb   TEXT                   -- heading hierarchy path
  tokenCount   INTEGER
  embedding    BLOB                   -- float32[] stored as blob
  createdAt    DATETIME
}
```

### 6.5 Indexing Jobs
```
IndexingJob {
  id           TEXT PRIMARY KEY
  repositoryId TEXT FK → Repository
  versionId    TEXT
  status       TEXT                   -- "queued" | "running" | "done" | "failed"
  progress     INTEGER DEFAULT 0      -- 0-100
  totalFiles   INTEGER
  processedFiles INTEGER
  error        TEXT
  startedAt    DATETIME
  completedAt  DATETIME
  createdAt    DATETIME
}
```

### 6.6 Repository Configuration (`trueref.json`)
```
RepositoryConfig {
  repositoryId  TEXT FK → Repository
  projectTitle  TEXT
  description   TEXT
  folders       TEXT[]                -- include paths
  excludeFolders TEXT[]
  excludeFiles  TEXT[]
  rules         TEXT[]                -- best practices for LLMs
  previousVersions { tag, title }[]
}
```

---

## 7. Core Features

### F1: Repository Management
Add, remove, update, and list repositories. Support GitHub (public/private via token) and local filesystem sources. Trigger indexing on demand or on schedule.

### F2: GitHub Crawler
Fetch repository file trees via GitHub Trees API. Download file contents. Respect `trueref.json` include/exclude rules. Support rate limiting and incremental re-indexing (checksum-based).

### F3: Local Filesystem Crawler
Walk directory trees. Apply include/exclude rules from `trueref.json`. Watch for file changes (optional).

### F4: Document Parser & Chunker
- Parse Markdown files into sections (heading-based splitting).
- Extract code blocks from Markdown.
- Parse standalone code files into function/class-level chunks.
- Calculate token counts.
- Produce structured `Snippet` records (type: "code" or "info").

### F5: Embedding & Vector Storage
- Generate embeddings for each snippet using a pluggable embeddings backend.
- Store embeddings as binary blobs in SQLite (sqlite-vec).
- Support fallback to FTS5-only search when no embedding provider is configured.

### F6: Semantic Search Engine
- Hybrid search: vector similarity + FTS5 keyword matching (BM25) with reciprocal rank fusion.
- Query-time retrieval: given `libraryId + query`, return ranked snippets.
- Library search: given `libraryName + query`, return matching repositories.

### F7: REST API (`/api/v1/*`)
- `GET /api/v1/libs/search?query=&libraryName=` — search libraries (context7-compatible)
- `GET /api/v1/context?query=&libraryId=&type=json|txt` — fetch documentation
- `GET /api/v1/libs` — list all indexed libraries
- `POST /api/v1/libs` — add a new repository
- `DELETE /api/v1/libs/:id` — remove a repository
- `POST /api/v1/libs/:id/index` — trigger re-indexing
- `GET /api/v1/jobs/:id` — get indexing job status

### F8: MCP Server
- Tool: `resolve-library-id` — search for libraries by name
- Tool: `query-docs` — fetch documentation by libraryId + query
- Transport: stdio (primary), HTTP (optional)
- Compatible with Claude Code, Cursor, and other MCP-aware tools

### F9: Web UI — Repository Dashboard
- List all repositories with status, snippet count, last indexed date
- Add/remove repositories (GitHub URL or local path)
- Trigger re-indexing
- View indexing job progress
- View repository config (`trueref.json`)

### F10: Web UI — Search Explorer
- Interactive search interface (resolve library → query docs)
- Preview snippets with syntax highlighting
- View raw document content

### F11: `trueref.json` Config Support
- Parse `trueref.json` from repo root (or `context7.json` for compatibility)
- Apply `folders`, `excludeFolders`, `excludeFiles` during crawling
- Inject `rules` into LLM context alongside snippets
- Support `previousVersions` for versioned documentation

### F12: Indexing Pipeline & Job Queue
- SQLite-backed job queue (no external message broker required)
- Sequential processing with progress tracking
- Error recovery and retry logic
- Incremental re-indexing using file checksums

### F13: Version Support
- Index specific git tags/branches per repository
- Serve version-specific context when libraryId includes version (`/owner/repo/v1.2.3`)
- UI for managing available versions

---

## 8. API Compatibility with context7

TrueRef's REST API mirrors context7's `/api/v2/*` interface to allow drop-in compatibility:

| context7 Endpoint | TrueRef Endpoint | Notes |
|-------------------|-----------------|-------|
| `GET /api/v2/libs/search` | `GET /api/v1/libs/search` | Same query params |
| `GET /api/v2/context` | `GET /api/v1/context` | Same query params, same response shape |

The MCP tool names and input schemas are identical:
- `resolve-library-id` with `libraryName` + `query`
- `query-docs` with `libraryId` + `query`

Library IDs follow the same convention: `/owner/repo` or `/owner/repo/version`.

---

## 9. Non-Functional Requirements

### Performance
- Library search: < 200ms p99
- Documentation retrieval: < 500ms p99 for 20 snippets
- Indexing throughput: > 1,000 files/minute (GitHub API rate-limited)

### Reliability
- Failed indexing jobs must not corrupt existing indexed data
- Atomic snippet replacement during re-indexing

### Portability
- Single SQLite file for all data
- Runs on Linux, macOS, Windows (Node.js 20+)
- No required external services beyond optional embedding API

### Scalability (v1 constraints)
- Designed for single-node deployment
- SQLite suitable for up to ~500 repositories, ~500k snippets

---

## 10. Milestones & Feature Order

| ID | Feature | Priority | Depends On |
|----|---------|----------|-----------|
| TRUEREF-0001 | Database schema & core data models | P0 | — |
| TRUEREF-0002 | Repository management service & REST API | P0 | TRUEREF-0001 |
| TRUEREF-0003 | GitHub repository crawler | P0 | TRUEREF-0001 |
| TRUEREF-0004 | Local filesystem crawler | P1 | TRUEREF-0001 |
| TRUEREF-0005 | Document parser & chunker | P0 | TRUEREF-0001 |
| TRUEREF-0006 | SQLite FTS5 full-text search | P0 | TRUEREF-0005 |
| TRUEREF-0007 | Embedding generation & vector storage | P1 | TRUEREF-0005 |
| TRUEREF-0008 | Hybrid semantic search engine | P1 | TRUEREF-0006, TRUEREF-0007 |
| TRUEREF-0009 | Indexing pipeline & job queue | P0 | TRUEREF-0003, TRUEREF-0005 |
| TRUEREF-0010 | REST API (search + context endpoints) | P0 | TRUEREF-0006, TRUEREF-0009 |
| TRUEREF-0011 | MCP server (stdio transport) | P0 | TRUEREF-0010 |
| TRUEREF-0012 | MCP server (HTTP transport) | P1 | TRUEREF-0011 |
| TRUEREF-0013 | `trueref.json` config file support | P0 | TRUEREF-0003 |
| TRUEREF-0014 | Repository version management | P1 | TRUEREF-0003 |
| TRUEREF-0015 | Web UI — repository dashboard | P1 | TRUEREF-0002, TRUEREF-0009 |
| TRUEREF-0016 | Web UI — search explorer | P2 | TRUEREF-0010, TRUEREF-0015 |
| TRUEREF-0017 | Incremental re-indexing (checksum diff) | P1 | TRUEREF-0009 |
| TRUEREF-0018 | Embedding provider configuration UI | P2 | TRUEREF-0007, TRUEREF-0015 |

---

## 11. Open Questions

1. **Embedding provider default**: Should we bundle a local model (transformers.js) or require external API configuration? Local model provides zero-config experience but adds ~100MB to install size.
2. **Streaming responses**: Should `query-docs` support streaming for long responses?
3. **GitHub private repos**: Should we store the GitHub token in the DB or require it per-request?
4. **context7.json backward compatibility**: Should we auto-detect and use `context7.json` if no `trueref.json` is present?
5. **Re-indexing strategy**: Automatic periodic re-indexing (cron) vs. manual-only?