mozempk/trueref

Fork 0

Files

Giancarmine Salucci 18437dfa7c chore: initial project scaffold

2026-03-22 17:08:15 +01:00

16 KiB

Raw Blame History

TrueRef — Product Requirements Document

Version: 1.0 Date: 2026-03-22 Status: Draft

1. Executive Summary

TrueRef is a self-hosted, open-source documentation intelligence platform — a full-stack clone of context7 that lets teams index, search, and query their own private or public code repositories. Unlike context7 (which has a private indexing backend), TrueRef ships the complete stack: crawler, parser, indexer, semantic search engine, REST API, MCP server, and a management web UI.

The core use case is enabling AI coding assistants (Claude Code, Cursor, Zed, etc.) to retrieve accurate, up-to-date, version-aware documentation from repositories the user controls — without sending code to third-party services.

2. Problem Statement

2.1 Context7's Limitations

The indexing and crawling backend is entirely private and closed-source.
Only public libraries already in the context7.com catalog are available.
Private, internal, or niche repositories cannot be added.
Data sovereignty: all queries go to context7.com servers.
No way to self-host for air-gapped or compliance-constrained environments.

2.2 The Gap

Teams with internal SDKs, private libraries, proprietary documentation, or a need for data sovereignty have no tooling that provides context7-equivalent LLM documentation retrieval.

3. Goals & Non-Goals

Goals

Replicate all context7 capabilities: library search, documentation retrieval, MCP tools (resolve-library-id, query-docs).
Support both GitHub-hosted and local filesystem repositories.
Provide a full indexing pipeline: crawl → parse → chunk → embed → store → query.
Expose a REST API compatible with context7's /api/v2/* surface.
Ship an MCP server implementing resolve-library-id and query-docs.
Provide a web UI for repository management and search exploration.
Support trueref.json config files in repos (analogous to context7.json).
Support versioned documentation via git tags/branches.
Self-hostable with minimal dependencies (SQLite-first, no external vector DB required).

Non-Goals (v1)

Authentication & authorization (deferred to a future version).
Skill generation (context7 CLI skill feature).
Multi-tenant SaaS mode.
Binary pre-built releases / Docker image (infrastructure, not product).
Paid API tier / rate limiting.
Support for non-git version control systems.

4. Users & Personas

Primary: The Developer / Tech Lead

Configures TrueRef, adds repositories, integrates the MCP server with their AI coding assistant. Technical, comfortable with CLI and config files.

Secondary: The AI Coding Assistant

The "user" at query time. Calls resolve-library-id and query-docs via MCP to retrieve documentation snippets for code generation.

5. Architecture Overview

┌────────────────────────────────────────────────────────────────────┐
│                          TrueRef Platform                          │
│                                                                    │
│  ┌──────────────────┐   ┌──────────────────┐   ┌───────────────┐  │
│  │   Web UI         │   │   REST API       │   │  MCP Server   │  │
│  │  (SvelteKit)     │   │  /api/v1/*       │   │  (stdio/HTTP) │  │
│  └────────┬─────────┘   └────────┬─────────┘   └───────┬───────┘  │
│           │                      │                     │           │
│           └──────────────────────┼─────────────────────┘           │
│                                  │                                  │
│                    ┌─────────────▼──────────────┐                  │
│                    │      Service Layer           │                  │
│                    │  LibraryService              │                  │
│                    │  SearchService               │                  │
│                    │  IndexingService             │                  │
│                    └─────────────┬──────────────┘                  │
│                                  │                                  │
│           ┌──────────────────────┼───────────────────┐             │
│           │                      │                   │             │
│  ┌────────▼────────┐  ┌─────────▼──────┐  ┌────────▼──────────┐  │
│  │  Indexing        │  │  SQLite DB     │  │  Vector/FTS Index │  │
│  │  Pipeline        │  │  (drizzle-orm) │  │  (SQLite FTS5 +   │  │
│  │  Crawler         │  │               │  │   embeddings)      │  │
│  │  Parser          │  │               │  │                    │  │
│  │  Chunker         │  │               │  │                    │  │
│  └────────┬────────┘  └───────────────┘  └────────────────────┘  │
│           │                                                         │
│   ┌───────▼──────────────────────┐                                 │
│   │    Repository Sources         │                                 │
│   │  - GitHub API                 │                                 │
│   │  - Local filesystem           │                                 │
│   └──────────────────────────────┘                                 │
└────────────────────────────────────────────────────────────────────┘

Technology Stack

Layer	Technology
Framework	SvelteKit (Node adapter)
Language	TypeScript
Database	SQLite via better-sqlite3 + drizzle-orm
Full-Text Search	SQLite FTS5
Vector Search	SQLite `sqlite-vec` extension (cosine similarity)
Embeddings	Pluggable: local (transformers.js / ONNX) or OpenAI-compatible API
MCP Protocol	`@modelcontextprotocol/sdk`
HTTP	SvelteKit API routes + optional standalone MCP HTTP server
CSS	TailwindCSS v4
Testing	Vitest
Linting	ESLint + Prettier

6. Data Model

6.1 Repositories

A Repository is the top-level entity. It maps to a GitHub repo or local directory.

Repository {
  id          TEXT PRIMARY KEY        -- e.g. "/facebook/react"
  title       TEXT NOT NULL           -- display name
  description TEXT
  source      TEXT NOT NULL           -- "github" | "local"
  sourceUrl   TEXT                    -- GitHub URL or local path
  branch      TEXT                    -- default branch
  state       TEXT NOT NULL           -- "pending" | "indexing" | "indexed" | "error"
  totalSnippets INTEGER DEFAULT 0
  totalTokens   INTEGER DEFAULT 0
  trustScore    REAL DEFAULT 0
  stars         INTEGER
  lastIndexedAt DATETIME
  createdAt     DATETIME
  updatedAt     DATETIME
}

6.2 Repository Versions

RepositoryVersion {
  id           TEXT PRIMARY KEY
  repositoryId TEXT FK → Repository
  tag          TEXT NOT NULL          -- git tag or branch name
  title        TEXT
  state        TEXT                   -- "pending" | "indexed" | "error"
  indexedAt    DATETIME
}

6.3 Documents (parsed files)

Document {
  id           TEXT PRIMARY KEY
  repositoryId TEXT FK → Repository
  versionId    TEXT FK → RepositoryVersion (nullable = default branch)
  filePath     TEXT NOT NULL
  title        TEXT
  content      TEXT NOT NULL          -- raw markdown/code
  language     TEXT                   -- programming language if code file
  tokenCount   INTEGER
  checksum     TEXT                   -- SHA-256 for change detection
  indexedAt    DATETIME
}

6.4 Snippets (indexed chunks)

Snippet {
  id           TEXT PRIMARY KEY
  documentId   TEXT FK → Document
  repositoryId TEXT FK → Repository
  type         TEXT NOT NULL          -- "code" | "info"
  title        TEXT
  content      TEXT NOT NULL          -- the actual searchable text/code
  language     TEXT
  breadcrumb   TEXT                   -- heading hierarchy path
  tokenCount   INTEGER
  embedding    BLOB                   -- float32[] stored as blob
  createdAt    DATETIME
}

6.5 Indexing Jobs

IndexingJob {
  id           TEXT PRIMARY KEY
  repositoryId TEXT FK → Repository
  versionId    TEXT
  status       TEXT                   -- "queued" | "running" | "done" | "failed"
  progress     INTEGER DEFAULT 0      -- 0-100
  totalFiles   INTEGER
  processedFiles INTEGER
  error        TEXT
  startedAt    DATETIME
  completedAt  DATETIME
  createdAt    DATETIME
}

6.6 Repository Configuration (`trueref.json`)

RepositoryConfig {
  repositoryId  TEXT FK → Repository
  projectTitle  TEXT
  description   TEXT
  folders       TEXT[]                -- include paths
  excludeFolders TEXT[]
  excludeFiles  TEXT[]
  rules         TEXT[]                -- best practices for LLMs
  previousVersions { tag, title }[]
}

7. Core Features

F1: Repository Management

Add, remove, update, and list repositories. Support GitHub (public/private via token) and local filesystem sources. Trigger indexing on demand or on schedule.

F2: GitHub Crawler

Fetch repository file trees via GitHub Trees API. Download file contents. Respect trueref.json include/exclude rules. Support rate limiting and incremental re-indexing (checksum-based).

F3: Local Filesystem Crawler

Walk directory trees. Apply include/exclude rules from trueref.json. Watch for file changes (optional).

F4: Document Parser & Chunker

Parse Markdown files into sections (heading-based splitting).
Extract code blocks from Markdown.
Parse standalone code files into function/class-level chunks.
Calculate token counts.
Produce structured Snippet records (type: "code" or "info").

F5: Embedding & Vector Storage

Generate embeddings for each snippet using a pluggable embeddings backend.
Store embeddings as binary blobs in SQLite (sqlite-vec).
Support fallback to FTS5-only search when no embedding provider is configured.

F6: Semantic Search Engine

Hybrid search: vector similarity + FTS5 keyword matching (BM25) with reciprocal rank fusion.
Query-time retrieval: given libraryId + query, return ranked snippets.
Library search: given libraryName + query, return matching repositories.

F7: REST API (`/api/v1/*`)

GET /api/v1/libs/search?query=&libraryName= — search libraries (context7-compatible)
GET /api/v1/context?query=&libraryId=&type=json|txt — fetch documentation
GET /api/v1/libs — list all indexed libraries
POST /api/v1/libs — add a new repository
DELETE /api/v1/libs/:id — remove a repository
POST /api/v1/libs/:id/index — trigger re-indexing
GET /api/v1/jobs/:id — get indexing job status

F8: MCP Server

Tool: resolve-library-id — search for libraries by name
Tool: query-docs — fetch documentation by libraryId + query
Transport: stdio (primary), HTTP (optional)
Compatible with Claude Code, Cursor, and other MCP-aware tools

F9: Web UI — Repository Dashboard

List all repositories with status, snippet count, last indexed date
Add/remove repositories (GitHub URL or local path)
Trigger re-indexing
View indexing job progress
View repository config (trueref.json)

F10: Web UI — Search Explorer

Interactive search interface (resolve library → query docs)
Preview snippets with syntax highlighting
View raw document content

F11: `trueref.json` Config Support

Parse trueref.json from repo root (or context7.json for compatibility)
Apply folders, excludeFolders, excludeFiles during crawling
Inject rules into LLM context alongside snippets
Support previousVersions for versioned documentation

F12: Indexing Pipeline & Job Queue

SQLite-backed job queue (no external message broker required)
Sequential processing with progress tracking
Error recovery and retry logic
Incremental re-indexing using file checksums

F13: Version Support

Index specific git tags/branches per repository
Serve version-specific context when libraryId includes version (/owner/repo/v1.2.3)
UI for managing available versions

8. API Compatibility with context7

TrueRef's REST API mirrors context7's /api/v2/* interface to allow drop-in compatibility:

context7 Endpoint	TrueRef Endpoint	Notes
`GET /api/v2/libs/search`	`GET /api/v1/libs/search`	Same query params
`GET /api/v2/context`	`GET /api/v1/context`	Same query params, same response shape

The MCP tool names and input schemas are identical:

resolve-library-id with libraryName + query
query-docs with libraryId + query

Library IDs follow the same convention: /owner/repo or /owner/repo/version.

9. Non-Functional Requirements

Performance

Library search: < 200ms p99
Documentation retrieval: < 500ms p99 for 20 snippets
Indexing throughput: > 1,000 files/minute (GitHub API rate-limited)

Reliability

Failed indexing jobs must not corrupt existing indexed data
Atomic snippet replacement during re-indexing

Portability

Single SQLite file for all data
Runs on Linux, macOS, Windows (Node.js 20+)
No required external services beyond optional embedding API

Scalability (v1 constraints)

Designed for single-node deployment
SQLite suitable for up to ~500 repositories, ~500k snippets

10. Milestones & Feature Order

ID	Feature	Priority	Depends On
TRUEREF-0001	Database schema & core data models	P0	—
TRUEREF-0002	Repository management service & REST API	P0	TRUEREF-0001
TRUEREF-0003	GitHub repository crawler	P0	TRUEREF-0001
TRUEREF-0004	Local filesystem crawler	P1	TRUEREF-0001
TRUEREF-0005	Document parser & chunker	P0	TRUEREF-0001
TRUEREF-0006	SQLite FTS5 full-text search	P0	TRUEREF-0005
TRUEREF-0007	Embedding generation & vector storage	P1	TRUEREF-0005
TRUEREF-0008	Hybrid semantic search engine	P1	TRUEREF-0006, TRUEREF-0007
TRUEREF-0009	Indexing pipeline & job queue	P0	TRUEREF-0003, TRUEREF-0005
TRUEREF-0010	REST API (search + context endpoints)	P0	TRUEREF-0006, TRUEREF-0009
TRUEREF-0011	MCP server (stdio transport)	P0	TRUEREF-0010
TRUEREF-0012	MCP server (HTTP transport)	P1	TRUEREF-0011
TRUEREF-0013	`trueref.json` config file support	P0	TRUEREF-0003
TRUEREF-0014	Repository version management	P1	TRUEREF-0003
TRUEREF-0015	Web UI — repository dashboard	P1	TRUEREF-0002, TRUEREF-0009
TRUEREF-0016	Web UI — search explorer	P2	TRUEREF-0010, TRUEREF-0015
TRUEREF-0017	Incremental re-indexing (checksum diff)	P1	TRUEREF-0009
TRUEREF-0018	Embedding provider configuration UI	P2	TRUEREF-0007, TRUEREF-0015

11. Open Questions

Embedding provider default: Should we bundle a local model (transformers.js) or require external API configuration? Local model provides zero-config experience but adds ~100MB to install size.
Streaming responses: Should query-docs support streaming for long responses?
GitHub private repos: Should we store the GitHub token in the DB or require it per-request?
context7.json backward compatibility: Should we auto-detect and use context7.json if no trueref.json is present?
Re-indexing strategy: Automatic periodic re-indexing (cron) vs. manual-only?

16 KiB Raw Blame History