chore: initial project scaffold
This commit is contained in:
365
docs/PRD.md
Normal file
365
docs/PRD.md
Normal file
@@ -0,0 +1,365 @@
|
||||
# TrueRef — Product Requirements Document
|
||||
|
||||
**Version:** 1.0
|
||||
**Date:** 2026-03-22
|
||||
**Status:** Draft
|
||||
|
||||
---
|
||||
|
||||
## 1. Executive Summary
|
||||
|
||||
TrueRef is a self-hosted, open-source documentation intelligence platform — a full-stack clone of context7 that lets teams index, search, and query their own private or public code repositories. Unlike context7 (which has a private indexing backend), TrueRef ships the complete stack: crawler, parser, indexer, semantic search engine, REST API, MCP server, and a management web UI.
|
||||
|
||||
The core use case is enabling AI coding assistants (Claude Code, Cursor, Zed, etc.) to retrieve accurate, up-to-date, version-aware documentation from repositories the user controls — without sending code to third-party services.
|
||||
|
||||
---
|
||||
|
||||
## 2. Problem Statement
|
||||
|
||||
### 2.1 Context7's Limitations
|
||||
- The indexing and crawling backend is entirely private and closed-source.
|
||||
- Only public libraries already in the context7.com catalog are available.
|
||||
- Private, internal, or niche repositories cannot be added.
|
||||
- Data sovereignty: all queries go to context7.com servers.
|
||||
- No way to self-host for air-gapped or compliance-constrained environments.
|
||||
|
||||
### 2.2 The Gap
|
||||
Teams with internal SDKs, private libraries, proprietary documentation, or a need for data sovereignty have no tooling that provides context7-equivalent LLM documentation retrieval.
|
||||
|
||||
---
|
||||
|
||||
## 3. Goals & Non-Goals
|
||||
|
||||
### Goals
|
||||
- Replicate all context7 capabilities: library search, documentation retrieval, MCP tools (`resolve-library-id`, `query-docs`).
|
||||
- Support both GitHub-hosted and local filesystem repositories.
|
||||
- Provide a full indexing pipeline: crawl → parse → chunk → embed → store → query.
|
||||
- Expose a REST API compatible with context7's `/api/v2/*` surface.
|
||||
- Ship an MCP server implementing `resolve-library-id` and `query-docs`.
|
||||
- Provide a web UI for repository management and search exploration.
|
||||
- Support `trueref.json` config files in repos (analogous to `context7.json`).
|
||||
- Support versioned documentation via git tags/branches.
|
||||
- Self-hostable with minimal dependencies (SQLite-first, no external vector DB required).
|
||||
|
||||
### Non-Goals (v1)
|
||||
- Authentication & authorization (deferred to a future version).
|
||||
- Skill generation (context7 CLI skill feature).
|
||||
- Multi-tenant SaaS mode.
|
||||
- Binary pre-built releases / Docker image (infrastructure, not product).
|
||||
- Paid API tier / rate limiting.
|
||||
- Support for non-git version control systems.
|
||||
|
||||
---
|
||||
|
||||
## 4. Users & Personas
|
||||
|
||||
### Primary: The Developer / Tech Lead
|
||||
Configures TrueRef, adds repositories, integrates the MCP server with their AI coding assistant. Technical, comfortable with CLI and config files.
|
||||
|
||||
### Secondary: The AI Coding Assistant
|
||||
The "user" at query time. Calls `resolve-library-id` and `query-docs` via MCP to retrieve documentation snippets for code generation.
|
||||
|
||||
---
|
||||
|
||||
## 5. Architecture Overview
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────────────────────┐
|
||||
│ TrueRef Platform │
|
||||
│ │
|
||||
│ ┌──────────────────┐ ┌──────────────────┐ ┌───────────────┐ │
|
||||
│ │ Web UI │ │ REST API │ │ MCP Server │ │
|
||||
│ │ (SvelteKit) │ │ /api/v1/* │ │ (stdio/HTTP) │ │
|
||||
│ └────────┬─────────┘ └────────┬─────────┘ └───────┬───────┘ │
|
||||
│ │ │ │ │
|
||||
│ └──────────────────────┼─────────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌─────────────▼──────────────┐ │
|
||||
│ │ Service Layer │ │
|
||||
│ │ LibraryService │ │
|
||||
│ │ SearchService │ │
|
||||
│ │ IndexingService │ │
|
||||
│ └─────────────┬──────────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────────────────┼───────────────────┐ │
|
||||
│ │ │ │ │
|
||||
│ ┌────────▼────────┐ ┌─────────▼──────┐ ┌────────▼──────────┐ │
|
||||
│ │ Indexing │ │ SQLite DB │ │ Vector/FTS Index │ │
|
||||
│ │ Pipeline │ │ (drizzle-orm) │ │ (SQLite FTS5 + │ │
|
||||
│ │ Crawler │ │ │ │ embeddings) │ │
|
||||
│ │ Parser │ │ │ │ │ │
|
||||
│ │ Chunker │ │ │ │ │ │
|
||||
│ └────────┬────────┘ └───────────────┘ └────────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌───────▼──────────────────────┐ │
|
||||
│ │ Repository Sources │ │
|
||||
│ │ - GitHub API │ │
|
||||
│ │ - Local filesystem │ │
|
||||
│ └──────────────────────────────┘ │
|
||||
└────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Technology Stack
|
||||
| Layer | Technology |
|
||||
|-------|-----------|
|
||||
| Framework | SvelteKit (Node adapter) |
|
||||
| Language | TypeScript |
|
||||
| Database | SQLite via better-sqlite3 + drizzle-orm |
|
||||
| Full-Text Search | SQLite FTS5 |
|
||||
| Vector Search | SQLite `sqlite-vec` extension (cosine similarity) |
|
||||
| Embeddings | Pluggable: local (transformers.js / ONNX) or OpenAI-compatible API |
|
||||
| MCP Protocol | `@modelcontextprotocol/sdk` |
|
||||
| HTTP | SvelteKit API routes + optional standalone MCP HTTP server |
|
||||
| CSS | TailwindCSS v4 |
|
||||
| Testing | Vitest |
|
||||
| Linting | ESLint + Prettier |
|
||||
|
||||
---
|
||||
|
||||
## 6. Data Model
|
||||
|
||||
### 6.1 Repositories
|
||||
A `Repository` is the top-level entity. It maps to a GitHub repo or local directory.
|
||||
|
||||
```
|
||||
Repository {
|
||||
id TEXT PRIMARY KEY -- e.g. "/facebook/react"
|
||||
title TEXT NOT NULL -- display name
|
||||
description TEXT
|
||||
source TEXT NOT NULL -- "github" | "local"
|
||||
sourceUrl TEXT -- GitHub URL or local path
|
||||
branch TEXT -- default branch
|
||||
state TEXT NOT NULL -- "pending" | "indexing" | "indexed" | "error"
|
||||
totalSnippets INTEGER DEFAULT 0
|
||||
totalTokens INTEGER DEFAULT 0
|
||||
trustScore REAL DEFAULT 0
|
||||
stars INTEGER
|
||||
lastIndexedAt DATETIME
|
||||
createdAt DATETIME
|
||||
updatedAt DATETIME
|
||||
}
|
||||
```
|
||||
|
||||
### 6.2 Repository Versions
|
||||
```
|
||||
RepositoryVersion {
|
||||
id TEXT PRIMARY KEY
|
||||
repositoryId TEXT FK → Repository
|
||||
tag TEXT NOT NULL -- git tag or branch name
|
||||
title TEXT
|
||||
state TEXT -- "pending" | "indexed" | "error"
|
||||
indexedAt DATETIME
|
||||
}
|
||||
```
|
||||
|
||||
### 6.3 Documents (parsed files)
|
||||
```
|
||||
Document {
|
||||
id TEXT PRIMARY KEY
|
||||
repositoryId TEXT FK → Repository
|
||||
versionId TEXT FK → RepositoryVersion (nullable = default branch)
|
||||
filePath TEXT NOT NULL
|
||||
title TEXT
|
||||
content TEXT NOT NULL -- raw markdown/code
|
||||
language TEXT -- programming language if code file
|
||||
tokenCount INTEGER
|
||||
checksum TEXT -- SHA-256 for change detection
|
||||
indexedAt DATETIME
|
||||
}
|
||||
```
|
||||
|
||||
### 6.4 Snippets (indexed chunks)
|
||||
```
|
||||
Snippet {
|
||||
id TEXT PRIMARY KEY
|
||||
documentId TEXT FK → Document
|
||||
repositoryId TEXT FK → Repository
|
||||
type TEXT NOT NULL -- "code" | "info"
|
||||
title TEXT
|
||||
content TEXT NOT NULL -- the actual searchable text/code
|
||||
language TEXT
|
||||
breadcrumb TEXT -- heading hierarchy path
|
||||
tokenCount INTEGER
|
||||
embedding BLOB -- float32[] stored as blob
|
||||
createdAt DATETIME
|
||||
}
|
||||
```
|
||||
|
||||
### 6.5 Indexing Jobs
|
||||
```
|
||||
IndexingJob {
|
||||
id TEXT PRIMARY KEY
|
||||
repositoryId TEXT FK → Repository
|
||||
versionId TEXT
|
||||
status TEXT -- "queued" | "running" | "done" | "failed"
|
||||
progress INTEGER DEFAULT 0 -- 0-100
|
||||
totalFiles INTEGER
|
||||
processedFiles INTEGER
|
||||
error TEXT
|
||||
startedAt DATETIME
|
||||
completedAt DATETIME
|
||||
createdAt DATETIME
|
||||
}
|
||||
```
|
||||
|
||||
### 6.6 Repository Configuration (`trueref.json`)
|
||||
```
|
||||
RepositoryConfig {
|
||||
repositoryId TEXT FK → Repository
|
||||
projectTitle TEXT
|
||||
description TEXT
|
||||
folders TEXT[] -- include paths
|
||||
excludeFolders TEXT[]
|
||||
excludeFiles TEXT[]
|
||||
rules TEXT[] -- best practices for LLMs
|
||||
previousVersions { tag, title }[]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Core Features
|
||||
|
||||
### F1: Repository Management
|
||||
Add, remove, update, and list repositories. Support GitHub (public/private via token) and local filesystem sources. Trigger indexing on demand or on schedule.
|
||||
|
||||
### F2: GitHub Crawler
|
||||
Fetch repository file trees via GitHub Trees API. Download file contents. Respect `trueref.json` include/exclude rules. Support rate limiting and incremental re-indexing (checksum-based).
|
||||
|
||||
### F3: Local Filesystem Crawler
|
||||
Walk directory trees. Apply include/exclude rules from `trueref.json`. Watch for file changes (optional).
|
||||
|
||||
### F4: Document Parser & Chunker
|
||||
- Parse Markdown files into sections (heading-based splitting).
|
||||
- Extract code blocks from Markdown.
|
||||
- Parse standalone code files into function/class-level chunks.
|
||||
- Calculate token counts.
|
||||
- Produce structured `Snippet` records (type: "code" or "info").
|
||||
|
||||
### F5: Embedding & Vector Storage
|
||||
- Generate embeddings for each snippet using a pluggable embeddings backend.
|
||||
- Store embeddings as binary blobs in SQLite (sqlite-vec).
|
||||
- Support fallback to FTS5-only search when no embedding provider is configured.
|
||||
|
||||
### F6: Semantic Search Engine
|
||||
- Hybrid search: vector similarity + FTS5 keyword matching (BM25) with reciprocal rank fusion.
|
||||
- Query-time retrieval: given `libraryId + query`, return ranked snippets.
|
||||
- Library search: given `libraryName + query`, return matching repositories.
|
||||
|
||||
### F7: REST API (`/api/v1/*`)
|
||||
- `GET /api/v1/libs/search?query=&libraryName=` — search libraries (context7-compatible)
|
||||
- `GET /api/v1/context?query=&libraryId=&type=json|txt` — fetch documentation
|
||||
- `GET /api/v1/libs` — list all indexed libraries
|
||||
- `POST /api/v1/libs` — add a new repository
|
||||
- `DELETE /api/v1/libs/:id` — remove a repository
|
||||
- `POST /api/v1/libs/:id/index` — trigger re-indexing
|
||||
- `GET /api/v1/jobs/:id` — get indexing job status
|
||||
|
||||
### F8: MCP Server
|
||||
- Tool: `resolve-library-id` — search for libraries by name
|
||||
- Tool: `query-docs` — fetch documentation by libraryId + query
|
||||
- Transport: stdio (primary), HTTP (optional)
|
||||
- Compatible with Claude Code, Cursor, and other MCP-aware tools
|
||||
|
||||
### F9: Web UI — Repository Dashboard
|
||||
- List all repositories with status, snippet count, last indexed date
|
||||
- Add/remove repositories (GitHub URL or local path)
|
||||
- Trigger re-indexing
|
||||
- View indexing job progress
|
||||
- View repository config (`trueref.json`)
|
||||
|
||||
### F10: Web UI — Search Explorer
|
||||
- Interactive search interface (resolve library → query docs)
|
||||
- Preview snippets with syntax highlighting
|
||||
- View raw document content
|
||||
|
||||
### F11: `trueref.json` Config Support
|
||||
- Parse `trueref.json` from repo root (or `context7.json` for compatibility)
|
||||
- Apply `folders`, `excludeFolders`, `excludeFiles` during crawling
|
||||
- Inject `rules` into LLM context alongside snippets
|
||||
- Support `previousVersions` for versioned documentation
|
||||
|
||||
### F12: Indexing Pipeline & Job Queue
|
||||
- SQLite-backed job queue (no external message broker required)
|
||||
- Sequential processing with progress tracking
|
||||
- Error recovery and retry logic
|
||||
- Incremental re-indexing using file checksums
|
||||
|
||||
### F13: Version Support
|
||||
- Index specific git tags/branches per repository
|
||||
- Serve version-specific context when libraryId includes version (`/owner/repo/v1.2.3`)
|
||||
- UI for managing available versions
|
||||
|
||||
---
|
||||
|
||||
## 8. API Compatibility with context7
|
||||
|
||||
TrueRef's REST API mirrors context7's `/api/v2/*` interface to allow drop-in compatibility:
|
||||
|
||||
| context7 Endpoint | TrueRef Endpoint | Notes |
|
||||
|-------------------|-----------------|-------|
|
||||
| `GET /api/v2/libs/search` | `GET /api/v1/libs/search` | Same query params |
|
||||
| `GET /api/v2/context` | `GET /api/v1/context` | Same query params, same response shape |
|
||||
|
||||
The MCP tool names and input schemas are identical:
|
||||
- `resolve-library-id` with `libraryName` + `query`
|
||||
- `query-docs` with `libraryId` + `query`
|
||||
|
||||
Library IDs follow the same convention: `/owner/repo` or `/owner/repo/version`.
|
||||
|
||||
---
|
||||
|
||||
## 9. Non-Functional Requirements
|
||||
|
||||
### Performance
|
||||
- Library search: < 200ms p99
|
||||
- Documentation retrieval: < 500ms p99 for 20 snippets
|
||||
- Indexing throughput: > 1,000 files/minute (GitHub API rate-limited)
|
||||
|
||||
### Reliability
|
||||
- Failed indexing jobs must not corrupt existing indexed data
|
||||
- Atomic snippet replacement during re-indexing
|
||||
|
||||
### Portability
|
||||
- Single SQLite file for all data
|
||||
- Runs on Linux, macOS, Windows (Node.js 20+)
|
||||
- No required external services beyond optional embedding API
|
||||
|
||||
### Scalability (v1 constraints)
|
||||
- Designed for single-node deployment
|
||||
- SQLite suitable for up to ~500 repositories, ~500k snippets
|
||||
|
||||
---
|
||||
|
||||
## 10. Milestones & Feature Order
|
||||
|
||||
| ID | Feature | Priority | Depends On |
|
||||
|----|---------|----------|-----------|
|
||||
| TRUEREF-0001 | Database schema & core data models | P0 | — |
|
||||
| TRUEREF-0002 | Repository management service & REST API | P0 | TRUEREF-0001 |
|
||||
| TRUEREF-0003 | GitHub repository crawler | P0 | TRUEREF-0001 |
|
||||
| TRUEREF-0004 | Local filesystem crawler | P1 | TRUEREF-0001 |
|
||||
| TRUEREF-0005 | Document parser & chunker | P0 | TRUEREF-0001 |
|
||||
| TRUEREF-0006 | SQLite FTS5 full-text search | P0 | TRUEREF-0005 |
|
||||
| TRUEREF-0007 | Embedding generation & vector storage | P1 | TRUEREF-0005 |
|
||||
| TRUEREF-0008 | Hybrid semantic search engine | P1 | TRUEREF-0006, TRUEREF-0007 |
|
||||
| TRUEREF-0009 | Indexing pipeline & job queue | P0 | TRUEREF-0003, TRUEREF-0005 |
|
||||
| TRUEREF-0010 | REST API (search + context endpoints) | P0 | TRUEREF-0006, TRUEREF-0009 |
|
||||
| TRUEREF-0011 | MCP server (stdio transport) | P0 | TRUEREF-0010 |
|
||||
| TRUEREF-0012 | MCP server (HTTP transport) | P1 | TRUEREF-0011 |
|
||||
| TRUEREF-0013 | `trueref.json` config file support | P0 | TRUEREF-0003 |
|
||||
| TRUEREF-0014 | Repository version management | P1 | TRUEREF-0003 |
|
||||
| TRUEREF-0015 | Web UI — repository dashboard | P1 | TRUEREF-0002, TRUEREF-0009 |
|
||||
| TRUEREF-0016 | Web UI — search explorer | P2 | TRUEREF-0010, TRUEREF-0015 |
|
||||
| TRUEREF-0017 | Incremental re-indexing (checksum diff) | P1 | TRUEREF-0009 |
|
||||
| TRUEREF-0018 | Embedding provider configuration UI | P2 | TRUEREF-0007, TRUEREF-0015 |
|
||||
|
||||
---
|
||||
|
||||
## 11. Open Questions
|
||||
|
||||
1. **Embedding provider default**: Should we bundle a local model (transformers.js) or require external API configuration? Local model provides zero-config experience but adds ~100MB to install size.
|
||||
2. **Streaming responses**: Should `query-docs` support streaming for long responses?
|
||||
3. **GitHub private repos**: Should we store the GitHub token in the DB or require it per-request?
|
||||
4. **context7.json backward compatibility**: Should we auto-detect and use `context7.json` if no `trueref.json` is present?
|
||||
5. **Re-indexing strategy**: Automatic periodic re-indexing (cron) vs. manual-only?
|
||||
Reference in New Issue
Block a user