chore: initial project scaffold

2026-03-22 17:08:15 +01:00
commit 18437dfa7c
53 changed files with 12002 additions and 0 deletions
--- a/docs/PRD.md
+++ b/docs/PRD.md
@@ -0,0 +1,365 @@
+# TrueRef — Product Requirements Document
+
+**Version:** 1.0
+**Date:** 2026-03-22
+**Status:** Draft
+
+---
+
+## 1. Executive Summary
+
+TrueRef is a self-hosted, open-source documentation intelligence platform — a full-stack clone of context7 that lets teams index, search, and query their own private or public code repositories. Unlike context7 (which has a private indexing backend), TrueRef ships the complete stack: crawler, parser, indexer, semantic search engine, REST API, MCP server, and a management web UI.
+
+The core use case is enabling AI coding assistants (Claude Code, Cursor, Zed, etc.) to retrieve accurate, up-to-date, version-aware documentation from repositories the user controls — without sending code to third-party services.
+
+---
+
+## 2. Problem Statement
+
+### 2.1 Context7's Limitations
+- The indexing and crawling backend is entirely private and closed-source.
+- Only public libraries already in the context7.com catalog are available.
+- Private, internal, or niche repositories cannot be added.
+- Data sovereignty: all queries go to context7.com servers.
+- No way to self-host for air-gapped or compliance-constrained environments.
+
+### 2.2 The Gap
+Teams with internal SDKs, private libraries, proprietary documentation, or a need for data sovereignty have no tooling that provides context7-equivalent LLM documentation retrieval.
+
+---
+
+## 3. Goals & Non-Goals
+
+### Goals
+- Replicate all context7 capabilities: library search, documentation retrieval, MCP tools (`resolve-library-id`, `query-docs`).
+- Support both GitHub-hosted and local filesystem repositories.
+- Provide a full indexing pipeline: crawl → parse → chunk → embed → store → query.
+- Expose a REST API compatible with context7's `/api/v2/*` surface.
+- Ship an MCP server implementing `resolve-library-id` and `query-docs`.
+- Provide a web UI for repository management and search exploration.
+- Support `trueref.json` config files in repos (analogous to `context7.json`).
+- Support versioned documentation via git tags/branches.
+- Self-hostable with minimal dependencies (SQLite-first, no external vector DB required).
+
+### Non-Goals (v1)
+- Authentication & authorization (deferred to a future version).
+- Skill generation (context7 CLI skill feature).
+- Multi-tenant SaaS mode.
+- Binary pre-built releases / Docker image (infrastructure, not product).
+- Paid API tier / rate limiting.
+- Support for non-git version control systems.
+
+---
+
+## 4. Users & Personas
+
+### Primary: The Developer / Tech Lead
+Configures TrueRef, adds repositories, integrates the MCP server with their AI coding assistant. Technical, comfortable with CLI and config files.
+
+### Secondary: The AI Coding Assistant
+The "user" at query time. Calls `resolve-library-id` and `query-docs` via MCP to retrieve documentation snippets for code generation.
+
+---
+
+## 5. Architecture Overview
+
+```
+┌────────────────────────────────────────────────────────────────────┐
+│                          TrueRef Platform                          │
+│                                                                    │
+│  ┌──────────────────┐   ┌──────────────────┐   ┌───────────────┐  │
+│  │   Web UI         │   │   REST API       │   │  MCP Server   │  │
+│  │  (SvelteKit)     │   │  /api/v1/*       │   │  (stdio/HTTP) │  │
+│  └────────┬─────────┘   └────────┬─────────┘   └───────┬───────┘  │
+│           │                      │                     │           │
+│           └──────────────────────┼─────────────────────┘           │
+│                                  │                                  │
+│                    ┌─────────────▼──────────────┐                  │
+│                    │      Service Layer           │                  │
+│                    │  LibraryService              │                  │
+│                    │  SearchService               │                  │
+│                    │  IndexingService             │                  │
+│                    └─────────────┬──────────────┘                  │
+│                                  │                                  │
+│           ┌──────────────────────┼───────────────────┐             │
+│           │                      │                   │             │
+│  ┌────────▼────────┐  ┌─────────▼──────┐  ┌────────▼──────────┐  │
+│  │  Indexing        │  │  SQLite DB     │  │  Vector/FTS Index │  │
+│  │  Pipeline        │  │  (drizzle-orm) │  │  (SQLite FTS5 +   │  │
+│  │  Crawler         │  │               │  │   embeddings)      │  │
+│  │  Parser          │  │               │  │                    │  │
+│  │  Chunker         │  │               │  │                    │  │
+│  └────────┬────────┘  └───────────────┘  └────────────────────┘  │
+│           │                                                         │
+│   ┌───────▼──────────────────────┐                                 │
+│   │    Repository Sources         │                                 │
+│   │  - GitHub API                 │                                 │
+│   │  - Local filesystem           │                                 │
+│   └──────────────────────────────┘                                 │
+└────────────────────────────────────────────────────────────────────┘
+```
+
+### Technology Stack
+| Layer | Technology |
+|-------|-----------|
+| Framework | SvelteKit (Node adapter) |
+| Language | TypeScript |
+| Database | SQLite via better-sqlite3 + drizzle-orm |
+| Full-Text Search | SQLite FTS5 |
+| Vector Search | SQLite `sqlite-vec` extension (cosine similarity) |
+| Embeddings | Pluggable: local (transformers.js / ONNX) or OpenAI-compatible API |
+| MCP Protocol | `@modelcontextprotocol/sdk` |
+| HTTP | SvelteKit API routes + optional standalone MCP HTTP server |
+| CSS | TailwindCSS v4 |
+| Testing | Vitest |
+| Linting | ESLint + Prettier |
+
+---
+
+## 6. Data Model
+
+### 6.1 Repositories
+A `Repository` is the top-level entity. It maps to a GitHub repo or local directory.
+
+```
+Repository {
+  id          TEXT PRIMARY KEY        -- e.g. "/facebook/react"
+  title       TEXT NOT NULL           -- display name
+  description TEXT
+  source      TEXT NOT NULL           -- "github" | "local"
+  sourceUrl   TEXT                    -- GitHub URL or local path
+  branch      TEXT                    -- default branch
+  state       TEXT NOT NULL           -- "pending" | "indexing" | "indexed" | "error"
+  totalSnippets INTEGER DEFAULT 0
+  totalTokens   INTEGER DEFAULT 0
+  trustScore    REAL DEFAULT 0
+  stars         INTEGER
+  lastIndexedAt DATETIME
+  createdAt     DATETIME
+  updatedAt     DATETIME
+}
+```
+
+### 6.2 Repository Versions
+```
+RepositoryVersion {
+  id           TEXT PRIMARY KEY
+  repositoryId TEXT FK → Repository
+  tag          TEXT NOT NULL          -- git tag or branch name
+  title        TEXT
+  state        TEXT                   -- "pending" | "indexed" | "error"
+  indexedAt    DATETIME
+}
+```
+
+### 6.3 Documents (parsed files)
+```
+Document {
+  id           TEXT PRIMARY KEY
+  repositoryId TEXT FK → Repository
+  versionId    TEXT FK → RepositoryVersion (nullable = default branch)
+  filePath     TEXT NOT NULL
+  title        TEXT
+  content      TEXT NOT NULL          -- raw markdown/code
+  language     TEXT                   -- programming language if code file
+  tokenCount   INTEGER
+  checksum     TEXT                   -- SHA-256 for change detection
+  indexedAt    DATETIME
+}
+```
+
+### 6.4 Snippets (indexed chunks)
+```
+Snippet {
+  id           TEXT PRIMARY KEY
+  documentId   TEXT FK → Document
+  repositoryId TEXT FK → Repository
+  type         TEXT NOT NULL          -- "code" | "info"
+  title        TEXT
+  content      TEXT NOT NULL          -- the actual searchable text/code
+  language     TEXT
+  breadcrumb   TEXT                   -- heading hierarchy path
+  tokenCount   INTEGER
+  embedding    BLOB                   -- float32[] stored as blob
+  createdAt    DATETIME
+}
+```
+
+### 6.5 Indexing Jobs
+```
+IndexingJob {
+  id           TEXT PRIMARY KEY
+  repositoryId TEXT FK → Repository
+  versionId    TEXT
+  status       TEXT                   -- "queued" | "running" | "done" | "failed"
+  progress     INTEGER DEFAULT 0      -- 0-100
+  totalFiles   INTEGER
+  processedFiles INTEGER
+  error        TEXT
+  startedAt    DATETIME
+  completedAt  DATETIME
+  createdAt    DATETIME
+}
+```
+
+### 6.6 Repository Configuration (`trueref.json`)
+```
+RepositoryConfig {
+  repositoryId  TEXT FK → Repository
+  projectTitle  TEXT
+  description   TEXT
+  folders       TEXT[]                -- include paths
+  excludeFolders TEXT[]
+  excludeFiles  TEXT[]
+  rules         TEXT[]                -- best practices for LLMs
+  previousVersions { tag, title }[]
+}
+```
+
+---
+
+## 7. Core Features
+
+### F1: Repository Management
+Add, remove, update, and list repositories. Support GitHub (public/private via token) and local filesystem sources. Trigger indexing on demand or on schedule.
+
+### F2: GitHub Crawler
+Fetch repository file trees via GitHub Trees API. Download file contents. Respect `trueref.json` include/exclude rules. Support rate limiting and incremental re-indexing (checksum-based).
+
+### F3: Local Filesystem Crawler
+Walk directory trees. Apply include/exclude rules from `trueref.json`. Watch for file changes (optional).
+
+### F4: Document Parser & Chunker
+- Parse Markdown files into sections (heading-based splitting).
+- Extract code blocks from Markdown.
+- Parse standalone code files into function/class-level chunks.
+- Calculate token counts.
+- Produce structured `Snippet` records (type: "code" or "info").
+
+### F5: Embedding & Vector Storage
+- Generate embeddings for each snippet using a pluggable embeddings backend.
+- Store embeddings as binary blobs in SQLite (sqlite-vec).
+- Support fallback to FTS5-only search when no embedding provider is configured.
+
+### F6: Semantic Search Engine
+- Hybrid search: vector similarity + FTS5 keyword matching (BM25) with reciprocal rank fusion.
+- Query-time retrieval: given `libraryId + query`, return ranked snippets.
+- Library search: given `libraryName + query`, return matching repositories.
+
+### F7: REST API (`/api/v1/*`)
+- `GET /api/v1/libs/search?query=&libraryName=` — search libraries (context7-compatible)
+- `GET /api/v1/context?query=&libraryId=&type=json|txt` — fetch documentation
+- `GET /api/v1/libs` — list all indexed libraries
+- `POST /api/v1/libs` — add a new repository
+- `DELETE /api/v1/libs/:id` — remove a repository
+- `POST /api/v1/libs/:id/index` — trigger re-indexing
+- `GET /api/v1/jobs/:id` — get indexing job status
+
+### F8: MCP Server
+- Tool: `resolve-library-id` — search for libraries by name
+- Tool: `query-docs` — fetch documentation by libraryId + query
+- Transport: stdio (primary), HTTP (optional)
+- Compatible with Claude Code, Cursor, and other MCP-aware tools
+
+### F9: Web UI — Repository Dashboard
+- List all repositories with status, snippet count, last indexed date
+- Add/remove repositories (GitHub URL or local path)
+- Trigger re-indexing
+- View indexing job progress
+- View repository config (`trueref.json`)
+
+### F10: Web UI — Search Explorer
+- Interactive search interface (resolve library → query docs)
+- Preview snippets with syntax highlighting
+- View raw document content
+
+### F11: `trueref.json` Config Support
+- Parse `trueref.json` from repo root (or `context7.json` for compatibility)
+- Apply `folders`, `excludeFolders`, `excludeFiles` during crawling
+- Inject `rules` into LLM context alongside snippets
+- Support `previousVersions` for versioned documentation
+
+### F12: Indexing Pipeline & Job Queue
+- SQLite-backed job queue (no external message broker required)
+- Sequential processing with progress tracking
+- Error recovery and retry logic
+- Incremental re-indexing using file checksums
+
+### F13: Version Support
+- Index specific git tags/branches per repository
+- Serve version-specific context when libraryId includes version (`/owner/repo/v1.2.3`)
+- UI for managing available versions
+
+---
+
+## 8. API Compatibility with context7
+
+TrueRef's REST API mirrors context7's `/api/v2/*` interface to allow drop-in compatibility:
+
+| context7 Endpoint | TrueRef Endpoint | Notes |
+|-------------------|-----------------|-------|
+| `GET /api/v2/libs/search` | `GET /api/v1/libs/search` | Same query params |
+| `GET /api/v2/context` | `GET /api/v1/context` | Same query params, same response shape |
+
+The MCP tool names and input schemas are identical:
+- `resolve-library-id` with `libraryName` + `query`
+- `query-docs` with `libraryId` + `query`
+
+Library IDs follow the same convention: `/owner/repo` or `/owner/repo/version`.
+
+---
+
+## 9. Non-Functional Requirements
+
+### Performance
+- Library search: < 200ms p99
+- Documentation retrieval: < 500ms p99 for 20 snippets
+- Indexing throughput: > 1,000 files/minute (GitHub API rate-limited)
+
+### Reliability
+- Failed indexing jobs must not corrupt existing indexed data
+- Atomic snippet replacement during re-indexing
+
+### Portability
+- Single SQLite file for all data
+- Runs on Linux, macOS, Windows (Node.js 20+)
+- No required external services beyond optional embedding API
+
+### Scalability (v1 constraints)
+- Designed for single-node deployment
+- SQLite suitable for up to ~500 repositories, ~500k snippets
+
+---
+
+## 10. Milestones & Feature Order
+
+| ID | Feature | Priority | Depends On |
+|----|---------|----------|-----------|
+| TRUEREF-0001 | Database schema & core data models | P0 | — |
+| TRUEREF-0002 | Repository management service & REST API | P0 | TRUEREF-0001 |
+| TRUEREF-0003 | GitHub repository crawler | P0 | TRUEREF-0001 |
+| TRUEREF-0004 | Local filesystem crawler | P1 | TRUEREF-0001 |
+| TRUEREF-0005 | Document parser & chunker | P0 | TRUEREF-0001 |
+| TRUEREF-0006 | SQLite FTS5 full-text search | P0 | TRUEREF-0005 |
+| TRUEREF-0007 | Embedding generation & vector storage | P1 | TRUEREF-0005 |
+| TRUEREF-0008 | Hybrid semantic search engine | P1 | TRUEREF-0006, TRUEREF-0007 |
+| TRUEREF-0009 | Indexing pipeline & job queue | P0 | TRUEREF-0003, TRUEREF-0005 |
+| TRUEREF-0010 | REST API (search + context endpoints) | P0 | TRUEREF-0006, TRUEREF-0009 |
+| TRUEREF-0011 | MCP server (stdio transport) | P0 | TRUEREF-0010 |
+| TRUEREF-0012 | MCP server (HTTP transport) | P1 | TRUEREF-0011 |
+| TRUEREF-0013 | `trueref.json` config file support | P0 | TRUEREF-0003 |
+| TRUEREF-0014 | Repository version management | P1 | TRUEREF-0003 |
+| TRUEREF-0015 | Web UI — repository dashboard | P1 | TRUEREF-0002, TRUEREF-0009 |
+| TRUEREF-0016 | Web UI — search explorer | P2 | TRUEREF-0010, TRUEREF-0015 |
+| TRUEREF-0017 | Incremental re-indexing (checksum diff) | P1 | TRUEREF-0009 |
+| TRUEREF-0018 | Embedding provider configuration UI | P2 | TRUEREF-0007, TRUEREF-0015 |
+
+---
+
+## 11. Open Questions
+
+1. **Embedding provider default**: Should we bundle a local model (transformers.js) or require external API configuration? Local model provides zero-config experience but adds ~100MB to install size.
+2. **Streaming responses**: Should `query-docs` support streaming for long responses?
+3. **GitHub private repos**: Should we store the GitHub token in the DB or require it per-request?
+4. **context7.json backward compatibility**: Should we auto-detect and use `context7.json` if no `trueref.json` is present?
+5. **Re-indexing strategy**: Automatic periodic re-indexing (cron) vs. manual-only?