diff --git a/docs/features/TRUEREF-0019.md b/docs/features/TRUEREF-0019.md new file mode 100644 index 0000000..8b4dfa3 --- /dev/null +++ b/docs/features/TRUEREF-0019.md @@ -0,0 +1,347 @@ +# TRUEREF-0019 — Git-Native Version Indexing and Corporate Deployment Support + +**Priority:** P1 +**Status:** Pending +**Depends On:** TRUEREF-0001, TRUEREF-0002 +**Blocks:** — + +--- + +## Overview + +TrueRef is intended for corporate environments where developers work with private repositories hosted on Bitbucket Server/Data Center and self-hosted GitLab instances. This feature covers two tightly related concerns: + +1. **Git-native version indexing** — resolve version tags to specific commit hashes, extract file trees per commit using `git archive`, and store indexed snippets against an exact commit rather than a floating label. +2. **Corporate deployment support** — run TrueRef inside Docker with access to private remotes, per-host credentials, and corporate CA certificates without requiring changes to the host machine beyond environment variables and file mounts. + +Together these allow a team to deploy TrueRef once, point it at their internal repositories, and have it return version-accurate documentation to LLM assistants. + +--- + +## Part 1: Git-Native Version Indexing + +### Background + +Currently, versions are registered manually with a tag and a title. There is no link between a version tag and a specific git commit, so re-indexing the same version after a branch has moved produces different results silently. For documentation retrieval to be trustworthy, each indexed version must be pinned to an immutable commit hash. + +### Version → Commit Hash Resolution + +Git tags are the canonical mapping. Both lightweight and annotated tags must be handled: + +```sh +# Resolves any tag, branch, or ref to the underlying commit hash +git -C /path/to/repo rev-parse ^{commit} +``` + +The `^{commit}` peel dereferences annotated tag objects to the commit they point at. + +For GitHub-hosted repositories the GitHub REST API provides the same mapping without requiring a local clone: + +``` +GET /repos/{owner}/{repo}/git/refs/tags/{tag} +``` + +If the returned object type is `tag` (annotated), a second request is required to dereference it: + +``` +GET /repos/{owner}/{repo}/git/tags/{sha} +→ { object: { sha: "", type: "commit" } } +``` + +### Schema Change + +Add `commit_hash` to the versions table: + +```sql +ALTER TABLE versions ADD COLUMN commit_hash TEXT; +``` + +Populated at version registration time. Re-indexing a version always uses the stored commit hash, never re-resolves the tag (tags can be moved; the stored hash is immutable). + +### Tag Auto-Discovery + +When a repository is added or re-fetched, TrueRef should discover all version tags automatically and propose them to the user rather than requiring manual entry: + +```sh +git -C /path/to/repo tag -l +# → ["v1.0.0", "v1.1.0", "v2.0.0", ...] +``` + +For each tag, resolve to a commit hash and pre-populate the versions table. The user can then select which versions to index from the UI. + +### Per-Version File Extraction via `git archive` + +`git archive` extracts a clean file tree from any commit without disturbing the working directory or requiring a separate clone: + +```sh +git -C /path/to/repo archive | tar -x -C /tmp/trueref-idx/-/ +``` + +Advantages over `git checkout` or worktrees: +- Working directory is completely untouched +- No `.git` directory in the output (cleaner for parsing) +- Temp directory deleted after indexing with no git state to clean up +- Multiple versions can be extracted in parallel + +The indexing pipeline receives the extracted temp directory as its source path — identical to the existing local crawler interface. + +``` +trigger index v2.0.0 + → look up commit_hash = "a3f9c12" + → git archive a3f9c12 | tar -x /tmp/trueref/repo-v2.0.0/ + → run crawler on /tmp/trueref/repo-v2.0.0/ + → store snippets { repo_id, version_tag, commit_hash } + → rm -rf /tmp/trueref/repo-v2.0.0/ +``` + +### trueref.json Explicit Override + +Allow commit hashes to be pinned explicitly per version, overriding tag resolution. Useful when tags are mutable, versioning is non-standard, or a specific patch within a release must be targeted: + +```json +{ + "previousVersions": [ + { + "tag": "v2.0.0", + "title": "Version 2.0.0", + "commitHash": "a3f9c12abc..." + } + ] +} +``` + +### Edge Cases + +| Case | Handling | +|------|----------| +| Annotated tags | `rev-parse ^{commit}` peels to commit automatically | +| Mutable tags (e.g. `latest`) | Re-resolve on re-index; warn in UI if hash has changed | +| Branch as version | `rev-parse origin/^{commit}` gives tip; re-resolves on re-index | +| Shallow clone | Run `git fetch --unshallow` before `git archive` if commit is unavailable | +| Submodules | `git archive --recurse-submodules` or document as a known limitation | +| Git LFS | `git lfs pull` required after archive if LFS-tracked files are needed for indexing | + +### Acceptance Criteria + +- [ ] `versions` table has a `commit_hash` column with a migration +- [ ] `resolveTagToCommit(repoPath, tag): string` utility using `child_process` +- [ ] Tag auto-discovery runs on repo registration and on manual re-fetch +- [ ] Discovered tags appear in the UI with their resolved commit hashes +- [ ] Indexing a version extracts files via `git archive` to a temp directory +- [ ] Temp directory is deleted after indexing completes (success or failure) +- [ ] Snippets are stored with both `version_tag` and `commit_hash` +- [ ] `trueref.json` `commitHash` field overrides tag resolution when present +- [ ] Re-indexing a version with a moved tag warns the user if the hash has changed + +--- + +## Part 2: Corporate Deployment Support + +### Background + +Corporate environments introduce three constraints not present in standard cloud setups: + +1. **Multiple private git remotes** — typically Bitbucket Server/Data Center and self-hosted GitLab on separate hostnames, each requiring independent credentials. +2. **Credential management on Windows** — Git Credential Manager stores credentials in the Windows Credential Manager (DPAPI-encrypted), which is inaccessible from inside a Linux container. Personal Access Tokens passed as environment variables are the correct substitute. +3. **Custom CA certificates** — on-premise servers use certificates issued by corporate CAs not trusted by default in Linux container images. All git operations, HTTP requests, and Node.js `fetch` calls fail until the CA is registered at the OS level. + +### Multi-Remote Credential Configuration + +Two environment variables carry the HTTPS tokens, one per remote. Git's per-host credential scoping routes each token to the correct server: + +```sh +# In docker-entrypoint.sh, before any git operation + +if [ -n "$GIT_TOKEN_BITBUCKET" ] && [ -n "$BITBUCKET_HOST" ]; then + git config --global \ + "credential.https://${BITBUCKET_HOST}.helper" \ + "!f() { echo username=x-token-auth; echo password=\$GIT_TOKEN_BITBUCKET; }; f" +fi + +if [ -n "$GIT_TOKEN_GITLAB" ] && [ -n "$GITLAB_HOST" ]; then + git config --global \ + "credential.https://${GITLAB_HOST}.helper" \ + "!f() { echo username=oauth2; echo password=\$GIT_TOKEN_GITLAB; }; f" +fi +``` + +Username conventions by server type: + +| Server | HTTPS username | Password | +|--------|---------------|----------| +| Bitbucket Server / Data Center | `x-token-auth` | HTTP access token | +| Bitbucket Cloud | account username | App password | +| GitLab (self-hosted or cloud) | `oauth2` | Personal access token | +| GitLab deploy token | `gitlab-deploy-token` | Deploy token secret | + +SSH authentication is also supported and preferred for long-lived deployments. The host SSH configuration (`~/.ssh/config`) handles per-host key selection and travels into the container via volume mount. + +### Corporate CA Certificate Handling + +The CA certificate is mounted as a read-only file at a staging path. The entrypoint detects the encoding (PEM or DER) and installs it at the system level before any other operation runs: + +```sh +if [ -f /certs/corp-ca.crt ]; then + if openssl x509 -inform PEM -in /certs/corp-ca.crt -noout 2>/dev/null; then + cp /certs/corp-ca.crt /usr/local/share/ca-certificates/corp-ca.crt + else + openssl x509 -inform DER -in /certs/corp-ca.crt \ + -out /usr/local/share/ca-certificates/corp-ca.crt + fi + update-ca-certificates 2>/dev/null +fi +``` + +Installing at the OS level means git, curl, and Node.js `fetch` all trust the certificate without per-tool flags. `GIT_SSL_NO_VERIFY` is explicitly not used. + +`openssl` must be added to the production image: + +```dockerfile +RUN apk add --no-cache openssl +``` + +### SSH Key Permissions + +Windows NTFS does not track Unix permissions. SSH keys mounted from `%USERPROFILE%\.ssh` arrive with world-readable permissions that the SSH client rejects. The entrypoint corrects this before any git operation: + +```sh +if [ -d /root/.ssh ]; then + chmod 700 /root/.ssh + chmod 600 /root/.ssh/* 2>/dev/null || true +fi +``` + +### docker-compose.yml + +```yaml +services: + web: + build: . + ports: + - "3000:3000" + volumes: + - trueref-data:/data + - ${USERPROFILE}/.ssh:/root/.ssh:ro + - ${USERPROFILE}/.gitconfig:/root/.gitconfig:ro + - ${CORP_CA_CERT}:/certs/corp-ca.crt:ro + environment: + DATABASE_URL: /data/trueref.db + GIT_TOKEN_BITBUCKET: "${BITBUCKET_TOKEN}" + GIT_TOKEN_GITLAB: "${GITLAB_TOKEN}" + BITBUCKET_HOST: "${BITBUCKET_HOST}" + GITLAB_HOST: "${GITLAB_HOST}" + restart: unless-stopped + + mcp: + build: . + command: mcp + ports: + - "3001:3001" + environment: + TRUEREF_API_URL: http://web:3000 + MCP_PORT: "3001" + depends_on: + - web + restart: unless-stopped + +volumes: + trueref-data: +``` + +### .env Template + +```env +# Corporate CA certificate (PEM or DER, auto-detected) +CORP_CA_CERT=C:/path/to/corp-ca.crt + +# Git remote hostnames +BITBUCKET_HOST=bitbucket.corp.example.com +GITLAB_HOST=gitlab.corp.example.com + +# Personal access tokens (never commit these) +BITBUCKET_TOKEN= +GITLAB_TOKEN= +``` + +### Full docker-entrypoint.sh + +```sh +#!/bin/sh +set -e + +# 1. Trust corporate CA — must run first +if [ -f /certs/corp-ca.crt ]; then + if openssl x509 -inform PEM -in /certs/corp-ca.crt -noout 2>/dev/null; then + cp /certs/corp-ca.crt /usr/local/share/ca-certificates/corp-ca.crt + else + openssl x509 -inform DER -in /certs/corp-ca.crt \ + -out /usr/local/share/ca-certificates/corp-ca.crt + fi + update-ca-certificates 2>/dev/null +fi + +# 2. Fix SSH key permissions (Windows mounts arrive world-readable) +if [ -d /root/.ssh ]; then + chmod 700 /root/.ssh + chmod 600 /root/.ssh/* 2>/dev/null || true +fi + +# 3. Per-host HTTPS credential helpers +if [ -n "$GIT_TOKEN_BITBUCKET" ] && [ -n "$BITBUCKET_HOST" ]; then + git config --global \ + "credential.https://${BITBUCKET_HOST}.helper" \ + "!f() { echo username=x-token-auth; echo password=\$GIT_TOKEN_BITBUCKET; }; f" +fi + +if [ -n "$GIT_TOKEN_GITLAB" ] && [ -n "$GITLAB_HOST" ]; then + git config --global \ + "credential.https://${GITLAB_HOST}.helper" \ + "!f() { echo username=oauth2; echo password=\$GIT_TOKEN_GITLAB; }; f" +fi + +case "${1:-web}" in + web) + echo "Running database migrations..." + DATABASE_URL="$DATABASE_URL" npx drizzle-kit migrate + echo "Starting TrueRef web app on port ${PORT:-3000}..." + exec node build + ;; + mcp) + MCP_PORT="${MCP_PORT:-3001}" + echo "Starting TrueRef MCP HTTP server on port ${MCP_PORT}..." + exec npx tsx src/mcp/index.ts --transport http --port "$MCP_PORT" + ;; + *) + exec "$@" + ;; +esac +``` + +### Acceptance Criteria + +- [ ] `docker-entrypoint.sh` runs CA install, SSH permission fix, and credential helpers in that order before any other operation +- [ ] CA cert mount path is `/certs/corp-ca.crt`; PEM and DER are both handled without user intervention +- [ ] `openssl` is installed in the production Docker image +- [ ] Per-host credential helpers are configured only when the relevant env vars are set (no-op if absent) +- [ ] SSH key permissions are corrected unconditionally when `/root/.ssh` exists +- [ ] `docker-compose.yml` exposes `BITBUCKET_HOST`, `GITLAB_HOST`, `GIT_TOKEN_BITBUCKET`, `GIT_TOKEN_GITLAB`, and `CORP_CA_CERT` as first-class configuration points +- [ ] `.env.example` includes all corporate deployment variables with placeholder values +- [ ] README Docker section documents the corporate setup (cert export on Windows, SSH mount, per-host tokens) + +--- + +## Files to Create + +- `Dockerfile` +- `docker-compose.yml` +- `docker-entrypoint.sh` +- `.dockerignore` +- `.env.example` (update) + +## Files to Modify + +- `src/lib/server/db/schema.ts` — add `commit_hash` to versions table +- `src/lib/server/db/migrations/` — new migration for `commit_hash` +- `src/lib/server/services/version.service.ts` — `resolveTagToCommit`, tag auto-discovery +- `src/lib/server/pipeline/indexing.pipeline.ts` — `git archive` extraction, temp dir lifecycle +- `README.md` — Docker section, corporate deployment guide