# TRUEREF-0019 — Git-Native Version Indexing and Corporate Deployment Support **Priority:** P1 **Status:** Pending **Depends On:** TRUEREF-0001, TRUEREF-0002 **Blocks:** — --- ## Overview TrueRef is intended for corporate environments where developers work with private repositories hosted on Bitbucket Server/Data Center and self-hosted GitLab instances. This feature covers two tightly related concerns: 1. **Git-native version indexing** — resolve version tags to specific commit hashes, extract file trees per commit using `git archive`, and store indexed snippets against an exact commit rather than a floating label. 2. **Corporate deployment support** — run TrueRef inside Docker with access to private remotes, per-host credentials, and corporate CA certificates without requiring changes to the host machine beyond environment variables and file mounts. Together these allow a team to deploy TrueRef once, point it at their internal repositories, and have it return version-accurate documentation to LLM assistants. --- ## Part 1: Git-Native Version Indexing ### Background Currently, versions are registered manually with a tag and a title. There is no link between a version tag and a specific git commit, so re-indexing the same version after a branch has moved produces different results silently. For documentation retrieval to be trustworthy, each indexed version must be pinned to an immutable commit hash. ### Version → Commit Hash Resolution Git tags are the canonical mapping. Both lightweight and annotated tags must be handled: ```sh # Resolves any tag, branch, or ref to the underlying commit hash git -C /path/to/repo rev-parse ^{commit} ``` The `^{commit}` peel dereferences annotated tag objects to the commit they point at. For GitHub-hosted repositories the GitHub REST API provides the same mapping without requiring a local clone: ``` GET /repos/{owner}/{repo}/git/refs/tags/{tag} ``` If the returned object type is `tag` (annotated), a second request is required to dereference it: ``` GET /repos/{owner}/{repo}/git/tags/{sha} → { object: { sha: "", type: "commit" } } ``` ### Schema Change Add `commit_hash` to the versions table: ```sql ALTER TABLE versions ADD COLUMN commit_hash TEXT; ``` Populated at version registration time. Re-indexing a version always uses the stored commit hash, never re-resolves the tag (tags can be moved; the stored hash is immutable). ### Tag Auto-Discovery When a repository is added or re-fetched, TrueRef should discover all version tags automatically and propose them to the user rather than requiring manual entry: ```sh git -C /path/to/repo tag -l # → ["v1.0.0", "v1.1.0", "v2.0.0", ...] ``` For each tag, resolve to a commit hash and pre-populate the versions table. The user can then select which versions to index from the UI. ### Per-Version File Extraction via `git archive` `git archive` extracts a clean file tree from any commit without disturbing the working directory or requiring a separate clone: ```sh git -C /path/to/repo archive | tar -x -C /tmp/trueref-idx/-/ ``` Advantages over `git checkout` or worktrees: - Working directory is completely untouched - No `.git` directory in the output (cleaner for parsing) - Temp directory deleted after indexing with no git state to clean up - Multiple versions can be extracted in parallel The indexing pipeline receives the extracted temp directory as its source path — identical to the existing local crawler interface. ``` trigger index v2.0.0 → look up commit_hash = "a3f9c12" → git archive a3f9c12 | tar -x /tmp/trueref/repo-v2.0.0/ → run crawler on /tmp/trueref/repo-v2.0.0/ → store snippets { repo_id, version_tag, commit_hash } → rm -rf /tmp/trueref/repo-v2.0.0/ ``` ### trueref.json Explicit Override Allow commit hashes to be pinned explicitly per version, overriding tag resolution. Useful when tags are mutable, versioning is non-standard, or a specific patch within a release must be targeted: ```json { "previousVersions": [ { "tag": "v2.0.0", "title": "Version 2.0.0", "commitHash": "a3f9c12abc..." } ] } ``` ### Edge Cases | Case | Handling | |------|----------| | Annotated tags | `rev-parse ^{commit}` peels to commit automatically | | Mutable tags (e.g. `latest`) | Re-resolve on re-index; warn in UI if hash has changed | | Branch as version | `rev-parse origin/^{commit}` gives tip; re-resolves on re-index | | Shallow clone | Run `git fetch --unshallow` before `git archive` if commit is unavailable | | Submodules | `git archive --recurse-submodules` or document as a known limitation | | Git LFS | `git lfs pull` required after archive if LFS-tracked files are needed for indexing | ### Acceptance Criteria - [ ] `versions` table has a `commit_hash` column with a migration - [ ] `resolveTagToCommit(repoPath, tag): string` utility using `child_process` - [ ] Tag auto-discovery runs on repo registration and on manual re-fetch - [ ] Discovered tags appear in the UI with their resolved commit hashes - [ ] Indexing a version extracts files via `git archive` to a temp directory - [ ] Temp directory is deleted after indexing completes (success or failure) - [ ] Snippets are stored with both `version_tag` and `commit_hash` - [ ] `trueref.json` `commitHash` field overrides tag resolution when present - [ ] Re-indexing a version with a moved tag warns the user if the hash has changed --- ## Part 2: Corporate Deployment Support ### Background Corporate environments introduce three constraints not present in standard cloud setups: 1. **Multiple private git remotes** — typically Bitbucket Server/Data Center and self-hosted GitLab on separate hostnames, each requiring independent credentials. 2. **Credential management on Windows** — Git Credential Manager stores credentials in the Windows Credential Manager (DPAPI-encrypted), which is inaccessible from inside a Linux container. Personal Access Tokens passed as environment variables are the correct substitute. 3. **Custom CA certificates** — on-premise servers use certificates issued by corporate CAs not trusted by default in Linux container images. All git operations, HTTP requests, and Node.js `fetch` calls fail until the CA is registered at the OS level. ### Multi-Remote Credential Configuration Two environment variables carry the HTTPS tokens, one per remote. Git's per-host credential scoping routes each token to the correct server: ```sh # In docker-entrypoint.sh, before any git operation if [ -n "$GIT_TOKEN_BITBUCKET" ] && [ -n "$BITBUCKET_HOST" ]; then git config --global \ "credential.https://${BITBUCKET_HOST}.helper" \ "!f() { echo username=x-token-auth; echo password=\$GIT_TOKEN_BITBUCKET; }; f" fi if [ -n "$GIT_TOKEN_GITLAB" ] && [ -n "$GITLAB_HOST" ]; then git config --global \ "credential.https://${GITLAB_HOST}.helper" \ "!f() { echo username=oauth2; echo password=\$GIT_TOKEN_GITLAB; }; f" fi ``` Username conventions by server type: | Server | HTTPS username | Password | |--------|---------------|----------| | Bitbucket Server / Data Center | `x-token-auth` | HTTP access token | | Bitbucket Cloud | account username | App password | | GitLab (self-hosted or cloud) | `oauth2` | Personal access token | | GitLab deploy token | `gitlab-deploy-token` | Deploy token secret | SSH authentication is also supported and preferred for long-lived deployments. The host SSH configuration (`~/.ssh/config`) handles per-host key selection and travels into the container via volume mount. ### Corporate CA Certificate Handling The CA certificate is mounted as a read-only file at a staging path. The entrypoint detects the encoding (PEM or DER) and installs it at the system level before any other operation runs: ```sh if [ -f /certs/corp-ca.crt ]; then if openssl x509 -inform PEM -in /certs/corp-ca.crt -noout 2>/dev/null; then cp /certs/corp-ca.crt /usr/local/share/ca-certificates/corp-ca.crt else openssl x509 -inform DER -in /certs/corp-ca.crt \ -out /usr/local/share/ca-certificates/corp-ca.crt fi update-ca-certificates 2>/dev/null fi ``` Installing at the OS level means git, curl, and Node.js `fetch` all trust the certificate without per-tool flags. `GIT_SSL_NO_VERIFY` is explicitly not used. `openssl` must be added to the production image: ```dockerfile RUN apk add --no-cache openssl ``` ### SSH Key Permissions Windows NTFS does not track Unix permissions. SSH keys mounted from `%USERPROFILE%\.ssh` arrive with world-readable permissions that the SSH client rejects. The entrypoint corrects this before any git operation: ```sh if [ -d /root/.ssh ]; then chmod 700 /root/.ssh chmod 600 /root/.ssh/* 2>/dev/null || true fi ``` ### docker-compose.yml ```yaml services: web: build: . ports: - "3000:3000" volumes: - trueref-data:/data - ${USERPROFILE}/.ssh:/root/.ssh:ro - ${USERPROFILE}/.gitconfig:/root/.gitconfig:ro - ${CORP_CA_CERT}:/certs/corp-ca.crt:ro environment: DATABASE_URL: /data/trueref.db GIT_TOKEN_BITBUCKET: "${BITBUCKET_TOKEN}" GIT_TOKEN_GITLAB: "${GITLAB_TOKEN}" BITBUCKET_HOST: "${BITBUCKET_HOST}" GITLAB_HOST: "${GITLAB_HOST}" restart: unless-stopped mcp: build: . command: mcp ports: - "3001:3001" environment: TRUEREF_API_URL: http://web:3000 MCP_PORT: "3001" depends_on: - web restart: unless-stopped volumes: trueref-data: ``` ### .env Template ```env # Corporate CA certificate (PEM or DER, auto-detected) CORP_CA_CERT=C:/path/to/corp-ca.crt # Git remote hostnames BITBUCKET_HOST=bitbucket.corp.example.com GITLAB_HOST=gitlab.corp.example.com # Personal access tokens (never commit these) BITBUCKET_TOKEN= GITLAB_TOKEN= ``` ### Full docker-entrypoint.sh ```sh #!/bin/sh set -e # 1. Trust corporate CA — must run first if [ -f /certs/corp-ca.crt ]; then if openssl x509 -inform PEM -in /certs/corp-ca.crt -noout 2>/dev/null; then cp /certs/corp-ca.crt /usr/local/share/ca-certificates/corp-ca.crt else openssl x509 -inform DER -in /certs/corp-ca.crt \ -out /usr/local/share/ca-certificates/corp-ca.crt fi update-ca-certificates 2>/dev/null fi # 2. Fix SSH key permissions (Windows mounts arrive world-readable) if [ -d /root/.ssh ]; then chmod 700 /root/.ssh chmod 600 /root/.ssh/* 2>/dev/null || true fi # 3. Per-host HTTPS credential helpers if [ -n "$GIT_TOKEN_BITBUCKET" ] && [ -n "$BITBUCKET_HOST" ]; then git config --global \ "credential.https://${BITBUCKET_HOST}.helper" \ "!f() { echo username=x-token-auth; echo password=\$GIT_TOKEN_BITBUCKET; }; f" fi if [ -n "$GIT_TOKEN_GITLAB" ] && [ -n "$GITLAB_HOST" ]; then git config --global \ "credential.https://${GITLAB_HOST}.helper" \ "!f() { echo username=oauth2; echo password=\$GIT_TOKEN_GITLAB; }; f" fi case "${1:-web}" in web) echo "Running database migrations..." DATABASE_URL="$DATABASE_URL" npx drizzle-kit migrate echo "Starting TrueRef web app on port ${PORT:-3000}..." exec node build ;; mcp) MCP_PORT="${MCP_PORT:-3001}" echo "Starting TrueRef MCP HTTP server on port ${MCP_PORT}..." exec npx tsx src/mcp/index.ts --transport http --port "$MCP_PORT" ;; *) exec "$@" ;; esac ``` ### Acceptance Criteria - [ ] `docker-entrypoint.sh` runs CA install, SSH permission fix, and credential helpers in that order before any other operation - [ ] CA cert mount path is `/certs/corp-ca.crt`; PEM and DER are both handled without user intervention - [ ] `openssl` is installed in the production Docker image - [ ] Per-host credential helpers are configured only when the relevant env vars are set (no-op if absent) - [ ] SSH key permissions are corrected unconditionally when `/root/.ssh` exists - [ ] `docker-compose.yml` exposes `BITBUCKET_HOST`, `GITLAB_HOST`, `GIT_TOKEN_BITBUCKET`, `GIT_TOKEN_GITLAB`, and `CORP_CA_CERT` as first-class configuration points - [ ] `.env.example` includes all corporate deployment variables with placeholder values - [ ] README Docker section documents the corporate setup (cert export on Windows, SSH mount, per-host tokens) --- ## Files to Create - `Dockerfile` - `docker-compose.yml` - `docker-entrypoint.sh` - `.dockerignore` - `.env.example` (update) ## Files to Modify - `src/lib/server/db/schema.ts` — add `commit_hash` to versions table - `src/lib/server/db/migrations/` — new migration for `commit_hash` - `src/lib/server/services/version.service.ts` — `resolveTagToCommit`, tag auto-discovery - `src/lib/server/pipeline/indexing.pipeline.ts` — `git archive` extraction, temp dir lifecycle - `README.md` — Docker section, corporate deployment guide