Files
trueref/docs/features/TRUEREF-0019.md
2026-03-27 02:23:01 +01:00

13 KiB

TRUEREF-0019 — Git-Native Version Indexing and Corporate Deployment Support

Priority: P1 Status: Pending Depends On: TRUEREF-0001, TRUEREF-0002, TRUEREF-0020 Blocks:


Overview

TrueRef is intended for corporate environments where developers work with private repositories hosted on Bitbucket Server/Data Center and self-hosted GitLab instances. This feature covers two tightly related concerns:

  1. Git-native version indexing — resolve version tags to specific commit hashes, extract file trees per commit using git archive, and store indexed snippets against an exact commit rather than a floating label.
  2. Corporate deployment support — run TrueRef inside Docker with access to private remotes, per-host credentials, and corporate CA certificates without requiring changes to the host machine beyond environment variables and file mounts.

Together these allow a team to deploy TrueRef once, point it at their internal repositories, and have it return version-accurate documentation to LLM assistants.

This ticket depends on TRUEREF-0020 so that version-targeted retrieval remains semantically correct after version indexing is made commit-accurate. Without version-scoped hybrid retrieval, semantic results can still leak across versions even if version metadata is stored correctly.


Part 1: Git-Native Version Indexing

Background

Currently, versions are registered manually with a tag and a title. There is no link between a version tag and a specific git commit, so re-indexing the same version after a branch has moved produces different results silently. For documentation retrieval to be trustworthy, each indexed version must be pinned to an immutable commit hash.

Version → Commit Hash Resolution

Git tags are the canonical mapping. Both lightweight and annotated tags must be handled:

# Resolves any tag, branch, or ref to the underlying commit hash
git -C /path/to/repo rev-parse <tag>^{commit}

The ^{commit} peel dereferences annotated tag objects to the commit they point at.

For GitHub-hosted repositories the GitHub REST API provides the same mapping without requiring a local clone:

GET /repos/{owner}/{repo}/git/refs/tags/{tag}

If the returned object type is tag (annotated), a second request is required to dereference it:

GET /repos/{owner}/{repo}/git/tags/{sha}
→ { object: { sha: "<commit-sha>", type: "commit" } }

Schema Change

Add commit_hash to the versions table:

ALTER TABLE versions ADD COLUMN commit_hash TEXT;

Populated at version registration time. Re-indexing a version always uses the stored commit hash, never re-resolves the tag (tags can be moved; the stored hash is immutable).

Tag Auto-Discovery

When a repository is added or re-fetched, TrueRef should discover all version tags automatically and propose them to the user rather than requiring manual entry:

git -C /path/to/repo tag -l
# → ["v1.0.0", "v1.1.0", "v2.0.0", ...]

For each tag, resolve to a commit hash and pre-populate the versions table. The user can then select which versions to index from the UI.

Per-Version File Extraction via git archive

git archive extracts a clean file tree from any commit without disturbing the working directory or requiring a separate clone:

git -C /path/to/repo archive <commit-hash> | tar -x -C /tmp/trueref-idx/<repo>-<tag>/

Advantages over git checkout or worktrees:

  • Working directory is completely untouched
  • No .git directory in the output (cleaner for parsing)
  • Temp directory deleted after indexing with no git state to clean up
  • Multiple versions can be extracted in parallel

The indexing pipeline receives the extracted temp directory as its source path — identical to the existing local crawler interface.

trigger index v2.0.0
  → look up commit_hash = "a3f9c12"
  → git archive a3f9c12 | tar -x /tmp/trueref/repo-v2.0.0/
  → run crawler on /tmp/trueref/repo-v2.0.0/
  → store snippets { repo_id, version_tag, commit_hash }
  → rm -rf /tmp/trueref/repo-v2.0.0/

trueref.json Explicit Override

Allow commit hashes to be pinned explicitly per version, overriding tag resolution. Useful when tags are mutable, versioning is non-standard, or a specific patch within a release must be targeted:

{
	"previousVersions": [
		{
			"tag": "v2.0.0",
			"title": "Version 2.0.0",
			"commitHash": "a3f9c12abc..."
		}
	]
}

Edge Cases

Case Handling
Annotated tags rev-parse <tag>^{commit} peels to commit automatically
Mutable tags (e.g. latest) Re-resolve on re-index; warn in UI if hash has changed
Branch as version rev-parse origin/<branch>^{commit} gives tip; re-resolves on re-index
Shallow clone Run git fetch --unshallow before git archive if commit is unavailable
Submodules git archive --recurse-submodules or document as a known limitation
Git LFS git lfs pull required after archive if LFS-tracked files are needed for indexing

Acceptance Criteria

  • versions table has a commit_hash column with a migration
  • resolveTagToCommit(repoPath, tag): string utility using child_process
  • Tag auto-discovery runs on repo registration and on manual re-fetch
  • Discovered tags appear in the UI with their resolved commit hashes
  • Indexing a version extracts files via git archive to a temp directory
  • Temp directory is deleted after indexing completes (success or failure)
  • Snippets are stored with both version_tag and commit_hash
  • trueref.json commitHash field overrides tag resolution when present
  • Re-indexing a version with a moved tag warns the user if the hash has changed

Part 2: Corporate Deployment Support

Background

Corporate environments introduce three constraints not present in standard cloud setups:

  1. Multiple private git remotes — typically Bitbucket Server/Data Center and self-hosted GitLab on separate hostnames, each requiring independent credentials.
  2. Credential management on Windows — Git Credential Manager stores credentials in the Windows Credential Manager (DPAPI-encrypted), which is inaccessible from inside a Linux container. Personal Access Tokens passed as environment variables are the correct substitute.
  3. Custom CA certificates — on-premise servers use certificates issued by corporate CAs not trusted by default in Linux container images. All git operations, HTTP requests, and Node.js fetch calls fail until the CA is registered at the OS level.

Multi-Remote Credential Configuration

Two environment variables carry the HTTPS tokens, one per remote. Git's per-host credential scoping routes each token to the correct server:

# In docker-entrypoint.sh, before any git operation

if [ -n "$GIT_TOKEN_BITBUCKET" ] && [ -n "$BITBUCKET_HOST" ]; then
  git config --global \
    "credential.https://${BITBUCKET_HOST}.helper" \
    "!f() { echo username=x-token-auth; echo password=\$GIT_TOKEN_BITBUCKET; }; f"
fi

if [ -n "$GIT_TOKEN_GITLAB" ] && [ -n "$GITLAB_HOST" ]; then
  git config --global \
    "credential.https://${GITLAB_HOST}.helper" \
    "!f() { echo username=oauth2; echo password=\$GIT_TOKEN_GITLAB; }; f"
fi

Username conventions by server type:

Server HTTPS username Password
Bitbucket Server / Data Center x-token-auth HTTP access token
Bitbucket Cloud account username App password
GitLab (self-hosted or cloud) oauth2 Personal access token
GitLab deploy token gitlab-deploy-token Deploy token secret

SSH authentication is also supported and preferred for long-lived deployments. The host SSH configuration (~/.ssh/config) handles per-host key selection and travels into the container via volume mount.

Corporate CA Certificate Handling

The CA certificate is mounted as a read-only file at a staging path. The entrypoint detects the encoding (PEM or DER) and installs it at the system level before any other operation runs:

if [ -f /certs/corp-ca.crt ]; then
  if openssl x509 -inform PEM -in /certs/corp-ca.crt -noout 2>/dev/null; then
    cp /certs/corp-ca.crt /usr/local/share/ca-certificates/corp-ca.crt
  else
    openssl x509 -inform DER -in /certs/corp-ca.crt \
      -out /usr/local/share/ca-certificates/corp-ca.crt
  fi
  update-ca-certificates 2>/dev/null
fi

Installing at the OS level means git, curl, and Node.js fetch all trust the certificate without per-tool flags. GIT_SSL_NO_VERIFY is explicitly not used.

openssl must be added to the production image:

RUN apk add --no-cache openssl

SSH Key Permissions

Windows NTFS does not track Unix permissions. SSH keys mounted from %USERPROFILE%\.ssh arrive with world-readable permissions that the SSH client rejects. The entrypoint corrects this before any git operation:

if [ -d /root/.ssh ]; then
  chmod 700 /root/.ssh
  chmod 600 /root/.ssh/* 2>/dev/null || true
fi

docker-compose.yml

services:
  web:
    build: .
    ports:
      - '3000:3000'
    volumes:
      - trueref-data:/data
      - ${USERPROFILE}/.ssh:/root/.ssh:ro
      - ${USERPROFILE}/.gitconfig:/root/.gitconfig:ro
      - ${CORP_CA_CERT}:/certs/corp-ca.crt:ro
    environment:
      DATABASE_URL: /data/trueref.db
      GIT_TOKEN_BITBUCKET: '${BITBUCKET_TOKEN}'
      GIT_TOKEN_GITLAB: '${GITLAB_TOKEN}'
      BITBUCKET_HOST: '${BITBUCKET_HOST}'
      GITLAB_HOST: '${GITLAB_HOST}'
    restart: unless-stopped

  mcp:
    build: .
    command: mcp
    ports:
      - '3001:3001'
    environment:
      TRUEREF_API_URL: http://web:3000
      MCP_PORT: '3001'
    depends_on:
      - web
    restart: unless-stopped

volumes:
  trueref-data:

.env Template

# Corporate CA certificate (PEM or DER, auto-detected)
CORP_CA_CERT=C:/path/to/corp-ca.crt

# Git remote hostnames
BITBUCKET_HOST=bitbucket.corp.example.com
GITLAB_HOST=gitlab.corp.example.com

# Personal access tokens (never commit these)
BITBUCKET_TOKEN=
GITLAB_TOKEN=

Full docker-entrypoint.sh

#!/bin/sh
set -e

# 1. Trust corporate CA — must run first
if [ -f /certs/corp-ca.crt ]; then
  if openssl x509 -inform PEM -in /certs/corp-ca.crt -noout 2>/dev/null; then
    cp /certs/corp-ca.crt /usr/local/share/ca-certificates/corp-ca.crt
  else
    openssl x509 -inform DER -in /certs/corp-ca.crt \
      -out /usr/local/share/ca-certificates/corp-ca.crt
  fi
  update-ca-certificates 2>/dev/null
fi

# 2. Fix SSH key permissions (Windows mounts arrive world-readable)
if [ -d /root/.ssh ]; then
  chmod 700 /root/.ssh
  chmod 600 /root/.ssh/* 2>/dev/null || true
fi

# 3. Per-host HTTPS credential helpers
if [ -n "$GIT_TOKEN_BITBUCKET" ] && [ -n "$BITBUCKET_HOST" ]; then
  git config --global \
    "credential.https://${BITBUCKET_HOST}.helper" \
    "!f() { echo username=x-token-auth; echo password=\$GIT_TOKEN_BITBUCKET; }; f"
fi

if [ -n "$GIT_TOKEN_GITLAB" ] && [ -n "$GITLAB_HOST" ]; then
  git config --global \
    "credential.https://${GITLAB_HOST}.helper" \
    "!f() { echo username=oauth2; echo password=\$GIT_TOKEN_GITLAB; }; f"
fi

case "${1:-web}" in
  web)
    echo "Running database migrations..."
    DATABASE_URL="$DATABASE_URL" npx drizzle-kit migrate
    echo "Starting TrueRef web app on port ${PORT:-3000}..."
    exec node build
    ;;
  mcp)
    MCP_PORT="${MCP_PORT:-3001}"
    echo "Starting TrueRef MCP HTTP server on port ${MCP_PORT}..."
    exec npx tsx src/mcp/index.ts --transport http --port "$MCP_PORT"
    ;;
  *)
    exec "$@"
    ;;
esac

Acceptance Criteria

  • docker-entrypoint.sh runs CA install, SSH permission fix, and credential helpers in that order before any other operation
  • CA cert mount path is /certs/corp-ca.crt; PEM and DER are both handled without user intervention
  • openssl is installed in the production Docker image
  • Per-host credential helpers are configured only when the relevant env vars are set (no-op if absent)
  • SSH key permissions are corrected unconditionally when /root/.ssh exists
  • docker-compose.yml exposes BITBUCKET_HOST, GITLAB_HOST, GIT_TOKEN_BITBUCKET, GIT_TOKEN_GITLAB, and CORP_CA_CERT as first-class configuration points
  • .env.example includes all corporate deployment variables with placeholder values
  • README Docker section documents the corporate setup (cert export on Windows, SSH mount, per-host tokens)

Files to Create

  • Dockerfile
  • docker-compose.yml
  • docker-entrypoint.sh
  • .dockerignore
  • .env.example (update)

Files to Modify

  • src/lib/server/db/schema.ts — add commit_hash to versions table
  • src/lib/server/db/migrations/ — new migration for commit_hash
  • src/lib/server/services/version.service.tsresolveTagToCommit, tag auto-discovery
  • src/lib/server/pipeline/indexing.pipeline.tsgit archive extraction, temp dir lifecycle
  • README.md — Docker section, corporate deployment guide