Files
trueref/docs/features/TRUEREF-0019.md
2026-03-27 02:23:01 +01:00

351 lines
13 KiB
Markdown

# TRUEREF-0019 — Git-Native Version Indexing and Corporate Deployment Support
**Priority:** P1
**Status:** Pending
**Depends On:** TRUEREF-0001, TRUEREF-0002, TRUEREF-0020
**Blocks:**
---
## Overview
TrueRef is intended for corporate environments where developers work with private repositories hosted on Bitbucket Server/Data Center and self-hosted GitLab instances. This feature covers two tightly related concerns:
1. **Git-native version indexing** — resolve version tags to specific commit hashes, extract file trees per commit using `git archive`, and store indexed snippets against an exact commit rather than a floating label.
2. **Corporate deployment support** — run TrueRef inside Docker with access to private remotes, per-host credentials, and corporate CA certificates without requiring changes to the host machine beyond environment variables and file mounts.
Together these allow a team to deploy TrueRef once, point it at their internal repositories, and have it return version-accurate documentation to LLM assistants.
This ticket depends on TRUEREF-0020 so that version-targeted retrieval remains semantically correct after version indexing is made commit-accurate. Without version-scoped hybrid retrieval, semantic results can still leak across versions even if version metadata is stored correctly.
---
## Part 1: Git-Native Version Indexing
### Background
Currently, versions are registered manually with a tag and a title. There is no link between a version tag and a specific git commit, so re-indexing the same version after a branch has moved produces different results silently. For documentation retrieval to be trustworthy, each indexed version must be pinned to an immutable commit hash.
### Version → Commit Hash Resolution
Git tags are the canonical mapping. Both lightweight and annotated tags must be handled:
```sh
# Resolves any tag, branch, or ref to the underlying commit hash
git -C /path/to/repo rev-parse <tag>^{commit}
```
The `^{commit}` peel dereferences annotated tag objects to the commit they point at.
For GitHub-hosted repositories the GitHub REST API provides the same mapping without requiring a local clone:
```
GET /repos/{owner}/{repo}/git/refs/tags/{tag}
```
If the returned object type is `tag` (annotated), a second request is required to dereference it:
```
GET /repos/{owner}/{repo}/git/tags/{sha}
→ { object: { sha: "<commit-sha>", type: "commit" } }
```
### Schema Change
Add `commit_hash` to the versions table:
```sql
ALTER TABLE versions ADD COLUMN commit_hash TEXT;
```
Populated at version registration time. Re-indexing a version always uses the stored commit hash, never re-resolves the tag (tags can be moved; the stored hash is immutable).
### Tag Auto-Discovery
When a repository is added or re-fetched, TrueRef should discover all version tags automatically and propose them to the user rather than requiring manual entry:
```sh
git -C /path/to/repo tag -l
# → ["v1.0.0", "v1.1.0", "v2.0.0", ...]
```
For each tag, resolve to a commit hash and pre-populate the versions table. The user can then select which versions to index from the UI.
### Per-Version File Extraction via `git archive`
`git archive` extracts a clean file tree from any commit without disturbing the working directory or requiring a separate clone:
```sh
git -C /path/to/repo archive <commit-hash> | tar -x -C /tmp/trueref-idx/<repo>-<tag>/
```
Advantages over `git checkout` or worktrees:
- Working directory is completely untouched
- No `.git` directory in the output (cleaner for parsing)
- Temp directory deleted after indexing with no git state to clean up
- Multiple versions can be extracted in parallel
The indexing pipeline receives the extracted temp directory as its source path — identical to the existing local crawler interface.
```
trigger index v2.0.0
→ look up commit_hash = "a3f9c12"
→ git archive a3f9c12 | tar -x /tmp/trueref/repo-v2.0.0/
→ run crawler on /tmp/trueref/repo-v2.0.0/
→ store snippets { repo_id, version_tag, commit_hash }
→ rm -rf /tmp/trueref/repo-v2.0.0/
```
### trueref.json Explicit Override
Allow commit hashes to be pinned explicitly per version, overriding tag resolution. Useful when tags are mutable, versioning is non-standard, or a specific patch within a release must be targeted:
```json
{
"previousVersions": [
{
"tag": "v2.0.0",
"title": "Version 2.0.0",
"commitHash": "a3f9c12abc..."
}
]
}
```
### Edge Cases
| Case | Handling |
| ---------------------------- | ---------------------------------------------------------------------------------- |
| Annotated tags | `rev-parse <tag>^{commit}` peels to commit automatically |
| Mutable tags (e.g. `latest`) | Re-resolve on re-index; warn in UI if hash has changed |
| Branch as version | `rev-parse origin/<branch>^{commit}` gives tip; re-resolves on re-index |
| Shallow clone | Run `git fetch --unshallow` before `git archive` if commit is unavailable |
| Submodules | `git archive --recurse-submodules` or document as a known limitation |
| Git LFS | `git lfs pull` required after archive if LFS-tracked files are needed for indexing |
### Acceptance Criteria
- [ ] `versions` table has a `commit_hash` column with a migration
- [ ] `resolveTagToCommit(repoPath, tag): string` utility using `child_process`
- [ ] Tag auto-discovery runs on repo registration and on manual re-fetch
- [ ] Discovered tags appear in the UI with their resolved commit hashes
- [ ] Indexing a version extracts files via `git archive` to a temp directory
- [ ] Temp directory is deleted after indexing completes (success or failure)
- [ ] Snippets are stored with both `version_tag` and `commit_hash`
- [ ] `trueref.json` `commitHash` field overrides tag resolution when present
- [ ] Re-indexing a version with a moved tag warns the user if the hash has changed
---
## Part 2: Corporate Deployment Support
### Background
Corporate environments introduce three constraints not present in standard cloud setups:
1. **Multiple private git remotes** — typically Bitbucket Server/Data Center and self-hosted GitLab on separate hostnames, each requiring independent credentials.
2. **Credential management on Windows** — Git Credential Manager stores credentials in the Windows Credential Manager (DPAPI-encrypted), which is inaccessible from inside a Linux container. Personal Access Tokens passed as environment variables are the correct substitute.
3. **Custom CA certificates** — on-premise servers use certificates issued by corporate CAs not trusted by default in Linux container images. All git operations, HTTP requests, and Node.js `fetch` calls fail until the CA is registered at the OS level.
### Multi-Remote Credential Configuration
Two environment variables carry the HTTPS tokens, one per remote. Git's per-host credential scoping routes each token to the correct server:
```sh
# In docker-entrypoint.sh, before any git operation
if [ -n "$GIT_TOKEN_BITBUCKET" ] && [ -n "$BITBUCKET_HOST" ]; then
git config --global \
"credential.https://${BITBUCKET_HOST}.helper" \
"!f() { echo username=x-token-auth; echo password=\$GIT_TOKEN_BITBUCKET; }; f"
fi
if [ -n "$GIT_TOKEN_GITLAB" ] && [ -n "$GITLAB_HOST" ]; then
git config --global \
"credential.https://${GITLAB_HOST}.helper" \
"!f() { echo username=oauth2; echo password=\$GIT_TOKEN_GITLAB; }; f"
fi
```
Username conventions by server type:
| Server | HTTPS username | Password |
| ------------------------------ | --------------------- | --------------------- |
| Bitbucket Server / Data Center | `x-token-auth` | HTTP access token |
| Bitbucket Cloud | account username | App password |
| GitLab (self-hosted or cloud) | `oauth2` | Personal access token |
| GitLab deploy token | `gitlab-deploy-token` | Deploy token secret |
SSH authentication is also supported and preferred for long-lived deployments. The host SSH configuration (`~/.ssh/config`) handles per-host key selection and travels into the container via volume mount.
### Corporate CA Certificate Handling
The CA certificate is mounted as a read-only file at a staging path. The entrypoint detects the encoding (PEM or DER) and installs it at the system level before any other operation runs:
```sh
if [ -f /certs/corp-ca.crt ]; then
if openssl x509 -inform PEM -in /certs/corp-ca.crt -noout 2>/dev/null; then
cp /certs/corp-ca.crt /usr/local/share/ca-certificates/corp-ca.crt
else
openssl x509 -inform DER -in /certs/corp-ca.crt \
-out /usr/local/share/ca-certificates/corp-ca.crt
fi
update-ca-certificates 2>/dev/null
fi
```
Installing at the OS level means git, curl, and Node.js `fetch` all trust the certificate without per-tool flags. `GIT_SSL_NO_VERIFY` is explicitly not used.
`openssl` must be added to the production image:
```dockerfile
RUN apk add --no-cache openssl
```
### SSH Key Permissions
Windows NTFS does not track Unix permissions. SSH keys mounted from `%USERPROFILE%\.ssh` arrive with world-readable permissions that the SSH client rejects. The entrypoint corrects this before any git operation:
```sh
if [ -d /root/.ssh ]; then
chmod 700 /root/.ssh
chmod 600 /root/.ssh/* 2>/dev/null || true
fi
```
### docker-compose.yml
```yaml
services:
web:
build: .
ports:
- '3000:3000'
volumes:
- trueref-data:/data
- ${USERPROFILE}/.ssh:/root/.ssh:ro
- ${USERPROFILE}/.gitconfig:/root/.gitconfig:ro
- ${CORP_CA_CERT}:/certs/corp-ca.crt:ro
environment:
DATABASE_URL: /data/trueref.db
GIT_TOKEN_BITBUCKET: '${BITBUCKET_TOKEN}'
GIT_TOKEN_GITLAB: '${GITLAB_TOKEN}'
BITBUCKET_HOST: '${BITBUCKET_HOST}'
GITLAB_HOST: '${GITLAB_HOST}'
restart: unless-stopped
mcp:
build: .
command: mcp
ports:
- '3001:3001'
environment:
TRUEREF_API_URL: http://web:3000
MCP_PORT: '3001'
depends_on:
- web
restart: unless-stopped
volumes:
trueref-data:
```
### .env Template
```env
# Corporate CA certificate (PEM or DER, auto-detected)
CORP_CA_CERT=C:/path/to/corp-ca.crt
# Git remote hostnames
BITBUCKET_HOST=bitbucket.corp.example.com
GITLAB_HOST=gitlab.corp.example.com
# Personal access tokens (never commit these)
BITBUCKET_TOKEN=
GITLAB_TOKEN=
```
### Full docker-entrypoint.sh
```sh
#!/bin/sh
set -e
# 1. Trust corporate CA — must run first
if [ -f /certs/corp-ca.crt ]; then
if openssl x509 -inform PEM -in /certs/corp-ca.crt -noout 2>/dev/null; then
cp /certs/corp-ca.crt /usr/local/share/ca-certificates/corp-ca.crt
else
openssl x509 -inform DER -in /certs/corp-ca.crt \
-out /usr/local/share/ca-certificates/corp-ca.crt
fi
update-ca-certificates 2>/dev/null
fi
# 2. Fix SSH key permissions (Windows mounts arrive world-readable)
if [ -d /root/.ssh ]; then
chmod 700 /root/.ssh
chmod 600 /root/.ssh/* 2>/dev/null || true
fi
# 3. Per-host HTTPS credential helpers
if [ -n "$GIT_TOKEN_BITBUCKET" ] && [ -n "$BITBUCKET_HOST" ]; then
git config --global \
"credential.https://${BITBUCKET_HOST}.helper" \
"!f() { echo username=x-token-auth; echo password=\$GIT_TOKEN_BITBUCKET; }; f"
fi
if [ -n "$GIT_TOKEN_GITLAB" ] && [ -n "$GITLAB_HOST" ]; then
git config --global \
"credential.https://${GITLAB_HOST}.helper" \
"!f() { echo username=oauth2; echo password=\$GIT_TOKEN_GITLAB; }; f"
fi
case "${1:-web}" in
web)
echo "Running database migrations..."
DATABASE_URL="$DATABASE_URL" npx drizzle-kit migrate
echo "Starting TrueRef web app on port ${PORT:-3000}..."
exec node build
;;
mcp)
MCP_PORT="${MCP_PORT:-3001}"
echo "Starting TrueRef MCP HTTP server on port ${MCP_PORT}..."
exec npx tsx src/mcp/index.ts --transport http --port "$MCP_PORT"
;;
*)
exec "$@"
;;
esac
```
### Acceptance Criteria
- [ ] `docker-entrypoint.sh` runs CA install, SSH permission fix, and credential helpers in that order before any other operation
- [ ] CA cert mount path is `/certs/corp-ca.crt`; PEM and DER are both handled without user intervention
- [ ] `openssl` is installed in the production Docker image
- [ ] Per-host credential helpers are configured only when the relevant env vars are set (no-op if absent)
- [ ] SSH key permissions are corrected unconditionally when `/root/.ssh` exists
- [ ] `docker-compose.yml` exposes `BITBUCKET_HOST`, `GITLAB_HOST`, `GIT_TOKEN_BITBUCKET`, `GIT_TOKEN_GITLAB`, and `CORP_CA_CERT` as first-class configuration points
- [ ] `.env.example` includes all corporate deployment variables with placeholder values
- [ ] README Docker section documents the corporate setup (cert export on Windows, SSH mount, per-host tokens)
---
## Files to Create
- `Dockerfile`
- `docker-compose.yml`
- `docker-entrypoint.sh`
- `.dockerignore`
- `.env.example` (update)
## Files to Modify
- `src/lib/server/db/schema.ts` — add `commit_hash` to versions table
- `src/lib/server/db/migrations/` — new migration for `commit_hash`
- `src/lib/server/services/version.service.ts``resolveTagToCommit`, tag auto-discovery
- `src/lib/server/pipeline/indexing.pipeline.ts``git archive` extraction, temp dir lifecycle
- `README.md` — Docker section, corporate deployment guide