350 lines
13 KiB
Markdown
350 lines
13 KiB
Markdown
# TRUEREF-0019 — Git-Native Version Indexing and Corporate Deployment Support
|
|
|
|
**Priority:** P1
|
|
**Status:** Pending
|
|
**Depends On:** TRUEREF-0001, TRUEREF-0002, TRUEREF-0020
|
|
**Blocks:** —
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
TrueRef is intended for corporate environments where developers work with private repositories hosted on Bitbucket Server/Data Center and self-hosted GitLab instances. This feature covers two tightly related concerns:
|
|
|
|
1. **Git-native version indexing** — resolve version tags to specific commit hashes, extract file trees per commit using `git archive`, and store indexed snippets against an exact commit rather than a floating label.
|
|
2. **Corporate deployment support** — run TrueRef inside Docker with access to private remotes, per-host credentials, and corporate CA certificates without requiring changes to the host machine beyond environment variables and file mounts.
|
|
|
|
Together these allow a team to deploy TrueRef once, point it at their internal repositories, and have it return version-accurate documentation to LLM assistants.
|
|
|
|
This ticket depends on TRUEREF-0020 so that version-targeted retrieval remains semantically correct after version indexing is made commit-accurate. Without version-scoped hybrid retrieval, semantic results can still leak across versions even if version metadata is stored correctly.
|
|
|
|
---
|
|
|
|
## Part 1: Git-Native Version Indexing
|
|
|
|
### Background
|
|
|
|
Currently, versions are registered manually with a tag and a title. There is no link between a version tag and a specific git commit, so re-indexing the same version after a branch has moved produces different results silently. For documentation retrieval to be trustworthy, each indexed version must be pinned to an immutable commit hash.
|
|
|
|
### Version → Commit Hash Resolution
|
|
|
|
Git tags are the canonical mapping. Both lightweight and annotated tags must be handled:
|
|
|
|
```sh
|
|
# Resolves any tag, branch, or ref to the underlying commit hash
|
|
git -C /path/to/repo rev-parse <tag>^{commit}
|
|
```
|
|
|
|
The `^{commit}` peel dereferences annotated tag objects to the commit they point at.
|
|
|
|
For GitHub-hosted repositories the GitHub REST API provides the same mapping without requiring a local clone:
|
|
|
|
```
|
|
GET /repos/{owner}/{repo}/git/refs/tags/{tag}
|
|
```
|
|
|
|
If the returned object type is `tag` (annotated), a second request is required to dereference it:
|
|
|
|
```
|
|
GET /repos/{owner}/{repo}/git/tags/{sha}
|
|
→ { object: { sha: "<commit-sha>", type: "commit" } }
|
|
```
|
|
|
|
### Schema Change
|
|
|
|
Add `commit_hash` to the versions table:
|
|
|
|
```sql
|
|
ALTER TABLE versions ADD COLUMN commit_hash TEXT;
|
|
```
|
|
|
|
Populated at version registration time. Re-indexing a version always uses the stored commit hash, never re-resolves the tag (tags can be moved; the stored hash is immutable).
|
|
|
|
### Tag Auto-Discovery
|
|
|
|
When a repository is added or re-fetched, TrueRef should discover all version tags automatically and propose them to the user rather than requiring manual entry:
|
|
|
|
```sh
|
|
git -C /path/to/repo tag -l
|
|
# → ["v1.0.0", "v1.1.0", "v2.0.0", ...]
|
|
```
|
|
|
|
For each tag, resolve to a commit hash and pre-populate the versions table. The user can then select which versions to index from the UI.
|
|
|
|
### Per-Version File Extraction via `git archive`
|
|
|
|
`git archive` extracts a clean file tree from any commit without disturbing the working directory or requiring a separate clone:
|
|
|
|
```sh
|
|
git -C /path/to/repo archive <commit-hash> | tar -x -C /tmp/trueref-idx/<repo>-<tag>/
|
|
```
|
|
|
|
Advantages over `git checkout` or worktrees:
|
|
- Working directory is completely untouched
|
|
- No `.git` directory in the output (cleaner for parsing)
|
|
- Temp directory deleted after indexing with no git state to clean up
|
|
- Multiple versions can be extracted in parallel
|
|
|
|
The indexing pipeline receives the extracted temp directory as its source path — identical to the existing local crawler interface.
|
|
|
|
```
|
|
trigger index v2.0.0
|
|
→ look up commit_hash = "a3f9c12"
|
|
→ git archive a3f9c12 | tar -x /tmp/trueref/repo-v2.0.0/
|
|
→ run crawler on /tmp/trueref/repo-v2.0.0/
|
|
→ store snippets { repo_id, version_tag, commit_hash }
|
|
→ rm -rf /tmp/trueref/repo-v2.0.0/
|
|
```
|
|
|
|
### trueref.json Explicit Override
|
|
|
|
Allow commit hashes to be pinned explicitly per version, overriding tag resolution. Useful when tags are mutable, versioning is non-standard, or a specific patch within a release must be targeted:
|
|
|
|
```json
|
|
{
|
|
"previousVersions": [
|
|
{
|
|
"tag": "v2.0.0",
|
|
"title": "Version 2.0.0",
|
|
"commitHash": "a3f9c12abc..."
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Edge Cases
|
|
|
|
| Case | Handling |
|
|
|------|----------|
|
|
| Annotated tags | `rev-parse <tag>^{commit}` peels to commit automatically |
|
|
| Mutable tags (e.g. `latest`) | Re-resolve on re-index; warn in UI if hash has changed |
|
|
| Branch as version | `rev-parse origin/<branch>^{commit}` gives tip; re-resolves on re-index |
|
|
| Shallow clone | Run `git fetch --unshallow` before `git archive` if commit is unavailable |
|
|
| Submodules | `git archive --recurse-submodules` or document as a known limitation |
|
|
| Git LFS | `git lfs pull` required after archive if LFS-tracked files are needed for indexing |
|
|
|
|
### Acceptance Criteria
|
|
|
|
- [ ] `versions` table has a `commit_hash` column with a migration
|
|
- [ ] `resolveTagToCommit(repoPath, tag): string` utility using `child_process`
|
|
- [ ] Tag auto-discovery runs on repo registration and on manual re-fetch
|
|
- [ ] Discovered tags appear in the UI with their resolved commit hashes
|
|
- [ ] Indexing a version extracts files via `git archive` to a temp directory
|
|
- [ ] Temp directory is deleted after indexing completes (success or failure)
|
|
- [ ] Snippets are stored with both `version_tag` and `commit_hash`
|
|
- [ ] `trueref.json` `commitHash` field overrides tag resolution when present
|
|
- [ ] Re-indexing a version with a moved tag warns the user if the hash has changed
|
|
|
|
---
|
|
|
|
## Part 2: Corporate Deployment Support
|
|
|
|
### Background
|
|
|
|
Corporate environments introduce three constraints not present in standard cloud setups:
|
|
|
|
1. **Multiple private git remotes** — typically Bitbucket Server/Data Center and self-hosted GitLab on separate hostnames, each requiring independent credentials.
|
|
2. **Credential management on Windows** — Git Credential Manager stores credentials in the Windows Credential Manager (DPAPI-encrypted), which is inaccessible from inside a Linux container. Personal Access Tokens passed as environment variables are the correct substitute.
|
|
3. **Custom CA certificates** — on-premise servers use certificates issued by corporate CAs not trusted by default in Linux container images. All git operations, HTTP requests, and Node.js `fetch` calls fail until the CA is registered at the OS level.
|
|
|
|
### Multi-Remote Credential Configuration
|
|
|
|
Two environment variables carry the HTTPS tokens, one per remote. Git's per-host credential scoping routes each token to the correct server:
|
|
|
|
```sh
|
|
# In docker-entrypoint.sh, before any git operation
|
|
|
|
if [ -n "$GIT_TOKEN_BITBUCKET" ] && [ -n "$BITBUCKET_HOST" ]; then
|
|
git config --global \
|
|
"credential.https://${BITBUCKET_HOST}.helper" \
|
|
"!f() { echo username=x-token-auth; echo password=\$GIT_TOKEN_BITBUCKET; }; f"
|
|
fi
|
|
|
|
if [ -n "$GIT_TOKEN_GITLAB" ] && [ -n "$GITLAB_HOST" ]; then
|
|
git config --global \
|
|
"credential.https://${GITLAB_HOST}.helper" \
|
|
"!f() { echo username=oauth2; echo password=\$GIT_TOKEN_GITLAB; }; f"
|
|
fi
|
|
```
|
|
|
|
Username conventions by server type:
|
|
|
|
| Server | HTTPS username | Password |
|
|
|--------|---------------|----------|
|
|
| Bitbucket Server / Data Center | `x-token-auth` | HTTP access token |
|
|
| Bitbucket Cloud | account username | App password |
|
|
| GitLab (self-hosted or cloud) | `oauth2` | Personal access token |
|
|
| GitLab deploy token | `gitlab-deploy-token` | Deploy token secret |
|
|
|
|
SSH authentication is also supported and preferred for long-lived deployments. The host SSH configuration (`~/.ssh/config`) handles per-host key selection and travels into the container via volume mount.
|
|
|
|
### Corporate CA Certificate Handling
|
|
|
|
The CA certificate is mounted as a read-only file at a staging path. The entrypoint detects the encoding (PEM or DER) and installs it at the system level before any other operation runs:
|
|
|
|
```sh
|
|
if [ -f /certs/corp-ca.crt ]; then
|
|
if openssl x509 -inform PEM -in /certs/corp-ca.crt -noout 2>/dev/null; then
|
|
cp /certs/corp-ca.crt /usr/local/share/ca-certificates/corp-ca.crt
|
|
else
|
|
openssl x509 -inform DER -in /certs/corp-ca.crt \
|
|
-out /usr/local/share/ca-certificates/corp-ca.crt
|
|
fi
|
|
update-ca-certificates 2>/dev/null
|
|
fi
|
|
```
|
|
|
|
Installing at the OS level means git, curl, and Node.js `fetch` all trust the certificate without per-tool flags. `GIT_SSL_NO_VERIFY` is explicitly not used.
|
|
|
|
`openssl` must be added to the production image:
|
|
|
|
```dockerfile
|
|
RUN apk add --no-cache openssl
|
|
```
|
|
|
|
### SSH Key Permissions
|
|
|
|
Windows NTFS does not track Unix permissions. SSH keys mounted from `%USERPROFILE%\.ssh` arrive with world-readable permissions that the SSH client rejects. The entrypoint corrects this before any git operation:
|
|
|
|
```sh
|
|
if [ -d /root/.ssh ]; then
|
|
chmod 700 /root/.ssh
|
|
chmod 600 /root/.ssh/* 2>/dev/null || true
|
|
fi
|
|
```
|
|
|
|
### docker-compose.yml
|
|
|
|
```yaml
|
|
services:
|
|
web:
|
|
build: .
|
|
ports:
|
|
- "3000:3000"
|
|
volumes:
|
|
- trueref-data:/data
|
|
- ${USERPROFILE}/.ssh:/root/.ssh:ro
|
|
- ${USERPROFILE}/.gitconfig:/root/.gitconfig:ro
|
|
- ${CORP_CA_CERT}:/certs/corp-ca.crt:ro
|
|
environment:
|
|
DATABASE_URL: /data/trueref.db
|
|
GIT_TOKEN_BITBUCKET: "${BITBUCKET_TOKEN}"
|
|
GIT_TOKEN_GITLAB: "${GITLAB_TOKEN}"
|
|
BITBUCKET_HOST: "${BITBUCKET_HOST}"
|
|
GITLAB_HOST: "${GITLAB_HOST}"
|
|
restart: unless-stopped
|
|
|
|
mcp:
|
|
build: .
|
|
command: mcp
|
|
ports:
|
|
- "3001:3001"
|
|
environment:
|
|
TRUEREF_API_URL: http://web:3000
|
|
MCP_PORT: "3001"
|
|
depends_on:
|
|
- web
|
|
restart: unless-stopped
|
|
|
|
volumes:
|
|
trueref-data:
|
|
```
|
|
|
|
### .env Template
|
|
|
|
```env
|
|
# Corporate CA certificate (PEM or DER, auto-detected)
|
|
CORP_CA_CERT=C:/path/to/corp-ca.crt
|
|
|
|
# Git remote hostnames
|
|
BITBUCKET_HOST=bitbucket.corp.example.com
|
|
GITLAB_HOST=gitlab.corp.example.com
|
|
|
|
# Personal access tokens (never commit these)
|
|
BITBUCKET_TOKEN=
|
|
GITLAB_TOKEN=
|
|
```
|
|
|
|
### Full docker-entrypoint.sh
|
|
|
|
```sh
|
|
#!/bin/sh
|
|
set -e
|
|
|
|
# 1. Trust corporate CA — must run first
|
|
if [ -f /certs/corp-ca.crt ]; then
|
|
if openssl x509 -inform PEM -in /certs/corp-ca.crt -noout 2>/dev/null; then
|
|
cp /certs/corp-ca.crt /usr/local/share/ca-certificates/corp-ca.crt
|
|
else
|
|
openssl x509 -inform DER -in /certs/corp-ca.crt \
|
|
-out /usr/local/share/ca-certificates/corp-ca.crt
|
|
fi
|
|
update-ca-certificates 2>/dev/null
|
|
fi
|
|
|
|
# 2. Fix SSH key permissions (Windows mounts arrive world-readable)
|
|
if [ -d /root/.ssh ]; then
|
|
chmod 700 /root/.ssh
|
|
chmod 600 /root/.ssh/* 2>/dev/null || true
|
|
fi
|
|
|
|
# 3. Per-host HTTPS credential helpers
|
|
if [ -n "$GIT_TOKEN_BITBUCKET" ] && [ -n "$BITBUCKET_HOST" ]; then
|
|
git config --global \
|
|
"credential.https://${BITBUCKET_HOST}.helper" \
|
|
"!f() { echo username=x-token-auth; echo password=\$GIT_TOKEN_BITBUCKET; }; f"
|
|
fi
|
|
|
|
if [ -n "$GIT_TOKEN_GITLAB" ] && [ -n "$GITLAB_HOST" ]; then
|
|
git config --global \
|
|
"credential.https://${GITLAB_HOST}.helper" \
|
|
"!f() { echo username=oauth2; echo password=\$GIT_TOKEN_GITLAB; }; f"
|
|
fi
|
|
|
|
case "${1:-web}" in
|
|
web)
|
|
echo "Running database migrations..."
|
|
DATABASE_URL="$DATABASE_URL" npx drizzle-kit migrate
|
|
echo "Starting TrueRef web app on port ${PORT:-3000}..."
|
|
exec node build
|
|
;;
|
|
mcp)
|
|
MCP_PORT="${MCP_PORT:-3001}"
|
|
echo "Starting TrueRef MCP HTTP server on port ${MCP_PORT}..."
|
|
exec npx tsx src/mcp/index.ts --transport http --port "$MCP_PORT"
|
|
;;
|
|
*)
|
|
exec "$@"
|
|
;;
|
|
esac
|
|
```
|
|
|
|
### Acceptance Criteria
|
|
|
|
- [ ] `docker-entrypoint.sh` runs CA install, SSH permission fix, and credential helpers in that order before any other operation
|
|
- [ ] CA cert mount path is `/certs/corp-ca.crt`; PEM and DER are both handled without user intervention
|
|
- [ ] `openssl` is installed in the production Docker image
|
|
- [ ] Per-host credential helpers are configured only when the relevant env vars are set (no-op if absent)
|
|
- [ ] SSH key permissions are corrected unconditionally when `/root/.ssh` exists
|
|
- [ ] `docker-compose.yml` exposes `BITBUCKET_HOST`, `GITLAB_HOST`, `GIT_TOKEN_BITBUCKET`, `GIT_TOKEN_GITLAB`, and `CORP_CA_CERT` as first-class configuration points
|
|
- [ ] `.env.example` includes all corporate deployment variables with placeholder values
|
|
- [ ] README Docker section documents the corporate setup (cert export on Windows, SSH mount, per-host tokens)
|
|
|
|
---
|
|
|
|
## Files to Create
|
|
|
|
- `Dockerfile`
|
|
- `docker-compose.yml`
|
|
- `docker-entrypoint.sh`
|
|
- `.dockerignore`
|
|
- `.env.example` (update)
|
|
|
|
## Files to Modify
|
|
|
|
- `src/lib/server/db/schema.ts` — add `commit_hash` to versions table
|
|
- `src/lib/server/db/migrations/` — new migration for `commit_hash`
|
|
- `src/lib/server/services/version.service.ts` — `resolveTagToCommit`, tag auto-discovery
|
|
- `src/lib/server/pipeline/indexing.pipeline.ts` — `git archive` extraction, temp dir lifecycle
|
|
- `README.md` — Docker section, corporate deployment guide
|