diff --git a/docs/FINDINGS.md b/docs/FINDINGS.md index 3834bae..57c1c86 100644 --- a/docs/FINDINGS.md +++ b/docs/FINDINGS.md @@ -3145,3 +3145,243 @@ Footer component needs null-safe access since initial state is `null`: **Document Version:** 3.0 **Last Updated by:** Planner Agent (RECIPE-0009 Iteration 1) **Next Update:** Developer Agent + +--- + +--- + +# Session Findings: Instagram Extraction & Production Lessons + +*Recorded during active development sessions (2025–2026). These are hard-won discoveries from real debugging — not theoretical analysis.* + +--- + +## Instagram: Caption Truncation in Web GraphQL API + +**Symptom:** LLM says "no recipe found" even though the full recipe IS in the Instagram caption. + +**Root cause:** Instagram's web GraphQL API (`doc_id=8845758582119845`) silently truncates captions in `edge_media_to_caption.edges[0].node.text`. Truncation is **inconsistent**: +- Sometimes ends with `….` (Unicode U+2026 + period) +- Sometimes cuts off mid-sentence with no marker at all + +Known examples: +- `DWWxiymssxE`: GraphQL returns 327 chars, full caption is 393 chars (no truncation marker) +- `DXT73izCBoH`: GraphQL returns 744 chars, cuts off mid-sentence `"Versa nella tortiera co'"` + +**Fix:** Never trust the GraphQL-intercepted caption. Always use DOM extraction (`extractWithStrategies` → `extractFromHTMLSection` → `tryExpandCaptionInHTMLSection` clicks "… more" button). Keep the intercepted GraphQL caption only as an emergency fallback when DOM extraction fails entirely. + +**Key lesson:** The `….` suffix check is **not sufficient** to detect truncation. The only reliable approach is to always go through the DOM. + +--- + +## Instagram: Mobile API vs GraphQL API (yt-dlp behavior) + +**How yt-dlp selects which API to call:** +1. If `sessionid` cookie present → calls `https://i.instagram.com/api/v1/media/{PK}/info/` (mobile API) +2. If mobile API fails (or no sessionid) → falls back to GraphQL `doc_id=8845758582119845` + +**Mobile API User-Agent:** +- Desktop UA → HTTP 404 +- Instagram Android UA → HTTP 200 with full response +- The `--user-agent` CLI flag only affects video download requests, **not** API calls — yt-dlp uses its own hardcoded headers for API calls + +**Mobile API also truncates:** Even with a valid sessionid and HTTP 200, `caption.text` in the mobile API response can still be truncated. DOM extraction is the only fully reliable source. + +**Shortcode → PK conversion:** +```python +def shortcode_to_pk(sc): + alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_' + n = 0 + for c in sc: n = n * 64 + alphabet.index(c) + return n +``` + +--- + +## Instagram: Creator-Written `….` vs API Truncation + +**Gotcha:** Some creators intentionally end their captions with `….` or `#seriesname….` as a signature or series marker. This is NOT API truncation. + +**Example:** Reel `DW5zH3xjY-_` ("5030 LOW CAL 💪") — the `….` is written by the creator as a series signature. The reel has only 213 chars of real content and no recipe. + +**Implication:** Never use `….` suffix as the primary signal to fetch more content — always use DOM extraction regardless. + +--- + +## Instagram: cookies.txt vs auth.json — Session Management + +**Two auth formats coexist:** +- `secrets/auth.json` — Playwright `storageState` format (JSON, cookies + origins) +- `secrets/cookies.txt` — Netscape format for yt-dlp + +**yt-dlp overwrites cookies.txt** after each extraction, removing `sessionid`. The next run regenerates it from `auth.json` via `maybeConvertAuthJson()` before each call. This is safe in normal operation — but inspecting cookies.txt directly between runs will show a reduced file. + +**`sessionid` is critical.** Without it: +- yt-dlp mobile API returns HTTP 404 (empty response) +- Falls back to GraphQL → truncated caption + +**Auth scheduler:** `scheduler.ts` runs every 15 minutes to renew the session by navigating to Instagram. Verify with logs: `[Scheduler] Instagram authentication renewed successfully`. + +--- + +## Instagram: Playwright Browser Session Expiry (independent of cookies) + +**Symptom:** Playwright navigates to Instagram, sees a profile selector ("Continue as …"), clicks Continue, gets redirected to `/accounts/login/`. + +**Root cause:** The `sessionid` cookie is valid for API calls but the browser-level session can expire independently. Instagram shows the profile selector as a soft prompt which, when clicked, triggers a re-auth that fails with a stale session. + +**Diagnosis:** +- `svg[aria-label="Home"]` found → session valid ✅ +- `(N) Instagram` in title with notifications count → logged in ✅ +- Profile selector visible → session expired, need to re-authenticate + +**Fix:** Re-authenticate by updating `auth.json` with a fresh login from a real browser session and copying to the volume at `/home/moze/Server/stacks/insta-recipe/data/secrets/auth.json`. + +--- + +## Instagram: DOM Extraction Strategy Order (2025/2026) + +`extractWithStrategies` tries 6 approaches in order. Only one reliably works now: + +| Strategy | Status | Reason | +|---|---|---| +| `embedded-json` | ❌ Fails | Instagram removed `window.__additionalDataLoaded` | +| `internal-state` | ❌ Fails | Instagram removed `window._sharedData` | +| `html-section` | ✅ Works | DOM extraction + "… more" button click | +| `dom-selector` | ⚠️ Partial | Simpler DOM query, may miss truncated captions | +| `graphql-api` | ⚠️ Truncated | Live interception but caption is still truncated | +| `legacy` | ❌ Fails | Old format gone | + +**Note:** Clicking "… more" triggers feed-loading GraphQL calls (`xdt_api__v1__clips__home__connection_v2`) as a side effect. The full text comes purely from the expanded DOM, not a network response. + +--- + +## LLM: phi4-mini Recipe Detection Too Strict + +**Problem:** phi4-mini rejected valid Italian Instagram recipe posts as "no recipe found" during detection. + +**Root cause:** Detection prompt required quantities + at least 2 steps. Italian Instagram posts often: +- Omit explicit quantities (just list ingredients by name) +- Say "full recipe at link in bio" with no steps at all + +**Detection prompt evolution:** +- v1: title + 3 ingredients with quantities + 2 steps +- v2: title + 3 ingredients (no quantities) + 1 step +- v3 (current): title + 2 ingredients, NO step requirement + +**Lesson:** If it reads like food content with at least 2 named ingredients, say yes. + +--- + +## LLM: gemma4 Thinking Models Behavior + +**gemma4 models on llama-swap (`http://192.168.1.50:8080`):** +- `gemma4-e2b-q8_0` — smaller/faster +- `gemma4-e4b-q6k` — better quality (production model) +- `gemma4-26b-moe-iq4xs`, `granite-3.3-8b-q6k`, `deepseek-r1-8b-q6k` also available + +**gemma4 is a "thinking" model:** Outputs internal reasoning before the actual answer. + +With `max_tokens: 1024`: Model skips most reasoning and puts the answer directly in `content`. The `reasoning_content` fallback in `parser.ts` covers edge cases where content is empty. + +**vs phi4-mini:** phi4-mini is more literal and strict. For permissive recipe detection of Italian informal posts, gemma4 is significantly better. + +--- + +## Tandoor: Steps Required to Save Ingredients + +**Symptom:** Recipe saved to Tandoor has no ingredients even though parsing succeeded. + +**Root cause:** Tandoor requires at least one Step for ingredients to be associated. When `recipe.steps` is null/empty: +```typescript +// Old code — creates stepCount=1 but no actual step: +const stepCount = recipe.steps?.length || 1; +(recipe.steps || []).map(...) // returns [] → all ingredients lost +``` + +**Fix in `tandoor.ts` `buildTandoorRecipeDTO()`:** When `recipe.steps` is null or empty, create a placeholder: +```typescript +const steps = (recipe.steps?.length ? recipe.steps : ['Vedi la ricetta completa al link in bio.']); +``` + +--- + +## SvelteKit SSE: Phase Updates Never Reaching UI + +**Symptom:** Processing animation showed "Prepping" throughout, then jumped straight to done. + +**Three root causes found:** + +1. **`updateQueueItem` never set `currentPhase`:** Spreading `...items[idx]` but never applying `update.phase`. Fix: + ```typescript + currentPhase: update.phase ?? prev.currentPhase + ``` + +2. **Progress events silently discarded:** SSE `type: 'progress'` messages received but `progressEvents` array never updated. Live messages (e.g. "Parsing with LLM…") were dropped. Fix: append `data.event` to `progressEvents`. + +3. **Initial SSE snapshot missing `phase`:** The initial broadcast of queued items omitted `phase: item.currentPhase`. Items already in-progress on page load showed the wrong phase. Fix: include `phase` in the initial snapshot. + +--- + +## Gitea CI: Common Failure Modes + +**Chromium not available in Alpine Docker:** +`vite.config.ts` defines two vitest projects: `client` (browser, needs Chromium) and `server` (Node.js). Alpine CI has no Chromium. Always specify: +```bash +npm run test:unit -- --run --project=server +``` + +**`$env/dynamic/private` throws in Docker build (no `.env`):** +Any code reading SvelteKit env vars at module import time will throw during Docker `RUN npm test` because there's no `.env` file in the build. Fix: mock the module in affected tests: +```typescript +vi.mock('$env/dynamic/private', () => ({ + env: { OPENAI_BASE_URL: 'http://localhost:11434', OPENAI_MODEL: 'test-model' } +})); +``` + +**Registry secrets must be set manually in Gitea:** +`REGISTRY_USERNAME` and `REGISTRY_TOKEN` must be created in repo Settings → Actions → Secrets. They are not automatically available. + +--- + +## TypeScript Quirk: Async Callback Closure Narrowing + +```typescript +let interceptedCaption: string | null = null; +page.on('response', async () => { interceptedCaption = 'value'; }); // assigned in async callback +// TypeScript may narrow `interceptedCaption` to `never` outside the callback +// if no other assignment exists in the outer scope. +const capturedCaption = interceptedCaption as string | null; // explicit cast required +``` + +--- + +## Production Architecture: yt-dlp + Playwright Split + +**Current split (as of commit `c9f5300`+):** +- **Playwright** → caption extraction (DOM, always full text) +- **yt-dlp** → thumbnail URL only (fast, no browser overhead) +- Both run **in parallel** in `QueueProcessor.ts` + +**Why not yt-dlp for caption?** Both mobile API and GraphQL responses can be truncated even with a valid session. DOM is the only reliable source. + +**Why not Playwright for thumbnail?** yt-dlp extracts thumbnail cleanly and quickly. Playwright-based thumbnail extraction was fragile. + +--- + +## Infrastructure Reference + +| Resource | Value | +|---|---| +| App URL | `https://insta-recipe.sal.giize.com` | +| SSH | `ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa_ideapad moze@192.168.1.50` | +| Compose file | `/home/moze/Server/stacks/insta-recipe/compose.yaml` | +| Env file | `/home/moze/Server/stacks/insta-recipe/.env` | +| Docker registry | `git.sal.giize.com/mozempk/insta-recipe:latest` | +| Build | `docker buildx build --platform linux/amd64 -t git.sal.giize.com/mozempk/insta-recipe:latest --push .` | +| Deploy | `docker compose pull && docker compose up -d` | +| LLM (internal) | `http://chat_llama-cpp:8080/v1` | +| LLM (external) | `http://192.168.1.50:8080` | +| Current LLM model | `gemma4-e4b-q6k` (via `LLM_MODEL` in `.env`) | +| Auth file (host) | `/home/moze/Server/stacks/insta-recipe/data/secrets/auth.json` | +| Auth file (container) | `/app/secrets/auth.json` |