docs: add session findings — Instagram extraction, LLM, SSE, CI lessons

Documents hard-won discoveries from active debugging sessions: - Instagram GraphQL/mobile API silent caption truncation (no marker) - DOM extraction (html-section strategy) as the only reliable approach - creator-written '….' vs API truncation — cannot use as signal - cookies.txt vs auth.json session management and sessionid loss - Playwright browser session expiry independent of API cookies - phi4-mini too strict for Italian recipe posts → gemma4 switch - gemma4 thinking model behavior with max_tokens: 1024 - Tandoor requires Step for ingredients to be saved - SvelteKit SSE: 3 bugs that caused phase updates to never reach UI - Gitea CI gotchas: Alpine Chromium, $env/dynamic/private, secrets - yt-dlp + Playwright split architecture rationale - Infrastructure reference table Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-13 03:13:17 +02:00
parent 61876f18e5
commit ecd2aef971
1 changed files with 240 additions and 0 deletions
--- a/docs/FINDINGS.md
+++ b/docs/FINDINGS.md
@@ -3145,3 +3145,243 @@ Footer component needs null-safe access since initial state is `null`:
 **Document Version:** 3.0  
 **Last Updated by:** Planner Agent (RECIPE-0009 Iteration 1)  
 **Next Update:** Developer Agent
 ---
 ---
 # Session Findings: Instagram Extraction & Production Lessons
 *Recorded during active development sessions (2025–2026). These are hard-won discoveries from real debugging — not theoretical analysis.*
 ---
 ## Instagram: Caption Truncation in Web GraphQL API
 **Symptom:** LLM says "no recipe found" even though the full recipe IS in the Instagram caption.
 **Root cause:** Instagram's web GraphQL API (`doc_id=8845758582119845`) silently truncates captions in `edge_media_to_caption.edges[0].node.text`. Truncation is **inconsistent**:
 - Sometimes ends with `….` (Unicode U+2026 + period)
 - Sometimes cuts off mid-sentence with no marker at all
 Known examples:
 - `DWWxiymssxE`: GraphQL returns 327 chars, full caption is 393 chars (no truncation marker)
 - `DXT73izCBoH`: GraphQL returns 744 chars, cuts off mid-sentence `"Versa nella tortiera co'"`
 **Fix:** Never trust the GraphQL-intercepted caption. Always use DOM extraction (`extractWithStrategies` → `extractFromHTMLSection` → `tryExpandCaptionInHTMLSection` clicks "… more" button). Keep the intercepted GraphQL caption only as an emergency fallback when DOM extraction fails entirely.
 **Key lesson:** The `….` suffix check is **not sufficient** to detect truncation. The only reliable approach is to always go through the DOM.
 ---
 ## Instagram: Mobile API vs GraphQL API (yt-dlp behavior)
 **How yt-dlp selects which API to call:**
 1. If `sessionid` cookie present → calls `https://i.instagram.com/api/v1/media/{PK}/info/` (mobile API)
 2. If mobile API fails (or no sessionid) → falls back to GraphQL `doc_id=8845758582119845`
 **Mobile API User-Agent:**
 - Desktop UA → HTTP 404
 - Instagram Android UA → HTTP 200 with full response
 - The `--user-agent` CLI flag only affects video download requests, **not** API calls — yt-dlp uses its own hardcoded headers for API calls
 **Mobile API also truncates:** Even with a valid sessionid and HTTP 200, `caption.text` in the mobile API response can still be truncated. DOM extraction is the only fully reliable source.
 **Shortcode → PK conversion:**
 ```python
 def shortcode_to_pk(sc):
    alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_'
    n = 0
    for c in sc: n = n * 64 + alphabet.index(c)
    return n
 ```
 ---
 ## Instagram: Creator-Written `….` vs API Truncation
 **Gotcha:** Some creators intentionally end their captions with `….` or `#seriesname….` as a signature or series marker. This is NOT API truncation.
 **Example:** Reel `DW5zH3xjY-_` ("5030 LOW CAL 💪") — the `….` is written by the creator as a series signature. The reel has only 213 chars of real content and no recipe.
 **Implication:** Never use `….` suffix as the primary signal to fetch more content — always use DOM extraction regardless.
 ---
 ## Instagram: cookies.txt vs auth.json — Session Management
 **Two auth formats coexist:**
 - `secrets/auth.json` — Playwright `storageState` format (JSON, cookies + origins)
 - `secrets/cookies.txt` — Netscape format for yt-dlp
 **yt-dlp overwrites cookies.txt** after each extraction, removing `sessionid`. The next run regenerates it from `auth.json` via `maybeConvertAuthJson()` before each call. This is safe in normal operation — but inspecting cookies.txt directly between runs will show a reduced file.
 **`sessionid` is critical.** Without it:
 - yt-dlp mobile API returns HTTP 404 (empty response)
 - Falls back to GraphQL → truncated caption
 **Auth scheduler:** `scheduler.ts` runs every 15 minutes to renew the session by navigating to Instagram. Verify with logs: `[Scheduler] Instagram authentication renewed successfully`.
 ---
 ## Instagram: Playwright Browser Session Expiry (independent of cookies)
 **Symptom:** Playwright navigates to Instagram, sees a profile selector ("Continue as …"), clicks Continue, gets redirected to `/accounts/login/`.
 **Root cause:** The `sessionid` cookie is valid for API calls but the browser-level session can expire independently. Instagram shows the profile selector as a soft prompt which, when clicked, triggers a re-auth that fails with a stale session.
 **Diagnosis:**
 - `svg[aria-label="Home"]` found → session valid ✅
 - `(N) Instagram` in title with notifications count → logged in ✅
 - Profile selector visible → session expired, need to re-authenticate
 **Fix:** Re-authenticate by updating `auth.json` with a fresh login from a real browser session and copying to the volume at `/home/moze/Server/stacks/insta-recipe/data/secrets/auth.json`.
 ---
 ## Instagram: DOM Extraction Strategy Order (2025/2026)
 `extractWithStrategies` tries 6 approaches in order. Only one reliably works now:
 | Strategy | Status | Reason |
 |---|---|---|
 | `embedded-json` | ❌ Fails | Instagram removed `window.__additionalDataLoaded` |
 | `internal-state` | ❌ Fails | Instagram removed `window._sharedData` |
 | `html-section` | ✅ Works | DOM extraction + "… more" button click |
 | `dom-selector` | ⚠️ Partial | Simpler DOM query, may miss truncated captions |
 | `graphql-api` | ⚠️ Truncated | Live interception but caption is still truncated |
 | `legacy` | ❌ Fails | Old format gone |
 **Note:** Clicking "… more" triggers feed-loading GraphQL calls (`xdt_api__v1__clips__home__connection_v2`) as a side effect. The full text comes purely from the expanded DOM, not a network response.
 ---
 ## LLM: phi4-mini Recipe Detection Too Strict
 **Problem:** phi4-mini rejected valid Italian Instagram recipe posts as "no recipe found" during detection.
 **Root cause:** Detection prompt required quantities + at least 2 steps. Italian Instagram posts often:
 - Omit explicit quantities (just list ingredients by name)
 - Say "full recipe at link in bio" with no steps at all
 **Detection prompt evolution:**
 - v1: title + 3 ingredients with quantities + 2 steps
 - v2: title + 3 ingredients (no quantities) + 1 step  
 - v3 (current): title + 2 ingredients, NO step requirement
 **Lesson:** If it reads like food content with at least 2 named ingredients, say yes.
 ---
 ## LLM: gemma4 Thinking Models Behavior
 **gemma4 models on llama-swap (`http://192.168.1.50:8080`):**
 - `gemma4-e2b-q8_0` — smaller/faster
 - `gemma4-e4b-q6k` — better quality (production model)
 - `gemma4-26b-moe-iq4xs`, `granite-3.3-8b-q6k`, `deepseek-r1-8b-q6k` also available
 **gemma4 is a "thinking" model:** Outputs internal reasoning before the actual answer.
 With `max_tokens: 1024`: Model skips most reasoning and puts the answer directly in `content`. The `reasoning_content` fallback in `parser.ts` covers edge cases where content is empty.
 **vs phi4-mini:** phi4-mini is more literal and strict. For permissive recipe detection of Italian informal posts, gemma4 is significantly better.
 ---
 ## Tandoor: Steps Required to Save Ingredients
 **Symptom:** Recipe saved to Tandoor has no ingredients even though parsing succeeded.
 **Root cause:** Tandoor requires at least one Step for ingredients to be associated. When `recipe.steps` is null/empty:
 ```typescript
 // Old code — creates stepCount=1 but no actual step:
 const stepCount = recipe.steps?.length || 1;
 (recipe.steps || []).map(...) // returns [] → all ingredients lost
 ```
 **Fix in `tandoor.ts` `buildTandoorRecipeDTO()`:** When `recipe.steps` is null or empty, create a placeholder:
 ```typescript
 const steps = (recipe.steps?.length ? recipe.steps : ['Vedi la ricetta completa al link in bio.']);
 ```
 ---
 ## SvelteKit SSE: Phase Updates Never Reaching UI
 **Symptom:** Processing animation showed "Prepping" throughout, then jumped straight to done.
 **Three root causes found:**
 1. **`updateQueueItem` never set `currentPhase`:** Spreading `...items[idx]` but never applying `update.phase`. Fix:
   ```typescript
   currentPhase: update.phase ?? prev.currentPhase
   ```
 2. **Progress events silently discarded:** SSE `type: 'progress'` messages received but `progressEvents` array never updated. Live messages (e.g. "Parsing with LLM…") were dropped. Fix: append `data.event` to `progressEvents`.
 3. **Initial SSE snapshot missing `phase`:** The initial broadcast of queued items omitted `phase: item.currentPhase`. Items already in-progress on page load showed the wrong phase. Fix: include `phase` in the initial snapshot.
 ---
 ## Gitea CI: Common Failure Modes
 **Chromium not available in Alpine Docker:**
 `vite.config.ts` defines two vitest projects: `client` (browser, needs Chromium) and `server` (Node.js). Alpine CI has no Chromium. Always specify:
 ```bash
 npm run test:unit -- --run --project=server
 ```
 **`$env/dynamic/private` throws in Docker build (no `.env`):**
 Any code reading SvelteKit env vars at module import time will throw during Docker `RUN npm test` because there's no `.env` file in the build. Fix: mock the module in affected tests:
 ```typescript
 vi.mock('$env/dynamic/private', () => ({
  env: { OPENAI_BASE_URL: 'http://localhost:11434', OPENAI_MODEL: 'test-model' }
 }));
 ```
 **Registry secrets must be set manually in Gitea:**
 `REGISTRY_USERNAME` and `REGISTRY_TOKEN` must be created in repo Settings → Actions → Secrets. They are not automatically available.
 ---
 ## TypeScript Quirk: Async Callback Closure Narrowing
 ```typescript
 let interceptedCaption: string | null = null;
 page.on('response', async () => { interceptedCaption = 'value'; }); // assigned in async callback
 // TypeScript may narrow `interceptedCaption` to `never` outside the callback
 // if no other assignment exists in the outer scope.
 const capturedCaption = interceptedCaption as string | null; // explicit cast required
 ```
 ---
 ## Production Architecture: yt-dlp + Playwright Split
 **Current split (as of commit `c9f5300`+):**
 - **Playwright** → caption extraction (DOM, always full text)
 - **yt-dlp** → thumbnail URL only (fast, no browser overhead)
 - Both run **in parallel** in `QueueProcessor.ts`
 **Why not yt-dlp for caption?** Both mobile API and GraphQL responses can be truncated even with a valid session. DOM is the only reliable source.
 **Why not Playwright for thumbnail?** yt-dlp extracts thumbnail cleanly and quickly. Playwright-based thumbnail extraction was fragile.
 ---
 ## Infrastructure Reference
 | Resource | Value |
 |---|---|
 | App URL | `https://insta-recipe.sal.giize.com` |
 | SSH | `ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa_ideapad moze@192.168.1.50` |
 | Compose file | `/home/moze/Server/stacks/insta-recipe/compose.yaml` |
 | Env file | `/home/moze/Server/stacks/insta-recipe/.env` |
 | Docker registry | `git.sal.giize.com/mozempk/insta-recipe:latest` |
 | Build | `docker buildx build --platform linux/amd64 -t git.sal.giize.com/mozempk/insta-recipe:latest --push .` |
 | Deploy | `docker compose pull && docker compose up -d` |
 | LLM (internal) | `http://chat_llama-cpp:8080/v1` |
 | LLM (external) | `http://192.168.1.50:8080` |
 | Current LLM model | `gemma4-e4b-q6k` (via `LLM_MODEL` in `.env`) |
 | Auth file (host) | `/home/moze/Server/stacks/insta-recipe/data/secrets/auth.json` |
 | Auth file (container) | `/app/secrets/auth.json` |