docs: add session findings — Instagram extraction, LLM, SSE, CI lessons
Some checks failed
Build & Push Docker Image / test-and-build (push) Failing after 33s

Documents hard-won discoveries from active debugging sessions:
- Instagram GraphQL/mobile API silent caption truncation (no marker)
- DOM extraction (html-section strategy) as the only reliable approach
- creator-written '….' vs API truncation — cannot use as signal
- cookies.txt vs auth.json session management and sessionid loss
- Playwright browser session expiry independent of API cookies
- phi4-mini too strict for Italian recipe posts → gemma4 switch
- gemma4 thinking model behavior with max_tokens: 1024
- Tandoor requires Step for ingredients to be saved
- SvelteKit SSE: 3 bugs that caused phase updates to never reach UI
- Gitea CI gotchas: Alpine Chromium, $env/dynamic/private, secrets
- yt-dlp + Playwright split architecture rationale
- Infrastructure reference table

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
Giancarmine Salucci
2026-05-13 03:13:17 +02:00
parent 61876f18e5
commit ecd2aef971

View File

@@ -3145,3 +3145,243 @@ Footer component needs null-safe access since initial state is `null`:
**Document Version:** 3.0 **Document Version:** 3.0
**Last Updated by:** Planner Agent (RECIPE-0009 Iteration 1) **Last Updated by:** Planner Agent (RECIPE-0009 Iteration 1)
**Next Update:** Developer Agent **Next Update:** Developer Agent
---
---
# Session Findings: Instagram Extraction & Production Lessons
*Recorded during active development sessions (20252026). These are hard-won discoveries from real debugging — not theoretical analysis.*
---
## Instagram: Caption Truncation in Web GraphQL API
**Symptom:** LLM says "no recipe found" even though the full recipe IS in the Instagram caption.
**Root cause:** Instagram's web GraphQL API (`doc_id=8845758582119845`) silently truncates captions in `edge_media_to_caption.edges[0].node.text`. Truncation is **inconsistent**:
- Sometimes ends with `….` (Unicode U+2026 + period)
- Sometimes cuts off mid-sentence with no marker at all
Known examples:
- `DWWxiymssxE`: GraphQL returns 327 chars, full caption is 393 chars (no truncation marker)
- `DXT73izCBoH`: GraphQL returns 744 chars, cuts off mid-sentence `"Versa nella tortiera co'"`
**Fix:** Never trust the GraphQL-intercepted caption. Always use DOM extraction (`extractWithStrategies` → `extractFromHTMLSection` → `tryExpandCaptionInHTMLSection` clicks "… more" button). Keep the intercepted GraphQL caption only as an emergency fallback when DOM extraction fails entirely.
**Key lesson:** The `….` suffix check is **not sufficient** to detect truncation. The only reliable approach is to always go through the DOM.
---
## Instagram: Mobile API vs GraphQL API (yt-dlp behavior)
**How yt-dlp selects which API to call:**
1. If `sessionid` cookie present → calls `https://i.instagram.com/api/v1/media/{PK}/info/` (mobile API)
2. If mobile API fails (or no sessionid) → falls back to GraphQL `doc_id=8845758582119845`
**Mobile API User-Agent:**
- Desktop UA → HTTP 404
- Instagram Android UA → HTTP 200 with full response
- The `--user-agent` CLI flag only affects video download requests, **not** API calls — yt-dlp uses its own hardcoded headers for API calls
**Mobile API also truncates:** Even with a valid sessionid and HTTP 200, `caption.text` in the mobile API response can still be truncated. DOM extraction is the only fully reliable source.
**Shortcode → PK conversion:**
```python
def shortcode_to_pk(sc):
alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_'
n = 0
for c in sc: n = n * 64 + alphabet.index(c)
return n
```
---
## Instagram: Creator-Written `….` vs API Truncation
**Gotcha:** Some creators intentionally end their captions with `….` or `#seriesname….` as a signature or series marker. This is NOT API truncation.
**Example:** Reel `DW5zH3xjY-_` ("5030 LOW CAL 💪") — the `….` is written by the creator as a series signature. The reel has only 213 chars of real content and no recipe.
**Implication:** Never use `….` suffix as the primary signal to fetch more content — always use DOM extraction regardless.
---
## Instagram: cookies.txt vs auth.json — Session Management
**Two auth formats coexist:**
- `secrets/auth.json` — Playwright `storageState` format (JSON, cookies + origins)
- `secrets/cookies.txt` — Netscape format for yt-dlp
**yt-dlp overwrites cookies.txt** after each extraction, removing `sessionid`. The next run regenerates it from `auth.json` via `maybeConvertAuthJson()` before each call. This is safe in normal operation — but inspecting cookies.txt directly between runs will show a reduced file.
**`sessionid` is critical.** Without it:
- yt-dlp mobile API returns HTTP 404 (empty response)
- Falls back to GraphQL → truncated caption
**Auth scheduler:** `scheduler.ts` runs every 15 minutes to renew the session by navigating to Instagram. Verify with logs: `[Scheduler] Instagram authentication renewed successfully`.
---
## Instagram: Playwright Browser Session Expiry (independent of cookies)
**Symptom:** Playwright navigates to Instagram, sees a profile selector ("Continue as …"), clicks Continue, gets redirected to `/accounts/login/`.
**Root cause:** The `sessionid` cookie is valid for API calls but the browser-level session can expire independently. Instagram shows the profile selector as a soft prompt which, when clicked, triggers a re-auth that fails with a stale session.
**Diagnosis:**
- `svg[aria-label="Home"]` found → session valid ✅
- `(N) Instagram` in title with notifications count → logged in ✅
- Profile selector visible → session expired, need to re-authenticate
**Fix:** Re-authenticate by updating `auth.json` with a fresh login from a real browser session and copying to the volume at `/home/moze/Server/stacks/insta-recipe/data/secrets/auth.json`.
---
## Instagram: DOM Extraction Strategy Order (2025/2026)
`extractWithStrategies` tries 6 approaches in order. Only one reliably works now:
| Strategy | Status | Reason |
|---|---|---|
| `embedded-json` | ❌ Fails | Instagram removed `window.__additionalDataLoaded` |
| `internal-state` | ❌ Fails | Instagram removed `window._sharedData` |
| `html-section` | ✅ Works | DOM extraction + "… more" button click |
| `dom-selector` | ⚠️ Partial | Simpler DOM query, may miss truncated captions |
| `graphql-api` | ⚠️ Truncated | Live interception but caption is still truncated |
| `legacy` | ❌ Fails | Old format gone |
**Note:** Clicking "… more" triggers feed-loading GraphQL calls (`xdt_api__v1__clips__home__connection_v2`) as a side effect. The full text comes purely from the expanded DOM, not a network response.
---
## LLM: phi4-mini Recipe Detection Too Strict
**Problem:** phi4-mini rejected valid Italian Instagram recipe posts as "no recipe found" during detection.
**Root cause:** Detection prompt required quantities + at least 2 steps. Italian Instagram posts often:
- Omit explicit quantities (just list ingredients by name)
- Say "full recipe at link in bio" with no steps at all
**Detection prompt evolution:**
- v1: title + 3 ingredients with quantities + 2 steps
- v2: title + 3 ingredients (no quantities) + 1 step
- v3 (current): title + 2 ingredients, NO step requirement
**Lesson:** If it reads like food content with at least 2 named ingredients, say yes.
---
## LLM: gemma4 Thinking Models Behavior
**gemma4 models on llama-swap (`http://192.168.1.50:8080`):**
- `gemma4-e2b-q8_0` — smaller/faster
- `gemma4-e4b-q6k` — better quality (production model)
- `gemma4-26b-moe-iq4xs`, `granite-3.3-8b-q6k`, `deepseek-r1-8b-q6k` also available
**gemma4 is a "thinking" model:** Outputs internal reasoning before the actual answer.
With `max_tokens: 1024`: Model skips most reasoning and puts the answer directly in `content`. The `reasoning_content` fallback in `parser.ts` covers edge cases where content is empty.
**vs phi4-mini:** phi4-mini is more literal and strict. For permissive recipe detection of Italian informal posts, gemma4 is significantly better.
---
## Tandoor: Steps Required to Save Ingredients
**Symptom:** Recipe saved to Tandoor has no ingredients even though parsing succeeded.
**Root cause:** Tandoor requires at least one Step for ingredients to be associated. When `recipe.steps` is null/empty:
```typescript
// Old code — creates stepCount=1 but no actual step:
const stepCount = recipe.steps?.length || 1;
(recipe.steps || []).map(...) // returns [] → all ingredients lost
```
**Fix in `tandoor.ts` `buildTandoorRecipeDTO()`:** When `recipe.steps` is null or empty, create a placeholder:
```typescript
const steps = (recipe.steps?.length ? recipe.steps : ['Vedi la ricetta completa al link in bio.']);
```
---
## SvelteKit SSE: Phase Updates Never Reaching UI
**Symptom:** Processing animation showed "Prepping" throughout, then jumped straight to done.
**Three root causes found:**
1. **`updateQueueItem` never set `currentPhase`:** Spreading `...items[idx]` but never applying `update.phase`. Fix:
```typescript
currentPhase: update.phase ?? prev.currentPhase
```
2. **Progress events silently discarded:** SSE `type: 'progress'` messages received but `progressEvents` array never updated. Live messages (e.g. "Parsing with LLM…") were dropped. Fix: append `data.event` to `progressEvents`.
3. **Initial SSE snapshot missing `phase`:** The initial broadcast of queued items omitted `phase: item.currentPhase`. Items already in-progress on page load showed the wrong phase. Fix: include `phase` in the initial snapshot.
---
## Gitea CI: Common Failure Modes
**Chromium not available in Alpine Docker:**
`vite.config.ts` defines two vitest projects: `client` (browser, needs Chromium) and `server` (Node.js). Alpine CI has no Chromium. Always specify:
```bash
npm run test:unit -- --run --project=server
```
**`$env/dynamic/private` throws in Docker build (no `.env`):**
Any code reading SvelteKit env vars at module import time will throw during Docker `RUN npm test` because there's no `.env` file in the build. Fix: mock the module in affected tests:
```typescript
vi.mock('$env/dynamic/private', () => ({
env: { OPENAI_BASE_URL: 'http://localhost:11434', OPENAI_MODEL: 'test-model' }
}));
```
**Registry secrets must be set manually in Gitea:**
`REGISTRY_USERNAME` and `REGISTRY_TOKEN` must be created in repo Settings → Actions → Secrets. They are not automatically available.
---
## TypeScript Quirk: Async Callback Closure Narrowing
```typescript
let interceptedCaption: string | null = null;
page.on('response', async () => { interceptedCaption = 'value'; }); // assigned in async callback
// TypeScript may narrow `interceptedCaption` to `never` outside the callback
// if no other assignment exists in the outer scope.
const capturedCaption = interceptedCaption as string | null; // explicit cast required
```
---
## Production Architecture: yt-dlp + Playwright Split
**Current split (as of commit `c9f5300`+):**
- **Playwright** → caption extraction (DOM, always full text)
- **yt-dlp** → thumbnail URL only (fast, no browser overhead)
- Both run **in parallel** in `QueueProcessor.ts`
**Why not yt-dlp for caption?** Both mobile API and GraphQL responses can be truncated even with a valid session. DOM is the only reliable source.
**Why not Playwright for thumbnail?** yt-dlp extracts thumbnail cleanly and quickly. Playwright-based thumbnail extraction was fragile.
---
## Infrastructure Reference
| Resource | Value |
|---|---|
| App URL | `https://insta-recipe.sal.giize.com` |
| SSH | `ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa_ideapad moze@192.168.1.50` |
| Compose file | `/home/moze/Server/stacks/insta-recipe/compose.yaml` |
| Env file | `/home/moze/Server/stacks/insta-recipe/.env` |
| Docker registry | `git.sal.giize.com/mozempk/insta-recipe:latest` |
| Build | `docker buildx build --platform linux/amd64 -t git.sal.giize.com/mozempk/insta-recipe:latest --push .` |
| Deploy | `docker compose pull && docker compose up -d` |
| LLM (internal) | `http://chat_llama-cpp:8080/v1` |
| LLM (external) | `http://192.168.1.50:8080` |
| Current LLM model | `gemma4-e4b-q6k` (via `LLM_MODEL` in `.env`) |
| Auth file (host) | `/home/moze/Server/stacks/insta-recipe/data/secrets/auth.json` |
| Auth file (container) | `/app/secrets/auth.json` |