docs: add session findings — Instagram extraction, LLM, SSE, CI lessons
Some checks failed
Build & Push Docker Image / test-and-build (push) Failing after 33s
Some checks failed
Build & Push Docker Image / test-and-build (push) Failing after 33s
Documents hard-won discoveries from active debugging sessions: - Instagram GraphQL/mobile API silent caption truncation (no marker) - DOM extraction (html-section strategy) as the only reliable approach - creator-written '….' vs API truncation — cannot use as signal - cookies.txt vs auth.json session management and sessionid loss - Playwright browser session expiry independent of API cookies - phi4-mini too strict for Italian recipe posts → gemma4 switch - gemma4 thinking model behavior with max_tokens: 1024 - Tandoor requires Step for ingredients to be saved - SvelteKit SSE: 3 bugs that caused phase updates to never reach UI - Gitea CI gotchas: Alpine Chromium, $env/dynamic/private, secrets - yt-dlp + Playwright split architecture rationale - Infrastructure reference table Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
240
docs/FINDINGS.md
240
docs/FINDINGS.md
@@ -3145,3 +3145,243 @@ Footer component needs null-safe access since initial state is `null`:
|
|||||||
**Document Version:** 3.0
|
**Document Version:** 3.0
|
||||||
**Last Updated by:** Planner Agent (RECIPE-0009 Iteration 1)
|
**Last Updated by:** Planner Agent (RECIPE-0009 Iteration 1)
|
||||||
**Next Update:** Developer Agent
|
**Next Update:** Developer Agent
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Session Findings: Instagram Extraction & Production Lessons
|
||||||
|
|
||||||
|
*Recorded during active development sessions (2025–2026). These are hard-won discoveries from real debugging — not theoretical analysis.*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Instagram: Caption Truncation in Web GraphQL API
|
||||||
|
|
||||||
|
**Symptom:** LLM says "no recipe found" even though the full recipe IS in the Instagram caption.
|
||||||
|
|
||||||
|
**Root cause:** Instagram's web GraphQL API (`doc_id=8845758582119845`) silently truncates captions in `edge_media_to_caption.edges[0].node.text`. Truncation is **inconsistent**:
|
||||||
|
- Sometimes ends with `….` (Unicode U+2026 + period)
|
||||||
|
- Sometimes cuts off mid-sentence with no marker at all
|
||||||
|
|
||||||
|
Known examples:
|
||||||
|
- `DWWxiymssxE`: GraphQL returns 327 chars, full caption is 393 chars (no truncation marker)
|
||||||
|
- `DXT73izCBoH`: GraphQL returns 744 chars, cuts off mid-sentence `"Versa nella tortiera co'"`
|
||||||
|
|
||||||
|
**Fix:** Never trust the GraphQL-intercepted caption. Always use DOM extraction (`extractWithStrategies` → `extractFromHTMLSection` → `tryExpandCaptionInHTMLSection` clicks "… more" button). Keep the intercepted GraphQL caption only as an emergency fallback when DOM extraction fails entirely.
|
||||||
|
|
||||||
|
**Key lesson:** The `….` suffix check is **not sufficient** to detect truncation. The only reliable approach is to always go through the DOM.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Instagram: Mobile API vs GraphQL API (yt-dlp behavior)
|
||||||
|
|
||||||
|
**How yt-dlp selects which API to call:**
|
||||||
|
1. If `sessionid` cookie present → calls `https://i.instagram.com/api/v1/media/{PK}/info/` (mobile API)
|
||||||
|
2. If mobile API fails (or no sessionid) → falls back to GraphQL `doc_id=8845758582119845`
|
||||||
|
|
||||||
|
**Mobile API User-Agent:**
|
||||||
|
- Desktop UA → HTTP 404
|
||||||
|
- Instagram Android UA → HTTP 200 with full response
|
||||||
|
- The `--user-agent` CLI flag only affects video download requests, **not** API calls — yt-dlp uses its own hardcoded headers for API calls
|
||||||
|
|
||||||
|
**Mobile API also truncates:** Even with a valid sessionid and HTTP 200, `caption.text` in the mobile API response can still be truncated. DOM extraction is the only fully reliable source.
|
||||||
|
|
||||||
|
**Shortcode → PK conversion:**
|
||||||
|
```python
|
||||||
|
def shortcode_to_pk(sc):
|
||||||
|
alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_'
|
||||||
|
n = 0
|
||||||
|
for c in sc: n = n * 64 + alphabet.index(c)
|
||||||
|
return n
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Instagram: Creator-Written `….` vs API Truncation
|
||||||
|
|
||||||
|
**Gotcha:** Some creators intentionally end their captions with `….` or `#seriesname….` as a signature or series marker. This is NOT API truncation.
|
||||||
|
|
||||||
|
**Example:** Reel `DW5zH3xjY-_` ("5030 LOW CAL 💪") — the `….` is written by the creator as a series signature. The reel has only 213 chars of real content and no recipe.
|
||||||
|
|
||||||
|
**Implication:** Never use `….` suffix as the primary signal to fetch more content — always use DOM extraction regardless.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Instagram: cookies.txt vs auth.json — Session Management
|
||||||
|
|
||||||
|
**Two auth formats coexist:**
|
||||||
|
- `secrets/auth.json` — Playwright `storageState` format (JSON, cookies + origins)
|
||||||
|
- `secrets/cookies.txt` — Netscape format for yt-dlp
|
||||||
|
|
||||||
|
**yt-dlp overwrites cookies.txt** after each extraction, removing `sessionid`. The next run regenerates it from `auth.json` via `maybeConvertAuthJson()` before each call. This is safe in normal operation — but inspecting cookies.txt directly between runs will show a reduced file.
|
||||||
|
|
||||||
|
**`sessionid` is critical.** Without it:
|
||||||
|
- yt-dlp mobile API returns HTTP 404 (empty response)
|
||||||
|
- Falls back to GraphQL → truncated caption
|
||||||
|
|
||||||
|
**Auth scheduler:** `scheduler.ts` runs every 15 minutes to renew the session by navigating to Instagram. Verify with logs: `[Scheduler] Instagram authentication renewed successfully`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Instagram: Playwright Browser Session Expiry (independent of cookies)
|
||||||
|
|
||||||
|
**Symptom:** Playwright navigates to Instagram, sees a profile selector ("Continue as …"), clicks Continue, gets redirected to `/accounts/login/`.
|
||||||
|
|
||||||
|
**Root cause:** The `sessionid` cookie is valid for API calls but the browser-level session can expire independently. Instagram shows the profile selector as a soft prompt which, when clicked, triggers a re-auth that fails with a stale session.
|
||||||
|
|
||||||
|
**Diagnosis:**
|
||||||
|
- `svg[aria-label="Home"]` found → session valid ✅
|
||||||
|
- `(N) Instagram` in title with notifications count → logged in ✅
|
||||||
|
- Profile selector visible → session expired, need to re-authenticate
|
||||||
|
|
||||||
|
**Fix:** Re-authenticate by updating `auth.json` with a fresh login from a real browser session and copying to the volume at `/home/moze/Server/stacks/insta-recipe/data/secrets/auth.json`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Instagram: DOM Extraction Strategy Order (2025/2026)
|
||||||
|
|
||||||
|
`extractWithStrategies` tries 6 approaches in order. Only one reliably works now:
|
||||||
|
|
||||||
|
| Strategy | Status | Reason |
|
||||||
|
|---|---|---|
|
||||||
|
| `embedded-json` | ❌ Fails | Instagram removed `window.__additionalDataLoaded` |
|
||||||
|
| `internal-state` | ❌ Fails | Instagram removed `window._sharedData` |
|
||||||
|
| `html-section` | ✅ Works | DOM extraction + "… more" button click |
|
||||||
|
| `dom-selector` | ⚠️ Partial | Simpler DOM query, may miss truncated captions |
|
||||||
|
| `graphql-api` | ⚠️ Truncated | Live interception but caption is still truncated |
|
||||||
|
| `legacy` | ❌ Fails | Old format gone |
|
||||||
|
|
||||||
|
**Note:** Clicking "… more" triggers feed-loading GraphQL calls (`xdt_api__v1__clips__home__connection_v2`) as a side effect. The full text comes purely from the expanded DOM, not a network response.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## LLM: phi4-mini Recipe Detection Too Strict
|
||||||
|
|
||||||
|
**Problem:** phi4-mini rejected valid Italian Instagram recipe posts as "no recipe found" during detection.
|
||||||
|
|
||||||
|
**Root cause:** Detection prompt required quantities + at least 2 steps. Italian Instagram posts often:
|
||||||
|
- Omit explicit quantities (just list ingredients by name)
|
||||||
|
- Say "full recipe at link in bio" with no steps at all
|
||||||
|
|
||||||
|
**Detection prompt evolution:**
|
||||||
|
- v1: title + 3 ingredients with quantities + 2 steps
|
||||||
|
- v2: title + 3 ingredients (no quantities) + 1 step
|
||||||
|
- v3 (current): title + 2 ingredients, NO step requirement
|
||||||
|
|
||||||
|
**Lesson:** If it reads like food content with at least 2 named ingredients, say yes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## LLM: gemma4 Thinking Models Behavior
|
||||||
|
|
||||||
|
**gemma4 models on llama-swap (`http://192.168.1.50:8080`):**
|
||||||
|
- `gemma4-e2b-q8_0` — smaller/faster
|
||||||
|
- `gemma4-e4b-q6k` — better quality (production model)
|
||||||
|
- `gemma4-26b-moe-iq4xs`, `granite-3.3-8b-q6k`, `deepseek-r1-8b-q6k` also available
|
||||||
|
|
||||||
|
**gemma4 is a "thinking" model:** Outputs internal reasoning before the actual answer.
|
||||||
|
|
||||||
|
With `max_tokens: 1024`: Model skips most reasoning and puts the answer directly in `content`. The `reasoning_content` fallback in `parser.ts` covers edge cases where content is empty.
|
||||||
|
|
||||||
|
**vs phi4-mini:** phi4-mini is more literal and strict. For permissive recipe detection of Italian informal posts, gemma4 is significantly better.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tandoor: Steps Required to Save Ingredients
|
||||||
|
|
||||||
|
**Symptom:** Recipe saved to Tandoor has no ingredients even though parsing succeeded.
|
||||||
|
|
||||||
|
**Root cause:** Tandoor requires at least one Step for ingredients to be associated. When `recipe.steps` is null/empty:
|
||||||
|
```typescript
|
||||||
|
// Old code — creates stepCount=1 but no actual step:
|
||||||
|
const stepCount = recipe.steps?.length || 1;
|
||||||
|
(recipe.steps || []).map(...) // returns [] → all ingredients lost
|
||||||
|
```
|
||||||
|
|
||||||
|
**Fix in `tandoor.ts` `buildTandoorRecipeDTO()`:** When `recipe.steps` is null or empty, create a placeholder:
|
||||||
|
```typescript
|
||||||
|
const steps = (recipe.steps?.length ? recipe.steps : ['Vedi la ricetta completa al link in bio.']);
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## SvelteKit SSE: Phase Updates Never Reaching UI
|
||||||
|
|
||||||
|
**Symptom:** Processing animation showed "Prepping" throughout, then jumped straight to done.
|
||||||
|
|
||||||
|
**Three root causes found:**
|
||||||
|
|
||||||
|
1. **`updateQueueItem` never set `currentPhase`:** Spreading `...items[idx]` but never applying `update.phase`. Fix:
|
||||||
|
```typescript
|
||||||
|
currentPhase: update.phase ?? prev.currentPhase
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Progress events silently discarded:** SSE `type: 'progress'` messages received but `progressEvents` array never updated. Live messages (e.g. "Parsing with LLM…") were dropped. Fix: append `data.event` to `progressEvents`.
|
||||||
|
|
||||||
|
3. **Initial SSE snapshot missing `phase`:** The initial broadcast of queued items omitted `phase: item.currentPhase`. Items already in-progress on page load showed the wrong phase. Fix: include `phase` in the initial snapshot.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Gitea CI: Common Failure Modes
|
||||||
|
|
||||||
|
**Chromium not available in Alpine Docker:**
|
||||||
|
`vite.config.ts` defines two vitest projects: `client` (browser, needs Chromium) and `server` (Node.js). Alpine CI has no Chromium. Always specify:
|
||||||
|
```bash
|
||||||
|
npm run test:unit -- --run --project=server
|
||||||
|
```
|
||||||
|
|
||||||
|
**`$env/dynamic/private` throws in Docker build (no `.env`):**
|
||||||
|
Any code reading SvelteKit env vars at module import time will throw during Docker `RUN npm test` because there's no `.env` file in the build. Fix: mock the module in affected tests:
|
||||||
|
```typescript
|
||||||
|
vi.mock('$env/dynamic/private', () => ({
|
||||||
|
env: { OPENAI_BASE_URL: 'http://localhost:11434', OPENAI_MODEL: 'test-model' }
|
||||||
|
}));
|
||||||
|
```
|
||||||
|
|
||||||
|
**Registry secrets must be set manually in Gitea:**
|
||||||
|
`REGISTRY_USERNAME` and `REGISTRY_TOKEN` must be created in repo Settings → Actions → Secrets. They are not automatically available.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## TypeScript Quirk: Async Callback Closure Narrowing
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
let interceptedCaption: string | null = null;
|
||||||
|
page.on('response', async () => { interceptedCaption = 'value'; }); // assigned in async callback
|
||||||
|
// TypeScript may narrow `interceptedCaption` to `never` outside the callback
|
||||||
|
// if no other assignment exists in the outer scope.
|
||||||
|
const capturedCaption = interceptedCaption as string | null; // explicit cast required
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Production Architecture: yt-dlp + Playwright Split
|
||||||
|
|
||||||
|
**Current split (as of commit `c9f5300`+):**
|
||||||
|
- **Playwright** → caption extraction (DOM, always full text)
|
||||||
|
- **yt-dlp** → thumbnail URL only (fast, no browser overhead)
|
||||||
|
- Both run **in parallel** in `QueueProcessor.ts`
|
||||||
|
|
||||||
|
**Why not yt-dlp for caption?** Both mobile API and GraphQL responses can be truncated even with a valid session. DOM is the only reliable source.
|
||||||
|
|
||||||
|
**Why not Playwright for thumbnail?** yt-dlp extracts thumbnail cleanly and quickly. Playwright-based thumbnail extraction was fragile.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Infrastructure Reference
|
||||||
|
|
||||||
|
| Resource | Value |
|
||||||
|
|---|---|
|
||||||
|
| App URL | `https://insta-recipe.sal.giize.com` |
|
||||||
|
| SSH | `ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa_ideapad moze@192.168.1.50` |
|
||||||
|
| Compose file | `/home/moze/Server/stacks/insta-recipe/compose.yaml` |
|
||||||
|
| Env file | `/home/moze/Server/stacks/insta-recipe/.env` |
|
||||||
|
| Docker registry | `git.sal.giize.com/mozempk/insta-recipe:latest` |
|
||||||
|
| Build | `docker buildx build --platform linux/amd64 -t git.sal.giize.com/mozempk/insta-recipe:latest --push .` |
|
||||||
|
| Deploy | `docker compose pull && docker compose up -d` |
|
||||||
|
| LLM (internal) | `http://chat_llama-cpp:8080/v1` |
|
||||||
|
| LLM (external) | `http://192.168.1.50:8080` |
|
||||||
|
| Current LLM model | `gemma4-e4b-q6k` (via `LLM_MODEL` in `.env`) |
|
||||||
|
| Auth file (host) | `/home/moze/Server/stacks/insta-recipe/data/secrets/auth.json` |
|
||||||
|
| Auth file (container) | `/app/secrets/auth.json` |
|
||||||
|
|||||||
Reference in New Issue
Block a user