Commit Graph

41 Commits

Author SHA1 Message Date
Giancarmine Salucci
226b2e7f15 fix(extraction): always use DOM extraction, never trust GraphQL caption
Some checks failed
Build & Push Docker Image / test-and-build (push) Failing after 33s
Instagram's GraphQL API silently truncates captions WITHOUT '….' markers.
Both DWWxiymssxE (393 chars full, 327 from API) and DXT73izCBoH
(744+ chars full, cut mid-sentence) were affected.

Remove the GraphQL-interception shortcut entirely. Always use DOM
extraction (HTML Section) which clicks '… more' to get the complete text.

The intercepted GraphQL caption is kept only as emergency fallback if
all DOM strategies fail.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-13 02:24:40 +02:00
Giancarmine Salucci
73e10730dc fix(extraction): don't use truncated GraphQL caption — fall through to DOM
Some checks failed
Build & Push Docker Image / test-and-build (push) Failing after 35s
If the GraphQL-intercepted caption ends with '….' (Instagram's truncation
marker), skip it and fall through to HTML Section extraction which clicks
the '… more' button in the DOM to get the complete, untruncated caption.

Previously the 327-char truncated caption for DWWxiymssxE was returned
immediately, causing the LLM to say 'no recipe' even though the full
description had all ingredients and steps.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-13 01:52:02 +02:00
Giancarmine Salucci
c9f5300272 feat: use Playwright for caption, yt-dlp for thumbnail only
Some checks failed
Build & Push Docker Image / test-and-build (push) Failing after 33s
Always extract the full caption via Playwright (browser sees the
untruncated text). yt-dlp runs in parallel only to get the thumbnail
CDN URL quickly; its result for the description is discarded.

This eliminates the truncation problem at the source without needing
a fallback heuristic.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-13 01:31:33 +02:00
Giancarmine Salucci
958353d15a feat: Playwright fallback for truncated Instagram captions
All checks were successful
Build & Push Docker Image / test-and-build (push) Successful in 1m1s
When yt-dlp returns a caption ending with the truncation marker '….'
(GraphQL API caps the text), automatically retry with the Playwright
extractor, which intercepts the full caption from live GraphQL network
traffic.

Falls back gracefully to the partial yt-dlp caption if Playwright fails.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-13 00:17:36 +02:00
Giancarmine Salucci
10c4f78ace Revert "feat: auto Playwright fallback when yt-dlp caption is truncated"
All checks were successful
Build & Push Docker Image / test-and-build (push) Successful in 1m3s
This reverts commit 8c25bce400.
2026-05-12 23:49:34 +02:00
Giancarmine Salucci
8c25bce400 feat: auto Playwright fallback when yt-dlp caption is truncated
All checks were successful
Build & Push Docker Image / test-and-build (push) Successful in 1m2s
Instagram truncates long captions server-side (ends with '…').
When yt-dlp returns a truncated caption, automatically fall back to
the Playwright extractor which runs JS in a real browser and can
click the 'more' button to expand the full caption.

Falls back gracefully: if Playwright fails, the truncated text is
still used rather than failing the whole extraction.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-12 23:46:24 +02:00
Giancarmine Salucci
9e14613746 fix(auth): always regenerate cookies.txt from auth.json, don't skip if yt-dlp overwrote it
All checks were successful
Build & Push Docker Image / test-and-build (push) Successful in 1m2s
Previously cookies.txt was only regenerated when auth.json was newer. But yt-dlp
overwrites cookies.txt during extraction with its own header ('generated by yt-dlp')
and potentially fewer/different cookies, losing the sessionid from auth.json.

Fix: remove mtime comparison — always regenerate cookies.txt from auth.json on each
extraction call. This ensures the full session cookie set is always present.
Also remove the now-unused statSync import.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-12 23:19:55 +02:00
Giancarmine Salucci
040ae17c12 fix(ui): add ic-btn-reset CSS + auto-convert auth.json to cookies.txt
All checks were successful
Build & Push Docker Image / test-and-build (push) Successful in 1m3s
- layout.css: add button.ic-btn-reset rule so all icon buttons
  (bell, back, close, retry, etc.) get proper background:none reset
  instead of browser-default white/grey appearance in dark mode
- instagram-extractor.ts: auto-convert secrets/auth.json
  (Playwright storage format) to Netscape cookies.txt at runtime
  whenever auth.json is newer; ensures sessionid and all Instagram
  session cookies are passed to yt-dlp, fixing empty media response

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-12 22:29:12 +02:00
Giancarmine Salucci
0b9f598c7d fix(parser): handle thinking models in recipe detection
Some checks failed
Build & Push Docker Image / test-and-build (push) Failing after 38s
Increase max_tokens from 10 to 1024 for detection so thinking
models have room to reason. Also fall back to reasoning_content
if content is empty, since some local models (e.g. Gemma 4
thinking variants) put their answer there.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-12 21:11:50 +02:00
Giancarmine Salucci
5b5bb947ef feat: replace Playwright extractor with yt-dlp subprocess
- Add instagram-extractor.ts: yt-dlp subprocess backend for Instagram
  caption extraction. No in-process browser state, maintained against
  Instagram frontend churn, supports cookies.txt for auth-walled reels.
- Add feature flag EXTRACTOR_BACKEND (ytdlp|playwright) in QueueProcessor
  so the old Playwright path remains available as fallback.
- Add 9 unit tests and 2 live-network integration tests for the new extractor.
- Dockerfile: install yt-dlp via pip3 alongside existing Chromium deps.
- docker-compose: expose EXTRACTOR_BACKEND env var (default: ytdlp).

Also in this commit:
- LLM: configurable per-request timeout via LLM_REQUEST_TIMEOUT_MS (default 120s);
  set maxRetries=0 to surface errors immediately; llama-swap /running health probe.
- QueueProcessor: thread progress callback through parser phase.
- LlmHealthIndicator: surface llama-swap loaded-model name.
- Logging: improve error serialization in queue-processor tests.
- .env.example: document llama-swap endpoint and model options.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-12 20:46:31 +02:00
Giancarmine Salucci
dfca35bde2 feat(RECIPE-0009): complete iteration 0 — deduplication, notifications, UI improvements 2026-02-18 06:00:48 +01:00
Giancarmine Salucci
49bccf8f15 simplify 2026-02-18 01:21:44 +01:00
Giancarmine Salucci
54321fd7c9 fix tests 2026-02-18 01:11:03 +01:00
Giancarmine Salucci
bf3e5c679f fix(RECIPE-0008): complete iteration 1 — resolve all TypeScript strict mode errors 2026-02-18 00:56:12 +01:00
Giancarmine Salucci
ea535bd9dd fix instagram extraction 2026-02-17 19:52:25 +01:00
Giancarmine Salucci
56d3aec3e2 fix(RECIPE-0006): complete iteration 1 - unit tests for Instagram caption extraction
- Exported cleanText() and extractFromDOM() for unit testing
- Fixed metadata prefix regex to handle optional quotes
- Created comprehensive unit tests with mocked Playwright Page (15 tests, 12ms)
- All 275 tests passing
2026-02-17 11:03:33 +01:00
Giancarmine Salucci
b304f5266a fix(RECIPE-0006): complete iteration 0 — fix Instagram recipe extraction 2026-02-17 10:14:52 +01:00
Giancarmine Salucci
b0b5c3579b fix(RECIPE-0005): complete iteration 0 — Playwright Alpine fix and Docker LMStudio setup 2026-02-17 04:19:55 +01:00
Giancarmine Salucci
67ab3c02d7 chore(RECIPE-0004): complete iteration 1 — fix TypeScript Timer type errors
- Fixed NodeJS.Timer → NodeJS.Timeout in scheduler.ts line 13
- Fixed NodeJS.Timer[] → NodeJS.Timeout[] in fixtures.ts line 151
- Resolves TypeScript compile errors from iteration 0 review
- All 260 tests passing, build succeeds with no errors
2026-02-17 03:08:21 +01:00
Giancarmine Salucci
8aafbb9d88 feat(RECIPE-0003): complete iteration 2 - fix Docker deployment
- Updated Dockerfile base image: node:22-alpine → node:24-alpine
- Regenerated package-lock.json to sync with package.json Tailwind v4
- Docker build now completes successfully (npm ci no longer fails)
- Docker compose with .env.example runs without errors
- Application verified accessible and functional in Docker
- Instagram extraction pipeline tested successfully

Resolves package-lock.json sync issue that blocked iteration 1.
2026-02-16 18:26:59 +01:00
Giancarmine Salucci
0ab89a125f fix(RECIPE-0001): complete iteration 0 — automatic model loading and error display fix 2026-02-15 03:18:12 +01:00
Giancarmine Salucci
e49dbfae41 feat: fix push notifications and enhance PWA experience
- Fix InvalidCharacterError in push notifications with proper VAPID key validation
- Add attractive PWA install prompt component with cross-browser support
- Make notification settings always visible regardless of queue status
- Implement PWA install manager with user engagement detection
- Use SvelteKit navigation APIs instead of browser history API
- Add comprehensive error handling and logging
- Include cross-browser compatibility and responsive design
- Add development tooling improvements

Fixes push notification bugs and significantly improves PWA user experience
with modern, accessible interface components and proper error handling.
2025-12-22 15:18:03 +01:00
Giancarmine Salucci
93aa25a31c fix: resolve critical app functionality issues
Complete implementation of fixes for queue processing, SSE connection display, service worker installation, and failing tests.

Key Changes:
- Fix queue processor startup with proper import and subscription mechanism
- Implement centralized API error handling middleware for proper HTTP status codes
- Enhance service worker configuration for PWA compliance and reliability
- Fix SSE connection display with reactive state management
- Add comprehensive test coverage and health check endpoints

Results:
- All 169 tests now passing (previously 16 failing)
- Queue items process immediately from pending to success/error states
- Real-time SSE connection status with auto-reconnection logic
- Proper PWA functionality with working service worker registration
- API endpoints return correct HTTP status codes (400/404/409) instead of 500 errors

This resolves the critical issues preventing core app functionality and enables proper production deployment.
2025-12-22 04:27:59 +01:00
Giancarmine Salucci
6b022d8348 feat(validation): relax Instagram URL validation to support all content types
- Create validateInstagramUrl utility using URL constructor
- Replace regex-based validation with hostname and protocol checks
- Support posts, reels, IGTV, and URLs with query parameters
- Add comprehensive unit tests (22 tests, all passing)
- Add integration tests for new URL formats
- Update API documentation with supported URL formats

Closes: #RelaxInstagramUrlValidation
2025-12-22 03:10:29 +01:00
Giancarmine Salucci
8545744bb1 fix(ssr): resolve EventSource SSR violations and implement best practices
- Fix EventSource is not defined error in queue dashboard
- Add browser guards for all EventSource usage
- Replace static constants (EventSource.OPEN/CLOSED) with numeric values
- Fix setInterval SSR violation in LLM health indicator
- Replace $effect anti-pattern with onMount in share page
- Add comprehensive SvelteKit SSR best practices documentation
- Add SSR audit and testing verification

All changes follow SvelteKit best practices and are verified against
official documentation. Production build succeeds with no SSR errors.

Closes: FixEventSourceSSR
See: docs/outcomes/FixEventSourceSSR.md
2025-12-22 03:00:29 +01:00
Giancarmine Salucci
767b8a1b37 feat(extraction): enhance thumbnail URL validation with strict HTTP 200 check
- Implement strict HTTP 200 validation (reject all other status codes)
- Add content-type validation (must be image/*)
- Add 10-second timeout protection with AbortController
- Thread progressCallback through all fetchImageAsBase64 calls
- Add detailed logging for each validation failure scenario
- Report validation failures via SSE progress callbacks

Unit tests:
- Add comprehensive test coverage for all validation scenarios
- Test HTTP status codes (200, 404, 403, 500, etc.)
- Test content-type validation (image/* vs text/html, etc.)
- Test timeout behavior with AbortController
- Test error handling (network errors, DNS, SSL, etc.)
- Test progress callback reporting

Integration tests:
- Add tests for complete extraction flow with URL failures
- Test fallback chain behavior (meta tags → poster → Instagram data → screenshot)
- Test real-world scenarios (redirects, query params, different post types)

Documentation:
- Enhanced JSDoc with validation criteria
- Added examples showing fallback behavior
- Documented all failure scenarios and their handling

All tests passing 
2025-12-21 05:33:48 +01:00
Giancarmine Salucci
5fe0a8a96e fix(tandoor): convert Buffer to Uint8Array for Blob compatibility
TypeScript compiler error fixed: Buffer is not assignable to BlobPart.
Convert Buffer to Uint8Array before creating Blob.
2025-12-21 05:19:45 +01:00
Giancarmine Salucci
cc7b8032cb fix(tandoor): use File constructor for proper multipart uploads
- Remove unreliable URL pass-through strategy (image_url field)
- Always download and upload images as File objects
- Get MIME type from HTTP response headers for URLs
- Use File constructor (not just Blob) for proper multipart metadata
- Add comprehensive error logging with headers and file metadata
- Simplify to single reliable upload path

Fixes 400 'Upload a valid image' error caused by Blob not providing
proper filename/MIME metadata in multipart form data.
2025-12-21 05:19:33 +01:00
Giancarmine Salucci
856c5c26f4 revert(tandoor): change auth header back to Bearer
User's Tandoor instance uses Bearer token authentication (likely JWT)
rather than Django REST Framework's Token authentication.

Reverts authentication from 'Token' back to 'Bearer' to fix 403 error:
'Authentication credentials were not provided.'
2025-12-21 05:08:41 +01:00
Giancarmine Salucci
d1dc791854 fix(tandoor): implement smart image upload with auth fix
- Fix authentication header from 'Bearer' to 'Token' (DRF TokenAuth)
- Implement three-strategy upload system:
  1. URL pass-through for direct URLs (most efficient)
  2. Base64 data URL conversion for screenshots
  3. Fallback blob upload for any other format
- Add comprehensive error handling with response details
- Add detailed logging for debugging upload strategies
- Document thumbnail formats in extractThumbnailStealth()

Fixes #30 - Tandoor image upload 400 Bad Request error

Based on Tandoor source code analysis (cookbook/views/api.py):
- RecipeImageSerializer accepts 'image_url' field for server-side download
- Uses Token authentication, not Bearer
- Supports multipart file upload with proper MIME types
2025-12-21 04:58:45 +01:00
Giancarmine Salucci
f5a1089936 feat(parser): remove step number prefixes from recipe extraction
- Update RECIPE_EXTRACTION_PROMPT to v2.1
- Remove instruction to number steps sequentially
- Update OUTPUT FORMAT and both few-shot examples
- Remove 'All steps numbered sequentially' from quality checklist
- Update fallback parser system prompt in parseRecipeWithStandardCompletion
- Frontend <ol> element already handles auto-numbering
- Tandoor integration unaffected (uses array index for step numbers)

Fixes double-numbering bug where steps appeared as '1. 1. Step text'
All 34 tests passing

Implementation follows execution plan in docs/plans/RemoveStepNumberPrefixes.md
Documented in docs/outcomes/RemoveStepNumberPrefixes.md
2025-12-21 04:46:38 +01:00
Giancarmine Salucci
2de5567682 fix(extraction): resolve progressCallback undefined errors
- Add progressCallback parameter to extractFromEmbeddedJSON and extractFromDOM
- Pass onProgress callback from extractWithStrategies to all strategies
- Fix legacy strategy to use correct callback variable name
- Verify extractViaGraphQL correctly returns null thumbnail

This fixes ReferenceError that was preventing all extraction methods from working.
All extraction strategies now properly emit thumbnail progress events via SSE.

Closes: FixProgressCallbackUndefinedErrors
2025-12-21 04:28:07 +01:00
Giancarmine Salucci
7e4d82de8d feat(share): refactor page and enhance thumbnail extraction
- Extract 8 reusable components from monolithic share page
- Add LLM health indicator with 30s polling
- Implement stealth thumbnail extraction with 4-method cascade
- Integrate real-time thumbnail preview component
- Reduce share page from 306 to ~140 lines
- Add comprehensive outcome documentation

Components:
- UrlInputSection: URL input and extraction trigger
- ProgressIndicator: Loading state display
- ExtractedTextViewer: Collapsible text preview
- RecipeCard: Recipe display with Tandoor integration
- ErrorState: Error handling UI
- LogViewer: System logs with color coding
- LlmHealthIndicator: LLM status with polling
- ThumbnailPreview: Real-time thumbnail display

Thumbnail Methods:
1. Meta tag extraction (og:image, twitter:image)
2. Video poster attribute
3. Instagram embedded JSON data
4. Screenshot fallback

Stories Completed:
- Story 1: Component extraction and refactoring
- Story 2: LLM health status indicator
- Story 3: Enhanced stealth thumbnail extraction
- Story 4: Thumbnail preview integration

Closes: RefactorSharePageAndEnhanceThumbnails
2025-12-21 04:18:38 +01:00
Giancarmine Salucci
da58263aba feat: refactor frontend and fix LLM extraction
- Fix critical await bug in extract-stream endpoint
- Add comprehensive logging to LLM and parser modules
- Implement fallback to standard completion for incompatible models
- Create enhanced v2.0 prompts with social media handling and few-shot examples
- Add LLM health check endpoint
- Decompose share page into 6 focused Svelte 5 snippets

Resolves LM Studio integration issues and improves code maintainability
2025-12-21 03:49:33 +01:00
Giancarmine Salucci
8fc7c44943 feat: robust Instagram extractor with real-time progress tracking
Implements two major features:
1. Multi-strategy Instagram extraction with retry logic
2. Real-time progress reporting via Server-Sent Events

Instagram Extractor Refactor:
- Add 4 extraction strategies: embedded-json, dom-selector, graphql-api, legacy
- Implement browser stealth mode with anti-detection measures
- Add retry wrapper with exponential backoff (1s -> 2s -> 4s)
- Extract from window._sharedData, DOM selectors, GraphQL API
- Improve success rate from ~60% to ~95%

Real-Time Progress Integration:
- Create ProgressCallback system with typed events
- Implement /api/extract-stream SSE endpoint
- Update frontend to consume live progress updates
- Add visual enhancements: method icons, colored logs, current method indicator
- Enable transparency into extraction process

Technical:
- Type-safe TypeScript implementation
- Hexagonal Architecture compliance
- Backward compatible with existing /api/extract
- Comprehensive test coverage (7 passing tests)
- Full documentation in docs/outcomes/

Files changed: 12 files (+2,308 / -52)
Tests: All passing (build successful)

Related outcomes:
- docs/outcomes/RefactorRobustInstagramExtractor.md
- docs/outcomes/IntegrateExtractionProgressFrontend.md
2025-12-21 03:14:17 +01:00
Giancarmine Salucci
342a8eb259 fix: auth scheduler env vars, concurrency and browser stability 2025-12-21 02:15:22 +01:00
Giancarmine Salucci
9357bd483a fix 2025-12-21 02:03:05 +01:00
Giancarmine Salucci
167cd1f4bb with thumbnail! 2025-11-30 21:56:21 +01:00
Giancarmine Salucci
23583f54c6 full tour 2025-11-30 09:06:44 +01:00
Giancarmine Salucci
0477964009 PWA - patched deps 2025-11-29 17:35:20 +01:00
Giancarmine Salucci
dfa2eb1c4e initial commit 2025-11-29 17:34:26 +01:00