Instagram's GraphQL API silently truncates captions WITHOUT '….' markers.
Both DWWxiymssxE (393 chars full, 327 from API) and DXT73izCBoH
(744+ chars full, cut mid-sentence) were affected.
Remove the GraphQL-interception shortcut entirely. Always use DOM
extraction (HTML Section) which clicks '… more' to get the complete text.
The intercepted GraphQL caption is kept only as emergency fallback if
all DOM strategies fail.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
If the GraphQL-intercepted caption ends with '….' (Instagram's truncation
marker), skip it and fall through to HTML Section extraction which clicks
the '… more' button in the DOM to get the complete, untruncated caption.
Previously the 327-char truncated caption for DWWxiymssxE was returned
immediately, causing the LLM to say 'no recipe' even though the full
description had all ingredients and steps.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add instagram-extractor.ts: yt-dlp subprocess backend for Instagram
caption extraction. No in-process browser state, maintained against
Instagram frontend churn, supports cookies.txt for auth-walled reels.
- Add feature flag EXTRACTOR_BACKEND (ytdlp|playwright) in QueueProcessor
so the old Playwright path remains available as fallback.
- Add 9 unit tests and 2 live-network integration tests for the new extractor.
- Dockerfile: install yt-dlp via pip3 alongside existing Chromium deps.
- docker-compose: expose EXTRACTOR_BACKEND env var (default: ytdlp).
Also in this commit:
- LLM: configurable per-request timeout via LLM_REQUEST_TIMEOUT_MS (default 120s);
set maxRetries=0 to surface errors immediately; llama-swap /running health probe.
- QueueProcessor: thread progress callback through parser phase.
- LlmHealthIndicator: surface llama-swap loaded-model name.
- Logging: improve error serialization in queue-processor tests.
- .env.example: document llama-swap endpoint and model options.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Exported cleanText() and extractFromDOM() for unit testing
- Fixed metadata prefix regex to handle optional quotes
- Created comprehensive unit tests with mocked Playwright Page (15 tests, 12ms)
- All 275 tests passing
- Fixed NodeJS.Timer → NodeJS.Timeout in scheduler.ts line 13
- Fixed NodeJS.Timer[] → NodeJS.Timeout[] in fixtures.ts line 151
- Resolves TypeScript compile errors from iteration 0 review
- All 260 tests passing, build succeeds with no errors
- Implement strict HTTP 200 validation (reject all other status codes)
- Add content-type validation (must be image/*)
- Add 10-second timeout protection with AbortController
- Thread progressCallback through all fetchImageAsBase64 calls
- Add detailed logging for each validation failure scenario
- Report validation failures via SSE progress callbacks
Unit tests:
- Add comprehensive test coverage for all validation scenarios
- Test HTTP status codes (200, 404, 403, 500, etc.)
- Test content-type validation (image/* vs text/html, etc.)
- Test timeout behavior with AbortController
- Test error handling (network errors, DNS, SSL, etc.)
- Test progress callback reporting
Integration tests:
- Add tests for complete extraction flow with URL failures
- Test fallback chain behavior (meta tags → poster → Instagram data → screenshot)
- Test real-world scenarios (redirects, query params, different post types)
Documentation:
- Enhanced JSDoc with validation criteria
- Added examples showing fallback behavior
- Documented all failure scenarios and their handling
All tests passing ✅
- Fix authentication header from 'Bearer' to 'Token' (DRF TokenAuth)
- Implement three-strategy upload system:
1. URL pass-through for direct URLs (most efficient)
2. Base64 data URL conversion for screenshots
3. Fallback blob upload for any other format
- Add comprehensive error handling with response details
- Add detailed logging for debugging upload strategies
- Document thumbnail formats in extractThumbnailStealth()
Fixes#30 - Tandoor image upload 400 Bad Request error
Based on Tandoor source code analysis (cookbook/views/api.py):
- RecipeImageSerializer accepts 'image_url' field for server-side download
- Uses Token authentication, not Bearer
- Supports multipart file upload with proper MIME types
- Add progressCallback parameter to extractFromEmbeddedJSON and extractFromDOM
- Pass onProgress callback from extractWithStrategies to all strategies
- Fix legacy strategy to use correct callback variable name
- Verify extractViaGraphQL correctly returns null thumbnail
This fixes ReferenceError that was preventing all extraction methods from working.
All extraction strategies now properly emit thumbnail progress events via SSE.
Closes: FixProgressCallbackUndefinedErrors