fix(extraction): don't use truncated GraphQL caption — fall through to DOM
Some checks failed
Build & Push Docker Image / test-and-build (push) Failing after 35s

If the GraphQL-intercepted caption ends with '….' (Instagram's truncation
marker), skip it and fall through to HTML Section extraction which clicks
the '… more' button in the DOM to get the complete, untruncated caption.

Previously the 327-char truncated caption for DWWxiymssxE was returned
immediately, causing the LLM to say 'no recipe' even though the full
description had all ingredients and steps.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
Giancarmine Salucci
2026-05-13 01:52:02 +02:00
parent c9f5300272
commit 73e10730dc

View File

@@ -1386,9 +1386,14 @@ export async function extractTextAndThumbnail(
}); });
await page.waitForTimeout(1000); await page.waitForTimeout(1000);
// If we intercepted a full caption, use it immediately // Use intercepted GraphQL caption only if it is NOT truncated.
if (interceptedCaption) { // Instagram truncates captions with "…." (U+2026 + "."). If that
console.log('[Extractor] Using intercepted caption from network traffic'); // marker is present, fall through to HTML Section which will click
// "… more" in the DOM and get the complete text.
const TRUNCATED = '\u2026.';
const capturedCaption = interceptedCaption as string | null;
if (capturedCaption && !capturedCaption.trimEnd().endsWith(TRUNCATED)) {
console.log('[Extractor] Using intercepted caption from network traffic (not truncated)');
const thumbnail = await extractThumbnailStealth(page, onProgress); const thumbnail = await extractThumbnailStealth(page, onProgress);
onProgress?.({ onProgress?.({
type: 'complete', type: 'complete',
@@ -1396,7 +1401,12 @@ export async function extractTextAndThumbnail(
method: 'graphql-intercept', method: 'graphql-intercept',
timestamp: new Date().toISOString() timestamp: new Date().toISOString()
}); });
return { bodyText: cleanText(interceptedCaption), thumbnail }; return { bodyText: cleanText(capturedCaption), thumbnail };
}
if (capturedCaption) {
console.log(
`[Extractor] GraphQL caption truncated (${capturedCaption.length} chars, ends with "….") — falling through to DOM extraction`
);
} }
const result = await extractWithStrategies(url, page, context, onProgress); const result = await extractWithStrategies(url, page, context, onProgress);