fix(extraction): don't use truncated GraphQL caption — fall through to DOM
Some checks failed
Build & Push Docker Image / test-and-build (push) Failing after 35s
Some checks failed
Build & Push Docker Image / test-and-build (push) Failing after 35s
If the GraphQL-intercepted caption ends with '….' (Instagram's truncation marker), skip it and fall through to HTML Section extraction which clicks the '… more' button in the DOM to get the complete, untruncated caption. Previously the 327-char truncated caption for DWWxiymssxE was returned immediately, causing the LLM to say 'no recipe' even though the full description had all ingredients and steps. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
@@ -1386,9 +1386,14 @@ export async function extractTextAndThumbnail(
|
|||||||
});
|
});
|
||||||
await page.waitForTimeout(1000);
|
await page.waitForTimeout(1000);
|
||||||
|
|
||||||
// If we intercepted a full caption, use it immediately
|
// Use intercepted GraphQL caption only if it is NOT truncated.
|
||||||
if (interceptedCaption) {
|
// Instagram truncates captions with "…." (U+2026 + "."). If that
|
||||||
console.log('[Extractor] Using intercepted caption from network traffic');
|
// marker is present, fall through to HTML Section which will click
|
||||||
|
// "… more" in the DOM and get the complete text.
|
||||||
|
const TRUNCATED = '\u2026.';
|
||||||
|
const capturedCaption = interceptedCaption as string | null;
|
||||||
|
if (capturedCaption && !capturedCaption.trimEnd().endsWith(TRUNCATED)) {
|
||||||
|
console.log('[Extractor] Using intercepted caption from network traffic (not truncated)');
|
||||||
const thumbnail = await extractThumbnailStealth(page, onProgress);
|
const thumbnail = await extractThumbnailStealth(page, onProgress);
|
||||||
onProgress?.({
|
onProgress?.({
|
||||||
type: 'complete',
|
type: 'complete',
|
||||||
@@ -1396,7 +1401,12 @@ export async function extractTextAndThumbnail(
|
|||||||
method: 'graphql-intercept',
|
method: 'graphql-intercept',
|
||||||
timestamp: new Date().toISOString()
|
timestamp: new Date().toISOString()
|
||||||
});
|
});
|
||||||
return { bodyText: cleanText(interceptedCaption), thumbnail };
|
return { bodyText: cleanText(capturedCaption), thumbnail };
|
||||||
|
}
|
||||||
|
if (capturedCaption) {
|
||||||
|
console.log(
|
||||||
|
`[Extractor] GraphQL caption truncated (${capturedCaption.length} chars, ends with "….") — falling through to DOM extraction`
|
||||||
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
const result = await extractWithStrategies(url, page, context, onProgress);
|
const result = await extractWithStrategies(url, page, context, onProgress);
|
||||||
|
|||||||
Reference in New Issue
Block a user