fix(extraction): always use DOM extraction, never trust GraphQL caption
Some checks failed
Build & Push Docker Image / test-and-build (push) Failing after 33s

Instagram's GraphQL API silently truncates captions WITHOUT '….' markers.
Both DWWxiymssxE (393 chars full, 327 from API) and DXT73izCBoH
(744+ chars full, cut mid-sentence) were affected.

Remove the GraphQL-interception shortcut entirely. Always use DOM
extraction (HTML Section) which clicks '… more' to get the complete text.

The intercepted GraphQL caption is kept only as emergency fallback if
all DOM strategies fail.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
Giancarmine Salucci
2026-05-13 02:24:40 +02:00
parent 73e10730dc
commit 226b2e7f15

View File

@@ -1386,32 +1386,29 @@ export async function extractTextAndThumbnail(
}); });
await page.waitForTimeout(1000); await page.waitForTimeout(1000);
// Use intercepted GraphQL caption only if it is NOT truncated. // Always use DOM extraction (HTML Section) — it clicks "… more" in
// Instagram truncates captions with "…." (U+2026 + "."). If that // the browser and gets the fully expanded caption. The GraphQL
// marker is present, fall through to HTML Section which will click // interception is unreliable: Instagram often truncates captions
// "… more" in the DOM and get the complete text. // in API responses without any "…." marker, so we cannot trust
const TRUNCATED = '\u2026.'; // the intercepted text to be complete.
const capturedCaption = interceptedCaption as string | null; const capturedCaption = interceptedCaption as string | null;
if (capturedCaption && !capturedCaption.trimEnd().endsWith(TRUNCATED)) {
console.log('[Extractor] Using intercepted caption from network traffic (not truncated)');
const thumbnail = await extractThumbnailStealth(page, onProgress);
onProgress?.({
type: 'complete',
message: 'Extraction completed via GraphQL interception',
method: 'graphql-intercept',
timestamp: new Date().toISOString()
});
return { bodyText: cleanText(capturedCaption), thumbnail };
}
if (capturedCaption) { if (capturedCaption) {
console.log( console.log(
`[Extractor] GraphQL caption truncated (${capturedCaption.length} chars, ends with "….") — falling through to DOM extraction` `[Extractor] Intercepted GraphQL caption (${capturedCaption.length} chars) — always using DOM extraction for full text`
); );
} }
const result = await extractWithStrategies(url, page, context, onProgress); const result = await extractWithStrategies(url, page, context, onProgress);
if (!result.success || !result.data) { if (!result.success || !result.data) {
// DOM extraction failed — fall back to intercepted caption if available
if (capturedCaption) {
console.log(
'[Extractor] DOM extraction failed — using intercepted GraphQL caption as fallback'
);
const thumbnail = await extractThumbnailStealth(page, onProgress);
return { bodyText: cleanText(capturedCaption), thumbnail };
}
throw new Error(result.error || 'Extraction failed'); throw new Error(result.error || 'Extraction failed');
} }