fix(extraction): always use DOM extraction, never trust GraphQL caption
Some checks failed
Build & Push Docker Image / test-and-build (push) Failing after 33s

Instagram's GraphQL API silently truncates captions WITHOUT '….' markers.
Both DWWxiymssxE (393 chars full, 327 from API) and DXT73izCBoH
(744+ chars full, cut mid-sentence) were affected.

Remove the GraphQL-interception shortcut entirely. Always use DOM
extraction (HTML Section) which clicks '… more' to get the complete text.

The intercepted GraphQL caption is kept only as emergency fallback if
all DOM strategies fail.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
Giancarmine Salucci
2026-05-13 02:24:40 +02:00
parent 73e10730dc
commit 226b2e7f15

View File

@@ -1386,32 +1386,29 @@ export async function extractTextAndThumbnail(
});
await page.waitForTimeout(1000);
// Use intercepted GraphQL caption only if it is NOT truncated.
// Instagram truncates captions with "…." (U+2026 + "."). If that
// marker is present, fall through to HTML Section which will click
// "… more" in the DOM and get the complete text.
const TRUNCATED = '\u2026.';
// Always use DOM extraction (HTML Section) — it clicks "… more" in
// the browser and gets the fully expanded caption. The GraphQL
// interception is unreliable: Instagram often truncates captions
// in API responses without any "…." marker, so we cannot trust
// the intercepted text to be complete.
const capturedCaption = interceptedCaption as string | null;
if (capturedCaption && !capturedCaption.trimEnd().endsWith(TRUNCATED)) {
console.log('[Extractor] Using intercepted caption from network traffic (not truncated)');
const thumbnail = await extractThumbnailStealth(page, onProgress);
onProgress?.({
type: 'complete',
message: 'Extraction completed via GraphQL interception',
method: 'graphql-intercept',
timestamp: new Date().toISOString()
});
return { bodyText: cleanText(capturedCaption), thumbnail };
}
if (capturedCaption) {
console.log(
`[Extractor] GraphQL caption truncated (${capturedCaption.length} chars, ends with "….") — falling through to DOM extraction`
`[Extractor] Intercepted GraphQL caption (${capturedCaption.length} chars) — always using DOM extraction for full text`
);
}
const result = await extractWithStrategies(url, page, context, onProgress);
if (!result.success || !result.data) {
// DOM extraction failed — fall back to intercepted caption if available
if (capturedCaption) {
console.log(
'[Extractor] DOM extraction failed — using intercepted GraphQL caption as fallback'
);
const thumbnail = await extractThumbnailStealth(page, onProgress);
return { bodyText: cleanText(capturedCaption), thumbnail };
}
throw new Error(result.error || 'Extraction failed');
}