Files

Giancarmine Salucci 767b8a1b37 feat(extraction): enhance thumbnail URL validation with strict HTTP 200 check

- Implement strict HTTP 200 validation (reject all other status codes)
- Add content-type validation (must be image/*)
- Add 10-second timeout protection with AbortController
- Thread progressCallback through all fetchImageAsBase64 calls
- Add detailed logging for each validation failure scenario
- Report validation failures via SSE progress callbacks

Unit tests:
- Add comprehensive test coverage for all validation scenarios
- Test HTTP status codes (200, 404, 403, 500, etc.)
- Test content-type validation (image/* vs text/html, etc.)
- Test timeout behavior with AbortController
- Test error handling (network errors, DNS, SSL, etc.)
- Test progress callback reporting

Integration tests:
- Add tests for complete extraction flow with URL failures
- Test fallback chain behavior (meta tags → poster → Instagram data → screenshot)
- Test real-world scenarios (redirects, query params, different post types)

Documentation:
- Enhanced JSDoc with validation criteria
- Added examples showing fallback behavior
- Documented all failure scenarios and their handling

All tests passing ✅

2025-12-21 05:33:48 +01:00

26 KiB

Raw Blame History

Execution Plan: Validate Thumbnail URL Status

Created: 2025-12-21
Analyst: GitHub Copilot
Status: Ready for Implementation

Executive Summary

When extracting thumbnails from Instagram posts, the current implementation fetches image URLs and converts them to base64 data URIs. However, the URL validation is insufficient - it only checks response.ok which accepts any 2xx status code. This plan enhances thumbnail URL validation to explicitly require HTTP 200 status, add content-type validation, implement timeouts, and provide detailed progress reporting for debugging and user feedback.

Goal: Ensure thumbnail URL extraction methods fail gracefully and report detailed validation failures, allowing the system to properly fall back through the extraction strategy chain.

Current State Analysis

Existing Implementation

Location: src/lib/server/extraction.ts

Current fetchImageAsBase64 function:

async function fetchImageAsBase64(imageUrl: string): Promise<string | null> {
	try {
		const response = await fetch(imageUrl);
		if (!response.ok) return null;

		const arrayBuffer = await response.arrayBuffer();
		const buffer = Buffer.from(arrayBuffer);
		const contentType = response.headers.get('content-type') || 'image/jpeg';

		return `data:${contentType};base64,${buffer.toString('base64')}`;
	} catch (e) {
		console.error('[Thumbnail] Failed to fetch image:', e);
		return null;
	}
}

Issues:

response.ok accepts 200-299, but 204 No Content or 206 Partial Content are problematic
No explicit status code logging for debugging
No content-type validation (could download non-image data)
No timeout protection (could hang indefinitely)
No progress reporting for failed validations
Generic error logging doesn't distinguish failure types

Extraction Strategy Chain

The extractThumbnailStealth function tries multiple methods:

Meta tags (og:image, twitter:image) → Uses fetchImageAsBase64
Video poster attribute → Uses fetchImageAsBase64
Instagram data structures (display_url, thumbnail_src) → Uses fetchImageAsBase64
Screenshot fallback → Always succeeds with base64

When a URL method fails, it should cleanly return null and continue to the next method. Enhanced validation ensures we don't accept invalid URLs.

Stories

Story 1: Enhance URL Validation in fetchImageAsBase64

Objective: Implement strict HTTP 200 validation, content-type checking, and timeout protection.

Location: src/lib/server/extraction.ts

Technical Specifications:

/**
 * Helper: Fetch image from URL and convert to base64 data URI
 * 
 * Validation criteria:
 * - HTTP status must be exactly 200
 * - Content-Type must start with 'image/'
 * - Request timeout: 10 seconds
 * 
 * @param imageUrl - The image URL to fetch
 * @param progressCallback - Optional callback for progress reporting
 * @returns Base64 data URI or null if validation fails
 */
async function fetchImageAsBase64(
	imageUrl: string,
	progressCallback?: ProgressCallback
): Promise<string | null> {
	try {
		// Create abort controller for timeout
		const controller = new AbortController();
		const timeoutId = setTimeout(() => controller.abort(), 10000); // 10s timeout

		console.log(`[Thumbnail] Validating URL: ${imageUrl}`);

		const response = await fetch(imageUrl, {
			signal: controller.signal
		});

		clearTimeout(timeoutId);

		// Strict status validation: must be exactly 200
		if (response.status !== 200) {
			console.warn(`[Thumbnail] URL validation failed: HTTP ${response.status} for ${imageUrl}`);
			progressCallback?.({
				type: 'status',
				message: `Thumbnail URL returned HTTP ${response.status}, trying next method...`,
				timestamp: new Date().toISOString()
			});
			return null;
		}

		// Validate content-type
		const contentType = response.headers.get('content-type') || '';
		if (!contentType.startsWith('image/')) {
			console.warn(`[Thumbnail] URL validation failed: Invalid content-type '${contentType}' for ${imageUrl}`);
			progressCallback?.({
				type: 'status',
				message: `Thumbnail URL returned non-image content (${contentType}), trying next method...`,
				timestamp: new Date().toISOString()
			});
			return null;
		}

		console.log(`[Thumbnail] URL validation successful: ${imageUrl} (${contentType})`);

		const arrayBuffer = await response.arrayBuffer();
		const buffer = Buffer.from(arrayBuffer);

		const base64Data = `data:${contentType};base64,${buffer.toString('base64')}`;

		progressCallback?.({
			type: 'status',
			message: `Thumbnail fetched and validated from URL`,
			timestamp: new Date().toISOString()
		});

		return base64Data;
	} catch (e) {
		if (e instanceof Error) {
			if (e.name === 'AbortError') {
				console.error(`[Thumbnail] URL fetch timeout: ${imageUrl}`);
				progressCallback?.({
					type: 'status',
					message: `Thumbnail URL fetch timeout, trying next method...`,
					timestamp: new Date().toISOString()
				});
			} else {
				console.error(`[Thumbnail] Failed to fetch image from ${imageUrl}:`, e.message);
				progressCallback?.({
					type: 'status',
					message: `Thumbnail URL fetch failed (${e.message}), trying next method...`,
					timestamp: new Date().toISOString()
				});
			}
		} else {
			console.error('[Thumbnail] Failed to fetch image:', e);
		}
		return null;
	}
}

Changes:

Add progressCallback parameter
Use AbortController for 10-second timeout
Check response.status === 200 explicitly
Validate content-type starts with 'image/'
Add detailed logging for each failure scenario
Report validation progress via callbacks
Clear timeout after successful fetch

Acceptance Criteria:

✅ Only HTTP 200 responses are accepted
✅ Only responses with image/* content-type are accepted
✅ Requests timeout after 10 seconds
✅ Each failure type is logged with specific message
✅ Progress callbacks report validation attempts and failures
✅ Function returns null for any validation failure
✅ Timeout is properly cleared to prevent memory leaks

Dependencies: None

Risk Assessment:

Low Risk: Changes are isolated to helper function
Backwards Compatible: Signature change is additive (optional parameter)
Timeout: 10s might be too short for slow networks, but Instagram CDN is typically fast

Story 2: Thread Progress Callback Through Extraction Methods

Objective: Update all callsites of fetchImageAsBase64 to pass the progressCallback.

Location: src/lib/server/extraction.ts

Technical Specifications:

Update extractThumbnailStealth to pass progressCallback to all fetchImageAsBase64 calls:

async function extractThumbnailStealth(
	page: Page,
	progressCallback?: ProgressCallback
): Promise<string | null> {
	console.log('[Thumbnail] Starting stealth extraction');

	// Method 1: Try meta tags (most stealthy)
	try {
		const ogImage = await page.getAttribute('meta[property="og:image"]', 'content');
		if (ogImage) {
			console.log('[Thumbnail] Found og:image meta tag');
			const imageBuffer = await fetchImageAsBase64(ogImage, progressCallback); // ✅ Pass callback
			if (imageBuffer) {
				if (progressCallback) {
					progressCallback({
						type: 'thumbnail',
						message: 'Thumbnail extracted from meta tags',
						data: { thumbnail: imageBuffer },
						timestamp: new Date().toISOString()
					});
				}
				return imageBuffer;
			}
		}

		const twitterImage = await page.getAttribute('meta[name="twitter:image"]', 'content');
		if (twitterImage) {
			console.log('[Thumbnail] Found twitter:image meta tag');
			const imageBuffer = await fetchImageAsBase64(twitterImage, progressCallback); // ✅ Pass callback
			if (imageBuffer) {
				if (progressCallback) {
					progressCallback({
						type: 'thumbnail',
						message: 'Thumbnail extracted from twitter meta tag',
						data: { thumbnail: imageBuffer },
						timestamp: new Date().toISOString()
					});
				}
				return imageBuffer;
			}
		}
	} catch (e) {
		console.log('[Thumbnail] Meta tag method failed:', e);
	}

	// Method 2: Try video poster attribute
	try {
		const poster = await page.getAttribute('video', 'poster');
		if (poster) {
			console.log('[Thumbnail] Found video poster attribute');
			const imageBuffer = await fetchImageAsBase64(poster, progressCallback); // ✅ Pass callback
			if (imageBuffer) {
				if (progressCallback) {
					progressCallback({
						type: 'thumbnail',
						message: 'Thumbnail extracted from video poster',
						data: { thumbnail: imageBuffer },
						timestamp: new Date().toISOString()
					});
				}
				return imageBuffer;
			}
		}
	} catch (e) {
		console.log('[Thumbnail] Video poster method failed:', e);
	}

	// Method 3: Try Instagram window data structures
	try {
		const thumbnailUrl = await page.evaluate(() => {
			const data = (window as any).__additionalDataLoaded;
			if (data) {
				for (const key in data) {
					const item = data[key];
					if (item?.graphql?.shortcode_media?.display_url) {
						return item.graphql.shortcode_media.display_url;
					}
					if (item?.graphql?.shortcode_media?.thumbnail_src) {
						return item.graphql.shortcode_media.thumbnail_src;
					}
				}
			}
			return null;
		});

		if (thumbnailUrl) {
			console.log('[Thumbnail] Found thumbnail in Instagram data structures');
			const imageBuffer = await fetchImageAsBase64(thumbnailUrl, progressCallback); // ✅ Pass callback
			if (imageBuffer) {
				if (progressCallback) {
					progressCallback({
						type: 'thumbnail',
						message: 'Thumbnail extracted from Instagram data',
						data: { thumbnail: imageBuffer },
						timestamp: new Date().toISOString()
					});
				}
				return imageBuffer;
			}
		}
	} catch (e) {
		console.log('[Thumbnail] Instagram data method failed:', e);
	}

	// Method 4: Screenshot fallback (existing method)
	console.log('[Thumbnail] Falling back to screenshot method');
	const screenshotThumbnail = await extractThumbnailScreenshot(page);
	if (screenshotThumbnail && progressCallback) {
		progressCallback({
			type: 'thumbnail',
			message: 'Thumbnail extracted via screenshot',
			data: { thumbnail: screenshotThumbnail },
			timestamp: new Date().toISOString()
		});
	}
	return screenshotThumbnail;
}

Changes:

Update all 4 fetchImageAsBase64 calls in extractThumbnailStealth
Pass progressCallback parameter to each call
Maintain existing success callbacks

Acceptance Criteria:

✅ All callsites pass progressCallback to fetchImageAsBase64
✅ Frontend receives detailed progress updates via SSE
✅ Users can see which URL methods were tried and why they failed
✅ Existing functionality remains unchanged

Dependencies: Story 1

Risk Assessment:

Low Risk: Simple parameter passing
No Breaking Changes: progressCallback is optional

Story 3: Add Unit Tests for URL Validation

Objective: Test all validation scenarios for fetchImageAsBase64.

Location: src/tests/thumbnail-validation.spec.ts (new file)

Technical Specifications:

import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest';
import type { ProgressCallback } from '$lib/server/extraction';

// Import the function to test (will need to export it or test through public API)
// For testing purposes, we'll mock fetch

describe('fetchImageAsBase64 URL Validation', () => {
	let originalFetch: typeof globalThis.fetch;
	let mockProgressCallback: ProgressCallback;

	beforeEach(() => {
		originalFetch = globalThis.fetch;
		mockProgressCallback = vi.fn();
	});

	afterEach(() => {
		globalThis.fetch = originalFetch;
	});

	it('should accept HTTP 200 with image content-type', async () => {
		const mockImageData = Buffer.from('fake-image-data');
		globalThis.fetch = vi.fn().mockResolvedValue({
			status: 200,
			headers: {
				get: (name: string) => name === 'content-type' ? 'image/jpeg' : null
			},
			arrayBuffer: async () => mockImageData.buffer
		});

		// Call function and verify result
		// const result = await fetchImageAsBase64('https://example.com/image.jpg', mockProgressCallback);
		// expect(result).toMatch(/^data:image\/jpeg;base64,/);
		// expect(mockProgressCallback).toHaveBeenCalledWith(expect.objectContaining({
		// 	type: 'status',
		// 	message: expect.stringContaining('validated')
		// }));
	});

	it('should reject HTTP 404 status', async () => {
		globalThis.fetch = vi.fn().mockResolvedValue({
			status: 404,
			headers: { get: () => null }
		});

		// const result = await fetchImageAsBase64('https://example.com/missing.jpg', mockProgressCallback);
		// expect(result).toBeNull();
		// expect(mockProgressCallback).toHaveBeenCalledWith(expect.objectContaining({
		// 	message: expect.stringContaining('404')
		// }));
	});

	it('should reject HTTP 204 No Content', async () => {
		globalThis.fetch = vi.fn().mockResolvedValue({
			status: 204,
			headers: { get: () => null }
		});

		// const result = await fetchImageAsBase64('https://example.com/image.jpg', mockProgressCallback);
		// expect(result).toBeNull();
	});

	it('should reject non-image content-type', async () => {
		globalThis.fetch = vi.fn().mockResolvedValue({
			status: 200,
			headers: {
				get: (name: string) => name === 'content-type' ? 'text/html' : null
			}
		});

		// const result = await fetchImageAsBase64('https://example.com/page.html', mockProgressCallback);
		// expect(result).toBeNull();
		// expect(mockProgressCallback).toHaveBeenCalledWith(expect.objectContaining({
		// 	message: expect.stringContaining('non-image')
		// }));
	});

	it('should timeout after 10 seconds', async () => {
		globalThis.fetch = vi.fn().mockImplementation(
			() => new Promise((resolve) => {
				setTimeout(() => resolve({ status: 200 }), 15000);
			})
		);

		// const result = await fetchImageAsBase64('https://slow.example.com/image.jpg', mockProgressCallback);
		// expect(result).toBeNull();
		// expect(mockProgressCallback).toHaveBeenCalledWith(expect.objectContaining({
		// 	message: expect.stringContaining('timeout')
		// }));
	}, 12000); // Set test timeout > fetch timeout

	it('should handle network errors gracefully', async () => {
		globalThis.fetch = vi.fn().mockRejectedValue(new Error('Network error'));

		// const result = await fetchImageAsBase64('https://example.com/image.jpg', mockProgressCallback);
		// expect(result).toBeNull();
		// expect(mockProgressCallback).toHaveBeenCalledWith(expect.objectContaining({
		// 	message: expect.stringContaining('failed')
		// }));
	});

	it('should accept various image content-types', async () => {
		const contentTypes = ['image/jpeg', 'image/png', 'image/gif', 'image/webp', 'image/svg+xml'];
		
		for (const contentType of contentTypes) {
			const mockImageData = Buffer.from('fake-image-data');
			globalThis.fetch = vi.fn().mockResolvedValue({
				status: 200,
				headers: {
					get: (name: string) => name === 'content-type' ? contentType : null
				},
				arrayBuffer: async () => mockImageData.buffer
			});

			// const result = await fetchImageAsBase64(`https://example.com/image`, mockProgressCallback);
			// expect(result).toMatch(new RegExp(`^data:${contentType};base64,`));
		}
	});
});

describe('extractThumbnailStealth fallback chain', () => {
	it('should try all methods and fall back to screenshot', async () => {
		// Mock all URL methods to fail (404)
		// Mock screenshot to succeed
		// Verify screenshot method is called
		// Verify all URL methods were attempted
	});

	it('should stop at first successful URL method', async () => {
		// Mock og:image to return 404
		// Mock twitter:image to return 200
		// Verify video poster is not tried
	});
});

Testing Strategy:

Mock fetch with different responses
Test each validation criterion independently
Test timeout behavior with delayed promises
Test error handling
Test progress callback invocations
Integration test for fallback chain

Acceptance Criteria:

✅ All validation scenarios have test coverage
✅ Tests verify progress callbacks are invoked correctly
✅ Tests verify fallback behavior
✅ Tests run successfully in CI/CD pipeline

Dependencies: Story 1, Story 2

Risk Assessment:

Low Risk: Tests don't affect production code
Coverage: Ensures validation logic works correctly

Story 4: Add Integration Test for Complete Extraction Flow

Objective: Test end-to-end extraction with URL validation failures.

Location: src/tests/extraction-url-validation.integration.spec.ts (new file)

Technical Specifications:

import { describe, it, expect } from 'vitest';
import { extractTextAndThumbnail } from '$lib/server/extraction';

describe('Thumbnail URL Validation Integration', () => {
	it('should fall back to screenshot when all URL methods fail', async () => {
		// This test requires a real Instagram URL or mocked page
		// Test scenario:
		// 1. Mock Instagram page with meta tags pointing to invalid URLs (404)
		// 2. Verify extraction still succeeds with screenshot fallback
		// 3. Verify progress callbacks show URL failures
	});

	it('should use URL method when available and valid', async () => {
		// Test scenario:
		// 1. Mock Instagram page with valid og:image URL
		// 2. Verify thumbnail is fetched from URL (not screenshot)
		// 3. Verify progress shows successful URL fetch
	});

	it('should report detailed progress for URL validation failures', async () => {
		const progressEvents: any[] = [];
		const progressCallback = (event: any) => progressEvents.push(event);

		// Extract from URL with failing meta tag URLs
		// await extractTextAndThumbnail(testUrl, progressCallback);

		// Verify progress events include:
		// - URL validation attempts
		// - HTTP status codes for failures
		// - Fallback to screenshot
		// expect(progressEvents).toContainEqual(
		// 	expect.objectContaining({
		// 		message: expect.stringContaining('HTTP 404')
		// 	})
		// );
	});
});

Acceptance Criteria:

✅ Integration tests validate end-to-end flow
✅ Tests verify fallback behavior in realistic scenarios
✅ Tests confirm progress reporting works correctly
✅ Tests can run in CI with mocked Instagram pages

Dependencies: Story 1, Story 2, Story 3

Risk Assessment:

Medium Risk: Integration tests may require more complex mocking
Maintenance: May need updates when Instagram changes page structure

Story 5: Update Documentation

Objective: Document the enhanced URL validation behavior.

Location:

src/lib/server/extraction.ts (JSDoc)
README.md (if applicable)

Technical Specifications:

Update JSDoc for fetchImageAsBase64:

/**
 * Helper: Fetch image from URL and convert to base64 data URI
 * 
 * **Validation Criteria:**
 * - HTTP status must be exactly 200 (not 2xx, only 200)
 * - Content-Type must start with 'image/' (e.g., image/jpeg, image/png, image/webp)
 * - Request must complete within 10 seconds
 * 
 * **Failure Scenarios:**
 * - Non-200 status → Returns null, reports status code via progress callback
 * - Invalid content-type → Returns null, reports content-type via progress callback
 * - Timeout → Returns null, reports timeout via progress callback
 * - Network error → Returns null, reports error message via progress callback
 * 
 * **Usage in Fallback Chain:**
 * This function is used by `extractThumbnailStealth()` which tries multiple URL sources:
 * 1. Meta tags (og:image, twitter:image)
 * 2. Video poster attribute
 * 3. Instagram data structures (display_url, thumbnail_src)
 * 4. Screenshot fallback (always succeeds)
 * 
 * When this function returns null, extraction continues to the next method.
 * 
 * @param imageUrl - The image URL to fetch (must be HTTPS)
 * @param progressCallback - Optional callback for progress reporting
 * @returns Base64 data URI (data:image/*;base64,...) or null if validation fails
 * 
 * @example
 * ```typescript
 * const thumbnail = await fetchImageAsBase64(
 *   'https://instagram.com/image.jpg',
 *   (event) => console.log(event.message)
 * );
 * 
 * if (thumbnail) {
 *   // thumbnail is a valid base64 data URI
 *   console.log(thumbnail.substring(0, 50)); // "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
 * } else {
 *   // URL validation failed, try next method
 * }
 * ```
 */

Update main extraction documentation:

/**
 * Extract thumbnail from Instagram post using stealth techniques
 * 
 * Tries multiple methods in order of stealth:
 * 1. Meta tags (og:image, twitter:image) - Returns: Direct HTTPS URL → Base64
 * 2. Video poster attribute - Returns: Direct HTTPS URL → Base64
 * 3. Instagram window data structures - Returns: Direct HTTPS URL → Base64
 * 4. Screenshot fallback - Returns: Base64 data URL (data:image/jpeg;base64,...)
 * 
 * **URL Validation (Methods 1-3):**
 * Each URL method validates the image URL before converting to base64:
 * - Requires HTTP 200 status (other 2xx codes are rejected)
 * - Requires image/* content-type
 * - 10-second timeout protection
 * - Detailed progress reporting for debugging
 * 
 * If URL validation fails, extraction continues to the next method.
 * The screenshot fallback (Method 4) always succeeds (barring page errors).
 * 
 * @param page - Playwright page instance
 * @param progressCallback - Optional progress callback for SSE updates
 * @returns Image URL (either direct HTTPS URL converted to base64, or screenshot base64) or null if all methods fail
 */

Acceptance Criteria:

✅ JSDoc clearly explains validation criteria
✅ Documentation includes failure scenarios
✅ Examples show how validation works
✅ Developers understand why strict validation is important

Dependencies: Story 1, Story 2

Risk Assessment:

No Risk: Documentation only

Technical Dependencies

External Dependencies

Node.js fetch API: Built-in (Node 18+)
AbortController: Built-in (Node 15+)
Buffer: Built-in Node.js module

Internal Dependencies

src/lib/server/extraction.ts: Main extraction logic
ProgressCallback type: Existing type for SSE reporting
Playwright Page type: For extraction methods

Testing Strategy

Unit Tests

Mock fetch with different HTTP status codes
Mock content-type headers
Test timeout behavior
Verify progress callback invocations

Integration Tests

Test complete extraction flow with failing URLs
Verify fallback chain works correctly
Test with realistic Instagram-like pages

Manual Testing

Test with real Instagram URLs
Monitor SSE progress updates in frontend
Verify logs show detailed failure information

Rollout Plan

Phase 1: Core Validation Enhancement (Story 1)

Implement enhanced fetchImageAsBase64
Add timeout, status check, content-type validation
Deploy to development environment
Monitor logs for validation failures

Phase 2: Progress Reporting (Story 2)

Thread progress callback through extraction methods
Test SSE updates in frontend
Verify user sees helpful error messages

Phase 3: Testing & Documentation (Stories 3-5)

Add comprehensive test coverage
Update documentation
Prepare for production deployment

Phase 4: Production Deployment

Deploy to production
Monitor extraction success rates
Analyze which URL methods succeed/fail
Adjust timeout if needed based on metrics

Success Metrics

Validation Accuracy

✅ 0% false positives (valid URLs rejected)
✅ 100% invalid URLs detected (404, non-image, etc.)
✅ Fallback chain works in all scenarios

Performance

✅ URL validation adds < 500ms to extraction time
✅ Timeout prevents hanging requests
✅ No memory leaks from uncleaned timeouts

User Experience

✅ Frontend shows detailed progress for URL validation
✅ Users understand why certain methods failed
✅ Extraction still succeeds even when URLs are invalid

Observability

✅ Logs show HTTP status codes for failed URLs
✅ Logs distinguish between timeout, network error, invalid status
✅ Metrics track URL validation success rate per method

Risk Mitigation

Risk: Instagram CDN Blocks Validation Requests

Likelihood: Low
Impact: Medium
Mitigation:

Monitor HTTP status codes in production
If 403/429 errors increase, consider adding user-agent headers
May need to use browser context for fetching (more stealthy)

Risk: Timeout Too Short for Slow Networks

Likelihood: Medium
Impact: Low
Mitigation:

Start with 10s timeout
Monitor timeout frequency in logs
Adjust to 15s if needed based on data
Screenshot fallback ensures extraction still succeeds

Risk: Content-Type Header Missing or Incorrect

Likelihood: Low
Impact: Low
Mitigation:

Default to 'image/jpeg' when content-type is empty
Consider checking file extension as secondary validation
Rely on arrayBuffer() to fail for truly non-image data

Appendix: Validation Flow Diagram

extractThumbnailStealth()
│
├─ Method 1: Meta Tags (og:image, twitter:image)
│  ├─ Find URL in page
│  ├─ Call fetchImageAsBase64(url, callback)
│  │  ├─ Fetch with 10s timeout
│  │  ├─ Check status === 200 ❌ → return null → Try Method 2
│  │  ├─ Check content-type startsWith('image/') ❌ → return null → Try Method 2
│  │  └─ Convert to base64 ✅ → return base64 → SUCCESS
│  └─ If null, continue to Method 2
│
├─ Method 2: Video Poster Attribute
│  ├─ Find poster URL
│  ├─ Call fetchImageAsBase64(url, callback)
│  │  └─ [same validation as Method 1]
│  └─ If null, continue to Method 3
│
├─ Method 3: Instagram Data Structures
│  ├─ Extract display_url or thumbnail_src
│  ├─ Call fetchImageAsBase64(url, callback)
│  │  └─ [same validation as Method 1]
│  └─ If null, continue to Method 4
│
└─ Method 4: Screenshot Fallback
   └─ extractThumbnailScreenshot(page)
      └─ Always returns base64 (or null on page error)

Key Points:

Each URL method independently validates before converting to base64
Validation failures return null and trigger next method
Progress callbacks report each validation attempt and failure
Screenshot fallback ensures extraction succeeds even if all URLs fail

Conclusion

This plan enhances thumbnail URL validation to be more robust, observable, and user-friendly. By implementing strict HTTP 200 validation, content-type checking, and timeout protection, we ensure that the extraction system only accepts valid image URLs and gracefully falls back when URLs are invalid. The detailed progress reporting helps with debugging and provides transparency to users about the extraction process.

Implementation Priority: Medium-High
Estimated Effort: 2-3 days
Complexity: Medium

26 KiB Raw Blame History

Execution Plan: Validate Thumbnail URL Status

Executive Summary

Current State Analysis

Existing Implementation

Extraction Strategy Chain

Stories

Story 1: Enhance URL Validation in fetchImageAsBase64

Story 2: Thread Progress Callback Through Extraction Methods

Story 3: Add Unit Tests for URL Validation

Story 4: Add Integration Test for Complete Extraction Flow

Story 5: Update Documentation

Technical Dependencies

External Dependencies

Internal Dependencies

Testing Strategy

Unit Tests

Integration Tests

Manual Testing

Rollout Plan

Phase 1: Core Validation Enhancement (Story 1)

Phase 2: Progress Reporting (Story 2)

Phase 3: Testing & Documentation (Stories 3-5)

Phase 4: Production Deployment

Success Metrics

Validation Accuracy

Performance

User Experience

Observability

Risk Mitigation

Risk: Instagram CDN Blocks Validation Requests

Risk: Timeout Too Short for Slow Networks

Risk: Content-Type Header Missing or Incorrect

Appendix: Validation Flow Diagram

Conclusion

26 KiB

Raw Blame History