Files
insta-recipe/docs/plans/ValidateThumbnailURLStatus.md
Giancarmine Salucci 767b8a1b37 feat(extraction): enhance thumbnail URL validation with strict HTTP 200 check
- Implement strict HTTP 200 validation (reject all other status codes)
- Add content-type validation (must be image/*)
- Add 10-second timeout protection with AbortController
- Thread progressCallback through all fetchImageAsBase64 calls
- Add detailed logging for each validation failure scenario
- Report validation failures via SSE progress callbacks

Unit tests:
- Add comprehensive test coverage for all validation scenarios
- Test HTTP status codes (200, 404, 403, 500, etc.)
- Test content-type validation (image/* vs text/html, etc.)
- Test timeout behavior with AbortController
- Test error handling (network errors, DNS, SSL, etc.)
- Test progress callback reporting

Integration tests:
- Add tests for complete extraction flow with URL failures
- Test fallback chain behavior (meta tags → poster → Instagram data → screenshot)
- Test real-world scenarios (redirects, query params, different post types)

Documentation:
- Enhanced JSDoc with validation criteria
- Added examples showing fallback behavior
- Documented all failure scenarios and their handling

All tests passing 
2025-12-21 05:33:48 +01:00

823 lines
26 KiB
Markdown

# Execution Plan: Validate Thumbnail URL Status
**Created:** 2025-12-21
**Analyst:** GitHub Copilot
**Status:** Ready for Implementation
---
## Executive Summary
When extracting thumbnails from Instagram posts, the current implementation fetches image URLs and converts them to base64 data URIs. However, the URL validation is insufficient - it only checks `response.ok` which accepts any 2xx status code. This plan enhances thumbnail URL validation to explicitly require HTTP 200 status, add content-type validation, implement timeouts, and provide detailed progress reporting for debugging and user feedback.
**Goal:** Ensure thumbnail URL extraction methods fail gracefully and report detailed validation failures, allowing the system to properly fall back through the extraction strategy chain.
---
## Current State Analysis
### Existing Implementation
**Location:** `src/lib/server/extraction.ts`
**Current `fetchImageAsBase64` function:**
```typescript
async function fetchImageAsBase64(imageUrl: string): Promise<string | null> {
try {
const response = await fetch(imageUrl);
if (!response.ok) return null;
const arrayBuffer = await response.arrayBuffer();
const buffer = Buffer.from(arrayBuffer);
const contentType = response.headers.get('content-type') || 'image/jpeg';
return `data:${contentType};base64,${buffer.toString('base64')}`;
} catch (e) {
console.error('[Thumbnail] Failed to fetch image:', e);
return null;
}
}
```
**Issues:**
1. `response.ok` accepts 200-299, but 204 No Content or 206 Partial Content are problematic
2. No explicit status code logging for debugging
3. No content-type validation (could download non-image data)
4. No timeout protection (could hang indefinitely)
5. No progress reporting for failed validations
6. Generic error logging doesn't distinguish failure types
### Extraction Strategy Chain
The `extractThumbnailStealth` function tries multiple methods:
1. **Meta tags** (og:image, twitter:image) → Uses fetchImageAsBase64
2. **Video poster** attribute → Uses fetchImageAsBase64
3. **Instagram data structures** (display_url, thumbnail_src) → Uses fetchImageAsBase64
4. **Screenshot fallback** → Always succeeds with base64
When a URL method fails, it should cleanly return null and continue to the next method. Enhanced validation ensures we don't accept invalid URLs.
---
## Stories
### Story 1: Enhance URL Validation in fetchImageAsBase64
**Objective:** Implement strict HTTP 200 validation, content-type checking, and timeout protection.
**Location:** `src/lib/server/extraction.ts`
**Technical Specifications:**
```typescript
/**
* Helper: Fetch image from URL and convert to base64 data URI
*
* Validation criteria:
* - HTTP status must be exactly 200
* - Content-Type must start with 'image/'
* - Request timeout: 10 seconds
*
* @param imageUrl - The image URL to fetch
* @param progressCallback - Optional callback for progress reporting
* @returns Base64 data URI or null if validation fails
*/
async function fetchImageAsBase64(
imageUrl: string,
progressCallback?: ProgressCallback
): Promise<string | null> {
try {
// Create abort controller for timeout
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 10000); // 10s timeout
console.log(`[Thumbnail] Validating URL: ${imageUrl}`);
const response = await fetch(imageUrl, {
signal: controller.signal
});
clearTimeout(timeoutId);
// Strict status validation: must be exactly 200
if (response.status !== 200) {
console.warn(`[Thumbnail] URL validation failed: HTTP ${response.status} for ${imageUrl}`);
progressCallback?.({
type: 'status',
message: `Thumbnail URL returned HTTP ${response.status}, trying next method...`,
timestamp: new Date().toISOString()
});
return null;
}
// Validate content-type
const contentType = response.headers.get('content-type') || '';
if (!contentType.startsWith('image/')) {
console.warn(`[Thumbnail] URL validation failed: Invalid content-type '${contentType}' for ${imageUrl}`);
progressCallback?.({
type: 'status',
message: `Thumbnail URL returned non-image content (${contentType}), trying next method...`,
timestamp: new Date().toISOString()
});
return null;
}
console.log(`[Thumbnail] URL validation successful: ${imageUrl} (${contentType})`);
const arrayBuffer = await response.arrayBuffer();
const buffer = Buffer.from(arrayBuffer);
const base64Data = `data:${contentType};base64,${buffer.toString('base64')}`;
progressCallback?.({
type: 'status',
message: `Thumbnail fetched and validated from URL`,
timestamp: new Date().toISOString()
});
return base64Data;
} catch (e) {
if (e instanceof Error) {
if (e.name === 'AbortError') {
console.error(`[Thumbnail] URL fetch timeout: ${imageUrl}`);
progressCallback?.({
type: 'status',
message: `Thumbnail URL fetch timeout, trying next method...`,
timestamp: new Date().toISOString()
});
} else {
console.error(`[Thumbnail] Failed to fetch image from ${imageUrl}:`, e.message);
progressCallback?.({
type: 'status',
message: `Thumbnail URL fetch failed (${e.message}), trying next method...`,
timestamp: new Date().toISOString()
});
}
} else {
console.error('[Thumbnail] Failed to fetch image:', e);
}
return null;
}
}
```
**Changes:**
1. Add `progressCallback` parameter
2. Use `AbortController` for 10-second timeout
3. Check `response.status === 200` explicitly
4. Validate `content-type` starts with 'image/'
5. Add detailed logging for each failure scenario
6. Report validation progress via callbacks
7. Clear timeout after successful fetch
**Acceptance Criteria:**
- ✅ Only HTTP 200 responses are accepted
- ✅ Only responses with image/* content-type are accepted
- ✅ Requests timeout after 10 seconds
- ✅ Each failure type is logged with specific message
- ✅ Progress callbacks report validation attempts and failures
- ✅ Function returns null for any validation failure
- ✅ Timeout is properly cleared to prevent memory leaks
**Dependencies:** None
**Risk Assessment:**
- **Low Risk:** Changes are isolated to helper function
- **Backwards Compatible:** Signature change is additive (optional parameter)
- **Timeout:** 10s might be too short for slow networks, but Instagram CDN is typically fast
---
### Story 2: Thread Progress Callback Through Extraction Methods
**Objective:** Update all callsites of `fetchImageAsBase64` to pass the `progressCallback`.
**Location:** `src/lib/server/extraction.ts`
**Technical Specifications:**
Update `extractThumbnailStealth` to pass `progressCallback` to all `fetchImageAsBase64` calls:
```typescript
async function extractThumbnailStealth(
page: Page,
progressCallback?: ProgressCallback
): Promise<string | null> {
console.log('[Thumbnail] Starting stealth extraction');
// Method 1: Try meta tags (most stealthy)
try {
const ogImage = await page.getAttribute('meta[property="og:image"]', 'content');
if (ogImage) {
console.log('[Thumbnail] Found og:image meta tag');
const imageBuffer = await fetchImageAsBase64(ogImage, progressCallback); // ✅ Pass callback
if (imageBuffer) {
if (progressCallback) {
progressCallback({
type: 'thumbnail',
message: 'Thumbnail extracted from meta tags',
data: { thumbnail: imageBuffer },
timestamp: new Date().toISOString()
});
}
return imageBuffer;
}
}
const twitterImage = await page.getAttribute('meta[name="twitter:image"]', 'content');
if (twitterImage) {
console.log('[Thumbnail] Found twitter:image meta tag');
const imageBuffer = await fetchImageAsBase64(twitterImage, progressCallback); // ✅ Pass callback
if (imageBuffer) {
if (progressCallback) {
progressCallback({
type: 'thumbnail',
message: 'Thumbnail extracted from twitter meta tag',
data: { thumbnail: imageBuffer },
timestamp: new Date().toISOString()
});
}
return imageBuffer;
}
}
} catch (e) {
console.log('[Thumbnail] Meta tag method failed:', e);
}
// Method 2: Try video poster attribute
try {
const poster = await page.getAttribute('video', 'poster');
if (poster) {
console.log('[Thumbnail] Found video poster attribute');
const imageBuffer = await fetchImageAsBase64(poster, progressCallback); // ✅ Pass callback
if (imageBuffer) {
if (progressCallback) {
progressCallback({
type: 'thumbnail',
message: 'Thumbnail extracted from video poster',
data: { thumbnail: imageBuffer },
timestamp: new Date().toISOString()
});
}
return imageBuffer;
}
}
} catch (e) {
console.log('[Thumbnail] Video poster method failed:', e);
}
// Method 3: Try Instagram window data structures
try {
const thumbnailUrl = await page.evaluate(() => {
const data = (window as any).__additionalDataLoaded;
if (data) {
for (const key in data) {
const item = data[key];
if (item?.graphql?.shortcode_media?.display_url) {
return item.graphql.shortcode_media.display_url;
}
if (item?.graphql?.shortcode_media?.thumbnail_src) {
return item.graphql.shortcode_media.thumbnail_src;
}
}
}
return null;
});
if (thumbnailUrl) {
console.log('[Thumbnail] Found thumbnail in Instagram data structures');
const imageBuffer = await fetchImageAsBase64(thumbnailUrl, progressCallback); // ✅ Pass callback
if (imageBuffer) {
if (progressCallback) {
progressCallback({
type: 'thumbnail',
message: 'Thumbnail extracted from Instagram data',
data: { thumbnail: imageBuffer },
timestamp: new Date().toISOString()
});
}
return imageBuffer;
}
}
} catch (e) {
console.log('[Thumbnail] Instagram data method failed:', e);
}
// Method 4: Screenshot fallback (existing method)
console.log('[Thumbnail] Falling back to screenshot method');
const screenshotThumbnail = await extractThumbnailScreenshot(page);
if (screenshotThumbnail && progressCallback) {
progressCallback({
type: 'thumbnail',
message: 'Thumbnail extracted via screenshot',
data: { thumbnail: screenshotThumbnail },
timestamp: new Date().toISOString()
});
}
return screenshotThumbnail;
}
```
**Changes:**
1. Update all 4 `fetchImageAsBase64` calls in `extractThumbnailStealth`
2. Pass `progressCallback` parameter to each call
3. Maintain existing success callbacks
**Acceptance Criteria:**
- ✅ All callsites pass progressCallback to fetchImageAsBase64
- ✅ Frontend receives detailed progress updates via SSE
- ✅ Users can see which URL methods were tried and why they failed
- ✅ Existing functionality remains unchanged
**Dependencies:** Story 1
**Risk Assessment:**
- **Low Risk:** Simple parameter passing
- **No Breaking Changes:** progressCallback is optional
---
### Story 3: Add Unit Tests for URL Validation
**Objective:** Test all validation scenarios for `fetchImageAsBase64`.
**Location:** `src/tests/thumbnail-validation.spec.ts` (new file)
**Technical Specifications:**
```typescript
import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest';
import type { ProgressCallback } from '$lib/server/extraction';
// Import the function to test (will need to export it or test through public API)
// For testing purposes, we'll mock fetch
describe('fetchImageAsBase64 URL Validation', () => {
let originalFetch: typeof globalThis.fetch;
let mockProgressCallback: ProgressCallback;
beforeEach(() => {
originalFetch = globalThis.fetch;
mockProgressCallback = vi.fn();
});
afterEach(() => {
globalThis.fetch = originalFetch;
});
it('should accept HTTP 200 with image content-type', async () => {
const mockImageData = Buffer.from('fake-image-data');
globalThis.fetch = vi.fn().mockResolvedValue({
status: 200,
headers: {
get: (name: string) => name === 'content-type' ? 'image/jpeg' : null
},
arrayBuffer: async () => mockImageData.buffer
});
// Call function and verify result
// const result = await fetchImageAsBase64('https://example.com/image.jpg', mockProgressCallback);
// expect(result).toMatch(/^data:image\/jpeg;base64,/);
// expect(mockProgressCallback).toHaveBeenCalledWith(expect.objectContaining({
// type: 'status',
// message: expect.stringContaining('validated')
// }));
});
it('should reject HTTP 404 status', async () => {
globalThis.fetch = vi.fn().mockResolvedValue({
status: 404,
headers: { get: () => null }
});
// const result = await fetchImageAsBase64('https://example.com/missing.jpg', mockProgressCallback);
// expect(result).toBeNull();
// expect(mockProgressCallback).toHaveBeenCalledWith(expect.objectContaining({
// message: expect.stringContaining('404')
// }));
});
it('should reject HTTP 204 No Content', async () => {
globalThis.fetch = vi.fn().mockResolvedValue({
status: 204,
headers: { get: () => null }
});
// const result = await fetchImageAsBase64('https://example.com/image.jpg', mockProgressCallback);
// expect(result).toBeNull();
});
it('should reject non-image content-type', async () => {
globalThis.fetch = vi.fn().mockResolvedValue({
status: 200,
headers: {
get: (name: string) => name === 'content-type' ? 'text/html' : null
}
});
// const result = await fetchImageAsBase64('https://example.com/page.html', mockProgressCallback);
// expect(result).toBeNull();
// expect(mockProgressCallback).toHaveBeenCalledWith(expect.objectContaining({
// message: expect.stringContaining('non-image')
// }));
});
it('should timeout after 10 seconds', async () => {
globalThis.fetch = vi.fn().mockImplementation(
() => new Promise((resolve) => {
setTimeout(() => resolve({ status: 200 }), 15000);
})
);
// const result = await fetchImageAsBase64('https://slow.example.com/image.jpg', mockProgressCallback);
// expect(result).toBeNull();
// expect(mockProgressCallback).toHaveBeenCalledWith(expect.objectContaining({
// message: expect.stringContaining('timeout')
// }));
}, 12000); // Set test timeout > fetch timeout
it('should handle network errors gracefully', async () => {
globalThis.fetch = vi.fn().mockRejectedValue(new Error('Network error'));
// const result = await fetchImageAsBase64('https://example.com/image.jpg', mockProgressCallback);
// expect(result).toBeNull();
// expect(mockProgressCallback).toHaveBeenCalledWith(expect.objectContaining({
// message: expect.stringContaining('failed')
// }));
});
it('should accept various image content-types', async () => {
const contentTypes = ['image/jpeg', 'image/png', 'image/gif', 'image/webp', 'image/svg+xml'];
for (const contentType of contentTypes) {
const mockImageData = Buffer.from('fake-image-data');
globalThis.fetch = vi.fn().mockResolvedValue({
status: 200,
headers: {
get: (name: string) => name === 'content-type' ? contentType : null
},
arrayBuffer: async () => mockImageData.buffer
});
// const result = await fetchImageAsBase64(`https://example.com/image`, mockProgressCallback);
// expect(result).toMatch(new RegExp(`^data:${contentType};base64,`));
}
});
});
describe('extractThumbnailStealth fallback chain', () => {
it('should try all methods and fall back to screenshot', async () => {
// Mock all URL methods to fail (404)
// Mock screenshot to succeed
// Verify screenshot method is called
// Verify all URL methods were attempted
});
it('should stop at first successful URL method', async () => {
// Mock og:image to return 404
// Mock twitter:image to return 200
// Verify video poster is not tried
});
});
```
**Testing Strategy:**
1. Mock `fetch` with different responses
2. Test each validation criterion independently
3. Test timeout behavior with delayed promises
4. Test error handling
5. Test progress callback invocations
6. Integration test for fallback chain
**Acceptance Criteria:**
- ✅ All validation scenarios have test coverage
- ✅ Tests verify progress callbacks are invoked correctly
- ✅ Tests verify fallback behavior
- ✅ Tests run successfully in CI/CD pipeline
**Dependencies:** Story 1, Story 2
**Risk Assessment:**
- **Low Risk:** Tests don't affect production code
- **Coverage:** Ensures validation logic works correctly
---
### Story 4: Add Integration Test for Complete Extraction Flow
**Objective:** Test end-to-end extraction with URL validation failures.
**Location:** `src/tests/extraction-url-validation.integration.spec.ts` (new file)
**Technical Specifications:**
```typescript
import { describe, it, expect } from 'vitest';
import { extractTextAndThumbnail } from '$lib/server/extraction';
describe('Thumbnail URL Validation Integration', () => {
it('should fall back to screenshot when all URL methods fail', async () => {
// This test requires a real Instagram URL or mocked page
// Test scenario:
// 1. Mock Instagram page with meta tags pointing to invalid URLs (404)
// 2. Verify extraction still succeeds with screenshot fallback
// 3. Verify progress callbacks show URL failures
});
it('should use URL method when available and valid', async () => {
// Test scenario:
// 1. Mock Instagram page with valid og:image URL
// 2. Verify thumbnail is fetched from URL (not screenshot)
// 3. Verify progress shows successful URL fetch
});
it('should report detailed progress for URL validation failures', async () => {
const progressEvents: any[] = [];
const progressCallback = (event: any) => progressEvents.push(event);
// Extract from URL with failing meta tag URLs
// await extractTextAndThumbnail(testUrl, progressCallback);
// Verify progress events include:
// - URL validation attempts
// - HTTP status codes for failures
// - Fallback to screenshot
// expect(progressEvents).toContainEqual(
// expect.objectContaining({
// message: expect.stringContaining('HTTP 404')
// })
// );
});
});
```
**Acceptance Criteria:**
- ✅ Integration tests validate end-to-end flow
- ✅ Tests verify fallback behavior in realistic scenarios
- ✅ Tests confirm progress reporting works correctly
- ✅ Tests can run in CI with mocked Instagram pages
**Dependencies:** Story 1, Story 2, Story 3
**Risk Assessment:**
- **Medium Risk:** Integration tests may require more complex mocking
- **Maintenance:** May need updates when Instagram changes page structure
---
### Story 5: Update Documentation
**Objective:** Document the enhanced URL validation behavior.
**Location:**
1. `src/lib/server/extraction.ts` (JSDoc)
2. `README.md` (if applicable)
**Technical Specifications:**
Update JSDoc for `fetchImageAsBase64`:
```typescript
/**
* Helper: Fetch image from URL and convert to base64 data URI
*
* **Validation Criteria:**
* - HTTP status must be exactly 200 (not 2xx, only 200)
* - Content-Type must start with 'image/' (e.g., image/jpeg, image/png, image/webp)
* - Request must complete within 10 seconds
*
* **Failure Scenarios:**
* - Non-200 status → Returns null, reports status code via progress callback
* - Invalid content-type → Returns null, reports content-type via progress callback
* - Timeout → Returns null, reports timeout via progress callback
* - Network error → Returns null, reports error message via progress callback
*
* **Usage in Fallback Chain:**
* This function is used by `extractThumbnailStealth()` which tries multiple URL sources:
* 1. Meta tags (og:image, twitter:image)
* 2. Video poster attribute
* 3. Instagram data structures (display_url, thumbnail_src)
* 4. Screenshot fallback (always succeeds)
*
* When this function returns null, extraction continues to the next method.
*
* @param imageUrl - The image URL to fetch (must be HTTPS)
* @param progressCallback - Optional callback for progress reporting
* @returns Base64 data URI (data:image/*;base64,...) or null if validation fails
*
* @example
* ```typescript
* const thumbnail = await fetchImageAsBase64(
* 'https://instagram.com/image.jpg',
* (event) => console.log(event.message)
* );
*
* if (thumbnail) {
* // thumbnail is a valid base64 data URI
* console.log(thumbnail.substring(0, 50)); // "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
* } else {
* // URL validation failed, try next method
* }
* ```
*/
```
Update main extraction documentation:
```typescript
/**
* Extract thumbnail from Instagram post using stealth techniques
*
* Tries multiple methods in order of stealth:
* 1. Meta tags (og:image, twitter:image) - Returns: Direct HTTPS URL → Base64
* 2. Video poster attribute - Returns: Direct HTTPS URL → Base64
* 3. Instagram window data structures - Returns: Direct HTTPS URL → Base64
* 4. Screenshot fallback - Returns: Base64 data URL (data:image/jpeg;base64,...)
*
* **URL Validation (Methods 1-3):**
* Each URL method validates the image URL before converting to base64:
* - Requires HTTP 200 status (other 2xx codes are rejected)
* - Requires image/* content-type
* - 10-second timeout protection
* - Detailed progress reporting for debugging
*
* If URL validation fails, extraction continues to the next method.
* The screenshot fallback (Method 4) always succeeds (barring page errors).
*
* @param page - Playwright page instance
* @param progressCallback - Optional progress callback for SSE updates
* @returns Image URL (either direct HTTPS URL converted to base64, or screenshot base64) or null if all methods fail
*/
```
**Acceptance Criteria:**
- ✅ JSDoc clearly explains validation criteria
- ✅ Documentation includes failure scenarios
- ✅ Examples show how validation works
- ✅ Developers understand why strict validation is important
**Dependencies:** Story 1, Story 2
**Risk Assessment:**
- **No Risk:** Documentation only
---
## Technical Dependencies
### External Dependencies
- **Node.js fetch API**: Built-in (Node 18+)
- **AbortController**: Built-in (Node 15+)
- **Buffer**: Built-in Node.js module
### Internal Dependencies
- `src/lib/server/extraction.ts`: Main extraction logic
- `ProgressCallback` type: Existing type for SSE reporting
- Playwright `Page` type: For extraction methods
---
## Testing Strategy
### Unit Tests
- Mock fetch with different HTTP status codes
- Mock content-type headers
- Test timeout behavior
- Verify progress callback invocations
### Integration Tests
- Test complete extraction flow with failing URLs
- Verify fallback chain works correctly
- Test with realistic Instagram-like pages
### Manual Testing
- Test with real Instagram URLs
- Monitor SSE progress updates in frontend
- Verify logs show detailed failure information
---
## Rollout Plan
### Phase 1: Core Validation Enhancement (Story 1)
- Implement enhanced `fetchImageAsBase64`
- Add timeout, status check, content-type validation
- Deploy to development environment
- Monitor logs for validation failures
### Phase 2: Progress Reporting (Story 2)
- Thread progress callback through extraction methods
- Test SSE updates in frontend
- Verify user sees helpful error messages
### Phase 3: Testing & Documentation (Stories 3-5)
- Add comprehensive test coverage
- Update documentation
- Prepare for production deployment
### Phase 4: Production Deployment
- Deploy to production
- Monitor extraction success rates
- Analyze which URL methods succeed/fail
- Adjust timeout if needed based on metrics
---
## Success Metrics
### Validation Accuracy
- ✅ 0% false positives (valid URLs rejected)
- ✅ 100% invalid URLs detected (404, non-image, etc.)
- ✅ Fallback chain works in all scenarios
### Performance
- ✅ URL validation adds < 500ms to extraction time
- ✅ Timeout prevents hanging requests
- ✅ No memory leaks from uncleaned timeouts
### User Experience
- ✅ Frontend shows detailed progress for URL validation
- ✅ Users understand why certain methods failed
- ✅ Extraction still succeeds even when URLs are invalid
### Observability
- ✅ Logs show HTTP status codes for failed URLs
- ✅ Logs distinguish between timeout, network error, invalid status
- ✅ Metrics track URL validation success rate per method
---
## Risk Mitigation
### Risk: Instagram CDN Blocks Validation Requests
**Likelihood:** Low
**Impact:** Medium
**Mitigation:**
- Monitor HTTP status codes in production
- If 403/429 errors increase, consider adding user-agent headers
- May need to use browser context for fetching (more stealthy)
### Risk: Timeout Too Short for Slow Networks
**Likelihood:** Medium
**Impact:** Low
**Mitigation:**
- Start with 10s timeout
- Monitor timeout frequency in logs
- Adjust to 15s if needed based on data
- Screenshot fallback ensures extraction still succeeds
### Risk: Content-Type Header Missing or Incorrect
**Likelihood:** Low
**Impact:** Low
**Mitigation:**
- Default to 'image/jpeg' when content-type is empty
- Consider checking file extension as secondary validation
- Rely on arrayBuffer() to fail for truly non-image data
---
## Appendix: Validation Flow Diagram
```
extractThumbnailStealth()
├─ Method 1: Meta Tags (og:image, twitter:image)
│ ├─ Find URL in page
│ ├─ Call fetchImageAsBase64(url, callback)
│ │ ├─ Fetch with 10s timeout
│ │ ├─ Check status === 200 ❌ → return null → Try Method 2
│ │ ├─ Check content-type startsWith('image/') ❌ → return null → Try Method 2
│ │ └─ Convert to base64 ✅ → return base64 → SUCCESS
│ └─ If null, continue to Method 2
├─ Method 2: Video Poster Attribute
│ ├─ Find poster URL
│ ├─ Call fetchImageAsBase64(url, callback)
│ │ └─ [same validation as Method 1]
│ └─ If null, continue to Method 3
├─ Method 3: Instagram Data Structures
│ ├─ Extract display_url or thumbnail_src
│ ├─ Call fetchImageAsBase64(url, callback)
│ │ └─ [same validation as Method 1]
│ └─ If null, continue to Method 4
└─ Method 4: Screenshot Fallback
└─ extractThumbnailScreenshot(page)
└─ Always returns base64 (or null on page error)
```
**Key Points:**
- Each URL method independently validates before converting to base64
- Validation failures return null and trigger next method
- Progress callbacks report each validation attempt and failure
- Screenshot fallback ensures extraction succeeds even if all URLs fail
---
## Conclusion
This plan enhances thumbnail URL validation to be more robust, observable, and user-friendly. By implementing strict HTTP 200 validation, content-type checking, and timeout protection, we ensure that the extraction system only accepts valid image URLs and gracefully falls back when URLs are invalid. The detailed progress reporting helps with debugging and provides transparency to users about the extraction process.
**Implementation Priority:** Medium-High
**Estimated Effort:** 2-3 days
**Complexity:** Medium