fix(RECIPE-0006): complete iteration 1 - unit tests for Instagram caption extraction

- Exported cleanText() and extractFromDOM() for unit testing - Fixed metadata prefix regex to handle optional quotes - Created comprehensive unit tests with mocked Playwright Page (15 tests, 12ms) - All 275 tests passing
2026-02-17 11:02:59 +01:00
parent b304f5266a
commit 56d3aec3e2
4 changed files with 433 additions and 13 deletions
--- a/docs/FINDINGS.md
+++ b/docs/FINDINGS.md
@@ -1590,6 +1590,165 @@ From prior research (RECIPE-0001), `llm.ts` already implements:

 ---

-**Document Version:** 1.7  
-**Last Updated by:** Planner Agent (RECIPE-0005 Iteration 0)  
+### [Planner] Research Notes - RECIPE-0006 Iteration 1 (2026-02-17)
+
+**Task:** Transform E2E test to unit test with mocked fixtures and fix extraction logic iteratively
+
+#### Problem Analysis
+**Research Date:** 2026-02-17T10:00:00.000Z  
+**Source:** review_report.yaml, extraction.ts analysis, test fixtures
+
+**Iteration 0 Failure:**
+- E2E test created but never executed during development
+- User manually ran test and it FAILED
+- Current output: `"16K likes, 325 comments - chef.antonio.la.cava on October 17, 2025: "La cacio e pepe..."`
+- Expected output: Full recipe starting with `"La cacio e pepe infallibile di Luciano Monosilio 🍝"`
+
+**Root Cause Analysis:**
+1. **DOM selectors failing**: Lines 331-341 of extraction.ts try selectors but none match Instagram's current structure
+2. **Fallback to og:description**: Line 348-357 extracts from `<meta property="og:description">` which contains metadata prefix
+3. **Regex cleanup insufficient**: Line 356 tries to clean metadata with regex `^\d+K?\s+likes,\s+\d+\s+comments\s+-\s+[\w.]+\s+on\s+[^:]+:\s+` but it's not removing the text properly
+
+**Current extractFromDOM() Flow:**
+```
+1. Try selectors: article h1, article span[dir="auto"], article div[role="button"] + span, article span:not([aria-label])
+   → All fail (return null or < 100 chars)
+2. Fallback to og:description meta tag
+   → Returns: "16K likes, 325 comments - username on date: caption..."
+3. Apply metadata cleanup regex
+   → Regex doesn't match properly (or matches but leaves quotes)
+4. Pass to cleanText()
+   → cleanText() removes hashtags but metadata prefix remains
+```
+
+---
+
+#### Vitest Unit Testing for Playwright Mocking
+**Research Date:** 2026-02-17T10:00:00.000Z  
+**Source:** TESTING.md, existing tests (queue-processor.spec.ts, scheduler.spec.ts)
+
+**Mocking Strategy:**
+From TESTING.md and existing test patterns, Vitest provides module-level mocking:
+
+```typescript
+// Mock entire module BEFORE imports
+vi.mock('$lib/server/extraction', () => ({
+  extractTextAndThumbnail: vi.fn().mockResolvedValue({
+    bodyText: 'Mocked text',
+    thumbnail: 'https://example.com/thumb.jpg'
+  })
+}));
+```
+
+**For Unit Testing extractFromDOM():**
+- Cannot mock the entire `extraction.ts` module (we're testing functions inside it)
+- Need to test internal functions directly (extractFromDOM, cleanText are not exported)
+- Options:
+  1. **Export functions for testing** (add `export` to extractFromDOM and cleanText)
+  2. **Mock Playwright Page.evaluate()** (mock the browser automation layer)
+  3. **Integration test with mocked browser context**
+
+**Chosen Approach: Export Internal Functions**
+- Cleanest separation of concerns
+- Allows direct unit testing without browser overhead
+- Follows existing pattern (extractTextAndThumbnail is already exported)
+- Test Runtime: < 10ms (vs 30s for E2E test)
+
+**Test Structure:**
+```typescript
+// Unit test with fixtures
+import { extractFromDOM, cleanText } from '$lib/server/extraction';
+
+describe('Instagram Caption Extraction Unit Tests', () => {
+  it('should clean metadata prefix from og:description', async () => {
+    const input = '16K likes, 325 comments - chef.antonio.la.cava on October 17, 2025: "La cacio e pepe...';
+    const expected = 'La cacio e pepe infallibile di Luciano Monosilio...';
+    
+    // Create mock page that returns problematic og:description
+    const mockPage = {
+      evaluate: vi.fn().mockResolvedValue(input)
+    };
+    
+    const result = await extractFromDOM(mockPage as any);
+    expect(result.bodyText).toBe(expected);
+  });
+});
+```
+
+---
+
+#### Metadata Prefix Regex Analysis
+**Research Date:** 2026-02-17T10:00:00.000Z  
+**Source:** extraction.ts line 356, test fixtures
+
+**Current Regex (Line 356):**
+```typescript
+const cleanedContent = content.replace(/^\d+K?\s+likes,\s+\d+\s+comments\s+-\s+[\w.]+\s+on\s+[^:]+:\s+/, '');
+```
+
+**Test Against Actual Input:**
+```
+Input:    '16K likes, 325 comments - chef.antonio.la.cava on October 17, 2025: "La cacio e pepe...'
+Pattern:  '^\d+K?\s+likes,\s+\d+\s+comments\s+-\s+[\w.]+\s+on\s+[^:]+:\s+'
+          ^----- Should match "16K likes, 325 comments - chef.antonio.la.cava on October 17, 2025: "
+```
+
+**Issue:** Pattern matches but leaves opening quote `"` after the colon.
+
+**Problems Identified:**
+1. Pattern doesn't account for quotes after colon
+2. Date pattern `[^:]+` is too greedy (matches "October 17, 2025")
+3. Pattern assumes single space after colon, but actual format may have `": "` (colon-space-quote)
+
+**Improved Regex:**
+```typescript
+// Match: "X likes, Y comments - username on date: " (with optional quote)
+/^\d+K?\s+likes,\s+\d+\s+comments\s+-\s+[\w.]+\s+on\s+[^:]+:\s*["']?/
+```
+
+**Breakdown:**
+- `^\d+K?` - Matches "16K" or "16" (K is optional)
+- `\s+likes,\s+\d+\s+comments` - Matches " likes, 325 comments"
+- `\s+-\s+[\w.]+` - Matches " - chef.antonio.la.cava" (alphanumeric + dots)
+- `\s+on\s+[^:]+:` - Matches " on October 17, 2025:" (anything before colon)
+- `\s*` - Optional whitespace after colon
+- `["']?` - Optional quote character (single or double)
+
+**This should properly strip:**
+- `"16K likes, 325 comments - chef.antonio.la.cava on October 17, 2025: "` → (empty)
+
+---
+
+#### Files to Modify - RECIPE-0006 Iteration 1
+
+**Primary Changes:**
+1. **src/lib/server/extraction.ts**
+   - Export `extractFromDOM` for unit testing
+   - Export `cleanText` for unit testing
+   - Fix metadata prefix regex in extractFromDOM() (line 356)
+
+2. **src/tests/instagram-caption-extraction.unit.spec.ts** (NEW)
+   - Replace E2E test with unit test
+   - Mock page.evaluate() to return test fixtures
+   - Test both problematic and expected outputs
+   - Runtime < 100ms
+
+3. **src/tests/instagram-caption-extraction.e2e.spec.ts** (MODIFY)
+   - Mark as `.skip` or remove (replaced by unit test)
+   - Keep file for future real-world validation (optional)
+
+**Dependencies:**
+- Vitest mocking (vi.fn(), mockResolvedValue)
+- Test fixtures from context_compact.yaml
+- No external libraries needed
+
+**Parallelization:**
+- All changes are independent
+- Unit test can be written in parallel with extraction.ts fix
+- Test validates fix iteratively
+
+---
+
+**Document Version:** 1.8  
+**Last Updated by:** Planner Agent (RECIPE-0006 Iteration 1)  
 **Next Update:** Developer Agent