fix(RECIPE-0006): complete iteration 1 - unit tests for Instagram caption extraction

- Exported cleanText() and extractFromDOM() for unit testing
- Fixed metadata prefix regex to handle optional quotes
- Created comprehensive unit tests with mocked Playwright Page (15 tests, 12ms)
- All 275 tests passing
This commit is contained in:
Giancarmine Salucci
2026-02-17 11:02:59 +01:00
parent b304f5266a
commit 56d3aec3e2
4 changed files with 433 additions and 13 deletions

View File

@@ -1590,6 +1590,165 @@ From prior research (RECIPE-0001), `llm.ts` already implements:
---
**Document Version:** 1.7
**Last Updated by:** Planner Agent (RECIPE-0005 Iteration 0)
### [Planner] Research Notes - RECIPE-0006 Iteration 1 (2026-02-17)
**Task:** Transform E2E test to unit test with mocked fixtures and fix extraction logic iteratively
#### Problem Analysis
**Research Date:** 2026-02-17T10:00:00.000Z
**Source:** review_report.yaml, extraction.ts analysis, test fixtures
**Iteration 0 Failure:**
- E2E test created but never executed during development
- User manually ran test and it FAILED
- Current output: `"16K likes, 325 comments - chef.antonio.la.cava on October 17, 2025: "La cacio e pepe..."`
- Expected output: Full recipe starting with `"La cacio e pepe infallibile di Luciano Monosilio 🍝"`
**Root Cause Analysis:**
1. **DOM selectors failing**: Lines 331-341 of extraction.ts try selectors but none match Instagram's current structure
2. **Fallback to og:description**: Line 348-357 extracts from `<meta property="og:description">` which contains metadata prefix
3. **Regex cleanup insufficient**: Line 356 tries to clean metadata with regex `^\d+K?\s+likes,\s+\d+\s+comments\s+-\s+[\w.]+\s+on\s+[^:]+:\s+` but it's not removing the text properly
**Current extractFromDOM() Flow:**
```
1. Try selectors: article h1, article span[dir="auto"], article div[role="button"] + span, article span:not([aria-label])
→ All fail (return null or < 100 chars)
2. Fallback to og:description meta tag
→ Returns: "16K likes, 325 comments - username on date: caption..."
3. Apply metadata cleanup regex
→ Regex doesn't match properly (or matches but leaves quotes)
4. Pass to cleanText()
→ cleanText() removes hashtags but metadata prefix remains
```
---
#### Vitest Unit Testing for Playwright Mocking
**Research Date:** 2026-02-17T10:00:00.000Z
**Source:** TESTING.md, existing tests (queue-processor.spec.ts, scheduler.spec.ts)
**Mocking Strategy:**
From TESTING.md and existing test patterns, Vitest provides module-level mocking:
```typescript
// Mock entire module BEFORE imports
vi.mock('$lib/server/extraction', () => ({
extractTextAndThumbnail: vi.fn().mockResolvedValue({
bodyText: 'Mocked text',
thumbnail: 'https://example.com/thumb.jpg'
})
}));
```
**For Unit Testing extractFromDOM():**
- Cannot mock the entire `extraction.ts` module (we're testing functions inside it)
- Need to test internal functions directly (extractFromDOM, cleanText are not exported)
- Options:
1. **Export functions for testing** (add `export` to extractFromDOM and cleanText)
2. **Mock Playwright Page.evaluate()** (mock the browser automation layer)
3. **Integration test with mocked browser context**
**Chosen Approach: Export Internal Functions**
- Cleanest separation of concerns
- Allows direct unit testing without browser overhead
- Follows existing pattern (extractTextAndThumbnail is already exported)
- Test Runtime: < 10ms (vs 30s for E2E test)
**Test Structure:**
```typescript
// Unit test with fixtures
import { extractFromDOM, cleanText } from '$lib/server/extraction';
describe('Instagram Caption Extraction Unit Tests', () => {
it('should clean metadata prefix from og:description', async () => {
const input = '16K likes, 325 comments - chef.antonio.la.cava on October 17, 2025: "La cacio e pepe...';
const expected = 'La cacio e pepe infallibile di Luciano Monosilio...';
// Create mock page that returns problematic og:description
const mockPage = {
evaluate: vi.fn().mockResolvedValue(input)
};
const result = await extractFromDOM(mockPage as any);
expect(result.bodyText).toBe(expected);
});
});
```
---
#### Metadata Prefix Regex Analysis
**Research Date:** 2026-02-17T10:00:00.000Z
**Source:** extraction.ts line 356, test fixtures
**Current Regex (Line 356):**
```typescript
const cleanedContent = content.replace(/^\d+K?\s+likes,\s+\d+\s+comments\s+-\s+[\w.]+\s+on\s+[^:]+:\s+/, '');
```
**Test Against Actual Input:**
```
Input: '16K likes, 325 comments - chef.antonio.la.cava on October 17, 2025: "La cacio e pepe...'
Pattern: '^\d+K?\s+likes,\s+\d+\s+comments\s+-\s+[\w.]+\s+on\s+[^:]+:\s+'
^----- Should match "16K likes, 325 comments - chef.antonio.la.cava on October 17, 2025: "
```
**Issue:** Pattern matches but leaves opening quote `"` after the colon.
**Problems Identified:**
1. Pattern doesn't account for quotes after colon
2. Date pattern `[^:]+` is too greedy (matches "October 17, 2025")
3. Pattern assumes single space after colon, but actual format may have `": "` (colon-space-quote)
**Improved Regex:**
```typescript
// Match: "X likes, Y comments - username on date: " (with optional quote)
/^\d+K?\s+likes,\s+\d+\s+comments\s+-\s+[\w.]+\s+on\s+[^:]+:\s*["']?/
```
**Breakdown:**
- `^\d+K?` - Matches "16K" or "16" (K is optional)
- `\s+likes,\s+\d+\s+comments` - Matches " likes, 325 comments"
- `\s+-\s+[\w.]+` - Matches " - chef.antonio.la.cava" (alphanumeric + dots)
- `\s+on\s+[^:]+:` - Matches " on October 17, 2025:" (anything before colon)
- `\s*` - Optional whitespace after colon
- `["']?` - Optional quote character (single or double)
**This should properly strip:**
- `"16K likes, 325 comments - chef.antonio.la.cava on October 17, 2025: "` → (empty)
---
#### Files to Modify - RECIPE-0006 Iteration 1
**Primary Changes:**
1. **src/lib/server/extraction.ts**
- Export `extractFromDOM` for unit testing
- Export `cleanText` for unit testing
- Fix metadata prefix regex in extractFromDOM() (line 356)
2. **src/tests/instagram-caption-extraction.unit.spec.ts** (NEW)
- Replace E2E test with unit test
- Mock page.evaluate() to return test fixtures
- Test both problematic and expected outputs
- Runtime < 100ms
3. **src/tests/instagram-caption-extraction.e2e.spec.ts** (MODIFY)
- Mark as `.skip` or remove (replaced by unit test)
- Keep file for future real-world validation (optional)
**Dependencies:**
- Vitest mocking (vi.fn(), mockResolvedValue)
- Test fixtures from context_compact.yaml
- No external libraries needed
**Parallelization:**
- All changes are independent
- Unit test can be written in parallel with extraction.ts fix
- Test validates fix iteratively
---
**Document Version:** 1.8
**Last Updated by:** Planner Agent (RECIPE-0006 Iteration 1)
**Next Update:** Developer Agent