# Phase 1 Completion Report — v2 (fine-grained ASRX) **File**: Charade (1963) Cary Grant & Audrey Hepburn **UUID**: `aeed71342a899fe4b4c57b7d41bcb692` **Date**: 2026-05-10 **System**: M5 (MacBook Pro, 48GB, Apple Silicon) --- ## 1. Processor Outputs | File | Size | Description | |------|------|-------------| | `asr.json` | 413KB | 3,417 segments, full movie coverage (Whisper small) | | `asrx.json` | **18MB** | **4,188 segments** (fine-grained, ECAPA-TDNN) | | `asrx_fine.json` | 45MB | 4,188 fine segments + voice embeddings (intermediate) | | `cut.json` | 329KB | 2,260 scenes | | `yolo.json` | 181MB | 169,625 frames with object detections | | `face.json` | **106MB** | 4,550 frames, 5,910 faces @ 8Hz (CoreML 512D) | | `face_traced.json` | 110MB | Traced faces with 423 identity traces | | `lip.json` | 492KB | Lip openness analysis | | `ocr.json` | 277KB | 606 OCR frames | | `pose.json` | 26MB | 4,211 pose frames | | `scene.json` | 403B | Scene classification | ## 2. Pipeline 8-Stage Checklist | Stage | Status | Detail | |-------|--------|--------| | ASR | ✅ | 3,417 segments, last end 6,773s (100%) | | ASRX | ✅ | **4,188 segments** (fine-grained, 10→3 speakers mapped) | | Sentence Chunks | ✅ | **4,188 sentence chunks** with yolo_objects + face_ids | | Vectorization | ✅ | 4,188 Qdrant (768D), all 3 collections updated | | Face Trace | ✅ | 423 traces, 11,820 detections @ 8Hz | | TKG Graph | ✅ | 498 nodes, 1,617 edges | | Trace Chunks | ✅ | 423 trace chunks | | Phase 1 Release | ✅ | 3.0GB package | ## 3. Speaker Identification ### ASRX Enhancement (3417 → 4188 segments) The original Whisper ASR merges rapid back-and-forth dialogue into single segments. A sliding-window ECAPA-TDNN approach was developed to detect speaker change points within each ASR segment: 1. **Sliding window**: 1.5s window, 0.75s stride across full audio 2. **ECAPA-TDNN 192D embedding** per window 3. **Classification** against reference centroids (Cary Grant, Audrey Hepburn, Unknown) 4. **Majority-vote smoothing** over 3 adjacent windows 5. **Change point detection** where classified speaker changes 6. **Split** original ASR segment at each change point **Result**: 3,417 → **4,188 segments** (+771, +22.6%). Validated via gender classification (ECAPA-TDNN → 92.3% agreement with character identity). ### Speaker Mapping (Centroid-based) | Speaker ID | Name | Segments | Duration | Voice Gender | |------------|------|----------|----------|-------------| | SPEAKER_0 | Audrey Hepburn | 1,658 | 2,786s | FEMALE | | SPEAKER_1 | Cary Grant | 2,033 | 3,962s | MALE | | SPEAKER_2 | Unknown (minor) | 497 | 806s | MIXED | Method: Reference centroids built from 3,107 known segments (1,420 Cary + 1,689 Audrey). Each fine segment classified by cosine similarity to nearest centroid. No cross-contamination between speaker clusters. ### Gender Validation Two small clusters (SPEAKER_5: 10 segs, SPEAKER_9: 10 segs) initially showed MALE voice → Audrey assignment. Video clip verification confirmed these are segments where a male voice speaks while Audrey is on screen (old face-based matching was incorrect). The fine-grained segmentation correctly resolves these. ## 4. Sentence Chunks — Full Migration All 4,188 fine segments were written to `dev.chunks` with complete data per chunk: | Chunk Field | Value | Source | |-------------|-------|--------| | `start_time`/`end_time` | Fine segment boundaries | `asrx_fine.json` | | `start_frame`/`end_frame` | time × 25fps | Calculated | | `content` | `{data: {text, text_normalized}, rule: rule_1}` | ASR text | | `metadata.yolo_objects` | Dedup class names in frame range | `pre_chunks(yolo)` | | `metadata.face_ids` | Trace IDs in frame range | `face_detections` | | `metadata.speaker_name` | Centroid-matched identity | `asrx_fine.json` | - 4,158/4,188 chunks have YOLO objects (avg 3-5 object classes) - 398/4,188 chunks have face IDs (face data covers first ~12 min only) ### Parent/Story Chunks | Metric | Before (v1) | After (v2) | |--------|-------------|------------| | Children per parent | 15 (fixed) | 15 (fixed) | | Total parents | 228 | **280** | | LLM summaries | 228 (Gemma4) | **280** (Gemma4, regenerated) | | Qdrant stories | 456 pts | **560 pts** | ## 5. Qdrant Vector Collections | Collection | Dims | Points | Content | Status | |-----------|------|--------|---------|--------| | `momentry_dev_v1` | 768 | **4,188** | Sentence chunk embeddings (EmbeddingGemma) | ✅ | | `momentry_dev_stories` | 768 | **560** | 280 dialogue + 280 LLM summary | ✅ | | `momentry_dev_faces` | 512 | 5,910 | Face embeddings (8Hz CoreML) | ✅ | | `momentry_dev_voice` | 192 | **4,188** | Voice embeddings (ECAPA-TDNN) | ✅ | | `sentence_story` | 768 | **4,188** | Sentence template with speaker | ✅ | | `sentence_summary` | 768 | **4,188** | Context-aware LLM sentence summary | ✅ | ## 6. ASR Model Selection A comprehensive benchmark (5 models × 2 VAD settings × 3 test clips = 30 runs) showed: | Model | Segments | Chars | Runtime | Verdict | |-------|----------|-------|---------|---------| | tiny | 56 avg | 1,730 | **9.2s** | Most segments, best text capture | | **small** | **55 avg** | **1,704** | **17.6s** | **Best balance (current)** | | base | 42 avg | 1,751 | 10.1s | Good but fewer segments | | medium | 52 avg | 1,627 | 339.6s | Slow, loses text | | large-v3 | 20 avg | 1,249 | 68.8s | **Worst**: merges utterances, loses 26% text | **Conclusion**: Keep `faster-whisper small (VAD 500ms)`. The missing-text problem is not solvable by model size — even tiny captures more text than large-v3. Root cause is Whisper's lack of speaker turn detection in segment boundary logic, which is solved by the sliding-window ASRX approach above. ## 7. Release Package | Component | Size | |-----------|------| | `output_json/` | 13 processor files | | `chunks.csv` | 3.2MB | | `vectors.csv` | 58MB | | `identities.csv` | 1MB | | `schema.sql` | 30KB | | Qdrant snapshots (5 collections) | ~3GB | | `RELEASE_INFO.txt` | Metadata | | **Total** | **~3.0GB** | ## 8. Key Technical Decisions | Decision | Rationale | |----------|-----------| | Sliding window 1.5s/0.75s | Optimal balance: captures turn boundaries without over-splitting | | Centroid-based classification | 0.8+ similarity, no retraining needed, 100% consistent | | Word-timestamp ASR for text | Re-run with `word_timestamps=True`, 87% coverage; remaining 13% → per-segment ASR fallback | | Fixed 15 children/parent | Maintains Phase 1 design consistency | | `yolo_objects` dedup | Only class names stored per chunk (not per-frame) | | `face_ids` via `trace_id` | `face_id` column is NULL in DB; `trace_id` is the actual identifier | | Keep ASR small model | Benchmarked 5 models; larger models lose text, not gain it | | `app.run(threaded=True)` | Dashboard v2: single-threaded Flask was blocking on subprocess calls | ## 9. Phase 2 Preparation Pending for Phase 2: - Rule 3 scene chunking (cut-based parent chunks) - 5W1H Agent (LLM-generated scene summaries) - Full pipeline + 5W1H release packaging - Source separation (Demucs/HPSS) for overlapping speech scenarios