Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index Correction: asr-1.json format, generate/apply scripts API: 37/37 endpoints fixed and tested Docs: HANDOVER_V2.0.md for M4
7.0 KiB
Phase 1 Completion Report — v2 (fine-grained ASRX)
File: Charade (1963) Cary Grant & Audrey Hepburn
UUID: aeed71342a899fe4b4c57b7d41bcb692
Date: 2026-05-10
System: M5 (MacBook Pro, 48GB, Apple Silicon)
1. Processor Outputs
| File | Size | Description |
|---|---|---|
asr.json |
413KB | 3,417 segments, full movie coverage (Whisper small) |
asrx.json |
18MB | 4,188 segments (fine-grained, ECAPA-TDNN) |
asrx_fine.json |
45MB | 4,188 fine segments + voice embeddings (intermediate) |
cut.json |
329KB | 2,260 scenes |
yolo.json |
181MB | 169,625 frames with object detections |
face.json |
106MB | 4,550 frames, 5,910 faces @ 8Hz (CoreML 512D) |
face_traced.json |
110MB | Traced faces with 423 identity traces |
lip.json |
492KB | Lip openness analysis |
ocr.json |
277KB | 606 OCR frames |
pose.json |
26MB | 4,211 pose frames |
scene.json |
403B | Scene classification |
2. Pipeline 8-Stage Checklist
| Stage | Status | Detail |
|---|---|---|
| ASR | ✅ | 3,417 segments, last end 6,773s (100%) |
| ASRX | ✅ | 4,188 segments (fine-grained, 10→3 speakers mapped) |
| Sentence Chunks | ✅ | 4,188 sentence chunks with yolo_objects + face_ids |
| Vectorization | ✅ | 4,188 Qdrant (768D), all 3 collections updated |
| Face Trace | ✅ | 423 traces, 11,820 detections @ 8Hz |
| TKG Graph | ✅ | 498 nodes, 1,617 edges |
| Trace Chunks | ✅ | 423 trace chunks |
| Phase 1 Release | ✅ | 3.0GB package |
3. Speaker Identification
ASRX Enhancement (3417 → 4188 segments)
The original Whisper ASR merges rapid back-and-forth dialogue into single segments. A sliding-window ECAPA-TDNN approach was developed to detect speaker change points within each ASR segment:
- Sliding window: 1.5s window, 0.75s stride across full audio
- ECAPA-TDNN 192D embedding per window
- Classification against reference centroids (Cary Grant, Audrey Hepburn, Unknown)
- Majority-vote smoothing over 3 adjacent windows
- Change point detection where classified speaker changes
- Split original ASR segment at each change point
Result: 3,417 → 4,188 segments (+771, +22.6%). Validated via gender classification (ECAPA-TDNN → 92.3% agreement with character identity).
Speaker Mapping (Centroid-based)
| Speaker ID | Name | Segments | Duration | Voice Gender |
|---|---|---|---|---|
| SPEAKER_0 | Audrey Hepburn | 1,658 | 2,786s | FEMALE |
| SPEAKER_1 | Cary Grant | 2,033 | 3,962s | MALE |
| SPEAKER_2 | Unknown (minor) | 497 | 806s | MIXED |
Method: Reference centroids built from 3,107 known segments (1,420 Cary + 1,689 Audrey). Each fine segment classified by cosine similarity to nearest centroid. No cross-contamination between speaker clusters.
Gender Validation
Two small clusters (SPEAKER_5: 10 segs, SPEAKER_9: 10 segs) initially showed MALE voice → Audrey assignment. Video clip verification confirmed these are segments where a male voice speaks while Audrey is on screen (old face-based matching was incorrect). The fine-grained segmentation correctly resolves these.
4. Sentence Chunks — Full Migration
All 4,188 fine segments were written to dev.chunks with complete data per chunk:
| Chunk Field | Value | Source |
|---|---|---|
start_time/end_time |
Fine segment boundaries | asrx_fine.json |
start_frame/end_frame |
time × 25fps | Calculated |
content |
{data: {text, text_normalized}, rule: rule_1} |
ASR text |
metadata.yolo_objects |
Dedup class names in frame range | pre_chunks(yolo) |
metadata.face_ids |
Trace IDs in frame range | face_detections |
metadata.speaker_name |
Centroid-matched identity | asrx_fine.json |
- 4,158/4,188 chunks have YOLO objects (avg 3-5 object classes)
- 398/4,188 chunks have face IDs (face data covers first ~12 min only)
Parent/Story Chunks
| Metric | Before (v1) | After (v2) |
|---|---|---|
| Children per parent | 15 (fixed) | 15 (fixed) |
| Total parents | 228 | 280 |
| LLM summaries | 228 (Gemma4) | 280 (Gemma4, regenerated) |
| Qdrant stories | 456 pts | 560 pts |
5. Qdrant Vector Collections
| Collection | Dims | Points | Content | Status |
|---|---|---|---|---|
momentry_dev_v1 |
768 | 4,188 | Sentence chunk embeddings (EmbeddingGemma) | ✅ |
momentry_dev_stories |
768 | 560 | 280 dialogue + 280 LLM summary | ✅ |
momentry_dev_faces |
512 | 5,910 | Face embeddings (8Hz CoreML) | ✅ |
momentry_dev_voice |
192 | 4,188 | Voice embeddings (ECAPA-TDNN) | ✅ |
sentence_story |
768 | 4,188 | Sentence template with speaker | ✅ |
sentence_summary |
768 | 4,188 | Context-aware LLM sentence summary | ✅ |
6. ASR Model Selection
A comprehensive benchmark (5 models × 2 VAD settings × 3 test clips = 30 runs) showed:
| Model | Segments | Chars | Runtime | Verdict |
|---|---|---|---|---|
| tiny | 56 avg | 1,730 | 9.2s | Most segments, best text capture |
| small | 55 avg | 1,704 | 17.6s | Best balance (current) |
| base | 42 avg | 1,751 | 10.1s | Good but fewer segments |
| medium | 52 avg | 1,627 | 339.6s | Slow, loses text |
| large-v3 | 20 avg | 1,249 | 68.8s | Worst: merges utterances, loses 26% text |
Conclusion: Keep faster-whisper small (VAD 500ms). The missing-text problem is not solvable by model size — even tiny captures more text than large-v3. Root cause is Whisper's lack of speaker turn detection in segment boundary logic, which is solved by the sliding-window ASRX approach above.
7. Release Package
| Component | Size |
|---|---|
output_json/ |
13 processor files |
chunks.csv |
3.2MB |
vectors.csv |
58MB |
identities.csv |
1MB |
schema.sql |
30KB |
| Qdrant snapshots (5 collections) | ~3GB |
RELEASE_INFO.txt |
Metadata |
| Total | ~3.0GB |
8. Key Technical Decisions
| Decision | Rationale |
|---|---|
| Sliding window 1.5s/0.75s | Optimal balance: captures turn boundaries without over-splitting |
| Centroid-based classification | 0.8+ similarity, no retraining needed, 100% consistent |
| Word-timestamp ASR for text | Re-run with word_timestamps=True, 87% coverage; remaining 13% → per-segment ASR fallback |
| Fixed 15 children/parent | Maintains Phase 1 design consistency |
yolo_objects dedup |
Only class names stored per chunk (not per-frame) |
face_ids via trace_id |
face_id column is NULL in DB; trace_id is the actual identifier |
| Keep ASR small model | Benchmarked 5 models; larger models lose text, not gain it |
app.run(threaded=True) |
Dashboard v2: single-threaded Flask was blocking on subprocess calls |
9. Phase 2 Preparation
Pending for Phase 2:
- Rule 3 scene chunking (cut-based parent chunks)
- 5W1H Agent (LLM-generated scene summaries)
- Full pipeline + 5W1H release packaging
- Source separation (Demucs/HPSS) for overlapping speech scenarios