Files
momentry_core/docs/PHASE1_COMPLETION_REPORT.md
Accusys 39ba5ddf76 feat: Phase 1 handover - schema migration, correction mechanism, API fixes
Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index
Correction: asr-1.json format, generate/apply scripts
API: 37/37 endpoints fixed and tested
Docs: HANDOVER_V2.0.md for M4
2026-05-11 07:03:22 +08:00

7.0 KiB
Raw Permalink Blame History

Phase 1 Completion Report — v2 (fine-grained ASRX)

File: Charade (1963) Cary Grant & Audrey Hepburn UUID: aeed71342a899fe4b4c57b7d41bcb692 Date: 2026-05-10 System: M5 (MacBook Pro, 48GB, Apple Silicon)


1. Processor Outputs

File Size Description
asr.json 413KB 3,417 segments, full movie coverage (Whisper small)
asrx.json 18MB 4,188 segments (fine-grained, ECAPA-TDNN)
asrx_fine.json 45MB 4,188 fine segments + voice embeddings (intermediate)
cut.json 329KB 2,260 scenes
yolo.json 181MB 169,625 frames with object detections
face.json 106MB 4,550 frames, 5,910 faces @ 8Hz (CoreML 512D)
face_traced.json 110MB Traced faces with 423 identity traces
lip.json 492KB Lip openness analysis
ocr.json 277KB 606 OCR frames
pose.json 26MB 4,211 pose frames
scene.json 403B Scene classification

2. Pipeline 8-Stage Checklist

Stage Status Detail
ASR 3,417 segments, last end 6,773s (100%)
ASRX 4,188 segments (fine-grained, 10→3 speakers mapped)
Sentence Chunks 4,188 sentence chunks with yolo_objects + face_ids
Vectorization 4,188 Qdrant (768D), all 3 collections updated
Face Trace 423 traces, 11,820 detections @ 8Hz
TKG Graph 498 nodes, 1,617 edges
Trace Chunks 423 trace chunks
Phase 1 Release 3.0GB package

3. Speaker Identification

ASRX Enhancement (3417 → 4188 segments)

The original Whisper ASR merges rapid back-and-forth dialogue into single segments. A sliding-window ECAPA-TDNN approach was developed to detect speaker change points within each ASR segment:

  1. Sliding window: 1.5s window, 0.75s stride across full audio
  2. ECAPA-TDNN 192D embedding per window
  3. Classification against reference centroids (Cary Grant, Audrey Hepburn, Unknown)
  4. Majority-vote smoothing over 3 adjacent windows
  5. Change point detection where classified speaker changes
  6. Split original ASR segment at each change point

Result: 3,417 → 4,188 segments (+771, +22.6%). Validated via gender classification (ECAPA-TDNN → 92.3% agreement with character identity).

Speaker Mapping (Centroid-based)

Speaker ID Name Segments Duration Voice Gender
SPEAKER_0 Audrey Hepburn 1,658 2,786s FEMALE
SPEAKER_1 Cary Grant 2,033 3,962s MALE
SPEAKER_2 Unknown (minor) 497 806s MIXED

Method: Reference centroids built from 3,107 known segments (1,420 Cary + 1,689 Audrey). Each fine segment classified by cosine similarity to nearest centroid. No cross-contamination between speaker clusters.

Gender Validation

Two small clusters (SPEAKER_5: 10 segs, SPEAKER_9: 10 segs) initially showed MALE voice → Audrey assignment. Video clip verification confirmed these are segments where a male voice speaks while Audrey is on screen (old face-based matching was incorrect). The fine-grained segmentation correctly resolves these.

4. Sentence Chunks — Full Migration

All 4,188 fine segments were written to dev.chunks with complete data per chunk:

Chunk Field Value Source
start_time/end_time Fine segment boundaries asrx_fine.json
start_frame/end_frame time × 25fps Calculated
content {data: {text, text_normalized}, rule: rule_1} ASR text
metadata.yolo_objects Dedup class names in frame range pre_chunks(yolo)
metadata.face_ids Trace IDs in frame range face_detections
metadata.speaker_name Centroid-matched identity asrx_fine.json
  • 4,158/4,188 chunks have YOLO objects (avg 3-5 object classes)
  • 398/4,188 chunks have face IDs (face data covers first ~12 min only)

Parent/Story Chunks

Metric Before (v1) After (v2)
Children per parent 15 (fixed) 15 (fixed)
Total parents 228 280
LLM summaries 228 (Gemma4) 280 (Gemma4, regenerated)
Qdrant stories 456 pts 560 pts

5. Qdrant Vector Collections

Collection Dims Points Content Status
momentry_dev_v1 768 4,188 Sentence chunk embeddings (EmbeddingGemma)
momentry_dev_stories 768 560 280 dialogue + 280 LLM summary
momentry_dev_faces 512 5,910 Face embeddings (8Hz CoreML)
momentry_dev_voice 192 4,188 Voice embeddings (ECAPA-TDNN)
sentence_story 768 4,188 Sentence template with speaker
sentence_summary 768 4,188 Context-aware LLM sentence summary

6. ASR Model Selection

A comprehensive benchmark (5 models × 2 VAD settings × 3 test clips = 30 runs) showed:

Model Segments Chars Runtime Verdict
tiny 56 avg 1,730 9.2s Most segments, best text capture
small 55 avg 1,704 17.6s Best balance (current)
base 42 avg 1,751 10.1s Good but fewer segments
medium 52 avg 1,627 339.6s Slow, loses text
large-v3 20 avg 1,249 68.8s Worst: merges utterances, loses 26% text

Conclusion: Keep faster-whisper small (VAD 500ms). The missing-text problem is not solvable by model size — even tiny captures more text than large-v3. Root cause is Whisper's lack of speaker turn detection in segment boundary logic, which is solved by the sliding-window ASRX approach above.

7. Release Package

Component Size
output_json/ 13 processor files
chunks.csv 3.2MB
vectors.csv 58MB
identities.csv 1MB
schema.sql 30KB
Qdrant snapshots (5 collections) ~3GB
RELEASE_INFO.txt Metadata
Total ~3.0GB

8. Key Technical Decisions

Decision Rationale
Sliding window 1.5s/0.75s Optimal balance: captures turn boundaries without over-splitting
Centroid-based classification 0.8+ similarity, no retraining needed, 100% consistent
Word-timestamp ASR for text Re-run with word_timestamps=True, 87% coverage; remaining 13% → per-segment ASR fallback
Fixed 15 children/parent Maintains Phase 1 design consistency
yolo_objects dedup Only class names stored per chunk (not per-frame)
face_ids via trace_id face_id column is NULL in DB; trace_id is the actual identifier
Keep ASR small model Benchmarked 5 models; larger models lose text, not gain it
app.run(threaded=True) Dashboard v2: single-threaded Flask was blocking on subprocess calls

9. Phase 2 Preparation

Pending for Phase 2:

  • Rule 3 scene chunking (cut-based parent chunks)
  • 5W1H Agent (LLM-generated scene summaries)
  • Full pipeline + 5W1H release packaging
  • Source separation (Demucs/HPSS) for overlapping speech scenarios