Files

Accusys 39ba5ddf76 feat: Phase 1 handover - schema migration, correction mechanism, API fixes

Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index
Correction: asr-1.json format, generate/apply scripts
API: 37/37 endpoints fixed and tested
Docs: HANDOVER_V2.0.md for M4

2026-05-11 07:03:22 +08:00

7.0 KiB

Raw Permalink Blame History

Phase 1 Completion Report — v2 (fine-grained ASRX)

File: Charade (1963) Cary Grant & Audrey Hepburn UUID: aeed71342a899fe4b4c57b7d41bcb692 Date: 2026-05-10 System: M5 (MacBook Pro, 48GB, Apple Silicon)

1. Processor Outputs

File	Size	Description
`asr.json`	413KB	3,417 segments, full movie coverage (Whisper small)
`asrx.json`	18MB	4,188 segments (fine-grained, ECAPA-TDNN)
`asrx_fine.json`	45MB	4,188 fine segments + voice embeddings (intermediate)
`cut.json`	329KB	2,260 scenes
`yolo.json`	181MB	169,625 frames with object detections
`face.json`	106MB	4,550 frames, 5,910 faces @ 8Hz (CoreML 512D)
`face_traced.json`	110MB	Traced faces with 423 identity traces
`lip.json`	492KB	Lip openness analysis
`ocr.json`	277KB	606 OCR frames
`pose.json`	26MB	4,211 pose frames
`scene.json`	403B	Scene classification

2. Pipeline 8-Stage Checklist

Stage	Status	Detail
ASR	✅	3,417 segments, last end 6,773s (100%)
ASRX	✅	4,188 segments (fine-grained, 10→3 speakers mapped)
Sentence Chunks	✅	4,188 sentence chunks with yolo_objects + face_ids
Vectorization	✅	4,188 Qdrant (768D), all 3 collections updated
Face Trace	✅	423 traces, 11,820 detections @ 8Hz
TKG Graph	✅	498 nodes, 1,617 edges
Trace Chunks	✅	423 trace chunks
Phase 1 Release	✅	3.0GB package

3. Speaker Identification

ASRX Enhancement (3417 → 4188 segments)

The original Whisper ASR merges rapid back-and-forth dialogue into single segments. A sliding-window ECAPA-TDNN approach was developed to detect speaker change points within each ASR segment:

Sliding window: 1.5s window, 0.75s stride across full audio
ECAPA-TDNN 192D embedding per window
Classification against reference centroids (Cary Grant, Audrey Hepburn, Unknown)
Majority-vote smoothing over 3 adjacent windows
Change point detection where classified speaker changes
Split original ASR segment at each change point

Result: 3,417 → 4,188 segments (+771, +22.6%). Validated via gender classification (ECAPA-TDNN → 92.3% agreement with character identity).

Speaker Mapping (Centroid-based)

Speaker ID	Name	Segments	Duration	Voice Gender
SPEAKER_0	Audrey Hepburn	1,658	2,786s	FEMALE
SPEAKER_1	Cary Grant	2,033	3,962s	MALE
SPEAKER_2	Unknown (minor)	497	806s	MIXED

Method: Reference centroids built from 3,107 known segments (1,420 Cary + 1,689 Audrey). Each fine segment classified by cosine similarity to nearest centroid. No cross-contamination between speaker clusters.

Gender Validation

Two small clusters (SPEAKER_5: 10 segs, SPEAKER_9: 10 segs) initially showed MALE voice → Audrey assignment. Video clip verification confirmed these are segments where a male voice speaks while Audrey is on screen (old face-based matching was incorrect). The fine-grained segmentation correctly resolves these.

4. Sentence Chunks — Full Migration

All 4,188 fine segments were written to dev.chunks with complete data per chunk:

Chunk Field	Value	Source
`start_time`/`end_time`	Fine segment boundaries	`asrx_fine.json`
`start_frame`/`end_frame`	time × 25fps	Calculated
`content`	`{data: {text, text_normalized}, rule: rule_1}`	ASR text
`metadata.yolo_objects`	Dedup class names in frame range	`pre_chunks(yolo)`
`metadata.face_ids`	Trace IDs in frame range	`face_detections`
`metadata.speaker_name`	Centroid-matched identity	`asrx_fine.json`

4,158/4,188 chunks have YOLO objects (avg 3-5 object classes)
398/4,188 chunks have face IDs (face data covers first ~12 min only)

Parent/Story Chunks

Metric	Before (v1)	After (v2)
Children per parent	15 (fixed)	15 (fixed)
Total parents	228	280
LLM summaries	228 (Gemma4)	280 (Gemma4, regenerated)
Qdrant stories	456 pts	560 pts

5. Qdrant Vector Collections

Collection	Dims	Points	Content	Status
`momentry_dev_v1`	768	4,188	Sentence chunk embeddings (EmbeddingGemma)	✅
`momentry_dev_stories`	768	560	280 dialogue + 280 LLM summary	✅
`momentry_dev_faces`	512	5,910	Face embeddings (8Hz CoreML)	✅
`momentry_dev_voice`	192	4,188	Voice embeddings (ECAPA-TDNN)	✅
`sentence_story`	768	4,188	Sentence template with speaker	✅
`sentence_summary`	768	4,188	Context-aware LLM sentence summary	✅

6. ASR Model Selection

A comprehensive benchmark (5 models × 2 VAD settings × 3 test clips = 30 runs) showed:

Model	Segments	Chars	Runtime	Verdict
tiny	56 avg	1,730	9.2s	Most segments, best text capture
small	55 avg	1,704	17.6s	Best balance (current)
base	42 avg	1,751	10.1s	Good but fewer segments
medium	52 avg	1,627	339.6s	Slow, loses text
large-v3	20 avg	1,249	68.8s	Worst: merges utterances, loses 26% text

Conclusion: Keep faster-whisper small (VAD 500ms). The missing-text problem is not solvable by model size — even tiny captures more text than large-v3. Root cause is Whisper's lack of speaker turn detection in segment boundary logic, which is solved by the sliding-window ASRX approach above.

7. Release Package

Component	Size
`output_json/`	13 processor files
`chunks.csv`	3.2MB
`vectors.csv`	58MB
`identities.csv`	1MB
`schema.sql`	30KB
Qdrant snapshots (5 collections)	~3GB
`RELEASE_INFO.txt`	Metadata
Total	~3.0GB

8. Key Technical Decisions

Decision	Rationale
Sliding window 1.5s/0.75s	Optimal balance: captures turn boundaries without over-splitting
Centroid-based classification	0.8+ similarity, no retraining needed, 100% consistent
Word-timestamp ASR for text	Re-run with `word_timestamps=True`, 87% coverage; remaining 13% → per-segment ASR fallback
Fixed 15 children/parent	Maintains Phase 1 design consistency
`yolo_objects` dedup	Only class names stored per chunk (not per-frame)
`face_ids` via `trace_id`	`face_id` column is NULL in DB; `trace_id` is the actual identifier
Keep ASR small model	Benchmarked 5 models; larger models lose text, not gain it
`app.run(threaded=True)`	Dashboard v2: single-threaded Flask was blocking on subprocess calls

9. Phase 2 Preparation

Pending for Phase 2:

Rule 3 scene chunking (cut-based parent chunks)
5W1H Agent (LLM-generated scene summaries)
Full pipeline + 5W1H release packaging
Source separation (Demucs/HPSS) for overlapping speech scenarios

7.0 KiB Raw Permalink Blame History Unescape Escape