feat: Phase 1 handover - schema migration, correction mechanism, API fixes
Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index Correction: asr-1.json format, generate/apply scripts API: 37/37 endpoints fixed and tested Docs: HANDOVER_V2.0.md for M4
This commit is contained in:
@@ -1,8 +1,8 @@
|
||||
# Phase 1 Completion Report — v1 (base model)
|
||||
# Phase 1 Completion Report — v2 (fine-grained ASRX)
|
||||
|
||||
**File**: Charade (1963) Cary Grant & Audrey Hepburn
|
||||
**UUID**: `aeed71342a899fe4b4c57b7d41bcb692`
|
||||
**Date**: 2026-05-09
|
||||
**Date**: 2026-05-10
|
||||
**System**: M5 (MacBook Pro, 48GB, Apple Silicon)
|
||||
|
||||
---
|
||||
@@ -11,12 +11,13 @@
|
||||
|
||||
| File | Size | Description |
|
||||
|------|------|-------------|
|
||||
| `asr.json` | 413KB | 3,417 segments, full movie coverage |
|
||||
| `asrx.json` | 307KB | 1,815 segments, 10 speakers |
|
||||
| `asr.json` | 413KB | 3,417 segments, full movie coverage (Whisper small) |
|
||||
| `asrx.json` | **18MB** | **4,188 segments** (fine-grained, ECAPA-TDNN) |
|
||||
| `asrx_fine.json` | 45MB | 4,188 fine segments + voice embeddings (intermediate) |
|
||||
| `cut.json` | 329KB | 2,260 scenes |
|
||||
| `yolo.json` | 181MB | 169,625 frames with object detections |
|
||||
| `face.json` | **106MB** | 4,550 frames, 5,910 faces @ 8Hz (CoreML 512D) |
|
||||
| `face_traced.json` | 110MB | Traced faces with identity |
|
||||
| `face_traced.json` | 110MB | Traced faces with 423 identity traces |
|
||||
| `lip.json` | 492KB | Lip openness analysis |
|
||||
| `ocr.json` | 277KB | 606 OCR frames |
|
||||
| `pose.json` | 26MB | 4,211 pose frames |
|
||||
@@ -27,93 +28,123 @@
|
||||
| Stage | Status | Detail |
|
||||
|-------|--------|--------|
|
||||
| ASR | ✅ | 3,417 segments, last end 6,773s (100%) |
|
||||
| ASRX | ✅ | 1,815 segments, 10 speakers |
|
||||
| Sentence Chunks | ✅ | 3,417 sentence chunks with text |
|
||||
| Vectorization | ✅ | 3,417 PG + Qdrant (768D) |
|
||||
| ASRX | ✅ | **4,188 segments** (fine-grained, 10→3 speakers mapped) |
|
||||
| Sentence Chunks | ✅ | **4,188 sentence chunks** with yolo_objects + face_ids |
|
||||
| Vectorization | ✅ | 4,188 Qdrant (768D), all 3 collections updated |
|
||||
| Face Trace | ✅ | 423 traces, 11,820 detections @ 8Hz |
|
||||
| TKG Graph | ✅ | 498 nodes, 1,617 edges |
|
||||
| Trace Chunks | ✅ | 423 trace chunks with ASR text |
|
||||
| Phase 1 Release | ✅ | 483MB package |
|
||||
| Trace Chunks | ✅ | 423 trace chunks |
|
||||
| Phase 1 Release | ✅ | 3.0GB package |
|
||||
|
||||
## 3. Identity & Knowledge Graph
|
||||
## 3. Speaker Identification
|
||||
|
||||
### TMDb Character Matching (9 characters)
|
||||
### ASRX Enhancement (3417 → 4188 segments)
|
||||
|
||||
| Character | Traces | Actor |
|
||||
|-----------|--------|-------|
|
||||
| Audrey Hepburn | 843 | Regina Lampert |
|
||||
| Cary Grant | 482 | Peter Joshua |
|
||||
| Jacques Marin | 348 | Inspector Grandpierre |
|
||||
| James Coburn | 188 | Tex Panthollow |
|
||||
| Ned Glass | 176 | Leopold W. Gideon |
|
||||
| George Kennedy | 104 | Herman Scobie |
|
||||
| Walter Matthau | 104 | Hamilton Bartholomew |
|
||||
| Dominique Minot | 45 | Sylvie Gaudel |
|
||||
| Raoul Delfosse | 32 | — |
|
||||
The original Whisper ASR merges rapid back-and-forth dialogue into single segments. A sliding-window ECAPA-TDNN approach was developed to detect speaker change points within each ASR segment:
|
||||
|
||||
### Speaker Bindings (via Lip Verification)
|
||||
1. **Sliding window**: 1.5s window, 0.75s stride across full audio
|
||||
2. **ECAPA-TDNN 192D embedding** per window
|
||||
3. **Classification** against reference centroids (Cary Grant, Audrey Hepburn, Unknown)
|
||||
4. **Majority-vote smoothing** over 3 adjacent windows
|
||||
5. **Change point detection** where classified speaker changes
|
||||
6. **Split** original ASR segment at each change point
|
||||
|
||||
| Speaker | Identity | Confidence |
|
||||
|---------|----------|------------|
|
||||
| SPEAKER_2 | Audrey Hepburn | 61% |
|
||||
| SPEAKER_4 | Cary Grant | 56% |
|
||||
| SPEAKER_5 | Audrey Hepburn | 100% |
|
||||
| SPEAKER_6 | Audrey Hepburn | 43% |
|
||||
| SPEAKER_7 | Cary Grant | 100% |
|
||||
| SPEAKER_8 | Audrey Hepburn | 54% |
|
||||
**Result**: 3,417 → **4,188 segments** (+771, +22.6%). Validated via gender classification (ECAPA-TDNN → 92.3% agreement with character identity).
|
||||
|
||||
### TKG Graph
|
||||
### Speaker Mapping (Centroid-based)
|
||||
|
||||
| Node Type | Count |
|
||||
|-----------|-------|
|
||||
| Face traces | 423 |
|
||||
| Objects | 75 |
|
||||
| Total nodes | 498 |
|
||||
| Total edges | 1,617 |
|
||||
| Speaker ID | Name | Segments | Duration | Voice Gender |
|
||||
|------------|------|----------|----------|-------------|
|
||||
| SPEAKER_0 | Audrey Hepburn | 1,658 | 2,786s | FEMALE |
|
||||
| SPEAKER_1 | Cary Grant | 2,033 | 3,962s | MALE |
|
||||
| SPEAKER_2 | Unknown (minor) | 497 | 806s | MIXED |
|
||||
|
||||
### Qdrant Vector Collections
|
||||
Method: Reference centroids built from 3,107 known segments (1,420 Cary + 1,689 Audrey). Each fine segment classified by cosine similarity to nearest centroid. No cross-contamination between speaker clusters.
|
||||
|
||||
### Gender Validation
|
||||
|
||||
Two small clusters (SPEAKER_5: 10 segs, SPEAKER_9: 10 segs) initially showed MALE voice → Audrey assignment. Video clip verification confirmed these are segments where a male voice speaks while Audrey is on screen (old face-based matching was incorrect). The fine-grained segmentation correctly resolves these.
|
||||
|
||||
## 4. Sentence Chunks — Full Migration
|
||||
|
||||
All 4,188 fine segments were written to `dev.chunks` with complete data per chunk:
|
||||
|
||||
| Chunk Field | Value | Source |
|
||||
|-------------|-------|--------|
|
||||
| `start_time`/`end_time` | Fine segment boundaries | `asrx_fine.json` |
|
||||
| `start_frame`/`end_frame` | time × 25fps | Calculated |
|
||||
| `content` | `{data: {text, text_normalized}, rule: rule_1}` | ASR text |
|
||||
| `metadata.yolo_objects` | Dedup class names in frame range | `pre_chunks(yolo)` |
|
||||
| `metadata.face_ids` | Trace IDs in frame range | `face_detections` |
|
||||
| `metadata.speaker_name` | Centroid-matched identity | `asrx_fine.json` |
|
||||
|
||||
- 4,158/4,188 chunks have YOLO objects (avg 3-5 object classes)
|
||||
- 398/4,188 chunks have face IDs (face data covers first ~12 min only)
|
||||
|
||||
### Parent/Story Chunks
|
||||
|
||||
| Metric | Before (v1) | After (v2) |
|
||||
|--------|-------------|------------|
|
||||
| Children per parent | 15 (fixed) | 15 (fixed) |
|
||||
| Total parents | 228 | **280** |
|
||||
| LLM summaries | 228 (Gemma4) | **280** (Gemma4, regenerated) |
|
||||
| Qdrant stories | 456 pts | **560 pts** |
|
||||
|
||||
## 5. Qdrant Vector Collections
|
||||
|
||||
| Collection | Dims | Points | Content | Status |
|
||||
|-----------|------|--------|---------|--------|
|
||||
| `momentry_dev_v1` | 768 | 3,417 | Sentence chunk embeddings (待重embed含speaker) | ⏳ |
|
||||
| `momentry_dev_stories` | 768 | 456 | Story dialogue + LLM summary | ✅ |
|
||||
| `momentry_dev_v1` | 768 | **4,188** | Sentence chunk embeddings (EmbeddingGemma) | ✅ |
|
||||
| `momentry_dev_stories` | 768 | **560** | 280 dialogue + 280 LLM summary | ✅ |
|
||||
| `momentry_dev_faces` | 512 | 5,910 | Face embeddings (8Hz CoreML) | ✅ |
|
||||
| `momentry_dev_voice` | 192 | **1,815** | Voice embeddings (ECAPA-TDNN) | ✅ |
|
||||
| `story_sentence` | 768 | 0 | Story processor template (待建立) | ⏳ |
|
||||
| `sentence_summary` | 768 | 0 | LLM 50字摘要 (待建立) | ⏳ |
|
||||
| `momentry_dev_voice` | 192 | **4,188** | Voice embeddings (ECAPA-TDNN) | ✅ |
|
||||
| `sentence_story` | 768 | **4,188** | Sentence template with speaker | ✅ |
|
||||
| `sentence_summary` | 768 | **4,188** | Context-aware LLM sentence summary | ✅ |
|
||||
|
||||
## 4. Release Package
|
||||
## 6. ASR Model Selection
|
||||
|
||||
A comprehensive benchmark (5 models × 2 VAD settings × 3 test clips = 30 runs) showed:
|
||||
|
||||
| Model | Segments | Chars | Runtime | Verdict |
|
||||
|-------|----------|-------|---------|---------|
|
||||
| tiny | 56 avg | 1,730 | **9.2s** | Most segments, best text capture |
|
||||
| **small** | **55 avg** | **1,704** | **17.6s** | **Best balance (current)** |
|
||||
| base | 42 avg | 1,751 | 10.1s | Good but fewer segments |
|
||||
| medium | 52 avg | 1,627 | 339.6s | Slow, loses text |
|
||||
| large-v3 | 20 avg | 1,249 | 68.8s | **Worst**: merges utterances, loses 26% text |
|
||||
|
||||
**Conclusion**: Keep `faster-whisper small (VAD 500ms)`. The missing-text problem is not solvable by model size — even tiny captures more text than large-v3. Root cause is Whisper's lack of speaker turn detection in segment boundary logic, which is solved by the sliding-window ASRX approach above.
|
||||
|
||||
## 7. Release Package
|
||||
|
||||
| Component | Size |
|
||||
|-----------|------|
|
||||
| `output_json/` | 11 processor files |
|
||||
| `chunks.csv` | 2.2MB |
|
||||
| `vectors.csv` | 56MB |
|
||||
| `identities.csv` | 973KB |
|
||||
| `schema.sql` | 29KB |
|
||||
| `output_json/` | 13 processor files |
|
||||
| `chunks.csv` | 3.2MB |
|
||||
| `vectors.csv` | 58MB |
|
||||
| `identities.csv` | 1MB |
|
||||
| `schema.sql` | 30KB |
|
||||
| Qdrant snapshots (5 collections) | ~3GB |
|
||||
| `RELEASE_INFO.txt` | Metadata |
|
||||
| **Total** | **483MB** |
|
||||
| **Total** | **~3.0GB** |
|
||||
|
||||
Location: `release/phase1/v1.0.0_20260509_101337/`
|
||||
|
||||
## 5. Key Technical Decisions
|
||||
## 8. Key Technical Decisions
|
||||
|
||||
| Decision | Rationale |
|
||||
|----------|-----------|
|
||||
| Face 8Hz (interval=3) | 5-15Hz human lip motion needs ≥8Hz sampling |
|
||||
| Two-stage face processor | Apple Vision ANE (fast) + CoreML FaceNet (512D) |
|
||||
| VNFaceprint not used | KVC returns nil in video pipeline |
|
||||
| Face Qdrant separate collection | Face 512D vs chunk 768D — different dimensions |
|
||||
| LLM reasoning off | `--reasoning off` needed for non-empty content |
|
||||
| Voice embedding (ECAPA-TDNN) | SFSpeechAnalyzer 無暴露 speaker embedding (Apple 未開放 API) |
|
||||
| ASRX embeddings bug | `asrx_processor_custom.py` 遺漏傳遞 embeddings → 已修復 |
|
||||
| Speaker 匹配方式 | ASR × ASRX 時間重疊 (any overlap),99% 配對率 |
|
||||
| Story chunk 分組 | 固定 15 ASR segments,228 parent chunks |
|
||||
| Sliding window 1.5s/0.75s | Optimal balance: captures turn boundaries without over-splitting |
|
||||
| Centroid-based classification | 0.8+ similarity, no retraining needed, 100% consistent |
|
||||
| Word-timestamp ASR for text | Re-run with `word_timestamps=True`, 87% coverage; remaining 13% → per-segment ASR fallback |
|
||||
| Fixed 15 children/parent | Maintains Phase 1 design consistency |
|
||||
| `yolo_objects` dedup | Only class names stored per chunk (not per-frame) |
|
||||
| `face_ids` via `trace_id` | `face_id` column is NULL in DB; `trace_id` is the actual identifier |
|
||||
| Keep ASR small model | Benchmarked 5 models; larger models lose text, not gain it |
|
||||
| `app.run(threaded=True)` | Dashboard v2: single-threaded Flask was blocking on subprocess calls |
|
||||
|
||||
## 6. Phase 2 Preparation
|
||||
## 9. Phase 2 Preparation
|
||||
|
||||
Pending for Phase 2:
|
||||
- Rule 3 scene chunking (cut-based parent chunks)
|
||||
- 5W1H Agent (LLM-generated scene summaries)
|
||||
- Full pipeline + 5W1H release packaging
|
||||
- Lip analysis extended to full movie speaker binding
|
||||
- Source separation (Demucs/HPSS) for overlapping speech scenarios
|
||||
|
||||
Reference in New Issue
Block a user