feat: ASRX hybrid pipeline, identity history, worker fixes, checkpoint system

2026-06-02 07:13:23 +08:00
parent e3066c3f49
commit e1572907ae
198 changed files with 43705 additions and 8910 deletions
--- a/docs_v1.0/M4_workspace/2026-05-27_charade_pipeline_checklist.md
+++ b/docs_v1.0/M4_workspace/2026-05-27_charade_pipeline_checklist.md
@@ -0,0 +1,242 @@
+---
+title: Charade Full Movie Pipeline Checklist
+version: 1.0
+date: 2026-05-27
+author: M5Max48
+status: in_progress
+---
+
+# Charade Full Movie Pipeline Checklist
+
+**File UUID**: `c3c635e3641da80dde10cc555ffcdda5`
+**File Name**: Charade (1963) Cary Grant & Audrey Hepburn | Comedy Mystery Romance Thriller | Full Movie.mp4
+**Duration**: 6785 seconds (113 minutes)
+**Total Frames**: 169,625
+
+---
+
+## P0: Processor Outputs
+
+### Purpose
+原始處理器輸出檔案，存放在 `/Users/accusys/momentry/output_dev/`。這些是後續 ingestion 的資料來源。
+
+### Processor Details
+
+| Processor | Expected Output | Size Estimate | Purpose | Status |
+|-----------|-----------------|---------------|---------|--------|
+| CUT | `c3c635e3641da80dde10cc555ffcdda5.cut.json` | ~170KB | Scene boundary detection，切割點用於 Rule 3 chunking | ✅ Done |
+| YOLO | `c3c635e3641da80dde10cc555ffcdda5.yolo.json` | ~50-80MB | Object detection，每幀的物件類別與位置 | 🔄 Running |
+| Face | `c3c635e3641da80dde10cc555ffcdda5.face.json` | ~1.5GB | Face detection + 512-dim embedding (FaceNet CoreML) | 🔄 44% |
+| Face Traced | `c3c635e3641da80dde10cc555ffcdda5.face_traced.json` | ~1.2GB | Face tracking，同一人物的連續出現 → trace_id | ⏳ Pending (after Face) |
+| OCR | `c3c635e3641da80dde10cc555ffcdda5.ocr.json` | ~50KB | Text recognition from frames | ❌ Skipped |
+| Pose | `c3c635e3641da80dde10cc555ffcdda5.pose.json` | ~20MB | Body pose estimation | 🔄 Running |
+| ASRX | `c3c635e3641da80dde10cc555ffcdda5.asrx.json` | ~8MB | Speaker diarization，語者分段 | ✅ Done (reuse from public) |
+| Visual Chunk | `c3c635e3641da80dde10cc555ffcdda5.visual_chunk.json` | ~60KB | Visual scene chunk metadata | ✅ Done |
+| Scene | `c3c635e3641da80dde10cc555ffcdda5.scene.json` | ~300B | Scene list from CUT | ✅ Done |
+| Scene Meta | `c3c635e3641da80dde10cc555ffcdda5.scene_meta.json` | ~50KB | Heuristic scene metadata (人物 + 物件統計) | ⏳ Pending |
+| Story LLM | `c3c635e3641da80dde10cc555ffcdda5.story_llm.json` | ~800KB | LLM-generated story summaries per chunk | ✅ Done |
+| Story Story | `c3c635e3641da80dde10cc555ffcdda5.story_story.json` | ~800KB | Story parent-child relationships | ✅ Done |
+| TMDb | `c3c635e3641da80dde10cc555ffcdda5.tmdb.json` | ~5KB | TMDb cast list with face embeddings | ⏳ Pending |
+| 5W1H | `c3c635e3641da80dde10cc555ffcdda5.5w1h.json` | ~500KB | 5W1H agent output (who/when/where/what/why/how) | ✅ Done |
+
+### Key Dependencies
+- Face Traced 需要 Face 完成後才能執行 (face_traced.json = face.json + tracking)
+- Scene Meta 需要 Face + YOLO 完成
+- TMDb 需要 Face Traced 完成後執行 matching
+
+---
+
+## P1: Database Records
+
+### Purpose
+將 processor outputs 存入 PostgreSQL，供 API query 使用。
+
+### Table Details
+
+| Table | Expected Records | Purpose | Verification Query | Status |
+|-------|------------------|---------|-------------------|--------|
+| `dev.videos` | 1 row | Video metadata (duration, fps, status) | `SELECT file_uuid, status FROM dev.videos WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | ✅ Registered |
+| `dev.monitor_jobs` | 1 row | Processing job state machine | `SELECT uuid, status, completed_processors FROM dev.monitor_jobs WHERE uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | 🔄 Running |
+| `dev.pre_chunks` | ~7,000 rows | Raw processor outputs (ASR sentences, YOLO objects, etc.) | `SELECT COUNT(*) FROM dev.pre_chunks WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | ⏳ Pending |
+| `dev.face_detections` | ~70,000 rows | Face detection records (每幀每張臉) | `SELECT COUNT(*) FROM dev.face_detections WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | ⏳ Pending |
+| `dev.face_detections.embedding` | ~70,000 non-NULL | 512-dim FaceNet embedding (用於 identity matching) | `SELECT COUNT(embedding) FROM dev.face_detections WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | ⏳ Pending |
+| `dev.face_detections.trace_id` | ~70,000 non-NULL | Face tracking ID (同一人物跨幀連續出現) | `SELECT COUNT(trace_id) FROM dev.face_detections WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | ⏳ Pending |
+| `dev.face_detections.identity_id` | ~50,000 non-NULL | TMDb identity binding (Audrey, Cary, etc.) | `SELECT COUNT(identity_id) FROM dev.face_detections WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | ⏳ Pending |
+
+### Key Points
+- `embedding` 必須非 NULL 才能進行 TMDb matching (之前 store_traced_faces.py bug 修復)
+- `trace_id` 由 `store_traced_faces.py` 從 face_traced.json 計算
+- `identity_id` 由 `match_faces_to_tmdb.py` 計算 (cosine similarity > 0.5)
+
+---
+
+## P2: Chunk Ingestion
+
+### Purpose
+將 raw processor outputs 轉換為 searchable chunks，用於 RAG query。
+
+### Chunk Types
+
+| Chunk Type | Expected Count | Purpose | Source | Verification Query | Status |
+|------------|----------------|---------|--------|-------------------|--------|
+| sentence (Rule 1) | ~1,700 | Sentence-level chunks for text search | ASR output → sentence split | `SELECT COUNT(*) FROM dev.chunk WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5' AND chunk_type = 'sentence'` | ⏳ Pending |
+| llm_parent | ~800 | LLM-generated summary parent chunks | Story LLM output | `SELECT COUNT(*) FROM dev.chunk WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5' AND chunk_type = 'llm_parent'` | ⏳ Pending |
+| story_parent | ~800 | Story parent chunks (narrative segments) | Story processor | `SELECT COUNT(*) FROM dev.chunk WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5' AND chunk_type = 'story_parent'` | ⏳ Pending |
+| story_child | ~1,700 | Story child chunks (linked to sentence) | Story processor | `SELECT COUNT(*) FROM dev.chunk WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5' AND chunk_type = 'story_child'` | ⏳ Pending |
+| cut (Rule 3) | ~500 | Scene-level chunks for scene search | CUT output → scene boundaries | `SELECT COUNT(*) FROM dev.chunk WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5' AND chunk_type = 'cut'` | ⏳ Pending |
+| trace | ~3,600 | Face trace chunks (identity-centric) | Face Traced output | `SELECT COUNT(*) FROM dev.chunk WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5' AND chunk_type = 'trace'` | ⏳ Pending |
+
+### Ingestion Pipeline
+1. **Rule 1**: ASR → sentence split → chunk + embedding → Qdrant
+2. **Rule 3**: CUT + ASR → scene chunks → chunk + embedding → Qdrant
+3. **Trace**: Face Traced → trace chunks → TKG nodes → Qdrant
+
+### Key Points
+- `start_frame` / `end_frame` 必須正確計算 (之前 bug: frame=0)
+- Chunks 必須有 `embedding` 才能 search
+
+---
+
+## P3: Vector Embeddings
+
+### Purpose
+將 chunks 的 text 轉換為 768-dim embeddings，存入 PostgreSQL + Qdrant，用於 semantic search。
+
+### Embedding Targets
+
+| Target | Expected Count | Model | Purpose | Verification | Status |
+|--------|----------------|-------|---------|--------------|--------|
+| PostgreSQL `dev.chunk.embedding` | ~5,000 | Gemma-2-9B (768-dim) | Text semantic search | `SELECT COUNT(embedding) FROM dev.chunk WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | ⏳ Pending |
+| Qdrant `momentry_dev_rule1_v2` | ~5,000 points | Gemma-2-9B | Fast vector similarity search | `curl -H "api-key: Test3200Test3200Test3200" "http://localhost:6333/collections/momentry_dev_rule1_v2"` | ⏳ Pending |
+| Qdrant `_face` collection | ~70,000 points | FaceNet-512 (512-dim) | Face identity search | Face embeddings sync via `sync_face_embeddings()` | ⏳ Pending |
+
+### Embedding Pipeline
+1. **Text chunks**: `embeddinggemma_server.py` (port 11436) → 768-dim embedding
+2. **Face embeddings**: FaceNet CoreML (from face.json) → 512-dim embedding (已在 P0 產生)
+3. **Sync to Qdrant**: `sync_face_embeddings()` function in Rust
+
+### Key Points
+- Text embeddings 使用 Gemma-2-9B (local LLM server)
+- Face embeddings 使用 FaceNet-512 (CoreML ANE accelerated)
+- Qdrant 提供 fast similarity search (cosine similarity)
+
+---
+
+## P4: Identity Binding
+
+### Purpose
+將 detected faces 綁定到 TMDb identities (Audrey Hepburn, Cary Grant, etc.)，用於 identity_text search。
+
+### Identity Matching Pipeline
+
+| Step | Expected Result | Method | Verification | Status |
+|------|-----------------|--------|--------------|--------|
+| TMDb seeds loaded | 23 identities | `tmdb_embed_extractor.py` → TMDb profile face embeddings | `SELECT COUNT(*) FROM dev.identities WHERE source = 'tmdb' AND face_embedding IS NOT NULL` | ✅ Done |
+| Face matching | ~50,000 bindings | `match_faces_to_tmdb.py` → cosine similarity > 0.5 | `SELECT COUNT(identity_id) FROM dev.face_detections WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5' AND identity_id IS NOT NULL` | ⏳ Pending |
+| Audrey Hepburn faces | ~16,000 | Highest similarity match | `SELECT COUNT(*) FROM dev.face_detections fd JOIN dev.identities i ON fd.identity_id = i.id WHERE fd.file_uuid = 'c3c635e3641da80dde10cc555ffcdda5' AND i.name = 'Audrey Hepburn'` | ⏳ Pending |
+| Cary Grant faces | ~5,000 | Second highest match | Same query for Cary Grant | ⏳ Pending |
+
+### Matching Algorithm
+```python
+# match_faces_to_tmdb.py
+for trace_id in traces:
+    for face_embedding in trace_faces:
+        for tmdb_identity in tmdb_identities:
+            similarity = cosine_similarity(face_embedding, tmdb_identity.face_embedding)
+            if similarity >= 0.5:
+                match trace_id → tmdb_identity
+```
+
+### Key Points
+- TMDb seeds 需要 `face_embedding` (之前已驗證: 23 identities with embeddings)
+- Face `embedding` 必須非 NULL (之前 store_traced_faces.py bug 修復)
+- Threshold: 0.5 (可調整)
+
+---
+
+## P5: API Endpoints
+
+### Purpose
+驗證 API endpoints 可以正確返回 identity_text search results。
+
+### API Tests
+
+| Endpoint | Purpose | Expected Response | Test Command | Status |
+|----------|---------|-------------------|--------------|--------|
+| `/api/v1/search/identity_text` | Search chunk text → identities | Results with `identity_name`, `trace_id`, `identity_source` | `curl "http://localhost:3003/api/v1/search/identity_text?file_uuid=c3c635e3641da80dde10cc555ffcdda5&q=Regina&limit=5"` | ⏳ Pending |
+| `/api/v1/identities` | List identities with TMDb | Identity list with `tmdb_id`, `face_embedding` | `curl "http://localhost:3003/api/v1/identities?name=Audrey"` | ⏳ Pending |
+| `/api/v1/progress/:file_uuid` | Check processing progress | JSON with `status`, `completed_processors` | `curl "http://localhost:3003/api/v1/progress/c3c635e3641da80dde10cc555ffcdda5"` | ⏳ Pending |
+
+### Expected API Response Example
+```json
+{
+  "success": true,
+  "total": 5,
+  "results": [
+    {
+      "chunk_id": "sentence_123",
+      "start_time": 355.0,
+      "text_content": "Oh, mine's Regina Lampert.",
+      "identity_id": 9,
+      "identity_name": "Audrey Hepburn",
+      "identity_source": "tmdb",
+      "trace_id": 169
+    }
+  ]
+}
+```
+
+### Key Points
+- `identity_text` API 需要 `chunk.start_frame` / `chunk.end_frame` 正確 (之前 bug: frame=0)
+- `identity_id` 必須非 NULL 才能返回 identity_name
+
+---
+
+## P6: Completion Criteria
+
+### Purpose
+驗證 pipeline 完整完成，所有 ingestion steps 成功。
+
+### Final Verification Checklist
+
+| Criteria | Purpose | Check Command | Expected Result | Status |
+|----------|---------|---------------|-----------------|--------|
+| All processor outputs exist | 確認所有 processor JSON 檔案產生 | `ls -la output_dev/c3c635e3641da80dde10cc555ffcdda5.*` | 14+ files with size > 0 | ⏳ Pending |
+| Job status = completed | 確認 worker 完成 job | `SELECT status FROM dev.monitor_jobs WHERE uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | `completed` | ⏳ Pending |
+| Video status = completed | 確認 video state 更新 | `SELECT status FROM dev.videos WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | `completed` | ⏳ Pending |
+| All chunks have embeddings | 確認 text embeddings 完成 | `SELECT COUNT(*) = COUNT(embedding) FROM dev.chunk WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | `true` (all chunks have embedding) | ⏳ Pending |
+| Face traces assigned | 確認 face tracking 完成 | `SELECT COUNT(*) = COUNT(trace_id) FROM dev.face_detections WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | `true` (all faces have trace_id) | ⏳ Pending |
+| TMDb matching done | 確認 identity binding 完成 | `SELECT COUNT(identity_id) > 40000 FROM dev.face_detections WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | `true` (> 40K identity bindings) | ⏳ Pending |
+| Qdrant synced | 確認 vector search ready | Check Qdrant points count | Points increased by ~5,000 | ⏳ Pending |
+
+### Success Thresholds
+- **Face detections**: ~70,000 (169K frames / 3 sample interval)
+- **Identity bindings**: > 40,000 (60% match rate)
+- **Chunks with embeddings**: > 4,000 (all chunk types)
+- **Qdrant points**: > 90,000 (current) → > 95,000 (after Charade)
+
+---
+
+## Verification Script
+
+```bash
+# Run after completion
+./scripts/verify_charade_pipeline.sh c3c635e3641da80dde10cc555ffcdda5
+```
+
+---
+
+## Notes
+
+- OCR processor failed, skipped
+- Face detection using SwiftFace (ANE accelerated)
+- TMDb matching using `scripts/match_faces_to_tmdb.py`
+- Expected total processing time: ~2-3 hours
+
+---
+
+## Version History
+
+| Version | Date | Author | Changes |
+|---------|------|--------|---------|
+| 1.0 | 2026-05-27 | M5Max48 | Initial checklist |
--- a/docs_v1.0/M4_workspace/2026-05-29_identity_sync_and_wp_fixes.md
+++ b/docs_v1.0/M4_workspace/2026-05-29_identity_sync_and_wp_fixes.md
@@ -0,0 +1,49 @@
+# Session Summary: Identity Fixes + WP Proxy Fixes + Data Sync
+
+**Date**: 2026-05-29
+**Author**: OpenCode
+**Status**: Completed (marcom team testing)
+
+## What Was Done (Chronological)
+
+### 1. Production Identity Fixes (3002)
+- **James Coburn restored** (id=18738, confirmed)
+- **Chantal Goya restored** (id=18737, confirmed)
+- **Louis Viret name/status fixed**
+- **Sequences fixed**: `identities_id_seq` (48→18734), `face_detections_id_seq` (141383→932413), `identity_history_id_seq`, `identity_bindings_id_seq`, `pre_chunks_id_seq`, `file_identities_id_seq`
+- **COALESCE fix** for `reference_data` NULL crash (`postgres_db.rs:3198`, `storage.rs:196`)
+
+### 2. Bug Fixes
+- **DELETE identity**: Fixed binding order bug + removed `identity_confidence` column reference
+- **PATCH identity**: `jsonb_deep_merge` Nested JSON metadata
+- **mergeinto UNDO/REDO**: MongoDB deserialization fix (`Collection<Document>`)
+
+### 3. Library Page Infinite Load Fix
+- **Root cause**: WP scan proxy (snippet 48) didn't forward query params → infinite pagination loop
+- **Fix**: Added `$request->get_query_params()` forwarding in scan proxy
+- **Safety**: Added `maxPages = 10` limit in JS pagination
+
+### 4. Identity Data Sync (Dev → Production)
+- **Full replacement** of `public.identities`, `public.identity_bindings`, `public.identity_history` with dev data
+- James Coburn id: 18738 → 11
+- Bindings: 11,892 → 12,834 (+942)
+- **Verification**: 0 differences between schemas
+
+### 5. Snippet 55 Filter
+- Added `.filter(f => f.is_registered)` to show only registered files on library page
+- Changed `status:'unregistered'` → `status: f.status || 'unregistered'`
+
+## Key Decisions
+- Library page filter: default show registered files only
+- Identity sync: full DELETE + INSERT (not UPDATE) to ensure consistency
+- No user-defined metadata fields (starred/notes/role) preserved — matches dev exactly
+
+## Handoff to Marcom
+- `/people/` page should show correct identity state
+- `/library/` page should show only registered files (4 currently)
+- Login required for `/library/` — redirects to `/login/` if not authenticated
+
+## Files Modified
+- `snippet 48` (/scan WP proxy — query param forwarding)
+- `snippet 55` (library page JS — registered-only filter, maxPages safety)
+- `docs_v1.0/M4_workspace/2026-05-29_identity_sync_prod.md` (sync record)
--- a/docs_v1.0/M4_workspace/2026-05-29_identity_sync_prod.md
+++ b/docs_v1.0/M4_workspace/2026-05-29_identity_sync_prod.md
@@ -0,0 +1,45 @@
+# Identity Data Sync: Dev (3003) → Production (3002)
+
+**Date**: 2026-05-29
+**Author**: OpenCode
+**Status**: Completed
+
+## Summary
+
+Fully synced all identity-related tables from dev schema to public schema on PostgreSQL `momentry` database.
+
+## What Was Done
+
+1. **Identities table** (`public.identities`): Replaced with `dev.identities` (69 records, original ids preserved)
+2. **Identity_bindings** (`public.identity_bindings`): Replaced with `dev.identity_bindings` (12,834 records)
+3. **Identity_history** (`public.identity_history`): Replaced with `dev.identity_history` (10 records)
+4. **Sequences**: Updated `identities_id_seq`, `identity_bindings_id_seq`, `identity_history_id_seq` to match
+
+### Key Changes
+- **James Coburn**: Changed from id=18738 → id=11 (dev's original id)
+- **Chantal Goya**: Changed from id=18737 → id=18736 (dev's id)
+- **Metadata**: Now matches dev schema — TMDB fields only, no user-defined fields (starred, notes, role, aliases, user_confirmed are removed as expected)
+- **Bindings**: Increased from 11,892 → 12,834 (+942 bindings)
+
+### Not Changed
+- `face_detections` — identical in both schemas (135,521 records)
+- `pre_chunks` — large difference (public: 1.3M vs dev: 3.3M) but NOT related to identity
+- All other non-identity tables unchanged
+
+## Verification
+
+```sql
+-- Counts match
+identities:        69 = 69 ✅
+identity_bindings: 12,834 = 12,834 ✅
+identity_history:  10 = 10 ✅
+
+-- No differences
+id/uuid mismatch:         0
+metadata/status/name diffs: 0
+```
+
+## Files Referenced
+
+- `AGENTS.md` — Development isolation rules
+- `/Users/accusys/momentry_core/docs_v1.0/M4_workspace/2026-05-29_wp_api_url_update.md` — Previous session handoff
--- a/docs_v1.0/M4_workspace/2026-05-29_mergeinto_null_faceid_fix.md
+++ b/docs_v1.0/M4_workspace/2026-05-29_mergeinto_null_faceid_fix.md
@@ -0,0 +1,27 @@
+# 2026-05-29: Mergeinto NULL face_id Fix
+
+## Problem
+Production server (3002) returned `"error":"error occurred while decoding column 0: unexpected null; try decoding as an 'Option'"` when using mergeinto after clicking undo on a merge.
+
+## Root Cause
+`src/api/identity_binding.rs:428` decodes `face_id` from `face_detections` as `String` (non-Option), but **135,521 records** in the production `face_detections` table have NULL `face_id`. When merging an identity whose face_detections include NULL face_ids, the SQLx decode panics.
+
+## Fix
+- Changed `(String, Option<i32>)` → `(Option<String>, Option<i32>)` at line 428
+- Changed `face_id_list` to use `filter_map` instead of `map` to skip NULL face_ids
+- Changed `faces_count` to use `face_id_list.len()` instead of `face_ids.len()` (matching the actual transferred count)
+
+## Files Changed
+- `momentry_core/src/api/identity_binding.rs` — 3 lines changed
+
+## Verification
+- 234 library tests pass
+- `cargo fmt` passes
+- Production binary rebuilt (`target/release/momentry`)
+- Production server restarted on port 3002 (PID 92043)
+
+## Identities with NULL face_id (20 identities, ~135k records)
+Audrey Hepburn (36k), Cary Grant (15k), Bernard Musson, Walter Matthau, Jacques Marin, George Kennedy, Michel Thomass, Antonio Passalia, etc. — all `type=people, status=confirmed`. These identities were likely imported from bulk face detection data without face_id generation.
+
+## Data Note
+The NULL face_ids are a pre-existing data quality issue. The fix prevents crashes but doesn't clean up the NULL data. Faces with NULL face_id won't be tracked in undo history (they stay with the target after undo), but the bulk transfer (`WHERE identity_id = $1`) still works correctly.