feat: ASRX hybrid pipeline, identity history, worker fixes, checkpoint system

This commit is contained in:
Accusys
2026-06-02 07:13:23 +08:00
parent e3066c3f49
commit e1572907ae
198 changed files with 43705 additions and 8910 deletions

View File

@@ -0,0 +1,242 @@
---
title: Charade Full Movie Pipeline Checklist
version: 1.0
date: 2026-05-27
author: M5Max48
status: in_progress
---
# Charade Full Movie Pipeline Checklist
**File UUID**: `c3c635e3641da80dde10cc555ffcdda5`
**File Name**: Charade (1963) Cary Grant & Audrey Hepburn | Comedy Mystery Romance Thriller | Full Movie.mp4
**Duration**: 6785 seconds (113 minutes)
**Total Frames**: 169,625
---
## P0: Processor Outputs
### Purpose
原始處理器輸出檔案,存放在 `/Users/accusys/momentry/output_dev/`。這些是後續 ingestion 的資料來源。
### Processor Details
| Processor | Expected Output | Size Estimate | Purpose | Status |
|-----------|-----------------|---------------|---------|--------|
| CUT | `c3c635e3641da80dde10cc555ffcdda5.cut.json` | ~170KB | Scene boundary detection切割點用於 Rule 3 chunking | ✅ Done |
| YOLO | `c3c635e3641da80dde10cc555ffcdda5.yolo.json` | ~50-80MB | Object detection每幀的物件類別與位置 | 🔄 Running |
| Face | `c3c635e3641da80dde10cc555ffcdda5.face.json` | ~1.5GB | Face detection + 512-dim embedding (FaceNet CoreML) | 🔄 44% |
| Face Traced | `c3c635e3641da80dde10cc555ffcdda5.face_traced.json` | ~1.2GB | Face tracking同一人物的連續出現 → trace_id | ⏳ Pending (after Face) |
| OCR | `c3c635e3641da80dde10cc555ffcdda5.ocr.json` | ~50KB | Text recognition from frames | ❌ Skipped |
| Pose | `c3c635e3641da80dde10cc555ffcdda5.pose.json` | ~20MB | Body pose estimation | 🔄 Running |
| ASRX | `c3c635e3641da80dde10cc555ffcdda5.asrx.json` | ~8MB | Speaker diarization語者分段 | ✅ Done (reuse from public) |
| Visual Chunk | `c3c635e3641da80dde10cc555ffcdda5.visual_chunk.json` | ~60KB | Visual scene chunk metadata | ✅ Done |
| Scene | `c3c635e3641da80dde10cc555ffcdda5.scene.json` | ~300B | Scene list from CUT | ✅ Done |
| Scene Meta | `c3c635e3641da80dde10cc555ffcdda5.scene_meta.json` | ~50KB | Heuristic scene metadata (人物 + 物件統計) | ⏳ Pending |
| Story LLM | `c3c635e3641da80dde10cc555ffcdda5.story_llm.json` | ~800KB | LLM-generated story summaries per chunk | ✅ Done |
| Story Story | `c3c635e3641da80dde10cc555ffcdda5.story_story.json` | ~800KB | Story parent-child relationships | ✅ Done |
| TMDb | `c3c635e3641da80dde10cc555ffcdda5.tmdb.json` | ~5KB | TMDb cast list with face embeddings | ⏳ Pending |
| 5W1H | `c3c635e3641da80dde10cc555ffcdda5.5w1h.json` | ~500KB | 5W1H agent output (who/when/where/what/why/how) | ✅ Done |
### Key Dependencies
- Face Traced 需要 Face 完成後才能執行 (face_traced.json = face.json + tracking)
- Scene Meta 需要 Face + YOLO 完成
- TMDb 需要 Face Traced 完成後執行 matching
---
## P1: Database Records
### Purpose
將 processor outputs 存入 PostgreSQL供 API query 使用。
### Table Details
| Table | Expected Records | Purpose | Verification Query | Status |
|-------|------------------|---------|-------------------|--------|
| `dev.videos` | 1 row | Video metadata (duration, fps, status) | `SELECT file_uuid, status FROM dev.videos WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | ✅ Registered |
| `dev.monitor_jobs` | 1 row | Processing job state machine | `SELECT uuid, status, completed_processors FROM dev.monitor_jobs WHERE uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | 🔄 Running |
| `dev.pre_chunks` | ~7,000 rows | Raw processor outputs (ASR sentences, YOLO objects, etc.) | `SELECT COUNT(*) FROM dev.pre_chunks WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | ⏳ Pending |
| `dev.face_detections` | ~70,000 rows | Face detection records (每幀每張臉) | `SELECT COUNT(*) FROM dev.face_detections WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | ⏳ Pending |
| `dev.face_detections.embedding` | ~70,000 non-NULL | 512-dim FaceNet embedding (用於 identity matching) | `SELECT COUNT(embedding) FROM dev.face_detections WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | ⏳ Pending |
| `dev.face_detections.trace_id` | ~70,000 non-NULL | Face tracking ID (同一人物跨幀連續出現) | `SELECT COUNT(trace_id) FROM dev.face_detections WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | ⏳ Pending |
| `dev.face_detections.identity_id` | ~50,000 non-NULL | TMDb identity binding (Audrey, Cary, etc.) | `SELECT COUNT(identity_id) FROM dev.face_detections WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | ⏳ Pending |
### Key Points
- `embedding` 必須非 NULL 才能進行 TMDb matching (之前 store_traced_faces.py bug 修復)
- `trace_id``store_traced_faces.py` 從 face_traced.json 計算
- `identity_id``match_faces_to_tmdb.py` 計算 (cosine similarity > 0.5)
---
## P2: Chunk Ingestion
### Purpose
將 raw processor outputs 轉換為 searchable chunks用於 RAG query。
### Chunk Types
| Chunk Type | Expected Count | Purpose | Source | Verification Query | Status |
|------------|----------------|---------|--------|-------------------|--------|
| sentence (Rule 1) | ~1,700 | Sentence-level chunks for text search | ASR output → sentence split | `SELECT COUNT(*) FROM dev.chunk WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5' AND chunk_type = 'sentence'` | ⏳ Pending |
| llm_parent | ~800 | LLM-generated summary parent chunks | Story LLM output | `SELECT COUNT(*) FROM dev.chunk WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5' AND chunk_type = 'llm_parent'` | ⏳ Pending |
| story_parent | ~800 | Story parent chunks (narrative segments) | Story processor | `SELECT COUNT(*) FROM dev.chunk WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5' AND chunk_type = 'story_parent'` | ⏳ Pending |
| story_child | ~1,700 | Story child chunks (linked to sentence) | Story processor | `SELECT COUNT(*) FROM dev.chunk WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5' AND chunk_type = 'story_child'` | ⏳ Pending |
| cut (Rule 3) | ~500 | Scene-level chunks for scene search | CUT output → scene boundaries | `SELECT COUNT(*) FROM dev.chunk WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5' AND chunk_type = 'cut'` | ⏳ Pending |
| trace | ~3,600 | Face trace chunks (identity-centric) | Face Traced output | `SELECT COUNT(*) FROM dev.chunk WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5' AND chunk_type = 'trace'` | ⏳ Pending |
### Ingestion Pipeline
1. **Rule 1**: ASR → sentence split → chunk + embedding → Qdrant
2. **Rule 3**: CUT + ASR → scene chunks → chunk + embedding → Qdrant
3. **Trace**: Face Traced → trace chunks → TKG nodes → Qdrant
### Key Points
- `start_frame` / `end_frame` 必須正確計算 (之前 bug: frame=0)
- Chunks 必須有 `embedding` 才能 search
---
## P3: Vector Embeddings
### Purpose
將 chunks 的 text 轉換為 768-dim embeddings存入 PostgreSQL + Qdrant用於 semantic search。
### Embedding Targets
| Target | Expected Count | Model | Purpose | Verification | Status |
|--------|----------------|-------|---------|--------------|--------|
| PostgreSQL `dev.chunk.embedding` | ~5,000 | Gemma-2-9B (768-dim) | Text semantic search | `SELECT COUNT(embedding) FROM dev.chunk WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | ⏳ Pending |
| Qdrant `momentry_dev_rule1_v2` | ~5,000 points | Gemma-2-9B | Fast vector similarity search | `curl -H "api-key: Test3200Test3200Test3200" "http://localhost:6333/collections/momentry_dev_rule1_v2"` | ⏳ Pending |
| Qdrant `_face` collection | ~70,000 points | FaceNet-512 (512-dim) | Face identity search | Face embeddings sync via `sync_face_embeddings()` | ⏳ Pending |
### Embedding Pipeline
1. **Text chunks**: `embeddinggemma_server.py` (port 11436) → 768-dim embedding
2. **Face embeddings**: FaceNet CoreML (from face.json) → 512-dim embedding (已在 P0 產生)
3. **Sync to Qdrant**: `sync_face_embeddings()` function in Rust
### Key Points
- Text embeddings 使用 Gemma-2-9B (local LLM server)
- Face embeddings 使用 FaceNet-512 (CoreML ANE accelerated)
- Qdrant 提供 fast similarity search (cosine similarity)
---
## P4: Identity Binding
### Purpose
將 detected faces 綁定到 TMDb identities (Audrey Hepburn, Cary Grant, etc.),用於 identity_text search。
### Identity Matching Pipeline
| Step | Expected Result | Method | Verification | Status |
|------|-----------------|--------|--------------|--------|
| TMDb seeds loaded | 23 identities | `tmdb_embed_extractor.py` → TMDb profile face embeddings | `SELECT COUNT(*) FROM dev.identities WHERE source = 'tmdb' AND face_embedding IS NOT NULL` | ✅ Done |
| Face matching | ~50,000 bindings | `match_faces_to_tmdb.py` → cosine similarity > 0.5 | `SELECT COUNT(identity_id) FROM dev.face_detections WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5' AND identity_id IS NOT NULL` | ⏳ Pending |
| Audrey Hepburn faces | ~16,000 | Highest similarity match | `SELECT COUNT(*) FROM dev.face_detections fd JOIN dev.identities i ON fd.identity_id = i.id WHERE fd.file_uuid = 'c3c635e3641da80dde10cc555ffcdda5' AND i.name = 'Audrey Hepburn'` | ⏳ Pending |
| Cary Grant faces | ~5,000 | Second highest match | Same query for Cary Grant | ⏳ Pending |
### Matching Algorithm
```python
# match_faces_to_tmdb.py
for trace_id in traces:
for face_embedding in trace_faces:
for tmdb_identity in tmdb_identities:
similarity = cosine_similarity(face_embedding, tmdb_identity.face_embedding)
if similarity >= 0.5:
match trace_id tmdb_identity
```
### Key Points
- TMDb seeds 需要 `face_embedding` (之前已驗證: 23 identities with embeddings)
- Face `embedding` 必須非 NULL (之前 store_traced_faces.py bug 修復)
- Threshold: 0.5 (可調整)
---
## P5: API Endpoints
### Purpose
驗證 API endpoints 可以正確返回 identity_text search results。
### API Tests
| Endpoint | Purpose | Expected Response | Test Command | Status |
|----------|---------|-------------------|--------------|--------|
| `/api/v1/search/identity_text` | Search chunk text → identities | Results with `identity_name`, `trace_id`, `identity_source` | `curl "http://localhost:3003/api/v1/search/identity_text?file_uuid=c3c635e3641da80dde10cc555ffcdda5&q=Regina&limit=5"` | ⏳ Pending |
| `/api/v1/identities` | List identities with TMDb | Identity list with `tmdb_id`, `face_embedding` | `curl "http://localhost:3003/api/v1/identities?name=Audrey"` | ⏳ Pending |
| `/api/v1/progress/:file_uuid` | Check processing progress | JSON with `status`, `completed_processors` | `curl "http://localhost:3003/api/v1/progress/c3c635e3641da80dde10cc555ffcdda5"` | ⏳ Pending |
### Expected API Response Example
```json
{
"success": true,
"total": 5,
"results": [
{
"chunk_id": "sentence_123",
"start_time": 355.0,
"text_content": "Oh, mine's Regina Lampert.",
"identity_id": 9,
"identity_name": "Audrey Hepburn",
"identity_source": "tmdb",
"trace_id": 169
}
]
}
```
### Key Points
- `identity_text` API 需要 `chunk.start_frame` / `chunk.end_frame` 正確 (之前 bug: frame=0)
- `identity_id` 必須非 NULL 才能返回 identity_name
---
## P6: Completion Criteria
### Purpose
驗證 pipeline 完整完成,所有 ingestion steps 成功。
### Final Verification Checklist
| Criteria | Purpose | Check Command | Expected Result | Status |
|----------|---------|---------------|-----------------|--------|
| All processor outputs exist | 確認所有 processor JSON 檔案產生 | `ls -la output_dev/c3c635e3641da80dde10cc555ffcdda5.*` | 14+ files with size > 0 | ⏳ Pending |
| Job status = completed | 確認 worker 完成 job | `SELECT status FROM dev.monitor_jobs WHERE uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | `completed` | ⏳ Pending |
| Video status = completed | 確認 video state 更新 | `SELECT status FROM dev.videos WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | `completed` | ⏳ Pending |
| All chunks have embeddings | 確認 text embeddings 完成 | `SELECT COUNT(*) = COUNT(embedding) FROM dev.chunk WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | `true` (all chunks have embedding) | ⏳ Pending |
| Face traces assigned | 確認 face tracking 完成 | `SELECT COUNT(*) = COUNT(trace_id) FROM dev.face_detections WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | `true` (all faces have trace_id) | ⏳ Pending |
| TMDb matching done | 確認 identity binding 完成 | `SELECT COUNT(identity_id) > 40000 FROM dev.face_detections WHERE file_uuid = 'c3c635e3641da80dde10cc555ffcdda5'` | `true` (> 40K identity bindings) | ⏳ Pending |
| Qdrant synced | 確認 vector search ready | Check Qdrant points count | Points increased by ~5,000 | ⏳ Pending |
### Success Thresholds
- **Face detections**: ~70,000 (169K frames / 3 sample interval)
- **Identity bindings**: > 40,000 (60% match rate)
- **Chunks with embeddings**: > 4,000 (all chunk types)
- **Qdrant points**: > 90,000 (current) → > 95,000 (after Charade)
---
## Verification Script
```bash
# Run after completion
./scripts/verify_charade_pipeline.sh c3c635e3641da80dde10cc555ffcdda5
```
---
## Notes
- OCR processor failed, skipped
- Face detection using SwiftFace (ANE accelerated)
- TMDb matching using `scripts/match_faces_to_tmdb.py`
- Expected total processing time: ~2-3 hours
---
## Version History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0 | 2026-05-27 | M5Max48 | Initial checklist |

View File

@@ -0,0 +1,49 @@
# Session Summary: Identity Fixes + WP Proxy Fixes + Data Sync
**Date**: 2026-05-29
**Author**: OpenCode
**Status**: Completed (marcom team testing)
## What Was Done (Chronological)
### 1. Production Identity Fixes (3002)
- **James Coburn restored** (id=18738, confirmed)
- **Chantal Goya restored** (id=18737, confirmed)
- **Louis Viret name/status fixed**
- **Sequences fixed**: `identities_id_seq` (48→18734), `face_detections_id_seq` (141383→932413), `identity_history_id_seq`, `identity_bindings_id_seq`, `pre_chunks_id_seq`, `file_identities_id_seq`
- **COALESCE fix** for `reference_data` NULL crash (`postgres_db.rs:3198`, `storage.rs:196`)
### 2. Bug Fixes
- **DELETE identity**: Fixed binding order bug + removed `identity_confidence` column reference
- **PATCH identity**: `jsonb_deep_merge` Nested JSON metadata
- **mergeinto UNDO/REDO**: MongoDB deserialization fix (`Collection<Document>`)
### 3. Library Page Infinite Load Fix
- **Root cause**: WP scan proxy (snippet 48) didn't forward query params → infinite pagination loop
- **Fix**: Added `$request->get_query_params()` forwarding in scan proxy
- **Safety**: Added `maxPages = 10` limit in JS pagination
### 4. Identity Data Sync (Dev → Production)
- **Full replacement** of `public.identities`, `public.identity_bindings`, `public.identity_history` with dev data
- James Coburn id: 18738 → 11
- Bindings: 11,892 → 12,834 (+942)
- **Verification**: 0 differences between schemas
### 5. Snippet 55 Filter
- Added `.filter(f => f.is_registered)` to show only registered files on library page
- Changed `status:'unregistered'``status: f.status || 'unregistered'`
## Key Decisions
- Library page filter: default show registered files only
- Identity sync: full DELETE + INSERT (not UPDATE) to ensure consistency
- No user-defined metadata fields (starred/notes/role) preserved — matches dev exactly
## Handoff to Marcom
- `/people/` page should show correct identity state
- `/library/` page should show only registered files (4 currently)
- Login required for `/library/` — redirects to `/login/` if not authenticated
## Files Modified
- `snippet 48` (/scan WP proxy — query param forwarding)
- `snippet 55` (library page JS — registered-only filter, maxPages safety)
- `docs_v1.0/M4_workspace/2026-05-29_identity_sync_prod.md` (sync record)

View File

@@ -0,0 +1,45 @@
# Identity Data Sync: Dev (3003) → Production (3002)
**Date**: 2026-05-29
**Author**: OpenCode
**Status**: Completed
## Summary
Fully synced all identity-related tables from dev schema to public schema on PostgreSQL `momentry` database.
## What Was Done
1. **Identities table** (`public.identities`): Replaced with `dev.identities` (69 records, original ids preserved)
2. **Identity_bindings** (`public.identity_bindings`): Replaced with `dev.identity_bindings` (12,834 records)
3. **Identity_history** (`public.identity_history`): Replaced with `dev.identity_history` (10 records)
4. **Sequences**: Updated `identities_id_seq`, `identity_bindings_id_seq`, `identity_history_id_seq` to match
### Key Changes
- **James Coburn**: Changed from id=18738 → id=11 (dev's original id)
- **Chantal Goya**: Changed from id=18737 → id=18736 (dev's id)
- **Metadata**: Now matches dev schema — TMDB fields only, no user-defined fields (starred, notes, role, aliases, user_confirmed are removed as expected)
- **Bindings**: Increased from 11,892 → 12,834 (+942 bindings)
### Not Changed
- `face_detections` — identical in both schemas (135,521 records)
- `pre_chunks` — large difference (public: 1.3M vs dev: 3.3M) but NOT related to identity
- All other non-identity tables unchanged
## Verification
```sql
-- Counts match
identities: 69 = 69
identity_bindings: 12,834 = 12,834
identity_history: 10 = 10
-- No differences
id/uuid mismatch: 0
metadata/status/name diffs: 0
```
## Files Referenced
- `AGENTS.md` — Development isolation rules
- `/Users/accusys/momentry_core/docs_v1.0/M4_workspace/2026-05-29_wp_api_url_update.md` — Previous session handoff

View File

@@ -0,0 +1,27 @@
# 2026-05-29: Mergeinto NULL face_id Fix
## Problem
Production server (3002) returned `"error":"error occurred while decoding column 0: unexpected null; try decoding as an 'Option'"` when using mergeinto after clicking undo on a merge.
## Root Cause
`src/api/identity_binding.rs:428` decodes `face_id` from `face_detections` as `String` (non-Option), but **135,521 records** in the production `face_detections` table have NULL `face_id`. When merging an identity whose face_detections include NULL face_ids, the SQLx decode panics.
## Fix
- Changed `(String, Option<i32>)``(Option<String>, Option<i32>)` at line 428
- Changed `face_id_list` to use `filter_map` instead of `map` to skip NULL face_ids
- Changed `faces_count` to use `face_id_list.len()` instead of `face_ids.len()` (matching the actual transferred count)
## Files Changed
- `momentry_core/src/api/identity_binding.rs` — 3 lines changed
## Verification
- 234 library tests pass
- `cargo fmt` passes
- Production binary rebuilt (`target/release/momentry`)
- Production server restarted on port 3002 (PID 92043)
## Identities with NULL face_id (20 identities, ~135k records)
Audrey Hepburn (36k), Cary Grant (15k), Bernard Musson, Walter Matthau, Jacques Marin, George Kennedy, Michel Thomass, Antonio Passalia, etc. — all `type=people, status=confirmed`. These identities were likely imported from bulk face detection data without face_id generation.
## Data Note
The NULL face_ids are a pre-existing data quality issue. The fix prevents crashes but doesn't clean up the NULL data. Faces with NULL face_id won't be tracked in undo history (they stay with the target after undo), but the bulk transfer (`WHERE identity_id = $1`) still works correctly.