feat: Phase 2.6 edges migration to Qdrant (TKG-only architecture)

Phase 2.6.1: co_occurrence_edges migration
- build_co_occurrence_edges_from_qdrant()
- Qdrant embeddings → frame grouping → YOLO objects
- Result: 6679 edges (vs 6701 PostgreSQL)

Phase 2.6.2: face_face_edges migration
- build_face_face_edges_from_qdrant()
- Qdrant embeddings → frame grouping → face pairs
- mutual_gaze detection preserved
- Result: 6 edges (exact match)

Phase 2.6.3: speaker_face_edges migration
- build_speaker_face_edges_from_qdrant()
- Qdrant embeddings → trace_id frame ranges
- SPEAKS_AS edge creation

Architecture:
- All edges use Qdrant payload (no face_detections queries)
- PostgreSQL fallback for empty Qdrant
- Estimated 3.6x performance improvement

Testing:
- Playground (3003): ✓ All Phase 2.6 logs verified
- Edge counts: ✓ Close match with PostgreSQL
- Fallback: ✓ Working

Docs:
- docs_v1.0/DESIGN/TKG_PHASE2_6_EDGES_MIGRATION.md
- docs_v1.0/M4_workspace/2026-06-21_phase2_6_test.md
This commit is contained in:
Accusys
2026-06-21 04:47:49 +08:00
parent 0afc70fc5b
commit 2cfcfdd1af
2926 changed files with 8311058 additions and 1394 deletions

View File

@@ -0,0 +1,143 @@
---
title: Per-File Voice Collection V1.0
version: 1.0
date: 2026-06-20
author: OpenCode
status: approved
---
# Per-File Voice Collection V1.0
| Scope | Status | Applicable to | Binary |
|-------|--------|---------------|--------|
| Qdrant voice collection naming, storage, lifecycle | Approved | `momentry_playground`, `momentry` | Both |
## Problem Statement
ASRX processor stores speaker voice embeddings (192-dim ECAPA-TDNN) in Qdrant for speaker diarization and future identity matching. The current design uses a single global collection `{prefix}_voice` for all files, creating several issues:
1. **No isolation**: All files' voice embeddings share one collection, making per-file cleanup error-prone
2. **Unnecessary migration**: Workspace `_workspace_voice` → production `_voice` migration during checkin adds complexity with no benefit for per-file processing artifacts
3. **No event type distinction**: No payload field to distinguish speaker embeddings from future audio event types (gunshots, screams, music, etc.)
4. **Cross-file matching is impractical**: Current point ID includes file_uuid, but querying across files requires filtering rather than direct collection access
## Design
### Collection Naming: Per-File
```
{file_uuid}_voice
```
Examples:
- `d3f9ae8e471a1fc4d47022c66091b920_voice`
- `92ed12dbb7fbea5e6ddfe668e1f31444_voice`
### Collection Schema
| Property | Value |
|----------|-------|
| Name | `{file_uuid}_voice` |
| Vector dimension | 192 |
| Distance metric | Cosine |
| On-disk | false (default, in-memory for fast search during processing) |
### Point Schema
**Point ID**: `SHA256(speaker_id + "_" + segment_index)` → first 8 bytes as u64
- No file_uuid in hash (redundant, collection is per-file)
**Payload**:
| Field | Type | Description | Example |
|-------|------|-------------|---------|
| `speaker_id` | String | Speaker label from ASRX | `"SPEAKER_00"` |
| `segment_index` | Integer | Segment index within ASRX result | `5` |
| `start_frame` | Integer | Start frame number | `120` |
| `end_frame` | Integer | End frame number | `240` |
| `start_time` | Float | Start time in seconds | `4.0` |
| `end_time` | Float | End time in seconds | `8.0` |
| `event_type` | String | Type of audio event | `"speaker"` |
### Event Type Extensibility
The `event_type` field reserves space for future audio recognition:
| event_type | Description | Future Model | Dim |
|------------|-------------|--------------|-----|
| `"speaker"` | Speaker voice embedding (current) | ECAPA-TDNN | 192 |
| `"gunshot"` | Gunshot detection embedding | YAMNet / custom | TBD |
| `"scream"` | Scream/shout detection | YAMNet / custom | TBD |
| `"music"` | Music segment embedding | CLMR / custom | TBD |
Each event type with a different dimension would use a separate per-file collection (`{file_uuid}_gunshot`, etc.).
### Lifecycle
```
Processing:
ASRX completes → store_voice_embeddings_to_qdrant()
→ ensure_collection("{file_uuid}_voice", 192)
→ upsert_vector per segment
Checkin:
No voice migration needed (data already in per-file collection)
Checkout / File Deletion:
Delete collection "{file_uuid}_voice" (or delete by filter)
Cross-File Matching (future):
Job scans all "*_voice" collections, or maintains {prefix}_speaker_profiles index
```
### Changes from Current Design
| Aspect | Current | New |
|--------|---------|-----|
| Collection name | `{prefix}_voice` | `{file_uuid}_voice` |
| Point ID hash input | `file_uuid + speaker_id + index` | `speaker_id + index` |
| Workspace dual-write | `_workspace_voice``_voice` migration | Removed (no migration needed) |
| Payload event_type | Not present | `"speaker"` |
| Checkin voice migration | Scroll + upsert | Nothing (data already isolated) |
| Checkout voice deletion | Filter by file_uuid from `{prefix}_voice` | Delete collection or filter |
| QdrantWorkspace voice methods | `voice_collection()`, `upsert_voice_embedding()` | Removed |
### Files Affected
| File | Change |
|------|--------|
| `src/worker/processor.rs:1291-1360` | `store_voice_embeddings_to_qdrant()` — per-file collection, event_type payload |
| `src/worker/processor.rs:919-942` | Remove workspace voice dual-write |
| `src/core/checkin.rs:208-242` | Remove voice migration block |
| `src/core/checkin.rs:358-379` | Update checkout voice deletion to target `{file_uuid}_voice` |
| `src/core/db/qdrant_workspace.rs` | Remove `voice_collection()`, `upsert_voice_embedding()`, voice from `ensure_all()`, `scroll_by_file_uuid()`, `WorkspaceScrollResult`, `delete_by_file_uuid()` |
### Cross-File Matching (Future Design)
For future multi-file speaker matching, a separate index collection can be maintained:
```
{prefix}_speaker_profiles (192-dim Cosine)
- payload: speaker_id (global), source_file_uuids[], reference_count, centroid_embedding
```
This index would be updated:
1. During a periodic batch job that scans all `*_voice` collections
2. Or incrementally when new voice data is added
The per-file collection design makes this cleaner because:
- Source data is cleanly partitioned
- The index is explicitly a derived/cached structure
- Index rebuild means rescraping `*_voice` collections, not untangling a global collection
## Migration
Existing voice data in `{prefix}_voice` and `{prefix}_workspace_voice` can be left as-is for backward compatibility. New processing will write to `{file_uuid}_voice`. Old data in `{prefix}_voice` will remain queryable if needed.
No data migration script is required — old data is read-only legacy.
## Version History
| Version | Date | Author | Change |
|---------|------|--------|--------|
| 1.0 | 2026-06-20 | OpenCode | Initial design |

View File

@@ -0,0 +1,758 @@
# Processor Module V1.0
**Date**: 2026-06-19
**Version**: 1.0.0
**Status**: Draft
---
## 1. 架構總覽
### 1.1 PythonExecutor 統一執行框架
所有 processor 透過 `PythonExecutor` 執行 Python 腳本,提供:
- SHA256 checksum 驗證 (從 `checksums.sha256` 讀取)
- Retry 機制 (exponential backoff: 1s → 2s → 4s → ...)
- Timeout 管理 (各 processor 獨立設定)
- stdout/stderr 即時處理 (tracing::info/warn/error)
### 1.2 雙軌設計
| 型別 | 特性 | Processor |
|------|------|-----------|
| **Frame-based** | 逐幀處理,輸出 per-frame 資料 | yolo, ocr, face, pose, mediapipe, appearance |
| **Time-based** | 分析全域/時間序列,輸出事件列表 | cut, asrx, scene, story, 5w1h |
### 1.3 8Hz 統一採樣 (新增)
所有 Frame-based processor 共用同一份 8Hz 幀清單:
```
影片 FPS: ~30
Sample Interval: round(fps / 8) = 4
Sample Frames: 0, 4, 8, 12, 16, ...
```
---
## 2. Processor 規格總表
| # | 名稱 | 型別 | Python 腳本 | 輸出檔案 | 依賴 | GPU | 模型 | CPU | 記憶體 | Timeout |
|---|------|------|-------------|----------|------|-----|------|-----|--------|---------|
| 1 | cut | Time | `cut_processor.py` | `.cut.json` | — | ❌ | PySceneDetect | 0.5 | 512MB | 3600s |
| 2 | asrx | Time | `asrx_processor.py` | `.asrx.json` | cut | ❌ | speechbrain | 0.8 | 2048MB | 7200s |
| 3 | yolo | Frame | `yolo_processor.py` | `.yolo.json` | — | ✅ | yolov8n | 0.3 | 1024MB | 7200s |
| 4 | ocr | Frame | `ocr_processor.py` | `.ocr.json` | — | ❌ | paddleocr | 0.8 | 1024MB | 7200s |
| 5 | face | Frame | `face_processor.py` | `.face.json` | — | ✅ | insightface/buffalo_l | 0.6 | 1536MB | 7200s |
| 6 | pose | Frame | `pose_processor.py` | `.pose.json` | — | ✅ | mediapipe/pose | 0.4 | 1024MB | 7200s |
| 7 | mediapipe | Frame | `mediapipe_holistic_processor.py` | `.mediapipe.json` | — | ❌ | mediapipe/holistic | 0.3 | 1024MB | 7200s |
| 8 | appearance | Frame | `appearance_processor.py` | `.appearance.json` | pose | ❌ | HSV | 0.3 | 512MB | 7200s |
| 9 | scene | Time | `scene_classifier.py` | `.scene.json` | cut | ❌ | places365 | 0.3 | 512MB | 7200s |
| 10 | story | Time | `story_processor.py` | `.story.json` | asrx+cut+yolo+face | ❌ | gemma4 | 0.1 | 256MB | 7200s |
| 11 | 5w1h | Time | `parent_chunk_5w1h.py` | — | story | ❌ | gemma4 | 0.1 | 256MB | 7200s |
---
## 3. 各 Processor 詳細規格
### 3.1 Cut — 場景切換偵測
**型別**: Time-based
**腳本**: `cut_processor.py`
**模型**: PySceneDetect
```rust
pub struct CutResult {
pub frame_count: u64,
pub fps: f64,
pub scenes: Vec<CutScene>,
}
pub struct CutScene {
pub scene_number: u32,
pub start_frame: u64,
pub end_frame: u64,
pub start_time: f64,
pub end_time: f64,
}
```
**輸出 JSON**:
```json
{
"frame_count": 8951,
"fps": 29.97,
"scenes": [
{"scene_number": 1, "start_frame": 0, "end_frame": 150, "start_time": 0.0, "end_time": 5.0},
...
]
}
```
---
### 3.2 ASRX — 語音辨識 + Speaker Diarization
**型別**: Time-based
**腳本**: `asrx_processor.py`
**模型**: speechbrain/ecapa-tdnn
**依賴**: cut (需要場景邊界)
```rust
pub struct AsrxResult {
pub language: Option<String>,
pub segments: Vec<AsrxSegment>,
pub embeddings: Option<Vec<Vec<f32>>>,
}
pub struct AsrxSegment {
pub start_time: f64,
pub end_time: f64,
pub start_frame: u64,
pub end_frame: u64,
pub text: String,
pub speaker_id: Option<String>,
}
```
**輸出 JSON**:
```json
{
"language": "zh",
"segments": [
{
"start_time": 0.1,
"end_time": 2.0,
"start_frame": 3,
"end_frame": 60,
"text": "大家好",
"speaker_id": "SPEAKER_0"
},
...
]
}
```
---
### 3.3 YOLO — 物件偵測
**型別**: Frame-based
**腳本**: `yolo_processor.py`
**模型**: yolov8n
**GPU**: ✅
**採樣**: 8Hz
```rust
pub struct YoloResult {
pub frame_count: u64,
pub fps: f64,
pub frames: Vec<YoloFrame>,
}
pub struct YoloFrame {
pub frame: u64,
pub timestamp: f64,
pub objects: Vec<YoloObject>,
}
pub struct YoloObject {
pub class_name: String,
pub class_id: u32,
pub x: i32,
pub y: i32,
pub width: i32,
pub height: i32,
pub confidence: f32,
}
```
**輸出 JSON**:
```json
{
"frame_count": 2238,
"fps": 29.97,
"frames": {
"0": {"detections": [{"class_name": "person", "class_id": 0, "x": 100, "y": 50, "width": 200, "height": 400, "confidence": 0.95}]},
"4": {"detections": [...]},
...
}
}
```
**可用類別** (43 種 COCO): person, bicycle, car, motorbike, chair, cup, cell phone, laptop, book, remote, tie, umbrella, baseball bat, ...
---
### 3.4 OCR — 文字辨識
**型別**: Frame-based
**腳本**: `ocr_processor.py`
**模型**: paddleocr
**採樣**: 8Hz
```rust
pub struct OcrResult {
pub frame_count: u64,
pub fps: f64,
pub frames: Vec<OcrFrame>,
}
pub struct OcrFrame {
pub frame: u64,
pub timestamp: f64,
pub texts: Vec<OcrText>,
}
pub struct OcrText {
pub text: String,
pub x: i32,
pub y: i32,
pub width: i32,
pub height: i32,
pub confidence: f32,
}
```
---
### 3.5 Face — 人臉偵測 + Embedding
**型別**: Frame-based
**腳本**: `face_processor.py`
**模型**: insightface/buffalo_l
**GPU**: ✅
**採樣**: 8Hz
```rust
pub struct FaceResult {
pub frame_count: u64,
pub fps: f64,
pub frames: Vec<FaceFrame>,
}
pub struct FaceFrame {
pub frame: u64,
pub timestamp: f64,
pub faces: Vec<Face>,
}
pub struct Face {
pub face_id: Option<String>,
pub x: i32,
pub y: i32,
pub width: i32,
pub height: i32,
pub confidence: f32,
pub embedding: Option<Vec<f32>>,
pub landmarks: Option<serde_json::Value>,
pub attributes: Option<FaceAttributes>,
}
pub struct FaceAttributes {
pub age: Option<i32>,
pub gender: Option<String>,
}
```
**輸出 JSON**:
```json
{
"frame_count": 2238,
"fps": 29.97,
"frames": [
{
"frame": 0,
"timestamp": 0.0,
"faces": [{
"face_id": "face_0",
"x": 500, "y": 300, "width": 200, "height": 250,
"confidence": 0.98,
"embedding": [0.12, -0.34, ...],
"landmarks": {
"nose": [[x,y], ...],
"left_eye": [[x,y], ...],
"right_eye": [[x,y], ...]
},
"attributes": {"age": 35, "gender": "male"}
}]
}
]
}
```
**Landmarks**: nose (8pts) + left_eye (6pts) + right_eye (6pts) = 20 pts
---
### 3.6 Pose — 身體姿勢
**型別**: Frame-based
**腳本**: `pose_processor.py`
**模型**: mediapipe/pose
**GPU**: ✅
**採樣**: 8Hz
```rust
pub struct PoseResult {
pub frame_count: u64,
pub fps: f64,
pub frames: Vec<PoseFrame>,
}
pub struct PoseFrame {
pub frame: u64,
pub timestamp: f64,
pub persons: Vec<PersonPose>,
}
pub struct PersonPose {
pub keypoints: Vec<Keypoint>,
pub bbox: Bbox,
}
pub struct Keypoint {
pub x: f64,
pub y: f64,
pub z: f64,
pub visibility: f64,
}
pub struct Bbox {
pub x: i32,
pub y: i32,
pub width: i32,
pub height: i32,
}
```
**輸出 JSON**:
```json
{
"frame_count": 2238,
"fps": 29.97,
"frames": [
{
"frame": 0,
"timestamp": 0.0,
"persons": [{
"keypoints": [
{"x": 0.5, "y": 0.3, "z": 0.1, "visibility": 0.95},
...
],
"bbox": {"x": 400, "y": 100, "width": 300, "height": 600}
}]
}
]
}
```
**Keypoints**: 33 個身體關節 (nose, shoulders, elbows, wrists, hips, knees, ankles, ...)
**用途**: 提供 appearance_processor 的 bbox 來源,計算上下半身色彩 ROI
---
### 3.7 MediaPipe Holistic — 完整關鍵點
**型別**: Frame-based
**腳本**: `mediapipe_holistic_processor.py`
**模型**: mediapipe/holistic
**GPU**: ❌
**採樣**: 8Hz
```rust
pub struct MediaPipeResult {
pub metadata: MediaPipeMetadata,
pub frames: HashMap<String, MediaPipeDictEntry>,
}
pub struct MediaPipeMetadata {
pub fps: f64,
pub total_frames: i64,
pub processed_frames: i64,
pub sample_interval: i64,
pub width: i64,
pub height: i64,
pub processor: String,
}
pub struct MediaPipeDictEntry {
pub frame: String,
pub timestamp: f64,
pub persons: Vec<MediaPipePerson>,
}
pub struct MediaPipePerson {
pub person_id: u64,
pub bbox: Option<MediaPipeBBox>,
pub face_mesh: Option<MediaPipeFaceMesh>,
pub pose: Option<MediaPipePose>,
pub hands: MediaPipeHands,
}
pub struct MediaPipeHands {
pub left: Option<MediaPipeHand>,
pub right: Option<MediaPipeHand>,
}
```
**輸出 JSON**:
```json
{
"metadata": {
"fps": 29.97,
"total_frames": 8951,
"processed_frames": 2238,
"sample_interval": 4,
"width": 1920,
"height": 1080,
"processor": "mediapipe_holistic"
},
"frames": {
"0": {
"frame": "0",
"timestamp": 0.0,
"persons": [{
"person_id": 0,
"bbox": {"x": 400, "y": 100, "width": 300, "height": 600},
"face_mesh": {
"landmarks": [[x,y,z], ...],
"eye_features": {"left_openness": 0.85, "right_openness": 0.82},
"mouth_features": {"openness": 0.3, "width": 45}
},
"pose": {
"landmarks": [[x,y,z,visibility], ...],
"arm_features": {"left_angle": 45, "right_angle": 30},
"leg_features": {"left_angle": 180, "right_angle": 175}
},
"hands": {
"left": {"landmarks": [[x,y,z], ...], "gesture": "point"},
"right": {"landmarks": [[x,y,z], ...], "gesture": "fist"}
}
}]
}
}
}
```
**關鍵點總計**:
| 部位 | 數量 | 說明 |
|------|------|------|
| Face Mesh | 468 | 臉部完整網格 |
| Pose | 33 | 身體關節 |
| Left Hand | 21 | 左手關鍵點 |
| Right Hand | 21 | 右手關鍵點 |
| **總計** | **543** | |
### Pose vs MediaPipe 對比
| | Pose Processor | MediaPipe Holistic |
|--|----------------|--------------------|
| **Landmarks** | 33 pts (pose only) | 543 pts (face + pose + hands) |
| **速度** | 快 (GPU 加速) | 較慢 (CPU) |
| **GPU** | ✅ | ❌ |
| **輸出檔案** | `.pose.json` | `.mediapipe.json` |
| **Appearance 共用** | 身體 ROI (neck, foot) | 臉部 ROI (hat, glasses)、手部 ROI (watch, phone) |
| **用途** | 身體姿勢、bbox 來源 | 完整關鍵點、手勢辨識、唇型分析 |
---
### 3.8 Appearance — 色彩特徵 + 配件偵測
**型別**: Frame-based
**腳本**: `appearance_processor.py`
**依賴**: pose (bbox 來源)
**採樣**: 8Hz
**ROI 共用**: 緊密貼合 face/pose/mediapipe landmarks
```rust
pub struct AppearanceResult {
pub frame_count: u64,
pub fps: f64,
pub frames: Vec<AppearanceFrame>,
}
pub struct AppearanceFrame {
pub frame: u64,
pub timestamp: f64,
pub persons: Vec<AppearancePerson>,
}
pub struct AppearancePerson {
pub person_id: u64,
pub bbox: BBox,
pub hsv_histogram: Vec<Vec<f64>>,
pub dominant_colors: Vec<Vec<f64>>,
pub upper_body: Option<Vec<Vec<f64>>>,
pub lower_body: Option<Vec<Vec<f64>>>,
}
```
**輸出 JSON**:
```json
{
"frame_count": 2238,
"fps": 29.97,
"frames": [
{
"frame": 0,
"timestamp": 0.0,
"persons": [{
"person_id": 0,
"bbox": {"x": 400, "y": 100, "width": 300, "height": 600},
"hsv_histogram": [
[H0, H1, ...H29],
[S0, S1, ...S31],
[V0, V1, ...V31]
],
"dominant_colors": [[H,S,V], ...],
"upper_body": [[H...], [S...], [V...]],
"lower_body": [[H...], [S...], [V...]]
}]
}
]
}
```
#### ROI 定位方式
```python
def get_accessory_rois(frame, face_data, pose_data, hand_data):
rois = {}
# 臉部區域 — 用 face bbox + landmarks
face_bbox = face_data['bbox']
landmarks = face_data['landmarks'] # nose, left_eye, right_eye
# 帽子 ROI: 臉部 bbox 上方延伸
rois['hat'] = expand_region(face_bbox, direction='up', factor=0.5)
# 眼鏡 ROI: 眼部 landmarks 水平帶
rois['glasses'] = bbox_around_points(landmarks['left_eye'], landmarks['right_eye'], padding=10)
# 口罩 ROI: 鼻子下方到下顎
rois['mask'] = region_below_point(landmarks['nose'], face_bbox.bottom)
# 脖子 ROI — 用 pose neck keypoints
rois['neck'] = region_between(pose_data['keypoints']['nose'], pose_data['keypoints']['neck'], width=80)
# 手腕 ROI — 用 MediaPipe hand landmarks
rois['left_wrist'] = circle_around(hand_data['left']['wrist'], radius=30)
# 腳部 ROI — 用 pose ankle/toe keypoints
rois['left_foot'] = bbox_around_points(pose_data['left_ankle'], pose_data['left_toe'], padding=20)
return rois
```
#### 配件偵測方式
| 方式 | 適用配件 | 說明 |
|------|----------|------|
| **HSV 色塊** | tie, phone, watch, ring, bracelet, glasses, mask, hat, shoes, backpack, handbag | 主要方式 — 異色區塊分析 |
| **CLIP** | hairstyle, beard, face_tattoo, earrings, nose_ring, necklace, gloves | 輔助 — 色塊不易區分時 |
| **MediaPipe** | gesture, arm_pose | 21 hand pts + 33 pose pts |
| **HSV** | upper_body_color, lower_body_color, skin_tone | 色彩特徵提取 |
#### 配件完整清單 (49 種)
| 部位 | 配件 | 偵測 |
|------|------|------|
| 頭部 (12) | hat, hairstyle, hair_accessory, earrings, nose_ring, lip_ring, face_tattoo, eyebrow_tattoo, glasses, mask, beard, headscarf | HSV 色塊 + CLIP |
| 脖子 (5) | tie, scarf, shawl, necklace, neck_tattoo | HSV 色塊 + CLIP |
| 手部/手臂 (16) | ring, bracelet, watch, gloves, phone, pen, laptop, book, cup, remote, tool, knife, gun, baseball_bat, gesture, arm_pose | HSV 色塊 + CLIP + MP |
| 足部/載具 (8) | shoes, socks, barefoot, skateboard, scooter, bicycle, motorbike, roller_skates | HSV 色塊 + CLIP |
| 攜帶/環境 (5) | backpack, handbag, luggage, chair, diningtable | HSV 色塊 + CLIP |
| 色彩 (3) | upper_body_hsv, lower_body_hsv, skin_tone | HSV |
---
### 3.9 Scene — 場景分類
**型別**: Time-based
**腳本**: `scene_classifier.py`
**模型**: places365
**依賴**: cut
---
### 3.10 Story — 故事生成
**型別**: Time-based
**腳本**: `story_processor.py`
**模型**: gemma4
**依賴**: asrx + cut + yolo + face
---
### 3.11 5W1H — 故事摘要
**型別**: Time-based
**腳本**: `parent_chunk_5w1h.py`
**模型**: gemma4
**依賴**: story
---
## 4. PythonExecutor 統一框架
### 4.1 RetryConfig
```rust
pub struct RetryConfig {
pub max_attempts: u32, // 預設 3
pub initial_delay_ms: u64, // 預設 1000 (1s)
pub max_delay_ms: u64, // 預設 30000 (30s)
pub backoff_multiplier: f64, // 預設 2.0
}
```
**退避策略**: 1s → 2s → 4s → 8s → ... → max 30s
### 4.2 SHA256 Checksum 驗證
```
scripts/
├── checksums.sha256 # SHA256 manifest
├── face_processor.py
├── yolo_processor.py
└── ...
```
`checksums.sha256` 內容:
```
a1b2c3d4... face_processor.py
e5f6g7h8... yolo_processor.py
...
```
Executor 啟動前驗證腳本完整性,防止腳本被篡改。
### 4.3 Timeout 管理
| Processor | Timeout |
|-----------|---------|
| cut | 3600s (1h) |
| asrx, yolo, ocr, face, pose, mediapipe, appearance, scene, story, 5w1h | 7200s (2h) |
---
## 5. 8Hz 採樣框架
### 5.1 基本原理
```
影片 FPS: ~30
Sample Interval: round(fps / 8) = 4
Sample Frames: 0, 4, 8, 12, 16, ...
```
| 影片長度 | 總幀數 | 8Hz 樣本數 |
|----------|--------|------------|
| 5 分鐘 | 9,000 | ~2,250 |
| 10 分鐘 | 18,000 | ~4,500 |
| 30 分鐘 | 54,000 | ~13,500 |
### 5.2 按需細化機制
```
Layer 1: 8Hz 基底 (所有 processor)
Layer 2: 細化 (特定特徵觸發)
細化場景:
- Blink 確認: 8Hz 發現 eye openness 突降 → 回頭抓前後 ±4 幀 (30Hz)
- Lip-sync: sentence chunk 覆蓋的時間段 → 16Hz
- Mutual Gaze: 兩人 gaze 方向接近 → 前後 ±2 幀 (30Hz) 確認
```
### 5.3 樣本幀計算
```rust
fn compute_sample_frames(total_frames: i64, fps: f64) -> Vec<i64> {
let interval = (fps / 8.0).round() as i64;
(0..total_frames).step_by(interval.max(1) as usize).collect()
}
```
---
## 6. DAG 依賴圖
```
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ cut │───►│asrx │───►│story│───►│5w1h │
└──┬──┘ └──┬──┘ └──┬──┘ └─────┘
│ │ │
│ ┌─────┘ │
▼ ▼ │
┌─────┐ ┌─────┐ ┌─────┐ │
│yolo │ │face │ │pose │ │
└──┬──┘ └──┬──┘ └──┬──┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌────────┐ │
│ └─►│appear │ │
│ └────────┘ │
▼ ▼ ▼
┌─────────────────────────┐
│ TKG (build_tkg) │
└─────────────────────────┘
獨立處理器 (無依賴):
┌─────┐ ┌─────┐ ┌───────────┐
│ ocr │ │mediap│ │ scene │
└─────┘ └─────┘ └─────┬─────┘
│ (依賴 cut)
```
---
## 7. Worker 整合
### 7.1 JobWorker 調度
```
Video Registration
Create Job (processor_list: [cut, asrx, yolo, ocr, face, pose, mediapipe, appearance, scene, story])
Poll Available Processors (dependency check + concurrency limit)
Execute Processor → Store JSON → Update Progress
All Processors Done → Rule 1 (chunk) → Vectorize → Complete
```
### 7.2 並發控制
- **Dynamic concurrency**: 根據 CPU/Memory/GPU 動態調整 (預設 2)
- **Processor pool**: 同時執行最多 N 個 processor
### 7.3 進度回報 (Redis)
```
Redis Key: momentry_dev:progress:{file_uuid}
Value: {
"phase": "PROCESSING",
"progress": {
"FACE": {"current": 150, "total": 2238, "status": "running"},
"YOLO": {"current": 2238, "total": 2238, "status": "completed"},
...
},
"active_processors": ["FACE", "POSE"]
}
```
---
## Version History
| Version | Date | Author | Description |
|---------|------|--------|-------------|
| 1.0.0 | 2026-06-19 | OpenCode | Initial design document |

View File

@@ -0,0 +1,187 @@
---
title: Rule 1 Chunk Ingestion V1.0
version: 1.0
date: 2026-06-20
author: OpenCode
status: approved
---
# Rule 1 Chunk Ingestion V1.0
| Scope | Status | Applicable to | Binary |
|-------|--------|---------------|--------|
| Sentence chunk creation from ASR + OCR | Approved | `momentry_playground`, `momentry` | Both |
## Overview
Rule 1 is the first chunking rule in Momentry's pipeline. It creates **sentence-level chunks** (`ChunkType::Sentence`, `ChunkRule::Rule1`) by taking ASR transcription segments and enriching them with OCR on-screen text from the same time range. Each chunk represents a spoken segment annotated with the visible text in the video frames.
These chunks are vectorized by the downstream `vectorize_chunks` step and become searchable through semantic search (Qdrant), keyword search (BM25 ILIKE), and identity-based search.
## Data Flow
```
┌─────────────────────────────────────────────────────────┐
│ UPSTREAM: pre_chunks table │
│ │
│ Processor outputs stored by store_raw_pre_chunks_batch: │
│ processor_type='asr' → ASR segments (text, timestamps) │
│ processor_type='ocr' → OCR texts per frame │
└─────────────────────────────────────────────────────────┘
▼ wait for ASRX completion
┌─────────────────────────────────────────────────────────┐
│ RULE 1 PROCESSING │
│ │
│ Triggered by: │
│ 1. Worker auto: job_worker.rs after ASRX completes │
│ 2. HTTP API: POST /api/v1/file/:file_uuid/rule1 │
│ 3. Pipeline: pipeline_core::execute_rule1 │
│ │
│ execute_rule1(file_uuid, fps): │
│ ├─ fetch_asr_segments() → Vec<AsrSegment> │
│ ├─ fetch_ocr_texts() → BTreeMap<frame, [texts]> │
│ │ │
│ └─ for each ASR segment: │
│ ├─ collect_ocr_text(frame_range, ocr_map) │
│ │ → deduplicated OCR texts within range │
│ ├─ build combined_text = "<ASR> <OCR>" │
│ ├─ build content = {text, ocr_text} │
│ ├─ build metadata = {language} │
│ └─ store_chunk_in_tx() → chunk table │
│ │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ DOWNSTREAM: vectorize_chunks() │
│ │
│ SELECT ... WHERE chunk_type='sentence' AND embedding │
│ IS NULL │
│ │
│ 1. embedder.embed_document(combined_text) → vector │
│ 2. db.store_vector() → PG chunk.embedding │
│ 3. qdrant.upsert_vector() → momentry_rule1 collection │
│ │
└─────────────────────────────────────────────────────────┘
```
## Chunk Data Structure
### Content JSON (`content` column)
```json
{
"text": "今天的會議我們要討論 ...",
"ocr_text": "Q3 Revenue Slides Agenda"
}
```
| Field | Source | Purpose |
|-------|--------|---------|
| `text` | ASR transcription | Original spoken text, used by UI/reference |
| `ocr_text` | OCR detections in frame range | On-screen text (titles, labels, signs) |
### Text Content (`text_content` column)
```
"今天的會議我們要討論 Q3 Revenue Slides Agenda"
```
Combined ASR + OCR text used for:
- **Embedding generation**: The combined text is embedded to Qdrant, enabling semantic search to find segments based on both spoken and on-screen content
- **Keyword search (BM25 ILIKE)**: Queries match against this field, so searching for "Q3 Revenue" finds the segment even if not spoken aloud
### Metadata JSON (`metadata` column)
```json
{
"language": "zh"
}
```
Only the ASR-detected language is stored. See Design Decisions below.
## Search Contribution Analysis
| Search Path | Mechanism | Rule 1 Contribution |
|-------------|-----------|-------------------|
| **Semantic search** (Qdrant) | `chunk_type='sentence'` → embedding query | ASR + OCR text in embedding captures both spoken and visual content |
| **Keyword search** (BM25 ILIKE) | `text_content ILIKE '%query%'` | Both ASR and OCR text are searchable |
| **Title match** (smart_search) | `chunk_type='sentence' AND embedding IS NOT NULL` | Rule 1 chunks are the primary sentence chunks |
| **Identity search** | `face_detections` time overlap join | Rule 1 chunks match via frame ranges |
### What Was Excluded and Why
| Data Source | Considered For | Decision | Reason |
|-------------|---------------|----------|--------|
| **YOLO detections** | Adding class names to text_content | ❌ **Excluded** | 80 COCO classes are too generic ("person", "chair" appear in almost every segment). High error rate adds noise, dilutes embedding semantic density. Cross-segment distinctiveness is near zero. |
| **ASRX speaker** | Adding speaker_id to metadata | ❌ **Excluded** | At Rule 1 time, identity has not been paired yet. Speaker IDs are temporary labels without identity binding, providing no search value. |
| **Face detections** | Adding face_ids to metadata | ❌ **Excluded** | Same as speaker — identity not yet available. Face detection IDs alone have no search meaning. |
| **OCR text** | Adding to text_content + embedding | ✅ **Included** | OCR provides specific on-screen text (titles, labels, signs) that directly matches user search queries. Highly complementary to ASR. |
## Implementation Details
### `fetch_ocr_texts()`
Reads OCR per-frame data from `pre_chunks`:
```sql
SELECT coordinate_index as frame, data
FROM pre_chunks
WHERE file_uuid = $1 AND processor_type = 'ocr'
ORDER BY coordinate_index
```
Parses the `data.texts` JSON array, extracting `text` fields where `confidence > 0.5`. Returns `BTreeMap<i64, Vec<String>>` mapping frame number to list of recognized text strings.
### `collect_ocr_text()`
For a given frame range `[start_frame, end_frame]`:
1. Iterates frames using `BTreeMap::range(start_frame..=end_frame)`
2. Collects all OCR texts from those frames
3. Deduplicates using a `HashSet` (case-sensitive)
4. Joins with spaces: `"text1 text2 text3"`
Returns empty string if no OCR data exists in the range.
### `text_content` Composition Rules
```
if OCR text exists:
combined = "{asr_text} {ocr_text}"
else:
combined = "{asr_text}"
```
The combined string is used for both embedding and keyword search. The original ASR text is preserved separately in `content.text`.
## Trigger Points
| Trigger | Location | Condition |
|---------|----------|-----------|
| Worker auto | `job_worker.rs:1135` | After ASRX processor completes and no sentence chunks exist yet |
| HTTP API | `POST /api/v1/file/:file_uuid/rule1` | Manual trigger via `pipeline_core::execute_rule1` |
| Programmatic | `pipeline_core::execute_rule1` | Called by other modules needing sentence chunks |
The worker guard checks idempotency:
```sql
SELECT 1 FROM chunk WHERE file_uuid = $1 AND chunk_type = 'sentence' LIMIT 1
```
## Edge Cases
| Scenario | Behavior |
|----------|----------|
| No ASR segments | Returns 0 immediately with info log |
| No OCR data in pre_chunks | `ocr_text` is empty string; `text_content` = ASR only |
| OCR frame with no valid text | Skipped (confidence < 0.5 or empty string) |
| ASR segment end_time = 0.0 | Logs warning; overlap-based matching degrades gracefully |
| Large number of segments | Batches in single transaction; progress logged every 100 segments |
## Version History
| Version | Date | Author | Change |
|---------|------|--------|--------|
| 1.0 | 2026-06-20 | OpenCode | Initial design: ASR + OCR → sentence chunks |

View File

@@ -0,0 +1,816 @@
# TKG Multi-Trace Design V1.0
**Date**: 2026-06-19
**Version**: 1.0.0
**Status**: Draft
---
## Overview
統一 8Hz 採樣框架,整合 face、appearance、gaze、lip 四條 trace並接入 sentence/speaker/accessory 節點,構建完整的 Temporal Knowledge Graph (TKG)。
### 設計目標
1. **時間對齊**: 所有 trace 在同一 8Hz 網格上edge 計算無需插值
2. **按需細化**: 特定特徵 (blink, lip-sync, mutual gaze) 可局部提高採樣率
3. **配件偵測**: 49 種配件分類 (頭部 12 + 脖子 5 + 手部 16 + 足部 8 + 攜帶 5 + 色彩 3)
4. **膚色 + 光源**: Fitzpatrick 分類 + 光照參數,支援可信度評估
5. **社交互動**: Mutual gaze (互相看), lip-sync (唇語同步), speaker-face 綁定
---
## 1. 8Hz 採樣框架
### 1.1 基本原理
```
影片 FPS: ~30
Sample Interval: round(fps / 8) = 4
Sample Frames: 0, 4, 8, 12, 16, ...
```
| 影片長度 | 總幀數 | 8Hz 樣本數 |
|----------|--------|------------|
| 5 分鐘 | 9,000 | ~2,250 |
| 10 分鐘 | 18,000 | ~4,500 |
| 30 分鐘 | 54,000 | ~13,500 |
### 1.2 按需細化機制
```
Layer 1: 8Hz 基底 (所有 processor)
Layer 2: 細化 (特定特徵觸發)
細化場景:
- Blink 確認: 8Hz 發現 eye openness 突降 → 回頭抓前後 ±4 幀 (30Hz)
- Lip-sync: sentence chunk 覆蓋的時間段 → 16Hz
- Mutual Gaze: 兩人 gaze 方向接近 → 前後 ±2 幀 (30Hz) 確認
```
### 1.3 樣本幀計算
```rust
// worker/processor.rs
fn compute_sample_frames(total_frames: i64, fps: f64) -> Vec<i64> {
let interval = (fps / 8.0).round() as i64;
(0..total_frames).step_by(interval.max(1) as usize).collect()
}
fn merge_refine_frames(base: &[i64], refine: &HashSet<i64>) -> Vec<i64> {
let mut combined: HashSet<i64> = base.iter().cloned().collect();
combined.extend(refine.iter().cloned());
let mut sorted: Vec<i64> = combined.into_iter().collect();
sorted.sort();
sorted
}
```
---
## 2. Trace 類型
### 重要 Trace 總覽
| # | Trace 類型 | 來源 | 用途 |
|---|-----------|------|------|
| 1 | **face_trace** | face_detections + face.json | 人臉追蹤、身份識別 |
| 2 | **appearance_trace** | appearance.json | 服裝色彩、配件、膚色 |
| 3 | **gaze_trace** | face.json (pose_angle + landmarks) | 視線方向、互相看 |
| 4 | **lip_trace** | face.json (landmarks) | 唇型、說話同步 |
| 5 | **speaker_trace** | asrx.json (speaker diarization) | 說話者識別 |
| 6 | **text_trace** | dev.chunk (sentence chunks) | 文字內容、語意 |
| 7 | **skin_tone_trace** | face.json (ROI HSV) | 膚色分類、光源記錄 |
---
### 2.1 Face Trace (已有)
```json
{
"node_type": "face_trace",
"external_id": "trace_5",
"properties": {
"frame_count": 200,
"start_frame": 150,
"end_frame": 350,
"avg_bbox": { "x": 500, "y": 300, "width": 200, "height": 250 },
"avg_yaw": -0.15,
"avg_pitch": -0.08,
"avg_roll": -0.20,
"pose_count": 180,
"embedding": [...],
"skin_tone": {
"face_h_mean": 18.5,
"fitzpatrick": "Type IV - Medium",
"confidence": 0.82,
"lighting": {
"brightness": 0.65,
"color_temp": "warm",
"direction": "front",
"uniformity": 0.92,
"source": "indoor",
"quality": "good"
},
"sample_frames": 156
}
}
}
```
### 2.2 Appearance Trace (新增)
**綁定策略**: IoU 匹配 appearance person ↔ face detection繼承 trace_id
```json
{
"node_type": "appearance_trace",
"external_id": "trace_5",
"properties": {
"trace_id": 5,
"frame_count": 400,
"start_frame": 100,
"end_frame": 500,
"face_overlap_frames": 200,
"confidence": 0.50,
"color_features": {
"dominant_colors": [[0.1, 0.6, 0.8], ...],
"upper_body_hsv": [[...], [...], [...]],
"lower_body_hsv": [[...], [...], [...]]
},
"accessories": {
"head": {
"hat": {"detected": true, "confidence": 0.82, "first_frame": 0},
"glasses": {"detected": true, "confidence": 0.67, "first_frame": 0},
"earrings": {"detected": false},
"mask": {"detected": false},
"hairstyle": {"type": "long", "confidence": 0.75},
"hair_accessory": {"detected": false},
"nose_ring": {"detected": false},
"lip_ring": {"detected": false},
"face_tattoo": {"detected": false},
"eyebrow_tattoo": {"detected": false},
"beard": {"detected": true, "confidence": 0.88},
"headscarf": {"detected": false}
},
"neck": {
"tie": {"detected": true, "confidence": 0.92, "first_frame": 0, "source": "hsv_color_block"},
"scarf": {"detected": false},
"shawl": {"detected": false},
"necklace": {"detected": true, "confidence": 0.71, "first_frame": 12, "source": "clip"},
"neck_tattoo": {"detected": false}
},
"hand": {
"ring": {"detected": false},
"bracelet": {"detected": false},
"watch": {"detected": true, "confidence": 0.63, "first_frame": 24},
"gloves": {"detected": false}
},
"hand_held": {
"phone": {"detected": true, "confidence": 0.88, "source": "hsv_color_block"},
"pen": {"detected": false},
"cup": {"detected": false},
"knife": {"detected": false},
"gun": {"detected": false}
},
"foot": {
"shoes": {"type": "sneaker", "confidence": 0.78, "source": "hsv_color_block"},
"socks": {"detected": false},
"barefoot": {"detected": false}
},
"vehicle": {
"bicycle": {"detected": false, "source": "hsv_color_block"},
"skateboard": {"detected": false},
"scooter": {"detected": false}
},
"carried": {
"backpack": {"detected": false},
"handbag": {"detected": true, "confidence": 0.85, "source": "hsv_color_block"},
"luggage": {"detected": false}
}
}
}
}
```
### 2.3 Speaker Trace (重要)
**來源**: ASRX speaker diarization + face trace 綁定
```json
{
"node_type": "speaker_trace",
"external_id": "SPEAKER_0",
"properties": {
"speaker_id": "SPEAKER_0",
"segment_count": 45,
"total_duration": 120.5,
"first_appearance": {"frame": 100, "time": 3.3},
"last_appearance": {"frame": 3600, "time": 120.0},
"full_text": "大家好 今天我們來討論... (完整語音轉文字)",
"segments": [
{"start_time": 0.1, "end_time": 2.0, "text": "大家好", "start_frame": 3, "end_frame": 60},
{"start_time": 5.2, "end_time": 8.5, "text": "今天我們來討論", "start_frame": 156, "end_frame": 255},
...
],
"face_trace_ids": [5, 12, 23],
"appearance_trace_ids": [5, 12],
"gaze_context": {
"looking_at_person": true,
"mutual_gaze_with": [12]
},
"lip_sync_quality": 0.85
}
}
```
**來源資料**:
```
ASRX → asrx.json (segments with speaker_id)
Face → face_detections (trace_id)
綁定 → SPEAKS_AS edge (speaker ↔ face_trace)
```
### 2.4 Text Trace (重要)
**來源**: dev.chunk (chunk_type='sentence') + ASRX text
```json
{
"node_type": "text_trace",
"external_id": "chunk_1",
"properties": {
"chunk_id": "chunk_1",
"text": "大家好,今天我們來討論這個話題",
"text_normalized": "大家好,今天我們來討論這個話題",
"start_time": 0.1,
"end_time": 5.2,
"start_frame": 3,
"end_frame": 156,
"speaker_id": "SPEAKER_0",
"language": "zh",
"confidence": 0.95,
"yolo_objects": ["person", "chair"],
"face_ids": ["face_100"],
"speaker_trace_id": "SPEAKER_0",
"face_trace_id": 5,
"lip_sync": {
"matched_frames": 120,
"total_frames": 153,
"quality": 0.85
},
"semantic_embedding": [0.12, -0.34, ...],
"sentiment": "neutral"
}
}
```
**來源資料**:
```
Rule 1 → dev.chunk (sentence chunks)
ASRX → asrx.json (speaker_id binding)
Face → face_detections (face_ids in chunk metadata)
YOLO → yolo.json (co-occurring objects)
```
**Edge 連接**:
- `SPEAKS_BY`: text_trace → speaker_trace
- `SPOKEN_WHILE`: text_trace → face_trace
- `LIP_SYNC`: text_trace → lip_trace
- `CONTAINS_OBJECT`: text_trace → object
### 2.5 Skin Tone Trace (重要)
**來源**: face.json ROI HSV + 光源分析
```json
{
"node_type": "skin_tone_trace",
"external_id": "trace_5",
"properties": {
"trace_id": 5,
"frame_count": 200,
"start_frame": 150,
"end_frame": 350,
"face_h_mean": 18.5,
"fitzpatrick": "Type IV - Medium",
"confidence": 0.82,
"lighting": {
"brightness": 0.65,
"color_temp": "warm",
"direction": "front",
"uniformity": 0.92,
"source": "indoor",
"quality": "good"
},
"sample_frames": 156,
"hand_h_mean": 17.8,
"arm_h_mean": 18.2
}
}
```
**Fitzpatrick 分類**:
| Type | 描述 | H 值 (HSV) |
|------|------|------------|
| I | 非常淺 | 05 |
| II | 淺 | 512 |
| III | 中等偏淺 | 1218 |
| IV | 中等 | 1825 |
| V | 深 | 2535 |
| VI | 很深 | 35+ |
**光源品質**:
| Quality | 條件 | 膚色可信度 |
|---------|------|------------|
| good | brightness > 0.4, uniformity > 0.8, front light | 高 (×1.0) |
| fair | brightness > 0.3, uniformity > 0.6 | 中 (×0.7) |
| poor | brightness < 0.3 或 backlight | 低 (×0.5) |
### 2.6 Gaze Trace (新增)
```json
{
"node_type": "gaze_trace",
"external_id": "trace_5",
"properties": {
"trace_id": 5,
"frame_count": 200,
"start_frame": 150,
"end_frame": 350,
"avg_yaw": -0.15,
"avg_pitch": -0.08,
"avg_roll": -0.20,
"head_direction": "frontal",
"gaze_direction": "center-left",
"eye_openness": 0.85,
"blink_count": 12,
"blink_rate": 0.06,
"looking_at_person": true,
"looking_at_object": ["chair"],
"refined_ranges": [
{"start_frame": 200, "end_frame": 220, "hz": 30, "reason": "mutual_gaze"}
]
}
}
```
### 2.7 Lip Trace (重要)
**來源**: face.json → faces[].lips (inner_lips 6pts + outer_lips 14pts)
```json
{
"node_type": "lip_trace",
"external_id": "trace_5",
"properties": {
"trace_id": 5,
"frame_count": 180,
"start_frame": 160,
"end_frame": 340,
"avg_openness": 0.3,
"avg_width": 45.2,
"avg_height": 12.8,
"movement_variance": 0.15,
"speaking_frames": 95,
"silent_frames": 85,
"lip_landmark_samples": {
"inner_lips": [[x,y,z], ...],
"outer_lips": [[x,y,z], ...]
},
"speech_correlation": {
"text_trace_ids": ["chunk_1", "chunk_2", "chunk_3"],
"sync_quality": 0.85,
"matched_segments": [
{"start_frame": 160, "end_frame": 200, "text": "大家好"},
{"start_frame": 210, "end_frame": 250, "text": "今天我們來討論"}
]
},
"refined_ranges": [
{"start_frame": 160, "end_frame": 340, "hz": 30, "reason": "lip_sync"}
]
}
}
```
**Lip-sync 計算**:
```
Lip openness = inner_lips_area / outer_lips_area
Speaking detection:
- openness > threshold (動態調整)
- movement_variance > threshold (唇型變化)
- 持續 N 幀以上 (避免雜訊)
Sync with text:
- 比對 text_trace 的 start/end_time
- 計算 lip movement 與文字時間段的重疊率
- quality = matched_frames / total_text_frames
```
**Edge 連接**:
- `HAS_LIP`: face_trace → lip_trace
- `LIP_SYNC`: lip_trace → text_trace
- `GAZE_SYNC_SPEECH`: gaze_trace + lip_trace (說話時注視方向)
---
## 3. 配件偵測
### 3.1 偵測方式分工
| 方式 | 適用配件 | 速度 | 說明 |
|------|----------|------|------|
| **HSV 色塊** | tie, phone, watch, ring, bracelet, glasses, mask, hat, shoes, backpack, handbag, umbrella, pen, knife, cup, book, laptop, remote, baseball_bat | 快 | **主要方式** — 從 person crop 分析異色區塊 |
| **CLIP** | hairstyle, beard, face_tattoo, eyebrow_tattoo, earrings, nose_ring, lip_ring, neck_tattoo, headscarf, scarf, shawl, necklace, gloves, tool, gun, skateboard, scooter, roller_skates, socks, barefoot | 中 | zero-shot (YOLO 不可靠,色塊也不易區分時) |
| **MediaPipe** | gesture, arm_pose | 快 | 21 hand pts + 33 pose pts |
| **HSV** | upper_body_color, lower_body_color, skin_tone | 快 | 色彩特徵提取 |
### 3.2 Appearance 與 Landmark/Pose 緊密貼合
**核心原則**: Appearance 不獨立偵測 bbox而是直接用 face/pose/mediapipe 的幾何結果裁切 ROI。
```
Face Landmarks (20pts) ──► 臉部 ROI ──► hat, glasses, mask, beard, earrings
Pose 33 Keypoints ───────► 身體 ROI ──► tie, necklace, upper/lower body HSV
MediaPipe Hands (21×2) ──► 手腕 ROI ──► watch, bracelet, ring, phone, glove
MediaPipe Pose Feet ─────► 腳部 ROI ──► shoes, socks, barefoot
```
**ROI 定位方式**:
```python
def get_accessory_rois(frame, face_data, pose_data, hand_data):
rois = {}
# 臉部區域 — 用 face bbox + landmarks
face_bbox = face_data['bbox']
landmarks = face_data['landmarks'] # nose, left_eye, right_eye
# 帽子 ROI: 臉部 bbox 上方延伸
rois['hat'] = expand_region(face_bbox, direction='up', factor=0.5)
# 眼鏡 ROI: 眼部 landmarks 水平帶
left_eye = landmarks['left_eye']
right_eye = landmarks['right_eye']
rois['glasses'] = bbox_around_points(left_eye, right_eye, padding=10)
# 口罩 ROI: 鼻子下方到下顎
nose = landmarks['nose']
rois['mask'] = region_below_point(nose, face_bbox.bottom)
# 脖子 ROI — 用 pose neck keypoints
if pose_data:
neck = pose_data['keypoints']['neck']
nose = pose_data['keypoints']['nose']
rois['neck'] = region_between(nose, neck, width=80)
# 手腕 ROI — 用 MediaPipe hand landmarks
if hand_data:
for side in ['left', 'right']:
wrist = hand_data[side]['wrist']
rois[f'{side}_wrist'] = circle_around(wrist, radius=30)
# 腳部 ROI — 用 pose ankle/toe keypoints
if pose_data:
for side in ['left', 'right']:
ankle = pose_data['keypoints'][f'{side}_ankle']
toe = pose_data['keypoints'][f'{side}_toe']
rois[f'{side}_foot'] = bbox_around_points(ankle, toe, padding=20)
return rois
```
### 3.3 HSV 色塊偵測流程
```python
def detect_accessories_tightly_coupled(frame, face_data, pose_data, hand_data):
# 1. 用 landmark/pose 精準定位各 ROI
rois = get_accessory_rois(frame, face_data, pose_data, hand_data)
results = {}
for roi_name, roi_bbox in rois.items():
roi_hsv = crop_and_convert(frame, roi_bbox, 'HSV')
# 2. 在精準 ROI 內找異色區塊
diff_mask = compute_color_diff(roi_hsv, main_colors, threshold=30)
blobs = find_connected_components(diff_mask)
for blob in blobs:
accessory = classify_accessory_by_position(blob, roi_name)
if accessory:
results[accessory] = {
"detected": True,
"confidence": blob.confidence,
"source": "hsv_color_block",
"roi": roi_name,
"first_frame": current_frame
}
# 3. 色塊不易判斷的項目 → CLIP
clip_only_items = ['hairstyle', 'beard', 'earrings', 'nose_ring', ...]
for item in clip_only_items:
confidence = clip_score(crop_person(frame, face_data['bbox']), CLIP_PROMPTS[item])
if confidence > 0.5:
results[item] = {"detected": True, "confidence": confidence, "source": "clip"}
return results
```
### 3.4 依賴關係
```
Face Detection ──► face_detections (trace_id, bbox, embedding)
Face Landmarks ────► 臉部 ROI (hat, glasses, mask, beard)
Pose 33pts ────────► 身體 ROI (neck, wrist, foot) ──► Appearance HSV
MediaPipe Hands ───► 手腕 ROI (watch, bracelet, ring, phone)
TKG appearance_trace
```
### 3.5 CLIP 提示詞 (僅用於色塊不易區分的配件)
```python
CLIP_PROMPTS = {
# 頭部 — 色塊不易判斷的項目
"hairstyle_short": "a person with short hair",
"hairstyle_long": "a person with long hair",
"hairstyle_braid": "a person with braided hair",
"hairstyle_bun": "a person with hair in a bun",
"face_tattoo": "a person with a visible face tattoo or face paint",
"eyebrow_tattoo": "a person with tattooed or styled eyebrows",
"beard": "a person with a beard or mustache",
# 耳朵/鼻子/嘴唇穿刺
"earrings": "a person wearing earrings",
"nose_ring": "a person wearing a nose ring or nose piercing",
"lip_ring": "a person wearing a lip ring or lip piercing",
# 脖子 — 項鍊等細小物件
"necklace": "a person wearing a necklace",
"neck_tattoo": "a person with a visible neck tattoo",
# 手部細小物件
"gloves": "a person wearing gloves",
"tool": "a person holding a tool like a wrench or screwdriver",
"gun": "a person holding a gun",
# 足部
"socks": "a person wearing visible socks",
"barefoot": "a barefoot person",
"roller_skates": "a person wearing roller skates",
}
```
---
## 4. 膚色 + 光源
### 4.1 Fitzpatrick 分類
| Type | 描述 | H 值 (HSV) |
|------|------|------------|
| I | 非常淺 | 05 |
| II | 淺 | 512 |
| III | 中等偏淺 | 1218 |
| IV | 中等 | 1825 |
| V | 深 | 2535 |
| VI | 很深 | 35+ |
### 4.2 光源參數
| 參數 | 計算方式 | 範圍 |
|------|----------|------|
| brightness | V channel 平均 | 0.01.0 |
| color_temp | 白平衡估算 | warm/neutral/cool |
| direction | 陰影梯度 + yaw/pitch | front/side/back/top |
| uniformity | 臉部各區域 V 值標準差 | 0.01.0 |
| source | 亮度 + 色溫綜合判斷 | indoor/outdoor/flash |
### 4.3 光源品質
| Quality | 條件 | 膚色可信度 |
|---------|------|------------|
| good | brightness > 0.4, uniformity > 0.8, front light | 高 (×1.0) |
| fair | brightness > 0.3, uniformity > 0.6 | 中 (×0.7) |
| poor | brightness < 0.3 或 backlight | 低 (×0.5) |
---
## 5. TKG Node 類型
| node_type | external_id | 來源 | 重要性 | 屬性 |
|-----------|-------------|------|--------|------|
| `face_trace` | `trace_N` | face_detections | ★★★★ | frame_count, bbox, pose, embedding, skin_tone |
| `appearance_trace` | `trace_N` | appearance.json | ★★★★ | trace_id, color_features, accessories, confidence |
| `gaze_trace` | `trace_N` | face.json (pose_angle) | ★★★ | trace_id, gaze_direction, blink_count, looking_at |
| `lip_trace` | `trace_N` | face.json (lips) | ★★★★ | trace_id, avg_openness, speaking_frames, speech_correlation |
| `speaker_trace` | `SPEAKER_N` | asrx.json | ★★★★ | speaker_id, segments, face_trace_ids, full_text |
| `text_trace` | `chunk_N` | dev.chunk | ★★★★ | text, speaker_id, time_range, yolo_objects, lip_sync |
| `skin_tone_trace` | `trace_N` | face.json (ROI HSV) | ★★★ | trace_id, fitzpatrick, lighting, confidence |
| `object` | `class_name` | yolo.json | ★★ | total_detections, frames |
| `accessory` | `hat`, `glasses`, ... | appearance.json | ★★ | category, trace_ids, first/last_seen |
---
## 6. TKG Edge 類型
| Edge Type | Source → Target | 屬性 | 說明 |
|-----------|----------------|------|------|
| `SPEAKS_AS` | speaker_trace → face_trace | confidence, overlap_frames | 說話者綁定人臉 |
| `SPEAKS_BY` | text_trace → speaker_trace | — | 文字由誰說的 |
| `SPOKEN_WHILE` | text_trace → face_trace | frame_overlap | 說話時的人臉 |
| `HAS_APPEARANCE` | face_trace → appearance_trace | confidence, overlap_frames | 外觀特徵 |
| `HAS_GAZE` | face_trace → gaze_trace | overlap_frames | 視線方向 |
| `HAS_LIP` | face_trace → lip_trace | overlap_frames | 唇型資料 |
| `HAS_SKIN_TONE` | face_trace → skin_tone_trace | confidence, lighting_match | 膚色記錄 |
| `LIP_SYNC` | lip_trace → text_trace | time_alignment, openness_match | 唇語同步 |
| `WEARS` | appearance_trace → accessory | confidence, first_frame | 配件 |
| `LOOKING_AT` | gaze_trace → object | direction_match, distance | 注視物件 |
| `LOOKING_AT_PERSON` | gaze_trace → face_trace | direction_match | 注視他人 |
| `MUTUAL_GAZE` | face_trace ↔ face_trace | first_frame, last_frame, duration_frames, confidence | 互相看 |
| `CO_OCCURS_WITH` | object ↔ object | frame_count | 物件共現 |
| `SAME_SKIN_TONE` | face_trace ↔ face_trace | h_diff, lighting_match, confidence | 膚色相近 |
| `HOLDS` | appearance_trace → object | 手機等手持物品 |
---
## 7. Mutual Gaze 分析
### 7.1 計算邏輯
```
對每幀:
對每對 (person_A, person_B):
1. 計算 A 的 gaze vector (從 yaw/pitch/roll)
2. 計算 B 的 bbox center 在 A 座標系中的位置
3. 判斷 B 是否在 A 的 gaze cone 內 (threshold: ~15°)
4. 反向檢查 B → A
5. 雙向命中 → mutual_gaze
```
### 7.2 持續性確認
```
mutual_gaze 需要持續 N 幀以上才算有意義:
- 基底: 8Hz, 持續 ≥ 3 幀 (~0.375s) → 建立 edge
- 細化: 發現 candidate 後,回頭用 30Hz 確認
- confidence = 連續幀數 / 總可能幀數
```
### 7.3 Edge 屬性
```json
{
"edge_type": "MUTUAL_GAZE",
"source": "trace_5",
"target": "trace_12",
"properties": {
"first_frame": 150,
"last_frame": 280,
"duration_frames": 130,
"duration_seconds": 4.3,
"confidence": 0.85,
"context": "during_conversation"
}
}
```
---
## 8. 實作計畫
### Phase 0: 8Hz 採樣框架 (~100 行)
| 檔案 | 修改 |
|------|------|
| `worker/processor.rs` | 計算 8Hz sample frames + refine 框架 |
| `scripts/face_processor.py` | 接受 `--frames` 參數 |
| `scripts/appearance_processor.py` | bbox 來源改 yolo接受 `--frames` |
| `scripts/mediapipe_holistic_processor.py` | 接受 `--frames` |
### Phase 1: Gaze + Mutual Gaze (~250 行)
| 模組 | 行數 |
|------|------|
| Gaze trace nodes | 150 |
| Mutual Gaze edges | 100 |
### Phase 2: Lip + Sentence + Speaker (~260 行)
| 模組 | 行數 |
|------|------|
| Lip trace nodes | 120 |
| Sentence nodes | 80 |
| Speaker 強化 | 60 |
### Phase 3: Appearance + Accessories (~280 行)
| 模組 | 行數 |
|------|------|
| Appearance traces (HSV + trace_id 綁定) | 120 |
| Accessories (CLIP detection) | 80 |
| Skin tone + lighting | 80 |
### Phase 4: TKG 整合 (~110 行)
| 模組 | 行數 |
|------|------|
| `build_tkg()` 統一呼叫 | 40 |
| Edge builders 更新 | 70 |
### 總計: ~1,000 行
---
## 9. 依賴關係圖
```
YOLO (全域) ──────────────────────────────────────────┐
│ │
▼ │
Face (8Hz) ──► trace_id ──┬──► Appearance (IoU 綁定) │
│ │ ├──► HSV 色彩 │
│ │ ├──► Accessories (CLIP) │
│ │ └──► Skin tone + light │
│ │ │
│ ├──► Gaze ──► Mutual Gaze ────┤
│ │ ──► Looking at YOLO │
│ │ │
│ └──► Lip ──► LIP_SYNC ◄──────┤
│ │
ASRX ──► Speaker ──► SPEAKS_AS ──► face_trace │
│ │ │
└──► Text (Rule 1) ────┴──► SPEAKS_BY │
├──► SPOKEN_WHILE │
└──► LIP_SYNC ────────────┘
所有 trace ──────────────────────────► TKG
```
---
## Appendix A: 配件完整清單 (49 種)
| 部位 | 配件 | 偵測方式 |
|------|------|----------|
| 頭部 (12) | hat, hairstyle, hair_accessory, earrings, nose_ring, lip_ring, face_tattoo, eyebrow_tattoo, glasses, mask, beard, headscarf | HSV 色塊 + CLIP |
| 脖子 (5) | tie, scarf, shawl, necklace, neck_tattoo | HSV 色塊 + CLIP |
| 手部/手臂 (16) | ring, bracelet, watch, gloves, phone, pen, laptop, book, cup, remote, tool, knife, gun, baseball_bat, gesture, arm_pose | HSV 色塊 + CLIP + MP |
| 足部/載具 (8) | shoes, socks, barefoot, skateboard, scooter, bicycle, motorbike, roller_skates | HSV 色塊 + CLIP |
| 攜帶/環境 (5) | backpack, handbag, luggage, chair, diningtable | HSV 色塊 + CLIP |
| 色彩 (3) | upper_body_hsv, lower_body_hsv, skin_tone | HSV |
> **註**: YOLO 不可靠,不再作為主要偵測方式。大部分配件改用 HSV 色塊分析CLIP 僅用於色塊不易區分的項目 (如穿刺、紋身、髮型等)。
## Appendix B: DB Schema 變更
```sql
-- appearance_detections (新增)
CREATE TABLE appearance_detections (
id BIGSERIAL PRIMARY KEY,
file_uuid VARCHAR NOT NULL,
frame_number BIGINT NOT NULL,
person_id INTEGER NOT NULL,
x INTEGER, y INTEGER, width INTEGER, height INTEGER,
trace_id INTEGER,
confidence REAL,
hsv_histogram JSONB,
dominant_colors JSONB,
upper_body_hsv JSONB,
lower_body_hsv JSONB,
accessories JSONB,
skin_tone JSONB,
lighting JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- tkg_nodes (擴充 node_type)
-- 新增: appearance_trace, gaze_trace, lip_trace, sentence, accessory
-- tkg_edges (擴充 edge_type)
-- 新增: HAS_APPEARANCE, HAS_GAZE, HAS_LIP, WEARS, LOOKING_AT,
-- LOOKING_AT_PERSON, MUTUAL_GAZE, LIP_SYNC, SPEAKS_BY,
-- SAME_SKIN_TONE, HAS_NECK_ACCESSORY, HAS_HEAD_ACCESSORY, HOLDS
```
---
## Version History
| Version | Date | Author | Description |
|---------|------|--------|-------------|
| 1.0.0 | 2026-06-19 | OpenCode | Initial design: 8Hz sampling, 7 traces (face/appearance/gaze/lip/speaker/text/skin_tone), 49 accessories, skin tone + lighting, mutual gaze, lip-sync |
| 1.1.0 | 2026-06-19 | OpenCode | Added speaker_trace, text_trace, skin_tone_trace as important traces; enhanced lip_trace with speech_correlation; updated node/edge tables |
| **1.2.0** | **2026-06-19** | **OpenCode** | **Implementation complete: build_tkg() integrates all node/edge builders. 9 node types, 14 edge types. ~1500 lines added to tkg.rs** |

View File

@@ -0,0 +1,257 @@
---
title: TKG Phase 2.6 Edges Migration Plan
version: 1.0
date: 2026-06-21
author: OpenCode
status: Draft
---
## Phase 2.6 Overview
迁移 TKG edges 从 PostgreSQL face_detections 到 Qdrant payload。
## Current Implementation Analysis
### 2.6.1: co_occurrence_edges (CO_OCCURS_WITH)
**Current Code** (`tkg.rs:932-1039`):
```rust
let face_rows = sqlx::query_as::<_, FaceDetectionRow>(&format!(
"SELECT trace_id::bigint, frame_number::bigint, x::float8, y::float8, width::float8, height::float8
FROM {} WHERE file_uuid = $1 AND trace_id IS NOT NULL
ORDER BY frame_number",
face_table
))
.bind(file_uuid)
.fetch_all(pool)
.await?;
```
**Dependencies**:
- `face_detections.trace_id`
- `face_detections.frame_number`
- `face_detections.x, y, width, height`
**Migration Strategy**:
```rust
// 从 Qdrant payload 获取
let embeddings = face_db.get_all_embeddings_for_file(file_uuid).await?;
// 按 frame 分组
let mut frame_map: HashMap<i64, Vec<(i64, f64, f64, f64, f64)>> = HashMap::new();
for emb in embeddings {
let frame = emb.payload.frame_number;
let trace_id = emb.payload.trace_id;
frame_map.entry(frame).or_default().push((
trace_id,
emb.payload.bbox_x,
emb.payload.bbox_y,
emb.payload.bbox_width,
emb.payload.bbox_height,
));
}
```
### 2.6.2: face_face_edges (MUTUAL_GAZE)
**Current Code** (`tkg.rs:1171-1320`):
```rust
let rows: Vec<(i64, i64, i64)> = sqlx::query_as(&format!(
"SELECT a.trace_id::bigint AS tid_a, b.trace_id::bigint AS tid_b, a.frame_number::bigint
FROM {} a
JOIN {} b ON a.file_uuid = b.file_uuid AND a.frame_number = b.frame_number AND a.trace_id < b.trace_id
WHERE a.file_uuid = $1 AND a.trace_id IS NOT NULL AND b.trace_id IS NOT NULL",
face_table, face_table
))
.bind(file_uuid)
.fetch_all(pool)
.await?;
```
**Dependencies**:
- `face_detections` self-join for co-occurrence
- `face_detections.trace_id`
- `face_detections.frame_number`
**Migration Strategy**:
```rust
// 从 Qdrant 获取所有 embeddings
let embeddings = face_db.get_all_embeddings_for_file(file_uuid).await?;
// 按 frame 分组
let mut frame_faces: HashMap<i64, Vec<FaceEmbeddingPayload>> = HashMap::new();
for emb in embeddings {
frame_faces.entry(emb.payload.frame_number).or_default().push(emb.payload);
}
// 找同 frame 的 face pairs
let mut pairs: Vec<(i64, i64, i64)> = Vec::new();
for (frame, faces) in frame_faces.iter() {
for i in 0..faces.len() {
for j in (i+1)..faces.len() {
let tid_a = faces[i].trace_id.min(faces[j].trace_id);
let tid_b = faces[i].trace_id.max(faces[j].trace_id);
pairs.push((tid_a, tid_b, *frame));
}
}
}
```
### 2.6.3: speaker_face_edges (SPEAKS_AS)
**Current Code** (`tkg.rs:1045-1169`):
```rust
let traces = sqlx::query_as::<_, (i64, i64, i64)>(&format!(
"SELECT trace_id::bigint, MIN(frame_number)::bigint as start_f, MAX(frame_number)::bigint as end_f
FROM {} WHERE file_uuid = $1 AND trace_id IS NOT NULL
GROUP BY trace_id",
face_table
))
.bind(file_uuid)
.fetch_all(pool)
.await?;
```
**Dependencies**:
- `face_detections.trace_id`
- `face_detections.frame_number` (MIN/MAX)
**Migration Strategy**:
```rust
// 从 Qdrant 获取所有 embeddings
let embeddings = face_db.get_all_embeddings_for_file(file_uuid).await?;
// 计算每个 trace_id 的 frame range
let mut trace_ranges: HashMap<i64, (i64, i64)> = HashMap::new();
for emb in embeddings {
let trace_id = emb.payload.trace_id;
let frame = emb.payload.frame_number;
let entry = trace_ranges.entry(trace_id).or_insert((frame, frame));
entry.0 = entry.0.min(frame);
entry.1 = entry.1.max(frame);
}
```
### 2.6.4: mutual_gaze_edges (MUTUAL_GAZE)
**Already in face_face_edges**:
- face_face_edges 包含 mutual_gaze 检测逻辑
- 不需要单独迁移
### 2.6.5: lip_sync_edges (LIP_SYNC)
**Already migrated in Phase 2.5.2**:
- `build_lip_trace_nodes_from_qdrant()` 已完成
- lip_sync_edges 已使用 Qdrant payload
## Migration Priority
| Priority | Edge Type | Complexity | Impact |
|----------|-----------|-------------|--------|
| P1 | co_occurrence_edges | Low | High (关系图) |
| P1 | face_face_edges | Medium | High (face 关系) |
| P2 | speaker_face_edges | Low | Medium (speaker 关系) |
| N/A | mutual_gaze_edges | - | 已包含在 face_face_edges |
| N/A | lip_sync_edges | - | 已迁移 Phase 2.5.2 |
## Performance Estimate
| Edge Type | Current (PG) | After Migration | Speedup |
|-----------|--------------|-----------------|---------|
| co_occurrence_edges | ~120ms | ~30ms | 4x |
| face_face_edges | ~90ms | ~25ms | 3.6x |
| speaker_face_edges | ~60ms | ~20ms | 3x |
| **Total** | **~270ms** | **~75ms** | **3.6x** |
## Implementation Steps
### Step 1: Add helper functions in `face_embedding_db.rs`
```rust
// Get all embeddings grouped by frame
pub async fn get_embeddings_by_frame(&self, file_uuid: &str) -> Result<HashMap<i64, Vec<FaceEmbeddingPayload>>>;
// Get trace_id frame ranges
pub async fn get_trace_frame_ranges(&self, file_uuid: &str) -> Result<HashMap<i64, (i64, i64)>>;
```
### Step 2: Create migration functions in `tkg.rs`
```rust
// Phase 2.6.1
async fn build_co_occurrence_edges_from_qdrant(
pool: &PgPool,
file_uuid: &str,
output_dir: &str,
face_db: &FaceEmbeddingDb,
) -> Result<usize>;
// Phase 2.6.2
async fn build_face_face_edges_from_qdrant(
pool: &PgPool,
file_uuid: &str,
pose_data: &[FacePose],
face_db: &FaceEmbeddingDb,
) -> Result<usize>;
// Phase 2.6.3
async fn build_speaker_face_edges_from_qdrant(
pool: &PgPool,
file_uuid: &str,
output_dir: &str,
face_db: &FaceEmbeddingDb,
) -> Result<usize>;
```
### Step 3: Replace in `build_tkg.rs`
```rust
// Old
let e_co = build_co_occurrence_edges(pool, file_uuid, output_dir).await?;
// New
let e_co = build_co_occurrence_edges_from_qdrant(pool, file_uuid, output_dir, face_db).await?;
```
### Step 4: Add feature flag (optional)
```rust
#[cfg(feature = "qdrant-edges")]
let e_co = build_co_occurrence_edges_from_qdrant(...).await?;
#[cfg(not(feature = "qdrant-edges"))]
let e_co = build_co_occurrence_edges(...).await?;
```
## Verification Plan
1. Run TKG rebuild on test file
2. Compare edge counts (PG vs Qdrant)
3. Verify edge properties match
4. Performance benchmark
5. Integration test with Rule2
## Risks & Mitigations
| Risk | Mitigation |
|------|------------|
| Qdrant collection empty | Fallback to PostgreSQL |
| Performance regression | Benchmark before merge |
| Edge count mismatch | Validate with test suite |
| Data inconsistency | Add reconciliation job |
## Success Criteria
- [ ] All edges use Qdrant payload (no face_detections queries)
- [ ] Edge counts match PostgreSQL version
- [ ] Performance improvement >= 2x
- [ ] Rule2/Rule3 work correctly
- [ ] No regressions in existing tests
## Timeline
- Phase 2.6.1 (co_occurrence): 1 day
- Phase 2.6.2 (face_face): 1 day
- Phase 2.6.3 (speaker_face): 0.5 day
- Testing & verification: 0.5 day
- **Total: 3 days**

View File

@@ -0,0 +1,374 @@
---
document_type: "design"
service: "MOMENTRY_CORE"
title: "Video Playback Architecture — Local Direct Serve & Remote Streaming"
version: "V1.0"
date: "2026-06-07"
author: "OpenCode"
status: "draft"
tags:
- "video-playback"
- "caddy"
- "streaming"
- "thumbnail"
- "wordpress-frontend"
related_documents:
- "DESIGN/FILE_LIFECYCLE_V1.0.md"
---
# Video Playback Architecture — Local Direct Serve & Remote Streaming
| Item | Value |
|------|-------|
| Scope | Video file playback & thumbnail serving for WordPress frontend (m5wp) |
| Status | Draft |
| Applies to | Search results (`serve_url`), Caddy routing, Momentry media-proxy endpoint |
| Key concept | Local files served directly by Caddy (zero backend overhead); remote files fall back to Momentry streaming; thumbnails proxied through Caddy to Momentry |
---
## Problem Statement
The WordPress frontend (`m5wp.momentry.ddns.net`) displays search results with video thumbnails and a player. Currently:
- **Thumbnails**: WordPress Code Snippet 61 (`momentry/v1/media` REST route) is inactive → all requests return `rest_no_route` 404
- **Video playback**: Frontend has no way to construct a playable URL from search results; no `serve_url` exists in the search response
- **WordPress constraint**: WordPress files and database tables must not be modified (marcom team territory)
The solution must work for two deployment scenarios:
- **Local**: Video file resides on the same server as Momentry → serve via static HTTP (zero processing overhead)
- **Remote**: Video file resides on an external storage (NAS, S3, etc.) → fall back to Momentry's ffmpeg-based streaming
---
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Browser (search-chat @ m5wp.momentry.ddns.net) │
│ │
│ ┌──────────┐ ┌──────────────────┐ ┌─────────────────────┐ │
│ │ Search │ │ Thumbnail img │ │ <video src="..."> │ │
│ └────┬─────┘ └───────┬──────────┘ └──────────┬──────────┘ │
│ │ │ │ │
└───────┼─────────────────┼──────────────────────────┼─────────────┘
│ │ │
▼ ▼ ▼
┌───────────────────────────────────────────────────────────────┐
│ Caddy (m5wp block) │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ handle /wp-json/momentry/v1/media { │ │
│ │ rewrite * /api/v1/media-proxy{?} │ │
│ │ reverse_proxy localhost:3002 (+ X-API-Key) │ │
│ │ } │ │
│ │ │ │
│ │ handle_path /files/* { │ │
│ │ root * /Users/accusys/momentry/var/sftpgo/data │ │
│ │ file_server │ │
│ │ } │ │
│ │ │ │
│ │ reverse_proxy localhost:9002 ← WordPress (PHP-FPM) │ │
│ └─────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘
│ │ │
│ │ ▼
│ │ ┌───────────────────────┐
│ │ │ /files/* │
│ │ │ Local file on disk │
│ │ │ (zero backend cost) │
│ │ └───────────────────────┘
│ ▼
│ ┌─────────────────────────────────────────┐
│ │ Momentry Core (localhost:3002) │
│ │ │
▼ ▼ /api/v1/media-proxy │
┌─────────────────────────┐ │
│ type=thumbnail?frame=N │──→ face_thumbnail │
│ type=video&start=… │──→ stream_video │
└─────────────────────────┘ │
┌─────────────────────────┐ │
│ POST /api/v1/search/* │──→ smart_search │
│ response: serve_url │ │
└─────────────────────────┘ │
└───────────────────────────────────────────────┘
```
---
## Data Flow
### 1. Search → serve_url
```
Frontend Caddy Momentry Backend
│ │ │
│ POST /wp-json/.../search │ │
│ ─────────────────────────→│ │
│ │ POST /api/v1/search/* │
│ │ ──────────────────────→│
│ │ │
│ │ ←─ SearchResult[] ─────│
│ │ (with serve_url + │
│ │ file_name added) │
│ ←─ JSON response ────────│ │
│ results[0].serve_url = │ │
│ "https://m5wp.momentry.│ │
│ ddns.net/files/demo/ │ │
│ Charade_YouTube_24fps │ │
│ .mp4" │ │
```
#### serve_url Construction
The backend computes `serve_url` from the video's `file_path` (stored in `videos` table) and two config values:
| Config | Env Var | Default |
|--------|---------|---------|
| `STORAGE_ROOT` | `MOMENTRY_STORAGE_ROOT` | `/Users/accusys/momentry/var/sftpgo/data` |
| `SERVE_BASE_URL` | `MOMENTRY_SERVE_BASE_URL` | `https://m5wp.momentry.ddns.net/files` |
Algorithm:
```
file_path: /Users/accusys/momentry/var/sftpgo/data/demo/Charade_YouTube_24fps.mp4
STORAGE_ROOT /Users/accusys/momentry/var/sftpgo/data
─────────────────────────────────────────────
relative: demo/Charade_YouTube_24fps.mp4
↓ join with SERVE_BASE_URL
serve_url: https://m5wp.momentry.ddns.net/files/demo/Charade_YouTube_24fps.mp4
```
#### SearchResult Additions
```rust
pub struct SearchResult {
// ... existing fields
pub file_name: Option<String>, // e.g. "Charade_YouTube_24fps.mp4"
pub serve_url: Option<String>, // e.g. "https://m5wp.momentry.ddns.net/files/..."
}
```
### 2. Video Playback (Local)
```
Frontend <video> Caddy (file_server)
│ │
│ GET /files/demo/Charade… │
│ ─────────────────────────→│
│ │ root = /Users/accusys/momentry/var/sftpgo/data
│ │ serves /demo/Charade_YouTube_24fps.mp4
│ │
│ ←─ 200 video/mp4 ────────│
│ (range-request │
│ supported natively) │
```
**Characteristics**:
- Zero CPU cost — pure I/O, no ffmpeg decode
- HTTP range requests work natively (Caddy `file_server` supports `Accept-Ranges: bytes`)
- HTML5 `<video>` can seek arbitrarily, play/pause normally
- Supports MP4 (H.264), WebM, and any browser-playable format
### 3. Video Playback (Remote — Fallback)
```
Frontend Caddy Momentry Backend
│ │ │
│ GET /wp-json/.../ │ │
│ media?uuid=X& │ │
│ type=video& │ │
│ start_time=S& │ │
│ end_time=E │ │
│ ────────────────────→│ │
│ │ rewrite to │
│ │ /api/v1/media-proxy{?} │
│ │ │
│ │ GET /api/v1/media-proxy? │
│ │ uuid=X&type=video&... │
│ │ ─────────────────────────→│
│ │ │
│ │ stream_video: │
│ │ ffmpeg -ss S -i file │
│ │ -t (E-S) -c copy │
│ │ │
│ │ ←─ 200 video/mp4 ──────────│
│ │ (chunk data) │
│ ←─ HTTP streaming ───│ │
```
### 4. Thumbnail
```
Frontend <img> Caddy Momentry Backend
│ │ │
│ GET /wp-json/.../ │ │
│ media?uuid=X& │ │
│ type=thumbnail& │ │
│ frame=N │ │
│ ──────────────────────→│ │
│ │ rewrite to │
│ │ /api/v1/media-proxy{?} │
│ │ │
│ │ /api/v1/media-proxy? │
│ │ uuid=X&type=thumbnail& │
│ │ frame=N │
│ │ ─────────────────────────→│
│ │ │
│ │ face_thumbnail: │
│ │ look up trace_id path │
│ │ → cached face crop │
│ │ → validated JPEG │
│ │ │
│ │ ←─ 200 image/jpeg ────────│
│ ←─ JPEG ───────────────│ │
```
**Thumbnail flow detail**:
1. Caddy intercepts `/wp-json/momentry/v1/media` → rewrites to `/api/v1/media-proxy` keeping query params intact (`{?}`)
2. Momentry `media_proxy_handler` reads `uuid`, `type=thumbnail`, `frame=N` from query
3. Dispatches to the internal `face_thumbnail` handler
4. Returns cached face crop JPEG (or fallback frame extraction result)
---
## Caddyfile Configuration
Addition to the existing `m5wp` block:
```caddy
m5wp.momentry.ddns.net {
tls internal
# ── Local video files: direct serve, zero backend overhead ──
handle_path /files/* {
root * /Users/accusys/momentry/var/sftpgo/data
file_server
}
# ── Media proxy: thumbnails + remote streaming ──
# Bypasses inactive WordPress Code Snippet 61
handle /wp-json/momentry/v1/media {
rewrite * /api/v1/media-proxy{?}
reverse_proxy localhost:3002 {
header_up X-API-Key muser_68600856036340bcafc01930eb4bd839_1774418104_97221b69
}
}
# ── Existing WordPress (PHP-FPM) ──
reverse_proxy localhost:9002
import common_log m5wp_access
}
```
**Key syntax**:
- `handle_path /files/*` — strips `/files` prefix, serves from `root` directory
- `{?}` — Caddy placeholder that preserves the original query string in the rewrite
- `handle /wp-json/momentry/v1/media` — matches exact path (query params are irrelevant for matching)
---
## Momentry API Changes
### New Endpoint: `GET /api/v1/media-proxy`
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `uuid` | string | yes | file_uuid (accepts `file_uuid` key as alias) |
| `type` | string | yes | `thumbnail`, `video` (future: `image`, `file`) |
| `frame` | int | for thumbnail | Frame number to extract |
| `trace_id` | int | no | Face trace ID for cached crop |
| `start_time` | float | for video | Start time in seconds |
| `end_time` | float | for video | End time in seconds |
| `mode` | string | no | `normal` or `debug` (video) |
| `audio` | string | no | `on` or `off` (video) |
**Dispatch logic**:
- `type=thumbnail` → call `face_thumbnail(State, Path(uuid), Query(frame, trace_id, ...))`
- `type=video` → call `stream_video(State, Path(uuid), Query(params), request)`
The endpoint reuses existing handler implementations via direct axum extractor composition, avoiding code duplication.
### Modified Endpoint: `POST /api/v1/search/smart`
**Response changes**: `SearchResult` gains two optional fields:
```json
{
"results": [
{
"file_uuid": "a6fb22eebefaef17e62af874997c5944",
"file_name": "Charade_YouTube_24fps.mp4",
"serve_url": "https://m5wp.momentry.ddns.net/files/demo/Charade_YouTube_24fps.mp4",
"start_frame": 88649,
"start_time": 3697.08,
"end_time": 3707.08,
"summary": "...",
"similarity": 0.85
}
]
}
```
The `serve_url` is computed after enrichment via a batch query to the `videos` table (`file_uuid → file_path`), then applying the path translation:
1. Strip `STORAGE_ROOT` prefix from `file_path`
2. Prepend `SERVE_BASE_URL`
---
## Environment Variables
Add to `.env` (production) and `.env.development`:
```bash
# Storage root: where video files are stored on disk
# Used to compute serve_url from file_path
MOMENTRY_STORAGE_ROOT=/Users/accusys/momentry/var/sftpgo/data
# Public base URL for direct file access via Caddy file_server
MOMENTRY_SERVE_BASE_URL=https://m5wp.momentry.ddns.net/files
```
---
## Trade-offs & Rationale
| Approach | Pros | Cons |
|----------|------|------|
| **Caddy file_server** (local) | Zero CPU, native range requests, no code change to Momentry for serving | Requires storage root config; files must be accessible from Caddy |
| **Momentry stream_video** (remote) | Works with any storage backend (S3, NAS, NFS) | ffmpeg decode per request, higher latency, CPU-bound |
| **WordPress PHP proxy** (rejected) | No infra change | Fragile, snippet inactive, violates marcom territory |
| **Direct backend streaming only** (rejected) | Simplest implementation | Unnecessary CPU for local files; 100% backend dependency |
### Fallback Logic (Frontend)
The frontend JavaScript should handle playback as follows:
```javascript
if (result.serve_url) {
// Local file — direct Caddy file_server
video.src = result.serve_url;
} else {
// Remote — use streaming endpoint
video.src = `/wp-json/momentry/v1/media?uuid=${result.file_uuid}&type=video&start_time=${result.start_time}&end_time=${result.end_time}`;
}
```
This gives the frontend flexibility to pick the optimal playback path based on available data.
---
## Future Considerations
- **S3/NAS remote files**: When video files are stored externally, the `file_path` won't match `STORAGE_ROOT`. The backend can detect this by checking `file_path.starts_with(STORAGE_ROOT)`. If it doesn't match, omit `serve_url` and rely on the streaming fallback.
- **Pre-signed URLs**: For S3 storage, `serve_url` could be replaced with a pre-signed URL or cloud CDN URL.
- **Caching**: `file_server` responses are cacheable; consider adding `Cache-Control` headers for thumbnails.
- **Authentication**: Direct file access currently has no auth. If needed, Caddy can inject auth via `forward_auth` or JWT validation.
---
## Version History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| V1.0 | 2026-06-07 | OpenCode | Initial design — local direct serve + remote streaming + thumbnail proxy architecture |