feat: Phase 2.6 edges migration to Qdrant (TKG-only architecture)
Phase 2.6.1: co_occurrence_edges migration - build_co_occurrence_edges_from_qdrant() - Qdrant embeddings → frame grouping → YOLO objects - Result: 6679 edges (vs 6701 PostgreSQL) Phase 2.6.2: face_face_edges migration - build_face_face_edges_from_qdrant() - Qdrant embeddings → frame grouping → face pairs - mutual_gaze detection preserved - Result: 6 edges (exact match) Phase 2.6.3: speaker_face_edges migration - build_speaker_face_edges_from_qdrant() - Qdrant embeddings → trace_id frame ranges - SPEAKS_AS edge creation Architecture: - All edges use Qdrant payload (no face_detections queries) - PostgreSQL fallback for empty Qdrant - Estimated 3.6x performance improvement Testing: - Playground (3003): ✓ All Phase 2.6 logs verified - Edge counts: ✓ Close match with PostgreSQL - Fallback: ✓ Working Docs: - docs_v1.0/DESIGN/TKG_PHASE2_6_EDGES_MIGRATION.md - docs_v1.0/M4_workspace/2026-06-21_phase2_6_test.md
This commit is contained in:
143
docs_v1.0/DESIGN/PER_FILE_VOICE_COLLECTION_V1.0.md
Normal file
143
docs_v1.0/DESIGN/PER_FILE_VOICE_COLLECTION_V1.0.md
Normal file
@@ -0,0 +1,143 @@
|
||||
---
|
||||
title: Per-File Voice Collection V1.0
|
||||
version: 1.0
|
||||
date: 2026-06-20
|
||||
author: OpenCode
|
||||
status: approved
|
||||
---
|
||||
|
||||
# Per-File Voice Collection V1.0
|
||||
|
||||
| Scope | Status | Applicable to | Binary |
|
||||
|-------|--------|---------------|--------|
|
||||
| Qdrant voice collection naming, storage, lifecycle | Approved | `momentry_playground`, `momentry` | Both |
|
||||
|
||||
## Problem Statement
|
||||
|
||||
ASRX processor stores speaker voice embeddings (192-dim ECAPA-TDNN) in Qdrant for speaker diarization and future identity matching. The current design uses a single global collection `{prefix}_voice` for all files, creating several issues:
|
||||
|
||||
1. **No isolation**: All files' voice embeddings share one collection, making per-file cleanup error-prone
|
||||
2. **Unnecessary migration**: Workspace `_workspace_voice` → production `_voice` migration during checkin adds complexity with no benefit for per-file processing artifacts
|
||||
3. **No event type distinction**: No payload field to distinguish speaker embeddings from future audio event types (gunshots, screams, music, etc.)
|
||||
4. **Cross-file matching is impractical**: Current point ID includes file_uuid, but querying across files requires filtering rather than direct collection access
|
||||
|
||||
## Design
|
||||
|
||||
### Collection Naming: Per-File
|
||||
|
||||
```
|
||||
{file_uuid}_voice
|
||||
```
|
||||
|
||||
Examples:
|
||||
- `d3f9ae8e471a1fc4d47022c66091b920_voice`
|
||||
- `92ed12dbb7fbea5e6ddfe668e1f31444_voice`
|
||||
|
||||
### Collection Schema
|
||||
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| Name | `{file_uuid}_voice` |
|
||||
| Vector dimension | 192 |
|
||||
| Distance metric | Cosine |
|
||||
| On-disk | false (default, in-memory for fast search during processing) |
|
||||
|
||||
### Point Schema
|
||||
|
||||
**Point ID**: `SHA256(speaker_id + "_" + segment_index)` → first 8 bytes as u64
|
||||
- No file_uuid in hash (redundant, collection is per-file)
|
||||
|
||||
**Payload**:
|
||||
|
||||
| Field | Type | Description | Example |
|
||||
|-------|------|-------------|---------|
|
||||
| `speaker_id` | String | Speaker label from ASRX | `"SPEAKER_00"` |
|
||||
| `segment_index` | Integer | Segment index within ASRX result | `5` |
|
||||
| `start_frame` | Integer | Start frame number | `120` |
|
||||
| `end_frame` | Integer | End frame number | `240` |
|
||||
| `start_time` | Float | Start time in seconds | `4.0` |
|
||||
| `end_time` | Float | End time in seconds | `8.0` |
|
||||
| `event_type` | String | Type of audio event | `"speaker"` |
|
||||
|
||||
### Event Type Extensibility
|
||||
|
||||
The `event_type` field reserves space for future audio recognition:
|
||||
|
||||
| event_type | Description | Future Model | Dim |
|
||||
|------------|-------------|--------------|-----|
|
||||
| `"speaker"` | Speaker voice embedding (current) | ECAPA-TDNN | 192 |
|
||||
| `"gunshot"` | Gunshot detection embedding | YAMNet / custom | TBD |
|
||||
| `"scream"` | Scream/shout detection | YAMNet / custom | TBD |
|
||||
| `"music"` | Music segment embedding | CLMR / custom | TBD |
|
||||
|
||||
Each event type with a different dimension would use a separate per-file collection (`{file_uuid}_gunshot`, etc.).
|
||||
|
||||
### Lifecycle
|
||||
|
||||
```
|
||||
Processing:
|
||||
ASRX completes → store_voice_embeddings_to_qdrant()
|
||||
→ ensure_collection("{file_uuid}_voice", 192)
|
||||
→ upsert_vector per segment
|
||||
|
||||
Checkin:
|
||||
No voice migration needed (data already in per-file collection)
|
||||
|
||||
Checkout / File Deletion:
|
||||
Delete collection "{file_uuid}_voice" (or delete by filter)
|
||||
|
||||
Cross-File Matching (future):
|
||||
Job scans all "*_voice" collections, or maintains {prefix}_speaker_profiles index
|
||||
```
|
||||
|
||||
### Changes from Current Design
|
||||
|
||||
| Aspect | Current | New |
|
||||
|--------|---------|-----|
|
||||
| Collection name | `{prefix}_voice` | `{file_uuid}_voice` |
|
||||
| Point ID hash input | `file_uuid + speaker_id + index` | `speaker_id + index` |
|
||||
| Workspace dual-write | `_workspace_voice` → `_voice` migration | Removed (no migration needed) |
|
||||
| Payload event_type | Not present | `"speaker"` |
|
||||
| Checkin voice migration | Scroll + upsert | Nothing (data already isolated) |
|
||||
| Checkout voice deletion | Filter by file_uuid from `{prefix}_voice` | Delete collection or filter |
|
||||
| QdrantWorkspace voice methods | `voice_collection()`, `upsert_voice_embedding()` | Removed |
|
||||
|
||||
### Files Affected
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `src/worker/processor.rs:1291-1360` | `store_voice_embeddings_to_qdrant()` — per-file collection, event_type payload |
|
||||
| `src/worker/processor.rs:919-942` | Remove workspace voice dual-write |
|
||||
| `src/core/checkin.rs:208-242` | Remove voice migration block |
|
||||
| `src/core/checkin.rs:358-379` | Update checkout voice deletion to target `{file_uuid}_voice` |
|
||||
| `src/core/db/qdrant_workspace.rs` | Remove `voice_collection()`, `upsert_voice_embedding()`, voice from `ensure_all()`, `scroll_by_file_uuid()`, `WorkspaceScrollResult`, `delete_by_file_uuid()` |
|
||||
|
||||
### Cross-File Matching (Future Design)
|
||||
|
||||
For future multi-file speaker matching, a separate index collection can be maintained:
|
||||
|
||||
```
|
||||
{prefix}_speaker_profiles (192-dim Cosine)
|
||||
- payload: speaker_id (global), source_file_uuids[], reference_count, centroid_embedding
|
||||
```
|
||||
|
||||
This index would be updated:
|
||||
1. During a periodic batch job that scans all `*_voice` collections
|
||||
2. Or incrementally when new voice data is added
|
||||
|
||||
The per-file collection design makes this cleaner because:
|
||||
- Source data is cleanly partitioned
|
||||
- The index is explicitly a derived/cached structure
|
||||
- Index rebuild means rescraping `*_voice` collections, not untangling a global collection
|
||||
|
||||
## Migration
|
||||
|
||||
Existing voice data in `{prefix}_voice` and `{prefix}_workspace_voice` can be left as-is for backward compatibility. New processing will write to `{file_uuid}_voice`. Old data in `{prefix}_voice` will remain queryable if needed.
|
||||
|
||||
No data migration script is required — old data is read-only legacy.
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Author | Change |
|
||||
|---------|------|--------|--------|
|
||||
| 1.0 | 2026-06-20 | OpenCode | Initial design |
|
||||
758
docs_v1.0/DESIGN/Processor_Module_V1.0.md
Normal file
758
docs_v1.0/DESIGN/Processor_Module_V1.0.md
Normal file
@@ -0,0 +1,758 @@
|
||||
# Processor Module V1.0
|
||||
|
||||
**Date**: 2026-06-19
|
||||
**Version**: 1.0.0
|
||||
**Status**: Draft
|
||||
|
||||
---
|
||||
|
||||
## 1. 架構總覽
|
||||
|
||||
### 1.1 PythonExecutor 統一執行框架
|
||||
|
||||
所有 processor 透過 `PythonExecutor` 執行 Python 腳本,提供:
|
||||
- SHA256 checksum 驗證 (從 `checksums.sha256` 讀取)
|
||||
- Retry 機制 (exponential backoff: 1s → 2s → 4s → ...)
|
||||
- Timeout 管理 (各 processor 獨立設定)
|
||||
- stdout/stderr 即時處理 (tracing::info/warn/error)
|
||||
|
||||
### 1.2 雙軌設計
|
||||
|
||||
| 型別 | 特性 | Processor |
|
||||
|------|------|-----------|
|
||||
| **Frame-based** | 逐幀處理,輸出 per-frame 資料 | yolo, ocr, face, pose, mediapipe, appearance |
|
||||
| **Time-based** | 分析全域/時間序列,輸出事件列表 | cut, asrx, scene, story, 5w1h |
|
||||
|
||||
### 1.3 8Hz 統一採樣 (新增)
|
||||
|
||||
所有 Frame-based processor 共用同一份 8Hz 幀清單:
|
||||
|
||||
```
|
||||
影片 FPS: ~30
|
||||
Sample Interval: round(fps / 8) = 4
|
||||
Sample Frames: 0, 4, 8, 12, 16, ...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Processor 規格總表
|
||||
|
||||
| # | 名稱 | 型別 | Python 腳本 | 輸出檔案 | 依賴 | GPU | 模型 | CPU | 記憶體 | Timeout |
|
||||
|---|------|------|-------------|----------|------|-----|------|-----|--------|---------|
|
||||
| 1 | cut | Time | `cut_processor.py` | `.cut.json` | — | ❌ | PySceneDetect | 0.5 | 512MB | 3600s |
|
||||
| 2 | asrx | Time | `asrx_processor.py` | `.asrx.json` | cut | ❌ | speechbrain | 0.8 | 2048MB | 7200s |
|
||||
| 3 | yolo | Frame | `yolo_processor.py` | `.yolo.json` | — | ✅ | yolov8n | 0.3 | 1024MB | 7200s |
|
||||
| 4 | ocr | Frame | `ocr_processor.py` | `.ocr.json` | — | ❌ | paddleocr | 0.8 | 1024MB | 7200s |
|
||||
| 5 | face | Frame | `face_processor.py` | `.face.json` | — | ✅ | insightface/buffalo_l | 0.6 | 1536MB | 7200s |
|
||||
| 6 | pose | Frame | `pose_processor.py` | `.pose.json` | — | ✅ | mediapipe/pose | 0.4 | 1024MB | 7200s |
|
||||
| 7 | mediapipe | Frame | `mediapipe_holistic_processor.py` | `.mediapipe.json` | — | ❌ | mediapipe/holistic | 0.3 | 1024MB | 7200s |
|
||||
| 8 | appearance | Frame | `appearance_processor.py` | `.appearance.json` | pose | ❌ | HSV | 0.3 | 512MB | 7200s |
|
||||
| 9 | scene | Time | `scene_classifier.py` | `.scene.json` | cut | ❌ | places365 | 0.3 | 512MB | 7200s |
|
||||
| 10 | story | Time | `story_processor.py` | `.story.json` | asrx+cut+yolo+face | ❌ | gemma4 | 0.1 | 256MB | 7200s |
|
||||
| 11 | 5w1h | Time | `parent_chunk_5w1h.py` | — | story | ❌ | gemma4 | 0.1 | 256MB | 7200s |
|
||||
|
||||
---
|
||||
|
||||
## 3. 各 Processor 詳細規格
|
||||
|
||||
### 3.1 Cut — 場景切換偵測
|
||||
|
||||
**型別**: Time-based
|
||||
**腳本**: `cut_processor.py`
|
||||
**模型**: PySceneDetect
|
||||
|
||||
```rust
|
||||
pub struct CutResult {
|
||||
pub frame_count: u64,
|
||||
pub fps: f64,
|
||||
pub scenes: Vec<CutScene>,
|
||||
}
|
||||
|
||||
pub struct CutScene {
|
||||
pub scene_number: u32,
|
||||
pub start_frame: u64,
|
||||
pub end_frame: u64,
|
||||
pub start_time: f64,
|
||||
pub end_time: f64,
|
||||
}
|
||||
```
|
||||
|
||||
**輸出 JSON**:
|
||||
```json
|
||||
{
|
||||
"frame_count": 8951,
|
||||
"fps": 29.97,
|
||||
"scenes": [
|
||||
{"scene_number": 1, "start_frame": 0, "end_frame": 150, "start_time": 0.0, "end_time": 5.0},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.2 ASRX — 語音辨識 + Speaker Diarization
|
||||
|
||||
**型別**: Time-based
|
||||
**腳本**: `asrx_processor.py`
|
||||
**模型**: speechbrain/ecapa-tdnn
|
||||
**依賴**: cut (需要場景邊界)
|
||||
|
||||
```rust
|
||||
pub struct AsrxResult {
|
||||
pub language: Option<String>,
|
||||
pub segments: Vec<AsrxSegment>,
|
||||
pub embeddings: Option<Vec<Vec<f32>>>,
|
||||
}
|
||||
|
||||
pub struct AsrxSegment {
|
||||
pub start_time: f64,
|
||||
pub end_time: f64,
|
||||
pub start_frame: u64,
|
||||
pub end_frame: u64,
|
||||
pub text: String,
|
||||
pub speaker_id: Option<String>,
|
||||
}
|
||||
```
|
||||
|
||||
**輸出 JSON**:
|
||||
```json
|
||||
{
|
||||
"language": "zh",
|
||||
"segments": [
|
||||
{
|
||||
"start_time": 0.1,
|
||||
"end_time": 2.0,
|
||||
"start_frame": 3,
|
||||
"end_frame": 60,
|
||||
"text": "大家好",
|
||||
"speaker_id": "SPEAKER_0"
|
||||
},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.3 YOLO — 物件偵測
|
||||
|
||||
**型別**: Frame-based
|
||||
**腳本**: `yolo_processor.py`
|
||||
**模型**: yolov8n
|
||||
**GPU**: ✅
|
||||
**採樣**: 8Hz
|
||||
|
||||
```rust
|
||||
pub struct YoloResult {
|
||||
pub frame_count: u64,
|
||||
pub fps: f64,
|
||||
pub frames: Vec<YoloFrame>,
|
||||
}
|
||||
|
||||
pub struct YoloFrame {
|
||||
pub frame: u64,
|
||||
pub timestamp: f64,
|
||||
pub objects: Vec<YoloObject>,
|
||||
}
|
||||
|
||||
pub struct YoloObject {
|
||||
pub class_name: String,
|
||||
pub class_id: u32,
|
||||
pub x: i32,
|
||||
pub y: i32,
|
||||
pub width: i32,
|
||||
pub height: i32,
|
||||
pub confidence: f32,
|
||||
}
|
||||
```
|
||||
|
||||
**輸出 JSON**:
|
||||
```json
|
||||
{
|
||||
"frame_count": 2238,
|
||||
"fps": 29.97,
|
||||
"frames": {
|
||||
"0": {"detections": [{"class_name": "person", "class_id": 0, "x": 100, "y": 50, "width": 200, "height": 400, "confidence": 0.95}]},
|
||||
"4": {"detections": [...]},
|
||||
...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**可用類別** (43 種 COCO): person, bicycle, car, motorbike, chair, cup, cell phone, laptop, book, remote, tie, umbrella, baseball bat, ...
|
||||
|
||||
---
|
||||
|
||||
### 3.4 OCR — 文字辨識
|
||||
|
||||
**型別**: Frame-based
|
||||
**腳本**: `ocr_processor.py`
|
||||
**模型**: paddleocr
|
||||
**採樣**: 8Hz
|
||||
|
||||
```rust
|
||||
pub struct OcrResult {
|
||||
pub frame_count: u64,
|
||||
pub fps: f64,
|
||||
pub frames: Vec<OcrFrame>,
|
||||
}
|
||||
|
||||
pub struct OcrFrame {
|
||||
pub frame: u64,
|
||||
pub timestamp: f64,
|
||||
pub texts: Vec<OcrText>,
|
||||
}
|
||||
|
||||
pub struct OcrText {
|
||||
pub text: String,
|
||||
pub x: i32,
|
||||
pub y: i32,
|
||||
pub width: i32,
|
||||
pub height: i32,
|
||||
pub confidence: f32,
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.5 Face — 人臉偵測 + Embedding
|
||||
|
||||
**型別**: Frame-based
|
||||
**腳本**: `face_processor.py`
|
||||
**模型**: insightface/buffalo_l
|
||||
**GPU**: ✅
|
||||
**採樣**: 8Hz
|
||||
|
||||
```rust
|
||||
pub struct FaceResult {
|
||||
pub frame_count: u64,
|
||||
pub fps: f64,
|
||||
pub frames: Vec<FaceFrame>,
|
||||
}
|
||||
|
||||
pub struct FaceFrame {
|
||||
pub frame: u64,
|
||||
pub timestamp: f64,
|
||||
pub faces: Vec<Face>,
|
||||
}
|
||||
|
||||
pub struct Face {
|
||||
pub face_id: Option<String>,
|
||||
pub x: i32,
|
||||
pub y: i32,
|
||||
pub width: i32,
|
||||
pub height: i32,
|
||||
pub confidence: f32,
|
||||
pub embedding: Option<Vec<f32>>,
|
||||
pub landmarks: Option<serde_json::Value>,
|
||||
pub attributes: Option<FaceAttributes>,
|
||||
}
|
||||
|
||||
pub struct FaceAttributes {
|
||||
pub age: Option<i32>,
|
||||
pub gender: Option<String>,
|
||||
}
|
||||
```
|
||||
|
||||
**輸出 JSON**:
|
||||
```json
|
||||
{
|
||||
"frame_count": 2238,
|
||||
"fps": 29.97,
|
||||
"frames": [
|
||||
{
|
||||
"frame": 0,
|
||||
"timestamp": 0.0,
|
||||
"faces": [{
|
||||
"face_id": "face_0",
|
||||
"x": 500, "y": 300, "width": 200, "height": 250,
|
||||
"confidence": 0.98,
|
||||
"embedding": [0.12, -0.34, ...],
|
||||
"landmarks": {
|
||||
"nose": [[x,y], ...],
|
||||
"left_eye": [[x,y], ...],
|
||||
"right_eye": [[x,y], ...]
|
||||
},
|
||||
"attributes": {"age": 35, "gender": "male"}
|
||||
}]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Landmarks**: nose (8pts) + left_eye (6pts) + right_eye (6pts) = 20 pts
|
||||
|
||||
---
|
||||
|
||||
### 3.6 Pose — 身體姿勢
|
||||
|
||||
**型別**: Frame-based
|
||||
**腳本**: `pose_processor.py`
|
||||
**模型**: mediapipe/pose
|
||||
**GPU**: ✅
|
||||
**採樣**: 8Hz
|
||||
|
||||
```rust
|
||||
pub struct PoseResult {
|
||||
pub frame_count: u64,
|
||||
pub fps: f64,
|
||||
pub frames: Vec<PoseFrame>,
|
||||
}
|
||||
|
||||
pub struct PoseFrame {
|
||||
pub frame: u64,
|
||||
pub timestamp: f64,
|
||||
pub persons: Vec<PersonPose>,
|
||||
}
|
||||
|
||||
pub struct PersonPose {
|
||||
pub keypoints: Vec<Keypoint>,
|
||||
pub bbox: Bbox,
|
||||
}
|
||||
|
||||
pub struct Keypoint {
|
||||
pub x: f64,
|
||||
pub y: f64,
|
||||
pub z: f64,
|
||||
pub visibility: f64,
|
||||
}
|
||||
|
||||
pub struct Bbox {
|
||||
pub x: i32,
|
||||
pub y: i32,
|
||||
pub width: i32,
|
||||
pub height: i32,
|
||||
}
|
||||
```
|
||||
|
||||
**輸出 JSON**:
|
||||
```json
|
||||
{
|
||||
"frame_count": 2238,
|
||||
"fps": 29.97,
|
||||
"frames": [
|
||||
{
|
||||
"frame": 0,
|
||||
"timestamp": 0.0,
|
||||
"persons": [{
|
||||
"keypoints": [
|
||||
{"x": 0.5, "y": 0.3, "z": 0.1, "visibility": 0.95},
|
||||
...
|
||||
],
|
||||
"bbox": {"x": 400, "y": 100, "width": 300, "height": 600}
|
||||
}]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Keypoints**: 33 個身體關節 (nose, shoulders, elbows, wrists, hips, knees, ankles, ...)
|
||||
|
||||
**用途**: 提供 appearance_processor 的 bbox 來源,計算上下半身色彩 ROI
|
||||
|
||||
---
|
||||
|
||||
### 3.7 MediaPipe Holistic — 完整關鍵點
|
||||
|
||||
**型別**: Frame-based
|
||||
**腳本**: `mediapipe_holistic_processor.py`
|
||||
**模型**: mediapipe/holistic
|
||||
**GPU**: ❌
|
||||
**採樣**: 8Hz
|
||||
|
||||
```rust
|
||||
pub struct MediaPipeResult {
|
||||
pub metadata: MediaPipeMetadata,
|
||||
pub frames: HashMap<String, MediaPipeDictEntry>,
|
||||
}
|
||||
|
||||
pub struct MediaPipeMetadata {
|
||||
pub fps: f64,
|
||||
pub total_frames: i64,
|
||||
pub processed_frames: i64,
|
||||
pub sample_interval: i64,
|
||||
pub width: i64,
|
||||
pub height: i64,
|
||||
pub processor: String,
|
||||
}
|
||||
|
||||
pub struct MediaPipeDictEntry {
|
||||
pub frame: String,
|
||||
pub timestamp: f64,
|
||||
pub persons: Vec<MediaPipePerson>,
|
||||
}
|
||||
|
||||
pub struct MediaPipePerson {
|
||||
pub person_id: u64,
|
||||
pub bbox: Option<MediaPipeBBox>,
|
||||
pub face_mesh: Option<MediaPipeFaceMesh>,
|
||||
pub pose: Option<MediaPipePose>,
|
||||
pub hands: MediaPipeHands,
|
||||
}
|
||||
|
||||
pub struct MediaPipeHands {
|
||||
pub left: Option<MediaPipeHand>,
|
||||
pub right: Option<MediaPipeHand>,
|
||||
}
|
||||
```
|
||||
|
||||
**輸出 JSON**:
|
||||
```json
|
||||
{
|
||||
"metadata": {
|
||||
"fps": 29.97,
|
||||
"total_frames": 8951,
|
||||
"processed_frames": 2238,
|
||||
"sample_interval": 4,
|
||||
"width": 1920,
|
||||
"height": 1080,
|
||||
"processor": "mediapipe_holistic"
|
||||
},
|
||||
"frames": {
|
||||
"0": {
|
||||
"frame": "0",
|
||||
"timestamp": 0.0,
|
||||
"persons": [{
|
||||
"person_id": 0,
|
||||
"bbox": {"x": 400, "y": 100, "width": 300, "height": 600},
|
||||
"face_mesh": {
|
||||
"landmarks": [[x,y,z], ...],
|
||||
"eye_features": {"left_openness": 0.85, "right_openness": 0.82},
|
||||
"mouth_features": {"openness": 0.3, "width": 45}
|
||||
},
|
||||
"pose": {
|
||||
"landmarks": [[x,y,z,visibility], ...],
|
||||
"arm_features": {"left_angle": 45, "right_angle": 30},
|
||||
"leg_features": {"left_angle": 180, "right_angle": 175}
|
||||
},
|
||||
"hands": {
|
||||
"left": {"landmarks": [[x,y,z], ...], "gesture": "point"},
|
||||
"right": {"landmarks": [[x,y,z], ...], "gesture": "fist"}
|
||||
}
|
||||
}]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**關鍵點總計**:
|
||||
| 部位 | 數量 | 說明 |
|
||||
|------|------|------|
|
||||
| Face Mesh | 468 | 臉部完整網格 |
|
||||
| Pose | 33 | 身體關節 |
|
||||
| Left Hand | 21 | 左手關鍵點 |
|
||||
| Right Hand | 21 | 右手關鍵點 |
|
||||
| **總計** | **543** | |
|
||||
|
||||
### Pose vs MediaPipe 對比
|
||||
|
||||
| | Pose Processor | MediaPipe Holistic |
|
||||
|--|----------------|--------------------|
|
||||
| **Landmarks** | 33 pts (pose only) | 543 pts (face + pose + hands) |
|
||||
| **速度** | 快 (GPU 加速) | 較慢 (CPU) |
|
||||
| **GPU** | ✅ | ❌ |
|
||||
| **輸出檔案** | `.pose.json` | `.mediapipe.json` |
|
||||
| **Appearance 共用** | 身體 ROI (neck, foot) | 臉部 ROI (hat, glasses)、手部 ROI (watch, phone) |
|
||||
| **用途** | 身體姿勢、bbox 來源 | 完整關鍵點、手勢辨識、唇型分析 |
|
||||
|
||||
---
|
||||
|
||||
### 3.8 Appearance — 色彩特徵 + 配件偵測
|
||||
|
||||
**型別**: Frame-based
|
||||
**腳本**: `appearance_processor.py`
|
||||
**依賴**: pose (bbox 來源)
|
||||
**採樣**: 8Hz
|
||||
**ROI 共用**: 緊密貼合 face/pose/mediapipe landmarks
|
||||
|
||||
```rust
|
||||
pub struct AppearanceResult {
|
||||
pub frame_count: u64,
|
||||
pub fps: f64,
|
||||
pub frames: Vec<AppearanceFrame>,
|
||||
}
|
||||
|
||||
pub struct AppearanceFrame {
|
||||
pub frame: u64,
|
||||
pub timestamp: f64,
|
||||
pub persons: Vec<AppearancePerson>,
|
||||
}
|
||||
|
||||
pub struct AppearancePerson {
|
||||
pub person_id: u64,
|
||||
pub bbox: BBox,
|
||||
pub hsv_histogram: Vec<Vec<f64>>,
|
||||
pub dominant_colors: Vec<Vec<f64>>,
|
||||
pub upper_body: Option<Vec<Vec<f64>>>,
|
||||
pub lower_body: Option<Vec<Vec<f64>>>,
|
||||
}
|
||||
```
|
||||
|
||||
**輸出 JSON**:
|
||||
```json
|
||||
{
|
||||
"frame_count": 2238,
|
||||
"fps": 29.97,
|
||||
"frames": [
|
||||
{
|
||||
"frame": 0,
|
||||
"timestamp": 0.0,
|
||||
"persons": [{
|
||||
"person_id": 0,
|
||||
"bbox": {"x": 400, "y": 100, "width": 300, "height": 600},
|
||||
"hsv_histogram": [
|
||||
[H0, H1, ...H29],
|
||||
[S0, S1, ...S31],
|
||||
[V0, V1, ...V31]
|
||||
],
|
||||
"dominant_colors": [[H,S,V], ...],
|
||||
"upper_body": [[H...], [S...], [V...]],
|
||||
"lower_body": [[H...], [S...], [V...]]
|
||||
}]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### ROI 定位方式
|
||||
|
||||
```python
|
||||
def get_accessory_rois(frame, face_data, pose_data, hand_data):
|
||||
rois = {}
|
||||
|
||||
# 臉部區域 — 用 face bbox + landmarks
|
||||
face_bbox = face_data['bbox']
|
||||
landmarks = face_data['landmarks'] # nose, left_eye, right_eye
|
||||
|
||||
# 帽子 ROI: 臉部 bbox 上方延伸
|
||||
rois['hat'] = expand_region(face_bbox, direction='up', factor=0.5)
|
||||
|
||||
# 眼鏡 ROI: 眼部 landmarks 水平帶
|
||||
rois['glasses'] = bbox_around_points(landmarks['left_eye'], landmarks['right_eye'], padding=10)
|
||||
|
||||
# 口罩 ROI: 鼻子下方到下顎
|
||||
rois['mask'] = region_below_point(landmarks['nose'], face_bbox.bottom)
|
||||
|
||||
# 脖子 ROI — 用 pose neck keypoints
|
||||
rois['neck'] = region_between(pose_data['keypoints']['nose'], pose_data['keypoints']['neck'], width=80)
|
||||
|
||||
# 手腕 ROI — 用 MediaPipe hand landmarks
|
||||
rois['left_wrist'] = circle_around(hand_data['left']['wrist'], radius=30)
|
||||
|
||||
# 腳部 ROI — 用 pose ankle/toe keypoints
|
||||
rois['left_foot'] = bbox_around_points(pose_data['left_ankle'], pose_data['left_toe'], padding=20)
|
||||
|
||||
return rois
|
||||
```
|
||||
|
||||
#### 配件偵測方式
|
||||
|
||||
| 方式 | 適用配件 | 說明 |
|
||||
|------|----------|------|
|
||||
| **HSV 色塊** | tie, phone, watch, ring, bracelet, glasses, mask, hat, shoes, backpack, handbag | 主要方式 — 異色區塊分析 |
|
||||
| **CLIP** | hairstyle, beard, face_tattoo, earrings, nose_ring, necklace, gloves | 輔助 — 色塊不易區分時 |
|
||||
| **MediaPipe** | gesture, arm_pose | 21 hand pts + 33 pose pts |
|
||||
| **HSV** | upper_body_color, lower_body_color, skin_tone | 色彩特徵提取 |
|
||||
|
||||
#### 配件完整清單 (49 種)
|
||||
|
||||
| 部位 | 配件 | 偵測 |
|
||||
|------|------|------|
|
||||
| 頭部 (12) | hat, hairstyle, hair_accessory, earrings, nose_ring, lip_ring, face_tattoo, eyebrow_tattoo, glasses, mask, beard, headscarf | HSV 色塊 + CLIP |
|
||||
| 脖子 (5) | tie, scarf, shawl, necklace, neck_tattoo | HSV 色塊 + CLIP |
|
||||
| 手部/手臂 (16) | ring, bracelet, watch, gloves, phone, pen, laptop, book, cup, remote, tool, knife, gun, baseball_bat, gesture, arm_pose | HSV 色塊 + CLIP + MP |
|
||||
| 足部/載具 (8) | shoes, socks, barefoot, skateboard, scooter, bicycle, motorbike, roller_skates | HSV 色塊 + CLIP |
|
||||
| 攜帶/環境 (5) | backpack, handbag, luggage, chair, diningtable | HSV 色塊 + CLIP |
|
||||
| 色彩 (3) | upper_body_hsv, lower_body_hsv, skin_tone | HSV |
|
||||
|
||||
---
|
||||
|
||||
### 3.9 Scene — 場景分類
|
||||
|
||||
**型別**: Time-based
|
||||
**腳本**: `scene_classifier.py`
|
||||
**模型**: places365
|
||||
**依賴**: cut
|
||||
|
||||
---
|
||||
|
||||
### 3.10 Story — 故事生成
|
||||
|
||||
**型別**: Time-based
|
||||
**腳本**: `story_processor.py`
|
||||
**模型**: gemma4
|
||||
**依賴**: asrx + cut + yolo + face
|
||||
|
||||
---
|
||||
|
||||
### 3.11 5W1H — 故事摘要
|
||||
|
||||
**型別**: Time-based
|
||||
**腳本**: `parent_chunk_5w1h.py`
|
||||
**模型**: gemma4
|
||||
**依賴**: story
|
||||
|
||||
---
|
||||
|
||||
## 4. PythonExecutor 統一框架
|
||||
|
||||
### 4.1 RetryConfig
|
||||
|
||||
```rust
|
||||
pub struct RetryConfig {
|
||||
pub max_attempts: u32, // 預設 3
|
||||
pub initial_delay_ms: u64, // 預設 1000 (1s)
|
||||
pub max_delay_ms: u64, // 預設 30000 (30s)
|
||||
pub backoff_multiplier: f64, // 預設 2.0
|
||||
}
|
||||
```
|
||||
|
||||
**退避策略**: 1s → 2s → 4s → 8s → ... → max 30s
|
||||
|
||||
### 4.2 SHA256 Checksum 驗證
|
||||
|
||||
```
|
||||
scripts/
|
||||
├── checksums.sha256 # SHA256 manifest
|
||||
├── face_processor.py
|
||||
├── yolo_processor.py
|
||||
└── ...
|
||||
```
|
||||
|
||||
`checksums.sha256` 內容:
|
||||
```
|
||||
a1b2c3d4... face_processor.py
|
||||
e5f6g7h8... yolo_processor.py
|
||||
...
|
||||
```
|
||||
|
||||
Executor 啟動前驗證腳本完整性,防止腳本被篡改。
|
||||
|
||||
### 4.3 Timeout 管理
|
||||
|
||||
| Processor | Timeout |
|
||||
|-----------|---------|
|
||||
| cut | 3600s (1h) |
|
||||
| asrx, yolo, ocr, face, pose, mediapipe, appearance, scene, story, 5w1h | 7200s (2h) |
|
||||
|
||||
---
|
||||
|
||||
## 5. 8Hz 採樣框架
|
||||
|
||||
### 5.1 基本原理
|
||||
|
||||
```
|
||||
影片 FPS: ~30
|
||||
Sample Interval: round(fps / 8) = 4
|
||||
Sample Frames: 0, 4, 8, 12, 16, ...
|
||||
```
|
||||
|
||||
| 影片長度 | 總幀數 | 8Hz 樣本數 |
|
||||
|----------|--------|------------|
|
||||
| 5 分鐘 | 9,000 | ~2,250 |
|
||||
| 10 分鐘 | 18,000 | ~4,500 |
|
||||
| 30 分鐘 | 54,000 | ~13,500 |
|
||||
|
||||
### 5.2 按需細化機制
|
||||
|
||||
```
|
||||
Layer 1: 8Hz 基底 (所有 processor)
|
||||
↓
|
||||
Layer 2: 細化 (特定特徵觸發)
|
||||
|
||||
細化場景:
|
||||
- Blink 確認: 8Hz 發現 eye openness 突降 → 回頭抓前後 ±4 幀 (30Hz)
|
||||
- Lip-sync: sentence chunk 覆蓋的時間段 → 16Hz
|
||||
- Mutual Gaze: 兩人 gaze 方向接近 → 前後 ±2 幀 (30Hz) 確認
|
||||
```
|
||||
|
||||
### 5.3 樣本幀計算
|
||||
|
||||
```rust
|
||||
fn compute_sample_frames(total_frames: i64, fps: f64) -> Vec<i64> {
|
||||
let interval = (fps / 8.0).round() as i64;
|
||||
(0..total_frames).step_by(interval.max(1) as usize).collect()
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. DAG 依賴圖
|
||||
|
||||
```
|
||||
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
|
||||
│ cut │───►│asrx │───►│story│───►│5w1h │
|
||||
└──┬──┘ └──┬──┘ └──┬──┘ └─────┘
|
||||
│ │ │
|
||||
│ ┌─────┘ │
|
||||
▼ ▼ │
|
||||
┌─────┐ ┌─────┐ ┌─────┐ │
|
||||
│yolo │ │face │ │pose │ │
|
||||
└──┬──┘ └──┬──┘ └──┬──┘ │
|
||||
│ │ │ │
|
||||
│ │ ▼ │
|
||||
│ │ ┌────────┐ │
|
||||
│ └─►│appear │ │
|
||||
│ └────────┘ │
|
||||
▼ ▼ ▼
|
||||
┌─────────────────────────┐
|
||||
│ TKG (build_tkg) │
|
||||
└─────────────────────────┘
|
||||
|
||||
獨立處理器 (無依賴):
|
||||
┌─────┐ ┌─────┐ ┌───────────┐
|
||||
│ ocr │ │mediap│ │ scene │
|
||||
└─────┘ └─────┘ └─────┬─────┘
|
||||
│ (依賴 cut)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Worker 整合
|
||||
|
||||
### 7.1 JobWorker 調度
|
||||
|
||||
```
|
||||
Video Registration
|
||||
│
|
||||
▼
|
||||
Create Job (processor_list: [cut, asrx, yolo, ocr, face, pose, mediapipe, appearance, scene, story])
|
||||
│
|
||||
▼
|
||||
Poll Available Processors (dependency check + concurrency limit)
|
||||
│
|
||||
▼
|
||||
Execute Processor → Store JSON → Update Progress
|
||||
│
|
||||
▼
|
||||
All Processors Done → Rule 1 (chunk) → Vectorize → Complete
|
||||
```
|
||||
|
||||
### 7.2 並發控制
|
||||
|
||||
- **Dynamic concurrency**: 根據 CPU/Memory/GPU 動態調整 (預設 2)
|
||||
- **Processor pool**: 同時執行最多 N 個 processor
|
||||
|
||||
### 7.3 進度回報 (Redis)
|
||||
|
||||
```
|
||||
Redis Key: momentry_dev:progress:{file_uuid}
|
||||
Value: {
|
||||
"phase": "PROCESSING",
|
||||
"progress": {
|
||||
"FACE": {"current": 150, "total": 2238, "status": "running"},
|
||||
"YOLO": {"current": 2238, "total": 2238, "status": "completed"},
|
||||
...
|
||||
},
|
||||
"active_processors": ["FACE", "POSE"]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Author | Description |
|
||||
|---------|------|--------|-------------|
|
||||
| 1.0.0 | 2026-06-19 | OpenCode | Initial design document |
|
||||
187
docs_v1.0/DESIGN/RULE1_CHUNK_V1.0.md
Normal file
187
docs_v1.0/DESIGN/RULE1_CHUNK_V1.0.md
Normal file
@@ -0,0 +1,187 @@
|
||||
---
|
||||
title: Rule 1 Chunk Ingestion V1.0
|
||||
version: 1.0
|
||||
date: 2026-06-20
|
||||
author: OpenCode
|
||||
status: approved
|
||||
---
|
||||
|
||||
# Rule 1 Chunk Ingestion V1.0
|
||||
|
||||
| Scope | Status | Applicable to | Binary |
|
||||
|-------|--------|---------------|--------|
|
||||
| Sentence chunk creation from ASR + OCR | Approved | `momentry_playground`, `momentry` | Both |
|
||||
|
||||
## Overview
|
||||
|
||||
Rule 1 is the first chunking rule in Momentry's pipeline. It creates **sentence-level chunks** (`ChunkType::Sentence`, `ChunkRule::Rule1`) by taking ASR transcription segments and enriching them with OCR on-screen text from the same time range. Each chunk represents a spoken segment annotated with the visible text in the video frames.
|
||||
|
||||
These chunks are vectorized by the downstream `vectorize_chunks` step and become searchable through semantic search (Qdrant), keyword search (BM25 ILIKE), and identity-based search.
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ UPSTREAM: pre_chunks table │
|
||||
│ │
|
||||
│ Processor outputs stored by store_raw_pre_chunks_batch: │
|
||||
│ processor_type='asr' → ASR segments (text, timestamps) │
|
||||
│ processor_type='ocr' → OCR texts per frame │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼ wait for ASRX completion
|
||||
│
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ RULE 1 PROCESSING │
|
||||
│ │
|
||||
│ Triggered by: │
|
||||
│ 1. Worker auto: job_worker.rs after ASRX completes │
|
||||
│ 2. HTTP API: POST /api/v1/file/:file_uuid/rule1 │
|
||||
│ 3. Pipeline: pipeline_core::execute_rule1 │
|
||||
│ │
|
||||
│ execute_rule1(file_uuid, fps): │
|
||||
│ ├─ fetch_asr_segments() → Vec<AsrSegment> │
|
||||
│ ├─ fetch_ocr_texts() → BTreeMap<frame, [texts]> │
|
||||
│ │ │
|
||||
│ └─ for each ASR segment: │
|
||||
│ ├─ collect_ocr_text(frame_range, ocr_map) │
|
||||
│ │ → deduplicated OCR texts within range │
|
||||
│ ├─ build combined_text = "<ASR> <OCR>" │
|
||||
│ ├─ build content = {text, ocr_text} │
|
||||
│ ├─ build metadata = {language} │
|
||||
│ └─ store_chunk_in_tx() → chunk table │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ DOWNSTREAM: vectorize_chunks() │
|
||||
│ │
|
||||
│ SELECT ... WHERE chunk_type='sentence' AND embedding │
|
||||
│ IS NULL │
|
||||
│ │
|
||||
│ 1. embedder.embed_document(combined_text) → vector │
|
||||
│ 2. db.store_vector() → PG chunk.embedding │
|
||||
│ 3. qdrant.upsert_vector() → momentry_rule1 collection │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Chunk Data Structure
|
||||
|
||||
### Content JSON (`content` column)
|
||||
|
||||
```json
|
||||
{
|
||||
"text": "今天的會議我們要討論 ...",
|
||||
"ocr_text": "Q3 Revenue Slides Agenda"
|
||||
}
|
||||
```
|
||||
|
||||
| Field | Source | Purpose |
|
||||
|-------|--------|---------|
|
||||
| `text` | ASR transcription | Original spoken text, used by UI/reference |
|
||||
| `ocr_text` | OCR detections in frame range | On-screen text (titles, labels, signs) |
|
||||
|
||||
### Text Content (`text_content` column)
|
||||
|
||||
```
|
||||
"今天的會議我們要討論 Q3 Revenue Slides Agenda"
|
||||
```
|
||||
|
||||
Combined ASR + OCR text used for:
|
||||
- **Embedding generation**: The combined text is embedded to Qdrant, enabling semantic search to find segments based on both spoken and on-screen content
|
||||
- **Keyword search (BM25 ILIKE)**: Queries match against this field, so searching for "Q3 Revenue" finds the segment even if not spoken aloud
|
||||
|
||||
### Metadata JSON (`metadata` column)
|
||||
|
||||
```json
|
||||
{
|
||||
"language": "zh"
|
||||
}
|
||||
```
|
||||
|
||||
Only the ASR-detected language is stored. See Design Decisions below.
|
||||
|
||||
## Search Contribution Analysis
|
||||
|
||||
| Search Path | Mechanism | Rule 1 Contribution |
|
||||
|-------------|-----------|-------------------|
|
||||
| **Semantic search** (Qdrant) | `chunk_type='sentence'` → embedding query | ASR + OCR text in embedding captures both spoken and visual content |
|
||||
| **Keyword search** (BM25 ILIKE) | `text_content ILIKE '%query%'` | Both ASR and OCR text are searchable |
|
||||
| **Title match** (smart_search) | `chunk_type='sentence' AND embedding IS NOT NULL` | Rule 1 chunks are the primary sentence chunks |
|
||||
| **Identity search** | `face_detections` time overlap join | Rule 1 chunks match via frame ranges |
|
||||
|
||||
### What Was Excluded and Why
|
||||
|
||||
| Data Source | Considered For | Decision | Reason |
|
||||
|-------------|---------------|----------|--------|
|
||||
| **YOLO detections** | Adding class names to text_content | ❌ **Excluded** | 80 COCO classes are too generic ("person", "chair" appear in almost every segment). High error rate adds noise, dilutes embedding semantic density. Cross-segment distinctiveness is near zero. |
|
||||
| **ASRX speaker** | Adding speaker_id to metadata | ❌ **Excluded** | At Rule 1 time, identity has not been paired yet. Speaker IDs are temporary labels without identity binding, providing no search value. |
|
||||
| **Face detections** | Adding face_ids to metadata | ❌ **Excluded** | Same as speaker — identity not yet available. Face detection IDs alone have no search meaning. |
|
||||
| **OCR text** | Adding to text_content + embedding | ✅ **Included** | OCR provides specific on-screen text (titles, labels, signs) that directly matches user search queries. Highly complementary to ASR. |
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### `fetch_ocr_texts()`
|
||||
|
||||
Reads OCR per-frame data from `pre_chunks`:
|
||||
|
||||
```sql
|
||||
SELECT coordinate_index as frame, data
|
||||
FROM pre_chunks
|
||||
WHERE file_uuid = $1 AND processor_type = 'ocr'
|
||||
ORDER BY coordinate_index
|
||||
```
|
||||
|
||||
Parses the `data.texts` JSON array, extracting `text` fields where `confidence > 0.5`. Returns `BTreeMap<i64, Vec<String>>` mapping frame number to list of recognized text strings.
|
||||
|
||||
### `collect_ocr_text()`
|
||||
|
||||
For a given frame range `[start_frame, end_frame]`:
|
||||
1. Iterates frames using `BTreeMap::range(start_frame..=end_frame)`
|
||||
2. Collects all OCR texts from those frames
|
||||
3. Deduplicates using a `HashSet` (case-sensitive)
|
||||
4. Joins with spaces: `"text1 text2 text3"`
|
||||
|
||||
Returns empty string if no OCR data exists in the range.
|
||||
|
||||
### `text_content` Composition Rules
|
||||
|
||||
```
|
||||
if OCR text exists:
|
||||
combined = "{asr_text} {ocr_text}"
|
||||
else:
|
||||
combined = "{asr_text}"
|
||||
```
|
||||
|
||||
The combined string is used for both embedding and keyword search. The original ASR text is preserved separately in `content.text`.
|
||||
|
||||
## Trigger Points
|
||||
|
||||
| Trigger | Location | Condition |
|
||||
|---------|----------|-----------|
|
||||
| Worker auto | `job_worker.rs:1135` | After ASRX processor completes and no sentence chunks exist yet |
|
||||
| HTTP API | `POST /api/v1/file/:file_uuid/rule1` | Manual trigger via `pipeline_core::execute_rule1` |
|
||||
| Programmatic | `pipeline_core::execute_rule1` | Called by other modules needing sentence chunks |
|
||||
|
||||
The worker guard checks idempotency:
|
||||
```sql
|
||||
SELECT 1 FROM chunk WHERE file_uuid = $1 AND chunk_type = 'sentence' LIMIT 1
|
||||
```
|
||||
|
||||
## Edge Cases
|
||||
|
||||
| Scenario | Behavior |
|
||||
|----------|----------|
|
||||
| No ASR segments | Returns 0 immediately with info log |
|
||||
| No OCR data in pre_chunks | `ocr_text` is empty string; `text_content` = ASR only |
|
||||
| OCR frame with no valid text | Skipped (confidence < 0.5 or empty string) |
|
||||
| ASR segment end_time = 0.0 | Logs warning; overlap-based matching degrades gracefully |
|
||||
| Large number of segments | Batches in single transaction; progress logged every 100 segments |
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Author | Change |
|
||||
|---------|------|--------|--------|
|
||||
| 1.0 | 2026-06-20 | OpenCode | Initial design: ASR + OCR → sentence chunks |
|
||||
816
docs_v1.0/DESIGN/TKG_MultiTrace_V1.0.md
Normal file
816
docs_v1.0/DESIGN/TKG_MultiTrace_V1.0.md
Normal file
@@ -0,0 +1,816 @@
|
||||
# TKG Multi-Trace Design V1.0
|
||||
|
||||
**Date**: 2026-06-19
|
||||
**Version**: 1.0.0
|
||||
**Status**: Draft
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
統一 8Hz 採樣框架,整合 face、appearance、gaze、lip 四條 trace,並接入 sentence/speaker/accessory 節點,構建完整的 Temporal Knowledge Graph (TKG)。
|
||||
|
||||
### 設計目標
|
||||
|
||||
1. **時間對齊**: 所有 trace 在同一 8Hz 網格上,edge 計算無需插值
|
||||
2. **按需細化**: 特定特徵 (blink, lip-sync, mutual gaze) 可局部提高採樣率
|
||||
3. **配件偵測**: 49 種配件分類 (頭部 12 + 脖子 5 + 手部 16 + 足部 8 + 攜帶 5 + 色彩 3)
|
||||
4. **膚色 + 光源**: Fitzpatrick 分類 + 光照參數,支援可信度評估
|
||||
5. **社交互動**: Mutual gaze (互相看), lip-sync (唇語同步), speaker-face 綁定
|
||||
|
||||
---
|
||||
|
||||
## 1. 8Hz 採樣框架
|
||||
|
||||
### 1.1 基本原理
|
||||
|
||||
```
|
||||
影片 FPS: ~30
|
||||
Sample Interval: round(fps / 8) = 4
|
||||
Sample Frames: 0, 4, 8, 12, 16, ...
|
||||
```
|
||||
|
||||
| 影片長度 | 總幀數 | 8Hz 樣本數 |
|
||||
|----------|--------|------------|
|
||||
| 5 分鐘 | 9,000 | ~2,250 |
|
||||
| 10 分鐘 | 18,000 | ~4,500 |
|
||||
| 30 分鐘 | 54,000 | ~13,500 |
|
||||
|
||||
### 1.2 按需細化機制
|
||||
|
||||
```
|
||||
Layer 1: 8Hz 基底 (所有 processor)
|
||||
↓
|
||||
Layer 2: 細化 (特定特徵觸發)
|
||||
|
||||
細化場景:
|
||||
- Blink 確認: 8Hz 發現 eye openness 突降 → 回頭抓前後 ±4 幀 (30Hz)
|
||||
- Lip-sync: sentence chunk 覆蓋的時間段 → 16Hz
|
||||
- Mutual Gaze: 兩人 gaze 方向接近 → 前後 ±2 幀 (30Hz) 確認
|
||||
```
|
||||
|
||||
### 1.3 樣本幀計算
|
||||
|
||||
```rust
|
||||
// worker/processor.rs
|
||||
fn compute_sample_frames(total_frames: i64, fps: f64) -> Vec<i64> {
|
||||
let interval = (fps / 8.0).round() as i64;
|
||||
(0..total_frames).step_by(interval.max(1) as usize).collect()
|
||||
}
|
||||
|
||||
fn merge_refine_frames(base: &[i64], refine: &HashSet<i64>) -> Vec<i64> {
|
||||
let mut combined: HashSet<i64> = base.iter().cloned().collect();
|
||||
combined.extend(refine.iter().cloned());
|
||||
let mut sorted: Vec<i64> = combined.into_iter().collect();
|
||||
sorted.sort();
|
||||
sorted
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Trace 類型
|
||||
|
||||
### 重要 Trace 總覽
|
||||
|
||||
| # | Trace 類型 | 來源 | 用途 |
|
||||
|---|-----------|------|------|
|
||||
| 1 | **face_trace** | face_detections + face.json | 人臉追蹤、身份識別 |
|
||||
| 2 | **appearance_trace** | appearance.json | 服裝色彩、配件、膚色 |
|
||||
| 3 | **gaze_trace** | face.json (pose_angle + landmarks) | 視線方向、互相看 |
|
||||
| 4 | **lip_trace** | face.json (landmarks) | 唇型、說話同步 |
|
||||
| 5 | **speaker_trace** | asrx.json (speaker diarization) | 說話者識別 |
|
||||
| 6 | **text_trace** | dev.chunk (sentence chunks) | 文字內容、語意 |
|
||||
| 7 | **skin_tone_trace** | face.json (ROI HSV) | 膚色分類、光源記錄 |
|
||||
|
||||
---
|
||||
|
||||
### 2.1 Face Trace (已有)
|
||||
|
||||
```json
|
||||
{
|
||||
"node_type": "face_trace",
|
||||
"external_id": "trace_5",
|
||||
"properties": {
|
||||
"frame_count": 200,
|
||||
"start_frame": 150,
|
||||
"end_frame": 350,
|
||||
"avg_bbox": { "x": 500, "y": 300, "width": 200, "height": 250 },
|
||||
"avg_yaw": -0.15,
|
||||
"avg_pitch": -0.08,
|
||||
"avg_roll": -0.20,
|
||||
"pose_count": 180,
|
||||
"embedding": [...],
|
||||
"skin_tone": {
|
||||
"face_h_mean": 18.5,
|
||||
"fitzpatrick": "Type IV - Medium",
|
||||
"confidence": 0.82,
|
||||
"lighting": {
|
||||
"brightness": 0.65,
|
||||
"color_temp": "warm",
|
||||
"direction": "front",
|
||||
"uniformity": 0.92,
|
||||
"source": "indoor",
|
||||
"quality": "good"
|
||||
},
|
||||
"sample_frames": 156
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2.2 Appearance Trace (新增)
|
||||
|
||||
**綁定策略**: IoU 匹配 appearance person ↔ face detection,繼承 trace_id
|
||||
|
||||
```json
|
||||
{
|
||||
"node_type": "appearance_trace",
|
||||
"external_id": "trace_5",
|
||||
"properties": {
|
||||
"trace_id": 5,
|
||||
"frame_count": 400,
|
||||
"start_frame": 100,
|
||||
"end_frame": 500,
|
||||
"face_overlap_frames": 200,
|
||||
"confidence": 0.50,
|
||||
"color_features": {
|
||||
"dominant_colors": [[0.1, 0.6, 0.8], ...],
|
||||
"upper_body_hsv": [[...], [...], [...]],
|
||||
"lower_body_hsv": [[...], [...], [...]]
|
||||
},
|
||||
"accessories": {
|
||||
"head": {
|
||||
"hat": {"detected": true, "confidence": 0.82, "first_frame": 0},
|
||||
"glasses": {"detected": true, "confidence": 0.67, "first_frame": 0},
|
||||
"earrings": {"detected": false},
|
||||
"mask": {"detected": false},
|
||||
"hairstyle": {"type": "long", "confidence": 0.75},
|
||||
"hair_accessory": {"detected": false},
|
||||
"nose_ring": {"detected": false},
|
||||
"lip_ring": {"detected": false},
|
||||
"face_tattoo": {"detected": false},
|
||||
"eyebrow_tattoo": {"detected": false},
|
||||
"beard": {"detected": true, "confidence": 0.88},
|
||||
"headscarf": {"detected": false}
|
||||
},
|
||||
"neck": {
|
||||
"tie": {"detected": true, "confidence": 0.92, "first_frame": 0, "source": "hsv_color_block"},
|
||||
"scarf": {"detected": false},
|
||||
"shawl": {"detected": false},
|
||||
"necklace": {"detected": true, "confidence": 0.71, "first_frame": 12, "source": "clip"},
|
||||
"neck_tattoo": {"detected": false}
|
||||
},
|
||||
"hand": {
|
||||
"ring": {"detected": false},
|
||||
"bracelet": {"detected": false},
|
||||
"watch": {"detected": true, "confidence": 0.63, "first_frame": 24},
|
||||
"gloves": {"detected": false}
|
||||
},
|
||||
"hand_held": {
|
||||
"phone": {"detected": true, "confidence": 0.88, "source": "hsv_color_block"},
|
||||
"pen": {"detected": false},
|
||||
"cup": {"detected": false},
|
||||
"knife": {"detected": false},
|
||||
"gun": {"detected": false}
|
||||
},
|
||||
"foot": {
|
||||
"shoes": {"type": "sneaker", "confidence": 0.78, "source": "hsv_color_block"},
|
||||
"socks": {"detected": false},
|
||||
"barefoot": {"detected": false}
|
||||
},
|
||||
"vehicle": {
|
||||
"bicycle": {"detected": false, "source": "hsv_color_block"},
|
||||
"skateboard": {"detected": false},
|
||||
"scooter": {"detected": false}
|
||||
},
|
||||
"carried": {
|
||||
"backpack": {"detected": false},
|
||||
"handbag": {"detected": true, "confidence": 0.85, "source": "hsv_color_block"},
|
||||
"luggage": {"detected": false}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2.3 Speaker Trace (重要)
|
||||
|
||||
**來源**: ASRX speaker diarization + face trace 綁定
|
||||
|
||||
```json
|
||||
{
|
||||
"node_type": "speaker_trace",
|
||||
"external_id": "SPEAKER_0",
|
||||
"properties": {
|
||||
"speaker_id": "SPEAKER_0",
|
||||
"segment_count": 45,
|
||||
"total_duration": 120.5,
|
||||
"first_appearance": {"frame": 100, "time": 3.3},
|
||||
"last_appearance": {"frame": 3600, "time": 120.0},
|
||||
"full_text": "大家好 今天我們來討論... (完整語音轉文字)",
|
||||
"segments": [
|
||||
{"start_time": 0.1, "end_time": 2.0, "text": "大家好", "start_frame": 3, "end_frame": 60},
|
||||
{"start_time": 5.2, "end_time": 8.5, "text": "今天我們來討論", "start_frame": 156, "end_frame": 255},
|
||||
...
|
||||
],
|
||||
"face_trace_ids": [5, 12, 23],
|
||||
"appearance_trace_ids": [5, 12],
|
||||
"gaze_context": {
|
||||
"looking_at_person": true,
|
||||
"mutual_gaze_with": [12]
|
||||
},
|
||||
"lip_sync_quality": 0.85
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**來源資料**:
|
||||
```
|
||||
ASRX → asrx.json (segments with speaker_id)
|
||||
Face → face_detections (trace_id)
|
||||
綁定 → SPEAKS_AS edge (speaker ↔ face_trace)
|
||||
```
|
||||
|
||||
### 2.4 Text Trace (重要)
|
||||
|
||||
**來源**: dev.chunk (chunk_type='sentence') + ASRX text
|
||||
|
||||
```json
|
||||
{
|
||||
"node_type": "text_trace",
|
||||
"external_id": "chunk_1",
|
||||
"properties": {
|
||||
"chunk_id": "chunk_1",
|
||||
"text": "大家好,今天我們來討論這個話題",
|
||||
"text_normalized": "大家好,今天我們來討論這個話題",
|
||||
"start_time": 0.1,
|
||||
"end_time": 5.2,
|
||||
"start_frame": 3,
|
||||
"end_frame": 156,
|
||||
"speaker_id": "SPEAKER_0",
|
||||
"language": "zh",
|
||||
"confidence": 0.95,
|
||||
"yolo_objects": ["person", "chair"],
|
||||
"face_ids": ["face_100"],
|
||||
"speaker_trace_id": "SPEAKER_0",
|
||||
"face_trace_id": 5,
|
||||
"lip_sync": {
|
||||
"matched_frames": 120,
|
||||
"total_frames": 153,
|
||||
"quality": 0.85
|
||||
},
|
||||
"semantic_embedding": [0.12, -0.34, ...],
|
||||
"sentiment": "neutral"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**來源資料**:
|
||||
```
|
||||
Rule 1 → dev.chunk (sentence chunks)
|
||||
ASRX → asrx.json (speaker_id binding)
|
||||
Face → face_detections (face_ids in chunk metadata)
|
||||
YOLO → yolo.json (co-occurring objects)
|
||||
```
|
||||
|
||||
**Edge 連接**:
|
||||
- `SPEAKS_BY`: text_trace → speaker_trace
|
||||
- `SPOKEN_WHILE`: text_trace → face_trace
|
||||
- `LIP_SYNC`: text_trace → lip_trace
|
||||
- `CONTAINS_OBJECT`: text_trace → object
|
||||
|
||||
### 2.5 Skin Tone Trace (重要)
|
||||
|
||||
**來源**: face.json ROI HSV + 光源分析
|
||||
|
||||
```json
|
||||
{
|
||||
"node_type": "skin_tone_trace",
|
||||
"external_id": "trace_5",
|
||||
"properties": {
|
||||
"trace_id": 5,
|
||||
"frame_count": 200,
|
||||
"start_frame": 150,
|
||||
"end_frame": 350,
|
||||
"face_h_mean": 18.5,
|
||||
"fitzpatrick": "Type IV - Medium",
|
||||
"confidence": 0.82,
|
||||
"lighting": {
|
||||
"brightness": 0.65,
|
||||
"color_temp": "warm",
|
||||
"direction": "front",
|
||||
"uniformity": 0.92,
|
||||
"source": "indoor",
|
||||
"quality": "good"
|
||||
},
|
||||
"sample_frames": 156,
|
||||
"hand_h_mean": 17.8,
|
||||
"arm_h_mean": 18.2
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Fitzpatrick 分類**:
|
||||
|
||||
| Type | 描述 | H 值 (HSV) |
|
||||
|------|------|------------|
|
||||
| I | 非常淺 | 0–5 |
|
||||
| II | 淺 | 5–12 |
|
||||
| III | 中等偏淺 | 12–18 |
|
||||
| IV | 中等 | 18–25 |
|
||||
| V | 深 | 25–35 |
|
||||
| VI | 很深 | 35+ |
|
||||
|
||||
**光源品質**:
|
||||
|
||||
| Quality | 條件 | 膚色可信度 |
|
||||
|---------|------|------------|
|
||||
| good | brightness > 0.4, uniformity > 0.8, front light | 高 (×1.0) |
|
||||
| fair | brightness > 0.3, uniformity > 0.6 | 中 (×0.7) |
|
||||
| poor | brightness < 0.3 或 backlight | 低 (×0.5) |
|
||||
|
||||
### 2.6 Gaze Trace (新增)
|
||||
|
||||
```json
|
||||
{
|
||||
"node_type": "gaze_trace",
|
||||
"external_id": "trace_5",
|
||||
"properties": {
|
||||
"trace_id": 5,
|
||||
"frame_count": 200,
|
||||
"start_frame": 150,
|
||||
"end_frame": 350,
|
||||
"avg_yaw": -0.15,
|
||||
"avg_pitch": -0.08,
|
||||
"avg_roll": -0.20,
|
||||
"head_direction": "frontal",
|
||||
"gaze_direction": "center-left",
|
||||
"eye_openness": 0.85,
|
||||
"blink_count": 12,
|
||||
"blink_rate": 0.06,
|
||||
"looking_at_person": true,
|
||||
"looking_at_object": ["chair"],
|
||||
"refined_ranges": [
|
||||
{"start_frame": 200, "end_frame": 220, "hz": 30, "reason": "mutual_gaze"}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2.7 Lip Trace (重要)
|
||||
|
||||
**來源**: face.json → faces[].lips (inner_lips 6pts + outer_lips 14pts)
|
||||
|
||||
```json
|
||||
{
|
||||
"node_type": "lip_trace",
|
||||
"external_id": "trace_5",
|
||||
"properties": {
|
||||
"trace_id": 5,
|
||||
"frame_count": 180,
|
||||
"start_frame": 160,
|
||||
"end_frame": 340,
|
||||
"avg_openness": 0.3,
|
||||
"avg_width": 45.2,
|
||||
"avg_height": 12.8,
|
||||
"movement_variance": 0.15,
|
||||
"speaking_frames": 95,
|
||||
"silent_frames": 85,
|
||||
"lip_landmark_samples": {
|
||||
"inner_lips": [[x,y,z], ...],
|
||||
"outer_lips": [[x,y,z], ...]
|
||||
},
|
||||
"speech_correlation": {
|
||||
"text_trace_ids": ["chunk_1", "chunk_2", "chunk_3"],
|
||||
"sync_quality": 0.85,
|
||||
"matched_segments": [
|
||||
{"start_frame": 160, "end_frame": 200, "text": "大家好"},
|
||||
{"start_frame": 210, "end_frame": 250, "text": "今天我們來討論"}
|
||||
]
|
||||
},
|
||||
"refined_ranges": [
|
||||
{"start_frame": 160, "end_frame": 340, "hz": 30, "reason": "lip_sync"}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Lip-sync 計算**:
|
||||
|
||||
```
|
||||
Lip openness = inner_lips_area / outer_lips_area
|
||||
|
||||
Speaking detection:
|
||||
- openness > threshold (動態調整)
|
||||
- movement_variance > threshold (唇型變化)
|
||||
- 持續 N 幀以上 (避免雜訊)
|
||||
|
||||
Sync with text:
|
||||
- 比對 text_trace 的 start/end_time
|
||||
- 計算 lip movement 與文字時間段的重疊率
|
||||
- quality = matched_frames / total_text_frames
|
||||
```
|
||||
|
||||
**Edge 連接**:
|
||||
- `HAS_LIP`: face_trace → lip_trace
|
||||
- `LIP_SYNC`: lip_trace → text_trace
|
||||
- `GAZE_SYNC_SPEECH`: gaze_trace + lip_trace (說話時注視方向)
|
||||
|
||||
---
|
||||
|
||||
## 3. 配件偵測
|
||||
|
||||
### 3.1 偵測方式分工
|
||||
|
||||
| 方式 | 適用配件 | 速度 | 說明 |
|
||||
|------|----------|------|------|
|
||||
| **HSV 色塊** | tie, phone, watch, ring, bracelet, glasses, mask, hat, shoes, backpack, handbag, umbrella, pen, knife, cup, book, laptop, remote, baseball_bat | 快 | **主要方式** — 從 person crop 分析異色區塊 |
|
||||
| **CLIP** | hairstyle, beard, face_tattoo, eyebrow_tattoo, earrings, nose_ring, lip_ring, neck_tattoo, headscarf, scarf, shawl, necklace, gloves, tool, gun, skateboard, scooter, roller_skates, socks, barefoot | 中 | zero-shot (YOLO 不可靠,色塊也不易區分時) |
|
||||
| **MediaPipe** | gesture, arm_pose | 快 | 21 hand pts + 33 pose pts |
|
||||
| **HSV** | upper_body_color, lower_body_color, skin_tone | 快 | 色彩特徵提取 |
|
||||
|
||||
### 3.2 Appearance 與 Landmark/Pose 緊密貼合
|
||||
|
||||
**核心原則**: Appearance 不獨立偵測 bbox,而是直接用 face/pose/mediapipe 的幾何結果裁切 ROI。
|
||||
|
||||
```
|
||||
Face Landmarks (20pts) ──► 臉部 ROI ──► hat, glasses, mask, beard, earrings
|
||||
Pose 33 Keypoints ───────► 身體 ROI ──► tie, necklace, upper/lower body HSV
|
||||
MediaPipe Hands (21×2) ──► 手腕 ROI ──► watch, bracelet, ring, phone, glove
|
||||
MediaPipe Pose Feet ─────► 腳部 ROI ──► shoes, socks, barefoot
|
||||
```
|
||||
|
||||
**ROI 定位方式**:
|
||||
|
||||
```python
|
||||
def get_accessory_rois(frame, face_data, pose_data, hand_data):
|
||||
rois = {}
|
||||
|
||||
# 臉部區域 — 用 face bbox + landmarks
|
||||
face_bbox = face_data['bbox']
|
||||
landmarks = face_data['landmarks'] # nose, left_eye, right_eye
|
||||
|
||||
# 帽子 ROI: 臉部 bbox 上方延伸
|
||||
rois['hat'] = expand_region(face_bbox, direction='up', factor=0.5)
|
||||
|
||||
# 眼鏡 ROI: 眼部 landmarks 水平帶
|
||||
left_eye = landmarks['left_eye']
|
||||
right_eye = landmarks['right_eye']
|
||||
rois['glasses'] = bbox_around_points(left_eye, right_eye, padding=10)
|
||||
|
||||
# 口罩 ROI: 鼻子下方到下顎
|
||||
nose = landmarks['nose']
|
||||
rois['mask'] = region_below_point(nose, face_bbox.bottom)
|
||||
|
||||
# 脖子 ROI — 用 pose neck keypoints
|
||||
if pose_data:
|
||||
neck = pose_data['keypoints']['neck']
|
||||
nose = pose_data['keypoints']['nose']
|
||||
rois['neck'] = region_between(nose, neck, width=80)
|
||||
|
||||
# 手腕 ROI — 用 MediaPipe hand landmarks
|
||||
if hand_data:
|
||||
for side in ['left', 'right']:
|
||||
wrist = hand_data[side]['wrist']
|
||||
rois[f'{side}_wrist'] = circle_around(wrist, radius=30)
|
||||
|
||||
# 腳部 ROI — 用 pose ankle/toe keypoints
|
||||
if pose_data:
|
||||
for side in ['left', 'right']:
|
||||
ankle = pose_data['keypoints'][f'{side}_ankle']
|
||||
toe = pose_data['keypoints'][f'{side}_toe']
|
||||
rois[f'{side}_foot'] = bbox_around_points(ankle, toe, padding=20)
|
||||
|
||||
return rois
|
||||
```
|
||||
|
||||
### 3.3 HSV 色塊偵測流程
|
||||
|
||||
```python
|
||||
def detect_accessories_tightly_coupled(frame, face_data, pose_data, hand_data):
|
||||
# 1. 用 landmark/pose 精準定位各 ROI
|
||||
rois = get_accessory_rois(frame, face_data, pose_data, hand_data)
|
||||
|
||||
results = {}
|
||||
for roi_name, roi_bbox in rois.items():
|
||||
roi_hsv = crop_and_convert(frame, roi_bbox, 'HSV')
|
||||
|
||||
# 2. 在精準 ROI 內找異色區塊
|
||||
diff_mask = compute_color_diff(roi_hsv, main_colors, threshold=30)
|
||||
blobs = find_connected_components(diff_mask)
|
||||
|
||||
for blob in blobs:
|
||||
accessory = classify_accessory_by_position(blob, roi_name)
|
||||
if accessory:
|
||||
results[accessory] = {
|
||||
"detected": True,
|
||||
"confidence": blob.confidence,
|
||||
"source": "hsv_color_block",
|
||||
"roi": roi_name,
|
||||
"first_frame": current_frame
|
||||
}
|
||||
|
||||
# 3. 色塊不易判斷的項目 → CLIP
|
||||
clip_only_items = ['hairstyle', 'beard', 'earrings', 'nose_ring', ...]
|
||||
for item in clip_only_items:
|
||||
confidence = clip_score(crop_person(frame, face_data['bbox']), CLIP_PROMPTS[item])
|
||||
if confidence > 0.5:
|
||||
results[item] = {"detected": True, "confidence": confidence, "source": "clip"}
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
### 3.4 依賴關係
|
||||
|
||||
```
|
||||
Face Detection ──► face_detections (trace_id, bbox, embedding)
|
||||
│
|
||||
▼
|
||||
Face Landmarks ────► 臉部 ROI (hat, glasses, mask, beard)
|
||||
│
|
||||
▼
|
||||
Pose 33pts ────────► 身體 ROI (neck, wrist, foot) ──► Appearance HSV
|
||||
│
|
||||
▼
|
||||
MediaPipe Hands ───► 手腕 ROI (watch, bracelet, ring, phone)
|
||||
│
|
||||
▼
|
||||
TKG appearance_trace
|
||||
```
|
||||
|
||||
### 3.5 CLIP 提示詞 (僅用於色塊不易區分的配件)
|
||||
|
||||
```python
|
||||
CLIP_PROMPTS = {
|
||||
# 頭部 — 色塊不易判斷的項目
|
||||
"hairstyle_short": "a person with short hair",
|
||||
"hairstyle_long": "a person with long hair",
|
||||
"hairstyle_braid": "a person with braided hair",
|
||||
"hairstyle_bun": "a person with hair in a bun",
|
||||
"face_tattoo": "a person with a visible face tattoo or face paint",
|
||||
"eyebrow_tattoo": "a person with tattooed or styled eyebrows",
|
||||
"beard": "a person with a beard or mustache",
|
||||
|
||||
# 耳朵/鼻子/嘴唇穿刺
|
||||
"earrings": "a person wearing earrings",
|
||||
"nose_ring": "a person wearing a nose ring or nose piercing",
|
||||
"lip_ring": "a person wearing a lip ring or lip piercing",
|
||||
|
||||
# 脖子 — 項鍊等細小物件
|
||||
"necklace": "a person wearing a necklace",
|
||||
"neck_tattoo": "a person with a visible neck tattoo",
|
||||
|
||||
# 手部細小物件
|
||||
"gloves": "a person wearing gloves",
|
||||
"tool": "a person holding a tool like a wrench or screwdriver",
|
||||
"gun": "a person holding a gun",
|
||||
|
||||
# 足部
|
||||
"socks": "a person wearing visible socks",
|
||||
"barefoot": "a barefoot person",
|
||||
"roller_skates": "a person wearing roller skates",
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. 膚色 + 光源
|
||||
|
||||
### 4.1 Fitzpatrick 分類
|
||||
|
||||
| Type | 描述 | H 值 (HSV) |
|
||||
|------|------|------------|
|
||||
| I | 非常淺 | 0–5 |
|
||||
| II | 淺 | 5–12 |
|
||||
| III | 中等偏淺 | 12–18 |
|
||||
| IV | 中等 | 18–25 |
|
||||
| V | 深 | 25–35 |
|
||||
| VI | 很深 | 35+ |
|
||||
|
||||
### 4.2 光源參數
|
||||
|
||||
| 參數 | 計算方式 | 範圍 |
|
||||
|------|----------|------|
|
||||
| brightness | V channel 平均 | 0.0–1.0 |
|
||||
| color_temp | 白平衡估算 | warm/neutral/cool |
|
||||
| direction | 陰影梯度 + yaw/pitch | front/side/back/top |
|
||||
| uniformity | 臉部各區域 V 值標準差 | 0.0–1.0 |
|
||||
| source | 亮度 + 色溫綜合判斷 | indoor/outdoor/flash |
|
||||
|
||||
### 4.3 光源品質
|
||||
|
||||
| Quality | 條件 | 膚色可信度 |
|
||||
|---------|------|------------|
|
||||
| good | brightness > 0.4, uniformity > 0.8, front light | 高 (×1.0) |
|
||||
| fair | brightness > 0.3, uniformity > 0.6 | 中 (×0.7) |
|
||||
| poor | brightness < 0.3 或 backlight | 低 (×0.5) |
|
||||
|
||||
---
|
||||
|
||||
## 5. TKG Node 類型
|
||||
|
||||
| node_type | external_id | 來源 | 重要性 | 屬性 |
|
||||
|-----------|-------------|------|--------|------|
|
||||
| `face_trace` | `trace_N` | face_detections | ★★★★ | frame_count, bbox, pose, embedding, skin_tone |
|
||||
| `appearance_trace` | `trace_N` | appearance.json | ★★★★ | trace_id, color_features, accessories, confidence |
|
||||
| `gaze_trace` | `trace_N` | face.json (pose_angle) | ★★★ | trace_id, gaze_direction, blink_count, looking_at |
|
||||
| `lip_trace` | `trace_N` | face.json (lips) | ★★★★ | trace_id, avg_openness, speaking_frames, speech_correlation |
|
||||
| `speaker_trace` | `SPEAKER_N` | asrx.json | ★★★★ | speaker_id, segments, face_trace_ids, full_text |
|
||||
| `text_trace` | `chunk_N` | dev.chunk | ★★★★ | text, speaker_id, time_range, yolo_objects, lip_sync |
|
||||
| `skin_tone_trace` | `trace_N` | face.json (ROI HSV) | ★★★ | trace_id, fitzpatrick, lighting, confidence |
|
||||
| `object` | `class_name` | yolo.json | ★★ | total_detections, frames |
|
||||
| `accessory` | `hat`, `glasses`, ... | appearance.json | ★★ | category, trace_ids, first/last_seen |
|
||||
|
||||
---
|
||||
|
||||
## 6. TKG Edge 類型
|
||||
|
||||
| Edge Type | Source → Target | 屬性 | 說明 |
|
||||
|-----------|----------------|------|------|
|
||||
| `SPEAKS_AS` | speaker_trace → face_trace | confidence, overlap_frames | 說話者綁定人臉 |
|
||||
| `SPEAKS_BY` | text_trace → speaker_trace | — | 文字由誰說的 |
|
||||
| `SPOKEN_WHILE` | text_trace → face_trace | frame_overlap | 說話時的人臉 |
|
||||
| `HAS_APPEARANCE` | face_trace → appearance_trace | confidence, overlap_frames | 外觀特徵 |
|
||||
| `HAS_GAZE` | face_trace → gaze_trace | overlap_frames | 視線方向 |
|
||||
| `HAS_LIP` | face_trace → lip_trace | overlap_frames | 唇型資料 |
|
||||
| `HAS_SKIN_TONE` | face_trace → skin_tone_trace | confidence, lighting_match | 膚色記錄 |
|
||||
| `LIP_SYNC` | lip_trace → text_trace | time_alignment, openness_match | 唇語同步 |
|
||||
| `WEARS` | appearance_trace → accessory | confidence, first_frame | 配件 |
|
||||
| `LOOKING_AT` | gaze_trace → object | direction_match, distance | 注視物件 |
|
||||
| `LOOKING_AT_PERSON` | gaze_trace → face_trace | direction_match | 注視他人 |
|
||||
| `MUTUAL_GAZE` | face_trace ↔ face_trace | first_frame, last_frame, duration_frames, confidence | 互相看 |
|
||||
| `CO_OCCURS_WITH` | object ↔ object | frame_count | 物件共現 |
|
||||
| `SAME_SKIN_TONE` | face_trace ↔ face_trace | h_diff, lighting_match, confidence | 膚色相近 |
|
||||
| `HOLDS` | appearance_trace → object | 手機等手持物品 |
|
||||
|
||||
---
|
||||
|
||||
## 7. Mutual Gaze 分析
|
||||
|
||||
### 7.1 計算邏輯
|
||||
|
||||
```
|
||||
對每幀:
|
||||
對每對 (person_A, person_B):
|
||||
1. 計算 A 的 gaze vector (從 yaw/pitch/roll)
|
||||
2. 計算 B 的 bbox center 在 A 座標系中的位置
|
||||
3. 判斷 B 是否在 A 的 gaze cone 內 (threshold: ~15°)
|
||||
4. 反向檢查 B → A
|
||||
5. 雙向命中 → mutual_gaze
|
||||
```
|
||||
|
||||
### 7.2 持續性確認
|
||||
|
||||
```
|
||||
mutual_gaze 需要持續 N 幀以上才算有意義:
|
||||
- 基底: 8Hz, 持續 ≥ 3 幀 (~0.375s) → 建立 edge
|
||||
- 細化: 發現 candidate 後,回頭用 30Hz 確認
|
||||
- confidence = 連續幀數 / 總可能幀數
|
||||
```
|
||||
|
||||
### 7.3 Edge 屬性
|
||||
|
||||
```json
|
||||
{
|
||||
"edge_type": "MUTUAL_GAZE",
|
||||
"source": "trace_5",
|
||||
"target": "trace_12",
|
||||
"properties": {
|
||||
"first_frame": 150,
|
||||
"last_frame": 280,
|
||||
"duration_frames": 130,
|
||||
"duration_seconds": 4.3,
|
||||
"confidence": 0.85,
|
||||
"context": "during_conversation"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. 實作計畫
|
||||
|
||||
### Phase 0: 8Hz 採樣框架 (~100 行)
|
||||
|
||||
| 檔案 | 修改 |
|
||||
|------|------|
|
||||
| `worker/processor.rs` | 計算 8Hz sample frames + refine 框架 |
|
||||
| `scripts/face_processor.py` | 接受 `--frames` 參數 |
|
||||
| `scripts/appearance_processor.py` | bbox 來源改 yolo,接受 `--frames` |
|
||||
| `scripts/mediapipe_holistic_processor.py` | 接受 `--frames` |
|
||||
|
||||
### Phase 1: Gaze + Mutual Gaze (~250 行)
|
||||
|
||||
| 模組 | 行數 |
|
||||
|------|------|
|
||||
| Gaze trace nodes | 150 |
|
||||
| Mutual Gaze edges | 100 |
|
||||
|
||||
### Phase 2: Lip + Sentence + Speaker (~260 行)
|
||||
|
||||
| 模組 | 行數 |
|
||||
|------|------|
|
||||
| Lip trace nodes | 120 |
|
||||
| Sentence nodes | 80 |
|
||||
| Speaker 強化 | 60 |
|
||||
|
||||
### Phase 3: Appearance + Accessories (~280 行)
|
||||
|
||||
| 模組 | 行數 |
|
||||
|------|------|
|
||||
| Appearance traces (HSV + trace_id 綁定) | 120 |
|
||||
| Accessories (CLIP detection) | 80 |
|
||||
| Skin tone + lighting | 80 |
|
||||
|
||||
### Phase 4: TKG 整合 (~110 行)
|
||||
|
||||
| 模組 | 行數 |
|
||||
|------|------|
|
||||
| `build_tkg()` 統一呼叫 | 40 |
|
||||
| Edge builders 更新 | 70 |
|
||||
|
||||
### 總計: ~1,000 行
|
||||
|
||||
---
|
||||
|
||||
## 9. 依賴關係圖
|
||||
|
||||
```
|
||||
YOLO (全域) ──────────────────────────────────────────┐
|
||||
│ │
|
||||
▼ │
|
||||
Face (8Hz) ──► trace_id ──┬──► Appearance (IoU 綁定) │
|
||||
│ │ ├──► HSV 色彩 │
|
||||
│ │ ├──► Accessories (CLIP) │
|
||||
│ │ └──► Skin tone + light │
|
||||
│ │ │
|
||||
│ ├──► Gaze ──► Mutual Gaze ────┤
|
||||
│ │ ──► Looking at YOLO │
|
||||
│ │ │
|
||||
│ └──► Lip ──► LIP_SYNC ◄──────┤
|
||||
│ │
|
||||
ASRX ──► Speaker ──► SPEAKS_AS ──► face_trace │
|
||||
│ │ │
|
||||
└──► Text (Rule 1) ────┴──► SPEAKS_BY │
|
||||
├──► SPOKEN_WHILE │
|
||||
└──► LIP_SYNC ────────────┘
|
||||
|
||||
所有 trace ──────────────────────────► TKG
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Appendix A: 配件完整清單 (49 種)
|
||||
|
||||
| 部位 | 配件 | 偵測方式 |
|
||||
|------|------|----------|
|
||||
| 頭部 (12) | hat, hairstyle, hair_accessory, earrings, nose_ring, lip_ring, face_tattoo, eyebrow_tattoo, glasses, mask, beard, headscarf | HSV 色塊 + CLIP |
|
||||
| 脖子 (5) | tie, scarf, shawl, necklace, neck_tattoo | HSV 色塊 + CLIP |
|
||||
| 手部/手臂 (16) | ring, bracelet, watch, gloves, phone, pen, laptop, book, cup, remote, tool, knife, gun, baseball_bat, gesture, arm_pose | HSV 色塊 + CLIP + MP |
|
||||
| 足部/載具 (8) | shoes, socks, barefoot, skateboard, scooter, bicycle, motorbike, roller_skates | HSV 色塊 + CLIP |
|
||||
| 攜帶/環境 (5) | backpack, handbag, luggage, chair, diningtable | HSV 色塊 + CLIP |
|
||||
| 色彩 (3) | upper_body_hsv, lower_body_hsv, skin_tone | HSV |
|
||||
|
||||
> **註**: YOLO 不可靠,不再作為主要偵測方式。大部分配件改用 HSV 色塊分析,CLIP 僅用於色塊不易區分的項目 (如穿刺、紋身、髮型等)。
|
||||
|
||||
## Appendix B: DB Schema 變更
|
||||
|
||||
```sql
|
||||
-- appearance_detections (新增)
|
||||
CREATE TABLE appearance_detections (
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
file_uuid VARCHAR NOT NULL,
|
||||
frame_number BIGINT NOT NULL,
|
||||
person_id INTEGER NOT NULL,
|
||||
x INTEGER, y INTEGER, width INTEGER, height INTEGER,
|
||||
trace_id INTEGER,
|
||||
confidence REAL,
|
||||
hsv_histogram JSONB,
|
||||
dominant_colors JSONB,
|
||||
upper_body_hsv JSONB,
|
||||
lower_body_hsv JSONB,
|
||||
accessories JSONB,
|
||||
skin_tone JSONB,
|
||||
lighting JSONB,
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
-- tkg_nodes (擴充 node_type)
|
||||
-- 新增: appearance_trace, gaze_trace, lip_trace, sentence, accessory
|
||||
|
||||
-- tkg_edges (擴充 edge_type)
|
||||
-- 新增: HAS_APPEARANCE, HAS_GAZE, HAS_LIP, WEARS, LOOKING_AT,
|
||||
-- LOOKING_AT_PERSON, MUTUAL_GAZE, LIP_SYNC, SPEAKS_BY,
|
||||
-- SAME_SKIN_TONE, HAS_NECK_ACCESSORY, HAS_HEAD_ACCESSORY, HOLDS
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Author | Description |
|
||||
|---------|------|--------|-------------|
|
||||
| 1.0.0 | 2026-06-19 | OpenCode | Initial design: 8Hz sampling, 7 traces (face/appearance/gaze/lip/speaker/text/skin_tone), 49 accessories, skin tone + lighting, mutual gaze, lip-sync |
|
||||
| 1.1.0 | 2026-06-19 | OpenCode | Added speaker_trace, text_trace, skin_tone_trace as important traces; enhanced lip_trace with speech_correlation; updated node/edge tables |
|
||||
| **1.2.0** | **2026-06-19** | **OpenCode** | **Implementation complete: build_tkg() integrates all node/edge builders. 9 node types, 14 edge types. ~1500 lines added to tkg.rs** |
|
||||
257
docs_v1.0/DESIGN/TKG_PHASE2_6_EDGES_MIGRATION.md
Normal file
257
docs_v1.0/DESIGN/TKG_PHASE2_6_EDGES_MIGRATION.md
Normal file
@@ -0,0 +1,257 @@
|
||||
---
|
||||
title: TKG Phase 2.6 Edges Migration Plan
|
||||
version: 1.0
|
||||
date: 2026-06-21
|
||||
author: OpenCode
|
||||
status: Draft
|
||||
---
|
||||
|
||||
## Phase 2.6 Overview
|
||||
|
||||
迁移 TKG edges 从 PostgreSQL face_detections 到 Qdrant payload。
|
||||
|
||||
## Current Implementation Analysis
|
||||
|
||||
### 2.6.1: co_occurrence_edges (CO_OCCURS_WITH)
|
||||
|
||||
**Current Code** (`tkg.rs:932-1039`):
|
||||
```rust
|
||||
let face_rows = sqlx::query_as::<_, FaceDetectionRow>(&format!(
|
||||
"SELECT trace_id::bigint, frame_number::bigint, x::float8, y::float8, width::float8, height::float8
|
||||
FROM {} WHERE file_uuid = $1 AND trace_id IS NOT NULL
|
||||
ORDER BY frame_number",
|
||||
face_table
|
||||
))
|
||||
.bind(file_uuid)
|
||||
.fetch_all(pool)
|
||||
.await?;
|
||||
```
|
||||
|
||||
**Dependencies**:
|
||||
- `face_detections.trace_id`
|
||||
- `face_detections.frame_number`
|
||||
- `face_detections.x, y, width, height`
|
||||
|
||||
**Migration Strategy**:
|
||||
```rust
|
||||
// 从 Qdrant payload 获取
|
||||
let embeddings = face_db.get_all_embeddings_for_file(file_uuid).await?;
|
||||
|
||||
// 按 frame 分组
|
||||
let mut frame_map: HashMap<i64, Vec<(i64, f64, f64, f64, f64)>> = HashMap::new();
|
||||
for emb in embeddings {
|
||||
let frame = emb.payload.frame_number;
|
||||
let trace_id = emb.payload.trace_id;
|
||||
frame_map.entry(frame).or_default().push((
|
||||
trace_id,
|
||||
emb.payload.bbox_x,
|
||||
emb.payload.bbox_y,
|
||||
emb.payload.bbox_width,
|
||||
emb.payload.bbox_height,
|
||||
));
|
||||
}
|
||||
```
|
||||
|
||||
### 2.6.2: face_face_edges (MUTUAL_GAZE)
|
||||
|
||||
**Current Code** (`tkg.rs:1171-1320`):
|
||||
```rust
|
||||
let rows: Vec<(i64, i64, i64)> = sqlx::query_as(&format!(
|
||||
"SELECT a.trace_id::bigint AS tid_a, b.trace_id::bigint AS tid_b, a.frame_number::bigint
|
||||
FROM {} a
|
||||
JOIN {} b ON a.file_uuid = b.file_uuid AND a.frame_number = b.frame_number AND a.trace_id < b.trace_id
|
||||
WHERE a.file_uuid = $1 AND a.trace_id IS NOT NULL AND b.trace_id IS NOT NULL",
|
||||
face_table, face_table
|
||||
))
|
||||
.bind(file_uuid)
|
||||
.fetch_all(pool)
|
||||
.await?;
|
||||
```
|
||||
|
||||
**Dependencies**:
|
||||
- `face_detections` self-join for co-occurrence
|
||||
- `face_detections.trace_id`
|
||||
- `face_detections.frame_number`
|
||||
|
||||
**Migration Strategy**:
|
||||
```rust
|
||||
// 从 Qdrant 获取所有 embeddings
|
||||
let embeddings = face_db.get_all_embeddings_for_file(file_uuid).await?;
|
||||
|
||||
// 按 frame 分组
|
||||
let mut frame_faces: HashMap<i64, Vec<FaceEmbeddingPayload>> = HashMap::new();
|
||||
for emb in embeddings {
|
||||
frame_faces.entry(emb.payload.frame_number).or_default().push(emb.payload);
|
||||
}
|
||||
|
||||
// 找同 frame 的 face pairs
|
||||
let mut pairs: Vec<(i64, i64, i64)> = Vec::new();
|
||||
for (frame, faces) in frame_faces.iter() {
|
||||
for i in 0..faces.len() {
|
||||
for j in (i+1)..faces.len() {
|
||||
let tid_a = faces[i].trace_id.min(faces[j].trace_id);
|
||||
let tid_b = faces[i].trace_id.max(faces[j].trace_id);
|
||||
pairs.push((tid_a, tid_b, *frame));
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2.6.3: speaker_face_edges (SPEAKS_AS)
|
||||
|
||||
**Current Code** (`tkg.rs:1045-1169`):
|
||||
```rust
|
||||
let traces = sqlx::query_as::<_, (i64, i64, i64)>(&format!(
|
||||
"SELECT trace_id::bigint, MIN(frame_number)::bigint as start_f, MAX(frame_number)::bigint as end_f
|
||||
FROM {} WHERE file_uuid = $1 AND trace_id IS NOT NULL
|
||||
GROUP BY trace_id",
|
||||
face_table
|
||||
))
|
||||
.bind(file_uuid)
|
||||
.fetch_all(pool)
|
||||
.await?;
|
||||
```
|
||||
|
||||
**Dependencies**:
|
||||
- `face_detections.trace_id`
|
||||
- `face_detections.frame_number` (MIN/MAX)
|
||||
|
||||
**Migration Strategy**:
|
||||
```rust
|
||||
// 从 Qdrant 获取所有 embeddings
|
||||
let embeddings = face_db.get_all_embeddings_for_file(file_uuid).await?;
|
||||
|
||||
// 计算每个 trace_id 的 frame range
|
||||
let mut trace_ranges: HashMap<i64, (i64, i64)> = HashMap::new();
|
||||
for emb in embeddings {
|
||||
let trace_id = emb.payload.trace_id;
|
||||
let frame = emb.payload.frame_number;
|
||||
let entry = trace_ranges.entry(trace_id).or_insert((frame, frame));
|
||||
entry.0 = entry.0.min(frame);
|
||||
entry.1 = entry.1.max(frame);
|
||||
}
|
||||
```
|
||||
|
||||
### 2.6.4: mutual_gaze_edges (MUTUAL_GAZE)
|
||||
|
||||
**Already in face_face_edges**:
|
||||
- face_face_edges 包含 mutual_gaze 检测逻辑
|
||||
- 不需要单独迁移
|
||||
|
||||
### 2.6.5: lip_sync_edges (LIP_SYNC)
|
||||
|
||||
**Already migrated in Phase 2.5.2**:
|
||||
- `build_lip_trace_nodes_from_qdrant()` 已完成
|
||||
- lip_sync_edges 已使用 Qdrant payload
|
||||
|
||||
## Migration Priority
|
||||
|
||||
| Priority | Edge Type | Complexity | Impact |
|
||||
|----------|-----------|-------------|--------|
|
||||
| P1 | co_occurrence_edges | Low | High (关系图) |
|
||||
| P1 | face_face_edges | Medium | High (face 关系) |
|
||||
| P2 | speaker_face_edges | Low | Medium (speaker 关系) |
|
||||
| N/A | mutual_gaze_edges | - | 已包含在 face_face_edges |
|
||||
| N/A | lip_sync_edges | - | 已迁移 Phase 2.5.2 |
|
||||
|
||||
## Performance Estimate
|
||||
|
||||
| Edge Type | Current (PG) | After Migration | Speedup |
|
||||
|-----------|--------------|-----------------|---------|
|
||||
| co_occurrence_edges | ~120ms | ~30ms | 4x |
|
||||
| face_face_edges | ~90ms | ~25ms | 3.6x |
|
||||
| speaker_face_edges | ~60ms | ~20ms | 3x |
|
||||
| **Total** | **~270ms** | **~75ms** | **3.6x** |
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
### Step 1: Add helper functions in `face_embedding_db.rs`
|
||||
|
||||
```rust
|
||||
// Get all embeddings grouped by frame
|
||||
pub async fn get_embeddings_by_frame(&self, file_uuid: &str) -> Result<HashMap<i64, Vec<FaceEmbeddingPayload>>>;
|
||||
|
||||
// Get trace_id frame ranges
|
||||
pub async fn get_trace_frame_ranges(&self, file_uuid: &str) -> Result<HashMap<i64, (i64, i64)>>;
|
||||
```
|
||||
|
||||
### Step 2: Create migration functions in `tkg.rs`
|
||||
|
||||
```rust
|
||||
// Phase 2.6.1
|
||||
async fn build_co_occurrence_edges_from_qdrant(
|
||||
pool: &PgPool,
|
||||
file_uuid: &str,
|
||||
output_dir: &str,
|
||||
face_db: &FaceEmbeddingDb,
|
||||
) -> Result<usize>;
|
||||
|
||||
// Phase 2.6.2
|
||||
async fn build_face_face_edges_from_qdrant(
|
||||
pool: &PgPool,
|
||||
file_uuid: &str,
|
||||
pose_data: &[FacePose],
|
||||
face_db: &FaceEmbeddingDb,
|
||||
) -> Result<usize>;
|
||||
|
||||
// Phase 2.6.3
|
||||
async fn build_speaker_face_edges_from_qdrant(
|
||||
pool: &PgPool,
|
||||
file_uuid: &str,
|
||||
output_dir: &str,
|
||||
face_db: &FaceEmbeddingDb,
|
||||
) -> Result<usize>;
|
||||
```
|
||||
|
||||
### Step 3: Replace in `build_tkg.rs`
|
||||
|
||||
```rust
|
||||
// Old
|
||||
let e_co = build_co_occurrence_edges(pool, file_uuid, output_dir).await?;
|
||||
|
||||
// New
|
||||
let e_co = build_co_occurrence_edges_from_qdrant(pool, file_uuid, output_dir, face_db).await?;
|
||||
```
|
||||
|
||||
### Step 4: Add feature flag (optional)
|
||||
|
||||
```rust
|
||||
#[cfg(feature = "qdrant-edges")]
|
||||
let e_co = build_co_occurrence_edges_from_qdrant(...).await?;
|
||||
#[cfg(not(feature = "qdrant-edges"))]
|
||||
let e_co = build_co_occurrence_edges(...).await?;
|
||||
```
|
||||
|
||||
## Verification Plan
|
||||
|
||||
1. Run TKG rebuild on test file
|
||||
2. Compare edge counts (PG vs Qdrant)
|
||||
3. Verify edge properties match
|
||||
4. Performance benchmark
|
||||
5. Integration test with Rule2
|
||||
|
||||
## Risks & Mitigations
|
||||
|
||||
| Risk | Mitigation |
|
||||
|------|------------|
|
||||
| Qdrant collection empty | Fallback to PostgreSQL |
|
||||
| Performance regression | Benchmark before merge |
|
||||
| Edge count mismatch | Validate with test suite |
|
||||
| Data inconsistency | Add reconciliation job |
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- [ ] All edges use Qdrant payload (no face_detections queries)
|
||||
- [ ] Edge counts match PostgreSQL version
|
||||
- [ ] Performance improvement >= 2x
|
||||
- [ ] Rule2/Rule3 work correctly
|
||||
- [ ] No regressions in existing tests
|
||||
|
||||
## Timeline
|
||||
|
||||
- Phase 2.6.1 (co_occurrence): 1 day
|
||||
- Phase 2.6.2 (face_face): 1 day
|
||||
- Phase 2.6.3 (speaker_face): 0.5 day
|
||||
- Testing & verification: 0.5 day
|
||||
- **Total: 3 days**
|
||||
|
||||
374
docs_v1.0/DESIGN/VideoPlayback_Architecture_V1.0.md
Normal file
374
docs_v1.0/DESIGN/VideoPlayback_Architecture_V1.0.md
Normal file
@@ -0,0 +1,374 @@
|
||||
---
|
||||
document_type: "design"
|
||||
service: "MOMENTRY_CORE"
|
||||
title: "Video Playback Architecture — Local Direct Serve & Remote Streaming"
|
||||
version: "V1.0"
|
||||
date: "2026-06-07"
|
||||
author: "OpenCode"
|
||||
status: "draft"
|
||||
tags:
|
||||
- "video-playback"
|
||||
- "caddy"
|
||||
- "streaming"
|
||||
- "thumbnail"
|
||||
- "wordpress-frontend"
|
||||
related_documents:
|
||||
- "DESIGN/FILE_LIFECYCLE_V1.0.md"
|
||||
---
|
||||
|
||||
# Video Playback Architecture — Local Direct Serve & Remote Streaming
|
||||
|
||||
| Item | Value |
|
||||
|------|-------|
|
||||
| Scope | Video file playback & thumbnail serving for WordPress frontend (m5wp) |
|
||||
| Status | Draft |
|
||||
| Applies to | Search results (`serve_url`), Caddy routing, Momentry media-proxy endpoint |
|
||||
| Key concept | Local files served directly by Caddy (zero backend overhead); remote files fall back to Momentry streaming; thumbnails proxied through Caddy to Momentry |
|
||||
|
||||
---
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The WordPress frontend (`m5wp.momentry.ddns.net`) displays search results with video thumbnails and a player. Currently:
|
||||
|
||||
- **Thumbnails**: WordPress Code Snippet 61 (`momentry/v1/media` REST route) is inactive → all requests return `rest_no_route` 404
|
||||
- **Video playback**: Frontend has no way to construct a playable URL from search results; no `serve_url` exists in the search response
|
||||
- **WordPress constraint**: WordPress files and database tables must not be modified (marcom team territory)
|
||||
|
||||
The solution must work for two deployment scenarios:
|
||||
- **Local**: Video file resides on the same server as Momentry → serve via static HTTP (zero processing overhead)
|
||||
- **Remote**: Video file resides on an external storage (NAS, S3, etc.) → fall back to Momentry's ffmpeg-based streaming
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Browser (search-chat @ m5wp.momentry.ddns.net) │
|
||||
│ │
|
||||
│ ┌──────────┐ ┌──────────────────┐ ┌─────────────────────┐ │
|
||||
│ │ Search │ │ Thumbnail img │ │ <video src="..."> │ │
|
||||
│ └────┬─────┘ └───────┬──────────┘ └──────────┬──────────┘ │
|
||||
│ │ │ │ │
|
||||
└───────┼─────────────────┼──────────────────────────┼─────────────┘
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌───────────────────────────────────────────────────────────────┐
|
||||
│ Caddy (m5wp block) │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────────────────┐ │
|
||||
│ │ handle /wp-json/momentry/v1/media { │ │
|
||||
│ │ rewrite * /api/v1/media-proxy{?} │ │
|
||||
│ │ reverse_proxy localhost:3002 (+ X-API-Key) │ │
|
||||
│ │ } │ │
|
||||
│ │ │ │
|
||||
│ │ handle_path /files/* { │ │
|
||||
│ │ root * /Users/accusys/momentry/var/sftpgo/data │ │
|
||||
│ │ file_server │ │
|
||||
│ │ } │ │
|
||||
│ │ │ │
|
||||
│ │ reverse_proxy localhost:9002 ← WordPress (PHP-FPM) │ │
|
||||
│ └─────────────────────────────────────────────────────────┘ │
|
||||
└───────────────────────────────────────────────────────────────┘
|
||||
│ │ │
|
||||
│ │ ▼
|
||||
│ │ ┌───────────────────────┐
|
||||
│ │ │ /files/* │
|
||||
│ │ │ Local file on disk │
|
||||
│ │ │ (zero backend cost) │
|
||||
│ │ └───────────────────────┘
|
||||
│ ▼
|
||||
│ ┌─────────────────────────────────────────┐
|
||||
│ │ Momentry Core (localhost:3002) │
|
||||
│ │ │
|
||||
▼ ▼ /api/v1/media-proxy │
|
||||
┌─────────────────────────┐ │
|
||||
│ type=thumbnail?frame=N │──→ face_thumbnail │
|
||||
│ type=video&start=… │──→ stream_video │
|
||||
└─────────────────────────┘ │
|
||||
┌─────────────────────────┐ │
|
||||
│ POST /api/v1/search/* │──→ smart_search │
|
||||
│ response: serve_url │ │
|
||||
└─────────────────────────┘ │
|
||||
└───────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Data Flow
|
||||
|
||||
### 1. Search → serve_url
|
||||
|
||||
```
|
||||
Frontend Caddy Momentry Backend
|
||||
│ │ │
|
||||
│ POST /wp-json/.../search │ │
|
||||
│ ─────────────────────────→│ │
|
||||
│ │ POST /api/v1/search/* │
|
||||
│ │ ──────────────────────→│
|
||||
│ │ │
|
||||
│ │ ←─ SearchResult[] ─────│
|
||||
│ │ (with serve_url + │
|
||||
│ │ file_name added) │
|
||||
│ ←─ JSON response ────────│ │
|
||||
│ results[0].serve_url = │ │
|
||||
│ "https://m5wp.momentry.│ │
|
||||
│ ddns.net/files/demo/ │ │
|
||||
│ Charade_YouTube_24fps │ │
|
||||
│ .mp4" │ │
|
||||
```
|
||||
|
||||
#### serve_url Construction
|
||||
|
||||
The backend computes `serve_url` from the video's `file_path` (stored in `videos` table) and two config values:
|
||||
|
||||
| Config | Env Var | Default |
|
||||
|--------|---------|---------|
|
||||
| `STORAGE_ROOT` | `MOMENTRY_STORAGE_ROOT` | `/Users/accusys/momentry/var/sftpgo/data` |
|
||||
| `SERVE_BASE_URL` | `MOMENTRY_SERVE_BASE_URL` | `https://m5wp.momentry.ddns.net/files` |
|
||||
|
||||
Algorithm:
|
||||
|
||||
```
|
||||
file_path: /Users/accusys/momentry/var/sftpgo/data/demo/Charade_YouTube_24fps.mp4
|
||||
STORAGE_ROOT /Users/accusys/momentry/var/sftpgo/data
|
||||
─────────────────────────────────────────────
|
||||
relative: demo/Charade_YouTube_24fps.mp4
|
||||
↓ join with SERVE_BASE_URL
|
||||
serve_url: https://m5wp.momentry.ddns.net/files/demo/Charade_YouTube_24fps.mp4
|
||||
```
|
||||
|
||||
#### SearchResult Additions
|
||||
|
||||
```rust
|
||||
pub struct SearchResult {
|
||||
// ... existing fields
|
||||
pub file_name: Option<String>, // e.g. "Charade_YouTube_24fps.mp4"
|
||||
pub serve_url: Option<String>, // e.g. "https://m5wp.momentry.ddns.net/files/..."
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Video Playback (Local)
|
||||
|
||||
```
|
||||
Frontend <video> Caddy (file_server)
|
||||
│ │
|
||||
│ GET /files/demo/Charade… │
|
||||
│ ─────────────────────────→│
|
||||
│ │ root = /Users/accusys/momentry/var/sftpgo/data
|
||||
│ │ serves /demo/Charade_YouTube_24fps.mp4
|
||||
│ │
|
||||
│ ←─ 200 video/mp4 ────────│
|
||||
│ (range-request │
|
||||
│ supported natively) │
|
||||
```
|
||||
|
||||
**Characteristics**:
|
||||
- Zero CPU cost — pure I/O, no ffmpeg decode
|
||||
- HTTP range requests work natively (Caddy `file_server` supports `Accept-Ranges: bytes`)
|
||||
- HTML5 `<video>` can seek arbitrarily, play/pause normally
|
||||
- Supports MP4 (H.264), WebM, and any browser-playable format
|
||||
|
||||
### 3. Video Playback (Remote — Fallback)
|
||||
|
||||
```
|
||||
Frontend Caddy Momentry Backend
|
||||
│ │ │
|
||||
│ GET /wp-json/.../ │ │
|
||||
│ media?uuid=X& │ │
|
||||
│ type=video& │ │
|
||||
│ start_time=S& │ │
|
||||
│ end_time=E │ │
|
||||
│ ────────────────────→│ │
|
||||
│ │ rewrite to │
|
||||
│ │ /api/v1/media-proxy{?} │
|
||||
│ │ │
|
||||
│ │ GET /api/v1/media-proxy? │
|
||||
│ │ uuid=X&type=video&... │
|
||||
│ │ ─────────────────────────→│
|
||||
│ │ │
|
||||
│ │ stream_video: │
|
||||
│ │ ffmpeg -ss S -i file │
|
||||
│ │ -t (E-S) -c copy │
|
||||
│ │ │
|
||||
│ │ ←─ 200 video/mp4 ──────────│
|
||||
│ │ (chunk data) │
|
||||
│ ←─ HTTP streaming ───│ │
|
||||
```
|
||||
|
||||
### 4. Thumbnail
|
||||
|
||||
```
|
||||
Frontend <img> Caddy Momentry Backend
|
||||
│ │ │
|
||||
│ GET /wp-json/.../ │ │
|
||||
│ media?uuid=X& │ │
|
||||
│ type=thumbnail& │ │
|
||||
│ frame=N │ │
|
||||
│ ──────────────────────→│ │
|
||||
│ │ rewrite to │
|
||||
│ │ /api/v1/media-proxy{?} │
|
||||
│ │ │
|
||||
│ │ /api/v1/media-proxy? │
|
||||
│ │ uuid=X&type=thumbnail& │
|
||||
│ │ frame=N │
|
||||
│ │ ─────────────────────────→│
|
||||
│ │ │
|
||||
│ │ face_thumbnail: │
|
||||
│ │ look up trace_id path │
|
||||
│ │ → cached face crop │
|
||||
│ │ → validated JPEG │
|
||||
│ │ │
|
||||
│ │ ←─ 200 image/jpeg ────────│
|
||||
│ ←─ JPEG ───────────────│ │
|
||||
```
|
||||
|
||||
**Thumbnail flow detail**:
|
||||
1. Caddy intercepts `/wp-json/momentry/v1/media` → rewrites to `/api/v1/media-proxy` keeping query params intact (`{?}`)
|
||||
2. Momentry `media_proxy_handler` reads `uuid`, `type=thumbnail`, `frame=N` from query
|
||||
3. Dispatches to the internal `face_thumbnail` handler
|
||||
4. Returns cached face crop JPEG (or fallback frame extraction result)
|
||||
|
||||
---
|
||||
|
||||
## Caddyfile Configuration
|
||||
|
||||
Addition to the existing `m5wp` block:
|
||||
|
||||
```caddy
|
||||
m5wp.momentry.ddns.net {
|
||||
tls internal
|
||||
|
||||
# ── Local video files: direct serve, zero backend overhead ──
|
||||
handle_path /files/* {
|
||||
root * /Users/accusys/momentry/var/sftpgo/data
|
||||
file_server
|
||||
}
|
||||
|
||||
# ── Media proxy: thumbnails + remote streaming ──
|
||||
# Bypasses inactive WordPress Code Snippet 61
|
||||
handle /wp-json/momentry/v1/media {
|
||||
rewrite * /api/v1/media-proxy{?}
|
||||
reverse_proxy localhost:3002 {
|
||||
header_up X-API-Key muser_68600856036340bcafc01930eb4bd839_1774418104_97221b69
|
||||
}
|
||||
}
|
||||
|
||||
# ── Existing WordPress (PHP-FPM) ──
|
||||
reverse_proxy localhost:9002
|
||||
import common_log m5wp_access
|
||||
}
|
||||
```
|
||||
|
||||
**Key syntax**:
|
||||
- `handle_path /files/*` — strips `/files` prefix, serves from `root` directory
|
||||
- `{?}` — Caddy placeholder that preserves the original query string in the rewrite
|
||||
- `handle /wp-json/momentry/v1/media` — matches exact path (query params are irrelevant for matching)
|
||||
|
||||
---
|
||||
|
||||
## Momentry API Changes
|
||||
|
||||
### New Endpoint: `GET /api/v1/media-proxy`
|
||||
|
||||
| Parameter | Type | Required | Description |
|
||||
|-----------|------|----------|-------------|
|
||||
| `uuid` | string | yes | file_uuid (accepts `file_uuid` key as alias) |
|
||||
| `type` | string | yes | `thumbnail`, `video` (future: `image`, `file`) |
|
||||
| `frame` | int | for thumbnail | Frame number to extract |
|
||||
| `trace_id` | int | no | Face trace ID for cached crop |
|
||||
| `start_time` | float | for video | Start time in seconds |
|
||||
| `end_time` | float | for video | End time in seconds |
|
||||
| `mode` | string | no | `normal` or `debug` (video) |
|
||||
| `audio` | string | no | `on` or `off` (video) |
|
||||
|
||||
**Dispatch logic**:
|
||||
- `type=thumbnail` → call `face_thumbnail(State, Path(uuid), Query(frame, trace_id, ...))`
|
||||
- `type=video` → call `stream_video(State, Path(uuid), Query(params), request)`
|
||||
|
||||
The endpoint reuses existing handler implementations via direct axum extractor composition, avoiding code duplication.
|
||||
|
||||
### Modified Endpoint: `POST /api/v1/search/smart`
|
||||
|
||||
**Response changes**: `SearchResult` gains two optional fields:
|
||||
|
||||
```json
|
||||
{
|
||||
"results": [
|
||||
{
|
||||
"file_uuid": "a6fb22eebefaef17e62af874997c5944",
|
||||
"file_name": "Charade_YouTube_24fps.mp4",
|
||||
"serve_url": "https://m5wp.momentry.ddns.net/files/demo/Charade_YouTube_24fps.mp4",
|
||||
"start_frame": 88649,
|
||||
"start_time": 3697.08,
|
||||
"end_time": 3707.08,
|
||||
"summary": "...",
|
||||
"similarity": 0.85
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `serve_url` is computed after enrichment via a batch query to the `videos` table (`file_uuid → file_path`), then applying the path translation:
|
||||
1. Strip `STORAGE_ROOT` prefix from `file_path`
|
||||
2. Prepend `SERVE_BASE_URL`
|
||||
|
||||
---
|
||||
|
||||
## Environment Variables
|
||||
|
||||
Add to `.env` (production) and `.env.development`:
|
||||
|
||||
```bash
|
||||
# Storage root: where video files are stored on disk
|
||||
# Used to compute serve_url from file_path
|
||||
MOMENTRY_STORAGE_ROOT=/Users/accusys/momentry/var/sftpgo/data
|
||||
|
||||
# Public base URL for direct file access via Caddy file_server
|
||||
MOMENTRY_SERVE_BASE_URL=https://m5wp.momentry.ddns.net/files
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Trade-offs & Rationale
|
||||
|
||||
| Approach | Pros | Cons |
|
||||
|----------|------|------|
|
||||
| **Caddy file_server** (local) | Zero CPU, native range requests, no code change to Momentry for serving | Requires storage root config; files must be accessible from Caddy |
|
||||
| **Momentry stream_video** (remote) | Works with any storage backend (S3, NAS, NFS) | ffmpeg decode per request, higher latency, CPU-bound |
|
||||
| **WordPress PHP proxy** (rejected) | No infra change | Fragile, snippet inactive, violates marcom territory |
|
||||
| **Direct backend streaming only** (rejected) | Simplest implementation | Unnecessary CPU for local files; 100% backend dependency |
|
||||
|
||||
### Fallback Logic (Frontend)
|
||||
|
||||
The frontend JavaScript should handle playback as follows:
|
||||
|
||||
```javascript
|
||||
if (result.serve_url) {
|
||||
// Local file — direct Caddy file_server
|
||||
video.src = result.serve_url;
|
||||
} else {
|
||||
// Remote — use streaming endpoint
|
||||
video.src = `/wp-json/momentry/v1/media?uuid=${result.file_uuid}&type=video&start_time=${result.start_time}&end_time=${result.end_time}`;
|
||||
}
|
||||
```
|
||||
|
||||
This gives the frontend flexibility to pick the optimal playback path based on available data.
|
||||
|
||||
---
|
||||
|
||||
## Future Considerations
|
||||
|
||||
- **S3/NAS remote files**: When video files are stored externally, the `file_path` won't match `STORAGE_ROOT`. The backend can detect this by checking `file_path.starts_with(STORAGE_ROOT)`. If it doesn't match, omit `serve_url` and rely on the streaming fallback.
|
||||
- **Pre-signed URLs**: For S3 storage, `serve_url` could be replaced with a pre-signed URL or cloud CDN URL.
|
||||
- **Caching**: `file_server` responses are cacheable; consider adding `Cache-Control` headers for thumbnails.
|
||||
- **Authentication**: Direct file access currently has no auth. If needed, Caddy can inject auth via `forward_auth` or JWT validation.
|
||||
|
||||
---
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Author | Changes |
|
||||
|---------|------|--------|---------|
|
||||
| V1.0 | 2026-06-07 | OpenCode | Initial design — local direct serve + remote streaming + thumbnail proxy architecture |
|
||||
Reference in New Issue
Block a user