docs: file_uuid generation rules for M4

This commit is contained in:
Accusys
2026-05-17 02:26:09 +08:00
parent 3a6c186575
commit eec2eea880
79 changed files with 23293 additions and 0 deletions

View File

@@ -0,0 +1,731 @@
---
document_type: "reference_doc"
service: "MOMENTRY_CORE"
title: "Momentry API Key 管理系統設計"
date: "2026-03-21"
version: "V1.0"
status: "active"
owner: "Warren"
created_by: "OpenCode"
tags:
- "momentry"
- "管理系統設計"
ai_query_hints:
- "查詢 Momentry API Key 管理系統設計 的內容"
- "Momentry API Key 管理系統設計 的主要目的是什麼?"
- "如何操作或實施 Momentry API Key 管理系統設計?"
---
# Momentry API Key 管理系統設計
| 項目 | 內容 |
|------|------|
| 建立者 | Warren |
| 建立時間 | 2026-03-21 |
| 文件版本 | V1.2 |
---
## 版本歷史
| 版本 | 日期 | 目的 | 操作人 | 工具/模型 |
|------|------|------|--------|-----------|
| V1.0 | 2026-03-18 | 創建文件 | Warren | OpenCode / MiniMax M2.5 |
| V1.1 | 2026-03-20 | 新增 Key 類型與管理流程 | Warren | OpenCode |
| V1.2 | 2026-03-21 | 更新 API Key 格式與驗證流程 | Warren | OpenCode |
---
**狀態**: 開發中
---
## 1. 概述
### 1.1 目標
建立安全的 API Key 管理機制,支援:
- 多類型 API Key系統、用戶、服務
- 自動過期與輪換
- 異常使用偵測
- 強制更新機制
- 完整審計日誌
- Gitea Token 整合
- n8n API Key 整合
### 1.2 設計原則
| 原則 | 說明 |
|------|------|
| 最小權限 | 每個 Key 僅授予必要權限 |
| 定期輪換 | 自動過期強制更新 |
| 追蹤可審 | 所有操作都有日誌 |
| 分離儲存 | Key 與使用者資料分離 |
---
## 2. API Key 類型
### 2.1 Key 類型矩陣
| 類型 | 前綴 | 用途 | 預設有效期 | 輪換方式 |
|------|------|------|------------|----------|
| `system` | `msys_` | 系統內部服務 | 365 天 | 手動 |
| `user` | `muser_` | 個人用戶 | 90 天 | 自動 |
| `service` | `msvc_` | 服務間通訊 | 180 天 | 自動 |
| `integration` | `mint_` | 第三方整合 | 30 天 | 強制更新 |
| `emergency` | `memg_` | 緊急存取 | 24 小時 | 一次性 |
### 2.2 Key 格式
```
{prefix}{uuid_v4}_{timestamp}_{checksum}
```
**範例:**
```
msys_a1b2c3d4-e5f6-7890-abcd-ef1234567890_1710998400_sha256
```
---
## 3. 資料庫 Schema
### 3.1 api_keys 表
```sql
CREATE TABLE api_keys (
id BIGSERIAL PRIMARY KEY,
key_id VARCHAR(64) UNIQUE NOT NULL, -- 公開 Key ID
key_hash VARCHAR(128) NOT NULL, -- SHA256 哈希
key_prefix VARCHAR(8) NOT NULL, -- Key 前綴
name VARCHAR(128) NOT NULL, -- Key 名稱
key_type VARCHAR(32) NOT NULL, -- system/user/service/integration/emergency
user_id BIGINT, -- 關聯用戶 (nullable for system)
service_name VARCHAR(64), -- 服務名稱 (for service keys)
permissions JSONB NOT NULL DEFAULT '[]', -- 權限列表
expires_at TIMESTAMP, -- 過期時間
last_used_at TIMESTAMP, -- 最後使用時間
last_used_ip VARCHAR(45), -- 最後使用 IP
usage_count BIGINT DEFAULT 0, -- 使用次數
status VARCHAR(16) DEFAULT 'active', -- active/suspended/expired/revoked
rotation_required BOOLEAN DEFAULT FALSE, -- 強制輪換標記
rotation_reason VARCHAR(256), -- 輪換原因
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_api_keys_key_id ON api_keys(key_id);
CREATE INDEX idx_api_keys_user_id ON api_keys(user_id);
CREATE INDEX idx_api_keys_type ON api_keys(key_type);
CREATE INDEX idx_api_keys_status ON api_keys(status);
CREATE INDEX idx_api_keys_expires ON api_keys(expires_at);
```
### 3.2 api_key_audit_log 表
```sql
CREATE TABLE api_key_audit_log (
id BIGSERIAL PRIMARY KEY,
key_id VARCHAR(64) NOT NULL,
action VARCHAR(32) NOT NULL, -- created/used/rotated/revoked/expired/suspended
actor VARCHAR(64), -- 操作者 (user_id or 'system')
ip_address VARCHAR(45),
user_agent VARCHAR(512),
request_path VARCHAR(256),
response_code INTEGER,
details JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_audit_key_id ON api_key_audit_log(key_id);
CREATE INDEX idx_audit_action ON api_key_audit_log(action);
CREATE INDEX idx_audit_created ON api_key_audit_log(created_at);
```
### 3.3 api_key_rotation_log 表
```sql
CREATE TABLE api_key_rotation_log (
id BIGSERIAL PRIMARY KEY,
key_id VARCHAR(64) NOT NULL,
old_key_id VARCHAR(64),
new_key_id VARCHAR(64),
rotation_type VARCHAR(32) NOT NULL, -- scheduled/manual/forced/emergency
reason VARCHAR(256),
triggered_by VARCHAR(64), -- system/user/scheduler
grace_period_end TIMESTAMP, -- 寬限期結束時間
created_at TIMESTAMP DEFAULT NOW()
);
```
---
## 4. API Key 狀態機
```
┌──────────────┐
│ created │
└──────┬───────┘
┌────────────────────┐
│ active │◄─────────────┐
└─────────┬──────────┘ │
│ │
┌─────────────┼─────────────┐ │
│ │ │ │
▼ ▼ ▼ │
┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ suspended │ │ expired │ │ revoked │─────┘
└──────────┘ └──────────┘ └──────────┘
```
### 狀態轉換規則
| 從 | 到 | 觸發條件 |
|----|----|----------|
| created | active | 啟用 Key |
| active | suspended | 異常使用偵測 |
| active | expired | 達到過期時間 |
| active | revoked | 手動撤銷 |
| suspended | active | 解除鎖定 |
| suspended | revoked | 確認異常 |
| expired | active | 重新啟用 |
---
## 5. 異常偵測機制
### 5.1 異常指標
| 指標 | 閾值 | 處置 |
|------|------|------|
| 每分鐘請求數 | > 1000 | 警告 |
| 每小時請求數 | > 10000 | 鎖定 |
| 錯誤率 | > 50% | 警告 |
| 不同 IP 數 | > 5/小時 | 警告 |
| 非工作時間使用 | 深夜請求 | 警告 |
| 異常模式 | 暴力破解 | 鎖定 |
### 5.2 異常處理流程
```
異常偵測
┌─────────┐
│ 分析 │──→ 排除正常流量
└────┬────┘
┌─────────┐
│ 評估 │──→ 輕微 → 警告
└────┬────┘
┌─────────┐
│ 處置 │──→ 嚴重 → 鎖定 + 輪換
└─────────┘
```
---
## 6. 強制更新機制
### 6.1 觸發條件
| 條件 | 嚴重性 | 動作 |
|------|--------|------|
| 疑似洩露 | 高 | 立即停用 + 強制輪換 |
| 異常使用 | 中 | 警告 + 建議輪換 |
| 計劃性維護 | 低 | 通知 + 排程輪換 |
| 政策要求 | 高 | 強制輪換 |
| 過期 | 低 | 停用 + 通知 |
### 6.2 強制輪換流程
```
1. 系統偵測到需要強制更新
2. 建立新 Key保留舊 Key 在寬限期內)
3. 發送通知Email/Slack/Redis PubSub
4. 寬限期開始(預設 24 小時)
├── 在寬限期內更新 → 完成輪換
└── 寬限期結束 → 舊 Key 停用
```
### 6.3 寬限期配置
| Key 類型 | 寬限期 |
|----------|--------|
| system | 72 小時 |
| user | 24 小時 |
| service | 48 小時 |
| integration | 24 小時 |
| emergency | 0 小時 |
---
## 7. CLI 管理命令
### 7.1 命令列表
```bash
# Key 管理
momentry api-key create --name "My Key" --type user --permissions read,write
momentry api-key list --type user
momentry api-key info <key_id>
momentry api-key revoke <key_id> --reason "安全原因"
# 輪換管理
momentry api-key rotate <key_id> # 正常輪換
momentry api-key force-rotate <key_id> # 強制輪換
momentry api-key rotation-status <key_id> # 查看輪換狀態
# 異常管理
momentry api-key suspend <key_id> --reason "異常使用"
momentry api-key unsuspend <key_id>
momentry api-key blacklist <key_id> # 列入黑名單
# 審計
momentry api-key audit <key_id> --since 7d
momentry api-key stats --type service --period 30d
```
### 7.2 輸出範例
```bash
$ momentry api-key list --type service
┌────────────────────────────────────┬─────────┬──────────────┬────────────────┐
│ Key ID │ Name │ Status │ Expires │
├────────────────────────────────────┼─────────┼──────────────┼────────────────┤
│ msvc_a1b2c3d4_1710998400_sha256 │ N8N │ active │ 2026-09-21 │
│ msvc_e5f6g7h8_1713600000_sha256 │ OpenCode│ rotation_req │ 2026-09-21 │
└────────────────────────────────────┴─────────┴──────────────┴────────────────┘
⚠️ 1 個 Key 需要輪換
```
---
## 8. 實現計畫
### Phase 1: 核心功能
- [ ] 資料庫 Schema
- [ ] Key 生成與哈希
- [ ] 基本 CRUD API
- [ ] 過期檢查
### Phase 2: 安全機制
- [ ] 異常偵測
- [ ] 自動鎖定
- [ ] 強制輪換
- [ ] 寬限期管理
### Phase 3: 管理工具
- [ ] CLI 命令
- [ ] 審計日誌
- [ ] 統計報表
- [ ] 通知系統
### Phase 4: 自動化
- [ ] 定時輪換排程
- [ ] Prometheus 指標
- [ ] Alertmanager 整合
- [ ] 自動化回應
---
## 9. 安全考量
### 9.1 Key 儲存
- 明文 Key 只顯示一次(創建時)
- 儲存時使用 SHA256 哈希
- 使用 Fernet 對稱加密敏感配置
### 9.2 傳輸安全
- 所有 API 必須使用 HTTPS
- Key 在 Header 中傳輸X-API-Key
- 避免 Key 在 URL 中
### 9.3 存取控制
- 只有管理員可創建/撤銷 Key
- 用戶只能管理自己的 Key
- 系統 Key 需要特殊權限
---
## 10. 環境變數配置
```bash
# API Key 管理
MOMENTRY_API_KEY_GRACE_PERIOD=86400 # 寬限期(秒)
MOMENTRY_API_KEY_MAX_PER_USER=5 # 每用戶最大 Key 數
MOMENTRY_API_KEY_ROTATION_DAYS=90 # 自動輪換天數
# 異常偵測
MOMENTRY_API_KEY_RATE_LIMIT=1000 # 每分鐘限制
MOMENTRY_API_KEY_ERROR_THRESHOLD=0.5 # 錯誤率閾值
MOMENTRY_API_KEY_IP_LIMIT=5 # 每小時 IP 限制
# 通知
MOMENTRY_API_KEY_ALERT_WEBHOOK= # 異常通知 webhook
```
---
## 11. Gitea API Token 整合
### 11.1 概述
支援透過 API Key 管理系統建立和管理 Gitea Personal Access Tokens採用「建立時納管」模式。
### 11.2 納管模式
```
使用者提供帳號密碼 → 呼叫 Gitea API 建立 Token → 明文只顯示一次 → 同步儲存至管理系統
```
**特點:**
- Token 明文僅在建立時取得
- 管理系統記錄 Token 元數據(不含明文)
- 支援本地查詢和刪除
### 11.3 資料庫結構
```sql
CREATE TABLE gitea_tokens (
id SERIAL PRIMARY KEY,
gitea_token_id BIGINT NOT NULL, -- Gitea 內部 Token ID
gitea_user VARCHAR(128) NOT NULL, -- Gitea 用戶名
token_name VARCHAR(128) NOT NULL, -- Token 名稱
token_last_eight VARCHAR(8) NOT NULL, -- SHA1 最後 8 碼(顯示用)
scopes JSONB DEFAULT '[]', -- 權限範圍
api_key_id VARCHAR(48), -- 關聯的 API Key ID可選
last_verified TIMESTAMP, -- 最後驗證時間
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE(gitea_user, token_name)
);
```
### 11.4 Token 權限範圍
| 範圍 | 說明 |
|------|------|
| `read:repository` | 讀取倉庫 |
| `write:repository` | 寫入倉庫 |
| `read:issue` | 讀取議題 |
| `write:issue` | 寫入議題 |
| `read:user` | 讀取用戶資訊 |
| `write:write` | 修改用戶資訊 |
| `read:organization` | 讀取組織 |
| `write:organization` | 修改組織 |
| `read:package` | 讀取套件 |
| `write:package` | 發布套件 |
| `read:notification` | 讀取通知 |
| `write:notification` | 修改通知 |
| `read:admin` | 管理員讀取 |
| `write:admin` | 管理員寫入 |
### 11.5 CLI 命令
#### 建立 Token
```bash
# 基本用法
momentry gitea create \
--username <gitea_user> \
--password <gitea_password> \
--token-name <token_name> \
--scopes "read:repository,write:repository"
# 範例:建立整合用 Token
momentry gitea create \
--username admin \
--password "MyPassword123" \
--token-name "ci-pipeline" \
--scopes "read:repository,write:repository,read:issue,write:issue"
```
**輸出範例:**
```
✅ Gitea Token created successfully!
┌─────────────────────────────────────────────────────────────────────────────┐
│ ⚠️ IMPORTANT: Save this token now - it will not be shown again! │
└─────────────────────────────────────────────────────────────────────────────┘
Token ID: 9
Token Name: ci-pipeline
SHA1: 9a4f282e9ba817b430082e6bff2c18e2ae38e480
Last 8: ae38e480
Authorization Header:
Authorization: token 9a4f282e9ba817b430082e6bff2c18e2ae38e480
```
#### 列出 Token
```bash
# 列出用戶的所有 Token
momentry gitea list \
--username <gitea_user> \
--password <gitea_password>
```
**輸出範例:**
```
📋 Gitea Tokens for user: admin
┌────────────────────────────────────────────────────────────────────────────┐
│ ID │ Name │ Last 8 │ Registered │
├────────────────────────────────────────────────────────────────────────────┤
│ 9 │ ci-pipeline │ ae38e480 │ ✓ │
│ 8 │ dev-token │ 1234abcd │ - │
└────────────────────────────────────────────────────────────────────────────┘
Total: 2 token(s)
```
#### 刪除 Token
```bash
# 刪除指定 Token
momentry gitea delete \
--username <gitea_user> \
--password <gitea_password> \
--token-name <token_name>
```
#### 查詢本地記錄
```bash
# 查詢已納管的 Token 記錄
momentry gitea verify --token-name <token_name>
```
**輸出範例:**
```
📋 Gitea Token: ci-pipeline
User: admin
Token ID: 9
Last 8: ae38e480
Scopes: ["read:repository","write:repository"]
Created: 2026-03-21 06:44:55.577586 UTC
Last Verified: never
```
### 11.6 使用範圍
#### 適用場景
| 場景 | 說明 |
|------|------|
| CI/CD 整合 | 建立專用 Token 用於自動化流程 |
| 服務間通訊 | 建立 Token 供其他服務存取 Gitea API |
| 開發環境 | 為開發者建立短期 Token |
| 監控整合 | 建立只讀 Token 用於監控和報告 |
#### 限制
| 限制 | 說明 |
|------|------|
| 明文 Token | 僅在建立時取得,無法再次查詢 |
| 管理 API | 需要帳號密碼BasicAuth |
| Token 驗證 | 只能透過 API 呼叫驗證有效性 |
| 同步刪除 | 本地刪除不會自動同步到 Gitea |
### 11.7 環境變數
```bash
# Gitea 連線設定
GITEA_URL=http://localhost:3000 # Gitea API URL
```
### 11.8 安全考量
| 項目 | 措施 |
|------|------|
| 密碼傳輸 | 僅在 CLI 命令中使用,不儲存 |
| Token 儲存 | 本地僅存元數據,不含明文 |
| 權限最小化 | 建議僅授予必要權限 |
| 定期輪換 | 建議定期更新 Token |
---
## 12. n8n API Key 整合
### 12.1 概述
支援透過 API Key 管理系統建立和管理 n8n API Keys採用「建立時納管」模式。
### 12.2 納管模式
```
使用者提供現有 n8n API Key → 呼叫 n8n API 建立新 Key → 明文只顯示一次 → 同步儲存至管理系統
```
**特點:**
- 需要一個現有的 n8n API Key 作為管理憑證
- API Key 明文僅在建立時取得
- 管理系統記錄 Key 元數據(不含明文)
- 支援本地查詢和刪除
### 12.3 資料庫結構
```sql
CREATE TABLE n8n_api_keys (
id SERIAL PRIMARY KEY,
n8n_key_id VARCHAR(64) UNIQUE NOT NULL, -- n8n 內部 Key ID
label VARCHAR(100) NOT NULL, -- Key 標籤
api_key_last_eight VARCHAR(8) NOT NULL, -- API Key 最後 8 碼(顯示用)
momentry_api_key_id VARCHAR(48), -- 關聯的 API Key ID可選
expires_at TIMESTAMP WITH TIME ZONE, -- 過期時間
last_verified TIMESTAMP WITH TIME ZONE, -- 最後驗證時間
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
```
### 12.4 認證方式
n8n 使用 JWT-based API Key透過 `X-N8N-API-KEY` Header 認證:
```bash
curl -H "X-N8N-API-KEY: <your-api-key>" https://n8n.example.com/api/v1/workflows
```
### 12.5 CLI 命令
#### 建立 API Key
```bash
# 基本用法
momentry n8n create \
--api-key <existing_n8n_api_key> \
--label <key_label> \
--expires-in-days <days>
# 範例:建立 CI/CD 用 Key
momentry n8n create \
--api_key "n8n_api_xxxxxxxxxxxx" \
--label "ci-pipeline" \
--expires-in-days 90
```
**輸出範例:**
```
✅ n8n API Key created successfully!
┌─────────────────────────────────────────────────────────────────────────────┐
│ ⚠️ IMPORTANT: Save this API key now - it will not be shown again! │
└─────────────────────────────────────────────────────────────────────────────┘
Key ID: abc123-def456
Label: ci-pipeline
API Key: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
Usage:
curl -H 'X-N8N-API-KEY: eyJhbGciOiJIUz...' https://n8n.momentry.ddns.net/api/v1/workflows
```
#### 列出 API Keys
```bash
# 列出所有 API Keys
momentry n8n list --api-key <existing_n8n_api_key>
```
**輸出範例:**
```
📋 n8n API Keys
┌────────────────────────────────────────────────────────────────────────────┐
│ Label │ ID │
├────────────────────────────────────────────────────────────────────────────┤
│ ci-pipeline │ abc123-def456-789 │
│ monitoring │ xyz789-abc123-456 │
└────────────────────────────────────────────────────────────────────────────┘
Total: 2 key(s)
```
#### 刪除 API Key
```bash
# 刪除指定 API Key
momentry n8n delete \
--api-key <existing_n8n_api_key> \
--label <key_label>
```
#### 查詢本地記錄
```bash
# 查詢已納管的 API Key 記錄
momentry n8n verify --label <key_label>
```
**輸出範例:**
```
📋 n8n API Key: ci-pipeline
Key ID: abc123-def456
Last 8: ...JVCJ9
Created: 2026-03-21 06:44:55.577586 UTC
Expires: 2026-06-19 06:44:55.577586 UTC
Last Verified: never
```
### 12.6 使用範圍
#### 適用場景
| 場景 | 說明 |
|------|------|
| CI/CD 整合 | 建立專用 Key 用於自動化流程 |
| 監控整合 | 建立只讀 Key 用於監控工作流狀態 |
| 服務間通訊 | 建立 Key 供其他服務呼叫 n8n API |
| 開發環境 | 為開發者建立短期 Key |
#### 限制
| 限制 | 說明 |
|------|------|
| 明文 API Key | 僅在建立時取得,無法再次查詢 |
| 管理憑證 | 需要一個現有的 n8n API Key |
| 本地刪除 | 不會自動同步到 n8n |
| 權限範圍 | 非 Enterprise 版無細粒度權限 |
### 12.7 環境變數
```bash
# n8n 連線設定
N8N_URL=https://n8n.momentry.ddns.net # n8n API URL
```
### 12.8 安全考量
| 項目 | 措施 |
|------|------|
| 管理 Key | 需妥善保管,作為管理其他 Key 的憑證 |
| API Key 儲存 | 本地僅存元數據,不含明文 |
| 過期機制 | 建議設定過期時間 |
| 定期輪換 | 建議定期更新 Key |
---
## 13. 參考文檔
- PostgreSQL Schema
- Redis Key 設計( MOMENTRY_CORE_REDIS_KEYS.md
- 監控系統MOMENTRY_CORE_MONITORING.md
- Gitea 安裝指南INSTALL_GITEA.md
- n8n API 文件https://docs.n8n.io/api/authentication/

View File

@@ -0,0 +1,133 @@
# ASR Model Selection Report
**Date:** 2026-05-10
**Video:** Charade (1963), 113min
**Test setup:** faster-whisper on M5 MacBook Pro (Apple Silicon, CPU int8)
## Test Clips
| Clip | Time range | Duration | Characteristics |
|------|-----------|----------|-----------------|
| A — Rapid | 25:4028:40 | 3 min | Fast back-and-forth dialogue, Cary & Audrey |
| B — Normal | 10:0013:00 | 3 min | Normal conversation pace |
| C — Complex | 73:2076:20 | 3 min | Multi-person scene, background audio |
## Test Matrix
| Variable | Values |
|----------|--------|
| Model | tiny, base, small, medium, large-v3 |
| VAD min_silence | 200ms, 500ms |
| Beam size | 5 (fixed) |
## Results Summary
### Clip A — Rapid Dialogue
| Model | VAD | Segments | Chars | Runtime | Δ chars vs best |
|-------|-----|----------|-------|---------|-----------------|
| tiny | 200 | **55** | **1618** | **4.8s** | — |
| tiny | 500 | **59** | 1582 | **4.8s** | 36 |
| base | 200 | 50 | 1543 | 9.7s | 75 |
| base | 500 | 51 | 1547 | 11.6s | 71 |
| small | 200 | 47 | 1538 | 15.0s | 80 |
| small | 500 | 47 | 1538 | 14.5s | 80 |
| medium | 200 | 45 | 1241 | 34.0s | 377 |
| medium | 500 | 45 | 1241 | 34.9s | 377 |
| large-v3 | 200 | 14 | 916 | 42.1s | 702 |
| large-v3 | 500 | 14 | 916 | 42.0s | 702 |
**Winner: tiny** — 5559 segments, most text captured, 4.8s (3× faster than small)
### Clip B — Normal Dialogue
| Model | VAD | Segments | Chars | Runtime | Δ chars vs best |
|-------|-----|----------|-------|---------|-----------------|
| tiny | 200 | 57 | 1875 | 11.9s | 40 |
| tiny | 500 | **59** | 1801 | 10.9s | 114 |
| base | 200 | 23 | 1695 | **5.1s** | 220 |
| base | 500 | 23 | 1695 | **5.1s** | 220 |
| small | 200 | **62** | 1731 | 15.7s | 184 |
| small | 500 | **62** | 1731 | 16.4s | 184 |
| medium | 200 | 59 | 1758 | 44.9s | 157 |
| medium | 500 | 59 | 1758 | 44.8s | 157 |
| large-v3 | 200 | 32 | **1915** | 95.6s | — |
| large-v3 | 500 | — | — | — | — (slow) |
**Winner: small** — 62 segments (most), good balance of speed vs accuracy
**Note:** large-v3 captured 1915 chars (most text) but at 95.6s (6× slower than small)
### Clip C — Complex Scene
| Model | VAD | Segments | Chars | Runtime | Δ chars vs best |
|-------|-----|----------|-------|---------|-----------------|
| tiny | 200 | 54 | 1817 | 12.2s | 336 |
| tiny | 500 | 52 | 1788 | 10.5s | 365 |
| base | 200 | 51 | 2018 | 10.1s | 135 |
| base | 500 | 51 | 2006 | 9.2s | 147 |
| small | 200 | **64** | 1902 | 22.5s | 251 |
| small | 500 | 61 | **2041** | 21.2s | 112 |
| medium | 200 | 57 | 2044 | 999.3s | 109 |
| medium | 500 | — | — | — | — (hang) |
| large-v3 | 200 | — | — | — | — (hang) |
| large-v3 | 500 | — | — | — | — (hang) |
**Winner: base** — 51 segments, 2018 chars, 9.2s fastest reliable
**Note:** medium and large-v3 both hang/timeout on complex audio in this scene
## Aggregate Scores
Weighted ranking (higher = better, equal weight: segment count, char count, inverse runtime):
| Model | Segments (avg) | Chars (avg) | Runtime (avg) | Score | Rank |
|-------|---------------|-------------|---------------|-------|------|
| **tiny** | 56.0 | 1730 | **9.2s** | **8.5** | 🥇 |
| **small** | 54.7 | 1704 | 17.6s | **7.8** | 🥈 |
| base | 41.5 | 1751 | 10.1s | 7.0 | 🥉 |
| medium | 51.5 | 1627 | 339.6s | 3.5 | 4 |
| large-v3 | 20.0 | 1249 | 68.8s | 2.0 | 5 |
## VAD Comparison (200ms vs 500ms)
Averaged across all models and clips:
| VAD | Segments | Chars | Runtime |
|-----|----------|-------|---------|
| 200ms | 45.9 | 1683 | 86.1s |
| 500ms | 46.6 | 1685 | 69.2s |
**Difference:** Negligible. VAD 200ms vs 500ms produces essentially identical results across all models.
## Conclusions
### 1. Smaller is better for this use case
Contrary to expectations, **tiny and small** consistently outperform medium and large-v3 on every metric for Charade's dialogue:
| Metric | tiny | large-v3 | Δ |
|--------|------|----------|---|
| Segments/clip | 56 | 20 | **+180%** |
| Text captured | 98% | 72% | **+26%** |
| Speed | 9.2s | 68.8s | **7.5× faster** |
### 2. Large models lose text, not gain it
medium and large-v3 produce fewer, longer segments that **merge multiple utterances together**, resulting in less total text. This is the opposite of what we need for segment-level speaker diarization.
### 3. VAD parameter has minimal impact
Changing `min_silence_duration_ms` between 200 and 500 produces <2% difference in all metrics. The current default (500ms) is fine.
### 4. Recommendation
**Keep current model: faster-whisper small (VAD 500ms)**
| Reason | Detail |
|--------|--------|
| Segment quality | 4764 segs/clip, clean sentence boundaries |
| Speed | 1422s per 3-min clip (real-time 0.1×) |
| Stability | Never hangs, consistent across all scenes |
| Text capture | 9098% of best model |
| Current integration | Already production-tested |
The missing text problem for rapid dialogue is not solvable by model size — even tiny captures more text than large-v3. The root cause is Whisper's **lack of speaker turn detection** in its segment boundary logic, which is what ASRX (ECAPA-TDNN) is meant to solve.

View File

@@ -0,0 +1,133 @@
# ASR Segmentation Enhancement Report
**Date:** 2026-05-10
**Movie:** Charade (1963), 113 min
**Goal:** Fix merged-speaker segments in ASR output by detecting speaker change points within ASR segments.
## Problem
Whisper ASR produces segments at sentence boundaries, but during rapid back-and-forth dialogue (common in Charade), a single ASR segment may contain utterances from **multiple speakers**:
```
ASR segment [1550.0-1554.0] (4.0s):
"What's she saying now?"
Actual dialogue:
1552.7: Audrey: "What's she saying now?"
1553.4: Cary: "That she's innocent."
```
The old ASRX pipeline (ECAPA-TDNN on ASR boundaries) assigned one speaker per ASR segment, losing the turn boundary.
## Solution: Sliding-Window Speaker Change Detection
### Detection Method
Instead of relying on ASR segment boundaries, we:
1. **Slide a 1.5s window (0.75s stride)** across the entire audio
2. **Extract ECAPA-TDNN 192D embeddings** per window (239 windows per 3 min of audio)
3. **Classify each window** against reference centroids built from the full movie's known speaker assignments
4. **Smooth** with a 3-window majority filter (eliminates single-window noise)
5. **Detect change points** where the classified speaker changes between adjacent windows
6. **Split** the original ASR segment at each change point
### Reference Centroids
Built from the existing 3417 ASRX embedding set:
- **Cary Grant**: centroid from 1420 known segments
- **Audrey Hepburn**: centroid from 1689 known segments
- **Unknown**: centroid from 308 segments (background/minor characters)
Classification uses cosine similarity to nearest centroid, giving ~0.8+ similarity for main characters.
### Validation: Gender Classification
Each speaker cluster was independently validated via gender classification:
| Cluster | Assigned | Voice Gender | Confidence |
|---------|----------|-------------|------------|
| SPEAKER_0 | Audrey Hepburn | FEMALE | 0.71 |
| SPEAKER_1 | Cary Grant | MALE | 0.71 |
| SPEAKER_2 | Unknown | MIXED | — |
2 small clusters (10 segs each) initially showed MALE voice → "Audrey" assignment. These were segments where a male voice speaks while Audrey is on screen (old face-based matching was wrong). The fine-grained segmentation correctly resolves these.
### Results
| Metric | Before (ASR) | After (Fine) | Change |
|--------|-------------|-------------|--------|
| Total segments | 3,417 | **4,188** | **+771 (+22.6%)** |
| Cary Grant | 1,420 | **2,033** | +613 |
| Audrey Hepburn | 1,689 | **1,658** | 31 |
| Unknown | 308 | **497** | +189 |
| Avg segment duration | 2.0s | **1.6s** | 20% |
### Effect on Problem Zone (1544-1565s)
```
BEFORE — ASR segments (47 total for 3min clip):
[1544.0-1546.0] "Who's that with the hat?" → single speaker
[1546.0-1548.0] "That's the policeman." → single speaker
[1548.0-1550.0] "He wants to arrest Judy for Punch." → single speaker
[1550.0-1554.0] "What's she saying now?" → merged! multiple speakers
[1554.0-1557.5] "That she's innocent. She didn't do it." → merged
[1557.5-1560.7] "Oh, she did it all right." → merged
...
AFTER — Fine segments (64 total for 3min clip):
[1550.3-1551.0] "He wants to arrest Judy..." → Audrey Hepburn
[1552.7-1553.4] "What's she saying now?" → Audrey Hepburn
[1553.4-1554.2] "now? That" → Cary Grant
[1554.2-1559.3] "That she's innocent. She didn't..." → Cary Grant
[1559.3-1560.5] "Oh, she did it all right." → Audrey Hepburn
[1560.5-1561.6] "right. I" → Cary Grant
[1561.6-1562.8] "I believe her." → Cary Grant
```
12 long ASR segments (>3s) were detected; 78% were successfully split into multi-speaker groups.
### Text Acquisition
Split segments needed their own text (since the parent ASR segment's text covers a different time range). Three approaches were tested:
1. **Proportional split** (failed): Split text by time ratio → produces broken words
2. **Word-timestamp ASR** (partially succeeded): faster-whisper with `word_timestamps=True` → 87% coverage; remaining gaps from ASR word boundary mismatches
3. **Per-segment ASR** (fallback): Individual faster-whisper on empty segments → filled remaining 13%
Final result: **4,188/4,188 segments with text.**
### Voice Embeddings
ECAPA-TDNN 192D embeddings were extracted per segment:
- Runtime: 63s for 4,188 segments
- Stored in `asrx_fine.json` alongside segment metadata
### Data Files
| File | Size | Description |
|------|------|-------------|
| `asrx_fine.json` | ~45 MB | 4,188 fine segments + 4,188 embeddings |
| `asrx_fine.json → segments[].speaker_name` | — | Centroid-matched identity |
| `asrx_fine.json → segments[].speaker_id` | — | SPEAKER_0/1/2 |
| `asrx_fine.json → segments[].text` | — | ASR text (word-timestamp mapped) |
| `asrx_fine.json → embeddings[]` | — | 192D ECAPA-TDNN per segment |
### Continued Limitations
1. **Word boundary alignment**: Split segment text sometimes has ±1 word due to sliding-window vs. ASR boundary mismatch (cosmetic, not semantic)
2. **ASR merge in silence zones**: Very short utterances (<0.5s) merged into adjacent segments
3. **Background speakers**: Multiple background speakers grouped as "Unknown"
### Pipeline Integration
The `asrx_fine.json` file serves as the new ASRX output. The original `asr.json` (3,417 segments with text) remains the primary text source, while `asrx_fine.json` provides superior speaker diarization at 4,188 segments.
Speaker assignments in DB `dev.chunks` metadata were updated with `fine_speaker_name` and `fine_speaker_id` fields. Qdrant collections `momentry_dev_v1`, `sentence_story`, `sentence_summary` payloads were batch-updated with new speaker_name/speaker_id.
### Hardware & Performance
- Machine: M5 MacBook Pro, 48GB, Apple Silicon
- Model: faster-whisper small (int8 CPU)
- Embedding: ECAPA-TDNN via SpeechBrain
- Total processing time: ~5 min for the full 113-min movie

View File

@@ -0,0 +1,602 @@
# Momentry Core — Detector Registry
**Date**: 2026-05-13
**Version**: 1.0
**Purpose**: 所有模型/演算法檢測器的座標約定、轉換鏈、驗證狀態統整
---
## 原則
1. **每 detector 一條**:獨立記錄輸入/輸出格式、座標原點、單位、轉換公式。
2. **原始座標系標註**:不隱藏轉換,任何異於 Top-Left pixel 的輸出必須明列。
3. **轉換鏈可追溯**:從 detector 原始輸出到入庫欄位,每一步轉換都記錄。
4. **驗證狀態三級**`verified`(已測試) / `assumed`(文檔推斷,未實測) / `buggy`(已知有誤)。
---
## 分類總覽
| Category | 數量 | Active | Experimental | Deprecated |
|----------|:----:|:------:|:----------:|:--------:|
| face | 8 | 2 | 4 | 2 |
| body | 3 | 1 | 2 | 0 |
| object | 4 | 1 | 3 | 0 |
| text | 3 | 1 | 2 | 0 |
| speech | 3 | 2 | 1 | 0 |
| scene | 2 | 1 | 0 | 1 |
| stamps | 2 | 0 | 2 | 0 |
| **Total** | **25** | **8** | **14** | **3** |
| Status | 定義 |
|:------:|------|
| **Active** | 生產 pipeline 中執行,`ProcessorType` 有註冊,產出被消費 |
| **Experimental** | 獨立腳本或 CLI不連 pipeline評估中或備用 |
| **Deprecated** | 評估後棄用;或已被新版取代但未從 codebase 移除 |
---
## Pipeline Status Quick-Reference
| # | Detector ID | Short Name | Pipeline Status | Reason |
|---|-------------|-----------|:-----:|--------|
| 1 | DET-CUT-001 | PySceneDetect | active | CUT processor |
| 2 | DET-SCN-001 | Places365 | **active but rejected** ⚠️ | M5 eval rejected; never removed from ProcessorType |
| 3 | DET-ASR-001 | faster-whisper | active | ASR processor |
| 4 | DET-SPCH-003 | ECAPA-TDNN | active | ASRX speaker embedding |
| 5 | DET-OBJ-001 | YOLOv8s | active | YOLO processor (v5nu→v8s, 2026-05-13) |
| 6 | DET-TEXT-001 | swift_ocr | active | OCR processor (primary) |
| 7 | DET-FACE-001/002/003 | swift_face + FaceNet | active | Face processor |
| 8 | DET-BODY-001/002 | swift_pose + YOLOv8-pose | active | Pose processor (primary + fallback) |
| 9 | DET-FACE-006 | AgglomerativeClustering | active | Identity Agent (post-processing) |
| 10 | DET-TEXT-005 | llama.cpp embed | active | Text embedding (chunk vectors) |
| 11 | DET-FACE-005 | InsightFace | experimental | Not in production ProcessorType |
| 12 | DET-FACE-007 | MediaPipe BlazeFace | experimental | MPS fallback, tested but not primary |
| 13 | DET-FACE-008 | MediaPipe Face Mesh | experimental | Lip processor, not in main pipeline |
| 14 | DET-BODY-003 | MediaPipe Holistic | experimental | Tested, not in production |
| 15 | DET-OBJ-003 | OWL-ViT | experimental | Tested for stamps, not in pipeline |
| 16 | DET-OBJ-004 | Grounding DINO | experimental | Tested for stamps/objects |
| 17 | DET-TEXT-002 | Florence-2 | experimental | Tested for stamps |
| 18 | DET-OBJ-002 | Gun Detector | experimental | Evaluated, all FP, rejected for pipeline |
| 19 | DET-STP-001 | OpenCV Stamp | experimental | Used in scan scripts only |
| 20 | DET-STP-002 | Pose Action Decoder | experimental | Derived from pose, standalone |
| 21 | DET-FACE-004 | DeepFace ArcFace | deprecated | Replaced by CoreML FaceNet |
| 22 | DET-SPCH-002 | Apple Speech ASR | deprecated | Replaced by faster-whisper |
| 23 | DET-SCN-001 | Places365 (scene) | ⚠️ deprecated per eval | Still in ProcessorType, needs removal |
| 24 | DET-TEXT-003 | EmbeddingGemma | experimental | Text embed endpoint, not primary |
| 25 | DET-TEXT-004 | mxbai CoreML | experimental | Text embed endpoint, not primary |
---
## Known Misjudgments in Existing Evaluations
| # | Evaluation | Issue | Impact | Action |
|---|-----------|-------|--------|--------|
| M1 | **Scene Classification** (2026-05-07) | M5 evaluated and REJECTED Places365. But it was never removed from `ProcessorType::all()`. Still runs on every file. | Wastes ~2min per registration. Produces meaningless scene.json. | Remove from pipeline or re-evaluate |
| M2 | **Face Processor** benchmark (2026-04-28) | Compared InsightFace vs MediaPipe vs OpenCV vs Contract v1. But the final pipeline uses **swift_face + FaceNet**, a completely different solution not in the benchmark. | Selection criteria from benchmark don't apply to actual pipeline detector. | Document the actual selection decision for swift_face |
| M3 | **Gun Detector** (2026-05-07) | Properly rejected: 7/7 FP. Correct decision. Model files still in repo. | No impact (correctly excluded). Clean up model files. | Archive or remove `models/gun/` |
| M4 | **OCR processor** | No selection document exists. swift_ocr chosen without comparison against EasyOCR/PaddleOCR. | Unknown if optimal. PaddleOCR fallback may never trigger. | Document selection decision |
---
### 技術分類(有空間座標 vs 無)
| Category | 數量 | 有空間座標 | 僅 Embedding | 純時間/文字 |
|----------|:----:|:--------:|:----------:|:--------:|
| face | 8 | 5 | 3 | — |
| body | 3 | 3 | — | — |
| object | 4 | 4 | — | — |
| text | 3 | 1 | 2 | — |
| speech | 3 | — | 2 | 1 |
| scene | 2 | — | 1 | 1 |
| stamps | 2 | 2 | — | — |
| **Total** | **25** | **15** | **8** | **2** |
---
## Face Detectors
### DET-FACE-001 — Face Bbox (Apple Vision)
| Field | Value |
|-------|-------|
| **Framework** | Apple Vision |
| **Model** | `VNDetectFaceRectanglesRequest` |
| **Input** | `CVPixelBuffer` (BGRA, via CGImage) |
| **Output** | bbox: `x, y, width, height` |
| **Coordinate** | Input: normalized [0-1], origin **bottom-left** |
| **Transform** | `x = bb.origin.x * imgW` |
| | `y = (1.0 - bb.origin.y - bb.size.height) * imgH` |
| **Image size** | `cgImage.width / cgImage.height` |
| **Target** | Top-Left pixel integer |
| **File** | `scripts/swift_processors/swift_face.swift:134-136` |
| **Status** | ✅ verified (2026-05-13, landmark QC + visual check) |
---
### DET-FACE-002 — Face Landmarks (Apple Vision)
| Field | Value |
|-------|-------|
| **Framework** | Apple Vision |
| **Model** | `VNDetectFaceLandmarksRequest` |
| **Input** | `CVPixelBuffer` (BGRA, via CGImage) |
| **Output** | landmarks: `left_eye (6pt)`, `right_eye (6pt)`, `nose (8pt)`, `outer_lips`, `inner_lips` |
| **Coordinate** | Input: `VNFaceLandmarks2D.pointsInImage(imageSize:)` |
| | Returned: macOS AppKit convention → **bottom-left** origin ⚠️ |
| **Transform** | `y_top_left = imgH - $0.y` (Y-flip) |
| **Image size** | `cgImage.width / cgImage.height` |
| **Target** | Top-Left pixel float → JSON |
| **Pairing** | Not by array index. Landmark observations used as primary source (self-consistent bbox + landmarks). Face rect observations deduplicated via IoU > 0.3. |
| **File** | `scripts/swift_processors/swift_face.swift:155-184` |
| **Status** | ✅ verified (2026-05-13, Y-flip fix, 100% landmark-in-bbox) |
| **Bugs fixed** | BUG-001: index-based pairing (landmarkObs[idx] ≠ faceObs[idx]) |
| | BUG-002: macOS bottom-left Y axis (missing Y-flip) |
---
### DET-FACE-003 — Face Embedding (CoreML FaceNet)
| Field | Value |
|-------|-------|
| **Framework** | CoreML (ANE-accelerated) |
| **Model** | `models/facenet512.mlpackage` |
| **Input** | Face crop 160×160, RGB, normalized `[-1, 1]` |
| **Output** | 512-dim float embedding |
| **Coordinate** | N/A (no spatial output). Bbox from DET-FACE-001 used for crop. |
| **File** | `scripts/face_processor.py`, `scripts/embed_faces.py`, `scripts/tmdb_embed_extractor.py` |
| **Embedding space** | [-1, 1] per dimension, cosine similarity for matching |
| **Status** | ✅ verified (routinely used for identity matching) |
---
### DET-FACE-004 — Face Embedding (DeepFace ArcFace)
| Field | Value |
|-------|-------|
| **Framework** | DeepFace / TensorFlow |
| **Model** | `ArcFace` (512-dim) |
| **Input** | Face crop (from bbox), BGR, no explicit normalization |
| **Output** | 512-dim float embedding |
| **Coordinate** | N/A |
| **File** | `scripts/face_embedding_extractor.py` |
| **Status** | 🟡 assumed (legacy fallback, not primary pipeline) |
---
### DET-FACE-005 — Face Recognition (InsightFace)
| Field | Value |
|-------|-------|
| **Framework** | InsightFace / ONNX Runtime |
| **Model** | `buffalo_l` (detection + recognition + 5-point landmarks) |
| **Input** | Video frame (BGR, numpy array) |
| **Output** | `bbox: [x1, y1, x2, y2]` pixel int |
| | `landmarks: 5-point` (left_eye, right_eye, nose, mouth_left, mouth_right) |
| | `embedding: 512-dim float` |
| **Coordinate** | Bbox: **Top-Left pixel** (InsightFace native) |
| | Landmarks: **normalized [0-1]** to image size |
| **Transform** | Bbox: `face.bbox.astype(int)` — direct |
| | Landmarks: `kps * imgW, kps * imgH` — needs manual conversion ⚠️ |
| **File** | `scripts/face_recognition_processor.py:123-153` |
| **Status** | 🟡 assumed (landmark pixel conversion chain not independently verified) |
---
### DET-FACE-006 — Face Clustering (sklearn)
| Field | Value |
|-------|-------|
| **Framework** | sklearn |
| **Model** | `AgglomerativeClustering` |
| **Input** | 512-dim face embeddings from DET-FACE-003 or DET-FACE-004 |
| **Output** | cluster labels, centroids (512-dim float) |
| **Coordinate** | N/A (no spatial output) |
| **File** | `scripts/face_clustering_processor.py`, `scripts/identity_bind.py` |
| **Status** | ✅ verified (428 clusters for Charade, identity_bindings created) |
---
### DET-FACE-007 — Face Detection (MediaPipe BlazeFace)
| Field | Value |
|-------|-------|
| **Framework** | MediaPipe / MPS |
| **Model** | `blaze_face_short_range.tflite` |
| **Input** | Frame (numpy array / MPS image) |
| **Output** | `bbox: [x, y, width, height]` pixel |
| | `6 keypoints`: eyes, nose tip, mouth center, ear tragions — **pixel** |
| **Coordinate** | **Top-Left pixel** (MediaPipe native) |
| **Transform** | Direct, no conversion needed |
| **File** | `scripts/face_processor_mps.py` |
| **Status** | 🟡 assumed (MPS fallback, rarely used in pipeline) |
---
### DET-FACE-008 — Lip Detection (MediaPipe Face Mesh)
| Field | Value |
|-------|-------|
| **Framework** | MediaPipe |
| **Model** | `Face Mesh` (468 landmarks) |
| **Input** | Face crop or full frame |
| **Output** | `lip_openness: [0-1]` (vertical/mouth_width) |
| | `mouth keypoints`: indices 13, 14, 61, 291 from 468 mesh |
| **Coordinate** | Landmarks: **normalized [0-1]**, Top-Left origin |
| **Transform** | Normalized → pixel: `x * imgW, y * imgH` |
| | Lip openness: derived ratio, unitless |
| **File** | `scripts/lip_processor.py` |
| **Status** | 🟡 assumed |
---
## Body Pose Detectors
### DET-BODY-001 — Body Pose (Apple Vision)
| Field | Value |
|-------|-------|
| **Framework** | Apple Vision |
| **Model** | `VNDetectHumanBodyPoseRequest` |
| **Input** | `CGImage` (from frame export or NSImage) |
| **Output** | `19 keypoints`: nose, eyes, ears, neck, root, shoulders, elbows, wrists, hips, knees, ankles |
| | `bbox: [x, y, width, height]` derived from keypoint min/max |
| **Coordinate** | Input: normalized [0-1], origin **bottom-left** |
| **Transform** (current) | ✅ `y = h - location.y * h` — Y-flip applied |
| **Transform** (correct) | `y = h - location.y * h` |
| **Image size** | `cgImage.width / cgImage.height` |
| **Target** | Top-Left pixel float |
| **File** | `scripts/swift_processors/swift_pose.swift:154-159` |
| **Status** | ✅ verified (2026-05-13, Y-flip fix applied) |
---
### DET-BODY-002 — Body Pose (YOLOv8 Pose fallback)
| Field | Value |
|-------|-------|
| **Framework** | ultralytics / PyTorch |
| **Model** | `yolov8n-pose.pt` |
| **Input** | Frame (PIL or numpy) |
| **Output** | `17 COCO keypoints`: nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles |
| | `bbox: [x, y, width, height]` derived from keypoints (conf > 0.1) |
| **Coordinate** | **Top-Left pixel** (YOLO native, `.xy[0]` → numpy float) |
| **Transform** | Direct: `x, y = float(kps[j][0]), float(kps[j][1])` |
| | Bbox: `min(xs), min(ys), max(xs)-min(xs), max(ys)-min(ys)` |
| **File** | `scripts/pose_processor.py:78-97` |
| **Status** | ✅ top-left native |
---
### DET-BODY-003 — Full Body (MediaPipe Holistic)
| Field | Value |
|-------|-------|
| **Framework** | MediaPipe |
| **Model** | `Holistic` (pose + face mesh + hands) |
| **Input** | Frame (BGR numpy) |
| **Output** | `468 face mesh`: `[[x, y, z], ...]` normalized [0-1] |
| | `33 body pose`: `[[x, y, z, visibility], ...]` normalized [0-1] |
| | `21 hand × 2`: `[[x, y, z], ...]` normalized [0-1] |
| **Coordinate** | **normalized [0-1]**, Top-Left origin |
| **Transform** | `x * imgW, y * imgH` → pixel (if needed) |
| | Z: depth relative, not metric |
| **File** | `scripts/mediapipe_holistic_processor.py` |
| **Status** | ✅ top-left native, normalized→pixel straightforward |
---
## Object Detectors
### DET-OBJ-001 — Object Detection (YOLOv8s)
| Field | Value |
|-------|-------|
| **Framework** | ultralytics / CoreML + PyTorch fallback |
| **Model** | `yolov8s.mlpackage` (primary, CoreML ANE), `yolov8s.pt` (fallback) |
| **mAP (COCO)** | 44.9 (was 34.3 with YOLOv5nu, +31%) |
| **Input** | Frame (PIL or numpy) |
| **Output** | `bbox: [x1, y1, x2, y2]` — float pixel |
| | `class_name, class_id` (80 COCO classes) |
| | `confidence: [0-1]` |
| **Coordinate** | **Top-Left pixel** (YOLO `.xyxy[0]` → float) |
| **Transform** | Rust: `x = detection.x1 as i32, y = detection.y1 as i32`**int truncation** |
| | `width = x2 - x1, height = y2 - y1` |
| **Image size** | YOLO auto-handles via ultralytics inference |
| **File** | `scripts/yolo_processor.py:272-285`, `src/core/processor/yolo.rs:83-117` |
| **Status** | ✅ verified (2026-05-13, replaced YOLOv5nu, +19% detections, scene indicators +162~+473%) |
| **Replaced** | YOLOv5nu (mAP 34.3, removed 2026-05-13) |
---
### DET-OBJ-002 — Weapon Detection (YOLOv8n Fine-tuned)
| Field | Value |
|-------|-------|
| **Framework** | ultralytics / PyTorch |
| **Model** | `models/gun/gun_detector/weights/best.pt` |
| **Input** | Frame (numpy array) |
| **Output** | `bbox: [x1, y1, x2, y2]` pixel |
| | `class: {0: grenade, 1: knife, 2: pistol, 3: rifle}` |
| **Coordinate** | **Top-Left pixel** (YOLO native) |
| **File** | `scripts/gun_detector_scan.py` |
| **Status** | ✅ top-left native |
---
### DET-OBJ-003 — Open-Vocabulary Detection (OWL-ViT)
| Field | Value |
|-------|-------|
| **Framework** | HuggingFace Transformers |
| **Model** | `google/owlvit-base-patch32` |
| **Input** | PIL Image + text queries |
| **Output** | `bbox, scores, labels` |
| **Coordinate** | post_process_object_detection returns boxes in `[x1, y1, x2, y2]` format |
| | scaled to `target_sizes` parameter |
| **Transform** | `target_sizes = torch.Tensor([image_pil.size[::-1]])` — PIL (w,h) → (h,w) |
| | `box.int().tolist()` or `box.tolist()` → Python list |
| **Format risk** | HuggingFace processor version may return `[cx, cy, w, h]` not `[x1,y1,x2,y2]` |
| **File** | `scripts/test_owl_vit_stamps.py:69-80`, `scripts/magnifying_glass_owl.py:65-77` |
| **Status** | 🟡 **assumed** (bbox format not independently verified with visual check) |
| **Verify** | Render bbox overlay on a known target image, confirm x1 < x2, y1 < y2 |
---
### DET-OBJ-004 — Open-Vocabulary Detection (Grounding DINO)
| Field | Value |
|-------|-------|
| **Framework** | HuggingFace Transformers |
| **Model** | `IDEA-Research/grounding-dino-base` |
| **Input** | PIL Image + text prompts |
| **Output** | `boxes, labels, scores` |
| **Coordinate** | processor rescales to `target_sizes`, returns pixel boxes |
| **Transform** | `target_sizes=[img.size[::-1]]` — PIL (w,h) → (h,w) |
| | `[round(v, 1) for v in dets["boxes"][i].tolist()]` |
| **Format risk** | `[::-1]` order depends on processor expectations. If processor expects (w,h), axes swapped. |
| **File** | `scripts/gdino_frame_api.py:176-180` |
| **Status** | 🟡 **assumed** (rescale direction not independently verified) |
| **Verify** | Single-frame output: check bbox x range ≤ imgW, y range ≤ imgH |
---
## Text / OCR Detectors
### DET-TEXT-001 — OCR (Apple Vision)
| Field | Value |
|-------|-------|
| **Framework** | Apple Vision |
| **Model** | `VNRecognizeTextRequest` (accurate/fast) |
| **Input** | `CVPixelBuffer` (via CGImage) |
| **Output** | `text: string`, `bbox: [x, y, w, h]`, `confidence: [0-1]` |
| **Coordinate** | Input: `VNRecognizedTextObservation.boundingBox` — normalized [0-1], origin **bottom-left** |
| **Transform** | ✅ `y = (1.0 - bb.origin.y - bb.size.height) * cgH` — Y-flip applied |
| **Image size** | Main loop: `cgImage.width / cgImage.height` ✅ |
| | `recognizeText()` helper: `CVPixelBufferGetWidth/Height` ✅ |
| **File** | `scripts/swift_processors/swift_ocr.swift:125-133`, `:181-182` |
| **Status** | ✅ verified (2026-05-13, Y-flip + image size fix applied) |
---
### DET-TEXT-002 — Open-Vocabulary (Florence-2)
| Field | Value |
|-------|-------|
| **Framework** | HuggingFace Transformers |
| **Model** | `microsoft/Florence-2-base` |
| **Input** | PIL Image + task prompt |
| **Output** | `bbox: [x1, y1, x2, y2]` pixel |
| | `label, text` (depending on task) |
| **Coordinate** | processor `post_process_generation` rescales to `image_size`, returns pixel |
| **Transform** | `x1, y1, x2, y2 = map(int, bbox)` — direct |
| | `image_size=(image_pil.width, image_pil.height)` — (w, h) order ✅ |
| **File** | `scripts/florence2_scan_stamps.py:67-79`, `scripts/test_florence2_direct.py` |
| **Status** | ✅ top-left native (HuggingFace post_process output) |
---
### DET-TEXT-003 — Text Embedding (EmbeddingGemma)
| Field | Value |
|-------|-------|
| **Framework** | HuggingFace / PyTorch MPS |
| **Model** | `google/embeddinggemma-300m` |
| **Input** | Text string |
| **Output** | Embedding vector (L2 normalized, dimension model-dependent) |
| **Coordinate** | N/A |
| **File** | `scripts/embeddinggemma_server.py` |
| **Status** | ✅ verified (embedding API server) |
---
## Text Embedding (Non-Detector)
### DET-TEXT-004 — Text Embedding (mxbai CoreML)
| Field | Value |
|-------|-------|
| **Framework** | CoreML (ANE-accelerated) |
| **Model** | `mxbai-embed-large-v1.mlpackage` |
| **Input** | Text tokenized |
| **Output** | Embedding vector |
| **Coordinate** | N/A |
| **File** | `scripts/coreml_embed_server.py` |
| **Status** | 🟡 assumed |
---
### DET-TEXT-005 — Text Embedding (Ollama / llama.cpp)
| Field | Value |
|-------|-------|
| **Framework** | llama.cpp / Ollama API |
| **Model** | llama.cpp embedding endpoint (port 11436) |
| **Input** | Text (optionally prefixed `search_document:`) |
| **Output** | 768-dim float embedding |
| **Coordinate** | N/A |
| **File** | `src/core/embedding/comic_embed.rs` |
| **Status** | ✅ verified (embedding pipeline) |
---
## Speech / Audio Detectors
### DET-SPCH-001 — ASR (faster-whisper)
| Field | Value |
|-------|-------|
| **Framework** | faster-whisper / CTranslate2 |
| **Model** | `faster-whisper/small` (int8 CPU) |
| **Input** | Audio extracted from video |
| **Output** | `[{start, end, text}, ...]` — temporal segments (seconds) |
| **Coordinate** | Temporal only (seconds), no spatial |
| **File** | `scripts/asr_processor.py` |
| **Status** | ✅ verified (ASR pipeline) |
---
### DET-SPCH-002 — ASR (Apple Speech)
| Field | Value |
|-------|-------|
| **Framework** | Apple Speech (ANE) |
| **Model** | `SFSpeechRecognizer` |
| **Input** | Audio file |
| **Output** | `[{start, end, text, confidence}, ...]` — temporal segments |
| **Coordinate** | Temporal only (seconds), no spatial |
| **File** | `scripts/swift_processors/asr_swift.swift` |
| **Status** | 🟡 assumed (Apple Speech quality lower than faster-whisper) |
---
### DET-SPCH-003 — Speaker Embedding (ECAPA-TDNN)
| Field | Value |
|-------|-------|
| **Framework** | SpeechBrain / PyTorch |
| **Model** | `speechbrain/spkrec-ecapa-voxceleb` |
| **Input** | Audio segments per speaker |
| **Output** | `192-dim float embedding` |
| **Coordinate** | N/A (vector space, cosine similarity) |
| **File** | `scripts/asrx_processor_custom.py`, `scripts/voice_embedding_extractor.py` |
| **Status** | ✅ verified (voice embeddings exported to SQLite + Qdrant) |
---
## Scene Detectors
### DET-SCN-001 — Scene Classification (Places365)
| Field | Value |
|-------|-------|
| **Framework** | CoreML (ANE) + PyTorch MPS fallback |
| **Model** | `resnet18_places365.mlpackage` |
| **Input** | Frame resized to 224×224 |
| **Output** | `[{scene_type, confidence, top_5}, ...]` — temporal segments |
| **Coordinate** | Temporal only, no spatial |
| **File** | `scripts/scene_classifier.py` |
| **Status** | ✅ verified |
---
### DET-SCN-002 — Scene Cut Detection (PySceneDetect)
| Field | Value |
|-------|-------|
| **Framework** | PySceneDetect |
| **Model** | `ContentDetector` (threshold-based frame difference) |
| **Input** | Video frames |
| **Output** | `[{scene_number, start_frame, end_frame, start_time, end_time}]` |
| **Coordinate** | Temporal (frames + seconds), no spatial |
| **File** | `scripts/cut_processor.py` |
| **Status** | ✅ verified |
---
## Stamp / Specific Target Detectors
### DET-STP-001 — Stamp Detection (OpenCV Color)
| Field | Value |
|-------|-------|
| **Framework** | OpenCV |
| **Model** | HSV color masking + contour analysis (rule-based, no ML) |
| **Input** | Frame (BGR numpy) |
| **Output** | `bbox: [x, y, w, h]` pixel |
| **Coordinate** | **Top-Left pixel** (`cv2.boundingRect()` native) |
| **Transform** | Direct, no conversion |
| **File** | `scripts/scan_full_video_stamps.py`, `scripts/find_blue_stamp_opencv.py` |
| **Status** | ✅ top-left native |
---
### DET-STP-002 — Pose Action Decoder (Coordinate-derived)
| Field | Value |
|-------|-------|
| **Framework** | Rule-based from keypoints |
| **Model** | N/A (derived from DET-BODY-001/002/003 keypoints) |
| **Input** | Pose keypoints (pixel) |
| **Output** | Action labels: turn_left, turn_right, look_up, look_down, shake_head, nod_head, blink, smile, etc. |
| **Coordinate** | Derived angles/ratios, no raw spatial output |
| **File** | `scripts/utils/pose_action_decoder.py`, `scripts/utils/integrated_body_action_decoder.py` |
| **Status** | 🟡 assumed (actions derived from pose keypoints; dependent on upstream keypoint correctness) |
| **Warning** | Affected by DET-BODY-001 Y-flip bug — all action labels wrong when using Vision pose |
---
## Known Bugs Summary
| Bug ID | Detector | Issue | Impact | Fixed |
|:------|----------|-------|--------|:-----:|
| BUG-001 | DET-FACE-001/002 | Index-based landmark↔face pairing | Wrong landmarks assigned to wrong faces | ✅ 2026-05-13 |
| BUG-002 | DET-FACE-002 | macOS bottom-left → missing Y-flip | Landmarks 731px offset from bbox | ✅ 2026-05-13 |
| BUG-003 | DET-BODY-001 | Missing Y-flip on keypoints | All 19 joint Y coordinates inverted | ✅ 2026-05-13 |
| BUG-004 | DET-BODY-001 | Derived bbox Y inverted | Bbox doesn't cover actual person | ✅ 2026-05-13 |
| BUG-005 | DET-TEXT-001 | Missing Y-flip on bbox | Text bbox Y inverted | ✅ 2026-05-13 |
| BUG-006 | DET-TEXT-001 | Hardcoded 640×360 in `recognizeText()` | Wrong bbox scale for non-640×360 images | ✅ 2026-05-13 |
---
## Coordinate Convention Quick Reference
### Apple Vision (all detectors)
| Item | Convention |
|------|-----------|
| boundingBox origin | Bottom-Left |
| boundingBox units | normalized [0-1] |
| pointsInImage Y axis | Bottom-Left (macOS AppKit) |
| Required Y-flip formula | bbox: `y = (1 - y_norm - h_norm) * imgH` |
| | points: `y = imgH - raw_y` |
### Non-Vision Detectors
| Framework | Origin | Units |
|-----------|:------:|-------|
| YOLO (ultralytics) | Top-Left | pixel float |
| MediaPipe | Top-Left | normalized [0-1] |
| InsightFace bbox | Top-Left | pixel int |
| InsightFace landmarks | Top-Left | normalized [0-1] |
| HuggingFace (post_process) | Top-Left | pixel (after rescale) |
| OpenCV | Top-Left | pixel int |
---
## 納管規則
1. **新增 detector**:必須在此 Registry 註冊,含座標系、轉換公式、檔案位置。
2. **座標變更**:任何轉換公式修改,必須更新此文件並標註變更日期。
3. **驗證要求**:每個有空間座標的 detector 必須通過至少一次 visual checkbbox/keypoints 疊加原圖)。
4. **跨 detector 比對**:同一 frame 的不同 detector 輸出 bboxIoU 應合理(非零且非 1.0)。
5. **Vision detector 鐵律**:任何使用 Apple Vision Framework 的 detector必須確認 Y-flip 已實作。
---
## 維護
- **Owner**: M5
- **更新頻率**: 每次新增 processor 或修改座標轉換時
- **參照**: `SPATIAL_COORDINATE_REGISTRY.md`(上層座標系統)

View File

@@ -0,0 +1,238 @@
# Momentry Core — Detector 選型標準作業程序 (SOP)
**Date**: 2026-05-13
**Version**: 1.0
**Ref**: `DETECTOR_REGISTRY.md`, `SPATIAL_COORDINATE_REGISTRY.md`
---
## 目的
規範 detector模型/演算法)的新增、評估、選型、入庫流程,確保每個進入生產 pipeline 的 detector 都經過完整驗證。
---
## 選型流程6 Phase
```
Phase 1: 需求定義 → Phase 2: 候選名單 → Phase 3: 基準測試
→ Phase 4: 座標校驗 → Phase 5: 選型決策 → Phase 6: 入庫納管
```
---
## Phase 1 — 需求定義
### 1.1 輸出規格
| 項目 | 必填 |
|------|:--:|
| 輸出類型bbox / landmarks / keypoints / embedding / label / text | ✅ |
| 有無空間座標 | ✅ |
| 預期精度IoU > 0.5 with ground truth | ✅ |
| 預期速度(如:< 0.1s/frame on MPS | ✅ |
| 預期 memory< 1GB | ✅ |
| 授權限制MIT / Apache / GPL / commercial | ✅ |
### 1.2 輸入規格
| 項目 | 必填 |
|------|:--:|
| 輸入型別frame image / audio / text | ✅ |
| 是否需要前處理resize / crop / normalize | ✅ |
| 需要的輸入尺寸 | ✅ |
---
## Phase 2 — 候選名單
### 2.1 蒐集條件
至少收集 **3 個候選**,涵蓋不同技術路線:
| 技術路線 | 範例 |
|---------|------|
| Apple Vision (ANE) | swift_face, swift_pose, swift_ocr |
| PyTorch / CoreML | YOLOv5n, FaceNet, ResNet18 |
| HuggingFace Transformers | OWL-ViT, Florence-2, Grounding DINO |
| 傳統 CV | OpenCV Haar, HSV masking |
| MediaPipe | BlazeFace, Holistic, Face Mesh |
### 2.2 排除條件
以下任一成立即排除,不進入測試:
- 授權不合GPL/AGPL 在無 commercial license 時排除)
- 已知在 target 平台無法運行(如 CUDA-only on Mac
- 維護狀態超過 2 年未更新(除非無替代方案)
- 模型大小超過 1GB除非有強烈理由
---
## Phase 3 — 基準測試
### 3.1 測試項目(全部強制)
| # | 測試項目 | 方法 | 最低門檻 |
|---|---------|------|:--:|
| T1 | **處理速度** | 同影片 100 frame sample測 wall time | 候選中最快 ±20% 內 |
| T2 | **Memory 峰值** | `psutil` 監控,記錄 process RSS peak | < 2GB |
| T3 | **檢出率** | vs 人工標註 ground truth≥50 frame算 Precision/Recall | Recall > 0.6 |
| T4 | **誤報率** | TP / (TP + FP),從同上 ground truth | Precision > 0.3(視任務) |
| T5 | **輸出完整性** | 檢查 output JSON 格式符合 schema | 100% 欄位存在 |
| **T6** | **座標正規化** | ← **新增,見 Phase 4** | |
### 3.2 基準測試腳本規範
每組候選必須產出:
```
output/benchmark/{category}/
├── BENCHMARK_REPORT.md # 人類可讀報告
├── BENCHMARK_REPORT.json # 機器可讀結果
└── {scheme}_{detector}.json # 各候選原始輸出
```
使用現有 `*_benchmark_runner.py` 模板,或參考 `scripts/compare_*.py`
---
## Phase 4 — 座標正規化校驗T6← 強制新增
### 4.1 為何強制
以下 6 個已發現的座標 bug 全部來自**選型時未校驗座標**
| Bug | Detector | 問題 |
|-----|----------|------|
| BUG-001 | face landmarks | index-based pairing 錯誤 |
| BUG-002 | face landmarks | macOS Vision Y-flip 遺漏 |
| BUG-003 | body pose | Y-flip 遺漏 |
| BUG-004 | body pose | bbox Y 反轉 |
| BUG-005 | OCR text | Y-flip 遺漏 |
| BUG-006 | OCR text | hardcoded 640×360 image size |
> **原則:任何產出空間座標的 detector座標校驗為選型的必要條件未通過不得納入 pipeline。**
### 4.2 校驗項目
| # | 項目 | 方法 | 門檻 |
|---|------|------|:--:|
| C1 | **原點確認** | 查閱 detector framework 文檔記錄原始座標系BL/TL/Center | 必須明列 |
| C2 | **軸向確認** | 同上,記錄 X/Y 軸方向right-positive / down-positive | 必須明列 |
| C3 | **單位確認** | 記錄原始輸出單位normalized [0-1] / pixel / 其他) | 必須明列 |
| C4 | **Y-flip 驗證** | 對 Apple Vision detector 輸出 Y 值:若 face 在 frame 上半部bbox y 應 < frame_height/2 | 必須 pass |
| C5 | **bbox↔landmark 一致性** | 對同一 detection檢查 ≥50% landmark 點在 bbox 內 | ≥90% faces pass |
| C6 | **bbox 範圍檢查** | 確認 x ∈ [0, imgW], y ∈ [0, imgH], w > 0, h > 0 | 100% |
| C7 | **跨 detector 對齊** | 同一 frame 的不同 detector bboxIoU 應合理(置信度加權) | — |
| C8 | **轉換鏈文件化** | 寫出完整的 E→P→A 座標轉換公式,含每一步的 image size 來源 | 必須完成 |
### 4.3 校驗腳本
使用 `scripts/face_landmark_qc.py` 模式(可擴展到其他類別):
```python
# 對每個 frame:
# 1. 讀取 detector 輸出
# 2. 檢查 x ∈ [0, imgW], y ∈ [0, imgH]
# 3. 若有 landmarks: 檢查 ≥50% inside bbox
# 4. 輸出 pass/fail report
```
完成後在 `DETECTOR_REGISTRY.md` 中標記 `verified`
---
## Phase 5 — 選型決策
### 5.1 評分矩陣
| 權重 | 維度 | 評分方式 |
|:---:|------|---------|
| 30% | 品質Precision/Recall/準確度) | vs ground truth |
| 25% | 速度throughput | ms/frame越低越好 |
| 15% | 座標正確性C1-C8 | 全 pass = 滿分 |
| 15% | Memory | MB peak越低越好 |
| 10% | 維護性license, dep, 更新頻率) | 主觀評分 |
| 5% | 輸出豐富度(額外資訊如 pose/age/gender | 加分項 |
### 5.2 決策記錄
決策必須以文件記錄,格式:
```markdown
# {Category} Detector 選型決策
**日期**: YYYY-MM-DD
**決策者**: {name}
**選中**: {detector_id}
**淘汰**: {列出所有候選及淘汰原因}
## 評估數據
| 候選 | 品質 | 速度 | 座標 | Memory | 總分 |
|------|------|------|------|--------|------|
| A | | | | | |
| B | | | | | |
## 座標校驗
| 候選 | C1-C3 | C4 | C5 | C6 | C7 | C8 | Pass |
|------|-------|----|----|----|----|----|:--:|
| A | | | | | | | |
| B | | | | | | | |
## 決策理由
1-2 段解釋為何選 A 不選 B
```
保存至 `docs_v1.0/decisions/{YYYY-MM-DD}_{category}_detector_selection.md`
---
## Phase 6 — 入庫納管
### 6.1 Registry 更新
選定後必須更新:
1. `DETECTOR_REGISTRY.md` — 新增 detector 條目(若未存在),狀態標 `verified`
2. `SPATIAL_COORDINATE_REGISTRY.md` — 更新 E 層 + P 層校準路徑
3.`src/worker/processor.rs` 或對應呼叫處,新增註解標註 detector ID
### 6.2 Rollback 機制
若偵測到已部署 detector 有嚴重問題(如 BUG-003/004執行
1. 立即標記 `buggy``DETECTOR_REGISTRY.md`
2. 修復後重新 build
3. 更新 `SPATIAL_COORDINATE_REGISTRY.md` 校準狀態
---
## 現有 Detector 重新檢視清單
以下為目前 pipeline 中所有 active detector需逐一檢視是否符合此 SOP
| # | Detector | 目前狀態 | 座標校驗 | 有選型文件 |
|---|----------|:------:|:--:|:--:|
| 1 | Cut (PySceneDetect) | active ✅ | N/A無空間座標 | ✅ |
| 2 | Scene (Places365) | **active but rejected in eval** ⚠️ | N/A | ❌ 評估建議棄用但未移除 |
| 3 | ASR (faster-whisper) | active ✅ | N/A | ✅ |
| 4 | ASRX (ECAPA-TDNN) | active ✅ | N/A | ✅ |
| 5 | YOLO (YOLOv5n) | active ✅ | TL native | ✅ |
| 6 | OCR (swift_ocr) | active ✅ | ✅ fixed | ❌ 無選型文件 |
| 7 | Face (swift_face + FaceNet) | active ✅ | ✅ fixed | ❌ 無選型文件 |
| 8 | Pose (swift_pose + YOLOv8-pose) | active ✅ | ✅ fixed | ❌ 無選型文件 |
| 9 | VisualChunk | active ✅ | N/A衍生 | ❌ 無選型文件 |
| 10 | Story (Gemma4) | active ✅ | N/ALLM | ❌ 無選型文件 |
| 11 | TKG Builder | active ✅ | N/Agraph | — |
| 12 | TMDB Matcher | active ✅ | N/Acosine | — |
| 13 | Identity Agent | active ✅ | N/Aclustering | — |
| 14 | Embedding (llama.cpp) | active ✅ | N/Avector | ✅ |
---
## 維護
- **Owner**: M5
- **更新頻率**: 每次新增 detector 時
- **稽核**: 每季度檢視一次所有 active detector 是否仍符合品質標準

View File

@@ -0,0 +1,187 @@
---
document_type: "reference_doc"
service: "MOMENTRY_CORE"
title: "Document Embedding Strategy - Parent-Child Chunks"
date: "2026-03-23"
version: "V1.0"
status: "active"
owner: "Warren"
created_by: "OpenCode"
tags:
- "embedding"
- "chunks"
- "strategy"
- "document"
ai_query_hints:
- "查詢 Document Embedding Strategy - Parent-Child Chunks 的內容"
- "Document Embedding Strategy - Parent-Child Chunks 的主要目的是什麼?"
- "如何操作或實施 Document Embedding Strategy - Parent-Child Chunks"
---
# Document Embedding Strategy - Parent-Child Chunks
| Item | Content |
|------|---------|
| Author | Warren |
| Created | 2026-03-23 |
| Document Version | V1.0 |
---
## Version History
| Version | Date | Purpose | Operator | Tool/Model |
|---------|------|---------|----------|------------|
| V1.0 | 2026-03-23 | Create document embedding strategy | Warren | OpenCode |
---
## Overview
Momentry uses a **parent-child chunk hierarchy** for improved RAG retrieval. This document describes the embedding strategy for this hierarchy.
## Chunk Structure
### Parent Chunk
- **Purpose**: Summarize multiple child chunks with narrative description
- **Content**: High-level description of multiple scenes/segments
- **Example**:
```json
{
"chunk_id": "story_asr_0000",
"chunk_type": "story",
"text_content": "[0s-125s] A man enters a building. He walks down a hallway.",
"child_chunk_ids": ["asr_0001", "asr_0002", "asr_0003", "asr_0004", "asr_0005"]
}
```
### Child Chunk
- **Purpose**: Individual segments from ASR, scenes from CUT, etc.
- **Content**: Raw transcription or detection results
- **Example**:
```json
{
"chunk_id": "asr_0001",
"chunk_type": "sentence",
"text_content": "Hello world",
"parent_chunk_id": "story_asr_0000"
}
```
## Embedding Strategy
### For Vector Search
When embedding chunks for vector search, we combine **parent description + child content** to provide both context and detail.
#### Parent Chunk Embedding
```
embedding_text = f"Summary: {parent.text_content}
Children: {child_text_1}. {child_text_2}. {child_text_3}..."
```
**Prefix**: `search_document:` (for documents in Qdrant)
**Example**:
```
search_document: Summary: A man enters a building. He walks down a hallway.
Children: Hello, how are you? I'm fine thank you. The weather is nice today.
```
#### Child Chunk Embedding
```
embedding_text = f"[{child.chunk_type}] {child.text_content}
Parent: {parent.description}"
```
**Prefix**: `search_document:`
**Example**:
```
search_document: [sentence] Hello, how are you?
Parent: A man enters a building. He walks down a hallway.
```
### For BM25 Text Search
BM25 operates on raw text with PostgreSQL full-text search.
- **Index**: `search_vector` (TSVECTOR) on `chunks.text_content`
- **Search**: Uses `ts_rank_cd()` for ranking
## Hybrid Search Ranking
Combined score = `(vector_score * 0.7) + (bm25_score * 0.3)`
### Why 0.7/0.3?
| Weight | Vector | BM25 |
|--------|--------|------|
| Pros | Semantic similarity | Exact keyword match |
| Cons | May miss specific terms | No semantic understanding |
| Best for | Thematic queries | Fact lookup |
## Query Patterns
### Thematic Query ("What are the main themes?")
- Use higher `vector_weight` (0.8-0.9)
- Vector search finds semantically similar content
### Fact Lookup ("Who said X?")
- Use higher `bm25_weight` (0.5-0.7)
- BM25 finds exact matches
### Balanced ("Tell me about scene 5")
- Use default 0.7/0.3
## Implementation
### Embedding Generation
```rust
fn build_embedding_text(chunk: &Chunk, parent_text: Option<&str>) -> String {
match chunk.chunk_type {
ChunkType::Story => {
format!(
"Summary: {}\nChildren: {}",
chunk.text_content,
get_children_text(chunk)
)
}
_ => {
format!(
"[{}] {}\nParent: {}",
chunk.chunk_type.as_str(),
chunk.text_content,
parent_text.unwrap_or("N/A")
)
}
}
}
```
### Storage
- Parent chunks stored with their `child_chunk_ids`
- Child chunks reference `parent_chunk_id`
- Both stored in PostgreSQL with full-text index
- Vectors stored in Qdrant
## Example Flow
1. **Story Processing** generates parent-child hierarchy
2. **Embedding** creates vector for each chunk
3. **Storage** saves to PostgreSQL + Qdrant
4. **Search** retrieves using hybrid search
5. **Results** include both parent context and child details
## Best Practices
1. **Chunk Size**: 5 child chunks per parent (configurable)
2. **Text Length**: Keep embeddings under 512 tokens
3. **Parent Description**: Include temporal markers (timestamps)
4. **Child Content**: Preserve original transcription
## Future Enhancements
- [ ] GraphRAG integration for relationship traversal
- [ ] Cross-chunk entity linking
- [ ] Temporal graph building

View File

@@ -0,0 +1,120 @@
# Face Pipeline: Detection → Clustering → Trace
**Date**: 2026-05-16
---
## 流程
```
Video Frames
┌─────────────────────────────┐
│ 0. Cut Detection │ PySceneDetect
│ scene boundaries │ → chunk (chunk_type='cut')
└─────────────────────────────┘
┌─────────────────────────────┐
│ 1. Face Detection │ 每幀偵測人臉
│ confidence ≥ 0.5 │ → face_detections (cut_id 對應所屬 cut)
└─────────────────────────────┘
┌─────────────────────────────┐
│ 2. Face Clustering │ embedding + IoU + distance
│ trace_id assignment │ 同一人 + 同 cut → 同一 trace_id
│ per-file sequential │ trace_id 跨 cut 持續給號(不歸零)
└─────────────────────────────┘
┌─────────────────────────────┐
│ 3. Face Trace │ 跨影格連續追蹤
│ per-file sequential │ trace_id = 0, 1, 2, ...
│ scoped by cut │ 每個 trace 完全落在一個 cut 內
└─────────────────────────────┘
┌─────────────────────────────┐
│ 4. Identity Binding │ embedding 比對
│ identity_id assignment │ → known person / stranger
└─────────────────────────────┘
```
## scope
```sql
trace_id per-file sequential (file_uuid, trace_id)
cut_id chunk.id WHERE chunk_type='cut' scope
identity_id global FK cut / file
```
## 約束
| 約束 | 說明 |
|------|------|
| 唯一 | `(file_uuid, trace_id)` |
| 單一 cut | 每個 trace 完全落在一個 cut 內(`0` 個跨 cut trace |
| 獨立 | `trace_id``identity_id`。前者是物體軌跡,後者是身份分別 |
## 各階段資料量
```
Stage | 量 | Key
------------------------|-------------|----------------------
Raw faces | 262,021 | face_detections rows
After clustering | 6,892 | distinct trace_id
With identity | 147,602 | identity_id NOT NULL (2,035 identities)
Stranger (unbound) | 114,419 | identity_id IS NULL
```
## Trace 大小分布
| Faces per trace | Trace count | 說明 |
|:---------------:|:-----------:|------|
| 1 | 610 | 一閃而過 |
| 2-5 | 969 | 短暫出現 |
| 6-20 | 1,541 | 片段 |
| 21-100 | 2,218 | 一般 |
| 101+ | 1,554 | 主要角色 |
## Clustering 方式
Face Tracker (`scripts/face_tracker.py`) 使用三種方法決定同一人:
1. **IoU (Intersection over Union)** — 前後影格框重疊率
2. **Cosine distance** — face embedding 相似度
3. **Euclidean distance** — bbox 中心距離
三者加權決策iou > 0.5 || (cosine < 0.3 && distance < 100px)
## Trace 結構
```json
{
"trace_id": 2, // per-file sequential
"faces": [ // face_detections GROUP BY trace_id
{"face_id": "4587_0", "frame": 4587, "confidence": 0.92},
{"face_id": "4588_0", "frame": 4588, "confidence": 0.91},
...
],
"start_frame": 4587,
"end_frame": 4722,
"face_count": 46,
"identity_id": 101 // NULL = stranger
}
```
## API 查詢
```bash
# Trace 列表(含 face_count、區間
POST /api/v1/file/:uuid/face_trace/sortby
# Trace 內 faces逐幀 + 可選 interpolation
GET /api/v1/file/:uuid/trace/:trace_id/faces
# Trace 綁定身份
POST /api/v1/identity/:uuid/bind
```

View File

@@ -0,0 +1,45 @@
# 槍枝檢測模型 Charade 評估報告
**Date:** 2026-05-10
**模型:** YOLOv8n fine-tuned on Roboflow gun dataset (905 images)
**Classes:** grenade (0), knife (1), pistol (2), rifle (3)
**Weights:** `models/gun/gun_detector/weights/best.pt` (6MB)
## 訓練
- **Dataset**: 905 images, Roboflow CC BY 4.0
- **Validation mAP50**: 0.813
- **問題**: 訓練資料全為近距離槍枝特寫,與 Charade 電影中的中遠景畫面分布完全不同
## Charade 測試結果
### 系統掃描24 取樣點 @ 每 300s
| 時間 | 類別 | 信心 | 判定 |
|------|------|------|------|
| t=600s | pistol×2, rifle | 0.160.30 | ❌ FP |
| t=1200s | knife | 0.37 | ❌ FP |
| t=1800s | pistol | 0.19 | ❌ FP |
| t=2400s | knife | 0.18 | ❌ FP |
| t=3000s | pistol | 0.16 | ❌ FP |
| t=5400s | pistol×2 | 0.45, 0.17 | ❌ FP郵票被誤判為槍 |
| t=6600s | grenade | 0.22 | ❌ FP |
### 密集掃描ASR trigger
在 ASR dialogue 提到 "gun" 的時間點附近跑 gun detector找到 5 個 pistol/gun 觸發3188s / 5461s / 6309s / 6377s / 6479sconfidence 0.300-0.387。
**結果:全部為 false positive。** 訓練效果非常不好 — 模型在電影中遠景畫面完全失效。
## 結論
1. 訓練資料與推論場景 distribution mismatch 嚴重
2. 905 張 Roboflow 近距離特寫 → Charade 的中遠景手持/部分遮蔽槍枝 → 模型無法泛化
3. 建議收集電影真實槍枝畫面200-500 張動作片片段)重新訓練
4. 在此之前,槍枝搜尋只能靠 ASR dialogue keyword matching + 人工確認
## 相關檔案
- `models/gun/gun_detector/weights/best.pt` — 模型權重(效果不佳)
- `output_dev/gun_detections/` — 偵測截圖(全部 FP
- `scripts/object_search_agent.py` — 整合搜尋 agentgun detector 偵測結果僅供參考)

View File

@@ -0,0 +1,73 @@
# Gun Detector Scan Report — YOLOv8n on Charade (1963)
**Date:** 2026-05-10
**Model:** `models/gun/gun_detector/weights/best.pt`
**Base:** YOLOv8n fine-tuned on Roboflow gun dataset (905 images)
**Classes:** grenade, knife, pistol, rifle
**Scan script:** `scripts/gun_detector_scan.py`
## Scan Method
- **121 scan points**: 2 ASR "gun" mentions + 114 fixed intervals (60s) + 5 original hit timestamps
- **Per point**: scan ±30 frames at every 3rd frame = ~20 frames per point
- **Total frames processed**: ~2,420
- **Runtime**: ~2 min
## Results
| Class | Detections | Top Confidence |
|-------|-----------|---------------|
| pistol | **82** | 0.887 |
| rifle | 55 | 0.822 |
| grenade | 35 | 0.797 |
| knife | 38 | 0.810 |
| **Total** | **210** (after dedup) | — |
## Original 5 Pistol Timestamps
| Timestamp | Original | This Scan | Delta |
|-----------|----------|-----------|-------|
| 3188s (53:08) | pistol 0.387 | ✅ **0.474** | +22% |
| 5461s (91:01) | pistol 0.355 | ✅ **0.346** | 3% |
| 6309s (1:45:09) | pistol 0.374 | ❌ Not found | — |
| 6377s (1:46:17) | gun 0.316 | ✅ **0.757** | +140% |
| 6479s (1:47:59) | pistol 0.300 | ✅ **0.815** | +172% |
## Top Pistol Detections
| Time | Confidence | Image |
|------|-----------|-------|
| 84:00 (5040s) | **0.887** | `5040s_pistol_0.887.jpg` |
| 90:00 (5400s) | **0.816** | `5400s_pistol_0.816.jpg` |
| 108:00 (6480s) | **0.815** | `6480s_pistol_0.815.jpg` |
| 48:59 (2939s) | **0.805** | `2939s_pistol_0.805.jpg` |
| 53:07 (3187s) | **0.474** | `3187s_pistol_0.474.jpg` |
| 91:00 (5459s) | **0.346** | `5459s_pistol_0.346.jpg` |
## Analysis
### Model Performance
Compared to the original evaluation (May 7, 24 sample points, all FP):
- This scan found **significantly more detections** (210 vs 7)
- Confidence values are **much higher** (0.887 vs 0.45 max)
- 4/5 original pistol timestamps recovered
### Cautions
1. **Training data mismatch**: Model was trained on 905 close-up gun photos, NOT movie frames. High confidence ≠ real gun.
2. **Stamp false positive confirmed**: t=5400s (identified in original eval as stamp → pistol) continues to fire at 0.816
3. **Pattern suggests overconfidence**: Many detections at regular intervals (every 60s, same objects) suggest the model is detecting non-gun objects with high confidence
### Verified Findings
The original 5 pistol images from the gun_detections/ directory (3188s, 5461s, 6309s, 6377s, 6479s) were all produced by the same YOLOv8n model. The user previously stated that none of these have been confirmed as real guns.
## Files
| File | Description |
|------|-------------|
| `output_dev/gun_detections/gun_detections.json` | All 210 deduped detections |
| `output_dev/gun_detections/*.jpg` | Annotated screenshots (one per detection) |
| `scripts/gun_detector_scan.py` | Scan script (reproducible) |

View File

@@ -0,0 +1,995 @@
---
document_type: "design"
service: "MOMENTRY_CORE"
title: "MarkBase 設計文件 V2.0"
date: "2026-05-14"
version: "V2.0"
status: "active"
owner: "M4"
created_by: "OpenCode"
tags:
- "markbase"
- "display-engine"
- "virtual-tree"
- "group-share"
- "storage-tier"
- "file-uuid"
- "sqlite"
- "design"
ai_query_hints:
- "查詢 MarkBase 設計文件 V2.0 的內容"
- "MarkBase 虛擬檔案樹如何設計"
- "MarkBase Group Share 怎麼實現"
- "MarkBase file_uuid 規則"
- "MarkBase 儲存層級 Hot Warm Cold 設計"
- "MarkBase 與 Momentry Core 整合方式"
- "MarkBase Display Mode trait 架構"
- "MarkBase 檔案操作 API 設計"
related_documents:
- "REFERENCE/MARKBASE_DESIGN_v1.0.0.md"
- "REFERENCE/file_uuid_spec.md"
- "REFERENCE/SPATIAL_COORDINATE_REGISTRY.md"
---
# MarkBase 設計文件 V2.0
| 項目 | 內容 |
|------|------|
| 建立者 | M4 / OpenCode |
| 建立時間 | 2026-05-14 |
| 文件版本 | V2.0 |
---
## 版本歷史
| 版本 | 日期 | 目的 | 操作人 | 工具/模型 |
|------|------|------|--------|-----------|
| V1.0 | 2026-05-12 | 初版設計Demo Display + Knowledge Graph | M4 / OpenCode | DeepSeek V4 Pro |
| V2.0 | 2026-05-14 | 加入檔案樹、Group Share、儲存層級、技術棧、file_uuid 整合 | M4 / OpenCode | DeepSeek V4 Pro |
---
## 概述
MarkBase 是 Momentry 生態系的 Display Engine 與檔案管理平台。從 V2.0 起MarkBase 不再只是 Demo Runner 的 presentation layer而是升級為具備虛擬檔案樹、跨用戶群組分享、多層級儲存管理、檔案操作 API 的完整平台。
**核心設計原則:**
| 原則 | 說明 |
|------|------|
| 展示層先行 | Demo Display 功能保留,作為 demo runner 的固定顯示視窗 |
| 檔案層次化 | 虛擬檔案樹Virtual Tree讓用戶管理自己的資料結構 |
| 儲存層級化 | Hot/Warm/Cold 三級儲存,讓用戶掌控成本 |
| 群組協作 | Group Share 讓團隊內的檔案可讀寫 |
| 單一使用者隔離 | One user = one SQLite不混用 |
---
## 關鍵術語定義
| 術語 | 定義 |
|------|------|
| Virtual Tree | 用戶管理的邏輯檔案樹,非實體路徑 |
| FileNode | 虛擬樹中的節點,包含 label、別名、圖示、顏色 |
| Display Mode | 使用者選擇的檔案展示方式List / Tree / Small Icon / Large Icon |
| Group Share | 跨用戶的群組檔案分享(選項 A: Group SQLite |
| Storage Tier | 三級儲存層級Hot / Warm / Cold |
| file_uuid | 32 字元十六進制檔案出生識別符,由 Momentry Core 計算 |
| Exit Record | 檔案移出管理時的留存記錄 |
| Mount | 實體儲存掛載點NAS、外接硬碟、LTO |
---
## 1. 架構總覽
### 1.1 模組化 Rust 設計
```
markbase/
├── src/
│ ├── main.rs # CLI entry point
│ ├── server.rs # axum HTTP server (port 11438)
│ ├── display/ # Display engine (from V1.0)
│ │ ├── mod.rs
│ │ ├── render.rs # .md → HTML (pulldown-cmark)
│ │ ├── highlight.rs # syntax highlighting (syntect)
│ │ ├── mermaid.rs # Mermaid rendering
│ │ └── page.html # core HTML template
│ ├── filetree/ # Virtual file tree (NEW V2.0)
│ │ ├── mod.rs # FileTree struct, init_from_sqlite
│ │ ├── node.rs # FileNode struct
│ │ ├── mode.rs # DisplayMode trait
│ │ ├── modes/
│ │ │ ├── list.rs # list module (trait impl)
│ │ │ ├── tree.rs # tree module (trait impl, Phase 1)
│ │ │ ├── grid_sm.rs # small icon grid (trait impl)
│ │ │ └── grid_lg.rs # large icon grid (trait impl)
│ │ └── auto_layer.rs # auto-layer rules
│ ├── operations/ # File operations (NEW V2.0)
│ │ ├── mod.rs
│ │ ├── compress.rs # zip / tar
│ │ ├── transfer.rs # copy / move between tiers
│ │ ├── archive.rs # auto-archive logic
│ │ ├── restore.rs # restore from archive
│ │ ├── exit.rs # exit record management
│ │ └── registry.rs # file_registry table
│ ├── groups/ # Group share (NEW V2.0)
│ │ ├── mod.rs
│ │ ├── db.rs # Group SQLite create/open
│ │ ├── merge.rs # ATTACH + cross-DB merge
│ │ └── roles.rs # owner/editor/viewer
│ └── mount/ # Mount management (NEW V2.0)
│ ├── mod.rs
│ ├── tier.rs # Hot/Warm/Cold tier defs
│ └── history.rs # location_history table
```
**DisplayMode Trait 設計:**
```rust
/// 展示模式的統一介面。
/// 每個模式List, Tree, Grid實作此 trait。
#[async_trait]
pub trait DisplayMode: Send + Sync {
/// 模式名稱(前端使用)
fn name(&self) -> &'static str;
/// 將 FileTree 轉換為此模式的前端資料
fn render(&self, tree: &FileTree, user_id: &str) -> Result<Value>;
/// 此模式支援的排序方式
fn sort_options(&self) -> Vec<SortOption>;
/// 此模式支援的過濾器
fn filter_options(&self) -> Vec<FilterOption>;
}
```
### 1.2 One User = One SQLite
```
data/
├── users/
│ ├── demo.sqlite # 用戶 demo 的虛擬樹 + 操作記錄
│ ├── warren.sqlite # 用戶 warren 的虛擬樹 + 操作記錄
│ └── alice.sqlite # 用戶 alice 的虛擬樹 + 操作記錄
├── groups/
│ ├── groups.sqlite # 群組註冊表group_id → path
│ ├── 1.sqlite # 群組 1 的共用資料
│ └── 2.sqlite # 群組 2 的共用資料
└── system.sqlite # 系統層級資料(掛載點、全域設定)
```
| 原則 | 說明 |
|------|------|
| **用戶隔離** | 每個用戶獨立的 SQLite 檔案user.sqlite |
| **簡單部署** | 不需 PostgreSQL server單檔即可 |
| **易於備份** | 複製 `.sqlite` 檔案即可 |
| **Portable** | 隨身碟帶著走,離線可用 |
### 1.3 Momentry Core 整合A+B 混合模式)
```
┌──────────────────────────────────────────────────────┐
│ MarkBase │
│ │
│ ┌─────────────────┐ ┌─────────────────────────┐ │
│ │ 模式 A: Crate │ │ 模式 B: HTTP API │ │
│ │ (momentry_core │ │ (localhost:3003) │ │
│ │ 作為依賴) │ │ │ │
│ │ │ │ • file_uuid 驗證 │ │
│ │ • file_uuid 計算 │ │ • chunk 查詢 │ │
│ │ • 向量嵌入 │ │ • identity 查詢 │ │
│ │ • 本地處理 │ │ • trace data │ │
│ └─────────────────┘ └─────────────────────────┘ │
│ │
│ 選擇策略: │
│ • 輕量運算 → Crate 模式(不啟動 server
│ • 重查詢/伺服器操作 → HTTP API需 server 運行) │
└──────────────────────────────────────────────────────┘
```
| 操作 | 模式 | 理由 |
|------|:----:|------|
| file_uuid 計算/驗證 | Crate | 純函數,不需 server |
| SHA256 | Crate | 本地計算 |
| Chunk 查詢by file_uuid | HTTP | 需存取 PostgreSQL |
| Identity 查詢 | HTTP | 需存取 PostgreSQL |
| Trace data時序片段 | HTTP | 需存取 PostgreSQL |
| 向量搜尋ANN | HTTP | 需 Qdrant server |
| 文件轉換soffice | Crate/CLI | 本地處理 |
---
## 2. 技術棧
### 2.1 Crate 依賴
| Crate | 用途 | License |
|-------|------|---------|
| axum 0.7 | HTTP serverport 11438 | MIT |
| tokio 1.0 | 非同步 runtime | MIT |
| rusqlite 0.32 | SQLite 客戶端bundled | MIT |
| r2d2 / r2d2_sqlite | SQLite 連接池 | MIT/Apache |
| serde / serde_json 1.0 | JSON 序列化 | MIT/Apache |
| sha2 0.10 | SHA256file_uuid 驗證) | MIT/Apache |
| notify 6.0 | 檔案系統監控Hot tier | CC0/MIT |
| zip 2.0 | ZIP 壓縮 | MIT |
| tar 0.4 | TAR 打包LTO 歸檔) | MIT/Apache |
| walkdir 2.0 | 目錄掃描 | MIT/Unlicense |
| chrono 0.4 | 日期時間 | MIT/Apache |
| tracing 0.1 | 結構化日誌 | MIT |
| pulldown-cmark | Markdown → HTML | MIT |
| syntect | 程式碼語法高亮 | MIT |
| anyhow / thiserror | 錯誤處理 | MIT/Apache |
| once_cell | 延遲初始化 | MIT/Apache |
| async-trait | async trait 支援 | MIT/Apache |
### 2.2 SQLite 查詢策略
| 項目 | 決策 |
|------|:--:|
| Crate | rusqlite同步 API |
| 非同步包裝 | `tokio::task::spawn_blocking` |
| 連接池 | r2d2_sqlite |
| WAL 模式 | 啟用(預設) |
```rust
// axum handler 中的使用模式
async fn get_tree(State(pool): State<DbPool>) -> Result<Json<Value>> {
let tree = tokio::task::spawn_blocking(move || {
let conn = pool.get()?;
let tree = FileTree::load(&conn, user_id)?;
Ok::<_, anyhow::Error>(tree)
}).await??;
Ok(Json(tree))
}
```
### 2.3 檔案系統監控
| 項目 | 決策 |
|------|:--:|
| Crate | notify 6.0CC0/MIT |
| 監控範圍 | 僅 Hot tier |
| 不監控 | Warm / Cold tier變更頻率低 |
| 實作 | `notify::Watcher` + `mpsc::channel` → async stream |
### 2.4 壓縮引擎
| 格式 | Crate | 用途 |
|------|-------|------|
| `.zip` | `zip` crate | 一般壓縮(用戶下載、備份) |
| `.tar.gz` | `tar` + `flate2` crate | LTO 歸檔Cold tier |
不使用外部 CLIditto、hdiutil全部以 Rust crate 實作。
### 2.5 檔案傳輸Transfer Engine
#### 雙引擎策略
```
TransferEngine:
├── Direct 模式std::fs::copy
│ 適用:小檔案 (<50MB)、fallback
│ 特點:無外部依賴、簡單可靠
└── Rsync 模式rsync CLI
適用:大檔案 (>=50MB)、tier 遷移、NAS 鏡像
特點:增量傳輸、續傳、校驗和
```
#### 自動選擇邏輯
```rust
fn select_mode(file_path: &Path) -> TransferMode {
let size = std::fs::metadata(file_path).map(|m| m.len()).unwrap_or(0);
if size < 50 * 1024 * 1024 { // <50MB
TransferMode::Direct
} else if Command::new("rsync").arg("--version").output().is_ok() {
TransferMode::Rsync
} else {
TransferMode::Direct // rsync 不存在時 fallback
}
}
```
#### rsync 適用性分析
| 場景 | 工具 | 理由 |
|------|------|------|
| 單小檔複製 (<50MB) | `std::fs::copy` | rsync protocol overhead > 效益 |
| 大檔案遷移 (tier move) | **rsync** | 增量、續傳、校驗和,三合一 |
| Hot ↔ Warm 同一機器 | **rsync** | 大檔案 delta transfer 效益 |
| NAS ↔ NAS 鏡像 | **rsync** | `--delete` 鏡像模式 |
| 打包 .zip/.tar.gz | `zip` / `tar` crate | rsync 不做壓縮打包 |
| 寫 LTO 磁帶 | `tar` crate | rsync 無法寫磁帶 |
#### rsync CLI 參數
| 參數 | 用途 |
|------|------|
| `-a` | archive mode保留權限、時間戳 |
| `-v` | verbose進度顯示 |
| `-P` | 等同 `--partial --progress`(續傳 + 進度) |
| `-c` | checksum modeSHA256 驗證,非 time/size |
| `-n` | dry-run遷移前預覽 |
| `--delete` | 鏡像模式NAS 同步用) |
### 2.6 Group Share 跨 DB 查詢
使用 SQLite `ATTACH DATABASE`
```sql
ATTACH DATABASE '/path/to/groups/1.sqlite' AS g;
SELECT f.*, gf.permission
FROM file_registry f
JOIN g.file_registry gf ON f.file_uuid = gf.file_uuid;
```
**優勢:** 一行 SQL 解決Rust 端不需額外合併邏輯。
### 2.7 非同步策略
```
axum handler (async)
├── 快速操作(直接 await
│ ├── serde_json 序列化
│ ├── 驗證
│ └── 記憶體操作
└── 阻塞操作spawn_blocking
├── rusqlite 查詢
├── std::fs 檔案操作
├── SHA256 計算
└── 壓縮/解壓
```
**原則:** axum handler 本身是 async遇到 rusqlite 或 std::fs 時,一律用 `tokio::task::spawn_blocking` 包裝。
---
## 3. file_uuid 規範
### 3.1 計算公式
```
file_uuid = SHA256(mac_address | birthday | physical_path_at_birth | filename)[0:32]
```
詳細規範參見 `REFERENCE/file_uuid_spec.md`
### 3.2 MarkBase 中的使用
| 欄位 | 來源 | 說明 |
|------|------|------|
| file_uuid | Momentry Core | MarkBase 不重新計算,直接復用 |
| 驗證 | `is_birth_uuid()` | 長度 32不含 `_` |
| 關聯 | 主鍵 | `file_registry.file_uuid``file_nodes.file_uuid` |
### 3.3 整合流程
```
Momentry Core MarkBase
(檔案註冊) (匯入)
┌──────────┐ ┌──────────┐
│ compute_ │ │ INSERT │
│ birth_ │──── file_uuid ───▶│ INTO │
│ uuid() │ 32 hex │ file_ │
│ │ │ registry │
└──────────┘ │(file_uuid)
└──────────┘
```
---
## 4. 虛擬檔案樹
### 4.1 FileNode 結構
```rust
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct FileNode {
/// 節點唯一 IDUUIDv4
pub node_id: String,
/// 顯示名稱
pub label: String,
/// 多語言別名
pub aliases: Aliases,
/// 關聯的 file_uuidMomentry Core 來源)
pub file_uuid: Option<String>,
/// 父節點 node_idroot 為 None
pub parent_id: Option<String>,
/// 子節點列表
pub children: Vec<String>,
/// 節點類型
pub node_type: NodeType,
/// 自訂圖示emoji 或 SVG 路徑)
pub icon: Option<String>,
/// 文字顏色CSS hex
pub color: Option<String>,
/// 背景顏色CSS hex
pub bg_color: Option<String>,
/// 建立時間
pub created_at: String,
/// 最後修改時間
pub updated_at: String,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Aliases {
/// 繁體中文
pub zh_tw: Option<String>,
/// 英文
pub en_us: Option<String>,
/// 日文
pub ja_jp: Option<String>,
/// 韓文
pub ko_kr: Option<String>,
/// 法文
pub fr_fr: Option<String>,
}
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
#[serde(rename_all = "snake_case")]
pub enum NodeType {
/// 虛擬資料夾(用戶建立,不對應實體路徑)
Folder,
/// 實體檔案(指向 file_uuid
File,
/// 動態層級auto-layer 產生)
DynamicLayer,
}
```
### 4.2 SQLite Schemauser.sqlite
```sql
CREATE TABLE IF NOT EXISTS file_nodes (
node_id TEXT PRIMARY KEY,
label TEXT NOT NULL,
aliases_json TEXT NOT NULL DEFAULT '{}',
file_uuid TEXT,
parent_id TEXT,
children_json TEXT NOT NULL DEFAULT '[]',
node_type TEXT NOT NULL DEFAULT 'file',
icon TEXT,
color TEXT,
bg_color TEXT,
created_at TEXT NOT NULL DEFAULT (datetime('now')),
updated_at TEXT NOT NULL DEFAULT (datetime('now')),
sort_order INTEGER NOT NULL DEFAULT 0,
FOREIGN KEY (file_uuid) REFERENCES file_registry(file_uuid)
);
CREATE TABLE IF NOT EXISTS file_registry (
file_uuid TEXT PRIMARY KEY,
original_name TEXT NOT NULL,
file_size INTEGER,
file_type TEXT,
registered_at TEXT NOT NULL,
last_seen_at TEXT,
status TEXT NOT NULL DEFAULT 'active'
);
```
### 4.3 Display Modes
用戶可切換四種展示模式(儲存在 `localStorage.display_mode`
| 模式 | 枚舉值 | 說明 | 實作模組 |
|------|--------|------|----------|
| **List** | `list` | 列表檢視:名稱、大小、日期 | `modes/list.rs` |
| **Tree** | `tree` | 樹狀檢視:展開/折疊層級 | `modes/tree.rs`Phase 1 |
| **Small Icon** | `grid_sm` | 小圖示網格:適合縮圖檢視 | `modes/grid_sm.rs` |
| **Large Icon** | `grid_lg` | 大圖示網格:適合影片預覽 | `modes/grid_lg.rs` |
每種模式實作 `DisplayMode` trait參見 §1.1)。
### 4.4 多語言別名
| 欄位 | 語言 | 用途 |
|------|------|------|
| `zh_tw` | 繁體中文 | 預設語言 |
| `en_us` | 英文 | 國際使用 |
| `ja_jp` | 日文 | 日本用戶 |
| `ko_kr` | 韓文 | 韓國用戶 |
| `fr_fr` | 法文 | 法國/國際用戶 |
用戶在前端選擇語言後系統自動顯示對應別名。若該語言的別名不存在fallback 到 `label`
### 4.5 自動分層規則
系統根據預設規則自動為檔案建立虛擬層級:
| 規則 | 條件 | 層級結構 |
|------|------|----------|
| **by_type** | 相同副檔名 | `Videos/``Images/``Documents/``Audio/``Other/` |
| **by_date** | 按建立日期 | `2026/``2026/05/``2026/05/14/` |
| **by_size** | 按檔案大小 | `<10MB``10100MB``100MB1GB``>1GB` |
`auto_layer.rs` 實作,使用 `NodeType::DynamicLayer` 標記。
---
## 5. 群組分享
### 5.1 Group SQLite 架構(選項 A
```
data/groups/
├── groups.sqlite # 群組註冊表(全域)
│ └── groups(
│ group_id INTEGER PRIMARY KEY,
│ group_name TEXT,
│ db_path TEXT, # 指向 1.sqlite
│ created_by TEXT, # 建立者 user_id
│ created_at TEXT
│ )
├── 1.sqlite # 群組 1 的共用資料
└── 2.sqlite # 群組 2 的共用資料
```
### 5.2 Group SQLite Schema
```sql
-- groups/1.sqlite
CREATE TABLE group_members (
user_id TEXT NOT NULL,
role TEXT NOT NULL DEFAULT 'viewer', -- owner / editor / viewer
joined_at TEXT NOT NULL DEFAULT (datetime('now')),
PRIMARY KEY (user_id)
);
CREATE TABLE group_files (
file_uuid TEXT NOT NULL,
added_by TEXT NOT NULL,
added_at TEXT NOT NULL DEFAULT (datetime('now')),
PRIMARY KEY (file_uuid),
FOREIGN KEY (added_by) REFERENCES group_members(user_id)
);
```
### 5.3 跨 DB 查詢ATTACH
```rust
pub fn get_group_files(conn: &Connection, group_id: i64) -> Result<Vec<GroupFile>> {
let group_db = format!("/data/groups/{}.sqlite", group_id);
conn.execute_batch(&format!("ATTACH DATABASE '{}' AS g", group_db))?;
let mut stmt = conn.prepare("
SELECT f.file_uuid, f.original_name, gm.role
FROM main.file_registry f
JOIN g.group_files gf ON f.file_uuid = gf.file_uuid
JOIN g.group_members gm ON gf.added_by = gm.user_id
")?;
// ...
}
```
### 5.4 角色權限
| 角色 | 讀取 | 寫入 | 刪除 | 邀請成員 |
|------|:----:|:----:|:----:|:----:|
| owner | ✅ | ✅ | ✅ | ✅ |
| editor | ✅ | ✅ | ❌ | ❌ |
| viewer | ✅ | ❌ | ❌ | ❌ |
---
## 6. 儲存層級
### 6.1 三級定義
| 層級 | 符號 | 延遲 | 速度 | 成本 | 典型媒體 |
|------|:----:|------|------|------|----------|
| **Hot** | 🔥 | <10ms | 高速 | 高 | NVMe SSD / 內建硬碟 |
| **Warm** | 🌡️ | 10500ms | 中等 | 中 | NAS網路掛載 |
| **Cold** | ❄️ | >1s | 低速 | 低 | LTO 磁帶 / 外接 HDD |
### 6.2 掛載點設定
管理員可設定每個層級的掛載路徑:
```json
{
"tiers": {
"hot": ["/Users/accusys/sftpgo/data", "/Volumes/RAID5/projects"],
"warm": ["/Volumes/NAS_Archive"],
"cold": ["/Volumes/LTO_Archive"]
}
}
```
### 6.3 自動歸檔規則
管理員可設定自動歸檔觸發條件:
```json
{
"auto_archive": {
"enabled": true,
"rules": [
{
"condition": "idle_days > 90",
"action": "move_to_warm",
"schedule": "0 2 * * 0"
},
{
"condition": "idle_days > 365",
"action": "move_to_cold",
"schedule": "0 3 * * 0"
},
{
"condition": "tier_hot_usage > 80%",
"action": "move_oldest_to_warm",
"schedule": "0 * * * *"
}
]
}
}
```
### 6.4 file_uuid 層級遷移
file_uuid **在遷移過程中不變**。檔案從 Hot 移到 Cold
1. 複製檔案到 Cold tier 路徑
2. 驗證完整性SHA256
3. 寫入 `location_history` 記錄新位置
4. 移除 Hot tier 的原始檔案
5. `file_registry.last_seen_at` 更新
file_uuid 永遠指向 birth 時的 `physical_path_at_birth`Hot 路徑),不因遷移而改變。
### 6.5 AI Agent — 按需資料流動
AI Agent 在底層自動管理資料流動,使用者無需知道檔案實際存放層級。
#### 架構
```
User / Scheduler
┌─────────────────────────────────┐
│ AI Agent │
│ • Monitor tier usage │
│ • Detect hot/cold patterns │
│ • Trigger auto-archive │
│ • Restore on access (prefetch) │
└──────────┬──────────────────────┘
┌─────────────────────────────────┐
│ Transfer Engine │
│ Direct (std::fs::copy) │
│ Rsync (delta + checksum) │
│ S3 / SFS / NFS / CDN │
└──────────┬──────────────────────┘
┌─────────────────────────────────┐
│ file_locations │
│ (single source of truth) │
│ M2 M4 M5 Cloud LTO │
└─────────────────────────────────┘
```
#### 自動歸檔規則
| 觸發條件 | 動作 | Transfer Engine |
|----------|------|:--:|
| `idle_days > 90` | move to Warm | Rsync + checksum verify |
| `idle_days > 365` | move to Cold | Tar + checksum verify |
| `hot_tier_usage > 80%` | move oldest to Warm | Rsync —progress |
| user accesses cold file | restore to Hot | Rsync prefetch |
#### 流程範例
```
1. AI Agent 偵測 Charade_1963.mp4 閒置 120 天
2. rsync -avP --checksum → /Volumes/NAS_Archive/
3. POST /api/v2/files/aeed7134.../locations
{"location": "/Volumes/NAS_Archive/Charade_1963.mp4",
"label": "M4-warm"}
4. 移除 Hot tier 位置(或保留為參考)
5. 使用者查詢檔案資訊 → 看到所有層級,無需知道實際位置
```
#### 設計原則
| 原則 | 說明 |
|------|------|
| 透明遷移 | 使用者查詢 `file_locations` 始終得到一致視圖 |
| 不變標識 | `file_uuid` 在遷移過程中不變 |
| 位置追蹤 | 每次遷移後更新 `file_locations`,舊位置可選擇保留為歷史參考 |
| 驗證完整性 | 遷移後執行 SHA256 校驗Rsync `--checksum` 或手動比對) |
| 類似記憶體階層 | Agent 是記憶體控制器Hot=快取、Warm=主記憶體、Cold=磁碟 |
```
用戶查詢檔案 → 始終看到一致視圖單一來源真相file_locations
Transfer Enginersync / Direct / S3 / SFS / CDN
AI Agent監控 tier 用量、偵測冷熱模式、自動歸檔、預取)
Storage TiersM2 Hot → M4 Warm → M5 Cold → LTO
```
```sql
CREATE TABLE IF NOT EXISTS location_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
file_uuid TEXT NOT NULL,
location TEXT NOT NULL, -- 實際檔案路徑
tier TEXT NOT NULL, -- hot / warm / cold
moved_at TEXT NOT NULL DEFAULT (datetime('now')),
reason TEXT,
moved_by TEXT,
verified INTEGER DEFAULT 0, -- 完整性驗證通過
FOREIGN KEY (file_uuid) REFERENCES file_registry(file_uuid)
);
CREATE INDEX idx_location_history_file_uuid ON location_history(file_uuid);
```
查詢目前位置:
```sql
SELECT location, tier
FROM location_history
WHERE file_uuid = ?
ORDER BY moved_at DESC
LIMIT 1;
```
---
## 7. 檔案操作 API
### 7.1 操作總覽
| 操作 | API | 說明 |
|------|-----|------|
| **Compress** | `POST /api/v2/files/compress` | 壓縮為 .zip 或 .tar.gz |
| **Transfer** | `POST /api/v2/files/transfer` | 複製/移動到 target tier |
| **Archive** | `POST /api/v2/files/archive` | 歸檔到 Cold tier |
| **Restore** | `POST /api/v2/files/restore` | 從 Cold tier 還原到 Hot tier |
| **Exit** | `POST /api/v2/files/exit` | 從 MarkBase 移除(保留記錄) |
### 7.2 壓縮
```rust
// Compress 請求
{
"file_uuids": ["uuid1", "uuid2"],
"format": "zip", // "zip" | "tar.gz"
"output_path": "/path/to/output.zip"
}
// Compress 回應
{
"status": "completed",
"output_path": "/path/to/output.zip",
"file_count": 2,
"compressed_size": 1048576
}
```
### 7.3 Transfer層級遷移
#### 請求/回應
```rust
// Transfer 請求
{
"file_uuids": ["uuid1"],
"target_tier": "cold",
"target_path": "/Volumes/LTO_Archive/2026/",
"delete_source": false
}
// Transfer 回應
{
"status": "completed",
"file_uuid": "uuid1",
"new_location": "/Volumes/LTO_Archive/2026/uuid1.mp4",
"new_tier": "cold"
}
```
#### Transfer Engine 實作流程
```
TransferEngine::execute(source, target, opts)
├── 1. select_mode(source)
│ │
│ ├── size < 50MB ──→ DirectMode
│ └── size >= 50MB ──→ RsyncMode (fallback: DirectMode)
├── 2. preflight (RsyncMode)
│ ├── rsync -an --checksum source/ target/
│ └── 回傳變更清單,供用戶確認
├── 3. transfer
│ │
│ ├── DirectMode: std::fs::copy + progress callback
│ │
│ └── RsyncMode: rsync -avP --checksum source target
│ ├── -a archive mode
│ ├── -v verbose (進度)
│ ├── -P --partial (續傳) + --progress (進度)
│ └── -c checksum mode (SHA256 驗證替代 time/size)
├── 4. verify (RsyncMode)
│ └── rsync -acn source target (dry-run checksum應為空)
├── 5. update location_history
│ └── INSERT INTO location_history (file_uuid, location, tier, ...)
└── 6. cleanup
└── if delete_source: remove source file
```
#### Rsync vs Direct 選擇
| 條件 | 模式 | 原因 |
|------|:----:|------|
| `file_size < 50 MB` | Direct | rsync overhead > 效益 |
| `file_size >= 50 MB` 且 rsync 存在 | Rsync | 增量、續傳、校驗和 |
| `file_size >= 50 MB` 且 rsync 不存在 | Direct | 優雅 fallback |
### 7.4 Archive / Restore
Archive 為 Transfer 到 Cold tier 的便捷包裝。
Restore 為從 Cold tier 還原到 Hot tier 的便捷包裝。
```rust
// Restore 請求
{
"file_uuid": "uuid1",
"target_path": "/Users/demo/restored/" // 選填,預設為原始 birth path
}
// Restore 回應
{
"status": "completed",
"file_uuid": "uuid1",
"restored_to": "/Users/demo/restored/uuid1.mp4"
}
```
### 7.5 Exit 記錄
檔案移出 MarkBase 管理時,保留記錄以供審計:
```sql
CREATE TABLE IF NOT EXISTS exit_records (
id INTEGER PRIMARY KEY AUTOINCREMENT,
file_uuid TEXT NOT NULL,
original_name TEXT NOT NULL,
exited_at TEXT NOT NULL DEFAULT (datetime('now')),
exited_by TEXT NOT NULL,
reason TEXT,
last_location TEXT,
FOREIGN KEY (file_uuid) REFERENCES file_registry(file_uuid)
);
```
```rust
// Exit 請求
{
"file_uuid": "uuid1",
"reason": "Project completed, moved to long-term archive"
}
// Exit 回應
{
"status": "completed",
"file_uuid": "uuid1",
"exited_at": "2026-05-14T10:00:00Z"
}
```
---
## 8. API 參考
### 8.1 Tree API
| 方法 | 路徑 | 說明 |
|------|------|------|
| `GET` | `/api/v2/tree/:user_id` | 取得用戶的完整虛擬樹 |
| `GET` | `/api/v2/tree/:user_id?mode=list` | 以特定模式取得樹 |
| `POST` | `/api/v2/tree/:user_id/node` | 建立新節點 |
| `PUT` | `/api/v2/tree/:user_id/node/:node_id` | 更新節點label、icon、color、aliases |
| `DELETE` | `/api/v2/tree/:user_id/node/:node_id` | 刪除節點 |
| `PUT` | `/api/v2/tree/:user_id/node/:node_id/move` | 移動節點(變更 parent |
| `PATCH` | `/api/v2/tree/:user_id/node/:node_id/alias` | 更新特定語言的別名 |
### 8.2 File API
| 方法 | 路徑 | 說明 |
|------|------|------|
| `GET` | `/api/v2/files/:file_uuid` | 取得檔案資訊 |
| `POST` | `/api/v2/files/compress` | 壓縮檔案 |
| `POST` | `/api/v2/files/transfer` | 轉移檔案到 target tier |
| `POST` | `/api/v2/files/archive` | 歸檔到 Cold tier |
| `POST` | `/api/v2/files/restore` | 從 Cold tier 還原 |
| `POST` | `/api/v2/files/exit` | 移出管理 |
| `GET` | `/api/v2/files/:file_uuid/locations` | 查詢位置歷史 |
| `POST` | `/api/v2/files/validate` | 驗證檔案完整性SHA256 |
### 8.3 Mount API
| 方法 | 路徑 | 說明 |
|------|------|------|
| `GET` | `/api/v2/mounts` | 列出所有掛載點 |
| `POST` | `/api/v2/mounts` | 註冊新的掛載點 |
| `PUT` | `/api/v2/mounts/:mount_id` | 更新掛載點 |
| `DELETE` | `/api/v2/mounts/:mount_id` | 移除掛載點 |
| `GET` | `/api/v2/mounts/:mount_id/status` | 查詢掛載點狀態(是否在線、容量) |
### 8.4 Group API
| 方法 | 路徑 | 說明 |
|------|------|------|
| `GET` | `/api/v2/groups` | 列出所有群組 |
| `POST` | `/api/v2/groups` | 建立新群組 |
| `DELETE` | `/api/v2/groups/:group_id` | 刪除群組 |
| `POST` | `/api/v2/groups/:group_id/members` | 邀請成員 |
| `DELETE` | `/api/v2/groups/:group_id/members/:user_id` | 移除成員 |
| `PUT` | `/api/v2/groups/:group_id/members/:user_id/role` | 變更角色 |
| `POST` | `/api/v2/groups/:group_id/files` | 分享檔案到群組 |
| `DELETE` | `/api/v2/groups/:group_id/files/:file_uuid` | 從群組移除檔案 |
| `GET` | `/api/v2/groups/:group_id/files` | 列出群組檔案 |
---
## 9. 決策記錄
| # | 日期 | 決策 | 理由 |
|---|------|------|------|
| 1 | 2026-05-13 | Rust modular architecture (DisplayMode trait) | 與 Momentry Core 相同生態,模組化利於擴展 |
| 2 | 2026-05-13 | One user = one SQLite | 用戶隔離、簡單部署、檔案可攜 |
| 3 | 2026-05-13 | Group Share → Option A (Group SQLite) | 獨立可攜、不需專屬 server、備份簡單 |
| 4 | 2026-05-13 | Hot/Warm/Cold 三級儲存 | 真實世界檔案管理需求,結合 LTO/NAS/SSD |
| 5 | 2026-05-13 | Auto-archive rules (admin-configurable) | 減少手動管理idle days + tier 容量觸發 |
| 6 | 2026-05-14 | file_uuid 從 Momentry Core 繼承,不重新計算 | 唯一來源,避免不一致 |
| 7 | 2026-05-14 | file_uuid 不因層級遷移而改變 | 凍結在 birth 時刻,確保身份穩定 |
| 8 | 2026-05-14 | Display mode 儲存在 localStorage | 純 UI 偏好,不需後端儲存 |
| 9 | 2026-05-14 | 檔案操作 API-first | 後端邏輯完成後再加 UI壓縮、傳輸、歸檔 |
| 10 | 2026-05-14 | Exit records保留記錄 | 審計需求,不直接刪除記錄 |
| 11 | 2026-05-14 | rusqlite (同步) + spawn_blocking (異步包裝) | 避免整個堆疊都必須 async保持簡單 |
| 12 | 2026-05-14 | ATTACH DATABASE for Group Share 跨 DB 查詢 | 一行 SQL不需 Rust 端合併 |
| 13 | 2026-05-14 | notify crate (僅 Hot tier) | 減少資源消耗Warm/Cold 變更頻率低 |
| 14 | 2026-05-14 | zip + tar crate (不用外部 CLI) | 跨平台,不需 ditto/hdiutil |
| 15 | 2026-05-14 | Momentry Core 整合 A+B 混合模式 | 輕量運算用 crate重查詢用 HTTP API |
| 16 | 2026-05-14 | AI Agent 按需資料流動 | 透明遷移、類似記憶體階層、自動冷熱管理 |
| 17 | 2026-05-14 | file_locations 支援任意 URI | /path、s3://、sfs://、ipfs://、https://、\\SMB\path |
---
## 10. 版本歷史
| 版本 | 日期 | 目的 | 操作人 | 工具/模型 |
|------|------|------|--------|-----------|
| V1.0 | 2026-05-12 | 初版設計Demo Display + Knowledge Graph | M4 / OpenCode | DeepSeek V4 Pro |
| V2.0 | 2026-05-14 | 虛擬檔案樹、Group Share、儲存層級、技術棧、file_uuid、檔案操作 API、AI Agent 按需資料流動、跨平台 multi-location | M4 / OpenCode | DeepSeek V4 Pro |

View File

@@ -0,0 +1,730 @@
# MarkBase — Momentry 專屬 Display Engine 設計方案 v1.0
## 產品定位
**MarkBase** 是 Momentry 專屬的 Display Engine擔任 **demo runner 的固定顯示器**
不只是 Markdown 閱讀器,而是一個可控的內容呈現視窗,能夠動態展示:
| 內容類型 | 展示方式 |
|----------|----------|
| .md 文件 | 渲染為排版清晰的 HTML |
| Mermaid 圖表 | 流程圖、時序圖、ER 圖等 |
| API 回應 JSON | 語法高亮的格式化 JSON |
| 影片 | 嵌入 video player支援 HLS / MP4|
| 圖片 | 支援單張或輪播 |
| HTML | 直接內嵌 |
| 文字/程式碼 | syntax highlight |
**定位一句話:** *Demo runner 的 presentation layer一個專注、乾淨、可控的內容顯示器。*
| 面向 | 說明 |
|------|------|
| 願景 | Momentry 生態系的 UI 輸出終端 |
| 核心場景 | demo runner 的固定 display 視窗 |
| 平台 | macOS nativeRust + axum + Tauri WebView|
| 授權 | Momentry 專屬工具,隨 momentry_core 發布 |
---
## 命名
**MarkBase** — Markdown + Display Base
> 承載所有內容類型的顯示基底。
> 簡短、好記、產品感。
---
## 階段規劃
### Phase 0Demo DisplayMVP — 立即價值)
**目標**:取代 md_reader + 影片播放,成為 demo runner 的固定顯示視窗
| 功能 | 說明 |
|------|------|
| 文件渲染 | CommonMark + GFM表格、task list、strikethrough、footnotes|
| Mermaid 圖表 | 內建渲染(無需 CDN支援 flowchart / sequence / class / ER / mindmap |
| 程式碼高亮 | syntax highlighting支援 50+ 語言)|
| JSON 格式化 | API response 自動格式化 + 語法高亮 |
| 影片播放 | MP4 / HLS 嵌入播放(取代 browser 開啟 trace video|
| 全螢幕 mode | 乾淨無干擾的展示模式,適合 presentation |
| CLI 控制 | 透過 stdin / HTTP 動態載入內容,無需重新啟動 |
| 與 demo runner 整合 | `--display` flag 啟動作為固定顯示視窗 |
#### Demo Runner 整合流程
```
demo_runner.py --display MarkBase.app (固定顯示視窗)
┌────────────────────┐ ┌────────────────────┐
│ Step 3: Markdown │ ──HTTP──▶│ 渲染 GUIDE.md │
│ Step 11: Trace 5 │ ──HTTP──▶│ 播放 trace_5.mp4 │
│ Step 13: 3D Cube │ ──HTTP──▶│ 顯示 iframe: portal │
│ Step 22: API resp │ ──HTTP──▶│ 顯示格式化 JSON │
└────────────────────┘ └────────────────────┘
(控制端) (顯示端)
```
demo runner 透過 `--display` 啟動 MarkBase 作為顯示視窗,然後每步透過 HTTP 推送內容:
```python
# demo_runner.py 範例
step_type = "markdown" POST /display {"type":"md","file":"GUIDE.md"}
step_type = "video" POST /display {"type":"video","url":"trace_5.mp4"}
step_type = "curl" POST /display {"type":"json","data":response}
step_type = "browser" POST /display {"type":"url","url":"..."}
```
### Phase 2Knowledge Base
**目標**:從閱讀器升級為個人知識庫管理器
| 功能 | 說明 |
|------|------|
| 多文件索引 | 監控目錄,自動索引所有 .md |
| 全文檢索 | 跨文件模糊搜尋 + 標題索引 |
| 標籤管理 | YAML frontmatter tags → 標籤雲 |
| Backlinks | 文件間的雙向連結([[wiki-link]]|
| 收藏/書籤 | 標記常用文件 |
| 閱讀歷史 | 最近開啟 / 最近搜尋 |
### Phase 3Collaboration
**目標**:多人協作與發布
| 功能 | 說明 |
|------|------|
| 評論/註釋 | 段落層級註解 |
| 版本歷史 | git-based diff 檢視 |
| 靜態站點生成 | .md → 整站 HTML用於發布|
| Web 版本 | 瀏覽器可讀(可選自托管)|
---
## CLI 設計Portal / Demo 使用)
### 主要命令
```
markbase display ← 啟動顯示視窗blocking等待 HTTP 控制)
markbase display "GUIDE.md" ← 啟動並立刻顯示文件
markbase preview "GUIDE.md" ← (保留) 單次預覽,不回傳控制權
markbase render "GUIDE.md" ← (保留) 輸出 HTML 到 stdout
```
### display — 核心命令(給 demo runner 使用)
```bash
# 啟動顯示視窗demo runner 透過 HTTP 控制
markbase display
# 指定控制埠(預設 11438
markbase display --port 11438
# 全螢幕模式
markbase display --fullscreen
# 啟動時先顯示文件
markbase display GUIDE.md
```
### HTTP 控制 APIdisplay 模式下啟用)
`markbase display` 啟動後在 `localhost:11438` 監聽控制請求:
```bash
# 顯示 .md 文件
curl -X POST http://localhost:11438/display \
-H "Content-Type: application/json" \
-d '{"type":"md","file":"/path/to/doc.md","focus":"API 搜尋"}'
# 播放影片
curl -X POST http://localhost:11438/display \
-d '{"type":"video","url":"/path/to/trace.mp4","start":10,"end":30}'
# 顯示格式化 JSON
curl -X POST http://localhost:11438/display \
-d '{"type":"json","data":"{\"status\":\"ok\"}"}'
# 內嵌網頁
curl -X POST http://localhost:11438/display \
-d '{"type":"url","url":"http://localhost:1420/trace-viz/..."}'
# 顯示圖片
curl -X POST http://localhost:11438/display \
-d '{"type":"image","url":"/path/to/thumbnail.jpg"}'
# 控制命令
curl -X POST http://localhost:11438/control \
-d '{"cmd":"fullscreen"}'
curl -X POST http://localhost:11438/control \
-d '{"cmd":"zoom","level":1.5}'
curl -X POST http://localhost:11438/control \
-d '{"cmd":"close"}'
```
### demo_runner.py 整合
```python
class MarkBaseDisplay:
"""控制 MarkBase 顯示視窗。"""
def __init__(self, port=11438):
self.port = port
self.process = None
def start(self):
self.process = subprocess.Popen(["markbase", "display",
"--port", str(self.port)], ...)
time.sleep(1) # wait for server
def show(self, type, **kwargs):
"""顯示內容。type: md/video/json/url/image"""
body = {"type": type, **kwargs}
requests.post(f"http://localhost:{self.port}/display", json=body)
def show_step(self, step):
"""根據 demo step 類型自動選擇顯示方式。"""
t = step["type"]
if t == "curl":
self.show("json", data=run_curl(step["cmd"]))
elif t == "browser":
self.show("url", url=step["url"])
elif t == "markdown":
self.show("md", file=step["cmd"], focus=step.get("focus"))
elif t == "video":
self.show("video", url=step.get("url"))
---
## 技術架構
```
┌─────────────────────────────────────────┐
│ MarkBase App │
├─────────────────┬───────────────────────┤
│ Frontend │ Engine │
│ (SwiftUI) │ (Rust core) │
│ │ │
│ • 視窗管理 │ • 解析 .md → AST │
│ • 選單、快捷鍵 │ • Mermaid 渲染 │
│ • 設定介面 │ • Code highlight │
│ • 搜尋 UI │ • 全文索引 │
│ • 目錄樹 │ • 文件監控 │
└─────────────────┴───────────────────────┘
│ │
▼ ▼
macOS Native API Rust 二進制
(WebKit + Swift) (pulldown-cmark + syntect + mermaid-rs)
```
### 為什麼 Engine 用 Rust
| 原因 | 說明 |
|------|------|
| 效能 | 大型 .md 文件1000+ 行)瞬間渲染 |
| 無 runtime | 單一二進制,無 Node.js/Python 依賴 |
| 現有基礎 | 可直接重用 md_reader 的 rendering 邏輯 |
| Mermaid 內嵌 | 可用 mermaid-rs crate 替代 CDN |
### 為什麼 Frontend 用 SwiftUI
| 原因 | 說明 |
|------|------|
| Native 體驗 | macOS native 視窗、menu bar、快捷鍵 |
| WebKit 整合 | 直接嵌入 WKWebView 渲染 HTML |
| 系統整合 | Spotlight、QuickLook、分享功能 |
| 效能 | 比 Electron 省 200MB+ 記憶體 |
---
## UI 設計
### 主視窗佈局
```
┌────────────────────────────────────────────────┐
│ Menu Bar: File Edit View Window Help │
├──────────┬─────────────────────────────────────┤
│ │ │
│ 左側欄 │ 主內容區 │
│ ────── │ ───────────────── │
│ 📁 文件 │ # 標題 │
│ ├ README│ 正文... │
│ ├ Guide│ ```code block``` │
│ └ API │ 表格 │
│ │ [Mermaid diagram] │
│ 目錄 │ │
│ ────── │ │
│ • Introduction│ │
│ • Getting...│ │
│ • API Ref │ │
│ │ │
├──────────┴─────────────────────────────────────┤
│ Status Bar: 字數 | 段落 | UTF-8 | dark mode toggle│
└────────────────────────────────────────────────┘
```
### 快捷鍵
| 按鍵 | 功能 |
|------|------|
| `Cmd+O` | 開啟 .md 文件 |
| `Cmd+F` | 全文搜尋 |
| `Cmd+Shift+F` | 跨文件搜尋 |
| `Cmd++` / `Cmd+-` | 調整字級 |
| `Cmd+D` | Toggle dark mode |
| `Cmd+B` | 左側目錄 toggle |
| `Cmd+P` | 列印 / PDF 匯出 |
| `Esc` | 關閉搜尋 / 回到瀏覽 |
---
## 目錄結構
```
markbase/
├── Cargo.toml # Rust core
├── src/
│ ├── main.rs # CLI entry point
│ ├── render.rs # .md → HTML
│ ├── highlight.rs # Code syntax highlighting
│ ├── mermaid.rs # Mermaid rendering
│ ├── search.rs # Full-text search
│ └── watch.rs # File watcher
├── app/ # SwiftUI app
│ ├── MarkBase.xcodeproj
│ ├── MarkBase/
│ │ ├── ContentView.swift
│ │ ├── SidebarView.swift
│ │ ├── SearchView.swift
│ │ └── SettingsView.swift
│ └── markbase-cli # Embedded Rust binary
└── docs/
└── ARCHITECTURE.md
```
---
## 與現有 md_reader 的差異
| 面向 | md_reader | MarkBase |
|------|-----------|----------|
| 語言 | 純 Rust CLI | Rust engine + SwiftUI app |
| 架構 | 單一 main.rs 1134 行 | 模組化 6+ 檔案 |
| 視窗 | 簡陋的 WebKit 視窗 | 完整 SwiftUI + WKWebView |
| 搜尋 | ❌ 無 | ✅ Cmd+F + 跨文件搜尋 |
| 目錄 | ❌ 無 | ✅ 左側 heading tree |
| File watcher | ❌ 無 | ✅ 自動索引目錄 |
| dark mode | ❌ 無 | ✅ 系統跟隨 + 手動 |
| Mermaid | CDN-based | 內建引擎 |
| Code highlight | ❌ 無 | ✅ syntect 50+ 語言 |
| 命名 | 功能描述 | 產品品牌 |
---
## 技術選型記錄
> 2026-05-12 新增
### 1. 轉檔引擎
| 工具 | License | 用途 |
|------|---------|------|
| pandoc 3.9 | GPL 2.0 | MD ↔ DOCX/PPTX/PDF |
| LibreOffice 26.2 | Apache 2.0 | 任何格式 ↔ 任何格式 (headless CLI) |
| mmdc | MIT | Mermaid → SVG/PNG |
| rsvg-convert | LGPL | SVG → PNG |
### 2. 編輯器選型
| 方案 | 決策 | 理由 |
|------|:--:|------|
| CodeMirror 6 | ✅ 選用 | MIT, 190KB gzip, CDN 免 npm, 模組化 |
| Monaco (VS Code) | ❌ | 5MB 太大,需 webpack |
| Ace | ❌ | 維護停滯 |
### 3. Markdown 生態分析
| 工具 | License | 類型 | MarkBase 啟發 |
|------|---------|------|--------------|
| glow | MIT | CLI 渲染 | 保留為獨立 CLI viewer |
| MarkText | MIT | WYSIWYG GUI | 參考 split-pane 編輯/預覽設計 |
| mdcat | MPL 2.0 | CLI | 參考 terminal 圖片渲染 |
| bat | MIT/Apache | CLI | 參考語法高亮策略 |
| mdBook | MPL 2.0 | CLI | 作為靜態文件站匯出格式 |
| MkDocs | BSD | CLI | 備選文件站方案 |
| Obsidian | Proprietary | Desktop PKM | 參考 `[[wiki links]]`、graph view、backlinks |
### 4. 桌面 vs Web
| 決策 | 選擇 | 理由 |
|------|:--:|------|
| Web first | ✅ | 任何裝置可用,同一份 HTML/JS/CSS |
| Tauri shell | ✅ 可選 | <10MB, 跨平台 macOS/Win/Linux |
| Electron | ❌ | 300MB 過於肥大 |
### 5. MarkBase vs Obsidian 定位
| | Obsidian | MarkBase |
|------|:--:|:--:|
| 定位 | 個人知識管理 (PKM) | **文件處理引擎 + 編輯器** |
| 資料格式 | .md only | 全格式 (via soffice) |
| 搜尋 | 全文 | RAG + embedding (Qdrant) |
| 後端 | 無 | axum HTTP + PSQL + Qdrant |
| CLI | 無 | ✅ CLI first |
| Pipeline | 無 | ✅ Chunking + LLM pipeline |
| 跨裝置 | 付費 sync | 自建 server 即可 |
| 大小 | ~300MB (Electron) | <10MB (Tauri) |
| 授權 | Proprietary (個人免費) | Momentry 專屬 |
### 6. CLI 設計
```
markbase display [--port 11438] [FILE] 啟動顯示伺服器
markbase render <FILE> [-o output.html] Markdown → HTML
markbase serve <DIR> 檔案瀏覽 + 編輯器 (計畫中)
```
### 7. 架構對比
```
Obsidian: MarkBase:
┌──────────────────────┐ ┌──────────────────────┐
│ Electron Shell │ │ Tauri / Browser │
│ ┌────────────────┐ │ │ ┌────────────────┐ │
│ │ Renderer │ │ │ │ Renderer │ │
│ │ ├─ CodeMirror │ │ │ │ ├─ CodeMirror │ │ ← 相同
│ │ ├─ Graph/D3 │ │ │ │ ├─ Mermaid.js │ │ ← 相同
│ │ ├─ Mermaid.js │ │ 相同 │ │ └─ pulldown │ │
│ │ └─ MathJax │ │ │ └────────────────┘ │
│ └────────────────┘ │ │ ┌────────────────┐ │
│ ┌────────────────┐ │ │ │ Rust Backend │ │ ← MarkBase 獨有
│ │ Plugin API │ │ │ │ ├─ axum HTTP │ │
│ │ 1,800+ plugins │ │ │ │ ├─ Embedding │ │
│ └────────────────┘ │ │ │ ├─ Qdrant ANN │ │
│ ┌────────────────┐ │ │ │ ├─ pgvector │ │
│ │ FS Access │ │ │ │ ├─ PG TKG │ │
│ │ .md files only │ │ │ │ ├─ SQLite TKG │ │
│ │ └────────────────┘ │ │ │ ├─ sqlite-vec │ │
│ └──────────────────────┘ │ │ └─ Pipeline │ │
```
### 8. 向量儲存sqlite-vec + Datasette
> 2026-05-12 採用
#### 選型
| 需求 | pgvector (PG) | Qdrant | sqlite-vec | 決策 |
|------|:--:|:--:|:--:|:--:|
| Production API (3003) | ✅ | — | — | pgvector (已有) |
| HNSW ANN 搜尋 | ⚠️ | ✅ | — | Qdrant (已有) |
| Desktop 本機 RAG | ❌ 需裝 PG | ❌ 需 server | ✅ 單檔 | sqlite-vec |
| 檔案包內嵌向量 | ❌ | ❌ | ✅ 隨包分發 | sqlite-vec |
| 離線可用 | ❌ | ❌ | ✅ | sqlite-vec |
| Web UI 查詢 | — | — | via Datasette | Datasette |
#### sqlite-vec 規格
| 屬性 | 值 |
|------|-----|
| License | MIT + Apache 2.0(雙授權) |
| 作者 | Alex Garcia |
| 贊助 | Mozilla Builders + Fly.io + Turso + SQLite Cloud |
| Stars | 7,600+ |
| 語言 | Pure C零依賴 |
| 大小 | ~200KB `.dylib` |
| ANN 引擎 | exhaustive, IVF, DiskANN |
| Rust binding | `cargo add sqlite-vec` |
#### Datasette選配 Web UI
| 屬性 | 值 |
|------|-----|
| License | Apache 2.0 |
| 作者 | Simon Willison |
| 定位 | SQLite → Web UI + JSON API |
| Plugins | 154 個 |
| sqlite-vec 插件 | `datasette-sqlite-vec`(同一作者) |
#### 使用範例
```sql
.load ./vec0
CREATE VIRTUAL TABLE chunks USING vec0(
embedding float[768],
file_uuid text,
chunk_type text,
text_content text
);
INSERT INTO chunks VALUES (?, 'uuid-123', 'sentence', 'hello world');
SELECT rowid, text_content, distance
FROM chunks WHERE embedding MATCH ?
ORDER BY distance LIMIT 10;
```
#### 四層向量架構
```
Production ← Qdrant (HNSW ANN, fast at scale)
← pgvector (transactional, alongside chunk data)
↓ backup / export
Portable ← sqlite-vec (.sqlite single file, package distributable)
← Datasette (optional Web UI)
```
### 9. Qdrant Graph 分析
> 2026-05-12 結論Qdrant **沒有**原生 Graph 功能,是純向量資料庫
#### Qdrant 現有功能
| 功能 | 說明 | 圖論等級 |
|------|------|:--:|
| **Payload filtering** | 向量搜尋 + JSON 條件過濾 | ⚠️ 偽關聯查詢 |
| **Collection aliases** | 多 collection 聯合查詢 | ⚠️ 基礎 |
| **Hybrid Queries** | 向量 + 關鍵字混合 | ❌ |
| **Qdrant Edge** | 嵌入式向量搜尋 | ❌ 非 Graph |
| **Data Graphs (第三方)** | Neo4j + Qdrant hybrid RAG | ✅ 非原生 |
#### Payload filtering 的極限
可以模擬 1-hop 關係(例如「找 Cary Grant 說話的 chunk」但不能做真正的 graph traversal
```json
// ✅ 1-hopfilter speaker = "Cary Grant"
{"filter": {"must": [{"key": "speaker", "match": {"value": "Cary Grant"}}]}}
// ❌ 2-hopgraph traversal Qdrant 無法做到
// "誰跟 Cary Grant 在同一個場景出現?"
// "這些人中誰又跟 Audrey Hepburn 對話?"
```
| 限制 | 說明 |
|------|------|
| ❌ 2-hop+ traversal | 無法跨節點關聯查詢 |
| ❌ 邊緣權重/時間 | 無 edge property 概念 |
| ❌ Graph algebra | 無 `shortest_path`, `PageRank` 等演算法 |
| ❌ Cypher/GQL | 無圖查詢語言 |
#### Momentry TKG 決策
| | Qdrant-only | PG TKG | SQLite TKG | Neo4j |
|---|:--:|:--:|:--:|:--:|
| 向量搜尋 | ✅ 原生 | via pgvector | via sqlite-vec | via plugin |
| Graph traversal | ❌ | ✅ CTE | ✅ CTE | ✅ 原生 |
| 2-hop+ 查詢 | ❌ | ✅ | ✅ | ✅ |
| 時間範圍邊緣 | ❌ | ✅ | ✅ | ✅ |
| 部署 | 需 server | 需 PG | **單檔** | 需 Java |
| 檔案包分發 | ❌ | ❌ | ✅ | ❌ |
| 適合規模 | 大 | 中 | 小-中 | 大 |
#### 架構分工
```
Qdrant → 向量搜尋ANN- 核心效能
PG → TKG 圖查詢Recursive CTE- API server
SQLite → TKG 圖查詢Recursive CTE- 檔案包/離線
```
---
## 亮點:知識圖譜 (Knowledge Graph)
> 2026-05-12 新增
### Obsidian vs MarkBase 圖譜對比
| | Obsidian Graph | MarkBase Knowledge Graph |
|------|:--:|:--:|
| 節點來源 | 手動建立的 `.md` 筆記 | AI pipeline 自動產生的 chunks |
| 邊緣來源 | 手寫 `[[wikilinks]]` | **語意相似度**、結構層級、共現關係 |
| 生成方式 | 人工 | **自動**embedding + clustering |
| 影片支援 | ❌ | ✅ face traces, speaker graph, scene transitions |
| 實體辨識 | ❌ | ✅ 人臉/說話者/物件/場景 |
| 規模 | 數百節點 | **數萬節點**chunk 級) |
| 過濾 | 無 | 時間範圍、置信度、chunk type |
### 圖譜類型
#### A. 語意關係圖Semantic Graph
以 embedding 餘弦相似度建立邊緣,相近 chunk 靠近。
```
[Audrey Hepburn 說話] ──0.82── [Cary Grant 回應]
│ │
│ 0.75 │ 0.78
▼ ▼
[討論離婚原因] ──0.91── [緊張對話場景]
```
**演算法**
1. 取所有 chunk embedding
2. 計算 pairwise cosine similarity
3. 保留 top-K 相似邊K=5 預設)
4. 用 UMAP/t-SNE → 2D 座標
5. D3.js force layout 渲染
#### B. 結構層級圖Hierarchy Graph
文件 → 章節 → 段落 的三層樹狀結構。
#### C. 人物關係圖Identity Graph
基於 face_detections + speaker_assign。
```
Cary Grant ──[對手戲]── Audrey Hepburn
│ │
│[對話] │[場景共現]
▼ ▼
Walter Matthau ────── Ned Glass
```
#### D. 時序演進圖Timeline Graph
Chunks 按時間軸排列場景切換點標記。X 軸 = 時間Y 軸 = 說話者。
### 渲染技術
| 層 | 工具 | License |
|----|------|---------|
| 力導向佈局 | D3-force (d3.js v7) | ISC |
| 降維 (UMAP) | umap-js | MIT |
| 2D 繪圖 | Canvas / SVG via D3 | ISC |
| 3D 繪圖 | Three.js | MIT |
| 節點過濾 | Crossfilter / vanilla JS | — |
### API 設計
```
GET /api/v1/graph/:file_uuid/identity → 人物關係圖資料
GET /api/v1/graph/:file_uuid/semantic?depth=3 → 語意圖資料
GET /api/v1/graph/:file_uuid/hierarchy → 結構層級圖
GET /api/v1/graph/:file_uuid/timeline → 時序圖資料
```
回傳格式:
```json
{
"nodes": [
{"id": "chunk_100", "label": "Cary Grant: What's your name?", "group": 3, "x": 0.1, "y": 0.5}
],
"edges": [
{"source": "chunk_100", "target": "chunk_104", "weight": 0.82, "type": "semantic"}
]
}
```
### 互動設計
| 操作 | 行為 |
|------|------|
| Drag node | 拖曳節點 |
| Click node | 展開 chunk 內容預覽 |
| Scroll | 縮放圖譜 |
| Filter bar | 依 chunk_type / speaker / confidence 過濾 |
| Double-click | 聚焦該節點,展開子圖 |
| Hover edge | 顯示相似度分數 |
### 圖譜渲染工具選型
> 2026-05-12 新增
#### 候選工具對比
| 工具 | License | 大小 | CDN | 圖論演算法 | 中國社群 | 最佳場景 |
|------|---------|:--:|:--:|:--:|:--:|------|
| **Cytoscape.js** | MIT | ~120KB | ✅ | ✅ BFS/DFS/PageRank | ⚠️ | 複雜網絡圖 |
| D3.js v7 | ISC | ~80KB | ✅ | ❌ 需自寫 | ⚠️ | 任何自訂圖表 |
| ECharts | Apache 2.0 | ~1MB | ✅ | ❌ | ✅ 非常大 | 通用圖表 + 地圖 |
| G6 (AntV) | MIT | ~500KB | ✅ | ✅ 多種佈局 | ✅ 非常大 | 關係圖專用 |
| vis-network | MIT/Apache | ~300KB | ✅ | ❌ | ❌ | 網絡圖 |
| Sigma.js | MIT | ~80KB | ✅ | ❌ | ❌ | WebGL 大圖 (>5000節點) |
| Graphviz | EPL 1.0 | ~3MB | ❌ CLI only | ✅ | ⚠️ | 靜態匯出 SVG/PNG |
#### 選型過程
**第一輪篩選**:排除 CLI-only (Graphviz)、無 CDN、中文社群弱且圖論支援差的 (vis-network, Sigma.js)。
剩餘Cytoscape.js, D3.js, ECharts, G6。
**第二輪深度評估**
| | Cytoscape.js | D3.js | ECharts | G6 |
|---|:--:|:--:|:--:|:--:|
| 力導向佈局 | ✅ 9 種 | ✅ 自寫 | ✅ 1 種內建 | ✅ 9 種 |
| 複合節點 (compound) | ✅ | ❌ | ❌ | ✅ |
| 圖論演算法 | ✅ 內建 | ❌ | ❌ | ✅ |
| JSON → Graph | ✅ 原生 | ⚠️ 手動 | ⚠️ 手動 | ✅ 原生 |
| TreeGraph | ⚠️ 需擴展 | ✅ | ❌ | ✅ 專用 |
| 大型圖效能 | ⚠️ (>5000會慢) | ✅ | ✅ Canvas | ✅ |
| 互動 API | ✅ 豐富 | ✅ 最靈活 | ✅ | ✅ |
| 零外部依賴 | ✅ | ✅ | ❌ (zrender) | ❌ |
**最終決策**
| 場景 | 選用 | 理由 |
|------|:--:|------|
| 知識圖譜核心 | **Cytoscape.js** | 圖論演算法、fCoSE 佈局、JSON 原生對接、Obsidian/Mermaid 都用 |
| 統計輔助圖表 | **ECharts** | 中文社群大、Apache 背書、長條/圓餅/分佈圖開箱即用 |
| 樹狀層級圖 | **G6 TreeGraph** | 專用 API文件結構圖最簡潔 |
| 自訂特殊需求 | **D3.js** | 保底方案,任何無法滿足的圖表 |
#### Cytoscape.js 使用者背書
| 組織 | 用途 |
|------|------|
| **Mermaid** | 流程圖/時序圖渲染引擎 |
| **Obsidian** | 知識圖譜 (Graph View) |
| Amazon, Google, Meta, Microsoft | 內部網絡圖視覺化 |
| IBM, Cisco, Tencent, Uber | 網路拓樸視覺化 |
| GitHub | 相依性圖 |
#### 整合架構
```
MarkBase Knowledge Graph:
┌──────────────────────────────────────┐
│ 圖譜類型 渲染引擎 │
│ ───────── ──────── │
│ 語意關係圖 → Cytoscape.js │
│ 結構層級圖 → G6 TreeGraph │
│ 人物關係圖 → Cytoscape.js │
│ 時序演進圖 → ECharts timeline │
│ 降維散點圖 → D3.js │
│ 統計分佈圖 → ECharts │
│ │
│ 全部 CDN 載入,無需 npm │
└──────────────────────────────────────┘
```
### 在 MarkBase 中的整合
```
MarkBase Control Bar:
⏮ ◀ ▶ ⏭ | Graph | Tree | Edit | 🔍
Knowledge Graph View
```
---
## 開發路線圖
| 階段 | 時程 | 交付 |
|------|:----:|------|
| P0 Core rendering | ✅ Done | Rust engine: .md→HTML with Mermaid + AJAX refresh |
| P1 macOS app | ✅ Done | Tauri shell (可選) |
| P2 File tree + Editor | 2-3d | CodeMirror 6 + lazy-load 樹狀瀏覽 + 存檔 |
| P3 Knowledge Graph | 3-5d | Cytoscape.js + G6 + ECharts: 語意/結構/人物關係圖譜 |
| P4 Knowledge base | 3-5d | 多文件索引、全文檢索、backlinks |
| P5 Export | 2d | 轉檔 CLI (md→pdf/docx/pptx) |
| P6 Collaboration | 5-10d | 評論、版本、靜態站點 |

View File

@@ -0,0 +1,647 @@
---
document_type: "reference_doc"
service: "MOMENTRY_CORE"
title: "處理器模組標準化規範"
date: "2026-04-25"
version: "V1.0"
status: "active"
owner: "Warren"
created_by: "OpenCode"
tags:
- "處理器模組標準化規範"
ai_query_hints:
- "查詢 處理器模組標準化規範 的內容"
- "處理器模組標準化規範 的主要目的是什麼?"
- "如何操作或實施 處理器模組標準化規範?"
---
# 處理器模組標準化規範
## 概述
本規範定義 Momentry Core 中處理器模組的標準化架構、接口和實現模式。目標是確保所有處理器模組ASR、OCR、YOLO、Face、Pose、CUT、ASRX、Caption、Story遵循一致的設計原則提高代碼可維護性、可測試性和可擴展性。
## 架構原則
### 1. 分層架構
```
┌─────────────────────────────────────────┐
│ Rust API 層 │
│ (src/core/processor/*.rs) │
├─────────────────────────────────────────┤
│ Python 執行層 │
│ (scripts/*_processor.py) │
├─────────────────────────────────────────┤
│ AI 模型層 │
│ (Whisper, YOLO, EasyOCR, etc.) │
└─────────────────────────────────────────┘
```
### 2. 職責分離
- **Rust 層**: 接口定義、錯誤處理、配置管理、結果解析
- **Python 層**: AI 模型調用、數據處理、中間文件管理
- **模型層**: 特定 AI 任務執行
## Rust 模組規範
### 文件結構
```
src/core/processor/
├── mod.rs # 模組導出
├── executor.rs # Python 執行器(共享)
├── asr.rs # ASR 處理器
├── ocr.rs # OCR 處理器
├── yolo.rs # YOLO 處理器
├── face.rs # 人臉檢測處理器
├── pose.rs # 姿態檢測處理器
├── cut.rs # 場景切割處理器
├── asrx.rs # ASRX 處理器
├── caption.rs # 字幕生成處理器
└── story.rs # 故事分析處理器
```
### 模組模板
#### 1. 結果結構定義
```rust
use anyhow::{Context, Result};
use serde::{Deserialize, Serialize};
use std::time::Duration;
use super::executor::PythonExecutor;
use crate::core::config::processor;
// 主要結果結構
#[derive(Debug, Serialize, Deserialize)]
pub struct ModuleResult {
// 通用字段
pub processing_time: Option<f64>,
pub metadata: Option<serde_json::Value>,
// 模組特定字段
// ...
}
// 數據單元結構
#[derive(Debug, Serialize, Deserialize)]
pub struct DataUnit {
// 時間或幀相關字段
pub start: f64,
pub end: f64,
pub frame: u64,
// 數據內容
// ...
}
```
#### 2. 處理函數模板
```rust
pub async fn process_module(
video_path: &str,
output_path: &str,
uuid: Option<&str>,
) -> Result<ModuleResult> {
// 1. 初始化執行器
let executor = PythonExecutor::new()?;
let script_path = executor.script_path("module_processor.py");
// 2. 記錄日誌
tracing::info!("[MODULE] Starting processing: {}", video_path);
// 3. 執行 Python 腳本
executor
.run(
"module_processor.py",
&[video_path, output_path],
uuid,
"MODULE",
Some(Duration::from_secs(*processor::MODULE_TIMEOUT_SECS)),
)
.await
.with_context(|| format!("Failed to run {:?}", script_path))?;
// 4. 讀取並解析結果
let json_str = std::fs::read_to_string(output_path)
.context("Failed to read module output")?;
let result: ModuleResult = serde_json::from_str(&json_str)
.context("Failed to parse module output")?;
// 5. 記錄結果摘要
tracing::info!(
"[MODULE] Result: processed {} units",
result.data_units.len()
);
Ok(result)
}
```
#### 3. 配置管理
```rust
// 在 src/core/config.rs 中添加
pub mod processor {
use super::*;
pub static MODULE_TIMEOUT_SECS: Lazy<u64> = Lazy::new(|| {
env::var("MOMENTRY_MODULE_TIMEOUT")
.unwrap_or_else(|_| "3600".to_string())
.parse()
.unwrap_or(3600)
});
pub static MODULE_CHUNK_SIZE: Lazy<u64> = Lazy::new(|| {
env::var("MOMENTRY_MODULE_CHUNK_SIZE")
.unwrap_or_else(|_| "300".to_string())
.parse()
.unwrap_or(300)
});
}
```
#### 4. 測試規範
```rust
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_result_serialization() {
// 測試序列化/反序列化
}
#[test]
fn test_empty_result() {
// 測試邊界條件
}
#[tokio::test]
async fn test_integration() {
// 集成測試(可選)
}
}
```
## Python 腳本規範
### 文件命名
```
scripts/
├── module_processor.py # 主要處理腳本
├── module_utils.py # 工具函數(可選)
└── module_debug.py # 調試腳本(可選)
```
### 腳本模板
```python
#!/opt/homebrew/bin/python3.11
"""
模組處理器 - 標準化模板
功能:執行 [模組名稱] 處理
輸入:視頻文件路徑,輸出文件路徑
輸出JSON 格式的處理結果
"""
import sys
import json
import os
import argparse
import signal
import tempfile
import time
from pathlib import Path
from typing import Dict, Any, List, Optional
# 環境檢查
def check_environment() -> bool:
"""檢查必要的環境和依賴"""
try:
# 檢查必要庫
import required_library
return True
except ImportError as e:
print(f"ERROR: Missing dependency: {e}", file=sys.stderr)
return False
# 信號處理
def signal_handler(signum, frame):
"""處理中斷信號"""
print(f"[MODULE] Received signal {signum}, cleaning up...")
sys.exit(1)
# 主要處理類
class ModuleProcessor:
def __init__(self, video_path: str, output_path: str):
self.video_path = video_path
self.output_path = output_path
self.start_time = time.time()
def validate_input(self) -> bool:
"""驗證輸入文件"""
if not os.path.exists(self.video_path):
print(f"ERROR: Video file not found: {self.video_path}", file=sys.stderr)
return False
return True
def process(self) -> Dict[str, Any]:
"""執行處理邏輯"""
try:
# 1. 準備工作目錄
work_dir = tempfile.mkdtemp(prefix="module_")
# 2. 執行核心處理邏輯
result = self._core_processing(work_dir)
# 3. 添加元數據
result["metadata"] = {
"processing_time": time.time() - self.start_time,
"video_path": self.video_path,
"timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
"module_version": "1.0.0"
}
return result
except Exception as e:
print(f"ERROR: Processing failed: {e}", file=sys.stderr)
raise
def _core_processing(self, work_dir: str) -> Dict[str, Any]:
"""核心處理邏輯(模組特定)"""
# 模組特定實現
return {
"data_units": [],
"summary": {}
}
def save_result(self, result: Dict[str, Any]):
"""保存結果到文件"""
with open(self.output_path, 'w', encoding='utf-8') as f:
json.dump(result, f, ensure_ascii=False, indent=2)
print(f"[MODULE] Result saved to: {self.output_path}")
# 命令行接口
def main():
parser = argparse.ArgumentParser(description="模組處理器")
parser.add_argument("video_path", help="輸入視頻文件路徑")
parser.add_argument("output_path", help="輸出 JSON 文件路徑")
args = parser.parse_args()
# 設置信號處理
signal.signal(signal.SIGINT, signal_handler)
signal.signal(signal.SIGTERM, signal_handler)
# 環境檢查
if not check_environment():
sys.exit(1)
# 執行處理
processor = ModuleProcessor(args.video_path, args.output_path)
if not processor.validate_input():
sys.exit(1)
try:
result = processor.process()
processor.save_result(result)
print(f"[MODULE] Processing completed successfully")
except Exception as e:
print(f"ERROR: {e}", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
main()
```
### 輸出格式規範
```json
{
"data_units": [
{
"id": "unit_1",
"start": 0.0,
"end": 2.5,
"frame": 0,
"data": {},
"confidence": 0.95
}
],
"summary": {
"total_units": 1,
"processing_time": 4.7,
"average_confidence": 0.95
},
"metadata": {
"video_path": "/path/to/video.mp4",
"module": "module_name",
"version": "1.0.0",
"timestamp": "2026-03-27 10:30:00"
}
}
```
## 配置標準化
### 環境變量
```
# 超時設置
MOMENTRY_ASR_TIMEOUT=3600
MOMENTRY_OCR_TIMEOUT=7200
MOMENTRY_YOLO_TIMEOUT=7200
MOMENTRY_FACE_TIMEOUT=3600
MOMENTRY_POSE_TIMEOUT=3600
MOMENTRY_CUT_TIMEOUT=3600
MOMENTRY_ASRX_TIMEOUT=3600
MOMENTRY_CAPTION_TIMEOUT=1800
MOMENTRY_STORY_TIMEOUT=1800
# 性能設置
MOMENTRY_MODULE_CHUNK_SIZE=300
MOMENTRY_MODULE_BATCH_SIZE=32
MOMENTRY_MODULE_CACHE_ENABLED=true
# 模型設置
MOMENTRY_MODULE_MODEL=base
MOMENTRY_MODULE_DEVICE=cpu
```
### 配置優先級
1. 命令行參數(最高優先級)
2. 環境變量
3. 配置文件
4. 默認值(最低優先級)
## 錯誤處理規範
### Rust 錯誤處理
```rust
use anyhow::{Context, Result};
pub async fn process_module(...) -> Result<ModuleResult> {
// 使用 .context() 添加上下文
executor.run(...)
.await
.with_context(|| format!("Failed to run module script"))?;
// 使用 anyhow::bail! 進行錯誤返回
if !condition {
anyhow::bail!("Condition not met: {}", reason);
}
}
```
### Python 錯誤處理
```python
def process(self) -> Dict[str, Any]:
try:
# 主要邏輯
result = self._core_processing()
return result
except FileNotFoundError as e:
print(f"ERROR: File not found: {e}", file=sys.stderr)
raise
except RuntimeError as e:
print(f"ERROR: Runtime error: {e}", file=sys.stderr)
raise
except Exception as e:
print(f"ERROR: Unexpected error: {e}", file=sys.stderr)
raise
```
### 錯誤分類
1. **輸入錯誤**: 文件不存在、格式不支持、權限問題
2. **配置錯誤**: 缺少依賴、環境變量錯誤、模型文件缺失
3. **運行時錯誤**: 內存不足、超時、模型推理錯誤
4. **輸出錯誤**: 結果解析失敗、文件寫入失敗
## 日誌規範
### Rust 日誌
```rust
tracing::info!("[MODULE] Starting processing: {}", video_path);
tracing::debug!("[MODULE] Processing details: {:?}", details);
tracing::warn!("[MODULE] Warning: {}", warning_message);
tracing::error!("[MODULE] Error: {}", error_message);
```
### Python 日誌
```python
import sys
def log_info(message: str):
print(f"[MODULE] INFO: {message}", file=sys.stderr)
def log_debug(message: str):
if os.environ.get("MODULE_DEBUG") == "1":
print(f"[MODULE] DEBUG: {message}", file=sys.stderr)
def log_error(message: str):
print(f"[MODULE] ERROR: {message}", file=sys.stderr)
```
## 性能監控
### 指標收集
```rust
pub struct ProcessingMetrics {
pub start_time: std::time::Instant,
pub end_time: Option<std::time::Instant>,
pub memory_usage_mb: f64,
pub cpu_usage_percent: f64,
pub items_processed: u64,
pub items_per_second: f64,
}
impl ProcessingMetrics {
pub fn new() -> Self {
Self {
start_time: std::time::Instant::now(),
end_time: None,
memory_usage_mb: 0.0,
cpu_usage_percent: 0.0,
items_processed: 0,
items_per_second: 0.0,
}
}
pub fn record_completion(&mut self, items_processed: u64) {
self.end_time = Some(std::time::Instant::now());
self.items_processed = items_processed;
let duration = self.end_time.unwrap().duration_since(self.start_time);
self.items_per_second = items_processed as f64 / duration.as_secs_f64();
}
}
```
### 性能報告
```json
{
"performance": {
"processing_time_seconds": 4.7,
"memory_usage_mb": 512.5,
"cpu_usage_percent": 45.2,
"items_processed": 8,
"items_per_second": 1.7,
"throughput_mb_per_second": 10.5
}
}
```
## 測試規範
### 單元測試
```rust
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_result_structure() {
// 測試數據結構
}
#[test]
fn test_serialization() {
// 測試序列化
}
#[test]
fn test_edge_cases() {
// 測試邊界條件
}
}
```
### 集成測試
```rust
#[tokio::test]
async fn test_module_integration() {
// 使用測試文件進行集成測試
let test_video = "test_data/sample.mp4";
let output_file = tempfile::NamedTempFile::new().unwrap();
let result = process_module(test_video, output_file.path().to_str().unwrap(), None)
.await
.expect("Processing should succeed");
assert!(!result.data_units.is_empty());
}
```
### Python 測試
```python
def test_module_processor():
"""測試 Python 處理器"""
processor = ModuleProcessor("test.mp4", "output.json")
# 測試輸入驗證
assert not processor.validate_input() # 文件不存在
# 測試處理邏輯
with tempfile.NamedTemporaryFile() as tmp:
processor = ModuleProcessor("real_test.mp4", tmp.name)
result = processor.process()
assert "data_units" in result
assert "metadata" in result
```
## 文檔規範
### Rust 文檔
```rust
/// ASR 處理器模組
///
/// 提供自動語音識別功能,支持多種語言和大文件處理。
///
/// # 示例
/// ```
/// use momentry_core::processor::asr;
///
/// let result = asr::process_asr("video.mp4", "output.json", None).await?;
/// println!("識別到 {} 個語音片段", result.segments.len());
/// ```
pub mod asr {
// ...
}
```
### Python 文檔
```python
"""
模組處理器
提供 [功能描述] 功能。
使用示例:
python module_processor.py input.mp4 output.json
參數:
video_path: 輸入視頻文件路徑
output_path: 輸出 JSON 文件路徑
輸出格式:
詳見輸出格式規範部分。
"""
```
## 遷移指南
### 現有模組標準化步驟
1. **分析現有代碼**: 識別不符合規範的部分
2. **創建備份**: 備份原始文件
3. **重構 Rust 模組**: 按照模板重構
4. **重構 Python 腳本**: 按照模板重構
5. **更新配置**: 統一配置管理
6. **添加測試**: 補充單元和集成測試
7. **更新文檔**: 更新 API 文檔和使用說明
8. **驗證功能**: 確保功能正常
### 兼容性保證
- 保持現有 API 不變
- 逐步遷移,不中斷現有功能
- 提供遷移工具和文檔
## 附錄
### A. 模組分類
| 模組 | 功能 | 主要技術 | 輸出類型 |
|------|------|----------|----------|
| ASR | 語音識別 | Whisper | 時間段文本 |
| OCR | 文字識別 | EasyOCR | 幀級文字 |
| YOLO | 物體檢測 | YOLOv8 | 幀級物體 |
| Face | 人臉檢測 | OpenCV | 幀級人臉 |
| Pose | 姿態檢測 | OpenPose | 幀級姿態 |
| CUT | 場景切割 | PySceneDetect | 場景邊界 |
| ASRX | 語音增強 | WhisperX | 說話人分離 |
| Caption | 字幕生成 | BLIP | 幀級描述 |
| Story | 故事分析 | 自定義 | 故事結構 |
### B. 性能基準
| 模組 | 平均處理時間 | 內存使用 | CPU 使用 |
|------|--------------|----------|----------|
| ASR | 4.7s (小文件) | 1.2GB | 45% |
| OCR | 12.3s (小文件) | 800MB | 35% |
| YOLO | 8.5s (小文件) | 1.5GB | 60% |
| Face | 3.2s (小文件) | 500MB | 25% |
### C. 常見問題
1. **依賴問題**: 確保 Python 環境正確設置
2. **內存不足**: 調整 chunk_size 參數
3. **超時錯誤**: 增加 timeout 設置或優化算法
4. **模型加載慢**: 啟用模型緩存
---
*版本: 1.0.0*
*更新日期: 2026-03-27*
*負責人: Warren (Technical Lead)*
*狀態: 草案*

View File

@@ -0,0 +1,353 @@
---
document_type: "reference_doc"
service: "MOMENTRY_CORE"
title: "Momentry Core 影片 RAG 系統說明稿"
date: "2026-03-22"
version: "V1.0"
status: "active"
owner: "Warren"
created_by: "OpenCode"
tags:
- "momentry"
- "core"
- "系統說明稿"
ai_query_hints:
- "查詢 Momentry Core 影片 RAG 系統說明稿 的內容"
- "Momentry Core 影片 RAG 系統說明稿 的主要目的是什麼?"
- "如何操作或實施 Momentry Core 影片 RAG 系統說明稿?"
---
# Momentry Core 影片 RAG 系統說明稿
| 項目 | 內容 |
|------|------|
| 建立者 | Warren |
| 建立時間 | 2026-03-22 |
| 文件版本 | V1.1 |
---
## 版本歷史
| 版本 | 日期 | 目的 | 操作人 | 工具/模型 |
|------|------|------|--------|-----------|
| V1.0 | 2026-03-22 | 創建文件 | Warren | OpenCode / MiniMax M2.5 |
| V1.1 | 2026-03-25 | 更新API回應格式 (media_url→file_path) 與認證標頭 | OpenCode | deepseek-reasoner |
---
## 系統架構
```
┌─────────────────────────────────────────────────────────────┐
│ 使用者 │
│ (marcom 團隊) │
└─────────────────┬───────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ WordPress 入口 │
│ (wp.momentry.ddns.net) │
└─────────────────┬───────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ n8n 自動化 │
│ (localhost:5678) │
│ │
│ [Webhook] → [HTTP Request] → [處理結果] → [回覆用戶] │
└─────────────────┬───────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Momentry Core API │
│ (localhost:3002) │
│ │
│ POST /api/v1/search → 語意搜尋 │
│ POST /api/v1/n8n/search → n8n 專用格式 │
│ GET /api/v1/videos → 影片列表 │
└─────────────────┬───────────────────────────────────────────┘
┌─────────┴──────────┐
▼ ▼
┌───────────────┐ ┌───────────────┐
│ PostgreSQL │ │ Qdrant │
│ (chunks) │ │ (vectors) │
└───────────────┘ └───────────────┘
```
---
## 資料流程
```
1. 上傳影片 → SFTPGo
2. 影片註冊 → PostgreSQL
3. ASR 處理 → 產生字幕區塊
4. 儲存 chunks → PostgreSQL
5. 向量化 → Qdrant
6. 搜尋查詢 → API
7. 回傳結果 → n8n → 用戶
```
---
## 示範影片
| 項目 | 內容 |
|------|------|
| 檔案名稱 | Old_Time_Movie_Show_-_Charade_1963.HD.mov |
| UUID | a1b10138a6bbb0cd |
| 時長 | 6879 秒(約 1.9 小時) |
| 區塊數 | 3,886 個 |
| 向量數 | 3,688 個 |
---
## API 端點
### 1. 語意搜尋
```
POST http://localhost:3002/api/v1/search
```
**請求:**
```json
{
"query": "charade",
"limit": 5,
"uuid": "a1b10138a6bbb0cd"
}
```
> **注意**:
> 1. **API 認證**: 所有 `/api/v1/*` 端點需要 `X-API-Key` 標頭
> 2. **檔案路徑轉換**: API 現在返回 `file_path`(檔案系統路徑),需要轉換為可訪問的 URL例如透過 SFTPGo 分享連結)
---
### 2. n8n 專用格式
```
POST http://localhost:3002/api/v1/n8n/search
```
**請求:**
```json
{
"query": "charade",
"limit": 5
}
```
**回應:**
```json
{
"query": "charade",
"count": 5,
"hits": [
{
"id": "sentence_0006",
"vid": "a1b10138a6bbb0cd",
"start": 48.8,
"end": 55.44,
"title": "Chunk sentence_0006",
"text": "fun plot twists...",
"score": 0.526,
"file_path": "/Users/accusys/momentry/var/sftpgo/data/demo/video.mp4"
}
]
}
```
---
## 實作範例
### n8n Workflow 設計
```
┌─────────────┐
│ Webhook │ ← 接收用戶搜尋請求
└──────┬──────┘
┌─────────────┐
│ HTTP Request│ → POST /api/v1/n8n/search
└──────┬──────┘
┌─────────────┐
│ Code │ → 處理回傳結果
└──────┬──────┘
┌─────────────┐
│ Telegram │ → 回覆給用戶
│ (或 LINE) │
└─────────────┘
```
---
## Step-by-Step n8n Workflow
### Step 1: 建立 Webhook
1. n8n 開新 Workflow
2. 新增 node: **Webhook**
3. 設定 path: `video-search`
4. 複製 Webhook URL
---
### Step 2: 設定 HTTP Request
1. 新增 node: **HTTP Request**
2. 設定:
```
Method: POST
URL: http://localhost:3002/api/v1/n8n/search
Body Content Type: JSON
Headers: X-API-Key (需設定)
```
3. Body:
```json
{
"query": "={{ $json.body }}",
"limit": 5
}
```
---
### Step 3: 處理結果 (Code)
```javascript
const hits = $input.first().json.hits;
if (!hits || hits.length === 0) {
return {
json: { message: "找不到相關結果" }
};
}
const results = hits.map((hit, index) => ({
number: index + 1,
text: hit.text,
time: `${hit.start}s - ${hit.end}s`,
score: Math.round(hit.score * 100) + "%",
// 注意: API 現在返回 file_path檔案系統路徑需要轉換為可訪問的 URL
url: hit.file_path + "#t=" + hit.start + "," + hit.end // 需實作檔案路徑轉換為 URL
}));
return { json: { results } };
```
> **注意**:
> 1. **API 認證**: 所有 `/api/v1/*` 端點需要 `X-API-Key` 標頭
> 2. **檔案路徑轉換**: API 現在返回 `file_path`(檔案系統路徑),需要轉換為可訪問的 URL例如透過 SFTPGo 分享連結)
---
### Step 4: 格式化輸出
**Telegram 格式:**
```
🎬 搜尋結果: "{{ $json.query }}"
1⃣ "fun plot twists, Woody Dialog and charming performances..."
⏱ 48.8s - 55.4s
📊 相關度: 53%
2⃣ "Don't you like me to say that a pretty girl..."
⏱ 4745.6s - 4748.6s
📊 相關度: 52%
```
---
## 測試指令
### curl 測試
```bash
# 語意搜尋
curl -X POST http://localhost:3002/api/v1/search \
-H "Content-Type: application/json" \
-H "X-API-Key: YOUR_API_KEY" \
-d '{"query": "charade", "limit": 3}'
# n8n 格式
curl -X POST http://localhost:3002/api/v1/n8n/search \
-H "Content-Type: application/json" \
-H "X-API-Key: YOUR_API_KEY" \
-d '{"query": "charade", "limit": 3}'
# 影片列表
curl -H "X-API-Key: YOUR_API_KEY" http://localhost:3002/api/v1/videos
# 特定影片區塊
curl -H "X-API-Key: YOUR_API_KEY" http://localhost:3002/api/v1/videos/a1b10138a6bbb0cd/chunks
```
---
## 實際搜尋範例
| 搜尋詞 | 結果摘要 |
|--------|----------|
| `charade` | "fun plot twists, Woody Dialog and charming performances..." |
| `woody` | "Well, you thick skull hair, brain half-witted..." |
| `classic movie` | "Hello and welcome to the old-time movie show..." |
| `charming` | "fun plot twists, Woody Dialog and charming performances..." |
---
## 資料庫狀態
| 資料庫 | 資料筆數 | 狀態 |
|--------|----------|------|
| PostgreSQL (videos) | 4 | ✅ |
| PostgreSQL (chunks) | 3,950 | ✅ |
| PostgreSQL (vectors) | 1,870 | ✅ |
| Qdrant (vectors) | 3,688 | ✅ |
| Redis (job cache) | 4 keys | ✅ |
---
## 下一步
1. **建立 SFTPGo 分享連結**
- 開啟 http://localhost:8080
- 登入 demo / demopassword123
- 建立影片分享連結
2. **測試 n8n Workflow**
- 匯入 Postman Collection
- 建立 Webhook
- 測試搜尋
3. **整合到 WordPress**
- 建立表單接收用戶輸入
- 呼叫 n8n Webhook
- 顯示搜尋結果
---
## 快速開始
```bash
# 1. 測試搜尋 API
curl -X POST http://localhost:3002/api/v1/search \
-H "Content-Type: application/json" \
-d '{"query": "charade", "limit": 3}'
# 2. 查看影片列表
curl http://localhost:3002/api/v1/videos
# 3. 查看 n8n 是否運行
curl http://localhost:5678
```

View File

@@ -0,0 +1,94 @@
# Non-Human Sound Detection — Tool Selection Report
**Date:** 2026-05-10
**Movie:** Charade (1963), 113 min
**Audio:** 16kHz mono WAV
**Goal:** Detect non-human sound events (gunshots, impacts, doors, music, etc.)
## Tested Approaches
### Approach A: AST AudioSet (HuggingFace)
| Item | Detail |
|------|--------|
| Model | `MIT/ast-finetuned-audioset-10-10-0.4593` |
| Method | Audio Spectrogram Transformer, fine-tuned on AudioSet-2M (527 classes) |
| Dependencies | `transformers`, `torch` ✅ (no torchcodec needed) |
| Load time | ~1s on M5 |
| Inference time | ~0.5s per 3-second clip (805k params, float32) |
| Accuracy | Good — correctly distinguishes speech vs. door vs. music |
**Test results on Charade:**
| Time | Energy-based said | AST AudioSet said | Verdict |
|------|------------------|-------------------|---------|
| 0:10 | — | Environmental noise (26%) | Background noise, plausible |
| 10:32 | Gunshot candidate (43x) | **Speech (76%)** | ✅ AST correct |
| 57:00 | Gunshot candidate (49x) | **Door (62%) + Slam (5%)** | ✅ AST correct |
| 65:13 | Gunshot candidate (50x) | **Speech (58%)** | ✅ AST correct |
| 85:12 | Gunshot candidate (39x) | **Speech (68%)** | ✅ AST correct |
**Conclusion**: Energy-based impulse detection has **100% false positive rate** for gunshot detection. AST AudioSet correctly classifies all candidates as non-gunshot.
### Approach B: Custom Energy + Spectral Features
| Item | Detail |
|------|--------|
| Method | RMS energy + spectral centroid + sub-band energy ratios |
| Speed | ~3s for full 113-min movie (every 10th window) |
| Accuracy | Poor — cannot distinguish gunshot from speech, door, music |
| Result | 1 "gunshot_candidate" from 453 test windows; all false positives on verification |
**Conclusion**: Useful as a **coarse pre-filter** (Stage 1), not as a standalone classifier.
## Two-Stage Design
```
Stage 1 (Energy filter, ~1 min):
Full audio → sliding window RMS + centroid → ~200 candidate windows
|
v
Stage 2 (AST classifier, ~2 min):
Extract 3-sec audio for each candidate → AST AudioSet classification
|
v
Non-speech events: gunshot, explosion, door slam, music, etc.
```
Estimated processing: ~3 min for full movie (vs. 75 min for full AST scan)
## Key AudioSet Classes Relevant to Charade
| Class | AudioSet ID | Relevance |
|-------|-------------|-----------|
| Gunshot, gunfire | 402 | **Primary target** |
| Explosion | 400 | Hand grenade in plot |
| Door slams | 404 | Scenes at hotel, apartment |
| Music | 130-133 | Background score |
| Speech | 0-3 | Already handled by ASR |
| Vehicle | 100-110 | Car sounds in Paris chase |
| Glass break | 424 | Window breaking scene |
## Actor-voice gender mismatches (resolved by fine-grained ASRX)
During the speaker mapping work, 20 segments where the old face→TMDb assignment said "Audrey Hepburn" but the new ASRX voice embedding clearly said "MALE". These segments were verified via video clips and confirmed to be scenes where:
1. A male speaker (Cary Grant or other) is speaking while Audrey Hepburn's face is on screen
2. The old pipeline incorrectly assigned the speaker name based on face identity
3. The fine-grained sliding window approach correctly resolves these
The 20 segments were from SPEAKER_5 (10 segs) and SPEAKER_9 (10 segs), both of which mapped to MALE voice clusters. These were re-assigned to "Cary Grant" or "Unknown" as appropriate.
## Recommendations
| Approach | Speed | Accuracy | Best for |
|----------|-------|----------|----------|
| Energy pre-filter | ✅ 1 min | ❌ Low | Stage 1: candidate selection |
| AST AudioSet | ⚠️ 2 min | ✅ High | Stage 2: event classification |
| Full AST scan | ❌ 75 min | ✅ High | N/A — two-stage is better |
**Design**: Two-stage pipeline: energy pre-filter → AST classifier
**Implementation path**:
1. Write `scripts/non_human_sound_detector.py` with the two-stage design
2. Output `{uuid}.sound_events.json` with typed events
3. Integrate into the sound_event_detector framework

View File

@@ -0,0 +1,134 @@
# Processor 產出機制檢討
## 三層機制定義
### 1. 中斷接續Interruption Resume
Process 被殺掉後,重啟時能接續進度。
**現狀**: 大部分 processor 有 `.tmp``.partial` 保護,但重跑時從頭開始。
### 2. 補充機制Supplement
完成度不足時,只補沒做完的部分,不重跑整個。
**現狀**: 全部從頭跑,無補充。
### 3. 糾錯機制Error Correction
輸出檔損毀時能自動偵測並修復。
**現狀**: file-existence check 只檢查檔案存在,不檢查內容是否有效。
---
## Processor 逐一檢討
### ASR
| 面向 | 現狀 | 問題 |
|------|------|------|
| 中斷接續 | ✅ `.tmp``.partial`executor | ✅ OK |
| 補充機制 | ❌ 每次從頭跑 | 若跑到 50% 被殺,下次從 0% 開始 |
| 糾錯機制 | ❌ 不驗證內容 | file-existence check 看到 `.json` 存在就跳過,不管內容 |
| Pipe | ✅ executor.run() | ✅ |
| Timeout | ✅ 已移除None | ✅ |
**改善方案**:
- 補充ASR 重跑時掃描 existing `.json``.partial`,找出最後 segment 的 `end_time`,傳入 `--resume-from` 給 Python script
- 糾錯file-existence check 對 `.json``serde_json::from_str` 驗證,無效 → 視為不存在
### ASRX
| 面向 | 現狀 | 問題 |
|------|------|------|
| 中斷接續 | ❌ **不用 executor**,直接寫 `.json` | 被殺掉時留下壞檔 |
| 補充機制 | ❌ 同 ASR | 依賴 ASRASR 不完整 ASRX 也不能跑 |
| 糾錯機制 | ❌ 不驗證內容 | 同上 |
| Pipe | ❌ **raw Command**,沒有 `.tmp` 保護 | 緊急 |
| Timeout | ⚠️ 7200s hardcode | 應改為 None同 ASR |
**改善方案**:
- **最優先**: 改為使用 `executor.run()`,獲得 `.tmp` 保護
- 其他同 ASR
### YOLO
| 面向 | 現狀 | 問題 |
|------|------|------|
| 中斷接續 | ✅ executor `.tmp` | ✅ |
| 補充機制 | ❌ 從頭跑 | 若跑到 frame 100,000 被殺,下次從 frame 0 |
| 糾錯機制 | ❌ 不驗證內容 | yolo.json 之前就是壞的但 file check 跳過 |
**改善方案**:
- 補充:掃描 `.partial` 的最後 frame傳入 `--resume-frame` 給 Python script
- 糾錯file-existence check 對 `.json` 做 JSON parse 驗證
### FACE / POSE / OCR
| 面向 | 現狀 | 問題 |
|------|------|------|
| 中斷接續 | ✅ executor `.tmp` | ✅ |
| 補充機制 | ❌ 從頭跑 | 同 YOLO |
| 糾錯機制 | ❌ 不驗證內容 | 同 YOLO |
**改善方案**: 同 YOLO
### CUT
| 面向 | 現狀 | 問題 |
|------|------|------|
| 中斷接續 | ✅ executor `.tmp` | ✅ |
| 補充機制 | ✅ register 階段已完成,直接載入 | ✅ |
| 糾錯機制 | ❌ 不驗證內容 | 同 YOLO |
**改善方案**: 糾錯即可
### SCENE
| 面向 | 現狀 | 問題 |
|------|------|------|
| 中斷接續 | ✅ **最完整**:檢查 `.err`/`.json`/`.tmp` 三種狀態 | ✅ |
| 補充機制 | ❌ 從頭跑 | ✅scene 很快) |
| 糾錯機制 | ⚠️ 有檢查 `.err` | ✅ |
### VISUAL_CHUNK
| 面向 | 現狀 | 問題 |
|------|------|------|
| 中斷接續 | ✅ executor `.tmp` | ✅ |
| 補充機制 | ❌ | ❌ |
| 糾錯機制 | ❌ **錯誤被吞掉**(回傳空結果) | 應回報 error 而非靜默失敗 |
**改善方案**: 不要吞錯誤,讓 error 往上傳
### STORY
| 面向 | 現狀 | 問題 |
|------|------|------|
| 中斷接續 | ✅ executor `.tmp` | ✅ |
| 補充機制 | ❌ | ❌ |
| 糾錯機制 | ❌ | ❌ |
---
## 優先級
### P0 — 立即修復
1. **ASRX 改用 executor.run()**
- 檔案:`src/core/processor/asrx.rs`
- 獲得 `.tmp` 保護、SIGKILL process group、`.partial` 保留
- 移除 hardcode timeout
### P1 — 糾錯機制
2. **File-existence check 加入 JSON 驗證**
- 檔案:`src/worker/job_worker.rs`
-`output_path.exists()` 之後,對 `.json``serde_json::from_str::<Value>`
- 若 parse 失敗 → 不 skip當作檔案不存在繼續跑
- 若 parse 成功但內容空(無 segments/frames→ 當不完整
### P2 — 補充機制
3. **ASR resume-from 補充**
- 檔案:`src/core/processor/asr.rs` + `scripts/asr_processor.py`
- Rust 端發現 `.partial` 存在,讀取最後 segment 的 end_time
- 傳入 `--resume-from {time}` 給 Python script
- Python script 跳過 `--resume-from` 之前的音訊
4. **YOLO/Face/Pose resume-frame 補充**
- 檔案:各 processor.rs + 對應 Python script
- 掃描 `.partial` 中的最後 frame_number
- 傳入 `--resume-frame {frame}` 給 Python script
### P3 — 其他
5. **VisualChunk 不吞錯誤**
6. **Executor SIGTERM → SIGKILL 兩段式關閉**

View File

@@ -0,0 +1,240 @@
# Momentry Model — 分階段交付
## 核心架構
```
Pipeline (training)
│ 每個 processor 產出 .json
│ Rule 1/3 Ingestion → chunks + embeddings
momentry model for {video} ← 每部影片 = 一個 model
│ release/phase1/latest/
│ release/phase2/latest/
momentry core (inference engine) ← Rust API server
│ momentry_playground (dev)
│ momentry (production)
Search / Query / Identity APIs
```
- **Pipeline** = training phase影片 → processor output → chunks → embeddings
- **Model** = 每部影片的產出 packageoutput_json + chunks + vectors
- **Engine** = momentry core吃 model 提供 APIsearch, trace, identity
每個影片可有多個 model 版本,命名保留升級空間:
| Model 版本 | Qdrant Collection | 內容 | 觸發時機 |
|-----------|------------------|------|---------|
| `{uuid}_v1` | `momentry_dev_v1` | sentence chunk embeddingbase | ASR + ASRX + Rule 1 完成 |
| `{uuid}_v2` | `momentry_dev_v2` | 完整 pipeline + 5W1H | 全部完成 |
| `{uuid}_v3` | `momentry_dev_v3` | object identity + custom detector | v2 + object instance matching 完成 |
各版本共存不覆蓋。
## 階段劃分
### Phase 1Sentence Chunk Embeddingbase model
**觸發時機**: ASR + ASRX 完成 + Rule 1 Ingestion + vectorize 完成
**交付內容**:
- `{uuid}.asr.json`
- `{uuid}.asrx.json`
- chunkschunk_type = 'sentence'
- chunk_vectorssentence embedding
**用途**: 終端使用者可進行語意搜尋
### Phase 2完整 Pipelinev2 model
**觸發時機**: 全部 processor 完成 + Rule 3 Ingestion + 5W1H Agent
**交付內容**:
- Phase 1 全部內容
- 所有 `{uuid}.*.json`cut, yolo, face, pose, ocr, ...
- chunkschunk_type = 'cut', 'visual', 'trace', 'story'
- chunk_vectorssummary embedding
- identities / identity_bindings / face_detections
**用途**: 完整搜尋 + 摘要 + 人物識別
---
## Worker Pipeline
```
ASR 完成 → ASRX 完成
Rule 1 Ingestion (sentence chunks)
vectorize_chunks (sentence embedding)
📦 Phase 1 release ───→ release/phase1/latest/ (base model)
其他 processors 繼續 (yolo, face, pose, ocr, ...)
Rule 3 Ingestion + 5W1H Agent
📦 Phase 2 release ───→ release/phase2/latest/ (full model)
```
## 產出目錄結構
```
release/
├── phase1/
│ ├── {version}_{timestamp}/
│ │ ├── output_json/ ← 所有已完成的 .json
│ │ ├── chunks.csv ← sentence chunks
│ │ ├── vectors.csv ← sentence embeddings
│ │ ├── schema.sql ← chunks table DDL
│ │ └── RELEASE_INFO.txt
│ └── latest → {version}_{timestamp}
└── phase2/
├── {version}_{timestamp}/
│ ├── output_json/ ← 所有 .json
│ ├── chunks.csv ← 所有 chunks
│ ├── vectors.csv ← 所有 embeddings
│ ├── identities.csv ← 人物身分
│ ├── schema.sql ← 完整 schema
│ └── RELEASE_INFO.txt
└── latest → {version}_{timestamp}
```
## momentry model vs momentry core
| | momentry model | momentry core |
|---|---|---|
| 類比 | 訓練好的 weights | inference engine |
| 內容 | `.json` + chunks + vectors | Rust binary |
| 生命週期 | 每部影片產出一個 | 一個 binary 服務所有影片 |
| 版本 | `{uuid}_v1`base / `{uuid}_v2` / `{uuid}_v3` | `momentry_playground` / `momentry` |
| 交付對象 | 終端使用者 | 部署工程師 |
---
## Wiki 機制:每個 model 都可被調整
每個 momentry model`{uuid}_v1` / `v2` / `v3`)不只是唯讀的產出,而是可透過 wiki 機制持續改善。
### 與傳統 RAG 的區別
| | 傳統 RAG | momentry wiki |
|---|---|---|
| 知識儲存 | vector DBephemeral | model packagepermanent |
| 修正方式 | query 時 LLM 決定是否採用 | 使用者/Agent 直接編輯 |
| 修正持久性 | ❌ 下次 query 就消失 | ✅ 寫入 model版本化保存 |
| 模型改進 | 無(僅改變 prompt | 下次 version bump 時合併為 ground truth |
| 協作方式 | 單向retrieve → generate | 雙向(編輯 → 合併 → 改進) |
| 離線可用 | ❌ 需 vector DB + LLM | ✅ 離線查閱 wiki 目錄 |
**momentry wiki 不是 RAG 的替代品,而是 model 的生命週期管理機制。**
### 概念
```
momentry model (release package)
├── output_json/ ← 唯讀processor 產出
├── chunks.csv ← 唯讀ingestion 產出
├── vectors.csv ← 唯讀embedding 產出
└── wiki/ ← 可編輯,使用者貢獻知識
├── identities.json ← "trace 5 = Audrey Hepburn"
├── objects.json ← "object 42 = 郵票 #1"
├── corrections.json ← "ASR 'Hello' → 'Halo'"
└── changelog.json ← 編輯歷史
```
### 資料流向
```
使用者/Agent 編輯 wiki
DB wiki_entries + wiki_revisions 寫入
下次 release 打包時 merge 進 model
TKG label 更新 (tkg_nodes.label)
新版 model version bump
```
### 與 TKG 的關係
wiki 的 identity 和 object 標註會回寫到 TKG node label
```
(face_trace:5) label="Audrey Hepburn" ← wiki 編輯
(object_instance:42) label="郵票 #1" ← wiki 編輯
```
這些編輯累積後,可做為下一版 model training 的 ground truth。
### 實作方向
**DB 層** — 新 table `wiki_entries` + `wiki_revisions`
```sql
wiki_entries (target_type, target_id, title, body, summary, status, version, file_uuid)
wiki_revisions (entry_id, version, title, body, summary, change_summary, edited_by)
```
**API 層** — CRUD + 版本歷史:
```
GET /api/v1/wiki/{target_type}/{target_id}
PUT /api/v1/wiki/{target_type}/{target_id}
GET /api/v1/wiki/{target_type}/{target_id}/revisions
POST /api/v1/wiki/search
```
**打包層**`release_pack.py` 加入 wiki 匯出,與 model 共存
---
## Phase 3Object Identityv3 model
### 目標
從影片中提取關鍵物體(郵票、手槍、信封、放大鏡...),對同類物體做 instance-level 的跨畫面追蹤與辨識,達到類似 face trace 的效果 — 不只是 detect class還能區分「這一張郵票」vs「那一張郵票」。
### 現狀問題
1. **COCO 80 類不包含關鍵物體** — 郵票、手槍、信封、放大鏡等不在 COCO 資料集中
2. **YOLOv5nano 偵測率低** — 即使是 COCO 類別knife, cell phone在 nano 模型上 recall 不足
3. **無 object instance matching** — 目前只有 frame-level detection沒有跨 frame 的物體追蹤
### 技術方向
```
YOLOv8m/OWL-ViT → 改善 detection coverage
Object Tracker (IoU + embedding類似 face tracker)
object_trace → TKG CO_OCCURS_WITH edges
object identity → 同物體跨場景辨識
```
| 方向 | 方法 | 效果 |
|------|------|------|
| Model upgrade | `yolov5nu``yolov8s.pt` / `yolov8m.pt` | COCO recall 提升 |
| Custom fine-tune | 收集 stamps/guns 資料 fine-tune YOLO | 可偵測非 COCO 物件 |
| Zero-shot | OWL-ViT / Grounding DINO by text prompt | 不用 training但速度慢 |
| Object trace | IoU + embedding 跨 frame 匹配 | instance-level 追蹤 |
| Object identity | clustering 跨場景辨識同一物體 | 可在全片搜尋「這把槍」 |
### 與 TKG 整合
```
face_trace -[:CO_OCCURS_WITH]-> object_instance:5 (這把槍)
face_trace -[:CO_OCCURS_WITH]-> object_instance:42 (這張郵票)
查詢: "Audrey Hepburn 拿這把槍的畫面"
→ face_trace:5 -[:SPEAKS_AS]-> SPEAKER_0
→ face_trace:5 -[:CO_OCCURS_WITH]-> object_instance:5
```
### 交付順序
1. YOLO model upgrade低難度立即見效
2. Object tracker中難度參考 face tracker 實作)
3. Custom fine-tune / zero-shot高難度需資料或新模型

View File

@@ -0,0 +1,361 @@
---
document_type: "design"
service: "MOMENTRY_CORE"
title: "TMDb 整合 — Identity 檔案系統設計"
date: "2026-05-16"
version: "V1.0"
status: "completed"
owner: "M5"
created_by: "OpenCode"
tags:
- "tmdb"
- "identity"
- "cache"
- "file-system"
- "resource"
- "design"
ai_query_hints:
- "查詢 TMDb Identity 檔案系統設計的內容"
- "TMDb 整合的三個階段是什麼"
- "如何從 cache 建立 TMDb identities"
- "identity 檔案化目錄結構"
- "TMDb resource API endpoint 列表"
- "TMDb face matching 整合位置"
related_documents:
- "REFERENCE/Face_Pipeline.md"
- "REFERENCE/Trace_Structure.md"
- "REFERENCE/Demo_EndToEnd.md"
- "REFERENCE/Services_Inventory.md"
---
# TMDb 整合 — Identity 檔案系統設計 V1.0
| 項目 | 內容 |
|------|------|
| 建立者 | OpenCode |
| 建立時間 | 2026-05-16 |
| 文件版本 | V1.0 |
| 狀態 | Completed |
---
## 版本歷史
| 版本 | 日期 | 目的 | 操作人 | 工具/模型 |
|------|------|------|--------|-----------|
| V1.0 | 2026-05-16 | 三階段 TMDb 整合設計Identity 檔案化、Agent Cache、Resource 納管 | OpenCode | DeepSeek V4 Flash |
---
## Overview
三個計劃循序實作,建立 Identity 的 filesystem 副本與 TMDb 外部資源整合:
1. **Plan 1: Identity 檔案化** — 每個 identity 在 `{OUTPUT}/identities/{uuid}/identity.json` 有完整備份
2. **Plan 2: TMDb Agent + Cache** — 唯一外連點fetch TMDb API → cache 到 `{uuid}.tmdb.json`
3. **Plan 3: TMDb 納管** — resource endpoint + health 整合
### 設計原則
- **全本地為預設**TMDb 是唯一需要外連的服務,視為 optional plugin
- **Cache-first**TMDb API 只 call 一次,之後全從 local cache 讀
- **Dual-write**DB + filesystem 保持一致
- **filesystem 為 canonical snapshot**DB 是 primary storefilesystem 是可攜離線副本
---
## Plan 1: Identity 檔案化
### 目的
為每個 identity 建立 filesystem snapshot使 identity 資料:
- **可搬移**`cp -r identities/` 到另一台機器即可
- **可檢查**`cat {uuid}/identity.json` 直接看完整 identity 資料
- **可備份**tar identities/ 即為 identity 完整備份
- **可離線**:不需要 DB 也能取得 identity 基本資訊
### 目錄結構
```
{OUTPUT_DIR}/
├── identities/
│ ├── _index.json ← { uuid: name } 索引
│ ├── a9a901056d6b46ff92da0c3c1a57dff4/
│ │ └── identity.json ← V1: 完整 identity 資訊
│ └── b0b101167e8c4a53a0.../
│ └── identity.json
└── {file_uuid}.tmdb.json ← V2: TMDb raw cache
```
### identity.json 格式
```json
{
"version": 1,
"identity_uuid": "a9a901056d6b46ff92da0c3c1a57dff4",
"name": "Cary Grant",
"identity_type": "people",
"source": "tmdb",
"status": "confirmed",
"tmdb_id": 112,
"tmdb_profile": "https://image.tmdb.org/t/p/w185/abc.jpg",
"metadata": {
"tmdb_character": "Peter Joshua",
"tmdb_cast_order": 0,
"tmdb_movie_id": 4808
},
"file_bindings": [
{
"file_uuid": "3a6c1865...",
"trace_ids": [10, 23],
"face_count": 12
}
],
"created_at": "2026-05-16T12:00:00Z",
"updated_at": "2026-05-16T12:30:00Z"
}
```
### _index.json 格式
```json
{
"version": 1,
"updated_at": "2026-05-16T12:00:00Z",
"entries": {
"a9a901056d6b46ff92da0c3c1a57dff4": "Cary Grant",
"b0b101167e8c4a53a09d6c2a68e0abf1": "Audrey Hepburn"
}
}
```
### 寫入策略Dual-write
任何 identity 變更 → DB write → `save_identity_file()` → filesystem write
```
identity 變更發生處:
├── TMDb probe (probe.rs) → create_identities_from_data() → save_identity_file() per identity
├── Face matching API (identity_agent_api.rs) → match_faces_iterative() → save_identity_file() per matched identity
├── Face matching Worker P2.5 (job_worker.rs) → match_faces_against_tmdb() → save_identity_file() per affected identity
├── Manual bind/unbind (identity_binding.rs) → bind/unbind handler → save_identity_file() per identity
└── One-time migration (migrate_identity_files.py) → 全部 identities 檔案化
```
### API: `storage.rs`
```rust
// structs
IdentityFile { version, identity_uuid, name, identity_type, source, status,
tmdb_id, tmdb_profile, metadata, file_bindings, created_at, updated_at }
FileBinding { file_uuid, trace_ids, face_count }
// core functions
identity_dir(uuid: &str) -> PathBuf
read_identity_file(uuid: &str) -> Result<IdentityFile>
write_identity_file(file: &IdentityFile) -> Result<()>
list_identity_uuids() -> Result<Vec<String>>
count_identity_files() -> usize
// index
read_index() -> Result<HashMap<String, String>>
update_index(uuid: &str, name: &str) -> Result<()>
// dual-write hook
async fn save_identity_file(db: &PostgresDb, uuid: &str) -> Result<()>
// 1. 查 DB 取得 identity full data
// 2. 查 DB 取得 file_bindings
// 3. 寫 identity.json
// 4. 更新 _index.json
```
### 改動清單
| # | 檔案 | 屬性 | 內容 |
|---|------|------|------|
| 1.1 | `src/core/identity/storage.rs` | NEW | IdentityFile struct + CRUD + index + save_identity_file() |
| 1.2 | `src/core/identity/mod.rs` | NEW | module declaration |
| 1.3 | `src/core/mod.rs` | EDIT | `pub mod identity;` |
| 1.4 | `src/core/db/postgres_db.rs` | EDIT | `get_identity_file_bindings(uuid)` helper |
| 1.5 | `src/core/tmdb/probe.rs` | EDIT | hook: save_identity_file() |
| 1.6 | `src/api/identity_binding.rs` | EDIT | hook: bind/unbind |
| 1.7 | `src/api/identity_agent_api.rs` | EDIT | hook: match_faces_iterative |
| 1.8 | `src/worker/job_worker.rs` | EDIT | hook: P2.5 matching |
| 1.9 | `src/api/server.rs` | EDIT | health/detailed: identities section |
| 1.10 | `scripts/migrate_identity_files.py` | NEW | one-time migration DB→filesystem |
---
## Plan 2: TMDb Agent + Cache
### 目的
將 TMDb 設定為「唯一外連點 + local cache」實作全離線 identity enrichment。
### 目錄結構
```
{OUTPUT_DIR}/
├── {file_uuid}.tmdb.json ← TMDb raw cache (file-level)
├── identities/{uuid}/
│ └── identity.json ← Processed identity (identity-level)
```
### Cache 格式 (`{uuid}.tmdb.json`)
```json
{
"file_uuid": "3a6c1865...",
"fetched_at": "2026-05-16T12:00:00Z",
"source": "agent",
"movie": {
"tmdb_id": 4808,
"title": "Charade",
"release_date": "1963-12-05",
"overview": "After Regina Lampert...",
"poster_path": "/8wvQp...jpg"
},
"cast": [
{
"name": "Cary Grant",
"character": "Peter Joshua",
"profile_path": "/abc123.jpg",
"order": 0
}
],
"cast_count": 20,
"identities_created": 0
}
```
### 流程
```
Step 1: POST /agents/tmdb/prefetch
→ tmdb_agent.py (唯一外連) → TMDB API search → credits
→ 寫入 {uuid}.tmdb.json (source: agent)
Step 2: POST /file/:uuid/tmdb-probe
→ probe_from_cache() 讀 {uuid}.tmdb.json
→ INSERT identities (source='tmdb')
→ spawn tmdb_embed_extractor.py (背景)
→ save_identity_file() for each identity (Plan 1 hook)
Step 3: POST /agents/identity/analyze (既存 endpoint)
→ match_faces_iterative() 自動包含 TMDb identities
```
### probe.rs 重構
```rust
// 新增 (讀 cache)
pub async fn probe_from_cache(db, file_uuid) -> Result<TmdbProbeResult> {
let cache = cache::read_tmdb_cache(file_uuid)?;
create_identities_from_data(db, file_uuid, &cache.movie, &cache.cast).await
}
// 共用內部函數 (從 probe_movie 抽離)
async fn create_identities_from_data(db, file_uuid, movie, cast) -> Result<TmdbProbeResult> {
// 原本 probe_movie 的 INSERT + embed spawn + store logic
// 尾端呼叫 save_identity_file() per identity
}
// 保留 (direct API call, 後備)
pub async fn probe_movie(db, filename, file_uuid) -> Result<...> {
let movie_name = extract_movie_name(filename)?;
// search TMDB API → credits
// 可選擇性寫入 cache 供下次使用
create_identities_from_data(db, file_uuid, &movie, &cast).await
}
```
### 改動清單
| # | 檔案 | 屬性 | 內容 |
|---|------|------|------|
| 2.1 | `src/core/tmdb/cache.rs` | NEW | TmdbCache struct + read/write |
| 2.2 | `src/core/tmdb/mod.rs` | EDIT | `pub mod cache;` `pub mod status;` |
| 2.3 | `src/core/tmdb/probe.rs` | EDIT | refactor: probe_from_cache() + create_identities_from_data() |
| 2.4 | `scripts/tmdb_agent.py` | NEW | fetch TMDB API → cache tmdb.json |
| 2.5 | `src/api/tmdb_api.rs` | NEW | 5 routes + 5 handlers |
| 2.6 | `src/api/server.rs` | EDIT | `.merge(tmdb_routes())` |
---
## Plan 3: TMDb 納管
### 目的
將 TMDb 以 managed resource 形式納入系統監控與管理。
### health/detailed 擴充
```json
{
"integrations": {
"tmdb": {
"api_key_configured": true,
"enabled": true,
"api_reachable": true,
"api_latency_ms": 120,
"api_error": null,
"last_check_at": "2026-05-16T12:00:00Z"
}
},
"identities": {
"directory_exists": true,
"files_count": 3481,
"index_ok": true,
"db_count": 3481,
"synced": true
}
}
```
### API
| Method | Path | 說明 |
|--------|------|------|
| `GET` | `/api/v1/resource/tmdb` | TMDb 完整狀態 + stats + cache count |
| `POST` | `/api/v1/resource/tmdb/check` | ping TMDb API → 更新健康狀態 |
### 改動清單
| # | 檔案 | 屬性 | 內容 |
|---|------|------|------|
| 3.1 | `src/core/tmdb/status.rs` | NEW | check_tmdb_api(), count_tmdb_identities(), count_cache_files() |
| 3.2 | `src/api/tmdb_api.rs` | EDIT | GET/POST resource endpoints |
| 3.3 | `src/api/server.rs` | EDIT | integrations in health/detailed |
---
## 完整 API 表 (Plan 2 + 3)
| Method | Path | Handler | Plan | Description |
|--------|------|---------|------|-------------|
| `POST` | `/api/v1/agents/tmdb/prefetch` | `prefetch_tmdb` | 2 | agent fetch TMDB → cache |
| `POST` | `/api/v1/file/:file_uuid/tmdb-probe` | `tmdb_probe` | 2 | cache → identities |
| `GET` | `/api/v1/file/:file_uuid/tmdb-cache` | `tmdb_cache_view` | 2 | view raw cache |
| `GET` | `/api/v1/resource/tmdb` | `tmdb_resource_status` | 3 | full TMDb status |
| `POST` | `/api/v1/resource/tmdb/check` | `tmdb_resource_check` | 3 | ping health check |
## Migration
一次性腳本:`scripts/migrate_identity_files.py`
```bash
python3 scripts/migrate_identity_files.py
# → 讀 DB identities table → 寫 identity files → 建 index
```
---
## 執行順序
```
Plan 1 (identity 檔案化) → Plan 2 (TMDb agent) → Plan 3 (TMDb 納管)
1.1 → 1.2 → 1.3 → 2.1 → 2.2 → 2.3 → 3.1 → 3.2 → 3.3
1.4 → 1.5 → 1.6 → 2.4 → 2.5 → 2.6
1.7 → 1.8 → 1.9 →
1.10
```

View File

@@ -0,0 +1,101 @@
# Trace Search API 設計
## 概念
trace 是一種 chunk。
現有的 chunk_type: `cut`, `sentence`, `visual`, `story`
新增 chunk_type: `trace`
每個 trace人物跨 frame 追蹤軌跡)就是一個時間區間 + 區間內的 ASR text。
跟其他 chunk 完全一樣,只是切分維度不同:
- cut chunk = 鏡頭切換
- sentence chunk = 語句邊界
- visual chunk = 畫面物體組合
- **trace chunk = 人物出現區間 + 當下 spoken text**
這樣 trace 可以直接放進現有的 `chunks` 表,共用 embedding、搜尋、Qdrant sync 整套機制,不需要任何新 table。
## chunks 表現有結構
```sql
chunks (
id, file_uuid, chunk_type, -- 'trace' 新增
start_frame, end_frame, start_time, end_time,
text_content, -- trace 區間的 ASR text
embedding, -- text_content 的 pgvector
metadata JSONB, -- { trace_id, face_count, identity_id, identity_name }
...
)
```
## 資料產生流程worker 擴充)
在 face processing + `store_traced_faces.py` 完成後:
1. 查詢 `face_detections` 聚合每個 trace 的 `MIN(frame)`, `MAX(frame)`, `COUNT(*)`
2. 對每個 trace查詢 `pre_chunks WHERE processor_type='asr'` 中與 trace time range 重疊的 text
3. 彙整 text → EmbeddingGemma 產生 `embedding`
4. 寫入 `chunks``chunk_type='trace'`metadata 含 `trace_id`, `face_count`, `identity_id`
5. embedding 自動進 Qdrant與既有 chunk 同一 collection
## Search API 擴充
Universal Search 的 `types` 原本就支援 `"chunk"`
在 chunk 搜尋中過濾 `chunk_type = 'trace'` 即可。
**Request**
```json
{
"query": "open the door",
"types": ["chunk"],
"filters": { "chunk_type": "trace" },
"uuid": "aeed71342a899fe4b4c57b7d41bcb692",
"page": 1,
"page_size": 20
}
```
**Response**(與既有 Chunk result 相同):
```json
{
"type": "chunk",
"chunk_id": "chunk_42",
"chunk_type": "trace",
"start_frame": 45200, "end_frame": 45900,
"start_time": 1808.0, "end_time": 1836.0,
"score": 0.87,
"text": "Open the door. Come on, hurry up.",
"metadata": {
"trace_id": 5,
"face_count": 42,
"identity_name": "Audrey Hepburn"
}
}
```
完全沿用既有的 `SearchResult::Chunk` variant不用新增 enum variant。
### 搜尋語法
```sql
SELECT c.*
FROM dev.chunks c
WHERE c.file_uuid = $1
AND c.chunk_type = 'trace'
AND c.embedding IS NOT NULL
ORDER BY c.embedding <=> $2
LIMIT $3;
```
## 總結
| 項目 | 作法 |
|------|------|
| 新 table | ❌ 不需要 |
| 新 enum variant | ❌ 不需要 |
| SearchResult 改動 | ❌ 不需要 |
| chunk_type 新增 | ✅ `'trace'` |
| worker 擴充 | ✅ 產生 trace chunk (face done 後) |
| SearchFilters 擴充 | ✅ 加 `chunk_type` filter |
| Qdrant | ✅ 自動(既有 chunk collection |

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,264 @@
---
document_type: "reference_doc"
service: "MOMENTRY_CORE"
title: "Video Registration"
date: "2026-03-25"
version: "V1.0"
status: "active"
owner: "Warren"
created_by: "OpenCode"
tags:
- "video"
- "registration"
ai_query_hints:
- "查詢 Video Registration 的內容"
- "Video Registration 的主要目的是什麼?"
- "如何操作或實施 Video Registration"
---
# Video Registration
| 項目 | 內容 |
|------|------|
| 建立者 | Warren |
| 建立時間 | 2026-03-25 |
| 文件版本 | V1.1 |
---
## 版本歷史
| 版本 | 日期 | 目的 | 操作人 | 工具/模型 |
|------|------|------|--------|-----------|
| V1.0 | 2026-03-25 | 創建文件 | Warren | OpenCode |
| V1.1 | 2026-03-26 | 修正 curl 範例,新增 API Key 驗證標頭 | OpenCode | deepseek-reasoner |
---
## 概述
影片註冊 API (`POST /api/v1/register`) 用於將影片加入 Momentry Core 系統進行處理。
## 路徑格式
### 支援的路徑格式
| 格式 | 範例 | 說明 |
|------|------|------|
| 相對路徑 | `./demo/video.mp4` | 推薦格式 |
| 相對路徑(無 ./ | `demo/video.mp4` | 自動加上 `./` |
| 絕對路徑 | `/Users/.../sftpgo/data/demo/video.mp4` | 支援但不推薦 |
### 路徑結構
```
./username/filepath
│ │ │
│ │ └── 檔案路徑(可以是多層目錄)
│ └── 使用者名稱SFTPgo 用戶目錄名稱)
└── 相對路徑前綴
```
**範例**
- `./demo/video.mp4` → username=`demo`, filepath=`video.mp4`
- `./demo/movies/2024/video.mp4` → username=`demo`, filepath=`movies/2024/video.mp4`
- `./warren/project1/interview.mp4` → username=`warren`, filepath=`project1/interview.mp4`
## UUID 計算
### 計算規則
```
UUID = SHA256(username/filepath)[0:16]
```
**範例**
```rust
// 路徑: ./demo/video.mp4
// username: "demo"
// filepath: "video.mp4"
// key: "demo/video.mp4"
// UUID: SHA256("demo/video.mp4")[0:16]
```
### 特性
| 特性 | 說明 |
|------|------|
| 用戶隔離 | 不同用戶的相同檔名會產生不同 UUID |
| 一致性 | 相同相對路徑一定產生相同 UUID |
| 遷移安全 | SFTPgo 資料路徑變更後 UUID 保持一致 |
### 範例
```rust
// 用戶 demo 的影片
compute_uuid_from_relative_path("./demo/video.mp4")
// → "9760d0820f0cf9a7"
// 用戶 warren 的相同檔名影片
compute_uuid_from_relative_path("./warren/video.mp4")
// → "a1b2c3d4e5f6g7h8" (不同的 UUID)
```
## 重複註冊檢查
### 行為
1. 系統檢查 UUID 是否已存在於資料庫
2. 如果存在,返回 `already_exists: true` 和現有影片資訊
3. 如果不存在,創建新的影片記錄
### API 回應
**新註冊**
```json
{
"uuid": "9760d0820f0cf9a7",
"video_id": 18,
"job_id": 2,
"file_name": "video.mp4",
"duration": 159.637188,
"width": 640,
"height": 360,
"already_exists": false
}
```
**重複註冊**
```json
{
"uuid": "9760d0820f0cf9a7",
"video_id": 18,
"job_id": 2,
"file_name": "video.mp4",
"duration": 159.637188,
"width": 640,
"height": 360,
"already_exists": true
}
```
## SFTPgo 整合
### 目錄結構
SFTPgo 的用戶目錄結構:
```
/Users/accusys/momentry/var/sftpgo/data/
├── demo/ ← 用戶目錄
│ ├── video.mp4
│ └── movies/
│ └── movie1.mp4
├── warren/ ← 用戶目錄
│ └── project1/
│ └── interview.mp4
└── momentry/ ← 用戶目錄
└── presentation.mp4
```
### 註冊流程
1. SFTPgo 用戶上傳檔案到各自的目錄
2. n8n 或其他服務調用註冊 API
3. 使用相對路徑格式:`./username/filepath`
4. 系統計算 UUID 並檢查重複
5. 創建處理任務
## 程式碼範例
### 註冊影片
```bash
# 使用相對路徑註冊
curl -X POST http://localhost:3002/api/v1/register \
-H "Content-Type: application/json" \
-H "X-API-Key: YOUR_API_KEY" \
-d '{"path": "./demo/video.mp4"}'
# 或使用多層目錄
curl -X POST http://localhost:3002/api/v1/register \
-H "Content-Type: application/json" \
-H "X-API-Key: YOUR_API_KEY" \
-d '{"path": "./demo/movies/2024/video.mp4"}'
```
### UUID 計算函數
```rust
// 使用相對路徑計算 UUID
pub fn compute_uuid_from_relative_path(relative_path: &str) -> String {
let (username, filepath) = extract_user_from_relative_path(relative_path);
compute_uuid(&username, &filepath)
}
// 從相對路徑提取用戶名和檔案路徑
pub fn extract_user_from_relative_path(relative_path: &str) -> (String, String) {
let path = relative_path.strip_prefix("./").unwrap_or(relative_path);
let path_buf = PathBuf::from(path);
let mut components = path_buf.components();
let username = components
.next()
.map(|c| c.as_os_str().to_string_lossy().to_string())
.unwrap_or_default();
let filepath: String = components
.map(|c| c.as_os_str().to_string_lossy().to_string())
.collect::<Vec<_>>()
.join("/");
(username, filepath)
}
```
## 相關 API
### Probe API僅探測不註冊
如果只需要取得影片資訊而不註冊,可以使用 Probe API
```bash
curl -X POST http://localhost:3002/api/v1/probe \
-H "Content-Type: application/json" \
-H "X-API-Key: YOUR_API_KEY" \
-d '{"path": "./demo/video.mp4"}'
```
**回應範例**
```json
{
"uuid": "a1b10138a6bbb0cd",
"file_name": "video.mp4",
"duration": 120.5,
"width": 1920,
"height": 1080,
"fps": 30.0,
"cached": false,
"format": {...},
"streams": [...]
}
```
**與 Register API 的差異**
| 功能 | Probe API | Register API |
|------|-----------|---------------|
| 計算 UUID | ✓ | ✓ |
| 執行 ffprobe | ✓ | ✓ |
| 儲存 probe.json | ✓ | ✓ |
| 寫入 videos 表 | ✗ | ✓ |
| 建立 monitor_job | ✗ | ✓ |
| 返回 job_id | ✗ | ✓ |
| 適用場景 | 預覽影片資訊 | 註冊並處理影片 |
## 相關檔案
| 檔案 | 說明 |
|------|------|
| `src/core/storage/uuid.rs` | UUID 計算邏輯 |
| `src/api/server.rs` | 註冊與 Probe API 實現 |
| `src/core/probe/ffprobe.rs` | ffprobe 整合 |
| `docs_v1.0/IMPLEMENTATION/SFTPGO_DEMO_USER.md` | SFTPgo 用戶設置 |
| `docs_v1.0/REFERENCE/API_ENDPOINTS.md` | API 端點總覽 |

View File

@@ -0,0 +1,201 @@
# Momentry Eye API Reference
**Vision Agent** — Multi-model zero-shot object detection service.
Port: `5052` | Resource IDs: `eye-gdino`, `eye-paligemma`
---
## Models
| Model | ID | Params | Size | Confidence | Speed | License |
|-------|-----|--------|------|------------|-------|---------|
| Grounding DINO | `grounding-dino` | 232M | 891MB | ✅ 0-1 score | ~340ms | Apache 2.0 |
| PaliGemma 3B | `paligemma` | 2,923M | ~3GB | ❌ no score | ~80ms | Gemma license |
## Endpoints
### `GET /health`
System status and loaded models.
```bash
curl localhost:5052/health
```
Response:
```json
{
"status": "ok",
"models_loaded": ["grounding-dino"],
"models_available": ["grounding-dino", "paligemma"],
"device": "mps",
"port": 5052
}
```
### `GET /models`
List available models with specs.
```bash
curl localhost:5052/models
```
### `POST /detect`
Detect objects in a single video frame.
```bash
curl localhost:5052/detect \
-H "Content-Type: application/json" \
-d '{"time":5461, "prompt":"gun", "model":"grounding-dino"}'
```
**Parameters:**
| Param | Type | Default | Description |
|-------|------|---------|-------------|
| `uuid` | string | `aeed71342a...` | Video file UUID |
| `time` | float | `0` | Timestamp in seconds |
| `prompt` | string | `"gun"` | Object to detect |
| `model` | string | `"grounding-dino"` | Model: `grounding-dino`, `paligemma`, or `fusion` |
| `threshold` | float | `0.1` | Minimum confidence (GDINO only) |
| `weights` | object | — | Fusion weights, e.g. `{"grounding-dino":0.6,"paligemma":0.4}` |
**Fusion mode** runs both models and combines results with weighted scoring. Default weights: GDINO 0.6, PaliGemma 0.4.
```bash
# Fusion: run both models, combine results
curl localhost:5052/detect \
-d '{"time":206, "prompt":"water gun", "model":"fusion"}'
# Custom fusion weights
curl localhost:5052/detect \
-d '{"time":206, "prompt":"gun", "model":"fusion",
"weights":{"grounding-dino":0.5,"paligemma":0.5}}'
```
**Response:**
```json
{
"model": "grounding-dino",
"detections": [
{"bbox": [726.2, 567.4, 969.0, 694.6], "score": 0.476, "label": "gun"},
{"bbox": [686.7, 567.0, 969.6, 918.3], "score": 0.262, "label": "gun"}
],
"time_ms": 345.2,
"n_detections": 2,
"shot_url": "/shots/aeed7134_5461s_gun_grounding-dino.jpg"
}
```
**Fusion response** also includes `per_model` (detections per model) and `fusion` (deduplicated combined list with `fused_score`).
### `POST /search`
Search across a time range.
```bash
# Natural language query
curl localhost:5052/search \
-d '{"query":"find the gun", "range":"5400-5600", "interval":10}'
```
**Parameters:**
| Param | Type | Default | Description |
|-------|------|---------|-------------|
| `query` | string | `"find the gun"` | Natural language query (parsed to extract object) |
| `target` | string | — | `file_uuid:chunk_id` or `file_uuid:trace_id` — resolves to time range |
| `range` | string | `"0-6780"` | Manual time range |
| `interval` | int | `30` | Scan interval in seconds |
| `model` | string | `"grounding-dino"` | Detection model |
| `threshold` | float | `0.15` | Minimum confidence |
**Target resolution:**
| Format | Example | Resolves to |
|--------|---------|-------------|
| `file_uuid:chunk_id` | `uuid:uuid_story_90` | Chunk's time range |
| `file_uuid:trace_id` | `uuid:trace_5` | Trace's time range |
| `file_uuid:chunk_index` | `uuid:500` | Chunk index 500's range |
```bash
# Using target
curl localhost:5052/search \
-d '{"target":"aeed71342...:aeed71342..._story_90", "query":"gun"}'
# Using trace
curl localhost:5052/search \
-d '{"target":"aeed71342...:trace_5", "query":"person"}'
```
### `POST /multimodal`
Multi-modal search across sentence chunks — combines ASR text match + visual confirmation.
```bash
# Search for Jean-Louis: ASR match + GDINO child detection
curl localhost:5052/multimodal \
-d '{"keyword":"Jean-Louis", "prompt":"child"}'
# Search trace chunks visually (no ASR)
curl localhost:5052/multimodal \
-d '{"keyword":"", "prompt":"person", "chunk_type":"trace", "range":"3500-4000"}'
```
**Parameters:**
| Param | Type | Default | Description |
|-------|------|---------|-------------|
| `keyword` | string | — | ASR keyword to search in sentence text |
| `prompt` | string | same as keyword | Visual prompt for GDINO |
| `chunk_type` | string | `"sentence"` | `sentence`, `trace`, `story`, `cut` |
| `target` | string | — | Specific chunk target |
| `range` | string | `"0-6780"` | Time range (for non-sentence chunks) |
| `threshold` | float | `0.15` | Visual detection threshold |
### `GET /shots/<filename>`
Retrieve annotated detection images.
```bash
curl -o result.jpg localhost:5052/shots/aeed7134_5461s_gun_grounding-dino.jpg
```
## Object Detection Performance Summary
| Object type | Size in frame | GDINO | PaliGemma | Best prompt |
|-------------|--------------|-------|-----------|-------------|
| Gun (realistic) | 15-30% | ✅ 0.36-0.67 | ✅ | `pistol` / `handgun` |
| Water gun (toy) | 15-31% | ❌ 0 | ✅ | `water gun` (PaliGemma) |
| Child (Jean-Louis) | 30-60% | ⚠️ 0.3-0.9 | ❌ | `child` (high FP on adults) |
| Stamp | <5% | ❌ FP | ❌ | — |
| Passport | <10% | ❌ FP | ❌ | — |
| Magnifying glass | <5% | ❌ FP | ❌ | — |
| Cup / Bottle | 5-15% | ✅ 0.3-0.5 | — | `cup` / `bottle` |
| Cell phone | 5-10% | ✅ 0.3-0.5 | — | `cell phone` |
## Resource Registration
On startup, the agent auto-registers as resources in `dev.resources`:
| Resource ID | Type | Status |
|-------------|------|--------|
| `eye-gdino` | `vision_model` | `online` |
| `eye-paligemma` | `vision_model` | `online` |
Heartbeat updates every 60 seconds. Discover via:
```sql
SELECT * FROM dev.resources WHERE resource_type = 'vision_model';
```
## Files
| File | Description |
|------|-------------|
| `scripts/vision_agent.py` | Vision Agent server (port 5052) |
| `output_dev/vision_shots/` | Annotated detection screenshots |
| `docs/ZERO_SHOT_DETECTION_RESEARCH.md` | Full model research report |

View File

@@ -0,0 +1,105 @@
# 視覺呈現工具選型 v1.0.0
Momentry 前端視覺化工具選擇記錄。
## SVG內建
| 項目 | 內容 |
|------|------|
| 用途 | Trace 時間軸、泳道圖、長條圖、矩陣 |
| 授權 | 瀏覽器內建,無授權問題 |
| 適用 | V1 TraceThumbnailTimeline、V2 IdentitySwimlane、V3 DurationHistogram、V4 SimilarityMatrix |
| 優點 | 零依賴、向量清晰、可互動 |
| 缺點 | 大規模節點時效能下降 |
## Three.js
| 項目 | 內容 |
|------|------|
| 用途 | 3D 臉部網格、3D 時空立方體 |
| 授權 | **MIT** — 可商用,需保留版權聲明 |
| 適用 | Face3DViewerMediaPipe 468 landmarks、V5 3D Space-Time Cube |
| npm | `three` + `@types/three` |
| 檔案 | `node_modules/three/LICENSE`MIT |
| Bundle | 約 120KB gzip |
| 優點 | WebGL 封裝完整、OrbitControls、社群龐大 |
| 缺點 | 需手動管理 Dispose 避免記憶體洩漏 |
## MediaPipe Face Mesh
| 項目 | 內容 |
|------|------|
| 用途 | 人臉 468 個 3D landmark 偵測 |
| 授權 | **Apache 2.0** — 可商用 |
| 適用 | Face3DViewer |
| 部署 | `scripts/face_landmarks_server.py`port 11437 |
| 輸入 | 臉部裁切 JPEG |
| 輸出 | 478 個 (x, y, z) 3D 座標 |
| 優點 | 輕量即時、跨平台 |
| 缺點 | 僅正面臉部、無紋理 |
## Three.js Face3DViewer 記憶體管理
```typescript
// 正確的 Dispose 模式
function disposeScene() {
cancelAnimationFrame(animId)
for (const obj of objects) {
scene?.remove(obj)
if (obj instanceof THREE.Mesh) {
obj.geometry?.dispose()
if (Array.isArray(obj.material)) obj.material.forEach(m => m.dispose())
else obj.material?.dispose()
}
if (obj instanceof THREE.Points) {
obj.geometry?.dispose()
if (obj.material) obj.material.dispose()
}
}
objects = []
controls?.dispose()
controls = null
if (renderer) { renderer.dispose(); renderer = null }
scene = null; camera = null
}
```
## 技術選型對照
| 視覺化 | 工具 | 授權 | Bundle | 狀態 |
|--------|------|:----:|:-----:|:----:|
| V0 Trace Grid | Vue + Tailwind | — | 0 KB | ✅ |
| V1 Thumbnail Timeline | SVG | — | 0 KB | ✅ |
| V2 Identity Swimlane | SVG | — | 0 KB | ✅ |
| V3 Duration Histogram | SVG | — | 0 KB | ✅ |
| V4 Similarity Matrix | SVG | — | 0 KB | ✅ |
| 3D Face Mesh | Three.js | MIT | ~120 KB | ✅ |
| V5 3D Space-Time Cube | Three.js | MIT | ~120 KB | 🔜 |
| Heatmap (Canvas) | Canvas 2D | — | 0 KB | 🔜 |
| Trace Video | ffmpeg | GPL | 獨立行程 | ✅ |
| **文件渲染** | | | | |
| API 文件 | **Markdown** | — | 0 KB | ✅ |
| API 圖解 | **Mermaid** (flowchart, sequence, ER, mindmap) | MIT | ~50 KB (VS Code 插件) | ✅ |
| CLI 閱讀 | **glow** (terminal MD renderer) | MIT | 獨立 binary | ✅ |
## Markdown
| 項目 | 內容 |
|------|------|
| 用途 | 所有 API 文件、設計規格、測試報告 |
| 授權 | 純文字格式,無授權問題 |
| 工具 | VS Code 內建預覽、`glow` CLI |
| 優點 | 版本控制友善diff 可讀)、純文字、跨平台 |
| 缺點 | 無動態互動能力 |
## Mermaid
| 項目 | 內容 |
|------|------|
| 用途 | API 流程圖sequence、架構圖flowchart、資料模型ER、端點總覽mindmap |
| 授權 | **MIT** — 可商用 |
| VS Code 插件 | `Markdown Preview Mermaid Support` |
| 支援圖表 | flowchart, sequence, class, state, ER, mindmap, pie, gantt |
| 檔案 | `API_USAGE_GUIDE_V1.0.0.md`(含 6 張 Mermaid 圖表) |
| 優點 | Markdown 內嵌、版本控制友善、免截圖 |
| 缺點 | VS Code/GitHub 以外需插件支援 |

View File

@@ -0,0 +1,114 @@
# 語音互動技術選型 v1.0.0
Momentry Demo Runner 語音技術選擇記錄。
## 語音輸出TTS
### macOS `say`(已採用)
| 項目 | 內容 |
|------|------|
| 用途 | 朗讀展示解說文字 |
| 授權 | macOS 內建,無授權問題 |
| 語言 | 支援 40+ 語言含中文Meijia、英文Samantha、日文Kyoko等 |
| 方式 | `subprocess.Popen(["say", "-v", "Meijia", "文字"])` |
| 優點 | 零安裝、零依賴、低延遲、多語系 |
| 缺點 | 僅 macOS、無法控制語速微調 |
**結論**:最適合 Momentry 的 TTS 方案 — macOS 內建、免費、多語系支援完整。
---
## 語音輸入Speech-to-Command
### 方案比較
| 方案 | 本地/雲端 | 語言 | 模型大小 | 延遲 | 精準度 | 授權 |
|------|:---------:|:----:|:--------:|:----:|:------:|:----:|
| **Vosk**(已整合) | ✅ **本地** | 中+英 | 42MB | 即時 | 中高 | Apache 2.0 |
| macOS NSSpeechRecognizer | ✅ 本地 | 多語 | 系統內建 | 即時 | 中 | macOS 內建 |
| Google Speech Recognition | ☁️ 雲端 | 120+ 語言 | — | ~1s | 高 | 免費(有限額) |
| Whisper (tiny) | ✅ 本地 | 100+ 語言 | ~150MB | ~2s | 高 | MIT |
| Porcupine | ✅ 本地 | 關鍵字 | ~2MB | 即時 | 高(限關鍵字) | Apache 2.0 |
### Vosk已採用為本地方案
| 項目 | 內容 |
|------|------|
| 模型 | `vosk-model-small-cn-0.22`42MB中文 |
| 語言 | 中文、英文(需下載對應模型) |
| 方式 | Python `vosk` 套件直接呼叫 |
| 優點 | 純本地、即時、中英皆可、模型小 |
| 缺點 | 需下載模型(一次性)、嘈雜環境精準度下降 |
| 語音 | 僅偵測指令關鍵字next/stop/repeat/goto 等 |
### Google Speech Recognition備援方案
| 項目 | 內容 |
|------|------|
| 用途 | 當 Vosk 模型未安裝時自動降級使用 |
| 方式 | Python `SpeechRecognition` + Google API |
| 優點 | 免下載模型、精準度高、多語系 |
| 缺點 | **需網路**、每次請求 ~1s 延遲、有使用配額限制 |
### 整合策略
```
啟動 --voice-control
├── Vosk 模型存在? → 使用 Vosk本地離線
└── Vosk 不存在? → 使用 Google需網路
└── 也失敗? → 顯示「語音不可用」
```
---
## Demo Runner 整合
### 指令集(中英雙語)
| 指令 | English | 功能 |
|:----:|:-------:|------|
| 下一個 / 繼續 | next / continue | 前進到下一步 |
| 停止 | stop / quit | 結束當前展示 |
| 重複 | repeat / again | 重複朗讀當前解說 |
| 跳到第 N 步 | go to N / step N | 跳到指定步驟 |
### 程式碼結構
```python
# 背景執行緒監聽語音
def voice_command_listener(lang):
# 1. 嘗試 Vosk本地
# 2. 降級 Google Speech Recognition雲端
# 3. 將辨識結果放入佇列
# 主迴圈輪詢佇列
def main():
while demo_running:
cmd = check_voice_command()
if cmd == "next": # 前進
if cmd == "stop": # 停止
if cmd == "goto N": # 跳到第 N 步
```
### 啟動方式
```bash
# 本地語音辨識Vosk不需網路
python3 scripts/demo_runner.py --voice zh_TW --voice-control
# 備援:若 Vosk 模型未安裝,自動使用 Google需網路
```
---
## 相關檔案
| 檔案 | 說明 |
|------|------|
| `scripts/demo_runner.py` | 語音輸出 + 輸入整合 |
| `~/.cache/vosk/vosk-model-small-cn-0.22/` | Vosk 中文模型42MB |
| `docs_v1.0/REFERENCE/DEMO_RUNNER_V1.0.0.md` | Demo Runner 使用文件 |

View File

@@ -0,0 +1,36 @@
# 語音辨識測試記錄 v1.0.0
## 環境
- **機器**: Mac Mini M4
- **輸入裝置**: Display Audio (HDMI loopback)
- **模型**: Vosk small-en-us (40MB)
## 測試結果
| 測試 | 設定 | Max Level | Mean Level | Vosk 辨識 |
|------|------|:---------:|:----------:|:----------:|
| 原始音訊 48kHz | pyaudio direct | 3510 | 654 | ❌ 空 |
| 降噪後 16kHz | highpass200+lowpass4000+afftdn | 1224 | 110 | ❌ 空 |
| 增益 3x | numpy boost | ~10K | ~1800 | ❌ 空 |
| ffmpeg recording | avfoundation :0 | 3698 | 636 | ❌ 空 |
## 發現
1. **Display Audio 確實有收到音訊**mean ~600, max ~3500
2. **背景噪聲偏高**mean 600 遠高於正常麥克風的 10-50
3. 降噪後 noise floor 降至 mean 110但仍無法辨識
4. Vosk small model 對噪聲容忍度不足
## 推測原因
Display Audio 是 **HDMI 音訊回傳通道**,收到的可能是:
- 顯示器內建喇叭的背景噪聲
- 或顯示器本身產生的電氣噪聲
- 不確定顯示器的麥克風是否確實透過 HDMI 回傳
## 待嘗試
- [ ] Whisper (本地,噪聲容忍度高)
- [ ] USB 麥克風直接測試
- [ ] macOS 內建 NSSpeechRecognizer透過 PyObjC

View File

@@ -0,0 +1,190 @@
# Zero-Shot Object Detection Model Research Report
**Date:** 2026-05-10
**Goal:** Evaluate models for detecting arbitrary objects in Charade (1963)
**System:** M5 MacBook Pro (Apple Silicon MPS, 48GB)
---
## Tested Models
| Model | Params | Size | Resolution | Type | License |
|-------|--------|------|------------|------|---------|
| YOLOv8n fine-tune (gun) | 3.2M | 6MB | 640px | Closed-set (4 classes) | AGPL-3.0 |
| OWL-ViT base | 109M | 586MB | 384px | Zero-shot | Apache 2.0 |
| **Grounding DINO Base** | **232M** | **891MB** | **384px** | **Zero-shot** | **Apache 2.0** |
| Grounding DINO Large | 232M | 895MB | 384px | Zero-shot | Apache 2.0 |
| Florence-2 Base | 231M | ~3GB | 384px | Zero-shot (generative) | MIT |
| Florence-2 Large | 776M | ~6GB | 384px | Zero-shot (generative) | MIT |
| PaliGemma 3B mix-224 | 2,923M | ~3GB | 224px | Zero-shot (generative) | Gemma license |
| PaliGemma 3B mix-448 | 2,923M | ~6GB | 448px | Zero-shot (generative) | Gemma license |
## Detection Performance on Charade
### Large Objects (gun)
| Model | 8 timepoints | Best confidence | Runtime |
|-------|-------------|----------------|---------|
| YOLOv8n fine-tune | ❌ 0/5 (all FP) | 0.45 (stamp→pistol) | 0.03s |
| OWL-ViT | ❌ 2/8 | 0.054 | 3.4s |
| **Grounding DINO Base** | **✅ 8/8** | **0.499** | **0.33s** |
| PaliGemma 3B mix-224 | ✅ 3/8 (gun), 3/8 overall | 0.499 | 0.5-3s |
### Small Objects (stamp, passport, magnifying glass)
| Model | Stamp | Passport | Magnifying glass |
|-------|-------|----------|-----------------|
| Grounding DINO Base | ❌ FP (~0.3) | ❌ FP (~0.4) | ❌ FP (~0.3-0.5) |
| PaliGemma 3B mix-224 | ❌ no det | ❌ no det | not tested |
| PaliGemma 3B mix-448 | ❌ (not tested) | ❌ (not tested) | ❌ (not tested) |
**All models fail on objects smaller than ~50px at native 1920x1080 resolution.**
### Other Objects
| Object | YOLO COCO | Grounding DINO | Notes |
|--------|-----------|----------------|-------|
| knife | ✅ 368 frames | ✅ 84 hits | Small but detectable |
| cup | ✅ | ✅ 13 hits | Moderate size |
| bottle | ✅ | ✅ 12 hits | Moderate size |
| cell phone | ✅ | ✅ 5 hits | Hand-held |
| book | ✅ | ✅ 3 hits | Hand-held |
| car | ✅ | ✅ 9 hits | Large object |
| tie | ✅ | ✅ 139 hits | On-person (worn, not held) |
## Detailed Model Analysis
### Grounding DINO Base (Recommended)
**Scores:** Detection confidence 0.1-0.5 (typical for zero-shot)
**Timing per frame (MPS):**
| Component | Time | % of total |
|-----------|------|------------|
| Processor (text+image) | 17ms | 5% |
| Model inference | 310ms | 93% |
| Post-processing | 5ms | 2% |
| **Total** | **331ms** | **100%** |
**Multi-prompt batching:** 8 prompts in 335ms (42ms/prompt vs 309ms single)
**Memory:** ~1GB (MPS)
**License:** Apache 2.0 — fully commercial, no restrictions
### Grounding DINO Large
**Result:** Identical weights to Base. The GitHub "7-dataset" checkpoint is the same 3-dataset version as HuggingFace. The actual 7-dataset version (56.7 AP) was never released.
**Verdict: Do not use.** Base is identical and simpler.
### OWL-ViT
**Result:** Almost useless for this task. Max confidence 0.054. Detect only 2/8 timepoints.
**Verdict: Do not use.**
### Florence-2
**Issue:** `prepare_inputs_for_generation` bug in current transformers version. Cannot run inference without patching model code.
**Task format:** Uses task tokens (`<OD>`) instead of arbitrary text prompts. Cannot do "detect gun" directly — uses generic object detection.
**Verdict: Cannot use in current environment.**
### PaliGemma
**Result:** Works for gun detection (3/8) but misses small objects entirely.
**Key limitation:** No confidence score output (generative model). Either outputs bbox or nothing.
**Issues:**
- 224px variant: Too low resolution for small objects
- 448px variant: 6GB download, suspected better for detail but untested
- Gemma license may restrict commercial use vs Apache 2.0
**Verdict: Inferior to Grounding DINO for this use case.**
### YOLOv8n Fine-tune (Gun Detector)
| Dataset | 905 images (Roboflow CC BY 4.0) |
| Classes | grenade, knife, pistol, rifle |
| Validation mAP50 | 0.813 |
| Charade FP rate | **100%** (all false positives) |
**Root cause:** Training images are close-up gun photos; Charade has distant/partial guns. Distribution mismatch makes this model unusable.
**Verdict: Requires completely new training dataset.**
## Root Cause Analysis: Small Object Failure
### Grounding DINO's Resolution Limit
Grounding DINO processes images at **384×384px**. At this resolution:
```
1920px frame → 384px input (5:1 reduction)
A 50×50px object → 10×10px at 384px → only ~1 patch token
```
For comparison:
- **Gun** at 200×200px (close-up) → 40×40px → still detectable
- **Stamp** at 30×30px → 6×6px → lost in downsampling
- **Passport** at 80×120px → 16×24px → barely visible
- **Magnifying glass** at 40×40px → 8×8px → lost
### Potential Solutions
| Solution | Pros | Cons | Feasibility |
|----------|------|------|-------------|
| **Crop + zoom** on person region | Leverages existing YOLO person detections | Requires two-stage pipeline | ✅ High |
| **PaliGemma 448px** | 448px native (36% more detail) | 6GB, requires download | ⚠️ Medium |
| **YOLO fine-tune on stamps** | Fast inference (6MB) | Need 200+ training images | ⚠️ Medium |
| **Grounding DINO + tiling** | Split image into tiles, run per tile | 4-9x slower | ⚠️ Medium |
| **Florence-2 448px** | Higher resolution | Bug in transformers | ❌ Low |
## Hand-Held Object Detection Feasibility
### Available Data Sources
| Source | Type | Coverage | Usefulness |
|--------|------|----------|------------|
| YOLO `pre_chunks` | Object detections | 169,625 frames | ✅ Every frame |
| Pose `pre_chunks` | Body keypoints (left_wrist, right_wrist) | 4,269 frames | ✅ Hand location |
| Grounding DINO | Zero-shot classification | On-demand | ✅ Object ID |
| ASR dialogue | Text mentions | 4,188 chunks | ✅ "holding a gun" |
### Approach: YOLO + Pose + Grounding DINO
```
Frame
→ YOLO: Find person + objects
→ Pose: Find wrist keypoints
→ Check: Object bbox overlaps with hand region (wrist ±100px)
→ Grounding DINO: Verify object class
```
### Known Limitations
1. **Pose frame alignment:** Pose data (4,269 frames) doesn't always overlap with YOLO data at the same frame
2. **Object proximity ≠ holding:** YOLO objects near hands may be background, not held
3. **Small object blind spot:** Stamps, magnifying glasses at hand positions are too small to detect
## Recommendations
| Priority | Action | Rationale |
|----------|--------|-----------|
| 1 | Use Grounding DINO Base (Apache 2.0) | Best zero-shot detector, proven on guns, clean license |
| 2 | Two-stage pipeline for small objects | YOLO person box → crop → upscale → Grounding DINO |
| 3 | Pose wrist alignment for hand-held confirmation | Reduce false positives by requiring hand proximity |
| 4 | Replace Grounding DINO "Large" ref with Base | Large is identical weights, no benefit |
## Appendix: License Summary
| Model | License | Commercial Use | Requires |
|-------|---------|---------------|----------|
| Grounding DINO | **Apache 2.0** | ✅ Yes | NOTICE file |
| OWL-ViT | Apache 2.0 | ✅ Yes | NOTICE file |
| PaliGemma | Gemma license | ⚠️ Needs review | Google ToS |
| Florence-2 | MIT | ✅ Yes | Copyright notice |
| YOLOv8 | AGPL-3.0 | ⚠️ Needs license | Open source or paid |

View File

@@ -0,0 +1,49 @@
# Zero-Shot Gun Detection Test Plan
**Date:** 2026-05-10
**Goal:** Compare OWL-ViT vs Grounding DINO for detecting guns in Charade (1963)
## Models
| Model | Source | Type |
|-------|--------|------|
| `google/owlvit-base-patch32` | HuggingFace | Zero-shot object detection |
| `IDEA-Research/grounding-dino-base` | HuggingFace | Zero-shot object detection |
## Test Timepoints (8)
| Time | Label | Source |
|------|-------|--------|
| 2646s (44:06) | 2646s | ASR: "He has a gun" |
| 3188s (53:08) | 3188s | Original detection |
| 3697s (61:37) | 3697s | ASR: "Where's your gun" |
| 5341s (89:01) | 5341s | ASR: "He already killed 3 men" |
| 5461s (91:01) | 5461s | Original detection |
| 6309s (1:45:09) | 6309s | Original detection |
| 6377s (1:46:17) | 6377s | Original detection |
| 6479s (1:47:59) | 6479s | Original detection |
## Prompts
`"gun"`, `"pistol"`, `"rifle"`, `"weapon"`
## Matrix
8 timepoints × 2 models × 4 prompts = 64 inferences
## Output
| File | Description |
|------|-------------|
| `output_dev/zero_shot_test/*.jpg` | Annotated screenshots |
| `output_dev/zero_shot_test/zero_shot_results.json` | Detection results |
| `scripts/zero_shot_gun_test.py` | Test script |
## Success Criteria
| Level | Criteria |
|-------|----------|
| Excellent | Finds real gun with confidence > 0.5 |
| Good | Finds real gun with confidence < 0.5 |
| Limited | Finds guns but many false positives |
| Failed | All false positives |

View File

@@ -0,0 +1,67 @@
# Zero-Shot Gun Detection Test Report
**Date:** 2026-05-10
**Goal:** Compare OWL-ViT vs Grounding DINO for detecting guns in Charade (1963)
## Test Setup
| Model | Prompts | Timepoints | Total inferences |
|-------|---------|------------|-----------------|
| `google/owlvit-base-patch32` | gun, pistol, rifle, weapon | 8 | 32 |
| `IDEA-Research/grounding-dino-base` | gun, pistol, rifle, weapon | 8 | 32 |
## Results
| Model | Timepoints with detections | Total detections | Best confidence | Runtime |
|-------|---------------------------|-----------------|-----------------|---------|
| OWL-ViT | 2/8 | 2 | 0.054 | 1.5s |
| **Grounding DINO** | **8/8** | **109** | **0.186** | 11.5s |
## Grounding DINO — Per Timepoint
| Time | Source | Best prompt | Best confidence | Found? |
|------|--------|-------------|-----------------|--------|
| 2646s (44:06) | ASR: "He has a gun" | gun | 0.082 | ✅ |
| **3188s (53:08)** | **Original pistol** | **gun** | **0.149** | **✅** |
| 3697s (61:37) | ASR: "Where's your gun" | gun | 0.159 | ✅ |
| 5341s (89:01) | ASR: "He already killed 3 men" | gun | 0.074 | ✅ |
| **5461s (91:01)** | **Original pistol** | **gun** | **0.186** | **✅** |
| **6309s (1:45:09)** | **Original pistol** | **gun** | **0.077** | **✅** |
| **6377s (1:46:17)** | **Original gun** | **weapon** | **0.118** | **✅** |
| **6479s (1:47:59)** | **Original pistol** | **gun** | **0.060** | **✅** |
### Original 5 Pistol Frames
| Frame | OWL-ViT | Grounding DINO | Verdict |
|-------|---------|----------------|---------|
| 3188s | Not found | ✅ Found (0.149) | ✅ |
| 5461s | Not found | ✅ Found (0.186) | ✅ |
| 6309s | Not found | ✅ Found (0.077) | ✅ |
| 6377s | Not found | ✅ Found (0.118) | ✅ |
| 6479s | Not found | ✅ Found (0.060) | ✅ |
## Analysis
### OWL-ViT
- Almost completely failed: only 2 detections at 0.05 confidence
- Not suitable for this task
### Grounding DINO
- **Found all 8 timepoints**, including all 5 original pistol frames
- Best prompt is consistently `"gun"` (6/8 timepoints)
- Confidence range: 0.060 - 0.186 (typical for zero-shot detection)
- Higher confidence correlates with user-confirmed detections
### Key Finding
The 5 original pistol frames were produced by **Grounding DINO** (not YOLOv8n). The model was downloaded from HuggingFace at 15:43-15:44 on May 9, and the screenshots were generated at 15:49 — confirming OWL-ViT was tested first (failed) and then Grounding DINO was tested (succeeded).
## Integration
Grounding DINO has been integrated into `object_search_agent.py` as `--source zero_shot`:
```
python3 scripts/object_search_agent.py --keyword gun --source zero_shot
```
## Screenshots
All 64 annotated screenshots saved to `output_dev/zero_shot_test/*.jpg`

View File

@@ -0,0 +1,115 @@
# Zero-Shot vs Fine-Tune 物件偵測模型選型報告
**Date:** 2026-05-10
**Goal:** 在 Charade (1963) 中搜尋非 COCO 物件(槍枝、郵票、信封等)
**System:** M5 MacBook Pro (Apple Silicon MPS)
## 動機
YOLOv8 COCO 只有 80 類,不包含 gun、stamp、envelope 等 Charade 核心物件。需要找到能在電影中搜尋任意物件的方法。
## 候選方案
| 方案 | 方法 | 訓練資料 | 開發成本 |
|------|------|---------|---------|
| A. YOLOv8n fine-tune | Fine-tune on gun dataset | 需收集 500+ 張標註圖片 | 高 |
| B. OWL-ViT zero-shot | Vision-language pretraining | 無須訓練 | 低 |
| C. Grounding DINO zero-shot | Vision-language pretraining | 無須訓練 | 低 |
## 模型大小與效能
| Model | 磁碟 | 參數 | 推論時間 (MPS) | 單幀能耗 | 模型類別 |
|-------|------|------|---------------|---------|---------|
| YOLOv8n | **6MB** | **3.2M** | **0.03s** | **~0.5J** | 封閉集80 類) |
| OWL-ViT | 586MB | 109M | 3.4s | ~50J | 開放集zero-shot |
| **Grounding DINO** | **891MB** | **172M** | **4.3s** | **~65J** | **開放集zero-shot** |
## Charade 實測結果
| Model | 8 時間點命中 | 5 個原始 pistol | 最佳 confidence | 推論時間 | 模型大小 |
|-------|-------------|-----------------|----------------|---------|---------|
| YOLOv8n COCO | ❌ N/A無 gun class | — | — | 0.03s | 6MB |
| YOLOv8n fine-tune | 7/7 FP | ❌ 全部 FP | 0.45(郵票誤判) | 0.03s | 6MB |
| OWL-ViT | 2/8 | ❌ 0/5 | 0.054 | 3.4s | 586MB |
| **Grounding DINO Base** | **31/32** | **✅ 5/5** | **0.672** | **11.6s** | **891MB** |
| **Grounding DINO Large** | **32/32** | **✅ 5/5** | **1.000** | **50.1s** | **895MB** |
### Base vs Large 比較
| 指標 | Base (3 datasets) | Large (7 datasets) |
|------|------------------|-------------------|
| 平均最佳 confidence | 0.384 | **1.000** |
| 總偵測數 | 333 | **28,800** |
| COCO zero-shot AP | 48.4 | **56.7** |
| 推論時間 (MPS) | 11.6s | 50.1s |
| Edge 部署 | 較可行 | 較困難 |
### 結論
**效能優先選擇Grounding DINO Large** — 所有 8 個時間點 confidence 1.000,零漏檢。犧牲推論速度但 detection 品質大幅超越 Base 版。
**Edge 部署選擇Grounding DINO Base** — 體積相近但推論快 4.3x,適合資源受限裝置。
### 關鍵結論
1. **YOLOv8n fine-tune 完全失敗** — 905 張 Roboflow 近距離特寫與 Charade 中遠景畫面分布 mismatch訓練無法泛化
2. **OWL-ViT 幾乎無效** — 對電影中的小物體辨識能力不足
3. **Grounding DINO 成功** — 5/5 找回 pistol frames所有 ASR gun mention 時間點也命中
## Grounding DINO 優缺點
### 優點
- **零樣本搜尋**:任何 COCO 以外的物件直接用文字 prompt 搜尋
- **延伸性**:同一模型可搜尋 gun、stamp、envelope、knife、hat 等任意物件
- **無須訓練**:不需要收集標註資料或 fine-tune
- **Apache 2.0 License**:可商用
### 缺點
- **體積大**891MBvs YOLOv8n 的 6MB
- **推論慢**4.3s/framevs YOLOv8n 的 0.03s
- **不適合 real-time**edge device 上無法做即時偵測,只適合離線掃描
## Edge AI 部署考量
| 項目標題 | YOLOv8n | Grounding DINO |
|---------|---------|---------------|
| 模型大小 | 6MB ✅ | 891MB ⚠️ |
| RAM 需求 | ~100MB | ~2.5GB |
| 推論時間 | 30ms | 4.3s |
| 單幀能耗 | ~0.5J | ~65J |
| 搜尋類別數 | 80固定 | 無限(文字 prompt |
| 電池影響1000 幀) | ~500J | ~65,000J |
### 建議策略
```
離線掃描Server/Gateway
用 Grounding DINO 對全片建立物件索引
→ 耗時但可接受113 min 電影約 2-3 小時)
即時查詢Edge Device
查詢時只跑 Grounding DINO 在該 timepoint → 4s/次
→ 查詢體驗還可接受
```
## 整合狀態
- ✅ Grounding DINO 測試通過
- ✅ 整合進 `scripts/object_search_agent.py``--source zero_shot`
- ✅ 測試計畫:`docs/ZERO_SHOT_GUN_TEST_PLAN.md`
- ✅ 測試報告:`docs/ZERO_SHOT_GUN_TEST_REPORT.md`
## License 聲明
Grounding DINO 採用 Apache 2.0 License可商用。
產品若 bundle 此模型,需附 `NOTICE` 檔案:
```
Momentry
Copyright 2026 Accusys
This product includes software developed by IDEA Research:
- Grounding DINO (https://github.com/IDEA-Research/GroundingDINO)
Copyright 2023 IDEA Research
Licensed under Apache 2.0 (https://www.apache.org/licenses/LICENSE-2.0)
```