feat: ASRX hybrid pipeline, identity history, worker fixes, checkpoint system

This commit is contained in:
Accusys
2026-06-02 07:13:23 +08:00
parent e3066c3f49
commit e1572907ae
198 changed files with 43705 additions and 8910 deletions

View File

@@ -0,0 +1,588 @@
# ASRX Hybrid Pipeline v1.0 — 聲紋分離混合架構
| 項目 | 內容 |
|------|------|
| **範圍** | ASRX 處理器重構whisperx → VAD-first hybrid pipeline |
| **狀態** | Draft |
| **適用版本** | Momentry Core V4.0+ |
| **作者** | OpenCode / Warren |
| **建立日期** | 2026-06-01 |
---
## 1. 問題
### 1.1 現有問題
| 問題 | 說明 | 影響 |
|------|------|------|
| **Whisper 合併短句** | `whisper small` 會將兩個人的對話錯認成一個連續段 (A+B → 一句) | ASR segment 內混兩人話語speaker 無法分離 |
| **ASRX v2 speaker_id = null** | `asrx_processor_v2.py` 使用 `whisperx.DiarizationPipeline()` 但該 API 未在 whisperx `__init__.py` 暴露 | 所有 segment speaker 均為 null |
| **文字丟失** | `asrx_processor_custom.py``SelfASRXFixed.process_with_segments()` 只輸出 `text: ""` | Rule 1 合併時無文字可用 |
| **錯誤的聲紋後端** | `asrx_processor_v2.py` 依賴 whisperx 內建 diarization但該功能不穩定 | 準確度 ~85%,需 HF token |
| **多版本混亂** | 7 個 root-level 變體、14 個 asrx_self 檔案,生產環境使用錯誤版本 | 維護困難,不知哪個是對的 |
### 1.2 痛點場景
**兩個說話人短句來回切換**(訪談、對話):
```
Audio: A(2s) → B(1.5s) → A(3s)
Whisper: ───────[0-7s, "A+B+A 全部混在一起"]───────
```
Whisper 在句間停頓處不切段,導致 ASR 時間邊界無法反映 speaker 切換。
---
## 2. 架構
### 2.1 核心原則
1. **VAD 先定邊界** — 用 VAD 在句間停頓處切段,取代 whisper 的邊界
2. **ASR 後做** — 每段各自轉錄,保有獨立文字
3. **聲紋聚類定 speaker** — ECAPA-TDNN + AgglomerativeClustering
### 2.2 5 步 Pipeline
```
Audio
① whisper (一次, 粗略定位)
│ 找到說話段 + 初步文字 + 語種
│ [0-7s, "今天天氣很好我覺得也不錯對啊", zh]
② VAD scan (在每段內細切)
│ 利用句間停頓切開
│ 段1 [0-2s] 段2 [2-3.5s] 段3 [3.5-7s]
③ whisper per refined segment (各段轉錄)
│ 段1 → "今天天氣很好" (zh, 0.98)
│ 段2 → "我覺得也不錯" (zh, 0.97)
│ 段3 → "對啊" (zh, 0.96)
④ ECAPA-TDNN per refined segment (聲紋提取)
│ 段1 → emb[0] (192-dim)
│ 段2 → emb[1] (192-dim)
│ 段3 → emb[2] (192-dim)
⑤ AgglomerativeClustering (聚類定 speaker)
│ emb[0]=SPEAKER_0, emb[1]=SPEAKER_1, emb[2]=SPEAKER_0
輸出:
start end text language speaker_id
0.0 2.0 今天天氣很好 zh SPEAKER_0
2.0 3.5 我覺得也不錯 zh SPEAKER_1
3.5 7.0 對啊 zh SPEAKER_0
```
### 2.3 流程圖
```
┌─────────────────────────────────────────────────────────────────────┐
│ asrx_processor.py │
│ (wrapper) │
│ │
│ ① ffprobe → select best track → ffmpeg → 16kHz WAV │
│ │
│ ② SelfASRXFixed.process(audio_wav, file_uuid) │
│ │ │
│ ├─ Step 1: whisper.transcribe() → rough segments │
│ ├─ Step 2: VAD scan each rough segment │
│ ├─ Step 3: whisper per refined segment → text+language │
│ ├─ Step 4: ECAPA-TDNN per segment → 192-dim embedding │
│ ├─ Step 5: AgglomerativeClustering → speaker_labels │
│ │ │
│ ├─ Step 6: Store embeddings in Qdrant │
│ │ └─ {file_uuid, speaker_id, text, language, start, end} │
│ │ │
│ └─ Step 7: Classify high-quality embeddings │
│ ├─ quality > threshold → reference profile │
│ ├─ 送入聲音分類模型推論性別/屬性 │
│ └─ 寫入 Qdrant (type: speaker_reference) │
│ │
│ ③ 輸出 JSON 格式 (不含 embedding) │
│ │
│ Rust: rule1_ingest.rs │
│ └─ pre_chunks(processor_type='asrx') → chunks │
└─────────────────────────────────────────────────────────────────────┘
```
---
## 3. 檔案組織
### 3.1 最終檔案結構
```
scripts/
├── asrx_processor.py ← production (cleaned custom.py)
└── asrx_self/ ← 核心庫
├── __init__.py ← package marker
├── vad.py ← Silero VAD (新增 scan_within_segment)
├── whisper_local.py ← 🆕 封裝 whisper 載入+轉錄
├── speaker_encoder.py ← ECAPA-TDNN 192-dim
├── speaker_cluster_fixed.py ← AgglomerativeClustering
└── main_fixed.py ← 🔧 重寫為 5 步 pipeline
```
### 3.2 刪除清單
**Root-level 變體**(全部刪除):
| 檔案 | 原因 |
|------|------|
| `asrx_processor.py` | 原始 whisperx 版diarization 壞的 |
| `asrx_processor_v2.py` | 同上Rust 目前錯誤呼叫此檔 |
| `asrx_processor_v2_noalign.py` | 跳過對齊但 diarization 仍壞 |
| `asrx_processor_v2_transcribe.py` | 只轉錄不做 speaker |
| `asrx_processor_simplified.py` | 變體 |
| `asrx_processor_contract_v1.py` | 18KBpyannote需 HF token |
**asrx_self 內被取代的舊版**
| 檔案 | 原因 | 取代者 |
|------|------|--------|
| `main.py` | 用 SpectralClustering有 NaN 問題 | `main_fixed.py` |
| `speaker_cluster.py` | 用 SpectralClustering不穩定 | `speaker_cluster_fixed.py` |
### 3.3 搬離清單
非生產工具搬至 `tools/asrx/`
```
tools/asrx/
├── integrate_face_asrx_speaker.py
├── speaker_player_gui.py
├── speaker_player_gui_face.py
├── speaker_player_interactive.py
├── speaker_audio_player.py
├── test_long_movie.py
├── test_gui_face_player.py
└── docs/
├── FINAL_TEST_REPORT.md
├── GUI_FACE_PLAYER_USAGE.md
├── LONG_MOVIE_TEST_SUMMARY.md
└── SPEAKER_PLAYER_GUIDE.md
```
---
---
## 4. Qdrant 聲紋向量儲存
### 4.1 儲存流程
```
Step 4 輸出: 每個 refined segment 有 {embedding: [192-dim], text, language, start, end}
Step 5 輸出: 每個 segment 被標上 speaker_id {SPEAKER_0, SPEAKER_1, ...}
Step 6: Qdrant 儲存
┌─ 每個 segment → Qdrant point
│ point_id = hash(file_uuid + segment_index) ← 可重複查詢
│ vector = embedding (192-dim)
│ payload = {
│ "file_uuid": str, ← 聚類後填入
│ "speaker_id": str, ← 聚類後填入
│ "text": str, ← ASR 轉錄結果
│ "language": str, ← 語種 (zh/en/...)
│ "start_time": f64, ← 秒
│ "end_time": f64, ← 秒
│ "type": "speaker_embedding" ← 便於區分
│ }
└─
```
### 4.2 Qdrant Collection
| 項目 | 內容 |
|------|------|
| Collection Name | `momentry_speaker` (或共用現有 collection) |
| Vector Dimension | 192 (ECAPA-TDNN 輸出) |
| Distance Metric | Cosine |
| Point ID | `hash(file_uuid + "_" + segment_index)` |
### 4.3 Rust `upsert_speaker_embedding`
```rust
impl QdrantDb {
pub async fn upsert_speaker_embedding(
&self,
point_id: u64,
vector: &[f32],
file_uuid: &str,
speaker_id: &str,
text: &str,
language: &str,
start_time: f64,
end_time: f64,
) -> Result<()> {
// Qdrant PUT /collections/{collection}/points?wait=true
// payload: {file_uuid, speaker_id, text, language, start_time, end_time, type: "speaker_embedding"}
}
}
```
### 4.4 與現有 Face Embedding 的關係
| 類別 | Qdrant Collection | Dim | Payload |
|------|-------------------|-----|---------|
| Face | `momentry` (self.collection_name) | 512 (FaceNet) | `file_uuid, trace_id, frame_number` |
| **Speaker** | `momentry` 或獨立 collection | **192** (ECAPA-TDNN) | `file_uuid, speaker_id, text, language, start, end` |
---
## 5. 模組詳細設計
### 5.1 `vad.py` — 語音活動檢測
| 項目 | 內容 |
|------|------|
| 模型 | Silero VAD (torch.hub, snakers4/silero-vad) |
| 現有函數 | `load_vad_model()`, `extract_speech_segments()` |
| **新增函數** | **`scan_within_segment(wav, start_sec, end_sec, model, utils, min_speech_duration_ms=500)`** |
`scan_within_segment` 作用:
- 在一個時間範圍 `[start_sec, end_sec]` 內執行 VAD 掃描
- 只回傳該範圍內的語音子片段 `[(s1, e1), (s2, e2), ...]`
- 利用句間停頓切分,解決 whisper 合併問題
### 5.2 `whisper_local.py` 🆕 — Whisper 封裝
| 項目 | 內容 |
|------|------|
| 模型 | `whisper.load_model("base")` (可設定) |
| 函數 | `load_model()`, `transcribe_segment(audio, start, end)` |
```python
def transcribe_segment(wav, sample_rate, start_sec, end_sec, model) -> dict:
"""轉錄單一段落,回傳 {text, language, lang_prob, segments}"""
```
每段獨立轉錄,保留語言與信心度。
### 5.3 `speaker_encoder.py` — 聲紋編碼器
| 項目 | 內容 |
|------|------|
| 模型 | SpeechBrain ECAPA-TDNN (`spkrec-ecapa-voxceleb`) |
| 輸出維度 | 192-dim |
| EER | 0.80% (VoxCeleb1) |
| 授權 | MIT (不需要 HuggingFace token) |
| 函數 | `load_speaker_encoder()`, `extract_speaker_embedding()`, `extract_speaker_embeddings_batch()` |
### 5.4 `speaker_cluster_fixed.py` — 說話人聚類
| 項目 | 內容 |
|------|------|
| 演算法 | AgglomerativeClustering (cosine + average linkage) |
| 取代 | `speaker_cluster.py` (SpectralClustering, NaN 問題) |
| 函數 | `robust_speaker_clustering(embeddings, n_speakers=None, max_speakers=10)` |
### 5.5 `main_fixed.py` 🔧 — 核心調度器7 步 Pipeline
```python
class SelfASRXFixed:
def process(self, audio_path, output_path=None, file_uuid=None):
"""
7 步 speaker diarization pipeline
Steps:
1. whisper.transcribe(audio) → rough segments + text + language
2. VAD scan each rough segment → refined segments
3. whisper per refined segment → {text, language, lang_prob}
4. ECAPA-TDNN per refined segment → 192-dim embeddings
5. AgglomerativeClustering → speaker_labels
6. Store all embeddings in Qdrant (if file_uuid provided)
payload: {file_uuid, speaker_id, text, language, start_time, end_time, type: "speaker_embedding"}
7. High-quality embeddings (quality > threshold) → classify + store reference
payload: {type: "speaker_reference", file_uuid, speaker_id, n_segments, avg_quality, ...}
Returns:
{
"segments": [
{
"start": float, "end": float,
"text": str, "language": str,
"lang_prob": float, "speaker": str,
"speaker_id": str, "quality": float
},
...
],
"speaker_stats": {...},
"n_speakers": int,
"total_duration": float,
"references": [
{
"speaker_id": str,
"n_segments": int,
"avg_quality": float,
"gender": str
}
]
}
"""
def _store_speaker_embeddings(self, segments, file_uuid):
"""Step 6: 每個 segment 的 192-dim embedding 存入 Qdrant"""
def _classify_high_quality_speakers(self, segments, embeddings, labels, file_uuid):
"""Step 7: 高品質聲紋分級 + 分類 → Qdrant reference profile"""
**移除**
| 舊方法 | 原因 |
|--------|------|
| `process_with_segments(audio, asr_segments)` | 外部 ASR 邊界來源不可靠 VAD 取代 |
| `process()` VAD-only fallback | 無文字輸出被完整 pipeline 取代 |
### 5.6 `speaker_classifier.py` 🆕 — 高品質聲紋分級與分類
#### 目的
聚類後對每個 cluster embedding 進行品質評估高於閾值的獨立建檔並用外部模型做自動分類
#### 流程
```
Step ⑤ 聚類後,每個 segment 有 {embedding, speaker_id}
└─ Compute quality score per embedding
├─ 低於閾值 → 寫入 Qdrant (一般 speaker_embedding)
└─ 高於閾值 (quality > 0.85)
├─ 獨立建 reference profile
└─ 送入「支持聲音的模型」做分類
├─ 語者性別 (male/female)
├─ 語種口音 (zh-CN / zh-TW / en-US)
└─ 或跨影片 speaker 匹配用
```
#### Quality Score 計算
```python
def compute_embedding_quality(embeddings, labels, threshold=0.85):
"""
每個 embedding 到所屬 cluster centroid 的餘弦相似度
Args:
embeddings: [n_segments, 192]
labels: [n_segments] 聚類標籤
threshold: 高品質門檻
Returns:
qualities: [n_segments] 每個 embedding 的品質分數
high_quality_mask: [n_segments] bool 陣列
"""
from sklearn.metrics.pairwise import cosine_similarity
unique_labels = set(labels)
centroids = {}
for label in unique_labels:
mask = labels == label
centroid = np.mean(embeddings[mask], axis=0)
centroid = centroid / np.linalg.norm(centroid)
centroids[label] = centroid
qualities = []
for i, (emb, label) in enumerate(zip(embeddings, labels)):
sim = cosine_similarity([emb], [centroids[label]])[0][0]
qualities.append(sim)
return np.array(qualities), np.array(qualities) >= threshold
```
#### Reference Profile 格式
```json
{
"point_id": "hash(speaker_reference_" + file_uuid + "_" + speaker_id + "_" + cluster_index)",
"vector": "[192-dim centroid embedding]",
"payload": {
"type": "speaker_reference",
"file_uuid": "",
"speaker_id": "SPEAKER_0",
"n_segments": 25,
"avg_quality": 0.92,
"total_duration": 45.3,
"language": "zh",
"gender": "male",
"text_samples": ["", "", "..."]
}
}
```
#### 支援的聲音分類模型(選項)
| 模型 | 用途 | 優點 | 缺點 |
|------|------|------|------|
| **SpeechBrain gender classifier** | 性別分類 | 已整合 ECAPA-TDNN | 只分 male/female |
| **CLAP** (LAION) | 零樣本音頻分類 | 可自訂 label text | 需額外安裝 |
| **YAMNet** | 聲音事件分類 | Google 出品521 classes | 不擅長語者屬性 |
| **Wav2Vec2-BERT** (speechbrain) | 情感/屬性 | 多維度分類 | 模型較大 |
| **自建 identity classifier** | 跨影片 speaker 匹配 | 與現有 identity 系統對接 | 需累積 reference data |
> **待決定**: 選擇哪個分類模型,由後續 POC 決定。
#### `main_fixed.py` 新增方法
```python
class SelfASRXFixed:
# ... 既有 6 個步驟 ...
def _classify_high_quality_speakers(self, segments, embeddings, labels, file_uuid):
"""
步驟 7: 高品質聲紋分級與分類
1. 計算 quality score
2. 高於閾值者建立 reference profile
3. 用分類模型推論性別/屬性
4. 寫入 Qdrant (type: speaker_reference)
"""
qualities, mask = compute_embedding_quality(embeddings, labels)
for i, (seg, emb, label, quality, is_high) in enumerate(
zip(segments, embeddings, labels, qualities, mask)
):
seg["quality"] = float(quality)
if is_high:
profile = self._build_reference_profile(
emb, seg, file_uuid
)
# 分類 (placeholder)
# gender = classify_gender(embedding)
self._store_speaker_reference(profile)
```
### 5.7 `asrx_processor.py` — 清理後的 wrapper
清理項目:
| 問題 | 位置 | 修法 |
|------|------|------|
| 硬編碼 UUID `dd61fda8...` | line 155 | 移除該 fallback path |
| `os.chdir(script_dir)` | line 112 | 改區域性 Path 操作 |
| ASR 文字丟棄 | line 258 | `text` 來自新 pipeline |
| `_debug` dict | line 222 | 移除 |
| `max_speakers=10` 寫死 | line 201 | 改 CLI 參數 `--max-speakers` |
| 載入外部 ASR segments | line 148-174 | 移除(不再需要) |
---
## 6. 輸出格式
### 6.1 ASRX JSON Output (由 `asrx_processor.py` 寫入)
> **注意**: 192-dim embedding 不在此 JSON 中。embedding 在 Python 端直接送入 QdrantJSON 只保留中繼資料。
```json
{
"language": "zh",
"segments": [
{
"start_time": 0.0,
"end_time": 2.0,
"start_frame": 0,
"end_frame": 60,
"text": "今天天氣很好",
"speaker_id": "SPEAKER_0",
"language": "zh",
"lang_prob": 0.98
},
{
"start_time": 2.0,
"end_time": 3.5,
"start_frame": 60,
"end_frame": 105,
"text": "我覺得也不錯",
"speaker_id": "SPEAKER_1",
"language": "zh",
"lang_prob": 0.97
}
],
"n_speakers": 2,
"speaker_stats": {
"SPEAKER_0": {"count": 1, "duration": 2.0},
"SPEAKER_1": {"count": 1, "duration": 1.5}
}
}
```
### 6.2 Qdrant Point 格式 (由 Python `_store_speaker_embeddings` 寫入)
> Embedding 不經過 Rust直接在 Python 端完成 Qdrant HTTP PUT。
| Qdrant 欄位 | 值 | 說明 |
|-------------|-----|------|
| `id` | `hash(file_uuid + "_" + segment_index)` | 可重複查詢的 point ID |
| `vector` | `[f32; 192]` | ECAPA-TDNN 聲紋向量 |
| `payload.file_uuid` | `str` | 影片識別碼 |
| `payload.speaker_id` | `str` | 聚類後的 speaker 標籤 |
| `payload.text` | `str` | 該段的轉錄文字 |
| `payload.language` | `str` | 語種 (`zh`/`en`) |
| `payload.start_time` | `f64` | 開始時間(秒) |
| `payload.end_time` | `f64` | 結束時間(秒) |
| `payload.type` | `"speaker_embedding"` | 便於與 face_embedding 區分 |
### 6.3 Rust `AsrxResult` 對應
```rust
pub struct AsrxSegment {
pub start_time: f64, // serde(alias = "start")
pub end_time: f64, // serde(alias = "end")
pub start_frame: u64, // default 0
pub end_frame: u64, // default 0
pub text: String,
pub speaker_id: Option<String>,
pub language: Option<String>, // 🆕 新增
pub lang_prob: Option<f64>, // 🆕 新增
}
```
---
## 7. Rust 端變動
| 檔案 | 變動 |
|------|------|
| `src/core/processor/asrx.rs` | `asrx_processor_v2.py``asrx_processor.py` |
| `src/core/processor/asrx.rs` | `AsrxSegment` 新增 `language`, `lang_prob` 欄位 |
| `src/core/processor/asrx.rs` | 傳遞 `--file-uuid` 給 Python 腳本,讓 Python 端可直接寫入 Qdrant |
| `src/core/chunk/rule1_ingest.rs` | 若 `pre_chunks` data 含 `language` 則帶入 chunk metadata |
| `src/core/db/qdrant_db.rs` | 🆕 新增 `upsert_speaker_embedding()` 方法 (可選,若 Python 端直接寫 Qdrant 則不需) |
---
## 8. 遷移計畫
### 實作順序 (依賴關係排序)
| 步驟 | 內容 | 檔案 | 風險 |
|------|------|------|------|
| **S1** | `vad.py`: 新增 `scan_within_segment()` | `asrx_self/vad.py` | 低 |
| **S2** | 🆕 `whisper_local.py`: 封裝 whisper 載入 + 轉錄 | `asrx_self/whisper_local.py` | 低 |
| **S3** | 🔧 `main_fixed.py`: 重寫為 7 步 pipeline | `asrx_self/main_fixed.py` | 中 |
| **S4** | 🆕 `speaker_classifier.py`: 性別分類器 | `asrx_self/speaker_classifier.py` | 低 |
| **S5** | 🔧 `custom.py` cleanup + rename → `asrx_processor.py` | `asrx_processor_custom.py` | 低 |
| **S6** | 🔧 Rust `asrx.rs`: 改指向 + 傳 `--file-uuid` | `src/core/processor/asrx.rs` | 低 |
| **S7** | ✅ 驗證build + playground 測試 | — | 中 |
| **S8** | 🧹 刪除變體 + 搬離工具 | — | 低 |
### 驗證標準
1. `cargo build` 通過
2. Playground 3003: 註冊影片 → ASRX processor 完成
3. 輸出 JSON 中 `speaker_id``null`
4. Qdrant collection 有 `speaker_embedding`
5. 性別正確標記 (male/female)
---
## 9. 版本歷史
| 版本 | 日期 | 修改者 | 說明 |
|------|------|--------|------|
| V1.0 | 2026-06-01 | OpenCode | 初始版本7 步 hybrid pipeline + Qdrant 聲紋儲存 + 高品質分類 |

View File

@@ -0,0 +1,385 @@
---
document_type: "design"
service: "MOMENTRY_CORE"
title: "模組生成式文件產出系統"
date: "2026-05-17"
version: "V1.0"
status: "active"
owner: "M5"
created_by: "OpenCode"
tags:
- "documentation"
- "modular"
- "generated-docs"
- "workspace"
ai_query_hints:
- "查詢模組生成式文件產出系統的設計理念"
- "如何使用 API_WORKSPACE"
- "如何新增 API endpoint 文檔"
- "make deploy 流程"
- "自定義交付文件"
related_documents:
- "STANDARDS/USER_DOCS_STANDARD.md"
- "STANDARDS/DOCS_STANDARD.md"
- "API_WORKSPACE/README.md"
- "API_WORKSPACE/modules/_template.md"
---
# 模組生成式文件產出系統
| 項目 | 內容 |
|------|------|
| 建立者 | OpenCode |
| 建立時間 | 2026-05-17 |
| 文件版本 | V1.0 |
| 目標讀者 | developer, documentation maintainer |
---
## 版本歷史
| 版本 | 日期 | 目的 | 操作人 |
|------|------|------|--------|
| V1.0 | 2026-05-17 | 建立設計文件 | OpenCode |
---
## 1. 設計理念
### 1.1 痛點
傳統 API 文件維護有常見問題:
| 問題 | 具體表現 |
|------|----------|
| **內容重複** | 同一個 endpoint 在快速參考、完整手冊、教育訓練文件中寫三次 |
| **更新遺漏** | 修改 curl 範例後,忘記同步到另一份文件 |
| **交付僵化** | 無法按對象產出不同版本的 API 文件 |
| **版本失靈** | YAML frontmatter 版本號與實際內容脫節 |
### 1.2 核心原則
```
單一真理源modules/)→ 組裝引擎assemble_docs.sh→ 多種交付產品GUIDES/
編輯 ──→ 生成 ──→ 部署
1 處修改模組 make all make deploy
```
| 原則 | 說明 |
|------|------|
| **單一真理源** | 每個 endpoint 只在 `modules/` 中定義一次 |
| **組裝而非撰寫** | 交付文件是 modules 的組合,不是手寫 |
| **開發與交付分離** | `API_WORKSPACE/` 開發,`GUIDES/` 交付 |
| **模組為最小可測試單位** | 每個 module 可獨立驗證正確性 |
| **配置驅動** | `.toml` 配置定義哪些 module 以何種模式組裝成何種輸出 |
### 1.3 檔案類型對照
| 類型 | 角色 | 可編輯 | 位置 |
|------|------|--------|------|
| Module (模組) | 不可再拆的內容最小單位 | ✅ 是 | `API_WORKSPACE/modules/` |
| Config (配方) | 定義組裝規則 | ✅ 是 | `API_WORKSPACE/configs/` |
| Narrative (敘事) | 非結構化的前言/背景 | ✅ 是 | `API_WORKSPACE/narratives/` |
| Assembled (產出) | 從模組組裝的交付文件 | ❌ 否generated | `API_WORKSPACE/_build/``GUIDES/` |
---
## 2. 目錄結構
```
docs_v1.0/
├── API_WORKSPACE/ ← 開發區
│ ├── modules/ ← 端點模組(單一真理源)
│ │ ├── _template.md ← 模組撰寫規範
│ │ ├── 01_auth.md ← 認證、Base URL
│ │ ├── 02_health.md ← 健康檢查
│ │ ├── 03_register.md ← 註冊、掃描
│ │ ├── 04_lookup.md ← 查詢、刪除
│ │ ├── 05_process.md ← 處理、進度、任務
│ │ ├── 06_search.md ← 搜尋向量、n8n、視覺
│ │ ├── 07_identity.md ← 身份 CRUD、bind/unbind
│ │ ├── 08_identity_agent.md ← Identity Agent
│ │ ├── 09_tmdb.md ← TMDb Enrichment
│ │ ├── 10_pipeline.md ← Stats、配置、未掛載端點
│ │ └── 11_error_codes.md ← 錯誤碼對照表
│ │
│ ├── configs/ ← 組裝配方(每個輸出一份)
│ │ ├── reference.toml → API_REFERENCE.md
│ │ ├── endpoints.toml → API_ENDPOINTS.md
│ │ ├── quickref.toml → API_QUICK_REFERENCE.md
│ │ ├── errors.toml → API_ERROR_CODES.md
│ │ ├── index.toml → API_INDEX.md
│ │ ├── marcom.toml → API_TRAINING_MARCOM.md
│ │ └── tmdb.toml → TMDb_User_Guide.md
│ │
│ ├── narratives/ ← 非端點敘事前言
│ │ └── marcom_intro.md
│ │
│ ├── _build/ ← 生成暫存區gitignored
│ ├── Makefile ← 組裝自動化入口
│ ├── assemble_docs.sh ← 組裝引擎
│ └── README.md ← 開發者速查
├── GUIDES/ ← 交付區
│ ├── API_REFERENCE.md (generated)
│ ├── API_ENDPOINTS.md (generated)
│ ├── API_QUICK_REFERENCE.md (generated)
│ ├── API_ERROR_CODES.md (generated)
│ ├── API_INDEX.md (generated)
│ ├── API_TRAINING_MARCOM.md (generated)
│ ├── TMDb_User_Guide.md (generated)
│ ├── Demo_EndToEnd.md (手寫保留)
│ ├── Pipeline_API_Demo.md (手寫保留)
│ └── ... (其他手寫文件)
├── DESIGN/
├── REFERENCE/
├── OPERATIONS/
├── INTEGRATIONS/
└── STANDARDS/
```
---
## 3. 模組規範
### 3.1 檔名規則
- 格式:`NN_<name>.md`NN = 兩位數排序 01-99
- 範例:`03_register.md`, `09_tmdb.md`
- 依賴序號決定組裝時的 endpoint 順序
### 3.2 Module Metadata 註解
每個 module 開頭必須有 metadata 註解:
```markdown
<!-- module: auth -->
<!-- description: Authentication, API Key, Base URL configuration -->
<!-- depends: -->
```
| 欄位 | 必填 | 說明 |
|------|------|------|
| `module` | Yes | 唯一名稱,無空格無數字開頭 |
| `description` | Yes | 一句話說明 |
| `depends` | No | 依賴的其他 module 名稱(逗號分隔) |
### 3.3 Endpoint 結構
每個 endpoint 必須使用一致結構:
```markdown
### `METHOD /path/to/endpoint`
**Auth**: Required / Optional / Public
**Scope**: file-level / identity-level / system-level
#### Request Parameters
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
#### Example
```bash
curl -s -X METHOD "$API/path" \
-H "X-API-Key: $KEY" \
-d '{"field": "value"}'
```
#### Response (200)
```json
{ ... }
```
#### Error Codes
| Code | HTTP | When |
|------|------|------|
```
```
### 3.4 變數規則
| 變數 | 用途 | 範例值 |
|------|------|--------|
| `$API` | Base URL | `http://localhost:3003` |
| `$KEY` | API Key | `your-api-key-here` |
| `$FILE_UUID` | File UUID | `3a6c1865...` |
| `$IDENTITY_UUID` | Identity UUID | `a9a90105...` |
---
## 4. 組裝引擎
### 4.1 `assemble_docs.sh`
Shell 腳本,接收三個參數:
| 參數 | 說明 | 範例 |
|------|------|------|
| `--config` | TOML 配方路徑 | `configs/reference.toml` |
| `--modules` | Module 目錄 | `modules/` |
| `--build` | 輸出目錄 | `_build/` |
### 4.2 三種組裝模式
| mode | 行為 | 適用 |
|------|------|------|
| `full` | 完整包含 module 全部內容(除 metadata | API_REFERENCE, API_ENDPOINTS |
| `summary` | 僅擷取 endpoint 表格 + curl 範例 | API_QUICK_REFERENCE |
| `index` | 生成文件總覽(掃描 modules 目錄自動產生索引) | API_INDEX |
### 4.3 組裝流程
```
1. 讀取 config.toml → 解析 title, modules, mode, narrative
2. 生成 YAML frontmatter含 document_type, date, version
3. 生成 title heading + info block
4. (可選)摘自 TOC從 modules ## headings 生成目錄
5. (可選)插入 narrative intro
6. 遍歷 modules
- full mode: 複製整份內容(跳過 <!-- --> 註解)
- summary mode: 只提取 | table | + ```bash code block
- index mode: 自動掃描 modules 目錄生成清單
7. 寫入 _build/ 輸出檔案
```
---
## 5. 配方格式config.toml
```toml
title = "輸出文件標題"
output = "_build/FILENAME.md" # 輸出路徑(相對於 API_WORKSPACE
mode = "full" # full | summary | index
modules = ["01_auth", "03_register"] # 要包含的 module 名稱
narrative = "narratives/xxx.md" # (可選)包含的敘事前言
toc = true # (可選)是否生成目錄
[frontmatter]
document_type = "api_reference" # 用於 YAML frontmatter
service = "MOMENTRY_CORE"
version = "V1.0"
owner = "M5"
created_by = "OpenCode"
```
### 內建配方一覽
| 檔案 | 輸出 | Modules | Mode |
|------|------|---------|------|
| `reference.toml` | API_REFERENCE.md | 01-11 | full |
| `endpoints.toml` | API_ENDPOINTS.md | 01-10 | full |
| `quickref.toml` | API_QUICK_REFERENCE.md | 01-06,09 | summary |
| `errors.toml` | API_ERROR_CODES.md | 11 | full |
| `index.toml` | API_INDEX.md | (auto) | index |
| `marcom.toml` | API_TRAINING_MARCOM.md | 01,03,06 + narrative | full |
| `tmdb.toml` | TMDb_User_Guide.md | 01,03,09 | full |
---
## 6. 工作流程
### 6.1 日常修改
```bash
# 1. 編輯模組
cd API_WORKSPACE
vim modules/09_tmdb.md
# 2. 重新生成單一文件
make tmdb
# 3. 預覽結果
less _build/TMDb_User_Guide.md
# 4. 部署
make deploy
```
### 6.2 新增端點
```bash
# 1. 找到所屬模組
ls modules/
# 決定該 endpoint 屬於哪個模組(如 tmdb, identity, search
# 2. 在對應模組加入 endpoint 文檔
vim modules/09_tmdb.md
# 3. 重新生成所有文件
make all
# 4. 確認所有引用此端點的文件都有正確更新
make check
# 5. 部署
make deploy
```
### 6.3 客製化交付
```bash
# 新增一個客製化配方
cat > configs/integration_partner.toml << TOML
title = "Integration Partner API Guide"
output = "_build/PARTNER_GUIDE.md"
mode = "full"
modules = ["01_auth", "06_search", "09_tmdb", "11_error_codes"]
toc = true
[frontmatter]
document_type = "user_manual"
service = "MOMENTRY_CORE"
version = "V1.0"
owner = "M5"
created_by = "OpenCode"
TOML
# 在 Makefile 中加入對應 target
echo "partner:" >> Makefile
echo ' @$$(SCRIPT) --config configs/integration_partner.toml --modules $$(MODULES) --build $$(BUILD)' >> Makefile
# 生成
make partner
# 部署
make deploy
```
---
## 7. 交付客製化對照表
| 對象 | 需要 modules | make target | 輸出 |
|------|-------------|-------------|------|
| API Developer | 01-11 (all) | `make reference` | API_REFERENCE.md |
| Quick Start User | 01-06,09 | `make quickref` | API_QUICK_REFERENCE.md |
| Marcom Team | 01,03,06 + narrative | `make marcom` | API_TRAINING_MARCOM.md |
| TMDb User | 01,03,09 | `make tmdb` | TMDb_User_Guide.md |
| Integration Partner | 01,06,09,11 | Custom config | PARTNER_GUIDE.md |
---
## 8. GUIDES/ 文件類型說明
| 類型 | 來源 | 說明 |
|------|------|------|
| `API_*.md` (7 files) | Generated from API_WORKSPACE | API 功能文件endpoint 列表 + curl 範例 |
| `Demo_*.md`, `M5API_*.md` | 手寫 | 敘事性指引,含完整 step-by-step 流程 |
| `PORTAL_*.md` | 手寫 | Portal 開發計畫與 Demo 指引 |
| `USER_MANUAL.md` | 手寫 | 系統操作使用手冊 |
> **提醒**:不要直接修改 GUIDES/ 中的 generated files。修改應在 API_WORKSPACE/modules/ 中進行,然後執行 `make deploy`。
---
## 相關文件
- `API_WORKSPACE/README.md` — 開發者快速上手指南
- `API_WORKSPACE/modules/_template.md` — 模組撰寫範本
- `STANDARDS/DOCS_STANDARD.md` — 文件創建規範
- `STANDARDS/USER_DOCS_STANDARD.md` — 使用者文件規範

View File

@@ -0,0 +1,128 @@
# Representative Frame API V1.0
Portal 影片代表畫面 API — 沒有指定 frame_number 時自動偵測男女主角找到最佳互動 frame。
---
## 1. Overview
### Purpose
Portal 需要為每個影片顯示一張代表畫面thumbnail內容應為該影片最具代表性的 scene — 通常包含男女主角同框且互看的時刻。
### Principle
**沒有指定 frame_number → auto-detect representative frame**
既有端點不需改動,只需在 `frame` 參數為空時自動偵測。
---
## 2. Endpoint
### `GET /api/v1/file/:file_uuid/thumbnail`
**Query Parameters**:
| Param | Type | Required | Description |
|-------|------|----------|-------------|
| `frame` | i64 | ❌ | 指定 frame不傳則 auto-detect |
| `x` | i32 | ❌ | bbox crop x |
| `y` | i32 | ❌ | bbox crop y |
| `w` | i32 | ❌ | bbox crop width |
| `h` | i32 | ❌ | bbox crop height |
**Response**: Pure JPEG bytes (Content-Type: image/jpeg)
**Examples**:
```
GET /api/v1/file/:uuid/thumbnail → auto-detect
GET /api/v1/file/:uuid/thumbnail?frame=38165 → 指定 frame
GET /api/v1/file/:uuid/thumbnail?frame=38165&x=723&y=205&w=221&h=221 → 指定 crop
```
---
## 3. Internal Algorithm
### Auto-detect Fallback Chain
```
Step 1: Auto-detect 主角 (top 2 by face count)
└─ face_detections JOIN identities
Step 2: TKG Bridge — mutual_gaze?
├── 有 mutual_gaze edge → first_frame ✅
└── 無 → face_detections 第一次同框 frame ✅
Step 3: 只有一個主角?
└─ 該主角 face_quality (w×h×confidence) 最高 frame
Step 4: 完全無 identity?
└─ 任 identity 的 face_quality 最高 frame
Step 5: 完全無 face?
└─ 404 "No faces in this file"
```
### TKG Bridge Query
```sql
-- 找兩主角各自的 main trace
SELECT trace_id FROM face_detections
WHERE file_uuid = $1 AND identity_id = $2 AND trace_id IS NOT NULL
GROUP BY trace_id ORDER BY COUNT(*) DESC LIMIT 1;
-- TKG mutual_gaze 查詢
SELECT (e.properties->>'first_frame')::bigint
FROM tkg_edges e
JOIN tkg_nodes a ON a.id = e.source_node_id
JOIN tkg_nodes b ON b.id = e.target_node_id
WHERE e.file_uuid = $1
AND a.external_id = concat('trace_', $4)
AND b.external_id = concat('trace_', $5)
AND e.properties->>'mutual_gaze' = 'true'
LIMIT 1;
-- Fallback: 第一次同框
SELECT MIN(fd_a.frame_number)::bigint
FROM face_detections fd_a
JOIN face_detections fd_b ON fd_a.frame_number = fd_b.frame_number
WHERE fd_a.file_uuid = $1 AND fd_a.identity_id = $2 AND fd_b.identity_id = $3;
```
---
## 4. Implementation
### Files Changed
| File | Change |
|------|--------|
| `src/api/media_api.rs` | `ThumbQuery.frame``Option<i64>`; add auto-detect fallback |
| `src/core/processor/tkg.rs` | Add `query_auto_representative_frame()` + structs (已實作) |
| `src/core/processor/mod.rs` | Export new function + structs (已實作) |
### Existing Trace-level Endpoints (不變)
```
GET /api/v1/file/:uuid/trace/:tid/representative-face → JSON (legacy)
GET /api/v1/file/:uuid/trace/:tid/thumbnail → JPEG (auto via select_rep_face)
```
### No Changes
- ❌ No new DB tables / migrations
- ❌ No changes to `select_rep_face` / blurdetect
- ❌ No chunk / cut / pre_chunks dependency
---
## 5. Version History
| Date | Version | Author | Change |
|------|---------|--------|--------|
| 2026-05-22 | 1.0 | OpenCode | Initial design |
| 2026-05-22 | 1.1 | OpenCode | 簡化為單一 endpoint: frame 為 None 時 auto-detect |
*Updated: 2026-05-22*

View File

@@ -0,0 +1,270 @@
---
document_type: "design_doc"
service: "MOMENTRY_CORE"
title: "Redis Progress Reporting V1.0"
version: "V1.0"
date: "2026-05-17"
author: "M5"
status: "draft"
---
# Redis Progress Reporting V1.0
| 項目 | 內容 |
|------|------|
| Service | `MOMENTRY_CORE` |
| Version | V1.0 |
| Date | 2026-05-17 |
| Author | M5 (OpenCode) |
| Status | Draft |
## 1. Overview
This document defines the standardized progress reporting architecture for Momentry Core processors. It replaces the inconsistent ad-hoc progress patterns found across `scripts/`, `src/worker/`, and `src/api/`.
### 1.1 Problems Addressed
| # | Problem | Detail |
|---|---------|--------|
| 1 | Worker Redis key does not match `OPERATIONS/MOMENTRY_CORE_REDIS_KEYS.md` V1.0 spec | Worker writes `worker:job:{uuid}:processor:{name}` instead of spec `job:{uuid}:processor:{name}` |
| 2 | Progress API reads wrong key | `get_progress()` reads `worker:job:{uuid}:processor:{name}` — unresolved with Playground subscriber which writes `job:{uuid}:processor:{name}` |
| 3 | Swift processors (Face/OCR/Pose) lack RedisPublisher | Progress lost — only stdout text |
| 4 | ASRX/Story/Visual chunk have no incremental progress | Start + complete only, no `current/total` updates |
| 5 | `frames_processed` / `chunks_produced` never updated in real-time | Worker only writes processor hash at start and exit |
| 6 | No `output_count` / `output_type` fields | Impossible to know how many faces/objects/segments were produced |
### 1.2 Key Design Decisions
| Decision | Rationale |
|----------|-----------|
| Progress unit = frames for video processors | All media-level processors work frame by frame |
| Output count separate from progress | Processors may produce N outputs per frame (multiple faces, objects) |
| Pub/sub for real-time, Hash for final state | Pub/sub is transient; Hash persists for API queries |
---
## 2. Redis Key Architecture
### 2.1 Key Patterns
All keys use the configured `REDIS_KEY_PREFIX` (default: `momentry:` for production, `momentry_dev:` for playground).
| Pattern | Type | TTL | Purpose | Owner |
|---------|------|-----|---------|-------|
| `{prefix}progress:{uuid}` | Pub/Sub | — | Real-time progress messages | Python scripts |
| `{prefix}job:{uuid}` | Hash | 24h | Per-video job state | Worker |
| `{prefix}job:{uuid}:processor:{name}` | Hash | 24h | Per-processor final state | Worker |
| `{prefix}job:{uuid}:processor:{name}:output_count` | String | 24h | Output count by type | Worker |
### 2.2 Processor Hash Fields
```
{prefix}job:{uuid}:processor:{name}
├── status String running / completed / failed / pending
├── current u32 Units processed (frames for video processors)
├── total u32 Total units
├── output_count u32 Output items produced (faces, objects, segments)
├── output_type String Type name of output: faces / objects / segments / cuts / etc.
├── pid i32 OS process ID (0 if not running)
├── error String Error message if failed
└── updated_at String ISO 8601 timestamp
```
### 2.3 Migrated Keys
The following key patterns from the original implementation are REMOVED:
| Old Key | Reason |
|---------|--------|
| `{prefix}worker:job:{uuid}:processor:{name}` | Non-standard prefix — not in `MOMENTRY_CORE_REDIS_KEYS.md` spec |
| `{prefix}job:{uuid}:processor:{name}:status` (flat) | Redundant — status stored in Hash |
| `{prefix}job:{uuid}:processor:{name}:progress` (flat) | Replaced by `current` + `total` for percent calculation |
| `{prefix}job:{uuid}:processor:{name}:current` (flat) | Replaced by Hash fields |
| `{prefix}job:{uuid}:processor:{name}:total` (flat) | Replaced by Hash fields |
| `{prefix}job:{uuid}:processor:{name}:started_at` (flat) | Replaced by Hash `updated_at` |
---
## 3. Pub/Sub Message Format
### 3.1 Channel
```
{prefix}progress:{uuid}
```
### 3.2 Message JSON
```json
{
"processor": "face",
"current": 150,
"total": 162696,
"output_count": 423,
"output_type": "faces",
"message": "Processing frame 150",
"timestamp": 1700000000
}
```
### 3.3 Field Definitions
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `processor` | String | ✅ | Processor name: asr / asrx / yolo / ocr / face / pose / cut / story / visual_chunk |
| `current` | u32 | ✅ | Units processed (frames for video processors) |
| `total` | u32 | ✅ | Total units |
| `output_count` | u32 | ❌ | Output items produced so far |
| `output_type` | String | ❌ | Type name: faces / objects / segments / cuts / text_regions / persons / speakers / stories / visual_chunks |
| `message` | String | ❌ | Human-readable progress description |
| `timestamp` | u64 | ✅ | Unix timestamp |
---
## 4. Per-Processor Metrics
| Processor | current/total Unit | output_type | When to Publish |
|-----------|-------------------|-------------|-----------------|
| ASR | frames | `segments` | Every 100 segments processed |
| ASRX | frames | `speakers` | Every processing stage |
| YOLO | frames | `objects` | Every 500 frames |
| OCR | frames | `text_regions` | Every 5% |
| Face | frames | `faces` | Every batch (5% of frames) |
| Pose | frames | `persons` | Every 10% |
| CUT | frames | `cuts` | Every scene detected |
| Story | chunks | `stories` | Every chunk processed |
| Visual chunk | frames | `visual_chunks` | Every chunk processed |
### 4.1 Output Type Enum
```rust
pub enum OutputType {
Segments, // ASR
Speakers, // ASRX
Objects, // YOLO
TextRegions, // OCR
Faces, // Face
Persons, // Pose
Cuts, // CUT
Stories, // Story
VisualChunks, // Visual chunk
}
```
---
## 5. Data Flow
```
┌──────────────────┐ Pub/Sub ┌──────────────────────┐
│ Python Processor │ ───────── progress:{uuid} ──────────→│ Worker (subscriber) │
│ (ASR/YOLO/Face) │ {current, total, │ │
│ │ output_count, output_type} │ ──→ HSET │
└──────────────────┘ │ job:{uuid}: │
│ processor:{name} │
┌──────────────────┐ │ │
│ Swift Processor │ ──→ Python wrapper ──→ pub/sub │ (status, current, │
│ (Face/OCR/Pose) │ (add RedisPublisher) │ total, output_count,│
└──────────────────┘ │ output_type) │
└──────────┬───────────┘
│ HGETALL
┌──────────▼───────────┐
│ Progress API │
│ GET /progress/:uuid │
│ │
│ ─→ compute % │
│ ─→ return JSON │
└─────────────────────┘
```
---
## 6. Implementation Plan
### Phase 1: Python Processor RedisPublisher
| Task | Files | Effort |
|------|-------|--------|
| Add `RedisPublisher` to `face_processor.py` | `scripts/face_processor.py` | Medium |
| Add `RedisPublisher` to `ocr_processor.py` | `scripts/ocr_processor.py` | Medium |
| Add `RedisPublisher` to `pose_processor.py` | `scripts/pose_processor.py` | Medium |
| Add incremental `.progress()` to `asrx_processor_custom.py` | `scripts/asrx_processor_custom.py` | Low |
| Standardize pub/sub message to include `output_count`, `output_type` | All processor scripts | Low |
### Phase 2: Worker
| Task | Files | Effort |
|------|-------|--------|
| Fix Redis key from `worker:job:` to `job:` | `src/worker/processor.rs`, `src/core/db/redis_client.rs` | Low |
| Subscribe to `progress:{uuid}` channel in `run_processor()` | `src/worker/processor.rs` | Medium |
| HSET Processor Hash on each progress message | `src/worker/processor.rs` | Medium |
| Set `output_count` and `output_type` from pub/sub message | `src/worker/processor.rs` | Low |
### Phase 3: Progress API
| Task | Files | Effort |
|------|-------|--------|
| Read `output_count`, `output_type` from Redis Hash | `src/api/server.rs` | Low |
| Compute percentage from `current` / `total` | `src/api/server.rs` | Low |
| Return `output_count`, `output_type` in response JSON | `src/api/server.rs` | Low |
| Remove `worker:` fallback path | `src/api/server.rs` | Low |
### Phase 4: Cleanup
| Task | Files | Effort |
|------|-------|--------|
| Remove old `worker:job:` keys from Redis | Deployment script | Low |
| Remove `update_processor_progress()` DB path (stale `processing_status` JSONB) | `src/core/db/postgres_db.rs` | Medium |
---
## 7. API Response Changes
### ProgressResponse (new fields)
```json
{
"processors": [
{
"name": "face",
"status": "running",
"current": 150,
"total": 162696,
"progress": 0,
"frames_processed": 150,
"output_count": 423,
"output_type": "faces"
}
]
}
```
---
## 8. Dependencies
| Component | Version | Role |
|-----------|---------|------|
| Redis | ≥ 6.0 | Pub/Sub + Hash storage |
| `redis_publisher.py` | Existing | Python → Redis pub/sub client |
| `redis_client.rs` | Existing | Rust Redis client for worker + API |
---
## 9. References
| Doc | Relation |
|-----|----------|
| `OPERATIONS/MOMENTRY_CORE_REDIS_KEYS.md` | Parent spec — this doc supersedes sections 4, 7, 8 |
| `DESIGN/VIDEO_PROCESSING_SPEC.md` §2.3 | Original progress design (ProcessProgress struct) |
| `src/worker/processor.rs` | Worker progress write implementation |
| `scripts/redis_publisher.py` | Python pub/sub client |
| `src/api/server.rs` (get_progress) | Progress API handler |
---
## Version History
| Version | Date | Author | Change |
|---------|------|--------|--------|
| V1.0 | 2026-05-17 | M5 (OpenCode) | Initial draft — replaces ad-hoc progress patterns |