feat: ASRX hybrid pipeline, identity history, worker fixes, checkpoint system
This commit is contained in:
588
docs_v1.0/DESIGN/ASRX_HYBRID_PIPELINE_V1.0.md
Normal file
588
docs_v1.0/DESIGN/ASRX_HYBRID_PIPELINE_V1.0.md
Normal file
@@ -0,0 +1,588 @@
|
||||
# ASRX Hybrid Pipeline v1.0 — 聲紋分離混合架構
|
||||
|
||||
| 項目 | 內容 |
|
||||
|------|------|
|
||||
| **範圍** | ASRX 處理器重構:whisperx → VAD-first hybrid pipeline |
|
||||
| **狀態** | Draft |
|
||||
| **適用版本** | Momentry Core V4.0+ |
|
||||
| **作者** | OpenCode / Warren |
|
||||
| **建立日期** | 2026-06-01 |
|
||||
|
||||
---
|
||||
|
||||
## 1. 問題
|
||||
|
||||
### 1.1 現有問題
|
||||
|
||||
| 問題 | 說明 | 影響 |
|
||||
|------|------|------|
|
||||
| **Whisper 合併短句** | `whisper small` 會將兩個人的對話錯認成一個連續段 (A+B → 一句) | ASR segment 內混兩人話語,speaker 無法分離 |
|
||||
| **ASRX v2 speaker_id = null** | `asrx_processor_v2.py` 使用 `whisperx.DiarizationPipeline()` 但該 API 未在 whisperx `__init__.py` 暴露 | 所有 segment speaker 均為 null |
|
||||
| **文字丟失** | `asrx_processor_custom.py` 的 `SelfASRXFixed.process_with_segments()` 只輸出 `text: ""` | Rule 1 合併時無文字可用 |
|
||||
| **錯誤的聲紋後端** | `asrx_processor_v2.py` 依賴 whisperx 內建 diarization,但該功能不穩定 | 準確度 ~85%,需 HF token |
|
||||
| **多版本混亂** | 7 個 root-level 變體、14 個 asrx_self 檔案,生產環境使用錯誤版本 | 維護困難,不知哪個是對的 |
|
||||
|
||||
### 1.2 痛點場景
|
||||
|
||||
**兩個說話人短句來回切換**(訪談、對話):
|
||||
|
||||
```
|
||||
Audio: A(2s) → B(1.5s) → A(3s)
|
||||
Whisper: ───────[0-7s, "A+B+A 全部混在一起"]───────
|
||||
```
|
||||
|
||||
Whisper 在句間停頓處不切段,導致 ASR 時間邊界無法反映 speaker 切換。
|
||||
|
||||
---
|
||||
|
||||
## 2. 架構
|
||||
|
||||
### 2.1 核心原則
|
||||
|
||||
1. **VAD 先定邊界** — 用 VAD 在句間停頓處切段,取代 whisper 的邊界
|
||||
2. **ASR 後做** — 每段各自轉錄,保有獨立文字
|
||||
3. **聲紋聚類定 speaker** — ECAPA-TDNN + AgglomerativeClustering
|
||||
|
||||
### 2.2 5 步 Pipeline
|
||||
|
||||
```
|
||||
Audio
|
||||
│
|
||||
① whisper (一次, 粗略定位)
|
||||
│ 找到說話段 + 初步文字 + 語種
|
||||
│ [0-7s, "今天天氣很好我覺得也不錯對啊", zh]
|
||||
│
|
||||
② VAD scan (在每段內細切)
|
||||
│ 利用句間停頓切開
|
||||
│ 段1 [0-2s] 段2 [2-3.5s] 段3 [3.5-7s]
|
||||
│
|
||||
③ whisper per refined segment (各段轉錄)
|
||||
│ 段1 → "今天天氣很好" (zh, 0.98)
|
||||
│ 段2 → "我覺得也不錯" (zh, 0.97)
|
||||
│ 段3 → "對啊" (zh, 0.96)
|
||||
│
|
||||
④ ECAPA-TDNN per refined segment (聲紋提取)
|
||||
│ 段1 → emb[0] (192-dim)
|
||||
│ 段2 → emb[1] (192-dim)
|
||||
│ 段3 → emb[2] (192-dim)
|
||||
│
|
||||
⑤ AgglomerativeClustering (聚類定 speaker)
|
||||
│ emb[0]=SPEAKER_0, emb[1]=SPEAKER_1, emb[2]=SPEAKER_0
|
||||
│
|
||||
輸出:
|
||||
start end text language speaker_id
|
||||
0.0 2.0 今天天氣很好 zh SPEAKER_0
|
||||
2.0 3.5 我覺得也不錯 zh SPEAKER_1
|
||||
3.5 7.0 對啊 zh SPEAKER_0
|
||||
```
|
||||
|
||||
### 2.3 流程圖
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ asrx_processor.py │
|
||||
│ (wrapper) │
|
||||
│ │
|
||||
│ ① ffprobe → select best track → ffmpeg → 16kHz WAV │
|
||||
│ │
|
||||
│ ② SelfASRXFixed.process(audio_wav, file_uuid) │
|
||||
│ │ │
|
||||
│ ├─ Step 1: whisper.transcribe() → rough segments │
|
||||
│ ├─ Step 2: VAD scan each rough segment │
|
||||
│ ├─ Step 3: whisper per refined segment → text+language │
|
||||
│ ├─ Step 4: ECAPA-TDNN per segment → 192-dim embedding │
|
||||
│ ├─ Step 5: AgglomerativeClustering → speaker_labels │
|
||||
│ │ │
|
||||
│ ├─ Step 6: Store embeddings in Qdrant │
|
||||
│ │ └─ {file_uuid, speaker_id, text, language, start, end} │
|
||||
│ │ │
|
||||
│ └─ Step 7: Classify high-quality embeddings │
|
||||
│ ├─ quality > threshold → reference profile │
|
||||
│ ├─ 送入聲音分類模型推論性別/屬性 │
|
||||
│ └─ 寫入 Qdrant (type: speaker_reference) │
|
||||
│ │
|
||||
│ ③ 輸出 JSON 格式 (不含 embedding) │
|
||||
│ │
|
||||
│ Rust: rule1_ingest.rs │
|
||||
│ └─ pre_chunks(processor_type='asrx') → chunks │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. 檔案組織
|
||||
|
||||
### 3.1 最終檔案結構
|
||||
|
||||
```
|
||||
scripts/
|
||||
├── asrx_processor.py ← production (cleaned custom.py)
|
||||
│
|
||||
└── asrx_self/ ← 核心庫
|
||||
├── __init__.py ← package marker
|
||||
├── vad.py ← Silero VAD (新增 scan_within_segment)
|
||||
├── whisper_local.py ← 🆕 封裝 whisper 載入+轉錄
|
||||
├── speaker_encoder.py ← ECAPA-TDNN 192-dim
|
||||
├── speaker_cluster_fixed.py ← AgglomerativeClustering
|
||||
└── main_fixed.py ← 🔧 重寫為 5 步 pipeline
|
||||
```
|
||||
|
||||
### 3.2 刪除清單
|
||||
|
||||
**Root-level 變體**(全部刪除):
|
||||
|
||||
| 檔案 | 原因 |
|
||||
|------|------|
|
||||
| `asrx_processor.py` | 原始 whisperx 版,diarization 壞的 |
|
||||
| `asrx_processor_v2.py` | 同上,Rust 目前錯誤呼叫此檔 |
|
||||
| `asrx_processor_v2_noalign.py` | 跳過對齊但 diarization 仍壞 |
|
||||
| `asrx_processor_v2_transcribe.py` | 只轉錄不做 speaker |
|
||||
| `asrx_processor_simplified.py` | 變體 |
|
||||
| `asrx_processor_contract_v1.py` | 18KB,pyannote,需 HF token |
|
||||
|
||||
**asrx_self 內被取代的舊版**:
|
||||
|
||||
| 檔案 | 原因 | 取代者 |
|
||||
|------|------|--------|
|
||||
| `main.py` | 用 SpectralClustering,有 NaN 問題 | `main_fixed.py` |
|
||||
| `speaker_cluster.py` | 用 SpectralClustering,不穩定 | `speaker_cluster_fixed.py` |
|
||||
|
||||
### 3.3 搬離清單
|
||||
|
||||
非生產工具搬至 `tools/asrx/`:
|
||||
|
||||
```
|
||||
tools/asrx/
|
||||
├── integrate_face_asrx_speaker.py
|
||||
├── speaker_player_gui.py
|
||||
├── speaker_player_gui_face.py
|
||||
├── speaker_player_interactive.py
|
||||
├── speaker_audio_player.py
|
||||
├── test_long_movie.py
|
||||
├── test_gui_face_player.py
|
||||
└── docs/
|
||||
├── FINAL_TEST_REPORT.md
|
||||
├── GUI_FACE_PLAYER_USAGE.md
|
||||
├── LONG_MOVIE_TEST_SUMMARY.md
|
||||
└── SPEAKER_PLAYER_GUIDE.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
## 4. Qdrant 聲紋向量儲存
|
||||
|
||||
### 4.1 儲存流程
|
||||
|
||||
```
|
||||
Step 4 輸出: 每個 refined segment 有 {embedding: [192-dim], text, language, start, end}
|
||||
Step 5 輸出: 每個 segment 被標上 speaker_id {SPEAKER_0, SPEAKER_1, ...}
|
||||
|
||||
Step 6: Qdrant 儲存
|
||||
┌─ 每個 segment → Qdrant point
|
||||
│ point_id = hash(file_uuid + segment_index) ← 可重複查詢
|
||||
│ vector = embedding (192-dim)
|
||||
│ payload = {
|
||||
│ "file_uuid": str, ← 聚類後填入
|
||||
│ "speaker_id": str, ← 聚類後填入
|
||||
│ "text": str, ← ASR 轉錄結果
|
||||
│ "language": str, ← 語種 (zh/en/...)
|
||||
│ "start_time": f64, ← 秒
|
||||
│ "end_time": f64, ← 秒
|
||||
│ "type": "speaker_embedding" ← 便於區分
|
||||
│ }
|
||||
└─
|
||||
```
|
||||
|
||||
### 4.2 Qdrant Collection
|
||||
|
||||
| 項目 | 內容 |
|
||||
|------|------|
|
||||
| Collection Name | `momentry_speaker` (或共用現有 collection) |
|
||||
| Vector Dimension | 192 (ECAPA-TDNN 輸出) |
|
||||
| Distance Metric | Cosine |
|
||||
| Point ID | `hash(file_uuid + "_" + segment_index)` |
|
||||
|
||||
### 4.3 Rust `upsert_speaker_embedding`
|
||||
|
||||
```rust
|
||||
impl QdrantDb {
|
||||
pub async fn upsert_speaker_embedding(
|
||||
&self,
|
||||
point_id: u64,
|
||||
vector: &[f32],
|
||||
file_uuid: &str,
|
||||
speaker_id: &str,
|
||||
text: &str,
|
||||
language: &str,
|
||||
start_time: f64,
|
||||
end_time: f64,
|
||||
) -> Result<()> {
|
||||
// Qdrant PUT /collections/{collection}/points?wait=true
|
||||
// payload: {file_uuid, speaker_id, text, language, start_time, end_time, type: "speaker_embedding"}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 4.4 與現有 Face Embedding 的關係
|
||||
|
||||
| 類別 | Qdrant Collection | Dim | Payload |
|
||||
|------|-------------------|-----|---------|
|
||||
| Face | `momentry` (self.collection_name) | 512 (FaceNet) | `file_uuid, trace_id, frame_number` |
|
||||
| **Speaker** | `momentry` 或獨立 collection | **192** (ECAPA-TDNN) | `file_uuid, speaker_id, text, language, start, end` |
|
||||
|
||||
---
|
||||
|
||||
## 5. 模組詳細設計
|
||||
|
||||
### 5.1 `vad.py` — 語音活動檢測
|
||||
|
||||
| 項目 | 內容 |
|
||||
|------|------|
|
||||
| 模型 | Silero VAD (torch.hub, snakers4/silero-vad) |
|
||||
| 現有函數 | `load_vad_model()`, `extract_speech_segments()` |
|
||||
| **新增函數** | **`scan_within_segment(wav, start_sec, end_sec, model, utils, min_speech_duration_ms=500)`** |
|
||||
|
||||
`scan_within_segment` 作用:
|
||||
- 在一個時間範圍 `[start_sec, end_sec]` 內執行 VAD 掃描
|
||||
- 只回傳該範圍內的語音子片段 `[(s1, e1), (s2, e2), ...]`
|
||||
- 利用句間停頓切分,解決 whisper 合併問題
|
||||
|
||||
### 5.2 `whisper_local.py` 🆕 — Whisper 封裝
|
||||
|
||||
| 項目 | 內容 |
|
||||
|------|------|
|
||||
| 模型 | `whisper.load_model("base")` (可設定) |
|
||||
| 函數 | `load_model()`, `transcribe_segment(audio, start, end)` |
|
||||
|
||||
```python
|
||||
def transcribe_segment(wav, sample_rate, start_sec, end_sec, model) -> dict:
|
||||
"""轉錄單一段落,回傳 {text, language, lang_prob, segments}"""
|
||||
```
|
||||
|
||||
每段獨立轉錄,保留語言與信心度。
|
||||
|
||||
### 5.3 `speaker_encoder.py` — 聲紋編碼器
|
||||
|
||||
| 項目 | 內容 |
|
||||
|------|------|
|
||||
| 模型 | SpeechBrain ECAPA-TDNN (`spkrec-ecapa-voxceleb`) |
|
||||
| 輸出維度 | 192-dim |
|
||||
| EER | 0.80% (VoxCeleb1) |
|
||||
| 授權 | MIT (不需要 HuggingFace token) |
|
||||
| 函數 | `load_speaker_encoder()`, `extract_speaker_embedding()`, `extract_speaker_embeddings_batch()` |
|
||||
|
||||
### 5.4 `speaker_cluster_fixed.py` — 說話人聚類
|
||||
|
||||
| 項目 | 內容 |
|
||||
|------|------|
|
||||
| 演算法 | AgglomerativeClustering (cosine + average linkage) |
|
||||
| 取代 | `speaker_cluster.py` (SpectralClustering, NaN 問題) |
|
||||
| 函數 | `robust_speaker_clustering(embeddings, n_speakers=None, max_speakers=10)` |
|
||||
|
||||
### 5.5 `main_fixed.py` 🔧 — 核心調度器(7 步 Pipeline)
|
||||
|
||||
```python
|
||||
class SelfASRXFixed:
|
||||
def process(self, audio_path, output_path=None, file_uuid=None):
|
||||
"""
|
||||
7 步 speaker diarization pipeline
|
||||
|
||||
Steps:
|
||||
1. whisper.transcribe(audio) → rough segments + text + language
|
||||
2. VAD scan each rough segment → refined segments
|
||||
3. whisper per refined segment → {text, language, lang_prob}
|
||||
4. ECAPA-TDNN per refined segment → 192-dim embeddings
|
||||
5. AgglomerativeClustering → speaker_labels
|
||||
6. Store all embeddings in Qdrant (if file_uuid provided)
|
||||
payload: {file_uuid, speaker_id, text, language, start_time, end_time, type: "speaker_embedding"}
|
||||
7. High-quality embeddings (quality > threshold) → classify + store reference
|
||||
payload: {type: "speaker_reference", file_uuid, speaker_id, n_segments, avg_quality, ...}
|
||||
|
||||
Returns:
|
||||
{
|
||||
"segments": [
|
||||
{
|
||||
"start": float, "end": float,
|
||||
"text": str, "language": str,
|
||||
"lang_prob": float, "speaker": str,
|
||||
"speaker_id": str, "quality": float
|
||||
},
|
||||
...
|
||||
],
|
||||
"speaker_stats": {...},
|
||||
"n_speakers": int,
|
||||
"total_duration": float,
|
||||
"references": [
|
||||
{
|
||||
"speaker_id": str,
|
||||
"n_segments": int,
|
||||
"avg_quality": float,
|
||||
"gender": str
|
||||
}
|
||||
]
|
||||
}
|
||||
"""
|
||||
|
||||
def _store_speaker_embeddings(self, segments, file_uuid):
|
||||
"""Step 6: 每個 segment 的 192-dim embedding 存入 Qdrant"""
|
||||
|
||||
def _classify_high_quality_speakers(self, segments, embeddings, labels, file_uuid):
|
||||
"""Step 7: 高品質聲紋分級 + 分類 → Qdrant reference profile"""
|
||||
|
||||
**移除**:
|
||||
|
||||
| 舊方法 | 原因 |
|
||||
|--------|------|
|
||||
| `process_with_segments(audio, asr_segments)` | 外部 ASR 邊界來源不可靠,被 VAD 取代 |
|
||||
| `process()` VAD-only fallback | 無文字輸出,被完整 pipeline 取代 |
|
||||
|
||||
### 5.6 `speaker_classifier.py` 🆕 — 高品質聲紋分級與分類
|
||||
|
||||
#### 目的
|
||||
|
||||
聚類後,對每個 cluster 的 embedding 進行品質評估,高於閾值的獨立建檔,並用外部模型做自動分類。
|
||||
|
||||
#### 流程
|
||||
|
||||
```
|
||||
Step ⑤ 聚類後,每個 segment 有 {embedding, speaker_id}
|
||||
│
|
||||
└─ Compute quality score per embedding
|
||||
│
|
||||
├─ 低於閾值 → 寫入 Qdrant (一般 speaker_embedding)
|
||||
│
|
||||
└─ 高於閾值 (quality > 0.85)
|
||||
├─ 獨立建 reference profile
|
||||
└─ 送入「支持聲音的模型」做分類
|
||||
├─ 語者性別 (male/female)
|
||||
├─ 語種口音 (zh-CN / zh-TW / en-US)
|
||||
└─ 或跨影片 speaker 匹配用
|
||||
```
|
||||
|
||||
#### Quality Score 計算
|
||||
|
||||
```python
|
||||
def compute_embedding_quality(embeddings, labels, threshold=0.85):
|
||||
"""
|
||||
每個 embedding 到所屬 cluster centroid 的餘弦相似度
|
||||
|
||||
Args:
|
||||
embeddings: [n_segments, 192]
|
||||
labels: [n_segments] 聚類標籤
|
||||
threshold: 高品質門檻
|
||||
|
||||
Returns:
|
||||
qualities: [n_segments] 每個 embedding 的品質分數
|
||||
high_quality_mask: [n_segments] bool 陣列
|
||||
"""
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
|
||||
unique_labels = set(labels)
|
||||
centroids = {}
|
||||
for label in unique_labels:
|
||||
mask = labels == label
|
||||
centroid = np.mean(embeddings[mask], axis=0)
|
||||
centroid = centroid / np.linalg.norm(centroid)
|
||||
centroids[label] = centroid
|
||||
|
||||
qualities = []
|
||||
for i, (emb, label) in enumerate(zip(embeddings, labels)):
|
||||
sim = cosine_similarity([emb], [centroids[label]])[0][0]
|
||||
qualities.append(sim)
|
||||
|
||||
return np.array(qualities), np.array(qualities) >= threshold
|
||||
```
|
||||
|
||||
#### Reference Profile 格式
|
||||
|
||||
```json
|
||||
{
|
||||
"point_id": "hash(speaker_reference_" + file_uuid + "_" + speaker_id + "_" + cluster_index)",
|
||||
"vector": "[192-dim centroid embedding]",
|
||||
"payload": {
|
||||
"type": "speaker_reference",
|
||||
"file_uuid": "來源影片",
|
||||
"speaker_id": "SPEAKER_0",
|
||||
"n_segments": 25,
|
||||
"avg_quality": 0.92,
|
||||
"total_duration": 45.3,
|
||||
"language": "zh",
|
||||
"gender": "male",
|
||||
"text_samples": ["今天天氣很好", "我覺得也不錯", "..."]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### 支援的聲音分類模型(選項)
|
||||
|
||||
| 模型 | 用途 | 優點 | 缺點 |
|
||||
|------|------|------|------|
|
||||
| **SpeechBrain gender classifier** | 性別分類 | 已整合 ECAPA-TDNN | 只分 male/female |
|
||||
| **CLAP** (LAION) | 零樣本音頻分類 | 可自訂 label text | 需額外安裝 |
|
||||
| **YAMNet** | 聲音事件分類 | Google 出品,521 classes | 不擅長語者屬性 |
|
||||
| **Wav2Vec2-BERT** (speechbrain) | 情感/屬性 | 多維度分類 | 模型較大 |
|
||||
| **自建 identity classifier** | 跨影片 speaker 匹配 | 與現有 identity 系統對接 | 需累積 reference data |
|
||||
|
||||
> **待決定**: 選擇哪個分類模型,由後續 POC 決定。
|
||||
|
||||
#### `main_fixed.py` 新增方法
|
||||
|
||||
```python
|
||||
class SelfASRXFixed:
|
||||
# ... 既有 6 個步驟 ...
|
||||
|
||||
def _classify_high_quality_speakers(self, segments, embeddings, labels, file_uuid):
|
||||
"""
|
||||
步驟 7: 高品質聲紋分級與分類
|
||||
|
||||
1. 計算 quality score
|
||||
2. 高於閾值者建立 reference profile
|
||||
3. 用分類模型推論性別/屬性
|
||||
4. 寫入 Qdrant (type: speaker_reference)
|
||||
"""
|
||||
qualities, mask = compute_embedding_quality(embeddings, labels)
|
||||
|
||||
for i, (seg, emb, label, quality, is_high) in enumerate(
|
||||
zip(segments, embeddings, labels, qualities, mask)
|
||||
):
|
||||
seg["quality"] = float(quality)
|
||||
if is_high:
|
||||
profile = self._build_reference_profile(
|
||||
emb, seg, file_uuid
|
||||
)
|
||||
# 分類 (placeholder)
|
||||
# gender = classify_gender(embedding)
|
||||
self._store_speaker_reference(profile)
|
||||
```
|
||||
|
||||
### 5.7 `asrx_processor.py` — 清理後的 wrapper
|
||||
|
||||
清理項目:
|
||||
|
||||
| 問題 | 位置 | 修法 |
|
||||
|------|------|------|
|
||||
| 硬編碼 UUID `dd61fda8...` | line 155 | 移除該 fallback path |
|
||||
| `os.chdir(script_dir)` | line 112 | 改區域性 Path 操作 |
|
||||
| ASR 文字丟棄 | line 258 | `text` 來自新 pipeline |
|
||||
| `_debug` dict | line 222 | 移除 |
|
||||
| `max_speakers=10` 寫死 | line 201 | 改 CLI 參數 `--max-speakers` |
|
||||
| 載入外部 ASR segments | line 148-174 | 移除(不再需要) |
|
||||
|
||||
---
|
||||
|
||||
## 6. 輸出格式
|
||||
|
||||
### 6.1 ASRX JSON Output (由 `asrx_processor.py` 寫入)
|
||||
|
||||
> **注意**: 192-dim embedding 不在此 JSON 中。embedding 在 Python 端直接送入 Qdrant,JSON 只保留中繼資料。
|
||||
|
||||
```json
|
||||
{
|
||||
"language": "zh",
|
||||
"segments": [
|
||||
{
|
||||
"start_time": 0.0,
|
||||
"end_time": 2.0,
|
||||
"start_frame": 0,
|
||||
"end_frame": 60,
|
||||
"text": "今天天氣很好",
|
||||
"speaker_id": "SPEAKER_0",
|
||||
"language": "zh",
|
||||
"lang_prob": 0.98
|
||||
},
|
||||
{
|
||||
"start_time": 2.0,
|
||||
"end_time": 3.5,
|
||||
"start_frame": 60,
|
||||
"end_frame": 105,
|
||||
"text": "我覺得也不錯",
|
||||
"speaker_id": "SPEAKER_1",
|
||||
"language": "zh",
|
||||
"lang_prob": 0.97
|
||||
}
|
||||
],
|
||||
"n_speakers": 2,
|
||||
"speaker_stats": {
|
||||
"SPEAKER_0": {"count": 1, "duration": 2.0},
|
||||
"SPEAKER_1": {"count": 1, "duration": 1.5}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 6.2 Qdrant Point 格式 (由 Python `_store_speaker_embeddings` 寫入)
|
||||
|
||||
> Embedding 不經過 Rust,直接在 Python 端完成 Qdrant HTTP PUT。
|
||||
|
||||
| Qdrant 欄位 | 值 | 說明 |
|
||||
|-------------|-----|------|
|
||||
| `id` | `hash(file_uuid + "_" + segment_index)` | 可重複查詢的 point ID |
|
||||
| `vector` | `[f32; 192]` | ECAPA-TDNN 聲紋向量 |
|
||||
| `payload.file_uuid` | `str` | 影片識別碼 |
|
||||
| `payload.speaker_id` | `str` | 聚類後的 speaker 標籤 |
|
||||
| `payload.text` | `str` | 該段的轉錄文字 |
|
||||
| `payload.language` | `str` | 語種 (`zh`/`en`) |
|
||||
| `payload.start_time` | `f64` | 開始時間(秒) |
|
||||
| `payload.end_time` | `f64` | 結束時間(秒) |
|
||||
| `payload.type` | `"speaker_embedding"` | 便於與 face_embedding 區分 |
|
||||
|
||||
### 6.3 Rust `AsrxResult` 對應
|
||||
|
||||
```rust
|
||||
pub struct AsrxSegment {
|
||||
pub start_time: f64, // serde(alias = "start")
|
||||
pub end_time: f64, // serde(alias = "end")
|
||||
pub start_frame: u64, // default 0
|
||||
pub end_frame: u64, // default 0
|
||||
pub text: String,
|
||||
pub speaker_id: Option<String>,
|
||||
pub language: Option<String>, // 🆕 新增
|
||||
pub lang_prob: Option<f64>, // 🆕 新增
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Rust 端變動
|
||||
|
||||
| 檔案 | 變動 |
|
||||
|------|------|
|
||||
| `src/core/processor/asrx.rs` | `asrx_processor_v2.py` → `asrx_processor.py` |
|
||||
| `src/core/processor/asrx.rs` | `AsrxSegment` 新增 `language`, `lang_prob` 欄位 |
|
||||
| `src/core/processor/asrx.rs` | 傳遞 `--file-uuid` 給 Python 腳本,讓 Python 端可直接寫入 Qdrant |
|
||||
| `src/core/chunk/rule1_ingest.rs` | 若 `pre_chunks` data 含 `language` 則帶入 chunk metadata |
|
||||
| `src/core/db/qdrant_db.rs` | 🆕 新增 `upsert_speaker_embedding()` 方法 (可選,若 Python 端直接寫 Qdrant 則不需) |
|
||||
|
||||
---
|
||||
|
||||
## 8. 遷移計畫
|
||||
|
||||
### 實作順序 (依賴關係排序)
|
||||
|
||||
| 步驟 | 內容 | 檔案 | 風險 |
|
||||
|------|------|------|------|
|
||||
| **S1** | `vad.py`: 新增 `scan_within_segment()` | `asrx_self/vad.py` | 低 |
|
||||
| **S2** | 🆕 `whisper_local.py`: 封裝 whisper 載入 + 轉錄 | `asrx_self/whisper_local.py` | 低 |
|
||||
| **S3** | 🔧 `main_fixed.py`: 重寫為 7 步 pipeline | `asrx_self/main_fixed.py` | 中 |
|
||||
| **S4** | 🆕 `speaker_classifier.py`: 性別分類器 | `asrx_self/speaker_classifier.py` | 低 |
|
||||
| **S5** | 🔧 `custom.py` cleanup + rename → `asrx_processor.py` | `asrx_processor_custom.py` | 低 |
|
||||
| **S6** | 🔧 Rust `asrx.rs`: 改指向 + 傳 `--file-uuid` | `src/core/processor/asrx.rs` | 低 |
|
||||
| **S7** | ✅ 驗證:build + playground 測試 | — | 中 |
|
||||
| **S8** | 🧹 刪除變體 + 搬離工具 | — | 低 |
|
||||
|
||||
### 驗證標準
|
||||
|
||||
1. `cargo build` 通過
|
||||
2. Playground 3003: 註冊影片 → ASRX processor 完成
|
||||
3. 輸出 JSON 中 `speaker_id` 非 `null`
|
||||
4. Qdrant collection 有 `speaker_embedding` 點
|
||||
5. 性別正確標記 (male/female)
|
||||
|
||||
---
|
||||
|
||||
## 9. 版本歷史
|
||||
|
||||
| 版本 | 日期 | 修改者 | 說明 |
|
||||
|------|------|--------|------|
|
||||
| V1.0 | 2026-06-01 | OpenCode | 初始版本:7 步 hybrid pipeline + Qdrant 聲紋儲存 + 高品質分類 |
|
||||
385
docs_v1.0/DESIGN/Modular_Doc_System_V1.0.md
Normal file
385
docs_v1.0/DESIGN/Modular_Doc_System_V1.0.md
Normal file
@@ -0,0 +1,385 @@
|
||||
---
|
||||
document_type: "design"
|
||||
service: "MOMENTRY_CORE"
|
||||
title: "模組生成式文件產出系統"
|
||||
date: "2026-05-17"
|
||||
version: "V1.0"
|
||||
status: "active"
|
||||
owner: "M5"
|
||||
created_by: "OpenCode"
|
||||
tags:
|
||||
- "documentation"
|
||||
- "modular"
|
||||
- "generated-docs"
|
||||
- "workspace"
|
||||
ai_query_hints:
|
||||
- "查詢模組生成式文件產出系統的設計理念"
|
||||
- "如何使用 API_WORKSPACE"
|
||||
- "如何新增 API endpoint 文檔"
|
||||
- "make deploy 流程"
|
||||
- "自定義交付文件"
|
||||
related_documents:
|
||||
- "STANDARDS/USER_DOCS_STANDARD.md"
|
||||
- "STANDARDS/DOCS_STANDARD.md"
|
||||
- "API_WORKSPACE/README.md"
|
||||
- "API_WORKSPACE/modules/_template.md"
|
||||
---
|
||||
|
||||
# 模組生成式文件產出系統
|
||||
|
||||
| 項目 | 內容 |
|
||||
|------|------|
|
||||
| 建立者 | OpenCode |
|
||||
| 建立時間 | 2026-05-17 |
|
||||
| 文件版本 | V1.0 |
|
||||
| 目標讀者 | developer, documentation maintainer |
|
||||
|
||||
---
|
||||
|
||||
## 版本歷史
|
||||
|
||||
| 版本 | 日期 | 目的 | 操作人 |
|
||||
|------|------|------|--------|
|
||||
| V1.0 | 2026-05-17 | 建立設計文件 | OpenCode |
|
||||
|
||||
---
|
||||
|
||||
## 1. 設計理念
|
||||
|
||||
### 1.1 痛點
|
||||
|
||||
傳統 API 文件維護有常見問題:
|
||||
|
||||
| 問題 | 具體表現 |
|
||||
|------|----------|
|
||||
| **內容重複** | 同一個 endpoint 在快速參考、完整手冊、教育訓練文件中寫三次 |
|
||||
| **更新遺漏** | 修改 curl 範例後,忘記同步到另一份文件 |
|
||||
| **交付僵化** | 無法按對象產出不同版本的 API 文件 |
|
||||
| **版本失靈** | YAML frontmatter 版本號與實際內容脫節 |
|
||||
|
||||
### 1.2 核心原則
|
||||
|
||||
```
|
||||
單一真理源(modules/)→ 組裝引擎(assemble_docs.sh)→ 多種交付產品(GUIDES/)
|
||||
|
||||
編輯 ──→ 生成 ──→ 部署
|
||||
1 處修改模組 make all make deploy
|
||||
```
|
||||
|
||||
| 原則 | 說明 |
|
||||
|------|------|
|
||||
| **單一真理源** | 每個 endpoint 只在 `modules/` 中定義一次 |
|
||||
| **組裝而非撰寫** | 交付文件是 modules 的組合,不是手寫 |
|
||||
| **開發與交付分離** | `API_WORKSPACE/` 開發,`GUIDES/` 交付 |
|
||||
| **模組為最小可測試單位** | 每個 module 可獨立驗證正確性 |
|
||||
| **配置驅動** | `.toml` 配置定義哪些 module 以何種模式組裝成何種輸出 |
|
||||
|
||||
### 1.3 檔案類型對照
|
||||
|
||||
| 類型 | 角色 | 可編輯 | 位置 |
|
||||
|------|------|--------|------|
|
||||
| Module (模組) | 不可再拆的內容最小單位 | ✅ 是 | `API_WORKSPACE/modules/` |
|
||||
| Config (配方) | 定義組裝規則 | ✅ 是 | `API_WORKSPACE/configs/` |
|
||||
| Narrative (敘事) | 非結構化的前言/背景 | ✅ 是 | `API_WORKSPACE/narratives/` |
|
||||
| Assembled (產出) | 從模組組裝的交付文件 | ❌ 否(generated) | `API_WORKSPACE/_build/` → `GUIDES/` |
|
||||
|
||||
---
|
||||
|
||||
## 2. 目錄結構
|
||||
|
||||
```
|
||||
docs_v1.0/
|
||||
├── API_WORKSPACE/ ← 開發區
|
||||
│ ├── modules/ ← 端點模組(單一真理源)
|
||||
│ │ ├── _template.md ← 模組撰寫規範
|
||||
│ │ ├── 01_auth.md ← 認證、Base URL
|
||||
│ │ ├── 02_health.md ← 健康檢查
|
||||
│ │ ├── 03_register.md ← 註冊、掃描
|
||||
│ │ ├── 04_lookup.md ← 查詢、刪除
|
||||
│ │ ├── 05_process.md ← 處理、進度、任務
|
||||
│ │ ├── 06_search.md ← 搜尋(向量、n8n、視覺)
|
||||
│ │ ├── 07_identity.md ← 身份 CRUD、bind/unbind
|
||||
│ │ ├── 08_identity_agent.md ← Identity Agent
|
||||
│ │ ├── 09_tmdb.md ← TMDb Enrichment
|
||||
│ │ ├── 10_pipeline.md ← Stats、配置、未掛載端點
|
||||
│ │ └── 11_error_codes.md ← 錯誤碼對照表
|
||||
│ │
|
||||
│ ├── configs/ ← 組裝配方(每個輸出一份)
|
||||
│ │ ├── reference.toml → API_REFERENCE.md
|
||||
│ │ ├── endpoints.toml → API_ENDPOINTS.md
|
||||
│ │ ├── quickref.toml → API_QUICK_REFERENCE.md
|
||||
│ │ ├── errors.toml → API_ERROR_CODES.md
|
||||
│ │ ├── index.toml → API_INDEX.md
|
||||
│ │ ├── marcom.toml → API_TRAINING_MARCOM.md
|
||||
│ │ └── tmdb.toml → TMDb_User_Guide.md
|
||||
│ │
|
||||
│ ├── narratives/ ← 非端點敘事前言
|
||||
│ │ └── marcom_intro.md
|
||||
│ │
|
||||
│ ├── _build/ ← 生成暫存區(gitignored)
|
||||
│ ├── Makefile ← 組裝自動化入口
|
||||
│ ├── assemble_docs.sh ← 組裝引擎
|
||||
│ └── README.md ← 開發者速查
|
||||
│
|
||||
├── GUIDES/ ← 交付區
|
||||
│ ├── API_REFERENCE.md (generated)
|
||||
│ ├── API_ENDPOINTS.md (generated)
|
||||
│ ├── API_QUICK_REFERENCE.md (generated)
|
||||
│ ├── API_ERROR_CODES.md (generated)
|
||||
│ ├── API_INDEX.md (generated)
|
||||
│ ├── API_TRAINING_MARCOM.md (generated)
|
||||
│ ├── TMDb_User_Guide.md (generated)
|
||||
│ ├── Demo_EndToEnd.md (手寫保留)
|
||||
│ ├── Pipeline_API_Demo.md (手寫保留)
|
||||
│ └── ... (其他手寫文件)
|
||||
│
|
||||
├── DESIGN/
|
||||
├── REFERENCE/
|
||||
├── OPERATIONS/
|
||||
├── INTEGRATIONS/
|
||||
└── STANDARDS/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. 模組規範
|
||||
|
||||
### 3.1 檔名規則
|
||||
|
||||
- 格式:`NN_<name>.md`(NN = 兩位數排序 01-99)
|
||||
- 範例:`03_register.md`, `09_tmdb.md`
|
||||
- 依賴序號決定組裝時的 endpoint 順序
|
||||
|
||||
### 3.2 Module Metadata 註解
|
||||
|
||||
每個 module 開頭必須有 metadata 註解:
|
||||
|
||||
```markdown
|
||||
<!-- module: auth -->
|
||||
<!-- description: Authentication, API Key, Base URL configuration -->
|
||||
<!-- depends: -->
|
||||
```
|
||||
|
||||
| 欄位 | 必填 | 說明 |
|
||||
|------|------|------|
|
||||
| `module` | Yes | 唯一名稱,無空格無數字開頭 |
|
||||
| `description` | Yes | 一句話說明 |
|
||||
| `depends` | No | 依賴的其他 module 名稱(逗號分隔) |
|
||||
|
||||
### 3.3 Endpoint 結構
|
||||
|
||||
每個 endpoint 必須使用一致結構:
|
||||
|
||||
```markdown
|
||||
### `METHOD /path/to/endpoint`
|
||||
|
||||
**Auth**: Required / Optional / Public
|
||||
**Scope**: file-level / identity-level / system-level
|
||||
|
||||
#### Request Parameters
|
||||
|
||||
| Field | Type | Required | Default | Description |
|
||||
|-------|------|----------|---------|-------------|
|
||||
|
||||
#### Example
|
||||
|
||||
```bash
|
||||
curl -s -X METHOD "$API/path" \
|
||||
-H "X-API-Key: $KEY" \
|
||||
-d '{"field": "value"}'
|
||||
```
|
||||
|
||||
#### Response (200)
|
||||
|
||||
```json
|
||||
{ ... }
|
||||
```
|
||||
|
||||
#### Error Codes
|
||||
|
||||
| Code | HTTP | When |
|
||||
|------|------|------|
|
||||
```
|
||||
```
|
||||
|
||||
### 3.4 變數規則
|
||||
|
||||
| 變數 | 用途 | 範例值 |
|
||||
|------|------|--------|
|
||||
| `$API` | Base URL | `http://localhost:3003` |
|
||||
| `$KEY` | API Key | `your-api-key-here` |
|
||||
| `$FILE_UUID` | File UUID | `3a6c1865...` |
|
||||
| `$IDENTITY_UUID` | Identity UUID | `a9a90105...` |
|
||||
|
||||
---
|
||||
|
||||
## 4. 組裝引擎
|
||||
|
||||
### 4.1 `assemble_docs.sh`
|
||||
|
||||
Shell 腳本,接收三個參數:
|
||||
|
||||
| 參數 | 說明 | 範例 |
|
||||
|------|------|------|
|
||||
| `--config` | TOML 配方路徑 | `configs/reference.toml` |
|
||||
| `--modules` | Module 目錄 | `modules/` |
|
||||
| `--build` | 輸出目錄 | `_build/` |
|
||||
|
||||
### 4.2 三種組裝模式
|
||||
|
||||
| mode | 行為 | 適用 |
|
||||
|------|------|------|
|
||||
| `full` | 完整包含 module 全部內容(除 metadata) | API_REFERENCE, API_ENDPOINTS |
|
||||
| `summary` | 僅擷取 endpoint 表格 + curl 範例 | API_QUICK_REFERENCE |
|
||||
| `index` | 生成文件總覽(掃描 modules 目錄自動產生索引) | API_INDEX |
|
||||
|
||||
### 4.3 組裝流程
|
||||
|
||||
```
|
||||
1. 讀取 config.toml → 解析 title, modules, mode, narrative
|
||||
2. 生成 YAML frontmatter(含 document_type, date, version)
|
||||
3. 生成 title heading + info block
|
||||
4. (可選)摘自 TOC:從 modules ## headings 生成目錄
|
||||
5. (可選)插入 narrative intro
|
||||
6. 遍歷 modules:
|
||||
- full mode: 複製整份內容(跳過 <!-- --> 註解)
|
||||
- summary mode: 只提取 | table | + ```bash code block
|
||||
- index mode: 自動掃描 modules 目錄生成清單
|
||||
7. 寫入 _build/ 輸出檔案
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. 配方格式(config.toml)
|
||||
|
||||
```toml
|
||||
title = "輸出文件標題"
|
||||
output = "_build/FILENAME.md" # 輸出路徑(相對於 API_WORKSPACE)
|
||||
mode = "full" # full | summary | index
|
||||
modules = ["01_auth", "03_register"] # 要包含的 module 名稱
|
||||
narrative = "narratives/xxx.md" # (可選)包含的敘事前言
|
||||
toc = true # (可選)是否生成目錄
|
||||
|
||||
[frontmatter]
|
||||
document_type = "api_reference" # 用於 YAML frontmatter
|
||||
service = "MOMENTRY_CORE"
|
||||
version = "V1.0"
|
||||
owner = "M5"
|
||||
created_by = "OpenCode"
|
||||
```
|
||||
|
||||
### 內建配方一覽
|
||||
|
||||
| 檔案 | 輸出 | Modules | Mode |
|
||||
|------|------|---------|------|
|
||||
| `reference.toml` | API_REFERENCE.md | 01-11 | full |
|
||||
| `endpoints.toml` | API_ENDPOINTS.md | 01-10 | full |
|
||||
| `quickref.toml` | API_QUICK_REFERENCE.md | 01-06,09 | summary |
|
||||
| `errors.toml` | API_ERROR_CODES.md | 11 | full |
|
||||
| `index.toml` | API_INDEX.md | (auto) | index |
|
||||
| `marcom.toml` | API_TRAINING_MARCOM.md | 01,03,06 + narrative | full |
|
||||
| `tmdb.toml` | TMDb_User_Guide.md | 01,03,09 | full |
|
||||
|
||||
---
|
||||
|
||||
## 6. 工作流程
|
||||
|
||||
### 6.1 日常修改
|
||||
|
||||
```bash
|
||||
# 1. 編輯模組
|
||||
cd API_WORKSPACE
|
||||
vim modules/09_tmdb.md
|
||||
|
||||
# 2. 重新生成單一文件
|
||||
make tmdb
|
||||
|
||||
# 3. 預覽結果
|
||||
less _build/TMDb_User_Guide.md
|
||||
|
||||
# 4. 部署
|
||||
make deploy
|
||||
```
|
||||
|
||||
### 6.2 新增端點
|
||||
|
||||
```bash
|
||||
# 1. 找到所屬模組
|
||||
ls modules/
|
||||
# 決定該 endpoint 屬於哪個模組(如 tmdb, identity, search)
|
||||
|
||||
# 2. 在對應模組加入 endpoint 文檔
|
||||
vim modules/09_tmdb.md
|
||||
|
||||
# 3. 重新生成所有文件
|
||||
make all
|
||||
|
||||
# 4. 確認所有引用此端點的文件都有正確更新
|
||||
make check
|
||||
|
||||
# 5. 部署
|
||||
make deploy
|
||||
```
|
||||
|
||||
### 6.3 客製化交付
|
||||
|
||||
```bash
|
||||
# 新增一個客製化配方
|
||||
cat > configs/integration_partner.toml << TOML
|
||||
title = "Integration Partner API Guide"
|
||||
output = "_build/PARTNER_GUIDE.md"
|
||||
mode = "full"
|
||||
modules = ["01_auth", "06_search", "09_tmdb", "11_error_codes"]
|
||||
toc = true
|
||||
[frontmatter]
|
||||
document_type = "user_manual"
|
||||
service = "MOMENTRY_CORE"
|
||||
version = "V1.0"
|
||||
owner = "M5"
|
||||
created_by = "OpenCode"
|
||||
TOML
|
||||
|
||||
# 在 Makefile 中加入對應 target
|
||||
echo "partner:" >> Makefile
|
||||
echo ' @$$(SCRIPT) --config configs/integration_partner.toml --modules $$(MODULES) --build $$(BUILD)' >> Makefile
|
||||
|
||||
# 生成
|
||||
make partner
|
||||
|
||||
# 部署
|
||||
make deploy
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. 交付客製化對照表
|
||||
|
||||
| 對象 | 需要 modules | make target | 輸出 |
|
||||
|------|-------------|-------------|------|
|
||||
| API Developer | 01-11 (all) | `make reference` | API_REFERENCE.md |
|
||||
| Quick Start User | 01-06,09 | `make quickref` | API_QUICK_REFERENCE.md |
|
||||
| Marcom Team | 01,03,06 + narrative | `make marcom` | API_TRAINING_MARCOM.md |
|
||||
| TMDb User | 01,03,09 | `make tmdb` | TMDb_User_Guide.md |
|
||||
| Integration Partner | 01,06,09,11 | Custom config | PARTNER_GUIDE.md |
|
||||
|
||||
---
|
||||
|
||||
## 8. GUIDES/ 文件類型說明
|
||||
|
||||
| 類型 | 來源 | 說明 |
|
||||
|------|------|------|
|
||||
| `API_*.md` (7 files) | Generated from API_WORKSPACE | API 功能文件,endpoint 列表 + curl 範例 |
|
||||
| `Demo_*.md`, `M5API_*.md` | 手寫 | 敘事性指引,含完整 step-by-step 流程 |
|
||||
| `PORTAL_*.md` | 手寫 | Portal 開發計畫與 Demo 指引 |
|
||||
| `USER_MANUAL.md` | 手寫 | 系統操作使用手冊 |
|
||||
|
||||
> **提醒**:不要直接修改 GUIDES/ 中的 generated files。修改應在 API_WORKSPACE/modules/ 中進行,然後執行 `make deploy`。
|
||||
|
||||
---
|
||||
|
||||
## 相關文件
|
||||
|
||||
- `API_WORKSPACE/README.md` — 開發者快速上手指南
|
||||
- `API_WORKSPACE/modules/_template.md` — 模組撰寫範本
|
||||
- `STANDARDS/DOCS_STANDARD.md` — 文件創建規範
|
||||
- `STANDARDS/USER_DOCS_STANDARD.md` — 使用者文件規範
|
||||
128
docs_v1.0/DESIGN/REPRESENTATIVE_FRAME_API_V1.md
Normal file
128
docs_v1.0/DESIGN/REPRESENTATIVE_FRAME_API_V1.md
Normal file
@@ -0,0 +1,128 @@
|
||||
# Representative Frame API V1.0
|
||||
|
||||
Portal 影片代表畫面 API — 沒有指定 frame_number 時自動偵測男女主角找到最佳互動 frame。
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview
|
||||
|
||||
### Purpose
|
||||
|
||||
Portal 需要為每個影片顯示一張代表畫面(thumbnail),內容應為該影片最具代表性的 scene — 通常包含男女主角同框且互看的時刻。
|
||||
|
||||
### Principle
|
||||
|
||||
**沒有指定 frame_number → auto-detect representative frame**
|
||||
|
||||
既有端點不需改動,只需在 `frame` 參數為空時自動偵測。
|
||||
|
||||
---
|
||||
|
||||
## 2. Endpoint
|
||||
|
||||
### `GET /api/v1/file/:file_uuid/thumbnail`
|
||||
|
||||
**Query Parameters**:
|
||||
|
||||
| Param | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `frame` | i64 | ❌ | 指定 frame;不傳則 auto-detect |
|
||||
| `x` | i32 | ❌ | bbox crop x |
|
||||
| `y` | i32 | ❌ | bbox crop y |
|
||||
| `w` | i32 | ❌ | bbox crop width |
|
||||
| `h` | i32 | ❌ | bbox crop height |
|
||||
|
||||
**Response**: Pure JPEG bytes (Content-Type: image/jpeg)
|
||||
|
||||
**Examples**:
|
||||
```
|
||||
GET /api/v1/file/:uuid/thumbnail → auto-detect
|
||||
GET /api/v1/file/:uuid/thumbnail?frame=38165 → 指定 frame
|
||||
GET /api/v1/file/:uuid/thumbnail?frame=38165&x=723&y=205&w=221&h=221 → 指定 crop
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Internal Algorithm
|
||||
|
||||
### Auto-detect Fallback Chain
|
||||
|
||||
```
|
||||
Step 1: Auto-detect 主角 (top 2 by face count)
|
||||
└─ face_detections JOIN identities
|
||||
|
||||
Step 2: TKG Bridge — mutual_gaze?
|
||||
├── 有 mutual_gaze edge → first_frame ✅
|
||||
└── 無 → face_detections 第一次同框 frame ✅
|
||||
|
||||
Step 3: 只有一個主角?
|
||||
└─ 該主角 face_quality (w×h×confidence) 最高 frame
|
||||
|
||||
Step 4: 完全無 identity?
|
||||
└─ 任 identity 的 face_quality 最高 frame
|
||||
|
||||
Step 5: 完全無 face?
|
||||
└─ 404 "No faces in this file"
|
||||
```
|
||||
|
||||
### TKG Bridge Query
|
||||
|
||||
```sql
|
||||
-- 找兩主角各自的 main trace
|
||||
SELECT trace_id FROM face_detections
|
||||
WHERE file_uuid = $1 AND identity_id = $2 AND trace_id IS NOT NULL
|
||||
GROUP BY trace_id ORDER BY COUNT(*) DESC LIMIT 1;
|
||||
|
||||
-- TKG mutual_gaze 查詢
|
||||
SELECT (e.properties->>'first_frame')::bigint
|
||||
FROM tkg_edges e
|
||||
JOIN tkg_nodes a ON a.id = e.source_node_id
|
||||
JOIN tkg_nodes b ON b.id = e.target_node_id
|
||||
WHERE e.file_uuid = $1
|
||||
AND a.external_id = concat('trace_', $4)
|
||||
AND b.external_id = concat('trace_', $5)
|
||||
AND e.properties->>'mutual_gaze' = 'true'
|
||||
LIMIT 1;
|
||||
|
||||
-- Fallback: 第一次同框
|
||||
SELECT MIN(fd_a.frame_number)::bigint
|
||||
FROM face_detections fd_a
|
||||
JOIN face_detections fd_b ON fd_a.frame_number = fd_b.frame_number
|
||||
WHERE fd_a.file_uuid = $1 AND fd_a.identity_id = $2 AND fd_b.identity_id = $3;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Implementation
|
||||
|
||||
### Files Changed
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `src/api/media_api.rs` | `ThumbQuery.frame` → `Option<i64>`; add auto-detect fallback |
|
||||
| `src/core/processor/tkg.rs` | Add `query_auto_representative_frame()` + structs (已實作) |
|
||||
| `src/core/processor/mod.rs` | Export new function + structs (已實作) |
|
||||
|
||||
### Existing Trace-level Endpoints (不變)
|
||||
|
||||
```
|
||||
GET /api/v1/file/:uuid/trace/:tid/representative-face → JSON (legacy)
|
||||
GET /api/v1/file/:uuid/trace/:tid/thumbnail → JPEG (auto via select_rep_face)
|
||||
```
|
||||
|
||||
### No Changes
|
||||
|
||||
- ❌ No new DB tables / migrations
|
||||
- ❌ No changes to `select_rep_face` / blurdetect
|
||||
- ❌ No chunk / cut / pre_chunks dependency
|
||||
|
||||
---
|
||||
|
||||
## 5. Version History
|
||||
|
||||
| Date | Version | Author | Change |
|
||||
|------|---------|--------|--------|
|
||||
| 2026-05-22 | 1.0 | OpenCode | Initial design |
|
||||
| 2026-05-22 | 1.1 | OpenCode | 簡化為單一 endpoint: frame 為 None 時 auto-detect |
|
||||
|
||||
*Updated: 2026-05-22*
|
||||
270
docs_v1.0/DESIGN/Redis_Progress_Reporting_V1.0.md
Normal file
270
docs_v1.0/DESIGN/Redis_Progress_Reporting_V1.0.md
Normal file
@@ -0,0 +1,270 @@
|
||||
---
|
||||
document_type: "design_doc"
|
||||
service: "MOMENTRY_CORE"
|
||||
title: "Redis Progress Reporting V1.0"
|
||||
version: "V1.0"
|
||||
date: "2026-05-17"
|
||||
author: "M5"
|
||||
status: "draft"
|
||||
---
|
||||
|
||||
# Redis Progress Reporting V1.0
|
||||
|
||||
| 項目 | 內容 |
|
||||
|------|------|
|
||||
| Service | `MOMENTRY_CORE` |
|
||||
| Version | V1.0 |
|
||||
| Date | 2026-05-17 |
|
||||
| Author | M5 (OpenCode) |
|
||||
| Status | Draft |
|
||||
|
||||
## 1. Overview
|
||||
|
||||
This document defines the standardized progress reporting architecture for Momentry Core processors. It replaces the inconsistent ad-hoc progress patterns found across `scripts/`, `src/worker/`, and `src/api/`.
|
||||
|
||||
### 1.1 Problems Addressed
|
||||
|
||||
| # | Problem | Detail |
|
||||
|---|---------|--------|
|
||||
| 1 | Worker Redis key does not match `OPERATIONS/MOMENTRY_CORE_REDIS_KEYS.md` V1.0 spec | Worker writes `worker:job:{uuid}:processor:{name}` instead of spec `job:{uuid}:processor:{name}` |
|
||||
| 2 | Progress API reads wrong key | `get_progress()` reads `worker:job:{uuid}:processor:{name}` — unresolved with Playground subscriber which writes `job:{uuid}:processor:{name}` |
|
||||
| 3 | Swift processors (Face/OCR/Pose) lack RedisPublisher | Progress lost — only stdout text |
|
||||
| 4 | ASRX/Story/Visual chunk have no incremental progress | Start + complete only, no `current/total` updates |
|
||||
| 5 | `frames_processed` / `chunks_produced` never updated in real-time | Worker only writes processor hash at start and exit |
|
||||
| 6 | No `output_count` / `output_type` fields | Impossible to know how many faces/objects/segments were produced |
|
||||
|
||||
### 1.2 Key Design Decisions
|
||||
|
||||
| Decision | Rationale |
|
||||
|----------|-----------|
|
||||
| Progress unit = frames for video processors | All media-level processors work frame by frame |
|
||||
| Output count separate from progress | Processors may produce N outputs per frame (multiple faces, objects) |
|
||||
| Pub/sub for real-time, Hash for final state | Pub/sub is transient; Hash persists for API queries |
|
||||
|
||||
---
|
||||
|
||||
## 2. Redis Key Architecture
|
||||
|
||||
### 2.1 Key Patterns
|
||||
|
||||
All keys use the configured `REDIS_KEY_PREFIX` (default: `momentry:` for production, `momentry_dev:` for playground).
|
||||
|
||||
| Pattern | Type | TTL | Purpose | Owner |
|
||||
|---------|------|-----|---------|-------|
|
||||
| `{prefix}progress:{uuid}` | Pub/Sub | — | Real-time progress messages | Python scripts |
|
||||
| `{prefix}job:{uuid}` | Hash | 24h | Per-video job state | Worker |
|
||||
| `{prefix}job:{uuid}:processor:{name}` | Hash | 24h | Per-processor final state | Worker |
|
||||
| `{prefix}job:{uuid}:processor:{name}:output_count` | String | 24h | Output count by type | Worker |
|
||||
|
||||
### 2.2 Processor Hash Fields
|
||||
|
||||
```
|
||||
{prefix}job:{uuid}:processor:{name}
|
||||
├── status String running / completed / failed / pending
|
||||
├── current u32 Units processed (frames for video processors)
|
||||
├── total u32 Total units
|
||||
├── output_count u32 Output items produced (faces, objects, segments)
|
||||
├── output_type String Type name of output: faces / objects / segments / cuts / etc.
|
||||
├── pid i32 OS process ID (0 if not running)
|
||||
├── error String Error message if failed
|
||||
└── updated_at String ISO 8601 timestamp
|
||||
```
|
||||
|
||||
### 2.3 Migrated Keys
|
||||
|
||||
The following key patterns from the original implementation are REMOVED:
|
||||
|
||||
| Old Key | Reason |
|
||||
|---------|--------|
|
||||
| `{prefix}worker:job:{uuid}:processor:{name}` | Non-standard prefix — not in `MOMENTRY_CORE_REDIS_KEYS.md` spec |
|
||||
| `{prefix}job:{uuid}:processor:{name}:status` (flat) | Redundant — status stored in Hash |
|
||||
| `{prefix}job:{uuid}:processor:{name}:progress` (flat) | Replaced by `current` + `total` for percent calculation |
|
||||
| `{prefix}job:{uuid}:processor:{name}:current` (flat) | Replaced by Hash fields |
|
||||
| `{prefix}job:{uuid}:processor:{name}:total` (flat) | Replaced by Hash fields |
|
||||
| `{prefix}job:{uuid}:processor:{name}:started_at` (flat) | Replaced by Hash `updated_at` |
|
||||
|
||||
---
|
||||
|
||||
## 3. Pub/Sub Message Format
|
||||
|
||||
### 3.1 Channel
|
||||
|
||||
```
|
||||
{prefix}progress:{uuid}
|
||||
```
|
||||
|
||||
### 3.2 Message JSON
|
||||
|
||||
```json
|
||||
{
|
||||
"processor": "face",
|
||||
"current": 150,
|
||||
"total": 162696,
|
||||
"output_count": 423,
|
||||
"output_type": "faces",
|
||||
"message": "Processing frame 150",
|
||||
"timestamp": 1700000000
|
||||
}
|
||||
```
|
||||
|
||||
### 3.3 Field Definitions
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `processor` | String | ✅ | Processor name: asr / asrx / yolo / ocr / face / pose / cut / story / visual_chunk |
|
||||
| `current` | u32 | ✅ | Units processed (frames for video processors) |
|
||||
| `total` | u32 | ✅ | Total units |
|
||||
| `output_count` | u32 | ❌ | Output items produced so far |
|
||||
| `output_type` | String | ❌ | Type name: faces / objects / segments / cuts / text_regions / persons / speakers / stories / visual_chunks |
|
||||
| `message` | String | ❌ | Human-readable progress description |
|
||||
| `timestamp` | u64 | ✅ | Unix timestamp |
|
||||
|
||||
---
|
||||
|
||||
## 4. Per-Processor Metrics
|
||||
|
||||
| Processor | current/total Unit | output_type | When to Publish |
|
||||
|-----------|-------------------|-------------|-----------------|
|
||||
| ASR | frames | `segments` | Every 100 segments processed |
|
||||
| ASRX | frames | `speakers` | Every processing stage |
|
||||
| YOLO | frames | `objects` | Every 500 frames |
|
||||
| OCR | frames | `text_regions` | Every 5% |
|
||||
| Face | frames | `faces` | Every batch (5% of frames) |
|
||||
| Pose | frames | `persons` | Every 10% |
|
||||
| CUT | frames | `cuts` | Every scene detected |
|
||||
| Story | chunks | `stories` | Every chunk processed |
|
||||
| Visual chunk | frames | `visual_chunks` | Every chunk processed |
|
||||
|
||||
### 4.1 Output Type Enum
|
||||
|
||||
```rust
|
||||
pub enum OutputType {
|
||||
Segments, // ASR
|
||||
Speakers, // ASRX
|
||||
Objects, // YOLO
|
||||
TextRegions, // OCR
|
||||
Faces, // Face
|
||||
Persons, // Pose
|
||||
Cuts, // CUT
|
||||
Stories, // Story
|
||||
VisualChunks, // Visual chunk
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Data Flow
|
||||
|
||||
```
|
||||
┌──────────────────┐ Pub/Sub ┌──────────────────────┐
|
||||
│ Python Processor │ ───────── progress:{uuid} ──────────→│ Worker (subscriber) │
|
||||
│ (ASR/YOLO/Face) │ {current, total, │ │
|
||||
│ │ output_count, output_type} │ ──→ HSET │
|
||||
└──────────────────┘ │ job:{uuid}: │
|
||||
│ processor:{name} │
|
||||
┌──────────────────┐ │ │
|
||||
│ Swift Processor │ ──→ Python wrapper ──→ pub/sub │ (status, current, │
|
||||
│ (Face/OCR/Pose) │ (add RedisPublisher) │ total, output_count,│
|
||||
└──────────────────┘ │ output_type) │
|
||||
└──────────┬───────────┘
|
||||
│ HGETALL
|
||||
┌──────────▼───────────┐
|
||||
│ Progress API │
|
||||
│ GET /progress/:uuid │
|
||||
│ │
|
||||
│ ─→ compute % │
|
||||
│ ─→ return JSON │
|
||||
└─────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Implementation Plan
|
||||
|
||||
### Phase 1: Python Processor RedisPublisher
|
||||
|
||||
| Task | Files | Effort |
|
||||
|------|-------|--------|
|
||||
| Add `RedisPublisher` to `face_processor.py` | `scripts/face_processor.py` | Medium |
|
||||
| Add `RedisPublisher` to `ocr_processor.py` | `scripts/ocr_processor.py` | Medium |
|
||||
| Add `RedisPublisher` to `pose_processor.py` | `scripts/pose_processor.py` | Medium |
|
||||
| Add incremental `.progress()` to `asrx_processor_custom.py` | `scripts/asrx_processor_custom.py` | Low |
|
||||
| Standardize pub/sub message to include `output_count`, `output_type` | All processor scripts | Low |
|
||||
|
||||
### Phase 2: Worker
|
||||
|
||||
| Task | Files | Effort |
|
||||
|------|-------|--------|
|
||||
| Fix Redis key from `worker:job:` to `job:` | `src/worker/processor.rs`, `src/core/db/redis_client.rs` | Low |
|
||||
| Subscribe to `progress:{uuid}` channel in `run_processor()` | `src/worker/processor.rs` | Medium |
|
||||
| HSET Processor Hash on each progress message | `src/worker/processor.rs` | Medium |
|
||||
| Set `output_count` and `output_type` from pub/sub message | `src/worker/processor.rs` | Low |
|
||||
|
||||
### Phase 3: Progress API
|
||||
|
||||
| Task | Files | Effort |
|
||||
|------|-------|--------|
|
||||
| Read `output_count`, `output_type` from Redis Hash | `src/api/server.rs` | Low |
|
||||
| Compute percentage from `current` / `total` | `src/api/server.rs` | Low |
|
||||
| Return `output_count`, `output_type` in response JSON | `src/api/server.rs` | Low |
|
||||
| Remove `worker:` fallback path | `src/api/server.rs` | Low |
|
||||
|
||||
### Phase 4: Cleanup
|
||||
|
||||
| Task | Files | Effort |
|
||||
|------|-------|--------|
|
||||
| Remove old `worker:job:` keys from Redis | Deployment script | Low |
|
||||
| Remove `update_processor_progress()` DB path (stale `processing_status` JSONB) | `src/core/db/postgres_db.rs` | Medium |
|
||||
|
||||
---
|
||||
|
||||
## 7. API Response Changes
|
||||
|
||||
### ProgressResponse (new fields)
|
||||
|
||||
```json
|
||||
{
|
||||
"processors": [
|
||||
{
|
||||
"name": "face",
|
||||
"status": "running",
|
||||
"current": 150,
|
||||
"total": 162696,
|
||||
"progress": 0,
|
||||
"frames_processed": 150,
|
||||
"output_count": 423,
|
||||
"output_type": "faces"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Dependencies
|
||||
|
||||
| Component | Version | Role |
|
||||
|-----------|---------|------|
|
||||
| Redis | ≥ 6.0 | Pub/Sub + Hash storage |
|
||||
| `redis_publisher.py` | Existing | Python → Redis pub/sub client |
|
||||
| `redis_client.rs` | Existing | Rust Redis client for worker + API |
|
||||
|
||||
---
|
||||
|
||||
## 9. References
|
||||
|
||||
| Doc | Relation |
|
||||
|-----|----------|
|
||||
| `OPERATIONS/MOMENTRY_CORE_REDIS_KEYS.md` | Parent spec — this doc supersedes sections 4, 7, 8 |
|
||||
| `DESIGN/VIDEO_PROCESSING_SPEC.md` §2.3 | Original progress design (ProcessProgress struct) |
|
||||
| `src/worker/processor.rs` | Worker progress write implementation |
|
||||
| `scripts/redis_publisher.py` | Python pub/sub client |
|
||||
| `src/api/server.rs` (get_progress) | Progress API handler |
|
||||
|
||||
---
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Author | Change |
|
||||
|---------|------|--------|--------|
|
||||
| V1.0 | 2026-05-17 | M5 (OpenCode) | Initial draft — replaces ad-hoc progress patterns |
|
||||
Reference in New Issue
Block a user