- Object instance tracking (similar to face trace) - Custom detector for stamps, guns, etc. - TKG integration for object-face-speaker graph - Upgrade path: yolov5nu → yolov8m, fine-tune, zero-shot
166 lines
5.5 KiB
Markdown
166 lines
5.5 KiB
Markdown
# Momentry Model — 分階段交付
|
||
|
||
## 核心架構
|
||
|
||
```
|
||
Pipeline (training)
|
||
│ 每個 processor 產出 .json
|
||
│ Rule 1/3 Ingestion → chunks + embeddings
|
||
▼
|
||
momentry model for {video} ← 每部影片 = 一個 model
|
||
│ release/phase1/latest/
|
||
│ release/phase2/latest/
|
||
▼
|
||
momentry core (inference engine) ← Rust API server
|
||
│ momentry_playground (dev)
|
||
│ momentry (production)
|
||
▼
|
||
Search / Query / Identity APIs
|
||
```
|
||
|
||
- **Pipeline** = training phase:影片 → processor output → chunks → embeddings
|
||
- **Model** = 每部影片的產出 package(output_json + chunks + vectors)
|
||
- **Engine** = momentry core,吃 model 提供 API(search, trace, identity)
|
||
|
||
每個影片可有多個 model 版本,命名保留升級空間:
|
||
|
||
| Model 版本 | Qdrant Collection | 內容 | 觸發時機 |
|
||
|-----------|------------------|------|---------|
|
||
| `{uuid}_v1` | `momentry_dev_v1` | sentence chunk embedding(base) | ASR + ASRX + Rule 1 完成 |
|
||
| `{uuid}_v2` | `momentry_dev_v2` | 完整 pipeline + 5W1H | 全部完成 |
|
||
| `{uuid}_v3` | `momentry_dev_v3` | object identity + custom detector | v2 + object instance matching 完成 |
|
||
|
||
各版本共存不覆蓋。
|
||
|
||
## 階段劃分
|
||
|
||
### Phase 1:Sentence Chunk Embedding(base model)
|
||
|
||
**觸發時機**: ASR + ASRX 完成 + Rule 1 Ingestion + vectorize 完成
|
||
|
||
**交付內容**:
|
||
- `{uuid}.asr.json`
|
||
- `{uuid}.asrx.json`
|
||
- chunks(chunk_type = 'sentence')
|
||
- chunk_vectors(sentence embedding)
|
||
|
||
**用途**: 終端使用者可進行語意搜尋
|
||
|
||
### Phase 2:完整 Pipeline(v2 model)
|
||
|
||
**觸發時機**: 全部 processor 完成 + Rule 3 Ingestion + 5W1H Agent
|
||
|
||
**交付內容**:
|
||
- Phase 1 全部內容
|
||
- 所有 `{uuid}.*.json`(cut, yolo, face, pose, ocr, ...)
|
||
- chunks(chunk_type = 'cut', 'visual', 'trace', 'story')
|
||
- chunk_vectors(summary embedding)
|
||
- identities / identity_bindings / face_detections
|
||
|
||
**用途**: 完整搜尋 + 摘要 + 人物識別
|
||
|
||
---
|
||
|
||
## Worker Pipeline
|
||
|
||
```
|
||
ASR 完成 → ASRX 完成
|
||
↓
|
||
Rule 1 Ingestion (sentence chunks)
|
||
↓
|
||
vectorize_chunks (sentence embedding)
|
||
↓
|
||
📦 Phase 1 release ───→ release/phase1/latest/ (base model)
|
||
↓
|
||
其他 processors 繼續 (yolo, face, pose, ocr, ...)
|
||
↓
|
||
Rule 3 Ingestion + 5W1H Agent
|
||
↓
|
||
📦 Phase 2 release ───→ release/phase2/latest/ (full model)
|
||
```
|
||
|
||
## 產出目錄結構
|
||
|
||
```
|
||
release/
|
||
├── phase1/
|
||
│ ├── {version}_{timestamp}/
|
||
│ │ ├── output_json/ ← 所有已完成的 .json
|
||
│ │ ├── chunks.csv ← sentence chunks
|
||
│ │ ├── vectors.csv ← sentence embeddings
|
||
│ │ ├── schema.sql ← chunks table DDL
|
||
│ │ └── RELEASE_INFO.txt
|
||
│ └── latest → {version}_{timestamp}
|
||
│
|
||
└── phase2/
|
||
├── {version}_{timestamp}/
|
||
│ ├── output_json/ ← 所有 .json
|
||
│ ├── chunks.csv ← 所有 chunks
|
||
│ ├── vectors.csv ← 所有 embeddings
|
||
│ ├── identities.csv ← 人物身分
|
||
│ ├── schema.sql ← 完整 schema
|
||
│ └── RELEASE_INFO.txt
|
||
└── latest → {version}_{timestamp}
|
||
```
|
||
|
||
## momentry model vs momentry core
|
||
|
||
| | momentry model | momentry core |
|
||
|---|---|---|
|
||
| 類比 | 訓練好的 weights | inference engine |
|
||
| 內容 | `.json` + chunks + vectors | Rust binary |
|
||
| 生命週期 | 每部影片產出一個 | 一個 binary 服務所有影片 |
|
||
| 版本 | `{uuid}_v1`(base) / `{uuid}_v2` / `{uuid}_v3` | `momentry_playground` / `momentry` |
|
||
| 交付對象 | 終端使用者 | 部署工程師 |
|
||
|
||
---
|
||
|
||
## Phase 3:Object Identity(v3 model)
|
||
|
||
### 目標
|
||
|
||
從影片中提取關鍵物體(郵票、手槍、信封、放大鏡...),對同類物體做 instance-level 的跨畫面追蹤與辨識,達到類似 face trace 的效果 — 不只是 detect class,還能區分「這一張郵票」vs「那一張郵票」。
|
||
|
||
### 現狀問題
|
||
|
||
1. **COCO 80 類不包含關鍵物體** — 郵票、手槍、信封、放大鏡等不在 COCO 資料集中
|
||
2. **YOLOv5nano 偵測率低** — 即使是 COCO 類別(knife, cell phone)在 nano 模型上 recall 不足
|
||
3. **無 object instance matching** — 目前只有 frame-level detection,沒有跨 frame 的物體追蹤
|
||
|
||
### 技術方向
|
||
|
||
```
|
||
YOLOv8m/OWL-ViT → 改善 detection coverage
|
||
↓
|
||
Object Tracker (IoU + embedding,類似 face tracker)
|
||
↓
|
||
object_trace → TKG CO_OCCURS_WITH edges
|
||
↓
|
||
object identity → 同物體跨場景辨識
|
||
```
|
||
|
||
| 方向 | 方法 | 效果 |
|
||
|------|------|------|
|
||
| Model upgrade | `yolov5nu` → `yolov8s.pt` / `yolov8m.pt` | COCO recall 提升 |
|
||
| Custom fine-tune | 收集 stamps/guns 資料 fine-tune YOLO | 可偵測非 COCO 物件 |
|
||
| Zero-shot | OWL-ViT / Grounding DINO by text prompt | 不用 training,但速度慢 |
|
||
| Object trace | IoU + embedding 跨 frame 匹配 | instance-level 追蹤 |
|
||
| Object identity | clustering 跨場景辨識同一物體 | 可在全片搜尋「這把槍」 |
|
||
|
||
### 與 TKG 整合
|
||
|
||
```
|
||
face_trace -[:CO_OCCURS_WITH]-> object_instance:5 (這把槍)
|
||
face_trace -[:CO_OCCURS_WITH]-> object_instance:42 (這張郵票)
|
||
|
||
查詢: "Audrey Hepburn 拿這把槍的畫面"
|
||
→ face_trace:5 -[:SPEAKS_AS]-> SPEAKER_0
|
||
→ face_trace:5 -[:CO_OCCURS_WITH]-> object_instance:5
|
||
```
|
||
|
||
### 交付順序
|
||
|
||
1. YOLO model upgrade(低難度,立即見效)
|
||
2. Object tracker(中難度,參考 face tracker 實作)
|
||
3. Custom fine-tune / zero-shot(高難度,需資料或新模型)
|