feat: trace chunks with co-appearance relationships
- New trace_ingest module: creates chunks for each face trace (time + bbox + ASR text) - Computes pairwise time overlaps between traces -> co_appearances in metadata - Worker auto-triggers after face trace store + Qdrant sync - SearchFilters: chunk_type filter (sentence/cut/trace/visual) - SearchFilters: co_appears_with_trace_id filter
This commit is contained in:
101
docs/TRACE_SEARCH_API_DESIGN.md
Normal file
101
docs/TRACE_SEARCH_API_DESIGN.md
Normal file
@@ -0,0 +1,101 @@
|
||||
# Trace Search API 設計
|
||||
|
||||
## 概念
|
||||
|
||||
trace 是一種 chunk。
|
||||
|
||||
現有的 chunk_type: `cut`, `sentence`, `visual`, `story`
|
||||
新增 chunk_type: `trace`
|
||||
|
||||
每個 trace(人物跨 frame 追蹤軌跡)就是一個時間區間 + 區間內的 ASR text。
|
||||
跟其他 chunk 完全一樣,只是切分維度不同:
|
||||
- cut chunk = 鏡頭切換
|
||||
- sentence chunk = 語句邊界
|
||||
- visual chunk = 畫面物體組合
|
||||
- **trace chunk = 人物出現區間 + 當下 spoken text**
|
||||
|
||||
這樣 trace 可以直接放進現有的 `chunks` 表,共用 embedding、搜尋、Qdrant sync 整套機制,不需要任何新 table。
|
||||
|
||||
## chunks 表現有結構
|
||||
|
||||
```sql
|
||||
chunks (
|
||||
id, file_uuid, chunk_type, -- 'trace' 新增
|
||||
start_frame, end_frame, start_time, end_time,
|
||||
text_content, -- trace 區間的 ASR text
|
||||
embedding, -- text_content 的 pgvector
|
||||
metadata JSONB, -- { trace_id, face_count, identity_id, identity_name }
|
||||
...
|
||||
)
|
||||
```
|
||||
|
||||
## 資料產生流程(worker 擴充)
|
||||
|
||||
在 face processing + `store_traced_faces.py` 完成後:
|
||||
|
||||
1. 查詢 `face_detections` 聚合每個 trace 的 `MIN(frame)`, `MAX(frame)`, `COUNT(*)`
|
||||
2. 對每個 trace,查詢 `pre_chunks WHERE processor_type='asr'` 中與 trace time range 重疊的 text
|
||||
3. 彙整 text → EmbeddingGemma 產生 `embedding`
|
||||
4. 寫入 `chunks`(`chunk_type='trace'`),metadata 含 `trace_id`, `face_count`, `identity_id`
|
||||
5. embedding 自動進 Qdrant(與既有 chunk 同一 collection)
|
||||
|
||||
## Search API 擴充
|
||||
|
||||
Universal Search 的 `types` 原本就支援 `"chunk"`。
|
||||
在 chunk 搜尋中過濾 `chunk_type = 'trace'` 即可。
|
||||
|
||||
**Request**:
|
||||
```json
|
||||
{
|
||||
"query": "open the door",
|
||||
"types": ["chunk"],
|
||||
"filters": { "chunk_type": "trace" },
|
||||
"uuid": "aeed71342a899fe4b4c57b7d41bcb692",
|
||||
"page": 1,
|
||||
"page_size": 20
|
||||
}
|
||||
```
|
||||
|
||||
**Response**(與既有 Chunk result 相同):
|
||||
```json
|
||||
{
|
||||
"type": "chunk",
|
||||
"chunk_id": "chunk_42",
|
||||
"chunk_type": "trace",
|
||||
"start_frame": 45200, "end_frame": 45900,
|
||||
"start_time": 1808.0, "end_time": 1836.0,
|
||||
"score": 0.87,
|
||||
"text": "Open the door. Come on, hurry up.",
|
||||
"metadata": {
|
||||
"trace_id": 5,
|
||||
"face_count": 42,
|
||||
"identity_name": "Audrey Hepburn"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
完全沿用既有的 `SearchResult::Chunk` variant,不用新增 enum variant。
|
||||
|
||||
### 搜尋語法
|
||||
|
||||
```sql
|
||||
SELECT c.*
|
||||
FROM dev.chunks c
|
||||
WHERE c.file_uuid = $1
|
||||
AND c.chunk_type = 'trace'
|
||||
AND c.embedding IS NOT NULL
|
||||
ORDER BY c.embedding <=> $2
|
||||
LIMIT $3;
|
||||
```
|
||||
|
||||
## 總結
|
||||
|
||||
| 項目 | 作法 |
|
||||
|------|------|
|
||||
| 新 table | ❌ 不需要 |
|
||||
| 新 enum variant | ❌ 不需要 |
|
||||
| SearchResult 改動 | ❌ 不需要 |
|
||||
| chunk_type 新增 | ✅ `'trace'` |
|
||||
| worker 擴充 | ✅ 產生 trace chunk (face done 後) |
|
||||
| SearchFilters 擴充 | ✅ 加 `chunk_type` filter |
|
||||
| Qdrant | ✅ 自動(既有 chunk collection) |
|
||||
Reference in New Issue
Block a user