docs: update Rule 1 OCR independent chunks documentation
Updated Searchable_Chunk_Rules.md and pipeline.md to reflect: - Phase 1: ASRX segments (pure speech, NO OCR merge) - Phase 2: OCR-only chunks (all OCR frames grouped by proximity) - New stats API steps: rule1_ocr, rule1_ocr_chunks
This commit is contained in:
@@ -42,6 +42,7 @@ These steps run after the 10 processors and are **required for pipeline completi
|
||||
| # | Step | Triggers When | Verification |
|
||||
|---|------|--------------|-------------|
|
||||
| 1 | **Rule 1 Sentence Chunking** | ASR + ASRX done | `chunk` table has rows with `chunk_type = 'sentence'` |
|
||||
| 1.1 | **Rule 1 OCR Chunks** | OCR done | OCR pre_chunks grouped into sentence chunks |
|
||||
| 2 | **Auto-Vectorize** | Rule 1 done | `chunk.embedding` IS NOT NULL for sentence chunks |
|
||||
| 3 | **Phase 1 Pack** | Rule 1 done | `release_pack.py --phase 1` executed |
|
||||
| 4 | **Rule 3 Scene Chunking** | All 10 processors done + Cut + ASR | `chunk` table has rows with `chunk_type = 'cut'` |
|
||||
@@ -81,15 +82,17 @@ curl "$API/api/v1/stats/ingestion-status/bd80fec9c42afb0307eb28f22c64c76a" | jq
|
||||
{
|
||||
"file_uuid": "bd80fec9c42afb0307eb28f22c64c76a",
|
||||
"steps": [
|
||||
{ "name": "rule1_sentence", "status": "pending", "detail": "0 sentence chunks" },
|
||||
{ "name": "auto_vectorize", "status": "pending", "detail": "0 embedded" },
|
||||
{ "name": "rule3_scene", "status": "pending", "detail": "0 scene chunks" },
|
||||
{ "name": "face_trace", "status": "pending", "detail": "0 traces" },
|
||||
{ "name": "trace_chunks", "status": "pending", "detail": "0 trace chunks" },
|
||||
{ "name": "tkg", "status": "pending", "detail": "0 nodes, 0 edges" },
|
||||
{ "name": "identity_match", "status": "pending", "detail": "0 identities" },
|
||||
{ "name": "scene_metadata", "status": "pending", "detail": null },
|
||||
{ "name": "5w1h", "status": "pending", "detail": "0 scenes with 5W1H" }
|
||||
{ "name": "rule1_sentence", "status": "done", "detail": "35 sentence chunks" },
|
||||
{ "name": "rule1_ocr", "status": "done", "detail": "30 OCR frames" },
|
||||
{ "name": "rule1_ocr_chunks", "status": "done", "detail": "3 OCR-only chunks" },
|
||||
{ "name": "auto_vectorize", "status": "pending", "detail": "0 embedded" },
|
||||
{ "name": "rule3_scene", "status": "pending", "detail": "0 scene chunks" },
|
||||
{ "name": "face_trace", "status": "pending", "detail": "0 traces" },
|
||||
{ "name": "trace_chunks", "status": "pending", "detail": "0 trace chunks" },
|
||||
{ "name": "tkg", "status": "pending", "detail": "0 nodes, 0 edges" },
|
||||
{ "name": "identity_match", "status": "pending", "detail": "0 identities" },
|
||||
{ "name": "scene_metadata", "status": "pending", "detail": null },
|
||||
{ "name": "5w1h", "status": "pending", "detail": "0 scenes with 5W1H" }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
# Searchable Chunk — 綜合規則組成
|
||||
|
||||
**Date**: 2026-05-16
|
||||
**Date**: 2026-05-16
|
||||
**Updated**: 2026-07-05 — OCR 獨立 chunks (方案 A)
|
||||
|
||||
---
|
||||
|
||||
@@ -11,8 +12,8 @@ Searchable chunk 不是原始的 cut 或 sentence,而是經過規則組合後
|
||||
```
|
||||
原始資料 規則組合 可搜尋 chunk
|
||||
───────── ────────── ──────────────
|
||||
ASR sentence (聽覺) ─┐
|
||||
YOLO objects (視覺) ─┤ Rule 1 / Rule 2 chunk (text + metadata + embedding)
|
||||
ASRX sentence (聽覺) ─┐
|
||||
OCR text (視覺文字) ─┤ Rule 1 chunk (text + metadata + embedding)
|
||||
Cut boundary (鏡頭) ─┘
|
||||
```
|
||||
|
||||
@@ -21,27 +22,64 @@ Cut boundary (鏡頭) ─┘
|
||||
| 層級 | 類型 | 說明 | 可搜尋 |
|
||||
|------|------|------|:------:|
|
||||
| **原始** | `cut` | 視覺 chunk(鏡頭) | ❌(無文字) |
|
||||
| **原始** | `sentence` | 聽覺 chunk(ASR 句子) | ✅ 文字搜尋 |
|
||||
| **原始** | `sentence` | 聽覺 chunk(ASRX 句子) | ✅ 文字搜尋 |
|
||||
| **原始** | `sentence` | OCR-only chunk(純視覺文字) | ✅ 文字搜尋 |
|
||||
| **合成** | `story_child` | 故事子句 | ✅ |
|
||||
| **合成** | `story_parent` | 故事段落(多句聚合) | ✅ |
|
||||
|
||||
## Rule 1 — 直接轉換
|
||||
## Rule 1 — 雙階段轉換
|
||||
|
||||
最簡單的規則。ASR 輸出的每個 sentence 直接成為 chunk,不做聚合。
|
||||
### Phase 1: ASRX Segments(純語音)
|
||||
|
||||
ASRX 輸出的每個 segment 直接成為 chunk,**不合併 OCR 文字**。
|
||||
|
||||
```json
|
||||
{
|
||||
"chunk_id": "0",
|
||||
"chunk_type": "sentence",
|
||||
"rule": "rule_1",
|
||||
"data": {
|
||||
"text": "I'm in scoby.",
|
||||
"text_normalized": "i'm in scoby."
|
||||
}
|
||||
"text": "And speaking of storage and workflow...",
|
||||
"ocr_text": "",
|
||||
"start_time": 0.0,
|
||||
"end_time": 5.4
|
||||
}
|
||||
```
|
||||
|
||||
- `chunk_type = 'sentence'`
|
||||
- 可文字搜尋(`text_content ILIKE`)
|
||||
- 可向量搜尋(embedding in Qdrant)
|
||||
- `content.text` = ASRX 語音文字
|
||||
- `content.ocr_text` = ""(空)
|
||||
- `text_content` = ASRX 文字
|
||||
|
||||
### Phase 2: OCR-only Chunks(純視覺文字)
|
||||
|
||||
所有 OCR 幀按鄰近性分組(間距 ≤ 5 幀),每個群組成為獨立 chunk。
|
||||
|
||||
```json
|
||||
{
|
||||
"chunk_id": "32",
|
||||
"chunk_type": "sentence",
|
||||
"rule": "rule_1",
|
||||
"text": "",
|
||||
"ocr_text": "Accusys. G Carry 2 AccusyS Purpose-built...",
|
||||
"start_time": 0.125,
|
||||
"end_time": 1.627
|
||||
}
|
||||
```
|
||||
|
||||
- `chunk_type = 'sentence'`
|
||||
- `content.text` = ""(空)
|
||||
- `content.ocr_text` = OCR 文字
|
||||
- `text_content` = OCR 文字
|
||||
- `metadata.language` = "ocr"
|
||||
|
||||
### 分組邏輯
|
||||
|
||||
```
|
||||
OCR frames: 4, 5, 6, 7, 8, 9, 10, 11, 13, 16, ..., 66
|
||||
↓ 按鄰近性分組(間距 ≤ 5 幀)
|
||||
Group 1: frames 4-16 → chunk "Accusys. G Carry 2..."
|
||||
Group 2: frames 48-66 → chunk "Western Digital..."
|
||||
```
|
||||
|
||||
## Rule 2 — 集合內容
|
||||
|
||||
@@ -80,11 +118,46 @@ Body: {"uuid": "...", "criteria": {"required_classes": ["person"]}}
|
||||
## 流程
|
||||
|
||||
```
|
||||
ASR output (sentence) ─── Rule 1 ───→ chunk (sentence, text+embedding)
|
||||
│
|
||||
YOLO output (objects) ─── Rule 2 ───→ chunk (visual, objects+classes)
|
||||
│
|
||||
├── 文字搜尋 (ILIKE)
|
||||
├── 向量搜尋 (Qdrant)
|
||||
└── 視覺過濾 (objects/classes)
|
||||
ASRX output (sentence) ─── Rule 1 Phase 1 ──→ chunk (sentence, ASRX text)
|
||||
│
|
||||
OCR output (frames) ─── Rule 1 Phase 2 ──→ chunk (sentence, OCR text)
|
||||
│
|
||||
├── 文字搜尋 (ILIKE)
|
||||
├── 向量搜尋 (Qdrant)
|
||||
└── 視覺過濾 (objects/classes)
|
||||
```
|
||||
|
||||
## 統計 API
|
||||
|
||||
```
|
||||
GET /api/v1/stats/ingestion-status/{file_uuid}
|
||||
|
||||
回應:
|
||||
rule1_sentence: 35 sentence chunks
|
||||
rule1_ocr: 30 OCR frames
|
||||
rule1_ocr_chunks: 3 OCR-only chunks
|
||||
```
|
||||
|
||||
| 步驟 | 說明 |
|
||||
|------|------|
|
||||
| `rule1_sentence` | 總 sentence chunks 數(ASRX + OCR-only) |
|
||||
| `rule1_ocr` | OCR pre_chunks 幀數 |
|
||||
| `rule1_ocr_chunks` | OCR-only chunks 數 |
|
||||
|
||||
## 範例:FilmRiot_test
|
||||
|
||||
| 項目 | 數量 | 說明 |
|
||||
|------|------|------|
|
||||
| ASRX segments | 32 | 語音段落 |
|
||||
| OCR frames | 30 | 偵測到文字的幀 |
|
||||
| Sentence chunks | 35 | 32 ASRX + 3 OCR-only |
|
||||
| OCR-only chunks | 3 | 片頭文字群組 |
|
||||
|
||||
### Chunks 分佈
|
||||
|
||||
| Chunk ID | 類型 | 時間 | 內容 |
|
||||
|----------|------|------|------|
|
||||
| 0-31 | ASRX | 0-81s | 語音文字 |
|
||||
| 32 | OCR-only | 0.12-1.62s | 片頭 "Accusys. G Carry 2..." |
|
||||
| 33 | OCR-only | 1.91-2.21s | "Western Digital..." |
|
||||
| 34 | OCR-only | 2.46-2.79s | "WD Cold Enterprise..." |
|
||||
|
||||
Reference in New Issue
Block a user