docs: update Rule 1 OCR independent chunks documentation

Updated Searchable_Chunk_Rules.md and pipeline.md to reflect: - Phase 1: ASRX segments (pure speech, NO OCR merge) - Phase 2: OCR-only chunks (all OCR frames grouped by proximity) - New stats API steps: rule1_ocr, rule1_ocr_chunks
2026-07-05 23:36:56 +08:00
parent e91d51cc5e
commit cb604b74ec
2 changed files with 104 additions and 28 deletions
--- a/docs_v1.0/API_WORKSPACE/modules/10_pipeline.md
+++ b/docs_v1.0/API_WORKSPACE/modules/10_pipeline.md
@@ -42,6 +42,7 @@ These steps run after the 10 processors and are **required for pipeline completi
 | # | Step | Triggers When | Verification |
 |---|------|--------------|-------------|
 | 1 | **Rule 1 Sentence Chunking** | ASR + ASRX done | `chunk` table has rows with `chunk_type = 'sentence'` |
+| 1.1 | **Rule 1 OCR Chunks** | OCR done | OCR pre_chunks grouped into sentence chunks |
 | 2 | **Auto-Vectorize** | Rule 1 done | `chunk.embedding` IS NOT NULL for sentence chunks |
 | 3 | **Phase 1 Pack** | Rule 1 done | `release_pack.py --phase 1` executed |
 | 4 | **Rule 3 Scene Chunking** | All 10 processors done + Cut + ASR | `chunk` table has rows with `chunk_type = 'cut'` |
@@ -81,15 +82,17 @@ curl "$API/api/v1/stats/ingestion-status/bd80fec9c42afb0307eb28f22c64c76a" | jq
 {
  "file_uuid": "bd80fec9c42afb0307eb28f22c64c76a",
  "steps": [
-    { "name": "rule1_sentence", "status": "pending", "detail": "0 sentence chunks" },
-    { "name": "auto_vectorize",  "status": "pending", "detail": "0 embedded" },
-    { "name": "rule3_scene",     "status": "pending", "detail": "0 scene chunks" },
-    { "name": "face_trace",      "status": "pending", "detail": "0 traces" },
-    { "name": "trace_chunks",    "status": "pending", "detail": "0 trace chunks" },
-    { "name": "tkg",             "status": "pending", "detail": "0 nodes, 0 edges" },
-    { "name": "identity_match",  "status": "pending", "detail": "0 identities" },
-    { "name": "scene_metadata",  "status": "pending", "detail": null },
-    { "name": "5w1h",            "status": "pending", "detail": "0 scenes with 5W1H" }
+    { "name": "rule1_sentence",   "status": "done",    "detail": "35 sentence chunks" },
+    { "name": "rule1_ocr",        "status": "done",    "detail": "30 OCR frames" },
+    { "name": "rule1_ocr_chunks", "status": "done",    "detail": "3 OCR-only chunks" },
+    { "name": "auto_vectorize",   "status": "pending", "detail": "0 embedded" },
+    { "name": "rule3_scene",      "status": "pending", "detail": "0 scene chunks" },
+    { "name": "face_trace",       "status": "pending", "detail": "0 traces" },
+    { "name": "trace_chunks",     "status": "pending", "detail": "0 trace chunks" },
+    { "name": "tkg",              "status": "pending", "detail": "0 nodes, 0 edges" },
+    { "name": "identity_match",   "status": "pending", "detail": "0 identities" },
+    { "name": "scene_metadata",   "status": "pending", "detail": null },
+    { "name": "5w1h",             "status": "pending", "detail": "0 scenes with 5W1H" }
  ]
 }
 ```
--- a/docs_v1.0/REFERENCE/Searchable_Chunk_Rules.md
+++ b/docs_v1.0/REFERENCE/Searchable_Chunk_Rules.md
@@ -1,6 +1,7 @@
 # Searchable Chunk — 綜合規則組成

-**Date**: 2026-05-16
+**Date**: 2026-05-16  
+**Updated**: 2026-07-05 — OCR 獨立 chunks (方案 A)

 ---

@@ -11,8 +12,8 @@ Searchable chunk 不是原始的 cut 或 sentence，而是經過規則組合後
 ```
 原始資料               規則組合                 可搜尋 chunk
 ─────────             ──────────               ──────────────
-ASR sentence (聽覺)    ─┐
-YOLO objects (視覺)    ─┤  Rule 1 / Rule 2      chunk (text + metadata + embedding)
+ASRX sentence (聽覺)   ─┐
+OCR text (視覺文字)    ─┤  Rule 1              chunk (text + metadata + embedding)
 Cut boundary (鏡頭)    ─┘
 ```

@@ -21,27 +22,64 @@ Cut boundary (鏡頭)    ─┘
 | 層級 | 類型 | 說明 | 可搜尋 |
 |------|------|------|:------:|
 | **原始** | `cut` | 視覺 chunk（鏡頭） | ❌（無文字） |
-| **原始** | `sentence` | 聽覺 chunk（ASR 句子） | ✅ 文字搜尋 |
+| **原始** | `sentence` | 聽覺 chunk（ASRX 句子） | ✅ 文字搜尋 |
+| **原始** | `sentence` | OCR-only chunk（純視覺文字） | ✅ 文字搜尋 |
 | **合成** | `story_child` | 故事子句 | ✅ |
 | **合成** | `story_parent` | 故事段落（多句聚合） | ✅ |

-## Rule 1 — 直接轉換
+## Rule 1 — 雙階段轉換

-最簡單的規則。ASR 輸出的每個 sentence 直接成為 chunk，不做聚合。
+### Phase 1: ASRX Segments（純語音）
+
+ASRX 輸出的每個 segment 直接成為 chunk，**不合併 OCR 文字**。

 ```json
 {
+  "chunk_id": "0",
+  "chunk_type": "sentence",
  "rule": "rule_1",
-  "data": {
-    "text": "I'm in scoby.",
-    "text_normalized": "i'm in scoby."
-  }
+  "text": "And speaking of storage and workflow...",
+  "ocr_text": "",
+  "start_time": 0.0,
+  "end_time": 5.4
 }
 ```

 - `chunk_type = 'sentence'`
- 可文字搜尋（`text_content ILIKE`）
- 可向量搜尋（embedding in Qdrant）
+- `content.text` = ASRX 語音文字
+- `content.ocr_text` = ""（空）
+- `text_content` = ASRX 文字
+
+### Phase 2: OCR-only Chunks（純視覺文字）
+
+所有 OCR 幀按鄰近性分組（間距 ≤ 5 幀），每個群組成為獨立 chunk。
+
+```json
+{
+  "chunk_id": "32",
+  "chunk_type": "sentence",
+  "rule": "rule_1",
+  "text": "",
+  "ocr_text": "Accusys. G Carry 2 AccusyS Purpose-built...",
+  "start_time": 0.125,
+  "end_time": 1.627
+}
+```
+
+- `chunk_type = 'sentence'`
+- `content.text` = ""（空）
+- `content.ocr_text` = OCR 文字
+- `text_content` = OCR 文字
+- `metadata.language` = "ocr"
+
+### 分組邏輯
+
+```
+OCR frames: 4, 5, 6, 7, 8, 9, 10, 11, 13, 16, ..., 66
+  ↓ 按鄰近性分組（間距 ≤ 5 幀）
+Group 1: frames 4-16 → chunk "Accusys. G Carry 2..."
+Group 2: frames 48-66 → chunk "Western Digital..."
+```

 ## Rule 2 — 集合內容

@@ -80,11 +118,46 @@ Body: {"uuid": "...", "criteria": {"required_classes": ["person"]}}
 ## 流程

 ```
-ASR output (sentence) ─── Rule 1 ───→ chunk (sentence, text+embedding)
-                                           │
-YOLO output (objects) ─── Rule 2 ───→ chunk (visual, objects+classes)
-                                           │
-                                           ├── 文字搜尋 (ILIKE)
-                                           ├── 向量搜尋 (Qdrant)
-                                           └── 視覺過濾 (objects/classes)
+ASRX output (sentence) ─── Rule 1 Phase 1 ──→ chunk (sentence, ASRX text)
+                                                 │
+OCR output (frames)    ─── Rule 1 Phase 2 ──→ chunk (sentence, OCR text)
+                                                 │
+                                                 ├── 文字搜尋 (ILIKE)
+                                                 ├── 向量搜尋 (Qdrant)
+                                                 └── 視覺過濾 (objects/classes)
 ```
+
+## 統計 API
+
+```
+GET /api/v1/stats/ingestion-status/{file_uuid}
+
+回應:
+  rule1_sentence: 35 sentence chunks
+  rule1_ocr: 30 OCR frames
+  rule1_ocr_chunks: 3 OCR-only chunks
+```
+
+| 步驟 | 說明 |
+|------|------|
+| `rule1_sentence` | 總 sentence chunks 數（ASRX + OCR-only） |
+| `rule1_ocr` | OCR pre_chunks 幀數 |
+| `rule1_ocr_chunks` | OCR-only chunks 數 |
+
+## 範例：FilmRiot_test
+
+| 項目 | 數量 | 說明 |
+|------|------|------|
+| ASRX segments | 32 | 語音段落 |
+| OCR frames | 30 | 偵測到文字的幀 |
+| Sentence chunks | 35 | 32 ASRX + 3 OCR-only |
+| OCR-only chunks | 3 | 片頭文字群組 |
+
+### Chunks 分佈
+
+| Chunk ID | 類型 | 時間 | 內容 |
+|----------|------|------|------|
+| 0-31 | ASRX | 0-81s | 語音文字 |
+| 32 | OCR-only | 0.12-1.62s | 片頭 "Accusys. G Carry 2..." |
+| 33 | OCR-only | 1.91-2.21s | "Western Digital..." |
+| 34 | OCR-only | 2.46-2.79s | "WD Cold Enterprise..." |