docs: update Rule 1 OCR independent chunks documentation

Updated Searchable_Chunk_Rules.md and pipeline.md to reflect:
- Phase 1: ASRX segments (pure speech, NO OCR merge)
- Phase 2: OCR-only chunks (all OCR frames grouped by proximity)
- New stats API steps: rule1_ocr, rule1_ocr_chunks
This commit is contained in:
Accusys
2026-07-05 23:36:56 +08:00
parent e91d51cc5e
commit cb604b74ec
2 changed files with 104 additions and 28 deletions

View File

@@ -42,6 +42,7 @@ These steps run after the 10 processors and are **required for pipeline completi
| # | Step | Triggers When | Verification |
|---|------|--------------|-------------|
| 1 | **Rule 1 Sentence Chunking** | ASR + ASRX done | `chunk` table has rows with `chunk_type = 'sentence'` |
| 1.1 | **Rule 1 OCR Chunks** | OCR done | OCR pre_chunks grouped into sentence chunks |
| 2 | **Auto-Vectorize** | Rule 1 done | `chunk.embedding` IS NOT NULL for sentence chunks |
| 3 | **Phase 1 Pack** | Rule 1 done | `release_pack.py --phase 1` executed |
| 4 | **Rule 3 Scene Chunking** | All 10 processors done + Cut + ASR | `chunk` table has rows with `chunk_type = 'cut'` |
@@ -81,15 +82,17 @@ curl "$API/api/v1/stats/ingestion-status/bd80fec9c42afb0307eb28f22c64c76a" | jq
{
"file_uuid": "bd80fec9c42afb0307eb28f22c64c76a",
"steps": [
{ "name": "rule1_sentence", "status": "pending", "detail": "0 sentence chunks" },
{ "name": "auto_vectorize", "status": "pending", "detail": "0 embedded" },
{ "name": "rule3_scene", "status": "pending", "detail": "0 scene chunks" },
{ "name": "face_trace", "status": "pending", "detail": "0 traces" },
{ "name": "trace_chunks", "status": "pending", "detail": "0 trace chunks" },
{ "name": "tkg", "status": "pending", "detail": "0 nodes, 0 edges" },
{ "name": "identity_match", "status": "pending", "detail": "0 identities" },
{ "name": "scene_metadata", "status": "pending", "detail": null },
{ "name": "5w1h", "status": "pending", "detail": "0 scenes with 5W1H" }
{ "name": "rule1_sentence", "status": "done", "detail": "35 sentence chunks" },
{ "name": "rule1_ocr", "status": "done", "detail": "30 OCR frames" },
{ "name": "rule1_ocr_chunks", "status": "done", "detail": "3 OCR-only chunks" },
{ "name": "auto_vectorize", "status": "pending", "detail": "0 embedded" },
{ "name": "rule3_scene", "status": "pending", "detail": "0 scene chunks" },
{ "name": "face_trace", "status": "pending", "detail": "0 traces" },
{ "name": "trace_chunks", "status": "pending", "detail": "0 trace chunks" },
{ "name": "tkg", "status": "pending", "detail": "0 nodes, 0 edges" },
{ "name": "identity_match", "status": "pending", "detail": "0 identities" },
{ "name": "scene_metadata", "status": "pending", "detail": null },
{ "name": "5w1h", "status": "pending", "detail": "0 scenes with 5W1H" }
]
}
```

View File

@@ -1,6 +1,7 @@
# Searchable Chunk — 綜合規則組成
**Date**: 2026-05-16
**Date**: 2026-05-16
**Updated**: 2026-07-05 — OCR 獨立 chunks (方案 A)
---
@@ -11,8 +12,8 @@ Searchable chunk 不是原始的 cut 或 sentence而是經過規則組合後
```
原始資料 規則組合 可搜尋 chunk
───────── ────────── ──────────────
ASR sentence (聽覺) ─┐
YOLO objects (視覺) ─┤ Rule 1 / Rule 2 chunk (text + metadata + embedding)
ASRX sentence (聽覺) ─┐
OCR text (視覺文字) ─┤ Rule 1 chunk (text + metadata + embedding)
Cut boundary (鏡頭) ─┘
```
@@ -21,27 +22,64 @@ Cut boundary (鏡頭) ─┘
| 層級 | 類型 | 說明 | 可搜尋 |
|------|------|------|:------:|
| **原始** | `cut` | 視覺 chunk鏡頭 | ❌(無文字) |
| **原始** | `sentence` | 聽覺 chunkASR 句子) | ✅ 文字搜尋 |
| **原始** | `sentence` | 聽覺 chunkASRX 句子) | ✅ 文字搜尋 |
| **原始** | `sentence` | OCR-only chunk純視覺文字 | ✅ 文字搜尋 |
| **合成** | `story_child` | 故事子句 | ✅ |
| **合成** | `story_parent` | 故事段落(多句聚合) | ✅ |
## Rule 1 — 直接轉換
## Rule 1 — 雙階段轉換
最簡單的規則。ASR 輸出的每個 sentence 直接成為 chunk不做聚合。
### Phase 1: ASRX Segments純語音
ASRX 輸出的每個 segment 直接成為 chunk**不合併 OCR 文字**。
```json
{
"chunk_id": "0",
"chunk_type": "sentence",
"rule": "rule_1",
"data": {
"text": "I'm in scoby.",
"text_normalized": "i'm in scoby."
}
"text": "And speaking of storage and workflow...",
"ocr_text": "",
"start_time": 0.0,
"end_time": 5.4
}
```
- `chunk_type = 'sentence'`
- 可文字搜尋(`text_content ILIKE`
- 可向量搜尋embedding in Qdrant
- `content.text` = ASRX 語音文字
- `content.ocr_text` = ""(空
- `text_content` = ASRX 文字
### Phase 2: OCR-only Chunks純視覺文字
所有 OCR 幀按鄰近性分組(間距 ≤ 5 幀),每個群組成為獨立 chunk。
```json
{
"chunk_id": "32",
"chunk_type": "sentence",
"rule": "rule_1",
"text": "",
"ocr_text": "Accusys. G Carry 2 AccusyS Purpose-built...",
"start_time": 0.125,
"end_time": 1.627
}
```
- `chunk_type = 'sentence'`
- `content.text` = ""(空)
- `content.ocr_text` = OCR 文字
- `text_content` = OCR 文字
- `metadata.language` = "ocr"
### 分組邏輯
```
OCR frames: 4, 5, 6, 7, 8, 9, 10, 11, 13, 16, ..., 66
↓ 按鄰近性分組(間距 ≤ 5 幀)
Group 1: frames 4-16 → chunk "Accusys. G Carry 2..."
Group 2: frames 48-66 → chunk "Western Digital..."
```
## Rule 2 — 集合內容
@@ -80,11 +118,46 @@ Body: {"uuid": "...", "criteria": {"required_classes": ["person"]}}
## 流程
```
ASR output (sentence) ─── Rule 1 ──→ chunk (sentence, text+embedding)
YOLO output (objects) ─── Rule 2 ──→ chunk (visual, objects+classes)
├── 文字搜尋 (ILIKE)
├── 向量搜尋 (Qdrant)
└── 視覺過濾 (objects/classes)
ASRX output (sentence) ─── Rule 1 Phase 1 ──→ chunk (sentence, ASRX text)
OCR output (frames) ─── Rule 1 Phase 2 ──→ chunk (sentence, OCR text)
├── 文字搜尋 (ILIKE)
├── 向量搜尋 (Qdrant)
└── 視覺過濾 (objects/classes)
```
## 統計 API
```
GET /api/v1/stats/ingestion-status/{file_uuid}
回應:
rule1_sentence: 35 sentence chunks
rule1_ocr: 30 OCR frames
rule1_ocr_chunks: 3 OCR-only chunks
```
| 步驟 | 說明 |
|------|------|
| `rule1_sentence` | 總 sentence chunks 數ASRX + OCR-only |
| `rule1_ocr` | OCR pre_chunks 幀數 |
| `rule1_ocr_chunks` | OCR-only chunks 數 |
## 範例FilmRiot_test
| 項目 | 數量 | 說明 |
|------|------|------|
| ASRX segments | 32 | 語音段落 |
| OCR frames | 30 | 偵測到文字的幀 |
| Sentence chunks | 35 | 32 ASRX + 3 OCR-only |
| OCR-only chunks | 3 | 片頭文字群組 |
### Chunks 分佈
| Chunk ID | 類型 | 時間 | 內容 |
|----------|------|------|------|
| 0-31 | ASRX | 0-81s | 語音文字 |
| 32 | OCR-only | 0.12-1.62s | 片頭 "Accusys. G Carry 2..." |
| 33 | OCR-only | 1.91-2.21s | "Western Digital..." |
| 34 | OCR-only | 2.46-2.79s | "WD Cold Enterprise..." |