momentry_core/docs/CHUNK_SPEC.md

# Video Chunk 切分規範

| 項目 | 內容 |
|------|------|
| 建立者 | Warren |
| 建立時間 | 2026-03-16 |
| 文件版本 | V1.0 |

---

## 版本歷史

| 版本 | 日期 | 目的 | 操作人 | 工具/模型 |
|------|------|------|--------|-----------|
| V1.0 | 2026-03-16 | 創建文件 | Warren | OpenCode / MiniMax M2.5 |

---

本文檔定義 Momentry Core 系統中影片 chunks 的切分原則與資料結構。

---

## 1. Chunk 概述

### 1.1 設計原則

1. **允許重疊**: 不同類型的 chunk 可以重疊（如語句 chunk 與時間 chunk）
2. **Frame 精確度**: 時間坐標精確到影片 frame
3. **多元分類**: 支援語句、場景、時間三種分割方式

### 1.2 Chunk 類型

| 類型 | 說明 | 是否可重疊 |
|------|------|------------|
| **Sentence** | 語句分割 | ✅ 可與其他類型重疊 |
| **Cut** | 場景切割 | ✅ 可與其他類型重疊 |
| **TimeBased** | 時間長度切割 | ✅ 可與其他類型重疊 |

---

## 2. 時間坐標系統

### 2.1 時間格式

所有時間使用 **秒** 為單位，精確到 **微秒** (浮點數)：

```json
{
  "start_time": 10.5,
  "end_time": 15.75
}
```

### 2.2 Frame 計算

```
frame_number = floor(time_in_seconds * fps)
time_at_frame = frame_number / fps
```

**範例**:
- 影片 FPS: 24/1 (24 fps)
- 時間: 10.5 秒
- Frame: floor(10.5 * 24) = 252
- 校驗: 252 / 24 = 10.5 秒 ✅

### 2.3 Frame 資訊結構

```json
{
  "start_time": 10.5,
  "start_frame": 252,
  "end_time": 15.75,
  "end_frame": 378,
  "fps": "24/1",
  "fps_value": 24.0
}
```

---

## 3. 三種切分方式

### 3.1 Sentence (語句分割)

**原則**:
- 根據 ASR 語音識別結果
- 每個識別的語句為一個 chunk
- 文字內容來自 ASR 輸出

**範例**:

```
ASR 輸出:
[
  {"start": 10.0, "end": 15.0, "text": "Hello world"},
  {"start": 15.0, "end": 20.0, "text": "This is a test"},
  {"start": 20.0, "end": 25.5, "text": "Processing video"}
]

轉換為 Chunks:
┌────────────────────────────────────────┐
│ chunk_0001: 10.0s - 15.0s "Hello world"    │
├────────────────────────────────────────┤
│ chunk_0002: 15.0s - 20.0s "This is a test"  │
├────────────────────────────────────────┤
│ chunk_0003: 20.0s - 25.5s "Processing video" │
└────────────────────────────────────────┘
```

### 3.2 Cut (場景切割)

**原則**:
- 根據影片鏡頭變化 (scene change / cut detection)
- 使用 ffmpeg 或 Python (scenedetect) 偵測
- 每個場景為一個 chunk

**偵測方法**:

```bash
# 使用 ffmpeg 偵測場景變化
ffmpeg -i input.mp4 -filter:v "select='gt(scene,0.3)',showinfo" -f null -
```

**範例**:

```
場景偵測結果:
[
  {"start": 0.0, "end": 45.2, "scene_id": 1},
  {"start": 45.2, "end": 120.5, "scene_id": 2},
  {"start": 120.5, "end": 180.0, "scene_id": 3}
]

轉換為 Chunks:
┌────────────────────────────────────────┐
│ chunk_0001: 0.0s - 45.2s (Scene 1)        │
├────────────────────────────────────────┤
│ chunk_0002: 45.2s - 120.5s (Scene 2)       │
├────────────────────────────────────────┤
│ chunk_0003: 120.5s - 180.0s (Scene 3)      │
└────────────────────────────────────────┘
```

### 3.3 TimeBased (時間長度切割)

**原則**:
- 固定時間長度切割
- 預設 **10 秒** 為一個 chunk
- 最後一個 chunk 可能不足 10 秒
- **支援重疊** (可設定 overlap 秒數)

**參數配置**:

| 參數 | 預設值 | 說明 |
|------|--------|------|
| duration | 10.0 | 每個 chunk 時長 (秒) |
| overlap | 0.0 | 重疊時長 (秒) |

**範例** (無重疊):

```
影片時長: 35 秒, duration=10

Chunks:
┌────────────────────────────────────────┐
│ chunk_0001: 0.0s - 10.0s                  │
├────────────────────────────────────────┤
│ chunk_0002: 10.0s - 20.0s                 │
├────────────────────────────────────────┤
│ chunk_0003: 20.0s - 30.0s                 │
├────────────────────────────────────────┤
│ chunk_0004: 30.0s - 35.0s (不足10秒)       │
└────────────────────────────────────────┘
```

**範例** (有重疊, overlap=2):

```
影片時長: 35 秒, duration=10, overlap=2

Chunks:
┌────────────────────────────────────────┐
│ chunk_0001: 0.0s - 10.0s                  │
├────────────────────────────────────────┤
│ chunk_0002: 8.0s - 18.0s (重疊 2秒)       │
├────────────────────────────────────────┤
│ chunk_0003: 16.0s - 26.0s (重疊 2秒)      │
├────────────────────────────────────────┤
│ chunk_0004: 24.0s - 34.0s (重疊 2秒)      │
├────────────────────────────────────────┤
│ chunk_0005: 32.0s - 35.0s (重疊+不足)      │
└────────────────────────────────────────┘
```

---

## 4. Chunk 資料結構

### 4.1 基本結構

```json
{
  "uuid": "1636719dc31f78ac",
  "chunk_id": "sentence_0001",
  "chunk_index": 1,
  "chunk_type": "sentence",
  "start_time": 10.5,
  "start_frame": 252,
  "end_time": 15.75,
  "end_frame": 378,
  "fps": "24/1",
  "fps_value": 24.0,
  "content": {
    "text": "Hello world, this is a test"
  },
  "metadata": {
    "source": "asr",
    "confidence": 0.95,
    "language": "en"
  }
}
```

### 4.2 欄位說明

| 欄位 | 類型 | 必填 | 說明 |
|------|------|------|------|
| `uuid` | String | ✅ | 影片 UUID (16 字元) |
| `chunk_id` | String | ✅ | Chunk 唯一 ID |
| `chunk_index` | Integer | ✅ | Chunk 索引 (從 0 開始) |
| `chunk_type` | String | ✅ | 類型: sentence/cut/time_based |
| `start_time` | Float | ✅ | 開始時間 (秒) |
| `start_frame` | Integer | ✅ | 開始 frame 編號 |
| `end_time` | Float | ✅ | 結束時間 (秒) |
| `end_frame` | Integer | ✅ | 結束 frame 編號 |
| `fps` | String | ✅ | FPS 表示 (如 "24/1") |
| `fps_value` | Float | ✅ | FPS 數值 (如 24.0) |
| `content` | Object | ✅ | 內容 (見下文) |
| `metadata` | Object | ❌ | 額外資訊 (見下文) |

### 4.3 Content 結構

根據 `chunk_type` 不同，content 結構也不同：

#### Sentence Content

```json
{
  "content": {
    "text": "Hello world, this is a test message",
    "text_normalized": "hello world this is a test message",
    "word_count": 7,
    "char_count": 34
  }
}
```

| 欄位 | 類型 | 說明 |
|------|------|------|
| `text` | String | 原始識別文字 |
| `text_normalized` | String | 正規化文字 (小寫,去除標點) |
| `word_count` | Integer | 字詞數量 |
| `char_count` | Integer | 字元數量 |

#### Cut Content

```json
{
  "content": {
    "scene_id": 2,
    "scene_number": 2,
    "transition_type": "cut",
    "scene_change_score": 0.95
  }
}
```

| 欄位 | 類型 | 說明 |
|------|------|------|
| `scene_id` | Integer | 場景 ID |
| `scene_number` | Integer | 場景編號 |
| `transition_type` | String | 轉場類型: cut/dissolve/fade |
| `scene_change_score` | Float | 場景變化分數 (0-1) |

#### TimeBased Content

```json
{
  "content": {
    "duration": 10.0,
    "is_last": false,
    "segment_number": 3,
    "total_segments": 10
  }
}
```

| 欄位 | 類型 | 說明 |
|------|------|------|
| `duration` | Float | 時長 (秒) |
| `is_last` | Boolean | 是否最後一個 chunk |
| `segment_number` | Integer | 分段編號 |
| `total_segments` | Integer | 總分段數 |

### 4.4 Metadata 結構

```json
{
  "metadata": {
    "source": "asr",
    "confidence": 0.95,
    "language": "en",
    "model": "tiny",
    "created_at": "2026-03-16T10:00:00Z"
  }
}
```

| 欄位 | 類型 | 說明 |
|------|------|------|
| `source` | String | 來源: asr/scene_detect/time_based |
| `confidence` | Float | 信心度 (0-1) |
| `language` | String | 語言代碼 |
| `model` | String | 使用模型 |
| `created_at` | String | 創建時間 (ISO 8601) |

---

## 5. Chunk ID 命名規範

### 5.1 格式

```
{chunk_type}_{chunk_index:04}
```

| 類型 | 前綴 | 範例 |
|------|------|------|
| Sentence | `sentence_` | `sentence_0001` |
| Cut | `cut_` | `cut_0001` |
| TimeBased | `time_based_` | `time_based_0001` |

### 5.2 編號規則

- 從 **0** 開始
- 使用 **4 位數** 補零
- 按時間順序遞增

---

## 6. 資料庫 Schema

### 6.1 PostgreSQL Table

```sql
CREATE TABLE chunks (
    id BIGSERIAL PRIMARY KEY,
    uuid VARCHAR(16) NOT NULL,
    chunk_id VARCHAR(64) NOT NULL,
    chunk_index INTEGER NOT NULL,
    chunk_type VARCHAR(32) NOT NULL,
    start_time DOUBLE PRECISION NOT NULL,
    start_frame BIGINT NOT NULL,
    end_time DOUBLE PRECISION NOT NULL,
    end_frame BIGINT NOT NULL,
    fps VARCHAR(16) NOT NULL,
    fps_value DOUBLE PRECISION NOT NULL,
    content JSONB NOT NULL,
    metadata JSONB,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    UNIQUE(uuid, chunk_id)
);

-- 索引
CREATE INDEX idx_chunks_uuid ON chunks(uuid);
CREATE INDEX idx_chunks_type ON chunks(chunk_type);
CREATE INDEX idx_chunks_time ON chunks(start_time, end_time);
CREATE INDEX idx_chunks_uuid_type ON chunks(uuid, chunk_type);
```

### 6.2 查詢範例

```sql
-- 查詢影片所有 chunks
SELECT * FROM chunks WHERE uuid = '1636719dc31f78ac';

-- 查詢特定類型的 chunks
SELECT * FROM chunks WHERE uuid = '1636719dc31f78ac' AND chunk_type = 'sentence';

-- 查詢時間範圍內的 chunks
SELECT * FROM chunks
WHERE uuid = '1636719dc31f78ac'
AND start_time <= 30.0 AND end_time >= 20.0;

-- 查詢時間範圍內的所有 chunks (混合類型)
SELECT * FROM chunks
WHERE uuid = '1636719dc31f78ac'
AND start_time <= 30.0 AND end_time >= 20.0
ORDER BY chunk_type, chunk_index;
```

---

## 7. Rust 資料結構

### 7.1 Chunk 定義

```rust
use serde::{Deserialize, Serialize};

#[derive(Debug, Clone, Copy, Serialize, Deserialize, PartialEq)]
#[serde(rename_all = "snake_case")]
pub enum ChunkType {
    Sentence,
    Cut,
    TimeBased,
}

impl ChunkType {
    pub fn as_str(&self) -> &'static str {
        match self {
            ChunkType::Sentence => "sentence",
            ChunkType::Cut => "cut",
            ChunkType::TimeBased => "time_based",
        }
    }
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Chunk {
    pub uuid: String,
    pub chunk_id: String,
    pub chunk_index: u32,
    pub chunk_type: ChunkType,
    pub start_time: f64,
    pub start_frame: i64,
    pub end_time: f64,
    pub end_frame: i64,
    pub fps: String,
    pub fps_value: f64,
    pub content: serde_json::Value,
    pub metadata: Option<serde_json::Value>,
}
```

### 7.2 建立 Chunk

```rust
impl Chunk {
    pub fn new(
        uuid: String,
        chunk_index: u32,
        chunk_type: ChunkType,
        start_time: f64,
        end_time: f64,
        fps: &str,
        content: serde_json::Value,
    ) -> Self {
        let fps_value = parse_fps(fps);
        let start_frame = (start_time * fps_value) as i64;
        let end_frame = (end_time * fps_value) as i64;
        let chunk_id = format!("{}_{:04}", chunk_type.as_str(), chunk_index);

        Self {
            uuid,
            chunk_id,
            chunk_index,
            chunk_type,
            start_time,
            start_frame,
            end_time,
            end_frame,
            fps: fps.to_string(),
            fps_value,
            content,
            metadata: None,
        }
    }
}
```

---

## 8. 時間切割器實作

### 8.1 TimeBasedSplitter

```rust
pub struct TimeBasedSplitter {
    pub duration: f64,  // 每個 chunk 時長 (秒)
    pub overlap: f64,  // 重疊時長 (秒)
}

impl TimeBasedSplitter {
    pub fn new(duration: f64, overlap: f64) -> Self {
        Self { duration, overlap }
    }

    pub fn split(&self, uuid: &str, video_duration: f64, fps: f64) -> Vec<Chunk> {
        let mut chunks = Vec::new();
        let step = self.duration - self.overlap;
        let mut current_time = 0.0;
        let mut index = 0;

        while current_time < video_duration {
            let end_time = (current_time + self.duration).min(video_duration);

            let chunk = Chunk::new(
                uuid.to_string(),
                index,
                ChunkType::TimeBased,
                current_time,
                end_time,
                &format!("{:.0}/1", fps as u32),
                serde_json::json!({
                    "duration": end_time - current_time,
                    "is_last": end_time >= video_duration,
                    "segment_number": index + 1,
                }),
            );
            chunks.push(chunk);

            current_time += step;
            index += 1;
        }

        chunks
    }
}
```

### 8.2 使用範例

```rust
// 建立時間切割器 (10秒, 無重疊)
let splitter = TimeBasedSplitter::new(10.0, 0.0);
let chunks = splitter.split(&uuid, video_duration, 24.0);

// 建立時間切割器 (10秒, 2秒重疊)
let splitter = TimeBasedSplitter::new(10.0, 2.0);
let chunks = splitter.split(&uuid, video_duration, 24.0);
```

---

## 9. 處理流程

### 9.1 完整流程

```
1. Register (註冊影片)
   └── 取得 UUID, video_duration, fps

2. Probe (探測影片)
   └── 取得 streams, format, fps

3. 產生 Sentence Chunks
   └── 讀取 ASR 輸出
       └── 為每個 segment 建立 chunk

4. 產生 Cut Chunks
   └── 執行場景偵測
       └── 為每個 scene 建立 chunk

5. 產生 TimeBased Chunks
   └── 使用 TimeBasedSplitter
       └── 為每個時間段建立 chunk

6. 儲存至資料庫
   └── 批次寫入 PostgreSQL
```

### 9.2 輸出範例

```
影片: 35 秒, FPS: 24

Sentence Chunks (3 個):
  sentence_0000: 0.0s - 10.0s (252 frames)
  sentence_0001: 10.0s - 20.0s (480 frames)
  sentence_0002: 20.0s - 35.0s (840 frames)

Cut Chunks (3 個):
  cut_0000: 0.0s - 15.0s (360 frames)
  cut_0001: 15.0s - 28.0s (672 frames)
  cut_0002: 28.0s - 35.0s (168 frames)

TimeBased Chunks (4 個, 重疊 2秒):
  time_based_0000: 0.0s - 10.0s (240 frames)
  time_based_0001: 8.0s - 18.0s (240 frames)
  time_based_0002: 16.0s - 26.0s (240 frames)
  time_based_0003: 24.0s - 35.0s (264 frames)
```

---

## 10. 資料庫儲存

### 10.1 PostgreSQL 儲存

#### Table Schema

```sql
CREATE TABLE chunks (
    id BIGSERIAL PRIMARY KEY,
    uuid VARCHAR(16) NOT NULL,
    chunk_id VARCHAR(64) NOT NULL,
    chunk_index INTEGER NOT NULL,
    chunk_type VARCHAR(32) NOT NULL,
    start_time DOUBLE PRECISION NOT NULL,
    start_frame BIGINT NOT NULL,
    end_time DOUBLE PRECISION NOT NULL,
    end_frame BIGINT NOT NULL,
    fps VARCHAR(16) NOT NULL,
    fps_value DOUBLE PRECISION NOT NULL,
    content JSONB NOT NULL,
    metadata JSONB,
    vector_id VARCHAR(64),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    UNIQUE(uuid, chunk_id)
);

-- 索引
CREATE INDEX idx_chunks_uuid ON chunks(uuid);
CREATE INDEX idx_chunks_type ON chunks(chunk_type);
CREATE INDEX idx_chunks_time ON chunks(start_time, end_time);
CREATE INDEX idx_chunks_uuid_type ON chunks(uuid, chunk_type);
CREATE INDEX idx_chunks_vector_id ON chunks(vector_id);
```

#### 儲存範例

```rust
pub async fn store_chunk_to_postgres(db: &PostgresDb, chunk: &Chunk) -> Result<()> {
    sqlx::query!(
        r#"
        INSERT INTO chunks (
            uuid, chunk_id, chunk_index, chunk_type,
            start_time, start_frame, end_time, end_frame,
            fps, fps_value, content, metadata, vector_id
        ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        ON CONFLICT (uuid, chunk_id) DO UPDATE SET
            content = EXCLUDED.content,
            metadata = EXCLUDED.metadata,
            vector_id = EXCLUDED.vector_id,
            updated_at = NOW()
        "#,
        chunk.uuid,
        chunk.chunk_id,
        chunk.chunk_index as i32,
        chunk.chunk_type.as_str(),
        chunk.start_time,
        chunk.start_frame,
        chunk.end_time,
        chunk.end_frame,
        chunk.fps,
        chunk.fps_value,
        serde_json::to_value(&chunk.content)?,
        serde_json::to_value(&chunk.metadata)?,
        chunk.vector_id,
    )
    .execute(&db.pool)
    .await?;
    Ok(())
}
```

---

### 10.2 MongoDB 儲存

#### Collection Schema

```javascript
// chunks collection
{
  _id: ObjectId,
  uuid: "1636719dc31f78ac",
  chunk_id: "sentence_0001",
  chunk_index: 1,
  chunk_type: "sentence",
  start_time: 10.5,
  start_frame: 252,
  end_time: 15.75,
  end_frame: 378,
  fps: "24/1",
  fps_value: 24.0,
  content: {
    text: "Hello world, this is a test",
    text_normalized: "hello world this is a test",
    word_count: 7,
    char_count: 34
  },
  metadata: {
    source: "asr",
    confidence: 0.95,
    language: "en"
  },
  vector_id: "vec_sentence_0001",
  created_at: ISODate("2026-03-16T10:00:00Z"),
  updated_at: ISODate("2026-03-16T10:00:00Z")
}

// 索引
db.chunks.createIndex({ uuid: 1 })
db.chunks.createIndex({ chunk_type: 1 })
db.chunks.createIndex({ start_time: 1, end_time: 1 })
db.chunks.createIndex({ vector_id: 1 })
db.chunks.createIndex({ uuid: 1, chunk_type: 1 })
```

#### 儲存範例

```rust
pub async fn store_chunk_to_mongodb(db: &MongoDb, chunk: &Chunk) -> Result<()> {
    let doc = bson::doc! {
        "uuid": chunk.uuid,
        "chunk_id": chunk.chunk_id,
        "chunk_index": chunk.chunk_index,
        "chunk_type": chunk.chunk_type.as_str(),
        "start_time": chunk.start_time,
        "start_frame": chunk.start_frame,
        "end_time": chunk.end_time,
        "end_frame": chunk.end_frame,
        "fps": chunk.fps,
        "fps_value": chunk.fps_value,
        "content": serde_json::to_value(&chunk.content)?,
        "metadata": serde_json::to_value(&chunk.metadata)?,
        "vector_id": chunk.vector_id,
        "created_at": chrono::Utc::now(),
        "updated_at": chrono::Utc::now()
    };

    let collection = db.database("momentry").collection("chunks");
    collection.update_one(
        doc! { "uuid": &chunk.uuid, "chunk_id": &chunk.chunk_id },
        doc! { "$set": doc },
        UpdateOptions::builder().upsert(true).build(),
    ).await?;
    Ok(())
}
```

---

## 11. 向量儲存設計

### 11.1 設計原則

**統一向量 ID 格式**，確保 Qdrant 與 PostgreSQL 相容：

```
{chunk_type}_{chunk_index:04}

範例:
sentence_0001
cut_0002
time_based_0015
```

### 11.2 Qdrant Collection

#### 建立 Collection

```bash
# 使用 Qdrant client 建立 collection
curl -X PUT http://localhost:6333/collections/chunks \
  -H "Content-Type: application/json" \
  -H "api-key: Test3200Test3200Test3200" \
  -d '{
    "vectors": {
      "size": 768,
      "distance": "Cosine"
    }
  }'
```

#### Point 結構

```json
{
  "id": "sentence_0001",
  "vector": [0.123, -0.456, ...],
  "payload": {
    "uuid": "1636719dc31f78ac",
    "chunk_id": "sentence_0001",
    "chunk_type": "sentence",
    "chunk_index": 1,
    "start_time": 10.5,
    "end_time": 15.75,
    "text": "Hello world, this is a test",
    "metadata": {
      "confidence": 0.95,
      "language": "en"
    }
  }
}
```

#### Rust 結構

```rust
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct VectorPoint {
    pub id: String,
    pub vector: Vec<f32>,
    pub payload: VectorPayload,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct VectorPayload {
    pub uuid: String,
    pub chunk_id: String,
    pub chunk_type: String,
    pub chunk_index: u32,
    pub start_time: f64,
    pub end_time: f64,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub text: Option<String>,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub scene_id: Option<i32>,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub segment_number: Option<i32>,
    pub metadata: Option<serde_json::Value>,
}
```

### 11.3 PostgreSQL Vector 儲存

#### Table Schema

```sql
-- 使用 pgvector 擴展
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE chunk_vectors (
    id BIGSERIAL PRIMARY KEY,
    vector_id VARCHAR(64) NOT NULL UNIQUE,
    uuid VARCHAR(16) NOT NULL,
    chunk_id VARCHAR(64) NOT NULL,
    chunk_type VARCHAR(32) NOT NULL,
    chunk_index INTEGER NOT NULL,
    start_time DOUBLE PRECISION NOT NULL,
    end_time DOUBLE PRECISION NOT NULL,
    embedding vector(768) NOT NULL,
    metadata JSONB,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),

    FOREIGN KEY (uuid, chunk_id) REFERENCES chunks(uuid, chunk_id)
);

-- 向量檢索索引 (IVFFlat)
CREATE INDEX idx_chunk_vectors_embedding
ON chunk_vectors
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- 查詢索引
CREATE INDEX idx_chunk_vectors_uuid ON chunk_vectors(uuid);
CREATE INDEX idx_chunk_vectors_type ON chunk_vectors(chunk_type);
```

#### 儲存範例

```rust
pub async fn store_vector_to_postgres(db: &PostgresDb, point: &VectorPoint) -> Result<()> {
    sqlx::query!(
        r#"
        INSERT INTO chunk_vectors (
            vector_id, uuid, chunk_id, chunk_type, chunk_index,
            start_time, end_time, embedding, metadata
        ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
        ON CONFLICT (vector_id) DO UPDATE SET
            embedding = EXCLUDED.embedding,
            metadata = EXCLUDED.metadata
        "#,
        point.id,
        point.payload.uuid,
        point.payload.chunk_id,
        point.payload.chunk_type,
        point.payload.chunk_index as i32,
        point.payload.start_time,
        point.payload.end_time,
        point.vector,
        serde_json::to_value(&point.payload.metadata)?,
    )
    .execute(&db.pool)
    .await?;
    Ok(())
}
```

---

## 12. 查詢範例

### 12.1 語義搜尋 (Semantic Search)

#### 查詢類型 1: 相似文字搜尋

```rust
// 搜尋與問句相似的 chunks
pub async fn semantic_search(
    qdrant: &QdrantDb,
    query: &str,
    limit: usize,
) -> Result<Vec<SearchResult>> {
    // 1. 將問句向量化
    let query_vector = embed_text(query).await?;

    // 2. 搜尋 Qdrant
    let results = qdrant.search(
        "chunks",
        &query_vector,
        limit,
        Some(&Filter::must([
            Condition::Match("chunk_type", "sentence"),
        ])),
    ).await?;

    Ok(results)
}

// 使用範例
let results = semantic_search(&qdrant, "找出有人在說話的片段", 10).await?;
for r in results {
    println!("{}: {:.3}", r.payload.chunk_id, r.score);
    println!("  Time: {}s - {}s", r.payload.start_time, r.payload.end_time);
    println!("  Text: {:?}", r.payload.text);
}
```

#### 查詢類型 2: 語音/文字混合搜尋

```sql
-- PostgreSQL: 搜尋特定文字的 chunks
SELECT
    c.chunk_id,
    c.chunk_type,
    c.start_time,
    c.end_time,
    c.content->>'text' as text,
    v.embedding <=> query_embedding('找出開車的場景') as similarity
FROM chunks c
LEFT JOIN chunk_vectors v ON c.chunk_id = v.chunk_id
WHERE c.chunk_type = 'sentence'
AND c.content->>'text' ILIKE '%car%'
ORDER BY v.embedding <=> query_embedding('找出開車的場景')
LIMIT 10;
```

### 12.2 時間範圍搜尋

#### 查詢類型 3: 特定時間範圍

```rust
// 找出 30-60 秒之間的所有 chunks
pub async fn search_by_time_range(
    db: &PostgresDb,
    uuid: &str,
    start: f64,
    end: f64,
) -> Result<Vec<Chunk>> {
    let chunks = sqlx::query_as!(
        Chunk,
        r#"
        SELECT * FROM chunks
        WHERE uuid = $1
        AND start_time < $3
        AND end_time > $2
        ORDER BY chunk_type, chunk_index
        "#,
        uuid, start, end
    )
    .fetch_all(&db.pool)
    .await?;
    Ok(chunks)
}

// 使用範例
let chunks = search_by_time_range(&db, "1636719dc31f78ac", 30.0, 60.0).await?;
```

```javascript
// MongoDB: 時間範圍查詢
db.chunks.find({
  uuid: "1636719dc31f78ac",
  start_time: { $lt: 60 },
  end_time: { $gt: 30 }
}).sort({ chunk_type: 1, chunk_index: 1 })
```

### 12.3 混合搜尋 (Hybrid Search)

#### 查詢類型 4: 文字關鍵詞 + 向量相似度

```rust
// 結合關鍵詞匹配與向量相似度
pub async fn hybrid_search(
    db: &PostgresDb,
    qdrant: &QdrantDb,
    query: &str,
    keywords: &[&str],
    limit: usize,
) -> Result<Vec<HybridResult>> {
    // 1. 向量搜尋
    let query_vector = embed_text(query).await?;
    let vector_results = qdrant.search("chunks", &query_vector, limit * 2, None).await?;

    // 2. 關鍵詞過濾
    let keyword_filter: Vec<_> = keywords.iter()
        .map(|k| format!("%{}%", k))
        .collect();

    let filtered: Vec<_> = vector_results.into_iter()
        .filter(|r| {
            if let Some(text) = &r.payload.text {
                keyword_filter.iter().any(|k| text.contains(k.as_str()))
            } else {
                false
            }
        })
        .take(limit)
        .collect();

    Ok(filtered)
}
```

### 12.4 場景搜尋

#### 查詢類型 5: 找出特定場景

```sql
-- PostgreSQL: 找出特定場景 ID 的 chunks
SELECT * FROM chunks
WHERE uuid = '1636719dc31f78ac'
AND chunk_type = 'cut'
AND (content->>'scene_id')::int = 5;

-- 找出包含轉場效果的 chunks
SELECT * FROM chunks
WHERE uuid = '1636719dc31f78ac'
AND chunk_type = 'cut'
AND content->>'transition_type' = 'dissolve';
```

### 12.5 影片摘要

#### 查詢類型 6: 產生影片摘要

```sql
-- 合併影片所有語句
SELECT
    string_agg(content->>'text', ' ' ORDER BY start_time) as full_transcript
FROM chunks
WHERE uuid = '1636719dc31f78ac'
AND chunk_type = 'sentence'
AND content->>'text' IS NOT NULL;

-- 按場景聚合文字
SELECT
    content->>'scene_id' as scene,
    string_agg(content->>'text', ' ' ORDER BY start_time) as scene_text
FROM chunks
WHERE uuid = '1636719dc31f78ac'
AND chunk_type = 'cut'
GROUP BY content->>'scene_id'
ORDER BY MIN(start_time);
```

### 12.6 常見查詢模式

| 查詢類型 | 描述 | 資料庫 | SQL/程式碼 |
|----------|------|--------|-------------|
| 語義搜尋 | 找相似內容 | Qdrant | `search(vector, limit)` |
| 關鍵詞搜尋 | 精確文字匹配 | PostgreSQL | `ILIKE '%keyword%'` |
| 時間範圍 | 特定時段 | Both | `start_time < end AND end_time > start` |
| 場景搜尋 | 特定鏡頭 | PostgreSQL | `scene_id = N` |
| 混合搜尋 | 向量+關鍵詞 | Both |結合以上兩種 |
| 摘要產生 | 合併文字 | PostgreSQL | `string_agg()` |

---

## 13. 資料庫選擇建議

### 13.1 儲存策略

| 資料類型 | 主要儲存 | 備份/查詢 | 說明 |
|----------|----------|-----------|------|
| **Chunk 元數據** | PostgreSQL | MongoDB | 結構化查詢為主 |
| **向量資料** | Qdrant | PostgreSQL | 向量搜尋為主 |
| **全文檢索** | PostgreSQL | - | 關鍵詞搜尋 |
| **日誌/歷史** | MongoDB | - | 靈活性為主 |

### 13.2 讀寫模式

| 場景 | 寫入 | 讀取 |
|------|------|------|
| **影片處理** | PostgreSQL + Qdrant | - |
| **語義搜尋** | - | Qdrant |
| **時間軸瀏覽** | - | PostgreSQL |
| **系統分析** | MongoDB | MongoDB |

---

## 14. 相關文件

- [JSON_OUTPUT_SPEC.md](./JSON_OUTPUT_SPEC.md) - JSON 輸出規範
- [RUST_DEVELOPMENT.md](./RUST_DEVELOPMENT.md) - Rust 開發規範
- [AGENTS.md](../AGENTS.md) - 開發規範