Files
momentry_core/docs/CHUNK_SPEC.md
accusys 383201cacd feat: Initial v0.9 release with API Key authentication
## v0.9.20260325_144654

### Features
- API Key Authentication System
- Job Worker System
- V2 Backup Versioning

### Bug Fixes
- get_processor_results_by_job column mapping

Co-authored-by: OpenCode
2026-03-25 14:53:41 +08:00

1114 lines
28 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Video Chunk 切分規範
| 項目 | 內容 |
|------|------|
| 建立者 | Warren |
| 建立時間 | 2026-03-16 |
| 文件版本 | V1.0 |
---
## 版本歷史
| 版本 | 日期 | 目的 | 操作人 | 工具/模型 |
|------|------|------|--------|-----------|
| V1.0 | 2026-03-16 | 創建文件 | Warren | OpenCode / MiniMax M2.5 |
---
本文檔定義 Momentry Core 系統中影片 chunks 的切分原則與資料結構。
---
## 1. Chunk 概述
### 1.1 設計原則
1. **允許重疊**: 不同類型的 chunk 可以重疊(如語句 chunk 與時間 chunk
2. **Frame 精確度**: 時間坐標精確到影片 frame
3. **多元分類**: 支援語句、場景、時間三種分割方式
### 1.2 Chunk 類型
| 類型 | 說明 | 是否可重疊 |
|------|------|------------|
| **Sentence** | 語句分割 | ✅ 可與其他類型重疊 |
| **Cut** | 場景切割 | ✅ 可與其他類型重疊 |
| **TimeBased** | 時間長度切割 | ✅ 可與其他類型重疊 |
---
## 2. 時間坐標系統
### 2.1 時間格式
所有時間使用 **秒** 為單位,精確到 **微秒** (浮點數)
```json
{
"start_time": 10.5,
"end_time": 15.75
}
```
### 2.2 Frame 計算
```
frame_number = floor(time_in_seconds * fps)
time_at_frame = frame_number / fps
```
**範例**:
- 影片 FPS: 24/1 (24 fps)
- 時間: 10.5 秒
- Frame: floor(10.5 * 24) = 252
- 校驗: 252 / 24 = 10.5 秒 ✅
### 2.3 Frame 資訊結構
```json
{
"start_time": 10.5,
"start_frame": 252,
"end_time": 15.75,
"end_frame": 378,
"fps": "24/1",
"fps_value": 24.0
}
```
---
## 3. 三種切分方式
### 3.1 Sentence (語句分割)
**原則**:
- 根據 ASR 語音識別結果
- 每個識別的語句為一個 chunk
- 文字內容來自 ASR 輸出
**範例**:
```
ASR 輸出:
[
{"start": 10.0, "end": 15.0, "text": "Hello world"},
{"start": 15.0, "end": 20.0, "text": "This is a test"},
{"start": 20.0, "end": 25.5, "text": "Processing video"}
]
轉換為 Chunks:
┌────────────────────────────────────────┐
│ chunk_0001: 10.0s - 15.0s "Hello world" │
├────────────────────────────────────────┤
│ chunk_0002: 15.0s - 20.0s "This is a test" │
├────────────────────────────────────────┤
│ chunk_0003: 20.0s - 25.5s "Processing video" │
└────────────────────────────────────────┘
```
### 3.2 Cut (場景切割)
**原則**:
- 根據影片鏡頭變化 (scene change / cut detection)
- 使用 ffmpeg 或 Python (scenedetect) 偵測
- 每個場景為一個 chunk
**偵測方法**:
```bash
# 使用 ffmpeg 偵測場景變化
ffmpeg -i input.mp4 -filter:v "select='gt(scene,0.3)',showinfo" -f null -
```
**範例**:
```
場景偵測結果:
[
{"start": 0.0, "end": 45.2, "scene_id": 1},
{"start": 45.2, "end": 120.5, "scene_id": 2},
{"start": 120.5, "end": 180.0, "scene_id": 3}
]
轉換為 Chunks:
┌────────────────────────────────────────┐
│ chunk_0001: 0.0s - 45.2s (Scene 1) │
├────────────────────────────────────────┤
│ chunk_0002: 45.2s - 120.5s (Scene 2) │
├────────────────────────────────────────┤
│ chunk_0003: 120.5s - 180.0s (Scene 3) │
└────────────────────────────────────────┘
```
### 3.3 TimeBased (時間長度切割)
**原則**:
- 固定時間長度切割
- 預設 **10 秒** 為一個 chunk
- 最後一個 chunk 可能不足 10 秒
- **支援重疊** (可設定 overlap 秒數)
**參數配置**:
| 參數 | 預設值 | 說明 |
|------|--------|------|
| duration | 10.0 | 每個 chunk 時長 (秒) |
| overlap | 0.0 | 重疊時長 (秒) |
**範例** (無重疊):
```
影片時長: 35 秒, duration=10
Chunks:
┌────────────────────────────────────────┐
│ chunk_0001: 0.0s - 10.0s │
├────────────────────────────────────────┤
│ chunk_0002: 10.0s - 20.0s │
├────────────────────────────────────────┤
│ chunk_0003: 20.0s - 30.0s │
├────────────────────────────────────────┤
│ chunk_0004: 30.0s - 35.0s (不足10秒) │
└────────────────────────────────────────┘
```
**範例** (有重疊, overlap=2):
```
影片時長: 35 秒, duration=10, overlap=2
Chunks:
┌────────────────────────────────────────┐
│ chunk_0001: 0.0s - 10.0s │
├────────────────────────────────────────┤
│ chunk_0002: 8.0s - 18.0s (重疊 2秒) │
├────────────────────────────────────────┤
│ chunk_0003: 16.0s - 26.0s (重疊 2秒) │
├────────────────────────────────────────┤
│ chunk_0004: 24.0s - 34.0s (重疊 2秒) │
├────────────────────────────────────────┤
│ chunk_0005: 32.0s - 35.0s (重疊+不足) │
└────────────────────────────────────────┘
```
---
## 4. Chunk 資料結構
### 4.1 基本結構
```json
{
"uuid": "1636719dc31f78ac",
"chunk_id": "sentence_0001",
"chunk_index": 1,
"chunk_type": "sentence",
"start_time": 10.5,
"start_frame": 252,
"end_time": 15.75,
"end_frame": 378,
"fps": "24/1",
"fps_value": 24.0,
"content": {
"text": "Hello world, this is a test"
},
"metadata": {
"source": "asr",
"confidence": 0.95,
"language": "en"
}
}
```
### 4.2 欄位說明
| 欄位 | 類型 | 必填 | 說明 |
|------|------|------|------|
| `uuid` | String | ✅ | 影片 UUID (16 字元) |
| `chunk_id` | String | ✅ | Chunk 唯一 ID |
| `chunk_index` | Integer | ✅ | Chunk 索引 (從 0 開始) |
| `chunk_type` | String | ✅ | 類型: sentence/cut/time_based |
| `start_time` | Float | ✅ | 開始時間 (秒) |
| `start_frame` | Integer | ✅ | 開始 frame 編號 |
| `end_time` | Float | ✅ | 結束時間 (秒) |
| `end_frame` | Integer | ✅ | 結束 frame 編號 |
| `fps` | String | ✅ | FPS 表示 (如 "24/1") |
| `fps_value` | Float | ✅ | FPS 數值 (如 24.0) |
| `content` | Object | ✅ | 內容 (見下文) |
| `metadata` | Object | ❌ | 額外資訊 (見下文) |
### 4.3 Content 結構
根據 `chunk_type` 不同content 結構也不同:
#### Sentence Content
```json
{
"content": {
"text": "Hello world, this is a test message",
"text_normalized": "hello world this is a test message",
"word_count": 7,
"char_count": 34
}
}
```
| 欄位 | 類型 | 說明 |
|------|------|------|
| `text` | String | 原始識別文字 |
| `text_normalized` | String | 正規化文字 (小寫,去除標點) |
| `word_count` | Integer | 字詞數量 |
| `char_count` | Integer | 字元數量 |
#### Cut Content
```json
{
"content": {
"scene_id": 2,
"scene_number": 2,
"transition_type": "cut",
"scene_change_score": 0.95
}
}
```
| 欄位 | 類型 | 說明 |
|------|------|------|
| `scene_id` | Integer | 場景 ID |
| `scene_number` | Integer | 場景編號 |
| `transition_type` | String | 轉場類型: cut/dissolve/fade |
| `scene_change_score` | Float | 場景變化分數 (0-1) |
#### TimeBased Content
```json
{
"content": {
"duration": 10.0,
"is_last": false,
"segment_number": 3,
"total_segments": 10
}
}
```
| 欄位 | 類型 | 說明 |
|------|------|------|
| `duration` | Float | 時長 (秒) |
| `is_last` | Boolean | 是否最後一個 chunk |
| `segment_number` | Integer | 分段編號 |
| `total_segments` | Integer | 總分段數 |
### 4.4 Metadata 結構
```json
{
"metadata": {
"source": "asr",
"confidence": 0.95,
"language": "en",
"model": "tiny",
"created_at": "2026-03-16T10:00:00Z"
}
}
```
| 欄位 | 類型 | 說明 |
|------|------|------|
| `source` | String | 來源: asr/scene_detect/time_based |
| `confidence` | Float | 信心度 (0-1) |
| `language` | String | 語言代碼 |
| `model` | String | 使用模型 |
| `created_at` | String | 創建時間 (ISO 8601) |
---
## 5. Chunk ID 命名規範
### 5.1 格式
```
{chunk_type}_{chunk_index:04}
```
| 類型 | 前綴 | 範例 |
|------|------|------|
| Sentence | `sentence_` | `sentence_0001` |
| Cut | `cut_` | `cut_0001` |
| TimeBased | `time_based_` | `time_based_0001` |
### 5.2 編號規則
-**0** 開始
- 使用 **4 位數** 補零
- 按時間順序遞增
---
## 6. 資料庫 Schema
### 6.1 PostgreSQL Table
```sql
CREATE TABLE chunks (
id BIGSERIAL PRIMARY KEY,
uuid VARCHAR(16) NOT NULL,
chunk_id VARCHAR(64) NOT NULL,
chunk_index INTEGER NOT NULL,
chunk_type VARCHAR(32) NOT NULL,
start_time DOUBLE PRECISION NOT NULL,
start_frame BIGINT NOT NULL,
end_time DOUBLE PRECISION NOT NULL,
end_frame BIGINT NOT NULL,
fps VARCHAR(16) NOT NULL,
fps_value DOUBLE PRECISION NOT NULL,
content JSONB NOT NULL,
metadata JSONB,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
UNIQUE(uuid, chunk_id)
);
-- 索引
CREATE INDEX idx_chunks_uuid ON chunks(uuid);
CREATE INDEX idx_chunks_type ON chunks(chunk_type);
CREATE INDEX idx_chunks_time ON chunks(start_time, end_time);
CREATE INDEX idx_chunks_uuid_type ON chunks(uuid, chunk_type);
```
### 6.2 查詢範例
```sql
-- 查詢影片所有 chunks
SELECT * FROM chunks WHERE uuid = '1636719dc31f78ac';
-- 查詢特定類型的 chunks
SELECT * FROM chunks WHERE uuid = '1636719dc31f78ac' AND chunk_type = 'sentence';
-- 查詢時間範圍內的 chunks
SELECT * FROM chunks
WHERE uuid = '1636719dc31f78ac'
AND start_time <= 30.0 AND end_time >= 20.0;
-- 查詢時間範圍內的所有 chunks (混合類型)
SELECT * FROM chunks
WHERE uuid = '1636719dc31f78ac'
AND start_time <= 30.0 AND end_time >= 20.0
ORDER BY chunk_type, chunk_index;
```
---
## 7. Rust 資料結構
### 7.1 Chunk 定義
```rust
use serde::{Deserialize, Serialize};
#[derive(Debug, Clone, Copy, Serialize, Deserialize, PartialEq)]
#[serde(rename_all = "snake_case")]
pub enum ChunkType {
Sentence,
Cut,
TimeBased,
}
impl ChunkType {
pub fn as_str(&self) -> &'static str {
match self {
ChunkType::Sentence => "sentence",
ChunkType::Cut => "cut",
ChunkType::TimeBased => "time_based",
}
}
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Chunk {
pub uuid: String,
pub chunk_id: String,
pub chunk_index: u32,
pub chunk_type: ChunkType,
pub start_time: f64,
pub start_frame: i64,
pub end_time: f64,
pub end_frame: i64,
pub fps: String,
pub fps_value: f64,
pub content: serde_json::Value,
pub metadata: Option<serde_json::Value>,
}
```
### 7.2 建立 Chunk
```rust
impl Chunk {
pub fn new(
uuid: String,
chunk_index: u32,
chunk_type: ChunkType,
start_time: f64,
end_time: f64,
fps: &str,
content: serde_json::Value,
) -> Self {
let fps_value = parse_fps(fps);
let start_frame = (start_time * fps_value) as i64;
let end_frame = (end_time * fps_value) as i64;
let chunk_id = format!("{}_{:04}", chunk_type.as_str(), chunk_index);
Self {
uuid,
chunk_id,
chunk_index,
chunk_type,
start_time,
start_frame,
end_time,
end_frame,
fps: fps.to_string(),
fps_value,
content,
metadata: None,
}
}
}
```
---
## 8. 時間切割器實作
### 8.1 TimeBasedSplitter
```rust
pub struct TimeBasedSplitter {
pub duration: f64, // 每個 chunk 時長 (秒)
pub overlap: f64, // 重疊時長 (秒)
}
impl TimeBasedSplitter {
pub fn new(duration: f64, overlap: f64) -> Self {
Self { duration, overlap }
}
pub fn split(&self, uuid: &str, video_duration: f64, fps: f64) -> Vec<Chunk> {
let mut chunks = Vec::new();
let step = self.duration - self.overlap;
let mut current_time = 0.0;
let mut index = 0;
while current_time < video_duration {
let end_time = (current_time + self.duration).min(video_duration);
let chunk = Chunk::new(
uuid.to_string(),
index,
ChunkType::TimeBased,
current_time,
end_time,
&format!("{:.0}/1", fps as u32),
serde_json::json!({
"duration": end_time - current_time,
"is_last": end_time >= video_duration,
"segment_number": index + 1,
}),
);
chunks.push(chunk);
current_time += step;
index += 1;
}
chunks
}
}
```
### 8.2 使用範例
```rust
// 建立時間切割器 (10秒, 無重疊)
let splitter = TimeBasedSplitter::new(10.0, 0.0);
let chunks = splitter.split(&uuid, video_duration, 24.0);
// 建立時間切割器 (10秒, 2秒重疊)
let splitter = TimeBasedSplitter::new(10.0, 2.0);
let chunks = splitter.split(&uuid, video_duration, 24.0);
```
---
## 9. 處理流程
### 9.1 完整流程
```
1. Register (註冊影片)
└── 取得 UUID, video_duration, fps
2. Probe (探測影片)
└── 取得 streams, format, fps
3. 產生 Sentence Chunks
└── 讀取 ASR 輸出
└── 為每個 segment 建立 chunk
4. 產生 Cut Chunks
└── 執行場景偵測
└── 為每個 scene 建立 chunk
5. 產生 TimeBased Chunks
└── 使用 TimeBasedSplitter
└── 為每個時間段建立 chunk
6. 儲存至資料庫
└── 批次寫入 PostgreSQL
```
### 9.2 輸出範例
```
影片: 35 秒, FPS: 24
Sentence Chunks (3 個):
sentence_0000: 0.0s - 10.0s (252 frames)
sentence_0001: 10.0s - 20.0s (480 frames)
sentence_0002: 20.0s - 35.0s (840 frames)
Cut Chunks (3 個):
cut_0000: 0.0s - 15.0s (360 frames)
cut_0001: 15.0s - 28.0s (672 frames)
cut_0002: 28.0s - 35.0s (168 frames)
TimeBased Chunks (4 個, 重疊 2秒):
time_based_0000: 0.0s - 10.0s (240 frames)
time_based_0001: 8.0s - 18.0s (240 frames)
time_based_0002: 16.0s - 26.0s (240 frames)
time_based_0003: 24.0s - 35.0s (264 frames)
```
---
## 10. 資料庫儲存
### 10.1 PostgreSQL 儲存
#### Table Schema
```sql
CREATE TABLE chunks (
id BIGSERIAL PRIMARY KEY,
uuid VARCHAR(16) NOT NULL,
chunk_id VARCHAR(64) NOT NULL,
chunk_index INTEGER NOT NULL,
chunk_type VARCHAR(32) NOT NULL,
start_time DOUBLE PRECISION NOT NULL,
start_frame BIGINT NOT NULL,
end_time DOUBLE PRECISION NOT NULL,
end_frame BIGINT NOT NULL,
fps VARCHAR(16) NOT NULL,
fps_value DOUBLE PRECISION NOT NULL,
content JSONB NOT NULL,
metadata JSONB,
vector_id VARCHAR(64),
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
UNIQUE(uuid, chunk_id)
);
-- 索引
CREATE INDEX idx_chunks_uuid ON chunks(uuid);
CREATE INDEX idx_chunks_type ON chunks(chunk_type);
CREATE INDEX idx_chunks_time ON chunks(start_time, end_time);
CREATE INDEX idx_chunks_uuid_type ON chunks(uuid, chunk_type);
CREATE INDEX idx_chunks_vector_id ON chunks(vector_id);
```
#### 儲存範例
```rust
pub async fn store_chunk_to_postgres(db: &PostgresDb, chunk: &Chunk) -> Result<()> {
sqlx::query!(
r#"
INSERT INTO chunks (
uuid, chunk_id, chunk_index, chunk_type,
start_time, start_frame, end_time, end_frame,
fps, fps_value, content, metadata, vector_id
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT (uuid, chunk_id) DO UPDATE SET
content = EXCLUDED.content,
metadata = EXCLUDED.metadata,
vector_id = EXCLUDED.vector_id,
updated_at = NOW()
"#,
chunk.uuid,
chunk.chunk_id,
chunk.chunk_index as i32,
chunk.chunk_type.as_str(),
chunk.start_time,
chunk.start_frame,
chunk.end_time,
chunk.end_frame,
chunk.fps,
chunk.fps_value,
serde_json::to_value(&chunk.content)?,
serde_json::to_value(&chunk.metadata)?,
chunk.vector_id,
)
.execute(&db.pool)
.await?;
Ok(())
}
```
---
### 10.2 MongoDB 儲存
#### Collection Schema
```javascript
// chunks collection
{
_id: ObjectId,
uuid: "1636719dc31f78ac",
chunk_id: "sentence_0001",
chunk_index: 1,
chunk_type: "sentence",
start_time: 10.5,
start_frame: 252,
end_time: 15.75,
end_frame: 378,
fps: "24/1",
fps_value: 24.0,
content: {
text: "Hello world, this is a test",
text_normalized: "hello world this is a test",
word_count: 7,
char_count: 34
},
metadata: {
source: "asr",
confidence: 0.95,
language: "en"
},
vector_id: "vec_sentence_0001",
created_at: ISODate("2026-03-16T10:00:00Z"),
updated_at: ISODate("2026-03-16T10:00:00Z")
}
// 索引
db.chunks.createIndex({ uuid: 1 })
db.chunks.createIndex({ chunk_type: 1 })
db.chunks.createIndex({ start_time: 1, end_time: 1 })
db.chunks.createIndex({ vector_id: 1 })
db.chunks.createIndex({ uuid: 1, chunk_type: 1 })
```
#### 儲存範例
```rust
pub async fn store_chunk_to_mongodb(db: &MongoDb, chunk: &Chunk) -> Result<()> {
let doc = bson::doc! {
"uuid": chunk.uuid,
"chunk_id": chunk.chunk_id,
"chunk_index": chunk.chunk_index,
"chunk_type": chunk.chunk_type.as_str(),
"start_time": chunk.start_time,
"start_frame": chunk.start_frame,
"end_time": chunk.end_time,
"end_frame": chunk.end_frame,
"fps": chunk.fps,
"fps_value": chunk.fps_value,
"content": serde_json::to_value(&chunk.content)?,
"metadata": serde_json::to_value(&chunk.metadata)?,
"vector_id": chunk.vector_id,
"created_at": chrono::Utc::now(),
"updated_at": chrono::Utc::now()
};
let collection = db.database("momentry").collection("chunks");
collection.update_one(
doc! { "uuid": &chunk.uuid, "chunk_id": &chunk.chunk_id },
doc! { "$set": doc },
UpdateOptions::builder().upsert(true).build(),
).await?;
Ok(())
}
```
---
## 11. 向量儲存設計
### 11.1 設計原則
**統一向量 ID 格式**,確保 Qdrant 與 PostgreSQL 相容:
```
{chunk_type}_{chunk_index:04}
範例:
sentence_0001
cut_0002
time_based_0015
```
### 11.2 Qdrant Collection
#### 建立 Collection
```bash
# 使用 Qdrant client 建立 collection
curl -X PUT http://localhost:6333/collections/chunks \
-H "Content-Type: application/json" \
-H "api-key: Test3200Test3200Test3200" \
-d '{
"vectors": {
"size": 768,
"distance": "Cosine"
}
}'
```
#### Point 結構
```json
{
"id": "sentence_0001",
"vector": [0.123, -0.456, ...],
"payload": {
"uuid": "1636719dc31f78ac",
"chunk_id": "sentence_0001",
"chunk_type": "sentence",
"chunk_index": 1,
"start_time": 10.5,
"end_time": 15.75,
"text": "Hello world, this is a test",
"metadata": {
"confidence": 0.95,
"language": "en"
}
}
}
```
#### Rust 結構
```rust
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct VectorPoint {
pub id: String,
pub vector: Vec<f32>,
pub payload: VectorPayload,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct VectorPayload {
pub uuid: String,
pub chunk_id: String,
pub chunk_type: String,
pub chunk_index: u32,
pub start_time: f64,
pub end_time: f64,
#[serde(skip_serializing_if = "Option::is_none")]
pub text: Option<String>,
#[serde(skip_serializing_if = "Option::is_none")]
pub scene_id: Option<i32>,
#[serde(skip_serializing_if = "Option::is_none")]
pub segment_number: Option<i32>,
pub metadata: Option<serde_json::Value>,
}
```
### 11.3 PostgreSQL Vector 儲存
#### Table Schema
```sql
-- 使用 pgvector 擴展
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE chunk_vectors (
id BIGSERIAL PRIMARY KEY,
vector_id VARCHAR(64) NOT NULL UNIQUE,
uuid VARCHAR(16) NOT NULL,
chunk_id VARCHAR(64) NOT NULL,
chunk_type VARCHAR(32) NOT NULL,
chunk_index INTEGER NOT NULL,
start_time DOUBLE PRECISION NOT NULL,
end_time DOUBLE PRECISION NOT NULL,
embedding vector(768) NOT NULL,
metadata JSONB,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
FOREIGN KEY (uuid, chunk_id) REFERENCES chunks(uuid, chunk_id)
);
-- 向量檢索索引 (IVFFlat)
CREATE INDEX idx_chunk_vectors_embedding
ON chunk_vectors
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- 查詢索引
CREATE INDEX idx_chunk_vectors_uuid ON chunk_vectors(uuid);
CREATE INDEX idx_chunk_vectors_type ON chunk_vectors(chunk_type);
```
#### 儲存範例
```rust
pub async fn store_vector_to_postgres(db: &PostgresDb, point: &VectorPoint) -> Result<()> {
sqlx::query!(
r#"
INSERT INTO chunk_vectors (
vector_id, uuid, chunk_id, chunk_type, chunk_index,
start_time, end_time, embedding, metadata
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT (vector_id) DO UPDATE SET
embedding = EXCLUDED.embedding,
metadata = EXCLUDED.metadata
"#,
point.id,
point.payload.uuid,
point.payload.chunk_id,
point.payload.chunk_type,
point.payload.chunk_index as i32,
point.payload.start_time,
point.payload.end_time,
point.vector,
serde_json::to_value(&point.payload.metadata)?,
)
.execute(&db.pool)
.await?;
Ok(())
}
```
---
## 12. 查詢範例
### 12.1 語義搜尋 (Semantic Search)
#### 查詢類型 1: 相似文字搜尋
```rust
// 搜尋與問句相似的 chunks
pub async fn semantic_search(
qdrant: &QdrantDb,
query: &str,
limit: usize,
) -> Result<Vec<SearchResult>> {
// 1. 將問句向量化
let query_vector = embed_text(query).await?;
// 2. 搜尋 Qdrant
let results = qdrant.search(
"chunks",
&query_vector,
limit,
Some(&Filter::must([
Condition::Match("chunk_type", "sentence"),
])),
).await?;
Ok(results)
}
// 使用範例
let results = semantic_search(&qdrant, "找出有人在說話的片段", 10).await?;
for r in results {
println!("{}: {:.3}", r.payload.chunk_id, r.score);
println!(" Time: {}s - {}s", r.payload.start_time, r.payload.end_time);
println!(" Text: {:?}", r.payload.text);
}
```
#### 查詢類型 2: 語音/文字混合搜尋
```sql
-- PostgreSQL: 搜尋特定文字的 chunks
SELECT
c.chunk_id,
c.chunk_type,
c.start_time,
c.end_time,
c.content->>'text' as text,
v.embedding <=> query_embedding('找出開車的場景') as similarity
FROM chunks c
LEFT JOIN chunk_vectors v ON c.chunk_id = v.chunk_id
WHERE c.chunk_type = 'sentence'
AND c.content->>'text' ILIKE '%car%'
ORDER BY v.embedding <=> query_embedding('找出開車的場景')
LIMIT 10;
```
### 12.2 時間範圍搜尋
#### 查詢類型 3: 特定時間範圍
```rust
// 找出 30-60 秒之間的所有 chunks
pub async fn search_by_time_range(
db: &PostgresDb,
uuid: &str,
start: f64,
end: f64,
) -> Result<Vec<Chunk>> {
let chunks = sqlx::query_as!(
Chunk,
r#"
SELECT * FROM chunks
WHERE uuid = $1
AND start_time < $3
AND end_time > $2
ORDER BY chunk_type, chunk_index
"#,
uuid, start, end
)
.fetch_all(&db.pool)
.await?;
Ok(chunks)
}
// 使用範例
let chunks = search_by_time_range(&db, "1636719dc31f78ac", 30.0, 60.0).await?;
```
```javascript
// MongoDB: 時間範圍查詢
db.chunks.find({
uuid: "1636719dc31f78ac",
start_time: { $lt: 60 },
end_time: { $gt: 30 }
}).sort({ chunk_type: 1, chunk_index: 1 })
```
### 12.3 混合搜尋 (Hybrid Search)
#### 查詢類型 4: 文字關鍵詞 + 向量相似度
```rust
// 結合關鍵詞匹配與向量相似度
pub async fn hybrid_search(
db: &PostgresDb,
qdrant: &QdrantDb,
query: &str,
keywords: &[&str],
limit: usize,
) -> Result<Vec<HybridResult>> {
// 1. 向量搜尋
let query_vector = embed_text(query).await?;
let vector_results = qdrant.search("chunks", &query_vector, limit * 2, None).await?;
// 2. 關鍵詞過濾
let keyword_filter: Vec<_> = keywords.iter()
.map(|k| format!("%{}%", k))
.collect();
let filtered: Vec<_> = vector_results.into_iter()
.filter(|r| {
if let Some(text) = &r.payload.text {
keyword_filter.iter().any(|k| text.contains(k.as_str()))
} else {
false
}
})
.take(limit)
.collect();
Ok(filtered)
}
```
### 12.4 場景搜尋
#### 查詢類型 5: 找出特定場景
```sql
-- PostgreSQL: 找出特定場景 ID 的 chunks
SELECT * FROM chunks
WHERE uuid = '1636719dc31f78ac'
AND chunk_type = 'cut'
AND (content->>'scene_id')::int = 5;
-- 找出包含轉場效果的 chunks
SELECT * FROM chunks
WHERE uuid = '1636719dc31f78ac'
AND chunk_type = 'cut'
AND content->>'transition_type' = 'dissolve';
```
### 12.5 影片摘要
#### 查詢類型 6: 產生影片摘要
```sql
-- 合併影片所有語句
SELECT
string_agg(content->>'text', ' ' ORDER BY start_time) as full_transcript
FROM chunks
WHERE uuid = '1636719dc31f78ac'
AND chunk_type = 'sentence'
AND content->>'text' IS NOT NULL;
-- 按場景聚合文字
SELECT
content->>'scene_id' as scene,
string_agg(content->>'text', ' ' ORDER BY start_time) as scene_text
FROM chunks
WHERE uuid = '1636719dc31f78ac'
AND chunk_type = 'cut'
GROUP BY content->>'scene_id'
ORDER BY MIN(start_time);
```
### 12.6 常見查詢模式
| 查詢類型 | 描述 | 資料庫 | SQL/程式碼 |
|----------|------|--------|-------------|
| 語義搜尋 | 找相似內容 | Qdrant | `search(vector, limit)` |
| 關鍵詞搜尋 | 精確文字匹配 | PostgreSQL | `ILIKE '%keyword%'` |
| 時間範圍 | 特定時段 | Both | `start_time < end AND end_time > start` |
| 場景搜尋 | 特定鏡頭 | PostgreSQL | `scene_id = N` |
| 混合搜尋 | 向量+關鍵詞 | Both |結合以上兩種 |
| 摘要產生 | 合併文字 | PostgreSQL | `string_agg()` |
---
## 13. 資料庫選擇建議
### 13.1 儲存策略
| 資料類型 | 主要儲存 | 備份/查詢 | 說明 |
|----------|----------|-----------|------|
| **Chunk 元數據** | PostgreSQL | MongoDB | 結構化查詢為主 |
| **向量資料** | Qdrant | PostgreSQL | 向量搜尋為主 |
| **全文檢索** | PostgreSQL | - | 關鍵詞搜尋 |
| **日誌/歷史** | MongoDB | - | 靈活性為主 |
### 13.2 讀寫模式
| 場景 | 寫入 | 讀取 |
|------|------|------|
| **影片處理** | PostgreSQL + Qdrant | - |
| **語義搜尋** | - | Qdrant |
| **時間軸瀏覽** | - | PostgreSQL |
| **系統分析** | MongoDB | MongoDB |
---
## 14. 相關文件
- [JSON_OUTPUT_SPEC.md](./JSON_OUTPUT_SPEC.md) - JSON 輸出規範
- [RUST_DEVELOPMENT.md](./RUST_DEVELOPMENT.md) - Rust 開發規範
- [AGENTS.md](../AGENTS.md) - 開發規範