# Video 解析行為規範 | 項目 | 內容 | |------|------| | 建立者 | Warren | | 建立時間 | 2026-03-16 | | 文件版本 | V1.0 | --- ## 版本歷史 | 版本 | 日期 | 目的 | 操作人 | 工具/模型 | |------|------|------|--------|-----------| | V1.0 | 2026-03-16 | 創建文件 | Warren | OpenCode / MiniMax M2.5 | --- 本文檔定義 Momentry Core 系統中影片解析的完整行為規範,涵蓋觸發、狀態、輸出、斷點續傳、多語言處理及各種識別標示。 --- ## 1. Video 檔案觸發規範 ### 1.1 支援的影片格式 | 格式 | 副檔名 | 說明 | |------|--------|------| | MP4 | .mp4 | 最廣泛支援 | | MOV | .mov | QuickTime 格式 | | AVI | .avi | 傳統格式 | | MKV | .mkv | Matroska 格式 | | WebM | .webm | Web 格式 | | WMV | .wmv | Windows Media | | FLV | .flv | Flash 格式 | ### 1.2 觸發方式 #### 1.2.1 指令列註冊 ```bash cargo run -- register /path/to/video.mp4 ``` #### 1.2.2 監控目錄自動觸發 ```yaml # monitor_config.yaml watch: directories: - /path/to/watch recursive: true extensions: [".mp4", ".mov", ".avi", ".mkv", ".webm"] ``` #### 1.2.3 API 觸發 ```bash # POST /api/v1/register curl -X POST http://localhost:3002/api/v1/register \ -H "Content-Type: application/json" \ -d '{"path": "/path/to/video.mp4", "auto_process": true}' ``` ### 1.3 觸發前驗證 ```rust pub fn validate_video_file(path: &str) -> Result { // 1. 檢查檔案存在 // 2. 檢查副檔名 // 3. 檢查檔案大小 > 0 // 4. 檢查是否為有效的影片檔案 (魔數 Magic Number) // 回傳結構 Ok(VideoValidation { path: path.to_string(), valid: true, codec: "h264".to_string(), has_video: true, has_audio: true, }) } ``` ### 1.4 影片UUID 生成 ``` UUID = MD5(檔案路徑)[0:16] 範例: "/media/videos/clip.mp4" → "3a2f1b9c4d5e6f0a" ``` --- ## 2. Video 處理過程狀態顯示規範 ### 2.1 處理狀態定義 ```rust #[derive(Debug, Clone, Copy, Serialize, Deserialize)] #[serde(rename_all = "snake_case")] pub enum ProcessStatus { Pending, // 等待處理 Registered, // 已註冊 Probing, // 探測中 AsrProcessing, // ASR 處理中 AsrxProcessing, // 說話者分離中 OcrProcessing, // OCR 處理中 YoloProcessing, // YOLO 處理中 FaceProcessing, // 人臉偵測中 PoseProcessing, // 姿態估計中 Chunking, // 分塊處理中 Completed, // 完成 Failed, // 失敗 Paused, // 暫停 Resuming, // 恢復中 } ``` ### 2.2 狀態輸出格式 #### 2.2.1 標準輸出 (stdout) ``` [REGISTER] Video registered: 1636719dc31f78ac [PROBE] Starting probe for video.mp4 [PROBE] Duration: 120.5s, FPS: 24/1, Resolution: 1920x1080 [ASR_START] Loading Whisper model... [ASR_LANGUAGE:en] Language detected: English (99.45%) [ASR_PROGRESS:50] Processed 50 segments... [ASR_PROGRESS:100] Processed 100 segments... [ASR_COMPLETE:150] Completed! Total: 150 segments [ASRX_START] Loading pyannote model... [ASRX_PROGRESS] Speaker diarization: 3 speakers identified [ASRX_COMPLETE] Speaker diarization complete [OCR_START] Starting OCR processing... [OCR_PROGRESS:30/60] Frame 30/60 processed [OCR_COMPLETE] OCR complete: 25 text regions found [YOLO_START] Starting YOLO processing... [YOLO_PROGRESS:60/120] Frame 60/120 processed [YOLO_COMPLETE] YOLO complete: 189 objects detected [FACE_START] Starting face detection... [FACE_PROGRESS:60/120] Frame 60/120 processed [FACE_COMPLETE] Face detection complete: 5 unique faces [POSE_START] Starting pose estimation... [POSE_PROGRESS:60/120] Frame 60/120 processed [POSE_COMPLETE] Pose estimation complete: 12 persons detected [CHUNK_START] Creating chunks... [CHUNK_COMPLETE] 450 chunks created [COMPLETE] Video processing complete! ``` #### 2.2.2 狀態訊息前綴 | 處理階段 | 前綴 | 範例 | |----------|------|------| | 註冊 | `[REGISTER]` | `[REGISTER] Video registered: 1636719dc31f78ac` | | 探測 | `[PROBE]` | `[PROBE] Duration: 120.5s` | | ASR | `[ASR_*]` | `[ASR_START]`, `[ASR_PROGRESS:50]` | | ASRx | `[ASRX_*]` | `[ASRX_START]`, `[ASRX_COMPLETE]` | | OCR | `[OCR_*]` | `[OCR_START]`, `[OCR_PROGRESS:30/60]` | | YOLO | `[YOLO_*]` | `[YOLO_START]`, `[YOLO_COMPLETE]` | | Face | `[FACE_*]` | `[FACE_START]`, `[FACE_PROGRESS:60/120]` | | Pose | `[POSE_*]` | `[POSE_START]`, `[POSE_COMPLETE]` | | Chunk | `[CHUNK_*]` | `[CHUNK_START]`, `[CHUNK_COMPLETE]` | | 完成 | `[COMPLETE]` | `[COMPLETE] Video processing complete!` | | 錯誤 | `[ERROR]` | `[ERROR] ASR processing failed` | | 警告 | `[WARN]` | `[WARN] No audio track detected` | ### 2.3 即時進度報告 ```rust #[derive(Debug, Clone, Serialize, Deserialize)] pub struct ProcessProgress { pub uuid: String, pub status: ProcessStatus, pub current_processor: String, pub total_frames: i64, pub processed_frames: i64, pub progress_percentage: f64, pub elapsed_seconds: f64, pub estimated_remaining_seconds: f64, pub last_checkpoint: Option, } ``` #### 範例輸出 ```json { "uuid": "1636719dc31f78ac", "status": "asr_processing", "current_processor": "asr", "total_frames": 3000, "processed_frames": 1500, "progress_percentage": 50.0, "elapsed_seconds": 120.5, "estimated_remaining_seconds": 120.5, "last_checkpoint": { "timestamp": 60.0, "segments_processed": 50, "output_file": "1636719dc31f78ac.asr.partial.json" } } ``` --- ## 3. Video 處理輸出規範 ### 3.1 輸出檔案命名 ``` {UUID}.{處理類型}.json 範例: 1636719dc31f78ac.probe.json # 探測結果 1636719dc31f78ac.asr.json # ASR 結果 1636719dc31f78ac.asrx.json # 說話者分離結果 1636719dc31f78ac.ocr.json # OCR 結果 1636719dc31f78ac.yolo.json # YOLO 結果 1636719dc31f78ac.face.json # 人臉偵測結果 1636719dc31f78ac.pose.json # 姿態估計結果 1636719dc31f78ac.chunks.json # 分塊結果 ``` ### 3.2 輸出目錄結構 ``` momentry_core/ ├── output/ │ ├── {uuid}/ │ │ ├── {uuid}.probe.json │ │ ├── {uuid}.asr.json │ │ ├── {uuid}.asrx.json │ │ ├── {uuid}.ocr.json │ │ ├── {uuid}.yolo.json │ │ ├── {uuid}.face.json │ │ ├── {uuid}.pose.json │ │ └── thumbnails/ │ │ ├── thumb_000.jpg │ │ ├── thumb_001.jpg │ │ └── ... │ └── checkpoints/ │ └── {uuid}/ │ ├── {uuid}.asr.partial.001.json │ ├── {uuid}.asr.partial.002.json │ └── ... ``` ### 3.3 完整處理結果 JSON 結構 ```json { "uuid": "1636719dc31f78ac", "video_path": "/path/to/video.mp4", "video_info": { "duration": 120.5, "fps": "24/1", "fps_value": 24.0, "width": 1920, "height": 1080, "has_video": true, "has_audio": true, "has_music": false, "audio_codec": "aac", "video_codec": "h264" }, "processing": { "status": "completed", "started_at": "2026-03-16T10:00:00Z", "completed_at": "2026-03-16T10:05:00Z", "elapsed_seconds": 300.0, "processors": { "asr": { "status": "completed", "language": "en", "language_probability": 0.9945, "segments_count": 150, "duration_seconds": 120.0 }, "asrx": { "status": "completed", "speakers_count": 3, "segments_count": 150, "duration_seconds": 60.0 }, "ocr": { "status": "completed", "text_regions_count": 25, "duration_seconds": 45.0 }, "yolo": { "status": "completed", "objects_count": 189, "unique_classes": ["person", "car", "dog"], "duration_seconds": 30.0 }, "face": { "status": "completed", "unique_faces_count": 5, "duration_seconds": 30.0 }, "pose": { "status": "completed", "persons_count": 12, "duration_seconds": 15.0 } } }, "asr": { "language": "en", "language_probability": 0.9945855736732483, "segments": [...] }, "asrx": { "language": "en", "segments": [...] }, "ocr": { "segments": [...] }, "yolo": { "segments": [...] }, "face": { "segments": [...] }, "pose": { "segments": [...] } } ``` --- ## 4. Video 處理中分時輸出規範 (Checkpoint) ### 4.1 分時輸出目的 - 避免處理異常中斷導致資料全部遺失 - 提供中斷點,方便後續接續處理 - 可設定輸出頻率(每 N 秒或每 N 幀) ### 4.2 配置參數 ```yaml processing: checkpoint: enabled: true interval_seconds: 60 # 每 60 秒輸出一次 interval_frames: 1500 # 或每 1500 幀 (二選一) output_dir: "checkpoints" keep_partial: true # 保留部分完成檔案 ``` ### 4.3 Checkpoint 結構 ```rust #[derive(Debug, Clone, Serialize, Deserialize)] pub struct Checkpoint { pub uuid: String, pub processor: String, pub checkpoint_id: String, pub timestamp: f64, pub frame_number: i64, pub total_frames: i64, pub progress_percentage: f64, pub partial_data: serde_json::Value, pub created_at: DateTime, } ``` ### 4.4 分時輸出檔案命名 ``` {UUID}.{處理類型}.partial.{序號}.json 範例: 1636719dc31f78ac.asr.partial.001.json # 第 1 次 checkpoint 1636719dc31f78ac.asr.partial.002.json # 第 2 次 checkpoint 1636719dc31f78ac.asr.partial.003.json # 第 3 次 checkpoint ``` ### 4.5 分時輸出範例 (ASR) ```json { "uuid": "1636719dc31f78ac", "processor": "asr", "checkpoint_id": "partial_001", "timestamp": 60.0, "frame_number": 1440, "total_frames": 3000, "progress_percentage": 48.0, "partial_data": { "language": "en", "language_probability": 0.9945, "segments": [ {"start": 0.0, "end": 5.0, "text": "Hello world"}, {"start": 5.0, "end": 10.0, "text": "This is a test"}, ... ] }, "created_at": "2026-03-16T10:01:00Z" } ``` ### 4.6 分時合併邏輯 ```rust pub fn merge_checkpoints(checkpoints: Vec) -> serde_json::Value { // 按 checkpoint_id 排序 let mut sorted = checkpoints; sorted.sort_by(|a, b| a.checkpoint_id.cmp(&b.checkpoint_id)); // 合併 segments let mut merged_segments: Vec = vec![]; for checkpoint in sorted { if let Some(segments) = checkpoint.partial_data.get("segments") { if let Some(seg_array) = segments.as_array() { merged_segments.extend(seg_array.clone()); } } } serde_json::json!({ "segments": merged_segments }) } ``` --- ## 5. Video 處理中斷接續規範 ### 5.1 支援的中斷類型 | 中斷類型 | 說明 | 處理方式 | |----------|------|----------| | 程序崩潰 | 處理程序異常退出 | 從上次 checkpoint 恢復 | | 系統關機 | 系統意外關機 | 從上次 checkpoint 恢復 | | 資源不足 | OOM/磁碟空間不足 | 釋放資源後重試 | | 用戶暫停 | 用戶主動暫停 | 顯示 Paused 狀態 | | 網路中斷 | 遠端資源不可用 | 重試連線後繼續 | ### 5.2 接續處理流程 ``` 1. 檢測中斷 │ ▼ 2. 查找最新 checkpoint │ ▼ 3. 載入 partial data │ ▼ 4. 驗證數據完整性 │ ▼ 5. 從 checkpoint 繼續處理 │ ▼ 6. 輸出完整結果 ``` ### 5.3 接續狀態檢測 ```rust pub async fn check_resume_status(uuid: &str) -> Result { // 1. 查找所有 checkpoint 檔案 let checkpoints = find_checkpoints(uuid)?; // 2. 查找最後處理的進度 let last_checkpoint = checkpoints.last(); // 3. 檢查主要輸出檔案是否存在 let main_output_exists = Path::new(&format!("{}.asr.json", uuid)).exists(); // 4. 判斷可恢復的處理器 let resumeable = ResumeStatus { can_resume: !checkpoints.is_empty() && !main_output_exists, last_checkpoint: last_checkpoint.cloned(), processed_processors: detect_processed_processors(uuid), remaining_processors: detect_remaining_processors(uuid), }; Ok(resumeable) } ``` ### 5.4 Resume Status 結構 ```rust #[derive(Debug, Clone, Serialize, Deserialize)] pub struct ResumeStatus { pub can_resume: bool, pub last_checkpoint: Option, pub processed_processors: Vec, pub remaining_processors: Vec, pub suggested_action: String, } ``` ### 5.5 接續命令 ```bash # 自動檢測並恢復 cargo run -- resume /path/to/video.mp4 # 強制從頭開始 cargo run -- process /path/to/video.mp4 --force # 查看處理狀態 cargo run -- status 1636719dc31f78ac # 查看可恢復的檢查點 cargo run -- checkpoints 1636719dc31f78ac ``` ### 5.6 衝突處理 ```rust pub fn resolve_conflict( partial: &serde_json::Value, main: &serde_json::Value, strategy: ConflictStrategy, ) -> serde_json::Value { match strategy { ConflictStrategy::KeepMain => main.clone(), ConflictStrategy::KeepPartial => partial.clone(), ConflictStrategy::Merge => { // 合併 segments,移除重複 merge_segments(partial, main) } } } ``` --- ## 6. Video 處理多語種標示規範 ### 6.1 語言代碼標準 使用 ISO 639-1 兩碼語言代碼: | 語言 | 代碼 | 範例 | |------|------|------| | 英語 | en | English | | 國語/普通話 | zh | Chinese (Mandarin) | | 粵語 | yue | Cantonese | | 閩南語 | nan | Min Nan | | 日語 | ja | Japanese | | 韓語 | ko | Korean | | 西班牙語 | es | Spanish | | 法語 | fr | French | | 德語 | de | German | | 義大利語 | it | Italian | | 俄語 | ru | Russian | | 阿拉伯語 | ar | Arabic | | 印地語 | hi | Hindi | | 葡萄牙語 | pt | Portuguese | ### 6.2 多語種偵測結果 ```json { "language": "multi", "languages_detected": [ { "code": "en", "name": "English", "probability": 0.75, "segments_count": 100 }, { "code": "zh", "name": "Chinese", "probability": 0.20, "segments_count": 30 }, { "code": "ja", "name": "Japanese", "probability": 0.05, "segments_count": 5 } ], "primary_language": "en", "segments": [ { "start": 0.0, "end": 5.0, "text": "Hello world", "language": "en", "language_probability": 0.99 }, { "start": 5.0, "end": 10.0, "text": "你好世界", "language": "zh", "language_probability": 0.98 } ] } ``` ### 6.3 段落級語言標示 ```rust #[derive(Debug, Clone, Serialize, Deserialize)] pub struct AsrSegment { pub start: f64, pub end: f64, pub text: String, pub language: Option, // 段落語言 pub language_probability: Option, // 段落語言機率 } ``` ### 6.4 資料庫儲存 ```sql CREATE TABLE asr_segments ( id BIGSERIAL PRIMARY KEY, uuid VARCHAR(16) NOT NULL, segment_index INTEGER NOT NULL, start_time DOUBLE PRECISION NOT NULL, end_time DOUBLE PRECISION NOT NULL, text TEXT NOT NULL, language VARCHAR(10), language_probability DOUBLE PRECISION, created_at TIMESTAMP DEFAULT NOW(), UNIQUE(uuid, segment_index) ); CREATE INDEX idx_asr_language ON asr_segments(language); ``` --- ## 7. Video 處理未識別成功語種標示規範 ### 7.1 未識別狀態 | 狀態 | 代碼 | 說明 | |------|------|------| | Unknown | unknown | 無法判斷語言 | | Uncertain | uncertain | 語言識別信心度過低 | | NoSpeech | no_speech | 無語音內容 | | Silent | silent | 完全是靜音 | ### 7.2 未識別結果結構 ```json { "language": "unknown", "language_probability": null, "language_detection_failed": true, "failure_reason": "low_confidence", "min_confidence_threshold": 0.7, "detected_language": "en", "detected_probability": 0.45, "segments": [ { "start": 0.0, "end": 5.0, "text": "", "language": "no_speech", "language_probability": 0.99, "has_audio": false } ] } ``` ### 7.3 處理策略 ```rust pub fn handle_undetected_language( audio_path: &str, result: &AsrResult, ) -> AsrResult { // 1. 檢測是否為靜音 let is_silent = detect_silence(audio_path); // 2. 如果靜音,標示為 silent if is_silent { return AsrResult { language: Some("silent".to_string()), language_probability: Some(1.0), segments: result.segments.iter().map(|s| AsrSegment { language: Some("silent".to_string()), has_audio: Some(false), ..s.clone() }).collect(), }; } // 3. 如果語言信心度低,標示為 uncertain if result.language_probability.unwrap_or(0.0) < 0.7 { return AsrResult { language: Some("uncertain".to_string()), language_probability: result.language_probability, segments: result.segments.iter().map(|s| AsrSegment { language: Some("uncertain".to_string()), language_probability: Some(result.language_probability.unwrap_or(0.0)), ..s.clone() }).collect(), }; } result.clone() } ``` ### 7.4 靜音偵測 ```rust #[derive(Debug, Clone, Serialize, Deserialize)] pub struct SilenceDetection { pub is_silent: bool, pub silence_ratio: f64, # 靜音比例 (0.0 - 1.0) pub audio_level_db: f64, # 平均音量大 (dB) pub threshold_db: f64, # 閾值 (-40 dB) } pub fn detect_silence(audio_path: &str, threshold_db: f64) -> SilenceDetection { # 使用 ffmpeg 分析音量大 } ``` --- ## 8. Video 處理 Music 標示規範 ### 8.1 音樂偵測結果 ```json { "has_music": true, "music_segments": [ { "start": 0.0, "end": 30.0, "type": "background_music", "confidence": 0.95, "genre": "classical", "tempo": 120, "has_lyrics": false }, { "start": 60.0, "end": 90.0, "type": "song_with_vocals", "confidence": 0.88, "artist": "Unknown", "title": "Unknown", "has_lyrics": true } ], "audio_classification": { "speech": 0.30, "music": 0.60, "ambient": 0.10 } } ``` ### 8.2 音樂類型分類 | 類型 | 說明 | |------|------| | background_music | 背景音樂 | | song_with_vocals | 帶歌詞的歌曲 | | instrumental | 純音樂 | | sound_effect | 音效 | ### 8.3 結構定義 ```rust #[derive(Debug, Clone, Serialize, Deserialize)] pub struct MusicSegment { pub start: f64, pub end: f64, pub music_type: String, // 音樂類型 pub confidence: f64, // 偵測信心度 pub genre: Option, // 音樂類型 (可選) pub tempo: Option, # BPM (可選) pub has_lyrics: bool, # 是否有歌詞 pub artist: Option, # 藝術家 (可選) pub title: Option, # 標題 (可選) } ``` --- ## 9. Video 處理無聲音標示規範 ### 9.1 無聲音定義 | 狀態 | 說明 | |------|------| | no_audio_track | 影片無音軌 | | all_silent | 有音軌但全為靜音 | | audio_error | 音軌讀取錯誤 | ### 9.2 無聲音結果 ```json { "has_audio": false, "audio_status": "no_audio_track", "audio_info": { "has_audio_track": false, "error_message": null, "audio_codec": null, "sample_rate": null, "channels": null, "duration": null }, "asr": { "language": "no_speech", "language_probability": 1.0, "segments": [], "segments_count": 0, "total_speech_duration": 0.0, "speech_ratio": 0.0 } } ``` ### 9.3 處理流程 ```rust pub async fn process_video_no_audio(uuid: &str, video_path: &str) -> Result { // 1. Probe 影片 let probe = probe_video(video_path).await?; // 2. 判斷無聲音原因 let audio_status = if !probe.has_audio_stream { "no_audio_track" } else if probe.audio_is_silent { "all_silent" } else { "audio_error" }; // 3. 產生結果 Ok(ProcessingResult { has_audio: false, audio_status: audio_status.to_string(), asr: AsrResult { language: Some("no_speech".to_string()), language_probability: Some(1.0), segments: vec![], }, ..Default::default() }) } ``` --- ## 10. Frame 物件識別標示規範 (YOLO) ### 10.1 YOLO 偵測結果結構 ```json { "model": "yolov8x", "model_version": "8.0", "segments": [ { "start": 0.0, "end": 1.0, "frame_number": 0, "objects": [ { "class": "person", "class_id": 0, "confidence": 0.92, "box": { "x1": 150, "y1": 200, "x2": 400, "y2": 800 }, "tracking_id": "person_001" }, { "class": "car", "class_id": 2, "confidence": 0.87, "box": { "x1": 800, "y1": 400, "x2": 1200, "y2": 700 }, "tracking_id": "car_001" } ] } ], "statistics": { "total_objects": 189, "unique_classes": ["person", "car", "dog", "bicycle"], "class_counts": { "person": 120, "car": 45, "dog": 15, "bicycle": 9 } } } ``` ### 10.2 支援的類別 (COCO) | 類別 ID | 類別名稱 | |---------|----------| | 0 | person | | 1 | bicycle | | 2 | car | | 3 | motorcycle | | 4 | airplane | | 5 | bus | | 6 | train | | 7 | truck | | 8 | boat | | 9 | traffic light | | ... | ... | ### 10.3 結構定義 ```rust #[derive(Debug, Clone, Serialize, Deserialize)] pub struct YoloResult { pub model: String, pub model_version: String, pub segments: Vec, pub statistics: YoloStatistics, } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct YoloSegment { pub start: f64, pub end: f64, pub frame_number: i64, pub objects: Vec, } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct YoloObject { pub class: String, pub class_id: i32, pub confidence: f64, pub box: BoundingBox, pub tracking_id: Option, } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct BoundingBox { pub x1: i32, pub y1: i32, pub x2: i32, pub y2: i32, } ``` ### 10.4 處理配置 ```yaml processing: yolo: model: "yolov8x" confidence_threshold: 0.25 iou_threshold: 0.45 max_det: 300 device: "cuda" batch_size: 8 skip_frames: 1 # 每 N 幀處理一次 ``` --- ## 11. Frame 文字識別標示規範 (OCR) ### 11.1 OCR 偵測結果結構 ```json { "model": "easyocr", "model_version": "1.7", "language": ["en"], "segments": [ { "start": 0.0, "end": 1.0, "frame_number": 0, "texts": [ { "text": "EXAMPLE TEXT", "text_normalized": "example text", "boxes": [ { "x1": 100, "y1": 50, "x2": 400, "y2": 100 } ], "confidence": 0.95, "language": "en" }, { "text": "SUBTITLE HERE", "text_normalized": "subtitle here", "boxes": [ { "x1": 200, "y1": 900, "x2": 1720, "y2": 1000 } ], "confidence": 0.88, "language": "en" } ] } ], "statistics": { "total_text_regions": 25, "unique_texts": 18, "languages_detected": ["en"] } } ``` ### 11.2 結構定義 ```rust #[derive(Debug, Clone, Serialize, Deserialize)] pub struct OcrResult { pub model: String, pub model_version: String, pub language: Vec, pub segments: Vec, pub statistics: OcrStatistics, } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct OcrSegment { pub start: f64, pub end: f64, pub frame_number: i64, pub texts: Vec, } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct OcrText { pub text: String, pub text_normalized: String, pub boxes: Vec, pub confidence: f64, pub language: Option, } ``` ### 11.3 文字類型分類 ```rust #[derive(Debug, Clone, Serialize, Deserialize)] #[serde(rename_all = "snake_case")] pub enum TextType { Subtitle, // 字幕 Title, # 標題 Sign, # 標牌/招牌 Caption, # 說明文字 Watermark, #浮水印 SceneText, # 場景文字 Unknown, # 未知 } ``` ### 11.4 處理配置 ```yaml processing: ocr: model: "easyocr" languages: ["en", "zh", "ja"] confidence_threshold: 0.5 text_detection: true text_recognition: true batch_size: 16 skip_frames: 30 # 每 30 幀處理一次 (字幕通常持續較久) ``` --- ## 12. Frame Face 識別標示規範 (Face) ### 12.1 人臉偵測結果結構 ```json { "model": "retinaface", "model_version": "1.0", "segments": [ { "start": 0.0, "end": 1.0, "frame_number": 0, "faces": [ { "face_id": "face_001", "box": { "x1": 100, "y1": 50, "x2": 300, "y2": 350 }, "embedding": [0.123, -0.456, ...], "embedding_dim": 512, "emotion": { "dominant": "happy", "scores": { "happy": 0.75, "neutral": 0.20, "sad": 0.03, "angry": 0.02 } }, "age": 35, "gender": "female", "confidence": 0.95 } ] } ], "statistics": { "total_faces": 50, "unique_faces": 5, "face_tracks": [ { "face_id": "face_001", "duration": 120.5, "appearances": 3000, "first_seen": 0.0, "last_seen": 120.5 } ] } } ``` ### 12.2 結構定義 ```rust #[derive(Debug, Clone, Serialize, Deserialize)] pub struct FaceResult { pub model: String, pub model_version: String, pub segments: Vec, pub statistics: FaceStatistics, } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct FaceSegment { pub start: f64, pub end: f64, pub frame_number: i64, pub faces: Vec, } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct Face { pub face_id: String, pub box: BoundingBox, pub embedding: Option>, pub embedding_dim: Option, pub emotion: Option, pub age: Option, pub gender: Option, pub confidence: f64, } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct Emotion { pub dominant: String, pub scores: std::collections::HashMap, } ``` ### 12.3 人臉追蹤 ```rust #[derive(Debug, Clone, Serialize, Deserialize)] pub struct FaceTrack { pub face_id: String, pub duration: f64, pub appearances: i32, pub first_seen: f64, pub last_seen: f64, pub frames: Vec, pub embedding_average: Vec, } ``` ### 12.4 處理配置 ```yaml processing: face: model: "retinaface" recognition_model: "arcface" detection_threshold: 0.5 recognition_threshold: 0.6 track_faces: true detect_emotion: true detect_age: true detect_gender: true skip_frames: 1 ``` --- ## 13. 完整處理流程圖 ``` ┌─────────────────────────────────────────────────────────────────┐ │ VIDEO PROCESSING │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ REGISTER │───▶│ PROBE │───▶│ ASR │───▶│ ASRx │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ UUID │ │ Video │ │ Language│ │Speaker │ │ │ │ Generate│ │ Info │ │ Detect │ │Diariz. │ │ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │ │ │ │ ▼ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ OCR │───▶│ YOLO │───▶│ FACE │───▶│ POSE │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Text │ │ Object │ │ Face │ │Pose │ │ │ │ Detect │ │ Detect │ │ Track │ │Estimate │ │ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │ CHUNK │ │ │ └──────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ │ │ │ COMPLETE │ │ │ └──────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ CHECKPOINT (每分鐘輸出) │ │ │ │ {uuid}.asr.partial.001.json │ │ │ │ {uuid}.asrx.partial.001.json │ │ │ │ ... │ │ │ └──────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## 14. 處理器狀態矩陣 | 處理器 | 輸入 | 輸出 | 必要 | 可選參數 | |--------|------|------|------|----------| | Probe | 影片檔 | probe.json | ✅ | - | | ASR | 影片/音軌 | asr.json | ✅ | model, language | | ASRx | asr.json | asrx.json | ❌ | min_speakers, max_speakers | | OCR | 影片幀 | ocr.json | ❌ | languages, threshold | | YOLO | 影片幀 | yolo.json | ❌ | model, confidence | | Face | 影片幀 | face.json | ❌ | recognition, track | | Pose | 影片幀 | pose.json | ❌ | model, tracking | --- ## 15. 錯誤處理 ### 15.1 錯誤類型 ```rust #[derive(Debug, Clone, Serialize, Deserialize)] #[serde(rename_all = "snake_case")] pub enum ProcessError { FileNotFound, InvalidFormat, NoAudioTrack, NoVideoTrack, ProcessingFailed { processor: String, message: String }, OutOfMemory, DiskFull, Timeout, Cancelled, } ``` ### 15.2 錯誤回應 ```json { "error": "processing_failed", "processor": "asr", "message": "Failed to load Whisper model", "timestamp": "2026-03-16T10:00:00Z", "retryable": false, "suggestion": "Check GPU availability and model files" } ``` --- ## 16. 版本歷史 | 版本 | 日期 | 變更 | |------|------|------| | 1.0.0 | 2026-03-16 | 初始版本 | --- ## 17. 相關文件 - [CHUNK_SPEC.md](./CHUNK_SPEC.md) - 影片分塊規範 - [JSON_OUTPUT_SPEC.md](./JSON_OUTPUT_SPEC.md) - JSON 輸出規範 - [RUST_DEVELOPMENT.md](./RUST_DEVELOPMENT.md) - Rust 開發規範 - [AGENTS.md](../AGENTS.md) - 開發規範