Visual scene classification: Phase 1+2 complete
- Extracted visual_stats per scene (face count, size, objects, duration, density) - Classified 1130 scenes into 18 types (establishing/close_up/medium/long/two/group × dialogue/sparse/silent) - All from existing data, no LLM needed - Scene type stored in cut chunk metadata
This commit is contained in:
@@ -0,0 +1,89 @@
|
||||
# 視覺場景分類方案
|
||||
|
||||
## 核心問題
|
||||
|
||||
Places365 被棄用後,如何用視覺方式填補場景分類?
|
||||
|
||||
## 可用視覺特徵(不需額外模型)
|
||||
|
||||
### 從現有資料直接取得
|
||||
|
||||
| 視覺特徵 | 資料來源 | 可推論 |
|
||||
|---------|---------|--------|
|
||||
| **臉部 bbox 大小** | face_detections | 運鏡尺度(close-up < 10% 畫面、medium < 30%、long shot > 30%) |
|
||||
| **人數** | face_detections 每 scene 去重 | 獨白(1)、對話(2)、群戲(3+) |
|
||||
| **YOLO 物件類別** | pre_chunks | 車→戶外、床/沙發→室內、餐桌→餐廳、書→辦公室 |
|
||||
| **YOLO 物件多樣性** | pre_chunks unique class count | 單一物件→特寫、多樣→廣角 |
|
||||
| **物件位置分布** | bbox center (x,y) | 置中→單一主體、分散→群戲、偏左/右→構圖引導 |
|
||||
| **場景長度** | cut 起訖時間 | 短(<3s)→快節奏、長(>30s)→對話/沈重 |
|
||||
| **對話密度** | ASR segments / scene duration | 高→對話戲、低→動作/過場 |
|
||||
| **說話者數量** | ASRX speaker_id 去重 | 單一 speaker→獨白、多 speaker→對話 |
|
||||
| **幀差變化** | 連續 frame 的 bbox 位移量 | 大→動作、小→靜態 |
|
||||
|
||||
### 需 ffmpeg 計算(M4 實驗)
|
||||
|
||||
| 視覺特徵 | ffmpeg filter | 可推論 |
|
||||
|---------|-------------|--------|
|
||||
| **亮度 (Y_mean)** | signalstats | 白天(Y>100)、夜晚(Y<40)、室內(40-100) |
|
||||
| **對比度 (Y_std)** | signalstats | 高→強烈燈光/陽光、低→陰天/室內柔光 |
|
||||
| **色調 (U/V mean)** | signalstats | 暖色(V偏高)→夕陽/燭光、冷色(U偏高)→陰天/辦公室 |
|
||||
| **運動量** | tblend / flow | 大→追逐/動作、小→對話/靜態 |
|
||||
| **相機晃動** | deshake | 大→手持攝影(新聞感)、小→穩定(電影感) |
|
||||
| **場景切換頻率** | select='gt(scene,X)' | 高→快節奏/動作片、低→慢節奏/文戲 |
|
||||
|
||||
## 視覺特徵 → 場景類型對照表
|
||||
|
||||
| 場景類型 | 人數 | 臉部大小 | YOLO 物件 | 亮度 | 運動 | 對話密度 |
|
||||
|---------|------|---------|-----------|------|------|---------|
|
||||
| close-up 獨白 | 1 | 大(>15%) | 少/無 | 不拘 | 低 | 高 |
|
||||
| 雙人對話 | 2 | 中 | 室內物件 | 室內 | 低 | 高 |
|
||||
| 群戲 | 3+ | 小-中 | 多樣 | 不拘 | 中 | 高 |
|
||||
| 追逐/動作 | 1-2 | 中 | 戶外物件 | 白天 | 高 | 低 |
|
||||
| 過場空景 | 0 | — | 場景物件 | 不拘 | 低-中 | 無 |
|
||||
| 夜景 | 1-3 | 中 | 較少 | 低(Y<40) | 低-中 | 低-中 |
|
||||
| 特寫物體 | 0 | — | 1 類為主 | 不拘 | 低 | 無 |
|
||||
|
||||
## 實作建議
|
||||
|
||||
### Phase 1:從現有資料產生(最簡單)
|
||||
|
||||
```sql
|
||||
-- 每個 scene 的視覺摘要
|
||||
SELECT
|
||||
c.chunk_id,
|
||||
COUNT(DISTINCT fd.trace_id) AS face_count,
|
||||
AVG(fd.width * fd.height)::float / (1920*1080) AS avg_face_ratio,
|
||||
(SELECT COUNT(DISTINCT data->>'label') FROM pre_chunks
|
||||
WHERE file_uuid=c.file_uuid AND processor_type='yolo'
|
||||
AND frame_number BETWEEN c.start_frame AND c.end_frame) AS object_types
|
||||
FROM chunks c
|
||||
LEFT JOIN face_detections fd ON fd.file_uuid=c.file_uuid AND fd.trace_id IS NOT NULL
|
||||
AND fd.frame_number BETWEEN c.start_frame AND c.end_frame
|
||||
WHERE c.chunk_type='cut'
|
||||
GROUP BY c.id;
|
||||
```
|
||||
|
||||
### Phase 2:加 ffmpeg 特徵(M4 實驗方向)
|
||||
|
||||
對每個 scene clip 跑 ffmpeg signalstats,存入 scene metadata。
|
||||
|
||||
### Phase 3:整合進 5W1H+ prompt
|
||||
|
||||
把視覺特徵數據餵給 LLM,讓 summary 包含 scene type。
|
||||
|
||||
```json
|
||||
{
|
||||
"scene_summary": "...",
|
||||
"scene_type": "two_person_dialogue",
|
||||
"setting": "indoor_restaurant",
|
||||
"shot": "medium_two_shot",
|
||||
"mood": "tense",
|
||||
"visual_stats": {
|
||||
"face_count": 2,
|
||||
"avg_face_ratio": 0.08,
|
||||
"objects": ["bottle", "wine_glass", "dining_table"],
|
||||
"brightness": "dim",
|
||||
"motion": "low"
|
||||
}
|
||||
}
|
||||
```
|
||||
Reference in New Issue
Block a user