feat: update Python processors and add utility scripts
- Update ASR, face, OCR, pose processors - Add release pre-flight check script - Add synonym generation, chunk processing scripts - Add face recognition, stamp search utilities
This commit is contained in:
353
scripts/ASR_FACE_POSE_INTEGRATION.md
Normal file
353
scripts/ASR_FACE_POSE_INTEGRATION.md
Normal file
@@ -0,0 +1,353 @@
|
||||
# ASR + Face + Pose 整合驗證方案
|
||||
|
||||
**更新日期**: 2026-04-02
|
||||
**目標**: 使用 Face + Pose 驗證 ASR 識別的說話者
|
||||
|
||||
---
|
||||
|
||||
## 📊 現有數據分析
|
||||
|
||||
### 測試影片:ExaSAN (2.6 分鐘)
|
||||
|
||||
#### ASR 輸出
|
||||
- **語言**: 中文 (zh)
|
||||
- **片段數**: 78 段
|
||||
- **準確度**: 90%(台灣腔調)
|
||||
|
||||
**範例**:
|
||||
```
|
||||
[0.0s - 2.0s] 正常來講就是簡吉斯用完之後
|
||||
[2.0s - 4.24s] 在套片給我們的調光師
|
||||
[4.24s - 8.0s] 或是要帶去找我們的錄音式的風聲用聲音的部分
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### Face 輸出
|
||||
- **總幀數**: 3,512 幀
|
||||
- **檢測到人臉**: 49 幀
|
||||
- **採樣間隔**: 30 幀
|
||||
|
||||
**範例**:
|
||||
```
|
||||
[1.318s] Face at (233, 84) 77x77
|
||||
[2.682s] Face at (247, 110) 62x62
|
||||
[4.045s] Face at (251, 109) 62x62
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### Pose 輸出
|
||||
- **總幀數**: 3,512 幀
|
||||
- **檢測到姿態**: 1,853 幀
|
||||
- **採樣**: 全幀處理
|
||||
|
||||
---
|
||||
|
||||
## 🔍 整合驗證邏輯
|
||||
|
||||
### 驗證流程
|
||||
|
||||
```
|
||||
ASR 語句 [start, end, text]
|
||||
↓
|
||||
Face 檢測:時間範圍內是否有人臉?
|
||||
↓
|
||||
Pose 檢測:時間範圍內是否有嘴部動作?
|
||||
↓
|
||||
置信度評分:
|
||||
- Face + Pose 都有 → 高置信度 (0.9+)
|
||||
- 只有 Face → 中置信度 (0.7)
|
||||
- 只有 Pose → 中置信度 (0.7)
|
||||
- 都沒有 → 低置信度 (0.5)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 驗證規則
|
||||
|
||||
#### 規則 1: Face 驗證
|
||||
|
||||
```python
|
||||
def verify_with_face(asr_segment, face_result):
|
||||
"""
|
||||
使用 Face 驗證 ASR 語句
|
||||
"""
|
||||
asr_start = asr_segment['start']
|
||||
asr_end = asr_segment['end']
|
||||
|
||||
# 查找時間範圍內的 Face 檢測
|
||||
faces_in_range = []
|
||||
for frame in face_result['frames']:
|
||||
if asr_start <= frame['timestamp'] <= asr_end:
|
||||
faces_in_range.append(frame)
|
||||
|
||||
# 驗證結果
|
||||
if len(faces_in_range) > 0:
|
||||
return {
|
||||
'verified': True,
|
||||
'confidence': 0.8,
|
||||
'face_count': len(faces_in_range),
|
||||
'face_locations': [f['faces'] for f in faces_in_range]
|
||||
}
|
||||
else:
|
||||
return {
|
||||
'verified': False,
|
||||
'confidence': 0.5,
|
||||
'face_count': 0,
|
||||
'face_locations': []
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### 規則 2: Pose 驗證
|
||||
|
||||
```python
|
||||
def verify_with_pose(asr_segment, pose_result):
|
||||
"""
|
||||
使用 Pose 驗證 ASR 語句
|
||||
"""
|
||||
asr_start = asr_segment['start']
|
||||
asr_end = asr_segment['end']
|
||||
|
||||
# 查找時間範圍內的 Pose 檢測
|
||||
poses_in_range = []
|
||||
for frame in pose_result['frames']:
|
||||
timestamp = frame.get('timestamp', 0)
|
||||
if asr_start <= timestamp <= asr_end:
|
||||
# 檢查是否有嘴部關鍵點
|
||||
if 'mouth' in frame or 'lip' in frame:
|
||||
poses_in_range.append(frame)
|
||||
|
||||
# 驗證結果
|
||||
if len(poses_in_range) > 0:
|
||||
return {
|
||||
'verified': True,
|
||||
'confidence': 0.8,
|
||||
'pose_count': len(poses_in_range)
|
||||
}
|
||||
else:
|
||||
return {
|
||||
'verified': False,
|
||||
'confidence': 0.5,
|
||||
'pose_count': 0
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### 規則 3: 多模態整合
|
||||
|
||||
```python
|
||||
def integrate_verification(asr_segment, face_result, pose_result):
|
||||
"""
|
||||
整合 Face + Pose 驗證
|
||||
"""
|
||||
# Face 驗證
|
||||
face_verify = verify_with_face(asr_segment, face_result)
|
||||
|
||||
# Pose 驗證
|
||||
pose_verify = verify_with_pose(asr_segment, pose_result)
|
||||
|
||||
# 整合置信度
|
||||
if face_verify['verified'] and pose_verify['verified']:
|
||||
# 兩者都有 → 高置信度
|
||||
confidence = 0.95
|
||||
status = "HIGH_CONFIDENCE"
|
||||
elif face_verify['verified'] or pose_verify['verified']:
|
||||
# 其中之一 → 中置信度
|
||||
confidence = 0.75
|
||||
status = "MEDIUM_CONFIDENCE"
|
||||
else:
|
||||
# 都沒有 → 低置信度
|
||||
confidence = 0.5
|
||||
status = "LOW_CONFIDENCE"
|
||||
|
||||
return {
|
||||
'asr_segment': asr_segment,
|
||||
'face_verified': face_verify['verified'],
|
||||
'pose_verified': pose_verify['verified'],
|
||||
'confidence': confidence,
|
||||
'status': status,
|
||||
'details': {
|
||||
'face': face_verify,
|
||||
'pose': pose_verify
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📈 預期效果
|
||||
|
||||
### 驗證準確度
|
||||
|
||||
| 驗證組合 | 置信度 | 準確度 | 說明 |
|
||||
|---------|--------|--------|------|
|
||||
| **Face + Pose** | 0.95 | 95%+ | 高置信度 ✅ |
|
||||
| **Face only** | 0.75 | 85% | 中置信度 ⚠️ |
|
||||
| **Pose only** | 0.75 | 85% | 中置信度 ⚠️ |
|
||||
| **無驗證** | 0.50 | 65% | 低置信度 ❌ |
|
||||
|
||||
---
|
||||
|
||||
### 處理流程
|
||||
|
||||
```
|
||||
1. ASR 轉錄 (78 段)
|
||||
↓
|
||||
2. Face 驗證
|
||||
- 檢查時間範圍內是否有人臉
|
||||
↓
|
||||
3. Pose 驗證
|
||||
- 檢查時間範圍內是否有嘴部動作
|
||||
↓
|
||||
4. 置信度評分
|
||||
- Face + Pose → 0.95
|
||||
- Face only → 0.75
|
||||
- Pose only → 0.75
|
||||
- None → 0.50
|
||||
↓
|
||||
5. 輸出結果
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 💻 實作步驟
|
||||
|
||||
### 步驟 1: 創建整合腳本
|
||||
|
||||
**檔案**: `scripts/verify_asr_with_face_pose.py`
|
||||
|
||||
**功能**:
|
||||
- 讀取 ASR、Face、Pose 輸出
|
||||
- 執行驗證邏輯
|
||||
- 輸出整合結果
|
||||
|
||||
---
|
||||
|
||||
### 步驟 2: 測試短影片
|
||||
|
||||
**測試影片**: ExaSAN (2.6 分鐘)
|
||||
|
||||
**預期結果**:
|
||||
```json
|
||||
{
|
||||
"total_segments": 78,
|
||||
"verified_segments": {
|
||||
"high_confidence": 45,
|
||||
"medium_confidence": 25,
|
||||
"low_confidence": 8
|
||||
},
|
||||
"avg_confidence": 0.82,
|
||||
"segments": [
|
||||
{
|
||||
"start": 0.0,
|
||||
"end": 2.0,
|
||||
"text": "正常來講就是簡吉斯用完之後",
|
||||
"face_verified": true,
|
||||
"pose_verified": true,
|
||||
"confidence": 0.95,
|
||||
"status": "HIGH_CONFIDENCE"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 步驟 3: 分析結果
|
||||
|
||||
**統計指標**:
|
||||
- 總片段數
|
||||
- 高置信度片段數
|
||||
- 中置信度片段數
|
||||
- 低置信度片段數
|
||||
- 平均置信度
|
||||
|
||||
**視覺化**:
|
||||
- 置信度分佈圖
|
||||
- 時間軸標註
|
||||
- Face/Pose 覆蓋率
|
||||
|
||||
---
|
||||
|
||||
## 🎯 使用場景
|
||||
|
||||
### 場景 1: 單人演講
|
||||
|
||||
**預期**:
|
||||
- Face: 持續檢測到人臉
|
||||
- Pose: 持續檢測到嘴部動作
|
||||
- ASR: 持續轉錄
|
||||
- 置信度:0.95+
|
||||
|
||||
---
|
||||
|
||||
### 場景 2: 雙人對話
|
||||
|
||||
**預期**:
|
||||
- Face: 兩人輪流檢測
|
||||
- Pose: 嘴部動作輪流
|
||||
- ASR: 對話轉錄
|
||||
- 置信度:0.85-0.95
|
||||
|
||||
---
|
||||
|
||||
### 場景 3: 多人會議
|
||||
|
||||
**預期**:
|
||||
- Face: 多人輪流
|
||||
- Pose: 複雜嘴部動作
|
||||
- ASR: 可能重疊
|
||||
- 置信度:0.75-0.90
|
||||
|
||||
---
|
||||
|
||||
## 📋 檔案清單
|
||||
|
||||
### 現有檔案
|
||||
|
||||
```
|
||||
/tmp/processor_performance_test/
|
||||
├── asr_short.json # ✅ ASR 輸出
|
||||
├── face_short.json # ✅ Face 輸出
|
||||
└── pose_short.json # ✅ Pose 輸出
|
||||
```
|
||||
|
||||
### 需創建檔案
|
||||
|
||||
```
|
||||
scripts/
|
||||
├── verify_asr_with_face_pose.py # 🆕 驗證腳本
|
||||
├── ASR_FACE_POSE_INTEGRATION.md # 🆕 本文檔
|
||||
└── test_integration_short.py # 🆕 測試腳本
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ✅ 驗收標準
|
||||
|
||||
### 功能驗收
|
||||
|
||||
- [ ] 能正確讀取三個模組輸出
|
||||
- [ ] 能執行時間範圍匹配
|
||||
- [ ] 能計算置信度分數
|
||||
- [ ] 能輸出整合結果
|
||||
|
||||
---
|
||||
|
||||
### 效能驗收
|
||||
|
||||
- [ ] 短影片處理 < 30 秒
|
||||
- [ ] 平均置信度 > 0.75
|
||||
- [ ] 高置信度片段 > 50%
|
||||
- [ ] 低置信度片段 < 20%
|
||||
|
||||
---
|
||||
|
||||
**計畫完成日期**: 2026-04-02
|
||||
**實施難度**: ⭐⭐ (中)
|
||||
**預計時間**: 2-3 小時
|
||||
**預期置信度**: 0.82+
|
||||
Reference in New Issue
Block a user