feat: Phase 2.6 edges migration to Qdrant (TKG-only architecture)
Phase 2.6.1: co_occurrence_edges migration - build_co_occurrence_edges_from_qdrant() - Qdrant embeddings → frame grouping → YOLO objects - Result: 6679 edges (vs 6701 PostgreSQL) Phase 2.6.2: face_face_edges migration - build_face_face_edges_from_qdrant() - Qdrant embeddings → frame grouping → face pairs - mutual_gaze detection preserved - Result: 6 edges (exact match) Phase 2.6.3: speaker_face_edges migration - build_speaker_face_edges_from_qdrant() - Qdrant embeddings → trace_id frame ranges - SPEAKS_AS edge creation Architecture: - All edges use Qdrant payload (no face_detections queries) - PostgreSQL fallback for empty Qdrant - Estimated 3.6x performance improvement Testing: - Playground (3003): ✓ All Phase 2.6 logs verified - Edge counts: ✓ Close match with PostgreSQL - Fallback: ✓ Working Docs: - docs_v1.0/DESIGN/TKG_PHASE2_6_EDGES_MIGRATION.md - docs_v1.0/M4_workspace/2026-06-21_phase2_6_test.md
This commit is contained in:
171
v1.1/scripts/asrx_self/FINAL_TEST_REPORT_v1.11.md
Normal file
171
v1.1/scripts/asrx_self/FINAL_TEST_REPORT_v1.11.md
Normal file
@@ -0,0 +1,171 @@
|
||||
# GUI Face Player 最終測試報告
|
||||
|
||||
**測試日期**: 2026-04-02
|
||||
**測試狀態**: ✅ 所有測試通過
|
||||
**GUI 進程**: PID 4791 (運行中)
|
||||
|
||||
---
|
||||
|
||||
## 📊 測試結果總覽
|
||||
|
||||
| 測試項目 | 結果 | 說明 |
|
||||
|---------|------|------|
|
||||
| **文件檢查** | ✅ 通過 | 所有必需文件存在 |
|
||||
| **JSON 結構** | ✅ 通過 | 所有 JSON 結構正確 |
|
||||
| **整合腳本** | ✅ 通過 | 99.8% 匹配率 |
|
||||
| **GUI 啟動** | ✅ 通過 | GUI 正常運行 |
|
||||
|
||||
---
|
||||
|
||||
## 📁 測試文件
|
||||
|
||||
| 文件 | 大小 | 狀態 |
|
||||
|------|------|------|
|
||||
| `/tmp/charade_audio.wav` | 209.9 MB | ✅ |
|
||||
| `/tmp/asrx_charade_optimized.json` | 0.1 MB | ✅ |
|
||||
| `/tmp/face_long.json` | 4.8 MB | ✅ |
|
||||
| `/tmp/charade_integrated.json` | 0.4 MB | ✅ |
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Face 整合結果
|
||||
|
||||
**總匹配率**: 99.8% (1116/1118)
|
||||
|
||||
### 說話人詳細統計
|
||||
|
||||
| 說話人 | 片段數 | 有人臉 | 匹配率 |
|
||||
|--------|--------|--------|--------|
|
||||
| SPEAKER_0 | 654 | 654 | 100.0% ✅ |
|
||||
| SPEAKER_1 | 403 | 402 | 99.8% ✅ |
|
||||
| SPEAKER_2 | 49 | 49 | 100.0% ✅ |
|
||||
| SPEAKER_3 | 2 | 2 | 100.0% ✅ |
|
||||
| SPEAKER_4 | 3 | 3 | 100.0% ✅ |
|
||||
| SPEAKER_5 | 2 | 1 | 50.0% ⚠️ |
|
||||
| SPEAKER_6 | 3 | 3 | 100.0% ✅ |
|
||||
| SPEAKER_7 | 2 | 2 | 100.0% ✅ |
|
||||
|
||||
---
|
||||
|
||||
## 🎬 GUI 功能測試
|
||||
|
||||
### ✅ 已測試功能
|
||||
|
||||
| 功能 | 狀態 | 說明 |
|
||||
|------|------|------|
|
||||
| **文件選擇** | ✅ 正常 | 可選擇音頻、ASRX、Face 文件 |
|
||||
| **Face 整合** | ✅ 正常 | 整合按鈕正常工作 |
|
||||
| **說話人列表** | ✅ 正常 | 顯示 8 個說話人及統計 |
|
||||
| **片段列表** | ✅ 正常 | 顯示片段及 Face 對應標記 |
|
||||
| **播放控制** | ✅ 正常 | 播放、停止、播放全部正常 |
|
||||
| **進度顯示** | ✅ 正常 | 進度條和時間顯示正常 |
|
||||
|
||||
---
|
||||
|
||||
## 📋 使用方式
|
||||
|
||||
### 啟動 GUI
|
||||
|
||||
```bash
|
||||
cd /Users/accusys/momentry_core_0.1/scripts/asrx_self
|
||||
python3 speaker_player_gui_face.py
|
||||
```
|
||||
|
||||
### 後台啟動
|
||||
|
||||
```bash
|
||||
cd /Users/accusys/momentry_core_0.1/scripts/asrx_self
|
||||
nohup python3 speaker_player_gui_face.py > /tmp/gui_player.log 2>&1 &
|
||||
```
|
||||
|
||||
### 查看進程
|
||||
|
||||
```bash
|
||||
ps aux | grep speaker_player_gui_face
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔧 技術細節
|
||||
|
||||
### Face 整合邏輯
|
||||
|
||||
```python
|
||||
# 時間閾值:3.0 秒
|
||||
# 如果 Face 時間戳在 ASRX 片段前後 3 秒內,視為匹配
|
||||
|
||||
if start - 3.0 <= face_timestamp <= end + 3.0:
|
||||
匹配成功 👥✅
|
||||
```
|
||||
|
||||
### 匹配算法
|
||||
|
||||
1. **時間範圍匹配**: 前後擴展 3 秒
|
||||
2. **最近距離優先**: 選擇最接近片段中間的人臉
|
||||
3. **人臉存在檢查**: 檢查 faces 列表是否為空
|
||||
|
||||
---
|
||||
|
||||
## 📈 性能指標
|
||||
|
||||
| 指標 | 數值 | 說明 |
|
||||
|------|------|------|
|
||||
| **Face 檢測幀數** | 10,691 | 2.6% 檢測率 |
|
||||
| **ASRX 片段數** | 1,118 | 114.7 分鐘 |
|
||||
| **匹配片段數** | 1,116 | 99.8% 匹配率 |
|
||||
| **處理時間** | <1 分鐘 | 整合腳本 |
|
||||
| **GUI 啟動時間** | ~2 秒 | 冷啟動 |
|
||||
|
||||
---
|
||||
|
||||
## 🎯 改進建議
|
||||
|
||||
### 已完成
|
||||
|
||||
- ✅ Face 整合功能
|
||||
- ✅ GUI 界面優化
|
||||
- ✅ 自動化測試
|
||||
- ✅ 99.8% 匹配率
|
||||
|
||||
### 未來改進
|
||||
|
||||
- ⏳ 人臉縮圖顯示
|
||||
- ⏳ 實時人臉識別
|
||||
- ⏳ 說話人姓名標註
|
||||
- ⏳ 導出功能
|
||||
|
||||
---
|
||||
|
||||
## 📁 相關文件
|
||||
|
||||
```
|
||||
scripts/asrx_self/
|
||||
├── speaker_player_gui_face.py ✅ GUI 播放器(Face 整合版)
|
||||
├── speaker_player_gui.py ✅ GUI 播放器(舊版)
|
||||
├── speaker_player_interactive.py ✅ 交互式播放器
|
||||
├── speaker_audio_player.py ✅ 命令行播放器
|
||||
├── integrate_face_asrx_speaker.py ✅ Face+ASRX 整合工具
|
||||
├── test_gui_face_player.py ✅ 自動化測試腳本
|
||||
├── FINAL_TEST_REPORT.md ✅ 本測試報告
|
||||
├── GUI_FACE_PLAYER_USAGE.md ✅ 使用指南
|
||||
└── ...其他工具
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ✅ 測試結論
|
||||
|
||||
**所有測試項目通過!**
|
||||
|
||||
- ✅ 文件完整性:4/4
|
||||
- ✅ JSON 結構:3/3
|
||||
- ✅ 整合腳本:99.8% 匹配率
|
||||
- ✅ GUI 運行:正常
|
||||
|
||||
**GUI 已準備就緒,可以開始使用!**
|
||||
|
||||
---
|
||||
|
||||
**報告完成**: 2026-04-02
|
||||
**測試者**: OpenCode
|
||||
**狀態**: ✅ 所有測試通過
|
||||
202
v1.1/scripts/asrx_self/GUI_FACE_PLAYER_USAGE_v1.11.md
Normal file
202
v1.1/scripts/asrx_self/GUI_FACE_PLAYER_USAGE_v1.11.md
Normal file
@@ -0,0 +1,202 @@
|
||||
# GUI 說話人播放器使用指南(Face 整合版)
|
||||
|
||||
**更新日期**: 2026-04-02
|
||||
**功能**: 整合 Face 檢測 + ASRX 說話人分離 + 語音播放
|
||||
|
||||
---
|
||||
|
||||
## 🎯 功能特點
|
||||
|
||||
| 功能 | 說明 |
|
||||
|------|------|
|
||||
| **📁 音頻播放** | 提取並播放每個說話人的語音片段 |
|
||||
| **📊 ASRX 整合** | 顯示說話人分離結果 |
|
||||
| **👤 Face 整合** | 顯示人臉檢測對應(99.8% 匹配率) |
|
||||
| **▶️ 播放控制** | 單個播放、全部播放、停止 |
|
||||
| **⏱️ 進度顯示** | 實時播放進度條 |
|
||||
|
||||
---
|
||||
|
||||
## 🚀 啟動方式
|
||||
|
||||
### 方法 1: 命令行啟動
|
||||
|
||||
```bash
|
||||
cd /Users/accusys/momentry_core_0.1/scripts/asrx_self
|
||||
python3 speaker_player_gui_face.py
|
||||
```
|
||||
|
||||
### 方法 2: 後台啟動
|
||||
|
||||
```bash
|
||||
cd /Users/accusys/momentry_core_0.1/scripts/asrx_self
|
||||
nohup python3 speaker_player_gui_face.py > /tmp/gui_player.log 2>&1 &
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📋 使用步驟
|
||||
|
||||
### 步驟 1: 選擇文件
|
||||
|
||||
1. **選擇音頻** (.wav)
|
||||
- 點擊 "選擇音頻" 按鈕
|
||||
- 選擇 `/tmp/charade_audio.wav`
|
||||
|
||||
2. **選擇 ASRX 結果** (.json)
|
||||
- 點擊 "選擇結果" 按鈕
|
||||
- 選擇 `/tmp/asrx_charade_optimized.json`
|
||||
|
||||
3. **選擇 Face 結果** (.json) - 可選
|
||||
- 點擊 "選擇 Face" 按鈕
|
||||
- 選擇 `/tmp/face_long.json`
|
||||
- 點擊 "🔗 整合 Face" 按鈕
|
||||
|
||||
---
|
||||
|
||||
### 步驟 2: 查看說話人列表
|
||||
|
||||
**左側列表** 顯示所有說話人:
|
||||
```
|
||||
🔊 SPEAKER_0 | 654 段 | 29.4 分鐘 | 👥 654/654
|
||||
🔊 SPEAKER_1 | 403 段 | 18.7 分鐘 | 👥 402/403
|
||||
🔊 SPEAKER_2 | 49 段 | 1.1 分鐘 | 👥 49/49
|
||||
...
|
||||
```
|
||||
|
||||
**圖標說明**:
|
||||
- 🔊 說話人
|
||||
- 👥 有人臉對應
|
||||
- 654/654 有人臉的片段數/總片段數
|
||||
|
||||
---
|
||||
|
||||
### 步驟 3: 查看語音片段
|
||||
|
||||
**右側列表** 顯示所選說話人的所有片段:
|
||||
```
|
||||
[ 1] SPEAKER_0 | 374.80s - 375.90s ( 1.10s) 👥✅
|
||||
[ 2] SPEAKER_0 | 384.10s - 384.90s ( 0.80s) 👥✅
|
||||
[ 3] SPEAKER_0 | 387.30s - 388.40s ( 1.10s) 👥✅
|
||||
...
|
||||
```
|
||||
|
||||
**圖標說明**:
|
||||
- 👥✅ 有人臉對應
|
||||
- 👥❌ 無人臉對應
|
||||
|
||||
---
|
||||
|
||||
### 步驟 4: 播放語音
|
||||
|
||||
**播放方式**:
|
||||
1. **雙擊片段** - 播放所選片段
|
||||
2. **▶️ 播放所選** - 播放當前選中的片段
|
||||
3. **▶️▶️ 播放全部** - 播放所選說話人的所有片段
|
||||
4. **⏹️ 停止** - 停止播放
|
||||
|
||||
**播放進度**:
|
||||
- 底部進度條顯示播放進度
|
||||
- 狀態欄顯示當前播放的片段信息
|
||||
|
||||
---
|
||||
|
||||
## 📊 測試數據
|
||||
|
||||
### Charade 1963 (114.7 分鐘)
|
||||
|
||||
| 文件 | 路徑 |
|
||||
|------|------|
|
||||
| **音頻** | `/tmp/charade_audio.wav` |
|
||||
| **ASRX** | `/tmp/asrx_charade_optimized.json` |
|
||||
| **Face** | `/tmp/face_long.json` |
|
||||
| **整合** | `/tmp/charade_integrated.json` |
|
||||
|
||||
### 說話人統計
|
||||
|
||||
| 說話人 | 片段數 | 時長 | 有人臉 | 匹配率 |
|
||||
|--------|--------|------|--------|--------|
|
||||
| SPEAKER_0 | 654 | 29.4min | 654 | 100.0% ✅ |
|
||||
| SPEAKER_1 | 403 | 18.7min | 402 | 99.8% ✅ |
|
||||
| SPEAKER_2 | 49 | 1.1min | 49 | 100.0% ✅ |
|
||||
| ... | ... | ... | ... | ... |
|
||||
| **總計** | 1118 | 51.6min | 1116 | **99.8%** ✅ |
|
||||
|
||||
---
|
||||
|
||||
## 🎬 使用場景
|
||||
|
||||
### 場景 1: 驗證說話人分離準確度
|
||||
|
||||
1. 載入 ASRX 結果
|
||||
2. 逐一播放每個說話人的片段
|
||||
3. 人工判斷是否正確
|
||||
|
||||
---
|
||||
|
||||
### 場景 2: 整合 Face 與說話人
|
||||
|
||||
1. 載入 ASRX + Face 結果
|
||||
2. 點擊 "整合 Face"
|
||||
3. 查看每個片段的 Face 對應(👥✅/👥❌)
|
||||
4. 播放有人臉的片段
|
||||
|
||||
---
|
||||
|
||||
### 場景 3: 創建訓練數據
|
||||
|
||||
1. 播放特定說話人的所有片段
|
||||
2. 錄製音頻作為訓練數據
|
||||
3. 標記人臉與說話人對應
|
||||
|
||||
---
|
||||
|
||||
## ⚙️ 技術細節
|
||||
|
||||
### Face 整合邏輯
|
||||
|
||||
```python
|
||||
# 時間閾值:3.0 秒
|
||||
# 如果 Face 時間戳在 ASRX 片段前後 3 秒內,視為匹配
|
||||
|
||||
if start - 3.0 <= face_timestamp <= end + 3.0:
|
||||
匹配成功 👥✅
|
||||
```
|
||||
|
||||
### 播放邏輯
|
||||
|
||||
```python
|
||||
# 1. 使用 ffmpeg 提取音頻片段
|
||||
ffmpeg -i audio.wav -ss START -t DURATION segment.wav
|
||||
|
||||
# 2. 使用 afplay (macOS) 播放
|
||||
afplay segment.wav
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📁 相關文件
|
||||
|
||||
```
|
||||
scripts/asrx_self/
|
||||
├── speaker_player_gui_face.py # GUI 播放器(Face 整合版)⭐
|
||||
├── speaker_player_gui.py # GUI 播放器(舊版)
|
||||
├── speaker_player_interactive.py # 交互式播放器
|
||||
├── speaker_audio_player.py # 命令行播放器
|
||||
├── integrate_face_asrx_speaker.py # Face+ASRX 整合工具
|
||||
└── GUI_FACE_PLAYER_USAGE.md # 本使用指南
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ✅ 測試結果
|
||||
|
||||
**GUI 啟動**: ✅ 成功 (PID 10626)
|
||||
**Face 整合**: ✅ 成功 (99.8% 匹配率)
|
||||
**播放功能**: ✅ 正常
|
||||
**進度顯示**: ✅ 正常
|
||||
|
||||
---
|
||||
|
||||
**指南完成**: 2026-04-02
|
||||
**狀態**: ✅ GUI 已啟動並運行中
|
||||
208
v1.1/scripts/asrx_self/LONG_MOVIE_TEST_SUMMARY_v1.11.md
Normal file
208
v1.1/scripts/asrx_self/LONG_MOVIE_TEST_SUMMARY_v1.11.md
Normal file
@@ -0,0 +1,208 @@
|
||||
# 長影片(Charade 1963)完整測試總結
|
||||
|
||||
**測試日期**: 2026-04-02
|
||||
**測試影片**: Charade 1963 (114.7 分鐘)
|
||||
**測試狀態**: ✅ 所有測試通過 (6/6)
|
||||
|
||||
---
|
||||
|
||||
## 📊 測試結果總覽
|
||||
|
||||
| 測試項目 | 結果 | 詳情 |
|
||||
|---------|------|------|
|
||||
| **數據文件** | ✅ 通過 | 4/4 文件完整 |
|
||||
| **ASRX 結果** | ✅ 通過 | 8 個說話人,1118 片段 |
|
||||
| **Face 結果** | ✅ 通過 | 10,691 幀人臉檢測 |
|
||||
| **整合結果** | ✅ 通過 | 99.82% 匹配率 |
|
||||
| **GUI 進程** | ✅ 通過 | PID 37934 運行中 |
|
||||
| **播放功能** | ✅ 通過 | ffmpeg + afplay 正常 |
|
||||
|
||||
---
|
||||
|
||||
## 🎬 長影片數據統計
|
||||
|
||||
### 影片基本信息
|
||||
- **片名**: Charade (1963)
|
||||
- **時長**: 114.7 分鐘 (6879.3 秒)
|
||||
- **音頻大小**: 209.9 MB
|
||||
- **幀率**: 59.94 FPS
|
||||
- **總幀數**: 412,343 幀
|
||||
|
||||
---
|
||||
|
||||
### ASRX 說話人分離結果
|
||||
|
||||
**說話人數量**: 8 人
|
||||
**語音片段**: 1,118 段
|
||||
|
||||
#### 說話人分佈
|
||||
|
||||
| 說話人 | 片段數 | 時長 | 百分比 | 推測角色 |
|
||||
|--------|--------|------|--------|---------|
|
||||
| SPEAKER_0 | 654 | 29.4min | 25.6% | Cary Grant (男主角) |
|
||||
| SPEAKER_1 | 403 | 18.7min | 16.3% | Audrey Hepburn (女主角) |
|
||||
| SPEAKER_2 | 49 | 1.1min | 1.0% | Walter Matthau (配角) |
|
||||
| SPEAKER_4 | 3 | 0.7min | 0.6% | James Coburn (配角) |
|
||||
| 其他 | 9 | <0.1min | <0.1% | 臨時演員 |
|
||||
|
||||
---
|
||||
|
||||
### Face 人臉檢測結果
|
||||
|
||||
**檢測到人臉**: 10,691 幀
|
||||
**檢測率**: 2.59% (10,691 / 412,343)
|
||||
**採樣間隔**: 約 0.5 秒
|
||||
|
||||
---
|
||||
|
||||
### Face + ASRX 整合結果
|
||||
|
||||
**總匹配率**: 99.82% (1116/1118)
|
||||
|
||||
#### 說話人匹配詳情
|
||||
|
||||
| 說話人 | 總片段 | 有人臉 | 匹配率 | 狀態 |
|
||||
|--------|--------|--------|--------|------|
|
||||
| SPEAKER_0 | 654 | 654 | 100.0% | ✅ |
|
||||
| SPEAKER_1 | 403 | 402 | 99.8% | ✅ |
|
||||
| SPEAKER_2 | 49 | 49 | 100.0% | ✅ |
|
||||
| SPEAKER_3 | 2 | 2 | 100.0% | ✅ |
|
||||
| SPEAKER_4 | 3 | 3 | 100.0% | ✅ |
|
||||
| SPEAKER_5 | 2 | 1 | 50.0% | ⚠️ |
|
||||
| SPEAKER_6 | 3 | 3 | 100.0% | ✅ |
|
||||
| SPEAKER_7 | 2 | 2 | 100.0% | ✅ |
|
||||
|
||||
---
|
||||
|
||||
## 🎯 GUI 播放器測試
|
||||
|
||||
### 進程狀態
|
||||
- **PID**: 37934
|
||||
- **狀態**: 運行中 ✅
|
||||
- **CPU**: 0.0%
|
||||
- **記憶體**: 0.5%
|
||||
|
||||
### 功能測試
|
||||
- ✅ 文件選擇功能
|
||||
- ✅ Face 整合功能
|
||||
- ✅ 說話人列表顯示
|
||||
- ✅ 片段列表顯示(帶 Face 標記)
|
||||
- ✅ 播放控制
|
||||
- ✅ 進度顯示
|
||||
|
||||
---
|
||||
|
||||
## 🔧 技術細節
|
||||
|
||||
### Face 整合邏輯
|
||||
|
||||
```python
|
||||
# 時間閾值:3.0 秒
|
||||
if start - 3.0 <= face_timestamp <= end + 3.0:
|
||||
匹配成功 👥✅
|
||||
```
|
||||
|
||||
### 匹配算法
|
||||
1. **時間範圍匹配**: 前後擴展 3 秒
|
||||
2. **最近距離優先**: 選擇最接近片段中間的人臉
|
||||
3. **人臉存在檢查**: 檢查 faces 列表是否為空
|
||||
|
||||
### 播放流程
|
||||
```
|
||||
1. ffmpeg 提取音頻片段
|
||||
ffmpeg -i audio.wav -ss START -t DURATION segment.wav
|
||||
|
||||
2. afplay 播放
|
||||
afplay segment.wav
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📈 性能指標
|
||||
|
||||
| 指標 | 數值 | 說明 |
|
||||
|------|------|------|
|
||||
| **ASRX 處理時間** | 45.39 秒 | 151.58x 實時 |
|
||||
| **Face 處理時間** | ~25 分鐘 | 全幀處理 |
|
||||
| **整合處理時間** | <1 分鐘 | 1118 片段 |
|
||||
| **GUI 啟動時間** | ~2 秒 | 冷啟動 |
|
||||
| **音頻提取速度** | <0.1 秒 | 單個片段 |
|
||||
| **總記憶體使用** | 0.5% | GUI 進程 |
|
||||
|
||||
---
|
||||
|
||||
## ✅ 測試結論
|
||||
|
||||
### 成功項目
|
||||
|
||||
1. ✅ **ASRX 說話人分離**: 成功檢測 8 個說話人
|
||||
2. ✅ **Face 人臉檢測**: 10,691 幀人臉
|
||||
3. ✅ **Face + ASRX 整合**: 99.82% 匹配率
|
||||
4. ✅ **GUI 播放器**: 正常運行,所有功能正常
|
||||
5. ✅ **播放功能**: ffmpeg + afplay 正常工作
|
||||
6. ✅ **性能表現**: 151x 實時處理速度
|
||||
|
||||
### 改進空間
|
||||
|
||||
1. ⚠️ **SPEAKER_5**: 匹配率 50%,需要優化
|
||||
2. ⚠️ **Face 檢測率**: 2.59%,可提高採樣率
|
||||
3. ⚠️ **GUI 功能**: 可添加人臉縮圖顯示
|
||||
|
||||
---
|
||||
|
||||
## 📁 相關文件
|
||||
|
||||
### 數據文件
|
||||
- `/tmp/charade_audio.wav` (209.9 MB)
|
||||
- `/tmp/asrx_charade_optimized.json` (0.1 MB)
|
||||
- `/tmp/face_long.json` (4.8 MB)
|
||||
- `/tmp/charade_integrated.json` (0.4 MB)
|
||||
|
||||
### 程序文件
|
||||
- `speaker_player_gui_face.py` - GUI 播放器
|
||||
- `integrate_face_asrx_speaker.py` - 整合工具
|
||||
- `test_long_movie.py` - 測試腳本
|
||||
|
||||
### 文檔文件
|
||||
- `LONG_MOVIE_TEST_SUMMARY.md` - 本總結
|
||||
- `FINAL_TEST_REPORT.md` - 最終測試報告
|
||||
- `GUI_FACE_PLAYER_USAGE.md` - 使用指南
|
||||
|
||||
---
|
||||
|
||||
## 🎬 使用建議
|
||||
|
||||
### 快速開始
|
||||
|
||||
```bash
|
||||
# 1. 啟動 GUI
|
||||
cd /Users/accusys/momentry_core_0.1/scripts/asrx_self
|
||||
python3 speaker_player_gui_face.py
|
||||
|
||||
# 2. 選擇文件
|
||||
# - Audio: /tmp/charade_audio.wav
|
||||
# - ASRX: /tmp/asrx_charade_optimized.json
|
||||
# - Face: /tmp/face_long.json
|
||||
|
||||
# 3. 點擊 "🔗 整合 Face"
|
||||
|
||||
# 4. 選擇說話人並播放
|
||||
```
|
||||
|
||||
### 批量處理
|
||||
|
||||
```bash
|
||||
# 使用命令行播放器
|
||||
python3 speaker_audio_player.py \
|
||||
/tmp/charade_audio.wav \
|
||||
/tmp/asrx_charade_optimized.json \
|
||||
--speaker SPEAKER_0 \
|
||||
--limit 5
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**測試完成**: 2026-04-02
|
||||
**測試者**: OpenCode
|
||||
**狀態**: ✅ 所有測試通過 (6/6)
|
||||
**GUI PID**: 37934 (運行中)
|
||||
298
v1.1/scripts/asrx_self/SPEAKER_PLAYER_GUIDE_v1.11.md
Normal file
298
v1.1/scripts/asrx_self/SPEAKER_PLAYER_GUIDE_v1.11.md
Normal file
@@ -0,0 +1,298 @@
|
||||
# 說話人語音播放器使用指南
|
||||
|
||||
**創建日期**: 2026-04-02
|
||||
**功能**: 從 ASRX 結果中提取並播放每個說話人的語音片段
|
||||
|
||||
---
|
||||
|
||||
## 📋 工具列表
|
||||
|
||||
| 工具 | 功能 | 使用場景 |
|
||||
|------|------|---------|
|
||||
| `speaker_audio_player.py` | 命令行播放器 | 批次播放、統計 |
|
||||
| `speaker_player_interactive.py` | 交互式播放器 | 探索、逐個播放 |
|
||||
|
||||
---
|
||||
|
||||
## 🎯 使用方式
|
||||
|
||||
### 1. 顯示說話人統計
|
||||
|
||||
```bash
|
||||
python3 speaker_audio_player.py --stats /tmp/asrx_charade_optimized.json
|
||||
```
|
||||
|
||||
**輸出**:
|
||||
```
|
||||
============================================================
|
||||
說話人統計
|
||||
============================================================
|
||||
SPEAKER_0 654 segments 1764.4s ( 25.6%)
|
||||
SPEAKER_1 403 segments 1119.4s ( 16.3%)
|
||||
SPEAKER_2 49 segments 65.7s ( 1.0%)
|
||||
...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. 播放特定說話人的片段
|
||||
|
||||
#### 播放 SPEAKER_0 的前 3 個片段
|
||||
|
||||
```bash
|
||||
python3 speaker_audio_player.py \
|
||||
/tmp/charade_audio.wav \
|
||||
/tmp/asrx_charade_optimized.json \
|
||||
--speaker SPEAKER_0 \
|
||||
--limit 3
|
||||
```
|
||||
|
||||
**輸出**:
|
||||
```
|
||||
▶️ SPEAKER_0 (3 segments)
|
||||
------------------------------------------------------------
|
||||
[ 1] 374.80s - 375.90s ( 1.10s) ... ✅ ▶️ Played
|
||||
[ 2] 384.10s - 384.90s ( 0.80s) ... ✅ ▶️ Played
|
||||
[ 3] 387.30s - 388.40s ( 1.10s) ... ✅ ▶️ Played
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### 播放 SPEAKER_1 的所有片段
|
||||
|
||||
```bash
|
||||
python3 speaker_audio_player.py \
|
||||
/tmp/charade_audio.wav \
|
||||
/tmp/asrx_charade_optimized.json \
|
||||
--speaker SPEAKER_1
|
||||
```
|
||||
|
||||
⚠️ **警告**: SPEAKER_1 有 403 個片段,可能需要很長時間!
|
||||
|
||||
---
|
||||
|
||||
#### 播放所有說話人的前 2 個片段
|
||||
|
||||
```bash
|
||||
python3 speaker_audio_player.py \
|
||||
/tmp/charade_audio.wav \
|
||||
/tmp/asrx_charade_optimized.json \
|
||||
--limit 2
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. 交互式播放器(推薦⭐)
|
||||
|
||||
```bash
|
||||
python3 speaker_player_interactive.py \
|
||||
/tmp/charade_audio.wav \
|
||||
/tmp/asrx_charade_optimized.json
|
||||
```
|
||||
|
||||
**交互界面**:
|
||||
```
|
||||
======================================================================
|
||||
📢 SPEAKER_0 - 654 segments
|
||||
======================================================================
|
||||
[ 1] 0.30s - 2.00s ( 1.70s)
|
||||
[ 2] 15.10s - 18.50s ( 3.40s)
|
||||
[ 3] 18.80s - 25.90s ( 7.10s)
|
||||
...
|
||||
|
||||
======================================================================
|
||||
Commands:
|
||||
[1-20] Play specific segment
|
||||
all Play all segments (may take a while)
|
||||
first N Play first N segments
|
||||
next Next speaker
|
||||
prev Previous speaker
|
||||
list List all speakers
|
||||
quit Exit
|
||||
======================================================================
|
||||
|
||||
▶️ SPEAKER_0 >
|
||||
```
|
||||
|
||||
**可用命令**:
|
||||
- `[1-20]`: 播放特定片段(輸入數字)
|
||||
- `all`: 播放所有片段
|
||||
- `first N`: 播放前 N 個片段
|
||||
- `next`: 下一個說話人
|
||||
- `prev`: 上一個說話人
|
||||
- `list`: 列出所有說話人
|
||||
- `quit` / `q`: 退出
|
||||
|
||||
---
|
||||
|
||||
## 📊 Charade 1963 說話人分佈
|
||||
|
||||
| 說話人 | 片段數 | 總時長 | 百分比 | 推測角色 |
|
||||
|--------|--------|--------|--------|---------|
|
||||
| **SPEAKER_0** | 654 | 1764.4s | 25.6% | Cary Grant(男主角) |
|
||||
| **SPEAKER_1** | 403 | 1119.4s | 16.3% | Audrey Hepburn(女主角) |
|
||||
| **SPEAKER_2** | 49 | 65.7s | 1.0% | Walter Matthau(配角) |
|
||||
| **SPEAKER_4** | 3 | 44.1s | 0.6% | James Coburn(配角) |
|
||||
| **其他** | <10 | <3s | <0.1% | 臨時演員/背景 |
|
||||
|
||||
---
|
||||
|
||||
## 🎬 推薦使用流程
|
||||
|
||||
### 快速預覽
|
||||
|
||||
```bash
|
||||
# 1. 查看統計
|
||||
python3 speaker_audio_player.py --stats /tmp/asrx_charade_optimized.json
|
||||
|
||||
# 2. 播放主要演員的前 5 個片段
|
||||
python3 speaker_audio_player.py \
|
||||
/tmp/charade_audio.wav \
|
||||
/tmp/asrx_charade_optimized.json \
|
||||
--speaker SPEAKER_0 \
|
||||
--limit 5
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 詳細分析
|
||||
|
||||
```bash
|
||||
# 使用交互式播放器
|
||||
python3 speaker_player_interactive.py \
|
||||
/tmp/charade_audio.wav \
|
||||
/tmp/asrx_charade_optimized.json
|
||||
|
||||
# 然後在交互界面中:
|
||||
# > list # 查看所有說話人
|
||||
# > first 10 # 播放前 10 個片段
|
||||
# > next # 切換到下一個說話人
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ⚙️ 技術細節
|
||||
|
||||
### 音頻提取
|
||||
|
||||
使用 `ffmpeg` 提取音頻片段:
|
||||
```bash
|
||||
ffmpeg -i audio.wav -ss START -t DURATION -acodec pcm_s16le -ar 16000 output.wav
|
||||
```
|
||||
|
||||
### 音頻播放
|
||||
|
||||
**macOS**: 使用 `afplay`
|
||||
```bash
|
||||
afplay segment.wav
|
||||
```
|
||||
|
||||
**Linux**: 使用 `aplay`
|
||||
```bash
|
||||
aplay segment.wav
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📁 檔案清單
|
||||
|
||||
```
|
||||
scripts/asrx_self/
|
||||
├── speaker_audio_player.py # 命令行播放器 ⭐
|
||||
├── speaker_player_interactive.py # 交互式播放器 ⭐
|
||||
├── SPEAKER_PLAYER_GUIDE.md # 本指南
|
||||
└── ...其他 ASRX 工具
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 💡 使用技巧
|
||||
|
||||
### 1. 快速驗證說話人分離準確度
|
||||
|
||||
```bash
|
||||
# 播放每個說話人的前 3 個片段
|
||||
for speaker in SPEAKER_0 SPEAKER_1 SPEAKER_2; do
|
||||
echo "=== $speaker ==="
|
||||
python3 speaker_audio_player.py \
|
||||
/tmp/charade_audio.wav \
|
||||
/tmp/asrx_charade_optimized.json \
|
||||
--speaker $speaker \
|
||||
--limit 3
|
||||
done
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. 比較主要演員聲音
|
||||
|
||||
```bash
|
||||
# 使用交互式播放器
|
||||
python3 speaker_player_interactive.py \
|
||||
/tmp/charade_audio.wav \
|
||||
/tmp/asrx_charade_optimized.json
|
||||
|
||||
# 然後:
|
||||
# > first 5 # 播放 SPEAKER_0 前 5 個
|
||||
# > next # 切換到 SPEAKER_1
|
||||
# > first 5 # 播放 SPEAKER_1 前 5 個
|
||||
# > prev # 回到 SPEAKER_0
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. 批次處理
|
||||
|
||||
```bash
|
||||
# 提取所有 SPEAKER_0 的片段到單獨文件
|
||||
python3 << 'PYEOF'
|
||||
import json
|
||||
import subprocess
|
||||
import os
|
||||
|
||||
with open('/tmp/asrx_charade_optimized.json') as f:
|
||||
result = json.load(f)
|
||||
|
||||
os.makedirs('/tmp/speaker0_segments', exist_ok=True)
|
||||
|
||||
for i, seg in enumerate(result['segments'][:10]): # 前 10 個
|
||||
if seg['speaker'] == 'SPEAKER_0':
|
||||
start = seg['start']
|
||||
end = seg['end']
|
||||
duration = end - start
|
||||
|
||||
output = f'/tmp/speaker0_segments/segment_{i:03d}.wav'
|
||||
|
||||
subprocess.run([
|
||||
'ffmpeg', '-y', '-loglevel', 'quiet',
|
||||
'-i', '/tmp/charade_audio.wav',
|
||||
'-ss', str(start),
|
||||
'-t', str(duration),
|
||||
output
|
||||
])
|
||||
|
||||
print(f'Extracted: {output}')
|
||||
PYEOF
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ✅ 測試結果
|
||||
|
||||
**測試影片**: Charade 1963 (114.7 分鐘)
|
||||
**說話人**: 8 人
|
||||
**測試結果**: ✅ 成功播放所有說話人片段
|
||||
|
||||
**範例輸出**:
|
||||
```
|
||||
▶️ SPEAKER_0 (3 segments)
|
||||
------------------------------------------------------------
|
||||
[ 1] 374.80s - 375.90s ( 1.10s) ... ✅ ▶️ Played
|
||||
[ 2] 384.10s - 384.90s ( 0.80s) ... ✅ ▶️ Played
|
||||
[ 3] 387.30s - 388.40s ( 1.10s) ... ✅ ▶️ Played
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**指南完成**: 2026-04-02
|
||||
**狀態**: ✅ 工具已測試通過
|
||||
2
v1.1/scripts/asrx_self/__init___v1.11.py
Normal file
2
v1.1/scripts/asrx_self/__init___v1.11.py
Normal file
@@ -0,0 +1,2 @@
|
||||
# Self-implemented ASRX (Speaker Diarization)
|
||||
# Based on speaker embedding + spectral clustering
|
||||
729
v1.1/scripts/asrx_self/main_fixed_v1.11.py
Executable file
729
v1.1/scripts/asrx_self/main_fixed_v1.11.py
Executable file
@@ -0,0 +1,729 @@
|
||||
"""
|
||||
SelfASRXFixed - 7 步 Hybrid Speaker Diarization Pipeline
|
||||
|
||||
Pipeline:
|
||||
1. whisper.transcribe(full_audio) → rough segments + text + language
|
||||
2. VAD scan each rough segment → refined segments
|
||||
3. whisper per refined segment → {text, language, lang_prob}
|
||||
4. ECAPA-TDNN per refined segment → 192-dim embeddings
|
||||
5. AgglomerativeClustering → speaker_labels
|
||||
6. Store all embeddings in Qdrant (payload: file_uuid, speaker_id, text, ...)
|
||||
7. High-quality embeddings → gender classify + store reference in Qdrant
|
||||
"""
|
||||
|
||||
import sys
|
||||
import json
|
||||
import time
|
||||
import os
|
||||
import numpy as np
|
||||
from pathlib import Path
|
||||
from urllib.request import Request, urlopen
|
||||
from urllib.error import URLError
|
||||
|
||||
|
||||
def _load_audio(path):
|
||||
"""載入音頻文件,回傳 (wav_numpy, sample_rate)"""
|
||||
import soundfile as sf
|
||||
wav, sr = sf.read(path)
|
||||
if len(wav.shape) > 1:
|
||||
wav = np.mean(wav, axis=1)
|
||||
return wav, sr
|
||||
|
||||
|
||||
def _load_whisper_model(size="small"):
|
||||
from whisper_local import load_model
|
||||
return load_model(size)
|
||||
|
||||
|
||||
def _load_vad():
|
||||
from vad import load_vad_model
|
||||
return load_vad_model()
|
||||
|
||||
|
||||
def _load_speaker_encoder():
|
||||
from speaker_encoder import load_speaker_encoder
|
||||
return load_speaker_encoder()
|
||||
|
||||
|
||||
def _load_gender_classifier():
|
||||
try:
|
||||
from speechbrain.inference.classifiers import EncoderClassifier
|
||||
classifier = EncoderClassifier.from_hparams(
|
||||
source="speechbrain/gender-recognition-ecapa",
|
||||
run_opts={"device": "cpu"},
|
||||
)
|
||||
print("[Gender] Classifier loaded: speechbrain/gender-recognition-ecapa")
|
||||
return classifier
|
||||
except Exception as e:
|
||||
print(f"[Gender] Classifier not available: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def _ensure_speaker_collection(qdrant_url, api_key, collection):
|
||||
"""確認 Qdrant speaker collection 存在,不存在則建立 (dim=192, cosine)"""
|
||||
try:
|
||||
url = f"{qdrant_url}/collections/{collection}"
|
||||
req = Request(url, method="GET",
|
||||
headers={"api-key": api_key} if api_key else {})
|
||||
try:
|
||||
urlopen(req)
|
||||
return True
|
||||
except URLError as e:
|
||||
if getattr(e, "code", None) == 404:
|
||||
body = json.dumps({
|
||||
"vectors": {
|
||||
"size": 192,
|
||||
"distance": "Cosine"
|
||||
}
|
||||
}).encode()
|
||||
req = Request(url, data=body, method="PUT",
|
||||
headers={"Content-Type": "application/json",
|
||||
**({"api-key": api_key} if api_key else {})})
|
||||
urlopen(req)
|
||||
print(f"[Qdrant] Created collection: {collection} (dim=192)")
|
||||
return True
|
||||
raise
|
||||
except Exception as e:
|
||||
print(f"[Qdrant] Cannot access Qdrant: {e}")
|
||||
return False
|
||||
|
||||
|
||||
def _qdrant_upsert(qdrant_url, api_key, collection, points):
|
||||
"""批量寫入 Qdrant points"""
|
||||
try:
|
||||
url = f"{qdrant_url}/collections/{collection}/points?wait=true"
|
||||
body = json.dumps({"points": points}).encode()
|
||||
headers = {"Content-Type": "application/json"}
|
||||
if api_key:
|
||||
headers["api-key"] = api_key
|
||||
req = Request(url, data=body, headers=headers, method="PUT")
|
||||
urlopen(req)
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"[Qdrant] Upsert failed: {e}")
|
||||
return False
|
||||
|
||||
|
||||
def _hash_point_id(file_uuid, label):
|
||||
"""產生一致的 point ID"""
|
||||
s = f"{file_uuid}_{label}"
|
||||
return hash(s) & 0x7FFFFFFFFFFFFFFF
|
||||
|
||||
|
||||
def _save_checkpoint(path: str, data: dict):
|
||||
"""原子寫入 checkpoint(先 .tmp 再 rename)"""
|
||||
tmp = path + ".tmp"
|
||||
Path(tmp).parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(tmp, "w", encoding="utf-8") as f:
|
||||
json.dump(data, f, indent=2, ensure_ascii=False)
|
||||
os.replace(tmp, path)
|
||||
|
||||
|
||||
def compute_embedding_quality(embeddings, labels):
|
||||
"""每個 embedding 到所屬 cluster centroid 的餘弦相似度"""
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
unique_labels = set(labels)
|
||||
centroids = {}
|
||||
for label in unique_labels:
|
||||
mask = labels == label
|
||||
centroid = np.mean(embeddings[mask], axis=0)
|
||||
norm = np.linalg.norm(centroid)
|
||||
if norm > 0:
|
||||
centroid = centroid / norm
|
||||
centroids[label] = centroid
|
||||
qualities = []
|
||||
for emb, label in zip(embeddings, labels):
|
||||
sim = cosine_similarity([emb], [centroids[label]])[0][0]
|
||||
qualities.append(sim)
|
||||
return np.array(qualities)
|
||||
|
||||
|
||||
class SelfASRXFixed:
|
||||
"""7 步 Hybrid Speaker Diarization Pipeline"""
|
||||
|
||||
def __init__(self):
|
||||
print("[SelfASRX] Initializing models...")
|
||||
|
||||
print("[SelfASRX] Loading whisper model...")
|
||||
self.whisper = _load_whisper_model("small")
|
||||
|
||||
print("[SelfASRX] Loading VAD model (Silero)...")
|
||||
self.vad_model, self.vad_utils = _load_vad()
|
||||
|
||||
print("[SelfASRX] Loading speaker encoder (ECAPA-TDNN)...")
|
||||
self.speaker_encoder = _load_speaker_encoder()
|
||||
|
||||
print("[SelfASRX] Loading gender classifier...")
|
||||
self.gender_classifier = _load_gender_classifier()
|
||||
|
||||
# Qdrant 設定
|
||||
self.qdrant_url = os.environ.get("QDRANT_URL", "http://localhost:6333")
|
||||
self.qdrant_api_key = os.environ.get("QDRANT_API_KEY", "")
|
||||
schema = os.environ.get("DATABASE_SCHEMA", "public")
|
||||
self.qdrant_collection = os.environ.get(
|
||||
"QDRANT_SPEAKER_COLLECTION",
|
||||
f"momentry_{schema}_speaker"
|
||||
)
|
||||
self._qdrant_ok = False
|
||||
|
||||
print("[SelfASRX] Models loaded successfully")
|
||||
|
||||
def process(self, audio_path, output_path=None, file_uuid=None,
|
||||
max_speakers=10, quality_threshold=0.85,
|
||||
checkpoint_path=None):
|
||||
"""7 步 speaker diarization pipeline
|
||||
|
||||
Args:
|
||||
audio_path: 音頻文件路徑 (WAV 16kHz mono)
|
||||
output_path: 輸出 JSON 路徑 (可選)
|
||||
file_uuid: 檔案 UUID (用於 Qdrant 儲存)
|
||||
max_speakers: 最大說話人數
|
||||
quality_threshold: 高品質聲紋門檻 (0-1)
|
||||
checkpoint_path: Step 3 完成後儲存 checkpoint 路徑
|
||||
|
||||
Returns:
|
||||
dict: segments, speaker_stats, n_speakers, total_duration, references
|
||||
"""
|
||||
start_time = time.time()
|
||||
print(f"\n[SelfASRX] Processing: {audio_path}")
|
||||
print("=" * 60)
|
||||
|
||||
# 載入音頻
|
||||
wav, sample_rate = _load_audio(audio_path)
|
||||
total_duration = len(wav) / sample_rate
|
||||
print(f" Audio: {total_duration:.2f}s, {sample_rate}Hz")
|
||||
|
||||
# ── Step 1: whisper 粗略定位 (faster-whisper) ──
|
||||
print("\n[Step 1] Initial whisper transcription...")
|
||||
t1 = time.time()
|
||||
seg_gen, info = self.whisper.transcribe(audio_path)
|
||||
rough_segments = []
|
||||
for seg in seg_gen:
|
||||
rough_segments.append({"start": seg.start, "end": seg.end, "text": seg.text})
|
||||
language = info.language if info else None
|
||||
print(f" Rough segments: {len(rough_segments)}")
|
||||
print(f" Language: {language}")
|
||||
print(f" Step 1 time: {time.time() - t1:.2f}s")
|
||||
|
||||
if not rough_segments:
|
||||
print("[SelfASRX] No speech detected by whisper!")
|
||||
return {"error": "No speech detected", "segments": []}
|
||||
|
||||
# ── Step 2: VAD scan 每個 rough segment 細切 ──
|
||||
print("\n[Step 2] VAD scan for refined segmentation...")
|
||||
t2 = time.time()
|
||||
refined_segments = []
|
||||
for seg in rough_segments:
|
||||
s = seg["start"]
|
||||
e = seg["end"]
|
||||
sub = self._vad_scan_segment(wav, sample_rate, s, e)
|
||||
if sub:
|
||||
refined_segments.extend(sub)
|
||||
else:
|
||||
refined_segments.append((s, e))
|
||||
print(f" Refined segments: {len(refined_segments)}")
|
||||
print(f" Step 2 time: {time.time() - t2:.2f}s")
|
||||
|
||||
if not refined_segments:
|
||||
return {"error": "No segments after VAD scan", "segments": []}
|
||||
|
||||
# ── Step 3: whisper per refined segment ──
|
||||
print("\n[Step 3] Per-segment transcription...")
|
||||
t3 = time.time()
|
||||
CHECKPOINT_INTERVAL = 50
|
||||
|
||||
segment_texts = []
|
||||
resume_from = 0
|
||||
|
||||
# 載入既有 partial checkpoint(中斷續接)
|
||||
if checkpoint_path and os.path.exists(checkpoint_path):
|
||||
try:
|
||||
with open(checkpoint_path, "r") as f:
|
||||
cp = json.load(f)
|
||||
if cp.get("checkpoint_version") == 2 and not cp.get("step3_completed"):
|
||||
saved = cp.get("segment_texts", [])
|
||||
if saved:
|
||||
resume_from = len(saved)
|
||||
segment_texts = saved
|
||||
print(f"[Step 3] Resuming from #{resume_from}/{len(refined_segments)}")
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
for i, (start_sec, end_sec) in enumerate(refined_segments):
|
||||
if i < resume_from:
|
||||
continue
|
||||
seg_text = self._transcribe_segment(wav, sample_rate, start_sec, end_sec)
|
||||
segment_texts.append(seg_text)
|
||||
|
||||
if checkpoint_path and (i + 1) % CHECKPOINT_INTERVAL == 0:
|
||||
_save_checkpoint(checkpoint_path, {
|
||||
"checkpoint_version": 2,
|
||||
"step3_completed": False,
|
||||
"step3_progress": i + 1,
|
||||
"language": language,
|
||||
"total_duration": total_duration,
|
||||
"refined_segments": [[s, e] for s, e in refined_segments],
|
||||
"segment_texts": [{
|
||||
"text": st["text"],
|
||||
"language": st["language"],
|
||||
"lang_prob": st["lang_prob"],
|
||||
} for st in segment_texts],
|
||||
"file_uuid": file_uuid,
|
||||
"max_speakers": max_speakers,
|
||||
"quality_threshold": quality_threshold,
|
||||
})
|
||||
print(f"[Checkpoint] Step 3: {i+1}/{len(refined_segments)}")
|
||||
|
||||
print(f" Step 3 time: {time.time() - t3:.2f}s")
|
||||
|
||||
# ── Save final checkpoint after Step 3 ──
|
||||
if checkpoint_path:
|
||||
_save_checkpoint(checkpoint_path, {
|
||||
"checkpoint_version": 2,
|
||||
"step3_completed": True,
|
||||
"language": language,
|
||||
"total_duration": total_duration,
|
||||
"refined_segments": [[s, e] for s, e in refined_segments],
|
||||
"segment_texts": [{
|
||||
"text": st["text"],
|
||||
"language": st["language"],
|
||||
"lang_prob": st["lang_prob"],
|
||||
} for st in segment_texts],
|
||||
"file_uuid": file_uuid,
|
||||
"max_speakers": max_speakers,
|
||||
"quality_threshold": quality_threshold,
|
||||
})
|
||||
print(f"[Checkpoint] Step 3 complete, saved to {checkpoint_path}")
|
||||
|
||||
# ── Step 4: ECAPA-TDNN per refined segment ──
|
||||
print("\n[Step 4] Speaker embedding extraction...")
|
||||
t4 = time.time()
|
||||
audio_segments = []
|
||||
for start_sec, end_sec in refined_segments:
|
||||
s = int(start_sec * sample_rate)
|
||||
e = int(end_sec * sample_rate)
|
||||
audio_segments.append(wav[s:min(e, len(wav))])
|
||||
|
||||
from speaker_encoder import extract_speaker_embeddings_batch, normalize_embeddings
|
||||
embeddings = extract_speaker_embeddings_batch(
|
||||
self.speaker_encoder, audio_segments, sample_rate
|
||||
)
|
||||
embeddings = normalize_embeddings(embeddings)
|
||||
print(f" Embeddings: {embeddings.shape}")
|
||||
print(f" Step 4 time: {time.time() - t4:.2f}s")
|
||||
|
||||
# ── Step 5: AgglomerativeClustering ──
|
||||
print("\n[Step 5] Speaker clustering...")
|
||||
t5 = time.time()
|
||||
from speaker_cluster_fixed import robust_speaker_clustering
|
||||
speaker_labels, estimated_n_speakers = robust_speaker_clustering(
|
||||
embeddings, n_speakers=None, max_speakers=max_speakers
|
||||
)
|
||||
print(f" Speakers: {estimated_n_speakers}")
|
||||
print(f" Step 5 time: {time.time() - t5:.2f}s")
|
||||
|
||||
# 品質計算
|
||||
qualities = compute_embedding_quality(embeddings, speaker_labels)
|
||||
|
||||
# 建立輸出 segments
|
||||
segments = []
|
||||
for i, ((start_sec, end_sec), label) in enumerate(
|
||||
zip(refined_segments, speaker_labels)):
|
||||
seg = {
|
||||
"start": round(start_sec, 3),
|
||||
"end": round(end_sec, 3),
|
||||
"start_frame": int(start_sec * 30),
|
||||
"end_frame": int(end_sec * 30),
|
||||
"text": segment_texts[i]["text"],
|
||||
"language": segment_texts[i]["language"],
|
||||
"lang_prob": segment_texts[i]["lang_prob"],
|
||||
"speaker": f"SPEAKER_{int(label)}",
|
||||
"speaker_id": f"SPEAKER_{int(label)}",
|
||||
"quality": float(qualities[i]),
|
||||
}
|
||||
segments.append(seg)
|
||||
|
||||
# 統計
|
||||
speaker_stats = {}
|
||||
for seg in segments:
|
||||
spk = seg["speaker_id"]
|
||||
dur = seg["end"] - seg["start"]
|
||||
if spk not in speaker_stats:
|
||||
speaker_stats[spk] = {"count": 0, "duration": 0}
|
||||
speaker_stats[spk]["count"] += 1
|
||||
speaker_stats[spk]["duration"] += dur
|
||||
|
||||
result = {
|
||||
"language": language or "",
|
||||
"segments": segments,
|
||||
"n_speakers": int(estimated_n_speakers),
|
||||
"speaker_stats": speaker_stats,
|
||||
"total_duration": total_duration,
|
||||
"n_segments": len(segments),
|
||||
}
|
||||
|
||||
# ── Step 6: Store embeddings in Qdrant ──
|
||||
if file_uuid:
|
||||
print("\n[Step 6] Storing embeddings in Qdrant...")
|
||||
t6 = time.time()
|
||||
self._store_speaker_embeddings(segments, embeddings, speaker_labels,
|
||||
file_uuid)
|
||||
print(f" Step 6 time: {time.time() - t6:.2f}s")
|
||||
|
||||
# ── Step 7: High-quality classification ──
|
||||
if file_uuid:
|
||||
print("\n[Step 7] Classifying high-quality embeddings...")
|
||||
t7 = time.time()
|
||||
references = self._classify_high_quality_speakers(
|
||||
segments, embeddings, speaker_labels, file_uuid,
|
||||
wav, sample_rate, quality_threshold
|
||||
)
|
||||
if references:
|
||||
result["references"] = references
|
||||
print(f" Step 7 time: {time.time() - t7:.2f}s")
|
||||
|
||||
total_time = time.time() - start_time
|
||||
result["processing_time"] = round(total_time, 2)
|
||||
if total_duration > 0:
|
||||
result["realtime_factor"] = round(total_duration / total_time, 2)
|
||||
|
||||
# 保存輸出
|
||||
if output_path:
|
||||
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(output_path, "w", encoding="utf-8") as f:
|
||||
json.dump(result, f, indent=2, ensure_ascii=False)
|
||||
print(f"\n[SelfASRX] Saved to: {output_path}")
|
||||
|
||||
print(f"\n[SelfASRX] Done! {len(segments)} segments, "
|
||||
f"{estimated_n_speakers} speakers, "
|
||||
f"{total_time:.2f}s")
|
||||
|
||||
return result
|
||||
|
||||
def resume_from_checkpoint(self, checkpoint_path, audio_path,
|
||||
output_path=None):
|
||||
"""從 checkpoint 載入 Steps 1-3 結果,執行 Steps 4-7"""
|
||||
print(f"\n[SelfASRX] Resuming from checkpoint: {checkpoint_path}")
|
||||
print("=" * 60)
|
||||
|
||||
with open(checkpoint_path, "r", encoding="utf-8") as f:
|
||||
cp = json.load(f)
|
||||
|
||||
if not cp.get("step3_completed"):
|
||||
error_msg = f"Checkpoint step3 not completed (progress: {cp.get('step3_progress', '?')})"
|
||||
print(f"[SelfASRX] {error_msg}")
|
||||
return {"error": error_msg, "segments": []}
|
||||
|
||||
wav, sample_rate = _load_audio(audio_path)
|
||||
refined_segments = [tuple(s) for s in cp["refined_segments"]]
|
||||
segment_texts = cp["segment_texts"]
|
||||
language = cp.get("language", "")
|
||||
total_duration = cp.get("total_duration", 0)
|
||||
file_uuid = cp.get("file_uuid")
|
||||
max_speakers = cp.get("max_speakers", 10)
|
||||
quality_threshold = cp.get("quality_threshold", 0.85)
|
||||
|
||||
print(f" Loaded checkpoint: {len(refined_segments)} segments, "
|
||||
f"language={language}, duration={total_duration:.2f}s")
|
||||
|
||||
start_time = time.time()
|
||||
|
||||
# ── Step 4: ECAPA-TDNN per refined segment ──
|
||||
print("\n[Step 4] Speaker embedding extraction...")
|
||||
t4 = time.time()
|
||||
audio_segments = []
|
||||
for start_sec, end_sec in refined_segments:
|
||||
s = int(start_sec * sample_rate)
|
||||
e = int(end_sec * sample_rate)
|
||||
audio_segments.append(wav[s:min(e, len(wav))])
|
||||
|
||||
from speaker_encoder import extract_speaker_embeddings_batch, normalize_embeddings
|
||||
embeddings = extract_speaker_embeddings_batch(
|
||||
self.speaker_encoder, audio_segments, sample_rate
|
||||
)
|
||||
embeddings = normalize_embeddings(embeddings)
|
||||
print(f" Embeddings: {embeddings.shape}")
|
||||
print(f" Step 4 time: {time.time() - t4:.2f}s")
|
||||
|
||||
# ── Step 5: AgglomerativeClustering ──
|
||||
print("\n[Step 5] Speaker clustering...")
|
||||
t5 = time.time()
|
||||
from speaker_cluster_fixed import robust_speaker_clustering
|
||||
speaker_labels, estimated_n_speakers = robust_speaker_clustering(
|
||||
embeddings, n_speakers=None, max_speakers=max_speakers
|
||||
)
|
||||
print(f" Speakers: {estimated_n_speakers}")
|
||||
print(f" Step 5 time: {time.time() - t5:.2f}s")
|
||||
|
||||
# 品質計算
|
||||
qualities = compute_embedding_quality(embeddings, speaker_labels)
|
||||
|
||||
# 建立輸出 segments
|
||||
segments = []
|
||||
for i, ((start_sec, end_sec), label) in enumerate(
|
||||
zip(refined_segments, speaker_labels)):
|
||||
seg = {
|
||||
"start": round(start_sec, 3),
|
||||
"end": round(end_sec, 3),
|
||||
"start_frame": int(start_sec * 30),
|
||||
"end_frame": int(end_sec * 30),
|
||||
"text": segment_texts[i]["text"],
|
||||
"language": segment_texts[i]["language"],
|
||||
"lang_prob": segment_texts[i]["lang_prob"],
|
||||
"speaker": f"SPEAKER_{int(label)}",
|
||||
"speaker_id": f"SPEAKER_{int(label)}",
|
||||
"quality": float(qualities[i]),
|
||||
}
|
||||
segments.append(seg)
|
||||
|
||||
# 統計
|
||||
speaker_stats = {}
|
||||
for seg in segments:
|
||||
spk = seg["speaker_id"]
|
||||
dur = seg["end"] - seg["start"]
|
||||
if spk not in speaker_stats:
|
||||
speaker_stats[spk] = {"count": 0, "duration": 0}
|
||||
speaker_stats[spk]["count"] += 1
|
||||
speaker_stats[spk]["duration"] += dur
|
||||
|
||||
result = {
|
||||
"language": language or "",
|
||||
"segments": segments,
|
||||
"n_speakers": int(estimated_n_speakers),
|
||||
"speaker_stats": speaker_stats,
|
||||
"total_duration": total_duration,
|
||||
"n_segments": len(segments),
|
||||
}
|
||||
|
||||
# ── Step 6: Store embeddings in Qdrant ──
|
||||
if file_uuid:
|
||||
print("\n[Step 6] Storing embeddings in Qdrant...")
|
||||
t6 = time.time()
|
||||
self._store_speaker_embeddings(segments, embeddings, speaker_labels,
|
||||
file_uuid)
|
||||
print(f" Step 6 time: {time.time() - t6:.2f}s")
|
||||
|
||||
# ── Step 7: High-quality classification ──
|
||||
if file_uuid:
|
||||
print("\n[Step 7] Classifying high-quality embeddings...")
|
||||
t7 = time.time()
|
||||
references = self._classify_high_quality_speakers(
|
||||
segments, embeddings, speaker_labels, file_uuid,
|
||||
wav, sample_rate, quality_threshold
|
||||
)
|
||||
if references:
|
||||
result["references"] = references
|
||||
print(f" Step 7 time: {time.time() - t7:.2f}s")
|
||||
|
||||
total_time = time.time() - start_time
|
||||
result["processing_time"] = round(total_time, 2)
|
||||
if total_duration > 0:
|
||||
result["realtime_factor"] = round(total_duration / total_time, 2)
|
||||
|
||||
# 保存輸出
|
||||
if output_path:
|
||||
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(output_path, "w", encoding="utf-8") as f:
|
||||
json.dump(result, f, indent=2, ensure_ascii=False)
|
||||
print(f"\n[SelfASRX] Saved to: {output_path}")
|
||||
|
||||
print(f"\n[SelfASRX] Done! {len(segments)} segments, "
|
||||
f"{estimated_n_speakers} speakers, "
|
||||
f"{total_time:.2f}s")
|
||||
|
||||
return result
|
||||
|
||||
# ── Internal helpers ──
|
||||
|
||||
def _vad_scan_segment(self, wav, sample_rate, start_sec, end_sec):
|
||||
"""VAD 細切單一段落"""
|
||||
from vad import scan_within_segment
|
||||
return scan_within_segment(
|
||||
wav, sample_rate, start_sec, end_sec,
|
||||
self.vad_model, self.vad_utils
|
||||
)
|
||||
|
||||
def _transcribe_segment(self, wav, sample_rate, start_sec, end_sec):
|
||||
"""轉錄單一段落"""
|
||||
from whisper_local import transcribe_segment
|
||||
return transcribe_segment(wav, sample_rate, start_sec, end_sec, self.whisper)
|
||||
|
||||
def _store_speaker_embeddings(self, segments, embeddings, labels, file_uuid):
|
||||
"""Step 6: 所有 embedding 存入 Qdrant"""
|
||||
if not self._ensure_qdrant():
|
||||
return
|
||||
|
||||
points = []
|
||||
for i, (seg, emb, label) in enumerate(
|
||||
zip(segments, embeddings, labels)):
|
||||
point_id = _hash_point_id(file_uuid, f"{i}")
|
||||
points.append({
|
||||
"id": point_id,
|
||||
"vector": emb.tolist(),
|
||||
"payload": {
|
||||
"type": "speaker_embedding",
|
||||
"file_uuid": file_uuid,
|
||||
"speaker_id": seg["speaker_id"],
|
||||
"text": seg["text"],
|
||||
"language": seg["language"],
|
||||
"start_time": seg["start"],
|
||||
"end_time": seg["end"],
|
||||
}
|
||||
})
|
||||
|
||||
ok = _qdrant_upsert(self.qdrant_url, self.qdrant_api_key,
|
||||
self.qdrant_collection, points)
|
||||
if ok:
|
||||
print(f" Stored {len(points)} speaker embeddings to Qdrant")
|
||||
return ok
|
||||
|
||||
def _classify_high_quality_speakers(self, segments, embeddings, labels,
|
||||
file_uuid, wav, sample_rate,
|
||||
threshold=0.85):
|
||||
"""Step 7: 高品質聲紋分級 + 性別分類 → Qdrant reference"""
|
||||
qualities = compute_embedding_quality(embeddings, labels)
|
||||
high_mask = qualities >= threshold
|
||||
|
||||
if not np.any(high_mask):
|
||||
print(" No high-quality embeddings found")
|
||||
return []
|
||||
|
||||
unique_labels = set(labels)
|
||||
references = []
|
||||
for label in unique_labels:
|
||||
mask = (labels == label) & high_mask
|
||||
if not np.any(mask):
|
||||
continue
|
||||
high_indices = [i for i in range(len(segments)) if mask[i]]
|
||||
high_segs = [segments[i] for i in high_indices]
|
||||
|
||||
# 取品質最高的 segment index
|
||||
best_idx = high_indices[int(np.argmax(qualities[mask]))]
|
||||
best_seg = segments[best_idx]
|
||||
|
||||
centroid = np.mean(embeddings[mask], axis=0)
|
||||
norm = np.linalg.norm(centroid)
|
||||
if norm > 0:
|
||||
centroid = centroid / norm
|
||||
|
||||
avg_quality = float(np.mean(qualities[mask]))
|
||||
speaker_id = f"SPEAKER_{int(label)}"
|
||||
text_samples = [s["text"] for s in high_segs[:5] if s["text"]]
|
||||
total_dur = sum(s["end"] - s["start"] for s in high_segs)
|
||||
|
||||
ref_id = _hash_point_id(file_uuid, f"ref_{label}")
|
||||
ref_payload = {
|
||||
"type": "speaker_reference",
|
||||
"file_uuid": file_uuid,
|
||||
"speaker_id": speaker_id,
|
||||
"n_segments": int(np.sum(mask)),
|
||||
"avg_quality": avg_quality,
|
||||
"total_duration": round(total_dur, 2),
|
||||
"language": best_seg.get("language", ""),
|
||||
"text_samples": text_samples,
|
||||
}
|
||||
|
||||
# 性別分類:用最佳 segment 的音頻
|
||||
if self.gender_classifier is not None:
|
||||
try:
|
||||
import torch
|
||||
s = int(best_seg["start"] * sample_rate)
|
||||
e = int(best_seg["end"] * sample_rate)
|
||||
seg_wav = wav[s:min(e, len(wav))]
|
||||
seg_tensor = torch.from_numpy(seg_wav).float().unsqueeze(0)
|
||||
# SpeechBrain gender classifier 接受音頻
|
||||
out = self.gender_classifier.classify_batch(seg_tensor)
|
||||
probs = torch.softmax(out[0], dim=-1).squeeze().cpu().detach().numpy()
|
||||
if len(probs) >= 2:
|
||||
idx = int(np.argmax(probs))
|
||||
ref_payload["gender"] = "male" if idx == 0 else "female"
|
||||
ref_payload["gender_conf"] = float(probs[idx])
|
||||
else:
|
||||
ref_payload["gender"] = "unknown"
|
||||
ref_payload["gender_conf"] = 0.0
|
||||
except Exception as e:
|
||||
print(f"[Gender] Classify error: {e}")
|
||||
ref_payload["gender"] = "unknown"
|
||||
ref_payload["gender_conf"] = 0.0
|
||||
else:
|
||||
ref_payload["gender"] = "unknown"
|
||||
ref_payload["gender_conf"] = 0.0
|
||||
|
||||
_qdrant_upsert(self.qdrant_url, self.qdrant_api_key,
|
||||
self.qdrant_collection, [{
|
||||
"id": ref_id,
|
||||
"vector": centroid.tolist(),
|
||||
"payload": ref_payload,
|
||||
}])
|
||||
|
||||
references.append({
|
||||
"speaker_id": speaker_id,
|
||||
"n_segments": int(np.sum(mask)),
|
||||
"avg_quality": avg_quality,
|
||||
"gender": ref_payload["gender"],
|
||||
})
|
||||
|
||||
print(f" Ref: {speaker_id}, gender={ref_payload['gender']}"
|
||||
f" ({ref_payload['gender_conf']:.2f}), q={avg_quality:.3f}")
|
||||
|
||||
return references
|
||||
|
||||
def _ensure_qdrant(self):
|
||||
"""確保 Qdrant collection 可用"""
|
||||
if not self._qdrant_ok:
|
||||
ok = _ensure_speaker_collection(
|
||||
self.qdrant_url, self.qdrant_api_key, self.qdrant_collection
|
||||
)
|
||||
self._qdrant_ok = ok
|
||||
return self._qdrant_ok
|
||||
|
||||
|
||||
def main():
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser(description="SelfASRX - Hybrid Speaker Diarization")
|
||||
parser.add_argument("audio_path", help="Path to audio file (WAV)")
|
||||
parser.add_argument("-o", "--output", help="Output JSON path")
|
||||
parser.add_argument("--file-uuid", help="File UUID for Qdrant storage")
|
||||
parser.add_argument("--max-speakers", type=int, default=10)
|
||||
parser.add_argument("--quality-threshold", type=float, default=0.85)
|
||||
parser.add_argument("--resume", help="Checkpoint path to resume from")
|
||||
parser.add_argument("--checkpoint", help="Save checkpoint path after Step 3")
|
||||
args = parser.parse_args()
|
||||
|
||||
asrx = SelfASRXFixed()
|
||||
|
||||
if args.resume:
|
||||
if not Path(args.resume).exists():
|
||||
print(f"Error: Checkpoint not found: {args.resume}")
|
||||
sys.exit(1)
|
||||
result = asrx.resume_from_checkpoint(
|
||||
args.resume, args.audio_path,
|
||||
output_path=args.output,
|
||||
)
|
||||
else:
|
||||
if not Path(args.audio_path).exists():
|
||||
print(f"Error: Audio file not found: {args.audio_path}")
|
||||
sys.exit(1)
|
||||
|
||||
result = asrx.process(
|
||||
args.audio_path,
|
||||
output_path=args.output,
|
||||
file_uuid=args.file_uuid,
|
||||
max_speakers=args.max_speakers,
|
||||
quality_threshold=args.quality_threshold,
|
||||
checkpoint_path=args.checkpoint,
|
||||
)
|
||||
|
||||
if "error" not in result:
|
||||
print("\n[Summary]")
|
||||
print(f" Duration: {result['total_duration']:.2f}s")
|
||||
print(f" Segments: {result['n_segments']}")
|
||||
print(f" Speakers: {result['n_speakers']}")
|
||||
if "references" in result:
|
||||
for ref in result["references"]:
|
||||
print(f" {ref['speaker_id']}: gender={ref['gender']}, "
|
||||
f"quality={ref['avg_quality']:.3f}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
65
v1.1/scripts/asrx_self/speaker_classifier_v1.11.py
Normal file
65
v1.1/scripts/asrx_self/speaker_classifier_v1.11.py
Normal file
@@ -0,0 +1,65 @@
|
||||
"""
|
||||
Speaker Classifier - 聲紋品質評估與性別分類
|
||||
|
||||
提供品質計算與性別分類功能,作為 main_fixed.py 的輔助模組。
|
||||
"""
|
||||
|
||||
import numpy as np
|
||||
|
||||
|
||||
def compute_embedding_quality(embeddings, labels):
|
||||
"""每個 embedding 到所屬 cluster centroid 的餘弦相似度
|
||||
|
||||
Args:
|
||||
embeddings: [n_segments, 192] 聲紋向量矩陣
|
||||
labels: [n_segments] 聚類標籤
|
||||
|
||||
Returns:
|
||||
qualities: [n_segments] 品質分數 (0-1)
|
||||
"""
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
|
||||
unique_labels = set(labels)
|
||||
centroids = {}
|
||||
for label in unique_labels:
|
||||
mask = labels == label
|
||||
centroid = np.mean(embeddings[mask], axis=0)
|
||||
norm = np.linalg.norm(centroid)
|
||||
if norm > 0:
|
||||
centroid = centroid / norm
|
||||
centroids[label] = centroid
|
||||
|
||||
qualities = []
|
||||
for emb, label in zip(embeddings, labels):
|
||||
sim = cosine_similarity([emb], [centroids[label]])[0][0]
|
||||
qualities.append(sim)
|
||||
|
||||
return np.array(qualities)
|
||||
|
||||
|
||||
def classify_gender(audio_wav, sample_rate, classifier):
|
||||
"""從音頻段分類性別
|
||||
|
||||
Args:
|
||||
audio_wav: 音頻波形 (numpy array)
|
||||
sample_rate: 採樣率
|
||||
classifier: SpeechBrain EncoderClassifier (gender-recognition-ecapa)
|
||||
|
||||
Returns:
|
||||
dict: {"gender": "male"|"female"|"unknown", "confidence": float}
|
||||
"""
|
||||
default = {"gender": "unknown", "confidence": 0.0}
|
||||
if classifier is None or len(audio_wav) == 0:
|
||||
return default
|
||||
try:
|
||||
import torch
|
||||
seg_tensor = torch.from_numpy(audio_wav).float().unsqueeze(0)
|
||||
out = classifier.classify_batch(seg_tensor)
|
||||
probs = torch.softmax(out[0], dim=-1).squeeze().cpu().detach().numpy()
|
||||
if len(probs) >= 2:
|
||||
idx = int(np.argmax(probs))
|
||||
label = "male" if idx == 0 else "female"
|
||||
return {"gender": label, "confidence": float(probs[idx])}
|
||||
except Exception as e:
|
||||
pass
|
||||
return default
|
||||
152
v1.1/scripts/asrx_self/speaker_cluster_fixed_v1.11.py
Normal file
152
v1.1/scripts/asrx_self/speaker_cluster_fixed_v1.11.py
Normal file
@@ -0,0 +1,152 @@
|
||||
#!/opt/homebrew/bin/python3.11
|
||||
"""
|
||||
Speaker Clustering - Fixed Version
|
||||
使用更穩定的聚類算法
|
||||
"""
|
||||
|
||||
import numpy as np
|
||||
from sklearn.cluster import AgglomerativeClustering
|
||||
|
||||
|
||||
def robust_speaker_clustering(embeddings, n_speakers=None, max_speakers=10):
|
||||
"""
|
||||
魯棒的說話人聚類
|
||||
|
||||
使用層次聚類代替譜聚類,避免 NaN 問題
|
||||
|
||||
Args:
|
||||
embeddings: 聲紋嵌入矩陣 [n_segments, 192]
|
||||
n_speakers: 說話人數量(None=自動估計)
|
||||
max_speakers: 最大說話人數
|
||||
|
||||
Returns:
|
||||
speaker_labels: 說話人標籤
|
||||
n_speakers: 使用的說話人數量
|
||||
"""
|
||||
n_segments = len(embeddings)
|
||||
|
||||
# 清洗數據
|
||||
embeddings = np.nan_to_num(embeddings, nan=0.0, posinf=0.0, neginf=0.0)
|
||||
|
||||
# 正規化
|
||||
from sklearn.preprocessing import normalize
|
||||
embeddings = normalize(embeddings, norm='l2')
|
||||
|
||||
# 再次清洗
|
||||
embeddings = np.nan_to_num(embeddings, nan=0.0, posinf=0.0, neginf=0.0)
|
||||
|
||||
# 自動估計說話人數量
|
||||
if n_speakers is None:
|
||||
n_speakers = estimate_n_speakers_from_embeddings(embeddings, max_speakers)
|
||||
print(f"[Clustering] Estimated n_speakers: {n_speakers}")
|
||||
|
||||
n_speakers = min(int(n_speakers), n_segments, max_speakers)
|
||||
n_speakers = max(2, n_speakers) # 至少 2 人
|
||||
|
||||
print(f"[Clustering] Using Agglomerative Clustering with {n_speakers} clusters")
|
||||
|
||||
# 使用層次聚類(更穩定)
|
||||
clustering = AgglomerativeClustering(
|
||||
n_clusters=n_speakers,
|
||||
metric='cosine',
|
||||
linkage='average'
|
||||
)
|
||||
|
||||
speaker_labels = clustering.fit_predict(embeddings)
|
||||
|
||||
# 統計每個聚類的大小
|
||||
unique, counts = np.unique(speaker_labels, return_counts=True)
|
||||
print("[Clustering] Cluster sizes:")
|
||||
for label, count in zip(unique, counts):
|
||||
print(f" SPEAKER_{label}: {count} segments ({count/n_segments*100:.1f}%)")
|
||||
|
||||
return speaker_labels, n_speakers
|
||||
|
||||
|
||||
def estimate_n_speakers_from_embeddings(embeddings, max_speakers=10):
|
||||
"""
|
||||
從嵌入向量估計說話人數量
|
||||
|
||||
使用距離閾值方法
|
||||
|
||||
Args:
|
||||
embeddings: 聲紋嵌入矩陣
|
||||
max_speakers: 最大說話人數
|
||||
|
||||
Returns:
|
||||
n_speakers: 估計的說話人數量
|
||||
"""
|
||||
from sklearn.metrics.pairwise import cosine_distances
|
||||
|
||||
# 計算距離矩陣
|
||||
distances = cosine_distances(embeddings)
|
||||
|
||||
# 計算每個樣本到最近鄰的距離(排除自己)
|
||||
n_samples = len(embeddings)
|
||||
min_distances = []
|
||||
|
||||
for i in range(min(200, n_samples)): # 取樣計算
|
||||
dists = distances[i]
|
||||
# 排除自己(距離為 0)
|
||||
sorted_dists = np.sort(dists)
|
||||
if len(sorted_dists) > 1:
|
||||
min_distances.append(sorted_dists[1]) # 最近鄰
|
||||
|
||||
if not min_distances:
|
||||
return 2
|
||||
|
||||
# 使用距離分佈估計聚類數
|
||||
avg_min_dist = np.mean(min_distances)
|
||||
std_min_dist = np.std(min_distances)
|
||||
|
||||
# 經驗法則:距離閾值約為平均值的 1.5 倍
|
||||
threshold = avg_min_dist * 1.5
|
||||
|
||||
# 簡單聚類:距離小於閾值的視為同一人
|
||||
n_speakers = 1
|
||||
assigned = [False] * len(min_distances)
|
||||
|
||||
for i in range(len(min_distances)):
|
||||
if not assigned[i]:
|
||||
n_speakers += 1
|
||||
# 標記所有距離近的為同一聚類
|
||||
for j in range(i+1, len(min_distances)):
|
||||
if not assigned[j]:
|
||||
# 檢查距離
|
||||
idx_i = i * (n_samples // 200) if n_samples > 200 else i
|
||||
idx_j = j * (n_samples // 200) if n_samples > 200 else j
|
||||
if idx_i < n_samples and idx_j < n_samples:
|
||||
if distances[idx_i, idx_j] < threshold:
|
||||
assigned[j] = True
|
||||
|
||||
# 限制範圍
|
||||
n_speakers = max(2, min(n_speakers, max_speakers))
|
||||
|
||||
return n_speakers
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# 測試
|
||||
print("[Test] Testing robust speaker clustering")
|
||||
|
||||
# 生成模擬數據:3 個說話人
|
||||
np.random.seed(42)
|
||||
n_speakers = 3
|
||||
n_per_speaker = 100
|
||||
|
||||
embeddings = []
|
||||
for i in range(n_speakers):
|
||||
center = np.random.randn(192) * 2 + i * 3
|
||||
for _ in range(n_per_speaker):
|
||||
emb = center + np.random.randn(192) * 0.5
|
||||
embeddings.append(emb)
|
||||
|
||||
embeddings = np.array(embeddings)
|
||||
print(f"Generated {len(embeddings)} embeddings for {n_speakers} speakers")
|
||||
|
||||
# 測試聚類
|
||||
labels, n_clusters = robust_speaker_clustering(embeddings)
|
||||
|
||||
print("\nResult:")
|
||||
print(f" True n_speakers: {n_speakers}")
|
||||
print(f" Estimated n_speakers: {n_clusters}")
|
||||
191
v1.1/scripts/asrx_self/speaker_encoder_v1.11.py
Normal file
191
v1.1/scripts/asrx_self/speaker_encoder_v1.11.py
Normal file
@@ -0,0 +1,191 @@
|
||||
#!/opt/homebrew/bin/python3.11
|
||||
"""
|
||||
Speaker Encoder - 聲紋特徵提取
|
||||
使用 ECAPA-TDNN 模型提取聲紋嵌入向量
|
||||
|
||||
技術來源:
|
||||
- ECAPA-TDNN: Desplanques et al. (2020), Interspeech
|
||||
- 論文:https://arxiv.org/abs/2005.07143
|
||||
- 模型:SpeechBrain spkrec-ecapa-voxceleb
|
||||
- 準確度:EER 0.80% (VoxCeleb1)
|
||||
"""
|
||||
|
||||
import torch
|
||||
import numpy as np
|
||||
from speechbrain.inference.speaker import EncoderClassifier
|
||||
|
||||
|
||||
def load_speaker_encoder(model_name="speechbrain/spkrec-ecapa-voxceleb"):
|
||||
"""
|
||||
載入聲紋編碼器模型
|
||||
|
||||
Args:
|
||||
model_name: 模型名稱(HuggingFace)
|
||||
|
||||
Returns:
|
||||
classifier: 聲紋編碼器
|
||||
"""
|
||||
print(f"[SpeakerEncoder] Loading model: {model_name}")
|
||||
|
||||
classifier = EncoderClassifier.from_hparams(
|
||||
source=model_name,
|
||||
run_opts={"device": "cpu"}, # 使用 CPU
|
||||
)
|
||||
|
||||
# 獲取模型資訊
|
||||
print("[SpeakerEncoder] Model loaded successfully")
|
||||
print("[SpeakerEncoder] Embedding dimension: 192")
|
||||
|
||||
return classifier
|
||||
|
||||
|
||||
def extract_speaker_embedding(classifier, audio_waveform, sample_rate=16000):
|
||||
"""
|
||||
從音頻波形提取聲紋嵌入
|
||||
|
||||
Args:
|
||||
classifier: 聲紋編碼器
|
||||
audio_waveform: 音頻波形 (numpy array)
|
||||
sample_rate: 採樣率
|
||||
|
||||
Returns:
|
||||
embedding: 聲紋嵌入向量 (192 維)
|
||||
"""
|
||||
# 轉換為 torch tensor
|
||||
if isinstance(audio_waveform, np.ndarray):
|
||||
audio_tensor = torch.from_numpy(audio_waveform).float()
|
||||
else:
|
||||
audio_tensor = audio_waveform
|
||||
|
||||
# 確保是 2D [batch, time]
|
||||
if audio_tensor.dim() == 1:
|
||||
audio_tensor = audio_tensor.unsqueeze(0)
|
||||
|
||||
# 提取嵌入
|
||||
with torch.no_grad():
|
||||
embedding = classifier.encode_batch(audio_tensor)
|
||||
|
||||
# 轉換為 numpy
|
||||
embedding = embedding.squeeze().cpu().numpy()
|
||||
|
||||
return embedding
|
||||
|
||||
|
||||
def extract_speaker_embeddings_batch(classifier, audio_segments, sample_rate=16000):
|
||||
"""
|
||||
批量提取多個語音片段的聲紋嵌入
|
||||
|
||||
Args:
|
||||
classifier: 聲紋編碼器
|
||||
audio_segments: 音頻片段列表 [numpy array, ...]
|
||||
sample_rate: 採樣率
|
||||
|
||||
Returns:
|
||||
embeddings: 嵌入矩陣 [n_segments, 192]
|
||||
"""
|
||||
embeddings = []
|
||||
|
||||
for i, audio in enumerate(audio_segments):
|
||||
emb = extract_speaker_embedding(classifier, audio, sample_rate)
|
||||
embeddings.append(emb)
|
||||
|
||||
if (i + 1) % 50 == 0:
|
||||
print(f"[SpeakerEncoder] Processed {i + 1} segments")
|
||||
|
||||
embeddings = np.vstack(embeddings)
|
||||
print(f"[SpeakerEncoder] Extracted {embeddings.shape[0]} embeddings")
|
||||
|
||||
return embeddings
|
||||
|
||||
|
||||
def compute_similarity_matrix(embeddings, method="cosine"):
|
||||
"""
|
||||
計算聲紋相似度矩陣
|
||||
|
||||
Args:
|
||||
embeddings: 嵌入矩陣 [n_segments, 192]
|
||||
method: 相似度計算方法 ('cosine', 'euclidean')
|
||||
|
||||
Returns:
|
||||
similarity_matrix: 相似度矩陣 [n_segments, n_segments]
|
||||
"""
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
|
||||
# 清洗數據:移除 NaN 和 Inf
|
||||
embeddings = np.nan_to_num(embeddings, nan=0.0, posinf=0.0, neginf=0.0)
|
||||
|
||||
# 正規化
|
||||
embeddings = normalize_embeddings(embeddings)
|
||||
|
||||
# 再次清洗
|
||||
embeddings = np.nan_to_num(embeddings, nan=0.0, posinf=0.0, neginf=0.0)
|
||||
|
||||
if method == "cosine":
|
||||
similarity = cosine_similarity(embeddings)
|
||||
elif method == "euclidean":
|
||||
from sklearn.metrics.pairwise import euclidean_distances
|
||||
|
||||
# 將距離轉換為相似度
|
||||
distances = euclidean_distances(embeddings)
|
||||
similarity = 1 / (1 + distances)
|
||||
else:
|
||||
raise ValueError(f"Unknown method: {method}")
|
||||
|
||||
# 確保沒有 NaN
|
||||
similarity = np.nan_to_num(similarity, nan=0.5)
|
||||
|
||||
return similarity
|
||||
|
||||
|
||||
def normalize_embeddings(embeddings):
|
||||
"""
|
||||
正規化嵌入向量(單位長度)
|
||||
|
||||
Args:
|
||||
embeddings: 嵌入矩陣 [n_segments, 192]
|
||||
|
||||
Returns:
|
||||
normalized: 正規化後的嵌入矩陣
|
||||
"""
|
||||
from sklearn.preprocessing import normalize
|
||||
|
||||
return normalize(embeddings, norm="l2")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# 測試聲紋編碼器
|
||||
import sys
|
||||
import torchaudio
|
||||
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: python3 speaker_encoder.py <audio_path>")
|
||||
sys.exit(1)
|
||||
|
||||
audio_path = sys.argv[1]
|
||||
|
||||
print("[Test] Loading speaker encoder...")
|
||||
classifier = load_speaker_encoder()
|
||||
|
||||
print(f"\n[Test] Loading audio: {audio_path}")
|
||||
wav, sr = torchaudio.load(audio_path)
|
||||
|
||||
# 重採樣到 16kHz
|
||||
if sr != 16000:
|
||||
transform = torchaudio.transforms.Resample(sr, 16000)
|
||||
wav = transform(wav)
|
||||
|
||||
print(f"[Test] Audio shape: {wav.shape}")
|
||||
print(f"[Test] Duration: {wav.shape[1] / 16000:.2f}s")
|
||||
|
||||
# 提取嵌入
|
||||
print("\n[Test] Extracting speaker embedding...")
|
||||
embedding = extract_speaker_embedding(classifier, wav.numpy())
|
||||
|
||||
print(f"[Test] Embedding shape: {embedding.shape}")
|
||||
print(f"[Test] Embedding norm: {np.linalg.norm(embedding):.4f}")
|
||||
print(f"[Test] Embedding mean: {embedding.mean():.4f}")
|
||||
print(f"[Test] Embedding std: {embedding.std():.4f}")
|
||||
|
||||
# 顯示部分嵌入值
|
||||
print("\n[Test] First 10 embedding values:")
|
||||
print(f" {embedding[:10]}")
|
||||
206
v1.1/scripts/asrx_self/vad_v1.11.py
Normal file
206
v1.1/scripts/asrx_self/vad_v1.11.py
Normal file
@@ -0,0 +1,206 @@
|
||||
#!/opt/homebrew/bin/python3.11
|
||||
"""
|
||||
VAD (Voice Activity Detection) - 語音活動檢測
|
||||
使用 Silero VAD 模型提取語音片段
|
||||
|
||||
技術來源:
|
||||
- Silero VAD: https://github.com/snakers4/silero-vad
|
||||
- 模型基於深度學習,準確度 95%+
|
||||
"""
|
||||
|
||||
import torch
|
||||
|
||||
|
||||
def load_vad_model():
|
||||
"""
|
||||
載入 Silero VAD 模型
|
||||
|
||||
Returns:
|
||||
model: VAD 模型
|
||||
utils: 工具函數
|
||||
"""
|
||||
model, utils = torch.hub.load(
|
||||
repo_or_dir="snakers4/silero-vad",
|
||||
model="silero_vad",
|
||||
force_reload=False,
|
||||
trust_repo=True,
|
||||
)
|
||||
return model, utils
|
||||
|
||||
|
||||
def extract_speech_segments(
|
||||
audio_path, model, utils, min_speech_duration_ms=500, min_silence_duration_ms=300
|
||||
):
|
||||
"""
|
||||
使用 VAD 提取語音片段
|
||||
|
||||
Args:
|
||||
audio_path: 音頻文件路徑
|
||||
model: VAD 模型
|
||||
utils: 工具函數
|
||||
min_speech_duration_ms: 最小語音持續時間(毫秒)
|
||||
min_silence_duration_ms: 最小靜音持續時間(毫秒)
|
||||
|
||||
Returns:
|
||||
speech_segments: 語音片段列表 [(start_sec, end_sec), ...]
|
||||
audio_waveform: 音頻波形 (numpy array)
|
||||
sample_rate: 採樣率
|
||||
"""
|
||||
get_speech_timestamps, save_audio, read_audio, _, _ = utils
|
||||
|
||||
# 讀取音頻
|
||||
wav = read_audio(audio_path, sampling_rate=16000)
|
||||
sample_rate = 16000
|
||||
|
||||
# 獲取語音時間戳
|
||||
speech_timestamps = get_speech_timestamps(
|
||||
wav,
|
||||
model,
|
||||
sampling_rate=sample_rate,
|
||||
min_speech_duration_ms=min_speech_duration_ms,
|
||||
min_silence_duration_ms=min_silence_duration_ms,
|
||||
return_seconds=True,
|
||||
)
|
||||
|
||||
# 轉換為片段列表
|
||||
speech_segments = [(ts["start"], ts["end"]) for ts in speech_timestamps]
|
||||
|
||||
return speech_segments, wav.numpy(), sample_rate
|
||||
|
||||
|
||||
def extract_speech_audio(audio_path, model, utils, output_dir=None):
|
||||
"""
|
||||
提取語音片段並保存為單獨音頻文件
|
||||
|
||||
Args:
|
||||
audio_path: 原始音頻路徑
|
||||
model: VAD 模型
|
||||
utils: 工具函數
|
||||
output_dir: 輸出目錄(可選)
|
||||
|
||||
Returns:
|
||||
speech_audios: 語音音頻列表 [numpy array, ...]
|
||||
speech_segments: 語音片段列表
|
||||
"""
|
||||
get_speech_timestamps, save_audio, read_audio, _, _ = utils
|
||||
|
||||
# 讀取音頻
|
||||
wav = read_audio(audio_path, sampling_rate=16000)
|
||||
sample_rate = 16000
|
||||
|
||||
# 獲取語音時間戳
|
||||
speech_timestamps = get_speech_timestamps(
|
||||
wav,
|
||||
model,
|
||||
sampling_rate=sample_rate,
|
||||
min_speech_duration_ms=500,
|
||||
min_silence_duration_ms=300,
|
||||
return_seconds=False, # 使用樣本索引
|
||||
)
|
||||
|
||||
# 提取語音片段
|
||||
speech_audios = []
|
||||
speech_segments = []
|
||||
|
||||
for i, ts in enumerate(speech_timestamps):
|
||||
start_sample = ts["start"]
|
||||
end_sample = ts["end"]
|
||||
|
||||
# 提取音頻片段
|
||||
speech_audio = wav[start_sample:end_sample]
|
||||
speech_audios.append(speech_audio.numpy())
|
||||
speech_segments.append(
|
||||
(
|
||||
start_sample / sample_rate, # 轉換為秒
|
||||
end_sample / sample_rate,
|
||||
)
|
||||
)
|
||||
|
||||
# 保存為文件(可選)
|
||||
if output_dir:
|
||||
import os
|
||||
|
||||
output_path = os.path.join(output_dir, f"speech_{i:03d}.wav")
|
||||
save_audio(output_path, speech_audio, sample_rate)
|
||||
|
||||
return speech_audios, speech_segments
|
||||
|
||||
|
||||
def scan_within_segment(wav, sample_rate, start_sec, end_sec, model, utils,
|
||||
min_speech_duration_ms=500, min_silence_duration_ms=300):
|
||||
"""
|
||||
在一個時間範圍內執行 VAD 掃描,切出子片段。
|
||||
|
||||
用途: whisper 給出的粗略時間段內,利用句間停頓細切。
|
||||
|
||||
Args:
|
||||
wav: 完整音頻波形 (numpy array)
|
||||
sample_rate: 採樣率
|
||||
start_sec: 掃描起始時間 (秒)
|
||||
end_sec: 掃描結束時間 (秒)
|
||||
model: VAD 模型
|
||||
utils: VAD 工具函數
|
||||
min_speech_duration_ms: 最小語音持續時間
|
||||
min_silence_duration_ms: 最小靜音持續時間
|
||||
|
||||
Returns:
|
||||
sub_segments: [(start_sec, end_sec), ...] 子片段列表 (原始時間軸)
|
||||
"""
|
||||
get_speech_timestamps, _, _, _, _ = utils
|
||||
|
||||
# 提取該時間範圍內的音頻
|
||||
start_sample = int(start_sec * sample_rate)
|
||||
end_sample = int(end_sec * sample_rate)
|
||||
segment_wav = wav[start_sample:end_sample]
|
||||
|
||||
# 在子音頻上執行 VAD
|
||||
speech_ts = get_speech_timestamps(
|
||||
segment_wav,
|
||||
model,
|
||||
sampling_rate=sample_rate,
|
||||
min_speech_duration_ms=min_speech_duration_ms,
|
||||
min_silence_duration_ms=min_silence_duration_ms,
|
||||
return_seconds=True,
|
||||
)
|
||||
|
||||
# 轉換回原始時間軸
|
||||
sub_segments = [
|
||||
(ts["start"] + start_sec, ts["end"] + start_sec)
|
||||
for ts in speech_ts
|
||||
]
|
||||
|
||||
return sub_segments
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# 測試 VAD
|
||||
import sys
|
||||
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: python3 vad.py <audio_path>")
|
||||
sys.exit(1)
|
||||
|
||||
audio_path = sys.argv[1]
|
||||
|
||||
print("[VAD] Loading model...")
|
||||
model, utils = load_vad_model()
|
||||
|
||||
print(f"[VAD] Processing: {audio_path}")
|
||||
segments, wav, sr = extract_speech_segments(audio_path, model, utils)
|
||||
|
||||
print("\n[VAD] Results:")
|
||||
print(f" Sample rate: {sr} Hz")
|
||||
print(f" Speech segments: {len(segments)}")
|
||||
print(f" Total duration: {len(wav) / sr:.2f}s")
|
||||
|
||||
total_speech = sum(end - start for start, end in segments)
|
||||
print(
|
||||
f" Total speech: {total_speech:.2f}s ({total_speech / (len(wav) / sr) * 100:.1f}%)"
|
||||
)
|
||||
|
||||
print("\n[VAD] Segments:")
|
||||
for i, (start, end) in enumerate(segments[:10]):
|
||||
print(f" {i + 1:3d}. {start:6.2f}s - {end:6.2f}s ({end - start:5.2f}s)")
|
||||
|
||||
if len(segments) > 10:
|
||||
print(f" ... and {len(segments) - 10} more segments")
|
||||
35
v1.1/scripts/asrx_self/whisper_local_v1.11.py
Normal file
35
v1.1/scripts/asrx_self/whisper_local_v1.11.py
Normal file
@@ -0,0 +1,35 @@
|
||||
"""
|
||||
Whisper Local - uses faster-whisper for per-segment transcription
|
||||
"""
|
||||
|
||||
import numpy as np
|
||||
|
||||
|
||||
def load_model(size="small"):
|
||||
from faster_whisper import WhisperModel
|
||||
return WhisperModel(size, device="cpu", compute_type="int8")
|
||||
|
||||
|
||||
def transcribe_segment(wav, sample_rate, start_sec, end_sec, model):
|
||||
start_sample = int(start_sec * sample_rate)
|
||||
end_sample = int(end_sec * sample_rate)
|
||||
if start_sample >= len(wav):
|
||||
return {"text": "", "language": "", "lang_prob": 0.0, "segments": []}
|
||||
segment_wav = wav[start_sample:min(end_sample, len(wav))]
|
||||
|
||||
segments_generator, info = model.transcribe(segment_wav, language=None)
|
||||
|
||||
text = ""
|
||||
lang_prob = info.language_probability if info else 0.0
|
||||
language = info.language if info else ""
|
||||
|
||||
segs = list(segments_generator)
|
||||
for seg in segs:
|
||||
text += seg.text + " "
|
||||
|
||||
return {
|
||||
"text": text.strip(),
|
||||
"language": language,
|
||||
"lang_prob": lang_prob,
|
||||
"segments": segs,
|
||||
}
|
||||
Reference in New Issue
Block a user