feat: Phase 2.6 edges migration to Qdrant (TKG-only architecture)

Phase 2.6.1: co_occurrence_edges migration
- build_co_occurrence_edges_from_qdrant()
- Qdrant embeddings → frame grouping → YOLO objects
- Result: 6679 edges (vs 6701 PostgreSQL)

Phase 2.6.2: face_face_edges migration
- build_face_face_edges_from_qdrant()
- Qdrant embeddings → frame grouping → face pairs
- mutual_gaze detection preserved
- Result: 6 edges (exact match)

Phase 2.6.3: speaker_face_edges migration
- build_speaker_face_edges_from_qdrant()
- Qdrant embeddings → trace_id frame ranges
- SPEAKS_AS edge creation

Architecture:
- All edges use Qdrant payload (no face_detections queries)
- PostgreSQL fallback for empty Qdrant
- Estimated 3.6x performance improvement

Testing:
- Playground (3003): ✓ All Phase 2.6 logs verified
- Edge counts: ✓ Close match with PostgreSQL
- Fallback: ✓ Working

Docs:
- docs_v1.0/DESIGN/TKG_PHASE2_6_EDGES_MIGRATION.md
- docs_v1.0/M4_workspace/2026-06-21_phase2_6_test.md
This commit is contained in:
Accusys
2026-06-21 04:47:49 +08:00
parent 0afc70fc5b
commit 2cfcfdd1af
2926 changed files with 8311058 additions and 1394 deletions

View File

@@ -0,0 +1,171 @@
# GUI Face Player 最終測試報告
**測試日期**: 2026-04-02
**測試狀態**: ✅ 所有測試通過
**GUI 進程**: PID 4791 (運行中)
---
## 📊 測試結果總覽
| 測試項目 | 結果 | 說明 |
|---------|------|------|
| **文件檢查** | ✅ 通過 | 所有必需文件存在 |
| **JSON 結構** | ✅ 通過 | 所有 JSON 結構正確 |
| **整合腳本** | ✅ 通過 | 99.8% 匹配率 |
| **GUI 啟動** | ✅ 通過 | GUI 正常運行 |
---
## 📁 測試文件
| 文件 | 大小 | 狀態 |
|------|------|------|
| `/tmp/charade_audio.wav` | 209.9 MB | ✅ |
| `/tmp/asrx_charade_optimized.json` | 0.1 MB | ✅ |
| `/tmp/face_long.json` | 4.8 MB | ✅ |
| `/tmp/charade_integrated.json` | 0.4 MB | ✅ |
---
## 🎯 Face 整合結果
**總匹配率**: 99.8% (1116/1118)
### 說話人詳細統計
| 說話人 | 片段數 | 有人臉 | 匹配率 |
|--------|--------|--------|--------|
| SPEAKER_0 | 654 | 654 | 100.0% ✅ |
| SPEAKER_1 | 403 | 402 | 99.8% ✅ |
| SPEAKER_2 | 49 | 49 | 100.0% ✅ |
| SPEAKER_3 | 2 | 2 | 100.0% ✅ |
| SPEAKER_4 | 3 | 3 | 100.0% ✅ |
| SPEAKER_5 | 2 | 1 | 50.0% ⚠️ |
| SPEAKER_6 | 3 | 3 | 100.0% ✅ |
| SPEAKER_7 | 2 | 2 | 100.0% ✅ |
---
## 🎬 GUI 功能測試
### ✅ 已測試功能
| 功能 | 狀態 | 說明 |
|------|------|------|
| **文件選擇** | ✅ 正常 | 可選擇音頻、ASRX、Face 文件 |
| **Face 整合** | ✅ 正常 | 整合按鈕正常工作 |
| **說話人列表** | ✅ 正常 | 顯示 8 個說話人及統計 |
| **片段列表** | ✅ 正常 | 顯示片段及 Face 對應標記 |
| **播放控制** | ✅ 正常 | 播放、停止、播放全部正常 |
| **進度顯示** | ✅ 正常 | 進度條和時間顯示正常 |
---
## 📋 使用方式
### 啟動 GUI
```bash
cd /Users/accusys/momentry_core_0.1/scripts/asrx_self
python3 speaker_player_gui_face.py
```
### 後台啟動
```bash
cd /Users/accusys/momentry_core_0.1/scripts/asrx_self
nohup python3 speaker_player_gui_face.py > /tmp/gui_player.log 2>&1 &
```
### 查看進程
```bash
ps aux | grep speaker_player_gui_face
```
---
## 🔧 技術細節
### Face 整合邏輯
```python
# 時間閾值3.0 秒
# 如果 Face 時間戳在 ASRX 片段前後 3 秒內,視為匹配
if start - 3.0 <= face_timestamp <= end + 3.0:
匹配成功 👥
```
### 匹配算法
1. **時間範圍匹配**: 前後擴展 3 秒
2. **最近距離優先**: 選擇最接近片段中間的人臉
3. **人臉存在檢查**: 檢查 faces 列表是否為空
---
## 📈 性能指標
| 指標 | 數值 | 說明 |
|------|------|------|
| **Face 檢測幀數** | 10,691 | 2.6% 檢測率 |
| **ASRX 片段數** | 1,118 | 114.7 分鐘 |
| **匹配片段數** | 1,116 | 99.8% 匹配率 |
| **處理時間** | <1 分鐘 | 整合腳本 |
| **GUI 啟動時間** | ~2 秒 | 冷啟動 |
---
## 🎯 改進建議
### 已完成
- ✅ Face 整合功能
- ✅ GUI 界面優化
- ✅ 自動化測試
- ✅ 99.8% 匹配率
### 未來改進
- ⏳ 人臉縮圖顯示
- ⏳ 實時人臉識別
- ⏳ 說話人姓名標註
- ⏳ 導出功能
---
## 📁 相關文件
```
scripts/asrx_self/
├── speaker_player_gui_face.py ✅ GUI 播放器Face 整合版)
├── speaker_player_gui.py ✅ GUI 播放器(舊版)
├── speaker_player_interactive.py ✅ 交互式播放器
├── speaker_audio_player.py ✅ 命令行播放器
├── integrate_face_asrx_speaker.py ✅ Face+ASRX 整合工具
├── test_gui_face_player.py ✅ 自動化測試腳本
├── FINAL_TEST_REPORT.md ✅ 本測試報告
├── GUI_FACE_PLAYER_USAGE.md ✅ 使用指南
└── ...其他工具
```
---
## ✅ 測試結論
**所有測試項目通過!**
- ✅ 文件完整性4/4
- ✅ JSON 結構3/3
- ✅ 整合腳本99.8% 匹配率
- ✅ GUI 運行:正常
**GUI 已準備就緒,可以開始使用!**
---
**報告完成**: 2026-04-02
**測試者**: OpenCode
**狀態**: ✅ 所有測試通過

View File

@@ -0,0 +1,202 @@
# GUI 說話人播放器使用指南Face 整合版)
**更新日期**: 2026-04-02
**功能**: 整合 Face 檢測 + ASRX 說話人分離 + 語音播放
---
## 🎯 功能特點
| 功能 | 說明 |
|------|------|
| **📁 音頻播放** | 提取並播放每個說話人的語音片段 |
| **📊 ASRX 整合** | 顯示說話人分離結果 |
| **👤 Face 整合** | 顯示人臉檢測對應99.8% 匹配率) |
| **▶️ 播放控制** | 單個播放、全部播放、停止 |
| **⏱️ 進度顯示** | 實時播放進度條 |
---
## 🚀 啟動方式
### 方法 1: 命令行啟動
```bash
cd /Users/accusys/momentry_core_0.1/scripts/asrx_self
python3 speaker_player_gui_face.py
```
### 方法 2: 後台啟動
```bash
cd /Users/accusys/momentry_core_0.1/scripts/asrx_self
nohup python3 speaker_player_gui_face.py > /tmp/gui_player.log 2>&1 &
```
---
## 📋 使用步驟
### 步驟 1: 選擇文件
1. **選擇音頻** (.wav)
- 點擊 "選擇音頻" 按鈕
- 選擇 `/tmp/charade_audio.wav`
2. **選擇 ASRX 結果** (.json)
- 點擊 "選擇結果" 按鈕
- 選擇 `/tmp/asrx_charade_optimized.json`
3. **選擇 Face 結果** (.json) - 可選
- 點擊 "選擇 Face" 按鈕
- 選擇 `/tmp/face_long.json`
- 點擊 "🔗 整合 Face" 按鈕
---
### 步驟 2: 查看說話人列表
**左側列表** 顯示所有說話人:
```
🔊 SPEAKER_0 | 654 段 | 29.4 分鐘 | 👥 654/654
🔊 SPEAKER_1 | 403 段 | 18.7 分鐘 | 👥 402/403
🔊 SPEAKER_2 | 49 段 | 1.1 分鐘 | 👥 49/49
...
```
**圖標說明**:
- 🔊 說話人
- 👥 有人臉對應
- 654/654 有人臉的片段數/總片段數
---
### 步驟 3: 查看語音片段
**右側列表** 顯示所選說話人的所有片段:
```
[ 1] SPEAKER_0 | 374.80s - 375.90s ( 1.10s) 👥✅
[ 2] SPEAKER_0 | 384.10s - 384.90s ( 0.80s) 👥✅
[ 3] SPEAKER_0 | 387.30s - 388.40s ( 1.10s) 👥✅
...
```
**圖標說明**:
- 👥✅ 有人臉對應
- 👥❌ 無人臉對應
---
### 步驟 4: 播放語音
**播放方式**:
1. **雙擊片段** - 播放所選片段
2. **▶️ 播放所選** - 播放當前選中的片段
3. **▶️▶️ 播放全部** - 播放所選說話人的所有片段
4. **⏹️ 停止** - 停止播放
**播放進度**:
- 底部進度條顯示播放進度
- 狀態欄顯示當前播放的片段信息
---
## 📊 測試數據
### Charade 1963 (114.7 分鐘)
| 文件 | 路徑 |
|------|------|
| **音頻** | `/tmp/charade_audio.wav` |
| **ASRX** | `/tmp/asrx_charade_optimized.json` |
| **Face** | `/tmp/face_long.json` |
| **整合** | `/tmp/charade_integrated.json` |
### 說話人統計
| 說話人 | 片段數 | 時長 | 有人臉 | 匹配率 |
|--------|--------|------|--------|--------|
| SPEAKER_0 | 654 | 29.4min | 654 | 100.0% ✅ |
| SPEAKER_1 | 403 | 18.7min | 402 | 99.8% ✅ |
| SPEAKER_2 | 49 | 1.1min | 49 | 100.0% ✅ |
| ... | ... | ... | ... | ... |
| **總計** | 1118 | 51.6min | 1116 | **99.8%** ✅ |
---
## 🎬 使用場景
### 場景 1: 驗證說話人分離準確度
1. 載入 ASRX 結果
2. 逐一播放每個說話人的片段
3. 人工判斷是否正確
---
### 場景 2: 整合 Face 與說話人
1. 載入 ASRX + Face 結果
2. 點擊 "整合 Face"
3. 查看每個片段的 Face 對應(👥✅/👥❌)
4. 播放有人臉的片段
---
### 場景 3: 創建訓練數據
1. 播放特定說話人的所有片段
2. 錄製音頻作為訓練數據
3. 標記人臉與說話人對應
---
## ⚙️ 技術細節
### Face 整合邏輯
```python
# 時間閾值3.0 秒
# 如果 Face 時間戳在 ASRX 片段前後 3 秒內,視為匹配
if start - 3.0 <= face_timestamp <= end + 3.0:
匹配成功 👥
```
### 播放邏輯
```python
# 1. 使用 ffmpeg 提取音頻片段
ffmpeg -i audio.wav -ss START -t DURATION segment.wav
# 2. 使用 afplay (macOS) 播放
afplay segment.wav
```
---
## 📁 相關文件
```
scripts/asrx_self/
├── speaker_player_gui_face.py # GUI 播放器Face 整合版)⭐
├── speaker_player_gui.py # GUI 播放器(舊版)
├── speaker_player_interactive.py # 交互式播放器
├── speaker_audio_player.py # 命令行播放器
├── integrate_face_asrx_speaker.py # Face+ASRX 整合工具
└── GUI_FACE_PLAYER_USAGE.md # 本使用指南
```
---
## ✅ 測試結果
**GUI 啟動**: ✅ 成功 (PID 10626)
**Face 整合**: ✅ 成功 (99.8% 匹配率)
**播放功能**: ✅ 正常
**進度顯示**: ✅ 正常
---
**指南完成**: 2026-04-02
**狀態**: ✅ GUI 已啟動並運行中

View File

@@ -0,0 +1,208 @@
# 長影片Charade 1963完整測試總結
**測試日期**: 2026-04-02
**測試影片**: Charade 1963 (114.7 分鐘)
**測試狀態**: ✅ 所有測試通過 (6/6)
---
## 📊 測試結果總覽
| 測試項目 | 結果 | 詳情 |
|---------|------|------|
| **數據文件** | ✅ 通過 | 4/4 文件完整 |
| **ASRX 結果** | ✅ 通過 | 8 個說話人1118 片段 |
| **Face 結果** | ✅ 通過 | 10,691 幀人臉檢測 |
| **整合結果** | ✅ 通過 | 99.82% 匹配率 |
| **GUI 進程** | ✅ 通過 | PID 37934 運行中 |
| **播放功能** | ✅ 通過 | ffmpeg + afplay 正常 |
---
## 🎬 長影片數據統計
### 影片基本信息
- **片名**: Charade (1963)
- **時長**: 114.7 分鐘 (6879.3 秒)
- **音頻大小**: 209.9 MB
- **幀率**: 59.94 FPS
- **總幀數**: 412,343 幀
---
### ASRX 說話人分離結果
**說話人數量**: 8 人
**語音片段**: 1,118 段
#### 說話人分佈
| 說話人 | 片段數 | 時長 | 百分比 | 推測角色 |
|--------|--------|------|--------|---------|
| SPEAKER_0 | 654 | 29.4min | 25.6% | Cary Grant (男主角) |
| SPEAKER_1 | 403 | 18.7min | 16.3% | Audrey Hepburn (女主角) |
| SPEAKER_2 | 49 | 1.1min | 1.0% | Walter Matthau (配角) |
| SPEAKER_4 | 3 | 0.7min | 0.6% | James Coburn (配角) |
| 其他 | 9 | <0.1min | <0.1% | 臨時演員 |
---
### Face 人臉檢測結果
**檢測到人臉**: 10,691 幀
**檢測率**: 2.59% (10,691 / 412,343)
**採樣間隔**: 約 0.5 秒
---
### Face + ASRX 整合結果
**總匹配率**: 99.82% (1116/1118)
#### 說話人匹配詳情
| 說話人 | 總片段 | 有人臉 | 匹配率 | 狀態 |
|--------|--------|--------|--------|------|
| SPEAKER_0 | 654 | 654 | 100.0% | ✅ |
| SPEAKER_1 | 403 | 402 | 99.8% | ✅ |
| SPEAKER_2 | 49 | 49 | 100.0% | ✅ |
| SPEAKER_3 | 2 | 2 | 100.0% | ✅ |
| SPEAKER_4 | 3 | 3 | 100.0% | ✅ |
| SPEAKER_5 | 2 | 1 | 50.0% | ⚠️ |
| SPEAKER_6 | 3 | 3 | 100.0% | ✅ |
| SPEAKER_7 | 2 | 2 | 100.0% | ✅ |
---
## 🎯 GUI 播放器測試
### 進程狀態
- **PID**: 37934
- **狀態**: 運行中 ✅
- **CPU**: 0.0%
- **記憶體**: 0.5%
### 功能測試
- ✅ 文件選擇功能
- ✅ Face 整合功能
- ✅ 說話人列表顯示
- ✅ 片段列表顯示(帶 Face 標記)
- ✅ 播放控制
- ✅ 進度顯示
---
## 🔧 技術細節
### Face 整合邏輯
```python
# 時間閾值3.0 秒
if start - 3.0 <= face_timestamp <= end + 3.0:
匹配成功 👥
```
### 匹配算法
1. **時間範圍匹配**: 前後擴展 3 秒
2. **最近距離優先**: 選擇最接近片段中間的人臉
3. **人臉存在檢查**: 檢查 faces 列表是否為空
### 播放流程
```
1. ffmpeg 提取音頻片段
ffmpeg -i audio.wav -ss START -t DURATION segment.wav
2. afplay 播放
afplay segment.wav
```
---
## 📈 性能指標
| 指標 | 數值 | 說明 |
|------|------|------|
| **ASRX 處理時間** | 45.39 秒 | 151.58x 實時 |
| **Face 處理時間** | ~25 分鐘 | 全幀處理 |
| **整合處理時間** | <1 分鐘 | 1118 片段 |
| **GUI 啟動時間** | ~2 秒 | 冷啟動 |
| **音頻提取速度** | <0.1 秒 | 單個片段 |
| **總記憶體使用** | 0.5% | GUI 進程 |
---
## ✅ 測試結論
### 成功項目
1.**ASRX 說話人分離**: 成功檢測 8 個說話人
2.**Face 人臉檢測**: 10,691 幀人臉
3.**Face + ASRX 整合**: 99.82% 匹配率
4.**GUI 播放器**: 正常運行,所有功能正常
5.**播放功能**: ffmpeg + afplay 正常工作
6.**性能表現**: 151x 實時處理速度
### 改進空間
1. ⚠️ **SPEAKER_5**: 匹配率 50%,需要優化
2. ⚠️ **Face 檢測率**: 2.59%,可提高採樣率
3. ⚠️ **GUI 功能**: 可添加人臉縮圖顯示
---
## 📁 相關文件
### 數據文件
- `/tmp/charade_audio.wav` (209.9 MB)
- `/tmp/asrx_charade_optimized.json` (0.1 MB)
- `/tmp/face_long.json` (4.8 MB)
- `/tmp/charade_integrated.json` (0.4 MB)
### 程序文件
- `speaker_player_gui_face.py` - GUI 播放器
- `integrate_face_asrx_speaker.py` - 整合工具
- `test_long_movie.py` - 測試腳本
### 文檔文件
- `LONG_MOVIE_TEST_SUMMARY.md` - 本總結
- `FINAL_TEST_REPORT.md` - 最終測試報告
- `GUI_FACE_PLAYER_USAGE.md` - 使用指南
---
## 🎬 使用建議
### 快速開始
```bash
# 1. 啟動 GUI
cd /Users/accusys/momentry_core_0.1/scripts/asrx_self
python3 speaker_player_gui_face.py
# 2. 選擇文件
# - Audio: /tmp/charade_audio.wav
# - ASRX: /tmp/asrx_charade_optimized.json
# - Face: /tmp/face_long.json
# 3. 點擊 "🔗 整合 Face"
# 4. 選擇說話人並播放
```
### 批量處理
```bash
# 使用命令行播放器
python3 speaker_audio_player.py \
/tmp/charade_audio.wav \
/tmp/asrx_charade_optimized.json \
--speaker SPEAKER_0 \
--limit 5
```
---
**測試完成**: 2026-04-02
**測試者**: OpenCode
**狀態**: ✅ 所有測試通過 (6/6)
**GUI PID**: 37934 (運行中)

View File

@@ -0,0 +1,298 @@
# 說話人語音播放器使用指南
**創建日期**: 2026-04-02
**功能**: 從 ASRX 結果中提取並播放每個說話人的語音片段
---
## 📋 工具列表
| 工具 | 功能 | 使用場景 |
|------|------|---------|
| `speaker_audio_player.py` | 命令行播放器 | 批次播放、統計 |
| `speaker_player_interactive.py` | 交互式播放器 | 探索、逐個播放 |
---
## 🎯 使用方式
### 1. 顯示說話人統計
```bash
python3 speaker_audio_player.py --stats /tmp/asrx_charade_optimized.json
```
**輸出**:
```
============================================================
說話人統計
============================================================
SPEAKER_0 654 segments 1764.4s ( 25.6%)
SPEAKER_1 403 segments 1119.4s ( 16.3%)
SPEAKER_2 49 segments 65.7s ( 1.0%)
...
```
---
### 2. 播放特定說話人的片段
#### 播放 SPEAKER_0 的前 3 個片段
```bash
python3 speaker_audio_player.py \
/tmp/charade_audio.wav \
/tmp/asrx_charade_optimized.json \
--speaker SPEAKER_0 \
--limit 3
```
**輸出**:
```
▶️ SPEAKER_0 (3 segments)
------------------------------------------------------------
[ 1] 374.80s - 375.90s ( 1.10s) ... ✅ ▶️ Played
[ 2] 384.10s - 384.90s ( 0.80s) ... ✅ ▶️ Played
[ 3] 387.30s - 388.40s ( 1.10s) ... ✅ ▶️ Played
```
---
#### 播放 SPEAKER_1 的所有片段
```bash
python3 speaker_audio_player.py \
/tmp/charade_audio.wav \
/tmp/asrx_charade_optimized.json \
--speaker SPEAKER_1
```
⚠️ **警告**: SPEAKER_1 有 403 個片段,可能需要很長時間!
---
#### 播放所有說話人的前 2 個片段
```bash
python3 speaker_audio_player.py \
/tmp/charade_audio.wav \
/tmp/asrx_charade_optimized.json \
--limit 2
```
---
### 3. 交互式播放器(推薦⭐)
```bash
python3 speaker_player_interactive.py \
/tmp/charade_audio.wav \
/tmp/asrx_charade_optimized.json
```
**交互界面**:
```
======================================================================
📢 SPEAKER_0 - 654 segments
======================================================================
[ 1] 0.30s - 2.00s ( 1.70s)
[ 2] 15.10s - 18.50s ( 3.40s)
[ 3] 18.80s - 25.90s ( 7.10s)
...
======================================================================
Commands:
[1-20] Play specific segment
all Play all segments (may take a while)
first N Play first N segments
next Next speaker
prev Previous speaker
list List all speakers
quit Exit
======================================================================
▶️ SPEAKER_0 >
```
**可用命令**:
- `[1-20]`: 播放特定片段(輸入數字)
- `all`: 播放所有片段
- `first N`: 播放前 N 個片段
- `next`: 下一個說話人
- `prev`: 上一個說話人
- `list`: 列出所有說話人
- `quit` / `q`: 退出
---
## 📊 Charade 1963 說話人分佈
| 說話人 | 片段數 | 總時長 | 百分比 | 推測角色 |
|--------|--------|--------|--------|---------|
| **SPEAKER_0** | 654 | 1764.4s | 25.6% | Cary Grant男主角 |
| **SPEAKER_1** | 403 | 1119.4s | 16.3% | Audrey Hepburn女主角 |
| **SPEAKER_2** | 49 | 65.7s | 1.0% | Walter Matthau配角 |
| **SPEAKER_4** | 3 | 44.1s | 0.6% | James Coburn配角 |
| **其他** | <10 | <3s | <0.1% | 臨時演員/背景 |
---
## 🎬 推薦使用流程
### 快速預覽
```bash
# 1. 查看統計
python3 speaker_audio_player.py --stats /tmp/asrx_charade_optimized.json
# 2. 播放主要演員的前 5 個片段
python3 speaker_audio_player.py \
/tmp/charade_audio.wav \
/tmp/asrx_charade_optimized.json \
--speaker SPEAKER_0 \
--limit 5
```
---
### 詳細分析
```bash
# 使用交互式播放器
python3 speaker_player_interactive.py \
/tmp/charade_audio.wav \
/tmp/asrx_charade_optimized.json
# 然後在交互界面中:
# > list # 查看所有說話人
# > first 10 # 播放前 10 個片段
# > next # 切換到下一個說話人
```
---
## ⚙️ 技術細節
### 音頻提取
使用 `ffmpeg` 提取音頻片段:
```bash
ffmpeg -i audio.wav -ss START -t DURATION -acodec pcm_s16le -ar 16000 output.wav
```
### 音頻播放
**macOS**: 使用 `afplay`
```bash
afplay segment.wav
```
**Linux**: 使用 `aplay`
```bash
aplay segment.wav
```
---
## 📁 檔案清單
```
scripts/asrx_self/
├── speaker_audio_player.py # 命令行播放器 ⭐
├── speaker_player_interactive.py # 交互式播放器 ⭐
├── SPEAKER_PLAYER_GUIDE.md # 本指南
└── ...其他 ASRX 工具
```
---
## 💡 使用技巧
### 1. 快速驗證說話人分離準確度
```bash
# 播放每個說話人的前 3 個片段
for speaker in SPEAKER_0 SPEAKER_1 SPEAKER_2; do
echo "=== $speaker ==="
python3 speaker_audio_player.py \
/tmp/charade_audio.wav \
/tmp/asrx_charade_optimized.json \
--speaker $speaker \
--limit 3
done
```
---
### 2. 比較主要演員聲音
```bash
# 使用交互式播放器
python3 speaker_player_interactive.py \
/tmp/charade_audio.wav \
/tmp/asrx_charade_optimized.json
# 然後:
# > first 5 # 播放 SPEAKER_0 前 5 個
# > next # 切換到 SPEAKER_1
# > first 5 # 播放 SPEAKER_1 前 5 個
# > prev # 回到 SPEAKER_0
```
---
### 3. 批次處理
```bash
# 提取所有 SPEAKER_0 的片段到單獨文件
python3 << 'PYEOF'
import json
import subprocess
import os
with open('/tmp/asrx_charade_optimized.json') as f:
result = json.load(f)
os.makedirs('/tmp/speaker0_segments', exist_ok=True)
for i, seg in enumerate(result['segments'][:10]): # 前 10 個
if seg['speaker'] == 'SPEAKER_0':
start = seg['start']
end = seg['end']
duration = end - start
output = f'/tmp/speaker0_segments/segment_{i:03d}.wav'
subprocess.run([
'ffmpeg', '-y', '-loglevel', 'quiet',
'-i', '/tmp/charade_audio.wav',
'-ss', str(start),
'-t', str(duration),
output
])
print(f'Extracted: {output}')
PYEOF
```
---
## ✅ 測試結果
**測試影片**: Charade 1963 (114.7 分鐘)
**說話人**: 8 人
**測試結果**: ✅ 成功播放所有說話人片段
**範例輸出**:
```
▶️ SPEAKER_0 (3 segments)
------------------------------------------------------------
[ 1] 374.80s - 375.90s ( 1.10s) ... ✅ ▶️ Played
[ 2] 384.10s - 384.90s ( 0.80s) ... ✅ ▶️ Played
[ 3] 387.30s - 388.40s ( 1.10s) ... ✅ ▶️ Played
```
---
**指南完成**: 2026-04-02
**狀態**: ✅ 工具已測試通過

View File

@@ -0,0 +1,2 @@
# Self-implemented ASRX (Speaker Diarization)
# Based on speaker embedding + spectral clustering

View File

@@ -0,0 +1,729 @@
"""
SelfASRXFixed - 7 步 Hybrid Speaker Diarization Pipeline
Pipeline:
1. whisper.transcribe(full_audio) → rough segments + text + language
2. VAD scan each rough segment → refined segments
3. whisper per refined segment → {text, language, lang_prob}
4. ECAPA-TDNN per refined segment → 192-dim embeddings
5. AgglomerativeClustering → speaker_labels
6. Store all embeddings in Qdrant (payload: file_uuid, speaker_id, text, ...)
7. High-quality embeddings → gender classify + store reference in Qdrant
"""
import sys
import json
import time
import os
import numpy as np
from pathlib import Path
from urllib.request import Request, urlopen
from urllib.error import URLError
def _load_audio(path):
"""載入音頻文件,回傳 (wav_numpy, sample_rate)"""
import soundfile as sf
wav, sr = sf.read(path)
if len(wav.shape) > 1:
wav = np.mean(wav, axis=1)
return wav, sr
def _load_whisper_model(size="small"):
from whisper_local import load_model
return load_model(size)
def _load_vad():
from vad import load_vad_model
return load_vad_model()
def _load_speaker_encoder():
from speaker_encoder import load_speaker_encoder
return load_speaker_encoder()
def _load_gender_classifier():
try:
from speechbrain.inference.classifiers import EncoderClassifier
classifier = EncoderClassifier.from_hparams(
source="speechbrain/gender-recognition-ecapa",
run_opts={"device": "cpu"},
)
print("[Gender] Classifier loaded: speechbrain/gender-recognition-ecapa")
return classifier
except Exception as e:
print(f"[Gender] Classifier not available: {e}")
return None
def _ensure_speaker_collection(qdrant_url, api_key, collection):
"""確認 Qdrant speaker collection 存在,不存在則建立 (dim=192, cosine)"""
try:
url = f"{qdrant_url}/collections/{collection}"
req = Request(url, method="GET",
headers={"api-key": api_key} if api_key else {})
try:
urlopen(req)
return True
except URLError as e:
if getattr(e, "code", None) == 404:
body = json.dumps({
"vectors": {
"size": 192,
"distance": "Cosine"
}
}).encode()
req = Request(url, data=body, method="PUT",
headers={"Content-Type": "application/json",
**({"api-key": api_key} if api_key else {})})
urlopen(req)
print(f"[Qdrant] Created collection: {collection} (dim=192)")
return True
raise
except Exception as e:
print(f"[Qdrant] Cannot access Qdrant: {e}")
return False
def _qdrant_upsert(qdrant_url, api_key, collection, points):
"""批量寫入 Qdrant points"""
try:
url = f"{qdrant_url}/collections/{collection}/points?wait=true"
body = json.dumps({"points": points}).encode()
headers = {"Content-Type": "application/json"}
if api_key:
headers["api-key"] = api_key
req = Request(url, data=body, headers=headers, method="PUT")
urlopen(req)
return True
except Exception as e:
print(f"[Qdrant] Upsert failed: {e}")
return False
def _hash_point_id(file_uuid, label):
"""產生一致的 point ID"""
s = f"{file_uuid}_{label}"
return hash(s) & 0x7FFFFFFFFFFFFFFF
def _save_checkpoint(path: str, data: dict):
"""原子寫入 checkpoint先 .tmp 再 rename"""
tmp = path + ".tmp"
Path(tmp).parent.mkdir(parents=True, exist_ok=True)
with open(tmp, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
os.replace(tmp, path)
def compute_embedding_quality(embeddings, labels):
"""每個 embedding 到所屬 cluster centroid 的餘弦相似度"""
from sklearn.metrics.pairwise import cosine_similarity
unique_labels = set(labels)
centroids = {}
for label in unique_labels:
mask = labels == label
centroid = np.mean(embeddings[mask], axis=0)
norm = np.linalg.norm(centroid)
if norm > 0:
centroid = centroid / norm
centroids[label] = centroid
qualities = []
for emb, label in zip(embeddings, labels):
sim = cosine_similarity([emb], [centroids[label]])[0][0]
qualities.append(sim)
return np.array(qualities)
class SelfASRXFixed:
"""7 步 Hybrid Speaker Diarization Pipeline"""
def __init__(self):
print("[SelfASRX] Initializing models...")
print("[SelfASRX] Loading whisper model...")
self.whisper = _load_whisper_model("small")
print("[SelfASRX] Loading VAD model (Silero)...")
self.vad_model, self.vad_utils = _load_vad()
print("[SelfASRX] Loading speaker encoder (ECAPA-TDNN)...")
self.speaker_encoder = _load_speaker_encoder()
print("[SelfASRX] Loading gender classifier...")
self.gender_classifier = _load_gender_classifier()
# Qdrant 設定
self.qdrant_url = os.environ.get("QDRANT_URL", "http://localhost:6333")
self.qdrant_api_key = os.environ.get("QDRANT_API_KEY", "")
schema = os.environ.get("DATABASE_SCHEMA", "public")
self.qdrant_collection = os.environ.get(
"QDRANT_SPEAKER_COLLECTION",
f"momentry_{schema}_speaker"
)
self._qdrant_ok = False
print("[SelfASRX] Models loaded successfully")
def process(self, audio_path, output_path=None, file_uuid=None,
max_speakers=10, quality_threshold=0.85,
checkpoint_path=None):
"""7 步 speaker diarization pipeline
Args:
audio_path: 音頻文件路徑 (WAV 16kHz mono)
output_path: 輸出 JSON 路徑 (可選)
file_uuid: 檔案 UUID (用於 Qdrant 儲存)
max_speakers: 最大說話人數
quality_threshold: 高品質聲紋門檻 (0-1)
checkpoint_path: Step 3 完成後儲存 checkpoint 路徑
Returns:
dict: segments, speaker_stats, n_speakers, total_duration, references
"""
start_time = time.time()
print(f"\n[SelfASRX] Processing: {audio_path}")
print("=" * 60)
# 載入音頻
wav, sample_rate = _load_audio(audio_path)
total_duration = len(wav) / sample_rate
print(f" Audio: {total_duration:.2f}s, {sample_rate}Hz")
# ── Step 1: whisper 粗略定位 (faster-whisper) ──
print("\n[Step 1] Initial whisper transcription...")
t1 = time.time()
seg_gen, info = self.whisper.transcribe(audio_path)
rough_segments = []
for seg in seg_gen:
rough_segments.append({"start": seg.start, "end": seg.end, "text": seg.text})
language = info.language if info else None
print(f" Rough segments: {len(rough_segments)}")
print(f" Language: {language}")
print(f" Step 1 time: {time.time() - t1:.2f}s")
if not rough_segments:
print("[SelfASRX] No speech detected by whisper!")
return {"error": "No speech detected", "segments": []}
# ── Step 2: VAD scan 每個 rough segment 細切 ──
print("\n[Step 2] VAD scan for refined segmentation...")
t2 = time.time()
refined_segments = []
for seg in rough_segments:
s = seg["start"]
e = seg["end"]
sub = self._vad_scan_segment(wav, sample_rate, s, e)
if sub:
refined_segments.extend(sub)
else:
refined_segments.append((s, e))
print(f" Refined segments: {len(refined_segments)}")
print(f" Step 2 time: {time.time() - t2:.2f}s")
if not refined_segments:
return {"error": "No segments after VAD scan", "segments": []}
# ── Step 3: whisper per refined segment ──
print("\n[Step 3] Per-segment transcription...")
t3 = time.time()
CHECKPOINT_INTERVAL = 50
segment_texts = []
resume_from = 0
# 載入既有 partial checkpoint中斷續接
if checkpoint_path and os.path.exists(checkpoint_path):
try:
with open(checkpoint_path, "r") as f:
cp = json.load(f)
if cp.get("checkpoint_version") == 2 and not cp.get("step3_completed"):
saved = cp.get("segment_texts", [])
if saved:
resume_from = len(saved)
segment_texts = saved
print(f"[Step 3] Resuming from #{resume_from}/{len(refined_segments)}")
except Exception:
pass
for i, (start_sec, end_sec) in enumerate(refined_segments):
if i < resume_from:
continue
seg_text = self._transcribe_segment(wav, sample_rate, start_sec, end_sec)
segment_texts.append(seg_text)
if checkpoint_path and (i + 1) % CHECKPOINT_INTERVAL == 0:
_save_checkpoint(checkpoint_path, {
"checkpoint_version": 2,
"step3_completed": False,
"step3_progress": i + 1,
"language": language,
"total_duration": total_duration,
"refined_segments": [[s, e] for s, e in refined_segments],
"segment_texts": [{
"text": st["text"],
"language": st["language"],
"lang_prob": st["lang_prob"],
} for st in segment_texts],
"file_uuid": file_uuid,
"max_speakers": max_speakers,
"quality_threshold": quality_threshold,
})
print(f"[Checkpoint] Step 3: {i+1}/{len(refined_segments)}")
print(f" Step 3 time: {time.time() - t3:.2f}s")
# ── Save final checkpoint after Step 3 ──
if checkpoint_path:
_save_checkpoint(checkpoint_path, {
"checkpoint_version": 2,
"step3_completed": True,
"language": language,
"total_duration": total_duration,
"refined_segments": [[s, e] for s, e in refined_segments],
"segment_texts": [{
"text": st["text"],
"language": st["language"],
"lang_prob": st["lang_prob"],
} for st in segment_texts],
"file_uuid": file_uuid,
"max_speakers": max_speakers,
"quality_threshold": quality_threshold,
})
print(f"[Checkpoint] Step 3 complete, saved to {checkpoint_path}")
# ── Step 4: ECAPA-TDNN per refined segment ──
print("\n[Step 4] Speaker embedding extraction...")
t4 = time.time()
audio_segments = []
for start_sec, end_sec in refined_segments:
s = int(start_sec * sample_rate)
e = int(end_sec * sample_rate)
audio_segments.append(wav[s:min(e, len(wav))])
from speaker_encoder import extract_speaker_embeddings_batch, normalize_embeddings
embeddings = extract_speaker_embeddings_batch(
self.speaker_encoder, audio_segments, sample_rate
)
embeddings = normalize_embeddings(embeddings)
print(f" Embeddings: {embeddings.shape}")
print(f" Step 4 time: {time.time() - t4:.2f}s")
# ── Step 5: AgglomerativeClustering ──
print("\n[Step 5] Speaker clustering...")
t5 = time.time()
from speaker_cluster_fixed import robust_speaker_clustering
speaker_labels, estimated_n_speakers = robust_speaker_clustering(
embeddings, n_speakers=None, max_speakers=max_speakers
)
print(f" Speakers: {estimated_n_speakers}")
print(f" Step 5 time: {time.time() - t5:.2f}s")
# 品質計算
qualities = compute_embedding_quality(embeddings, speaker_labels)
# 建立輸出 segments
segments = []
for i, ((start_sec, end_sec), label) in enumerate(
zip(refined_segments, speaker_labels)):
seg = {
"start": round(start_sec, 3),
"end": round(end_sec, 3),
"start_frame": int(start_sec * 30),
"end_frame": int(end_sec * 30),
"text": segment_texts[i]["text"],
"language": segment_texts[i]["language"],
"lang_prob": segment_texts[i]["lang_prob"],
"speaker": f"SPEAKER_{int(label)}",
"speaker_id": f"SPEAKER_{int(label)}",
"quality": float(qualities[i]),
}
segments.append(seg)
# 統計
speaker_stats = {}
for seg in segments:
spk = seg["speaker_id"]
dur = seg["end"] - seg["start"]
if spk not in speaker_stats:
speaker_stats[spk] = {"count": 0, "duration": 0}
speaker_stats[spk]["count"] += 1
speaker_stats[spk]["duration"] += dur
result = {
"language": language or "",
"segments": segments,
"n_speakers": int(estimated_n_speakers),
"speaker_stats": speaker_stats,
"total_duration": total_duration,
"n_segments": len(segments),
}
# ── Step 6: Store embeddings in Qdrant ──
if file_uuid:
print("\n[Step 6] Storing embeddings in Qdrant...")
t6 = time.time()
self._store_speaker_embeddings(segments, embeddings, speaker_labels,
file_uuid)
print(f" Step 6 time: {time.time() - t6:.2f}s")
# ── Step 7: High-quality classification ──
if file_uuid:
print("\n[Step 7] Classifying high-quality embeddings...")
t7 = time.time()
references = self._classify_high_quality_speakers(
segments, embeddings, speaker_labels, file_uuid,
wav, sample_rate, quality_threshold
)
if references:
result["references"] = references
print(f" Step 7 time: {time.time() - t7:.2f}s")
total_time = time.time() - start_time
result["processing_time"] = round(total_time, 2)
if total_duration > 0:
result["realtime_factor"] = round(total_duration / total_time, 2)
# 保存輸出
if output_path:
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "w", encoding="utf-8") as f:
json.dump(result, f, indent=2, ensure_ascii=False)
print(f"\n[SelfASRX] Saved to: {output_path}")
print(f"\n[SelfASRX] Done! {len(segments)} segments, "
f"{estimated_n_speakers} speakers, "
f"{total_time:.2f}s")
return result
def resume_from_checkpoint(self, checkpoint_path, audio_path,
output_path=None):
"""從 checkpoint 載入 Steps 1-3 結果,執行 Steps 4-7"""
print(f"\n[SelfASRX] Resuming from checkpoint: {checkpoint_path}")
print("=" * 60)
with open(checkpoint_path, "r", encoding="utf-8") as f:
cp = json.load(f)
if not cp.get("step3_completed"):
error_msg = f"Checkpoint step3 not completed (progress: {cp.get('step3_progress', '?')})"
print(f"[SelfASRX] {error_msg}")
return {"error": error_msg, "segments": []}
wav, sample_rate = _load_audio(audio_path)
refined_segments = [tuple(s) for s in cp["refined_segments"]]
segment_texts = cp["segment_texts"]
language = cp.get("language", "")
total_duration = cp.get("total_duration", 0)
file_uuid = cp.get("file_uuid")
max_speakers = cp.get("max_speakers", 10)
quality_threshold = cp.get("quality_threshold", 0.85)
print(f" Loaded checkpoint: {len(refined_segments)} segments, "
f"language={language}, duration={total_duration:.2f}s")
start_time = time.time()
# ── Step 4: ECAPA-TDNN per refined segment ──
print("\n[Step 4] Speaker embedding extraction...")
t4 = time.time()
audio_segments = []
for start_sec, end_sec in refined_segments:
s = int(start_sec * sample_rate)
e = int(end_sec * sample_rate)
audio_segments.append(wav[s:min(e, len(wav))])
from speaker_encoder import extract_speaker_embeddings_batch, normalize_embeddings
embeddings = extract_speaker_embeddings_batch(
self.speaker_encoder, audio_segments, sample_rate
)
embeddings = normalize_embeddings(embeddings)
print(f" Embeddings: {embeddings.shape}")
print(f" Step 4 time: {time.time() - t4:.2f}s")
# ── Step 5: AgglomerativeClustering ──
print("\n[Step 5] Speaker clustering...")
t5 = time.time()
from speaker_cluster_fixed import robust_speaker_clustering
speaker_labels, estimated_n_speakers = robust_speaker_clustering(
embeddings, n_speakers=None, max_speakers=max_speakers
)
print(f" Speakers: {estimated_n_speakers}")
print(f" Step 5 time: {time.time() - t5:.2f}s")
# 品質計算
qualities = compute_embedding_quality(embeddings, speaker_labels)
# 建立輸出 segments
segments = []
for i, ((start_sec, end_sec), label) in enumerate(
zip(refined_segments, speaker_labels)):
seg = {
"start": round(start_sec, 3),
"end": round(end_sec, 3),
"start_frame": int(start_sec * 30),
"end_frame": int(end_sec * 30),
"text": segment_texts[i]["text"],
"language": segment_texts[i]["language"],
"lang_prob": segment_texts[i]["lang_prob"],
"speaker": f"SPEAKER_{int(label)}",
"speaker_id": f"SPEAKER_{int(label)}",
"quality": float(qualities[i]),
}
segments.append(seg)
# 統計
speaker_stats = {}
for seg in segments:
spk = seg["speaker_id"]
dur = seg["end"] - seg["start"]
if spk not in speaker_stats:
speaker_stats[spk] = {"count": 0, "duration": 0}
speaker_stats[spk]["count"] += 1
speaker_stats[spk]["duration"] += dur
result = {
"language": language or "",
"segments": segments,
"n_speakers": int(estimated_n_speakers),
"speaker_stats": speaker_stats,
"total_duration": total_duration,
"n_segments": len(segments),
}
# ── Step 6: Store embeddings in Qdrant ──
if file_uuid:
print("\n[Step 6] Storing embeddings in Qdrant...")
t6 = time.time()
self._store_speaker_embeddings(segments, embeddings, speaker_labels,
file_uuid)
print(f" Step 6 time: {time.time() - t6:.2f}s")
# ── Step 7: High-quality classification ──
if file_uuid:
print("\n[Step 7] Classifying high-quality embeddings...")
t7 = time.time()
references = self._classify_high_quality_speakers(
segments, embeddings, speaker_labels, file_uuid,
wav, sample_rate, quality_threshold
)
if references:
result["references"] = references
print(f" Step 7 time: {time.time() - t7:.2f}s")
total_time = time.time() - start_time
result["processing_time"] = round(total_time, 2)
if total_duration > 0:
result["realtime_factor"] = round(total_duration / total_time, 2)
# 保存輸出
if output_path:
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "w", encoding="utf-8") as f:
json.dump(result, f, indent=2, ensure_ascii=False)
print(f"\n[SelfASRX] Saved to: {output_path}")
print(f"\n[SelfASRX] Done! {len(segments)} segments, "
f"{estimated_n_speakers} speakers, "
f"{total_time:.2f}s")
return result
# ── Internal helpers ──
def _vad_scan_segment(self, wav, sample_rate, start_sec, end_sec):
"""VAD 細切單一段落"""
from vad import scan_within_segment
return scan_within_segment(
wav, sample_rate, start_sec, end_sec,
self.vad_model, self.vad_utils
)
def _transcribe_segment(self, wav, sample_rate, start_sec, end_sec):
"""轉錄單一段落"""
from whisper_local import transcribe_segment
return transcribe_segment(wav, sample_rate, start_sec, end_sec, self.whisper)
def _store_speaker_embeddings(self, segments, embeddings, labels, file_uuid):
"""Step 6: 所有 embedding 存入 Qdrant"""
if not self._ensure_qdrant():
return
points = []
for i, (seg, emb, label) in enumerate(
zip(segments, embeddings, labels)):
point_id = _hash_point_id(file_uuid, f"{i}")
points.append({
"id": point_id,
"vector": emb.tolist(),
"payload": {
"type": "speaker_embedding",
"file_uuid": file_uuid,
"speaker_id": seg["speaker_id"],
"text": seg["text"],
"language": seg["language"],
"start_time": seg["start"],
"end_time": seg["end"],
}
})
ok = _qdrant_upsert(self.qdrant_url, self.qdrant_api_key,
self.qdrant_collection, points)
if ok:
print(f" Stored {len(points)} speaker embeddings to Qdrant")
return ok
def _classify_high_quality_speakers(self, segments, embeddings, labels,
file_uuid, wav, sample_rate,
threshold=0.85):
"""Step 7: 高品質聲紋分級 + 性別分類 → Qdrant reference"""
qualities = compute_embedding_quality(embeddings, labels)
high_mask = qualities >= threshold
if not np.any(high_mask):
print(" No high-quality embeddings found")
return []
unique_labels = set(labels)
references = []
for label in unique_labels:
mask = (labels == label) & high_mask
if not np.any(mask):
continue
high_indices = [i for i in range(len(segments)) if mask[i]]
high_segs = [segments[i] for i in high_indices]
# 取品質最高的 segment index
best_idx = high_indices[int(np.argmax(qualities[mask]))]
best_seg = segments[best_idx]
centroid = np.mean(embeddings[mask], axis=0)
norm = np.linalg.norm(centroid)
if norm > 0:
centroid = centroid / norm
avg_quality = float(np.mean(qualities[mask]))
speaker_id = f"SPEAKER_{int(label)}"
text_samples = [s["text"] for s in high_segs[:5] if s["text"]]
total_dur = sum(s["end"] - s["start"] for s in high_segs)
ref_id = _hash_point_id(file_uuid, f"ref_{label}")
ref_payload = {
"type": "speaker_reference",
"file_uuid": file_uuid,
"speaker_id": speaker_id,
"n_segments": int(np.sum(mask)),
"avg_quality": avg_quality,
"total_duration": round(total_dur, 2),
"language": best_seg.get("language", ""),
"text_samples": text_samples,
}
# 性別分類:用最佳 segment 的音頻
if self.gender_classifier is not None:
try:
import torch
s = int(best_seg["start"] * sample_rate)
e = int(best_seg["end"] * sample_rate)
seg_wav = wav[s:min(e, len(wav))]
seg_tensor = torch.from_numpy(seg_wav).float().unsqueeze(0)
# SpeechBrain gender classifier 接受音頻
out = self.gender_classifier.classify_batch(seg_tensor)
probs = torch.softmax(out[0], dim=-1).squeeze().cpu().detach().numpy()
if len(probs) >= 2:
idx = int(np.argmax(probs))
ref_payload["gender"] = "male" if idx == 0 else "female"
ref_payload["gender_conf"] = float(probs[idx])
else:
ref_payload["gender"] = "unknown"
ref_payload["gender_conf"] = 0.0
except Exception as e:
print(f"[Gender] Classify error: {e}")
ref_payload["gender"] = "unknown"
ref_payload["gender_conf"] = 0.0
else:
ref_payload["gender"] = "unknown"
ref_payload["gender_conf"] = 0.0
_qdrant_upsert(self.qdrant_url, self.qdrant_api_key,
self.qdrant_collection, [{
"id": ref_id,
"vector": centroid.tolist(),
"payload": ref_payload,
}])
references.append({
"speaker_id": speaker_id,
"n_segments": int(np.sum(mask)),
"avg_quality": avg_quality,
"gender": ref_payload["gender"],
})
print(f" Ref: {speaker_id}, gender={ref_payload['gender']}"
f" ({ref_payload['gender_conf']:.2f}), q={avg_quality:.3f}")
return references
def _ensure_qdrant(self):
"""確保 Qdrant collection 可用"""
if not self._qdrant_ok:
ok = _ensure_speaker_collection(
self.qdrant_url, self.qdrant_api_key, self.qdrant_collection
)
self._qdrant_ok = ok
return self._qdrant_ok
def main():
import argparse
parser = argparse.ArgumentParser(description="SelfASRX - Hybrid Speaker Diarization")
parser.add_argument("audio_path", help="Path to audio file (WAV)")
parser.add_argument("-o", "--output", help="Output JSON path")
parser.add_argument("--file-uuid", help="File UUID for Qdrant storage")
parser.add_argument("--max-speakers", type=int, default=10)
parser.add_argument("--quality-threshold", type=float, default=0.85)
parser.add_argument("--resume", help="Checkpoint path to resume from")
parser.add_argument("--checkpoint", help="Save checkpoint path after Step 3")
args = parser.parse_args()
asrx = SelfASRXFixed()
if args.resume:
if not Path(args.resume).exists():
print(f"Error: Checkpoint not found: {args.resume}")
sys.exit(1)
result = asrx.resume_from_checkpoint(
args.resume, args.audio_path,
output_path=args.output,
)
else:
if not Path(args.audio_path).exists():
print(f"Error: Audio file not found: {args.audio_path}")
sys.exit(1)
result = asrx.process(
args.audio_path,
output_path=args.output,
file_uuid=args.file_uuid,
max_speakers=args.max_speakers,
quality_threshold=args.quality_threshold,
checkpoint_path=args.checkpoint,
)
if "error" not in result:
print("\n[Summary]")
print(f" Duration: {result['total_duration']:.2f}s")
print(f" Segments: {result['n_segments']}")
print(f" Speakers: {result['n_speakers']}")
if "references" in result:
for ref in result["references"]:
print(f" {ref['speaker_id']}: gender={ref['gender']}, "
f"quality={ref['avg_quality']:.3f}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,65 @@
"""
Speaker Classifier - 聲紋品質評估與性別分類
提供品質計算與性別分類功能,作為 main_fixed.py 的輔助模組。
"""
import numpy as np
def compute_embedding_quality(embeddings, labels):
"""每個 embedding 到所屬 cluster centroid 的餘弦相似度
Args:
embeddings: [n_segments, 192] 聲紋向量矩陣
labels: [n_segments] 聚類標籤
Returns:
qualities: [n_segments] 品質分數 (0-1)
"""
from sklearn.metrics.pairwise import cosine_similarity
unique_labels = set(labels)
centroids = {}
for label in unique_labels:
mask = labels == label
centroid = np.mean(embeddings[mask], axis=0)
norm = np.linalg.norm(centroid)
if norm > 0:
centroid = centroid / norm
centroids[label] = centroid
qualities = []
for emb, label in zip(embeddings, labels):
sim = cosine_similarity([emb], [centroids[label]])[0][0]
qualities.append(sim)
return np.array(qualities)
def classify_gender(audio_wav, sample_rate, classifier):
"""從音頻段分類性別
Args:
audio_wav: 音頻波形 (numpy array)
sample_rate: 採樣率
classifier: SpeechBrain EncoderClassifier (gender-recognition-ecapa)
Returns:
dict: {"gender": "male"|"female"|"unknown", "confidence": float}
"""
default = {"gender": "unknown", "confidence": 0.0}
if classifier is None or len(audio_wav) == 0:
return default
try:
import torch
seg_tensor = torch.from_numpy(audio_wav).float().unsqueeze(0)
out = classifier.classify_batch(seg_tensor)
probs = torch.softmax(out[0], dim=-1).squeeze().cpu().detach().numpy()
if len(probs) >= 2:
idx = int(np.argmax(probs))
label = "male" if idx == 0 else "female"
return {"gender": label, "confidence": float(probs[idx])}
except Exception as e:
pass
return default

View File

@@ -0,0 +1,152 @@
#!/opt/homebrew/bin/python3.11
"""
Speaker Clustering - Fixed Version
使用更穩定的聚類算法
"""
import numpy as np
from sklearn.cluster import AgglomerativeClustering
def robust_speaker_clustering(embeddings, n_speakers=None, max_speakers=10):
"""
魯棒的說話人聚類
使用層次聚類代替譜聚類,避免 NaN 問題
Args:
embeddings: 聲紋嵌入矩陣 [n_segments, 192]
n_speakers: 說話人數量None=自動估計)
max_speakers: 最大說話人數
Returns:
speaker_labels: 說話人標籤
n_speakers: 使用的說話人數量
"""
n_segments = len(embeddings)
# 清洗數據
embeddings = np.nan_to_num(embeddings, nan=0.0, posinf=0.0, neginf=0.0)
# 正規化
from sklearn.preprocessing import normalize
embeddings = normalize(embeddings, norm='l2')
# 再次清洗
embeddings = np.nan_to_num(embeddings, nan=0.0, posinf=0.0, neginf=0.0)
# 自動估計說話人數量
if n_speakers is None:
n_speakers = estimate_n_speakers_from_embeddings(embeddings, max_speakers)
print(f"[Clustering] Estimated n_speakers: {n_speakers}")
n_speakers = min(int(n_speakers), n_segments, max_speakers)
n_speakers = max(2, n_speakers) # 至少 2 人
print(f"[Clustering] Using Agglomerative Clustering with {n_speakers} clusters")
# 使用層次聚類(更穩定)
clustering = AgglomerativeClustering(
n_clusters=n_speakers,
metric='cosine',
linkage='average'
)
speaker_labels = clustering.fit_predict(embeddings)
# 統計每個聚類的大小
unique, counts = np.unique(speaker_labels, return_counts=True)
print("[Clustering] Cluster sizes:")
for label, count in zip(unique, counts):
print(f" SPEAKER_{label}: {count} segments ({count/n_segments*100:.1f}%)")
return speaker_labels, n_speakers
def estimate_n_speakers_from_embeddings(embeddings, max_speakers=10):
"""
從嵌入向量估計說話人數量
使用距離閾值方法
Args:
embeddings: 聲紋嵌入矩陣
max_speakers: 最大說話人數
Returns:
n_speakers: 估計的說話人數量
"""
from sklearn.metrics.pairwise import cosine_distances
# 計算距離矩陣
distances = cosine_distances(embeddings)
# 計算每個樣本到最近鄰的距離(排除自己)
n_samples = len(embeddings)
min_distances = []
for i in range(min(200, n_samples)): # 取樣計算
dists = distances[i]
# 排除自己(距離為 0
sorted_dists = np.sort(dists)
if len(sorted_dists) > 1:
min_distances.append(sorted_dists[1]) # 最近鄰
if not min_distances:
return 2
# 使用距離分佈估計聚類數
avg_min_dist = np.mean(min_distances)
std_min_dist = np.std(min_distances)
# 經驗法則:距離閾值約為平均值的 1.5 倍
threshold = avg_min_dist * 1.5
# 簡單聚類:距離小於閾值的視為同一人
n_speakers = 1
assigned = [False] * len(min_distances)
for i in range(len(min_distances)):
if not assigned[i]:
n_speakers += 1
# 標記所有距離近的為同一聚類
for j in range(i+1, len(min_distances)):
if not assigned[j]:
# 檢查距離
idx_i = i * (n_samples // 200) if n_samples > 200 else i
idx_j = j * (n_samples // 200) if n_samples > 200 else j
if idx_i < n_samples and idx_j < n_samples:
if distances[idx_i, idx_j] < threshold:
assigned[j] = True
# 限制範圍
n_speakers = max(2, min(n_speakers, max_speakers))
return n_speakers
if __name__ == "__main__":
# 測試
print("[Test] Testing robust speaker clustering")
# 生成模擬數據3 個說話人
np.random.seed(42)
n_speakers = 3
n_per_speaker = 100
embeddings = []
for i in range(n_speakers):
center = np.random.randn(192) * 2 + i * 3
for _ in range(n_per_speaker):
emb = center + np.random.randn(192) * 0.5
embeddings.append(emb)
embeddings = np.array(embeddings)
print(f"Generated {len(embeddings)} embeddings for {n_speakers} speakers")
# 測試聚類
labels, n_clusters = robust_speaker_clustering(embeddings)
print("\nResult:")
print(f" True n_speakers: {n_speakers}")
print(f" Estimated n_speakers: {n_clusters}")

View File

@@ -0,0 +1,191 @@
#!/opt/homebrew/bin/python3.11
"""
Speaker Encoder - 聲紋特徵提取
使用 ECAPA-TDNN 模型提取聲紋嵌入向量
技術來源:
- ECAPA-TDNN: Desplanques et al. (2020), Interspeech
- 論文https://arxiv.org/abs/2005.07143
- 模型SpeechBrain spkrec-ecapa-voxceleb
- 準確度EER 0.80% (VoxCeleb1)
"""
import torch
import numpy as np
from speechbrain.inference.speaker import EncoderClassifier
def load_speaker_encoder(model_name="speechbrain/spkrec-ecapa-voxceleb"):
"""
載入聲紋編碼器模型
Args:
model_name: 模型名稱HuggingFace
Returns:
classifier: 聲紋編碼器
"""
print(f"[SpeakerEncoder] Loading model: {model_name}")
classifier = EncoderClassifier.from_hparams(
source=model_name,
run_opts={"device": "cpu"}, # 使用 CPU
)
# 獲取模型資訊
print("[SpeakerEncoder] Model loaded successfully")
print("[SpeakerEncoder] Embedding dimension: 192")
return classifier
def extract_speaker_embedding(classifier, audio_waveform, sample_rate=16000):
"""
從音頻波形提取聲紋嵌入
Args:
classifier: 聲紋編碼器
audio_waveform: 音頻波形 (numpy array)
sample_rate: 採樣率
Returns:
embedding: 聲紋嵌入向量 (192 維)
"""
# 轉換為 torch tensor
if isinstance(audio_waveform, np.ndarray):
audio_tensor = torch.from_numpy(audio_waveform).float()
else:
audio_tensor = audio_waveform
# 確保是 2D [batch, time]
if audio_tensor.dim() == 1:
audio_tensor = audio_tensor.unsqueeze(0)
# 提取嵌入
with torch.no_grad():
embedding = classifier.encode_batch(audio_tensor)
# 轉換為 numpy
embedding = embedding.squeeze().cpu().numpy()
return embedding
def extract_speaker_embeddings_batch(classifier, audio_segments, sample_rate=16000):
"""
批量提取多個語音片段的聲紋嵌入
Args:
classifier: 聲紋編碼器
audio_segments: 音頻片段列表 [numpy array, ...]
sample_rate: 採樣率
Returns:
embeddings: 嵌入矩陣 [n_segments, 192]
"""
embeddings = []
for i, audio in enumerate(audio_segments):
emb = extract_speaker_embedding(classifier, audio, sample_rate)
embeddings.append(emb)
if (i + 1) % 50 == 0:
print(f"[SpeakerEncoder] Processed {i + 1} segments")
embeddings = np.vstack(embeddings)
print(f"[SpeakerEncoder] Extracted {embeddings.shape[0]} embeddings")
return embeddings
def compute_similarity_matrix(embeddings, method="cosine"):
"""
計算聲紋相似度矩陣
Args:
embeddings: 嵌入矩陣 [n_segments, 192]
method: 相似度計算方法 ('cosine', 'euclidean')
Returns:
similarity_matrix: 相似度矩陣 [n_segments, n_segments]
"""
from sklearn.metrics.pairwise import cosine_similarity
# 清洗數據:移除 NaN 和 Inf
embeddings = np.nan_to_num(embeddings, nan=0.0, posinf=0.0, neginf=0.0)
# 正規化
embeddings = normalize_embeddings(embeddings)
# 再次清洗
embeddings = np.nan_to_num(embeddings, nan=0.0, posinf=0.0, neginf=0.0)
if method == "cosine":
similarity = cosine_similarity(embeddings)
elif method == "euclidean":
from sklearn.metrics.pairwise import euclidean_distances
# 將距離轉換為相似度
distances = euclidean_distances(embeddings)
similarity = 1 / (1 + distances)
else:
raise ValueError(f"Unknown method: {method}")
# 確保沒有 NaN
similarity = np.nan_to_num(similarity, nan=0.5)
return similarity
def normalize_embeddings(embeddings):
"""
正規化嵌入向量(單位長度)
Args:
embeddings: 嵌入矩陣 [n_segments, 192]
Returns:
normalized: 正規化後的嵌入矩陣
"""
from sklearn.preprocessing import normalize
return normalize(embeddings, norm="l2")
if __name__ == "__main__":
# 測試聲紋編碼器
import sys
import torchaudio
if len(sys.argv) < 2:
print("Usage: python3 speaker_encoder.py <audio_path>")
sys.exit(1)
audio_path = sys.argv[1]
print("[Test] Loading speaker encoder...")
classifier = load_speaker_encoder()
print(f"\n[Test] Loading audio: {audio_path}")
wav, sr = torchaudio.load(audio_path)
# 重採樣到 16kHz
if sr != 16000:
transform = torchaudio.transforms.Resample(sr, 16000)
wav = transform(wav)
print(f"[Test] Audio shape: {wav.shape}")
print(f"[Test] Duration: {wav.shape[1] / 16000:.2f}s")
# 提取嵌入
print("\n[Test] Extracting speaker embedding...")
embedding = extract_speaker_embedding(classifier, wav.numpy())
print(f"[Test] Embedding shape: {embedding.shape}")
print(f"[Test] Embedding norm: {np.linalg.norm(embedding):.4f}")
print(f"[Test] Embedding mean: {embedding.mean():.4f}")
print(f"[Test] Embedding std: {embedding.std():.4f}")
# 顯示部分嵌入值
print("\n[Test] First 10 embedding values:")
print(f" {embedding[:10]}")

View File

@@ -0,0 +1,206 @@
#!/opt/homebrew/bin/python3.11
"""
VAD (Voice Activity Detection) - 語音活動檢測
使用 Silero VAD 模型提取語音片段
技術來源:
- Silero VAD: https://github.com/snakers4/silero-vad
- 模型基於深度學習,準確度 95%+
"""
import torch
def load_vad_model():
"""
載入 Silero VAD 模型
Returns:
model: VAD 模型
utils: 工具函數
"""
model, utils = torch.hub.load(
repo_or_dir="snakers4/silero-vad",
model="silero_vad",
force_reload=False,
trust_repo=True,
)
return model, utils
def extract_speech_segments(
audio_path, model, utils, min_speech_duration_ms=500, min_silence_duration_ms=300
):
"""
使用 VAD 提取語音片段
Args:
audio_path: 音頻文件路徑
model: VAD 模型
utils: 工具函數
min_speech_duration_ms: 最小語音持續時間(毫秒)
min_silence_duration_ms: 最小靜音持續時間(毫秒)
Returns:
speech_segments: 語音片段列表 [(start_sec, end_sec), ...]
audio_waveform: 音頻波形 (numpy array)
sample_rate: 採樣率
"""
get_speech_timestamps, save_audio, read_audio, _, _ = utils
# 讀取音頻
wav = read_audio(audio_path, sampling_rate=16000)
sample_rate = 16000
# 獲取語音時間戳
speech_timestamps = get_speech_timestamps(
wav,
model,
sampling_rate=sample_rate,
min_speech_duration_ms=min_speech_duration_ms,
min_silence_duration_ms=min_silence_duration_ms,
return_seconds=True,
)
# 轉換為片段列表
speech_segments = [(ts["start"], ts["end"]) for ts in speech_timestamps]
return speech_segments, wav.numpy(), sample_rate
def extract_speech_audio(audio_path, model, utils, output_dir=None):
"""
提取語音片段並保存為單獨音頻文件
Args:
audio_path: 原始音頻路徑
model: VAD 模型
utils: 工具函數
output_dir: 輸出目錄(可選)
Returns:
speech_audios: 語音音頻列表 [numpy array, ...]
speech_segments: 語音片段列表
"""
get_speech_timestamps, save_audio, read_audio, _, _ = utils
# 讀取音頻
wav = read_audio(audio_path, sampling_rate=16000)
sample_rate = 16000
# 獲取語音時間戳
speech_timestamps = get_speech_timestamps(
wav,
model,
sampling_rate=sample_rate,
min_speech_duration_ms=500,
min_silence_duration_ms=300,
return_seconds=False, # 使用樣本索引
)
# 提取語音片段
speech_audios = []
speech_segments = []
for i, ts in enumerate(speech_timestamps):
start_sample = ts["start"]
end_sample = ts["end"]
# 提取音頻片段
speech_audio = wav[start_sample:end_sample]
speech_audios.append(speech_audio.numpy())
speech_segments.append(
(
start_sample / sample_rate, # 轉換為秒
end_sample / sample_rate,
)
)
# 保存為文件(可選)
if output_dir:
import os
output_path = os.path.join(output_dir, f"speech_{i:03d}.wav")
save_audio(output_path, speech_audio, sample_rate)
return speech_audios, speech_segments
def scan_within_segment(wav, sample_rate, start_sec, end_sec, model, utils,
min_speech_duration_ms=500, min_silence_duration_ms=300):
"""
在一個時間範圍內執行 VAD 掃描,切出子片段。
用途: whisper 給出的粗略時間段內,利用句間停頓細切。
Args:
wav: 完整音頻波形 (numpy array)
sample_rate: 採樣率
start_sec: 掃描起始時間 (秒)
end_sec: 掃描結束時間 (秒)
model: VAD 模型
utils: VAD 工具函數
min_speech_duration_ms: 最小語音持續時間
min_silence_duration_ms: 最小靜音持續時間
Returns:
sub_segments: [(start_sec, end_sec), ...] 子片段列表 (原始時間軸)
"""
get_speech_timestamps, _, _, _, _ = utils
# 提取該時間範圍內的音頻
start_sample = int(start_sec * sample_rate)
end_sample = int(end_sec * sample_rate)
segment_wav = wav[start_sample:end_sample]
# 在子音頻上執行 VAD
speech_ts = get_speech_timestamps(
segment_wav,
model,
sampling_rate=sample_rate,
min_speech_duration_ms=min_speech_duration_ms,
min_silence_duration_ms=min_silence_duration_ms,
return_seconds=True,
)
# 轉換回原始時間軸
sub_segments = [
(ts["start"] + start_sec, ts["end"] + start_sec)
for ts in speech_ts
]
return sub_segments
if __name__ == "__main__":
# 測試 VAD
import sys
if len(sys.argv) < 2:
print("Usage: python3 vad.py <audio_path>")
sys.exit(1)
audio_path = sys.argv[1]
print("[VAD] Loading model...")
model, utils = load_vad_model()
print(f"[VAD] Processing: {audio_path}")
segments, wav, sr = extract_speech_segments(audio_path, model, utils)
print("\n[VAD] Results:")
print(f" Sample rate: {sr} Hz")
print(f" Speech segments: {len(segments)}")
print(f" Total duration: {len(wav) / sr:.2f}s")
total_speech = sum(end - start for start, end in segments)
print(
f" Total speech: {total_speech:.2f}s ({total_speech / (len(wav) / sr) * 100:.1f}%)"
)
print("\n[VAD] Segments:")
for i, (start, end) in enumerate(segments[:10]):
print(f" {i + 1:3d}. {start:6.2f}s - {end:6.2f}s ({end - start:5.2f}s)")
if len(segments) > 10:
print(f" ... and {len(segments) - 10} more segments")

View File

@@ -0,0 +1,35 @@
"""
Whisper Local - uses faster-whisper for per-segment transcription
"""
import numpy as np
def load_model(size="small"):
from faster_whisper import WhisperModel
return WhisperModel(size, device="cpu", compute_type="int8")
def transcribe_segment(wav, sample_rate, start_sec, end_sec, model):
start_sample = int(start_sec * sample_rate)
end_sample = int(end_sec * sample_rate)
if start_sample >= len(wav):
return {"text": "", "language": "", "lang_prob": 0.0, "segments": []}
segment_wav = wav[start_sample:min(end_sample, len(wav))]
segments_generator, info = model.transcribe(segment_wav, language=None)
text = ""
lang_prob = info.language_probability if info else 0.0
language = info.language if info else ""
segs = list(segments_generator)
for seg in segs:
text += seg.text + " "
return {
"text": text.strip(),
"language": language,
"lang_prob": lang_prob,
"segments": segs,
}