fix: TKG stats API returning 0 - count_by_type used wrong column

- tkg_nodes has no edge_type column, query was failing silently
- Split into count_nodes(node_type) and count_edges(edge_type)
- Fixed text_region → text_trace node type name
- Also: OCR frame fix in rule1 (end_frame computed from end_time+FPS)
This commit is contained in:
Accusys
2026-07-02 14:53:47 +08:00
parent 6507766ea2
commit 619b056ada
3 changed files with 210 additions and 23 deletions

View File

@@ -0,0 +1,162 @@
---
title: Audio Scene & Instrument Detection POC Plan
version: 0.1
date: 2026-07-02
author: OpenCode
status: planned
---
| scope | status | applicable to |
|-------|--------|---------------|
| Audio processing pipeline | planned | Video files with non-speech audio |
## Goal
Detect non-speech audio events (instruments, music, environmental sounds) in video files alongside existing ASRX speech recognition.
## Why
Current pipeline only detects speech (ASRX → 64 segments + 1554 speaker embeddings). Instrument sounds, background music, and environmental audio are completely ignored.
## Technical Options
### Option A: PANNs (Pre-trained Audio Neural Networks)
- **Model**: Cnn14 (313M params, 700MB weights)
- **Classes**: 527 AudioSet classes (piano, guitar, drums, speech, etc.)
- **Pros**: Production-ready, accurate, PyTorch-based
- **Cons**: Large download, ~200MB RAM per inference
- **Install**: `pip install panns-inference`
### Option B: YAMNet (Google)
- **Model**: MobileNet-based, 4MB weights
- **Classes**: 521 AudioSet classes
- **Pros**: Lightweight, fast
- **Cons**: Requires TensorFlow (not currently installed)
- **Install**: `pip install yamnet` + TensorFlow
### Option C: torchaudio + heuristics (lightweight fallback)
- Use existing PyTorch + torchaudio
- Extract spectral features (MFCC, centroid, energy)
- Simple classification: speech vs music vs silence
- **Pros**: No extra dependencies
- **Cons**: Less accurate, limited classes
## Recommended: Option A (PANNs)
## Pipeline Integration
```
Video → Audio Extract → ASRX (speech) → Speaker Embeddings (3.4/s)
→ Audio Scene (new) → Scene Labels (1/s)
```
### New Processor: `audio_scene`
| Field | Value |
|-------|-------|
| Processor type | `audio_scene` |
| Input | Video file (audio track) |
| Output | `file_uuid.audio_scene.json` |
| Sampling | 1-second segments |
| Qdrant collection | `momentry_{schema}_audio_scene` |
### Output Format
```json
{
"file_uuid": "...",
"segments": [
{
"start_time": 0.0,
"end_time": 1.0,
"primary_class": "speech",
"confidence": 0.95,
"top_classes": [
{"class": "speech", "score": 0.95},
{"class": "music", "score": 0.03},
{"class": "piano", "score": 0.01}
]
}
],
"summary": {
"speech_ratio": 0.72,
"music_ratio": 0.15,
"silence_ratio": 0.08,
"instrument_ratio": 0.05,
"instruments_detected": ["piano", "guitar"]
}
}
```
### Qdrant Storage
| Field | Type | Purpose |
|-------|------|---------|
| `file_uuid` | string | Filter by file |
| `start_time` | float | Segment start |
| `end_time` | float | Segment end |
| `primary_class` | keyword | Filter by class |
| `confidence` | float | Filter by confidence |
| `instrument_name` | keyword | Search by instrument |
| `vector` | f32[2048] | Audio embedding for similarity search |
### Processor Dependencies
```
audio_scene → (no dependencies, runs parallel with ASRX)
```
## Key AudioSet Instrument Classes
| Category | Classes |
|----------|---------|
| Piano | Piano, Electric piano, Keyboard |
| Guitar | Guitar, Electric guitar, Acoustic guitar |
| Drums | Drum kit, Snare drum, Cymbal, Hi-hat |
| Strings | Violin, Cello, Harp, Double bass |
| Wind | Flute, Saxophone, Trumpet, Clarinet |
| Voice | Speech, Singing, Chant, Choir |
| Other | Music, Percussion, Organ, Synthesizer |
## POC Steps
1. **Install panns-inference**
```bash
pip install panns-inference
```
2. **Create `scripts/audio_scene_processor.py`**
- Load audio via ffmpeg → numpy array
- Process 1-second segments through Cnn14
- Save results to JSON + Qdrant
3. **Add processor type to pipeline**
- Add `AudioScene` to `ProcessorType` enum
- Add to worker's processor dispatch
- Add `AUDIO_SCENE_TIMEOUT` config
4. **Test with existing video**
- Run on KOBA interview video
- Verify instrument detection accuracy
- Check performance (time, memory)
5. **Integrate with search**
- Add audio_scene to universal_search
- Add filter by audio class (speech/music/instrument)
## Estimated Effort
| Step | Time |
|------|------|
| Install + prototype script | 2-3 hours |
| Pipeline integration | 1-2 hours |
| Qdrant + search integration | 1 hour |
| Testing + tuning | 1-2 hours |
| **Total** | **5-8 hours** |
## Future Enhancements
- Real-time audio classification during processing
- Audio event timeline visualization
- Combine with TKG for audio-visual relationships
- Background music detection for copyright checks

View File

@@ -836,24 +836,31 @@ async fn get_file_stats(
let tkg_nodes_table = schema::table_name("tkg_nodes"); let tkg_nodes_table = schema::table_name("tkg_nodes");
let tkg_edges_table = schema::table_name("tkg_edges"); let tkg_edges_table = schema::table_name("tkg_edges");
let tkg_nodes_total: i64 = sqlx::query_scalar::<_, i64>(&format!("SELECT COUNT(*) FROM {} WHERE file_uuid = $1", tkg_nodes_table))
.bind(&file_uuid).fetch_one(pool).await.unwrap_or(0);
let tkg_edges_total: i64 = sqlx::query_scalar::<_, i64>(&format!("SELECT COUNT(*) FROM {} WHERE file_uuid = $1", tkg_edges_table))
.bind(&file_uuid).fetch_one(pool).await.unwrap_or(0);
let tkg = TkgFileStats { let tkg = TkgFileStats {
face_track_nodes: count_by_type(pool, &tkg_nodes_table, &file_uuid, "face_track").await, total_nodes: tkg_nodes_total,
gaze_track_nodes: count_by_type(pool, &tkg_nodes_table, &file_uuid, "gaze_track").await, total_edges: tkg_edges_total,
lip_track_nodes: count_by_type(pool, &tkg_nodes_table, &file_uuid, "lip_track").await, face_track_nodes: count_nodes(pool, &tkg_nodes_table, &file_uuid, "face_track").await,
text_region_nodes: count_by_type(pool, &tkg_nodes_table, &file_uuid, "text_region").await, gaze_track_nodes: count_nodes(pool, &tkg_nodes_table, &file_uuid, "gaze_track").await,
appearance_nodes: count_by_type(pool, &tkg_nodes_table, &file_uuid, "appearance_trace").await, lip_track_nodes: count_nodes(pool, &tkg_nodes_table, &file_uuid, "lip_track").await,
accessory_nodes: count_by_type(pool, &tkg_nodes_table, &file_uuid, "accessory").await, text_region_nodes: count_nodes(pool, &tkg_nodes_table, &file_uuid, "text_trace").await,
object_nodes: count_by_type(pool, &tkg_nodes_table, &file_uuid, "yolo_object").await, appearance_nodes: count_nodes(pool, &tkg_nodes_table, &file_uuid, "appearance_trace").await,
hand_nodes: count_by_type(pool, &tkg_nodes_table, &file_uuid, "hand").await, accessory_nodes: count_nodes(pool, &tkg_nodes_table, &file_uuid, "accessory").await,
speaker_nodes: count_by_type(pool, &tkg_nodes_table, &file_uuid, "speaker").await, object_nodes: count_nodes(pool, &tkg_nodes_table, &file_uuid, "yolo_object").await,
co_occurrence_edges: count_by_type(pool, &tkg_edges_table, &file_uuid, "CO_OCCURS_WITH").await, hand_nodes: count_nodes(pool, &tkg_nodes_table, &file_uuid, "hand").await,
speaker_face_edges: count_by_type(pool, &tkg_edges_table, &file_uuid, "SPEAKS_AS").await, speaker_nodes: count_nodes(pool, &tkg_nodes_table, &file_uuid, "speaker").await,
face_face_edges: count_by_type(pool, &tkg_edges_table, &file_uuid, "FACE_TO_FACE").await, co_occurrence_edges: count_edges(pool, &tkg_edges_table, &file_uuid, "CO_OCCURS_WITH").await,
mutual_gaze_edges: count_by_type(pool, &tkg_edges_table, &file_uuid, "MUTUAL_GAZE").await, speaker_face_edges: count_edges(pool, &tkg_edges_table, &file_uuid, "SPEAKS_AS").await,
lip_sync_edges: count_by_type(pool, &tkg_edges_table, &file_uuid, "LIP_SYNC").await, face_face_edges: count_edges(pool, &tkg_edges_table, &file_uuid, "FACE_TO_FACE").await,
has_appearance_edges: count_by_type(pool, &tkg_edges_table, &file_uuid, "HAS_APPEARANCE").await, mutual_gaze_edges: count_edges(pool, &tkg_edges_table, &file_uuid, "MUTUAL_GAZE").await,
wears_edges: count_by_type(pool, &tkg_edges_table, &file_uuid, "WEARS").await, lip_sync_edges: count_edges(pool, &tkg_edges_table, &file_uuid, "LIP_SYNC").await,
hand_object_edges: count_by_type(pool, &tkg_edges_table, &file_uuid, "HAND_OBJECT").await, has_appearance_edges: count_edges(pool, &tkg_edges_table, &file_uuid, "HAS_APPEARANCE").await,
wears_edges: count_edges(pool, &tkg_edges_table, &file_uuid, "WEARS").await,
hand_object_edges: count_edges(pool, &tkg_edges_table, &file_uuid, "HAND_OBJECT").await,
..Default::default() ..Default::default()
}; };
@@ -890,13 +897,25 @@ async fn get_file_stats(
})) }))
} }
async fn count_by_type(pool: &sqlx::PgPool, table: &str, file_uuid: &str, type_val: &str) -> i64 { async fn count_nodes(pool: &sqlx::PgPool, table: &str, file_uuid: &str, node_type: &str) -> i64 {
sqlx::query_scalar::<_, i64>(&format!( sqlx::query_scalar::<_, i64>(&format!(
"SELECT COUNT(*) FROM {} WHERE file_uuid = $1 AND (node_type = $2 OR edge_type = $2)", "SELECT COUNT(*) FROM {} WHERE file_uuid = $1 AND node_type = $2",
table table
)) ))
.bind(file_uuid) .bind(file_uuid)
.bind(type_val) .bind(node_type)
.fetch_one(pool)
.await
.unwrap_or(0)
}
async fn count_edges(pool: &sqlx::PgPool, table: &str, file_uuid: &str, edge_type: &str) -> i64 {
sqlx::query_scalar::<_, i64>(&format!(
"SELECT COUNT(*) FROM {} WHERE file_uuid = $1 AND edge_type = $2",
table
))
.bind(file_uuid)
.bind(edge_type)
.fetch_one(pool) .fetch_one(pool)
.await .await
.unwrap_or(0) .unwrap_or(0)

View File

@@ -13,7 +13,7 @@ pub async fn execute_rule1(db: &PostgresDb, file_uuid: &str, fps: f64) -> Result
let pool = db.pool(); let pool = db.pool();
let pre_chunks_table = schema::table_name("pre_chunks"); let pre_chunks_table = schema::table_name("pre_chunks");
let asr_segments = fetch_asr_segments(pool, file_uuid, &pre_chunks_table).await?; let asr_segments = fetch_asr_segments(pool, file_uuid, &pre_chunks_table, fps).await?;
let ocr_map = fetch_ocr_texts(pool, file_uuid, &pre_chunks_table).await?; let ocr_map = fetch_ocr_texts(pool, file_uuid, &pre_chunks_table).await?;
let video = db let video = db
@@ -97,6 +97,7 @@ async fn fetch_asr_segments(
pool: &PgPool, pool: &PgPool,
file_uuid: &str, file_uuid: &str,
table: &str, table: &str,
fps: f64,
) -> Result<Vec<AsrSegment>> { ) -> Result<Vec<AsrSegment>> {
let query = format!( let query = format!(
r#" r#"
@@ -114,8 +115,6 @@ async fn fetch_asr_segments(
let segments: Vec<AsrSegment> = rows let segments: Vec<AsrSegment> = rows
.iter() .iter()
.map(|row| { .map(|row| {
let start_frame: i64 = row.try_get("start_frame").unwrap_or(0);
let end_frame: i64 = row.try_get("end_frame").unwrap_or(0);
let start_time: f64 = row.try_get("start_time").unwrap_or(0.0); let start_time: f64 = row.try_get("start_time").unwrap_or(0.0);
let end_time_raw: Option<f64> = row.try_get("end_time").ok(); let end_time_raw: Option<f64> = row.try_get("end_time").ok();
let data: Value = row.try_get("data").unwrap_or(Value::Null); let data: Value = row.try_get("data").unwrap_or(Value::Null);
@@ -124,6 +123,13 @@ async fn fetch_asr_segments(
.or_else(|| data.get("end_time").and_then(|v| v.as_f64())) .or_else(|| data.get("end_time").and_then(|v| v.as_f64()))
.unwrap_or(0.0); .unwrap_or(0.0);
let start_frame = (start_time * fps) as i64;
let end_frame = if end_time > 0.0 {
(end_time * fps) as i64
} else {
start_frame
};
if end_time <= 0.0 { if end_time <= 0.0 {
warn!( warn!(
"ASR segment end_time is 0.0 for file {} (frame {}..{})", "ASR segment end_time is 0.0 for file {} (frame {}..{})",