fix: TKG stats API returning 0 - count_by_type used wrong column

- tkg_nodes has no edge_type column, query was failing silently - Split into count_nodes(node_type) and count_edges(edge_type) - Fixed text_region → text_trace node type name - Also: OCR frame fix in rule1 (end_frame computed from end_time+FPS)
2026-07-02 14:53:47 +08:00
parent 6507766ea2
commit 619b056ada
3 changed files with 210 additions and 23 deletions
--- a/docs_v1.0/DESIGN/Audio_Scene_Detection_POC.md
+++ b/docs_v1.0/DESIGN/Audio_Scene_Detection_POC.md
@@ -0,0 +1,162 @@
+---
+title: Audio Scene & Instrument Detection POC Plan
+version: 0.1
+date: 2026-07-02
+author: OpenCode
+status: planned
+---
+
+| scope | status | applicable to |
+|-------|--------|---------------|
+| Audio processing pipeline | planned | Video files with non-speech audio |
+
+## Goal
+
+Detect non-speech audio events (instruments, music, environmental sounds) in video files alongside existing ASRX speech recognition.
+
+## Why
+
+Current pipeline only detects speech (ASRX → 64 segments + 1554 speaker embeddings). Instrument sounds, background music, and environmental audio are completely ignored.
+
+## Technical Options
+
+### Option A: PANNs (Pre-trained Audio Neural Networks)
+- **Model**: Cnn14 (313M params, 700MB weights)
+- **Classes**: 527 AudioSet classes (piano, guitar, drums, speech, etc.)
+- **Pros**: Production-ready, accurate, PyTorch-based
+- **Cons**: Large download, ~200MB RAM per inference
+- **Install**: `pip install panns-inference`
+
+### Option B: YAMNet (Google)
+- **Model**: MobileNet-based, 4MB weights
+- **Classes**: 521 AudioSet classes
+- **Pros**: Lightweight, fast
+- **Cons**: Requires TensorFlow (not currently installed)
+- **Install**: `pip install yamnet` + TensorFlow
+
+### Option C: torchaudio + heuristics (lightweight fallback)
+- Use existing PyTorch + torchaudio
+- Extract spectral features (MFCC, centroid, energy)
+- Simple classification: speech vs music vs silence
+- **Pros**: No extra dependencies
+- **Cons**: Less accurate, limited classes
+
+## Recommended: Option A (PANNs)
+
+## Pipeline Integration
+
+```
+Video → Audio Extract → ASRX (speech)  → Speaker Embeddings (3.4/s)
+                  → Audio Scene (new) → Scene Labels (1/s)
+```
+
+### New Processor: `audio_scene`
+
+| Field | Value |
+|-------|-------|
+| Processor type | `audio_scene` |
+| Input | Video file (audio track) |
+| Output | `file_uuid.audio_scene.json` |
+| Sampling | 1-second segments |
+| Qdrant collection | `momentry_{schema}_audio_scene` |
+
+### Output Format
+
+```json
+{
+  "file_uuid": "...",
+  "segments": [
+    {
+      "start_time": 0.0,
+      "end_time": 1.0,
+      "primary_class": "speech",
+      "confidence": 0.95,
+      "top_classes": [
+        {"class": "speech", "score": 0.95},
+        {"class": "music", "score": 0.03},
+        {"class": "piano", "score": 0.01}
+      ]
+    }
+  ],
+  "summary": {
+    "speech_ratio": 0.72,
+    "music_ratio": 0.15,
+    "silence_ratio": 0.08,
+    "instrument_ratio": 0.05,
+    "instruments_detected": ["piano", "guitar"]
+  }
+}
+```
+
+### Qdrant Storage
+
+| Field | Type | Purpose |
+|-------|------|---------|
+| `file_uuid` | string | Filter by file |
+| `start_time` | float | Segment start |
+| `end_time` | float | Segment end |
+| `primary_class` | keyword | Filter by class |
+| `confidence` | float | Filter by confidence |
+| `instrument_name` | keyword | Search by instrument |
+| `vector` | f32[2048] | Audio embedding for similarity search |
+
+### Processor Dependencies
+
+```
+audio_scene → (no dependencies, runs parallel with ASRX)
+```
+
+## Key AudioSet Instrument Classes
+
+| Category | Classes |
+|----------|---------|
+| Piano | Piano, Electric piano, Keyboard |
+| Guitar | Guitar, Electric guitar, Acoustic guitar |
+| Drums | Drum kit, Snare drum, Cymbal, Hi-hat |
+| Strings | Violin, Cello, Harp, Double bass |
+| Wind | Flute, Saxophone, Trumpet, Clarinet |
+| Voice | Speech, Singing, Chant, Choir |
+| Other | Music, Percussion, Organ, Synthesizer |
+
+## POC Steps
+
+1. **Install panns-inference**
+   ```bash
+   pip install panns-inference
+   ```
+
+2. **Create `scripts/audio_scene_processor.py`**
+   - Load audio via ffmpeg → numpy array
+   - Process 1-second segments through Cnn14
+   - Save results to JSON + Qdrant
+
+3. **Add processor type to pipeline**
+   - Add `AudioScene` to `ProcessorType` enum
+   - Add to worker's processor dispatch
+   - Add `AUDIO_SCENE_TIMEOUT` config
+
+4. **Test with existing video**
+   - Run on KOBA interview video
+   - Verify instrument detection accuracy
+   - Check performance (time, memory)
+
+5. **Integrate with search**
+   - Add audio_scene to universal_search
+   - Add filter by audio class (speech/music/instrument)
+
+## Estimated Effort
+
+| Step | Time |
+|------|------|
+| Install + prototype script | 2-3 hours |
+| Pipeline integration | 1-2 hours |
+| Qdrant + search integration | 1 hour |
+| Testing + tuning | 1-2 hours |
+| **Total** | **5-8 hours** |
+
+## Future Enhancements
+
+- Real-time audio classification during processing
+- Audio event timeline visualization
+- Combine with TKG for audio-visual relationships
+- Background music detection for copyright checks