fix: TKG stats API returning 0 - count_by_type used wrong column

- tkg_nodes has no edge_type column, query was failing silently
- Split into count_nodes(node_type) and count_edges(edge_type)
- Fixed text_region → text_trace node type name
- Also: OCR frame fix in rule1 (end_frame computed from end_time+FPS)
This commit is contained in:
Accusys
2026-07-02 14:53:47 +08:00
parent 6507766ea2
commit 619b056ada
3 changed files with 210 additions and 23 deletions

View File

@@ -0,0 +1,162 @@
---
title: Audio Scene & Instrument Detection POC Plan
version: 0.1
date: 2026-07-02
author: OpenCode
status: planned
---
| scope | status | applicable to |
|-------|--------|---------------|
| Audio processing pipeline | planned | Video files with non-speech audio |
## Goal
Detect non-speech audio events (instruments, music, environmental sounds) in video files alongside existing ASRX speech recognition.
## Why
Current pipeline only detects speech (ASRX → 64 segments + 1554 speaker embeddings). Instrument sounds, background music, and environmental audio are completely ignored.
## Technical Options
### Option A: PANNs (Pre-trained Audio Neural Networks)
- **Model**: Cnn14 (313M params, 700MB weights)
- **Classes**: 527 AudioSet classes (piano, guitar, drums, speech, etc.)
- **Pros**: Production-ready, accurate, PyTorch-based
- **Cons**: Large download, ~200MB RAM per inference
- **Install**: `pip install panns-inference`
### Option B: YAMNet (Google)
- **Model**: MobileNet-based, 4MB weights
- **Classes**: 521 AudioSet classes
- **Pros**: Lightweight, fast
- **Cons**: Requires TensorFlow (not currently installed)
- **Install**: `pip install yamnet` + TensorFlow
### Option C: torchaudio + heuristics (lightweight fallback)
- Use existing PyTorch + torchaudio
- Extract spectral features (MFCC, centroid, energy)
- Simple classification: speech vs music vs silence
- **Pros**: No extra dependencies
- **Cons**: Less accurate, limited classes
## Recommended: Option A (PANNs)
## Pipeline Integration
```
Video → Audio Extract → ASRX (speech) → Speaker Embeddings (3.4/s)
→ Audio Scene (new) → Scene Labels (1/s)
```
### New Processor: `audio_scene`
| Field | Value |
|-------|-------|
| Processor type | `audio_scene` |
| Input | Video file (audio track) |
| Output | `file_uuid.audio_scene.json` |
| Sampling | 1-second segments |
| Qdrant collection | `momentry_{schema}_audio_scene` |
### Output Format
```json
{
"file_uuid": "...",
"segments": [
{
"start_time": 0.0,
"end_time": 1.0,
"primary_class": "speech",
"confidence": 0.95,
"top_classes": [
{"class": "speech", "score": 0.95},
{"class": "music", "score": 0.03},
{"class": "piano", "score": 0.01}
]
}
],
"summary": {
"speech_ratio": 0.72,
"music_ratio": 0.15,
"silence_ratio": 0.08,
"instrument_ratio": 0.05,
"instruments_detected": ["piano", "guitar"]
}
}
```
### Qdrant Storage
| Field | Type | Purpose |
|-------|------|---------|
| `file_uuid` | string | Filter by file |
| `start_time` | float | Segment start |
| `end_time` | float | Segment end |
| `primary_class` | keyword | Filter by class |
| `confidence` | float | Filter by confidence |
| `instrument_name` | keyword | Search by instrument |
| `vector` | f32[2048] | Audio embedding for similarity search |
### Processor Dependencies
```
audio_scene → (no dependencies, runs parallel with ASRX)
```
## Key AudioSet Instrument Classes
| Category | Classes |
|----------|---------|
| Piano | Piano, Electric piano, Keyboard |
| Guitar | Guitar, Electric guitar, Acoustic guitar |
| Drums | Drum kit, Snare drum, Cymbal, Hi-hat |
| Strings | Violin, Cello, Harp, Double bass |
| Wind | Flute, Saxophone, Trumpet, Clarinet |
| Voice | Speech, Singing, Chant, Choir |
| Other | Music, Percussion, Organ, Synthesizer |
## POC Steps
1. **Install panns-inference**
```bash
pip install panns-inference
```
2. **Create `scripts/audio_scene_processor.py`**
- Load audio via ffmpeg → numpy array
- Process 1-second segments through Cnn14
- Save results to JSON + Qdrant
3. **Add processor type to pipeline**
- Add `AudioScene` to `ProcessorType` enum
- Add to worker's processor dispatch
- Add `AUDIO_SCENE_TIMEOUT` config
4. **Test with existing video**
- Run on KOBA interview video
- Verify instrument detection accuracy
- Check performance (time, memory)
5. **Integrate with search**
- Add audio_scene to universal_search
- Add filter by audio class (speech/music/instrument)
## Estimated Effort
| Step | Time |
|------|------|
| Install + prototype script | 2-3 hours |
| Pipeline integration | 1-2 hours |
| Qdrant + search integration | 1 hour |
| Testing + tuning | 1-2 hours |
| **Total** | **5-8 hours** |
## Future Enhancements
- Real-time audio classification during processing
- Audio event timeline visualization
- Combine with TKG for audio-visual relationships
- Background music detection for copyright checks