fix: TKG stats API returning 0 - count_by_type used wrong column
- tkg_nodes has no edge_type column, query was failing silently - Split into count_nodes(node_type) and count_edges(edge_type) - Fixed text_region → text_trace node type name - Also: OCR frame fix in rule1 (end_frame computed from end_time+FPS)
This commit is contained in:
162
docs_v1.0/DESIGN/Audio_Scene_Detection_POC.md
Normal file
162
docs_v1.0/DESIGN/Audio_Scene_Detection_POC.md
Normal file
@@ -0,0 +1,162 @@
|
||||
---
|
||||
title: Audio Scene & Instrument Detection POC Plan
|
||||
version: 0.1
|
||||
date: 2026-07-02
|
||||
author: OpenCode
|
||||
status: planned
|
||||
---
|
||||
|
||||
| scope | status | applicable to |
|
||||
|-------|--------|---------------|
|
||||
| Audio processing pipeline | planned | Video files with non-speech audio |
|
||||
|
||||
## Goal
|
||||
|
||||
Detect non-speech audio events (instruments, music, environmental sounds) in video files alongside existing ASRX speech recognition.
|
||||
|
||||
## Why
|
||||
|
||||
Current pipeline only detects speech (ASRX → 64 segments + 1554 speaker embeddings). Instrument sounds, background music, and environmental audio are completely ignored.
|
||||
|
||||
## Technical Options
|
||||
|
||||
### Option A: PANNs (Pre-trained Audio Neural Networks)
|
||||
- **Model**: Cnn14 (313M params, 700MB weights)
|
||||
- **Classes**: 527 AudioSet classes (piano, guitar, drums, speech, etc.)
|
||||
- **Pros**: Production-ready, accurate, PyTorch-based
|
||||
- **Cons**: Large download, ~200MB RAM per inference
|
||||
- **Install**: `pip install panns-inference`
|
||||
|
||||
### Option B: YAMNet (Google)
|
||||
- **Model**: MobileNet-based, 4MB weights
|
||||
- **Classes**: 521 AudioSet classes
|
||||
- **Pros**: Lightweight, fast
|
||||
- **Cons**: Requires TensorFlow (not currently installed)
|
||||
- **Install**: `pip install yamnet` + TensorFlow
|
||||
|
||||
### Option C: torchaudio + heuristics (lightweight fallback)
|
||||
- Use existing PyTorch + torchaudio
|
||||
- Extract spectral features (MFCC, centroid, energy)
|
||||
- Simple classification: speech vs music vs silence
|
||||
- **Pros**: No extra dependencies
|
||||
- **Cons**: Less accurate, limited classes
|
||||
|
||||
## Recommended: Option A (PANNs)
|
||||
|
||||
## Pipeline Integration
|
||||
|
||||
```
|
||||
Video → Audio Extract → ASRX (speech) → Speaker Embeddings (3.4/s)
|
||||
→ Audio Scene (new) → Scene Labels (1/s)
|
||||
```
|
||||
|
||||
### New Processor: `audio_scene`
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| Processor type | `audio_scene` |
|
||||
| Input | Video file (audio track) |
|
||||
| Output | `file_uuid.audio_scene.json` |
|
||||
| Sampling | 1-second segments |
|
||||
| Qdrant collection | `momentry_{schema}_audio_scene` |
|
||||
|
||||
### Output Format
|
||||
|
||||
```json
|
||||
{
|
||||
"file_uuid": "...",
|
||||
"segments": [
|
||||
{
|
||||
"start_time": 0.0,
|
||||
"end_time": 1.0,
|
||||
"primary_class": "speech",
|
||||
"confidence": 0.95,
|
||||
"top_classes": [
|
||||
{"class": "speech", "score": 0.95},
|
||||
{"class": "music", "score": 0.03},
|
||||
{"class": "piano", "score": 0.01}
|
||||
]
|
||||
}
|
||||
],
|
||||
"summary": {
|
||||
"speech_ratio": 0.72,
|
||||
"music_ratio": 0.15,
|
||||
"silence_ratio": 0.08,
|
||||
"instrument_ratio": 0.05,
|
||||
"instruments_detected": ["piano", "guitar"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Qdrant Storage
|
||||
|
||||
| Field | Type | Purpose |
|
||||
|-------|------|---------|
|
||||
| `file_uuid` | string | Filter by file |
|
||||
| `start_time` | float | Segment start |
|
||||
| `end_time` | float | Segment end |
|
||||
| `primary_class` | keyword | Filter by class |
|
||||
| `confidence` | float | Filter by confidence |
|
||||
| `instrument_name` | keyword | Search by instrument |
|
||||
| `vector` | f32[2048] | Audio embedding for similarity search |
|
||||
|
||||
### Processor Dependencies
|
||||
|
||||
```
|
||||
audio_scene → (no dependencies, runs parallel with ASRX)
|
||||
```
|
||||
|
||||
## Key AudioSet Instrument Classes
|
||||
|
||||
| Category | Classes |
|
||||
|----------|---------|
|
||||
| Piano | Piano, Electric piano, Keyboard |
|
||||
| Guitar | Guitar, Electric guitar, Acoustic guitar |
|
||||
| Drums | Drum kit, Snare drum, Cymbal, Hi-hat |
|
||||
| Strings | Violin, Cello, Harp, Double bass |
|
||||
| Wind | Flute, Saxophone, Trumpet, Clarinet |
|
||||
| Voice | Speech, Singing, Chant, Choir |
|
||||
| Other | Music, Percussion, Organ, Synthesizer |
|
||||
|
||||
## POC Steps
|
||||
|
||||
1. **Install panns-inference**
|
||||
```bash
|
||||
pip install panns-inference
|
||||
```
|
||||
|
||||
2. **Create `scripts/audio_scene_processor.py`**
|
||||
- Load audio via ffmpeg → numpy array
|
||||
- Process 1-second segments through Cnn14
|
||||
- Save results to JSON + Qdrant
|
||||
|
||||
3. **Add processor type to pipeline**
|
||||
- Add `AudioScene` to `ProcessorType` enum
|
||||
- Add to worker's processor dispatch
|
||||
- Add `AUDIO_SCENE_TIMEOUT` config
|
||||
|
||||
4. **Test with existing video**
|
||||
- Run on KOBA interview video
|
||||
- Verify instrument detection accuracy
|
||||
- Check performance (time, memory)
|
||||
|
||||
5. **Integrate with search**
|
||||
- Add audio_scene to universal_search
|
||||
- Add filter by audio class (speech/music/instrument)
|
||||
|
||||
## Estimated Effort
|
||||
|
||||
| Step | Time |
|
||||
|------|------|
|
||||
| Install + prototype script | 2-3 hours |
|
||||
| Pipeline integration | 1-2 hours |
|
||||
| Qdrant + search integration | 1 hour |
|
||||
| Testing + tuning | 1-2 hours |
|
||||
| **Total** | **5-8 hours** |
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
- Real-time audio classification during processing
|
||||
- Audio event timeline visualization
|
||||
- Combine with TKG for audio-visual relationships
|
||||
- Background music detection for copyright checks
|
||||
Reference in New Issue
Block a user