Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index Correction: asr-1.json format, generate/apply scripts API: 37/37 endpoints fixed and tested Docs: HANDOVER_V2.0.md for M4
4.0 KiB
Non-Human Sound Detection — Tool Selection Report
Date: 2026-05-10 Movie: Charade (1963), 113 min Audio: 16kHz mono WAV Goal: Detect non-human sound events (gunshots, impacts, doors, music, etc.)
Tested Approaches
Approach A: AST AudioSet (HuggingFace)
| Item | Detail |
|---|---|
| Model | MIT/ast-finetuned-audioset-10-10-0.4593 |
| Method | Audio Spectrogram Transformer, fine-tuned on AudioSet-2M (527 classes) |
| Dependencies | transformers, torch ✅ (no torchcodec needed) |
| Load time | ~1s on M5 |
| Inference time | ~0.5s per 3-second clip (805k params, float32) |
| Accuracy | Good — correctly distinguishes speech vs. door vs. music |
Test results on Charade:
| Time | Energy-based said | AST AudioSet said | Verdict |
|---|---|---|---|
| 0:10 | — | Environmental noise (26%) | Background noise, plausible |
| 10:32 | Gunshot candidate (43x) | Speech (76%) | ✅ AST correct |
| 57:00 | Gunshot candidate (49x) | Door (62%) + Slam (5%) | ✅ AST correct |
| 65:13 | Gunshot candidate (50x) | Speech (58%) | ✅ AST correct |
| 85:12 | Gunshot candidate (39x) | Speech (68%) | ✅ AST correct |
Conclusion: Energy-based impulse detection has 100% false positive rate for gunshot detection. AST AudioSet correctly classifies all candidates as non-gunshot.
Approach B: Custom Energy + Spectral Features
| Item | Detail |
|---|---|
| Method | RMS energy + spectral centroid + sub-band energy ratios |
| Speed | ~3s for full 113-min movie (every 10th window) |
| Accuracy | Poor — cannot distinguish gunshot from speech, door, music |
| Result | 1 "gunshot_candidate" from 453 test windows; all false positives on verification |
Conclusion: Useful as a coarse pre-filter (Stage 1), not as a standalone classifier.
Two-Stage Design
Stage 1 (Energy filter, ~1 min):
Full audio → sliding window RMS + centroid → ~200 candidate windows
|
v
Stage 2 (AST classifier, ~2 min):
Extract 3-sec audio for each candidate → AST AudioSet classification
|
v
Non-speech events: gunshot, explosion, door slam, music, etc.
Estimated processing: ~3 min for full movie (vs. 75 min for full AST scan)
Key AudioSet Classes Relevant to Charade
| Class | AudioSet ID | Relevance |
|---|---|---|
| Gunshot, gunfire | 402 | Primary target |
| Explosion | 400 | Hand grenade in plot |
| Door slams | 404 | Scenes at hotel, apartment |
| Music | 130-133 | Background score |
| Speech | 0-3 | Already handled by ASR |
| Vehicle | 100-110 | Car sounds in Paris chase |
| Glass break | 424 | Window breaking scene |
Actor-voice gender mismatches (resolved by fine-grained ASRX)
During the speaker mapping work, 20 segments where the old face→TMDb assignment said "Audrey Hepburn" but the new ASRX voice embedding clearly said "MALE". These segments were verified via video clips and confirmed to be scenes where:
- A male speaker (Cary Grant or other) is speaking while Audrey Hepburn's face is on screen
- The old pipeline incorrectly assigned the speaker name based on face identity
- The fine-grained sliding window approach correctly resolves these
The 20 segments were from SPEAKER_5 (10 segs) and SPEAKER_9 (10 segs), both of which mapped to MALE voice clusters. These were re-assigned to "Cary Grant" or "Unknown" as appropriate.
Recommendations
| Approach | Speed | Accuracy | Best for |
|---|---|---|---|
| Energy pre-filter | ✅ 1 min | ❌ Low | Stage 1: candidate selection |
| AST AudioSet | ⚠️ 2 min | ✅ High | Stage 2: event classification |
| Full AST scan | ❌ 75 min | ✅ High | N/A — two-stage is better |
Design: Two-stage pipeline: energy pre-filter → AST classifier Implementation path:
- Write
scripts/non_human_sound_detector.pywith the two-stage design - Output
{uuid}.sound_events.jsonwith typed events - Integrate into the sound_event_detector framework