Files
momentry_core/docs/NON_HUMAN_SOUND_DETECTION.md
Accusys 39ba5ddf76 feat: Phase 1 handover - schema migration, correction mechanism, API fixes
Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index
Correction: asr-1.json format, generate/apply scripts
API: 37/37 endpoints fixed and tested
Docs: HANDOVER_V2.0.md for M4
2026-05-11 07:03:22 +08:00

4.0 KiB

Non-Human Sound Detection — Tool Selection Report

Date: 2026-05-10 Movie: Charade (1963), 113 min Audio: 16kHz mono WAV Goal: Detect non-human sound events (gunshots, impacts, doors, music, etc.)

Tested Approaches

Approach A: AST AudioSet (HuggingFace)

Item Detail
Model MIT/ast-finetuned-audioset-10-10-0.4593
Method Audio Spectrogram Transformer, fine-tuned on AudioSet-2M (527 classes)
Dependencies transformers, torch (no torchcodec needed)
Load time ~1s on M5
Inference time ~0.5s per 3-second clip (805k params, float32)
Accuracy Good — correctly distinguishes speech vs. door vs. music

Test results on Charade:

Time Energy-based said AST AudioSet said Verdict
0:10 Environmental noise (26%) Background noise, plausible
10:32 Gunshot candidate (43x) Speech (76%) AST correct
57:00 Gunshot candidate (49x) Door (62%) + Slam (5%) AST correct
65:13 Gunshot candidate (50x) Speech (58%) AST correct
85:12 Gunshot candidate (39x) Speech (68%) AST correct

Conclusion: Energy-based impulse detection has 100% false positive rate for gunshot detection. AST AudioSet correctly classifies all candidates as non-gunshot.

Approach B: Custom Energy + Spectral Features

Item Detail
Method RMS energy + spectral centroid + sub-band energy ratios
Speed ~3s for full 113-min movie (every 10th window)
Accuracy Poor — cannot distinguish gunshot from speech, door, music
Result 1 "gunshot_candidate" from 453 test windows; all false positives on verification

Conclusion: Useful as a coarse pre-filter (Stage 1), not as a standalone classifier.

Two-Stage Design

Stage 1 (Energy filter, ~1 min):
  Full audio → sliding window RMS + centroid → ~200 candidate windows
                    |
                    v
Stage 2 (AST classifier, ~2 min):
  Extract 3-sec audio for each candidate → AST AudioSet classification
                    |
                    v
  Non-speech events: gunshot, explosion, door slam, music, etc.

Estimated processing: ~3 min for full movie (vs. 75 min for full AST scan)

Key AudioSet Classes Relevant to Charade

Class AudioSet ID Relevance
Gunshot, gunfire 402 Primary target
Explosion 400 Hand grenade in plot
Door slams 404 Scenes at hotel, apartment
Music 130-133 Background score
Speech 0-3 Already handled by ASR
Vehicle 100-110 Car sounds in Paris chase
Glass break 424 Window breaking scene

Actor-voice gender mismatches (resolved by fine-grained ASRX)

During the speaker mapping work, 20 segments where the old face→TMDb assignment said "Audrey Hepburn" but the new ASRX voice embedding clearly said "MALE". These segments were verified via video clips and confirmed to be scenes where:

  1. A male speaker (Cary Grant or other) is speaking while Audrey Hepburn's face is on screen
  2. The old pipeline incorrectly assigned the speaker name based on face identity
  3. The fine-grained sliding window approach correctly resolves these

The 20 segments were from SPEAKER_5 (10 segs) and SPEAKER_9 (10 segs), both of which mapped to MALE voice clusters. These were re-assigned to "Cary Grant" or "Unknown" as appropriate.

Recommendations

Approach Speed Accuracy Best for
Energy pre-filter 1 min Low Stage 1: candidate selection
AST AudioSet ⚠️ 2 min High Stage 2: event classification
Full AST scan 75 min High N/A — two-stage is better

Design: Two-stage pipeline: energy pre-filter → AST classifier Implementation path:

  1. Write scripts/non_human_sound_detector.py with the two-stage design
  2. Output {uuid}.sound_events.json with typed events
  3. Integrate into the sound_event_detector framework