Files

Accusys 39ba5ddf76 feat: Phase 1 handover - schema migration, correction mechanism, API fixes

Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index
Correction: asr-1.json format, generate/apply scripts
API: 37/37 endpoints fixed and tested
Docs: HANDOVER_V2.0.md for M4

2026-05-11 07:03:22 +08:00

4.0 KiB

Raw Permalink Blame History

Non-Human Sound Detection — Tool Selection Report

Date: 2026-05-10 Movie: Charade (1963), 113 min Audio: 16kHz mono WAV Goal: Detect non-human sound events (gunshots, impacts, doors, music, etc.)

Tested Approaches

Approach A: AST AudioSet (HuggingFace)

Item	Detail
Model	`MIT/ast-finetuned-audioset-10-10-0.4593`
Method	Audio Spectrogram Transformer, fine-tuned on AudioSet-2M (527 classes)
Dependencies	`transformers`, `torch` ✅ (no torchcodec needed)
Load time	~1s on M5
Inference time	~0.5s per 3-second clip (805k params, float32)
Accuracy	Good — correctly distinguishes speech vs. door vs. music

Test results on Charade:

Time	Energy-based said	AST AudioSet said	Verdict
0:10	—	Environmental noise (26%)	Background noise, plausible
10:32	Gunshot candidate (43x)	Speech (76%)	✅ AST correct
57:00	Gunshot candidate (49x)	Door (62%) + Slam (5%)	✅ AST correct
65:13	Gunshot candidate (50x)	Speech (58%)	✅ AST correct
85:12	Gunshot candidate (39x)	Speech (68%)	✅ AST correct

Conclusion: Energy-based impulse detection has 100% false positive rate for gunshot detection. AST AudioSet correctly classifies all candidates as non-gunshot.

Approach B: Custom Energy + Spectral Features

Item	Detail
Method	RMS energy + spectral centroid + sub-band energy ratios
Speed	~3s for full 113-min movie (every 10th window)
Accuracy	Poor — cannot distinguish gunshot from speech, door, music
Result	1 "gunshot_candidate" from 453 test windows; all false positives on verification

Conclusion: Useful as a coarse pre-filter (Stage 1), not as a standalone classifier.

Two-Stage Design

Stage 1 (Energy filter, ~1 min):
  Full audio → sliding window RMS + centroid → ~200 candidate windows
                    |
                    v
Stage 2 (AST classifier, ~2 min):
  Extract 3-sec audio for each candidate → AST AudioSet classification
                    |
                    v
  Non-speech events: gunshot, explosion, door slam, music, etc.

Estimated processing: ~3 min for full movie (vs. 75 min for full AST scan)

Key AudioSet Classes Relevant to Charade

Class	AudioSet ID	Relevance
Gunshot, gunfire	402	Primary target
Explosion	400	Hand grenade in plot
Door slams	404	Scenes at hotel, apartment
Music	130-133	Background score
Speech	0-3	Already handled by ASR
Vehicle	100-110	Car sounds in Paris chase
Glass break	424	Window breaking scene

Actor-voice gender mismatches (resolved by fine-grained ASRX)

During the speaker mapping work, 20 segments where the old face→TMDb assignment said "Audrey Hepburn" but the new ASRX voice embedding clearly said "MALE". These segments were verified via video clips and confirmed to be scenes where:

A male speaker (Cary Grant or other) is speaking while Audrey Hepburn's face is on screen
The old pipeline incorrectly assigned the speaker name based on face identity
The fine-grained sliding window approach correctly resolves these

The 20 segments were from SPEAKER_5 (10 segs) and SPEAKER_9 (10 segs), both of which mapped to MALE voice clusters. These were re-assigned to "Cary Grant" or "Unknown" as appropriate.

Recommendations

Approach	Speed	Accuracy	Best for
Energy pre-filter	✅ 1 min	❌ Low	Stage 1: candidate selection
AST AudioSet	⚠️ 2 min	✅ High	Stage 2: event classification
Full AST scan	❌ 75 min	✅ High	N/A — two-stage is better

Design: Two-stage pipeline: energy pre-filter → AST classifier Implementation path:

Write scripts/non_human_sound_detector.py with the two-stage design
Output {uuid}.sound_events.json with typed events
Integrate into the sound_event_detector framework

4.0 KiB Raw Permalink Blame History