# Non-Human Sound Detection — Tool Selection Report

**Date:** 2026-05-10
**Movie:** Charade (1963), 113 min
**Audio:** 16kHz mono WAV
**Goal:** Detect non-human sound events (gunshots, impacts, doors, music, etc.)

## Tested Approaches

### Approach A: AST AudioSet (HuggingFace)

| Item | Detail |
|------|--------|
| Model | `MIT/ast-finetuned-audioset-10-10-0.4593` |
| Method | Audio Spectrogram Transformer, fine-tuned on AudioSet-2M (527 classes) |
| Dependencies | `transformers`, `torch` ✅ (no torchcodec needed) |
| Load time | ~1s on M5 |
| Inference time | ~0.5s per 3-second clip (805k params, float32) |
| Accuracy | Good — correctly distinguishes speech vs. door vs. music |

**Test results on Charade:**

| Time | Energy-based said | AST AudioSet said | Verdict |
|------|------------------|-------------------|---------|
| 0:10 | — | Environmental noise (26%) | Background noise, plausible |
| 10:32 | Gunshot candidate (43x) | **Speech (76%)** | ✅ AST correct |
| 57:00 | Gunshot candidate (49x) | **Door (62%) + Slam (5%)** | ✅ AST correct |
| 65:13 | Gunshot candidate (50x) | **Speech (58%)** | ✅ AST correct |
| 85:12 | Gunshot candidate (39x) | **Speech (68%)** | ✅ AST correct |

**Conclusion**: Energy-based impulse detection has **100% false positive rate** for gunshot detection. AST AudioSet correctly classifies all candidates as non-gunshot.

### Approach B: Custom Energy + Spectral Features

| Item | Detail |
|------|--------|
| Method | RMS energy + spectral centroid + sub-band energy ratios |
| Speed | ~3s for full 113-min movie (every 10th window) |
| Accuracy | Poor — cannot distinguish gunshot from speech, door, music |
| Result | 1 "gunshot_candidate" from 453 test windows; all false positives on verification |

**Conclusion**: Useful as a **coarse pre-filter** (Stage 1), not as a standalone classifier.

## Two-Stage Design

```
Stage 1 (Energy filter, ~1 min):
  Full audio → sliding window RMS + centroid → ~200 candidate windows
                    |
                    v
Stage 2 (AST classifier, ~2 min):
  Extract 3-sec audio for each candidate → AST AudioSet classification
                    |
                    v
  Non-speech events: gunshot, explosion, door slam, music, etc.
```

Estimated processing: ~3 min for full movie (vs. 75 min for full AST scan)

## Key AudioSet Classes Relevant to Charade

| Class | AudioSet ID | Relevance |
|-------|-------------|-----------|
| Gunshot, gunfire | 402 | **Primary target** |
| Explosion | 400 | Hand grenade in plot |
| Door slams | 404 | Scenes at hotel, apartment |
| Music | 130-133 | Background score |
| Speech | 0-3 | Already handled by ASR |
| Vehicle | 100-110 | Car sounds in Paris chase |
| Glass break | 424 | Window breaking scene |

## Actor-voice gender mismatches (resolved by fine-grained ASRX)

During the speaker mapping work, 20 segments where the old face→TMDb assignment said "Audrey Hepburn" but the new ASRX voice embedding clearly said "MALE". These segments were verified via video clips and confirmed to be scenes where:

1. A male speaker (Cary Grant or other) is speaking while Audrey Hepburn's face is on screen
2. The old pipeline incorrectly assigned the speaker name based on face identity
3. The fine-grained sliding window approach correctly resolves these

The 20 segments were from SPEAKER_5 (10 segs) and SPEAKER_9 (10 segs), both of which mapped to MALE voice clusters. These were re-assigned to "Cary Grant" or "Unknown" as appropriate.

## Recommendations

| Approach | Speed | Accuracy | Best for |
|----------|-------|----------|----------|
| Energy pre-filter | ✅ 1 min | ❌ Low | Stage 1: candidate selection |
| AST AudioSet | ⚠️ 2 min | ✅ High | Stage 2: event classification |
| Full AST scan | ❌ 75 min | ✅ High | N/A — two-stage is better |

**Design**: Two-stage pipeline: energy pre-filter → AST classifier
**Implementation path**:
1. Write `scripts/non_human_sound_detector.py` with the two-stage design
2. Output `{uuid}.sound_events.json` with typed events
3. Integrate into the sound_event_detector framework