# ASR Segmentation Enhancement Report **Date:** 2026-05-10 **Movie:** Charade (1963), 113 min **Goal:** Fix merged-speaker segments in ASR output by detecting speaker change points within ASR segments. ## Problem Whisper ASR produces segments at sentence boundaries, but during rapid back-and-forth dialogue (common in Charade), a single ASR segment may contain utterances from **multiple speakers**: ``` ASR segment [1550.0-1554.0] (4.0s): "What's she saying now?" Actual dialogue: 1552.7: Audrey: "What's she saying now?" 1553.4: Cary: "That she's innocent." ``` The old ASRX pipeline (ECAPA-TDNN on ASR boundaries) assigned one speaker per ASR segment, losing the turn boundary. ## Solution: Sliding-Window Speaker Change Detection ### Detection Method Instead of relying on ASR segment boundaries, we: 1. **Slide a 1.5s window (0.75s stride)** across the entire audio 2. **Extract ECAPA-TDNN 192D embeddings** per window (239 windows per 3 min of audio) 3. **Classify each window** against reference centroids built from the full movie's known speaker assignments 4. **Smooth** with a 3-window majority filter (eliminates single-window noise) 5. **Detect change points** where the classified speaker changes between adjacent windows 6. **Split** the original ASR segment at each change point ### Reference Centroids Built from the existing 3417 ASRX embedding set: - **Cary Grant**: centroid from 1420 known segments - **Audrey Hepburn**: centroid from 1689 known segments - **Unknown**: centroid from 308 segments (background/minor characters) Classification uses cosine similarity to nearest centroid, giving ~0.8+ similarity for main characters. ### Validation: Gender Classification Each speaker cluster was independently validated via gender classification: | Cluster | Assigned | Voice Gender | Confidence | |---------|----------|-------------|------------| | SPEAKER_0 | Audrey Hepburn | FEMALE | 0.71 | | SPEAKER_1 | Cary Grant | MALE | 0.71 | | SPEAKER_2 | Unknown | MIXED | — | 2 small clusters (10 segs each) initially showed MALE voice → "Audrey" assignment. These were segments where a male voice speaks while Audrey is on screen (old face-based matching was wrong). The fine-grained segmentation correctly resolves these. ### Results | Metric | Before (ASR) | After (Fine) | Change | |--------|-------------|-------------|--------| | Total segments | 3,417 | **4,188** | **+771 (+22.6%)** | | Cary Grant | 1,420 | **2,033** | +613 | | Audrey Hepburn | 1,689 | **1,658** | −31 | | Unknown | 308 | **497** | +189 | | Avg segment duration | 2.0s | **1.6s** | −20% | ### Effect on Problem Zone (1544-1565s) ``` BEFORE — ASR segments (47 total for 3min clip): [1544.0-1546.0] "Who's that with the hat?" → single speaker [1546.0-1548.0] "That's the policeman." → single speaker [1548.0-1550.0] "He wants to arrest Judy for Punch." → single speaker [1550.0-1554.0] "What's she saying now?" → merged! multiple speakers [1554.0-1557.5] "That she's innocent. She didn't do it." → merged [1557.5-1560.7] "Oh, she did it all right." → merged ... AFTER — Fine segments (64 total for 3min clip): [1550.3-1551.0] "He wants to arrest Judy..." → Audrey Hepburn [1552.7-1553.4] "What's she saying now?" → Audrey Hepburn [1553.4-1554.2] "now? That" → Cary Grant [1554.2-1559.3] "That she's innocent. She didn't..." → Cary Grant [1559.3-1560.5] "Oh, she did it all right." → Audrey Hepburn [1560.5-1561.6] "right. I" → Cary Grant [1561.6-1562.8] "I believe her." → Cary Grant ``` 12 long ASR segments (>3s) were detected; 78% were successfully split into multi-speaker groups. ### Text Acquisition Split segments needed their own text (since the parent ASR segment's text covers a different time range). Three approaches were tested: 1. **Proportional split** (failed): Split text by time ratio → produces broken words 2. **Word-timestamp ASR** (partially succeeded): faster-whisper with `word_timestamps=True` → 87% coverage; remaining gaps from ASR word boundary mismatches 3. **Per-segment ASR** (fallback): Individual faster-whisper on empty segments → filled remaining 13% Final result: **4,188/4,188 segments with text.** ### Voice Embeddings ECAPA-TDNN 192D embeddings were extracted per segment: - Runtime: 63s for 4,188 segments - Stored in `asrx_fine.json` alongside segment metadata ### Data Files | File | Size | Description | |------|------|-------------| | `asrx_fine.json` | ~45 MB | 4,188 fine segments + 4,188 embeddings | | `asrx_fine.json → segments[].speaker_name` | — | Centroid-matched identity | | `asrx_fine.json → segments[].speaker_id` | — | SPEAKER_0/1/2 | | `asrx_fine.json → segments[].text` | — | ASR text (word-timestamp mapped) | | `asrx_fine.json → embeddings[]` | — | 192D ECAPA-TDNN per segment | ### Continued Limitations 1. **Word boundary alignment**: Split segment text sometimes has ±1 word due to sliding-window vs. ASR boundary mismatch (cosmetic, not semantic) 2. **ASR merge in silence zones**: Very short utterances (<0.5s) merged into adjacent segments 3. **Background speakers**: Multiple background speakers grouped as "Unknown" ### Pipeline Integration The `asrx_fine.json` file serves as the new ASRX output. The original `asr.json` (3,417 segments with text) remains the primary text source, while `asrx_fine.json` provides superior speaker diarization at 4,188 segments. Speaker assignments in DB `dev.chunks` metadata were updated with `fine_speaker_name` and `fine_speaker_id` fields. Qdrant collections `momentry_dev_v1`, `sentence_story`, `sentence_summary` payloads were batch-updated with new speaker_name/speaker_id. ### Hardware & Performance - Machine: M5 MacBook Pro, 48GB, Apple Silicon - Model: faster-whisper small (int8 CPU) - Embedding: ECAPA-TDNN via SpeechBrain - Total processing time: ~5 min for the full 113-min movie