# ASR Segmentation Enhancement Report

**Date:** 2026-05-10
**Movie:** Charade (1963), 113 min
**Goal:** Fix merged-speaker segments in ASR output by detecting speaker change points within ASR segments.

## Problem

Whisper ASR produces segments at sentence boundaries, but during rapid back-and-forth dialogue (common in Charade), a single ASR segment may contain utterances from **multiple speakers**:

```
ASR segment [1550.0-1554.0] (4.0s):
  "What's she saying now?"

Actual dialogue:
  1552.7: Audrey: "What's she saying now?"
  1553.4: Cary:   "That she's innocent."
```

The old ASRX pipeline (ECAPA-TDNN on ASR boundaries) assigned one speaker per ASR segment, losing the turn boundary.

## Solution: Sliding-Window Speaker Change Detection

### Detection Method

Instead of relying on ASR segment boundaries, we:

1. **Slide a 1.5s window (0.75s stride)** across the entire audio
2. **Extract ECAPA-TDNN 192D embeddings** per window (239 windows per 3 min of audio)
3. **Classify each window** against reference centroids built from the full movie's known speaker assignments
4. **Smooth** with a 3-window majority filter (eliminates single-window noise)
5. **Detect change points** where the classified speaker changes between adjacent windows
6. **Split** the original ASR segment at each change point

### Reference Centroids

Built from the existing 3417 ASRX embedding set:
- **Cary Grant**: centroid from 1420 known segments
- **Audrey Hepburn**: centroid from 1689 known segments
- **Unknown**: centroid from 308 segments (background/minor characters)

Classification uses cosine similarity to nearest centroid, giving ~0.8+ similarity for main characters.

### Validation: Gender Classification

Each speaker cluster was independently validated via gender classification:

| Cluster | Assigned | Voice Gender | Confidence |
|---------|----------|-------------|------------|
| SPEAKER_0 | Audrey Hepburn | FEMALE | 0.71 |
| SPEAKER_1 | Cary Grant | MALE | 0.71 |
| SPEAKER_2 | Unknown | MIXED | — |

2 small clusters (10 segs each) initially showed MALE voice → "Audrey" assignment. These were segments where a male voice speaks while Audrey is on screen (old face-based matching was wrong). The fine-grained segmentation correctly resolves these.

### Results

| Metric | Before (ASR) | After (Fine) | Change |
|--------|-------------|-------------|--------|
| Total segments | 3,417 | **4,188** | **+771 (+22.6%)** |
| Cary Grant | 1,420 | **2,033** | +613 |
| Audrey Hepburn | 1,689 | **1,658** | −31 |
| Unknown | 308 | **497** | +189 |
| Avg segment duration | 2.0s | **1.6s** | −20% |

### Effect on Problem Zone (1544-1565s)

```
BEFORE — ASR segments (47 total for 3min clip):
[1544.0-1546.0] "Who's that with the hat?"           → single speaker
[1546.0-1548.0] "That's the policeman."                → single speaker
[1548.0-1550.0] "He wants to arrest Judy for Punch."   → single speaker
[1550.0-1554.0] "What's she saying now?"               → merged! multiple speakers
[1554.0-1557.5] "That she's innocent. She didn't do it." → merged
[1557.5-1560.7] "Oh, she did it all right."            → merged
...

AFTER — Fine segments (64 total for 3min clip):
[1550.3-1551.0] "He wants to arrest Judy..."           → Audrey Hepburn
[1552.7-1553.4] "What's she saying now?"                → Audrey Hepburn
[1553.4-1554.2] "now? That"                              → Cary Grant
[1554.2-1559.3] "That she's innocent. She didn't..."    → Cary Grant
[1559.3-1560.5] "Oh, she did it all right."             → Audrey Hepburn
[1560.5-1561.6] "right. I"                               → Cary Grant
[1561.6-1562.8] "I believe her."                        → Cary Grant
```

12 long ASR segments (>3s) were detected; 78% were successfully split into multi-speaker groups.

### Text Acquisition

Split segments needed their own text (since the parent ASR segment's text covers a different time range). Three approaches were tested:

1. **Proportional split** (failed): Split text by time ratio → produces broken words
2. **Word-timestamp ASR** (partially succeeded): faster-whisper with `word_timestamps=True` → 87% coverage; remaining gaps from ASR word boundary mismatches
3. **Per-segment ASR** (fallback): Individual faster-whisper on empty segments → filled remaining 13%

Final result: **4,188/4,188 segments with text.**

### Voice Embeddings

ECAPA-TDNN 192D embeddings were extracted per segment:
- Runtime: 63s for 4,188 segments
- Stored in `asrx_fine.json` alongside segment metadata

### Data Files

| File | Size | Description |
|------|------|-------------|
| `asrx_fine.json` | ~45 MB | 4,188 fine segments + 4,188 embeddings |
| `asrx_fine.json → segments[].speaker_name` | — | Centroid-matched identity |
| `asrx_fine.json → segments[].speaker_id` | — | SPEAKER_0/1/2 |
| `asrx_fine.json → segments[].text` | — | ASR text (word-timestamp mapped) |
| `asrx_fine.json → embeddings[]` | — | 192D ECAPA-TDNN per segment |

### Continued Limitations

1. **Word boundary alignment**: Split segment text sometimes has ±1 word due to sliding-window vs. ASR boundary mismatch (cosmetic, not semantic)
2. **ASR merge in silence zones**: Very short utterances (<0.5s) merged into adjacent segments
3. **Background speakers**: Multiple background speakers grouped as "Unknown"

### Pipeline Integration

The `asrx_fine.json` file serves as the new ASRX output. The original `asr.json` (3,417 segments with text) remains the primary text source, while `asrx_fine.json` provides superior speaker diarization at 4,188 segments.

Speaker assignments in DB `dev.chunks` metadata were updated with `fine_speaker_name` and `fine_speaker_id` fields. Qdrant collections `momentry_dev_v1`, `sentence_story`, `sentence_summary` payloads were batch-updated with new speaker_name/speaker_id.

### Hardware & Performance

- Machine: M5 MacBook Pro, 48GB, Apple Silicon
- Model: faster-whisper small (int8 CPU)
- Embedding: ECAPA-TDNN via SpeechBrain
- Total processing time: ~5 min for the full 113-min movie