Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index Correction: asr-1.json format, generate/apply scripts API: 37/37 endpoints fixed and tested Docs: HANDOVER_V2.0.md for M4
5.9 KiB
ASR Segmentation Enhancement Report
Date: 2026-05-10 Movie: Charade (1963), 113 min Goal: Fix merged-speaker segments in ASR output by detecting speaker change points within ASR segments.
Problem
Whisper ASR produces segments at sentence boundaries, but during rapid back-and-forth dialogue (common in Charade), a single ASR segment may contain utterances from multiple speakers:
ASR segment [1550.0-1554.0] (4.0s):
"What's she saying now?"
Actual dialogue:
1552.7: Audrey: "What's she saying now?"
1553.4: Cary: "That she's innocent."
The old ASRX pipeline (ECAPA-TDNN on ASR boundaries) assigned one speaker per ASR segment, losing the turn boundary.
Solution: Sliding-Window Speaker Change Detection
Detection Method
Instead of relying on ASR segment boundaries, we:
- Slide a 1.5s window (0.75s stride) across the entire audio
- Extract ECAPA-TDNN 192D embeddings per window (239 windows per 3 min of audio)
- Classify each window against reference centroids built from the full movie's known speaker assignments
- Smooth with a 3-window majority filter (eliminates single-window noise)
- Detect change points where the classified speaker changes between adjacent windows
- Split the original ASR segment at each change point
Reference Centroids
Built from the existing 3417 ASRX embedding set:
- Cary Grant: centroid from 1420 known segments
- Audrey Hepburn: centroid from 1689 known segments
- Unknown: centroid from 308 segments (background/minor characters)
Classification uses cosine similarity to nearest centroid, giving ~0.8+ similarity for main characters.
Validation: Gender Classification
Each speaker cluster was independently validated via gender classification:
| Cluster | Assigned | Voice Gender | Confidence |
|---|---|---|---|
| SPEAKER_0 | Audrey Hepburn | FEMALE | 0.71 |
| SPEAKER_1 | Cary Grant | MALE | 0.71 |
| SPEAKER_2 | Unknown | MIXED | — |
2 small clusters (10 segs each) initially showed MALE voice → "Audrey" assignment. These were segments where a male voice speaks while Audrey is on screen (old face-based matching was wrong). The fine-grained segmentation correctly resolves these.
Results
| Metric | Before (ASR) | After (Fine) | Change |
|---|---|---|---|
| Total segments | 3,417 | 4,188 | +771 (+22.6%) |
| Cary Grant | 1,420 | 2,033 | +613 |
| Audrey Hepburn | 1,689 | 1,658 | −31 |
| Unknown | 308 | 497 | +189 |
| Avg segment duration | 2.0s | 1.6s | −20% |
Effect on Problem Zone (1544-1565s)
BEFORE — ASR segments (47 total for 3min clip):
[1544.0-1546.0] "Who's that with the hat?" → single speaker
[1546.0-1548.0] "That's the policeman." → single speaker
[1548.0-1550.0] "He wants to arrest Judy for Punch." → single speaker
[1550.0-1554.0] "What's she saying now?" → merged! multiple speakers
[1554.0-1557.5] "That she's innocent. She didn't do it." → merged
[1557.5-1560.7] "Oh, she did it all right." → merged
...
AFTER — Fine segments (64 total for 3min clip):
[1550.3-1551.0] "He wants to arrest Judy..." → Audrey Hepburn
[1552.7-1553.4] "What's she saying now?" → Audrey Hepburn
[1553.4-1554.2] "now? That" → Cary Grant
[1554.2-1559.3] "That she's innocent. She didn't..." → Cary Grant
[1559.3-1560.5] "Oh, she did it all right." → Audrey Hepburn
[1560.5-1561.6] "right. I" → Cary Grant
[1561.6-1562.8] "I believe her." → Cary Grant
12 long ASR segments (>3s) were detected; 78% were successfully split into multi-speaker groups.
Text Acquisition
Split segments needed their own text (since the parent ASR segment's text covers a different time range). Three approaches were tested:
- Proportional split (failed): Split text by time ratio → produces broken words
- Word-timestamp ASR (partially succeeded): faster-whisper with
word_timestamps=True→ 87% coverage; remaining gaps from ASR word boundary mismatches - Per-segment ASR (fallback): Individual faster-whisper on empty segments → filled remaining 13%
Final result: 4,188/4,188 segments with text.
Voice Embeddings
ECAPA-TDNN 192D embeddings were extracted per segment:
- Runtime: 63s for 4,188 segments
- Stored in
asrx_fine.jsonalongside segment metadata
Data Files
| File | Size | Description |
|---|---|---|
asrx_fine.json |
~45 MB | 4,188 fine segments + 4,188 embeddings |
asrx_fine.json → segments[].speaker_name |
— | Centroid-matched identity |
asrx_fine.json → segments[].speaker_id |
— | SPEAKER_0/1/2 |
asrx_fine.json → segments[].text |
— | ASR text (word-timestamp mapped) |
asrx_fine.json → embeddings[] |
— | 192D ECAPA-TDNN per segment |
Continued Limitations
- Word boundary alignment: Split segment text sometimes has ±1 word due to sliding-window vs. ASR boundary mismatch (cosmetic, not semantic)
- ASR merge in silence zones: Very short utterances (<0.5s) merged into adjacent segments
- Background speakers: Multiple background speakers grouped as "Unknown"
Pipeline Integration
The asrx_fine.json file serves as the new ASRX output. The original asr.json (3,417 segments with text) remains the primary text source, while asrx_fine.json provides superior speaker diarization at 4,188 segments.
Speaker assignments in DB dev.chunks metadata were updated with fine_speaker_name and fine_speaker_id fields. Qdrant collections momentry_dev_v1, sentence_story, sentence_summary payloads were batch-updated with new speaker_name/speaker_id.
Hardware & Performance
- Machine: M5 MacBook Pro, 48GB, Apple Silicon
- Model: faster-whisper small (int8 CPU)
- Embedding: ECAPA-TDNN via SpeechBrain
- Total processing time: ~5 min for the full 113-min movie