Files
momentry_core/docs/ASR_SEGMENTATION_ENHANCEMENT.md
Accusys 39ba5ddf76 feat: Phase 1 handover - schema migration, correction mechanism, API fixes
Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index
Correction: asr-1.json format, generate/apply scripts
API: 37/37 endpoints fixed and tested
Docs: HANDOVER_V2.0.md for M4
2026-05-11 07:03:22 +08:00

5.9 KiB
Raw Permalink Blame History

ASR Segmentation Enhancement Report

Date: 2026-05-10 Movie: Charade (1963), 113 min Goal: Fix merged-speaker segments in ASR output by detecting speaker change points within ASR segments.

Problem

Whisper ASR produces segments at sentence boundaries, but during rapid back-and-forth dialogue (common in Charade), a single ASR segment may contain utterances from multiple speakers:

ASR segment [1550.0-1554.0] (4.0s):
  "What's she saying now?"

Actual dialogue:
  1552.7: Audrey: "What's she saying now?"
  1553.4: Cary:   "That she's innocent."

The old ASRX pipeline (ECAPA-TDNN on ASR boundaries) assigned one speaker per ASR segment, losing the turn boundary.

Solution: Sliding-Window Speaker Change Detection

Detection Method

Instead of relying on ASR segment boundaries, we:

  1. Slide a 1.5s window (0.75s stride) across the entire audio
  2. Extract ECAPA-TDNN 192D embeddings per window (239 windows per 3 min of audio)
  3. Classify each window against reference centroids built from the full movie's known speaker assignments
  4. Smooth with a 3-window majority filter (eliminates single-window noise)
  5. Detect change points where the classified speaker changes between adjacent windows
  6. Split the original ASR segment at each change point

Reference Centroids

Built from the existing 3417 ASRX embedding set:

  • Cary Grant: centroid from 1420 known segments
  • Audrey Hepburn: centroid from 1689 known segments
  • Unknown: centroid from 308 segments (background/minor characters)

Classification uses cosine similarity to nearest centroid, giving ~0.8+ similarity for main characters.

Validation: Gender Classification

Each speaker cluster was independently validated via gender classification:

Cluster Assigned Voice Gender Confidence
SPEAKER_0 Audrey Hepburn FEMALE 0.71
SPEAKER_1 Cary Grant MALE 0.71
SPEAKER_2 Unknown MIXED

2 small clusters (10 segs each) initially showed MALE voice → "Audrey" assignment. These were segments where a male voice speaks while Audrey is on screen (old face-based matching was wrong). The fine-grained segmentation correctly resolves these.

Results

Metric Before (ASR) After (Fine) Change
Total segments 3,417 4,188 +771 (+22.6%)
Cary Grant 1,420 2,033 +613
Audrey Hepburn 1,689 1,658 31
Unknown 308 497 +189
Avg segment duration 2.0s 1.6s 20%

Effect on Problem Zone (1544-1565s)

BEFORE — ASR segments (47 total for 3min clip):
[1544.0-1546.0] "Who's that with the hat?"           → single speaker
[1546.0-1548.0] "That's the policeman."                → single speaker
[1548.0-1550.0] "He wants to arrest Judy for Punch."   → single speaker
[1550.0-1554.0] "What's she saying now?"               → merged! multiple speakers
[1554.0-1557.5] "That she's innocent. She didn't do it." → merged
[1557.5-1560.7] "Oh, she did it all right."            → merged
...

AFTER — Fine segments (64 total for 3min clip):
[1550.3-1551.0] "He wants to arrest Judy..."           → Audrey Hepburn
[1552.7-1553.4] "What's she saying now?"                → Audrey Hepburn
[1553.4-1554.2] "now? That"                              → Cary Grant
[1554.2-1559.3] "That she's innocent. She didn't..."    → Cary Grant
[1559.3-1560.5] "Oh, she did it all right."             → Audrey Hepburn
[1560.5-1561.6] "right. I"                               → Cary Grant
[1561.6-1562.8] "I believe her."                        → Cary Grant

12 long ASR segments (>3s) were detected; 78% were successfully split into multi-speaker groups.

Text Acquisition

Split segments needed their own text (since the parent ASR segment's text covers a different time range). Three approaches were tested:

  1. Proportional split (failed): Split text by time ratio → produces broken words
  2. Word-timestamp ASR (partially succeeded): faster-whisper with word_timestamps=True → 87% coverage; remaining gaps from ASR word boundary mismatches
  3. Per-segment ASR (fallback): Individual faster-whisper on empty segments → filled remaining 13%

Final result: 4,188/4,188 segments with text.

Voice Embeddings

ECAPA-TDNN 192D embeddings were extracted per segment:

  • Runtime: 63s for 4,188 segments
  • Stored in asrx_fine.json alongside segment metadata

Data Files

File Size Description
asrx_fine.json ~45 MB 4,188 fine segments + 4,188 embeddings
asrx_fine.json → segments[].speaker_name Centroid-matched identity
asrx_fine.json → segments[].speaker_id SPEAKER_0/1/2
asrx_fine.json → segments[].text ASR text (word-timestamp mapped)
asrx_fine.json → embeddings[] 192D ECAPA-TDNN per segment

Continued Limitations

  1. Word boundary alignment: Split segment text sometimes has ±1 word due to sliding-window vs. ASR boundary mismatch (cosmetic, not semantic)
  2. ASR merge in silence zones: Very short utterances (<0.5s) merged into adjacent segments
  3. Background speakers: Multiple background speakers grouped as "Unknown"

Pipeline Integration

The asrx_fine.json file serves as the new ASRX output. The original asr.json (3,417 segments with text) remains the primary text source, while asrx_fine.json provides superior speaker diarization at 4,188 segments.

Speaker assignments in DB dev.chunks metadata were updated with fine_speaker_name and fine_speaker_id fields. Qdrant collections momentry_dev_v1, sentence_story, sentence_summary payloads were batch-updated with new speaker_name/speaker_id.

Hardware & Performance

  • Machine: M5 MacBook Pro, 48GB, Apple Silicon
  • Model: faster-whisper small (int8 CPU)
  • Embedding: ECAPA-TDNN via SpeechBrain
  • Total processing time: ~5 min for the full 113-min movie