Files

Accusys 39ba5ddf76 feat: Phase 1 handover - schema migration, correction mechanism, API fixes

Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index
Correction: asr-1.json format, generate/apply scripts
API: 37/37 endpoints fixed and tested
Docs: HANDOVER_V2.0.md for M4

2026-05-11 07:03:22 +08:00

5.9 KiB

Raw Permalink Blame History

ASR Segmentation Enhancement Report

Date: 2026-05-10 Movie: Charade (1963), 113 min Goal: Fix merged-speaker segments in ASR output by detecting speaker change points within ASR segments.

Problem

Whisper ASR produces segments at sentence boundaries, but during rapid back-and-forth dialogue (common in Charade), a single ASR segment may contain utterances from multiple speakers:

ASR segment [1550.0-1554.0] (4.0s):
  "What's she saying now?"

Actual dialogue:
  1552.7: Audrey: "What's she saying now?"
  1553.4: Cary:   "That she's innocent."

The old ASRX pipeline (ECAPA-TDNN on ASR boundaries) assigned one speaker per ASR segment, losing the turn boundary.

Solution: Sliding-Window Speaker Change Detection

Detection Method

Instead of relying on ASR segment boundaries, we:

Slide a 1.5s window (0.75s stride) across the entire audio
Extract ECAPA-TDNN 192D embeddings per window (239 windows per 3 min of audio)
Classify each window against reference centroids built from the full movie's known speaker assignments
Smooth with a 3-window majority filter (eliminates single-window noise)
Detect change points where the classified speaker changes between adjacent windows
Split the original ASR segment at each change point

Reference Centroids

Built from the existing 3417 ASRX embedding set:

Cary Grant: centroid from 1420 known segments
Audrey Hepburn: centroid from 1689 known segments
Unknown: centroid from 308 segments (background/minor characters)

Classification uses cosine similarity to nearest centroid, giving ~0.8+ similarity for main characters.

Validation: Gender Classification

Each speaker cluster was independently validated via gender classification:

Cluster	Assigned	Voice Gender	Confidence
SPEAKER_0	Audrey Hepburn	FEMALE	0.71
SPEAKER_1	Cary Grant	MALE	0.71
SPEAKER_2	Unknown	MIXED	—

2 small clusters (10 segs each) initially showed MALE voice → "Audrey" assignment. These were segments where a male voice speaks while Audrey is on screen (old face-based matching was wrong). The fine-grained segmentation correctly resolves these.

Results

Metric	Before (ASR)	After (Fine)	Change
Total segments	3,417	4,188	+771 (+22.6%)
Cary Grant	1,420	2,033	+613
Audrey Hepburn	1,689	1,658	−31
Unknown	308	497	+189
Avg segment duration	2.0s	1.6s	−20%

Effect on Problem Zone (1544-1565s)

BEFORE — ASR segments (47 total for 3min clip):
[1544.0-1546.0] "Who's that with the hat?"           → single speaker
[1546.0-1548.0] "That's the policeman."                → single speaker
[1548.0-1550.0] "He wants to arrest Judy for Punch."   → single speaker
[1550.0-1554.0] "What's she saying now?"               → merged! multiple speakers
[1554.0-1557.5] "That she's innocent. She didn't do it." → merged
[1557.5-1560.7] "Oh, she did it all right."            → merged
...

AFTER — Fine segments (64 total for 3min clip):
[1550.3-1551.0] "He wants to arrest Judy..."           → Audrey Hepburn
[1552.7-1553.4] "What's she saying now?"                → Audrey Hepburn
[1553.4-1554.2] "now? That"                              → Cary Grant
[1554.2-1559.3] "That she's innocent. She didn't..."    → Cary Grant
[1559.3-1560.5] "Oh, she did it all right."             → Audrey Hepburn
[1560.5-1561.6] "right. I"                               → Cary Grant
[1561.6-1562.8] "I believe her."                        → Cary Grant

12 long ASR segments (>3s) were detected; 78% were successfully split into multi-speaker groups.

Text Acquisition

Split segments needed their own text (since the parent ASR segment's text covers a different time range). Three approaches were tested:

Proportional split (failed): Split text by time ratio → produces broken words
Word-timestamp ASR (partially succeeded): faster-whisper with word_timestamps=True → 87% coverage; remaining gaps from ASR word boundary mismatches
Per-segment ASR (fallback): Individual faster-whisper on empty segments → filled remaining 13%

Final result: 4,188/4,188 segments with text.

Voice Embeddings

ECAPA-TDNN 192D embeddings were extracted per segment:

Runtime: 63s for 4,188 segments
Stored in asrx_fine.json alongside segment metadata

Data Files

File	Size	Description
`asrx_fine.json`	~45 MB	4,188 fine segments + 4,188 embeddings
`asrx_fine.json → segments[].speaker_name`	—	Centroid-matched identity
`asrx_fine.json → segments[].speaker_id`	—	SPEAKER_0/1/2
`asrx_fine.json → segments[].text`	—	ASR text (word-timestamp mapped)
`asrx_fine.json → embeddings[]`	—	192D ECAPA-TDNN per segment

Continued Limitations

Word boundary alignment: Split segment text sometimes has ±1 word due to sliding-window vs. ASR boundary mismatch (cosmetic, not semantic)
ASR merge in silence zones: Very short utterances (<0.5s) merged into adjacent segments
Background speakers: Multiple background speakers grouped as "Unknown"

Pipeline Integration

The asrx_fine.json file serves as the new ASRX output. The original asr.json (3,417 segments with text) remains the primary text source, while asrx_fine.json provides superior speaker diarization at 4,188 segments.

Speaker assignments in DB dev.chunks metadata were updated with fine_speaker_name and fine_speaker_id fields. Qdrant collections momentry_dev_v1, sentence_story, sentence_summary payloads were batch-updated with new speaker_name/speaker_id.

Hardware & Performance

Machine: M5 MacBook Pro, 48GB, Apple Silicon
Model: faster-whisper small (int8 CPU)
Embedding: ECAPA-TDNN via SpeechBrain
Total processing time: ~5 min for the full 113-min movie

5.9 KiB Raw Permalink Blame History Unescape Escape