Files
momentry_core/docs/DOCUMENT_EMBEDDING_STRATEGY.md
accusys 383201cacd feat: Initial v0.9 release with API Key authentication
## v0.9.20260325_144654

### Features
- API Key Authentication System
- Job Worker System
- V2 Backup Versioning

### Bug Fixes
- get_processor_results_by_job column mapping

Co-authored-by: OpenCode
2026-03-25 14:53:41 +08:00

4.0 KiB

Document Embedding Strategy - Parent-Child Chunks

Overview

Momentry uses a parent-child chunk hierarchy for improved RAG retrieval. This document describes the embedding strategy for this hierarchy.

Chunk Structure

Parent Chunk

  • Purpose: Summarize multiple child chunks with narrative description
  • Content: High-level description of multiple scenes/segments
  • Example:
{
  "chunk_id": "story_asr_0000",
  "chunk_type": "story",
  "text_content": "[0s-125s] A man enters a building. He walks down a hallway.",
  "child_chunk_ids": ["asr_0001", "asr_0002", "asr_0003", "asr_0004", "asr_0005"]
}

Child Chunk

  • Purpose: Individual segments from ASR, scenes from CUT, etc.
  • Content: Raw transcription or detection results
  • Example:
{
  "chunk_id": "asr_0001",
  "chunk_type": "sentence",
  "text_content": "Hello world",
  "parent_chunk_id": "story_asr_0000"
}

Embedding Strategy

When embedding chunks for vector search, we combine parent description + child content to provide both context and detail.

Parent Chunk Embedding

embedding_text = f"Summary: {parent.text_content}
Children: {child_text_1}. {child_text_2}. {child_text_3}..."

Prefix: search_document: (for documents in Qdrant)

Example:

search_document: Summary: A man enters a building. He walks down a hallway.
Children: Hello, how are you? I'm fine thank you. The weather is nice today.

Child Chunk Embedding

embedding_text = f"[{child.chunk_type}] {child.text_content}
Parent: {parent.description}"

Prefix: search_document:

Example:

search_document: [sentence] Hello, how are you?
Parent: A man enters a building. He walks down a hallway.

BM25 operates on raw text with PostgreSQL full-text search.

  • Index: search_vector (TSVECTOR) on chunks.text_content
  • Search: Uses ts_rank_cd() for ranking

Hybrid Search Ranking

Combined score = (vector_score * 0.7) + (bm25_score * 0.3)

Why 0.7/0.3?

Weight Vector BM25
Pros Semantic similarity Exact keyword match
Cons May miss specific terms No semantic understanding
Best for Thematic queries Fact lookup

Query Patterns

Thematic Query ("What are the main themes?")

  • Use higher vector_weight (0.8-0.9)
  • Vector search finds semantically similar content

Fact Lookup ("Who said X?")

  • Use higher bm25_weight (0.5-0.7)
  • BM25 finds exact matches

Balanced ("Tell me about scene 5")

  • Use default 0.7/0.3

Implementation

Embedding Generation

fn build_embedding_text(chunk: &Chunk, parent_text: Option<&str>) -> String {
    match chunk.chunk_type {
        ChunkType::Story => {
            format!(
                "Summary: {}\nChildren: {}",
                chunk.text_content,
                get_children_text(chunk)
            )
        }
        _ => {
            format!(
                "[{}] {}\nParent: {}",
                chunk.chunk_type.as_str(),
                chunk.text_content,
                parent_text.unwrap_or("N/A")
            )
        }
    }
}

Storage

  • Parent chunks stored with their child_chunk_ids
  • Child chunks reference parent_chunk_id
  • Both stored in PostgreSQL with full-text index
  • Vectors stored in Qdrant

Example Flow

  1. Story Processing generates parent-child hierarchy
  2. Embedding creates vector for each chunk
  3. Storage saves to PostgreSQL + Qdrant
  4. Search retrieves using hybrid search
  5. Results include both parent context and child details

Best Practices

  1. Chunk Size: 5 child chunks per parent (configurable)
  2. Text Length: Keep embeddings under 512 tokens
  3. Parent Description: Include temporal markers (timestamps)
  4. Child Content: Preserve original transcription

Future Enhancements

  • GraphRAG integration for relationship traversal
  • Cross-chunk entity linking
  • Temporal graph building