Files

accusys 383201cacd feat: Initial v0.9 release with API Key authentication

## v0.9.20260325_144654

### Features
- API Key Authentication System
- Job Worker System
- V2 Backup Versioning

### Bug Fixes
- get_processor_results_by_job column mapping

Co-authored-by: OpenCode

2026-03-25 14:53:41 +08:00

4.0 KiB

Raw Blame History

Document Embedding Strategy - Parent-Child Chunks

Overview

Momentry uses a parent-child chunk hierarchy for improved RAG retrieval. This document describes the embedding strategy for this hierarchy.

Chunk Structure

Parent Chunk

Purpose: Summarize multiple child chunks with narrative description
Content: High-level description of multiple scenes/segments
Example:

{
  "chunk_id": "story_asr_0000",
  "chunk_type": "story",
  "text_content": "[0s-125s] A man enters a building. He walks down a hallway.",
  "child_chunk_ids": ["asr_0001", "asr_0002", "asr_0003", "asr_0004", "asr_0005"]
}

Child Chunk

Purpose: Individual segments from ASR, scenes from CUT, etc.
Content: Raw transcription or detection results
Example:

{
  "chunk_id": "asr_0001",
  "chunk_type": "sentence",
  "text_content": "Hello world",
  "parent_chunk_id": "story_asr_0000"
}

Embedding Strategy

For Vector Search

When embedding chunks for vector search, we combine parent description + child content to provide both context and detail.

Parent Chunk Embedding

embedding_text = f"Summary: {parent.text_content}
Children: {child_text_1}. {child_text_2}. {child_text_3}..."

Prefix: search_document: (for documents in Qdrant)

Example:

search_document: Summary: A man enters a building. He walks down a hallway.
Children: Hello, how are you? I'm fine thank you. The weather is nice today.

Child Chunk Embedding

embedding_text = f"[{child.chunk_type}] {child.text_content}
Parent: {parent.description}"

Prefix: search_document:

Example:

search_document: [sentence] Hello, how are you?
Parent: A man enters a building. He walks down a hallway.

For BM25 Text Search

BM25 operates on raw text with PostgreSQL full-text search.

Index: search_vector (TSVECTOR) on chunks.text_content
Search: Uses ts_rank_cd() for ranking

Hybrid Search Ranking

Combined score = (vector_score * 0.7) + (bm25_score * 0.3)

Why 0.7/0.3?

Weight	Vector	BM25
Pros	Semantic similarity	Exact keyword match
Cons	May miss specific terms	No semantic understanding
Best for	Thematic queries	Fact lookup

Query Patterns

Thematic Query ("What are the main themes?")

Use higher vector_weight (0.8-0.9)
Vector search finds semantically similar content

Fact Lookup ("Who said X?")

Use higher bm25_weight (0.5-0.7)
BM25 finds exact matches

Balanced ("Tell me about scene 5")

Use default 0.7/0.3

Implementation

Embedding Generation

fn build_embedding_text(chunk: &Chunk, parent_text: Option<&str>) -> String {
    match chunk.chunk_type {
        ChunkType::Story => {
            format!(
                "Summary: {}\nChildren: {}",
                chunk.text_content,
                get_children_text(chunk)
            )
        }
        _ => {
            format!(
                "[{}] {}\nParent: {}",
                chunk.chunk_type.as_str(),
                chunk.text_content,
                parent_text.unwrap_or("N/A")
            )
        }
    }
}

Storage

Parent chunks stored with their child_chunk_ids
Child chunks reference parent_chunk_id
Both stored in PostgreSQL with full-text index
Vectors stored in Qdrant

Example Flow

Story Processing generates parent-child hierarchy
Embedding creates vector for each chunk
Storage saves to PostgreSQL + Qdrant
Search retrieves using hybrid search
Results include both parent context and child details

Best Practices

Chunk Size: 5 child chunks per parent (configurable)
Text Length: Keep embeddings under 512 tokens
Parent Description: Include temporal markers (timestamps)
Child Content: Preserve original transcription

Future Enhancements

GraphRAG integration for relationship traversal
Cross-chunk entity linking
Temporal graph building

4.0 KiB Raw Blame History