feat: Initial v0.9 release with API Key authentication

## v0.9.20260325_144654 ### Features - API Key Authentication System - Job Worker System - V2 Backup Versioning ### Bug Fixes - get_processor_results_by_job column mapping Co-authored-by: OpenCode
2026-03-25 14:52:51 +08:00
parent 47e86b696f
commit 383201cacd
193 changed files with 40268 additions and 422 deletions
--- a/docs/DOCUMENT_EMBEDDING_STRATEGY.md
+++ b/docs/DOCUMENT_EMBEDDING_STRATEGY.md
@@ -0,0 +1,151 @@
+# Document Embedding Strategy - Parent-Child Chunks
+
+## Overview
+
+Momentry uses a **parent-child chunk hierarchy** for improved RAG retrieval. This document describes the embedding strategy for this hierarchy.
+
+## Chunk Structure
+
+### Parent Chunk
+- **Purpose**: Summarize multiple child chunks with narrative description
+- **Content**: High-level description of multiple scenes/segments
+- **Example**:
+```json
+{
+  "chunk_id": "story_asr_0000",
+  "chunk_type": "story",
+  "text_content": "[0s-125s] A man enters a building. He walks down a hallway.",
+  "child_chunk_ids": ["asr_0001", "asr_0002", "asr_0003", "asr_0004", "asr_0005"]
+}
+```
+
+### Child Chunk
+- **Purpose**: Individual segments from ASR, scenes from CUT, etc.
+- **Content**: Raw transcription or detection results
+- **Example**:
+```json
+{
+  "chunk_id": "asr_0001",
+  "chunk_type": "sentence",
+  "text_content": "Hello world",
+  "parent_chunk_id": "story_asr_0000"
+}
+```
+
+## Embedding Strategy
+
+### For Vector Search
+
+When embedding chunks for vector search, we combine **parent description + child content** to provide both context and detail.
+
+#### Parent Chunk Embedding
+```
+embedding_text = f"Summary: {parent.text_content}
+Children: {child_text_1}. {child_text_2}. {child_text_3}..."
+```
+
+**Prefix**: `search_document: ` (for documents in Qdrant)
+
+**Example**:
+```
+search_document: Summary: A man enters a building. He walks down a hallway.
+Children: Hello, how are you? I'm fine thank you. The weather is nice today.
+```
+
+#### Child Chunk Embedding
+```
+embedding_text = f"[{child.chunk_type}] {child.text_content}
+Parent: {parent.description}"
+```
+
+**Prefix**: `search_document: `
+
+**Example**:
+```
+search_document: [sentence] Hello, how are you?
+Parent: A man enters a building. He walks down a hallway.
+```
+
+### For BM25 Text Search
+
+BM25 operates on raw text with PostgreSQL full-text search.
+
+- **Index**: `search_vector` (TSVECTOR) on `chunks.text_content`
+- **Search**: Uses `ts_rank_cd()` for ranking
+
+## Hybrid Search Ranking
+
+Combined score = `(vector_score * 0.7) + (bm25_score * 0.3)`
+
+### Why 0.7/0.3?
+
+| Weight | Vector | BM25 |
+|--------|--------|------|
+| Pros | Semantic similarity | Exact keyword match |
+| Cons | May miss specific terms | No semantic understanding |
+| Best for | Thematic queries | Fact lookup |
+
+## Query Patterns
+
+### Thematic Query ("What are the main themes?")
+- Use higher `vector_weight` (0.8-0.9)
+- Vector search finds semantically similar content
+
+### Fact Lookup ("Who said X?")
+- Use higher `bm25_weight` (0.5-0.7)
+- BM25 finds exact matches
+
+### Balanced ("Tell me about scene 5")
+- Use default 0.7/0.3
+
+## Implementation
+
+### Embedding Generation
+```rust
+fn build_embedding_text(chunk: &Chunk, parent_text: Option<&str>) -> String {
+    match chunk.chunk_type {
+        ChunkType::Story => {
+            format!(
+                "Summary: {}\nChildren: {}",
+                chunk.text_content,
+                get_children_text(chunk)
+            )
+        }
+        _ => {
+            format!(
+                "[{}] {}\nParent: {}",
+                chunk.chunk_type.as_str(),
+                chunk.text_content,
+                parent_text.unwrap_or("N/A")
+            )
+        }
+    }
+}
+```
+
+### Storage
+- Parent chunks stored with their `child_chunk_ids`
+- Child chunks reference `parent_chunk_id`
+- Both stored in PostgreSQL with full-text index
+- Vectors stored in Qdrant
+
+## Example Flow
+
+1. **Story Processing** generates parent-child hierarchy
+2. **Embedding** creates vector for each chunk
+3. **Storage** saves to PostgreSQL + Qdrant
+4. **Search** retrieves using hybrid search
+5. **Results** include both parent context and child details
+
+## Best Practices
+
+1. **Chunk Size**: 5 child chunks per parent (configurable)
+2. **Text Length**: Keep embeddings under 512 tokens
+3. **Parent Description**: Include temporal markers (timestamps)
+4. **Child Content**: Preserve original transcription
+
+## Future Enhancements
+
+- [ ] GraphRAG integration for relationship traversal
+- [ ] Cross-chunk entity linking
+- [ ] Temporal graph building