feat: Initial v0.9 release with API Key authentication
## v0.9.20260325_144654 ### Features - API Key Authentication System - Job Worker System - V2 Backup Versioning ### Bug Fixes - get_processor_results_by_job column mapping Co-authored-by: OpenCode
This commit is contained in:
151
docs/DOCUMENT_EMBEDDING_STRATEGY.md
Normal file
151
docs/DOCUMENT_EMBEDDING_STRATEGY.md
Normal file
@@ -0,0 +1,151 @@
|
||||
# Document Embedding Strategy - Parent-Child Chunks
|
||||
|
||||
## Overview
|
||||
|
||||
Momentry uses a **parent-child chunk hierarchy** for improved RAG retrieval. This document describes the embedding strategy for this hierarchy.
|
||||
|
||||
## Chunk Structure
|
||||
|
||||
### Parent Chunk
|
||||
- **Purpose**: Summarize multiple child chunks with narrative description
|
||||
- **Content**: High-level description of multiple scenes/segments
|
||||
- **Example**:
|
||||
```json
|
||||
{
|
||||
"chunk_id": "story_asr_0000",
|
||||
"chunk_type": "story",
|
||||
"text_content": "[0s-125s] A man enters a building. He walks down a hallway.",
|
||||
"child_chunk_ids": ["asr_0001", "asr_0002", "asr_0003", "asr_0004", "asr_0005"]
|
||||
}
|
||||
```
|
||||
|
||||
### Child Chunk
|
||||
- **Purpose**: Individual segments from ASR, scenes from CUT, etc.
|
||||
- **Content**: Raw transcription or detection results
|
||||
- **Example**:
|
||||
```json
|
||||
{
|
||||
"chunk_id": "asr_0001",
|
||||
"chunk_type": "sentence",
|
||||
"text_content": "Hello world",
|
||||
"parent_chunk_id": "story_asr_0000"
|
||||
}
|
||||
```
|
||||
|
||||
## Embedding Strategy
|
||||
|
||||
### For Vector Search
|
||||
|
||||
When embedding chunks for vector search, we combine **parent description + child content** to provide both context and detail.
|
||||
|
||||
#### Parent Chunk Embedding
|
||||
```
|
||||
embedding_text = f"Summary: {parent.text_content}
|
||||
Children: {child_text_1}. {child_text_2}. {child_text_3}..."
|
||||
```
|
||||
|
||||
**Prefix**: `search_document: ` (for documents in Qdrant)
|
||||
|
||||
**Example**:
|
||||
```
|
||||
search_document: Summary: A man enters a building. He walks down a hallway.
|
||||
Children: Hello, how are you? I'm fine thank you. The weather is nice today.
|
||||
```
|
||||
|
||||
#### Child Chunk Embedding
|
||||
```
|
||||
embedding_text = f"[{child.chunk_type}] {child.text_content}
|
||||
Parent: {parent.description}"
|
||||
```
|
||||
|
||||
**Prefix**: `search_document: `
|
||||
|
||||
**Example**:
|
||||
```
|
||||
search_document: [sentence] Hello, how are you?
|
||||
Parent: A man enters a building. He walks down a hallway.
|
||||
```
|
||||
|
||||
### For BM25 Text Search
|
||||
|
||||
BM25 operates on raw text with PostgreSQL full-text search.
|
||||
|
||||
- **Index**: `search_vector` (TSVECTOR) on `chunks.text_content`
|
||||
- **Search**: Uses `ts_rank_cd()` for ranking
|
||||
|
||||
## Hybrid Search Ranking
|
||||
|
||||
Combined score = `(vector_score * 0.7) + (bm25_score * 0.3)`
|
||||
|
||||
### Why 0.7/0.3?
|
||||
|
||||
| Weight | Vector | BM25 |
|
||||
|--------|--------|------|
|
||||
| Pros | Semantic similarity | Exact keyword match |
|
||||
| Cons | May miss specific terms | No semantic understanding |
|
||||
| Best for | Thematic queries | Fact lookup |
|
||||
|
||||
## Query Patterns
|
||||
|
||||
### Thematic Query ("What are the main themes?")
|
||||
- Use higher `vector_weight` (0.8-0.9)
|
||||
- Vector search finds semantically similar content
|
||||
|
||||
### Fact Lookup ("Who said X?")
|
||||
- Use higher `bm25_weight` (0.5-0.7)
|
||||
- BM25 finds exact matches
|
||||
|
||||
### Balanced ("Tell me about scene 5")
|
||||
- Use default 0.7/0.3
|
||||
|
||||
## Implementation
|
||||
|
||||
### Embedding Generation
|
||||
```rust
|
||||
fn build_embedding_text(chunk: &Chunk, parent_text: Option<&str>) -> String {
|
||||
match chunk.chunk_type {
|
||||
ChunkType::Story => {
|
||||
format!(
|
||||
"Summary: {}\nChildren: {}",
|
||||
chunk.text_content,
|
||||
get_children_text(chunk)
|
||||
)
|
||||
}
|
||||
_ => {
|
||||
format!(
|
||||
"[{}] {}\nParent: {}",
|
||||
chunk.chunk_type.as_str(),
|
||||
chunk.text_content,
|
||||
parent_text.unwrap_or("N/A")
|
||||
)
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Storage
|
||||
- Parent chunks stored with their `child_chunk_ids`
|
||||
- Child chunks reference `parent_chunk_id`
|
||||
- Both stored in PostgreSQL with full-text index
|
||||
- Vectors stored in Qdrant
|
||||
|
||||
## Example Flow
|
||||
|
||||
1. **Story Processing** generates parent-child hierarchy
|
||||
2. **Embedding** creates vector for each chunk
|
||||
3. **Storage** saves to PostgreSQL + Qdrant
|
||||
4. **Search** retrieves using hybrid search
|
||||
5. **Results** include both parent context and child details
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Chunk Size**: 5 child chunks per parent (configurable)
|
||||
2. **Text Length**: Keep embeddings under 512 tokens
|
||||
3. **Parent Description**: Include temporal markers (timestamps)
|
||||
4. **Child Content**: Preserve original transcription
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
- [ ] GraphRAG integration for relationship traversal
|
||||
- [ ] Cross-chunk entity linking
|
||||
- [ ] Temporal graph building
|
||||
Reference in New Issue
Block a user