feat: Initial v0.9 release with API Key authentication

## v0.9.20260325_144654

### Features
- API Key Authentication System
- Job Worker System
- V2 Backup Versioning

### Bug Fixes
- get_processor_results_by_job column mapping

Co-authored-by: OpenCode
This commit is contained in:
accusys
2026-03-25 14:52:51 +08:00
parent 47e86b696f
commit 383201cacd
193 changed files with 40268 additions and 422 deletions

View File

@@ -0,0 +1,151 @@
# Document Embedding Strategy - Parent-Child Chunks
## Overview
Momentry uses a **parent-child chunk hierarchy** for improved RAG retrieval. This document describes the embedding strategy for this hierarchy.
## Chunk Structure
### Parent Chunk
- **Purpose**: Summarize multiple child chunks with narrative description
- **Content**: High-level description of multiple scenes/segments
- **Example**:
```json
{
"chunk_id": "story_asr_0000",
"chunk_type": "story",
"text_content": "[0s-125s] A man enters a building. He walks down a hallway.",
"child_chunk_ids": ["asr_0001", "asr_0002", "asr_0003", "asr_0004", "asr_0005"]
}
```
### Child Chunk
- **Purpose**: Individual segments from ASR, scenes from CUT, etc.
- **Content**: Raw transcription or detection results
- **Example**:
```json
{
"chunk_id": "asr_0001",
"chunk_type": "sentence",
"text_content": "Hello world",
"parent_chunk_id": "story_asr_0000"
}
```
## Embedding Strategy
### For Vector Search
When embedding chunks for vector search, we combine **parent description + child content** to provide both context and detail.
#### Parent Chunk Embedding
```
embedding_text = f"Summary: {parent.text_content}
Children: {child_text_1}. {child_text_2}. {child_text_3}..."
```
**Prefix**: `search_document: ` (for documents in Qdrant)
**Example**:
```
search_document: Summary: A man enters a building. He walks down a hallway.
Children: Hello, how are you? I'm fine thank you. The weather is nice today.
```
#### Child Chunk Embedding
```
embedding_text = f"[{child.chunk_type}] {child.text_content}
Parent: {parent.description}"
```
**Prefix**: `search_document: `
**Example**:
```
search_document: [sentence] Hello, how are you?
Parent: A man enters a building. He walks down a hallway.
```
### For BM25 Text Search
BM25 operates on raw text with PostgreSQL full-text search.
- **Index**: `search_vector` (TSVECTOR) on `chunks.text_content`
- **Search**: Uses `ts_rank_cd()` for ranking
## Hybrid Search Ranking
Combined score = `(vector_score * 0.7) + (bm25_score * 0.3)`
### Why 0.7/0.3?
| Weight | Vector | BM25 |
|--------|--------|------|
| Pros | Semantic similarity | Exact keyword match |
| Cons | May miss specific terms | No semantic understanding |
| Best for | Thematic queries | Fact lookup |
## Query Patterns
### Thematic Query ("What are the main themes?")
- Use higher `vector_weight` (0.8-0.9)
- Vector search finds semantically similar content
### Fact Lookup ("Who said X?")
- Use higher `bm25_weight` (0.5-0.7)
- BM25 finds exact matches
### Balanced ("Tell me about scene 5")
- Use default 0.7/0.3
## Implementation
### Embedding Generation
```rust
fn build_embedding_text(chunk: &Chunk, parent_text: Option<&str>) -> String {
match chunk.chunk_type {
ChunkType::Story => {
format!(
"Summary: {}\nChildren: {}",
chunk.text_content,
get_children_text(chunk)
)
}
_ => {
format!(
"[{}] {}\nParent: {}",
chunk.chunk_type.as_str(),
chunk.text_content,
parent_text.unwrap_or("N/A")
)
}
}
}
```
### Storage
- Parent chunks stored with their `child_chunk_ids`
- Child chunks reference `parent_chunk_id`
- Both stored in PostgreSQL with full-text index
- Vectors stored in Qdrant
## Example Flow
1. **Story Processing** generates parent-child hierarchy
2. **Embedding** creates vector for each chunk
3. **Storage** saves to PostgreSQL + Qdrant
4. **Search** retrieves using hybrid search
5. **Results** include both parent context and child details
## Best Practices
1. **Chunk Size**: 5 child chunks per parent (configurable)
2. **Text Length**: Keep embeddings under 512 tokens
3. **Parent Description**: Include temporal markers (timestamps)
4. **Child Content**: Preserve original transcription
## Future Enhancements
- [ ] GraphRAG integration for relationship traversal
- [ ] Cross-chunk entity linking
- [ ] Temporal graph building