feat: fix Chinese text search and duplicate chunk_id bug

- Add helper functions to extract text from nested content structure - Update SearchResult to include uuid field - Add PostgreSQL function get_chunk_by_chunk_id_and_uuid to handle duplicate chunk_ids - Update Qdrant search functions to extract uuid from payload - Change embedding model to nomic-embed-text-v2-moe:latest - Update Qdrant collection name to momentry_rule1 - Fix MongoDB authentication and disable cache for development - Improve error handling in processor.rs - Update documentation with new embedding model
2026-03-29 04:44:28 +08:00
parent 82955504f3
commit 2393d81a3f
13 changed files with 355 additions and 106 deletions
--- a/docs/PROCESSING_PIPELINE.md
+++ b/docs/PROCESSING_PIPELINE.md
@@ -119,11 +119,11 @@ cargo run --bin momentry -- chunk <uuid>
 ### Stage 4: 向量化

 ```bash
-# 向量化 chunks
+# 向量化 chunks（使用預設模型 nomic-embed-text-v2-moe:latest）
 cargo run --bin momentry -- vectorize <uuid>

-# 指定模型
-cargo run --bin momentry -- vectorize <uuid> --model sentence-transformers/all-MiniLM-L6-v2
+# 明確指定模型
+cargo run --bin momentry -- vectorize <uuid> --model nomic-embed-text-v2-moe:latest
 ```

 ---
@@ -187,18 +187,27 @@ YOLO: ✓ Already complete, skipping

 ## 向量化模型選擇

+### 統一嵌入模型
+Momentry Core 統一使用 **`nomic-embed-text-v2-moe:latest`** 作為所有規則的嵌入模型：
+
 ```bash
-# 預設模型
--model sentence-transformers/all-MiniLM-L6-v2
+# 統一模型（所有 Rule 1/2/3 使用）
+--model nomic-embed-text-v2-moe:latest
+```

-# 高精度模型
--model sentence-transformers/all-mpnet-base-v2
+### 模型特性
+| 特性 | 說明 |
+|------|------|
+| **模型名稱** | `nomic-embed-text-v2-moe:latest` |
+| **向量維度** | 768 維 |
+| **多語言支持** | ✅ 完整支持（英語、中文、日語、韓語等） |
+| **模型架構** | Mixture of Experts (MoE) |
+| **推理速度** | 快速，適合實時應用 |

-# 多語言模型
--model sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
-
-# 中文模型
--model sentence-transformers/paraphrase-multilingual-mpnet-base-v2
+### 使用方式
+```bash
+# 向量化命令
+cargo run --bin momentry -- vectorize <uuid> --model nomic-embed-text-v2-moe:latest
 ```

 ---