feat: Phase 1 handover - schema migration, correction mechanism, API fixes

Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index Correction: asr-1.json format, generate/apply scripts API: 37/37 endpoints fixed and tested Docs: HANDOVER_V2.0.md for M4
2026-05-11 07:03:22 +08:00
parent ef894a44ad
commit 39ba5ddf76
147 changed files with 19843 additions and 3053 deletions
--- a/docs_v1.0/API_V1.0.0/INTEGRATION/VISION_AGENT_RUST_INTEGRATION.md
+++ b/docs_v1.0/API_V1.0.0/INTEGRATION/VISION_AGENT_RUST_INTEGRATION.md
@@ -0,0 +1,296 @@
+---
+document_type: "architecture_design"
+service: "MOMENTRY_CORE"
+title: "Vision Agent — Rust Integration Design"
+date: "2026-05-10"
+version: "V1.0"
+status: "active"
+owner: "M5"
+created_by: "OpenCode"
+current_state: "draft"
+tags:
+  - "vision-agent"
+  - "rust-integration"
+  - "python-executor"
+  - "grounding-dino"
+  - "architecture"
+ai_query_hints:
+  - "Vision Agent Rust 整合架構與 PythonExecutor 設計"
+  - "Grounding DINO 無法 ONNX 匯出的原因與解決方案"
+  - "Rust 端 detect/search/multimodal handler 實作方式"
+  - "PythonExecutor persistent mode 與 model cache 設計"
+  - "Vision Agent 從 Flask 5052 遷移至 Rust 3003 的遷移計畫"
+related_documents:
+  - "../VISION_AGENT_API_V1.0.0.md"
+---
+
+# Vision Agent — Rust Integration Design
+
+**Goal:** Replace standalone Python Flask service (port 5052) with a Rust-native agent under `3003/api/v1/agents/vision/*`, following the same pattern as 5W1H, Identity, and Translate agents.
+
+---
+
+## Architecture
+
+```
+Client → 3003 (Rust Axum)
+           │
+           ├── /api/v1/agents/vision/detect      → PythonExecutor → vision_inference.py
+           ├── /api/v1/agents/vision/search       → PythonExecutor → vision_inference.py
+           ├── /api/v1/agents/vision/multimodal   → Rust DB query + PythonExecutor
+           └── /api/v1/agents/vision/models       → pure Rust (no Python needed)
+```
+
+### Why PythonExecutor?
+
+Grounding DINO uses `MultiScaleDeformableAttention` — a PyTorch custom CUDA kernel with no Rust/candle/ort equivalent. ONNX export is also impossible due to this custom op. Python is the only viable runtime.
+
+This matches the project's existing processor pattern:
+
+| Component | Rust | Inference |
+|-----------|------|-----------|
+| ASR | `PythonExecutor` | `asr_processor.py` |
+| ASRX | `PythonExecutor` | `asrx_processor_custom.py` |
+| YOLO | `PythonExecutor` | `yolo_processor.py` |
+| **Vision** | **`PythonExecutor`** | **`vision_inference.py`** |
+
+---
+
+## Config
+
+Add to existing `MOMENTRY_*` env var pattern in `src/core/config.rs`:
+
+```rust
+// Existing pattern — env::var("MOMENTRY_*")
+pub fn vision_enabled() -> bool {
+    env::var("MOMENTRY_VISION_ENABLED")
+        .unwrap_or_else(|_| "true".to_string())
+        .parse()
+        .unwrap_or(true)
+}
+```
+
+### Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `MOMENTRY_VISION_ENABLED` | `true` | Enable/disable all vision endpoints |
+| `MOMENTRY_VISION_MODEL` | `grounding-dino` | Default model: `grounding-dino` or `fusion` |
+| `MOMENTRY_VISION_GDINO_MODEL` | `IDEA-Research/grounding-dino-base` | HF model ID or local path |
+| `MOMENTRY_VISION_PALIGEMMA_ENABLED` | `false` | Enable PaliGemma (requires ~3GB download) |
+| `MOMENTRY_VISION_THRESHOLD` | `0.1` | Default confidence threshold |
+| `MOMENTRY_VISION_DEVICE` | `mps` on Apple Silicon, else `cpu` | Inference device |
+| `MOMENTRY_VISION_TIMEOUT` | `30000` | PythonExecutor timeout (ms) |
+
+---
+
+## Rust Route — `src/api/vision_agent_api.rs`
+
+### Route Registration
+
+```rust
+pub fn vision_agent_routes() -> Router<AppState> {
+    Router::new()
+        .route("/api/v1/agents/vision/detect", post(vision_detect))
+        .route("/api/v1/agents/vision/search", post(vision_search))
+        .route("/api/v1/agents/vision/multimodal", post(vision_multimodal))
+        .route("/api/v1/agents/vision/models", get(vision_models))
+}
+```
+
+Mount in `server.rs`:
+
+```rust
+if config::vision_enabled() {
+    app = app.merge(vision_agent_routes());
+}
+```
+
+### Detect Handler Flow
+
+```
+1. Receive JSON with {frame, query, model, threshold}
+2. Parse query → extract prompt (e.g., "find the gun" → "gun")
+3. Resolve frame → timestamp (for Python compatibility)
+4. Call PythonExecutor::run_script("vision_inference.py", args)
+5. Parse Python stdout → JSON response
+6. Return formatted result
+```
+
+### Frame/Time Resolution
+
+```rust
+fn resolve_frame(data: &Value, fps: f64) -> i64 {
+    // Priority: frame > time
+    if let Some(f) = data.get("frame").and_then(|v| v.as_i64()) {
+        return f;
+    }
+    if let Some(t) = data.get("time").and_then(|v| v.as_f64()) {
+        return (t * fps) as i64;
+    }
+    0
+}
+```
+
+### JSON Protocol (Rust ↔ Python)
+
+**Stdin (Rust → Python):**
+
+```json
+{
+  "action": "detect",
+  "frame": 136525,
+  "timestamp": 5461.0,
+  "prompt": "gun",
+  "model": "grounding-dino",
+  "threshold": 0.1,
+  "weights": {"grounding-dino": 0.6, "paligemma": 0.4},
+  "config": {
+    "gdino_model": "IDEA-Research/grounding-dino-base",
+    "paligemma_model": "google/paligemma-3b-mix-224",
+    "device": "mps"
+  }
+}
+```
+
+**Stdout (Python → Rust):**
+
+```json
+{
+  "success": true,
+  "frame": 136525,
+  "timestamp": 5461.0,
+  "detections": [
+    {"bbox": [726.2, 567.4, 969.0, 694.6], "score": 0.476, "label": "gun"}
+  ],
+  "time_ms": 345.2
+}
+```
+
+---
+
+## Python Script — `scripts/vision_inference.py`
+
+### Design
+
+- **No Flask.** Pure stdin/stdout protocol.
+- **Model cache.** `_model` global persists across PythonExecutor calls.
+- **Single entry point.** Reads JSON from stdin, dispatches by `action` field.
+
+```python
+#!/opt/homebrew/bin/python3.11
+"""
+Vision inference — called by Rust PythonExecutor.
+Reads JSON from stdin, runs inference, writes JSON to stdout.
+"""
+import json, sys, os, torch
+from PIL import Image
+from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
+
+_model = None
+_processor = None
+_device = None
+
+def load_model():
+    global _model, _processor, _device
+    if _model is not None:
+        return _model, _processor
+    _device = os.environ.get("MOMENTRY_VISION_DEVICE", "mps")
+    model_name = os.environ.get("MOMENTRY_VISION_GDINO_MODEL",
+                                "IDEA-Research/grounding-dino-base")
+    _processor = AutoProcessor.from_pretrained(model_name)
+    _model = AutoModelForZeroShotObjectDetection.from_pretrained(model_name).to(_device)
+    return _model, _processor
+
+def detect_gdino(img, prompt, threshold):
+    model, processor = load_model()
+    inputs = processor(images=img, text=f"{prompt}.", return_tensors="pt").to(_device)
+    with torch.no_grad():
+        outputs = model(**inputs)
+    dets = processor.post_process_grounded_object_detection(
+        outputs, threshold=threshold,
+        target_sizes=[img.size[::-1]])[0]
+    results = []
+    for i in range(len(dets["boxes"])):
+        results.append({
+            "bbox": [round(v, 1) for v in dets["boxes"][i].tolist()],
+            "score": round(dets["scores"][i].item(), 3),
+            "label": prompt,
+        })
+    return results
+
+def main():
+    input_data = json.load(sys.stdin)
+    action = input_data.get("action", "detect")
+    
+    if action == "detect":
+        # ... run inference
+    elif action == "search":
+        # ... iterate frames
+    elif action == "models":
+        # ... return model info
+    
+    json.dump(result, sys.stdout)
+    sys.stdout.flush()
+
+if __name__ == "__main__":
+    main()
+```
+
+---
+
+## Model Lifecycle
+
+### Issue
+
+GDINO loads in ~4s (download + CUDA init + weight load). PythonExecutor starts a new process per call — this would add 4s latency to every request.
+
+### Solution: Warm Process
+
+Use `PythonExecutor` in persistent/session mode where the Python process stays alive between calls. The `_model` global cache keeps the model in memory.
+
+From `src/core/processor/executor.rs` — check if persistent mode is supported, or use a simple approach:
+
+```rust
+// Keep Python process alive for multiple calls
+let executor = PythonExecutor::new("vision_inference.py")
+    .persistent(true)  // reuse same process
+    .timeout_ms(30000);
+```
+
+If `PythonExecutor` doesn't support persistent mode, implement a simple sidecar:
+
+```rust
+// Launch Python process on agent init
+let child = std::process::Command::new(python_path)
+    .arg(script_path)
+    .stdin(std::process::Stdio::piped())
+    .stdout(std::process::Stdio::piped())
+    .spawn()?;
+
+// Write request, read response per call
+child.stdin.write_all(json_request.as_bytes())?;
+let response = child.stdout.read_to_string(&mut buffer)?;
+```
+
+---
+
+## Files to Create/Modify
+
+| File | Action | Description |
+|------|--------|-------------|
+| `src/api/vision_agent_api.rs` | **Create** | Rust route handlers |
+| `src/core/config.rs` | **Modify** | Add `MOMENTRY_VISION_*` env vars |
+| `src/api/server.rs` | **Modify** | Merge `vision_agent_routes()` |
+| `scripts/vision_inference.py` | **Create** | Python inference script (stdin/stdout) |
+| `API_V1.0.0/VISION_AGENT_API_V1.0.0.md` | Created | API docs |
+
+## Migration Plan
+
+| Phase | Steps | Status |
+|-------|-------|--------|
+| **1** | Create `vision_inference.py` (stdin/stdout, model cache) | ⏳ |
+| **2** | Create `vision_agent_api.rs` (detect + search + multimodal handlers) | ⏳ |
+| **3** | Add config + mount routes to 3003 | ⏳ |
+| **4** | Test detect/search via 3003 (no 5052) | ⏳ |
+| **5** | Deprecate 5052 Flask service | ⏳ |