feat: Phase 1 handover - schema migration, correction mechanism, API fixes
Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index Correction: asr-1.json format, generate/apply scripts API: 37/37 endpoints fixed and tested Docs: HANDOVER_V2.0.md for M4
This commit is contained in:
@@ -0,0 +1,296 @@
|
||||
---
|
||||
document_type: "architecture_design"
|
||||
service: "MOMENTRY_CORE"
|
||||
title: "Vision Agent — Rust Integration Design"
|
||||
date: "2026-05-10"
|
||||
version: "V1.0"
|
||||
status: "active"
|
||||
owner: "M5"
|
||||
created_by: "OpenCode"
|
||||
current_state: "draft"
|
||||
tags:
|
||||
- "vision-agent"
|
||||
- "rust-integration"
|
||||
- "python-executor"
|
||||
- "grounding-dino"
|
||||
- "architecture"
|
||||
ai_query_hints:
|
||||
- "Vision Agent Rust 整合架構與 PythonExecutor 設計"
|
||||
- "Grounding DINO 無法 ONNX 匯出的原因與解決方案"
|
||||
- "Rust 端 detect/search/multimodal handler 實作方式"
|
||||
- "PythonExecutor persistent mode 與 model cache 設計"
|
||||
- "Vision Agent 從 Flask 5052 遷移至 Rust 3003 的遷移計畫"
|
||||
related_documents:
|
||||
- "../VISION_AGENT_API_V1.0.0.md"
|
||||
---
|
||||
|
||||
# Vision Agent — Rust Integration Design
|
||||
|
||||
**Goal:** Replace standalone Python Flask service (port 5052) with a Rust-native agent under `3003/api/v1/agents/vision/*`, following the same pattern as 5W1H, Identity, and Translate agents.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
Client → 3003 (Rust Axum)
|
||||
│
|
||||
├── /api/v1/agents/vision/detect → PythonExecutor → vision_inference.py
|
||||
├── /api/v1/agents/vision/search → PythonExecutor → vision_inference.py
|
||||
├── /api/v1/agents/vision/multimodal → Rust DB query + PythonExecutor
|
||||
└── /api/v1/agents/vision/models → pure Rust (no Python needed)
|
||||
```
|
||||
|
||||
### Why PythonExecutor?
|
||||
|
||||
Grounding DINO uses `MultiScaleDeformableAttention` — a PyTorch custom CUDA kernel with no Rust/candle/ort equivalent. ONNX export is also impossible due to this custom op. Python is the only viable runtime.
|
||||
|
||||
This matches the project's existing processor pattern:
|
||||
|
||||
| Component | Rust | Inference |
|
||||
|-----------|------|-----------|
|
||||
| ASR | `PythonExecutor` | `asr_processor.py` |
|
||||
| ASRX | `PythonExecutor` | `asrx_processor_custom.py` |
|
||||
| YOLO | `PythonExecutor` | `yolo_processor.py` |
|
||||
| **Vision** | **`PythonExecutor`** | **`vision_inference.py`** |
|
||||
|
||||
---
|
||||
|
||||
## Config
|
||||
|
||||
Add to existing `MOMENTRY_*` env var pattern in `src/core/config.rs`:
|
||||
|
||||
```rust
|
||||
// Existing pattern — env::var("MOMENTRY_*")
|
||||
pub fn vision_enabled() -> bool {
|
||||
env::var("MOMENTRY_VISION_ENABLED")
|
||||
.unwrap_or_else(|_| "true".to_string())
|
||||
.parse()
|
||||
.unwrap_or(true)
|
||||
}
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `MOMENTRY_VISION_ENABLED` | `true` | Enable/disable all vision endpoints |
|
||||
| `MOMENTRY_VISION_MODEL` | `grounding-dino` | Default model: `grounding-dino` or `fusion` |
|
||||
| `MOMENTRY_VISION_GDINO_MODEL` | `IDEA-Research/grounding-dino-base` | HF model ID or local path |
|
||||
| `MOMENTRY_VISION_PALIGEMMA_ENABLED` | `false` | Enable PaliGemma (requires ~3GB download) |
|
||||
| `MOMENTRY_VISION_THRESHOLD` | `0.1` | Default confidence threshold |
|
||||
| `MOMENTRY_VISION_DEVICE` | `mps` on Apple Silicon, else `cpu` | Inference device |
|
||||
| `MOMENTRY_VISION_TIMEOUT` | `30000` | PythonExecutor timeout (ms) |
|
||||
|
||||
---
|
||||
|
||||
## Rust Route — `src/api/vision_agent_api.rs`
|
||||
|
||||
### Route Registration
|
||||
|
||||
```rust
|
||||
pub fn vision_agent_routes() -> Router<AppState> {
|
||||
Router::new()
|
||||
.route("/api/v1/agents/vision/detect", post(vision_detect))
|
||||
.route("/api/v1/agents/vision/search", post(vision_search))
|
||||
.route("/api/v1/agents/vision/multimodal", post(vision_multimodal))
|
||||
.route("/api/v1/agents/vision/models", get(vision_models))
|
||||
}
|
||||
```
|
||||
|
||||
Mount in `server.rs`:
|
||||
|
||||
```rust
|
||||
if config::vision_enabled() {
|
||||
app = app.merge(vision_agent_routes());
|
||||
}
|
||||
```
|
||||
|
||||
### Detect Handler Flow
|
||||
|
||||
```
|
||||
1. Receive JSON with {frame, query, model, threshold}
|
||||
2. Parse query → extract prompt (e.g., "find the gun" → "gun")
|
||||
3. Resolve frame → timestamp (for Python compatibility)
|
||||
4. Call PythonExecutor::run_script("vision_inference.py", args)
|
||||
5. Parse Python stdout → JSON response
|
||||
6. Return formatted result
|
||||
```
|
||||
|
||||
### Frame/Time Resolution
|
||||
|
||||
```rust
|
||||
fn resolve_frame(data: &Value, fps: f64) -> i64 {
|
||||
// Priority: frame > time
|
||||
if let Some(f) = data.get("frame").and_then(|v| v.as_i64()) {
|
||||
return f;
|
||||
}
|
||||
if let Some(t) = data.get("time").and_then(|v| v.as_f64()) {
|
||||
return (t * fps) as i64;
|
||||
}
|
||||
0
|
||||
}
|
||||
```
|
||||
|
||||
### JSON Protocol (Rust ↔ Python)
|
||||
|
||||
**Stdin (Rust → Python):**
|
||||
|
||||
```json
|
||||
{
|
||||
"action": "detect",
|
||||
"frame": 136525,
|
||||
"timestamp": 5461.0,
|
||||
"prompt": "gun",
|
||||
"model": "grounding-dino",
|
||||
"threshold": 0.1,
|
||||
"weights": {"grounding-dino": 0.6, "paligemma": 0.4},
|
||||
"config": {
|
||||
"gdino_model": "IDEA-Research/grounding-dino-base",
|
||||
"paligemma_model": "google/paligemma-3b-mix-224",
|
||||
"device": "mps"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Stdout (Python → Rust):**
|
||||
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"frame": 136525,
|
||||
"timestamp": 5461.0,
|
||||
"detections": [
|
||||
{"bbox": [726.2, 567.4, 969.0, 694.6], "score": 0.476, "label": "gun"}
|
||||
],
|
||||
"time_ms": 345.2
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Python Script — `scripts/vision_inference.py`
|
||||
|
||||
### Design
|
||||
|
||||
- **No Flask.** Pure stdin/stdout protocol.
|
||||
- **Model cache.** `_model` global persists across PythonExecutor calls.
|
||||
- **Single entry point.** Reads JSON from stdin, dispatches by `action` field.
|
||||
|
||||
```python
|
||||
#!/opt/homebrew/bin/python3.11
|
||||
"""
|
||||
Vision inference — called by Rust PythonExecutor.
|
||||
Reads JSON from stdin, runs inference, writes JSON to stdout.
|
||||
"""
|
||||
import json, sys, os, torch
|
||||
from PIL import Image
|
||||
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
|
||||
|
||||
_model = None
|
||||
_processor = None
|
||||
_device = None
|
||||
|
||||
def load_model():
|
||||
global _model, _processor, _device
|
||||
if _model is not None:
|
||||
return _model, _processor
|
||||
_device = os.environ.get("MOMENTRY_VISION_DEVICE", "mps")
|
||||
model_name = os.environ.get("MOMENTRY_VISION_GDINO_MODEL",
|
||||
"IDEA-Research/grounding-dino-base")
|
||||
_processor = AutoProcessor.from_pretrained(model_name)
|
||||
_model = AutoModelForZeroShotObjectDetection.from_pretrained(model_name).to(_device)
|
||||
return _model, _processor
|
||||
|
||||
def detect_gdino(img, prompt, threshold):
|
||||
model, processor = load_model()
|
||||
inputs = processor(images=img, text=f"{prompt}.", return_tensors="pt").to(_device)
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs)
|
||||
dets = processor.post_process_grounded_object_detection(
|
||||
outputs, threshold=threshold,
|
||||
target_sizes=[img.size[::-1]])[0]
|
||||
results = []
|
||||
for i in range(len(dets["boxes"])):
|
||||
results.append({
|
||||
"bbox": [round(v, 1) for v in dets["boxes"][i].tolist()],
|
||||
"score": round(dets["scores"][i].item(), 3),
|
||||
"label": prompt,
|
||||
})
|
||||
return results
|
||||
|
||||
def main():
|
||||
input_data = json.load(sys.stdin)
|
||||
action = input_data.get("action", "detect")
|
||||
|
||||
if action == "detect":
|
||||
# ... run inference
|
||||
elif action == "search":
|
||||
# ... iterate frames
|
||||
elif action == "models":
|
||||
# ... return model info
|
||||
|
||||
json.dump(result, sys.stdout)
|
||||
sys.stdout.flush()
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Model Lifecycle
|
||||
|
||||
### Issue
|
||||
|
||||
GDINO loads in ~4s (download + CUDA init + weight load). PythonExecutor starts a new process per call — this would add 4s latency to every request.
|
||||
|
||||
### Solution: Warm Process
|
||||
|
||||
Use `PythonExecutor` in persistent/session mode where the Python process stays alive between calls. The `_model` global cache keeps the model in memory.
|
||||
|
||||
From `src/core/processor/executor.rs` — check if persistent mode is supported, or use a simple approach:
|
||||
|
||||
```rust
|
||||
// Keep Python process alive for multiple calls
|
||||
let executor = PythonExecutor::new("vision_inference.py")
|
||||
.persistent(true) // reuse same process
|
||||
.timeout_ms(30000);
|
||||
```
|
||||
|
||||
If `PythonExecutor` doesn't support persistent mode, implement a simple sidecar:
|
||||
|
||||
```rust
|
||||
// Launch Python process on agent init
|
||||
let child = std::process::Command::new(python_path)
|
||||
.arg(script_path)
|
||||
.stdin(std::process::Stdio::piped())
|
||||
.stdout(std::process::Stdio::piped())
|
||||
.spawn()?;
|
||||
|
||||
// Write request, read response per call
|
||||
child.stdin.write_all(json_request.as_bytes())?;
|
||||
let response = child.stdout.read_to_string(&mut buffer)?;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Files to Create/Modify
|
||||
|
||||
| File | Action | Description |
|
||||
|------|--------|-------------|
|
||||
| `src/api/vision_agent_api.rs` | **Create** | Rust route handlers |
|
||||
| `src/core/config.rs` | **Modify** | Add `MOMENTRY_VISION_*` env vars |
|
||||
| `src/api/server.rs` | **Modify** | Merge `vision_agent_routes()` |
|
||||
| `scripts/vision_inference.py` | **Create** | Python inference script (stdin/stdout) |
|
||||
| `API_V1.0.0/VISION_AGENT_API_V1.0.0.md` | Created | API docs |
|
||||
|
||||
## Migration Plan
|
||||
|
||||
| Phase | Steps | Status |
|
||||
|-------|-------|--------|
|
||||
| **1** | Create `vision_inference.py` (stdin/stdout, model cache) | ⏳ |
|
||||
| **2** | Create `vision_agent_api.rs` (detect + search + multimodal handlers) | ⏳ |
|
||||
| **3** | Add config + mount routes to 3003 | ⏳ |
|
||||
| **4** | Test detect/search via 3003 (no 5052) | ⏳ |
|
||||
| **5** | Deprecate 5052 Flask service | ⏳ |
|
||||
Reference in New Issue
Block a user