feat: Phase 1 handover - schema migration, correction mechanism, API fixes

Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index
Correction: asr-1.json format, generate/apply scripts
API: 37/37 endpoints fixed and tested
Docs: HANDOVER_V2.0.md for M4
This commit is contained in:
Accusys
2026-05-11 07:03:22 +08:00
parent ef894a44ad
commit 39ba5ddf76
147 changed files with 19843 additions and 3053 deletions

View File

@@ -0,0 +1,296 @@
---
document_type: "architecture_design"
service: "MOMENTRY_CORE"
title: "Vision Agent — Rust Integration Design"
date: "2026-05-10"
version: "V1.0"
status: "active"
owner: "M5"
created_by: "OpenCode"
current_state: "draft"
tags:
- "vision-agent"
- "rust-integration"
- "python-executor"
- "grounding-dino"
- "architecture"
ai_query_hints:
- "Vision Agent Rust 整合架構與 PythonExecutor 設計"
- "Grounding DINO 無法 ONNX 匯出的原因與解決方案"
- "Rust 端 detect/search/multimodal handler 實作方式"
- "PythonExecutor persistent mode 與 model cache 設計"
- "Vision Agent 從 Flask 5052 遷移至 Rust 3003 的遷移計畫"
related_documents:
- "../VISION_AGENT_API_V1.0.0.md"
---
# Vision Agent — Rust Integration Design
**Goal:** Replace standalone Python Flask service (port 5052) with a Rust-native agent under `3003/api/v1/agents/vision/*`, following the same pattern as 5W1H, Identity, and Translate agents.
---
## Architecture
```
Client → 3003 (Rust Axum)
├── /api/v1/agents/vision/detect → PythonExecutor → vision_inference.py
├── /api/v1/agents/vision/search → PythonExecutor → vision_inference.py
├── /api/v1/agents/vision/multimodal → Rust DB query + PythonExecutor
└── /api/v1/agents/vision/models → pure Rust (no Python needed)
```
### Why PythonExecutor?
Grounding DINO uses `MultiScaleDeformableAttention` — a PyTorch custom CUDA kernel with no Rust/candle/ort equivalent. ONNX export is also impossible due to this custom op. Python is the only viable runtime.
This matches the project's existing processor pattern:
| Component | Rust | Inference |
|-----------|------|-----------|
| ASR | `PythonExecutor` | `asr_processor.py` |
| ASRX | `PythonExecutor` | `asrx_processor_custom.py` |
| YOLO | `PythonExecutor` | `yolo_processor.py` |
| **Vision** | **`PythonExecutor`** | **`vision_inference.py`** |
---
## Config
Add to existing `MOMENTRY_*` env var pattern in `src/core/config.rs`:
```rust
// Existing pattern — env::var("MOMENTRY_*")
pub fn vision_enabled() -> bool {
env::var("MOMENTRY_VISION_ENABLED")
.unwrap_or_else(|_| "true".to_string())
.parse()
.unwrap_or(true)
}
```
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `MOMENTRY_VISION_ENABLED` | `true` | Enable/disable all vision endpoints |
| `MOMENTRY_VISION_MODEL` | `grounding-dino` | Default model: `grounding-dino` or `fusion` |
| `MOMENTRY_VISION_GDINO_MODEL` | `IDEA-Research/grounding-dino-base` | HF model ID or local path |
| `MOMENTRY_VISION_PALIGEMMA_ENABLED` | `false` | Enable PaliGemma (requires ~3GB download) |
| `MOMENTRY_VISION_THRESHOLD` | `0.1` | Default confidence threshold |
| `MOMENTRY_VISION_DEVICE` | `mps` on Apple Silicon, else `cpu` | Inference device |
| `MOMENTRY_VISION_TIMEOUT` | `30000` | PythonExecutor timeout (ms) |
---
## Rust Route — `src/api/vision_agent_api.rs`
### Route Registration
```rust
pub fn vision_agent_routes() -> Router<AppState> {
Router::new()
.route("/api/v1/agents/vision/detect", post(vision_detect))
.route("/api/v1/agents/vision/search", post(vision_search))
.route("/api/v1/agents/vision/multimodal", post(vision_multimodal))
.route("/api/v1/agents/vision/models", get(vision_models))
}
```
Mount in `server.rs`:
```rust
if config::vision_enabled() {
app = app.merge(vision_agent_routes());
}
```
### Detect Handler Flow
```
1. Receive JSON with {frame, query, model, threshold}
2. Parse query → extract prompt (e.g., "find the gun" → "gun")
3. Resolve frame → timestamp (for Python compatibility)
4. Call PythonExecutor::run_script("vision_inference.py", args)
5. Parse Python stdout → JSON response
6. Return formatted result
```
### Frame/Time Resolution
```rust
fn resolve_frame(data: &Value, fps: f64) -> i64 {
// Priority: frame > time
if let Some(f) = data.get("frame").and_then(|v| v.as_i64()) {
return f;
}
if let Some(t) = data.get("time").and_then(|v| v.as_f64()) {
return (t * fps) as i64;
}
0
}
```
### JSON Protocol (Rust ↔ Python)
**Stdin (Rust → Python):**
```json
{
"action": "detect",
"frame": 136525,
"timestamp": 5461.0,
"prompt": "gun",
"model": "grounding-dino",
"threshold": 0.1,
"weights": {"grounding-dino": 0.6, "paligemma": 0.4},
"config": {
"gdino_model": "IDEA-Research/grounding-dino-base",
"paligemma_model": "google/paligemma-3b-mix-224",
"device": "mps"
}
}
```
**Stdout (Python → Rust):**
```json
{
"success": true,
"frame": 136525,
"timestamp": 5461.0,
"detections": [
{"bbox": [726.2, 567.4, 969.0, 694.6], "score": 0.476, "label": "gun"}
],
"time_ms": 345.2
}
```
---
## Python Script — `scripts/vision_inference.py`
### Design
- **No Flask.** Pure stdin/stdout protocol.
- **Model cache.** `_model` global persists across PythonExecutor calls.
- **Single entry point.** Reads JSON from stdin, dispatches by `action` field.
```python
#!/opt/homebrew/bin/python3.11
"""
Vision inference — called by Rust PythonExecutor.
Reads JSON from stdin, runs inference, writes JSON to stdout.
"""
import json, sys, os, torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
_model = None
_processor = None
_device = None
def load_model():
global _model, _processor, _device
if _model is not None:
return _model, _processor
_device = os.environ.get("MOMENTRY_VISION_DEVICE", "mps")
model_name = os.environ.get("MOMENTRY_VISION_GDINO_MODEL",
"IDEA-Research/grounding-dino-base")
_processor = AutoProcessor.from_pretrained(model_name)
_model = AutoModelForZeroShotObjectDetection.from_pretrained(model_name).to(_device)
return _model, _processor
def detect_gdino(img, prompt, threshold):
model, processor = load_model()
inputs = processor(images=img, text=f"{prompt}.", return_tensors="pt").to(_device)
with torch.no_grad():
outputs = model(**inputs)
dets = processor.post_process_grounded_object_detection(
outputs, threshold=threshold,
target_sizes=[img.size[::-1]])[0]
results = []
for i in range(len(dets["boxes"])):
results.append({
"bbox": [round(v, 1) for v in dets["boxes"][i].tolist()],
"score": round(dets["scores"][i].item(), 3),
"label": prompt,
})
return results
def main():
input_data = json.load(sys.stdin)
action = input_data.get("action", "detect")
if action == "detect":
# ... run inference
elif action == "search":
# ... iterate frames
elif action == "models":
# ... return model info
json.dump(result, sys.stdout)
sys.stdout.flush()
if __name__ == "__main__":
main()
```
---
## Model Lifecycle
### Issue
GDINO loads in ~4s (download + CUDA init + weight load). PythonExecutor starts a new process per call — this would add 4s latency to every request.
### Solution: Warm Process
Use `PythonExecutor` in persistent/session mode where the Python process stays alive between calls. The `_model` global cache keeps the model in memory.
From `src/core/processor/executor.rs` — check if persistent mode is supported, or use a simple approach:
```rust
// Keep Python process alive for multiple calls
let executor = PythonExecutor::new("vision_inference.py")
.persistent(true) // reuse same process
.timeout_ms(30000);
```
If `PythonExecutor` doesn't support persistent mode, implement a simple sidecar:
```rust
// Launch Python process on agent init
let child = std::process::Command::new(python_path)
.arg(script_path)
.stdin(std::process::Stdio::piped())
.stdout(std::process::Stdio::piped())
.spawn()?;
// Write request, read response per call
child.stdin.write_all(json_request.as_bytes())?;
let response = child.stdout.read_to_string(&mut buffer)?;
```
---
## Files to Create/Modify
| File | Action | Description |
|------|--------|-------------|
| `src/api/vision_agent_api.rs` | **Create** | Rust route handlers |
| `src/core/config.rs` | **Modify** | Add `MOMENTRY_VISION_*` env vars |
| `src/api/server.rs` | **Modify** | Merge `vision_agent_routes()` |
| `scripts/vision_inference.py` | **Create** | Python inference script (stdin/stdout) |
| `API_V1.0.0/VISION_AGENT_API_V1.0.0.md` | Created | API docs |
## Migration Plan
| Phase | Steps | Status |
|-------|-------|--------|
| **1** | Create `vision_inference.py` (stdin/stdout, model cache) | ⏳ |
| **2** | Create `vision_agent_api.rs` (detect + search + multimodal handlers) | ⏳ |
| **3** | Add config + mount routes to 3003 | ⏳ |
| **4** | Test detect/search via 3003 (no 5052) | ⏳ |
| **5** | Deprecate 5052 Flask service | ⏳ |