# Momentry Eye API Reference **Vision Agent** — Multi-model zero-shot object detection service. Port: `5052` | Resource IDs: `eye-gdino`, `eye-paligemma` --- ## Models | Model | ID | Params | Size | Confidence | Speed | License | |-------|-----|--------|------|------------|-------|---------| | Grounding DINO | `grounding-dino` | 232M | 891MB | ✅ 0-1 score | ~340ms | Apache 2.0 | | PaliGemma 3B | `paligemma` | 2,923M | ~3GB | ❌ no score | ~80ms | Gemma license | ## Endpoints ### `GET /health` System status and loaded models. ```bash curl localhost:5052/health ``` Response: ```json { "status": "ok", "models_loaded": ["grounding-dino"], "models_available": ["grounding-dino", "paligemma"], "device": "mps", "port": 5052 } ``` ### `GET /models` List available models with specs. ```bash curl localhost:5052/models ``` ### `POST /detect` Detect objects in a single video frame. ```bash curl localhost:5052/detect \ -H "Content-Type: application/json" \ -d '{"time":5461, "prompt":"gun", "model":"grounding-dino"}' ``` **Parameters:** | Param | Type | Default | Description | |-------|------|---------|-------------| | `uuid` | string | `aeed71342a...` | Video file UUID | | `time` | float | `0` | Timestamp in seconds | | `prompt` | string | `"gun"` | Object to detect | | `model` | string | `"grounding-dino"` | Model: `grounding-dino`, `paligemma`, or `fusion` | | `threshold` | float | `0.1` | Minimum confidence (GDINO only) | | `weights` | object | — | Fusion weights, e.g. `{"grounding-dino":0.6,"paligemma":0.4}` | **Fusion mode** runs both models and combines results with weighted scoring. Default weights: GDINO 0.6, PaliGemma 0.4. ```bash # Fusion: run both models, combine results curl localhost:5052/detect \ -d '{"time":206, "prompt":"water gun", "model":"fusion"}' # Custom fusion weights curl localhost:5052/detect \ -d '{"time":206, "prompt":"gun", "model":"fusion", "weights":{"grounding-dino":0.5,"paligemma":0.5}}' ``` **Response:** ```json { "model": "grounding-dino", "detections": [ {"bbox": [726.2, 567.4, 969.0, 694.6], "score": 0.476, "label": "gun"}, {"bbox": [686.7, 567.0, 969.6, 918.3], "score": 0.262, "label": "gun"} ], "time_ms": 345.2, "n_detections": 2, "shot_url": "/shots/aeed7134_5461s_gun_grounding-dino.jpg" } ``` **Fusion response** also includes `per_model` (detections per model) and `fusion` (deduplicated combined list with `fused_score`). ### `POST /search` Search across a time range. ```bash # Natural language query curl localhost:5052/search \ -d '{"query":"find the gun", "range":"5400-5600", "interval":10}' ``` **Parameters:** | Param | Type | Default | Description | |-------|------|---------|-------------| | `query` | string | `"find the gun"` | Natural language query (parsed to extract object) | | `target` | string | — | `file_uuid:chunk_id` or `file_uuid:trace_id` — resolves to time range | | `range` | string | `"0-6780"` | Manual time range | | `interval` | int | `30` | Scan interval in seconds | | `model` | string | `"grounding-dino"` | Detection model | | `threshold` | float | `0.15` | Minimum confidence | **Target resolution:** | Format | Example | Resolves to | |--------|---------|-------------| | `file_uuid:chunk_id` | `uuid:uuid_story_90` | Chunk's time range | | `file_uuid:trace_id` | `uuid:trace_5` | Trace's time range | | `file_uuid:chunk_index` | `uuid:500` | Chunk index 500's range | ```bash # Using target curl localhost:5052/search \ -d '{"target":"aeed71342...:aeed71342..._story_90", "query":"gun"}' # Using trace curl localhost:5052/search \ -d '{"target":"aeed71342...:trace_5", "query":"person"}' ``` ### `POST /multimodal` Multi-modal search across sentence chunks — combines ASR text match + visual confirmation. ```bash # Search for Jean-Louis: ASR match + GDINO child detection curl localhost:5052/multimodal \ -d '{"keyword":"Jean-Louis", "prompt":"child"}' # Search trace chunks visually (no ASR) curl localhost:5052/multimodal \ -d '{"keyword":"", "prompt":"person", "chunk_type":"trace", "range":"3500-4000"}' ``` **Parameters:** | Param | Type | Default | Description | |-------|------|---------|-------------| | `keyword` | string | — | ASR keyword to search in sentence text | | `prompt` | string | same as keyword | Visual prompt for GDINO | | `chunk_type` | string | `"sentence"` | `sentence`, `trace`, `story`, `cut` | | `target` | string | — | Specific chunk target | | `range` | string | `"0-6780"` | Time range (for non-sentence chunks) | | `threshold` | float | `0.15` | Visual detection threshold | ### `GET /shots/` Retrieve annotated detection images. ```bash curl -o result.jpg localhost:5052/shots/aeed7134_5461s_gun_grounding-dino.jpg ``` ## Object Detection Performance Summary | Object type | Size in frame | GDINO | PaliGemma | Best prompt | |-------------|--------------|-------|-----------|-------------| | Gun (realistic) | 15-30% | ✅ 0.36-0.67 | ✅ | `pistol` / `handgun` | | Water gun (toy) | 15-31% | ❌ 0 | ✅ | `water gun` (PaliGemma) | | Child (Jean-Louis) | 30-60% | ⚠️ 0.3-0.9 | ❌ | `child` (high FP on adults) | | Stamp | <5% | ❌ FP | ❌ | — | | Passport | <10% | ❌ FP | ❌ | — | | Magnifying glass | <5% | ❌ FP | ❌ | — | | Cup / Bottle | 5-15% | ✅ 0.3-0.5 | — | `cup` / `bottle` | | Cell phone | 5-10% | ✅ 0.3-0.5 | — | `cell phone` | ## Resource Registration On startup, the agent auto-registers as resources in `dev.resources`: | Resource ID | Type | Status | |-------------|------|--------| | `eye-gdino` | `vision_model` | `online` | | `eye-paligemma` | `vision_model` | `online` | Heartbeat updates every 60 seconds. Discover via: ```sql SELECT * FROM dev.resources WHERE resource_type = 'vision_model'; ``` ## Files | File | Description | |------|-------------| | `scripts/vision_agent.py` | Vision Agent server (port 5052) | | `output_dev/vision_shots/` | Annotated detection screenshots | | `docs/ZERO_SHOT_DETECTION_RESEARCH.md` | Full model research report |