Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index Correction: asr-1.json format, generate/apply scripts API: 37/37 endpoints fixed and tested Docs: HANDOVER_V2.0.md for M4
5.9 KiB
Momentry Eye API Reference
Vision Agent — Multi-model zero-shot object detection service.
Port: 5052 | Resource IDs: eye-gdino, eye-paligemma
Models
| Model | ID | Params | Size | Confidence | Speed | License |
|---|---|---|---|---|---|---|
| Grounding DINO | grounding-dino |
232M | 891MB | ✅ 0-1 score | ~340ms | Apache 2.0 |
| PaliGemma 3B | paligemma |
2,923M | ~3GB | ❌ no score | ~80ms | Gemma license |
Endpoints
GET /health
System status and loaded models.
curl localhost:5052/health
Response:
{
"status": "ok",
"models_loaded": ["grounding-dino"],
"models_available": ["grounding-dino", "paligemma"],
"device": "mps",
"port": 5052
}
GET /models
List available models with specs.
curl localhost:5052/models
POST /detect
Detect objects in a single video frame.
curl localhost:5052/detect \
-H "Content-Type: application/json" \
-d '{"time":5461, "prompt":"gun", "model":"grounding-dino"}'
Parameters:
| Param | Type | Default | Description |
|---|---|---|---|
uuid |
string | aeed71342a... |
Video file UUID |
time |
float | 0 |
Timestamp in seconds |
prompt |
string | "gun" |
Object to detect |
model |
string | "grounding-dino" |
Model: grounding-dino, paligemma, or fusion |
threshold |
float | 0.1 |
Minimum confidence (GDINO only) |
weights |
object | — | Fusion weights, e.g. {"grounding-dino":0.6,"paligemma":0.4} |
Fusion mode runs both models and combines results with weighted scoring. Default weights: GDINO 0.6, PaliGemma 0.4.
# Fusion: run both models, combine results
curl localhost:5052/detect \
-d '{"time":206, "prompt":"water gun", "model":"fusion"}'
# Custom fusion weights
curl localhost:5052/detect \
-d '{"time":206, "prompt":"gun", "model":"fusion",
"weights":{"grounding-dino":0.5,"paligemma":0.5}}'
Response:
{
"model": "grounding-dino",
"detections": [
{"bbox": [726.2, 567.4, 969.0, 694.6], "score": 0.476, "label": "gun"},
{"bbox": [686.7, 567.0, 969.6, 918.3], "score": 0.262, "label": "gun"}
],
"time_ms": 345.2,
"n_detections": 2,
"shot_url": "/shots/aeed7134_5461s_gun_grounding-dino.jpg"
}
Fusion response also includes per_model (detections per model) and fusion (deduplicated combined list with fused_score).
POST /search
Search across a time range.
# Natural language query
curl localhost:5052/search \
-d '{"query":"find the gun", "range":"5400-5600", "interval":10}'
Parameters:
| Param | Type | Default | Description |
|---|---|---|---|
query |
string | "find the gun" |
Natural language query (parsed to extract object) |
target |
string | — | file_uuid:chunk_id or file_uuid:trace_id — resolves to time range |
range |
string | "0-6780" |
Manual time range |
interval |
int | 30 |
Scan interval in seconds |
model |
string | "grounding-dino" |
Detection model |
threshold |
float | 0.15 |
Minimum confidence |
Target resolution:
| Format | Example | Resolves to |
|---|---|---|
file_uuid:chunk_id |
uuid:uuid_story_90 |
Chunk's time range |
file_uuid:trace_id |
uuid:trace_5 |
Trace's time range |
file_uuid:chunk_index |
uuid:500 |
Chunk index 500's range |
# Using target
curl localhost:5052/search \
-d '{"target":"aeed71342...:aeed71342..._story_90", "query":"gun"}'
# Using trace
curl localhost:5052/search \
-d '{"target":"aeed71342...:trace_5", "query":"person"}'
POST /multimodal
Multi-modal search across sentence chunks — combines ASR text match + visual confirmation.
# Search for Jean-Louis: ASR match + GDINO child detection
curl localhost:5052/multimodal \
-d '{"keyword":"Jean-Louis", "prompt":"child"}'
# Search trace chunks visually (no ASR)
curl localhost:5052/multimodal \
-d '{"keyword":"", "prompt":"person", "chunk_type":"trace", "range":"3500-4000"}'
Parameters:
| Param | Type | Default | Description |
|---|---|---|---|
keyword |
string | — | ASR keyword to search in sentence text |
prompt |
string | same as keyword | Visual prompt for GDINO |
chunk_type |
string | "sentence" |
sentence, trace, story, cut |
target |
string | — | Specific chunk target |
range |
string | "0-6780" |
Time range (for non-sentence chunks) |
threshold |
float | 0.15 |
Visual detection threshold |
GET /shots/<filename>
Retrieve annotated detection images.
curl -o result.jpg localhost:5052/shots/aeed7134_5461s_gun_grounding-dino.jpg
Object Detection Performance Summary
| Object type | Size in frame | GDINO | PaliGemma | Best prompt |
|---|---|---|---|---|
| Gun (realistic) | 15-30% | ✅ 0.36-0.67 | ✅ | pistol / handgun |
| Water gun (toy) | 15-31% | ❌ 0 | ✅ | water gun (PaliGemma) |
| Child (Jean-Louis) | 30-60% | ⚠️ 0.3-0.9 | ❌ | child (high FP on adults) |
| Stamp | <5% | ❌ FP | ❌ | — |
| Passport | <10% | ❌ FP | ❌ | — |
| Magnifying glass | <5% | ❌ FP | ❌ | — |
| Cup / Bottle | 5-15% | ✅ 0.3-0.5 | — | cup / bottle |
| Cell phone | 5-10% | ✅ 0.3-0.5 | — | cell phone |
Resource Registration
On startup, the agent auto-registers as resources in dev.resources:
| Resource ID | Type | Status |
|---|---|---|
eye-gdino |
vision_model |
online |
eye-paligemma |
vision_model |
online |
Heartbeat updates every 60 seconds. Discover via:
SELECT * FROM dev.resources WHERE resource_type = 'vision_model';
Files
| File | Description |
|---|---|
scripts/vision_agent.py |
Vision Agent server (port 5052) |
output_dev/vision_shots/ |
Annotated detection screenshots |
docs/ZERO_SHOT_DETECTION_RESEARCH.md |
Full model research report |