Files
momentry_core/docs/VISION_AGENT_API.md
Accusys 39ba5ddf76 feat: Phase 1 handover - schema migration, correction mechanism, API fixes
Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index
Correction: asr-1.json format, generate/apply scripts
API: 37/37 endpoints fixed and tested
Docs: HANDOVER_V2.0.md for M4
2026-05-11 07:03:22 +08:00

5.9 KiB

Momentry Eye API Reference

Vision Agent — Multi-model zero-shot object detection service. Port: 5052 | Resource IDs: eye-gdino, eye-paligemma


Models

Model ID Params Size Confidence Speed License
Grounding DINO grounding-dino 232M 891MB 0-1 score ~340ms Apache 2.0
PaliGemma 3B paligemma 2,923M ~3GB no score ~80ms Gemma license

Endpoints

GET /health

System status and loaded models.

curl localhost:5052/health

Response:

{
  "status": "ok",
  "models_loaded": ["grounding-dino"],
  "models_available": ["grounding-dino", "paligemma"],
  "device": "mps",
  "port": 5052
}

GET /models

List available models with specs.

curl localhost:5052/models

POST /detect

Detect objects in a single video frame.

curl localhost:5052/detect \
  -H "Content-Type: application/json" \
  -d '{"time":5461, "prompt":"gun", "model":"grounding-dino"}'

Parameters:

Param Type Default Description
uuid string aeed71342a... Video file UUID
time float 0 Timestamp in seconds
prompt string "gun" Object to detect
model string "grounding-dino" Model: grounding-dino, paligemma, or fusion
threshold float 0.1 Minimum confidence (GDINO only)
weights object Fusion weights, e.g. {"grounding-dino":0.6,"paligemma":0.4}

Fusion mode runs both models and combines results with weighted scoring. Default weights: GDINO 0.6, PaliGemma 0.4.

# Fusion: run both models, combine results
curl localhost:5052/detect \
  -d '{"time":206, "prompt":"water gun", "model":"fusion"}'

# Custom fusion weights
curl localhost:5052/detect \
  -d '{"time":206, "prompt":"gun", "model":"fusion",
       "weights":{"grounding-dino":0.5,"paligemma":0.5}}'

Response:

{
  "model": "grounding-dino",
  "detections": [
    {"bbox": [726.2, 567.4, 969.0, 694.6], "score": 0.476, "label": "gun"},
    {"bbox": [686.7, 567.0, 969.6, 918.3], "score": 0.262, "label": "gun"}
  ],
  "time_ms": 345.2,
  "n_detections": 2,
  "shot_url": "/shots/aeed7134_5461s_gun_grounding-dino.jpg"
}

Fusion response also includes per_model (detections per model) and fusion (deduplicated combined list with fused_score).

POST /search

Search across a time range.

# Natural language query
curl localhost:5052/search \
  -d '{"query":"find the gun", "range":"5400-5600", "interval":10}'

Parameters:

Param Type Default Description
query string "find the gun" Natural language query (parsed to extract object)
target string file_uuid:chunk_id or file_uuid:trace_id — resolves to time range
range string "0-6780" Manual time range
interval int 30 Scan interval in seconds
model string "grounding-dino" Detection model
threshold float 0.15 Minimum confidence

Target resolution:

Format Example Resolves to
file_uuid:chunk_id uuid:uuid_story_90 Chunk's time range
file_uuid:trace_id uuid:trace_5 Trace's time range
file_uuid:chunk_index uuid:500 Chunk index 500's range
# Using target
curl localhost:5052/search \
  -d '{"target":"aeed71342...:aeed71342..._story_90", "query":"gun"}'

# Using trace
curl localhost:5052/search \
  -d '{"target":"aeed71342...:trace_5", "query":"person"}'

POST /multimodal

Multi-modal search across sentence chunks — combines ASR text match + visual confirmation.

# Search for Jean-Louis: ASR match + GDINO child detection
curl localhost:5052/multimodal \
  -d '{"keyword":"Jean-Louis", "prompt":"child"}'

# Search trace chunks visually (no ASR)
curl localhost:5052/multimodal \
  -d '{"keyword":"", "prompt":"person", "chunk_type":"trace", "range":"3500-4000"}'

Parameters:

Param Type Default Description
keyword string ASR keyword to search in sentence text
prompt string same as keyword Visual prompt for GDINO
chunk_type string "sentence" sentence, trace, story, cut
target string Specific chunk target
range string "0-6780" Time range (for non-sentence chunks)
threshold float 0.15 Visual detection threshold

GET /shots/<filename>

Retrieve annotated detection images.

curl -o result.jpg localhost:5052/shots/aeed7134_5461s_gun_grounding-dino.jpg

Object Detection Performance Summary

Object type Size in frame GDINO PaliGemma Best prompt
Gun (realistic) 15-30% 0.36-0.67 pistol / handgun
Water gun (toy) 15-31% 0 water gun (PaliGemma)
Child (Jean-Louis) 30-60% ⚠️ 0.3-0.9 child (high FP on adults)
Stamp <5% FP
Passport <10% FP
Magnifying glass <5% FP
Cup / Bottle 5-15% 0.3-0.5 cup / bottle
Cell phone 5-10% 0.3-0.5 cell phone

Resource Registration

On startup, the agent auto-registers as resources in dev.resources:

Resource ID Type Status
eye-gdino vision_model online
eye-paligemma vision_model online

Heartbeat updates every 60 seconds. Discover via:

SELECT * FROM dev.resources WHERE resource_type = 'vision_model';

Files

File Description
scripts/vision_agent.py Vision Agent server (port 5052)
output_dev/vision_shots/ Annotated detection screenshots
docs/ZERO_SHOT_DETECTION_RESEARCH.md Full model research report