Files

Accusys 39ba5ddf76 feat: Phase 1 handover - schema migration, correction mechanism, API fixes

Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index
Correction: asr-1.json format, generate/apply scripts
API: 37/37 endpoints fixed and tested
Docs: HANDOVER_V2.0.md for M4

2026-05-11 07:03:22 +08:00

5.9 KiB

Raw Blame History

Momentry Eye API Reference

Vision Agent — Multi-model zero-shot object detection service. Port: 5052 | Resource IDs: eye-gdino, eye-paligemma

Models

Model	ID	Params	Size	Confidence	Speed	License
Grounding DINO	`grounding-dino`	232M	891MB	✅ 0-1 score	~340ms	Apache 2.0
PaliGemma 3B	`paligemma`	2,923M	~3GB	❌ no score	~80ms	Gemma license

Endpoints

`GET /health`

System status and loaded models.

curl localhost:5052/health

Response:

{
  "status": "ok",
  "models_loaded": ["grounding-dino"],
  "models_available": ["grounding-dino", "paligemma"],
  "device": "mps",
  "port": 5052
}

`GET /models`

List available models with specs.

curl localhost:5052/models

`POST /detect`

Detect objects in a single video frame.

curl localhost:5052/detect \
  -H "Content-Type: application/json" \
  -d '{"time":5461, "prompt":"gun", "model":"grounding-dino"}'

Parameters:

Param	Type	Default	Description
`uuid`	string	`aeed71342a...`	Video file UUID
`time`	float	`0`	Timestamp in seconds
`prompt`	string	`"gun"`	Object to detect
`model`	string	`"grounding-dino"`	Model: `grounding-dino`, `paligemma`, or `fusion`
`threshold`	float	`0.1`	Minimum confidence (GDINO only)
`weights`	object	—	Fusion weights, e.g. `{"grounding-dino":0.6,"paligemma":0.4}`

Fusion mode runs both models and combines results with weighted scoring. Default weights: GDINO 0.6, PaliGemma 0.4.

# Fusion: run both models, combine results
curl localhost:5052/detect \
  -d '{"time":206, "prompt":"water gun", "model":"fusion"}'

# Custom fusion weights
curl localhost:5052/detect \
  -d '{"time":206, "prompt":"gun", "model":"fusion",
       "weights":{"grounding-dino":0.5,"paligemma":0.5}}'

Response:

{
  "model": "grounding-dino",
  "detections": [
    {"bbox": [726.2, 567.4, 969.0, 694.6], "score": 0.476, "label": "gun"},
    {"bbox": [686.7, 567.0, 969.6, 918.3], "score": 0.262, "label": "gun"}
  ],
  "time_ms": 345.2,
  "n_detections": 2,
  "shot_url": "/shots/aeed7134_5461s_gun_grounding-dino.jpg"
}

Fusion response also includes per_model (detections per model) and fusion (deduplicated combined list with fused_score).

`POST /search`

Search across a time range.

# Natural language query
curl localhost:5052/search \
  -d '{"query":"find the gun", "range":"5400-5600", "interval":10}'

Parameters:

Param	Type	Default	Description
`query`	string	`"find the gun"`	Natural language query (parsed to extract object)
`target`	string	—	`file_uuid:chunk_id` or `file_uuid:trace_id` — resolves to time range
`range`	string	`"0-6780"`	Manual time range
`interval`	int	`30`	Scan interval in seconds
`model`	string	`"grounding-dino"`	Detection model
`threshold`	float	`0.15`	Minimum confidence

Target resolution:

Format	Example	Resolves to
`file_uuid:chunk_id`	`uuid:uuid_story_90`	Chunk's time range
`file_uuid:trace_id`	`uuid:trace_5`	Trace's time range
`file_uuid:chunk_index`	`uuid:500`	Chunk index 500's range

# Using target
curl localhost:5052/search \
  -d '{"target":"aeed71342...:aeed71342..._story_90", "query":"gun"}'

# Using trace
curl localhost:5052/search \
  -d '{"target":"aeed71342...:trace_5", "query":"person"}'

`POST /multimodal`

Multi-modal search across sentence chunks — combines ASR text match + visual confirmation.

# Search for Jean-Louis: ASR match + GDINO child detection
curl localhost:5052/multimodal \
  -d '{"keyword":"Jean-Louis", "prompt":"child"}'

# Search trace chunks visually (no ASR)
curl localhost:5052/multimodal \
  -d '{"keyword":"", "prompt":"person", "chunk_type":"trace", "range":"3500-4000"}'

Parameters:

Param	Type	Default	Description
`keyword`	string	—	ASR keyword to search in sentence text
`prompt`	string	same as keyword	Visual prompt for GDINO
`chunk_type`	string	`"sentence"`	`sentence`, `trace`, `story`, `cut`
`target`	string	—	Specific chunk target
`range`	string	`"0-6780"`	Time range (for non-sentence chunks)
`threshold`	float	`0.15`	Visual detection threshold

`GET /shots/<filename>`

Retrieve annotated detection images.

curl -o result.jpg localhost:5052/shots/aeed7134_5461s_gun_grounding-dino.jpg

Object Detection Performance Summary

Object type	Size in frame	GDINO	PaliGemma	Best prompt
Gun (realistic)	15-30%	✅ 0.36-0.67	✅	`pistol` / `handgun`
Water gun (toy)	15-31%	❌ 0	✅	`water gun` (PaliGemma)
Child (Jean-Louis)	30-60%	⚠️ 0.3-0.9	❌	`child` (high FP on adults)
Stamp	<5%	❌ FP	❌	—
Passport	<10%	❌ FP	❌	—
Magnifying glass	<5%	❌ FP	❌	—
Cup / Bottle	5-15%	✅ 0.3-0.5	—	`cup` / `bottle`
Cell phone	5-10%	✅ 0.3-0.5	—	`cell phone`

Resource Registration

On startup, the agent auto-registers as resources in dev.resources:

Resource ID	Type	Status
`eye-gdino`	`vision_model`	`online`
`eye-paligemma`	`vision_model`	`online`

Heartbeat updates every 60 seconds. Discover via:

SELECT * FROM dev.resources WHERE resource_type = 'vision_model';

Files

File	Description
`scripts/vision_agent.py`	Vision Agent server (port 5052)
`output_dev/vision_shots/`	Annotated detection screenshots
`docs/ZERO_SHOT_DETECTION_RESEARCH.md`	Full model research report

5.9 KiB Raw Blame History

Momentry Eye API Reference

Models

Endpoints

GET /health

GET /models

POST /detect

POST /search

POST /multimodal

GET /shots/<filename>

Object Detection Performance Summary

Resource Registration

Files

5.9 KiB

Raw Blame History

`GET /health`

`GET /models`

`POST /detect`

`POST /search`

`POST /multimodal`

`GET /shots/<filename>`