feat: Phase 1 handover - schema migration, correction mechanism, API fixes
Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index Correction: asr-1.json format, generate/apply scripts API: 37/37 endpoints fixed and tested Docs: HANDOVER_V2.0.md for M4
This commit is contained in:
201
docs/VISION_AGENT_API.md
Normal file
201
docs/VISION_AGENT_API.md
Normal file
@@ -0,0 +1,201 @@
|
||||
# Momentry Eye API Reference
|
||||
|
||||
**Vision Agent** — Multi-model zero-shot object detection service.
|
||||
Port: `5052` | Resource IDs: `eye-gdino`, `eye-paligemma`
|
||||
|
||||
---
|
||||
|
||||
## Models
|
||||
|
||||
| Model | ID | Params | Size | Confidence | Speed | License |
|
||||
|-------|-----|--------|------|------------|-------|---------|
|
||||
| Grounding DINO | `grounding-dino` | 232M | 891MB | ✅ 0-1 score | ~340ms | Apache 2.0 |
|
||||
| PaliGemma 3B | `paligemma` | 2,923M | ~3GB | ❌ no score | ~80ms | Gemma license |
|
||||
|
||||
## Endpoints
|
||||
|
||||
### `GET /health`
|
||||
|
||||
System status and loaded models.
|
||||
|
||||
```bash
|
||||
curl localhost:5052/health
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"status": "ok",
|
||||
"models_loaded": ["grounding-dino"],
|
||||
"models_available": ["grounding-dino", "paligemma"],
|
||||
"device": "mps",
|
||||
"port": 5052
|
||||
}
|
||||
```
|
||||
|
||||
### `GET /models`
|
||||
|
||||
List available models with specs.
|
||||
|
||||
```bash
|
||||
curl localhost:5052/models
|
||||
```
|
||||
|
||||
### `POST /detect`
|
||||
|
||||
Detect objects in a single video frame.
|
||||
|
||||
```bash
|
||||
curl localhost:5052/detect \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"time":5461, "prompt":"gun", "model":"grounding-dino"}'
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
| Param | Type | Default | Description |
|
||||
|-------|------|---------|-------------|
|
||||
| `uuid` | string | `aeed71342a...` | Video file UUID |
|
||||
| `time` | float | `0` | Timestamp in seconds |
|
||||
| `prompt` | string | `"gun"` | Object to detect |
|
||||
| `model` | string | `"grounding-dino"` | Model: `grounding-dino`, `paligemma`, or `fusion` |
|
||||
| `threshold` | float | `0.1` | Minimum confidence (GDINO only) |
|
||||
| `weights` | object | — | Fusion weights, e.g. `{"grounding-dino":0.6,"paligemma":0.4}` |
|
||||
|
||||
**Fusion mode** runs both models and combines results with weighted scoring. Default weights: GDINO 0.6, PaliGemma 0.4.
|
||||
|
||||
```bash
|
||||
# Fusion: run both models, combine results
|
||||
curl localhost:5052/detect \
|
||||
-d '{"time":206, "prompt":"water gun", "model":"fusion"}'
|
||||
|
||||
# Custom fusion weights
|
||||
curl localhost:5052/detect \
|
||||
-d '{"time":206, "prompt":"gun", "model":"fusion",
|
||||
"weights":{"grounding-dino":0.5,"paligemma":0.5}}'
|
||||
```
|
||||
|
||||
**Response:**
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "grounding-dino",
|
||||
"detections": [
|
||||
{"bbox": [726.2, 567.4, 969.0, 694.6], "score": 0.476, "label": "gun"},
|
||||
{"bbox": [686.7, 567.0, 969.6, 918.3], "score": 0.262, "label": "gun"}
|
||||
],
|
||||
"time_ms": 345.2,
|
||||
"n_detections": 2,
|
||||
"shot_url": "/shots/aeed7134_5461s_gun_grounding-dino.jpg"
|
||||
}
|
||||
```
|
||||
|
||||
**Fusion response** also includes `per_model` (detections per model) and `fusion` (deduplicated combined list with `fused_score`).
|
||||
|
||||
### `POST /search`
|
||||
|
||||
Search across a time range.
|
||||
|
||||
```bash
|
||||
# Natural language query
|
||||
curl localhost:5052/search \
|
||||
-d '{"query":"find the gun", "range":"5400-5600", "interval":10}'
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
| Param | Type | Default | Description |
|
||||
|-------|------|---------|-------------|
|
||||
| `query` | string | `"find the gun"` | Natural language query (parsed to extract object) |
|
||||
| `target` | string | — | `file_uuid:chunk_id` or `file_uuid:trace_id` — resolves to time range |
|
||||
| `range` | string | `"0-6780"` | Manual time range |
|
||||
| `interval` | int | `30` | Scan interval in seconds |
|
||||
| `model` | string | `"grounding-dino"` | Detection model |
|
||||
| `threshold` | float | `0.15` | Minimum confidence |
|
||||
|
||||
**Target resolution:**
|
||||
|
||||
| Format | Example | Resolves to |
|
||||
|--------|---------|-------------|
|
||||
| `file_uuid:chunk_id` | `uuid:uuid_story_90` | Chunk's time range |
|
||||
| `file_uuid:trace_id` | `uuid:trace_5` | Trace's time range |
|
||||
| `file_uuid:chunk_index` | `uuid:500` | Chunk index 500's range |
|
||||
|
||||
```bash
|
||||
# Using target
|
||||
curl localhost:5052/search \
|
||||
-d '{"target":"aeed71342...:aeed71342..._story_90", "query":"gun"}'
|
||||
|
||||
# Using trace
|
||||
curl localhost:5052/search \
|
||||
-d '{"target":"aeed71342...:trace_5", "query":"person"}'
|
||||
```
|
||||
|
||||
### `POST /multimodal`
|
||||
|
||||
Multi-modal search across sentence chunks — combines ASR text match + visual confirmation.
|
||||
|
||||
```bash
|
||||
# Search for Jean-Louis: ASR match + GDINO child detection
|
||||
curl localhost:5052/multimodal \
|
||||
-d '{"keyword":"Jean-Louis", "prompt":"child"}'
|
||||
|
||||
# Search trace chunks visually (no ASR)
|
||||
curl localhost:5052/multimodal \
|
||||
-d '{"keyword":"", "prompt":"person", "chunk_type":"trace", "range":"3500-4000"}'
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
|
||||
| Param | Type | Default | Description |
|
||||
|-------|------|---------|-------------|
|
||||
| `keyword` | string | — | ASR keyword to search in sentence text |
|
||||
| `prompt` | string | same as keyword | Visual prompt for GDINO |
|
||||
| `chunk_type` | string | `"sentence"` | `sentence`, `trace`, `story`, `cut` |
|
||||
| `target` | string | — | Specific chunk target |
|
||||
| `range` | string | `"0-6780"` | Time range (for non-sentence chunks) |
|
||||
| `threshold` | float | `0.15` | Visual detection threshold |
|
||||
|
||||
### `GET /shots/<filename>`
|
||||
|
||||
Retrieve annotated detection images.
|
||||
|
||||
```bash
|
||||
curl -o result.jpg localhost:5052/shots/aeed7134_5461s_gun_grounding-dino.jpg
|
||||
```
|
||||
|
||||
## Object Detection Performance Summary
|
||||
|
||||
| Object type | Size in frame | GDINO | PaliGemma | Best prompt |
|
||||
|-------------|--------------|-------|-----------|-------------|
|
||||
| Gun (realistic) | 15-30% | ✅ 0.36-0.67 | ✅ | `pistol` / `handgun` |
|
||||
| Water gun (toy) | 15-31% | ❌ 0 | ✅ | `water gun` (PaliGemma) |
|
||||
| Child (Jean-Louis) | 30-60% | ⚠️ 0.3-0.9 | ❌ | `child` (high FP on adults) |
|
||||
| Stamp | <5% | ❌ FP | ❌ | — |
|
||||
| Passport | <10% | ❌ FP | ❌ | — |
|
||||
| Magnifying glass | <5% | ❌ FP | ❌ | — |
|
||||
| Cup / Bottle | 5-15% | ✅ 0.3-0.5 | — | `cup` / `bottle` |
|
||||
| Cell phone | 5-10% | ✅ 0.3-0.5 | — | `cell phone` |
|
||||
|
||||
## Resource Registration
|
||||
|
||||
On startup, the agent auto-registers as resources in `dev.resources`:
|
||||
|
||||
| Resource ID | Type | Status |
|
||||
|-------------|------|--------|
|
||||
| `eye-gdino` | `vision_model` | `online` |
|
||||
| `eye-paligemma` | `vision_model` | `online` |
|
||||
|
||||
Heartbeat updates every 60 seconds. Discover via:
|
||||
|
||||
```sql
|
||||
SELECT * FROM dev.resources WHERE resource_type = 'vision_model';
|
||||
```
|
||||
|
||||
## Files
|
||||
|
||||
| File | Description |
|
||||
|------|-------------|
|
||||
| `scripts/vision_agent.py` | Vision Agent server (port 5052) |
|
||||
| `output_dev/vision_shots/` | Annotated detection screenshots |
|
||||
| `docs/ZERO_SHOT_DETECTION_RESEARCH.md` | Full model research report |
|
||||
Reference in New Issue
Block a user