Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index Correction: asr-1.json format, generate/apply scripts API: 37/37 endpoints fixed and tested Docs: HANDOVER_V2.0.md for M4
7.4 KiB
Zero-Shot Object Detection Model Research Report
Date: 2026-05-10 Goal: Evaluate models for detecting arbitrary objects in Charade (1963) System: M5 MacBook Pro (Apple Silicon MPS, 48GB)
Tested Models
| Model | Params | Size | Resolution | Type | License |
|---|---|---|---|---|---|
| YOLOv8n fine-tune (gun) | 3.2M | 6MB | 640px | Closed-set (4 classes) | AGPL-3.0 |
| OWL-ViT base | 109M | 586MB | 384px | Zero-shot | Apache 2.0 |
| Grounding DINO Base | 232M | 891MB | 384px | Zero-shot | Apache 2.0 |
| Grounding DINO Large | 232M | 895MB | 384px | Zero-shot | Apache 2.0 |
| Florence-2 Base | 231M | ~3GB | 384px | Zero-shot (generative) | MIT |
| Florence-2 Large | 776M | ~6GB | 384px | Zero-shot (generative) | MIT |
| PaliGemma 3B mix-224 | 2,923M | ~3GB | 224px | Zero-shot (generative) | Gemma license |
| PaliGemma 3B mix-448 | 2,923M | ~6GB | 448px | Zero-shot (generative) | Gemma license |
Detection Performance on Charade
Large Objects (gun)
| Model | 8 timepoints | Best confidence | Runtime |
|---|---|---|---|
| YOLOv8n fine-tune | ❌ 0/5 (all FP) | 0.45 (stamp→pistol) | 0.03s |
| OWL-ViT | ❌ 2/8 | 0.054 | 3.4s |
| Grounding DINO Base | ✅ 8/8 | 0.499 | 0.33s |
| PaliGemma 3B mix-224 | ✅ 3/8 (gun), 3/8 overall | 0.499 | 0.5-3s |
Small Objects (stamp, passport, magnifying glass)
| Model | Stamp | Passport | Magnifying glass |
|---|---|---|---|
| Grounding DINO Base | ❌ FP (~0.3) | ❌ FP (~0.4) | ❌ FP (~0.3-0.5) |
| PaliGemma 3B mix-224 | ❌ no det | ❌ no det | not tested |
| PaliGemma 3B mix-448 | ❌ (not tested) | ❌ (not tested) | ❌ (not tested) |
All models fail on objects smaller than ~50px at native 1920x1080 resolution.
Other Objects
| Object | YOLO COCO | Grounding DINO | Notes |
|---|---|---|---|
| knife | ✅ 368 frames | ✅ 84 hits | Small but detectable |
| cup | ✅ | ✅ 13 hits | Moderate size |
| bottle | ✅ | ✅ 12 hits | Moderate size |
| cell phone | ✅ | ✅ 5 hits | Hand-held |
| book | ✅ | ✅ 3 hits | Hand-held |
| car | ✅ | ✅ 9 hits | Large object |
| tie | ✅ | ✅ 139 hits | On-person (worn, not held) |
Detailed Model Analysis
Grounding DINO Base (Recommended)
Scores: Detection confidence 0.1-0.5 (typical for zero-shot)
Timing per frame (MPS):
| Component | Time | % of total |
|---|---|---|
| Processor (text+image) | 17ms | 5% |
| Model inference | 310ms | 93% |
| Post-processing | 5ms | 2% |
| Total | 331ms | 100% |
Multi-prompt batching: 8 prompts in 335ms (42ms/prompt vs 309ms single)
Memory: ~1GB (MPS)
License: Apache 2.0 — fully commercial, no restrictions
Grounding DINO Large
Result: Identical weights to Base. The GitHub "7-dataset" checkpoint is the same 3-dataset version as HuggingFace. The actual 7-dataset version (56.7 AP) was never released.
Verdict: Do not use. Base is identical and simpler.
OWL-ViT
Result: Almost useless for this task. Max confidence 0.054. Detect only 2/8 timepoints.
Verdict: Do not use.
Florence-2
Issue: prepare_inputs_for_generation bug in current transformers version. Cannot run inference without patching model code.
Task format: Uses task tokens (<OD>) instead of arbitrary text prompts. Cannot do "detect gun" directly — uses generic object detection.
Verdict: Cannot use in current environment.
PaliGemma
Result: Works for gun detection (3/8) but misses small objects entirely.
Key limitation: No confidence score output (generative model). Either outputs bbox or nothing.
Issues:
- 224px variant: Too low resolution for small objects
- 448px variant: 6GB download, suspected better for detail but untested
- Gemma license may restrict commercial use vs Apache 2.0
Verdict: Inferior to Grounding DINO for this use case.
YOLOv8n Fine-tune (Gun Detector)
| Dataset | 905 images (Roboflow CC BY 4.0) | | Classes | grenade, knife, pistol, rifle | | Validation mAP50 | 0.813 | | Charade FP rate | 100% (all false positives) |
Root cause: Training images are close-up gun photos; Charade has distant/partial guns. Distribution mismatch makes this model unusable.
Verdict: Requires completely new training dataset.
Root Cause Analysis: Small Object Failure
Grounding DINO's Resolution Limit
Grounding DINO processes images at 384×384px. At this resolution:
1920px frame → 384px input (5:1 reduction)
A 50×50px object → 10×10px at 384px → only ~1 patch token
For comparison:
- Gun at 200×200px (close-up) → 40×40px → still detectable
- Stamp at 30×30px → 6×6px → lost in downsampling
- Passport at 80×120px → 16×24px → barely visible
- Magnifying glass at 40×40px → 8×8px → lost
Potential Solutions
| Solution | Pros | Cons | Feasibility |
|---|---|---|---|
| Crop + zoom on person region | Leverages existing YOLO person detections | Requires two-stage pipeline | ✅ High |
| PaliGemma 448px | 448px native (36% more detail) | 6GB, requires download | ⚠️ Medium |
| YOLO fine-tune on stamps | Fast inference (6MB) | Need 200+ training images | ⚠️ Medium |
| Grounding DINO + tiling | Split image into tiles, run per tile | 4-9x slower | ⚠️ Medium |
| Florence-2 448px | Higher resolution | Bug in transformers | ❌ Low |
Hand-Held Object Detection Feasibility
Available Data Sources
| Source | Type | Coverage | Usefulness |
|---|---|---|---|
YOLO pre_chunks |
Object detections | 169,625 frames | ✅ Every frame |
Pose pre_chunks |
Body keypoints (left_wrist, right_wrist) | 4,269 frames | ✅ Hand location |
| Grounding DINO | Zero-shot classification | On-demand | ✅ Object ID |
| ASR dialogue | Text mentions | 4,188 chunks | ✅ "holding a gun" |
Approach: YOLO + Pose + Grounding DINO
Frame
→ YOLO: Find person + objects
→ Pose: Find wrist keypoints
→ Check: Object bbox overlaps with hand region (wrist ±100px)
→ Grounding DINO: Verify object class
Known Limitations
- Pose frame alignment: Pose data (4,269 frames) doesn't always overlap with YOLO data at the same frame
- Object proximity ≠ holding: YOLO objects near hands may be background, not held
- Small object blind spot: Stamps, magnifying glasses at hand positions are too small to detect
Recommendations
| Priority | Action | Rationale |
|---|---|---|
| 1 | Use Grounding DINO Base (Apache 2.0) | Best zero-shot detector, proven on guns, clean license |
| 2 | Two-stage pipeline for small objects | YOLO person box → crop → upscale → Grounding DINO |
| 3 | Pose wrist alignment for hand-held confirmation | Reduce false positives by requiring hand proximity |
| 4 | Replace Grounding DINO "Large" ref with Base | Large is identical weights, no benefit |
Appendix: License Summary
| Model | License | Commercial Use | Requires |
|---|---|---|---|
| Grounding DINO | Apache 2.0 | ✅ Yes | NOTICE file |
| OWL-ViT | Apache 2.0 | ✅ Yes | NOTICE file |
| PaliGemma | Gemma license | ⚠️ Needs review | Google ToS |
| Florence-2 | MIT | ✅ Yes | Copyright notice |
| YOLOv8 | AGPL-3.0 | ⚠️ Needs license | Open source or paid |