# Zero-Shot Object Detection Model Research Report **Date:** 2026-05-10 **Goal:** Evaluate models for detecting arbitrary objects in Charade (1963) **System:** M5 MacBook Pro (Apple Silicon MPS, 48GB) --- ## Tested Models | Model | Params | Size | Resolution | Type | License | |-------|--------|------|------------|------|---------| | YOLOv8n fine-tune (gun) | 3.2M | 6MB | 640px | Closed-set (4 classes) | AGPL-3.0 | | OWL-ViT base | 109M | 586MB | 384px | Zero-shot | Apache 2.0 | | **Grounding DINO Base** | **232M** | **891MB** | **384px** | **Zero-shot** | **Apache 2.0** | | Grounding DINO Large | 232M | 895MB | 384px | Zero-shot | Apache 2.0 | | Florence-2 Base | 231M | ~3GB | 384px | Zero-shot (generative) | MIT | | Florence-2 Large | 776M | ~6GB | 384px | Zero-shot (generative) | MIT | | PaliGemma 3B mix-224 | 2,923M | ~3GB | 224px | Zero-shot (generative) | Gemma license | | PaliGemma 3B mix-448 | 2,923M | ~6GB | 448px | Zero-shot (generative) | Gemma license | ## Detection Performance on Charade ### Large Objects (gun) | Model | 8 timepoints | Best confidence | Runtime | |-------|-------------|----------------|---------| | YOLOv8n fine-tune | ❌ 0/5 (all FP) | 0.45 (stamp→pistol) | 0.03s | | OWL-ViT | ❌ 2/8 | 0.054 | 3.4s | | **Grounding DINO Base** | **✅ 8/8** | **0.499** | **0.33s** | | PaliGemma 3B mix-224 | ✅ 3/8 (gun), 3/8 overall | 0.499 | 0.5-3s | ### Small Objects (stamp, passport, magnifying glass) | Model | Stamp | Passport | Magnifying glass | |-------|-------|----------|-----------------| | Grounding DINO Base | ❌ FP (~0.3) | ❌ FP (~0.4) | ❌ FP (~0.3-0.5) | | PaliGemma 3B mix-224 | ❌ no det | ❌ no det | not tested | | PaliGemma 3B mix-448 | ❌ (not tested) | ❌ (not tested) | ❌ (not tested) | **All models fail on objects smaller than ~50px at native 1920x1080 resolution.** ### Other Objects | Object | YOLO COCO | Grounding DINO | Notes | |--------|-----------|----------------|-------| | knife | ✅ 368 frames | ✅ 84 hits | Small but detectable | | cup | ✅ | ✅ 13 hits | Moderate size | | bottle | ✅ | ✅ 12 hits | Moderate size | | cell phone | ✅ | ✅ 5 hits | Hand-held | | book | ✅ | ✅ 3 hits | Hand-held | | car | ✅ | ✅ 9 hits | Large object | | tie | ✅ | ✅ 139 hits | On-person (worn, not held) | ## Detailed Model Analysis ### Grounding DINO Base (Recommended) **Scores:** Detection confidence 0.1-0.5 (typical for zero-shot) **Timing per frame (MPS):** | Component | Time | % of total | |-----------|------|------------| | Processor (text+image) | 17ms | 5% | | Model inference | 310ms | 93% | | Post-processing | 5ms | 2% | | **Total** | **331ms** | **100%** | **Multi-prompt batching:** 8 prompts in 335ms (42ms/prompt vs 309ms single) **Memory:** ~1GB (MPS) **License:** Apache 2.0 — fully commercial, no restrictions ### Grounding DINO Large **Result:** Identical weights to Base. The GitHub "7-dataset" checkpoint is the same 3-dataset version as HuggingFace. The actual 7-dataset version (56.7 AP) was never released. **Verdict: Do not use.** Base is identical and simpler. ### OWL-ViT **Result:** Almost useless for this task. Max confidence 0.054. Detect only 2/8 timepoints. **Verdict: Do not use.** ### Florence-2 **Issue:** `prepare_inputs_for_generation` bug in current transformers version. Cannot run inference without patching model code. **Task format:** Uses task tokens (``) instead of arbitrary text prompts. Cannot do "detect gun" directly — uses generic object detection. **Verdict: Cannot use in current environment.** ### PaliGemma **Result:** Works for gun detection (3/8) but misses small objects entirely. **Key limitation:** No confidence score output (generative model). Either outputs bbox or nothing. **Issues:** - 224px variant: Too low resolution for small objects - 448px variant: 6GB download, suspected better for detail but untested - Gemma license may restrict commercial use vs Apache 2.0 **Verdict: Inferior to Grounding DINO for this use case.** ### YOLOv8n Fine-tune (Gun Detector) | Dataset | 905 images (Roboflow CC BY 4.0) | | Classes | grenade, knife, pistol, rifle | | Validation mAP50 | 0.813 | | Charade FP rate | **100%** (all false positives) | **Root cause:** Training images are close-up gun photos; Charade has distant/partial guns. Distribution mismatch makes this model unusable. **Verdict: Requires completely new training dataset.** ## Root Cause Analysis: Small Object Failure ### Grounding DINO's Resolution Limit Grounding DINO processes images at **384×384px**. At this resolution: ``` 1920px frame → 384px input (5:1 reduction) A 50×50px object → 10×10px at 384px → only ~1 patch token ``` For comparison: - **Gun** at 200×200px (close-up) → 40×40px → still detectable - **Stamp** at 30×30px → 6×6px → lost in downsampling - **Passport** at 80×120px → 16×24px → barely visible - **Magnifying glass** at 40×40px → 8×8px → lost ### Potential Solutions | Solution | Pros | Cons | Feasibility | |----------|------|------|-------------| | **Crop + zoom** on person region | Leverages existing YOLO person detections | Requires two-stage pipeline | ✅ High | | **PaliGemma 448px** | 448px native (36% more detail) | 6GB, requires download | ⚠️ Medium | | **YOLO fine-tune on stamps** | Fast inference (6MB) | Need 200+ training images | ⚠️ Medium | | **Grounding DINO + tiling** | Split image into tiles, run per tile | 4-9x slower | ⚠️ Medium | | **Florence-2 448px** | Higher resolution | Bug in transformers | ❌ Low | ## Hand-Held Object Detection Feasibility ### Available Data Sources | Source | Type | Coverage | Usefulness | |--------|------|----------|------------| | YOLO `pre_chunks` | Object detections | 169,625 frames | ✅ Every frame | | Pose `pre_chunks` | Body keypoints (left_wrist, right_wrist) | 4,269 frames | ✅ Hand location | | Grounding DINO | Zero-shot classification | On-demand | ✅ Object ID | | ASR dialogue | Text mentions | 4,188 chunks | ✅ "holding a gun" | ### Approach: YOLO + Pose + Grounding DINO ``` Frame → YOLO: Find person + objects → Pose: Find wrist keypoints → Check: Object bbox overlaps with hand region (wrist ±100px) → Grounding DINO: Verify object class ``` ### Known Limitations 1. **Pose frame alignment:** Pose data (4,269 frames) doesn't always overlap with YOLO data at the same frame 2. **Object proximity ≠ holding:** YOLO objects near hands may be background, not held 3. **Small object blind spot:** Stamps, magnifying glasses at hand positions are too small to detect ## Recommendations | Priority | Action | Rationale | |----------|--------|-----------| | 1 | Use Grounding DINO Base (Apache 2.0) | Best zero-shot detector, proven on guns, clean license | | 2 | Two-stage pipeline for small objects | YOLO person box → crop → upscale → Grounding DINO | | 3 | Pose wrist alignment for hand-held confirmation | Reduce false positives by requiring hand proximity | | 4 | Replace Grounding DINO "Large" ref with Base | Large is identical weights, no benefit | ## Appendix: License Summary | Model | License | Commercial Use | Requires | |-------|---------|---------------|----------| | Grounding DINO | **Apache 2.0** | ✅ Yes | NOTICE file | | OWL-ViT | Apache 2.0 | ✅ Yes | NOTICE file | | PaliGemma | Gemma license | ⚠️ Needs review | Google ToS | | Florence-2 | MIT | ✅ Yes | Copyright notice | | YOLOv8 | AGPL-3.0 | ⚠️ Needs license | Open source or paid |