# Zero-Shot Object Detection Model Research Report

**Date:** 2026-05-10
**Goal:** Evaluate models for detecting arbitrary objects in Charade (1963)
**System:** M5 MacBook Pro (Apple Silicon MPS, 48GB)

---

## Tested Models

| Model | Params | Size | Resolution | Type | License |
|-------|--------|------|------------|------|---------|
| YOLOv8n fine-tune (gun) | 3.2M | 6MB | 640px | Closed-set (4 classes) | AGPL-3.0 |
| OWL-ViT base | 109M | 586MB | 384px | Zero-shot | Apache 2.0 |
| **Grounding DINO Base** | **232M** | **891MB** | **384px** | **Zero-shot** | **Apache 2.0** |
| Grounding DINO Large | 232M | 895MB | 384px | Zero-shot | Apache 2.0 |
| Florence-2 Base | 231M | ~3GB | 384px | Zero-shot (generative) | MIT |
| Florence-2 Large | 776M | ~6GB | 384px | Zero-shot (generative) | MIT |
| PaliGemma 3B mix-224 | 2,923M | ~3GB | 224px | Zero-shot (generative) | Gemma license |
| PaliGemma 3B mix-448 | 2,923M | ~6GB | 448px | Zero-shot (generative) | Gemma license |

## Detection Performance on Charade

### Large Objects (gun)

| Model | 8 timepoints | Best confidence | Runtime |
|-------|-------------|----------------|---------|
| YOLOv8n fine-tune | ❌ 0/5 (all FP) | 0.45 (stamp→pistol) | 0.03s |
| OWL-ViT | ❌ 2/8 | 0.054 | 3.4s |
| **Grounding DINO Base** | **✅ 8/8** | **0.499** | **0.33s** |
| PaliGemma 3B mix-224 | ✅ 3/8 (gun), 3/8 overall | 0.499 | 0.5-3s |

### Small Objects (stamp, passport, magnifying glass)

| Model | Stamp | Passport | Magnifying glass |
|-------|-------|----------|-----------------|
| Grounding DINO Base | ❌ FP (~0.3) | ❌ FP (~0.4) | ❌ FP (~0.3-0.5) |
| PaliGemma 3B mix-224 | ❌ no det | ❌ no det | not tested |
| PaliGemma 3B mix-448 | ❌ (not tested) | ❌ (not tested) | ❌ (not tested) |

**All models fail on objects smaller than ~50px at native 1920x1080 resolution.**

### Other Objects

| Object | YOLO COCO | Grounding DINO | Notes |
|--------|-----------|----------------|-------|
| knife | ✅ 368 frames | ✅ 84 hits | Small but detectable |
| cup | ✅ | ✅ 13 hits | Moderate size |
| bottle | ✅ | ✅ 12 hits | Moderate size |
| cell phone | ✅ | ✅ 5 hits | Hand-held |
| book | ✅ | ✅ 3 hits | Hand-held |
| car | ✅ | ✅ 9 hits | Large object |
| tie | ✅ | ✅ 139 hits | On-person (worn, not held) |

## Detailed Model Analysis

### Grounding DINO Base (Recommended)

**Scores:** Detection confidence 0.1-0.5 (typical for zero-shot)

**Timing per frame (MPS):**
| Component | Time | % of total |
|-----------|------|------------|
| Processor (text+image) | 17ms | 5% |
| Model inference | 310ms | 93% |
| Post-processing | 5ms | 2% |
| **Total** | **331ms** | **100%** |

**Multi-prompt batching:** 8 prompts in 335ms (42ms/prompt vs 309ms single)

**Memory:** ~1GB (MPS)

**License:** Apache 2.0 — fully commercial, no restrictions

### Grounding DINO Large

**Result:** Identical weights to Base. The GitHub "7-dataset" checkpoint is the same 3-dataset version as HuggingFace. The actual 7-dataset version (56.7 AP) was never released.

**Verdict: Do not use.** Base is identical and simpler.

### OWL-ViT

**Result:** Almost useless for this task. Max confidence 0.054. Detect only 2/8 timepoints.

**Verdict: Do not use.**

### Florence-2

**Issue:** `prepare_inputs_for_generation` bug in current transformers version. Cannot run inference without patching model code.

**Task format:** Uses task tokens (`<OD>`) instead of arbitrary text prompts. Cannot do "detect gun" directly — uses generic object detection.

**Verdict: Cannot use in current environment.**

### PaliGemma

**Result:** Works for gun detection (3/8) but misses small objects entirely.

**Key limitation:** No confidence score output (generative model). Either outputs bbox or nothing.

**Issues:**
- 224px variant: Too low resolution for small objects
- 448px variant: 6GB download, suspected better for detail but untested
- Gemma license may restrict commercial use vs Apache 2.0

**Verdict: Inferior to Grounding DINO for this use case.**

### YOLOv8n Fine-tune (Gun Detector)

| Dataset | 905 images (Roboflow CC BY 4.0) |
| Classes | grenade, knife, pistol, rifle |
| Validation mAP50 | 0.813 |
| Charade FP rate | **100%** (all false positives) |

**Root cause:** Training images are close-up gun photos; Charade has distant/partial guns. Distribution mismatch makes this model unusable.

**Verdict: Requires completely new training dataset.**

## Root Cause Analysis: Small Object Failure

### Grounding DINO's Resolution Limit

Grounding DINO processes images at **384×384px**. At this resolution:

```
1920px frame → 384px input (5:1 reduction)
A 50×50px object → 10×10px at 384px → only ~1 patch token
```

For comparison:
- **Gun** at 200×200px (close-up) → 40×40px → still detectable
- **Stamp** at 30×30px → 6×6px → lost in downsampling
- **Passport** at 80×120px → 16×24px → barely visible
- **Magnifying glass** at 40×40px → 8×8px → lost

### Potential Solutions

| Solution | Pros | Cons | Feasibility |
|----------|------|------|-------------|
| **Crop + zoom** on person region | Leverages existing YOLO person detections | Requires two-stage pipeline | ✅ High |
| **PaliGemma 448px** | 448px native (36% more detail) | 6GB, requires download | ⚠️ Medium |
| **YOLO fine-tune on stamps** | Fast inference (6MB) | Need 200+ training images | ⚠️ Medium |
| **Grounding DINO + tiling** | Split image into tiles, run per tile | 4-9x slower | ⚠️ Medium |
| **Florence-2 448px** | Higher resolution | Bug in transformers | ❌ Low |

## Hand-Held Object Detection Feasibility

### Available Data Sources

| Source | Type | Coverage | Usefulness |
|--------|------|----------|------------|
| YOLO `pre_chunks` | Object detections | 169,625 frames | ✅ Every frame |
| Pose `pre_chunks` | Body keypoints (left_wrist, right_wrist) | 4,269 frames | ✅ Hand location |
| Grounding DINO | Zero-shot classification | On-demand | ✅ Object ID |
| ASR dialogue | Text mentions | 4,188 chunks | ✅ "holding a gun" |

### Approach: YOLO + Pose + Grounding DINO

```
Frame
  → YOLO: Find person + objects
  → Pose: Find wrist keypoints
  → Check: Object bbox overlaps with hand region (wrist ±100px)
  → Grounding DINO: Verify object class
```

### Known Limitations

1. **Pose frame alignment:** Pose data (4,269 frames) doesn't always overlap with YOLO data at the same frame
2. **Object proximity ≠ holding:** YOLO objects near hands may be background, not held
3. **Small object blind spot:** Stamps, magnifying glasses at hand positions are too small to detect

## Recommendations

| Priority | Action | Rationale |
|----------|--------|-----------|
| 1 | Use Grounding DINO Base (Apache 2.0) | Best zero-shot detector, proven on guns, clean license |
| 2 | Two-stage pipeline for small objects | YOLO person box → crop → upscale → Grounding DINO |
| 3 | Pose wrist alignment for hand-held confirmation | Reduce false positives by requiring hand proximity |
| 4 | Replace Grounding DINO "Large" ref with Base | Large is identical weights, no benefit |

## Appendix: License Summary

| Model | License | Commercial Use | Requires |
|-------|---------|---------------|----------|
| Grounding DINO | **Apache 2.0** | ✅ Yes | NOTICE file |
| OWL-ViT | Apache 2.0 | ✅ Yes | NOTICE file |
| PaliGemma | Gemma license | ⚠️ Needs review | Google ToS |
| Florence-2 | MIT | ✅ Yes | Copyright notice |
| YOLOv8 | AGPL-3.0 | ⚠️ Needs license | Open source or paid |