Files
momentry_core/docs/ZERO_SHOT_DETECTION_RESEARCH.md
Accusys 39ba5ddf76 feat: Phase 1 handover - schema migration, correction mechanism, API fixes
Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index
Correction: asr-1.json format, generate/apply scripts
API: 37/37 endpoints fixed and tested
Docs: HANDOVER_V2.0.md for M4
2026-05-11 07:03:22 +08:00

7.4 KiB
Raw Permalink Blame History

Zero-Shot Object Detection Model Research Report

Date: 2026-05-10 Goal: Evaluate models for detecting arbitrary objects in Charade (1963) System: M5 MacBook Pro (Apple Silicon MPS, 48GB)


Tested Models

Model Params Size Resolution Type License
YOLOv8n fine-tune (gun) 3.2M 6MB 640px Closed-set (4 classes) AGPL-3.0
OWL-ViT base 109M 586MB 384px Zero-shot Apache 2.0
Grounding DINO Base 232M 891MB 384px Zero-shot Apache 2.0
Grounding DINO Large 232M 895MB 384px Zero-shot Apache 2.0
Florence-2 Base 231M ~3GB 384px Zero-shot (generative) MIT
Florence-2 Large 776M ~6GB 384px Zero-shot (generative) MIT
PaliGemma 3B mix-224 2,923M ~3GB 224px Zero-shot (generative) Gemma license
PaliGemma 3B mix-448 2,923M ~6GB 448px Zero-shot (generative) Gemma license

Detection Performance on Charade

Large Objects (gun)

Model 8 timepoints Best confidence Runtime
YOLOv8n fine-tune 0/5 (all FP) 0.45 (stamp→pistol) 0.03s
OWL-ViT 2/8 0.054 3.4s
Grounding DINO Base 8/8 0.499 0.33s
PaliGemma 3B mix-224 3/8 (gun), 3/8 overall 0.499 0.5-3s

Small Objects (stamp, passport, magnifying glass)

Model Stamp Passport Magnifying glass
Grounding DINO Base FP (~0.3) FP (~0.4) FP (~0.3-0.5)
PaliGemma 3B mix-224 no det no det not tested
PaliGemma 3B mix-448 (not tested) (not tested) (not tested)

All models fail on objects smaller than ~50px at native 1920x1080 resolution.

Other Objects

Object YOLO COCO Grounding DINO Notes
knife 368 frames 84 hits Small but detectable
cup 13 hits Moderate size
bottle 12 hits Moderate size
cell phone 5 hits Hand-held
book 3 hits Hand-held
car 9 hits Large object
tie 139 hits On-person (worn, not held)

Detailed Model Analysis

Scores: Detection confidence 0.1-0.5 (typical for zero-shot)

Timing per frame (MPS):

Component Time % of total
Processor (text+image) 17ms 5%
Model inference 310ms 93%
Post-processing 5ms 2%
Total 331ms 100%

Multi-prompt batching: 8 prompts in 335ms (42ms/prompt vs 309ms single)

Memory: ~1GB (MPS)

License: Apache 2.0 — fully commercial, no restrictions

Grounding DINO Large

Result: Identical weights to Base. The GitHub "7-dataset" checkpoint is the same 3-dataset version as HuggingFace. The actual 7-dataset version (56.7 AP) was never released.

Verdict: Do not use. Base is identical and simpler.

OWL-ViT

Result: Almost useless for this task. Max confidence 0.054. Detect only 2/8 timepoints.

Verdict: Do not use.

Florence-2

Issue: prepare_inputs_for_generation bug in current transformers version. Cannot run inference without patching model code.

Task format: Uses task tokens (<OD>) instead of arbitrary text prompts. Cannot do "detect gun" directly — uses generic object detection.

Verdict: Cannot use in current environment.

PaliGemma

Result: Works for gun detection (3/8) but misses small objects entirely.

Key limitation: No confidence score output (generative model). Either outputs bbox or nothing.

Issues:

  • 224px variant: Too low resolution for small objects
  • 448px variant: 6GB download, suspected better for detail but untested
  • Gemma license may restrict commercial use vs Apache 2.0

Verdict: Inferior to Grounding DINO for this use case.

YOLOv8n Fine-tune (Gun Detector)

| Dataset | 905 images (Roboflow CC BY 4.0) | | Classes | grenade, knife, pistol, rifle | | Validation mAP50 | 0.813 | | Charade FP rate | 100% (all false positives) |

Root cause: Training images are close-up gun photos; Charade has distant/partial guns. Distribution mismatch makes this model unusable.

Verdict: Requires completely new training dataset.

Root Cause Analysis: Small Object Failure

Grounding DINO's Resolution Limit

Grounding DINO processes images at 384×384px. At this resolution:

1920px frame → 384px input (5:1 reduction)
A 50×50px object → 10×10px at 384px → only ~1 patch token

For comparison:

  • Gun at 200×200px (close-up) → 40×40px → still detectable
  • Stamp at 30×30px → 6×6px → lost in downsampling
  • Passport at 80×120px → 16×24px → barely visible
  • Magnifying glass at 40×40px → 8×8px → lost

Potential Solutions

Solution Pros Cons Feasibility
Crop + zoom on person region Leverages existing YOLO person detections Requires two-stage pipeline High
PaliGemma 448px 448px native (36% more detail) 6GB, requires download ⚠️ Medium
YOLO fine-tune on stamps Fast inference (6MB) Need 200+ training images ⚠️ Medium
Grounding DINO + tiling Split image into tiles, run per tile 4-9x slower ⚠️ Medium
Florence-2 448px Higher resolution Bug in transformers Low

Hand-Held Object Detection Feasibility

Available Data Sources

Source Type Coverage Usefulness
YOLO pre_chunks Object detections 169,625 frames Every frame
Pose pre_chunks Body keypoints (left_wrist, right_wrist) 4,269 frames Hand location
Grounding DINO Zero-shot classification On-demand Object ID
ASR dialogue Text mentions 4,188 chunks "holding a gun"

Approach: YOLO + Pose + Grounding DINO

Frame
  → YOLO: Find person + objects
  → Pose: Find wrist keypoints
  → Check: Object bbox overlaps with hand region (wrist ±100px)
  → Grounding DINO: Verify object class

Known Limitations

  1. Pose frame alignment: Pose data (4,269 frames) doesn't always overlap with YOLO data at the same frame
  2. Object proximity ≠ holding: YOLO objects near hands may be background, not held
  3. Small object blind spot: Stamps, magnifying glasses at hand positions are too small to detect

Recommendations

Priority Action Rationale
1 Use Grounding DINO Base (Apache 2.0) Best zero-shot detector, proven on guns, clean license
2 Two-stage pipeline for small objects YOLO person box → crop → upscale → Grounding DINO
3 Pose wrist alignment for hand-held confirmation Reduce false positives by requiring hand proximity
4 Replace Grounding DINO "Large" ref with Base Large is identical weights, no benefit

Appendix: License Summary

Model License Commercial Use Requires
Grounding DINO Apache 2.0 Yes NOTICE file
OWL-ViT Apache 2.0 Yes NOTICE file
PaliGemma Gemma license ⚠️ Needs review Google ToS
Florence-2 MIT Yes Copyright notice
YOLOv8 AGPL-3.0 ⚠️ Needs license Open source or paid