Files

Accusys 39ba5ddf76 feat: Phase 1 handover - schema migration, correction mechanism, API fixes

Schema changes: dev.chunks->dev.chunk, remove old_chunk_id/chunk_index
Correction: asr-1.json format, generate/apply scripts
API: 37/37 endpoints fixed and tested
Docs: HANDOVER_V2.0.md for M4

2026-05-11 07:03:22 +08:00

7.4 KiB

Raw Permalink Blame History

Zero-Shot Object Detection Model Research Report

Date: 2026-05-10 Goal: Evaluate models for detecting arbitrary objects in Charade (1963) System: M5 MacBook Pro (Apple Silicon MPS, 48GB)

Tested Models

Model	Params	Size	Resolution	Type	License
YOLOv8n fine-tune (gun)	3.2M	6MB	640px	Closed-set (4 classes)	AGPL-3.0
OWL-ViT base	109M	586MB	384px	Zero-shot	Apache 2.0
Grounding DINO Base	232M	891MB	384px	Zero-shot	Apache 2.0
Grounding DINO Large	232M	895MB	384px	Zero-shot	Apache 2.0
Florence-2 Base	231M	~3GB	384px	Zero-shot (generative)	MIT
Florence-2 Large	776M	~6GB	384px	Zero-shot (generative)	MIT
PaliGemma 3B mix-224	2,923M	~3GB	224px	Zero-shot (generative)	Gemma license
PaliGemma 3B mix-448	2,923M	~6GB	448px	Zero-shot (generative)	Gemma license

Detection Performance on Charade

Large Objects (gun)

Model	8 timepoints	Best confidence	Runtime
YOLOv8n fine-tune	❌ 0/5 (all FP)	0.45 (stamp→pistol)	0.03s
OWL-ViT	❌ 2/8	0.054	3.4s
Grounding DINO Base	✅ 8/8	0.499	0.33s
PaliGemma 3B mix-224	✅ 3/8 (gun), 3/8 overall	0.499	0.5-3s

Small Objects (stamp, passport, magnifying glass)

Model	Stamp	Passport	Magnifying glass
Grounding DINO Base	❌ FP (~0.3)	❌ FP (~0.4)	❌ FP (~0.3-0.5)
PaliGemma 3B mix-224	❌ no det	❌ no det	not tested
PaliGemma 3B mix-448	❌ (not tested)	❌ (not tested)	❌ (not tested)

All models fail on objects smaller than ~50px at native 1920x1080 resolution.

Other Objects

Object	YOLO COCO	Grounding DINO	Notes
knife	✅ 368 frames	✅ 84 hits	Small but detectable
cup	✅	✅ 13 hits	Moderate size
bottle	✅	✅ 12 hits	Moderate size
cell phone	✅	✅ 5 hits	Hand-held
book	✅	✅ 3 hits	Hand-held
car	✅	✅ 9 hits	Large object
tie	✅	✅ 139 hits	On-person (worn, not held)

Detailed Model Analysis

Grounding DINO Base (Recommended)

Scores: Detection confidence 0.1-0.5 (typical for zero-shot)

Timing per frame (MPS):

Component	Time	% of total
Processor (text+image)	17ms	5%
Model inference	310ms	93%
Post-processing	5ms	2%
Total	331ms	100%

Multi-prompt batching: 8 prompts in 335ms (42ms/prompt vs 309ms single)

Memory: ~1GB (MPS)

License: Apache 2.0 — fully commercial, no restrictions

Grounding DINO Large

Result: Identical weights to Base. The GitHub "7-dataset" checkpoint is the same 3-dataset version as HuggingFace. The actual 7-dataset version (56.7 AP) was never released.

Verdict: Do not use. Base is identical and simpler.

OWL-ViT

Result: Almost useless for this task. Max confidence 0.054. Detect only 2/8 timepoints.

Verdict: Do not use.

Florence-2

Issue: prepare_inputs_for_generation bug in current transformers version. Cannot run inference without patching model code.

Task format: Uses task tokens (<OD>) instead of arbitrary text prompts. Cannot do "detect gun" directly — uses generic object detection.

Verdict: Cannot use in current environment.

PaliGemma

Result: Works for gun detection (3/8) but misses small objects entirely.

Key limitation: No confidence score output (generative model). Either outputs bbox or nothing.

Issues:

224px variant: Too low resolution for small objects
448px variant: 6GB download, suspected better for detail but untested
Gemma license may restrict commercial use vs Apache 2.0

Verdict: Inferior to Grounding DINO for this use case.

YOLOv8n Fine-tune (Gun Detector)

Root cause: Training images are close-up gun photos; Charade has distant/partial guns. Distribution mismatch makes this model unusable.

Verdict: Requires completely new training dataset.

Root Cause Analysis: Small Object Failure

Grounding DINO's Resolution Limit

Grounding DINO processes images at 384×384px. At this resolution:

1920px frame → 384px input (5:1 reduction)
A 50×50px object → 10×10px at 384px → only ~1 patch token

For comparison:

Gun at 200×200px (close-up) → 40×40px → still detectable
Stamp at 30×30px → 6×6px → lost in downsampling
Passport at 80×120px → 16×24px → barely visible
Magnifying glass at 40×40px → 8×8px → lost

Potential Solutions

Solution	Pros	Cons	Feasibility
Crop + zoom on person region	Leverages existing YOLO person detections	Requires two-stage pipeline	✅ High
PaliGemma 448px	448px native (36% more detail)	6GB, requires download	⚠️ Medium
YOLO fine-tune on stamps	Fast inference (6MB)	Need 200+ training images	⚠️ Medium
Grounding DINO + tiling	Split image into tiles, run per tile	4-9x slower	⚠️ Medium
Florence-2 448px	Higher resolution	Bug in transformers	❌ Low

Hand-Held Object Detection Feasibility

Available Data Sources

Source	Type	Coverage	Usefulness
YOLO `pre_chunks`	Object detections	169,625 frames	✅ Every frame
Pose `pre_chunks`	Body keypoints (left_wrist, right_wrist)	4,269 frames	✅ Hand location
Grounding DINO	Zero-shot classification	On-demand	✅ Object ID
ASR dialogue	Text mentions	4,188 chunks	✅ "holding a gun"

Approach: YOLO + Pose + Grounding DINO

Frame
  → YOLO: Find person + objects
  → Pose: Find wrist keypoints
  → Check: Object bbox overlaps with hand region (wrist ±100px)
  → Grounding DINO: Verify object class

Known Limitations

Pose frame alignment: Pose data (4,269 frames) doesn't always overlap with YOLO data at the same frame
Object proximity ≠ holding: YOLO objects near hands may be background, not held
Small object blind spot: Stamps, magnifying glasses at hand positions are too small to detect

Recommendations

Priority	Action	Rationale
1	Use Grounding DINO Base (Apache 2.0)	Best zero-shot detector, proven on guns, clean license
2	Two-stage pipeline for small objects	YOLO person box → crop → upscale → Grounding DINO
3	Pose wrist alignment for hand-held confirmation	Reduce false positives by requiring hand proximity
4	Replace Grounding DINO "Large" ref with Base	Large is identical weights, no benefit

Appendix: License Summary

Model	License	Commercial Use	Requires
Grounding DINO	Apache 2.0	✅ Yes	NOTICE file
OWL-ViT	Apache 2.0	✅ Yes	NOTICE file
PaliGemma	Gemma license	⚠️ Needs review	Google ToS
Florence-2	MIT	✅ Yes	Copyright notice
YOLOv8	AGPL-3.0	⚠️ Needs license	Open source or paid

7.4 KiB Raw Permalink Blame History Unescape Escape