Files

Accusys ffc30d7377 M4 handover: coordinate fixes, detector registry, deploy v2, YOLOv8s, identity lifecycle

- Fix swift_pose/swift_ocr Y-flip bugs (BUG-003~006)
- Add heuristic_scene module + post-processing trigger (replaces Places365)
- YOLOv5nu → YOLOv8s CoreML (+33% detections, +390% scene indicators)
- Per-table SQL export (split 4.7GB single file → 478MB max per table)
- Version/build check in deploy.sh (compare /health vs file_info.json)
- Add file_uuid column to identities table + backfill
- Identity pre-clean step in deploy (avoids UNIQUE conflicts on re-deploy)
- Stranger_xxx naming fix with UUID context
- Add DETECTOR_REGISTRY.md (25 detectors), DETECTOR_SELECTION_SOP.md
- Update SPATIAL_COORDINATE_REGISTRY.md (P layer, 6-layer architecture)
- New IDENTITY_LIFECYCLE.md
- M4 response docs for deploy_script_fix and 111614 test report

2026-05-13 20:00:47 +08:00

24 KiB

Raw Blame History

Momentry Core — Detector Registry

Date: 2026-05-13 Version: 1.0 Purpose: 所有模型/演算法檢測器的座標約定、轉換鏈、驗證狀態統整

原則

每 detector 一條：獨立記錄輸入/輸出格式、座標原點、單位、轉換公式。
原始座標系標註：不隱藏轉換，任何異於 Top-Left pixel 的輸出必須明列。
轉換鏈可追溯：從 detector 原始輸出到入庫欄位，每一步轉換都記錄。
驗證狀態三級：verified（已測試） / assumed（文檔推斷，未實測） / buggy（已知有誤）。

分類總覽

Category	數量	Active	Experimental	Deprecated
face	8	2	4	2
body	3	1	2	0
object	4	1	3	0
text	3	1	2	0
speech	3	2	1	0
scene	2	1	0	1
stamps	2	0	2	0
Total	25	8	14	3

Status	定義
Active	生產 pipeline 中執行，`ProcessorType` 有註冊，產出被消費
Experimental	獨立腳本或 CLI，不連 pipeline；評估中或備用
Deprecated	評估後棄用；或已被新版取代但未從 codebase 移除

Pipeline Status Quick-Reference

#	Detector ID	Short Name	Pipeline Status	Reason
1	DET-CUT-001	PySceneDetect	active	CUT processor
2	DET-SCN-001	Places365	active but rejected ⚠️	M5 eval rejected; never removed from ProcessorType
3	DET-ASR-001	faster-whisper	active	ASR processor
4	DET-SPCH-003	ECAPA-TDNN	active	ASRX speaker embedding
5	DET-OBJ-001	YOLOv8s	active	YOLO processor (v5nu→v8s, 2026-05-13)
6	DET-TEXT-001	swift_ocr	active	OCR processor (primary)
7	DET-FACE-001/002/003	swift_face + FaceNet	active	Face processor
8	DET-BODY-001/002	swift_pose + YOLOv8-pose	active	Pose processor (primary + fallback)
9	DET-FACE-006	AgglomerativeClustering	active	Identity Agent (post-processing)
10	DET-TEXT-005	llama.cpp embed	active	Text embedding (chunk vectors)
11	DET-FACE-005	InsightFace	experimental	Not in production ProcessorType
12	DET-FACE-007	MediaPipe BlazeFace	experimental	MPS fallback, tested but not primary
13	DET-FACE-008	MediaPipe Face Mesh	experimental	Lip processor, not in main pipeline
14	DET-BODY-003	MediaPipe Holistic	experimental	Tested, not in production
15	DET-OBJ-003	OWL-ViT	experimental	Tested for stamps, not in pipeline
16	DET-OBJ-004	Grounding DINO	experimental	Tested for stamps/objects
17	DET-TEXT-002	Florence-2	experimental	Tested for stamps
18	DET-OBJ-002	Gun Detector	experimental	Evaluated, all FP, rejected for pipeline
19	DET-STP-001	OpenCV Stamp	experimental	Used in scan scripts only
20	DET-STP-002	Pose Action Decoder	experimental	Derived from pose, standalone
21	DET-FACE-004	DeepFace ArcFace	deprecated	Replaced by CoreML FaceNet
22	DET-SPCH-002	Apple Speech ASR	deprecated	Replaced by faster-whisper
23	DET-SCN-001	Places365 (scene)	⚠️ deprecated per eval	Still in ProcessorType, needs removal
24	DET-TEXT-003	EmbeddingGemma	experimental	Text embed endpoint, not primary
25	DET-TEXT-004	mxbai CoreML	experimental	Text embed endpoint, not primary

Known Misjudgments in Existing Evaluations

#	Evaluation	Issue	Impact	Action
M1	Scene Classification (2026-05-07)	M5 evaluated and REJECTED Places365. But it was never removed from `ProcessorType::all()`. Still runs on every file.	Wastes ~2min per registration. Produces meaningless scene.json.	Remove from pipeline or re-evaluate
M2	Face Processor benchmark (2026-04-28)	Compared InsightFace vs MediaPipe vs OpenCV vs Contract v1. But the final pipeline uses swift_face + FaceNet, a completely different solution not in the benchmark.	Selection criteria from benchmark don't apply to actual pipeline detector.	Document the actual selection decision for swift_face
M3	Gun Detector (2026-05-07)	Properly rejected: 7/7 FP. Correct decision. Model files still in repo.	No impact (correctly excluded). Clean up model files.	Archive or remove `models/gun/`
M4	OCR processor	No selection document exists. swift_ocr chosen without comparison against EasyOCR/PaddleOCR.	Unknown if optimal. PaddleOCR fallback may never trigger.	Document selection decision

技術分類（有空間座標 vs 無）

Category	數量	有空間座標	僅 Embedding	純時間/文字
face	8	5	3	—
body	3	3	—	—
object	4	4	—	—
text	3	1	2	—
speech	3	—	2	1
scene	2	—	1	1
stamps	2	2	—	—
Total	25	15	8	2

Face Detectors

DET-FACE-001 — Face Bbox (Apple Vision)

Field	Value
Framework	Apple Vision
Model	`VNDetectFaceRectanglesRequest`
Input	`CVPixelBuffer` (BGRA, via CGImage)
Output	bbox: `x, y, width, height`
Coordinate	Input: normalized [0-1], origin bottom-left
Transform	`x = bb.origin.x * imgW`
	`y = (1.0 - bb.origin.y - bb.size.height) * imgH`
Image size	`cgImage.width / cgImage.height`
Target	Top-Left pixel integer
File	`scripts/swift_processors/swift_face.swift:134-136`
Status	✅ verified (2026-05-13, landmark QC + visual check)

DET-FACE-002 — Face Landmarks (Apple Vision)

Field	Value
Framework	Apple Vision
Model	`VNDetectFaceLandmarksRequest`
Input	`CVPixelBuffer` (BGRA, via CGImage)
Output	landmarks: `left_eye (6pt)`, `right_eye (6pt)`, `nose (8pt)`, `outer_lips`, `inner_lips`
Coordinate	Input: `VNFaceLandmarks2D.pointsInImage(imageSize:)`
	Returned: macOS AppKit convention → bottom-left origin ⚠️
Transform	`y_top_left = imgH - $0.y` (Y-flip)
Image size	`cgImage.width / cgImage.height`
Target	Top-Left pixel float → JSON
Pairing	Not by array index. Landmark observations used as primary source (self-consistent bbox + landmarks). Face rect observations deduplicated via IoU > 0.3.
File	`scripts/swift_processors/swift_face.swift:155-184`
Status	✅ verified (2026-05-13, Y-flip fix, 100% landmark-in-bbox)
Bugs fixed	BUG-001: index-based pairing (landmarkObs[idx] ≠ faceObs[idx])
	BUG-002: macOS bottom-left Y axis (missing Y-flip)

DET-FACE-003 — Face Embedding (CoreML FaceNet)

Field	Value
Framework	CoreML (ANE-accelerated)
Model	`models/facenet512.mlpackage`
Input	Face crop 160×160, RGB, normalized `[-1, 1]`
Output	512-dim float embedding
Coordinate	N/A (no spatial output). Bbox from DET-FACE-001 used for crop.
File	`scripts/face_processor.py`, `scripts/embed_faces.py`, `scripts/tmdb_embed_extractor.py`
Embedding space	[-1, 1] per dimension, cosine similarity for matching
Status	✅ verified (routinely used for identity matching)

DET-FACE-004 — Face Embedding (DeepFace ArcFace)

Field	Value
Framework	DeepFace / TensorFlow
Model	`ArcFace` (512-dim)
Input	Face crop (from bbox), BGR, no explicit normalization
Output	512-dim float embedding
Coordinate	N/A
File	`scripts/face_embedding_extractor.py`
Status	🟡 assumed (legacy fallback, not primary pipeline)

DET-FACE-005 — Face Recognition (InsightFace)

Field	Value
Framework	InsightFace / ONNX Runtime
Model	`buffalo_l` (detection + recognition + 5-point landmarks)
Input	Video frame (BGR, numpy array)
Output	`bbox: [x1, y1, x2, y2]` pixel int
	`landmarks: 5-point` (left_eye, right_eye, nose, mouth_left, mouth_right)
	`embedding: 512-dim float`
Coordinate	Bbox: Top-Left pixel (InsightFace native)
	Landmarks: normalized [0-1] to image size
Transform	Bbox: `face.bbox.astype(int)` — direct
	Landmarks: `kps * imgW, kps * imgH` — needs manual conversion ⚠️
File	`scripts/face_recognition_processor.py:123-153`
Status	🟡 assumed (landmark pixel conversion chain not independently verified)

DET-FACE-006 — Face Clustering (sklearn)

Field	Value
Framework	sklearn
Model	`AgglomerativeClustering`
Input	512-dim face embeddings from DET-FACE-003 or DET-FACE-004
Output	cluster labels, centroids (512-dim float)
Coordinate	N/A (no spatial output)
File	`scripts/face_clustering_processor.py`, `scripts/identity_bind.py`
Status	✅ verified (428 clusters for Charade, identity_bindings created)

DET-FACE-007 — Face Detection (MediaPipe BlazeFace)

Field	Value
Framework	MediaPipe / MPS
Model	`blaze_face_short_range.tflite`
Input	Frame (numpy array / MPS image)
Output	`bbox: [x, y, width, height]` pixel
	`6 keypoints`: eyes, nose tip, mouth center, ear tragions — pixel
Coordinate	Top-Left pixel (MediaPipe native)
Transform	Direct, no conversion needed
File	`scripts/face_processor_mps.py`
Status	🟡 assumed (MPS fallback, rarely used in pipeline)

DET-FACE-008 — Lip Detection (MediaPipe Face Mesh)

Field	Value
Framework	MediaPipe
Model	`Face Mesh` (468 landmarks)
Input	Face crop or full frame
Output	`lip_openness: [0-1]` (vertical/mouth_width)
	`mouth keypoints`: indices 13, 14, 61, 291 from 468 mesh
Coordinate	Landmarks: normalized [0-1], Top-Left origin
Transform	Normalized → pixel: `x * imgW, y * imgH`
	Lip openness: derived ratio, unitless
File	`scripts/lip_processor.py`
Status	🟡 assumed

Body Pose Detectors

DET-BODY-001 — Body Pose (Apple Vision)

Field	Value
Framework	Apple Vision
Model	`VNDetectHumanBodyPoseRequest`
Input	`CGImage` (from frame export or NSImage)
Output	`19 keypoints`: nose, eyes, ears, neck, root, shoulders, elbows, wrists, hips, knees, ankles
	`bbox: [x, y, width, height]` derived from keypoint min/max
Coordinate	Input: normalized [0-1], origin bottom-left
Transform (current)	✅ `y = h - location.y * h` — Y-flip applied
Transform (correct)	`y = h - location.y * h`
Image size	`cgImage.width / cgImage.height`
Target	Top-Left pixel float
File	`scripts/swift_processors/swift_pose.swift:154-159`
Status	✅ verified (2026-05-13, Y-flip fix applied)

DET-BODY-002 — Body Pose (YOLOv8 Pose fallback)

Field	Value
Framework	ultralytics / PyTorch
Model	`yolov8n-pose.pt`
Input	Frame (PIL or numpy)
Output	`17 COCO keypoints`: nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles
	`bbox: [x, y, width, height]` derived from keypoints (conf > 0.1)
Coordinate	Top-Left pixel (YOLO native, `.xy[0]` → numpy float)
Transform	Direct: `x, y = float(kps[j][0]), float(kps[j][1])`
	Bbox: `min(xs), min(ys), max(xs)-min(xs), max(ys)-min(ys)`
File	`scripts/pose_processor.py:78-97`
Status	✅ top-left native

DET-BODY-003 — Full Body (MediaPipe Holistic)

Field	Value
Framework	MediaPipe
Model	`Holistic` (pose + face mesh + hands)
Input	Frame (BGR numpy)
Output	`468 face mesh`: `[[x, y, z], ...]` normalized [0-1]
	`33 body pose`: `[[x, y, z, visibility], ...]` normalized [0-1]
	`21 hand × 2`: `[[x, y, z], ...]` normalized [0-1]
Coordinate	normalized [0-1], Top-Left origin
Transform	`x * imgW, y * imgH` → pixel (if needed)
	Z: depth relative, not metric
File	`scripts/mediapipe_holistic_processor.py`
Status	✅ top-left native, normalized→pixel straightforward

Object Detectors

DET-OBJ-001 — Object Detection (YOLOv8s)

Field	Value
Framework	ultralytics / CoreML + PyTorch fallback
Model	`yolov8s.mlpackage` (primary, CoreML ANE), `yolov8s.pt` (fallback)
mAP (COCO)	44.9 (was 34.3 with YOLOv5nu, +31%)
Input	Frame (PIL or numpy)
Output	`bbox: [x1, y1, x2, y2]` — float pixel
	`class_name, class_id` (80 COCO classes)
	`confidence: [0-1]`
Coordinate	Top-Left pixel (YOLO `.xyxy[0]` → float)
Transform	Rust: `x = detection.x1 as i32, y = detection.y1 as i32` — int truncation
	`width = x2 - x1, height = y2 - y1`
Image size	YOLO auto-handles via ultralytics inference
File	`scripts/yolo_processor.py:272-285`, `src/core/processor/yolo.rs:83-117`
Status	✅ verified (2026-05-13, replaced YOLOv5nu, +19% detections, scene indicators +162~+473%)
Replaced	YOLOv5nu (mAP 34.3, removed 2026-05-13)

DET-OBJ-002 — Weapon Detection (YOLOv8n Fine-tuned)

Field	Value
Framework	ultralytics / PyTorch
Model	`models/gun/gun_detector/weights/best.pt`
Input	Frame (numpy array)
Output	`bbox: [x1, y1, x2, y2]` pixel
	`class: {0: grenade, 1: knife, 2: pistol, 3: rifle}`
Coordinate	Top-Left pixel (YOLO native)
File	`scripts/gun_detector_scan.py`
Status	✅ top-left native

DET-OBJ-003 — Open-Vocabulary Detection (OWL-ViT)

Field	Value
Framework	HuggingFace Transformers
Model	`google/owlvit-base-patch32`
Input	PIL Image + text queries
Output	`bbox, scores, labels`
Coordinate	post_process_object_detection returns boxes in `[x1, y1, x2, y2]` format
	scaled to `target_sizes` parameter
Transform	`target_sizes = torch.Tensor([image_pil.size[::-1]])` — PIL (w,h) → (h,w)
	`box.int().tolist()` or `box.tolist()` → Python list
Format risk	HuggingFace processor version may return `[cx, cy, w, h]` not `[x1,y1,x2,y2]`
File	`scripts/test_owl_vit_stamps.py:69-80`, `scripts/magnifying_glass_owl.py:65-77`
Status	🟡 assumed (bbox format not independently verified with visual check)
Verify	Render bbox overlay on a known target image, confirm x1 < x2, y1 < y2

DET-OBJ-004 — Open-Vocabulary Detection (Grounding DINO)

Field	Value
Framework	HuggingFace Transformers
Model	`IDEA-Research/grounding-dino-base`
Input	PIL Image + text prompts
Output	`boxes, labels, scores`
Coordinate	processor rescales to `target_sizes`, returns pixel boxes
Transform	`target_sizes=[img.size[::-1]]` — PIL (w,h) → (h,w)
	`[round(v, 1) for v in dets["boxes"][i].tolist()]`
Format risk	`[::-1]` order depends on processor expectations. If processor expects (w,h), axes swapped.
File	`scripts/gdino_frame_api.py:176-180`
Status	🟡 assumed (rescale direction not independently verified)
Verify	Single-frame output: check bbox x range ≤ imgW, y range ≤ imgH

Text / OCR Detectors

DET-TEXT-001 — OCR (Apple Vision)

Field	Value
Framework	Apple Vision
Model	`VNRecognizeTextRequest` (accurate/fast)
Input	`CVPixelBuffer` (via CGImage)
Output	`text: string`, `bbox: [x, y, w, h]`, `confidence: [0-1]`
Coordinate	Input: `VNRecognizedTextObservation.boundingBox` — normalized [0-1], origin bottom-left
Transform	✅ `y = (1.0 - bb.origin.y - bb.size.height) * cgH` — Y-flip applied
Image size	Main loop: `cgImage.width / cgImage.height` ✅
	`recognizeText()` helper: `CVPixelBufferGetWidth/Height` ✅
File	`scripts/swift_processors/swift_ocr.swift:125-133`, `:181-182`
Status	✅ verified (2026-05-13, Y-flip + image size fix applied)

DET-TEXT-002 — Open-Vocabulary (Florence-2)

Field	Value
Framework	HuggingFace Transformers
Model	`microsoft/Florence-2-base`
Input	PIL Image + task prompt
Output	`bbox: [x1, y1, x2, y2]` pixel
	`label, text` (depending on task)
Coordinate	processor `post_process_generation` rescales to `image_size`, returns pixel
Transform	`x1, y1, x2, y2 = map(int, bbox)` — direct
	`image_size=(image_pil.width, image_pil.height)` — (w, h) order ✅
File	`scripts/florence2_scan_stamps.py:67-79`, `scripts/test_florence2_direct.py`
Status	✅ top-left native (HuggingFace post_process output)

DET-TEXT-003 — Text Embedding (EmbeddingGemma)

Field	Value
Framework	HuggingFace / PyTorch MPS
Model	`google/embeddinggemma-300m`
Input	Text string
Output	Embedding vector (L2 normalized, dimension model-dependent)
Coordinate	N/A
File	`scripts/embeddinggemma_server.py`
Status	✅ verified (embedding API server)

Text Embedding (Non-Detector)

DET-TEXT-004 — Text Embedding (mxbai CoreML)

Field	Value
Framework	CoreML (ANE-accelerated)
Model	`mxbai-embed-large-v1.mlpackage`
Input	Text tokenized
Output	Embedding vector
Coordinate	N/A
File	`scripts/coreml_embed_server.py`
Status	🟡 assumed

DET-TEXT-005 — Text Embedding (Ollama / llama.cpp)

Field	Value
Framework	llama.cpp / Ollama API
Model	llama.cpp embedding endpoint (port 11436)
Input	Text (optionally prefixed `search_document:`)
Output	768-dim float embedding
Coordinate	N/A
File	`src/core/embedding/comic_embed.rs`
Status	✅ verified (embedding pipeline)

Speech / Audio Detectors

DET-SPCH-001 — ASR (faster-whisper)

Field	Value
Framework	faster-whisper / CTranslate2
Model	`faster-whisper/small` (int8 CPU)
Input	Audio extracted from video
Output	`[{start, end, text}, ...]` — temporal segments (seconds)
Coordinate	Temporal only (seconds), no spatial
File	`scripts/asr_processor.py`
Status	✅ verified (ASR pipeline)

DET-SPCH-002 — ASR (Apple Speech)

Field	Value
Framework	Apple Speech (ANE)
Model	`SFSpeechRecognizer`
Input	Audio file
Output	`[{start, end, text, confidence}, ...]` — temporal segments
Coordinate	Temporal only (seconds), no spatial
File	`scripts/swift_processors/asr_swift.swift`
Status	🟡 assumed (Apple Speech quality lower than faster-whisper)

DET-SPCH-003 — Speaker Embedding (ECAPA-TDNN)

Field	Value
Framework	SpeechBrain / PyTorch
Model	`speechbrain/spkrec-ecapa-voxceleb`
Input	Audio segments per speaker
Output	`192-dim float embedding`
Coordinate	N/A (vector space, cosine similarity)
File	`scripts/asrx_processor_custom.py`, `scripts/voice_embedding_extractor.py`
Status	✅ verified (voice embeddings exported to SQLite + Qdrant)

Scene Detectors

DET-SCN-001 — Scene Classification (Places365)

Field	Value
Framework	CoreML (ANE) + PyTorch MPS fallback
Model	`resnet18_places365.mlpackage`
Input	Frame resized to 224×224
Output	`[{scene_type, confidence, top_5}, ...]` — temporal segments
Coordinate	Temporal only, no spatial
File	`scripts/scene_classifier.py`
Status	✅ verified

DET-SCN-002 — Scene Cut Detection (PySceneDetect)

Field	Value
Framework	PySceneDetect
Model	`ContentDetector` (threshold-based frame difference)
Input	Video frames
Output	`[{scene_number, start_frame, end_frame, start_time, end_time}]`
Coordinate	Temporal (frames + seconds), no spatial
File	`scripts/cut_processor.py`
Status	✅ verified

Stamp / Specific Target Detectors

DET-STP-001 — Stamp Detection (OpenCV Color)

Field	Value
Framework	OpenCV
Model	HSV color masking + contour analysis (rule-based, no ML)
Input	Frame (BGR numpy)
Output	`bbox: [x, y, w, h]` pixel
Coordinate	Top-Left pixel (`cv2.boundingRect()` native)
Transform	Direct, no conversion
File	`scripts/scan_full_video_stamps.py`, `scripts/find_blue_stamp_opencv.py`
Status	✅ top-left native

DET-STP-002 — Pose Action Decoder (Coordinate-derived)

Field	Value
Framework	Rule-based from keypoints
Model	N/A (derived from DET-BODY-001/002/003 keypoints)
Input	Pose keypoints (pixel)
Output	Action labels: turn_left, turn_right, look_up, look_down, shake_head, nod_head, blink, smile, etc.
Coordinate	Derived angles/ratios, no raw spatial output
File	`scripts/utils/pose_action_decoder.py`, `scripts/utils/integrated_body_action_decoder.py`
Status	🟡 assumed (actions derived from pose keypoints; dependent on upstream keypoint correctness)
Warning	Affected by DET-BODY-001 Y-flip bug — all action labels wrong when using Vision pose

Known Bugs Summary

Bug ID	Detector	Issue	Impact	Fixed
BUG-001	DET-FACE-001/002	Index-based landmark↔face pairing	Wrong landmarks assigned to wrong faces	✅ 2026-05-13
BUG-002	DET-FACE-002	macOS bottom-left → missing Y-flip	Landmarks 731px offset from bbox	✅ 2026-05-13
BUG-003	DET-BODY-001	Missing Y-flip on keypoints	All 19 joint Y coordinates inverted	✅ 2026-05-13
BUG-004	DET-BODY-001	Derived bbox Y inverted	Bbox doesn't cover actual person	✅ 2026-05-13
BUG-005	DET-TEXT-001	Missing Y-flip on bbox	Text bbox Y inverted	✅ 2026-05-13
BUG-006	DET-TEXT-001	Hardcoded 640×360 in `recognizeText()`	Wrong bbox scale for non-640×360 images	✅ 2026-05-13

Coordinate Convention Quick Reference

Apple Vision (all detectors)

Item	Convention
boundingBox origin	Bottom-Left
boundingBox units	normalized [0-1]
pointsInImage Y axis	Bottom-Left (macOS AppKit)
Required Y-flip formula	bbox: `y = (1 - y_norm - h_norm) * imgH`
	points: `y = imgH - raw_y`

Non-Vision Detectors

Framework	Origin	Units
YOLO (ultralytics)	Top-Left	pixel float
MediaPipe	Top-Left	normalized [0-1]
InsightFace bbox	Top-Left	pixel int
InsightFace landmarks	Top-Left	normalized [0-1]
HuggingFace (post_process)	Top-Left	pixel (after rescale)
OpenCV	Top-Left	pixel int

納管規則

新增 detector：必須在此 Registry 註冊，含座標系、轉換公式、檔案位置。
座標變更：任何轉換公式修改，必須更新此文件並標註變更日期。
驗證要求：每個有空間座標的 detector 必須通過至少一次 visual check（bbox/keypoints 疊加原圖）。
跨 detector 比對：同一 frame 的不同 detector 輸出 bbox，IoU 應合理（非零且非 1.0）。
Vision detector 鐵律：任何使用 Apple Vision Framework 的 detector，必須確認 Y-flip 已實作。

維護

Owner: M5
更新頻率: 每次新增 processor 或修改座標轉換時
參照: SPATIAL_COORDINATE_REGISTRY.md（上層座標系統）

24 KiB Raw Blame History Unescape Escape

Momentry Core — Detector Registry

原則

分類總覽

Pipeline Status Quick-Reference

Known Misjudgments in Existing Evaluations

技術分類（有空間座標 vs 無）

Face Detectors

DET-FACE-001 — Face Bbox (Apple Vision)

DET-FACE-002 — Face Landmarks (Apple Vision)

DET-FACE-003 — Face Embedding (CoreML FaceNet)

DET-FACE-004 — Face Embedding (DeepFace ArcFace)

DET-FACE-005 — Face Recognition (InsightFace)

DET-FACE-006 — Face Clustering (sklearn)

DET-FACE-007 — Face Detection (MediaPipe BlazeFace)

DET-FACE-008 — Lip Detection (MediaPipe Face Mesh)

Body Pose Detectors

DET-BODY-001 — Body Pose (Apple Vision)

DET-BODY-002 — Body Pose (YOLOv8 Pose fallback)

DET-BODY-003 — Full Body (MediaPipe Holistic)

Object Detectors

DET-OBJ-001 — Object Detection (YOLOv8s)

DET-OBJ-002 — Weapon Detection (YOLOv8n Fine-tuned)

DET-OBJ-003 — Open-Vocabulary Detection (OWL-ViT)

DET-OBJ-004 — Open-Vocabulary Detection (Grounding DINO)

Text / OCR Detectors

DET-TEXT-001 — OCR (Apple Vision)

DET-TEXT-002 — Open-Vocabulary (Florence-2)

DET-TEXT-003 — Text Embedding (EmbeddingGemma)

Text Embedding (Non-Detector)

DET-TEXT-004 — Text Embedding (mxbai CoreML)

DET-TEXT-005 — Text Embedding (Ollama / llama.cpp)

Speech / Audio Detectors

DET-SPCH-001 — ASR (faster-whisper)

DET-SPCH-002 — ASR (Apple Speech)

DET-SPCH-003 — Speaker Embedding (ECAPA-TDNN)

Scene Detectors

DET-SCN-001 — Scene Classification (Places365)

DET-SCN-002 — Scene Cut Detection (PySceneDetect)

Stamp / Specific Target Detectors

DET-STP-001 — Stamp Detection (OpenCV Color)

DET-STP-002 — Pose Action Decoder (Coordinate-derived)

Known Bugs Summary

Coordinate Convention Quick Reference

Apple Vision (all detectors)

Non-Vision Detectors

納管規則

維護

24 KiB

Raw Blame History