Files
momentry_core/docs_v1.0/REFERENCE/DETECTOR_REGISTRY.md
Accusys ffc30d7377 M4 handover: coordinate fixes, detector registry, deploy v2, YOLOv8s, identity lifecycle
- Fix swift_pose/swift_ocr Y-flip bugs (BUG-003~006)
- Add heuristic_scene module + post-processing trigger (replaces Places365)
- YOLOv5nu → YOLOv8s CoreML (+33% detections, +390% scene indicators)
- Per-table SQL export (split 4.7GB single file → 478MB max per table)
- Version/build check in deploy.sh (compare /health vs file_info.json)
- Add file_uuid column to identities table + backfill
- Identity pre-clean step in deploy (avoids UNIQUE conflicts on re-deploy)
- Stranger_xxx naming fix with UUID context
- Add DETECTOR_REGISTRY.md (25 detectors), DETECTOR_SELECTION_SOP.md
- Update SPATIAL_COORDINATE_REGISTRY.md (P layer, 6-layer architecture)
- New IDENTITY_LIFECYCLE.md
- M4 response docs for deploy_script_fix and 111614 test report
2026-05-13 20:00:47 +08:00

24 KiB
Raw Blame History

Momentry Core — Detector Registry

Date: 2026-05-13 Version: 1.0 Purpose: 所有模型/演算法檢測器的座標約定、轉換鏈、驗證狀態統整


原則

  1. 每 detector 一條:獨立記錄輸入/輸出格式、座標原點、單位、轉換公式。
  2. 原始座標系標註:不隱藏轉換,任何異於 Top-Left pixel 的輸出必須明列。
  3. 轉換鏈可追溯:從 detector 原始輸出到入庫欄位,每一步轉換都記錄。
  4. 驗證狀態三級verified(已測試) / assumed(文檔推斷,未實測) / buggy(已知有誤)。

分類總覽

Category 數量 Active Experimental Deprecated
face 8 2 4 2
body 3 1 2 0
object 4 1 3 0
text 3 1 2 0
speech 3 2 1 0
scene 2 1 0 1
stamps 2 0 2 0
Total 25 8 14 3
Status 定義
Active 生產 pipeline 中執行,ProcessorType 有註冊,產出被消費
Experimental 獨立腳本或 CLI不連 pipeline評估中或備用
Deprecated 評估後棄用;或已被新版取代但未從 codebase 移除

Pipeline Status Quick-Reference

# Detector ID Short Name Pipeline Status Reason
1 DET-CUT-001 PySceneDetect active CUT processor
2 DET-SCN-001 Places365 active but rejected ⚠️ M5 eval rejected; never removed from ProcessorType
3 DET-ASR-001 faster-whisper active ASR processor
4 DET-SPCH-003 ECAPA-TDNN active ASRX speaker embedding
5 DET-OBJ-001 YOLOv8s active YOLO processor (v5nu→v8s, 2026-05-13)
6 DET-TEXT-001 swift_ocr active OCR processor (primary)
7 DET-FACE-001/002/003 swift_face + FaceNet active Face processor
8 DET-BODY-001/002 swift_pose + YOLOv8-pose active Pose processor (primary + fallback)
9 DET-FACE-006 AgglomerativeClustering active Identity Agent (post-processing)
10 DET-TEXT-005 llama.cpp embed active Text embedding (chunk vectors)
11 DET-FACE-005 InsightFace experimental Not in production ProcessorType
12 DET-FACE-007 MediaPipe BlazeFace experimental MPS fallback, tested but not primary
13 DET-FACE-008 MediaPipe Face Mesh experimental Lip processor, not in main pipeline
14 DET-BODY-003 MediaPipe Holistic experimental Tested, not in production
15 DET-OBJ-003 OWL-ViT experimental Tested for stamps, not in pipeline
16 DET-OBJ-004 Grounding DINO experimental Tested for stamps/objects
17 DET-TEXT-002 Florence-2 experimental Tested for stamps
18 DET-OBJ-002 Gun Detector experimental Evaluated, all FP, rejected for pipeline
19 DET-STP-001 OpenCV Stamp experimental Used in scan scripts only
20 DET-STP-002 Pose Action Decoder experimental Derived from pose, standalone
21 DET-FACE-004 DeepFace ArcFace deprecated Replaced by CoreML FaceNet
22 DET-SPCH-002 Apple Speech ASR deprecated Replaced by faster-whisper
23 DET-SCN-001 Places365 (scene) ⚠️ deprecated per eval Still in ProcessorType, needs removal
24 DET-TEXT-003 EmbeddingGemma experimental Text embed endpoint, not primary
25 DET-TEXT-004 mxbai CoreML experimental Text embed endpoint, not primary

Known Misjudgments in Existing Evaluations

# Evaluation Issue Impact Action
M1 Scene Classification (2026-05-07) M5 evaluated and REJECTED Places365. But it was never removed from ProcessorType::all(). Still runs on every file. Wastes ~2min per registration. Produces meaningless scene.json. Remove from pipeline or re-evaluate
M2 Face Processor benchmark (2026-04-28) Compared InsightFace vs MediaPipe vs OpenCV vs Contract v1. But the final pipeline uses swift_face + FaceNet, a completely different solution not in the benchmark. Selection criteria from benchmark don't apply to actual pipeline detector. Document the actual selection decision for swift_face
M3 Gun Detector (2026-05-07) Properly rejected: 7/7 FP. Correct decision. Model files still in repo. No impact (correctly excluded). Clean up model files. Archive or remove models/gun/
M4 OCR processor No selection document exists. swift_ocr chosen without comparison against EasyOCR/PaddleOCR. Unknown if optimal. PaddleOCR fallback may never trigger. Document selection decision

技術分類(有空間座標 vs 無)

Category 數量 有空間座標 僅 Embedding 純時間/文字
face 8 5 3
body 3 3
object 4 4
text 3 1 2
speech 3 2 1
scene 2 1 1
stamps 2 2
Total 25 15 8 2

Face Detectors

DET-FACE-001 — Face Bbox (Apple Vision)

Field Value
Framework Apple Vision
Model VNDetectFaceRectanglesRequest
Input CVPixelBuffer (BGRA, via CGImage)
Output bbox: x, y, width, height
Coordinate Input: normalized [0-1], origin bottom-left
Transform x = bb.origin.x * imgW
y = (1.0 - bb.origin.y - bb.size.height) * imgH
Image size cgImage.width / cgImage.height
Target Top-Left pixel integer
File scripts/swift_processors/swift_face.swift:134-136
Status verified (2026-05-13, landmark QC + visual check)

DET-FACE-002 — Face Landmarks (Apple Vision)

Field Value
Framework Apple Vision
Model VNDetectFaceLandmarksRequest
Input CVPixelBuffer (BGRA, via CGImage)
Output landmarks: left_eye (6pt), right_eye (6pt), nose (8pt), outer_lips, inner_lips
Coordinate Input: VNFaceLandmarks2D.pointsInImage(imageSize:)
Returned: macOS AppKit convention → bottom-left origin ⚠️
Transform y_top_left = imgH - $0.y (Y-flip)
Image size cgImage.width / cgImage.height
Target Top-Left pixel float → JSON
Pairing Not by array index. Landmark observations used as primary source (self-consistent bbox + landmarks). Face rect observations deduplicated via IoU > 0.3.
File scripts/swift_processors/swift_face.swift:155-184
Status verified (2026-05-13, Y-flip fix, 100% landmark-in-bbox)
Bugs fixed BUG-001: index-based pairing (landmarkObs[idx] ≠ faceObs[idx])
BUG-002: macOS bottom-left Y axis (missing Y-flip)

DET-FACE-003 — Face Embedding (CoreML FaceNet)

Field Value
Framework CoreML (ANE-accelerated)
Model models/facenet512.mlpackage
Input Face crop 160×160, RGB, normalized [-1, 1]
Output 512-dim float embedding
Coordinate N/A (no spatial output). Bbox from DET-FACE-001 used for crop.
File scripts/face_processor.py, scripts/embed_faces.py, scripts/tmdb_embed_extractor.py
Embedding space [-1, 1] per dimension, cosine similarity for matching
Status verified (routinely used for identity matching)

DET-FACE-004 — Face Embedding (DeepFace ArcFace)

Field Value
Framework DeepFace / TensorFlow
Model ArcFace (512-dim)
Input Face crop (from bbox), BGR, no explicit normalization
Output 512-dim float embedding
Coordinate N/A
File scripts/face_embedding_extractor.py
Status 🟡 assumed (legacy fallback, not primary pipeline)

DET-FACE-005 — Face Recognition (InsightFace)

Field Value
Framework InsightFace / ONNX Runtime
Model buffalo_l (detection + recognition + 5-point landmarks)
Input Video frame (BGR, numpy array)
Output bbox: [x1, y1, x2, y2] pixel int
landmarks: 5-point (left_eye, right_eye, nose, mouth_left, mouth_right)
embedding: 512-dim float
Coordinate Bbox: Top-Left pixel (InsightFace native)
Landmarks: normalized [0-1] to image size
Transform Bbox: face.bbox.astype(int) — direct
Landmarks: kps * imgW, kps * imgH — needs manual conversion ⚠️
File scripts/face_recognition_processor.py:123-153
Status 🟡 assumed (landmark pixel conversion chain not independently verified)

DET-FACE-006 — Face Clustering (sklearn)

Field Value
Framework sklearn
Model AgglomerativeClustering
Input 512-dim face embeddings from DET-FACE-003 or DET-FACE-004
Output cluster labels, centroids (512-dim float)
Coordinate N/A (no spatial output)
File scripts/face_clustering_processor.py, scripts/identity_bind.py
Status verified (428 clusters for Charade, identity_bindings created)

DET-FACE-007 — Face Detection (MediaPipe BlazeFace)

Field Value
Framework MediaPipe / MPS
Model blaze_face_short_range.tflite
Input Frame (numpy array / MPS image)
Output bbox: [x, y, width, height] pixel
6 keypoints: eyes, nose tip, mouth center, ear tragions — pixel
Coordinate Top-Left pixel (MediaPipe native)
Transform Direct, no conversion needed
File scripts/face_processor_mps.py
Status 🟡 assumed (MPS fallback, rarely used in pipeline)

DET-FACE-008 — Lip Detection (MediaPipe Face Mesh)

Field Value
Framework MediaPipe
Model Face Mesh (468 landmarks)
Input Face crop or full frame
Output lip_openness: [0-1] (vertical/mouth_width)
mouth keypoints: indices 13, 14, 61, 291 from 468 mesh
Coordinate Landmarks: normalized [0-1], Top-Left origin
Transform Normalized → pixel: x * imgW, y * imgH
Lip openness: derived ratio, unitless
File scripts/lip_processor.py
Status 🟡 assumed

Body Pose Detectors

DET-BODY-001 — Body Pose (Apple Vision)

Field Value
Framework Apple Vision
Model VNDetectHumanBodyPoseRequest
Input CGImage (from frame export or NSImage)
Output 19 keypoints: nose, eyes, ears, neck, root, shoulders, elbows, wrists, hips, knees, ankles
bbox: [x, y, width, height] derived from keypoint min/max
Coordinate Input: normalized [0-1], origin bottom-left
Transform (current) y = h - location.y * h — Y-flip applied
Transform (correct) y = h - location.y * h
Image size cgImage.width / cgImage.height
Target Top-Left pixel float
File scripts/swift_processors/swift_pose.swift:154-159
Status verified (2026-05-13, Y-flip fix applied)

DET-BODY-002 — Body Pose (YOLOv8 Pose fallback)

Field Value
Framework ultralytics / PyTorch
Model yolov8n-pose.pt
Input Frame (PIL or numpy)
Output 17 COCO keypoints: nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles
bbox: [x, y, width, height] derived from keypoints (conf > 0.1)
Coordinate Top-Left pixel (YOLO native, .xy[0] → numpy float)
Transform Direct: x, y = float(kps[j][0]), float(kps[j][1])
Bbox: min(xs), min(ys), max(xs)-min(xs), max(ys)-min(ys)
File scripts/pose_processor.py:78-97
Status top-left native

DET-BODY-003 — Full Body (MediaPipe Holistic)

Field Value
Framework MediaPipe
Model Holistic (pose + face mesh + hands)
Input Frame (BGR numpy)
Output 468 face mesh: [[x, y, z], ...] normalized [0-1]
33 body pose: [[x, y, z, visibility], ...] normalized [0-1]
21 hand × 2: [[x, y, z], ...] normalized [0-1]
Coordinate normalized [0-1], Top-Left origin
Transform x * imgW, y * imgH → pixel (if needed)
Z: depth relative, not metric
File scripts/mediapipe_holistic_processor.py
Status top-left native, normalized→pixel straightforward

Object Detectors

DET-OBJ-001 — Object Detection (YOLOv8s)

Field Value
Framework ultralytics / CoreML + PyTorch fallback
Model yolov8s.mlpackage (primary, CoreML ANE), yolov8s.pt (fallback)
mAP (COCO) 44.9 (was 34.3 with YOLOv5nu, +31%)
Input Frame (PIL or numpy)
Output bbox: [x1, y1, x2, y2] — float pixel
class_name, class_id (80 COCO classes)
confidence: [0-1]
Coordinate Top-Left pixel (YOLO .xyxy[0] → float)
Transform Rust: x = detection.x1 as i32, y = detection.y1 as i32int truncation
width = x2 - x1, height = y2 - y1
Image size YOLO auto-handles via ultralytics inference
File scripts/yolo_processor.py:272-285, src/core/processor/yolo.rs:83-117
Status verified (2026-05-13, replaced YOLOv5nu, +19% detections, scene indicators +162~+473%)
Replaced YOLOv5nu (mAP 34.3, removed 2026-05-13)

DET-OBJ-002 — Weapon Detection (YOLOv8n Fine-tuned)

Field Value
Framework ultralytics / PyTorch
Model models/gun/gun_detector/weights/best.pt
Input Frame (numpy array)
Output bbox: [x1, y1, x2, y2] pixel
class: {0: grenade, 1: knife, 2: pistol, 3: rifle}
Coordinate Top-Left pixel (YOLO native)
File scripts/gun_detector_scan.py
Status top-left native

DET-OBJ-003 — Open-Vocabulary Detection (OWL-ViT)

Field Value
Framework HuggingFace Transformers
Model google/owlvit-base-patch32
Input PIL Image + text queries
Output bbox, scores, labels
Coordinate post_process_object_detection returns boxes in [x1, y1, x2, y2] format
scaled to target_sizes parameter
Transform target_sizes = torch.Tensor([image_pil.size[::-1]]) — PIL (w,h) → (h,w)
box.int().tolist() or box.tolist() → Python list
Format risk HuggingFace processor version may return [cx, cy, w, h] not [x1,y1,x2,y2]
File scripts/test_owl_vit_stamps.py:69-80, scripts/magnifying_glass_owl.py:65-77
Status 🟡 assumed (bbox format not independently verified with visual check)
Verify Render bbox overlay on a known target image, confirm x1 < x2, y1 < y2

DET-OBJ-004 — Open-Vocabulary Detection (Grounding DINO)

Field Value
Framework HuggingFace Transformers
Model IDEA-Research/grounding-dino-base
Input PIL Image + text prompts
Output boxes, labels, scores
Coordinate processor rescales to target_sizes, returns pixel boxes
Transform target_sizes=[img.size[::-1]] — PIL (w,h) → (h,w)
[round(v, 1) for v in dets["boxes"][i].tolist()]
Format risk [::-1] order depends on processor expectations. If processor expects (w,h), axes swapped.
File scripts/gdino_frame_api.py:176-180
Status 🟡 assumed (rescale direction not independently verified)
Verify Single-frame output: check bbox x range ≤ imgW, y range ≤ imgH

Text / OCR Detectors

DET-TEXT-001 — OCR (Apple Vision)

Field Value
Framework Apple Vision
Model VNRecognizeTextRequest (accurate/fast)
Input CVPixelBuffer (via CGImage)
Output text: string, bbox: [x, y, w, h], confidence: [0-1]
Coordinate Input: VNRecognizedTextObservation.boundingBox — normalized [0-1], origin bottom-left
Transform y = (1.0 - bb.origin.y - bb.size.height) * cgH — Y-flip applied
Image size Main loop: cgImage.width / cgImage.height
recognizeText() helper: CVPixelBufferGetWidth/Height
File scripts/swift_processors/swift_ocr.swift:125-133, :181-182
Status verified (2026-05-13, Y-flip + image size fix applied)

DET-TEXT-002 — Open-Vocabulary (Florence-2)

Field Value
Framework HuggingFace Transformers
Model microsoft/Florence-2-base
Input PIL Image + task prompt
Output bbox: [x1, y1, x2, y2] pixel
label, text (depending on task)
Coordinate processor post_process_generation rescales to image_size, returns pixel
Transform x1, y1, x2, y2 = map(int, bbox) — direct
image_size=(image_pil.width, image_pil.height) — (w, h) order
File scripts/florence2_scan_stamps.py:67-79, scripts/test_florence2_direct.py
Status top-left native (HuggingFace post_process output)

DET-TEXT-003 — Text Embedding (EmbeddingGemma)

Field Value
Framework HuggingFace / PyTorch MPS
Model google/embeddinggemma-300m
Input Text string
Output Embedding vector (L2 normalized, dimension model-dependent)
Coordinate N/A
File scripts/embeddinggemma_server.py
Status verified (embedding API server)

Text Embedding (Non-Detector)

DET-TEXT-004 — Text Embedding (mxbai CoreML)

Field Value
Framework CoreML (ANE-accelerated)
Model mxbai-embed-large-v1.mlpackage
Input Text tokenized
Output Embedding vector
Coordinate N/A
File scripts/coreml_embed_server.py
Status 🟡 assumed

DET-TEXT-005 — Text Embedding (Ollama / llama.cpp)

Field Value
Framework llama.cpp / Ollama API
Model llama.cpp embedding endpoint (port 11436)
Input Text (optionally prefixed search_document:)
Output 768-dim float embedding
Coordinate N/A
File src/core/embedding/comic_embed.rs
Status verified (embedding pipeline)

Speech / Audio Detectors

DET-SPCH-001 — ASR (faster-whisper)

Field Value
Framework faster-whisper / CTranslate2
Model faster-whisper/small (int8 CPU)
Input Audio extracted from video
Output [{start, end, text}, ...] — temporal segments (seconds)
Coordinate Temporal only (seconds), no spatial
File scripts/asr_processor.py
Status verified (ASR pipeline)

DET-SPCH-002 — ASR (Apple Speech)

Field Value
Framework Apple Speech (ANE)
Model SFSpeechRecognizer
Input Audio file
Output [{start, end, text, confidence}, ...] — temporal segments
Coordinate Temporal only (seconds), no spatial
File scripts/swift_processors/asr_swift.swift
Status 🟡 assumed (Apple Speech quality lower than faster-whisper)

DET-SPCH-003 — Speaker Embedding (ECAPA-TDNN)

Field Value
Framework SpeechBrain / PyTorch
Model speechbrain/spkrec-ecapa-voxceleb
Input Audio segments per speaker
Output 192-dim float embedding
Coordinate N/A (vector space, cosine similarity)
File scripts/asrx_processor_custom.py, scripts/voice_embedding_extractor.py
Status verified (voice embeddings exported to SQLite + Qdrant)

Scene Detectors

DET-SCN-001 — Scene Classification (Places365)

Field Value
Framework CoreML (ANE) + PyTorch MPS fallback
Model resnet18_places365.mlpackage
Input Frame resized to 224×224
Output [{scene_type, confidence, top_5}, ...] — temporal segments
Coordinate Temporal only, no spatial
File scripts/scene_classifier.py
Status verified

DET-SCN-002 — Scene Cut Detection (PySceneDetect)

Field Value
Framework PySceneDetect
Model ContentDetector (threshold-based frame difference)
Input Video frames
Output [{scene_number, start_frame, end_frame, start_time, end_time}]
Coordinate Temporal (frames + seconds), no spatial
File scripts/cut_processor.py
Status verified

Stamp / Specific Target Detectors

DET-STP-001 — Stamp Detection (OpenCV Color)

Field Value
Framework OpenCV
Model HSV color masking + contour analysis (rule-based, no ML)
Input Frame (BGR numpy)
Output bbox: [x, y, w, h] pixel
Coordinate Top-Left pixel (cv2.boundingRect() native)
Transform Direct, no conversion
File scripts/scan_full_video_stamps.py, scripts/find_blue_stamp_opencv.py
Status top-left native

DET-STP-002 — Pose Action Decoder (Coordinate-derived)

Field Value
Framework Rule-based from keypoints
Model N/A (derived from DET-BODY-001/002/003 keypoints)
Input Pose keypoints (pixel)
Output Action labels: turn_left, turn_right, look_up, look_down, shake_head, nod_head, blink, smile, etc.
Coordinate Derived angles/ratios, no raw spatial output
File scripts/utils/pose_action_decoder.py, scripts/utils/integrated_body_action_decoder.py
Status 🟡 assumed (actions derived from pose keypoints; dependent on upstream keypoint correctness)
Warning Affected by DET-BODY-001 Y-flip bug — all action labels wrong when using Vision pose

Known Bugs Summary

Bug ID Detector Issue Impact Fixed
BUG-001 DET-FACE-001/002 Index-based landmark↔face pairing Wrong landmarks assigned to wrong faces 2026-05-13
BUG-002 DET-FACE-002 macOS bottom-left → missing Y-flip Landmarks 731px offset from bbox 2026-05-13
BUG-003 DET-BODY-001 Missing Y-flip on keypoints All 19 joint Y coordinates inverted 2026-05-13
BUG-004 DET-BODY-001 Derived bbox Y inverted Bbox doesn't cover actual person 2026-05-13
BUG-005 DET-TEXT-001 Missing Y-flip on bbox Text bbox Y inverted 2026-05-13
BUG-006 DET-TEXT-001 Hardcoded 640×360 in recognizeText() Wrong bbox scale for non-640×360 images 2026-05-13

Coordinate Convention Quick Reference

Apple Vision (all detectors)

Item Convention
boundingBox origin Bottom-Left
boundingBox units normalized [0-1]
pointsInImage Y axis Bottom-Left (macOS AppKit)
Required Y-flip formula bbox: y = (1 - y_norm - h_norm) * imgH
points: y = imgH - raw_y

Non-Vision Detectors

Framework Origin Units
YOLO (ultralytics) Top-Left pixel float
MediaPipe Top-Left normalized [0-1]
InsightFace bbox Top-Left pixel int
InsightFace landmarks Top-Left normalized [0-1]
HuggingFace (post_process) Top-Left pixel (after rescale)
OpenCV Top-Left pixel int

納管規則

  1. 新增 detector:必須在此 Registry 註冊,含座標系、轉換公式、檔案位置。
  2. 座標變更:任何轉換公式修改,必須更新此文件並標註變更日期。
  3. 驗證要求:每個有空間座標的 detector 必須通過至少一次 visual checkbbox/keypoints 疊加原圖)。
  4. 跨 detector 比對:同一 frame 的不同 detector 輸出 bboxIoU 應合理(非零且非 1.0)。
  5. Vision detector 鐵律:任何使用 Apple Vision Framework 的 detector必須確認 Y-flip 已實作。

維護

  • Owner: M5
  • 更新頻率: 每次新增 processor 或修改座標轉換時
  • 參照: SPATIAL_COORDINATE_REGISTRY.md(上層座標系統)