- Fix swift_pose/swift_ocr Y-flip bugs (BUG-003~006) - Add heuristic_scene module + post-processing trigger (replaces Places365) - YOLOv5nu → YOLOv8s CoreML (+33% detections, +390% scene indicators) - Per-table SQL export (split 4.7GB single file → 478MB max per table) - Version/build check in deploy.sh (compare /health vs file_info.json) - Add file_uuid column to identities table + backfill - Identity pre-clean step in deploy (avoids UNIQUE conflicts on re-deploy) - Stranger_xxx naming fix with UUID context - Add DETECTOR_REGISTRY.md (25 detectors), DETECTOR_SELECTION_SOP.md - Update SPATIAL_COORDINATE_REGISTRY.md (P layer, 6-layer architecture) - New IDENTITY_LIFECYCLE.md - M4 response docs for deploy_script_fix and 111614 test report
24 KiB
24 KiB
Momentry Core — Detector Registry
Date: 2026-05-13 Version: 1.0 Purpose: 所有模型/演算法檢測器的座標約定、轉換鏈、驗證狀態統整
原則
- 每 detector 一條:獨立記錄輸入/輸出格式、座標原點、單位、轉換公式。
- 原始座標系標註:不隱藏轉換,任何異於 Top-Left pixel 的輸出必須明列。
- 轉換鏈可追溯:從 detector 原始輸出到入庫欄位,每一步轉換都記錄。
- 驗證狀態三級:
verified(已測試) /assumed(文檔推斷,未實測) /buggy(已知有誤)。
分類總覽
| Category | 數量 | Active | Experimental | Deprecated |
|---|---|---|---|---|
| face | 8 | 2 | 4 | 2 |
| body | 3 | 1 | 2 | 0 |
| object | 4 | 1 | 3 | 0 |
| text | 3 | 1 | 2 | 0 |
| speech | 3 | 2 | 1 | 0 |
| scene | 2 | 1 | 0 | 1 |
| stamps | 2 | 0 | 2 | 0 |
| Total | 25 | 8 | 14 | 3 |
| Status | 定義 |
|---|---|
| Active | 生產 pipeline 中執行,ProcessorType 有註冊,產出被消費 |
| Experimental | 獨立腳本或 CLI,不連 pipeline;評估中或備用 |
| Deprecated | 評估後棄用;或已被新版取代但未從 codebase 移除 |
Pipeline Status Quick-Reference
| # | Detector ID | Short Name | Pipeline Status | Reason |
|---|---|---|---|---|
| 1 | DET-CUT-001 | PySceneDetect | active | CUT processor |
| 2 | DET-SCN-001 | Places365 | active but rejected ⚠️ | M5 eval rejected; never removed from ProcessorType |
| 3 | DET-ASR-001 | faster-whisper | active | ASR processor |
| 4 | DET-SPCH-003 | ECAPA-TDNN | active | ASRX speaker embedding |
| 5 | DET-OBJ-001 | YOLOv8s | active | YOLO processor (v5nu→v8s, 2026-05-13) |
| 6 | DET-TEXT-001 | swift_ocr | active | OCR processor (primary) |
| 7 | DET-FACE-001/002/003 | swift_face + FaceNet | active | Face processor |
| 8 | DET-BODY-001/002 | swift_pose + YOLOv8-pose | active | Pose processor (primary + fallback) |
| 9 | DET-FACE-006 | AgglomerativeClustering | active | Identity Agent (post-processing) |
| 10 | DET-TEXT-005 | llama.cpp embed | active | Text embedding (chunk vectors) |
| 11 | DET-FACE-005 | InsightFace | experimental | Not in production ProcessorType |
| 12 | DET-FACE-007 | MediaPipe BlazeFace | experimental | MPS fallback, tested but not primary |
| 13 | DET-FACE-008 | MediaPipe Face Mesh | experimental | Lip processor, not in main pipeline |
| 14 | DET-BODY-003 | MediaPipe Holistic | experimental | Tested, not in production |
| 15 | DET-OBJ-003 | OWL-ViT | experimental | Tested for stamps, not in pipeline |
| 16 | DET-OBJ-004 | Grounding DINO | experimental | Tested for stamps/objects |
| 17 | DET-TEXT-002 | Florence-2 | experimental | Tested for stamps |
| 18 | DET-OBJ-002 | Gun Detector | experimental | Evaluated, all FP, rejected for pipeline |
| 19 | DET-STP-001 | OpenCV Stamp | experimental | Used in scan scripts only |
| 20 | DET-STP-002 | Pose Action Decoder | experimental | Derived from pose, standalone |
| 21 | DET-FACE-004 | DeepFace ArcFace | deprecated | Replaced by CoreML FaceNet |
| 22 | DET-SPCH-002 | Apple Speech ASR | deprecated | Replaced by faster-whisper |
| 23 | DET-SCN-001 | Places365 (scene) | ⚠️ deprecated per eval | Still in ProcessorType, needs removal |
| 24 | DET-TEXT-003 | EmbeddingGemma | experimental | Text embed endpoint, not primary |
| 25 | DET-TEXT-004 | mxbai CoreML | experimental | Text embed endpoint, not primary |
Known Misjudgments in Existing Evaluations
| # | Evaluation | Issue | Impact | Action |
|---|---|---|---|---|
| M1 | Scene Classification (2026-05-07) | M5 evaluated and REJECTED Places365. But it was never removed from ProcessorType::all(). Still runs on every file. |
Wastes ~2min per registration. Produces meaningless scene.json. | Remove from pipeline or re-evaluate |
| M2 | Face Processor benchmark (2026-04-28) | Compared InsightFace vs MediaPipe vs OpenCV vs Contract v1. But the final pipeline uses swift_face + FaceNet, a completely different solution not in the benchmark. | Selection criteria from benchmark don't apply to actual pipeline detector. | Document the actual selection decision for swift_face |
| M3 | Gun Detector (2026-05-07) | Properly rejected: 7/7 FP. Correct decision. Model files still in repo. | No impact (correctly excluded). Clean up model files. | Archive or remove models/gun/ |
| M4 | OCR processor | No selection document exists. swift_ocr chosen without comparison against EasyOCR/PaddleOCR. | Unknown if optimal. PaddleOCR fallback may never trigger. | Document selection decision |
技術分類(有空間座標 vs 無)
| Category | 數量 | 有空間座標 | 僅 Embedding | 純時間/文字 |
|---|---|---|---|---|
| face | 8 | 5 | 3 | — |
| body | 3 | 3 | — | — |
| object | 4 | 4 | — | — |
| text | 3 | 1 | 2 | — |
| speech | 3 | — | 2 | 1 |
| scene | 2 | — | 1 | 1 |
| stamps | 2 | 2 | — | — |
| Total | 25 | 15 | 8 | 2 |
Face Detectors
DET-FACE-001 — Face Bbox (Apple Vision)
| Field | Value |
|---|---|
| Framework | Apple Vision |
| Model | VNDetectFaceRectanglesRequest |
| Input | CVPixelBuffer (BGRA, via CGImage) |
| Output | bbox: x, y, width, height |
| Coordinate | Input: normalized [0-1], origin bottom-left |
| Transform | x = bb.origin.x * imgW |
y = (1.0 - bb.origin.y - bb.size.height) * imgH |
|
| Image size | cgImage.width / cgImage.height |
| Target | Top-Left pixel integer |
| File | scripts/swift_processors/swift_face.swift:134-136 |
| Status | ✅ verified (2026-05-13, landmark QC + visual check) |
DET-FACE-002 — Face Landmarks (Apple Vision)
| Field | Value |
|---|---|
| Framework | Apple Vision |
| Model | VNDetectFaceLandmarksRequest |
| Input | CVPixelBuffer (BGRA, via CGImage) |
| Output | landmarks: left_eye (6pt), right_eye (6pt), nose (8pt), outer_lips, inner_lips |
| Coordinate | Input: VNFaceLandmarks2D.pointsInImage(imageSize:) |
| Returned: macOS AppKit convention → bottom-left origin ⚠️ | |
| Transform | y_top_left = imgH - $0.y (Y-flip) |
| Image size | cgImage.width / cgImage.height |
| Target | Top-Left pixel float → JSON |
| Pairing | Not by array index. Landmark observations used as primary source (self-consistent bbox + landmarks). Face rect observations deduplicated via IoU > 0.3. |
| File | scripts/swift_processors/swift_face.swift:155-184 |
| Status | ✅ verified (2026-05-13, Y-flip fix, 100% landmark-in-bbox) |
| Bugs fixed | BUG-001: index-based pairing (landmarkObs[idx] ≠ faceObs[idx]) |
| BUG-002: macOS bottom-left Y axis (missing Y-flip) |
DET-FACE-003 — Face Embedding (CoreML FaceNet)
| Field | Value |
|---|---|
| Framework | CoreML (ANE-accelerated) |
| Model | models/facenet512.mlpackage |
| Input | Face crop 160×160, RGB, normalized [-1, 1] |
| Output | 512-dim float embedding |
| Coordinate | N/A (no spatial output). Bbox from DET-FACE-001 used for crop. |
| File | scripts/face_processor.py, scripts/embed_faces.py, scripts/tmdb_embed_extractor.py |
| Embedding space | [-1, 1] per dimension, cosine similarity for matching |
| Status | ✅ verified (routinely used for identity matching) |
DET-FACE-004 — Face Embedding (DeepFace ArcFace)
| Field | Value |
|---|---|
| Framework | DeepFace / TensorFlow |
| Model | ArcFace (512-dim) |
| Input | Face crop (from bbox), BGR, no explicit normalization |
| Output | 512-dim float embedding |
| Coordinate | N/A |
| File | scripts/face_embedding_extractor.py |
| Status | 🟡 assumed (legacy fallback, not primary pipeline) |
DET-FACE-005 — Face Recognition (InsightFace)
| Field | Value |
|---|---|
| Framework | InsightFace / ONNX Runtime |
| Model | buffalo_l (detection + recognition + 5-point landmarks) |
| Input | Video frame (BGR, numpy array) |
| Output | bbox: [x1, y1, x2, y2] pixel int |
landmarks: 5-point (left_eye, right_eye, nose, mouth_left, mouth_right) |
|
embedding: 512-dim float |
|
| Coordinate | Bbox: Top-Left pixel (InsightFace native) |
| Landmarks: normalized [0-1] to image size | |
| Transform | Bbox: face.bbox.astype(int) — direct |
Landmarks: kps * imgW, kps * imgH — needs manual conversion ⚠️ |
|
| File | scripts/face_recognition_processor.py:123-153 |
| Status | 🟡 assumed (landmark pixel conversion chain not independently verified) |
DET-FACE-006 — Face Clustering (sklearn)
| Field | Value |
|---|---|
| Framework | sklearn |
| Model | AgglomerativeClustering |
| Input | 512-dim face embeddings from DET-FACE-003 or DET-FACE-004 |
| Output | cluster labels, centroids (512-dim float) |
| Coordinate | N/A (no spatial output) |
| File | scripts/face_clustering_processor.py, scripts/identity_bind.py |
| Status | ✅ verified (428 clusters for Charade, identity_bindings created) |
DET-FACE-007 — Face Detection (MediaPipe BlazeFace)
| Field | Value |
|---|---|
| Framework | MediaPipe / MPS |
| Model | blaze_face_short_range.tflite |
| Input | Frame (numpy array / MPS image) |
| Output | bbox: [x, y, width, height] pixel |
6 keypoints: eyes, nose tip, mouth center, ear tragions — pixel |
|
| Coordinate | Top-Left pixel (MediaPipe native) |
| Transform | Direct, no conversion needed |
| File | scripts/face_processor_mps.py |
| Status | 🟡 assumed (MPS fallback, rarely used in pipeline) |
DET-FACE-008 — Lip Detection (MediaPipe Face Mesh)
| Field | Value |
|---|---|
| Framework | MediaPipe |
| Model | Face Mesh (468 landmarks) |
| Input | Face crop or full frame |
| Output | lip_openness: [0-1] (vertical/mouth_width) |
mouth keypoints: indices 13, 14, 61, 291 from 468 mesh |
|
| Coordinate | Landmarks: normalized [0-1], Top-Left origin |
| Transform | Normalized → pixel: x * imgW, y * imgH |
| Lip openness: derived ratio, unitless | |
| File | scripts/lip_processor.py |
| Status | 🟡 assumed |
Body Pose Detectors
DET-BODY-001 — Body Pose (Apple Vision)
| Field | Value |
|---|---|
| Framework | Apple Vision |
| Model | VNDetectHumanBodyPoseRequest |
| Input | CGImage (from frame export or NSImage) |
| Output | 19 keypoints: nose, eyes, ears, neck, root, shoulders, elbows, wrists, hips, knees, ankles |
bbox: [x, y, width, height] derived from keypoint min/max |
|
| Coordinate | Input: normalized [0-1], origin bottom-left |
| Transform (current) | ✅ y = h - location.y * h — Y-flip applied |
| Transform (correct) | y = h - location.y * h |
| Image size | cgImage.width / cgImage.height |
| Target | Top-Left pixel float |
| File | scripts/swift_processors/swift_pose.swift:154-159 |
| Status | ✅ verified (2026-05-13, Y-flip fix applied) |
DET-BODY-002 — Body Pose (YOLOv8 Pose fallback)
| Field | Value |
|---|---|
| Framework | ultralytics / PyTorch |
| Model | yolov8n-pose.pt |
| Input | Frame (PIL or numpy) |
| Output | 17 COCO keypoints: nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles |
bbox: [x, y, width, height] derived from keypoints (conf > 0.1) |
|
| Coordinate | Top-Left pixel (YOLO native, .xy[0] → numpy float) |
| Transform | Direct: x, y = float(kps[j][0]), float(kps[j][1]) |
Bbox: min(xs), min(ys), max(xs)-min(xs), max(ys)-min(ys) |
|
| File | scripts/pose_processor.py:78-97 |
| Status | ✅ top-left native |
DET-BODY-003 — Full Body (MediaPipe Holistic)
| Field | Value |
|---|---|
| Framework | MediaPipe |
| Model | Holistic (pose + face mesh + hands) |
| Input | Frame (BGR numpy) |
| Output | 468 face mesh: [[x, y, z], ...] normalized [0-1] |
33 body pose: [[x, y, z, visibility], ...] normalized [0-1] |
|
21 hand × 2: [[x, y, z], ...] normalized [0-1] |
|
| Coordinate | normalized [0-1], Top-Left origin |
| Transform | x * imgW, y * imgH → pixel (if needed) |
| Z: depth relative, not metric | |
| File | scripts/mediapipe_holistic_processor.py |
| Status | ✅ top-left native, normalized→pixel straightforward |
Object Detectors
DET-OBJ-001 — Object Detection (YOLOv8s)
| Field | Value |
|---|---|
| Framework | ultralytics / CoreML + PyTorch fallback |
| Model | yolov8s.mlpackage (primary, CoreML ANE), yolov8s.pt (fallback) |
| mAP (COCO) | 44.9 (was 34.3 with YOLOv5nu, +31%) |
| Input | Frame (PIL or numpy) |
| Output | bbox: [x1, y1, x2, y2] — float pixel |
class_name, class_id (80 COCO classes) |
|
confidence: [0-1] |
|
| Coordinate | Top-Left pixel (YOLO .xyxy[0] → float) |
| Transform | Rust: x = detection.x1 as i32, y = detection.y1 as i32 — int truncation |
width = x2 - x1, height = y2 - y1 |
|
| Image size | YOLO auto-handles via ultralytics inference |
| File | scripts/yolo_processor.py:272-285, src/core/processor/yolo.rs:83-117 |
| Status | ✅ verified (2026-05-13, replaced YOLOv5nu, +19% detections, scene indicators +162~+473%) |
| Replaced | YOLOv5nu (mAP 34.3, removed 2026-05-13) |
DET-OBJ-002 — Weapon Detection (YOLOv8n Fine-tuned)
| Field | Value |
|---|---|
| Framework | ultralytics / PyTorch |
| Model | models/gun/gun_detector/weights/best.pt |
| Input | Frame (numpy array) |
| Output | bbox: [x1, y1, x2, y2] pixel |
class: {0: grenade, 1: knife, 2: pistol, 3: rifle} |
|
| Coordinate | Top-Left pixel (YOLO native) |
| File | scripts/gun_detector_scan.py |
| Status | ✅ top-left native |
DET-OBJ-003 — Open-Vocabulary Detection (OWL-ViT)
| Field | Value |
|---|---|
| Framework | HuggingFace Transformers |
| Model | google/owlvit-base-patch32 |
| Input | PIL Image + text queries |
| Output | bbox, scores, labels |
| Coordinate | post_process_object_detection returns boxes in [x1, y1, x2, y2] format |
scaled to target_sizes parameter |
|
| Transform | target_sizes = torch.Tensor([image_pil.size[::-1]]) — PIL (w,h) → (h,w) |
box.int().tolist() or box.tolist() → Python list |
|
| Format risk | HuggingFace processor version may return [cx, cy, w, h] not [x1,y1,x2,y2] |
| File | scripts/test_owl_vit_stamps.py:69-80, scripts/magnifying_glass_owl.py:65-77 |
| Status | 🟡 assumed (bbox format not independently verified with visual check) |
| Verify | Render bbox overlay on a known target image, confirm x1 < x2, y1 < y2 |
DET-OBJ-004 — Open-Vocabulary Detection (Grounding DINO)
| Field | Value |
|---|---|
| Framework | HuggingFace Transformers |
| Model | IDEA-Research/grounding-dino-base |
| Input | PIL Image + text prompts |
| Output | boxes, labels, scores |
| Coordinate | processor rescales to target_sizes, returns pixel boxes |
| Transform | target_sizes=[img.size[::-1]] — PIL (w,h) → (h,w) |
[round(v, 1) for v in dets["boxes"][i].tolist()] |
|
| Format risk | [::-1] order depends on processor expectations. If processor expects (w,h), axes swapped. |
| File | scripts/gdino_frame_api.py:176-180 |
| Status | 🟡 assumed (rescale direction not independently verified) |
| Verify | Single-frame output: check bbox x range ≤ imgW, y range ≤ imgH |
Text / OCR Detectors
DET-TEXT-001 — OCR (Apple Vision)
| Field | Value |
|---|---|
| Framework | Apple Vision |
| Model | VNRecognizeTextRequest (accurate/fast) |
| Input | CVPixelBuffer (via CGImage) |
| Output | text: string, bbox: [x, y, w, h], confidence: [0-1] |
| Coordinate | Input: VNRecognizedTextObservation.boundingBox — normalized [0-1], origin bottom-left |
| Transform | ✅ y = (1.0 - bb.origin.y - bb.size.height) * cgH — Y-flip applied |
| Image size | Main loop: cgImage.width / cgImage.height ✅ |
recognizeText() helper: CVPixelBufferGetWidth/Height ✅ |
|
| File | scripts/swift_processors/swift_ocr.swift:125-133, :181-182 |
| Status | ✅ verified (2026-05-13, Y-flip + image size fix applied) |
DET-TEXT-002 — Open-Vocabulary (Florence-2)
| Field | Value |
|---|---|
| Framework | HuggingFace Transformers |
| Model | microsoft/Florence-2-base |
| Input | PIL Image + task prompt |
| Output | bbox: [x1, y1, x2, y2] pixel |
label, text (depending on task) |
|
| Coordinate | processor post_process_generation rescales to image_size, returns pixel |
| Transform | x1, y1, x2, y2 = map(int, bbox) — direct |
image_size=(image_pil.width, image_pil.height) — (w, h) order ✅ |
|
| File | scripts/florence2_scan_stamps.py:67-79, scripts/test_florence2_direct.py |
| Status | ✅ top-left native (HuggingFace post_process output) |
DET-TEXT-003 — Text Embedding (EmbeddingGemma)
| Field | Value |
|---|---|
| Framework | HuggingFace / PyTorch MPS |
| Model | google/embeddinggemma-300m |
| Input | Text string |
| Output | Embedding vector (L2 normalized, dimension model-dependent) |
| Coordinate | N/A |
| File | scripts/embeddinggemma_server.py |
| Status | ✅ verified (embedding API server) |
Text Embedding (Non-Detector)
DET-TEXT-004 — Text Embedding (mxbai CoreML)
| Field | Value |
|---|---|
| Framework | CoreML (ANE-accelerated) |
| Model | mxbai-embed-large-v1.mlpackage |
| Input | Text tokenized |
| Output | Embedding vector |
| Coordinate | N/A |
| File | scripts/coreml_embed_server.py |
| Status | 🟡 assumed |
DET-TEXT-005 — Text Embedding (Ollama / llama.cpp)
| Field | Value |
|---|---|
| Framework | llama.cpp / Ollama API |
| Model | llama.cpp embedding endpoint (port 11436) |
| Input | Text (optionally prefixed search_document:) |
| Output | 768-dim float embedding |
| Coordinate | N/A |
| File | src/core/embedding/comic_embed.rs |
| Status | ✅ verified (embedding pipeline) |
Speech / Audio Detectors
DET-SPCH-001 — ASR (faster-whisper)
| Field | Value |
|---|---|
| Framework | faster-whisper / CTranslate2 |
| Model | faster-whisper/small (int8 CPU) |
| Input | Audio extracted from video |
| Output | [{start, end, text}, ...] — temporal segments (seconds) |
| Coordinate | Temporal only (seconds), no spatial |
| File | scripts/asr_processor.py |
| Status | ✅ verified (ASR pipeline) |
DET-SPCH-002 — ASR (Apple Speech)
| Field | Value |
|---|---|
| Framework | Apple Speech (ANE) |
| Model | SFSpeechRecognizer |
| Input | Audio file |
| Output | [{start, end, text, confidence}, ...] — temporal segments |
| Coordinate | Temporal only (seconds), no spatial |
| File | scripts/swift_processors/asr_swift.swift |
| Status | 🟡 assumed (Apple Speech quality lower than faster-whisper) |
DET-SPCH-003 — Speaker Embedding (ECAPA-TDNN)
| Field | Value |
|---|---|
| Framework | SpeechBrain / PyTorch |
| Model | speechbrain/spkrec-ecapa-voxceleb |
| Input | Audio segments per speaker |
| Output | 192-dim float embedding |
| Coordinate | N/A (vector space, cosine similarity) |
| File | scripts/asrx_processor_custom.py, scripts/voice_embedding_extractor.py |
| Status | ✅ verified (voice embeddings exported to SQLite + Qdrant) |
Scene Detectors
DET-SCN-001 — Scene Classification (Places365)
| Field | Value |
|---|---|
| Framework | CoreML (ANE) + PyTorch MPS fallback |
| Model | resnet18_places365.mlpackage |
| Input | Frame resized to 224×224 |
| Output | [{scene_type, confidence, top_5}, ...] — temporal segments |
| Coordinate | Temporal only, no spatial |
| File | scripts/scene_classifier.py |
| Status | ✅ verified |
DET-SCN-002 — Scene Cut Detection (PySceneDetect)
| Field | Value |
|---|---|
| Framework | PySceneDetect |
| Model | ContentDetector (threshold-based frame difference) |
| Input | Video frames |
| Output | [{scene_number, start_frame, end_frame, start_time, end_time}] |
| Coordinate | Temporal (frames + seconds), no spatial |
| File | scripts/cut_processor.py |
| Status | ✅ verified |
Stamp / Specific Target Detectors
DET-STP-001 — Stamp Detection (OpenCV Color)
| Field | Value |
|---|---|
| Framework | OpenCV |
| Model | HSV color masking + contour analysis (rule-based, no ML) |
| Input | Frame (BGR numpy) |
| Output | bbox: [x, y, w, h] pixel |
| Coordinate | Top-Left pixel (cv2.boundingRect() native) |
| Transform | Direct, no conversion |
| File | scripts/scan_full_video_stamps.py, scripts/find_blue_stamp_opencv.py |
| Status | ✅ top-left native |
DET-STP-002 — Pose Action Decoder (Coordinate-derived)
| Field | Value |
|---|---|
| Framework | Rule-based from keypoints |
| Model | N/A (derived from DET-BODY-001/002/003 keypoints) |
| Input | Pose keypoints (pixel) |
| Output | Action labels: turn_left, turn_right, look_up, look_down, shake_head, nod_head, blink, smile, etc. |
| Coordinate | Derived angles/ratios, no raw spatial output |
| File | scripts/utils/pose_action_decoder.py, scripts/utils/integrated_body_action_decoder.py |
| Status | 🟡 assumed (actions derived from pose keypoints; dependent on upstream keypoint correctness) |
| Warning | Affected by DET-BODY-001 Y-flip bug — all action labels wrong when using Vision pose |
Known Bugs Summary
| Bug ID | Detector | Issue | Impact | Fixed |
|---|---|---|---|---|
| BUG-001 | DET-FACE-001/002 | Index-based landmark↔face pairing | Wrong landmarks assigned to wrong faces | ✅ 2026-05-13 |
| BUG-002 | DET-FACE-002 | macOS bottom-left → missing Y-flip | Landmarks 731px offset from bbox | ✅ 2026-05-13 |
| BUG-003 | DET-BODY-001 | Missing Y-flip on keypoints | All 19 joint Y coordinates inverted | ✅ 2026-05-13 |
| BUG-004 | DET-BODY-001 | Derived bbox Y inverted | Bbox doesn't cover actual person | ✅ 2026-05-13 |
| BUG-005 | DET-TEXT-001 | Missing Y-flip on bbox | Text bbox Y inverted | ✅ 2026-05-13 |
| BUG-006 | DET-TEXT-001 | Hardcoded 640×360 in recognizeText() |
Wrong bbox scale for non-640×360 images | ✅ 2026-05-13 |
Coordinate Convention Quick Reference
Apple Vision (all detectors)
| Item | Convention |
|---|---|
| boundingBox origin | Bottom-Left |
| boundingBox units | normalized [0-1] |
| pointsInImage Y axis | Bottom-Left (macOS AppKit) |
| Required Y-flip formula | bbox: y = (1 - y_norm - h_norm) * imgH |
points: y = imgH - raw_y |
Non-Vision Detectors
| Framework | Origin | Units |
|---|---|---|
| YOLO (ultralytics) | Top-Left | pixel float |
| MediaPipe | Top-Left | normalized [0-1] |
| InsightFace bbox | Top-Left | pixel int |
| InsightFace landmarks | Top-Left | normalized [0-1] |
| HuggingFace (post_process) | Top-Left | pixel (after rescale) |
| OpenCV | Top-Left | pixel int |
納管規則
- 新增 detector:必須在此 Registry 註冊,含座標系、轉換公式、檔案位置。
- 座標變更:任何轉換公式修改,必須更新此文件並標註變更日期。
- 驗證要求:每個有空間座標的 detector 必須通過至少一次 visual check(bbox/keypoints 疊加原圖)。
- 跨 detector 比對:同一 frame 的不同 detector 輸出 bbox,IoU 應合理(非零且非 1.0)。
- Vision detector 鐵律:任何使用 Apple Vision Framework 的 detector,必須確認 Y-flip 已實作。
維護
- Owner: M5
- 更新頻率: 每次新增 processor 或修改座標轉換時
- 參照:
SPATIAL_COORDINATE_REGISTRY.md(上層座標系統)