docs: file_uuid generation rules for M4

2026-05-17 02:26:09 +08:00
parent 3a6c186575
commit eec2eea880
79 changed files with 23293 additions and 0 deletions
--- a/docs_v1.0/DESIGN/DETECTOR_REGISTRY.md
+++ b/docs_v1.0/DESIGN/DETECTOR_REGISTRY.md
@@ -0,0 +1,602 @@
+# Momentry Core — Detector Registry
+
+**Date**: 2026-05-13
+**Version**: 1.0
+**Purpose**: 所有模型/演算法檢測器的座標約定、轉換鏈、驗證狀態統整
+
+---
+
+## 原則
+
+1. **每 detector 一條**：獨立記錄輸入/輸出格式、座標原點、單位、轉換公式。
+2. **原始座標系標註**：不隱藏轉換，任何異於 Top-Left pixel 的輸出必須明列。
+3. **轉換鏈可追溯**：從 detector 原始輸出到入庫欄位，每一步轉換都記錄。
+4. **驗證狀態三級**：`verified`（已測試） / `assumed`（文檔推斷，未實測） / `buggy`（已知有誤）。
+
+---
+
+## 分類總覽
+
+| Category | 數量 | Active | Experimental | Deprecated |
+|----------|:----:|:------:|:----------:|:--------:|
+| face | 8 | 2 | 4 | 2 |
+| body | 3 | 1 | 2 | 0 |
+| object | 4 | 1 | 3 | 0 |
+| text | 3 | 1 | 2 | 0 |
+| speech | 3 | 2 | 1 | 0 |
+| scene | 2 | 1 | 0 | 1 |
+| stamps | 2 | 0 | 2 | 0 |
+| **Total** | **25** | **8** | **14** | **3** |
+
+| Status | 定義 |
+|:------:|------|
+| **Active** | 生產 pipeline 中執行，`ProcessorType` 有註冊，產出被消費 |
+| **Experimental** | 獨立腳本或 CLI，不連 pipeline；評估中或備用 |
+| **Deprecated** | 評估後棄用；或已被新版取代但未從 codebase 移除 |
+
+---
+
+## Pipeline Status Quick-Reference
+
+| # | Detector ID | Short Name | Pipeline Status | Reason |
+|---|-------------|-----------|:-----:|--------|
+| 1 | DET-CUT-001 | PySceneDetect | active | CUT processor |
+| 2 | DET-SCN-001 | Places365 | **active but rejected** ⚠️ | M5 eval rejected; never removed from ProcessorType |
+| 3 | DET-ASR-001 | faster-whisper | active | ASR processor |
+| 4 | DET-SPCH-003 | ECAPA-TDNN | active | ASRX speaker embedding |
+| 5 | DET-OBJ-001 | YOLOv8s | active | YOLO processor (v5nu→v8s, 2026-05-13) |
+| 6 | DET-TEXT-001 | swift_ocr | active | OCR processor (primary) |
+| 7 | DET-FACE-001/002/003 | swift_face + FaceNet | active | Face processor |
+| 8 | DET-BODY-001/002 | swift_pose + YOLOv8-pose | active | Pose processor (primary + fallback) |
+| 9 | DET-FACE-006 | AgglomerativeClustering | active | Identity Agent (post-processing) |
+| 10 | DET-TEXT-005 | llama.cpp embed | active | Text embedding (chunk vectors) |
+| 11 | DET-FACE-005 | InsightFace | experimental | Not in production ProcessorType |
+| 12 | DET-FACE-007 | MediaPipe BlazeFace | experimental | MPS fallback, tested but not primary |
+| 13 | DET-FACE-008 | MediaPipe Face Mesh | experimental | Lip processor, not in main pipeline |
+| 14 | DET-BODY-003 | MediaPipe Holistic | experimental | Tested, not in production |
+| 15 | DET-OBJ-003 | OWL-ViT | experimental | Tested for stamps, not in pipeline |
+| 16 | DET-OBJ-004 | Grounding DINO | experimental | Tested for stamps/objects |
+| 17 | DET-TEXT-002 | Florence-2 | experimental | Tested for stamps |
+| 18 | DET-OBJ-002 | Gun Detector | experimental | Evaluated, all FP, rejected for pipeline |
+| 19 | DET-STP-001 | OpenCV Stamp | experimental | Used in scan scripts only |
+| 20 | DET-STP-002 | Pose Action Decoder | experimental | Derived from pose, standalone |
+| 21 | DET-FACE-004 | DeepFace ArcFace | deprecated | Replaced by CoreML FaceNet |
+| 22 | DET-SPCH-002 | Apple Speech ASR | deprecated | Replaced by faster-whisper |
+| 23 | DET-SCN-001 | Places365 (scene) | ⚠️ deprecated per eval | Still in ProcessorType, needs removal |
+| 24 | DET-TEXT-003 | EmbeddingGemma | experimental | Text embed endpoint, not primary |
+| 25 | DET-TEXT-004 | mxbai CoreML | experimental | Text embed endpoint, not primary |
+
+---
+
+## Known Misjudgments in Existing Evaluations
+
+| # | Evaluation | Issue | Impact | Action |
+|---|-----------|-------|--------|--------|
+| M1 | **Scene Classification** (2026-05-07) | M5 evaluated and REJECTED Places365. But it was never removed from `ProcessorType::all()`. Still runs on every file. | Wastes ~2min per registration. Produces meaningless scene.json. | Remove from pipeline or re-evaluate |
+| M2 | **Face Processor** benchmark (2026-04-28) | Compared InsightFace vs MediaPipe vs OpenCV vs Contract v1. But the final pipeline uses **swift_face + FaceNet**, a completely different solution not in the benchmark. | Selection criteria from benchmark don't apply to actual pipeline detector. | Document the actual selection decision for swift_face |
+| M3 | **Gun Detector** (2026-05-07) | Properly rejected: 7/7 FP. Correct decision. Model files still in repo. | No impact (correctly excluded). Clean up model files. | Archive or remove `models/gun/` |
+| M4 | **OCR processor** | No selection document exists. swift_ocr chosen without comparison against EasyOCR/PaddleOCR. | Unknown if optimal. PaddleOCR fallback may never trigger. | Document selection decision |
+
+---
+
+### 技術分類（有空間座標 vs 無）
+
+| Category | 數量 | 有空間座標 | 僅 Embedding | 純時間/文字 |
+|----------|:----:|:--------:|:----------:|:--------:|
+| face | 8 | 5 | 3 | — |
+| body | 3 | 3 | — | — |
+| object | 4 | 4 | — | — |
+| text | 3 | 1 | 2 | — |
+| speech | 3 | — | 2 | 1 |
+| scene | 2 | — | 1 | 1 |
+| stamps | 2 | 2 | — | — |
+| **Total** | **25** | **15** | **8** | **2** |
+
+---
+
+## Face Detectors
+
+### DET-FACE-001 — Face Bbox (Apple Vision)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | Apple Vision |
+| **Model** | `VNDetectFaceRectanglesRequest` |
+| **Input** | `CVPixelBuffer` (BGRA, via CGImage) |
+| **Output** | bbox: `x, y, width, height` |
+| **Coordinate** | Input: normalized [0-1], origin **bottom-left** |
+| **Transform** | `x = bb.origin.x * imgW` |
+| | `y = (1.0 - bb.origin.y - bb.size.height) * imgH` |
+| **Image size** | `cgImage.width / cgImage.height` |
+| **Target** | Top-Left pixel integer |
+| **File** | `scripts/swift_processors/swift_face.swift:134-136` |
+| **Status** | ✅ verified (2026-05-13, landmark QC + visual check) |
+
+---
+
+### DET-FACE-002 — Face Landmarks (Apple Vision)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | Apple Vision |
+| **Model** | `VNDetectFaceLandmarksRequest` |
+| **Input** | `CVPixelBuffer` (BGRA, via CGImage) |
+| **Output** | landmarks: `left_eye (6pt)`, `right_eye (6pt)`, `nose (8pt)`, `outer_lips`, `inner_lips` |
+| **Coordinate** | Input: `VNFaceLandmarks2D.pointsInImage(imageSize:)` |
+| | Returned: macOS AppKit convention → **bottom-left** origin ⚠️ |
+| **Transform** | `y_top_left = imgH - $0.y` (Y-flip) |
+| **Image size** | `cgImage.width / cgImage.height` |
+| **Target** | Top-Left pixel float → JSON |
+| **Pairing** | Not by array index. Landmark observations used as primary source (self-consistent bbox + landmarks). Face rect observations deduplicated via IoU > 0.3. |
+| **File** | `scripts/swift_processors/swift_face.swift:155-184` |
+| **Status** | ✅ verified (2026-05-13, Y-flip fix, 100% landmark-in-bbox) |
+| **Bugs fixed** | BUG-001: index-based pairing (landmarkObs[idx] ≠ faceObs[idx]) |
+| | BUG-002: macOS bottom-left Y axis (missing Y-flip) |
+
+---
+
+### DET-FACE-003 — Face Embedding (CoreML FaceNet)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | CoreML (ANE-accelerated) |
+| **Model** | `models/facenet512.mlpackage` |
+| **Input** | Face crop 160×160, RGB, normalized `[-1, 1]` |
+| **Output** | 512-dim float embedding |
+| **Coordinate** | N/A (no spatial output). Bbox from DET-FACE-001 used for crop. |
+| **File** | `scripts/face_processor.py`, `scripts/embed_faces.py`, `scripts/tmdb_embed_extractor.py` |
+| **Embedding space** | [-1, 1] per dimension, cosine similarity for matching |
+| **Status** | ✅ verified (routinely used for identity matching) |
+
+---
+
+### DET-FACE-004 — Face Embedding (DeepFace ArcFace)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | DeepFace / TensorFlow |
+| **Model** | `ArcFace` (512-dim) |
+| **Input** | Face crop (from bbox), BGR, no explicit normalization |
+| **Output** | 512-dim float embedding |
+| **Coordinate** | N/A |
+| **File** | `scripts/face_embedding_extractor.py` |
+| **Status** | 🟡 assumed (legacy fallback, not primary pipeline) |
+
+---
+
+### DET-FACE-005 — Face Recognition (InsightFace)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | InsightFace / ONNX Runtime |
+| **Model** | `buffalo_l` (detection + recognition + 5-point landmarks) |
+| **Input** | Video frame (BGR, numpy array) |
+| **Output** | `bbox: [x1, y1, x2, y2]` pixel int |
+| | `landmarks: 5-point` (left_eye, right_eye, nose, mouth_left, mouth_right) |
+| | `embedding: 512-dim float` |
+| **Coordinate** | Bbox: **Top-Left pixel** (InsightFace native) |
+| | Landmarks: **normalized [0-1]** to image size |
+| **Transform** | Bbox: `face.bbox.astype(int)` — direct |
+| | Landmarks: `kps * imgW, kps * imgH` — needs manual conversion ⚠️ |
+| **File** | `scripts/face_recognition_processor.py:123-153` |
+| **Status** | 🟡 assumed (landmark pixel conversion chain not independently verified) |
+
+---
+
+### DET-FACE-006 — Face Clustering (sklearn)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | sklearn |
+| **Model** | `AgglomerativeClustering` |
+| **Input** | 512-dim face embeddings from DET-FACE-003 or DET-FACE-004 |
+| **Output** | cluster labels, centroids (512-dim float) |
+| **Coordinate** | N/A (no spatial output) |
+| **File** | `scripts/face_clustering_processor.py`, `scripts/identity_bind.py` |
+| **Status** | ✅ verified (428 clusters for Charade, identity_bindings created) |
+
+---
+
+### DET-FACE-007 — Face Detection (MediaPipe BlazeFace)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | MediaPipe / MPS |
+| **Model** | `blaze_face_short_range.tflite` |
+| **Input** | Frame (numpy array / MPS image) |
+| **Output** | `bbox: [x, y, width, height]` pixel |
+| | `6 keypoints`: eyes, nose tip, mouth center, ear tragions — **pixel** |
+| **Coordinate** | **Top-Left pixel** (MediaPipe native) |
+| **Transform** | Direct, no conversion needed |
+| **File** | `scripts/face_processor_mps.py` |
+| **Status** | 🟡 assumed (MPS fallback, rarely used in pipeline) |
+
+---
+
+### DET-FACE-008 — Lip Detection (MediaPipe Face Mesh)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | MediaPipe |
+| **Model** | `Face Mesh` (468 landmarks) |
+| **Input** | Face crop or full frame |
+| **Output** | `lip_openness: [0-1]` (vertical/mouth_width) |
+| | `mouth keypoints`: indices 13, 14, 61, 291 from 468 mesh |
+| **Coordinate** | Landmarks: **normalized [0-1]**, Top-Left origin |
+| **Transform** | Normalized → pixel: `x * imgW, y * imgH` |
+| | Lip openness: derived ratio, unitless |
+| **File** | `scripts/lip_processor.py` |
+| **Status** | 🟡 assumed |
+
+---
+
+## Body Pose Detectors
+
+### DET-BODY-001 — Body Pose (Apple Vision)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | Apple Vision |
+| **Model** | `VNDetectHumanBodyPoseRequest` |
+| **Input** | `CGImage` (from frame export or NSImage) |
+| **Output** | `19 keypoints`: nose, eyes, ears, neck, root, shoulders, elbows, wrists, hips, knees, ankles |
+| | `bbox: [x, y, width, height]` derived from keypoint min/max |
+| **Coordinate** | Input: normalized [0-1], origin **bottom-left** |
+| **Transform** (current) | ✅ `y = h - location.y * h` — Y-flip applied |
+| **Transform** (correct) | `y = h - location.y * h` |
+| **Image size** | `cgImage.width / cgImage.height` |
+| **Target** | Top-Left pixel float |
+| **File** | `scripts/swift_processors/swift_pose.swift:154-159` |
+| **Status** | ✅ verified (2026-05-13, Y-flip fix applied) |
+
+---
+
+### DET-BODY-002 — Body Pose (YOLOv8 Pose fallback)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | ultralytics / PyTorch |
+| **Model** | `yolov8n-pose.pt` |
+| **Input** | Frame (PIL or numpy) |
+| **Output** | `17 COCO keypoints`: nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles |
+| | `bbox: [x, y, width, height]` derived from keypoints (conf > 0.1) |
+| **Coordinate** | **Top-Left pixel** (YOLO native, `.xy[0]` → numpy float) |
+| **Transform** | Direct: `x, y = float(kps[j][0]), float(kps[j][1])` |
+| | Bbox: `min(xs), min(ys), max(xs)-min(xs), max(ys)-min(ys)` |
+| **File** | `scripts/pose_processor.py:78-97` |
+| **Status** | ✅ top-left native |
+
+---
+
+### DET-BODY-003 — Full Body (MediaPipe Holistic)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | MediaPipe |
+| **Model** | `Holistic` (pose + face mesh + hands) |
+| **Input** | Frame (BGR numpy) |
+| **Output** | `468 face mesh`: `[[x, y, z], ...]` normalized [0-1] |
+| | `33 body pose`: `[[x, y, z, visibility], ...]` normalized [0-1] |
+| | `21 hand × 2`: `[[x, y, z], ...]` normalized [0-1] |
+| **Coordinate** | **normalized [0-1]**, Top-Left origin |
+| **Transform** | `x * imgW, y * imgH` → pixel (if needed) |
+| | Z: depth relative, not metric |
+| **File** | `scripts/mediapipe_holistic_processor.py` |
+| **Status** | ✅ top-left native, normalized→pixel straightforward |
+
+---
+
+## Object Detectors
+
+### DET-OBJ-001 — Object Detection (YOLOv8s)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | ultralytics / CoreML + PyTorch fallback |
+| **Model** | `yolov8s.mlpackage` (primary, CoreML ANE), `yolov8s.pt` (fallback) |
+| **mAP (COCO)** | 44.9 (was 34.3 with YOLOv5nu, +31%) |
+| **Input** | Frame (PIL or numpy) |
+| **Output** | `bbox: [x1, y1, x2, y2]` — float pixel |
+| | `class_name, class_id` (80 COCO classes) |
+| | `confidence: [0-1]` |
+| **Coordinate** | **Top-Left pixel** (YOLO `.xyxy[0]` → float) |
+| **Transform** | Rust: `x = detection.x1 as i32, y = detection.y1 as i32` — **int truncation** |
+| | `width = x2 - x1, height = y2 - y1` |
+| **Image size** | YOLO auto-handles via ultralytics inference |
+| **File** | `scripts/yolo_processor.py:272-285`, `src/core/processor/yolo.rs:83-117` |
+| **Status** | ✅ verified (2026-05-13, replaced YOLOv5nu, +19% detections, scene indicators +162~+473%) |
+| **Replaced** | YOLOv5nu (mAP 34.3, removed 2026-05-13) |
+
+---
+
+### DET-OBJ-002 — Weapon Detection (YOLOv8n Fine-tuned)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | ultralytics / PyTorch |
+| **Model** | `models/gun/gun_detector/weights/best.pt` |
+| **Input** | Frame (numpy array) |
+| **Output** | `bbox: [x1, y1, x2, y2]` pixel |
+| | `class: {0: grenade, 1: knife, 2: pistol, 3: rifle}` |
+| **Coordinate** | **Top-Left pixel** (YOLO native) |
+| **File** | `scripts/gun_detector_scan.py` |
+| **Status** | ✅ top-left native |
+
+---
+
+### DET-OBJ-003 — Open-Vocabulary Detection (OWL-ViT)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | HuggingFace Transformers |
+| **Model** | `google/owlvit-base-patch32` |
+| **Input** | PIL Image + text queries |
+| **Output** | `bbox, scores, labels` |
+| **Coordinate** | post_process_object_detection returns boxes in `[x1, y1, x2, y2]` format |
+| | scaled to `target_sizes` parameter |
+| **Transform** | `target_sizes = torch.Tensor([image_pil.size[::-1]])` — PIL (w,h) → (h,w) |
+| | `box.int().tolist()` or `box.tolist()` → Python list |
+| **Format risk** | HuggingFace processor version may return `[cx, cy, w, h]` not `[x1,y1,x2,y2]` |
+| **File** | `scripts/test_owl_vit_stamps.py:69-80`, `scripts/magnifying_glass_owl.py:65-77` |
+| **Status** | 🟡 **assumed** (bbox format not independently verified with visual check) |
+| **Verify** | Render bbox overlay on a known target image, confirm x1 < x2, y1 < y2 |
+
+---
+
+### DET-OBJ-004 — Open-Vocabulary Detection (Grounding DINO)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | HuggingFace Transformers |
+| **Model** | `IDEA-Research/grounding-dino-base` |
+| **Input** | PIL Image + text prompts |
+| **Output** | `boxes, labels, scores` |
+| **Coordinate** | processor rescales to `target_sizes`, returns pixel boxes |
+| **Transform** | `target_sizes=[img.size[::-1]]` — PIL (w,h) → (h,w) |
+| | `[round(v, 1) for v in dets["boxes"][i].tolist()]` |
+| **Format risk** | `[::-1]` order depends on processor expectations. If processor expects (w,h), axes swapped. |
+| **File** | `scripts/gdino_frame_api.py:176-180` |
+| **Status** | 🟡 **assumed** (rescale direction not independently verified) |
+| **Verify** | Single-frame output: check bbox x range ≤ imgW, y range ≤ imgH |
+
+---
+
+## Text / OCR Detectors
+
+### DET-TEXT-001 — OCR (Apple Vision)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | Apple Vision |
+| **Model** | `VNRecognizeTextRequest` (accurate/fast) |
+| **Input** | `CVPixelBuffer` (via CGImage) |
+| **Output** | `text: string`, `bbox: [x, y, w, h]`, `confidence: [0-1]` |
+| **Coordinate** | Input: `VNRecognizedTextObservation.boundingBox` — normalized [0-1], origin **bottom-left** |
+| **Transform** | ✅ `y = (1.0 - bb.origin.y - bb.size.height) * cgH` — Y-flip applied |
+| **Image size** | Main loop: `cgImage.width / cgImage.height` ✅ |
+| | `recognizeText()` helper: `CVPixelBufferGetWidth/Height` ✅ |
+| **File** | `scripts/swift_processors/swift_ocr.swift:125-133`, `:181-182` |
+| **Status** | ✅ verified (2026-05-13, Y-flip + image size fix applied) |
+
+---
+
+### DET-TEXT-002 — Open-Vocabulary (Florence-2)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | HuggingFace Transformers |
+| **Model** | `microsoft/Florence-2-base` |
+| **Input** | PIL Image + task prompt |
+| **Output** | `bbox: [x1, y1, x2, y2]` pixel |
+| | `label, text` (depending on task) |
+| **Coordinate** | processor `post_process_generation` rescales to `image_size`, returns pixel |
+| **Transform** | `x1, y1, x2, y2 = map(int, bbox)` — direct |
+| | `image_size=(image_pil.width, image_pil.height)` — (w, h) order ✅ |
+| **File** | `scripts/florence2_scan_stamps.py:67-79`, `scripts/test_florence2_direct.py` |
+| **Status** | ✅ top-left native (HuggingFace post_process output) |
+
+---
+
+### DET-TEXT-003 — Text Embedding (EmbeddingGemma)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | HuggingFace / PyTorch MPS |
+| **Model** | `google/embeddinggemma-300m` |
+| **Input** | Text string |
+| **Output** | Embedding vector (L2 normalized, dimension model-dependent) |
+| **Coordinate** | N/A |
+| **File** | `scripts/embeddinggemma_server.py` |
+| **Status** | ✅ verified (embedding API server) |
+
+---
+
+## Text Embedding (Non-Detector)
+
+### DET-TEXT-004 — Text Embedding (mxbai CoreML)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | CoreML (ANE-accelerated) |
+| **Model** | `mxbai-embed-large-v1.mlpackage` |
+| **Input** | Text tokenized |
+| **Output** | Embedding vector |
+| **Coordinate** | N/A |
+| **File** | `scripts/coreml_embed_server.py` |
+| **Status** | 🟡 assumed |
+
+---
+
+### DET-TEXT-005 — Text Embedding (Ollama / llama.cpp)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | llama.cpp / Ollama API |
+| **Model** | llama.cpp embedding endpoint (port 11436) |
+| **Input** | Text (optionally prefixed `search_document:`) |
+| **Output** | 768-dim float embedding |
+| **Coordinate** | N/A |
+| **File** | `src/core/embedding/comic_embed.rs` |
+| **Status** | ✅ verified (embedding pipeline) |
+
+---
+
+## Speech / Audio Detectors
+
+### DET-SPCH-001 — ASR (faster-whisper)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | faster-whisper / CTranslate2 |
+| **Model** | `faster-whisper/small` (int8 CPU) |
+| **Input** | Audio extracted from video |
+| **Output** | `[{start, end, text}, ...]` — temporal segments (seconds) |
+| **Coordinate** | Temporal only (seconds), no spatial |
+| **File** | `scripts/asr_processor.py` |
+| **Status** | ✅ verified (ASR pipeline) |
+
+---
+
+### DET-SPCH-002 — ASR (Apple Speech)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | Apple Speech (ANE) |
+| **Model** | `SFSpeechRecognizer` |
+| **Input** | Audio file |
+| **Output** | `[{start, end, text, confidence}, ...]` — temporal segments |
+| **Coordinate** | Temporal only (seconds), no spatial |
+| **File** | `scripts/swift_processors/asr_swift.swift` |
+| **Status** | 🟡 assumed (Apple Speech quality lower than faster-whisper) |
+
+---
+
+### DET-SPCH-003 — Speaker Embedding (ECAPA-TDNN)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | SpeechBrain / PyTorch |
+| **Model** | `speechbrain/spkrec-ecapa-voxceleb` |
+| **Input** | Audio segments per speaker |
+| **Output** | `192-dim float embedding` |
+| **Coordinate** | N/A (vector space, cosine similarity) |
+| **File** | `scripts/asrx_processor_custom.py`, `scripts/voice_embedding_extractor.py` |
+| **Status** | ✅ verified (voice embeddings exported to SQLite + Qdrant) |
+
+---
+
+## Scene Detectors
+
+### DET-SCN-001 — Scene Classification (Places365)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | CoreML (ANE) + PyTorch MPS fallback |
+| **Model** | `resnet18_places365.mlpackage` |
+| **Input** | Frame resized to 224×224 |
+| **Output** | `[{scene_type, confidence, top_5}, ...]` — temporal segments |
+| **Coordinate** | Temporal only, no spatial |
+| **File** | `scripts/scene_classifier.py` |
+| **Status** | ✅ verified |
+
+---
+
+### DET-SCN-002 — Scene Cut Detection (PySceneDetect)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | PySceneDetect |
+| **Model** | `ContentDetector` (threshold-based frame difference) |
+| **Input** | Video frames |
+| **Output** | `[{scene_number, start_frame, end_frame, start_time, end_time}]` |
+| **Coordinate** | Temporal (frames + seconds), no spatial |
+| **File** | `scripts/cut_processor.py` |
+| **Status** | ✅ verified |
+
+---
+
+## Stamp / Specific Target Detectors
+
+### DET-STP-001 — Stamp Detection (OpenCV Color)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | OpenCV |
+| **Model** | HSV color masking + contour analysis (rule-based, no ML) |
+| **Input** | Frame (BGR numpy) |
+| **Output** | `bbox: [x, y, w, h]` pixel |
+| **Coordinate** | **Top-Left pixel** (`cv2.boundingRect()` native) |
+| **Transform** | Direct, no conversion |
+| **File** | `scripts/scan_full_video_stamps.py`, `scripts/find_blue_stamp_opencv.py` |
+| **Status** | ✅ top-left native |
+
+---
+
+### DET-STP-002 — Pose Action Decoder (Coordinate-derived)
+
+| Field | Value |
+|-------|-------|
+| **Framework** | Rule-based from keypoints |
+| **Model** | N/A (derived from DET-BODY-001/002/003 keypoints) |
+| **Input** | Pose keypoints (pixel) |
+| **Output** | Action labels: turn_left, turn_right, look_up, look_down, shake_head, nod_head, blink, smile, etc. |
+| **Coordinate** | Derived angles/ratios, no raw spatial output |
+| **File** | `scripts/utils/pose_action_decoder.py`, `scripts/utils/integrated_body_action_decoder.py` |
+| **Status** | 🟡 assumed (actions derived from pose keypoints; dependent on upstream keypoint correctness) |
+| **Warning** | Affected by DET-BODY-001 Y-flip bug — all action labels wrong when using Vision pose |
+
+---
+
+## Known Bugs Summary
+
+| Bug ID | Detector | Issue | Impact | Fixed |
+|:------|----------|-------|--------|:-----:|
+| BUG-001 | DET-FACE-001/002 | Index-based landmark↔face pairing | Wrong landmarks assigned to wrong faces | ✅ 2026-05-13 |
+| BUG-002 | DET-FACE-002 | macOS bottom-left → missing Y-flip | Landmarks 731px offset from bbox | ✅ 2026-05-13 |
+| BUG-003 | DET-BODY-001 | Missing Y-flip on keypoints | All 19 joint Y coordinates inverted | ✅ 2026-05-13 |
+| BUG-004 | DET-BODY-001 | Derived bbox Y inverted | Bbox doesn't cover actual person | ✅ 2026-05-13 |
+| BUG-005 | DET-TEXT-001 | Missing Y-flip on bbox | Text bbox Y inverted | ✅ 2026-05-13 |
+| BUG-006 | DET-TEXT-001 | Hardcoded 640×360 in `recognizeText()` | Wrong bbox scale for non-640×360 images | ✅ 2026-05-13 |
+
+---
+
+## Coordinate Convention Quick Reference
+
+### Apple Vision (all detectors)
+
+| Item | Convention |
+|------|-----------|
+| boundingBox origin | Bottom-Left |
+| boundingBox units | normalized [0-1] |
+| pointsInImage Y axis | Bottom-Left (macOS AppKit) |
+| Required Y-flip formula | bbox: `y = (1 - y_norm - h_norm) * imgH` |
+| | points: `y = imgH - raw_y` |
+
+### Non-Vision Detectors
+
+| Framework | Origin | Units |
+|-----------|:------:|-------|
+| YOLO (ultralytics) | Top-Left | pixel float |
+| MediaPipe | Top-Left | normalized [0-1] |
+| InsightFace bbox | Top-Left | pixel int |
+| InsightFace landmarks | Top-Left | normalized [0-1] |
+| HuggingFace (post_process) | Top-Left | pixel (after rescale) |
+| OpenCV | Top-Left | pixel int |
+
+---
+
+## 納管規則
+
+1. **新增 detector**：必須在此 Registry 註冊，含座標系、轉換公式、檔案位置。
+2. **座標變更**：任何轉換公式修改，必須更新此文件並標註變更日期。
+3. **驗證要求**：每個有空間座標的 detector 必須通過至少一次 visual check（bbox/keypoints 疊加原圖）。
+4. **跨 detector 比對**：同一 frame 的不同 detector 輸出 bbox，IoU 應合理（非零且非 1.0）。
+5. **Vision detector 鐵律**：任何使用 Apple Vision Framework 的 detector，必須確認 Y-flip 已實作。
+
+---
+
+## 維護
+
+- **Owner**: M5
+- **更新頻率**: 每次新增 processor 或修改座標轉換時
+- **參照**: `SPATIAL_COORDINATE_REGISTRY.md`（上層座標系統）