docs: file lifecycle design — pre-process (birth certificate) + registration (civil registry)

This commit is contained in:
Accusys
2026-05-15 12:05:13 +08:00
parent 802beb2db6
commit d81aec7360

View File

@@ -0,0 +1,184 @@
---
title: "File Lifecycle — Pre-Processing & Registration"
version: "V1.0"
date: "2026-05-15"
author: "M5"
status: "draft"
---
# File Lifecycle — Pre-Processing & Registration
## Metaphor
```
SHA256 = DNA or fingerprint (唯一不變的生物特徵)
file created time = 出生時刻
birthday (UUID anchor) = 出生時間戳
.pre.json = 出生證明書
POST /api/v1/files/register = 戶政登記
status = registered = 完成戶籍登記
```
## Two-Phase Flow
A file enters the system in two distinct phases:
| Phase | Action | Analogy | Automatic? | Status |
|-------|--------|---------|:----------:|:------:|
| **Birth** | Pre-process: SHA256 + probe + UUID | 出生 + 醫院開出生證明 | ✅ Watcher | `unregistered` |
| **Citizenship** | Register: INSERT into DB | 戶政事務所登記 | ❌ User API | `registered` |
## Phase 1: Pre-Processing (Birth)
### Trigger
File watcher (`src/watcher/watcher.rs`) polls monitored directories every 60 seconds. When a new file is detected, pre-processor runs automatically.
### Computation Steps
```
1. fs::metadata(path).created()
→ birthday = file creation time (RFC 3339)
2. SHA256(full file, streaming 64KB chunks)
→ content_hash = 512-bit hex string (檔案 DNA/指紋)
3. ffprobe (or minimal fs metadata fallback for non-video)
→ probe_json
4. compute_birth_uuid(mac, birthday, canonical_path, filename)
→ file_uuid = SHA256(mac | birthday | path | filename)[0:32]
5. Write {OUTPUT_DIR}/{file_uuid}.pre.json
```
### Output: `.pre.json` Schema
Stored alongside other processor outputs:
```
{OUTPUT_DIR}/
{file_uuid}.probe.json ← ffprobe
{file_uuid}.face.json ← face detection
{file_uuid}.pre.json ← pre-processor (NEW)
```
```json
{
"file_name": "charade.mp4",
"file_path": "/data/demo/charade.mp4",
"canonical_path": "/private/data/demo/charade.mp4",
"content_hash": "a1b2c3d4e5f6...",
"probe_json": {
"format": { "duration": "6879.3", "size": "2147483648" },
"streams": [...]
},
"birthday": "2026-05-15T02:15:00Z",
"file_uuid": "aeed71342a899fe4b4c57b7d41bcb692",
"file_size": 2147483648,
"file_type": "video",
"pre_processed_at": "2026-05-15T02:15:05Z"
}
```
### Key Design: UUID = f(mac, birthday, path, filename)
The `birthday` is `file created time` — obtained from `fs::metadata().created()`. This is the **true birth time** of the file, not the registration time.
```
birthday = 2026-05-15T02:15:00Z ← 檔案出生時間,永不改變
file_uuid = SHA256(mac | birthday | path | filename)
同一檔案:相同 path + filename → 相同 UUID無論註冊幾次
不同檔案:不同 content_hash → 不同 UUID即使同名
```
## Phase 2: Registration (Citizenship)
### POST /api/v1/files/register
```bash
curl -X POST http://localhost:3002/api/v1/files/register \
-H "X-API-Key: ..." \
-H "Content-Type: application/json" \
-d '{"file_path":"/data/demo/charade.mp4"}'
```
### Flow
```
1. Check {OUTPUT_DIR}/{file_uuid}.pre.json
├─ Exists AND content_hash matches → use cached (skip SHA256 + probe)
└─ Not exists OR hash mismatch → compute fresh (existing logic)
2. Dedup check: SELECT file_uuid FROM videos WHERE content_hash = $1
├─ Found → already_exists: true (identical DNA = same person)
└─ Not found → continue
3. Name conflict check + auto-rename if needed
└─ charade.mp4 → charade (1).mp4 (same name, different DNA)
4. INSERT INTO videos (
file_uuid, file_path, file_name, file_type,
duration, width, height, fps,
probe_json, content_hash, status, registration_time
) VALUES (
$1, $2, $3, $4, $5, $6, $7, $8, $9, $10,
'registered', NOW() ← status=registered, registration_time=NOW()
)
```
## Data Separation
| Field | Source | Computed When | Mutable |
|-------|--------|---------------|:------:|
| `birthday` | `fs::metadata().created()` | Pre-process (once) | ❌ Never |
| `content_hash` (SHA256) | Full file | Pre-process (once) | ❌ Never (unless file modified) |
| `file_uuid` | SHA256(mac\|birthday\|path\|filename) | Pre-process (once) | ❌ Never |
| `registration_time` | `NOW()` at register | Register API | ✅ Per registration |
| `status` | — | Register API | `unregistered``registered` |
## File Lifecycle State Diagram
```
File detected by watcher
[Pre-Processor]
├─ SHA256 (DNA/fingerprint)
├─ ffprobe (vital signs)
└─ UUID (birth certificate ID)
{file_uuid}.pre.json
status = unregistered (no DB record)
│ (user calls POST /api/v1/files/register)
[Register Handler]
├─ Read .pre.json (skip recomputation)
├─ Dedup check (content_hash collision?)
├─ Name check + rename?
└─ INSERT INTO videos
status = registered
registration_time = NOW()
```
## Implementation Checklist
| # | Task | File |
|---|------|------|
| 1 | Modify watcher pre-processor: SHA256 + probe + write `.pre.json` | `src/watcher/watcher.rs` |
| 2 | Register: read `.pre.json`, skip SHA256/probe if cached | `src/api/server.rs``register_single_file` |
| 3 | UUID: use `birthday` from `.pre.json` (or `fs::metadata().created()` fallback) | `src/api/server.rs` |
| 4 | INSERT status: `registered`, registration_time: `NOW()` | `src/api/server.rs` |
| 5 | Pre-process all file types (not just video) | `src/watcher/watcher.rs` |
## Version History
| Version | Date | Changes |
|---------|------|---------|
| V1.0 | 2026-05-15 | Initial design — birth certificate (pre-process) + civil registration two-phase flow |