docs: reclassify — DESIGN→STANDARDS, conversion→M5_workspace, cleanup

This commit is contained in:
Accusys
2026-05-15 12:18:29 +08:00
parent 33b6f3cc66
commit 9cf20d3f8e
16 changed files with 438 additions and 125 deletions

View File

@@ -1,177 +0,0 @@
---
title: "File Conversion Strategy"
version: "V1.0"
date: "2026-05-15"
author: "M5"
status: "draft"
---
# File Conversion Strategy
## Overview
Momentry Core registers and processes various file types beyond video. Non-video files (documents, images, spreadsheets, presentations) need conversion to extract text content and metadata for indexing, search, and preview.
## Supported Formats
| Category | Extensions | Strategy | Status |
|----------|-----------|----------|--------|
| Document | `.docx`, `.doc`, `.odt` | `textutil` (macOS built-in) | ✅ Phase 1 |
| Document | `.pages` | `unzip` → extract `preview.pdf` (limited) | ❌ Pages.app needed |
| Document | `.pdf` | `pdftotext` (poppler) or `PyPDF2` | ❌ Phase 2 |
| Spreadsheet | `.xlsx`, `.xls`, `.numbers` | `openpyxl` (MIT, Python) | ❌ Phase 1 |
| Presentation | `.pptx`, `.ppt`, `.key` | `python-pptx` (MIT, Python) | ❌ Phase 1 |
| Image | `.jpg`, `.png`, `.gif`, `.bmp`, `.webp` | `sips` (macOS built-in) | ✅ Phase 1 |
| Image | `.svg`, `.heic` | ffprobe or `sips` | ✅ Phase 1 |
| Apple iWork | `.pages`, `.key`, `.numbers` | LibreOffice (MPL-2.0) | ❌ Phase 3 |
## Available Tools
### macOS Built-in (No Installation)
| Tool | License | Supports | Command |
|------|---------|----------|---------|
| `textutil` | Apple EULA | `.docx``.txt`, `.html`, `.rtf` | `textutil -convert txt input.docx -output out.txt` |
| `sips` | Apple EULA | Image format conversion | `sips -s format jpeg input.heic -o out.jpg` |
| `qlmanage` | Apple EULA | QuickLook thumbnails | `qlmanage -t -s 256 -o output input.pdf` |
### Python Packages (pip install, MIT/BSD License)
| Package | License | Supports | Notes |
|---------|---------|----------|-------|
| `openpyxl` | MIT | `.xlsx` → text/csv | Sheet iteration, cell extraction |
| `python-pptx` | MIT | `.pptx` → text | Slide/shape text iteration |
| `python-docx` | MIT | `.docx` → text | Paragraph/run text extraction |
| `PyPDF2` / `pypdf` | BSD-3 | `.pdf` → text | Page-by-page text extraction |
### Brew Packages
| Package | License | Supports | Notes |
|---------|---------|----------|-------|
| `pandoc` | GPL-2.0 | `.docx``.md``.html` | CLI subprocess, GPL not linked |
| `poppler` (pdftotext) | GPL-2.0 | `.pdf``.txt` | CLI subprocess, GPL not linked |
| `libreoffice` | MPL-2.0 | All office formats | Headless CLI, heavy (~1GB) |
## File Format Analysis
### Office Open XML (.docx, .xlsx, .pptx)
These are ZIP archives containing XML files:
```
document.docx = ZIP
├── word/document.xml ← Main content (text + formatting)
├── word/comments.xml
└── [Content_Types].xml
```
Python packages parse the XML directly — fast and reliable.
### Apple iWork (.pages, .key, .numbers)
These are also ZIP archives:
```
document.pages = ZIP
├── index.apxl.gz ← Actual content (Apple private format)
├── preview.pdf ← Embedded preview (usable for thumbnails)
├── QuickLook/
│ └── Thumbnail.jpg
└── Metadata/
```
Without Pages.app or LibreOffice, only the `preview.pdf`/`Thumbnail.jpg` is extractable via `unzip`. Full text extraction requires LibreOffice.
### PDF (.pdf)
PDF is not a ZIP format. Requires a dedicated parser:
- `pdftotext` (poppler) — best text extraction, CLI
- `PyPDF2` — pure Python, lighter dependency
## Commercial Licensing Analysis
### Summary
| Tool | License | Commercial Use |
|------|---------|:-------------:|
| macOS built-in tools | Apple EULA | ✅ Licensed macOS required |
| openpyxl / python-pptx / python-docx / PyPDF2 | MIT / BSD | ✅ Zero restrictions |
| pandoc / poppler (pdftotext) | **GPL-2.0** | ✅ Safe via CLI subprocess |
| LibreOffice (soffice) | **MPL-2.0** | ✅ Safe via CLI subprocess |
### GPL Analysis (pandoc, poppler)
Calling GPL-licensed tools via CLI subprocess (`std::process::Command`) is **mere aggregation**, not linking. The GPL FAQ states:
> "If the program is a separate process communicating with the main program via pipes, sockets, or command-line arguments, then the programs are separate works and the GPL does not apply to the main program."
Momentry Core calls these tools via `std::process::Command` (pipe stdin/stdout/stderr). **GPL does not propagate.**
### MPL Analysis (LibreOffice)
MPL-2.0 is a file-level weak copyleft. Only source files from LibreOffice itself must remain MPL-licensed. Calling `soffice --headless` via subprocess does not affect Momentry Core's license.
### MIT/BSD (Python packages)
No copyleft. Can be distributed and linked with any license.
## Implementation Phases
### Phase 1: Zero-Install + Lightweight Python (Recommended First)
| Tool | Install | License |
|------|---------|---------|
| `textutil` | macOS built-in | Apple EULA |
| `sips` | macOS built-in | Apple EULA |
| `openpyxl` | `pip install` | MIT |
| `python-pptx` | `pip install` | MIT |
Coverage: `.docx`, `.xlsx`, `.pptx`, `.jpg`, `.png`, `.heic`, `.svg`
### Phase 2: Text Extraction Enhancement
| Tool | Install | License |
|------|---------|---------|
| `pdftotext` (poppler) | `brew install poppler` | GPL-2.0 |
| `PyPDF2` | `pip install` | BSD-3 |
Coverage: `.pdf` text extraction
### Phase 3: Full Office Suite
| Tool | Install | License |
|------|---------|---------|
| LibreOffice | `brew install libreoffice` | MPL-2.0 |
Coverage: `.pages`, `.key`, `.numbers`, legacy `.doc`/`.ppt`/`.xls`, full format fidelity
## Recommended Integration
File conversion should be implemented as Python scripts (matching existing processor pattern) and invoked at registration time:
```python
# scripts/convert_file.py
# Called during file registration if file_type not in (video, audio, image)
import openpyxl
from pptx import Presentation
def extract_xlsx(path):
wb = openpyxl.load_workbook(path, data_only=True)
for sheet_name in wb.sheetnames:
ws = wb[sheet_name]
for row in ws.iter_rows(values_only=True):
yield [str(c) if c is not None else "" for c in row]
def extract_pptx(path):
prs = Presentation(path)
for slide in prs.slides:
text = [shape.text for shape in slide.shapes if hasattr(shape, "text")]
yield "\n".join(text)
```
## Version History
| Version | Date | Changes |
|---------|------|---------|
| V1.0 | 2026-05-15 | Initial document — conversion tool survey, licensing analysis, implementation phases |

View File

@@ -657,7 +657,81 @@ file_uuid **在遷移過程中不變**。檔案從 Hot 移到 Cold
file_uuid 永遠指向 birth 時的 `physical_path_at_birth`Hot 路徑),不因遷移而改變。
### 6.5 location_history 表
### 6.5 AI Agent — 按需資料流動
AI Agent 在底層自動管理資料流動,使用者無需知道檔案實際存放層級。
#### 架構
```
User / Scheduler
┌─────────────────────────────────┐
│ AI Agent │
│ • Monitor tier usage │
│ • Detect hot/cold patterns │
│ • Trigger auto-archive │
│ • Restore on access (prefetch) │
└──────────┬──────────────────────┘
┌─────────────────────────────────┐
│ Transfer Engine │
│ Direct (std::fs::copy) │
│ Rsync (delta + checksum) │
│ S3 / SFS / NFS / CDN │
└──────────┬──────────────────────┘
┌─────────────────────────────────┐
│ file_locations │
│ (single source of truth) │
│ M2 M4 M5 Cloud LTO │
└─────────────────────────────────┘
```
#### 自動歸檔規則
| 觸發條件 | 動作 | Transfer Engine |
|----------|------|:--:|
| `idle_days > 90` | move to Warm | Rsync + checksum verify |
| `idle_days > 365` | move to Cold | Tar + checksum verify |
| `hot_tier_usage > 80%` | move oldest to Warm | Rsync —progress |
| user accesses cold file | restore to Hot | Rsync prefetch |
#### 流程範例
```
1. AI Agent 偵測 Charade_1963.mp4 閒置 120 天
2. rsync -avP --checksum → /Volumes/NAS_Archive/
3. POST /api/v2/files/aeed7134.../locations
{"location": "/Volumes/NAS_Archive/Charade_1963.mp4",
"label": "M4-warm"}
4. 移除 Hot tier 位置(或保留為參考)
5. 使用者查詢檔案資訊 → 看到所有層級,無需知道實際位置
```
#### 設計原則
| 原則 | 說明 |
|------|------|
| 透明遷移 | 使用者查詢 `file_locations` 始終得到一致視圖 |
| 不變標識 | `file_uuid` 在遷移過程中不變 |
| 位置追蹤 | 每次遷移後更新 `file_locations`,舊位置可選擇保留為歷史參考 |
| 驗證完整性 | 遷移後執行 SHA256 校驗Rsync `--checksum` 或手動比對) |
| 類似記憶體階層 | Agent 是記憶體控制器Hot=快取、Warm=主記憶體、Cold=磁碟 |
```
用戶查詢檔案 → 始終看到一致視圖單一來源真相file_locations
Transfer Enginersync / Direct / S3 / SFS / CDN
AI Agent監控 tier 用量、偵測冷熱模式、自動歸檔、預取)
Storage TiersM2 Hot → M4 Warm → M5 Cold → LTO
```
```sql
CREATE TABLE IF NOT EXISTS location_history (
@@ -908,6 +982,8 @@ CREATE TABLE IF NOT EXISTS exit_records (
| 13 | 2026-05-14 | notify crate (僅 Hot tier) | 減少資源消耗Warm/Cold 變更頻率低 |
| 14 | 2026-05-14 | zip + tar crate (不用外部 CLI) | 跨平台,不需 ditto/hdiutil |
| 15 | 2026-05-14 | Momentry Core 整合 A+B 混合模式 | 輕量運算用 crate重查詢用 HTTP API |
| 16 | 2026-05-14 | AI Agent 按需資料流動 | 透明遷移、類似記憶體階層、自動冷熱管理 |
| 17 | 2026-05-14 | file_locations 支援任意 URI | /path、s3://、sfs://、ipfs://、https://、\\SMB\path |
---
@@ -916,4 +992,4 @@ CREATE TABLE IF NOT EXISTS exit_records (
| 版本 | 日期 | 目的 | 操作人 | 工具/模型 |
|------|------|------|--------|-----------|
| V1.0 | 2026-05-12 | 初版設計Demo Display + Knowledge Graph | M4 / OpenCode | DeepSeek V4 Pro |
| V2.0 | 2026-05-14 | 虛擬檔案樹、Group Share、儲存層級、技術棧、file_uuid、檔案操作 API | M4 / OpenCode | DeepSeek V4 Pro |
| V2.0 | 2026-05-14 | 虛擬檔案樹、Group Share、儲存層級、技術棧、file_uuid、檔案操作 API、AI Agent 按需資料流動、跨平台 multi-location | M4 / OpenCode | DeepSeek V4 Pro |