docs: reclassify — DESIGN→STANDARDS, conversion→M5_workspace, cleanup
This commit is contained in:
@@ -1,177 +0,0 @@
|
||||
---
|
||||
title: "File Conversion Strategy"
|
||||
version: "V1.0"
|
||||
date: "2026-05-15"
|
||||
author: "M5"
|
||||
status: "draft"
|
||||
---
|
||||
|
||||
# File Conversion Strategy
|
||||
|
||||
## Overview
|
||||
|
||||
Momentry Core registers and processes various file types beyond video. Non-video files (documents, images, spreadsheets, presentations) need conversion to extract text content and metadata for indexing, search, and preview.
|
||||
|
||||
## Supported Formats
|
||||
|
||||
| Category | Extensions | Strategy | Status |
|
||||
|----------|-----------|----------|--------|
|
||||
| Document | `.docx`, `.doc`, `.odt` | `textutil` (macOS built-in) | ✅ Phase 1 |
|
||||
| Document | `.pages` | `unzip` → extract `preview.pdf` (limited) | ❌ Pages.app needed |
|
||||
| Document | `.pdf` | `pdftotext` (poppler) or `PyPDF2` | ❌ Phase 2 |
|
||||
| Spreadsheet | `.xlsx`, `.xls`, `.numbers` | `openpyxl` (MIT, Python) | ❌ Phase 1 |
|
||||
| Presentation | `.pptx`, `.ppt`, `.key` | `python-pptx` (MIT, Python) | ❌ Phase 1 |
|
||||
| Image | `.jpg`, `.png`, `.gif`, `.bmp`, `.webp` | `sips` (macOS built-in) | ✅ Phase 1 |
|
||||
| Image | `.svg`, `.heic` | ffprobe or `sips` | ✅ Phase 1 |
|
||||
| Apple iWork | `.pages`, `.key`, `.numbers` | LibreOffice (MPL-2.0) | ❌ Phase 3 |
|
||||
|
||||
## Available Tools
|
||||
|
||||
### macOS Built-in (No Installation)
|
||||
|
||||
| Tool | License | Supports | Command |
|
||||
|------|---------|----------|---------|
|
||||
| `textutil` | Apple EULA | `.docx`→`.txt`, `.html`, `.rtf` | `textutil -convert txt input.docx -output out.txt` |
|
||||
| `sips` | Apple EULA | Image format conversion | `sips -s format jpeg input.heic -o out.jpg` |
|
||||
| `qlmanage` | Apple EULA | QuickLook thumbnails | `qlmanage -t -s 256 -o output input.pdf` |
|
||||
|
||||
### Python Packages (pip install, MIT/BSD License)
|
||||
|
||||
| Package | License | Supports | Notes |
|
||||
|---------|---------|----------|-------|
|
||||
| `openpyxl` | MIT | `.xlsx` → text/csv | Sheet iteration, cell extraction |
|
||||
| `python-pptx` | MIT | `.pptx` → text | Slide/shape text iteration |
|
||||
| `python-docx` | MIT | `.docx` → text | Paragraph/run text extraction |
|
||||
| `PyPDF2` / `pypdf` | BSD-3 | `.pdf` → text | Page-by-page text extraction |
|
||||
|
||||
### Brew Packages
|
||||
|
||||
| Package | License | Supports | Notes |
|
||||
|---------|---------|----------|-------|
|
||||
| `pandoc` | GPL-2.0 | `.docx`⇄`.md`⇄`.html` | CLI subprocess, GPL not linked |
|
||||
| `poppler` (pdftotext) | GPL-2.0 | `.pdf` → `.txt` | CLI subprocess, GPL not linked |
|
||||
| `libreoffice` | MPL-2.0 | All office formats | Headless CLI, heavy (~1GB) |
|
||||
|
||||
## File Format Analysis
|
||||
|
||||
### Office Open XML (.docx, .xlsx, .pptx)
|
||||
|
||||
These are ZIP archives containing XML files:
|
||||
|
||||
```
|
||||
document.docx = ZIP
|
||||
├── word/document.xml ← Main content (text + formatting)
|
||||
├── word/comments.xml
|
||||
└── [Content_Types].xml
|
||||
```
|
||||
|
||||
Python packages parse the XML directly — fast and reliable.
|
||||
|
||||
### Apple iWork (.pages, .key, .numbers)
|
||||
|
||||
These are also ZIP archives:
|
||||
|
||||
```
|
||||
document.pages = ZIP
|
||||
├── index.apxl.gz ← Actual content (Apple private format)
|
||||
├── preview.pdf ← Embedded preview (usable for thumbnails)
|
||||
├── QuickLook/
|
||||
│ └── Thumbnail.jpg
|
||||
└── Metadata/
|
||||
```
|
||||
|
||||
Without Pages.app or LibreOffice, only the `preview.pdf`/`Thumbnail.jpg` is extractable via `unzip`. Full text extraction requires LibreOffice.
|
||||
|
||||
### PDF (.pdf)
|
||||
|
||||
PDF is not a ZIP format. Requires a dedicated parser:
|
||||
- `pdftotext` (poppler) — best text extraction, CLI
|
||||
- `PyPDF2` — pure Python, lighter dependency
|
||||
|
||||
## Commercial Licensing Analysis
|
||||
|
||||
### Summary
|
||||
|
||||
| Tool | License | Commercial Use |
|
||||
|------|---------|:-------------:|
|
||||
| macOS built-in tools | Apple EULA | ✅ Licensed macOS required |
|
||||
| openpyxl / python-pptx / python-docx / PyPDF2 | MIT / BSD | ✅ Zero restrictions |
|
||||
| pandoc / poppler (pdftotext) | **GPL-2.0** | ✅ Safe via CLI subprocess |
|
||||
| LibreOffice (soffice) | **MPL-2.0** | ✅ Safe via CLI subprocess |
|
||||
|
||||
### GPL Analysis (pandoc, poppler)
|
||||
|
||||
Calling GPL-licensed tools via CLI subprocess (`std::process::Command`) is **mere aggregation**, not linking. The GPL FAQ states:
|
||||
|
||||
> "If the program is a separate process communicating with the main program via pipes, sockets, or command-line arguments, then the programs are separate works and the GPL does not apply to the main program."
|
||||
|
||||
Momentry Core calls these tools via `std::process::Command` (pipe stdin/stdout/stderr). **GPL does not propagate.**
|
||||
|
||||
### MPL Analysis (LibreOffice)
|
||||
|
||||
MPL-2.0 is a file-level weak copyleft. Only source files from LibreOffice itself must remain MPL-licensed. Calling `soffice --headless` via subprocess does not affect Momentry Core's license.
|
||||
|
||||
### MIT/BSD (Python packages)
|
||||
|
||||
No copyleft. Can be distributed and linked with any license.
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
### Phase 1: Zero-Install + Lightweight Python (Recommended First)
|
||||
|
||||
| Tool | Install | License |
|
||||
|------|---------|---------|
|
||||
| `textutil` | macOS built-in | Apple EULA |
|
||||
| `sips` | macOS built-in | Apple EULA |
|
||||
| `openpyxl` | `pip install` | MIT |
|
||||
| `python-pptx` | `pip install` | MIT |
|
||||
|
||||
Coverage: `.docx`, `.xlsx`, `.pptx`, `.jpg`, `.png`, `.heic`, `.svg`
|
||||
|
||||
### Phase 2: Text Extraction Enhancement
|
||||
|
||||
| Tool | Install | License |
|
||||
|------|---------|---------|
|
||||
| `pdftotext` (poppler) | `brew install poppler` | GPL-2.0 |
|
||||
| `PyPDF2` | `pip install` | BSD-3 |
|
||||
|
||||
Coverage: `.pdf` text extraction
|
||||
|
||||
### Phase 3: Full Office Suite
|
||||
|
||||
| Tool | Install | License |
|
||||
|------|---------|---------|
|
||||
| LibreOffice | `brew install libreoffice` | MPL-2.0 |
|
||||
|
||||
Coverage: `.pages`, `.key`, `.numbers`, legacy `.doc`/`.ppt`/`.xls`, full format fidelity
|
||||
|
||||
## Recommended Integration
|
||||
|
||||
File conversion should be implemented as Python scripts (matching existing processor pattern) and invoked at registration time:
|
||||
|
||||
```python
|
||||
# scripts/convert_file.py
|
||||
# Called during file registration if file_type not in (video, audio, image)
|
||||
|
||||
import openpyxl
|
||||
from pptx import Presentation
|
||||
|
||||
def extract_xlsx(path):
|
||||
wb = openpyxl.load_workbook(path, data_only=True)
|
||||
for sheet_name in wb.sheetnames:
|
||||
ws = wb[sheet_name]
|
||||
for row in ws.iter_rows(values_only=True):
|
||||
yield [str(c) if c is not None else "" for c in row]
|
||||
|
||||
def extract_pptx(path):
|
||||
prs = Presentation(path)
|
||||
for slide in prs.slides:
|
||||
text = [shape.text for shape in slide.shapes if hasattr(shape, "text")]
|
||||
yield "\n".join(text)
|
||||
```
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Changes |
|
||||
|---------|------|---------|
|
||||
| V1.0 | 2026-05-15 | Initial document — conversion tool survey, licensing analysis, implementation phases |
|
||||
@@ -657,7 +657,81 @@ file_uuid **在遷移過程中不變**。檔案從 Hot 移到 Cold:
|
||||
|
||||
file_uuid 永遠指向 birth 時的 `physical_path_at_birth`(Hot 路徑),不因遷移而改變。
|
||||
|
||||
### 6.5 location_history 表
|
||||
### 6.5 AI Agent — 按需資料流動
|
||||
|
||||
AI Agent 在底層自動管理資料流動,使用者無需知道檔案實際存放層級。
|
||||
|
||||
#### 架構
|
||||
|
||||
```
|
||||
User / Scheduler
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────┐
|
||||
│ AI Agent │
|
||||
│ • Monitor tier usage │
|
||||
│ • Detect hot/cold patterns │
|
||||
│ • Trigger auto-archive │
|
||||
│ • Restore on access (prefetch) │
|
||||
└──────────┬──────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────┐
|
||||
│ Transfer Engine │
|
||||
│ Direct (std::fs::copy) │
|
||||
│ Rsync (delta + checksum) │
|
||||
│ S3 / SFS / NFS / CDN │
|
||||
└──────────┬──────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────┐
|
||||
│ file_locations │
|
||||
│ (single source of truth) │
|
||||
│ M2 M4 M5 Cloud LTO │
|
||||
└─────────────────────────────────┘
|
||||
```
|
||||
|
||||
#### 自動歸檔規則
|
||||
|
||||
| 觸發條件 | 動作 | Transfer Engine |
|
||||
|----------|------|:--:|
|
||||
| `idle_days > 90` | move to Warm | Rsync + checksum verify |
|
||||
| `idle_days > 365` | move to Cold | Tar + checksum verify |
|
||||
| `hot_tier_usage > 80%` | move oldest to Warm | Rsync —progress |
|
||||
| user accesses cold file | restore to Hot | Rsync prefetch |
|
||||
|
||||
#### 流程範例
|
||||
|
||||
```
|
||||
1. AI Agent 偵測 Charade_1963.mp4 閒置 120 天
|
||||
2. rsync -avP --checksum → /Volumes/NAS_Archive/
|
||||
3. POST /api/v2/files/aeed7134.../locations
|
||||
{"location": "/Volumes/NAS_Archive/Charade_1963.mp4",
|
||||
"label": "M4-warm"}
|
||||
4. 移除 Hot tier 位置(或保留為參考)
|
||||
5. 使用者查詢檔案資訊 → 看到所有層級,無需知道實際位置
|
||||
```
|
||||
|
||||
#### 設計原則
|
||||
|
||||
| 原則 | 說明 |
|
||||
|------|------|
|
||||
| 透明遷移 | 使用者查詢 `file_locations` 始終得到一致視圖 |
|
||||
| 不變標識 | `file_uuid` 在遷移過程中不變 |
|
||||
| 位置追蹤 | 每次遷移後更新 `file_locations`,舊位置可選擇保留為歷史參考 |
|
||||
| 驗證完整性 | 遷移後執行 SHA256 校驗(Rsync `--checksum` 或手動比對) |
|
||||
| 類似記憶體階層 | Agent 是記憶體控制器:Hot=快取、Warm=主記憶體、Cold=磁碟 |
|
||||
|
||||
```
|
||||
|
||||
用戶查詢檔案 → 始終看到一致視圖(單一來源真相:file_locations)
|
||||
↑
|
||||
Transfer Engine(rsync / Direct / S3 / SFS / CDN)
|
||||
↑
|
||||
AI Agent(監控 tier 用量、偵測冷熱模式、自動歸檔、預取)
|
||||
↑
|
||||
Storage Tiers(M2 Hot → M4 Warm → M5 Cold → LTO)
|
||||
```
|
||||
|
||||
```sql
|
||||
CREATE TABLE IF NOT EXISTS location_history (
|
||||
@@ -908,6 +982,8 @@ CREATE TABLE IF NOT EXISTS exit_records (
|
||||
| 13 | 2026-05-14 | notify crate (僅 Hot tier) | 減少資源消耗,Warm/Cold 變更頻率低 |
|
||||
| 14 | 2026-05-14 | zip + tar crate (不用外部 CLI) | 跨平台,不需 ditto/hdiutil |
|
||||
| 15 | 2026-05-14 | Momentry Core 整合 A+B 混合模式 | 輕量運算用 crate,重查詢用 HTTP API |
|
||||
| 16 | 2026-05-14 | AI Agent 按需資料流動 | 透明遷移、類似記憶體階層、自動冷熱管理 |
|
||||
| 17 | 2026-05-14 | file_locations 支援任意 URI | /path、s3://、sfs://、ipfs://、https://、\\SMB\path |
|
||||
|
||||
---
|
||||
|
||||
@@ -916,4 +992,4 @@ CREATE TABLE IF NOT EXISTS exit_records (
|
||||
| 版本 | 日期 | 目的 | 操作人 | 工具/模型 |
|
||||
|------|------|------|--------|-----------|
|
||||
| V1.0 | 2026-05-12 | 初版設計(Demo Display + Knowledge Graph) | M4 / OpenCode | DeepSeek V4 Pro |
|
||||
| V2.0 | 2026-05-14 | 虛擬檔案樹、Group Share、儲存層級、技術棧、file_uuid、檔案操作 API | M4 / OpenCode | DeepSeek V4 Pro |
|
||||
| V2.0 | 2026-05-14 | 虛擬檔案樹、Group Share、儲存層級、技術棧、file_uuid、檔案操作 API、AI Agent 按需資料流動、跨平台 multi-location | M4 / OpenCode | DeepSeek V4 Pro |
|
||||
|
||||
Reference in New Issue
Block a user