docs: file conversion strategy — tools, licensing, implementation phases

2026-05-15 10:05:14 +08:00
parent 85b06b6169
commit d7a133e1e4
1 changed files with 177 additions and 0 deletions
--- a/docs_v1.0/REFERENCE/FILE_CONVERSION_STRATEGY_V1.0.md
+++ b/docs_v1.0/REFERENCE/FILE_CONVERSION_STRATEGY_V1.0.md
@@ -0,0 +1,177 @@
+---
+title: "File Conversion Strategy"
+version: "V1.0"
+date: "2026-05-15"
+author: "M5"
+status: "draft"
+---
+
+# File Conversion Strategy
+
+## Overview
+
+Momentry Core registers and processes various file types beyond video. Non-video files (documents, images, spreadsheets, presentations) need conversion to extract text content and metadata for indexing, search, and preview.
+
+## Supported Formats
+
+| Category | Extensions | Strategy | Status |
+|----------|-----------|----------|--------|
+| Document | `.docx`, `.doc`, `.odt` | `textutil` (macOS built-in) | ✅ Phase 1 |
+| Document | `.pages` | `unzip` → extract `preview.pdf` (limited) | ❌ Pages.app needed |
+| Document | `.pdf` | `pdftotext` (poppler) or `PyPDF2` | ❌ Phase 2 |
+| Spreadsheet | `.xlsx`, `.xls`, `.numbers` | `openpyxl` (MIT, Python) | ❌ Phase 1 |
+| Presentation | `.pptx`, `.ppt`, `.key` | `python-pptx` (MIT, Python) | ❌ Phase 1 |
+| Image | `.jpg`, `.png`, `.gif`, `.bmp`, `.webp` | `sips` (macOS built-in) | ✅ Phase 1 |
+| Image | `.svg`, `.heic` | ffprobe or `sips` | ✅ Phase 1 |
+| Apple iWork | `.pages`, `.key`, `.numbers` | LibreOffice (MPL-2.0) | ❌ Phase 3 |
+
+## Available Tools
+
+### macOS Built-in (No Installation)
+
+| Tool | License | Supports | Command |
+|------|---------|----------|---------|
+| `textutil` | Apple EULA | `.docx`→`.txt`, `.html`, `.rtf` | `textutil -convert txt input.docx -output out.txt` |
+| `sips` | Apple EULA | Image format conversion | `sips -s format jpeg input.heic -o out.jpg` |
+| `qlmanage` | Apple EULA | QuickLook thumbnails | `qlmanage -t -s 256 -o output input.pdf` |
+
+### Python Packages (pip install, MIT/BSD License)
+
+| Package | License | Supports | Notes |
+|---------|---------|----------|-------|
+| `openpyxl` | MIT | `.xlsx` → text/csv | Sheet iteration, cell extraction |
+| `python-pptx` | MIT | `.pptx` → text | Slide/shape text iteration |
+| `python-docx` | MIT | `.docx` → text | Paragraph/run text extraction |
+| `PyPDF2` / `pypdf` | BSD-3 | `.pdf` → text | Page-by-page text extraction |
+
+### Brew Packages
+
+| Package | License | Supports | Notes |
+|---------|---------|----------|-------|
+| `pandoc` | GPL-2.0 | `.docx`⇄`.md`⇄`.html` | CLI subprocess, GPL not linked |
+| `poppler` (pdftotext) | GPL-2.0 | `.pdf` → `.txt` | CLI subprocess, GPL not linked |
+| `libreoffice` | MPL-2.0 | All office formats | Headless CLI, heavy (~1GB) |
+
+## File Format Analysis
+
+### Office Open XML (.docx, .xlsx, .pptx)
+
+These are ZIP archives containing XML files:
+
+```
+document.docx = ZIP
+  ├── word/document.xml    ← Main content (text + formatting)
+  ├── word/comments.xml
+  └── [Content_Types].xml
+```
+
+Python packages parse the XML directly — fast and reliable.
+
+### Apple iWork (.pages, .key, .numbers)
+
+These are also ZIP archives:
+
+```
+document.pages = ZIP
+  ├── index.apxl.gz        ← Actual content (Apple private format)
+  ├── preview.pdf          ← Embedded preview (usable for thumbnails)
+  ├── QuickLook/
+  │   └── Thumbnail.jpg
+  └── Metadata/
+```
+
+Without Pages.app or LibreOffice, only the `preview.pdf`/`Thumbnail.jpg` is extractable via `unzip`. Full text extraction requires LibreOffice.
+
+### PDF (.pdf)
+
+PDF is not a ZIP format. Requires a dedicated parser:
+- `pdftotext` (poppler) — best text extraction, CLI
+- `PyPDF2` — pure Python, lighter dependency
+
+## Commercial Licensing Analysis
+
+### Summary
+
+| Tool | License | Commercial Use |
+|------|---------|:-------------:|
+| macOS built-in tools | Apple EULA | ✅ Licensed macOS required |
+| openpyxl / python-pptx / python-docx / PyPDF2 | MIT / BSD | ✅ Zero restrictions |
+| pandoc / poppler (pdftotext) | **GPL-2.0** | ✅ Safe via CLI subprocess |
+| LibreOffice (soffice) | **MPL-2.0** | ✅ Safe via CLI subprocess |
+
+### GPL Analysis (pandoc, poppler)
+
+Calling GPL-licensed tools via CLI subprocess (`std::process::Command`) is **mere aggregation**, not linking. The GPL FAQ states:
+
+> "If the program is a separate process communicating with the main program via pipes, sockets, or command-line arguments, then the programs are separate works and the GPL does not apply to the main program."
+
+Momentry Core calls these tools via `std::process::Command` (pipe stdin/stdout/stderr). **GPL does not propagate.**
+
+### MPL Analysis (LibreOffice)
+
+MPL-2.0 is a file-level weak copyleft. Only source files from LibreOffice itself must remain MPL-licensed. Calling `soffice --headless` via subprocess does not affect Momentry Core's license.
+
+### MIT/BSD (Python packages)
+
+No copyleft. Can be distributed and linked with any license.
+
+## Implementation Phases
+
+### Phase 1: Zero-Install + Lightweight Python (Recommended First)
+
+| Tool | Install | License |
+|------|---------|---------|
+| `textutil` | macOS built-in | Apple EULA |
+| `sips` | macOS built-in | Apple EULA |
+| `openpyxl` | `pip install` | MIT |
+| `python-pptx` | `pip install` | MIT |
+
+Coverage: `.docx`, `.xlsx`, `.pptx`, `.jpg`, `.png`, `.heic`, `.svg`
+
+### Phase 2: Text Extraction Enhancement
+
+| Tool | Install | License |
+|------|---------|---------|
+| `pdftotext` (poppler) | `brew install poppler` | GPL-2.0 |
+| `PyPDF2` | `pip install` | BSD-3 |
+
+Coverage: `.pdf` text extraction
+
+### Phase 3: Full Office Suite
+
+| Tool | Install | License |
+|------|---------|---------|
+| LibreOffice | `brew install libreoffice` | MPL-2.0 |
+
+Coverage: `.pages`, `.key`, `.numbers`, legacy `.doc`/`.ppt`/`.xls`, full format fidelity
+
+## Recommended Integration
+
+File conversion should be implemented as Python scripts (matching existing processor pattern) and invoked at registration time:
+
+```python
+# scripts/convert_file.py
+# Called during file registration if file_type not in (video, audio, image)
+
+import openpyxl
+from pptx import Presentation
+
+def extract_xlsx(path):
+    wb = openpyxl.load_workbook(path, data_only=True)
+    for sheet_name in wb.sheetnames:
+        ws = wb[sheet_name]
+        for row in ws.iter_rows(values_only=True):
+            yield [str(c) if c is not None else "" for c in row]
+
+def extract_pptx(path):
+    prs = Presentation(path)
+    for slide in prs.slides:
+        text = [shape.text for shape in slide.shapes if hasattr(shape, "text")]
+        yield "\n".join(text)
+```
+
+## Version History
+
+| Version | Date | Changes |
+|---------|------|---------|
+| V1.0 | 2026-05-15 | Initial document — conversion tool survey, licensing analysis, implementation phases |