From d7a133e1e46186f6af6c622ac4af9b4290059291 Mon Sep 17 00:00:00 2001 From: Accusys Date: Fri, 15 May 2026 10:05:14 +0800 Subject: [PATCH] =?UTF-8?q?docs:=20file=20conversion=20strategy=20?= =?UTF-8?q?=E2=80=94=20tools,=20licensing,=20implementation=20phases?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../FILE_CONVERSION_STRATEGY_V1.0.md | 177 ++++++++++++++++++ 1 file changed, 177 insertions(+) create mode 100644 docs_v1.0/REFERENCE/FILE_CONVERSION_STRATEGY_V1.0.md diff --git a/docs_v1.0/REFERENCE/FILE_CONVERSION_STRATEGY_V1.0.md b/docs_v1.0/REFERENCE/FILE_CONVERSION_STRATEGY_V1.0.md new file mode 100644 index 0000000..5e33ed1 --- /dev/null +++ b/docs_v1.0/REFERENCE/FILE_CONVERSION_STRATEGY_V1.0.md @@ -0,0 +1,177 @@ +--- +title: "File Conversion Strategy" +version: "V1.0" +date: "2026-05-15" +author: "M5" +status: "draft" +--- + +# File Conversion Strategy + +## Overview + +Momentry Core registers and processes various file types beyond video. Non-video files (documents, images, spreadsheets, presentations) need conversion to extract text content and metadata for indexing, search, and preview. + +## Supported Formats + +| Category | Extensions | Strategy | Status | +|----------|-----------|----------|--------| +| Document | `.docx`, `.doc`, `.odt` | `textutil` (macOS built-in) | ✅ Phase 1 | +| Document | `.pages` | `unzip` → extract `preview.pdf` (limited) | ❌ Pages.app needed | +| Document | `.pdf` | `pdftotext` (poppler) or `PyPDF2` | ❌ Phase 2 | +| Spreadsheet | `.xlsx`, `.xls`, `.numbers` | `openpyxl` (MIT, Python) | ❌ Phase 1 | +| Presentation | `.pptx`, `.ppt`, `.key` | `python-pptx` (MIT, Python) | ❌ Phase 1 | +| Image | `.jpg`, `.png`, `.gif`, `.bmp`, `.webp` | `sips` (macOS built-in) | ✅ Phase 1 | +| Image | `.svg`, `.heic` | ffprobe or `sips` | ✅ Phase 1 | +| Apple iWork | `.pages`, `.key`, `.numbers` | LibreOffice (MPL-2.0) | ❌ Phase 3 | + +## Available Tools + +### macOS Built-in (No Installation) + +| Tool | License | Supports | Command | +|------|---------|----------|---------| +| `textutil` | Apple EULA | `.docx`→`.txt`, `.html`, `.rtf` | `textutil -convert txt input.docx -output out.txt` | +| `sips` | Apple EULA | Image format conversion | `sips -s format jpeg input.heic -o out.jpg` | +| `qlmanage` | Apple EULA | QuickLook thumbnails | `qlmanage -t -s 256 -o output input.pdf` | + +### Python Packages (pip install, MIT/BSD License) + +| Package | License | Supports | Notes | +|---------|---------|----------|-------| +| `openpyxl` | MIT | `.xlsx` → text/csv | Sheet iteration, cell extraction | +| `python-pptx` | MIT | `.pptx` → text | Slide/shape text iteration | +| `python-docx` | MIT | `.docx` → text | Paragraph/run text extraction | +| `PyPDF2` / `pypdf` | BSD-3 | `.pdf` → text | Page-by-page text extraction | + +### Brew Packages + +| Package | License | Supports | Notes | +|---------|---------|----------|-------| +| `pandoc` | GPL-2.0 | `.docx`⇄`.md`⇄`.html` | CLI subprocess, GPL not linked | +| `poppler` (pdftotext) | GPL-2.0 | `.pdf` → `.txt` | CLI subprocess, GPL not linked | +| `libreoffice` | MPL-2.0 | All office formats | Headless CLI, heavy (~1GB) | + +## File Format Analysis + +### Office Open XML (.docx, .xlsx, .pptx) + +These are ZIP archives containing XML files: + +``` +document.docx = ZIP + ├── word/document.xml ← Main content (text + formatting) + ├── word/comments.xml + └── [Content_Types].xml +``` + +Python packages parse the XML directly — fast and reliable. + +### Apple iWork (.pages, .key, .numbers) + +These are also ZIP archives: + +``` +document.pages = ZIP + ├── index.apxl.gz ← Actual content (Apple private format) + ├── preview.pdf ← Embedded preview (usable for thumbnails) + ├── QuickLook/ + │ └── Thumbnail.jpg + └── Metadata/ +``` + +Without Pages.app or LibreOffice, only the `preview.pdf`/`Thumbnail.jpg` is extractable via `unzip`. Full text extraction requires LibreOffice. + +### PDF (.pdf) + +PDF is not a ZIP format. Requires a dedicated parser: +- `pdftotext` (poppler) — best text extraction, CLI +- `PyPDF2` — pure Python, lighter dependency + +## Commercial Licensing Analysis + +### Summary + +| Tool | License | Commercial Use | +|------|---------|:-------------:| +| macOS built-in tools | Apple EULA | ✅ Licensed macOS required | +| openpyxl / python-pptx / python-docx / PyPDF2 | MIT / BSD | ✅ Zero restrictions | +| pandoc / poppler (pdftotext) | **GPL-2.0** | ✅ Safe via CLI subprocess | +| LibreOffice (soffice) | **MPL-2.0** | ✅ Safe via CLI subprocess | + +### GPL Analysis (pandoc, poppler) + +Calling GPL-licensed tools via CLI subprocess (`std::process::Command`) is **mere aggregation**, not linking. The GPL FAQ states: + +> "If the program is a separate process communicating with the main program via pipes, sockets, or command-line arguments, then the programs are separate works and the GPL does not apply to the main program." + +Momentry Core calls these tools via `std::process::Command` (pipe stdin/stdout/stderr). **GPL does not propagate.** + +### MPL Analysis (LibreOffice) + +MPL-2.0 is a file-level weak copyleft. Only source files from LibreOffice itself must remain MPL-licensed. Calling `soffice --headless` via subprocess does not affect Momentry Core's license. + +### MIT/BSD (Python packages) + +No copyleft. Can be distributed and linked with any license. + +## Implementation Phases + +### Phase 1: Zero-Install + Lightweight Python (Recommended First) + +| Tool | Install | License | +|------|---------|---------| +| `textutil` | macOS built-in | Apple EULA | +| `sips` | macOS built-in | Apple EULA | +| `openpyxl` | `pip install` | MIT | +| `python-pptx` | `pip install` | MIT | + +Coverage: `.docx`, `.xlsx`, `.pptx`, `.jpg`, `.png`, `.heic`, `.svg` + +### Phase 2: Text Extraction Enhancement + +| Tool | Install | License | +|------|---------|---------| +| `pdftotext` (poppler) | `brew install poppler` | GPL-2.0 | +| `PyPDF2` | `pip install` | BSD-3 | + +Coverage: `.pdf` text extraction + +### Phase 3: Full Office Suite + +| Tool | Install | License | +|------|---------|---------| +| LibreOffice | `brew install libreoffice` | MPL-2.0 | + +Coverage: `.pages`, `.key`, `.numbers`, legacy `.doc`/`.ppt`/`.xls`, full format fidelity + +## Recommended Integration + +File conversion should be implemented as Python scripts (matching existing processor pattern) and invoked at registration time: + +```python +# scripts/convert_file.py +# Called during file registration if file_type not in (video, audio, image) + +import openpyxl +from pptx import Presentation + +def extract_xlsx(path): + wb = openpyxl.load_workbook(path, data_only=True) + for sheet_name in wb.sheetnames: + ws = wb[sheet_name] + for row in ws.iter_rows(values_only=True): + yield [str(c) if c is not None else "" for c in row] + +def extract_pptx(path): + prs = Presentation(path) + for slide in prs.slides: + text = [shape.text for shape in slide.shapes if hasattr(shape, "text")] + yield "\n".join(text) +``` + +## Version History + +| Version | Date | Changes | +|---------|------|---------| +| V1.0 | 2026-05-15 | Initial document — conversion tool survey, licensing analysis, implementation phases |