docs: file conversion strategy — tools, licensing, implementation phases

This commit is contained in:
Accusys
2026-05-15 10:05:14 +08:00
parent 85b06b6169
commit d7a133e1e4

View File

@@ -0,0 +1,177 @@
---
title: "File Conversion Strategy"
version: "V1.0"
date: "2026-05-15"
author: "M5"
status: "draft"
---
# File Conversion Strategy
## Overview
Momentry Core registers and processes various file types beyond video. Non-video files (documents, images, spreadsheets, presentations) need conversion to extract text content and metadata for indexing, search, and preview.
## Supported Formats
| Category | Extensions | Strategy | Status |
|----------|-----------|----------|--------|
| Document | `.docx`, `.doc`, `.odt` | `textutil` (macOS built-in) | ✅ Phase 1 |
| Document | `.pages` | `unzip` → extract `preview.pdf` (limited) | ❌ Pages.app needed |
| Document | `.pdf` | `pdftotext` (poppler) or `PyPDF2` | ❌ Phase 2 |
| Spreadsheet | `.xlsx`, `.xls`, `.numbers` | `openpyxl` (MIT, Python) | ❌ Phase 1 |
| Presentation | `.pptx`, `.ppt`, `.key` | `python-pptx` (MIT, Python) | ❌ Phase 1 |
| Image | `.jpg`, `.png`, `.gif`, `.bmp`, `.webp` | `sips` (macOS built-in) | ✅ Phase 1 |
| Image | `.svg`, `.heic` | ffprobe or `sips` | ✅ Phase 1 |
| Apple iWork | `.pages`, `.key`, `.numbers` | LibreOffice (MPL-2.0) | ❌ Phase 3 |
## Available Tools
### macOS Built-in (No Installation)
| Tool | License | Supports | Command |
|------|---------|----------|---------|
| `textutil` | Apple EULA | `.docx``.txt`, `.html`, `.rtf` | `textutil -convert txt input.docx -output out.txt` |
| `sips` | Apple EULA | Image format conversion | `sips -s format jpeg input.heic -o out.jpg` |
| `qlmanage` | Apple EULA | QuickLook thumbnails | `qlmanage -t -s 256 -o output input.pdf` |
### Python Packages (pip install, MIT/BSD License)
| Package | License | Supports | Notes |
|---------|---------|----------|-------|
| `openpyxl` | MIT | `.xlsx` → text/csv | Sheet iteration, cell extraction |
| `python-pptx` | MIT | `.pptx` → text | Slide/shape text iteration |
| `python-docx` | MIT | `.docx` → text | Paragraph/run text extraction |
| `PyPDF2` / `pypdf` | BSD-3 | `.pdf` → text | Page-by-page text extraction |
### Brew Packages
| Package | License | Supports | Notes |
|---------|---------|----------|-------|
| `pandoc` | GPL-2.0 | `.docx``.md``.html` | CLI subprocess, GPL not linked |
| `poppler` (pdftotext) | GPL-2.0 | `.pdf``.txt` | CLI subprocess, GPL not linked |
| `libreoffice` | MPL-2.0 | All office formats | Headless CLI, heavy (~1GB) |
## File Format Analysis
### Office Open XML (.docx, .xlsx, .pptx)
These are ZIP archives containing XML files:
```
document.docx = ZIP
├── word/document.xml ← Main content (text + formatting)
├── word/comments.xml
└── [Content_Types].xml
```
Python packages parse the XML directly — fast and reliable.
### Apple iWork (.pages, .key, .numbers)
These are also ZIP archives:
```
document.pages = ZIP
├── index.apxl.gz ← Actual content (Apple private format)
├── preview.pdf ← Embedded preview (usable for thumbnails)
├── QuickLook/
│ └── Thumbnail.jpg
└── Metadata/
```
Without Pages.app or LibreOffice, only the `preview.pdf`/`Thumbnail.jpg` is extractable via `unzip`. Full text extraction requires LibreOffice.
### PDF (.pdf)
PDF is not a ZIP format. Requires a dedicated parser:
- `pdftotext` (poppler) — best text extraction, CLI
- `PyPDF2` — pure Python, lighter dependency
## Commercial Licensing Analysis
### Summary
| Tool | License | Commercial Use |
|------|---------|:-------------:|
| macOS built-in tools | Apple EULA | ✅ Licensed macOS required |
| openpyxl / python-pptx / python-docx / PyPDF2 | MIT / BSD | ✅ Zero restrictions |
| pandoc / poppler (pdftotext) | **GPL-2.0** | ✅ Safe via CLI subprocess |
| LibreOffice (soffice) | **MPL-2.0** | ✅ Safe via CLI subprocess |
### GPL Analysis (pandoc, poppler)
Calling GPL-licensed tools via CLI subprocess (`std::process::Command`) is **mere aggregation**, not linking. The GPL FAQ states:
> "If the program is a separate process communicating with the main program via pipes, sockets, or command-line arguments, then the programs are separate works and the GPL does not apply to the main program."
Momentry Core calls these tools via `std::process::Command` (pipe stdin/stdout/stderr). **GPL does not propagate.**
### MPL Analysis (LibreOffice)
MPL-2.0 is a file-level weak copyleft. Only source files from LibreOffice itself must remain MPL-licensed. Calling `soffice --headless` via subprocess does not affect Momentry Core's license.
### MIT/BSD (Python packages)
No copyleft. Can be distributed and linked with any license.
## Implementation Phases
### Phase 1: Zero-Install + Lightweight Python (Recommended First)
| Tool | Install | License |
|------|---------|---------|
| `textutil` | macOS built-in | Apple EULA |
| `sips` | macOS built-in | Apple EULA |
| `openpyxl` | `pip install` | MIT |
| `python-pptx` | `pip install` | MIT |
Coverage: `.docx`, `.xlsx`, `.pptx`, `.jpg`, `.png`, `.heic`, `.svg`
### Phase 2: Text Extraction Enhancement
| Tool | Install | License |
|------|---------|---------|
| `pdftotext` (poppler) | `brew install poppler` | GPL-2.0 |
| `PyPDF2` | `pip install` | BSD-3 |
Coverage: `.pdf` text extraction
### Phase 3: Full Office Suite
| Tool | Install | License |
|------|---------|---------|
| LibreOffice | `brew install libreoffice` | MPL-2.0 |
Coverage: `.pages`, `.key`, `.numbers`, legacy `.doc`/`.ppt`/`.xls`, full format fidelity
## Recommended Integration
File conversion should be implemented as Python scripts (matching existing processor pattern) and invoked at registration time:
```python
# scripts/convert_file.py
# Called during file registration if file_type not in (video, audio, image)
import openpyxl
from pptx import Presentation
def extract_xlsx(path):
wb = openpyxl.load_workbook(path, data_only=True)
for sheet_name in wb.sheetnames:
ws = wb[sheet_name]
for row in ws.iter_rows(values_only=True):
yield [str(c) if c is not None else "" for c in row]
def extract_pptx(path):
prs = Presentation(path)
for slide in prs.slides:
text = [shape.text for shape in slide.shapes if hasattr(shape, "text")]
yield "\n".join(text)
```
## Version History
| Version | Date | Changes |
|---------|------|---------|
| V1.0 | 2026-05-15 | Initial document — conversion tool survey, licensing analysis, implementation phases |