docs: file conversion strategy — tools, licensing, implementation phases
This commit is contained in:
177
docs_v1.0/REFERENCE/FILE_CONVERSION_STRATEGY_V1.0.md
Normal file
177
docs_v1.0/REFERENCE/FILE_CONVERSION_STRATEGY_V1.0.md
Normal file
@@ -0,0 +1,177 @@
|
||||
---
|
||||
title: "File Conversion Strategy"
|
||||
version: "V1.0"
|
||||
date: "2026-05-15"
|
||||
author: "M5"
|
||||
status: "draft"
|
||||
---
|
||||
|
||||
# File Conversion Strategy
|
||||
|
||||
## Overview
|
||||
|
||||
Momentry Core registers and processes various file types beyond video. Non-video files (documents, images, spreadsheets, presentations) need conversion to extract text content and metadata for indexing, search, and preview.
|
||||
|
||||
## Supported Formats
|
||||
|
||||
| Category | Extensions | Strategy | Status |
|
||||
|----------|-----------|----------|--------|
|
||||
| Document | `.docx`, `.doc`, `.odt` | `textutil` (macOS built-in) | ✅ Phase 1 |
|
||||
| Document | `.pages` | `unzip` → extract `preview.pdf` (limited) | ❌ Pages.app needed |
|
||||
| Document | `.pdf` | `pdftotext` (poppler) or `PyPDF2` | ❌ Phase 2 |
|
||||
| Spreadsheet | `.xlsx`, `.xls`, `.numbers` | `openpyxl` (MIT, Python) | ❌ Phase 1 |
|
||||
| Presentation | `.pptx`, `.ppt`, `.key` | `python-pptx` (MIT, Python) | ❌ Phase 1 |
|
||||
| Image | `.jpg`, `.png`, `.gif`, `.bmp`, `.webp` | `sips` (macOS built-in) | ✅ Phase 1 |
|
||||
| Image | `.svg`, `.heic` | ffprobe or `sips` | ✅ Phase 1 |
|
||||
| Apple iWork | `.pages`, `.key`, `.numbers` | LibreOffice (MPL-2.0) | ❌ Phase 3 |
|
||||
|
||||
## Available Tools
|
||||
|
||||
### macOS Built-in (No Installation)
|
||||
|
||||
| Tool | License | Supports | Command |
|
||||
|------|---------|----------|---------|
|
||||
| `textutil` | Apple EULA | `.docx`→`.txt`, `.html`, `.rtf` | `textutil -convert txt input.docx -output out.txt` |
|
||||
| `sips` | Apple EULA | Image format conversion | `sips -s format jpeg input.heic -o out.jpg` |
|
||||
| `qlmanage` | Apple EULA | QuickLook thumbnails | `qlmanage -t -s 256 -o output input.pdf` |
|
||||
|
||||
### Python Packages (pip install, MIT/BSD License)
|
||||
|
||||
| Package | License | Supports | Notes |
|
||||
|---------|---------|----------|-------|
|
||||
| `openpyxl` | MIT | `.xlsx` → text/csv | Sheet iteration, cell extraction |
|
||||
| `python-pptx` | MIT | `.pptx` → text | Slide/shape text iteration |
|
||||
| `python-docx` | MIT | `.docx` → text | Paragraph/run text extraction |
|
||||
| `PyPDF2` / `pypdf` | BSD-3 | `.pdf` → text | Page-by-page text extraction |
|
||||
|
||||
### Brew Packages
|
||||
|
||||
| Package | License | Supports | Notes |
|
||||
|---------|---------|----------|-------|
|
||||
| `pandoc` | GPL-2.0 | `.docx`⇄`.md`⇄`.html` | CLI subprocess, GPL not linked |
|
||||
| `poppler` (pdftotext) | GPL-2.0 | `.pdf` → `.txt` | CLI subprocess, GPL not linked |
|
||||
| `libreoffice` | MPL-2.0 | All office formats | Headless CLI, heavy (~1GB) |
|
||||
|
||||
## File Format Analysis
|
||||
|
||||
### Office Open XML (.docx, .xlsx, .pptx)
|
||||
|
||||
These are ZIP archives containing XML files:
|
||||
|
||||
```
|
||||
document.docx = ZIP
|
||||
├── word/document.xml ← Main content (text + formatting)
|
||||
├── word/comments.xml
|
||||
└── [Content_Types].xml
|
||||
```
|
||||
|
||||
Python packages parse the XML directly — fast and reliable.
|
||||
|
||||
### Apple iWork (.pages, .key, .numbers)
|
||||
|
||||
These are also ZIP archives:
|
||||
|
||||
```
|
||||
document.pages = ZIP
|
||||
├── index.apxl.gz ← Actual content (Apple private format)
|
||||
├── preview.pdf ← Embedded preview (usable for thumbnails)
|
||||
├── QuickLook/
|
||||
│ └── Thumbnail.jpg
|
||||
└── Metadata/
|
||||
```
|
||||
|
||||
Without Pages.app or LibreOffice, only the `preview.pdf`/`Thumbnail.jpg` is extractable via `unzip`. Full text extraction requires LibreOffice.
|
||||
|
||||
### PDF (.pdf)
|
||||
|
||||
PDF is not a ZIP format. Requires a dedicated parser:
|
||||
- `pdftotext` (poppler) — best text extraction, CLI
|
||||
- `PyPDF2` — pure Python, lighter dependency
|
||||
|
||||
## Commercial Licensing Analysis
|
||||
|
||||
### Summary
|
||||
|
||||
| Tool | License | Commercial Use |
|
||||
|------|---------|:-------------:|
|
||||
| macOS built-in tools | Apple EULA | ✅ Licensed macOS required |
|
||||
| openpyxl / python-pptx / python-docx / PyPDF2 | MIT / BSD | ✅ Zero restrictions |
|
||||
| pandoc / poppler (pdftotext) | **GPL-2.0** | ✅ Safe via CLI subprocess |
|
||||
| LibreOffice (soffice) | **MPL-2.0** | ✅ Safe via CLI subprocess |
|
||||
|
||||
### GPL Analysis (pandoc, poppler)
|
||||
|
||||
Calling GPL-licensed tools via CLI subprocess (`std::process::Command`) is **mere aggregation**, not linking. The GPL FAQ states:
|
||||
|
||||
> "If the program is a separate process communicating with the main program via pipes, sockets, or command-line arguments, then the programs are separate works and the GPL does not apply to the main program."
|
||||
|
||||
Momentry Core calls these tools via `std::process::Command` (pipe stdin/stdout/stderr). **GPL does not propagate.**
|
||||
|
||||
### MPL Analysis (LibreOffice)
|
||||
|
||||
MPL-2.0 is a file-level weak copyleft. Only source files from LibreOffice itself must remain MPL-licensed. Calling `soffice --headless` via subprocess does not affect Momentry Core's license.
|
||||
|
||||
### MIT/BSD (Python packages)
|
||||
|
||||
No copyleft. Can be distributed and linked with any license.
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
### Phase 1: Zero-Install + Lightweight Python (Recommended First)
|
||||
|
||||
| Tool | Install | License |
|
||||
|------|---------|---------|
|
||||
| `textutil` | macOS built-in | Apple EULA |
|
||||
| `sips` | macOS built-in | Apple EULA |
|
||||
| `openpyxl` | `pip install` | MIT |
|
||||
| `python-pptx` | `pip install` | MIT |
|
||||
|
||||
Coverage: `.docx`, `.xlsx`, `.pptx`, `.jpg`, `.png`, `.heic`, `.svg`
|
||||
|
||||
### Phase 2: Text Extraction Enhancement
|
||||
|
||||
| Tool | Install | License |
|
||||
|------|---------|---------|
|
||||
| `pdftotext` (poppler) | `brew install poppler` | GPL-2.0 |
|
||||
| `PyPDF2` | `pip install` | BSD-3 |
|
||||
|
||||
Coverage: `.pdf` text extraction
|
||||
|
||||
### Phase 3: Full Office Suite
|
||||
|
||||
| Tool | Install | License |
|
||||
|------|---------|---------|
|
||||
| LibreOffice | `brew install libreoffice` | MPL-2.0 |
|
||||
|
||||
Coverage: `.pages`, `.key`, `.numbers`, legacy `.doc`/`.ppt`/`.xls`, full format fidelity
|
||||
|
||||
## Recommended Integration
|
||||
|
||||
File conversion should be implemented as Python scripts (matching existing processor pattern) and invoked at registration time:
|
||||
|
||||
```python
|
||||
# scripts/convert_file.py
|
||||
# Called during file registration if file_type not in (video, audio, image)
|
||||
|
||||
import openpyxl
|
||||
from pptx import Presentation
|
||||
|
||||
def extract_xlsx(path):
|
||||
wb = openpyxl.load_workbook(path, data_only=True)
|
||||
for sheet_name in wb.sheetnames:
|
||||
ws = wb[sheet_name]
|
||||
for row in ws.iter_rows(values_only=True):
|
||||
yield [str(c) if c is not None else "" for c in row]
|
||||
|
||||
def extract_pptx(path):
|
||||
prs = Presentation(path)
|
||||
for slide in prs.slides:
|
||||
text = [shape.text for shape in slide.shapes if hasattr(shape, "text")]
|
||||
yield "\n".join(text)
|
||||
```
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Changes |
|
||||
|---------|------|---------|
|
||||
| V1.0 | 2026-05-15 | Initial document — conversion tool survey, licensing analysis, implementation phases |
|
||||
Reference in New Issue
Block a user