# Ceph RADOS Integration Analysis for MarkBase **Date**: 2026-06-25 **Status**: Shelved (不符合 macOS 跨平台定位) **Library**: ceph-async (4.0.5) **Constraint**: Linux-only (requires librados.so symlink) --- ## Executive Summary ### Goal Add Ceph RADOS as a VfsBackend option for distributed, highly scalable storage. ### Key Findings | Aspect | Finding | |--------|---------| | **Platform** | ❌ Linux-only (librados.so FFI, macOS needs Docker/VM) | | **Deployment** | ⚠️ Requires full cluster (Monitor + OSD + MGR) | | **Complexity** | ⚠️⚠️⚠️⚠️⚠️ High (超出 Lightweight 定位) | | **Positioning** | ❌ 不符合 MarkBase macOS 跨平台定位 | ### Recommendation **当前搁置**。优先考虑: 1. **MinIO** — S3-compatible,已有 S3Vfs 支持,跨平台 2. **内置分布式** — DedupFs + S3Vfs 组合,轻量级 --- ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ MarkBase Application Layer │ │ ├── SMB Server (Port 4445) │ │ ├── SFTP Server (Port 2024) │ │ ├── WebDAV Server (Port 11438) │ │ └───────────────────────────────────────────────────────────────────────┘ │ ↓ │ ┌─────────────────────────────────────────────────────────────────────────┐ │ VFS Abstraction Layer (VfsBackend trait) │ │ ├── LocalFs — POSIX local filesystem │ │ ├── S3Vfs — S3-compatible storage (HTTP API) │ │ ├── SmbVfs — SMB client backend │ │ ├── CephVfs — Ceph RADOS backend (搁置) │ │ ├── EncryptedFs — Encryption layer │ │ ├── Compression — ZSTD/LZ4 compression layer │ │ ├── DedupFs — Block deduplication layer │ │ ├── RaidFs — RAID-Z emulation layer │ │ └─────────────────────────────────────────────────────────────────────┘ │ ↓ │ ┌─────────────────────────────────────────────────────────────────────────┐ │ Ceph Storage Cluster (RADOS) │ │ ├── Monitor (MON) — Cluster map, authentication │ │ ├── OSD Daemons — Object storage (data replication) │ │ ├── Manager (MGR) — Dashboard, telemetry │ │ ├── MDS (optional) — CephFS metadata server │ │ ├── RGW (optional) — S3/Swift gateway │ │ └─────────────────────────────────────────────────────────────────────┘ ``` --- ## Library Analysis ### Rust Ceph Crates | Crate | Version | Description | Platform | |-------|---------|-------------|----------| | `ceph` | 3.2.5 | Official librados FFI (sync) | Linux-only | | `ceph-async` | 4.0.5 | Async librados FFI (futures 0.3) | Linux-only | | `ceph-rbd` | 0.3.2 | RADOS Block Device bindings | Linux-only | ### ceph-async Module Structure ``` ceph_async:: ├── CephClient — Admin operations (OSD/Pool/Mon commands) ├── rados:: — Low-level FFI bindings (100+ functions) │ ├── rados_read/write/stat/remove — Object I/O │ ├── rados_pool_create/delete/lookup — Pool management │ ├── rados_ioctx_* — I/O context (pool handle) │ ├── rados_snap_* — Snapshot management │ ├── rados_lock_* — Distributed locking │ ├── rados_aio_* — Async I/O │ ├── rados_omap_* — Key-value store per object │ └── rados_write_op_* / rados_read_op_* — Compound operations ├── completion:: — Async completion handling ├── read_stream:: — Async read stream ├── write_sink:: — Async write sink └── list_stream:: — Async object listing ``` ### CephClient API ```rust let client = CephClient::new("admin", "/etc/ceph/ceph.conf")?; // OSD operations client.osd_tree()?; // Get OSD tree (CRUSH map) client.osd_out(osd_id)?; // Mark OSD out client.osd_crush_remove(osd_id)?; // Remove from CRUSH map // Pool operations client.osd_pool_get(pool, option)?; // Get pool config client.osd_pool_set(pool, key, val)?; // Set pool config client.osd_pool_quota_get(pool)?; // Get pool quota // Cluster status client.status()?; // Cluster health client.mon_dump()?; // Monitor list client.version()?; // Ceph version ``` --- ## Implementation Phases | Phase | Task | Code Lines | Priority | Risk | Dependencies | |-------|------|------------|----------|------|--------------| | **Phase 1** | CephVfs struct + basic I/O | ~400 | P0 | Medium ⚠️⚠️⚠️ | ceph-async crate | | **Phase 2** | Pool management CLI | ~150 | P1 | Low ⚠️ | Phase 1 | | **Phase 3** | Snapshot support | ~200 | P2 | Medium ⚠️⚠️⚠️ | librados snap API | | **Phase 4** | Distributed locking | ~100 | P2 | Medium ⚠️⚠️⚠️ | librados lock API | | **Phase 5** | OMAP key-value | ~150 | P3 | Low ⚠️ | librados omap API | | **Phase 6** | Async integration | ~300 | P1 | High ⚠️⚠️⚠️⚠️ | async-vfs feature | | **Phase 7** | Docker test environment | ~50 | P0 | Low ⚠️ | Docker compose | | **Phase 8** | Performance benchmark | ~100 | P2 | Low ⚠️ | Benchmark scripts | | **Total** | | **~1350** | | | | --- ## Phase 1: CephVfs Core Implementation ### Key Design Decisions **1. Object vs File mapping**: - RADOS is object storage (no directories) - Path `/foo/bar.txt` → Object `foo/bar.txt` in pool - Directories simulated via zero-byte objects with `/` suffix (like S3) **2. Pool-per-share vs single pool**: - Option A: Single pool + path prefix (simpler, less isolation) - Option B: Pool-per-share (better isolation, quota per pool) - **Recommend**: Option B (pool-per-share) for enterprise use **3. I/O context caching**: - Each pool requires separate `rados_ioctx_t` - Cache ioctx per share to avoid recreation overhead ### CephVfs Struct (Draft) ```rust pub struct CephVfs { cluster: rados_t, // RADOS cluster handle pool_name: String, // Pool name for this share ioctx: rados_ioctx_t, // I/O context (cached) root_prefix: String, // Path prefix within pool } pub struct CephVfsFile { ioctx: rados_ioctx_t, object_id: String, // Object name in pool position: u64, write_buffer: Vec, // Buffer for writes (flush on close) size: u64, } ``` ### VfsBackend Method Mapping | Method | RADOS equivalent | Complexity | |--------|-----------------|------------| | `read_dir()` | `rados_nobjects_list_*` | High (pagination) | | `open_file()` | Custom (object ops) | Medium | | `stat()` | `rados_stat()` | Low | | `create_dir()` | `rados_write_full(0-byte)` | Low | | `remove_dir()` | `rados_remove()` | Low | | `remove_file()` | `rados_remove()` | Low | | `rename()` | Custom (copy + delete) | Medium | | `exists()` | `rados_stat()` | Low | | `copy()` | `rados_clone_range()` | Low | | `hard_link()` | `rados_clone_range()` | Low | | `read_link()` | Unsupported | N/A | | `create_symlink()` | Unsupported | N/A | --- ## Risk Assessment | Risk | Level | Mitigation | |------|-------|------------| | **Linux-only** | ⚠️⚠️⚠️⚠️⚠️ Critical | Docker/VM for macOS; 不符合跨平台定位 | | **librados.so symlink** | ⚠️⚠️⚠️ Medium | Document setup; CI check | | **Pool-level snapshots** | ⚠️⚠️ Low | Document limitation; consider RGW | | **Async overhead** | ⚠️⚠️⚠️ Medium | Benchmark; spawn_blocking wrapper | | **Cluster complexity** | ⚠️⚠️⚠️⚠️⚠️ Critical | 超出 Lightweight 定位; Docker compose | | **SMB Oplocks integration** | ⚠️⚠️⚠️ Medium | RADOS locking API; careful design | --- ## Alternatives (推荐方案) ### 方案对比 | 方案 | 跨平台 | 部署复杂度 | 定位匹配 | 状态 | |------|--------|-----------|---------|------| | **Ceph RADOS** | ❌ Linux-only | ⚠️⚠️⚠️⚠️⚠️ 极高 | ❌ 不匹配 | 搁置 | | **Ceph RGW (S3)** | ✅ HTTP API | ⚠️⚠️⚠️⚠️ 高 | ⭐⭐⭐ 中等 | 已有 S3Vfs | | **MinIO** | ✅ 全平台 | ⚠️⚠️ 低 | ⭐⭐⭐⭐⭐ 完全匹配 | 已有 S3Vfs | | **GlusterFS** | ✅ POSIX | ⚠️⚠️⚠️ 中 | ⭐⭐⭐⭐ 高 | 待研究 | | **内置分布式** | ✅ 全平台 | ⚠️⚠️ 低 | ⭐⭐⭐⭐⭐ 完全匹配 | 已有基础 | ### 方案 1: MinIO (推荐) **优势**: - ✅ S3-compatible API(已有 S3Vfs,无需新代码) - ✅ 单节点部署(轻量级) - ✅ 跨平台(macOS/Linux/Windows) - ✅ 高性能(纠删码) - ✅ 开源 + 企业版 **部署**: ```bash # macOS 单节点 minio server /data --console-address ":9001" # MarkBase 配置 MB_S3_ENDPOINT=http://localhost:9000 MB_S3_BUCKET=markbase ``` **集成**: 无需修改代码,S3Vfs 已支持。 --- ### 方案 2: 内置分布式存储 **已有基础**: | 功能 | 文件 | 分布式潜力 | |------|------|-----------| | DedupFs | dedup.rs | ✅ SHA-256 块存储可跨节点共享 | | RaidFs | raid.rs | ⚠️ 单节点 RAID-Z | | Send-Receive | send_receive.rs | ⚠️ 类似 ZFS send/receive | | Checksum | checksum.rs | ✅ 数据完整性验证 | | Compression | compression.rs | ✅ ZSTD 压缩 | **扩展方向**: 1. DedupFs + S3Vfs: Dedup 块存储到 MinIO/S3(跨节点共享) 2. Checksum + Replication: 增加跨节点复制 3. Send-Receive + Remote: 增加远程 replication --- ## Technical Details ### librados API Functions **Object I/O**: - `rados_read(ioctx, oid, buf, len, offset)` — Read at offset - `rados_write(ioctx, oid, buf, len, offset)` — Write at offset - `rados_write_full(ioctx, oid, buf, len)` — Write entire object - `rados_append(ioctx, oid, buf, len)` — Append to object - `rados_stat(ioctx, oid, psize, pmtime)` — Get object size/mtime - `rados_remove(ioctx, oid)` — Delete object **Pool Operations**: - `rados_pool_create(cluster, pool_name)` — Create pool - `rados_pool_delete(cluster, pool_name)` — Delete pool - `rados_pool_lookup(cluster, pool_name)` — Find pool ID - `rados_ioctx_create(cluster, pool_name, ioctx)` — Create I/O context **Snapshots**: - `rados_ioctx_snap_create(ioctx, snap_name)` — Create pool snapshot - `rados_ioctx_snap_list(ioctx, snaps)` — List snapshots - `rados_ioctx_snap_remove(ioctx, snap_id)` — Delete snapshot - `rados_ioctx_snap_rollback(ioctx, oid, snap_id)` — Rollback object **Locking**: - `rados_lock_exclusive(ioctx, oid, name, cookie, desc, duration, flags)` — Exclusive lock - `rados_lock_shared(ioctx, oid, name, cookie, tag, desc, duration, flags)` — Shared lock - `rados_unlock(ioctx, oid, name, cookie)` — Release lock - `rados_list_lockers(ioctx, oid, name, ...)` — List lock holders **OMAP (Key-Value)**: - `rados_omap_set(ioctx, oid, map)` — Set key-value pairs - `rados_omap_get(ioctx, oid, ...)` — Get values by keys - `rados_omap_get_keys(ioctx, oid, ...)` — List keys - `rados_omap_rm_keys(ioctx, oid, keys)` — Delete keys **Async I/O**: - `rados_aio_read(ioctx, oid, completion, buf, len, offset)` — Async read - `rados_aio_write(ioctx, oid, completion, buf, len, offset)` — Async write - `rados_aio_flush(ioctx)` — Flush pending async ops - `rados_aio_wait_for_complete(completion)` — Wait for completion --- ## Open Questions 1. **部署目标**: Linux-only production vs macOS development? 2. **Backend choice**: RADOS (librados) vs RGW (S3 API)? 3. **Pool strategy**: Pool-per-share vs single pool + path prefix? 4. **SMB Oplocks**: Should CephVfs support SMB Oplocks via RADOS locking? 5. **Priority**: Start with basic I/O or full async integration first? --- ## Conclusion **当前搁置 Ceph RADOS 集成**,原因: 1. ❌ Linux-only 约束不符合 macOS 跨平台定位 2. ⚠️ 部署复杂度超出 Lightweight 定位 3. ⚠️ 需要完整 Ceph 集群(Monitor + OSD + MGR) **推荐替代方案**: 1. ⭐⭐⭐⭐⭐ **MinIO** — S3-compatible,已有 S3Vfs,轻量级 2. ⭐⭐⭐⭐⭐ **内置分布式** — DedupFs + S3Vfs 组合 **后续行动**: - MinIO 集成文档(0 行代码) - DedupFs + S3Vfs 组合研究(~100 行) - 内置 Replication 功能(~400 行) --- **文档创建**: 2026-06-25 **最后更新**: 2026-06-25