14 KiB
14 KiB
Ceph RADOS Integration Analysis for MarkBase
Date: 2026-06-25 Status: Shelved (不符合 macOS 跨平台定位) Library: ceph-async (4.0.5) Constraint: Linux-only (requires librados.so symlink)
Executive Summary
Goal
Add Ceph RADOS as a VfsBackend option for distributed, highly scalable storage.
Key Findings
| Aspect | Finding |
|---|---|
| Platform | ❌ Linux-only (librados.so FFI, macOS needs Docker/VM) |
| Deployment | ⚠️ Requires full cluster (Monitor + OSD + MGR) |
| Complexity | ⚠️⚠️⚠️⚠️⚠️ High (超出 Lightweight 定位) |
| Positioning | ❌ 不符合 MarkBase macOS 跨平台定位 |
Recommendation
当前搁置。优先考虑:
- MinIO — S3-compatible,已有 S3Vfs 支持,跨平台
- 内置分布式 — DedupFs + S3Vfs 组合,轻量级
Architecture Overview
┌─────────────────────────────────────────────────────────────────────────┐
│ MarkBase Application Layer │
│ ├── SMB Server (Port 4445) │
│ ├── SFTP Server (Port 2024) │
│ ├── WebDAV Server (Port 11438) │
│ └───────────────────────────────────────────────────────────────────────┘
│ ↓ │
┌─────────────────────────────────────────────────────────────────────────┐
│ VFS Abstraction Layer (VfsBackend trait) │
│ ├── LocalFs — POSIX local filesystem │
│ ├── S3Vfs — S3-compatible storage (HTTP API) │
│ ├── SmbVfs — SMB client backend │
│ ├── CephVfs — Ceph RADOS backend (搁置) │
│ ├── EncryptedFs — Encryption layer │
│ ├── Compression — ZSTD/LZ4 compression layer │
│ ├── DedupFs — Block deduplication layer │
│ ├── RaidFs — RAID-Z emulation layer │
│ └─────────────────────────────────────────────────────────────────────┘
│ ↓ │
┌─────────────────────────────────────────────────────────────────────────┐
│ Ceph Storage Cluster (RADOS) │
│ ├── Monitor (MON) — Cluster map, authentication │
│ ├── OSD Daemons — Object storage (data replication) │
│ ├── Manager (MGR) — Dashboard, telemetry │
│ ├── MDS (optional) — CephFS metadata server │
│ ├── RGW (optional) — S3/Swift gateway │
│ └─────────────────────────────────────────────────────────────────────┘
Library Analysis
Rust Ceph Crates
| Crate | Version | Description | Platform |
|---|---|---|---|
ceph |
3.2.5 | Official librados FFI (sync) | Linux-only |
ceph-async |
4.0.5 | Async librados FFI (futures 0.3) | Linux-only |
ceph-rbd |
0.3.2 | RADOS Block Device bindings | Linux-only |
ceph-async Module Structure
ceph_async::
├── CephClient — Admin operations (OSD/Pool/Mon commands)
├── rados:: — Low-level FFI bindings (100+ functions)
│ ├── rados_read/write/stat/remove — Object I/O
│ ├── rados_pool_create/delete/lookup — Pool management
│ ├── rados_ioctx_* — I/O context (pool handle)
│ ├── rados_snap_* — Snapshot management
│ ├── rados_lock_* — Distributed locking
│ ├── rados_aio_* — Async I/O
│ ├── rados_omap_* — Key-value store per object
│ └── rados_write_op_* / rados_read_op_* — Compound operations
├── completion:: — Async completion handling
├── read_stream:: — Async read stream
├── write_sink:: — Async write sink
└── list_stream:: — Async object listing
CephClient API
let client = CephClient::new("admin", "/etc/ceph/ceph.conf")?;
// OSD operations
client.osd_tree()?; // Get OSD tree (CRUSH map)
client.osd_out(osd_id)?; // Mark OSD out
client.osd_crush_remove(osd_id)?; // Remove from CRUSH map
// Pool operations
client.osd_pool_get(pool, option)?; // Get pool config
client.osd_pool_set(pool, key, val)?; // Set pool config
client.osd_pool_quota_get(pool)?; // Get pool quota
// Cluster status
client.status()?; // Cluster health
client.mon_dump()?; // Monitor list
client.version()?; // Ceph version
Implementation Phases
| Phase | Task | Code Lines | Priority | Risk | Dependencies |
|---|---|---|---|---|---|
| Phase 1 | CephVfs struct + basic I/O | ~400 | P0 | Medium ⚠️⚠️⚠️ | ceph-async crate |
| Phase 2 | Pool management CLI | ~150 | P1 | Low ⚠️ | Phase 1 |
| Phase 3 | Snapshot support | ~200 | P2 | Medium ⚠️⚠️⚠️ | librados snap API |
| Phase 4 | Distributed locking | ~100 | P2 | Medium ⚠️⚠️⚠️ | librados lock API |
| Phase 5 | OMAP key-value | ~150 | P3 | Low ⚠️ | librados omap API |
| Phase 6 | Async integration | ~300 | P1 | High ⚠️⚠️⚠️⚠️ | async-vfs feature |
| Phase 7 | Docker test environment | ~50 | P0 | Low ⚠️ | Docker compose |
| Phase 8 | Performance benchmark | ~100 | P2 | Low ⚠️ | Benchmark scripts |
| Total | ~1350 |
Phase 1: CephVfs Core Implementation
Key Design Decisions
1. Object vs File mapping:
- RADOS is object storage (no directories)
- Path
/foo/bar.txt→ Objectfoo/bar.txtin pool - Directories simulated via zero-byte objects with
/suffix (like S3)
2. Pool-per-share vs single pool:
- Option A: Single pool + path prefix (simpler, less isolation)
- Option B: Pool-per-share (better isolation, quota per pool)
- Recommend: Option B (pool-per-share) for enterprise use
3. I/O context caching:
- Each pool requires separate
rados_ioctx_t - Cache ioctx per share to avoid recreation overhead
CephVfs Struct (Draft)
pub struct CephVfs {
cluster: rados_t, // RADOS cluster handle
pool_name: String, // Pool name for this share
ioctx: rados_ioctx_t, // I/O context (cached)
root_prefix: String, // Path prefix within pool
}
pub struct CephVfsFile {
ioctx: rados_ioctx_t,
object_id: String, // Object name in pool
position: u64,
write_buffer: Vec<u8>, // Buffer for writes (flush on close)
size: u64,
}
VfsBackend Method Mapping
| Method | RADOS equivalent | Complexity |
|---|---|---|
read_dir() |
rados_nobjects_list_* |
High (pagination) |
open_file() |
Custom (object ops) | Medium |
stat() |
rados_stat() |
Low |
create_dir() |
rados_write_full(0-byte) |
Low |
remove_dir() |
rados_remove() |
Low |
remove_file() |
rados_remove() |
Low |
rename() |
Custom (copy + delete) | Medium |
exists() |
rados_stat() |
Low |
copy() |
rados_clone_range() |
Low |
hard_link() |
rados_clone_range() |
Low |
read_link() |
Unsupported | N/A |
create_symlink() |
Unsupported | N/A |
Risk Assessment
| Risk | Level | Mitigation |
|---|---|---|
| Linux-only | ⚠️⚠️⚠️⚠️⚠️ Critical | Docker/VM for macOS; 不符合跨平台定位 |
| librados.so symlink | ⚠️⚠️⚠️ Medium | Document setup; CI check |
| Pool-level snapshots | ⚠️⚠️ Low | Document limitation; consider RGW |
| Async overhead | ⚠️⚠️⚠️ Medium | Benchmark; spawn_blocking wrapper |
| Cluster complexity | ⚠️⚠️⚠️⚠️⚠️ Critical | 超出 Lightweight 定位; Docker compose |
| SMB Oplocks integration | ⚠️⚠️⚠️ Medium | RADOS locking API; careful design |
Alternatives (推荐方案)
方案对比
| 方案 | 跨平台 | 部署复杂度 | 定位匹配 | 状态 |
|---|---|---|---|---|
| Ceph RADOS | ❌ Linux-only | ⚠️⚠️⚠️⚠️⚠️ 极高 | ❌ 不匹配 | 搁置 |
| Ceph RGW (S3) | ✅ HTTP API | ⚠️⚠️⚠️⚠️ 高 | ⭐⭐⭐ 中等 | 已有 S3Vfs |
| MinIO | ✅ 全平台 | ⚠️⚠️ 低 | ⭐⭐⭐⭐⭐ 完全匹配 | 已有 S3Vfs |
| GlusterFS | ✅ POSIX | ⚠️⚠️⚠️ 中 | ⭐⭐⭐⭐ 高 | 待研究 |
| 内置分布式 | ✅ 全平台 | ⚠️⚠️ 低 | ⭐⭐⭐⭐⭐ 完全匹配 | 已有基础 |
方案 1: MinIO (推荐)
优势:
- ✅ S3-compatible API(已有 S3Vfs,无需新代码)
- ✅ 单节点部署(轻量级)
- ✅ 跨平台(macOS/Linux/Windows)
- ✅ 高性能(纠删码)
- ✅ 开源 + 企业版
部署:
# macOS 单节点
minio server /data --console-address ":9001"
# MarkBase 配置
MB_S3_ENDPOINT=http://localhost:9000
MB_S3_BUCKET=markbase
集成: 无需修改代码,S3Vfs 已支持。
方案 2: 内置分布式存储
已有基础:
| 功能 | 文件 | 分布式潜力 |
|---|---|---|
| DedupFs | dedup.rs | ✅ SHA-256 块存储可跨节点共享 |
| RaidFs | raid.rs | ⚠️ 单节点 RAID-Z |
| Send-Receive | send_receive.rs | ⚠️ 类似 ZFS send/receive |
| Checksum | checksum.rs | ✅ 数据完整性验证 |
| Compression | compression.rs | ✅ ZSTD 压缩 |
扩展方向:
- DedupFs + S3Vfs: Dedup 块存储到 MinIO/S3(跨节点共享)
- Checksum + Replication: 增加跨节点复制
- Send-Receive + Remote: 增加远程 replication
Technical Details
librados API Functions
Object I/O:
rados_read(ioctx, oid, buf, len, offset)— Read at offsetrados_write(ioctx, oid, buf, len, offset)— Write at offsetrados_write_full(ioctx, oid, buf, len)— Write entire objectrados_append(ioctx, oid, buf, len)— Append to objectrados_stat(ioctx, oid, psize, pmtime)— Get object size/mtimerados_remove(ioctx, oid)— Delete object
Pool Operations:
rados_pool_create(cluster, pool_name)— Create poolrados_pool_delete(cluster, pool_name)— Delete poolrados_pool_lookup(cluster, pool_name)— Find pool IDrados_ioctx_create(cluster, pool_name, ioctx)— Create I/O context
Snapshots:
rados_ioctx_snap_create(ioctx, snap_name)— Create pool snapshotrados_ioctx_snap_list(ioctx, snaps)— List snapshotsrados_ioctx_snap_remove(ioctx, snap_id)— Delete snapshotrados_ioctx_snap_rollback(ioctx, oid, snap_id)— Rollback object
Locking:
rados_lock_exclusive(ioctx, oid, name, cookie, desc, duration, flags)— Exclusive lockrados_lock_shared(ioctx, oid, name, cookie, tag, desc, duration, flags)— Shared lockrados_unlock(ioctx, oid, name, cookie)— Release lockrados_list_lockers(ioctx, oid, name, ...)— List lock holders
OMAP (Key-Value):
rados_omap_set(ioctx, oid, map)— Set key-value pairsrados_omap_get(ioctx, oid, ...)— Get values by keysrados_omap_get_keys(ioctx, oid, ...)— List keysrados_omap_rm_keys(ioctx, oid, keys)— Delete keys
Async I/O:
rados_aio_read(ioctx, oid, completion, buf, len, offset)— Async readrados_aio_write(ioctx, oid, completion, buf, len, offset)— Async writerados_aio_flush(ioctx)— Flush pending async opsrados_aio_wait_for_complete(completion)— Wait for completion
Open Questions
- 部署目标: Linux-only production vs macOS development?
- Backend choice: RADOS (librados) vs RGW (S3 API)?
- Pool strategy: Pool-per-share vs single pool + path prefix?
- SMB Oplocks: Should CephVfs support SMB Oplocks via RADOS locking?
- Priority: Start with basic I/O or full async integration first?
Conclusion
当前搁置 Ceph RADOS 集成,原因:
- ❌ Linux-only 约束不符合 macOS 跨平台定位
- ⚠️ 部署复杂度超出 Lightweight 定位
- ⚠️ 需要完整 Ceph 集群(Monitor + OSD + MGR)
推荐替代方案:
- ⭐⭐⭐⭐⭐ MinIO — S3-compatible,已有 S3Vfs,轻量级
- ⭐⭐⭐⭐⭐ 内置分布式 — DedupFs + S3Vfs 组合
后续行动:
- MinIO 集成文档(0 行代码)
- DedupFs + S3Vfs 组合研究(~100 行)
- 内置 Replication 功能(~400 行)
文档创建: 2026-06-25 最后更新: 2026-06-25